是否可以更改Spacy令牌生成器的令牌拆分规则?
问题内容:
(德语)spacy标记生成器默认情况下不会在斜杠,下划线或星号上分割,这正是我所需要的(因此“ der / die”生成单个标记)。
但是,它的确在括号上分割,因此“
dies(und)das”被分割成5个令牌。是否有一种(简单的)方法来告诉默认令牌生成器也不要在括号内进行拆分,括号内的字母两侧都没有空格?
为标记器定义的括号上的拆分有多精确?
问题答案:
在此行中定义了括号中的拆分,在括号中的拆分在两个字母之间进行了拆分:
https://github.com/explosion/spaCy/blob/23ec07debdd568f09c7c83b10564850f9fa67ad4/spacy/lang/de/punctuation.py#L18
没有简单的方法来删除中缀模式,但是您可以定义一个自定义标记器,该标记器可以执行您想要的操作。一种方法是从中复制并定义中缀spacy/lang/de/punctuation.py
:
import re
import spacy
from spacy.tokenizer import Tokenizer
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER, CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.lang.de.punctuation import _quotes
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
def custom_tokenizer(nlp):
infixes = (
LIST_ELLIPSES
+ LIST_ICONS
+ [
r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
r'(?<=[{a}])[:<>=](?=[{a}])'.format(a=ALPHA),
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
r"(?<=[{a}])([{q}\]\[])(?=[{a}])".format(a=ALPHA, q=_quotes),
r"(?<=[{a}])--(?=[{a}])".format(a=ALPHA),
r"(?<=[0-9])-(?=[0-9])",
]
)
infix_re = compile_infix_regex(infixes)
return Tokenizer(nlp.vocab, prefix_search=nlp.tokenizer.prefix_search,
suffix_search=nlp.tokenizer.suffix_search,
infix_finditer=infix_re.finditer,
token_match=nlp.tokenizer.token_match,
rules=nlp.Defaults.tokenizer_exceptions)
nlp = spacy.load('de')
nlp.tokenizer = custom_tokenizer(nlp)