deeppavlov.models.tokenizers

class deeppavlov.models.tokenizers.lazy_tokenizer.LazyTokenizer(**kwargs)[source]

Tokenizes if there is something to tokenize.

class deeppavlov.models.tokenizers.nltk_moses_tokenizer.NLTKMosesTokenizer(escape: bool = False, *args, **kwargs)[source]

Class for splitting texts on tokens using NLTK wrapper over MosesTokenizer

escape

whether escape characters for use in html markup

tokenizer

tokenizer instance from nltk.tokenize.moses

detokenizer

detokenizer instance from nltk.tokenize.moses

Parameters:escape – whether escape characters for use in html markup
__call__(batch: List[Union[List[str], str]]) → List[Union[List[str], str]][source]

Tokenize given batch of strings or detokenize given batch of lists of tokens

Parameters:batch – list of text samples or list of lists of tokens
Returns:list of lists of tokens or list of text samples
class deeppavlov.models.tokenizers.nltk_tokenizer.NLTKTokenizer(tokenizer: str = 'wordpunct_tokenize', download: bool = False, *args, **kwargs)[source]

Class for splitting texts on tokens using NLTK

Parameters:
  • tokenizer – tokenization mode for nltk.tokenize
  • download – whether to download nltk data
tokenizer

tokenizer instance from nltk.tokenizers

__call__(batch: List[str]) → List[List[str]][source]

Tokenize given batch

Parameters:batch – list of text samples
Returns:list of lists of tokens
class deeppavlov.models.tokenizers.ru_sent_tokenizer.RuSentTokenizer(shortenings: Set[str] = {'co', 'corp', 'inc', 'авт', 'адм', 'барр', 'букв', 'внутр', 'га', 'гос', 'дифф', 'дол', 'долл', 'ед', 'жен', 'зав', 'зам', 'искл', 'коп', 'корп', 'куб', 'лат', 'мин', 'мн', 'муж', 'накл', 'о', 'обл', 'обр', 'повел', 'прим', 'проц', 'р', 'ред', 'руб', 'рус', 'русск', 'сан', 'сек', 'тыс', 'шутл', 'эт', 'яз'}, joining_shortenings: Set[str] = {'dr', 'mr', 'mrs', 'ms', 'vs', 'англ', 'араб', 'г', 'греч', 'д', 'евр', 'им', 'итал', 'кит', 'корп', 'пер', 'пл', 'рис', 'св', 'слав', 'см', 'сокр', 'стр', 'тел', 'ул', 'устар', 'яп'}, paired_shortenings: Set[Tuple[str, str]] = {('и', 'о'), ('н', 'э'), ('т', 'е'), ('т', 'п'), ('у', 'е')}, **kwargs)[source]

Rule-base sentence tokenizer for Russian language. https://github.com/deepmipt/ru_sentence_tokenizer

Parameters:
  • shortenings – list of known shortenings. Use default value if working on news or fiction texts
  • joining_shortenings – list of shortenings after that sentence split is not possible (i.e. “ул”). Use default value if working on news or fiction texts
  • paired_shortenings – list of known paired shotenings (i.e. “т. е.”). Use default value if working on news or fiction texts
class deeppavlov.models.tokenizers.split_tokenizer.SplitTokenizer(**kwargs)[source]

Generates utterance’s tokens by mere python’s str.split().

Doesn’t have any parameters.

class deeppavlov.models.tokenizers.spacy_tokenizer.StreamSpacyTokenizer(disable: Optional[Iterable[str]] = None, stopwords: Optional[List[str]] = None, batch_size: Optional[int] = None, ngram_range: Optional[List[int]] = None, lemmas: bool = False, lowercase: Optional[bool] = None, alphas_only: Optional[bool] = None, spacy_model: str = 'en_core_web_sm', **kwargs)[source]

Tokenize or lemmatize a list of documents. Default spacy model is en_core_web_sm. Return a list of tokens or lemmas for a whole document. If is called onto List[str], performs detokenizing procedure.

Parameters:
  • disable – spacy pipeline elements to disable, serves a purpose of performing; if nothing
  • stopwords – a list of stopwords that should be ignored during tokenizing/lemmatizing and ngrams creation
  • batch_size – a batch size for spaCy buffering
  • ngram_range – size of ngrams to create; only unigrams are returned by default
  • lemmas – whether to perform lemmatizing or not
  • lowercase – whether to perform lowercasing or not; is performed by default by _tokenize() and _lemmatize() methods
  • alphas_only – whether to filter out non-alpha tokens; is performed by default by _filter() method
  • spacy_model – a string name of spacy model to use; DeepPavlov searches for this name in downloaded spacy models; default model is en_core_web_sm, it downloads automatically during DeepPavlov installation
stopwords

a list of stopwords that should be ignored during tokenizing/lemmatizing and ngrams creation

model

a loaded spacy model

batch_size

a batch size for spaCy buffering

ngram_range

size of ngrams to create; only unigrams are returned by default

lemmas

whether to perform lemmatizing or not

lowercase

whether to perform lowercasing or not; is performed by default by _tokenize() and _lemmatize() methods

alphas_only

whether to filter out non-alpha tokens; is performed by default by _filter() method

__call__(batch: Union[List[str], List[List[str]]]) → Union[List[List[str]], List[str]][source]

Tokenize or detokenize strings, depends on the type structure of passed arguments.

Parameters:batch – a batch of documents to perform tokenizing/lemmatizing; or a batch of lists of tokens/lemmas to perform detokenizing
Returns:a batch of lists of tokens/lemmas; or a batch of detokenized strings
Raises:TypeError – If the first element of batch is neither List, nor str.
class deeppavlov.models.tokenizers.ru_tokenizer.RussianTokenizer(stopwords: Optional[List[str]] = None, ngram_range: List[int] = None, lemmas: bool = False, lowercase: Optional[bool] = None, alphas_only: Optional[bool] = None, **kwargs)[source]

Tokenize or lemmatize a list of documents for Russian language. Default models are ToktokTokenizer tokenizer and pymorphy2 lemmatizer. Return a list of tokens or lemmas for a whole document. If is called onto List[str], performs detokenizing procedure.

Parameters:
  • stopwords – a list of stopwords that should be ignored during tokenizing/lemmatizing and ngrams creation
  • ngram_range – size of ngrams to create; only unigrams are returned by default
  • lemmas – whether to perform lemmatizing or not
  • lowercase – whether to perform lowercasing or not; is performed by default by _tokenize() and _lemmatize() methods
  • alphas_only – whether to filter out non-alpha tokens; is performed by default by _filter() method
stopwords

a list of stopwords that should be ignored during tokenizing/lemmatizing and ngrams creation

tokenizer

an instance of ToktokTokenizer tokenizer class

lemmatizer

an instance of pymorphy2.MorphAnalyzer lemmatizer class

ngram_range

size of ngrams to create; only unigrams are returned by default

lemmas

whether to perform lemmatizing or not

lowercase

whether to perform lowercasing or not; is performed by default by _tokenize() and _lemmatize() methods

alphas_only

whether to filter out non-alpha tokens; is performed by default by _filter() method tok2morph: token-to-lemma cache

__call__(batch: Union[List[str], List[List[str]]]) → Union[List[List[str]], List[str]][source]

Tokenize or detokenize strings, depends on the type structure of passed arguments.

Parameters:batch – a batch of documents to perform tokenizing/lemmatizing; or a batch of lists of tokens/lemmas to perform detokenizing
Returns:a batch of lists of tokens/lemmas; or a batch of detokenized strings
Raises:TypeError – If the first element of batch is neither List, nor str.