
class deeppavlov.models.tokenizers.lazy_tokenizer.LazyTokenizer(**kwargs)[source]

Tokenizes if there is something to tokenize.

class deeppavlov.models.tokenizers.nltk_moses_tokenizer.NLTKMosesTokenizer(escape: bool = False, *args, **kwargs)[source]

Class for splitting texts on tokens using NLTK wrapper over MosesTokenizer


whether escape characters for use in html markup


tokenizer instance from nltk.tokenize.moses


detokenizer instance from nltk.tokenize.moses

Parameters:escape – whether escape characters for use in html markup
__call__(batch: List[Union[str, List[str]]]) → List[Union[str, List[str]]][source]

Tokenize given batch of strings or detokenize given batch of lists of tokens

Parameters:batch – list of text samples or list of lists of tokens
Returns:list of lists of tokens or list of text samples
class deeppavlov.models.tokenizers.nltk_tokenizer.NLTKTokenizer(tokenizer: str = 'wordpunct_tokenize', download: bool = False, *args, **kwargs)[source]

Class for splitting texts on tokens using NLTK

  • tokenizer – tokenization mode for nltk.tokenize
  • download – whether to download nltk data

tokenizer instance from nltk.tokenizers

__call__(batch: List[str]) → List[List[str]][source]

Tokenize given batch

Parameters:batch – list of text samples
Returns:list of lists of tokens
class deeppavlov.models.tokenizers.ru_sent_tokenizer.RuSentTokenizer(shortenings: Set[str] = {'co', 'corp', 'inc', 'авт', 'адм', 'барр', 'букв', 'внутр', 'га', 'гос', 'дифф', 'дол', 'долл', 'ед', 'жен', 'зав', 'зам', 'искл', 'коп', 'корп', 'куб', 'лат', 'мин', 'мн', 'муж', 'накл', 'о', 'обл', 'обр', 'повел', 'прим', 'проц', 'р', 'ред', 'руб', 'рус', 'русск', 'сан', 'сек', 'тыс', 'шутл', 'эт', 'яз'}, joining_shortenings: Set[str] = {'dr', 'mr', 'mrs', 'ms', 'vs', 'англ', 'араб', 'г', 'греч', 'д', 'евр', 'им', 'итал', 'кит', 'корп', 'пер', 'пл', 'рис', 'св', 'слав', 'см', 'сокр', 'стр', 'тел', 'ул', 'устар', 'яп'}, paired_shortenings: Set[Tuple[str, str]] = {('и', 'о'), ('н', 'э'), ('т', 'е'), ('т', 'п'), ('у', 'е')}, **kwargs)[source]

Rule-base sentence tokenizer for Russian language.

  • shortenings – list of known shortenings. Use default value if working on news or fiction texts
  • joining_shortenings – list of shortenings after that sentence split is not possible (i.e. “ул”). Use default value if working on news or fiction texts
  • paired_shortenings – list of known paired shotenings (i.e. “т. е.”). Use default value if working on news or fiction texts
class deeppavlov.models.tokenizers.split_tokenizer.SplitTokenizer(**kwargs)[source]

Generates utterance’s tokens by mere python’s str.split().

Doesn’t have any parameters.

class deeppavlov.models.tokenizers.spacy_tokenizer.StreamSpacyTokenizer(disable: Optional[List[str]] = None, stopwords: Optional[List[str]] = None, batch_size: Optional[int] = None, ngram_range: Optional[List[int]] = None, lemmas: bool = False, n_threads: Optional[int] = None, lowercase: Optional[bool] = None, alphas_only: Optional[bool] = None, spacy_model: str = 'en_core_web_sm', **kwargs)[source]

Tokenize or lemmatize a list of documents. Default spacy model is en_core_web_sm. Return a list of tokens or lemmas for a whole document. If is called onto List[str], performs detokenizing procedure.

  • disable – spacy pipeline elements to disable, serves a purpose of performing; if nothing
  • stopwords – a list of stopwords that should be ignored during tokenizing/lemmatizing and ngrams creation
  • batch_size – a batch size for inner spacy multi-threading
  • ngram_range – size of ngrams to create; only unigrams are returned by default
  • lemmas – whether to perform lemmatizing or not
  • n_threads – a number of threads for inner spacy multi-threading
  • lowercase – whether to perform lowercasing or not; is performed by default by _tokenize() and _lemmatize() methods
  • alphas_only – whether to filter out non-alpha tokens; is performed by default by _filter() method
  • spacy_model – a string name of spacy model to use; DeepPavlov searches for this name in downloaded spacy models; default model is en_core_web_sm, it downloads automatically during DeepPavlov installation

a list of stopwords that should be ignored during tokenizing/lemmatizing and ngrams creation


a loaded spacy model


a loaded spacy tokenizer from the model


a batch size for inner spacy multi-threading


size of ngrams to create; only unigrams are returned by default


whether to perform lemmatizing or not


a number of threads for inner spacy multi-threading


whether to perform lowercasing or not; is performed by default by _tokenize() and _lemmatize() methods


whether to filter out non-alpha tokens; is performed by default by _filter() method

__call__(batch: Union[List[str], List[List[str]]]) → Union[List[List[str]], List[str]][source]

Tokenize or detokenize strings, depends on the type structure of passed arguments.

Parameters:batch – a batch of documents to perform tokenizing/lemmatizing; or a batch of lists of tokens/lemmas to perform detokenizing
Returns:a batch of lists of tokens/lemmas; or a batch of detokenized strings
Raises:TypeError – If the first element of batch is neither List, nor str.
class deeppavlov.models.tokenizers.ru_tokenizer.RussianTokenizer(stopwords: Optional[List[str]] = None, ngram_range: List[int] = None, lemmas: bool = False, lowercase: Optional[bool] = None, alphas_only: Optional[bool] = None, **kwargs)[source]

Tokenize or lemmatize a list of documents for Russian language. Default models are ToktokTokenizer tokenizer and pymorphy2 lemmatizer. Return a list of tokens or lemmas for a whole document. If is called onto List[str], performs detokenizing procedure.

  • stopwords – a list of stopwords that should be ignored during tokenizing/lemmatizing and ngrams creation
  • ngram_range – size of ngrams to create; only unigrams are returned by default
  • lemmas – whether to perform lemmatizing or not
  • lowercase – whether to perform lowercasing or not; is performed by default by _tokenize() and _lemmatize() methods
  • alphas_only – whether to filter out non-alpha tokens; is performed by default by _filter() method

a list of stopwords that should be ignored during tokenizing/lemmatizing and ngrams creation


an instance of ToktokTokenizer tokenizer class


an instance of pymorphy2.MorphAnalyzer lemmatizer class


size of ngrams to create; only unigrams are returned by default


whether to perform lemmatizing or not


whether to perform lowercasing or not; is performed by default by _tokenize() and _lemmatize() methods


whether to filter out non-alpha tokens; is performed by default by _filter() method tok2morph: token-to-lemma cache

__call__(batch: Union[List[str], List[List[str]]]) → Union[List[List[str]], List[str]][source]

Tokenize or detokenize strings, depends on the type structure of passed arguments.

Parameters:batch – a batch of documents to perform tokenizing/lemmatizing; or a batch of lists of tokens/lemmas to perform detokenizing
Returns:a batch of lists of tokens/lemmas; or a batch of detokenized strings
Raises:TypeError – If the first element of batch is neither List, nor str.