deeppavlov.models.tokenizers¶

class deeppavlov.models.tokenizers.nltk_moses_tokenizer.NLTKMosesTokenizer(escape: bool = False, *args, **kwargs)[source]¶

Class for splitting texts on tokens using NLTK wrapper over MosesTokenizer

escape¶: whether escape characters for use in html markup

tokenizer¶: tokenizer instance from nltk.tokenize.moses

detokenizer¶: detokenizer instance from nltk.tokenize.moses

Parameters: escape – whether escape characters for use in html markup

__call__(batch: List[Union[str, List[str]]]) → List[Union[str, List[str]]][source]¶

Tokenize given batch of strings or detokenize given batch of lists of tokens

Parameters: batch – list of text samples or list of lists of tokens
Returns: list of lists of tokens or list of text samples

class deeppavlov.models.tokenizers.nltk_tokenizer.NLTKTokenizer(tokenizer: str = 'wordpunct_tokenize', download: bool = False, *args, **kwargs)[source]¶

Class for splitting texts on tokens using NLTK

Parameters

tokenizer – tokenization mode for nltk.tokenize
download – whether to download nltk data

tokenizer¶: tokenizer instance from nltk.tokenizers

__call__(batch: List[str]) → List[List[str]][source]¶

Tokenize given batch

Parameters: batch – list of text samples
Returns: list of lists of tokens

class deeppavlov.models.tokenizers.split_tokenizer.SplitTokenizer(**kwargs)[source]¶

Generates utterance’s tokens by mere python’s str.split().

Doesn’t have any parameters.

class deeppavlov.models.tokenizers.spacy_tokenizer.StreamSpacyTokenizer(disable: Optional[Iterable[str]] = None, filter_stopwords: bool = False, batch_size: Optional[int] = None, ngram_range: Optional[List[int]] = None, lemmas: bool = False, lowercase: Optional[bool] = None, alphas_only: Optional[bool] = None, spacy_model: str = 'en_core_web_sm', **kwargs)[source]¶

Tokenize or lemmatize a list of documents. Default spacy model is en_core_web_sm. Return a list of tokens or lemmas for a whole document. If is called onto List[str], performs detokenizing procedure.

Parameters

disable – spacy pipeline elements to disable, serves a purpose of performing; if nothing
stopwords – a list of stopwords that should be ignored during tokenizing/lemmatizing and ngrams creation
batch_size – a batch size for spaCy buffering
ngram_range – size of ngrams to create; only unigrams are returned by default
lemmas – whether to perform lemmatizing or not
lowercase – whether to perform lowercasing or not; is performed by default by _tokenize() and _lemmatize() methods
alphas_only – whether to filter out non-alpha tokens; is performed by default by _filter() method
spacy_model – a string name of spacy model to use; DeepPavlov searches for this name in downloaded spacy models; default model is en_core_web_sm, it downloads automatically during DeepPavlov installation

stopwords¶: a list of stopwords that should be ignored during tokenizing/lemmatizing and ngrams creation

model¶: a loaded spacy model

batch_size¶: a batch size for spaCy buffering

ngram_range¶: size of ngrams to create; only unigrams are returned by default

lemmas¶: whether to perform lemmatizing or not

lowercase¶: whether to perform lowercasing or not; is performed by default by _tokenize() and _lemmatize() methods

alphas_only¶: whether to filter out non-alpha tokens; is performed by default by _filter() method

__call__(batch: Union[List[str], List[List[str]]]) → Union[List[List[str]], List[str]][source]¶

Tokenize or detokenize strings, depends on the type structure of passed arguments.

Parameters: batch – a batch of documents to perform tokenizing/lemmatizing; or a batch of lists of tokens/lemmas to perform detokenizing
Returns: a batch of lists of tokens/lemmas; or a batch of detokenized strings
Raises: TypeError – If the first element of batch is neither List, nor str.

class deeppavlov.models.tokenizers.ru_tokenizer.RussianTokenizer(stopwords: Optional[List[str]] = None, ngram_range: Optional[List[int]] = None, lemmas: bool = False, lowercase: Optional[bool] = None, alphas_only: Optional[bool] = None, **kwargs)[source]¶

Tokenize or lemmatize a list of documents for Russian language. Default models are ToktokTokenizer tokenizer and pymorphy2 lemmatizer. Return a list of tokens or lemmas for a whole document. If is called onto List[str], performs detokenizing procedure.

Parameters

stopwords – a list of stopwords that should be ignored during tokenizing/lemmatizing and ngrams creation
ngram_range – size of ngrams to create; only unigrams are returned by default
lemmas – whether to perform lemmatizing or not
lowercase – whether to perform lowercasing or not; is performed by default by _tokenize() and _lemmatize() methods
alphas_only – whether to filter out non-alpha tokens; is performed by default by _filter() method

stopwords¶: a list of stopwords that should be ignored during tokenizing/lemmatizing and ngrams creation

tokenizer¶: an instance of ToktokTokenizer tokenizer class

lemmatizer¶: an instance of pymorphy2.MorphAnalyzer lemmatizer class

ngram_range¶: size of ngrams to create; only unigrams are returned by default

lemmas¶: whether to perform lemmatizing or not

lowercase¶: whether to perform lowercasing or not; is performed by default by _tokenize() and _lemmatize() methods

alphas_only¶: whether to filter out non-alpha tokens; is performed by default by _filter() method tok2morph: token-to-lemma cache

__call__(batch: Union[List[str], List[List[str]]]) → Union[List[List[str]], List[str]][source]¶

Tokenize or detokenize strings, depends on the type structure of passed arguments.

Parameters: batch – a batch of documents to perform tokenizing/lemmatizing; or a batch of lists of tokens/lemmas to perform detokenizing
Returns: a batch of lists of tokens/lemmas; or a batch of detokenized strings
Raises: TypeError – If the first element of batch is neither List, nor str.