deeppavlov.models.tokenizers

class deeppavlov.models.tokenizers.nltk_moses_tokenizer.NLTKMosesTokenizer(escape: bool = False, *args, **kwargs)[source]

Class for splitting texts on tokens using NLTK wrapper over MosesTokenizer

escape

whether escape characters for use in html markup

tokenizer

tokenizer instance from nltk.tokenize.moses

detokenizer

detokenizer instance from nltk.tokenize.moses

Parameters

escape – whether escape characters for use in html markup

__call__(batch: List[Union[str, List[str]]]) List[Union[str, List[str]]][source]

Tokenize given batch of strings or detokenize given batch of lists of tokens

Parameters

batch – list of text samples or list of lists of tokens

Returns

list of lists of tokens or list of text samples

class deeppavlov.models.tokenizers.nltk_tokenizer.NLTKTokenizer(tokenizer: str = 'wordpunct_tokenize', download: bool = False, *args, **kwargs)[source]

Class for splitting texts on tokens using NLTK

Parameters
  • tokenizer – tokenization mode for nltk.tokenize

  • download – whether to download nltk data

tokenizer

tokenizer instance from nltk.tokenizers

__call__(batch: List[str]) List[List[str]][source]

Tokenize given batch

Parameters

batch – list of text samples

Returns

list of lists of tokens

class deeppavlov.models.tokenizers.split_tokenizer.SplitTokenizer(**kwargs)[source]

Generates utterance’s tokens by mere python’s str.split().

Doesn’t have any parameters.

class deeppavlov.models.tokenizers.spacy_tokenizer.StreamSpacyTokenizer(disable: Optional[Iterable[str]] = None, filter_stopwords: bool = False, batch_size: Optional[int] = None, ngram_range: Optional[List[int]] = None, lemmas: bool = False, lowercase: Optional[bool] = None, alphas_only: Optional[bool] = None, spacy_model: str = 'en_core_web_sm', **kwargs)[source]

Tokenize or lemmatize a list of documents. Default spacy model is en_core_web_sm. Return a list of tokens or lemmas for a whole document. If is called onto List[str], performs detokenizing procedure.

Parameters
  • disable – spacy pipeline elements to disable, serves a purpose of performing; if nothing

  • filter_stopwords – whether to ignore stopwords during tokenizing/lemmatizing and ngrams creation

  • batch_size – a batch size for spaCy buffering

  • ngram_range – size of ngrams to create; only unigrams are returned by default

  • lemmas – whether to perform lemmatizing or not

  • lowercase – whether to perform lowercasing or not; is performed by default by _tokenize() and _lemmatize() methods

  • alphas_only – whether to filter out non-alpha tokens; is performed by default by _filter() method

  • spacy_model – a string name of spacy model to use; DeepPavlov searches for this name in downloaded spacy models; default model is en_core_web_sm, it downloads automatically during DeepPavlov installation

stopwords

a list of stopwords that should be ignored during tokenizing/lemmatizing and ngrams creation

model

a loaded spacy model

batch_size

a batch size for spaCy buffering

ngram_range

size of ngrams to create; only unigrams are returned by default

lemmas

whether to perform lemmatizing or not

lowercase

whether to perform lowercasing or not; is performed by default by _tokenize() and _lemmatize() methods

alphas_only

whether to filter out non-alpha tokens; is performed by default by _filter() method

__call__(batch: Union[List[str], List[List[str]]]) Union[List[List[str]], List[str]][source]

Tokenize or detokenize strings, depends on the type structure of passed arguments.

Parameters

batch – a batch of documents to perform tokenizing/lemmatizing; or a batch of lists of tokens/lemmas to perform detokenizing

Returns

a batch of lists of tokens/lemmas; or a batch of detokenized strings

Raises

TypeError – If the first element of batch is neither List, nor str.