deeppavlov.models.preprocessors

class deeppavlov.models.preprocessors.assemble_embeddings_matrix.EmbeddingsMatrixAssembler(embedder, vocab, character_level=False, emb_dim=None, estimate_by_n=10000, *args, **kwargs)[source]

Assembles matrix of embeddings obtained from some embedder.

class deeppavlov.models.preprocessors.assemble_embeddings_matrix.RandomEmbeddingsMatrix(vocab_len, emb_dim, *args, **kwargs)[source]

Assembles matrix of random embeddings.

class deeppavlov.models.preprocessors.capitalization.CapitalizationPreprocessor(pad_zeros=True, *args, **kwargs)[source]

Featurizer useful for NER task. It detects following patterns: - no capitals - single capital single character - single capital multiple characters - all capitals multiple characters

class deeppavlov.models.preprocessors.capitalization.LowercasePreprocessor(to_lower=True, append_case='first', *args, **kwargs)[source]
class deeppavlov.models.preprocessors.char_splitter.CharSplitter(**kwargs)[source]

This component transforms batch of sequences of tokens into batch of sequences of character sequences.

class deeppavlov.models.preprocessors.dirty_comments_preprocessor.DirtyCommentsPreprocessor(*args, **kwargs)[source]

Class implements preprocessing of english texts with low level of literacy such as comments

__call__(batch: List[str], **kwargs) → List[str][source]

Preprocess given batch

Parameters:
  • batch – list of text samples
  • **kwargs – additional arguments
Returns:

list of preprocessed text samples

class deeppavlov.models.preprocessors.mask.Mask(*args, **kwargs)[source]

Takes batch of tokens and returns the masks of corresponding length

class deeppavlov.models.preprocessors.one_hotter.OneHotter(depth: int, pad_zeros: bool = True, *args, **kwargs)[source]

One-hot featurizer with zero-padding.

Parameters:
  • depth – the depth for one-hotting
  • pad_zeros – whether to pad elements of batch with zeros
class deeppavlov.models.preprocessors.russian_lemmatizer.PymorphyRussianLemmatizer(*args, **kwargs)[source]

Class for lemmatization using PyMorphy.

class deeppavlov.models.preprocessors.sanitizer.Sanitizer(diacritical=True, nums=False, *args, **kwargs)[source]

Remove all combining characters like diacritical marks from tokens

class deeppavlov.models.preprocessors.str_lower.StrLower(*args, **kwargs)[source]

Component for converting strings to lowercase at any level of lists nesting

__call__(batch: Union[str, list, tuple])[source]

Recursively search for strings in a list and convert them to lowercase

Parameters:batch – a string or a list containing strings at some level of nesting
Returns:the same structure where all strings are converted to lowercase