deeppavlov.models.preprocessors¶
-
class
deeppavlov.models.preprocessors.assemble_embeddings_matrix.
EmbeddingsMatrixAssembler
(embedder, vocab, character_level=False, emb_dim=None, estimate_by_n=10000, *args, **kwargs)[source]¶ Assembles matrix of embeddings obtained from some embedder.
-
class
deeppavlov.models.preprocessors.assemble_embeddings_matrix.
RandomEmbeddingsMatrix
(vocab_len, emb_dim, *args, **kwargs)[source]¶ Assembles matrix of random embeddings.
-
class
deeppavlov.models.preprocessors.capitalization.
CapitalizationPreprocessor
(pad_zeros=True, *args, **kwargs)[source]¶ Featurizer useful for NER task. It detects following patterns: - no capitals - single capital single character - single capital multiple characters - all capitals multiple characters
-
class
deeppavlov.models.preprocessors.capitalization.
LowercasePreprocessor
(to_lower=True, append_case='first', *args, **kwargs)[source]¶
-
class
deeppavlov.models.preprocessors.char_splitter.
CharSplitter
(**kwargs)[source]¶ This component transforms batch of sequences of tokens into batch of sequences of character sequences.
-
class
deeppavlov.models.preprocessors.dirty_comments_preprocessor.
DirtyCommentsPreprocessor
(*args, **kwargs)[source]¶ Class implements preprocessing of english texts with low level of literacy such as comments
-
class
deeppavlov.models.preprocessors.mask.
Mask
(*args, **kwargs)[source]¶ Takes batch of tokens and returns the masks of corresponding length
-
class
deeppavlov.models.preprocessors.one_hotter.
OneHotter
(depth: int, pad_zeros: bool = True, *args, **kwargs)[source]¶ One-hot featurizer with zero-padding.
Parameters: - depth – the depth for one-hotting
- pad_zeros – whether to pad elements of batch with zeros
-
class
deeppavlov.models.preprocessors.russian_lemmatizer.
PymorphyRussianLemmatizer
(*args, **kwargs)[source]¶ Class for lemmatization using PyMorphy.
-
class
deeppavlov.models.preprocessors.sanitizer.
Sanitizer
(diacritical=True, nums=False, *args, **kwargs)[source]¶ Remove all combining characters like diacritical marks from tokens