deeppavlov.models.preprocessors¶
-
class
deeppavlov.models.preprocessors.assemble_embeddings_matrix.
EmbeddingsMatrixAssembler
(embedder: Union[deeppavlov.models.embedders.fasttext_embedder.FasttextEmbedder, deeppavlov.models.embedders.glove_embedder.GloVeEmbedder], vocab: deeppavlov.core.data.simple_vocab.SimpleVocabulary, character_level: bool = False, emb_dim: int = None, estimate_by_n: int = 10000, *args, **kwargs)[source]¶ - For a given Vocabulary assembles matrix of embeddings obtained from some Embedder. This
- class also can assemble embeddins of characters using
Parameters: - embedder – an instance of the class that convertes tokens to vectors.
For example
FasttextEmbedder
orGloVeEmbedder
- vocab – instance of
SimpleVocab
. The matrix of embeddings will be assembled relying on every token in the vocabulary. the indexing will match vocabulary indexing. - character_level – whether to perform assembling on character level. This procedure will assemble matrix with embeddings for every character using averaged embeddings of words, that contain this character.
- emb_dim – dimensionality of the resulting embeddings. If not None it should be less or equal to the dimensionality of the embeddings provided by Embedder. The reduction of dimensionality is performed by taking main components of PCA.
- estimate_by_n – how much samples to use to estimate covariance matrix for PCA. 10000 seems to be enough.
-
dim
¶ dimensionality of the embeddings (can be less than dimensionality of embeddings produced by Embedder.
-
class
deeppavlov.models.preprocessors.capitalization.
CapitalizationPreprocessor
(pad_zeros: bool = True, *args, **kwargs)[source]¶ Featurizer useful for NER task. It detects following patterns in the words: - no capitals - single capital single character - single capital multiple characters - all capitals multiple characters
Parameters: pad_zeros – whether to pad capitalization features batch with zeros up to maximal length or not. -
dim
¶ dimensionality of the feature vectors, produced by the featurizer
-
-
deeppavlov.models.preprocessors.capitalization.
process_word
(word: str, to_lower: bool = False, append_case: str = None) → Tuple[str][source]¶ Converts word to a tuple of symbols, optionally converts it to lowercase and adds capitalization label.
Parameters: - word – input word
- to_lower – whether to lowercase
- append_case – whether to add case mark (‘<FIRST_UPPER>’ for first capital and ‘<ALL_UPPER>’ for all caps)
Returns: a preprocessed word
-
class
deeppavlov.models.preprocessors.capitalization.
LowercasePreprocessor
(to_lower: bool = True, append_case: str = 'first', *args, **kwargs)[source]¶ A callable wrapper over
process_word()
. Takes as input a batch of sentences and returns a batch of preprocessed sentences.
-
class
deeppavlov.models.preprocessors.char_splitter.
CharSplitter
(**kwargs)[source]¶ This component transforms batch of sequences of tokens into batch of sequences of character sequences.
-
class
deeppavlov.models.preprocessors.dirty_comments_preprocessor.
DirtyCommentsPreprocessor
(*args, **kwargs)[source]¶ Class implements preprocessing of english texts with low level of literacy such as comments
-
class
deeppavlov.models.preprocessors.mask.
Mask
(*args, **kwargs)[source]¶ Takes batch of tokens and returns the masks of corresponding length
-
class
deeppavlov.models.preprocessors.one_hotter.
OneHotter
(depth: int, pad_zeros: bool = True, *args, **kwargs)[source]¶ One-hot featurizer with zero-padding.
Parameters: - depth – the depth for one-hotting
- pad_zeros – whether to pad elements of batch with zeros
-
class
deeppavlov.models.preprocessors.russian_lemmatizer.
PymorphyRussianLemmatizer
(*args, **kwargs)[source]¶ Class for lemmatization using PyMorphy.
-
class
deeppavlov.models.preprocessors.sanitizer.
Sanitizer
(diacritical: bool = True, nums: bool = False, *args, **kwargs)[source]¶ Remove all combining characters like diacritical marks from tokens
Parameters: - diacritical – whether to remove diacritical signs or not diacritical signs are something like hats and stress marks
- nums – whether to replace all digits with 1 or not
-
class
deeppavlov.models.preprocessors.str_lower.
StrLower
(*args, **kwargs)[source]¶ Component for converting strings to lowercase at any level of lists nesting
-
class
deeppavlov.models.preprocessors.odqa_preprocessors.
DocumentChunker
(sentencize_fn: Callable = <sphinx.ext.autodoc.importer._MockObject object>, keep_sentences: bool = True, tokens_limit: int = 400, flatten_result: bool = False, *args, **kwargs)[source]¶ Make chunks from a document or a list of documents. Don’t tear up sentences if needed.
Parameters: - sentencize_fn – a function for sentence segmentation
- keep_sentences – whether to tear up sentences between chunks or not
- tokens_limit – a number of tokens in a single chunk (usually this number corresponds to the squad model limit)
- flatten_result – whether to flatten the resulting list of lists of chunks
-
keep_sentences
¶ whether to tear up sentences between chunks or not
-
tokens_limit
¶ a number of tokens in a single chunk
-
flatten_result
¶ whether to flatten the resulting list of lists of chunks
-
__call__
(batch_docs: List[Union[List[str], str]]) → List[Union[List[str], List[List[str]]]][source]¶ Make chunks from a batch of documents. There can be several documents in each batch.
Parameters: batch_docs – a batch of documents / a batch of lists of documents Returns: chunks of docs, flattened or not