deeppavlov.models.preprocessors

class deeppavlov.models.preprocessors.assemble_embeddings_matrix.EmbeddingsMatrixAssembler(embedder: deeppavlov.models.embedders.abstract_embedder.Embedder, vocab: deeppavlov.core.data.simple_vocab.SimpleVocabulary, character_level: bool = False, emb_dim: int = None, estimate_by_n: int = 10000, *args, **kwargs)[source]
For a given Vocabulary assembles matrix of embeddings obtained from some Embedder. This
class also can assemble embeddins of characters using
Parameters:
  • embedder – an instance of the class that convertes tokens to vectors. For example FasttextEmbedder or GloVeEmbedder
  • vocab – instance of SimpleVocab. The matrix of embeddings will be assembled relying on every token in the vocabulary. the indexing will match vocabulary indexing.
  • character_level – whether to perform assembling on character level. This procedure will assemble matrix with embeddings for every character using averaged embeddings of words, that contain this character.
  • emb_dim – dimensionality of the resulting embeddings. If not None it should be less or equal to the dimensionality of the embeddings provided by Embedder. The reduction of dimensionality is performed by taking main components of PCA.
  • estimate_by_n – how much samples to use to estimate covariance matrix for PCA. 10000 seems to be enough.
dim

dimensionality of the embeddings (can be less than dimensionality of embeddings produced by Embedder.

class deeppavlov.models.preprocessors.capitalization.CapitalizationPreprocessor(pad_zeros: bool = True, *args, **kwargs)[source]

Featurizer useful for NER task. It detects following patterns in the words: - no capitals - single capital single character - single capital multiple characters - all capitals multiple characters

Parameters:pad_zeros – whether to pad capitalization features batch with zeros up to maximal length or not.
dim

dimensionality of the feature vectors, produced by the featurizer

deeppavlov.models.preprocessors.capitalization.process_word(word: str, to_lower: bool = False, append_case: Optional[str] = None) → Tuple[str][source]

Converts word to a tuple of symbols, optionally converts it to lowercase and adds capitalization label.

Parameters:
  • word – input word
  • to_lower – whether to lowercase
  • append_case – whether to add case mark (‘<FIRST_UPPER>’ for first capital and ‘<ALL_UPPER>’ for all caps)
Returns:

a preprocessed word

class deeppavlov.models.preprocessors.capitalization.LowercasePreprocessor(to_lower: bool = True, append_case: str = 'first', *args, **kwargs)[source]

A callable wrapper over process_word(). Takes as input a batch of tokenized sentences and returns a batch of preprocessed sentences.

class deeppavlov.models.preprocessors.char_splitter.CharSplitter(**kwargs)[source]

This component transforms batch of sequences of tokens into batch of sequences of character sequences.

class deeppavlov.models.preprocessors.dirty_comments_preprocessor.DirtyCommentsPreprocessor(remove_punctuation: bool = True, *args, **kwargs)[source]

Class implements preprocessing of english texts with low level of literacy such as comments

__call__(batch: List[str], **kwargs) → List[str][source]

Preprocess given batch

Parameters:
  • batch – list of text samples
  • **kwargs – additional arguments
Returns:

list of preprocessed text samples

class deeppavlov.models.preprocessors.mask.Mask(*args, **kwargs)[source]

Takes batch of tokens and returns the masks of corresponding length

class deeppavlov.models.preprocessors.one_hotter.OneHotter(depth: int, pad_zeros: bool = False, single_vector=False, *args, **kwargs)[source]

One-hot featurizer with zero-padding. If single_vector, return the only vector per sample which can have several elements equal to 1.

Parameters:
  • depth – the depth for one-hotting
  • pad_zeros – whether to pad elements of batch with zeros
  • single_vector – whether to return one vector for the sample (sum of each one-hotted vectors)
class deeppavlov.models.preprocessors.random_embeddings_matrix.RandomEmbeddingsMatrix(vocab_len: int, emb_dim: int, *args, **kwargs)[source]

Assembles matrix of random embeddings.

Parameters:
  • vocab_len – length of the vocabulary (number of tokens in it)
  • emb_dim – dimensionality of the embeddings
dim

dimensionality of the embeddings

class deeppavlov.models.preprocessors.russian_lemmatizer.PymorphyRussianLemmatizer(*args, **kwargs)[source]

Class for lemmatization using PyMorphy.

class deeppavlov.models.preprocessors.sanitizer.Sanitizer(diacritical: bool = True, nums: bool = False, *args, **kwargs)[source]

Remove all combining characters like diacritical marks from tokens

Parameters:
  • diacritical – whether to remove diacritical signs or not diacritical signs are something like hats and stress marks
  • nums – whether to replace all digits with 1 or not
class deeppavlov.models.preprocessors.siamese_preprocessor.SiamesePreprocessor(save_path: str = './tok.dict', load_path: str = './tok.dict', max_sequence_length: int = None, dynamic_batch: bool = False, padding: str = 'post', truncating: str = 'post', use_matrix: bool = True, num_context_turns: int = 1, num_ranking_samples: int = 1, tokenizer: deeppavlov.core.models.component.Component = None, vocab: deeppavlov.core.models.estimator.Estimator = 'simple_vocab', embedder: deeppavlov.core.models.component.Component = 'fasttext', sent_vocab: deeppavlov.core.models.estimator.Estimator = 'simple_vocab', **kwargs)[source]

Preprocessing of data samples containing few text strings to feed them in siamese networks.

First num_context_turns strings in each data sample corresponds to the dialogue context and the rest string(s) in the sample is (are) response(s).

Parameters:
  • save_path – The parameter is only needed to initialize the base class Serializable.
  • load_path – The parameter is only needed to initialize the base class Serializable.
  • max_sequence_length – A maximum length of text sequences in tokens. Longer sequences will be truncated and shorter ones will be padded.
  • dynamic_batch – Whether to use dynamic batching. If True, the maximum length of a sequence for a batch will be equal to the maximum of all sequences lengths from this batch, but not higher than max_sequence_length.
  • padding – Padding. Possible values are pre and post. If set to pre a sequence will be padded at the beginning. If set to post it will padded at the end.
  • truncating – Truncating. Possible values are pre and post. If set to pre a sequence will be truncated at the beginning. If set to post it will truncated at the end.
  • use_matrix – Whether to use a trainable matrix with token (word) embeddings.
  • num_context_turns – A number of context turns in data samples.
  • num_ranking_samples – A number of condidates for ranking including positive one.
  • tokenizer – An instance of one of the deeppavlov.models.tokenizers.
  • vocab – An instance of deeppavlov.core.data.simple_vocab.SimpleVocabulary.
  • embedder – an instance of one of the deeppavlov.models.embedders.
  • sent_vocab – An instance of of deeppavlov.core.data.simple_vocab.SimpleVocabulary. It is used to store all responces and to find the best response to the user context in the interact mode.
class deeppavlov.models.preprocessors.str_lower.StrLower(*args, **kwargs)[source]

Component for converting strings to lowercase at any level of lists nesting

__call__(batch: Union[str, list, tuple])[source]

Recursively search for strings in a list and convert them to lowercase

Parameters:batch – a string or a list containing strings at some level of nesting
Returns:the same structure where all strings are converted to lowercase
class deeppavlov.models.preprocessors.str_token_reverser.StrTokenReverser(tokenized: bool = False, *args, **kwargs)[source]

Component for converting strings to strings with reversed token positions

Parameters:tokenized – The parameter is only needed to reverse tokenized strings.
__call__(batch: Union[str, list, tuple]) → Union[List[str], List[Union[List[str], List[StrTokenReverserInfo]]]][source]

Recursively search for strings in a list and convert them to strings with reversed token positions

Parameters:batch – a string or a list containing strings
Returns:the same structure where all strings tokens are reversed
class deeppavlov.models.preprocessors.str_utf8_encoder.StrUTF8Encoder(max_word_length: int = 50, pad_special_char_use: bool = False, word_boundary_special_char_use: bool = False, sentence_boundary_special_char_use: bool = False, reversed_sentense_tokens: bool = False, bos: str = '<S>', eos: str = '</S>', **kwargs)[source]

Component for encoding all strings to utf8 codes

Parameters:
  • max_word_length – Max length of words of input and output batches.
  • pad_special_char_use – Whether to use special char for padding or not.
  • word_boundary_special_char_use – Whether to add word boundaries by special chars or not.
  • sentence_boundary_special_char_use – Whether to add word boundaries by special chars or not.
  • reversed_sentense_tokens – Whether to use reversed sequences of tokens or not.
  • bos – Name of a special token of the begin of a sentence.
  • eos – Name of a special token of the end of a sentence.
__call__(batch: Union[List[str], Tuple[str]]) → Union[List[str], List[Union[List[str], List[StrUTF8EncoderInfo]]]][source]

Recursively search for strings in a list and utf8 encode

Parameters:batch – a string or a list containing strings
Returns:the same structure where all strings are utf8 encoded
class deeppavlov.models.preprocessors.odqa_preprocessors.DocumentChunker(sentencize_fn: Callable = <sphinx.ext.autodoc.importer._MockObject object>, keep_sentences: bool = True, tokens_limit: int = 400, flatten_result: bool = False, paragraphs: bool = False, *args, **kwargs)[source]

Make chunks from a document or a list of documents. Don’t tear up sentences if needed.

Parameters:
  • sentencize_fn – a function for sentence segmentation
  • keep_sentences – whether to tear up sentences between chunks or not
  • tokens_limit – a number of tokens in a single chunk (usually this number corresponds to the squad model limit)
  • flatten_result – whether to flatten the resulting list of lists of chunks
  • paragraphs – whether to split document by paragrahs; if set to True, tokens_limit is ignored
keep_sentences

whether to tear up sentences between chunks or not

tokens_limit

a number of tokens in a single chunk

flatten_result

whether to flatten the resulting list of lists of chunks

paragraphs

whether to split document by paragrahs; if set to True, tokens_limit is ignored

__call__(batch_docs: List[Union[str, List[str]]]) → List[Union[List[str], List[List[str]]]][source]

Make chunks from a batch of documents. There can be several documents in each batch. :param batch_docs: a batch of documents / a batch of lists of documents

Returns:chunks of docs, flattened or not
class deeppavlov.models.preprocessors.odqa_preprocessors.StringMultiplier(**kwargs)[source]

Make a list of strings from a provided string. A length of the resulting list equals a length of a provided reference argument.

__call__(batch_s: List[str], ref: List[str]) → List[List[str]][source]

Multiply each string in a provided batch of strings.

Parameters:
  • batch_s – a batch of strings to be multiplied
  • ref – a reference to obtain a length of the resulting list
Returns:

a multiplied s as list