deeppavlov.models.preprocessors

class deeppavlov.models.preprocessors.dirty_comments_preprocessor.DirtyCommentsPreprocessor(remove_punctuation: bool = True, *args, **kwargs)[source]

Class implements preprocessing of english texts with low level of literacy such as comments

__call__(batch: List[str], **kwargs) List[str][source]

Preprocess given batch

Parameters
  • batch – list of text samples

  • **kwargs – additional arguments

Returns

list of preprocessed text samples

class deeppavlov.models.preprocessors.mask.Mask(*args, **kwargs)[source]

Takes a batch of tokens and returns the masks of corresponding length

class deeppavlov.models.preprocessors.one_hotter.OneHotter(depth: int, pad_zeros: bool = False, single_vector=False, *args, **kwargs)[source]

One-hot featurizer with zero-padding. If single_vector, return the only vector per sample which can have several elements equal to 1.

Parameters
  • depth – the depth for one-hotting

  • pad_zeros – whether to pad elements of batch with zeros

  • single_vector – whether to return one vector for the sample (sum of each one-hotted vectors)

class deeppavlov.models.preprocessors.sanitizer.Sanitizer(diacritical: bool = True, nums: bool = False, *args, **kwargs)[source]

Remove all combining characters like diacritical marks from tokens

Parameters
  • diacritical – whether to remove diacritical signs or not diacritical signs are something like hats and stress marks

  • nums – whether to replace all digits with 1 or not

deeppavlov.models.preprocessors.str_lower.str_lower(batch: Union[str, list, tuple])[source]

Recursively search for strings in a list and convert them to lowercase

Parameters

batch – a string or a list containing strings at some level of nesting

Returns

the same structure where all strings are converted to lowercase

class deeppavlov.models.preprocessors.str_token_reverser.StrTokenReverser(tokenized: bool = False, *args, **kwargs)[source]

Component for converting strings to strings with reversed token positions

Parameters

tokenized – The parameter is only needed to reverse tokenized strings.

__call__(batch: Union[str, list, tuple]) Union[List[str], List[Union[List[str], List[StrTokenReverserInfo]]]][source]

Recursively search for strings in a list and convert them to strings with reversed token positions

Parameters

batch – a string or a list containing strings

Returns

the same structure where all strings tokens are reversed

class deeppavlov.models.preprocessors.str_utf8_encoder.StrUTF8Encoder(max_word_length: int = 50, pad_special_char_use: bool = False, word_boundary_special_char_use: bool = False, sentence_boundary_special_char_use: bool = False, reversed_sentense_tokens: bool = False, bos: str = '<S>', eos: str = '</S>', **kwargs)[source]

Component for encoding all strings to utf8 codes

Parameters
  • max_word_length – Max length of words of input and output batches.

  • pad_special_char_use – Whether to use special char for padding or not.

  • word_boundary_special_char_use – Whether to add word boundaries by special chars or not.

  • sentence_boundary_special_char_use – Whether to add word boundaries by special chars or not.

  • reversed_sentense_tokens – Whether to use reversed sequences of tokens or not.

  • bos – Name of a special token of the begin of a sentence.

  • eos – Name of a special token of the end of a sentence.

__call__(batch: Union[List[str], Tuple[str]]) Union[List[str], List[Union[List[str], List[StrUTF8EncoderInfo]]]][source]

Recursively search for strings in a list and utf8 encode

Parameters

batch – a string or a list containing strings

Returns

the same structure where all strings are utf8 encoded

class deeppavlov.models.preprocessors.odqa_preprocessors.DocumentChunker(sentencize_fn: Callable = nltk.sent_tokenize, keep_sentences: bool = True, tokens_limit: int = 400, flatten_result: bool = False, paragraphs: bool = False, number_of_paragraphs: int = - 1, *args, **kwargs)[source]

Make chunks from a document or a list of documents. Don’t tear up sentences if needed.

Parameters
  • sentencize_fn – a function for sentence segmentation

  • keep_sentences – whether to tear up sentences between chunks or not

  • tokens_limit – a number of tokens in a single chunk (usually this number corresponds to the squad model limit)

  • flatten_result – whether to flatten the resulting list of lists of chunks

  • paragraphs – whether to split document by paragrahs; if set to True, tokens_limit is ignored

keep_sentences

whether to tear up sentences between chunks or not

tokens_limit

a number of tokens in a single chunk

flatten_result

whether to flatten the resulting list of lists of chunks

paragraphs

whether to split document by paragrahs; if set to True, tokens_limit is ignored

__call__(batch_docs: List[Union[str, List[str]]], batch_docs_ids: Optional[List[Union[str, List[str]]]] = None) Union[Tuple[Union[List[str], List[List[str]]], Union[List[str], List[List[str]]]], List[str], List[List[str]]][source]

Make chunks from a batch of documents. There can be several documents in each batch. :param batch_docs: a batch of documents / a batch of lists of documents :param batch_docs_ids: a batch of documents ids / a batch of lists of documents ids :type batch_docs_ids: optional

Returns

chunks of docs, flattened or not and chunks of docs ids, flattened or not if batch_docs_ids were passed

class deeppavlov.models.preprocessors.odqa_preprocessors.StringMultiplier(**kwargs)[source]

Make a list of strings from a provided string. A length of the resulting list equals a length of a provided reference argument.

__call__(batch_s: List[str], ref: List[str]) List[List[str]][source]

Multiply each string in a provided batch of strings.

Parameters
  • batch_s – a batch of strings to be multiplied

  • ref – a reference to obtain a length of the resulting list

Returns

a multiplied s as list