deeppavlov.models.preprocessors¶

class deeppavlov.models.preprocessors.dirty_comments_preprocessor.DirtyCommentsPreprocessor(remove_punctuation: bool = True, *args, **kwargs)[source]¶

Class implements preprocessing of english texts with low level of literacy such as comments

__call__(batch: List[str], **kwargs) → List[str][source]¶

Preprocess given batch

Parameters

batch – list of text samples
**kwargs – additional arguments

Returns

list of preprocessed text samples

class deeppavlov.models.preprocessors.mask.Mask(*args, **kwargs)[source]¶: Takes a batch of tokens and returns the masks of corresponding length

class deeppavlov.models.preprocessors.one_hotter.OneHotter(depth: int, pad_zeros: bool = False, single_vector=False, *args, **kwargs)[source]¶

One-hot featurizer with zero-padding. If single_vector, return the only vector per sample which can have several elements equal to 1.

Parameters

depth – the depth for one-hotting
pad_zeros – whether to pad elements of batch with zeros
single_vector – whether to return one vector for the sample (sum of each one-hotted vectors)

class deeppavlov.models.preprocessors.sanitizer.Sanitizer(diacritical: bool = True, nums: bool = False, *args, **kwargs)[source]¶

Remove all combining characters like diacritical marks from tokens

Parameters

diacritical – whether to remove diacritical signs or not diacritical signs are something like hats and stress marks
nums – whether to replace all digits with 1 or not

deeppavlov.models.preprocessors.str_lower.str_lower(batch: Union[str, list, tuple])[source]¶

Recursively search for strings in a list and convert them to lowercase

Parameters: batch – a string or a list containing strings at some level of nesting
Returns: the same structure where all strings are converted to lowercase

class deeppavlov.models.preprocessors.str_token_reverser.StrTokenReverser(tokenized: bool = False, *args, **kwargs)[source]¶

Component for converting strings to strings with reversed token positions

Parameters: tokenized – The parameter is only needed to reverse tokenized strings.

__call__(batch: Union[str, list, tuple]) → Union[List[str], List[Union[List[str], List[StrTokenReverserInfo]]]][source]¶

Recursively search for strings in a list and convert them to strings with reversed token positions

Parameters: batch – a string or a list containing strings
Returns: the same structure where all strings tokens are reversed

class deeppavlov.models.preprocessors.str_utf8_encoder.StrUTF8Encoder(max_word_length: int = 50, pad_special_char_use: bool = False, word_boundary_special_char_use: bool = False, sentence_boundary_special_char_use: bool = False, reversed_sentense_tokens: bool = False, bos: str = '<S>', eos: str = '</S>', **kwargs)[source]¶

Component for encoding all strings to utf8 codes

Parameters

max_word_length – Max length of words of input and output batches.
pad_special_char_use – Whether to use special char for padding or not.
word_boundary_special_char_use – Whether to add word boundaries by special chars or not.
sentence_boundary_special_char_use – Whether to add word boundaries by special chars or not.
reversed_sentense_tokens – Whether to use reversed sequences of tokens or not.
bos – Name of a special token of the begin of a sentence.
eos – Name of a special token of the end of a sentence.

__call__(batch: Union[List[str], Tuple[str]]) → Union[List[str], List[Union[List[str], List[StrUTF8EncoderInfo]]]][source]¶

Recursively search for strings in a list and utf8 encode

Parameters: batch – a string or a list containing strings
Returns: the same structure where all strings are utf8 encoded

class deeppavlov.models.preprocessors.odqa_preprocessors.DocumentChunker(sentencize_fn: Callable = nltk.sent_tokenize, keep_sentences: bool = True, tokens_limit: int = 400, flatten_result: bool = False, paragraphs: bool = False, number_of_paragraphs: int = - 1, *args, **kwargs)[source]¶

Make chunks from a document or a list of documents. Don’t tear up sentences if needed.

Parameters

sentencize_fn – a function for sentence segmentation
keep_sentences – whether to tear up sentences between chunks or not
tokens_limit – a number of tokens in a single chunk (usually this number corresponds to the squad model limit)
flatten_result – whether to flatten the resulting list of lists of chunks
paragraphs – whether to split document by paragrahs; if set to True, tokens_limit is ignored

keep_sentences¶: whether to tear up sentences between chunks or not

tokens_limit¶: a number of tokens in a single chunk

flatten_result¶: whether to flatten the resulting list of lists of chunks

paragraphs¶: whether to split document by paragrahs; if set to True, tokens_limit is ignored

__call__(batch_docs: List[Union[str, List[str]]], batch_docs_ids: Optional[List[Union[str, List[str]]]] = None) → Union[Tuple[Union[List[str], List[List[str]]], Union[List[str], List[List[str]]]], List[str], List[List[str]]][source]¶

Make chunks from a batch of documents. There can be several documents in each batch. :param batch_docs: a batch of documents / a batch of lists of documents :param batch_docs_ids: a batch of documents ids / a batch of lists of documents ids :type batch_docs_ids: optional

Returns: chunks of docs, flattened or not and chunks of docs ids, flattened or not if batch_docs_ids were passed

class deeppavlov.models.preprocessors.odqa_preprocessors.StringMultiplier(**kwargs)[source]¶

Make a list of strings from a provided string. A length of the resulting list equals a length of a provided reference argument.

__call__(batch_s: List[str], ref: List[str]) → List[List[str]][source]¶

Multiply each string in a provided batch of strings.

Parameters

batch_s – a batch of strings to be multiplied
ref – a reference to obtain a length of the resulting list

Returns

a multiplied s as list