
class deeppavlov.models.vectorizers.hashing_tfidf_vectorizer.HashingTfIdfVectorizer(tokenizer: deeppavlov.core.models.component.Component, hash_size=16777216, doc_index: Optional[dict] = None, save_path: Optional[str] = None, load_path: Optional[str] = None, **kwargs)[source]

Create a tfidf matrix from collection of documents of size [n_documents X n_features(hash_size)].

  • tokenizer – a tokenizer class
  • hash_size – a hash size, power of two
  • doc_index – a dictionary of document ids and their titles
  • save_path – a path to .npz file where tfidf matrix is saved
  • load_path – a path to .npz file where tfidf matrix is loaded from

a hash size


instance of a tokenizer class


a dictionary with tfidf terms and their frequences


provided by a user ids or generated automatically ids


tfidf matrix rows corresponding to terms


tfidf matrix cols corresponding to docs


tfidf matrix data corresponding to tfidf values

__call__(questions: List[str]) → scipy.sparse.csr.csr_matrix[source]

Transform input list of documents to tfidf vectors.

Parameters:questions – a list of input strings
Returns:transformed documents as a csr_matrix with shape [n_documents X hash_size]
fit_batch(docs: List[str], doc_ids: List[Any]) → None[source]

Fit batch of documents.

  • docs – a list of input documents
  • doc_ids – a list of document ids corresponding to input documents


fit_batches(iterator:, batch_size: int) → None[source]

Generate a batch to be fit to a vectorizer.

  • iterator – an instance of an iterator class
  • batch_size – a size of a generated batch


get_count_matrix(row: List[int], col: List[int], data: List[int], size: int) → scipy.sparse.csr.csr_matrix[source]

Get count matrix.

  • row – tfidf matrix rows corresponding to terms
  • col – tfidf matrix cols corresponding to docs
  • data – tfidf matrix data corresponding to tfidf values
  • sizedoc_index size

a count csr_matrix

get_counts(docs: List[str], doc_ids: List[Any]) → Generator[[Tuple[KeysView, ValuesView, List[int]], Any], None][source]

Get term counts for a list of documents.

  • docs – a list of input documents
  • doc_ids – a list of document ids corresponding to input documents

a tuple of term hashes, count values and column ids



get_index2doc() → Dict[Any, int][source]

Invert doc_index.

Returns:inverted doc_index dict
static get_tfidf_matrix(count_matrix: scipy.sparse.csr.csr_matrix) → Tuple[scipy.sparse.csr.csr_matrix, numpy.core.multiarray.array][source]

Convert a count matrix into a tfidf matrix.

Parameters:count_matrix – a count matrix
Returns:a tuple of tfidf matrix and term frequences
load() → Tuple[scipy.sparse.csr.csr_matrix, Dict][source]

Load a tfidf matrix as csr_matrix.

Returns:a tuple of tfidf matrix and csr data.
Raises:FileNotFoundError if load_path doesn’t exist.
reset() → None[source]

Clear rows, cols and data

save() → None[source]

Save tfidf matrix into .npz format.

class deeppavlov.models.vectorizers.word_vectorizer.DictionaryVectorizer(save_path: str, load_path: Union[str, List[str]], min_freq: int = 1, unk_token: str = None, **kwargs)[source]

Transforms words into 0-1 vector of its possible tags, read from a vocabulary file. The format of the vocabulary must be word<TAB>tag_1<SPACE>…<SPACE>tag_k

  • save_path – path to save the vocabulary,
  • load_path – path to the vocabulary(-ies),
  • min_freq – minimal frequency of tag to memorize this tag,
  • unk_token – unknown token to be yielded for unknown words
__call__(data: List) → numpy.ndarray

Transforms words to one-hot encoding according to the dictionary.

Parameters:data – the batch of words
Returns:a 3D array. answer[i][j][k] = 1 iff data[i][j] is the k-th word in the dictionary.
load() → None[source]

Loads the dictionary from self.load_path

save() → None[source]

Saves the dictionary to self.save_path

class deeppavlov.models.vectorizers.word_vectorizer.PymorphyVectorizer(save_path: str, load_path: str, max_pymorphy_variants: int = -1, **kwargs)[source]

Transforms russian words into 0-1 vector of its possible Universal Dependencies tags. Tags are obtained using Pymorphy analyzer ( and transformed to UD2.0 format using russian-tagsets library ( All UD2.0 tags that are compatible with produced tags are memorized. The list of possible Universal Dependencies tags is read from a file, which contains all the labels that occur in UD2.0 SynTagRus dataset.

  • save_path – path to save the tags list,
  • load_path – path to load the list of tags,
  • max_pymorphy_variants – maximal number of pymorphy parses to be used. If -1, all parses are used.
__call__(data: List) → numpy.ndarray

Transforms words to one-hot encoding according to the dictionary.

Parameters:data – the batch of words
Returns:a 3D array. answer[i][j][k] = 1 iff data[i][j] is the k-th word in the dictionary.
find_compatible(tag: str) → List[int][source]

Transforms a Pymorphy tag to a list of indexes of compatible UD tags.

Parameters:tag – input Pymorphy tag
Returns:indexes of compatible UD tags
load() → None[source]

Loads the dictionary from self.load_path

save() → None[source]

Saves the dictionary to self.save_path