deeppavlov.models.vectorizers

class deeppavlov.models.vectorizers.hashing_tfidf_vectorizer.HashingTfIdfVectorizer(tokenizer: deeppavlov.core.models.component.Component, hash_size=16777216, doc_index: Optional[dict] = None, save_path: Optional[str] = None, load_path: Optional[str] = None, **kwargs)[source]

Create a tfidf matrix from collection of documents of size [n_documents X n_features(hash_size)].

Parameters:
  • tokenizer – a tokenizer class
  • hash_size – a hash size, power of two
  • doc_index – a dictionary of document ids and their titles
  • save_path – a path to .npz file where tfidf matrix is saved
  • load_path – a path to .npz file where tfidf matrix is loaded from
hash_size

a hash size

tokenizer

instance of a tokenizer class

term_freqs

a dictionary with tfidf terms and their frequences

doc_index

provided by a user ids or generated automatically ids

rows

tfidf matrix rows corresponding to terms

cols

tfidf matrix cols corresponding to docs

data

tfidf matrix data corresponding to tfidf values

__call__(questions: List[str]) → scipy.sparse.csr.csr_matrix[source]

Transform input list of documents to tfidf vectors.

Parameters:questions – a list of input strings
Returns:transformed documents as a csr_matrix with shape [n_documents X hash_size]
fit_batch(docs: List[str], doc_ids: List[Any]) → None[source]

Fit batch of documents.

Parameters:
  • docs – a list of input documents
  • doc_ids – a list of document ids corresponding to input documents
Returns:

None

fit_batches(iterator: deeppavlov.core.data.data_fitting_iterator.DataFittingIterator, batch_size: int) → None[source]

Generate a batch to be fit to a vectorizer.

Parameters:
  • iterator – an instance of an iterator class
  • batch_size – a size of a generated batch
Returns:

None

get_count_matrix(row: List[int], col: List[int], data: List[int], size: int) → scipy.sparse.csr.csr_matrix[source]

Get count matrix.

Parameters:
  • row – tfidf matrix rows corresponding to terms
  • col – tfidf matrix cols corresponding to docs
  • data – tfidf matrix data corresponding to tfidf values
  • sizedoc_index size
Returns:

a count csr_matrix

get_counts(docs: List[str], doc_ids: List[Any]) → Generator[[Tuple[KeysView, ValuesView, List[int]], Any], None][source]

Get term counts for a list of documents.

Parameters:
  • docs – a list of input documents
  • doc_ids – a list of document ids corresponding to input documents
Yields:

a tuple of term hashes, count values and column ids

Returns:

None

get_index2doc() → Dict[Any, int][source]

Invert doc_index.

Returns:inverted doc_index dict
static get_tfidf_matrix(count_matrix: scipy.sparse.csr.csr_matrix) → Tuple[scipy.sparse.csr.csr_matrix, numpy.core.multiarray.array][source]

Convert a count matrix into a tfidf matrix.

Parameters:count_matrix – a count matrix
Returns:a tuple of tfidf matrix and term frequences
load() → Tuple[scipy.sparse.csr.csr_matrix, Dict][source]

Load a tfidf matrix as csr_matrix.

Returns:a tuple of tfidf matrix and csr data.
Raises:FileNotFoundError if load_path doesn’t exist.
reset() → None[source]

Clear rows, cols and data

Returns:None
save() → None[source]

Save tfidf matrix into .npz format.

Returns:None
class deeppavlov.models.vectorizers.tfidf_vectorizer.TfIdfVectorizer(save_path: str = None, load_path: str = None, **kwargs)[source]

Sentence vectorizer which produce sparse vector with TF-IDF values for each word in sentence

Parameters:
  • save_path – path to save the model
  • load_path – path to load the model
Returns:

None

__call__(questions: List[str]) → scipy.sparse.csr.csr_matrix[source]

Vectorize sentence into TF-IDF values

Parameters:questions – list of sentences
Returns:list of vectorized sentences
fit(x_train: List[str]) → None[source]

Train TF-IDF vectorizer

Parameters:x_train – list of sentences for train
Returns:None
load() → None[source]

Load TF-IDF vectorizer

save() → None[source]

Save TF-IDF vectorizer

class deeppavlov.models.vectorizers.sentence2vector_w2v_tfidf.SentenceW2vVectorizerTfidfWeights(save_path: str = None, load_path: str = None, **kwargs)[source]

Sentence vectorizer which produce one vector as tf-idf weighted sum of words vectors in sentence

Parameters:
  • save_path – path to save the model
  • load_path – path to load the model
Returns:

None

__call__(questions: List[str], tokens_fasttext_vectors: List) → List[source]

Vectorize list of sentences

Parameters:
  • questions – list of questions/sentences
  • tokens_fasttext_vectors – fasttext vectors for sentences
Returns:

List of vectorized sentences

fit(x_train: List) → None[source]

Train tf-idf weights

Parameters:x_train – train sentences
Returns:None
load() → None[source]

Load model

save() → None[source]

Save model

class deeppavlov.models.vectorizers.sentence2vector_w2v_avg.SentenceAvgW2vVectorizer(**kwargs)[source]

Sentence vectorizer which produce one vector as average sum of words vectors in sentence

__call__(questions: List[str], tokens_fasttext_vectors: List) → List[source]

Vectorize list of sentences

Parameters:
  • questions – list of questions/sentences
  • tokens_fasttext_vectors – fasttext vectors for sentences
Returns:

List of vectorized sentences