deeppavlov.models.embedders

class deeppavlov.models.embedders.bow_embedder.BoWEmbedder(depth: int, with_counts: bool = False, **kwargs)[source]

Performs one-hot encoding of tokens based on a pre-built vocabulary of tokens.

Parameters:
  • depth – size of output numpy vector.
  • with_counts – flag denotes whether to use binary encoding (with zeros and ones), or to use counts as token representation.

Example

>>> bow = BoWEmbedder(depth=3)

>>> bow([[0, 1], [1], [])
[array([1, 1, 0], dtype=int32),
 array([0, 1, 0], dtype=int32),
 array([0, 0, 0], dtype=int32)]
class deeppavlov.models.embedders.fasttext_embedder.FasttextEmbedder(load_path: Union[str, pathlib.Path], pad_zero: bool = False, mean: bool = False, **kwargs)[source]

Class implements fastText embedding model

Parameters:
  • load_path – path where to load pre-trained embedding model from
  • pad_zero – whether to pad samples or not
model

fastText model instance

tok2emb

dictionary with already embedded tokens

dim

dimension of embeddings

pad_zero

whether to pad sequence of tokens with zeros or not

load_path

path with pre-trained fastText binary model

__call__(batch: List[List[str]], mean: bool = None) → List[Union[list, numpy.ndarray]]

Embed sentences from batch

Parameters:
  • batch – list of tokenized text samples
  • mean – whether to return mean embedding of tokens per sample
Returns:

embedded batch

__iter__() → Iterator[str][source]

Iterate over all words from fastText model vocabulary

Returns:iterator
class deeppavlov.models.embedders.elmo_embedder.ELMoEmbedder(spec: str, elmo_output_names: Optional[List] = None, dim: Optional[int] = None, pad_zero: bool = False, concat_last_axis: bool = True, max_token: Optional[int] = None, mini_batch_size: int = 32, **kwargs)[source]

ELMo (Embeddings from Language Models) representations are pre-trained contextual representations from large-scale bidirectional language models. See a paper Deep contextualized word representations for more information about the algorithm and a detailed analysis.

Parameters:
  • spec – A ModuleSpec defining the Module to instantiate or a path where to load a ModuleSpec from via tenserflow_hub.load_module_spec by using TensorFlow Hub.
  • elmo_output_names – A list of output ELMo. You can use combination of ["word_emb", "lstm_outputs1", "lstm_outputs2","elmo"] and you can use separately ["default"]. See TensorFlow Hub for more information about it.
  • dim – Dimensionality of output token embeddings of ELMo model.
  • pad_zero – Whether to use pad samples or not.
  • concat_last_axis – A boolean that enables/disables last axis concatenation. It is not used for elmo_output_names = ["default"].
  • max_token – The number limitation of words per a batch line.
  • mini_batch_size – It is used to reduce the memory requirements of the device.

Examples

You can use ELMo models from DeepPavlov as usual TensorFlow Hub Module.

>>> import tensorflow_hub as hub
>>> elmo = hub.Module("http://files.deeppavlov.ai/deeppavlov_data/elmo_ru-news_wmt11-16_1.5M_steps.tar.gz",
trainable=True)
>>> embeddings = elmo(["это предложение", "word"], signature="default", as_dict=True)["elmo"]

You can also embed tokenized sentences.

>>> tokens_input = [["мама", "мыла", "раму"], ["рама", "", ""]]
>>> tokens_length = [3, 1]
>>> embeddings = elmo(
        inputs={
                "tokens": tokens_input,
                "sequence_len": tokens_length
                },
        signature="tokens",
        as_dict=True)["elmo"]

You can also get hub.text_embedding_column like described here.

__call__(batch: List[List[str]], *args, **kwargs) → Union[List[numpy.ndarray], numpy.ndarray][source]

Embed sentences from a batch.

Parameters:batch – A list of tokenized text samples.
Returns:A batch of ELMo embeddings.
__iter__() → Iterator[source]

Iterate over all words from a ELMo model vocabulary. The ELMo model vocabulary consists of ['<S>', '</S>', '<UNK>'].

Returns:An iterator of three elements ['<S>', '</S>', '<UNK>'].
class deeppavlov.models.embedders.glove_embedder.GloVeEmbedder(load_path: Union[str, pathlib.Path], pad_zero: bool = False, mean: bool = False, **kwargs)[source]

Class implements GloVe embedding model

Parameters:
  • load_path – path where to load pre-trained embedding model from
  • pad_zero – whether to pad samples or not
model

GloVe model instance

tok2emb

dictionary with already embedded tokens

dim

dimension of embeddings

pad_zero

whether to pad sequence of tokens with zeros or not

load_path

path with pre-trained GloVe model

__call__(batch: List[List[str]], mean: bool = None) → List[Union[list, numpy.ndarray]]

Embed sentences from batch

Parameters:
  • batch – list of tokenized text samples
  • mean – whether to return mean embedding of tokens per sample
Returns:

embedded batch

__iter__() → Iterator[str][source]

Iterate over all words from GloVe model vocabulary

Returns:iterator
class deeppavlov.models.embedders.tfidf_weighted_embedder.TfidfWeightedEmbedder(embedder: deeppavlov.core.models.component.Component, tokenizer: deeppavlov.core.models.component.Component = None, pad_zero: bool = False, mean: bool = False, tags_vocab_path: str = None, vectorizer: deeppavlov.core.models.component.Component = None, counter_vocab_path: str = None, idf_base_count: int = 100, log_base: int = 10, min_idf_weight=0.0, **kwargs)[source]
The class implements the functional of embedding the sentence as a weighted by special coefficients average of tokens embeddings. Coefficients can be taken from the given TFIDF-vectorizer in vectorizer or calculated as TFIDF from counter vocabulary given in counter_vocab_path.
Also one can give in tags_vocab_path path to the vocabulary with weights of tags, in this case batch with tags should be given as a second input in __call__ method.
Parameters:
  • embedder – embedder instance
  • tokenizer – tokenizer instance, should be able to detokenize sentence
  • pad_zero – whether to pad samples or not
  • mean – whether to return mean token embedding
  • tags_vocab_path – optional path to vocabulary with tags weights
  • vectorizer – vectorizer instance should be trained with analyzer="word"
  • counter_vocab_path – path to counter vocabulary
  • idf_base_count – minimal idf value (less time occured are not counted)
  • log_base – logarithm base for TFIDF-coefficient calculation froom counter vocabulary
  • min_idf_weight – minimal idf weight
embedder

embedder instance

tokenizer

tokenizer instance, should be able to detokenize sentence

dim

dimension of embeddings

pad_zero

whether to pad samples or not

mean

whether to return mean token embedding

tags_vocab

vocabulary with weigths for tags

vectorizer

vectorizer instance

counter_vocab_path

path to counter vocabulary

counter_vocab

counter vocabulary

idf_base_count

minimal idf value (less time occured are not counted)

log_base

logarithm base for TFIDF-coefficient calculation froom counter vocabulary

min_idf_weight

minimal idf weight

__call__(batch: List[List[str]], tags_batch: Optional[List[List[str]]] = None, mean: bool = None, *args, **kwargs) → List[Union[list, numpy.ndarray]][source]

Infer on the given data

Parameters:
  • batch – tokenized text samples
  • tags_batch – optional batch of corresponding tags
  • mean – whether to return mean token embedding (does not depend on self.mean)
  • *args – additional arguments
  • **kwargs – additional arguments

Returns: