deeppavlov.models.embedders

class deeppavlov.models.embedders.bow_embedder.BoWEmbedder(depth: int, with_counts: bool = False, **kwargs)[source]

Performs one-hot encoding of tokens based on a pre-built vocabulary of tokens.

Parameters
  • depth – size of output numpy vector.

  • with_counts – flag denotes whether to use binary encoding (with zeros and ones), or to use counts as token representation.

Example

>>> bow = BoWEmbedder(depth=3)

>>> bow([[0, 1], [1], [])
[array([1, 1, 0], dtype=int32),
 array([0, 1, 0], dtype=int32),
 array([0, 0, 0], dtype=int32)]
class deeppavlov.models.embedders.fasttext_embedder.FasttextEmbedder(load_path: Union[str, pathlib.Path], pad_zero: bool = False, mean: bool = False, **kwargs)[source]

Class implements fastText embedding model

Parameters
  • load_path – path where to load pre-trained embedding model from

  • pad_zero – whether to pad samples or not

model

fastText model instance

tok2emb

dictionary with already embedded tokens

dim

dimension of embeddings

pad_zero

whether to pad sequence of tokens with zeros or not

load_path

path with pre-trained fastText binary model

__call__(batch: List[List[str]], mean: Optional[bool] = None) → List[Union[list, numpy.ndarray]]

Embed sentences from batch

Parameters
  • batch – list of tokenized text samples

  • mean – whether to return mean embedding of tokens per sample

Returns

embedded batch

__iter__() → Iterator[str][source]

Iterate over all words from fastText model vocabulary

Returns

iterator

class deeppavlov.models.embedders.elmo_embedder.ELMoEmbedder(*args, **kwargs)[source]

ELMo (Embeddings from Language Models) representations are pre-trained contextual representations from large-scale bidirectional language models. See a paper Deep contextualized word representations for more information about the algorithm and a detailed analysis.

Parameters
  • spec – A ModuleSpec defining the Module to instantiate or a path where to load a ModuleSpec from via tenserflow_hub.load_module_spec by using TensorFlow Hub.

  • elmo_output_names

    A list of output ELMo. You can use combination of ["word_emb", "lstm_outputs1", "lstm_outputs2","elmo"] and you can use separately ["default"].

    Where,

    • word_emb - CNN embedding (default dim 512)

    • lstm_outputs* - ouputs of lstm (default dim 1024)

    • elmo - weighted sum of cnn and lstm outputs (default dim 1024)

    • default - mean elmo vector for sentence (default dim 1024)

    See TensorFlow Hub for more information about it.

  • dim – Can be used for output embeddings dimensionality reduction if elmo_output_names != [‘default’]

  • pad_zero – Whether to use pad samples or not.

  • concat_last_axis – A boolean that enables/disables last axis concatenation. It is not used for elmo_output_names = ["default"].

  • max_token – The number limitation of words per a batch line.

  • mini_batch_size – It is used to reduce the memory requirements of the device.

If some required packages are missing, install all the requirements by running in command line:

python -m deeppavlov install <path_to_config>

where <path_to_config> is a path to one of the provided config files or its name without an extension, for example :

python -m deeppavlov install elmo_ru-news

Examples

>>> from deeppavlov.models.embedders.elmo_embedder import ELMoEmbedder
>>> elmo = ELMoEmbedder("http://files.deeppavlov.ai/deeppavlov_data/elmo_ru-news_wmt11-16_1.5M_steps.tar.gz")
>>> elmo([['вопрос', 'жизни', 'Вселенной', 'и', 'вообще', 'всего'], ['42']])
array([[ 0.00719104,  0.08544601, -0.07179783, ...,  0.10879009,
        -0.18630421, -0.2189409 ],
       [ 0.16325025, -0.04736076,  0.12354863, ..., -0.1889013 ,
         0.04972512,  0.83029324]], dtype=float32)

You can use ELMo models from DeepPavlov as usual TensorFlow Hub Module.

>>> import tensorflow as tf
>>> import tensorflow_hub as hub
>>> elmo = hub.Module("http://files.deeppavlov.ai/deeppavlov_data/elmo_ru-news_wmt11-16_1.5M_steps.tar.gz",
trainable=True)
>>> sess = tf.Session()
>>> sess.run(tf.global_variables_initializer())
>>> embeddings = elmo(["это предложение", "word"], signature="default", as_dict=True)["elmo"]
>>> sess.run(embeddings)
array([[[ 0.05817392,  0.22493343, -0.19202903, ..., -0.14448944,
         -0.12425567,  1.0148407 ],
        [ 0.53596294,  0.2868537 ,  0.28028542, ..., -0.08028372,
          0.49089077,  0.75939953]],
       [[ 0.3433637 ,  1.0031182 , -0.1597258 , ...,  1.2442509 ,
          0.61029315,  0.43388373],
        [ 0.05370751,  0.02260921,  0.01074906, ...,  0.08748816,
         -0.0066415 , -0.01344293]]], dtype=float32)

TensorFlow Hub module also supports tokenized sentences in the following format.

>>> tokens_input = [["мама", "мыла", "раму"], ["рама", "", ""]]
>>> tokens_length = [3, 1]
>>> embeddings = elmo(
        inputs={
                "tokens": tokens_input,
                "sequence_len": tokens_length
                },
        signature="tokens",
        as_dict=True)["elmo"]
>>> sess.run(embeddings)
array([[[ 0.6040001 , -0.16130011,  0.56478846, ..., -0.00376141,
         -0.03820051,  0.26321286],
        [ 0.01834148,  0.17055789,  0.5311495 , ..., -0.5675535 ,
          0.62669843, -0.05939034],
        [ 0.3242596 ,  0.17909613,  0.01657108, ...,  0.1866098 ,
          0.7392496 ,  0.08285746]],
       [[ 1.1322289 ,  0.19077688, -0.17811403, ...,  0.42973226,
          0.23391506, -0.01294377],
        [ 0.05370751,  0.02260921,  0.01074906, ...,  0.08748816,
         -0.0066415 , -0.01344293],
        [ 0.05370751,  0.02260921,  0.01074906, ...,  0.08748816,
         -0.0066415 , -0.01344293]]], dtype=float32)

You can also get hub.text_embedding_column like described here.

__call__(batch: List[List[str]], *args, **kwargs) → Union[List[numpy.ndarray], numpy.ndarray][source]

Embed sentences from a batch.

Parameters

batch – A list of tokenized text samples.

Returns

A batch of ELMo embeddings.

__iter__() → Iterator[source]

Iterate over all words from a ELMo model vocabulary. The ELMo model vocabulary consists of ['<S>', '</S>', '<UNK>'].

Returns

An iterator of three elements ['<S>', '</S>', '<UNK>'].

class deeppavlov.models.embedders.glove_embedder.GloVeEmbedder(load_path: Union[str, pathlib.Path], pad_zero: bool = False, mean: bool = False, **kwargs)[source]

Class implements GloVe embedding model

Parameters
  • load_path – path where to load pre-trained embedding model from

  • pad_zero – whether to pad samples or not

model

GloVe model instance

tok2emb

dictionary with already embedded tokens

dim

dimension of embeddings

pad_zero

whether to pad sequence of tokens with zeros or not

load_path

path with pre-trained GloVe model

__call__(batch: List[List[str]], mean: Optional[bool] = None) → List[Union[list, numpy.ndarray]]

Embed sentences from batch

Parameters
  • batch – list of tokenized text samples

  • mean – whether to return mean embedding of tokens per sample

Returns

embedded batch

__iter__() → Iterator[str][source]

Iterate over all words from GloVe model vocabulary

Returns

iterator

class deeppavlov.models.embedders.tfidf_weighted_embedder.TfidfWeightedEmbedder(embedder: deeppavlov.core.models.component.Component, tokenizer: Optional[deeppavlov.core.models.component.Component] = None, pad_zero: bool = False, mean: bool = False, tags_vocab_path: Optional[str] = None, vectorizer: Optional[deeppavlov.core.models.component.Component] = None, counter_vocab_path: Optional[str] = None, idf_base_count: int = 100, log_base: int = 10, min_idf_weight=0.0, **kwargs)[source]
The class implements the functionality of embedding the sentence as a weighted average by special coefficients of tokens embeddings. Coefficients can be taken from the given TFIDF-vectorizer in vectorizer or calculated as TFIDF from counter vocabulary given in counter_vocab_path.

Also one can give tags_vocab_path to the vocabulary with weights of tags. In this case, batch with tags should be given as a second input in __call__ method.

Parameters
  • embedder – embedder instance

  • tokenizer – tokenizer instance, should be able to detokenize sentence

  • pad_zero – whether to pad samples or not

  • mean – whether to return mean token embedding

  • tags_vocab_path – optional path to vocabulary with tags weights

  • vectorizer – vectorizer instance should be trained with analyzer="word"

  • counter_vocab_path – path to counter vocabulary

  • idf_base_count – minimal idf value (less time occured are not counted)

  • log_base – logarithm base for TFIDF-coefficient calculation froom counter vocabulary

  • min_idf_weight – minimal idf weight

embedder

embedder instance

tokenizer

tokenizer instance, should be able to detokenize sentence

dim

dimension of embeddings

pad_zero

whether to pad samples or not

mean

whether to return mean token embedding

tags_vocab

vocabulary with weigths for tags

vectorizer

vectorizer instance

counter_vocab_path

path to counter vocabulary

counter_vocab

counter vocabulary

idf_base_count

minimal idf value (less time occured are not counted)

log_base

logarithm base for TFIDF-coefficient calculation froom counter vocabulary

min_idf_weight

minimal idf weight

Examples

>>> from deeppavlov.models.embedders.tfidf_weighted_embedder import TfidfWeightedEmbedder
>>> from deeppavlov.models.embedders.fasttext_embedder import FasttextEmbedder
>>> fasttext_embedder = FasttextEmbedder('/data/embeddings/wiki.ru.bin')
>>> fastTextTfidf = TfidfWeightedEmbedder(embedder=fasttext_embedder,
        counter_vocab_path='/data/vocabs/counts_wiki_lenta.txt')
>>> fastTextTfidf([['большой', 'и', 'розовый', 'бегемот']])
[array([ 1.99135890e-01, -7.14746421e-02,  8.01428872e-02, -5.32840924e-02,
         5.05212297e-02,  2.76053832e-01, -2.53270134e-01, -9.34443950e-02,
         ...
         1.18385439e-02,  1.05643446e-01, -1.21904516e-03,  7.70555378e-02])]
__call__(batch: List[List[str]], tags_batch: Optional[List[List[str]]] = None, mean: Optional[bool] = None, *args, **kwargs) → List[Union[list, numpy.ndarray]][source]

Infer on the given data

Parameters
  • batch – tokenized text samples

  • tags_batch – optional batch of corresponding tags

  • mean – whether to return mean token embedding (does not depend on self.mean)

  • *args – additional arguments

  • **kwargs – additional arguments

Returns:

class deeppavlov.models.embedders.transformers_embedder.TransformersBertEmbedder(load_path: Union[str, pathlib.Path], bert_config_path: Optional[Union[str, pathlib.Path]] = None, truncate: bool = False, **kwargs)[source]

Transformers-based BERT model for embeddings tokens, subtokens and sentences

Parameters
  • load_path – path to a pretrained BERT pytorch checkpoint

  • bert_config_file – path to a BERT configuration file

  • truncate – whether to remove zero-paddings from returned data

__call__(subtoken_ids_batch: Collection[Collection[int]], startofwords_batch: Collection[Collection[int]], attention_batch: Collection[Collection[int]]) → Tuple[Collection[Collection[Collection[float]]], Collection[Collection[Collection[float]]], Collection[Collection[float]], Collection[Collection[float]], Collection[Collection[float]]][source]

Predict embeddings values for a given batch

Parameters
  • subtoken_ids_batch – padded indexes for every subtoken

  • startofwords_batch – a mask matrix with 1 for every first subtoken init in a token and 0 for every other subtoken

  • attention_batch – a mask matrix with 1 for every significant subtoken and 0 for paddings