Pre-trained embeddings

BERT

We are publishing several pre-trained BERT models:

  • RuBERT for Russian language

  • Slavic BERT for Bulgarian, Czech, Polish, and Russian

  • Conversational BERT for informal English

  • Conversational BERT for informal Russian

  • Sentence Multilingual BERT for encoding sentences in 101 languages

  • Sentence RuBERT for encoding sentences in Russian

Description of these models is available in the BERT section of the docs.

License

The pre-trained models are distributed under the License Apache 2.0.

Downloads

The TensorFlow models can be run with the original BERT repo code while the PyTorch models can be run with the HuggingFace’s Transformers library. The download links are:

Description

Model parameters

Download links

RuBERT

vocab size = 120K, parameters = 180M, size = 632MB

[pytorch], [tensorflow]

Slavic BERT

vocab size = 120K, parameters = 180M, size = 632MB

[pytorch], [tensorflow]

Conversational BERT

vocab size = 30K, parameters = 110M, size = 385MB

[pytorch], [tensorflow]

Conversational RuBERT

vocab size = 120K, parameters = 180M, size = 630MB

[pytorch], [tensorflow]

Sentence Multilingual BERT

vocab size = 120K, parameters = 180M, size = 630MB

[pytorch], [tensorflow]

Sentence RuBERT

vocab size = 120K, parameters = 180M, size = 630MB

[pytorch], [tensorflow]

ELMo

The ELMo can used via Python code as following:

import tensorflow as tf
import tensorflow_hub as hub
elmo = hub.Module("http://files.deeppavlov.ai/deeppavlov_data/elmo_ru-news_wmt11-16_1.5M_steps.tar.gz", trainable=True)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
embeddings = elmo(["это предложение", "word"], signature="default", as_dict=True)["elmo"]
sess.run(embeddings)

TensorFlow Hub module also supports tokenized sentences in the following format.

tokens_input = [["мама", "мыла", "раму"], ["рама", "", ""]]
tokens_length = [3, 1]
embeddings = elmo(inputs={"tokens": tokens_input,"sequence_len": tokens_length},signature="tokens",as_dict=True)["elmo"]
sess.run(embeddings)

Downloads

The models can be downloaded and run by tensorflow hub module from:

Description

Dataset parameters

Perplexity

Tensorflow hub module

ELMo on Russian Wikipedia

lines = 1M, tokens = 386M, size = 5GB

43.692

module_spec

ELMo on Russian WMT News

lines = 63M, tokens = 946M, size = 12GB

49.876

module_spec

ELMo on Russian Twitter

lines = 104M, tokens = 810M, size = 8.5GB

94.145

module_spec

fastText

We are publishing pre-trained word vectors for Russian language. Several models were trained on joint Russian Wikipedia and Lenta.ru corpora. We also introduce one model for Russian conversational language that was trained on Russian Twitter corpus.

All vectors are 300-dimensional. We used fastText skip-gram (see Bojanowski et al. (2016)) for vectors training as well as various preprocessing options (see below).

You can get vectors either in binary or in text (vec) formats for FastText.

License

The pre-trained word vectors are distributed under the License Apache 2.0.

Downloads

The pre-trained fastText skipgram models can be downloaded from:

Domain

Preprocessing

Vectors

Wiki+Lenta

tokenize (nltk word_tokenize), lemmatize (pymorphy2)

bin, vec

tokenize (nltk word_tokenize), lowercasing

bin, vec

tokenize (nltk wordpunсt_tokenize)

bin, vec

tokenize (nltk word_tokenize)

bin, vec

tokenize (nltk word_tokenize), remove stopwords

bin, vec

Twitter

tokenize (nltk word_tokenize)

bin, vec

Word vectors training parameters

These word vectors were trained with following parameters ([…] is for default value):

fastText (skipgram)

  • lr [0.1]

  • lrUpdateRate [100]

  • dim 300

  • ws [5]

  • epoch [5]

  • neg [5]

  • loss [softmax]

  • pretrainedVectors []

  • saveOutput [0]