BERT in DeepPavlov

BERT (Bidirectional Encoder Representations from Transformers) is a Transformer pre-trained on masked language model and next sentence prediction tasks. This approach showed state-of-the-art results on a wide range of NLP tasks in English.

Google Research BERT repository:

There are several pre-trained BERT models released by Google Research, more details about these pre-trained models could be found here:

We have trained BERT-base model for other languages and domains:

The deeppavlov_pytorch models are designed to be run with the HuggingFace’s Transformers library.

RuBERT was trained on the Russian part of Wikipedia and news data. We used this training data to build vocabulary of Russian subtokens and took multilingual version of BERT-base as initialization for RuBERT 1.

SlavicBERT was trained on Russian News and four Wikipedias: Bulgarian, Czech, Polish, and Russian. Subtoken vocabulary was built using this data. Multilingual BERT was used as an initialization for SlavicBERT. The model is described in our ACL paper 2.

Conversational BERT was trained on the English part of Twitter, Reddit, DailyDialogues 4, OpenSubtitles 5, Debates 6, Blogs 7, Facebook News Comments. We used this training data to build the vocabulary of English subtokens and took English cased version of BERT-base as initialization for English Conversational BERT.

Conversational RuBERT was trained on OpenSubtitles 5, Dirty, Pikabu, and Social Media segment of Taiga corpus 8. We assembled new vocabulary for Conversational RuBERT model on this data and initialized model with RuBERT.

Conversational DistilRuBERT (6 transformer layers) and DistilRuBERT-tiny (2 transformer layers) were trained on the same data as Conversational RuBERT and highly inspired by DistilBERT 3. Namely, Distil* models (students) used pretrained Conversational RuBERT as teacher and linear combination of the following losses:

  1. Masked language modeling loss (between student output logits for tokens and its true labels)

  2. Kullback-Leibler divergence (between student and teacher output logits)

  3. Cosine embedding loss (between averaged hidden states of the teacher and hidden states of the student)

  4. Mean squared error loss (between averaged attention maps of the teacher and attention maps of the student)

Sentence Multilingual BERT is a representation-based sentence encoder for 101 languages of Multilingual BERT. It is initialized with Multilingual BERT and then fine-tuned on english MultiNLI 9 and on dev set of multilingual XNLI 10. Sentence representations are mean pooled token embeddings in the same manner as in Sentence-BERT 12.

Sentence RuBERT is a representation-based sentence encoder for Russian. It is initialized with RuBERT and fine-tuned on SNLI 11 google-translated to russian and on russian part of XNLI dev set 10. Sentence representations are mean pooled token embeddings in the same manner as in Sentence-BERT 12.

Here, in DeepPavlov, we made it easy to use pre-trained BERT for downstream tasks like classification, tagging, question answering and ranking. We also provide pre-trained models and examples on how to use BERT with DeepPavlov.

BERT as Embedder

TransformersBertEmbedder allows for using BERT model outputs as token, subtoken and sentence level embeddings.

Additionaly the embeddings can be easily used in DeepPavlov. To get text level, token level and subtoken level representations, you can use or modify a BERT embedder configuration:

from deeppavlov.core.common.file import read_json
from deeppavlov import build_model, configs

bert_config = read_json(configs.embedder.bert_embedder)
bert_config['metadata']['variables']['BERT_PATH'] = 'path/to/bert/directory'

m = build_model(bert_config)

texts = ['Hi, i want my embedding.', 'And mine too, please!']
tokens, token_embs, subtokens, subtoken_embs, sent_max_embs, sent_mean_embs, bert_pooler_outputs = m(texts)

BERT for Classification

TorchTransformersClassifierModel provides solution for classification problem using pre-trained BERT on PyTorch. One can use several pre-trained English, multi-lingual and Russian BERT models that are listed above. TorchTransformersClassifierModel also supports any Transformer-based model of Transformers <>.

Two main components of BERT classifier pipeline in DeepPavlov are TorchTransformersPreprocessor and TorchTransformersClassifierModel. Non-processed texts should be given to torch_transformers_preprocessor for tokenization on subtokens, encoding subtokens with their indices and creating tokens and segment masks.

torch_transformers_classifier has a dense layer of number of classes size upon pooled outputs of Transformer encoder, it is followed by softmax activation (sigmoid if multilabel parameter is set to true in config).

BERT for Named Entity Recognition (Sequence Tagging)

Pre-trained BERT model can be used for sequence tagging. Examples of BERT application to sequence tagging can be found here. The module used for tagging is torch_transformers_sequence_tagger:TorchTransformersSequenceTagger. The tags are obtained by applying a dense layer to the representation of the first subtoken of each word. There is also an optional CRF layer on the top. You can choose among different Transformers architectures by modifying the TRANSFORMER variable in the corresponding configuration files. The possible choices are DistilBert, Albert, Camembert, XLMRoberta, Bart, Roberta, Bert, XLNet, Flaubert, XLM.

Multilingual BERT model allows to perform zero-shot transfer across languages. To use our 19 tags NER for over a hundred languages see ner_multi_bert.

BERT for Context Question Answering (SQuAD)

Context Question Answering on SQuAD dataset is a task of looking for an answer on a question in a given context. This task could be formalized as predicting answer start and end position in a given context. torch_transformers_squad:TorchTransformersSquad on PyTorch uses two linear transformations to predict probability that current subtoken is start/end position of an answer. For details check Context Question Answering documentation page.

Using custom BERT in DeepPavlov

The previous sections describe the BERT based models implemented in DeepPavlov. To change the BERT model used for initialization in any downstream task mentioned above the following parameters of the config file must be changed to match new BERT path:

  • download URL in the part of the config

  • bert_config_file, pretrained_bert in the BERT based Component. In case of PyTorch BERT, pretrained_bert can be assigned to

    string name of any Transformer-based model (e.g. "bert-base-uncased", "distilbert-base-uncased") and then bert_config_file is set to None.

  • vocab_file in the torch_transformers_preprocessor. vocab_file can be assigned to

    string name of used pre-trained BERT (e.g. "bert-base-uncased").


Kuratov, Y., Arkhipov, M. (2019). Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language. arXiv preprint arXiv:1905.07213.


Arkhipov M., Trofimova M., Kuratov Y., Sorokin A. (2019). Tuning Multilingual Transformers for Language-Specific Named Entity Recognition . ACL anthology W19-3712.


Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.


Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset. IJCNLP 2017.

  1. Lison and J. Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)


Justine Zhang, Ravi Kumar, Sujith Ravi, Cristian Danescu-Niculescu-Mizil. Proceedings of NAACL, 2016.

  1. Schler, M. Koppel, S. Argamon and J. Pennebaker (2006). Effects of Age and Gender on Blogging in Proceedings of 2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs.


Shavrina T., Shapovalova O. (2017) TO THE METHODOLOGY OF CORPUS CONSTRUCTION FOR MACHINE LEARNING: «TAIGA» SYNTAX TREE CORPUS AND PARSER. in proc. of “CORPORA2017”, international conference , Saint-Petersbourg, 2017.


Williams A., Nangia N. & Bowman S. (2017) A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. arXiv preprint arXiv:1704.05426


Williams A., Bowman S. (2018) XNLI: Evaluating Cross-lingual Sentence Representations. arXiv preprint arXiv:1809.05053

    1. Bowman, G. Angeli, C. Potts, and C. D. Manning. (2015) A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326

  1. Reimers, I. Gurevych (2019) Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv preprint arXiv:1908.10084