BERT in DeepPavlov

BERT (Bidirectional Encoder Representations from Transformers) is a Transformer pre-trained on masked language model and next sentence prediction tasks. This approach showed state-of-the-art results on a wide range of NLP tasks in English.

Google Research BERT repository:

There are several pre-trained BERT models released by Google Research, more detail about these pretrained models could be found here

  • BERT-base, English, cased, 12-layer, 768-hidden, 12-heads, 110M parameters: download from [google], [deeppavlov]

  • BERT-base, English, uncased, 12-layer, 768-hidden, 12-heads, 110M parameters: download from [google], [deeppavlov]

  • BERT-large, English, cased, 24-layer, 1024-hidden, 16-heads, 340M parameters: download from [google]

  • BERT-base, multilingual, cased, 12-layer, 768-hidden, 12-heads, 180M parameters: download from [google], [deeppavlov]

  • BERT-base, Chinese, cased, 12-layer, 768-hidden, 12-heads, 110M parameters: download from [google]

We have trained BERT-base model for other languages:

  • RuBERT, Russian, cased, 12-layer, 768-hidden, 12-heads, 180M parameters: [deeppavlov]

  • SlavicBERT, Slavic (bg, cs, pl, ru), cased, 12-layer, 768-hidden, 12-heads, 180M parameters: [deeppavlov]

  • Conversational BERT, English, cased, 12-layer, 768-hidden, 12-heads, 110M parameters: [deeppavlov]

RuBERT was trained on the Russian part of Wikipedia and news data. We used this training data to build vocabulary of Russian subtokens and took multilingual version of BERT-base as initialization for RuBERT 1.

SlavicBERT was trained on Russian News and four Wikipedias: Bulgarian, Czech, Polish, and Russian. Subtoken vocabulary was built using this data. Multilingual BERT was used as an initialization for SlavicBERT.

Conversational BERT was trained on the English part of Twitter, Reddit, DailyDialogues 3, OpenSubtitles 4, Debates 5, Blogs 6, Facebook News Comments. We used this training data to build the vocabulary of English subtokens and took English cased version of BERT-base as initialization for English Conversational BERT.

Here, in DeepPavlov, we made it easy to use pre-trained BERT for downstream tasks like classification, tagging, question answering and ranking. We also provide pre-trained models and examples on how to use BERT with DeepPavlov.

BERT for Classification

BertClassifierModel provides easy to use solution for classification problem using pre-trained BERT. One can use several pre-trained English, multi-lingual and Russian BERT models that are listed above.

Two main components of BERT classifier pipeline in DeepPavlov are BertPreprocessor and BertClassifierModel. Non-processed texts should be given to bert_preprocessor for tokenization on subtokens, encoding subtokens with their indices and creating tokens and segment masks. In case of using one-hot encoded classes in the pipeline, set one_hot_labels to true.

bert_classifier has a dense layer of number of classes size upon pooled outputs of Transformer encoder, it is followed by softmax activation (sigmoid if multilabel parameter is set to true in config).

BERT for Named Entity Recognition (Sequence Tagging)

Pre-trained BERT model can be used for sequence tagging. Examples of usage of BERT for sequence tagging can be found here. The module used for tagging is BertNerModel. To tag each word representations of the first sub-word elements are extracted. So for each word there is only one vector produced. These representations are passed to a dense layer or Bi-RNN layer to produce distribution over tags. There is also an optional CRF layer on the top.

Multilingual BERT model allows to perform zero-shot transfer across languages. To use our 19 tags NER for over a hundred languages see Multilingual BERT Zero-Shot Transfer.

BERT for Context Question Answering (SQuAD)

Context Question Answering on SQuAD dataset is a task of looking for an answer on a question in a given context. This task could be formalized as predicting answer start and end position in a given context. BertSQuADModel uses two linear transformations to predict probability that current subtoken is start/end position of an answer. For details check Context Question Answering documentation page.

BERT for Ranking

There are two main approaches in text ranking. The first one is interaction-based which is relatively accurate but works slow and the second one is representation-based which is less accurate but faster 2. The interaction-based ranking based on BERT is represented in the DeepPavlov with two main components BertRankerPreprocessor and BertRankerModel and the representation-based ranking with components BertSepRankerPreprocessor and BertSepRankerModel. Additional components BertSepRankerPredictorPreprocessor and BertSepRankerPredictor are for usage in the interact mode where the task for ranking is to retrieve the best possible response from some provided response base with the help of the trained model. Working examples with the trained models are given here. Statistics are available here.

Using custom BERT in DeepPavlov

The previous sections describe the BERT based models implemented in DeepPavlov. To change the BERT model used for initialization in any downstream task mentioned above the following parameters of the config file must be changed to match new BERT path:

  • download URL in the part of the config

  • bert_config_file, pretrained_bert in the BERT based Component

  • vocab_file in the bert_preprocessor


Kuratov, Y., Arkhipov, M. (2019). Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language. arXiv preprint arXiv:1905.07213.


McDonald, R., Brokos, G. I., & Androutsopoulos, I. (2018). Deep relevance ranking using enhanced document-query interactions. arXiv preprint arXiv:1809.01682.


Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset. IJCNLP 2017.

  1. Lison and J. Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)


Justine Zhang, Ravi Kumar, Sujith Ravi, Cristian Danescu-Niculescu-Mizil. Proceedings of NAACL, 2016.

  1. Schler, M. Koppel, S. Argamon and J. Pennebaker (2006). Effects of Age and Gender on Blogging in Proceedings of 2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs.