Multi-task BERT in DeepPavlov

Multi-task BERT in DeepPavlov is an implementation of BERT training algorithm published in the paper Knowledge Transfer Between Tasks and Languages in the Multi-task Encoder-agnostic Transformer-based Models.

The idea is to share BERT body between several tasks. This is necessary if a model pipe has several components using BERT and the amount of GPU memory is limited. Each task has its own ‘head’ part attached to the output of the BERT encoder. If multi-task BERT has \(T\) heads, one training iteration consists of

  • composing \(T\) lists of examples, one for each task,

  • \(T\) gradient steps, one gradient step for each task.

By default, on every training steps lists of examples for all but one tasks are empty, as if in the original MT-DNN repository.

When one of BERT heads is being trained, other heads’ parameters do not change. On each training step both BERT head and body parameters are modified.

Currently multitask bert heads support classification, regression, NER and multiple choice tasks.

At this page, multi-task BERT usage is explained on a toy configuration file of a model that is trained for the single-sentence classification, sentence pair classification, regression, multiple choice and NER. The config for this model is multitask_example.

Other examples of using multitask models can be found in mt_glue.

Train config

When using multitask_transformer component, you can use the same inference file as the train file.

Data reading and iteration is performed by MultiTaskReader and MultiTaskIterator. These classes are composed of task readers and iterators and generate batches that contain data from heterogeneous datasets. Example below demonstrates the usage of multitask dataset reader:

"dataset_reader": {
  "class_name": "multitask_reader",
  "task_defaults": {
    "class_name": "huggingface_dataset_reader",
    "path": "glue",
    "train": "train",
    "valid": "validation",
    "test": "test"
  "tasks": {
    "cola": {"name": "cola"},
    "copa": {
      "path": "super_glue",
      "name": "copa"
    "conll": {
      "class_name": "conll2003_reader",
      "use_task_defaults": false,
      "data_path": "{DOWNLOADS_PATH}/conll2003/",
      "dataset_name": "conll2003",
      "provide_pos": false

Nested dataset readers are listed in the tasks section. By default, default nested readers parameters are taken from task_defaults section. Values from the tasks could complement parameters, like name parameter in the dataset_reader.tasks.cola, and could overwrite default parameter values, like path parameter from dataset_reader.tasks.copa. In the dataset_reader.tasks.conll use_task_defaults is False. This is special parameter, that forces multitask_reader to ignore task_defaults while creating nested reader, which means that dataset reader for conll task will use only parameters from dataset_reader.tasks.conll.

The same principle with default values applies to multitask_iterator.

Batches generated by multitask_iterator are tuples of two elements: inputs of the model and labels. Both inputsand labels are lists of tuples. The inputs have following format: [(first_task_inputs[0], second_task_inputs[0],...), (first_task_inputs[1], second_task_inputs[1], ...), ...] where first_task_inputs, second_task_inputs, and so on are x values of batches from task dataset iterators. The labels in the second element have the similar format.

If task datasets have different sizes, then for smaller datasets the lists are padded with None values. For example, if the first task dataset inputs are [0, 1, 2, 3, 4, 5, 6], the second task dataset inputs are [7, 8, 9], and the batch size is 2, then multi-task input mini-batches will be [(0, 7), (1, 8)], [(2, 9), (3, None)], [(4, None), (5, None)], [(6, None)].

In this tutorial, there are 5 datasets. Considering the batch structure, chainer inputs in multitask_example are:

"in": ["x_cola", "x_rte", "x_stsb", "x_copa", "x_conll"],
"in_y": ["y_cola", "y_rte", "y_stsb", "y_copa", "y_conll"]

Sometimes a task dataset iterator returns inputs or labels consisting of more than one element. For example, in the model input element could consist of two strings. If there is a necessity to split such a variable, InputSplitter component can be used. Data preparation in the multitask setting can be similar to the preparation in singletask setting except for the names of the variables.

For streamlining the code, however, input_splitter and tokenizer can be unified into the multitask_pipeline_preprocessor. This preprocessor gets as a parameter preprocessor the one preprocessor class name for all tasks, or gets the preprocessor name list as a parameter preprocessors. After splitting input by possible_keys_to_extract, every preprocessor (being initialized by the input beforehand) processes the input. Note, that if strict parameter(default:False) is set to True, we always try to split data. Here is the definition of multitask_pipeline_preprocessor from the multitask_example:

"class_name": "multitask_pipeline_preprocessor",
"possible_keys_to_extract": [0, 1],
"preprocessors": [
"do_lower_case": true,
"n_task": 5,
"vocab_file": "{BACKBONE}",
"max_seq_length": 200,
"max_subword_length": 15,
"token_masking_prob": 0.0,
"return_features": true,
"in": ["x_cola", "x_rte", "x_stsb", "x_copa", "x_conll"],
"out": [

The multitask_transformer component has common and task-specific parameters. Shared parameters are provided inside the tasks parameter. The tasks is a dictionary that keys are task names and values are task-specific parameters (type, options). Common parameters, are backbone_model(same parameter as in the tokenizer) and all parameters from torch_bert. The order of tasks MATTERS.

Here is the definition of multitask_transformer from the multitask_example:

"id": "multitask_transformer",
"class_name": "multitask_transformer",
"optimizer_parameters": {"lr": 2e-5},
"gradient_accumulation_steps": "{GRADIENT_ACC_STEPS}",
"learning_rate_drop_patience": 2,
"learning_rate_drop_div": 2.0,
"return_probas": true,
"backbone_model": "{BACKBONE}",
"save_path": "{MODEL_PATH}",
"load_path": "{MODEL_PATH}",
"tasks": {
  "cola": {
    "type": "classification",
    "options": 2
  "rte": {
    "type": "classification",
    "options": 2
  "stsb": {
    "type": "regression",
    "options": 1
  "copa": {
    "type": "multiple_choice",
    "options": 2
  "conll": {
    "type": "sequence_labeling",
    "options": "#vocab_conll.len"
"in": [
"in_y": ["y_cola", "y_rte", "y_stsb", "y_copa", "y_ids_conll"],
"out": [

Note that proba2labels can now take several arguments.

  "in":["y_cola_pred_probas", "y_rte_pred_probas", "y_copa_pred_probas"],
  "out":["y_cola_pred_ids", "y_rte_pred_ids", "y_copa_pred_ids"],

You may need to create your own metric for early stopping. In this example, the target metric is an average of AUC ROC for insults and sentiment tasks and F1 for NER task:

from deeppavlov.metrics.roc_auc_score import roc_auc_score

def roc_auc__roc_auc__ner_f1(true_onehot1, pred_probas1, true_onehot2, pred_probas2, ner_true3, ner_pred3):
    roc_auc1 = roc_auc_score(true_onehot1, pred_probas1)
    roc_auc2 = roc_auc_score(true_onehot2, pred_probas2)
    ner_f1_3 = ner_f1(ner_true3, ner_pred3) / 100
    return (roc_auc1 + roc_auc2 + ner_f1_3) / 3

It he code above will be saved at, metric could be used in the config as custom_metric:roc_auc__roc_auc__ner_f1 (module.submodules:function_name reference format).

You can make an inference-only config. In this config, there is no need in dataset reader and dataset iterator. A train field and components preparing in_y are removed. In multitask_transformer component configuration all training parameters (learning rate, optimizer, etc.) are omitted.

Here are the results of deeppavlov/configs/multitask/mt_glue.json compared to the analogous single-task configs, according to the test server.













from server

Matthew’s Corr


F1 / Accuracy

Pearson/Spearman Corr

F1 / Accuracy




Matthew’s Corr

Multitask config











Singletask config