Configuration file

An NLP pipeline config is a JSON file that contains one required element chainer:

{
  "chainer": {
    "in": ["x"],
    "in_y": ["y"],
    "pipe": [
      ...
    ],
    "out": ["y_predicted"]
  }
}

Chainer is a core concept of DeepPavlov library: chainer builds a pipeline from heterogeneous components (Rule-Based/ML/DL) and allows to train or infer from pipeline as a whole. Each component in the pipeline specifies its inputs and outputs as arrays of names, for example: "in": ["tokens", "features"] and "out": ["token_embeddings", "features_embeddings"] and you can chain outputs of one components with inputs of other components:

{
  "class_name": "deeppavlov.models.preprocessors.str_lower:str_lower",
  "in": ["x"],
  "out": ["x_lower"]
},
{
  "class_name": "nltk_tokenizer",
  "in": ["x_lower"],
  "out": ["x_tokens"]
},

Pipeline elements could be child classes of Component or functions.

Each Component in the pipeline must implement method __call__() and has class_name parameter, which is its registered codename, or full name of any python class in the form of "module_name:ClassName". It can also have any other parameters which repeat its __init__() method arguments. Default values of __init__() arguments will be overridden with the config values during the initialization of a class instance.

You can reuse components in the pipeline to process different parts of data with the help of id and ref parameters:

{
  "class_name": "nltk_tokenizer",
  "id": "tokenizer",
  "in": ["x_lower"],
  "out": ["x_tokens"]
},
{
  "ref": "tokenizer",
  "in": ["y"],
  "out": ["y_tokens"]
},

Nested configuration files

Any configuration file could be used inside another configuration file as an element of the Chainer or as a field of another component using config_path key. Any field of the nested configuration file could be overwritten using overwrite field:

"chainer": {
  "pipe": {
    ...
    {
      "class_name": "ner_chunk_model",
      "ner": {
        "config_path": "{CONFIGS_PATH}/ner/ner_ontonotes_bert.json",
        "overwrite": {
          "chainer.out": ["x_tokens", "tokens_offsets", "y_pred", "probas"]
        }
      },
      ...
    }
  }
}

In this example ner_ontonotes_bert.json is used as ner argument value in ner_chunk_model component. chainer.out value is overwritten with new list. Overwritten fields names are defined using dot notation. In this notation numeric fields are treated as indexes of lists. For example, to change class_name value of the second element of the pipe to ner_chunker (1 is the index of the second element), use "chainer.pipe.1.class_name": "ner_chunker" key-value pair.

Variables

As of version 0.1.0 every string value in a configuration file is interpreted as a format string where fields are evaluated from metadata.variables element:

{
  "chainer": {
    "in": ["x"],
    "pipe": [
      {
        "class_name": "my_component",
        "in": ["x"],
        "out": ["x"],
        "load_path": "{MY_PATH}/file.obj"
      },
      {
        "in": ["x"],
        "out": ["y_predicted"],
        "config_path": "{CONFIGS_PATH}/classifiers/insults_kaggle_bert.json"
      }
    ],
    "out": ["y_predicted"]
  },
  "metadata": {
    "variables": {
      "MY_PATH": "/some/path",
      "CONFIGS_PATH": "{DEEPPAVLOV_PATH}/configs"
    }
  }
}

Variable DEEPPAVLOV_PATH is always preset to be a path to the deeppavlov python module.

One can override configuration variables using environment variables with prefix DP_. So environment variable DP_VARIABLE_NAME will override VARIABLE_NAME inside a configuration file.

For example, adding DP_ROOT_PATH=/my_path/to/large_hard_drive will make most configs use this path for downloading and reading embeddings/models/datasets.

Training

There are two abstract classes for trainable components: Estimator and NNModel.

Estimator are fit once on any data with no batching or early stopping, so it can be safely done at the time of pipeline initialization. fit() method has to be implemented for each Estimator. One example is Vocab.

NNModel requires more complex training. It can only be trained in a supervised mode (as opposed to Estimator which can be trained in both supervised and unsupervised settings). This process takes multiple epochs with periodic validation and logging. train_on_batch() method has to be implemented for each NNModel.

Training is triggered by train_model() function.

Train config

Estimator s that are trained should also have fit_on parameter which contains a list of input parameter names. An NNModel should have the in_y parameter which contains a list of ground truth answer names. For example:

[
  {
    "id": "classes_vocab",
    "class_name": "default_vocab",
    "fit_on": ["y"],
    "level": "token",
    "save_path": "vocabs/classes.dict",
    "load_path": "vocabs/classes.dict"
  },
  {
    "in": ["x"],
    "in_y": ["y"],
    "out": ["y_predicted"],
    "class_name": "intent_model",
    "save_path": "classifiers/intent_cnn",
    "load_path": "classifiers/intent_cnn",
    "classes_vocab": {
      "ref": "classes_vocab"
    }
  }
]

The config for training the pipeline should have three additional elements: dataset_reader, dataset_iterator and train:

{
  "dataset_reader": {
    "class_name": ...,
    ...
  },
  "dataset_iterator": {
    "class_name": ...,
    ...
  },
  "chainer": {
    ...
  },
  "train": {
    ...
  }
}

Simplified version of training pipeline contains two elements: dataset and train. The dataset element currently can be used for train from classification data in csv and json formats.

Train Parameters

train element can contain a class_name parameter that references a trainer class (default value is torch_trainer). All other parameters will be passed as keyword arguments to the trainer class’s constructor.

Metrics

"train": {
  "class_name": "torch_trainer",
  "metrics": [
    "f1",
    {
      "name": "accuracy",
      "inputs": ["y", "y_labels"]
    },
    {
      "name": "sklearn.metrics:accuracy_score",
      "alias": "unnormalized_accuracy",
      "inputs": ["y", "y_labels"],
      "normalize": false
    }
  ],
  ...
}

The first metric in the list is used for early stopping.

Each metric can be described as a JSON object with name, alias and inputs properties, where:

  • name is either a registered name of a metric function or module.submodules:function_name.

  • alias is a metric name. Default value is name value.

  • inputs is a list of parameter names from chainer’s inner memory that will be passed to the metric function. Default value is a concatenation of chainer’s in_y and out parameters.

All other arguments are interpreted as kwargs when the metric is called. If a metric is given as a string, this string is interpreted as a metric name, i.e. "f1" in the example above is equivalent to {"name": "f1"}.

DatasetReader

DatasetReader class reads data and returns it in a specified format. A concrete DatasetReader class should be inherited from this base class and registered with a codename:

from deeppavlov.core.common.registry import register
from deeppavlov.core.data.dataset_reader import DatasetReader

@register('conll2003_reader')
class Conll2003DatasetReader(DatasetReader):

DataLearningIterator and DataFittingIterator

DataLearningIterator forms the sets of data (‘train’, ‘valid’, ‘test’) needed for training/inference and divides them into batches. A concrete DataLearningIterator class should be registered and can be inherited from deeppavlov.data.data_learning_iterator.DataLearningIterator class. This is a base class and can be used as a DataLearningIterator as well.

DataFittingIterator iterates over provided dataset without train/valid/test splitting and is useful for Estimator s that do not require training.

Inference

All components inherited from Component abstract class can be used for inference. The __call__() method should return standard output of a component. For example, a tokenizer should return tokens, a NER recognizer should return recognized entities, a bot should return an utterance. A particular format of returned data should be defined in __call__().

Inference is triggered by interact_model() function. There is no need in a separate JSON for inference.

Model Configuration

Each DeepPavlov model is determined by its configuration file. You can use existing config files or create yours. You can also choose a config file and modify preprocessors/tokenizers/embedders/vectorizers there. The components below have the same interface and are responsible for the same functions, therefore they can be used in the same parts of a config pipeline.

Here is a list of useful Components aimed to preprocess, postprocess and vectorize your data.

Preprocessors

Preprocessor is a component that processes batch of samples.

  • Already implemented universal preprocessors of tokenized texts (each sample is a list of tokens):

    • Mask (registered as mask) returns binary mask of corresponding length (padding up to the maximum length per batch.

    • Sanitizer (registered as sanitizer) removes all combining characters like diacritical marks from tokens.

  • Already implemented universal preprocessors of non-tokenized texts (each sample is a string):

    • DirtyCommentsPreprocessor (registered as dirty_comments_preprocessor) preprocesses samples converting samples to lowercase, paraphrasing English combinations with apostrophe ', transforming more than three the same symbols to two symbols.

    • str_lower() converts samples to lowercase.

  • Already implemented universal preprocessors of another type of features:

    • OneHotter (registered as one_hotter) performs one-hotting operation for the batch of samples where each sample is an integer label or a list of integer labels (can be combined in one batch). If multi_label parameter is set to True, returns one one-dimensional vector per sample with several elements equal to 1.

Tokenizers

Tokenizer is a component that processes batch of samples (each sample is a text string).

  • NLTKTokenizer (registered as nltk_tokenizer) tokenizes using tokenizers from nltk.tokenize, e.g. nltk.tokenize.wordpunct_tokenize.

  • NLTKMosesTokenizer (registered as nltk_moses_tokenizer) tokenizes and detokenizes using nltk.tokenize.moses.MosesDetokenizer, nltk.tokenize.moses.MosesTokenizer.

  • StreamSpacyTokenizer (registered as stream_spacy_tokenizer) tokenizes or lemmatizes texts with spacy en_core_web_sm models by default.

  • SplitTokenizer (registered as split_tokenizer) tokenizes using string method split.

Embedders

Embedder is a component that converts every token in a tokenized batch to a vector of a particular dimension (optionally, returns a single vector per sample).

  • FasttextEmbedder (registered as fasttext) reads embedding file in fastText format. If mean returns one vector per sample - mean of embedding vectors of tokens.

  • TfidfWeightedEmbedder (registered as tfidf_weighted) accepts embedder, tokenizer (for detokenization, by default, detokenize with joining with space), TFIDF vectorizer or counter vocabulary, optionally accepts tags vocabulary (to assign additional multiplcative weights to particular tags). If mean returns one vector per sample - mean of embedding vectors of tokens.

Vectorizers

Vectorizer is a component that converts batch of text samples to batch of vectors.

  • SklearnComponent (registered as sklearn_component) is a DeepPavlov wrapper for most of sklearn estimators, vectorizers etc. For example, to get TFIDF-vectorizer one should assign in config model_class to sklearn.feature_extraction.text:TfidfVectorizer, infer_method to transform, pass load_path, save_path and other sklearn model parameters.

  • HashingTfIdfVectorizer (registered as hashing_tfidf_vectorizer) implements hashing version of usual TFIDF-vecotrizer. It creates a TFIDF matrix from collection of documents of size [n_documents X n_features(hash_size)].