Features

Models

NER model [docs]

There are two models for Named Entity Recognition task in DeepPavlov: BERT-based and Bi-LSTM+CRF. The models predict tags (in BIO format) for tokens in input.

BERT-based model is described in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

The second model reproduces architecture from the paper Application of a Hybrid Bi-LSTM-CRF model to the task of Russian Named Entity Recognition which is inspired by Bi-LSTM+CRF architecture from https://arxiv.org/pdf/1603.01360.pdf.

Dataset

Lang

Model

Test F1

Persons-1000 dataset with additional LOC and ORG markup

(Collection 3)

Ru

ner_rus_bert.json

98.1

ner_rus.json

95.1

Ontonotes

Multi

ner_ontonotes_bert_mult.json

88.8

En

ner_ontonotes_bert.json

88.6

ner_ontonotes.json

87.1

ConLL-2003

ner_conll2003_bert.json

91.7

ner_conll2003.json

89.9

DSTC2

ner_dstc2.json

97.1

Slot filling models [docs]

Based on fuzzy Levenshtein search to extract normalized slot values from text. The models either rely on NER results or perform needle in haystack search.

Dataset

Slots Accuracy

DSTC 2

98.85

Classification model [docs]

Model for classification tasks (intents, sentiment, etc) on word-level. Shallow-and-wide CNN, Deep CNN, BiLSTM, BiLSTM with self-attention and other models are presented. The model also allows multilabel classification of texts. Several pre-trained models are available and presented in Table below.

Task

Dataset

Lang

Model

Metric

Valid

Test

Downloads

28 intents

DSTC 2

En

DSTC 2 emb

Accuracy

0.7613

0.7733

800 Mb

Wiki emb

0.9629

0.9617

8.5 Gb

BERT

0.9673

0.9636

800 Mb

7 intents

SNIPS-2017 1

DSTC 2 emb

F1-macro

0.8591

800 Mb

Wiki emb

0.9820

8.5 Gb

Tfidf + SelectKBest + PCA + Wiki emb

0.9673

8.6 Gb

Wiki emb weighted by Tfidf

0.9786

8.5 Gb

Insult detection

Insults

Reddit emb

ROC-AUC

0.9263

0.8556

6.2 Gb

English BERT

0.9255

0.8612

1200 Mb

English Conversational BERT

0.9389

0.8941

1200 Mb

5 topics

AG News

Wiki emb

Accuracy

0.8922

0.9059

8.5 Gb

Intent

Yahoo-L31

Yahoo-L31 on conversational BERT

ROC-AUC

0.9436

1200 Mb

Sentiment

SST

5-classes SST on conversational BERT

Accuracy

0.6456

0.6715

400 Mb

5-classes SST on multilingual BERT

0.5738

0.6024

660 Mb

Yelp

5-classes Yelp on conversational BERT

0.6925

0.6842

400 Mb

5-classes Yelp on multilingual BERT

0.5896

0.5874

660 Mb

Sentiment

Twitter mokoron

Ru

RuWiki+Lenta emb w/o preprocessing

0.9965

0.9961

6.2 Gb

RuWiki+Lenta emb with preprocessing

0.7823

0.7759

6.2 Gb

RuSentiment

RuWiki+Lenta emb

F1-weighted

0.6541

0.7016

6.2 Gb

Twitter emb super-convergence 2

0.7301

0.7576

3.4 Gb

ELMo

0.7519

0.7875

700 Mb

Multi-language BERT

0.6809

0.7193

1900 Mb

Conversational RuBERT

0.7548

0.7742

657 Mb

Intent

Ru like`Yahoo-L31`_

Conversational vs Informational on ELMo

ROC-AUC

0.9412

700 Mb

1

Coucke A. et al. Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces //arXiv preprint arXiv:1805.10190. – 2018.

2

Smith L. N., Topin N. Super-convergence: Very fast training of residual networks using large learning rates. – 2018.

As no one had published intent recognition for DSTC-2 data, the comparison of the presented model is given on SNIPS dataset. The evaluation of model scores was conducted in the same way as in 3 to compare with the results from the report of the authors of the dataset. The results were achieved with tuning of parameters and embeddings trained on Reddit dataset.

Model

AddToPlaylist

BookRestaurant

GetWheather

PlayMusic

RateBook

SearchCreativeWork

SearchScreeningEvent

api.ai

0.9931

0.9949

0.9935

0.9811

0.9992

0.9659

0.9801

ibm.watson

0.9931

0.9950

0.9950

0.9822

0.9996

0.9643

0.9750

microsoft.luis

0.9943

0.9935

0.9925

0.9815

0.9988

0.9620

0.9749

wit.ai

0.9877

0.9913

0.9921

0.9766

0.9977

0.9458

0.9673

snips.ai

0.9873

0.9921

0.9939

0.9729

0.9985

0.9455

0.9613

recast.ai

0.9894

0.9943

0.9910

0.9660

0.9981

0.9424

0.9539

amazon.lex

0.9930

0.9862

0.9825

0.9709

0.9981

0.9427

0.9581

Shallow-and-wide CNN

0.9956

0.9973

0.9968

0.9871

0.9998

0.9752

0.9854

3

https://www.slideshare.net/KonstantinSavenkov/nlu-intent-detection-benchmark-by-intento-august-2017

Automatic spelling correction model [docs]

Pipelines that use candidates search in a static dictionary and an ARPA language model to correct spelling errors.

Note

About 4.4 GB on disc required for the Russian language model and about 7 GB for the English one.

Comparison on the test set for the SpellRuEval competition on Automatic Spelling Correction for Russian:

Correction method

Precision

Recall

F-measure

Speed (sentences/s)

Yandex.Speller

83.09

59.86

69.59

Damerau Levenshtein 1 + lm

53.26

53.74

53.50

29.3

Brill Moore top 4 + lm

51.92

53.94

52.91

0.6

Hunspell + lm

41.03

48.89

44.61

2.1

JamSpell

44.57

35.69

39.64

136.2

Brill Moore top 1

41.29

37.26

39.17

2.4

Hunspell

30.30

34.02

32.06

20.3

Ranking model [docs]

The main neural ranking model based on LSTM-based deep learning models for non-factoid answer selection. The model performs ranking of responses or contexts from some database by their relevance for the given context.

There are 3 alternative neural architectures available as well:

Sequential Matching Network (SMN)

Based on the work Wu, Yu, et al. “Sequential Matching Network: A New Architecture for Multi-turn Response Selection in Retrieval-based Chatbots”. ACL. 2017.

Deep Attention Matching Network (DAM)

Based on the work Xiangyang Zhou, et al. “Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network”. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018

Deep Attention Matching Network + Universal Sentence Encoder v3 (DAM-USE-T)

Our new proposed architecture based on the works: Xiangyang Zhou, et al. “Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network”. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018 and Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Brian Strope, Ray Kurzweil. 2018a. Universal Sentence Encoder for English.

Available pre-trained models for ranking:

Dataset

Model config

Val

Test

R10@1

R10@1

R10@2

R10@5

Downloads

InsuranceQA v1

ranking_insurance_interact

72.0

72.2

8374 MB

Ubuntu V2

ranking_ubuntu_v2_mt_word2vec_dam_transformer

74.32

74.46

86.77

97.38

2457 MB

Ubuntu V2

ranking_ubuntu_v2_mt_word2vec_dam

71.20

71.54

83.66

96.33

1645 MB

Ubuntu V2

ranking_ubuntu_v2_mt_word2vec_smn

68.56

67.91

81.49

95.63

1609 MB

Ubuntu V2

ranking_ubuntu_v2_bert_uncased

66.5

66.6

396 MB

Ubuntu V2

ranking_ubuntu_v2_bert_sep

66.5

66.5

396 MB

Ubuntu V2

ranking_ubuntu_v2_interact

52.9

52.4

8913 MB

Ubuntu V2

ranking_ubuntu_v2_mt_interact

59.2

58.7

8906 MB

Ubuntu V1

ranking_ubuntu_v1_mt_word2vec_dam_transformer

79.57

89.32

97.34

2439 MB

Ubuntu V1

ranking_ubuntu_v1_mt_word2vec_dam

77.95

88.07

97.06

1645 MB

Ubuntu V1

ranking_ubuntu_v1_mt_word2vec_smn

75.90

87.16

96.80

1591 MB

Available pre-trained models for paraphrase identification:

Dataset

Model config

Val (accuracy)

Test (accuracy)

Val (F1)

Test (F1)

Val (log_loss)

Test (log_loss)

Downloads

paraphraser.ru

paraphrase_ident_paraphraser_ft

83.8

75.4

87.9

80.9

0.468

0.616

5938M

paraphraser.ru

paraphrase_ident_paraphraser_elmo

82.7

76.0

87.3

81.4

0.391

0.510

5938M

paraphraser.ru

paraphrase_ident_paraphraser_tune

82.9

76.7

87.3

82.0

0.392

0.479

5938M

paraphraser.ru

paraphrase_bert_multilingual

87.4

79.3

90.2

83.4

1330M

paraphraser.ru

paraphrase_rubert

90.2

84.9

92.3

87.9

1325M

Quora Question Pairs

paraphrase_ident_qqp_bilstm

87.1

87.0

83.0

82.6

0.300

0.305

8134M

Quora Question Pairs

paraphrase_ident_qqp

87.7

87.5

84.0

83.8

0.287

0.298

8136M

Comparison with other models on the InsuranceQA V1:

Model

Validation (Recall@1)

Test1 (Recall@1)

Architecture II (HLQA(200) CNNQA(4000) 1-MaxPooling Tanh)

61.8

62.8

QA-LSTM basic-model(max pooling)

64.3

63.1

ranking_insurance

72.0

72.2

Comparison with other models on the Ubuntu Dialogue Corpus v1 (test):

Model

R@1

R@2

R@5

SMN last [Wu et al., 2017]

0.723

0.842

0.956

SMN last [DeepPavlov ranking_ubuntu_v1_mt_word2vec_smn]

0.754

0.869

0.967

DAM [Zhou et al., 2018]

0.767

0.874

0.969

DAM [DeepPavlov ranking_ubuntu_v1_mt_word2vec_dam]

0.779

0.880

0.970

MRFN-FLS [Tao et al., 2019]

0.786

0.886

0.976

IMN [Gu et al., 2019]

0.777

0.880

0.974

IMN Ensemble [Gu et al., 2019]

0.794

0.893

0.978

DAM-USE-T [DeepPavlov ranking_ubuntu_v1_mt_word2vec_dam_transformer]

0.7957

0.8932

0.9734

Comparison with other models on the Ubuntu Dialogue Corpus v2 (test):

Model

R@1

R@2

R@5

SMN last [Wu et al., 2017]

SMN last [DeepPavlov ranking_ubuntu_v2_mt_word2vec_smn]

0.6791

0.8149

0.9563

DAM [Zhou et al., 2018]

DAM [DeepPavlov ranking_ubuntu_v2_mt_word2vec_dam]

0.7154

0.8366

0.9633

MRFN-FLS [Tao et al., 2019]

IMN [Gu et al., 2019]

0.771

0.886

0.979

IMN Ensemble [Gu et al., 2019]

0.791

0.899

0.982

DAM-USE-T [DeepPavlov ranking_ubuntu_v2_mt_word2vec_dam_transformer]

0.7446

0.8677

0.9738

References:

  • Yu Wu, Wei Wu, Ming Zhou, and Zhoujun Li. 2017. Sequential match network: A new architecture for multi-turn response selection in retrieval-based chatbots. In ACL, pages 372–381. https://www.aclweb.org/anthology/P17-1046

  • Xiangyang Zhou, Lu Li, Daxiang Dong, Yi Liu, Ying Chen, Wayne Xin Zhao, Dianhai Yu and Hua Wu. 2018. Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1118-1127, ACL. http://aclweb.org/anthology/P18-1103

  • Chongyang Tao, Wei Wu, Can Xu, Wenpeng Hu, Dongyan Zhao, and Rui Yan. Multi-Representation Fusion Network for Multi-turn Response Selection in Retrieval-based Chatbots. In WSDM’19. https://dl.acm.org/citation.cfm?id=3290985

  • Gu, Jia-Chen & Ling, Zhen-Hua & Liu, Quan. (2019). Interactive Matching Network for Multi-Turn Response Selection in Retrieval-Based Chatbots. https://arxiv.org/abs/1901.01824

TF-IDF Ranker model [docs]

Based on Reading Wikipedia to Answer Open-Domain Questions. The model solves the task of document retrieval for a given query.

Dataset

Model

Wiki dump

Recall@5

Downloads

SQuAD-v1.1

doc_retrieval

enwiki (2018-02-11)

75.6

33 GB

Question Answering model [docs]

Models in this section solve the task of looking for an answer on a question in a given context (SQuAD task format). There are two models for this task in DeepPavlov: BERT-based and R-Net. Both models predict answer start and end position in a given context.

BERT-based model is described in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

R-Net model is based on R-NET: Machine Reading Comprehension with Self-matching Networks.

Dataset

Model config

lang

EM (dev)

F-1 (dev)

Downloads

SQuAD-v1.1

DeepPavlov BERT

en

80.88

88.49

806Mb

SQuAD-v1.1

DeepPavlov R-Net

en

71.49

80.34

~2.5Gb

SDSJ Task B

DeepPavlov RuBERT

ru

66.30+-0.24

84.60+-0.11

1325Mb

SDSJ Task B

DeepPavlov multilingual BERT

ru

64.35+-0.39

83.39+-0.08

1323Mb

SDSJ Task B

DeepPavlov R-Net

ru

60.62

80.04

~5Gb

DRCD

DeepPavlov multilingual BERT

ch

84.18

84.08

630Mb

DRCD

DeepPavlov Chinese BERT

ch

85.13

85.15

362Mb

In the case when answer is not necessary present in given context we have squad_noans model. This model outputs empty string in case if there is no answer in context.

Morphological tagging model [docs]

We have a BERT-based model for Russian and character-based models for 11 languages. The character model is based on Heigold et al., 2017. An extensive empirical evaluation of character-based morphological tagging for 14 languages. It is a state-of-the-art model for Russian and near state of the art for several other languages. Model takes as input tokenized sentences and outputs the corresponding sequence of morphological labels in UD format. The table below contains word and sentence accuracy on UD2.0 datasets. For more scores see full table.

Dataset

Model

Word accuracy

Sent. accuracy

Download size (MB)

UD2.3 (Russian)

UD Pipe 2.3 (Straka et al., 2017)

93.5

UD Pipe Future (Straka et al., 2018)

96.90

BERT-based model

97.83

72.02

661

UD2.0 (Russian)

Pymorphy + russian_tagsets (first tag)

60.93

0.00

UD Pipe 1.2 (Straka et al., 2017)

93.57

43.04

Basic model

95.17

50.58

48.7

Pymorphy-enhanced model

96.23

58.00

48.7

UD2.0 (Czech)

UD Pipe 1.2 (Straka et al., 2017)

91.86

42.28

Basic model

94.35

51.56

41.8

UD2.0 (English)

UD Pipe 1.2 (Straka et al., 2017)

92.89

55.75

Basic model

93.00

55.18

16.9

UD2.0 (German)

UD Pipe 1.2 (Straka et al., 2017)

76.65

10.24

Basic model

83.83

15.25

18.6

Syntactic parsing model [docs]

We have a biaffine model for syntactic parsing based on RuBERT. It achieves the highest known labeled attachments score of 93.7% on ru_syntagrus Russian corpus (version UD 2.3).

Dataset

Model

UAS

LAS

UD2.3 (Russian)

UD Pipe 2.3 (Straka et al., 2017)

90.3

89.0

UD Pipe Future (Straka, 2018)

93.0

91.5

UDify (multilingual BERT) (Kondratyuk, 2018)

94.8

93.1

our BERT model

95.2

93.7

Frequently Asked Questions (FAQ) model [docs]

Set of pipelines for FAQ task: classifying incoming question into set of known questions and return prepared answer. You can build different pipelines based on: tf-idf, weighted fasttext, cosine similarity, logistic regression.

Skills

Goal-oriented bot [docs]

Based on Hybrid Code Networks (HCNs) architecture from Jason D. Williams, Kavosh Asadi, Geoffrey Zweig, Hybrid Code Networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning – 2017. It allows to predict responses in a goal-oriented dialog. The model is customizable: embeddings, slot filler and intent classifier can be switched on and off on demand.

Available pre-trained models and their comparison with existing benchmarks:

Dataset

Lang

Model

Metric

Test

Downloads

DSTC 2 (modified)

En

basic bot

Turn Accuracy

0.380

10 Mb

bot with slot filler

0.542

400 Mb

bot with slot filler, intents & attention

0.553

8.5 Gb

DSTC 2

Bordes and Weston (2016)

0.411

Eric and Manning (2017)

0.480

Perez and Liu (2016)

0.487

Williams et al. (2017)

0.556

Seq2seq goal-oriented bot [docs]

Dialogue agent predicts responses in a goal-oriented dialog and is able to handle multiple domains (pretrained bot allows calendar scheduling, weather information retrieval, and point-of-interest navigation). The model is end-to-end differentiable and does not need to explicitly model dialogue state or belief trackers.

Comparison of deeppavlov pretrained model with others:

Dataset

Lang

Model

Valid BLEU

Test BLEU

Downloads

Stanford Kvret

En

KvretNet

0.131

0.132

10 Gb

KvretNet, Mihail Eric et al. (2017)

0.132

CopyNet, Mihail Eric et al. (2017)

0.110

Attn Seq2Seq, Mihail Eric et al. (2017)

0.102

Rule-based, Mihail Eric et al. (2017)

0.066

ODQA [docs]

An open domain question answering skill. The skill accepts free-form questions about the world and outputs an answer based on its Wikipedia knowledge.

Dataset

Model config

Wiki dump

F1

Downloads

SQuAD-v1.1

ODQA

enwiki (2018-02-11)

35.89

9.7Gb

SQuAD-v1.1

ODQA

enwiki (2016-12-21)

37.83

9.3Gb

SDSJ Task B

ODQA

ruwiki (2018-04-01)

28.56

7.7Gb

SDSJ Task B

ODQA with RuBERT

ruwiki (2018-04-01)

37.83

4.3Gb

AutoML

Hyperparameters optimization [docs]

Hyperparameters optimization (either by cross-validation or neural evolution) for DeepPavlov models that requires only some small changes in a config file.

Embeddings

Pre-trained embeddings [docs]

Word vectors for the Russian language trained on joint Russian Wikipedia and Lenta.ru corpora.

Examples of some models

  • Run goal-oriented bot with Telegram interface:

    python -m deeppavlov telegram gobot_dstc2 -d -t <TELEGRAM_TOKEN>
    
  • Run goal-oriented bot with console interface:

    python -m deeppavlov interact gobot_dstc2 -d
    
  • Run goal-oriented bot with REST API:

    python -m deeppavlov riseapi gobot_dstc2 -d
    
  • Run slot-filling model with Telegram interface:

    python -m deeppavlov telegram slotfill_dstc2 -d -t <TELEGRAM_TOKEN>
    
  • Run slot-filling model with console interface:

    python -m deeppavlov interact slotfill_dstc2 -d
    
  • Run slot-filling model with REST API:

    python -m deeppavlov riseapi slotfill_dstc2 -d
    
  • Predict intents on every line in a file:

    python -m deeppavlov predict intents_snips -d --batch-size 15 < /data/in.txt > /data/out.txt
    

View video demo of deployment of a goal-oriented bot and a slot-filling model with Telegram UI.