Neural Ranking

Open In Colab

1. Introduction to the task

This model solves the tasks of ranking and paraphrase identification based on semantic similarity which is trained with siamese neural networks. The trained network can retrieve the response closest semantically to a given context from some database or answer whether two sentences are paraphrases or not. It is possible to build automatic semantic FAQ systems with such neural architectures.

2. Get started with the model

First make sure you have the DeepPavlov Library installed. More info about the first installation.

[ ]:
!pip install -q deeppavlov

Then make sure that all the required packages for the model are installed.

[ ]:
!python -m deeppavlov install ranking_ubuntu_v2_torch_bert_uncased

ranking_ubuntu_v2_torch_bert_uncased is the name of the model’s config_file. What is a Config File?

There are alternative ways to install the model’s packages that do not require executing a separate command – see the options in the next sections of this page. The full list of models for neural ranking with their config names can be found in the table.

3. Models list

Config

Language

Dataset

Transformer model

ranking/ranking_ubuntu_v2_torch_bert_uncased.json

En

Ubuntu v2

bert-base-uncased

classifiers/paraphraser_rubert.json

Ru

paraphraser.ru

DeepPavlov/rubert-base-cased

classifiers/paraphraser_convers_distilrubert_2L.json

Ru

paraphraser.ru

DeepPavlov/distilrubert-tiny-cased-conversational

classifiers/paraphraser_convers_distilrubert_6L.json

Ru

paraphraser.ru

DeepPavlov/distilrubert-base-cased-conversational

4. Use the model for prediction

4.1 Predict using Python

English

[ ]:
from deeppavlov import configs, build_model


ranking = build_model("ranking_ubuntu_v2_torch_bert_uncased", download=True, install=True)
[ ]:
ranking([["Forrest Gump is a 1994 American epic comedy-drama film directed by Robert Zemeckis.",
          "Robert Zemeckis directed Forrest Gump.",
          "Robert Lee Zemeckis was born on May 14, 1952, in Chicago."]])

Input: List[List[sentence1, sentence2, …]], where the sentences from the second to the last will be ranked by similarity with the first sentence.

Output: List[List[scores]] - similarity scores to the first sentence of the sentences from the second to the last.

Russian

[ ]:
from deeppavlov import configs, build_model


ranking = build_model("paraphraser_rubert", download=True, install=True)
[ ]:
ranking(["Форрест Гамп - комедийная драма, девятый полнометражный фильм режиссёра Роберта Земекиса."],
        ["Роберт Земекис был режиссером фильма «Форрест Гамп»."])

Input: Tuple[List[sentences1], List[sentence2]], where each element of the list of sentences1 will be compared with the corresponding element of the sentence2 list.

Output: List[labels] - each label is 1 or 0, 1 - if the sentence from the first list is a paraphrase to the corresponding sentence from the second list, 0 - otherwise.

4.2 Predict using CLI

English

It is not intended to use the class deeppavlov.models.torch_bert.torch_bert_ranker.TorchBertRankerModel in the interact mode, so it is better to launch the config ranking/ranking_ubuntu_v2_torch_bert_uncased.json using Python.

Russian

You can also get predictions in an interactive mode through CLI (Сommand Line Interface).

[ ]:
! python -m deeppavlov interact paraphraser_rubert -d

5. Customize the model

English

To train the ranking model on your own data, you should make a dataset in the following format:

  • the dataset should have train.csv, valid.csv and test.csv files.

  • train.csv file should contain the following columns: Context, Utterance, Label. Context and utterance are two texts and label (0 or 1) shows the relevance of the utterance to the context.

  • valid.csv and test.csv files should contain the following columns: Context, Ground Truth Utterance, Distractor_0, Distractor_1, …, Distractor_N. Distractor utterances are negative samples (utterances, irrelevant to the context).

Then you should put train.csv, valid.csv and test.csv files into the directory "data_path" in the dataset reader from the config and launch training of the model:

[ ]:
python -m deeppavlov train ranking_ubuntu_v2_torch_bert_uncased

Russian

To train the ranking model on your own data, you should make a dataset with two files: paraphrases.xml (for training) and paraphrases_gold.xml (for testing).

The xml files should have the following format:

<?xml version='1.0' encoding='UTF8'?>
<data>
  <head>
    <title>Russian Paraphrase Corpus</title>
    <description>This file contains a collection of sentence pairs with crowdsourced annotation. Paraphrase classes: -1: non-paraphrases, 0: loose paraphrases, 1: strict paraphrases.</description>
    <reference>http://paraphraser.ru</reference>
    <version>1.0 beta</version>
    <date>2015-11-28</date>
  </head>
  <corpus>
    <paraphrase>
      <value name="id">1</value>
      <value name="id_1">201</value>
      <value name="id_2">8159</value>
      <value name="text_1">text 1</value>
      <value name="text_2">text 2</value>
      <value name="jaccard">0.65</value>
      <value name="class">0</value>
    </paraphrase>
    <paraphrase>
      ...
    </paraphrase>
  </corpus>
</data>

Place paraphrases.xml and paraphrases_gold.xml files into the directory "data_path" in the dataset reader from the config and launch training of the model:

[ ]:
python -m deeppavlov train paraphraser_rubert