Tfidf Ranking

Open In Colab

1. Introduction to the task

This is an implementation of a passage ranker based on tf-idf vectorization. The ranker implementation is based on DrQA project. The default ranker implementation takes a batch of queries as input and returns 100 passage titles sorted via relevance.

2. Get started with the model

First make sure you have the DeepPavlov Library installed. More info about the first installation.

[ ]:
!pip install -q deeppavlov

Then make sure that all the required packages for the model are installed.

[ ]:
!python -m deeppavlov install en_ranker_tfidf_wiki

en_ranker_tfidf_wiki is the name of the model’s config_file. What is a Config File?

There are alternative ways to install the model’s packages that do not require executing a separate command – see the options in the next sections of this page. The full list of models for tfidf ranking with their config names can be found in the table.

3. Models list

Config

Language

Description

RAM

doc_retrieval/en_ranker_tfidf_wiki.json

En

Config for TF-IDF ranking over Wikipedia

2.9 Gb

doc_retrieval/en_ranker_pop_wiki.json

En

Config for TF-IDF ranking, followed by popularity ranking, over Wikipedia

8.1 Gb

doc_retrieval/ru_ranker_tfidf_wiki.json

Ru

TF-IDF ranking config over Wikipedia

8.4 Gb

4. Use the model for prediction

4.1 Predict using Python

English

Building (if you don’t have your own data)

[ ]:
from deeppavlov import build_model, configs

ranker = build_model(configs.doc_retrieval.en_ranker_tfidf_wiki, download=True, install=True)

Inference

[ ]:
result = ranker(['Who is Ivan Pavlov?'])
print(result[0][:5])
[18155097, 628663, 17123727, 628662, 19097375]

Russian

[ ]:
from deeppavlov import build_model, configs

ranker = build_model(configs.doc_retrieval.ru_ranker_tfidf_wiki, download=True, install=True)
[ ]:
result = ranker(['Когда произошла Куликовская битва?'])
print(result[0][:5])
[4902620, 1900377, 11129584, 1720563, 1720658]

Text for the output titles can be further extracted with deeppavlov.vocabs.wiki_sqlite.WikiSQLiteVocab class.

4.2 Predict using CLI

You can also get predictions in an interactive mode through CLI (Сommand Line Interface).

[ ]:
! python -m deeppavlov interact en_ranker_tfidf_wiki -d

5. Customize the model

5.1 Fit on Wikipedia

Run the following to fit the ranker on English Wikipedia:

[ ]:
python -m deppavlov train en_ranker_tfidf_wiki

Run the following to fit the ranker on Russian Wikipedia:

[ ]:
python -m deeppavlov train ru_ranker_tfidf_wiki

As a result of ranker training, a SQLite database and tf-idf matrix are created.

5.2 Download, parse new Wikipedia dump, build database and index

enwiki.db SQLite database consists of ~21 M Wikipedia articles and is built by the following steps:

enwiki_tfidf_matrix.npz is a full Wikipedia tf-idf matrix of size hash_size x number of documents which is \(2^{24}\) x 21 M. This matrix is built with deeppavlov.models.vectorizers.hashing_tfidf_vectorizer.HashingTfIdfVectorizer class.

ruwiki.db SQLite database consists of ~12 M Wikipedia articles and is built by the following steps:

ruwiki_tfidf_matrix.npz is a full Wikipedia tf-idf matrix of size hash_size x number of documents which is \(2^{24}\) x 12 M. This matrix is built with deeppavlov.models.vectorizers.hashing_tfidf_vectorizer.HashingTfIdfVectorizer class.