Tfidf Ranking¶

Table of contents¶

Introduction to the task
Get started with the model
Models list
Use the model for prediction

4.1. Predict using Python

4.2. Predict using CLI
Customize the model

5.1. Fit on Wikipedia

5.2. Download, parse new Wikipedia dump, build database and index

1. Introduction to the task¶

This is an implementation of a passage ranker based on tf-idf vectorization. The ranker implementation is based on DrQA project. The default ranker implementation takes a batch of queries as input and returns 100 passage titles sorted via relevance.

2. Get started with the model¶

First make sure you have the DeepPavlov Library installed. More info about the first installation.

[ ]:

!pip install -q deeppavlov

Then make sure that all the required packages for the model are installed.

[ ]:

!python -m deeppavlov install en_ranker_tfidf_wiki

en_ranker_tfidf_wiki is the name of the model’s config_file. What is a Config File?

There are alternative ways to install the model’s packages that do not require executing a separate command – see the options in the next sections of this page. The full list of models for tfidf ranking with their config names can be found in the table.

3. Models list¶

Config	Language	Description	RAM
doc_retrieval/en_ranker_tfidf_wiki.json	En	Config for TF-IDF ranking over Wikipedia	2.9 Gb
doc_retrieval/en_ranker_pop_wiki.json	En	Config for TF-IDF ranking, followed by popularity ranking, over Wikipedia	8.1 Gb
doc_retrieval/ru_ranker_tfidf_wiki.json	Ru	TF-IDF ranking config over Wikipedia	8.4 Gb

4. Use the model for prediction¶

4.1 Predict using Python¶

English¶

Building (if you don’t have your own data)

[ ]:

from deeppavlov import build_model, configs

ranker = build_model(configs.doc_retrieval.en_ranker_tfidf_wiki, download=True, install=True)

Inference

[ ]:

result = ranker(['Who is Ivan Pavlov?'])
print(result[0][:5])

[18155097, 628663, 17123727, 628662, 19097375]

Russian¶

[ ]:

from deeppavlov import build_model, configs

ranker = build_model(configs.doc_retrieval.ru_ranker_tfidf_wiki, download=True, install=True)

[ ]:

result = ranker(['Когда произошла Куликовская битва?'])
print(result[0][:5])

[4902620, 1900377, 11129584, 1720563, 1720658]

Text for the output titles can be further extracted with deeppavlov.vocabs.wiki_sqlite.WikiSQLiteVocab class.

4.2 Predict using CLI¶

You can also get predictions in an interactive mode through CLI (Сommand Line Interface).

[ ]:

! python -m deeppavlov interact en_ranker_tfidf_wiki -d

5. Customize the model¶

5.1 Fit on Wikipedia¶

Run the following to fit the ranker on English Wikipedia:

[ ]:

python -m deppavlov train en_ranker_tfidf_wiki

Run the following to fit the ranker on Russian Wikipedia:

[ ]:

python -m deeppavlov train ru_ranker_tfidf_wiki

As a result of ranker training, a SQLite database and tf-idf matrix are created.

5.2 Download, parse new Wikipedia dump, build database and index¶

enwiki.db SQLite database consists of ~21 M Wikipedia articles and is built by the following steps:

Download a Wikipedia dump file. We took the latest enwiki dump
Unpack and extract the articles with WikiExtractor (with --json, --no-templates, --filter_disambig_pages options)
Build a database.

enwiki_tfidf_matrix.npz is a full Wikipedia tf-idf matrix of size hash_size x number of documents which is \(2^{24}\) x 21 M. This matrix is built with deeppavlov.models.vectorizers.hashing_tfidf_vectorizer.HashingTfIdfVectorizer class.

ruwiki.db SQLite database consists of ~12 M Wikipedia articles and is built by the following steps:

Download a Wikipedia dump file. We took the latest ruwiki dump
Unpack and extract the articles with WikiExtractor (with --json, --no-templates, --filter_disambig_pages options)
Build a database.

ruwiki_tfidf_matrix.npz is a full Wikipedia tf-idf matrix of size hash_size x number of documents which is \(2^{24}\) x 12 M. This matrix is built with deeppavlov.models.vectorizers.hashing_tfidf_vectorizer.HashingTfIdfVectorizer class.