TF-IDF Ranker¶

This is an implementation of a document ranker based on tf-idf vectorization. The ranker implementation is based on DrQA [1] project. The default ranker implementation takes a batch of queries as input and returns 5 document ids as output.

:: Who is Ivan Pavlov?
>> ['Ivan Pavlov (lawyer)', 'Ivan Pavlov', 'Pavlovian session', 'Ivan Pavlov (film)', 'Vladimir Bekhterev']

Text for the output ids can be further extracted with WikiSQLiteVocab class.

Configuration¶

Default ranker config for English language is doc_retrieval/en_ranker_tfidf_wiki.json

Default ranker config for Russian language is doc_retrieval/ru_ranker_tfidf_wiki.json

Running the Ranker¶

Note

Training and inferring the ranker requires ~16 GB RAM.

Training¶

Run the following to fit the ranker on English Wikipedia:

cd deeppavlov/
python deep.py train configs/doc_retrieval/en_ranker_tfidf_wiki.json

Run the following to fit the ranker on Russian Wikipedia:

cd deeppavlov/
python deep.py train configs/doc_retrieval/ru_ranker_tfidf_wiki.json

Interacting¶

When interacting, the ranker returns document titles of the relevant documents.

Run the following to interact with the English ranker:

cd deeppavlov/
python deep.py interact configs/doc_retrieval/en_ranker_tfidf_wiki.json -d

Run the following to interact with the Russian ranker:

cd deeppavlov/
python deep.py interact configs/doc_retrieval/ru_ranker_tfidf_wiki.json -d

As a result of ranker training, a SQLite database and tf-idf matrix are created.

Available Data and Pretrained Models¶

Wikipedia DB and pretrained tfidf matrices are downloaded in deeppavlov/download/odqa folder by default.

enwiki.db¶

enwiki.db SQLite database consists of 5159530 Wikipedia articles and is built by the following steps:

Download a Wikipedia dump file. We took the latest enwiki dump (from 2018-02-11)
Unpack and extract the articles with WikiExtractor [2] (with --json, --no-templates, --filter_disambig_pages options)
Build a database during Training.

enwiki_tfidf_matrix.npz¶

enwiki_tfidf_matrix.npz is a full Wikipedia tf-idf matrix of size hash_size x number of documents which is 2²⁴ x 5180368. This matrix is built with HashingTfIdfVectorizer class.

ruwiki.db¶

ruwiki.db SQLite database consists of 1463888 Wikipedia articles and is built by the following steps:

Download a Wikipedia dump file. We took the latest ruwiki dump (from 2018-04-01)
Unpack and extract the articles with WikiExtractor (with --json, --no-templates, --filter_disambig_pages options)
Build a database during Training.

ruwiki_tfidf_matrix.npz¶

ruwiki_tfidf_matrix.npz is a full Wikipedia tf-idf matrix of size hash_size x number of documents which is 2²⁴ x 1463888. This matrix is built with HashingTfIdfVectorizer class. class.

Comparison¶

Scores for TF-IDF Ranker model:

Model	Dataset	Wiki dump	Recall (top 5)
DeepPavlov	SQuAD (dev)	enwiki (2018-02-11)	75.6
DrQA [1]	SQuAD (dev)	enwiki (2016-12-21)	77.8

References¶

[1]	(1, 2) https://github.com/facebookresearch/DrQA/

[2]	https://github.com/attardi/wikiextractor