This is an implementation of a document ranker based on tf-idf vectorization. The ranker implementation is based on DrQA  project. The default ranker implementation takes a batch of queries as input and returns 5 document ids as output.
:: Who is Ivan Pavlov? >> ['Ivan Pavlov (lawyer)', 'Ivan Pavlov', 'Pavlovian session', 'Ivan Pavlov (film)', 'Vladimir Bekhterev']
Text for the output ids can be further extracted with
Default ranker config for English language is doc_retrieval/en_ranker_tfidf_wiki.json
Default ranker config for Russian language is doc_retrieval/ru_ranker_tfidf_wiki.json
Running the Ranker¶
Training and inferring the ranker requires ~16 GB RAM.
Run the following to fit the ranker on English Wikipedia:
cd deeppavlov/ python deep.py train configs/doc_retrieval/en_ranker_tfidf_wiki.json
Run the following to fit the ranker on Russian Wikipedia:
cd deeppavlov/ python deep.py train configs/doc_retrieval/ru_ranker_tfidf_wiki.json
When interacting, the ranker returns document titles of the relevant documents.
Run the following to interact with the English ranker:
cd deeppavlov/ python deep.py interact configs/doc_retrieval/en_ranker_tfidf_wiki.json -d
Run the following to interact with the Russian ranker:
cd deeppavlov/ python deep.py interact configs/doc_retrieval/ru_ranker_tfidf_wiki.json -d
As a result of ranker training, a SQLite database and tf-idf matrix are created.
Available Data and Pretrained Models¶
Wikipedia DB and pretrained tfidf matrices are downloaded in
deeppavlov/download/odqa folder by default.
enwiki.db SQLite database consists of 5159530 Wikipedia articles and is built by the following steps:
enwiki_tfidf_matrix.npz is a full Wikipedia tf-idf matrix of
size hash_size x number of documents which is
224 x 5180368. This matrix is built with
ruwiki.db SQLite database consists of 1463888 Wikipedia articles and is built by the following steps:
Scores for TF-IDF Ranker model:
|Model||Dataset||Wiki dump||Recall (top 5)|
|DeepPavlov||SQuAD (dev)||enwiki (2018-02-11)||75.6|
|DrQA ||SQuAD (dev)||enwiki (2016-12-21)||77.8|