This is an implementation of a document ranker based on tf-idf vectorization. The ranker implementation is based on DrQA 1 project. The default ranker implementation takes a batch of queries as input and returns 25 document titles sorted via relevance.
Before using the model make sure that all required packages are installed running the command:
python -m deeppavlov install en_ranker_tfidf_wiki
Training and building (if you have your own data)
from deeppavlov import configs, train_model
ranker = train_model(configs.doc_retrieval.en_ranker_tfidf_wiki, download=True)
Building (if you don’t have your own data)
from deeppavlov import build_model, configs
ranker = build_model(configs.doc_retrieval.en_ranker_tfidf_wiki, download=True)
result = ranker(['Who is Ivan Pavlov?'])
>> ['Ivan Pavlov (lawyer)', 'Ivan Pavlov', 'Pavlovian session', 'Ivan Pavlov (film)', 'Vladimir Bekhterev']
Text for the output titles can be further extracted with
Default ranker config for English language is doc_retrieval/en_ranker_tfidf_wiki.json
Default ranker config for Russian language is doc_retrieval/ru_ranker_tfidf_wiki.json
Running the Ranker¶
About 16 GB of RAM required.
Run the following to fit the ranker on English Wikipedia:
python -m deppavlov train en_ranker_tfidf_wiki
Run the following to fit the ranker on Russian Wikipedia:
python -m deeppavlov train ru_ranker_tfidf_wiki
As a result of ranker training, a SQLite database and tf-idf matrix are created.
When interacting, the ranker returns document titles of the relevant documents.
Run the following to interact with the English ranker:
python -m deeppavlov interact en_ranker_tfidf_wiki -d
Run the following to interact with the Russian ranker:
python -m deeppavlov ru_ranker_tfidf_wiki -d
Available Data and Pretrained Models¶
Wikipedia DB is downloaded to
~/.deeppavlov/downloads/odqa and pre-trained tfidf matrices are downloaded
~/.deeppavlov/models/odqa folder by default.
enwiki.db SQLite database consists of 5180368 Wikipedia articles and is built by the following steps:
enwiki_tfidf_matrix.npz is a full Wikipedia tf-idf matrix of
size hash_size x number of documents which is
224 x 5180368. This matrix is built with
ruwiki.db SQLite database consists of 1463888 Wikipedia articles and is built by the following steps:
Scores for TF-IDF Ranker model: