TF-IDF Ranker

This is an implementation of a document ranker based on tf-idf vectorization. The ranker implementation is based on DrQA [1] project. The default ranker implementation takes a batch of queries as input and returns 5 document ids as output.

:: Who is Ivan Pavlov?
>> ['Ivan Pavlov (lawyer)', 'Ivan Pavlov', 'Pavlovian session', 'Ivan Pavlov (film)', 'Vladimir Bekhterev']

Text for the output ids can be further extracted with WikiSQLiteVocab class.

Config

Default ranker config for English language is ranking/en_ranker_tfidf_wiki.json

Default ranker config for Russian language is ranking/ru_ranker_tfidf_wiki.json

Config Structure

  • dataset_iterator - downloads Wikipidia DB, creates batches for ranker fitting
    • data_dir - a directory to download DB to
    • data_url - an URL to download Wikipedia DB from
    • shuffle - whether to perform shuffling when iterating over DB or not
  • chainer - pipeline manager
    • in - pipeline input data (questions)
    • out - pipeline output data (Wikipedia articles ids)
  • tfidf_ranker - the ranker class
    • top_n - a number of document to return (when n=1 the most relevant document is returned)
    • in - ranker input data (queries)
    • out - ranker output data (Wikipedia articles ids)
    • fit_on_batch - pass method to a vectorizer
    • vectorizer - a vectorizer class
      • fit_on_batch - fit the vectorizer on batches of Wikipedia articles
      • save_path - a path to serialize a vectorizer to
      • load_path - a path to load a vectorizer from
      • tokenizer - a tokenizer class
        • lemmas - whether to lemmatize tokens or not
        • ngram_range - ngram range for vectorizer features
  • train - parameters for vectorizer fitting
    • validate_best- is ingnored, any value
    • test_best - is ignored, any value
    • batch_size - how many Wikipedia articles should return the dataset_iterator in a single batch

Running the Ranker

Note

Training and inferring the ranker requires ~16 GB RAM.

Training

Run the following to fit the ranker on English Wikipedia:

cd deeppavlov/
python deep.py train deeppavlov/configs/ranking/en_ranker_tfidf_wiki.json

Run the following to fit the ranker on Russian Wikipedia:

cd deeppavlov/
python deep.py train deeppavlov/configs/ranking/ru_ranker_tfidf_wiki.json

Interacting

When interacting, the ranker returns document titles of the relevant documents.

Run the following to interact with the English ranker:

cd deeppavlov/
python deep.py interact deeppavlov/configs/ranking/en_ranker_tfidf_wiki.json -d

Run the following to interact with the Russian ranker:

cd deeppavlov/
python deep.py interact deeppavlov/configs/ranking/ru_ranker_tfidf_wiki.json -d

Available Data and Pretrained Models

Wikipedia DB and pretrained tfidf matrices are downloaded in deeppavlov/download/odqa folder by default.

enwiki.db

enwiki.db SQLite database consists of 5159530 Wikipedia articles and is built by the following steps:

  1. Download a Wikipedia dump file. We took the latest enwiki dump (from 2018-02-11)
  2. Unpack and extract the articles with WikiExtractor [2] (with --json, --no-templates, --filter_disambig_pages options)
  3. Build a database with the help of DrQA script.

enwiki_tfidf_matrix.npz

enwiki_tfidf_matrix.npz is a full Wikipedia tf-idf matrix of size hash_size x number of documents which is 2**24 x 5159530. This matrix is built with HashingTfIdfVectorizer class.

ruwiki.db

ruwiki.db SQLite database consists of 1463888 Wikipedia articles and is built by the following steps:

  1. Download a Wikipedia dump file. We took the latest ruwiki dump (from 2018-04-01)
  2. Unpack and extract the articles with WikiExtractor (with --json, --no-templates, --filter_disambig_pages options)
  3. Build a database with the help of DrQA script.

ruwiki_tfidf_matrix.npz

ruwiki_tfidf_matrix.npz is a full Wikipedia tf-idf matrix of size hash_size x number of documents which is 2**24 x 1463888. This matrix is built with HashingTfIdfVectorizer class. class.

Comparison

Scores for TF-IDF Ranker model:

Model Dataset Wiki dump Recall (top 5)
DeepPavlov SQuAD (dev) enwiki (2018-02-11) 75.6
DrQA [1] SQuAD (dev) enwiki (2016-12-21) 77.8