deeppavlov.models.doc_retrieval

Document retrieval classes.

class deeppavlov.models.doc_retrieval.tfidf_ranker.TfidfRanker(vectorizer: deeppavlov.models.vectorizers.hashing_tfidf_vectorizer.HashingTfIdfVectorizer, top_n=5, active: bool = True, **kwargs)[source]

Rank documents according to input strings.

Parameters:
  • vectorizer – a vectorizer class
  • top_n – a number of doc ids to return
  • active – whether to return a number specified by top_n (True) or all ids (False)
top_n

a number of doc ids to return

vectorizer

an instance of vectorizer class

active

whether to return a number specified by top_n or all ids

index2doc

inverted doc_index

iterator

a dataset iterator used for generating batches while fitting the vectorizer

__call__(questions: List[str]) → Tuple[List[Any], List[float]][source]

Rank documents and return top n document titles with scores.

Parameters:questions – list of queries used in ranking
Returns:a tuple of selected doc ids and their scores
class deeppavlov.models.doc_retrieval.logit_ranker.LogitRanker(squad_model: deeppavlov.core.models.component.Component, batch_size: int = 50, sort_noans: bool = False, **kwargs)[source]

Select best answer using squad model logits. Make several batches for a single batch, send each batch to the squad model separately and get a single best answer for each batch.

Parameters:
  • squad_model – a loaded squad model
  • batch_size – batch size to use with squad model
  • sort_noans – whether to downgrade noans tokens in the most possible answers
squad_model

a loaded squad model

batch_size

batch size to use with squad model

__call__(contexts_batch: List[List[str]], questions_batch: List[List[str]]) → Tuple[List[str], List[float]][source]

Sort obtained results from squad reader by logits and get the answer with a maximum logit.

Parameters:
  • contexts_batch – a batch of contexts which should be treated as a single batch in the outer JSON config
  • questions_batch – a batch of questions which should be treated as a single batch in the outer JSON config
Returns:

a batch of best answers and their scores

class deeppavlov.models.doc_retrieval.pop_ranker.PopRanker(pop_dict_path: str, load_path: str, top_n: int = 3, active: bool = True, **kwargs)[source]

Rank documents according to their tfidf scores and popularities. It is not a standalone ranker, it should be used for re-ranking the results of TF-IDF Ranker.

Based on a Logistic Regression trained on 3 features:

  • tfidf score of the article
  • popularity of the article obtained via Wikimedia REST API as a mean number of views for the period since 2017/11/05 to 2018/11/05
  • multiplication of the two features above
Parameters:
  • pop_dict_path – a path to json file with article title to article popularity map
  • load_path – a path to saved logistic regression classifier
  • top_n – a number of doc ids to return
  • active – whether to return a number specified by top_n (True) or all ids (False)
pop_dict

a map of article titles to their popularity

mean_pop

mean popularity of all popularities in pop_dict, use it when popularity is not found

clf

a loaded logistic regression classifier

top_n

a number of doc ids to return

active

whether to return a number specified by top_n or all ids

__call__(input_doc_ids: List[List[Any]], input_doc_scores: List[List[float]]) → Tuple[List[List], List[List]][source]

Get tfidf scores and tfidf ids, re-rank them by applying logistic regression classifier, output pop ranker ids and pop ranker scores.

Args:
input_doc_ids: top input doc ids of tfidf ranker input_doc_scores: top input doc scores of tfidf ranker corresponding to doc ids
Returns:top doc ids of pop ranker and their corresponding scores