deeppavlov.models.entity_linking

class deeppavlov.models.kbqa.entity_linking.NerChunker(max_chunk_len: int = 300, batch_size: int = 30, **kwargs)[source]

Class to split documents into chunks of max_chunk_len symbols so that the length will not exceed maximal sequence length to feed into BERT

__init__(max_chunk_len: int = 300, batch_size: int = 30, **kwargs)[source]
Parameters
  • max_chunk_len – maximal length of chunks into which the document is split

  • batch_size – how many chunks are in batch

__call__(docs_batch: List[str]) → Tuple[List[List[str]], List[List[int]]][source]

This method splits each document in the batch into chunks wuth the maximal length of max_chunk_len

Parameters

docs_batch – batch of documents

Returns

batch of lists of document chunks for each document batch of lists of numbers of documents which correspond to chunks

class deeppavlov.models.kbqa.entity_linking.EntityLinker(load_path: str, word_to_idlist_filename: str, entities_list_filename: str, entities_ranking_filename: str, vectorizer_filename: str, faiss_index_filename: str, chunker: deeppavlov.models.kbqa.entity_linking.NerChunker = None, ner: deeppavlov.core.common.chainer.Chainer = None, ner_parser: deeppavlov.models.kbqa.entity_detection_parser.EntityDetectionParser = None, entity_ranker: deeppavlov.models.kbqa.rel_ranking_bert_infer.RelRankerBertInfer = None, num_faiss_candidate_entities: int = 20, num_entities_for_bert_ranking: int = 50, num_faiss_cells: int = 50, use_gpu: bool = True, save_path: str = None, fit_vectorizer: bool = False, max_tfidf_features: int = 1000, include_mention: bool = False, ngram_range: List[int] = None, num_entities_to_return: int = 10, lang: str = 'ru', use_descriptions: bool = True, lemmatize: bool = False, **kwargs)[source]

Class for linking of entity substrings in the document to entities in Wikidata

__init__(load_path: str, word_to_idlist_filename: str, entities_list_filename: str, entities_ranking_filename: str, vectorizer_filename: str, faiss_index_filename: str, chunker: deeppavlov.models.kbqa.entity_linking.NerChunker = None, ner: deeppavlov.core.common.chainer.Chainer = None, ner_parser: deeppavlov.models.kbqa.entity_detection_parser.EntityDetectionParser = None, entity_ranker: deeppavlov.models.kbqa.rel_ranking_bert_infer.RelRankerBertInfer = None, num_faiss_candidate_entities: int = 20, num_entities_for_bert_ranking: int = 50, num_faiss_cells: int = 50, use_gpu: bool = True, save_path: str = None, fit_vectorizer: bool = False, max_tfidf_features: int = 1000, include_mention: bool = False, ngram_range: List[int] = None, num_entities_to_return: int = 10, lang: str = 'ru', use_descriptions: bool = True, lemmatize: bool = False, **kwargs)None[source]
Parameters
  • load_path – path to folder with inverted index files

  • word_to_idlist_filename – file with dict of words (keys) and start and end indices in entities_list filename of the corresponding entity ids

  • entities_list_filename – file with the list of entity ids from the knowledge base

  • entities_ranking_filename – file with dict of entity ids (keys) and number of relations in Wikidata for entities

  • vectorizer_filename – filename with TfidfVectorizer data

  • faiss_index_filename – file with Faiss index of words

  • chunker – component deeppavlov.models.kbqa.ner_chunker

  • ner – config for entity detection

  • ner_parser – component deeppavlov.models.kbqa.entity_detection_parser

  • entity_ranker – component deeppavlov.models.kbqa.rel_ranking_bert_infer

  • num_faiss_candidate_entities – number of nearest neighbors for the entity substring from the text

  • num_entities_for_bert_ranking – number of candidate entities for BERT ranking using description and context

  • num_faiss_cells – number of Voronoi cells for Faiss index

  • use_gpu – whether to use GPU for faster search of candidate entities

  • save_path – path to folder with inverted index files

  • fit_vectorizer – whether to build index with Faiss library

  • max_tfidf_features – maximal number of features for TfidfVectorizer

  • include_mention – whether to leave entity mention in the context (during BERT ranking)

  • ngram_range – char ngrams range for TfidfVectorizer

  • num_entities_to_return – number of candidate entities for the substring which are returned

  • lang – russian or english

  • use_description – whether to perform entity ranking by context and description

  • lemmatize – whether to lemmatize tokens

  • **kwargs

__call__(docs_batch: List[str])[source]
Parameters

docs_batch – batch of documents

Returns

batch of lists of candidate entity ids

class deeppavlov.models.kbqa.entity_detection_parser.EntityDetectionParser(entity_tags: List[str], type_tag: str, o_tag: str, tags_file: str, return_entities_with_tags: bool = False, thres_proba: float = 0.8, **kwargs)[source]

This class parses probabilities of tokens to be a token from the entity substring.

__init__(entity_tags: List[str], type_tag: str, o_tag: str, tags_file: str, return_entities_with_tags: bool = False, thres_proba: float = 0.8, **kwargs)[source]
Parameters
  • entity_tags – tags for entities

  • type_tag – tag for types

  • o_tag – tag for tokens which are neither entities nor types

  • tags_file – filename with NER tags

  • return_entities_with_tags – whether to return a dict of tags (keys) and list of entity substrings (values) or simply a list of entity substrings

  • thres_proba – if the probability of the tag is less than thres_proba, we assign the tag as ‘O’

__call__(question_tokens: List[List[str]], token_probas: List[List[List[float]]]) → Tuple[List[Union[List[str], Dict[str, List[str]]]], List[List[str]], List[Union[List[int], Dict[str, List[List[int]]]]]][source]
Parameters
  • question_tokens – tokenized questions

  • token_probas – list of probabilities of question tokens

Returns

Batch of dicts where keys are tags and values are substrings corresponding to tags Batch of substrings which correspond to entity types Batch of lists of token indices in the text which correspond to entities

class deeppavlov.models.kbqa.entity_detection_parser.QuestionSignChecker(**kwargs)[source]

This class adds question sign if it is absent or replaces dot with question sign

__init__(**kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

__call__(questions: List[str]) → List[str][source]

Call self as a function.