deeppavlov.models.entity_linking¶

class deeppavlov.models.kbqa.entity_linking.NerChunker(max_chunk_len: int = 300, batch_size: int = 30, **kwargs)[source]¶

Class to split documents into chunks of max_chunk_len symbols so that the length will not exceed maximal sequence length to feed into BERT

__init__(max_chunk_len: int = 300, batch_size: int = 30, **kwargs)[source]¶

Parameters

max_chunk_len – maximal length of chunks into which the document is split
batch_size – how many chunks are in batch

__call__(docs_batch: List[str]) → Tuple[List[List[str]], List[List[int]]][source]¶

This method splits each document in the batch into chunks wuth the maximal length of max_chunk_len

Parameters: docs_batch – batch of documents
Returns: batch of lists of document chunks for each document batch of lists of numbers of documents which correspond to chunks

class deeppavlov.models.kbqa.entity_linking.EntityLinker(load_path: str, word_to_idlist_filename: str, entities_list_filename: str, entities_ranking_filename: str, vectorizer_filename: str, faiss_index_filename: str, chunker: deeppavlov.models.kbqa.entity_linking.NerChunker = None, ner: deeppavlov.core.common.chainer.Chainer = None, ner_parser: deeppavlov.models.kbqa.entity_detection_parser.EntityDetectionParser = None, entity_ranker: deeppavlov.models.kbqa.rel_ranking_bert_infer.RelRankerBertInfer = None, num_faiss_candidate_entities: int = 20, num_entities_for_bert_ranking: int = 50, num_faiss_cells: int = 50, use_gpu: bool = True, save_path: str = None, fit_vectorizer: bool = False, max_tfidf_features: int = 1000, include_mention: bool = False, ngram_range: List[int] = None, num_entities_to_return: int = 10, lang: str = 'ru', use_descriptions: bool = True, lemmatize: bool = False, **kwargs)[source]¶

Class for linking of entity substrings in the document to entities in Wikidata

__init__(load_path: str, word_to_idlist_filename: str, entities_list_filename: str, entities_ranking_filename: str, vectorizer_filename: str, faiss_index_filename: str, chunker: deeppavlov.models.kbqa.entity_linking.NerChunker = None, ner: deeppavlov.core.common.chainer.Chainer = None, ner_parser: deeppavlov.models.kbqa.entity_detection_parser.EntityDetectionParser = None, entity_ranker: deeppavlov.models.kbqa.rel_ranking_bert_infer.RelRankerBertInfer = None, num_faiss_candidate_entities: int = 20, num_entities_for_bert_ranking: int = 50, num_faiss_cells: int = 50, use_gpu: bool = True, save_path: str = None, fit_vectorizer: bool = False, max_tfidf_features: int = 1000, include_mention: bool = False, ngram_range: List[int] = None, num_entities_to_return: int = 10, lang: str = 'ru', use_descriptions: bool = True, lemmatize: bool = False, **kwargs) → None [source]¶

Parameters

load_path – path to folder with inverted index files
word_to_idlist_filename – file with dict of words (keys) and start and end indices in entities_list filename of the corresponding entity ids
entities_list_filename – file with the list of entity ids from the knowledge base
entities_ranking_filename – file with dict of entity ids (keys) and number of relations in Wikidata for entities
vectorizer_filename – filename with TfidfVectorizer data
faiss_index_filename – file with Faiss index of words
chunker – component deeppavlov.models.kbqa.ner_chunker
ner – config for entity detection
ner_parser – component deeppavlov.models.kbqa.entity_detection_parser
entity_ranker – component deeppavlov.models.kbqa.rel_ranking_bert_infer
num_faiss_candidate_entities – number of nearest neighbors for the entity substring from the text
num_entities_for_bert_ranking – number of candidate entities for BERT ranking using description and context
num_faiss_cells – number of Voronoi cells for Faiss index
use_gpu – whether to use GPU for faster search of candidate entities
save_path – path to folder with inverted index files
fit_vectorizer – whether to build index with Faiss library
max_tfidf_features – maximal number of features for TfidfVectorizer
include_mention – whether to leave entity mention in the context (during BERT ranking)
ngram_range – char ngrams range for TfidfVectorizer
num_entities_to_return – number of candidate entities for the substring which are returned
lang – russian or english
use_description – whether to perform entity ranking by context and description
lemmatize – whether to lemmatize tokens
**kwargs –

__call__(docs_batch: List[str])[source]¶

Parameters: docs_batch – batch of documents
Returns: batch of lists of candidate entity ids

class deeppavlov.models.kbqa.entity_detection_parser.EntityDetectionParser(entity_tags: List[str], type_tag: str, o_tag: str, tags_file: str, return_entities_with_tags: bool = False, thres_proba: float = 0.8, **kwargs)[source]¶

This class parses probabilities of tokens to be a token from the entity substring.

__init__(entity_tags: List[str], type_tag: str, o_tag: str, tags_file: str, return_entities_with_tags: bool = False, thres_proba: float = 0.8, **kwargs)[source]¶

Parameters

entity_tags – tags for entities
type_tag – tag for types
o_tag – tag for tokens which are neither entities nor types
tags_file – filename with NER tags
return_entities_with_tags – whether to return a dict of tags (keys) and list of entity substrings (values) or simply a list of entity substrings
thres_proba – if the probability of the tag is less than thres_proba, we assign the tag as ‘O’

__call__(question_tokens: List[List[str]], token_probas: List[List[List[float]]]) → Tuple[List[Union[List[str], Dict[str, List[str]]]], List[List[str]], List[Union[List[int], Dict[str, List[List[int]]]]]][source]¶

Parameters

question_tokens – tokenized questions
token_probas – list of probabilities of question tokens

Returns

Batch of dicts where keys are tags and values are substrings corresponding to tags Batch of substrings which correspond to entity types Batch of lists of token indices in the text which correspond to entities

class deeppavlov.models.kbqa.entity_detection_parser.QuestionSignChecker(**kwargs)[source]¶

This class adds question sign if it is absent or replaces dot with question sign

__init__(**kwargs)[source]¶: Initialize self. See help(type(self)) for accurate signature.

__call__(questions: List[str]) → List[str][source]¶: Call self as a function.