deeppavlov.models.entity_extraction

class deeppavlov.models.entity_extraction.ner_chunker.NerChunker(vocab_file: str, max_seq_len: int = 400, lowercase: bool = False, max_chunk_len: int = 180, batch_size: int = 2, **kwargs)[source]

Class to split documents into chunks of max_chunk_len symbols so that the length will not exceed maximal sequence length to feed into BERT

__init__(vocab_file: str, max_seq_len: int = 400, lowercase: bool = False, max_chunk_len: int = 180, batch_size: int = 2, **kwargs)[source]
Parameters
  • max_chunk_len – maximal length of chunks into which the document is split

  • batch_size – how many chunks are in batch

__call__(docs_batch: List[str])Tuple[List[List[str]], List[List[int]], List[List[List[Tuple[int, int]]]], List[List[List[str]]]][source]

This method splits each document in the batch into chunks wuth the maximal length of max_chunk_len

Parameters

docs_batch – batch of documents

Returns

batch of lists of document chunks for each document batch of lists of numbers of documents which correspond to chunks

class deeppavlov.models.entity_extraction.entity_linking.EntityLinker(load_path: str, entities_database_filename: str, entity_ranker=None, num_entities_for_bert_ranking: int = 50, wikidata_file: Optional[str] = None, num_entities_to_return: int = 10, max_text_len: int = 300, lang: str = 'en', use_descriptions: bool = True, use_tags: bool = False, lemmatize: bool = False, full_paragraph: bool = False, use_connections: bool = False, max_paragraph_len: int = 250, **kwargs)[source]

Class for linking of entity substrings in the document to entities in Wikidata

__init__(load_path: str, entities_database_filename: str, entity_ranker=None, num_entities_for_bert_ranking: int = 50, wikidata_file: Optional[str] = None, num_entities_to_return: int = 10, max_text_len: int = 300, lang: str = 'en', use_descriptions: bool = True, use_tags: bool = False, lemmatize: bool = False, full_paragraph: bool = False, use_connections: bool = False, max_paragraph_len: int = 250, **kwargs)None[source]
Parameters
  • load_path – path to folder with inverted index files

  • entities_database_filename – file with sqlite database with Wikidata entities index

  • entity_ranker – deeppavlov.models.torch_bert.torch_transformers_el_ranker.TorchTransformersEntityRankerInfer

  • num_entities_for_bert_ranking – number of candidate entities for BERT ranking using description and context

  • wikidata_file – .hdt file with Wikidata graph

  • num_entities_to_return – number of candidate entities for the substring which are returned

  • max_text_len – max length of context for entity ranking by description

  • lang – russian or english

  • use_description – whether to perform entity ranking by context and description

  • use_tags – whether to use ner tags for entity filtering

  • lemmatize – whether to lemmatize tokens

  • full_paragraph – whether to use full paragraph for entity ranking by context and description

  • use_connections – whether to ranking entities by number of connections in Wikidata

  • max_paragraph_len – maximum length of paragraph for ranking by context and description

  • **kwargs

__call__(entity_substr_batch: List[List[str]], entity_tags_batch: Optional[List[List[str]]] = None, sentences_batch: Optional[List[List[str]]] = None, entity_offsets_batch: Optional[List[List[List[int]]]] = None, sentences_offsets_batch: Optional[List[List[Tuple[int, int]]]] = None)Tuple[Union[List[List[List[str]]], List[List[str]]], Union[List[List[List[Any]]], List[List[Any]]], Union[List[List[List[str]]], List[List[str]]]][source]

Call self as a function.

class deeppavlov.models.entity_extraction.entity_detection_parser.EntityDetectionParser(o_tag: str, tags_file: str, entity_tags: Optional[List[str]] = None, ignore_points: bool = False, return_entities_with_tags: bool = False, thres_proba: float = 0.8, **kwargs)[source]

This class parses probabilities of tokens to be a token from the entity substring.

__init__(o_tag: str, tags_file: str, entity_tags: Optional[List[str]] = None, ignore_points: bool = False, return_entities_with_tags: bool = False, thres_proba: float = 0.8, **kwargs)[source]
Parameters
  • o_tag – tag for tokens which are neither entities nor types

  • tags_file – filename with NER tags

  • entity_tags – tags for entities

  • ignore_points – whether to consider points as separate symbols

  • return_entities_with_tags – whether to return a dict of tags (keys) and list of entity substrings (values) or simply a list of entity substrings

  • thres_proba – if the probability of the tag is less than thres_proba, we assign the tag as ‘O’

__call__(question_tokens_batch: List[List[str]], tokens_info_batch: List[List[List[float]]], tokens_probas_batch: numpy.ndarray)Tuple[List[Union[List[str], Dict[str, List[str]]]], List[List[str]], List[Union[List[int], Dict[str, List[List[int]]]]]][source]
Parameters
  • question_tokens – tokenized questions

  • token_probas – list of probabilities of question tokens

Returns

Batch of dicts where keys are tags and values are substrings corresponding to tags Batch of substrings which correspond to entity types Batch of lists of token indices in the text which correspond to entities

deeppavlov.models.entity_extraction.entity_detection_parser.question_sign_checker(questions: List[str])List[str][source]

Adds question sign if it is absent or replaces dots in the end with question sign.