deeppavlov.models.entity_extraction

class deeppavlov.models.entity_extraction.ner_chunker.NerChunker(vocab_file: str, max_seq_len: int = 400, lowercase: bool = False, batch_size: int = 2, **kwargs)[source]

Class to split documents into chunks of max_seq_len symbols so that the length will not exceed maximal sequence length to feed into BERT

__init__(vocab_file: str, max_seq_len: int = 400, lowercase: bool = False, batch_size: int = 2, **kwargs)[source]
Parameters
  • vocab_file – vocab file of pretrained transformer model

  • max_seq_len – maximal length of chunks into which the document is split

  • lowercase – whether to lowercase text

  • batch_size – how many chunks are in batch

__call__(docs_batch: List[str]) Tuple[List[List[str]], List[List[int]], List[List[Union[List[Union[Tuple[int, int], Tuple[Union[int, Any], Union[int, Any]]]], List[Tuple[Union[int, Any], Union[int, Any]]], List[Tuple[int, int]]]]], List[List[Union[List[Any], List[str]]]], List[List[str]]][source]

This method splits each document in the batch into chunks wuth the maximal length of max_seq_len

Parameters

docs_batch – batch of documents

Returns

batch of lists of document chunks for each document batch of lists of numbers of documents which correspond to chunks

class deeppavlov.models.entity_extraction.entity_linking.EntityLinker(load_path: str, entity_ranker=None, entities_database_filename: Optional[str] = None, words_dict_filename: Optional[str] = None, ngrams_matrix_filename: Optional[str] = None, num_entities_for_bert_ranking: int = 50, num_entities_for_conn_ranking: int = 5, num_entities_to_return: int = 10, max_text_len: int = 300, max_paragraph_len: int = 150, lang: str = 'ru', use_descriptions: bool = True, alias_coef: float = 1.1, use_tags: bool = False, lemmatize: bool = False, full_paragraph: bool = False, use_connections: bool = False, kb_filename: Optional[str] = None, prefixes: Optional[Dict[str, Any]] = None, **kwargs)[source]

Class for linking of entity substrings in the document to entities in Wikidata

__init__(load_path: str, entity_ranker=None, entities_database_filename: Optional[str] = None, words_dict_filename: Optional[str] = None, ngrams_matrix_filename: Optional[str] = None, num_entities_for_bert_ranking: int = 50, num_entities_for_conn_ranking: int = 5, num_entities_to_return: int = 10, max_text_len: int = 300, max_paragraph_len: int = 150, lang: str = 'ru', use_descriptions: bool = True, alias_coef: float = 1.1, use_tags: bool = False, lemmatize: bool = False, full_paragraph: bool = False, use_connections: bool = False, kb_filename: Optional[str] = None, prefixes: Optional[Dict[str, Any]] = None, **kwargs) None[source]
Parameters
  • load_path – path to folder with inverted index files

  • entity_ranker – component deeppavlov.models.kbqa.rel_ranking_bert

  • entities_database_filename – filename with database with entities index

  • words_dict_filename – filename with words and corresponding tags

  • ngrams_matrix_filename – filename with char tfidf matrix

  • num_entities_for_bert_ranking – number of candidate entities for BERT ranking using description and context

  • num_entities_for_conn_ranking – number of candidate entities for ranking using connections in the knowledge graph

  • num_entities_to_return – number of candidate entities for the substring which are returned

  • max_text_len – maximal length of entity context

  • max_paragraph_len – maximal length of context paragraphs

  • lang – russian or english

  • use_description – whether to perform entity ranking by context and description

  • alias_coef – coefficient which is multiplied by the substring matching confidence if the substring is the title of the entity

  • use_tags – whether to filter candidate entities by tags

  • lemmatize – whether to lemmatize tokens

  • full_paragraph – whether to use full paragraph for entity context

  • use_connections – whether to rank entities by connections in the knowledge graph

  • kb_filename – filename with the knowledge base in HDT format

  • prefixes – entity and title prefixes

  • **kwargs

__call__(substr_batch: List[List[str]], tags_batch: Optional[List[List[str]]] = None, probas_batch: Optional[List[List[float]]] = None, sentences_batch: Optional[List[List[str]]] = None, offsets_batch: Optional[List[List[List[int]]]] = None, sentences_offsets_batch: Optional[List[List[Tuple[int, int]]]] = None, entities_to_link_batch: Optional[List[List[int]]] = None)[source]

Call self as a function.

class deeppavlov.models.entity_extraction.entity_detection_parser.EntityDetectionParser(o_tag: str, tags_file: str, entity_tags: Optional[List[str]] = None, ignore_points: bool = False, thres_proba: float = 0.8, make_tags_from_probas: bool = False, lang: str = 'en', ignored_tags: Optional[List[str]] = None, **kwargs)[source]

This class parses probabilities of tokens to be a token from the entity substring.

__init__(o_tag: str, tags_file: str, entity_tags: Optional[List[str]] = None, ignore_points: bool = False, thres_proba: float = 0.8, make_tags_from_probas: bool = False, lang: str = 'en', ignored_tags: Optional[List[str]] = None, **kwargs)[source]
Parameters
  • o_tag – tag for tokens which are neither entities nor types

  • tags_file – filename with NER tags

  • entity_tags – tags for entities

  • ignore_points – whether to consider points as separate symbols

  • thres_proba – if the probability of the tag is less than thres_proba, we assign the tag as ‘O’

  • make_tags_from_probas – whether to define token tags from confidences from sequence tagging model

  • lang – language of texts

  • ignored_tags – not used tags of entities

__call__(question_tokens_batch: List[List[str]], tokens_info_batch: List[List[List[float]]], tokens_probas_batch: ndarray) Tuple[List[dict], List[dict], List[dict]][source]
Parameters
  • question_tokens_batch – tokenized questions

  • tokens_info_batch – list of tags of question tokens

  • tokens_probas_probas – list of probabilities of question tokens

Returns

Batch of dicts where keys are tags and values are substrings corresponding to tags Batch of substrings which correspond to entity types Batch of lists of token indices in the text which correspond to entities

class deeppavlov.models.entity_extraction.entity_detection_parser.QuestionSignChecker(delete_brackets: bool = False, **kwargs)[source]