Entity Extraction

Open In Colab

1. Introduction to the task

Entity Detection is the task of identifying entity mentions in text with corresponding entity types. Entity Detection models in DeepPavlov split the input text into fragments of the lengths less than 512 tokens and find entities with BERT-based models.

Entity Linking is the task of finding knowledge base entity ids for entity mentions in text. Entity Linking in DeepPavlov supports Wikidata and Wikipedia. Entity Linking component performs the following steps:

  • extraction of candidate entities from SQLite database;

  • candidate entities sorting by entity tags (if entity tags are provided);

  • ranking of candidate entities by connections in Wikidata knowledge graph of candidate entities for different mentions;

  • candidate entities ranking by context and descriptions using Transformer model bert-small in English config and distilrubert-tiny.

Entity Extraction configs perform subsequent Entity Detection and Entity Linking of extracted entity mentions.

2. Get started with the model

First make sure you have the DeepPavlov Library installed. More info about the first installation.

[ ]:
!pip install -q deeppavlov

Then make sure that all the required packages for the model are installed.

[ ]:
!python -m deeppavlov install entity_extraction_en

entity_extraction_en is the name of the model’s config_file. What is a Config File?

There are alternative ways to install the model’s packages that do not require executing a separate command – see the options in the next sections of this page. The full list of models for entity detection, linking and extraction with their config names can be found in the table.

3. Models list

The table presents a list of all of the models for entity detection, linking and extraction available in the DeepPavlov Library.

Config name

Language

RAM

GPU

entity_detection_en

En

2.5 Gb

3.7 Gb

entity_detection_ru

Ru

2.5 Gb

5.3 Gb

entity_linking_en

En

2.4 Gb

1.2 Gb

entity_linking_ru

Ru

2.2 Gb

1.1 Gb

entity_extraction_en

En

2.5 Gb

3.7 Gb

entity_extraction_ru

Ru

2.5 Gb

5.3 Gb

4. Use the model for prediction

4.1 Predict using Python

After installing the model, build it from the config and predict.

Entity Detection

For English:

[ ]:
import warnings
warnings.filterwarnings('ignore')

from deeppavlov import build_model

ed_en = build_model('entity_detection_en', download=True, install=True)

The output elements:

  • entity substrings

  • entity offsets (indices of start and end symbols of entities in text)

  • entity positions (indices of entity tokens in text)

  • entity tags

  • sentences offsets

  • list of sentences in text

  • confidences of detected entities

[ ]:
ed_en(['Forrest Gump is a comedy-drama film directed by Robert Zemeckis and written by Eric Roth.'])

For Russian:

[ ]:
ed_ru = build_model('entity_detection_ru', download=True, install=True)
ed_ru(['Москва — столица России, центр Центрального федерального округа и центр Московской области.'])

Entity Linking

For English:

[ ]:
el_en = build_model('entity_linking_en', download=True, install=True)

The input elements:

  • entity substrings

  • entity tags (optional argument)

  • confidences of entity substrings (optional argument)

  • sentences (context) of the entities (optional argument)

  • entity offsets (optional argument)

  • sentences offsets (optional argument)

The output elements:

  • entity ids

  • entity confidences (for each entity - the list with three confidences: substring matching confidence, popularity ranking confidence and context ranking confidence)

  • entity pages in Wikipedia

  • entity labels in Wikidata

[ ]:
el_en([['forrest gump', 'robert zemeckis', 'eric roth']],
      [['WORK_OF_ART', 'PERSON', 'PERSON']],
      [[1.0, 1.0, 1.0]],
      [['Forrest Gump is a comedy-drama film directed by Robert Zemeckis and written by Eric Roth.']],
      [[(0, 12), (48, 63), (79, 88)]],
      [[(0, 89)]])

For Russian:

[ ]:
el_ru = build_model('entity_linking_ru', download=True, install=True)

el_ru([['москва', 'россии', 'центрального федерального округа', 'московской области']],
      [['CITY', 'COUNTRY', 'LOC', 'LOC']],
      [[1.0, 1.0, 1.0, 1.0]],
      [['Москва — столица России, центр Центрального федерального округа и центр Московской области.']],
      [[(0, 6), (17, 23), (31, 63), (72, 90)]],
      [[(0, 91)]])

Entity Extraction

For English:

[ ]:
ex_en = build_model('entity_extraction_en', download=True, install=True)

The output elements:

  • entity substrings

  • entity tags

  • entity offsets

  • entity ids in the knowledge base

  • entity linking confidences

  • entity pages

  • entity labels

[ ]:
ex_en(['Forrest Gump is a comedy-drama film directed by Robert Zemeckis and written by Eric Roth.'])

For Russian:

[ ]:
ex_ru = build_model('entity_extraction_ru', download=True, install=True)

ex_ru(['Москва — столица России, центр Центрального федерального округа и центр Московской области.'])

4.2 Predict using CLI

You can also get predictions in an interactive mode through CLI (Сommand Line Interface).

[ ]:
! python -m deeppavlov interact entity_extraction_en -d

5. Customize the model

5.1 Description of config parameters

Parameters of ner_chunker component:

  • batch_size: int - each text from the input text batch is split into chunks with the length lower than the threshold (because Transformer-based models for entity detection work with limited lengths of the input sequences), than all chunks are concatenated into one list and the list is split into batches of the size batch_size;

  • max_seq_len: int - maximum length of chunk (in wordpiece tokens);

  • vocab_file: str - vocab file of Transformer tokenizer, which is used to tokenize the text for further splitting into chunks.

Parameters of entity_detection_parser component:

  • thres_proba: float - the NER models return tag confidences for each token; if the probability of “O” tag (which is used for tokens not related to entities) for the token is lower than the thres_proba, the tag with the maximum probability from entity tags list is chosen;

  • o_tag: str - tag for non-entity tokens (by default is “O” tag);

  • tags_file: str - the filename with the list of tags used in the NER model.

Parameters of ner_chunk_model component:

  • ner: deeppavlov.core.common.chainer:Chainer - the config for entity recognition, which defines entity tags (or “O” tag) and tag probabilities for each token in the input text;

  • ner_parser: deeppavlov.models.entity_extraction.entity_detection_parser:EntityDetectionParser - the component which processes the tags and tag probabilities returned by the entity recognition model and defines entity substrings;

  • ner2: deeppavlov.core.common.chainer:Chainer - (optional) an additional entity recognition config, which can improve the quality of entity recognition in the case of joint usage with ner config;

  • ner_parser2: deeppavlov.models.entity_extraction.entity_detection_parser:EntityDetectionParser - (optional) an additional config for processing entity recognition output.

Parameters of entity_linker component:

  • load_path: str - the path to the folder with the inverted index;

  • entity_ranker - the component for ranking of candidate entities by descriptions;

  • entities_database_filename: str - file with the inverted index (the mapping between entity titles and entity IDs);

  • words_dict_filename: str - file with mapping of entity titles to the tags of entity detection model;

  • ngrams_matrix_filename: str - matrix of char ngrams of words from entity titles from the knowledge base;

  • num_entities_for_bert_ranking: int - number of candidate entities which are re-ranked by context and description using Transformer-based model;

  • num_entities_for_conn_ranking: int - number of candidate entities which are re-ranked by connections in the knowledge graph between entities for different mentions in the text;

  • num_entities_to_return: int - the number of entity IDs, returned for each entity mention in text;

  • max_paragraph_len: int - maximum length of context used for ranking of entities by description;

  • lang: str - language of the entity linking model (Russian or English);

  • use_descriptions: bool - whether to perform ranking of candidate entities by similarity of their descriptions to the context;

  • alias_coef: float - the coefficient which is multiplied by the substring matching score of the entity if the entity mention in the text matches with the entity title;

  • use_tags: bool - whether to search only those entity IDs in the inverted index, which have the same tag as the entity mention;

  • lemmatize: bool - whether to lemmatize entity mentions before searching candidate entity IDs in the inverted index;

  • full_paragraph: bool - whether to use full context for ranking of entities by descriptions or cut the paragraph to one sentence with entity mention;

  • use_connections: bool - whether to use connections between candidate entities for different mentions for ranking;

  • kb_filename: str - file with the knowledge base in .hdt format;

  • prefixes: Dict[str, Any] - prefixes in the knowledge base for entities and relations.

5.2 Training entity detection model

The configs entity_detection_en and entity extraction_en use ner_ontonotes_bert model for detection of entity mentions, the configs entity_detection_ru and entity extraction_ru use ner_rus_bert_probas model. How to train a NER model.

5.3 Using custom knowledge base

The database filename is defined with the entities_database_filename in entity linking configs. The file is in SQLite format with FTS5 extensions for full-text search of entities by entity mention. The database file should contain the inverted_index table with the following columns:

  • title - entity title (name or alias) in the knowledge base;

  • entity_id - entity ID in the knowledge base;

  • num_rels - number of relations of the entity with other entities in the knowledge graph;

  • ent_tag - entity tag of the entity detection model (for example, CITY, PERSON, WORK_OF_ART, etc.);

  • page - page title of the entity (for Wikidata entities - the Wikipedia page);

  • label - entity label in the knowledge base;

  • descr - entity description in the knowledge base.

Tags of entities in the knowledge base should correspond with the tags of the custom NER model or default ner_ontonotes_bert or ner_rus_bert_probas models. The list of ner_ontonotes_bert tags is listed in tags.dict file in ~/.deeppavlov/models/ner_ontonotes_bert_torch_crf directory, the list of ner_rus_bert_probas tags - in tags.dict file in ~/.deeppavlov/models/wiki_ner_rus_bert directory.