Named Entity Recognition (NER)

Open In Colab

1. Introduction to the task

Named Entity Recognition (NER) is a task of assigning a tag (from a predefined set of tags) to each token in a given sequence. In other words, NER-task consists of identifying named entities in the text and classifying them into types (e.g. person name, organization, location etc).

BIO encoding schema is usually used in NER task. It uses 3 tags: B for the beginning of the entity, I for the inside of the entity, and O for non-entity tokens. The second part of the tag stands for the entity type.

Here is an example of a tagged sequence:

Elon

Musk

founded

Tesla

in

2003

.

B-PER

I-PER

O

B-ORG

O

B-DATE

O

Here we can see three extracted named entities: Elon Musk (which is a person’s name), Tesla (which is a name of an organization) and 2003 (which is a date). To see more examples try out our Demo.

The list of possible types of NER entities may vary depending on your dataset domain. The list of tags used in DeepPavlov’s models can be found in the table.

2. Get started with the model

First make sure you have the DeepPavlov Library installed. More info about the first installation.

[ ]:
!pip install -q deeppavlov

Then make sure that all the required packages for the model are installed.

[ ]:
!python -m deeppavlov install ner_ontonotes_bert

ner_ontonotes_bert is the name of the model’s config_file. What is a Config File?

Configuration file defines the model and describes its hyperparameters. To use another model, change the name of the config_file here and further. The full list of NER models with their config names can be found in the table.

There are alternative ways to install the model’s packages that do not require executing a separate command – see the options in the next sections of this page.

3. Models list

The table presents a list of all of the NER-models available in the DeepPavlov Library.

Config name

Dataset

Language

Model Size

F1 score (ner_f1)

F1 score (ner_f1_token)

ner_case_agnostic_mdistilbert

CoNLL-2003

En

1.6 GB

89.9

91.6

ner_conll2003_bert

CoNLL-2003

En

1.3 GB

91.9

93.4

ner_ontonotes_bert

OntoNotes

En

1.3 GB

89.2

92.7

ner_collection3_bert

Collection3

Ru

2.1 GB

98.5

98.9

ner_rus_bert

Collection3

Ru

2.1 GB

97.6

98.5

ner_rus_convers_distilrubert_2L

Collection-rus

Ru

1.3 GB

92.9

96.6

ner_rus_convers_distilrubert_6L

Collection-rus

Ru

1.6 GB

96.7

98.5

ner_rus_bert_probas

Wiki-NER-rus

Ru

2.1 GB

72.6

79.5

ner_ontonotes_bert_mult

OntoNotes

Multi

2.1 GB

88.9

92.0

4. Use the model for prediction

4.1 Predict using Python

After installing the model, build it from the config and predict.

[ ]:
from deeppavlov import build_model

ner_model = build_model('ner_ontonotes_bert', download=True, install=True)

The download argument defines whether it is necessary to download the files defined in the download section of the config: usually it provides the links to the train and test data, to the pretrained models, or to the embeddings.

Setting the install argument to True is equivalent to executing the command line install command. If set to True, it will first install all the required packages.

Input: List[sentences]

Output: List[tokenized sentences, corresponding NER-tags]

[ ]:
ner_model(['Bob Ross lived in Florida', 'Elon Musk founded Tesla'])
[[['Bob', 'Ross', 'lived', 'in', 'Florida'],
  ['Elon', 'Musk', 'founded', 'Tesla']],
 [['B-PERSON', 'I-PERSON', 'O', 'O', 'B-GPE'],
  ['B-PERSON', 'I-PERSON', 'O', 'B-ORG']]]

4.2 Predict using CLI

You can also get predictions in an interactive mode through CLI (Сommand Line Interface).

[ ]:
! python -m deeppavlov interact ner_ontonotes_bert -d

-d is an optional download key (alternative to download=True in Python code). The key -d is used to download the pre-trained model along with embeddings and all other files needed to run the model.

Or make predictions for samples from stdin.

[ ]:
! python -m deeppavlov predict ner_ontonotes_bert -f <file-name>

5. Evaluate

There are two metrics that are used to evaluate a NER model in DeepPavlov:

ner_f1 is measured on the entity-level (actual text spans should match exactly)

ner_token_f1 is measured on a token level (correct tokens from not fully extracted entities will still be counted as TPs (true positives))

5.1 Evaluate from Python

[ ]:
from deeppavlov import evaluate_model

model = evaluate_model('ner_ontonotes_bert', download=True)

5.2 Evaluate from CLI

[ ]:
! python -m deeppavlov evaluate ner_ontonotes_bert

6. Customize the model

6.1 Train your model from Python

Provide your data path

To train the model on your data, you need to change the path to the training data in the config_file.

Parse the config_file and change the path to your data from Python.

[ ]:
from deeppavlov import train_model
from deeppavlov.core.commands.utils import parse_config

model_config = parse_config('ner_ontonotes_bert')

# dataset that the model was trained on
print(model_config['dataset_reader']['data_path'])
~/.deeppavlov/downloads/ontonotes/

Provide a data_path to your own dataset.

[ ]:
# download and unzip a new example dataset
!wget http://files.deeppavlov.ai/deeppavlov_data/conll2003_v2.tar.gz
!tar -xzvf "conll2003_v2.tar.gz"
[ ]:
# provide a path to the train file
model_config['dataset_reader']['data_path'] = 'contents/train.txt'

Train dataset format

To train the model, you need to have a txt-file with a dataset in the following format:

EU B-ORG
rejects O
the O
call O
of O
Germany B-LOC
to O
boycott O
lamb O
from O
Great B-LOC
Britain I-LOC
. O

China B-LOC
says O
time O
right O
for O
Taiwan B-LOC
talks O
. O

The source text is tokenized and tagged. For each token, there is a tag with BIO markup. Tags are separated from tokens with whitespaces. Sentences are separated with empty lines.

Train the model using new config

[ ]:
ner_model = train_model(model_config)

Use your model for prediction.

[ ]:
ner_model(['Bob Ross lived in Florida', 'Elon Musk founded Tesla'])
[[['Bob', 'Ross', 'lived', 'in', 'Florida'],
  ['Elon', 'Musk', 'founded', 'Tesla']],
 [['B-PERSON', 'I-PERSON', 'O', 'O', 'B-GPE'],
  ['B-PERSON', 'I-PERSON', 'O', 'B-ORG']]]

6.2 Train your model from CLI

[ ]:
! python -m deeppavlov train ner_ontonotes_bert

7. NER-tags list

The table presents a list of all of the NER entity tags used in DeepPavlov’s NER-models.

PERSON

People including fictional

NORP

Nationalities or religious or political groups

FACILITY

Buildings, airports, highways, bridges, etc.

ORGANIZATION

Companies, agencies, institutions, etc.

GPE

Countries, cities, states

LOCATION

Non-GPE locations, mountain ranges, bodies of water

PRODUCT

Vehicles, weapons, foods, etc. (Not services)

EVENT

Named hurricanes, battles, wars, sports events, etc.

WORK OF ART

Titles of books, songs, etc.

LAW

Named documents made into laws

LANGUAGE

Any named language

DATE

Absolute or relative dates or periods

TIME

Times smaller than a day

PERCENT

Percentage (including “%”)

MONEY

Monetary values, including unit

QUANTITY

Measurements such as weight or distance

ORDINAL

“first”, “second”, etc.

CARDINAL

Numerals that do not fall under another type