Named Entity Recognition (NER)

Train and use the model

There are two main types of models available: standard RNN based and BERT based. To see details about BERT based models see here. Any pre-trained model can be used for inference from both Command Line Interface (CLI) and Python. Before using the model make sure that all required packages are installed using the command:

python -m deeppavlov install ner_ontonotes_bert

To use a pre-trained model from CLI use the following command:

python deeppavlov/deep.py interact ner_ontonotes_bert [-d]

where ner_conll2003_bert is the name of the config and -d is an optional download key. The key -d is used to download the pre-trained model along with embeddings and all other files needed to run the model. Other possible commands are train, evaluate, and download,

Here is the list of all available configs:

Model

Dataset

Language

Embeddings Size

Model Size

F1 score

ner_rus_bert

Collection3 1

Ru

700 MB

1.4 GB

98.1

ner_rus

1.0 GB

5.6 MB

95.1

ner_ontonotes_bert_mult

Ontonotes

Multi

700 MB

1.4 GB

88.8

ner_ontonotes_bert

En

400 MB

800 MB

88.6

ner_ontonotes

331 MB

7.8 MB

86.7

ner_conll2003_bert

CoNLL-2003

400 MB

850 MB

91.7

ner_conll2003

331 MB

3.1 MB

89.9

ner_dstc2

DSTC2

626 KB

97.1

Models can be used from Python using the following code:

from deeppavlov import configs, build_model

ner_model = build_model(configs.ner.ner_ontonotes_bert, download=True)

ner_model(['Bob Ross lived in Florida'])
>>> [[['Bob', 'Ross', 'lived', 'in', 'Florida']], [['B-PERSON', 'I-PERSON', 'O', 'O', 'B-GPE']]]

The model also can be trained from the Python:

from deeppavlov import configs, train_model

ner_model = train_model(configs.ner.ner_ontonotes_bert)

The data for training should be placed in the folder provided in the config:

from deeppavlov import configs, train_model
from deeppavlov.core.commands.utils import parse_config

config_dict = parse_config(configs.ner.ner_ontonotes_bert)

print(config_dict['dataset_reader']['data_path'])
>>> '~/.deeppavlov/downloads/ontonotes'

There must be three txt files: train.txt, valid.txt, and test.txt. Furthermore the data_path can be changed from code. The format of the data is described in the Training data section.

Multilingual BERT Zero-Shot Transfer

Multilingual BERT models allow to perform zero-shot transfer from one language to another. The model ner_ontonotes_bert_mult was trained on OntoNotes corpus which has 19 types in the markup schema. The model performance was evaluated on Russian corpus Collection 3 1. Results of the transfer are presented in the table below.

TOTAL

79.39

PER

95.74

LOC

82.62

ORG

55.68

The following Python code can be used to infer the model:

from deeppavlov import configs, build_model

ner_model = build_model(configs.ner.ner_ontonotes_bert_mult, download=True)

ner_model(['Curling World Championship will be held in Antananarivo'])
>>> (['Curling', 'World', 'Championship', 'will', 'be', 'held', 'in', 'Antananarivo']],
[['B-EVENT', 'I-EVENT', 'I-EVENT', 'O', 'O', 'O', 'O', 'B-GPE'])

ner_model(['Mistrzostwa Świata w Curlingu odbędą się w Antananarivo'])
>>> (['Mistrzostwa', 'Świata', 'w', 'Curlingu', 'odbędą', 'się', 'w', 'Antananarivo']],
[['B-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'O', 'O', 'O', 'B-GPE'])

ner_model(['Чемпионат мира по кёрлингу пройдёт в Антананариву'])
>>> (['Чемпионат', 'мира', 'по', 'кёрлингу', 'пройдёт', 'в', 'Антананариву'],
['B-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'O', 'O', 'B-GPE'])

The list of available tags and their descriptions are presented below.

PERSON

People including fictional

NORP

Nationalities or religious or political groups

FACILITY

Buildings, airports, highways, bridges, etc.

ORGANIZATION

Companies, agencies, institutions, etc.

GPE

Countries, cities, states

LOCATION

Non-GPE locations, mountain ranges, bodies of water

PRODUCT

Vehicles, weapons, foods, etc. (Not services)

EVENT

Named hurricanes, battles, wars, sports events, etc.

WORK OF ART

Titles of books, songs, etc.

LAW

Named documents made into laws

LANGUAGE

Any named language

DATE

Absolute or relative dates or periods

TIME

Times smaller than a day

PERCENT

Percentage (including “%”)

MONEY

Monetary values, including unit

QUANTITY

Measurements, as of weight or distance

ORDINAL

“first”, “second”

CARDINAL

Numerals that do not fall under another type

NER task

Named Entity Recognition (NER) is one of the most common tasks in natural language processing. In most of the cases, NER task can be formulated as:

Given a sequence of tokens (words, and maybe punctuation symbols) provide a tag from a predefined set of tags for each token in the sequence.

For NER task there are some common types of entities used as tags:

  • persons

  • locations

  • organizations

  • expressions of time

  • quantities

  • monetary values

Furthermore, to distinguish adjacent entities with the same tag many applications use BIO tagging scheme. Here “B” denotes beginning of an entity, “I” stands for “inside” and is used for all words comprising the entity except the first one, and “O” means the absence of entity. Example with dropped punctuation:

Bernhard        B-PER
Riemann         I-PER
Carl            B-PER
Friedrich       I-PER
Gauss           I-PER
and             O
Leonhard        B-PER
Euler           I-PER

In the example above PER means person tag, and “B-” and “I-” are prefixes identifying beginnings and continuations of the entities. Without such prefixes, it is impossible to separate Bernhard Riemann from Carl Friedrich Gauss.

Training data

To train the neural network, you need to have a dataset in the following format:

EU B-ORG
rejects O
the O
call O
of O
Germany B-LOC
to O
boycott O
lamb O
from O
Great B-LOC
Britain I-LOC
. O

China B-LOC
says O
time O
right O
for O
Taiwan B-LOC
talks O
. O

...

The source text is tokenized and tagged. For each token, there is a tag with BIO markup. Tags are separated from tokens with whitespaces. Sentences are separated with empty lines.

Dataset is a text file or a set of text files. The dataset must be split into three parts: train, test, and validation. The train set is used for training the network, namely adjusting the weights with gradient descent. The validation set is used for monitoring learning progress and early stopping. The test set is used for final evaluation of model quality. Typical partition of a dataset into train, validation, and test are 80%, 10%, 10%, respectively.

Few-shot Language-Model based

It is possible to get a cold-start baseline from just a few samples of labeled data in a couple of seconds. The solution is based on a Language Model trained on open domain corpus. On top of the LM a SVM classification layer is placed. It is possible to start from as few as 10 sentences containing entities of interest.

The data for training this model should be collected in the following way. Given a collection of N sentences without markup, sequentially markup sentences until the total number of sentences with entity of interest become equal K. During the training both sentences with and without markup are used.

Mean chunk-wise F1 scores for Russian language on 10 sentences with entities :

PER

84.85

LOC

68.41

ORG

32.63

(the total number of training sentences is bigger and defined by the distribution of sentences with / without entities).

The model can be trained using CLI:

python -m deeppavlov train ner_few_shot_ru

you have to provide the train.txt, valid.txt, and test.txt files in the format described in the Training data section. The files must be in the ner_few_shot_data folder as described in the dataset_reader part of the config ner/ner_few_shot_ru_train.json .

To train and use the model from python code the following snippet can be used:

from deeppavlov import configs, train_model

ner_model = train_model(configs.ner.ner_few_shot_ru, download=True)

ner_model(['Example sentence'])

Warning! This model can take a lot of time and memory if the number of sentences is greater than 1000!

If a lot of data is available the few-shot setting can be simulated with special dataset_iterator. For this purpose the config ner/ner_few_shot_ru_train.json . The following code can be used for this simulation:

from deeppavlov import configs, train_model

ner_model = train_model(configs.ner.ner_few_shot_ru_simulate, download=True)

In this config the Collection dataset is used. However, if there are files train.txt, valid.txt, and test.txt in the ner_few_shot_data folder they will be used instead.

To use existing few-shot model use the following python interface can be used:

from deeppavlov import configs, build_model

ner_model = build_model(configs.ner.ner_few_shot_ru)

ner_model([['Example', 'sentence']])
ner_model(['Example sentence'])

Literature

1(1,2)

Mozharova V., Loukachevitch N., Two-stage approach in Russian named entity recognition // International FRUCT Conference on Intelligence, Social Media and Web, ISMW FRUCT 2016. Saint-Petersburg; Russian Federation, DOI 10.1109/FRUCT.2016.7584769