Relation Extraction

Open In Colab

1. Introduction to the task

Relation extraction is the task of detecting and classifying the relationship between two entities in text. DeepPavlov provides the document-level relation extraction meaning that the relation can be detected between the entities that are not in one sentence.

RE Model Architecture

We based our model on the Adaptive Thresholding and Localized Context Pooling model and used NER entity tags as additional input. Two core ideas of this model are:

  • Adaptive Threshold

The usual global threshold for converting the RE classifier output probability to relation label is replaced with a learnable one. A new threshold class that learns an entities-dependent threshold value is introduced and learnt as all other classes. During prediction the positive classes (= relations that are hold in the sample indeed) are claimed to be the classes with higher logins that the TH class, while all others are negative ones.

  • Localised Context Pooling

The embedding of each entity pair is enhanced with an additional local context embedding related to both entities. Such representation, which is attended to the relevant context in the document, is useful to decide the relation for exactly this entity pair. For incorporating the context information the attention heads are directly used.

2. Get started with the model

First make sure you have the DeepPavlov Library installed. More info about the first installation.

[ ]:
!pip install -q deeppavlov

Before using the model make sure that all required packages are installed running the command:

[ ]:
!python -m deeppavlov install re_docred

3. Models list

The table presents a list of all of the relation extraction models available in the DeepPavlov Library.

Config

Language

Dataset

relation_extraction/re_docred.json

En

DocRED

relation_extraction/re_rured.json

Ru

RuRED

Some details on DocRED corpus English RE model was trained on

The English RE model was trained on DocRED English corpus. It was constructed from Wikipedia and Wikidata and is now the largest human-annotated dataset for document-level RE from plain text.

As the original DocRED test dataset containes only unlabeled data, while we want to have labeled one in order to perform evaluation, we decided to: 1. merge train and dev data (= labeled data) 2. split them into new train, dev and test dataset

Currently, there are two types of possible splittings provided:

  • user can set the relative size of dev and test data (e.g. 1/7)

  • user can set the absolute size of dev and test data (e.g. 2000 samples)

In our experiment, we set the absolute size of dev and test data == 150 initial documents. It resulted in approximately 3500 samples.

We additionally generate negative samples if it was necessary to have the following proportions: - for train set: negative samples are twice as many as positive ones - for dev & test set: negative samples are the same amount as positive ones

Train

Dev

Test

130650

3406

3545

Train Positive

Train Negative

Dev Positive

Dev Negative

Test Positive

Test Negative

44823

89214

1239

1229

1043

1036

Some details on RuRED corpus Russian RE model was trained on

In case of RuRED we used the train, dev and test sets from the original RuRED setting. We additionally generate negative samples if it was necessary to have the following proportions:

  • for train set: negative samples are twice as many as positive ones

  • for dev & test set: negative samples are the same amount as positive ones

Train

Dev

Test

12855

1076

1072

Train Positive

Train Negative

Dev Positive

Dev Negative

Test Positive

Test Negative

4285

8570

538

538

536

536

4. Use the model for prediction

4.1 Predict using Python

English

[ ]:
from deeppavlov import configs, build_model

re_model = build_model(configs.relation_extraction.re_docred, download=False)
[ ]:
sentence_tokens = [["Barack", "Obama", "is", "married", "to", "Michelle", "Obama", ",", "born", "Michelle", "Robinson", "."]]
entity_pos = [[[(0, 2)], [(5, 7), (9, 11)]]]
entity_tags = [["PER", "PER"]]
pred = re_model(sentence_tokens, entity_pos, entity_tags)
[['P26'], ['spouse']]

Model Input:

  • list of tokens of a text document

  • list of entities positions (i.e. all start and end positions of both entities’ mentions)

  • list of NER tags of both entities.

As NER tags, we adapted the used in the DocRED corpus, which are, in turn, inherited from Tjong Kim Sang and De Meulder(2003)

The whole list of 6 English NER tags

Tag

Description

PER

People, including fictional

ORG

Companies, universities, institutions, political or religious groups, etc.

LOC

Geographically defined locations, including mountains, waters, etc. Politically defined locations, including countries, cities, states, streets, etc. Facilities, including buildings, museums, stadiums, hospitals, factories, airports, etc.

TIME

Absolute or relative dates or periods.

NUM

Percents, money, quantities

MISC

Products, including vehicles, weapons, etc. Events, including elections, battles, sporting MISC events, etc. Laws, cases, languages, etc.

Model Output: one or several of the 97 relations found between the given entities; relation id in Wikidata (e.g. ‘P26’) and relation name (‘spouse’).

Russian

[ ]:
from deeppavlov import configs, build_model

re_model = build_model(configs.relation_extraction.re_rured)
[ ]:
sentence_tokens = [["Илон", "Маск", "живет", "в", "Сиэттле", "."]]
entity_pos = [[[(0, 2)], [(4, 5)]]]
entity_tags = [["PERSON", "CITY"]]
pred = re_model(sentence_tokens, entity_pos, entity_tags)
[['P495'], ['страна происхождения']]

Model Input:

  • list of tokens of a text document

  • list of entities positions (i.e. all start and end positions of both entities’ mentions)

  • list of NER tags of both entities.

Model Output: one or several of the 30 relations found between the given entities; a Russian relation name (e.g. “участник”) or an English one, if Russian one is unavailable, and, if applicable, its id in Wikidata (e.g. ‘P710’).

4.2 Predict using CLI

You can also get predictions in an interactive mode through CLI.

[ ]:
! python -m deeppavlov interact re_docred [-d]
! python -m deeppavlov interact re_rured [-d]

-d is an optional download key (alternative to download=True in Python code). It is used to download the pre-trained model along with embeddings and all other files needed to run the model.

5. Customize the model

5.1 Description of config parameters

Parameters of re_preprocessor component:

  • ner_tags: List[str] - ner tags of the entities, which are one-hot encoded and concatenated to entity embeddings in the output of the Transformer;

  • special_token: str - the token which is added before and after the entities (subject and object in the triplet) mentions;

  • default_tag: str - default ner tags, if no tags are provided;

  • do_lower_case: bool - set True if lowercasing is needed.

Parameters of re_classifier component:

  • n_classes: int - number of relations which the model supports;

  • num_ner_tags: int - number of ner tags;

  • return_probas: bool - whether to return confidences of predicted relations.

Parameters of re_postprocessor component:

  • rel2id_path: str - the file with mapping of relation IDs in the knowledge base to relation number (for example, “P19”: 24);

  • rel2label_path: str - the file with mapping of relation IDs to relation labels.

5.2 Train Relation Extraction on custom data

There are two kinds of dataset readers for relation extraction in DeepPavlov library:

  • docred_reader, which takes into account partition of the text into sentences and several mentions in the text for one entity;

  • rured_reader, a simplified dataset reader.

Train with docred_reader

You should prepare train_annotated.json, dev.json, test.json in the following format:

{
  "vertexSet": [
    [
      {
        "name": entity1_mention1,
        "pos": [mention1 start token index, mention1 end token index],
        "sent_id": ID of the sentence with the entity1 mention1,
        "type": ner tag
      },
      {
        "name": entity1_mention2,
        ...
      },
      ...
    ],
    [ ... ]
  ],
  "labels": [
    {
      "r": relation ID,
      "h": index of head entity of the triplet in the vertexSet list,
      "t": index of tail entity of the triplet in the vertexSet list,
      "evidence": [
        indices of the sentences with the triplet
      ]
    },
    ...
  ],
  "title": doc title,
  "sentences": [
    list of tokens of sentence 1,
    list of tokens of sentence 2,
    ...
  ],
  ...
}

For example,

{
  "vertexSet": [
    [
      {
        "name": "Elon Musk",
        "pos": [0, 2],
        "sent_id": 0,
        "type": "PER"
      }
    ],
    [
      {
        "name": "Seattle",
        "pos": [4, 5],
        "sent_id": 0,
        "type": "CITY"
      }
    ]
  ],
  "labels": [
    {
      "r": "P551",
      "h": 0,
      "t": 1,
      "evidence": [0]
    }
  ],
  "title": "title1",
  "sentences": [
    ["Elon", "Musk", "lives", "in", "Seattle", "."]
  ]
}

Train with rured_reader

You should prepare train.json, dev.json, test.json in the following format:

{
    "token": list of text tokens,
    "relation": relation ID,
    "subj_start": index of the token of the subject start in the list,
    "subj_end": index of the token of the subject end in the list,
    "obj_start": index of the token of the object start in the list,
    "obj_end": index of the token of the object end in the list,
    "subj_type": ner tag of the subject entity,
    "obj_type": ner tag of the object entity,
},

for example:

{
    "token": ["Илон", "Маск", "живет", "в", "Сиэттле", "."],
    "relation": "P551",
    "subj_start": 0,
    "subj_end": 2,
    "obj_start": 4,
    "obj_end": 5,
    "subj_type": "PERSON",
    "obj_type": "CITY"
}

Train the model using Python:

[ ]:
from deeppavlov import train_model

train_model("re_docred")

or using CLI:

[ ]:
! python -m deeppavlov train re_docred

6. Relations list

6.1 Relations used in English model

Relation id

Relation

P6

head of government

P17

country

P19

place of birth

P20

place of death

P22

father

P25

mother

P26

spouse

P27

country of citizenship

P30

continent

P31

instance of

P35

head of state

P36

capital

P37

official language

P39

position held

P40

child

P50

author

P54

member of sports team

P57

director

P58

screenwriter

P69

educated at

P86

composer

P102

member of political party

P108

employer

P112

founded by

P118

league

P123

publisher

P127

owned by

P131

located in the administrative territorial entity

P136

genre

P137

operator

P140

religion

P150

contains administrative territorial entity

P155

follows

P156

followed by

P159

headquarters location

P161

cast member

P162

producer

P166

award received

P170

creator

P171

parent taxon

P172

ethnic group

P175

performer

P176

manufacturer

P178

developer

P179

series

P190

sister city

P194

legislative body

P205

basin country

P206

located in or next to body of water

P241

military branch

P264

record label

P272

production company

P276

location

P279

subclass of

P355

subsidiary

P361

part of

P364

original language of work

P400

platform

P403

mouth of the watercourse

P449

original network

P463

member of

P488

chairperson

P495

country of origin

P527

has part

P551

residence

P569

date of birth

P570

date of death

P571

inception

P576

dissolved, abolished or demolished

P577

publication date

P580

start time

P582

end time

P585

point in time

P607

conflict

P674

characters

P676

lyrics by

P706

located on terrain feature

P710

participant

P737

influenced by

P740

location of formation

P749

parent organization

P800

notable work

P807

separated from

P840

narrative location

P937

work location

P1001

applies to jurisdiction

P1056

product or material produced

P1198

unemployment rate

P1336

territory claimed by

P1344

participant of

P1365

replaces

P1366

replaced by

P1376

capital of

P1412

languages spoken, written or signed

P1441

present in work

P3373

sibling

6.2 Relations used in Russian model

Relation

Relation id

Russian relation

MEMBER

P710

участник

WORKS_AS

P106

род занятий

WORKPLACE

OWNERSHIP

P1830

владеет

SUBORDINATE_OF

TAKES_PLACE_IN

P276

местонахождение

EVENT_TAKES_PART_IN

P1344

участвовал в

SELLS_TO

ALTERNATIVE_NAME

HEADQUARTERED_IN

P159

расположение штаб-квартиры

PRODUCES

P1056

продукция

ABBREVIATION

DATE_DEFUNCT_IN

P576

дата прекращения существования

SUBEVENT_OF

P361

часть от

DATE_FOUNDED_IN

P571

дата основания/создания/возн-я

DATE_TAKES_PLACE_ON

P585

момент времени

NUMBER_OF_EMPLOYEES_FIRED

ORIGINS_FROM

P495

страна происхождения

ACQUINTANCE_OF

PARENT_OF

P40

дети

ORGANIZES

P664

организатор

FOUNDED_BY

P112

основатель

PLACE_RESIDES_IN

P551

место жительства

BORN_IN

P19

место рождения

AGE_IS

RELATIVE

NUMBER_OF_EMPLOYEES

P1128

число сотрудников

SIBLING

P3373

брат/сестра

DATE_OF_BIRTH

P569

дата рождения