Relation Extraction¶

Table of contents¶

Introduction to the task
Get started with the model
Models list
Use the model for prediction

4.1 Predict using Python

4.2 Predict using CLI
Customize the model

5.1 Description of config parameters

5.2 Train Relation Extraction on custom data
Relations list

6.1 Relations used in English model

6.2 Relations used in Russian model

1. Introduction to the task¶

Relation extraction is the task of detecting and classifying the relationship between two entities in text. DeepPavlov provides the document-level relation extraction meaning that the relation can be detected between the entities that are not in one sentence.

RE Model Architecture

We based our model on the Adaptive Thresholding and Localized Context Pooling model and used NER entity tags as additional input. Two core ideas of this model are:

Adaptive Threshold

The usual global threshold for converting the RE classifier output probability to relation label is replaced with a learnable one. A new threshold class that learns an entities-dependent threshold value is introduced and learnt as all other classes. During prediction the positive classes (= relations that are hold in the sample indeed) are claimed to be the classes with higher logins that the TH class, while all others are negative ones.

Localised Context Pooling

The embedding of each entity pair is enhanced with an additional local context embedding related to both entities. Such representation, which is attended to the relevant context in the document, is useful to decide the relation for exactly this entity pair. For incorporating the context information the attention heads are directly used.

2. Get started with the model¶

First make sure you have the DeepPavlov Library installed. More info about the first installation.

[ ]:

!pip install -q deeppavlov

Before using the model make sure that all required packages are installed running the command:

[ ]:

!python -m deeppavlov install re_docred

3. Models list¶

The table presents a list of all of the relation extraction models available in the DeepPavlov Library.

Config	Language	Dataset
relation_extraction/re_docred.json	En	DocRED
relation_extraction/re_rured.json	Ru	RuRED

Some details on DocRED corpus English RE model was trained on¶

The English RE model was trained on DocRED English corpus. It was constructed from Wikipedia and Wikidata and is now the largest human-annotated dataset for document-level RE from plain text.

As the original DocRED test dataset containes only unlabeled data, while we want to have labeled one in order to perform evaluation, we decided to: 1. merge train and dev data (= labeled data) 2. split them into new train, dev and test dataset

Currently, there are two types of possible splittings provided:

user can set the relative size of dev and test data (e.g. 1/7)
user can set the absolute size of dev and test data (e.g. 2000 samples)

In our experiment, we set the absolute size of dev and test data == 150 initial documents. It resulted in approximately 3500 samples.

We additionally generate negative samples if it was necessary to have the following proportions: - for train set: negative samples are twice as many as positive ones - for dev & test set: negative samples are the same amount as positive ones

Train	Dev	Test
130650	3406	3545

Train Positive	Train Negative	Dev Positive	Dev Negative	Test Positive	Test Negative
44823	89214	1239	1229	1043	1036

Some details on RuRED corpus Russian RE model was trained on¶

In case of RuRED we used the train, dev and test sets from the original RuRED setting. We additionally generate negative samples if it was necessary to have the following proportions:

for train set: negative samples are twice as many as positive ones
for dev & test set: negative samples are the same amount as positive ones

Train	Dev	Test
12855	1076	1072

Train Positive	Train Negative	Dev Positive	Dev Negative	Test Positive	Test Negative
4285	8570	538	538	536	536

4. Use the model for prediction¶

4.1 Predict using Python¶

English¶

[ ]:

from deeppavlov import configs, build_model

re_model = build_model(configs.relation_extraction.re_docred, download=False)

[ ]:

sentence_tokens = [["Barack", "Obama", "is", "married", "to", "Michelle", "Obama", ",", "born", "Michelle", "Robinson", "."]]
entity_pos = [[[(0, 2)], [(5, 7), (9, 11)]]]
entity_tags = [["PER", "PER"]]
pred = re_model(sentence_tokens, entity_pos, entity_tags)

[['P26'], ['spouse']]

Model Input:

list of tokens of a text document
list of entities positions (i.e. all start and end positions of both entities’ mentions)
list of NER tags of both entities.

As NER tags, we adapted the used in the DocRED corpus, which are, in turn, inherited from Tjong Kim Sang and De Meulder(2003)

The whole list of 6 English NER tags

Tag	Description
PER	People, including fictional
ORG	Companies, universities, institutions, political or religious groups, etc.
LOC	Geographically defined locations, including mountains, waters, etc. Politically defined locations, including countries, cities, states, streets, etc. Facilities, including buildings, museums, stadiums, hospitals, factories, airports, etc.
TIME	Absolute or relative dates or periods.
NUM	Percents, money, quantities
MISC	Products, including vehicles, weapons, etc. Events, including elections, battles, sporting MISC events, etc. Laws, cases, languages, etc.

Model Output: one or several of the 97 relations found between the given entities; relation id in Wikidata (e.g. ‘P26’) and relation name (‘spouse’).

Russian¶

[ ]:

from deeppavlov import configs, build_model

re_model = build_model(configs.relation_extraction.re_rured)

[ ]:

sentence_tokens = [["Илон", "Маск", "живет", "в", "Сиэттле", "."]]
entity_pos = [[[(0, 2)], [(4, 5)]]]
entity_tags = [["PERSON", "CITY"]]
pred = re_model(sentence_tokens, entity_pos, entity_tags)

[['P495'], ['страна происхождения']]

Model Input:

list of tokens of a text document
list of entities positions (i.e. all start and end positions of both entities’ mentions)
list of NER tags of both entities.

Model Output: one or several of the 30 relations found between the given entities; a Russian relation name (e.g. “участник”) or an English one, if Russian one is unavailable, and, if applicable, its id in Wikidata (e.g. ‘P710’).

4.2 Predict using CLI¶

You can also get predictions in an interactive mode through CLI.

[ ]:

! python -m deeppavlov interact re_docred [-d]
! python -m deeppavlov interact re_rured [-d]

-d is an optional download key (alternative to download=True in Python code). It is used to download the pre-trained model along with embeddings and all other files needed to run the model.

5. Customize the model¶

5.1 Description of config parameters¶

Parameters of re_preprocessor component:

ner_tags: List[str] - ner tags of the entities, which are one-hot encoded and concatenated to entity embeddings in the output of the Transformer;
special_token: str - the token which is added before and after the entities (subject and object in the triplet) mentions;
default_tag: str - default ner tags, if no tags are provided;
do_lower_case: bool - set True if lowercasing is needed.

Parameters of re_classifier component:

n_classes: int - number of relations which the model supports;
num_ner_tags: int - number of ner tags;
return_probas: bool - whether to return confidences of predicted relations.

Parameters of re_postprocessor component:

rel2id_path: str - the file with mapping of relation IDs in the knowledge base to relation number (for example, “P19”: 24);
rel2label_path: str - the file with mapping of relation IDs to relation labels.

5.2 Train Relation Extraction on custom data¶

There are two kinds of dataset readers for relation extraction in DeepPavlov library:

docred_reader, which takes into account partition of the text into sentences and several mentions in the text for one entity;
rured_reader, a simplified dataset reader.

Train with `docred_reader`¶

You should prepare train_annotated.json, dev.json, test.json in the following format:

{
  "vertexSet": [
    [
      {
        "name": entity1_mention1,
        "pos": [mention1 start token index, mention1 end token index],
        "sent_id": ID of the sentence with the entity1 mention1,
        "type": ner tag
      },
      {
        "name": entity1_mention2,
        ...
      },
      ...
    ],
    [ ... ]
  ],
  "labels": [
    {
      "r": relation ID,
      "h": index of head entity of the triplet in the vertexSet list,
      "t": index of tail entity of the triplet in the vertexSet list,
      "evidence": [
        indices of the sentences with the triplet
      ]
    },
    ...
  ],
  "title": doc title,
  "sentences": [
    list of tokens of sentence 1,
    list of tokens of sentence 2,
    ...
  ],
  ...
}

For example,

{
  "vertexSet": [
    [
      {
        "name": "Elon Musk",
        "pos": [0, 2],
        "sent_id": 0,
        "type": "PER"
      }
    ],
    [
      {
        "name": "Seattle",
        "pos": [4, 5],
        "sent_id": 0,
        "type": "CITY"
      }
    ]
  ],
  "labels": [
    {
      "r": "P551",
      "h": 0,
      "t": 1,
      "evidence": [0]
    }
  ],
  "title": "title1",
  "sentences": [
    ["Elon", "Musk", "lives", "in", "Seattle", "."]
  ]
}

Train with `rured_reader`¶

You should prepare train.json, dev.json, test.json in the following format:

{
    "token": list of text tokens,
    "relation": relation ID,
    "subj_start": index of the token of the subject start in the list,
    "subj_end": index of the token of the subject end in the list,
    "obj_start": index of the token of the object start in the list,
    "obj_end": index of the token of the object end in the list,
    "subj_type": ner tag of the subject entity,
    "obj_type": ner tag of the object entity,
},

for example:

{
    "token": ["Илон", "Маск", "живет", "в", "Сиэттле", "."],
    "relation": "P551",
    "subj_start": 0,
    "subj_end": 2,
    "obj_start": 4,
    "obj_end": 5,
    "subj_type": "PERSON",
    "obj_type": "CITY"
}

Train the model using Python:¶

[ ]:

from deeppavlov import train_model

train_model("re_docred")

or using CLI:

[ ]:

! python -m deeppavlov train re_docred

6. Relations list¶

6.1 Relations used in English model¶

Relation id	Relation
P6	head of government
P17	country
P19	place of birth
P20	place of death
P22	father
P25	mother
P26	spouse
P27	country of citizenship
P30	continent
P31	instance of
P35	head of state
P36	capital
P37	official language
P39	position held
P40	child
P50	author
P54	member of sports team
P57	director
P58	screenwriter
P69	educated at
P86	composer
P102	member of political party
P108	employer
P112	founded by
P118	league
P123	publisher
P127	owned by
P131	located in the administrative territorial entity
P136	genre
P137	operator
P140	religion
P150	contains administrative territorial entity
P155	follows
P156	followed by
P159	headquarters location
P161	cast member
P162	producer
P166	award received
P170	creator
P171	parent taxon
P172	ethnic group
P175	performer
P176	manufacturer
P178	developer
P179	series
P190	sister city
P194	legislative body
P205	basin country
P206	located in or next to body of water
P241	military branch
P264	record label
P272	production company
P276	location
P279	subclass of
P355	subsidiary
P361	part of
P364	original language of work
P400	platform
P403	mouth of the watercourse
P449	original network
P463	member of
P488	chairperson
P495	country of origin
P527	has part
P551	residence
P569	date of birth
P570	date of death
P571	inception
P576	dissolved, abolished or demolished
P577	publication date
P580	start time
P582	end time
P585	point in time
P607	conflict
P674	characters
P676	lyrics by
P706	located on terrain feature
P710	participant
P737	influenced by
P740	location of formation
P749	parent organization
P800	notable work
P807	separated from
P840	narrative location
P937	work location
P1001	applies to jurisdiction
P1056	product or material produced
P1198	unemployment rate
P1336	territory claimed by
P1344	participant of
P1365	replaces
P1366	replaced by
P1376	capital of
P1412	languages spoken, written or signed
P1441	present in work
P3373	sibling

6.2 Relations used in Russian model¶

Relation	Relation id	Russian relation
MEMBER	P710	участник
WORKS_AS	P106	род занятий
WORKPLACE	–	–
OWNERSHIP	P1830	владеет
SUBORDINATE_OF	–	–
TAKES_PLACE_IN	P276	местонахождение
EVENT_TAKES_PART_IN	P1344	участвовал в
SELLS_TO	–	–
ALTERNATIVE_NAME	–	–
HEADQUARTERED_IN	P159	расположение штаб-квартиры
PRODUCES	P1056	продукция
ABBREVIATION	–	–
DATE_DEFUNCT_IN	P576	дата прекращения существования
SUBEVENT_OF	P361	часть от
DATE_FOUNDED_IN	P571	дата основания/создания/возн-я
DATE_TAKES_PLACE_ON	P585	момент времени
NUMBER_OF_EMPLOYEES_FIRED	–	–
ORIGINS_FROM	P495	страна происхождения
ACQUINTANCE_OF	–	–
PARENT_OF	P40	дети
ORGANIZES	P664	организатор
FOUNDED_BY	P112	основатель
PLACE_RESIDES_IN	P551	место жительства
BORN_IN	P19	место рождения
AGE_IS	–	–
RELATIVE	–	–
NUMBER_OF_EMPLOYEES	P1128	число сотрудников
SIBLING	P3373	брат/сестра
DATE_OF_BIRTH	P569	дата рождения