dataset_readers¶

Concrete DatasetReader classes.

class deeppavlov.dataset_readers.basic_classification_reader.BasicClassificationDatasetReader[source]¶

Class provides reading dataset in .csv format

read(data_path: str, url: str = None, format: str = 'csv', class_sep: str = None, *args, **kwargs) → dict [source]¶

Read dataset from data_path directory. Reading files are all data_types + extension (i.e for data_types=[“train”, “valid”] files “train.csv” and “valid.csv” form data_path will be read)

Parameters

data_path – directory with files
url – download data files if data_path not exists or empty
format – extension of files. Set of Values: "csv", "json"
class_sep – string separator of labels in column with labels
sep (str) – delimeter for "csv" files. Default: None -> only one class per sample
header (int) – row number to use as the column names
names (array) – list of column names to use
orient (str) – indication of expected JSON string format
lines (boolean) – read the file as a json object per line. Default: False

Returns

dictionary with types from data_types. Each field of dictionary is a list of tuples (x_i, y_i)

class deeppavlov.dataset_readers.conll2003_reader.Conll2003DatasetReader[source]¶: Class to read training datasets in CoNLL-2003 format

class deeppavlov.dataset_readers.dstc2_reader.DSTC2DatasetReader[source]¶

Contains labelled dialogs from Dialog State Tracking Challenge 2 (http://camdial.org/~mh521/dstc/).

There’ve been made the following modifications to the original dataset:

added api calls to restaurant database

example: {"text": "api_call area="south" food="dontcare" pricerange="cheap"", "dialog_acts": ["api_call"]}.

new actions

bot dialog actions were concatenated into one action (example: {"dialog_acts": ["ask", "request"]} -> {"dialog_acts": ["ask_request"]})

if a slot key was associated with the dialog action, the new act was a concatenation of an act and a slot key (example: {"dialog_acts": ["ask"], "slot_vals": ["area"]} -> {"dialog_acts": ["ask_area"]})

new train/dev/test split

original dstc2 consisted of three different MDP policies, the original train and dev datasets (consisting of two policies) were merged and randomly split into train/dev/test

minor fixes

fixed several dialogs, where actions were wrongly annotated

uppercased first letter of bot responses

unified punctuation for bot responses

classmethod read(data_path: str, dialogs: bool = False) → Dict[str, List][source]¶

Downloads 'dstc2_v2.tar.gz' archive from ipavlov internal server, decompresses and saves files to data_path.

Parameters

data_path – path to save DSTC2 dataset
dialogs – flag which indicates whether to output list of turns or list of dialogs

Returns

dictionary that contains 'train' field with dialogs from 'dstc2-trn.jsonlist', 'valid' field with dialogs from 'dstc2-val.jsonlist' and 'test' field with dialogs from 'dstc2-tst.jsonlist'. Each field is a list of tuples (x_i, y_i).

class deeppavlov.dataset_readers.dstc2_reader.SimpleDSTC2DatasetReader[source]¶

Contains labelled dialogs from Dialog State Tracking Challenge 2 (http://camdial.org/~mh521/dstc/).

There’ve been made the following modifications to the original dataset:

added api calls to restaurant database

example: {"text": "api_call area="south" food="dontcare" pricerange="cheap"", "dialog_acts": ["api_call"]}.

new actions

bot dialog actions were concatenated into one action (example: {"dialog_acts": ["ask", "request"]} -> {"dialog_acts": ["ask_request"]})

if a slot key was associated with the dialog action, the new act was a concatenation of an act and a slot key (example: {"dialog_acts": ["ask"], "slot_vals": ["area"]} -> {"dialog_acts": ["ask_area"]})

new train/dev/test split

original dstc2 consisted of three different MDP policies, the original train and dev datasets (consisting of two policies) were merged and randomly split into train/dev/test

minor fixes

fixed several dialogs, where actions were wrongly annotated

uppercased first letter of bot responses

unified punctuation for bot responses

classmethod read(data_path: str, dialogs: bool = False, encoding='utf-8') → Dict[str, List][source]¶

Downloads 'simple_dstc2.tar.gz' archive from internet, decompresses and saves files to data_path.

Parameters

data_path – path to save DSTC2 dataset
dialogs – flag which indicates whether to output list of turns or list of dialogs

Returns

dictionary that contains 'train' field with dialogs from 'simple-dstc2-trn.json', 'valid' field with dialogs from 'simple-dstc2-val.json' and 'test' field with dialogs from 'simple-dstc2-tst.json'. Each field is a list of tuples (user turn, system turn).

class deeppavlov.dataset_readers.md_yaml_dialogs_reader.DomainKnowledge(domain_knowledge_di: Dict)[source]¶: the DTO-like class to store the domain knowledge from the domain yaml config.

class deeppavlov.dataset_readers.md_yaml_dialogs_reader.MD_YAML_DialogsDatasetReader[source]¶

Reads dialogs from dataset composed of stories.md, nlu.md, domain.yml .

stories.md is to provide the dialogues dataset for model to train on. The dialogues are represented as user messages labels and system response messages labels: (not texts, just action labels). This is so to distinguish the NLU-NLG tasks from the actual dialogues storytelling experience: one should be able to describe just the scripts of dialogues to the system.

nlu.md is contrariwise to provide the NLU training set irrespective of the dialogues scripts.

domain.yml is to desribe the task-specific domain and serves two purposes: provide the NLG templates and provide some specific configuration of the NLU

classmethod read(data_path: str, dialogs: bool = False) → Dict[str, List][source]¶

Parameters

data_path – path to read dataset from
dialogs – flag which indicates whether to output list of turns or list of dialogs

Returns

dictionary that contains 'train' field with dialogs from 'stories-trn.md', 'valid' field with dialogs from 'stories-val.md' and 'test' field with dialogs from 'stories-tst.md'. Each field is a list of tuples (x_i, y_i).

class deeppavlov.dataset_readers.faq_reader.FaqDatasetReader[source]¶

Reader for FAQ dataset

read(data_path: str = None, data_url: str = None, x_col_name: str = 'x', y_col_name: str = 'y') → Dict[source]¶

Read FAQ dataset from specified csv file or remote url

Parameters

data_path – path to csv file of FAQ
data_url – url to csv file of FAQ
x_col_name – name of Question column in csv file
y_col_name – name of Answer column in csv file

Returns

A dictionary containing training, validation and test parts of the dataset obtainable via train, valid and test keys.

class deeppavlov.dataset_readers.file_paths_reader.FilePathsReader[source]¶

Find all file paths by a data path glob

read(data_path: Union[str, pathlib.Path], train: Optional[str] = None, valid: Optional[str] = None, test: Optional[str] = None, *args, **kwargs) → Dict[source]¶

Find all file paths by a data path glob

Parameters

data_path – directory with data
train – data path glob relative to data_path
valid – data path glob relative to data_path
test – data path glob relative to data_path

Returns

A dictionary containing training, validation and test parts of the dataset obtainable via train, valid and test keys.

class deeppavlov.dataset_readers.insurance_reader.InsuranceReader[source]¶

The class to read the InsuranceQA V1 dataset from files.

Please, see https://github.com/shuzi/insuranceQA.

class deeppavlov.dataset_readers.kvret_reader.KvretDatasetReader[source]¶

A New Multi-Turn, Multi-Domain, Task-Oriented Dialogue Dataset.

Stanford NLP released a corpus of 3,031 multi-turn dialogues in three distinct domains appropriate for an in-car assistant: calendar scheduling, weather information retrieval, and point-of-interest navigation. The dialogues are grounded through knowledge bases ensuring that they are versatile in their natural language without being completely free form.

For details see https://nlp.stanford.edu/blog/a-new-multi-turn-multi-domain-task-oriented-dialogue-dataset/.

classmethod read(data_path: str, dialogs: bool = False) → Dict[str, List][source]¶

Downloads 'kvrest_public.tar.gz', decompresses, saves files to data_path.

Parameters

data_path – path to save data
dialogs – flag indices whether to output list of turns or list of dialogs

Returns

dictionary with 'train' containing dialogs from 'kvret_train_public.json', 'valid' containing dialogs from 'kvret_valid_public.json', 'test' containing dialogs from 'kvret_test_public.json'. Each fields is a list of tuples (x_i, y_i).

class deeppavlov.dataset_readers.line_reader.LineReader[source]¶

Read txt file by lines

read(data_path: str = None, *args, **kwargs) → Dict[source]¶

Read lines from txt file

Parameters: data_path – path to txt file
Returns: A dictionary containing training, validation and test parts of the dataset obtainable via train, valid and test keys.

class deeppavlov.dataset_readers.morphotagging_dataset_reader.MorphotaggerDatasetReader[source]¶

Class to read training datasets in UD format

read(data_path: Union[List, str], language: Optional[str] = None, data_types: Optional[List[str]] = None, **kwargs) → Dict[str, List][source]¶

Reads UD dataset from data_path.

Parameters

data_path – can be either 1. a directory containing files. The file for data_type ‘mode’ is then data_path / {language}-ud-{mode}.conllu 2. a list of files, containing the same number of items as data_types
language – a language to detect filename when it is not given
data_types – which dataset parts among ‘train’, ‘dev’, ‘test’ are returned

Returns

a dictionary containing dataset fragments (see read_infile) for given data types

deeppavlov.dataset_readers.morphotagging_dataset_reader.get_language(filepath: str) → str [source]¶: Extracts language from typical UD filename

deeppavlov.dataset_readers.morphotagging_dataset_reader.read_infile(infile: Union[pathlib.Path, str], *, from_words=False, word_column: int = 1, pos_column: int = 3, tag_column: int = 5, head_column: int = 6, dep_column: int = 7, max_sents: int = - 1, read_only_words: bool = False, read_syntax: bool = False) → List[Tuple[List, Optional[List]]][source]¶

Reads input file in CONLL-U format

Parameters

infile – a path to a file
word_column – column containing words (default=1)
pos_column – column containing part-of-speech labels (default=3)
tag_column – column containing fine-grained tags (default=5)
head_column – column containing syntactic head position (default=6)
dep_column – column containing syntactic dependency label (default=7)
max_sents – maximal number of sentences to read
read_only_words – whether to read only words
read_syntax – whether to return heads and deps alongside tags. Ignored if read_only_words is True

Returns

a list of sentences. Each item contains a word sequence and an output sequence. The output sentence is None, if read_only_words is True, a single list of word tags if read_syntax is False, and a list of the form [tags, heads, deps] in case read_syntax is True.

class deeppavlov.dataset_readers.paraphraser_reader.ParaphraserReader[source]¶

The class to read the paraphraser.ru dataset from files.

Please, see https://paraphraser.ru.

class deeppavlov.dataset_readers.paraphraser_pretrain_reader.ParaphraserPretrainReader[source]¶: The class to read the pretraining dataset for the paraphrase identification task from files.

class deeppavlov.dataset_readers.quora_question_pairs_reader.QuoraQuestionPairsReader[source]¶

The class to read the Quora Question Pairs dataset from files.

Please, see https://www.kaggle.com/c/quora-question-pairs/data.

Parameters

data_path – A path to a folder with dataset files.
seed – Random seed.

class deeppavlov.dataset_readers.siamese_reader.SiameseReader[source]¶: The class to read dataset for ranking or paraphrase identification with Siamese networks.

class deeppavlov.dataset_readers.squad_dataset_reader.SquadDatasetReader[source]¶

Downloads dataset files and prepares train/valid split.

SQuAD: Stanford Question Answering Dataset https://rajpurkar.github.io/SQuAD-explorer/

SberSQuAD: Dataset from SDSJ Task B https://www.sdsj.ru/ru/contest.html

MultiSQuAD: SQuAD dataset with additional contexts retrieved (by tfidf) from original Wikipedia article.

MultiSQuADRetr: SQuAD dataset with additional contexts retrieved by tfidf document ranker from full Wikipedia.

read(dir_path: str, dataset: Optional[str] = 'SQuAD', url: Optional[str] = None, *args, **kwargs) → Dict[str, Dict[str, Any]][source]¶

Parameters

dir_path – path to save data
dataset – default dataset names: 'SQuAD', 'SberSQuAD' or 'MultiSQuAD'
url – link to archive with dataset, use url argument if non-default dataset is used

Returns

dataset split on train/valid

Raises

RuntimeError – if dataset is not one of these: 'SQuAD', 'SberSQuAD', 'MultiSQuAD'.

class deeppavlov.dataset_readers.typos_reader.TyposCustom[source]¶

Base class for reading spelling corrections dataset files

static build(data_path: str) → pathlib.Path [source]¶

Base method that interprets data_path argument.

Parameters: data_path – path to the tsv-file containing erroneous and corrected words
Returns: the same path as a Path object

classmethod read(data_path: str, *args, **kwargs) → Dict[str, List[Tuple[str, str]]][source]¶

Read train data for spelling corrections algorithms

Parameters: data_path – path that needs to be interpreted with build()
Returns: train data to pass to a TyposDatasetIterator

class deeppavlov.dataset_readers.typos_reader.TyposKartaslov[source]¶

Implementation of TyposCustom that works with a Russian misspellings dataset from kartaslov

static build(data_path: str) → pathlib.Path [source]¶

Download misspellings list from github

Parameters: data_path – target directory to download the data to
Returns: path to the resulting csv-file

static read(data_path: str, *args, **kwargs) → Dict[str, List[Tuple[str, str]]][source]¶

Read train data for spelling corrections algorithms

Parameters: data_path – path that needs to be interpreted with build()
Returns: train data to pass to a TyposDatasetIterator

class deeppavlov.dataset_readers.typos_reader.TyposWikipedia[source]¶

Implementation of TyposCustom that works with English Wikipedia’s list of common misspellings

static build(data_path: str) → pathlib.Path [source]¶

Download and parse common misspellings list from Wikipedia

Parameters: data_path – target directory to download the data to
Returns: path to the resulting tsv-file

class deeppavlov.dataset_readers.ubuntu_v2_reader.UbuntuV2Reader[source]¶

The class to read the Ubuntu V2 dataset from csv files.

Please, see https://github.com/rkadlec/ubuntu-ranking-dataset-creator.

read(data_path: str, positive_samples=False, *args, **kwargs) → Dict[str, List[Tuple[List[str], int]]][source]¶

Read the Ubuntu V2 dataset from csv files.

Parameters

data_path – A path to a folder with dataset csv files.
positive_samples – if True, only positive context-response pairs will be taken for train

class deeppavlov.dataset_readers.ubuntu_v2_mt_reader.UbuntuV2MTReader[source]¶

The class to read the Ubuntu V2 dataset from csv files taking into account multi-turn dialogue context.

Please, see https://github.com/rkadlec/ubuntu-ranking-dataset-creator.

Parameters

data_path – A path to a folder with dataset csv files.
num_context_turns – A maximum number of dialogue context turns.
padding – “post” or “pre” context sentences padding

read(data_path: str, num_context_turns: int = 1, padding: str = 'post', *args, **kwargs) → Dict[str, List[Tuple[List[str], int]]][source]¶

Read the Ubuntu V2 dataset from csv files taking into account multi-turn dialogue context.

Parameters

data_path – A path to a folder with dataset csv files.
num_context_turns – A maximum number of dialogue context turns.
padding – “post” or “pre” context sentences padding

Returns

Dictionary with keys “train”, “valid”, “test” and parts of the dataset as their values