dataset_readers¶

Concrete DatasetReader classes.

class deeppavlov.dataset_readers.basic_classification_reader.BasicClassificationDatasetReader[source]¶

Class provides reading dataset in .csv format

read(data_path: str, url: str = None, format: str = 'csv', class_sep: str = ', ', *args, **kwargs) → dict[source]¶

Read dataset from data_path directory. Reading files are all data_types + extension (i.e for data_types=[“train”, “valid”] files “train.csv” and “valid.csv” form data_path will be read)

Parameters:

data_path – directory with files
url – download data files if data_path not exists or empty
format – extension of files. Set of Values: "csv", "json"
class_sep – string separator of labels in column with labels
sep (str) – delimeter for "csv" files. Default: ","
header (int) – row number to use as the column names
names (array) – list of column names to use
orient (str) – indication of expected JSON string format
lines (boolean) – read the file as a json object per line. Default: False

Returns:

dictionary with types from data_types. Each field of dictionary is a list of tuples (x_i, y_i)

class deeppavlov.dataset_readers.conll2003_reader.Conll2003DatasetReader[source]¶: Class to read training datasets in CONLL2003 format

class deeppavlov.dataset_readers.dstc2_reader.DSTC2DatasetReader[source]¶

Contains labelled dialogs from Dialog State Tracking Challenge 2 (http://camdial.org/~mh521/dstc/).

There’ve been made the following modifications to the original dataset:

added api calls to restaurant database

example: {"text": "api_call area="south" food="dontcare" pricerange="cheap"", "dialog_acts": ["api_call"]}.

new actions

bot dialog actions were concatenated into one action (example: {"dialog_acts": ["ask", "request"]} -> {"dialog_acts": ["ask_request"]})

if a slot key was associated with the dialog action, the new act was a concatenation of an act and a slot key (example: {"dialog_acts": ["ask"], "slot_vals": ["area"]} -> {"dialog_acts": ["ask_area"]})

new train/dev/test split

original dstc2 consisted of three different MDP policies, the original train and dev datasets (consisting of two policies) were merged and randomly split into train/dev/test

minor fixes

fixed several dialogs, where actions were wrongly annotated

uppercased first letter of bot responses

unified punctuation for bot responses

classmethod read(data_path: str, dialogs: bool = False) → Dict[str, List][source]¶

Downloads 'dstc2_v2.tar.gz' archive from ipavlov internal server, decompresses and saves files to data_path.

Parameters:	data_path – path to save DSTC2 dataset dialogs – flag which indicates whether to output list of turns or list of dialogs
Returns:	dictionary that contains `'train'` field with dialogs from `'dstc2-trn.jsonlist'`, `'valid'` field with dialogs from `'dstc2-val.jsonlist'` and `'test'` field with dialogs from `'dstc2-tst.jsonlist'`. Each field is a list of tuples `(x_i, y_i)`.

class deeppavlov.dataset_readers.insurance_reader.InsuranceReader[source]¶

read(data_path: str, **kwargs) → Dict[str, List[Dict[str, Union[int, typing.List[int]]]]][source]¶

Read the InsuranceQA data from files and forms the dataset.

Parameters:	data_path – A path to a folder where dataset files are stored. **kwargs – Other parameters.
Returns:	A dictionary containing training, validation and test parts of the dataset obtainable via `train`, `valid` and `test` keys.

class deeppavlov.dataset_readers.kvret_reader.KvretDatasetReader[source]¶

A New Multi-Turn, Multi-Domain, Task-Oriented Dialogue Dataset.

Stanford NLP released a corpus of 3,031 multi-turn dialogues in three distinct domains appropriate for an in-car assistant: calendar scheduling, weather information retrieval, and point-of-interest navigation. The dialogues are grounded through knowledge bases ensuring that they are versatile in their natural language without being completely free form.

For details see https://nlp.stanford.edu/blog/a-new-multi-turn-multi-domain-task-oriented-dialogue-dataset/.

classmethod read(data_path: str, dialogs: bool = False) → Dict[str, List][source]¶

Downloads 'kvrest_public.tar.gz', decompresses, saves files to data_path.

Parameters:	data_path – path to save data dialogs – flag indices whether to output list of turns or list of dialogs
Returns:	dictionary with `'train'` containing dialogs from `'kvret_train_public.json'`, `'valid'` containing dialogs from `'kvret_valid_public.json'`, `'test'` containing dialogs from `'kvret_test_public.json'`. Each fields is a list of tuples `(x_i, y_i)`.

class deeppavlov.dataset_readers.morphotagging_dataset_reader.MorphotaggerDatasetReader[source]¶: Class to read training datasets in UD format

class deeppavlov.dataset_readers.ontonotes_reader.OntonotesReader[source]¶: Class to read training datasets in OntoNotes format

class deeppavlov.dataset_readers.squad_dataset_reader.SquadDatasetReader[source]¶

Stanford Question Answering Dataset https://rajpurkar.github.io/SQuAD-explorer/ and Dataset from SDSJ Task B https://www.sdsj.ru/ru/contest.html

Downloads dataset files and prepares train/valid split.

read(dir_path: str, dataset: str = 'SQuAD', *args, **kwargs) → Dict[str, Dict[str, Any]][source]¶

Parameters:	dir_path – path to save data dataset – dataset name: `'SQuAD'` or `'SberSQuAD'`
Returns:	dataset split on train/valid

class deeppavlov.dataset_readers.typos_reader.TyposCustom[source]¶

Base class for reading spelling corrections dataset files

static build(data_path: str) → pathlib.Path[source]¶

Base method that interprets data_path argument.

Parameters:	data_path – path to the tsv-file containing erroneous and corrected words
Returns:	the same path as a `Path` object

classmethod read(data_path: str, *args, **kwargs) → Dict[str, List[Tuple[str, str]]][source]¶

Read train data for spelling corrections algorithms

Parameters:	data_path – path that needs to be interpreted with `build()`
Returns:	train data to pass to a `TyposDatasetIterator`

class deeppavlov.dataset_readers.typos_reader.TyposKartaslov[source]¶

Implementation of TyposCustom that works with a Russian misspellings dataset from kartaslov

static build(data_path: str) → pathlib.Path[source]¶

Download misspellings list from github

Parameters:	data_path – target directory to download the data to
Returns:	path to the resulting csv-file

static read(data_path: str, *args, **kwargs) → Dict[str, List[Tuple[str, str]]][source]¶

Read train data for spelling corrections algorithms

Parameters:	data_path – path that needs to be interpreted with `build()`
Returns:	train data to pass to a `TyposDatasetIterator`

class deeppavlov.dataset_readers.typos_reader.TyposWikipedia[source]¶

Implementation of TyposCustom that works with English Wikipedia’s list of common misspellings

static build(data_path: str) → pathlib.Path[source]¶

Download and parse common misspellings list from Wikipedia

Parameters:	data_path – target directory to download the data to
Returns:	path to the resulting tsv-file