dataset_readers

Concrete DatasetReader classes.

class deeppavlov.dataset_readers.basic_classification_reader.BasicClassificationDatasetReader[source]

Class provides reading dataset in .csv format

read(data_path: str, url: str = None, format: str = 'csv', class_sep: str = None, *args, **kwargs) → dict[source]

Read dataset from data_path directory. Reading files are all data_types + extension (i.e for data_types=[“train”, “valid”] files “train.csv” and “valid.csv” form data_path will be read)

Parameters
  • data_path – directory with files

  • url – download data files if data_path not exists or empty

  • format – extension of files. Set of Values: "csv", "json"

  • class_sep – string separator of labels in column with labels

  • sep (str) – delimeter for "csv" files. Default: None -> only one class per sample

  • header (int) – row number to use as the column names

  • names (array) – list of column names to use

  • orient (str) – indication of expected JSON string format

  • lines (boolean) – read the file as a json object per line. Default: False

Returns

dictionary with types from data_types. Each field of dictionary is a list of tuples (x_i, y_i)

class deeppavlov.dataset_readers.conll2003_reader.Conll2003DatasetReader[source]

Class to read training datasets in CoNLL-2003 format

class deeppavlov.dataset_readers.dstc2_reader.DSTC2DatasetReader[source]

Contains labelled dialogs from Dialog State Tracking Challenge 2 (http://camdial.org/~mh521/dstc/).

There’ve been made the following modifications to the original dataset:

  1. added api calls to restaurant database

    • example: {"text": "api_call area="south" food="dontcare" pricerange="cheap"", "dialog_acts": ["api_call"]}.

  2. new actions

    • bot dialog actions were concatenated into one action (example: {"dialog_acts": ["ask", "request"]} -> {"dialog_acts": ["ask_request"]})

    • if a slot key was associated with the dialog action, the new act was a concatenation of an act and a slot key (example: {"dialog_acts": ["ask"], "slot_vals": ["area"]} -> {"dialog_acts": ["ask_area"]})

  3. new train/dev/test split

    • original dstc2 consisted of three different MDP policies, the original train and dev datasets (consisting of two policies) were merged and randomly split into train/dev/test

  4. minor fixes

    • fixed several dialogs, where actions were wrongly annotated

    • uppercased first letter of bot responses

    • unified punctuation for bot responses

classmethod read(data_path: str, dialogs: bool = False) → Dict[str, List][source]

Downloads 'dstc2_v2.tar.gz' archive from ipavlov internal server, decompresses and saves files to data_path.

Parameters
  • data_path – path to save DSTC2 dataset

  • dialogs – flag which indicates whether to output list of turns or list of dialogs

Returns

dictionary that contains 'train' field with dialogs from 'dstc2-trn.jsonlist', 'valid' field with dialogs from 'dstc2-val.jsonlist' and 'test' field with dialogs from 'dstc2-tst.jsonlist'. Each field is a list of tuples (x_i, y_i).

class deeppavlov.dataset_readers.dstc2_reader.SimpleDSTC2DatasetReader[source]

Contains labelled dialogs from Dialog State Tracking Challenge 2 (http://camdial.org/~mh521/dstc/).

There’ve been made the following modifications to the original dataset:

  1. added api calls to restaurant database

    • example: {"text": "api_call area="south" food="dontcare" pricerange="cheap"", "dialog_acts": ["api_call"]}.

  2. new actions

    • bot dialog actions were concatenated into one action (example: {"dialog_acts": ["ask", "request"]} -> {"dialog_acts": ["ask_request"]})

    • if a slot key was associated with the dialog action, the new act was a concatenation of an act and a slot key (example: {"dialog_acts": ["ask"], "slot_vals": ["area"]} -> {"dialog_acts": ["ask_area"]})

  3. new train/dev/test split

    • original dstc2 consisted of three different MDP policies, the original train and dev datasets (consisting of two policies) were merged and randomly split into train/dev/test

  4. minor fixes

    • fixed several dialogs, where actions were wrongly annotated

    • uppercased first letter of bot responses

    • unified punctuation for bot responses

classmethod read(data_path: str, dialogs: bool = False) → Dict[str, List][source]

Downloads 'simple_dstc2.tar.gz' archive from internet, decompresses and saves files to data_path.

Parameters
  • data_path – path to save DSTC2 dataset

  • dialogs – flag which indicates whether to output list of turns or list of dialogs

Returns

dictionary that contains 'train' field with dialogs from 'simple-dstc2-trn.json', 'valid' field with dialogs from 'simple-dstc2-val.json' and 'test' field with dialogs from 'simple-dstc2-tst.json'. Each field is a list of tuples (user turn, system turn).

class deeppavlov.dataset_readers.faq_reader.FaqDatasetReader[source]

Reader for FAQ dataset

read(data_path: str = None, data_url: str = None, x_col_name: str = 'x', y_col_name: str = 'y') → Dict[source]

Read FAQ dataset from specified csv file or remote url

Parameters
  • data_path – path to csv file of FAQ

  • data_url – url to csv file of FAQ

  • x_col_name – name of Question column in csv file

  • y_col_name – name of Answer column in csv file

Returns

A dictionary containing training, validation and test parts of the dataset obtainable via train, valid and test keys.

class deeppavlov.dataset_readers.file_paths_reader.FilePathsReader[source]

Find all file paths by a data path glob

read(data_path: Union[str, pathlib.Path], train: Optional[str] = None, valid: Optional[str] = None, test: Optional[str] = None, *args, **kwargs) → Dict[source]

Find all file paths by a data path glob

Parameters
  • data_path – directory with data

  • train – data path glob relative to data_path

  • valid – data path glob relative to data_path

  • test – data path glob relative to data_path

Returns

A dictionary containing training, validation and test parts of the dataset obtainable via train, valid and test keys.

class deeppavlov.dataset_readers.insurance_reader.InsuranceReader[source]

The class to read the InsuranceQA V1 dataset from files.

Please, see https://github.com/shuzi/insuranceQA.

class deeppavlov.dataset_readers.kvret_reader.KvretDatasetReader[source]

A New Multi-Turn, Multi-Domain, Task-Oriented Dialogue Dataset.

Stanford NLP released a corpus of 3,031 multi-turn dialogues in three distinct domains appropriate for an in-car assistant: calendar scheduling, weather information retrieval, and point-of-interest navigation. The dialogues are grounded through knowledge bases ensuring that they are versatile in their natural language without being completely free form.

For details see https://nlp.stanford.edu/blog/a-new-multi-turn-multi-domain-task-oriented-dialogue-dataset/.

classmethod read(data_path: str, dialogs: bool = False) → Dict[str, List][source]

Downloads 'kvrest_public.tar.gz', decompresses, saves files to data_path.

Parameters
  • data_path – path to save data

  • dialogs – flag indices whether to output list of turns or list of dialogs

Returns

dictionary with 'train' containing dialogs from 'kvret_train_public.json', 'valid' containing dialogs from 'kvret_valid_public.json', 'test' containing dialogs from 'kvret_test_public.json'. Each fields is a list of tuples (x_i, y_i).

class deeppavlov.dataset_readers.line_reader.LineReader[source]

Read txt file by lines

read(data_path: str = None, *args, **kwargs) → Dict[source]

Read lines from txt file

Parameters

data_path – path to txt file

Returns

A dictionary containing training, validation and test parts of the dataset obtainable via train, valid and test keys.

class deeppavlov.dataset_readers.morphotagging_dataset_reader.MorphotaggerDatasetReader[source]

Class to read training datasets in UD format

read(data_path: Union[List, str], language: Optional[str] = None, data_types: Optional[List[str]] = None, **kwargs) → Dict[str, List][source]

Reads UD dataset from data_path.

Parameters
  • data_path – can be either 1. a directory containing files. The file for data_type ‘mode’ is then data_path / {language}-ud-{mode}.conllu 2. a list of files, containing the same number of items as data_types

  • language – a language to detect filename when it is not given

  • data_types – which dataset parts among ‘train’, ‘dev’, ‘test’ are returned

Returns

a dictionary containing dataset fragments (see read_infile) for given data types

deeppavlov.dataset_readers.morphotagging_dataset_reader.get_language(filepath: str) → str[source]

Extracts language from typical UD filename

deeppavlov.dataset_readers.morphotagging_dataset_reader.read_infile(infile: Union[pathlib.Path, str], *, from_words=False, word_column: int = 1, pos_column: int = 3, tag_column: int = 5, head_column: int = 6, dep_column: int = 7, max_sents: int = -1, read_only_words: bool = False, read_syntax: bool = False) → List[Tuple[List, Optional[List]]][source]

Reads input file in CONLL-U format

Parameters
  • infile – a path to a file

  • word_column – column containing words (default=1)

  • pos_column – column containing part-of-speech labels (default=3)

  • tag_column – column containing fine-grained tags (default=5)

  • head_column – column containing syntactic head position (default=6)

  • dep_column – column containing syntactic dependency label (default=7)

  • max_sents – maximal number of sentences to read

  • read_only_words – whether to read only words

  • read_syntax – whether to return heads and deps alongside tags. Ignored if read_only_words is True

Returns

a list of sentences. Each item contains a word sequence and an output sequence. The output sentence is None, if read_only_words is True, a single list of word tags if read_syntax is False, and a list of the form [tags, heads, deps] in case read_syntax is True.

class deeppavlov.dataset_readers.paraphraser_reader.ParaphraserReader[source]

The class to read the paraphraser.ru dataset from files.

Please, see https://paraphraser.ru.

class deeppavlov.dataset_readers.paraphraser_pretrain_reader.ParaphraserPretrainReader[source]

The class to read the pretraining dataset for the paraphrase identification task from files.

class deeppavlov.dataset_readers.quora_question_pairs_reader.QuoraQuestionPairsReader[source]

The class to read the Quora Question Pairs dataset from files.

Please, see https://www.kaggle.com/c/quora-question-pairs/data.

Parameters
  • data_path – A path to a folder with dataset files.

  • seed – Random seed.

class deeppavlov.dataset_readers.siamese_reader.SiameseReader[source]

The class to read dataset for ranking or paraphrase identification with Siamese networks.

class deeppavlov.dataset_readers.squad_dataset_reader.SquadDatasetReader[source]

Downloads dataset files and prepares train/valid split.

SQuAD: Stanford Question Answering Dataset https://rajpurkar.github.io/SQuAD-explorer/

SberSQuAD: Dataset from SDSJ Task B https://www.sdsj.ru/ru/contest.html

MultiSQuAD: SQuAD dataset with additional contexts retrieved (by tfidf) from original Wikipedia article.

MultiSQuADRetr: SQuAD dataset with additional contexts retrieved by tfidf document ranker from full Wikipedia.

read(dir_path: str, dataset: Optional[str] = 'SQuAD', url: Optional[str] = None, *args, **kwargs) → Dict[str, Dict[str, Any]][source]
Parameters
  • dir_path – path to save data

  • dataset – default dataset names: 'SQuAD', 'SberSQuAD' or 'MultiSQuAD'

  • url – link to archive with dataset, use url argument if non-default dataset is used

Returns

dataset split on train/valid

Raises

RuntimeError – if dataset is not one of these: 'SQuAD', 'SberSQuAD', 'MultiSQuAD'.

class deeppavlov.dataset_readers.typos_reader.TyposCustom[source]

Base class for reading spelling corrections dataset files

static build(data_path: str) → pathlib.Path[source]

Base method that interprets data_path argument.

Parameters

data_path – path to the tsv-file containing erroneous and corrected words

Returns

the same path as a Path object

classmethod read(data_path: str, *args, **kwargs) → Dict[str, List[Tuple[str, str]]][source]

Read train data for spelling corrections algorithms

Parameters

data_path – path that needs to be interpreted with build()

Returns

train data to pass to a TyposDatasetIterator

class deeppavlov.dataset_readers.typos_reader.TyposKartaslov[source]

Implementation of TyposCustom that works with a Russian misspellings dataset from kartaslov

static build(data_path: str) → pathlib.Path[source]

Download misspellings list from github

Parameters

data_path – target directory to download the data to

Returns

path to the resulting csv-file

static read(data_path: str, *args, **kwargs) → Dict[str, List[Tuple[str, str]]][source]

Read train data for spelling corrections algorithms

Parameters

data_path – path that needs to be interpreted with build()

Returns

train data to pass to a TyposDatasetIterator

class deeppavlov.dataset_readers.typos_reader.TyposWikipedia[source]

Implementation of TyposCustom that works with English Wikipedia’s list of common misspellings

static build(data_path: str) → pathlib.Path[source]

Download and parse common misspellings list from Wikipedia

Parameters

data_path – target directory to download the data to

Returns

path to the resulting tsv-file

class deeppavlov.dataset_readers.ubuntu_v2_reader.UbuntuV2Reader[source]

The class to read the Ubuntu V2 dataset from csv files.

Please, see https://github.com/rkadlec/ubuntu-ranking-dataset-creator.

read(data_path: str, positive_samples=False, *args, **kwargs) → Dict[str, List[Tuple[List[str], int]]][source]

Read the Ubuntu V2 dataset from csv files.

Parameters
  • data_path – A path to a folder with dataset csv files.

  • positive_samples – if True, only positive context-response pairs will be taken for train

class deeppavlov.dataset_readers.ubuntu_v2_mt_reader.UbuntuV2MTReader[source]

The class to read the Ubuntu V2 dataset from csv files taking into account multi-turn dialogue context.

Please, see https://github.com/rkadlec/ubuntu-ranking-dataset-creator.

Parameters
  • data_path – A path to a folder with dataset csv files.

  • num_context_turns – A maximum number of dialogue context turns.

  • padding – “post” or “pre” context sentences padding

read(data_path: str, num_context_turns: int = 1, padding: str = 'post', *args, **kwargs) → Dict[str, List[Tuple[List[str], int]]][source]

Read the Ubuntu V2 dataset from csv files taking into account multi-turn dialogue context.

Parameters
  • data_path – A path to a folder with dataset csv files.

  • num_context_turns – A maximum number of dialogue context turns.

  • padding – “post” or “pre” context sentences padding

Returns

Dictionary with keys “train”, “valid”, “test” and parts of the dataset as their values