dataset_readers

Concrete DatasetReader classes.

class deeppavlov.dataset_readers.basic_classification_reader.BasicClassificationDatasetReader[source]

Class provides reading dataset in .csv format

read(data_path: str, url: str = None, format: str = 'csv', class_sep: str = None, *args, **kwargs)dict[source]

Read dataset from data_path directory. Reading files are all data_types + extension (i.e for data_types=[“train”, “valid”] files “train.csv” and “valid.csv” form data_path will be read)

Parameters
  • data_path – directory with files

  • url – download data files if data_path not exists or empty

  • format – extension of files. Set of Values: "csv", "json"

  • class_sep – string separator of labels in column with labels

  • sep (str) – delimeter for "csv" files. Default: None -> only one class per sample

  • header (int) – row number to use as the column names

  • names (array) – list of column names to use

  • orient (str) – indication of expected JSON string format

  • lines (boolean) – read the file as a json object per line. Default: False

Returns

dictionary with types from data_types. Each field of dictionary is a list of tuples (x_i, y_i)

class deeppavlov.dataset_readers.conll2003_reader.Conll2003DatasetReader[source]

Class to read training datasets in CoNLL-2003 format

class deeppavlov.dataset_readers.dstc2_reader.DSTC2DatasetReader[source]

Contains labelled dialogs from Dialog State Tracking Challenge 2 (http://camdial.org/~mh521/dstc/).

There’ve been made the following modifications to the original dataset:

  1. added api calls to restaurant database

    • example: {"text": "api_call area="south" food="dontcare" pricerange="cheap"", "dialog_acts": ["api_call"]}.

  2. new actions

    • bot dialog actions were concatenated into one action (example: {"dialog_acts": ["ask", "request"]} -> {"dialog_acts": ["ask_request"]})

    • if a slot key was associated with the dialog action, the new act was a concatenation of an act and a slot key (example: {"dialog_acts": ["ask"], "slot_vals": ["area"]} -> {"dialog_acts": ["ask_area"]})

  3. new train/dev/test split

    • original dstc2 consisted of three different MDP policies, the original train and dev datasets (consisting of two policies) were merged and randomly split into train/dev/test

  4. minor fixes

    • fixed several dialogs, where actions were wrongly annotated

    • uppercased first letter of bot responses

    • unified punctuation for bot responses

classmethod read(data_path: str, dialogs: bool = False) → Dict[str, List][source]

Downloads 'dstc2_v2.tar.gz' archive from ipavlov internal server, decompresses and saves files to data_path.

Parameters
  • data_path – path to save DSTC2 dataset

  • dialogs – flag which indicates whether to output list of turns or list of dialogs

Returns

dictionary that contains 'train' field with dialogs from 'dstc2-trn.jsonlist', 'valid' field with dialogs from 'dstc2-val.jsonlist' and 'test' field with dialogs from 'dstc2-tst.jsonlist'. Each field is a list of tuples (x_i, y_i).

class deeppavlov.dataset_readers.dstc2_reader.SimpleDSTC2DatasetReader[source]

Contains labelled dialogs from Dialog State Tracking Challenge 2 (http://camdial.org/~mh521/dstc/).

There’ve been made the following modifications to the original dataset:

  1. added api calls to restaurant database

    • example: {"text": "api_call area="south" food="dontcare" pricerange="cheap"", "dialog_acts": ["api_call"]}.

  2. new actions

    • bot dialog actions were concatenated into one action (example: {"dialog_acts": ["ask", "request"]} -> {"dialog_acts": ["ask_request"]})

    • if a slot key was associated with the dialog action, the new act was a concatenation of an act and a slot key (example: {"dialog_acts": ["ask"], "slot_vals": ["area"]} -> {"dialog_acts": ["ask_area"]})

  3. new train/dev/test split

    • original dstc2 consisted of three different MDP policies, the original train and dev datasets (consisting of two policies) were merged and randomly split into train/dev/test

  4. minor fixes

    • fixed several dialogs, where actions were wrongly annotated

    • uppercased first letter of bot responses

    • unified punctuation for bot responses

classmethod read(data_path: str, dialogs: bool = False, encoding='utf-8') → Dict[str, List][source]

Downloads 'simple_dstc2.tar.gz' archive from internet, decompresses and saves files to data_path.

Parameters
  • data_path – path to save DSTC2 dataset

  • dialogs – flag which indicates whether to output list of turns or list of dialogs

Returns

dictionary that contains 'train' field with dialogs from 'simple-dstc2-trn.json', 'valid' field with dialogs from 'simple-dstc2-val.json' and 'test' field with dialogs from 'simple-dstc2-tst.json'. Each field is a list of tuples (user turn, system turn).

class deeppavlov.dataset_readers.md_yaml_dialogs_reader.DomainKnowledge(domain_knowledge_di: Dict)[source]

the DTO-like class to store the domain knowledge from the domain yaml config.

class deeppavlov.dataset_readers.md_yaml_dialogs_reader.MD_YAML_DialogsDatasetReader[source]

Reads dialogs from dataset composed of stories.md, nlu.md, domain.yml .

stories.md is to provide the dialogues dataset for model to train on. The dialogues are represented as user messages labels and system response messages labels: (not texts, just action labels). This is so to distinguish the NLU-NLG tasks from the actual dialogues storytelling experience: one should be able to describe just the scripts of dialogues to the system.

nlu.md is contrariwise to provide the NLU training set irrespective of the dialogues scripts.

domain.yml is to desribe the task-specific domain and serves two purposes: provide the NLG templates and provide some specific configuration of the NLU

classmethod read(data_path: str, dialogs: bool = False) → Dict[str, List][source]
Parameters
  • data_path – path to read dataset from

  • dialogs – flag which indicates whether to output list of turns or list of dialogs

Returns

dictionary that contains 'train' field with dialogs from 'stories-trn.md', 'valid' field with dialogs from 'stories-val.md' and 'test' field with dialogs from 'stories-tst.md'. Each field is a list of tuples (x_i, y_i).

class deeppavlov.dataset_readers.faq_reader.FaqDatasetReader[source]

Reader for FAQ dataset

read(data_path: str = None, data_url: str = None, x_col_name: str = 'x', y_col_name: str = 'y') → Dict[source]

Read FAQ dataset from specified csv file or remote url

Parameters
  • data_path – path to csv file of FAQ

  • data_url – url to csv file of FAQ

  • x_col_name – name of Question column in csv file

  • y_col_name – name of Answer column in csv file

Returns

A dictionary containing training, validation and test parts of the dataset obtainable via train, valid and test keys.

class deeppavlov.dataset_readers.file_paths_reader.FilePathsReader[source]

Find all file paths by a data path glob

read(data_path: Union[str, pathlib.Path], train: Optional[str] = None, valid: Optional[str] = None, test: Optional[str] = None, *args, **kwargs) → Dict[source]

Find all file paths by a data path glob

Parameters
  • data_path – directory with data

  • train – data path glob relative to data_path

  • valid – data path glob relative to data_path

  • test – data path glob relative to data_path

Returns

A dictionary containing training, validation and test parts of the dataset obtainable via train, valid and test keys.

class deeppavlov.dataset_readers.insurance_reader.InsuranceReader[source]

The class to read the InsuranceQA V1 dataset from files.

Please, see https://github.com/shuzi/insuranceQA.

class deeppavlov.dataset_readers.kvret_reader.KvretDatasetReader[source]

A New Multi-Turn, Multi-Domain, Task-Oriented Dialogue Dataset.

Stanford NLP released a corpus of 3,031 multi-turn dialogues in three distinct domains appropriate for an in-car assistant: calendar scheduling, weather information retrieval, and point-of-interest navigation. The dialogues are grounded through knowledge bases ensuring that they are versatile in their natural language without being completely free form.

For details see https://nlp.stanford.edu/blog/a-new-multi-turn-multi-domain-task-oriented-dialogue-dataset/.

classmethod read(data_path: str, dialogs: bool = False) → Dict[str, List][source]

Downloads 'kvrest_public.tar.gz', decompresses, saves files to data_path.

Parameters
  • data_path – path to save data

  • dialogs – flag indices whether to output list of turns or list of dialogs

Returns

dictionary with 'train' containing dialogs from 'kvret_train_public.json', 'valid' containing dialogs from 'kvret_valid_public.json', 'test' containing dialogs from 'kvret_test_public.json'. Each fields is a list of tuples (x_i, y_i).

class deeppavlov.dataset_readers.line_reader.LineReader[source]

Read txt file by lines

read(data_path: str = None, *args, **kwargs) → Dict[source]

Read lines from txt file

Parameters

data_path – path to txt file

Returns

A dictionary containing training, validation and test parts of the dataset obtainable via train, valid and test keys.

class deeppavlov.dataset_readers.morphotagging_dataset_reader.MorphotaggerDatasetReader[source]

Class to read training datasets in UD format

read(data_path: Union[List, str], language: Optional[str] = None, data_types: Optional[List[str]] = None, **kwargs) → Dict[str, List][source]

Reads UD dataset from data_path.

Parameters
  • data_path – can be either 1. a directory containing files. The file for data_type ‘mode’ is then data_path / {language}-ud-{mode}.conllu 2. a list of files, containing the same number of items as data_types

  • language – a language to detect filename when it is not given

  • data_types – which dataset parts among ‘train’, ‘dev’, ‘test’ are returned

Returns

a dictionary containing dataset fragments (see read_infile) for given data types

deeppavlov.dataset_readers.morphotagging_dataset_reader.get_language(filepath: str)str[source]

Extracts language from typical UD filename

deeppavlov.dataset_readers.morphotagging_dataset_reader.read_infile(infile: Union[pathlib.Path, str], *, from_words=False, word_column: int = 1, pos_column: int = 3, tag_column: int = 5, head_column: int = 6, dep_column: int = 7, max_sents: int = - 1, read_only_words: bool = False, read_syntax: bool = False) → List[Tuple[List, Optional[List]]][source]

Reads input file in CONLL-U format

Parameters
  • infile – a path to a file

  • word_column – column containing words (default=1)

  • pos_column – column containing part-of-speech labels (default=3)

  • tag_column – column containing fine-grained tags (default=5)

  • head_column – column containing syntactic head position (default=6)

  • dep_column – column containing syntactic dependency label (default=7)

  • max_sents – maximal number of sentences to read

  • read_only_words – whether to read only words

  • read_syntax – whether to return heads and deps alongside tags. Ignored if read_only_words is True

Returns

a list of sentences. Each item contains a word sequence and an output sequence. The output sentence is None, if read_only_words is True, a single list of word tags if read_syntax is False, and a list of the form [tags, heads, deps] in case read_syntax is True.

class deeppavlov.dataset_readers.paraphraser_reader.ParaphraserReader[source]

The class to read the paraphraser.ru dataset from files.

Please, see https://paraphraser.ru.

class deeppavlov.dataset_readers.paraphraser_pretrain_reader.ParaphraserPretrainReader[source]

The class to read the pretraining dataset for the paraphrase identification task from files.

class deeppavlov.dataset_readers.quora_question_pairs_reader.QuoraQuestionPairsReader[source]

The class to read the Quora Question Pairs dataset from files.

Please, see https://www.kaggle.com/c/quora-question-pairs/data.

Parameters
  • data_path – A path to a folder with dataset files.

  • seed – Random seed.

class deeppavlov.dataset_readers.siamese_reader.SiameseReader[source]

The class to read dataset for ranking or paraphrase identification with Siamese networks.

class deeppavlov.dataset_readers.squad_dataset_reader.SquadDatasetReader[source]

Downloads dataset files and prepares train/valid split.

SQuAD: Stanford Question Answering Dataset https://rajpurkar.github.io/SQuAD-explorer/

SberSQuAD: Dataset from SDSJ Task B https://www.sdsj.ru/ru/contest.html

MultiSQuAD: SQuAD dataset with additional contexts retrieved (by tfidf) from original Wikipedia article.

MultiSQuADRetr: SQuAD dataset with additional contexts retrieved by tfidf document ranker from full Wikipedia.

read(dir_path: str, dataset: Optional[str] = 'SQuAD', url: Optional[str] = None, *args, **kwargs) → Dict[str, Dict[str, Any]][source]
Parameters
  • dir_path – path to save data

  • dataset – default dataset names: 'SQuAD', 'SberSQuAD' or 'MultiSQuAD'

  • url – link to archive with dataset, use url argument if non-default dataset is used

Returns

dataset split on train/valid

Raises

RuntimeError – if dataset is not one of these: 'SQuAD', 'SberSQuAD', 'MultiSQuAD'.

class deeppavlov.dataset_readers.typos_reader.TyposCustom[source]

Base class for reading spelling corrections dataset files

static build(data_path: str)pathlib.Path[source]

Base method that interprets data_path argument.

Parameters

data_path – path to the tsv-file containing erroneous and corrected words

Returns

the same path as a Path object

classmethod read(data_path: str, *args, **kwargs) → Dict[str, List[Tuple[str, str]]][source]

Read train data for spelling corrections algorithms

Parameters

data_path – path that needs to be interpreted with build()

Returns

train data to pass to a TyposDatasetIterator

class deeppavlov.dataset_readers.typos_reader.TyposKartaslov[source]

Implementation of TyposCustom that works with a Russian misspellings dataset from kartaslov

static build(data_path: str)pathlib.Path[source]

Download misspellings list from github

Parameters

data_path – target directory to download the data to

Returns

path to the resulting csv-file

static read(data_path: str, *args, **kwargs) → Dict[str, List[Tuple[str, str]]][source]

Read train data for spelling corrections algorithms

Parameters

data_path – path that needs to be interpreted with build()

Returns

train data to pass to a TyposDatasetIterator

class deeppavlov.dataset_readers.typos_reader.TyposWikipedia[source]

Implementation of TyposCustom that works with English Wikipedia’s list of common misspellings

static build(data_path: str)pathlib.Path[source]

Download and parse common misspellings list from Wikipedia

Parameters

data_path – target directory to download the data to

Returns

path to the resulting tsv-file

class deeppavlov.dataset_readers.ubuntu_v2_reader.UbuntuV2Reader[source]

The class to read the Ubuntu V2 dataset from csv files.

Please, see https://github.com/rkadlec/ubuntu-ranking-dataset-creator.

read(data_path: str, positive_samples=False, *args, **kwargs) → Dict[str, List[Tuple[List[str], int]]][source]

Read the Ubuntu V2 dataset from csv files.

Parameters
  • data_path – A path to a folder with dataset csv files.

  • positive_samples – if True, only positive context-response pairs will be taken for train

class deeppavlov.dataset_readers.ubuntu_v2_mt_reader.UbuntuV2MTReader[source]

The class to read the Ubuntu V2 dataset from csv files taking into account multi-turn dialogue context.

Please, see https://github.com/rkadlec/ubuntu-ranking-dataset-creator.

Parameters
  • data_path – A path to a folder with dataset csv files.

  • num_context_turns – A maximum number of dialogue context turns.

  • padding – “post” or “pre” context sentences padding

read(data_path: str, num_context_turns: int = 1, padding: str = 'post', *args, **kwargs) → Dict[str, List[Tuple[List[str], int]]][source]

Read the Ubuntu V2 dataset from csv files taking into account multi-turn dialogue context.

Parameters
  • data_path – A path to a folder with dataset csv files.

  • num_context_turns – A maximum number of dialogue context turns.

  • padding – “post” or “pre” context sentences padding

Returns

Dictionary with keys “train”, “valid”, “test” and parts of the dataset as their values