dataset_readers

Concrete DatasetReader classes.

class deeppavlov.dataset_readers.basic_classification_reader.BasicClassificationDatasetReader[source]

Class provides reading dataset in .csv format

read(data_path: str, url: Optional[str] = None, format: str = 'csv', class_sep: Optional[str] = None, *args, **kwargs)dict[source]

Read dataset from data_path directory. Reading files are all data_types + extension (i.e for data_types=[“train”, “valid”] files “train.csv” and “valid.csv” form data_path will be read)

Parameters
  • data_path – directory with files

  • url – download data files if data_path not exists or empty

  • format – extension of files. Set of Values: "csv", "json"

  • class_sep – string separator of labels in column with labels

  • sep (str) – delimeter for "csv" files. Default: None -> only one class per sample

  • header (int) – row number to use as the column names

  • names (array) – list of column names to use

  • orient (str) – indication of expected JSON string format

  • lines (boolean) – read the file as a json object per line. Default: False

Returns

dictionary with types from data_types. Each field of dictionary is a list of tuples (x_i, y_i)

class deeppavlov.dataset_readers.conll2003_reader.Conll2003DatasetReader[source]

Class to read training datasets in CoNLL-2003 format

class deeppavlov.dataset_readers.dstc2_reader.DSTC2DatasetReader[source]

Contains labelled dialogs from Dialog State Tracking Challenge 2 (http://camdial.org/~mh521/dstc/).

There’ve been made the following modifications to the original dataset:

  1. added api calls to restaurant database

    • example: {"text": "api_call area="south" food="dontcare" pricerange="cheap"", "dialog_acts": ["api_call"]}.

  2. new actions

    • bot dialog actions were concatenated into one action (example: {"dialog_acts": ["ask", "request"]} -> {"dialog_acts": ["ask_request"]})

    • if a slot key was associated with the dialog action, the new act was a concatenation of an act and a slot key (example: {"dialog_acts": ["ask"], "slot_vals": ["area"]} -> {"dialog_acts": ["ask_area"]})

  3. new train/dev/test split

    • original dstc2 consisted of three different MDP policies, the original train and dev datasets (consisting of two policies) were merged and randomly split into train/dev/test

  4. minor fixes

    • fixed several dialogs, where actions were wrongly annotated

    • uppercased first letter of bot responses

    • unified punctuation for bot responses

classmethod read(data_path: str, dialogs: bool = False)Dict[str, List][source]

Downloads 'dstc2_v2.tar.gz' archive from ipavlov internal server, decompresses and saves files to data_path.

Parameters
  • data_path – path to save DSTC2 dataset

  • dialogs – flag which indicates whether to output list of turns or list of dialogs

Returns

dictionary that contains 'train' field with dialogs from 'dstc2-trn.jsonlist', 'valid' field with dialogs from 'dstc2-val.jsonlist' and 'test' field with dialogs from 'dstc2-tst.jsonlist'. Each field is a list of tuples (x_i, y_i).

class deeppavlov.dataset_readers.dstc2_reader.SimpleDSTC2DatasetReader[source]

Contains labelled dialogs from Dialog State Tracking Challenge 2 (http://camdial.org/~mh521/dstc/).

There’ve been made the following modifications to the original dataset:

  1. added api calls to restaurant database

    • example: {"text": "api_call area="south" food="dontcare" pricerange="cheap"", "dialog_acts": ["api_call"]}.

  2. new actions

    • bot dialog actions were concatenated into one action (example: {"dialog_acts": ["ask", "request"]} -> {"dialog_acts": ["ask_request"]})

    • if a slot key was associated with the dialog action, the new act was a concatenation of an act and a slot key (example: {"dialog_acts": ["ask"], "slot_vals": ["area"]} -> {"dialog_acts": ["ask_area"]})

  3. new train/dev/test split

    • original dstc2 consisted of three different MDP policies, the original train and dev datasets (consisting of two policies) were merged and randomly split into train/dev/test

  4. minor fixes

    • fixed several dialogs, where actions were wrongly annotated

    • uppercased first letter of bot responses

    • unified punctuation for bot responses

classmethod read(data_path: str, dialogs: bool = False, encoding='utf-8')Dict[str, List][source]

Downloads 'simple_dstc2.tar.gz' archive from internet, decompresses and saves files to data_path.

Parameters
  • data_path – path to save DSTC2 dataset

  • dialogs – flag which indicates whether to output list of turns or list of dialogs

Returns

dictionary that contains 'train' field with dialogs from 'simple-dstc2-trn.json', 'valid' field with dialogs from 'simple-dstc2-val.json' and 'test' field with dialogs from 'simple-dstc2-tst.json'. Each field is a list of tuples (user turn, system turn).

class deeppavlov.dataset_readers.md_yaml_dialogs_reader.DomainKnowledge(domain_knowledge_di: Dict)[source]

the DTO-like class to store the domain knowledge from the domain yaml config.

classmethod from_yaml(domain_yml_fpath: Union[str, pathlib.Path] = 'domain.yml')[source]

Parses domain.yml domain config file into the DomainKnowledge object :param domain_yml_fpath: path to the domain config file, defaults to domain.yml

Returns

the loaded DomainKnowledge obect

class deeppavlov.dataset_readers.md_yaml_dialogs_reader.MD_YAML_DialogsDatasetReader[source]

Reads dialogs from dataset composed of stories.md, nlu.md, domain.yml .

stories.md is to provide the dialogues dataset for model to train on. The dialogues are represented as user messages labels and system response messages labels: (not texts, just action labels). This is so to distinguish the NLU-NLG tasks from the actual dialogues storytelling experience: one should be able to describe just the scripts of dialogues to the system.

nlu.md is contrariwise to provide the NLU training set irrespective of the dialogues scripts.

domain.yml is to desribe the task-specific domain and serves two purposes: provide the NLG templates and provide some specific configuration of the NLU

classmethod augment_form(form_name: str, domain_knowledge: deeppavlov.dataset_readers.md_yaml_dialogs_reader.DomainKnowledge, intent2slots2text: Dict)List[str][source]

Replaced the form mention in stories.md with the actual turns relevant to the form :param form_name: the name of form to generate turns for :param domain_knowledge: the domain knowledge (see domain.yml in RASA) relevant to the processed config :param intent2slots2text: the mapping of intents and particular slots onto text

Returns

the story turns relevant to the passed form

classmethod augment_slot(known_responses: List[str], known_intents: List[str], slot_name: str, form_name: str)List[str][source]

Given the slot name, generates a sequence of system turn asking for a slot and user’ turn providing this slot

Parameters
  • known_responses – responses known to the system from domain.yml

  • known_intents – intents known to the system from domain.yml

  • slot_name – the name of the slot to augment for

  • form_name – the name of the form for which the turn is augmented

Returns

the list of stories.md alike turns

classmethod augment_user_turn(intent2slots2text, line: str, slot_name2text2value)List[Dict[str, Any]][source]

given the turn information generate all the possible stories representing it :param intent2slots2text: the intents and slots to natural language utterances mapping known to the system :param line: the line representing used utterance in stories.md format :param slot_name2text2value: the slot names to values mapping known o the system

Returns

the batch of all the possible dstc2 representations of the passed intent

classmethod get_augmented_ask_intent_utter(known_intents: List[str], slot_name: str)Optional[str][source]

if the system knows the inform_{slot} intent, return this intent name, otherwise return None :param known_intents: intents known to the system :param slot_name: the slot to look inform intent for

Returns

the slot informing intent or None

classmethod get_augmented_ask_slot_utter(form_name: str, known_responses: List[str], slot_name: str)[source]

if the system knows the ask_{slot} action, return this action name, otherwise return None :param form_name: the name of the currently processed form :param known_responses: actions known to the system :param slot_name: the slot to look asking action for

Returns

the slot asking action or None

classmethod get_last_users_turn(curr_story_utters: List[Dict])Dict[source]

Given the dstc2 story, return the last user utterance from it :param curr_story_utters: the dstc2-formatted stoyr

Returns

the last user utterance from the passed story

classmethod parse_system_turn(domain_knowledge: deeppavlov.dataset_readers.md_yaml_dialogs_reader.DomainKnowledge, line: str)Dict[source]

Given the RASA stories.md line, returns the dstc2-formatted json (dict) for this line :param domain_knowledge: the domain knowledge relevant to the processed stories config (from which line is taken) :param line: the story system step representing line from stories.md

Returns

the dstc2-formatted passed turn

classmethod read(data_path: str, dialogs: bool = False, ignore_slots: bool = False)Dict[str, List][source]
Parameters
  • data_path – path to read dataset from

  • dialogs – flag which indicates whether to output list of turns or list of dialogs

  • ignore_slots – whether to ignore slots information provided in stories.md or not

Returns

dictionary that contains 'train' field with dialogs from 'stories-trn.md', 'valid' field with dialogs from 'stories-val.md' and 'test' field with dialogs from 'stories-tst.md'. Each field is a list of tuples (x_i, y_i).

class deeppavlov.dataset_readers.faq_reader.FaqDatasetReader[source]

Reader for FAQ dataset

read(data_path: Optional[str] = None, data_url: Optional[str] = None, x_col_name: str = 'x', y_col_name: str = 'y')Dict[source]

Read FAQ dataset from specified csv file or remote url

Parameters
  • data_path – path to csv file of FAQ

  • data_url – url to csv file of FAQ

  • x_col_name – name of Question column in csv file

  • y_col_name – name of Answer column in csv file

Returns

A dictionary containing training, validation and test parts of the dataset obtainable via train, valid and test keys.

class deeppavlov.dataset_readers.file_paths_reader.FilePathsReader[source]

Find all file paths by a data path glob

read(data_path: Union[str, pathlib.Path], train: Optional[str] = None, valid: Optional[str] = None, test: Optional[str] = None, *args, **kwargs)Dict[source]

Find all file paths by a data path glob

Parameters
  • data_path – directory with data

  • train – data path glob relative to data_path

  • valid – data path glob relative to data_path

  • test – data path glob relative to data_path

Returns

A dictionary containing training, validation and test parts of the dataset obtainable via train, valid and test keys.

class deeppavlov.dataset_readers.kvret_reader.KvretDatasetReader[source]

A New Multi-Turn, Multi-Domain, Task-Oriented Dialogue Dataset.

Stanford NLP released a corpus of 3,031 multi-turn dialogues in three distinct domains appropriate for an in-car assistant: calendar scheduling, weather information retrieval, and point-of-interest navigation. The dialogues are grounded through knowledge bases ensuring that they are versatile in their natural language without being completely free form.

For details see https://nlp.stanford.edu/blog/a-new-multi-turn-multi-domain-task-oriented-dialogue-dataset/.

classmethod read(data_path: str, dialogs: bool = False)Dict[str, List][source]

Downloads 'kvrest_public.tar.gz', decompresses, saves files to data_path.

Parameters
  • data_path – path to save data

  • dialogs – flag indices whether to output list of turns or list of dialogs

Returns

dictionary with 'train' containing dialogs from 'kvret_train_public.json', 'valid' containing dialogs from 'kvret_valid_public.json', 'test' containing dialogs from 'kvret_test_public.json'. Each fields is a list of tuples (x_i, y_i).

class deeppavlov.dataset_readers.line_reader.LineReader[source]

Read txt file by lines

read(data_path: Optional[str] = None, *args, **kwargs)Dict[source]

Read lines from txt file

Parameters

data_path – path to txt file

Returns

A dictionary containing training, validation and test parts of the dataset obtainable via train, valid and test keys.

class deeppavlov.dataset_readers.morphotagging_dataset_reader.MorphotaggerDatasetReader[source]

Class to read training datasets in UD format

read(data_path: Union[List, str], language: Optional[str] = None, data_types: Optional[List[str]] = None, **kwargs)Dict[str, List][source]

Reads UD dataset from data_path.

Parameters
  • data_path – can be either 1. a directory containing files. The file for data_type ‘mode’ is then data_path / {language}-ud-{mode}.conllu 2. a list of files, containing the same number of items as data_types

  • language – a language to detect filename when it is not given

  • data_types – which dataset parts among ‘train’, ‘dev’, ‘test’ are returned

Returns

a dictionary containing dataset fragments (see read_infile) for given data types

deeppavlov.dataset_readers.morphotagging_dataset_reader.get_language(filepath: str)str[source]

Extracts language from typical UD filename

deeppavlov.dataset_readers.morphotagging_dataset_reader.read_infile(infile: Union[pathlib.Path, str], *, from_words=False, word_column: int = 1, pos_column: int = 3, tag_column: int = 5, head_column: int = 6, dep_column: int = 7, max_sents: int = - 1, read_only_words: bool = False, read_syntax: bool = False)List[Tuple[List, Optional[List]]][source]

Reads input file in CONLL-U format

Parameters
  • infile – a path to a file

  • word_column – column containing words (default=1)

  • pos_column – column containing part-of-speech labels (default=3)

  • tag_column – column containing fine-grained tags (default=5)

  • head_column – column containing syntactic head position (default=6)

  • dep_column – column containing syntactic dependency label (default=7)

  • max_sents – maximal number of sentences to read

  • read_only_words – whether to read only words

  • read_syntax – whether to return heads and deps alongside tags. Ignored if read_only_words is True

Returns

a list of sentences. Each item contains a word sequence and an output sequence. The output sentence is None, if read_only_words is True, a single list of word tags if read_syntax is False, and a list of the form [tags, heads, deps] in case read_syntax is True.

class deeppavlov.dataset_readers.paraphraser_reader.ParaphraserReader[source]

The class to read the paraphraser.ru dataset from files.

Please, see https://paraphraser.ru.

class deeppavlov.dataset_readers.siamese_reader.SiameseReader[source]

The class to read dataset for ranking or paraphrase identification with Siamese networks.

class deeppavlov.dataset_readers.squad_dataset_reader.SquadDatasetReader[source]

Downloads dataset files and prepares train/valid split.

SQuAD: Stanford Question Answering Dataset https://rajpurkar.github.io/SQuAD-explorer/

SberSQuAD: Dataset from SDSJ Task B https://www.sdsj.ru/ru/contest.html

MultiSQuAD: SQuAD dataset with additional contexts retrieved (by tfidf) from original Wikipedia article.

MultiSQuADRetr: SQuAD dataset with additional contexts retrieved by tfidf document ranker from full Wikipedia.

read(dir_path: str, dataset: Optional[str] = 'SQuAD', url: Optional[str] = None, *args, **kwargs)Dict[str, Dict[str, Any]][source]
Parameters
  • dir_path – path to save data

  • dataset – default dataset names: 'SQuAD', 'SberSQuAD' or 'MultiSQuAD'

  • url – link to archive with dataset, use url argument if non-default dataset is used

Returns

dataset split on train/valid

Raises

RuntimeError – if dataset is not one of these: 'SQuAD', 'SberSQuAD', 'MultiSQuAD'.

class deeppavlov.dataset_readers.typos_reader.TyposCustom[source]

Base class for reading spelling corrections dataset files

static build(data_path: str)pathlib.Path[source]

Base method that interprets data_path argument.

Parameters

data_path – path to the tsv-file containing erroneous and corrected words

Returns

the same path as a Path object

classmethod read(data_path: str, *args, **kwargs)Dict[str, List[Tuple[str, str]]][source]

Read train data for spelling corrections algorithms

Parameters

data_path – path that needs to be interpreted with build()

Returns

train data to pass to a TyposDatasetIterator

class deeppavlov.dataset_readers.typos_reader.TyposKartaslov[source]

Implementation of TyposCustom that works with a Russian misspellings dataset from kartaslov

static build(data_path: str)pathlib.Path[source]

Download misspellings list from github

Parameters

data_path – target directory to download the data to

Returns

path to the resulting csv-file

static read(data_path: str, *args, **kwargs)Dict[str, List[Tuple[str, str]]][source]

Read train data for spelling corrections algorithms

Parameters

data_path – path that needs to be interpreted with build()

Returns

train data to pass to a TyposDatasetIterator

class deeppavlov.dataset_readers.typos_reader.TyposWikipedia[source]

Implementation of TyposCustom that works with English Wikipedia’s list of common misspellings

static build(data_path: str)pathlib.Path[source]

Download and parse common misspellings list from Wikipedia

Parameters

data_path – target directory to download the data to

Returns

path to the resulting tsv-file

class deeppavlov.dataset_readers.ubuntu_v2_reader.UbuntuV2Reader[source]

The class to read the Ubuntu V2 dataset from csv files.

Please, see https://github.com/rkadlec/ubuntu-ranking-dataset-creator.

read(data_path: str, positive_samples=False, *args, **kwargs)Dict[str, List[Tuple[List[str], int]]][source]

Read the Ubuntu V2 dataset from csv files.

Parameters
  • data_path – A path to a folder with dataset csv files.

  • positive_samples – if True, only positive context-response pairs will be taken for train

class deeppavlov.dataset_readers.ubuntu_v2_mt_reader.UbuntuV2MTReader[source]

The class to read the Ubuntu V2 dataset from csv files taking into account multi-turn dialogue context.

Please, see https://github.com/rkadlec/ubuntu-ranking-dataset-creator.

Parameters
  • data_path – A path to a folder with dataset csv files.

  • num_context_turns – A maximum number of dialogue context turns.

  • padding – “post” or “pre” context sentences padding

read(data_path: str, num_context_turns: int = 1, padding: str = 'post', *args, **kwargs)Dict[str, List[Tuple[List[str], int]]][source]

Read the Ubuntu V2 dataset from csv files taking into account multi-turn dialogue context.

Parameters
  • data_path – A path to a folder with dataset csv files.

  • num_context_turns – A maximum number of dialogue context turns.

  • padding – “post” or “pre” context sentences padding

Returns

Dictionary with keys “train”, “valid”, “test” and parts of the dataset as their values

class deeppavlov.dataset_readers.intent_catcher_reader.IntentCatcherReader[source]

Reader for Intent Catcher dataset in json format