dataset_readers¶

Concrete DatasetReader classes.

class deeppavlov.dataset_readers.basic_classification_reader.BasicClassificationDatasetReader[source]¶

Class provides reading dataset in .csv format

read(data_path: str, url: Optional[str] = None, format: str = 'csv', class_sep: Optional[str] = None, *args, **kwargs) → dict[source]¶

Read dataset from data_path directory. Reading files are all data_types + extension (i.e for data_types=[“train”, “valid”] files “train.csv” and “valid.csv” form data_path will be read)

Parameters

data_path – directory with files
url – download data files if data_path not exists or empty
format – extension of files. Set of Values: "csv", "json"
class_sep – string separator of labels in column with labels
sep (str) – delimeter for "csv" files. Default: None -> only one class per sample
header (int) – row number to use as the column names
names (array) – list of column names to use
orient (str) – indication of expected JSON string format
lines (boolean) – read the file as a json object per line. Default: False

Returns

dictionary with types from data_types. Each field of dictionary is a list of tuples (x_i, y_i)

class deeppavlov.dataset_readers.conll2003_reader.Conll2003DatasetReader[source]¶: Class to read training datasets in CoNLL-2003 format

class deeppavlov.dataset_readers.faq_reader.FaqDatasetReader[source]¶

Reader for FAQ dataset

read(data_path: Optional[str] = None, data_url: Optional[str] = None, x_col_name: str = 'x', y_col_name: str = 'y') → Dict[source]¶

Read FAQ dataset from specified csv file or remote url

Parameters

data_path – path to csv file of FAQ
data_url – url to csv file of FAQ
x_col_name – name of Question column in csv file
y_col_name – name of Answer column in csv file

Returns

A dictionary containing training, validation and test parts of the dataset obtainable via train, valid and test keys.

class deeppavlov.dataset_readers.line_reader.LineReader[source]¶

Read txt file by lines

read(data_path: Optional[str] = None, *args, **kwargs) → Dict[source]¶

Read lines from txt file

Parameters: data_path – path to txt file
Returns: A dictionary containing training, validation and test parts of the dataset obtainable via train, valid and test keys.

class deeppavlov.dataset_readers.paraphraser_reader.ParaphraserReader[source]¶

The class to read the paraphraser.ru dataset from files.

Please, see https://paraphraser.ru.

class deeppavlov.dataset_readers.squad_dataset_reader.SquadDatasetReader[source]¶

Downloads dataset files and prepares train/valid split.

SQuAD: Stanford Question Answering Dataset https://rajpurkar.github.io/SQuAD-explorer/

SQuAD2.0: Stanford Question Answering Dataset, version 2.0 https://rajpurkar.github.io/SQuAD-explorer/

SberSQuAD: Dataset from SDSJ Task B https://www.sdsj.ru/ru/contest.html

MultiSQuAD: SQuAD dataset with additional contexts retrieved (by tfidf) from original Wikipedia article.

MultiSQuADRetr: SQuAD dataset with additional contexts retrieved by tfidf document ranker from full Wikipedia.

read(data_path: str, dataset: Optional[str] = 'SQuAD', url: Optional[str] = None, *args, **kwargs) → Dict[str, Dict[str, Any]][source]¶

Parameters

data_path – path to save data
dataset – default dataset names: 'SQuAD', 'SberSQuAD' or 'MultiSQuAD'
url – link to archive with dataset, use url argument if non-default dataset is used

Returns

dataset split on train/valid

Raises

RuntimeError – if dataset is not one of these: 'SQuAD', 'SberSQuAD', 'MultiSQuAD'.

class deeppavlov.dataset_readers.typos_reader.TyposCustom[source]¶

Base class for reading spelling corrections dataset files

static build(data_path: str) → Path[source]¶

Base method that interprets data_path argument.

Parameters: data_path – path to the tsv-file containing erroneous and corrected words
Returns: the same path as a Path object

classmethod read(data_path: str, *args, **kwargs) → Dict[str, List[Tuple[str, str]]][source]¶

Read train data for spelling corrections algorithms

Parameters: data_path – path that needs to be interpreted with build()
Returns: train data to pass to a TyposDatasetIterator

class deeppavlov.dataset_readers.typos_reader.TyposKartaslov[source]¶

Implementation of TyposCustom that works with a Russian misspellings dataset from kartaslov

static build(data_path: str) → Path[source]¶

Download misspellings list from github

Parameters: data_path – target directory to download the data to
Returns: path to the resulting csv-file

static read(data_path: str, *args, **kwargs) → Dict[str, List[Tuple[str, str]]][source]¶

Read train data for spelling corrections algorithms

Parameters: data_path – path that needs to be interpreted with build()
Returns: train data to pass to a TyposDatasetIterator

class deeppavlov.dataset_readers.typos_reader.TyposWikipedia[source]¶

Implementation of TyposCustom that works with English Wikipedia’s list of common misspellings

static build(data_path: str) → Path[source]¶

Download and parse common misspellings list from Wikipedia

Parameters: data_path – target directory to download the data to
Returns: path to the resulting tsv-file

class deeppavlov.dataset_readers.ubuntu_v2_reader.UbuntuV2Reader[source]¶

The class to read the Ubuntu V2 dataset from csv files.

Please, see https://github.com/rkadlec/ubuntu-ranking-dataset-creator.

read(data_path: str, positive_samples=False, *args, **kwargs) → Dict[str, List[Tuple[List[str], int]]][source]¶

Read the Ubuntu V2 dataset from csv files.

Parameters

data_path – A path to a folder with dataset csv files.
positive_samples – if True, only positive context-response pairs will be taken for train

class deeppavlov.dataset_readers.multitask_reader.MultiTaskReader[source]¶

Class to read several datasets simultaneously.

read(tasks: Dict[str, Dict[str, dict]], task_defaults: Optional[dict] = None, **kwargs)[source]¶

Creates dataset readers for tasks and returns what task dataset readers read() methods return.

Parameters

tasks – dictionary which keys are task names and values are dictionaries with param name - value pairs for nested dataset readers initialization. If task has key-value pair 'use_task_defaults': False, task_defaults for this task dataset reader will be ignored.
task_defaults – default task parameters.

Returns

dictionary which keys are task names and values are what task readers read() methods returned.