dataset_iterators

Concrete DatasetIterator classes.

class deeppavlov.dataset_iterators.basic_classification_iterator.BasicClassificationDatasetIterator(data: dict, fields_to_merge: Optional[List[str]] = None, merged_field: Optional[str] = None, field_to_split: Optional[str] = None, split_fields: Optional[List[str]] = None, split_proportions: Optional[List[float]] = None, seed: Optional[int] = None, shuffle: bool = True, split_seed: Optional[int] = None, stratify: Optional[bool] = None, shot: Optional[int] = None, *args, **kwargs)[source]

Class gets data dictionary from DatasetReader instance, merge fields if necessary, split a field if necessary

Parameters
  • data – dictionary of data with fields “train”, “valid” and “test” (or some of them)

  • fields_to_merge – list of fields (out of "train", "valid", "test") to merge

  • merged_field – name of field (out of "train", "valid", "test") to which save merged fields

  • field_to_split – name of field (out of "train", "valid", "test") to split

  • split_fields – list of fields (out of "train", "valid", "test") to which save splitted field

  • split_proportions – list of corresponding proportions for splitting

  • seed – random seed for iterating

  • shuffle – whether to shuffle examples in batches

  • split_seed – random seed for splitting dataset, if split_seed is None, division is based on seed.

  • stratify – whether to use stratified split

  • shot – number of examples to sample for each class in training data. If None, all examples will remain in data.

  • *args – arguments

  • **kwargs – arguments

data

dictionary of data with fields “train”, “valid” and “test” (or some of them)

class deeppavlov.dataset_iterators.siamese_iterator.SiameseIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: Optional[int] = None, shuffle: bool = True, *args, **kwargs)[source]

The class contains methods for iterating over a dataset for ranking in training, validation and test mode.

class deeppavlov.dataset_iterators.sqlite_iterator.SQLiteDataIterator(load_path: Union[str, Path], batch_size: Optional[int] = None, shuffle: Optional[bool] = None, seed: Optional[int] = None, **kwargs)[source]

Iterate over SQLite database. Gen batches from SQLite data. Get document ids and document.

Parameters
  • load_path – a path to local DB file

  • batch_size – a number of samples in a single batch

  • shuffle – whether to shuffle data during batching

  • seed – random seed for data shuffling

connect

a DB connection

db_name

a DB name

doc_ids

DB document ids

doc2index

a dictionary of document indices and their titles

batch_size

a number of samples in a single batch

shuffle

whether to shuffle data during batching

random

an instance of Random class.

class deeppavlov.dataset_iterators.squad_iterator.SquadIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: Optional[int] = None, shuffle: bool = True, *args, **kwargs)[source]

SquadIterator allows to iterate over examples in SQuAD-like datasets. SquadIterator is used to train torch_transformers_squad:TorchTransformersSquad.

It extracts context, question, answer_text and answer_start position from dataset. Example from a dataset is a tuple of (context, question) and (answer_text, answer_start)

train

train examples

valid

validation examples

test

test examples

class deeppavlov.dataset_iterators.typos_iterator.TyposDatasetIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: Optional[int] = None, shuffle: bool = True, *args, **kwargs)[source]

Implementation of DataLearningIterator used for training ErrorModel

split(test_ratio: float = 0.0, *args, **kwargs)[source]

Split all data into train and test

Parameters

test_ratio – ratio of test data to train, from 0. to 1.

class deeppavlov.dataset_iterators.multitask_iterator.MultiTaskIterator(data: dict, num_train_epochs: int, tasks: dict, batch_size: int = 8, sampling_mode: str = 'plain', gradient_accumulation_steps: int = 1, steps_per_epoch: int = 0, one_element_tuples: bool = True, task_defaults: Optional[dict] = None, seed: int = 42, **kwargs)[source]

Class merges data from several dataset iterators. When used for batch generation batches from merged dataset iterators are united into one batch. If sizes of merged datasets are different smaller datasets are repeated until their size becomes equal to the largest dataset.

Parameters
  • data – dictionary which keys are task names and values are dictionaries with fields "train", "valid", "test".

  • num_train_epochs – number of training epochs

  • tasks – dictionary which keys are task names and values are init params of dataset iterators. If task has key-value pair 'use_task_defaults': False task_defaults for this task dataset iterator will be ignored.

  • batch_size – batch_size

  • sampling_mode – mode of sampling we use. It can be plain, uniform or anneal.

  • gradient_accumulation_steps – number of gradient accumulation steps. Default is 1

  • steps_per_epoch – number of steps per epoch. Nesessary if gradient_accumulation_steps > 1

  • iterator_class_name – name of iterator class.

  • use_label_name

  • seed

  • class (features - parameters for the iterator) –

  • one_element_tuples – if True, tuple of x consisting of one element is returned in this element. Default: True

  • task_defaults – default task parameters.

  • sampling (seed - random seed for) –

data

dictionary of data with fields “train”, “valid” and “test” (or some of them)

gen_batches(batch_size: int, data_type: str = 'train', shuffle: Optional[bool] = None) Iterator[Tuple[tuple, tuple]][source]

Generates batches and expected output to train neural networks. If there are not enough samples from any task, samples are padded with None :param batch_size: number of samples in batch :param data_type: can be either ‘train’, ‘test’, or ‘valid’ :param shuffle: whether to shuffle dataset before batching

Yields

A tuple of a batch of inputs and a batch of expected outputs. Inputs and outputs are tuples. Element of inputs or outputs is a tuple which elements are x values of merged tasks in the order tasks are present in tasks argument of __init__ method.

get_instances(data_type: str = 'train')[source]

Returns a tuple of inputs and outputs from all datasets. Lengths of and outputs are equal to the size of the largest dataset. Smaller datasets are padded with Nones until their sizes are equal to the size of the largest dataset. :param data_type: can be either ‘train’, ‘test’, or ‘valid’

Returns

A tuple of all inputs for a data type and all expected outputs for a data type.

class deeppavlov.dataset_iterators.multitask_iterator.SingleTaskBatchGenerator(dataset_iterator: DataLearningIterator, batch_size: int, data_type: str, shuffle: bool, n_batches: Optional[int] = None, size_of_last_batch: Optional[int] = None)[source]

Batch generator for a single task. If there are no elements in the dataset to form another batch, Nones are returned. :param dataset_iterator: dataset iterator from which batches are drawn. :param batch_size: size fo the batch. :param data_type: “train”, “valid”, or “test” :param shuffle: whether dataset will be shuffled. :param n_batches: the number of batches that will be generated.