dataset_iterators

Concrete DatasetIterator classes.

class deeppavlov.dataset_iterators.basic_classification_iterator.BasicClassificationDatasetIterator(data: dict, fields_to_merge: List[str] = None, merged_field: str = None, field_to_split: str = None, split_fields: List[str] = None, split_proportions: List[float] = None, seed: int = None, shuffle: bool = True, *args, **kwargs)[source]

Class gets data dictionary from DatasetReader instance, merge fields if necessary, split a field if necessary

Parameters:
  • data – dictionary of data with fields “train”, “valid” and “test” (or some of them)
  • fields_to_merge – list of fields (out of "train", "valid", "test") to merge
  • merged_field – name of field (out of "train", "valid", "test") to which save merged fields
  • field_to_split – name of field (out of "train", "valid", "test") to split
  • split_fields – list of fields (out of "train", "valid", "test") to which save splitted field
  • split_proportions – list of corresponding proportions for splitting
  • seed – random seed
  • shuffle – whether to shuffle examples in batches
  • *args – arguments
  • **kwargs – arguments
data

dictionary of data with fields “train”, “valid” and “test” (or some of them)

class deeppavlov.dataset_iterators.dialog_iterator.DialogDatasetIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]

Iterates over dialog data, generates batches where one sample is one dialog.

A subclass of DataLearningIterator.

train

list of training dialogs (tuples (context, response))

valid

list of validation dialogs (tuples (context, response))

test

list of dialogs used for testing (tuples (context, response))

class deeppavlov.dataset_iterators.dialog_iterator.DialogDBResultDatasetIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]

Iterates over dialog data, outputs list of all 'db_result' fields (if present).

The class helps to build a list of all 'db_result' values present in a dataset.

Inherits key methods and attributes from DataLearningIterator.

train

list of tuples (db_result dictionary, '') from “train” data

valid

list of tuples (db_result dictionary, '') from “valid” data

test

list of tuples (db_result dictionary, '') from “test” data

class deeppavlov.dataset_iterators.dstc2_intents_iterator.Dstc2IntentsDatasetIterator(data: dict, fields_to_merge: List[str] = None, merged_field: str = None, field_to_split: str = None, split_fields: List[str] = None, split_proportions: List[float] = None, seed: int = None, shuffle: bool = True, *args, **kwargs)[source]

Class gets data dictionary from DSTC2DatasetReader instance, construct intents from act and slots, merge fields if necessary, split a field if necessary

Parameters:
  • data – dictionary of data with fields “train”, “valid” and “test” (or some of them)
  • fields_to_merge – list of fields (out of "train", "valid", "test") to merge
  • merged_field – name of field (out of "train", "valid", "test") to which save merged fields
  • field_to_split – name of field (out of "train", "valid", "test") to split
  • split_fields – list of fields (out of "train", "valid", "test") to which save splitted field
  • split_proportions – list of corresponding proportions for splitting
  • seed – random seed
  • shuffle – whether to shuffle examples in batches
  • *args – arguments
  • **kwargs – arguments
data

dictionary of data with fields “train”, “valid” and “test” (or some of them)

class deeppavlov.dataset_iterators.dstc2_ner_iterator.Dstc2NerDatasetIterator[source]

Iterates over data for DSTC2 NER task. Dataset takes a dict with fields ‘train’, ‘test’, ‘valid’. A list of samples (pairs x, y) is stored in each field.

Parameters:
  • data – list of (x, y) pairs, samples from the dataset: x as well as y can be a tuple of different input features.
  • dataset_path – path to dataset
  • seed – value for random seed
  • shuffle – whether to shuffle the data
class deeppavlov.dataset_iterators.kvret_dialog_iterator.KvretDialogDatasetIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]

Inputs data from DSTC2DatasetReader, constructs dialog history for each turn, generates batches (one sample is a turn).

Inherits key methods and attributes from DataLearningIterator.

train

list of “train” (context, response) tuples

valid

list of “valid” (context, response) tuples

test

list of “test” (context, response) tuples

class deeppavlov.dataset_iterators.morphotagger_iterator.MorphoTaggerDatasetIterator(data, seed=None, shuffle=True, validation_split=0.2, bucket=True)[source]

Iterates over data for Morphological Tagging. A subclass of DataLearningIterator.

class deeppavlov.dataset_iterators.sqlite_iterator.SQLiteDataIterator(data_dir: str = '', data_url: str = 'http://files.deeppavlov.ai/datasets/wikipedia/enwiki.db', batch_size: int = None, shuffle: bool = None, seed: int = None, **kwargs)[source]

Iterate over SQLite database. Gen batches from SQLite data. Get document ids and document.

Parameters:
  • data_dir – a directory where to save downloaded DB to
  • data_url – an URL where to download a DB from
  • batch_size – a number of samples in a single batch
  • shuffle – whether to shuffle data when batching
  • seed – random seed for data shuffling
connect

a DB connection

db_name

a DB name

doc_ids

DB document ids

doc2index

a dictionary of document indices and their titles

batch_size

a number of samples in a single batch

shuffle

whether to shuffle data when batching

random

an instance of Random class.

class deeppavlov.dataset_iterators.typos_iterator.TyposDatasetIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]

Implementation of DataLearningIterator used for training ErrorModel

split(test_ratio: float = 0.0, *args, **kwargs)[source]

Split all data into train and test

Parameters:test_ratio – ratio of test data to train, from 0. to 1.
class deeppavlov.dataset_iterators.ranking_iterator.RankingIterator(data: Dict[str, List], sample_candidates_pool: bool = False, sample_candidates_pool_valid: bool = True, sample_candidates_pool_test: bool = True, num_negative_samples: int = 10, num_ranking_samples_valid: int = 10, num_ranking_samples_test: int = 10, seed: int = None, shuffle: bool = False, len_vocab: int = 0, pos_pool_sample: bool = False, pos_pool_rank: bool = True, random_batches: bool = False, batches_per_epoch: int = None, triplet_mode: bool = True, hard_triplets_sampling: bool = False, num_positive_samples: int = 5)[source]

The class contains methods for iterating over a dataset for ranking in training, validation and test mode.

Note

Each sample in data['train'] is arranged as follows: {'context': 21507, 'response': 7009, 'pos_pool': [7009, 7010], 'neg_pool': None}. The context has a ‘context’ key in the data sample. It is represented by a single integer. The correct response has the ‘response’ key in the sample, its value is also always a single integer. The list of possible correct responses (there may be several) can be obtained with the ‘pos_pool’ key. The value of the ‘response’ should be equal to the one item from the list obtained using the ‘pos_pool’ key. The list of possible negative responses (there can be a lot of them, 100–10000) is represented by the key ‘neg_pool’. Its value is None, when global sampling is used, or the list of fixed length, when sampling from predefined negative responses is used. It is important that values in ‘pos_pool’ and ‘negative_pool’ do not overlap. Single items in ‘context’, ‘response’, ‘pos_pool’, ‘neg_pool’ are represented by single integers that give lists of integers using some dictionary integer–list of integers. These lists of integers are converted to lists of tokens with some dictionary integer–token. Samples in data['valid'] and data['test'] representation are almost the same as the train sample shown above.

Parameters:
  • data – A dictionary containing training, validation and test parts of the dataset obtainable via train, valid and test keys.
  • sample_candidates_pool – Whether to sample candidates from a predefined pool of candidates for each sample in training mode. If False, negative sampling from the whole data will be performed.
  • sample_candidates_pool_valid – Whether to validate a model on a predefined pool of candidates for each sample. If False, sampling from the whole data will be performed for validation.
  • sample_candidates_pool_test – Whether to test a model on a predefined pool of candidates for each sample. If False, sampling from the whole data will be performed for test.
  • num_negative_samples – A size of a predefined pool of candidates or a size of data subsample from the whole data in training mode.
  • num_ranking_samples_valid – A size of a predefined pool of candidates or a size of data subsample from the whole data in validation mode.
  • num_ranking_samples_test – A size of a predefined pool of candidates or a size of data subsample from the whole data in test mode.
  • seed – Random seed.
  • shuffle – Whether to shuffle data.
  • len_vocab – A length of a vocabulary to perform sampling in training, validation and test mode.
  • pos_pool_sample – Whether to sample response from pos_pool each time when the batch is generated. If False, the response from response will be used.
  • pos_pool_rank – Whether to count samples from the whole pos_pool as correct answers in test / validation mode.
  • random_batches – Whether to choose batches randomly or iterate over data sequentally in training mode.
  • batches_per_epoch – A number of batches to choose per each epoch in training mode. Only required if random_batches is set to True.
  • triplet_mode – Whether to use a model with triplet loss. If False, a model with crossentropy loss will be used.
  • hard_triplets_sampling – Whether to use hard triplets method of sampling in training mode.
  • num_positive_samples – A number of contexts to choose from pos_pool for each context. Only required if hard_triplets_sampling is set to True.
gen_batches(batch_size: int, data_type: str = 'train', shuffle: bool = True) → Tuple[List[List[Tuple[int, int]]], List[int]][source]

Generate batches of inputs and expected outputs to train neural networks.

Parameters:
  • batch_size – number of samples in batch
  • data_type – can be either ‘train’, ‘test’, or ‘valid’
  • shuffle – whether to shuffle dataset before batching
Returns:

A tuple of a batch of inputs and a batch of expected outputs.

Inputs and expected outputs have different structure and meaning depending on class attributes values and data_type.

class deeppavlov.dataset_iterators.squad_iterator.SquadIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]

SquadIterator allows to iterate over examples in SQuAD-like datasets. SquadIterator is used to train SquadModel.

It extracts context, question, answer_text and answer_start position from dataset. Example from a dataset is a tuple of (context, question) and (answer_text, answer_start)

train

train examples

valid

validation examples

test

test examples