
Concrete DatasetIterator classes.

class deeppavlov.dataset_iterators.basic_classification_iterator.BasicClassificationDatasetIterator(data: dict, fields_to_merge: List[str] = None, merged_field: str = None, field_to_split: str = None, split_fields: List[str] = None, split_proportions: List[float] = None, seed: int = None, shuffle: bool = True, split_seed: int = None, stratify: bool = None, *args, **kwargs)[source]

Class gets data dictionary from DatasetReader instance, merge fields if necessary, split a field if necessary

  • data – dictionary of data with fields “train”, “valid” and “test” (or some of them)

  • fields_to_merge – list of fields (out of "train", "valid", "test") to merge

  • merged_field – name of field (out of "train", "valid", "test") to which save merged fields

  • field_to_split – name of field (out of "train", "valid", "test") to split

  • split_fields – list of fields (out of "train", "valid", "test") to which save splitted field

  • split_proportions – list of corresponding proportions for splitting

  • seed – random seed for iterating

  • shuffle – whether to shuffle examples in batches

  • split_seed – random seed for splitting dataset, if split_seed is None, division is based on seed.

  • stratify – whether to use stratified split

  • *args – arguments

  • **kwargs – arguments


dictionary of data with fields “train”, “valid” and “test” (or some of them)

class deeppavlov.dataset_iterators.dialog_iterator.DialogDatasetIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]

Iterates over dialog data, generates batches where one sample is one dialog.

A subclass of DataLearningIterator.


list of training dialogs (tuples (context, response))


list of validation dialogs (tuples (context, response))


list of dialogs used for testing (tuples (context, response))

class deeppavlov.dataset_iterators.dialog_iterator.DialogDatasetIndexingIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]

Iterates over dialog data, generates batches where one sample is one dialog. Assigns unique index value to each turn item of each dialog.

A subclass of DataLearningIterator.


list of training dialogs (tuples (context, response))


list of validation dialogs (tuples (context, response))


list of dialogs used for testing (tuples (context, response))

class deeppavlov.dataset_iterators.dialog_iterator.DialogDBResultDatasetIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]

Iterates over dialog data, outputs list of all 'db_result' fields (if present).

The class helps to build a list of all 'db_result' values present in a dataset.

Inherits key methods and attributes from DataLearningIterator.


list of tuples (db_result dictionary, '') from “train” data


list of tuples (db_result dictionary, '') from “valid” data


list of tuples (db_result dictionary, '') from “test” data

class deeppavlov.dataset_iterators.dstc2_intents_iterator.Dstc2IntentsDatasetIterator(data: dict, fields_to_merge: List[str] = None, merged_field: str = None, field_to_split: str = None, split_fields: List[str] = None, split_proportions: List[float] = None, seed: int = None, shuffle: bool = True, *args, **kwargs)[source]

Class gets data dictionary from DSTC2DatasetReader instance, construct intents from act and slots, merge fields if necessary, split a field if necessary

  • data – dictionary of data with fields “train”, “valid” and “test” (or some of them)

  • fields_to_merge – list of fields (out of "train", "valid", "test") to merge

  • merged_field – name of field (out of "train", "valid", "test") to which save merged fields

  • field_to_split – name of field (out of "train", "valid", "test") to split

  • split_fields – list of fields (out of "train", "valid", "test") to which save splitted field

  • split_proportions – list of corresponding proportions for splitting

  • seed – random seed

  • shuffle – whether to shuffle examples in batches

  • *args – arguments

  • **kwargs – arguments


dictionary of data with fields “train”, “valid” and “test” (or some of them)

class deeppavlov.dataset_iterators.dstc2_ner_iterator.Dstc2NerDatasetIterator(data: Dict[str, List[Tuple]], slot_values_path: str, seed: int = None, shuffle: bool = False)[source]

Iterates over data for DSTC2 NER task. Dataset takes a dict with fields ‘train’, ‘test’, ‘valid’. A list of samples (pairs x, y) is stored in each field.

  • data – list of (x, y) pairs, samples from the dataset: x as well as y can be a tuple of different input features.

  • dataset_path – path to dataset

  • seed – value for random seed

  • shuffle – whether to shuffle the data

class deeppavlov.dataset_iterators.elmo_file_paths_iterator.ELMoFilePathsIterator(data: Dict[str, List[Union[str, pathlib.Path]]], load_path: Union[str, pathlib.Path], seed: Optional[int] = None, shuffle: bool = True, unroll_steps: Optional[int] = None, n_gpus: Optional[int] = None, max_word_length: Optional[int] = None, bos: str = '<S>', eos: str = '</S>', *args, **kwargs)[source]

Dataset iterator for tokenized datasets like 1 Billion Word Benchmark It gets lists of file paths from the data dictionary and returns batches of lines from each file.

  • data – dict with keys 'train', 'valid' and 'test' and values

  • load_path – path to the vocabulary to be load from

  • seed – random seed for data shuffling

  • shuffle – whether to shuffle data during batching

  • unroll_steps – number of unrolling steps

  • n_gpus – number of gpu to use

  • max_word_length – max length of word

  • bos – tag of begin of sentence

  • eos – tag of end of sentence

class deeppavlov.dataset_iterators.file_paths_iterator.FilePathsIterator(data: Dict[str, List[Union[str, pathlib.Path]]], seed: Optional[int] = None, shuffle: bool = True, *args, **kwargs)[source]

Dataset iterator for datasets like 1 Billion Word Benchmark. It gets lists of file paths from the data dictionary and returns lines from each file.

  • data – dict with keys 'train', 'valid' and 'test' and values

  • seed – random seed for data shuffling

  • shuffle – whether to shuffle data during batching

class deeppavlov.dataset_iterators.kvret_dialog_iterator.KvretDialogDatasetIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]

Inputs data from DSTC2DatasetReader, constructs dialog history for each turn, generates batches (one sample is a turn).

Inherits key methods and attributes from DataLearningIterator.


list of “train” (context, response) tuples


list of “valid” (context, response) tuples


list of “test” (context, response) tuples

deeppavlov.dataset_iterators.morphotagger_iterator.preprocess_data(data: List[Tuple[List[str], List[str]]], to_lower: bool = True, append_case: str = 'first') → List[Tuple[List[Tuple[str]], List[str]]][source]

Processes all words in data using process_word().

  • data – a list of pairs (words, tags), each pair corresponds to a single sentence

  • to_lower – whether to lowercase

  • append_case – whether to add case mark


a list of preprocessed sentences

class deeppavlov.dataset_iterators.morphotagger_iterator.MorphoTaggerDatasetIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, min_train_fraction: float = 0.0, validation_split: float = 0.2)[source]

Iterates over data for Morphological Tagging. A subclass of DataLearningIterator.

  • seed – random seed for data shuffling

  • shuffle – whether to shuffle data during batching

  • validation_split – the fraction of validation data (is used only if there is no valid subset in data)

  • min_train_fraction – minimal fraction of train data in train+dev dataset, For fair comparison with UD Pipe it is set to 0.9 for UD experiments. It is actually used only for Turkish data.

class deeppavlov.dataset_iterators.siamese_iterator.SiameseIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]

The class contains methods for iterating over a dataset for ranking in training, validation and test mode.

class deeppavlov.dataset_iterators.sqlite_iterator.SQLiteDataIterator(load_path: Union[str, pathlib.Path], batch_size: Optional[int] = None, shuffle: Optional[bool] = None, seed: Optional[int] = None, **kwargs)[source]

Iterate over SQLite database. Gen batches from SQLite data. Get document ids and document.

  • load_path – a path to local DB file

  • batch_size – a number of samples in a single batch

  • shuffle – whether to shuffle data during batching

  • seed – random seed for data shuffling


a DB connection


a DB name


DB document ids


a dictionary of document indices and their titles


a number of samples in a single batch


whether to shuffle data during batching


an instance of Random class.

class deeppavlov.dataset_iterators.squad_iterator.SquadIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]

SquadIterator allows to iterate over examples in SQuAD-like datasets. SquadIterator is used to train SquadModel.

It extracts context, question, answer_text and answer_start position from dataset. Example from a dataset is a tuple of (context, question) and (answer_text, answer_start)


train examples


validation examples


test examples

class deeppavlov.dataset_iterators.typos_iterator.TyposDatasetIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]

Implementation of DataLearningIterator used for training ErrorModel

split(test_ratio: float = 0.0, *args, **kwargs)[source]

Split all data into train and test


test_ratio – ratio of test data to train, from 0. to 1.