dataset_iterators

Concrete DatasetIterator classes.

class deeppavlov.dataset_iterators.basic_classification_iterator.BasicClassificationDatasetIterator(data: dict, fields_to_merge: List[str] = None, merged_field: str = None, field_to_split: str = None, split_fields: List[str] = None, split_proportions: List[float] = None, seed: int = None, shuffle: bool = True, split_seed: int = None, stratify: bool = None, *args, **kwargs)[source]

Class gets data dictionary from DatasetReader instance, merge fields if necessary, split a field if necessary

Parameters:
  • data – dictionary of data with fields “train”, “valid” and “test” (or some of them)
  • fields_to_merge – list of fields (out of "train", "valid", "test") to merge
  • merged_field – name of field (out of "train", "valid", "test") to which save merged fields
  • field_to_split – name of field (out of "train", "valid", "test") to split
  • split_fields – list of fields (out of "train", "valid", "test") to which save splitted field
  • split_proportions – list of corresponding proportions for splitting
  • seed – random seed for iterating
  • shuffle – whether to shuffle examples in batches
  • split_seed – random seed for splitting dataset, if split_seed is None, division is based on seed.
  • stratify – whether to use stratified split
  • *args – arguments
  • **kwargs – arguments
data

dictionary of data with fields “train”, “valid” and “test” (or some of them)

class deeppavlov.dataset_iterators.dialog_iterator.DialogDatasetIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]

Iterates over dialog data, generates batches where one sample is one dialog.

A subclass of DataLearningIterator.

train

list of training dialogs (tuples (context, response))

valid

list of validation dialogs (tuples (context, response))

test

list of dialogs used for testing (tuples (context, response))

class deeppavlov.dataset_iterators.dialog_iterator.DialogDBResultDatasetIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]

Iterates over dialog data, outputs list of all 'db_result' fields (if present).

The class helps to build a list of all 'db_result' values present in a dataset.

Inherits key methods and attributes from DataLearningIterator.

train

list of tuples (db_result dictionary, '') from “train” data

valid

list of tuples (db_result dictionary, '') from “valid” data

test

list of tuples (db_result dictionary, '') from “test” data

class deeppavlov.dataset_iterators.dstc2_intents_iterator.Dstc2IntentsDatasetIterator(data: dict, fields_to_merge: List[str] = None, merged_field: str = None, field_to_split: str = None, split_fields: List[str] = None, split_proportions: List[float] = None, seed: int = None, shuffle: bool = True, *args, **kwargs)[source]

Class gets data dictionary from DSTC2DatasetReader instance, construct intents from act and slots, merge fields if necessary, split a field if necessary

Parameters:
  • data – dictionary of data with fields “train”, “valid” and “test” (or some of them)
  • fields_to_merge – list of fields (out of "train", "valid", "test") to merge
  • merged_field – name of field (out of "train", "valid", "test") to which save merged fields
  • field_to_split – name of field (out of "train", "valid", "test") to split
  • split_fields – list of fields (out of "train", "valid", "test") to which save splitted field
  • split_proportions – list of corresponding proportions for splitting
  • seed – random seed
  • shuffle – whether to shuffle examples in batches
  • *args – arguments
  • **kwargs – arguments
data

dictionary of data with fields “train”, “valid” and “test” (or some of them)

class deeppavlov.dataset_iterators.dstc2_ner_iterator.Dstc2NerDatasetIterator[source]

Iterates over data for DSTC2 NER task. Dataset takes a dict with fields ‘train’, ‘test’, ‘valid’. A list of samples (pairs x, y) is stored in each field.

Parameters:
  • data – list of (x, y) pairs, samples from the dataset: x as well as y can be a tuple of different input features.
  • dataset_path – path to dataset
  • seed – value for random seed
  • shuffle – whether to shuffle the data
class deeppavlov.dataset_iterators.elmo_file_paths_iterator.ELMoFilePathsIterator(data: Dict[str, List[Union[str, pathlib.Path]]], load_path: Union[str, pathlib.Path], seed: Optional[int] = None, shuffle: bool = True, unroll_steps: Optional[int] = None, n_gpus: Optional[int] = None, max_word_length: Optional[int] = None, bos: str = '<S>', eos: str = '</S>', *args, **kwargs)[source]

Dataset iterator for tokenized datasets like 1 Billion Word Benchmark It gets lists of file paths from the data dictionary and returns batches of lines from each file.

Parameters:
  • data – dict with keys 'train', 'valid' and 'test' and values
  • load_path – path to the vocabulary to be load from
  • seed – random seed for data shuffling
  • shuffle – whether to shuffle data during batching
  • unroll_steps – number of unrolling steps
  • n_gpus – number of gpu to use
  • max_word_length – max length of word
  • bos – tag of begin of sentence
  • eos – tag of end of sentence
class deeppavlov.dataset_iterators.file_paths_iterator.FilePathsIterator(data: Dict[str, List[Union[str, pathlib.Path]]], seed: Optional[int] = None, shuffle: bool = True, *args, **kwargs)[source]

Dataset iterator for datasets like 1 Billion Word Benchmark. It gets lists of file paths from the data dictionary and returns lines from each file.

Parameters:
  • data – dict with keys 'train', 'valid' and 'test' and values
  • seed – random seed for data shuffling
  • shuffle – whether to shuffle data during batching
class deeppavlov.dataset_iterators.kvret_dialog_iterator.KvretDialogDatasetIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]

Inputs data from DSTC2DatasetReader, constructs dialog history for each turn, generates batches (one sample is a turn).

Inherits key methods and attributes from DataLearningIterator.

train

list of “train” (context, response) tuples

valid

list of “valid” (context, response) tuples

test

list of “test” (context, response) tuples

deeppavlov.dataset_iterators.morphotagger_iterator.preprocess_data(data: List[Tuple[List[str], List[str]]], to_lower: bool = True, append_case: str = 'first') → List[Tuple[List[Tuple[str]], List[str]]][source]

Processes all words in data using process_word().

Parameters:
  • data – a list of pairs (words, tags), each pair corresponds to a single sentence
  • to_lower – whether to lowercase
  • append_case – whether to add case mark
Returns:

a list of preprocessed sentences

class deeppavlov.dataset_iterators.morphotagger_iterator.MorphoTaggerDatasetIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, min_train_fraction: float = 0.0, validation_split: float = 0.2)[source]

Iterates over data for Morphological Tagging. A subclass of DataLearningIterator.

Parameters:
  • seed – random seed for data shuffling
  • shuffle – whether to shuffle data during batching
  • validation_split – the fraction of validation data (is used only if there is no valid subset in data)
  • min_train_fraction – minimal fraction of train data in train+dev dataset, For fair comparison with UD Pipe it is set to 0.9 for UD experiments. It is actually used only for Turkish data.
class deeppavlov.dataset_iterators.siamese_iterator.SiameseIterator(data: Dict[str, List], seed: int = None, shuffle: bool = False, num_samples: int = None, random_batches: bool = False, batches_per_epoch: int = None, *args, **kwargs)[source]

The class contains methods for iterating over a dataset for ranking in training, validation and test mode.

Parameters:
  • data – A dictionary containing training, validation and test parts of the dataset obtainable via train, valid and test keys.
  • seed – Random seed.
  • shuffle – Whether to shuffle data.
  • num_samples – A number of data samples to use in train, validation and test mode.
  • random_batches – Whether to choose batches randomly or iterate over data sequentally in training mode.
  • batches_per_epoch – A number of batches to choose per each epoch in training mode. Only required if random_batches is set to True.
class deeppavlov.dataset_iterators.sqlite_iterator.SQLiteDataIterator(load_path: Union[str, pathlib.Path], batch_size: Optional[int] = None, shuffle: Optional[bool] = None, seed: Optional[int] = None, **kwargs)[source]

Iterate over SQLite database. Gen batches from SQLite data. Get document ids and document.

Parameters:
  • load_path – a path to local DB file
  • batch_size – a number of samples in a single batch
  • shuffle – whether to shuffle data during batching
  • seed – random seed for data shuffling
connect

a DB connection

db_name

a DB name

doc_ids

DB document ids

doc2index

a dictionary of document indices and their titles

batch_size

a number of samples in a single batch

shuffle

whether to shuffle data during batching

random

an instance of Random class.

class deeppavlov.dataset_iterators.squad_iterator.SquadIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]

SquadIterator allows to iterate over examples in SQuAD-like datasets. SquadIterator is used to train SquadModel.

It extracts context, question, answer_text and answer_start position from dataset. Example from a dataset is a tuple of (context, question) and (answer_text, answer_start)

train

train examples

valid

validation examples

test

test examples

class deeppavlov.dataset_iterators.typos_iterator.TyposDatasetIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]

Implementation of DataLearningIterator used for training ErrorModel

split(test_ratio: float = 0.0, *args, **kwargs)[source]

Split all data into train and test

Parameters:test_ratio – ratio of test data to train, from 0. to 1.