DatasetReader, Vocab, DataLearningIterator and DataFittingIterator classes.


An abstract class for reading data from some location and construction of a dataset.

class List[str], doc_ids: List[Any] = None, seed: int = None, shuffle: bool = True, *args, **kwargs)[source]

Dataset iterator for fitting estimator models, like vocabs, kNN, vectorizers. Data is passed as a list of strings(documents). Generate batches (for large datasets).

  • data – list of documents

  • doc_ids – provided document ids

  • seed – random seed for data shuffling

  • shuffle – whether to shuffle data during batching


whether to shuffle data during batching


instance of Random initialized with a seed


list of documents


provided by a user ids or generated automatically ids

class Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]

Dataset iterator for learning models, e. g. neural networks.

  • data – list of (x, y) pairs for every data type in 'train', 'valid' and 'test'

  • seed – random seed for data shuffling

  • shuffle – whether to shuffle data during batching


whether to shuffle data during batching


instance of Random initialized with a seed

class str, primary_keys: List[str], keys: List[str] = None, table_name: str = 'mytable', unknown_value: str = 'UNK', *args, **kwargs)[source]

Loads and trains sqlite table of any items (with name table_name and path save_path).

Primary (unique) keys must be specified, all other keys are infered from data. Batch here is a list of dictionaries, where each dictionary corresponds to an item. If an item doesn’t contain values for all keys, then missing values will be stored with unknown_value.

  • save_path – sqlite database path.

  • primary_keys – list of table primary keys’ names.

  • keys – all table keys’ names.

  • table_name – name of the sqlite table.

  • unknown_value – value assigned to missing item values.

  • **kwargs – parameters passed to parent Estimator class.

class Tuple[str, ...] = (), max_tokens: int = 1073741824, min_freq: int = 0, pad_with_zeros: bool = False, unk_token: Optional[str] = None, freq_drop_load: Optional[bool] = None, *args, **kwargs)[source]

Implements simple vocabulary.

  • special_tokens – tuple of tokens that shouldn’t be counted.

  • max_tokens – upper bound for number of tokens in the vocabulary.

  • min_freq – minimal count of a token (except special tokens).

  • pad_with_zeros – if True, then batch of elements will be padded with zeros up to length of the longest element in batch.

  • unk_token – label assigned to unknown tokens.

  • freq_drop_load – if True, then frequencies of tokens are set to min_freq on the model load.