deeppavlov.core.data¶

DatasetReader, Vocab, DataLearningIterator and DataFittingIterator classes.

class deeppavlov.core.data.dataset_reader.DatasetReader[source]¶: An abstract class for reading data from some location and construction of a dataset.

class deeppavlov.core.data.data_fitting_iterator.DataFittingIterator(data: List[str], doc_ids: List[Any] = None, seed: int = None, shuffle: bool = True, *args, **kwargs)[source]¶

Dataset iterator for fitting estimator models, like vocabs, kNN, vectorizers. Data is passed as a list of strings(documents). Generate batches (for large datasets).

Parameters:	data – list of documents doc_ids – provided document ids seed – random seed for data shuffling shuffle – whether to shuffle data during batching

shuffle¶: whether to shuffle data during batching

random¶: instance of Random initialized with a seed

data¶: list of documents

doc_ids¶: provided by a user ids or generated automatically ids

class deeppavlov.core.data.data_learning_iterator.DataLearningIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]¶

Dataset iterator for learning models, e. g. neural networks.

Parameters:	data – list of (x, y) pairs for every data type in `'train'`, `'valid'` and `'test'` seed – random seed for data shuffling shuffle – whether to shuffle data during batching

shuffle¶: whether to shuffle data during batching

random¶: instance of Random initialized with a seed

class deeppavlov.core.data.sqlite_database.Sqlite3Database(save_path: str, table_name: str, primary_keys: List[str], keys: List[str] = None, unknown_value: str = 'UNK', *args, **kwargs)[source]¶

Loads and trains sqlite table of any items (with name table_name and path save_path).

Primary (unique) keys must be specified, all other keys are infered from data. Batch here is a list of dictionaries, where each dictionary corresponds to an item. If an item doesn’t contain values for all keys, then missing values will be stored with unknown_value.

Parameters:	save_path – sqlite database path. table_name – name of the sqlite table. primary_keys – list of table primary keys’ names. keys – all table keys’ names. unknown_value – value assigned to missing item values. **kwargs – parameters passed to parent `Estimator` class.

class deeppavlov.core.data.vocab.DefaultVocabulary(save_path: str, load_path: str, level: str = 'token', special_tokens: List[str] = [], default_token: str = None, tokenizer: Callable = None, min_freq: int = 0, **kwargs)[source]¶

Implements vocabulary of tokens, chars or other structeres.

Parameters:	level – level of operation can be tokens (`'token'`) or chars (`'char'`). special_tokens – tuple of tokens that shouldn’t be counted. default_token – label assigned to unknown tokens. tokenizer – callable used to get tokens out of string. min_freq – minimal count of a token (except special tokens).

class deeppavlov.core.data.simple_vocab.SimpleVocabulary(special_tokens: Tuple[str, ...] = (), max_tokens: int = 1073741824, min_freq: int = 0, pad_with_zeros: bool = False, unk_token: Optional[str] = None, freq_drop_load: Optional[bool] = None, *args, **kwargs)[source]¶: Implements simple vocabulary.