deeppavlov.models.classifiers¶

class deeppavlov.models.classifiers.torch_classification_model.TorchTextClassificationModel(n_classes: int, kernel_sizes_cnn: List[int], filters_cnn: int, dense_size: int, dropout_rate: float = 0.0, embedding_size: Optional[int] = None, multilabel: bool = False, criterion: str = 'CrossEntropyLoss', embedded_tokens: bool = True, vocab_size: Optional[int] = None, return_probas: bool = True, **kwargs)[source]¶

Class implements torch model for classification of texts. Input can either be embedded tokenized texts OR indices of words in the vocabulary. Number of tokens is not fixed while the samples in batch should be padded to the same (e.g. longest) lengths.

Parameters

n_classes – number of classes
kernel_sizes_cnn – list of kernel sizes of convolutions
filters_cnn – number of filters for convolutions
dense_size – number of units for dense layer
dropout_rate – dropout rate, after convolutions and between dense
embedding_size – size of vector representation of words
multilabel – is multi-label classification (if so, sigmoid activation will be used, otherwise, softmax)
criterion – criterion name from torch.nn
embedded_tokens – True, if input contains embedded tokenized texts; False, if input containes indices of words in the vocabulary
vocab_size – vocabulary size in case of embedded_tokens=False, and embedding is a layer in the Network
return_probas – whether to return probabilities or index of classes (only for multilabel=False)

model¶: torch model itself

epochs_done¶: number of epochs that were done

criterion¶: torch criterion instance

__call__(texts: List[ndarray], *args) → Union[List[List[float]], List[int]][source]¶

Infer on the given data.

Parameters

texts – list of tokenized text samples
labels – labels
*args – additional arguments

Returns

vector of probabilities to belong with each class or list of labels sentence belongs with

Return type

for each sentence

train_on_batch(texts: List[List[ndarray]], labels: list) → Union[float, List[float]][source]¶

Train the model on the given batch.

Parameters

texts – vectorized texts
labels – list of labels

Returns

metrics values on the given batch

class deeppavlov.models.classifiers.cos_sim_classifier.CosineSimilarityClassifier(top_n: int = 1, save_path: Optional[str] = None, load_path: Optional[str] = None, **kwargs)[source]¶

Classifier based on cosine similarity between vectorized sentences

Parameters

save_path – path to save the model
load_path – path to load the model

__call__(q_vects: Union[csr_matrix, List]) → Tuple[List[str], List[int]][source]¶

Found most similar answer for input vectorized question

Parameters: q_vects – vectorized questions
Returns: Tuple of Answer and Score

fit(x_train_vects: Tuple[Union[csr_matrix, List]], y_train: Tuple[str]) → None[source]¶

Train classifier

Parameters

x_train_vects – vectorized question for train dataset
y_train – answers for train dataset

Returns

None

load() → None[source]¶: Load classifier parameters

save() → None[source]¶: Save classifier parameters

class deeppavlov.models.classifiers.proba2labels.Proba2Labels(max_proba: Optional[bool] = None, confidence_threshold: Optional[float] = None, top_n: Optional[int] = None, is_binary: bool = False, **kwargs)[source]¶

Class implements probability to labels processing using the following ways: choosing one or top_n indices with maximal probability or choosing any number of indices which probabilities to belong with are higher than given confident threshold

Parameters

max_proba – whether to choose label with maximal probability
confidence_threshold – boundary probability value for sample to belong with the class (best use for multi-label)
top_n – how many top labels with the highest probabilities to return

max_proba¶: whether to choose label with maximal probability

confidence_threshold¶: boundary probability value for sample to belong with the class (best use for multi-label)

top_n¶: how many top labels with the highest probabilities to return

__call__(*args, **kwargs)[source]¶

Process probabilities to labels :param Every argument is a list of vectors with probability distribution:

Returns: list of labels (only label classification) or list of lists of labels (multi-label classification), or list of the following lists (in multitask setting) for every argument