deeppavlov.models.ranking¶

Ranking classes.

class deeppavlov.models.ranking.ranking_model.RankingModel(vocab_name, hard_triplets_sampling: bool = False, hardest_positives: bool = False, semi_hard_negatives: bool = False, num_hardest_negatives: int = None, update_embeddings: bool = False, interact_pred_num: int = 3, **kwargs)[source]¶

Class to perform ranking.

Parameters:

vocab_name – A key word that indicates which subclass of the deeppavlov.models.ranking.ranking_dict.RankingDict to use.
hard_triplets_sampling – Whether to use hard triplets sampling to train the model i.e. to choose negative samples close to positive ones.
hardest_positives – Whether to use only one hardest positive sample per each anchor sample.
semi_hard_negatives – Whether hard negative samples should be further away from anchor samples than positive samples or not.
update_embeddings – Whether to store and update context and response embeddings or not.
interact_pred_num – The number of the most relevant contexts and responses which model returns in the interact regime.
**kwargs – Other parameters.

load()[source]¶: Load the model from the last checkpoint.

save()[source]¶: Save the model.

train_on_batch(x: List[List[Tuple[int, int]]], y: List[int])[source]¶: Train the model on a batch.

__call__(batch: Union[List[List[Tuple[int, int]]], List[str]]) → Union[numpy.ndarray, Dict[str, List[str]]][source]¶: Make a prediction on a batch.

class deeppavlov.models.ranking.ranking_network.RankingNetwork(toks_num: int, chars_num: int, emb_dict: deeppavlov.models.ranking.emb_dict.EmbDict, max_sequence_length: int, max_token_length: int = None, learning_rate: float = 0.001, device_num: int = 0, seed: int = None, shared_weights: bool = True, triplet_mode: bool = True, margin: float = 0.1, distance: str = 'cos_similarity', token_embeddings: bool = True, use_matrix: bool = False, tok_dynamic_batch: bool = False, embedding_dim: int = 300, char_embeddings: bool = False, char_dynamic_batch: bool = False, char_emb_dim: int = 32, highway_on_top: bool = False, reccurent: str = 'bilstm', hidden_dim: int = 300, max_pooling: bool = True)[source]¶

Class to perform context-response matching with neural networks.

Parameters:

toks_num – A size of tok2int vocabulary to build embedding layer.
chars_num – A size of char2int vocabulary to build character-level embedding layer.
learning_rate – Learning rate.
device_num – A number of a device to perform model training on if several devices are available in a system.
seed – Random seed.
shared_weights – Whether to use shared weights in the model to encode contexts and responses.
triplet_mode – Whether to use a model with triplet loss. If False, a model with crossentropy loss will be used.
margin – A margin parameter for triplet loss. Only required if triplet_mode is set to True.
distance – Distance metric (similarity measure) to compare context and response representations in the model. Possible values are cos_similarity (cosine similarity), euqlidian (euqlidian distance), sigmoid (1 minus sigmoid).
token_embeddings – Whether to use token (word) embeddings in the model.
use_matrix – Whether to use trainable matrix with token (word) embeddings.
max_sequence_length – A maximum length of a sequence in tokens. Longer sequences will be truncated and shorter ones will be padded.
tok_dynamic_batch – Whether to use dynamic batching. If True, a maximum length of a sequence for a batch will be equal to the maximum of all sequences lengths from this batch, but not higher than max_sequence_length.
embedding_dim – Dimensionality of token (word) embeddings.
char_embeddings – Whether to use character-level token (word) embeddings in the model.
max_token_length – A maximum length of a token for representing it by a character-level embedding.
char_dynamic_batch – Whether to use dynamic batching for character-level embeddings. If True, a maximum length of a token for a batch will be equal to the maximum of all tokens lengths from this batch, but not higher than max_token_length.
char_emb_dim – Dimensionality of character-level embeddings.
reccurent – A type of the RNN cell. Possible values are lstm and bilstm.
hidden_dim – Dimensionality of the hidden state of the RNN cell. If reccurent equals bilstm to get the actual dimensionality hidden_dim should be doubled.
max_pooling – Whether to use max-pooling operation to get context (response) vector representation. If False, the last hidden state of the RNN will be used.

class deeppavlov.models.ranking.ranking_dict.RankingDict(save_path: str, load_path: str, max_sequence_length: int, max_token_length: int, padding: str = 'post', truncating: str = 'post', token_embeddings: bool = True, char_embeddings: bool = False, char_pad: str = 'post', char_trunc: str = 'post', tok_dynamic_batch: bool = False, char_dynamic_batch: bool = False, update_embeddings: bool = False)[source]¶

Class to encode characters, tokens, whole contexts and responses with vocabularies, to pad and truncate.

Parameters:

save_path – A path including filename to store the instance of deeppavlov.models.ranking.ranking_network.RankingNetwork.
load_path – A path including filename to load the instance of deeppavlov.models.ranking.ranking_network.RankingNetwork.
max_sequence_length – A maximum length of a sequence in tokens. Longer sequences will be truncated and shorter ones will be padded.
tok_dynamic_batch – Whether to use dynamic batching. If True, a maximum length of a sequence for a batch will be equal to the maximum of all sequences lengths from this batch, but not higher than max_sequence_length.
padding – Padding. Possible values are pre and post. If set to pre a sequence will be padded at the beginning. If set to post it will padded at the end.
truncating – Truncating. Possible values are pre and post. If set to pre a sequence will be truncated at the beginning. If set to post it will truncated at the end.
max_token_length – A maximum length of a token for representing it by a character-level embedding.
char_dynamic_batch – Whether to use dynamic batching for character-level embeddings. If True, a maximum length of a token for a batch will be equal to the maximum of all tokens lengths from this batch, but not higher than max_token_length.
char_pad – Character-level padding. Possible values are pre and post. If set to pre a token will be padded at the beginning. If set to post it will padded at the end.
char_trunc – Character-level truncating. Possible values are pre and post. If set to pre a token will be truncated at the beginning. If set to post it will truncated at the end.
update_embeddings – Whether to store and update context and response embeddings or not.

class deeppavlov.models.ranking.emb_dict.EmbDict(save_path: str, load_path: str, embeddings_path: str, max_sequence_length: int, embedding_dim: int = 300, embeddings: str = 'word2vec', seed: int = None, use_matrix: bool = False)[source]¶

The class that provides token (word) embeddings.

Parameters:

save_path – A path including filename to store the instance of deeppavlov.models.ranking.ranking_network.RankingNetwork.
load_path – A path including filename to load the instance of deeppavlov.models.ranking.ranking_network.RankingNetwork.
max_sequence_length – A maximum length of a sequence in tokens. Longer sequences will be truncated and shorter ones will be padded.
seed – Random seed.
embeddings – A type of embeddings. Possible values are fasttext, word2vec and random.
embeddings_path – A path to an embeddings model including filename. The type of the model should coincide with the type of embeddings defined by the embeddings parameter.
embedding_dim – Dimensionality of token (word) embeddings.
use_matrix – Whether to use trainable matrix with token (word) embeddings.

class deeppavlov.models.ranking.tfidf_ranker.TfidfRanker(vectorizer: deeppavlov.models.vectorizers.hashing_tfidf_vectorizer.HashingTfIdfVectorizer, top_n=5, active: bool = True, **kwargs)[source]¶

Rank documents according to input strings.

Parameters:	vectorizer – a vectorizer class top_n – a number of doc ids to return active – whether to return a number specified by `top_n` (`True`) or all ids (`False`)

top_n¶: a number of doc ids to return

vectorizer¶: an instance of vectorizer class

active¶: whether to return a number specified by top_n or all ids

index2doc¶: inverted doc_index

iterator¶: a dataset iterator used for generating batches while fitting the vectorizer

__call__(questions: List[str]) → Tuple[List[Any], List[float]][source]¶

Rank documents and return top n document titles with scores.

Parameters:	questions – list of queries used in ranking
Returns:	a tuple of selected doc ids and their scores

class deeppavlov.models.ranking.logit_ranker.LogitRanker(squad_model, **kwargs)[source]¶

Select best answer using squad model logits. Make several batches for a single batch, send each batch to the squad model separately and get a single best answer for each batch.

Parameters:	squad_model – a loaded squad model

squad_model¶: a loaded squad model

__call__(contexts_batch: List[List[str]], questions_batch: List[List[str]]) → List[str][source]¶

Sort obtained results from squad reader by logits and get the answer with a maximum logit.

Parameters:	contexts_batch – a batch of contexts which should be treated as a single batch in the outer JSON config questions_batch – a batch of questions which should be treated as a single batch in the outer JSON config
Returns:	a batch of best answers