deeppavlov.models.ranking

Ranking classes.

class deeppavlov.models.ranking.ranking_model.RankingModel(vocab_name, hard_triplets_sampling: bool = False, hardest_positives: bool = False, semi_hard_negatives: bool = False, num_hardest_negatives: int = None, update_embeddings: bool = False, interact_pred_num: int = 3, **kwargs)[source]

Class to perform ranking.

Parameters:
  • vocab_name – A key word that indicates which subclass of the deeppavlov.models.ranking.ranking_dict.RankingDict to use.
  • hard_triplets_sampling – Whether to use hard triplets sampling to train the model i.e. to choose negative samples close to positive ones.
  • hardest_positives – Whether to use only one hardest positive sample per each anchor sample.
  • semi_hard_negatives – Whether hard negative samples should be further away from anchor samples than positive samples or not.
  • update_embeddings – Whether to store and update context and response embeddings or not.
  • interact_pred_num – The number of the most relevant contexts and responses which model returns in the interact regime.
  • **kwargs – Other parameters.
load()[source]

Load the model from the last checkpoint.

save()[source]

Save the model.

train_on_batch(x: List[List[Tuple[int, int]]], y: List[int])[source]

Train the model on a batch.

__call__(batch: Union[List[List[Tuple[int, int]]], List[str]]) → Union[numpy.ndarray, Dict[str, List[str]]][source]

Make a prediction on a batch.

class deeppavlov.models.ranking.ranking_network.RankingNetwork(toks_num: int, chars_num: int, emb_dict: deeppavlov.models.ranking.emb_dict.EmbDict, max_sequence_length: int, max_token_length: int = None, learning_rate: float = 0.001, device_num: int = 0, seed: int = None, shared_weights: bool = True, triplet_mode: bool = True, margin: float = 0.1, distance: str = 'cos_similarity', token_embeddings: bool = True, use_matrix: bool = False, tok_dynamic_batch: bool = False, embedding_dim: int = 300, char_embeddings: bool = False, char_dynamic_batch: bool = False, char_emb_dim: int = 32, highway_on_top: bool = False, reccurent: str = 'bilstm', hidden_dim: int = 300, max_pooling: bool = True)[source]

Class to perform context-response matching with neural networks.

Parameters:
  • toks_num – A size of tok2int vocabulary to build embedding layer.
  • chars_num – A size of char2int vocabulary to build character-level embedding layer.
  • learning_rate – Learning rate.
  • device_num – A number of a device to perform model training on if several devices are available in a system.
  • seed – Random seed.
  • shared_weights – Whether to use shared weights in the model to encode contexts and responses.
  • triplet_mode – Whether to use a model with triplet loss. If False, a model with crossentropy loss will be used.
  • margin – A margin parameter for triplet loss. Only required if triplet_mode is set to True.
  • distance – Distance metric (similarity measure) to compare context and response representations in the model. Possible values are cos_similarity (cosine similarity), euqlidian (euqlidian distance), sigmoid (1 minus sigmoid).
  • token_embeddings – Whether to use token (word) embeddings in the model.
  • use_matrix – Whether to use trainable matrix with token (word) embeddings.
  • max_sequence_length – A maximum length of a sequence in tokens. Longer sequences will be truncated and shorter ones will be padded.
  • tok_dynamic_batch – Whether to use dynamic batching. If True, a maximum length of a sequence for a batch will be equal to the maximum of all sequences lengths from this batch, but not higher than max_sequence_length.
  • embedding_dim – Dimensionality of token (word) embeddings.
  • char_embeddings – Whether to use character-level token (word) embeddings in the model.
  • max_token_length – A maximum length of a token for representing it by a character-level embedding.
  • char_dynamic_batch – Whether to use dynamic batching for character-level embeddings. If True, a maximum length of a token for a batch will be equal to the maximum of all tokens lengths from this batch, but not higher than max_token_length.
  • char_emb_dim – Dimensionality of character-level embeddings.
  • reccurent – A type of the RNN cell. Possible values are lstm and bilstm.
  • hidden_dim – Dimensionality of the hidden state of the RNN cell. If reccurent equals bilstm to get the actual dimensionality hidden_dim should be doubled.
  • max_pooling – Whether to use max-pooling operation to get context (response) vector representation. If False, the last hidden state of the RNN will be used.
class deeppavlov.models.ranking.ranking_dict.RankingDict(save_path: str, load_path: str, max_sequence_length: int, max_token_length: int, padding: str = 'post', truncating: str = 'post', token_embeddings: bool = True, char_embeddings: bool = False, char_pad: str = 'post', char_trunc: str = 'post', tok_dynamic_batch: bool = False, char_dynamic_batch: bool = False, update_embeddings: bool = False)[source]

Class to encode characters, tokens, whole contexts and responses with vocabularies, to pad and truncate.

Parameters:
  • save_path – A path including filename to store the instance of deeppavlov.models.ranking.ranking_network.RankingNetwork.
  • load_path – A path including filename to load the instance of deeppavlov.models.ranking.ranking_network.RankingNetwork.
  • max_sequence_length – A maximum length of a sequence in tokens. Longer sequences will be truncated and shorter ones will be padded.
  • tok_dynamic_batch – Whether to use dynamic batching. If True, a maximum length of a sequence for a batch will be equal to the maximum of all sequences lengths from this batch, but not higher than max_sequence_length.
  • padding – Padding. Possible values are pre and post. If set to pre a sequence will be padded at the beginning. If set to post it will padded at the end.
  • truncating – Truncating. Possible values are pre and post. If set to pre a sequence will be truncated at the beginning. If set to post it will truncated at the end.
  • max_token_length – A maximum length of a token for representing it by a character-level embedding.
  • char_dynamic_batch – Whether to use dynamic batching for character-level embeddings. If True, a maximum length of a token for a batch will be equal to the maximum of all tokens lengths from this batch, but not higher than max_token_length.
  • char_pad – Character-level padding. Possible values are pre and post. If set to pre a token will be padded at the beginning. If set to post it will padded at the end.
  • char_trunc – Character-level truncating. Possible values are pre and post. If set to pre a token will be truncated at the beginning. If set to post it will truncated at the end.
  • update_embeddings – Whether to store and update context and response embeddings or not.
class deeppavlov.models.ranking.emb_dict.EmbDict(save_path: str, load_path: str, embeddings_path: str, max_sequence_length: int, embedding_dim: int = 300, embeddings: str = 'word2vec', seed: int = None, use_matrix: bool = False)[source]

The class that provides token (word) embeddings.

Parameters:
  • save_path – A path including filename to store the instance of deeppavlov.models.ranking.ranking_network.RankingNetwork.
  • load_path – A path including filename to load the instance of deeppavlov.models.ranking.ranking_network.RankingNetwork.
  • max_sequence_length – A maximum length of a sequence in tokens. Longer sequences will be truncated and shorter ones will be padded.
  • seed – Random seed.
  • embeddings – A type of embeddings. Possible values are fasttext, word2vec and random.
  • embeddings_path – A path to an embeddings model including filename. The type of the model should coincide with the type of embeddings defined by the embeddings parameter.
  • embedding_dim – Dimensionality of token (word) embeddings.
  • use_matrix – Whether to use trainable matrix with token (word) embeddings.
class deeppavlov.models.ranking.tfidf_ranker.TfidfRanker(vectorizer: deeppavlov.models.vectorizers.hashing_tfidf_vectorizer.HashingTfIdfVectorizer, top_n=5, active: bool = True, **kwargs)[source]

Rank documents according to input strings.

Parameters:
  • vectorizer – a vectorizer class
  • top_n – a number of doc ids to return
  • active – whether to return a number specified by top_n (True) or all ids (False)
top_n

a number of doc ids to return

vectorizer

an instance of vectorizer class

active

whether to return a number specified by top_n or all ids

index2doc

inverted doc_index

iterator

a dataset iterator used for generating batches while fitting the vectorizer

__call__(questions: List[str]) → Tuple[List[Any], List[float]][source]

Rank documents and return top n document titles with scores.

Parameters:questions – list of queries used in ranking
Returns:a tuple of selected doc ids and their scores
class deeppavlov.models.ranking.logit_ranker.LogitRanker(squad_model, **kwargs)[source]

Select best answer using squad model logits. Make several batches for a single batch, send each batch to the squad model separately and get a single best answer for each batch.

Parameters:squad_model – a loaded squad model
squad_model

a loaded squad model

__call__(contexts_batch: List[List[str]], questions_batch: List[List[str]]) → List[str][source]

Sort obtained results from squad reader by logits and get the answer with a maximum logit.

Parameters:
  • contexts_batch – a batch of contexts which should be treated as a single batch in the outer JSON config
  • questions_batch – a batch of questions which should be treated as a single batch in the outer JSON config
Returns:

a batch of best answers