deeppavlov.models.ranking¶
Ranking classes.
-
class
deeppavlov.models.ranking.ranking_model.
RankingModel
(vocab_name, hard_triplets_sampling: bool = False, hardest_positives: bool = False, semi_hard_negatives: bool = False, num_hardest_negatives: int = None, update_embeddings: bool = False, interact_pred_num: int = 3, **kwargs)[source]¶ Class to perform ranking.
Parameters: - vocab_name – A key word that indicates which subclass
of the
deeppavlov.models.ranking.ranking_dict.RankingDict
to use. - hard_triplets_sampling – Whether to use hard triplets sampling to train the model i.e. to choose negative samples close to positive ones.
- hardest_positives – Whether to use only one hardest positive sample per each anchor sample.
- semi_hard_negatives – Whether hard negative samples should be further away from anchor samples than positive samples or not.
- update_embeddings – Whether to store and update context and response embeddings or not.
- interact_pred_num – The number of the most relevant contexts and responses which model returns in the interact regime.
- **kwargs – Other parameters.
- vocab_name – A key word that indicates which subclass
of the
-
class
deeppavlov.models.ranking.ranking_network.
RankingNetwork
(toks_num: int, chars_num: int, emb_dict: deeppavlov.models.ranking.emb_dict.EmbDict, max_sequence_length: int, max_token_length: int = None, learning_rate: float = 0.001, device_num: int = 0, seed: int = None, shared_weights: bool = True, triplet_mode: bool = True, margin: float = 0.1, distance: str = 'cos_similarity', token_embeddings: bool = True, use_matrix: bool = False, tok_dynamic_batch: bool = False, embedding_dim: int = 300, char_embeddings: bool = False, char_dynamic_batch: bool = False, char_emb_dim: int = 32, highway_on_top: bool = False, reccurent: str = 'bilstm', hidden_dim: int = 300, max_pooling: bool = True)[source]¶ Class to perform context-response matching with neural networks.
Parameters: - toks_num – A size of tok2int vocabulary to build embedding layer.
- chars_num – A size of char2int vocabulary to build character-level embedding layer.
- learning_rate – Learning rate.
- device_num – A number of a device to perform model training on if several devices are available in a system.
- seed – Random seed.
- shared_weights – Whether to use shared weights in the model to encode contexts and responses.
- triplet_mode – Whether to use a model with triplet loss.
If
False
, a model with crossentropy loss will be used. - margin – A margin parameter for triplet loss. Only required if
triplet_mode
is set toTrue
. - distance – Distance metric (similarity measure) to compare context and response representations in the model.
Possible values are
cos_similarity
(cosine similarity),euqlidian
(euqlidian distance),sigmoid
(1 minus sigmoid). - token_embeddings – Whether to use token (word) embeddings in the model.
- use_matrix – Whether to use trainable matrix with token (word) embeddings.
- max_sequence_length – A maximum length of a sequence in tokens. Longer sequences will be truncated and shorter ones will be padded.
- tok_dynamic_batch – Whether to use dynamic batching. If
True
, a maximum length of a sequence for a batch will be equal to the maximum of all sequences lengths from this batch, but not higher thanmax_sequence_length
. - embedding_dim – Dimensionality of token (word) embeddings.
- char_embeddings – Whether to use character-level token (word) embeddings in the model.
- max_token_length – A maximum length of a token for representing it by a character-level embedding.
- char_dynamic_batch – Whether to use dynamic batching for character-level embeddings.
If
True
, a maximum length of a token for a batch will be equal to the maximum of all tokens lengths from this batch, but not higher thanmax_token_length
. - char_emb_dim – Dimensionality of character-level embeddings.
- reccurent – A type of the RNN cell. Possible values are
lstm
andbilstm
. - hidden_dim – Dimensionality of the hidden state of the RNN cell. If
reccurent
equalsbilstm
to get the actual dimensionalityhidden_dim
should be doubled. - max_pooling – Whether to use max-pooling operation to get context (response) vector representation.
If
False
, the last hidden state of the RNN will be used.
-
class
deeppavlov.models.ranking.ranking_dict.
RankingDict
(save_path: str, load_path: str, max_sequence_length: int, max_token_length: int, padding: str = 'post', truncating: str = 'post', token_embeddings: bool = True, char_embeddings: bool = False, char_pad: str = 'post', char_trunc: str = 'post', tok_dynamic_batch: bool = False, char_dynamic_batch: bool = False, update_embeddings: bool = False)[source]¶ Class to encode characters, tokens, whole contexts and responses with vocabularies, to pad and truncate.
Parameters: - save_path – A path including filename to store the instance of
deeppavlov.models.ranking.ranking_network.RankingNetwork
. - load_path – A path including filename to load the instance of
deeppavlov.models.ranking.ranking_network.RankingNetwork
. - max_sequence_length – A maximum length of a sequence in tokens. Longer sequences will be truncated and shorter ones will be padded.
- tok_dynamic_batch – Whether to use dynamic batching. If
True
, a maximum length of a sequence for a batch will be equal to the maximum of all sequences lengths from this batch, but not higher thanmax_sequence_length
. - padding – Padding. Possible values are
pre
andpost
. If set topre
a sequence will be padded at the beginning. If set topost
it will padded at the end. - truncating – Truncating. Possible values are
pre
andpost
. If set topre
a sequence will be truncated at the beginning. If set topost
it will truncated at the end. - max_token_length – A maximum length of a token for representing it by a character-level embedding.
- char_dynamic_batch – Whether to use dynamic batching for character-level embeddings.
If
True
, a maximum length of a token for a batch will be equal to the maximum of all tokens lengths from this batch, but not higher thanmax_token_length
. - char_pad – Character-level padding. Possible values are
pre
andpost
. If set topre
a token will be padded at the beginning. If set topost
it will padded at the end. - char_trunc – Character-level truncating. Possible values are
pre
andpost
. If set topre
a token will be truncated at the beginning. If set topost
it will truncated at the end. - update_embeddings – Whether to store and update context and response embeddings or not.
- save_path – A path including filename to store the instance of
-
class
deeppavlov.models.ranking.emb_dict.
EmbDict
(save_path: str, load_path: str, embeddings_path: str, max_sequence_length: int, embedding_dim: int = 300, embeddings: str = 'word2vec', seed: int = None, use_matrix: bool = False)[source]¶ The class that provides token (word) embeddings.
Parameters: - save_path – A path including filename to store the instance of
deeppavlov.models.ranking.ranking_network.RankingNetwork
. - load_path – A path including filename to load the instance of
deeppavlov.models.ranking.ranking_network.RankingNetwork
. - max_sequence_length – A maximum length of a sequence in tokens. Longer sequences will be truncated and shorter ones will be padded.
- seed – Random seed.
- embeddings – A type of embeddings. Possible values are
fasttext
,word2vec
andrandom
. - embeddings_path – A path to an embeddings model including filename.
The type of the model should coincide with the type of embeddings defined by the
embeddings
parameter. - embedding_dim – Dimensionality of token (word) embeddings.
- use_matrix – Whether to use trainable matrix with token (word) embeddings.
- save_path – A path including filename to store the instance of
-
class
deeppavlov.models.ranking.tfidf_ranker.
TfidfRanker
(vectorizer: deeppavlov.models.vectorizers.hashing_tfidf_vectorizer.HashingTfIdfVectorizer, top_n=5, active: bool = True, **kwargs)[source]¶ Rank documents according to input strings.
Parameters: - vectorizer – a vectorizer class
- top_n – a number of doc ids to return
- active – whether to return a number specified by
top_n
(True
) or all ids (False
)
-
top_n
¶ a number of doc ids to return
-
vectorizer
¶ an instance of vectorizer class
-
index2doc
¶ inverted
doc_index
-
iterator
¶ a dataset iterator used for generating batches while fitting the vectorizer
-
class
deeppavlov.models.ranking.logit_ranker.
LogitRanker
(squad_model, **kwargs)[source]¶ Select best answer using squad model logits. Make several batches for a single batch, send each batch to the squad model separately and get a single best answer for each batch.
Parameters: squad_model – a loaded squad model -
squad_model
¶ a loaded squad model
-
__call__
(contexts_batch: List[List[str]], questions_batch: List[List[str]]) → List[str][source]¶ Sort obtained results from squad reader by logits and get the answer with a maximum logit.
Parameters: - contexts_batch – a batch of contexts which should be treated as a single batch in the outer JSON config
- questions_batch – a batch of questions which should be treated as a single batch in the outer JSON config
Returns: a batch of best answers
-