deeppavlov.models.torch_bert¶
-
class
deeppavlov.models.preprocessors.torch_transformers_preprocessor.
TorchTransformersPreprocessor
(vocab_file: str, do_lower_case: bool = True, max_seq_length: int = 512, **kwargs)[source]¶ Tokenize text on subtokens, encode subtokens with their indices, create tokens and segment masks.
- Parameters
vocab_file – A string, the model id of a predefined tokenizer hosted inside a model repo on huggingface.co or a path to a directory containing vocabulary files required by the tokenizer.
do_lower_case – set True if lowercasing is needed
max_seq_length – max sequence length in subtokens, including [SEP] and [CLS] tokens
-
max_seq_length
¶ max sequence length in subtokens, including [SEP] and [CLS] tokens
-
tokenizer
¶ instance of Bert FullTokenizer
-
__call__
(texts_a: List, texts_b: Optional[List[str]] = None) → Union[List[transformers.data.processors.utils.InputFeatures], Tuple[List[transformers.data.processors.utils.InputFeatures], List[List[str]]]][source]¶ Tokenize and create masks. texts_a and texts_b are separated by [SEP] token :param texts_a: list of texts, :param texts_b: list of texts, it could be None, e.g. single sentence classification task
- Returns
batch of
transformers.data.processors.utils.InputFeatures
with subtokens, subtoken ids, subtoken mask, segment mask, or tuple of batch of InputFeatures and Batch of subtokens
-
class
deeppavlov.models.preprocessors.torch_transformers_preprocessor.
TorchTransformersNerPreprocessor
(vocab_file: str, do_lower_case: bool = False, max_seq_length: int = 512, max_subword_length: Optional[int] = None, token_masking_prob: float = 0.0, provide_subword_tags: bool = False, subword_mask_mode: str = 'first', return_features: bool = False, **kwargs)[source]¶ Takes tokens and splits them into bert subtokens, encodes subtokens with their indices. Creates a mask of subtokens (one for the first subtoken, zero for the others).
If tags are provided, calculates tags for subtokens.
- Parameters
vocab_file – path to vocabulary
do_lower_case – set True if lowercasing is needed
max_seq_length – max sequence length in subtokens, including [SEP] and [CLS] tokens
max_subword_length – replace token to <unk> if it’s length is larger than this (defaults to None, which is equal to +infinity)
token_masking_prob – probability of masking token while training
provide_subword_tags – output tags for subwords or for words
subword_mask_mode – subword to select inside word tokens, can be “first” or “last” (default=”first”)
return_features – if True, returns answer in features format
-
max_seq_length
¶ max sequence length in subtokens, including [SEP] and [CLS] tokens
-
max_subword_length
¶ rmax lenght of a bert subtoken
-
tokenizer
¶ instance of Bert FullTokenizer
-
class
deeppavlov.models.preprocessors.torch_transformers_preprocessor.
TorchBertRankerPreprocessor
(vocab_file: str, do_lower_case: bool = True, max_seq_length: int = 512, **kwargs)[source]¶ Tokenize text to sub-tokens, encode sub-tokens with their indices, create tokens and segment masks for ranking.
Builds features for a pair of context with each of the response candidates.
-
__call__
(batch: List[List[str]]) → List[List[transformers.data.processors.utils.InputFeatures]][source]¶ Tokenize and create masks.
- Parameters
batch – list of elements where the first element represents the batch with contexts and the rest of elements represent response candidates batches
- Returns
list of feature batches with subtokens, subtoken ids, subtoken mask, segment mask.
-
-
class
deeppavlov.models.torch_bert.torch_transformers_classifier.
TorchTransformersClassifierModel
(n_classes, pretrained_bert, multilabel: bool = False, return_probas: bool = False, attention_probs_keep_prob: Optional[float] = None, hidden_keep_prob: Optional[float] = None, bert_config_file: Optional[str] = None, is_binary: Optional[bool] = False, num_special_tokens: Optional[int] = None, **kwargs)[source]¶ Bert-based model for text classification on PyTorch.
It uses output from [CLS] token and predicts labels using linear transformation.
- Parameters
n_classes – number of classes
pretrained_bert – pretrained Bert checkpoint path or key title (e.g. “bert-base-uncased”)
multilabel – set True if it is multi-label classification
return_probas – set True if return class probabilites instead of most probable label needed
attention_probs_keep_prob – keep_prob for Bert self-attention layers
hidden_keep_prob – keep_prob for Bert hidden layers
bert_config_file – path to Bert configuration file (not used if pretrained_bert is key title)
is_binary – whether classification task is binary or multi-class
num_special_tokens – number of special tokens used by classification model
-
__call__
(features: Dict[str, torch.tensor]) → Union[List[int], List[List[float]]][source]¶ Make prediction for given features (texts).
- Parameters
features – batch of InputFeatures
- Returns
predicted classes or probabilities of each class
-
train_on_batch
(features: Dict[str, torch.tensor], y: Union[List[int], List[List[int]]]) → Dict[source]¶ Train model on given batch. This method calls train_op using features and y (labels).
- Parameters
features – batch of InputFeatures
y – batch of labels (class id or one-hot encoding)
- Returns
dict with loss and learning_rate values
-
class
deeppavlov.models.torch_bert.torch_transformers_sequence_tagger.
TorchTransformersSequenceTagger
(n_tags: int, pretrained_bert: str, bert_config_file: Optional[str] = None, attention_probs_keep_prob: Optional[float] = None, hidden_keep_prob: Optional[float] = None, use_crf: bool = False, **kwargs)[source]¶ Transformer-based model on PyTorch for text tagging. It predicts a label for every token (not subtoken) in the text. You can use it for sequence labeling tasks, such as morphological tagging or named entity recognition.
- Parameters
n_tags – number of distinct tags
pretrained_bert – pretrained Bert checkpoint path or key title (e.g. “bert-base-uncased”)
bert_config_file – path to Bert configuration file, or None, if pretrained_bert is a string name
attention_probs_keep_prob – keep_prob for Bert self-attention layers
hidden_keep_prob – keep_prob for Bert hidden layers
use_crf – whether to use Conditional Ramdom Field to decode tags
-
__call__
(input_ids: Union[List[List[int]], numpy.ndarray], input_masks: Union[List[List[int]], numpy.ndarray], y_masks: Union[List[List[int]], numpy.ndarray]) → Tuple[List[List[int]], List[numpy.ndarray]][source]¶ Predicts tag indices for a given subword tokens batch
- Parameters
input_ids – indices of the subwords
input_masks – mask that determines where to attend and where not to
y_masks – mask which determines the first subword units in the the word
- Returns
Label indices or class probabilities for each token (not subtoken)
-
train_on_batch
(input_ids: Union[List[List[int]], numpy.ndarray], input_masks: Union[List[List[int]], numpy.ndarray], y_masks: Union[List[List[int]], numpy.ndarray], y: List[List[int]], *args, **kwargs) → Dict[str, float][source]¶ - Parameters
input_ids – batch of indices of subwords
input_masks – batch of masks which determine what should be attended
args – arguments passed to _build_feed_dict and corresponding to additional input and output tensors of the derived class.
kwargs – keyword arguments passed to _build_feed_dict and corresponding to additional input and output tensors of the derived class.
- Returns
dict with fields ‘loss’, ‘head_learning_rate’, and ‘bert_learning_rate’
-
class
deeppavlov.models.torch_bert.torch_transformers_squad.
TorchTransformersSquad
(pretrained_bert: str, attention_probs_keep_prob: Optional[float] = None, hidden_keep_prob: Optional[float] = None, bert_config_file: Optional[str] = None, psg_cls: bool = False, batch_size: int = 10, **kwargs)[source]¶ Bert-based on PyTorch model for SQuAD-like problem setting: It predicts start and end position of answer for given question and context.
[CLS] token is used as no_answer. If model selects [CLS] token as most probable answer, it means that there is no answer in given context.
Start and end position of answer are predicted by linear transformation of Bert outputs.
- Parameters
pretrained_bert – pretrained Bert checkpoint path or key title (e.g. “bert-base-uncased”)
attention_probs_keep_prob – keep_prob for Bert self-attention layers
hidden_keep_prob – keep_prob for Bert hidden layers
bert_config_file – path to Bert configuration file, or None, if pretrained_bert is a string name
psg_cls – whether to use a separate linear layer to define if a passage contains the answer to the question
batch_size – batch size for inference of squad model
-
__call__
(features_batch: List[List[transformers.data.processors.utils.InputFeatures]]) → Tuple[List[List[int]], List[List[int]], List[List[float]], List[List[float]], List[int]][source]¶ get predictions using features as input
- Parameters
features_batch – batch of InputFeatures instances
- Returns
answer start positions end_pred_batch: answer end positions logits_batch: answer logits scores_batch: answer confidences ind_batch: indices of paragraph pieces where the answer was found
- Return type
start_pred_batch
-
train_on_batch
(features: List[List[transformers.data.processors.utils.InputFeatures]], y_st: List[List[int]], y_end: List[List[int]]) → Dict[source]¶ Train model on given batch. This method calls train_op using features and labels from y_st and y_end
- Parameters
features – batch of InputFeatures instances
y_st – batch of lists of ground truth answer start positions
y_end – batch of lists of ground truth answer end positions
- Returns
dict with loss and learning_rate values
-
class
deeppavlov.models.torch_bert.torch_bert_ranker.
TorchBertRankerModel
(pretrained_bert: Optional[str] = None, bert_config_file: Optional[str] = None, n_classes: int = 2, return_probas: bool = True, **kwargs)[source]¶ BERT-based model for interaction-based text ranking on PyTorch.
Linear transformation is trained over the BERT pooled output from [CLS] token. Predicted probabilities of classes are used as a similarity measure for ranking.
- Parameters
pretrained_bert – pretrained Bert checkpoint path or key title (e.g. “bert-base-uncased”)
bert_config_file – path to Bert configuration file (not used if pretrained_bert is key title)
n_classes – number of classes
return_probas – set True if class probabilities are returned instead of the most probable label
-
__call__
(features_li: List[List[transformers.data.processors.utils.InputFeatures]]) → Union[List[int], List[List[float]]][source]¶ Calculate scores for the given context over candidate responses.
- Parameters
features_li – list of elements where each element contains the batch of features for contexts with particular response candidates
- Returns
predicted scores for contexts over response candidates
-
train_on_batch
(features_li: List[List[transformers.data.processors.utils.InputFeatures]], y: Union[List[int], List[List[int]]]) → Dict[source]¶ Train the model on the given batch.
- Parameters
features_li – list with the single element containing the batch of InputFeatures
y – batch of labels (class id or one-hot encoding)
- Returns
dict with loss and learning rate values