Utiliser Bert pour la prédiction de phrase suivante

Question

Google [~ # ~ # ~] Bert [~ # ~ ~] est préparé sur les tâches de prévision de phrase suivantes, mais je me demande s'il est possible d'appeler la fonction de prévision de la phrase suivante sur les nouvelles données.

L'idée est la suivante: une phrase donnée une phrase A et une phrase donnée B, je veux une étiquette probabiliste pour savoir si la phrase B suit ou non la phrase A. Bert est préparé sur un énorme ensemble de données, alors j'espérais utiliser cette phrase suivante sur la nouvelle phrase Les données. Je n'arrive pas à comprendre si cette fonction de prédiction de phrase suivante peut être appelée et si oui, comment. Merci pour ton aide!

Aerin · Answer

Streamging Face l'a fait pour vous: https://github.com/huggingface/pytorch-prétrained-bert/blob/master/pytorch_prétrained_bert/modeling.py#l854

class BertForNextSentencePrediction(BertPreTrainedModel): """BERT model with next sentence prediction head. This module comprises the BERT model followed by the next sentence classification head. Params: config: a BertConfig class instance with the configuration to build a new model. Inputs: `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] with the Word token indices in the vocabulary(see the tokens preprocessing logic in the scripts `extract_features.py`, `run_classifier.py` and `run_squad.py`) `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to a `sentence B` token (see BERT paper for more details). `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max input sequence length in the current batch. It's the mask that we typically use for attention when a batch has varying length sentences. `next_sentence_label`: next sentence classification loss: torch.LongTensor of shape [batch_size] with indices selected in [0, 1]. 0 => next sentence is the continuation, 1 => next sentence is a random sentence. Outputs: if `next_sentence_label` is not `None`: Outputs the total_loss which is the sum of the masked language modeling loss and the next sentence classification loss. if `next_sentence_label` is `None`: Outputs the next sentence classification logits of shape [batch_size, 2]. Example usage: ```python # Already been converted into WordPiece token ids input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]]) input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]]) token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]]) config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768, num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072) model = BertForNextSentencePrediction(config) seq_relationship_logits = model(input_ids, token_type_ids, input_mask) ``` """ def __init__(self, config): super(BertForNextSentencePrediction, self).__init__(config) self.bert = BertModel(config) self.cls = BertOnlyNSPHead(config) self.apply(self.init_bert_weights) def forward(self, input_ids, token_type_ids=None, attention_mask=None, next_sentence_label=None): _, pooled_output = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False) seq_relationship_score = self.cls( pooled_output) if next_sentence_label is not None: loss_fct = CrossEntropyLoss(ignore_index=-1) next_sentence_loss = loss_fct(seq_relationship_score.view(-1, 2), next_sentence_label.view(-1)) return next_sentence_loss else: return seq_relationship_score

Bram Vanroy · Answer

La réponse de Aerin est sorti. La bibliothèque de BuggingFace (maintenant appelée transformers) a beaucoup changé au cours des deux derniers mois. Voici un exemple de comment utiliser le modèle de prévision de la phrase suivante (NSP) et à extraire les probabilités de celui-ci.

from torch.nn.functional import softmax from transformers import BertForNextSentencePrediction, BertTokenizer seq_A = 'I like cookies !' seq_B = 'Do you like them ?' # load pretrained model and a pretrained tokenizer model = BertForNextSentencePrediction.from_pretrained('bert-base-cased') tokenizer = BertTokenizer.from_pretrained('bert-base-cased') # encode the two sequences. Particularly, make clear that they must be # encoded as "one" input to the model by using 'seq_B' as the 'text_pair' encoded = tokenizer.encode_plus(seq_A, text_pair=seq_B, return_tensors='pt') print(encoded) # {'input_ids': tensor([[ 101, 146, 1176, 18621, 106, 102, 2091, 1128, 1176, 1172, 136, 102]]), # 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]]), # 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])} # NOTE how the token_type_ids are 0 for all tokens in seq_A and 1 for seq_B, # this way the model knows which token belongs to which sequence # a model's output is a Tuple, we only need the output tensor containing # the relationships which is the first item in the Tuple seq_relationship_logits = model(**encoded)[0] # we still need softmax to convert the logits into probabilities # index 0: sequence B is a continuation of sequence A # index 1: sequence B is a random sequence probs = softmax(seq_relationship_logits, dim=1) print(seq_relationship_logits) print(probs) # tensor([[9.9993e-01, 6.7607e-05]], grad_fn=<SoftmaxBackward>) # very high value for index 0: high probability of seq_B being a continuation of seq_A # which is what we expect!