Lecture Notes - Advanced Language Model - BERT, GPT
Lecture Notes - Advanced Language Model - BERT, GPT
traditional Seq2Seq models. Some of these challenges were increased computational and memory needs
when working with long sequences.
The Transformer architecture takes advantage of the best features of two proven methods: attention
mechanisms for capturing relative dependencies and CNNs for parallel processing. By combining these
concepts, the Transformer can analyze input sequences in parallel and generate a context vector that
reflects their relative dependencies.
The Google Brain team introduced the transformer architecture in their paper Attention Is All You Need
and sparked a shift in the advancement of natural language understanding (NLU).
A transformer has multiple encoders and decoders stacked on top of each other (generally, a stack of 6
identical layers).
1. Positional encoding
2. Multi-head attention block
3. Normalization layer
4. Position-wise feedforward NNs
Positional Encoding
The input sentence given by us is converted into word embeddings and then fed to an encoder as input.
Word embedding is the process of converting a sentence into a numerical representation that a machine
can understand. To store the relevant information, the embedding layer generates a matrix that contains
the embeddings of all the tokens. This matrix is usually of the shape (vocab_size, embedding_dim). Here,
vocab_size is the number of unique tokens, and embedding_dim is the number of dimensions in the
embedding space. Each dimension in the embedding space represents a specific feature of the token.
Here, pos is the position of the word, i is the ith dimension of the word embedding, and d/dmodel is the
number of dimensions in the embeddings
Self-Attention
The self-attention mechanism used by the encoder allows it to selectively focus on important parts of the
input while processing it. This allows the model to determine the relevance of different sections of the
input and make informed decisions when translating languages.
Multi-Head Attention
According to the original paper, the use of eight attention heads, referred to as multi-head attention, is
recommended. Multi-head attention enhances the speed of context vector calculation by providing
multiple parallel attention layers, each of which can capture different aspects of the input and improve the
model's performance.
1. Look-ahead mask
2. Cross-attention (encoder-decoder attention)
In the transformer decoder, the model must generate a new word based on the context of all the previous
words in the input sequence, but it should not have access to information about the future words that have
not been generated yet. To ensure that the model only has access to the relevant information, a look-ahead
● Once the attention layers are applied to the current positions, a normalization layer is applied, just
like it is done in the encoder.
Cross-Attention
After applying the normalization layer, cross-attention (encoder-decoder attention) is applied to the output
received from the previous layer of the decoder and the output of the encoder (context vector).
Cross-attention obtains its queries (Q) from the previous decoder layer and the keys (K) and values (V) from
the encoder output. This allows every position in the decoder to look over all the positions in the input
sequence (similar to the typical encoder-decoder architecture).
The below image depicts how cross-attention is calculated.
Hugging Face library is known for its NLP platform, which provides a collection of pre-trained machine
learning models for various NLP tasks, such as text classification, text generation, question-answering, and
more. This platform is accessible through various programming languages, including Python, and it offers
easy access to the latest and most advanced NLP models.
The models available on the Hugging Face platform have been trained on extensive text data and can be
further improved by fine-tuning them on specific tasks.
Pipelines provide an effortless and convenient solution for utilizing a variety of models for inference. These
pipelines are designed as objects that encapsulate the complicated code from the Hugging Face library,
making it simple to perform various NLP tasks with a dedicated API. These tasks include recognizing named
entities, performing masked language modeling, analyzing sentiment, extracting features, and answering
questions.
● For text generation, you can use pipeline() in the following manner.
generator = pipeline("text-generation")
generator("In the galaxy far far")
Output
● The code to perform the task of masked language modeling (where the downloaded model predicts
masked words) would look like this.
unmasker = pipeline("fill-mask")
unmasker("You are going to <mask> about a wonderful library today.", top_k=2)
In the above code, we are asking the model to make the top two predictions for the <mask> token.
Output
]
classifier = pipeline("sentiment-analysis")
classifier(
input_sentences
© Copyright UpGrad Education Pvt. Ltd. All rights reserved
)
Output
A pipeline is composed of a series of components, each of which performs a specific NLP task. The
following are some of the common components that can be used inside a pipeline: a tokenizer, model,
named entity recognizer, sentiment analyzer, question answering model, text generation Model, and text
summarizer.
The first component inside the pipeline() function is pre-processing. Pre-processing is done by a tokenizer
inside the transformer API. This is how an input text is pre-processed:
● Tokenization
The transformer tokenizers convert the input text into the numerical representation of each token.
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
tokenizer("Learning NLP is so much rewarding")
Output
{'input_ids': [101, 9681, 21239, 2101, 1110, 1177, 1277, 10703, 1158, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0,
0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Here, the input text is transformed into a dictionary consisting of the following three keys:
“input_ids,” “token_type_ids,” and “attention_mask.”
● In case the tokenized outputs are not of the same length, the tokenizer function allows us to control
the output using padding and truncation.
BERT stands for Bidirectional Encoder Representations from Transformers. BERT is a transformer-based
architecture that uses an attention mechanism to learn the contextual relationships between words in a
sentence.
During pre-training, BERT is exposed to a large corpus of text, such as books and articles, and is optimized
to predict missing words or words that have been masked in the input. This pre-training allows BERT to
build a representation of language and its context, which can then be fine-tuned for specific NLP tasks. The
main innovation of BERT is its bidirectional approach, which considers the context to the left and right of
each word, giving the model a more nuanced understanding of the context.
In the module, you tried to solve the following problem statement using the BERT model.
Predict whether any given two sentences (questions) are semantically similar to each other. You used the
Quora Question Pair (QQP) data set, which is part of the GLUE benchmark.
You performed the following key steps to solve the problem statement:
● You downloaded the data set. There were 363,846 entries of data with four columns: question1,
question2, label, and idx.
train =
pd.read_csv('/content/drive/MyDrive/sentence_pair_classification_data/train.csv')
train.sample(5)
● You used the load_datset() function to automatically convert the given data into a dictionary.
dataset=load_dataset('csv',data_files={'train':'/content/drive/MyDrive/sentence_p
air_classification_data/train.csv',\'valid':'/content/drive/MyDrive/sentence_pair
_classification_data/val.csv','test':
'/content/drive/MyDrive/sentence_pair_classification_data/test.csv'},)
model_checkpoint = "bert-base-cased"
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
● After the tokenizer is loaded, we applied it to the entire data set as shown below.
def preprocess_function(records):
return tokenizer(records['question1'], records['question2'], truncation=True,
return_token_type_ids=True, max_length = 75)
encoded_dataset = dataset.map(preprocess_function, batched=True )
Output
DatasetDict({
train: Dataset({
features: ['question1', 'question2', 'label', 'idx', 'input_ids',
'token_type_ids', 'attention_mask'],
num_rows: 363846
})
valid: Dataset({
features: ['question1', 'question2', 'label', 'idx', 'input_ids',
'token_type_ids', 'attention_mask'],
num_rows: 40430
})
test: Dataset({
features: ['question1', 'question2', 'label', 'idx', 'input_ids',
'token_type_ids', 'attention_mask'],
num_rows: 390965
})
})
● Since the transformed data set consists of few original features, we must remove them once the
data set is encoded.
pre_tokenizer_columns = set(dataset["train"].features)
tokenizer_columns=list(set(encoded_dataset["train"].features) -
pre_tokenizer_columns)
print("Columns added by tokenizer:", tokenizer_columns)
● The next step is to split the data into train and validation sets.
tf_train_dataset = encoded_dataset["train"].to_tf_dataset(
columns=tokenizer_columns,label_cols=["labels"],shuffle=True,
batch_size=batch_size,collate_fn=data_collator,)
tf_validation_dataset = encoded_dataset["valid"].to_tf_dataset(
columns=tokenizer_columns,label_cols=["labels"],shuffle=False,batch_size=batch_size,
collate_fn=data_collator,)
● By now, the data is set into a format that is compatible with the chosen Tensorflow framework. This
was done using the to_tf_dataset() function.
● Next, you download the model and define hyperparameters for it.
model = TFAutoModelForSequenceClassification.from_pretrained(model_checkpoint,
num_labels = num_labels)
num_epochs = 3
num_train_steps = len(tf_train_dataset) * num_epochs
lr_scheduler = PolynomialDecay(
initial_learning_rate=5e-5, end_learning_rate=0.0,
decay_steps=num_train_steps, power = 2
)
opt = Adam(learning_rate=lr_scheduler)
loss = SparseCategoricalCrossentropy(from_logits=True)
● You can save the model training history in your drive and use the pre-trained model when needed.
● After loading the model, you can apply a custom function to infer the model’s performance on a
custom input.
This wraps up our use case of fine-tuning a BERT model for finding sentence-pair similarity.
● You can download this document from the website for self-use only.
● Any copies of this document, in part or full, saved to disc or to any other storage medium may only be used
for subsequent, self-viewing purposes or to print an individual extract or copy for non-commercial personal
use only.
● Any further dissemination, distribution, reproduction, copying of the content of the document herein or the
uploading thereof on other websites or use of the content for any other commercial/unauthorized purposes
in any way which could infringe the intellectual property rights of UpGrad or its contributors, is strictly
prohibited.
● No graphics, images, or photographs from any accompanying text in this document will be used separately
for unauthorized purposes.
● No material in this document will be modified, adapted, or altered in any way.
● No part of this document or UpGrad content may be reproduced or stored in any other website or included
in any public or private electronic retrieval system or service without UpGrad’s prior written permission.
● Any rights not expressly granted in these terms are reserved.