BERT Architecture
BERT Architecture
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]
Ack:Anashua Dastidar,
Teaching Assistant
Introduction
BERT is a language representation model pre-trained on a very large amount of unlabeled text
corpus over different pre-training tasks. It was proposed in the paper BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding (Devlin et al., 2018)[1].
The main idea behind BERT was to create a pre trained large language model which could be used
for various downstream tasks which could either be
1. Fine Tuning
Why Bidirectional ?
Some of the popular language models before the BERT paper was released were ELMo[2] and Open AI’s
GPT[3] . However the authors of the BERT paper argued that the current techniques used in creating these
models restricted the power of pre trained models . The major limitation being that they were unidirectional.
The authors said that such techniques were suboptimal when applied to sentence level tasks and could be
harmful when applied to token level tasks such as question answering.
ELMo concatenates
the outputs of the left-
right model and the
right-left model which
is a shallow
approach.
How does BERT overcome this issue?
● BERT alleviates the previously mentioned unidirectionality constraint by using a
pre training step called masked language modelling. The idea is to randomly mask
some tokens from the objective sequence and the model must predict the original
token based on only the context .
○ Masked language modelling enables the representation to fuse together the
left and right context which allows us to pretrain a deep bidirectional
transformer
● In addition to masked language modelling another pre training task is NSP or Next
sentence prediction, this is used to jointly pre trains text-pair representations.
BERT Architecture
● There are two steps to the BERT framework the first being pre training with a large corpus and the
second being fine tuning to various downstream tasks.
● BERT’s model architecture is a multi-layer bidirectional Transformer encoder based on the original
implementation described in Vaswani et al. (2017)
● The paper presents two models BERTBASE AND BERT LARGE , the BERTBASE model is similar in size to
GPT and is used to make comparisons while the BERT LARGE model is used to achieve the state of
the art results presented in the paper.
● Both BERT model sizes have a large number of encoder
layers (which the paper calls Transformer Blocks) –
twelve for the Base version, and twenty four for the Large
version.
● These also have larger feedforward-networks (768 and
1024 hidden units respectively), and more attention
heads (12 and 16 respectively) than the default
configuration in the reference implementation of the
Transformer in the initial paper (6 encoder layers, 512
hidden units, and 8 attention heads).
BERT Model inputs
● Such that the model can handle a variety of downstreamed tasks its input representation is such that
which can unambiguously represent a single sentence or a pair of sentences.
● The first token of every sequence is classed a [CLS] token or a classification token.The final hidden state
wrt to the [CLS] token is is used as the aggregate sequence representation for classification tasks.
BERT Model inputs
In order to train a deep bi directional representation, some percentage of the inputs are masked at
random and then the masked tokens are to be predicted. The final hidden vectors corresponding to
the mask tokens are fed into an output softmax over the vocabulary, as in a standard LM.
In all of their experiments, they mask 15% of all Word Piece tokens in each sequence at random.
Although this allows them to obtain a bidirectional pre-trained model, a downside is that they are
creating a mismatch between pre-training and fine-tuning, since the [MASK] token does not appear
during fine-tuning. To mitigate this, they do not always replace “masked” words with the actual
[MASK] token.
BERT Model Pre-training Masked Language Modelling
Assuming the unlabeled sentence is “ my dog is hairy” and during the random masking procedure chose the 4th
token (which corresponds to hairy) , our masking procedure can be further illustrated by:
1. 80% of the time : Replace the word with the [MASK] token,
E.g., my dog is hairy→ my dog is [MASK]
The advantage of this procedure is that the Transformer encoder does not know which words it will be asked to
predict or which have been replaced by random words, so it is forced to keep a distributional contextual
representation of every input token. Additionally, because random replacement only occurs for 1.5%of all tokens
(i.e., 10% of 15%), this does not seem to harm the model’s language understanding capability.
Masked
Language
Modelling
BERT Model Pre-training Next Sentence Prediction
● Many important downstream tasks such as Question Answering (QA) and Natural Language Inference
(NLI) are based on understanding the relationship between two sentences.
● Inorder to train a model that understands sentence relationships, we pre-train for a binarized next
sentence prediction task that can be trivially generated from any monolingual corpus.
● Specifically, when choosing the sentences A and B for each pretraining example, 50% of the time B is
the actual next sentence that follows A (labeled as IsNext ), and 50% of the time it is a random
sentence from the corpus (labeled as Not Next).
BERT Model Pre-training Next Sentence Prediction
To help the model distinguish between the two sentences in training, the input is processed in the following
way before entering the model:
1. A [CLS] token is inserted at the beginning of the first sentence and a [SEP] token is inserted at the
end of each sentence.
2. A sentence embedding indicating Sentence A or Sentence B is added to each token. Sentence
embeddings are similar in concept to token embeddings with a vocabulary of 2.
3. A positional embedding is added to each token to indicate its position in the sequence. The concept
and implementation of positional embedding are presented in the Transformer paper.
To predict if the second sentence is indeed connected to the first, the following steps
are performed:
● For the pre-training corpus we use theBooksCorpus (800 Words) (Zhu Et al., 2015) and the English
Wikipedia (2,500Mwords).
● To generate each training input sequence, they sample two spans of text from the corpus,which they refer
to as “sentences” even though they are typically much longer than single sentences (but can be shorter
also).
● The first sentence receives the A embedding and the second receives the B embedding. They are sampled
such that the combined length is ≤ 512 tokens. The LM masking is applied after Word Piece tokenization
with a uniform masking rate of 15%.
● Training of BERTBASE was performed on 4 Cloud TPUs in Pod configuration(16 TPU chips total). 13
Training of BERTLARGE was performed on 16 Cloud TPUs (64 TPU chips total). Each pre training took 4
days to complete.
BERT Fine Tuning
BERT can be used for a wide variety of language tasks, while only adding a small layer to the core model:
1. Classification tasks such as sentiment analysis are done similarly to Next Sentence classification, by
adding a classification layer on top of the Transformer output for the [CLS] token.
BERT Fine Tuning
2. In Question Answering tasks (e.g. SQuAD v1.1), the software receives a question regarding a text sequence
and is required to mark the answer in the sequence. Using BERT, a Q&A model can be trained by learning two
extra vectors that mark the beginning and the end of the answer.
3. In Named Entity Recognition (NER), the software receives a text sequence and is required to mark the various
types of entities (Person, Organization, Date, etc) that appear in the text. Using BERT, a NER model can be
trained by feeding the output vector of each token into a classification layer that predicts the NER label.
Feature based approach with BERT
The fine-tuning approach isn’t the only way to use BERT. Just like ELMo, you can use the pre-trained BERT to
create contextualized word embeddings. Then you can feed these embeddings to your existing model – a
process the paper shows yield results not far behind fine-tuning BERT on a task such as named-entity
recognition.
The feature-based approach here comprised
of extracting the activations (or contextual
embeddings or token representations or
features) from one or more of the 12 layers
without fine-tuning any parameters of BERT.
[2] https://www.youtube.com/playlist?list=PLam9sigHPGwOBuH4_4fr-XvDbe5uneaf6
[3] https://www.youtube.com/watch?v=-9evrZnBorM
[4] https://trishalaneeraj.github.io/2020-04-04/feature-based-approach-with-bert
[5] https://jalammar.github.io/illustrated-bert/
UE21CS343BB2
Topics in Deep Learning
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]
Ack:Anashua Dastidar,
Teaching Assistant