0% found this document useful (0 votes)

31 views23 pages

BERT Architecture

The document provides an overview of BERT, a pre-trained language representation model designed for various downstream tasks, emphasizing its bidirectional architecture and pre-training techniques such as masked language modeling and next sentence prediction. It details the model's architecture, input-output mechanisms, and fine-tuning processes for tasks like classification, question answering, and named entity recognition. Additionally, it discusses the feature-based approach to utilize BERT without fine-tuning, alongside references for further reading.

Uploaded by

yashaswinivmipuc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views23 pages

BERT Architecture

Uploaded by

yashaswinivmipuc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

UE21CS343BB2

Topics in Deep Learning

Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]

Ack:Anashua Dastidar,
Teaching Assistant
Introduction
BERT is a language representation model pre-trained on a very large amount of unlabeled text
corpus over different pre-training tasks. It was proposed in the paper BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding (Devlin et al., 2018)[1].

The main idea behind BERT was to create a pre trained large language model which could be used
for various downstream tasks which could either be

1. Feature Based tasks ( generating word embeddings)

1. Fine Tuning
Why Bidirectional ?
Some of the popular language models before the BERT paper was released were ELMo[2] and Open AI’s
GPT[3] . However the authors of the BERT paper argued that the current techniques used in creating these
models restricted the power of pre trained models . The major limitation being that they were unidirectional.
The authors said that such techniques were suboptimal when applied to sentence level tasks and could be
harmful when applied to token level tasks such as question answering.

ELMo concatenates
the outputs of the left-
right model and the
right-left model which
is a shallow
approach.
How does BERT overcome this issue?
● BERT alleviates the previously mentioned unidirectionality constraint by using a
pre training step called masked language modelling. The idea is to randomly mask
some tokens from the objective sequence and the model must predict the original
token based on only the context .
○ Masked language modelling enables the representation to fuse together the
left and right context which allows us to pretrain a deep bidirectional
transformer
● In addition to masked language modelling another pre training task is NSP or Next
sentence prediction, this is used to jointly pre trains text-pair representations.
BERT Architecture
● There are two steps to the BERT framework the first being pre training with a large corpus and the
second being fine tuning to various downstream tasks.

● BERT’s model architecture is a multi-layer bidirectional Transformer encoder based on the original
implementation described in Vaswani et al. (2017)

● The paper presents two models BERTBASE AND BERT LARGE , the BERTBASE model is similar in size to
GPT and is used to make comparisons while the BERT LARGE model is used to achieve the state of
the art results presented in the paper.
● Both BERT model sizes have a large number of encoder
layers (which the paper calls Transformer Blocks) –
twelve for the Base version, and twenty four for the Large
version.
● These also have larger feedforward-networks (768 and
1024 hidden units respectively), and more attention
heads (12 and 16 respectively) than the default
configuration in the reference implementation of the
Transformer in the initial paper (6 encoder layers, 512
hidden units, and 8 attention heads).
BERT Model inputs
● Such that the model can handle a variety of downstreamed tasks its input representation is such that
which can unambiguously represent a single sentence or a pair of sentences.

● The first token of every sequence is classed a [CLS] token or a classification token.The final hidden state
wrt to the [CLS] token is is used as the aggregate sequence representation for classification tasks.
BERT Model inputs

● Now while training BERT we many use a

single sentence of sentence pairs packed
together into a single sequence . But we
must figure out a way to differentiate two
different sentences. This can be done using
two methods:

○ Sentences separated by a special

token called [SEP] token.

○ We could add a learned embedding to

every token indicating which sentence
it belongs to.
BERT Model Outputs
Each position outputs a vector of size
hidden_size (768 in BERT Base).

For the sentence classification example we’ve

looked at above, we focus on the output of only
the first position (that we passed the special
[CLS] token to).

That vector can now be used as the input for a

classifier of our choosing. The paper achieves
great results by just using a single-layer neural
network as the classifier.

If you have more labels (for example if you’re an

email service that tags emails with “spam”, “not
spam”, “social”, and “promotion”), you just tweak
the classifier network to have more output
neurons that then pass through softmax.
Word Piece Embeddings
● The BERT model has a fixed vocabulary , so how does the model deal with words which are not present in
the vocabulary.
● One of the approaches used in other language models is to use a [UNK] or unknown token to represent
out of vocabulary words . But what BERT does is use the Word piece model and break down an unknown
word into multiple known subwords. Ex1
● All subwords other the to first are preceded by ## .
● In a case where breaking down a word into meaningful subwords is not possible the word can be broken
down into individual characters and some kind of a representation can be created for it. Ex2
● However this turns out to be very useful when we are able to break an unknown word into meaningful sub
tokens. Ex 3

Example 1 Example 2 Example 3

BERT Model Pre-training Masked Language Modelling
Unfortunately, standard conditional language models can only be trained left-to-right or right-to-left,
since bidirectional conditioning would allow each word to indirectly “see itself”, and the model could
trivially predict the target word in a multi-layered context.

In order to train a deep bi directional representation, some percentage of the inputs are masked at
random and then the masked tokens are to be predicted. The final hidden vectors corresponding to
the mask tokens are fed into an output softmax over the vocabulary, as in a standard LM.

In all of their experiments, they mask 15% of all Word Piece tokens in each sequence at random.

Although this allows them to obtain a bidirectional pre-trained model, a downside is that they are
creating a mismatch between pre-training and fine-tuning, since the [MASK] token does not appear
during fine-tuning. To mitigate this, they do not always replace “masked” words with the actual
[MASK] token.
BERT Model Pre-training Masked Language Modelling
Assuming the unlabeled sentence is “ my dog is hairy” and during the random masking procedure chose the 4th
token (which corresponds to hairy) , our masking procedure can be further illustrated by:

1. 80% of the time : Replace the word with the [MASK] token,
E.g., my dog is hairy→ my dog is [MASK]

1. 10% of the time : Replace the word with a random word,

E.g.,my dog is hairy→my dog is apple

1. 10% of the time: Keep the word unchanged,

E.g.,my dog is hairy→ my dog is hairy.
(The purpose of this is to bias the representation towards the actual observed word.)

The advantage of this procedure is that the Transformer encoder does not know which words it will be asked to
predict or which have been replaced by random words, so it is forced to keep a distributional contextual
representation of every input token. Additionally, because random replacement only occurs for 1.5%of all tokens
(i.e., 10% of 15%), this does not seem to harm the model’s language understanding capability.
Masked
Language
Modelling
BERT Model Pre-training Next Sentence Prediction
● Many important downstream tasks such as Question Answering (QA) and Natural Language Inference
(NLI) are based on understanding the relationship between two sentences.

● This is not directly captured by masked language modeling.

● Inorder to train a model that understands sentence relationships, we pre-train for a binarized next
sentence prediction task that can be trivially generated from any monolingual corpus.

● Specifically, when choosing the sentences A and B for each pretraining example, 50% of the time B is
the actual next sentence that follows A (labeled as IsNext ), and 50% of the time it is a random
sentence from the corpus (labeled as Not Next).
BERT Model Pre-training Next Sentence Prediction
To help the model distinguish between the two sentences in training, the input is processed in the following
way before entering the model:

1. A [CLS] token is inserted at the beginning of the first sentence and a [SEP] token is inserted at the
end of each sentence.
2. A sentence embedding indicating Sentence A or Sentence B is added to each token. Sentence
embeddings are similar in concept to token embeddings with a vocabulary of 2.
3. A positional embedding is added to each token to indicate its position in the sequence. The concept
and implementation of positional embedding are presented in the Transformer paper.
To predict if the second sentence is indeed connected to the first, the following steps
are performed:

The entire input sequence

goes through the
Transformer model.
The output of the [CLS]
token is transformed into
a 2×1 shaped vector,
using a simple
classification layer
(learned matrices of
weights and biases).
Calculating the probability of
Is Next Sequence with
softmax.
BERT Model Pre-training details
● The pre-training procedure largely follows the existing literature on language model pre-training.
○ They train with batch size of 256 sequences for about 40 epochs on the 2.3 billion word corpus.
○ They use Adam with a learning rate of 1e-4, β1=0.9, β2=0.999, L2 weight decay of 0.01, learning rate
warmup over the first 10,000 steps,and linear decay of the learning rate.
○ They also use a dropout probability of 0.1 on all layers.

● For the pre-training corpus we use theBooksCorpus (800 Words) (Zhu Et al., 2015) and the English
Wikipedia (2,500Mwords).

● To generate each training input sequence, they sample two spans of text from the corpus,which they refer
to as “sentences” even though they are typically much longer than single sentences (but can be shorter
also).
● The first sentence receives the A embedding and the second receives the B embedding. They are sampled
such that the combined length is ≤ 512 tokens. The LM masking is applied after Word Piece tokenization
with a uniform masking rate of 15%.

● Training of BERTBASE was performed on 4 Cloud TPUs in Pod configuration(16 TPU chips total). 13
Training of BERTLARGE was performed on 16 Cloud TPUs (64 TPU chips total). Each pre training took 4
days to complete.
BERT Fine Tuning
BERT can be used for a wide variety of language tasks, while only adding a small layer to the core model:

1. Classification tasks such as sentiment analysis are done similarly to Next Sentence classification, by
adding a classification layer on top of the Transformer output for the [CLS] token.
BERT Fine Tuning
2. In Question Answering tasks (e.g. SQuAD v1.1), the software receives a question regarding a text sequence
and is required to mark the answer in the sequence. Using BERT, a Q&A model can be trained by learning two
extra vectors that mark the beginning and the end of the answer.

3. In Named Entity Recognition (NER), the software receives a text sequence and is required to mark the various
types of entities (Person, Organization, Date, etc) that appear in the text. Using BERT, a NER model can be
trained by feeding the output vector of each token into a classification layer that predicts the NER label.
Feature based approach with BERT
The fine-tuning approach isn’t the only way to use BERT. Just like ELMo, you can use the pre-trained BERT to
create contextualized word embeddings. Then you can feed these embeddings to your existing model – a
process the paper shows yield results not far behind fine-tuning BERT on a task such as named-entity
recognition.
The feature-based approach here comprised
of extracting the activations (or contextual
embeddings or token representations or
features) from one or more of the 12 layers
without fine-tuning any parameters of BERT.

These embeddings are then used as input to

a BiLSTM followed by the classification layer
for NER. The authors report that when they
concatenate the token representations from
the top four hidden layers of the pre-trained
Transformer and use that directly in the
downstream task, the performance achieved
is comparable to fine-tuning the entire model
(including the parameters of BERT).
Conclusion
The best way to try out BERT is through the BERT Fine Tuning with Cloud TPUs
notebook hosted on Google Colab. If you’ve never used Cloud TPUs before, this is also
a good starting point to try them as well as the BERT code works on TPUs, CPUs and
GPUs as well.

We hope this preliminary understanding of BERT was useful we would encourage

students to read the paper[1] cited in the references and also refer to the blog posts
and videos linked in the same.
References
[1] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ‘BERT: Pre-training of Deep Bidirectional Transformers
for Language Understanding’, arXiv [cs.CL]. 2019.

[2] https://www.youtube.com/playlist?list=PLam9sigHPGwOBuH4_4fr-XvDbe5uneaf6

[3] https://www.youtube.com/watch?v=-9evrZnBorM

[4] https://trishalaneeraj.github.io/2020-04-04/feature-based-approach-with-bert

[5] https://jalammar.github.io/illustrated-bert/
UE21CS343BB2
Topics in Deep Learning

Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]

Ack:Anashua Dastidar,
Teaching Assistant

Cmu 108 Hirschmann User Manual
No ratings yet
Cmu 108 Hirschmann User Manual
8 pages
Fagor CNC 8025 - 8030
No ratings yet
Fagor CNC 8025 - 8030
255 pages
Bert Explained
No ratings yet
Bert Explained
8 pages
BERT Finetuning Theory
No ratings yet
BERT Finetuning Theory
14 pages
BERT
No ratings yet
BERT
4 pages
Bert Model - NLP
No ratings yet
Bert Model - NLP
10 pages
BERT and Its Implementation
No ratings yet
BERT and Its Implementation
5 pages
BERT
No ratings yet
BERT
98 pages
BERT
No ratings yet
BERT
21 pages
Understanding BERT
No ratings yet
Understanding BERT
4 pages
Lec14 Pretraining
No ratings yet
Lec14 Pretraining
42 pages
NLP LLM
No ratings yet
NLP LLM
47 pages
495 Lecture 11 BERT
No ratings yet
495 Lecture 11 BERT
31 pages
13 - Bert
No ratings yet
13 - Bert
17 pages
BERT Explained - State of The Art Language Model For NLP - by Rani Horev - Towards Data Science
100% (1)
BERT Explained - State of The Art Language Model For NLP - by Rani Horev - Towards Data Science
8 pages
Transformer Part3 16 Mar 23 PDF
No ratings yet
Transformer Part3 16 Mar 23 PDF
59 pages
BERT Slides
No ratings yet
BERT Slides
41 pages
Jacob Devlin BERT
No ratings yet
Jacob Devlin BERT
43 pages
Lec 02
No ratings yet
Lec 02
33 pages
Transformers MUIA
No ratings yet
Transformers MUIA
34 pages
32-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024
No ratings yet
32-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024
8 pages
Bert
No ratings yet
Bert
36 pages
Preprint Jesus
No ratings yet
Preprint Jesus
2 pages
Bert Ayman
No ratings yet
Bert Ayman
5 pages
NLP DL Lecture4
No ratings yet
NLP DL Lecture4
78 pages
Bert
No ratings yet
Bert
20 pages
BERT and Transformer
No ratings yet
BERT and Transformer
48 pages
Pretraining Part1 16 Mar 23 PDF
No ratings yet
Pretraining Part1 16 Mar 23 PDF
32 pages
Week 3: Deeplearning - Ai
No ratings yet
Week 3: Deeplearning - Ai
98 pages
Bert 1
No ratings yet
Bert 1
4 pages
HKBK College of Engineering Department of Computer Science and Engineering
No ratings yet
HKBK College of Engineering Department of Computer Science and Engineering
24 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
99 pages
BERT GPT CoT
No ratings yet
BERT GPT CoT
83 pages
Data Mining Report
No ratings yet
Data Mining Report
17 pages
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
20 pages
6-Bert T5 GPT
No ratings yet
6-Bert T5 GPT
31 pages
Rebertsubmission116 NW
No ratings yet
Rebertsubmission116 NW
26 pages
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
19 pages
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
No ratings yet
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
10 pages
Aspect-Based Sentiment Analysis Using BERT
No ratings yet
Aspect-Based Sentiment Analysis Using BERT
10 pages
2024 Acl-Long 256
No ratings yet
2024 Acl-Long 256
11 pages
Bert
No ratings yet
Bert
10 pages
BERT Language Model
No ratings yet
BERT Language Model
7 pages
LSTM To BERT
No ratings yet
LSTM To BERT
30 pages
Tacl A 00300
No ratings yet
Tacl A 00300
14 pages
Improving BERT-Based Text Classification With Auxiliary Sentence and Domain Knowledge
No ratings yet
Improving BERT-Based Text Classification With Auxiliary Sentence and Domain Knowledge
16 pages
855 Roberta A Robustly Optimized B
No ratings yet
855 Roberta A Robustly Optimized B
15 pages
A Primer in BERTology - What We Know About How BERT Works
No ratings yet
A Primer in BERTology - What We Know About How BERT Works
23 pages
Bert 1 42
No ratings yet
Bert 1 42
42 pages
Ensemble BERT A Student Social Network Text Sentiment Classification Model Based On Ensemble Learning and BERT Architecture
No ratings yet
Ensemble BERT A Student Social Network Text Sentiment Classification Model Based On Ensemble Learning and BERT Architecture
4 pages
2024 Semeval-1 72
No ratings yet
2024 Semeval-1 72
6 pages
Transformers in NLP 1
No ratings yet
Transformers in NLP 1
9 pages
11 Bert
No ratings yet
11 Bert
66 pages
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time.
No ratings yet
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time.
4 pages
1102AITA04 AI For Text Analytics
No ratings yet
1102AITA04 AI For Text Analytics
88 pages
BERT Interview Questions and Cross Questions-1
No ratings yet
BERT Interview Questions and Cross Questions-1
9 pages
A Primer in BERTology
No ratings yet
A Primer in BERTology
15 pages
Semantics-Aware BERT For Language Understanding: Zhuosheng Zhang, Yuwei Wu, Hai Zhao, Zuchao Li
No ratings yet
Semantics-Aware BERT For Language Understanding: Zhuosheng Zhang, Yuwei Wu, Hai Zhao, Zuchao Li
8 pages
BERT Architecture
No ratings yet
BERT Architecture
8 pages
The Birth of BERT
No ratings yet
The Birth of BERT
7 pages
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
No ratings yet
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
20 pages
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
From Everand
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
34 - Three Address Code
No ratings yet
34 - Three Address Code
30 pages
AOML
No ratings yet
AOML
14 pages
Prompting Techniques Slide Deck
No ratings yet
Prompting Techniques Slide Deck
29 pages
Aoml Projj
No ratings yet
Aoml Projj
11 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
12 pages
Machine Translation
No ratings yet
Machine Translation
10 pages
Regularization in Linear Regression
No ratings yet
Regularization in Linear Regression
1 page
TF Idf
No ratings yet
TF Idf
6 pages
Uneecops Technologis LTD., Noida
No ratings yet
Uneecops Technologis LTD., Noida
175 pages
DS 17 DFS
No ratings yet
DS 17 DFS
3 pages
SM-N950F UM EU Nougat Eng Rev.1.0 170912
100% (8)
SM-N950F UM EU Nougat Eng Rev.1.0 170912
289 pages
Manual Sim Next
No ratings yet
Manual Sim Next
16 pages
2 - Overview of C Programming Language
No ratings yet
2 - Overview of C Programming Language
35 pages
Data Management 1 8
No ratings yet
Data Management 1 8
11 pages
FF Configuration in GRC 10.0
No ratings yet
FF Configuration in GRC 10.0
12 pages
Event Handling MCQ
No ratings yet
Event Handling MCQ
20 pages
Hitachi Calibration
No ratings yet
Hitachi Calibration
3 pages
GLEAMviz Client Manual v7.0
No ratings yet
GLEAMviz Client Manual v7.0
62 pages
Software Design Patterns Made Simple
No ratings yet
Software Design Patterns Made Simple
31 pages
Lecture01 - Introduction
No ratings yet
Lecture01 - Introduction
22 pages
Semester6 Major Project Final
No ratings yet
Semester6 Major Project Final
58 pages
In The First Part of This Tutorial On JSTL, The Author Gives A Brief Introduction To JSTL and Shows Why and How It Evolved
No ratings yet
In The First Part of This Tutorial On JSTL, The Author Gives A Brief Introduction To JSTL and Shows Why and How It Evolved
36 pages
Business Information System - AIRLINE SYSTEM
No ratings yet
Business Information System - AIRLINE SYSTEM
17 pages
Syed Zahidul Hassan Resume
No ratings yet
Syed Zahidul Hassan Resume
5 pages
v2 GTM Data Layer Cheat Sheet Analytics Mania
No ratings yet
v2 GTM Data Layer Cheat Sheet Analytics Mania
20 pages
Commit To Win - Regulations
No ratings yet
Commit To Win - Regulations
1 page
Control Unit (Single Cycle Implementation) Building The Control Unit
No ratings yet
Control Unit (Single Cycle Implementation) Building The Control Unit
3 pages
Top 10 Skills For Today'S Rpgers: Jon Paris Susan Gantner Paul Tuohy
100% (2)
Top 10 Skills For Today'S Rpgers: Jon Paris Susan Gantner Paul Tuohy
41 pages
WFPEEUS230001A0 - en Harmony Distributed Control Unit (System Six) Overview
No ratings yet
WFPEEUS230001A0 - en Harmony Distributed Control Unit (System Six) Overview
12 pages
PassLeader 300-375 Exam Dumps (21-30)
No ratings yet
PassLeader 300-375 Exam Dumps (21-30)
4 pages
Electronics Experimenters Handbook 1995 Summer
100% (1)
Electronics Experimenters Handbook 1995 Summer
116 pages
Devops Evolution Towards 5g Services
No ratings yet
Devops Evolution Towards 5g Services
14 pages
SH3 Patrol Event Recorder
No ratings yet
SH3 Patrol Event Recorder
53 pages
Roberto F. Lu, Richard L. Storch Auth., Flavio S. Fogliatto, Giovani J. C. Da Silveira Eds. Mass Customization Engineering and Managing Global Operations
No ratings yet
Roberto F. Lu, Richard L. Storch Auth., Flavio S. Fogliatto, Giovani J. C. Da Silveira Eds. Mass Customization Engineering and Managing Global Operations
384 pages
PDMS Design Reference Manual Part1
No ratings yet
PDMS Design Reference Manual Part1
140 pages