Preprint Jesus

This document summarizes a research paper on BERT (Bidirectional Encoder Representations from Transformers), a new language representation model for natural language processing. BERT uses bidirectional training of a transformer model to learn contextual relationships between words. It was pre-trained on two unsupervised prediction tasks and achieved state-of-the-art results on a variety of NLP tasks, including question answering and language inference. The paper introduces the BERT model architecture, pre-training procedures, and evaluation results on benchmark datasets that demonstrate its effectiveness compared to previous models.

Uploaded by

Jesus Urbaneja

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views2 pages

Preprint Jesus

Uploaded by

Jesus Urbaneja

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

Preprint of Seminar II October 17, 2023

担当学生 Urbaneja Jesus

指導教員小林広明

BERT: Pre-training of Deep Bidirectional Transformers

for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

1 Introduction attention mechanism to learn contextual relationships among

words in a text. The encoder consists of a stack of identi-
Language models are essential tools for natural language pro- cal layers responsible for processing the input sentence while
cessing (NLP) tasks as they enable machines to interpret the decoder is responsible for generating the corresponding
human languages much like humans do. However, existing output.
language models, such as OpenAI GPT and ELMo, have lim-
itations due to their unidirectional architectures. 3.2 Input/Output Representation
The objective of this paper is to introduce a new language
representation model called BERT, which has proven to out- First, BERT takes the text input and tokenize the sentences
perform its predecessors in various tasks such as question into words or subwords using WordPiece tokenization.
answering and language inference. Then, each token is represented as an embedded vector
BERT achieves new state-of-the-art results on important that represents the characteristic of the tokens and is mapped
benchmark collections for evaluating and analyzing natural to the correspondent component in the vocabulary of the
languages. This paper shows the results obtained in tasks model. In addition to that, BERT includes special tokens
such as GLUE or SQuAD v1.1 and discusses their implica- that are used for structural analysis rather than for semantic
tions. purposes. Some of them are CLS and SEP. CLS marks the
the beginning of a sentence and SEP indicates separation
between sentences. Moreover, it is crucial to point out that
2 Conventional NLP models additional information is added to the embeddings to capture
the position of words within a text.
Conventional language models apply a method known as pre-
training to NLP tasks. This involves the training of a lan-
guage model on an extensive amount of text data and aims to
3.3 Pre-training
understand the intricacies and regularities of languages. The BERT introduces two distinctive tasks during the pre-
pre-training method enables the model to produce coherent training phase aimed at enhancing its bidirectional architec-
responses that align with a given context. ture. These tasks are known as MLM (Masked Language
There are two existing strategies for applying pre-trained Model) and NSP (Next Sentence Prediction).
language representations to downstream tasks: the fine-
tuning and feature-based approaches. The fine-tuning ap-
proach introduces minimal task-specific parameters and 3.3.1 Masked Language Model
trains the model on tasks by simply adjusting these param- In order to train a bidirectional representation, a certain per-
eters to achieve the correct results. On the other hand, the centage of the input tokens in a sentence or text are randomly
feature-based approach takes task-specific parameters that chosen and replaced with a special MASK token. Therefore,
have been pre-trained in other architectures and utilizes them the task is named Masked LM. The objective of the model
as additional features for the model. Generally, groups of is to predict the original tokens that were masked based on
parameters used in feature-based approaches can be used to the surrounding context within the same sentence. This task
extract relevant information or features from a text. Some helps the model learn a deep contextualized understanding
of the possible applications include: word frequency count, of the language by training it to predict missing words in a
syntactic features analysis, and word dependency. sentence, which involves capturing the overall context.
However, the eﬀicacy of the current techniques is con- In this task, the vectors representing the masked tokens
strained by their unidirectional approach. For example, Ope- are used to compute the probability distribution of the next
nAI GPT employs a left-to-right architecture, where every word in a sequence, and the model makes charge to output
token can only attend to previous tokens in the pre-training the word with the highest probability.
process. This could lead to a suboptimal fine-tuning phase
of the model due to the lack of global contexts.
3.3.2 Next Sentence Prediction
In this task, the model is presented with pairs of sentences
3 BERT and learns to predict whether the second sentence in the pair
logically follows the first sentence or not. This task helps the
BERT introduces a new asset that processes the entire se- model understand the relationships between sentences and
quence of words simultaneously, making it a bidirectional aids in various downstream tasks like text classification and
model. There are two steps in its framework: pre-training question answering.
and fine-tuning. In the pre-training phase the model is The combination of these two tasks, MLM and NSP, during
trained in unlabeled data over different tasks and the pa- pre-training allows the model to capture both word-level and
rameters obtained are later passed to the fine-tuning phase sentence-level contextual information, making it a powerful
for specialization. language representation model.

3.1 Model Structure 3.4 Fine-tuning BERT

BERT is a multi-layer bidirectional Transformer encoder. After the initial pre-training phase, the model undergoes fur-
This language model utilizes Transformers, which employ an ther training to specialize in a specific task. This specializa-
Figure 1: Overall pre-training and fine-tuning procedures for BERT

tion can be achieved by fine-tuning BERT using labeled task-

specific data. For instance, the model could be trained for Table 1: GLUE Test results
sentiment analysis of movie reviews using a dataset of text
samples labeled with their corresponding sentiments. Fur-
thermore, in the fine-tuning process, there is the flexibility to
either replace the output layer or introduce task-specific lay-
ers atop the BERT architecture, thereby allowing for a more
tailored customization of the model.
The overall goal of this phase remains in optimizing the Table 2: Results of SQuAD v1.1
parameters, finally achieving remarkable performance for the
specific task.

4 Performance Evaluation
This section presents the fine-tuning results of BERT applied
to two NLP benchmarks: GLUE and SQuAD v1.1. In this
section, the paper introduces two different versions of BERT:
BERTBASE and BERTLARGE . BERTBASE is a modest ver-
sion composed of 12 layers (12 transformer blocks) with 110
million parameters (similar to OpenAI GPT architecture),
while BERTLARGE is a larger version with 24 attention lay-
ers and 340 million parameters. NSP task and a Left-to-Right (LTR & No NSP) model. Re-
moving NSP significantly impairs BERT performance on the
SQuAD benchmark, as it loses its ability to understand con-
4.1 Experimental Conditions text and sentence relationships. The LTR model shows to
General Language Understanding Evaluation (GLUE) bench- be less effective than the MLM model across tasks, partic-
mark is a collection of nine different NLP tasks. These tasks ularly struggling with token predictions in SQuAD due to
encompass a wide range of aspects of human language pro- the absence of right-side context. However, incorporating
cessing. On the other hand, the Stanford Question Answering a Bidirectional Long Short-Term Memory (BiLSTM) com-
Dataset (SQuAD v1.1) is a collection of 100k crowdsourced ponent enhances its performance but still falls short of the
question/answer pairs. In simple terms, the aim of this task results achieved by BERT. Additionally, it points out how
is to predict the answer text span in a passage given a ques- combining separate LTR and RTL models may improve re-
tion and a Wikipedia article. sults, but at the cost of doubling memory consumption, a
lack of intuitiveness for tasks like question answering, and is
not as powerful as BERT.
4.2 Experimental Results
Table 1 shows the results of BERTBASE and BERTLARGE ,
which exhibit remarkable performance across all tasks and 5 Conclusions
surpassing all other systems by a considerable margin. They
achieve average accuracy improvements of 4.5% and 7.0%, BERT is a language model designed to capture contextual
respectively, compared to the prior state of the art. information in a bidirectional way. This bidirectional under-
Table 2 shows how BERTLARGE outperforms the next- standing of language enables BERT to excel in various NLP
best system by +1.5 F1 in ensembling and +1.3 F1 as a tasks compared to unidirectional models, whose architecture
single system. Here, BERTLARGE defeats other models due limits their ability to analyze long inputs of text.
to its big architecture that captures more detailed contextual The results presented in section 4 suggest that low-resource
information and allows it to perform better on tasks that tasks may benefit from deep unidirectional architectures,
require deep understanding of language. while leaving the more challenging tasks to bidirectional mod-
els. Nonetheless, it is important to point out that the major
contribution of the paper is the generalization of the results
4.3 Importance of Pre-Training Tasks: from bidirectional architectures, which provides a wide range
MLM and NSP of possible applications. This indicates that a diverse set of
numerous NLP tasks can now be carried out by the same
The section examines the impact of specific tasks on BERT pre-trained model.
pre-training effectiveness by comparing a version without the

BERT Slides
No ratings yet
BERT Slides
41 pages
Transformers MUIA
No ratings yet
Transformers MUIA
34 pages
1102AITA04 AI For Text Analytics
No ratings yet
1102AITA04 AI For Text Analytics
88 pages
BERT Explained - State of The Art Language Model For NLP - by Rani Horev - Towards Data Science
100% (1)
BERT Explained - State of The Art Language Model For NLP - by Rani Horev - Towards Data Science
8 pages
BERT GPT CoT
No ratings yet
BERT GPT CoT
83 pages
A Primer in BERTology
No ratings yet
A Primer in BERTology
15 pages
LLM Learning
No ratings yet
LLM Learning
56 pages
BERT
No ratings yet
BERT
98 pages
Analysis of The Evolution of Advanced Transformer-Based Language Models: Experiments On Opinion Mining
No ratings yet
Analysis of The Evolution of Advanced Transformer-Based Language Models: Experiments On Opinion Mining
16 pages
Albert: A L Bert S - L L R: ITE FOR ELF Supervised Earning of Anguage Epresentations
No ratings yet
Albert: A L Bert S - L L R: ITE FOR ELF Supervised Earning of Anguage Epresentations
16 pages
NLP DL Lecture4
No ratings yet
NLP DL Lecture4
78 pages
Improving BERT-Based Text Classification With Auxiliary Sentence and Domain Knowledge
No ratings yet
Improving BERT-Based Text Classification With Auxiliary Sentence and Domain Knowledge
16 pages
BERT and Transformer
No ratings yet
BERT and Transformer
48 pages
BERT Architecture
No ratings yet
BERT Architecture
23 pages
Lec14 Pretraining
No ratings yet
Lec14 Pretraining
42 pages
495 Lecture 11 BERT
No ratings yet
495 Lecture 11 BERT
31 pages
Bert Model - NLP
No ratings yet
Bert Model - NLP
10 pages
All About Encoder-Decoder Models
No ratings yet
All About Encoder-Decoder Models
50 pages
Lec 02
No ratings yet
Lec 02
33 pages
LSTM To BERT
No ratings yet
LSTM To BERT
30 pages
1 s2.0 S2095809922006324 Main
No ratings yet
1 s2.0 S2095809922006324 Main
20 pages
Stanford Dataset 2.0
No ratings yet
Stanford Dataset 2.0
9 pages
A E A T - B L M: E O M: Nalysis of The Volution of Dvanced Ransformer Ased Anguage Odels Xperiments On Pinion Ining
No ratings yet
A E A T - B L M: E O M: Nalysis of The Volution of Dvanced Ransformer Ased Anguage Odels Xperiments On Pinion Ining
16 pages
BLEURT: Learning Robust Metrics For Text Generation
No ratings yet
BLEURT: Learning Robust Metrics For Text Generation
12 pages
855 Roberta A Robustly Optimized B
No ratings yet
855 Roberta A Robustly Optimized B
15 pages
Pretraining-Based Natural Language Generation For Text Summarization
No ratings yet
Pretraining-Based Natural Language Generation For Text Summarization
7 pages
Rebertsubmission116 NW
No ratings yet
Rebertsubmission116 NW
26 pages
Semantics-Aware BERT For Language Understanding: Zhuosheng Zhang, Yuwei Wu, Hai Zhao, Zuchao Li
No ratings yet
Semantics-Aware BERT For Language Understanding: Zhuosheng Zhang, Yuwei Wu, Hai Zhao, Zuchao Li
8 pages
Bert
No ratings yet
Bert
10 pages
NLP LLM
No ratings yet
NLP LLM
47 pages
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
No ratings yet
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
20 pages
Pre-Training & LLM 2
No ratings yet
Pre-Training & LLM 2
46 pages
BERT
No ratings yet
BERT
1 page
13 - Bert
No ratings yet
13 - Bert
17 pages
Bert
No ratings yet
Bert
36 pages
Jacob Devlin BERT
No ratings yet
Jacob Devlin BERT
43 pages
BERT
No ratings yet
BERT
4 pages
32-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024
No ratings yet
32-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024
8 pages
Data Mining Report
No ratings yet
Data Mining Report
17 pages
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
No ratings yet
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
10 pages
Bert 1
No ratings yet
Bert 1
4 pages
Problem Statement:: Rule-Based Machine Translation (RBMT), Statistical Machine Translation (SMT), Neural
No ratings yet
Problem Statement:: Rule-Based Machine Translation (RBMT), Statistical Machine Translation (SMT), Neural
4 pages
RM Assignment 4
No ratings yet
RM Assignment 4
5 pages
BERT Finetuning Theory
No ratings yet
BERT Finetuning Theory
14 pages
Paper Review
No ratings yet
Paper Review
6 pages
Bert Ayman
No ratings yet
Bert Ayman
5 pages
Bert
No ratings yet
Bert
20 pages
Pretraining Part1 16 Mar 23 PDF
No ratings yet
Pretraining Part1 16 Mar 23 PDF
32 pages
Transformer Part3 16 Mar 23 PDF
No ratings yet
Transformer Part3 16 Mar 23 PDF
59 pages
TRANS-BLSTM: Transformer With Bidirectional LSTM For Language Understanding
No ratings yet
TRANS-BLSTM: Transformer With Bidirectional LSTM For Language Understanding
9 pages
A Comparison of LSTM and BERT For Small Corpus: Aysu Ezen-Can SAS Inst. September 14, 2020
No ratings yet
A Comparison of LSTM and BERT For Small Corpus: Aysu Ezen-Can SAS Inst. September 14, 2020
12 pages
Bert Explained
No ratings yet
Bert Explained
8 pages
Understanding BERT
No ratings yet
Understanding BERT
4 pages
Lecture 12 Pretraining
No ratings yet
Lecture 12 Pretraining
46 pages
ChatGPT KZ Feb2023 PDF
No ratings yet
ChatGPT KZ Feb2023 PDF
7 pages
Assignment 05 CL
No ratings yet
Assignment 05 CL
3 pages
HKBK College of Engineering Department of Computer Science and Engineering
No ratings yet
HKBK College of Engineering Department of Computer Science and Engineering
24 pages
Ensemble BERT A Student Social Network Text Sentiment Classification Model Based On Ensemble Learning and BERT Architecture
No ratings yet
Ensemble BERT A Student Social Network Text Sentiment Classification Model Based On Ensemble Learning and BERT Architecture
4 pages
BERT Interview Questions and Cross Questions-1
No ratings yet
BERT Interview Questions and Cross Questions-1
9 pages
10th English PTA Questions & Answers
No ratings yet
10th English PTA Questions & Answers
37 pages
Baby Sign Language
50% (2)
Baby Sign Language
4 pages
Branches of Linguistics
No ratings yet
Branches of Linguistics
5 pages
De Thi Giua Ki 1 Tieng Anh 11 Global Success
No ratings yet
De Thi Giua Ki 1 Tieng Anh 11 Global Success
8 pages
Active and Passive Voice - Notes
No ratings yet
Active and Passive Voice - Notes
4 pages
REASONING 18 Topics
No ratings yet
REASONING 18 Topics
111 pages
08. ĐỀ SỐ 08 HSG ANH 8 (HUYỆN)
No ratings yet
08. ĐỀ SỐ 08 HSG ANH 8 (HUYỆN)
7 pages
TOS8 English 7
No ratings yet
TOS8 English 7
1 page
Activity 1
No ratings yet
Activity 1
3 pages
Cambridge IGCSE™: French 0520/22
No ratings yet
Cambridge IGCSE™: French 0520/22
8 pages
Handout Conditionals
No ratings yet
Handout Conditionals
8 pages
DLL G4 Eng Q3 W8
No ratings yet
DLL G4 Eng Q3 W8
11 pages
Slaid Amalan Baik Kicauan Pagi
100% (1)
Slaid Amalan Baik Kicauan Pagi
14 pages
TA7 - Unit 3 Lesson 1
No ratings yet
TA7 - Unit 3 Lesson 1
18 pages
Final Academic Language Study Guide1
No ratings yet
Final Academic Language Study Guide1
11 pages
Tabla Verbos 200
No ratings yet
Tabla Verbos 200
2 pages
Contrastive Morphology The Morpheme Is The Smallest Unit of A Language That Has A Binary Nature (That Can
No ratings yet
Contrastive Morphology The Morpheme Is The Smallest Unit of A Language That Has A Binary Nature (That Can
6 pages
Grammar Worksheet 1: For Requests and Offers To Help
100% (1)
Grammar Worksheet 1: For Requests and Offers To Help
1 page
English Nexus Global Quiz
No ratings yet
English Nexus Global Quiz
42 pages
Untitled
No ratings yet
Untitled
13 pages
Future Forms Present Continuous
No ratings yet
Future Forms Present Continuous
3 pages
Conocer A Dios
No ratings yet
Conocer A Dios
6 pages
Cuadro de Baklog 13.11.2019
No ratings yet
Cuadro de Baklog 13.11.2019
160 pages
00-01 2nd Midterm Exam Lise 2
No ratings yet
00-01 2nd Midterm Exam Lise 2
3 pages
ELA 3L Unit 3 Lesson 27 - Prefixes and Suffixes
No ratings yet
ELA 3L Unit 3 Lesson 27 - Prefixes and Suffixes
3 pages
Commands, Requests and Questions
No ratings yet
Commands, Requests and Questions
5 pages
Eng - Idioms - Lesson Plan
No ratings yet
Eng - Idioms - Lesson Plan
9 pages
In Traditional Prescriptive Grammar
No ratings yet
In Traditional Prescriptive Grammar
1 page
Asking For and Giving Clarification
No ratings yet
Asking For and Giving Clarification
4 pages
Answer of 2 Week of Sekolah TOEFL Exercise 6
No ratings yet
Answer of 2 Week of Sekolah TOEFL Exercise 6
4 pages
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
From Everand
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

Preprint Jesus

Uploaded by

Preprint Jesus

Uploaded by

Preprint of Seminar II October 17, 2023

担当学生 Urbaneja Jesus

BERT: Pre-training of Deep Bidirectional Transformers

1 Introduction attention mechanism to learn contextual relationships among

3.1 Model Structure 3.4 Fine-tuning BERT

tion can be achieved by fine-tuning BERT using labeled task-

You might also like