0% found this document useful (0 votes)
29 views2 pages

Preprint Jesus

This document summarizes a research paper on BERT (Bidirectional Encoder Representations from Transformers), a new language representation model for natural language processing. BERT uses bidirectional training of a transformer model to learn contextual relationships between words. It was pre-trained on two unsupervised prediction tasks and achieved state-of-the-art results on a variety of NLP tasks, including question answering and language inference. The paper introduces the BERT model architecture, pre-training procedures, and evaluation results on benchmark datasets that demonstrate its effectiveness compared to previous models.

Uploaded by

Jesus Urbaneja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views2 pages

Preprint Jesus

This document summarizes a research paper on BERT (Bidirectional Encoder Representations from Transformers), a new language representation model for natural language processing. BERT uses bidirectional training of a transformer model to learn contextual relationships between words. It was pre-trained on two unsupervised prediction tasks and achieved state-of-the-art results on a variety of NLP tasks, including question answering and language inference. The paper introduces the BERT model architecture, pre-training procedures, and evaluation results on benchmark datasets that demonstrate its effectiveness compared to previous models.

Uploaded by

Jesus Urbaneja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Preprint of Seminar II October 17, 2023

担当学生 Urbaneja Jesus


指導教員 小林 広明

BERT: Pre-training of Deep Bidirectional Transformers


for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

1 Introduction attention mechanism to learn contextual relationships among


words in a text. The encoder consists of a stack of identi-
Language models are essential tools for natural language pro- cal layers responsible for processing the input sentence while
cessing (NLP) tasks as they enable machines to interpret the decoder is responsible for generating the corresponding
human languages much like humans do. However, existing output.
language models, such as OpenAI GPT and ELMo, have lim-
itations due to their unidirectional architectures. 3.2 Input/Output Representation
The objective of this paper is to introduce a new language
representation model called BERT, which has proven to out- First, BERT takes the text input and tokenize the sentences
perform its predecessors in various tasks such as question into words or subwords using WordPiece tokenization.
answering and language inference. Then, each token is represented as an embedded vector
BERT achieves new state-of-the-art results on important that represents the characteristic of the tokens and is mapped
benchmark collections for evaluating and analyzing natural to the correspondent component in the vocabulary of the
languages. This paper shows the results obtained in tasks model. In addition to that, BERT includes special tokens
such as GLUE or SQuAD v1.1 and discusses their implica- that are used for structural analysis rather than for semantic
tions. purposes. Some of them are CLS and SEP. CLS marks the
the beginning of a sentence and SEP indicates separation
between sentences. Moreover, it is crucial to point out that
2 Conventional NLP models additional information is added to the embeddings to capture
the position of words within a text.
Conventional language models apply a method known as pre-
training to NLP tasks. This involves the training of a lan-
guage model on an extensive amount of text data and aims to
3.3 Pre-training
understand the intricacies and regularities of languages. The BERT introduces two distinctive tasks during the pre-
pre-training method enables the model to produce coherent training phase aimed at enhancing its bidirectional architec-
responses that align with a given context. ture. These tasks are known as MLM (Masked Language
There are two existing strategies for applying pre-trained Model) and NSP (Next Sentence Prediction).
language representations to downstream tasks: the fine-
tuning and feature-based approaches. The fine-tuning ap-
proach introduces minimal task-specific parameters and 3.3.1 Masked Language Model
trains the model on tasks by simply adjusting these param- In order to train a bidirectional representation, a certain per-
eters to achieve the correct results. On the other hand, the centage of the input tokens in a sentence or text are randomly
feature-based approach takes task-specific parameters that chosen and replaced with a special MASK token. Therefore,
have been pre-trained in other architectures and utilizes them the task is named Masked LM. The objective of the model
as additional features for the model. Generally, groups of is to predict the original tokens that were masked based on
parameters used in feature-based approaches can be used to the surrounding context within the same sentence. This task
extract relevant information or features from a text. Some helps the model learn a deep contextualized understanding
of the possible applications include: word frequency count, of the language by training it to predict missing words in a
syntactic features analysis, and word dependency. sentence, which involves capturing the overall context.
However, the efficacy of the current techniques is con- In this task, the vectors representing the masked tokens
strained by their unidirectional approach. For example, Ope- are used to compute the probability distribution of the next
nAI GPT employs a left-to-right architecture, where every word in a sequence, and the model makes charge to output
token can only attend to previous tokens in the pre-training the word with the highest probability.
process. This could lead to a suboptimal fine-tuning phase
of the model due to the lack of global contexts.
3.3.2 Next Sentence Prediction
In this task, the model is presented with pairs of sentences
3 BERT and learns to predict whether the second sentence in the pair
logically follows the first sentence or not. This task helps the
BERT introduces a new asset that processes the entire se- model understand the relationships between sentences and
quence of words simultaneously, making it a bidirectional aids in various downstream tasks like text classification and
model. There are two steps in its framework: pre-training question answering.
and fine-tuning. In the pre-training phase the model is The combination of these two tasks, MLM and NSP, during
trained in unlabeled data over different tasks and the pa- pre-training allows the model to capture both word-level and
rameters obtained are later passed to the fine-tuning phase sentence-level contextual information, making it a powerful
for specialization. language representation model.

3.1 Model Structure 3.4 Fine-tuning BERT


BERT is a multi-layer bidirectional Transformer encoder. After the initial pre-training phase, the model undergoes fur-
This language model utilizes Transformers, which employ an ther training to specialize in a specific task. This specializa-
Figure 1: Overall pre-training and fine-tuning procedures for BERT

tion can be achieved by fine-tuning BERT using labeled task-


specific data. For instance, the model could be trained for Table 1: GLUE Test results
sentiment analysis of movie reviews using a dataset of text
samples labeled with their corresponding sentiments. Fur-
thermore, in the fine-tuning process, there is the flexibility to
either replace the output layer or introduce task-specific lay-
ers atop the BERT architecture, thereby allowing for a more
tailored customization of the model.
The overall goal of this phase remains in optimizing the Table 2: Results of SQuAD v1.1
parameters, finally achieving remarkable performance for the
specific task.

4 Performance Evaluation
This section presents the fine-tuning results of BERT applied
to two NLP benchmarks: GLUE and SQuAD v1.1. In this
section, the paper introduces two different versions of BERT:
BERTBASE and BERTLARGE . BERTBASE is a modest ver-
sion composed of 12 layers (12 transformer blocks) with 110
million parameters (similar to OpenAI GPT architecture),
while BERTLARGE is a larger version with 24 attention lay-
ers and 340 million parameters. NSP task and a Left-to-Right (LTR & No NSP) model. Re-
moving NSP significantly impairs BERT performance on the
SQuAD benchmark, as it loses its ability to understand con-
4.1 Experimental Conditions text and sentence relationships. The LTR model shows to
General Language Understanding Evaluation (GLUE) bench- be less effective than the MLM model across tasks, partic-
mark is a collection of nine different NLP tasks. These tasks ularly struggling with token predictions in SQuAD due to
encompass a wide range of aspects of human language pro- the absence of right-side context. However, incorporating
cessing. On the other hand, the Stanford Question Answering a Bidirectional Long Short-Term Memory (BiLSTM) com-
Dataset (SQuAD v1.1) is a collection of 100k crowdsourced ponent enhances its performance but still falls short of the
question/answer pairs. In simple terms, the aim of this task results achieved by BERT. Additionally, it points out how
is to predict the answer text span in a passage given a ques- combining separate LTR and RTL models may improve re-
tion and a Wikipedia article. sults, but at the cost of doubling memory consumption, a
lack of intuitiveness for tasks like question answering, and is
not as powerful as BERT.
4.2 Experimental Results
Table 1 shows the results of BERTBASE and BERTLARGE ,
which exhibit remarkable performance across all tasks and 5 Conclusions
surpassing all other systems by a considerable margin. They
achieve average accuracy improvements of 4.5% and 7.0%, BERT is a language model designed to capture contextual
respectively, compared to the prior state of the art. information in a bidirectional way. This bidirectional under-
Table 2 shows how BERTLARGE outperforms the next- standing of language enables BERT to excel in various NLP
best system by +1.5 F1 in ensembling and +1.3 F1 as a tasks compared to unidirectional models, whose architecture
single system. Here, BERTLARGE defeats other models due limits their ability to analyze long inputs of text.
to its big architecture that captures more detailed contextual The results presented in section 4 suggest that low-resource
information and allows it to perform better on tasks that tasks may benefit from deep unidirectional architectures,
require deep understanding of language. while leaving the more challenging tasks to bidirectional mod-
els. Nonetheless, it is important to point out that the major
contribution of the paper is the generalization of the results
4.3 Importance of Pre-Training Tasks: from bidirectional architectures, which provides a wide range
MLM and NSP of possible applications. This indicates that a diverse set of
numerous NLP tasks can now be carried out by the same
The section examines the impact of specific tasks on BERT pre-trained model.
pre-training effectiveness by comparing a version without the

You might also like