Preprint Jesus
Preprint Jesus
4 Performance Evaluation
This section presents the fine-tuning results of BERT applied
to two NLP benchmarks: GLUE and SQuAD v1.1. In this
section, the paper introduces two different versions of BERT:
BERTBASE and BERTLARGE . BERTBASE is a modest ver-
sion composed of 12 layers (12 transformer blocks) with 110
million parameters (similar to OpenAI GPT architecture),
while BERTLARGE is a larger version with 24 attention lay-
ers and 340 million parameters. NSP task and a Left-to-Right (LTR & No NSP) model. Re-
moving NSP significantly impairs BERT performance on the
SQuAD benchmark, as it loses its ability to understand con-
4.1 Experimental Conditions text and sentence relationships. The LTR model shows to
General Language Understanding Evaluation (GLUE) bench- be less effective than the MLM model across tasks, partic-
mark is a collection of nine different NLP tasks. These tasks ularly struggling with token predictions in SQuAD due to
encompass a wide range of aspects of human language pro- the absence of right-side context. However, incorporating
cessing. On the other hand, the Stanford Question Answering a Bidirectional Long Short-Term Memory (BiLSTM) com-
Dataset (SQuAD v1.1) is a collection of 100k crowdsourced ponent enhances its performance but still falls short of the
question/answer pairs. In simple terms, the aim of this task results achieved by BERT. Additionally, it points out how
is to predict the answer text span in a passage given a ques- combining separate LTR and RTL models may improve re-
tion and a Wikipedia article. sults, but at the cost of doubling memory consumption, a
lack of intuitiveness for tasks like question answering, and is
not as powerful as BERT.
4.2 Experimental Results
Table 1 shows the results of BERTBASE and BERTLARGE ,
which exhibit remarkable performance across all tasks and 5 Conclusions
surpassing all other systems by a considerable margin. They
achieve average accuracy improvements of 4.5% and 7.0%, BERT is a language model designed to capture contextual
respectively, compared to the prior state of the art. information in a bidirectional way. This bidirectional under-
Table 2 shows how BERTLARGE outperforms the next- standing of language enables BERT to excel in various NLP
best system by +1.5 F1 in ensembling and +1.3 F1 as a tasks compared to unidirectional models, whose architecture
single system. Here, BERTLARGE defeats other models due limits their ability to analyze long inputs of text.
to its big architecture that captures more detailed contextual The results presented in section 4 suggest that low-resource
information and allows it to perform better on tasks that tasks may benefit from deep unidirectional architectures,
require deep understanding of language. while leaving the more challenging tasks to bidirectional mod-
els. Nonetheless, it is important to point out that the major
contribution of the paper is the generalization of the results
4.3 Importance of Pre-Training Tasks: from bidirectional architectures, which provides a wide range
MLM and NSP of possible applications. This indicates that a diverse set of
numerous NLP tasks can now be carried out by the same
The section examines the impact of specific tasks on BERT pre-trained model.
pre-training effectiveness by comparing a version without the