A E A T - B L M: E O M: Nalysis of The Volution of Dvanced Ransformer Ased Anguage Odels Xperiments On Pinion Ining
A E A T - B L M: E O M: Nalysis of The Volution of Dvanced Ransformer Ased Anguage Odels Xperiments On Pinion Ining
August 8, 2023
A BSTRACT
Opinion mining, also known as sentiment analysis, is a subfield of natural language processing (NLP)
that focuses on identifying and extracting subjective information in textual material. This can include
determining the overall sentiment of a piece of text (e.g., positive or negative), as well as identifying
specific emotions or opinions expressed in the text, that involves the use of advanced machine and
deep learning techniques. Recently, transformer-based language models make this task of human
emotion analysis intuitive, thanks to the attention mechanism and parallel computation. These
advantages make such models very powerful on linguistic tasks, unlike recurrent neural networks that
spend a lot of time on sequential processing, making them prone to fail when it comes to processing
long text. The scope of our paper aims to study the behaviour of the cutting-edge Transformer-based
language models on opinion mining and provide a high-level comparison between them to highlight
their key particularities. Additionally, our comparative study shows leads and paves the way for
production engineers regarding the approach to focus on and is useful for researchers as it provides
guidelines for future research subjects.
1 Introduction
Over the past few years, interest in natural language processing (NLP) [1] has increased significantly. Today, several
applications are investing massively in this new technology, such as extending recommender systems [2], [3], uncovering
new insights in the health industry [4], [5], and unraveling e-reputation and opinion mining [6], [7]. Opinion mining
is an approach to computational linguistics and NLP that automatically identifies the emotional tone, sentiment, or
thoughts behind a body of text. As a result, it plays a vital role in driving business decisions in many industries.
However, seeking customer satisfaction is costly expensive. Indeed, mining user feedback regarding the products
offered, is the most accurate way to adapt strategies and future business plans. In recent years, opinion mining has seen
considerable progress, with applications in social media and review websites. Recommendation may be staff-oriented
[2] or user-oriented [8] and should be tailored to meet customer needs and behaviors.
Nowadays, analyzing people’s emotions has become more intuitive thanks to the availability of many large pre-trained
language models such as bidirectional encoder representations from transformers (BERT) [9] and its variants. These
models use the seminal transformer architecture [10], which is based solely on attention mechanisms, to build robust
language models for a variety of semantic tasks, including text classification. Moreover, there has been a surge in
opinion mining text datasets, specifically designed to challenge NLP models and enhance their performance. These
∗
Corresponding author.
A PREPRINT - AUGUST 8, 2023
datasets are aimed at enabling models to imitate or even exceed human level performance, while introducing more
complex features.
Even though many papers have addressed NLP topics for opinion mining using high-performance deep learning models,
it is still challenging to determine their performance concretely and accurately due to variations in technical environments
and datasets. Therefore, to address these issues, our paper aims to study the behaviour of the cutting-edge transformer-
based models on textual material and reveal their differences. Although, it focuses on applying both transformer
encoders and decoders, such as BERT [9] and generative pre-trained transformer (GPT) [11], respectively, and their
improvements on a benchmark dataset. This enable a credible assessment of their performance and understanding their
advantages, allowing subject matter experts to clearly rank the models. Furthermore, through ablations, we show the
impact of configuration choices on the final results.
2 Background
2.1 Transformer
The transformer [10], as illustrated in Figure 1, is an encoder-decoder model dispensing entirely with recurrence and
convolutions. Instead, it leverages the attention mechanism to compute high-level contextualized embeddings. Being the
first model to rely solely on attention mechanisms, it is able to address the issues commonly associated with recurrent
neural networks, which factor computation along symbol positions of input and output sequences, and then precludes
parallelization within samples. Despite this, the transformer is highly parallelizable and requires significantly less
time to train. In the upcoming sections, we will highlight the recent breakthroughs in NLP involving transformer that
changed the field overnight by introducing its designs, such as BERT [9] and its improvements.
2
A PREPRINT - AUGUST 8, 2023
2.2 BERT
BERT [9] is pre-trained using a combination of Masked Language Modeling (MLM) and Next Sentence Prediction
(NSP) objectives. It provides high-level contextualized embeddings grasping the meaning of words in different contexts
through global attention. As a result, the pre-trained BERT model can be fine-tuned for a wide range of downstream
tasks, such as question answering and text classification, without substantial task-specific architecture modifications.
BERT and its variants allow the training of modern data-intensive models. Moreover, they are able to capture the
contextual meaning of each piece of text in a way that traditional language models are unfit to do, while being quicker
to develop and yielding better results with less data. On the other hand, BERT and other large neural language models
are very expensive and computationally intensive to train/fine-tune and make inference.
GPT [11] is the first causal or autoregressive transformer-based model pre-trained using language modeling on a large
corpus with long-range dependencies. However, its bigger an optimized version called GPT-2 [12], was pre-trained on
WebText. Likewise, GPT-3 [13] is architecturally similar to its predecessors. Its higher level of accuracy is attributed
to its increased capacity and greater number of parameters, and it was pre-trained on Common Crawl. The OpenAI
GPT family models has taken pre-trained language models by storm, they are very powerful on realistic human text
generation and many other miscellaneous NLP tasks. Therefore, a small amount of input text can be used to generate
large amount of high-quality text, while maintaining semantic and syntactic understanding of each word.
2.4 ALBERT
A lite BERT (ALBERT) [14] was proposed to address the problems associated with large models. It was specifically
designed to provide contextualized natural language representations to improve the results on downstream tasks.
However, increasing the model size to pre-train embeddings becomes harder due to memory limitations and longer
training time. For this reason, this model arose.
ALBERT is a lighter version of BERT, in which next sentence prediction (NSP) is replaced by sentence order prediction
(SOP). In addition to that, it employs two parameter-reduction techniques to reduce memory consumption and improve
training time of BERT without hurting performance:
• Splitting the embedding matrix into two smaller matrices to easily grow the hidden size with fewer parameters,
ALBERT separates the hidden layers size from the size of the vocabulary embedding by decomposing the
embedding matrix of the vocabulary.
• Repeating layers split among groups to prevent the parameter from growing with the depth of the network.
2.5 RoBERTa
The choice of language model hyper-parameters has a substantial impact on the final results. Hence, robustly optimized
BERT pre-training approach (RoBERTa) [15] is introduced to investigate the impact of many key hyper-parameters
along with data size on model performance. RoBERTa is based on Google’s BERT [9] model and modifies key
hyper-parameters, where the masked language modeling objective is dynamic and the NSP objective is removed. It is
an improved version of BERT, pre-trained with much larger mini-batches and learning rates on a large corpus using
self-supervised learning.
2.6 XLNET
The bidirectional property of transformer encoders, such as BERT [9], help them achieve better performance than
autoregressive language modeling based approaches. Nevertheless, BERT ignores dependency between the positions
masked, and suffers from a pretrain-finetune discrepancy when relying on corrupting the input with masks. In view
of these pros and cons, XLNet [16] has been proposed. XLNet is a generalized autoregressive pretraining approach
that allows learning bidirectional dependencies by maximizing the anticipated likelihood over all permutations of
the factorization order. Furthermore, it overcomes the drawbacks of BERT [9] due to its casual or autoregressive
formulation, inspired from the transformer-XL [17].
3
A PREPRINT - AUGUST 8, 2023
2.7 DistilBERT
Unfortunately, the outstanding performance that comes with large-scale pretrained models is not cheap. In fact,
operating them on edge devices under constrained computational training or inference budgets remains challenging.
Against this backdrop, DistilBERT [18] (or Distilled BERT) has seen the light to address the cited issues by leveraging
knowledge distillation [19].
DistilBERT is similar to BERT, but it is smaller, faster, and cheaper. It has 40% less parameters than BERT base, runs
40% faster, while preserving over 95% of BERT’s performance. It is trained using distillation of the pretrained BERT
base model.
2.8 XLM-RoBERTa
Pre-trained multilingual models at scale, such as multilingual BERT (mBERT) [9] and cross-lingual language models
(XLMs) [20], have led to considerable performance improvements for a wide variety of cross-lingual transfer tasks,
including question answering, sequence labeling, and classification. However, the multilingual version of RoBERTa
[15] called XLM-RoBERTa [21], pre-trained on the newly created 2.5TB multilingual CommonCrawl corpus containing
100 different languages, has further pushed the performance. It has shown strong improvements on low-resource
languages compared to previous multilingual models.
2.9 BART
Bidirectional and auto-regressive transformer (BART) [22] is a generalization of BERT [9] and GPT [11], it takes
advantage of the standard transformer [10]. Concretely, it uses a bidirectional encoder and a left-to-right decoder. It is
trained by corrupting text with an arbitrary noising function and learning a model to reconstruct the original text. BART
has shown phenomenal success when fine-tuned on text generation tasks such as translation, but also performs well for
comprehension tasks like question answering and classification.
2.10 ConvBERT
While BERT [9] and its variants have recently achieved incredible performance gains in many NLP tasks compared
to previous models, BERT suffers from large computation cost and memory footprint due to reliance on the global
self-attention block. Although all its attention heads, BERT was found to be computationally redundant, since some
heads simply need to learn local dependencies. Therefore, ConvBERT [23] is a better version of BERT [9], where
self-attention blocks are replaced with new mixed ones that leverage convolutions to better model global and local
context.
2.11 Reformer
Consistently, large transformer [10] models achieve state-of-the-art results in a large variety of linguistic tasks, but
training them on long sequences is costly challenging. To address this issue, the Reformer [24] was introduced to
improve the efficiency of transformers while holding the high performance and the smooth training. Reformer is more
efficient than transformer [10] thanks to locality-sensitive hashing attention and reversible residual layers instead of the
standard residuals, and axial position encoding and other optimizations.
2.12 T5
Transfer learning has emerged as one of the most influential techniques in NLP. Its efficiency in transferring knowledge
to downstream tasks through fine-tuning has given birth to a range of innovative approaches. One of these approaches
is transfer learning with a unified text-to-text transformer (T5) [25], which consists of a bidirectional encoder and a
left-to-right decoder. This approach is reshaping the transfer learning landscape by leveraging the power of being
pre-trained on a combination of unsupervised and supervised tasks and reframing every NLP task into text-to-text
format.
2.13 ELECTRA
Masked language modeling (MLM) approaches like BERT [9] have proven to be effective when transferred to
downstream NLP tasks, although, they are expensive and require large amounts of compute. Efficiently learn an encoder
that classifies token replacements accurately (ELECTRA) [26] is a new pre-training approach that aims to overcome
4
A PREPRINT - AUGUST 8, 2023
these computation problems by training two Transformer models: the generator and the discriminator. ELECTRA trains
on a replaced token detection objective, using the discriminator to identify which tokens were replaced by the generator
in the sequences. Unlike MLM-based models, ELECTRA is defined over all input tokens rather than just a small subset
that was masked, making it a more efficient pre-training approach.
2.14 Longformer
While previous transformers were focusing on making changes to the pre-training methods, the long-document
transformer (Longformer) [27] comes to change the transformer’s self-attention mechanism. It has became the de facto
standard for tackling a wide range of complex NLP tasks, with an new attention mechanism that scales linearly with
sequence length, and then being able to easily process longer sequences. Longformer’s new attention mechanism is
a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated
global attention. Simply, it replaces the transformer [10] attention matrices with sparse matrices for higher training
efficiency.
2.15 DeBERTa
DeBERTa [28] stands for decoding-enhanced BERT with disentangled attention. It is a pre-training approach that
extends Google’s BERT [9] and builds on the RoBERTa [15]. Despite being trained on only half of the data used
for RoBERTa, DeBERTa has been able to improve the efficiency of pre-trained models through the use of two novel
techniques:
• Disentangled attention (DA): an attention mechanism that computes the attention weights among words using
disentangled matrices based on two vectors that encode the content and the relative position of each word
respectively.
• Enhanced mask decoder (EMD): a pre-trained technique used to replace the output softmax layer. Thus,
incorporate absolute positions in the decoding layer to predict masked tokens for model pre-training.
3 Approach
Transformer-based pre-trained language models have led to substantial performance gains, but careful comparison
between different approaches is challenging. Therefore, we extend our study to uncover insights regarding their
fine-tuning process and main characteristics. Our paper first aims to study the behavior of these models, following
two approaches: a data-centric view focusing on the data state and quality, and a model-centric view giving more
attention to the models tweaks. Indeed, we will see how data processing affects their performance and how adjustments
and improvements made to the model over time is changing its performance. Thus, we seek to end with some
takeaways regarding the optimal setup that aids in cross-validating a Transformer-based model, specifically model
tuning hyper-parameters and data quality.
In this section, we present the base versions’ details of the models introduced previously as shown in Table A1. We
aim to provide a fair comparison based on the following criteria: L-Number of transformer layers, H-Hidden state
size or model dimension, A-Number of attention heads, number of total parameters, tokenization algorithm, data used
for pre-training, training devices and computational cost, training objectives, good performance tasks, and a short
description regarding the model key points [29]. All these information will help to understand the performance and
behaviors of different transformer-based models and aid to make the appropriate choice depending on the task and
resources.
3.2 Configuration
It should be noted that we have used almost the same architecture building blocks for all our implemented models
as shown in Figure 2 and Figure 3 for both encoder and decoder based models, respectively. In contrast, seq2seq
models like BART are merely a bidirectional encoder pursued by an autoregressive decoder. Each model is fed with the
three required inputs, namely input ids, token type ids, and attention mask. However, for some models, the position
embeddings are optional and can sometimes be completely ignored (e.g RoBERTa), for this reason we have blurred
them a bit in the figures. Furthermore, it is important to note that we uniformed the dataset in lower cases, and we
tokenized it with tokenizers based on WordPiece [30], SentencePiece [31], and Byte-pair-encoding [32] algorithms.
5
A PREPRINT - AUGUST 8, 2023
In our experiments, we used a highly optimized setup using only the base version of each pre-trained language model.
For training and validation, we set a batch size of 8 and 4, respectively, and fine-tuned the models for 4 epochs over
the data with maximum sequence length of 384 for the intent of correspondence to the majority of reviews’ lengths
and computational capabilities. The AdamW optimizer is utilized to optimize the models with a learning rate of 3e-5
and the epsilon (eps) used to improve numerical stability is set to 1e-6, which is the default value. Furthermore, the
weight decay is set to 0.001, while excluding bias, LayerNorm.bias, and LayerNorm.weight from the decay weight
when fine-tuning, and not decaying them when it is set to 0.000. We implemented all of our models using PyTorch and
transformers library from Hugging Face, and ran them on an NVIDIA Tesla P100-PCIE GPU-Persistence-M (51G)
GPU RAM.
3.3 Evaluation
Dataset to fine-tune our models, we used the IMDb movie review dataset [33]. A binary sentiment classification dataset
having 50K highly polar movie reviews labelled in a balanced way between positive and negative. We chose it for our
study because it is often used in research studies and is a very popular resource for researchers working on NLP and
ML tasks, particularly those related to sentiment analysis and text classification due to its accessibility, size, balance
and pre-processing. In other words, it is easily accessible and widely available, with over 50K reviews well-balanced,
with an equal number of positive and negative reviews as shown in Figure 4. This helps prevent biases in the trained
model. Additionally, it has already been pre-processed with the text of each review cleaned and normalized.
Metrics To assess the performance of the fine-tuned transformers on the IMDb movie reviews dataset, tracking the loss
and accuracy learning curves for each model is an effective method. These curves can help detect incorrect predictions
and potential overfitting, which are crucial factors to consider in the evaluation process. Moreover, widely-used metrics,
namely accuracy, recall, precision, and F1-score are valuable to consider when dealing with classification problems.
These metrics can be defined as:
TP TP P recision × Recall
P recision = , Recall = , and F1 = 2 × (1)
TP + FP TP + FN P recision + Recall
6
A PREPRINT - AUGUST 8, 2023
7
A PREPRINT - AUGUST 8, 2023
Table 1: Transformer-based language models validation performance on the opinion mining IMDb dataset
Model Recall Precision F1 Accuracy
BERT 93.9 94.3 94.1 94.0
GPT 92.4 51.8 66.4 53.2
GPT-2 51.1 54.8 52.9 54.5
ALBERT 94.1 91.9 93.0 93.0
RoBERTa 96.0 94.6 95.3 95.3
XLNet 94.7 95.1 94.9 94.8
DistilBERT 94.3 92.7 93.5 93.4
XLM-RoBERTA 83.1 71.7 77.0 75.2
BART 96.0 93.3 94.6 94.6
ConvBERT 95.5 93.7 94.6 94.5
DeBERTa 95.2 95.0 95.1 95.1
ELECTRA 95.8 95.4 95.6 95.6
Longformer 95.9 94.3 95.1 95.0
Reformer 54.6 52.1 53.3 52.2
T5 94.8 93.4 94.0 93.9
4 Results
In this section, we present the fine-tuning main results of our implemented transformer-based language mod-
els on the opinion mining task on the IMDb movie reviews dataset. Typically, all the fine-tuned models
perform well with fairly high performance, except the three autoregressive models: GPT, GPT-2, and Re-
former, as shown in Table 1. The best model, ELECTRA, provides an F1-score of 95.6 points, followed
by RoBERTa, Longformer, and DeBERTa, with an F1-score of 95.3, 95.1, and 95.1 points, respectively.
On the other hand, the worst model, GPT-2 provide an F1-score of 52.9 points as shown in Figure 5 and
Figure 6. From the results, it is clear that purely autoregressive models do not perform well on comprehension
tasks like sentiment classification, where sequences may require access to bidirectional contexts for better word
representation, therefore, good classification accuracy. Whereas, with autoencoding models taking advantage of left
and right contexts, we saw good performance gains. For instance, the autoregressive XLNet model is our fourth best
model in Table 1 with an F1 score of 94.9%, it incorporates modelling techniques from autoencoding models into
autoregressive models while avoiding and addressing limitations of encoders. The code and fine-tuned models are
available at [34].
5 Ablation Study
In Table 2 and Figure 7, we demonstrate the importance of configuration choices through controlled trials and ablation
experiments. Indeed, the maximum length of the sequence and data cleaning are particularly crucial. Thus, to make
our ablation study credible, we fine-tuned our BERT model with the same setup, changing only the sequence length
(max-len) initially and cleaning the data (cd) at another time to observe how they affect the performance of the model.
8
A PREPRINT - AUGUST 8, 2023
Table 2: Validation results of the BERT model based on different configurations, where cd stands for cleaned data,
meaning that the latest model (BERTmax-len=384, cd ) is trained on an exhaustively cleaned text
Model Recall Precision F1 Accuracy
BERTmax-len=64 86.8% 84.7% 85.8% 85.6%
BERTmax-len=384 93.9% 94.3% 94.1% 94.0%
BERTmax-len=384, cd 92.6% 91.6% 92.1% 92.2%
The gap between the performance of BERTmax-len=64 and BERTmax-len=384 on the IMDb dataset is an astounding 8.3 F1
points, as in Table 2, demonstrating how important this parameter is. Thereby, visualizing the distribution of tokens or
words count is the ultimate solution for defining the optimal and correct value of the maximum length parameter that
corresponds to all the training data points. Figure 8 illustrates the distribution of the number of tokens in the IMDb
movie reviews dataset, it shows that the majority of reviews are between 100 and 400 tokens in length. In this context,
we chose 384 as the maximum length reference to study the effect of the maximum length parameter, because it covers
the majority of review lengths while conserving memory and saving computational resources. It should be noted that
the BERT model can process texts up to 512 tokens in length. It is a consequence of the model architecture and can not
be adjusted directly.
Traditional machine learning algorithms require extensive data cleaning before vectorizing the input sequence and then
feeding it to the model, with the aim of improving both reliability and quality of the data. Therefore, the model can only
focus on important features during training. Contrarily, the performance dropped down dramatically by 2 F1 points
when we cleaned the data for the BERT model. Indeed, the cleaning carried out aims to normalize the words of each
review. It includes lemmatization to group together the different forms of the same word, stemming to reduce a word to
its root, which is affixed to suffixes and prefixes, deletion of URLs, punctuations, and patterns that do not contribute
to the sentiment, as well as the elimination of all stop words, except the words “no", “nor", and “not", because their
contribution to the sentiment can be tricky. For instance, “Black Panther is boring" is a negative review, but “Black
Panther is not boring" is a positive review. This drop can be justified by the fact that BERT model and attention-based
models need all the sequence words to better capture the meaning of words’ contexts. However, with cleaning, words
may be represented differently from their meaning in the original sequence. Note that “not boring" and “boring" are
completely different in meaning, but if the stop word “not" is removed, we end up with two similar sequences, which is
not good in sentiment analysis context.
Carefully observing the accuracy and the loss learning curves in Figure 9 and Figure 10, we notice that the validation
loss starts to creep upward and the validation accuracy starts to go down. In this perspective, the model in question
continues to lose its ability to generalize well on unseen data. In fact, the model is relatively biased due to the effect of
9
A PREPRINT - AUGUST 8, 2023
Figure 8: Distribution of the number of tokens for a better selection of the maximum sequence length
the training data and data-drift issues related to the fine-tuning data. In this context, we assume that the model starts to
overfit. However, setting different dropouts, reducing the learning rate, or even trying larger batches will not work. On
the other hand, these strategies sometimes give worst results, then a more critical overfitting problem. For this reason,
pretraining these models on your industry data and vocabulary and then fine-tuning them may be the best solution.
10
A PREPRINT - AUGUST 8, 2023
6 Conclusion
In this paper, we presented a detailed comparison to highlight the main characteristics of transformer-based pre-trained
language models and what differentiates them from each other. Then, we studied their performance on the opinion
mining task. Thereby, we deduce the power of fine-tuning and how it helps in leveraging the pre-trained models’
knowledge to achieve high accuracy on downstream tasks, even with the bias they came with due to the pre-training data.
Experimental results show how performant these models are. We have seen the highest F1-score with the ELECTRA
model with 95.6 points, across the IMDb dataset. Similarly, we found that access to both left and right contexts is
necessary when it comes to comprehension tasks like sentiment classification. We have seen that autoregressive models
like GPT, GPT-2, and Reformer perform poorly and fail to achieve high accuracy. Nevertheless, XLNet has reached
good results even though it is an autoregressive model because it incorporates ideas taken from encoders characterized
by their bidirectional property. Indeed, all performances were nearby, including DistilBERT, which helps to gain
incredible performance in less training time thanks to knowledge distillation. For example, for 4 epochs, BERT took
70 minutes to train, while DistilBERT took 35 minutes, losing only 0.6 F1 points, but saving half the time taken by
BERT. Moreover, our ablation study shows that the maximum length of the sequence is one of the parameters having a
significant impact on the final results and must be carefully analyzed and adjusted. Likewise, data quality is a must
for good performance, data that will do not need to be processed, since extensive data cleaning processes may not
help the model capture local and global contexts in sequences, distilled sometimes with words removed or trimmed
during cleaning. Besides, we notice, that the majority of the models we fine-tuned on the IMDb dataset start to overfit
at a certain number of epochs, which can lead to biased models. However, good quality data is not even enough, but
pre-training a model on large amounts of business problem data and vocabulary may help on preventing it from making
wrong predictions and may help on reaching a high level of generalization.
Acknowledgments
We are grateful to the Hugging Face team for their role in democratizing state-of-the-art machine learning and natural
language processing technology through open-source tools. Their commitment to providing valuable resources to the
research community is highly appreciated, and we acknowledge their vital contribution to the development of our
article.
Appendix
Appendix for “Analysis of the Evolution of Advanced Transformer-based Language Models: Experiments on Opining
Mining"
11
Table A1: Summary and comparison of transformer-based models.
L H A Att. Total Tokenization Training Data Computational Training Ob- Performance Tasks Short Description
Model Type Params Cost jectives
GPT 12 512 12 Global 110M Byte-Pair- Books Corpus - Autoregressive, Zero-shot, Text Sum- The first Transformer-
Encoding (800M words) decoder marization, question based autoregressive
[32] answering, transla- and Causal Masking
tion. model.
BERT 12 768 12 Global 110M WordPiece Books Corpus 4 days on 4 Autoencoding, Text classification, The first Transformer-
[30] (800M words) and Cloud TPUs in Encoder Natural Language based autoencoding
English Wikipedia Pod configura- (MLM - NSP) Inference, Question model, that uses global
(2,500M words) tion. Answering. attention to provide
high-level bidirectional
contextualization.
GPT-2 12 1600 12 Global 117M Byte-Pair- WebText (10B - Autoregressive, Zero-shot, Text Sum- Optimized and bigger
Encoding words) Decoder marization, question than GPT and performs
answering, transla- well on zero-shot set-
tion. tings.
GPT-3 96 12288 96 Global 175B Byte-Pair- Filtered Common - Autoregressive, Text Summarization, Bigger that its predeces-
Encoding Crawl, WebText2, Decoder Question Answering, sors.
Books1, Books2, Translation, Zero-shot,
and Wikipedia for One-shot, Few-shot.
300B words.
12
ALBERT 12 768 12 Global 11M SentencePiece Books Corpus Cloud TPU V3 Autoencoding, Semantic Similarity, Smaller and similar to
[31] [35] and English TPUs number Encoder, Semantic Relevance, BERT with minimal
Wikipedia. ranges from Sentence- Question Answering, tweaks including the
64 to 512 Ordering Reading Comprehen- splitting of layers into
(32h ALBERT- Prediction sion. groups via cross-layer
xxlarge). (SOP) parameter sharing, mak-
ing it faster and reduc-
ing memory footprint.
DistilBERT 6 768 12 Global 66M WordPiece English Wikipedia 90 hours on Autoencoding Semantic Similarity, Pre-training leveraging
and Toront Book 8 16GB V100 (MLM), En- Semantic Relevance, knowledge distillation
Corpus. GPUs. coder Question Answering, to deliver great results
Textual Entailment. as BERT with lower la-
tency. Similar to BERT
model but smaller.
RoBERTa 12 1024 12 Global 125M Byte-Pair- Book Corpus [35], 8 × 32GB Autoencoding Text Classification, Pre-trained with large
Encoding CC-News, Open Nvidia V100 (Dynamic Language Inference, batches using some
Web Text, and GPUs. MLM, No Question Answering. tricks for diverse
Stories [36]. NSP), Encoder learning like dynamic
masking, where tokens
are differently masked
for each epoch.
A PREPRINT - AUGUST 8, 2023
XLM 12 2048 8 Global - Byte-Pair Wikipedias of the 64 Volta Autoencoding, Translation tasks and By being trained on sev-
Encoding XNLI languages. GPUs for Encoder, NLU cross-lingual eral pre-training objec-
the language Causal Lan- benchmarks. tives on a multilingual
modeling tasks guage Mod- corpus, XLM proves
and 8 GPUs eling (CLM), that multilingual pre-
for the MT Masked training methods have
tasks. Language a strong impact, es-
Modeling pecially on the perfor-
(MLM), and mance of multilingual
Translation tasks.
Language
Modeling
(TLM).
XLM- 12 768 8 Global 270M SentencePiece CommonCrawl 100 32GB Autoencoding, Translation tasks and Using only the masked
RoBERTa Corpus in 100 Nvidia V100 Encoder, NLU cross-lingual language modeling ob-
languages. GPUs. Masked benchmarks. jective, XLM-RoBERTa
Language uses RoBERTa tricks on
Modeling XLM approaches. it is
(MLM). able to detect the input
language by itself (100
languages).
ELECTRA 12 768 12 Global 110M WordPiece Wikipedia, 4 days on 1 Generator (au- Sentiment Analysis, Replaced token detec-
BooksCorpus, GPU. toregressive, Language Inference tion is a pre-training
Gigas5 [37], Replaced To- Tasks. objective that addresses
ClueWeb 2012-B, ken Detection) MLM issues and it re-
13
and Common and Discrimi- sutls in efficient perfor-
Crawl. nator (Electra: mance.
Predicting
Masked To-
kens).
DeBERTa 12 768 12 Global 125M Byte-Pair Wikipedia, 10 days 64 Autoencoding, DeBERTa was the first DeBERTa uses
(Dis- Encoding BooksCorpus, V100 GPUs. Disentangled pretrained model to RoBERTa with Disen-
entan- Reddit content, Attention beat HLP on the Su- tangled attention and an
gled Stories, STORIES. Mechanism, perGLUE benchmark enhanced mask decoder
atten- and Enhanced [38]. to significantly improve
tion) Mask Decoder. model performance on
many downstream tasks
while being trained
only on half of the data
used in RoBERTa large
version.
XLNet 12 768 12 Global 110M SentencePiece Wikipedia, 5.5 days on Autoregressive, XLNet achieved XLNet incorporates
BooksCorpus, 512 TPU v3 Decoder state-of-the-art results ideas from Transformer-
Gigas5 [37], chips. and outperformed XL [17] and addresses
ClueWeb 2012-B, BERT on 20 down- the pretrain-finetune
and Common stream task inlcuding BERT’s discrepancy
Crawl. sentiment analysis, being more capable
question answering, to grasp dependencies
Reading Compre- between masked tokens.
hension, Document
A PREPRINT - AUGUST 8, 2023
Ranking.
BART 12 768 16 Global 139M Byte-Pair Wikipedia, - Generative BART beats its prede- Trained to map cor-
Encoding BooksCorpus. sequence to cessors on generation rupted text to the orig-
sequence, tasks such as trans- inal using an arbitrary
Encoder De- lation and achieved noising function.
coder, Token state-of-the-art results,
Masking, To- while performing sim-
ken Deletion, ilarly to RoBERTa on
Text Infilling, discriminative tasks in-
Sentence cluding question an-
Permutation, swering and classifica-
and Document tion.
Rotation.
ConvBERT 12 768 12 Global 124M WordPiece OpenWebText [39] GPU and TPU Autoencoding, With fewer parameters For reduced redundancy
Encoder and lower costs Con- and better modeling of
vBERT consistently global and local context,
outperforms BERT on BERT’s self-attention
various downstream blocks are replaced
tasks with less training by mixed-attention
cost. blocks incorporating
self-attention and
span-based dynamic
convolutions.
Reformer 12 1024 8 Attenion 149M SentencePiece OpenWebText [39] Parallelization Autoregressive, Performs well with An efficient and faster
with across 8 GPUs Decoder. paragmatic require- Transformer that costs
Local or 8 TPU v3 ments, thanks to less time on long se-
14
Sen- cores. reduction of the quences thanks to two
sitive attention complexity. optimization techniques,
Hash- Local-Sensitive Hash-
ing ing attention and axial
position encoding.
T5 12 768 12 Global 220M SentencePiece The Colossal Clean Cloud TPU Generative Entailement, Coref- To incorporate the vari-
Crawled Corpus Pods. sequence to erence challenges, eties of most linguistic
(C4) sequence, Question Answering tasks, T5 pre-trained on
Encoder- Tasks via SuperGLUE a mix of supervised and
Decoder. benchmark unsupervised tasks in a
text-to-text format.
Longformer 12 768 12 149M Byte-Pair- Books corpus, . Autoregressive, Longformer achieved For heigher training ef-
Local Encoding English Wikipedia, Decoder state-of-the-art results ficiency on long docu-
+ and Realnews on two benchmark ments, Longformer uses
Global. dataset [40] datasets WikiHop and sparse matrices instead
TriviaQA. of attention matrices to
linearly scale with se-
quences of length up to
4 096.
A PREPRINT - AUGUST 8, 2023
A PREPRINT - AUGUST 8, 2023
References
[1] KR Chowdhary. Natural language processing. In Fundamentals of artificial intelligence, pages 603–649. Springer, 2020. doi:
10.1007/978-81-322-3972-7_19.
[2] Maryem Rhanoui, Mounia Mikram, Siham Yousfi, Ayoub Kasmi, and Naoufel Zoubeidi. A hybrid recommender system
for patron driven library acquisition and weeding. In Journal of King Saud University-Computer and Information Sciences,
volume 34, pages 2809–2819. Elsevier, 2020. doi: 10.1016/j.jksuci.2020.10.017.
[3] Fatima Zohra Trabelsi, Amal Khtira, and Bouchra El Asri. Hybrid recommendation systems: A state of art. In ENASE, pages
281–288, 2021. doi: 10.5220/0010452202810288.
[4] Babita Pandey, Devendra Kumar Pandey, Brijendra Pratap Mishra, and Wasiur Rhmann. A comprehensive survey of deep
learning in the field of medical imaging and medical natural language processing: Challenges and research directions. In
Journal of King Saud University-Computer and Information Sciences, volume 34, pages 5083–5099. Elsevier, 2021. doi:
10.1016/j.jksuci.2021.01.007.
[5] Ayoub Harnoune, Maryem Rhanoui, Mounia Mikram, Siham Yousfi, Zineb Elkaimbillah, and Bouchra El Asri. Bert based
clinical knowledge extraction for biomedical knowledge graph construction and analysis. In Computer Methods and Programs
in Biomedicine Update, volume 1, page 100042. Elsevi-er, 2021. doi: 10.1016/j.cmpbup.2021.100042.
[6] Walaa Medhat, Ahmed Hassan, and Hoda Korashy. Sentiment analysis algorithms and applications: A survey. In Ain Shams
engineering journal, volume 5, pages 1093–1113. Elsevier, 2014. doi: 10.1016/j.asej.2014.04.011.
[7] Shiliang Sun, Chen Luo, and Junyu Chen. A review of natural language processing techniques for opinion mining systems. In
Information fusion, volume 36, pages 10–25. Elsevier, 2017. doi: 10.1016/j.inffus.2016.10.004.
[8] Siham Yousfi, Maryem Rhanoui, and Dalila Chiadmi. Mixed-profiling recommender systems for big data environ-
ment. In First International Conference on Real Time Intelligent Systems, pages 79–89. Springer, 2017. doi: 10.1007/
978-3-319-91337-7_8.
[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers
for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, pages 4171–4186, Minneapolis, Minnesota, USA, June 2019.
Association for Computational Linguistics. doi: 10.18653/v1/N19-1423.
[10] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems, pages
5998–6008, Long Beach, California, USA, December 2017. Curran Associates. doi: 10.48550/arXiv.1706.03762.
[11] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding with unsupervised
learning. Proceedings of the 2018 Conference on Neural Information Processing Systems, 2018.
[12] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised
multitask learners. 2019.
[13] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav
Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, et al. Language models are few-shot learners. In Proceedings of
the 34th International Conference on Neural Information Processing Systems, volume 33 of NIPS’20, pages 1877–1901, Red
Hook, NY, USA, 2020. Curran Associates Inc. doi: 10.48550/arXiv.2005.14165.
[14] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert
for self-supervised learning of language representations. International Conference on Learning Representations, 2019. doi:
10.48550/arXiv.1909.11942.
[15] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and
Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. In arXiv preprint arXiv:1907.11692, volume
abs/1907.11692, aug 2019. doi: 10.48550/arXiv.1907.11692.
[16] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. Xlnet: Generalized
autoregressive pretraining for language understanding. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox,
and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. doi:
10.48550/arXiv.1906.08237.
[17] Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models
beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics,
pages 2978–2988, jul 2019. doi: 10.18653/v1/P19-1285.
[18] Victor Sanh, L Debut, J Chaumond, and T Wolf. Distilbert, a distilled version of bert: Smaller, faster, cheaper and lighter. arxiv
2019. In arXiv preprint arXiv:1910.01108, volume abs/1910.01108, 2019. doi: 10.48550/arXiv.1910.01108.
[19] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. In arXiv preprint
arXiv:1503.02531, volume abs/1503.02531, 2015. doi: 10.48550/arXiv.1503.02531.
[20] Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining. In Neural Information Processing Systems,
volume 32, page 11, Red Hook, NY, USA, 2019. Curran Associates Inc. doi: 10.48550/arXiv.1901.07291.
15
A PREPRINT - AUGUST 8, 2023
[21] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzman, Edouard
Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. pages
8440–8451. Association for Computational Linguistics, 01 2020. doi: 10.18653/v1/2020.acl-main.747.
[22] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and
Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and
comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880.
Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-main.703.
[23] Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, and Shuicheng Yan. Convbert: Improving bert with
span-based dynamic convolution. In Proceedings of the 34th International Conference on Neural Information Processing
Systems, NIPS’20, page 12, Red Hook, NY, USA, 2020. Curran Associates Inc. doi: 10.48550/arXiv.2008.02496.
[24] Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In arXiv preprint arXiv:2001.04451,
volume abs/2001.04451, 2020. doi: 10.48550/arXiv.2001.04451.
[25] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu,
et al. Exploring the limits of transfer learning with a unified text-to-text transformer. In Journal of Machine Learning Research,
volume 21, pages 1–67. JMLR.org, 2020. doi: 10.48550/arXiv.1910.10683.
[26] Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. Electra: Pre-training text encoders as discriminators
rather than generators. In arXiv preprint arXiv:2003.10555, volume abs/2003.10555, 2020. doi: 10.48550/arXiv.2003.
10555.
[27] Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. In arXiv preprint
arXiv:2004.05150, volume abs/2004.05150, 2020. doi: 10.48550/arXiv.2004.05150.
[28] Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention.
In arXiv preprint arXiv:2006.03654, volume abs/2006.03654, 2020. doi: 10.48550/arXiv.2006.03654.
[29] Sushant Singh and Ausif Mahmood. The nlp cookbook: modern recipes for transformer based deep learning architectures. In
IEEE Access, volume 9, pages 68675–68702. IEEE, 2021. doi: 10.1109/access.2021.3077350.
[30] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao,
Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine
translation. In arXiv preprint arXiv:1609.08144, volume abs/1609.08144, 2016. doi: 10.48550/arXiv.1609.08144.
[31] Taku Kudo and John Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for
neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System
Demonstrations, pages 66–71. Association for Computational Linguistics, 2018. doi: 10.18653/v1/D18-2012.
[32] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1,
pages 1715–1725, Berlin, Germany, 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1162.
[33] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word
vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:
Human Language Technologies, volume 1, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational
Linguistics.
[34] Nour Eddine Zekaoui. Opinion Transformers, 2023. [Online]. Available: : https://github.com/zekaouinoureddine/
Opinion-Transformers.
[35] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning
books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE
international conference on computer vision (ICCV), pages 19–27, 2015. doi: 10.1109/ICCV.2015.11.
[36] Trieu H Trinh and Quoc V Le. A simple method for commonsense reasoning. In arXiv preprint arXiv:1806.02847, volume
abs/1806.02847, 2018. doi: 10.48550/arXiv.1806.02847.
[37] Robert Parker, David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. English gigaword fifth edition, linguistic data
consortium. In Google Scholar, 2011. doi: 10.35111/wk4f-qt80.
[38] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman.
Superglue: A stickier benchmark for general-purpose language understanding systems. In Advances in neural information
processing systems, volume 32, Red Hook, NY, USA, 2019. Curran Associates Inc. doi: 10.48550/arXiv.1905.00537.
[39] Aaron Gokaslan and Vanya Cohen. Openwebtext corpus, 2019. Accessed: Jan. 2, 2023.
[40] Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. Defending
against neural fake news. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,
Proceedings of the 33rd International Conference on Neural Information Processing Systems, volume 32. Curran Associates,
Inc., 2019. doi: 10.48550/arXiv.1905.12616.
16