Analysis of The Evolution of Advanced Transformer-Based Language Models: Experiments On Opinion Mining
Analysis of The Evolution of Advanced Transformer-Based Language Models: Experiments On Opinion Mining
Corresponding Author:
Nour Eddine Zekaoui
Meridian Team, LYRICA Laboratory, School of Information Sciences
Rabat, Morocco
Email: [email protected], [email protected]
1. INTRODUCTION
Over the past few years, interest in natural language processing (NLP) [1] has increased significantly.
Today, several applications are investing massively in this new technology, such as extending recommender
systems [2], [3], uncovering new insights in the health industry [4], [5], and unraveling e-reputation and opin-
ion mining [6], [7]. Opinion mining is an approach to computational linguistics and NLP that automatically
identifies the emotional tone, sentiment, or thoughts behind a body of text. As a result, it plays a vital role
in driving business decisions in many industries. However, seeking customer satisfaction is costly expensive.
Indeed, mining user feedback regarding the products offered, is the most accurate way to adapt strategies and
future business plans. In recent years, opinion mining has seen considerable progress, with applications in
social media and review websites. Recommendation may be staff-oriented [2] or user-oriented [8] and should
be tailored to meet customer needs and behaviors.
Nowadays, analyzing people’s emotions has become more intuitive thanks to the availability of many
large pre-trained language models such as bidirectional encoder representations from transformers (BERT) [9]
and its variants. These models use the seminal transformer architecture [10], which is based solely on attention
mechanisms, to build robust language models for a variety of semantic tasks, including text classification.
Moreover, there has been a surge in opinion mining text datasets, specifically designed to challenge NLP
models and enhance their performance. These datasets are aimed at enabling models to imitate or even exceed
human level performance, while introducing more complex features.
Even though many papers have addressed NLP topics for opinion mining using high-performance
deep learning models, it is still challenging to determine their performance concretely and accurately due to
variations in technical environments and datasets. Therefore, to address these issues, our paper aims to study
the behaviour of the cutting-edge transformer-based models on textual material and reveal their differences.
Although, it focuses on applying both transformer encoders and decoders, such as BERT [9] and generative
pre-trained transformer (GPT) [11], respectively, and their improvements on a benchmark dataset. This enable
a credible assessment of their performance and understanding their advantages, allowing subject matter experts
to clearly rank the models. Furthermore, through ablations, we show the impact of configuration choices on
the final results.
2. BACKGROUND
2.1. Transformer
The transformer [10], as illustrated in Figure 1, is an encoder-decoder model dispensing entirely with
recurrence and convolutions. Instead, it leverages the attention mechanism to compute high-level contextual-
ized embeddings. Being the first model to rely solely on attention mechanisms, it is able to address the issues
commonly associated with recurrent neural networks, which factor computation along symbol positions of in-
put and output sequences, and then precludes parallelization within samples. Despite this, the transformer is
highly parallelizable and requires significantly less time to train. In the upcoming sections, we will highlight the
recent breakthroughs in NLP involving transformer that changed the field overnight by introducing its designs,
such as BERT [9] and its improvements.
2.2. BERT
BERT [9] is pre-trained using a combination of masked language modeling (MLM) and next sentence
prediction (NSP) objectives. It provides high-level contextualized embeddings grasping the meaning of words
in different contexts through global attention. As a result, the pre-trained BERT model can be fine-tuned for a
wide range of downstream tasks, such as question answering and text classification, without substantial task-
specific architecture modifications.
BERT and its variants allow the training of modern data-intensive models. Moreover, they are able to
capture the contextual meaning of each piece of text in a way that traditional language models are unfit to do,
while being quicker to develop and yielding better results with less data. On the other hand, BERT and other
large neural language models are very expensive and computationally intensive to train/fine-tune and make
inference.
2.4. ALBERT
A lite BERT (ALBERT) [14] was proposed to address the problems associated with large models. It
was specifically designed to provide contextualized natural language representations to improve the results on
downstream tasks. However, increasing the model size to pre-train embeddings becomes harder due to memory
limitations and longer training time. For this reason, this model arose.
ALBERT is a lighter version of BERT, in which next sentence prediction (NSP) is replaced by sentence
order prediction (SOP). In addition to that, it employs two parameter-reduction techniques to reduce memory
consumption and improve training time of BERT without hurting performance:
− Splitting the embedding matrix into two smaller matrices to easily grow the hidden size with fewer
parameters, ALBERT separates the hidden layers size from the size of the vocabulary embedding by
decomposing the embedding matrix of the vocabulary.
− Repeating layers split among groups to prevent the parameter from growing with the depth of the net-
work.
2.5. RoBERTa
The choice of language model hyper-parameters has a substantial impact on the final results. Hence,
robustly optimized BERT pre-training approach (RoBERTa) [15] is introduced to investigate the impact of
many key hyper-parameters along with data size on model performance. RoBERTa is based on Google’s BERT
[9] model and modifies key hyper-parameters, where the masked language modeling objective is dynamic and
the NSP objective is removed. It is an improved version of BERT, pre-trained with much larger mini-batches
and learning rates on a large corpus using self-supervised learning.
2.6. XLNet
The bidirectional property of transformer encoders, such as BERT [9], help them achieve better per-
formance than autoregressive language modeling based approaches. Nevertheless, BERT ignores dependency
between the positions masked, and suffers from a pretrain-finetune discrepancy when relying on corrupting the
input with masks. In view of these pros and cons, XLNet [16] has been proposed. XLNet is a generalized
autoregressive pretraining approach that allows learning bidirectional dependencies by maximizing the antici-
pated likelihood over all permutations of the factorization order. Furthermore, it overcomes the drawbacks of
BERT [9] due to its casual or autoregressive formulation, inspired from the transformer-XL [17].
Analysis of the evolution of advanced transformer-based language models: ... (Nour Eddine Zekaoui)
1998 r ISSN: 2252-8938
2.7. DistilBERT
Unfortunately, the outstanding performance that comes with large-scale pretrained models is not
cheap. In fact, operating them on edge devices under constrained computational training or inference bud-
gets remains challenging. Against this backdrop, DistilBERT [18] (or Distilled BERT) has seen the light to
address the cited issues by leveraging knowledge distillation [19].
DistilBERT is similar to BERT, but it is smaller, faster, and cheaper. It has 40% less parameters than
BERT base, runs 40% faster, while preserving over 95% of BERT’s performance. It is trained using distillation
of the pretrained BERT base model.
2.8. XLM-RoBERTa
Pre-trained multilingual models at scale, such as multilingual BERT (mBERT) [9] and cross-lingual
language models (XLMs) [20], have led to considerable performance improvements for a wide variety of
cross-lingual transfer tasks, including question answering, sequence labeling, and classification. However, the
multilingual version of RoBERTa [15] called XLM-RoBERTa [21], pre-trained on the newly created 2.5TB
multilingual CommonCrawl corpus containing 100 different languages, has further pushed the performance. It
has shown strong improvements on low-resource languages compared to previous multilingual models.
2.9. BART
Bidirectional and auto-regressive transformer (BART) [22] is a generalization of BERT [9] and GPT
[11], it takes advantage of the standard transformer [10]. Concretely, it uses a bidirectional encoder and a
left-to-right decoder. It is trained by corrupting text with an arbitrary noising function and learning a model to
reconstruct the original text. BART has shown phenomenal success when fine-tuned on text generation tasks
such as translation, but also performs well for comprehension tasks like question answering and classification.
2.10. ConvBERT
While BERT [9] and its variants have recently achieved incredible performance gains in many NLP
tasks compared to previous models, BERT suffers from large computation cost and memory footprint due to
reliance on the global self-attention block. Although all its attention heads, BERT was found to be compu-
tationally redundant, since some heads simply need to learn local dependencies. Therefore, ConvBERT [23]
is a better version of BERT [9], where self-attention blocks are replaced with new mixed ones that leverage
convolutions to better model global and local context.
2.11. Reformer
Consistently, large transformer [10] models achieve state-of-the-art results in a large variety of linguis-
tic tasks, but training them on long sequences is costly challenging. To address this issue, the Reformer [24]
was introduced to improve the efficiency of transformers while holding the high performance and the smooth
training. Reformer is more efficient than transformer [10] thanks to locality-sensitive hashing attention and
reversible residual layers instead of the standard residuals, and axial position encoding and other optimizations.
2.12. T5
Transfer learning has emerged as one of the most influential techniques in NLP. Its efficiency in trans-
ferring knowledge to downstream tasks through fine-tuning has given birth to a range of innovative approaches.
One of these approaches is transfer learning with a unified text-to-text transformer (T5) [25], which consists
of a bidirectional encoder and a left-to-right decoder. This approach is reshaping the transfer learning land-
scape by leveraging the power of being pre-trained on a combination of unsupervised and supervised tasks and
reframing every NLP task into text-to-text format.
2.13. ELECTRA
Masked language modeling (MLM) approaches like BERT [9] have proven to be effective when trans-
ferred to downstream NLP tasks, although, they are expensive and require large amounts of compute. Effi-
ciently learn an encoder that classifies token replacements accurately (ELECTRA) [26] is a new pre-training
approach that aims to overcome these computation problems by training two Transformer models: the gener-
ator and the discriminator. ELECTRA trains on a replaced token detection objective, using the discriminator
to identify which tokens were replaced by the generator in the sequences. Unlike MLM-based models, ELEC-
TRA is defined over all input tokens rather than just a small subset that was masked, making it a more efficient
pre-training approach.
2.14. Longformer
While previous transformers were focusing on making changes to the pre-training methods, the long-
document transformer (Longformer) [27] comes to change the transformer’s self-attention mechanism. It has
became the de facto standard for tackling a wide range of complex NLP tasks, with an new attention mecha-
nism that scales linearly with sequence length, and then being able to easily process longer sequences. Long-
former’s new attention mechanism is a drop-in replacement for the standard self-attention and combines a local
windowed attention with a task motivated global attention. Simply, it replaces the transformer [10] attention
matrices with sparse matrices for higher training efficiency.
2.15. DeBERTa
DeBERTa [28] stands for decoding-enhanced BERT with disentangled attention. It is a pre-training
approach that extends Google’s BERT [9] and builds on the RoBERTa [15]. Despite being trained on only half
of the data used for RoBERTa, DeBERTa has been able to improve the efficiency of pre-trained models through
the use of two novel techniques:
− Disentangled attention (DA): an attention mechanism that computes the attention weights among words
using disentangled matrices based on two vectors that encode the content and the relative position of
each word respectively.
− Enhanced mask decoder (EMD): a pre-trained technique used to replace the output softmax layer. Thus,
incorporate absolute positions in the decoding layer to predict masked tokens for model pre-training.
3. APPROACH
Transformer-based pre-trained language models have led to substantial performance gains, but careful
comparison between different approaches is challenging. Therefore, we extend our study to uncover insights
regarding their fine-tuning process and main characteristics. Our paper first aims to study the behavior of
these models, following two approaches: a data-centric view focusing on the data state and quality, and a
model-centric view giving more attention to the models tweaks. Indeed, we will see how data processing
affects their performance and how adjustments and improvements made to the model over time is changing
its performance. Thus, we seek to end with some takeaways regarding the optimal setup that aids in cross-
validating a Transformer-based model, specifically model tuning hyper-parameters and data quality.
3.2. Configuration
It should be noted that we have used almost the same architecture building blocks for all our imple-
mented models as shown in Figure 2 and Figure 3 for both encoder and decoder based models, respectively.
In contrast, seq2seq models like BART are merely a bidirectional encoder pursued by an autoregressive de-
coder. Each model is fed with the three required inputs, namely input ids, token type ids, and attention mask.
However, for some models, the position embeddings are optional and can sometimes be completely ignored
(e.g RoBERTa), for this reason we have blurred them a bit in the figures. Furthermore, it is important to note
that we uniformed the dataset in lower cases, and we tokenized it with tokenizers based on WordPiece [30],
SentencePiece [31], and Byte-pair-encoding [32] algorithms.
In our experiments, we used a highly optimized setup using only the base version of each pre-trained
language model. For training and validation, we set a batch size of 8 and 4, respectively, and fine-tuned the
models for 4 epochs over the data with maximum sequence length of 384 for the intent of correspondence to
the majority of reviews’ lengths and computational capabilities. The AdamW optimizer is utilized to optimize
the models with a learning rate of 3e-5 and the epsilon (eps) used to improve numerical stability is set to 1e-6,
which is the default value. Furthermore, the weight decay is set to 0.001, while excluding bias, LayerNorm.bias,
Analysis of the evolution of advanced transformer-based language models: ... (Nour Eddine Zekaoui)
2000 r ISSN: 2252-8938
and LayerNorm.weight from the decay weight when fine-tuning, and not decaying them when it is set to 0.000.
We implemented all of our models using PyTorch and transformers library from Hugging Face, and ran them
on an NVIDIA Tesla P100-PCIE GPU-Persistence-M (51G) GPU RAM.
3.3. Evaluation
Dataset to fine-tune our models, we used the IMDb movie review dataset [33]. A binary sentiment
classification dataset having 50K highly polar movie reviews labelled in a balanced way between positive
and negative. We chose it for our study because it is often used in research studies and is a very popular
resource for researchers working on NLP and ML tasks, particularly those related to sentiment analysis and text
classification due to its accessibility, size, balance and pre-processing. In other words, it is easily accessible
and widely available, with over 50K reviews well-balanced, with an equal number of positive and negative
reviews as shown in Figure 4. This helps prevent biases in the trained model. Additionally, it has already been
pre-processed with the text of each review cleaned and normalized.
Metrics to assess the performance of the fine-tuned transformers on the IMDb movie reviews dataset,
tracking the loss and accuracy learning curves for each model is an effective method. These curves can help
detect incorrect predictions and potential overfitting, which are crucial factors to consider in the evaluation
process. Moreover, widely-used metrics, namely accuracy, recall, precision, and F1-score are valuable to
consider when dealing with classification problems. These metrics can be defined as:
TP TP P recision × Recall
P recision = , Recall = , and F1 = 2 × (1)
TP + FP TP + FN P recision + Recall
Analysis of the evolution of advanced transformer-based language models: ... (Nour Eddine Zekaoui)
2002 r ISSN: 2252-8938
4. RESULTS
In this section, we present the fine-tuning main results of our implemented transformer-based lan-
guage models on the opinion mining task on the IMDb movie reviews dataset. Typically, all the fine-tuned
models perform well with fairly high performance, except the three autoregressive models: GPT, GPT-2,
and Reformer, as shown in Table 1. The best model, ELECTRA, provides an F1-score of 95.6 points, fol-
lowed by RoBERTa, Longformer, and DeBERTa, with an F1-score of 95.3, 95.1, and 95.1 points, respectively.
On the other hand, the worst model, GPT-2 provide an F1-score of 52.9 points as shown in Figure 5 and
Figure 6. From the results, it is clear that purely autoregressive models do not perform well on comprehension
tasks like sentiment classification, where sequences may require access to bidirectional contexts for better word
representation, therefore, good classification accuracy. Whereas, with autoencoding models taking advantage
of left and right contexts, we saw good performance gains. For instance, the autoregressive XLNet model is our
fourth best model in Table 1 with an F1 score of 94.9%, it incorporates modelling techniques from autoencod-
ing models into autoregressive models while avoiding and addressing limitations of encoders. The code and
fine-tuned models are available at [34].
Table 1. Transformer-based language models validation performance on the opinion mining IMDb dataset
Model Recall Precision F1 Accuracy
BERT 93.9 94.3 94.1 94.0
GPT 92.4 51.8 66.4 53.2
GPT-2 51.1 54.8 52.9 54.5
ALBERT 94.1 91.9 93.0 93.0
RoBERTa 96.0 94.6 95.3 95.3
XLNet 94.7 95.1 94.9 94.8
DistilBERT 94.3 92.7 93.5 93.4
XLM-RoBERTA 83.1 71.7 77.0 75.2
BART 96.0 93.3 94.6 94.6
ConvBERT 95.5 93.7 94.6 94.5
DeBERTa 95.2 95.0 95.1 95.1
ELECTRA 95.8 95.4 95.6 95.6
Longformer 95.9 94.3 95.1 95.0
Reformer 54.6 52.1 53.3 52.2
T5 94.8 93.4 94.0 93.9
5. ABLATION STUDY
In Table 2 and Figure 7, we demonstrate the importance of configuration choices through controlled
trials and ablation experiments. Indeed, the maximum length of the sequence and data cleaning are particularly
crucial. Thus, to make our ablation study credible, we fine-tuned our BERT model with the same setup,
changing only the sequence length (max-len) initially and cleaning the data (cd) at another time to observe how
they affect the performance of the model.
Table 2. Validation results of the BERT model based on different configurations, where cd stands for cleaned
data, meaning that the latest model (BERTmax-len=384, cd ) is trained on an exhaustively cleaned text
Model Recall Precision F1 Accuracy
BERTmax-len=64 86.8% 84.7% 85.8% 85.6%
BERTmax-len=384 93.9% 94.3% 94.1% 94.0%
BERTmax-len=384, cd 92.6% 91.6% 92.1% 92.2%
Analysis of the evolution of advanced transformer-based language models: ... (Nour Eddine Zekaoui)
2004 r ISSN: 2252-8938
Therefore, the model can only focus on important features during training. Contrarily, the performance dropped
down dramatically by 2 F1 points when we cleaned the data for the BERT model. Indeed, the cleaning carried
out aims to normalize the words of each review. It includes lemmatization to group together the different forms
of the same word, stemming to reduce a word to its root, which is affixed to suffixes and prefixes, deletion of
URLs, punctuations, and patterns that do not contribute to the sentiment, as well as the elimination of all stop
words, except the words “no”, “nor”, and “not”, because their contribution to the sentiment can be tricky. For
instance, “Black Panther is boring” is a negative review, but “Black Panther is not boring” is a positive review.
This drop can be justified by the fact that BERT model and attention-based models need all the sequence words
to better capture the meaning of words’ contexts. However, with cleaning, words may be represented differently
from their meaning in the original sequence. Note that “not boring” and “boring” are completely different in
meaning, but if the stop word “not” is removed, we end up with two similar sequences, which is not good in
sentiment analysis context.
Figure 8. Distribution of the number of tokens for a better selection of the maximum sequence length
6. CONCLUSION
In this paper, we presented a detailed comparison to highlight the main characteristics of transformer-
based pre-trained language models and what differentiates them from each other. Then, we studied their perfor-
mance on the opinion mining task. Thereby, we deduce the power of fine-tuning and how it helps in leveraging
the pre-trained models’ knowledge to achieve high accuracy on downstream tasks, even with the bias they came
with due to the pre-training data. Experimental results show how performant these models are. We have seen
the highest F1-score with the ELECTRA model with 95.6 points, across the IMDb dataset. Similarly, we found
that access to both left and right contexts is necessary when it comes to comprehension tasks like sentiment
classification. We have seen that autoregressive models like GPT, GPT-2, and Reformer perform poorly and
fail to achieve high accuracy. Nevertheless, XLNet has reached good results even though it is an autoregressive
model because it incorporates ideas taken from encoders characterized by their bidirectional property. Indeed,
all performances were nearby, including DistilBERT, which helps to gain incredible performance in less train-
ing time thanks to knowledge distillation. For example, for 4 epochs, BERT took 70 minutes to train, while
DistilBERT took 35 minutes, losing only 0.6 F1 points, but saving half the time taken by BERT. Moreover, our
ablation study shows that the maximum length of the sequence is one of the parameters having a significant im-
pact on the final results and must be carefully analyzed and adjusted. Likewise, data quality is a must for good
performance, data that will do not need to be processed, since extensive data cleaning processes may not help
the model capture local and global contexts in sequences, distilled sometimes with words removed or trimmed
during cleaning. Besides, we notice, that the majority of the models we fine-tuned on the IMDb dataset start
to overfit at a certain number of epochs, which can lead to biased models. However, good quality data is not
even enough, but pre-training a model on large amounts of business problem data and vocabulary may help on
preventing it from making wrong predictions and may help on reaching a high level of generalization.
ACKNOWLEDGMENTS
We are grateful to the Hugging Face team for their role in democratizing state-of-the-art machine
learning and natural language processing technology through open-source tools. Their commitment to pro-
viding valuable resources to the research community is highly appreciated, and we acknowledge their vital
contribution to the development of our article.
APPENDIX
Appendix for ”Analysis of the evolution of advanced transformer-based language models: experi-
ments on opining mining”.
Analysis of the evolution of advanced transformer-based language models: ... (Nour Eddine Zekaoui)
Table A1. Summary and comparison of transformer-based models 2006
Model L H A Att. Total Tokenization Training data Computational Training objec- Performance tasks Short description
type params cost tives
GPT 12 512 12 Global 110M Byte-pair- Books Corpus (800M - Autoregressive, Zero-shot, text sum- The first transformer-
encoding words) decoder marization, question based autoregressive and
r
DeBERTa 12 768 12 Global 125M Byte-pair en- Wikipedia, 10 days 64 V100 Autoencoding, DeBERTa was the first DeBERTa uses RoBERTa
(Disen- coding BooksCorpus, Red- GPUs. disentangled pretrained model to beat with disentangled atten-
tangled dit content, Stories, attention mech- HLP on the SuperGLUE tion and an enhanced
atten- STORIES. anism, and benchmark [38]. mask decoder to signif-
tion) enhanced mask icantly improve model
decoder. performance on many
downstream tasks while
being trained only on
half of the data used in
RoBERTa large version.
XLNet 12 768 12 Global 110M SentencePiece Wikipedia, 5.5 days on 512 Autoregressive, XLNet achieved state- XLNet incorporates ideas
BooksCorpus, Gi- TPU v3 chips. decoder of-the-art results and from transformer-XL [17]
gas5 [37], ClueWeb outperformed BERT and addresses the pretrain-
2012-B, and Common on 20 downstream finetune BERT’s discrep-
r
Analysis of the evolution of advanced transformer-based language models: ... (Nour Eddine Zekaoui)
2007
ranking.
BART 12 768 16 Global 139M Byte-pair en- Wikipedia, - Generative BART beats its prede- Trained to map corrupted 2008
coding BooksCorpus. sequence to se- cessors on generation text to the original using an
quence, encoder tasks such as translation arbitrary noising function.
decoder, token and achieved state-of-
masking, token the-art results, while
r
REFERENCES
[1] K. R. Chowdhary, “Natural language processing,” in Fundamentals of artificial intelligence, 2020, pp. 603–649, doi: 10.1007/978-
81-322-3972-7 19.
[2] M. Rhanoui, M. Mikram, S. Yousfi, A. Kasmi, and N. Zoubeidi, “A hybrid recommender system for patron driven library acquisition
and weeding,” in Journal of King Saud University-Computer and Information Sciences, 2020, vol. 34, no. 6, Part A, pp. 2809–2819,
doi: 10.1016/j.jksuci.2020.10.017.
[3] F. Z. Trabelsi, A. Khtira, and B. El Asri, “Hybrid recommendation systems: a state of art.,” in Proceedings of the 16th
International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE), 2021, pp. 281–288, doi:
10.5220/0010452202810288.
[4] B. Pandey, D. K. Pandey, B. P. Mishra, and W. Rhmann, “A comprehensive survey of deep learning in the field of medical imaging
and medical natural language processing: challenges and research directions,” in Journal of King Saud University-Computer and
Information Sciences, 2021, vol. 34, no. 8, Part A, pp. 5083–5099, doi: 10.1016/j.jksuci.2021.01.007.
[5] A. Harnoune, M. Rhanoui, M. Mikram, S. Yousfi, Z. Elkaimbillah, and B. El Asri, “BERT based clinical knowledge extraction for
biomedical knowledge graph construction and analysis,” in Computer Methods and Programs in Biomedicine Update, 2021, vol. 1,
p. 100042, doi: 10.1016/j.cmpbup.2021.100042.
[6] W. Medhat, A. Hassan, and H. Korashy, “Sentiment analysis algorithms and applications: a survey,” in Ain Shams engineering
journal, 2014, vol. 5, no. 4, pp. 1093–1113, doi: 10.1016/j.asej.2014.04.011.
[7] S. Sun, C. Luo, and J. Chen, “A review of natural language processing techniques for opinion mining systems,” in Information
fusion, 2017, vol. 36, pp. 10–25, doi: 10.1016/j.inffus.2016.10.004.
[8] S. Yousfi, M. Rhanoui, and D. Chiadmi, “Mixed-profiling recommender systems for big data environment,” in First International
Conference on Real Time Intelligent Systems, 2017, pp. 79–89, doi: 10.1007/978-3-319-91337-7 8.
[9] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understand-
ing,” in NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies - Proceedings of the Conference, 2019, vol. 1, pp. 4171–4186, doi: 10.18653/v1/N19-1423.
[10] A. Vaswani et al., “Attention is all you need,” in Proceedings of the 31st Conference on Neural Information Processing Systems,
Dec. 2017, pp. 5998–6008, doi: 10.48550/arXiv.1706.03762.
[11] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding with unsupervised learning,” Pro-
ceedings of the 2018 Conference on Neural Information Processing Systems, 2018.
[12] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” OpenAI
blog, vol. 1, no. 8, p. 9, 2019.
[13] T. B. Brown et al., “Language models are few-shot learners,” in Proceedings of the 34th International Conference on Neural
Information Processing Systems, 2020, vol. 33, pp. 1877–1901, doi: 10.48550/arXiv.2005.14165.
[14] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language
representations,” International Conference on Learning Representations, 2019, doi: 10.48550/arXiv.1909.11942.
[15] Y. Liu et al., “RoBERTa: a robustly optimized BERT pretraining approach,” arXiv:1907.11692, 2019, doi:
10.48550/arXiv.1907.11692.
[16] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V Le, “XlNet: generalized autoregressive pre-
training for language understanding,” in Advances in neural information processing systems, 2019, pp. 5753–5763, doi:
10.48550/arXiv.1906.08237.
[17] Z. Yang, Y. Yang, J. Carbonell, Q. Le, and R. Salakhutdinov, “Transformer-XL: attentive language models beyond a fixed-length
context,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Jul. 2019, pp. 2978–2988,
doi: 10.18653/v1/P19-1285.
[18] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv
2019,” in arXiv preprint arXiv:1910.01108, 2019, doi: 10.48550/arXiv.1910.01108.
[19] G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” in arXiv preprint arXiv:1503.02531, vol. 2,
no. 7, Mar. 2015, doi: 10.48550/arXiv.1503.02531.
[20] G. Lample and A. Conneau, “Cross-lingual language model pretraining,” arXiv:1901.07291, 2019, doi:
10.48550/arXiv.1901.07291.
[21] A. Conneau et al., “Unsupervised cross-lingual representation learning at scale,” in Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics, 2020, pp. 8440–8451, doi: 10.18653/v1/2020.acl-main.747.
[22] M. Lewis et al., “Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehen-
sion,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 7871–7880, doi:
10.18653/v1/2020.acl-main.703..
[23] Z. Jiang, W. Yu, D. Zhou, Y. Chen, J. Feng, and S. Yan, “ConvBERT: Improving BERT with span-based dynamic convo-
lution,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020, p. 12, doi:
10.48550/arXiv.2008.02496.
[24] N. Kitaev, Ł. Kaiser, and A. Levskaya, “Reformer: the efficient transformer,” arXiv:2001.04451, 2020, doi:
10.48550/arXiv.2001.04451.
[25] C. Raffel et al., “Exploring the limits of transfer learning with a unified text-to-text transformer.,” in Journal of Machine Learning
Research, 2020, vol. 21, no. 140, pp. 1–67, doi: 10.48550/arXiv.1910.10683.
[26] K. Clark, M.-T. Luong, Q. V Le, and C. D. Manning, “Electra: pre-training text encoders as discriminators rather than generators,”
arXiv:2003.10555, p. 18, 2020, doi: 10.48550/arXiv.2003.10555.
[27] I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: the long-document transformer,” arXiv:2004.05150, 2020, doi:
10.48550/arXiv.2004.05150.
[28] P. He, X. Liu, J. Gao, and W. Chen, “DeBERTa: decoding-enhanced BERT with disentangled attention,” arXiv:2006.03654, 2020,
doi: 10.48550/arXiv.2006.03654.
[29] S. Singh and A. Mahmood, “The NLP cookbook: modern recipes for transformer based deep learning architectures,” in IEEE
Access, 2021, vol. 9, pp. 68675–68702, doi: 10.1109/access.2021.3077350.
Analysis of the evolution of advanced transformer-based language models: ... (Nour Eddine Zekaoui)
2010 r ISSN: 2252-8938
[30] Y. Wu et al., “Google’s neural machine translation system: Bridging the gap between human and machine translation,” in arXiv
preprint arXiv:1609.08144, 2016, doi: 10.48550/arXiv.1609.08144.
[31] T. Kudo and J. Richardson, “Sentence Piece: A simple and language independent subword tokenizer and detokenizer for Neural
Text Processing,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demon-
strations, 2018, pp. 66–71, doi: 10.18653/v1/D18-2012.
[32] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in Proceedings of the 54th
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016, vol. 1, pp. 1715–1725, doi:
10.18653/v1/P16-1162.
[33] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, “Learning word vectors for sentiment analysis,” in Proceedings
of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Jun. 2011, vol. 1,
pp. 142–150.
[34] N. E. Zekaoui, “Opinion transformers.” 2023, [Online]. Available: https://github.com/zekaouinoureddine/Opinion-Transformers
(Accessed Jan. 2, 2023).
[35] Y. Zhu et al., “Aligning books and movies: towards story-like visual explanations by watching movies and reading books,” in Pro-
ceedings of the IEEE International Conference on Computer Vision, 2015, vol. 2015 Inter, pp. 19–27, doi: 10.1109/ICCV.2015.11.
[36] T. H. Trinh and Q. V Le, “A simple method for commonsense reasoning,” in arXiv:1806.02847, 2018, doi:
10.48550/arXiv.1806.02847.
[37] R. Parker, D. Graff, J. Kong, K. Chen, and K. Maeda, “English gigaword fifth edition, linguistic data consortium,” 2011, doi:
10.35111/wk4f-qt80.
[38] A. Wang et al., “SuperGLUE: A stickier benchmark for general-purpose language understanding systems,” in Advances in neural
information processing systems, 2019, vol. 32, doi: 10.48550/arXiv.1905.00537.
[39] A. Gokaslan and V. Cohen, “OpenWebText Corpus,” 2019. http://skylion007.github.io/OpenWebTextCorpus (Accessed Jan. 2,
2023).
[40] R. Zellers et al., “Defending against neural fake news,” Advances in Neural Information Processing Systems, vol. 32, p. 12, 2019,
doi: 10.48550/arXiv.1905.12616.
BIOGRAPHIES OF AUTHORS
Nour Eddine Zekaoui holds an Engineering degree in Knowledge and Data Engineer-
ing from School of Information Sciences, Morocco in 2021. He is currently a Machine Learn-
ing Engineer in a tech company. His research focuses on the areas of natural language process-
ing and artificial intelligence, including information retrieval, question answering, semantic simi-
larity, and bioinformatics. He can be contacted at email: [email protected] or nour-
[email protected].
Siham Yousfi is a Professor of Computer Sciences and Big Data at the School of Informa-
tion Sciences, Rabat since 2011. She is a PhD holder from Mohammadia School of engineering of
Mohammed V University in Rabat (2019). Her research interests include big data, natural language
processing and artificial intelligence. She can be contacted at email: [email protected].