0% found this document useful (0 votes)
136 views36 pages

A Review On Large Language Models Architectures Applications Taxonomies Open Issues and Challenges

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views36 pages

A Review On Large Language Models Architectures Applications Taxonomies Open Issues and Challenges

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

This article has been accepted for publication in IEEE Access.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier

A Review on Large Language Models:


Architectures, Applications,
Taxonomies, Open Issues and
Challenges
MOHAIMENUL AZAM KHAN RAIAAN1 , MD. SADDAM HOSSAIN MUKTA1 , KANIZ FATEMA2 ,
NUR MOHAMMAD FAHAD1 , SADMAN SAKIB1 , MOST. MARUFATUL JANNAT MIM1 , JUBAER
AHMAD1 , MOHAMMED EUNUS ALI3 , and, SAMI AZAM2
1
Department of Computer Science and Engineering, United International University, Dhaka 1212, Bangladesh
2
Faculty of Science and Technology, Charles Darwin University, Australia
3
Department of CSE, Bangladesh University of Engineering and Technology (BUET), Dhaka-1000, Bangladesh
Corresponding author: Md. Saddam Hossain Mukta (e-mail: [email protected]).

ABSTRACT Large Language Models (LLMs) recently demonstrated extraordinary capability, includ-
ing natural language processing (NLP), language translation, text generation, question answering, etc.
Moreover, LLMs are a new and essential part of computerized language processing, having the ability to
understand complex verbal patterns and generate coherent and appropriate replies for the situation. Though
this success of LLMs has prompted a substantial increase in research contributions, rapid growth has made
it difficult to understand the overall impact of these improvements. Since a lot of new research on LLMs is
coming out quickly, it is getting tough to get an overview of all of them in a short note. Consequently, the
research community would benefit from a short but thorough review of the recent changes in this area. This
article thoroughly overviews LLMs, including their history, architectures, transformers, resources, training
methods, applications, impacts, challenges, etc. This paper begins by discussing the fundamental concepts
of LLMs with its traditional pipeline of the LLMs training phase. It then provides an overview of the
existing works, the history of LLMs, their evolution over time, the architecture of transformers in LLMs,
the different resources of LLMs, and the different training methods that have been used to train them. It
also demonstrated the datasets utilized in the studies. After that, the paper discusses the wide range of
applications of LLMs, including biomedical and healthcare, education, social, business, and agriculture. It
also illustrates how LLMs create an impact on society and shape the future of AI and how they can be used
to solve real-world problems. Then it also explores open issues and challenges to deploying LLMs in real-
world scenario. Our review paper aims to help practitioners, researchers, and experts thoroughly understand
the evolution of LLMs, pre-trained architectures, applications, challenges, and future goals.

INDEX TERMS Large Language Models, Natural Language Processing, Evolution, Transformer, Pre-
trained models, Taxonomy, Application

I. INTRODUCTION like reading, writing, and communication skills in machines


[4]. However, advances in deep learning approaches, the
Language is a remarkable tool for human expression and
availability of immense computer resources, and the avail-
communication, one that begins to emerge in infancy and
ability of vast quantities of training data all contributed to the
makers throughout a lifetime [1], [2]. Nevertheless, machines
emergence of large language models (LLMs). It is a category
are unable to possess the innate ability to understand and
of language models that utilizes neural networks containing
speak in human language without the help of sophisticated
billions of parameters, trained on enormous quantities of
artificial intelligence (AI) [3]. Therefore, a long-standing
unlabeled text data using a self-supervised learning approach
scientific challenge and aim has been to achieve human-

VOLUME 4, 2016 1

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Review on Large Language Models

FIGURE 1. Pipeline of the LLMs training phase

[5]. It is considered a huge step forward in natural language LLMs in many NLP tasks, including specialized applications
processing (NLP) and AI [6]. Frequently pre-trained on large in domains such as the medical and health sciences [12] and
corpora from the web, these models may learn complicated politics [13]. Moreover, after inventing the most sophisti-
patterns, language subtleties, and semantic linkages. Besides, cated GPT model [14], developing the state-of-the-art models
they have proved their ability in various language-related (LLaMa and Bard [15]), and exploring their capabilities,
tasks, including text synthesis, translation, summarization, such as Alpaca and GPTHuggingface [16], it has become a
question-answering, and sentiment analysis, by leveraging crucial and impactful domain. As a result, a trustworthy as-
deep learning techniques and large datasets. Moreover, fine- sessment of current LLMs research is becoming increasingly
tuning these models on specific downstream tasks has been important, and prior research has shown the potential and
quite promising, with state-of-the-art performance in several superiority of LLMs in NLP tasks. Despite this, only a few
benchmarks [7]. LLMs have their roots in the early devel- studies have thoroughly reviewed their work’s latest LLMs
opment of language models and neural networks. Statistical developments, possibilities, and limitations.
approaches and n-gram models were used in earlier attempts Besides, researchers have presented various aspects of the
to develop language models [8], but these models have LLMs domain in several studies [3], [17]–[19], but their work
shortcomings in expressing long-term interdependence and still has several limitations. For example, there is a lack of
context in language. After that, researchers began to explore introduction to the core architecture and configurations of
more complex ways with the development of neural networks the LLMs model, a lack of proper explanation of the taxon-
and the availability of larger datasets. The creation of the omy of LLMs, differentiation based on ML, domain-specific
Recurrent Neural Network (RNN) [9], which allowed for the applications, API applications, and descriptions of LLMs
modeling of sequential data, including language, was a cru- datasets. Furthermore, the vast majority of LLMs review
cial milestone. However, RNNs were limited in their efficacy papers are not peer-reviewed works. The absence of these
due to vanishing gradients and long-term dependencies. The key points in a review indicates that a thorough investigation
significant advancement in LLMs systems occurred when the is missing in the current literature. Due to the significant
transformer architecture was introduced in the seminal work extent of the constraints, it is possible to mitigate them by
[10]. The transformer model is built around the self-attention thoroughly analyzing and successfully resolving them. Thus,
mechanism, enabling parallelization and efficient handling the motivation of this paper is to comprehensively explore the
of long-range dependencies. Furthermore, it served as the current review papers, identify their limitations, and outline
basis for models such as Google’s Bidirectional Encoder the current state-of-the-art methods to address these vital
Representations from Transformers [11] and open AI’s Gen- challenges. Therefore, our primary objective is to explore,
erative Pre-trained Transformer (GPT) series, which excelled comprehend, and evaluate LLMs that encompass domains,
at various language tasks. evolution, classification, the structure of pre-trained mod-
The pipeline of the basic LLMs architecture is shown in els, resources, and real-time applications. Additionally, our
Figure 1. It receives text data from multiple sources and comprehensive review discusses open issues and challenges
then forwards it to the subsequent stage for preprocessing. associated with LLMs, including security, ethical, privacy,
It then completes its training process by executing a series of economic, and environmental considerations. In addition, we
stages, including random parameter initialization, numerical present a set of guidelines to explore future research and
data input, loss function calculation, parameter optimization, development in the effective use of LLMs. We hope that this
and iterative training. They offer text translation, text sum- study will contribute to a better understanding and use of
marization, sentiment analysis, and other services following LLMs. The list of contributions to this paper is as follows:
the training phase. Prior research has shown the potential of • Providing a complete overview of LLMs, including their
2 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Survey on Large Language Models

FIGURE 2. Section organization of the review

evolution, classification, and transformer architecture. societal impact of LLMs, Indusrial significance of LLMs is
The history of LLMs provides a brief account of the highlighted in Section IX, Section X discuss the open issues
evaluation from its origins (1940) to the present (2023), and challenges regarding this study, Section XI discuss about
as well as a taxonomy of LLMs based on pre-trained and the future prospects of LLMs, Section XII acknowledge the
API-based models and major LLMs structures. limitation and Section XIII finally concludes the paper.
• Describing the comparison of different pre-trained
model designs in LLMs, along with their own systems II. LITERATURE REVIEW
that show how the model architectures are different.
• Explaining the influence of ML models on LLMs, The growing number of LLMs is an extraordinary develop-
demonstrating the significance of ML in various LLMs ment in AI. In recent years, the prevalence of these models
domains. has skyrocketed, and numerous studies have been conducted
• Providing a brief overview of the data sets used in the to investigate and evaluate their expanding capabilities. Re-
training phase to differentiate between the models in searchers from various fields have conducted exhaustive stud-
existing works. ies on the rise of LLMs, shedding light on their remarkable
• Presenting a thorough explanation of the hardware im- advancements, diverse applications, and potential to revo-
plementation in terms of LLMs. lutionize tasks from text generation and comprehension to
• Defining insights into the potential of LLMs and their demonstrating reasoning skills. Collectively, these studies
impact on society and demonstrating bio-medical appli- contribute to our comprehension of LLMs’ significant role
cations in five practical domains, including bio-medical in shaping the landscape of AI-driven language processing
and healthcare, education, social media, business, and and problem-solving.
agriculture. Huang et al., [17] presented a study on reasoning in large
• Investigating LLMs’s diverse set of open issues, chal- Language models that comprehensively summarizes the cur-
lenges, and future opportunities. This section focuses on rent state of LLMs reasoning capabilities. It examines various
identifying key challenges and future opportunities that aspects of reasoning in LLMs, such as techniques to enhance
can aid in advancing knowledge in this area. and extract reasoning abilities, methodologies and criteria
for assessing these abilities, insights from prior research,
The remaining sections of the paper are organized as and suggestions for future directions. The primary concern
depicted in Figure 2. In Section II, the literature review is dis- is the extent to which LLMs can demonstrate reasoning
cussed. Section III illustrates the history of LLMs; Section IV skills. This paper aims to provide an in-depth and up-to-
demonstrates the Methodology; Section V explains the clear date examination of this topic, fostering fruitful discussions
concept of large language models; Section VI describes the and guiding future research in LLMs-based reasoning. In
resources of LLMs; Section VII demonstrates the domain- another study, Zhao et al., [3] survey on LLMs illustrates a
specific applications of LLMs; and Section VIII explains the comprehensive examination of the evolution and impact of
VOLUME 4, 2016 3

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Review on Large Language Models

TABLE 1. Comparison between state-of-the-art research

Domain ML Based
LLMs LLMs LLMs LLMs LLMs Performance of Parameters and Methodology
Papers LLMs Specific Taxonomy Comparison Scope Key Findings
Model API Dataset Architecture Configurations LLMs Hardware Specification and Approach
LLMs (Domain Specific)
Aims to provide a critical
analysis of LLMs capabilities,
Review and analysis
Huang et al., (2022) Reasoning in methods for improving and
✓ X X X X X X X X X of reasoning abilities
[17] LLMs evaluating reasoning,
in LLMs
conclusions from earlier
research, and future directions.
Explore the historical journey
of LLMs, including pre-trained
language models (PLMs) ,
discussed about LLMs’ unique Survey and analysis
Zhao et al., (2023) Evolution and
✓ X ✓ X ✓ X ✓ X X X capabilities, insights into LLMs of LLMs evolution
[3] impact of LLMs
development resources and and impact
highlights significant contributions
of LLMs to AI and NLP research
areas.
Present a comprehensive overview
of LLMs research from 2017 to 2023,
Bibliometric Bibliometric analysis
Fan et al., (2023) tracking research trends, advancements,
✓ X X X X X X X X X review of LLMs of over 5,000 LLMs
[18] and provides insights into the dynamic
research publications
nature of LLMs research, and impact in
various domains.
Investigate the methodologies employed
in evaluating LLMs programs, with a
Survey and analysis
Chang et al., (2023) Assessment specific focus on the aspects of what,
✓ X ✓ X ✓ X X X X X of LLMs evaluation
[19] of LLMs where and how to conduct evaluations
approaches
and identified the potential risks and
the future challenge also.
Our research investigated the history,
resources, architectural configuration, Broad review and
Detailed review domain-specific analysis, ml-based analysis of LLMs
OURS ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
on LLMs differentiation, broad level of open considering all the
issues, challenges, and future scope of key aspects
large language models.

LLMs in the field of artificial intelligence and natural lan- assessing LLMs performance, emphasizing successful and
guage processing. It traces the historical journey from early unsuccessful cases. It underlines future challenges in LLMs
language models to the recent emergence of pre-trained lan- evaluation and emphasizes the significance of evaluating
guage models (PLMs) with billions of parameters. Notably, LLMs as a fundamental discipline to support the develop-
the paper discusses LLMs’ unique capabilities as they scale ment of more competent LLMs.
in size, including in-context learning. The authors highlight Table 1 illustrates the comparison between different review
the significant contributions of LLMs to the AI community papers based on some critical factors such as LLMs model,
and the launch of ChatGPT, a prominent AI chatbot powered LLMs API, LLMs dataset, domain specific LLMs, ml-based
by LLMs. The survey is structured around four key aspects comparison of LLMs, taxonomy, LLMs architecture, LLMs
of LLMs: pre-training, adaptation tuning, utilization, and performance, LLMs harware specifications, LLMs configu-
capacity evaluation. Additionally, the paper provides insights rations. Huang et al., [17] lack information on LLMs API,
into available resources for LLMs development and identifies LLMs Dataset, Domain-Specific LLMs, Taxonomy, LLMs
further research and development areas. Architecture, and LLMs Configurations. In contrast, Zhao
A recent study by Fan et al., [18] conducted a bibliometric et al., [3] lack information on LLMs API, Domain-Specific
review of LLMs research from 2017 to 2023, encompass- LLMs, Taxonomy, LLMs Architecture, and LLMs Config-
ing over 5,000 publications. The study aims to provide re- urations. Moreover, Fan et al., [18] and Chang et al., [19]
searchers, practitioners, and policymakers with an overview lack information on LLMs API, Domain-Specific LLMs,
of the evolving landscape of LLMs research. It tracks re- Taxonomy, LLMs Architecture, and LLMs Configurations.
search trends during the specified time period, including On the contrary, our research offers a considerably broader
advancements in fundamental algorithms, prominent NLP perspective on the LLMs context. In addition to incorporat-
tasks, and applications in disciplines such as medicine, engi- ing every parameter specified in the table, we provide an
neering, the social sciences, and the humanities. In addition elaborate account of the hardware implementation and LLMs
to highlighting the dynamic and swiftly changing nature of datasets. Previous research frequently focuses on particular
LLMs research, the study offers insights into their current aspects of LLMs, including their historical development,
status, impact, and potential in the context of scientific and bibliometric patterns, or assessment approaches. However,
technological advancements. Another study by Chang et our investigation surpasses these limitations. A thorough
al., [19] focuses on the assessment of LMMs. Their research examination is conducted on each of these aspects, result-
examines the increasing prevalence of LLMs in academia ing in a comprehensive comprehension of the strengths and
and industry due to their exceptional performance in vari- weaknesses of LLMs. Furthermore, our research is focused
ous applications. It highlights the growing significance of on the crucial element of reasoning capabilities in LLMs,
evaluating LLMs at both the task and societal levels in thereby providing a significant and groundbreaking addition
order to comprehend potential risks. The paper thoroughly to the body of knowledge in the field. By giving thorough
analyzes LLMs evaluation methods, focusing on three critical information, such as descriptions of datasets and hardware
dimensions: what to evaluate, where to evaluate, and how to implementations, our paper stands out as an innovative re-
evaluate. It includes tasks such as natural language process- source for LLMs practitioners and researchers. Furthermore,
ing, reasoning, medical applications, ethics, and education. we briefly discuss open issues in LLMs research, such as eth-
The article examines evaluation methods and benchmarks for ical and responsible AI, multimodal integration, energy effi-
4 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Survey on Large Language Models

ciency, privacy and data protection, generalization and few- were superior in terms of accuracy to early neural networks
shot learning, and cross-lingual and low-resource settings. and rule-based models, as they were able to process large
We also highlight key challenges, including data complexity amounts of data with ease [22]. Although statistical language
and scale, tokenization sensitivity, computational resource models have been successful in many applications of NLP,
demands, fine-tuning complexity, real-time responsiveness, they still have limits when it comes to comprehending the
contextual constraints, bias and undesirable output, knowl- semantic relationships and context of language, and they have
edge temporality, and evaluation complexity. Our review sug- difficulty dealing with long-range dependencies [26].
gests future research directions to tackle these issues, making During the mid-2000s, the field of NLP witnessed the
our review a pioneering resource for LLMs researchers and introduction of word embeddings, which were recognized
practitioners. Our extensive systematic review, which encom- as a notable breakthrough and subsequently acquired con-
passes a broad spectrum of subjects, makes a substantial siderable attention [27]. It refers to the process of repre-
contribution to the field of LLMs research. Our study has senting words in a continuous vector space. The approach
successfully resolved the constraints associated with current captures the semantic relationships among words by repre-
cutting-edge research in this domain. senting them in a vector space. This representation reduces
the computational cost by mapping the words to a lower-
III. HISTORY OF LARGE LANGUAGE MODELS dimensional space. Word2Vec and GloVe are widely recog-
LLMs refer to a category of AI models developed specif- nized word embedding models in the domain [28]. These
ically to comprehend and produce human language [20]. models are mostly utilized for assessing word similarity and
LLMs have significantly transformed the field of AI and assisting in the clustering and representation of words within
have been implemented in diverse areas, including educa- semantic domains. Although not classified as LLMs, these
tion, communication, content generation, article composi- embeddings have significantly contributed to the progress of
tion, healthcare, research, entertainment, and information natural language comprehension and have set the path for the
dissemination, among others [20], [21]. The origins of LLMs development of more complex models. Nevertheless, these
can be attributed to the emergence and advancement of neural models possess several limitations, such as their difficulty
network-based methodologies in the field of natural language in effectively dealing with words that have many meanings
processing (NLP) [21]. In order to process language, early or words that sound the same, as well as their inability to
NLP systems utilized rule-based techniques and statistical comprehend contextual information [27].
models. However, they frequently encountered difficulties The introduction of neural language models in the mid-
in comprehending the textual context and producing cohe- 2010s marked a significant advancement in large language
sive and contextually pertinent discourse [22]. This section modeling [29]. These models employed deep learning ap-
provides a high-level overview of LLMs, including their proaches to acquire knowledge of language patterns from
background, development, training, and functioning. Figure 3 extensive textual data and additionally utilized artificial neu-
depicts the history of language models. ral networks to comprehend, produce, or forecast human
In the 1940s, Warren McCulloch and Walter Pitts intro- language. Furthermore, they have demonstrated exceptional
duced the world to the idea of artificial neural networks outcomes in a wide range of language-related tasks. The ini-
(ANNs) [23]. Following this, the 1950s and 1960s saw the tial neural language model to be introduced was the recurrent
development of the first language models [24]. These models neural network language model (RNNLM) in 2010 [30]. The
included early neural networks as well as rule-based models. purpose of its development was to capture the sequential
The processing of language was facilitated by their utilization dependencies present in textual data. The utilization of a
of precisely established linguistic rules and features [25]. hidden state allows for the retention and propagation of
These models experienced limitations in their abilities and information from preceding words in a particular sequence.
encountered difficulties in managing the complexities of RNNLM has been employed in several applications such as
complicated language assignments. The models were pre- text production, speech recognition, machine translation, and
dominantly employed for tasks involving binary classifica- language modeling. The RNNLM demonstrated the capabil-
tion. However, their efficacy in dealing with the intricacies ity to effectively capture the contextual information of words,
inherent in natural language was limited [25]. resulting in the generation of text that exhibits a higher
Statistics-based models of language were created in the degree of naturalness compared to earlier models. Although
’80s and ’90s. These models belong to a category of models the RNNLM offers certain advantages, it is not without its
utilized in the field of NLP and machine learning (ML) drawbacks. Some of these limitations include a limited short-
with the purpose of capturing and quantifying the statis- term memory capacity, extended training time requirements,
tical patterns and correlations within language data [22]. and vulnerability to overfitting [31].
The models employed probabilistic techniques to assess the In the year 2015, Google unveiled the initial large neural
probability of a sequence of words or phrases inside a spe- language model that employed deep learning methodologies.
cific context. Statistical language models have significance The technology was referred to as the Google Neural Ma-
in several applications, such as predictive text input, text chine Translation (GNMT) model [32]. The model under-
generation, speech recognition, and spam detection, etc. They went training using huge quantities of multilingual textual
VOLUME 4, 2016 5

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Review on Large Language Models

FIGURE 3. Brief history of language models.

data. This development signifies a notable progression in the sents a noteworthy advancement in the domain of NLP [18].
field of machine translation [33]. The model demonstrated The underlying framework utilized in this study was the
exceptional performance on machine translation tasks, de- transformer architecture. Before the introduction of BERT,
parting from traditional rule-based and statistical techniques the preceding language model rooted in NLP had constraints
in favor of neural network-based methodologies. When com- in understanding contextual information due to its reliance on
pared to earlier language models, it was able to tackle unidirectional language modeling. BERT was introduced by
complex natural language tasks with ease. The utilization of Google as a solution to address this particular constraint [37].
this model resulted in enhanced translation accuracy and the The employed methodology involved the utilization of deep
generation of meaningful translations, while also mitigating bidirectional representations, which were conditioned on
errors associated with intricate linguistic constructions [32]. both the left and right contexts across all layers [38]. The
The advancement of Language models persisted with the pre-trained BERT model was able to undergo fine-tuning
emergence of the Transformer model in the year 2017 [34]. by incorporating an additional output layer, hence enabling
The transformer model has had a significant impact on the its applicability to diverse tasks such as question answering
field of NLP and has played a crucial role in the development and language inference. Due to the widespread adoption
of language models such as Bidirectional Encoder Repre- of BERT, several versions and subsequent models, such as
sentations from Transformers (BERT) and Generative Pre- RoBERTa, T5, and DistilBERT, have been developed to ef-
trained Transformers (GPT) [35]. These models employ a fectively address diverse tasks across multiple domains [38].
self-attention mechanism that enables them to assess the rel- Following the advent of transformers, subsequent years
ative significance of individual words in a sentence, thereby saw the development of scaling-up LLMs models through the
encoding complex relationships within the text [35]. The expansion of training data and parameter counts [21]. Ope-
primary objective behind the development of the Transformer nAI significantly contributed to the development of LLMs in
model was to overcome the inherent constraints observed in 2018. During the same year, GPT, an additional transformer-
earlier models such as RNNs and Long Short-Term Memory based architecture, was developed. Multiple iterations of
(LSTM) networks. The Transformer models possess notable the GPT models, developed by OpenAI, underwent pre-
advantages in comparison to other models due to their ability training using extensive datasets comprising excerpts from
to capture longer-term dependencies in language and fa- the Internet, novels, and various other textual sources [39].
cilitate concurrent training on many Graphical Processing The first version of the GPT model was referred to as GPT-
Units (GPUs) with a vast number of parameters, enabling 1 [40]. The introduction of GPT-1 was a notable progression
the construction of much larger models [36]. Parallelization in the field of NLP. GPT-1 effectively produces words that
capabilities and scalability are further benefits that have are contextually appropriate, showcasing the transformative
resulted in notable progress across many NLP activities [34]. capabilities of transformers in significantly advancing natu-
The introduction of BERT in 2018 by Google AI repre- ral language processing tasks. This proficiency is attributed
6 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Survey on Large Language Models

to its extensive training on a vast number of parameters, characteristics of GPT-3 is its capacity to engage in few-
specifically 117 million. The model underwent a two-step shot and zero-shot learning, hence mitigating the necessity
procedure consisting of unsupervised pre-training followed for extensive data in order to generate natural language text
by supervised fine-tuning [21]. The initial iteration of GPT of superior quality. The advent of GPT-3 has catapulted the
did not attain the same level of popularity as BERT due to domain of natural language processing to new heights. [41]
several inherent limitations [41]. These drawbacks include a In the year 2020, OpenAI introduced GPT-4, the subse-
restricted context window, absence of bi-directionality, and quent version of their language model, following the achieve-
occasional generation of biased content. Despite the inherent ments of GPT-3 [21]. Similar to its predecessor, GPT-4 is
limits of GPT-1, this model played a crucial role in paving a transformer-based model. The system has the capability
the way for later, more advanced models. As a result, it has to analyze both textual and visual data to produce textual
sparked a new era of AI research and intensified competition outputs [18]. The performance of the system was assessed
in the development of LLMs. using a range of standardized professional and academic
The subsequent version of the GPT series, known as examinations specifically intended for human test-takers.
GPT-2, was designed with the purpose of addressing the GPT-4 exhibited a level of performance comparable to that
limitations observed in its predecessor, GPT-1 [41]. Similar of humans on the majority of examinations. Significantly, it
to GPT-1, GPT-2 was developed utilizing the Transformer achieved a ranking inside the highest decile of participants on
architecture. In the year 2019, Alec Radford introduced GPT- a simulated iteration of the Uniform Bar Examination [45].
2, a language model that was developed on a deep neural GPT-4 has greater dimension and efficacy compared to its
network consisting of 1.5 billion parameters [42]. The GPT- predecessor, GPT-3, as it possesses the capacity to generate
2 model includes a transformer design, which incorporates text that is even more comprehensive and exhibits a height-
self-attention processes to extract information from differ- ened level of naturalness [21].
ent positions within the input sequence. Despite the high The development of large language models presents addi-
computing cost associated with training and executing the tional prospects for innovation, knowledge acquisition, and
model, its substantial magnitude facilitates the comprehen- experimentation across diverse domains such as healthcare,
sion and generation of a wide range of linguistic subtleties education, research, etc. The utilization of AI and NLP in
and diversified outputs [41]. The GPT-2 model has played these models has significantly transformed how we engage
a pivotal function in the advancement of LLMs and the with machine devices.
execution of NLP activities. The influence of GPT-2 has had a
significant impact on successor models like GPT-3 and GPT- IV. METHODOLOGY
4, leading to additional advancements in the field of language Preferred Reporting Items for Systematic Reviews and Meta-
processing and creation [43]. Analyses (PRISMA) guide is crucial for drafting review pa-
In 2019, NVIDIA produced Megatron-LM, which is an pers as it assists systematic reviews in conducting transparent
LLMs [44]. Similar to GPT, this model is built on the meta-analyses, accurately reporting aims and conclusions of
transformer architecture. The model possesses a total of 8.3 the study, and ensuring the adequate reliability and rele-
billion parameters, a notably bigger quantity compared to vance with the findings of the study [46]. Therefore, this
the parameter count of GPT-1 and GPT-2 [18]. The mag- review work focuses on the adoption of PRISMA technique
nitude of this dimension facilitates the model’s capacity to in analyzing the design, configurations, applications, and
acquire and produce intricate linguistic structures. Never- challenges of LLMs.
theless, Megatron-LM has certain limitations, primarily due Initial Searching: The research materials employed in
to its substantial dimensions, which necessitate substantial this study have been acquired from recognized scientific
computational resources for both the training and inference journals and conferences from January 2020 to August 2023,
processes [44]. conducted through the Google Scholar platform. A com-
In the year 2020, OpenAI introduced GPT-3 as the suc- prehensive selection of scholarly research articles has been
cessor to GPT-2 [41]. GPT-3 was trained on an extensive specified, encompassing various reputable academic sources
collection of textual data and demonstrated the ability to such as IEEE Xplore, ScienceDirect, ACM Digital Library,
generate text that exhibited a high degree of coherence and Wiley Online Library, Springer Link, MDPI, and patents.
naturalness. Similar to GPT-1 and GPT-2, this model also Initially, 355 papers were selected based on their relevance
utilizes the Transformer architecture [21]. The potential of to the topic and keyword. Table 2 describes the identification
LLMs for various NLP applications was exemplified by GPT- technique of the materials from various electronic sources.
3. This particular LLMs was trained on a deep neural network Searching Query and Keywords: Using the combination
with an enormous 175 billion parameters, surpassing the size of the appropriate search queries and keywords enlisted in Ta-
of any other LLMs available at that particular time [18]. ble 3 helps to perform a proper literature search. To conduct
The ability to produce natural language text of superior a thorough search of the articles for our LLMs-based review
quality with less fine-tuning is facilitated by sophisticated work, we encompass the following terms: "LLMs AND
methodologies, including a more significant number of layers machine learning OR deep learning OR models," "LLMs
and a wider range of training data. One of the most essential AND machine learning OR deep learning OR API," "LLMs
VOLUME 4, 2016 7

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Review on Large Language Models

TABLE 2. Electronic database search

Electronic Type URL


Database
IEEE Xplore Digital Library https://ieeexplore.ieee.org/
Xplore/home.jsp (accessed on 18
September, 2023)
Springer Digital Library https://www.springer.com/gp (ac-
cessed on 18 September, 2023)
Google Search Engine https://scholar.google.com.au (ac-
Scholar cessed on 18 September, 2023)
Science Digital Library https://www.sciencedirect.com
Direct— (accessed on 18 September, 2023)
Elsevier
MDPI Digital Library https://www.mdpi.com (accessed
on 18 September, 2023)
ACM Digital Library https://www.researchgate.net (ac-
cessed on 18 September, 2023)

TABLE 3. Search queries used for the review paper.

Search Queries (SQ)


SQ1 “LLMs” AND machine learning OR deep learning OR
models
FIGURE 4. PRISMA flow diagram of the review. SQ2 “LLMs” AND machine learning OR deep learning OR
API
SQ3 “LLMs” AND machine learning OR deep learning OR
Dataset
AND machine learning OR deep learning OR Dataset," and SQ4 “LLMs” AND machine learning OR deep learning OR
"LLMs" AND machine learning OR deep learning OR tools." tools
These specific searching techniques help to extract the eligi-
ble and quality research papers. TABLE 4. Inclusion and exclusion criteria.

Inclusion and Exclusion criteria set: To acquire the


List of Inclusion and Exclusion Criteria
final research papers, PRISMA protocols and principles were Inclusion
adhered to formulate a standard set of Inclusion Criteria (IC) Criteria(IC)
and Exclusion Criteria (EC). The inclusion criteria define the IC1 Should contain at least one of the keywords
IC2 Must be included in one of the selected databases
standards of the paper that need to be included, while the IC3 Published within the last three years (2020–2023)
exclusion criteria eliminate articles that do not meet the in- IC4 Publication in a journal, conference is required
clusion scope. Thus, this manual screening process improves IC5 The research being examined should have a matching
title, abstract, and full text
the transparency of selection process. Table 4 presents the
Exclusion
inclusion and exclusion criteria set for the proposed study. Crite-
PRISMA Diagram: Figure 4 depicts the PRISMA flow ria(EC)
diagram utilized in selecting papers for the study. It also EC1 Redundant items
EC2 Whole text of paper cannot be taken
provides the numbers of included and excluded papers for EC3 Purpose of the paper is not related to LLMs
better understanding. The diagram begins by identifying EC4 Non-english documents
articles from electronic databases using keywords, queries,
resulting in 355 papers. After applying the screening method
to exclude duplicated, low-quality, and irrelevant journal eration, text analysis, translation, sentiment analysis, ques-
papers, the total number of papers for review is reduced to tion answering, and other related functions. GPT-3, GPT-
294. Following a thorough analysis of the titles and abstracts, 4, PaLM, and LaMDA are extensively used transformer-
a total of 207 papers were selected. The final screening based LLMs models trained on a large amount of textual
method involves the application of inclusion and exclusion data. In terms of architectural properties, these models show
criteria. Following this process, a total of 135 papers were variations in size and depth. For example, GPT-3 generates
ultimately selected for the final review. The process begins parameters of 175 billion, distributed across 96 levels, while
with an extensive collection of papers and reduces to the final PaLM has an even larger parameter number of 540 billion,
selection that meets the pre-defined selection criteria for the organized across 106 layers. All of these models have distinct
systematic review. configurations. The configurations of GPT-3 and PaLM differ
in terms of their techniques for generating output. LLMs
V. LARGE LANGUAGE MODELS have evaluated several datasets within Wikipedia, code repos-
Large language models (LLMs) refer to a specific type of AI itories, books, question sets, and social media data. They
algorithm that holds the capability to execute a diverse range have demonstrated their ability to execute diverse activities
of NLP operations. The most common tasks entail text gen- successfully. Consequently, LLMs have drawn significant
8 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Survey on Large Language Models

attention for their effective contribution in different domains, able training. Additionally, different libraries and frame-
including education, healthcare, media marketing, and other works, including Transformers, DeepSpeed, PyTorch, Ten-
customer services. A particular LLMs program has superior sorFlow, MXNet, and MindSpore, are used frequently for
performance in a specific domain compared to others, such their training and further implementation [56].
as GPT-3, which has gained recognition for its proficiency in Data Preprocessing: The approaches used to preprocess
generating text styles, whereas LaMDA demonstrates supe- data focus on the significance of quality filtering, data dedu-
rior performance in providing accurate responses to factual plication and privacy reduction in preparing training data for
inquiries. LLMs are an emerging technological innovation LLMs. The filtering technique helps to reduce low quality
that holds the potential to bring about transformative changes and relevant data. Besides, it reduces the compute complexity
across various sectors. by ignoring the useless pattern of the input. Duplicate sam-
ples are removed using deduplication technnique and it also
A. BACKGROUND OF LARGE LANGUAGE MODELS avoid the overfitting tendency of the model. Finally, privacy
In this section, the necessary background to comprehend the reduction ensures the security and compliance of data and
essential aspects associated with LLMs are presented. Large upholds the preservance of the personal data.
Language Model research requires a comprehensive expla- Parameter Tuning: The researchers explore the many
nation of the crucial concept. Various vital aspects, such as stages of adaptation for LLMs, starting from pre-training and
tokenization, encoding technique, layer normalization, etc., progressing to fine-tuning for subsequent tasks. These ap-
are encompassed in the following background section. proaches serve as a guide for customizing models to suit spe-
Tokenization: The primary emphasis is on tokenization, cific applications. Several model adaptation and parameter-
a crucial preprocessing stage of LLMs that involves parsing efficient tuning techniques, such as Prefix Tuning, Prompt
text into discrete parts referred to as tokens [47]. Characters, Tuning, and Adapter Tuning, provide strategies for achieving
subwords, symbols, or words may serve as tokens, contingent effective fine-tuning while minimizing resource usage [57]–
upon the language model’s dimensions and nature [48], [49]. [59].
Various tokenization algorithms are utilized in LLMs, such This background part aims to provide a thorough under-
as WordPiece, UnigramLM, and Byte Pair Encoding (BPE). standing of the underlying concepts and approaches that
This algorithm has distinct technique for tokenizing from the form the basis of Language Models, which are constantly
input and then, applied for the specific tasks [48]–[50]. developing.
Attention Mechanism: The attention mechanisms used in The transformer is widely employed in most advanced
LLMs is a crucial topic hence it contributes in the improvi- LLMs as the basic structure because its architecture, scala-
sation of the architecture and performance. This mechanism bility, and pretraining approach, which render it the optimal
helps to figure out the representation of input sequences by framework for constructing robust large language models.
forming links between various tokens. There are several at- In addition, the self-attention mechanism of transformers is
tention mechanism available namely Self-Attention where all significantly effective for capturing and representing long-
the queries and values come from the same encoder-decoder range relationships in language. Consequently, Transformer-
block. Then, Full Attention which is the naive understanding based LLMs have significantly improved the state-of-the-art
version of self attention, and finally, when the output of achievement in NLP related tasks. In the section V-A1,a com-
encoder block is used as the query of immediate decoder prehensive overview of transformer architectures, configura-
block, is called as cross attention mechanism. [10], [51]. tions are provided for building a high-scalable, optimized and
Activation Function: The activation functions play a cost-efficient LLMs. Figure 5 depicts the visualization of the
vital role in the curve-fitting capacities of LLMs archi- LLMs background.
tectures [52]. Several activation functions, such as ReLU,
GeLU, and other GLU variations, are explored to determine 1) What is Transformer?
their significance in current research on LLMs [53], [54]. Transformer architecture is considered as the basic building
Normalization Layer: Layer normalization is essential block of LLMs. It is intended for neural networks to effi-
for achieving faster convergence in LLMs model and em- ciently handle sequential data [10]. This architecture does not
phasizes their effects on stability during training sessions. use iteration methods. Instead, it employs an attentional ap-
It presents different approaches, such as LayerNorm, Deep- proach to determine global input-output dependencies. This
Norm, and RMSNorm. These layer normalization tech- results in increased size and performance levels of the novel
niques offer distinct advantages and contribute to the regu- model, resulting in substantial parallelization and reduced
larization of LLMs applications like GPT-3, BERT, T5, etc., training time in NLP. Furthermore, it can take input of vary-
facilitating effective training. [55]. ing lengths and can change its focus depending on the length
Training Methods and Frameworks: LLMs training has of the sequence. As a result, it has become the go-to architec-
different distributed methodologies, including Data Paral- ture in many fields, often replacing sophisticated recurrent
lelism, Pipeline Parallelism, Tensor Parallelism, Model Par- or convolutional neural networks with much more efficient
allelism, and Optimizer Parallelism [44], [56]. These tech- structure [60]. In this regard, it is particularly important for
niques contribute to understanding the practical and expand- LLMs applications. Figure 6 illustrates the architecture of
VOLUME 4, 2016 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Review on Large Language Models

FIGURE 5. Background of LLMs

the transformer model. Transformer architecture consists of The encoder is a crucial component of the neural network
seven main components. A demonstration of each component responsible for processing the input text. Its primary function
is shown below. is to generate a series of hidden states that represent the input
Inputs and Input Embeddings text in a meaningful way [62]. Then, it uses a series of self-
The ML models utilize tokens, which are units of text attention layers that are often referred to metaphorically as
like words or sub words, as training data. However, these "voodoo magic," emphasizing their complex and powerful
models process numbers. Tokenization begins this transla- ability to capture relationships between different elements
tion process by dividing down input text into meaningful in the input. In the transformer, the encoder is used in
components. A unique number identification is assigned more than one layer. This section is depicted in Figure 6C
to each token, connecting the linguistic information to the comprehensively.
numerical realm. This numerical format is known as "input Outputs (shifted right)
embeddings." These input embeddings are numerical repre- During the training process, the decoder in the transformer
sentations of words, which ML models may subsequently model learns to predict the next word in a sequence by
process. These embeddings function similarly to a dictionary, analyzing the preceding words. This is achieved through a
assisting the model in understanding the meaning of words by mechanism known as autoregressive training. The decoder’s
arranging them in a mathematical space where comparable ability to predict the next word is critical for generating
phrases are situated close together. The model is trained to coherent and contextually relevant sequences. Additionally,
generate these embeddings so that vectors of the same size the GPT (GPT-3) is also trained on a massive amount of text
represent words with similar meanings. Figure 6A illustrates data, that helps it generate sense while writing something.
the input and input embeddings. Besides, several corpus including the Common Crawl web
Positional Encoding corpus, the BooksCorpus dataset, and the English Wikipedia
The sequence of words in a sentence frequently conveys are also used during the common issue. Figure 6D highlights
important semantic information. The same set of words in the transformer’s outputs (shifted right) module.
a different order can convey completely different meanings. Output Embeddings
In this regard, understanding the word order in a sentence is Input embeddings, which contain text and are not recog-
essential in NLP to identify the correct utterance meaning. In nized by the model. Therefore, the output must be converted
general, in terms of neural networks, they do not fundamen- to a format known as "output embedding." Similar to input
tally perceive the order of inputs. To remedy the problem, embeddings, output embeddings undergo positional encod-
positional encoding can be used to encode the position of ing, enabling the model to understand the order of words
each word in the input sequence as a collection of integers. in a sentence [63]. In machine learning, the loss function
The transformer model uses integer and input embedding evaluates the difference between a model’s prediction and
and positional encoding to help GPT understand sentence the objective value. Loss functions are essential for complex
word order and provide grammatically accurate and semanti- GPT language models. The loss function modifies a portion
cally appropriate output [61]. The positional encoding part is of the model to increase accuracy by reducing the discrep-
shown in Figure 6B. ancy between predictions and targets. The change improves
Encoder the overall performance of the model. The loss function is
10 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Survey on Large Language Models

FIGURE 6. Architecture of a Transformer model

calculated during training, and the model parameters are ships in the data. Besides, the softmax function generates
modified. In the inference process, the output text is created a probability distribution for each output token in the de-
by mapping the predicted probability of each token in the veloped vocabulary, allowing us to generate probabilistic
model to the corresponding token in the vocabulary. The output tokens [65]. Figure 6G shows the process by which
output embedding part is illustrated in Figure 6E. the features are propagated through a linear layer, followed
Decoder by the activation of the accurate output probability using the
The decoder processes both positionally encoded input and softmax activation function.
output embeddings. Positional encoding is crucial for the
model to understand the sequential order of the tokens in both B. HARDWARE SPECIFICATIONS FOR LARGE
the input and output sequences. The positional information LANGUAGE MODELS
helps the decoder effectively capture the structure within Understanding the computing resources and training dura-
the sequences. The decoder has an attention mechanism that tions needed for various language models is crucial. It allows
helps to improve the output’s quality by leveraging contextual for informed decision-making when choosing a model for
information received from the encoder. The primary function specific tasks. To choose a model that is appropriate for a
of the decoder is to create output sequences based on the given task, a clear understanding of the training times and
encoded input sequences. It generates a sequence of tokens, computational resources is a must. Table 5 shows the hard-
often representing words or sub-words, as its output. The ware specifications, parameters number, training duration
dependency between the encoder-decoder in a transformer is and other configurations of individual LLMs model.
significant where the encoder processes the input sequence GPT-3: GPT-3 uses Nvidia A100 GPUs to pre-train on a
and based on this representation, the decoder provides the large 300 billion token set, generating around 175 billion
desired output sequence. In addition, GPT is a decoder- parameters [66].It is not stated about the specific training du-
only transformer [64]. The decoder part of GPT uses a ration. Besides, it has context learning features, enables itself
masked self-attention mechanism which can process the in- to understand the words reasoning, sentence, and language
put sequence without requiring encoder explicitly. Figure 6F properly.
demonstrates the decoder component of a transformer. Bert: Trained on an unspecified data scale, the Bert model
Linear Layer and Softmax has a variable parameter count that depends on batch size and
The linear layer is a fully connected neural network the corresponding model’s hidden layer numbers and usually
layer that transforms the output embedding into a higher- it is around 340 million. Nvidia A100 and V100 GPUs are
dimensional space. This step is required to convert the output used for training, and the length of the training depends on
embedding into the original input space. This transformation the scale of the model’s parameters [67]. Contextual learning
enhances the expressiveness of the representation, allowing is incorporated in the model also.
the model to capture more complex patterns and relation- RoBERTa: RoBERTa, an improvised version of BERT has
VOLUME 4, 2016 11

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Review on Large Language Models

TABLE 5. Hardware specifications for the LLMs model.

Pre trained Hardware Training Context


Model’s Name Parameters
Data Scale Specifications Duration Learning
GPT-3 [66] 175 Billion (B) 300B tokens Nvidia A100 GPU - Yes
Depends on model
BERT [67] 340 Million (M) - Nvidia A100, V100 Yes
parameter scale.
RoBERTa [68] 340 M - 6144 TPU v4 Nearly 2 weeks. Yes
T5 [69] 11 B 1 Trillion (T) tokens 1024 TPU v3 - Yes
PaLM [70] 540 B 780 B tokens 6144 TPU v4 120 Days Yes
LaMDA [71] 137 B 768 B tokens 1024 TPU v3 57.7 Days Yes
GLM-130B [72] 130 B 400 B tokens 1024 TPU v4 60 Days Yes
Gopher [73] 280B 300 B tokens 4096 TPU v3 920 Hours Yes
Jurassic-1 [74] 178 B 300 B tokens 800 GPU - Yes
MT-NLG [75] 530 B 270 B tokens 4480 80G A100 - Yes
LLaMA [76] 65 B 1.4 T tokens 2048 80G A100 21 Days Yes
LLaMA 2 [77] 70 B 2 T tokens 2000 80G A100 25 Days Yes
Falcon [78] 40 B 1.3 T tokens - - Yes
Chinchilla [79] 70 B 1.4 T tokens - - Yes
OPT [80] 175 B 180 B tokens 992 80G A100 - Yes
Galactica [81] 120 B 106 B tokens - - Yes
BLOOM [56] 176 B 366 B tokens 384 80G A100 105 Days Yes
PanGU-a [82] 207 B 1.1 TB 2048 Ascend 910 - Yes

a parameter count of 3ro million and conducts pre-training duration of GPU training is available.
on a specific amount of data. The training process completed MT-NLG: MT-NLG has an impressive size of 530 billion
on 6144 TPU v4 units, running for around two weeks [68]. parameters. It has been trained on a massive dataset of 270
The model also employs context learning feature. billion tokens, utilizing 4480 80GB A100 GPUs [75]. No
T5: T5 uses 1024 TPU v3 units and has 11 billion pa- data regarding the duration of GPU training is available. The
rameters. It has been pre-trained on 1 trillion tokens [69]. model integrates context learning features.
There is no information available on GPU training time. It LLaMA: LLaMA isa language model with an enormous
also enables the features of contextual learning and provides capacity of 65 billion parameters. It has undergone pre-
an accurate result. training on a massive dataset consisting of 1.4 trillion tokens.
PaLM: PaLM produces a substantial number of parame- This training process was carried out utilizing 2048 high-
ters, around 540 billion, and it goes pre-training on a large performance 80GB A100 GPUs [76]. The training period is
dataset of 780 billion tokens. This pre-training process is explicitly set to 21 days.
carried out utilizing 6144 TPU v4 units [70]. The training LLaMA 2: LLaMA 2 is equipped with 70 billion param-
period extends for 120 days, and the model also incorporates eters and has performed pre-training on 2 trillion tokens,
contextual learning. utilizing 2000 80GB A100 GPUs [77]. The training period is
LaMDA: LaMDA uses 1024 TPU v3 units during training set to 25 days, and the model contains context-based learning.
and the model is pre-trained on 768 billion tokens which Falcon: Falcon, equipped with 40 billion parameters, un-
generates 137 billion parameters [71]. It has a long 57.7 days dergoes pre-training on a large dataset of 1.3 trillion to-
of training time. kens [78]. No details regarding the duration of GPU training
GLM-130B: GLM-130B model possesses an immense 130 and it also have the context learning features.
billion parameters and has undergone pre-training on a huge Chinchilla: Chinchilla is a language model that has 70
dataset of 400 billion tokens. This training was conducted billion parameters and has been pre-trained on 1.4 trillion
utilizing 1024 TPU v4 units and the training session lasts for tokens [79]. There is no details regarding the duration of GPU
60 days [72]. training.
Gopher: Gopher is a language model that has been pre- OPT: OPT, equipped with 175 billion parameters, conducts
trained on 300 billion tokens and requires 4096 TPU v3 for pre-training on 180 billion tokens utilizing 992 A100 GPUs
the experiment. It has a total of 280 billion parameters [73]. with a capacity of 80GB each [80]. No details regarding the
The GPU training period is precisely stated as 920 hours. duration of GPU training.
Furthermore, the model integrates context learning to demon- Galactica: Galactica possesses 120 billion parameters and
strate an effective outcome. has undergone pre-training using 106 billion tokens [81].
Jurassic-1 is a model with an impressive capacity of 178 Details regarding the duration of GPU training are not given.
billion parameters. It has been pre-trained on a massive BLOOM: BLOOM has a remarkable capacity of 176
dataset of 300 billion tokens, utilizing the computational billion parameters and has undergone pre-training on 366
power of 800 GPUs [74]. No information regarding the billion tokens utilizing 384 80GB A100 GPUs [56]. The
12 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Survey on Large Language Models

training period lasts for 105 days, and the model incorporates making BERT a bidirectional language model. The smaller
contextual learning. variant of BERT consists of 12 encoder blocks with a model
PanGU-a: PanGU-a is a language model that has been pre- dimension of 768 and a parameter count that is approximately
trained on a massive amount of data, specifically 1.1 billion, equal to that of GPT. In contrast, the larger variant has 24
employing 2048 Ascend 910 processing units [82]. It has encoder blocks with a model dimension of 1024 and 336
an impressive parameter count of 207 billion. No details million parameters [67].
regarding the duration of GPU training. In contrast to encoder-only models such as BERT and
This comprehensive analysis helps to determine the hard- decoder-only models like GPT-1 and GPT-2, T5 pre-trains
ware specifications and the computational complexity of each with generative span corruption and an encoder-decoder ar-
model and thus researchers navigate the implementation of chitecture [85]. T5 models have displayed state-of-the-art
this model precisely and improve the performance of their performance on a wide variety of NLP tasks, like GLUE and
research outcomes. SuperGLUE, and are able to expand up to hundreds of bil-
lions of parameters. LLaMA normalizes the input for every
C. DEEP NEURAL NETWORK ARCHITECTURES OF transformer sub-layer rather than the output [76]. To increase
LLMS performance, it employs the RMSNorm normalizing function
A deep neural network is utilized in LLMs to understand and and the SwiGLU activation function rather than the ReLU.
generate new content more accurately and efficiently. In this Single models are utilized by LaMDA to execute multiple
section, we include a summary of various DNN architectures duties. The model architecture is a decoder-only Transformer
of different large language models with respect to various language model. The Transformer is comprised of 64 layers,
literature studies as well as various real-time applications of a d(model) value of 8192, gated-GELU as the activation
large language models using various DNN models. function, and relative attention the same as T5 LLMs [71].
AlphaCode employs an encoder-decoder transformer archi-
1) Comparison Between State-of-The-Art studies tecture in which input tokens are passed to the encoder,
A large language model is a dynamic model capable of and one token is extracted from the decoder until an end-
performing various tasks, such as creating coherent text and of-code token is generated [86]. When contrasting encoder-
summarizing text. This is possible because of its pre-training decoder architectures with decoder-only architectures, the
on extensive text data. A defining feature of a language model encoder-decoder architecture provides the advantage of en-
is its ability to anticipate the subsequent word by analyzing abling bidirectional description representation and provides
the preceding text. The deep neural network (DNN) frame- additional flexibility by separating the encoder structure from
work that is utilized in LLMs enhances this process to be the decoder. It employs an asymmetric architecture with 1536
more akin to human-like understanding [3], [83]. LLMs use encoder tokens but only 768 decoder tokens. It makes use of
different DNN models in their architecture to enhance task multi-query attention to lower sampling costs. Cache update
performance. costs and memory utilization are greatly reduced when all
The transformer architecture serves as the core component query heads are used but only shared for key and value
of all language models. GPT-1, the initial version of GPT heads in each attention block. It employed a SentencePiece
employs the Transformer decoder architecture [67]. In GPT- tokenizer for tokenization, trained on a combination of Code-
1 the decoder structure operates independently from the Contests and GitHub data, with a vocabulary size of 8,000
encoder, therefore eliminating the Multi-Head Attention and tokens. Through the usage of DNNs, all of these LLMs have
Layer Norm components that are linked to the encoder. The demonstrated remarkable performance on various NLP tasks
pre-trained GPT model consists of 12 transformer blocks, like as language understanding and generation.
each with a d(model) value of 768 and a total of 110 million
parameters. GPT-2, the second version of GPT, employs 2) Applications of LLMs using various DNN models
the transformer decoder architecture like GPT-1 [67]. GPT- Pre-training Transformer models have led to the proposal
2 employs 50,257 BPE tokens and ensures that the Masked of LLMs with impressive capacities in addressing a vari-
Multi-Head component is preceded by the Layer Norm. In ety of NLP tasks, including question-answering, document
GPT-2, an additional Layer Norm is included subsequent summarization, and language translation [3]. Due to their
to the last block. There are four pre-trained GPT-2 models remarkable abilities in basic tasks of language processing
available, each with a unique quantity of decoder blocks. The and creation, they have completely transformed the fields
largest model, which has a d(model) value of 1600 and 48 of NLP and AI. Various DNN models have been employed
blocks, comprises a total of 1.5 billion model parameters. in different industries, such as technology, healthcare, and
BERT employs the transformer encoder structure, in contrast retail to increase performance. DNNs have made substantial
to the Transformer decoder structure utilized by GPT-1 and progress in improving the capabilities of LLMs [88]. DNN
GPT-2 [84]. Following the final encoder block is composed models, such as convolutional neural networks (CNNs), re-
of two fully connected output layers separated by a Layer current neural networks (RNNs), generative adversarial net-
Norm component. The calculation of the likelihood of each works (GANs), capsule networks (CapsNets), transformers,
token’s output depends on both the previous and next tokens, and BERT, have been extensively employed in diverse appli-
VOLUME 4, 2016 13

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Review on Large Language Models

TABLE 6. Comparison of applications of LLMs using various DNN models

Study DNN model Application


Assessd the use of a pre-trained large-scale language model
Koizumi et al., [87] Transformer (decoder)
in audio captioning.
Discuss the significance of Recommender Systems in web
Fan et al., [88] Transformer (encoder-decoder)
applications and e-commerce cite
Non-Autoregressive attention Propose a non-autoregressive speech recognition model
Bai et al., [89]
based encoder-decoder named LASO (Listen Attentively and Spell Once)
Forecast the impact of news releases and attempt to mitigate
Decoder + SOCIALSENSE
Sun et al., [90] potential adverse consequences by automatically anticipating
(belief-centered graph)
news media responses
Propose a method for sound event detection which takes a
Drossos et al., [91] RNN sequence of audio frames as input and predicts the activities
of sound events in each frame.
Propose a method called TPBERT to improve the reranking
Chiu et al., [92] Transformer (encoder)
of N-best hypotheses in automatic recognition of speech.
Propose a monitoring system for dealing with semantic
Elhafsi et al., [93] Encoder structure
abnormalities in robotic systems.
Propose a self-regulating edge AI system to autonomously
Shen et al., [94] Transformer (decoder)
plan, and adjust itself to fulfill the needs of users.

cations of LLMs [95]. Numerous studies [87]–[94] suggest responses.


that DNN models are utilized in several types of LLMs-based Drossos et al., [91] present a technique in their study that
applications to increase task efficiency. enables a recurrent neural network (RNN) to acquire LLMs
Koizumi et al., [87] introduce an innovative method to for sound event detection. The proposed approach adjusts
address the issue of insufficient training data in audio cap- the input of the RNN based on the activity of classes in the
tioning that utilizes a pre-trained extensive LLMs that uses preceding time step. This proposed approach is evaluated on
a deep neural network, as a decoder for generating cap- three distinct datasets: the TUT-SED Synthetic 2016, TUT
tions. The findings of the study demonstrate the effectiveness Sound Events 2016, and TUT Sound Events 2017 datasets.
of the proposed methodology in utilizing LLMs for audio In a separate study, Chiu et al., [92] present an efficient
captioning. Significantly, the performance of this proposed method called TPBERT (based on BERT) for improving the
approach outperforms the traditional approaches which are reranking of N-best hypotheses in automatic recognition of
trained from scratch. speech. This approach uses task-specific topic information
In a recent study, Fan et al., [88] discuss the significance to increase the BERT model’s ability to create accurate
of Recommender Systems in web applications and the short- embeddings of the N-best hypotheses.
comings of current DNN approaches in comprehending user In another study, Elhafsi et al., [93] propose a moni-
desires and integrating textual side information efficiently. toring methodology that utilizes large language models to
They discuss the capacity of LLMs to tackle the difficulties tackle the issue of semantic irregularities in robotic systems.
and also highlight the necessity for a systematic evaluation of The efficiency of LLMs-based monitoring in recognizing
recommender systems. semantic abnormalities and aligning with human thinking is
LASO (Listen Attentively and Spell Once) is an end-to- demonstrated through tests on autonomous driving.
end non-autoregressive speech recognition model that was Shen et al., [94] present a self-regulating edge artificial
developed in a recent study by Bai et al., [89] to improve the intelligence system that utilizes a deep neural network to
speed of inference by simultaneously predicting all tokens. autonomously plan, and adjust itself to fulfill the needs of
The proposed model utilizes attention methods to combine users. The proposed system uses a hierarchical design known
decoded speech information into hidden representations for as cloud-edge-client, where the primary language model is
every token. Moreover, they suggest using cross-modal trans- located in the cloud. By leveraging the robust capabilities
fer learning to increase the performance of the speech-modal of GPT in language comprehension, and code creation, in
LASO model by utilizing a text-modal language model to this study, they introduce a methodology that effectively
align the semantic meaning of tokens. handles edge AI models to meet users’ requirements while
In another study, Sun et al., [90] provide a new method- automatically generating new codes for training new models
ology to predict the effect of news releases and try to through edge federated learning.
minimize potential negative consequences by automatically Table 6 gives a brief overview of these DNN applications-
forecasting responses in news media. By utilizing an LLMs oriented studies of LLMs. These studies suggest that employ-
which utilizes a deep neural network, their method creates ing deep neural networks in language models increases the
a belief-centered graph on an existing social network that performance of LLMs-based applications in several indus-
utilizes graph-based propagation to analyze social dynamics. tries. .
The proposed framework depicts its efficiency in predicting
14 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Survey on Large Language Models

D. ARCHITECTURAL OVERVIEW OF LARGE and T5 use ReLU as the activation function. The Formula of
LANGUAGE MODELS these activation functions are given below [7], [60]:
In Table 7, a description and architecture of LLMs such as
GPT-1, BERT, RoBERta, and T5 are presented. This table (
x, if x ≥ 0
will assist researchers in selecting the optimal model for a ReLU (x) = max(0, x) = f (x) = (1)
natural language processing task. GPT-1, BERT base, and 0, if x < 0
BERT large contain 12, 12, and 24 layers, correspondingly, in
the larger language model. RoBERta is an enhanced variant GeLU (x) = 0.5x(tanh[
p
2/π(x + 0.44715x3 )]) (2)
of BERT, while T5 is a decoder and encoder transformer.
Diagram illustrating BERT’s input token processing, context-
aware embedding, and masked language modeling tasks, SwiGLU (x) = x.Sigmoid(βx).xV (3)
where the masked words are intended to predict the model.
T5 demonstrates the sequential layers of the transformer BARD is recognized for its informative response. It fea-
model, including the feedforward neural network, and self- tures 24 attention heads and facilitates its contextually re-
attention. It explains how information flows and structures lated response. BERT size is identical to BARD of 340M.
text. GPT-1 passes data input embedding and positional The key advantage of BERT is understanding the context
encoding through multiple transformer layers. of words. It has effective training settings with a proper
learning rate, batch size, and a dropout value of 0.1, lever-
E. COMPARISON BETWEEN CONFIGURATIONS OF ages the convergence of the model, and contributes to the
LLMS NLP-based tasks precisely. PanGU BLOOM, Galactica, and
Table 8 provides an extensive overview of various Large Chinchilla are also LLMs but possess distinct configurations
Language Models (LLMs), highlighting their configuration and challenges. Usually, PanGU is highly effective for the
details and optimization settings. These LLMs have played Chinese language, whereas Galactica performs well with
a crucial role in advancing natural language comprehension repeated data. Chinchilla is a scaling strategy constrained by
and generation tasks, making them a focal point in artificial data limitations and creates efficient resource allocation for
intelligence and natural language processing. This analysis training and generating output. Falcon and T5 are compact
compares and contrasts these LLMs based on critical param- compared to other LLMs, and both are transformer-based
eters, including model size, learning rate, category, activa- models. However, they have some unique differences, such
tion function, batch size, bias, number of layers, optimizer, as Falcon is a decoder-based model whereas T5 integrated
number of attention heads, hidden state size, dropout rate, both encoder-decoders. Additionally, Falcon utilizes multi-
and maximum training context length. GPT-4 stands out as head query attention to increase the scalability of the model.
the most prominent model on display, with a staggering 1.8 LLaMA-2 is the updated version of LLaMA. It is an en-
trillion parameters. It is comparatively faster than the prior hanced fine-tuned version that exploits the hardware utiliza-
GPT versions and provide many advanced features. Besides, tion for efficient training sessions. MT-NLG and PaLM have
it has fast prompt response, generate more accurate output substantial parameter sizes of 530B and 540B, respectively.
and it has reduced the biases presented in the model sub- Both of them also use the casual decoder technique. How-
stantially. GPT-1, despite being lesser with 125 million pa- ever, they have some architectural differences, such as PaLM
rameters, demonstrates the significant development of LLMs uses a SwiGLU activation function and adafactor optimizer.
over the years. An increased number of parameters in LLMs Moreover, it uses a higher learning rate and batch size of 1
enhances the model’s ability to comprehend intricate patterns × 102 and 1000K. On the contrary, MT-NLG uses a lower
and produce text that is more contextually appropriate and learning rate and batch size of 5 × 105 and 64K, respectively.
reminiscent of human language. GPT3’s selection of a mod- GLM-130B and LaMDA are also effective LLMs, widely
est learning rate of 6 is notable, which highlights the sig- used for NLP-based tasks, including question answering, text
nificance of cautious hyperparameter selection. Models are generation, etc. Both of them use the Gated GLU (GeGLU)
categorized as Causal decoder (CD), Autoregressive (AR), activation function, a GLU variant. The following equation is
Encoder-decoder (ED), and Prefix decoder (PD) to illustrate used to express the GeGLU operation [100].
architectural diversity. Activation functions vary, influencing
the models’ expressive strength from GeLU in GPT-3 to GEGLU(x, W, V, b, c) = GELU(xW + b) ⊗ (xV + c) (4)
SwiGLU in LLaMA and LLaMA-2. All versions of GPT
employ the GeLU as its activation function as it mitigates However, there are noticeable differences between GLM-
the vanishing gradient problem and facilitates the generation 130B and LaMDA in terms of their decoder mechanisms.
of smoother gradients throughout the training process. The GLM-130B employs a prefix decoder, whereas LaMDA
utilization of SwiGLU as the activation function is observed adopts a casual decoder technique. In addition, the GLM-
in models such as PaLM and LLaMA versions 1 and 2, as it 130B model employs a larger batch size compared to the
has gating mechanisms that enhance its ability to capture in- LaMDA model. In addition, the presence or absence of
tricate correlations within the data. Models like BERT, OPT, biased terms in models, such as Falcon, T5, LLaMA 1,2, and
VOLUME 4, 2016 15

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Review on Large Language Models

TABLE 7. Architectural overview of different LLMs

Model Description Architecture

Twelve-level decoder
transformer that uses
GPT-1 [67]
twelve masked
self-focusing heads.

BERT is a transformer
architecture. It has two model
sizes. BERT base has 12 layers
BERT [11]
in encoder stack and BERT
Large has 24 layers in encoder
stack.

Optimized version of
RoBERTa [68]
BERT model.

The model consists of an encoder


T5 [85] and a decoder transformer, which
has many layers.

16 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Survey on Large Language Models

TABLE 8. Various LLMs with configuration details and optimization settings (Here, LR = learning rate, CG = Category, AF = the activation function, bs = batch size,
NL = the number of layers, NAH = the number of attention heads, SHS = the size of the hidden states, MCLDT = the maximum context length during training, CD =
causal decoder, ED = encoder-decoder, PD = prefix decoder, and AR = autoregressive)

Model Size LR CG AF BS Bias NL Optimizer NAH SHS Dropout MCLDT


GPT-4 [96] 1.8 T − CD GeLU − Yes 120 Adam 120-150 20000 − 32768
GPT-3 [66] 175B 6 × 10−5 CD GeLU 32K-3200K Yes 96 Adam 96 12288 − 2048
GPT-2 [97] 1.5B 1 × 10−4 AR GeLU 16K-64K Yes 48 Adam 24 1280 0.1 1024
GPT-1 [67] 125M 1 × 10−4 AR GeLU 16K-64K Yes 12 Adam 12 768 0.1 512
BARD [98] 340M − − ReLU 64K Yes 24 − 24 768 − 512
BERT [67] 340M 1 × 10−5 − ReLU 16K-64K Yes 24 Adam 16 1024 0.1 512
PanGU-α [82] 207B 2 × 10−5 CD GeLU − Yes 64 Adam 128 16384 − 1024
BLOOM [56] 176B 6 × 10−5 CD GeLU 4000K Yes 70 Adam 112 14336 0 2048
Galactica [99] 120B 7 × 10−6 CD GeLU 2000K No 96 AdamW 80 10240 0.1 2048
OPT [80] 175B 1.2 × 10−4 CD ReLU 2000K Yes 96 AdamW 96 12288 0.1 2048
Chinchilla [79] 70B 1 × 10−4 CD − 1500K-3000K − 80 AdamW 64 8192 − −
Falcon [78] 40B 1.85 × 10−4 CD GeLU 2000K No 60 AdamW 64 8192 − 2048
T5 [69] 11B 1 × 10−2 ED ReLU 64K No 24 AdaFactor 128 1024 0.1 512
LLaMA [76] 65B 1.5 × 10−4 CD SwiGLU 4000K No 80 AdamW 64 8192 − 2048
LLaMA-2 [77] 70B 1.5 × 10−4 CD SwiGLU 4000K No 80 AdamW 64 8192 − 4096
MT-NLG [75] 530B 5 × 10−5 CD − 64K-3750K − 105 Adam 128 20480 − 2048
Jurassic-1 [74] 178B 6 × 10−5 CD GeLU 32K-3200K Yes 76 − 96 13824 − 2048
Gopher [73] 280B 4 × 10−5 CD − 3000K-6000K − 80 Adam 128 16384 − 2048
GLM-130B [72] 130B 8 × 10−5 PD GeGLU 400k-8250K Yes 70 AdamW 96 12288 0.1 2048
LaMDA [71] 137B − CD GeGLU 256K − 64 − 128 8192 − −
PaLM [70] 540B 1 × 10−2 CD SwiGLU 1000K-4000K No 118 Adafactor 48 18432 0.1 2048

Galactica’s "No," highlights the complexity of the choices able to generate text outputs with a great coherence. Table 8
made. From 12 for GPT-1 to 118 for PaLM, the number reports that GPT-4 has the context length of 32768 which is
of layers affects a model’s ability to capture intricate pat- the maximum among all the LLMs. This substantial length
terns. Optimizers are also diverse, with Adam, AdamW, and number indicates the capability of GPT-4 to remember the
AdaFactor playing crucial roles. All GPT variants employ more extended token sequence during training. LLaMA-2
Adam as the optimizer, although models such as Galactica, obtained the second-highest context length of 4096. Most of
OPT, and Falcon utilize AdamW as their optimizer. Both the models have a context length of 2048, meaning they can
T5 and PaLM models utilize the Adafactor optimizer in handle a maximum of 2048 tokens simultaneously during
their respective architectures. These variations highlight the the text generation. A few compacted models, including
significance of selecting models and configurations that are BARD, BERT, and T5, possess a maximum context length
tailored to particular tasks, with performance, computational of 512. This table presents a qualitative architectural com-
resources, and task requirements playing a central role. parison among the most popular LLMs. It also provides
The number of attention heads also exhibits variation comprehensive knowledge about the configurations, strength
across different models. GPT-1 is equipped with a total of these models. These variations highlight the significance
of 12 attention heads, whilst GPT-4 boasts a much larger of selecting models for the particular tasks considering the
number of attention heads, ranging from 120 to 150 within its performance, computational resources.
model. The additional number of attention heads in the LLMs
enables the model to concurrently attend to several segments F. COMPARISON BETWEEN DATASETS OF LLMS
of the input sequence, hence expediting the model’s training Different LLMs utilized different datasets for the training
process. In order to enhance the efficacy of the LLMs, phase, distinguishing the models from one another. A concise
researchers employ diverse dimensions for the hidden states overview of the datasets is provided in this section. Moreover,
within their model. The larger dimensions of the hidden it explicitly exhibits the diverse range of datasets used by the
state enable the capturing of complex patterns within the model since understanding of these datasets facilitates the
text. Both GPT 4 and MT-NLG employ hidden state sizes development and training of the model and boost the per-
of approximately 20,000, which is significantly greater in formance. The datasets used to train various large language
comparison to the hidden state sizes of other LLMs included models (LLMs) and their compatibility with each model are
in the table. Certain LLMs models incorporate a dropout detailed in Table 9.
value of 0.1 to prevent overfitting issues, whereas others Table 9 demonstrates that datasets have been divided into
do not employ any dropout value. The maximum context multiple categories: webpages, conversation data, literature
length denotes the number of tokens that can be remembered and news, scientific data, and code. This classification en-
by the model during training. Increasing the size of the ables us to comprehend the variety of data sources that con-
context window boosts the model’s ability to grasp the distant tribute to LLMs training. C4, OpenWebText, and Wikipedia
relationships between the texts. Consequently, the model is are examples of datasets that belong to the "Webpages"
VOLUME 4, 2016 17

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Review on Large Language Models

TABLE 9. Dataset for large language models

Conversation
Dataset → Webpages Books and News Scientific Data Code
Data
the Pile - the Pile -
the Pile - the Pile -
LLMs ↓ C4 OpenWebText Wikipedia Stack BookCorpus Gutenberg CC-Stories-R CC-NEWES REALNEWs PubMed BigQuery
ArXiv GitHub
Exchange Abstracts
T5 [69] ✓ ✓ ✓ X X X X X X X X X X
Falcon [78] ✓ ✓ ✓ X X X X X X X X X X
LLaMA [76] ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
GPT-3 [66] ✓ ✓ ✓ X ✓ ✓ ✓ ✓ ✓ X X X X
GPT-4 [96] ✓ ✓ ✓ X ✓ ✓ ✓ ✓ ✓ X X X X
MT-NLG [75] ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Gopher [73] ✓ ✓ ✓ X ✓ ✓ ✓ ✓ ✓ X X ✓ ✓
Chinchilla [79] ✓ ✓ ✓ X ✓ ✓ ✓ ✓ ✓ X X ✓ ✓
GLaM [101] ✓ ✓ ✓ X ✓ ✓ ✓ ✓ ✓ X X X X
PaLM [70] ✓ ✓ ✓ X ✓ ✓ ✓ ✓ ✓ X X ✓ ✓
LaMDA [71] ✓ ✓ ✓ X X X X X X ✓ ✓ ✓ ✓
Galactica [99] ✓ ✓ ✓ X X X X X X ✓ ✓ ✓ ✓
GPT-NeoX [102] ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
CodeGen [103] ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
AlphaCode [86] X X X X X X X X X X X ✓ ✓
GPT-1 [86] X ✓ ✓ X ✓ X X X X X X ✓ ✓
GPT-2 [86] X ✓ ✓ X ✓ X X X X X X X X
BARD [86] ✓ ✓ ✓ X ✓ X X X X X X X X
BERT [86] ✓ ✓ ✓ X ✓ X X X X X X X X
PanGU- [86] ✓ ✓ ✓ X ✓ X X X X X X X X
BLOOM [86] ✓ ✓ ✓ X X X X X X X X X X
OPT [86] X ✓ ✓ X ✓ ✓ X X X X X X X
GLM-130 [86] X ✓ ✓ X X X X ✓ ✓ X X ✓ ✓
Size 800GB 38GB 21GB 800GB 5GB - 31GB 78GB 120GB 800GB 800GB - 800GB
CommonCrawl RedditLinks Wikipedia Other Books Books CommonCrawl CommonCrawl CommonCrawl Other Other Codes Other
Source
(April 2019) (March 2023) (March 2023) (Dec 2020) (Dec 2015) (Dec 2021) (Sep 2019) (Feb 2019) (April 2019) (Dec 2020) (Dec 2020) (March 2023) (Dec 2020)

category. At the same time, BookCorpus, Gutenberg, CC- use scientific data source for training which advances these
Stories-R, CC-NEWES, and REALNEWS are examples of models to adapt the scientific knowledge and terminology.
datasets that belong to the "Books and News" category. These Hence, these model can lead to a more accurate responses
categories reflect the richness and diversity of text data used in scientific domains. These findings highlight the signifi-
to train LLMs, including web content, novels, news articles, cance of this table, which contributed to distinguishing the
scientific literature, and code. models based on their data source for training and task
accomplishments. AlphaCode, GLM-130 is the model of
From the ✓, it can be seen that LLaMA has been trained choice for code-related tasks, whereas LLaMA, Bert excels
on a wide range of data sources, with significant expo- in diverse text data applications. Most of the LLMs such
sure to webpages (87%), conversation data (5%), books as T5, GPT models, Gopher, GLam, PaLM, BLOOM vastly
and news (2%), scientific data (3%), and code (5%). This utilize websource data which helps them to automate various
makes LLaMA a versatile model suitable for a wide array practical tasks such as content creation, data analysis and
of natural language processing tasks that involve these data virtual chatbot for answering the question. On the contray,
types. In contrast, platforms such as GPT-3 and AlphaCode some models such as Falcon, OPT different version of GPT
have restricted data exposure. GPT-1 and GPT-2 focus on models utilizes books and news data facilitates in education
webpages and (70%) and books and news (30%) data to material based application such as document summarization,
train the model. GPT-3 is proficient with web pages (84%), article writings. The models trained on scientific data have
literature, and news (16%) but requires additional instruction several use cases in scientific domain. In addition, Table 9
with conversation data, scientific data, and code. This diverse provides contextual information of the datasets to maintain
dataset range enables the GPT models to generate more the transparency of the comparison and an effective guide
contextual information across various domains. Specifically, to future model implementation. The "Size" and "Source"
the Webpages, books, and news datasets help to comprehend columns of the Table listed the additional information. The
formal and structured language. Consequently, GPT models size of datasets ranges from 5GB (BookCorpus) to a massive
achieve the capability of responding in a more informative 800GB (several datasets), indicating the sheer magnitude of
and accurate way. AlphaCode, as its name suggests, is solely data required to train these LLMs. The source information
focused on code (100%) and does not utilize any other data reveals when and where the data were collected, which
sources. This feature uniquely distinguish AlphaCode from is essential for comprehending the training data’s temporal
other models and emphasize the significance of this model relevance and potential biases. Table 9 provides a multitude
for code-based tasks. Bard, Bert, and Pangu- A model ex- of information regarding the datasets used to train LLMs and
hibits identical traits, with each of them concentrating on the how each model leverages these datasets. This information
extensive textual data obtained from webpage contents and is invaluable for natural language processing researchers,
books for pretraining the models. Bloom and OPT primarily developers, and practitioners, as it enables them to make
emphasize on evaluating data from books and websites, such informed decisions about which LLMs to use for specific
as Wikipedia or other online sources. On the other hand, tasks and casts light on the breadth and depth of data that
GLM-130 not only analyzes books and web data but also powers these cutting-edge language models.
incorporates computer code data to provide further tech-
nological benefits. LaMDA, Galactica and CodeGen model
18 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Survey on Large Language Models

G. PERFORMANCE ANALYSIS OF LLMS diverse text sources. They offer a substantial advantage by
Large language models are models that perform the major- minimizing the computational resources and data required
ity of natural language processing and are trained on vast for fine-tuning specific tasks. There are some of the most
quantities of data. Over time, numerous large languages in- common pre-trained LLMs models, which have been de-
cluding GPT-1 through GPT-4, Bing, ChatpGPT, and BERT picted in Table 11.
have developed in order to contribute jointly to industry and
academia. As a result of the scarcity of adequate data pertain- 1) Generative Pretrained Transformer (GPT)
ing to large language models, we have solely presented per- Generative Pre-trained Transformer [66] is an influential
formance outcomes for diverse tasks of publicly accessible breakthrough in artificial intelligence, particularly in natural
LLMs in Table 10. All GPT series, including GPT-1, GPT-2, language processing (NLP). Developed by OpenAI, GPT
GPT-3, GPT-3.5, and GPT-4, are evaluated using a variety of leverages the Transformer architecture and extensive pre-
metrics, including the Stanford Question Answering Dataset training on vast internet text data to achieve a deep under-
(SQuAD), Language Model Benchmark (LAMBADA), and standing of human language. This generative model excels
General Language Understanding Evaluation (GLUE), as at tasks like text generation, translation, question answering,
shown in Table 10. GPT-1 obtains a score of 68.4 on the and more, making it a versatile tool across various NLP do-
GLUE, while GPT-2, GPT-3, GPT-3.5, and GPT-4 attain mains. GPT’s capacity to capture intricate language patterns
scores of 84.6, 93.2, 93.5, and 94.4 respectively. GLUE re- and context, coupled with its iterative improvements, has
sults indicate that GPT-4 outperforms prior versions of GPT. profoundly impacted academia and industry, revolutionizing
The GPT-4 scores in SQuAD and LAMBDA are 93.6 and the landscape of language understanding and generation.
82.4, respectively. As shown in the table, GPT-4 outperforms
its predecessors in both LAMBDA and SQuAD. As GPT-4 2) BERT
outperforms its predecessors in all three benchmark metrics BERT [11], short for "Bidirectional Encoder Representations
and exhibits robust performance, it can be concluded that from Transformers," is a language model with a distinctive
GPT-4 is significantly more effective than its predecessors in approach. Unlike previous models, BERT is designed to pre-
tasks involving language comprehension and language mod- train deep bidirectional representations from unlabeled text
eling. The VietNamese High School Graduation Examination by considering both left and right context in all layers. This
(VNHSGE) English dataset was utilized to analyze various pre-trained BERT model can be fine-tuned with minimal
LLMs, including GPT-3.5, BingChat, and BARD. Based on adjustments to create cutting-edge models for various tasks
the accuracy presented in Table 10, it is evident that BingChat like question answering and language inference, eliminating
LLMs outperforms the other two models, achieving an accu- the need for extensive task-specific modifications. BERT is
racy of 92.4% in this dataset. LLMs such as ChatGPT and both conceptually straightforward and remarkably effective,
Bing were evaluated using the average intraclass correlation achieving state-of-the-art results on eleven different natural
coefficient (ICC) values. The ICC value for Bing was 0.975, language processing tasks. Notable accomplishments include
whereas for ChatGPT it was 0.858. The higher mean ICC raising the GLUE score to 80.5% (an impressive 7.7% ab-
value indicates that Bing exhibited robust performance and solute improvement), boosting MultiNLI accuracy to 86.7%
consistency in comparison to ChatGPT. Table 10 depicts that, (a 4.6% absolute improvement), and significantly improving
all of the LLMs mentioned in the table have been analyzed SQuAD v1.1 question answering Test F1 to 93.2 (a 1.5 point
and tested on multiple performance metrics and datasets absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (a
to validate the robustness and reliability of these language remarkable 5.1 point absolute improvement).
models. In our analysis, we have exclusively considered versions
of BERT (Bidirectional Encoder Representations from Trans-
VI. RESOURCES OF LARGE LANGUAGE MODELS formers) that are inherently Large Language Models (LLMs).
Large Language Models (LLMs) have a wide range of po- Specifically, we focused on variants of BERT that are pre-
tential applications and resources available for their develop- trained on extensive text corpora and possess the characteris-
ment, deployment, and utilization. In Figure 7, we present tics of LLMs, enabling them to understand and generate natu-
an LLMs taxonomy that categorizes Large Language Models ral language comprehensively. This deliberate choice ensures
into two main branches: those based on pre-trained models that the models we have included in our study harness the full
and those based on APIs. This taxonomy allows for a com- spectrum of language understanding and generation capabil-
prehensive exploration of these two distinct aspects of Large ities, thereby aligning with the core objective of our research
Language Models. Here are some key resources presented in exploring the impact and advancements of LLMs in the
which are associated with LLMs: field of natural language processing. Non-LLMs versions
of BERT or those with significantly reduced model sizes
A. PRETRAINED MODELS were excluded from our analysis to maintain consistency and
Pretrained language models play a pivotal role in natural relevance in our investigation of the transformative potential
language processing due to their ability to encapsulate broad of Large Language Models.
language understanding and generation skills gleaned from
VOLUME 4, 2016 19

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Review on Large Language Models

TABLE 10. Accuracy of various LLMs on different datasets.

LLMs Accuracy Task


Score of GPT-1 in standard NLP Modeling tasks GLUE,LAMBDA,
GPT-1 [67] 68.4%, 48.4% , and 82%
and SQuAD 68.4,48.4 , and 82.0 respectively.
Score of GPT-2 in standard NLP Modeling tasks GLUE,LAMBDA,
GPT-2 [97] 84.6%, 60.1%, and 89.5%
and SQuAD 84.6 ,60.1 , and 89.5 respectively.
Score of GPT-3 in standard NLP Modeling tasks GLUE,LAMBDA,
GPT-3 [66] 93.2%, 69.6%, and 92.4%
and SQuAD 93.2,69.6, and 92.4 respectively.
Score of GPT-3.5 in standard NLP Modeling tasks GLUE,LAMBDA,
93.5%, 79.3%, and 92.4
GPT-3.5 [104] and SQuAD 93.5 ,79.3 , and 92.4 respectively.
79.20% GPT-3.5 is 79.2% performance on the VNHSGE English dataset.
85.50% 3 shot accuracy on MMLU across languages (English) 85.5%.
GPT-4 [96]
Score of GPT-4 in standard NLP Modeling tasks GLUE,LAMBDA,
94.2%, 82.4%, and 93.6%
and SQuAD 94.2 ,82.4 , and 93.6 respectively.
A total of 167 SCORE and 112 Data-B questions were presented to
71% and 68% the ChatGPT interface. ChatGPT correctly answered 71% and 68%
ChatGPT [105] of multiple-choice SCORE and Data-B questions, respectively.
The 5-year average percentage of correct answers for ChatGPT
75.1% (SD 3%) and 64.5% (SD 5%) was 75.1% (SD 3%) for basic knowledge questions and 64.5%
(SD 5%) for general questions.
The average intraclass correlation coefficient (ICC) values for ChatGPT
0.858 (95% CI: 0.777 to 0.91, p<0.0001) were 0.858 (95% CI: 0.777 to 0.91, p<0.0001) with a total of 77 cases
(answering case vignettes in physiology).
BingChat [106] 92.40% 92.4% performance on the VNHSGE English dataset.
Bard [98] 86% 86% performance on the VNHSGE English dataset.
The average intraclass correlation coefficient (ICC) values for Bing
Bing [107] 0.975 (95% CI: 0.961 to 0.984, p<0.0001) were 0.975 (95% CI: 0.961 to 0.984, p<0.0001) with a total of 77 cases
(answering case vignettes in physiology).
BERT(large)’s performance on SWAG(Situations With Adversarial
Dev 86.6%, Test 86.3%
BERT [67] Generations) where Dev 86.6 ,Test 86.3.
82.1%, grammatical 60.5%, sentiment analysis 94.9%, BERT(large)’s performance on GLUE(General Language Understanding
similarity 86.5%, paraphrase 89.3%, Evaluation) where 82.1, grammatical 60.5, sentiment analysis 94.9,
question similarity 72.1%, contradiction 86.7%, similarity 86.5, paraphrase 89.3, question similarity 72.1 ,
answerable 92.7%, and entail 70.1% contradiction 86.7/85.9, answerable 92.7, and entail 70.1.

FIGURE 7. Taxonomy of LLMs

3) RoBERTa that BERT was initially trained with room for improvement,
RoBERTA [68] is a study that replicates the BERT pre- yet it can now perform on par with or even surpass the
training approach outlined by Devlin et al., in 2019. In this performance of subsequent models that have been published.
study, we meticulously assess the influence of various critical As a result, RoBERTa achieves top-tier results in GLUE,
hyperparameters and training data sizes. It’s worth noting RACE, and SQuAD evaluations. These outcomes underscore

20 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Survey on Large Language Models

TABLE 11. Description of LLMs

Model Name Description Key Features Training Data Fine-Tuning Fine-Tuning Applications
Data Tasks
GPT (Generative Transformative LLMs Extensive pre- Internet text data Custom datasets Text generation, Chatbots, content
Pretrained Trans- by OpenAI for versa- training, deep translation, QA, generation, NLP
former) [66] tile NLP tasks. language under- and more domains
standing, iterative
improvements, impact
on academia/industry
BERT Google AI’s NLP Deep bidirectional BookCorpus, Task-specific Various NLP Question answer-
(Bidirectional model excelling with representations, Wikipedia datasets tasks ing, language in-
Encoder bidirectional context conceptually ference
Representations learning. straightforward,
from minimal task-specific
Transformers) adjustments
[11]
RoBERTa [68] BERT-based Significance of design BookCorpus, Task-specific Various NLP Benchmark
model with refined decisions, publicly Wikipedia datasets tasks improvements,
hyperparameters. available, top-tier research
NLP results
XLNet [108] Combines Bidirectional context Internet text data Task-specific Diverse NLP Research, appli-
autoregressive learning, versatile ap- datasets tasks cations
pretraining with proach
bidirectional context
learning.
Speech-XLNet Unsupervised acous- Robust regularizer, Speech datasets TIMIT, WSJ Speech recogni- Speech recogni-
[109] tic model with robust improved recognition datasets tion tion systems
regularization. accuracy
DialogXL [110] Improved dialogue Enhanced Internet text data Dialogue datasets Dialogue under- Chatbots,
handling with dialog- conversation standing customer support
aware self-attention. modeling,
outperforms baselines
T5 (Text-to- Google’s unified text- Unified framework, Internet text data Task-specific Text Language
Text Transfer to-text NLP model. extensive pre-training, datasets classification, translation,
Transformer) versatile tool translation, and summarization
[85] more
BioGPT [111] Specialized biomedi- Biomedical literature Biomedical liter- Biomedical Biomedical text Biomedical text
cal LLMs with state- pretraining, excels in ature datasets analysis analysis, research
of-the-art results. biomedical tasks

the significance of design decisions that were previously that by rearranging the order of speech frames, the permuta-
overlooked and prompt inquiries into the origins of recently tion technique in Speech-XLNet acts as a robust regularizer,
reported advancements. We have made our models and code encouraging the SAN to make inferences by prioritizing
available for public use. global structures through its attention mechanisms. More-
over, Speech-XLNet enables the model to explore bidirec-
4) XLNet tional contexts, enhancing the effectiveness of speech repre-
XLNet [108] represents a versatile autoregressive pretraining sentation learning. Experimental results on TIMIT and WSJ
approach that achieves bidirectional context learning by op- datasets demonstrate that Speech-XLNet significantly en-
timizing expected likelihood across all possible permutations hances the performance of the SAN/HMM system in terms of
of factorization orders. It addresses the constraints of BERT both convergence speed and recognition accuracy compared
through its autoregressive design and incorporates insights to systems trained from randomly initialized weights. Our
from Transformer-XL, a leading autoregressive model. In best models achieve an impressive relative improvement of
practical experiments with consistent conditions, XLNet con- 11.9% and 8.3% on the TIMIT and WSJ tasks, respectively.
sistently surpasses BERT on 20 diverse tasks, frequently Notably, the top-performing system achieves a phone error
by a substantial margin. These tasks encompass question rate (PER) of 13.3% on the TIMIT test set, which, to the best
answering, natural language inference, sentiment analysis, of our knowledge, is the lowest PER achieved by a single
and document ranking, among others. system.

5) Speech-XLNet 6) DialogXL
Speech-XLNet [109] is a method for training unsupervised DialogXL [110] introduces enhancements to tackle longer
acoustic models to learn speech representations using a Self- historical context and multi-party structures in dialogues.
Attention Network (SAN) and subsequently fine-tuning it Initially, alterations are made to how XLNet manages re-
within the hybrid SAN/HMM framework. Our hypothesis is currence, transitioning from segment-level to utterance-level,
VOLUME 4, 2016 21

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Review on Large Language Models

thereby improving its effectiveness in modeling conversa- text-related applications [120]. It facilitates many tasks such
tional data. Secondly, the integration of dialog-aware self- as coding, question and answer, analysis, and other related
attention, as opposed to the standard self-attention in XLNet, activities. The available models encompass a spectrum of
enables capturing crucial dependencies within and between options, spanning from gpt-4 to gpt-3.5-turbo, as well as
speakers. While training the DialogXL, a comprehensive set many legacy variants. The Chat Completions API facilitates
of experiments is conducted on four ERC benchmarks, com- interactive dialogues by incorporating distinct roles such as
paring DialogXL with mainstream models. The experimental user, and assistance. The programming language provides
results consistently demonstrate that DialogXL outperforms support for function calling, which allows for the retrieval of
the baseline models across all datasets. structured data. The OpenAI API provides developers with
the capability to leverage advanced modeling of languages
7) T5 for a diverse range of applications.
T5 (Text-to-Text Transfer Transformer) [85] is a ground- Hugging Face: Hugging Face provides a complimentary
breaking large language model developed by Google Re- Inference API that facilitates the examination and assessment
search, revolutionizing natural language processing (NLP). of more than 150,000 publicly available ML models [121].
T5’s innovation lies in framing all NLP tasks as text-to- It features predictive capabilities, and integration with more
text tasks, simplifying the NLP pipeline and unifying various than 20 open-source libraries, and facilitates fast change
tasks under a single framework. Built upon the Transformer between models. The API facilitates a range of operations,
architecture, T5 utilizes multi-head self-attention to capture including classification, image segmentation, text analysis,
intricate language relationships. Its extensive pre-training on speech recognition, and other related functionalities.
vast text data, followed by fine-tuning on specific tasks, Google Cloud API: The Cloud-based NLP API developed
empowers T5 to excel in text classification, translation, by Google provides support for a range of approaches, such
summarization, question answering, and more. With con- as sentiment analysis, text analysis, entity recognition, and
sistently state-of-the-art results across NLP benchmarks, T5 other text annotations [116]. The functionalities can be ac-
has reshaped the field, offering researchers and developers a cessed by developers through REST API calls utilizing either
versatile tool for comprehensive language understanding and the client libraries or their own custom libraries. Additionally,
generation tasks. it offers moderation functionalities for the purpose of detect-
ing potentially sensitive content. Several API exists, and each
8) BioGPT possesses distinct features and functions.
BioGPT [111] is a large-scale language model that was Microsoft Azure Language APIs: These APIs support
constructed by the Allen Institute for AI (AI2) with the many activities, including sentiment analysis, text summa-
explicit purpose of undertaking training on biomedical text. rization, and other related tasks [117]. Developers use REST-
It was trained on an extensive corpus of biomedical literature, ful endpoints to include Azure LLMs APIs. Microsoft pro-
including PubMed abstracts and full-text articles, and is vides useful SDKs and code examples in other programming
based on the GPT architecture. It has been demonstrated languages, including Python, Java, etc. to facilitate the uti-
that BioGPT outperforms alternative biomedical language lization of these APIs.
models across a range of tasks, such as query answering, IBM Watson Natural Language: The IBM Watson API is a
relation extraction, and named entity recognition. The pre- robust tool for investigating and extracting valuable informa-
trained weights of the model are accessible to the public, tion from textual data. This API offers developers a variety
enabling researchers to optimize it using their biomedical of functionalities, encompassing sentiment analysis, emotion
text data. BioGPT has the capacity to substantially drive analysis, and additional features [118]. Due to its provision of
biomedical research forward by facilitating the analysis of multilingual support and a user-friendly API, this technology
vast quantities of biomedical text data in a more precise and enables developers to effectively include sophisticated text
efficient manner [112], [113]. analytics into their programs.
In summary, pre-trained LLMs are foundational in NLP, Amazon Comprehend API: The Amazon Comprehend API
providing a starting point for various applications without is a powerful NLP service provided by Amazon Web Ser-
the need for extensive training from scratch. They are widely vices [119]. This tool evaluates textual data, allowing the
used and have democratized access to advanced language un- researchers to acquire significant knowledge, such as en-
derstanding and generation capabilities. However, responsi- tity recognition, language detection, sentiment analysis, and
ble use and ethical considerations are essential when working topic modeling. Due to its ability to accommodate many
with these models to ensure fair and unbiased outcomes. languages and simple integration, this tool displays adapt-
ability in addressing a range of use cases, including customer
B. API OF LLMS feedback analysis and others. The utilization of this API can
In this section, we discuss the APIs of LLMs, which have prove to be a significant resource for enterprises’ marketing
been described in Table 12. to extract practical insights from unstructured textual data.
Open AI API: The API provided by OpenAI offers access Facebook AI’s Fairseq: The Fairseq framework developed
to GPT models that may be utilized for a wide range of by Facebook AI is a comprehensive tool for performing
22 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Survey on Large Language Models

TABLE 12. Comparison of LLMs APIs

API Name Provider Languages Access Application Advantages Constraints


Supported Type Area
OpenAI OpenAI Multiple API Key NLP, text State-of-the- API rate con-
API [114] languages generation, art models, strain, cost con-
chatbots versatility, GPT siderations
architecture
Hugging Hugging Multiple Open NLP, model Large model Self-hosting
Face Trans- Face languages Source fine-tuning, repository, complexity, no
formers research extensive official support
[115] community
support
Google Google Multiple API Key Sentiment Google’s robust Cost may vary
Cloud AI- Cloud languages analysis, entity infrastructure, based on usage
Language recognition, easy integration
[116] translation
Microsoft Microsoft Multiple API Key Sentiment Integration Pricing based
Azure Azure languages analysis, entity with Azure on usage
Language recognition, services,
[117] language comprehensive
understanding APIs
IBM IBM Multiple API Key Sentiment IBM’s AI Costs may add
Watson Watson languages analysis, expertise, up for high us-
NLU [118] emotion customization age
analysis, options
keyword
extraction
Amazon Amazon Multiple API Key Entity recogni- Integration Costs may vary
Compre- AWS languages tion, sentiment with AWS, based on usage
hend [119] analysis, topic scalability
modeling, doc-
ument classifi-
cation
Facebook Facebook Multiple Open Neural machine Research- Self-
AI’s AI languages Source translation, oriented, hosting and
Fairseq language flexibility, maintenance
[119] modeling, open-source. complexity.
research,
development

sequence-to-sequence modeling, specifically designed for VII. DOMAIN SPECIFIC APPLICATION


handling LLMs [122]. Fairseq is a well-suited API for many Since there are several pre-trained models in LLMs, all
applications related to analyzing and generating natural lan- of them are utilized by training or fine-tuned to perform
guage. The platform provides support for advanced models well-defined tasks maintained by their requirements in dif-
such as BERT and RoBERTa, allowing researchers to per- ferent fields. Numerous research studies have consistently
form fine-tuning on these models according to specific needs. contributed by using LLMs model in diverse domains such
In this study, we have provided a comprehensive overview as healthcare, finance, education, forecasting, and natural
of seven popular APIs in Table 12 that leverage the capa- language processing. The extensive experiments of differ-
bilities of LLMs for the purpose of NLP-based function- ent LLMs models contribute to revolutionizing the use of
alities. However, the taxonomy revealed the presence of AI across these diverse domains. This section demonstrates
several other APIs that are associated with text analysis but the potential contribution of LLMs application in different
do not utilize LLMs. The aforementioned APIs, including domains. Table 13 illustrates the major contribution of LLMs
TextBlob, TextRazor, Sapling AI, MonkeyLearn, and Aylien, in the specific domain, as well as outline their prospective
etc., utilize traditional machine learning, statistical methods, limitations and future directions.
and rule-based natural NLP techniques instead of relying on Bio-Medical and Healthcare: As previously stated, GPT
extensive pre-trained LLMs. Since, the primary focus of this has several versions, ranging from GPT1 to GPT4. GPT3 is
study has been on describing the tools that particularly utilize extremely useful in the healthcare industry since it can be
LLMs for the purpose of advanced text analysis, generation, trained to support customer service with no effort. It can
and comprehension, we have refrained from discussing these get all required information through a conversation rather
APIs in depth. than an intake form, and many systems might be built to
VOLUME 4, 2016 23

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Review on Large Language Models

TABLE 13. Domain Specific Machine learning-based study comparison in LLMs

Domain Author Major Contributions Limitations Future Research Direction


I. Assess the state-of-the-art performance
I. To support this study’s findings,
of biomedical LLMs for the purpose of I. Data limitation due to privacy
need to experiment using real clinical
Chen et al., [123] classifying and reasoning tasks on clinical concern of biomedical data.
data.
(2023) text data. II. Emphasizes the vulnerability II. Did not evaluate the performance
II. Optimize the models to make them
of LLMs performance in relation to prompts of the model in an out-of-domain task.
more robust and resource-efficient.
and addresses it.
I. Investigates the possible utilization of LLMs, I. Lack of data resulted in the
specifically ChatGPT and its variety within the post-training process, raising concerns
I. Reducing operational costs by fine-tuning
domain of dentistry. about the model’s reliability.
Huang et al., [124] the model and enhancing efficiency.
II. Design a MultiModal LLMs system for II. The possibility of data breaches has
(2023) II. Explore diverse medical data to provide
clinical dentistry application and address no strict security method.
personalized dental care.
critical challenges to revolutionize III. Requires intensive computational
dental diagnosis. cost.
I. Evaluating the efficacy of ChatGPT-3.5 as I. More diverse sample of breast tumor cases
I. Conducting the experiment with a small
a supporting tool for facilitating clinical to increase ChatGPT’s performance and
sample size leads to performance bias in
Sorin et al., [125] decision-making in breast tumor cases. generalizability.
the model.
(2023) II. Outlines the implementation of a grading II. Introducing a multimodal approach to
II. Human errors in the grading system can
system for evaluating the responses generated increase the reliability of clinical
potentially add biases to the system.
by ChatGPT. recommendations.
I. Emphasis on integrating more recent and
I. Focuses on the energy and environmental
I. Inaccuracies observed in the responses up-to-date training data.
impact of training LLMs models such as
provided to queries due to the lack of II. Further investigation should strive to
GPT-3 and GPT-4 and emphasize cost
Thirunavukarasu et al., updates on the training data. enhance the transparency and interpretability
reduction to make them more accessible.
[126] (2023) II. Lack of interpretability of LLMs model of LLMs.
II. Examines the utilization of LLMs models
since it is a black box, hence the concept was III. Including the feasibility of implementing
in the medical domain, specifically focusing
frequently misunderstood. randomized trials to evaluate the effects of
on medical education and medical research.
LLMs on medical outcomes.
I. Conversational AI like GPT-3 will not
I. Discuss the benefits and potential pitfalls of I. Analyze GPT’s impact on real-world
replace human interaction in healthcare
Korngiebel et al., [127] NLP technologies in eHealth. healthcare settings to assess its performance.
soon, despite extensive development.
(2021) II. Discuss the benefits of using GPT in the II. Provide personalized healthcare by
II. Examines GPT’s applicability in a
medical domain. analyzing a variety of medical data.
certain medical domain.
I. examine LLMs’ ethical and practical issues,
I. The addition of a detectable-by-design
focusing on medicinal use and public health. I. An experiment using real clinical data is
the technique may slow LLMs development
Angelis et al., [128] II. Discuss how ChatGPT can provide false or needed to support the findings.
and AI business acceptance.
(2023) misleading information. II. Further research should be conducted to
Medical

II. Experimental data has been limited due


III. Suggest the detectable-by-design technique speed up the entire procedure.
to medical data privacy concerns.
to spot fake news or information.
I. Copyright issues, bias based on the training
I. Accountability, honesty, transparency, and
dataset, plagiarism, over-detailed content, lack
integrity must be considered in scientific
of scientific accuracy, limited updated
I. Saves time in scientific research through research.
knowledge, and lack of ability to critically
code delivery and literature review.
discuss the results in using ChatGPT in scientific
II. Makes the publication process faster by II. To enhance healthcare and academics,
research.
providing better research ideas and results. ChatGPT should uphold ethical principles.
Sallam et al., [129] II. Unable to understand the complexity
III. Reduces potential costs and increases Potential dangers and other issues must also
(2023) of biological systems, lack of emotional and
efficiency in healthcare delivery. be considered.
personal perspective,
IV. Enhances communication skills in
inaccurate content, bias, and transparency issues
healthcare education through proper III. An AI editor and an AI reviewer in
in healthcare practice.
academic mentoring. academic writing to advance academic
III. Copyright issues, inaccurate references,
research, given the previous shortcomings
limited updated knowledge, and plagiarism in
of the editorial and peer review process.
healthcare education.
I. Enhance the ability to answer medical
I. Generates answers that sound plausible but questions and provide the context for
Cascella et al., [130] I. Support of clinical practice
may be incorrect or meaningless and biased understanding complex relationships
(2023) II. Scientific writing
based on trained data. between various medical conditions
and treatments.
I. The investigation of AI within the context
of medical education. I. The experiment is conducted on a small I. To evaluate the efficacy of ChatGpt in
II. Assessment of ChatGPT’s Performance in input size. real-world clinical practice by assessing
Kung et al., [131]
Clinical Decision-making. II. Human adjudication variability and bias. its performance and impact.
(2023)
III. Explore the demands of AI in medical III. The absence of real-life instructional II. A comprehensive analysis of ChatGPT’s
education to standardize methods and readouts scenarios. effectiveness in relation to subject taxonomy.
and quantify human-AI interactions
I. Shows that domain-specific pretraining from I. Explore the applicability only in a fixed I. An Investigation and analysis into
scratch outperforms mixed-domain in Biomedical Domain. pretraining strategies.
Gu et al., [132]
biomedical NLP. II. Future modifications of the benchmark may II. The addition of Biomedical NLP tasks.
(2021)
II. Formulate a new dataset using the be required to reflect the effectiveness of the III. Exploring other domains for comparative
Biomedical set of diverse tasks. research. analysis.
I. Should include metrics, and comparative I. Integrating input from healthcare specialists
I. Introduced a foresight application based on
analysis in real-world clinical scenarios to and consistently updating the model with the
Kraljevic et al., [133] electronic health records.
evaluate Foresight’s performance. latest medical data.
(2022) II. Develop a multifunctional model.
II. Integrate enough security on health records II. Implement a real-life scenario to investigate
III. Conduct experiments in different hospitals.
to protect the privacy of the patients. the clinical application of Foresight.
Tourism

I. Highlights how ChatGPT is contributing to


I. Transparency and accountability issues: the
Mich et al., [134] the tourism sector by identifying new target I. Applications should increase user trust and
dataset is not updated, and can not see the logic
(2023) markets, implementing the marketing strategy fact-checking.
of what is wrong and what is right.
designs, and improving customer service.

24 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Survey on Large Language Models

TABLE 13. (Continued) Domain Specific Machine learning-based study comparison in LLMs

Domain Author Major Contributions Limitations Future Research Direction


I. Examines how LLMs can use their superior I. SP500 and Russell 2000 stock indexes
I. The study utilizes a small amount of data
knowledge and reasoning to predict financial will be added to the research.
samples.
time series. II. The research will use macro-economy
Yu et al., [135] II. Data is collected from only one specific
Industry

II. Focuses on NASDAQ-100 stocks using time series, stock trading volumes, and social
(2023) domain.
publicly available historical stock price data. network data.
III. Utilizing a small sample size during
III. To prove LLMs can solve problems III. To improve reasoning, larger public models
experiments cause performance bias.
comprehensively, experiments are conducted. like 30B will be refined.
I. Discusses the uses and concerns with ChatGPT I. Analyze how ChatGPT can enhance the
I.A limited amount of data is used in the
in supply chains. supply chain efficiency.
Frederico et al., [136] experiment.
II.Provide supply chain specialists II. Discuss supply chain ChatGPT
(2023) II. Did not assess the efficacy of ChatGPT
advice about ChatGPT’s effects and implementation
in practical industrial settings.
usage. issues and success factors.
I. Examines the efficacy of employing LLMs
as a gaming tool.
Gaming

II. Assess the performance of GPT in the I. They did not employ a well-curated set of
I. Assess the performance of LLMs by
Sobieszek et al., [137] context of the Turing test. targeted questions.
administering inquiries across diverse
(2022) III. Analyze the boundaries of LLMs. II. It may produce answers that are either
domains.
IV. Discuss the challenges these models erroneous or lack significance.
encounter in accurately conveying
information.
I. Utilized network science and cognitive
I. Putting a priority on integrating data from
psychology to study biases toward math and I. Commercial GPT systems can be tested
training that is up-to-date.
STEM across language models. by researchers but not replicated by everyone
Abramski et al., [43] II. Investigating several other fields for the
II. Behavioral Forma Mentis Networks due to their structure.
(2023) purpose of comparative research.
(BFMNs) are used to understand how LLMs II. The old interface or API system no longer
Education

III. More information from students at


comprehend arithmetic, STEM, and similar allows public access to GPT-3.
different institutions will be gathered.
concepts.
I. Helpful only for English-speaking people,
I. Creating an age-appropriate user interface
I.Helps students develop critical thinking in but also for people of other languages cannot
that maximizes the benefits and minimizes
reading and writing, provides practice problems enjoy the benefits.
the pitfalls of interaction with AI-based tools.
and quizzes, helps improve research skills, and II.Consumes high energy and financial
Kasneci et al., [20] II. To guarantee equity for all educational entities
improves various developmental skills. cost of maintenance.
(2023) interested in current technologies, government
II.Provides guidance to teachers on how to improve III. Negative effect on critical thinking and
organizations may regulate financial obstacles to
student learning in each aspect problem-solving skills of students and teachers.
accessing, training, and maintaining large
of teaching and helps develop teaching materials. IV.Privacy and security risks to students’
language models.
personal and sensitive information.
I. Helps students save labor and time by assigning
assignments and helps teachers automate the grading
process, and provides detailed feedback to students,
which reduces their workload.
II.Aid decision-making, problem-solving and promote
learning in medical education. I. Improving the accuracy and performance
Hadi et al., [138] I. Bias, reasoning errors, counting errors,
III.Provides financial advice based on their queries to of LLMs, addressing their limitations, and
(2023) information hallucination, LLMs explainability.
improve customer service, and provides various steps exploring new ways to utilize them.
based on financial algorithms to reduce risk by
analyzing past market data.
IV. Saves software engineers time and increases
overall efficiency by providing code snippets,
identifying and generating test cases, etc.
I. Training instructors on how to effectively
I. Negative effect on critical thinking and
Lo et al., [139] I. Helps students in learning and assessment and helps use ChatGPT and identify student intelligence.
problem-solving skills of students and
(2023) teachers in teaching preparation and assessment. Also, educate students about the uses and
teachers.
limitations of ChatGPT.
I. Highlights the challenges, opportunities, and impacts of I. The generated text is hard to understand I. Teaching, learning, and scholarly research,
Dwivedi et al., [140] ChatGPT in education, business, and society, as well as and can’t answer questions correctly unless digital transformation organization and society,
(2023) investigates important research questions asked of phrased a certain way, lacks updated information, knowledge, transparency, and ethics to enhance
ChatGPT across the education, business, and society sectors. and doesn’t automatically update the actual data. ChatGPT’s efficiency in all these areas.

assist numerous patients at the same time [127]. Besides, tice, although simultaneously underscoring the imperative
clinics and hospitals are places to cure illness, but it is necessity for proactive inspection and ethical transparency.
also true that various contagious viruses are brought into Several studies [126], [130], [132], [133] investigations at
these places. Patients and healthcare providers can be better exploring the prospective utilities and constraints of LLMs
protected from infection by replacing a human receptionist such as ChatGPT within the healthcare domain, namely in
with a robot. This becomes increasingly important during the context of clinical practice, research, and public health.
the COVID-19 epidemic [141]. Since clinics and hospitals In their study, Kung et al., [131] conducted an evaluation
often see a high volume of patients on a daily basis, an of ChatGPT’s performance on the United States Medical
optimum and lightweight system may submit several queries Licensing Examination (USMLE), and the outcomes indi-
for single patients to create acceptable output. Consequently, cate the potentiality of LLMs to support clinical decision-
GPT models can also aid in cost reduction in the medical making and medical education. Sorin et al., [125] evaluated
industry. Furthermore, biomedical and clinical text mining ChatGPT-3.5 as a decision support for breast tumor boards
has always been an essential and major challenge due to where they compared the tumor board’s explanations, and
the complex nature of domain corpora and the continually summaries with ChatGPT-3.5 and showed that ChatGPT-3.5
expanding number of documents. As a result, using the BERT and the tumor board had a high degree of decisional align-
models improves the performance of biomedical and clinical ment. Huang et al., [124] (year) investigate the prospective
text mining models [142]. Salam et al., [129] and Korngiebel applications of LLMs with a specific emphasis on ChatGPT,
et al., [127] demonstrate the substantial advantages of Chat- in the field of dentistry, mainly focusing on automated dental
GPT in the domains of healthcare, clinical research, and prac- diagnosis and highlighting the efficacy of LLMs in den-

VOLUME 4, 2016 25

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Review on Large Language Models

tal diagnosis. Furthermore, the XLNet contributes to better over, it assists in determining public opinion on certain topics
clinical note representation by adding temporal information by analyzing public interest and demand.
and a realistic prediction setup [143]. Furthermore, various Business: In business, LLMs helps companies improve
LLMs models also assist this medical industry by making the their decision-making processes, product manufacturing pro-
procedure easier than previously. cesses, operations, and customer interactions. Communicat-
Education: Educators have long struggled with unequal ing with customers and providing 24/7 customer service
educational resources to student demand across disciplines. by answering their queries, assisting them in their work,
One of the significant challenges is a shortage of accessible and providing advanced advice related to areas of interest
educational resources for pupils to study outside of school. to customers is crucial for business progress. Moreover, it
Although online instructional videos are helping to alleviate is also important to analyze customer sentiment, market
the problem, society still hopes that AI will deliver indi- trends, risk factors, and competitive intelligence [21]. In this
vidualized teaching services to satisfy the learning demands case, LLMs help to fulfill all their requirements within a
of each student and increase teaching efficiency. So, the short period. The LLMs models, like GPT, XLNet, BERT,
LLMs models are very significant and have the potential to etc., play a vital role in creating customer documents and
revolutionize many facets of learning, teaching, and educa- product details and efficiently maintaining the entire business
tional research in the education sector [141]. So, the GPT by saving time and reducing laborious tasks. Frederico et
model aids the students in converting the math word prob- al., [136] presents an initial investigation into the potential
lems into representative equations [144]. Kasenci et al., [20] applications and effects of ChatGPT in the domain of supply
highlighted substantial impact of LLMs in education by fa- chain management. Their study provides significant insights
cilitating personalized learning, automating grading process, for professionals engaged in this domain. Mich et. al. [134]
and accessibility of educational resources. Hadi et al., [138] present an initial investigation of potential hazards associated
presents a thorough analysis of LLMs, covering their his- with the implementation of ChatGPT in bussiness domain.
torical development, wide-ranging applications in domains Yu et al., [135] presented an analysis of the capabilities
such as medicine, engineering, education, and their potential of LLMs, specifically GPT-4, in the context of financial
impact on the trajectory of AI. Lo et al., [139] and Dwivedi et. forecasting for a time series. Besides, their findings reveal
al. [140] investigate the prospective uses of ChatGpt within that the performance of LLMs outperforms other traditional
the realm of education and identify the primary obstacles that models also.
have arisen during its initial deployment. Besides, in terms of Agriculture: In agriculture, variations of GPT models,
writing authentic texts in distinct formats, including essays, including GPT3, BERT, and XLNet models, play a signif-
summaries, and articles, these models help to accomplish icant role [147]–[149]. They are able to analyze large data
this without any error. In contrast, the manual process may hubs of soil, crop, and weather data along with satellite im-
have human errors in the documentation. In this case, the agery. They can provide recommendations on plating times,
GPT model helps to address this problem. In addition, the irrigation, fertilizer application, and optimizing fields and
XLNet Excel method also helps understand the texts and resources. Farmers can obtain current updates and market
documents that can be employed in the education sector [39]. requirements, predict crop prices, anticipate natural disasters,
Furthermore, other models significantly impact the education and document farmers’ and crop details. Manual agriculture
system, making it more engaging, accessible, and productive management can be time-consuming and laborious, but these
for both students and teachers. models can handle all the issues.
Social Media: The LLMs have revolutionized several as-
pects of the social media industry regarding content produc- VIII. IMPACT OF LARGE LANGUAGE MODELS ON
tion, moderation, sentiment analysis, etc. There are some SOCIETY
crucial aspects of the LLMs in the social media sector in Large Language Models (LLMs) and similar AI technolo-
terms of writing content, generating images, classifying and gies have had a profound impact on society across various
generating text, and even full blogs and articles for social me- domains. While these technologies offer many benefits, they
dia. Also, these models can perform named entity recognition also raise important ethical, social, and economic considera-
(NER) and text classification [145], [146]. When the GPT, tions. Here’s an overview of the impact of LLMs on society:
XLNet, BERT, etc., model aids the writer and content pro- 1. Advancements in Natural Language Processing
ducers in generating a consistent flow of excellent material. It (NLP): LLMs have significantly advanced the field of NLP,
also provides content suggestions, and to create a safer online making it possible to automate and scale a wide range of
environment, these models are hired to assist in discovering language-related tasks such as translation, summarization,
and filtering out different dangerous and improper content. sentiment analysis, and more. In recent years, Natural Lan-
In their study, Abramski et al., [43] utilized network science guage Processing (NLP) has witnessed significant advance-
and the principles of cognitive psychology to evaluate biases ments, primarily driven by the emergence of Large Language
present in LLMs. Sobieszek et al., [137] presents a critical Models (LLMs). These advancements, exemplified by mod-
examination of the stated semantic capabilities of GPT-3, els such as BERT [11], RoBERTa [68], and XLNet [108],
aiming to challenge the current view of its dismissal. More- have transformed the NLP landscape. Notably, LLMs have
26 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Survey on Large Language Models

4. Language Translation: LLMs have improved ma-


chine translation systems, making communication across lan-
guages more accessible and accurate.
5. Virtual Assistants and Chatbots: LLMs power virtual
assistants and chatbots, enhancing customer service and pro-
viding round-the-clock support in various industries.
6. Medical and Scientific Research: LLMs are used to an-
alyze and summarize vast amounts of medical and scientific
literature, aiding researchers in finding relevant information
quickly.
7. Accessibility: LLMs have the potential to improve ac-
cessibility by providing real-time translation and transcrip-
tion services for individuals with hearing impairments or
FIGURE 8. Visual representation of impact on LLMs language barriers.
8. Personalization: LLMs enable personalized recommen-
dations and content curation on platforms such as social
been fine-tuned for various specific NLP tasks, enabling media, e-commerce, and news websites.
remarkable performance improvements. Multilingual models
9. Creative Tools: LLMs are used as creative tools in
like mBERT [150] and cross-lingual models like XLM-R
various art forms, including generating poetry, music, and
[151] have facilitated language understanding across diverse
visual art.
linguistic contexts. Additionally, there has been a focus on
10. Ethical Concerns: Bias and fairness issues in LLMs
creating more efficient versions of LLMs such as DistilBERT
have raised ethical concerns. LLMs may perpetuate or am-
[152] and ALBERT [153]. These developments have not only
plify biases present in training data, leading to unfair or
expanded the applicability of NLP but have also raised ethical
discriminatory outcomes.
considerations, prompting research in bias mitigation [154]
and responsible AI. LLMs have enabled breakthroughs in 11. Misinformation and Disinformation: LLMs can gen-
applications like conversational AI, few-shot and zero-shot erate realistic-sounding fake text, raising concerns about the
learning, and domain-specific NLP in fields like healthcare spread of misinformation and disinformation.
and finance. These advancements underscore the pivotal role 12. Job Displacement: The automation capabilities of
of LLMs in advancing the capabilities of NLP and continue LLMs may lead to job displacement in certain industries, par-
to shape the future of language understanding and generation. ticularly in routine data-entry and content-generation roles.
2. Automation and Efficiency: LLMs are used to auto- 13. Data Privacy: The use of LLMs often involves pro-
mate tasks that were previously time-consuming and labor- cessing large amounts of user-generated text data, which
intensive, leading to increased efficiency in industries such raises data privacy concerns, especially regarding sensitive
as customer support, content generation, and data analysis. or personal information.
The automation and efficiency of Large Language Mod- 14. Economic Impact: The adoption of LLMs can disrupt
els (LLMs), driven by models like BERT and GPT, have traditional business models and create economic shifts as
revolutionized industries and applications. These models industries adapt to automation and AI technologies.
have automated intricate language-related tasks, from sen- 15. Regulation and Accountability: Policymakers and
timent analysis to language translation, making them more regulators are grappling with the need to establish guidelines
efficient and accessible. LLMs, such as DialoGPT [155] and regulations for the responsible use of LLMs, including
and ChatGPT, have powered conversational AI, streamlining addressing issues of bias, transparency, and accountability.
customer support and interactions. Moreover, they excel in 16. Education and Skill Development: The rise of LLMs
few-shot and zero-shot learning, as demonstrated by GPT-3 underscores the importance of education and skill develop-
[156], automating tasks with minimal examples. Multilingual ment in AI and data science, as these technologies become
LLMs like mBERT have automated language tasks across increasingly integral to various industries.
various languages, enhancing global accessibility. Efficiency The impact of LLMs on society is multifaceted, and it is
has further advanced through models like DistilBERT and important to consider both the positive and negative conse-
ALBERT, which maintain performance while reducing com- quences. As these technologies continue to evolve, stakehold-
putational resources. These models can be fine-tuned for ers, including governments, businesses, researchers, and the
specific domains, such as healthcare [157], making them in- general public, must work together to harness the benefits
dispensable in automating domain-specific tasks efficiently. of LLMs while addressing their challenges and ethical im-
3. Content Generation: LLMs are capable of generating plications. The visual representation of Figure 8 effectively
human-like text, which has implications for content creation, demonstrates the impact of LLMs, outlining their benefits on
including automated news articles, marketing materials, and the left and the adversarial impacts on the right side. The
creative writing. utilization of this figure will provide a distinct and easily
VOLUME 4, 2016 27

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Review on Large Language Models

understandable visual depiction of LLMs’ impact across making informed investment choices and managing financial
different domains. risks.
10. Boosting E-commerce: LLMs enable personalized
IX. INDUSTRIAL SIGNIFICANCE OF LARGE LANGUAGE product recommendations [168], chatbots for customer sup-
MODELS port, and efficient inventory management. These enhance-
Large Language Models (LLMs) have gained substantial ments result in enriched user experiences and heightened
popularity in various industries, bringing about significant sales.
transformations. Their industrial importance can be under- 11. Illuminating Customer Insights: LLMs analyze cus-
stood through several key facets: tomer reviews [169], feedback, and social media data, fur-
1. Enhancing Natural Language Processing (NLP) Ap- nishing businesses with insights into customer preferences,
plications: LLMs have ushered in a revolution in NLP appli- opinions, and sentiments. This invaluable information aids
cations [158] across sectors like customer service, chatbots, companies in customizing their products and services.
and sentiment analysis. They contribute to more precise As LLMs continue to advance, their industrial importance
and efficient interactions with users, leading to increased is on a steady rise. They streamline operations, enhance
customer satisfaction and reduced response times. decision-making, and bolster efficiency across diverse do-
mains, positioning them as a transformative technology in the
2. Enabling Data Analysis and Information Extraction:
contemporary business landscape.
LLMs play a pivotal role in extracting valuable insights from
unstructured text data [159]. This is particularly critical in
X. OPEN ISSUES AND CHALLENGES
fields like finance, market research [160], and healthcare,
This section discusses critical analysis of open issues and
where deciphering market trends, sentiment in news, or med-
challenges of LLMs.
ical records holds paramount significance.
3. Facilitating Translation Services: Industries heavily A. OPEN ISSUES
reliant on multilingual communication [161], such as e- In this section, we delve into the critical open issues sur-
commerce, travel, and international business, benefit from rounding LLMs. These concerns are at the vanguard of artifi-
LLMs that streamline automated translation. This not only cial intelligence research and development. They emphasize
saves time but also resources, ensuring high-quality transla- the need for ongoing research and innovation to resolve
tions across multiple languages. issues that have emerged alongside the rapid development of
4. Empowering Content Generation: LLMs are harnessed LLMs. Our discussion will cast light on the significance of
for content generation [162], which encompasses automated these unresolved issues, highlighting their impact on various
article writing, social media posts [163], product descrip- applications and the AI landscape as a whole.
tions, and more. This automation simplifies content creation • Issue 1: Ethical and Responsible AI
processes and allows for scalable production of top-tier con- The question regarding how to ensure the ethical use
tent. of large language models remains unresolved. Filtering,
5. Revolutionizing Healthcare: LLMs find applications in moderation, and accountability concerns regarding AI-
medical record analysis [130], diagnosis assistance, and drug generated content remain troublesome. Misinformation,
discovery. They empower healthcare professionals to access hate speech, and biased content generated by LLMs
and comprehend extensive medical literature and patient necessitate continuous research and development [170].
data, thereby enhancing healthcare decision-making. • Issue 2: Multimodal Integration
6. Revamping Education: The education sector [164] While LLMs are predominantly concerned with text,
leverages LLMs for automated grading, ensuring prompt there is a growing demand for multimodal models that
feedback to students. These models also contribute to the can comprehend and generate content that includes text,
development of intelligent tutoring systems and personalized images, and other media types [171]. Integrating multi-
learning platforms. ple modalities into a single model poses difficulties in
7. Aiding Legal Practices: Legal practitioners [165] bene- data acquisition, training, and evaluation.
fit from LLMs for contract analysis, legal research, and docu- • Issue 3: Energy Efficiency
ment review. These models assist in efficiently extracting per- The environmental impact of training and deploying
tinent information and identifying potential legal concerns. large language models is still an urgent concern [172].
8. Assisting Human Resources: LLMs support HR profes- It is essential to develop more energy-efficient training
sionals [166] in tasks like candidate screening, resume pars- methods, model architectures, and hardware solutions to
ing, and identifying potential job candidates. They streamline reduce the carbon footprint of LLMs.
time-consuming processes within the recruitment phase. • Issue 4: Security and Adversarial Attacks
9. Empowering Financial Services: In the realm of fi- LLMs are vulnerable to adversarial assaults, where
nancial services [167], LLMs come into play for activities slight input modifications can lead to unexpected and
like sentiment analysis of news articles, algorithmic trading, potentially harmful outputs [173]. Improving model ro-
risk assessment, and fraud detection. They are instrumental in bustness and security against such assaults is a crucial
28 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Survey on Large Language Models

area of study, particularly for cybersecurity and content The training of LLMs is a computationally intensive
moderation applications. procedure that requires substantial hardware and energy
• Issue 5: Privacy and Data Protection resources [179]. It is necessary to have access to su-
As LLMs become more competent, user privacy and percomputing clusters or specialized hardware in order
data protection concerns increase. Finding methods for to train large models, and the environmental impact
users to interact with these models without compromis- of such resource-intensive training has raised concerns.
ing their personal information is an ongoing challenge. Significant energy consumption is associated with train-
There is a need for research on privacy-preserving tech- ing LLMs at scale, contributing to the AI industry’s
niques and regulatory compliance [174]. overall carbon footprint.
• Issue 6: Generalization and Few-Shot Learning • Challenge 4: Fine-Tuning Complexity
LLMs excel when there is abundant data but struggle While pre-training gives LLMs a broad comprehension
with tasks requiring few examples or domain-specific of language, fine-tuning is required to adapt these mod-
knowledge. Improving their capacity to generalize and els to specific tasks [180]. Fine-tuning entails training
perform well with limited training data is a crucial area the model on a smaller dataset, frequently requiring
of research [175]. human annotators to label examples. As it involves
• Issue 7: Cross-Lingual and Low-Resource Settings the construction of task-specific datasets and extensive
It is an ongoing challenge to make LLMs more ac- human intervention, this process can be both time-
cessible and effective in languages and regions with consuming and costly.
limited resources and data [176]. Global applications • Challenge 5: Real-Time Responsiveness
require developing techniques for cross-lingual transfer The remarkable training capabilities of LLMs come at
learning and low-resource language support. the expense of inference speed. Real-time response or
prediction generation with these models can be slug-
B. CHALLENGES gish, limiting their applicability in applications such as
chatbots or recommendation systems where low-latency
LLMs have rapidly evolved from being non-existent to be-
responses are crucial for user satisfaction.
coming a ubiquitous presence in the field of machine learning
• Challenge 6: Contextual Constraints
within just a few years. Their extraordinary ability to generate
LLMs can only evaluate a limited number of preceding
text that resembles that of a human has garnered significant
tokens when generating text due to their limited context
attention and applications in numerous fields. However, this
window [181]. This limitation presents difficulties when
meteoric rise in prominence has also revealed many chal-
working with lengthy documents or having lengthy con-
lenges and concerns that must be addressed to realize the
versations. Maintaining coherence and relevance over
potentiality of these models fully. In this discussion, we will
lengthy text sequences can be challenging because the
examine ten of the most significant challenges pertaining to
model may neglect or lose track of pertinent informa-
LLMs.
tion.
• Challenge 1: Data Complexity and Scale • Challenge 7: Bias and Undesirable Output
In the era of LLMs, the size and complexity of the In their output, LLMs can display biases or undesirable
datasets on which they are trained is one of the most sig- characteristics. This is due to the inherent biases in
nificant challenges. These models are typically trained the training data, which are assimilated by the model
on enormous corpora of Internet-sourced text data. and reflected in its responses [182]. Such biases can
These datasets are so extensive that it is nearly impos- manifest as objectionable, discriminatory, or harmful
sible to comprehend or investigate the totality of their content, making it imperative to address and mitigate
information. This raises concerns regarding the quality these concerns to ensure the responsible deployment of
and biases of the training data and the potential for the AI.
unintentional dissemination of detrimental or inaccurate • Challenge 8: Knowledge Temporality
information [177]. LLMs learn using historical data from the Internet, and
• Challenge 2: Tokenization Sensitivity their knowledge is restricted to what is available as of
For analysis, LLMs rely significantly on tokenization, a particular date. Consequently, they may lack access
dividing text into smaller units (tokens) [178]. Tokeniza- to the most recent information or events. This can be
tion is essential for language processing and compre- problematic when users expect up-to-date responses or
hension but can also present challenges. For instance, when the conversation involves recent events.
the meaning of a sentence can alter significantly based • Challenge 9: Evaluation Complexity
on the choice of tokens or the ordering of words. This Evaluation of LLMs presents significant difficulties.
sensitivity to input phrasing can lead to unintended out- Many extant evaluation metrics are insufficient to cap-
comes when generating text, such as adversarial assaults ture the nuances of model performance, which raises
and output variations based on minute input changes. questions about their efficacy. Additionally, these met-
• Challenge 3: Computational Resource Demands rics can be susceptible to manipulation or gaming,
VOLUME 4, 2016 29

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Review on Large Language Models

which may provide an inaccurate image of a model’s making processes, addressing the interpretability issues that
capabilities. To assess LLMs’ actual performance and have been a concern [185].
limitations, robust and reliable evaluation methodolo- Multimodal LLMs: Researchers are pioneering the de-
gies are required. velopment of LLMs that incorporate text, vision, and other
• Challenge 10: Dynamic Evaluation Needs modalities [186]. These models can understand and generate
Frequently, evaluating LLMs entails comparing their text from images, videos, and audio, creating new avenues
outputs to static benchmarks or human-authored ground for AI applications and effectively addressing the need for
truth. However, language is dynamic and evolves, and multi-sensory comprehension.
preset evaluation data may not adequately reflect a Human-AI Collaboration: Research on how humans and
model’s adaptability to language and context change. LLMs can collaborate effectively, with AI assisting and aug-
This difficulty underscores the need for evaluation menting human tasks, is a crucial focal point. This collab-
frameworks that are more dynamic and continually up- oration bridges the gap between AI capabilities and human
dated. needs, thereby resolving previous challenges and issues in
deployment.
XI. FUTURE RESEARCH PROSPECTS ON LLMS Dynamic Evaluation Metrics and Relevant Bench-
In the ever-evolving realm of Large Language Models marks: Researchers are working on dynamic evaluation
(LLMs), several key research focuses and directions are metrics that adapt to changing language and context, en-
emerging that promise to address and resolve the challenges suring that LLMs performance is accurately assessed [187].
and open issues discussed earlier. These endeavors will play This includes the development of relevant and up-to-date
a pivotal role in harnessing the full potential of LLMs while benchmarks that address earlier shortcomings in assessing AI
ensuring their responsible and ethical utilization in our dy- capabilities.
namic AI landscape. Personalization and Customization: Techniques to cus-
Enhancing Bias Mitigation: Researchers are dedicated tomize LLMs interactions to individual user preferences and
to refining training data to minimize bias, devising effective needs are gaining prominence. This personalization boosts
debiasing techniques, and establishing guidelines for respon- user satisfaction and resolves issues related to one-size-fits-
sible AI development [183]. They are also focused on inte- all AI interactions.
grating continuous monitoring and auditing mechanisms into Ethical and Legal Frameworks: In response to evolving
AI pipelines, thereby conforming fairness and impartiality of AI regulation, researchers are diligently developing ethical
the system. This commitment to mitigating bias ensures that and legal regulatory frameworks. These frameworks serve
LLMs not only advance in capability but do so in a way that as guiding principles for the responsible use of LLMs and
upholds ethical standards. ensure compliance with data protection and privacy regula-
Efficiency Optimization: A core concern driving research tions, effectively addressing previous concerns about ethical
is the quest for more efficient training techniques. Re- AI deployment [188].
searchers are delving into innovative methods like federated These forward-looking research directions stand as bea-
learning, which enables the distribution of training across cons of progress, poised to overcome challenges and open
decentralized data sources [184]. They are also exploring issues and ultimately lead to the maximization of LLMs
knowledge distillation techniques for model compression and potential while upholding the highest standards of account-
finding ways to reduce the substantial computational and en- ability and ethics in our evolving AI landscape.
vironmental costs associated with LLMs. This optimization
paves the way for more sustainable and resource-efficient AI XII. LIMITATIONS
models. While conducting a thorough examination of LLMs, which
Dynamic Context Handling: LLMs are being endowed includes analyzing their application taxonomies, comparing
with enhanced context management capabilities. This em- configurations, and addressing concerns and obstacles, it is
powers them to comprehend longer context windows and essential to recognize the existence of some limitations that
seamlessly handle extensive documents or conversations. should be considered. A primary limitation of this study
Such enhancements significantly expand their utility in vari- is the unavailability of review papers that directly relate
ous applications and resolve previous limitations. to the topic of LLMs. Although we have made diligent
Continuous Learning: To keep LLMs up-to-date, re- attempts to address the available research thoroughly, the
searchers are focusing on developing techniques that enable limited quantity of papers in this field restricts our potential
these models to adapt on evolving language and knowledge to perform broad comparisons and evaluations. However,
over time. This ensures that LLMs remain valuable and we have overcome the limitations of all the existing review
accurate sources of information and consistently overcoming papers and introduced many new aspects to give a compre-
challenges of obsolescence. hensive overview of LLMs. While endeavoring to offer a
Interpretable AI: The research community is committed broad perspective on LLMs concepts, we recognize that this
to making LLMs outputs more transparent and interpretable. analysis predominantly focuses on the ground-level concepts
This fosters confidence and comprehension in AI decision- of LLMs configurations and applications. Limited resources,
30 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Survey on Large Language Models

time, and page constraints affect the extensive exploration of [6] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
individual LLMs architectures. Although our goal is not to computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[7] M. A. K. Raiaan, K. Fatema, I. U. Khan, S. Azam, M. R. ur Rashid,
offer the comprehension of single LLMs but instead provide M. S. H. Mukta, M. Jonkman, and F. De Boer, “A lightweight robust
the evolution of LLMs and its application around various deep learning model gained high accuracy in classifying a wide range of
domains, however, readers looking for detailed analysis of diabetic retinopathy images,” IEEE Access, 2023.
[8] B. Ramabhadran, S. Khudanpur, and E. Arisoy, “Proceedings of the
specific architectures and advanced topics are not thoroughly naacl-hlt 2012 workshop: Will we ever really replace the n-gram model?
covered. Furthermore, the impact of the LLMs across vari- on the future of language modeling for hlt,” in Proceedings of the
ous domains, including education, health, and economy, is NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram
Model? On the Future of Language Modeling for HLT, 2012.
highlighted, but assessing the practical impacts of LLMs in [9] T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur,
many domains can be complex and subjective, especially “Recurrent neural network based language model.” in Interspeech, vol. 2,
when considering their impact on social aspects. no. 3. Makuhari, 2010, pp. 1045–1048.
[10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in
XIII. CONCLUSION neural information processing systems, vol. 30, 2017.
The field of LLMs has witnessed a remarkable evolution and [11] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
expansion, resulting in extraordinary capabilities in natural of deep bidirectional transformers for language understanding,” arXiv
preprint arXiv:1810.04805, 2018.
language processing (NLP) and various applications in vari- [12] Y. Khare, V. Bagal, M. Mathew, A. Devi, U. D. Priyakumar, and C. Jawa-
ous areas. Based on neural networks and the transformative har, “Mmbert: Multimodal bert pretraining for improved medical vqa,”
transformer architecture, these LLMs have revolutionized our in 2021 IEEE 18th International Symposium on Biomedical Imaging
(ISBI). IEEE, 2021, pp. 1033–1036.
approach to machine language comprehension and genera- [13] R. Liu, C. Jia, J. Wei, G. Xu, L. Wang, and S. Vosoughi, “Mitigating
tion. The thorough review of this research has provided an political bias in language models through reinforced calibration,” in
insightful overview of LLMs, encompassing their historical Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35,
no. 17, 2021, pp. 14 857–14 866.
development, architectural foundations, training methods, [14] K. Sanderson, “Gpt-4 is here: what scientists think,” Nature, vol. 615, no.
and vast advancement resources. It has also examined the 7954, p. 773, 2023.
various applications of LLMs in disciplines such as health- [15] S. Pichai, “An important next step on our ai journey, feb 2023,” URL
https://blog. google/technology/ai/bard-google-ai-search-updates, vol. 2,
care, education, social sciences, business, and agriculture, no. 9, 2023.
demonstrating their potential to address real-world issues. [16] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin,
In addition, this review has delved into the societal effects P. Liang, and T. B. Hashimoto, “Alpaca: A strong, replicable instruction-
of LLMs, discussing how they shape the future of AI and following model,” Stanford Center for Research on Foundation Models.
https://crfm. stanford. edu/2023/03/13/alpaca. html, vol. 3, no. 6, p. 7,
can be utilized to address complex problems. However, it 2023.
has not shied away from addressing the pressing challenges [17] J. Huang and K. C.-C. Chang, “Towards reasoning in large language
and ethical considerations associated with deploying LLMs, models: A survey,” arXiv preprint arXiv:2212.10403, 2022.
[18] L. Fan, L. Li, Z. Ma, S. Lee, H. Yu, and L. Hemphill, “A bibliometric
including model biases, privacy concerns, and the need for review of large language models research from 2017 to 2023,” arXiv
enhanced robustness and controllability. As the field of LLMs preprint arXiv:2304.02020, 2023.
research continues to evolve swiftly, this review is a valuable [19] Y. Chang, X. Wang, J. Wang, Y. Wu, K. Zhu, H. Chen, L. Yang, X. Yi,
C. Wang, Y. Wang et al., “A survey on evaluation of large language
resource for practitioners, researchers, and experts seeking models,” arXiv preprint arXiv:2307.03109, 2023.
a comprehensive understanding of LLMs’ past, present, and [20] E. Kasneci, K. Seßler, S. Küchemann, M. Bannert, D. Dementieva,
future. It emphasizes the significance of ongoing efforts to F. Fischer, U. Gasser, G. Groh, S. Günnemann, E. Hüllermeier et al.,
“Chatgpt for good? on opportunities and challenges of large language
improve the efficacy and dependability of LLMs, as well as models for education,” Learning and individual differences, vol. 103, p.
the need for ethical development and deployment practices. 102274, 2023.
LLMs represent a pivotal advancement in AI and NLP, with [21] M. U. Hadi, R. Qureshi, A. Shah, M. Irfan, A. Zafar, M. Shaikh,
N. Akhtar, J. Wu, and S. Mirjalili, “A survey on large language models:
the potential to revolutionize a variety of domains and solve
Applications, challenges, limitations, and practical usage,” TechRxiv,
complex problems. This article provides a comprehensive 2023.
foundation for future research and development in Large [22] B. Cronin, “Annual review of information science and technology,” 2004.
Language Models’ dynamic and thrilling field. [23] M. Kardum, “Rudolf carnap–the grandfather of artificial neural networks:
The influence of carnap’s philosophy on walter pitts,” Guide to Deep
Learning Basics: Logical, Historical and Philosophical Perspectives, pp.
REFERENCES 55–66, 2020.
[1] S. Pinker, The language instinct: How the mind creates language. Pen- [24] G. Leech, “Corpora and theories of linguistic performance,” Svartvik, J.
guin uK, 2003. Directions in Corpus Linguistics, pp. 105–22, 1992.
[2] M. D. Hauser, N. Chomsky, and W. T. Fitch, “The faculty of language: [25] E. D. Liddy, “Natural language processing,” 2001.
what is it, who has it, and how did it evolve?” science, vol. 298, no. 5598, [26] B.-H. Juang and L. R. Rabiner, “Automatic speech recognition–a brief
pp. 1569–1579, 2002. history of the technology development,” Georgia Institute of Technology.
[3] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, Atlanta Rutgers University and the University of California. Santa Bar-
J. Zhang, Z. Dong et al., “A survey of large language models,” arXiv bara, vol. 1, p. 67, 2005.
preprint arXiv:2303.18223, 2023. [27] D. S. Hain, R. Jurowetzki, T. Buchmann, and P. Wolf, “A text-embedding-
[4] I. Turing, “Computing machinery and intelligence-am turing,” Mind, based approach to measuring patent-to-patent technological similarity,”
vol. 59, no. 236, p. 433, 2007. Technological Forecasting and Social Change, vol. 177, p. 121559, 2022.
[5] Y. Shen, L. Heacock, J. Elias, K. D. Hentel, B. Reig, G. Shih, and L. Moy, [28] G. Curto, M. F. Jojoa Acosta, F. Comim, and B. Garcia-Zapirain, “Are
“Chatgpt and other large language models are double-edged swords,” p. ai systems biased against the poor? a machine learning analysis using
e230163, 2023. word2vec and glove embeddings,” AI & society, pp. 1–16, 2022.

VOLUME 4, 2016 31

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Review on Large Language Models

[29] P. Azunre, Transfer learning for natural language processing. Simon and [52] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward
Schuster, 2021. networks are universal approximators,” Neural networks, vol. 2, no. 5,
[30] Y. Shi, M. Larson, and C. M. Jonker, “Recurrent neural network lan- pp. 359–366, 1989.
guage model adaptation with curriculum learning,” Computer Speech & [53] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltz-
Language, vol. 33, no. 1, pp. 136–154, 2015. mann machines,” in Proceedings of the 27th international conference on
[31] A. Kovačević and D. Kečo, “Bidirectional lstm networks for abstractive machine learning (ICML-10), 2010, pp. 807–814.
text summarization,” in Advanced Technologies, Systems, and Applica- [54] D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv
tions VI: Proceedings of the International Symposium on Innovative and preprint arXiv:1606.08415, 2016.
Interdisciplinary Applications of Advanced Technologies (IAT) 2021. [55] B. Zhang and R. Sennrich, “Root mean square layer normalization,”
Springer, 2022, pp. 281–293. Advances in Neural Information Processing Systems, vol. 32, 2019.
[32] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, [56] T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, R. Castagné,
M. Krikun, Y. Cao, Q. Gao, K. Macherey et al., “Google’s neural ma- A. S. Luccioni, F. Yvon, M. Gallé et al., “Bloom: A 176b-parameter open-
chine translation system: Bridging the gap between human and machine access multilingual language model,” arXiv preprint arXiv:2211.05100,
translation,” arXiv preprint arXiv:1609.08144, 2016. 2022.
[33] R. K. Yadav, S. Harwani, S. K. Maurya, and S. Kumar, “Intelligent chat- [57] B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for
bot using gnmt, seq-2-seq techniques,” in 2021 International Conference parameter-efficient prompt tuning,” arXiv preprint arXiv:2104.08691,
on Intelligent Technologies (CONIT). IEEE, 2021, pp. 1–5. 2021.
[34] D. Luitse and W. Denkena, “The great transformer: Examining the role [58] X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for
of large language models in the political economy of ai,” Big Data & generation,” arXiv preprint arXiv:2101.00190, 2021.
Society, vol. 8, no. 2, p. 20539517211047734, 2021. [59] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li,
[35] M. O. Topal, A. Bas, and I. van Heerden, “Exploring transformers X. Wang, M. Dehghani, S. Brahma et al., “Scaling instruction-finetuned
in natural language generation: Gpt, bert, and xlnet,” arXiv preprint language models,” arXiv preprint arXiv:2210.11416, 2022.
arXiv:2102.08036, 2021. [60] I. U. Khan, M. A. K. Raiaan, K. Fatema, S. Azam, R. u. Rashid, S. H.
[36] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, Mukta, M. Jonkman, and F. De Boer, “A computer-aided diagnostic
T. Rault, R. Louf, M. Funtowicz et al., “Transformers: State-of-the-art system to identify diabetic retinopathy, utilizing a modified compact con-
natural language processing,” in Proceedings of the 2020 conference on volutional transformer and low-resolution images to reduce computation
empirical methods in natural language processing: system demonstra- time,” Biomedicines, vol. 11, no. 6, p. 1566, 2023.
tions, 2020, pp. 38–45. [61] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train,
[37] C. Sur, “Rbn: enhancement in language attribute prediction using global prompt, and predict: A systematic survey of prompting methods in natural
representation of natural language transfer learning technology like language processing,” ACM Computing Surveys, vol. 55, no. 9, pp. 1–35,
google bert,” SN Applied Sciences, vol. 2, no. 1, p. 22, 2020. 2023.
[38] J. J. Bird, A. Ekárt, and D. R. Faria, “Chatbot interaction with artificial [62] W. Ansar, S. Goswami, A. Chakrabarti, and B. Chakraborty, “A novel
intelligence: human data augmentation with t5 and language transformer selective learning based transformer encoder architecture with enhanced
ensemble for text classification,” Journal of Ambient Intelligence and word representation,” Applied Intelligence, vol. 53, no. 8, pp. 9424–9443,
Humanized Computing, vol. 14, no. 4, pp. 3129–3144, 2023. 2023.
[63] G. Dar, M. Geva, A. Gupta, and J. Berant, “Analyzing transformers in
[39] B. D. Lund and T. Wang, “Chatting about chatgpt: how may ai and gpt
embedding space,” arXiv preprint arXiv:2209.02535, 2022.
impact academia and libraries?” Library Hi Tech News, vol. 40, no. 3,
[64] D. Hazarika, M. Namazifar, and D. Hakkani-Tür, “Attention biasing and
pp. 26–29, 2023.
context augmentation for zero-shot control of encoder-decoder trans-
[40] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving
formers for natural language generation,” in Proceedings of the AAAI
language understanding by generative pre-training,” 2018.
Conference on Artificial Intelligence, vol. 36, no. 10, 2022, pp. 10 738–
[41] B. Ghojogh and A. Ghodsi, “Attention mechanism, transformers, bert,
10 748.
and gpt: tutorial and survey,” 2020.
[65] J. Lu, J. Yao, J. Zhang, X. Zhu, H. Xu, W. Gao, C. Xu, T. Xiang,
[42] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., and L. Zhang, “Soft: Softmax-free transformer with linear complexity,”
“Language models are unsupervised multitask learners,” OpenAI blog, Advances in Neural Information Processing Systems, vol. 34, pp. 21 297–
vol. 1, no. 8, p. 9, 2019. 21 309, 2021.
[43] K. Abramski, S. Citraro, L. Lombardi, G. Rossetti, and M. Stella, [66] L. Floridi and M. Chiriatti, “Gpt-3: Its nature, scope, limits, and conse-
“Cognitive network science reveals bias in gpt-3, gpt-3.5 turbo, and gpt-4 quences,” Minds and Machines, vol. 30, pp. 681–694, 2020.
mirroring math anxiety in high-school students,” Big Data and Cognitive [67] X. Zheng, C. Zhang, and P. C. Woodland, “Adapting gpt, gpt-2 and
Computing, vol. 7, no. 3, p. 124, 2023. bert language models for speech recognition,” in 2021 IEEE Automatic
[44] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catan- Speech Recognition and Understanding Workshop (ASRU). IEEE,
zaro, “Megatron-lm: Training multi-billion parameter language models 2021, pp. 162–168.
using model parallelism,” arXiv preprint arXiv:1909.08053, 2019. [68] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
[45] D. M. Katz, M. J. Bommarito, S. Gao, and P. Arredondo, “Gpt-4 passes L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert
the bar exam,” Available at SSRN 4389233, 2023. pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
[46] M. J. Page, J. E. McKenzie, P. M. Bossuyt, I. Boutron, T. C. Hoffmann, [69] A. Roberts, C. Raffel, and N. Shazeer, “How much knowledge can
C. D. Mulrow, L. Shamseer, J. M. Tetzlaff, E. A. Akl, S. E. Brennan you pack into the parameters of a language model?” arXiv preprint
et al., “The prisma 2020 statement: an updated guideline for reporting arXiv:2002.08910, 2020.
systematic reviews,” International journal of surgery, vol. 88, p. 105906, [70] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts,
2021. P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling
[47] J. J. Webster and C. Kit, “Tokenization as the initial phase in nlp,” in language modeling with pathways,” arXiv preprint arXiv:2204.02311,
COLING 1992 volume 4: The 14th international conference on compu- 2022.
tational linguistics, 1992. [71] R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T.
[48] T. Kudo, “Subword regularization: Improving neural network trans- Cheng, A. Jin, T. Bos, L. Baker, Y. Du et al., “Lamda: Language models
lation models with multiple subword candidates,” arXiv preprint for dialog applications,” arXiv preprint arXiv:2201.08239, 2022.
arXiv:1804.10959, 2018. [72] A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu,
[49] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of W. Zheng, X. Xia et al., “Glm-130b: An open bilingual pre-trained
rare words with subword units,” arXiv preprint arXiv:1508.07909, 2015. model,” arXiv preprint arXiv:2210.02414, 2022.
[50] M. Schuster and K. Nakajima, “Japanese and korean voice search,” [73] J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song,
in 2012 IEEE international conference on acoustics, speech and signal J. Aslanides, S. Henderson, R. Ring, S. Young et al., “Scaling language
processing (ICASSP). IEEE, 2012, pp. 5149–5152. models: Methods, analysis & insights from training gopher,” arXiv
[51] R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating long preprint arXiv:2112.11446, 2021.
sequences with sparse transformers,” arXiv preprint arXiv:1904.10509, [74] O. Lieber, O. Sharir, B. Lenz, and Y. Shoham, “Jurassic-1: Technical
2019. details and evaluation,” White Paper. AI21 Labs, vol. 1, 2021.

32 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Survey on Large Language Models

[75] S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajbhandari, J. Casper, [97] J. Vig and Y. Belinkov, “Analyzing the structure of attention in a trans-
Z. Liu, S. Prabhumoye, G. Zerveas, V. Korthikanti et al., “Using deep- former language model,” arXiv preprint arXiv:1906.04284, 2019.
speed and megatron to train megatron-turing nlg 530b, a large-scale [98] A. McGowan, Y. Gui, M. Dobbs, S. Shuster, M. Cotter, A. Selloni,
generative language model,” arXiv preprint arXiv:2201.11990, 2022. M. Goodman, A. Srivastava, G. A. Cecchi, and C. M. Corcoran, “Chat-
[76] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, gpt and bard exhibit spontaneous citation fabrication during psychiatry
T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., literature search,” Psychiatry Research, vol. 326, p. 115334, 2023.
“Llama: Open and efficient foundation language models,” arXiv preprint [99] R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia,
arXiv:2302.13971, 2023. A. Poulton, V. Kerkez, and R. Stojnic, “Galactica: A large language
[77] T. T. Nguyen, C. Wilson, and J. Dalins, “Fine-tuning llama 2 large model for science,” arXiv preprint arXiv:2211.09085, 2022.
language models for detecting online sexual predatory chats and abusive [100] N. Shazeer, “Glu variants improve transformer,” arXiv preprint
texts,” arXiv preprint arXiv:2308.14683, 2023. arXiv:2002.05202, 2020.
[78] G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, A. Cappelli, H. Alobei- [101] N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun,
dli, B. Pannier, E. Almazrouei, and J. Launay, “The refinedweb dataset for Y. Zhou, A. W. Yu, O. Firat et al., “Glam: Efficient scaling of language
falcon llm: outperforming curated corpora with web data, and web data models with mixture-of-experts,” in International Conference on Ma-
only,” arXiv preprint arXiv:2306.01116, 2023. chine Learning. PMLR, 2022, pp. 5547–5569.
[79] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Ruther- [102] S. Black, S. Biderman, E. Hallahan, Q. Anthony, L. Gao, L. Gold-
ford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark et al., ing, H. He, C. Leahy, K. McDonell, J. Phang et al., “Gpt-neox-
“Training compute-optimal large language models,” arXiv preprint 20b: An open-source autoregressive language model,” arXiv preprint
arXiv:2203.15556, 2022. arXiv:2204.06745, 2022.
[80] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, [103] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese,
M. Diab, X. Li, X. V. Lin et al., “Opt: Open pre-trained transformer and C. Xiong, “Codegen: An open large language model for code with
language models,” arXiv preprint arXiv:2205.01068, 2022. multi-turn program synthesis,” arXiv preprint arXiv:2203.13474, 2022.
[81] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, [104] T. Hagendorff, S. Fabi, and M. Kosinski, “Machine intuition: Uncov-
D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large ering human-like intuitive decision-making in gpt-3.5,” arXiv preprint
language models,” Advances in Neural Information Processing Systems, arXiv:2212.05206, 2022.
vol. 35, pp. 24 824–24 837, 2022. [105] P. P. Ray, “Chatgpt: A comprehensive review on background, applica-
[82] W. Zeng, X. Ren, T. Su, H. Wang, Y. Liao, Z. Wang, X. Jiang, Z. Yang, tions, key challenges, bias, ethics, limitations and future scope,” Internet
K. Wang, X. Zhang et al., “Pangu: Large-scale autoregressive pretrained of Things and Cyber-Physical Systems, 2023.
chinese language models with auto-parallel computation,” arXiv preprint [106] X.-Q. Dao, “Performance comparison of large language models on
arXiv:2104.12369, 2021. vnhsge english dataset: Openai chatgpt, microsoft bing chat, and google
[83] H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, bard,” arXiv preprint arXiv:2307.02288, 2023.
N. Barnes, and A. Mian, “A comprehensive overview of large language
[107] D. Kelly, Y. Chen, S. E. Cornwell, N. S. Delellis, A. Mayhew, S. Onao-
models,” arXiv preprint arXiv:2307.06435, 2023.
lapo, and V. L. Rubin, “Bing chat: The future of search engines?”
[84] G. Jawahar, B. Sagot, and D. Seddah, “What does bert learn about
Proceedings of the Association for Information Science and Technology,
the structure of language?” in ACL 2019-57th Annual Meeting of the
vol. 60, no. 1, pp. 1007–1009, 2023.
Association for Computational Linguistics, 2019.
[108] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V.
[85] J. Ni, G. H. Ábrego, N. Constant, J. Ma, K. B. Hall, D. Cer, and Y. Yang,
Le, “Xlnet: Generalized autoregressive pretraining for language under-
“Sentence-t5: Scalable sentence encoders from pre-trained text-to-text
standing,” Advances in neural information processing systems, vol. 32,
models,” arXiv preprint arXiv:2108.08877, 2021.
2019.
[86] Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond,
[109] X. Song, G. Wang, Z. Wu, Y. Huang, D. Su, D. Yu, and H. Meng,
T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago et al., “Competition-level
“Speech-xlnet: Unsupervised acoustic model pretraining for self-
code generation with alphacode,” Science, vol. 378, no. 6624, pp. 1092–
attention networks,” arXiv preprint arXiv:1910.10387, 2019.
1097, 2022.
[110] W. Shen, J. Chen, X. Quan, and Z. Xie, “Dialogxl: All-in-one xlnet
[87] Y. Koizumi, Y. Ohishi, D. Niizumi, D. Takeuchi, and M. Yasuda, “Audio
for multi-party conversation emotion recognition,” in Proceedings of the
captioning using pre-trained large-scale language model guided by audio-
AAAI Conference on Artificial Intelligence, vol. 35, no. 15, 2021, pp.
based similar caption retrieval,” arXiv preprint arXiv:2012.07331, 2020.
13 789–13 797.
[88] W. Fan, Z. Zhao, J. Li, Y. Liu, X. Mei, Y. Wang, J. Tang, and Q. Li,
“Recommender systems in the era of large language models (llms),” [111] R. Luo, L. Sun, Y. Xia, T. Qin, S. Zhang, H. Poon, and T.-Y. Liu, “Biogpt:
arXiv preprint arXiv:2307.02046, 2023. generative pre-trained transformer for biomedical text generation and
[89] Y. Bai, J. Yi, J. Tao, Z. Tian, Z. Wen, and S. Zhang, “Fast end-to- mining,” Briefings in Bioinformatics, vol. 23, no. 6, p. bbac409, 2022.
end speech recognition via non-autoregressive models and cross-modal [112] D. Deutsch, J. Juraska, M. Finkelstein et al., “Training and meta-
knowledge transferring from bert,” IEEE/ACM Transactions on Audio, evaluating machine translation evaluation metrics at the paragraph level,”
Speech, and Language Processing, vol. 29, pp. 1897–1911, 2021. arXiv preprint arXiv:2308.13506, 2023.
[90] C. Sun, J. Li, Y. R. Fung, H. P. Chan, T. Abdelzaher, C. Zhai, and [113] A. Ushio, F. Alva-Manchego, and J. Camacho-Collados, “Generative
H. Ji, “Decoding the silent majority: Inducing belief augmented social language models for paragraph-level question generation,” arXiv preprint
graph with large language model for response forecasting,” arXiv preprint arXiv:2210.03992, 2022.
arXiv:2310.13297, 2023. [114] openai, “openai,” 2023, accessed Sep 12, 2023. [Online]. Available:
[91] K. Drossos, S. Gharib, P. Magron, and T. Virtanen, “Language modelling https://openai.com/blog/openai-api
for sound event detection with teacher forcing and scheduled sampling,” [115] huggingface, “huggingface,” 2023, accessed Sep 12, 2023. [Online].
arXiv preprint arXiv:1907.08506, 2019. Available: https://huggingface.co/docs/transformers/index
[92] S.-H. Chiu and B. Chen, “Innovative bert-based reranking language mod- [116] Google Cloud. (2023) Cloud natural language. Accessed Sep 12, 2023.
els for speech recognition,” in 2021 IEEE Spoken Language Technology [Online]. Available: https://cloud.google.com/natural-language
Workshop (SLT). IEEE, 2021, pp. 266–271. [117] azure, “azure,” 2023, accessed Sep 12, 2023. [Online]. Available:
[93] A. Elhafsi, R. Sinha, C. Agia, E. Schmerling, I. A. Nesnas, and https://azure.microsoft.com/en-us/products/ai-services/ai-language
M. Pavone, “Semantic anomaly detection with large language models,” [118] IBM. (2023) Ibm watson natural language understanding. Accessed
Autonomous Robots, pp. 1–21, 2023. Sep 12, 2023. [Online]. Available: https://www.ibm.com/products/
[94] Y. Shen, J. Shao, X. Zhang, Z. Lin, H. Pan, D. Li, J. Zhang, and K. B. natural-language-understanding
Letaief, “Large language models empowered autonomous edge ai for [119] G. Satyanarayana, J. Bhuvana, and M. Balamurugan, “Sentimental anal-
connected intelligence,” arXiv preprint arXiv:2307.02779, 2023. ysis on voice using aws comprehend,” in 2020 International Conference
[95] H. Abdel-Jaber, D. Devassy, A. Al Salam, L. Hidaytallah, and M. El- on Computer Communication and Informatics (ICCCI). IEEE, 2020,
Amir, “A review of deep learning algorithms and their applications in pp. 1–4.
healthcare,” Algorithms, vol. 15, no. 2, p. 71, 2022. [120] A. Kolides, A. Nawaz, A. Rathor, D. Beeman, M. Hashmi, S. Fatima,
[96] B. Peng, C. Li, P. He, M. Galley, and J. Gao, “Instruction tuning with D. Berdik, M. Al-Ayyoub, and Y. Jararweh, “Artificial intelligence foun-
gpt-4,” arXiv preprint arXiv:2304.03277, 2023. dation and pre-trained models: Fundamentals, applications, opportuni-

VOLUME 4, 2016 33

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Review on Large Language Models

ties, and social impacts,” Simulation Modelling Practice and Theory, vol. [142] R. Zhu, X. Tu, and J. X. Huang, “Utilizing bert for biomedical and clinical
126, p. 102754, 2023. text mining,” in Data analytics in biomedical engineering and healthcare.
[121] S. M. Jain, “Hugging face,” in Introduction to Transformers for NLP: Elsevier, 2021, pp. 73–103.
With the Hugging Face Library and Models to Solve Problems. Springer, [143] K. Huang, A. Singh, S. Chen, E. T. Moseley, C.-Y. Deng, N. George,
2022, pp. 51–67. and C. Lindvall, “Clinical xlnet: modeling sequential clinical notes
[122] J. Wang, X. Hu, W. Hou, H. Chen, R. Zheng, Y. Wang, L. Yang, and predicting prolonged mechanical ventilation,” arXiv preprint
H. Huang, W. Ye, X. Geng et al., “On the robustness of chat- arXiv:1912.11975, 2019.
gpt: An adversarial and out-of-distribution perspective,” arXiv preprint [144] J. Zhang, L. Wang, R. K.-W. Lee, Y. Bin, Y. Wang, J. Shao, and E.-P. Lim,
arXiv:2302.12095, 2023. “Graph-to-tree learning for solving math word problems,” in Proceedings
[123] S. Chen, Y. Li, S. Lu, H. Van, H. J. Aerts, G. K. Savova, and D. S. Bitter- of the 58th Annual Meeting of the Association for Computational Lin-
man, “Evaluation of chatgpt family of models for biomedical reasoning guistics, 2020, pp. 3928–3937.
and classification,” arXiv preprint arXiv:2304.02496, 2023. [145] X. Dai, S. Karimi, B. Hachey, and C. Paris, “Cost-effective selection of
[124] H. Huang, O. Zheng, D. Wang, J. Yin, Z. Wang, S. Ding, H. Yin, C. Xu, pretraining data: A case study of pretraining bert on social media,” arXiv
R. Yang, Q. Zheng et al., “Chatgpt for shaping the future of dentistry: the preprint arXiv:2010.01150, 2020.
potential of multi-modal large language model,” International Journal of [146] S. Biswas, “The function of chat gpt in social media: According to chat
Oral Science, vol. 15, no. 1, p. 29, 2023. gpt,” Available at SSRN 4405389, 2023.
[125] V. Sorin, E. Klang, M. Sklair-Levy, I. Cohen, D. B. Zippel, N. Balint La- [147] R. Peng, K. Liu, P. Yang, Z. Yuan, and S. Li, “Embedding-based retrieval
hat, E. Konen, and Y. Barash, “Large language model (chatgpt) as a with llm for effective agriculture information extracting from unstruc-
support tool for breast tumor board,” NPJ Breast Cancer, vol. 9, no. 1, tured data,” arXiv preprint arXiv:2308.03107, 2023.
p. 44, 2023. [148] S. Biswas, “Importance of chat gpt in agriculture: According to chat gpt,”
[126] A. J. Thirunavukarasu, D. S. J. Ting, K. Elangovan, L. Gutierrez, T. F. Available at SSRN 4405391, 2023.
Tan, and D. S. W. Ting, “Large language models in medicine,” Nature [149] M. A. K. Raiaan, N. M. Fahad, S. Chowdhury, D. Sutradhar, S. S. Mihad,
medicine, pp. 1–11, 2023. and M. M. Islam, “Iot-based object-detection system to safeguard en-
[127] D. M. Korngiebel and S. D. Mooney, “Considering the possibilities dangered animals and bolster agricultural farm security,” Future Internet,
and pitfalls of generative pre-trained transformer 3 (gpt-3) in healthcare vol. 15, no. 12, p. 372, 2023.
delivery,” NPJ Digital Medicine, vol. 4, no. 1, p. 93, 2021.
[150] T. Pires, E. Schlinger, and D. Garrette, “How multilingual is multilingual
[128] L. De Angelis, F. Baglivo, G. Arzilli, G. P. Privitera, P. Ferragina, A. E. bert?” arXiv preprint arXiv:1906.01502, 2019.
Tozzi, and C. Rizzo, “Chatgpt and the rise of large language models:
[151] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek,
the new ai-driven infodemic threat in public health,” Frontiers in Public
F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Un-
Health, vol. 11, p. 1166120, 2023.
supervised cross-lingual representation learning at scale,” arXiv preprint
[129] M. Sallam, “Chatgpt utility in healthcare education, research, and prac-
arXiv:1911.02116, 2019.
tice: systematic review on the promising perspectives and valid con-
[152] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled
cerns,” in Healthcare, vol. 11, no. 6. MDPI, 2023, p. 887.
version of bert: smaller, faster, cheaper and lighter,” arXiv preprint
[130] M. Cascella, J. Montomoli, V. Bellini, and E. Bignami, “Evaluating the
arXiv:1910.01108, 2019.
feasibility of chatgpt in healthcare: an analysis of multiple clinical and
research scenarios,” Journal of Medical Systems, vol. 47, no. 1, p. 33, [153] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut,
2023. “Albert: A lite bert for self-supervised learning of language representa-
tions,” arXiv preprint arXiv:1909.11942, 2019.
[131] T. H. Kung, M. Cheatham, A. Medenilla, C. Sillos, L. De Leon,
C. Elepaño, M. Madriaga, R. Aggabao, G. Diaz-Candido, J. Maningo [154] T. Bolukbasi, K.-W. Chang, J. Y. Zou, V. Saligrama, and A. T. Kalai,
et al., “Performance of chatgpt on usmle: Potential for ai-assisted medical “Man is to computer programmer as woman is to homemaker? debiasing
education using large language models,” PLoS digital health, vol. 2, no. 2, word embeddings,” Advances in neural information processing systems,
p. e0000198, 2023. vol. 29, 2016.
[132] Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, [155] Y. Zhang, S. Sun, M. Galley, Y.-C. Chen, C. Brockett, X. Gao, J. Gao,
J. Gao, and H. Poon, “Domain-specific language model pretraining for J. Liu, and B. Dolan, “Dialogpt: Large-scale generative pre-training for
biomedical natural language processing,” ACM Transactions on Com- conversational response generation,” arXiv preprint arXiv:1911.00536,
puting for Healthcare (HEALTH), vol. 3, no. 1, pp. 1–23, 2021. 2019.
[133] Z. Kraljevic, D. Bean, A. Shek, R. Bendayan, H. Hemingway, and [156] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal,
J. Au, “Foresight-generative pretrained transformer (gpt) for modelling A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models
of patient timelines using ehrs.” are few-shot learners,” Advances in neural information processing sys-
[134] L. Mich and R. Garigliano, “Chatgpt for e-tourism: a technological tems, vol. 33, pp. 1877–1901, 2020.
perspective,” Information Technology & Tourism, pp. 1–12, 2023. [157] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang, “Biobert:
[135] X. Yu, Z. Chen, Y. Ling, S. Dong, Z. Liu, and Y. Lu, “Temporal data a pre-trained biomedical language representation model for biomedical
meets llm–explainable financial time series forecasting,” arXiv preprint text mining,” Bioinformatics, vol. 36, no. 4, pp. 1234–1240, 2020.
arXiv:2306.11025, 2023. [158] Y. Gat, N. Calderon, A. Feder, A. Chapanin, A. Sharma, and R. Reichart,
[136] G. F. Frederico, “Chatgpt in supply chains: Initial evidence of applica- “Faithful explanations of black-box nlp models using llm-generated
tions and potential research agenda,” Logistics, vol. 7, no. 2, p. 26, 2023. counterfactuals,” arXiv preprint arXiv:2310.00603, 2023.
[137] A. Sobieszek and T. Price, “Playing games with ais: the limits of gpt-3 [159] M. Josifoski, M. Sakota, M. Peyrard, and R. West, “Exploiting asym-
and similar large language models,” Minds and Machines, vol. 32, no. 2, metry for synthetic training data generation: Synthie and the case of
pp. 341–364, 2022. information extraction,” arXiv preprint arXiv:2303.04132, 2023.
[138] M. U. Hadi, R. Qureshi, A. Shah, M. Irfan, A. Zafar, M. B. Shaikh, [160] M. S. H. Mukta, J. Ahmad, M. A. K. Raiaan, S. Islam, S. Azam, M. E.
N. Akhtar, J. Wu, S. Mirjalili et al., “Large language models: A com- Ali, and M. Jonkman, “An investigation of the effectiveness of deepfake
prehensive survey of its applications, challenges, limitations, and future models and tools,” Journal of Sensor and Actuator Networks, vol. 12,
prospects,” 2023. no. 4, p. 61, 2023.
[139] C. K. Lo, “What is the impact of chatgpt on education? a rapid review of [161] A. Awasthi, N. Gupta, B. Samanta, S. Dave, S. Sarawagi, and P. Taluk-
the literature,” Education Sciences, vol. 13, no. 4, p. 410, 2023. dar, “Bootstrapping multilingual semantic parsers using large language
[140] Y. K. Dwivedi, N. Kshetri, L. Hughes, E. L. Slade, A. Jeyaraj, A. K. models,” arXiv preprint arXiv:2210.07313, 2022.
Kar, A. M. Baabdullah, A. Koohang, V. Raghavan, M. Ahuja et al., ““so [162] P. Sridhar, A. Doyle, A. Agarwal, C. Bogart, J. Savelka, and M. Sakr,
what if chatgpt wrote it?” multidisciplinary perspectives on opportunities, “Harnessing llms in curricular design: Using gpt-4 to support authoring
challenges and implications of generative conversational ai for research, of learning objectives,” arXiv preprint arXiv:2306.17459, 2023.
practice and policy,” International Journal of Information Management, [163] M. A. K. Raiaan, A. Al Mamun, M. A. Islam, M. E. Ali, and M. S. H.
vol. 71, p. 102642, 2023. Mukta, “Envy prediction from users’ photos using convolutional neural
[141] M. Zong and B. Krishnamachari, “A survey on gpt-3,” arXiv preprint networks,” in 2023 International Conference on Computer, Electrical &
arXiv:2212.00857, 2022. Communication Engineering (ICCECE). IEEE, 2023, pp. 1–7.

34 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Survey on Large Language Models

[164] E. Waisberg, J. Ong, M. Masalkhi, and A. G. Lee, “Large language model [186] Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and A. Smola, “Mul-
(llm)-driven chatbots for neuro-ophthalmic medical education,” Eye, pp. timodal chain-of-thought reasoning in language models,” arXiv preprint
1–3, 2023. arXiv:2302.00923, 2023.
[165] W. Channell, “Making a difference: The role of the llm in policy formu- [187] Z. Liu, Y. Zhang, P. Li, Y. Liu, and D. Yang, “Dynamic llm-agent
lation and reform,” in The Export of Legal Education. Routledge, 2016, network: An llm-agent collaboration framework with agent team opti-
pp. 13–21. mization,” arXiv preprint arXiv:2310.02170, 2023.
[166] P. Budhwar, S. Chowdhury, G. Wood, H. Aguinis, G. J. Bamber, J. R. [188] U. Iqbal, T. Kohno, and F. Roesner, “Llm platform security: Applying
Beltran, P. Boselie, F. Lee Cooke, S. Decker, A. DeNisi et al., “Human a systematic evaluation framework to openai’s chatgpt plugins,” arXiv
resource management in the age of generative artificial intelligence: preprint arXiv:2309.10254, 2023.
Perspectives and research directions on chatgpt,” Human Resource Man-
agement Journal, vol. 33, no. 3, pp. 606–659, 2023.
[167] G. Fatouros, J. Soldatos, K. Kouroumali, G. Makridis, and D. Kyriazis,
“Transforming sentiment analysis in the financial domain with chatgpt,”
Machine Learning with Applications, vol. 14, p. 100508, 2023.
[168] Y. Li, S. Ma, X. Wang, S. Huang, C. Jiang, H.-T. Zheng, P. Xie, F. Huang,
and Y. Jiang, “Ecomgpt: Instruction-tuning large language model with
chain-of-task tasks for e-commerce,” arXiv preprint arXiv:2308.06966,
2023. MOHAIMENUL AZAM KHAN RAIAAN earned
[169] P. Weingart, T. Wambsganss, and M. Soellner, “A taxonomy for deriving his Bachelor of Science in Computer Science and
business insights from user-generated content,” 2023. Engineering from United International University
[170] L. Zhu, X. Xu, Q. Lu, G. Governatori, and J. Whittle, “Ai and (UIU) in 2023. Currently, he holds the position of
ethics—operationalizing responsible ai,” Humanity Driven AI: Produc- Research Assistant within the Computer Science
tivity, Well-being, Sustainability and Partnership, pp. 15–33, 2022. and Engineering Department at UIU. His profes-
[171] I. Molenaar, S. de Mooij, R. Azevedo, M. Bannertd, S. Järveläe, and sional pursuits are marked by active involvement
D. Gaševićf, “Measuring self-regulated learning and the role of ai: Five in diverse research areas such as computer vision,
years of research using multimodal multichannel data,” Computers in health informatics, explainable artificial intelli-
Human Behavior, p. 107540, 2022.
gence, and graph optimization. Notably, Raiaan
[172] M. C. Rillig, M. Ågerstrand, M. Bi, K. A. Gould, and U. Sauerland,
has made significant contributions to the field, as evidenced by his multiple
“Risks and benefits of large language models for the environment,”
research papers published in prestigious journals indexed by Scopus and
Environmental Science & Technology, vol. 57, no. 9, pp. 3464–3466,
2023. categorized under the Q1 ranking.
[173] B. Liu, B. Xiao, X. Jiang, S. Cen, X. He, W. Dou et al., “Adversarial
attacks on large language model-based system and mitigating strategies:
A case study on chatgpt,” Security and Communication Networks, vol.
2023, 2023.
[174] Z. Sun, “A short survey of viewing large language models in legal aspect,”
arXiv preprint arXiv:2303.09136, 2023.
[175] Y. Meng, M. Michalski, J. Huang, Y. Zhang, T. Abdelzaher, and J. Han,
“Tuning language models as training data generators for augmentation-
MD. SADDAM HOSSAIN MUKTA received the
enhanced few-shot learning,” in International Conference on Machine
Ph.D. degree from the Data Science and Engi-
Learning. PMLR, 2023, pp. 24 457–24 477.
neering Research Laboratory (Data Laboratory),
[176] S. Fincke, S. Agarwal, S. Miller, and E. Boschee, “Language model
priming for cross-lingual event extraction,” in Proceedings of the AAAI BUET, in 2018. He is currently an Associate Pro-
Conference on Artificial Intelligence, vol. 36, no. 10, 2022, pp. 10 627– fessor and an Undergraduate Program Coordinator
10 635. with the Department of Computer Science and
[177] N. M. Fahad, S. Sakib, M. A. K. Raiaan, and M. S. H. Mukta, “Skinnet-8: Engineering. He has a number of quality publi-
An efficient cnn architecture for classifying skin cancer on an imbalanced cations in both national and international confer-
dataset,” in 2023 International Conference on Electrical, Computer and ences and journals. His research interests include
Communication Engineering (ECCE). IEEE, 2023, pp. 1–6. deep learning, machine learning, data mining, and
[178] N. Jain, K. Saifullah, Y. Wen, J. Kirchenbauer, M. Shu, A. Saha, M. Gold- social computing.
blum, J. Geiping, and T. Goldstein, “Bring your own data! self-supervised
evaluation for large language models,” arXiv preprint arXiv:2306.13651,
2023.
[179] X. Zhu, J. Li, Y. Liu, C. Ma, and W. Wang, “A survey on model com-
pression for large language models,” arXiv preprint arXiv:2308.07633,
2023.
[180] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant,
A. Barua, and C. Raffel, “mt5: A massively multilingual pre-trained text-
to-text transformer,” arXiv preprint arXiv:2010.11934, 2020. KANIZ FATEMA has completed her bachelor’s
[181] N. Ratner, Y. Levine, Y. Belinkov, O. Ram, O. Abend, E. Karpas, degree in Computer Science and Engineering
A. Shashua, K. Leyton-Brown, and Y. Shoham, “Parallel context win- from Daffodil International University, Dhaka,
dows improve in-context learning of large language models,” arXiv Bangladesh. She is actively involved in research
preprint arXiv:2212.10947, 2022. activities, especially in Health informatics, Com-
[182] F. Motoki, V. P. Neto, and V. Rodrigues, “More human than human: puter vision, Machine learning, Deep learning,
Measuring chatgpt political bias,” Public Choice, pp. 1–21, 2023. and Artificial intelligence-based system. She is
[183] K. Werder, B. Ramesh, and R. Zhang, “Establishing data provenance currently working as a Research Assistant (RA)
for responsible artificial intelligence systems,” ACM Transactions on at Charles Darwin University. She has published
Management Information Systems (TMIS), vol. 13, no. 2, pp. 1–23, 2022. several research papers in journals (Scopus) and
[184] J. Jiang, X. Liu, and C. Fan, “Low-parameter federated learning with international conferences.
large language models,” arXiv preprint arXiv:2307.13896, 2023.
[185] W. S. Saba, “Towards explainable and language-agnostic llms: Sym-
bolic reverse engineering of language at scale,” arXiv preprint
arXiv:2306.00017, 2023.

VOLUME 4, 2016 35

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365742

Author et al.: A Review on Large Language Models

NUR MOHAMMAD FAHAD holds a Bachelor’s MOHAMMED EUNUS ALI is a Professor with
degree from the Department of Computer Science the Department of CSE, Bangladesh University of
and Engineering at United International Univer- Engineering and Technology (BUET). He is the
sity (UIU). During his undergraduate years, he Group Leader of the Data Science and Engineer-
contributed to the academic community as an un- ing Research Laboratory (DataLab), CSE, BUET.
dergraduate teaching assistant at UIU’s Depart- His research papers have published in top rank-
ment of Computer Science and Engineering in ing journals and conferences, such as the VLDB
Bangladesh. In addition to his teaching role, Fahad Journal, IEEE TRANSACTIONS ON KNOWL-
is deeply engaged in cutting-edge research across EDGE AND DATA ENGINEERING, DMKD, In-
several domains, including computer vision, ma- formation Systems Journal, WWWJ, DKE, ICDE,
chine learning, deep learning, health informatics, graph theory, and mental CIKM, EDBT, PVLDB, and UbiComp. His research interests include
health modeling. database systems and information management, including spatial databases,
practical machine learning, and social media analytics. He served as a
Program Committee Member for many prestigious conferences, including
SIGMOD, VLDB, AAAI, and SIGSPATIAL

SADMAN SAKIB acquired a bachelor’s degree


in computer science and engineering from the
Department of Computer Science and Engineering
at United International University in 2023. He was
also a teaching assistant for undergraduate stu-
dents in the Department of Computer Science and
Engineering at United International University,
Bangladesh. Besides this, he is actively involved
in machine learning, deep learning, artificial intel-
ligence, computer vision, and health informatics
research.

MOST. MARUFATUL JANNAT MIM is currently


pursuing a degree in the Computer Science and SAMI AZAM is a leading researcher and Professor
Engineering Department at United International at the Faculty of Science and Technology, Charles
University (UIU). She is actively involved in re- Darwin University, Australia. He is actively in-
search activities related to computer vision, deep volved in the research fields relating to Computer
learning, graph theory, and human-computer inter- Vision, Signal Processing, Artificial Intelligence
action. Her passion lies in pioneering innovative and Biomedical Engineering. Dr. Azam has num-
research in computer science. Apart from studies, ber of publications in peer reviewed journals and
she is involved in co-curricular activities at the international conference proceedings.
UIU APP Forum, where she currently serves as
president and demonstrates strong leadership by organizing various seminars
and workshops for computer science students.

JUBAER AHMAD received the B.Sc. degree in


computer science and engineering from United In-
ternational University (UIU), Dhaka, Bangladesh,
in 2022. He is currently working as a Research
Assistant with the IAR project, United Interna-
tional University, Bangladesh. His research inter-
ests include Computer Vision, NLP, Big Data, and
Distributed Learning.

36 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4

You might also like