0% found this document useful (0 votes)
30 views21 pages

Recent Advances in Generative AI and Large Language Models Current Status Challenges and Perspectives-3

This article discusses recent advancements in generative artificial intelligence (AI) and large language models (LLMs), highlighting their transformative capabilities in natural language processing (NLP) across various domains. It provides a comprehensive overview of the current state of these technologies, their applications, and the challenges they face, while emphasizing the importance of ethical integration and addressing research gaps. The authors aim to guide future research and collaboration within the AI community to foster responsible development and deployment of these technologies.

Uploaded by

Maritess Reyes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views21 pages

Recent Advances in Generative AI and Large Language Models Current Status Challenges and Perspectives-3

This article discusses recent advancements in generative artificial intelligence (AI) and large language models (LLMs), highlighting their transformative capabilities in natural language processing (NLP) across various domains. It provides a comprehensive overview of the current state of these technologies, their applications, and the challenges they face, while emphasizing the importance of ethical integration and addressing research gaps. The authors aim to guide future research and collaboration within the AI community to foster responsible development and deployment of these technologies.

Uploaded by

Maritess Reyes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO.

12, DECEMBER 2024 5873

Recent Advances in Generative AI and


Large Language Models: Current Status,
Challenges, and Perspectives
Desta Haileselassie Hagos , Member, IEEE, Rick Battle , and Danda B. Rawat , Senior Member, IEEE

Abstract—The emergence of generative artificial intelligence domains. We believe that this research serves as a roadmap for
(AI) and large language models (LLMs) has marked a new era the AI community, pushing toward an ethical, inclusive, and
of natural language processing (NLP), introducing unprecedented impactful future. It empowers diverse domains with transforma-
capabilities that are revolutionizing various domains. This article tive technologies, creating a robust landscape for the responsible
explores the current state of these cutting-edge technologies, evolution of AI.
demonstrating their remarkable advancements and wide-ranging Index Terms—Decoder, encoder, generative artificial intelli-
applications. Our article contributes to providing a holistic per- gence (AI), large language models (LLMs), long-sequence lan-
spective on the technical foundations, practical applications, and guage models, machine translation, natural language processing
emerging challenges within the evolving landscape of generative (NLP), transformers.
AI and LLMs. We believe that understanding the generative
capabilities of AI systems and the specific context of LLMs
is crucial for researchers, practitioners, and policymakers to
collaboratively shape the responsible and ethical integration Acronym Definition
of these technologies into various domains. Furthermore, we AI Artificial intelligence
identify and address main research gaps, providing valuable ASR Automatic speech recognition
insights to guide future research endeavors within the AI research
BERTs Bidirectional encoder representations from transformers
community.
CLIP Contrastive language-image pretraining
Impact Statement—Understanding the full potentials and lim-
CNNs Convolutional neural networks
itations of generative AI and LLMs shapes the future of
DCGANs Deep convolutional generative adversarial networks
NLP and its impact on various industries and societies. This
DL Deep learning
article explores the transformative potential of advanced NLP
DNNs Deep neural networks
tools such as generative AI and LLMs, shaping the future of
DPO Direct policy optimization
communication and understanding across diverse domains. Our
DSM Denoising score matching
article not only addresses the current state of generative AI and
ELBO Evidence lower bound
LLMs in language understanding, machine translation, question
answering, text summarization, and code Completion but also FFN Positionwise feed-forward network
makes a significant contribution in addressing some of the critical GANs Generative adversarial networks
research gaps of generative AI and LLMs. By addressing issues GELU Gaussian error linear unit
of bias and fairness, interpretability, fine-tuning and adaptability, GPT Generative pretrained transformer
domain adaptation, data privacy and security, computational GPUs Graphics processing units
cost, deepfake generation, human-AI collaboration, long-term HMMs Hidden Markov models
planning, limited context window, long-term memory, etc., our KL Kullback–Leibler
work aims to pave the way for responsible, ethical, and impactful LLMs Large language models
integration of these transformative technologies across diverse LSTM Long short-term memory
ML Machine learning
MLM Masked language modeling
MoEs Mixture of experts
Manuscript received 30 March 2024; revised 5 July 2024; accepted NCE Noise-contrastive estimation
10 August 2024. Date of publication 19 August 2024; date of current version NLG Natural language generation
10 December 2024. This work was supported by the U.S. DoD Center of NLP Natural language processing
Excellence in AI/ML at Howard University under Contract W911NF-20-2- NLU Natural language understanding
0277 with the U.S. Army Research Laboratory (ARL). This article was recom- ReLU Rectified linear unit
mended for publication by Associate Editor Sriparna Saha upon evaluation of
RL Reinforcement learning
the reviewers’ comments. (Corresponding author: Desta Haileselassie Hagos.)
Desta Haileselassie Hagos and Danda B. Rawat are with the DoD Center RLHF Reinforcement learning from human feedback
of Excellence in Artificial Intelligence and Machine Learning (CoE-AIML), RNNs Recurrent neural networks
Department of Electrical Engineering and Computer Science, College of TPUs Tensor processing units
Engineering and Architecture (CEA), Howard University, Washington, DC XAI Explainable artificial intelligence
20059 USA (e-mail: [email protected]; [email protected]). VAEs Variational autoencoders
Rick Battle is with VMware AI Labs by Broadcom, Palo Alto, CA 94304 ViT Vision transformer
USA (e-mail: [email protected]). WGANs Wasserstein GANs
Digital Object Identifier 10.1109/TAI.2024.3444742

2691-4581 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
5874 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 12, DECEMBER 2024

I. INTRODUCTION approach to training LLMs is proposed in terms of computation


and data usage. The authors suggest that for optimal LLM scal-
I N today’s data-driven world, the ability to effectively pro-
cess and understand natural language is becoming increas-
ingly important. Generative AI and LLMs have emerged as
ing, it is essential to equally scale the model size and training
dataset size. This implies that having a sufficiently large dataset
is vital for achieving the best performance.
powerful tools that are expanding the boundaries of NLP, of-
fering unprecedented capabilities across a variety of domains.
LLMs, being a specific application of generative AI, play C. Deep Learning (DL) Advances
a foundational role in the broader landscape of generative New machine learning (ML) algorithms, such as DL, have
capabilities of AI, demonstrating remarkable abilities in un- been developed to learn complex patterns from data. DL tech-
derstanding and generating human language, opening up many niques, especially DNNs with many layers, have made remark-
opportunities across a wide range of domains. Their ability able advancements [14]. Innovations such as recurrent neural
to process and analyze vast amounts of text data has enabled networks (RNNs) [15], [16], convolutional neural networks
them to tackle complex linguistic tasks such as machine trans- (CNNs) [17], and transformers [18] have paved the way for
lation [1], [2], text summarization [3], question answering [4], more advanced and capable models. The transformer architec-
mathematical reasoning [5], and code generation [6] with un- ture, in particular, played a significant role in the development
precedented accuracy [7]. Recent AI advancements have rev- of LLMs [18].
olutionized our ability to understand, create, and engage with
human language [4], [8]. Overcoming the challenges related to
D. Transfer Learning and Pretraining
understanding and generating human language has been one of
the main goals of AI research. This progress has been made LLMs are trained on massive datasets of text, giving them a
possible through the development of new state-of-the-art LLMs broad understanding of the world and how language is used.
and generative AI models. This rapid advancement is the result For example, the generative pre-trained transformer (GPT)-3
of several factors, some of which are listed as follows. language model was trained on a dataset of 175 billion words
[4]. Transfer learning plays a critical role in the development
of highly efficient and effective LLMs and generative AI mod-
A. Advances in Computational Power
els [19]. Models such as bidirectional encoder representations
The explosion of data and the increasing computational from transformers (BERT) [20], GPT [21], and their variants
power accessible to researchers, organizations, and companies are pretrained on massive text corpora, giving them a broad
have enabled the training of complex neural networks [9]. As understanding of language. This pretrained knowledge can be
computational power has increased, larger and more complex leveraged for various downstream tasks without the need for
neural networks have become possible, leading to the devel- retraining the model from scratch, which can be both computa-
opment of LLMs and generative AI models that can perform tionally expensive and time-consuming [19]. Transfer learning
tasks that were previously impossible, such as generating re- enables the use of pretrained models that have already been
alistic text and images. These powerful computing resources trained on a large dataset. This reduces the amount of training
are essential for processing and modeling the vast amount of data that we need for our specific task. For example, if we want
data required to train LLMs and generative AI, enabling them to train a model to translate text from English to Chinese, we
to learn the patterns and relationships necessary for their tasks. can fine-tune a pretrained language model that was trained on
The development of powerful new computing hardware, such a dataset of English and Chinese text. This approach is particu-
as graphics processing units (GPUs), has facilitated the training larly useful in scenarios where obtaining large labeled datasets
of AI models on massive datasets of text and code [10]. The is challenging and expensive since it reduces the amount of
increasing availability of computational power has also reduced training data that we need to collect and label. Transfer learning
the time and cost of training LLMs and generative AI models, significantly reduces the computational and data requirements
making it more feasible for researchers and companies to de- for developing effective language models. Instead of training
velop and deploy them [11]. a separate model for each specific task, a pretrained language
model can be fine-tuned on a smaller task-specific dataset. This
B. Datasets Availability and Scale fine-tuning process is faster and requires less data, making it a
practical approach for a wide range of applications [22].
The increasing availability of data has enabled the training
of LLMs and generative AI models on larger and more diverse
E. Modern Neural Network Architectures
datasets, significantly improving their performance [12]. The
vast amounts of text, audio, images, and video content produced The emergence of neural network architectures, such as the
in the digital age provide valuable resources for training AI GPT [21] and variational autoencoders (VAEs) [23], has led to
models, which rely on these massive datasets to learn the com- the development of modern LLMs and generative AI. LLMs
plexities of human language and content creation. The work in need to be able to learn long-range dependencies in text to
[12] indicates that dataset size is a key factor in determining the generate coherent and meaningful text in a variety of for-
performance of LLMs and that larger datasets lead to significant mats [4]. Traditional RNNs [16], e.g., long short-term mem-
improvements in model performance. In [13], a more efficient ory (LSTM), are not well suited for this task because they

Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
HAGOS et al.: RECENT ADVANCES IN GENERATIVE AI AND LARGE LANGUAGE MODELS 5875

have difficulty learning long-range dependencies beyond a few II. GENERATIVE AI


words. However, the transformer architecture can learn long-
Generative AI refers to a class of algorithms and models
range dependencies more effectively [21]. The work in [18]
within AI and NLP that are designed to generate new, pre-
demonstrates that the transformer architecture outperformed
viously unseen data that are similar to existing examples by
RNNs on a variety of NLP tasks, including machine translation
employing a variety of techniques [21]. These models learn the
and text summarization [2], [24].
underlying patterns and structures present in the training data
and use that knowledge to create novel instances that resemble
F. Community Collaboration and Open-Source Initiatives the original data. It has the potential to revolutionize many
industries and creative fields. Generative AI models are trained
The AI research community, through collaborative efforts on large datasets of existing content. Generative models aim
and open-source initiatives, such as OpenAI [4], Hugging Face to capture the underlying distribution of data, enabling them
[25], and Google AI [26], has significantly contributed to to generate new samples that are statistically similar to the
the advancement of state-of-the-art LLMs and generative AI. training data. To achieve this, generative models employ a latent
This progress is the result of joint collaboration among AI space, denoted as Z, which represents a hidden or underlying
researchers and developers from various organizations and re- representation of the data. This latent space is then mapped
search institutions. These collaborations have facilitated the to the actual data space, denoted as X, through a generator
sharing of knowledge, expertise, and resources, enabling rapid function, represented by Gθ (Z). The parameter θ represents
progress. The open-source movement has played a critical role the adjustable parameters of the generative model, which are
in accelerating the development of LLMs and generative AI. optimized during the training process. The goal of training
By making source codes, data, and models publicly available, a generative model is to make the generated samples Gθ (Z)
open-source initiatives have allowed researchers and developers virtually indistinguishable from real data samples by focusing
to build upon each other’s work, leading to faster innovation and on maximizing the probability of generating the observed data
more robust models. Open-source platforms such as Hugging samples. The objective function for training a generative model,
Face and GitHub serve as hubs for sharing pretrained mod- without specifying a particular architecture, is expressed in (1),
els, datasets, and fine-tuning scripts. Additionally, open-source where N is the number of training x(i) represents the
 (i)samples,

projects and community efforts have made substantial corpora ith training sample, and pmodel x ; θ denotes the probability
of text data available for training robust language models, such assigned by the generative model to the ith training sample
as Wikipedia, Common Crawl, and Project Gutenberg.

N  
max log pmodel x(i) ; θ . (1)
θ
G. Contributions i=1

In this article, we make the following main contributions.


1) We provide a holistic perspective on the current landscape A. Generative Adversarial Networks (GANs)
of the generative capabilities of AI systems and the spe- GANs are a type of generative AI model that consists of
cific context of LLMs. two neural networks: a generator and a discriminator [27]. The
2) We demonstrate the significant progress and unprece- generator is responsible for creating new realistic and high-
dented capabilities introduced by the emergence of gen- quality data, including images, text, and music, by learning the
erative AI and LLMs. underlying distribution of the data [28]. The discriminator, on
3) We provide valuable insights that can guide future re- the other hand, is responsible for distinguishing whether the
search endeavors within the AI research community. new data are real or fake [28]. The fundamental principle behind
4) Finally, we identify and address several key research gaps GANs involves a generator network creating realistic data, such
in the field of generative AI and LLMs. as images, and a discriminator network evaluating the generated
data by distinguishing between real and fake data [28]. Over
time, the generator improves its ability to create realistic data
H. Organization by attempting to deceive the discriminator, which enhances its
The rest of this article is organized as follows. Section II ability to distinguish between real and generated data [28]. The
introduces the overview of generative models and explores the training process of a GAN network, as shown in (2), involves
applications of generative AI. In Section III, we discuss the tra- optimizing the parameters of both the generator (represented by
ditional and modern approaches to language modeling and the G) and discriminator (represented by D) networks [28]. Here,
applications of LLMs. Section IV provides a detailed discussion pdata (x) denotes the distribution of real data, pz (z) represents
of the challenges associated with generative AI and LLMs and the distribution of random noise in the latent space, x denotes
potential solutions. The impact of identified research gaps and a real data point, G(z) is a data point generated from random
future directions on the ethical and responsible integration of noise z, D(x) is the discriminator’s output indicating the prob-
generative AI and LLMs is presented in Section V. Finally, ability that x is real, and log refers to the natural logarithm. The
Section VI concludes our article and suggests directions for objective is to minimize the log probability of the discriminator
future research work. correctly identifying whether a sample is real or generated,

Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
5876 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 12, DECEMBER 2024

while simultaneously maximizing the log probability of the μφ (x) and σφ (x) represent the mean and standard deviation of
generator producing data that the discriminator perceives as real the distribution, respectively. In (5), the parameters μθ (z) and
σθ (z) represent the mean and standard deviation of the latent
min max V (D, G) = Ex∼pdata (x) [log D(x)]
G D space distribution, which are learned by the decoder neural
+ Ez∼pz (z) [log(1 − D(G(z)))]. (2) network during training
 
Within the adversarial setting, various classes of GANs z = qφ (z | x) = N μφ (x), σφ (x)2 (4)
have emerged over the years, each tailored to specific tasks  
pθ (x | z) = N μθ (z), σθ (z)2 . (5)
in the generative modeling space. For example, the work of
Radford et al. [29] presents a deep convolutional GANs (DC- The reparameterization trick, introduced in VAEs to facilitate
GANs) by extending the GANs architecture, an extension of the backpropagation through the sampling process [23], addresses
original GAN architecture proposed by Goodfellow et al. [28]. the challenge of applying backpropagation to inherently random
DCGANs employ CNNs in both the generator and discrimi- sampling operations. While backpropagation is a fundamental
nator, enabling the generation of high-quality images. CNNs algorithm for training neural networks, its direct application
are known to perform well at capturing spatial relationships in to sampling is problematic due to the randomness involved.
data [29], making them well suited for image generation tasks. The reparameterization trick provides an alternative approach
Addressing the training instability issues of [28], Arjovsky et al. to sample from a distribution while maintaining the necessary
introduced the wasserstein GANs (WGANs) algorithm [30]. connections for backpropagation [23]. In VAEs, this technique
WGANs replace the binary cross-entropy loss with the Wasser- is employed to sample the latent variable z from a simple
stein distance, leading to improved stability and convergence distribution, typically a standard normal distribution. These
during training [30]. In the context of GANs, the Wasserstein samples are then transformed to match the distribution produced
distance defines the objective function between two distribu- by the encoder, as described in (6). This transformation en-
tions, denoted as A and B, as shown in (3). Here, W (A, B) rep- sures that the sampled latent variables remain consistent with
resents the Wasserstein distance between distributions A and B, the encoder’s understanding of the data while preserving the
inf denotes the infimum, which represents the minimum value, randomness required for generating new samples. In (6), the
γ refers to a joint distribution defined on the product space of  represents a random noise vector sampled from a standard
A and B, and Π(A, B) is the set of all joint distributions with normal distribution,  represents the elementwise product oper-
marginals A and B. The terms (x, y represent samples from the ation, σθ (x) represents the standard deviation of the distribution
joint distribution γ, and d(x, y) denotes the distance between x produced by the encoder, and μθ (x) represents the mean of the
and y in the metric space distribution produced by the encoder

W (A, B) = inf E(x,y)∼γ [d(x, y)]. (3) z = μθ (x) + σθ (x)  , where  ∼ N (0, 1). (6)
γ∈Π(A,B)
The main objective for training a VAE is to maximize
To tackle the challenges associated with training high-
the ELBO [23], [33]. Maximizing the evidence lower bound
resolution GANs, Karras et al. [31] proposed a progressive
(ELBO) during training encourages the VAE to learn a mean-
growth of GANs. This algorithm employs a progressive training
ingful and smooth latent space representation for the input data
strategy that gradually increases the resolution of the gener-
[23], [33]. By maximizing the ELBO, the VAE is trained to
ated images throughout the training process. This approach
learn a latent space that captures the underlying structure of
allows the algorithm to capture finer details and produce high-
the data while also allowing for the efficient generation of new
resolution images with enhanced stability and scalability [31].
samples [23], [33]. The ELBO, as shown in (7), comprises
two terms: the reconstruction loss of the data given the la-
B. Variational Autoencoder (VAE) Models tent variable [log pθ (x | z)], which measures the expected log-
VAEs are generative models that learn a probabilistic map- likelihood of the data given the latent variable, and the kullback-
ping from the data space to a latent space, a lower dimensional leibler (KL) divergence between the approximate posterior (en-
representation of the data that capture its essential features, coder) and the prior distribution [DKL (qφ (z | x)p(z))]. The
enabling the generation of new samples through sampling from KL divergence encourages the latent distribution learned by the
the learned latent space [23]. This process involves two key encoder to be similar to the prior distribution, which is typically
components: encoders and decoders. In the VAEs framework, a standard normal distribution. This constraint helps prevent
encoders and decoders play important roles in the process of the encoder from learning overly complex or entangled latent
learning and generating data. The encoder is implemented using representations. In (7), L denotes the overall loss function
a neural network and it is responsible for mapping the input data L(θ, φ; x) = Eqφ (z|x) [log pθ (x | z)] − DKL (qφ (z | x)p(z)) .
x to a probability distribution in the latent space z, as shown (7)
in (4). Similar to the encoder, the decoder is also implemented
using a neural network, and it reconstructs the original data
from this latent representation z as illustrated in (5). The en- C. Autoregressive Models
coder and decoder are trained jointly using a technique called In the context of generative AI, autoregressive models are a
variational inference [23], [32]. Variational inference minimizes class of likelihood models that generate new sequential data by
two losses: a reconstruction loss and a regularization loss. In (4), predicting the next value in a sequence based on the previous
Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
HAGOS et al.: RECENT ADVANCES IN GENERATIVE AI AND LARGE LANGUAGE MODELS 5877

values. These models involve modeling the probability distri- simpler neural network architectures. Developing more efficient
bution of each element in a sequence given the entire history of training algorithms specifically tailored for MoE models can
previous elements, P (xt |xt1 , xt2 , . . . , x1 ). This ability makes help to reduce computational intensity. The overall MoE model
autoregressive models well suited for a variety of NLP tasks architecture can be broken down into several key components
where the ability to understand and generate coherent sequences as follows.
is essential [34]. They are also widely used in capturing the 1) Expert Networks: One of the main features of the MoE
dynamics of time series data [34]. An autoregressive model of model is the presence of multiple expert networks. These expert
order p can be generally represented as shown in (8) where Xt networks play a critical role in learning specific patterns or
denotes the value of the time series at time t, c is a constant term, features within the input data and serve as the core models of
φi are the autoregressive coefficients, representing the influence the MoE system. Each expert network is tailored to specialize
of the previous ith observations on the current observation, and in a particular aspect or subset of the input problem space.
t is an error term, which represents the random noise in the 2) Gating Network: The gating network mechanism is a
data. The parameters of the model (c, φi ) are typically estimated crucial component that analyzes the input data and decides
from the observed data using methods such as maximum like- which expert network is most suitable for a given instance [40].
lihood estimation It assigns weights to each expert, indicating their relevance or
 p contribution to the current input. The gating network typically
Xt = c + φi Xt−i + t . (8) outputs a probability distribution over available experts, reflect-
i=1 ing the relevance of each expert to the current input [40]. There
are two main types of MoE routing strategies in MoE systems:
D. Mixture of Expert (MoE) Models
dense routing and sparse routing. In dense routing, every input
A MoEs model represents a neural network architecture that is directed to all experts, and the final output is a weighted com-
combines the strengths of specialized expert networks with bination of all expert predictions based on the gating network’s
a gating mechanism to perform complex tasks [35], [36]. In output. On the other hand, sparse routing is a more efficient
the context of NLP architectures, MoE models are applied approach where the gating network selects only a subset of ex-
to enhance the capabilities and efficiency of the underlying perts for each input, reducing computational cost [35], [42]. The
language generation architecture [35], [37]. Within the realm MoE model dynamically combines the predictions of multiple
of MoE models, these architectures optimize resource utiliza- experts based on learned gating coefficients, allowing it to adap-
tion by selectively activating relevant experts for specific tasks, tively switch between different experts depending on the input
demonstrating adaptability to different domains through the data. This mechanism enables the model to capture complex
integration of domain-specific expert models [38]. Moreover, patterns and improve performance compared to a single expert
MoE architectures offer scalability, allowing the addition of model. The gating network is generally represented as shown
more expert networks to handle a broader range of tasks [39]. in (9) where gk (x) denotes the gating function for gate k, σ
The advantages of MoE models extend beyond their archi- is an activation function (usually sigmoid or softmax), and Wg
tectural complexities. Recent studies, such as the work pre- represents the parameters of the gating network
sented in [39], emphasize their scalability, enabling the addition  T 
of more expert networks to handle a broader range of tasks. gk (x) = σ Wgk x. (9)
Furthermore, these models have demonstrated the ability to
achieve superior model quality compared to their dense coun- 3) Output Computation: When the experts are activated,
terparts, even with significantly reduced training costs. How- they process input data and generate individual predictions.
ever, despite these advantages, MoE models pose some critical These predictions are then combined to form the final output of
challenges. MoE models are sensitive to small changes in the the MoE model. The specific method of combining predictions
gating network weights. Since the gating network determines depends on the task and MoE architecture. In the weighted
the contribution of each expert to the final prediction, even averaging approach, predictions from each expert are weighted
slight changes in these weights can lead to significant shifts in based on the output of the gating network, and the weighted
the model’s training stability and cause unpredictable behav- average is taken as the final output. In classification tasks,
ior [35]. This sensitivity can make training and optimization experts can vote for the most likely class, and the majority
of the model more challenging. To mitigate this, techniques vote becomes the final prediction [43]. The output of a MoE
such as sparse routing have been proposed [40], [41]. Regu- model, denoted as y(x), is computed using (10), representing
larization techniques such as weight decay and dropout also a weighted sum of the expert outputs. The final output y(x)
help mitigate sensitivity to small changes in gating network is computed by aggregating the contributions of all experts. It
weights by preventing overfitting and promoting smoother de- sums up the weighted outputs of all experts based on the gating
cision boundaries [36]. Additionally, training MoE models can values, resulting in the MoE’s prediction. This output is often
be computationally intensive, especially when dealing with a passed through additional layers, such as fully connected layers
large number of experts or complex gating functions. Each for- and activation functions, depending on the specific task. Here,
ward pass through the network involves evaluating the outputs Ei (x) denotes the output of expert i, x represents an input to
of multiple experts and updating the parameters of both the the model, and N is the number of experts [35]. Gating weights
expert and gating networks. This computational overhead can gi (x), detailed in (11), are computed using a softmax function,
make training slower and require more resources compared to with ai (x) representing the activation for an expert i given the
Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
5878 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 12, DECEMBER 2024

input x. The gating network uses the input data to determine DALL-E 2 [57], [58], Imagen [59], and stable diffusion [60], are
which expert is best suited for the task a class of probabilistic models that describe the evolution of an
image from a simple initial distribution to the desired complex

N
y(x) = gi (x) · Ei (x) (10) distribution [61]
i=1
xt = f (xt−1 , θt , t ). (12)
exp (ai (x))
gi (x) = N , i = 1, 2, . . . , N. (11)
j=1 exp (aj (x))
1) Stable Diffusion: Text-to-image generation involves cre-
ating visual content based on textual descriptions [62]. Stable
diffusion is an open-source text-to-image diffusion model that
E. Model Merging
generates diverse and high-quality images based on textual
Model merging is a technique used to combine the parame- prompts.1 This model operates by taking a noisy image as
ters of multiple task-specific pretrained LLMs to create a new input and gradually denoising it to generate the desired output.
and improved language model [44]. Initially, this involves the The denoising process is guided by a text prompt, providing
process of selecting base models and aligning the architectures information about the desired content and style of the image.
of chosen models to ensure compatibility. Techniques such 2) Midjourney: Midjourney is a text-to-image diffusion
as parameter averaging [45] and knowledge distillation [46], model that, such as stable diffusion [60], leverages prompts to
[47] are then employed to integrate the knowledge from these generate unique and artistic images [63]. However, it is a closed-
models. Additionally, various algorithms, including task vector source generative AI project requiring a paid subscription. This
arithmetic [48], TIES [44], and DARE [49], can be used for setup consequently may discourage community collaboration
parameter merging, each with its own advantages and consid- and development, leaving some users with less control over the
erations, such as computational complexity and the ability to underlying model compared to open-source alternatives such as
handle models trained on different tasks. Following integration, stable diffusion [60].
the merged model undergoes fine-tuning task-specific data to
refine its representations and potentially optimize overall per-
formance. The resulting merged model retains the knowledge G. Multimodal Generative Models
and capabilities of its constituent models, leading to enhanced Multimodal generative models represent a significant ad-
performance and capabilities across tasks compared to tradi- vancement in AI. These models possess the capability to un-
tional methods of training a single model from scratch, as well derstand and create content by leveraging various data types,
as improved robustness and resource efficiency [50]. However, such as text, images, and audio [64], [65]. This integration of
challenges such as ensuring compatibility between models, different data modalities enables these models to capture a more
managing computational complexity, and avoiding performance comprehensive understanding of concepts [66]. By utilizing
degradation must be addressed [50], [51]. information from these diverse sources, multimodal generative
models aim to overcome the limitations inherent in traditional
F. Diffusion Models models that focus solely on a single data type [65]. Unimodal
methods, traditional approaches that primarily focus on a single
Diffusion models are specifically designed for generating modality, such as text and images, have limitations in captur-
images and data samples [52]. These models are trained to ing the full complexity of real-world data [65]. For example,
generate realistic samples by modeling the diffusion process text-based models may lack the ability to incorporate visual
of a data distribution. Different approaches such as noise- or emotional context into their understanding, while image-
contrastive estimation (NCE) [53] and score-based generative based models might lack textual or semantic understanding
modeling [54] exist within the domain of diffusion models in [65]. Multimodal generative models address these limitations
generative AI. They operate by iteratively adding noise to a by integrating information from different modalities, such as
given initial image and subsequently learning to reverse this text, images, and audio. This allows them to achieve a better
process to generate new, realistic, and high-quality images of understanding of the data and subsequently generate content
varying styles and complexities [55], [56]. As shown in (12), that reflects the richness of human expression and experience.
the general idea is to model the data distribution as a diffusion However, training multimodal models comes with its own set of
process, where the data is transformed from a simple distribu- challenges. These models can be computationally expensive to
tion to the target distribution through a series of steps. Here, train and require large amounts of labeled data for each modal-
xt represents the data at time step t, f denotes a diffusion ity [65]. Additionally, finding effective techniques to seamlessly
process that transforms the data from xt−1 to xt , θt represents integrate information from different modalities remains an ac-
the parameters of the diffusion process at time step t, and t rep- tive area of research [67]. There are two main architectures used
resents a sample from a noise distribution t. This approach has for multimodal learning: early fusion and late fusion [68]. Early
led to the development of generative models such as denoising fusion combines data from all modalities at the beginning of the
score matching (DSM) and diffusion probabilistic models. The model, while late fusion processes each modality independently
underlying idea is to transform a simple distribution through before combining the results. The ability of multimodal gener-
a series of steps to match the target distribution of real data. ative models to understand and create content across different
The generative process involves reversing these steps to gen-
erate new samples. Diffusion-based generative models, such as 1 https://stablediffusionweb.com/

Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
HAGOS et al.: RECENT ADVANCES IN GENERATIVE AI AND LARGE LANGUAGE MODELS 5879

data types makes them invaluable for a wide range of tasks architecture, featuring separate encoders for processing images
requiring a deep understanding of multimodal data [69]. Some and text. This architectural design allows CLIP to indepen-
real-world applications include generating realistic product de- dently encode visual and textual inputs into distinct feature
scriptions with images for e-commerce platforms or creating spaces, facilitating effective cross-modal understanding. For
personalized music recommendations based on a user’s au- image processing, CLIP often employs CNNs or vision trans-
dio preferences and listening history. In addition to this, these former (ViT) to extract visual features [74]. The image encoder
models have demonstrated remarkable capabilities in various within CLIP processes visual inputs, such as images, using
tasks, including medical imaging analysis, image captioning, CNNs. Through pretraining on large-scale image datasets, the
text-to-image synthesis, video understanding, and audio-visual image encoder learns to extract hierarchical visual features that
storytelling [69]. By overcoming the limitations of unimodal capture important characteristics of the input images. These fea-
models and offering new possibilities for creative content gen- tures are then encoded into a high-dimensional representation
eration, multimodal generative models will play a significant space. On the other hand, the text encoder in CLIP processes
role in the future of AI. textual inputs, such as captions and descriptions, using trans-
former architectures [18], [20]. Transformers are capable of
modeling sequential data such as text, allowing the text encoder
H. Applications of Generative AI
to capture semantic information and contextual relationships
Generative AI models are powerful tools for understanding within textual inputs. Through pretraining on large-scale text
and generating data with applications in various domains, in- corpora, the text encoder learns to encode textual inputs into a
cluding the following. corresponding feature space. Despite having separate encoders
1) Image Generation and Analysis: Advanced generative for images and text, CLIP achieves cross-modal understanding
AI models have demonstrated remarkable capabilities in gen- by mapping both image and text embeddings into a shared em-
erating high-quality images, such as photorealistic faces and bedding space. This shared space facilitates direct comparisons
scenes [21]. Generative AI models have been employed in between visual and textual representations, enabling CLIP to
developing complex systems capable of generating and un- determine the semantic similarity between them [69]. During
derstanding multimodal data such as text and images. For pretraining, CLIP leverages contrastive learning objectives to
example, the work in [70] proposes a large-scale autore- align similar pairs of image-text embeddings while maximiz-
gressive model that generates high-quality and content-rich ing the distance between dissimilar pairs, thereby enhancing
images from text descriptions. Additionally, DALL-E is a gen- its ability to understand and relate visual and textual inputs
erative model introduced by Ramesh et al. [57], [58], which effectively [69].
produces images from textual descriptions. Unlike traditional 2) Video Generation: Advanced generative AI models have
image generation models that rely on pixel-level manipulations not only demonstrated remarkable capabilities in generating
or predefined templates, DALL-E operates at a semantic level, high-quality images but have also begun to tackle the challenge
understanding textual prompts and synthesizing correspond- of video generation. Recent advancements in AI, such as Sora
ing images. The work in [71] introduces a novel architecture developed by OpenAI [75], [76], have enabled the generation of
specifically designed for generating high-quality facial images. realistic and dynamic video content from textual descriptions.
This architecture utilizes a style-based generator, demonstrating Similar to its image counterpart DALL-E [57], Sora operates at
advancements in synthesizing diverse and realistic images. Fur- a semantic level, understanding textual prompts and synthesiz-
thermore, generative AI models can also be employed in image- ing corresponding video sequences [75], [76]. Video generation
to-image translation [72], which involves converting images involves creating coherent and visually appealing sequences of
from one domain to another, such as enabling the conversion frames that align with the provided textual instructions [76].
of satellite images into maps or black-and-white photos into These models typically employ architectures designed to cap-
color. The work by Zhu et al. [73] presents a model designed ture temporal dependencies (i.e., relationships between frames
for unpaired image-to-image translation. This model utilizes over time) and spatial relationships (i.e., relationships between
cycle-consistent adversarial networks to learn mappings be- objects within a single frame). By understanding the semantic
tween two image domains without requiring paired training context of the text, these models generate videos that accurately
examples, making it versatile for various applications [73]. reflect described scenes while exhibiting smooth transitions and
Unlike DALL-E [58], which primarily focuses on generating realistic motion. In addition to video generation, as explained
images, contrastive language-image pre-training (CLIP) learns above, AI models are capable of multimodal generation, where
to understand the relationships between text and images in a textual prompts can result in the synthesis of both images
paired manner [69]. Through contrastive learning, CLIP pre- and videos. This capability enhances the quality of generated
trains vast amounts of image-text pairs, enabling it to encode content, enabling diverse applications in storytelling, content
both modalities into a shared embedding space [69]. CLIP’s creation, and multimedia production. Video generation has the
cross-modal understanding enables a wide range of applica- potential to revolutionize various domains, including the en-
tions beyond traditional image analysis tasks. By associat- tertainment industry, education and training, augmented reality
ing images with their textual descriptions, CLIP can perform and virtual reality applications, automation of video editing
tasks such as image classification, object detection, and even tasks, etc.
zero-shot learning, where it recognizes objects or concepts not 3) Text Generation: Advances in generative AI models
seen during training [69]. CLIP is built upon a dual-encoder can generate human-quality text, including translations, and
Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
5880 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 12, DECEMBER 2024

responses to natural language questions [4], [77]. Text genera- probability of the words that came before it [89]. These models
tion models learn patterns and relationships in language from are trained on large corpora of text, and they use statistical meth-
vast amounts of text data [4], [77]. ods to learn the probabilities of different sequences of words.
4) Code Generation: Widely adopted AI tools utilize gen- Such models, including n-gram models and models based on
erative AI techniques to analyze the context of the code being maximum entropy, often use conditional probability to estimate
written and suggest relevant code completions that can signif- the likelihood of a word given its context [90], [91]. Equation
icantly improve programmers’ and engineers’ productivity by (14) is derived from the maximum likelihood estimation, where
reducing the time spent manually typing codes [6]. the probability of a word given its context is estimated by the
5) Drug Discovery: Generative AI models are increasingly ratio of the count of the specific context-word pair to the count
being utilized in various aspects of drug discovery, providing of the context alone. In (14), P (w1 , w2 , . . . , wn ) denotes the
innovative approaches to developing new drugs and accelerat- conditional probability of the word, given the preceding word
ing the identification and design of novel therapeutic agents wn−1 , C (wn−1 , wn ) represents the count of occurrences of
[78], [79]. Furthermore, generative AI models have demon- the bigram (word wn−1 , word wn ) in the training data, and
strated the capability to identify new applications for drug the C (wn−1 ) represents the count of occurrences of the word
repurposing [80]. wn−1 in the training data. For higher order n-gram models, the
6) Material Discovery: Advanced ML and DL techniques, equation is extended to consider a longer history of words as
particularly generative models, are being employed to explore shown in (15)
and predict novel materials with desirable properties [81]. The C (wn−1 , wn )
application of generative AI models in material science [82] P (wn | wn−1 ) = (14)
C (wn−1 )
can significantly accelerate the material discovery process by
guiding experimental efforts, predicting new materials, and op- P (wn | wn−1 , wn−2 , . . . , w1 )
timizing existing materials [83]. C (wn−1 , wn−2 , . . . , w1 , wn )
7) Fraud Detection: Generative AI models have proven = . (15)
C (wn−1 , wn−2 , . . . , w1 )
effective in detecting fraud by identifying patterns indicative
of fraudulent activity [84]. Furthermore, these models can also B. Neural Network Language Models
be employed in identifying anomalies in data [85], [86].
8) Personalization: Generative AI models can be used in Neural network language models, particularly those based
personalization to tailor content, recommendations, or user on RNNs or transformer architectures, model the probability
experiences based on individual preferences [87], [88]. This of a word given its context using a neural network. Actual
customization can involve generating personalized recommen- neural network language models can have variations based on
dations or creating personalized user experiences. For example, the specific architecture used (e.g., recurrent and transformer
Netflix uses generative AI to recommend movies and TV shows based). However, the simplified representation of such mod-
to its users, while Spotify leverages generative AI to create els can be broken down into the hidden state calculation and
custom playlists. softmax calculation as shown in (16) and (17), respectively.
Equation (16) shows the hidden state calculation where hn−1
III. LANGUAGE MODELING denotes the hidden state of the neural network at time step
n − 1, Wh denotes the weight matrix for the hidden state
The use of language models is pervasive in various modern transition, Uh shows the weight matrix for the word embedding
NLP applications. In these models, the probability of different transition, En−2 denotes the embedding vector of the word
sequences of words is often modeled as the product of local wn−2 , and tanh is the hyperbolic tangent activation function.
probabilities, as expressed in (13), where wi represents the ith Equation (17) shows the softmax output calculation which com-
word in the sequence, and hi represents the word history pre- putes the conditional probability distribution over the vocabu-
ceding wi . The formulation in (13) summarizes the conditional lary for the next word wn where P (wn | wn−1 , wn−2 , . . . , w1 )
dependencies between words in a sequence, allowing language denotes the Conditional probability of the word wn given the
models to capture complex linguistic patterns. Leveraging such history wn−1 , wn−2 , . . . , w1 , Wo shows the weight matrix for
models has proven instrumental in tasks ranging from machine the output layer, hn1 is the hidden state of the neural network
translation and speech recognition to text generation and senti- at time step n − 1, where the softmax is the softmax function,
ment analysis [1], [2] converting the network’s output into probabilities

n
P (w1 , w2 , . . . , wn ) = P (wi | hi ) . (13) hn−1 = tanh (Wh · hn−2 + Uh · En−2 ) (16)
i=1
P (wn | wn−1 , wn−2 , . . . , w1 )
The following are some of the main approaches to traditional
and modern approaches to language modeling. = softmax (Wo · tanh (hn−1 )). (17)

A. Statistical Language Models C. Transformer Language Models


Statistical language models are based on the idea that the Transformer language models are based on the idea of at-
probability of a word appearing in a sentence is related to the tention, which allows the model to focus on the most relevant

Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
HAGOS et al.: RECENT ADVANCES IN GENERATIVE AI AND LARGE LANGUAGE MODELS 5881

parts of the input sequence when making predictions [4], [18], transformer model uses multiple self-attention heads in par-
[20]. Such models leverage pretraining to achieve strong perfor- allel across multiple heads to capture different aspects of the
mance across various NLP tasks. According to [18], the trans- relationships within the input sequence instead of perform-
former architecture offers several advantages over traditional ing a single attention function with dmodel -dimensional keys,
recurrent or CNNs. It enables significantly more parallelization value vectors, and queries. This allows the model to learn
for faster training, achieves state-of-the-art results in machine more complex representations of the input, which can im-
translation with shorter training times, reduces the complexity prove performance on a variety of NLP tasks. As shown
of relating distant input positions, and effectively models long- in (21), the outputs of these heads are concatenated and
range dependencies while handling variable-length sequences linearly transformed [18] where the transformations are pa-
[18]. The transformer model achieves state-of-the-art results rameter matrices WiQ ∈ Rdmodel ×dk , WiK ∈ Rdmodel ×dk , WiV ∈
in machine translation by employing attention mechanisms, Rdmodel ×dv , and W O ∈ Rhdv ×dmodel . Here, WiQ , WiK , WiV , and
enabling it to capture long-range dependencies and process W O are learned weight matrices. This allows the model to
variable-length sequences without padding or truncation [18]. learn a wider range of relationships between words in the input
Moreover, it simplifies the computation of relationships be- sequence
tween distant positions, leading to enhanced parallelization,
MultiHead(Q, K, V ) = Concat (head1 , . . . , headh ) .W O
faster training, and superior performance compared to tradi-
tional neural networks. Where
 
1) Self-Attention Mechanism: The transformer architec- headi = SelfAttention QWiQ , KWiK , V WiV . (21)
ture revolutionized sequence modeling by introducing a self-
attention mechanism, eliminating the need for recurrent or 3) Position-Wise Feed Forward Network (FFN): The FFN
convolutional structures. The self-attention mechanism essen- is an important component of the transformer architecture. It is
tially computes a weighted sum of input representations, where responsible for processing information from the self-attention
each position in the input sequence is allowed to attend to all mechanism across all positions in the input sequence [18]. The
other positions with different weights. This mechanism allows FFN consists of two fully connected linear transformations with
the model to capture long-range dependencies between distant a rectified linear unit (ReLU) activation function in between
words in a sentence, which is important for tasks such as them. This structure allows the FFN to learn complex nonlin-
machine translation and text summarization. Given an input ear relationships between the input features [18]. The FFN is
sequence X = {x1 , x2 , . . . , xn }, the self-attention mechanism applied independently to each position in the input sequence,
computes the output vector Y = {y1 , y2 , . . . , yn }. As shown ensuring that each position can interact with all other positions
in (18), the attention mechanism computes a set of attention [18]. This parallelized approach makes the FFN computation-
scores, which are then used to calculate a weighted sum of ally efficient and scalable to long input sequences. The output
the input vectors. Here, Qi , Kj , and vj are the query, key, of the self-attention mechanism is then passed through a FNN
and value vectors for the ith output element and jth input as shown in (22), where the learned parameters W1 and W2
element, respectively, and dk is the dimension of the key vectors are learned weight matrices, while b1 and b2 are the learned
[18]. The attention score aij for the ith element in the output bias vectors. As shown in (23), other works have proposed
sequence and the jth element in the input sequence is computed replacing the ReLU activation function with other nonlinear
as shown in (19). Here, eij , commonly represented as QTi · Kj , activation functions such as gaussian error linear unit (GELU)
is the attention energy or compatibility function between the (x) = xΦ(x) [92] where Φ(x) is the standard Gaussian cumu-
ith element in the output sequence and the jth element in the lative distribution function, and Swishβ (x) = xσ(βx) [93]
input sequence. Once the attention scores are computed, the FFN(x) = max(0, x.W1 + b1 ).W2 + b2 (22)
weighted sum of the input vectors is calculated to obtain the
context vector for each output element as shown in (20) where FFNGELU (x, W1 , W2 ) = GELU (xW1 ) W2
Vj is the value vector for the jth input element
FFNSwish (x, W1 , W2 ) = Swish1 (xW1 ) W2 . (23)

n
exp (eij ) (Qi · Kj ) In the context of language models, the transformer architec-
yi = n · vj , where eij = √ (18)
k=1 exp (eik ) dk ture facilitates the training of LLMs, such as GPT [28]. LLMs
j=1
are a type of generative AI model that is specifically trained on
exp (eij ) large corpora of text data. In recent years, LLMs have emerged
aij = n (19)
k=1 exp (eik )
as transformative breakthroughs in the field of AI, natural lan-
guage generation (NLG), and natural language understanding

n
ci = aij · Vj . (20) (NLU) [106] due to their remarkable capabilities in understand-
j=1
ing and generating humanlike text and other forms of content
[21]. LLMs are trained on massive datasets comprising text and
2) Multihead Self-Attention: The multihead self-attention code, and they exhibit the ability to learn and perform a wide
mechanism is a variant of the self-attention mechanism that range of language tasks, including text generation, language
introduces multiple attention heads to capture different as- translation [107], text summarization [99], sentiment analysis
pects of the relationships in the input sequence [18]. The [108], and question answering [109]. These models are more

Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
5882 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 12, DECEMBER 2024

TABLE I
LIST OF SOME OF THE STATE-OF-THE-ART LLMS WELL SUITED FOR A WIDE RANGE OF NLP TASKS

Number of Learning Rate


Year of Release LLMs Number of Training Tokens Developer
Parameters (Default)
2017 Transformer [18] 530 Million Not explicitly stated 1x10−3 Google AI
2018 BERT [20] 340 Million 250 Billion 5x10−5 Google AI
2019 GPT-2 [94] 1.5 Billion 40 Billion 1x10−5 OpenAI
2020 T5 [19] 11 Billion 1 Trillion 5x10−5 Google AI
2020 GPT-3 [4] 175 Billion 300 Billion 6x10−5 OpenAI
2020 Gopher [95] 280 Billion 300 Billion 4x10−5 Google AI
2021 Jurassic-1 Jumbo [96] 178 Billion 300 Billion 6x10−5 AI21 Labs
2021 Megatron-Turing NLG [97] 530 Billion 270 Billion 5x10−5 NVIDIA
2022 Chinchilla [98] 70 Billion 1.4 Trillion 1.25x10−4 Deep Mind
2022 LaMDA [26] 137 Billion 768 Billion Not explicitly stated Google AI
2022 GPT-3.5 (InstructGPT) [99] 175 Billion Not explicitly stateda Not explicitly stated OpenAI
2022 GPT-3.5 (ChatGPT) 175 Billionb Not explicitly statedc 5x10−5 OpenAI
2022 PaLM [100], [101] 540 Billion 780 Billion Not explicitly stated Google AI
2023 LLaMA [102] 65 Billion 1.4 Trillion 1.5x10−4 Meta AI
2023 Llama 2 [102] 70 Billion 2 Trillion 1.5x10−4 Meta AI
2023 PaLM 2 [103] 340 Billiond 3.6 Trillion Not explicitly stated Google AI
2023 GPT-4 [104] 1–1.76 Trillione Not explicitly statedf Not explicitly stated OpenAI
2023 Gemini [105] Not explicitly statedg Not explicitly statedh Not explicitly stated Google AI
a OpenAI has not officially stated the exact number of training tokens used for GPT-3.5 (InstructGPT) until the publication of this work. However,
it is rumored to be in the range of 600–700 billion tokens.
b OpenAI has not officially stated the size of GPT-3.5 until the publication of this work. However, it is rumored to be 175 billion parameters.
c OpenAI has not officially stated the exact number of training tokens used for GPT-3.5 (ChatGPT) until the publication of this work. However, it is
rumored to be in the range of 600–700 billion tokens.
d Google has not officially stated the size of PaLM 2 until the publication of this work. However, it is rumored to be 340 billion parameters.
e OpenAI has not officially stated the size of GPT-4 until the publication of this work. However, it is rumored to be between 1 and 1.76 trillion
parameters.
f OpenAI has not officially stated the exact number of training tokens used for GPT-4 until the publication of this work. However, it is rumored to
be in the range of 10–100 trillion tokens.
g Google has not officially stated the size of Gemini Pro or Ultra until the publication of this work. However, Nano has two versions at 1.8 billion
and 3.25 billion parameters.
h Google has not officially stated the exact number of training tokens for any of the Gemini models until the publication of this work. However, they
do follow the approach of [98].

powerful and versatile than traditional language models. LLMs LaMDA [102], etc. These models have demonstrated the power
have revolutionized the way we interact with and leverage nat- of pretrained, massive neural networks for NLP tasks. For ex-
ural language data, and they are now used in a wide variety ample, GPT can be used to generate realistic and coherent text,
of applications, including chatbots [88], machine translation while BERT can be used to extract complex meaning from text.
systems [1], [7], and search engines. These models have expe- Table II shows a performance comparison of some of the state-
rienced significant growth in terms of scale, complexity, and of-the-art LLMs well suited for a wide range of NLP tasks, as
performance. Recently, several LLMs have been introduced, reported on PapersWithCode.2
with some of the largest dense language models that have scaled
to billions of model sizes [4], [26], [95], [96], [97]. These D. Architecture of Transformer Models
powerful models demonstrate the capability to perform a wide
Transformer architectures have revolutionized NLP tasks,
range of innovative NLP tasks, including machine translation,
such as sequence modeling, by effectively capturing long-range
text summarization, question answering, and code completion.
dependencies and modeling relationships between words. The
To provide a comprehensive comparison of some well-known
advantages of the transformer architecture include enhanced
state-of-the-art LLMs, we have presented a list in Table I. Fig. 1
parallelization, faster training, and the ability to model long-
shows a trend of some of the LLMs and their corresponding
range dependencies efficiently. The attention mechanism allows
number of parameters (model sizes). Some of the well-known
state-of-the-art LLMs include GPT [28], T5 [19], Gopher [95], 2 https://paperswithcode.com/

Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
HAGOS et al.: RECENT ADVANCES IN GENERATIVE AI AND LARGE LANGUAGE MODELS 5883

Fig. 1. Timeline and model size of LLMs (M = millions, B = billions).

TABLE II and creating a contextualized representation that encompasses


PERFORMANCE COMPARISON OF SOME OF THE STATE-OF-THE-ART semantic and syntactic details of the input. Subsequently, the
LLMS WELL SUITED FOR A WIDE RANGE OF NLP TASKS,
AS REPORTED ON PAPERSWITHCODE decoder network, in turn, utilizes this contextualized represen-
tation from the encoder to generate the output sequence step by
Model MMLU GSM8K ARC (Challenge) step. At each step, the decoder attends to various parts of the
Gemini Pro 79.1 86.5 – encoder’s output, facilitating the alignment of source and tar-
Gemini Ultra 90.0 94.4 –
get language information. Both the encoder and decoder com-
ponents typically employ the self-attention mechanism [18].
GPT-3 53.9 55.0 53.2
This mechanism enables the model to weigh the importance of
GPT-3.5 (ChatGPT) 70.0 57.1 85.2 different positions in the input sequence during the generation
GPT-4 86.5 92.0 96.3 of the output sequence, thereby allowing for the capture of long-
LLaMA (65B) 63.4 50.9 56.0 range dependencies. Encoder–decoder architectures are com-
Llama 2 (70B) 68.9 56.8 67.3 monly trained in a supervised and unsupervised manner, where
the model is first pretrained on a large corpus, then fine-tuned on
PaLM (540B) 69.3 82.1 87.1
provided with pairs of input sequences and corresponding target
PaLM 2 81.2 91.0 95.1
output sequences. The model learns to map input sequences to
Note: MMLU = massive multitask language understanding, GSM8K = grade output sequences by minimizing a suitable loss function [18].
school math, ARC = abstraction and reasoning corpus. However, the field has witnessed a significant shift with the
emergence of decoder-only architectures, indicating a transition
toward more flexible and potentially more powerful models.
the model to focus on relevant parts of the input sequence, con- 2) Decoder-Only Architecture: The decoder-only architec-
tributing to its success in handling variable-length sequences ture utilizes only the decoder component of the transformer
without sacrificing performance. Recognizing the shift from model [21], [110]. In this architecture, the model generates
encoder–decoder to decoder-only architectures, understanding output sequences autoregressively, predicting one token at a
pretraining strategies, and the advantages of transformer models time based on the preceding tokens without relying on an ex-
provide a more nuanced perspective on their capabilities in plicit encoder [110]. The absence of an encoder implies that the
generative AI and various NLP tasks. Here, we will distin- model does not receive direct information about the input se-
guish between the original encoder–decoder architecture and quence but instead uses its autoregressive nature to capture de-
the decoder-only architecture and the pretraining strategies of pendencies within the generated sequence itself. Decoder-only
transformer models. architectures leverage a specific variant of the self-attention
1) Encoder–Decoder Architecture: The encoder–decoder mechanism. This mechanism allows the model to attend to
architecture serves as a fundamental structure in transformer different positions within the already generated sequence while
models, employed for sequence-to-sequence tasks such as ma- predicting each new token, effectively capturing the necessary
chine translation, where an input sequence (source language) is contextual information for generating coherent output [110].
transformed into an output sequence [18], [110]. In an encoder– These models are typically pretrained on massive text corpora in
decoder architecture, the model consists of two main compo- an unsupervised manner [21]. During this pretraining phase, the
nents featuring multiple layers of self-attention and feedforward model learns general language representations, capturing both
layers: an encoder and a decoder network. The encoder network syntactic and semantic information [21]. Subsequently, fine-
processes the input sequence, capturing relevant information tuning specific tasks with labeled data allows the model to adapt

Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
5884 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 12, DECEMBER 2024

to various downstream applications. One well-known example 1) Transformer-XL: Transformer-XL is an extension of the
of the decoder-only architecture is the GPT [4]. GPT employs a standard transformer model designed to overcome the limita-
stack of transformer decoder layers for autoregressive sequence tions of fixed-length contexts in traditional models [114]. It
generation [4]. addresses the inherent limitation of the standard transformer
model, which employs a fixed-length context window, by em-
ploying two advanced mechanisms. These mechanisms en-
E. Pretraining Strategies in Transformer Language Models
able the model to learn dependencies beyond a fixed length
One of the key factors behind the success of transformer- in language modeling and retain information from previous
based language models is their pretraining on massive amounts segments of the input sequence, thus enhancing its ability
of text data using self-supervised learning techniques [18]. This to process longer sequences more effectively [114]. The first
pretraining stage equips the models with a robust understand- mechanism, segment-level recurrence, allows the model to
ing of language structure and semantics, enabling exceptional reuse hidden states from previous segments by propagating
performance on various downstream NLP tasks [20], [21]. them through recurrent connections. This enables information
Transformer language models, leveraging pretraining, have flow across segments, facilitating the retention of context from
demonstrated outstanding performance across diverse NLP previous segments and extending the context beyond a fixed
tasks. In machine translation, the transformer’s attention mech- length. Incorporating recurrence at the segment level empowers
anism allows it to capture long-range dependencies, yielding Transformer-XL to capture longer term dependencies in the
state-of-the-art results without the need for excessive padding or data [114]. In addition to the segment-level recurrence mech-
truncation [18]. Beyond translation, decoder-only architectures anism, Transformer-XL employs a novel relative positional
such as GPT have proven effective in tasks such as sentiment encoding scheme [114]. This encoding scheme is crucial for en-
analysis, named entity recognition, and text completion [21]. abling state reuse without causing temporal confusion, thereby
1) Self-Supervised Learning for Pretraining: Unlike tra- allowing the model to effectively capture dependencies across
ditional supervised learning methods that demand extensive longer sequences. By utilizing relative positional encodings
labeled data, self-supervised learning leverages the unlabeled instead of absolute ones, Transformer-XL ensures that informa-
nature of textual data. Common pretraining objectives in trans- tion can be propagated across longer sequences without sacrific-
formers involve tasks such as predicting the next word in a se- ing temporal coherence. This encoding scheme plays a vital role
quence, also known as masked language modeling (MLM) [20], in allowing the model to learn dependencies that extend beyond
[111], or reconstructing a sentence where certain words are the fixed context length [114]. Furthermore, Transformer-XL
replaced with special tokens (masked tokens) [20]. By tackling incorporates a state reuse mechanism by caching a sequence
these tasks, the model learns contextual relationships between of hidden states from previous segments, which can be reused
words and develops a strong understanding of grammatical during evaluation. As demonstrated in [114], this state reuse
structures. The pretraining phase serves as a critical foundation mechanism significantly accelerates evaluation and enables the
for downstream NLP tasks. The learned representations from model to maintain context from earlier segments, contributing
vast amounts of text data can be fine-tuned for specific tasks to its ability to capture long-term dependencies in sequences.
such as sentiment analysis, question answering, and machine 2) XLNet Architecture: XLNet represents a pretraining
translation. This approach requires significantly less labeled method for NLU tasks [115]. Building upon BERT’s bidirec-
data compared to training a model from scratch [112]. Con- tional context modeling [20], XLNet addresses its limitations,
sequently, self-supervised learning not only improves the effi- such as the fixed-length context constraint. Unlike BERT, which
ciency of NLP model training but also enables them to perform relies on MLM, XLNet achieves bidirectional context learning
effectively on tasks where obtaining large amounts of labeled by maximizing the expected likelihood over all permutations of
data might be challenging. the factorization order [115]. By utilizing an autoregressive for-
mulation, XLNet ensures consistency between pretraining and
fine-tuning stages, a limitation observed in BERT. As shown in
F. Long-Sequence Language Models (24), instead of predicting the next word in a sentence given all
Long-sequence language models are neural network architec- previous words, XLNet predicts based on randomly chosen per-
tures specifically designed to effectively handle long textual in- mutations of the input sequence. This approach encourages the
put sequences by leveraging the transformer architecture [113]. model to consider all possible input permutations, effectively
While various architectures can handle longer sequences, trans- capturing the dependencies within the sequence. This involves
formers are dominant due to their self-attention mechanisms, randomly shuffling the order of elements in the sequence and
enabling parallel processing and capturing long-range depen- then predicting each element based on its permuted context. In
dencies, overcoming the sequential limitations of RNNs. This, (24), x1 , x2 , . . . , xn represents theinput sequence, π is a ran-
unlike traditional language models, enables long-sequence lan- dom permutation of indices, and P xπ(i) | x1 , x2 , . . . , xπ(i−1)
guage models to efficiently capture long-range dependencies is the conditional probability of predicting the ith token given
and relationships between words [113]. Several long-sequence the previously predicted tokens and the current input sequence.
language models address the limitations of standard transform- During training, XLNet receives permuted sequences as input
ers by introducing modifications and additional features to their and predicts each element based on the surrounding elements
architectures. in the shuffled order [115]. This forces the model to learn

Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
HAGOS et al.: RECENT ADVANCES IN GENERATIVE AI AND LARGE LANGUAGE MODELS 5885

contextual representations that are not dependent on the order Qi KjT


of elements. Additionally, as shown in (25), x1 , x2 , . . . , xn = softmax √ + mask_matrixij · Vj . (26)
dk
denotes the input sequence, and P (xi | x1 , x2 , . . . , xi−1 ) is
the conditional probability of predicting the ith token given 4) Sparse Transformers: The standard transformer’s atten-
the previously predicted tokens x1 , x2 , . . . , xn . Equation (25) tion mechanism calculates attention scores for all pairs of posi-
represents the probability of generating the entire sequence tions in a sequence, leading to quadratic time complexity [18].
x1 , x2 , . . . , xn by factorizing it into conditional probabilities Sparse transformers address this issue by considering only a
conditioned on the previously generated tokens. XLNet incor- subset of positions during attention computation [117]. This in-
porates a generalized autoregressive objective, similar to the troduces sparsity, significantly reducing memory requirements
one used in GPT models, allowing diverse and coherent text and computational load, making them suitable for longer se-
generation. In (25), Integrating ideas from Transformer-XL quences [117]. As shown in (27), the sparse transformer is a
enhances XLNet’s ability to handle long-range dependencies modified version of the standard attention mechanism used in
and capture contextual information efficiently [115]. XLNet can transformers [117]. Sparse transformer’s attention, given a se-
be applied across a wide range of NLP tasks including question quence of input embeddings X with dimensions N × d, where
answering, natural language inference, sentiment analysis, and N is the sequence length and d is the embedding dimension,
document ranking [115]. Its generalized autoregressive pre- the attention scores for position i attending to position j can
training method enables effective handling of bidirectional con- be computed as shown in (27), where Sp represents the sparse
texts, long-range dependencies, and ensures consistency across attention, Qi represents the query vector for position i, Kj
pretraining and fine-tuning stages. Furthermore, XLNet’s inte- denotes the key vector for position j, Qi KjT represents the
gration of Transformer-XL and advanced architectural designs dot product of query and the key vectors, capturing the pair-
improves performance on tasks involving longer text sequences wise interactions between positions in the input sequence, Vj
and explicit reasoning represents the value vector for position j, and Mij denotes

n
  a binary mask element indicating whether vector position i
P (x1 , x2 , . . . , xn ) = P xπ(i) | x1 , x2 , . . . , xπ(i−1) (24) attends to vector position j. This demonstrates that the attention
i=1 mechanism is computed for each pair of positions i and j

n based on their corresponding query, key, and value vectors. In
P (x1 , x2 , . . . , xn ) = P (xi | x1 , x2 , . . . , xi−1 ) . (25) global sparse attention, the mask Mij is generated by randomly
i=1 selecting a fixed number of positions for each position i to
3) Longformer: In the context of long-sequence language attend to. This introduces sparsity by limiting the attention to
models, the Longformer is a specialized architecture designed a small subset of positions in the sequence [117]. However, for
to improve the processing of long textual inputs [116]. It local sparse attention, the mask Mij ensures that each position
shares the transformer architecture’s foundation but introduces attends to a nearby local neighborhood. This reduces the com-
modifications to the attention mechanism to accommodate the putational complexity associated with attending to all positions
challenges posed by long sequences. It uses a locality-sensitive and helps capture√short-range dependencies efficiently [117].
attention mechanism where each token attends only to its rel- The division by dk serves as a scaling factor for numerical
evant local context and a few globally important tokens [113], stability, where dk represents the dimensionality of the key
[116]. This attention only considers relevant subsequences vectors. Additionally, the binary mask Mij controls the sparsity
around each token, improving efficiency for long sequences pattern by allowing only certain positions to contribute to the
and it is adjusted as shown in (26) where Qi , Kj , and Vj attention scores. The softmax function is applied to the masked
are the query, key, and value vectors for positions i and j in and scaled dot product to normalize the scores, and finally, the
the input sequence, respectively. The mask_matrix is used to result is multiplied elementwise with the value matrix Vj . Some
mask certain positions, such as preventing attending to future variations of sparse transformers incorporate adaptively deter-
positions during√training or ignoring padding positions. In (26), mined sparsity based on the input sequence, task, or training
the division by dk is a scaling factor that helps stabilize the phase [118], enhancing the model’s flexibility and performance
gradients during training, where dk is the dimensionality of the in handling diverse sequences [118]
key vectors. As described in [113] and [116], Longformer’s
attention mechanism scales linearly with the sequence length, Qi KjT · Mij
Sp (Qi , Kj , Vj , Mij ) = softmax √ · Vj . (27)
making it feasible to process long documents efficiently. It dk
combines local windowed attention with task-motivated global
attention. Local attention is primarily used to build contextual
representations, while global attention allows Longformer to G. Applications of LLMs
create full sequence representations for prediction [116]. In
standard transformers, the self-attention mechanism considers LLMs are a specific type of generative AI designed primarily
interactions between all pairs of positions in the input sequence, for generating and understanding human language. In addition
leading to quadratic complexity to the applications of generative AI explained in Section II,
LLMs can be employed for various other important tasks, such
Attention (Qi , Kj , Vj , mask_matrix) as the following.

Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
5886 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 12, DECEMBER 2024

1) Language Understanding: In the context of NLU, LLMs features into text but lack the broader language understanding
are employed to extract meaning from human language. LLMs capabilities of LLMs. While traditional ASR systems often
are being used for a variety of NLU [106] and other language- rely on specialized architectures, the use of LLMs, particularly
related tasks, including sentiment analysis and named entity transformer-based models, has gained attention for end-to-end
recognition [4]. These models can analyze and comprehend the speech recognition [18]. LLMs can analyze the output of ASR
context of a given text, making them valuable for a wide range models and suggest corrections based on their understanding of
of applications. language and context, improving the accuracy of transcriptions,
2) Machine Translation: In the context of machine transla- especially in noisy environments or with unclear pronunciations
tion, LLMs are used to automatically translate text between dif- [125]. Additionally, LLMs can be leveraged to provide context
ferent languages [119]. For example, Google Translate utilizes to the speech recognition process. By considering surrounding
LLMs to seamlessly translate text, documents, and websites text or information about the speaker and situation, LLMs can
from one language into another. This capability demonstrates assist ASR models in making better decisions about what is
the practical utility of LLMs in bridging language barriers and being said.
enhancing communication, achieved through training on exten- 6) Text Summarization: LLMs have demonstrated success-
sive multilingual datasets [7]. The quality of translation relies ful applications in various text summarization tasks, such as
on the underlying capabilities of LLMs for NLU and genera- summarizing documents and news articles [24], [126]. For
tion. The work in [107] introduced the concept of attention to example, the work presented in [3] introduces a sequence-
neural machine translation architecture, leading to significant to-sequence pretraining model, which has proven highly ef-
advancements in language translation quality. fective in abstractive summarization tasks. Modern LLMs,
3) Question Answering: LLMs are effectively employed in empowered with powerful NLP capabilities, can understand
question-answering tasks across a variety of topics, enabling the context of a document and generate concise and coherent
them to provide relevant answers to user queries [3], [4]. This summaries quickly while preserving the overall meaning of the
capability has applications in virtual assistants, information original text.
retrieval systems, and educational platforms. For example, the 7) Code Completion: In addition to the capabilities of LLMs
AI assistant from Google can answer questions about a variety to generate humanlike text and perform various NLP tasks,
of topics, such as current events, history, and science. LLMs have also demonstrated the ability to understand the
4) Chatbots: The NLP capabilities of LLMs contribute sig- context of code and generate relevant and accurate code sugges-
nificantly to the development of intelligent chatbots [99]. This tions [127]. Code completion with LLMs involves predicting
adaptability enhances the overall user experience, making in- the next set of characters in a code snippet based on the provided
teractions with virtual assistants more intuitive and effective. context [128]. These models leverage their extensive pretrained
LLMs are widely employed in creating chatbots for customer knowledge of programming languages and coding patterns to
support and other interactive applications, enabling these intel- generate pertinent code suggestions [129]. This approach has
ligent virtual assistants to engage with humans, answer queries, been shown to improve developer productivity [130].
and help in a natural and informative way [26]. The ability
of LLMs to understand and respond to natural languages has IV. CHALLENGES OF GENERATIVE AI AND LLMS
opened up new possibilities in customer service, education, Despite their wide range of immense potential for society,
entertainment, and healthcare [120]. For example, companies generative AI and LLMs also pose several critical challenges
such as Facebook and Microsoft have successfully integrated that need to be carefully considered and addressed. These chal-
LLMs into their chatbot systems, such as Facebook’s Mes- lenges include the following.
senger platform and Microsoft’s Azure Bot Service. These
platforms utilize the power of LLMs to provide users with A. Bias and Fairness
personalized and context-aware responses, demonstrating the
practical applications of these models in real-world interactive One of the main challenges associated with generative AI
environments. and LLMs is the inheriting biases from the training data,
5) Speech Recognition: Older speech recognition systems which can lead to biased, unfair, and discriminatory outputs.
often relied on RNNs or hybrid models combining hidden Biased outputs from generative AI and LLMs can have sig-
markov models (HMMs) with DNNs [121], [122]. However, nificant real-world consequences. For example, biased hiring
these approaches faced limitations. RNNs process input se- algorithms may discriminate against certain job applicants. Po-
quences one element at a time, leading to slow processing tential bias problems like these can be mitigated by developing
and difficulties handling long-range dependencies in audio sig- algorithms that are explicitly designed to be fair and unbiased
nals [123]. Additionally, hybrid models were complex and re- by using approaches such as fairness-aware training [131],
quired careful integration of separate components. To address counterfactual analysis [132], [133], [134], and adversarial
these limitations, researchers have explored and applied LLMs debiasing [135].
to speech recognition tasks, yielding promising results [124].
B. Interpretability
The core technology for speech recognition remains automatic
speech recognition (ASR) models specifically trained on vast Understanding and interpreting the decision-making process
amounts of speech data. These models excel at converting audio of LLMs presents a significant challenge. The inherent lack of

Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
HAGOS et al.: RECENT ADVANCES IN GENERATIVE AI AND LARGE LANGUAGE MODELS 5887

interpretability in these models raises serious concerns, espe- as a starting point for the fine-tuning process, synthesizing
cially in critical applications that require explainable decision- additional data from the new domain to supplement the existing
making. Addressing interpretability challenges in generative data, and simultaneous multitask training involving both the
AI and LLMs involves several approaches. One solution is original and new tasks.
to design LLMs with inherent explainability features, such as
employing interpretable model architectures and incorporating E. Data Privacy and Security
constraints that promote understandable decision-making. An- LLMs are trained on massive and diverse datasets that may
other approach is to develop advanced techniques that provide contain sensitive personal information. The potential for unin-
insights into the inner workings of LLMs, such as saliency tentional disclosure of private or sensitive information during
maps [136], attention mechanisms [18], and feature attribution text generation is a significant concern. For instance, when ap-
methods. Additionally, implementing post hoc interpretability plied in healthcare, the use of LLMs raises concerns regarding
methods [137], [138] including feature importance analysis patient privacy and the potential for misdiagnosis. There is also
and model-agnostic interpretation techniques can offer valuable a risk of AI systems being exploited for malicious purposes,
insights into the factors influencing the outputs of the model. such as generating fake identities, which raises privacy con-
cerns. This, for example, has caused ChatGPT to be temporarily
C. Fine-Tuning and Adaptability outlawed in Italy.3,4 Addressing privacy concerns in generative
Fine-tuning large LLMs for specific domains is challenging AI and LLMs requires a multifaceted approach that includes
due to their inherent limitations in generalization. In addition to enhancing model training with privacy-preserving techniques,
their limited ability to generalize, LLMs may face difficulty in such as federated learning, homomorphic encryption, and dif-
understanding and reasoning complex concepts, hindering their ferential privacy [144], [145]. Additionally, fine-tuning mod-
ability to adapt to new tasks. Addressing the challenges asso- els on curated datasets that exclude sensitive information can
ciated with fine-tuning and adaptability in generative AI and help to minimize the risk of unintentional disclosures. Ethical
LLMs involves exploring various approaches. One approach guidelines and regulations specific to AI applications, such
involves employing transfer learning techniques that leverage as in healthcare, can provide further safeguards against pri-
knowledge from pretrained models on diverse datasets, allow- vacy breaches [146], [147]. LLMs should be able to handle
ing the model to capture a broader range of knowledge by ac- adversarial attacks, noisy data, and out-of-distribution inputs.
celerating learning and improving generalization [139], [140]. In addition to this, it is worth mentioning that beyond model
Additionally, incorporating domain-specific data during fine- privacy, addressing concerns related to the privacy and security
tuning can enhance the model’s adaptability to particular tasks, of the training and deployment data itself is important.
ensuring it learns domain-specific patterns and relationships.
Incorporating symbolic reasoning capabilities into LLMs can F. Computational Cost
also enhance their ability to understand and manipulate abstract Training and deploying LLMs demand significant computa-
concepts [141]. Leveraging metalearning techniques to enable tional resources. This poses infrastructure challenges, energy
LLMs to learn how to quickly learn also improves their ability consumption particularly for large-scale deployments, and ac-
to adapt to new tasks and data distributions [142]. cessibility of high-performance computing resources. As shown
in Fig. 1, the increase in model sizes comes with challenges
D. Domain Adaptation related to computational requirements and resource accessi-
Most of the high-performing models being released are al- bility. Reducing the computational cost of LLMs involves
ready fine-tuned for instruction-following. However, adapting several approaches. First, optimizing model architectures and
these pretrained LLMs, which have been specifically fine-tuned algorithms can enhance efficiency, reducing the computational
for a specific domain (such as chat), to a new task (such as burden without compromising performance. Second, leveraging
generating text formats and answering your questions) not for- distributed computing frameworks and specialized hardware
matted for instruction-following without compromising its per- accelerators, such as GPUs and tensor processing units (TPUs),
formance in the original domain is challenging. The challenge can significantly improve training speed and resource utilization
lies in preserving the model’s ability to understand and follow [148]. In addition to this, employing quantization techniques
instructions while also enabling it to generate coherent and [149] to models that have already been trained5 is also impor-
informative text in the new domain. This requires careful con- tant.
sideration of the training data, the model architecture, and the
fine-tuning process. However, fine-tuning LLMs for an entirely G. Deepfake Generation
new domain introduces the risk of negative transfer [143]. This Generative AI models are widely used for deepfake genera-
occurs when the model’s new knowledge conflicts with its exist- tion [150]. Deepfakes utilize various generative models, includ-
ing knowledge. Additionally, domain adaptation often requires ing GANs, to manipulate or generate realistic-looking content,
access to a large amount of high-quality data from the new 3 https://www.bbc.com/news/technology-65139406
domain. This can be challenging to obtain, especially for spe- 4 https://www.theverge.com/2023/4/28/23702883/chatgpt-italy-ban-lifted-
cialized domains. Potential strategies for addressing this chal- gpdp-data-protection-age-verification
lenge include leveraging the weights of the pretrained LLMs 5 https://huggingface.co/TheBloke

Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
5888 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 12, DECEMBER 2024

primarily for image or video production [151], [152]. Despite context window defines the number of tokens considered by
their potential applications in various domains including edu- the model during prediction, and a smaller context window can
cation and entertainment, deepfakes also pose several potential limit the model’s ability to understand and generate a contex-
risks due to their potential misuse, including the spread of mis- tually relevant text, especially in long passages or documents.
information and identity theft [153]. The deepfakes technology Several techniques can be employed to address the challenge
can be exploited to create fake videos or audio recordings of of a limited context window. A common approach involves
individuals, leading to the spread of misinformation or disinfor- using hierarchical attention, which enables models to focus on
mation, which can have devastating consequences for individ- different levels of context [163]. Additionally, the parallel con-
uals and society. It is, therefore, important to develop advanced text window approach allows for parallel processing of multiple
techniques to mitigate the risks associated with deepfakes. context windows [165]. This feature allows the models to store
information beyond the immediate context window, enabling
H. Human-AI Collaboration better handling of long-term dependencies [166].
LLMs should be designed to enable seamless human-AI col-
laboration, enabling them to effectively understand and respond K. Long-Term Memory
to human instructions and provide clear explanations for their LLMs are trained on a massive corpus of text and code, but
outputs [154]. To achieve effective human-AI collaboration, their completely stateless nature limits their ability to store and
it is important to integrate humans into the design process retrieve information from past experiences [167]. This inherent
of LLMs to ensure that they are aligned with human needs lack of explicit memory restricts their ability to maintain context
and expectations [155], [156]. To incorporate human feedback and engage in natural conversations, leading to less coherent re-
into the training process, we can utilize techniques such as sponses, especially across multiturn dialogues or tasks requiring
reinforcement learning from human feedback (RLHF) [157], information retention. Without the ability to remember past in-
[158] and direct policy optimization (DPO) [159] for train- teractions, LLMs cannot personalize their responses to specific
ing reinforcement learning (RL) agents using human feedback. users. This means they cannot adapt their communication style
Additionally, employing explainable AI (XAI) techniques for based on the user’s preferences, interests, or previous interac-
LLMs can enhance the transparency and understandability of tions. Challenges associated with this limitation include issues
their decision-making processes [160]. Developing natural lan- of consistency and task continuity. To address these challenges,
guage interfaces that facilitate natural human-LLM interactions various approaches and techniques can be considered. Beyond
is another key aspect of enhancing human-AI collaboration context window techniques, integrating external memory mech-
[161]. Conversational AI, intelligent chatbots, and voice assis- anisms such as memory networks and attention mechanisms
tants are examples of technologies that enable intuitive human- with an external memory matrix can enhance the model’s ability
AI interactions. to access and update information across different turns [168].
Alternatively, designing applications that externally maintain
I. Long-Term Planning session-based context allows the model to reference past in-
teractions within a session. Additionally, retrieval-based tech-
Generative models, particularly autoregressive models that niques enable LLMs to access relevant information from past
generate text one token at a time, face challenges in long-term conversations or external sources during inference, enhancing
planning [162]. These models tend to focus on the immediate their ability to maintain context and deliver more consistent
local context, making it difficult to maintain consistency over responses [169].
longer text passages. This limitation comes from the model’s
lack of a global view of the entire sequence it generates. Addi-
tionally, autoregressive models struggle to plan for situations L. Measuring Capability and Quality
with future uncertainties. To address the long-term planning Traditional statistical quality measures, such as Accuracy and
challenge with LLMs, we can employ several approaches in- F-score do not easily translate to generative tasks [170], espe-
cluding hierarchical attention, which allows LLMs to focus cially long-form generative tasks. Furthermore, the accessibility
on different parts of the input at different times that can help of test sets in numerous benchmark datasets provides an avenue
the models capture long-range dependencies [163]. Equipping for the potential manipulation of leaderboards by unethical
LLMs with memory that allows them to store information about practitioners. This involves the inappropriate training of models
the past, which can be used to inform future decisions, is another on the test set, a practice likely employed by researchers seeking
approach to address this challenge [164]. funding through achieving top positions on public leaderboards,
such as Hugging Face’s Open LLM Leaderboard.6 At the time
J. Limited Context Window of writing this article, a seven billion parameter model is outper-
forming numerous 70 billion parameter models. A prospective
Having a limited context window is a fundamental challenge and pragmatic approach to appraising model outputs is to utilize
for LLMs since they can only process a limited amount of an auxiliary model for evaluating the generated content from the
text at a time. This limitation comes from their reliance on original model [171]. However, this methodology may prove
attention mechanisms [18], which allow them to focus on the
most relevant parts of the text when generating content. The 6 https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
HAGOS et al.: RECENT ADVANCES IN GENERATIVE AI AND LARGE LANGUAGE MODELS 5889

ineffective if the judgment model lacks training within the research gaps in areas such as bias, interpretability, deepfakes,
specific domain it is employed to assess. and human-AI collaboration, our work paves the way for an
impactful, ethical, and inclusive future of NLP. We envision this
M. A Concerning Trend Toward “Closed” Science research serving as a roadmap for the AI community, empower-
ing diverse domains with transformative tools and establishing
As models transition from experimental endeavors to com- a clear path for the responsible evolution of AI.
mercially viable products, there is a diminishing inclination In our future work, we aim to explore advanced techniques
to openly share the progress achieved within research labo- for identifying and mitigating bias in both training data and
ratories.7 This shift poses a significant obstacle to the col- algorithms to enhance fairness in AI systems. Additionally, we
laborative advancement of knowledge, hindering the ability to plan to investigate explainable AI approaches and develop new
build upon established foundations when essential details are strategies to improve the interpretability of AI models. Building
withheld. Furthermore, replicating published results becomes upon our previous line of research on human-autonomy team-
arduous when the prompts employed in the experimentation are ing, we will delve into the development of models that facilitate
not disclosed, since subtle alterations to prompts can, in some seamless collaboration and interaction between humans and
cases, significantly affect the performance of the model. Com- AI. We hope this work encourages researchers across multiple
pounding these concerns, accessing the necessary resources disciplines of the AI community, from both academia and in-
to reproduce results often entails financial obligations to the dustry, to further explore the broader domain of generative AI
publishers of the models, creating yet another barrier to entry and LLMs.
into the scientific landscape for low-resource researchers. This
situation prompts reflection on the current situation and the
potential impediments it imposes on the pursuit of knowledge ACKNOWLEDGMENT
and innovation. Any opinions, findings, conclusions, or recommendations
expressed in this article are those of the authors and should
V. BRIDGING RESEARCH GAPS AND FUTURE DIRECTIONS not be interpreted as representing the official policies, either
Our research has identified several key areas that require expressed or implied, of the funding agencies.
attention to ensure the ethical integration of generative AI and
LLMs. These areas include addressing issues such as bias and REFERENCES
fairness in outputs, the necessity for models to provide ex- [1] N. Kalchbrenner and P. Blunsom, “Recurrent continuous translation
planations for their reasoning, and the challenges associated models,” in Proc. Conf. Empirical Methods Natural Lang. Process.,
with adapting these models to diverse situations and domains. 2013, pp. 1700–1709.
[2] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method
Furthermore, considerations regarding data privacy, security, for automatic evaluation of machine translation,” in Proc. 40th Annu.
and the potential for misuse in areas such as deepfakes re- Meeting Assoc. Comput. Linguistics, 2002, pp. 311–318.
quire careful attention. Addressing these challenges through [3] M. Lewis et al., “BART: Denoising sequence-to-sequence pre-training
for natural language generation, translation, and comprehension,” 2019,
advancements in areas we have proposed, such as improved arXiv:1910.13461.
bias detection and the development of interpretable models, [4] T. Brown et al., “Language models are few-shot learners,” in Proc.
holds significant promise. Proactively tackling these issues is Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 1877–1901.
[5] K. Cobbe et al., “Training verifiers to solve math word problems,”
essential to ensuring that AI development is not only technically 2021, arXiv:2110.14168.
advanced but also beneficial to society. This includes develop- [6] M. Chen et al., “Evaluating large language models trained on code,”
ing clear metrics to assess model performance, enhancing their 2021, arXiv:2107.03374.
[7] T. Brants, A. Popat, P. Xu, F. J. Och, and J. Dean, “Large lan-
interpretability, and prioritizing user privacy and security. By guage models in machine translation,” in Proc. Joint Conf. Empirical
incorporating ethical considerations into AI development, we Methods Natural Lang. Process. Comput. Natural Lang. Learn., 2007,
pave the way for their responsible deployment across various pp. 858–867.
[8] C. D. Manning, “Human language understanding & reasoning,”
domains, including healthcare, recruitment, and content cre- Daedalus, vol. 151, no. 2, pp. 127–138, 2022.
ation. This will foster a future where AI serves as a positive [9] A. Svyatkovskiy, J. Kates-Harbeck, and W. Tang, “Training distributed
force for societal good, promoting inclusivity and making a deep recurrent neural networks with mixed precision on GPU clusters,”
in Proc. Mach. Learn. HPC Environ., 2017, pp. 1–8.
real impact. [10] B. Li et al., “Large scale recurrent neural network on GPU,” in Proc.
Int. Joint Conf. Neural Netw. (IJCNN), Piscataway, NJ, USA: IEEE
VI. CONCLUSION Press, 2014, pp. 4062–4069.
[11] M. Isaev, N. McDonald, and R. Vuduc, “Scaling infrastructure to
This article explores the transformative potential of genera- support multi-trillion parameter LLM training,” in Proc. Archit. Syst.
Support Transformer Models (ASSYST@ ISCA), 2023, pp. 1–5.
tive AI and LLMs, highlighting their advancements, technical [12] J. Kaplan et al., “Scaling laws for neural language models,” 2020,
foundations, and practical applications across diverse domains. arXiv:2001.08361.
We argue that understanding the full potential and limitations [13] J. Hoffmann et al., “An empirical analysis of compute-optimal large
language model training,” in Proc. Adv. Neural Inf. Process. Syst.,
of generative AI and LLMs is crucial for shaping the respon- vol. 35, 2022, pp. 30016–30030.
sible integration of these technologies. By addressing critical [14] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature,
vol. 521, no. 7553, pp. 436–444, 2015.
7 https://www.theverge.com/2023/3/15/23640180/openai-gpt-4-launch- [15] L. R. Medsker and L. Jain, “Recurrent neural networks,” Des. Appl.,
closed-research-ilya-sutskever-interview vol. 5, nos. 64–67, p. 2, 2001.

Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
5890 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 12, DECEMBER 2024

[16] R. Socher et al., “Recursive deep models for semantic compositionality [43] S. E. Yuksel, J. N. Wilson, and P. D. Gader, “Twenty years of mixture
over a sentiment treebank,” in Proc. Conf. Empirical Methods Natural of experts,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 8,
Lang. Process., 2013, pp. 1631–1642. pp. 1177–1193, Aug. 2012.
[17] Y. LeCun et al., “Convolutional networks for images, speech, and [44] P. Yadav et al., “Resolving interference when merging models,” 2023,
time series,” The Handbook of Brain Theory and Neural Networks, arXiv:2306.01708.
Cambridge, MA, USA: MIT Press, vol. 3361, no. 10, pp. 255–258, [45] M. S. Matena and C. A. Raffel, “Merging models with fisher-weighted
1995. averaging,” in Proc. Adv. Neural Inf. Process. Syst., vol. 35, 2022,
[18] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. pp. 17703–17716.
Process. Syst., vol. 30, 2017, pp. 1–15. [46] S. Khanuja, M. Johnson, and P. Talukdar, “MergeDistill: Merging pre-
[19] C. Raffel et al., “Exploring the limits of transfer learning with a trained language models using distillation,” 2021, arXiv:2106.02834.
unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, no. 1, [47] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a
pp. 5485–5551, 2020. neural network,” 2015, arXiv:1503.02531.
[20] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- [48] G. Ilharco et al., “Editing models with task arithmetic,” 2022,
training of deep bidirectional transformers for language understanding,” arXiv:2212.04089.
in Proc. NAACL-HLT, 2019, pp. 1–16. [49] L. Yu, B. Yu, H. Yu, F. Huang, and Y. Li, “Language models are super
[21] A. Radford et al., “Improving language understanding by generative mario: Absorbing abilities from homologous models as a free lunch,”
pre-training,” Open AI, San Francisco, CA, USA, Technical Report, 2023, arXiv:2311.03099.
pp. 1–12, 2018. [50] F. Wan et al., “Knowledge fusion of large language models,” 2024,
[22] N. Houlsby et al., “Parameter-efficient transfer learning for NLP,” in arXiv:2401.10491.
Proc. Int. Conf. Mach. Learn., PMLR, 2019, pp. 2790–2799. [51] S. K. Ainsworth, J. Hayase, and S. Srinivasa, “Git re-basin: Merging
[23] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” models modulo permutation symmetries,” 2022, arXiv:2209.04836.
2013, arXiv:1312.6114. [52] D. Kingma, T. Salimans, B. Poole, and J. Ho, “Variational diffusion
[24] C. Feng, F. Cai, H. Chen, and M. de Rijke, “Attentive encoder-based models,” in Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021,
extractive text summarization,” in Proc. 27th ACM Int. Conf. Inf. pp. 21696–21707.
Knowl. Manage., 2018, pp. 1499–1502. [53] C. Zach, “Fully variational noise-contrastive estimation,” in Proc.
[25] T. Wolf et al., “Transformers: State-of-the-art natural language process- Scand. Conf. Image Anal. (SCIA), New York, NY, USA: Springer-
ing,” in Proc. Conf. Empirical Methods Natural Lang. Process. Syst. Verlag, 2023, pp. 175–190.
Demonstrations, 2020, pp. 38–45. [54] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and
[26] R. Thoppilan et al., “LaMDA: Language models for dialog applica- B. Poole, “Score-based generative modeling through stochastic differ-
tions,” 2022, arXiv:2201.08239. ential equations,” 2020, arXiv:2011.13456.
[27] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sen- [55] P. Dhariwal and A. Nichol, “Diffusion models beat GANs on image
gupta, and A. A. Bharath, “Generative adversarial networks: An synthesis,” in Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021,
overview,” IEEE Signal Process. Mag., vol. 35, no. 1, pp. 53–65, pp. 8780–8794.
Jan. 2018. [56] Y. Song and S. Ermon, “Generative modeling by estimating gradients
[28] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv. Neural of the data distribution,” in Proc. Adv. Neural Inf. Process. Syst.,
Inf. Process. Syst., vol. 27, 2014, pp. 1–9. vol. 32, 2019, pp. 1–13.
[29] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation [57] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hier-
learning with deep convolutional generative adversarial networks,” archical text-conditional image generation with clip latents,” 2022,
2015, arXiv:1511.06434. arXiv:2204.06125.
[30] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative [58] A. Ramesh et al., “Zero-shot text-to-image generation,” in Proc. Int.
adversarial networks,” in Proc. Int. Conf. Mach. Learn., PMLR, 2017, Conf. Mach. Learn., PMLR, 2021, pp. 8821–8831.
pp. 214–223. [59] C. Saharia et al., “Photorealistic text-to-image diffusion models with
[31] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive grow- deep language understanding,” in Proc. Adv. Neural Inf. Process. Syst.,
ing of GANs for improved quality, stability, and variation,” 2017, vol. 35, 2022, pp. 36479–36494.
arXiv:1710.10196. [60] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Om-
[32] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, mer, “High-resolution image synthesis with latent diffusion models,”
and M. Welling, “Improved variational inference with inverse in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022,
autoregressive flow,” in Proc. Adv. Neural Inf. Process. Syst., pp. 10684–10695.
vol. 29, 2016, pp. 1–9. [61] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli,
[33] D. Rezende and S. Mohamed, “Variational inference with normalizing “Deep unsupervised learning using nonequilibrium thermodynamics,”
flows,” in Proc. Int. Conf. Mach. Learn., PMLR, 2015, pp. 1530–1538. in Proc. Int. Conf. Mach. Learn., PMLR, 2015, pp. 2256–2265.
[34] C. Meek, D. M. Chickering, and D. Heckerman, “Autoregressive tree [62] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee,
models for time-series analysis,” in Proc. SIAM Int. Conf. Data Mining, “Generative adversarial text to image synthesis,” in Proc. Int. Conf.
Philadelphia, PA, USA: SIAM, 2002, pp. 229–244. Mach. Learn., PMLR, 2016, pp. 1060–1069.
[35] N. Shazeer et al., “Outrageously large neural networks: The sparsely- [63] “Midjourney,” Midjourney. Accessed: Aug. 23, 2024. [Online]. Avail-
gated mixture-of-experts layer,” 2017, arXiv:1701.06538. able: https://www.midjourney.com/
[36] X. Wang et al., “Deep mixture of experts via shallow embedding,” in [64] M. Wu and N. Goodman, “Multimodal generative models for scalable
Proc. Uncertainty in AI, PMLR, 2020, pp. 552–562. weakly-supervised learning,” in Proc. Adv. Neural Inf. Process. Syst.,
[37] N. Du et al., “GLaM: Efficient scaling of language models with vol. 31, 2018, pp. 1–11.
mixture-of-experts,” in Proc. Int. Conf. Mach. Learn., PMLR, 2022, [65] M. Suzuki and Y. Matsuo, “A survey of multimodal deep generative
pp. 5547–5569. models,” Adv. Robot., vol. 36, nos. 5–6, pp. 261–278, 2022.
[38] S. Gururangan, M. Lewis, A. Holtzman, N. A. Smith, and L. Zettle- [66] Y. Shi et al., “Variational mixture-of-experts autoencoders for multi-
moyer, “DEMix layers: Disentangling domains for modular language modal deep generative models,” in Proc. Adv. Neural Inf. Process. Syst.,
modeling,” 2021, arXiv:2108.05036. vol. 32, 2019.
[39] S. Rajbhandari et al., “DeepSpeed-MoE: Advancing mixture-of-experts [67] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng,
inference and training to power next-generation AI scale,” in Proc. Int. “Multimodal deep learning,” in Proc. 28th Int. Conf. Mach. Learn.
Conf. Mach. Learn., PMLR, 2022, pp. 18332–18346. (ICML-11), 2011, pp. 689–696.
[40] Y. Zhou et al., “Mixture-of-experts with expert choice routing,” in Proc. [68] T. Baltrušaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine
Adv. Neural Inf. Process. Syst., vol. 35, 2022, pp. 7103–7114. learning: A survey and taxonomy,” IEEE Trans. pattern Anal. Mach.
[41] Z. Chi et al., “On the representation collapse of sparse mixture of Intell., vol. 41, no. 2, pp. 423–443, Feb. 2019.
experts,” in Proc. Adv. Neural Inf. Process. Syst., vol. 35, 2022, [69] A. Radford et al., “Learning transferable visual models from natural
pp. 34600–34613. language supervision,” in Proc. Int. Conf. Mach. Learn., PMLR, 2021,
[42] Z. Chen, Y. Deng, Y. Wu, Q. Gu, and Y. Li, “Towards understanding pp. 8748–8763.
the mixture-of-experts layer in deep learning,” in Proc. Adv. Neural [70] J. Yu et al., “Scaling autoregressive models for content-rich text-to-
Inf. Process. Syst., vol. 35, 2022, pp. 23049–23062. image generation,” 2022, arXiv:2206.10789.

Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
HAGOS et al.: RECENT ADVANCES IN GENERATIVE AI AND LARGE LANGUAGE MODELS 5891

[71] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture [100] S. Narang and A. Chowdhery, “Pathways language model (PaLM):
for generative adversarial networks,” in Proc. IEEE/CVF Conf., 2019, Scaling to 540 billion parameters for breakthrough performance,”
pp. 4401–4410. Google AI Blog. Accessed: Aug. 23, 2024. [Online]. Avail-
[72] T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu, “Semantic image able: https://blog.research.google/2022/04/pathways-language-model-
synthesis with spatially-adaptive normalization,” in Proc. IEEE/CVF palm-scaling-to.html
Conf., 2019, pp. 2337–2346. [101] A. Chowdhery et al., “PaLM: Scaling language modeling with path-
[73] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image ways,” 2022, arXiv:2204.02311.
translation using cycle-consistent adversarial networks,” in Proc. IEEE [102] H. Touvron et al., “Llama 2: Open foundation and fine-tuned chat
Int. Conf. Comput. Vis., 2017, pp. 2223–2232. models,” 2023, arXiv:2307.09288.
[74] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers [103] R. Anil et al., “PaLM 2 technical report,” 2023, arXiv:2305.10403.
for image recognition at scale,” 2020, arXiv:2010.11929. [104] OpenAI, “GPT-4 Technical Report,” 2023. Available: https://openai.
[75] “Sora: OpenAI’s platform for multimodal AI,” OpenAI. Accessed: Aug. com/research/gpt-4
23, 2024. [Online]. Available: https://openai.com/sora [105] “Gemini: A family of highly capable multimodal models,” Google.
[76] T. Brooks et al., “Video generation models as world simulators,” [Online]. Available: https://storage.googleapis.com/deepmind-media/
OpenAI. Accessed: Aug. 23, 2024. [Online]. Available: https://openai. gemini/gemini_1_report.pdf
com/blog/videoworldsimulators2024/ [106] M. McShane, “Natural language understanding (NLU, not NLP) in
[77] J. Manyika, “An overview of Bard: An early experiment with generative cognitive systems,” AI Mag., vol. 38, no. 4, pp. 43–56, 2017.
AI,” AI. Google Static Documents. Accessed: Aug. 23, 2024. [Online]. [107] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
Available: https://ai.google/static/documents/google-about-bard.pdf jointly learning to align and translate,” 2014, arXiv:1409.0473.
[78] X. Zeng et al., “Deep generative molecular design reshapes drug [108] T. Nasukawa and J. Yi, “Sentiment analysis: Capturing favorability
discovery,” Cell Rep. Med., pp. 1–13, 2022. using natural language processing,” in Proc. 2nd Int. Conf. Knowl.
[79] H. Altae-Tran, B. Ramsundar, A. S. Pappu, and V. Pande, “Low data Capture, 2003, pp. 70–77.
drug discovery with one-shot learning,” ACS Central Sci., vol. 3, [109] M. A. Di Gangi, M. Negri, and M. Turchi, “Adapting transformer to
no. 4, pp. 283–293, 2017. end-to-end spoken language translation,” in Proc. INTERSPEECH. Int.
[80] A. Aliper et al., “Deep learning applications for predicting pharmaco- Speech Communication Assoc. (ISCA), 2019, pp. 1133–1137.
logical properties of drugs and drug repurposing using transcriptomic [110] J. Wu et al., “On decoder-only architecture for speech-to-text and large
data,” Mol. Pharmaceutics, vol. 13, no. 7, pp. 2524–2530, 2016. language model integration,” in Proc. IEEE Autom. Speech Recognit.
[81] A. Merchant et al., “Scaling deep learning for materials discovery,” Understanding Workshop (ASRU), Piscataway, NJ, USA: IEEE Press,
Nature, vol. 624, no. 7990, pp. 80–85, 2023. 2023, pp. 1–8.
[82] C. P. Gomes et al., “Artificial intelligence for materials discovery,” MRS [111] A. Yamaguchi, G. Chrysostomou, K. Margatina, and N. Aletras, “Frus-
Bull., vol. 44, no. 7, pp. 538–544, 2019. tratingly simple pretraining alternatives to masked language modeling,”
[83] E. O. Pyzer-Knapp et al., “Accelerating materials discovery using 2021, arXiv:2109.01819.
artificial intelligence, high performance computing and robotics,” npj [112] J. Howard and S. Ruder, “Universal language model fine-tuning for
Comput. Mater., vol. 8, no. 1, 2022, Art. no. 84. text classification,” 2018, arXiv:1801.06146.
[84] A. Langevin, T. Cody, S. Adams, and P. Beling, “Generative adversarial [113] M. Zaheer et al., “Big bird: Transformers for longer sequences,” in
networks for data augmentation and transfer in credit card fraud Proc. Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 17283–17297.
detection,” J. Oper. Res. Soc., vol. 73, no. 1, pp. 153–180, 2022. [114] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov,
[85] T. Schlegl et al., “Unsupervised anomaly detection with generative “Transformer-XL: Attentive language models beyond a fixed-length
adversarial networks to guide marker discovery,” in Proc. Int. Conf. context,” 2019, arXiv:1901.02860.
Inf. Process. Med. Imag., New York, NY, USA: Springer-Verlag, 2017,
[115] Z. Yang et al., “XLNet: Generalized autoregressive pretraining for
pp. 146–157.
language understanding,” in Proc. Adv. Neural Inf. Process. Syst.,
[86] T. Schlegl, P. Seeböck, S. M. Waldstein, G. Langs, and U. Schmidt-
vol. 32, 2019, pp. 1–18.
Erfurth, “f-AnoGAN: Fast unsupervised anomaly detection with gen-
[116] I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long-
erative adversarial networks,” Med. Image Anal., vol. 54, pp. 30–44,
document transformer,” 2020, arXiv:2004.05150.
May 2019.
[117] R. Child et al., “Generating long sequences with sparse transformers,”
[87] X. Chen, S. Li, H. Li, S. Jiang, Y. Qi, and L. Song, “Generative
2019, arXiv:1904.10509.
adversarial user model for reinforcement learning based recommen-
dation system,” in Proc. Int. Conf. Mach. Learn., PMLR, 2019, [118] G. M. Correia, V. Niculae, and A. F. Martins, “Adaptively sparse
pp. 1052–1061. transformers,” 2019, arXiv:1909.00015.
[88] D. Adiwardana et al., “Towards a human-like open-domain chatbot,” [119] M. Johnson et al., “Google’s multilingual neural machine translation
2020, arXiv:2001.09977. system: Enabling zero-shot translation,” Trans. Assoc. Comput. Lin-
[89] X. Liu and W. B. Croft, “Statistical language modeling for information guistics, vol. 5, pp. 339–351, 2017.
retrieval.” Annu. Rev. Inf. Sci. Technol., vol. 39, no. 1, pp. 1–31, 2005. [120] A. J. Thirunavukarasu, D. S. J. Ting, K. Elangovan, L. Gutierrez, T. F.
[90] B. Roark, M. Saraclar, and M. Collins, “Discriminative n-gram lan- Tan, and D. S. W. Ting, “Large language models in medicine,” Nature
guage modeling,” Comput. Speech Lang., vol. 21, no. 2, pp. 373– Med., vol. 29, no. 8, pp. 1930–1940, 2023.
392, 2007. [121] G. Hinton et al., “Deep neural networks for acoustic modeling in speech
[91] S. Khudanpur and J. Wu, “Maximum entropy techniques for exploiting recognition: The shared views of four research groups,” IEEE Signal
syntactic, semantic and collocational dependencies in language model- Process. Mag., vol. 29, no. 6, pp. 82–97, Nov. 2012.
ing,” Comput. Speech Lang., vol. 14, no. 4, pp. 355–372, 2000. [122] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with
[92] D. Hendrycks and K. Gimpel, “Gaussian error linear units (GELUs),” deep recurrent neural networks,” in Proc. IEEE Int. Conf. Acoust.,
2016, arXiv:1606.08415. Speech Signal Process., Piscataway, NJ, USA: IEEE Press, 2013,
[93] P. Ramachandran, B. Zoph, and Q. V. Le, “Searching for activation pp. 6645–6649.
functions,” 2017, arXiv:1710.05941. [123] R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu,
[94] A. Radford et al., “Language models are unsupervised multitask “Exploring the limits of language modeling,” 2016, arXiv:1602.02410.
learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019. [124] C. Shan et al., “Investigating end-to-end speech recognition for
[95] J. W. Rae et al., “Scaling language models: Methods, analysis & mandarin-english code-switching,” in Proc. IEEE Int. Conf. Acoust.,
insights from training gopher,” 2021, arXiv:2112.11446. Speech Signal Process. (ICASSP), Piscataway, NJ, USA: IEEE Press,
[96] O. Lieber, O. Sharir, B. Lenz, and Y. Shoham, “Jurassic-1: Technical 2019, pp. 6056–6060.
details and evaluation,” White Paper. AI21 Labs, vol. 1, 2021. [125] J. Salazar, K. Kirchhoff, and Z. Huang, “Self-attention networks for
[97] S. Smith et al., “Using DeepSpeed and megatron to train megatron- connectionist temporal classification in speech recognition,” in Proc.
turing NLG 530B, a large-scale generative language model,” 2022, IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Piscataway,
arXiv:2201.11990. NJ, USA: IEEE Press, 2019, pp. 7115–7119.
[98] J. Hoffmann et al., “Training compute-optimal large language models,” [126] J. Krantz, W. Spokane, and J. Kalita, “Abstractive summarization using
2022, arXiv:2203.15556. attentive neural techniques,” in Proc. 15th Int. Conf. Natural Lang.
[99] L. Ouyang et al., “Training language models to follow instructions Process., 2018, p. 1.
with human feedback,” in Proc. Adv. Neural Inf. Process. Syst., [127] R. Li et al., “StarCoder: May the source be with you!” 2023,
vol. 35, 2022, pp. 27730–27744. arXiv:2305.06161.

Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
5892 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 12, DECEMBER 2024

[128] A. Svyatkovskiy, Y. Zhao, S. Fu, and N. Sundaresan, “Pythia: [155] S. Amershi, M. Cakmak, W. B. Knox, and T. Kulesza, “Power to the
AI-assisted code completion system,” in Proc. 25th ACM SIGKDD people: The role of humans in interactive machine learning,” AI Mag.,
Int. Conf. Knowl. Discovery Data Mining, 2019, pp. 2727–2735. vol. 35, no. 4, pp. 105–120, 2014.
[129] Z. Feng et al., “CodeBERT: A pre-trained model for programming and [156] E. Mosqueira-Rey, E. Hernández-Pereira, D. Alonso-Ríos, J. Bobes-
natural languages,” 2020, arXiv:2002.08155. Bascarán, and Á. Fernández-Leal, “Human-in-the-loop machine learn-
[130] S. Lu et al., “CodeXGLUE: A machine learning benchmark dataset for ing: A state of the art,” Artif. Intell. Rev., vol. 56, no. 4, pp. 3005–
code understanding and generation,” 2021, arXiv:2102.04664. 3054, 2023.
[131] D. Xu, S. Yuan, L. Zhang, and X. Wu, “FairGAN: Fairness-aware [157] S. Griffith, K. Subramanian, J. Scholz, C. L. Isbell, and A. L.
generative adversarial networks,” in Proc. IEEE Int. Conf. Big Data Thomaz, “Policy shaping: Integrating human feedback with reinforce-
(Big Data), Piscataway, NJ, USA: IEEE Press, 2018, pp. 570–575. ment learning,” in Proc. Adv. Neural Inf. Process. Syst., vol. 26, 2013,
[132] A. Feder, N. Oved, U. Shalit, and R. Reichart, “CausaLM: Causal pp. 1–9.
model explanation through counterfactual language models,” Comput. [158] J. MacGlashan et al., “Interactive learning from policy-dependent
Linguistics, vol. 47, no. 2, pp. 333–386, 2021. human feedback,” in Proc. Int. Conf. Mach. Learn., PMLR, 2017,
[133] Z. Chen, Q. Gao, A. Bosselut, A. Sabharwal, and K. Richard- pp. 2285–2294.
son, “DISCO: Distilling counterfactuals with large language mod- [159] R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C.
els,” in Proc. 61st Annu. Meeting Assoc. Comput. Linguistics, 2023, Finn, “Direct preference optimization: Your language model is secretly
pp. 5514–5528. a reward model,” 2023, arXiv:2305.18290.
[134] P.-S. Huang et al., “Reducing sentiment bias in language models via [160] A. Rosenfeld and A. Richardson, “Explainability in human–agent
counterfactual evaluation,” 2019, arXiv:1911.03064. systems,” Auton. Agents Multi-Agent Syst., vol. 33, pp. 673–705,
[135] B. H. Zhang, B. Lemoine, and M. Mitchell, “Mitigating unwanted May 2019.
biases with adversarial learning,” in Proc. AAAI/ACM Conf. AI, Ethics, [161] M. T. Ribeiro, S. Singh, and C. Guestrin, “ “Why should i trust
Soc., 2018, pp. 335–340. you?” Explaining the predictions of any classifier,” in Proc. 22nd
[136] S. Ding and P. Koehn, “Evaluating saliency methods for neural lan- ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2016,
guage models,” 2021, arXiv:2104.05824. pp. 1135–1144.
[137] A. Madsen et al., “Post-hoc interpretability for neural NLP: A survey,” [162] K. Valmeekam, M. Marquez, S. Sreedharan, and S. Kambhampati, “On
ACM Comput. Surv., vol. 55, no. 8, pp. 1–42, 2022. the planning abilities of large language models–A critical investiga-
[138] N. Kroeger, D. Ley, S. Krishna, C. Agarwal, and H. Lakkaraju, “Are tion,” 2023, arXiv:2305.15771.
large language models post hoc explainers?” 2023, arXiv:2310.05797. [163] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy,
[139] A. Chronopoulou, C. Baziotis, and A. Potamianos, “An embarrassingly “Hierarchical attention networks for document classification,” in Proc.
simple approach for transfer learning from pretrained language mod- Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human Lang.
els,” in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics Technol., 2016, pp. 1480–1489.
(Long and Short Papers), vol. 1, 2019, pp. 2089–2095. [164] E. Grave, A. Joulin, and N. Usunier, “Improving neural language mod-
[140] K. You, Z. Kou, M. Long, and J. Wang, “Co-tuning for transfer els with a continuous cache,” in Proc. Int. Conf. Learn. Representations,
learning,” in Proc. Adv. Neural Inf. Process. Syst., vol. 33, 2020, 2016, pp. 1–9.
pp. 17236–17246. [165] N. Ratner et al., “Parallel context windows for large language models,”
[141] J. Zhang and Y. Moshfeghi, “ELASTIC: Numerical reasoning with in Proc. 61st Annu. Meeting Assoc. Comput. Linguistics (Long Papers),
adaptive symbolic compiler,” in Proc. Adv. Neural Inf. Process. Syst., vol. 1, 2023, pp. 6383–6402.
vol. 35, 2022, pp. 12647–12661. [166] L. Kaiser et al., “One model to learn them all,” 2017,
[142] Z. Hou, J. Salazar, and G. Polovets, “Meta-learning the difference: arXiv:1706.05137.
Preparing large language models for efficient adaptation,” Trans. Assoc. [167] M. Ghodsi, X. Liu, J. Apfel, R. Cabrera, and E. Weinstein, “RNN-
Comput. Linguistics, vol. 10, pp. 1249–1265, 2022. transducer with stateless prediction network,” in Proc. IEEE Int. Conf.
[143] Z. Wang, Z. Dai, B. Póczos, and J. Carbonell, “Characterizing and Acoust., Speech Signal Process. (ICASSP), Piscataway, NJ, USA: IEEE
avoiding negative transfer,” in Proc. IEEE/CVF Conf. Comput. Vis. Press, 2020, pp. 7049–7053.
Pattern Recognit., 2019, pp. 11293–11302. [168] S. Sukhbaatar et al., “End-to-end memory networks,” in Proc. Adv.
[144] R. Shokri and V. Shmatikov, “Privacy-preserving deep learning,” in Neural Inf. Process. Syst., vol. 28, 2015.
Proc. 22nd ACM SIGSAC Conf. Comput. Commun. Secur., 2015, [169] Z. Azerbayev et al., “LLEMMA: An open language model mathemat-
pp. 1310–1321. ics,” 2023, arXiv:2310.10631.
[145] M. Abadi et al., “Deep learning with differential privacy,” in [170] D. Deutsch, R. Dror, and D. Roth, “On the limitations of reference-free
Proc. ACM SIGSAC Conf. Comput. Commun. Secur., 2016, pp. 308– evaluations of generated text,” 2022, arXiv:2210.12563.
318. [171] L. Zhu, X. Wang, and X. Wang, “JudgeLM: Fine-tuned large language
[146] J. Morley, A. Elhalal, F. Garcia, L. Kinsey, J. Mökander, and L. Floridi, models are scalable judges,” 2023, arXiv:2310.17631.
“Ethics as a service: A pragmatic operationalisation of AI ethics,”
Minds Mach., vol. 31, no. 2, pp. 239–256, 2021.
[147] J. Borenstein and A. Howard, “Emerging challenges in AI and the need
for AI ethics education,” AI Ethics, vol. 1, no. 1, pp. 61–65, 2021. Desta Haileselassie Hagos (Member, IEEE) re-
[148] E. Strubell, A. Ganesh, and A. McCallum, “Energy and policy consid- ceived the B.Sc. degree in computer science from
erations for modern deep learning research,” in Proc. AAAI Conf. Artif. Mekelle University, Mekelle, Tigray, in 2008, the
Intell., 2020, vol. 34, no. 9, pp. 13693–13696. M.Sc. degree in computer science and engineering
[149] J. Lin, J. Tang, H. Tang, S. Yang, X. Dang, and S. Han, “AWQ: with a specialization in mobile systems from the
Activation-aware weight quantization for LLM compression and ac- Department of Computer Science Electrical and
celeration,” 2023, arXiv:2306.00978. Space Engineering, Luleå University of Technology,
[150] R. Tolosana et al., “Deepfakes and beyond: A survey of face ma- Luleå, Sweden, in 2012, and the Ph.D. degree in
nipulation and fake detection,” Inf. Fusion, vol. 64, pp. 131–148, computer science from the Faculty of Mathematics
Dec. 2020. and Natural Sciences, University of Oslo, Oslo,
[151] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Norway, in 2020.
Nießner, “Face2face: Real-time face capture and reenactment of RGB Currently, he is a Postdoctoral Research Fellow with the U.S. Department
videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, of Defense (DoD), Center of Excellence in Artificial Intelligence and Machine
pp. 2387–2395. Learning (CoE-AIML), Department of Electrical Engineering and Computer
[152] P. Garrido, L. Valgaerts, O. Rehmsen, T. Thormahlen, P. Perez, and C. Science, College of Engineering and Architecture (CEA), Howard University,
Theobalt, “Automatic face reenactment,” in Proc. IEEE Conf. Comput. Washington, DC, USA. Previously, he was a Postdoctoral Research Fellow
Vis. Pattern Recognit., 2014, pp. 4217–4224. with the Division of Software and Computer Systems (SCS), Department of
[153] M. Brundage et al., “The malicious use of artificial intelligence: Computer Science, School of Electrical Engineering and Computer Science
Forecasting, prevention, and mitigation,” 2018, arXiv:1802.07228. (EECS), KTH Royal Institute of Technology, Stockholm, Sweden, worked on
[154] A. Adadi and M. Berrada, “Peeking inside the black-box: A survey the H2020-EU project, ExtremeEarth: From Copernicus Big Data to Extreme
on explainable artificial intelligence (XAI),” IEEE Access, vol. 6, Earth Analytics. His research interests include the areas of machine learning,
pp. 52138–52160, 2018. deep learning, and artificial intelligence.

Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
HAGOS et al.: RECENT ADVANCES IN GENERATIVE AI AND LARGE LANGUAGE MODELS 5893

Rick Battle received the Bachelor of Science de- Dr. Rawat has secured over $110 million as a PI and over $18 million
gree in computer engineering from Virginia Tech, as a Co-PI in research funding from the U.S. National Science Foundation
Blacksburg, VA, USA, in 2009, and the Mas- (NSF), U.S. Department of Homeland Security (DHS), U.S. National Security
ter of Science degree in computer science with Agency (NSA), U.S. Department of Energy, National Nuclear Security Ad-
a specialization in machine learning from the ministration (NNSA), National Institute of Health (NIH), U.S. Department of
Naval Postgraduate School, Monterey, CA, USA, Defense (DoD) and DoD Research Labs, Industry (Microsoft, Intel, VMware,
in 2013. PayPal, Mastercard, Meta, BAE, Raytheon, etc.), and private Foundations. He
He is a Staff Machine Learning Engineer with was the recipient of the U.S. NSF CAREER Award, the U.S. Department
VMware by Broadcom, Palo Alto, CA, USA. He is of Homeland Security (DHS) Scientific Leadership Award, the President’s
the Head of NLP Research, VMware AI Labs, Palo Medal of Achievement Award (2023) at Howard University, the Provost’s
Alto, CA, USA. His research interests include the Distinguished Service Award 2021, the U.S. Air Force Research Laboratory
areas of the application of large language models to real-world use cases and (AFRL) Summer Faculty Visiting Fellowship 2017, the Outstanding Research
information retrieval. Faculty Award (award for excellence in scholarly activity), and several
best paper awards. He has been an Editor/a Guest Editor for over 100
international journals, including an Associate Editor of IEEE TRANSACTIONS
ON INFORMATION FORENSICS AND SECURITY, an Associate Editor of IEEE

Danda B. Rawat (Senior Member, IEEE) received TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, an
the Ph.D. degree in electrical and computer engi- Associate Editor of IEEE TRANSACTIONS OF SERVICES COMPUTING, an Editor
neering from Old Dominion University, Norfolk, of IEEE INTERNET OF THINGS JOURNAL, an Editor of IEEE COMMUNICATIONS
VA, USA, in 2010. LETTERS, an Associate Editor of IEEE TRANSACTIONS ON NETWORK SCIENCE
AND ENGINEERING, and a Technical Editors of IEEE NETWORK. He has been
He is an Associate Dean of the Research and
Graduate Studies, a Full Professor with the De- in organizing committees for several IEEE flagship conferences, such as
partment of Electrical Engineering and Computer IEEE INFOCOM, IEEE CNS, IEEE ICC, and IEEE GLOBECOM. He served
Science (EECS), the Founding Director of the Data as a Technical Program Committee (TPC) Member for several international
Science & Cybersecurity Center, the Founding Di- conferences, including IEEE INFOCOM, IEEE GLOBECOM, IEEE CCNC,
rector of the Department of Defense (DoD) Center IEEE GreenCom, IEEE ICC, IEEE WCNC, and IEEE VTC conferences. He
of Excellence in Artificial Intelligence and Machine served as the Vice Chair of the Executive Committee of the IEEE Savannah
Learning (CoE-AIML), and the Director of Cyber-Security and Wireless Net- Section from 2013 to 2017. He is a Lifetime Professional Senior Member of
working Innovations (CWiNs) Research Lab, Howard University, Washington, ACM, a Lifetime Member of the Association for the Advancement of Artificial
DC, USA. He successfully led and established the Research Institute for Intelligence (AAAI), a Lifetime Member of SPIE, a Member of ASEE and
Tactical Autonomy (RITA), the 15th University Affiliated Research Center AAAS, and a Fellow of the Institution of Engineering and Technology (IET).
(UARC) of the U.S. Department of Defense as a Principal Investigator and the He is an ACM Distinguished Speaker and an IEEE Distinguished Lecturer
Founding Executive Director of Howard University. He is engaged in research (FNTC and VTS).
and teaching in the areas of cybersecurity, machine learning, big data analytics,
and wireless networking for emerging networked systems, including cyber–
physical systems (eHealth, energy, and transportation), Internet-of-Things,
multidomain operations, smart cities, software-defined systems, and vehicular
networks.

Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.

You might also like