Recent Advances in Generative AI and Large Language Models Current Status Challenges and Perspectives-3
Recent Advances in Generative AI and Large Language Models Current Status Challenges and Perspectives-3
Abstract—The emergence of generative artificial intelligence domains. We believe that this research serves as a roadmap for
(AI) and large language models (LLMs) has marked a new era the AI community, pushing toward an ethical, inclusive, and
of natural language processing (NLP), introducing unprecedented impactful future. It empowers diverse domains with transforma-
capabilities that are revolutionizing various domains. This article tive technologies, creating a robust landscape for the responsible
explores the current state of these cutting-edge technologies, evolution of AI.
demonstrating their remarkable advancements and wide-ranging Index Terms—Decoder, encoder, generative artificial intelli-
applications. Our article contributes to providing a holistic per- gence (AI), large language models (LLMs), long-sequence lan-
spective on the technical foundations, practical applications, and guage models, machine translation, natural language processing
emerging challenges within the evolving landscape of generative (NLP), transformers.
AI and LLMs. We believe that understanding the generative
capabilities of AI systems and the specific context of LLMs
is crucial for researchers, practitioners, and policymakers to
collaboratively shape the responsible and ethical integration Acronym Definition
of these technologies into various domains. Furthermore, we AI Artificial intelligence
identify and address main research gaps, providing valuable ASR Automatic speech recognition
insights to guide future research endeavors within the AI research
BERTs Bidirectional encoder representations from transformers
community.
CLIP Contrastive language-image pretraining
Impact Statement—Understanding the full potentials and lim-
CNNs Convolutional neural networks
itations of generative AI and LLMs shapes the future of
DCGANs Deep convolutional generative adversarial networks
NLP and its impact on various industries and societies. This
DL Deep learning
article explores the transformative potential of advanced NLP
DNNs Deep neural networks
tools such as generative AI and LLMs, shaping the future of
DPO Direct policy optimization
communication and understanding across diverse domains. Our
DSM Denoising score matching
article not only addresses the current state of generative AI and
ELBO Evidence lower bound
LLMs in language understanding, machine translation, question
answering, text summarization, and code Completion but also FFN Positionwise feed-forward network
makes a significant contribution in addressing some of the critical GANs Generative adversarial networks
research gaps of generative AI and LLMs. By addressing issues GELU Gaussian error linear unit
of bias and fairness, interpretability, fine-tuning and adaptability, GPT Generative pretrained transformer
domain adaptation, data privacy and security, computational GPUs Graphics processing units
cost, deepfake generation, human-AI collaboration, long-term HMMs Hidden Markov models
planning, limited context window, long-term memory, etc., our KL Kullback–Leibler
work aims to pave the way for responsible, ethical, and impactful LLMs Large language models
integration of these transformative technologies across diverse LSTM Long short-term memory
ML Machine learning
MLM Masked language modeling
MoEs Mixture of experts
Manuscript received 30 March 2024; revised 5 July 2024; accepted NCE Noise-contrastive estimation
10 August 2024. Date of publication 19 August 2024; date of current version NLG Natural language generation
10 December 2024. This work was supported by the U.S. DoD Center of NLP Natural language processing
Excellence in AI/ML at Howard University under Contract W911NF-20-2- NLU Natural language understanding
0277 with the U.S. Army Research Laboratory (ARL). This article was recom- ReLU Rectified linear unit
mended for publication by Associate Editor Sriparna Saha upon evaluation of
RL Reinforcement learning
the reviewers’ comments. (Corresponding author: Desta Haileselassie Hagos.)
Desta Haileselassie Hagos and Danda B. Rawat are with the DoD Center RLHF Reinforcement learning from human feedback
of Excellence in Artificial Intelligence and Machine Learning (CoE-AIML), RNNs Recurrent neural networks
Department of Electrical Engineering and Computer Science, College of TPUs Tensor processing units
Engineering and Architecture (CEA), Howard University, Washington, DC XAI Explainable artificial intelligence
20059 USA (e-mail: [email protected]; [email protected]). VAEs Variational autoencoders
Rick Battle is with VMware AI Labs by Broadcom, Palo Alto, CA 94304 ViT Vision transformer
USA (e-mail: [email protected]). WGANs Wasserstein GANs
Digital Object Identifier 10.1109/TAI.2024.3444742
2691-4581 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
5874 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 12, DECEMBER 2024
Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
HAGOS et al.: RECENT ADVANCES IN GENERATIVE AI AND LARGE LANGUAGE MODELS 5875
Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
5876 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 12, DECEMBER 2024
while simultaneously maximizing the log probability of the μφ (x) and σφ (x) represent the mean and standard deviation of
generator producing data that the discriminator perceives as real the distribution, respectively. In (5), the parameters μθ (z) and
σθ (z) represent the mean and standard deviation of the latent
min max V (D, G) = Ex∼pdata (x) [log D(x)]
G D space distribution, which are learned by the decoder neural
+ Ez∼pz (z) [log(1 − D(G(z)))]. (2) network during training
Within the adversarial setting, various classes of GANs z = qφ (z | x) = N μφ (x), σφ (x)2 (4)
have emerged over the years, each tailored to specific tasks
pθ (x | z) = N μθ (z), σθ (z)2 . (5)
in the generative modeling space. For example, the work of
Radford et al. [29] presents a deep convolutional GANs (DC- The reparameterization trick, introduced in VAEs to facilitate
GANs) by extending the GANs architecture, an extension of the backpropagation through the sampling process [23], addresses
original GAN architecture proposed by Goodfellow et al. [28]. the challenge of applying backpropagation to inherently random
DCGANs employ CNNs in both the generator and discrimi- sampling operations. While backpropagation is a fundamental
nator, enabling the generation of high-quality images. CNNs algorithm for training neural networks, its direct application
are known to perform well at capturing spatial relationships in to sampling is problematic due to the randomness involved.
data [29], making them well suited for image generation tasks. The reparameterization trick provides an alternative approach
Addressing the training instability issues of [28], Arjovsky et al. to sample from a distribution while maintaining the necessary
introduced the wasserstein GANs (WGANs) algorithm [30]. connections for backpropagation [23]. In VAEs, this technique
WGANs replace the binary cross-entropy loss with the Wasser- is employed to sample the latent variable z from a simple
stein distance, leading to improved stability and convergence distribution, typically a standard normal distribution. These
during training [30]. In the context of GANs, the Wasserstein samples are then transformed to match the distribution produced
distance defines the objective function between two distribu- by the encoder, as described in (6). This transformation en-
tions, denoted as A and B, as shown in (3). Here, W (A, B) rep- sures that the sampled latent variables remain consistent with
resents the Wasserstein distance between distributions A and B, the encoder’s understanding of the data while preserving the
inf denotes the infimum, which represents the minimum value, randomness required for generating new samples. In (6), the
γ refers to a joint distribution defined on the product space of represents a random noise vector sampled from a standard
A and B, and Π(A, B) is the set of all joint distributions with normal distribution, represents the elementwise product oper-
marginals A and B. The terms (x, y represent samples from the ation, σθ (x) represents the standard deviation of the distribution
joint distribution γ, and d(x, y) denotes the distance between x produced by the encoder, and μθ (x) represents the mean of the
and y in the metric space distribution produced by the encoder
W (A, B) = inf E(x,y)∼γ [d(x, y)]. (3) z = μθ (x) + σθ (x) , where ∼ N (0, 1). (6)
γ∈Π(A,B)
The main objective for training a VAE is to maximize
To tackle the challenges associated with training high-
the ELBO [23], [33]. Maximizing the evidence lower bound
resolution GANs, Karras et al. [31] proposed a progressive
(ELBO) during training encourages the VAE to learn a mean-
growth of GANs. This algorithm employs a progressive training
ingful and smooth latent space representation for the input data
strategy that gradually increases the resolution of the gener-
[23], [33]. By maximizing the ELBO, the VAE is trained to
ated images throughout the training process. This approach
learn a latent space that captures the underlying structure of
allows the algorithm to capture finer details and produce high-
the data while also allowing for the efficient generation of new
resolution images with enhanced stability and scalability [31].
samples [23], [33]. The ELBO, as shown in (7), comprises
two terms: the reconstruction loss of the data given the la-
B. Variational Autoencoder (VAE) Models tent variable [log pθ (x | z)], which measures the expected log-
VAEs are generative models that learn a probabilistic map- likelihood of the data given the latent variable, and the kullback-
ping from the data space to a latent space, a lower dimensional leibler (KL) divergence between the approximate posterior (en-
representation of the data that capture its essential features, coder) and the prior distribution [DKL (qφ (z | x)p(z))]. The
enabling the generation of new samples through sampling from KL divergence encourages the latent distribution learned by the
the learned latent space [23]. This process involves two key encoder to be similar to the prior distribution, which is typically
components: encoders and decoders. In the VAEs framework, a standard normal distribution. This constraint helps prevent
encoders and decoders play important roles in the process of the encoder from learning overly complex or entangled latent
learning and generating data. The encoder is implemented using representations. In (7), L denotes the overall loss function
a neural network and it is responsible for mapping the input data L(θ, φ; x) = Eqφ (z|x) [log pθ (x | z)] − DKL (qφ (z | x)p(z)) .
x to a probability distribution in the latent space z, as shown (7)
in (4). Similar to the encoder, the decoder is also implemented
using a neural network, and it reconstructs the original data
from this latent representation z as illustrated in (5). The en- C. Autoregressive Models
coder and decoder are trained jointly using a technique called In the context of generative AI, autoregressive models are a
variational inference [23], [32]. Variational inference minimizes class of likelihood models that generate new sequential data by
two losses: a reconstruction loss and a regularization loss. In (4), predicting the next value in a sequence based on the previous
Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
HAGOS et al.: RECENT ADVANCES IN GENERATIVE AI AND LARGE LANGUAGE MODELS 5877
values. These models involve modeling the probability distri- simpler neural network architectures. Developing more efficient
bution of each element in a sequence given the entire history of training algorithms specifically tailored for MoE models can
previous elements, P (xt |xt1 , xt2 , . . . , x1 ). This ability makes help to reduce computational intensity. The overall MoE model
autoregressive models well suited for a variety of NLP tasks architecture can be broken down into several key components
where the ability to understand and generate coherent sequences as follows.
is essential [34]. They are also widely used in capturing the 1) Expert Networks: One of the main features of the MoE
dynamics of time series data [34]. An autoregressive model of model is the presence of multiple expert networks. These expert
order p can be generally represented as shown in (8) where Xt networks play a critical role in learning specific patterns or
denotes the value of the time series at time t, c is a constant term, features within the input data and serve as the core models of
φi are the autoregressive coefficients, representing the influence the MoE system. Each expert network is tailored to specialize
of the previous ith observations on the current observation, and in a particular aspect or subset of the input problem space.
t is an error term, which represents the random noise in the 2) Gating Network: The gating network mechanism is a
data. The parameters of the model (c, φi ) are typically estimated crucial component that analyzes the input data and decides
from the observed data using methods such as maximum like- which expert network is most suitable for a given instance [40].
lihood estimation It assigns weights to each expert, indicating their relevance or
p contribution to the current input. The gating network typically
Xt = c + φi Xt−i + t . (8) outputs a probability distribution over available experts, reflect-
i=1 ing the relevance of each expert to the current input [40]. There
are two main types of MoE routing strategies in MoE systems:
D. Mixture of Expert (MoE) Models
dense routing and sparse routing. In dense routing, every input
A MoEs model represents a neural network architecture that is directed to all experts, and the final output is a weighted com-
combines the strengths of specialized expert networks with bination of all expert predictions based on the gating network’s
a gating mechanism to perform complex tasks [35], [36]. In output. On the other hand, sparse routing is a more efficient
the context of NLP architectures, MoE models are applied approach where the gating network selects only a subset of ex-
to enhance the capabilities and efficiency of the underlying perts for each input, reducing computational cost [35], [42]. The
language generation architecture [35], [37]. Within the realm MoE model dynamically combines the predictions of multiple
of MoE models, these architectures optimize resource utiliza- experts based on learned gating coefficients, allowing it to adap-
tion by selectively activating relevant experts for specific tasks, tively switch between different experts depending on the input
demonstrating adaptability to different domains through the data. This mechanism enables the model to capture complex
integration of domain-specific expert models [38]. Moreover, patterns and improve performance compared to a single expert
MoE architectures offer scalability, allowing the addition of model. The gating network is generally represented as shown
more expert networks to handle a broader range of tasks [39]. in (9) where gk (x) denotes the gating function for gate k, σ
The advantages of MoE models extend beyond their archi- is an activation function (usually sigmoid or softmax), and Wg
tectural complexities. Recent studies, such as the work pre- represents the parameters of the gating network
sented in [39], emphasize their scalability, enabling the addition T
of more expert networks to handle a broader range of tasks. gk (x) = σ Wgk x. (9)
Furthermore, these models have demonstrated the ability to
achieve superior model quality compared to their dense coun- 3) Output Computation: When the experts are activated,
terparts, even with significantly reduced training costs. How- they process input data and generate individual predictions.
ever, despite these advantages, MoE models pose some critical These predictions are then combined to form the final output of
challenges. MoE models are sensitive to small changes in the the MoE model. The specific method of combining predictions
gating network weights. Since the gating network determines depends on the task and MoE architecture. In the weighted
the contribution of each expert to the final prediction, even averaging approach, predictions from each expert are weighted
slight changes in these weights can lead to significant shifts in based on the output of the gating network, and the weighted
the model’s training stability and cause unpredictable behav- average is taken as the final output. In classification tasks,
ior [35]. This sensitivity can make training and optimization experts can vote for the most likely class, and the majority
of the model more challenging. To mitigate this, techniques vote becomes the final prediction [43]. The output of a MoE
such as sparse routing have been proposed [40], [41]. Regu- model, denoted as y(x), is computed using (10), representing
larization techniques such as weight decay and dropout also a weighted sum of the expert outputs. The final output y(x)
help mitigate sensitivity to small changes in gating network is computed by aggregating the contributions of all experts. It
weights by preventing overfitting and promoting smoother de- sums up the weighted outputs of all experts based on the gating
cision boundaries [36]. Additionally, training MoE models can values, resulting in the MoE’s prediction. This output is often
be computationally intensive, especially when dealing with a passed through additional layers, such as fully connected layers
large number of experts or complex gating functions. Each for- and activation functions, depending on the specific task. Here,
ward pass through the network involves evaluating the outputs Ei (x) denotes the output of expert i, x represents an input to
of multiple experts and updating the parameters of both the the model, and N is the number of experts [35]. Gating weights
expert and gating networks. This computational overhead can gi (x), detailed in (11), are computed using a softmax function,
make training slower and require more resources compared to with ai (x) representing the activation for an expert i given the
Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
5878 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 12, DECEMBER 2024
input x. The gating network uses the input data to determine DALL-E 2 [57], [58], Imagen [59], and stable diffusion [60], are
which expert is best suited for the task a class of probabilistic models that describe the evolution of an
image from a simple initial distribution to the desired complex
N
y(x) = gi (x) · Ei (x) (10) distribution [61]
i=1
xt = f (xt−1 , θt , t ). (12)
exp (ai (x))
gi (x) = N , i = 1, 2, . . . , N. (11)
j=1 exp (aj (x))
1) Stable Diffusion: Text-to-image generation involves cre-
ating visual content based on textual descriptions [62]. Stable
diffusion is an open-source text-to-image diffusion model that
E. Model Merging
generates diverse and high-quality images based on textual
Model merging is a technique used to combine the parame- prompts.1 This model operates by taking a noisy image as
ters of multiple task-specific pretrained LLMs to create a new input and gradually denoising it to generate the desired output.
and improved language model [44]. Initially, this involves the The denoising process is guided by a text prompt, providing
process of selecting base models and aligning the architectures information about the desired content and style of the image.
of chosen models to ensure compatibility. Techniques such 2) Midjourney: Midjourney is a text-to-image diffusion
as parameter averaging [45] and knowledge distillation [46], model that, such as stable diffusion [60], leverages prompts to
[47] are then employed to integrate the knowledge from these generate unique and artistic images [63]. However, it is a closed-
models. Additionally, various algorithms, including task vector source generative AI project requiring a paid subscription. This
arithmetic [48], TIES [44], and DARE [49], can be used for setup consequently may discourage community collaboration
parameter merging, each with its own advantages and consid- and development, leaving some users with less control over the
erations, such as computational complexity and the ability to underlying model compared to open-source alternatives such as
handle models trained on different tasks. Following integration, stable diffusion [60].
the merged model undergoes fine-tuning task-specific data to
refine its representations and potentially optimize overall per-
formance. The resulting merged model retains the knowledge G. Multimodal Generative Models
and capabilities of its constituent models, leading to enhanced Multimodal generative models represent a significant ad-
performance and capabilities across tasks compared to tradi- vancement in AI. These models possess the capability to un-
tional methods of training a single model from scratch, as well derstand and create content by leveraging various data types,
as improved robustness and resource efficiency [50]. However, such as text, images, and audio [64], [65]. This integration of
challenges such as ensuring compatibility between models, different data modalities enables these models to capture a more
managing computational complexity, and avoiding performance comprehensive understanding of concepts [66]. By utilizing
degradation must be addressed [50], [51]. information from these diverse sources, multimodal generative
models aim to overcome the limitations inherent in traditional
F. Diffusion Models models that focus solely on a single data type [65]. Unimodal
methods, traditional approaches that primarily focus on a single
Diffusion models are specifically designed for generating modality, such as text and images, have limitations in captur-
images and data samples [52]. These models are trained to ing the full complexity of real-world data [65]. For example,
generate realistic samples by modeling the diffusion process text-based models may lack the ability to incorporate visual
of a data distribution. Different approaches such as noise- or emotional context into their understanding, while image-
contrastive estimation (NCE) [53] and score-based generative based models might lack textual or semantic understanding
modeling [54] exist within the domain of diffusion models in [65]. Multimodal generative models address these limitations
generative AI. They operate by iteratively adding noise to a by integrating information from different modalities, such as
given initial image and subsequently learning to reverse this text, images, and audio. This allows them to achieve a better
process to generate new, realistic, and high-quality images of understanding of the data and subsequently generate content
varying styles and complexities [55], [56]. As shown in (12), that reflects the richness of human expression and experience.
the general idea is to model the data distribution as a diffusion However, training multimodal models comes with its own set of
process, where the data is transformed from a simple distribu- challenges. These models can be computationally expensive to
tion to the target distribution through a series of steps. Here, train and require large amounts of labeled data for each modal-
xt represents the data at time step t, f denotes a diffusion ity [65]. Additionally, finding effective techniques to seamlessly
process that transforms the data from xt−1 to xt , θt represents integrate information from different modalities remains an ac-
the parameters of the diffusion process at time step t, and t rep- tive area of research [67]. There are two main architectures used
resents a sample from a noise distribution t. This approach has for multimodal learning: early fusion and late fusion [68]. Early
led to the development of generative models such as denoising fusion combines data from all modalities at the beginning of the
score matching (DSM) and diffusion probabilistic models. The model, while late fusion processes each modality independently
underlying idea is to transform a simple distribution through before combining the results. The ability of multimodal gener-
a series of steps to match the target distribution of real data. ative models to understand and create content across different
The generative process involves reversing these steps to gen-
erate new samples. Diffusion-based generative models, such as 1 https://stablediffusionweb.com/
Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
HAGOS et al.: RECENT ADVANCES IN GENERATIVE AI AND LARGE LANGUAGE MODELS 5879
data types makes them invaluable for a wide range of tasks architecture, featuring separate encoders for processing images
requiring a deep understanding of multimodal data [69]. Some and text. This architectural design allows CLIP to indepen-
real-world applications include generating realistic product de- dently encode visual and textual inputs into distinct feature
scriptions with images for e-commerce platforms or creating spaces, facilitating effective cross-modal understanding. For
personalized music recommendations based on a user’s au- image processing, CLIP often employs CNNs or vision trans-
dio preferences and listening history. In addition to this, these former (ViT) to extract visual features [74]. The image encoder
models have demonstrated remarkable capabilities in various within CLIP processes visual inputs, such as images, using
tasks, including medical imaging analysis, image captioning, CNNs. Through pretraining on large-scale image datasets, the
text-to-image synthesis, video understanding, and audio-visual image encoder learns to extract hierarchical visual features that
storytelling [69]. By overcoming the limitations of unimodal capture important characteristics of the input images. These fea-
models and offering new possibilities for creative content gen- tures are then encoded into a high-dimensional representation
eration, multimodal generative models will play a significant space. On the other hand, the text encoder in CLIP processes
role in the future of AI. textual inputs, such as captions and descriptions, using trans-
former architectures [18], [20]. Transformers are capable of
modeling sequential data such as text, allowing the text encoder
H. Applications of Generative AI
to capture semantic information and contextual relationships
Generative AI models are powerful tools for understanding within textual inputs. Through pretraining on large-scale text
and generating data with applications in various domains, in- corpora, the text encoder learns to encode textual inputs into a
cluding the following. corresponding feature space. Despite having separate encoders
1) Image Generation and Analysis: Advanced generative for images and text, CLIP achieves cross-modal understanding
AI models have demonstrated remarkable capabilities in gen- by mapping both image and text embeddings into a shared em-
erating high-quality images, such as photorealistic faces and bedding space. This shared space facilitates direct comparisons
scenes [21]. Generative AI models have been employed in between visual and textual representations, enabling CLIP to
developing complex systems capable of generating and un- determine the semantic similarity between them [69]. During
derstanding multimodal data such as text and images. For pretraining, CLIP leverages contrastive learning objectives to
example, the work in [70] proposes a large-scale autore- align similar pairs of image-text embeddings while maximiz-
gressive model that generates high-quality and content-rich ing the distance between dissimilar pairs, thereby enhancing
images from text descriptions. Additionally, DALL-E is a gen- its ability to understand and relate visual and textual inputs
erative model introduced by Ramesh et al. [57], [58], which effectively [69].
produces images from textual descriptions. Unlike traditional 2) Video Generation: Advanced generative AI models have
image generation models that rely on pixel-level manipulations not only demonstrated remarkable capabilities in generating
or predefined templates, DALL-E operates at a semantic level, high-quality images but have also begun to tackle the challenge
understanding textual prompts and synthesizing correspond- of video generation. Recent advancements in AI, such as Sora
ing images. The work in [71] introduces a novel architecture developed by OpenAI [75], [76], have enabled the generation of
specifically designed for generating high-quality facial images. realistic and dynamic video content from textual descriptions.
This architecture utilizes a style-based generator, demonstrating Similar to its image counterpart DALL-E [57], Sora operates at
advancements in synthesizing diverse and realistic images. Fur- a semantic level, understanding textual prompts and synthesiz-
thermore, generative AI models can also be employed in image- ing corresponding video sequences [75], [76]. Video generation
to-image translation [72], which involves converting images involves creating coherent and visually appealing sequences of
from one domain to another, such as enabling the conversion frames that align with the provided textual instructions [76].
of satellite images into maps or black-and-white photos into These models typically employ architectures designed to cap-
color. The work by Zhu et al. [73] presents a model designed ture temporal dependencies (i.e., relationships between frames
for unpaired image-to-image translation. This model utilizes over time) and spatial relationships (i.e., relationships between
cycle-consistent adversarial networks to learn mappings be- objects within a single frame). By understanding the semantic
tween two image domains without requiring paired training context of the text, these models generate videos that accurately
examples, making it versatile for various applications [73]. reflect described scenes while exhibiting smooth transitions and
Unlike DALL-E [58], which primarily focuses on generating realistic motion. In addition to video generation, as explained
images, contrastive language-image pre-training (CLIP) learns above, AI models are capable of multimodal generation, where
to understand the relationships between text and images in a textual prompts can result in the synthesis of both images
paired manner [69]. Through contrastive learning, CLIP pre- and videos. This capability enhances the quality of generated
trains vast amounts of image-text pairs, enabling it to encode content, enabling diverse applications in storytelling, content
both modalities into a shared embedding space [69]. CLIP’s creation, and multimedia production. Video generation has the
cross-modal understanding enables a wide range of applica- potential to revolutionize various domains, including the en-
tions beyond traditional image analysis tasks. By associat- tertainment industry, education and training, augmented reality
ing images with their textual descriptions, CLIP can perform and virtual reality applications, automation of video editing
tasks such as image classification, object detection, and even tasks, etc.
zero-shot learning, where it recognizes objects or concepts not 3) Text Generation: Advances in generative AI models
seen during training [69]. CLIP is built upon a dual-encoder can generate human-quality text, including translations, and
Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
5880 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 12, DECEMBER 2024
responses to natural language questions [4], [77]. Text genera- probability of the words that came before it [89]. These models
tion models learn patterns and relationships in language from are trained on large corpora of text, and they use statistical meth-
vast amounts of text data [4], [77]. ods to learn the probabilities of different sequences of words.
4) Code Generation: Widely adopted AI tools utilize gen- Such models, including n-gram models and models based on
erative AI techniques to analyze the context of the code being maximum entropy, often use conditional probability to estimate
written and suggest relevant code completions that can signif- the likelihood of a word given its context [90], [91]. Equation
icantly improve programmers’ and engineers’ productivity by (14) is derived from the maximum likelihood estimation, where
reducing the time spent manually typing codes [6]. the probability of a word given its context is estimated by the
5) Drug Discovery: Generative AI models are increasingly ratio of the count of the specific context-word pair to the count
being utilized in various aspects of drug discovery, providing of the context alone. In (14), P (w1 , w2 , . . . , wn ) denotes the
innovative approaches to developing new drugs and accelerat- conditional probability of the word, given the preceding word
ing the identification and design of novel therapeutic agents wn−1 , C (wn−1 , wn ) represents the count of occurrences of
[78], [79]. Furthermore, generative AI models have demon- the bigram (word wn−1 , word wn ) in the training data, and
strated the capability to identify new applications for drug the C (wn−1 ) represents the count of occurrences of the word
repurposing [80]. wn−1 in the training data. For higher order n-gram models, the
6) Material Discovery: Advanced ML and DL techniques, equation is extended to consider a longer history of words as
particularly generative models, are being employed to explore shown in (15)
and predict novel materials with desirable properties [81]. The C (wn−1 , wn )
application of generative AI models in material science [82] P (wn | wn−1 ) = (14)
C (wn−1 )
can significantly accelerate the material discovery process by
guiding experimental efforts, predicting new materials, and op- P (wn | wn−1 , wn−2 , . . . , w1 )
timizing existing materials [83]. C (wn−1 , wn−2 , . . . , w1 , wn )
7) Fraud Detection: Generative AI models have proven = . (15)
C (wn−1 , wn−2 , . . . , w1 )
effective in detecting fraud by identifying patterns indicative
of fraudulent activity [84]. Furthermore, these models can also B. Neural Network Language Models
be employed in identifying anomalies in data [85], [86].
8) Personalization: Generative AI models can be used in Neural network language models, particularly those based
personalization to tailor content, recommendations, or user on RNNs or transformer architectures, model the probability
experiences based on individual preferences [87], [88]. This of a word given its context using a neural network. Actual
customization can involve generating personalized recommen- neural network language models can have variations based on
dations or creating personalized user experiences. For example, the specific architecture used (e.g., recurrent and transformer
Netflix uses generative AI to recommend movies and TV shows based). However, the simplified representation of such mod-
to its users, while Spotify leverages generative AI to create els can be broken down into the hidden state calculation and
custom playlists. softmax calculation as shown in (16) and (17), respectively.
Equation (16) shows the hidden state calculation where hn−1
III. LANGUAGE MODELING denotes the hidden state of the neural network at time step
n − 1, Wh denotes the weight matrix for the hidden state
The use of language models is pervasive in various modern transition, Uh shows the weight matrix for the word embedding
NLP applications. In these models, the probability of different transition, En−2 denotes the embedding vector of the word
sequences of words is often modeled as the product of local wn−2 , and tanh is the hyperbolic tangent activation function.
probabilities, as expressed in (13), where wi represents the ith Equation (17) shows the softmax output calculation which com-
word in the sequence, and hi represents the word history pre- putes the conditional probability distribution over the vocabu-
ceding wi . The formulation in (13) summarizes the conditional lary for the next word wn where P (wn | wn−1 , wn−2 , . . . , w1 )
dependencies between words in a sequence, allowing language denotes the Conditional probability of the word wn given the
models to capture complex linguistic patterns. Leveraging such history wn−1 , wn−2 , . . . , w1 , Wo shows the weight matrix for
models has proven instrumental in tasks ranging from machine the output layer, hn1 is the hidden state of the neural network
translation and speech recognition to text generation and senti- at time step n − 1, where the softmax is the softmax function,
ment analysis [1], [2] converting the network’s output into probabilities
n
P (w1 , w2 , . . . , wn ) = P (wi | hi ) . (13) hn−1 = tanh (Wh · hn−2 + Uh · En−2 ) (16)
i=1
P (wn | wn−1 , wn−2 , . . . , w1 )
The following are some of the main approaches to traditional
and modern approaches to language modeling. = softmax (Wo · tanh (hn−1 )). (17)
Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
HAGOS et al.: RECENT ADVANCES IN GENERATIVE AI AND LARGE LANGUAGE MODELS 5881
parts of the input sequence when making predictions [4], [18], transformer model uses multiple self-attention heads in par-
[20]. Such models leverage pretraining to achieve strong perfor- allel across multiple heads to capture different aspects of the
mance across various NLP tasks. According to [18], the trans- relationships within the input sequence instead of perform-
former architecture offers several advantages over traditional ing a single attention function with dmodel -dimensional keys,
recurrent or CNNs. It enables significantly more parallelization value vectors, and queries. This allows the model to learn
for faster training, achieves state-of-the-art results in machine more complex representations of the input, which can im-
translation with shorter training times, reduces the complexity prove performance on a variety of NLP tasks. As shown
of relating distant input positions, and effectively models long- in (21), the outputs of these heads are concatenated and
range dependencies while handling variable-length sequences linearly transformed [18] where the transformations are pa-
[18]. The transformer model achieves state-of-the-art results rameter matrices WiQ ∈ Rdmodel ×dk , WiK ∈ Rdmodel ×dk , WiV ∈
in machine translation by employing attention mechanisms, Rdmodel ×dv , and W O ∈ Rhdv ×dmodel . Here, WiQ , WiK , WiV , and
enabling it to capture long-range dependencies and process W O are learned weight matrices. This allows the model to
variable-length sequences without padding or truncation [18]. learn a wider range of relationships between words in the input
Moreover, it simplifies the computation of relationships be- sequence
tween distant positions, leading to enhanced parallelization,
MultiHead(Q, K, V ) = Concat (head1 , . . . , headh ) .W O
faster training, and superior performance compared to tradi-
tional neural networks. Where
1) Self-Attention Mechanism: The transformer architec- headi = SelfAttention QWiQ , KWiK , V WiV . (21)
ture revolutionized sequence modeling by introducing a self-
attention mechanism, eliminating the need for recurrent or 3) Position-Wise Feed Forward Network (FFN): The FFN
convolutional structures. The self-attention mechanism essen- is an important component of the transformer architecture. It is
tially computes a weighted sum of input representations, where responsible for processing information from the self-attention
each position in the input sequence is allowed to attend to all mechanism across all positions in the input sequence [18]. The
other positions with different weights. This mechanism allows FFN consists of two fully connected linear transformations with
the model to capture long-range dependencies between distant a rectified linear unit (ReLU) activation function in between
words in a sentence, which is important for tasks such as them. This structure allows the FFN to learn complex nonlin-
machine translation and text summarization. Given an input ear relationships between the input features [18]. The FFN is
sequence X = {x1 , x2 , . . . , xn }, the self-attention mechanism applied independently to each position in the input sequence,
computes the output vector Y = {y1 , y2 , . . . , yn }. As shown ensuring that each position can interact with all other positions
in (18), the attention mechanism computes a set of attention [18]. This parallelized approach makes the FFN computation-
scores, which are then used to calculate a weighted sum of ally efficient and scalable to long input sequences. The output
the input vectors. Here, Qi , Kj , and vj are the query, key, of the self-attention mechanism is then passed through a FNN
and value vectors for the ith output element and jth input as shown in (22), where the learned parameters W1 and W2
element, respectively, and dk is the dimension of the key vectors are learned weight matrices, while b1 and b2 are the learned
[18]. The attention score aij for the ith element in the output bias vectors. As shown in (23), other works have proposed
sequence and the jth element in the input sequence is computed replacing the ReLU activation function with other nonlinear
as shown in (19). Here, eij , commonly represented as QTi · Kj , activation functions such as gaussian error linear unit (GELU)
is the attention energy or compatibility function between the (x) = xΦ(x) [92] where Φ(x) is the standard Gaussian cumu-
ith element in the output sequence and the jth element in the lative distribution function, and Swishβ (x) = xσ(βx) [93]
input sequence. Once the attention scores are computed, the FFN(x) = max(0, x.W1 + b1 ).W2 + b2 (22)
weighted sum of the input vectors is calculated to obtain the
context vector for each output element as shown in (20) where FFNGELU (x, W1 , W2 ) = GELU (xW1 ) W2
Vj is the value vector for the jth input element
FFNSwish (x, W1 , W2 ) = Swish1 (xW1 ) W2 . (23)
n
exp (eij ) (Qi · Kj ) In the context of language models, the transformer architec-
yi = n · vj , where eij = √ (18)
k=1 exp (eik ) dk ture facilitates the training of LLMs, such as GPT [28]. LLMs
j=1
are a type of generative AI model that is specifically trained on
exp (eij ) large corpora of text data. In recent years, LLMs have emerged
aij = n (19)
k=1 exp (eik )
as transformative breakthroughs in the field of AI, natural lan-
guage generation (NLG), and natural language understanding
n
ci = aij · Vj . (20) (NLU) [106] due to their remarkable capabilities in understand-
j=1
ing and generating humanlike text and other forms of content
[21]. LLMs are trained on massive datasets comprising text and
2) Multihead Self-Attention: The multihead self-attention code, and they exhibit the ability to learn and perform a wide
mechanism is a variant of the self-attention mechanism that range of language tasks, including text generation, language
introduces multiple attention heads to capture different as- translation [107], text summarization [99], sentiment analysis
pects of the relationships in the input sequence [18]. The [108], and question answering [109]. These models are more
Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
5882 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 12, DECEMBER 2024
TABLE I
LIST OF SOME OF THE STATE-OF-THE-ART LLMS WELL SUITED FOR A WIDE RANGE OF NLP TASKS
powerful and versatile than traditional language models. LLMs LaMDA [102], etc. These models have demonstrated the power
have revolutionized the way we interact with and leverage nat- of pretrained, massive neural networks for NLP tasks. For ex-
ural language data, and they are now used in a wide variety ample, GPT can be used to generate realistic and coherent text,
of applications, including chatbots [88], machine translation while BERT can be used to extract complex meaning from text.
systems [1], [7], and search engines. These models have expe- Table II shows a performance comparison of some of the state-
rienced significant growth in terms of scale, complexity, and of-the-art LLMs well suited for a wide range of NLP tasks, as
performance. Recently, several LLMs have been introduced, reported on PapersWithCode.2
with some of the largest dense language models that have scaled
to billions of model sizes [4], [26], [95], [96], [97]. These D. Architecture of Transformer Models
powerful models demonstrate the capability to perform a wide
Transformer architectures have revolutionized NLP tasks,
range of innovative NLP tasks, including machine translation,
such as sequence modeling, by effectively capturing long-range
text summarization, question answering, and code completion.
dependencies and modeling relationships between words. The
To provide a comprehensive comparison of some well-known
advantages of the transformer architecture include enhanced
state-of-the-art LLMs, we have presented a list in Table I. Fig. 1
parallelization, faster training, and the ability to model long-
shows a trend of some of the LLMs and their corresponding
range dependencies efficiently. The attention mechanism allows
number of parameters (model sizes). Some of the well-known
state-of-the-art LLMs include GPT [28], T5 [19], Gopher [95], 2 https://paperswithcode.com/
Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
HAGOS et al.: RECENT ADVANCES IN GENERATIVE AI AND LARGE LANGUAGE MODELS 5883
Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
5884 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 12, DECEMBER 2024
to various downstream applications. One well-known example 1) Transformer-XL: Transformer-XL is an extension of the
of the decoder-only architecture is the GPT [4]. GPT employs a standard transformer model designed to overcome the limita-
stack of transformer decoder layers for autoregressive sequence tions of fixed-length contexts in traditional models [114]. It
generation [4]. addresses the inherent limitation of the standard transformer
model, which employs a fixed-length context window, by em-
ploying two advanced mechanisms. These mechanisms en-
E. Pretraining Strategies in Transformer Language Models
able the model to learn dependencies beyond a fixed length
One of the key factors behind the success of transformer- in language modeling and retain information from previous
based language models is their pretraining on massive amounts segments of the input sequence, thus enhancing its ability
of text data using self-supervised learning techniques [18]. This to process longer sequences more effectively [114]. The first
pretraining stage equips the models with a robust understand- mechanism, segment-level recurrence, allows the model to
ing of language structure and semantics, enabling exceptional reuse hidden states from previous segments by propagating
performance on various downstream NLP tasks [20], [21]. them through recurrent connections. This enables information
Transformer language models, leveraging pretraining, have flow across segments, facilitating the retention of context from
demonstrated outstanding performance across diverse NLP previous segments and extending the context beyond a fixed
tasks. In machine translation, the transformer’s attention mech- length. Incorporating recurrence at the segment level empowers
anism allows it to capture long-range dependencies, yielding Transformer-XL to capture longer term dependencies in the
state-of-the-art results without the need for excessive padding or data [114]. In addition to the segment-level recurrence mech-
truncation [18]. Beyond translation, decoder-only architectures anism, Transformer-XL employs a novel relative positional
such as GPT have proven effective in tasks such as sentiment encoding scheme [114]. This encoding scheme is crucial for en-
analysis, named entity recognition, and text completion [21]. abling state reuse without causing temporal confusion, thereby
1) Self-Supervised Learning for Pretraining: Unlike tra- allowing the model to effectively capture dependencies across
ditional supervised learning methods that demand extensive longer sequences. By utilizing relative positional encodings
labeled data, self-supervised learning leverages the unlabeled instead of absolute ones, Transformer-XL ensures that informa-
nature of textual data. Common pretraining objectives in trans- tion can be propagated across longer sequences without sacrific-
formers involve tasks such as predicting the next word in a se- ing temporal coherence. This encoding scheme plays a vital role
quence, also known as masked language modeling (MLM) [20], in allowing the model to learn dependencies that extend beyond
[111], or reconstructing a sentence where certain words are the fixed context length [114]. Furthermore, Transformer-XL
replaced with special tokens (masked tokens) [20]. By tackling incorporates a state reuse mechanism by caching a sequence
these tasks, the model learns contextual relationships between of hidden states from previous segments, which can be reused
words and develops a strong understanding of grammatical during evaluation. As demonstrated in [114], this state reuse
structures. The pretraining phase serves as a critical foundation mechanism significantly accelerates evaluation and enables the
for downstream NLP tasks. The learned representations from model to maintain context from earlier segments, contributing
vast amounts of text data can be fine-tuned for specific tasks to its ability to capture long-term dependencies in sequences.
such as sentiment analysis, question answering, and machine 2) XLNet Architecture: XLNet represents a pretraining
translation. This approach requires significantly less labeled method for NLU tasks [115]. Building upon BERT’s bidirec-
data compared to training a model from scratch [112]. Con- tional context modeling [20], XLNet addresses its limitations,
sequently, self-supervised learning not only improves the effi- such as the fixed-length context constraint. Unlike BERT, which
ciency of NLP model training but also enables them to perform relies on MLM, XLNet achieves bidirectional context learning
effectively on tasks where obtaining large amounts of labeled by maximizing the expected likelihood over all permutations of
data might be challenging. the factorization order [115]. By utilizing an autoregressive for-
mulation, XLNet ensures consistency between pretraining and
fine-tuning stages, a limitation observed in BERT. As shown in
F. Long-Sequence Language Models (24), instead of predicting the next word in a sentence given all
Long-sequence language models are neural network architec- previous words, XLNet predicts based on randomly chosen per-
tures specifically designed to effectively handle long textual in- mutations of the input sequence. This approach encourages the
put sequences by leveraging the transformer architecture [113]. model to consider all possible input permutations, effectively
While various architectures can handle longer sequences, trans- capturing the dependencies within the sequence. This involves
formers are dominant due to their self-attention mechanisms, randomly shuffling the order of elements in the sequence and
enabling parallel processing and capturing long-range depen- then predicting each element based on its permuted context. In
dencies, overcoming the sequential limitations of RNNs. This, (24), x1 , x2 , . . . , xn represents theinput sequence, π is a ran-
unlike traditional language models, enables long-sequence lan- dom permutation of indices, and P xπ(i) | x1 , x2 , . . . , xπ(i−1)
guage models to efficiently capture long-range dependencies is the conditional probability of predicting the ith token given
and relationships between words [113]. Several long-sequence the previously predicted tokens and the current input sequence.
language models address the limitations of standard transform- During training, XLNet receives permuted sequences as input
ers by introducing modifications and additional features to their and predicts each element based on the surrounding elements
architectures. in the shuffled order [115]. This forces the model to learn
Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
HAGOS et al.: RECENT ADVANCES IN GENERATIVE AI AND LARGE LANGUAGE MODELS 5885
Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
5886 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 12, DECEMBER 2024
1) Language Understanding: In the context of NLU, LLMs features into text but lack the broader language understanding
are employed to extract meaning from human language. LLMs capabilities of LLMs. While traditional ASR systems often
are being used for a variety of NLU [106] and other language- rely on specialized architectures, the use of LLMs, particularly
related tasks, including sentiment analysis and named entity transformer-based models, has gained attention for end-to-end
recognition [4]. These models can analyze and comprehend the speech recognition [18]. LLMs can analyze the output of ASR
context of a given text, making them valuable for a wide range models and suggest corrections based on their understanding of
of applications. language and context, improving the accuracy of transcriptions,
2) Machine Translation: In the context of machine transla- especially in noisy environments or with unclear pronunciations
tion, LLMs are used to automatically translate text between dif- [125]. Additionally, LLMs can be leveraged to provide context
ferent languages [119]. For example, Google Translate utilizes to the speech recognition process. By considering surrounding
LLMs to seamlessly translate text, documents, and websites text or information about the speaker and situation, LLMs can
from one language into another. This capability demonstrates assist ASR models in making better decisions about what is
the practical utility of LLMs in bridging language barriers and being said.
enhancing communication, achieved through training on exten- 6) Text Summarization: LLMs have demonstrated success-
sive multilingual datasets [7]. The quality of translation relies ful applications in various text summarization tasks, such as
on the underlying capabilities of LLMs for NLU and genera- summarizing documents and news articles [24], [126]. For
tion. The work in [107] introduced the concept of attention to example, the work presented in [3] introduces a sequence-
neural machine translation architecture, leading to significant to-sequence pretraining model, which has proven highly ef-
advancements in language translation quality. fective in abstractive summarization tasks. Modern LLMs,
3) Question Answering: LLMs are effectively employed in empowered with powerful NLP capabilities, can understand
question-answering tasks across a variety of topics, enabling the context of a document and generate concise and coherent
them to provide relevant answers to user queries [3], [4]. This summaries quickly while preserving the overall meaning of the
capability has applications in virtual assistants, information original text.
retrieval systems, and educational platforms. For example, the 7) Code Completion: In addition to the capabilities of LLMs
AI assistant from Google can answer questions about a variety to generate humanlike text and perform various NLP tasks,
of topics, such as current events, history, and science. LLMs have also demonstrated the ability to understand the
4) Chatbots: The NLP capabilities of LLMs contribute sig- context of code and generate relevant and accurate code sugges-
nificantly to the development of intelligent chatbots [99]. This tions [127]. Code completion with LLMs involves predicting
adaptability enhances the overall user experience, making in- the next set of characters in a code snippet based on the provided
teractions with virtual assistants more intuitive and effective. context [128]. These models leverage their extensive pretrained
LLMs are widely employed in creating chatbots for customer knowledge of programming languages and coding patterns to
support and other interactive applications, enabling these intel- generate pertinent code suggestions [129]. This approach has
ligent virtual assistants to engage with humans, answer queries, been shown to improve developer productivity [130].
and help in a natural and informative way [26]. The ability
of LLMs to understand and respond to natural languages has IV. CHALLENGES OF GENERATIVE AI AND LLMS
opened up new possibilities in customer service, education, Despite their wide range of immense potential for society,
entertainment, and healthcare [120]. For example, companies generative AI and LLMs also pose several critical challenges
such as Facebook and Microsoft have successfully integrated that need to be carefully considered and addressed. These chal-
LLMs into their chatbot systems, such as Facebook’s Mes- lenges include the following.
senger platform and Microsoft’s Azure Bot Service. These
platforms utilize the power of LLMs to provide users with A. Bias and Fairness
personalized and context-aware responses, demonstrating the
practical applications of these models in real-world interactive One of the main challenges associated with generative AI
environments. and LLMs is the inheriting biases from the training data,
5) Speech Recognition: Older speech recognition systems which can lead to biased, unfair, and discriminatory outputs.
often relied on RNNs or hybrid models combining hidden Biased outputs from generative AI and LLMs can have sig-
markov models (HMMs) with DNNs [121], [122]. However, nificant real-world consequences. For example, biased hiring
these approaches faced limitations. RNNs process input se- algorithms may discriminate against certain job applicants. Po-
quences one element at a time, leading to slow processing tential bias problems like these can be mitigated by developing
and difficulties handling long-range dependencies in audio sig- algorithms that are explicitly designed to be fair and unbiased
nals [123]. Additionally, hybrid models were complex and re- by using approaches such as fairness-aware training [131],
quired careful integration of separate components. To address counterfactual analysis [132], [133], [134], and adversarial
these limitations, researchers have explored and applied LLMs debiasing [135].
to speech recognition tasks, yielding promising results [124].
B. Interpretability
The core technology for speech recognition remains automatic
speech recognition (ASR) models specifically trained on vast Understanding and interpreting the decision-making process
amounts of speech data. These models excel at converting audio of LLMs presents a significant challenge. The inherent lack of
Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
HAGOS et al.: RECENT ADVANCES IN GENERATIVE AI AND LARGE LANGUAGE MODELS 5887
interpretability in these models raises serious concerns, espe- as a starting point for the fine-tuning process, synthesizing
cially in critical applications that require explainable decision- additional data from the new domain to supplement the existing
making. Addressing interpretability challenges in generative data, and simultaneous multitask training involving both the
AI and LLMs involves several approaches. One solution is original and new tasks.
to design LLMs with inherent explainability features, such as
employing interpretable model architectures and incorporating E. Data Privacy and Security
constraints that promote understandable decision-making. An- LLMs are trained on massive and diverse datasets that may
other approach is to develop advanced techniques that provide contain sensitive personal information. The potential for unin-
insights into the inner workings of LLMs, such as saliency tentional disclosure of private or sensitive information during
maps [136], attention mechanisms [18], and feature attribution text generation is a significant concern. For instance, when ap-
methods. Additionally, implementing post hoc interpretability plied in healthcare, the use of LLMs raises concerns regarding
methods [137], [138] including feature importance analysis patient privacy and the potential for misdiagnosis. There is also
and model-agnostic interpretation techniques can offer valuable a risk of AI systems being exploited for malicious purposes,
insights into the factors influencing the outputs of the model. such as generating fake identities, which raises privacy con-
cerns. This, for example, has caused ChatGPT to be temporarily
C. Fine-Tuning and Adaptability outlawed in Italy.3,4 Addressing privacy concerns in generative
Fine-tuning large LLMs for specific domains is challenging AI and LLMs requires a multifaceted approach that includes
due to their inherent limitations in generalization. In addition to enhancing model training with privacy-preserving techniques,
their limited ability to generalize, LLMs may face difficulty in such as federated learning, homomorphic encryption, and dif-
understanding and reasoning complex concepts, hindering their ferential privacy [144], [145]. Additionally, fine-tuning mod-
ability to adapt to new tasks. Addressing the challenges asso- els on curated datasets that exclude sensitive information can
ciated with fine-tuning and adaptability in generative AI and help to minimize the risk of unintentional disclosures. Ethical
LLMs involves exploring various approaches. One approach guidelines and regulations specific to AI applications, such
involves employing transfer learning techniques that leverage as in healthcare, can provide further safeguards against pri-
knowledge from pretrained models on diverse datasets, allow- vacy breaches [146], [147]. LLMs should be able to handle
ing the model to capture a broader range of knowledge by ac- adversarial attacks, noisy data, and out-of-distribution inputs.
celerating learning and improving generalization [139], [140]. In addition to this, it is worth mentioning that beyond model
Additionally, incorporating domain-specific data during fine- privacy, addressing concerns related to the privacy and security
tuning can enhance the model’s adaptability to particular tasks, of the training and deployment data itself is important.
ensuring it learns domain-specific patterns and relationships.
Incorporating symbolic reasoning capabilities into LLMs can F. Computational Cost
also enhance their ability to understand and manipulate abstract Training and deploying LLMs demand significant computa-
concepts [141]. Leveraging metalearning techniques to enable tional resources. This poses infrastructure challenges, energy
LLMs to learn how to quickly learn also improves their ability consumption particularly for large-scale deployments, and ac-
to adapt to new tasks and data distributions [142]. cessibility of high-performance computing resources. As shown
in Fig. 1, the increase in model sizes comes with challenges
D. Domain Adaptation related to computational requirements and resource accessi-
Most of the high-performing models being released are al- bility. Reducing the computational cost of LLMs involves
ready fine-tuned for instruction-following. However, adapting several approaches. First, optimizing model architectures and
these pretrained LLMs, which have been specifically fine-tuned algorithms can enhance efficiency, reducing the computational
for a specific domain (such as chat), to a new task (such as burden without compromising performance. Second, leveraging
generating text formats and answering your questions) not for- distributed computing frameworks and specialized hardware
matted for instruction-following without compromising its per- accelerators, such as GPUs and tensor processing units (TPUs),
formance in the original domain is challenging. The challenge can significantly improve training speed and resource utilization
lies in preserving the model’s ability to understand and follow [148]. In addition to this, employing quantization techniques
instructions while also enabling it to generate coherent and [149] to models that have already been trained5 is also impor-
informative text in the new domain. This requires careful con- tant.
sideration of the training data, the model architecture, and the
fine-tuning process. However, fine-tuning LLMs for an entirely G. Deepfake Generation
new domain introduces the risk of negative transfer [143]. This Generative AI models are widely used for deepfake genera-
occurs when the model’s new knowledge conflicts with its exist- tion [150]. Deepfakes utilize various generative models, includ-
ing knowledge. Additionally, domain adaptation often requires ing GANs, to manipulate or generate realistic-looking content,
access to a large amount of high-quality data from the new 3 https://www.bbc.com/news/technology-65139406
domain. This can be challenging to obtain, especially for spe- 4 https://www.theverge.com/2023/4/28/23702883/chatgpt-italy-ban-lifted-
cialized domains. Potential strategies for addressing this chal- gpdp-data-protection-age-verification
lenge include leveraging the weights of the pretrained LLMs 5 https://huggingface.co/TheBloke
Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
5888 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 12, DECEMBER 2024
primarily for image or video production [151], [152]. Despite context window defines the number of tokens considered by
their potential applications in various domains including edu- the model during prediction, and a smaller context window can
cation and entertainment, deepfakes also pose several potential limit the model’s ability to understand and generate a contex-
risks due to their potential misuse, including the spread of mis- tually relevant text, especially in long passages or documents.
information and identity theft [153]. The deepfakes technology Several techniques can be employed to address the challenge
can be exploited to create fake videos or audio recordings of of a limited context window. A common approach involves
individuals, leading to the spread of misinformation or disinfor- using hierarchical attention, which enables models to focus on
mation, which can have devastating consequences for individ- different levels of context [163]. Additionally, the parallel con-
uals and society. It is, therefore, important to develop advanced text window approach allows for parallel processing of multiple
techniques to mitigate the risks associated with deepfakes. context windows [165]. This feature allows the models to store
information beyond the immediate context window, enabling
H. Human-AI Collaboration better handling of long-term dependencies [166].
LLMs should be designed to enable seamless human-AI col-
laboration, enabling them to effectively understand and respond K. Long-Term Memory
to human instructions and provide clear explanations for their LLMs are trained on a massive corpus of text and code, but
outputs [154]. To achieve effective human-AI collaboration, their completely stateless nature limits their ability to store and
it is important to integrate humans into the design process retrieve information from past experiences [167]. This inherent
of LLMs to ensure that they are aligned with human needs lack of explicit memory restricts their ability to maintain context
and expectations [155], [156]. To incorporate human feedback and engage in natural conversations, leading to less coherent re-
into the training process, we can utilize techniques such as sponses, especially across multiturn dialogues or tasks requiring
reinforcement learning from human feedback (RLHF) [157], information retention. Without the ability to remember past in-
[158] and direct policy optimization (DPO) [159] for train- teractions, LLMs cannot personalize their responses to specific
ing reinforcement learning (RL) agents using human feedback. users. This means they cannot adapt their communication style
Additionally, employing explainable AI (XAI) techniques for based on the user’s preferences, interests, or previous interac-
LLMs can enhance the transparency and understandability of tions. Challenges associated with this limitation include issues
their decision-making processes [160]. Developing natural lan- of consistency and task continuity. To address these challenges,
guage interfaces that facilitate natural human-LLM interactions various approaches and techniques can be considered. Beyond
is another key aspect of enhancing human-AI collaboration context window techniques, integrating external memory mech-
[161]. Conversational AI, intelligent chatbots, and voice assis- anisms such as memory networks and attention mechanisms
tants are examples of technologies that enable intuitive human- with an external memory matrix can enhance the model’s ability
AI interactions. to access and update information across different turns [168].
Alternatively, designing applications that externally maintain
I. Long-Term Planning session-based context allows the model to reference past in-
teractions within a session. Additionally, retrieval-based tech-
Generative models, particularly autoregressive models that niques enable LLMs to access relevant information from past
generate text one token at a time, face challenges in long-term conversations or external sources during inference, enhancing
planning [162]. These models tend to focus on the immediate their ability to maintain context and deliver more consistent
local context, making it difficult to maintain consistency over responses [169].
longer text passages. This limitation comes from the model’s
lack of a global view of the entire sequence it generates. Addi-
tionally, autoregressive models struggle to plan for situations L. Measuring Capability and Quality
with future uncertainties. To address the long-term planning Traditional statistical quality measures, such as Accuracy and
challenge with LLMs, we can employ several approaches in- F-score do not easily translate to generative tasks [170], espe-
cluding hierarchical attention, which allows LLMs to focus cially long-form generative tasks. Furthermore, the accessibility
on different parts of the input at different times that can help of test sets in numerous benchmark datasets provides an avenue
the models capture long-range dependencies [163]. Equipping for the potential manipulation of leaderboards by unethical
LLMs with memory that allows them to store information about practitioners. This involves the inappropriate training of models
the past, which can be used to inform future decisions, is another on the test set, a practice likely employed by researchers seeking
approach to address this challenge [164]. funding through achieving top positions on public leaderboards,
such as Hugging Face’s Open LLM Leaderboard.6 At the time
J. Limited Context Window of writing this article, a seven billion parameter model is outper-
forming numerous 70 billion parameter models. A prospective
Having a limited context window is a fundamental challenge and pragmatic approach to appraising model outputs is to utilize
for LLMs since they can only process a limited amount of an auxiliary model for evaluating the generated content from the
text at a time. This limitation comes from their reliance on original model [171]. However, this methodology may prove
attention mechanisms [18], which allow them to focus on the
most relevant parts of the text when generating content. The 6 https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
HAGOS et al.: RECENT ADVANCES IN GENERATIVE AI AND LARGE LANGUAGE MODELS 5889
ineffective if the judgment model lacks training within the research gaps in areas such as bias, interpretability, deepfakes,
specific domain it is employed to assess. and human-AI collaboration, our work paves the way for an
impactful, ethical, and inclusive future of NLP. We envision this
M. A Concerning Trend Toward “Closed” Science research serving as a roadmap for the AI community, empower-
ing diverse domains with transformative tools and establishing
As models transition from experimental endeavors to com- a clear path for the responsible evolution of AI.
mercially viable products, there is a diminishing inclination In our future work, we aim to explore advanced techniques
to openly share the progress achieved within research labo- for identifying and mitigating bias in both training data and
ratories.7 This shift poses a significant obstacle to the col- algorithms to enhance fairness in AI systems. Additionally, we
laborative advancement of knowledge, hindering the ability to plan to investigate explainable AI approaches and develop new
build upon established foundations when essential details are strategies to improve the interpretability of AI models. Building
withheld. Furthermore, replicating published results becomes upon our previous line of research on human-autonomy team-
arduous when the prompts employed in the experimentation are ing, we will delve into the development of models that facilitate
not disclosed, since subtle alterations to prompts can, in some seamless collaboration and interaction between humans and
cases, significantly affect the performance of the model. Com- AI. We hope this work encourages researchers across multiple
pounding these concerns, accessing the necessary resources disciplines of the AI community, from both academia and in-
to reproduce results often entails financial obligations to the dustry, to further explore the broader domain of generative AI
publishers of the models, creating yet another barrier to entry and LLMs.
into the scientific landscape for low-resource researchers. This
situation prompts reflection on the current situation and the
potential impediments it imposes on the pursuit of knowledge ACKNOWLEDGMENT
and innovation. Any opinions, findings, conclusions, or recommendations
expressed in this article are those of the authors and should
V. BRIDGING RESEARCH GAPS AND FUTURE DIRECTIONS not be interpreted as representing the official policies, either
Our research has identified several key areas that require expressed or implied, of the funding agencies.
attention to ensure the ethical integration of generative AI and
LLMs. These areas include addressing issues such as bias and REFERENCES
fairness in outputs, the necessity for models to provide ex- [1] N. Kalchbrenner and P. Blunsom, “Recurrent continuous translation
planations for their reasoning, and the challenges associated models,” in Proc. Conf. Empirical Methods Natural Lang. Process.,
with adapting these models to diverse situations and domains. 2013, pp. 1700–1709.
[2] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method
Furthermore, considerations regarding data privacy, security, for automatic evaluation of machine translation,” in Proc. 40th Annu.
and the potential for misuse in areas such as deepfakes re- Meeting Assoc. Comput. Linguistics, 2002, pp. 311–318.
quire careful attention. Addressing these challenges through [3] M. Lewis et al., “BART: Denoising sequence-to-sequence pre-training
for natural language generation, translation, and comprehension,” 2019,
advancements in areas we have proposed, such as improved arXiv:1910.13461.
bias detection and the development of interpretable models, [4] T. Brown et al., “Language models are few-shot learners,” in Proc.
holds significant promise. Proactively tackling these issues is Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 1877–1901.
[5] K. Cobbe et al., “Training verifiers to solve math word problems,”
essential to ensuring that AI development is not only technically 2021, arXiv:2110.14168.
advanced but also beneficial to society. This includes develop- [6] M. Chen et al., “Evaluating large language models trained on code,”
ing clear metrics to assess model performance, enhancing their 2021, arXiv:2107.03374.
[7] T. Brants, A. Popat, P. Xu, F. J. Och, and J. Dean, “Large lan-
interpretability, and prioritizing user privacy and security. By guage models in machine translation,” in Proc. Joint Conf. Empirical
incorporating ethical considerations into AI development, we Methods Natural Lang. Process. Comput. Natural Lang. Learn., 2007,
pave the way for their responsible deployment across various pp. 858–867.
[8] C. D. Manning, “Human language understanding & reasoning,”
domains, including healthcare, recruitment, and content cre- Daedalus, vol. 151, no. 2, pp. 127–138, 2022.
ation. This will foster a future where AI serves as a positive [9] A. Svyatkovskiy, J. Kates-Harbeck, and W. Tang, “Training distributed
force for societal good, promoting inclusivity and making a deep recurrent neural networks with mixed precision on GPU clusters,”
in Proc. Mach. Learn. HPC Environ., 2017, pp. 1–8.
real impact. [10] B. Li et al., “Large scale recurrent neural network on GPU,” in Proc.
Int. Joint Conf. Neural Netw. (IJCNN), Piscataway, NJ, USA: IEEE
VI. CONCLUSION Press, 2014, pp. 4062–4069.
[11] M. Isaev, N. McDonald, and R. Vuduc, “Scaling infrastructure to
This article explores the transformative potential of genera- support multi-trillion parameter LLM training,” in Proc. Archit. Syst.
Support Transformer Models (ASSYST@ ISCA), 2023, pp. 1–5.
tive AI and LLMs, highlighting their advancements, technical [12] J. Kaplan et al., “Scaling laws for neural language models,” 2020,
foundations, and practical applications across diverse domains. arXiv:2001.08361.
We argue that understanding the full potential and limitations [13] J. Hoffmann et al., “An empirical analysis of compute-optimal large
language model training,” in Proc. Adv. Neural Inf. Process. Syst.,
of generative AI and LLMs is crucial for shaping the respon- vol. 35, 2022, pp. 30016–30030.
sible integration of these technologies. By addressing critical [14] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature,
vol. 521, no. 7553, pp. 436–444, 2015.
7 https://www.theverge.com/2023/3/15/23640180/openai-gpt-4-launch- [15] L. R. Medsker and L. Jain, “Recurrent neural networks,” Des. Appl.,
closed-research-ilya-sutskever-interview vol. 5, nos. 64–67, p. 2, 2001.
Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
5890 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 12, DECEMBER 2024
[16] R. Socher et al., “Recursive deep models for semantic compositionality [43] S. E. Yuksel, J. N. Wilson, and P. D. Gader, “Twenty years of mixture
over a sentiment treebank,” in Proc. Conf. Empirical Methods Natural of experts,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 8,
Lang. Process., 2013, pp. 1631–1642. pp. 1177–1193, Aug. 2012.
[17] Y. LeCun et al., “Convolutional networks for images, speech, and [44] P. Yadav et al., “Resolving interference when merging models,” 2023,
time series,” The Handbook of Brain Theory and Neural Networks, arXiv:2306.01708.
Cambridge, MA, USA: MIT Press, vol. 3361, no. 10, pp. 255–258, [45] M. S. Matena and C. A. Raffel, “Merging models with fisher-weighted
1995. averaging,” in Proc. Adv. Neural Inf. Process. Syst., vol. 35, 2022,
[18] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. pp. 17703–17716.
Process. Syst., vol. 30, 2017, pp. 1–15. [46] S. Khanuja, M. Johnson, and P. Talukdar, “MergeDistill: Merging pre-
[19] C. Raffel et al., “Exploring the limits of transfer learning with a trained language models using distillation,” 2021, arXiv:2106.02834.
unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, no. 1, [47] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a
pp. 5485–5551, 2020. neural network,” 2015, arXiv:1503.02531.
[20] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- [48] G. Ilharco et al., “Editing models with task arithmetic,” 2022,
training of deep bidirectional transformers for language understanding,” arXiv:2212.04089.
in Proc. NAACL-HLT, 2019, pp. 1–16. [49] L. Yu, B. Yu, H. Yu, F. Huang, and Y. Li, “Language models are super
[21] A. Radford et al., “Improving language understanding by generative mario: Absorbing abilities from homologous models as a free lunch,”
pre-training,” Open AI, San Francisco, CA, USA, Technical Report, 2023, arXiv:2311.03099.
pp. 1–12, 2018. [50] F. Wan et al., “Knowledge fusion of large language models,” 2024,
[22] N. Houlsby et al., “Parameter-efficient transfer learning for NLP,” in arXiv:2401.10491.
Proc. Int. Conf. Mach. Learn., PMLR, 2019, pp. 2790–2799. [51] S. K. Ainsworth, J. Hayase, and S. Srinivasa, “Git re-basin: Merging
[23] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” models modulo permutation symmetries,” 2022, arXiv:2209.04836.
2013, arXiv:1312.6114. [52] D. Kingma, T. Salimans, B. Poole, and J. Ho, “Variational diffusion
[24] C. Feng, F. Cai, H. Chen, and M. de Rijke, “Attentive encoder-based models,” in Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021,
extractive text summarization,” in Proc. 27th ACM Int. Conf. Inf. pp. 21696–21707.
Knowl. Manage., 2018, pp. 1499–1502. [53] C. Zach, “Fully variational noise-contrastive estimation,” in Proc.
[25] T. Wolf et al., “Transformers: State-of-the-art natural language process- Scand. Conf. Image Anal. (SCIA), New York, NY, USA: Springer-
ing,” in Proc. Conf. Empirical Methods Natural Lang. Process. Syst. Verlag, 2023, pp. 175–190.
Demonstrations, 2020, pp. 38–45. [54] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and
[26] R. Thoppilan et al., “LaMDA: Language models for dialog applica- B. Poole, “Score-based generative modeling through stochastic differ-
tions,” 2022, arXiv:2201.08239. ential equations,” 2020, arXiv:2011.13456.
[27] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sen- [55] P. Dhariwal and A. Nichol, “Diffusion models beat GANs on image
gupta, and A. A. Bharath, “Generative adversarial networks: An synthesis,” in Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021,
overview,” IEEE Signal Process. Mag., vol. 35, no. 1, pp. 53–65, pp. 8780–8794.
Jan. 2018. [56] Y. Song and S. Ermon, “Generative modeling by estimating gradients
[28] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv. Neural of the data distribution,” in Proc. Adv. Neural Inf. Process. Syst.,
Inf. Process. Syst., vol. 27, 2014, pp. 1–9. vol. 32, 2019, pp. 1–13.
[29] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation [57] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hier-
learning with deep convolutional generative adversarial networks,” archical text-conditional image generation with clip latents,” 2022,
2015, arXiv:1511.06434. arXiv:2204.06125.
[30] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative [58] A. Ramesh et al., “Zero-shot text-to-image generation,” in Proc. Int.
adversarial networks,” in Proc. Int. Conf. Mach. Learn., PMLR, 2017, Conf. Mach. Learn., PMLR, 2021, pp. 8821–8831.
pp. 214–223. [59] C. Saharia et al., “Photorealistic text-to-image diffusion models with
[31] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive grow- deep language understanding,” in Proc. Adv. Neural Inf. Process. Syst.,
ing of GANs for improved quality, stability, and variation,” 2017, vol. 35, 2022, pp. 36479–36494.
arXiv:1710.10196. [60] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Om-
[32] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, mer, “High-resolution image synthesis with latent diffusion models,”
and M. Welling, “Improved variational inference with inverse in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022,
autoregressive flow,” in Proc. Adv. Neural Inf. Process. Syst., pp. 10684–10695.
vol. 29, 2016, pp. 1–9. [61] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli,
[33] D. Rezende and S. Mohamed, “Variational inference with normalizing “Deep unsupervised learning using nonequilibrium thermodynamics,”
flows,” in Proc. Int. Conf. Mach. Learn., PMLR, 2015, pp. 1530–1538. in Proc. Int. Conf. Mach. Learn., PMLR, 2015, pp. 2256–2265.
[34] C. Meek, D. M. Chickering, and D. Heckerman, “Autoregressive tree [62] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee,
models for time-series analysis,” in Proc. SIAM Int. Conf. Data Mining, “Generative adversarial text to image synthesis,” in Proc. Int. Conf.
Philadelphia, PA, USA: SIAM, 2002, pp. 229–244. Mach. Learn., PMLR, 2016, pp. 1060–1069.
[35] N. Shazeer et al., “Outrageously large neural networks: The sparsely- [63] “Midjourney,” Midjourney. Accessed: Aug. 23, 2024. [Online]. Avail-
gated mixture-of-experts layer,” 2017, arXiv:1701.06538. able: https://www.midjourney.com/
[36] X. Wang et al., “Deep mixture of experts via shallow embedding,” in [64] M. Wu and N. Goodman, “Multimodal generative models for scalable
Proc. Uncertainty in AI, PMLR, 2020, pp. 552–562. weakly-supervised learning,” in Proc. Adv. Neural Inf. Process. Syst.,
[37] N. Du et al., “GLaM: Efficient scaling of language models with vol. 31, 2018, pp. 1–11.
mixture-of-experts,” in Proc. Int. Conf. Mach. Learn., PMLR, 2022, [65] M. Suzuki and Y. Matsuo, “A survey of multimodal deep generative
pp. 5547–5569. models,” Adv. Robot., vol. 36, nos. 5–6, pp. 261–278, 2022.
[38] S. Gururangan, M. Lewis, A. Holtzman, N. A. Smith, and L. Zettle- [66] Y. Shi et al., “Variational mixture-of-experts autoencoders for multi-
moyer, “DEMix layers: Disentangling domains for modular language modal deep generative models,” in Proc. Adv. Neural Inf. Process. Syst.,
modeling,” 2021, arXiv:2108.05036. vol. 32, 2019.
[39] S. Rajbhandari et al., “DeepSpeed-MoE: Advancing mixture-of-experts [67] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng,
inference and training to power next-generation AI scale,” in Proc. Int. “Multimodal deep learning,” in Proc. 28th Int. Conf. Mach. Learn.
Conf. Mach. Learn., PMLR, 2022, pp. 18332–18346. (ICML-11), 2011, pp. 689–696.
[40] Y. Zhou et al., “Mixture-of-experts with expert choice routing,” in Proc. [68] T. Baltrušaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine
Adv. Neural Inf. Process. Syst., vol. 35, 2022, pp. 7103–7114. learning: A survey and taxonomy,” IEEE Trans. pattern Anal. Mach.
[41] Z. Chi et al., “On the representation collapse of sparse mixture of Intell., vol. 41, no. 2, pp. 423–443, Feb. 2019.
experts,” in Proc. Adv. Neural Inf. Process. Syst., vol. 35, 2022, [69] A. Radford et al., “Learning transferable visual models from natural
pp. 34600–34613. language supervision,” in Proc. Int. Conf. Mach. Learn., PMLR, 2021,
[42] Z. Chen, Y. Deng, Y. Wu, Q. Gu, and Y. Li, “Towards understanding pp. 8748–8763.
the mixture-of-experts layer in deep learning,” in Proc. Adv. Neural [70] J. Yu et al., “Scaling autoregressive models for content-rich text-to-
Inf. Process. Syst., vol. 35, 2022, pp. 23049–23062. image generation,” 2022, arXiv:2206.10789.
Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
HAGOS et al.: RECENT ADVANCES IN GENERATIVE AI AND LARGE LANGUAGE MODELS 5891
[71] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture [100] S. Narang and A. Chowdhery, “Pathways language model (PaLM):
for generative adversarial networks,” in Proc. IEEE/CVF Conf., 2019, Scaling to 540 billion parameters for breakthrough performance,”
pp. 4401–4410. Google AI Blog. Accessed: Aug. 23, 2024. [Online]. Avail-
[72] T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu, “Semantic image able: https://blog.research.google/2022/04/pathways-language-model-
synthesis with spatially-adaptive normalization,” in Proc. IEEE/CVF palm-scaling-to.html
Conf., 2019, pp. 2337–2346. [101] A. Chowdhery et al., “PaLM: Scaling language modeling with path-
[73] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image ways,” 2022, arXiv:2204.02311.
translation using cycle-consistent adversarial networks,” in Proc. IEEE [102] H. Touvron et al., “Llama 2: Open foundation and fine-tuned chat
Int. Conf. Comput. Vis., 2017, pp. 2223–2232. models,” 2023, arXiv:2307.09288.
[74] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers [103] R. Anil et al., “PaLM 2 technical report,” 2023, arXiv:2305.10403.
for image recognition at scale,” 2020, arXiv:2010.11929. [104] OpenAI, “GPT-4 Technical Report,” 2023. Available: https://openai.
[75] “Sora: OpenAI’s platform for multimodal AI,” OpenAI. Accessed: Aug. com/research/gpt-4
23, 2024. [Online]. Available: https://openai.com/sora [105] “Gemini: A family of highly capable multimodal models,” Google.
[76] T. Brooks et al., “Video generation models as world simulators,” [Online]. Available: https://storage.googleapis.com/deepmind-media/
OpenAI. Accessed: Aug. 23, 2024. [Online]. Available: https://openai. gemini/gemini_1_report.pdf
com/blog/videoworldsimulators2024/ [106] M. McShane, “Natural language understanding (NLU, not NLP) in
[77] J. Manyika, “An overview of Bard: An early experiment with generative cognitive systems,” AI Mag., vol. 38, no. 4, pp. 43–56, 2017.
AI,” AI. Google Static Documents. Accessed: Aug. 23, 2024. [Online]. [107] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
Available: https://ai.google/static/documents/google-about-bard.pdf jointly learning to align and translate,” 2014, arXiv:1409.0473.
[78] X. Zeng et al., “Deep generative molecular design reshapes drug [108] T. Nasukawa and J. Yi, “Sentiment analysis: Capturing favorability
discovery,” Cell Rep. Med., pp. 1–13, 2022. using natural language processing,” in Proc. 2nd Int. Conf. Knowl.
[79] H. Altae-Tran, B. Ramsundar, A. S. Pappu, and V. Pande, “Low data Capture, 2003, pp. 70–77.
drug discovery with one-shot learning,” ACS Central Sci., vol. 3, [109] M. A. Di Gangi, M. Negri, and M. Turchi, “Adapting transformer to
no. 4, pp. 283–293, 2017. end-to-end spoken language translation,” in Proc. INTERSPEECH. Int.
[80] A. Aliper et al., “Deep learning applications for predicting pharmaco- Speech Communication Assoc. (ISCA), 2019, pp. 1133–1137.
logical properties of drugs and drug repurposing using transcriptomic [110] J. Wu et al., “On decoder-only architecture for speech-to-text and large
data,” Mol. Pharmaceutics, vol. 13, no. 7, pp. 2524–2530, 2016. language model integration,” in Proc. IEEE Autom. Speech Recognit.
[81] A. Merchant et al., “Scaling deep learning for materials discovery,” Understanding Workshop (ASRU), Piscataway, NJ, USA: IEEE Press,
Nature, vol. 624, no. 7990, pp. 80–85, 2023. 2023, pp. 1–8.
[82] C. P. Gomes et al., “Artificial intelligence for materials discovery,” MRS [111] A. Yamaguchi, G. Chrysostomou, K. Margatina, and N. Aletras, “Frus-
Bull., vol. 44, no. 7, pp. 538–544, 2019. tratingly simple pretraining alternatives to masked language modeling,”
[83] E. O. Pyzer-Knapp et al., “Accelerating materials discovery using 2021, arXiv:2109.01819.
artificial intelligence, high performance computing and robotics,” npj [112] J. Howard and S. Ruder, “Universal language model fine-tuning for
Comput. Mater., vol. 8, no. 1, 2022, Art. no. 84. text classification,” 2018, arXiv:1801.06146.
[84] A. Langevin, T. Cody, S. Adams, and P. Beling, “Generative adversarial [113] M. Zaheer et al., “Big bird: Transformers for longer sequences,” in
networks for data augmentation and transfer in credit card fraud Proc. Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 17283–17297.
detection,” J. Oper. Res. Soc., vol. 73, no. 1, pp. 153–180, 2022. [114] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov,
[85] T. Schlegl et al., “Unsupervised anomaly detection with generative “Transformer-XL: Attentive language models beyond a fixed-length
adversarial networks to guide marker discovery,” in Proc. Int. Conf. context,” 2019, arXiv:1901.02860.
Inf. Process. Med. Imag., New York, NY, USA: Springer-Verlag, 2017,
[115] Z. Yang et al., “XLNet: Generalized autoregressive pretraining for
pp. 146–157.
language understanding,” in Proc. Adv. Neural Inf. Process. Syst.,
[86] T. Schlegl, P. Seeböck, S. M. Waldstein, G. Langs, and U. Schmidt-
vol. 32, 2019, pp. 1–18.
Erfurth, “f-AnoGAN: Fast unsupervised anomaly detection with gen-
[116] I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long-
erative adversarial networks,” Med. Image Anal., vol. 54, pp. 30–44,
document transformer,” 2020, arXiv:2004.05150.
May 2019.
[117] R. Child et al., “Generating long sequences with sparse transformers,”
[87] X. Chen, S. Li, H. Li, S. Jiang, Y. Qi, and L. Song, “Generative
2019, arXiv:1904.10509.
adversarial user model for reinforcement learning based recommen-
dation system,” in Proc. Int. Conf. Mach. Learn., PMLR, 2019, [118] G. M. Correia, V. Niculae, and A. F. Martins, “Adaptively sparse
pp. 1052–1061. transformers,” 2019, arXiv:1909.00015.
[88] D. Adiwardana et al., “Towards a human-like open-domain chatbot,” [119] M. Johnson et al., “Google’s multilingual neural machine translation
2020, arXiv:2001.09977. system: Enabling zero-shot translation,” Trans. Assoc. Comput. Lin-
[89] X. Liu and W. B. Croft, “Statistical language modeling for information guistics, vol. 5, pp. 339–351, 2017.
retrieval.” Annu. Rev. Inf. Sci. Technol., vol. 39, no. 1, pp. 1–31, 2005. [120] A. J. Thirunavukarasu, D. S. J. Ting, K. Elangovan, L. Gutierrez, T. F.
[90] B. Roark, M. Saraclar, and M. Collins, “Discriminative n-gram lan- Tan, and D. S. W. Ting, “Large language models in medicine,” Nature
guage modeling,” Comput. Speech Lang., vol. 21, no. 2, pp. 373– Med., vol. 29, no. 8, pp. 1930–1940, 2023.
392, 2007. [121] G. Hinton et al., “Deep neural networks for acoustic modeling in speech
[91] S. Khudanpur and J. Wu, “Maximum entropy techniques for exploiting recognition: The shared views of four research groups,” IEEE Signal
syntactic, semantic and collocational dependencies in language model- Process. Mag., vol. 29, no. 6, pp. 82–97, Nov. 2012.
ing,” Comput. Speech Lang., vol. 14, no. 4, pp. 355–372, 2000. [122] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with
[92] D. Hendrycks and K. Gimpel, “Gaussian error linear units (GELUs),” deep recurrent neural networks,” in Proc. IEEE Int. Conf. Acoust.,
2016, arXiv:1606.08415. Speech Signal Process., Piscataway, NJ, USA: IEEE Press, 2013,
[93] P. Ramachandran, B. Zoph, and Q. V. Le, “Searching for activation pp. 6645–6649.
functions,” 2017, arXiv:1710.05941. [123] R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu,
[94] A. Radford et al., “Language models are unsupervised multitask “Exploring the limits of language modeling,” 2016, arXiv:1602.02410.
learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019. [124] C. Shan et al., “Investigating end-to-end speech recognition for
[95] J. W. Rae et al., “Scaling language models: Methods, analysis & mandarin-english code-switching,” in Proc. IEEE Int. Conf. Acoust.,
insights from training gopher,” 2021, arXiv:2112.11446. Speech Signal Process. (ICASSP), Piscataway, NJ, USA: IEEE Press,
[96] O. Lieber, O. Sharir, B. Lenz, and Y. Shoham, “Jurassic-1: Technical 2019, pp. 6056–6060.
details and evaluation,” White Paper. AI21 Labs, vol. 1, 2021. [125] J. Salazar, K. Kirchhoff, and Z. Huang, “Self-attention networks for
[97] S. Smith et al., “Using DeepSpeed and megatron to train megatron- connectionist temporal classification in speech recognition,” in Proc.
turing NLG 530B, a large-scale generative language model,” 2022, IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Piscataway,
arXiv:2201.11990. NJ, USA: IEEE Press, 2019, pp. 7115–7119.
[98] J. Hoffmann et al., “Training compute-optimal large language models,” [126] J. Krantz, W. Spokane, and J. Kalita, “Abstractive summarization using
2022, arXiv:2203.15556. attentive neural techniques,” in Proc. 15th Int. Conf. Natural Lang.
[99] L. Ouyang et al., “Training language models to follow instructions Process., 2018, p. 1.
with human feedback,” in Proc. Adv. Neural Inf. Process. Syst., [127] R. Li et al., “StarCoder: May the source be with you!” 2023,
vol. 35, 2022, pp. 27730–27744. arXiv:2305.06161.
Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
5892 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 12, DECEMBER 2024
[128] A. Svyatkovskiy, Y. Zhao, S. Fu, and N. Sundaresan, “Pythia: [155] S. Amershi, M. Cakmak, W. B. Knox, and T. Kulesza, “Power to the
AI-assisted code completion system,” in Proc. 25th ACM SIGKDD people: The role of humans in interactive machine learning,” AI Mag.,
Int. Conf. Knowl. Discovery Data Mining, 2019, pp. 2727–2735. vol. 35, no. 4, pp. 105–120, 2014.
[129] Z. Feng et al., “CodeBERT: A pre-trained model for programming and [156] E. Mosqueira-Rey, E. Hernández-Pereira, D. Alonso-Ríos, J. Bobes-
natural languages,” 2020, arXiv:2002.08155. Bascarán, and Á. Fernández-Leal, “Human-in-the-loop machine learn-
[130] S. Lu et al., “CodeXGLUE: A machine learning benchmark dataset for ing: A state of the art,” Artif. Intell. Rev., vol. 56, no. 4, pp. 3005–
code understanding and generation,” 2021, arXiv:2102.04664. 3054, 2023.
[131] D. Xu, S. Yuan, L. Zhang, and X. Wu, “FairGAN: Fairness-aware [157] S. Griffith, K. Subramanian, J. Scholz, C. L. Isbell, and A. L.
generative adversarial networks,” in Proc. IEEE Int. Conf. Big Data Thomaz, “Policy shaping: Integrating human feedback with reinforce-
(Big Data), Piscataway, NJ, USA: IEEE Press, 2018, pp. 570–575. ment learning,” in Proc. Adv. Neural Inf. Process. Syst., vol. 26, 2013,
[132] A. Feder, N. Oved, U. Shalit, and R. Reichart, “CausaLM: Causal pp. 1–9.
model explanation through counterfactual language models,” Comput. [158] J. MacGlashan et al., “Interactive learning from policy-dependent
Linguistics, vol. 47, no. 2, pp. 333–386, 2021. human feedback,” in Proc. Int. Conf. Mach. Learn., PMLR, 2017,
[133] Z. Chen, Q. Gao, A. Bosselut, A. Sabharwal, and K. Richard- pp. 2285–2294.
son, “DISCO: Distilling counterfactuals with large language mod- [159] R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C.
els,” in Proc. 61st Annu. Meeting Assoc. Comput. Linguistics, 2023, Finn, “Direct preference optimization: Your language model is secretly
pp. 5514–5528. a reward model,” 2023, arXiv:2305.18290.
[134] P.-S. Huang et al., “Reducing sentiment bias in language models via [160] A. Rosenfeld and A. Richardson, “Explainability in human–agent
counterfactual evaluation,” 2019, arXiv:1911.03064. systems,” Auton. Agents Multi-Agent Syst., vol. 33, pp. 673–705,
[135] B. H. Zhang, B. Lemoine, and M. Mitchell, “Mitigating unwanted May 2019.
biases with adversarial learning,” in Proc. AAAI/ACM Conf. AI, Ethics, [161] M. T. Ribeiro, S. Singh, and C. Guestrin, “ “Why should i trust
Soc., 2018, pp. 335–340. you?” Explaining the predictions of any classifier,” in Proc. 22nd
[136] S. Ding and P. Koehn, “Evaluating saliency methods for neural lan- ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2016,
guage models,” 2021, arXiv:2104.05824. pp. 1135–1144.
[137] A. Madsen et al., “Post-hoc interpretability for neural NLP: A survey,” [162] K. Valmeekam, M. Marquez, S. Sreedharan, and S. Kambhampati, “On
ACM Comput. Surv., vol. 55, no. 8, pp. 1–42, 2022. the planning abilities of large language models–A critical investiga-
[138] N. Kroeger, D. Ley, S. Krishna, C. Agarwal, and H. Lakkaraju, “Are tion,” 2023, arXiv:2305.15771.
large language models post hoc explainers?” 2023, arXiv:2310.05797. [163] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy,
[139] A. Chronopoulou, C. Baziotis, and A. Potamianos, “An embarrassingly “Hierarchical attention networks for document classification,” in Proc.
simple approach for transfer learning from pretrained language mod- Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human Lang.
els,” in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics Technol., 2016, pp. 1480–1489.
(Long and Short Papers), vol. 1, 2019, pp. 2089–2095. [164] E. Grave, A. Joulin, and N. Usunier, “Improving neural language mod-
[140] K. You, Z. Kou, M. Long, and J. Wang, “Co-tuning for transfer els with a continuous cache,” in Proc. Int. Conf. Learn. Representations,
learning,” in Proc. Adv. Neural Inf. Process. Syst., vol. 33, 2020, 2016, pp. 1–9.
pp. 17236–17246. [165] N. Ratner et al., “Parallel context windows for large language models,”
[141] J. Zhang and Y. Moshfeghi, “ELASTIC: Numerical reasoning with in Proc. 61st Annu. Meeting Assoc. Comput. Linguistics (Long Papers),
adaptive symbolic compiler,” in Proc. Adv. Neural Inf. Process. Syst., vol. 1, 2023, pp. 6383–6402.
vol. 35, 2022, pp. 12647–12661. [166] L. Kaiser et al., “One model to learn them all,” 2017,
[142] Z. Hou, J. Salazar, and G. Polovets, “Meta-learning the difference: arXiv:1706.05137.
Preparing large language models for efficient adaptation,” Trans. Assoc. [167] M. Ghodsi, X. Liu, J. Apfel, R. Cabrera, and E. Weinstein, “RNN-
Comput. Linguistics, vol. 10, pp. 1249–1265, 2022. transducer with stateless prediction network,” in Proc. IEEE Int. Conf.
[143] Z. Wang, Z. Dai, B. Póczos, and J. Carbonell, “Characterizing and Acoust., Speech Signal Process. (ICASSP), Piscataway, NJ, USA: IEEE
avoiding negative transfer,” in Proc. IEEE/CVF Conf. Comput. Vis. Press, 2020, pp. 7049–7053.
Pattern Recognit., 2019, pp. 11293–11302. [168] S. Sukhbaatar et al., “End-to-end memory networks,” in Proc. Adv.
[144] R. Shokri and V. Shmatikov, “Privacy-preserving deep learning,” in Neural Inf. Process. Syst., vol. 28, 2015.
Proc. 22nd ACM SIGSAC Conf. Comput. Commun. Secur., 2015, [169] Z. Azerbayev et al., “LLEMMA: An open language model mathemat-
pp. 1310–1321. ics,” 2023, arXiv:2310.10631.
[145] M. Abadi et al., “Deep learning with differential privacy,” in [170] D. Deutsch, R. Dror, and D. Roth, “On the limitations of reference-free
Proc. ACM SIGSAC Conf. Comput. Commun. Secur., 2016, pp. 308– evaluations of generated text,” 2022, arXiv:2210.12563.
318. [171] L. Zhu, X. Wang, and X. Wang, “JudgeLM: Fine-tuned large language
[146] J. Morley, A. Elhalal, F. Garcia, L. Kinsey, J. Mökander, and L. Floridi, models are scalable judges,” 2023, arXiv:2310.17631.
“Ethics as a service: A pragmatic operationalisation of AI ethics,”
Minds Mach., vol. 31, no. 2, pp. 239–256, 2021.
[147] J. Borenstein and A. Howard, “Emerging challenges in AI and the need
for AI ethics education,” AI Ethics, vol. 1, no. 1, pp. 61–65, 2021. Desta Haileselassie Hagos (Member, IEEE) re-
[148] E. Strubell, A. Ganesh, and A. McCallum, “Energy and policy consid- ceived the B.Sc. degree in computer science from
erations for modern deep learning research,” in Proc. AAAI Conf. Artif. Mekelle University, Mekelle, Tigray, in 2008, the
Intell., 2020, vol. 34, no. 9, pp. 13693–13696. M.Sc. degree in computer science and engineering
[149] J. Lin, J. Tang, H. Tang, S. Yang, X. Dang, and S. Han, “AWQ: with a specialization in mobile systems from the
Activation-aware weight quantization for LLM compression and ac- Department of Computer Science Electrical and
celeration,” 2023, arXiv:2306.00978. Space Engineering, Luleå University of Technology,
[150] R. Tolosana et al., “Deepfakes and beyond: A survey of face ma- Luleå, Sweden, in 2012, and the Ph.D. degree in
nipulation and fake detection,” Inf. Fusion, vol. 64, pp. 131–148, computer science from the Faculty of Mathematics
Dec. 2020. and Natural Sciences, University of Oslo, Oslo,
[151] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Norway, in 2020.
Nießner, “Face2face: Real-time face capture and reenactment of RGB Currently, he is a Postdoctoral Research Fellow with the U.S. Department
videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, of Defense (DoD), Center of Excellence in Artificial Intelligence and Machine
pp. 2387–2395. Learning (CoE-AIML), Department of Electrical Engineering and Computer
[152] P. Garrido, L. Valgaerts, O. Rehmsen, T. Thormahlen, P. Perez, and C. Science, College of Engineering and Architecture (CEA), Howard University,
Theobalt, “Automatic face reenactment,” in Proc. IEEE Conf. Comput. Washington, DC, USA. Previously, he was a Postdoctoral Research Fellow
Vis. Pattern Recognit., 2014, pp. 4217–4224. with the Division of Software and Computer Systems (SCS), Department of
[153] M. Brundage et al., “The malicious use of artificial intelligence: Computer Science, School of Electrical Engineering and Computer Science
Forecasting, prevention, and mitigation,” 2018, arXiv:1802.07228. (EECS), KTH Royal Institute of Technology, Stockholm, Sweden, worked on
[154] A. Adadi and M. Berrada, “Peeking inside the black-box: A survey the H2020-EU project, ExtremeEarth: From Copernicus Big Data to Extreme
on explainable artificial intelligence (XAI),” IEEE Access, vol. 6, Earth Analytics. His research interests include the areas of machine learning,
pp. 52138–52160, 2018. deep learning, and artificial intelligence.
Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.
HAGOS et al.: RECENT ADVANCES IN GENERATIVE AI AND LARGE LANGUAGE MODELS 5893
Rick Battle received the Bachelor of Science de- Dr. Rawat has secured over $110 million as a PI and over $18 million
gree in computer engineering from Virginia Tech, as a Co-PI in research funding from the U.S. National Science Foundation
Blacksburg, VA, USA, in 2009, and the Mas- (NSF), U.S. Department of Homeland Security (DHS), U.S. National Security
ter of Science degree in computer science with Agency (NSA), U.S. Department of Energy, National Nuclear Security Ad-
a specialization in machine learning from the ministration (NNSA), National Institute of Health (NIH), U.S. Department of
Naval Postgraduate School, Monterey, CA, USA, Defense (DoD) and DoD Research Labs, Industry (Microsoft, Intel, VMware,
in 2013. PayPal, Mastercard, Meta, BAE, Raytheon, etc.), and private Foundations. He
He is a Staff Machine Learning Engineer with was the recipient of the U.S. NSF CAREER Award, the U.S. Department
VMware by Broadcom, Palo Alto, CA, USA. He is of Homeland Security (DHS) Scientific Leadership Award, the President’s
the Head of NLP Research, VMware AI Labs, Palo Medal of Achievement Award (2023) at Howard University, the Provost’s
Alto, CA, USA. His research interests include the Distinguished Service Award 2021, the U.S. Air Force Research Laboratory
areas of the application of large language models to real-world use cases and (AFRL) Summer Faculty Visiting Fellowship 2017, the Outstanding Research
information retrieval. Faculty Award (award for excellence in scholarly activity), and several
best paper awards. He has been an Editor/a Guest Editor for over 100
international journals, including an Associate Editor of IEEE TRANSACTIONS
ON INFORMATION FORENSICS AND SECURITY, an Associate Editor of IEEE
Danda B. Rawat (Senior Member, IEEE) received TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, an
the Ph.D. degree in electrical and computer engi- Associate Editor of IEEE TRANSACTIONS OF SERVICES COMPUTING, an Editor
neering from Old Dominion University, Norfolk, of IEEE INTERNET OF THINGS JOURNAL, an Editor of IEEE COMMUNICATIONS
VA, USA, in 2010. LETTERS, an Associate Editor of IEEE TRANSACTIONS ON NETWORK SCIENCE
AND ENGINEERING, and a Technical Editors of IEEE NETWORK. He has been
He is an Associate Dean of the Research and
Graduate Studies, a Full Professor with the De- in organizing committees for several IEEE flagship conferences, such as
partment of Electrical Engineering and Computer IEEE INFOCOM, IEEE CNS, IEEE ICC, and IEEE GLOBECOM. He served
Science (EECS), the Founding Director of the Data as a Technical Program Committee (TPC) Member for several international
Science & Cybersecurity Center, the Founding Di- conferences, including IEEE INFOCOM, IEEE GLOBECOM, IEEE CCNC,
rector of the Department of Defense (DoD) Center IEEE GreenCom, IEEE ICC, IEEE WCNC, and IEEE VTC conferences. He
of Excellence in Artificial Intelligence and Machine served as the Vice Chair of the Executive Committee of the IEEE Savannah
Learning (CoE-AIML), and the Director of Cyber-Security and Wireless Net- Section from 2013 to 2017. He is a Lifetime Professional Senior Member of
working Innovations (CWiNs) Research Lab, Howard University, Washington, ACM, a Lifetime Member of the Association for the Advancement of Artificial
DC, USA. He successfully led and established the Research Institute for Intelligence (AAAI), a Lifetime Member of SPIE, a Member of ASEE and
Tactical Autonomy (RITA), the 15th University Affiliated Research Center AAAS, and a Fellow of the Institution of Engineering and Technology (IET).
(UARC) of the U.S. Department of Defense as a Principal Investigator and the He is an ACM Distinguished Speaker and an IEEE Distinguished Lecturer
Founding Executive Director of Howard University. He is engaged in research (FNTC and VTS).
and teaching in the areas of cybersecurity, machine learning, big data analytics,
and wireless networking for emerging networked systems, including cyber–
physical systems (eHealth, energy, and transportation), Internet-of-Things,
multidomain operations, smart cities, software-defined systems, and vehicular
networks.
Authorized licensed use limited to: University of the Philippines - Open University. Downloaded on April 18,2025 at 04:21:52 UTC from IEEE Xplore. Restrictions apply.