Chapter 2 Literature Review
Chapter 2 Literature Review
The focus of this chapter is to identify and discuss the AI methods employed in the
creation and implementation of chatbots. This chapter focuses on the field of natural language
processing (NLP), intending to discuss the features and progress of language models. The
implementation techniques. Then, the concepts of pre-trained and fine-tuned language models
are detailed before elaborating on scaling the pre-trained language models. Further, the chapter
presents a detailed case of the launch of the widely-used LLM-powered chatbot, ChatGPT,
Contrary to rule-based systems, AI-based chatbots utilize machine learning models and
natural language processing technologies to obtain and analyze user input and generate responses
(Singh & Thakur, 2020). AI-based chatbots do not rely on pre-established responses for every
user statement, precise pattern matching, or creating new rules through human coding. Rather,
these chatbots undergo training using large databases of conversational data, employing machine
learning algorithms that analyze and learn from past interactions, eventually enhancing their
intelligence (Rudolph et al., 2023). Training large datasets using machine learning and NLP
methodologies improves the flexibility of chatbots, enabling them to encompass a wide range of
instead of focusing on the present statement each time. There are two types of AI chatbots:
Information-retrieval chatbots and Generative chatbots. However, hybrid chatbots also combine
to users’ inputs (Makhalova et al., 2019). Upon assessing the user input, information retrieval
chatbots pick the most suitable response from the existing answers. Typically, these chatbots rely
on a dataset of question-answer pairs to build their knowledge base. Subsequently, they create a
chat index containing a list of possible responses based on the user’s input (Rukhiran & Netinant,
2022). The user’s inputs or questions match the most relevant entries in the chat index, and the
chatbot provides the corresponding answer. The increased popularity of retrieval-based chatbots
can be attributed to the widespread use of social web or participatory web technology and the
caution that building a rich knowledge base and training these models with significant
conversational data can be costly, laborious, and time-consuming (Rukhiran & Netinant, 2022).
According to Makhalova et al. (2019), the technique of retrieval-based chatbots may restrict the
capacity to develop chatbots with unique characters due to the limitation of response generation
only from a predetermined corpus. Such inflexibility may hinder the advancement of dynamic
and engaging conversational agents, especially chatbots designed for social encounters with
unique persona (Khadija et al., 2023). Overcoming these shortcomings requires employing novel
methods for constructing knowledge bases and generating responses that balance quality and
adaptability.
In contrast, generative chatbots are open-domain programs that generate original outputs
rather than selecting answers from pre-defined responses (Esfandiari et al., 2023). These chatbots
primarily utilize deep neural networks, such as Encoder/Decoder models, to produce original
responses based on user input (Zielinski et al., 2023). There is growing research interest in
generative chatbots, considering their promising reliability in open-domain discourse (Aydın &
Karaarslan, 2023). Training of generative chatbots occurs through large sentence datasets
obtained from real conversations. Subsequently, the training allows the algorithms to teach the
model how to generate responses that are linguistically coherent, accurate, and relevant (Mulla &
Gharpure, 2023). The next section discusses the chatbot design and implementation techniques.
The focus of this section is to detail the techniques for designing and implementing
chatbots. Specific focus is limited to language modeling, natural language processing, and neural
network models such as transformers, deep Seq2Seq, and recurrent neural networks. AI-powered
chatbots exploit these techniques to understand user inputs, generate contextually relevant
predetermined responses, adjusting their outputs in real time while including desirable
emulating the human brain’s cognitive processes (Gao et al., 2019). Neural networks are
machine learning (ML) and deep learning (DL) models used in generative-based and information
retrieval (IR)-based chatbot implementation techniques. Since they are trained on extensive
datasets with labeled or unlabeled information, these models can accurately represent the
connections and patterns between input and output data in natural language (Alaloul & Qureshi,
2020). As a result, artificial neural networks are capable of generating appropriate responses. In
provide replies by sequentially generating words based on probability calculations using the
given vocabulary (Mulla & Gharpure, 2023). Hybrid techniques include comparing candidate
responses and selecting the one that scored higher (Gao & Jiang, 2021). As discussed in
subsequent subsections, some examples of artificial neural networks include deep sequence-to-
sequence (seq2seq) models, transformers, seq2seq models, long short-term memory networks,
designed to process time series or sequential time series data. An RNN, derived from ANNs and
modified from recursive neural networks, is designed to store and recall prior input sequences of
varying lengths (Lalapura et al., 2021). Unlike a typical feed-forward neural network, an RNN
passes these sequences to neurons for processing (Hidasi et al., 2020). Figure 1 shows RNN
architecture containing hidden layers (green blocks) that incorporate a looping mechanism,
allowing them to retain and transfer knowledge from previous inputs throughout the network.
Within each block, the blue circles are defined by vector a, hidden units or nodes, where the
At a given time t, the RNN architecture considers events that occurred at t-1 by including
the h from the previous hidden state and the input x at time t (Khuong, 2019). The internal
memory can then be utilized for subsequent processing. RNNs can retain and utilize all past data
by storing the output of one layer and feeding it as input to another layer (Hidasi et al., 2020).
Furthermore, RNNs can collect data about the order of the input sequence, which might be
advantageous for its overall processing. However, during the training phase, the gradients
utilized for updating the weights of the RNN have the potential to diminish (vanishing) or
amplify significantly (exploding), making the learning process unstable and inefficient (Lalapura
et al., 2021).
Due to the problem of vanishing gradients in RNN, it could be challenging to search for
information history. As a result, this obscures the learning of long-term dependencies, making it
challenging for a model to use information from the distant past effectively (Hidasi et al., 2020).
Long Short-Term Memory (LSTM) seeks to overcome this RNN problem. As a type of RNN,
LSTM can capture and retain long-term dependencies in sequential input (Qin et al., 2023).
Besides, LSTMs can handle and examine sequential data, such as time series, text, and speech
(Zhang et al., 2023). LSTM achieves this by providing a network with mechanisms that regulate
indicates that LSTMs utilize “gates” to control the flow of information to the cell state and
subsequent network units. Navarro et al. (2020) clarified that these gates assign a value between
0 and 1. A numerical value of 0 signifies total failure, whereas a 1 signifies complete success.
The input gate selects the information that must be kept, while the forget gate decides what
should be removed, thus ensuring the network’s state stays current. The output gate governs the
such as gender (female, male), color (blue, purple violet), or type of pet (cat, dog) (Qin et al.,
2023). Using previous information, LSTM can identify and pick the correct attributes in
categorical or continuous variables. LSTM ensures that its powered networks attain enhanced
performance by resolving these concerns. Considering their power, chatbot design frequently
incorporates LSTM modifications, such as the Gated Recurrent Unit (GRU), which combines the
input and forget gate to produce a single “update gate.” While LSTMs potentially process
variable-length sequences, they are applied in tasks where the input and output sequences have
the same length (e.g., sequence classification, sequence labeling) (Hidasi et al., 2020).
Seq2Seq models are more powerful and can handle input and output sequences of different
lengths (Raman et al., 2022). As a result, Palasundram et al. (2020) remarked that Seq2Seq is
considered more suited for tasks like dialogue systems, text summarization, and machine
translation. Moreover, the Seq2Seq neural model is a versatile, comprehensive, and creative
model designed primarily for machine translation jobs. Its functionality is based on taking an
input sentence in the source language and generating a translated output in the target language.
Seq2Seq application has been widely adopted in conversation modeling in chatbots such as
Meena by Google, DialoGPT by Microsoft, and Blender Bot by scholars at the University of
California. As such, this makes Seq2Seq an emerging standard approach for modern NLP tasks
The Seq2Seq model consists of two types of recurrent neural networks: LSTMs and
GRUs. These RNNs are used to model longer sentences, although transformer networks are used
as alternatives (Soltan et al., 2023). Figure 3 illustrates that the Seq2Seq model includes an
encoder, which generates a vector to process the input sequence. Also, the Seq2Seq model has a
decoder that takes the vector and generates the desired output or results. This model aims to
generate the most likely response, considering the context provided by the previous turn or input
sentence (Palasundram et al., 2020). At first, the input text is passed to the previous RNN
encoder and is processed word by word in a concealed or hidden state of the network. The
encoder’s final state exposes the context vector representing the input sentence. This vector is
then fed into the decoder to produce an output, one element at a time, using an appropriate
probability function.
Figure 3: An attention-based seq2seq model (Shi et al., 2021)
2.2.1.4 Deep Seq2Seq Models
A deep Seq2Seq model is a variant of the traditional Seq2Seq models, which incorporates
more layers or depth in its architecture’s encoder and decoder components (Yin & Wan, 2022).
Using multiple stacked layers of LSTM or GRU in the decoder and encoder components enables
the deep Seq2Seq model to learn more complex representations and capture long-range
dependencies in the input and output sequences (Soltan et al., 2023). Deep Sed2Seq facilitates
the generation of efficient chatbots that acquire human-like intelligence characteristics. Lower
layers can capture basic characteristics, whereas higher layers can acquire more conceptual and
Figure 4 presents a multi-stage encoder-based Seq2seq deep learning model. During the
deep Seq2Seq model design process, the output is consistently propagated from the previous to
the next layers. The initial sequence is fed into the first encoder layer (Zhang & Deng, 2019).
Next, the LSTM encoder aims to convert each word into a vector. The output is sequentially
propagated via each LSTM encoder layer until it reaches the final LSTM encoder layer. From
there, it is directed to the first LSTM decoder layer (Zhang & Deng, 2019). The process
concludes with the target sequence computed by the corresponding probability function.
Figure 4: multi-stage encoder-based Seq2seq model (Zhang & Deng, 2019).
2.2.1.5. Transformer Models
Seq2Seq models have some limitations since RNNs process sequences sequentially,
which makes parallelization difficult, resulting in slower inference and training. Long-range
dependencies are also a problem with Seq2Seq2 models due to the inherent limitations of their
recurrent nature in sequences, such as exploding or vanishing gradients. Since the encoded input
longer sequences. As a result, Seq2Seq2-based chatbot may give wrong answers. Transformer
models address this problem by employing self-attention mechanisms across the whole input
sequence. This enables transformers to preserve more detailed representations of the input
In 2017, a team of researchers from Google Brain introduced the concept of Transformer
architecture in their seminal paper titled “Attention Is All You Need” (Vaswani et al., 2017). The
proposed Transformer architecture by Google made five main innovations and contributions (i.e.,
structure). First, the transformer model introduced a self-attention mechanism that empowers the
model to weigh and determine various important sections of the input sequence during encoding
and decoding. Therefore, this self-attention mechanism captures longer-range dependencies more
effectively than other NNA models (Berg et al., 2021; Vaswani et al., 2017).
Transformer model lacks recurrent or convolutional layers. Thus, the transformer relies on
positional encoding to incorporate information regarding the position of each token in the
sequence. Such positional encoding enables the model to comprehend the token order (Vaswani
et al., 2017). Third, multi-head attention was another important addition by the transformer,
enabling the model to attend to information from different representation subspaces jointly. As
such, this improves the model’s ability to capture diverse contextual information (Wang et al.,
2020). Fourth, parallelization is possible in the transformer architecture since it does not rely on
sequential operations such as RNNs. Such parallelization contributes to faster training and
inference on modern hardware (such as GPUs). Fifth, the Transformer model is based on the
encoder-decoder structure. In this process, the encoder maps the input sequence to a continuous
representation while the decoder creates the output sequence from this representation (Vaswani
et al., 2017).
Figure 5 presents the architecture of the standard transformer proposed by Vaswani et al.
(2017). Each component in a sequence has its representation. To determine the relationship
between these components, the matrices that should be computed include the Q (query), K (key),
and V (value). These matrices are generated using linear representations of the input sequence.
The query matrix pertains to the current component, the key matrix pertains to other components,
and the value matrix encompasses information that must be collected (Vaswani et al., 2017). As
detailed by Google researchers, the similarity between the query and key matrices is determined
by calculating the dot product, which defines the correlation weight of the current item with the
others. Subsequently, the SoftMax function is applied to standardize the similarity so that the
total of each correlation amounts to 1 (Vaswani et al., 2017). The new weights are allocated to
incorporating information about the current word’s relationship with the others.
advanced. These models gain a greater understanding of meaning by being trained on large
amounts of data. Greater comprehension allows these models to deliver more accurate and
complete representations of phrases or sentences in both the support and query sets. Multi-head
“heads.” Each head has its learnable parameters and can focus on different types of data
information, such as semantic or syntactic correlations (see Figure 6 for a visual representation).
The knowledge acquired by each individual can be combined into a unified representation to get
is used as a sentence encoder related to “The hacker gained access to sensitive information by
executing a media-less attack on Android” (Ma et al., 2023, p. 103). The sentence is processed as
“[CLS] [e1] The hacker [/e1] gained access to [e2] sensitive information [/e2]. [SEP]”. After
concatenating these vectors to “[e1]” and “[e2]” places, BERT creates sentence representations
with entity information. Transformer and BERT present neural machine translation capabilities
that are useful for many NLP applications, creating a “foundation” architecture with multiple
implementations. Examples of this include Google’s BERT and OpenAI’s Generative Pre-
Trained Transformers.
Figure 6: Example of sentence encoder output showing how an input sentence is converted to a
numerical representation for processing (Ma et al., 2023).
2.2.2 Natural Language Processing
In the past seven decades, AI researchers have attempted to study how to improve human
2020). NLP techniques, such as those based on linguistic rules and knowledge bases, are better
equipped to handle these exceptions and irregularities than ANNs (which rely on statistical
patterns in availed data) (Geetha et al., 2023). NLP systems are also improved because it is
possible to integrate domain-specific knowledge, such as rule bases, ontologies, and lexicons
(Agarwal, 2019). In contrast, ANNs mainly rely on training data and cannot leverage external
The complex characteristics of human language have prompted the creation of numerous
advanced AI algorithms and strategies aimed at helping machines comprehend and manage
natural language communications. As an AI field, NLP helps researchers explore how computer
systems process and understand human language, spoken or written (Zhang & Zong, 2019). The
study of this discipline has provided useful knowledge in language technology, allowing for the
development of methods that enhance the efficient manipulation of human language and the
accomplishment of desired tasks. Generally, these strategies rely on machine learning techniques
that include natural language understanding (NLU) to comprehend the meaning of natural
language content and natural language generation (NLG) to produce meaningful, non-repetitive,
full, and precise natural language output (Jackson, 2020; Seo et al., 2021).
Given a text, NLU enables computers to understand the meaning of the text. Then,
computers may interact naturally with humans once they comprehend the meaning and context.
When a human asks a chatbot a question, NLU aims to comprehend the query. Such
comprehension results in a semantic depiction of the supplied text (see Figure 7 on some aspects
of text NLU understands). The representation is subsequently inputted into other interconnected
systems to provide an appropriate reaction. NLU components are essential for implementing
natural user interfaces like chatbots (Rebala et al., 2019). The system attempts to extract context
and relevant information from the user’s unstructured statements and provide an appropriate
units called tokens, which can be words, phrases, or punctuation marks), parsing, intent detection
and categorization, and entity extraction (Khurana et al., 2023). These operations take into
account the surrounding context. Three components important for NLU include intent, entity,
and context. First, intent refers to manifesting the user’s purpose or objective. The user’s input is
linked to the necessary actions that the chatbot must perform, which may involve parameters to
provide a more specific response. For example, Figure 8 displays structural analysis with NLP
(Dhaduk, 2023). Texts like “find the location of lecture” could be labeled as an intent to identify
where lectures occur in a given semester (Schmitt et al., 2019). NLP chatbot framework avails
pre-defined intent for common functions during conversations. A list of inputs can also be used
to create intents. NLP machine learning classification makes it possible to identify intents,
although the inputs have different expressions from pre-defined ones, making this model better
that should be mentioned in a user’s utterance and is tied to intent. In the sentence “find lectures,
“the user intends to get clarification about where lectures are being hosted or taking place in a
System-defined entities, such as the system entity “lecture,” are used to express conventional
date references like “12 May 2024” or “the 12th of May”. In contrast, developer-defined entities
Third, contexts are strings that include contextual information about the object the user
mentions. For instance, in Figure 9, an input like “get it resolved” denotes there could be an
allusion to a previously mentioned object, such as “Tom, the customer care is unable to give a
definite answer.” Specifically, NLP prioritizes the importance of keeping in mind the context of
a “definite answer” so that when the user says, “I hope this gets resolved,” the intention to “gets
resolved” can be associated with the context of “denied insurance claim.” In this context, NLU
technology displays important capabilities such as in-depth analysis, multiple phrase support,
rapid response and interpretation, ease of integration and usability options, and automated
feedback actions.
In the last two decades, research efforts have been growing to develop advanced language
intelligence for computer programs. A language model is a statistical model that represents the
probability distribution of a natural language or how phrases and words are presented in a text
(Laato et al., 2023). The modeling process entails utilizing the generative likelihood of a
sequence of words to estimate the chance of future or absent tokens (Doan et al., 2023) based on
a previous collection of unannotated texts (Cerf, 2023). Consequently, the language models can
forecast a word by considering its surrounding context (Brodnik et al., 2023). A common use of
LMs is in speech recognition and NLU to distinguish between words and phrases that have
similar sounds but distinct meanings, such as “sent,” “scent,” and “cent,” or as the case with
interconnected experiments when he applied his statistical model to letter sequences in Eugene
Onegin, Alexander Pushkin’s tale in verse (Ahuja et al. (2023). These experiments established
the fundamental basis for Markov chains and Markov processes. Markov Models are statistical
models that adhere to the Markov property, which posits that the future state of a system is only
determined by its current state and is independent of the sequence of preceding events. Claude
letter sequencing in English text to illustrate his theory of information concerning actual
language (Abedi et al., 2023). Shannon introduced the concept of n-gram models.
symbols, numbers, words, and punctuation. The probabilities for N-gram models are typically
computed using the proportion of counts of N-grams in the training text. To illustrate, in a
bigram model, the likelihood of the word “soccer” occurring after the word “play” can be
determined by dividing the frequency of the bigram “play soccer” by the occurrence of the
unigram “play” in the instructional dataset (Zheng et al., 2023). Figure 10 presents an Illustration
of the N-gram modeling, which includes unigrams (one word), bigrams (two words), and
N-1 words. In the 1980s, language modeling was mostly used in automatic speech recognition
systems to improve understanding of the relationship between words and the acoustic signal (Wu
et al., 2021). During the 1990s, researchers created statistical language models that relied on the
Markov assumption to predict a word depending on the word that came before it (Takemoto,
2023). These models were extensively embraced for various NLP applications, such as part-of-
speech tagging, machine translation, and optical character recognition. The Markov assumption
models were quickly introduced and exploited for research in information retrieval.
describe the syntax of a language. Chomsky (2011) argued that finite-state grammars, which
analyze text sequentially from left to right and are based on finite Markov chains or n-gram
models, have limited capabilities and are restricted in their ability to represent languages. The
“curse of dimensionality” (i.e., the need for an exponentially large number of transition
probabilities to capture the vast potential word sequences) potentially contributes to data sparsity
issues (Antoulas et al., 2024). As the language model’s complexity grows with the vocabulary
size and sequence length, accurately learning the parameters becomes increasingly challenging
due to the limited availability of training data covering all possible contexts. Figure 11 illustrates
the curse of dimensionality, showing that classifier performance decreases as the number of
Considering the curse of dimensionality, neural language models emerged in the early
2000s and showed promising outcomes in mitigating the challenges posed by dimensionality
issues and data scarcity in statistical language models (Antoulas et al., 2024). Neural networks
are utilized to predict the probability of word sequences, reducing the number of model
parameters (Gao et al., 2019). Specifically, neural networks are utilized in more intricate natural
language processing jobs, such as machine translation. During the initial phases of training
language models using deep learning techniques, researchers utilized recurrent neural networks
(RNNs) and incorporated long short-term memory (LSTM) neural networks due to their
advantageous gating mechanism. In addition, the use of distributed representation of words, also
known as word embeddings, in neural language models has been shown to improve the
efficiency of word representation (Hussain S. et al., 2023). This strategy involves incorporating
word vectors into the models, a powerful technique (Camacho et al., 2018; Guo et al., 2016).
representations. Word embedding is a technique that involves representing words and phrases as
real number vectors, as described by Chitty-Venkata et al. (2022). The vectors represent the
meaning and relationships between words in a way that places words with similar meanings
closer together in the vector space (Liu et al., 2021; Seo et al., 2021). Figure 12 displays the
representation of the word embedding space. The vector’s extracted values are then inputted into
the model to obtain syntactic and semantic information from textual data. The obtained
knowledge will then be utilized by learning algorithms for text processing. Numerous chatbot
systems have utilized word embedding techniques to address various natural language processing
tasks.
Figure 12: Visualization of the word embedding space (Liu et al., 2021)
Mansurov and Mansurov (2020) employed word embeddings to analyze the Cyrillic
variety of the Uzbek language. The researchers trained these embeddings using word2vec,
GloVe, and FastText methods, utilizing a web crawl corpus of excellent quality. Tulu (2022)
Glove, and FastText to measure word-level semantic text similarity in Turkish. The results
revealed that Glove and FastText word vectors exhibit superior correlation in word-level
and Spearman correlation in the SimTurk and AnlamVer datasets. Borah et al. (2021) found that
FastText is the most stable word embedding method, followed by GloVe and Word2Vec, and
their stability impacts clustering and fairness evaluation in various datasets. These language
modeling methods have resulted in pre-trained word embeddings that are seen as initial
In the initial stages of training, there were efforts to pre-train on machine learning tasks
before deep neural networks became widely adopted. Nevertheless, the significance of training
language models on a broader scale became prominent when pre-trained contextual embeddings
were introduced. Nagda et al. (2020) investigated the rise of pre-trained state-of-the-art language
models. The researchers suggest that pre-trained models such as ULMFiT, ELMo, BERT,
Transformer-XL, XLNet, and OpenAI’s GPT-2 have greatly enhanced natural language
processing tasks such as speech synthesis, phrase paraphrasing, and question answering (Nagda
et al.., 2020).
Developing computers with high computational power has facilitated increased uptake of
these models. In addition, the advent of the novel Transformer architecture has promoted
research and pre-training efforts focused on enhancing the language models (Kadavath et al.,
2021). Pre-trained language transformer-based models have garnered significant attention and
have been widely adopted as a standard solution for various NLP tasks. According to Ragusa et
al. (2020), they excel in encoding extensive linguistic knowledge from diverse data and
certain architecture, such as transformer-based (Chen et al., 2020). The model’s learning process
commences with pre-training, wherein a vast corpus of unlabeled text data is employed to train
the model parameters using unsupervised (or self-supervised) learning (Nagda et al., 2020).
Afterward, it proceeds to refine its performance for a particular task by utilizing labeled data
through supervised learning. This enables the adjustment of its parameters to optimize its
Pre-trained language models (PLMs) are language models that undergo self-supervised
training on extensive datasets. These pre-trained models are language prediction algorithms
based on neural networks created using the transformer architecture (Tufano et al., 2023). The
models utilize natural language queries, called prompts, to analyze and forecast the most optimal
response using their language comprehension. PLMs based on the transformer architecture are
grouped into three categories with precise training objectives. First, decoder-only models are
trained using autoregressive methods. Second, encoder-only models are learned using Masked
Language Modeling. Third, encoder-decoder models are trained using Masked Language
Modeling or other denoising objectives (Li et al., 2024). Figure 13 illustrates the three classes of
Decoder-only models are trained in advance on a large collection of language data, often
including a significant amount of text from the internet. The main objective during this initial
training phase is to forecast the subsequent word in every text sequence (Wang et al., 2023). The
model can predict the next word in a sequence by utilizing multiple layers of multi-head self-
attention with masking. Using this process prevents the decoder from accessing future input
words. The prediction is made unidirectionally, from left to right, and considers the preceding
words (Zubiaga, 2024). An exemplary instance is the GPT series of models, which demonstrated
the capability to execute various NLP tasks with minimal need for fine-tuning, using few-shot or
Encoder-only models are composed of numerous layers stacked on each other, including
progressively enhance the representation of the input text (Kaneko, 2020). The greater the depth
of the stack, the more complex the comprehension of the language. These encoder-only models
employ Masked Language Modeling (MLM) to predict a masked word by considering each word
in the sequence (Dalmia, 2023). During the training objective of MLM, a mask token is used
randomly to mask word tokens. The masked words are then restored and predicted later on. The
method is implemented by collecting contextual information in both directions (from left to right
The masking process for encoder-only models is similar to filling in the missing pieces of
a puzzle (Zubiaga, 2024). Google BERT is a suitable case for the model that utilizes the next-
sentence prediction objective to evaluate the relationship between sentences. The training
process entails using pairs of sentences as inputs to learn how to predict whether the second
sentence follows the first in the original text. This particular method of pre-training is highly
the study by Nguyen et al. (2020) are DistilBERT, XLM-RoBERTa, and XLNet. Encoder-only
language models are commonly employed for natural language understanding (NLU) tasks and
employed for a distinct type of sequence modeling, where the output sequence is an intricate
function of the complete input sequence. In this case, the training involves converting a sequence
of input words or tokens into tags that are not simply direct mappings from individual words
(Dalmia et al., 2019). The encoder-decoder class is pre-trained using masking or other techniques
to corrupt words in the input sequence and restore them in the output sequence, a process known
as denoising (Li et al., 2024). The approach consists of two subcategories. The first subcategory
In this version, the bidirectional encoder and the unidirectional decoder are pre-trained
simultaneously with joint model parameters (Fu et al., 2023). Some examples of the encoder-
Generator Networks, Recurrent Seq2Seq Models, and Multimodal Seq2Seq Models (Soltan et
abilities, making them important resources in several industries. Multiple use cases of
PLMs have been utilized in various Natural Language Processing (NLP) tasks, such as
conversation (Nguyen et al., 2020). Most chatbot applications rely on the GPT-2 transformer
proficiency in executing various conversational tasks (Wang et al., 2023). Meena chatbot by
Google has been reported to score 79% on Sensibleness and Specificity Average (SSA), which is
23% higher in absolute SSA than other chatbots (e.g., XiaoIce, Mitsuku) (Freitas et al., 2020).
Yet, if perplexity optimization is improved, Meena shows potential for human-level SSA scores
of up to 86% (Freitas et al., 2020). However, Zhou et al. (2018) reported that XiaoIce is a more
empathetic social chatbot with an average Conversation-turns Per Session (CPS) of 23,
Google indicated that up to 2.6 billion parameters were used for its Meena model,
comprising 341 GB of text. The texts were filtered from social media conversations in the public
domain. Meena surpasses the state-of-the-art generative model, OpenAI GPT-2, with a model
capacity of 1.7 times greater and training data of 8.5 times more (Augustine, 2020). The process
of fine-tuning enhances pre-trained transformer models’ effectiveness in generating content that
is both safe and factually accurate (Mosin et al., 2023). Based on access to the substantial
dataset, fine-tuning improved the Meena model’s capacity to generate logical, precise, and
engaging replies.
the fluency of responses and contextual problems that arise from previous neural methods (Tay
et al., 2021). These models have achieved outstanding results in engagingness and human-like
interactions between the system and users (Yu & Ettinger, 2021). Their remarkable proficiency
them to engage in discussions with individuals across a wide range of subjects and exhibit
answers that resemble those of humans (Tay et al., 2021). These capabilities are crucial for
Using pre-trained neural language models on extensive corpora and refining data for
enhanced response quality, comprehension of context, and the development of more human-like
conversational characteristics for both task-oriented and non-task-oriented chatbots (Yu &
Ettinger, 2021). Nevertheless, there are still unresolved challenges that need to be addressed.
These challenges include concerns about user privacy, the need for more user engagement
through empathetic responses, and improving reasoning abilities to reduce inconsistencies (Liu et
al., 2023). Additionally, there are issues with repetitive responses in different conversations,
instances of knowledge hallucinations, and the need for safer responses that address toxic
tokens and extensive text corpora. Yet, these pre-trained LLMs undergo the same repetitive
training process once new data becomes accessible (Li et al., 2023). An alternative approach that
is considerably more effective involves scaling and consistently pre-training these LLM models,
language models (PLMs) such as LLaMA with 13B parameters, Chinchilla with 70B parameters,
GPT-3 with 175B parameters, and PaLM with 540B parameters (Touvron et al., 2023). These
experiments used publicly available datasets to investigate different factors, including dataset
size, model parameters, and training computation. The larger PLMs have demonstrated
remarkable performance in complex tasks compared to previous PLMs like BERT with 330M
Researchers have used the term “large language models” (LLMs) to refer to models that
have a size of more than 100B (Burgess, 2020). Some examples of LLMs include PanGu-Alpha
by Huawei with 200B, Megatron-Turing NLG by Microsoft and Nvidia with 530B, and Wu Dao
2.0 by Beijing Academy of AI with 1.75 trillion parameters (Liu et al., 2022; Zeng et al., 2021).
Additional examples with more than 100B include GPT-3 and GPT-4 by OpenAI, PaLM, and
LaMDA by Google, and Galactica and LlaMA by Meta AI (Touvron et al., 2023). Figure 14
shows that LLMs are becoming increasingly large. A common characteristic of LLMs is that
their structure is founded on the transformer architecture. Also, these LLMs have the same pre-
LLMs employ multi-head attention layers within a deep neural network structure,
typically consisting of hundreds of billions of parameters. These models are trained on huge
volumes of textual data. It is crucial because pre-training incorporates general information from
large-scale corpus into the model parameters (Touvron et al., 2023). The pre-training objective
has a significant impact on the generated text’s fluency in language models with the causal
decoder-only design, which has been the most widely used backbone by LLMs recently (Wang
et al., 2023). These models, which are typically trained with an autoregressive LM objective,
continue the text sequences by responding to prompts and predict the following words by
keeping track of all the words that have come before (Wang et al., 2022). Results indicate that
pre-training with the LM task leads to more advanced capabilities of LLMs and that it improves
performance in zero-shot and few-shot learning tasks, especially when combined with particular
Large pre-training tokens and LLM parameter scaling significantly improve arithmetic
reasoning, code generation, and instruction following. These capabilities are further improved by
supervised fine-tuning, where it has been shown that expanding the sizes of the models, datasets,
and total computation significantly improves the performance of LLMs (Zhou et al., 2022). In
particular, when the models’ parameter scale level reaches a specific threshold, there is a notable
and unexpected increase in performance and the appearance of some mysterious skills (Hahn &
Goyal, 2023). These skills are specific to LLMs and can execute different tasks and duties.
According to Berglund et al. (2023), standard situations include step-by-step reasoning, in-
Scaled LLMs can provide output that more closely matches human responses to natural
language inquiries through the commonly used fine-tuning technique known as instruction-
tuning, which frequently results in human-level performance on a variety of testbeds (Zhou et al.,
2022). Instruction tuning involves supervised learning on labeled (input, output) pairs. Still, fine-
tuning can be accomplished using almost any machine learning paradigm, such as reinforcement
learning, semi-supervised learning, or extra self-supervised learning (Meng et al., 2023). The
In the instance of ICL, the model can only produce output using a small number of
natural language samples through prompts, even though there is no additional model training or
modification (Wang et al., 2023). In the same situation, step-by-step reasoning adds intermediate
reasoning steps to ICL-provided examples using the chain of thought prompting method,
allowing it to solve problems without fine-tuning (Paranjape et al.,2023). These skills also fall
within the category of prompting techniques, which are covered in more detail in section 2.5.5.
Figure 15 illustrates key highlights of LLMs in terms of training sizes in typical chatbot
OpenAI’s GPT backend powers ChatGPT, an automated chatbot service. GPT relies on
an LLM that is comprised of four key components. These elements include a transformer
architecture, tokens, a context window, and a neural network. Therefore, ChatGPT is a suitable
case of a successful LLM-based chatbot that can be investigated to understand the concept of
prompt and prompting engineering, and the capabilities, risks, and limitations of ChatGPT.
2.5.1 OpenAI
In line with the Open AI charter and history, the company was founded in December
2015 (OpenAI, 2024a). Being an American AI research organization, OpenAI seeks to develop
safe and beneficial artificial general intelligence. OpenAI (2024a) defines artificial intelligence
as “highly autonomous systems that outperform humans at most economically valuable work.”
Open AI was founded by Ilya Sutskever, Greg Brockman, Trevor Blackwell, Vicki Cheung,
Andrej Karpathy, Durk Kingma, Jessica Livingston, John Schulman, Pamela Vagata, and
Wojciech Zaremba formed the company (OpenAI, 2024b). The original founding members of
the OpenAI Board of Directors were Sam Altman and Elon Musk (Haque, 2023). OpenAI
consists of a non-profit, OpenAI Inc. registered in Delaware, and a for-profit subsidiary, OpenAI
Global LLC.
substantial investment from Microsoft that steadily grew in size. The change to finance has
raised concerns about the democratization of AI technology, which was previously emphasized
by the founders as a goal, as well as the company’s transparency strategy (Andhov et al., 2024).
In 2019, Microsoft made a $1 billion investment in OpenAI Global LLC, followed by $10 billion
in 2023 (Widder et al., 2023). Most of the investment was allocated to computational resources
on Microsoft’s Azure cloud service (Pedersen et al., 2024). In addition to its commercial
solutions, OpenAI is actively engaged in research aimed at developing AI systems that are both
LLMs, as shown by GPT-3, demonstrating a fantastic capacity to generate coherent text. Other
notable models in this category include Whisper, which is used for Automatic Speech
Recognition (Vivek et al., 2023); Codex, which is used for coding tasks (Jackson & Sáenz,
2022); and DALL-E, which is used for generating images from natural language text (Ye et al.,
technique that utilizes rewards for desired activities and punishments for undesired actions
(Zhong et al., 2023). The goal is to train an intelligent agent to make optimal decisions in a given
environment (Berner et al., 2019). As part of this study, the business creates algorithms such as
Trust Area Policy Optimization and Proximal Policy Optimization (Sun et al., 2023).
comprehend and produce fluent and precise language (Gilson et al., 2023). The current section
discusses the four GPT models, starting from the initial version of GPT-1 and progressing to the
latest GPT-4. OpenAI introduced GPT-1 in 2018 as their initial implementation of a language
model based on the Transformer architecture (Sallam, 2023). GPT-1 model consisted of 117
million parameters, which led to a notable enhancement compared to the prior state-of-the-art
language models. An advantageous feature of GPT-1 was its proficiency in producing articulate
and logical English in response to a particular instruction or context (Frieder et al., 2023). The
model underwent training using two datasets: the Common Crawl, an extensive collection of web
pages containing billions of words (Sallam, 2023), and the BookCorpus dataset, which
comprised more than 11,000 books spanning various genres (Gilson et al., 2023). GPT-1 was
able to build robust language modeling skills by utilizing a wide range of information from the
two datasets.
was subject to some constraints. For instance, GPT-1 often produced repetitious text, mainly
when provided with suggestions beyond the range of its training data (Frieder et al., 2023).
Additionally, GPT-1 could not engage in reasoning across numerous conversation exchanges or
keep track of long-term connections in written material (Sallam, 2023). Critics of GPT-1 also
observed that the coherence and fluidity of the language in this model were only evident in
shorter sequences, while more significant portions lacked coherence. Despite these
shortcomings, GPT-1 served as the basis for developing more extensive and potent models
1. The GPT-2 model used 1.5 billion parameters, far surpassing the size of GPT-1, using 7,000
unique unpublished books and Byte-Pair Encoding (BPE) vocabulary of 40,000 tokens (Sallam,
2023). GPT-2 underwent training using a far more extensive and varied dataset, which involved
merging Common Crawl and WebText. An advantageous attribute of GPT-2 was its aptitude for
producing logically connected and lifelike sequences of textual content (Frieder et al., 2023).
Furthermore, GPT-2 could produce responses that resemble human outputs, rendering it a
valuable instrument for various natural language processing tasks, including content generation
and translation (Frieder et al., 2023). Nevertheless, GPT-2 presented some constraints. GPT-2
poorly executed tasks that demanded advanced cognitive abilities such as contextual
understanding and complicated thinking. The limitations of GPT-2 were particularly evident
when handling small text fragments and paragraphs, while it could not sustain context and
coherence when dealing with lengthier pieces (Frieder et al., 2023). These constraints facilitated
The field of NLP models made substantial progress with the release of GPT-3 in 2020.
The GPT-3 model used 175 billion parameters, which is over 100 times larger than the GPT-1
and ten times greater than the GPT-2 model. Unlike the previous GPTs, GPT-3 underwent
training using various data sources, such as BookCorpus, Common Crawl, and Wikipedia,
among other datasets (Gilson et al., 2023). GPT-3 dataset consisted of around one trillion words,
enabling the model to produce advanced replies for many natural language processing tasks,
even without the need for any previous example data (Frieder et al. 2023). GPT-3 exhibited
logically connected language, compose computer code, and construct artistic creations. GPT-3
could comprehend the context of a given text and produce suitable responses, which set it apart
GPT-3’s capacity to generate authentic text had significant ramifications for applications
such as chatbots, content generation, and language translation (Zain et al., 2023). An illustrative
recognized. Despite the remarkable capabilities of GPT-3, the model had some flaws, such as
providing biased, erroneous, or unsuitable responses. These errors resulted from the extensive
training of GPT-3 on vast quantities of text, which may include biased and erroneous
information (Vivek et al., 2023). Furthermore, critics noted that there were occasions where
GPT-3 produced entirely unrelated text in response to a prompt (Laato et al., 2023; Rudolph et
al., 2023; Sallam, 2023). As such, this suggests that GPT-3 still struggles with comprehending
context and background information. The capabilities of GPT-3 have also sparked worries
regarding the ethical impacts and the abuse of these highly potent language models. Concerns
arise over the potential misuse of the model for reprehensible activities, such as fabricating false
information, crafting deceptive emails, and developing malicious software (Khadija et al., 2023).
instructions and provide appropriate replies accurately. The GPT-3 model’s ability to align with
preferences. RLHF was employed to assess the caliber of the text produced by language models
and assist them in enhancing their performance in subsequent prompts (Jackson & Saenz, 2022).
The training improved GPT-3’s capacity to adhere to instructions while mitigating safety issues
by generating less toxic and harmful responses. These approaches improved GPT-3 by
displaying refined capabilities (i.e., processing code and increased usability due to RLH), from
which ChatGPT was further fine-tuned for dialogue. Figure 16 presents the training and
GPT-4, the most recent iteration of the GPT series, was released on March 14, 2023. The
current model, GPT-4, represents a notable improvement over its predecessor, GPT-3, which was
already outstanding (Nori et al., 2023). However, the exact details about the training data and
structure of GPT-4 have not been officially disclosed. Still, the assumption is that it leverages the
advantages of GPT-3 to address its shortcomings (Peng et al., 2023). GPT-4 demonstrates
Furthermore, the latest GPT-4 has an expanded context window and size, denoting the amount of
information the model can store in its memory when engaging in a conversation (Laato et al.,
2023). The model surpasses previous LLMs in identifying user intent and other modern fine-
A review by Nori et al. (2023) observed that GPT-4 has become a general-purpose large
language model that exceeds the passing score on USMLE by over 20 points. GPT-4 also
outperforms earlier models, including the ones fine-tuned explicitly for medical knowledge.
Peng et al. (2023) added that GPT-4 generates superior instruction-following data for large
language models, resulting in superior zero-shot performance on new tasks compared to previous
state-of-the-art models. Additionally, GPT-4 presents the concept of predictable scaling, which
aids in generating reliable predictions of the model’s behavior during training without extensive
computational resources. Lee (2023) reported that GPT-4 significantly enhances translation
accuracy and eliminates flaws in neural machine translation. The model also determines the ideal
equilibrium between hallucinations and fostering originality in GPT models, optimizing model
OpenAI introduced before the chatbot’s debut. GPT-3.5 is an enhanced iteration of GPT-3,
initially released in 2020. ChatGPT was launched in November 2022 with a specific focus on
conversing with users (Sallam, 2023). The model is refined through supervised learning by
having human AI trainers assume the roles of both users and AI agents. A dialogue synthesis was
carried out with the help of sample written recommendations, which were combined with the
InstructGPT dataset in a dialogue format (Rudolph et al., 2023). Besides these approaches, the
rest of ChatGPT is trained using the same methods as its “sibling,” InstructGPT. While ChatGPT
abilities, the capability to identify context in multi-turn dialogues, and human alignment for
enhanced safety (Laato et al., 2023), the system demonstrated remarkable communication
capabilities with humans. Common instances include tasks such as categorizing text, rephrasing,
questions [68], creating code with explanatory comments, generating complex text like poems,
essays, or witty puns, and even imitating well-known individuals (Gilson et al., 2023).
ChatGPT was released in a research preview format, allowing users to examine and
explore its capabilities freely. After its inception, it gained significant public attention and
surpassed one million users within a week. Such rapid adoption can be attributed to its
remarkable conversational capabilities and reasoning skills across various topics (Wu et al.,
2023). Furthermore, users are urged to provide feedback via the interface to assess the chatbot’s
responses and report any improper replies or potential threats. Subsequent feedback is collected
and utilized to train further and refine the system, improving its conversational capabilities (Nori
et al., 2023). After a few months, OpenAI introduced applications for Android and iOS devices,
enabling voice input and syncing of conversational history (Casheekar et al., 2024).
OpenAI also introduced premium features such as ChatGPT Plus based on user
subscriptions and ChatGPT Enterprises for companies. ChatGPT Plus offers enhanced response
speeds and prioritized access to members with improved features. Some examples of the benefits
are instant access to GPT-4 usage, as well as the ability to use external and internal plugins (such
as the ChatGPT Browsing Plugin and ChatGPT Code Interpreter Plugin) to augment the
Enterprise is described as the most potent iteration of ChatGPT, providing users unrestricted
access to the latest capabilities. These features encompass unrestricted and faster usage of GPT-
4, extended input processing, business-oriented data privacy and security procedures, and
While the latest GPT-4 improves translation quality and removes significant errors in
neural machine translation, one of the concerns is that it also produces hallucinated edits
(Raunak et al., 2023). Prompt engineering seeks to improve the latest models by developing and
enhancing prompts to guarantee that GPTs produce practical and relevant material. Through
reinterpretation, the prompt cycle guarantees that conversational models react suitably to diverse
human input, enhancing user experience (Pehlivanoglu et al., 2023). Following its introduction
during GPT-3, prompting helps improve pre-trained and fine-tuned LLMs. Through prompting,
A prompt plays a crucial role in setting the environment of a conversation, sifting through
the given information, and specifying the expected format and substance of an LLM’s output. An
LLM can produce more organized and refined solutions for different tasks when provided with
specific prompts encompassing norms and guidelines. Interactive prompting techniques have
methods include Chain-of-thought (CoT) and In-Context Learning (ICL) (Figure 17).
Figure 17: In-Context Learning (ICL) and Chain of Thought (CoT) (Polverini & Gregorcic, 2023).
ICL works on the task description and may include particular demonstration examples
using a natural language prompt. Various strategies have been suggested to enhance the
effectiveness of the ICL capability. These strategies emphasize the significance of the
information in specific examples and the proper sequence and format of demonstrations.
Furthermore, presenting sufficient information for the task and ensuring its relevance to the
the reasoning ability of LLMs, hence enriching the input-output pairings ICL (Polverini &
Gregorcic, 2023). The system offers more contextual information and simplifies the process of
commonsense reasoning, and symbolic reasoning, which involve reasoning step by step. While
CoT and ICL appear to enhance the performance of LLMs, there is still a requirement for more
efficient prompting mechanisms (White et al., 2023). Furthermore, the effectiveness of the
prompts supplied to the LLM is closely correlated with the quality of the generated outputs
(Pehlivanoglu et al., 2023). Therefore, the methods utilized to train LLMs through prompts,
Prompt engineering offers more possibilities for LLMs than just collecting plain text or
code samples. An appropriate prompt can initiate an entirely new set of interactions. For
example, prompts can assist an LLM in creating a dataset with a specific format and desired
number or function as a Linux terminal window. In addition, prompts have the potential to adjust
themselves, enabling them to suggest other prompts that can provide further information or
create relevant material (Raunak et al., 2023). Recent research has suggested utilizing prompt
patterns to develop reusable solutions for user task-related issues while interacting with
capabilities of ChatGPT by integrating diverse patterns that apply to various domains, such as
Since its inception in 2022, ChatGPT has revolutionized communication and information
retrieval by introducing a novel capability. Integrating text and code data-rich LLMs in natural
advanced tool with powerful NLP capabilities (Al-Khiami & Jaeger, 2023). Therefore, ChatGPT
is a productive tool with scalable and cutting-edge technology, offering various possibilities for
text creation, review, and analysis (Wu et al., 2023). ChatGPT can be employed either to execute
personal tasks or for work-related purposes. Considering that ChatGPT is trained on a diverse
and extensive corpus of textual data, including books, news stories, webpages, articles, forums,
and social media posts, the model understands and generates text on a wide range of topics,
Moreover, the training dataset used in ChatGPT enables it to understand and respond to
various linguistic inputs, deconstruct crucial information or complex concepts, and explain them
in a way that suits each user’s unique speaking style (Sallam, 2023). These qualities and
ChatGPT’s capacity to generate and modify code have made the model stand out as a practical
and accessible search engine that relies on dialogue rather than having users scroll through a list
of results (Rudolph et al., 2023). In addition to its machine learning optimization methods,
ChatGPT helps to simulate real-world dialogue features like remembering past conversations,
information, and assisting users in generating ideas through brainstorming (Laato et al., 2023;
ChatGPT also presents the opportunity to swiftly capture new insights, retain innovative
information, and learn from user interaction. ChatGPT improves its interpretation abilities
(Rudolph et al., 2023). Based on constant user interaction and engagement, ChatGPT adjusts to
various reactions based on feedback and new information, thereby implying that it becomes
Furthermore, by using external plugin mechanisms made possible by OpenAI, the chatbot
can use more tools or software to improve its functionality. External resources are crucial for
solving complicated issues and improving LLM performance (Raunak et al., 2023). For example,
ChatGPT can access fresh data using the web browser plugin. Furthermore, the open-source
retrieval plugin may learn from data sources by responding appropriately to prompts and queries.
New features like custom instructions have also been introduced to tailor ChatGPT usage.
Instead of repeating the conversational context and background information, users can give pre-
established guidelines that the chatbot will consider for the subsequent output. For example,
Figure 18 shows an example of customized instructions provided at the OpenAI website, where a
However, OpenAI has disclosed drawbacks for this sophisticated AI chatbot. ChatGPT
responses that may seem correct and sensible but are erroneous, incoherent, and unreliable (Lee,
2023). OpenAI has encountered challenges in addressing this problem despite improvements in
the ChatGPT Plus version. These difficulties arise from the supervised training process of the
model, and the reinforcement learning from human feedback (RLHF) used to align the model
with the provided demonstrations (Kulkarni et al., 2019). Moreover, ChatGPT has restricted
cognitive abilities, sometimes producing irrational responses (Casheekar et al., 2024). The
system faces difficulties in resolving intricate or even more straightforward mathematical issues,
grasping the significance of words, responding to thorough inquiries, and acquiring real-world
The chatbot’s output can be significantly altered by its sensitivity to the wording of the
input, leading to potential inconsistencies (Kumar et al., 2020). Despite receiving identical
prompts, the model has the potential to produce varying replies. Furthermore, while the training
phase of the chatbot concluded in early 2022, it lacks knowledge about any forthcoming events,
namely those occurring after January 2022. These concerns have prompted recommendations for