0% found this document useful (0 votes)
15 views44 pages

Chapter 2 Literature Review

Empowering E-Learning Platforms with LLM - based Conversational Chatbots: A New Era of AI-Driven Education.

Uploaded by

Robert Mortimer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views44 pages

Chapter 2 Literature Review

Empowering E-Learning Platforms with LLM - based Conversational Chatbots: A New Era of AI-Driven Education.

Uploaded by

Robert Mortimer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 44

2.

Artificial Intelligence Techniques in Chatbot Development

The focus of this chapter is to identify and discuss the AI methods employed in the

creation and implementation of chatbots. This chapter focuses on the field of natural language

processing (NLP), intending to discuss the features and progress of language models. The

chapter commences with a summary of AI-based chatbots, followed by design and

implementation techniques. Then, the concepts of pre-trained and fine-tuned language models

are detailed before elaborating on scaling the pre-trained language models. Further, the chapter

presents a detailed case of the launch of the widely-used LLM-powered chatbot, ChatGPT,

including its promising functionalities and current shortcomings.

2.1 Chatbots Powered by Artificial Intelligence

Contrary to rule-based systems, AI-based chatbots utilize machine learning models and

natural language processing technologies to obtain and analyze user input and generate responses

(Singh & Thakur, 2020). AI-based chatbots do not rely on pre-established responses for every

user statement, precise pattern matching, or creating new rules through human coding. Rather,

these chatbots undergo training using large databases of conversational data, employing machine

learning algorithms that analyze and learn from past interactions, eventually enhancing their

intelligence (Rudolph et al., 2023). Training large datasets using machine learning and NLP

methodologies improves the flexibility of chatbots, enabling them to encompass a wide range of

knowledge (Mageira et al., 2022). Notably, a holistic conversational context is considered

instead of focusing on the present statement each time. There are two types of AI chatbots:

Information-retrieval chatbots and Generative chatbots. However, hybrid chatbots also combine

elements from both methodologies.


Retrieval-based chatbots retrieve the most appropriate data from text content in response

to users’ inputs (Makhalova et al., 2019). Upon assessing the user input, information retrieval

chatbots pick the most suitable response from the existing answers. Typically, these chatbots rely

on a dataset of question-answer pairs to build their knowledge base. Subsequently, they create a

chat index containing a list of possible responses based on the user’s input (Rukhiran & Netinant,

2022). The user’s inputs or questions match the most relevant entries in the chat index, and the

chatbot provides the corresponding answer. The increased popularity of retrieval-based chatbots

can be attributed to the widespread use of social web or participatory web technology and the

significant access to online textual data (Khadija et al., 2023).

Chatbots that are based on information retrieval technology rely on pre-existing

responses to guarantee the production of high-quality or reliable outputs. Nevertheless, scholars

caution that building a rich knowledge base and training these models with significant

conversational data can be costly, laborious, and time-consuming (Rukhiran & Netinant, 2022).

According to Makhalova et al. (2019), the technique of retrieval-based chatbots may restrict the

capacity to develop chatbots with unique characters due to the limitation of response generation

only from a predetermined corpus. Such inflexibility may hinder the advancement of dynamic

and engaging conversational agents, especially chatbots designed for social encounters with

unique persona (Khadija et al., 2023). Overcoming these shortcomings requires employing novel

methods for constructing knowledge bases and generating responses that balance quality and

adaptability.

In contrast, generative chatbots are open-domain programs that generate original outputs

rather than selecting answers from pre-defined responses (Esfandiari et al., 2023). These chatbots

primarily utilize deep neural networks, such as Encoder/Decoder models, to produce original
responses based on user input (Zielinski et al., 2023). There is growing research interest in

generative chatbots, considering their promising reliability in open-domain discourse (Aydın &

Karaarslan, 2023). Training of generative chatbots occurs through large sentence datasets

obtained from real conversations. Subsequently, the training allows the algorithms to teach the

model how to generate responses that are linguistically coherent, accurate, and relevant (Mulla &

Gharpure, 2023). The next section discusses the chatbot design and implementation techniques.

2.2. Techniques for Designing and Implementing Chatbots

The focus of this section is to detail the techniques for designing and implementing

chatbots. Specific focus is limited to language modeling, natural language processing, and neural

network models such as transformers, deep Seq2Seq, and recurrent neural networks. AI-powered

chatbots exploit these techniques to understand user inputs, generate contextually relevant

responses, and engage in more natural conversations. As a result, chatbots go beyond

predetermined responses, adjusting their outputs in real time while including desirable

characteristics such as personality and emotional intelligence.

2.2.1. Neural Network Architectures

A neural network is an AI technique that enables computers to interpret data by

emulating the human brain’s cognitive processes (Gao et al., 2019). Neural networks are

machine learning (ML) and deep learning (DL) models used in generative-based and information

retrieval (IR)-based chatbot implementation techniques. Since they are trained on extensive

datasets with labeled or unlabeled information, these models can accurately represent the

connections and patterns between input and output data in natural language (Alaloul & Qureshi,

2020). As a result, artificial neural networks are capable of generating appropriate responses. In

information retrieval-based chatbots, the response is determined using probabilistic calculations


on the best appropriate reply in the neural network (Mangini et al., 2021). Generative models

provide replies by sequentially generating words based on probability calculations using the

given vocabulary (Mulla & Gharpure, 2023). Hybrid techniques include comparing candidate

responses and selecting the one that scored higher (Gao & Jiang, 2021). As discussed in

subsequent subsections, some examples of artificial neural networks include deep sequence-to-

sequence (seq2seq) models, transformers, seq2seq models, long short-term memory networks,

and recurrent neural networks.

2.2.1.1 Recurrent Neural Networks

A recurrent neural network (RNN) is an artificial neural network (ANN) specifically

designed to process time series or sequential time series data. An RNN, derived from ANNs and

modified from recursive neural networks, is designed to store and recall prior input sequences of

varying lengths (Lalapura et al., 2021). Unlike a typical feed-forward neural network, an RNN

passes these sequences to neurons for processing (Hidasi et al., 2020). Figure 1 shows RNN

architecture containing hidden layers (green blocks) that incorporate a looping mechanism,

allowing them to retain and transfer knowledge from previous inputs throughout the network.

Within each block, the blue circles are defined by vector a, hidden units or nodes, where the

number of nodes is decided by the hyper-parameter d (Khuong, 2019).

At a given time t, the RNN architecture considers events that occurred at t-1 by including

the h from the previous hidden state and the input x at time t (Khuong, 2019). The internal

memory can then be utilized for subsequent processing. RNNs can retain and utilize all past data

by storing the output of one layer and feeding it as input to another layer (Hidasi et al., 2020).

Furthermore, RNNs can collect data about the order of the input sequence, which might be

advantageous for its overall processing. However, during the training phase, the gradients
utilized for updating the weights of the RNN have the potential to diminish (vanishing) or

amplify significantly (exploding), making the learning process unstable and inefficient (Lalapura

et al., 2021).

Figure 1: RNN Architecture (Khuong, 2019)


2.2.1.2 Long Short-Term Memory Networks

Due to the problem of vanishing gradients in RNN, it could be challenging to search for

information history. As a result, this obscures the learning of long-term dependencies, making it

challenging for a model to use information from the distant past effectively (Hidasi et al., 2020).

Long Short-Term Memory (LSTM) seeks to overcome this RNN problem. As a type of RNN,

LSTM can capture and retain long-term dependencies in sequential input (Qin et al., 2023).

Besides, LSTMs can handle and examine sequential data, such as time series, text, and speech

(Zhang et al., 2023). LSTM achieves this by providing a network with mechanisms that regulate

information flow. Specifically, this involves integrating an additional layer of contextual

information, called the cell state, into the network.

Figure 2 presents a general architecture of the LSTM architecture. The illustration

indicates that LSTMs utilize “gates” to control the flow of information to the cell state and

subsequent network units. Navarro et al. (2020) clarified that these gates assign a value between

0 and 1. A numerical value of 0 signifies total failure, whereas a 1 signifies complete success.

The input gate selects the information that must be kept, while the forget gate decides what
should be removed, thus ensuring the network’s state stays current. The output gate governs the

magnitude of the output produced by the hidden layer.

An LSTM model is capable of remembering the nominal variables of previous objects

such as gender (female, male), color (blue, purple violet), or type of pet (cat, dog) (Qin et al.,

2023). Using previous information, LSTM can identify and pick the correct attributes in

categorical or continuous variables. LSTM ensures that its powered networks attain enhanced

performance by resolving these concerns. Considering their power, chatbot design frequently

incorporates LSTM modifications, such as the Gated Recurrent Unit (GRU), which combines the

input and forget gate to produce a single “update gate.” While LSTMs potentially process

variable-length sequences, they are applied in tasks where the input and output sequences have

the same length (e.g., sequence classification, sequence labeling) (Hidasi et al., 2020).

Figure 2: The General Architecture of LSTM Network (Navarro et al., 2020)


2.2.1.3 Sequence to Sequence (Seq2Seq) Models

Unlike LSTMs limited to handling variable-length sequences of the same length,

Seq2Seq models are more powerful and can handle input and output sequences of different

lengths (Raman et al., 2022). As a result, Palasundram et al. (2020) remarked that Seq2Seq is

considered more suited for tasks like dialogue systems, text summarization, and machine

translation. Moreover, the Seq2Seq neural model is a versatile, comprehensive, and creative
model designed primarily for machine translation jobs. Its functionality is based on taking an

input sentence in the source language and generating a translated output in the target language.

Seq2Seq application has been widely adopted in conversation modeling in chatbots such as

Meena by Google, DialoGPT by Microsoft, and Blender Bot by scholars at the University of

California. As such, this makes Seq2Seq an emerging standard approach for modern NLP tasks

(Soltan et al., 2023).

The Seq2Seq model consists of two types of recurrent neural networks: LSTMs and

GRUs. These RNNs are used to model longer sentences, although transformer networks are used

as alternatives (Soltan et al., 2023). Figure 3 illustrates that the Seq2Seq model includes an

encoder, which generates a vector to process the input sequence. Also, the Seq2Seq model has a

decoder that takes the vector and generates the desired output or results. This model aims to

generate the most likely response, considering the context provided by the previous turn or input

sentence (Palasundram et al., 2020). At first, the input text is passed to the previous RNN

encoder and is processed word by word in a concealed or hidden state of the network. The

encoder’s final state exposes the context vector representing the input sentence. This vector is

then fed into the decoder to produce an output, one element at a time, using an appropriate

probability function.
Figure 3: An attention-based seq2seq model (Shi et al., 2021)
2.2.1.4 Deep Seq2Seq Models

A deep Seq2Seq model is a variant of the traditional Seq2Seq models, which incorporates

more layers or depth in its architecture’s encoder and decoder components (Yin & Wan, 2022).

Using multiple stacked layers of LSTM or GRU in the decoder and encoder components enables

the deep Seq2Seq model to learn more complex representations and capture long-range

dependencies in the input and output sequences (Soltan et al., 2023). Deep Sed2Seq facilitates

the generation of efficient chatbots that acquire human-like intelligence characteristics. Lower

layers can capture basic characteristics, whereas higher layers can acquire more conceptual and

advanced representations, potentially enhancing the model’s performance on generative chatbot

tasks (Palasundram et al., 2020).

Figure 4 presents a multi-stage encoder-based Seq2seq deep learning model. During the

deep Seq2Seq model design process, the output is consistently propagated from the previous to

the next layers. The initial sequence is fed into the first encoder layer (Zhang & Deng, 2019).

Next, the LSTM encoder aims to convert each word into a vector. The output is sequentially

propagated via each LSTM encoder layer until it reaches the final LSTM encoder layer. From

there, it is directed to the first LSTM decoder layer (Zhang & Deng, 2019). The process

concludes with the target sequence computed by the corresponding probability function.
Figure 4: multi-stage encoder-based Seq2seq model (Zhang & Deng, 2019).
2.2.1.5. Transformer Models

Seq2Seq models have some limitations since RNNs process sequences sequentially,

which makes parallelization difficult, resulting in slower inference and training. Long-range

dependencies are also a problem with Seq2Seq2 models due to the inherent limitations of their

recurrent nature in sequences, such as exploding or vanishing gradients. Since the encoded input

sequence is fixed-length, the performance of Seq2Seq2 models decreases when exposed to

longer sequences. As a result, Seq2Seq2-based chatbot may give wrong answers. Transformer

models address this problem by employing self-attention mechanisms across the whole input

sequence. This enables transformers to preserve more detailed representations of the input

without reducing it to a fixed-length vector (Chitty-Venkata et al., 2022).

In 2017, a team of researchers from Google Brain introduced the concept of Transformer

architecture in their seminal paper titled “Attention Is All You Need” (Vaswani et al., 2017). The

proposed Transformer architecture by Google made five main innovations and contributions (i.e.,

self-attention, positional encoding, multi-head attention, parallelization, and encoder-decoder

structure). First, the transformer model introduced a self-attention mechanism that empowers the

model to weigh and determine various important sections of the input sequence during encoding
and decoding. Therefore, this self-attention mechanism captures longer-range dependencies more

effectively than other NNA models (Berg et al., 2021; Vaswani et al., 2017).

Second, the transformer architecture also introduced positional encoding. The

Transformer model lacks recurrent or convolutional layers. Thus, the transformer relies on

positional encoding to incorporate information regarding the position of each token in the

sequence. Such positional encoding enables the model to comprehend the token order (Vaswani

et al., 2017). Third, multi-head attention was another important addition by the transformer,

enabling the model to attend to information from different representation subspaces jointly. As

such, this improves the model’s ability to capture diverse contextual information (Wang et al.,

2020). Fourth, parallelization is possible in the transformer architecture since it does not rely on

sequential operations such as RNNs. Such parallelization contributes to faster training and

inference on modern hardware (such as GPUs). Fifth, the Transformer model is based on the

encoder-decoder structure. In this process, the encoder maps the input sequence to a continuous

representation while the decoder creates the output sequence from this representation (Vaswani

et al., 2017).

Figure 5 presents the architecture of the standard transformer proposed by Vaswani et al.

(2017). Each component in a sequence has its representation. To determine the relationship

between these components, the matrices that should be computed include the Q (query), K (key),

and V (value). These matrices are generated using linear representations of the input sequence.

The query matrix pertains to the current component, the key matrix pertains to other components,

and the value matrix encompasses information that must be collected (Vaswani et al., 2017). As

detailed by Google researchers, the similarity between the query and key matrices is determined

by calculating the dot product, which defines the correlation weight of the current item with the
others. Subsequently, the SoftMax function is applied to standardize the similarity so that the

total of each correlation amounts to 1 (Vaswani et al., 2017). The new weights are allocated to

the corresponding values, accumulating, resulting in a comprehensive representation

incorporating information about the current word’s relationship with the others.

Figure 5: Architecture of the Standard Transformer (Vaswani et al., 2017)


Pre-trained language models such as Transformer and BERT have become more

advanced. These models gain a greater understanding of meaning by being trained on large

amounts of data. Greater comprehension allows these models to deliver more accurate and

complete representations of phrases or sentences in both the support and query sets. Multi-head

self-attention in a model is achieved by applying the attention mechanism in many parallel

“heads.” Each head has its learnable parameters and can focus on different types of data

information, such as semantic or syntactic correlations (see Figure 6 for a visual representation).

The knowledge acquired by each individual can be combined into a unified representation to get

the intended result.


In Figure 6, Google’s Bidirectional Encoder Representations from Transformers (BERT)

is used as a sentence encoder related to “The hacker gained access to sensitive information by

executing a media-less attack on Android” (Ma et al., 2023, p. 103). The sentence is processed as

“[CLS] [e1] The hacker [/e1] gained access to [e2] sensitive information [/e2]. [SEP]”. After

concatenating these vectors to “[e1]” and “[e2]” places, BERT creates sentence representations

with entity information. Transformer and BERT present neural machine translation capabilities

that are useful for many NLP applications, creating a “foundation” architecture with multiple

implementations. Examples of this include Google’s BERT and OpenAI’s Generative Pre-

Trained Transformers.

Figure 6: Example of sentence encoder output showing how an input sentence is converted to a
numerical representation for processing (Ma et al., 2023).
2.2.2 Natural Language Processing

In the past seven decades, AI researchers have attempted to study how to improve human

communication. Natural human languages are irregular and context-dependent (Chowdhary,

2020). NLP techniques, such as those based on linguistic rules and knowledge bases, are better

equipped to handle these exceptions and irregularities than ANNs (which rely on statistical

patterns in availed data) (Geetha et al., 2023). NLP systems are also improved because it is
possible to integrate domain-specific knowledge, such as rule bases, ontologies, and lexicons

(Agarwal, 2019). In contrast, ANNs mainly rely on training data and cannot leverage external

knowledge sources (Gao et al., 2019).

The complex characteristics of human language have prompted the creation of numerous

advanced AI algorithms and strategies aimed at helping machines comprehend and manage

natural language communications. As an AI field, NLP helps researchers explore how computer

systems process and understand human language, spoken or written (Zhang & Zong, 2019). The

study of this discipline has provided useful knowledge in language technology, allowing for the

development of methods that enhance the efficient manipulation of human language and the

accomplishment of desired tasks. Generally, these strategies rely on machine learning techniques

that include natural language understanding (NLU) to comprehend the meaning of natural

language content and natural language generation (NLG) to produce meaningful, non-repetitive,

full, and precise natural language output (Jackson, 2020; Seo et al., 2021).

Given a text, NLU enables computers to understand the meaning of the text. Then,

computers may interact naturally with humans once they comprehend the meaning and context.

When a human asks a chatbot a question, NLU aims to comprehend the query. Such

comprehension results in a semantic depiction of the supplied text (see Figure 7 on some aspects

of text NLU understands). The representation is subsequently inputted into other interconnected

systems to provide an appropriate reaction. NLU components are essential for implementing

natural user interfaces like chatbots (Rebala et al., 2019). The system attempts to extract context

and relevant information from the user’s unstructured statements and provide an appropriate

response based on the user’s purpose.


Figure 7: Some aspects of text that NLU understands (Waldron, 2015)
The activities NLU executes include text tokenization (breaking down a text into smaller

units called tokens, which can be words, phrases, or punctuation marks), parsing, intent detection

and categorization, and entity extraction (Khurana et al., 2023). These operations take into

account the surrounding context. Three components important for NLU include intent, entity,

and context. First, intent refers to manifesting the user’s purpose or objective. The user’s input is

linked to the necessary actions that the chatbot must perform, which may involve parameters to

provide a more specific response. For example, Figure 8 displays structural analysis with NLP

(Dhaduk, 2023). Texts like “find the location of lecture” could be labeled as an intent to identify

where lectures occur in a given semester (Schmitt et al., 2019). NLP chatbot framework avails

pre-defined intent for common functions during conversations. A list of inputs can also be used

to create intents. NLP machine learning classification makes it possible to identify intents,

although the inputs have different expressions from pre-defined ones, making this model better

than rule-based grammar.


Figure 8: A customized NLU and corresponding labeled dataset to train and test the system
(Schmitt et al., 2019)
Second, an entity, often called a slot, is another important piece of information in NLP

that should be mentioned in a user’s utterance and is tied to intent. In the sentence “find lectures,

“the user intends to get clarification about where lectures are being hosted or taking place in a

specific semester. Entities can be categorized as either system-defined or developer-defined.

System-defined entities, such as the system entity “lecture,” are used to express conventional

date references like “12 May 2024” or “the 12th of May”. In contrast, developer-defined entities

are created by developers.

Third, contexts are strings that include contextual information about the object the user

mentions. For instance, in Figure 9, an input like “get it resolved” denotes there could be an

allusion to a previously mentioned object, such as “Tom, the customer care is unable to give a

definite answer.” Specifically, NLP prioritizes the importance of keeping in mind the context of

a “definite answer” so that when the user says, “I hope this gets resolved,” the intention to “gets

resolved” can be associated with the context of “denied insurance claim.” In this context, NLU

technology displays important capabilities such as in-depth analysis, multiple phrase support,
rapid response and interpretation, ease of integration and usability options, and automated

feedback actions.

Figure 9: Structural analysis of NLP (Dhaduk, 2023)


2.2.3 Language Modeling

In the last two decades, research efforts have been growing to develop advanced language

intelligence for computer programs. A language model is a statistical model that represents the

probability distribution of a natural language or how phrases and words are presented in a text

(Laato et al., 2023). The modeling process entails utilizing the generative likelihood of a

sequence of words to estimate the chance of future or absent tokens (Doan et al., 2023) based on

a previous collection of unannotated texts (Cerf, 2023). Consequently, the language models can

forecast a word by considering its surrounding context (Brodnik et al., 2023). A common use of

LMs is in speech recognition and NLU to distinguish between words and phrases that have

similar sounds but distinct meanings, such as “sent,” “scent,” and “cent,” or as the case with

“pair” and “pear,” and “advice” and “advise.”


In 1906, Andrey Markov released a paper analyzing the characteristics of sequences of

interconnected experiments when he applied his statistical model to letter sequences in Eugene

Onegin, Alexander Pushkin’s tale in verse (Ahuja et al. (2023). These experiments established

the fundamental basis for Markov chains and Markov processes. Markov Models are statistical

models that adhere to the Markov property, which posits that the future state of a system is only

determined by its current state and is independent of the sequence of preceding events. Claude

Shannon employed a similar methodology rooted in probability theory, utilizing simulations of

letter sequencing in English text to illustrate his theory of information concerning actual

language (Abedi et al., 2023). Shannon introduced the concept of n-gram models.

N-gram models relate to a collection of n successive items in a text document, including

symbols, numbers, words, and punctuation. The probabilities for N-gram models are typically

computed using the proportion of counts of N-grams in the training text. To illustrate, in a

bigram model, the likelihood of the word “soccer” occurring after the word “play” can be

determined by dividing the frequency of the bigram “play soccer” by the occurrence of the

unigram “play” in the instructional dataset (Zheng et al., 2023). Figure 10 presents an Illustration

of the N-gram modeling, which includes unigrams (one word), bigrams (two words), and

trigrams (three words) (Agarwal, 2024).


Figure 10: Illustration of the N-gram modeling (Agarwal, 2024)
N-gram modeling predicts the most often occurring words that can follow a sequence of

N-1 words. In the 1980s, language modeling was mostly used in automatic speech recognition

systems to improve understanding of the relationship between words and the acoustic signal (Wu

et al., 2021). During the 1990s, researchers created statistical language models that relied on the

Markov assumption to predict a word depending on the word that came before it (Takemoto,

2023). These models were extensively embraced for various NLP applications, such as part-of-

speech tagging, machine translation, and optical character recognition. The Markov assumption

models were quickly introduced and exploited for research in information retrieval.

Noam Chomsky proposed a hierarchy of grammars based on formal language theory to

describe the syntax of a language. Chomsky (2011) argued that finite-state grammars, which

analyze text sequentially from left to right and are based on finite Markov chains or n-gram

models, have limited capabilities and are restricted in their ability to represent languages. The

“curse of dimensionality” (i.e., the need for an exponentially large number of transition

probabilities to capture the vast potential word sequences) potentially contributes to data sparsity

issues (Antoulas et al., 2024). As the language model’s complexity grows with the vocabulary

size and sequence length, accurately learning the parameters becomes increasingly challenging
due to the limited availability of training data covering all possible contexts. Figure 11 illustrates

the curse of dimensionality, showing that classifier performance decreases as the number of

features or dimensions increases.

Figure 11: Curse of Dimensionality (Follow, 2022)

Considering the curse of dimensionality, neural language models emerged in the early

2000s and showed promising outcomes in mitigating the challenges posed by dimensionality

issues and data scarcity in statistical language models (Antoulas et al., 2024). Neural networks

are utilized to predict the probability of word sequences, reducing the number of model

parameters (Gao et al., 2019). Specifically, neural networks are utilized in more intricate natural

language processing jobs, such as machine translation. During the initial phases of training

language models using deep learning techniques, researchers utilized recurrent neural networks

(RNNs) and incorporated long short-term memory (LSTM) neural networks due to their

advantageous gating mechanism. In addition, the use of distributed representation of words, also

known as word embeddings, in neural language models has been shown to improve the
efficiency of word representation (Hussain S. et al., 2023). This strategy involves incorporating

word vectors into the models, a powerful technique (Camacho et al., 2018; Guo et al., 2016).

To process a user’s speech, a neural network must convert it into numerical

representations. Word embedding is a technique that involves representing words and phrases as

real number vectors, as described by Chitty-Venkata et al. (2022). The vectors represent the

meaning and relationships between words in a way that places words with similar meanings

closer together in the vector space (Liu et al., 2021; Seo et al., 2021). Figure 12 displays the

representation of the word embedding space. The vector’s extracted values are then inputted into

the model to obtain syntactic and semantic information from textual data. The obtained

knowledge will then be utilized by learning algorithms for text processing. Numerous chatbot

systems have utilized word embedding techniques to address various natural language processing

tasks.

Figure 12: Visualization of the word embedding space (Liu et al., 2021)
Mansurov and Mansurov (2020) employed word embeddings to analyze the Cyrillic

variety of the Uzbek language. The researchers trained these embeddings using word2vec,

GloVe, and FastText methods, utilizing a web crawl corpus of excellent quality. Tulu (2022)

conducted an experimental evaluation of pre-trained word embedding vectors from Word2Vec,

Glove, and FastText to measure word-level semantic text similarity in Turkish. The results

revealed that Glove and FastText word vectors exhibit superior correlation in word-level

semantic similarity in Turkish. FastText, in particular, demonstrates improved word coverage

and Spearman correlation in the SimTurk and AnlamVer datasets. Borah et al. (2021) found that

FastText is the most stable word embedding method, followed by GloVe and Word2Vec, and

their stability impacts clustering and fairness evaluation in various datasets. These language

modeling methods have resulted in pre-trained word embeddings that are seen as initial

advancements toward implementing pre-training complete models for language representation

(Rodriguez & Spirling, 2021).

2.3 Pre-Training and Fine-Tuning Language Models

In the initial stages of training, there were efforts to pre-train on machine learning tasks

before deep neural networks became widely adopted. Nevertheless, the significance of training

language models on a broader scale became prominent when pre-trained contextual embeddings

were introduced. Nagda et al. (2020) investigated the rise of pre-trained state-of-the-art language

models. The researchers suggest that pre-trained models such as ULMFiT, ELMo, BERT,

Transformer-XL, XLNet, and OpenAI’s GPT-2 have greatly enhanced natural language

processing tasks such as speech synthesis, phrase paraphrasing, and question answering (Nagda

et al.., 2020).
Developing computers with high computational power has facilitated increased uptake of

these models. In addition, the advent of the novel Transformer architecture has promoted

research and pre-training efforts focused on enhancing the language models (Kadavath et al.,

2021). Pre-trained language transformer-based models have garnered significant attention and

have been widely adopted as a standard solution for various NLP tasks. According to Ragusa et

al. (2020), they excel in encoding extensive linguistic knowledge from diverse data and

generating powerful universal contextualized language representations.

A pre-trained language model is created by implementing a language model with a

certain architecture, such as transformer-based (Chen et al., 2020). The model’s learning process

commences with pre-training, wherein a vast corpus of unlabeled text data is employed to train

the model parameters using unsupervised (or self-supervised) learning (Nagda et al., 2020).

Afterward, it proceeds to refine its performance for a particular task by utilizing labeled data

through supervised learning. This enables the adjustment of its parameters to optimize its

performance (Xu et al., 2021).

2.3.1 Pre-Trained Transformer Language Models and Training Objectives

Pre-trained language models (PLMs) are language models that undergo self-supervised

training on extensive datasets. These pre-trained models are language prediction algorithms

based on neural networks created using the transformer architecture (Tufano et al., 2023). The

models utilize natural language queries, called prompts, to analyze and forecast the most optimal

response using their language comprehension. PLMs based on the transformer architecture are

grouped into three categories with precise training objectives. First, decoder-only models are

trained using autoregressive methods. Second, encoder-only models are learned using Masked

Language Modeling. Third, encoder-decoder models are trained using Masked Language
Modeling or other denoising objectives (Li et al., 2024). Figure 13 illustrates the three classes of

pre-trained models and their training objectives.

Decoder-only models are trained in advance on a large collection of language data, often

including a significant amount of text from the internet. The main objective during this initial

training phase is to forecast the subsequent word in every text sequence (Wang et al., 2023). The

model can predict the next word in a sequence by utilizing multiple layers of multi-head self-

attention with masking. Using this process prevents the decoder from accessing future input

words. The prediction is made unidirectionally, from left to right, and considers the preceding

words (Zubiaga, 2024). An exemplary instance is the GPT series of models, which demonstrated

the capability to execute various NLP tasks with minimal need for fine-tuning, using few-shot or

zero-shot settings (Min et al., 2024).

Encoder-only models are composed of numerous layers stacked on each other, including

self-attention mechanisms and feed-forward neural networks. The layers collaborate to

progressively enhance the representation of the input text (Kaneko, 2020). The greater the depth

of the stack, the more complex the comprehension of the language. These encoder-only models

employ Masked Language Modeling (MLM) to predict a masked word by considering each word

in the sequence (Dalmia, 2023). During the training objective of MLM, a mask token is used

randomly to mask word tokens. The masked words are then restored and predicted later on. The

method is implemented by collecting contextual information in both directions (from left to right

and right to left) to make predictions.


Figure 13: Classes of pre-trained models and their training objectives (Li et al., 2024)

The masking process for encoder-only models is similar to filling in the missing pieces of

a puzzle (Zubiaga, 2024). Google BERT is a suitable case for the model that utilizes the next-

sentence prediction objective to evaluate the relationship between sentences. The training

process entails using pairs of sentences as inputs to learn how to predict whether the second

sentence follows the first in the original text. This particular method of pre-training is highly

applicable for activities involving question-answering. Additional language models mentioned in

the study by Nguyen et al. (2020) are DistilBERT, XLM-RoBERTa, and XLNet. Encoder-only

language models are commonly employed for natural language understanding (NLU) tasks and

are seldom deployed without undergoing fine-tuning (Wang et al., 2023).

Encoder-decoder models, also known as sequence-to-sequence (seq2seq) models, are

employed for a distinct type of sequence modeling, where the output sequence is an intricate

function of the complete input sequence. In this case, the training involves converting a sequence

of input words or tokens into tags that are not simply direct mappings from individual words

(Dalmia et al., 2019). The encoder-decoder class is pre-trained using masking or other techniques

to corrupt words in the input sequence and restore them in the output sequence, a process known

as denoising (Li et al., 2024). The approach consists of two subcategories. The first subcategory

involves a bidirectional encoder and a unidirectional decoder with distinct parameters. As


mentioned earlier, the second subcategory has a unified version of the encoder-decoder structure.

In this version, the bidirectional encoder and the unidirectional decoder are pre-trained

simultaneously with joint model parameters (Fu et al., 2023). Some examples of the encoder-

decoder models include Convolutional Seq2Seq Models, Google’s Transformer, Pointer

Generator Networks, Recurrent Seq2Seq Models, and Multimodal Seq2Seq Models (Soltan et

al., 2023; Tufano et al., 2023).

2.3.2 Fine-Tuned Language Models for Conversation

Fine-tuning language models for chatbot applications improve their conversational

abilities, making them important resources in several industries. Multiple use cases of

PLMs have been utilized in various Natural Language Processing (NLP) tasks, such as

conversation (Nguyen et al., 2020). Most chatbot applications rely on the GPT-2 transformer

language model developed by OpenAI, including DialoGPT, which demonstrates impressive

proficiency in executing various conversational tasks (Wang et al., 2023). Meena chatbot by

Google has been reported to score 79% on Sensibleness and Specificity Average (SSA), which is

23% higher in absolute SSA than other chatbots (e.g., XiaoIce, Mitsuku) (Freitas et al., 2020).

Yet, if perplexity optimization is improved, Meena shows potential for human-level SSA scores

of up to 86% (Freitas et al., 2020). However, Zhou et al. (2018) reported that XiaoIce is a more

empathetic social chatbot with an average Conversation-turns Per Session (CPS) of 23,

significantly higher than other chatbots and even human conversations.

Google indicated that up to 2.6 billion parameters were used for its Meena model,

comprising 341 GB of text. The texts were filtered from social media conversations in the public

domain. Meena surpasses the state-of-the-art generative model, OpenAI GPT-2, with a model

capacity of 1.7 times greater and training data of 8.5 times more (Augustine, 2020). The process
of fine-tuning enhances pre-trained transformer models’ effectiveness in generating content that

is both safe and factually accurate (Mosin et al., 2023). Based on access to the substantial

dataset, fine-tuning improved the Meena model’s capacity to generate logical, precise, and

engaging replies.

Refined pre-trained transformer models have proven effective in addressing limitations in

the fluency of responses and contextual problems that arise from previous neural methods (Tay

et al., 2021). These models have achieved outstanding results in engagingness and human-like

qualities in multi-turn dialogues, which involve extended conversations with back-and-forth

interactions between the system and users (Yu & Ettinger, 2021). Their remarkable proficiency

in comprehending language, accurately recognizing voices, and generating responses enables

them to engage in discussions with individuals across a wide range of subjects and exhibit

answers that resemble those of humans (Tay et al., 2021). These capabilities are crucial for

fostering a robust and more reliable user connection.

Using pre-trained neural language models on extensive corpora and refining data for

dialogue through fine-tuning have undoubtedly resulted in significant advancements in

conversational chatbot development. These advancements are especially noticeable in terms of

enhanced response quality, comprehension of context, and the development of more human-like

conversational characteristics for both task-oriented and non-task-oriented chatbots (Yu &

Ettinger, 2021). Nevertheless, there are still unresolved challenges that need to be addressed.

These challenges include concerns about user privacy, the need for more user engagement

through empathetic responses, and improving reasoning abilities to reduce inconsistencies (Liu et

al., 2023). Additionally, there are issues with repetitive responses in different conversations,
instances of knowledge hallucinations, and the need for safer responses that address toxic

language use and gender bias (Semnani et al., 2023).

2.4 Scaling Pre-Trained Lage Language Models

Large language models (LLMs) are commonly subjected to pre-training on billions of

tokens and extensive text corpora. Yet, these pre-trained LLMs undergo the same repetitive

training process once new data becomes accessible (Li et al., 2023). An alternative approach that

is considerably more effective involves scaling and consistently pre-training these LLM models,

resulting in substantial computational savings compared to re-training (Liu et al., 2023).

Growing research interest in scaling inspired successful experimentation on larger-sized

language models (PLMs) such as LLaMA with 13B parameters, Chinchilla with 70B parameters,

GPT-3 with 175B parameters, and PaLM with 540B parameters (Touvron et al., 2023). These

experiments used publicly available datasets to investigate different factors, including dataset

size, model parameters, and training computation. The larger PLMs have demonstrated

remarkable performance in complex tasks compared to previous PLMs like BERT with 330M

parameters and GPT-2 with 1.5B parameters.

Researchers have used the term “large language models” (LLMs) to refer to models that

have a size of more than 100B (Burgess, 2020). Some examples of LLMs include PanGu-Alpha

by Huawei with 200B, Megatron-Turing NLG by Microsoft and Nvidia with 530B, and Wu Dao

2.0 by Beijing Academy of AI with 1.75 trillion parameters (Liu et al., 2022; Zeng et al., 2021).

Additional examples with more than 100B include GPT-3 and GPT-4 by OpenAI, PaLM, and

LaMDA by Google, and Galactica and LlaMA by Meta AI (Touvron et al., 2023). Figure 14

shows that LLMs are becoming increasingly large. A common characteristic of LLMs is that
their structure is founded on the transformer architecture. Also, these LLMs have the same pre-

training language as smaller models less than 100B.

Figure 14: Dynamic increase and growth in the size of LLMs

LLMs employ multi-head attention layers within a deep neural network structure,

typically consisting of hundreds of billions of parameters. These models are trained on huge

volumes of textual data. It is crucial because pre-training incorporates general information from

large-scale corpus into the model parameters (Touvron et al., 2023). The pre-training objective

has a significant impact on the generated text’s fluency in language models with the causal

decoder-only design, which has been the most widely used backbone by LLMs recently (Wang

et al., 2023). These models, which are typically trained with an autoregressive LM objective,

continue the text sequences by responding to prompts and predict the following words by

keeping track of all the words that have come before (Wang et al., 2022). Results indicate that

pre-training with the LM task leads to more advanced capabilities of LLMs and that it improves

performance in zero-shot and few-shot learning tasks, especially when combined with particular

fine-tuning strategies (Meng et al., 2023; Rethmeier & Augenstein, 2020).


2.4.1. Resultant Capabilities of LLMs After Scaling

Large pre-training tokens and LLM parameter scaling significantly improve arithmetic

reasoning, code generation, and instruction following. These capabilities are further improved by

supervised fine-tuning, where it has been shown that expanding the sizes of the models, datasets,

and total computation significantly improves the performance of LLMs (Zhou et al., 2022). In

particular, when the models’ parameter scale level reaches a specific threshold, there is a notable

and unexpected increase in performance and the appearance of some mysterious skills (Hahn &

Goyal, 2023). These skills are specific to LLMs and can execute different tasks and duties.

According to Berglund et al. (2023), standard situations include step-by-step reasoning, in-

context learning (ICL), and instruction following.

Scaled LLMs can provide output that more closely matches human responses to natural

language inquiries through the commonly used fine-tuning technique known as instruction-

tuning, which frequently results in human-level performance on a variety of testbeds (Zhou et al.,

2022). Instruction tuning involves supervised learning on labeled (input, output) pairs. Still, fine-

tuning can be accomplished using almost any machine learning paradigm, such as reinforcement

learning, semi-supervised learning, or extra self-supervised learning (Meng et al., 2023). The

method is based on many multi-task datasets created under human supervision.

In the instance of ICL, the model can only produce output using a small number of

natural language samples through prompts, even though there is no additional model training or

modification (Wang et al., 2023). In the same situation, step-by-step reasoning adds intermediate

reasoning steps to ICL-provided examples using the chain of thought prompting method,

allowing it to solve problems without fine-tuning (Paranjape et al.,2023). These skills also fall

within the category of prompting techniques, which are covered in more detail in section 2.5.5.
Figure 15 illustrates key highlights of LLMs in terms of training sizes in typical chatbot

applications such as GPT-4, Ernie 4.0, Claude 3, and Olympus.

Figure 15: Highlights of LLMs as of April 2024 (Thompson, 2024)

2.5 ChatGPT: The Case of a Prominent LLM-based Chatbot

OpenAI’s GPT backend powers ChatGPT, an automated chatbot service. GPT relies on

an LLM that is comprised of four key components. These elements include a transformer

architecture, tokens, a context window, and a neural network. Therefore, ChatGPT is a suitable

case of a successful LLM-based chatbot that can be investigated to understand the concept of

scaled pre-trained LLMs and their applications in day-to-day conversational chatbots.

Subsequent subsections elaborate on OpenAI, GPT architecture, the emergence of ChatGPT,

prompt and prompting engineering, and the capabilities, risks, and limitations of ChatGPT.

2.5.1 OpenAI

In line with the Open AI charter and history, the company was founded in December

2015 (OpenAI, 2024a). Being an American AI research organization, OpenAI seeks to develop

safe and beneficial artificial general intelligence. OpenAI (2024a) defines artificial intelligence
as “highly autonomous systems that outperform humans at most economically valuable work.”

Open AI was founded by Ilya Sutskever, Greg Brockman, Trevor Blackwell, Vicki Cheung,

Andrej Karpathy, Durk Kingma, Jessica Livingston, John Schulman, Pamela Vagata, and

Wojciech Zaremba formed the company (OpenAI, 2024b). The original founding members of

the OpenAI Board of Directors were Sam Altman and Elon Musk (Haque, 2023). OpenAI

consists of a non-profit, OpenAI Inc. registered in Delaware, and a for-profit subsidiary, OpenAI

Global LLC.

OpenAI transitioned into a profit-oriented organization after 2019 when it received a

substantial investment from Microsoft that steadily grew in size. The change to finance has

raised concerns about the democratization of AI technology, which was previously emphasized

by the founders as a goal, as well as the company’s transparency strategy (Andhov et al., 2024).

In 2019, Microsoft made a $1 billion investment in OpenAI Global LLC, followed by $10 billion

in 2023 (Widder et al., 2023). Most of the investment was allocated to computational resources

on Microsoft’s Azure cloud service (Pedersen et al., 2024). In addition to its commercial

solutions, OpenAI is actively engaged in research aimed at developing AI systems that are both

secure and responsible.

OpenAI has achieved substantial accomplishments by contributing to the advancement of

LLMs, as shown by GPT-3, demonstrating a fantastic capacity to generate coherent text. Other

notable models in this category include Whisper, which is used for Automatic Speech

Recognition (Vivek et al., 2023); Codex, which is used for coding tasks (Jackson & Sáenz,

2022); and DALL-E, which is used for generating images from natural language text (Ye et al.,

2023). OpenAI is now involved in performing research on reinforcement learning (RL), an ML

technique that utilizes rewards for desired activities and punishments for undesired actions
(Zhong et al., 2023). The goal is to train an intelligent agent to make optimal decisions in a given

environment (Berner et al., 2019). As part of this study, the business creates algorithms such as

Trust Area Policy Optimization and Proximal Policy Optimization (Sun et al., 2023).

2.5.2 GPT Architecture and Initial Models

GPTs significantly advance natural language processing, enabling machines to

comprehend and produce fluent and precise language (Gilson et al., 2023). The current section

discusses the four GPT models, starting from the initial version of GPT-1 and progressing to the

latest GPT-4. OpenAI introduced GPT-1 in 2018 as their initial implementation of a language

model based on the Transformer architecture (Sallam, 2023). GPT-1 model consisted of 117

million parameters, which led to a notable enhancement compared to the prior state-of-the-art

language models. An advantageous feature of GPT-1 was its proficiency in producing articulate

and logical English in response to a particular instruction or context (Frieder et al., 2023). The

model underwent training using two datasets: the Common Crawl, an extensive collection of web

pages containing billions of words (Sallam, 2023), and the BookCorpus dataset, which

comprised more than 11,000 books spanning various genres (Gilson et al., 2023). GPT-1 was

able to build robust language modeling skills by utilizing a wide range of information from the

two datasets.

While GPT-1 represented a notable milestone in natural language processing (NLP), it

was subject to some constraints. For instance, GPT-1 often produced repetitious text, mainly

when provided with suggestions beyond the range of its training data (Frieder et al., 2023).

Additionally, GPT-1 could not engage in reasoning across numerous conversation exchanges or

keep track of long-term connections in written material (Sallam, 2023). Critics of GPT-1 also

observed that the coherence and fluidity of the language in this model were only evident in
shorter sequences, while more significant portions lacked coherence. Despite these

shortcomings, GPT-1 served as the basis for developing more extensive and potent models

utilizing the Transformer design, which was improved in GPT-2.

OpenAI introduced GPT-2 in 2019 as a follow-up to improve on the limitations of GPT-

1. The GPT-2 model used 1.5 billion parameters, far surpassing the size of GPT-1, using 7,000

unique unpublished books and Byte-Pair Encoding (BPE) vocabulary of 40,000 tokens (Sallam,

2023). GPT-2 underwent training using a far more extensive and varied dataset, which involved

merging Common Crawl and WebText. An advantageous attribute of GPT-2 was its aptitude for

producing logically connected and lifelike sequences of textual content (Frieder et al., 2023).

Furthermore, GPT-2 could produce responses that resemble human outputs, rendering it a

valuable instrument for various natural language processing tasks, including content generation

and translation (Frieder et al., 2023). Nevertheless, GPT-2 presented some constraints. GPT-2

poorly executed tasks that demanded advanced cognitive abilities such as contextual

understanding and complicated thinking. The limitations of GPT-2 were particularly evident

when handling small text fragments and paragraphs, while it could not sustain context and

coherence when dealing with lengthier pieces (Frieder et al., 2023). These constraints facilitated

the progress of subsequent GPT models, as further discussed below.

2.5.3 GPT-3, 3.5, and 4 Model Background

The field of NLP models made substantial progress with the release of GPT-3 in 2020.

The GPT-3 model used 175 billion parameters, which is over 100 times larger than the GPT-1

and ten times greater than the GPT-2 model. Unlike the previous GPTs, GPT-3 underwent

training using various data sources, such as BookCorpus, Common Crawl, and Wikipedia,

among other datasets (Gilson et al., 2023). GPT-3 dataset consisted of around one trillion words,
enabling the model to produce advanced replies for many natural language processing tasks,

even without the need for any previous example data (Frieder et al. 2023). GPT-3 exhibited

notable advancements compared to its predecessors, particularly in its capacity to produce

logically connected language, compose computer code, and construct artistic creations. GPT-3

could comprehend the context of a given text and produce suitable responses, which set it apart

from GPT-1 and GPT-2 (Ye et al., 2023).

GPT-3’s capacity to generate authentic text had significant ramifications for applications

such as chatbots, content generation, and language translation (Zain et al., 2023). An illustrative

instance is ChatGPT, an AI chatbot that rapidly transitioned from unknown to widely

recognized. Despite the remarkable capabilities of GPT-3, the model had some flaws, such as

providing biased, erroneous, or unsuitable responses. These errors resulted from the extensive

training of GPT-3 on vast quantities of text, which may include biased and erroneous

information (Vivek et al., 2023). Furthermore, critics noted that there were occasions where

GPT-3 produced entirely unrelated text in response to a prompt (Laato et al., 2023; Rudolph et

al., 2023; Sallam, 2023). As such, this suggests that GPT-3 still struggles with comprehending

context and background information. The capabilities of GPT-3 have also sparked worries

regarding the ethical impacts and the abuse of these highly potent language models. Concerns

arise over the potential misuse of the model for reprehensible activities, such as fabricating false

information, crafting deceptive emails, and developing malicious software (Khadija et al., 2023).

To address these challenges, OpenAI introduced GPT-3.5, an improved version of GPT-

3. InstructGPT is a collection of GPT-3.5 models that were fine-tuned to respond to user

instructions and provide appropriate replies accurately. The GPT-3 model’s ability to align with

humans was improved by implementing a three-step reinforcement learning from human


feedback (RLHF) algorithm. The RLHF technique trained a reward model based on human

preferences. RLHF was employed to assess the caliber of the text produced by language models

and assist them in enhancing their performance in subsequent prompts (Jackson & Saenz, 2022).

The training improved GPT-3’s capacity to adhere to instructions while mitigating safety issues

by generating less toxic and harmful responses. These approaches improved GPT-3 by

displaying refined capabilities (i.e., processing code and increased usability due to RLH), from

which ChatGPT was further fine-tuned for dialogue. Figure 16 presents the training and

development of the four GPTs (Mehra, 2024).

Figure 16: Training and development of GPTs (Mehra, 2024)

GPT-4, the most recent iteration of the GPT series, was released on March 14, 2023. The

current model, GPT-4, represents a notable improvement over its predecessor, GPT-3, which was

already outstanding (Nori et al., 2023). However, the exact details about the training data and

structure of GPT-4 have not been officially disclosed. Still, the assumption is that it leverages the

advantages of GPT-3 to address its shortcomings (Peng et al., 2023). GPT-4 demonstrates

enhanced comprehension of intricate instructions and achieves performance comparable to that


of humans on various professional and conventional standards (Rudolph et al., 2023).

Furthermore, the latest GPT-4 has an expanded context window and size, denoting the amount of

information the model can store in its memory when engaging in a conversation (Laato et al.,

2023). The model surpasses previous LLMs in identifying user intent and other modern fine-

tuned solutions based on a series of NLP tasks not exclusive to English.

A review by Nori et al. (2023) observed that GPT-4 has become a general-purpose large

language model that exceeds the passing score on USMLE by over 20 points. GPT-4 also

outperforms earlier models, including the ones fine-tuned explicitly for medical knowledge.

Peng et al. (2023) added that GPT-4 generates superior instruction-following data for large

language models, resulting in superior zero-shot performance on new tasks compared to previous

state-of-the-art models. Additionally, GPT-4 presents the concept of predictable scaling, which

aids in generating reliable predictions of the model’s behavior during training without extensive

computational resources. Lee (2023) reported that GPT-4 significantly enhances translation

accuracy and eliminates flaws in neural machine translation. The model also determines the ideal

equilibrium between hallucinations and fostering originality in GPT models, optimizing model

effectiveness in diverse tasks.

2.5.4 Emergence of ChatGPT

ChatGPT is a fine-tuned model of GPT-3.5, which belongs to a series of LLMs that

OpenAI introduced before the chatbot’s debut. GPT-3.5 is an enhanced iteration of GPT-3,

initially released in 2020. ChatGPT was launched in November 2022 with a specific focus on

conversing with users (Sallam, 2023). The model is refined through supervised learning by

having human AI trainers assume the roles of both users and AI agents. A dialogue synthesis was

carried out with the help of sample written recommendations, which were combined with the
InstructGPT dataset in a dialogue format (Rudolph et al., 2023). Besides these approaches, the

rest of ChatGPT is trained using the same methods as its “sibling,” InstructGPT. While ChatGPT

is based on OpenAI’s GPT-3 architecture, InstructGPT is based on Google’s Transformer

architecture (Verma, 2024).

Equipped with an extensive corpus of literature for pre-training, advanced reasoning

abilities, the capability to identify context in multi-turn dialogues, and human alignment for

enhanced safety (Laato et al., 2023), the system demonstrated remarkable communication

capabilities with humans. Common instances include tasks such as categorizing text, rephrasing,

translating, condensing information, analyzing sentiment (Haque, 2023), answering trivia

questions [68], creating code with explanatory comments, generating complex text like poems,

essays, or witty puns, and even imitating well-known individuals (Gilson et al., 2023).

ChatGPT was released in a research preview format, allowing users to examine and

explore its capabilities freely. After its inception, it gained significant public attention and

surpassed one million users within a week. Such rapid adoption can be attributed to its

remarkable conversational capabilities and reasoning skills across various topics (Wu et al.,

2023). Furthermore, users are urged to provide feedback via the interface to assess the chatbot’s

responses and report any improper replies or potential threats. Subsequent feedback is collected

and utilized to train further and refine the system, improving its conversational capabilities (Nori

et al., 2023). After a few months, OpenAI introduced applications for Android and iOS devices,

enabling voice input and syncing of conversational history (Casheekar et al., 2024).

OpenAI also introduced premium features such as ChatGPT Plus based on user

subscriptions and ChatGPT Enterprises for companies. ChatGPT Plus offers enhanced response

speeds and prioritized access to members with improved features. Some examples of the benefits
are instant access to GPT-4 usage, as well as the ability to use external and internal plugins (such

as the ChatGPT Browsing Plugin and ChatGPT Code Interpreter Plugin) to augment the

capabilities of the conversational chatbot (Al-Khiami &Jaeger, 2023). OpenAI’s ChatGPT

Enterprise is described as the most potent iteration of ChatGPT, providing users unrestricted

access to the latest capabilities. These features encompass unrestricted and faster usage of GPT-

4, extended input processing, business-oriented data privacy and security procedures, and

support for widespread implementation.

2.5.5 Prompting and Prompt Engineering

While the latest GPT-4 improves translation quality and removes significant errors in

neural machine translation, one of the concerns is that it also produces hallucinated edits

(Raunak et al., 2023). Prompt engineering seeks to improve the latest models by developing and

enhancing prompts to guarantee that GPTs produce practical and relevant material. Through

reinterpretation, the prompt cycle guarantees that conversational models react suitably to diverse

human input, enhancing user experience (Pehlivanoglu et al., 2023). Following its introduction

during GPT-3, prompting helps improve pre-trained and fine-tuned LLMs. Through prompting,

users apply prompt instructions to program an LLM through customization, refining, or

improvement of its capabilities (White et al., 2023)

A prompt plays a crucial role in setting the environment of a conversation, sifting through

the given information, and specifying the expected format and substance of an LLM’s output. An

LLM can produce more organized and refined solutions for different tasks when provided with

specific prompts encompassing norms and guidelines. Interactive prompting techniques have

demonstrated significant advantages, particularly when managing challenging activities,


particularly in the case of ChatGPT (Raunak et al., 2023). Some of the standard prompting

methods include Chain-of-thought (CoT) and In-Context Learning (ICL) (Figure 17).

Figure 17: In-Context Learning (ICL) and Chain of Thought (CoT) (Polverini & Gregorcic, 2023).

ICL works on the task description and may include particular demonstration examples

using a natural language prompt. Various strategies have been suggested to enhance the

effectiveness of the ICL capability. These strategies emphasize the significance of the

information in specific examples and the proper sequence and format of demonstrations.

Furthermore, presenting sufficient information for the task and ensuring its relevance to the

given exam query is imperative.

CoT prompting integrates intermediate reasoning processes into prompts to strengthen

the reasoning ability of LLMs, hence enriching the input-output pairings ICL (Polverini &

Gregorcic, 2023). The system offers more contextual information and simplifies the process of

generating responses, intending to tackle intricate problems like arithmetic reasoning,

commonsense reasoning, and symbolic reasoning, which involve reasoning step by step. While
CoT and ICL appear to enhance the performance of LLMs, there is still a requirement for more

efficient prompting mechanisms (White et al., 2023). Furthermore, the effectiveness of the

prompts supplied to the LLM is closely correlated with the quality of the generated outputs

(Pehlivanoglu et al., 2023). Therefore, the methods utilized to train LLMs through prompts,

which we call prompt engineering, are also highly significant.

Prompt engineering offers more possibilities for LLMs than just collecting plain text or

code samples. An appropriate prompt can initiate an entirely new set of interactions. For

example, prompts can assist an LLM in creating a dataset with a specific format and desired

number or function as a Linux terminal window. In addition, prompts have the potential to adjust

themselves, enabling them to suggest other prompts that can provide further information or

create relevant material (Raunak et al., 2023). Recent research has suggested utilizing prompt

patterns to develop reusable solutions for user task-related issues while interacting with

conversational LLMs. The findings demonstrated a significant enhancement in the advanced

capabilities of ChatGPT by integrating diverse patterns that apply to various domains, such as

education or entertainment (Lee et al., 2023).

2.5.1 ChatGPT Opportunities, Threats, and Limitations

Since its inception in 2022, ChatGPT has revolutionized communication and information

retrieval by introducing a novel capability. Integrating text and code data-rich LLMs in natural

language understanding and generation, along with a conversational experience, results in an

advanced tool with powerful NLP capabilities (Al-Khiami & Jaeger, 2023). Therefore, ChatGPT

is a productive tool with scalable and cutting-edge technology, offering various possibilities for

text creation, review, and analysis (Wu et al., 2023). ChatGPT can be employed either to execute

personal tasks or for work-related purposes. Considering that ChatGPT is trained on a diverse
and extensive corpus of textual data, including books, news stories, webpages, articles, forums,

and social media posts, the model understands and generates text on a wide range of topics,

employing various writing styles (Verma, 2024).

Moreover, the training dataset used in ChatGPT enables it to understand and respond to

various linguistic inputs, deconstruct crucial information or complex concepts, and explain them

in a way that suits each user’s unique speaking style (Sallam, 2023). These qualities and

ChatGPT’s capacity to generate and modify code have made the model stand out as a practical

and accessible search engine that relies on dialogue rather than having users scroll through a list

of results (Rudolph et al., 2023). In addition to its machine learning optimization methods,

ChatGPT helps to simulate real-world dialogue features like remembering past conversations,

customizing to their unique characteristics, offering an apology for displaying inaccurate

information, and assisting users in generating ideas through brainstorming (Laato et al., 2023;

Labadze et al., 2023).

ChatGPT also presents the opportunity to swiftly capture new insights, retain innovative

information, and learn from user interaction. ChatGPT improves its interpretation abilities

(Rudolph et al., 2023). Based on constant user interaction and engagement, ChatGPT adjusts to

various reactions based on feedback and new information, thereby implying that it becomes

increasingly progressive to changing scenarios. Based on these characteristics, there is a growing

focus on implementing conversation-based AI systems that can be applied to various industries

such as retail, education, healthcare, research, journalism, and information technology

(Casheekar et al., 2024; Kulkarni et al., 2019; Kumar et al., 2020).

Furthermore, by using external plugin mechanisms made possible by OpenAI, the chatbot

can use more tools or software to improve its functionality. External resources are crucial for
solving complicated issues and improving LLM performance (Raunak et al., 2023). For example,

ChatGPT can access fresh data using the web browser plugin. Furthermore, the open-source

retrieval plugin may learn from data sources by responding appropriately to prompts and queries.

New features like custom instructions have also been introduced to tailor ChatGPT usage.

Instead of repeating the conversational context and background information, users can give pre-

established guidelines that the chatbot will consider for the subsequent output. For example,

Figure 18 shows an example of customized instructions provided at the OpenAI website, where a

user provides tailored instructions before getting a response (OpenAI, 2023).

Figure 18: Customized instruction availed at ChatGPT (OpenAI, 2023).

However, OpenAI has disclosed drawbacks for this sophisticated AI chatbot. ChatGPT

tends to generate hallucinations, as described by Kumar et al. (2020). Hallucinations refer to

responses that may seem correct and sensible but are erroneous, incoherent, and unreliable (Lee,

2023). OpenAI has encountered challenges in addressing this problem despite improvements in

the ChatGPT Plus version. These difficulties arise from the supervised training process of the
model, and the reinforcement learning from human feedback (RLHF) used to align the model

with the provided demonstrations (Kulkarni et al., 2019). Moreover, ChatGPT has restricted

cognitive abilities, sometimes producing irrational responses (Casheekar et al., 2024). The

system faces difficulties in resolving intricate or even more straightforward mathematical issues,

grasping the significance of words, responding to thorough inquiries, and acquiring real-world

information (Semnani et al., 2023).

The chatbot’s output can be significantly altered by its sensitivity to the wording of the

input, leading to potential inconsistencies (Kumar et al., 2020). Despite receiving identical

prompts, the model has the potential to produce varying replies. Furthermore, while the training

phase of the chatbot concluded in early 2022, it lacks knowledge about any forthcoming events,

namely those occurring after January 2022. These concerns have prompted recommendations for

enhancing the resilience of ChatGPT.


References

You might also like