0% found this document useful (0 votes)

101 views11 pages

Assessing Fine-Tuning Efficacy in LLMS: A Case Study With Learning Guidance Chatbots

Training and accurately evaluating task- specific chatbots is an important research area for Large Language Models (LLMs). These models can be developed for general purposes with the ability to handle multiple tasks, or fine-tuned for specific applications such as education or customer support. In this study, Mistral 7B, Llama-2 and Phi-2 models are utilized which have proven success on various benchmarks, including question answering.

Uploaded by

International Journal of Innovative Science and Research Technology

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

101 views11 pages

Assessing Fine-Tuning Efficacy in LLMS: A Case Study With Learning Guidance Chatbots

Uploaded by

International Journal of Innovative Science and Research Technology

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Volume 9, Issue 5, May – 2024 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAY1600

Assessing Fine-Tuning Efficacy in LLMs:

A Case Study with Learning Guidance Chatbots
Rabia Bayraktar1; Batuhan Sarıtürk2; Merve Elmas Erdem3
Huawei Turkey R&D Center Istanbul, Turkey

Abstract:- Training and accurately evaluating task- a custom use case, for example, a domain-specific question-
specific chatbots is an important research area for Large answering system, data usage and fine-tuning will be
Language Models (LLMs). These models can be crucial. This process aims to ensure that the pre-trained
developed for general purposes with the ability to handle model can learn as much information as possible with
multiple tasks, or fine-tuned for specific applications limited data. In addition, techniques such as Retrieval-
such as education or customer support. In this study, Augmented Generation (RAG) can enable the model to
Mistral 7B, Llama-2 and Phi-2 models are utilized which obtain information from additional external sources during
have proven success on various benchmarks, including inference. [9]
question answering. The models were fine-tuned using
QLoRa with limited information gathered from course This study involved the development of a chatbot that
catalogs. The fine-tuned models were evaluated using provides information about training contents and helps in
various metrics, with the responses from GPT-4 taken as planning career roadmaps in various software fields. To
the ground truth. The experiments revealed that Phi-2 create a question-answering pair dataset, public data from
slightly outperformed Mistral 7B, achieving scores of Huawei's ICT Learning course catalog [10] was used, and
0.012 BLEU, 0.184 METEOR, and 0.873 BERT. state-of-the-art language models were fine-tuned using this
Considering the evaluation metrics obtained, the dataset. The study focuses on analyzing the development of
strengths and weaknesses of known LLM models, the an effective fine-tuned model with a small amount of data
amount of data required for fine-tuning, and the effect of within a limited scope. The remainder of the paper includes
the fine-tuning method on model performance are a literature review, method, and experimental results.
discussed.
II. LITERATURE REVIEW
Keywords:- LLM, Mistral, Llama, Phi, Fine-Tune, QLoRa.
LLM studies have become increasingly popular,
I. INTRODUCTION especially since the release of GPT-3 in 2020. Many
different model architectures have been released, both open-
With the rise of open-source Large Language Models source and commercial. The development of these large
(LLMs) in recent years, customized chatbots have become models has made it easy to adapt them to different use cases,
prevalent in many applications. The fact that most LLM leading to their widespread application in various business
have been trained on huge data and have been developed for domains such as education, medicine, software,
multiple tasks makes it easier to use these models for small- mathematics, law, finance, and coding [11, 12]. Most widely
scope and custom use cases. Today, the most well-known used open-source LLMs have performed well in various
LLMs with proven effectiveness are GPT [1] and its benchmarks are [2, 3, 4].
variations, which allow for the development of applications
by directly using an API. However, the number of open- The generalization ability of these language models is
source high-performance large language models [2, 3, 4] very good, but fine-tuning is also necessary to handle cases
continually increases. As a result, access to both pre-trained such as enabling response safety in terms of ethics and
base models and fine-tuned models for specific use cases is toxicity, and understanding user intents for specific
becoming easier. [5] domains. An example of fine-tuning language models for
textbook question answering is shown in [13], which
There are numerous applications for customized investigates the effect of a Llama model fine-tuned with
chatbots for providing product recommendations on online QLoRa [14] on the RAG output. The study used a total of
shopping sites [6], enhancing customer relations in 1076 course data with different contents. The model was
telecommunications [7], and creating question-answering trained with a total of 26,260 questions, with approximately
systems in specific domains [8]. Adaptation of a language 15,000 of the data used for training and 10,000 for testing
model for a specific use case can be achieved through and validation. The validation accuracy of the model trained
various methods, including zero-shot learning, few-shot with A100 GPU was obtained as 82.40, and the test
learning, parameter-efficient fine-tuning, or prompt tuning. accuracy as 84.24 [6].
During pre-training, an LLM will have already accessed the
most publicly available information sources, enabling its use
in various applications without requiring fine-tuning. But for

IJISRT24MAY1600 www.ijisrt.com 2461

Volume 9, Issue 5, May – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAY1600

Integration of chatbots into educational frameworks demonstrate that a platform-specific chatbot model can
has shown promising results. Some studies propose achieve effective results when trained with a limited dataset.
educational chatbots in specific domains [15] or for general To achieve this, we used a fine-tuning methodology that
purposes [16]. In a study conducted in 2024 [8], synthetic exposed these models to question and answer (Q&A) pair
data was used for subsections determined using the dialogue dataset generated by using edicational materials from course
dataset. Additionally, [7] proposes an educational chatbot as catalog. By integrating information from these sources, we
a mobile learning tool and provides implementation details. aimed to enrich the models' understanding and adaptation to
Furthermore, [18] offers a comprehensive overview of the industry-specific terminology, concepts, and contexts. This
stages involved in developing and implementing LLM- comprehensive approach allowed for a more robust
based systems. Another domain that can be supported by enhancement of the models' proficiency in comprehending
chatbots is health. [8] discusses how chatbots can provide and responding to queries within the talent platform domain,
assistance during health emergencies. These systems could aligning with our goal to strengthen their utility and
evolve to triage patients, provide preliminary diagnoses, and effectiveness in real-world applications, particularly in their
offer medical advice, potentially saving lives in critical engagement with talent platforms.
situations.
A. Dataset
Evaluation of chatbot performance is challenging Data collection studies were carried out from the talent
because LLMs are trained for multiple tasks and can be platform for a customized chatbot. The transaction was
adapted to various subtasks. When ground truth data is made using public data on the platform and kept limited
available, similarity metrics such as BLEU, ROUGE, with the ICT Academy 23 Courses Catalog. A dataset
METEOR, and BERTScore can be used. For instance, in comprising 500 Q&A pairs was created using GPT-4 from
[17], the BLEU score is used to calculate the accuracy of this content to fine-tune the Mistral model. A sample
generated responses in an educational chatbot. Another way question-answer pair can be seen in Fig. 1.
to measure response accuracy is by using AI. In [1], the
evaluation of a fine-tuned language model is done using The content was extracted from 14 courses to create
GPT-4. the question and answer pairs, with 7 courses focused on
professional topics and 7 courses covering general subjects.
III. MATERIAL & METHODS Questions and answers of varying lengths were generated
using GPT-4. When creating the questions, we followed a
Our research aimed to compare the performance LLMs progression from general to specific to enable the chatbot to
such as Mistral, Llama, and Phi by incorporating public data better generalize the information it learned.
from Huawei ICT Academy courses catalog and

B. Language Models

Fig 1: Example of a Q & A Pair

Fig 2: QLoRa

IJISRT24MAY1600 www.ijisrt.com 2462

Volume 9, Issue 5, May – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAY1600

LLMs are artificial intelligence models with the ability parameters. The model called Llama2-chat was trained with
to process and generate text, which falls under the category the dialogue dataset provided for customized tasks. In cases
of Natural Language Processing (NLP). Depending on the where the resource needs of the 70B parameter model
characteristics of the problem to be solved, various models cannot be met, llama-2's 7B parameter chat model can be
can be created. Most of these models are pretrained for used.
multiple tasks and include transformer architectures, which
allow for easy post-training adaptation to specific tasks. To The Mistral model, one of Llama's closest competitors,
effectively generalize the trained model, the dataset should has two different uses: instruct and base. A significant
contain as many different samples as possible. LLMs require increase in performance is observed with the decoder-only
a large dataset and a long training time due to their size, transformers used in the model trained using 7B parameters.
which also brings computational challenges.
The Phi series has been a significant breakthrough in
Many LLMs have been updated since 2020 and have the field of language models. The original Phi model was
been proven to be effective. The superiority of the models trained with 1.3 billion parameters, and its successor, Phi-2,
offered by developers such as OpenAI, Google, Meta, and was trained with an even more extensive 2.7 billion
Stability AI against each other in some cases has been parameters. Phi-2 has set new standards among base
examined. When the size of the models used is examined, it language models with less than 13 billion parameters,
is known that GPT uses 175 billion parameters, Gemini uses showcasing remarkable reasoning and language
3.25 billion, and Gemma uses 7 billion parameters. When understanding capabilities.
fine-tuning LLMs with limited resources, managing the
number of parameters is crucial. Fewer parameters mean C. QLoRa
lower computational and memory requirements, making the QLoRa represents a significant advancement in the
process more feasible at the cost of reduced performance. field of NLP, particularly in the realm of optimizing LLMs
for efficiency and accessibility. Developed as an extension
LLMs can be easily adapted to the new domains and of the LoRa framework, QLoRa introduces a novel approach
new tasks such as sentiment analysis, customer service, to parameter quantization, profoundly impacting memory
search, question answering etc. One of these models, Llama, utilization and hardware requirements (Fig. 2).
is a model offered by Meta that uses approximately 70B

Fig 3: Model Fine-Tune Diagram

IJISRT24MAY1600 www.ijisrt.com 2463

Volume 9, Issue 5, May – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAY1600

At its core, QLoRa leverages quantization techniques suitable courses based on their expertise level in a specific
to reduce the precision of weight parameters within pre- domain.
trained LLMs to an unprecedented 4-bit format. Unlike
traditional models that store parameters in a 32-bit format, IV. EXPERIMENTS & RESULTS
QLoRa's compression significantly diminishes the memory
footprint of LLMs, unlocking the potential for fine-tuning A. Fine-Tuning with QLoRa
operations on a single GPU. This breakthrough not only Various experiments were carried out to fine-tune
streamlines memory consumption but also democratizes models frequently mentioned in the literature with QLoRa.
access to cutting-edge language models by enabling their In all experiments, 500 Q&As generated from the catalog
execution on consumer-grade hardware [14]. data were used for model training. Q&A pairs contain
general and course-specific questions and answers about the
D. Fine-Tuning platform. Care was also taken to ensure that the questions
The main goal of fine-tuning is to achieve the highest and answers were of different lengths. Different step and
accuracy with minimal customization requirements for a batch sizes were used for the Phi-2, Llama-2 and Mistral 7B
model. In many cases, good results can be obtained by models depending on the model requirements.
retraining a well-trained model with your own data. This
approach requires much fewer resources compared to During the fine-tuning process of the Phi-2 model, a 5k
training a model from scratch. step training was performed with A100 GPU and 8 batch
size. The training took approximately 2 hours. Train and
There are many PEFT (Parameter Efficient Fine- evaluate losses of the training are shown separately in Fig.
tuning) methods for fine-tuning LLMs. LoRa is the most 4. 10% of the training data was used for evaluation.
popular of these methods. It can be integrated into the fully
connected layers of the existing model. It aims at The loss graphs illustrate how well the model is
optimization through the parameters it adds, without learning the provided material. As the training progresses,
increasing the complexity of the model. QLoRa works on the loss consistently decreases, reaching levels as low as 0.3
the original network to further increase the effectiveness of to 0.2. This indicates that the model is effectively
LoRa. understanding and absorbing the content. Due to the limited
amount of available training data, we decided to limit the
It scales the weight values of the original network from number of training steps to prevent overfitting. Overfitting
Float32, which is a high-resolution data type, to Int4. By can occur when the model becomes too specialized in the
rescaling, the memory needs of the trainee and the training training data and doesn't perform well on new, unseen data.
time are reduced. It is very suitable to use in situations Despite this cautious approach, the model still managed to
where there is a lot of data and limited hardware resources. learn the content quite well. The decreasing loss values
In this study, the Mistral-7B model was fine-tuned with demonstrate that the model is improving at minimizing
QLoRa. errors and enhancing its accuracy in predicting answers. By
being cautious with the number of training steps, we aimed
In this study, we proposed a method for fine-tuning the to strike a balance between allowing the model to learn as
base models using 500 Q&As. The diagram in Fig. 3 much as possible and ensuring it doesn't become too
illustrates the data exchange between the main and helper narrowly focused on the training data. Despite these
units that make up the system. Fine-tuning processes limitations, the model exhibits strong proficiency in
requires the base model pipeline to be loaded, and transform understanding and retaining the course material, which
matrices must be created. The Mistral-7B base model was validates the effectiveness of our training approach.
used in this study, and we drew from the hugging face of the
model pipeline to define transform and tokenizer rules. The recommended step size is 300 to prevent
overfitting when fine-tuning the Llama model with small
After building the pipeline, we conducted new training datasets like ours. When Llama examples are examined in
using the QLoRa model selected for fine-tuning. QLoRa the literature, it is seen that long training is carried out with
was chosen for fine-tuning in this study because it can more sufficient hardware resources, but 300 steps for fine-tuning
effectively prevent overfitting and underfitting when with QLoRa are enough to compare with other models. For
retraining large models with small datasets. this study, training was conducted with 8 batch sizes and
300 steps, but the expected curves in the loss graphs were
In the last phase, with the use of the fine-tuned model, not achieved as shown in Fig. 5. For this reason, a new
users are expected to find answers to their questions training was conducted and evaluated with 600 step size.
regarding training content, course format, duration, and

IJISRT24MAY1600 www.ijisrt.com 2464

Volume 9, Issue 5, May – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAY1600

Fig 4: Phi-2 Loss Graphs

Fig 5: Llama-2 Loss Graph

IJISRT24MAY1600 www.ijisrt.com 2465

Volume 9, Issue 5, May – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAY1600

In the experiments where 600 steps and 8 batch sizes B. QA Comparison

were used, it was observed that the Llama model requires In this section, a Q&A comparison of 3 different
more memory compared to other models. Upon reviewing models that we fine-tuned is presented. Based on the data on
Table 1, it was noted that the model's generated answers which the chatbot was trained, the same questions were
often do not correlate with the questions. This indicates a asked to three different models and their answers were
weaker content relationship in the Q&A compared to the evaluated. It was monitored how well the content obtained
Phi-2 model. Analysis of the training and evaluation loss from the Talent Platform matches the answers given by the
graphs shows a consistent decrease. It is anticipated that bot.
with longer training sessions using more powerful GPUs,
the train and test losses may drop below ~0.4. Evaluating Q&A models presents unique challenges as
it requires not only the accuracy of information but also the
It has been reported that Mistral 7B performed better relevance and naturalness of the responses [19]. Traditional
than the Phi-2 and Llama-2 models in the training conducted metrics used in other forms of NLP tasks, such as translation
with benchmark datasets. The loss graphs obtained from our or summarization, also apply here but with additional
study using Mistral 7B are displayed in Fig. 6. The considerations due to the dialogic and informational nature of
consistent decrease in the loss graphs of the model trained Q&A systems [19].Q&A models are best evaluated through a
with 500 Q&As is evident. It is observed that the training combination of metrics that can capture both the precision of
conducted with a batch size of 8 and 300 steps outperforms the answers in terms of content and their linguistic quality,
the Llama-2 model trained with 600 steps. The results we and the metrics selected to be used in this study are
obtained align with the benchmark results. particularly suitable for this task.

Fig 6: Mistral 7B Loss Graph

To assess the performance of the models, the chosen where accurate terminology is crucial. BLEU evaluates the
metrics for this study are BLEU, ROUGE, METEOR, and precision of n-grams in the generated text that match the
BERTScore, each offering unique insights into different reference text, applying a brevity penalty for overly short
aspects of text generation quality. BLEU (Bilingual responses. It helps quantify the match between the model's
Evaluation Understudy) is used to measure the similarity answers and the expected responses. ROUGE (Recall-
between machine-generated text and human reference Oriented Understudy for Gisting Evaluation) is used to
translations [20]. It is particularly useful in Q&A settings evaluate the recall of essential content in model responses,

IJISRT24MAY1600 www.ijisrt.com 2466

Volume 9, Issue 5, May – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAY1600

ensuring no critical information is omitted [21]. It is used in specific details, but Phi-2 and Mistral 7B models were able
text generation tasks and evaluates the recall of n-grams and to come closer to the most accurate answer than Llama-2.
word sequences between generated and reference texts. It Three chatbots could not reach the expected success in the
includes measures such as ROUGE-N, ROUGE-L, and questions about the number of courses and durations asked
ROUGE-Lsum among others. Each of these measures about the courses, but the closest answers were obtained with
captures different aspects of the text, such as n-gram overlap the Mistral model. It seems that it is necessary to improve the
or longest common subsequence. METEOR (Metric for training data for the questions in which all chatbots fail.
Evaluation of Translation with Explicit ORdering) is a metric
that balances precision and recall, providing a nuanced In Table 1, we asked the same questions to the 3 models
evaluation of linguistic and semantic quality [22]. It uses we trained with the same dataset and ChatGPT3.5. Among
synonymy and paraphrase matching to align generated text the answers of the models we trained, the correct answers
with reference text, allowing for accurate, paraphrased were marked green, those close to the correct answer were
responses. This makes it ideal for dynamic Q&A interactions marked yellow, and the wrong answers were marked orange.
and tasks requiring semantic accuracy and paraphrasing ChatGPT's answers wer e used as an evaluation criterion. As
abilities. This metric is based on the harmonic mean of can be seen from the table, while the most accurate answers
unigram precision and recall, with recall weighted higher were obtained with Mistral 7B, even close answers could not
than precision. It also has several features that are not found be obtained with Llama-2.
in other metrics, such as stemming and synonymy matching,
along with the standard exact word matching. BERTScore When BLEU scores are evaluated, it has been observed
uses BERT's contextual embeddings to evaluate the semantic that Phi-2 model has the highest score among all three with
similarity between generated and reference texts [23]. This is 0.012. But all scores are very low across all models. This
crucial in Q&A tasks, as the correctness of an answer often suggests that the generated answers have limited overlap
depends on its semantic rather than surface-form similarity to with the reference answers at the word level. Low BLEU
the reference answers. It goes beyond word overlap, scores in a Q&A context might indicate that while the
computing cosine similarity between token embeddings for a answers are semantically correct, they use different wording
nuanced quality assessment. This is key for tasks requiring or structure compared to the reference answers. Additionally,
understanding of context, nuance, and paraphrasing. It has when the BLEU precision values are examined, it has been
been shown to correlate with human judgment on sentence- observed that, Unigram Precision is substantially higher
level and system-level evaluation. compared to other metrics, the highest is 0.555 from Phi-2
model. These scores suggest that at the unigram (single
In the context of our Q&A models, these metrics word) level, there's a decent match between the generated
collectively offer a robust framework for evaluation. The and reference answers. High unigram precision indicates that
combined use of these metrics allows us to comprehensively key terms are being correctly used in the answers.
evaluate the models not only for accuracy but also for their Therewithal, there is a significant drop in precision values
ability to interact naturally and informatively, mimicking from Unigram to 4-gram, indicating a significant divergence
human-like performance in answering queries. Evaluation from the reference text as the n-gram size increases. It
results of the models are shown in Fig. 7. suggests that while the model can recall relevant terms, it
struggles with maintaining coherent longer phrases that
It has been observed that all Chatbots can provide match the reference. For the ROUGE Scores, Phi-2 model
accurate answers to general information about the platform. provided the highest scores on all ROUGE scores. 0.326 for
However, the answers given by the Mistral model confirm ROUGE-1, 0.118 for ROUGE-2, 0.214 for ROUGE-L, and
the platform's contents. All chatbots were able to produce 0.271 for ROUGE-Lsum. These scores indicate better recall
understandable answers to questions requiring course- of important words and phrases.

Table 1: Evaluation Metric Results

IJISRT24MAY1600 www.ijisrt.com 2467

Volume 9, Issue 5, May – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAY1600

Table 2: Sample Results

Question ChatGPT 3.5 Mistral Llama Phi-2
Huawei's certification HCIP focuses on The HCIP level is HCIP offers advanced
system offers various advanced skills, HCIA the most advanced specialization, HCIA
What are the levels of certification covers fundamental level of Huawei focuses on foundational
distinguishing to cater to different concepts, and HCIE certification and is knowledge, and HCIE
characteristics of HCIP, levels of expertise in demonstrates expert- designed for demonstrates expert-level
HCIA, and HCIE Huawei technologies. level proficiency. They advanced-level mastery in Huawei
certifications within The three main cater to varying levels specialists. The technologies. These
Huawei's certification certifications are HCIP of expertise, providing HCIP certification certifications cater to
system, and how do they (Huawei Certified ICT individuals with is for advanced- diverse skill levels,
cater to different levels of Professional), HCIA tailored learning level specialists ensuring accessibility for
expertise in Huawei (Huawei Certified ICT opportunities based on with significant beginners while
technologies? Associate), and HCIE their career goals and experience in providing rigorous
(Huawei Certified ICT current knowledge Huawei standards for experienced
Expert). base. technologies and is professionals.
considered the
pinnacle of Huawei
certification.

An overview of the In the 'Introduction to 5G

Certainly! The course Yes, I can. In the exam structures for Knowledge' course, the
satisfaction survey 'Introduction to 5G HCIP, HCIA, and satisfaction survey helps
plays a crucial role in Knowledge', '5G HCIE certifications instructors understand
Can you provide an evaluating the Network and is provided, students' understanding
overview of the role of effectiveness of the Applications', and including of key concepts, their
the course satisfaction courses "Introduction 'Introduction to differences in exam level of engagement with
survey in the to 5G Knowledge," Computer Networks formats and content the material, and
"Introduction to 5G "5G Network and and the Internet' coverage. This suggestions for
Knowledge," "5G Applications," and courses, the course helps students improving the course
Network and "Introduction to satisfaction survey understand the content. In the '5G
Applications," and Computer Networks serves as a feedback different levels of Network and
"Introduction to Computer and the Internet." mechanism for students certification and the Applications' course, the
Networks and the to evaluate their requirements for survey assists instructors
Internet" courses? learning experience. It each. in evaluating the
allows instructors to effectiveness of the
gather insights into curriculum and
students' perceptions of identifying areas for
the course content, improvement. In the
delivery methodology, 'Introduction to
and overall satisfaction Computer Networks and
with the course the Internet' course, the
structure. survey provides insights
into students'
comprehension of
network principles and
practical skills acquired
throughout the course.
The primary focus of The primary focus The primary focus of the
The "5G Network and the '5G Network and of the '5G Network '5G Network and
Applications" course Applications' course is and Applications' Applications' course is to
focuses on to explore the course is to provide provide an in-depth
understanding 5G architecture, features, participants with a understanding of the
Describe the primary technology, its and applications of comprehensive technical aspects,
focus of the "5G Network architecture, protocols, fifth-generation (5G) understanding of standards, and
and Applications" course. standards, and wireless networks, the key applications related to 5G
applications. along with their impact technologies, networks. It covers topics
on various industries. It architectures, and such as network
provides insights into use cases driving architecture, deployment
the advancements in the evolution of 5G models, spectrum bands,
telecommunications networks, including core network

IJISRT24MAY1600 www.ijisrt.com 2468

Volume 9, Issue 5, May – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAY1600

technology and how 5G 5G core, edge technologies, edge

networks are computing, network computing, and use cases
revolutionizing slicing, and IoT for 5G technology.
different sectors. integration. By the Through comprehensive
end of the course, theoretical sessions and
participants will be hands-on exercises,
equipped with the students gain insights
knowledge and into designing,
skills necessary to deploying, and managing
design, deploy, and 5G systems.
manage 5G
networks
effectively.
The "Search and There are a total of 4 The 'Search and Artificial
Artificial Intelligence" experimental modules The 'Search and Intelligence' course
course structure included in the 'Search Artificial includes 2 experimental
typically includes and Artificial Intelligence' course modules designed for
How many experimental several experimental Intelligence' course structure includes students interested in
modules are included in modules. These structure. These one experimental delving deeper into
the "Search and Artificial modules are designed modules provide hands- module on search specific aspects of search
Intelligence" course to provide hands-on on experiments or algorithms and algorithms or AI
structure, and what do experience and simulations aimed at another on AI technologies. These
they involve? practical application of reinforcing theoretical basics. These modules provide hands-
concepts related to concepts and modules provide on activities or projects
search algorithms, developing practical hands-on to reinforce learning and
artificial intelligence skills related to search experience with key explore advanced topics
(AI), and related algorithms and artificial concepts and related to search and
technologies. intelligence techniques in search artificial intelligence.
applications. and AI.
Participants engage in
these experiential
activities to deepen
their understanding and
apply knowledge
gained throughout the
course.
In an "Internet of Yes, absolutely! In the
Things and 'Internet of Things and The 'Internet of The balance between
Applications" course, Applications' course, Things and theory and hands-on
theory provides we strike a balance Applications' practice in the 'Internet of
foundational between theoretical course balances Things and Applications'
Can you explain the knowledge of IoT sessions and hands-on theory and hands- course ensures that
balance between theory concepts, while hands- practice to ensure on practice by students gain a solid
and hands-on practice in on practice allows students gain providing a theoretical understanding
the "Internet of Things students to apply this comprehensive theoretical of IoT concepts and
and Applications" course? knowledge practically. knowledge and foundation in IoT technologies while also
The balance ensures practical skills. The concepts and developing practical
students understand theoretical sessions technologies, along skills through lab
both the theoretical cover IoT with practical assignments, projects,
underpinnings and fundamentals, exercises and and simulations. This
practical networking protocols, projects that allow approach prepares
implementation of IoT embedded systems, students to apply students for real-world
systems. cloud computing, and their knowledge in challenges by equipping
IoT applications. real-world them with both
Hands-on practice scenarios. This knowledge and practical
includes lab sessions approach ensures abilities.
where students engage students gain both
in activities such as theoretical
device-cloud understanding and
integration, sensor practical skills
networks, edge necessary for

IJISRT24MAY1600 www.ijisrt.com 2469

Volume 9, Issue 5, May – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAY1600

computing developing and

deployments, and IoT deploying IoT
solution development solutions.
using development.

The duration of the The 'Internet of Things

The "Internet of The 'Internet of Things 'Internet of Things and Applications' course
What is the duration and Things and and Applications' and Applications' spans a specified
structure of the "Internet Applications" course course spans a total of course may vary duration with a
of Things and typically covers IoT 64 lessons, with 32 depending on the combination of theory
Applications" course, and concepts, architecture, theory sessions and 32 institution or sessions and hands-on
are there any connectivity hands-on practice platform offering it. practice modules. It may
recommended technologies, data sessions. It is designed It may span several be offered as a traditional
prerequisites? analytics, security, for students majoring in weeks to a classroom-based course
applications, computer applications semester, covering or delivered online with
development, and a and related majors, fundamental interactive components.
final project. providing them with concepts, practical Recommended
Prerequisites may essential skills and skills, and advanced prerequisites may include
include basic knowledge in IoT topics related to IoT a basic understanding of
knowledge of development. development and networking concepts,
computer networks, deployment. programming skills.
programming
languages, electronics,
and data analytics.

METEOR scores indicate moderate performance. This BERTScore. The sample results for the responses show that
metric accounts for synonymy and paraphrasing, suggesting acceptable accuracies could be achieved with limited
your model does fairly well at capturing meaning, even if processing power and a small scale of data for customized
exact wording does not always match. Phi-2 model comes chatbots. This highlights the potential of small-sized models
on top with 0.184 among all three. like Phi-2 compared to other larger models. However, a
detailed analysis of the safety, toxicity, or validity of the
F1 scores from BERTScore are relatively high responses of fine-tuned models is not within the scope of
compared to other metrics, with all models performing this study. These analyses and post-training alignment of the
closely, and the highest is 0.873 from Phi-2. BERTScore models could be considered as future work.
evaluates the contextual embedding similarity between the
generated and reference texts, indicating that semantically,
the model's answers are quite close to the desired answers.
This indicates that the semantic quality of the model outputs
is relatively strong, with good contextual alignment with the REFERENCES
reference answers.
[1]. T.F. Tan, K. Elangovan, L. Jin, Y. Jie, L. Yong, J.
V. CONCLUSION Lim, S. Poh, W.Y. Ng, D. Lim, Y. Ke, N. Liu,
D.S.W. Ting, "Fine-tuning Large Language Model
In this paper, we discuss the development and (LLM) Artificial Intelligence Chatbots in
implementation of a small-scale learning and career Ophthalmology and LLM-based evaluation using
guidance chatbot using public data from Huawei's ICT GPT-4," arXiv preprint arXiv:2402.10083, 2024.
Learning Catalog Courses. For the fine-tuning process of the [2]. H. Touvron, L. Martin, K. Stone, P. Albert, A.
selected models, a q&a pair dataset was generated using Almahairi, Y. Babaei, ... and T. Scialom, "Llama 2:
GPT-3 and external information from catalog documents Open foundation and fine-tuned chat models," arXiv
regarding course details and software-related categories. We preprint arXiv:2307.09288, 2023.
used small-scale models such as 2.7B parameters for phi, 7B [3]. Y. Li, S. Bubeck, R. Eldan, A. Del Giorno, S.
for Mistral, and 8B for Llama to tackle computational Gunasekar, and Y.T. Lee, "Textbooks are all you
challenges, and employed QLoRa to reduce trainable need ii: phi-1.5 technical report," arXiv preprint
parameters for the same reasons. Our study demonstrates the arXiv:2309.05463, 2023.
analysis of the usage of s.o.t.a. language models with limited [4]. A.Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford,
data in a specific domain. For the evaluation part of our D.S. Chaplot, D.D.L. Casas, ... and W.E. Sayed,
study, human evaluation was conducted for the generated "Mistral 7B," arXiv preprint arXiv:2310.06825,
results and also compared with the GPT-4 responses using 2023.
metrics such as BLEU, ROUGE, METEOR, and

IJISRT24MAY1600 www.ijisrt.com 2470

Volume 9, Issue 5, May – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAY1600

[5]. Hugging Face – The AI community building the [19]. A. Chen, G. Stanovsky, S. Singh, and M. Gardner,
future. (n.d.). https://huggingface.co/ “Evaluating question answering evaluation,” in
[6]. F. Khennouche, Y. Elmir,, Y. Himeur, N. Djebari, A. Proceedings of the 2nd workshop on machine reading
Amira, "Revolutionizing generative pre-traineds: for question answering, 2019, pp. 119–124.
Insights and challenges in deploying ChatGPT and [20]. K. Papineni, S. Roukos, T. Ward, and W.J. Zhu.
generative chatbots for FAQs." Expert Systems with “BLEU: a method for automatic evaluation of
Applications, 246, 123224, 2024. machine translation,” In Proc. 40th Annual Meeting
[7]. M. Jovanović, K. Kuk, V. Stojanović, and E. Mehić, of the Association for Computational Linguistics
"Chatbot Application as Support Tool for the (Association of Computational Machinery), 2002, pp.
Learning Process of Basic Concepts of 311–318.
Telecommunications and Wireless Networks," Facta [21]. L. Chin-Yew, “ROUGE: A package for automatic
Universitatis, Series: Automatic Control and evaluation of summaries,” In Text Summarization
Robotics, 22(2), 2024, pp. 057-073. Branches Out. Association for Computational
[8]. S. Balakrishnan, P. Jayanth, S. Parvathynathan, and Linguistics, Barcelona, Spain, 2004, pp. 74–81.
R. Sivashankar, "Artificial intelligence-based [22]. S. Banerjee and A. Lavie, “METEOR: An automatic
vociferation chatbot for emergency health assistant," metric for MT evaluation with improved correlation
In AIP Conference Proceedings (Vol. 2742, No. 1). with human judgments,” In Proceedings of the ACL
AIP Publishing, 2024. Workshop on Intrinsic and Extrinsic Evaluation
[9]. P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Measures for Machine Translation and/or
Karpukhin, N. Goyal, ... and D. Kiela, "Retrieval- Summarization, 2005.
augmented generation for knowledge-intensive nlp [23]. T. Zhang, V. Kishore, F. Wu, K.Q. Weinberger, and
tasks," Advances in Neural Information Processing Y. Artzi, “Bertscore: Evaluating text generation with
Systems, 33, 2020, pp. 9459-9474. bert,” In International Conference on Learning
[10]. Huawei Talent. (n.d.). Representations, 2020.
https://e.huawei.com/en/talent/portal/#/
[11]. Z. Guo, R. Jin, C. Liu, Y. Huang, D. Shi, L. Yu, ...
and D. Xiong, "Evaluating large language models: A
comprehensive survey," arXiv preprint
arXiv:2310.19736, 2023.
[12]. H. Naveed, A.U. Khan, S. Qiu, M. Saqib, S. Anwar,
M. Usman, ... and A. Mian, "A comprehensive
overview of large language models," arXiv preprint
arXiv:2307.06435, 2023.
[13]. H.A. Alawwad, A. Alhothali, U. Naseem, A.
Alkhathlan, and A. Jamal, "Enhancing Textbook
Question Answering Task with Large Language
Models and Retrieval Augmented Generation," arXiv
preprint arXiv:2402.05128, 2024.
[14]. T. Dettmers, A. Pagnoni, A. Holtzman, and L.
Zettlemoyer, "QLoRa: Efficient finetuning of
quantized llms," Advances in Neural Information
Processing Systems, 36, 2024.
[15]. J.C. Chow, L. Sanders, and K. Li, "Design of an
educational chatbot using artificial intelligence in
radiotherapy," AI, 4(1), 2023, pp. 319-332.
[16]. N. Ghorashi, A. Ismail, P. Ghosh, A. Sidawy, R.
Javan, and N.S. Ghorashi, "AI-powered chatbots in
medical education: potential applications and
implications," Cureus, 15(8), 2023.
[17]. J. Wang, J. Macina, N. Daheim, S.P. Chowdhury,
and M. Sachan, "Book2Dial: Generating Teacher-
Student Interactions from Textbooks for Cost-
Effective Development of Educational Chatbots,"
arXiv preprint arXiv:2403.03307, 2024.
[18]. A. Bandi, and H. Kagitha, "A Case Study on the
Generative AI Project Life Cycle Using Large
Language Models," Proceedings of 39th International
Confer, 98, 2024, pp. 189-199.