Full Fine-Tuning, PEFT, Prompt Engineering, or RAG
Full Fine-Tuning, PEFT, Prompt Engineering, or RAG
(https://deci.ai)
(https://deci.ai/schedule-demo/)
Back to Blog(https://deci.ai/blog/)
GENERATIVE AI
Introduction
Since the introduction of ChatGPT, businesses everywhere have eagerly sought to leverage
Large Language Models (LLMs) to elevate their operations and offerings. The potential of
LLMs is vast, but there’s a catch: even the most powerful pretrained LLM might not always
meet your specific needs right out of the box. Here’s why:
1. Tailored Outputs: Your application might demand a unique structure or style. Imagine a
writing assessment tool that grades an essay and offers a concise bullet-point
feedback.
2. Missing Context: The pretrained LLM might not know about specific documents
crucial to your application. Consider a chatbot fielding technical queries about a
particular set of products. If the instruction manuals for those products weren’t in the
LLM’s training data, its accuracy can falter.
3. Specialized Vocabulary: Certain domains, industries, and even particular enterprises
often have unique terminologies, concepts, and structures not prominently
represented in the general pretraining data. Thus, a pretrained LLM might find it
challenging to summarize, or answer questions about, financial data, medical research
papers, or even transcripts of company meetings.
So, how do you make an LLM align with your unique requirements? You’ll likely need to
tweak or “tune” it. Currently, there are four prominent tuning methods:
Full Fine-tuning: Adjusts all parameters of the LLM using task-specific data.
Parameter-efficient Fine-tuning (PEFT): Modifies select parameters for more efficient
adaptation.
Prompt Engineering: Refines model input to guide its output.
RAG (Retrieval Augmented Generation): Merges prompt engineering with database
querying for context-rich answers.
They vary in expertise needed, cost, and suitability for different scenarios. This article
explores each, shedding light on their nuances, costs, and optimal use cases. Dive in to
discover which method is the best fit for your project.
Full fine-tuning
Fine-tuning is the process we use to further train an already pre-trained LLM on a smaller,
task-specific, labeled dataset. In this way, we adjust some of the model parameters to
optimize its performance for a particular task or set of tasks. In full fine-tuning, all the model
parameters are updated, making it similar to pretraining—just that it’s done on a labeled
and much smaller dataset.
Full fine-tuning in 6 steps
To illustrate, let’s say we want to build a tool that generates abstracts for biotechnology
research papers. For full fine-tuning, you’ll need to go through the following steps:
Collect a set of research papers from the target domain of biotechnology. Ensure each
paper comes with its original abstract.
Split this collection into training, validation, and test sets.
Feed the processed content to the LLM as input and train it to generate the
corresponding abstract as the output.
Monitor the model’s performance on the validation set to prevent overfitting and to
decide when to stop training or make adjustments.
5. Evaluate performance:
Once fine-tuning is complete, assess the model’s performance on the test set, which it
hasn’t seen before.
Metrics might include the BLEU score, ROUGE score, or human evaluations to
measure the quality and relevance of the generated abstracts as compared to the
original ones.
Based on evaluation results, iterate on the above steps, possibly collecting more data,
adjusting hyperparameters, or trying different model configurations to improve
performance.
What makes full fine-tuning worthwhile?
Full fine-tuning can be effective even with a relatively small task-specific dataset. The
pretrained LLM already understands general language constructs. The fine-tuning process
focuses primarily on adjusting the model’s knowledge to the specificities of the new data. A
pretrained LLM, initially trained on roughly 1 trillion tokens and demonstrating solid general
performance, can be efficiently fine-tuned with just a few hundred examples, equivalent to
several hundred thousand tokens.
Enhanced accuracy
By fine-tuning on a task-specific dataset, the LLM can grasp the nuances of that particular
domain. This is especially vital in areas with specialized jargon, concepts, or structures, such
as legal documents, medical texts, or financial reports. As a result, when faced with unseen
examples from the specific domain or task, the model is likely to make predictions or
generate outputs with higher accuracy and relevance.
Increased robustness
Fine-tuning allows us to expose the model to more examples, especially edge cases or less
common scenarios in the domain-specific dataset. This makes the model better equipped
to handle a wider variety of inputs without producing erroneous outputs.
Full fine-tuning involves updating all the parameters of a large model. With large-scale LLMs
boasting tens or hundreds of billions of parameters, training requires enormous amounts of
computing power. Even when the fine-tuning dataset is relatively small, the number of
tokens can be huge and expensive to compute.
Working with large models can necessitate specialized hardware, such as high-end GPUs or
TPUs, with significant memory capacities. This is often impractical for many businesses.
Time & Expertise Intensive
When the model is very large, you often need to distribute the computation over multiple
GPUs and nodes. This requires appropriate expertise. Depending on the size of the model
and the dataset, fine-tuning can take hours, days, or even weeks.
Parameter-efficient fine-tuning
Parameter-efficient fine-tuning (https://huggingface.co/blog/peft) (PEFT) uses techniques
to further tune a pretrained model by updating only a small number of its total parameters.
A large language model pretrained on vast amounts of data has already learned a broad
spectrum of language constructs and knowledge. In short, it already possesses much of the
information needed for many tasks. Given the smaller scope, it’s often unnecessary and
inefficient to adjust the entire model. The fine-tuning process is conducted on a small set of
parameters.
PEFT methods vary in their approach to determining which components of the model are
trainable. Some techniques prioritize training select portions of the original model’s
parameters. Others integrate and train smaller additional components, like adapter layers,
without modifying the original structure.
LoRA
LoRA (https://arxiv.org/abs/2106.09685), short for Low-Rank Adaptation of Large Language
Models, was introduced in early 2023. It has since become the most commonly used PEFT
method, helping companies and researchers reduce their fine-tuning costs. Using
reparameterization, this technique downsizes the set of trainable parameters by performing
low-rank approximation.
For example, if we have a 100,000 x 100,000 weight matrix, then for full fine-tuning we need
to update 10,000,000,000 parameters. Using LoRA, we can capture all or most of the
crucial information by using a low-rank matrix that contains the selected parameters
updated during fine-tuning.
To get this low-rank matrix, we can reparametrize the original weight matrix into two
matrices, A and B, each of low rank r. Our new low-rank matrix is then taken to be the
product of A and B. If r = 2, we end up updating (100,000 x 2) + (100,000 x 2) = 400,000
parameters instead of 10,000,000. By updating a much smaller number of parameters, we
reduce the computational and memory requirements needed for fine-tuning. The following
graphic from the original paper on LoRA illustrates the technique.
Task switching efficiency – Creating different versions of the model for specific tasks
becomes easier. You can simply store a single copy of the pretrained weights and build
many small LoRA modules. When you switch from task to task you replace only the
matrices, A and B, and keep the LLM. This significantly reduces storage requirements.
Fewer GPUs required – LoRA reduces GPU memory requirements by up to 3x since we
don’t calculate/retrain gradients for most parameters.
High accuracy – On a variety of evaluation benchmarks, LoRA performance proves to
be almost equivalent to full fine-tuning—at a fraction of the cost. For example, Deci’s
DeciLM 6B (https://deci.ai/blog/decilm-15-times-faster-than-llama2-nas-generated-
llm-with-variable-gqa/) was fine-tuned using LoRA for instruction following. The fine-
tuned model, DeciLM 6B-Instruct matches the performance of top-tier models in its
class.
Because it’s so new, LoRA’s efficacy in situations where you need to fine-tune a model on
multiple tasks is still untested. In such situations, the pre-trained LLM needs to be fine-
tuned on each task sequentially, and it remains to be seen whether LoRA can maintain the
accuracy of full fine-tuning.
For a step-by-step guide on using LoRA to fine-tune Llama 2 7B, read our technical blog
post (https://deci.ai/blog/fine-tune-llama-2-with-lora-for-question-answering/).
Whether or not PEFT is an effective alternative to full fine-tuning depends on the use case
and the particular PEFT technique selected. In PEFT you train a much smaller number of
parameters than in full fine-tuning and, if the task is “hard enough,” the difference in the
number of parameters trained will show.
Prompt engineering
The methods discussed so far involve training model parameters on a new dataset and task,
using either all of the pretrained weights, as in full fine tuning, or a separate set of weights as
in LoRA. In contrast, prompt engineering doesn’t involve training network weights at all. It is
the process of designing and refining the input given to a model to guide and influence the
kind of output you want.
Zero-shot prompting
In zero-shot prompting, we prepend a certain instruction to the user’s query without
providing the model with any direct examples.
Imagine you’re developing a tech support chatbot using a large language model. To make
sure the model focuses on providing tech solutions without having prior examples, you can
prepend a specific instruction to all user inputs:
Prompt:
1. Provide a tech support solution based on the following user concern. User concern:
My computer won't turn on.
2. Solution:
By prepending an instruction to the user query (“My computer won’t turn on,” we give the
model context for the kind of answer desired. This is a way of adapting its output for tech
support even without explicit examples of tech solutions.
Few-shot prompting
In few-shot prompting, we prepend a few examples to the user’s query. These examples are
essentially pairs of sample input and expected model output.
Imagine creating a health app that categorizes dishes into ‘Low Fat’ or ‘High Fat’ using a
language model. To orient the model, a couple of examples are prepended to the user
query:
1. Classify the following dish based on its fat content: Grilled chicken, lemon,
herbs. Response: Low Fat
2. Classify the following dish based on its fat content: Mac and cheese with heavy
cream and butter. Response: High Fat
3. Classify the following dish based on its fat content: Avocado toast with olive oil
4. Response:
Informed by the examples in the prompt, a large enough and well trained LLM will reliably
respond: “High Fat.”
Few-shot prompting is a good way of getting the model to adopt a certain response format.
Going back to our tech support app example, if we wanted the model’s response to conform
to a certain structure or length restrictions, we could do so through few-shot prompting.
Chain-of-thought prompting
Chain-of-thought prompting allows for detailed problem-solving by guiding the model
through intermediate steps. Pairing it with few-shot prompting can enhance performance
on tasks that need thoughtful analysis before generating an answer.
1. Subtracting the smallest number from the largest in this group results in an even
number: 5, 8, 9.
2. A: Subtracting 5 from 9 gives 4. The answer is True.
3. Subtracting the smallest number from the largest in this group results in an even
number: 10, 15, 20.
4. A: Subtracting 10 from 20 gives 10. The answer is True.
5. Subtracting the smallest number from the largest in this group results in an even
number: 7, 12, 15.
6. A:
In fact, chain of thought prompting can also be paired with zero shot prompting to enhance
performance on tasks that require step-by-step analysis. Going back to our tech support
app example, if we wanted to improve the model’s performance, we could ask it to break
down the solution step by step.
1. Break down the tech support solution step by step based on the following user
concern. User concern: My computer won't turn on.
2. Solution:
For a variety of applications, basic prompt engineering of a very large LLM can deliver ‘good
enough’ accuracy. It provides an economical adaptation method because it is fast and
doesn’t involve large amounts of computing power. The downside is that it’s simply not
accurate or robust enough for use cases additional background knowledge is required.
RAG is also useful when the application needs to use up-to-date information and
documents that weren’t a part of the LLMs training set. Some examples might be news
databases or applications that search for medical research associated with new treatments.
Simple prompt engineering can’t handle these cases because of the limited context window
of the LLM. Currently, for most use cases you can’t feed the entire set of documents into the
prompt of the LLM.
Advantages of RAG
Easily adapts to new data – RAG can adapt in situations where facts could evolve over time,
making it useful for generating responses that require up-to-date information.
Interpretable – Using RAG, the source of the LLM’s answer can be pinpointed. Having
traceability regarding the source of an answer can be beneficial for internal monitoring,
quality assurance, or addressing customer disputes.
Cost effective – Instead of fine-tuning the entire model on the task-specific dataset, you can
get comparable results with RAG, which involves far less labeled data and computing
resources.
Prompt engineering – Prompt engineering has the lowest cost of the four approaches. It
boils down to writing and testing prompts to find ones that deliver good results when fed to
the pretrained LLM. It may also involve updating prompts if the pretrained model is itself
updated or replaced. This can happen periodically when using a commercial model like
OpenAI’s GPT4.
RAG – The cost of implementing RAG may be higher than that of prompt engineering. This is
because of the need for multiple components: embedding model, vector store, vector store
retriever, and pretrained LLM.
PEFT: The cost of PEFT tends to be higher than that of RAG. This is because fine-tuning,
even efficient fine-tuning, requires a considerable amount of computing power, time, and
ML expertise. Moreover, to maintain this approach, you’ll need to fine-tune periodically to
incorporate new relevant data into the model.
Full fine-tuning – This method is significantly more costly than PEFT, given that it requires
even more compute power and time.
Complexity of implementation
From the relatively straightforward approach of prompt engineering to the more intricate
configurations of RAG and advanced tuning methods, the complexity can vary significantly.
Here’s a quick rundown on what each method entails:
Prompt engineering – This method has relatively low implementation complexity. It requires
little to no programming. To draft a good prompt and conduct experiments, a prompt
engineer needs good language skills, domain expertise, and familiarity with few-shot
learning approaches.
RAG – This approach has higher implementation complexity than prompt engineering. To
implement this solution, you need coding and architecture skills. Depending on the RAG
components chosen, the complexity could run very high.
PEFT and Full fine-tuning – These approaches are the most complex to implement. They
demand a deep understanding of deep learning and NLP and expertise in data science to
change the model’s weights via tuning scripts. You’ll also need to consider factors such as
the training data, learning rate, loss function, etc.
Accuracy
Evaluating the accuracy of different approaches for LLM adaptation can be intricate,
especially because the accuracy often hinges on a blend of distinct metrics. The significance
of these metrics can fluctuate based on the specific use case. Certain applications might
prioritize domain-specific jargon. Others might prioritize the ability to trace the model’s
response back to a particular source. To find the most accurate approach for your needs, it’s
imperative to identify the pertinent accuracy metrics for your application and compare the
methodologies against those specific criteria.
Domain-specific terminology
Fine-tuning can be used to effectively impart LLMs with domain-specific terminology. While
RAG is proficient in data retrieval, it may not capture domain-specific patterns, vocabulary
and nuances as well as a fine-tuned model. For tasks seeking strong domain affinity, fine-
tuning is the way to go.
Up-to-date responses
A fine-tuned LLM becomes a fixed snapshot of its training dataset and will need regular
retraining for data that is evolving. This makes exclusive fine-tuning (both full and PEFT) a
less viable approach for applications that require responses to be synced with a dynamic
pool of information. In contrast, RAG’s external queries can ensure updated responses,
making it ideal for environments with dynamic data.
Transparency and Interpretability
For some applications, understanding the model’s decision-making is crucial. While fine-
tuning functions more like a ‘black box’, obscuring its reasoning, RAG provides clearer insight.
Its two-step process identifies the documents it retrieves, enhancing user trust and
comprehension.
Hallucinations
Pretrained LLMs sometimes make up answers that are missing from their training data or
the provided input. Fine-tuning can reduce these hallucinations by focusing an LLM on
domain specific data. However, unfamiliar queries may still cause the LLM to concoct a
made up answer. RAG reduces hallucinations by anchoring the LLM’s response in the
retrieved documents. The initial retrieval step essentially fact checks while the subsequent
generation is restricted to the context of the retrieved data. For tasks where avoiding
hallucinations is paramount, RAG is recommended.
We see that RAG is excellent for cases where interpretability, up-to-date responses, and
avoiding hallucinations are paramount. Full fine-tuning and PEFT are clear winners for use
cases that put most of the weight on domain-specific style and vocabulary. But what if your
use case requires both? In that case, you might want to consider a hybrid approach, using
both fine-tuning and RAG.
LLMs demand intricate computational processes, which can lead to significant latency
issues, negatively impacting user experiences, especially in applications that require real-
time responses. A major hurdle is the low throughput, which causes delayed responses and
hinders the handling of concurrent user requests. This situation typically necessitates the
use of costlier, high-performance hardware to improve throughput, thereby elevating
operational expenses. Consequently, the need for such advanced hardware compounds the
already substantial computational costs associated with implementing these models.
Enabling inference with just three lines of code, Infery-LLM makes it easy to deploy into
production and into any environment. Its optimizations unlock the true potential of LLMs, as
exemplified when it runs DeciLM-7B (https://deci.ai/blog/introducing-decilm-7b-the-
fastest-and-most-accurate-7b-large-language-model-to-date/), achieving 4.4x the speed
of Mistral 7B with vLLM with a 64% reduction in inference costs.
Get Started
Email Subscribe
(https://auth.deci.ai/oauth/account/sign-up)
(https://deci.ai/blog/llm-evaluation-ultimate-guide/)
(https://deci.ai/blog/list-of-large-language-models-in-open-source/)
(https://deci.ai/blog/llm-evaluation-ultimate-guide/)
Top Large Language Models Reshaping the Open-Source Arena
Generative AI
Deci
(https://deci.ai/blog/from-top-k-to-beam-search-llm-decoding-strategies/)
(https://deci.ai/blog/from-top-k-to-beam-search-llm-decoding-strategies/)
Generative AI
Deci
(https://deci.ai/blog/deci-nano-and-gen-ai-development-platform/)
(https://deci.ai/blog/deci-nano-and-gen-ai-development-platform/)
Generative AI
Email Subscribe
By registering you accept our privacy policy (https://deci.ai/privacy-policy/) and agree to receive other
communications from Deci AI.
This site is protected by reCAPTCHA and the Google Privacy Policy (https://policies.google.com/privacy) and Terms of
Service (https://policies.google.com/terms) apply.
About(https://deci.ai/about/) Pricing(https://deci.ai/pricing/)
Careers(https://deci.ai/careers/) Resources(https://deci.ai/resources/)
Newsroom(https://deci.ai/newsroom/) Blog(https://deci.ai/blog/)
Contact Us(https://deci.ai/contact/)
Technology(https://deci.ai/technology/)
Deci is ISO 27001
Certified
(https://www.linkedin.com/company/deciai)
(https://twitter.com/deci_ai)
(https://medium.com/deci-
(https://vimeo.com/user126115638)
(https://www.youtube.com/channel/UCO
(https://www.facebook.com/Deci-
ai) AI-
112702343883824)