Fine Tuning
Fine Tuning
Fine tuning :
Benefit :
2. increase consistency. :
Where:
pytorch
Fine tuning : 1
hugging face
Llama library(Lamini)
unsloth
3. Pass examples from the dataset to the model and collect its outputs.
4. Calculate the loss between the model’s outputs and the expected outputs.
5. Update the model parameters to reduce the loss using gradient descent
and backpropagation.
6. Repeat steps 3–5 for multiple epochs until the model converges.
7. The fine-tuned model can now be deployed for inference on new data.
Step 3: Tokenizer
Step 4: Initialize our base model
Fine tuning : 2
Step 5: Evaluate method
Adapting to new data: If your data distribution changes over time, fine-tune
the model to keep u
Customization
Every domain or task has its own unique language patterns, terminologies,
and contextual nuances. By fine-tuning a pre-trained LLM, you can
Fine tuning : 3
customize it to better understand these unique aspects and generate
content specific to your domain.
Disadvatages:
1. overfitting
3. underfitting
5. Loss of Generalisation
Your task is extremely dissimilar from the original model’s training data.
The model may struggle to connect its existing knowledge to this new
domain.
Fine tuning : 4
Primary fine-tuning approaches :
The final layers of the model are then trained on the task-specific data while
the rest of the model remains frozen. This approach leverages the rich
representations learned by the LLM and adapts them to the specific task,
offering a cost-effective and efficient way to fine-tune LLMs.
Full fine-tuning
Full fine-tuning is another primary approach to fine-tuning LLMs for specific
purposes. Unlike feature extraction, where only the final layers are adjusted,
full fine-tuning involves training the entire model on the task-specific data.
This means all the model layers are adjusted during the training process.
This approach is particularly beneficial when the task-specific dataset is
large and significantly different from the pre-training data. By allowing the
whole model to learn from the task-specific data, full fine-tuning can lead to a
more profound adaptation of the model to the specific task, potentially resulting
in superior performance. It is worth noting that full fine-tuning requires more
computational resources and time compared to feature extraction.
Fine tuning : 5
demands significant computational resources. Memory allocation is not only
required for storing the model but also for essential parameters during training,
presenting a challenge for simple hardware. PEFT addresses this by updating
only a subset of parameters, effectively “freezing” the rest.
Why PEFT ?
PEFT empowers parameter-efficient models with impressive performance,
revolutionizing the landscape of NLP. Here are a few reasons why we use
PEFT.
Reduced Computational Costs: PEFT requires fewer GPUs and GPU time,
making it more accessible and cost-effective for training large language
models.
Faster Training Times: With PEFT, models finish training faster, enabling
rapid iterations and quicker deployment in real-world applications.
What is LoRa?
LoRA is an improved finetuning method where instead of finetuning all the
weights that constitute the weight matrix of the pre-trained large language
model, two smaller matrices that approximate this larger matrix are fine-tuned.
These matrices constitute the LoRA adapter. This fine-tuned adapter is then
loaded into the pre-trained model and used for inference.
Fine tuning : 6
After LoRA fine-tuning for a specific task or use case, the outcome is an
unchanged original LLM and the emergence of a considerably smaller “LoRA
adapter,” often representing a single-digit percentage of the original LLM size
(in MBs rather than GBs).
During inference, the LoRA adapter must be combined with its original LLM.
The advantage lies in the ability of many LoRA adapters to reuse the original
LLM, thereby reducing overall memory requirements when handling multiple
tasks and use cases
Fine tuning : 7
LoRA: Low Rank Adaptation
Reduces memory footprint: LoRA achieves this by applying a low-rank
approximation to the weight update matrix (ΔW). This means it represents
ΔW as the product of two smaller matrices, significantly reducing the
number of parameters needed to store ΔW.
process:
3. rank 1 = 5 *5 = 25 matrices
QLoRA:
Fine tuning : 8
More memory efficient: QLoRA is even more memory efficient than LoRA,
making it ideal for resource-constrained environments.
If both memory and speed are important: QLoRA offers a good balance
between both.
1. Loading dataset
4. Tokenization
6. Pre-processing dataset
Fine tuning : 9
11. Evaluate the Model Quantitatively (with ROUGE Metric)
the model is trained on a task-specific labeled dataset, where each input data
point is associated with a correct answer or label. The model learns to adjust its
parameters to predict these labels as accurately as possible. This process
guides the model to apply its pre-existing knowledge, gained from pre-training
on a large dataset, to the specific task at hand. Supervised fine-tuning can
significantly improve the model's performance on the task, making it an
effective and efficient method for customizing LLMs.
2. Transfer learning
Transfer learning is a powerful technique that’s particularly beneficial when
dealing with limited task-specific data. In this approach, a model pre-trained on
a large, general dataset is used as a starting point.
The model is then fine-tuned on the task-specific data, allowing it to adapt its
pre-existing knowledge to the new task. This process significantly reduces
the amount of data and training time required and often leads to superior
performance compared to training a model from scratch.
3. Multi-task learning
Fine tuning : 10
In multi-task learning, the model is fine-tuned on multiple related tasks
simultaneously. The idea is to leverage the commonalities and differences
across these tasks to improve the model's performance. The model can
develop a more robust and generalized understanding of the data by learning to
perform multiple tasks simultaneously.
This approach leads to improved performance, especially when the tasks it will
perform are closely related or when there is limited data for individual tasks.
Multi-task learning requires a labeled dataset for each task, making it an
inherent component of supervised fine-tuning.
4. Few-shot learning
Few-shot learning enables a model to adapt to a new task with little task-
specific data. The idea is to leverage the vast knowledge model has already
gained from pre-training to learn effectively from just a few examples of the
new task. This approach is beneficial when the task-specific labeled data is
scarce or expensive.
In this technique, the model is given a few examples or "shots” during inference
time to learn a new task. The idea behind few-shot learning is to guide the
model's predictions by providing context and examples directly in the prompt.
Few-shot learning can also be integrated into the reinforcement learning from
human feedback (RLHF) approach if the small amount of task-specific data
includes human feedback that guides the model's learning process.
5. Task-specific fine-tuning
This method allows the model to adapt its parameters to the nuances and
requirements of the targeted task, thereby enhancing its performance and
relevance to that particular domain. Task-specific fine-tuning is particularly
valuable when you want to optimize the model's performance for a single, well-
defined task, ensuring that the model excels in generating task-specific content
with precision and accuracy.
Fine tuning : 11
RLHF facilitates the continuous enhancement of language models so they
produce more accurate and contextually appropriate responses.
1. Reward modeling
In this technique, the model generates several possible outputs or actions, and
human evaluators rank or rate these outputs based on their quality. The model
then learns to predict these human-provided rewards and adjusts its behavior
to maximize the predicted rewards.
Reward modeling provides a practical way to incorporate human judgment into
the learning process, allowing the model to learn complex tasks that are
difficult to define with a simple function. This method enables the model to
learn and adapt based on human-provided incentives, ultimately enhancing its
capabilities.
3. Comparative ranking
Comparative ranking is similar to reward modeling, but in comparative ranking,
the model learns from relative rankings of multiple outputs provided by human
evaluators, focusing more on the comparison between different outputs.
In this approach, the model generates multiple outputs or actions, and human
evaluators rank these outputs based on their quality or appropriateness. The
model then learns to adjust its behavior to produce outputs that are ranked
higher by the evaluators.
Fine tuning : 12
By comparing and ranking multiple outputs rather than evaluating each output
in isolation, comparative ranking provides more nuanced and relative feedback
to the model. This method helps the model understand the task subtleties
better, leading to improved results.
4. Preference learning (reinforcement learning with preference feedback)
Preference learning, also known as reinforcement learning with preference
feedback, focuses on training models to learn from human feedback in the form
of preferences between states, actions, or trajectories. In this approach, the
model generates multiple outputs, and human evaluators indicate their
preference between pairs of outputs.
The model then learns to adjust its behavior to produce outputs that align with
the human evaluators' preferences. This method is useful when it is difficult to
quantify the output quality with a numerical reward but easier to express a
preference between two outputs. Preference learning allows the model to learn
complex tasks based on nuanced human judgment, making it an effective
technique for fine-tuning the model on real-life applications.
5. Parameter efficient fine-tuning
Parameter-efficient fine-tuning (PEFT) is a technique used to improve the
performance of pre-trained LLMs on specific downstream tasks while
minimizing the number of trainable parameters. It offers a more efficient
approach by updating only a minor fraction of the model parameters during
fine-tuning.
DAY : 2
Fine tuning : 13
What is Instruction Fine tuning?
Instruction fine-tuning is a specialized technique to tailor large language
models to perform specific tasks based on explicit instructions. While
traditional fine-tuning involves training a model on task-specific data,
instruction fine-tuning goes further by incorporating high-level instructions or
demonstrations to guide the model’s behavior.
Giving one or more examples can be enough to identify and carry out a
task. Smaller models don’t always work as expected with instructions
Steps :
Fine tuning : 14
where instruction fine-tuning comes into play. By training the model to
change its behavior and respond more effectively to instructions, we can
enhance its performance and usefulness
Unlike pre-training, where models learn to predict the next word based on
general text, fine-tuning allows us to train the model on a smaller dataset
specifically tailored to following instructions. This fine-tuning process enables
the model to adapt and excel at specific tasks.
2. Text generation
3. Text summarization
Finetune process :
instruction dataset —> pretrained model — > Labelling —> loss measure
Fine tuning : 15
Fine tune on single task :
catastrophic forgetting :
1. Good for single and forget the other task : sentiment anslysis but forget
other
How to handle:
Fine tuning : 16
Multitask finetuning :
summarize
Single input instructions and multiple output formats carry out variety of
tasks
Fine tuning : 17
Dataset tasks used during tuning — data sets are chosen from variety of
sources of research papaers, and other datasets
FLAN — Fine tune Language Net — last step of trianing proces s — General
purpose instruct model.
Padding / Truncation
Padding and truncation are preprocessing techniques used in transformers to
ensure that all input sequences have the same length.
Padding refers to the process of adding extra tokens (usually a special token
such as [PAD] ) to the end of short sequences so that they all have the same
length. This is done so that the model can process all the sequences in a batch
simultaneously. The padded tokens do not carry any semantic meaning and are
just used to fill up the extra space in the shorter sequences.
Fine tuning : 18
Truncation, on the other hand, refers to the process of cutting off the end of
longer sequences so that they are all the same length. This is done to ensure
that the model is not overwhelmed by very long sequences and to reduce the
computational overhead of processing large sequences.
Sequence 1: "The cat sat on the mat [PAD] [PAD] [PAD] [PAD]"
Sequence 2: "The dog chased the cat [PAD] [PAD] [PAD] [PAD] [PAD]"
Sequence 3: "The mouse ran away from the cat and the dog"
Fine tuning : 19
For LLM it is nondeterministic and evaluation is much more challenging hence,
we use automated ways to evaluate these models using Rouge or BLEU Score
with
ROUGE scores alone may not capture the complete context and ordering of
words, leading to potentially deceptive results. To overcome this limitation, the
Fine tuning : 20
ROUGE-L score calculates the longest common subsequence between the
reference and generated outputs, giving a more comprehensive evaluation. It
takes into account the ordering of words and provides a more accurate
assessment.
Bleu Score
Finally, to calculate the Bleu Score, we multiply the Brevity Penalty with the
Geometric Average of the Precision Scores.
Fine tuning : 21
Bleu Score can be computed for different values of N. Typically, we use N = 4.
It corresponds with the way a human would evaluate the same text.
It can be used when you have more than one ground truth sentence.
It is used very widely, which makes it easier to compare your results with
other work
ROUGE and BLEU metrics serve as valuable tools for evaluating the
performance of large language models in tasks like summarization and
translation. These metrics provide an automated and structured way to
measure the quality and similarity of generated outputs compared to human
Fine tuning : 22
references. However, for a more comprehensive evaluation, it is essential to
consider task-specific evaluation benchmarks developed by researchers in the
field.
1.
GLUE and SuperGLUE: GLUE (General Language Understanding Evaluation)
introduced in 2018, comprises diverse natural language tasks like sentiment
analysis and question-answering. SuperGLUE, introduced in 2019, addresses
GLUE’s limitations and features more challenging tasks, including multi-
sentence reasoning and reading comprehension. Both benchmarks provide
leaderboards to compare model performance and track progress.
GLUE — used in evaluation of following tasks —
Sentiment Analysis
Question Answering
SuperGLUE —
Teading comprehension
Fine tuning : 23
multimetric approach, measuring seven metrics across 16 core scenarios,
beyond basic accuracy measures. HELM also includes metrics for fairness,
bias, and toxicity, crucial as LLMs become more capable of human-like
generation and potential harm.
Fine tuning : 24