0% found this document useful (0 votes)
15 views8 pages

LLM Research Report

This guide provides an overview of publicly available open-source models and various fine-tuning methods tailored for specific tasks. It discusses techniques such as Low-Rank Adaptation (LoRA), QLoRA, prefix fine-tuning, adapters, prompt fine-tuning, and P-tuning, emphasizing their efficiency and applicability in different scenarios. The document serves as a resource for selecting models and optimizing their performance based on available resources and requirements.

Uploaded by

umair imran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views8 pages

LLM Research Report

This guide provides an overview of publicly available open-source models and various fine-tuning methods tailored for specific tasks. It discusses techniques such as Low-Rank Adaptation (LoRA), QLoRA, prefix fine-tuning, adapters, prompt fine-tuning, and P-tuning, emphasizing their efficiency and applicability in different scenarios. The document serves as a resource for selecting models and optimizing their performance based on available resources and requirements.

Uploaded by

umair imran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

The Complete Guide for Public

Models
& Exploring Methods for Fine
Tuning

Muhammad Umair Imran


[email protected]

National University of Computer and Emerging Sciences

January 8, 2025
Contents
Abstract 2

Open Source Models 2

Fine Tuning Methods 3

Low-Rank Adaptation (LoRA) 3

QLoRA: Quantized Low-Rank Adaptation 4

Prefix Fine Tuning 4

Adapters 5

Prompt Fine Tuning 5

P-Tuning 6

1
Abstract
This report aims to provide clear guidelines for the selection of publicly available open-
source models. It also explores various methods for fine-tuning these models depending
on specific problem cases, resource availability, and performance requirements.

Open Source Models


Publicly available open-source models can be utilized for a variety of purposes. Below
are some notable models along with their specific use cases and details:

• Llama 3.1 8B Instruct: A model specifically trained for instructions, such as


conversations.
https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct

• Mistral 7B Instruct: Trained for instruction-following tasks.


https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2

• bigscience/bloomz-560m: Best for language translation with support for over


46 languages.
https://huggingface.co/bigscience/bloomz-560m

• google/gemma-2-2b-it: A model well-suited for small-resource environments


such as running on a laptop or edge devices.
https://huggingface.co/google/gemma-2-2b-it

• tiiuae/falcon-180B-chat: Optimized for large-scale conversational applications.


https://huggingface.co/tiiuae/falcon-180B-chat

• Salesforce/xgen-7b-8k-inst: Ideal for large context windows, especially for business-


specific needs.
https://huggingface.co/Salesforce/xgen-7b-8k-inst

• 01-ai/Yi-1.5-34B-Chat: Enhances chat performance in both English and Chi-


nese.
https://huggingface.co/01-ai/Yi-1.5-34B-Chat

2
• openai-community/gpt2: A general-purpose model but requires fine-tuning with
specific instructions for optimal results.
https://huggingface.co/openai-community/gpt2

Fine Tuning Methods

PEFT: Parameter Efficient Fine-Tuning

Parameter Efficient Fine-Tuning (PEFT) refers to methods that focus on efficiently adapt-
ing large pre-trained models for specific downstream tasks without requiring full model
retraining. PEFT techniques aim to save computational resources by reducing the num-
ber of trainable parameters.

Low-Rank Adaptation (LoRA) of Language Models


Low-Rank Adaptation, or LoRA, involves freezing the pretrained model weights while
introducing trainable rank decomposition matrices into each layer of the Transformer
architecture. This approach significantly reduces the number of trainable parameters for
downstream tasks, making it a parameter-efficient method for fine-tuning large language
models.

Figure 1: Low-Rank Adaptation Architecture.

LoRA works by converting high-rank matrices into low-rank matrices, which are then

3
combined and added back into the original matrices. This enables the model to retain its
original capabilities while being fine-tuned for specific tasks. As a result, LoRA speeds
up training while maintaining the overall model performance [1].

QLoRA: Quantized Low-Rank Adaptation


In large language models (LLMs), parameter efficiency is crucial for making them more
feasible for deployment in resource-limited settings. If a model, such as a 16-bit Llama
65B parameter model, requires about 780GB of GPU memory, it can become prohibitively
expensive to use in real-world applications. One approach to resolve this is the quanti-
zation of the model. This involves reducing the model’s precision, typically from 16-bit
to 4-bit or 8-bit.

Figure 2: QLoRA

Once the model is quantized, it is then further optimized through Low-Rank Adap-
tation (LoRA). By reducing the precision and applying LoRA, the model’s performance
can be fine-tuned for specific tasks, leading to significant resource savings in both time
and cost, while retaining much of the original model’s performance capabilities.
For more details, refer to the work on QLoRA, as described by Dettmers et al. [[2]].

Prefix Fine Tuning


In this method, we don’t change the weights of the model; instead, we add a prefix during
fine-tuning. For example, if we are querying, “What is the president of Pakistan?”, we
simply add a prefix, such as [history-related], to save time and computational resources.
After fine-tuning, the model has already been adapted, so the prefix does not need to be
added during inference.
For more details on prefix tuning, see Liang and Liang [[3]].

4
Figure 3: Prefix Fine Tuning

Adapters
Instead of performing full fine-tuning, adapter layers are added between the model’s
forward layers. Full fine-tuning adjusts all parameters, requiring significant resources.
However, in the adapter approach, task-specific layers are added to the model, making it
more flexible for many use cases. While this method saves resources, adding additional
layers can make inference slightly slower due to sequential processing.
For further information on adapters, refer to Hu et al. [[4]].

Figure 4: Adapters

Prompt Fine Tuning


In prompt fine-tuning, a model is trained with manual prompts or AI-generated soft
prompts during the fine-tuning phase. This approach is computationally low-cost and
can also be tailored to specific tasks.
For example, in a sentiment analysis task, a soft prompt might look like the following:

5
- Setup: The input becomes: [soft prompt embeddings] + ”This movie review is:” +
[review text] - During training, the model adjusts [soft prompt embeddings] to better
contextualize the task. - Inference Input: [soft prompt embeddings] + ”This movie
review is:” + ”An exciting roller-coaster ride full of suspense” - Expected Output:
”Positive”
Alternatively, a hard prompt is manually appended to the input:
- Input: ”Sentiment classification: Positive or Negative. Review: An exciting roller-
coaster ride full of suspense.” - Output: ”Positive”
Further details on prompt-tuning can be found in Shah [[5]].

P-Tuning
P-Tuning involves adding computer-generated embeddings along with the task-specific
prompt for prediction. For example, input text may be structured as: [P-tuning embed-
dings] + [original task input text].
However, it is worth noting that this approach works particularly well on large models
but does not produce optimal results on small models.
For more information on P-Tuning, see Liu et al. [[6]].

Figure 5: P-Tuning

References
[1] E. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen,
”LoRA: Low-Rank Adaptation of Large Language Models,” Microsoft Corporation,
[Online]. Available: https://arxiv.org/abs/2106.09685v2. [Accessed: 7-Jan-2025].

6
[2] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, ”QLoRA: Effi-
cient Finetuning of Quantized LLMs,” arXiv, May 2023. [Online]. Available:
https://arxiv.org/pdf/2305.14314.

[3] X. L. Li and P. Liang, ”Prefix-Tuning: Optimizing Continuous Prompts for Genera-


tion,” arXiv, 2021. [Online]. Available: https://arxiv.org/pdf/2101.00190. [Accessed:
07-Jan-2025].

[4] Z. Hu, L. Wang, Y. Lan, W. Xu, E.-P. Lim, L. Bing, X. Xu, S. Poria, and
R. K.-W. Lee, “LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-
Tuning of Large Language Models,” in Proc. of the 2023 Conf. on Empirical Meth-
ods in Natural Language Processing (EMNLP), Dec. 2023. [Online]. Available:
https://aclanthology.org/2023.emnlp-main.319.pdf.

[5] S. Shah, ”Prompt-Tuning: A Powerful Technique for Adapting LLMs to New Tasks,”
Medium, [Online]. Available: https://medium.com/@shahshreyansh20/prompt-
tuning-a-powerful-technique-for-adapting-llms-to-new-tasks-6d6fd9b83557.

[6] X. Liu, K. Ji, Y. Fu, W. L. Tam, Z. Du, Z. Yang, and J. Tang, ”P-
Tuning v2: Prompt Tuning Can Be Comparable to Fine-Tuning Universally
Across Scales and Tasks,” arXiv preprint arXiv:2110.07602. [Online]. Available:
https://arxiv.org/pdf/2110.07602.

[7] HuggingF ace, ”P EF T P ackageRef erence, ”[Online].Available : https :


//huggingf ace.co/docs/pef t/en/packager ef erence/pt uning.

You might also like