0% found this document useful (0 votes)
106 views41 pages

Building Finetuning Aimodels

This document outlines the key steps to build and fine-tune large language models for software engineering tasks. It discusses: 1. The 4 main steps to build a traditional machine learning model: data curation, model architecture selection, model training, and model evaluation. 2. Additional steps needed for large language models: fine-tuning the pre-trained model for specific tasks. 3. Details of each step, including types of model architectures, training techniques for billion-parameter models, and evaluation benchmarks. 4. An example of fine-tuning a large language model for a software engineering assistant to complete sentences.

Uploaded by

malaysheth34
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views41 pages

Building Finetuning Aimodels

This document outlines the key steps to build and fine-tune large language models for software engineering tasks. It discusses: 1. The 4 main steps to build a traditional machine learning model: data curation, model architecture selection, model training, and model evaluation. 2. Additional steps needed for large language models: fine-tuning the pre-trained model for specific tasks. 3. Details of each step, including types of model architectures, training techniques for billion-parameter models, and evaluation benchmarks. 4. An example of fine-tuning a large language model for a software engineering assistant to complete sentences.

Uploaded by

malaysheth34
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Building and Fine-tuning

GenAI Models
SOEN 691: Generative Artificial Intelligence
for Software Engineering

Diego Elias Costa


Language Models

Source: A Survey of Large Language Models We will focus on LLMs


2
Building and Fine-tuning
GenAI Models
Large
Language
This is a Road
Map class
Explore the reference materials if you
want to know the concepts in detail!

4
Building a Large
Language Model

5
How much does it cost?
• Training the smallest Llama 7b took 180,000 GPU hours

• Renting:
• Nvidia A100: $1-2 per GPU per hour -> whereabouts of $250,000 dollars
Source: Llama 2: Open Foundation and Fine-Tuned Chat Models
6
Building a Large
Language Model

Theoretically…
Source: A Survey of Large Language Models

7
Building a machine learning model
Here are the 4 key steps to build a traditional ML model:

1. Curating the Data


2. Choosing the appropriate model/algorithm
3. Training the model
4. Evaluating the model

8
What to know more?
Machine Learning
Book freely available
• Covers the basics of classifiers and
Regressions
• Emphasis on interpretability
• How do ML models work?
• Techniques to explain and interpret ML
models

Source: Interpretable Machine Learning

9
Want to know more?
Engineering AI-Based Systems
• Course focused on Engineering AI-
Based Systems
• Requirements
• Architecture and Design
• Implementation
• Testing
• Operations Source: Engineering AI-Based Software Systems

10
Building large language models
Here are the 4 key steps to build a traditional ML model:

1. Curating/Processing large amounts of Data


2. Choosing the appropriate transformer
3. Training the model at scale
4. Evaluating the basic model
5. Fine-tuning the model to specific cases

Source: How to Build an LLM from Scratch | An Overview


11
Step 1. Data Curation/Processing
Data quality is paramount for the quality of your model
• Garbage in, Garbage out

However, data quantity seems to be playing an even bigger role

Source: Llama 2: Open Foundation and Fine-Tuned Chat Models


12
Step 1. Data Curation/Processing
Where do we get all these data?
• The Internet
Publicly available datasets
• Public datasets
• Common Crawl
• The Pile
• Hugging Face datasets
• Private data sources
• Use an LLM to generate your dataset
• Alpaca
Source: A Survey of Large Language Models`
13
Step 1. Data Curation/Processing
Data diversity is key to generalizability

Larger diversity of sources

Source: A Survey of Large Language Models`


14
Step 1. Data Curation/Processing
How to prepare the data?
• Filter data based on quality
• Using heuristics or models to put higher weight in
better quality text
• De-duplicate the data
• Reduce the bias toward common text
• Privacy redaction
• Removal of sensitive and confidential information

15
Step 1. Data Curation/Processing
Large Language Models do not understand "pure text”

Common processing techniques Hi, my name is Pierre!

1. Punctuation padding Hi , my name is Pierre !


2. Tokenization
[12 56 15 56 33 49 2]
3. Text Vectorization
[12 56 15 56 33 49 2]
4. Casual masking [12 56 15 56 33 49 2]
[12 56 15 56 33 49 2]
[12 56 15 56 33 49 2]
[12 56 15 56 33 49 2]
[12 56 15 56 33 49 2] 16
Step 2. Model Architecture
Transformers
• Neural Networks that use the
attention mechanism

Dependency of words based on


position and context

Source: Attention is All You Need

17
The Attention Mechanism
Learn to pay attention to some key words, depending on the context

The pink elephant tried to get into a car but it was too ________

Positional encoding: the order of the words matter

The dog looked at the boy and ________


The boy looked at the dog and ________

18
Why don’t we have the same diversity?
Machine Learning algorithms LLM algorithms

Source: Machine learning in landscape ecological analysis: a


review of recent approaches 19
Step 2. Model Architecture
Choose between 3 types of
Transformers
• Encoder-only
• Encoder-Decoder
Decoder
• Decoder-only

Encoder
20
Encoder-only Architecture
Good for tasks requiring a nuanced understanding of the entire
sentence or code snippet.
• Code review, bug report understanding, and named entity recognition

Source: Large Language Models for Software Engineering: A Systematic Literature Review
21
Encoder-decoder Architecture
Good for translation or summarization tasks.
• Code summarization, (programming?) language translation

Source: Large Language Models for Software Engineering: A Systematic Literature Review
22
Decoder-only Architecture
Good for generative tasks
• Basically any SE tasks that requires generation

Source: Large Language Models for Software Engineering: A Systematic Literature Review
23
Architectures of LLMs for SE

Decoder-only architecture (e.g.,


GPT, Llama) are the most popular
architectures in SE research

Source: Large Language Models for Software Engineering: A Systematic Literature Review
24
Step 2. Model Architecture
Other design choices and hyper-parameters to tune:
• Activation functions, Layer normalization, Position Embeddings

Model Size

Rule of Thumb:
1 parameter for
~20 tokens

Sources: Training Compute-Optimal Large Language Models 25


Step 3. Training at Scale
• Self-supervision
• We can train large models using huge amounts of data, only if the label is
already part of the input

• Efficiency is key when training billion-parameter models


• 3D parallelism (data parallelism, pipeline parallelism, tensor parallelism)
• Mixed-precision training
• Use 16-bit floating numbers instead of 32-bit to reduce memory usage
and communication overhead

Source: A Survey of Large Language Models


26
Step 4. Evaluation
How to assess the quality of the trained LLM?
• Benchmarks (Open LLM Leaderboard)
• Multiple-choice Tasks
• ARC,
• Hellaswag,
• MMLU
• Open-ended Tasks
• TruthfulQA

27
Example: Hellaswag
• Can a Machine really finish your sentence?

28
Building large language models
Here are the 4 key steps to build a traditional ML model:

1. Curating large amounts of Data


Base LLM
2. Choosing the appropriate transformer Foundational Model
3. Training the model at scale
4. Evaluating the model
5. Fine-tuning the model to specific cases

29
DEMO

https://www.kaggle.com/code/diegoeliascosta/soen691-building-your-own-gpt
30
From Base Models to
SE-related Solutions

Source: Large Language Models for SE

31
Fine-tuning in a nutshell
1. Define a SE-related task/problem where LLM has the potential to
assist
2. Find a reasonably sized dataset for the model to fine-tune
3. Choose a pre-trained LLM
• Open source or closed source + API
4. Choose the method for fine-tuning
5. Evaluate the fine-tuned model in the SE task

32
What SE Task is commonly selected?

Most common SE Tasks


• Code generation (62)
• Program Repair (23)
• Code completion (16)
• Code summarization (10)

33
SE Related Datasets Challenge:
Which ones are **really**
open source?

Dominance of code-based and text-based datasets

Source: Large Language Models for SE


34
There are plenty of base LLM models
out there… Challenge:
Which ones can be used in conventional hardware?

35
Fine-tuning strategies
Instruction tuning Alignment tuning

Reinforcement
Learning with
Human
Feedback

Source: A Survey of Large Language Models 36


Fine-tuning for SE Tasks
• Fine-tuning is commonly used to optimize LLMs

However, due to data scarcity and computing restrictions, prompt


engineering is still frequently used as a method for adapting LLMs to
solve SE tasks
Source: Large Language Models for SE
37
Evaluating the fine-tuned model
The evaluation metric depends on the SE task:
• Classification-tasks
• F1 –score, precision, recall and accuracy
• Recommendation tasks
• Mean Reciprocal Ranking (MRR), Precision@k, F1-score@k
• Generation tasks
• BLEU The correct answer was
given in the first K-ranked
• Pass@k responses.
• Accuracy@k
Source: Large Language Models for SE
38
Technology and Libraries
• Transformers
• Open-source Python library for building models using the Transformer
architecture

• Want to use a LLM on your laptop?


• Llama2cpp with bindings for most programming languages

39
Course Project
• Careful about the scope of the project
• You should be discussing a Software Engineering problem/application
• Using LLMs for other topics is not part of this course’s scope

• Start diving deep into the type of problem you want to address

• Do you need to train or fine-tune an LLM in your project?


• Not necessarily.
• Survey papers, replication papers, and empirical studies are part of the
course’s scope

40
Paper Critiques
• Next week’s class will be a research discussion class!

• 2 papers to read:
• 1 paper to summarize (half a page)
• 1 paper to critique (1 to 2 pages max)
• Summary + Positive points (3+) + Negative Points (3+)

• Never read or critiqued a paper before?


• Check Complementary Materials in Moodle

41

You might also like