0% found this document useful (0 votes)

109 views41 pages

Building Finetuning Aimodels

This document outlines the key steps to build and fine-tune large language models for software engineering tasks. It discusses: 1. The 4 main steps to build a traditional machine learning model: data curation, model architecture selection, model training, and model evaluation. 2. Additional steps needed for large language models: fine-tuning the pre-trained model for specific tasks. 3. Details of each step, including types of model architectures, training techniques for billion-parameter models, and evaluation benchmarks. 4. An example of fine-tuning a large language model for a software engineering assistant to complete sentences.

Uploaded by

malaysheth34

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

109 views41 pages

Building Finetuning Aimodels

Uploaded by

malaysheth34

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Building and Fine-tuning

GenAI Models
SOEN 691: Generative Artificial Intelligence
for Software Engineering

Diego Elias Costa

Language Models

Source: A Survey of Large Language Models We will focus on LLMs

2
Building and Fine-tuning
GenAI Models
Large
Language
This is a Road
Map class
Explore the reference materials if you
want to know the concepts in detail!

4
Building a Large
Language Model

5
How much does it cost?
• Training the smallest Llama 7b took 180,000 GPU hours

• Renting:
• Nvidia A100: $1-2 per GPU per hour -> whereabouts of $250,000 dollars
Source: Llama 2: Open Foundation and Fine-Tuned Chat Models
6
Building a Large
Language Model

Theoretically…
Source: A Survey of Large Language Models

7
Building a machine learning model
Here are the 4 key steps to build a traditional ML model:

1. Curating the Data

2. Choosing the appropriate model/algorithm
3. Training the model
4. Evaluating the model

8
What to know more?
Machine Learning
Book freely available
• Covers the basics of classifiers and
Regressions
• Emphasis on interpretability
• How do ML models work?
• Techniques to explain and interpret ML
models

Source: Interpretable Machine Learning

9
Want to know more?
Engineering AI-Based Systems
• Course focused on Engineering AI-
Based Systems
• Requirements
• Architecture and Design
• Implementation
• Testing
• Operations Source: Engineering AI-Based Software Systems

10
Building large language models
Here are the 4 key steps to build a traditional ML model:

1. Curating/Processing large amounts of Data

2. Choosing the appropriate transformer
3. Training the model at scale
4. Evaluating the basic model
5. Fine-tuning the model to specific cases

Source: How to Build an LLM from Scratch | An Overview

11
Step 1. Data Curation/Processing
Data quality is paramount for the quality of your model
• Garbage in, Garbage out

However, data quantity seems to be playing an even bigger role

Source: Llama 2: Open Foundation and Fine-Tuned Chat Models

12
Step 1. Data Curation/Processing
Where do we get all these data?
• The Internet
Publicly available datasets
• Public datasets
• Common Crawl
• The Pile
• Hugging Face datasets
• Private data sources
• Use an LLM to generate your dataset
• Alpaca
Source: A Survey of Large Language Models`
13
Step 1. Data Curation/Processing
Data diversity is key to generalizability

Larger diversity of sources

Source: A Survey of Large Language Models`

14
Step 1. Data Curation/Processing
How to prepare the data?
• Filter data based on quality
• Using heuristics or models to put higher weight in
better quality text
• De-duplicate the data
• Reduce the bias toward common text
• Privacy redaction
• Removal of sensitive and confidential information

15
Step 1. Data Curation/Processing
Large Language Models do not understand "pure text”

Common processing techniques Hi, my name is Pierre!

1. Punctuation padding Hi , my name is Pierre !

2. Tokenization
[12 56 15 56 33 49 2]
3. Text Vectorization
[12 56 15 56 33 49 2]
4. Casual masking [12 56 15 56 33 49 2]
[12 56 15 56 33 49 2]
[12 56 15 56 33 49 2]
[12 56 15 56 33 49 2]
[12 56 15 56 33 49 2] 16
Step 2. Model Architecture
Transformers
• Neural Networks that use the
attention mechanism

Dependency of words based on

position and context

Source: Attention is All You Need

17
The Attention Mechanism
Learn to pay attention to some key words, depending on the context

The pink elephant tried to get into a car but it was too ________

Positional encoding: the order of the words matter

The dog looked at the boy and ________

The boy looked at the dog and ________

18
Why don’t we have the same diversity?
Machine Learning algorithms LLM algorithms

Source: Machine learning in landscape ecological analysis: a

review of recent approaches 19
Step 2. Model Architecture
Choose between 3 types of
Transformers
• Encoder-only
• Encoder-Decoder
Decoder
• Decoder-only

Encoder
20
Encoder-only Architecture
Good for tasks requiring a nuanced understanding of the entire
sentence or code snippet.
• Code review, bug report understanding, and named entity recognition

Source: Large Language Models for Software Engineering: A Systematic Literature Review
21
Encoder-decoder Architecture
Good for translation or summarization tasks.
• Code summarization, (programming?) language translation

Source: Large Language Models for Software Engineering: A Systematic Literature Review
22
Decoder-only Architecture
Good for generative tasks
• Basically any SE tasks that requires generation

Source: Large Language Models for Software Engineering: A Systematic Literature Review
23
Architectures of LLMs for SE

Decoder-only architecture (e.g.,

GPT, Llama) are the most popular
architectures in SE research

Source: Large Language Models for Software Engineering: A Systematic Literature Review
24
Step 2. Model Architecture
Other design choices and hyper-parameters to tune:
• Activation functions, Layer normalization, Position Embeddings

Model Size

Rule of Thumb:
1 parameter for
~20 tokens

Sources: Training Compute-Optimal Large Language Models 25

Step 3. Training at Scale
• Self-supervision
• We can train large models using huge amounts of data, only if the label is
already part of the input

• Efficiency is key when training billion-parameter models

• 3D parallelism (data parallelism, pipeline parallelism, tensor parallelism)
• Mixed-precision training
• Use 16-bit floating numbers instead of 32-bit to reduce memory usage
and communication overhead

Source: A Survey of Large Language Models

26
Step 4. Evaluation
How to assess the quality of the trained LLM?
• Benchmarks (Open LLM Leaderboard)
• Multiple-choice Tasks
• ARC,
• Hellaswag,
• MMLU
• Open-ended Tasks
• TruthfulQA

27
Example: Hellaswag
• Can a Machine really finish your sentence?

28
Building large language models
Here are the 4 key steps to build a traditional ML model:

1. Curating large amounts of Data

Base LLM
2. Choosing the appropriate transformer Foundational Model
3. Training the model at scale
4. Evaluating the model
5. Fine-tuning the model to specific cases

29
DEMO

https://www.kaggle.com/code/diegoeliascosta/soen691-building-your-own-gpt
30
From Base Models to
SE-related Solutions

Source: Large Language Models for SE

31
Fine-tuning in a nutshell
1. Define a SE-related task/problem where LLM has the potential to
assist
2. Find a reasonably sized dataset for the model to fine-tune
3. Choose a pre-trained LLM
• Open source or closed source + API
4. Choose the method for fine-tuning
5. Evaluate the fine-tuned model in the SE task

32
What SE Task is commonly selected?

Most common SE Tasks

• Code generation (62)
• Program Repair (23)
• Code completion (16)
• Code summarization (10)

33
SE Related Datasets Challenge:
Which ones are **really**
open source?

Dominance of code-based and text-based datasets

Source: Large Language Models for SE

34
There are plenty of base LLM models
out there… Challenge:
Which ones can be used in conventional hardware?

35
Fine-tuning strategies
Instruction tuning Alignment tuning

Reinforcement
Learning with
Human
Feedback

Source: A Survey of Large Language Models 36

Fine-tuning for SE Tasks
• Fine-tuning is commonly used to optimize LLMs

However, due to data scarcity and computing restrictions, prompt

engineering is still frequently used as a method for adapting LLMs to
solve SE tasks
Source: Large Language Models for SE
37
Evaluating the fine-tuned model
The evaluation metric depends on the SE task:
• Classification-tasks
• F1 –score, precision, recall and accuracy
• Recommendation tasks
• Mean Reciprocal Ranking (MRR), Precision@k, F1-score@k
• Generation tasks
• BLEU The correct answer was
given in the first K-ranked
• Pass@k responses.
• Accuracy@k
Source: Large Language Models for SE
38
Technology and Libraries
• Transformers
• Open-source Python library for building models using the Transformer
architecture

• Want to use a LLM on your laptop?

• Llama2cpp with bindings for most programming languages

39
Course Project
• Careful about the scope of the project
• You should be discussing a Software Engineering problem/application
• Using LLMs for other topics is not part of this course’s scope

• Start diving deep into the type of problem you want to address

• Do you need to train or fine-tune an LLM in your project?

• Not necessarily.
• Survey papers, replication papers, and empirical studies are part of the
course’s scope

40
Paper Critiques
• Next week’s class will be a research discussion class!

• 2 papers to read:
• 1 paper to summarize (half a page)
• 1 paper to critique (1 to 2 pages max)
• Summary + Positive points (3+) + Negative Points (3+)

• Never read or critiqued a paper before?

• Check Complementary Materials in Moodle

LLMs in Production-MLC - GRC
No ratings yet
LLMs in Production-MLC - GRC
39 pages
Acc QSN at Ans Gce PDF
No ratings yet
Acc QSN at Ans Gce PDF
156 pages
Building Effective Agents by Anthropic
No ratings yet
Building Effective Agents by Anthropic
12 pages
Impact of Generative AI On Ecommerce Use Cases
No ratings yet
Impact of Generative AI On Ecommerce Use Cases
41 pages
Generative AI A Transformative Force in Business Intelligence
No ratings yet
Generative AI A Transformative Force in Business Intelligence
7 pages
Master Teacher-Annex B1 RPMS Tool For Highly Proficient Teachers SY 2022-2023
100% (1)
Master Teacher-Annex B1 RPMS Tool For Highly Proficient Teachers SY 2022-2023
20 pages
Multi-Agent Agentic RAG Systems - Prashant Sahu
No ratings yet
Multi-Agent Agentic RAG Systems - Prashant Sahu
10 pages
Leonard Nadler' Model
67% (3)
Leonard Nadler' Model
3 pages
Field Study 1 Portfolio University of So
0% (1)
Field Study 1 Portfolio University of So
55 pages
Yoga Lesson 1
No ratings yet
Yoga Lesson 1
6 pages
B 190313162555
No ratings yet
B 190313162555
29 pages
Week 1 Contemporary Home-Based Activity 2
No ratings yet
Week 1 Contemporary Home-Based Activity 2
3 pages
Teacher's Interview Questions
No ratings yet
Teacher's Interview Questions
13 pages
Second Language Acquisition
No ratings yet
Second Language Acquisition
2 pages
Large Language Model (LLM) 1
100% (1)
Large Language Model (LLM) 1
17 pages
The Portrait of Education in Indonesia
No ratings yet
The Portrait of Education in Indonesia
17 pages
Lesson Exemplar in Science 6 Q3 Week 4 - Gomez
No ratings yet
Lesson Exemplar in Science 6 Q3 Week 4 - Gomez
3 pages
Contoh RPP k13 Versi Inggris
No ratings yet
Contoh RPP k13 Versi Inggris
8 pages
MC 20
No ratings yet
MC 20
9 pages
Week 2 - ACTIVITY 1 (ASYNC) Sept. 27, 2022 David James B. Ignacio
No ratings yet
Week 2 - ACTIVITY 1 (ASYNC) Sept. 27, 2022 David James B. Ignacio
1 page
FS2 - Episode 2 Format
No ratings yet
FS2 - Episode 2 Format
5 pages
World History B Syllabus 1
No ratings yet
World History B Syllabus 1
2 pages
Improving Language and Literacy in A VPK Class Revised
No ratings yet
Improving Language and Literacy in A VPK Class Revised
7 pages
Kumar Kumaresan
No ratings yet
Kumar Kumaresan
16 pages
Student Teaching Experience at Governor Ferrer Memorial Integrated National High School, Pinagtipunan, General Trias City, Cavite
No ratings yet
Student Teaching Experience at Governor Ferrer Memorial Integrated National High School, Pinagtipunan, General Trias City, Cavite
24 pages
Group8 11abm10 Final - Research Manuscript PR1
No ratings yet
Group8 11abm10 Final - Research Manuscript PR1
77 pages
Lesson Plan - Evaluation
No ratings yet
Lesson Plan - Evaluation
2 pages
AIML001 Generative AI On AWS - Build and Scale Generative AI Applications With Foundation Models
100% (1)
AIML001 Generative AI On AWS - Build and Scale Generative AI Applications With Foundation Models
28 pages
Reviwer: Puji Ana Astutik NIM:932204118: Backward Design. University of Sydney, Australia RELC, Singapore
No ratings yet
Reviwer: Puji Ana Astutik NIM:932204118: Backward Design. University of Sydney, Australia RELC, Singapore
1 page
Week6 Q3 2023 2024
No ratings yet
Week6 Q3 2023 2024
5 pages
Aios LLM As Os
100% (2)
Aios LLM As Os
35 pages
Hands-On Lab With LLMs and Gen AI Within IDC
No ratings yet
Hands-On Lab With LLMs and Gen AI Within IDC
57 pages
PORTFOLIO IN WORK IMMERSION Kiko
No ratings yet
PORTFOLIO IN WORK IMMERSION Kiko
5 pages
Generative Ai in Consumer Media and Internet
No ratings yet
Generative Ai in Consumer Media and Internet
12 pages
Langchain PDF Reader
100% (1)
Langchain PDF Reader
15 pages
Chapter III: Teaching English As A Foreign Language in The Algerian Secondary Schools
No ratings yet
Chapter III: Teaching English As A Foreign Language in The Algerian Secondary Schools
11 pages
Basics of Prompt Engineering
No ratings yet
Basics of Prompt Engineering
16 pages
Building Your Own Autonomous LLM Agents - LinkedIn
No ratings yet
Building Your Own Autonomous LLM Agents - LinkedIn
33 pages
Vector Databases - A Technical Primer
100% (1)
Vector Databases - A Technical Primer
68 pages
Brief Introduction To GenAI
No ratings yet
Brief Introduction To GenAI
1 page
Generative AI - POC - Readout
100% (3)
Generative AI - POC - Readout
56 pages
Gallup - 5 Clifton Strengths
No ratings yet
Gallup - 5 Clifton Strengths
19 pages
Sheffield R. Generative AI Development With Langchain. The Ultimate Guide 2023
100% (2)
Sheffield R. Generative AI Development With Langchain. The Ultimate Guide 2023
134 pages
A Comprehensive Guide To Generative AIpdf
100% (1)
A Comprehensive Guide To Generative AIpdf
10 pages
The New Stack and Ops For AI - LLMOps
No ratings yet
The New Stack and Ops For AI - LLMOps
12 pages
5 Pretraining On Unlabeled Data - Build A Large Language Model (From Scratch)
No ratings yet
5 Pretraining On Unlabeled Data - Build A Large Language Model (From Scratch)
61 pages
Multi Agents Share
No ratings yet
Multi Agents Share
45 pages
Everything You Need To Know About Small Language Models (SLM) and Its Applications
No ratings yet
Everything You Need To Know About Small Language Models (SLM) and Its Applications
3 pages
RAG Notes
No ratings yet
RAG Notes
19 pages
6.4g Using Observation and Assessment Information
No ratings yet
6.4g Using Observation and Assessment Information
2 pages
Generative AI - 48 Hours TOC
100% (1)
Generative AI - 48 Hours TOC
4 pages
Intro To Intelligent Apps Workshop
100% (1)
Intro To Intelligent Apps Workshop
106 pages
GenAI POC - Training
100% (1)
GenAI POC - Training
43 pages
Micrsoft - AI Builder Prompting Guide
No ratings yet
Micrsoft - AI Builder Prompting Guide
10 pages
Enhancing AI Systems With Agentic Workflows Patterns in Large Language Model
No ratings yet
Enhancing AI Systems With Agentic Workflows Patterns in Large Language Model
6 pages
Improve Real-World RAG Systems
No ratings yet
Improve Real-World RAG Systems
43 pages
Data For GenAI
No ratings yet
Data For GenAI
17 pages
How Build A RAG Agent With LlamaIndex
No ratings yet
How Build A RAG Agent With LlamaIndex
4 pages
Week 14, Day 1
No ratings yet
Week 14, Day 1
2 pages
Software Architecture in An AI World
100% (1)
Software Architecture in An AI World
25 pages
How To Build AI Driven Knowledge Assistants
100% (1)
How To Build AI Driven Knowledge Assistants
24 pages
Prompt Engineering
100% (1)
Prompt Engineering
33 pages
GenAI Interview Questions-Draft
No ratings yet
GenAI Interview Questions-Draft
27 pages
Building A Streamlit Chatbot With LangChain and Llama 3.1 - Exploring LLMs - 3 - by Abou Zuhayr - Sep, 2024 - GoPenAI
No ratings yet
Building A Streamlit Chatbot With LangChain and Llama 3.1 - Exploring LLMs - 3 - by Abou Zuhayr - Sep, 2024 - GoPenAI
15 pages
Agentic Systems - A Guide To Transforming Industries With Vertical AI Agents
No ratings yet
Agentic Systems - A Guide To Transforming Industries With Vertical AI Agents
31 pages
Technology 2025 AI Agents Will Be More Popular Than ChatGPT - 20241205 - 104020 - 0000
No ratings yet
Technology 2025 AI Agents Will Be More Popular Than ChatGPT - 20241205 - 104020 - 0000
3 pages
KAG Graph + Multimodal RAG + LLM Agents = Powerful AI Reasoning - by Gao Dalie (高達烈) - in Towards AI - Freedium
No ratings yet
KAG Graph + Multimodal RAG + LLM Agents = Powerful AI Reasoning - by Gao Dalie (高達烈) - in Towards AI - Freedium
13 pages
Introduction To Generative AI LLM
100% (1)
Introduction To Generative AI LLM
9 pages
RAG Technics
100% (1)
RAG Technics
8 pages
The 10 Generic Kinds of Agents 1730948119
No ratings yet
The 10 Generic Kinds of Agents 1730948119
17 pages
300 LangChain Projects
100% (1)
300 LangChain Projects
17 pages
Agentic AI Roadmap
No ratings yet
Agentic AI Roadmap
6 pages
Building Intelligent Agents With Semantic Kernel: A Comprehensive Guide
No ratings yet
Building Intelligent Agents With Semantic Kernel: A Comprehensive Guide
16 pages
FairEval - Evaluating Fairness in LLM-Based Recommendations With Personality Awareness
No ratings yet
FairEval - Evaluating Fairness in LLM-Based Recommendations With Personality Awareness
11 pages
Vector Databases
No ratings yet
Vector Databases
35 pages
Levels of AI Agents - From Rules To Large Language Models
No ratings yet
Levels of AI Agents - From Rules To Large Language Models
8 pages
Multi-Agentic RAG With Hugging Face Code Agents - by Gabriele Sgroi, PHD - Dec, 2024 - Towards Data Science
No ratings yet
Multi-Agentic RAG With Hugging Face Code Agents - by Gabriele Sgroi, PHD - Dec, 2024 - Towards Data Science
42 pages
AI Agent Index
No ratings yet
AI Agent Index
15 pages
Generative AI Interview Questions and Answers
No ratings yet
Generative AI Interview Questions and Answers
7 pages
AI Agents Frameworks Part 3 Discover AI Agents, Their Design, and
No ratings yet
AI Agents Frameworks Part 3 Discover AI Agents, Their Design, and
28 pages
MCP 9
No ratings yet
MCP 9
17 pages
A Visual Guide To LLM Agents - by Maarten Grootendorst
100% (1)
A Visual Guide To LLM Agents - by Maarten Grootendorst
30 pages
Testing Practices of English Teachers in Selected Public Secondary Schools
No ratings yet
Testing Practices of English Teachers in Selected Public Secondary Schools
12 pages
MasterClass Agentic AI & RAG Flyer-1
No ratings yet
MasterClass Agentic AI & RAG Flyer-1
4 pages
Agenti Ai Comparison
No ratings yet
Agenti Ai Comparison
2 pages
RAG Understanding PDF
No ratings yet
RAG Understanding PDF
12 pages
Yugandar - Generative AI Architect
No ratings yet
Yugandar - Generative AI Architect
8 pages
How To Build AI Agent Cheat Sheet by Dr. Maryam Miradi
No ratings yet
How To Build AI Agent Cheat Sheet by Dr. Maryam Miradi
2 pages
Tle DLL 8
No ratings yet
Tle DLL 8
4 pages
Implement NLP use-cases using BERT: Explore the Implementation of NLP Tasks Using the Deep Learning Framework and Python (English Edition)
From Everand
Implement NLP use-cases using BERT: Explore the Implementation of NLP Tasks Using the Deep Learning Framework and Python (English Edition)
Amandeep
No ratings yet

Building Finetuning Aimodels

Uploaded by

Building Finetuning Aimodels

Uploaded by

Building and Fine-tuning

Diego Elias Costa

Source: A Survey of Large Language Models We will focus on LLMs

1. Curating the Data

Source: Interpretable Machine Learning

1. Curating/Processing large amounts of Data

Source: How to Build an LLM from Scratch | An Overview

However, data quantity seems to be playing an even bigger role

Source: Llama 2: Open Foundation and Fine-Tuned Chat Models

Larger diversity of sources

Source: A Survey of Large Language Models`

Common processing techniques Hi, my name is Pierre!

1. Punctuation padding Hi , my name is Pierre !

Dependency of words based on

Source: Attention is All You Need

Positional encoding: the order of the words matter

The dog looked at the boy and ________

Source: Machine learning in landscape ecological analysis: a

Decoder-only architecture (e.g.,

Sources: Training Compute-Optimal Large Language Models 25

• Efficiency is key when training billion-parameter models

Source: A Survey of Large Language Models

1. Curating large amounts of Data

Source: Large Language Models for SE

Most common SE Tasks

Dominance of code-based and text-based datasets

Source: Large Language Models for SE

Source: A Survey of Large Language Models 36

However, due to data scarcity and computing restrictions, prompt

• Want to use a LLM on your laptop?

• Do you need to train or fine-tune an LLM in your project?

• Never read or critiqued a paper before?

You might also like