Lesson 1 Intro

The document outlines a course on understanding and creating LLaMA 4, focusing on its architecture, mathematics, and coding without the need for high-performance machines. It covers key concepts like tokenization, vector embeddings, and methods for word selection in language models, emphasizing the importance of subword tokens for efficiency and accuracy. The course aims to provide a comprehensive understanding of large language models (LLMs) and their functionalities through practical examples and explanations.

Uploaded by

sayantannandi13

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views51 pages

Lesson 1 Intro

Uploaded by

sayantannandi13

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

UNDERSTAND & CREATE LLAMA 4 FROM THE GROUND UP

By Vuk Rosić
GOOD TO HAVE
Intermediate Python
High school math
Even if you don’t know these, I will explain them.
No need for a high-performance machine, we will not train AI in this course, just learn its
architecture, math and code it from the ground up.
REALISTIC EXPECTATIONS
The course contains everything from high-school maths to advanced PhD math,
which will feel very dense and require months to understand.
But math for this LLM is the same as for every other LLM, and very similar to any other
transformer — learn once, create any LLM.
You can watch this video, other YouTube videos, and other courses that I make,
gradually grasping these concepts.
KEY CONCEPTS INTRODUCED:
POINT → VECTOR → 3D VECTOR → Large Language Models
LLMs: AIs that you can chat with using text messages.
More precisely: LLMs generate text and are typically used in chat interfaces.
Famous LLMs: GPT (ChatGPT), Claude, Gemini, DeepSeek, Qwen, Grok, Llama.
HOW LLMS WORK
An LLM:
Takes some input text,
Analyzes it
Predicts the next word that makes sense.
Example:
Input: "The sun is shining, and the sky is"
Output: "blue."
LLMs don’t just predict a single word—they:
Have a vocabulary of words, characters, etc.
Assign a probability to each based on the input context.
Example Probabilities:
"blue." → 60%
"clear." → 45%
"cloudy." → 41%
"apple." → 0.002%
"banana." → 0.001%
METHODS FOR WORD SELECTION
1. Greedy Sampling:
Always pick the most probable word.
Example: "blue" (60%).
2. Random Sampling:
Randomly choose based on the probability distribution.
Could choose "cloudy" even if it's not the top choice.
3. Top-k Sampling:
Choose randomly from the top k most probable words.
Example: Top 3 = "blue," "clear," "cloudy".
4. Temperature Sampling:
Adjust the “sharpness” of the distribution.
Higher temperature = more randomness.
Lower temperature = more deterministic (closer to greedy).
Higher temperature gives rare words like "banana" more chance.
Tokens
Anything can be added as a token that AI can generate (predict based on preceding text):
"car" (words token)
"b" (a letter token)
"123" (a number token)
" " (an emoji token)
" " (a Chinese character token)
"The sun rises in the east." (a sentence token)
" " (a Chinese sentence token)
"a1 b2 c3!@#" (an arbitrary token)
When a token is a word like "car", the model learns:
It's a vehicle
It has wheels
People drive it
It's used for transportation
...
When a token is a letter like "b", the model learns:
It's the second letter in the English alphabet
It can start words like "book" or "bird"
Where and how it's used to construct words
...
When a token is a number like "123", the model learns:
It represents a specific quantity
It follows mathematical rules
It appears in contexts like counts, measurements, or dates
This one also looks like a beginning of a counting sequence, so it can be used when
user wants model to count from 1 to 10 for example
...
When a token is an emoji like " ", the model learns:
It's an emoji (and what is an emoji)
It represents a tree or forest
It's also used to symbolize nature or the outdoors
...
When a token is a Chinese character like " ", the model learns:
It's a Chinese character
It means "car" or "vehicle"
It can be part of compound words like " " (automobile) or " " (train)
It's used in contexts related to transportation and vehicles
It has a specific stroke order and writing pattern
...
When a token is a sentence like "The sun rises in the east.", the model learns:
It describes a natural phenomenon that happens daily
It has a subject (sun), verb (rises), and prepositional phrase (in the east)
It's a statement of fact that's universally understood
It can be used literally or metaphorically in different contexts
...
When a token is a Chinese sentence like " ", the model learns:
It's in Mandarin Chinese langauge
It means "We are learning artificial intelligence"
The action is currently hapenning, it's present continuous, due to
For example if the story is about somebody learning AI, and the question is what they
are doing, LLM can output this token.
...
When a token is arbitrary like "a1b2$3]'.fDc3!@#", the model learns:
It contains a mix of letters, numbers, characters from different languages, and symbols
It doesn't match common language patterns
It might be a code, password, or random string
It lacks semantic meaning
...
The question is, which way of tokenizing is best:
"The cat is sleeping",
"The", " cat", " is", " sleeping",
"T", "h", "e", " ", "c", "a", "t", " ", "i", "s", " ", "s", "l", "e", "e", "p", "i", "n", "g",
Each token requires same amount of compute to process.
Computational Cost
"The quick brown fox jumps over the lazy dog."
Sentence tokenizer: 1 token = 1x compute
Word tokenizer: 9 tokens = 9x compute
Letter tokenizer: 44 tokens = 44x compute
But there are some issues with word and sentence tokens as well.
Massive vocabulary: Not only every word, but every version of every word: ["run",
"runs", "ran", "running" - this requires high compute to calculate probability
distribution.
If you forget to add any word, e.g., "running", to the vocabulary, the AI will not learn
what it means.
Can't handle misspellings: If text contains incorrect spellings like "runned" or
"avocdo", humans can infer meaning, but AI would see it as OOV (out of vocabulary)
token, replace it with a special token that indicates that this is completely unkonwn, eg.
<UNK> .

Most words are rare and will not appear enough times in the training data for model to
learn their meaning well.
Can not construct new words like letters can.
For sentence level tokens these issues are even more exaggerated.
Quick note: there might be breakthroughs in the future that make word or sentence level
tokens the best options, research in this field is very fast and accelerating.
Can we get the best of both worlds - low vocabulay size, ability to construct new words
(like letter based tokenizer) and lower computation cost by creating token of multiple
letters combined?
Introducing: Subword tokens!
Instead of:
[run, running, play, playing, stay, staying]
it's tokenized as:
[run, play, stay, ing]
Can be used to construct words like [running, playing, staying] - AI will learn that
adding "ing" to a base verb will convert it to present continuous
Requires less computation then processing each letter separately
Vocabulary size reduced from 10s of millions to just 100s of thousands
Better at understanding text with spelling mistakes
Qwen2.5 Tokenizer (at the time of making this, tokenizer for Qwen3 has not been released
yet, but it will probably be exactly the same or a bit different, which doesn't matter for our
understanding of how tokenizer works)
https://huggingface.co/Qwen/Qwen2.5-7B/raw/main/tokenizer.json
Example of what you will see:
"sh": 927,
"ual": 928,
"Type": 929,
"son": 930,
"new": 931,
"ern": 932,
"Ġag": 933,
"AR": 934,
"];Ċ": 935,
Tokens on the left, paired with a numbers from 0 to 151664 on the right.
Vocabulary contains 151664 tokens
Each token is indexed with a number
To understand what we need numbers for, we need to go back and understand how exactly
LLMs "understand and store semantic meaning" of a token.
Introducing: Vector Embeddings
(fancy words, but don't get scared, you will understand it)
Vector embedding is just an array of numbers
[0.43, 0.76, 0.13, 0.05]
Each token in the vocabulary will have a corresponding vector embedding (array of
numbers).
"car" = [0.84, 0.01, 0.31, 0.03]
"grass" = [0.01, 0.97, 0.89, 0.82]
In reality, these vectors are thousands of numbers long, and each number captures some
characteristic of this token, e.g.: quickness, aliveness, greenness, fluffiness, playfulness,
vehicleness, parallel universeness, regular excerciseness, and who knows what else [we
don't] - each feature will be a measure of some characteristic, and it will contribute to the
meaning.
AI learns by itself what each feature should be about, and the numerical value of it for each
token, and we don't know what exactly those numbers represent.
Let's take these 2 tokens as an example.
"dog" = [0.42, 0.97, 0.95]
"cat" = [0.73, 0.04, 0.01]
Feature (number) 1: Fluffiness Amount
How soft, fuzzy, and likely to leave hair on your favorite hoodie?
Dog → 0.42
Dogs come in all fluff levels, from bald weirdos to walking cotton balls.
Cat → 0.73
Living fur clouds. Built for maximum softness.
Feature 2: Willingness to Save Your Life
If you're in a lake and can't swim… will it save you?
Dog → 0.97
Leaps in with heroic urgency. Risks life. Brings floaty.
Cat → 0.04
The one who pushed you into the water.
Feature 3: Gratefulness
How much it appreciates everything you do for it?
Dog → 0.95
Thinks you hung the moon. Worships your every move. Applauds your shoe-tying
skills.
Cat → 0.01
You are but a mere buttler, void of any purpose beyond servitude.
In reality, features at same positions across tokens don't necessarily encode same
characteristic, also different tokens can have unique characteristics, or multiple features
could work together to encode characteristics.
Summary
These vector embeddings aren't manually created by human scientists — the AI learns
them by reading the entire internet (trillions of words), trying to predict the next token as
it's reading it, and adjusting vector embeddings for each token and other numbers
(parameters, that we will talk about later) so its predicted token matches the actual token in
the training data.
As it fails to predict the next token during the training, it updates vector embeddings and
other parameters, causing the correct token from the training data to be more likely next
time, and incorrect tokens less likely.
How LLMs use vector embeddings
"The car is fast, and the car is red."
"The" → Vector: [0.12, 0.34, 0.56, 0.78]
" car" → Vector: [0.84, 0.01, 0.31, 0.03]
" is" → Vector: [0.23, 0.45, 0.67, 0.89]
" fast" → Vector: [0.91, 0.12, 0.34, 0.56]
" and" → Vector: [0.78, 0.90, 0.12, 0.34]
" red." → Vector: [0.56, 0.78, 0.90, 0.12]
"," (comma) → Vector: [0.55, 0.66, 0.77, 0.88]
Usually tokens have space as the first character (" car") - it's more efficient then encoding
space separately.
LLMs replace each token with the vector embedding.
"The car is fast, and the car is red."
[0.12, 0.34, 0.56, 0.78] [0.84, 0.01, 0.31, 0.03] [0.23, 0.45, 0.67, 0.89] [0.91, 0.12, 0.34,
0.56] [0.78, 0.90, 0.12, 0.34] [0.84, 0.01, 0.31, 0.03] [0.23, 0.45, 0.67, 0.89] [0.56, 0.78,
0.90, 0.12] [0.55, 0.66, 0.77, 0.88] [0.78, 0.90, 0.12, 0.34] [0.56, 0.78, 0.90, 0.12]
But we don't want to tie tokens to vectors, because we want to use same token vocabulary
for different versions of the model or models, which will have different vector embeddings
for the same token.
So we will also have an array of vector embeddings (array of arrays, matrix), where each
vector embedding is placed at the same position corresponding to the index of the token.
[
[0.12, 0.34, 0.56, 0.78]
[0.84, 0.01, 0.31, 0.03]
[0.23, 0.45, 0.67, 0.89]
[0.91, 0.12, 0.34, 0.56]
[0.78, 0.90, 0.12, 0.34]
[0.56, 0.78, 0.90, 0.12]
[0.55, 0.66, 0.77, 0.88]
]
We will place a vector embedding corresponding to the token with index 0 at position 0.
Later we will pluck out embedding corresponding to the token's index from the embedding
matrix.
Mapping to Indices
"The" : 0
"car" : 1
"is" : 2
"fast," : 3
"and" : 4
"red." : 5
"The car is fast, and the car is red."
012340125
For each index (0 1 2 3 4 0 1 2 5) we will pluck out corresponding vector embedding (they
are arranged to correspond to tokens with the same index.)
[0.12, 0.34, 0.56, 0.78] [0.84, 0.01, 0.31, 0.03] [0.23, 0.45, 0.67, 0.89] [0.91, 0.12, 0.34,
0.56] [0.78, 0.90, 0.12, 0.34] [0.84, 0.01, 0.31, 0.03] [0.23, 0.45, 0.67, 0.89] [0.56, 0.78,
0.90, 0.12] [0.55, 0.66, 0.77, 0.88] [0.78, 0.90, 0.12, 0.34] [0.56, 0.78, 0.90, 0.12]
This allows us to easily swap vector embeddings from different models / versions that use
the same tokenizer.
Next lesson: Coding our own tokenizer in lesson_2_coding_tokenizer.py file

Lecture 10 - Knowledge and Reasoning - 2025 - LLM
No ratings yet
Lecture 10 - Knowledge and Reasoning - 2025 - LLM
121 pages
Large Language Models From Scratch
No ratings yet
Large Language Models From Scratch
29 pages
Building LLMs - Stanford
No ratings yet
Building LLMs - Stanford
78 pages
AI Tools
No ratings yet
AI Tools
19 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
SPR 08 Algorithms
No ratings yet
SPR 08 Algorithms
41 pages
13 TextGen 2024
No ratings yet
13 TextGen 2024
106 pages
Captura de Pantalla 2024-05-31 A La(s) 9.07.37 A. M.
No ratings yet
Captura de Pantalla 2024-05-31 A La(s) 9.07.37 A. M.
245 pages
Course Material - Artificial Intelligence-Week7 - Update
No ratings yet
Course Material - Artificial Intelligence-Week7 - Update
42 pages
Week Seven Class: Jss Two Week: 7 Topic: Site Preparation (I) Tools For Site Preparation (Ii) Techniques For Site Preparation
100% (2)
Week Seven Class: Jss Two Week: 7 Topic: Site Preparation (I) Tools For Site Preparation (Ii) Techniques For Site Preparation
11 pages
2 Generative Models
No ratings yet
2 Generative Models
60 pages
Week 02 Tokenizers
No ratings yet
Week 02 Tokenizers
36 pages
Path To The LLM & Generative AI
No ratings yet
Path To The LLM & Generative AI
12 pages
Module 5
No ratings yet
Module 5
76 pages
Extend en Llms Aug2024
No ratings yet
Extend en Llms Aug2024
65 pages
Tokenization
No ratings yet
Tokenization
34 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
91 pages
L1 Introduction
No ratings yet
L1 Introduction
127 pages
Module 1 NLP
No ratings yet
Module 1 NLP
26 pages
LLM Book 43-102
No ratings yet
LLM Book 43-102
60 pages
Anlp 02 Wordrep Textclass
No ratings yet
Anlp 02 Wordrep Textclass
59 pages
Language Modeling
No ratings yet
Language Modeling
88 pages
(Slide) Sentiment Analysis v3
No ratings yet
(Slide) Sentiment Analysis v3
46 pages
GPT in 60 Lines of NumPy - Jay Mody
No ratings yet
GPT in 60 Lines of NumPy - Jay Mody
41 pages
From Bytes To Ideas: Language Modeling With Autoregressive U-Nets
No ratings yet
From Bytes To Ideas: Language Modeling With Autoregressive U-Nets
18 pages
Anlp 02 Wordrep Textclass
No ratings yet
Anlp 02 Wordrep Textclass
58 pages
Lecture 03 - Introduction To LLMs
No ratings yet
Lecture 03 - Introduction To LLMs
32 pages
Intro DL 10 NLP
No ratings yet
Intro DL 10 NLP
99 pages
Introduction To NLP
No ratings yet
Introduction To NLP
68 pages
How Does A GPT Tool Process Inputs
No ratings yet
How Does A GPT Tool Process Inputs
19 pages
Llms Course Andrew
No ratings yet
Llms Course Andrew
46 pages
Chapter 2. Transformers: A Note For Early Release Readers
No ratings yet
Chapter 2. Transformers: A Note For Early Release Readers
85 pages
IntroToMachineLearning - 25-07-2019
No ratings yet
IntroToMachineLearning - 25-07-2019
37 pages
DAB311 DL Week 11 RNN
No ratings yet
DAB311 DL Week 11 RNN
25 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
NLP DeepNLP
No ratings yet
NLP DeepNLP
61 pages
UNIT 5a
No ratings yet
UNIT 5a
48 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
2 Ai
No ratings yet
2 Ai
21 pages
Deep Learning Fundamentals in Python
From Everand
Deep Learning Fundamentals in Python
LazyProgrammer
4/5 (9)
Tokenization
No ratings yet
Tokenization
26 pages
A7 Dsbda Sana
No ratings yet
A7 Dsbda Sana
15 pages
تمثيل النص كموترات - تدريب - مايكروسوفت ليرن
No ratings yet
تمثيل النص كموترات - تدريب - مايكروسوفت ليرن
14 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
BLITZ Durgesh
No ratings yet
BLITZ Durgesh
45 pages
Transformer Part3 16 Mar 23 PDF
No ratings yet
Transformer Part3 16 Mar 23 PDF
59 pages
Jason Wei Stanford cs330 Talk
No ratings yet
Jason Wei Stanford cs330 Talk
44 pages
Unit 1 NLP KCS072
No ratings yet
Unit 1 NLP KCS072
12 pages
Fundamentals of Generative AI
No ratings yet
Fundamentals of Generative AI
17 pages
Augmenting LLMs Survey
No ratings yet
Augmenting LLMs Survey
33 pages
INTELLIPAAT - 2024 - 01 - 20 - Tansformers Cont. and Autoencoders
No ratings yet
INTELLIPAAT - 2024 - 01 - 20 - Tansformers Cont. and Autoencoders
11 pages
Tokenization
No ratings yet
Tokenization
7 pages
Pretraining Part1 16 Mar 23 PDF
No ratings yet
Pretraining Part1 16 Mar 23 PDF
32 pages
Vinija's Notes - Natural Language Processing - Tokenizer
No ratings yet
Vinija's Notes - Natural Language Processing - Tokenizer
11 pages
Api 681 Compliance
100% (2)
Api 681 Compliance
2 pages
El-Fi M20: Shaft Power Monitor
No ratings yet
El-Fi M20: Shaft Power Monitor
32 pages
Unit - 2
No ratings yet
Unit - 2
10 pages
TCS Case Study
100% (2)
TCS Case Study
10 pages
Traffic Sign Classifier On Android
No ratings yet
Traffic Sign Classifier On Android
41 pages
Building A Simple Chatbot From Scratch in Python1
No ratings yet
Building A Simple Chatbot From Scratch in Python1
8 pages
Large Language Model Algorithms in Plain English
No ratings yet
Large Language Model Algorithms in Plain English
8 pages
Lecture 1 8086 Interrupt
No ratings yet
Lecture 1 8086 Interrupt
26 pages
Sat Headend
No ratings yet
Sat Headend
16 pages
NLP Summary
No ratings yet
NLP Summary
6 pages
Elevex Planning Guide
No ratings yet
Elevex Planning Guide
18 pages
5g-core-guide-building-a-new-world Переход от лте к 5г английский
No ratings yet
5g-core-guide-building-a-new-world Переход от лте к 5г английский
13 pages
A Beginner's Guide To Natural Language Processing - IBM Developer
No ratings yet
A Beginner's Guide To Natural Language Processing - IBM Developer
9 pages
Atmega 2560 Ingles (031-060)
No ratings yet
Atmega 2560 Ingles (031-060)
30 pages
Advances in Rasch Analyses in The Human Sciences William J. Boone - Download The Ebook Now To Never Miss Important Information
100% (2)
Advances in Rasch Analyses in The Human Sciences William J. Boone - Download The Ebook Now To Never Miss Important Information
67 pages
Cost Comparison of RCC Girder and PSC Girder
No ratings yet
Cost Comparison of RCC Girder and PSC Girder
3 pages
LAB 3 Manual - Bisection Method
No ratings yet
LAB 3 Manual - Bisection Method
5 pages
Art. 8K-1, 8K-2, 8K-N Users: Audiokit
No ratings yet
Art. 8K-1, 8K-2, 8K-N Users: Audiokit
28 pages
Handheld Device Projects
No ratings yet
Handheld Device Projects
9 pages
34-Article Text-57-1-10-20200810
No ratings yet
34-Article Text-57-1-10-20200810
15 pages
EL 0129 4c EuroB25a
No ratings yet
EL 0129 4c EuroB25a
10 pages
4008S09 Sample
No ratings yet
4008S09 Sample
4 pages
Marico Industries Mysap SCM
No ratings yet
Marico Industries Mysap SCM
11 pages
Third Eye For Blind Using Ultrasonic Sensor Research
No ratings yet
Third Eye For Blind Using Ultrasonic Sensor Research
6 pages
LU-S1 Ультразвуковой генератор с коагулятором Sonoca 300
No ratings yet
LU-S1 Ультразвуковой генератор с коагулятором Sonoca 300
4 pages
No 1TheStage-GateSystem-ARoadmapFromIdea-to-Launch-AnIntroSummary
No ratings yet
No 1TheStage-GateSystem-ARoadmapFromIdea-to-Launch-AnIntroSummary
19 pages
SQL-Questions & Answer
No ratings yet
SQL-Questions & Answer
22 pages
Parents' Smartphone Addiction Means Kids Are More Likely To Have Behavioral Issues
No ratings yet
Parents' Smartphone Addiction Means Kids Are More Likely To Have Behavioral Issues
12 pages
Pushing The Limits of Reengineering: Development Program For F-Class Turbine Parts
No ratings yet
Pushing The Limits of Reengineering: Development Program For F-Class Turbine Parts
4 pages
Math Subjects
No ratings yet
Math Subjects
2 pages
Over-The-Cable Updater For Windows Phones Readme: Cabapi - DLL Otcupdater - Exe Updatedll - DLL
No ratings yet
Over-The-Cable Updater For Windows Phones Readme: Cabapi - DLL Otcupdater - Exe Updatedll - DLL
3 pages
Transline Technical Training Institute: Contact Us
No ratings yet
Transline Technical Training Institute: Contact Us
2 pages
Datasheet AP B175N 65C 43
No ratings yet
Datasheet AP B175N 65C 43
1 page

Lesson 1 Intro

Uploaded by

Lesson 1 Intro

Uploaded by

UNDERSTAND & CREATE LLAMA 4 FROM THE GROUND UP

You might also like