Lecture Compiled
Lecture Compiled
Machine Learning
Lecture 1
Wang Xinchao
[email protected]
3
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
World’s Largest Selfie
4
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Recent Advances
Sora
Prompt: A stylish woman walks down a Tokyo street filled with warm glowing
neon and animated city signage. She wears a black leather jacket, a long
red dress, and black boots, and carries a black purse. She wears sunglasses
and red lipstick. She walks confidently and casually. The street is damp and
reflective, creating a mirror effect of the colorful lights. Many pedestrians walk
about.
5
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Outline
• What is machine learning?
– Three Definition(s)
• When do we need machine learning?
– Sometimes we need, sometimes we don’t
• Applications of machine learning
• Types of machine learning
– Supervised, Unsupervised, Reinforcement Learning
• Walking through a toy example on classification
• Inductive vs. Deductive Reasoning
6
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
What is machine learning?
Data Output
Cat
𝑓( ) = ‘cat’
𝑓(. ) such that
…
𝑓( )= ‘dog’
Dog
8
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Machine Learning (Supervised Learning)
Learned
Data
Computer Function 𝑓 .
Output
Data Output
When applied
Cat
𝑓(. ) 𝑓( ) Cat!
…
New image
Dog
9
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
AI, Machine Learning, and
Deep Learning
10
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
ML is used when:
• Human expertise does not exist (navigating on Mars)
When
• Humans do can’t
we explain
need their
machine
expertiselearning?
(speech recognition)
• Models must be customized (personalized medicine)
Lack of human expertise Involves huge amount of data
Modelsonare
•(Navigating based
Mars) on huge amounts of data (genomics)
(Genomics)
Learning isn’t
Learning is not always
always useful:
useful:
• NoThere is no need to “learn” to calculate payroll
need to “learn” to calculate payroll!
5
Based on slide by E. Alpaydin
My Salary = Days_of_work * Daily Salary + Bonus
11
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Application of Machine Learning
A classic example of a task that requires machine learnin
Task T, Performance P, Experience E
It is very hard to say what makes a 2
T: Digit Recognition
P: Classification Accuracy
E: Labelled Images
“four”
“three”
12
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Application of Machine Learning
T: Email Categorization
P: Classification Accuracy
E: Email Data, Some Labelled
13
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Application of Machine Learning
T: Playing Go Game
P: Chances of Winning
E: Records of Past Games
14
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Application of Machine Learning
15
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Web Search Engine Product Recommendation Language Translation
17
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Types of Machine Learning
18
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Supervised Data Output
𝑥 Classification 𝑦 (Categorical)
Arctic Sea Ice Extent in January (in million sq km) Acrtic Sea Ice Extent in January (in million sq km)
16 16
15.5 15.5
15 15
𝑦 14.5
𝑦
14.5
𝑓 𝑥 : line that best aligns
14 14 with samples
13.5 13.5
13 13
12.5 12.5
1970 1980 1990 2000 2010 2020 2030 1970 1980 1990 2000 2010 2020 2030
𝑥 𝑥
19
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Supervised Data Output
𝑥 Classification 𝑦 (Categorical)
width width
𝑦 = Sea Bass 𝑦 = Sea Bass
Feature 𝐱! 𝐱!
Space
𝐱" 𝐱" ?
𝑦 = Salmon 𝑦 = Salmon
lightness lightness
𝑓 𝑥 : line that separates
two classes
20
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Types of Machine Learning
21
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Unsupervised Learning
width width
𝐱! 𝐱!
𝐱" 𝐱"
lightness lightness
No Label/Supervision is given!
22
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Types of Machine Learning
23
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Reinforcement Learning
Breakout Game
24
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Reinforcement Learning
• Given sequence of states 𝑺 and actions 𝑨 with (delayed)
rewards 𝑹
• Output a policy 𝜋(𝑎, 𝑠), to guide us what action 𝑎 to take in
state 𝑠
𝑺: Ball Location,
Paddle Location, Bricks
𝑨: left, right
𝑹:
positive reward
Knocking a brick,
clearing all bricks
negative reward
Missing the ball
zero reward
Cases in between
25
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Supervised
Unsupervised Quiz Time!
Reinforcement
A classic example of a task that requires machine learning:
It is very hard to say what makes a 2
6
Slide credit: Geoffrey Hinton
Supervised Unsupervised
Supervised
Reinforcement
26
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Walking Through A Toy Example:
Token Classification
? Yes
?
Yes No
27
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Walking Through A Toy Example:
Token Classification
B,L,Ring R,L,Triangle
B,L,Rectangle
Y,S,Arrow
? Yes
G,S,Circle
G,S,Diamond ?
R,L,Circle
Y,L,Triangle
O,L,Diamond No
Yes
28
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Walking Through A Toy Example:
Token Classification
Feature Extraction
Color Size Shape Label
Blue Large Ring Yes
Red Large Triangle Yes
Orange Large Diamond Yes
Green Small Circle Yes
Yellow Small Arrow No
Blue Large Rectangle No
Red Large Circle No
Green Small Diamond No
Yellow Large Triangle ?
29
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Walking Through A Toy Example:
Token Classification
Feature Extraction
Color Size Shape Label
Blue Large Ring Yes
Red Large Triangle Yes
Orange Large Diamond Yes
Green Small Circle Yes
Yellow Small Arrow No
Blue Large Rectangle No
Red Large Circle No
Green Small Diamond No
30
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Walking Through A Toy Example:
Token Classification
31
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Walking Through A Toy Example:
Token Classification
Similarity
Color Size Shape Total
Nearest Neighbor Classifier:
0 1 0 1
0 1 1 2 1) Find the “nearest
0 1 0 1 neighbor” of a sample in
the feature space
0 0 0 0
1 0 0 1 2) Assign the label of the
0 1 0 1 nearest neighbor to the
sample
0 1 0 1
0 0 0 0
32
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Inductive vs. Deductive Reasoning
• Main Task of Machine Learning: to make inference
Two Types of Inference
Inductive Deductive
33
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Inductive Reasoning
Note: humans use inductive reasoning all the time and
not in a formal way like using probability/statistics.
35
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Practice Question
(Type of Question to Expect in Exams)
Which of the following statement is true?
A. Nearest Neighbor Classifier is an example of
unsupervised learning
B. Nearest Neighbor Classifier is an example of deductive
learning
C. Nearest Neighbor Classifier is an example of feature
selection
D. None of the above is correct.
36
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
NUS
National University
of Singapore
THf1NK You.
37
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
EE2211 Introduction to
Machine Learning
Lecture 2
Wang Xinchao
[email protected]
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Course Contents
• Introduction and Preliminaries (Xinchao)
– Introduction
– Data Engineering
– Introduction to Probability and Statistics
• Fundamental Machine Learning Algorithms I (Yueming)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Yueming)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Xinchao)
– Performance Issues
– K-means Clustering
– Neural Networks
2
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Summary of Lec 1
3
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Outline
4
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Types of Data
5
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Ways of Viewing Data
• Based on Levels/Scales of Measurement
– Nominal Data
– Ordinal Data
– Interval Data
– Ratio Data
• Based on Numerical/Categorical
– Numerical, also known as Quantitative
– Categorical, also known as Qualitative
• Other aspects
– Available or Missing Data
6
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Levels/Scales of Measurement
Highest
Ratio
NOIR Named + Ordered
+ Equal Interval +
Interval Has “True” Zero
Named + Ordered
+ Equal Interval
Ordinal
Named + Ordered
Nominal
Named
Lowest
7
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
A Quick Recap: Mean, Median, Mode
• If we are given a sequence of numbers:
1, 3, 4, 6, 6, 7, 8
Mean: computing the average
(1+3+4+6+6+7+8)/7 = 5
(4+6)/2=5
1, 3, 4, 6, 6, 7, 8
6
0
1 2 3 4 5 6 7 8
8
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Nominal Data
• Lowest Level of Measurement
• Discrete Categories
• NO natural order
• Estimating a mean, median, or standard deviation, would be
meaningless.
• Possible Measure: mode, frequency distribution
• Example:
Gender Occupation
9
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Ordinal Data
• Ordered Categories
• Relative Ranking
• Unknown “distance” between categories: orders matter but not the
difference between values
• Possible Measure: mode, frequency distribution + median
• Example:
– Evaluate the difficulty level of an exam
• 1: Very Easy, 2: Easy, 3: About Right, 4: Difficulty, 5: Very Difficulty
10
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Interval Data
• Ordered Categories
• Well-defined “unit” measurement:
– Distances between points on the scale are measurable and well-defined
– Can measure differences!
• Equal Interval (between two consecutive unit points)
• Zero is arbitrary (not absolute), in many cases human-defined
– If the variable equals zero, it does not mean there is none of that variable
• Ratio is meaningless
• Possible Measure: mode, frequency distribution + median + mean,
standard deviation, addition/subtraction
• Example:
– Temperature measured in Celsius
• For instance: 10 degrees C, 28 degrees C
– Year of someone’s birth
• For instance: 1990, 2005, 2010, 2022
11
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Ratio Data
• Most precise and highest level of measurement
• Ordered
• Equal Intervals
• Natural Zeros:
– If the variable equals zero, it means there is none of that variable
– Not arbitrary
• Possible Measure: mode, frequency distribution + median + mean, standard
deviation, addition/subtraction + multiplication and division (ratio)
• Example:
– Weights
• 10 KG, 20 KG, 30 KG
– Time
• 10 Seconds, 1 Hour, 1 Day
12
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
NOIR
Frequency
Distribution
Yes Yes Yes Yes
Mean, standard
deviation
No No Yes Yes
Ratios No No No Yes
13
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
NOIR
Yes.
Ratio
Yes.
Zero means
No.
none?
Interval
Yes.
Equally
split? No.
Ordinal
No.
Nominal
Ordered?
14
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
• Which level of measurement?
Nominal, Ordinal, Interval, Ratio
Yes.
1. Favorite Restaurant Ratio
Yes.
• Mcdonald’s, Burger King, Subway, KFC, … Zero means
none?
No.
Interval
Ordered?
15
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Ways of Viewing Data
• Based on Levels/Scales of Measurement:
– Nominal Data
– Ordinal Data
– Interval Data
– Ratio Data
• Based on Numerical/Categorical
– Numerical, also known as Quantitative
– Categorical, also known as Qualitative
• Other aspects
– Available or Missing Data
16
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Numerical or Categorical
Variable
Categorical Numerical
(qualitative) (quantitative)
Discrete Continuous
Nominal Ordinal (Whole numerical (Can take any value
(Unordered (Ordered values) within a rage)
categories) categories) Example: outcome of Example:
tossing a die temperature in day
Numerical
OR (quantitative)
Interval Ratio
(may compute (may compute
difference but no difference, real
absolute zero) zero exists)
17
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Ways of Viewing Data
• Based on Levels/Scales of Measurement:
– Nominal Data
– Ordinal Data
– Interval Data
– Ratio Data
• Based on Numerical/Categorical
– Numerical, also known as Quantitative
– Categorical, also known as Qualitative
• Other aspects
– Available or Missing Data
18
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Missing Data
• Missing data: data that is missing and you do not know
the mechanism.
– You should use a single common code for all missing values (for
example, “NA”), rather than leaving any entries blank.
19
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Outline
20
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Data Wrangling
• Data wrangling
– The process of transforming and mapping data from one "raw" data
form into another format with the intent of making it more
appropriate and valuable for a variety of downstream purposes
such as analytics.
– In short, transforms data to gain insight
– It is a general process!
Credit:https://en.wikipedia.org/wiki/Data_wrangling
21
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Data Wrangling
Example: Use data
Determine Make the most social network Remove invalid Ensure data (feature
your goal of dataset stored as graphs data correctness extraction/
training/test)
• Normalization
– Linear Scaling:
scale each variable to [0 1]
– Z-score standardization:
each independent dimension
of data is normally distributed
23
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
xample
https://developers.google.com/machine-learning/data-prep/transform/normalization
– Handling missing features
EE2211 Introduction to Machine Learning 17
, NUS. All Rights Reserved.
NA
NA
NA
24
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Data Cleaning: Handling missing features
25
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Data Cleaning: Handling missing features: Imputation
26
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Outline
27
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Data Integrity
• Data integrity is the maintenance and the assurance of data accuracy
and consistency;
– A critical aspect to the design, implementation, and usage of any system that stores,
processes, or retrieves data.
– Very broad concept!
• Example:
• In a dataset, numeric columns/cells should not accept alphabetic data.
• A binary entry should only allow binary inputs
28
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Data Integrity
29
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Data Visualization
Graphical
Representation
of data!
30
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Example: Showing Distribution
Visualization: Distribution Visualization: Bars
ng Distribution
Probability Mass Function
• The first quartile (Q1) is defined as the middle number between the smallest number (i.e.,
Minimum) and the median of the data set.
• The third quartile (Q3) is the middle number between the median and the highest value (i.e.,
Maximum) of the data set.
32
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Why Visualization is Necessary
Hence, we need
visualization to show
their difference!
33
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Summary
• Types of data
– NOIR
• Data wrangling and cleaning
34
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Practice Question
(Type of Question to Expect in Exams)
35
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
NUS
National University
of Singapore
THf1NK You.
36
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
EE2211 Introduction to
Machine Learning
Lecture 3
Wang Xinchao
[email protected]
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Course Contents
• Introduction and Preliminaries (Xinchao)
– Introduction
– Data Engineering
– Introduction to Linear Algebra, Probability and Statistics
• Fundamental Machine Learning Algorithms I (Yueming)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Yueming)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Xinchao)
– Performance Issues
– K-means Clustering
– Neural Networks
2
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Summary of Lec 2
• Types of data
– NOIR
• Data wrangling and cleaning
3
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Outline
• (Very Gentle) Introduction to Linear Algebra
– Prof. Yueming’s part will follow up
• Causality and Simpson’s paradox
– Understanding at intuitive level is sufficient
• Random Variable, Bayes’ Rule
4
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
(Very Gentle) Introduction to Linear Algebra
5
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Notations, Vectors, Matrices
• A vector is an ordered list of scalar values
– Denoted by a bold character, e.g. x or a
6
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Notations, Vectors, Matrices
• We denote an entry or attribute of a vector as an italic
value with an index, e.g. 𝑎(") or 𝑥 (") .
– The index j denotes a specific dimension of the vector, the position
of an attribute in the list
• Note:
– 𝑥 (") is not to be confused with the power operation, e.g., 𝑥 $ (squared)
– Square of an indexed attribute of a vector is denoted as (𝑥 " )$ .
7
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Notations, Vectors, Matrices
• Vectors can be visualized as, in a multi-dimensional space,
– arrows that point to some directions, or
– points
8
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Notations, Vectors, Matrices
• A matrix is a rectangular array of numbers arranged in rows and
columns
– Denoted with bold capital letters, such as X or W
– An example of a matrix with two rows
and three columns:
• Note:
– For elements in matrix X, we shall use the indexing 𝑥%,% where the first and
second indices indicate the row and the column position.
– Usually, for input data, rows represent samples and columns represent
features
9
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Notations, Vectors, Matrices
10
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Systems of Linear Equations
Linear dependence and independence
11
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Systems
Systems of Linear
of Linear Equations
Equations
𝑥𝑥2 𝑥𝑥3 𝒄𝒄
2
𝒂𝒂 2 𝒃𝒃
𝒂𝒂
1
𝒃𝒃 𝑥𝑥2
1
𝑥𝑥1 𝑥𝑥1
𝛽𝛽1 𝒂𝒂 + 𝛽𝛽2 𝐛𝐛 = 0 𝛽𝛽1 𝒂𝒂 + 𝛽𝛽2 𝒃𝒃 ≠ 𝛽𝛽3 𝐜𝐜
13
12
© Copyright EE, NUS. All Rights Reserved.
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Systems of Linear Equations
13
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Exercises
• The principled way for computing rank is to do Echelon Form
– https://stattrek.com/matrix-algebra/echelon-transform.aspx#MatrixA
• For small-size matrices, however, the rank is in many cases easy to
estimate
1 1
100 100
1 −2 3
0 −3 3
1 1 0
14
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Outline
• (Very Gentle) Introduction to Linear Algebra
• Causality and Simpson’s paradox
• Random Variable, Bayes’ Rule
15
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Causality
• Causality, or causation is:
– The influence by which one event or process (i.e., cause)
contributes to another (i.e. effect),
– The cause is partly responsible for the effect, and the effect is
partly dependent on the cause
16
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Causality
• (Probable) causal relations or non-causal?
– New web design implemented ? Web page traffic increased
– Your height and weight ? Gets A in EE2211
– Uploaded new app store images ? Downloads increased by 2X
– One works hard and attends lectures/tutorials ? Gets A in EE2211
– Your favorite color ? Your GPA in NUS
17
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Causality
• One popular way to causal data analysis is Randomized
Controlled Trial (RCT)
– A study design that randomly assigns participants into an experimental
group or a control group.
– As the study is conducted, the only expected difference between two
groups is the outcome variable being studied.
• Example:
– To decide whether smoking and lung cancer has a causal relation, we put
participants into experimental group (people who smoke) and control group
(people who don’t smoke), and check whether they develop lung cancer
eventually.
18
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Correlation
Causalityis
is aastatistical
statistical relationship
relationship
19
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Correlation (vs Causality)
• In statistics, correlation is any statistical relationship,
whether causal or not, between two random variables.
• Correlations are useful because they can indicate a
predictive relationship that can be exploited in practice.
20
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
• In statistics, correlation is any statistical relationship,
whetherCorrelation
causal or not, between two random variables.
(vs Causality)
• Correlations are useful because they can indicate a
• Linear correlation
predictive coefficient,
relationship that canr, which is also known
be exploited as
in practice.
•the Pearson
Linear Coefficient.
correlation coefficient, r, which is also known as
the Pearson Coefficient.
21
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Correlation does not imply causation
Correlation does not imply causation!
• Most data analyses involve inference or prediction.
• Unless a randomized study is performed, it is difficult to infer why
• Some
there is great examples
a relationship of correlations
between two variables.that can be
• calculated but are clearly
Some great examples not causally
of correlations that can related appear
be calculated at
but are
http://tylervigen.com/
clearly not causally related appear at http://tylervigen.com/
(See figure below).
22
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Simpson’s paradox
• Simpson's paradox is a phenomenon in probability and
statistics, in which a trend appears in several different
groups of data but disappears or reverses when these
groups are combined.
24
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Outline
• (Very Gentle) Introduction to Linear Algebra
• Causality and Simpson’s paradox
• Random Variable, Bayes’ Rule
25
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Probability
• We describe a random experiment by describing its
procedure and observations of its outcomes.
• Outcomes are mutual exclusive in the sense that only one
outcome occurs in a specific trial of the random outcome
experiment.
– This also means an outcome is not decomposable. sample space
– All unique outcomes form a sample space.
• A subset of sample space 𝑆, denoted as 𝐴, is an event in
a random experiment 𝐴 ⊂ 𝑆, that is meaningful to an
application.
– Example of an event: faces with numbers no greater than 3
event
26
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Axioms of Probability
https://en.wikipedia.org/wiki/Union_(set_theory)
https://en.wikipedia.org/wiki/Intersection_(set_theory)
27
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Random Variable
• A random variable, usually written as an italic capital
letter, like X, is a variable whose possible values are
numerical outcomes of a random event.
28
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Notations
• Some books used P(·) and p(·) to distinguish between the
probability of discrete random variable and the probability
of continuous random variables respectively.
29
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Discrete
Discrete randomrandom variable
variable
• A• discrete
A discrete random
random variable
variable(DRV) takestakes
(DRV) on only
ona only
countable
a number
of distinct values such as red, orange, blue or 1, 2, 3.
countable number of distinct values such as red, yellow,
blue or 1, 2, 3, . . ..
• The probability distribution of a discrete random variable is described
• Thebyprobability distribution
a list of probabilities of a discrete
associated random
with each variable
of its is values.
possible
described by a list of probabilities associated with each of its possible
values.
• This list of probabilities is called a
- This list of probabilities is called a
probability mass function (pmf).
probability mass function (pmf).
– Like a histogram, except that here
(Like a histogram, except that here
the probabilities sum to 1
the probabilities sum to 1)
31
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Discrete random variable
32
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Continuous random
Continuous variable
random variable
• A continuous random variable (CRV) takes an infinite
• A continuous
number random
of possible values variable
in some (CRV) takes an infinite
interval.
numberinclude
– Examples of possible
height, values in some
weight, and time. interval.
– Examples
– Because include height,
the number weight,
of values of aand time.
continuous random variable X
is–infinite,
The number of values ofPr(X
the probability a continuous random
= c) for any c is 0.variable X is infinite, the
probability Pr(X = c) for any c is 0
– Therefore, instead of the list of probabilities, the probability
– Therefore, instead of the list of probabilities, the probability distribution of a
distribution of a CRV (a continuous probability distribution) is
CRV (a continuous probability distribution) is described by a probability
described by a probability density function (pdf).
density function (pdf).
– The pdf pdf
– The is aisfunction
a functionwhose
whosecodomain
range is
is nonnegative
nonnegative and and the
thearea
area under the
curve is equal
under the curveto is
1.equal to 1.
34
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Mean and Standard Deviation of a Gaussian Distribution
Mean and Standard Deviation of a Gaussian
Mean and Standard Deviation of a Gaussian
distribution
distribution 𝑥𝑥2
𝑥𝑥2
95%
90%95%
90%
𝜇𝜇
𝜇𝜇
𝑥𝑥1
𝑥𝑥1
EE2211 Introduction to Machine Learning 35
32
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
© Copyright EE, NUS. All Rights Reserved.
EE2211 Introduction to Machine Learning 32
© Copyright EE, NUS. All Rights Reserved.
Example 1
• Independent random variables
• Consider tossing a fair coin twice, what is the probability
of having (H,H)? Assuming a coin has two sides, H=head
and T=Tail
– Pr(x=H, y=H) = Pr(x=H)Pr(y=H) = (1/2)(1/2) = 1/4
36
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Example 2
• Dependent random variables
• Given 2 balls with different colors (Red and Black), what
is the probability of first drawing B and then R? Assuming
we are drawing the balls without replacement.
• Mathematically:
– Pr(x=B, y=R) = Pr(y=R | x=B) Pr(x=B) = 1×(1/2) = 1/2
Conditional Probability
37
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Example 3
• Dependent random variables
• Given 3 balls with different colors (R,G,B), and we draw 2
balls. What is the probability of first having B and then G,
if we draw without replacement?
• The space of outcomes of taking two balls sequentially
without replacement:
R–G | G–B | B–R
R–B | G–R | B–G Thus, Pr(y=G, x=B) = 1/6
• Mathematically:
Pr(y=G, x=B) = Pr(y=G | x=B) Pr(x=B)
= (1/2) × (1/3)
= 1/6
38
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Two Basic Rules
• Sum Rule
Pr 𝑋 = 𝑥 = / Pr(𝑋 = 𝑥, 𝑌 = 𝑦% )
$
• Product Rule
Pr 𝑋 = 𝑥, 𝑌 = 𝑦 = Pr 𝑌 = 𝑦 𝑋 = 𝑥 𝑃(𝑋 = 𝑥)
39
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Bayes’ Rule
• The conditional probability Pr(𝑌 = 𝑦|𝑋 = 𝑥) is the
probability of the random variable 𝑌 to have a specific
value 𝑦, given that another random variable 𝑋 has a
specific value of 𝑥.
40
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Define:
Example
• B random variable for box picked
• B = {blue(b), red(r)}
• F identity of fruit
• Drawing a sample of fruit from a box
• F –= {apple(a),
First pick orange(o)}
a box, and then draw a sample of fruit from it
• P(B=r)=0.4 and P(B=b)=0.6
– B: variable for Box, can be r (red) or b (blue)
• Events– F:arevariable
mutually exclusive
for Fruit, and
can be include or
o (orange) allapossible
(apple) outcomes
• Their probabilities must sum to 1
• Pr(B=r)=0.4 prior
• Pr(F=o | B=r)= 0.75 likelihood
• Pr(F=o)= 0.45 evidence
41
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Summary
• (Very Gentle) Introduction to Linear Algebra
• Causality and Simpson’s paradox
• Random Variable, Bayes’ Rule
42
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Practice Question
(Type of Question to Expect in Exams)
X 1 2 3 4 5
Pr[X] 0.1 0.05 0.05 0.6 k
43
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
NUS
National University
of Singapore
THf1NK You.
44
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
EE2211 Introduction to Machine
Learning
Lecture 4
Semester 2
2024/2025
Yueming Jin
[email protected]
2
© Copyright EE, NUS. All Rights Reserved.
Welcome to EE2211
• Fundamental Machine Learning Algorithms I (Yueming)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Yueming)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• 3 Assignments
– Assignment 1: released on Week 4 Friday, due on Week 6 Friday
– Assignment 2: released on Week 6 Friday, due on Week 9 Wednesday
– Assignment 3: released on Week 9 Friday, due on Week 13 Friday
• Office hour via zoom: Monday 9:30-10:30am (Week 5-10)
3
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
Module II Contents
• Operations on Vectors and Matrices
• Systems of Linear Equations
• Set and Functions
• Derivative and Gradient
• Least Squares, Linear Regression
• Linear Regression with Multiple Outputs
• Linear Regression for Classification
• Ridge Regression
• Polynomial Regression
4
© Copyright EE, NUS. All Rights Reserved.
Fundamental ML Algorithms:
Linear Regression
References for Lectures 4-6:
Main
• [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019.
(read first, buy later: http://themlbook.com/wiki/doku.php)
• [Book2] Andreas C. Muller and Sarah Guido, “Introduction to Machine
Learning with Python: A Guide for Data Scientists”, O’Reilly Media, Inc., 2017
Supplementary
• [Book3] Jeff Leek, “The Elements of Data Analytic Style: A guide for people
who want to analyze data”, Lean Publishing, 2015.
• [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied
Linear Algebra”, Cambridge University Press, 2018 (available online)
http://vmls-book.stanford.edu/
• [Ref 5] Professor Vincent Tan’s notes (chapters 4-6): (useful)
https://vyftan.github.io/papers/ee2211book.pdf
5
© Copyright EE, NUS. All Rights Reserved.
Recap on Notations, Vectors, Matrices
Scalar Numerical value 15, -3.5
Variable Take scalar values x or a
Capital Pi ∏𝑚
𝑖=1 𝑥𝑖 = 𝑥1 · 𝑥2 ·…· 𝑥𝑚−1 · 𝑥𝑚
6
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
𝑥1 𝑦1 𝑥1 + 𝑦1
𝐱+𝐲= 𝑥 + 𝑦 = 𝑥 +𝑦
2 2 2 2
𝑥1 𝑦1 𝑥1 − 𝑦1
𝐱−𝐲= 𝑥 − 𝑦 = 𝑥 −𝑦
2 2 2 2
7
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
𝑥1 𝑎𝑥1
𝑎 𝐱 = 𝑎 𝑥 = 𝑎𝑥
2 2
1
𝑥1 𝑥
1 1 𝑎 1
𝑎
𝐱=
𝑎 𝑥2 = 1
𝑥
𝑎 2
8
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
Matrix or Vector Transpose:
𝑥1
𝐱= 𝑥 , 𝐱𝑇 = 𝑥1 𝑥2
2
𝑥1,1 𝑥1,2 𝑥1,3 𝑥1,1 𝑥2,1 𝑥3,1
𝐗 = 𝑥2,1 𝑥2,2 𝑥2,3 , 𝐗 𝑇 = 𝑥1,2 𝑥2,2 𝑥3,2
𝑥3,1 𝑥3,2 𝑥3,3 𝑥1,3 𝑥2,3 𝑥3,3
Python demo 1
9
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
Dot Product or Inner Product of Vectors:
𝐱 · 𝐲 = 𝐱𝑇 𝐲
𝑦1 𝐱
= 𝑥1 𝑥2 𝑦
2
= 𝑥1 𝑦1 + 𝑥2 𝑦2
Geometric definition: 𝜃
𝐱 · 𝐲 = 𝐱 𝐲 cos𝜃 𝐲
𝐱 cos𝜃
Matrix-Vector Product
11
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
Vector-Matrix Product
12
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
Matrix-Matrix Product
Matrix inverse
Definition:
A d-by-d square matrix A is invertible (also nonsingular)
if there exists a d-by-d square matrix B such that
𝐀𝐁 = 𝐁𝐀 = 𝐈 (identity matrix)
1 0…0 0
0 1 0 0
𝐈= ⁞ ⋱ ⁞ d-by-d dimension
0 0 1 0
0 0…0 1
Ref: https://en.wikipedia.org/wiki/Invertible_matrix
14
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
Matrix inverse computation
−1
1
𝐀 = adj(𝐀)
det 𝐀
• det 𝐀 is the determinant of 𝐀
• adj(𝐀) is the adjugate or adjoint of 𝐀
Determinant computation
Example: 2x2 matrix
𝑎 𝑏
𝐀=
𝑐 𝑑
𝑎 𝑏
det 𝐀 = |𝐀| = = 𝑎𝑑 − 𝑏𝑐
𝑐 𝑑
Ref: https://en.wikipedia.org/wiki/Invertible_matrix
15
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
• adj(𝐀) is the adjugate or adjoint of 𝐀
• adj(𝐀) is the transpose of the cofactor matrix 𝐂 of 𝐀 → adj(A)= CT
• Minor of an element in a matrix 𝐀 is defined as the determinant
obtained by deleting the row and column in which that element lies
𝑎11 𝑎12 𝑎13 𝑎21 𝑎23
A= 𝑎 𝑎
21 22 𝑎23 Minor of a12 is 𝑀12 =
𝑎31 𝑎33
𝑎31 𝑎32 𝑎33
Ref: https://en.wikipedia.org/wiki/Invertible_matrix
16
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
𝑎21 𝑎23
Minor of a12 is 𝑀12 = adj(A)= CT
𝑎31 𝑎33
𝑖 + 𝑗𝑀 det(A)= ∑𝑘𝑗=1 −1 𝑖 + 𝑗𝑎
Cofactor Cij = −1 𝑖𝑗 𝑖𝑗𝑀𝑖𝑗
𝑎 𝑏
• E.g. 𝐀 =
𝑐 𝑑
𝑑 −𝑐
𝐂=
−𝑏 𝑎
𝑇 𝑑 −𝑏
• adj 𝐀 = 𝐂 = det 𝐀 = |𝐀| = 𝑎𝑑 − 𝑏𝑐
−𝑐 𝑎
1 1 𝑑 −𝑏
𝐀 −1
= adj(𝐀) =
det 𝐀 𝑎𝑑−𝑏𝑐 −𝑐 𝑎
Ref: https://en.wikipedia.org/wiki/Invertible_matrix
17
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
det(A)= ∑𝑘𝑗=1 −1 𝑖 + 𝑗𝑎
Determinant computation 𝑖𝑗𝑀𝑖𝑗
18
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
𝑎22 𝑎23
The minor of 𝑎11 = 𝑎 𝑎33
32
Ref: https://en.wikipedia.org/wiki/Determinant
19
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
𝑎21 𝑎23
The minor of 𝑎12 = 𝑎 𝑎33
31
Ref: https://en.wikipedia.org/wiki/Determinant
20
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
𝑎11 𝑎13
The minor of 𝑎22 = 𝑎 𝑎33
31
adj(A)= CT
𝑖 +𝑗
det(A)= ∑𝑘𝑗=1 = 𝑎𝑖𝑗Cij = −1 𝑎𝑖𝑗𝑀𝑖𝑗
.
1
𝐀−1 = adj(𝐀)
det 𝐀
Ref: https://en.wikipedia.org/wiki/Determinant
21
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
Example
1 2 3
Find the cofactor matrix of 𝐀 given that 𝐀 = 0 4 5 .
1 0 6
Solution:
5 4 0 5 04
𝑎11 ⇒ = 24, 𝑎12 ⇒ − = 5, 𝑎13 ⇒ = −4,
6 0 1 6 10
2 3 1 3 1 2
𝑎21 ⇒− = −12, 𝑎22 ⇒ = 3, 𝑎23 ⇒ − = 2,
0 6 1 6 1 0
2 3 1 3 1 2
𝑎31 ⇒ = −2, 𝑎32 ⇒ − = −5, 𝑎33 ⇒ = 4,
4 5 0 5 0 4
24 5 −4
The cofactor matrix C is thus −12 3 2 .
−2 − 5 4
Ref: https://www.mathwords.com/c/cofactor_matrix.htm
22
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
Module II Contents
• Operations on Vectors and Matrices
• Systems of Linear Equations
• Set and Functions
• Derivative and Gradient
• Least Squares, Linear Regression
• Linear Regression with Multiple Outputs
• Linear Regression for Classification
• Ridge Regression
• Polynomial Regression
23
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
24
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
These equations can be written compactly in matrix-vector
notation:
𝐗𝐰 = 𝐲
Where
𝑥1,1 𝑥1,2 … 𝑥1,𝑑 𝑤1 𝑦1
𝐗= ⁞ ⁞ ⋱ ⁞ , 𝐰= ⁞ , 𝐲= ⁞ .
𝑥𝑚,1 𝑥𝑚,2 … 𝑥𝑚,𝑑 𝑤𝑑 𝑦𝑚
Note:
• The data matrix 𝐗 ∈ 𝓡𝑚×𝑑 and the target vector 𝐲 ∈ 𝓡𝑚 are given
• The unknown vector of parameters 𝐰 ∈ 𝓡𝑑 is to be learnt
25
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
A set of linear equations can have no solution, one
solution, or multiple solutions:
𝐗𝐰 = 𝐲
Where
𝑥1,1 𝑥1,2 … 𝑥1,𝑑 𝑤1 𝑦1
𝐗= ⁞ ⁞ ⋱ ⁞ , 𝐰= ⁞ , 𝐲= ⁞ .
𝑥𝑚,1 𝑥𝑚,2 … 𝑥𝑚,𝑑 𝑤𝑑 𝑦𝑚
26
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
𝐗𝐰 = 𝐲, 𝐗 ∈ 𝓡𝑚×𝑑 , 𝐰 ∈ 𝓡𝑑×1 , 𝐲 ∈ 𝓡𝑚×1
1. Square or even-determined system: 𝒎 = 𝒅
- Equal number of equations and unknowns, i.e., 𝐗 ∈ 𝓡𝑑×𝑑
- One unique solution if 𝐗 is invertible or all rows/columns of 𝐗 are
linearly independent
- If all rows or columns of 𝐗 are linearly independent, then 𝐗 is
invertible.
Solution:
If 𝐗 is invertible (or 𝐗 −1 𝐗 = 𝐈 ), then pre-multiply both sides by 𝐗 −1
𝐗 −1 𝐗 𝐰 = 𝐗 −1 𝐲
⇒ ෝ = 𝐗 −1 𝐲
𝐰
(Note: we use a hat on top of 𝐰 to indicate that it is a specific point in the space of 𝐰)
27
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
Example 1 𝑤1 + 𝑤2 = 4 (1) Two unknowns
𝑤1 − 2𝑤2 = 1 (2) Two equations
𝐗 𝐰 𝐲
1 1 𝑤1 4
𝑤 =
1 −2 2 1
ෝ
𝐰
ෝ = 𝐗 −1 𝐲
𝐰
−1
1 1 4
=
1 −2 1
−1 −2 −1 4 3
= =
3 −1 1 1 1 Python demo 3
1 𝑑 −𝑏
𝐀−1 = adj(𝐀) adj 𝐀 = 𝐂 𝑇 = det 𝐀 = 𝑎𝑑 − 𝑏𝑐
det 𝐀 −𝑐 𝑎
28
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
𝐗𝐰 = 𝐲, 𝐗 ∈ 𝓡𝑚×𝑑 , 𝐰 ∈ 𝓡𝑑×1 , 𝐲 ∈ 𝓡𝑚×1
2. Over-determined system: 𝒎 > 𝒅
– More equations than unknowns
– 𝐗 is non-square (tall) and hence not invertible
– Has no exact solution in general *
– An approximated solution is available using the left inverse
If the left-inverse of 𝐗 exists such that 𝐗 †𝐗 = 𝐈, then pre-multiply both
sides by 𝐗 † results in
𝐗†𝐗 𝐰 = 𝐗†𝐲
⇒𝐰ෝ = 𝐗†𝐲
Definition:
A matrix B that satisfies 𝑩𝒅 𝒙 𝒎𝑨𝒎 𝒙 𝒅 = 𝐈 is called a left-inverse of 𝐀.
The left-inverse of 𝐗: 𝐗 †= (𝐗 𝑇 𝐗)−𝟏 𝐗 𝑇 given 𝐗 𝑇 𝐗 is invertible.
Note: * exception: when rank(𝐗) = rank([𝐗,𝐲]), there is a solution.
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (Chp11.1-11.2, 11.5)
29
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
Example 2 𝑤1 + 𝑤2 = 1 (1) Two unknowns
𝑤1 − 𝑤2 = 0 (2) Three equations
𝐗 𝐰 𝐲 𝑤1 = 2 (3)
1 1 𝑤 1
1
1 −1 𝑤 = 0
2
1 0 2
ෝ
𝐰
No exact solution
Approximated solution
ෝ = 𝐗 † 𝐲 = (𝐗 𝑇 𝐗)−𝟏 𝐗 𝑇 𝐲
𝐰
3 0 −1 1 1 1 1 1
= 0 =
0 2 1 −1 0 2 0.5 Python demo 4
𝐗 𝑇 𝐗 is invertible
30
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
𝐗𝐰 = 𝐲, 𝐗 ∈ 𝓡𝑚×𝑑 , 𝐰 ∈ 𝓡𝑑×1 , 𝐲 ∈ 𝓡𝑚×1
3. Under-determined system: 𝒎 < 𝒅
– More unknowns than equations
– Infinite number of solutions in general *
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (Chp11.1-11.2, 11.5)
31
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
3. Under-determined system: 𝒎 < 𝒅
Derivation:
𝐗𝐰 = 𝐲, 𝐗 ∈ 𝓡𝑚×𝑑 , 𝐰 ∈ 𝓡𝑑×1 , 𝐲 ∈ 𝓡𝑚×1
𝐗†
right-inverse
32
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
Example 3 𝑤1 + 2𝑤2 + 3𝑤3 = 2 (1) Three unknowns
𝑤1 − 2𝑤2 + 3𝑤3 = 1 (2) Two equations
𝐗 𝐰 𝐲
𝑤
1 2 3 𝑤1 2
2 =
1 −2 3 𝑤3 1
Infinitely many solutions along
the intersection line
Here 𝐗𝐗𝑇 is invertible
ෝ = 𝐗 𝑇 (𝐗𝐗 𝑇 )−1 𝐲
𝐰
−1
1 1 14 6 0.15
2
= 2 −2 = 0.25 Constrained solution
1
3 3 6 14 0.45
33
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
𝐗 𝐰 𝐲
𝑤
1 2 3 𝑤1 2
2 =
3 6 9 𝑤3 1
35
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
Module II Contents
• Operations on Vectors and Matrices
• Systems of Linear Equations
• Set and Functions
• Derivative and Gradient
• Least Squares, Linear Regression
• Linear Regression with Multiple Outputs
• Linear Regression for Classification
• Ridge Regression
• Polynomial Regression
36
© Copyright EE, NUS. All Rights Reserved.
Notations: Set
• A set is an unordered collection of unique elements
– Denoted as a calligraphic capital character e.g., 𝓢, 𝓡, 𝓝 etc
– When an element 𝑥 belongs to a set 𝑺, we write 𝑥 ∈ 𝓢
• A set of numbers can be finite - include a fixed amount of values
– Denoted using accolades, e.g. {1, 3, 18, 23, 235} or {𝑥1, 𝑥2, 𝑥3 , 𝑥4, . . . , 𝑥𝑑 }
• A set can be infinite and include all values in some interval
– If a set of real numbers includes all values between a and b, including a and
b, it is denoted using square brackets as [a, b]
– If the set does not include the values a and b, it is denoted using
parentheses as (a, b)
• Examples:
– The special set denoted by 𝓡 includes all real numbers from minus infinity
to plus infinity
– The set [0, 1] includes values like 0, 0.0001, 0.25, 0.9995, and 1.0
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p4 of chp2).
37
© Copyright EE, NUS. All Rights Reserved.
Notations: Set operations
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p4 of chp2).
38
© Copyright EE, NUS. All Rights Reserved.
Functions
• A function is a relation that associates each element 𝑥 of a set 𝓧,
the domain of the function, to a single element 𝑦 of another set 𝓨,
the codomain of the function
• If the function is called f, this relation is denoted 𝑦 = 𝑓(𝑥)
– The element 𝑥 is the argument or input of the function
– 𝑦 is the value of the function or the output
• The symbol used for representing the input is the variable of the
function
– 𝑓(𝑥) 𝑓 is a function of the variable 𝑥; 𝑓(𝑥, 𝑤) 𝑓 is a function of the variable 𝑥 and w
𝓧 𝓨
1
1 2 Range
2 3 (or Image)
3 4 {3,4,5,6}
4 5
6
{1,2,3,4} domain codomain {1,2,3,4,5,6}
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p6 of chp2). 39
© Copyright EE, NUS. All Rights Reserved.
Functions
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p7 of chp2).
40
© Copyright EE, NUS. All Rights Reserved.
Functions
• The notation 𝑓: 𝓡𝑑 → 𝓡 means that 𝑓 is a function that
maps real d-vectors to real numbers
– i.e., 𝑓 is a scalar-valued function of d-vectors
• If 𝐱 is a d-vector argument, then 𝑓 𝐱 denotes the value
of the function 𝑓 at 𝐱
– i.e., 𝑓 𝐱 = 𝑓 𝑥1 , 𝑥2 , … , 𝑥𝑑 , 𝐱 ∈ 𝓡𝑑 , 𝑓 𝐱 ∈ 𝓡
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, 2018 (Ch 2, p29)
41
© Copyright EE, NUS. All Rights Reserved.
Functions
𝜃
𝐱
𝒂 cos𝜃
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p30)
42
© Copyright EE, NUS. All Rights Reserved.
Functions
Linear Functions
• Homogeneity
• For any d-vector 𝐱 and any scalar 𝛼, 𝑓 𝛼𝐱 = 𝛼𝑓 𝐱
• Scaling the (vector) argument is the same as scaling the
function value
• Additivity
• For any d-vectors 𝐱 and 𝐲, 𝑓 𝐱 + 𝐲 = 𝑓 𝐱 + 𝑓 𝐲
• Adding (vector) arguments is the same as adding the function
values
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p31)
43
© Copyright EE, NUS. All Rights Reserved.
Functions
Linear Functions
Superposition and linearity
• The inner product function 𝑓 𝐱 = 𝒂𝑇 𝐱 defined in equation (1)
(slide 42) satisfies the property
𝑓 𝛼𝐱 + 𝛽𝐲 = 𝒂𝑇 𝛼𝐱 + 𝛽𝐲
= 𝒂𝑇 𝛼𝐱) + 𝒂𝑇 (𝛽𝐲
= 𝛼 𝒂𝑇 𝐱) + 𝛽(𝒂𝑇 𝐲
= 𝛼𝑓 𝐱) + 𝛽𝑓(𝐲
for all d-vectors 𝐱, 𝐲, and all scalars 𝛼, 𝛽.
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p30)
44
© Copyright EE, NUS. All Rights Reserved.
Functions
Linear Functions
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p30)
45
© Copyright EE, NUS. All Rights Reserved.
Functions
Example:
𝑓 𝐱 = 2.3 − 2𝑥1 + 1.3𝑥2 − 𝑥3
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p32)
46
© Copyright EE, NUS. All Rights Reserved.
Functions
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p33)
47
© Copyright EE, NUS. All Rights Reserved.
Summary
• Operations on Vectors and Matrices Assignment 1 (week 6 Fri)
• Dot-product, matrix inverse Tutorial 4
• Systems of Linear Equations 𝐗𝐰 = 𝐲
• Matrix-vector notation, linear dependency, invertible
• Even-, over-, under-determined linear systems
• Set and Functions
𝐗 is Even- m=d One unique solution in general ෝ = 𝐗 −1𝐲
𝐰
Square determined
𝐗 is Over- m>d No exact solution in general; ෝ = (𝐗 𝑇 𝐗)−𝟏 𝐗 𝑇 𝐲
𝐰
Tall determined An approximated solution Left-inverse
𝐗 is Under- m<d Infinite number of solutions in general; ෝ = 𝐗 𝑇 (𝐗𝐗 𝑇 )−1 𝐲
𝐰
Wide determined Unique constrained solution Right-inverse
48
© Copyright EE, NUS. All Rights Reserved.
EE2211 Introduction to Machine
Learning
Lecture 5
Semester 2
2024/2025
Yueming Jin
[email protected]
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Course Contents
• Introduction and Preliminaries (Xinchao)
– Introduction
– Data Engineering
– Introduction to Probability and Statistics
• Fundamental Machine Learning Algorithms I (Yueming)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Yueming)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Xinchao)
– Performance Issues
– K-means Clustering
– Neural Networks
2
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Least Squares and Linear Regression
Module II Contents
• Notations, Vectors, Matrices (introduced in L3)
• Operations on Vectors and Matrices
• Systems of Linear Equations
• Set and Functions
• Derivative and Gradient
• Least Squares, Linear Regression
• Linear Regression with Multiple Outputs
• Linear Regression for Classification
• Ridge Regression
• Polynomial Regression
3
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Recap: Linear and Affine Functions
Linear Functions
A function 𝑓: 𝓡𝑑 → 𝓡 is linear if it satisfies the following two properties:
• Homogeneity 𝑓 𝛼𝐱 = 𝛼𝑓 𝐱 Scaling
• Additivity 𝑓 𝐱 + 𝐲 = 𝑓 𝐱 + 𝑓 𝐲 Adding
Affine function
𝑓 𝐱 = 𝒂𝑇 𝐱 + 𝑏 scalar 𝑏 is called the offset (or bias)
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p31)
4
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Functions: Maximum and Minimum
A local and a global minima of a function
5
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Functions: Maximum and Minimum
Max and Arg Max
• Given a set of values 𝓐 = {𝑎1 , 𝑎2 , … , 𝑎𝑚 },
• The operator max 𝑓(𝑎) returns the highest value 𝑓(𝑎) for all elements in
𝑎ϵ𝓐
the set 𝓐
• The operator arg max 𝑓(𝑎) returns the element of the set 𝓐 that
𝑎ϵ𝓐
maximizes 𝑓(𝑎)
• When the set is implicit or infinite, we can write
max 𝑓(𝑎) or arg max 𝑓(𝑎)
𝑎 𝑎
E.g. 𝑓(𝑎) = 3𝑎, 𝑎 ϵ [0,1] → max 𝑓 𝑎 = 3 and arg max 𝑓 𝑎 = 1
𝑎 𝑎
Note: arg max returns a value from the domain of the function and max returns
from the range (codomain) of the function.
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p6-7 of chp2).
6
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Derivative and Gradient
• The derivative 𝒇′ of a function 𝒇 is a function that
describes how fast 𝒇 grows (or decreases)
– If the derivative is a constant value, e.g. 5 or −3
• The function 𝑓 grows (or decreases) constantly at any point x of its domain
– When the derivative 𝑓′ is a function
• If 𝑓′ is positive at some x, then the function 𝑓 grows at this point
• If 𝑓′ is negative at some x, then the function 𝑓 decreases at this point
• The derivative of zero at x means that the function’s slope at x is horizontal
(e.g. maximum or minimum points)
7
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Derivative and Gradient
The gradient of a function is a vector of partial derivatives
Differentiation of a scalar function w.r.t. a vector
If 𝑓(𝐱) is a scalar function of d variables, 𝐱 is a d x1 vector.
Then differentiation of 𝑓(𝐱) w.r.t. 𝐱 results in a d x1 vector
𝜕𝑓
ⅆ𝑓(𝐱) 𝜕𝑥1
= ⋮
ⅆ𝐱 𝜕𝑓.
𝜕𝑥𝑑
This is referred to as the gradient of 𝑓(𝐱) and often written
as 𝛻𝐱 𝑓.
𝑎
E.g. 𝑓 𝐱 = 𝑎𝑥1 + 𝑏𝑥2 𝛻𝐱 𝑓 =
𝑏 Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Appendix)
8
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Derivative and Gradient
Partial Derivatives
Differentiation of a vector function w.r.t. a vector
If 𝐟(𝐱) is a vector function of size h x1 and 𝐱 is a d x1 vector.
Then differentiation of 𝐟(𝐱) results in a h x d matrix
𝜕𝑓1 𝜕𝑓1
⋯
ⅆ𝐟(𝐱) 𝜕𝑥1 𝜕𝑥𝑑
= ⋮ ⋱ ⋮
ⅆ𝐱 𝜕𝑓ℎ 𝜕𝑓.ℎ
⋯
𝜕𝑥1 𝜕𝑥𝑑
The matrix is referred to as the Jacobian of 𝐟(𝐱)
9
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Derivative and Gradient
ⅆ𝐀𝐱
=𝐀
ⅆ𝐱
10
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Poll on PollEv.com/ymjin
Just “skip” if you are required to do registration
11
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
!!!~na
~
~~
ofSingapore
True
0%
False
0%
12
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p3 of chp3).
13
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression
Problem Statement: To predict the unknown 𝑦 for a given 𝐱 (testing)
𝑚
• We have a collection of labeled examples (training) {(𝐱𝑖 , y𝑖 )}𝑖=1
– 𝑚 is the size of the collection
– 𝐱𝑖 is the d-dimensional feature vector of example 𝑖 = 1, … , 𝑚 (input)
– y𝑖 is a real-valued target (1-D)
– Note:
• when y𝑖 is continuous valued, it is a regression problem
• when y𝑖 is discrete valued, it is a classification problem
14
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression
15
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression
16
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression
Learning objective function (using simplified notation hereon)
• To find the optimal values for w* which minimizes the
following expression:
𝑚
(𝑓𝐰 𝐱 𝑖 − y𝑖 )2
𝑖=1
with 𝑓𝐰 𝐱 𝑖 = 𝐱 𝑇 𝐰,
where we define 𝐰 = [𝑏, 𝑤1 , … 𝑤𝑑 ]𝑇 = [𝑤0 , 𝑤1 , … 𝑤𝑑 ]𝑇 ,
and 𝐱 𝑖 = [1, 𝑥𝑖,1 , … 𝑥𝑖,𝑑 ]𝑇 = [𝑥𝑖,0 , 𝑥𝑖,1 , … 𝑥𝑖,𝑑 ]𝑇 , 𝑖 = 1, … , 𝑚
17
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
𝑚
Linear Regression (𝑓𝐰,𝑏 𝐱 𝑖 − y𝑖 )2
𝑖=1
• All model-based learning algorithms have a loss function
• What we do to find the best model is to minimize the
objective known as the cost function
• Cost function is a sum of loss functions over training set
plus possibly some model complexity penalty (regularization)
18
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression
Learning (Training)
• Consider the set of feature vector 𝐱 𝑖 and target output 𝑦𝑖
indexed by 𝑖 = 1, … , 𝑚, a linear model 𝑓𝐰 𝐱 = 𝐱 𝑇 𝐰 can
be stacked as 𝑦1
𝒇𝐰 𝐗 = 𝐗𝐰 𝐲= ⁞
𝑦𝑚
Learning 𝑇
Model 𝐱 1𝐰 Learning
target vector
= ⁞
𝑇
𝐱𝑚 𝐰
𝑏
where 𝐱 𝑖𝑇 𝐰 = [1, 𝑥𝑖,1 , … , 𝑥𝑖,𝑑 ] 𝑤1
⁞
𝑤𝑑
Note: The bias/offset term is responsible for translating the line/plane/hyperplane
away from the origin.
19
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression
20
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression
21
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression
Example 1 Training set {(𝑥𝑖 , y𝑖 )}𝑚
𝑖=1 {𝑥 = −9} → {𝑦 = −6}
{𝑥 = −7} → {𝑦 = −6}
𝐗 𝐰 𝐲 𝑥 = −5 → {𝑦 = −4}
1 −9 −6 𝑥 = 1 → 𝑦 = −1
1 −7 −6 𝑥 = 5 → {𝑦 = 1 }
𝑤0 −4 𝑥 = 9 → {𝑦 = 4 }
1 −5
𝑤1 = −1
1 1
1 5 1
1 9 4
This set of linear equations has no exact solution
However, 𝐗𝑇 𝐗 is invertible Least square approximation
ෝ = 𝐗 † 𝐲 = (𝐗 𝑇 𝐗)−𝟏 𝐗 𝑇 𝐲
𝐰 −6
−6
−1
6 −6 1 1 1 1 1 1 −4 −1.4375
= =
−6 262 −9 − 7 − 5 1 5 9 −1 0.5625
1
4
22
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression
𝑦ො = 𝐗𝐰ෝ
−1.4375
=𝐗
0.5625
y = −1.4375+0.5625x
Prediction:
Test set
{𝑥 = −1} → {𝑦 =? }
−1.4375
𝑦ො = 1 − 1
0.5625
= −2
Linear Regression on one-dimensional samples
Python demo 1
23
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression
Example 2 {(𝐱𝑖 , y𝑖 )}𝑚
𝑖=1
{𝑥1 = 1, 𝑥2 = 1, 𝑥3 = 1} → {𝑦 = 1}
{𝑥1 = 1, 𝑥2 = −1, 𝑥3 = 1} → {𝑦 = 0}
Training set {𝑥1= 1, 𝑥2 = 1, 𝑥3 = 3} → {𝑦 = 2}
𝐗 𝐰 𝐲 {𝑥1 = 1, 𝑥2 = 1, 𝑥3 = 0} → {𝑦 = −1}
1 1 1 1
𝑤1
1 −1 1 𝑤2 = 0
1 1 3 𝑤3 2
1 1 0 −1
This set of linear equations has no exact solution
However, 𝐗 𝑇 𝐗 is invertible
ෝ = 𝐗 † 𝐲 = (𝐗 𝑇 𝐗)−𝟏 𝐗 𝑇 𝐲
𝐰 Least square approximation
−1
4 2 5 1 1 1 1 1 −0.7500
= 2 4 3 0 = 0.1786
1 −1 1 1 2
5 3 11 1 1 3 0 −1 0.9286
24
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression
The four linear equations Prediction:
Test set
{𝑥1 = 1, 𝑥2 = 6, 𝑥3 = 8} → {𝑦 =? }
{𝑥1 = 1, 𝑥2 = 0, 𝑥3 = −1} → {𝑦 =? }
ෝ = 𝒇 𝐰 𝐗 𝑛𝑒𝑤 = 𝐗 𝑛𝑒𝑤 𝐰
𝒚 ෝ
1 6 8 −0.7500
ෝ=
𝒚 0.1786
1 0 −1
0.9286
7.7500
=
−1.6786
25
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression
Learning of Vectored Function (Multiple Outputs)
For one sample: a linear model 𝐟𝐰 𝐱 = 𝐱 𝑇 𝐖 Vector function
For m samples: 𝐅𝐰 𝐗 = 𝐗𝐖 = 𝐘
𝑇 𝑤0,1 … 𝑤0,ℎ
Sample 1
𝐱1 1 𝑥 1,1 … 𝑥 1,𝑑
𝑤1,1 … 𝑤1,ℎ
⁞ = ⁞ 𝐖= ⁞ ⁞ ⋱ ⁞
𝑇 1 𝑥 … 𝑥 ⁞ ⋱ ⁞
Sample m 𝐱𝑚 𝑚,1 𝑚,𝑑 𝑤
𝑑,1 … 𝑤𝑑,ℎ
ℎ
Sample 1’s output 𝑦1,1 … 𝑦1,ℎ
⁞ = ⁞ 𝑚
Sample m’s output 𝑦𝑚,1 … 𝑦𝑚,ℎ
ℎ
J(𝐖) = trace(𝐄 𝑇 𝐄)
= trace[(𝐗𝐖 − 𝐘)𝑇 (𝐗𝐖 − 𝐘)]
If 𝐗 𝑇 𝐗 is invertible, then
Learning/training: 𝐖 = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐘
Prediction/testing: 𝐅𝐰 𝐗 𝑛𝑒𝑤 = 𝐗 𝑛𝑒𝑤 𝐖
Ref: Hastie, Tibshirani, Friedman, “The Elements of Statistical Learning”, (2 nd ed., 12th printing) 2017 (chp.3.2.4)
27
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression
J(𝐖) = trace(𝐄 𝑇 𝐄)
𝐞1𝑇
= trace( ⁞ [𝐞1 𝐞2 … 𝐞ℎ ])
𝐞𝑇ℎ
28
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression of multiple outputs
Example 3
Training set {𝑥1= 1, 𝑥2 = 1, 𝑥3 = 1} → { 𝑦1= 1, 𝑦2 = 0}
{𝑥1 = 1, 𝑥2 = −1, 𝑥3 = 1} → { 𝑦1 = 0, 𝑦2 = 1}
{𝐱 𝑖 , 𝐲𝑖 }𝑚
𝑖=1 {𝑥1= 1, 𝑥2 = 1, 𝑥3 = 3} → { 𝑦1 = 2, 𝑦2 = −1}
{𝑥1 = 1, 𝑥2 = 1, 𝑥3 = 0} → { 𝑦1 = −1, 𝑦2 = 3}
𝐗 𝐖 𝐘
Bias 1 1 1 𝑤 1 0
1,1 𝑤1,2
1 − 1 1 𝑤2,1 𝑤2,2 = 0 1
1 1 3 𝑤3,1 𝑤3,2 2 −1
1 1 0 −1 3
This set of linear equations has NO exact solution
Least square
= 𝐗 † 𝐘 = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐘
𝐖 𝐗 𝑇 𝐗 is invertible approximation
−1 1 0
4 2 5 1 1 1 1 −0.75 2.25
= 2 4 3 0 1 = 0.1786 0.0357
1 −1 1 1 2 −1
5 3 11 1 1 3 0 −1 0.9286 − 1.2143
3
29
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression of multiple outputs
Example 3
Prediction:
Test set: two new samples
{𝑥1= 1, 𝑥2 = 6, 𝑥3 = 8} → { 𝑦1 =? , 𝑦2 =? }
{𝑥1 = 1, 𝑥2 = 0, 𝑥3 = −1} → { 𝑦1=? , 𝑦2 =? }
= 𝐗 𝑛𝑒𝑤
𝐘 𝐖
Bias 1 6 8 −0.75 2.25
= 0.1786 0.0357
1 0 −1
0.9286 − 1.2143
7.75 − 7.25
=
−1.6786 3.4643
Python demo 2
30
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Linear Regression of multiple outputs
Example 4
The values of feature x and their corresponding values of multiple
outputs target y are shown in the table below.
1.9 3.6
𝐖 = 𝐗 𝐘 = (𝐗 𝐗) 𝐗 𝐘 =
† 𝑇 −1 𝑇
Python demo 3
−0.4667 0.5
𝒏𝒆𝒘 = 𝐗 𝑛𝑒𝑤 𝐖 = 1
𝐘 = 0.9667 4.6
2 𝐖 Prediction
31
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
Summary
• Notations, Vectors, Matrices
• Operations on Vectors and Matrices
• Dot-product, matrix inverse
• Systems of Linear Equations 𝒇𝐰 𝐗 = 𝐗𝐰 = 𝐲
• Matrix-vector notation, linear dependency, invertible
• Even-, over-, under-determined linear systems
• Functions, Derivative and Gradient
• Inner product, linear/affine functions
• Maximum and minimum, partial derivatives, gradient
• Least Squares, Linear Regression
• Objective function, loss function
• Least square solution, training/learning and testing/prediction
• Linear regression with multiple outputs
Learning/training ෝ = (𝐗 𝑇𝒕𝒓𝒂𝒊𝒏 𝐗 𝒕𝒓𝒂𝒊𝒏 )−𝟏 𝐗 𝑇𝒕𝒓𝒂𝒊𝒏 𝐲𝒕𝒓𝒂𝒊𝒏
𝐰
Prediction/testing 𝐲𝒕𝒆𝒔𝒕 = 𝐗 𝒕𝒆𝒔𝒕 𝐰 ෝ
• Classification Python packages: numpy, pandas, matplotlib.pyplot,
• Ridge Regression numpy.linalg, and sklearn.metrics (for
• Polynomial Regression mean_squared_error), numpy.linalg.pinv
33
© NUS
## Copyright EE, NUS.
Confidential ## All Rights Reserved.
EE2211 Introduction to Machine
Learning
Lecture 6
Semester 2
2024/2025
Yueming Jin
[email protected]
2
© Copyright EE, NUS. All Rights Reserved.
Ridge Regression & Polynomial
Regression
Module II Contents
• Notations, Vectors, Matrices
• Operations on Vectors and Matrices
• Systems of Linear Equations
• Functions, Derivative and Gradient
• Least Squares, Linear Regression
• Linear Regression with Multiple Outputs
• Linear Regression for Classification
• Ridge Regression
• Polynomial Regression
3
© Copyright EE, NUS. All Rights Reserved.
Review: Linear Regression
Learning of Scalar Function (Single Output)
For one sample: a linear model 𝑓𝐰 𝐱 = 𝐱 𝑇 𝐰 scalar function
For m samples: 𝒇𝐰 𝐗 = 𝐗𝐰 = 𝐲
𝐱1𝑇 𝐰
𝐲= ⁞ where 𝐱 𝑖𝑇 = [1, 𝑥𝑖,1 , … , 𝑥𝑖,𝑑 ]
𝑇
𝐱𝑚 𝐰
1 𝑥1,1 … 𝑥1,𝑑 𝑏 𝑦1
𝐗= ⁞ ⁞ ⋱ ⁞ 𝐰 = 𝑤1 𝐲= ⁞
1 𝑥𝑚,1 … 𝑥𝑚,𝑑 ⁞ 𝑦𝑚
𝑤𝑑
𝑚
Objective:𝑖=1(𝑓𝐰 𝐱 𝑖 − y𝑖 )2 = 𝐞𝑇 𝐞 = (𝐗𝐰 − 𝐲)𝑇 (𝐗𝐰 − 𝐲)
Learning/training when 𝐗 𝑇 𝐗 is invertible
ෝ = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐲
Least square solution: 𝐰
Prediction/testing: 𝒚𝑛𝑒𝑤 = 𝒇 𝐰 𝐗 𝑛𝑒𝑤 = 𝐗 𝑛𝑒𝑤 𝐰
ෝ
4
© Copyright EE, NUS. All Rights Reserved.
Review: Linear Regression
Learning of Vectored Function (Multiple Outputs)
𝐅𝐰 𝐗 = 𝐗𝐖 = 𝐘
𝑤0,1 … 𝑤0,ℎ
Sample 1 𝐱1𝑇 1 𝑥1,1 … 𝑥1,𝑑
𝑤1,1 … 𝑤1,ℎ
⁞ = ⁞ 𝐖= ⁞ ⁞ ⋱ ⁞
𝑇 1 𝑥𝑚,1 … 𝑥𝑚,𝑑 ⁞ ⋱ ⁞
Sample m 𝐱𝑚 𝑤𝑑,1 … 𝑤𝑑,ℎ
Sample 1’s output 𝑦1,1 … 𝑦1,ℎ
⁞ = ⁞
Sample m’s output 𝑦𝑚,1 … 𝑦𝑚,ℎ Least Squares Regression
If 𝐗 𝑇 𝐗 is invertible, then
Learning/training: 𝐖 = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐘
Prediction/testing: 𝐅𝐰 𝐗 𝑛𝑒𝑤 = 𝐗 𝑛𝑒𝑤 𝐖
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (chp.14)
6
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)
sign(𝑎)
+1
0
𝑎
-1
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (chp.14)
7
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)
Example 1 Training set {𝑥𝑖 , y𝑖 }𝑚
𝑖=1 {𝑥 = −9} → { 𝑦 = −1 }
𝐗 𝐰 𝐲 {𝑥 = −7} → { 𝑦 = −1 }
𝑥 = −5 → { 𝑦 = −1 }
Bias 1 −9 −1
{ 𝑥 = 1} → 𝑦 = +1
1 −7 −1 { 𝑥 = 5} → { 𝑦 = +1 }
𝑤0 −1
1 −5 = { 𝑥 = 9} → { 𝑦 = +1 }
𝑤1 1
1 1
1 5 1
1 9 1
This set of linear equations has NO exact solution
ෝ = 𝐗 † 𝐲 = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐲
𝐰 𝐗 𝑇 𝐗 is invertible
−1
−1
6 −6 −1
1 1 1 1 1 1 −1 0.1406 Least square
= =
−6 262 −9 − 7 − 5 1 5 9 1 0.1406 approximation
1
1
8
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)
Example 1
ෝ
𝑦ො = sign(𝐗𝐰)
{ 𝑥 = 1} → 𝑦 = +1
0.1406
{ 𝑥 = 5} → { 𝑦 = +1 }
{ 𝑥 = 9} → { 𝑦 = +1 } = sign(𝐗 )
0.1406
ෝ = 0.1406+0.1406x
y' = 𝐗𝐰
Prediction:
𝑦𝑛𝑒𝑤′
Test set {𝑥 = −2} → {𝑦 = ? }
𝑦𝑛𝑒𝑤 = 𝑓መ𝐰𝑐 𝐱𝑛𝑒𝑤 = sign 𝐱𝑛𝑒𝑤 𝐰
ෝ
Bias
0.1406
{𝑥 = −9} → { 𝑦 = −1 } −2
= sign( 1 − 2 )
{𝑥 = −7} → { 𝑦 = −1 } 0.1406
𝑥 = −5 → { 𝑦 = −1 }
𝑥𝑛𝑒𝑤 = sign(− 0.1406) = −1
Python
Linear Regression for one-dimensional classification
demo 1
9
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)
Linear Methods for Classification
Multi-Category Classification:
If 𝐗 𝑇 𝐗 is invertible, then
Learning: = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐘,
𝐖 𝐘 ∈ 𝐑𝑚×𝐶
Prediction: 𝑓መ𝐰𝑐 𝐱 𝑛𝑒𝑤 = arg max 𝑘=1,…,𝐶 𝐱 𝑛𝑒𝑤
𝑇 :,𝑘
𝐖 𝑇
for each 𝐱 𝑛𝑒𝑤 of 𝐗 𝑛𝑒𝑤
10
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)
Example 2 Three class classification
{𝑥1 = 1, 𝑥2 = 1} → { 𝑦1 = 1, 𝑦2 = 0, 𝑦3 = 0} Class 1
Training set {𝑥1 = −1, 𝑥2 = 1} → { 𝑦1 = 0, 𝑦2 = 1, 𝑦3 = 0} Class 2
{𝐱 𝑖 , 𝐲𝑖 }𝑚
𝑖=1 {𝑥1 = 1, 𝑥2 = 3} → { 𝑦1= 1, 𝑦2 = 0, 𝑦3 = 0} Class 1
{𝑥1= 1, 𝑥2 = 0} → { 𝑦1= 0, 𝑦2 = 0, 𝑦3 = 1} Class 3
𝐗 𝐖 𝐘
Bias 1 1 1 𝑤 1 0 0
1,1 𝑤1,2 𝑤1,3
1 −1 1 𝑤2,1 𝑤2,2 𝑤2,3 = 0 1 0
1 1 3 𝑤3,1 𝑤3,2 𝑤3,3 1 0 0
1 1 0 0 0 1
This set of linear equations has NO exact solution. Least square
= 𝐗 † 𝐘 = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐘
𝐖 𝐗 𝑇 𝐗 is invertible approximation
−1 1 0 0
4 2 5 1 1 1 1 0 0.5 0.5
= 2 4 3 0 1 0 = 0.2857 − 0.5 0.2143
1 −1 1 1 1 0 0
5 3 11 1 1 3 0 0.2857 0 − 0.2857
0 0 1
11
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)
Example 2 Prediction
Test set 𝐗 𝑛𝑒𝑤 {𝑥1= 6, 𝑥2 = 8} → {𝑐𝑙𝑎𝑠𝑠 1, 2, 𝑜𝑟 3? }
{𝑥1= 0, 𝑥2 = −1} → {𝑐𝑙𝑎𝑠𝑠 1, 2, 𝑜𝑟 3? }
1 6 8 0 0.5 0.5
= 𝐗 𝑛𝑒𝑤
𝐘 𝐖= 0.2857 − 0.5 0.2143
1 0 −1
0.2857 0 − 0.2857
Category prediction:
14
© Copyright EE, NUS. All Rights Reserved.
Ridge Regression
Ref: Hastie, Tibshirani, Friedman, “The Elements of Statistical Learning”, (2 nd ed., 12th printing) 2017 (chp.3)
15
© Copyright EE, NUS. All Rights Reserved.
Ridge Regression
16
© Copyright EE, NUS. All Rights Reserved.
Polynomial Regression
Motivation: nonlinear decision surface
• Based on the sum of products of the variables
• E.g. when the input dimension is d=2,
a polynomial function of degree = 2 is:
𝑓𝐰 𝐱 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤12 𝑥1 𝑥2 + 𝑤11 𝑥12 + 𝑤22 𝑥22 .
XOR problem
𝑓𝐰 𝐱 = 𝑥1 𝑥2
17
© Copyright EE, NUS. All Rights Reserved.
Polynomial Regression
Polynomial Expansion
• The linear model 𝑓𝐰 𝐱 = 𝐱 𝑇 𝐰 can be written as
𝑓𝐰 𝐱 = 𝐱 𝑇 𝐰
= σ𝑑𝑖=0 𝑥𝑖 𝑤𝑖 , 𝑥0 = 1
= 𝑤0 + σ𝑑𝑖=1 𝑥𝑖 𝑤𝑖 .
18
© Copyright EE, NUS. All Rights Reserved.
Polynomial Regression
Generalized Linear Discriminant Function
• In general:
𝑓𝐰 𝐱 = 𝑤0 + σ𝑑𝑖=1 𝑤𝑖 𝑥𝑖 + σ𝑑𝑖=1 σ𝑑𝑗=1 𝑤𝑖𝑗 𝑥𝑖 𝑥𝑗 + σ𝑑𝑖=1 σ𝑑𝑗=1 σ𝑑𝑘=1 𝑤𝑖𝑗𝑘 𝑥𝑖 𝑥𝑗 𝑥𝑘 + ⋯
Notes:
• For high dimensional input features (large d value) and high polynomial order, the
number of polynomial terms becomes explosive! (i.e., grows exponentially)
• For high dimensional problems, polynomials of order larger than 3 is seldom used.
Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Chp.5) online
19
© Copyright EE, NUS. All Rights Reserved.
Polynomial Regression
Generalized Linear Discriminant Function
𝒑1𝑇 𝐰 𝑤0
= ⁞ 𝑤1
𝒑𝑇𝑚 𝐰 ⁞
𝑤𝑑
where 𝒑𝑇𝑙 𝐰 = [1, 𝑥𝑙,1 , … , 𝑥𝑙,𝑑 , … , 𝑥𝑙,𝑖 𝑥𝑙,𝑗 , … , 𝑥𝑙,𝑖 𝑥𝑙,𝑗 𝑥𝑙,𝑘 , … ] ⁞
𝑤𝑖𝑗
⁞
𝑤𝑖𝑗𝑘
𝑙 = 1, … , 𝑚; d denotes the dimension of input features; m
denotes the number of samples ⁞
Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Chp.5)
20
© Copyright EE, NUS. All Rights Reserved.
{𝑥1= 0, 𝑥2 = 0} → {𝑦 = −1}
Example 3 {𝑥1= 1, 𝑥2 = 1} → {𝑦 = −1}
Training set {𝐱 𝑖 , 𝐲𝑖 }𝑚
𝑖=1 {𝑥1= 1, 𝑥2 = 0} → {𝑦 = +1}
{𝑥1 = 0, 𝑥2 = 1} → {𝑦 = +1}
2nd order polynomial model
Note: Change X to P with reference to slides 15/16; m & d refers to the size of P (not X)
22
© Copyright EE, NUS. All Rights Reserved.
Polynomial Regression
Summary
23
© Copyright EE, NUS. All Rights Reserved.
{𝑥1 = 0, 𝑥2 = 0} → {𝑦 = −1}
Example 3 (cont’d) {𝑥1 = 1, 𝑥2 = 1} → {𝑦 = −1}
Training set {𝑥1 = 1, 𝑥2 = 0} → {𝑦 = +1}
{𝑥1 = 0, 𝑥2 = 1} → {𝑦 = +1}
2nd order polynomial model
2 2
1 𝑥1,1 𝑥1,2 𝑥1,1 𝑥1,2 𝑥1,1 𝑥1,2
1 0 0 0 0 0
2 2
1 𝑥2,1 𝑥2,2 𝑥2,1 𝑥2,2 𝑥2,1 𝑥2,2 1 1 1 1 1 1
𝐏= 2 2 =
1 𝑥3,1 𝑥3,2 𝑥3,1 𝑥3,2 𝑥3,1 𝑥3,2 1 1 0 0 1 0
1 𝑥4,1 𝑥4,2 𝑥4,1 𝑥4,2 2
𝑥4,1 2
𝑥4,2 1 0 1 0 0 1
ෝ = 𝐏 𝑇 (𝐏𝐏 𝑇 )−1 𝐲
𝐰
−1
1 1 1 1 −1
0 1 1 0 1 1 1 1 −1 1
−1 1
= 0 1 0 1 1 6 3 3 =
0 1 0 0 1 3 3 1 +1 −4
0 1 1 0 1 3 1 3 +1 1 Python
0 1 0 1 1 demo 3
24
© Copyright EE, NUS. All Rights Reserved.
Example 3 (cont’d) Prediction
Test set Test point 1: {𝑥1 = 0.1, 𝑥2 = 0.1} → {𝑦 = class − 1 or + 1? }
Test point 2: {𝑥1 = 0.9, 𝑥2 = 0.9} → {𝑦 = class − 1 or + 1? }
Test point 3: {𝑥1 = 0.1, 𝑥2 = 0.9} → {𝑦 = class − 1 or + 1? }
Test point 4: {𝑥1 = 0.9, 𝑥2 = 0.1} → {𝑦 = class − 1 or + 1? }
𝐲ො = 𝐏𝑛𝑒𝑤 𝐰
ෝ −1
1 0.1 0.1 0.01 0.01 0.01 1
1 0.9 0.9 0.81 0.81 0.81 1
=
1 0.1 0.9 0.09 0.01 0.81 −4
1 0.9 0.1 0.09 0.81 0.01 1
1
[1 𝑥1 𝑥2 𝑥1 𝑥2 𝑥12 𝑥22 ]
−0.82
−0.82
𝒇 𝑐𝐰 𝐏(𝐗 𝑛𝑒𝑤 ) = sign(𝐲)
ො = sign( 0.46 )
0.46
−1 Class −1
−1 Class −1
= +1
Class +1
+1 Class +1
25
© Copyright EE, NUS. All Rights Reserved.
Poll on PollEv.com/ymjin
Just “skip” if you are required to do registration
26
© Copyright EE, NUS. All Rights Reserved.
Mid-term: Lecture 1 to 6
Summary Trial quiz
• Notations, Vectors, Matrices Assignment 1 & 2
• Operations on Vectors and Matrices
• Systems of Linear Equations 𝒇𝐰 𝐗 = 𝐗𝐰 = 𝐲
• Functions, Derivative and Gradient
• Least Squares, Linear Regression with Single and Multiple Outputs
• Learning of vectored function, binary and multi-category classification
• Ridge Regression: penalty term, primal and dual forms
• Polynomial Regression: nonlinear decision boundary