Midterm Combined Slides
Midterm Combined Slides
Learning
Lecture 1
Semester 1
2021/2022
Li Haizhou ([email protected])
Acknowledgement:
EE2211 development team
(Thomas, Kar-Ann, Chen Khong, Helen, Robby and Haizhou)
• Important dates
– A1 will be released in Week 4 on Monday and submitted in 3 weeks
– A2 will be released in Week 6 on Monday and submitted in 4 weeks
– A3 will be released in Week 10 on Monday and submitted in 4 weeks
– Quiz (mid-term) will be on Week 8 lecture time ( tentative ).
References
• [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”,
2019. (read first, buy later: http://themlbook.com/wiki/doku.php)
• [Book2] Andreas C. Muller and Sarah Guido, “Introduction to Machine
Learning with Python: A Guide for Data Scientists”, O’Reilly Media, Inc.,
2017.
• [Book3] Jeff Leek, “The Elements of Data Analytic Style: A guide for
people who want to analyze data”, Lean Publishing, 2015.
5
© Copyright EE, NUS. All Rights Reserved.
2001: A Space Odyssey
HAL listens, talks, sings, reads lips, plays chess, and solves problems !
https://www.bbc.com/news/technology-35785875
https://www.businessinsider.com/googles-alphago-made-
artifical-intelligence-history-2016-3
8
© Copyright EE, NUS. All Rights Reserved.
What is Machine Learning?
Machine learning
is a subfield of computer science that is concerned with
building algorithms which, to be useful, rely on a
collection of examples of some phenomenon. - Andriy
Burkov
These examples can come from nature, be handcrafted by
humans or generated by another algorithm.
Experience: data
- Process/learn from experience
- Write programme to perform task
Task
- Measure the result of the task
Performance: Results, accuracy
Discrete
• Reinforcement Learning Classification Clustering
discrete
Continuous
x Classification y Dimensionality
Regression
continuous Reduction
x Regression y
Training
Apple
Orange
Model Testing
This is an orange
𝑥𝑥1
𝑇𝑇
𝐱𝐱 𝑖𝑖 = ⁞ or 𝐱𝐱 𝑖𝑖 = 𝑥𝑥1 , … , 𝑥𝑥𝑗𝑗 , … , 𝑥𝑥𝐷𝐷 , i = 1, . . . , N
𝑖𝑖
𝑥𝑥𝐷𝐷 𝑖𝑖
– Each element 𝐱𝐱 𝑖𝑖 among N is called a feature vector.
• A feature vector is a vector in which each dimension j =
1, . . . , D contains a value that describes the example
somehow.
1-Dimensional Case
y
“0” “1” Spam
Not-spam
1
0 x
(e.g., repeated word count)
(1D view)
Decision line
(threshold)
(age)
x 2 2-Dimensional Case
Malignant (harmful)
Benign (not harmful)
x 1 (tumor size)
(price)
y
x
(size in meter square)
Training Test
𝑀𝑀 𝑀𝑀 𝑁𝑁 𝑁𝑁
𝐱𝐱 𝑖𝑖 𝑖𝑖=1 𝑦𝑦𝑖𝑖 𝑖𝑖=1 𝐱𝐱 𝑘𝑘 𝑘𝑘=1 𝑦𝑦𝑘𝑘 𝑘𝑘=1
Model Model
Data Known Data Predicted
Parameters to label Learned label
learn parameters
Goal: to learn the model’s Goal: to predict the label of
parameters from the given novel data 𝐱𝐱 𝑘𝑘 𝑁𝑁
𝑘𝑘=1 using the
𝑀𝑀
data and labels 𝐱𝐱 𝑖𝑖 , 𝑦𝑦𝑖𝑖 𝑖𝑖=1 learned parameters
source: SUTD
Training
I found two
types of fruits!
(age)
x2 2-Dimensional Case
Malignant (harmful)
Benign (not harmful)
x1 (tumor size)
(age)
x2 2-Dimensional Case
x1 (tumor size)
x1 (tumor size)
x1 (tumor size)
https://en.wikipedia.org/wiki/Social_network_analysis#/media/File:Kencf0618FacebookNetwork.jpg
Labelled +
Unlabelled data
Labelled data Unlabelled data
+
Typically plenty of
unlabelled data
Learning Model
action
Environment
Agent
S1 S2
reward
Type of inferences
Example
Inductive Deductive
• To reach probable conclusions. • To reach logical conclusions
• All needed information is unavailable or deterministically: all information that can
unknown, causing uncertainty in the conclusions lead to the correct conclusion is available
Li Haizhou ([email protected])
Acknowledgement:
EE2211 development team
(Thomas, Kar-Ann, Chen Khong, Helen, Robby and Haizhou)
Module I Contents
• What is Machine Learning and Types of Learning
• How Supervised Learning works
• Regression and Classification Tasks
• Induction versus Deduction Reasoning
• Types of data
• Data wrangling and cleaning
• Data integrity and visualization
• Causality and Simpson’s paradox
• Random variable, Bayes’ rule
• Parameter estimation
• Parameters vs. Hyperparameters
3
© Copyright EE, NUS. All Rights Reserved.
Ask the right questions
if you are to find the right answers
Domain Domain
Knowledge Knowledge
Information Information
(examples, test (examples, test
cases) cases)
Data
(what data to be used to Data
solve the problem?)
Top-down Bottom-up
EE2211 Introduction to Machine Learning 6
© Copyright EE, NUS. All Rights Reserved.
• The data may not contain the answer.
• The combination of some data and an aching desire for an answer
does not ensure that a reasonable answer can be extracted from a
given body of data. John Tukey
Figures
Records Facts
Time
EE2211 Introduction to Machine Learning 9
© Copyright EE, NUS. All Rights Reserved.
Ordinal data
• Ordinal data are data that have a fixed, small number
(< 100) of possible values, called levels, that are
ordered (or ranked).
Categorical/ Numerical/
Qualitative Quantitative
Ratio
Interval (Includes natural zero,
Ordinal (No natural zero, e.g., temperature in
Nominal (Can be ordered, e.g., e.g., temperature Kelvin)
(e.g., gender, small/medium/large) in Celsius)
religion) https://i.stack.imgur.com/J8Ged.jpg
Ratio No No No Yes
• Scaling to a range
– When the bounds or range of each independent dimension of
data is known, a common normalization technique is min-max.
advantage: Ensures standard scale (make sure all between 0 - 1)
• Feature clipping
https://developers.google.com/machine-learning/data-prep/transform/normalization
https://developers.google.com/machine-learning/data-prep/transform/normalization
https://archive.ics.uci.edu/ml/datasets/iris
r = 5 4 3 2 1
– For example, one can assign a ‘1’ value for male and a ‘2’ value for
female for the case of gender attribute; ‘1’ value for spam and ‘0’
value for non-spam as in Lecture 1.
– Replace the missing value with a value outside the normal range of
values.
• For example, if the normal range is [0, 1], then you can set the missing
value to 2 or −1. The idea is that the learning algorithm will learn what is
best to do when the feature has a value significantly different from
regular values.
• Alternatively, you can replace the missing value by a value in the middle
of the range. For example, if the range for a feature is [−1, 1], you can
set the missing value to be equal to 0. Here, the idea is that the value in
the middle of the range will not significantly affect the prediction.
http://www.financetwitter.com/
EE2211 Introduction to Machine Learning 27
© Copyright EE, NUS. All Rights Reserved.
Data Visualization
• https://towardsdatascience.com/data-visualization-with-mathplotlib-using-python-a7bfb4628ee3
By Anscombe.svg: Schutz Derivative works of this file:(label using subscripts): Avenue - Anscombe.svg, CC BY-SA
3.0, https://commons.wikimedia.org/w/index.php?curid=9838454
1 2 3 4
• These boxplots look very similar, but if you overlay the actual data points
you can see that they have very different distributions.
• If we make this pie chart as a bar chart it is much easier to see that A is bigger than D
• Without logarithm 90% of the data are in the lower left-hand corner in this figure
Li Haizhou ([email protected])
Acknowledgement:
EE2211 development team
(Thomas, Kar-Ann, Chen Khong, Helen, Robby and Haizhou)
20 August 2021
3
© Copyright EE, NUS. All Rights Reserved.
Introduction to Linear Algebra
• Use of vector and matrix notation, especially with
multivariate statistics.
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019, Chp2.
4
© Copyright EE, NUS. All Rights Reserved.
Introduction to Linear Algebra
• A scalar is a simple numerical value, like 15 or −3.25
– Focus on real numbers
• Variables or constants that take scalar values are
denoted by an italic letter, like x or a
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019, Chp2.
5
© Copyright EE, NUS. All Rights Reserved.
Notations, Vectors, Matrices
• A vector is an ordered list of scalar values, called
attributes
– Denoted by a bold character, e.g. x or a
• In many books, vectors are written column-wise:
2 −2 1
𝐚𝐚 = , 𝐛𝐛 = , 𝐜𝐜 =
3 5 0
• The three vectors above are two-dimensional (or have two
elements)
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019, Chp2.
6
© Copyright EE, NUS. All Rights Reserved.
Notations, Vectors, Matrices
• We denote an attribute of a vector as an italic value
with an index, e.g. 𝑎𝑎 𝑗𝑗 𝑜𝑜𝑜𝑜 𝑥𝑥 𝑗𝑗
– The index j denotes a specific dimension of the vector, the
position of an attribute in the list
1 2 𝑎𝑎1 2
𝐚𝐚 = 𝑎𝑎 = or more commonly 𝐚𝐚 = 𝑎𝑎 =
𝑎𝑎 2 3 2 3
Note:
• The notation 𝑥𝑥 𝑗𝑗 should not be confused with the power operator, such
as the 2 in 𝑥𝑥 2 (squared) or 3 in 𝑥𝑥 3 (cubed)
• Square of an indexed attribute of a vector is denoted by (𝑥𝑥 𝑗𝑗 )2
𝑗𝑗 𝑘𝑘
• A variable can have two or more indices, like this: 𝑥𝑥𝑖𝑖 or 𝑥𝑥𝑖𝑖,𝑗𝑗
𝑗𝑗
• For example, in neural networks, we denote as 𝑥𝑥𝑙𝑙,𝑢𝑢 the input feature j of
unit u in layer l
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019, Chp2.
7
© Copyright EE, NUS. All Rights Reserved.
Notations, Vectors, Matrices
Vectors can be visualized as arrows that point to some directions as well
as points in a multi-dimensional space
2 −2 1
Illustrations of three two-dimensional vectors, 𝐚𝐚 = , 𝐛𝐛 = , and 𝐜𝐜 =
3 5 0
8
© Copyright EE, NUS. All Rights Reserved.
Notations, Vectors, Matrices
• A matrix is a rectangular array of numbers arranged in rows and
columns
– Denoted with bold capital letters, such as X or W
– An example of a matrix with two rows and three columns:
2 4 −3
𝐗𝐗 =
21 − 6 − 1
• A set is an unordered collection of unique elements
– Denoted as a calligraphic capital character e.g., 𝑺𝑺
– When an element 𝑥𝑥 belongs to a set 𝑺𝑺, we write 𝑥𝑥 ∈ 𝑺𝑺
– A special set denoted R includes all real numbers from minus
infinity to plus infinity
Note:
• For the elements in matrix 𝐗𝐗, we shall use the indexing 𝑥𝑥1,1 where the first and the
(1)
second indices respectively indicate the row and the column positions, or 𝑥𝑥1 .
• Usually, for input data, rows represent samples and columns represent features
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019, Chp2.
9
© Copyright EE, NUS. All Rights Reserved.
Notations, Vectors, Matrices: Example
Iris data set
• Measurement features
can be packed as
𝐗𝐗 ∈ 𝓡𝓡150×4
https://archive.ics.uci.edu/ml/datasets/iris
∏𝑚𝑚
𝑖𝑖=1 𝑥𝑥𝑖𝑖 = 𝑥𝑥1 · 𝑥𝑥2 ·…· 𝑥𝑥𝑚𝑚−1 · 𝑥𝑥𝑚𝑚
Note:
• Capital Sigma and Pi can be applied to the attributes of a vector x
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019, Chp2.
11
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
Linear dependence and independence
12
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
𝑥𝑥2 𝑥𝑥3 𝒄𝒄
2
𝒂𝒂 2 𝒃𝒃
𝒂𝒂
1
𝒃𝒃 𝑥𝑥2
1
𝑥𝑥1 𝑥𝑥1
𝛽𝛽1 𝒂𝒂 + 𝛽𝛽2 𝐛𝐛 = 0 𝛽𝛽1 𝒂𝒂 + 𝛽𝛽2 𝒃𝒃 ≠ 𝛽𝛽3 𝐜𝐜
13
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
These equations can be written compactly in matrix-vector
notation:
𝐗𝐗𝐰𝐰 = 𝐲𝐲
Where
𝑥𝑥1,1 𝑥𝑥1,2 … 𝑥𝑥1,𝑑𝑑 𝑤𝑤1 𝑦𝑦1
𝐗𝐗 = ⁞ ⁞ ⋱ ⁞ , 𝐰𝐰 = ⁞ , 𝐲𝐲 = ⁞ .
𝑥𝑥𝑚𝑚,1 𝑥𝑥𝑚𝑚,2 … 𝑥𝑥𝑚𝑚,𝑑𝑑 𝑤𝑤𝑑𝑑 𝑦𝑦𝑚𝑚
Note:
• The data matrix 𝐗𝐗 ∈ 𝓡𝓡𝑚𝑚×𝑑𝑑 and the target vector 𝐲𝐲 ∈ 𝓡𝓡𝑚𝑚 are given
• The unknown vector of parameters 𝐰𝐰 ∈ 𝓡𝓡𝑑𝑑 is to be learnt
• The rank(𝐗𝐗) corresponds to the maximal number of linearly
independent columns/rows of 𝐗𝐗 .
1 2
1. What is the rank of ?
2 1
1 −2 3
2. What is the rank of 0 −3 3 ?
1 1 0
https://www.geo.fu-berlin.de/en/v/soga/Basics-of-statistics/Descriptive-Statistics/Measures-of-Relation-Between-Variables/Correlation/index.html
https://en.wikipedia.org/wiki/Simpson%27s_paradox
Ref: Gardener, Martin (March 1979). "MATHEMATICAL GAMES: On the fabric of inductive logic, and some probability paradoxes" (PDF). Scientific American. 234
1. 𝑃𝑃𝑟𝑟 𝐴𝐴 ≥ 0
2. 𝑃𝑃𝑃𝑃 𝑆𝑆 = 1
3. If 𝐴𝐴 ∩ 𝐵𝐵 = ∅ , then 𝑃𝑃𝑃𝑃 𝐴𝐴 ∪ 𝐵𝐵 = 𝑃𝑃𝑃𝑃 𝐴𝐴 + 𝑃𝑃𝑃𝑃 𝐵𝐵
*otherwise, 𝑃𝑃𝑃𝑃 𝐴𝐴 ∪ 𝐵𝐵 = 𝑃𝑃𝑃𝑃 𝐴𝐴 + 𝑃𝑃𝑃𝑃 𝐵𝐵 − 𝑃𝑃𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵)
s
R
X(s)
EE2211 Introduction to Machine Learning 25
© Copyright EE, NUS. All Rights Reserved.
Notations
• Some books used P(·) and p(·) to distinguish between the
probability of discrete random variable and the probability
of continuous random variables respectively.
95%
90%
𝜇𝜇
𝑥𝑥1
𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 × 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 =
𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒
EE2211 Introduction to Machine Learning 36
© Copyright EE, NUS. All Rights Reserved.
Bayes’ Rule
Likelihood – propensity for
Prior – what we know about
observing a certain value of 𝑥𝑥
y BEFORE seeing 𝑥𝑥
given a certain value of 𝑦𝑦
Pr(𝑦𝑦) Pr 𝑥𝑥 𝑦𝑦 Pr(𝑦𝑦) Pr 𝑥𝑥 𝑦𝑦
Pr 𝑦𝑦 𝑥𝑥 = =
Pr(𝑥𝑥) 𝛴𝛴𝑦𝑦 Pr(𝑦𝑦) Pr 𝑥𝑥 𝑦𝑦
Evidence – a constant to
Posterior – what we know ensure that the left hand side
about y AFTER seeing 𝑥𝑥 is a valid distribution
Acknowledgement:
EE2211 development team
(Kar-Ann Toh, Thomas Yeo, Chen Khong, Helen Zhou, Vincent Tan, Robby Tan and
Haizhou Li)
2
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
Module II Contents
• Operations on Vectors and Matrices
• Systems of Linear Equations
• Set and Functions
• Derivative and Gradient
• Least Squares, Linear Regression
• Linear Regression with Multiple Outputs
• Linear Regression for Classification
• Ridge Regression
• Polynomial Regression
3
© Copyright EE, NUS. All Rights Reserved.
Fundamental ML Algorithms:
Linear Regression
References for Lectures 4-6:
Main
• [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019.
(read first, buy later: http://themlbook.com/wiki/doku.php)
• [Book2] Andreas C. Muller and Sarah Guido, “Introduction to Machine
Learning with Python: A Guide for Data Scientists”, O’Reilly Media, Inc., 2017
Supplementary
• [Book3] Jeff Leek, “The Elements of Data Analytic Style: A guide for people
who want to analyze data”, Lean Publishing, 2015.
• [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied
Linear Algebra”, Cambridge University Press, 2018 (available online)
http://vmls-book.stanford.edu/
• [Ref 5] Professor Vincent Tan’s notes (chapters 4-6): (useful)
https://vyftan.github.io/papers/ee2211book.pdf
4
© Copyright EE, NUS. All Rights Reserved.
Recap on Notations, Vectors, Matrices
Scalar Numerical value 15, -3.5
Variable Take scalar values x or a
Capital Pi ∏𝑚
𝑖=1 𝑥𝑖 = 𝑥1 · 𝑥2 ·…· 𝑥𝑚−1 · 𝑥𝑚
5
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
𝑥1 𝑦1 𝑥1 + 𝑦1
𝐱+𝐲= 𝑥 + 𝑦 = 𝑥 +𝑦
2 2 2 2
𝑥1 𝑦1 𝑥1 − 𝑦1
𝐱−𝐲= 𝑥 − 𝑦 = 𝑥 −𝑦
2 2 2 2
6
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
𝑥1 𝑎𝑥1
𝑎 𝐱 = 𝑎 𝑥 = 𝑎𝑥
2 2
1
𝑥1 𝑥
1 1 𝑎 1
𝑎
𝐱 =
𝑎 𝑥2 = 1
𝑥
𝑎 2
7
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
Matrix or Vector Transpose:
𝑥1
𝐱= 𝑥 , 𝐱𝑇 = 𝑥1 𝑥2
2
𝑥1,1 𝑥1,2 𝑥1,3 𝑥1,1 𝑥2,1 𝑥3,1
𝐗 = 𝑥2,1 𝑥2,2 𝑥2,3 , 𝐗 𝑇 = 𝑥1,2 𝑥2,2 𝑥3,2
𝑥3,1 𝑥3,2 𝑥3,3 𝑥1,3 𝑥2,3 𝑥3,3
Python demo 1
8
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
Dot Product or Inner Product of Vectors:
𝐱 · 𝐲 = 𝐱𝑇 𝐲
𝑦1 𝐱
= 𝑥1 𝑥2 𝑦
2
= 𝑥1 𝑦1 + 𝑥2 𝑦2
Geometric definition: 𝜃
𝐱 · 𝐲 = 𝐱 𝐲 cos𝜃 𝐲
𝐱 cos𝜃
Matrix-Vector Product
10
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
Vector-Matrix Product
11
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
Matrix-Matrix Product
Matrix inverse
Definition:
A d-by-d square matrix A is invertible (also nonsingular)
if there exists a d-by-d square matrix B such that
𝐀𝐁 = 𝐁𝐀 = 𝐈 (identity matrix)
1 0…0 0
0 1 0 0
𝐈= ⁞ ⋱ ⁞ d-by-d dimension
0 0 1 0
0 0…0 1
Ref: https://en.wikipedia.org/wiki/Invertible_matrix
13
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
Matrix inverse computation
−1
1
𝐀 = adj(𝐀)
det 𝐀
• det 𝐀 is the determinant of 𝐀
• adj(𝐀) is the adjugate or adjoint of 𝐀
Determinant computation
Example: 2x2 matrix
𝑎 𝑏
𝐀=
𝑐 𝑑
𝑎 𝑏
det 𝐀 = |𝐀| = = 𝑎𝑑 − 𝑏𝑐
𝑐 𝑑
Ref: https://en.wikipedia.org/wiki/Invertible_matrix
14
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
• adj(𝐀) is the adjugate or adjoint of 𝐀
• adj(𝐀) is the transpose of the cofactor matrix 𝐂 of 𝐀 adj(A)= CT
• Minor of an element in a matrix 𝐀 is defined as the determinant
obtained by deleting the row and column in which that element lies
𝑎11 𝑎12 𝑎13 𝑎21 𝑎23
A= 𝑎 𝑎
21 22 𝑎23 Minor of a12 is 𝑀12 =
𝑎31 𝑎33
𝑎31 𝑎32 𝑎33
Ref: https://en.wikipedia.org/wiki/Invertible_matrix
15
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
𝑎21 𝑎23
Minor of a12 is 𝑀12 = adj(A)= CT
𝑎31 𝑎33
Cofactor Cij = −1 𝑖 + 𝑗𝑀 det(A)= ∑𝑘𝑗=1 −1 𝑖 + 𝑗𝑎 𝑀
𝑖𝑗 𝑖𝑗 𝑖𝑗
𝑎 𝑏
• E.g. 𝐀 =
𝑐 𝑑
𝑑 −𝑐
𝐂=
−𝑏 𝑎
𝑇 𝑑 −𝑏
• adj 𝐀 = 𝐂 = det 𝐀 = |𝐀| = 𝑎𝑑 − 𝑏𝑐
−𝑐 𝑎
1 1 𝑑 −𝑏
𝐀 −1
= adj(𝐀) =
det 𝐀 𝑎𝑑−𝑏𝑐 −𝑐 𝑎
Ref: https://en.wikipedia.org/wiki/Invertible_matrix
16
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
det(A)= ∑𝑘𝑗=1 −1 𝑖 + 𝑗𝑎 𝑀
Determinant computation 𝑖𝑗 𝑖𝑗
Example: 3x3 matrix, use the first row (i = 1)
17
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
𝑎22 𝑎23
The minor of 𝑎11 = 𝑎 𝑎33
32
Ref: https://en.wikipedia.org/wiki/Determinant
18
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
𝑎21 𝑎23
The minor of 𝑎12 = 𝑎 𝑎33
31
Ref: https://en.wikipedia.org/wiki/Determinant
19
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
𝑎21 𝑎22
The minor of 𝑎13 = 𝑎 𝑎32
31
Ref: https://en.wikipedia.org/wiki/Determinant
20
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
𝑎12 𝑎13
The minor of 𝑎21 = 𝑎 𝑎33
32
Ref: https://en.wikipedia.org/wiki/Determinant
21
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
𝑎11 𝑎13
The minor of 𝑎22 = 𝑎 𝑎33
31
Ref: https://en.wikipedia.org/wiki/Determinant
22
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
Example
1 2 3
Find the cofactor matrix of 𝐀 given that 𝐀 = 0 4 5 .
1 0 6
Solution:
4 5 0 5 0 4
𝑎11 ⇒ = 24, 𝑎12 ⇒ − = 5, 𝑎13 ⇒ = −4,
0 6 1 6 1 0
2 3 1 3 1 2
𝑎21 ⇒ − = −12, 𝑎22 ⇒ = 3, 𝑎23 ⇒ − = 2,
0 6 1 6 1 0
2 3 1 3 1 2
𝑎31 ⇒ = −2, 𝑎32 ⇒ − = −5, 𝑎33 ⇒ = 4,
4 5 0 5 0 4
24 5 −4
The cofactor matrix C is thus −12 3 2 .
−2 − 5 4
Ref: https://www.mathwords.com/c/cofactor_matrix.htm
23
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
Module II Contents
• Operations on Vectors and Matrices
• Systems of Linear Equations
• Set and Functions
• Derivative and Gradient
• Least Squares, Linear Regression
• Linear Regression with Multiple Outputs
• Linear Regression for Classification
• Ridge Regression
• Polynomial Regression
24
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
25
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
These equations can be written compactly in matrix-vector
notation:
𝐗𝐰 = 𝐲
Where
𝑥1,1 𝑥1,2 … 𝑥1,𝑑 𝑤1 𝑦1
𝐗= ⁞ ⁞ ⋱ ⁞ , 𝐰= ⁞ , 𝐲= ⁞ .
𝑥𝑚,1 𝑥𝑚,2 … 𝑥𝑚,𝑑 𝑤𝑑 𝑦𝑚
Note:
• The data matrix 𝐗 ∈ 𝓡𝑚×𝑑 and the target vector 𝐲 ∈ 𝓡𝑚 are given
• The unknown vector of parameters 𝐰 ∈ 𝓡𝑑 is to be learnt
26
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
A set of linear equations can have no solution, one
solution, or multiple solutions:
𝐗𝐰 = 𝐲
Where
𝑥1,1 𝑥1,2 … 𝑥1,𝑑 𝑤1 𝑦1
𝐗= ⁞ ⁞ ⋱ ⁞ , 𝐰= ⁞ , 𝐲= ⁞ .
𝑥𝑚,1 𝑥𝑚,2 … 𝑥𝑚,𝑑 𝑤𝑑 𝑦𝑚
27
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
𝐗𝐰 = 𝐲, 𝐗 ∈ 𝓡𝑚×𝑑 , 𝐰 ∈ 𝓡𝑑×1 , 𝐲 ∈ 𝓡𝑚×1
1. Square or even-determined system: 𝒎 = 𝒅
- Equal number of equations and unknowns, i.e., 𝐗 ∈ 𝓡𝑑×𝑑
- One unique solution if 𝐗 is invertible or all rows/columns of 𝐗 are
linearly independent
- If all rows or columns of 𝐗 are linearly independent, then 𝐗 is
invertible.
Solution:
If 𝐗 is invertible (or 𝐗 −1 𝐗 = 𝐈 ), then pre-multiply both sides by 𝐗 −1
𝐗 −1 𝐗 𝐰 = 𝐗 −1 𝐲
⇒ ෝ = 𝐗 −1 𝐲
𝐰
(Note: we use a hat on top of 𝐰 to indicate that it is a specific point in the space of 𝐰)
28
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
Example 1 𝑤1 + 𝑤2 = 4 (1) Two unknowns
𝑤1 − 2𝑤2 = 1 (2) Two equations
𝐗 𝐰 𝐲
1 1 𝑤1 4
𝑤 =
1 −2 2 1
ෝ
𝐰
ෝ = 𝐗 −1 𝐲
𝐰
−1
1 1 4
=
1 −2 1
−1 −2 −1 4 3
= =
3 −1 1 1 1 Python demo 3
1 𝑑 −𝑏
𝐀−1 = adj(𝐀) adj 𝐀 = 𝐂𝑇 = det 𝐀 = 𝑎𝑑 − 𝑏𝑐
det 𝐀 −𝑐 𝑎
29
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
𝐗𝐰 = 𝐲, 𝐗 ∈ 𝓡𝑚×𝑑 , 𝐰 ∈ 𝓡𝑑×1 , 𝐲 ∈ 𝓡𝑚×1
2. Over-determined system: 𝒎 > 𝒅
– More equations than unknowns
– 𝐗 is non-square (tall) and hence not invertible
– Has no exact solution in general *
– An approximated solution is available using the left inverse
If the left-inverse of 𝐗 exists such that 𝐗 † 𝐗 = 𝐈, then pre-multiply both
sides by 𝐗 † results in
𝐗†𝐗 𝐰 = 𝐗†𝐲
⇒𝐰 ෝ = 𝐗†𝐲
Definition:
A matrix B that satisfies 𝑩𝒅 𝒙 𝒎 𝑨𝒎 𝒙 𝒅 = 𝐈 is called a left-inverse of 𝐀.
The left-inverse of 𝐗: 𝐗 † = (𝐗 𝑇 𝐗)−𝟏 𝐗 𝑇 given 𝐗 𝑇 𝐗 is invertible.
Note: * exception: when rank(𝐗) = rank([𝐗,𝐲]), there is a solution.
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (Chp11.1-11.2, 11.5)
30
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
Example 2 𝑤1 + 𝑤2 = 1 (1) Two unknowns
𝑤1 − 𝑤2 = 0 (2) Three equations
𝐗 𝐰 𝐲 𝑤1 = 2 (3)
1 1 𝑤 1
1
1 −1 𝑤 = 0
2
1 0 2
ෝ
𝐰
No exact solution
Approximated solution
ෝ = 𝐗 † 𝐲 = (𝐗 𝑇 𝐗)−𝟏 𝐗 𝑇 𝐲
𝐰
3 0 −1 1 1 1 1 1
= 0 =
0 2 1 −1 0 2 0.5 Python demo 4
𝐗 𝑇 𝐗 is invertible
31
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
𝐗𝐰 = 𝐲, 𝐗 ∈ 𝓡𝑚×𝑑 , 𝐰 ∈ 𝓡𝑑×1 , 𝐲 ∈ 𝓡𝑚×1
3. Under-determined system: 𝒎 < 𝒅
– More unknowns than equations
– Infinite number of solutions in general *
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (Chp11.1-11.2, 11.5)
32
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
3. Under-determined system: 𝒎 < 𝒅
Derivation:
𝐗𝐰 = 𝐲, 𝐗 ∈ 𝓡𝑚×𝑑 , 𝐰 ∈ 𝓡𝑑×1 , 𝐲 ∈ 𝓡𝑚×1
𝐗†
right-inverse
33
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
Example 3 𝑤1 + 2𝑤2 + 3𝑤3 = 2 (1) Three unknowns
𝑤1 − 2𝑤2 + 3𝑤3 = 1 (2) Two equations
𝐗 𝐰 𝐲
𝑤
1 2 3 𝑤1 2
2 =
1 −2 3 𝑤3 1
Infinitely many solutions along
the intersection line
Here 𝐗𝐗 𝑇 is invertible
ෝ = 𝐗 𝑇 (𝐗𝐗 𝑇 )−1 𝐲
𝐰
−1
1 1 14 6 0.15
2
= 2 −2 = 0.25 Constrained solution
1
3 3 6 14 0.45
34
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
𝐗 𝐰 𝐲
𝑤
1 2 3 𝑤1 2
2 =
3 6 9 𝑤3 1
Module II Contents
• Operations on Vectors and Matrices
• Systems of Linear Equations
• Set and Functions
• Derivative and Gradient
• Least Squares, Linear Regression
• Linear Regression with Multiple Outputs
• Linear Regression for Classification
• Ridge Regression
• Polynomial Regression
36
© Copyright EE, NUS. All Rights Reserved.
Notations: Set
• A set is an unordered collection of unique elements
– Denoted as a calligraphic capital character e.g., 𝓢, 𝓡, 𝓝 etc
– When an element 𝑥 belongs to a set 𝑺, we write 𝑥 ∈ 𝓢
• A set of numbers can be finite - include a fixed amount of values
– Denoted using accolades, e.g. {1, 3, 18, 23, 235} or {𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 , . . . , 𝑥𝑑 }
• A set can be infinite and include all values in some interval
– If a set of real numbers includes all values between a and b, including a and
b, it is denoted using square brackets as [a, b]
– If the set does not include the values a and b, it is denoted using
parentheses as (a, b)
• Examples:
– The special set denoted by 𝓡 includes all real numbers from minus infinity
to plus infinity
– The set [0, 1] includes values like 0, 0.0001, 0.25, 0.9995, and 1.0
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p4 of chp2).
37
© Copyright EE, NUS. All Rights Reserved.
Notations: Set operations
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p4 of chp2).
38
© Copyright EE, NUS. All Rights Reserved.
Functions
• A function is a relation that associates each element 𝑥 of a set 𝓧,
the domain of the function, to a single element 𝑦 of another set 𝓨,
the codomain of the function
• If the function is called f, this relation is denoted 𝑦 = 𝑓(𝑥)
– The element 𝑥 is the argument or input of the function
– 𝑦 is the value of the function or the output
• The symbol used for representing the input is the variable of the
function
– 𝑓(𝑥) 𝑓 is a function of the variable 𝑥; 𝑓(𝑥, 𝑤) 𝑓 is a function of the variable 𝑥 and w
𝓧 𝓨
1
1 2 Range
2 3 (or Image)
3 4 {3,4,5,6}
4 5
6
{1,2,3,4} domain codomain {1,2,3,4,5,6}
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p6 of chp2). 39
© Copyright EE, NUS. All Rights Reserved.
Functions
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p7 of chp2).
40
© Copyright EE, NUS. All Rights Reserved.
Functions
• The notation 𝑓: 𝓡𝑑 → 𝓡 means that 𝑓 is a function that
maps real d-vectors to real numbers
– i.e., 𝑓 is a scalar-valued function of d-vectors
• If 𝐱 is a d-vector argument, then 𝑓 𝐱 denotes the value
of the function 𝑓 at 𝐱
– i.e., 𝑓 𝐱 = 𝑓 𝑥1 , 𝑥2 , … , 𝑥𝑑 , 𝐱 ∈ 𝓡𝑑 , 𝑓 𝐱 ∈ 𝓡
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, 2018 (Ch 2, p29)
41
© Copyright EE, NUS. All Rights Reserved.
Functions
𝜃
𝐱
𝒂 cos𝜃
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p30)
42
© Copyright EE, NUS. All Rights Reserved.
Functions
Linear Functions
• Homogeneity
• For any d-vector 𝐱 and any scalar 𝛼, 𝑓 𝛼𝐱 = 𝛼𝑓 𝐱
• Scaling the (vector) argument is the same as scaling the
function value
• Additivity
• For any d-vectors 𝐱 and 𝐲, 𝑓 𝐱 + 𝐲 = 𝑓 𝐱 + 𝑓 𝐲
• Adding (vector) arguments is the same as adding the function
values
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p31)
43
© Copyright EE, NUS. All Rights Reserved.
Functions
Linear Functions
Superposition and linearity
• The inner product function 𝑓 𝐱 = 𝒂𝑇 𝐱 defined in equation (1)
(slide 9) satisfies the property
𝑓 𝛼𝐱 + 𝛽𝐲 = 𝒂𝑇 𝛼𝐱 + 𝛽𝐲
= 𝒂𝑇 𝛼𝐱) + 𝒂𝑇 (𝛽𝐲
= 𝛼 𝒂𝑇 𝐱) + 𝛽(𝒂𝑇 𝐲
= 𝛼𝑓 𝐱) + 𝛽𝑓(𝐲
for all d-vectors 𝐱, 𝐲, and all scalars 𝛼, 𝛽.
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p30)
44
© Copyright EE, NUS. All Rights Reserved.
Functions
Linear Functions
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p30)
45
© Copyright EE, NUS. All Rights Reserved.
Functions
Example:
𝑓 𝐱 = 2.3 − 2𝑥1 + 1.3𝑥2 − 𝑥3
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p32)
46
© Copyright EE, NUS. All Rights Reserved.
Functions
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p33)
47
© Copyright EE, NUS. All Rights Reserved.
Summary
• Operations on Vectors and Matrices Assignment 1 (week 6)
• Dot-product, matrix inverse Tutorial 4
• Systems of Linear Equations 𝐗𝐰 = 𝐲
• Matrix-vector notation, linear dependency, invertible
• Even-, over-, under-determined linear systems
• Set and Functions
𝐗 is Even- m = d One unique solution in general ෝ = 𝐗 −1 𝐲
𝐰
Square determined
𝐗 is Over- m > d No exact solution in general; ෝ = (𝐗 𝑇 𝐗)−𝟏 𝐗 𝑇 𝐲
𝐰
Tall determined An approximated solution Left-inverse
𝐗 is Under- m < d Infinite number of solutions in general; 𝐰ෝ = 𝐗 𝑇 (𝐗𝐗 𝑇 )−1 𝐲
Wide determined Unique constrained solution Right-inverse
48
© Copyright EE, NUS. All Rights Reserved.
EE2211 Introduction to Machine
Learning
Lecture 5
Semester 1
2021/2022
Acknowledgement:
EE2211 development team
(Thomas, Kar-Ann, Helen, Vincent, Chen Khong, Robby, and Haizhou)
2
© Copyright EE, NUS. All Rights Reserved.
Least Squares and Linear Regression
Module II Contents
• Notations, Vectors, Matrices (introduced in L3)
• Operations on Vectors and Matrices
• Systems of Linear Equations
• Set and Functions
• Derivative and Gradient
• Least Squares, Linear Regression
• Linear Regression with Multiple Outputs
• Linear Regression for Classification
• Ridge Regression
• Polynomial Regression
3
© Copyright EE, NUS. All Rights Reserved.
Recap: Linear and Affine Functions
Linear Functions
A function 𝑓: 𝓡𝑑 → 𝓡 is linear if it satisfies the following two properties:
• Homogeneity 𝑓 𝛼𝐱 = 𝛼𝑓 𝐱 Scaling
No offset
• Additivity 𝑓 𝐱 + 𝐲 = 𝑓 𝐱 + 𝑓 𝐲 Adding
Affine function
𝑓 𝐱 = 𝒂𝑇 𝐱 + 𝑏 scalar 𝑏 is called the offset (or bias)
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p31)
4
© Copyright EE, NUS. All Rights Reserved.
Functions: Maximum and Minimum
A local and a global minima of a function
5
© Copyright EE, NUS. All Rights Reserved.
Functions: Maximum and Minimum
Max and Arg Max
• Given a set of values 𝓐 = {𝑎1 , 𝑎2 , … , 𝑎𝑚 },
• The operator max 𝑓(𝑎) returns the highest value 𝑓(𝑎) for all elements in
𝑎ϵ𝓐
the set 𝓐
• The operator arg max 𝑓(𝑎) returns the element of the set 𝓐 that
𝑎ϵ𝓐
maximizes 𝑓(𝑎) (returns input)
Note: arg max returns a value from the domain of the function and max returns
from the range (codomain) of the function.
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p6-7 of chp2).
6
© Copyright EE, NUS. All Rights Reserved.
Derivative and Gradient
• The derivative 𝒇′ of a function 𝒇 is a function that describes
how fast 𝒇 grows (or decreases)
– If the derivative is a constant value, e.g. 5 or −3
• The function 𝑓 grows (or decreases) constantly at any point x of its domain
– When the derivative 𝑓′ is a function
• If 𝑓′ is positive at some x, then the function 𝑓 grows at this point
• If 𝑓′ is negative at some x, then the function 𝑓 decreases at this point
• The derivative of zero at x means that the function’s slope at x is horizontal (e.g.
maximum or minimum points)
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p8 of chp2).
7
© Copyright EE, NUS. All Rights Reserved.
Derivative and Gradient
The gradient of a function is a vector of partial derivatives
Differentiation of a scalar function w.r.t. a vector
If 𝑓(𝐱) is a scalar function of d variables, 𝐱 is a d x1 vector.
Then differentiation of 𝑓(𝐱) w.r.t. 𝐱 results in a d x1 vector
𝜕𝑓
ⅆ𝑓(𝐱) 𝜕𝑥1
= ⋮
ⅆ𝐱 𝜕𝑓.
𝜕𝑥𝑑
This is referred to as the gradient of 𝑓(𝐱) and often written
as 𝛻𝐱 𝑓.
𝑎
E.g. 𝑓 𝐱 = 𝑎𝑥1 + 𝑏𝑥2 𝛻𝐱 𝑓 =
𝑏 Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Appendix)
8
© Copyright EE, NUS. All Rights Reserved.
Derivative and Gradient
Partial Derivatives
Differentiation of a vector function w.r.t. a vector
If 𝐟(𝐱) is a vector function of size h x1 and 𝐱 is a d x1 vector.
Then differentiation of 𝐟(𝐱) results in a h x d matrix
𝜕𝑓1 𝜕𝑓1
⋯
ⅆ𝐟(𝐱) 𝜕𝑥1 𝜕𝑥𝑑
= ⋮ ⋱ ⋮
ⅆ𝐱 𝜕𝑓ℎ 𝜕𝑓.ℎ
⋯
𝜕𝑥1 𝜕𝑥𝑑
The matrix is referred to as the Jacobian of 𝐟(𝐱)
9
© Copyright EE, NUS. All Rights Reserved.
Derivative and Gradient
ⅆ𝐀𝐱
=𝐀
ⅆ𝐱
Derivations: https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf
10
© Copyright EE, NUS. All Rights Reserved.
Linear Regression
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p3 of chp3).
11
© Copyright EE, NUS. All Rights Reserved.
Linear Regression
Problem Statement: To predict the unknown 𝑦 for a given 𝐱 (testing)
𝑚
• We have a collection of labeled examples (training) {(𝐱 𝑖 , y𝑖 )}𝑖=1
– 𝑚 is the size of the collection
– 𝐱 𝑖 is the d-dimensional feature vector of example 𝑖 = 1, … , 𝑚 (input)
– y𝑖 is a real-valued target (1-D)
– Note:
• when y𝑖 is continuous valued, it is a regression problem
• when y𝑖 is discrete valued, it is a classification problem
12
© Copyright EE, NUS. All Rights Reserved.
Linear Regression
13
© Copyright EE, NUS. All Rights Reserved.
Linear Regression
Learning objective function (using simplified notation hereon)
• To find the optimal values for w* which minimizes the
following expression:
𝑚
(𝑓𝐰 𝐱𝑖 − y𝑖 )2
𝑖=1
with 𝑓𝐰 𝐱𝑖 = 𝐱 𝑇 𝐰,
where we define 𝐰 = [𝑏, 𝑤1 , … 𝑤𝑑 ]𝑇 = [𝑤0 , 𝑤1 , … 𝑤𝑑 ]𝑇 ,
and 𝐱 𝑖 = [1, 𝑥𝑖,1 , … 𝑥𝑖,𝑑 ]𝑇 = [𝑥𝑖,0 , 𝑥𝑖,1 , … 𝑥𝑖,𝑑 ]𝑇 , 𝑖 = 1, … , 𝑚
14
© Copyright EE, NUS. All Rights Reserved.
𝑚
Linear Regression (𝑓𝐰,𝑏 𝐱 𝑖 − y𝑖 )2
𝑖=1 predicted value True label
15
© Copyright EE, NUS. All Rights Reserved.
Linear Regression
Learning (Training)
• Consider the set of feature vector 𝐱 𝑖 and target output 𝑦𝑖
indexed by 𝑖 = 1, … , 𝑚, a linear model 𝑓𝐰 𝐱 = 𝐱 𝑇 𝐰 can
be stacked as 𝑦1
𝒇𝐰 𝐗 = 𝐗𝐰 𝐲= ⁞
𝑦𝑚
Learning 𝑇
Model 𝐱 1𝐰
Learning
target vector
= ⁞
𝑇𝐰
𝐱𝑚
𝑏
where 𝐱𝑖𝑇 𝐰 = [1, 𝑥𝑖,1 , … , 𝑥𝑖,𝑑 ] 𝑤1
⁞
𝑤𝑑
Note: The bias/offset term is responsible for translating the line/plane/hyperplane
away from the origin.
16
© Copyright EE, NUS. All Rights Reserved.
Linear Regression
17
© Copyright EE, NUS. All Rights Reserved.
Linear Regression
18
© Copyright EE, NUS. All Rights Reserved.
Linear Regression
Example 1 Training set {(𝑥𝑖 , y𝑖 )}𝑚
𝑖=1 {𝑥 = −9} → {𝑦 = −6}
{𝑥 = −7} → {𝑦 = −6}
𝐗 𝐰 𝐲 𝑥 = −5 → {𝑦 = −4}
1 −9 −6 𝑥 = 1 → 𝑦 = −1
1 −7 −6 𝑥 = 5 → {𝑦 = 1 }
𝑤0 −4 𝑥 = 9 → {𝑦 = 4 }
1 −5 =
𝑤1 −1
1 1
1 5 1
4 w0: offset term
1 9
This set of linear equations has no exact solution
However, 𝐗 𝑇 𝐗 is invertible Least square approximation
eqn of line: y=0.5625x-1.4375
ෝ = 𝐗 † 𝐲 = (𝐗 𝑇 𝐗)−𝟏 𝐗 𝑇 𝐲
𝐰 −6
−6
−1
6 −6 1 1 1 1 1 1 −4 −1.4375
= =
−6 262 −9 − 7 − 5 1 5 9 −1 0.5625
1
4
19
© Copyright EE, NUS. All Rights Reserved.
Linear Regression
𝑦ො = 𝐗𝐰ෝ
−1.4375
=𝐗
0.5625
y = −1.4375+0.5625x
Prediction:
Test set
{𝑥 = −1} → {𝑦 =? }
−1.4375
𝑦ො = 1 − 1
0.5625
= −2
Linear Regression on one-dimensional samples
Python demo 1
20
© Copyright EE, NUS. All Rights Reserved.
Linear Regression Add column of 1 for offset in X
1 1 1 1
𝑤1
1 −1 1 𝑤2 = 0
1 1 3 𝑤3 2
1 1 0 −1
This set of linear equations has no exact solution
However, 𝐗 𝑇 𝐗 is invertible
ෝ = 𝐗 † 𝐲 = (𝐗 𝑇 𝐗)−𝟏 𝐗 𝑇 𝐲
𝐰 Least square approximation
−1
4 2 5 1 1 1 1 1 −0.7500
= 2 4 3 0 = 0.1786
1 −1 1 1 2
5 3 11 1 1 3 0 −1 0.9286
21
© Copyright EE, NUS. All Rights Reserved.
Linear Regression
The four linear equations Prediction:
Test set
{𝑥1 = 1, 𝑥2 = 6, 𝑥3 = 8} → {𝑦 =? }
{𝑥1 = 1, 𝑥2 = 0, 𝑥3 = −1} → {𝑦 =? }
ෝ = 𝒇 𝐰 𝐗 𝑛𝑒𝑤 = 𝐗 𝑛𝑒𝑤 𝐰
𝒚 ෝ
1 6 8 −0.7500
ෝ=
𝒚 0.1786
1 0 −1
0.9286
7.7500
=
−1.6786
22
© Copyright EE, NUS. All Rights Reserved.
Linear Regression
Learning of Vectored Function (Multiple Outputs)
For one sample: a linear model 𝐟𝐰 𝐱 = 𝐱 𝑇 𝐖 Vector function
For m samples: 𝐅𝐰 𝐗 = 𝐗𝐖 = 𝐘
𝑇 𝑤0,1 … 𝑤0,ℎ
Sample 1
𝐱1 1 𝑥 1,1 … 𝑥 1,𝑑
𝑤1,1 … 𝑤1,ℎ
⁞ = ⁞ 𝐖= ⁞ ⁞ ⋱ ⁞
𝑇 1 𝑥 … 𝑥 ⁞ ⋱ ⁞
Sample m 𝐱𝑚 𝑚,1 𝑚,𝑑 𝑤
𝑑,1 … 𝑤𝑑,ℎ
ℎ
Sample 1’s output 𝑦1,1 … 𝑦1,ℎ
⁞ = ⁞ 𝑚
Sample m’s output 𝑦𝑚,1 … 𝑦𝑚,ℎ
ℎ
J(𝐖) = trace(𝐄𝑇 𝐄)
= trace[(𝐗𝐖 − 𝐘)𝑇 (𝐗𝐖 − 𝐘)]
If 𝐗 𝑇 𝐗 is invertible, then
Learning/training: 𝐖 = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐘
Prediction/testing: 𝐅𝐰 𝐗 𝑛𝑒𝑤 = 𝐗 𝑛𝑒𝑤 𝐖
Ref: Hastie, Tibshirani, Friedman, “The Elements of Statistical Learning”, (2 nd ed., 12th printing) 2017 (chp.3.2.4)
24
© Copyright EE, NUS. All Rights Reserved.
Linear Regression
J(𝐖) = trace(𝐄𝑇 𝐄)
𝐞1𝑇
= trace( ⁞ [𝐞1 𝐞2 … 𝐞ℎ ])
𝐞𝑇ℎ
25
© Copyright EE, NUS. All Rights Reserved.
Linear Regression of multiple outputs
Example 3
Training set {𝑥1 = 1, 𝑥2 = 1, 𝑥3 = 1} → { 𝑦1 = 1, 𝑦2 = 0}
{𝑥1 = 1, 𝑥2 = −1, 𝑥3 = 1} → { 𝑦1 = 0, 𝑦2 = 1}
{𝐱𝑖 , 𝐲𝑖 }𝑚
𝑖=1
{𝑥1 = 1, 𝑥2 = 1, 𝑥3 = 3} → { 𝑦1 = 2, 𝑦2 = −1}
{𝑥1 = 1, 𝑥2 = 1, 𝑥3 = 0} → { 𝑦1 = −1, 𝑦2 = 3}
𝐗 𝐖 𝐘
Bias 1 1 1 𝑤 𝑤1,2 1 0
1,1
1 − 1 1 𝑤2,1 𝑤2,2 = 0 1
1 1 3 𝑤3,1 𝑤3,2 2 −1
1 1 0 −1 3
This set of linear equations has NO exact solution
Least square
= 𝐗 † 𝐘 = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐘
𝐖 𝐗 𝑇 𝐗 is invertible approximation
−1 1 0
4 2 5 1 1 1 1 −0.75 2.25
= 2 4 3 0 1 = 0.1786 0.0357
1 −1 1 1 2 −1
5 3 11 1 1 3 0 −1 0.9286 − 1.2143
3
26
© Copyright EE, NUS. All Rights Reserved.
Linear Regression of multiple outputs
Example 3
Prediction:
Test set: two new samples
{𝑥1 = 1, 𝑥2 = 6, 𝑥3 = 8} → { 𝑦1 =? , 𝑦2 =? }
{𝑥1 = 1, 𝑥2 = 0, 𝑥3 = −1} → { 𝑦1 =? , 𝑦2 =? }
= 𝐗 𝑛𝑒𝑤
𝐘 𝐖
Bias 1 6 8 −0.75 2.25
= 0.1786 0.0357
1 0 −1
0.9286 − 1.2143
7.75 − 7.25
=
−1.6786 3.4643
Python demo 2
27
© Copyright EE, NUS. All Rights Reserved.
Linear Regression of multiple outputs
Example 4
The values of feature x and their corresponding values of multiple
outputs target y are shown in the table below.
1.9 3.6
𝐖 = 𝐗 𝐘 = (𝐗 𝐗) 𝐗 𝐘 =
† 𝑇 −1 𝑇
Python demo 3
−0.4667 0.5
𝒏𝒆𝒘 = 𝐗 𝑛𝑒𝑤 𝐖 = 1
𝐘 = 0.9667 4.6
2 𝐖 Prediction
*Prediction must be close to/ within observed range.
28
© Copyright EE, NUS. All Rights Reserved.
Summary
• Notations, Vectors, Matrices
• Operations on Vectors and Matrices
• Dot-product, matrix inverse
• Systems of Linear Equations 𝒇𝐰 𝐗 = 𝐗𝐰 = 𝐲
• Matrix-vector notation, linear dependency, invertible
• Even-, over-, under-determined linear systems
• Functions, Derivative and Gradient
• Inner product, linear/affine functions
• Maximum and minimum, partial derivatives, gradient
• Least Squares, Linear Regression
• Objective function, loss function
• Least square solution, training/learning and testing/prediction
• Linear regression with multiple outputs
Learning/training ෝ = (𝐗 𝑇𝒕𝒓𝒂𝒊𝒏 𝐗 𝒕𝒓𝒂𝒊𝒏 )−𝟏 𝐗 𝑇𝒕𝒓𝒂𝒊𝒏 𝐲𝒕𝒓𝒂𝒊𝒏
𝐰
Prediction/testing 𝐲𝒕𝒆𝒔𝒕 = 𝐗 𝒕𝒆𝒔𝒕 𝐰 ෝ
• Classification Python packages: numpy, pandas, matplotlib.pyplot,
• Ridge Regression numpy.linalg, and sklearn.metrics (for
• Polynomial Regression mean_squared_error), numpy.linalg.pinv
29
© Copyright EE, NUS. All Rights Reserved.
EE2211 Introduction to Machine
Learning
Lecture 6
Semester 1
2021/2022
Acknowledgement:
EE2211 development team
(Thomas, Kar-Ann, Chen Khong, Helen, Robby and Haizhou
2
© Copyright EE, NUS. All Rights Reserved.
Ridge Regression & Polynomial
Regression
Module II Contents
• Notations, Vectors, Matrices
• Operations on Vectors and Matrices
• Systems of Linear Equations
• Functions, Derivative and Gradient
• Least Squares, Linear Regression
• Linear Regression with Multiple Outputs
• Linear Regression for Classification
• Ridge Regression
• Polynomial Regression
3
© Copyright EE, NUS. All Rights Reserved.
Review: Linear Regression
Learning of Scalar Function (Single Output)
For one sample: a linear model 𝑓𝐰 𝐱 = 𝐱 𝑇 𝐰 scalar function
For m samples: 𝒇𝐰 𝐗 = 𝐗𝐰 = 𝐲
𝐱1𝑇 𝐰
𝐲= ⁞ where 𝐱𝑖𝑇 = [1, 𝑥𝑖,1 , … , 𝑥𝑖,𝑑 ]
𝐱𝑚𝑇𝐰
1 𝑥1,1 … 𝑥1,𝑑 𝑏 𝑦1
𝐗= ⁞ ⁞ ⋱ ⁞ 𝐰 = 𝑤1 𝐲= ⁞
1 𝑥𝑚,1 … 𝑥𝑚,𝑑 ⁞ 𝑦𝑚
𝑤𝑑
𝑚
Objective:𝑖=1(𝑓𝐰 𝐱 𝑖 − y𝑖 )2 = 𝐞𝑇 𝐞 = (𝐗𝐰 − 𝐲)𝑇 (𝐗𝐰 − 𝐲)
Learning/training when 𝐗 𝑇 𝐗 is invertible
ෝ = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐲
Least square solution: 𝐰
Prediction/testing: 𝒚𝑛𝑒𝑤 = 𝒇 𝐰 𝐗 𝑛𝑒𝑤 = 𝐗 𝑛𝑒𝑤 𝐰
ෝ
4
© Copyright EE, NUS. All Rights Reserved.
Review: Linear Regression
Learning of Vectored Function (Multiple Outputs)
𝐅𝐰 𝐗 = 𝐗𝐖 = 𝐘
𝑤0,1 … 𝑤0,ℎ
Sample 1 𝐱1𝑇 1 𝑥1,1 … 𝑥1,𝑑
𝑤1,1 … 𝑤1,ℎ
⁞ = ⁞ 𝐖= ⁞ ⁞ ⋱ ⁞
𝑇 1 𝑥𝑚,1 … 𝑥𝑚,𝑑 ⁞ ⋱ ⁞
Sample m 𝐱𝑚 𝑤𝑑,1 … 𝑤𝑑,ℎ
Sample 1’s output 𝑦1,1 … 𝑦1,ℎ
⁞ = ⁞
Sample m’s output 𝑦𝑚,1 … 𝑦𝑚,ℎ Least Squares Regression
If 𝐗 𝑇 𝐗 is invertible, then
Learning/training: 𝐖 = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐘
Prediction/testing: 𝐅𝐰 𝐗 𝑛𝑒𝑤 = 𝐗 𝑛𝑒𝑤 𝐖
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (chp.14)
6
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)
sign(𝑎)
+1
0
𝑎
-1
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (chp.14)
7
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)
Example 1 Training set {𝑥𝑖 , y𝑖 }𝑚
𝑖=1 {𝑥 = −9} → { 𝑦 = −1 }
𝐗 𝐰 𝐲 {𝑥 = −7} → { 𝑦 = −1 }
𝑥 = −5 → { 𝑦 = −1 }
Bias 1 −9 −1
{ 𝑥 = 1} → 𝑦 = +1
1 −7 −1
𝑤0 { 𝑥 = 5} → { 𝑦 = +1 }
1 −5 −1
𝑤1 = { 𝑥 = 9} → { 𝑦 = +1 }
1 1 1
1 5 1
1 9 1
This set of linear equations has NO exact solution
ෝ = 𝐗 † 𝐲 = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐲
𝐰 𝐗 𝑇 𝐗 is invertible
−1
−1
6 −6 −1
1 1 1 1 1 1 −1 0.1406 Least square
= =
−6 262 −9 − 7 − 5 1 5 9 1 0.1406 approximation
1
1
8
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)
Example 1
ෝ
𝑦ො = sign(𝐗𝐰)
{ 𝑥 = 1} → 𝑦 = +1
{ 𝑥 = 5} → { 𝑦 = +1 } 0.1406
{ 𝑥 = 9} → { 𝑦 = +1 } = sign(𝐗 )
0.1406
ෝ = 0.1406+0.1406x
y' = 𝐗𝐰
Prediction:
𝑦𝑛𝑒𝑤′
Test set {𝑥 = −2} → {𝑦 = ? }
𝑦𝑛𝑒𝑤 = 𝑓መ𝐰𝑐 𝐱 𝑛𝑒𝑤 = sign 𝐱 𝑛𝑒𝑤 𝐰
ෝ
Bias
0.1406
{𝑥 = −9} → { 𝑦 = −1 } −2
= sign( 1 − 2 )
{𝑥 = −7} → { 𝑦 = −1 } 0.1406
𝑥 = −5 → { 𝑦 = −1 }
𝑥𝑛𝑒𝑤 = sign(− 0.1406) = −1
Python
Linear Regression for one-dimensional classification
demo 1
9
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)
Linear Methods for Classification
Multi-Category Classification:
If 𝐗 𝑇 𝐗 is invertible, then
Learning: = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐘,
𝐖 𝐘 ∈ 𝐑𝑚×𝐶
Prediction: 𝑓መ𝐰𝑐 𝐱 𝑛𝑒𝑤 = arg max𝑘=1,…,𝐶 𝐱 𝑛𝑒𝑤
𝑇 𝐖 :,𝑘 𝑇
for each 𝐱 𝑛𝑒𝑤 of 𝐗 𝑛𝑒𝑤
10
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)
Example 2 Three class classification
{𝑥1 = 1, 𝑥2 = 1} → { 𝑦1 = 1, 𝑦2 = 0, 𝑦3 = 0} Class 1
Training set {𝑥1 = −1, 𝑥2 = 1} → { 𝑦1 = 0, 𝑦2 = 1, 𝑦3 = 0} Class 2
{𝐱𝑖 , 𝐲𝑖 }𝑚
𝑖=1 {𝑥1 = 1, 𝑥2 = 3} → { 𝑦1 = 1, 𝑦2 = 0, 𝑦3 = 0} Class 1
{𝑥1 = 1, 𝑥2 = 0} → { 𝑦1 = 0, 𝑦2 = 0, 𝑦3 = 1} Class 3
𝐗 𝐖 𝐘
Bias 1 1 1 𝑤 𝑤1,2 𝑤1,3 1 0 0
1,1
1 −1 1 𝑤2,1 𝑤2,2 𝑤2,3 = 0 1 0
1 1 3 𝑤3,1 𝑤3,2 𝑤3,3 1 0 0
1 1 0 0 0 1
This set of linear equations has NO exact solution. Least square
= 𝐗 † 𝐘 = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐘
𝐖 𝐗 𝑇 𝐗 is invertible approximation
−1 1 0 0
4 2 5 1 1 1 1 0 0.5 0.5
= 2 4 3 0 1 0 = 0.2857 − 0.5 0.2143
1 −1 1 1 1 0 0
5 3 11 1 1 3 0 0.2857 0 − 0.2857
0 0 1
11
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)
Example 2 Prediction
Test set 𝐗 𝑛𝑒𝑤 {𝑥1 = 6, 𝑥2 = 8} → {𝑐𝑙𝑎𝑠𝑠 1, 2, 𝑜𝑟 3? }
{𝑥1 = 0, 𝑥2 = −1} → {𝑐𝑙𝑎𝑠𝑠 1, 2, 𝑜𝑟 3? }
1 6 8 0 0.5 0.5
= 𝐗 𝑛𝑒𝑤
𝐘 𝐖= 0.2857 − 0.5 0.2143
1 0 −1
0.2857 0 − 0.2857
Category prediction:
3 Class 3 For each row of Y, the column position of the largest number
(across all columns for that row) determines the class label.
Python
E.g. in the first row, the maximum number is 4 which is in
demo 2 column 1. Therefore, the resulting predicted class is 1.
12
© Copyright EE, NUS. All Rights Reserved.
Ridge Regression ensures that XTX is invertible
14
© Copyright EE, NUS. All Rights Reserved.
Ridge Regression
Ref: Hastie, Tibshirani, Friedman, “The Elements of Statistical Learning”, (2nd ed., 12th printing) 2017 (chp.3)
15
© Copyright EE, NUS. All Rights Reserved.
Ridge Regression
16
© Copyright EE, NUS. All Rights Reserved.
Polynomial Regression
Motivation: nonlinear decision surface
• Based on the sum of products of the variables
• E.g. when the input dimension is d=2,
a polynomial function of degree = 2 is:
𝑓𝐰 𝐱 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤12 𝑥1 𝑥2 + 𝑤11 𝑥12 + 𝑤22 𝑥22 .
XOR problem
𝑓𝐰 𝐱 = 𝑥1 𝑥2
17
© Copyright EE, NUS. All Rights Reserved.
Polynomial Regression
Polynomial Expansion
• The linear model 𝑓𝐰 𝐱 = 𝐱 𝑇 𝐰 can be written as
𝑓𝐰 𝐱 = 𝐱 𝑇 𝐰
= σ𝑑𝑖=0 𝑥𝑖 𝑤𝑖 , 𝑥0 = 1
= 𝑤0 + σ𝑑𝑖=1 𝑥𝑖 𝑤𝑖 .
18
© Copyright EE, NUS. All Rights Reserved.
Polynomial Regression
Generalized Linear Discriminant Function
• In general:
𝑓𝐰 𝐱 = 𝑤0 + σ𝑑𝑖=1 𝑤𝑖 𝑥𝑖 + σ𝑑𝑖=1 σ𝑑𝑗=1 𝑤𝑖𝑗 𝑥𝑖 𝑥𝑗 + σ𝑑𝑖=1 σ𝑑𝑗=1 σ𝑑𝑘=1 𝑤𝑖𝑗𝑘 𝑥𝑖 𝑥𝑗 𝑥𝑘 + ⋯
Notes:
• For high dimensional input features (large d value) and high polynomial order, the
number of polynomial terms becomes explosive! (i.e., grows exponentially)
• For high dimensional problems, polynomials of order larger than 3 is seldom used.
Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Chp.5) online
19
© Copyright EE, NUS. All Rights Reserved.
Polynomial Regression
Generalized Linear Discriminant Function
𝒑1𝑇 𝐰 𝑤0
= ⁞ 𝑤1
𝒑𝑇𝑚 𝐰 ⁞
𝑤𝑑
where 𝒑𝑇𝑙 𝐰 = [1, 𝑥𝑙,1 , … , 𝑥𝑙,𝑑 , … , 𝑥𝑙,𝑖 𝑥𝑙,𝑗 , … , 𝑥𝑙,𝑖 𝑥𝑙,𝑗 𝑥𝑙,𝑘 , … ] ⁞
𝑤𝑖𝑗
⁞
𝑤𝑖𝑗𝑘
𝑙 = 1, … , 𝑚; d denotes the dimension of input features; m
denotes the number of samples ⁞
Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Chp.5)
20
© Copyright EE, NUS. All Rights Reserved.
{𝑥1 = 0, 𝑥2 = 0} → {𝑦 = −1}
Example 3 {𝑥1 = 1, 𝑥2 = 1} → {𝑦 = −1}
𝑚
Training set {𝐱𝑖 , 𝐲𝑖 }𝑖=1 {𝑥1 = 1, 𝑥2 = 0} → {𝑦 = +1}
{𝑥1 = 0, 𝑥2 = 1} → {𝑦 = +1}
2nd order polynomial model
Note: Change X to P with reference to slides 15/16; m & d refers to the size of P (not X)
22
© Copyright EE, NUS. All Rights Reserved.
Polynomial Regression
Summary
23
© Copyright EE, NUS. All Rights Reserved.
{𝑥1 = 0, 𝑥2 = 0} → {𝑦 = −1}
Example 3 (cont’d) {𝑥1 = 1, 𝑥2 = 1} → {𝑦 = −1}
Training set {𝑥1 = 1, 𝑥2 = 0} → {𝑦 = +1}
{𝑥1 = 0, 𝑥2 = 1} → {𝑦 = +1}
2nd order polynomial model
2 2
1 𝑥1,1 𝑥1,2 𝑥1,1 𝑥1,2 𝑥1,1 𝑥1,2
1 0 0 0 0 0
2 2
1 𝑥2,1 𝑥2,2 𝑥2,1 𝑥2,2 𝑥2,1 𝑥2,2 1 1 1 1 1 1
𝐏= 2 2 =
1 𝑥3,1 𝑥3,2 𝑥3,1 𝑥3,2 𝑥3,1 𝑥3,2 1 1 0 0 1 0
1 𝑥4,1 𝑥4,2 𝑥4,1 𝑥4,2 2
𝑥4,1 2
𝑥4,2 1 0 1 0 0 1
ෝ = 𝐏 𝑇 (𝐏𝐏 𝑇 )−1 𝐲
𝐰
−1
1 1 1 1 −1
0 1 1 0 1 1 1 1 −1 1
−1 1
= 0 1 0 1 1 6 3 3 =
0 1 0 0 1 3 3 1 +1 −4
0 1 1 0 1 3 1 3 +1 1 Python
0 1 0 1 1 demo 3
24
© Copyright EE, NUS. All Rights Reserved.
Example 3 (cont’d) Prediction
Test set Test point 1: {𝑥1 = 0.1, 𝑥2 = 0.1} → {𝑦 = class − 1 or + 1? }
Test point 2: {𝑥1 = 0.9, 𝑥2 = 0.9} → {𝑦 = class − 1 or + 1? }
Test point 3: {𝑥1 = 0.1, 𝑥2 = 0.9} → {𝑦 = class − 1 or + 1? }
Test point 4: {𝑥1 = 0.9, 𝑥2 = 0.1} → {𝑦 = class − 1 or + 1? }
𝐲ො = 𝐏𝑛𝑒𝑤 𝐰
ෝ −1
1 0.1 0.1 0.01 0.01 0.01 1
1 0.9 0.9 0.81 0.81 0.81 1
=
1 0.1 0.9 0.09 0.01 0.81 −4
1 0.9 0.1 0.09 0.81 0.01 1
1
[1 𝑥1 𝑥2 𝑥1 𝑥2 𝑥12 𝑥22 ]
−0.82
−0.82
𝒇 𝑐𝐰 𝐏(𝐗 𝑛𝑒𝑤 ) = sign(𝐲)
ො = sign( 0.46 )
0.46
−1 Class −1
−1 Class −1
= +1
Class +1
+1 Class +1
25
© Copyright EE, NUS. All Rights Reserved.
Summary
• Notations, Vectors, Matrices
• Operations on Vectors and Matrices
• Systems of Linear Equations 𝒇𝐰 𝐗 = 𝐗𝐰 = 𝐲
• Functions, Derivative and Gradient
• Least Squares, Linear Regression with Single and Multiple Outputs
• Learning of vectored function, binary and multi-category classification
• Ridge Regression: penalty term, primal and dual forms
• Polynomial Regression: nonlinear decision boundary