0% found this document useful (0 votes)

16 views

Midterm Combined Slides

midterm

Uploaded by

owenwongsohyik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

Midterm Combined Slides

midterm

Uploaded by

owenwongsohyik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 210

EE2211 Introduction to Machine

Learning
Lecture 1
Semester 1
2021/2022

Li Haizhou ([email protected])

Electrical and Computer Engineering Department

National University of Singapore

Acknowledgement:
EE2211 development team
(Thomas, Kar-Ann, Chen Khong, Helen, Robby and Haizhou)

© Copyright EE, NUS. All Rights Reserved.

Course Overview
• EE2211 is organized into 12 lectures, 12 tutorials, 3 assignments (40%
marks), 1 Quiz (30% marks) at mid-term, and 1 final Exam (30% marks).

• Tutorials 1+1 hours

• Lectures, tutorials (programming and coursework), quiz (mid-term) and final

exam are all conducted online. Videos of lectures are made available after
lectures.

• Important dates
– A1 will be released in Week 4 on Monday and submitted in 3 weeks
– A2 will be released in Week 6 on Monday and submitted in 4 weeks
– A3 will be released in Week 10 on Monday and submitted in 4 weeks
– Quiz (mid-term) will be on Week 8 lecture time ( tentative ).

EE2211 Introduction to Machine Learning L1.2

© Copyright EE, NUS. All Rights Reserved.
Reading List
Main textbooks: Book1 (text) and Book2 (python)
Supplementary reading: Book3

References
• [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”,
2019. (read first, buy later: http://themlbook.com/wiki/doku.php)
• [Book2] Andreas C. Muller and Sarah Guido, “Introduction to Machine
Learning with Python: A Guide for Data Scientists”, O’Reilly Media, Inc.,
2017.
• [Book3] Jeff Leek, “The Elements of Data Analytic Style: A guide for
people who want to analyze data”, Lean Publishing, 2015.

EE2211 Introduction to Machine Learning 3

© Copyright EE, NUS. All Rights Reserved.
Course Contents
• Introduction and Preliminaries (Haizhou)
– Introduction
– Data engineering
– Introduction to probability and statistics
• Fundamental Machine Learning Algorithms I (Helen)
– Systems of linear equations
– Least squares, linear regression
– Ridge regression, polynomial regression
• Fundamental Machine Learning Algorithms II (Thomas)
– Over-fitting, bias/variance Trade-off
– Optimization, gradient descent
– Decision trees, random forest
• Performance and More Algorithms (Haizhou, Xinchao)
– Performance issues
– K-means clustering
– Neural networks

EE2211 Introduction to Machine Learning 4

© Copyright EE, NUS. All Rights Reserved.
Introduction
Module I Content
• What is machine learning and types of learning
• How supervised learning works
• Regression and classification tasks
• Induction versus deduction reasoning
• Types of data
• Data wrangling and cleaning
• Data integrity and visualization
• Causality and Simpson’s paradox
• Random variable, Bayes’ rule
• Parameter estimation
• Parameters vs. Hyperparameters

5
© Copyright EE, NUS. All Rights Reserved.
2001: A Space Odyssey

HAL listens, talks, sings, reads lips, plays chess, and solves problems !

© Copyright EE, NUS. All Rights Reserved.

AlphaGo

https://www.bbc.com/news/technology-35785875

https://www.businessinsider.com/googles-alphago-made-
artifical-intelligence-history-2016-3

EE2211 Introduction to Machine Learning 7

the frog => the frog jumped

frog jumped => frog jumped into
jumped into => jumped into the
into the => into the pond

The Bit Player (2018) tells the story of

an overlooked genius, Claude
Shannon (the "Father of Information
Theory")
C. Shannon, A mathematical theory of communication, Bell System Technical Journal, vol. 27, pp. 379-423 and 623-656, 1948.

8
© Copyright EE, NUS. All Rights Reserved.
What is Machine Learning?
 Machine learning
 is a subfield of computer science that is concerned with
building algorithms which, to be useful, rely on a
collection of examples of some phenomenon. - Andriy
Burkov
These examples can come from nature, be handcrafted by
humans or generated by another algorithm.

Ref: Book1, chp1 , p3

EE2211 Introduction to Machine Learning 9

© Copyright EE, NUS. All Rights Reserved.
Early Definitions
• Arthur Samuel (1959): A field of study that gives
computers the ability to learn without being explicitly
programmed.

• Tom Mitchell (1998): A computer program is said to learn

from experience E with respect to some task T and some
performance measure P, if its performance on T, as
measured by P, improves with experience E.

How do we create computer programs that improve with experience?"

Tom Mitchell
http://videolectures.net/mlas06_mitchell_itm/

EE2211 Introduction to Machine Learning 10

Experience: data
- Process/learn from experience
- Write programme to perform task
Task
- Measure the result of the task
Performance: Results, accuracy

Image credit: Geoffrey Hinton and CIS 419/519, Eric Eaton

Task: Digit recognition

Performance: Classification accuracy
Experience: Labelled images
EE2211 Introduction to Machine Learning 11
© Copyright EE, NUS. All Rights Reserved.
Application Examples
• Speech recognition
• Face recognition
• Handwriting recognition
• Object recognition
• Housing price prediction
• Etc, etc.

EE2211 Introduction to Machine Learning 12

• Supervised Learning Supervised Unsupervised

Learning Learning
• Unsupervised Learning
• Semi-Supervised Learning

Discrete
• Reinforcement Learning Classification Clustering

discrete

Continuous
x Classification y Dimensionality
Regression
continuous Reduction
x Regression y

Ref: Book1 and https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d

EE2211 Introduction to Machine Learning 13

Training
Apple

Orange

Model Testing

This is an orange

EE2211 Introduction to Machine Learning 14

© Copyright EE, NUS. All Rights Reserved.
Supervised Learning
• In supervised learning, the dataset is the collection of labeled
examples ,

𝑥𝑥1
𝑇𝑇
𝐱𝐱 𝑖𝑖 = ⁞ or 𝐱𝐱 𝑖𝑖 = 𝑥𝑥1 , … , 𝑥𝑥𝑗𝑗 , … , 𝑥𝑥𝐷𝐷 , i = 1, . . . , N
𝑖𝑖
𝑥𝑥𝐷𝐷 𝑖𝑖
– Each element 𝐱𝐱 𝑖𝑖 among N is called a feature vector.
• A feature vector is a vector in which each dimension j =
1, . . . , D contains a value that describes the example
somehow.

– The label 𝑦𝑦 𝑖𝑖 can be either an element belonging to a finite

set of classes {1, 2, . . . ,C}, or a real number.
– Expensive in terms of time and resources
EE2211 Introduction to Machine Learning 15
© Copyright EE, NUS. All Rights Reserved.
• For instance, if your examples are email messages and your
problem is spam detection, then you have two classes {spam,
not-spam}.
• Classification: predict discrete valued output (e.g., 1 or 0)

1-Dimensional Case
y
“0” “1” Spam
Not-spam
1

0 x
(e.g., repeated word count)

(1D view)
Decision line
(threshold)

EE2211 Introduction to Machine Learning 16

(age)
x 2 2-Dimensional Case
Malignant (harmful)
Benign (not harmful)

x 1 (tumor size)

EE2211 Introduction to Machine Learning 17

(price)
y

x
(size in meter square)

EE2211 Introduction to Machine Learning 18

Training Test
𝑀𝑀 𝑀𝑀 𝑁𝑁 𝑁𝑁
𝐱𝐱 𝑖𝑖 𝑖𝑖=1 𝑦𝑦𝑖𝑖 𝑖𝑖=1 𝐱𝐱 𝑘𝑘 𝑘𝑘=1 𝑦𝑦𝑘𝑘 𝑘𝑘=1

Model Model
Data Known Data Predicted
Parameters to label Learned label
learn parameters
Goal: to learn the model’s Goal: to predict the label of
parameters from the given novel data 𝐱𝐱 𝑘𝑘 𝑁𝑁
𝑘𝑘=1 using the
𝑀𝑀
data and labels 𝐱𝐱 𝑖𝑖 , 𝑦𝑦𝑖𝑖 𝑖𝑖=1 learned parameters

Slides courtesy: Professor Robby Tan

EE2211 Introduction to Machine Learning 19

source: SUTD

Task: Number of new cases prediction

Performance: Prediction accuracy
Experience: Historical data

EE2211 Introduction to Machine Learning L1.20

* Ref: Book1, sec1.3, pp.5-7 EE2211 doesn’t discuss constraint

optimization in detail.
EE2211 Introduction to Machine Learning 21
© Copyright EE, NUS. All Rights Reserved.
Unsupervised Learning
• In unsupervised learning, the dataset is a collection of
unlabeled examples

• Again, x is a feature vector, and the goal of an

unsupervised learning algorithm is to create a model
that takes a feature vector x as input and either
transforms it into another vector or into a value that can
be used to solve a practical problem.

Ref: Book1, sec1.2.2

EE2211 Introduction to Machine Learning 22

Training

I found two
types of fruits!

EE2211 Introduction to Machine Learning 23

(age)
x2 2-Dimensional Case
Malignant (harmful)
Benign (not harmful)

x1 (tumor size)

EE2211 Introduction to Machine Learning 24

(age)
x2 2-Dimensional Case

x1 (tumor size)

EE2211 Introduction to Machine Learning 25

(age) Find the distribution of data

x2 2-Dimensional Case

x1 (tumor size)

EE2211 Introduction to Machine Learning 26

(age) Discover the underlying structures

x2 of data
2-Dimensional Case

x1 (tumor size)

EE2211 Introduction to Machine Learning 27

https://en.wikipedia.org/wiki/Social_network_analysis#/media/File:Kencf0618FacebookNetwork.jpg

EE2211 Introduction to Machine Learning 28

Supervised Semi-supervised Unsupervised

Learning Learning Learning

Labelled +
Unlabelled data
Labelled data Unlabelled data

+
Typically plenty of
unlabelled data

Learning Model

EE2211 Introduction to Machine Learning 29

action

Environment
Agent
S1 S2

reward

EE2211 Introduction to Machine Learning 30

• A policy is a function (similar to the model in supervised

learning) that takes the feature vector of a state as input
and outputs an optimal action to execute in that state.

• The action is optimal if it maximizes the expected

average reward.
Multiple trials

* EE2211 doesn’t discuss reinforcement learning in detail.

EE2211 Introduction to Machine Learning 31

© Copyright EE, NUS. All Rights Reserved.
Inductive vs. Deductive Reasoning
Main task of Machine Learning: to make inferences

Type of inferences
Example
Inductive Deductive
• To reach probable conclusions. • To reach logical conclusions
• All needed information is unavailable or deterministically: all information that can
unknown, causing uncertainty in the conclusions lead to the correct conclusion is available

Statistical Machine Learning (as

Probability and Statistics are required
opposed to logic machine learning)

Basic properties Basic rules

Product rule Sum rule

• 0 ≤ p(x) ≤ 1
• Independent variables: • Dependent
• ʃ p(x)dx = 1
p(a,b) = p(a)p(b) variables:
• ∑ p(x) = 1
x • Dependent variables: p(a) = ∑ p(a,b)
• p(a,b) = p(a|b) p(b)
b
(Marginalization)
= p(b|a) p(a)
Slides courtesy: Professor Robby Tan

EE2211 Introduction to Machine Learning 32

© Copyright EE, NUS. All Rights Reserved.
Inductive Reasoning
Note: humans use inductive reasoning all the time and not
in a formal way like using probability/statistics.

Ref: Gardener, Martin (March 1979). "MATHEMATICAL GAMES: On the

fabric of inductive logic, and some probability paradoxes" (PDF). Scientific
American. 234

EE2211 Introduction to Machine Learning 33

EE2211 Introduction to Machine Learning 34

Li Haizhou ([email protected])

Electrical and Computer Engineering Department

National University of Singapore

Acknowledgement:
EE2211 development team
(Thomas, Kar-Ann, Chen Khong, Helen, Robby and Haizhou)

EE2211 Introduction to Machine Learning 1

© Copyright EE, NUS. All Rights Reserved.
Course Contents
• Introduction and Preliminaries (Haizhou)
– Introduction
– Data Engineering
– Introduction to Probability and Statistics
• Fundamental Machine Learning Algorithms I （Helen)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Thomas)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Haizhou/Xinchao)
– Performance Issues
– K-means Clustering
– Neural Networks

EE2211 Introduction to Machine Learning 2

Module I Contents
• What is Machine Learning and Types of Learning
• How Supervised Learning works
• Regression and Classification Tasks
• Induction versus Deduction Reasoning
• Types of data
• Data wrangling and cleaning
• Data integrity and visualization
• Causality and Simpson’s paradox
• Random variable, Bayes’ rule
• Parameter estimation
• Parameters vs. Hyperparameters

3
© Copyright EE, NUS. All Rights Reserved.
Ask the right questions
if you are to find the right answers

EE2211 Introduction to Machine Learning 4

Joy Buolamwini: — Mit Lab Press Kit

EE2211 Introduction to Machine Learning 5

© Copyright EE, NUS. All Rights Reserved.
Business Problem
Problem or (what problem we can solve
Questions with the data?)

Domain Domain
Knowledge Knowledge

Information Information
(examples, test (examples, test
cases) cases)

Data
(what data to be used to Data
solve the problem?)

Top-down Bottom-up
EE2211 Introduction to Machine Learning 6
© Copyright EE, NUS. All Rights Reserved.
• The data may not contain the answer.
• The combination of some data and an aching desire for an answer
does not ensure that a reasonable answer can be extracted from a
given body of data. John Tukey

EE2211 Introduction to Machine Learning 7

© Copyright EE, NUS. All Rights Reserved.
Types of Data
• Continuous What is data?
• Ordinal
• Categorical Numbers
• Missing
• Censored
Statistics
Text

Figures
Records Facts

EE2211 Introduction to Machine Learning 8

© Copyright EE, NUS. All Rights Reserved.
Continuous, discrete variables
• Continuous variables are anything measured on a
quantitative scale that could be any fractional number.
– An example would be something like weight measured in kg.
• Discrete variables are numeric variables that have a
countable number of values between any two values.
Temperature

Time
EE2211 Introduction to Machine Learning 9
© Copyright EE, NUS. All Rights Reserved.
Ordinal data
• Ordinal data are data that have a fixed, small number
(< 100) of possible values, called levels, that are
ordered (or ranked).

– Example: survey responses

Excellent Good Average Fair Poor

Ref: Book3, chapter 4.1.

EE2211 Introduction to Machine Learning 10

© Copyright EE, NUS. All Rights Reserved.
Categorical data
• Categorical data are data where there are multiple
categories, but they are not ordered.
– Examples: gender, blood type, name of fruits and their production
regions etc.

EE2211 Introduction to Machine Learning 11

© Copyright EE, NUS. All Rights Reserved.
Missing data
• Missing data are data that are missing and you do not
know the mechanism.
– You should use a single common code for all missing values (for
example, “NA”), rather than leaving any entries blank.

NUS student Age Country of birth

Olivia Tan 20 Singapore
Hendra Setiawan 19 Indonesia
John Smith 19 NA

EE2211 Introduction to Machine Learning 12

© Copyright EE, NUS. All Rights Reserved.
Censored data
• Censored data are data where you know the missing
mechanism on some level.
– Common examples are a measurement being below a detection
limit or a patient being lost to follow-up.
– They should also be coded as NA when you don’t have the data.

NUS student Age Gender

Olivia Tan 20 F
Hendra Setiawan 19 M
Ah Beng NA M

EE2211 Introduction to Machine Learning 13

Categorical/ Numerical/
Qualitative Quantitative

Nominal Ordinal Discrete Continuous

Ratio
Interval (Includes natural zero,
Ordinal (No natural zero, e.g., temperature in
Nominal (Can be ordered, e.g., e.g., temperature Kelvin)
(e.g., gender, small/medium/large) in Celsius)
religion) https://i.stack.imgur.com/J8Ged.jpg

EE2211 Introduction to Machine Learning 14

Frequency distribution Yes Yes Yes Yes

Median and percentiles No Yes Yes Yes

Add or subtract No No Yes Yes

Mean, standard deviation No No Yes Yes

Ratio No No No Yes

EE2211 Introduction to Machine Learning 15

© Copyright EE, NUS. All Rights Reserved.
Data Wrangling
• Data wrangling (cleaning + transform) is the process of
transforming and mapping data from one "raw" data
form into another format to make it more appropriate for
downstream analytics. (https://en.wikipedia.org/wiki/Data_wrangling)
• e.g. scaling, clipping, z-score (see next page)
• Data wrangling should not be performed blindly. We
should know the reason for wrangling and why the
need.
Model Visualize

Import Clean Transform Output

EE2211 Introduction to Machine Learning 16

• Scaling to a range
– When the bounds or range of each independent dimension of
data is known, a common normalization technique is min-max.
advantage: Ensures standard scale (make sure all between 0 - 1)
• Feature clipping

https://developers.google.com/machine-learning/data-prep/transform/normalization

EE2211 Introduction to Machine Learning 17

© Copyright EE, NUS. All Rights Reserved.
Example
• Z-score standardization
– When the population of measurements of each independent
dimension of data is normally distributed where the parameters
are known, the standard score or z-score is a popular choice.
Advantage: Handles outliers well, Disadvantage: Does not have
Make mean = 0 and standard deviation as 1 same exact scale

https://developers.google.com/machine-learning/data-prep/transform/normalization

EE2211 Introduction to Machine Learning 18

Iris data set

• Measurement features
can be packed as

• Labels can be written as

https://archive.ics.uci.edu/ml/datasets/iris

EE2211 Introduction to Machine Learning 19

© Copyright EE, NUS. All Rights Reserved.
Example (ordinal)
• For data of ordinal scale, the exact numerical value
has no significance over the ranking order that it
carries.
• Suppose the ranks are given by r = 1, ...,R. Then, the
ranks can be normalized into standardized distance
values (d) which fall within [0, 1] using
Order of data matters (ranking)

Excellent Good Average Fair Poor

r = 5 4 3 2 1

EE2211 Introduction to Machine Learning 20

– For example, one can assign a ‘1’ value for male and a ‘2’ value for
female for the case of gender attribute; ‘1’ value for spam and ‘0’
value for non-spam as in Lecture 1.

– However, the label which is assigned a higher value may have a

greater influence than the one with a lower value when extremely
large values and extremely small values are involved along the
computational process.

Order of data does not matter

EE2211 Introduction to Machine Learning 21

© Copyright EE, NUS. All Rights Reserved.
Example (categorical)
• Binary coding
– Common examples of binary coding schemes include the binary-
coded decimal (e.g., one-hot encoding), n-ary Gray codes.
– Sophisticated coding schemes take into account the probability
distribution of each attributes during conversion.
– One-hot encoding:

EE2211 Introduction to Machine Learning 22

© Copyright EE, NUS. All Rights Reserved.
Data Cleaning
Data cleansing or data cleaning is the process of detecting and
correcting (or removing) corrupt or inaccurate records from a
record set, table, or database.

Example (missing features)

• Dealing with missing features
– Removing the examples with missing features from the dataset
(that can be done if your dataset is big enough so you can sacrifice
some training examples);
– Using a learning algorithm that can deal with missing feature
values (depends on the library and a specific implementation of the
algorithm);
– Using a data imputation technique.

EE2211 Introduction to Machine Learning 23

Students Year of Gender Height GPA

Birth
Tan Ah Kow 1995 M 1.72 4.2
Ahmad Abdul X M 1.65 4.1
John Smith 1995 M 1.75 X
Chen Lulu 1995 F X 4.0
Raj Kumar 1995 M 1.73 4.5
Li Xiuxiu 1994 F 1.70 3.8

EE2211 Introduction to Machine Learning 24

© Copyright EE, NUS. All Rights Reserved.
Example (missing features)
• Imputation
– Replace the missing value of a feature by an average value of this
feature in the dataset:

– Replace the missing value with a value outside the normal range of
values.
• For example, if the normal range is [0, 1], then you can set the missing
value to 2 or −1. The idea is that the learning algorithm will learn what is
best to do when the feature has a value significantly different from
regular values.
• Alternatively, you can replace the missing value by a value in the middle
of the range. For example, if the range for a feature is [−1, 1], you can
set the missing value to be equal to 0. Here, the idea is that the value in
the middle of the range will not significantly affect the prediction.

EE2211 Introduction to Machine Learning 25

• Data integrity is the maintenance of, and the assurance

of the accuracy and consistency of, data over its
entire life-cycle.
(Ref: https://en.wikipedia.org/wiki/Data_integrity)

• It is a critical aspect to the design, implementation and

usage of any system which stores, processes, or
retrieves data.
– Physical integrity (error-correction codes, check-sum, redundancy)
– Logical integrity (product price is a positive, use drop-down list)

EE2211 Introduction to Machine Learning 26

http://www.financetwitter.com/
EE2211 Introduction to Machine Learning 27
© Copyright EE, NUS. All Rights Reserved.
Data Visualization

• https://towardsdatascience.com/data-visualization-with-mathplotlib-using-python-a7bfb4628ee3

EE2211 Introduction to Machine Learning 28

(a) a probability mass function, (b) a probability density function

EE2211 Introduction to Machine Learning 29

© Copyright EE, NUS. All Rights Reserved.
Caution: Same Statistics But Different Distribution
Anscombe's quartet (wiki)
You can explore data by
calculating summary
statistics, for example the
correlation between
variables.

However all of these data

sets have the exact same
correlation and regression
line

By Anscombe.svg: Schutz Derivative works of this file:(label using subscripts): Avenue - Anscombe.svg, CC BY-SA
3.0, https://commons.wikimedia.org/w/index.php?curid=9838454

• Data sets with identical means, variances and regression lines

EE2211 Introduction to Machine Learning 30

1 2 3 4

Michael Galarnyk, Understanding boxplot, 2018

EE2211 Introduction to Machine Learning 31

• These boxplots look very similar, but if you overlay the actual data points
you can see that they have very different distributions.

EE2211 Introduction to Machine Learning 32

• If we make this pie chart as a bar chart it is much easier to see that A is bigger than D

EE2211 Introduction to Machine Learning 33

• Without logarithm 90% of the data are in the lower left-hand corner in this figure

EE2211 Introduction to Machine Learning 34

EE2211 Introduction to Machine Learning 35

Li Haizhou ([email protected])

Electrical and Computer Engineering Department

National University of Singapore

Acknowledgement:
EE2211 development team
(Thomas, Kar-Ann, Chen Khong, Helen, Robby and Haizhou)
20 August 2021

© Copyright EE, NUS. All Rights Reserved.

Course Contents
• Introduction and Preliminaries (Haizhou)
– Introduction
– Data Engineering
– Introduction to Probability and Statistics
• Fundamental Machine Learning Algorithms I (Helen)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Thomas)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Haizhou, Xinchao)
– Performance Issues
– K-means Clustering
– Neural Networks

EE2211 Introduction to Machine Learning 2

© Copyright EE, NUS. All Rights Reserved.
Introduction to Probability and
Statistics
Module I Contents
• What is Machine Learning and Types of Learning
• How Supervised Learning works
• Regression and Classification Tasks
• Induction versus Deduction Reasoning
• Types of data
• Data wrangling and cleaning
• Data integrity and visualization
• Introduction to Linear Algebra
• Causality and Simpson’s paradox
• Random variable, Bayes’ rule

3
© Copyright EE, NUS. All Rights Reserved.
Introduction to Linear Algebra
• Use of vector and matrix notation, especially with
multivariate statistics.

• Solutions to least squares and weighted least squares,

such as for linear regression.

• Estimates of mean and variance of data matrices.

• Principal component analysis for data reduction that

draws many of these elements together

Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019, Chp2.

4
© Copyright EE, NUS. All Rights Reserved.
Introduction to Linear Algebra
• A scalar is a simple numerical value, like 15 or −3.25
– Focus on real numbers
• Variables or constants that take scalar values are
denoted by an italic letter, like x or a

Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019, Chp2.

5
© Copyright EE, NUS. All Rights Reserved.
Notations, Vectors, Matrices
• A vector is an ordered list of scalar values, called
attributes
– Denoted by a bold character, e.g. x or a
• In many books, vectors are written column-wise:
2 −2 1
𝐚𝐚 = , 𝐛𝐛 = , 𝐜𝐜 =
3 5 0
• The three vectors above are two-dimensional (or have two
elements)

Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019, Chp2.

6
© Copyright EE, NUS. All Rights Reserved.
Notations, Vectors, Matrices
• We denote an attribute of a vector as an italic value
with an index, e.g. 𝑎𝑎 𝑗𝑗 𝑜𝑜𝑜𝑜 𝑥𝑥 𝑗𝑗
– The index j denotes a specific dimension of the vector, the
position of an attribute in the list
1 2 𝑎𝑎1 2
𝐚𝐚 = 𝑎𝑎 = or more commonly 𝐚𝐚 = 𝑎𝑎 =
𝑎𝑎 2 3 2 3

Note:
• The notation 𝑥𝑥 𝑗𝑗 should not be confused with the power operator, such
as the 2 in 𝑥𝑥 2 (squared) or 3 in 𝑥𝑥 3 (cubed)
• Square of an indexed attribute of a vector is denoted by (𝑥𝑥 𝑗𝑗 )2
𝑗𝑗 𝑘𝑘
• A variable can have two or more indices, like this: 𝑥𝑥𝑖𝑖 or 𝑥𝑥𝑖𝑖,𝑗𝑗
𝑗𝑗
• For example, in neural networks, we denote as 𝑥𝑥𝑙𝑙,𝑢𝑢 the input feature j of
unit u in layer l
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019, Chp2.

7
© Copyright EE, NUS. All Rights Reserved.
Notations, Vectors, Matrices
Vectors can be visualized as arrows that point to some directions as well
as points in a multi-dimensional space

2 −2 1
Illustrations of three two-dimensional vectors, 𝐚𝐚 = , 𝐛𝐛 = , and 𝐜𝐜 =
3 5 0

Figure 1: Three vectors visualized as directions and as points.

Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019, Chp2.

8
© Copyright EE, NUS. All Rights Reserved.
Notations, Vectors, Matrices
• A matrix is a rectangular array of numbers arranged in rows and
columns
– Denoted with bold capital letters, such as X or W
– An example of a matrix with two rows and three columns:
2 4 −3
𝐗𝐗 =
21 − 6 − 1
• A set is an unordered collection of unique elements
– Denoted as a calligraphic capital character e.g., 𝑺𝑺
– When an element 𝑥𝑥 belongs to a set 𝑺𝑺, we write 𝑥𝑥 ∈ 𝑺𝑺
– A special set denoted R includes all real numbers from minus
infinity to plus infinity
Note:
• For the elements in matrix 𝐗𝐗, we shall use the indexing 𝑥𝑥1,1 where the first and the
(1)
second indices respectively indicate the row and the column positions, or 𝑥𝑥1 .
• Usually, for input data, rows represent samples and columns represent features
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019, Chp2.

9
© Copyright EE, NUS. All Rights Reserved.
Notations, Vectors, Matrices: Example
Iris data set
• Measurement features
can be packed as
𝐗𝐗 ∈ 𝓡𝓡150×4

• Labels can be written as

https://archive.ics.uci.edu/ml/datasets/iris

EE2211 Introduction to Machine Learning 10

• Capital Sigma: the summation over a collection X = {𝑥𝑥1 ,

𝑥𝑥2 , 𝑥𝑥3 , 𝑥𝑥4 , . . . , 𝑥𝑥𝑚𝑚 } is denoted by:
∑𝑚𝑚
𝑖𝑖=1 𝑥𝑥𝑖𝑖 = 𝑥𝑥1 + 𝑥𝑥2 + … + 𝑥𝑥𝑚𝑚−1 + 𝑥𝑥𝑚𝑚

• Capital Pi: the product over a collection X = {𝑥𝑥1 , 𝑥𝑥2 , 𝑥𝑥3 ,

𝑥𝑥4 , . . . , 𝑥𝑥𝑚𝑚 } is denoted by:

∏𝑚𝑚
𝑖𝑖=1 𝑥𝑥𝑖𝑖 = 𝑥𝑥1 · 𝑥𝑥2 ·…· 𝑥𝑥𝑚𝑚−1 · 𝑥𝑥𝑚𝑚

Note:
• Capital Sigma and Pi can be applied to the attributes of a vector x

Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019, Chp2.

11
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
Linear dependence and independence

• A collection of d-vectors 𝐱𝐱1 , … , 𝐱𝐱 𝑚𝑚 (with 𝑚𝑚 ≥ 1) is called linearly

dependent if
(Scalar multiple)
𝛽𝛽1 𝐱𝐱1 + ⋯ + 𝛽𝛽𝑚𝑚 𝐱𝐱 𝑚𝑚 = 0
holds for some 𝛽𝛽1 , … , 𝛽𝛽𝑚𝑚 that are not all zero.

• A collection of d-vectors 𝐱𝐱1 , … , 𝐱𝐱 𝑚𝑚 (with 𝑚𝑚 ≥ 1) is called linearly

independent if it is not linearly dependent, which means that
𝛽𝛽1 𝐱𝐱1 + ⋯ + 𝛽𝛽𝑚𝑚 𝐱𝐱 𝑚𝑚 = 0
only holds for 𝛽𝛽1 = ⋯ = 𝛽𝛽𝑚𝑚 = 0.

Note: If all rows or columns of a square matrix 𝐗𝐗 are linearly

independent, then 𝐗𝐗 is invertible.
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to
Applied Linear Algebra”, Cambridge University Press, 2018 (Chp5 & 11.1)

12
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations

Geometry of dependency and independency

𝑥𝑥2 𝑥𝑥3 𝒄𝒄
2
𝒂𝒂 2 𝒃𝒃
𝒂𝒂
1
𝒃𝒃 𝑥𝑥2
1

𝑥𝑥1 𝑥𝑥1
𝛽𝛽1 𝒂𝒂 + 𝛽𝛽2 𝐛𝐛 = 0 𝛽𝛽1 𝒂𝒂 + 𝛽𝛽2 𝒃𝒃 ≠ 𝛽𝛽3 𝐜𝐜

13
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
These equations can be written compactly in matrix-vector
notation:
𝐗𝐗𝐰𝐰 = 𝐲𝐲
Where
𝑥𝑥1,1 𝑥𝑥1,2 … 𝑥𝑥1,𝑑𝑑 𝑤𝑤1 𝑦𝑦1
𝐗𝐗 = ⁞ ⁞ ⋱ ⁞ , 𝐰𝐰 = ⁞ , 𝐲𝐲 = ⁞ .
𝑥𝑥𝑚𝑚,1 𝑥𝑥𝑚𝑚,2 … 𝑥𝑥𝑚𝑚,𝑑𝑑 𝑤𝑤𝑑𝑑 𝑦𝑦𝑚𝑚

Note:
• The data matrix 𝐗𝐗 ∈ 𝓡𝓡𝑚𝑚×𝑑𝑑 and the target vector 𝐲𝐲 ∈ 𝓡𝓡𝑚𝑚 are given
• The unknown vector of parameters 𝐰𝐰 ∈ 𝓡𝓡𝑑𝑑 is to be learnt
• The rank(𝐗𝐗) corresponds to the maximal number of linearly
independent columns/rows of 𝐗𝐗 .

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to 14

1 2
1. What is the rank of ?
2 1

1 −2 3
2. What is the rank of 0 −3 3 ?
1 1 0

EE2211 Introduction to Machine Learning 15

© Copyright EE, NUS. All Rights Reserved.
Causality
What is statistical causality or causation?
• In statistics, causation means that one thing will cause the
other, which is why it is also referred to as cause and effect.
• The gold standard for causal data analysis is to combine
specific experimental designs such as randomized studies
with standard statistical analysis techniques.
• Example:
– A Randomized Controlled Trial in medicine typically compares a proposed new
treatment against an existing standard of care (or a placebo) ; these are then termed
the 'experimental' and 'control' treatments respectively. This blinding principle is ideally
also extended as much as possible to other parties including researchers, technicians,
data analysts, and evaluators. Effective blinding experimentally isolates the
physiological effects of treatments from various psychological sources of bias.

EE2211 Introduction to Machine Learning 16

© Copyright EE, NUS. All Rights Reserved.
Correlation
• In statistics, correlation is any statistical relationship,
whether causal or not, between two random variables.
• Correlations are useful because they can indicate a
predictive relationship that can be exploited in practice.
• Linear correlation coefficient, r, which is also known as
the Pearson Coefficient.

https://www.geo.fu-berlin.de/en/v/soga/Basics-of-statistics/Descriptive-Statistics/Measures-of-Relation-Between-Variables/Correlation/index.html

EE2211 Introduction to Machine Learning 17

© Copyright EE, NUS. All Rights Reserved.
Correlation does not imply causation
Causation between two events means they are correlated

• Most data analyses involve inference or prediction.

• Unless a randomized study is performed, it is difficult to infer why
there is a relationship between two variables.
• Some great examples of correlations that can be calculated but are
clearly not causally related appear at http://tylervigen.com/
(See figure below).

EE2211 Introduction to Machine Learning 18

© Copyright EE, NUS. All Rights Reserved.
Correlation is a statistical relationship
– Decades of data show a clear causal relationship between smoking
and cancer.
– If you smoke, it is a sure thing that your risk of cancer will increase.
– But it is not a sure thing that you will get cancer.
– The causal effect is real, but it is an effect on your average risk.

EE2211 Introduction to Machine Learning 19

• Particular caution should be used when applying words

such as “cause” and “effect” when performing inferential
analysis.
• Causal language applied to even clearly labelled
inferential analyses may lead to misinterpretation - a
phenomenon called causation creep.

EE2211 Introduction to Machine Learning 20

© Copyright EE, NUS. All Rights Reserved.
Simpson’s paradox
Simpson's paradox is a phenomenon in probability and statistics, in
which a trend appears in several different groups of data but
disappears or reverses when these groups are combined.

https://en.wikipedia.org/wiki/Simpson%27s_paradox

EE2211 Introduction to Machine Learning 21

Ref: Gardener, Martin (March 1979). "MATHEMATICAL GAMES: On the fabric of inductive logic, and some probability paradoxes" (PDF). Scientific American. 234

EE2211 Introduction to Machine Learning 22

• We describe a random experiment by describing its

procedure and observations of its outcomes.
• Outcomes are mutual exclusive in the sense that only one
outcome occurs in a specific trial of the random
experiment. This also means an outcome is not
decomposable. All unique outcomes form a sample
space.
• A subset of sample space 𝑆𝑆, denoted as 𝐴𝐴, is an event in
a random experiment 𝐴𝐴 ⊆ 𝑆𝑆, that is meaningful to the
application.

EE2211 Introduction to Machine Learning 23

Assuming events 𝐴𝐴 ⊆ 𝑆𝑆 and 𝐵𝐵 ⊆ 𝑆𝑆, the probabilities of

events related with and must satisfy,

1. 𝑃𝑃𝑟𝑟 𝐴𝐴 ≥ 0
2. 𝑃𝑃𝑃𝑃 𝑆𝑆 = 1
3. If 𝐴𝐴 ∩ 𝐵𝐵 = ∅ , then 𝑃𝑃𝑃𝑃 𝐴𝐴 ∪ 𝐵𝐵 = 𝑃𝑃𝑃𝑃 𝐴𝐴 + 𝑃𝑃𝑃𝑃 𝐵𝐵
*otherwise, 𝑃𝑃𝑃𝑃 𝐴𝐴 ∪ 𝐵𝐵 = 𝑃𝑃𝑃𝑃 𝐴𝐴 + 𝑃𝑃𝑃𝑃 𝐵𝐵 − 𝑃𝑃𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵)

EE2211 Introduction to Machine Learning 24

© Copyright EE, NUS. All Rights Reserved.
Random Variable
• A random variable, usually written as an italic capital
letter, like X, is a variable whose possible values are
numerical outcomes of a random phenomenon.
– Examples of random phenomena with a numerical outcome include
a toss of a coin (0 for heads and 1 for tails), a roll of a dice, or the
height of the first stranger you meet outside.
• There are two types of random variables:
– discrete and continuous.

s
R
X(s)
EE2211 Introduction to Machine Learning 25
© Copyright EE, NUS. All Rights Reserved.
Notations
• Some books used P(·) and p(·) to distinguish between the
probability of discrete random variable and the probability
of continuous random variables respectively.

• We shall use Pr(·) for both the above cases.

EE2211 Introduction to Machine Learning 26

© Copyright EE, NUS. All Rights Reserved.
Discrete random variable
• A discrete random variable (DRV) takes on only a
countable number of distinct values such as red, yellow,
blue or 1, 2, 3, . . ..
• The probability distribution of a discrete random variable is
described by a list of probabilities associated with each of its possible
values.
- This list of probabilities is called a
probability mass function (pmf).
(Like a histogram, except that here
the probabilities sum to 1)

Ref: Book 1, Chapter 2.2.

A probability mass function
EE2211 Introduction to Machine Learning 27
© Copyright EE, NUS. All Rights Reserved.
• Let a discrete random variable X have k possible values
𝑥𝑥𝑖𝑖 𝑘𝑘𝑖𝑖=1 .
• The expectation of X denoted as 𝐸𝐸 𝑥𝑥 is given by,
𝑘𝑘
𝐸𝐸 𝑥𝑥 ≝ �𝑖𝑖=1 𝑥𝑥𝑖𝑖 · Pr(𝑋𝑋 = 𝑥𝑥𝑖𝑖 )
= 𝑥𝑥1 · Pr(𝑋𝑋 = 𝑥𝑥1 ) + 𝑥𝑥2 · Pr(𝑋𝑋 = 𝑥𝑥2 ) + ··· + 𝑥𝑥𝑘𝑘 · Pr(𝑋𝑋 = 𝑥𝑥𝑘𝑘 )
where Pr(𝑋𝑋 = 𝑥𝑥𝑖𝑖 ) is the probability that X has the value 𝑥𝑥𝑖𝑖
according to the pmf.
• The expectation of a random variable is also called the
mean, average or expected value and is frequently
denoted with the letter μ.
• The expectation is one of the most important statistics of a
random variable.

EE2211 Introduction to Machine Learning 28

© Copyright EE, NUS. All Rights Reserved.
• Another important statistic is the standard deviation,
defined as,
σ ≝ 𝐸𝐸 (𝑋𝑋 − 𝜇𝜇)2 .
• Variance, denoted as 𝜎𝜎 2 or var(X), is defined as,
𝜎𝜎 2 = 𝐸𝐸 (𝑋𝑋 − 𝜇𝜇)2
• For a discrete random variable, the standard deviation is
given by
σ = Pr(𝑋𝑋 = 𝑥𝑥1 )(𝑥𝑥1 − 𝜇𝜇)2 + ⋯ + Pr(𝑋𝑋 = 𝑥𝑥𝑘𝑘 )(𝑥𝑥𝑘𝑘 − 𝜇𝜇)2
where 𝜇𝜇 = 𝐸𝐸(𝑋𝑋).

EE2211 Introduction to Machine Learning 29

© Copyright EE, NUS. All Rights Reserved.
Continuous random variable
• A continuous random variable (CRV) takes an infinite
number of possible values in some interval.
– Examples include height, weight, and time.
– Because the number of values of a continuous random variable X
is infinite, the probability Pr(X = c) for any c is 0.
– Therefore, instead of the list of probabilities, the probability
distribution of a CRV (a continuous probability distribution) is
described by a probability density function (pdf).
– The pdf is a function whose codomain
is nonnegative and the area under the
curve is equal to 1.
*Point on graph does not indicate probability,
Probability = area under graph

CDF: point on curve represents probability

A probability density function

EE2211 Introduction to Machine Learning 30

© Copyright EE, NUS. All Rights Reserved.
• The expectation of a continuous random variable 𝑋𝑋 is given
by 𝐸𝐸 𝑥𝑥 ≝ ∫𝑅𝑅 𝑥𝑥 𝑓𝑓𝑋𝑋 𝑥𝑥 d𝑥𝑥
where 𝑓𝑓𝑋𝑋 is the pdf of the variable 𝑋𝑋 and ∫𝑅𝑅 is the
integral of function 𝑥𝑥 𝑓𝑓𝑋𝑋 .
• The variance of a continuous random variable 𝑋𝑋 is given
by 𝜎𝜎 2 ≝ ∫𝑅𝑅(𝑋𝑋 − 𝜇𝜇)2 𝑓𝑓𝑋𝑋 𝑥𝑥 d𝑥𝑥

• Integral is an equivalent of the summation over all values of

the function when the function has a continuous domain.
• It equals the area under the curve of the function.
• The property of the pdf that the area under its curve is 1
mathematically means that ∫𝑅𝑅 𝑓𝑓𝑋𝑋 𝑥𝑥 d𝑥𝑥 = 1

EE2211 Introduction to Machine Learning 31

95%
90%

𝜇𝜇

𝑥𝑥1

EE2211 Introduction to Machine Learning 32

© Copyright EE, NUS. All Rights Reserved.
Example 1
Independent random variables
• Consider tossing a fair coin twice, what is the probability
of having (H,H)?
Pr(x=H, y=H) = Pr(x=H)Pr(y=H)
= (1/2)(1/2) = 1/4

Slides courtesy: Professor Robby Tan

EE2211 Introduction to Machine Learning 33

© Copyright EE, NUS. All Rights Reserved.
Example 2
Dependent random variables
• Given 2 balls with different colors (Red and Black), what
is the probability of first having B and then R?
• The space of outcomes of taking two balls
sequentially without replacement:
B–R
R–B Thus having B-R is ½ .
Mathematically:
Pr(x=B, y=R) = Pr(y=R | x=B) Pr(x=B)
= 1×(1/2) = 1/2
Since we are given the first pick was B, and thus we
know the probability of the remaining ball to be R is 1. Slides courtesy: Professor Robby Tan

EE2211 Introduction to Machine Learning 34

© Copyright EE, NUS. All Rights Reserved.
Example 3
Dependent random variables
• Given 3 balls with different colors (R,G,B), what is the
probability of first having B and then G?
• The space of outcomes of taking two balls
sequentially without replacement:
R–G|G–B|B–R
R – B | G – R | B – G Thus Pr(y=G, x=B) = 1/6 .
Mathematically:
Pr(y=G, x=B) = Pr(y=G | x=B) Pr(x=B)
= (1/2) × (1/3)
= 1/6
Given that the first pick is B, then the remaining balls are
G and R, and thus the chance of picking up G is ½. Slides courtesy: Professor Robby Tan

EE2211 Introduction to Machine Learning 35

• The conditional probability Pr(𝑌𝑌 = 𝑦𝑦|𝑋𝑋 = 𝑥𝑥) is the

probability of the random variable 𝑌𝑌 to have a specific
value 𝑦𝑦 given that another random variable 𝑋𝑋 has a
specific value of 𝑥𝑥.
• The Bayes’ Rule (also known as the Bayes’ Theorem)
stipulates that:
Pr 𝑋𝑋 = 𝑥𝑥 𝑌𝑌 = 𝑦𝑦 Pr(𝑌𝑌=𝑦𝑦)
Pr 𝑌𝑌 = 𝑦𝑦 𝑋𝑋 = 𝑥𝑥 = (1)
Pr(𝑋𝑋=𝑥𝑥)

𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 × 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 =
𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒
EE2211 Introduction to Machine Learning 36
© Copyright EE, NUS. All Rights Reserved.
Bayes’ Rule
Likelihood – propensity for
Prior – what we know about
observing a certain value of 𝑥𝑥
y BEFORE seeing 𝑥𝑥
given a certain value of 𝑦𝑦

Pr(𝑦𝑦) Pr 𝑥𝑥 𝑦𝑦 Pr(𝑦𝑦) Pr 𝑥𝑥 𝑦𝑦
Pr 𝑦𝑦 𝑥𝑥 = =
Pr(𝑥𝑥) 𝛴𝛴𝑦𝑦 Pr(𝑦𝑦) Pr 𝑥𝑥 𝑦𝑦

Evidence – a constant to
Posterior – what we know ensure that the left hand side
about y AFTER seeing 𝑥𝑥 is a valid distribution

Adapted from S. Prince

EE2211 Introduction to Machine Learning 38

Helen Juan Zhou

[email protected]

Electrical and Computer Engineering Department

National University of Singapore

Acknowledgement:
EE2211 development team
(Kar-Ann Toh, Thomas Yeo, Chen Khong, Helen Zhou, Vincent Tan, Robby Tan and
Haizhou Li)

© Copyright EE, NUS. All Rights Reserved.

Course Contents
• Introduction and Preliminaries (Haizhou)
– Introduction
– Data Engineering
– Introduction to Probability, Statistics, and Matrix
• Fundamental Machine Learning Algorithms I (Helen)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Thomas)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Haizhou)
– Performance Issues
– K-means Clustering
– Neural Networks

2
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations

Module II Contents
• Operations on Vectors and Matrices
• Systems of Linear Equations
• Set and Functions
• Derivative and Gradient
• Least Squares, Linear Regression
• Linear Regression with Multiple Outputs
• Linear Regression for Classification
• Ridge Regression
• Polynomial Regression

3
© Copyright EE, NUS. All Rights Reserved.
Fundamental ML Algorithms:
Linear Regression
References for Lectures 4-6:
Main
• [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019.
(read first, buy later: http://themlbook.com/wiki/doku.php)
• [Book2] Andreas C. Muller and Sarah Guido, “Introduction to Machine
Learning with Python: A Guide for Data Scientists”, O’Reilly Media, Inc., 2017

Supplementary
• [Book3] Jeff Leek, “The Elements of Data Analytic Style: A guide for people
who want to analyze data”, Lean Publishing, 2015.
• [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied
Linear Algebra”, Cambridge University Press, 2018 (available online)
http://vmls-book.stanford.edu/
• [Ref 5] Professor Vincent Tan’s notes (chapters 4-6): (useful)
https://vyftan.github.io/papers/ee2211book.pdf

4
© Copyright EE, NUS. All Rights Reserved.
Recap on Notations, Vectors, Matrices
Scalar Numerical value 15, -3.5
Variable Take scalar values x or a

Vector An ordered list of scalar values x or 𝐚

𝑎1 2
Attributes of a vector 𝐚= 𝑎 =
2 3

Matrix A rectangular array of numbers 2 4

𝐗=
arranged in rows and columns 21 −6

Capital Sigma 𝑖=1 𝑥𝑖 = 𝑥1 + 𝑥2 + … + 𝑥𝑚−1 + 𝑥𝑚

∑𝑚

Capital Pi ∏𝑚
𝑖=1 𝑥𝑖 = 𝑥1 · 𝑥2 ·…· 𝑥𝑚−1 · 𝑥𝑚
5
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

Operations on Vectors: summation and subtraction

𝑥1 𝑦1 𝑥1 + 𝑦1
𝐱+𝐲= 𝑥 + 𝑦 = 𝑥 +𝑦
2 2 2 2

𝑥1 𝑦1 𝑥1 − 𝑦1
𝐱−𝐲= 𝑥 − 𝑦 = 𝑥 −𝑦
2 2 2 2

6
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

Operations on Vectors: scalar

𝑥1 𝑎𝑥1
𝑎 𝐱 = 𝑎 𝑥 = 𝑎𝑥
2 2

1
𝑥1 𝑥
1 1 𝑎 1
𝑎
𝐱 =
𝑎 𝑥2 = 1
𝑥
𝑎 2

7
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
Matrix or Vector Transpose:
𝑥1
𝐱= 𝑥 , 𝐱𝑇 = 𝑥1 𝑥2
2
𝑥1,1 𝑥1,2 𝑥1,3 𝑥1,1 𝑥2,1 𝑥3,1
𝐗 = 𝑥2,1 𝑥2,2 𝑥2,3 , 𝐗 𝑇 = 𝑥1,2 𝑥2,2 𝑥3,2
𝑥3,1 𝑥3,2 𝑥3,3 𝑥1,3 𝑥2,3 𝑥3,3

Python demo 1

8
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
Dot Product or Inner Product of Vectors:
𝐱 · 𝐲 = 𝐱𝑇 𝐲
𝑦1 𝐱
= 𝑥1 𝑥2 𝑦
2
= 𝑥1 𝑦1 + 𝑥2 𝑦2

Geometric definition: 𝜃
𝐱 · 𝐲 = 𝐱 𝐲 cos𝜃 𝐲
𝐱 cos𝜃

where 𝜃 is the angle between 𝐱 and 𝐲,

and 𝐱 = 𝐱 ⋅ 𝐱 is the Euclidean length of vector 𝐱
2 1
𝐄. 𝐠. 𝐚 = ,𝐜 =  𝐚 · 𝐜 = 2*1 + 3 *0 = 2
3 0
9
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

Matrix-Vector Product

𝑤1,1 𝑤1,2 𝑤1,3 𝑥1

𝐖𝐱 = 𝑤2,1 𝑤2,2 𝑤2,3 𝑥2
𝑥3

𝑤1,1 𝑥1 + 𝑤1,2 𝑥2 + 𝑤1,3 𝑥3

= 𝑤2,1 𝑥1 + 𝑤2,2 𝑥2 + 𝑤2,3 𝑥3

10
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

Vector-Matrix Product

𝑤1,1 𝑤1,2 𝑤1,3

𝐱 𝑇 𝐖 = 𝑥1 𝑥2 𝑤2,1 𝑤2,2 𝑤2,3
= (𝑥1 𝑤1,1 + 𝑥2 𝑤2,1 ) (𝑥1 𝑤1,2 + 𝑥2 𝑤2,2 ) (𝑥1 𝑤1,3 + 𝑥2 𝑤2,3 )

11
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

Matrix-Matrix Product

𝑥1,1 … 𝑥1,𝑑 𝑤1,1 … 𝑤1,ℎ

𝐗𝐖 = ⁞ ⋱ ⁞ ⁞ ⋱ ⁞
𝑥𝑚,1 … 𝑥𝑚,𝑑 𝑤𝑑,1 … 𝑤𝑑,ℎ

(𝑥1,1 𝑤1,1 + ⋯ + 𝑥1,𝑑 𝑤𝑑,1 ) … (𝑥1,1 𝑤1,ℎ + ⋯ + 𝑥1,𝑑 𝑤𝑑,ℎ )

= ⁞ ⋱ ⁞
(𝑥𝑚,1 𝑤1,1 + ⋯ + 𝑥𝑚,𝑑 𝑤𝑑,1 ) … (𝑥𝑚,1 𝑤1,ℎ + ⋯ + 𝑥𝑚,𝑑 𝑤𝑑,ℎ )

∑𝑑𝑖=1 𝑥1,𝑖 𝑤𝑖,1 … ∑𝑑𝑖=1 𝑥1,𝑖 𝑤𝑖,ℎ

= ⁞ ⋱ ⁞
∑𝑑𝑖=1 𝑥𝑚,𝑖 𝑤𝑖,1 … ∑𝑑𝑖=1 𝑥𝑚,𝑖 𝑤𝑖,ℎ

If X is m x d and W is d x h, then the outcome is a m x h matrix

12
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

Matrix inverse

Definition:
A d-by-d square matrix A is invertible (also nonsingular)
if there exists a d-by-d square matrix B such that
𝐀𝐁 = 𝐁𝐀 = 𝐈 (identity matrix)

1 0…0 0
0 1 0 0
𝐈= ⁞ ⋱ ⁞ d-by-d dimension
0 0 1 0
0 0…0 1

Ref: https://en.wikipedia.org/wiki/Invertible_matrix

13
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
Matrix inverse computation
−1
1
𝐀 = adj(𝐀)
det 𝐀
• det 𝐀 is the determinant of 𝐀
• adj(𝐀) is the adjugate or adjoint of 𝐀

Determinant computation
Example: 2x2 matrix
𝑎 𝑏
𝐀=
𝑐 𝑑

𝑎 𝑏
det 𝐀 = |𝐀| = = 𝑎𝑑 − 𝑏𝑐
𝑐 𝑑
Ref: https://en.wikipedia.org/wiki/Invertible_matrix

14
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
• adj(𝐀) is the adjugate or adjoint of 𝐀
• adj(𝐀) is the transpose of the cofactor matrix 𝐂 of 𝐀  adj(A)= CT
• Minor of an element in a matrix 𝐀 is defined as the determinant
obtained by deleting the row and column in which that element lies
𝑎11 𝑎12 𝑎13 𝑎21 𝑎23
A= 𝑎 𝑎
21 22 𝑎23 Minor of a12 is 𝑀12 =
𝑎31 𝑎33
𝑎31 𝑎32 𝑎33

• The 𝑖, 𝑗 entry of the cofactor matrix 𝐂 is the minor of 𝑖, 𝑗

element times a sign factor +
Cofactor Cij = −1 𝑖 𝑗 𝑀𝑖𝑗

• The determinant of 𝐀 can also be defined by minors as

det(A)= ∑𝑘𝑗=1 = 𝑎𝑖𝑗Cij = −1 𝑖 + 𝑗𝑎 𝑀
𝑖𝑗 𝑖𝑗

Ref: https://en.wikipedia.org/wiki/Invertible_matrix

15
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
𝑎21 𝑎23
Minor of a12 is 𝑀12 = adj(A)= CT
𝑎31 𝑎33
Cofactor Cij = −1 𝑖 + 𝑗𝑀 det(A)= ∑𝑘𝑗=1 −1 𝑖 + 𝑗𝑎 𝑀
𝑖𝑗 𝑖𝑗 𝑖𝑗

𝑎 𝑏
• E.g. 𝐀 =
𝑐 𝑑
𝑑 −𝑐
𝐂=
−𝑏 𝑎
𝑇 𝑑 −𝑏
• adj 𝐀 = 𝐂 = det 𝐀 = |𝐀| = 𝑎𝑑 − 𝑏𝑐
−𝑐 𝑎
1 1 𝑑 −𝑏
𝐀 −1
= adj(𝐀) =
det 𝐀 𝑎𝑑−𝑏𝑐 −𝑐 𝑎
Ref: https://en.wikipedia.org/wiki/Invertible_matrix

16
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

det(A)= ∑𝑘𝑗=1 −1 𝑖 + 𝑗𝑎 𝑀
Determinant computation 𝑖𝑗 𝑖𝑗
Example: 3x3 matrix, use the first row (i = 1)

= a(ei - fh) – b(di - fg) + c(dh - eg)

Python demo 2 Ref: https://en.wikipedia.org/wiki/Determinant

17
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

𝑎22 𝑎23
The minor of 𝑎11 = 𝑎 𝑎33
32

Ref: https://en.wikipedia.org/wiki/Determinant

18
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

𝑎21 𝑎23
The minor of 𝑎12 = 𝑎 𝑎33
31

Ref: https://en.wikipedia.org/wiki/Determinant

19
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

𝑎21 𝑎22
The minor of 𝑎13 = 𝑎 𝑎32
31

Ref: https://en.wikipedia.org/wiki/Determinant

20
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

𝑎12 𝑎13
The minor of 𝑎21 = 𝑎 𝑎33
32

Ref: https://en.wikipedia.org/wiki/Determinant

21
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

𝑎11 𝑎13
The minor of 𝑎22 = 𝑎 𝑎33
31

Ref: https://en.wikipedia.org/wiki/Determinant

22
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
Example
1 2 3
Find the cofactor matrix of 𝐀 given that 𝐀 = 0 4 5 .
1 0 6
Solution:
4 5 0 5 0 4
𝑎11 ⇒ = 24, 𝑎12 ⇒ − = 5, 𝑎13 ⇒ = −4,
0 6 1 6 1 0
2 3 1 3 1 2
𝑎21 ⇒ − = −12, 𝑎22 ⇒ = 3, 𝑎23 ⇒ − = 2,
0 6 1 6 1 0
2 3 1 3 1 2
𝑎31 ⇒ = −2, 𝑎32 ⇒ − = −5, 𝑎33 ⇒ = 4,
4 5 0 5 0 4
24 5 −4
The cofactor matrix C is thus −12 3 2 .
−2 − 5 4

Ref: https://www.mathwords.com/c/cofactor_matrix.htm

23
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations

24
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations

• Consider a system of m linear equations with d

variables or unknowns 𝑤1, … , 𝑤𝑑:

𝑥1,1 𝑤1 + 𝑥1,2 𝑤2 + ⋯ + 𝑥1,𝑑 𝑤𝑑 = 𝑦1

𝑥2,1 𝑤1 + 𝑥2,2 𝑤2 + ⋯ + 𝑥2,𝑑 𝑤𝑑 = 𝑦2
⁞
𝑥𝑚,1 𝑤1 + 𝑥𝑚,2 𝑤2 + ⋯ + 𝑥𝑚,𝑑 𝑤𝑑 = 𝑦𝑚 .

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to

Applied Linear Algebra”, Cambridge University Press, 2018 (Chp8.3)

25
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
These equations can be written compactly in matrix-vector
notation:
𝐗𝐰 = 𝐲
Where
𝑥1,1 𝑥1,2 … 𝑥1,𝑑 𝑤1 𝑦1
𝐗= ⁞ ⁞ ⋱ ⁞ , 𝐰= ⁞ , 𝐲= ⁞ .
𝑥𝑚,1 𝑥𝑚,2 … 𝑥𝑚,𝑑 𝑤𝑑 𝑦𝑚

Note:
• The data matrix 𝐗 ∈ 𝓡𝑚×𝑑 and the target vector 𝐲 ∈ 𝓡𝑚 are given
• The unknown vector of parameters 𝐰 ∈ 𝓡𝑑 is to be learnt

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to

Applied Linear Algebra”, Cambridge University Press, 2018 (Chp8.3)

26
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
A set of linear equations can have no solution, one
solution, or multiple solutions:
𝐗𝐰 = 𝐲
Where
𝑥1,1 𝑥1,2 … 𝑥1,𝑑 𝑤1 𝑦1
𝐗= ⁞ ⁞ ⋱ ⁞ , 𝐰= ⁞ , 𝐲= ⁞ .
𝑥𝑚,1 𝑥𝑚,2 … 𝑥𝑚,𝑑 𝑤𝑑 𝑦𝑚

𝐗 is Square Even-determined m=d Equal number of equations

and unknowns
𝐗 is Tall Over-determined m>d More number of equations
than unknowns
𝐗 is Wide Under-determined m<d Fewer number of equations
than unknowns
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to
Applied Linear Algebra”, (Chp8.3 & 11) & [Ref 5] Tan’s notes, (Chp 4)

27
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
𝐗𝐰 = 𝐲, 𝐗 ∈ 𝓡𝑚×𝑑 , 𝐰 ∈ 𝓡𝑑×1 , 𝐲 ∈ 𝓡𝑚×1
1. Square or even-determined system: 𝒎 = 𝒅
- Equal number of equations and unknowns, i.e., 𝐗 ∈ 𝓡𝑑×𝑑
- One unique solution if 𝐗 is invertible or all rows/columns of 𝐗 are
linearly independent
- If all rows or columns of 𝐗 are linearly independent, then 𝐗 is
invertible.

Solution:
If 𝐗 is invertible (or 𝐗 −1 𝐗 = 𝐈 ), then pre-multiply both sides by 𝐗 −1
𝐗 −1 𝐗 𝐰 = 𝐗 −1 𝐲
⇒ ෝ = 𝐗 −1 𝐲
𝐰
(Note: we use a hat on top of 𝐰 to indicate that it is a specific point in the space of 𝐰)

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to

Applied Linear Algebra”, Cambridge University Press, 2018 (Chp11)

28
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
Example 1 𝑤1 + 𝑤2 = 4 (1) Two unknowns
𝑤1 − 2𝑤2 = 1 (2) Two equations

𝐗 𝐰 𝐲
1 1 𝑤1 4
𝑤 =
1 −2 2 1
ෝ
𝐰
ෝ = 𝐗 −1 𝐲
𝐰
−1
1 1 4
=
1 −2 1

−1 −2 −1 4 3
= =
3 −1 1 1 1 Python demo 3
1 𝑑 −𝑏
𝐀−1 = adj(𝐀) adj 𝐀 = 𝐂𝑇 = det 𝐀 = 𝑎𝑑 − 𝑏𝑐
det 𝐀 −𝑐 𝑎
29
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
𝐗𝐰 = 𝐲, 𝐗 ∈ 𝓡𝑚×𝑑 , 𝐰 ∈ 𝓡𝑑×1 , 𝐲 ∈ 𝓡𝑚×1
2. Over-determined system: 𝒎 > 𝒅
– More equations than unknowns
– 𝐗 is non-square (tall) and hence not invertible
– Has no exact solution in general *
– An approximated solution is available using the left inverse
If the left-inverse of 𝐗 exists such that 𝐗 † 𝐗 = 𝐈, then pre-multiply both
sides by 𝐗 † results in
𝐗†𝐗 𝐰 = 𝐗†𝐲
⇒𝐰 ෝ = 𝐗†𝐲
Definition:
A matrix B that satisfies 𝑩𝒅 𝒙 𝒎 𝑨𝒎 𝒙 𝒅 = 𝐈 is called a left-inverse of 𝐀.
The left-inverse of 𝐗: 𝐗 † = (𝐗 𝑇 𝐗)−𝟏 𝐗 𝑇 given 𝐗 𝑇 𝐗 is invertible.
Note: * exception: when rank(𝐗) = rank([𝐗,𝐲]), there is a solution.
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (Chp11.1-11.2, 11.5)

30
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
Example 2 𝑤1 + 𝑤2 = 1 (1) Two unknowns
𝑤1 − 𝑤2 = 0 (2) Three equations
𝐗 𝐰 𝐲 𝑤1 = 2 (3)
1 1 𝑤 1
1
1 −1 𝑤 = 0
2
1 0 2
ෝ
𝐰
No exact solution
Approximated solution

ෝ = 𝐗 † 𝐲 = (𝐗 𝑇 𝐗)−𝟏 𝐗 𝑇 𝐲
𝐰
3 0 −1 1 1 1 1 1
= 0 =
0 2 1 −1 0 2 0.5 Python demo 4
𝐗 𝑇 𝐗 is invertible
31
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
𝐗𝐰 = 𝐲, 𝐗 ∈ 𝓡𝑚×𝑑 , 𝐰 ∈ 𝓡𝑑×1 , 𝐲 ∈ 𝓡𝑚×1
3. Under-determined system: 𝒎 < 𝒅
– More unknowns than equations
– Infinite number of solutions in general *

If the right-inverse of 𝐗 exists such that 𝐗𝐗 † = 𝐈, then the 𝑑-vector

𝐰 = 𝐗 † 𝐲 (one of the infinite cases) satisfies the equation 𝐗𝐰 = 𝐲, i.e.,
𝐗𝐰 = 𝐲 ⇒ 𝐗𝐗 † 𝐲 = 𝐲
⇒ 𝐈𝐲 = 𝐲
Definition:
A matrix B that satisfies 𝐀𝒎 𝒙 𝒅 𝐁𝒅 𝒙 𝒎 = 𝐈 is called a right-inverse of 𝐀.
The right-inverse of 𝐗: 𝐗 † = 𝐗 𝑇 (𝐗𝐗 𝑇 )−1 given 𝐗𝐗 𝑇 is invertible.
If 𝐗 is right−invertible, we can find a unique constrained solution.

Note: * exception: no solution if the system is inconsistent rank(𝐗) < rank([𝐗,𝐲]

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (Chp11.1-11.2, 11.5)

32
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
3. Under-determined system: 𝒎 < 𝒅
Derivation:
𝐗𝐰 = 𝐲, 𝐗 ∈ 𝓡𝑚×𝑑 , 𝐰 ∈ 𝓡𝑑×1 , 𝐲 ∈ 𝓡𝑚×1

A unique solution is yet possible by constraining the search using

𝐰 = 𝐗𝑇 𝐚

If 𝐗𝐗 𝑇 is invertible, let 𝐰 = 𝐗 𝑇 𝐚, then

𝐗𝐗 𝑇 𝐚 = 𝐲
⇒ 𝐚ො = (𝐗𝐗 𝑇 )−1 𝐲
⇒𝐰ෝ = 𝐗 𝑇 𝐚ො = 𝐗 𝑇 (𝐗𝐗 𝑇 )−1 𝐲

𝐗†
right-inverse

33
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
Example 3 𝑤1 + 2𝑤2 + 3𝑤3 = 2 (1) Three unknowns
𝑤1 − 2𝑤2 + 3𝑤3 = 1 (2) Two equations
𝐗 𝐰 𝐲
𝑤
1 2 3 𝑤1 2
2 =
1 −2 3 𝑤3 1
Infinitely many solutions along
the intersection line
Here 𝐗𝐗 𝑇 is invertible

ෝ = 𝐗 𝑇 (𝐗𝐗 𝑇 )−1 𝐲
𝐰
−1
1 1 14 6 0.15
2
= 2 −2 = 0.25 Constrained solution
1
3 3 6 14 0.45

34
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations

Example 4 𝑤1 + 2𝑤2 + 3𝑤3 = 2 (1) Three unknowns

3𝑤1 + 6𝑤2 + 9𝑤3 = 1 (2) Two equations

𝐗 𝐰 𝐲
𝑤
1 2 3 𝑤1 2
2 =
3 6 9 𝑤3 1

Both 𝐗𝐗 𝑇 and 𝐗 𝑇 𝐗 are not invertible!

There is no solution for the system
35
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations

36
© Copyright EE, NUS. All Rights Reserved.
Notations: Set
• A set is an unordered collection of unique elements
– Denoted as a calligraphic capital character e.g., 𝓢, 𝓡, 𝓝 etc
– When an element 𝑥 belongs to a set 𝑺, we write 𝑥 ∈ 𝓢
• A set of numbers can be finite - include a fixed amount of values
– Denoted using accolades, e.g. {1, 3, 18, 23, 235} or {𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 , . . . , 𝑥𝑑 }
• A set can be infinite and include all values in some interval
– If a set of real numbers includes all values between a and b, including a and
b, it is denoted using square brackets as [a, b]
– If the set does not include the values a and b, it is denoted using
parentheses as (a, b)
• Examples:
– The special set denoted by 𝓡 includes all real numbers from minus infinity
to plus infinity
– The set [0, 1] includes values like 0, 0.0001, 0.25, 0.9995, and 1.0

Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p4 of chp2).

37
© Copyright EE, NUS. All Rights Reserved.
Notations: Set operations

• Intersection of two sets:

𝓢3 ← 𝓢1 ∩ 𝓢2
Example: {1,3,5,8} ∩ {1,8,4} = {1,8}

• Union of two sets:

𝓢3 ← 𝓢1 ∪ 𝓢2
Example: {1,3,5,8} ∪ {1,8,4} = {1,3,4,5,8}

Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p4 of chp2).

38
© Copyright EE, NUS. All Rights Reserved.
Functions
• A function is a relation that associates each element 𝑥 of a set 𝓧,
the domain of the function, to a single element 𝑦 of another set 𝓨,
the codomain of the function
• If the function is called f, this relation is denoted 𝑦 = 𝑓(𝑥)
– The element 𝑥 is the argument or input of the function
– 𝑦 is the value of the function or the output
• The symbol used for representing the input is the variable of the
function
– 𝑓(𝑥) 𝑓 is a function of the variable 𝑥; 𝑓(𝑥, 𝑤) 𝑓 is a function of the variable 𝑥 and w

𝓧 𝓨
1
1 2 Range
2 3 (or Image)
3 4 {3,4,5,6}
4 5
6
{1,2,3,4} domain codomain {1,2,3,4,5,6}
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p6 of chp2). 39
© Copyright EE, NUS. All Rights Reserved.
Functions

• A scalar function can have vector argument

– E.g. 𝑦 = 𝑓(𝐱) = 𝑥1 + 𝑥2 +2𝑥3
• A vector function, denoted as 𝐲 = 𝐟(𝐱) is a function
that returns a vector 𝐲
– Input argument can be a vector 𝐲 = 𝐟(𝐱) or a scalar 𝐲 = 𝐟(𝑥)
𝑦1 −𝑥1
– E.g. 𝑦 = 𝑥
2 2
𝑦1 −2𝑥1
– E.g. 𝑦 =
2 3𝑥1

Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p7 of chp2).

40
© Copyright EE, NUS. All Rights Reserved.
Functions
• The notation 𝑓: 𝓡𝑑 → 𝓡 means that 𝑓 is a function that
maps real d-vectors to real numbers
– i.e., 𝑓 is a scalar-valued function of d-vectors
• If 𝐱 is a d-vector argument, then 𝑓 𝐱 denotes the value
of the function 𝑓 at 𝐱
– i.e., 𝑓 𝐱 = 𝑓 𝑥1 , 𝑥2 , … , 𝑥𝑑 , 𝐱 ∈ 𝓡𝑑 , 𝑓 𝐱 ∈ 𝓡

• Example: we can define a function 𝑓: 𝓡4 → 𝓡 by

𝑓 𝐱 = 𝑥1 + 𝑥2 − 𝑥42

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, 2018 (Ch 2, p29)

41
© Copyright EE, NUS. All Rights Reserved.
Functions

The inner product function

• Suppose 𝒂 is a d-vector. We can define a scalar valued function 𝑓 of d-
vectors, given by
𝑓 𝐱 = 𝒂𝑇 𝐱 = 𝑎1 𝑥1 + 𝑎2 𝑥2 + ⋯ 𝑎𝑑 𝑥𝑑 (1)
for any d-vector 𝐱
• The inner product of its d-vector argument 𝐱 with some (fixed) d-vector 𝒂
• We can also think of 𝑓 as forming a weighted sum of the elements of 𝐱;
the elements of 𝒂 give the weights 𝒂

𝜃
𝐱
𝒂 cos𝜃
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p30)

42
© Copyright EE, NUS. All Rights Reserved.
Functions

Linear Functions

A function 𝑓: 𝓡𝑑 → 𝓡 is linear if it satisfies the following two properties:

• Homogeneity
• For any d-vector 𝐱 and any scalar 𝛼, 𝑓 𝛼𝐱 = 𝛼𝑓 𝐱
• Scaling the (vector) argument is the same as scaling the
function value

• Additivity
• For any d-vectors 𝐱 and 𝐲, 𝑓 𝐱 + 𝐲 = 𝑓 𝐱 + 𝑓 𝐲
• Adding (vector) arguments is the same as adding the function
values

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p31)

Linear Functions
Superposition and linearity
• The inner product function 𝑓 𝐱 = 𝒂𝑇 𝐱 defined in equation (1)
(slide 9) satisfies the property
𝑓 𝛼𝐱 + 𝛽𝐲 = 𝒂𝑇 𝛼𝐱 + 𝛽𝐲
= 𝒂𝑇 𝛼𝐱) + 𝒂𝑇 (𝛽𝐲
= 𝛼 𝒂𝑇 𝐱) + 𝛽(𝒂𝑇 𝐲
= 𝛼𝑓 𝐱) + 𝛽𝑓(𝐲
for all d-vectors 𝐱, 𝐲, and all scalars 𝛼, 𝛽.

• This property is called superposition, which consists of

homogeneity and additivity
• A function that satisfies the superposition property is called linear

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p30)

Linear Functions

• If a function 𝑓 is linear, superposition extends to linear

combinations of any number of vectors:
𝑓 𝛼1 𝐱1 + ⋯ + 𝛼𝑘 𝐱 𝑘 = 𝛼1 𝑓 𝐱1 ) + ⋯ + 𝛼𝑘 𝑓(𝐱 𝑘
for any d vectors 𝐱1 + ⋯ + 𝐱 𝑘 , and any scalars
𝛼1 + ⋯ + 𝛼𝑘 .

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p30)

Linear and Affine Functions

A linear function plus a constant is called an affine function

A linear function 𝑓: 𝓡𝑑 → 𝓡 is affine if and only if it can be

expressed as 𝑓 𝐱 = 𝒂𝑇 𝐱 + 𝑏 for some d-vector 𝒂 and scalar 𝑏,
which is called the offset (or bias)

Example:
𝑓 𝐱 = 2.3 − 2𝑥1 + 1.3𝑥2 − 𝑥3

This function is affine, with 𝑏 = 2.3, 𝒂𝑇 = [−2, 1.3, −1].

Affine if theres a bias, linear if bias is 0

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p32)

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p33)

47
© Copyright EE, NUS. All Rights Reserved.
Summary
• Operations on Vectors and Matrices Assignment 1 (week 6)
• Dot-product, matrix inverse Tutorial 4
• Systems of Linear Equations 𝐗𝐰 = 𝐲
• Matrix-vector notation, linear dependency, invertible
• Even-, over-, under-determined linear systems
• Set and Functions
𝐗 is Even- m = d One unique solution in general ෝ = 𝐗 −1 𝐲
𝐰
Square determined
𝐗 is Over- m > d No exact solution in general; ෝ = (𝐗 𝑇 𝐗)−𝟏 𝐗 𝑇 𝐲
𝐰
Tall determined An approximated solution Left-inverse
𝐗 is Under- m < d Infinite number of solutions in general; 𝐰ෝ = 𝐗 𝑇 (𝐗𝐗 𝑇 )−1 𝐲
Wide determined Unique constrained solution Right-inverse

• Scalar and vector functions

python package numpy
• Inner product function
Inverse: numpy.linalg.inv(X)
• Linear and affine functions
Transpose: X.T

Helen Juan Zhou

[email protected]

Electrical and Computer Engineering Department

National University of Singapore

Acknowledgement:
EE2211 development team
(Thomas, Kar-Ann, Helen, Vincent, Chen Khong, Robby, and Haizhou)

Course Contents
• Introduction and Preliminaries (Haizhou)
– Introduction
– Data Engineering
– Introduction to Probability and Statistics
• Fundamental Machine Learning Algorithms I (Helen)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Thomas)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Haizhou)
– Performance Issues
– K-means Clustering
– Neural Networks

2
© Copyright EE, NUS. All Rights Reserved.
Least Squares and Linear Regression
Module II Contents
• Notations, Vectors, Matrices (introduced in L3)
• Operations on Vectors and Matrices
• Systems of Linear Equations
• Set and Functions
• Derivative and Gradient
• Least Squares, Linear Regression
• Linear Regression with Multiple Outputs
• Linear Regression for Classification
• Ridge Regression
• Polynomial Regression

Linear Functions
A function 𝑓: 𝓡𝑑 → 𝓡 is linear if it satisfies the following two properties:

• Homogeneity 𝑓 𝛼𝐱 = 𝛼𝑓 𝐱 Scaling
No offset
• Additivity 𝑓 𝐱 + 𝐲 = 𝑓 𝐱 + 𝑓 𝐲 Adding

Inner product function

𝑓 𝐱 = 𝒂𝑇 𝐱 = 𝑎1 𝑥1 + 𝑎2 𝑥2 + ⋯ 𝑎𝑑 𝑥𝑑

Affine function
𝑓 𝐱 = 𝒂𝑇 𝐱 + 𝑏 scalar 𝑏 is called the offset (or bias)

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p31)

• 𝑓 𝑥 has a local minimum

at 𝑥 = 𝑐 if 𝑓 𝑥 ≥ 𝑓 𝑐 for
every 𝑥 in some open
interval around 𝑥 = 𝑐
x
• 𝑓 𝑥 has a global
minimum at 𝑥 = 𝑐 if 𝑓 𝑥 ≥
𝑓 𝑐 for all 𝑥 in the domain
of 𝑓 a b
a<x≤b
Note: An interval is a set of real numbers with the property that any number that lies
between two numbers in the set is also included in the set.
An open interval does not include its endpoints and is denoted using parentheses. E.g.
(0, 1) means “all numbers greater than 0 and less than 1”.
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p6-7 of chp2).

5
© Copyright EE, NUS. All Rights Reserved.
Functions: Maximum and Minimum
Max and Arg Max
• Given a set of values 𝓐 = {𝑎1 , 𝑎2 , … , 𝑎𝑚 },
• The operator max 𝑓(𝑎) returns the highest value 𝑓(𝑎) for all elements in
𝑎ϵ𝓐
the set 𝓐
• The operator arg max 𝑓(𝑎) returns the element of the set 𝓐 that
𝑎ϵ𝓐
maximizes 𝑓(𝑎) (returns input)

• When the set is implicit or infinite, we can write

max 𝑓(𝑎) or arg max 𝑓(𝑎)
𝑎 𝑎
E.g. 𝑓(𝑎) = 3𝑎, 𝑎 ϵ [0,1]  max 𝑓 𝑎 = 3 and arg max 𝑓 𝑎 = 1
𝑎 𝑎

Min and Arg Min operate in a similar manner

Note: arg max returns a value from the domain of the function and max returns
from the range (codomain) of the function.
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p6-7 of chp2).

6
© Copyright EE, NUS. All Rights Reserved.
Derivative and Gradient
• The derivative 𝒇′ of a function 𝒇 is a function that describes
how fast 𝒇 grows (or decreases)
– If the derivative is a constant value, e.g. 5 or −3
• The function 𝑓 grows (or decreases) constantly at any point x of its domain
– When the derivative 𝑓′ is a function
• If 𝑓′ is positive at some x, then the function 𝑓 grows at this point
• If 𝑓′ is negative at some x, then the function 𝑓 decreases at this point
• The derivative of zero at x means that the function’s slope at x is horizontal (e.g.
maximum or minimum points)

• The process of finding a derivative is called differentiation

• Gradient is the generalization of derivative for functions that
take several inputs (or one input in the form of a vector or
some other complex structure).

Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p8 of chp2).

7
© Copyright EE, NUS. All Rights Reserved.
Derivative and Gradient
The gradient of a function is a vector of partial derivatives
Differentiation of a scalar function w.r.t. a vector
If 𝑓(𝐱) is a scalar function of d variables, 𝐱 is a d x1 vector.
Then differentiation of 𝑓(𝐱) w.r.t. 𝐱 results in a d x1 vector
𝜕𝑓
ⅆ𝑓(𝐱) 𝜕𝑥1
= ⋮
ⅆ𝐱 𝜕𝑓.
𝜕𝑥𝑑
This is referred to as the gradient of 𝑓(𝐱) and often written
as 𝛻𝐱 𝑓.
𝑎
E.g. 𝑓 𝐱 = 𝑎𝑥1 + 𝑏𝑥2 𝛻𝐱 𝑓 =
𝑏 Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Appendix)
8
© Copyright EE, NUS. All Rights Reserved.
Derivative and Gradient
Partial Derivatives
Differentiation of a vector function w.r.t. a vector
If 𝐟(𝐱) is a vector function of size h x1 and 𝐱 is a d x1 vector.
Then differentiation of 𝐟(𝐱) results in a h x d matrix
𝜕𝑓1 𝜕𝑓1
⋯
ⅆ𝐟(𝐱) 𝜕𝑥1 𝜕𝑥𝑑
= ⋮ ⋱ ⋮
ⅆ𝐱 𝜕𝑓ℎ 𝜕𝑓.ℎ
⋯
𝜕𝑥1 𝜕𝑥𝑑
The matrix is referred to as the Jacobian of 𝐟(𝐱)

Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Appendix)

Some Vector-Matrix Differentiation Formulae

ⅆ𝐀𝐱
=𝐀
ⅆ𝐱

ⅆ(𝒃𝑇 𝐱) ⅆ(𝐲 𝑇 𝐀𝐱)

=𝒃 = 𝐀𝑇 𝐲
ⅆ𝐱 ⅆ𝐱
ⅆ(𝐱 𝑇 𝐀𝐱)
= (𝐀 + 𝐀𝑇 )𝐱
ⅆ𝐱

Derivations: https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf

Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Appendix)

• Linear regression is a popular regression learning

algorithm that learns a model which is a linear
combination of features of the input example.

𝐗𝐰 = 𝐲, 𝐗 ∈ 𝓡𝑚×𝑑 , 𝐰 ∈ 𝓡𝑑×1 , 𝐲 ∈ 𝓡𝑚×1

𝑥1,1 𝑥1,2 … 𝑥1,𝑑 𝑤1 𝑦1

𝐗= ⁞ ⁞ ⋱ ⁞ , 𝐰= ⁞ , 𝐲= ⁞
𝑥𝑚,1 𝑥𝑚,2 … 𝑥𝑚,𝑑 𝑤𝑑 𝑦𝑚

Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p3 of chp3).

11
© Copyright EE, NUS. All Rights Reserved.
Linear Regression
Problem Statement: To predict the unknown 𝑦 for a given 𝐱 (testing)
𝑚
• We have a collection of labeled examples (training) {(𝐱 𝑖 , y𝑖 )}𝑖=1
– 𝑚 is the size of the collection
– 𝐱 𝑖 is the d-dimensional feature vector of example 𝑖 = 1, … , 𝑚 (input)
– y𝑖 is a real-valued target (1-D)
– Note:
• when y𝑖 is continuous valued, it is a regression problem
• when y𝑖 is discrete valued, it is a classification problem

• We want to build a model 𝑓𝐰,𝑏 (𝐱) as a linear combination of features of

example 𝐱: 𝑓𝐰,𝑏 𝐱 = 𝐱 𝑇 𝐰 + 𝑏
where 𝐰 is a d-dimensional vector of parameters and 𝑏 is a real number.
• The notation 𝑓𝐰,𝑏 means that the model 𝑓 is parametrized by two values: w
and b
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (chp.14)

Learning objective function

• To find the optimal values for w*
and b* which minimizes the
following expression:
𝑚
1
෍(𝑓𝐰,𝑏 𝐱𝑖 − y𝑖 )2
𝑚
𝑖=1
• In mathematics, the expression we
minimize or maximize is called an
objective function, or, simply, an
objective

(𝑓𝐰 𝐱𝑖 − y𝑖 )2 is called the loss function: a measure of the difference

between 𝑓𝐰 𝐱 𝑖 and y𝑖 or a penalty for misclassification of example i.
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (chp3.1.2)

13
© Copyright EE, NUS. All Rights Reserved.
Linear Regression
Learning objective function (using simplified notation hereon)
• To find the optimal values for w* which minimizes the
following expression:
𝑚

෍(𝑓𝐰 𝐱𝑖 − y𝑖 )2
𝑖=1
with 𝑓𝐰 𝐱𝑖 = 𝐱 𝑇 𝐰,
where we define 𝐰 = [𝑏, 𝑤1 , … 𝑤𝑑 ]𝑇 = [𝑤0 , 𝑤1 , … 𝑤𝑑 ]𝑇 ,
and 𝐱 𝑖 = [1, 𝑥𝑖,1 , … 𝑥𝑖,𝑑 ]𝑇 = [𝑥𝑖,0 , 𝑥𝑖,1 , … 𝑥𝑖,𝑑 ]𝑇 , 𝑖 = 1, … , 𝑚

• This particular choice of the loss function is called squared

error loss
1
Note: The normalization factor can be omitted as it does not affect the optimization.
𝑚

• All model-based learning algorithms have a loss function

• What we do to find the best model is to minimize the
objective known as the cost function
• Cost function is a sum of loss functions over training set
plus possibly some model complexity penalty (regularization)

• In linear regression, the cost function is given by the average

loss, also called the empirical risk because we do not have
all the data (e.g. testing data)
– The average of all penalties is obtained by applying the
model to the training data
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (chp3.1.2)

15
© Copyright EE, NUS. All Rights Reserved.
Linear Regression
Learning (Training)
• Consider the set of feature vector 𝐱 𝑖 and target output 𝑦𝑖
indexed by 𝑖 = 1, … , 𝑚, a linear model 𝑓𝐰 𝐱 = 𝐱 𝑇 𝐰 can
be stacked as 𝑦1
𝒇𝐰 𝐗 = 𝐗𝐰 𝐲= ⁞
𝑦𝑚
Learning 𝑇
Model 𝐱 1𝐰
Learning
target vector
= ⁞
𝑇𝐰
𝐱𝑚
𝑏
where 𝐱𝑖𝑇 𝐰 = [1, 𝑥𝑖,1 , … , 𝑥𝑖,𝑑 ] 𝑤1
⁞
𝑤𝑑
Note: The bias/offset term is responsible for translating the line/plane/hyperplane
away from the origin.

Least Squares Regression

In vector-matrix notation, the minimization of the objective
function can be written compactly using 𝐞 = 𝐗𝐰 − 𝐲 :
J(𝐰) = 𝐞𝑇 𝐞
= (𝐗𝐰 − 𝐲)𝑇 (𝐗𝐰 − 𝐲)
= (𝐰 𝑇 𝐗 𝑇 − 𝐲 𝑇 )(𝐗𝐰 − 𝐲)
= 𝐰 𝑇 𝐗 𝑇 𝐗𝐰 − 𝐰 𝑇 𝐗 𝑇 𝐲 − 𝐲 𝑇 𝐗𝐰 + 𝐲 𝑇 𝐲
= 𝐰 𝑇 𝐗 𝑇 𝐗𝐰 − 2𝐲 𝑇 𝐗𝐰 + 𝐲 𝑇 𝐲.
Note: when 𝒇𝐰 𝐗 = 𝐗𝐰, then
𝑚
෌𝑖=1(𝑓𝐰 𝐱𝑖 − y𝑖 )2 = (𝐗𝐰 − 𝐲)𝑇 (𝐗𝐰 − 𝐲).

Differentiating J(𝐰) with respect to 𝐰 and setting the

𝜕
result to 0: J 𝐰 =𝟎
𝜕𝐰
𝜕
(𝐰 𝑇 𝐗 𝑇 𝐗𝐰 − 2𝐲 𝑇 𝐗𝐰 + 𝐲 𝑇 𝐲) = 𝟎
𝜕𝐰
⇒ 2𝐗 𝑇 𝐗𝐰 − 2𝐗 𝑇 𝐲 = 0
⇒ 𝐗 𝑇 𝐗𝐰 = 𝐗 𝑇 𝐲
⇒ Any minimizer 𝐰 ෝ of J 𝐰 must satisfy 𝐗 𝑇 𝐗𝐰 − 𝐲 = 𝟎.
If 𝐗 𝑇 𝐗 is invertible, then
Learning/training: 𝐰ෝ = (𝐗 𝑇 𝐗)−𝟏 𝐗 𝑇 𝐲
Prediction/testing: 𝒇෠ 𝐰 𝐗 𝑛𝑒𝑤 = 𝐗 𝑛𝑒𝑤 𝐰 ෝ

18
© Copyright EE, NUS. All Rights Reserved.
Linear Regression
Example 1 Training set {(𝑥𝑖 , y𝑖 )}𝑚
𝑖=1 {𝑥 = −9} → {𝑦 = −6}
{𝑥 = −7} → {𝑦 = −6}
𝐗 𝐰 𝐲 𝑥 = −5 → {𝑦 = −4}
1 −9 −6 𝑥 = 1 → 𝑦 = −1
1 −7 −6 𝑥 = 5 → {𝑦 = 1 }
𝑤0 −4 𝑥 = 9 → {𝑦 = 4 }
1 −5 =
𝑤1 −1
1 1
1 5 1
4 w0: offset term
1 9
This set of linear equations has no exact solution
However, 𝐗 𝑇 𝐗 is invertible Least square approximation
eqn of line: y=0.5625x-1.4375
ෝ = 𝐗 † 𝐲 = (𝐗 𝑇 𝐗)−𝟏 𝐗 𝑇 𝐲
𝐰 −6
−6
−1
6 −6 1 1 1 1 1 1 −4 −1.4375
= =
−6 262 −9 − 7 − 5 1 5 9 −1 0.5625
1
4
19
© Copyright EE, NUS. All Rights Reserved.
Linear Regression
𝑦ො = 𝐗𝐰ෝ
−1.4375
=𝐗
0.5625
y = −1.4375+0.5625x

Prediction:
Test set
{𝑥 = −1} → {𝑦 =? }

−1.4375
𝑦ො = 1 − 1
0.5625
= −2
Linear Regression on one-dimensional samples

Example 2 {(𝐱 𝑖 , y𝑖 )}𝑚

𝑖=1
{𝑥1 = 1, 𝑥2 = 1, 𝑥3 = 1} → {𝑦 = 1}
{𝑥1 = 1, 𝑥2 = −1, 𝑥3 = 1} → {𝑦 = 0}
Training set {𝑥1 = 1, 𝑥2 = 1, 𝑥3 = 3} → {𝑦 = 2}
𝐗 𝐰 𝐲 {𝑥1 = 1, 𝑥2 = 1, 𝑥3 = 0} → {𝑦 = −1}

1 1 1 1
𝑤1
1 −1 1 𝑤2 = 0
1 1 3 𝑤3 2
1 1 0 −1
This set of linear equations has no exact solution
However, 𝐗 𝑇 𝐗 is invertible
ෝ = 𝐗 † 𝐲 = (𝐗 𝑇 𝐗)−𝟏 𝐗 𝑇 𝐲
𝐰 Least square approximation
−1
4 2 5 1 1 1 1 1 −0.7500
= 2 4 3 0 = 0.1786
1 −1 1 1 2
5 3 11 1 1 3 0 −1 0.9286

{𝑥1 = 1, 𝑥2 = 6, 𝑥3 = 8} → {𝑦 =? }
{𝑥1 = 1, 𝑥2 = 0, 𝑥3 = −1} → {𝑦 =? }

ෝ = 𝒇෠ 𝐰 𝐗 𝑛𝑒𝑤 = 𝐗 𝑛𝑒𝑤 𝐰
𝒚 ෝ

1 6 8 −0.7500
ෝ=
𝒚 0.1786
1 0 −1
0.9286
7.7500
=
−1.6786

For m samples: 𝐅𝐰 𝐗 = 𝐗𝐖 = 𝐘
𝑇 𝑤0,1 … 𝑤0,ℎ
Sample 1
𝐱1 1 𝑥 1,1 … 𝑥 1,𝑑
𝑤1,1 … 𝑤1,ℎ
⁞ = ⁞ 𝐖= ⁞ ⁞ ⋱ ⁞
𝑇 1 𝑥 … 𝑥 ⁞ ⋱ ⁞
Sample m 𝐱𝑚 𝑚,1 𝑚,𝑑 𝑤
𝑑,1 … 𝑤𝑑,ℎ

ℎ
Sample 1’s output 𝑦1,1 … 𝑦1,ℎ
⁞ = ⁞ 𝑚
Sample m’s output 𝑦𝑚,1 … 𝑦𝑚,ℎ
ℎ

𝐗 ∈ 𝓡𝑚×(𝑑+1) , 𝐖 ∈ 𝓡(𝑑+1)×ℎ , 𝐘 ∈ 𝓡𝑚×ℎ

Least Squares Regression of Multiple Outputs

In matrix notation, the sum of squared errors cost
function can be written compactly using 𝐄 = 𝐗𝐖 − 𝐘:

J(𝐖) = trace(𝐄𝑇 𝐄)
= trace[(𝐗𝐖 − 𝐘)𝑇 (𝐗𝐖 − 𝐘)]

If 𝐗 𝑇 𝐗 is invertible, then
Learning/training: 𝐖 ෡ = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐘
Prediction/testing: 𝐅෠𝐰 𝐗 𝑛𝑒𝑤 = 𝐗 𝑛𝑒𝑤 𝐖 ෡

Ref: Hastie, Tibshirani, Friedman, “The Elements of Statistical Learning”, (2 nd ed., 12th printing) 2017 (chp.3.2.4)

Least Squares Regression of Multiple Outputs

J(𝐖) = trace(𝐄𝑇 𝐄)

𝐞1𝑇
= trace( ⁞ [𝐞1 𝐞2 … 𝐞ℎ ])
𝐞𝑇ℎ

𝐞1𝑇 𝐞1 𝐞1𝑇 𝐞2 … 𝐞1𝑇 𝐞ℎ

𝐞𝑇2 𝐞1 𝐞𝑇2 𝐞2 … 𝐞𝑇2 𝐞ℎ ℎ
= trace( ) = ෌𝑘=1 𝐞𝑇𝑘 𝐞𝑘
⁞ ⁞ ⋱ ⁞
𝐞𝑇ℎ 𝐞1 𝐞𝑇ℎ 𝐞2 … 𝐞𝑇ℎ 𝐞ℎ

25
© Copyright EE, NUS. All Rights Reserved.
Linear Regression of multiple outputs
Example 3
Training set {𝑥1 = 1, 𝑥2 = 1, 𝑥3 = 1} → { 𝑦1 = 1, 𝑦2 = 0}
{𝑥1 = 1, 𝑥2 = −1, 𝑥3 = 1} → { 𝑦1 = 0, 𝑦2 = 1}
{𝐱𝑖 , 𝐲𝑖 }𝑚
𝑖=1
{𝑥1 = 1, 𝑥2 = 1, 𝑥3 = 3} → { 𝑦1 = 2, 𝑦2 = −1}
{𝑥1 = 1, 𝑥2 = 1, 𝑥3 = 0} → { 𝑦1 = −1, 𝑦2 = 3}
𝐗 𝐖 𝐘
Bias 1 1 1 𝑤 𝑤1,2 1 0
1,1
1 − 1 1 𝑤2,1 𝑤2,2 = 0 1
1 1 3 𝑤3,1 𝑤3,2 2 −1
1 1 0 −1 3
This set of linear equations has NO exact solution
Least square
෡ = 𝐗 † 𝐘 = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐘
𝐖 𝐗 𝑇 𝐗 is invertible approximation
−1 1 0
4 2 5 1 1 1 1 −0.75 2.25
= 2 4 3 0 1 = 0.1786 0.0357
1 −1 1 1 2 −1
5 3 11 1 1 3 0 −1 0.9286 − 1.2143
3

26
© Copyright EE, NUS. All Rights Reserved.
Linear Regression of multiple outputs
Example 3
Prediction:
Test set: two new samples
{𝑥1 = 1, 𝑥2 = 6, 𝑥3 = 8} → { 𝑦1 =? , 𝑦2 =? }
{𝑥1 = 1, 𝑥2 = 0, 𝑥3 = −1} → { 𝑦1 =? , 𝑦2 =? }

෡ = 𝐗 𝑛𝑒𝑤 ෢
𝐘 𝐖
Bias 1 6 8 −0.75 2.25
= 0.1786 0.0357
1 0 −1
0.9286 − 1.2143
7.75 − 7.25
=
−1.6786 3.4643

Python demo 2

27
© Copyright EE, NUS. All Rights Reserved.
Linear Regression of multiple outputs
Example 4
The values of feature x and their corresponding values of multiple
outputs target y are shown in the table below.

Based on the least square regression, what are the values of w?

Based on the current mapping, when x = 2, what is the value of y?

x [3] [4] [10] [6] [7]

y [0, 5] [1.5, 4] [-3, 8] [-4, 10] [1, 6]

1.9 3.6
𝐖 = 𝐗 𝐘 = (𝐗 𝐗) 𝐗 𝐘 =
෡ † 𝑇 −1 𝑇
Python demo 3
−0.4667 0.5
𝒏𝒆𝒘 = 𝐗 𝑛𝑒𝑤 𝐖 = 1
෣
𝐘 ෢ ෢ = 0.9667 4.6
2 𝐖 Prediction
*Prediction must be close to/ within observed range.
28
© Copyright EE, NUS. All Rights Reserved.
Summary
• Notations, Vectors, Matrices
• Operations on Vectors and Matrices
• Dot-product, matrix inverse
• Systems of Linear Equations 𝒇𝐰 𝐗 = 𝐗𝐰 = 𝐲
• Matrix-vector notation, linear dependency, invertible
• Even-, over-, under-determined linear systems
• Functions, Derivative and Gradient
• Inner product, linear/affine functions
• Maximum and minimum, partial derivatives, gradient
• Least Squares, Linear Regression
• Objective function, loss function
• Least square solution, training/learning and testing/prediction
• Linear regression with multiple outputs
Learning/training ෝ = (𝐗 𝑇𝒕𝒓𝒂𝒊𝒏 𝐗 𝒕𝒓𝒂𝒊𝒏 )−𝟏 𝐗 𝑇𝒕𝒓𝒂𝒊𝒏 𝐲𝒕𝒓𝒂𝒊𝒏
𝐰
Prediction/testing 𝐲𝒕𝒆𝒔𝒕 = 𝐗 𝒕𝒆𝒔𝒕 𝐰 ෝ
• Classification Python packages: numpy, pandas, matplotlib.pyplot,
• Ridge Regression numpy.linalg, and sklearn.metrics (for
• Polynomial Regression mean_squared_error), numpy.linalg.pinv
29
© Copyright EE, NUS. All Rights Reserved.
EE2211 Introduction to Machine
Learning
Lecture 6
Semester 1
2021/2022

Helen Juan Zhou

[email protected]

Electrical and Computer Engineering Department

National University of Singapore

Acknowledgement:
EE2211 development team
(Thomas, Kar-Ann, Chen Khong, Helen, Robby and Haizhou

Course Contents
• Introduction and Preliminaries (Haizhou)
– Introduction
Mid-term: Oct 5th
– Data Engineering Week 8 lecture slot
– Introduction to Probability and Statistics
• Fundamental Machine Learning Algorithms I (Helen)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Thomas)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Haizhou)
– Performance Issues
– K-means Clustering
– Neural Networks

Module II Contents
• Notations, Vectors, Matrices
• Operations on Vectors and Matrices
• Systems of Linear Equations
• Functions, Derivative and Gradient
• Least Squares, Linear Regression
• Linear Regression with Multiple Outputs
• Linear Regression for Classification
• Ridge Regression
• Polynomial Regression

3
© Copyright EE, NUS. All Rights Reserved.
Review: Linear Regression
Learning of Scalar Function (Single Output)
For one sample: a linear model 𝑓𝐰 𝐱 = 𝐱 𝑇 𝐰 scalar function
For m samples: 𝒇𝐰 𝐗 = 𝐗𝐰 = 𝐲
𝐱1𝑇 𝐰
𝐲= ⁞ where 𝐱𝑖𝑇 = [1, 𝑥𝑖,1 , … , 𝑥𝑖,𝑑 ]
𝐱𝑚𝑇𝐰

1 𝑥1,1 … 𝑥1,𝑑 𝑏 𝑦1
𝐗= ⁞ ⁞ ⋱ ⁞ 𝐰 = 𝑤1 𝐲= ⁞
1 𝑥𝑚,1 … 𝑥𝑚,𝑑 ⁞ 𝑦𝑚
𝑤𝑑
𝑚
Objective:෌𝑖=1(𝑓𝐰 𝐱 𝑖 − y𝑖 )2 = 𝐞𝑇 𝐞 = (𝐗𝐰 − 𝐲)𝑇 (𝐗𝐰 − 𝐲)
Learning/training when 𝐗 𝑇 𝐗 is invertible
ෝ = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐲
Least square solution: 𝐰
Prediction/testing: 𝒚𝑛𝑒𝑤 = 𝒇෠ 𝐰 𝐗 𝑛𝑒𝑤 = 𝐗 𝑛𝑒𝑤 𝐰
ෝ
4
© Copyright EE, NUS. All Rights Reserved.
Review: Linear Regression
Learning of Vectored Function (Multiple Outputs)
𝐅𝐰 𝐗 = 𝐗𝐖 = 𝐘
𝑤0,1 … 𝑤0,ℎ
Sample 1 𝐱1𝑇 1 𝑥1,1 … 𝑥1,𝑑
𝑤1,1 … 𝑤1,ℎ
⁞ = ⁞ 𝐖= ⁞ ⁞ ⋱ ⁞
𝑇 1 𝑥𝑚,1 … 𝑥𝑚,𝑑 ⁞ ⋱ ⁞
Sample m 𝐱𝑚 𝑤𝑑,1 … 𝑤𝑑,ℎ
Sample 1’s output 𝑦1,1 … 𝑦1,ℎ
⁞ = ⁞
Sample m’s output 𝑦𝑚,1 … 𝑦𝑚,ℎ Least Squares Regression

If 𝐗 𝑇 𝐗 is invertible, then
Learning/training: 𝐖 ෡ = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐘
Prediction/testing: 𝐅෠𝐰 𝐗 𝑛𝑒𝑤 = 𝐗 𝑛𝑒𝑤 𝐖 ෡

𝐗 ∈ 𝓡𝑚×(𝑑+1) , 𝐖 ∈ 𝓡(𝑑+1)×ℎ , 𝐘 ∈ 𝓡𝑚×ℎ

Linear Methods for Classification

• We have a collection of labeled examples
• 𝑚 is the size of the collection
• 𝐱 𝑖 is the d-dimensional feature vector of example 𝑖 = 1, … , 𝑚
• 𝑦𝑖 is discrete target label (e.g., 𝑦𝑖 ∈ {−1, +1} or {0, 1} for
binary classification problems)
• Note:
• when 𝑦𝑖 is continuous valued  a regression problem
• when 𝑦𝑖 is discrete valued a classification problem
• Linear model: 𝑓𝐰,𝑏 𝐱 = 𝐱 𝑇 𝐰 + 𝑏 or in compact form 𝑓𝐰 𝐱 = 𝐱 𝑇 𝐰
(having the offset term absorbed into the inner product)

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (chp.14)

Linear Methods for Classification

Binary Classification:
If 𝐗 𝑇 𝐗 is invertible, then
Learning: ෝ = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐲,
𝐰 𝑦𝑖 ∈ −1, +1 , 𝑖 = 1, … , 𝑚
Prediction: 𝑓መ𝐰𝑐 𝐱 𝑛𝑒𝑤 = sign(𝐱 𝑛𝑒𝑤
𝑇 𝐰) 𝑇
ෝ for each row 𝐱 𝑛𝑒𝑤 of 𝐗 𝑛𝑒𝑤

sign(𝑎)
+1

0
𝑎
-1

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (chp.14)

7
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)
Example 1 Training set {𝑥𝑖 , y𝑖 }𝑚
𝑖=1 {𝑥 = −9} → { 𝑦 = −1 }
𝐗 𝐰 𝐲 {𝑥 = −7} → { 𝑦 = −1 }
𝑥 = −5 → { 𝑦 = −1 }
Bias 1 −9 −1
{ 𝑥 = 1} → 𝑦 = +1
1 −7 −1
𝑤0 { 𝑥 = 5} → { 𝑦 = +1 }
1 −5 −1
𝑤1 = { 𝑥 = 9} → { 𝑦 = +1 }
1 1 1
1 5 1
1 9 1
This set of linear equations has NO exact solution

ෝ = 𝐗 † 𝐲 = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐲
𝐰 𝐗 𝑇 𝐗 is invertible
−1
−1
6 −6 −1
1 1 1 1 1 1 −1 0.1406 Least square
= =
−6 262 −9 − 7 − 5 1 5 9 1 0.1406 approximation
1
1
8
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)
Example 1
ෝ
𝑦ො = sign(𝐗𝐰)
{ 𝑥 = 1} → 𝑦 = +1
{ 𝑥 = 5} → { 𝑦 = +1 } 0.1406
{ 𝑥 = 9} → { 𝑦 = +1 } = sign(𝐗 )
0.1406
ෝ = 0.1406+0.1406x
y' = 𝐗𝐰

Prediction:
𝑦𝑛𝑒𝑤′
Test set {𝑥 = −2} → {𝑦 = ? }
𝑦𝑛𝑒𝑤 = 𝑓መ𝐰𝑐 𝐱 𝑛𝑒𝑤 = sign 𝐱 𝑛𝑒𝑤 𝐰
ෝ
Bias
0.1406
{𝑥 = −9} → { 𝑦 = −1 } −2
= sign( 1 − 2 )
{𝑥 = −7} → { 𝑦 = −1 } 0.1406
𝑥 = −5 → { 𝑦 = −1 }
𝑥𝑛𝑒𝑤 = sign(− 0.1406) = −1
Python
Linear Regression for one-dimensional classification
demo 1
9
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)
Linear Methods for Classification

Multi-Category Classification:

If 𝐗 𝑇 𝐗 is invertible, then

Learning: ෡ = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐘,
𝐖 𝐘 ∈ 𝐑𝑚×𝐶
Prediction: 𝑓መ𝐰𝑐 𝐱 𝑛𝑒𝑤 = arg max𝑘=1,…,𝐶 𝐱 𝑛𝑒𝑤
𝑇 𝐖 ෢ :,𝑘 𝑇
for each 𝐱 𝑛𝑒𝑤 of 𝐗 𝑛𝑒𝑤

Each row (of 𝑖 = 1, … , 𝑚) in 𝐘 has an one-hot encoding/assignment:

e.g., target for class-1 is labelled as 𝐲𝑖𝑇 = [1, 0, 0, … , 0] for the ith sample,
target for class-2 is labelled as 𝐲𝑗𝑇 = 0, 1, 0, … , 0 for the jth sample,
𝑇
target for class-C is labelled as 𝐲𝑚 = 0, 0, … , 0, 1 for the mth sample.
𝐶
Ref: Hastie, Tibshirani, Friedman, “The Elements of Statistical Learning”, (2nd ed., 12th printing) 2017 (chp.4)

10
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)
Example 2 Three class classification
{𝑥1 = 1, 𝑥2 = 1} → { 𝑦1 = 1, 𝑦2 = 0, 𝑦3 = 0} Class 1
Training set {𝑥1 = −1, 𝑥2 = 1} → { 𝑦1 = 0, 𝑦2 = 1, 𝑦3 = 0} Class 2
{𝐱𝑖 , 𝐲𝑖 }𝑚
𝑖=1 {𝑥1 = 1, 𝑥2 = 3} → { 𝑦1 = 1, 𝑦2 = 0, 𝑦3 = 0} Class 1
{𝑥1 = 1, 𝑥2 = 0} → { 𝑦1 = 0, 𝑦2 = 0, 𝑦3 = 1} Class 3
𝐗 𝐖 𝐘
Bias 1 1 1 𝑤 𝑤1,2 𝑤1,3 1 0 0
1,1
1 −1 1 𝑤2,1 𝑤2,2 𝑤2,3 = 0 1 0
1 1 3 𝑤3,1 𝑤3,2 𝑤3,3 1 0 0
1 1 0 0 0 1
This set of linear equations has NO exact solution. Least square
෡ = 𝐗 † 𝐘 = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐘
𝐖 𝐗 𝑇 𝐗 is invertible approximation

−1 1 0 0
4 2 5 1 1 1 1 0 0.5 0.5
= 2 4 3 0 1 0 = 0.2857 − 0.5 0.2143
1 −1 1 1 1 0 0
5 3 11 1 1 3 0 0.2857 0 − 0.2857
0 0 1
11
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)
Example 2 Prediction
Test set 𝐗 𝑛𝑒𝑤 {𝑥1 = 6, 𝑥2 = 8} → {𝑐𝑙𝑎𝑠𝑠 1, 2, 𝑜𝑟 3? }
{𝑥1 = 0, 𝑥2 = −1} → {𝑐𝑙𝑎𝑠𝑠 1, 2, 𝑜𝑟 3? }

1 6 8 0 0.5 0.5
෡ = 𝐗 𝑛𝑒𝑤 ෢
𝐘 𝐖= 0.2857 − 0.5 0.2143
1 0 −1
0.2857 0 − 0.2857
Category prediction:

𝒇෠ 𝑐𝐰 𝐗 𝑛𝑒𝑤 = arg max𝑘=1,…,𝐶 (𝐘෡(: , 𝑘))

4 − 2.50 − 0.50
= arg max𝑘=1,…,𝐶 ( )
−0.2587 0.50 0.7857
1
=
Class 1

3 Class 3 For each row of Y, the column position of the largest number
(across all columns for that row) determines the class label.
Python
E.g. in the first row, the maximum number is 4 which is in
demo 2 column 1. Therefore, the resulting predicted class is 1.
12
© Copyright EE, NUS. All Rights Reserved.
Ridge Regression ensures that XTX is invertible

Recall Linear regression

𝑚
Objective:ෞ𝐰 = argmin ෌𝑖=1(𝑓𝐰 𝐱𝑖 − y𝑖 )2 = (𝐗𝐰 − 𝐲)𝑇 (𝐗𝐰 − 𝐲)
The learning computation: 𝐰ෝ = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐲
We cannot guarantee that the matrix 𝐗 𝑇 𝐗 is invertible

Ridge regression: shrinks the regression coefficients w by imposing a

penalty on their size
𝑚
𝐰 = argmin ෌𝑖=1(𝑓𝐰 𝐱𝑖 − y𝑖 )2 + λ σ𝑑𝑗=1 𝑤𝑗2
Objective:ෞ
= argmin (𝐗𝐰 − 𝐲)𝑇 (𝐗𝐰 − 𝐲) + λ𝐰 𝑇 𝐰

Here λ ≥ 0 is a complexity parameter that controls the amount of

shrinkage: the larger the value of λ, the greater the amount of shrinkage.

Note: m samples & d parameters

13
© Copyright EE, NUS. All Rights Reserved.
Ridge Regression
Using a linear model:
min𝐰 (𝐗𝐰 − 𝐲)𝑇 𝐗𝐰 − 𝐲 + λ𝐰 𝑇 𝐰
Solution:
𝜕
((𝐗𝐰 − 𝐲)𝑇 𝐗𝐰 − 𝐲 + λ𝐰 𝑇 𝐰) = 𝟎
𝜕𝐰
⇒ 2𝐗 𝑇 𝐗𝐰 − 2𝐗 𝑇 𝐲 + 2λ𝐰 = 𝟎
⇒ 𝐗 𝑇 𝐗𝐰 + λ𝐰 = 𝐗 𝑇 𝐲
⇒ (𝐗 𝑇 𝐗 + λ𝐈)𝐰 = 𝐗 𝑇 𝐲
where I is the dxd identity matrix
Here on, we shall focus on single column of output 𝐲 in derivations in
the sequel
Learning: ෝ = (𝐗 𝑇 𝐗 +λ𝐈)−1 𝐗 𝑇 𝐲
𝐰
Ref: Hastie, Tibshirani, Friedman, “The Elements of Statistical Learning”, (2nd ed., 12th printing) 2017 (chp.3)

Ridge Regression in Primal Form (when m > d)

(𝐗 𝑇 𝐗 + λ𝐈) is invertible for λ > 0,

Learning: 𝐰ෝ = (𝐗 𝑇 𝐗 +λ𝐈)−1 𝐗 𝑇 𝐲
Prediction: 𝒇෠ 𝐰 𝐗 𝑛𝑒𝑤 = 𝐗 𝑛𝑒𝑤 𝐰ෝ

Ref: Hastie, Tibshirani, Friedman, “The Elements of Statistical Learning”, (2nd ed., 12th printing) 2017 (chp.3)

Ridge Regression in Dual Form (when m < d)

underdetermined system

(𝐗𝐗 𝑇 +λ𝐈) is invertible for λ > 0,

Learning: 𝐰ෝ = 𝐗 𝑇 (𝐗𝐗 𝑇 +λ𝐈)−1 𝐲
Prediction: 𝒇෠ 𝐰 𝐗 𝑛𝑒𝑤 = 𝐗 𝑛𝑒𝑤 𝐰ෝ

Derivation as homework (see tutorial 6).

Hint: start off with (𝐗 𝑇 𝐗 + 𝜆𝐈)𝐰 = 𝐗 𝑇 𝐲 and make use of 𝐰 = 𝐗 𝑇 𝐚 and
𝒂 = 𝜆−1 𝐲 − 𝐗𝐰 , 𝜆 > 0

16
© Copyright EE, NUS. All Rights Reserved.
Polynomial Regression
Motivation: nonlinear decision surface
• Based on the sum of products of the variables
• E.g. when the input dimension is d=2,
a polynomial function of degree = 2 is:
𝑓𝐰 𝐱 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤12 𝑥1 𝑥2 + 𝑤11 𝑥12 + 𝑤22 𝑥22 .

XOR problem

𝑓𝐰 𝐱 = 𝑥1 𝑥2

• By including additional terms involving the products of

pairs of components of 𝐱, we obtain a quadratic model:
𝑓𝐰 𝐱 = 𝑤0 + σ𝑑𝑖=1 𝑤𝑖 𝑥𝑖 + σ𝑑𝑖=1 σ𝑑𝑗=1 𝑤𝑖𝑗 𝑥𝑖 𝑥𝑗 .

2nd order: 𝑓𝐰 𝐱 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤12 𝑥1 𝑥2 + 𝑤11 𝑥12 + 𝑤22 𝑥22

3rd order: 𝑓𝐰 𝐱 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤12 𝑥1 𝑥2 + 𝑤11 𝑥12 + 𝑤22 𝑥22 +
σ𝑑𝑖=1 σ𝑑𝑗=1 σ𝑑𝑘=1 𝑤𝑖𝑗𝑘 𝑥𝑖 𝑥𝑗 𝑥𝑘 , 𝑑 = 2 Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Chp.5)

18
© Copyright EE, NUS. All Rights Reserved.
Polynomial Regression
Generalized Linear Discriminant Function
• In general:
𝑓𝐰 𝐱 = 𝑤0 + σ𝑑𝑖=1 𝑤𝑖 𝑥𝑖 + σ𝑑𝑖=1 σ𝑑𝑗=1 𝑤𝑖𝑗 𝑥𝑖 𝑥𝑗 + σ𝑑𝑖=1 σ𝑑𝑗=1 σ𝑑𝑘=1 𝑤𝑖𝑗𝑘 𝑥𝑖 𝑥𝑗 𝑥𝑘 + ⋯

Weierstrass Approximation Theorem: Every continuous function defined on a

closed interval [a, b] can be uniformly approximated as closely as desired by
a polynomial function.
- Suppose f is a continuous real-valued function defined on the real interval [a, b].
- For every ε > 0, there exists a polynomial p such that for all x in [a, b], we have| f (x)
− p(x)| < ε.
(Ref: https://en.wikipedia.org/wiki/Stone%E2%80%93Weierstrass_theorem)

Notes:
• For high dimensional input features (large d value) and high polynomial order, the
number of polynomial terms becomes explosive! (i.e., grows exponentially)
• For high dimensional problems, polynomials of order larger than 3 is seldom used.
Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Chp.5) online

𝑓𝐰 𝐱 = 𝑤0 + σ𝑑𝑖=1 𝑤𝑖 𝑥𝑖 + σ𝑑𝑖=1 σ𝑑𝑗=1 𝑤𝑖𝑗 𝑥𝑖 𝑥𝑗 + σ𝑑𝑖=1 σ𝑑𝑗=1 σ𝑑𝑘=1 𝑤𝑖𝑗𝑘 𝑥𝑖 𝑥𝑗 𝑥𝑘 + ⋯

𝒇𝐰 𝐱 = 𝐏𝐰 ( Note: 𝐏 ≜ 𝐏(𝐗) for symbol simplicity )

𝒑1𝑇 𝐰 𝑤0
= ⁞ 𝑤1
𝒑𝑇𝑚 𝐰 ⁞
𝑤𝑑
where 𝒑𝑇𝑙 𝐰 = [1, 𝑥𝑙,1 , … , 𝑥𝑙,𝑑 , … , 𝑥𝑙,𝑖 𝑥𝑙,𝑗 , … , 𝑥𝑙,𝑖 𝑥𝑙,𝑗 𝑥𝑙,𝑘 , … ] ⁞
𝑤𝑖𝑗
⁞
𝑤𝑖𝑗𝑘
𝑙 = 1, … , 𝑚; d denotes the dimension of input features; m
denotes the number of samples ⁞
Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Chp.5)

20
© Copyright EE, NUS. All Rights Reserved.
{𝑥1 = 0, 𝑥2 = 0} → {𝑦 = −1}
Example 3 {𝑥1 = 1, 𝑥2 = 1} → {𝑦 = −1}
𝑚
Training set {𝐱𝑖 , 𝐲𝑖 }𝑖=1 {𝑥1 = 1, 𝑥2 = 0} → {𝑦 = +1}
{𝑥1 = 0, 𝑥2 = 1} → {𝑦 = +1}
2nd order polynomial model

𝑓𝐰 𝐱 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤12 𝑥1 𝑥2 + 𝑤11 𝑥12 + 𝑤22 𝑥22

𝑤0
𝑤1
𝑤2
= [1 𝑥1 𝑥2 𝑥1 𝑥2 𝑥1 𝑥2 ] 𝑤 2 2
12
𝑤11
Stack the 4 training samples as a matrix 𝑤22
2 2
1 𝑥1,1 𝑥1,2 𝑥1,1 𝑥1,2 𝑥1,1 𝑥1,2
1 0 0 0 0 0
2 2
1 𝑥2,1 𝑥2,2 𝑥2,1 𝑥2,2 𝑥2,1 𝑥2,2 1 1 1 1 1 1
𝐏= 2 2 =
1 𝑥3,1 𝑥3,2 𝑥3,1 𝑥3,2 𝑥3,1 𝑥3,2 1 1 0 0 1 0
1 𝑥4,1 𝑥4,2 𝑥4,1 𝑥4,2 2
𝑥4,1 2
𝑥4,2 1 0 1 0 0 1
21
© Copyright EE, NUS. All Rights Reserved.
Polynomial Regression
Summary

Ridge Regression in Primal Form (m > d)

For λ > 0,
Learning: 𝐰ෝ = (𝐏 𝑇 𝐏 +λ𝐈)−1 𝐏 𝑇 𝐲
Prediction: 𝒇෠ 𝐰 𝐏(𝐗 𝑛𝑒𝑤 ) = 𝐏𝑛𝑒𝑤 𝐰 ෝ

Ridge Regression in Dual Form (m < d)

For λ > 0,
Learning: 𝐰ෝ = 𝐏𝑇 (𝐏𝐏𝑇 +λ𝐈)−1 𝐲
Prediction: 𝒇෠ 𝐰 𝐏(𝐗 𝑛𝑒𝑤 ) = 𝐏𝑛𝑒𝑤 𝐰
ෝ

For Regression Applications

• Learn continuous valued 𝑦 using either primal form or dual form
• Prediction: 𝒇෠ 𝐰 𝐏(𝐗 𝑛𝑒𝑤 ) = 𝐏𝑛𝑒𝑤 𝐰
ෝ

For Classification Applications

• Learn discrete valued 𝑦 (𝑦 ∈ {−1, +1}) or 𝐘 (one-hot) using either primal
form or dual form
• Binary Prediction: 𝒇෠ 𝑐𝐰 𝐏(𝐗 𝑛𝑒𝑤 ) = sign(𝐏𝑛𝑒𝑤 𝐰)
ෝ
• Multi-Category Prediction: 𝒇෠ 𝑐𝐰 𝐏(𝐗 𝑛𝑒𝑤 ) = arg max𝑘=1,…,𝐶 (𝐏𝑛𝑒𝑤 𝐖(:
෡ , 𝑘))

23
© Copyright EE, NUS. All Rights Reserved.
{𝑥1 = 0, 𝑥2 = 0} → {𝑦 = −1}
Example 3 (cont’d) {𝑥1 = 1, 𝑥2 = 1} → {𝑦 = −1}
Training set {𝑥1 = 1, 𝑥2 = 0} → {𝑦 = +1}
{𝑥1 = 0, 𝑥2 = 1} → {𝑦 = +1}
2nd order polynomial model
2 2
1 𝑥1,1 𝑥1,2 𝑥1,1 𝑥1,2 𝑥1,1 𝑥1,2
1 0 0 0 0 0
2 2
1 𝑥2,1 𝑥2,2 𝑥2,1 𝑥2,2 𝑥2,1 𝑥2,2 1 1 1 1 1 1
𝐏= 2 2 =
1 𝑥3,1 𝑥3,2 𝑥3,1 𝑥3,2 𝑥3,1 𝑥3,2 1 1 0 0 1 0
1 𝑥4,1 𝑥4,2 𝑥4,1 𝑥4,2 2
𝑥4,1 2
𝑥4,2 1 0 1 0 0 1

ෝ = 𝐏 𝑇 (𝐏𝐏 𝑇 )−1 𝐲
𝐰
−1
1 1 1 1 −1
0 1 1 0 1 1 1 1 −1 1
−1 1
= 0 1 0 1 1 6 3 3 =
0 1 0 0 1 3 3 1 +1 −4
0 1 1 0 1 3 1 3 +1 1 Python
0 1 0 1 1 demo 3
24
© Copyright EE, NUS. All Rights Reserved.
Example 3 (cont’d) Prediction
Test set Test point 1: {𝑥1 = 0.1, 𝑥2 = 0.1} → {𝑦 = class − 1 or + 1? }
Test point 2: {𝑥1 = 0.9, 𝑥2 = 0.9} → {𝑦 = class − 1 or + 1? }
Test point 3: {𝑥1 = 0.1, 𝑥2 = 0.9} → {𝑦 = class − 1 or + 1? }
Test point 4: {𝑥1 = 0.9, 𝑥2 = 0.1} → {𝑦 = class − 1 or + 1? }
𝐲ො = 𝐏𝑛𝑒𝑤 𝐰
ෝ −1
1 0.1 0.1 0.01 0.01 0.01 1
1 0.9 0.9 0.81 0.81 0.81 1
=
1 0.1 0.9 0.09 0.01 0.81 −4
1 0.9 0.1 0.09 0.81 0.01 1
1
[1 𝑥1 𝑥2 𝑥1 𝑥2 𝑥12 𝑥22 ]
−0.82
−0.82
𝒇෠ 𝑐𝐰 𝐏(𝐗 𝑛𝑒𝑤 ) = sign(𝐲)
ො = sign( 0.46 )
0.46
−1 Class −1
−1 Class −1
= +1
Class +1
+1 Class +1

25
© Copyright EE, NUS. All Rights Reserved.
Summary
• Notations, Vectors, Matrices
• Operations on Vectors and Matrices
• Systems of Linear Equations 𝒇𝐰 𝐗 = 𝐗𝐰 = 𝐲
• Functions, Derivative and Gradient
• Least Squares, Linear Regression with Single and Multiple Outputs
• Learning of vectored function, binary and multi-category classification
• Ridge Regression: penalty term, primal and dual forms
• Polynomial Regression: nonlinear decision boundary

Primal form Learning: 𝐰ෝ = (𝐏 𝑇 𝐏 +λ𝐈)−1 𝐏 𝑇 𝐲

Prediction: 𝒇෠ 𝐰 𝐏(𝐗 𝑛𝑒𝑤 ) = 𝐏𝑛𝑒𝑤 𝐰 ෝ

Dual form Learning: 𝐰ෝ = 𝐏 𝑇 (𝐏𝐏 𝑇 +λ𝐈)−1 𝐲

Prediction: 𝒇෠ 𝐰 𝐏(𝐗 𝑛𝑒𝑤 ) = 𝐏𝑛𝑒𝑤 𝐰ෝ
Hint: python packages: sklearn.preprocessing (PolynomialFeatures), np.sign,
sklearn.model_selection (train_test_split), sklearn.preprocessing (OneHotEncoder)
26
© Copyright EE, NUS. All Rights Reserved.

MAchine Learning
No ratings yet
MAchine Learning
120 pages
EE2211 Introduction To Machine Learning: Semester 1 2021/2022
No ratings yet
EE2211 Introduction To Machine Learning: Semester 1 2021/2022
34 pages
EE2211 Introduction To Machine Learning: Semester 1 2020/2021
No ratings yet
EE2211 Introduction To Machine Learning: Semester 1 2020/2021
34 pages
EE2211 Introduction To Machine Learning: Semester 1 2021/2022
No ratings yet
EE2211 Introduction To Machine Learning: Semester 1 2021/2022
35 pages
ML 01
No ratings yet
ML 01
15 pages
Intelligent Systems
No ratings yet
Intelligent Systems
20 pages
Lecture Compiled
No ratings yet
Lecture Compiled
224 pages
CE802_Lec_IntroML_handouts
No ratings yet
CE802_Lec_IntroML_handouts
24 pages
Lec 1
No ratings yet
Lec 1
35 pages
Unit 1&2
No ratings yet
Unit 1&2
270 pages
Machine Learning: Professional CORE (CET3006B) T. Y. B.Tech CSE
No ratings yet
Machine Learning: Professional CORE (CET3006B) T. Y. B.Tech CSE
106 pages
1 - ML Introduction1
No ratings yet
1 - ML Introduction1
23 pages
ML Final
No ratings yet
ML Final
98 pages
01 02 Introduction Regression Analysis and GR
No ratings yet
01 02 Introduction Regression Analysis and GR
11 pages
INtro To ML
No ratings yet
INtro To ML
36 pages
PDF Machine Learning
100% (1)
PDF Machine Learning
222 pages
ML Final
No ratings yet
ML Final
95 pages
Machine Learning
100% (1)
Machine Learning
46 pages
Basics of Machine Learning
100% (4)
Basics of Machine Learning
22 pages
ml
No ratings yet
ml
333 pages
Module 01- ML-21EC744
No ratings yet
Module 01- ML-21EC744
20 pages
Lecture 1
No ratings yet
Lecture 1
47 pages
Machine Learning Syllabus BCA Eight Semester
No ratings yet
Machine Learning Syllabus BCA Eight Semester
2 pages
CSE445 1 Intro to ML
No ratings yet
CSE445 1 Intro to ML
36 pages
Chapter 1 - Introduction
No ratings yet
Chapter 1 - Introduction
28 pages
2024 Machine Learning Intro
No ratings yet
2024 Machine Learning Intro
50 pages
01 - Introduction To Machine Learning
No ratings yet
01 - Introduction To Machine Learning
72 pages
Lecture1 PDF
No ratings yet
Lecture1 PDF
37 pages
Week09a Intro ML
No ratings yet
Week09a Intro ML
17 pages
Unit-1 ML
No ratings yet
Unit-1 ML
19 pages
National Institute of Technology Patna: Department of Computer Science & Engineering
No ratings yet
National Institute of Technology Patna: Department of Computer Science & Engineering
2 pages
1-Iml Lecture1 Intro
No ratings yet
1-Iml Lecture1 Intro
35 pages
Lec1 Intoduction
No ratings yet
Lec1 Intoduction
34 pages
Intro To Machine Learning
100% (1)
Intro To Machine Learning
250 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
45 pages
ML Unit-1
No ratings yet
ML Unit-1
12 pages
Erick Myers - Python Machine Learning is the Complete Guide to Everything You Need to Know About Python Machine Learning_ Keras, Numpy, Scikit Learn, Tensorflow, With Useful Exercises and Examples. (2
50% (2)
Erick Myers - Python Machine Learning is the Complete Guide to Everything You Need to Know About Python Machine Learning_ Keras, Numpy, Scikit Learn, Tensorflow, With Useful Exercises and Examples. (2
175 pages
A Course in Machine Learning
No ratings yet
A Course in Machine Learning
50 pages
MLLecture 1
No ratings yet
MLLecture 1
10 pages
CE880_lecture5_slides
No ratings yet
CE880_lecture5_slides
32 pages
Introduction to ML Unit-1 PPT
No ratings yet
Introduction to ML Unit-1 PPT
90 pages
01 - Introduction To Machine Learning
No ratings yet
01 - Introduction To Machine Learning
55 pages
MACHINE LEARNING ALGORITHM - Unit-1-1
100% (1)
MACHINE LEARNING ALGORITHM - Unit-1-1
78 pages
Introduction of ML
No ratings yet
Introduction of ML
53 pages
Machine Learning
No ratings yet
Machine Learning
13 pages
Basics of Machine Learning
No ratings yet
Basics of Machine Learning
77 pages
mlintro-4
No ratings yet
mlintro-4
28 pages
Andrew NG Complete Machine Learning
No ratings yet
Andrew NG Complete Machine Learning
170 pages
mlintro-2
No ratings yet
mlintro-2
28 pages
1 - Introduction AML
No ratings yet
1 - Introduction AML
41 pages
Overview of machine learning
No ratings yet
Overview of machine learning
60 pages
COS324 Course Notes
No ratings yet
COS324 Course Notes
256 pages
All Into One ML
No ratings yet
All Into One ML
432 pages
1 (1)
No ratings yet
1 (1)
10 pages
ML - Week 1
No ratings yet
ML - Week 1
37 pages
asset-v1_MKAU+SEng9032+DEV_01+type@asset+block@ChapOne
No ratings yet
asset-v1_MKAU+SEng9032+DEV_01+type@asset+block@ChapOne
29 pages
Presentation 33360 Content Document 20250319044717PM
No ratings yet
Presentation 33360 Content Document 20250319044717PM
126 pages
AML - Mid Term - Merged
No ratings yet
AML - Mid Term - Merged
192 pages
Cs253 01 Introduction Marked
No ratings yet
Cs253 01 Introduction Marked
49 pages
The AI Artificial Intelligence Course From Beginner to Expert
From Everand
The AI Artificial Intelligence Course From Beginner to Expert
Asomoo Ebooks
No ratings yet