0% found this document useful (0 votes)
16 views

Midterm Combined Slides

midterm

Uploaded by

owenwongsohyik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Midterm Combined Slides

midterm

Uploaded by

owenwongsohyik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 210

EE2211 Introduction to Machine

Learning
Lecture 1
Semester 1
2021/2022

Li Haizhou ([email protected])

Electrical and Computer Engineering Department


National University of Singapore

Acknowledgement:
EE2211 development team
(Thomas, Kar-Ann, Chen Khong, Helen, Robby and Haizhou)

© Copyright EE, NUS. All Rights Reserved.


Course Overview
• EE2211 is organized into 12 lectures, 12 tutorials, 3 assignments (40%
marks), 1 Quiz (30% marks) at mid-term, and 1 final Exam (30% marks).

• Tutorials 1+1 hours

• Lectures, tutorials (programming and coursework), quiz (mid-term) and final


exam are all conducted online. Videos of lectures are made available after
lectures.

• Important dates
– A1 will be released in Week 4 on Monday and submitted in 3 weeks
– A2 will be released in Week 6 on Monday and submitted in 4 weeks
– A3 will be released in Week 10 on Monday and submitted in 4 weeks
– Quiz (mid-term) will be on Week 8 lecture time ( tentative ).

EE2211 Introduction to Machine Learning L1.2


© Copyright EE, NUS. All Rights Reserved.
Reading List
Main textbooks: Book1 (text) and Book2 (python)
Supplementary reading: Book3

References
• [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”,
2019. (read first, buy later: http://themlbook.com/wiki/doku.php)
• [Book2] Andreas C. Muller and Sarah Guido, “Introduction to Machine
Learning with Python: A Guide for Data Scientists”, O’Reilly Media, Inc.,
2017.
• [Book3] Jeff Leek, “The Elements of Data Analytic Style: A guide for
people who want to analyze data”, Lean Publishing, 2015.

EE2211 Introduction to Machine Learning 3


© Copyright EE, NUS. All Rights Reserved.
Course Contents
• Introduction and Preliminaries (Haizhou)
– Introduction
– Data engineering
– Introduction to probability and statistics
• Fundamental Machine Learning Algorithms I (Helen)
– Systems of linear equations
– Least squares, linear regression
– Ridge regression, polynomial regression
• Fundamental Machine Learning Algorithms II (Thomas)
– Over-fitting, bias/variance Trade-off
– Optimization, gradient descent
– Decision trees, random forest
• Performance and More Algorithms (Haizhou, Xinchao)
– Performance issues
– K-means clustering
– Neural networks

EE2211 Introduction to Machine Learning 4


© Copyright EE, NUS. All Rights Reserved.
Introduction
Module I Content
• What is machine learning and types of learning
• How supervised learning works
• Regression and classification tasks
• Induction versus deduction reasoning
• Types of data
• Data wrangling and cleaning
• Data integrity and visualization
• Causality and Simpson’s paradox
• Random variable, Bayes’ rule
• Parameter estimation
• Parameters vs. Hyperparameters

5
© Copyright EE, NUS. All Rights Reserved.
2001: A Space Odyssey

HAL listens, talks, sings, reads lips, plays chess, and solves problems !

© Copyright EE, NUS. All Rights Reserved.


AlphaGo

https://www.bbc.com/news/technology-35785875

https://www.businessinsider.com/googles-alphago-made-
artifical-intelligence-history-2016-3

EE2211 Introduction to Machine Learning 7


© Copyright EE, NUS. All Rights Reserved.
Shannon Game

the frog => the frog jumped


frog jumped => frog jumped into
jumped into => jumped into the
into the => into the pond

The Bit Player (2018) tells the story of


an overlooked genius, Claude
Shannon (the "Father of Information
Theory")
C. Shannon, A mathematical theory of communication, Bell System Technical Journal, vol. 27, pp. 379-423 and 623-656, 1948.

8
© Copyright EE, NUS. All Rights Reserved.
What is Machine Learning?
 Machine learning
 is a subfield of computer science that is concerned with
building algorithms which, to be useful, rely on a
collection of examples of some phenomenon. - Andriy
Burkov
These examples can come from nature, be handcrafted by
humans or generated by another algorithm.

Ref: Book1, chp1 , p3

EE2211 Introduction to Machine Learning 9


© Copyright EE, NUS. All Rights Reserved.
Early Definitions
• Arthur Samuel (1959): A field of study that gives
computers the ability to learn without being explicitly
programmed.

• Tom Mitchell (1998): A computer program is said to learn


from experience E with respect to some task T and some
performance measure P, if its performance on T, as
measured by P, improves with experience E.

How do we create computer programs that improve with experience?"


Tom Mitchell
http://videolectures.net/mlas06_mitchell_itm/

EE2211 Introduction to Machine Learning 10


© Copyright EE, NUS. All Rights Reserved.
Example: 2 or 7?

Experience: data
- Process/learn from experience
- Write programme to perform task
Task
- Measure the result of the task
Performance: Results, accuracy

Image credit: Geoffrey Hinton and CIS 419/519, Eric Eaton

Task: Digit recognition


Performance: Classification accuracy
Experience: Labelled images
EE2211 Introduction to Machine Learning 11
© Copyright EE, NUS. All Rights Reserved.
Application Examples
• Speech recognition
• Face recognition
• Handwriting recognition
• Object recognition
• Housing price prediction
• Etc, etc.

EE2211 Introduction to Machine Learning 12


© Copyright EE, NUS. All Rights Reserved.
Types of Learning

• Supervised Learning Supervised Unsupervised


Learning Learning
• Unsupervised Learning
• Semi-Supervised Learning

Discrete
• Reinforcement Learning Classification Clustering

discrete

Continuous
x Classification y Dimensionality
Regression
continuous Reduction
x Regression y

Ref: Book1 and https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d

EE2211 Introduction to Machine Learning 13


© Copyright EE, NUS. All Rights Reserved.
Supervised Learning

Training
Apple

Orange

Model Testing

This is an orange

EE2211 Introduction to Machine Learning 14


© Copyright EE, NUS. All Rights Reserved.
Supervised Learning
• In supervised learning, the dataset is the collection of labeled
examples ,

𝑥𝑥1
𝑇𝑇
𝐱𝐱 𝑖𝑖 = ⁞ or 𝐱𝐱 𝑖𝑖 = 𝑥𝑥1 , … , 𝑥𝑥𝑗𝑗 , … , 𝑥𝑥𝐷𝐷 , i = 1, . . . , N
𝑖𝑖
𝑥𝑥𝐷𝐷 𝑖𝑖
– Each element 𝐱𝐱 𝑖𝑖 among N is called a feature vector.
• A feature vector is a vector in which each dimension j =
1, . . . , D contains a value that describes the example
somehow.

– The label 𝑦𝑦 𝑖𝑖 can be either an element belonging to a finite


set of classes {1, 2, . . . ,C}, or a real number.
– Expensive in terms of time and resources
EE2211 Introduction to Machine Learning 15
© Copyright EE, NUS. All Rights Reserved.
• For instance, if your examples are email messages and your
problem is spam detection, then you have two classes {spam,
not-spam}.
• Classification: predict discrete valued output (e.g., 1 or 0)

1-Dimensional Case
y
“0” “1” Spam
Not-spam
1

0 x
(e.g., repeated word count)

(1D view)
Decision line
(threshold)

EE2211 Introduction to Machine Learning 16


© Copyright EE, NUS. All Rights Reserved.
Classification: Breast Cancer (malignant, benign)

(age)
x 2 2-Dimensional Case
Malignant (harmful)
Benign (not harmful)

x 1 (tumor size)

EE2211 Introduction to Machine Learning 17


© Copyright EE, NUS. All Rights Reserved.
Regression: predict continuous valued output
(e.g., house price prediction)

(price)
y

x
(size in meter square)

EE2211 Introduction to Machine Learning 18


© Copyright EE, NUS. All Rights Reserved.
How Supervised Learning Works

Training Test
𝑀𝑀 𝑀𝑀 𝑁𝑁 𝑁𝑁
𝐱𝐱 𝑖𝑖 𝑖𝑖=1 𝑦𝑦𝑖𝑖 𝑖𝑖=1 𝐱𝐱 𝑘𝑘 𝑘𝑘=1 𝑦𝑦𝑘𝑘 𝑘𝑘=1

Model Model
Data Known Data Predicted
Parameters to label Learned label
learn parameters
Goal: to learn the model’s Goal: to predict the label of
parameters from the given novel data 𝐱𝐱 𝑘𝑘 𝑁𝑁
𝑘𝑘=1 using the
𝑀𝑀
data and labels 𝐱𝐱 𝑖𝑖 , 𝑦𝑦𝑖𝑖 𝑖𝑖=1 learned parameters

Slides courtesy: Professor Robby Tan

EE2211 Introduction to Machine Learning 19


© Copyright EE, NUS. All Rights Reserved.
Example

source: SUTD

Task: Number of new cases prediction


Performance: Prediction accuracy
Experience: Historical data

EE2211 Introduction to Machine Learning L1.20


© Copyright EE, NUS. All Rights Reserved.
Example

* Ref: Book1, sec1.3, pp.5-7 EE2211 doesn’t discuss constraint


optimization in detail.
EE2211 Introduction to Machine Learning 21
© Copyright EE, NUS. All Rights Reserved.
Unsupervised Learning
• In unsupervised learning, the dataset is a collection of
unlabeled examples

• Again, x is a feature vector, and the goal of an


unsupervised learning algorithm is to create a model
that takes a feature vector x as input and either
transforms it into another vector or into a value that can
be used to solve a practical problem.

Ref: Book1, sec1.2.2

EE2211 Introduction to Machine Learning 22


© Copyright EE, NUS. All Rights Reserved.
Unsupervised Learning

Training

I found two
types of fruits!

EE2211 Introduction to Machine Learning 23


© Copyright EE, NUS. All Rights Reserved.
Pictorial summary
• Supervised Learning

(age)
x2 2-Dimensional Case
Malignant (harmful)
Benign (not harmful)

x1 (tumor size)

EE2211 Introduction to Machine Learning 24


© Copyright EE, NUS. All Rights Reserved.
Pictorial summary
• Unsupervised Learning

(age)
x2 2-Dimensional Case

x1 (tumor size)

EE2211 Introduction to Machine Learning 25


© Copyright EE, NUS. All Rights Reserved.
Pictorial summary
• Unsupervised Learning

(age) Find the distribution of data


x2 2-Dimensional Case

x1 (tumor size)

EE2211 Introduction to Machine Learning 26


© Copyright EE, NUS. All Rights Reserved.
Pictorial summary
• Unsupervised Learning: Clustering

(age) Discover the underlying structures


x2 of data
2-Dimensional Case

x1 (tumor size)

EE2211 Introduction to Machine Learning 27


© Copyright EE, NUS. All Rights Reserved.
Example: Social Network

https://en.wikipedia.org/wiki/Social_network_analysis#/media/File:Kencf0618FacebookNetwork.jpg

EE2211 Introduction to Machine Learning 28


© Copyright EE, NUS. All Rights Reserved.
Semi-Supervised Learning

Supervised Semi-supervised Unsupervised


Learning Learning Learning

Labelled +
Unlabelled data
Labelled data Unlabelled data

+
Typically plenty of
unlabelled data

Learning Model

EE2211 Introduction to Machine Learning 29


© Copyright EE, NUS. All Rights Reserved.
Reinforcement Learning

action

Environment
Agent
S1 S2

reward

EE2211 Introduction to Machine Learning 30


© Copyright EE, NUS. All Rights Reserved.
Reinforcement Learning

• A policy is a function (similar to the model in supervised


learning) that takes the feature vector of a state as input
and outputs an optimal action to execute in that state.

• The action is optimal if it maximizes the expected


average reward.
Multiple trials

* EE2211 doesn’t discuss reinforcement learning in detail.

EE2211 Introduction to Machine Learning 31


© Copyright EE, NUS. All Rights Reserved.
Inductive vs. Deductive Reasoning
Main task of Machine Learning: to make inferences

Type of inferences
Example
Inductive Deductive
• To reach probable conclusions. • To reach logical conclusions
• All needed information is unavailable or deterministically: all information that can
unknown, causing uncertainty in the conclusions lead to the correct conclusion is available

Statistical Machine Learning (as


Probability and Statistics are required
opposed to logic machine learning)

Basic properties Basic rules

Product rule Sum rule


• 0 ≤ p(x) ≤ 1
• Independent variables: • Dependent
• ʃ p(x)dx = 1
p(a,b) = p(a)p(b) variables:
• ∑ p(x) = 1
x • Dependent variables: p(a) = ∑ p(a,b)
• p(a,b) = p(a|b) p(b)
b
(Marginalization)
= p(b|a) p(a)
Slides courtesy: Professor Robby Tan

EE2211 Introduction to Machine Learning 32


© Copyright EE, NUS. All Rights Reserved.
Inductive Reasoning
Note: humans use inductive reasoning all the time and not
in a formal way like using probability/statistics.

Ref: Gardener, Martin (March 1979). "MATHEMATICAL GAMES: On the


fabric of inductive logic, and some probability paradoxes" (PDF). Scientific
American. 234

EE2211 Introduction to Machine Learning 33


© Copyright EE, NUS. All Rights Reserved.
The End

EE2211 Introduction to Machine Learning 34


© Copyright EE, NUS. All Rights Reserved.
EE2211 Introduction to Machine
Learning
Lecture 2
Semester 1
2021/2022

Li Haizhou ([email protected])

Electrical and Computer Engineering Department


National University of Singapore

Acknowledgement:
EE2211 development team
(Thomas, Kar-Ann, Chen Khong, Helen, Robby and Haizhou)

EE2211 Introduction to Machine Learning 1


© Copyright EE, NUS. All Rights Reserved.
Course Contents
• Introduction and Preliminaries (Haizhou)
– Introduction
– Data Engineering
– Introduction to Probability and Statistics
• Fundamental Machine Learning Algorithms I (Helen)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Thomas)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Haizhou/Xinchao)
– Performance Issues
– K-means Clustering
– Neural Networks

EE2211 Introduction to Machine Learning 2


© Copyright EE, NUS. All Rights Reserved.
Data Engineering

Module I Contents
• What is Machine Learning and Types of Learning
• How Supervised Learning works
• Regression and Classification Tasks
• Induction versus Deduction Reasoning
• Types of data
• Data wrangling and cleaning
• Data integrity and visualization
• Causality and Simpson’s paradox
• Random variable, Bayes’ rule
• Parameter estimation
• Parameters vs. Hyperparameters

3
© Copyright EE, NUS. All Rights Reserved.
Ask the right questions
if you are to find the right answers

EE2211 Introduction to Machine Learning 4


© Copyright EE, NUS. All Rights Reserved.
Is artificial intelligence racist?
Racial and gender bias in AI

Joy Buolamwini: — Mit Lab Press Kit

EE2211 Introduction to Machine Learning 5


© Copyright EE, NUS. All Rights Reserved.
Business Problem
Problem or (what problem we can solve
Questions with the data?)

Domain Domain
Knowledge Knowledge

Information Information
(examples, test (examples, test
cases) cases)

Data
(what data to be used to Data
solve the problem?)

Top-down Bottom-up
EE2211 Introduction to Machine Learning 6
© Copyright EE, NUS. All Rights Reserved.
• The data may not contain the answer.
• The combination of some data and an aching desire for an answer
does not ensure that a reasonable answer can be extracted from a
given body of data. John Tukey

EE2211 Introduction to Machine Learning 7


© Copyright EE, NUS. All Rights Reserved.
Types of Data
• Continuous What is data?
• Ordinal
• Categorical Numbers
• Missing
• Censored
Statistics
Text

Figures
Records Facts

EE2211 Introduction to Machine Learning 8


© Copyright EE, NUS. All Rights Reserved.
Continuous, discrete variables
• Continuous variables are anything measured on a
quantitative scale that could be any fractional number.
– An example would be something like weight measured in kg.
• Discrete variables are numeric variables that have a
countable number of values between any two values.
Temperature

Time
EE2211 Introduction to Machine Learning 9
© Copyright EE, NUS. All Rights Reserved.
Ordinal data
• Ordinal data are data that have a fixed, small number
(< 100) of possible values, called levels, that are
ordered (or ranked).

– Example: survey responses

Excellent Good Average Fair Poor

Ref: Book3, chapter 4.1.

EE2211 Introduction to Machine Learning 10


© Copyright EE, NUS. All Rights Reserved.
Categorical data
• Categorical data are data where there are multiple
categories, but they are not ordered.
– Examples: gender, blood type, name of fruits and their production
regions etc.

EE2211 Introduction to Machine Learning 11


© Copyright EE, NUS. All Rights Reserved.
Missing data
• Missing data are data that are missing and you do not
know the mechanism.
– You should use a single common code for all missing values (for
example, “NA”), rather than leaving any entries blank.

NUS student Age Country of birth


Olivia Tan 20 Singapore
Hendra Setiawan 19 Indonesia
John Smith 19 NA

EE2211 Introduction to Machine Learning 12


© Copyright EE, NUS. All Rights Reserved.
Censored data
• Censored data are data where you know the missing
mechanism on some level.
– Common examples are a measurement being below a detection
limit or a patient being lost to follow-up.
– They should also be coded as NA when you don’t have the data.

NUS student Age Gender


Olivia Tan 20 F
Hendra Setiawan 19 M
Ah Beng NA M

EE2211 Introduction to Machine Learning 13


© Copyright EE, NUS. All Rights Reserved.
Data

Categorical/ Numerical/
Qualitative Quantitative

Nominal Ordinal Discrete Continuous

Ratio
Interval (Includes natural zero,
Ordinal (No natural zero, e.g., temperature in
Nominal (Can be ordered, e.g., e.g., temperature Kelvin)
(e.g., gender, small/medium/large) in Celsius)
religion) https://i.stack.imgur.com/J8Ged.jpg

EE2211 Introduction to Machine Learning 14


© Copyright EE, NUS. All Rights Reserved.
From computational viewpoint
Nominal Ordinal Interval Ratio

Frequency distribution Yes Yes Yes Yes

Median and percentiles No Yes Yes Yes

Add or subtract No No Yes Yes

Mean, standard deviation No No Yes Yes

Ratio No No No Yes

EE2211 Introduction to Machine Learning 15


© Copyright EE, NUS. All Rights Reserved.
Data Wrangling
• Data wrangling (cleaning + transform) is the process of
transforming and mapping data from one "raw" data
form into another format to make it more appropriate for
downstream analytics. (https://en.wikipedia.org/wiki/Data_wrangling)
• e.g. scaling, clipping, z-score (see next page)
• Data wrangling should not be performed blindly. We
should know the reason for wrangling and why the
need.
Model Visualize

Import Clean Transform Output

EE2211 Introduction to Machine Learning 16


© Copyright EE, NUS. All Rights Reserved.
Example

• Scaling to a range
– When the bounds or range of each independent dimension of
data is known, a common normalization technique is min-max.
advantage: Ensures standard scale (make sure all between 0 - 1)
• Feature clipping

https://developers.google.com/machine-learning/data-prep/transform/normalization

EE2211 Introduction to Machine Learning 17


© Copyright EE, NUS. All Rights Reserved.
Example
• Z-score standardization
– When the population of measurements of each independent
dimension of data is normally distributed where the parameters
are known, the standard score or z-score is a popular choice.
Advantage: Handles outliers well, Disadvantage: Does not have
Make mean = 0 and standard deviation as 1 same exact scale

https://developers.google.com/machine-learning/data-prep/transform/normalization

EE2211 Introduction to Machine Learning 18


© Copyright EE, NUS. All Rights Reserved.
Example (continuous vs discrete?)

Iris data set


• Measurement features
can be packed as

• Labels can be written as

https://archive.ics.uci.edu/ml/datasets/iris

EE2211 Introduction to Machine Learning 19


© Copyright EE, NUS. All Rights Reserved.
Example (ordinal)
• For data of ordinal scale, the exact numerical value
has no significance over the ranking order that it
carries.
• Suppose the ranks are given by r = 1, ...,R. Then, the
ranks can be normalized into standardized distance
values (d) which fall within [0, 1] using
Order of data matters (ranking)

Excellent Good Average Fair Poor

r = 5 4 3 2 1

EE2211 Introduction to Machine Learning 20


© Copyright EE, NUS. All Rights Reserved.
Example (categorical)
• Assign arbitrary numbers to represent the attributes.

– For example, one can assign a ‘1’ value for male and a ‘2’ value for
female for the case of gender attribute; ‘1’ value for spam and ‘0’
value for non-spam as in Lecture 1.

– However, the label which is assigned a higher value may have a


greater influence than the one with a lower value when extremely
large values and extremely small values are involved along the
computational process.

Order of data does not matter

EE2211 Introduction to Machine Learning 21


© Copyright EE, NUS. All Rights Reserved.
Example (categorical)
• Binary coding
– Common examples of binary coding schemes include the binary-
coded decimal (e.g., one-hot encoding), n-ary Gray codes.
– Sophisticated coding schemes take into account the probability
distribution of each attributes during conversion.
– One-hot encoding:

EE2211 Introduction to Machine Learning 22


© Copyright EE, NUS. All Rights Reserved.
Data Cleaning
Data cleansing or data cleaning is the process of detecting and
correcting (or removing) corrupt or inaccurate records from a
record set, table, or database.

Example (missing features)


• Dealing with missing features
– Removing the examples with missing features from the dataset
(that can be done if your dataset is big enough so you can sacrifice
some training examples);
– Using a learning algorithm that can deal with missing feature
values (depends on the library and a specific implementation of the
algorithm);
– Using a data imputation technique.

EE2211 Introduction to Machine Learning 23


© Copyright EE, NUS. All Rights Reserved.
Example (missing features)

Students Year of Gender Height GPA


Birth
Tan Ah Kow 1995 M 1.72 4.2
Ahmad Abdul X M 1.65 4.1
John Smith 1995 M 1.75 X
Chen Lulu 1995 F X 4.0
Raj Kumar 1995 M 1.73 4.5
Li Xiuxiu 1994 F 1.70 3.8

EE2211 Introduction to Machine Learning 24


© Copyright EE, NUS. All Rights Reserved.
Example (missing features)
• Imputation
– Replace the missing value of a feature by an average value of this
feature in the dataset:

– Replace the missing value with a value outside the normal range of
values.
• For example, if the normal range is [0, 1], then you can set the missing
value to 2 or −1. The idea is that the learning algorithm will learn what is
best to do when the feature has a value significantly different from
regular values.
• Alternatively, you can replace the missing value by a value in the middle
of the range. For example, if the range for a feature is [−1, 1], you can
set the missing value to be equal to 0. Here, the idea is that the value in
the middle of the range will not significantly affect the prediction.

EE2211 Introduction to Machine Learning 25


© Copyright EE, NUS. All Rights Reserved.
Data integrity

• Data integrity is the maintenance of, and the assurance


of the accuracy and consistency of, data over its
entire life-cycle.
(Ref: https://en.wikipedia.org/wiki/Data_integrity)

• It is a critical aspect to the design, implementation and


usage of any system which stores, processes, or
retrieves data.
– Physical integrity (error-correction codes, check-sum, redundancy)
– Logical integrity (product price is a positive, use drop-down list)

EE2211 Introduction to Machine Learning 26


© Copyright EE, NUS. All Rights Reserved.
Data Integrity

http://www.financetwitter.com/
EE2211 Introduction to Machine Learning 27
© Copyright EE, NUS. All Rights Reserved.
Data Visualization

• https://towardsdatascience.com/data-visualization-with-mathplotlib-using-python-a7bfb4628ee3

EE2211 Introduction to Machine Learning 28


© Copyright EE, NUS. All Rights Reserved.
Example: Showing Distribution

(a) a probability mass function, (b) a probability density function

EE2211 Introduction to Machine Learning 29


© Copyright EE, NUS. All Rights Reserved.
Caution: Same Statistics But Different Distribution
Anscombe's quartet (wiki)
You can explore data by
calculating summary
statistics, for example the
correlation between
variables.

However all of these data


sets have the exact same
correlation and regression
line

By Anscombe.svg: Schutz Derivative works of this file:(label using subscripts): Avenue - Anscombe.svg, CC BY-SA
3.0, https://commons.wikimedia.org/w/index.php?curid=9838454

• Data sets with identical means, variances and regression lines

EE2211 Introduction to Machine Learning 30


© Copyright EE, NUS. All Rights Reserved.
Example: Boxplot
Not good for distribution of data

1 2 3 4

Michael Galarnyk, Understanding boxplot, 2018

EE2211 Introduction to Machine Learning 31


© Copyright EE, NUS. All Rights Reserved.
Caution: Same Boxplots But Different Distribution

• These boxplots look very similar, but if you overlay the actual data points
you can see that they have very different distributions.

EE2211 Introduction to Machine Learning 32


© Copyright EE, NUS. All Rights Reserved.
Example: Showing Composition/ Comparison

• If we make this pie chart as a bar chart it is much easier to see that A is bigger than D

EE2211 Introduction to Machine Learning 33


© Copyright EE, NUS. All Rights Reserved.
Example: Log Scale

• Without logarithm 90% of the data are in the lower left-hand corner in this figure

EE2211 Introduction to Machine Learning 34


© Copyright EE, NUS. All Rights Reserved.
The End

EE2211 Introduction to Machine Learning 35


© Copyright EE, NUS. All Rights Reserved.
EE2211 Introduction to Machine
Learning
Lecture 3
Semester 1
2021/2022

Li Haizhou ([email protected])

Electrical and Computer Engineering Department


National University of Singapore

Acknowledgement:
EE2211 development team
(Thomas, Kar-Ann, Chen Khong, Helen, Robby and Haizhou)
20 August 2021

© Copyright EE, NUS. All Rights Reserved.


Course Contents
• Introduction and Preliminaries (Haizhou)
– Introduction
– Data Engineering
– Introduction to Probability and Statistics
• Fundamental Machine Learning Algorithms I (Helen)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Thomas)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Haizhou, Xinchao)
– Performance Issues
– K-means Clustering
– Neural Networks

EE2211 Introduction to Machine Learning 2


© Copyright EE, NUS. All Rights Reserved.
Introduction to Probability and
Statistics
Module I Contents
• What is Machine Learning and Types of Learning
• How Supervised Learning works
• Regression and Classification Tasks
• Induction versus Deduction Reasoning
• Types of data
• Data wrangling and cleaning
• Data integrity and visualization
• Introduction to Linear Algebra
• Causality and Simpson’s paradox
• Random variable, Bayes’ rule

3
© Copyright EE, NUS. All Rights Reserved.
Introduction to Linear Algebra
• Use of vector and matrix notation, especially with
multivariate statistics.

• Solutions to least squares and weighted least squares,


such as for linear regression.

• Estimates of mean and variance of data matrices.

• Principal component analysis for data reduction that


draws many of these elements together

Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019, Chp2.

4
© Copyright EE, NUS. All Rights Reserved.
Introduction to Linear Algebra
• A scalar is a simple numerical value, like 15 or −3.25
– Focus on real numbers
• Variables or constants that take scalar values are
denoted by an italic letter, like x or a

Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019, Chp2.

5
© Copyright EE, NUS. All Rights Reserved.
Notations, Vectors, Matrices
• A vector is an ordered list of scalar values, called
attributes
– Denoted by a bold character, e.g. x or a
• In many books, vectors are written column-wise:
2 −2 1
𝐚𝐚 = , 𝐛𝐛 = , 𝐜𝐜 =
3 5 0
• The three vectors above are two-dimensional (or have two
elements)

Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019, Chp2.

6
© Copyright EE, NUS. All Rights Reserved.
Notations, Vectors, Matrices
• We denote an attribute of a vector as an italic value
with an index, e.g. 𝑎𝑎 𝑗𝑗 𝑜𝑜𝑜𝑜 𝑥𝑥 𝑗𝑗
– The index j denotes a specific dimension of the vector, the
position of an attribute in the list
1 2 𝑎𝑎1 2
𝐚𝐚 = 𝑎𝑎 = or more commonly 𝐚𝐚 = 𝑎𝑎 =
𝑎𝑎 2 3 2 3

Note:
• The notation 𝑥𝑥 𝑗𝑗 should not be confused with the power operator, such
as the 2 in 𝑥𝑥 2 (squared) or 3 in 𝑥𝑥 3 (cubed)
• Square of an indexed attribute of a vector is denoted by (𝑥𝑥 𝑗𝑗 )2
𝑗𝑗 𝑘𝑘
• A variable can have two or more indices, like this: 𝑥𝑥𝑖𝑖 or 𝑥𝑥𝑖𝑖,𝑗𝑗
𝑗𝑗
• For example, in neural networks, we denote as 𝑥𝑥𝑙𝑙,𝑢𝑢 the input feature j of
unit u in layer l
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019, Chp2.

7
© Copyright EE, NUS. All Rights Reserved.
Notations, Vectors, Matrices
Vectors can be visualized as arrows that point to some directions as well
as points in a multi-dimensional space

2 −2 1
Illustrations of three two-dimensional vectors, 𝐚𝐚 = , 𝐛𝐛 = , and 𝐜𝐜 =
3 5 0

Figure 1: Three vectors visualized as directions and as points.


Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019, Chp2.

8
© Copyright EE, NUS. All Rights Reserved.
Notations, Vectors, Matrices
• A matrix is a rectangular array of numbers arranged in rows and
columns
– Denoted with bold capital letters, such as X or W
– An example of a matrix with two rows and three columns:
2 4 −3
𝐗𝐗 =
21 − 6 − 1
• A set is an unordered collection of unique elements
– Denoted as a calligraphic capital character e.g., 𝑺𝑺
– When an element 𝑥𝑥 belongs to a set 𝑺𝑺, we write 𝑥𝑥 ∈ 𝑺𝑺
– A special set denoted R includes all real numbers from minus
infinity to plus infinity
Note:
• For the elements in matrix 𝐗𝐗, we shall use the indexing 𝑥𝑥1,1 where the first and the
(1)
second indices respectively indicate the row and the column positions, or 𝑥𝑥1 .
• Usually, for input data, rows represent samples and columns represent features
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019, Chp2.

9
© Copyright EE, NUS. All Rights Reserved.
Notations, Vectors, Matrices: Example
Iris data set
• Measurement features
can be packed as
𝐗𝐗 ∈ 𝓡𝓡150×4

• Labels can be written as

https://archive.ics.uci.edu/ml/datasets/iris

EE2211 Introduction to Machine Learning 10


© Copyright EE, NUS. All Rights Reserved.
Notations, Vectors, Matrices

• Capital Sigma: the summation over a collection X = {𝑥𝑥1 ,


𝑥𝑥2 , 𝑥𝑥3 , 𝑥𝑥4 , . . . , 𝑥𝑥𝑚𝑚 } is denoted by:
∑𝑚𝑚
𝑖𝑖=1 𝑥𝑥𝑖𝑖 = 𝑥𝑥1 + 𝑥𝑥2 + … + 𝑥𝑥𝑚𝑚−1 + 𝑥𝑥𝑚𝑚

• Capital Pi: the product over a collection X = {𝑥𝑥1 , 𝑥𝑥2 , 𝑥𝑥3 ,


𝑥𝑥4 , . . . , 𝑥𝑥𝑚𝑚 } is denoted by:

∏𝑚𝑚
𝑖𝑖=1 𝑥𝑥𝑖𝑖 = 𝑥𝑥1 · 𝑥𝑥2 ·…· 𝑥𝑥𝑚𝑚−1 · 𝑥𝑥𝑚𝑚

Note:
• Capital Sigma and Pi can be applied to the attributes of a vector x

Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019, Chp2.

11
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
Linear dependence and independence

• A collection of d-vectors 𝐱𝐱1 , … , 𝐱𝐱 𝑚𝑚 (with 𝑚𝑚 ≥ 1) is called linearly


dependent if
(Scalar multiple)
𝛽𝛽1 𝐱𝐱1 + ⋯ + 𝛽𝛽𝑚𝑚 𝐱𝐱 𝑚𝑚 = 0
holds for some 𝛽𝛽1 , … , 𝛽𝛽𝑚𝑚 that are not all zero.

• A collection of d-vectors 𝐱𝐱1 , … , 𝐱𝐱 𝑚𝑚 (with 𝑚𝑚 ≥ 1) is called linearly


independent if it is not linearly dependent, which means that
𝛽𝛽1 𝐱𝐱1 + ⋯ + 𝛽𝛽𝑚𝑚 𝐱𝐱 𝑚𝑚 = 0
only holds for 𝛽𝛽1 = ⋯ = 𝛽𝛽𝑚𝑚 = 0.

Note: If all rows or columns of a square matrix 𝐗𝐗 are linearly


independent, then 𝐗𝐗 is invertible.
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to
Applied Linear Algebra”, Cambridge University Press, 2018 (Chp5 & 11.1)

12
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations

Geometry of dependency and independency

𝑥𝑥2 𝑥𝑥3 𝒄𝒄
2
𝒂𝒂 2 𝒃𝒃
𝒂𝒂
1
𝒃𝒃 𝑥𝑥2
1

𝑥𝑥1 𝑥𝑥1
𝛽𝛽1 𝒂𝒂 + 𝛽𝛽2 𝐛𝐛 = 0 𝛽𝛽1 𝒂𝒂 + 𝛽𝛽2 𝒃𝒃 ≠ 𝛽𝛽3 𝐜𝐜

13
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
These equations can be written compactly in matrix-vector
notation:
𝐗𝐗𝐰𝐰 = 𝐲𝐲
Where
𝑥𝑥1,1 𝑥𝑥1,2 … 𝑥𝑥1,𝑑𝑑 𝑤𝑤1 𝑦𝑦1
𝐗𝐗 = ⁞ ⁞ ⋱ ⁞ , 𝐰𝐰 = ⁞ , 𝐲𝐲 = ⁞ .
𝑥𝑥𝑚𝑚,1 𝑥𝑥𝑚𝑚,2 … 𝑥𝑥𝑚𝑚,𝑑𝑑 𝑤𝑤𝑑𝑑 𝑦𝑦𝑚𝑚

Note:
• The data matrix 𝐗𝐗 ∈ 𝓡𝓡𝑚𝑚×𝑑𝑑 and the target vector 𝐲𝐲 ∈ 𝓡𝓡𝑚𝑚 are given
• The unknown vector of parameters 𝐰𝐰 ∈ 𝓡𝓡𝑑𝑑 is to be learnt
• The rank(𝐗𝐗) corresponds to the maximal number of linearly
independent columns/rows of 𝐗𝐗 .

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to 14


© Copyright EE, NUS. All Rights Reserved. Applied Linear Algebra”, Cambridge University Press, 2018 (Chp8.3)
Exercises

1 2
1. What is the rank of ?
2 1

1 −2 3
2. What is the rank of 0 −3 3 ?
1 1 0

EE2211 Introduction to Machine Learning 15


© Copyright EE, NUS. All Rights Reserved.
Causality
What is statistical causality or causation?
• In statistics, causation means that one thing will cause the
other, which is why it is also referred to as cause and effect.
• The gold standard for causal data analysis is to combine
specific experimental designs such as randomized studies
with standard statistical analysis techniques.
• Example:
– A Randomized Controlled Trial in medicine typically compares a proposed new
treatment against an existing standard of care (or a placebo) ; these are then termed
the 'experimental' and 'control' treatments respectively. This blinding principle is ideally
also extended as much as possible to other parties including researchers, technicians,
data analysts, and evaluators. Effective blinding experimentally isolates the
physiological effects of treatments from various psychological sources of bias.

EE2211 Introduction to Machine Learning 16


© Copyright EE, NUS. All Rights Reserved.
Correlation
• In statistics, correlation is any statistical relationship,
whether causal or not, between two random variables.
• Correlations are useful because they can indicate a
predictive relationship that can be exploited in practice.
• Linear correlation coefficient, r, which is also known as
the Pearson Coefficient.

https://www.geo.fu-berlin.de/en/v/soga/Basics-of-statistics/Descriptive-Statistics/Measures-of-Relation-Between-Variables/Correlation/index.html

EE2211 Introduction to Machine Learning 17


© Copyright EE, NUS. All Rights Reserved.
Correlation does not imply causation
Causation between two events means they are correlated

• Most data analyses involve inference or prediction.


• Unless a randomized study is performed, it is difficult to infer why
there is a relationship between two variables.
• Some great examples of correlations that can be calculated but are
clearly not causally related appear at http://tylervigen.com/
(See figure below).

EE2211 Introduction to Machine Learning 18


© Copyright EE, NUS. All Rights Reserved.
Correlation is a statistical relationship
– Decades of data show a clear causal relationship between smoking
and cancer.
– If you smoke, it is a sure thing that your risk of cancer will increase.
– But it is not a sure thing that you will get cancer.
– The causal effect is real, but it is an effect on your average risk.

EE2211 Introduction to Machine Learning 19


© Copyright EE, NUS. All Rights Reserved.
Caution

• Particular caution should be used when applying words


such as “cause” and “effect” when performing inferential
analysis.
• Causal language applied to even clearly labelled
inferential analyses may lead to misinterpretation - a
phenomenon called causation creep.

EE2211 Introduction to Machine Learning 20


© Copyright EE, NUS. All Rights Reserved.
Simpson’s paradox
Simpson's paradox is a phenomenon in probability and statistics, in
which a trend appears in several different groups of data but
disappears or reverses when these groups are combined.

https://en.wikipedia.org/wiki/Simpson%27s_paradox

EE2211 Introduction to Machine Learning 21


© Copyright EE, NUS. All Rights Reserved.
Example

Ref: Gardener, Martin (March 1979). "MATHEMATICAL GAMES: On the fabric of inductive logic, and some probability paradoxes" (PDF). Scientific American. 234

EE2211 Introduction to Machine Learning 22


© Copyright EE, NUS. All Rights Reserved.
Probability

• We describe a random experiment by describing its


procedure and observations of its outcomes.
• Outcomes are mutual exclusive in the sense that only one
outcome occurs in a specific trial of the random
experiment. This also means an outcome is not
decomposable. All unique outcomes form a sample
space.
• A subset of sample space 𝑆𝑆, denoted as 𝐴𝐴, is an event in
a random experiment 𝐴𝐴 ⊆ 𝑆𝑆, that is meaningful to the
application.

EE2211 Introduction to Machine Learning 23


© Copyright EE, NUS. All Rights Reserved.
Axioms of Probability

Assuming events 𝐴𝐴 ⊆ 𝑆𝑆 and 𝐵𝐵 ⊆ 𝑆𝑆, the probabilities of


events related with and must satisfy,

1. 𝑃𝑃𝑟𝑟 𝐴𝐴 ≥ 0
2. 𝑃𝑃𝑃𝑃 𝑆𝑆 = 1
3. If 𝐴𝐴 ∩ 𝐵𝐵 = ∅ , then 𝑃𝑃𝑃𝑃 𝐴𝐴 ∪ 𝐵𝐵 = 𝑃𝑃𝑃𝑃 𝐴𝐴 + 𝑃𝑃𝑃𝑃 𝐵𝐵
*otherwise, 𝑃𝑃𝑃𝑃 𝐴𝐴 ∪ 𝐵𝐵 = 𝑃𝑃𝑃𝑃 𝐴𝐴 + 𝑃𝑃𝑃𝑃 𝐵𝐵 − 𝑃𝑃𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵)

EE2211 Introduction to Machine Learning 24


© Copyright EE, NUS. All Rights Reserved.
Random Variable
• A random variable, usually written as an italic capital
letter, like X, is a variable whose possible values are
numerical outcomes of a random phenomenon.
– Examples of random phenomena with a numerical outcome include
a toss of a coin (0 for heads and 1 for tails), a roll of a dice, or the
height of the first stranger you meet outside.
• There are two types of random variables:
– discrete and continuous.

s
R
X(s)
EE2211 Introduction to Machine Learning 25
© Copyright EE, NUS. All Rights Reserved.
Notations
• Some books used P(·) and p(·) to distinguish between the
probability of discrete random variable and the probability
of continuous random variables respectively.

• We shall use Pr(·) for both the above cases.

EE2211 Introduction to Machine Learning 26


© Copyright EE, NUS. All Rights Reserved.
Discrete random variable
• A discrete random variable (DRV) takes on only a
countable number of distinct values such as red, yellow,
blue or 1, 2, 3, . . ..
• The probability distribution of a discrete random variable is
described by a list of probabilities associated with each of its possible
values.
- This list of probabilities is called a
probability mass function (pmf).
(Like a histogram, except that here
the probabilities sum to 1)

Ref: Book 1, Chapter 2.2.


A probability mass function
EE2211 Introduction to Machine Learning 27
© Copyright EE, NUS. All Rights Reserved.
• Let a discrete random variable X have k possible values
𝑥𝑥𝑖𝑖 𝑘𝑘𝑖𝑖=1 .
• The expectation of X denoted as 𝐸𝐸 𝑥𝑥 is given by,
𝑘𝑘
𝐸𝐸 𝑥𝑥 ≝ �𝑖𝑖=1 𝑥𝑥𝑖𝑖 · Pr(𝑋𝑋 = 𝑥𝑥𝑖𝑖 )
= 𝑥𝑥1 · Pr(𝑋𝑋 = 𝑥𝑥1 ) + 𝑥𝑥2 · Pr(𝑋𝑋 = 𝑥𝑥2 ) + ··· + 𝑥𝑥𝑘𝑘 · Pr(𝑋𝑋 = 𝑥𝑥𝑘𝑘 )
where Pr(𝑋𝑋 = 𝑥𝑥𝑖𝑖 ) is the probability that X has the value 𝑥𝑥𝑖𝑖
according to the pmf.
• The expectation of a random variable is also called the
mean, average or expected value and is frequently
denoted with the letter μ.
• The expectation is one of the most important statistics of a
random variable.

EE2211 Introduction to Machine Learning 28


© Copyright EE, NUS. All Rights Reserved.
• Another important statistic is the standard deviation,
defined as,
σ ≝ 𝐸𝐸 (𝑋𝑋 − 𝜇𝜇)2 .
• Variance, denoted as 𝜎𝜎 2 or var(X), is defined as,
𝜎𝜎 2 = 𝐸𝐸 (𝑋𝑋 − 𝜇𝜇)2
• For a discrete random variable, the standard deviation is
given by
σ = Pr(𝑋𝑋 = 𝑥𝑥1 )(𝑥𝑥1 − 𝜇𝜇)2 + ⋯ + Pr(𝑋𝑋 = 𝑥𝑥𝑘𝑘 )(𝑥𝑥𝑘𝑘 − 𝜇𝜇)2
where 𝜇𝜇 = 𝐸𝐸(𝑋𝑋).

EE2211 Introduction to Machine Learning 29


© Copyright EE, NUS. All Rights Reserved.
Continuous random variable
• A continuous random variable (CRV) takes an infinite
number of possible values in some interval.
– Examples include height, weight, and time.
– Because the number of values of a continuous random variable X
is infinite, the probability Pr(X = c) for any c is 0.
– Therefore, instead of the list of probabilities, the probability
distribution of a CRV (a continuous probability distribution) is
described by a probability density function (pdf).
– The pdf is a function whose codomain
is nonnegative and the area under the
curve is equal to 1.
*Point on graph does not indicate probability,
Probability = area under graph

CDF: point on curve represents probability

A probability density function

EE2211 Introduction to Machine Learning 30


© Copyright EE, NUS. All Rights Reserved.
• The expectation of a continuous random variable 𝑋𝑋 is given
by 𝐸𝐸 𝑥𝑥 ≝ ∫𝑅𝑅 𝑥𝑥 𝑓𝑓𝑋𝑋 𝑥𝑥 d𝑥𝑥
where 𝑓𝑓𝑋𝑋 is the pdf of the variable 𝑋𝑋 and ∫𝑅𝑅 is the
integral of function 𝑥𝑥 𝑓𝑓𝑋𝑋 .
• The variance of a continuous random variable 𝑋𝑋 is given
by 𝜎𝜎 2 ≝ ∫𝑅𝑅(𝑋𝑋 − 𝜇𝜇)2 𝑓𝑓𝑋𝑋 𝑥𝑥 d𝑥𝑥

• Integral is an equivalent of the summation over all values of


the function when the function has a continuous domain.
• It equals the area under the curve of the function.
• The property of the pdf that the area under its curve is 1
mathematically means that ∫𝑅𝑅 𝑓𝑓𝑋𝑋 𝑥𝑥 d𝑥𝑥 = 1

EE2211 Introduction to Machine Learning 31


© Copyright EE, NUS. All Rights Reserved.
Mean and Standard Deviation of a Gaussian
distribution 𝑥𝑥 2

95%
90%

𝜇𝜇

𝑥𝑥1

EE2211 Introduction to Machine Learning 32


© Copyright EE, NUS. All Rights Reserved.
Example 1
Independent random variables
• Consider tossing a fair coin twice, what is the probability
of having (H,H)?
Pr(x=H, y=H) = Pr(x=H)Pr(y=H)
= (1/2)(1/2) = 1/4

Slides courtesy: Professor Robby Tan

EE2211 Introduction to Machine Learning 33


© Copyright EE, NUS. All Rights Reserved.
Example 2
Dependent random variables
• Given 2 balls with different colors (Red and Black), what
is the probability of first having B and then R?
• The space of outcomes of taking two balls
sequentially without replacement:
B–R
R–B Thus having B-R is ½ .
Mathematically:
Pr(x=B, y=R) = Pr(y=R | x=B) Pr(x=B)
= 1×(1/2) = 1/2
Since we are given the first pick was B, and thus we
know the probability of the remaining ball to be R is 1. Slides courtesy: Professor Robby Tan

EE2211 Introduction to Machine Learning 34


© Copyright EE, NUS. All Rights Reserved.
Example 3
Dependent random variables
• Given 3 balls with different colors (R,G,B), what is the
probability of first having B and then G?
• The space of outcomes of taking two balls
sequentially without replacement:
R–G|G–B|B–R
R – B | G – R | B – G Thus Pr(y=G, x=B) = 1/6 .
Mathematically:
Pr(y=G, x=B) = Pr(y=G | x=B) Pr(x=B)
= (1/2) × (1/3)
= 1/6
Given that the first pick is B, then the remaining balls are
G and R, and thus the chance of picking up G is ½. Slides courtesy: Professor Robby Tan

EE2211 Introduction to Machine Learning 35


© Copyright EE, NUS. All Rights Reserved.
Bayes’ Rule
Thomas Bayes (1701 – 1761)

• The conditional probability Pr(𝑌𝑌 = 𝑦𝑦|𝑋𝑋 = 𝑥𝑥) is the


probability of the random variable 𝑌𝑌 to have a specific
value 𝑦𝑦 given that another random variable 𝑋𝑋 has a
specific value of 𝑥𝑥.
• The Bayes’ Rule (also known as the Bayes’ Theorem)
stipulates that:
Pr 𝑋𝑋 = 𝑥𝑥 𝑌𝑌 = 𝑦𝑦 Pr(𝑌𝑌=𝑦𝑦)
Pr 𝑌𝑌 = 𝑦𝑦 𝑋𝑋 = 𝑥𝑥 = (1)
Pr(𝑋𝑋=𝑥𝑥)

𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 × 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 =
𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒
EE2211 Introduction to Machine Learning 36
© Copyright EE, NUS. All Rights Reserved.
Bayes’ Rule
Likelihood – propensity for
Prior – what we know about
observing a certain value of 𝑥𝑥
y BEFORE seeing 𝑥𝑥
given a certain value of 𝑦𝑦

Pr(𝑦𝑦) Pr 𝑥𝑥 𝑦𝑦 Pr(𝑦𝑦) Pr 𝑥𝑥 𝑦𝑦
Pr 𝑦𝑦 𝑥𝑥 = =
Pr(𝑥𝑥) 𝛴𝛴𝑦𝑦 Pr(𝑦𝑦) Pr 𝑥𝑥 𝑦𝑦

Evidence – a constant to
Posterior – what we know ensure that the left hand side
about y AFTER seeing 𝑥𝑥 is a valid distribution

Adapted from S. Prince


© Copyright EE, NUS. All Rights Reserved.
The End

EE2211 Introduction to Machine Learning 38


© Copyright EE, NUS. All Rights Reserved.
EE2211 Introduction to Machine
Learning
Lecture 4
Semester 1
2021/2022

Helen Juan Zhou


[email protected]

Electrical and Computer Engineering Department


National University of Singapore

Acknowledgement:
EE2211 development team
(Kar-Ann Toh, Thomas Yeo, Chen Khong, Helen Zhou, Vincent Tan, Robby Tan and
Haizhou Li)

© Copyright EE, NUS. All Rights Reserved.


Course Contents
• Introduction and Preliminaries (Haizhou)
– Introduction
– Data Engineering
– Introduction to Probability, Statistics, and Matrix
• Fundamental Machine Learning Algorithms I (Helen)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Thomas)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Haizhou)
– Performance Issues
– K-means Clustering
– Neural Networks

2
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations

Module II Contents
• Operations on Vectors and Matrices
• Systems of Linear Equations
• Set and Functions
• Derivative and Gradient
• Least Squares, Linear Regression
• Linear Regression with Multiple Outputs
• Linear Regression for Classification
• Ridge Regression
• Polynomial Regression

3
© Copyright EE, NUS. All Rights Reserved.
Fundamental ML Algorithms:
Linear Regression
References for Lectures 4-6:
Main
• [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019.
(read first, buy later: http://themlbook.com/wiki/doku.php)
• [Book2] Andreas C. Muller and Sarah Guido, “Introduction to Machine
Learning with Python: A Guide for Data Scientists”, O’Reilly Media, Inc., 2017

Supplementary
• [Book3] Jeff Leek, “The Elements of Data Analytic Style: A guide for people
who want to analyze data”, Lean Publishing, 2015.
• [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied
Linear Algebra”, Cambridge University Press, 2018 (available online)
http://vmls-book.stanford.edu/
• [Ref 5] Professor Vincent Tan’s notes (chapters 4-6): (useful)
https://vyftan.github.io/papers/ee2211book.pdf

4
© Copyright EE, NUS. All Rights Reserved.
Recap on Notations, Vectors, Matrices
Scalar Numerical value 15, -3.5
Variable Take scalar values x or a

Vector An ordered list of scalar values x or 𝐚


𝑎1 2
Attributes of a vector 𝐚= 𝑎 =
2 3

Matrix A rectangular array of numbers 2 4


𝐗=
arranged in rows and columns 21 −6

Capital Sigma 𝑖=1 𝑥𝑖 = 𝑥1 + 𝑥2 + … + 𝑥𝑚−1 + 𝑥𝑚


∑𝑚

Capital Pi ∏𝑚
𝑖=1 𝑥𝑖 = 𝑥1 · 𝑥2 ·…· 𝑥𝑚−1 · 𝑥𝑚
5
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

Operations on Vectors: summation and subtraction

𝑥1 𝑦1 𝑥1 + 𝑦1
𝐱+𝐲= 𝑥 + 𝑦 = 𝑥 +𝑦
2 2 2 2

𝑥1 𝑦1 𝑥1 − 𝑦1
𝐱−𝐲= 𝑥 − 𝑦 = 𝑥 −𝑦
2 2 2 2

6
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

Operations on Vectors: scalar

𝑥1 𝑎𝑥1
𝑎 𝐱 = 𝑎 𝑥 = 𝑎𝑥
2 2

1
𝑥1 𝑥
1 1 𝑎 1
𝑎
𝐱 =
𝑎 𝑥2 = 1
𝑥
𝑎 2

7
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
Matrix or Vector Transpose:
𝑥1
𝐱= 𝑥 , 𝐱𝑇 = 𝑥1 𝑥2
2
𝑥1,1 𝑥1,2 𝑥1,3 𝑥1,1 𝑥2,1 𝑥3,1
𝐗 = 𝑥2,1 𝑥2,2 𝑥2,3 , 𝐗 𝑇 = 𝑥1,2 𝑥2,2 𝑥3,2
𝑥3,1 𝑥3,2 𝑥3,3 𝑥1,3 𝑥2,3 𝑥3,3

Python demo 1

8
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
Dot Product or Inner Product of Vectors:
𝐱 · 𝐲 = 𝐱𝑇 𝐲
𝑦1 𝐱
= 𝑥1 𝑥2 𝑦
2
= 𝑥1 𝑦1 + 𝑥2 𝑦2

Geometric definition: 𝜃
𝐱 · 𝐲 = 𝐱 𝐲 cos𝜃 𝐲
𝐱 cos𝜃

where 𝜃 is the angle between 𝐱 and 𝐲,


and 𝐱 = 𝐱 ⋅ 𝐱 is the Euclidean length of vector 𝐱
2 1
𝐄. 𝐠. 𝐚 = ,𝐜 =  𝐚 · 𝐜 = 2*1 + 3 *0 = 2
3 0
9
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

Matrix-Vector Product

𝑤1,1 𝑤1,2 𝑤1,3 𝑥1


𝐖𝐱 = 𝑤2,1 𝑤2,2 𝑤2,3 𝑥2
𝑥3

𝑤1,1 𝑥1 + 𝑤1,2 𝑥2 + 𝑤1,3 𝑥3


= 𝑤2,1 𝑥1 + 𝑤2,2 𝑥2 + 𝑤2,3 𝑥3

10
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

Vector-Matrix Product

𝑤1,1 𝑤1,2 𝑤1,3


𝐱 𝑇 𝐖 = 𝑥1 𝑥2 𝑤2,1 𝑤2,2 𝑤2,3
= (𝑥1 𝑤1,1 + 𝑥2 𝑤2,1 ) (𝑥1 𝑤1,2 + 𝑥2 𝑤2,2 ) (𝑥1 𝑤1,3 + 𝑥2 𝑤2,3 )

11
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

Matrix-Matrix Product

𝑥1,1 … 𝑥1,𝑑 𝑤1,1 … 𝑤1,ℎ


𝐗𝐖 = ⁞ ⋱ ⁞ ⁞ ⋱ ⁞
𝑥𝑚,1 … 𝑥𝑚,𝑑 𝑤𝑑,1 … 𝑤𝑑,ℎ

(𝑥1,1 𝑤1,1 + ⋯ + 𝑥1,𝑑 𝑤𝑑,1 ) … (𝑥1,1 𝑤1,ℎ + ⋯ + 𝑥1,𝑑 𝑤𝑑,ℎ )


= ⁞ ⋱ ⁞
(𝑥𝑚,1 𝑤1,1 + ⋯ + 𝑥𝑚,𝑑 𝑤𝑑,1 ) … (𝑥𝑚,1 𝑤1,ℎ + ⋯ + 𝑥𝑚,𝑑 𝑤𝑑,ℎ )

∑𝑑𝑖=1 𝑥1,𝑖 𝑤𝑖,1 … ∑𝑑𝑖=1 𝑥1,𝑖 𝑤𝑖,ℎ


= ⁞ ⋱ ⁞
∑𝑑𝑖=1 𝑥𝑚,𝑖 𝑤𝑖,1 … ∑𝑑𝑖=1 𝑥𝑚,𝑖 𝑤𝑖,ℎ

If X is m x d and W is d x h, then the outcome is a m x h matrix


12
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

Matrix inverse

Definition:
A d-by-d square matrix A is invertible (also nonsingular)
if there exists a d-by-d square matrix B such that
𝐀𝐁 = 𝐁𝐀 = 𝐈 (identity matrix)

1 0…0 0
0 1 0 0
𝐈= ⁞ ⋱ ⁞ d-by-d dimension
0 0 1 0
0 0…0 1

Ref: https://en.wikipedia.org/wiki/Invertible_matrix

13
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
Matrix inverse computation
−1
1
𝐀 = adj(𝐀)
det 𝐀
• det 𝐀 is the determinant of 𝐀
• adj(𝐀) is the adjugate or adjoint of 𝐀

Determinant computation
Example: 2x2 matrix
𝑎 𝑏
𝐀=
𝑐 𝑑

𝑎 𝑏
det 𝐀 = |𝐀| = = 𝑎𝑑 − 𝑏𝑐
𝑐 𝑑
Ref: https://en.wikipedia.org/wiki/Invertible_matrix

14
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
• adj(𝐀) is the adjugate or adjoint of 𝐀
• adj(𝐀) is the transpose of the cofactor matrix 𝐂 of 𝐀  adj(A)= CT
• Minor of an element in a matrix 𝐀 is defined as the determinant
obtained by deleting the row and column in which that element lies
𝑎11 𝑎12 𝑎13 𝑎21 𝑎23
A= 𝑎 𝑎
21 22 𝑎23 Minor of a12 is 𝑀12 =
𝑎31 𝑎33
𝑎31 𝑎32 𝑎33

• The 𝑖, 𝑗 entry of the cofactor matrix 𝐂 is the minor of 𝑖, 𝑗


element times a sign factor +
Cofactor Cij = −1 𝑖 𝑗 𝑀𝑖𝑗

• The determinant of 𝐀 can also be defined by minors as


det(A)= ∑𝑘𝑗=1 = 𝑎𝑖𝑗Cij = −1 𝑖 + 𝑗𝑎 𝑀
𝑖𝑗 𝑖𝑗

Ref: https://en.wikipedia.org/wiki/Invertible_matrix

15
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
𝑎21 𝑎23
Minor of a12 is 𝑀12 = adj(A)= CT
𝑎31 𝑎33
Cofactor Cij = −1 𝑖 + 𝑗𝑀 det(A)= ∑𝑘𝑗=1 −1 𝑖 + 𝑗𝑎 𝑀
𝑖𝑗 𝑖𝑗 𝑖𝑗

𝑎 𝑏
• E.g. 𝐀 =
𝑐 𝑑
𝑑 −𝑐
𝐂=
−𝑏 𝑎
𝑇 𝑑 −𝑏
• adj 𝐀 = 𝐂 = det 𝐀 = |𝐀| = 𝑎𝑑 − 𝑏𝑐
−𝑐 𝑎
1 1 𝑑 −𝑏
𝐀 −1
= adj(𝐀) =
det 𝐀 𝑎𝑑−𝑏𝑐 −𝑐 𝑎
Ref: https://en.wikipedia.org/wiki/Invertible_matrix

16
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

det(A)= ∑𝑘𝑗=1 −1 𝑖 + 𝑗𝑎 𝑀
Determinant computation 𝑖𝑗 𝑖𝑗
Example: 3x3 matrix, use the first row (i = 1)

= a(ei - fh) – b(di - fg) + c(dh - eg)

Python demo 2 Ref: https://en.wikipedia.org/wiki/Determinant

17
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

𝑎22 𝑎23
The minor of 𝑎11 = 𝑎 𝑎33
32

Ref: https://en.wikipedia.org/wiki/Determinant

18
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

𝑎21 𝑎23
The minor of 𝑎12 = 𝑎 𝑎33
31

Ref: https://en.wikipedia.org/wiki/Determinant

19
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

𝑎21 𝑎22
The minor of 𝑎13 = 𝑎 𝑎32
31

Ref: https://en.wikipedia.org/wiki/Determinant

20
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

𝑎12 𝑎13
The minor of 𝑎21 = 𝑎 𝑎33
32

Ref: https://en.wikipedia.org/wiki/Determinant

21
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices

𝑎11 𝑎13
The minor of 𝑎22 = 𝑎 𝑎33
31

Ref: https://en.wikipedia.org/wiki/Determinant

22
© Copyright EE, NUS. All Rights Reserved.
Operations on Vectors and Matrices
Example
1 2 3
Find the cofactor matrix of 𝐀 given that 𝐀 = 0 4 5 .
1 0 6
Solution:
4 5 0 5 0 4
𝑎11 ⇒ = 24, 𝑎12 ⇒ − = 5, 𝑎13 ⇒ = −4,
0 6 1 6 1 0
2 3 1 3 1 2
𝑎21 ⇒ − = −12, 𝑎22 ⇒ = 3, 𝑎23 ⇒ − = 2,
0 6 1 6 1 0
2 3 1 3 1 2
𝑎31 ⇒ = −2, 𝑎32 ⇒ − = −5, 𝑎33 ⇒ = 4,
4 5 0 5 0 4
24 5 −4
The cofactor matrix C is thus −12 3 2 .
−2 − 5 4

Ref: https://www.mathwords.com/c/cofactor_matrix.htm

23
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations

Module II Contents
• Operations on Vectors and Matrices
• Systems of Linear Equations
• Set and Functions
• Derivative and Gradient
• Least Squares, Linear Regression
• Linear Regression with Multiple Outputs
• Linear Regression for Classification
• Ridge Regression
• Polynomial Regression

24
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations

• Consider a system of m linear equations with d


variables or unknowns 𝑤1, … , 𝑤𝑑:

𝑥1,1 𝑤1 + 𝑥1,2 𝑤2 + ⋯ + 𝑥1,𝑑 𝑤𝑑 = 𝑦1


𝑥2,1 𝑤1 + 𝑥2,2 𝑤2 + ⋯ + 𝑥2,𝑑 𝑤𝑑 = 𝑦2

𝑥𝑚,1 𝑤1 + 𝑥𝑚,2 𝑤2 + ⋯ + 𝑥𝑚,𝑑 𝑤𝑑 = 𝑦𝑚 .

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to


Applied Linear Algebra”, Cambridge University Press, 2018 (Chp8.3)

25
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
These equations can be written compactly in matrix-vector
notation:
𝐗𝐰 = 𝐲
Where
𝑥1,1 𝑥1,2 … 𝑥1,𝑑 𝑤1 𝑦1
𝐗= ⁞ ⁞ ⋱ ⁞ , 𝐰= ⁞ , 𝐲= ⁞ .
𝑥𝑚,1 𝑥𝑚,2 … 𝑥𝑚,𝑑 𝑤𝑑 𝑦𝑚

Note:
• The data matrix 𝐗 ∈ 𝓡𝑚×𝑑 and the target vector 𝐲 ∈ 𝓡𝑚 are given
• The unknown vector of parameters 𝐰 ∈ 𝓡𝑑 is to be learnt

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to


Applied Linear Algebra”, Cambridge University Press, 2018 (Chp8.3)

26
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
A set of linear equations can have no solution, one
solution, or multiple solutions:
𝐗𝐰 = 𝐲
Where
𝑥1,1 𝑥1,2 … 𝑥1,𝑑 𝑤1 𝑦1
𝐗= ⁞ ⁞ ⋱ ⁞ , 𝐰= ⁞ , 𝐲= ⁞ .
𝑥𝑚,1 𝑥𝑚,2 … 𝑥𝑚,𝑑 𝑤𝑑 𝑦𝑚

𝐗 is Square Even-determined m=d Equal number of equations


and unknowns
𝐗 is Tall Over-determined m>d More number of equations
than unknowns
𝐗 is Wide Under-determined m<d Fewer number of equations
than unknowns
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to
Applied Linear Algebra”, (Chp8.3 & 11) & [Ref 5] Tan’s notes, (Chp 4)

27
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
𝐗𝐰 = 𝐲, 𝐗 ∈ 𝓡𝑚×𝑑 , 𝐰 ∈ 𝓡𝑑×1 , 𝐲 ∈ 𝓡𝑚×1
1. Square or even-determined system: 𝒎 = 𝒅
- Equal number of equations and unknowns, i.e., 𝐗 ∈ 𝓡𝑑×𝑑
- One unique solution if 𝐗 is invertible or all rows/columns of 𝐗 are
linearly independent
- If all rows or columns of 𝐗 are linearly independent, then 𝐗 is
invertible.

Solution:
If 𝐗 is invertible (or 𝐗 −1 𝐗 = 𝐈 ), then pre-multiply both sides by 𝐗 −1
𝐗 −1 𝐗 𝐰 = 𝐗 −1 𝐲
⇒ ෝ = 𝐗 −1 𝐲
𝐰
(Note: we use a hat on top of 𝐰 to indicate that it is a specific point in the space of 𝐰)

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to


Applied Linear Algebra”, Cambridge University Press, 2018 (Chp11)

28
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
Example 1 𝑤1 + 𝑤2 = 4 (1) Two unknowns
𝑤1 − 2𝑤2 = 1 (2) Two equations

𝐗 𝐰 𝐲
1 1 𝑤1 4
𝑤 =
1 −2 2 1

𝐰
ෝ = 𝐗 −1 𝐲
𝐰
−1
1 1 4
=
1 −2 1

−1 −2 −1 4 3
= =
3 −1 1 1 1 Python demo 3
1 𝑑 −𝑏
𝐀−1 = adj(𝐀) adj 𝐀 = 𝐂𝑇 = det 𝐀 = 𝑎𝑑 − 𝑏𝑐
det 𝐀 −𝑐 𝑎
29
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
𝐗𝐰 = 𝐲, 𝐗 ∈ 𝓡𝑚×𝑑 , 𝐰 ∈ 𝓡𝑑×1 , 𝐲 ∈ 𝓡𝑚×1
2. Over-determined system: 𝒎 > 𝒅
– More equations than unknowns
– 𝐗 is non-square (tall) and hence not invertible
– Has no exact solution in general *
– An approximated solution is available using the left inverse
If the left-inverse of 𝐗 exists such that 𝐗 † 𝐗 = 𝐈, then pre-multiply both
sides by 𝐗 † results in
𝐗†𝐗 𝐰 = 𝐗†𝐲
⇒𝐰 ෝ = 𝐗†𝐲
Definition:
A matrix B that satisfies 𝑩𝒅 𝒙 𝒎 𝑨𝒎 𝒙 𝒅 = 𝐈 is called a left-inverse of 𝐀.
The left-inverse of 𝐗: 𝐗 † = (𝐗 𝑇 𝐗)−𝟏 𝐗 𝑇 given 𝐗 𝑇 𝐗 is invertible.
Note: * exception: when rank(𝐗) = rank([𝐗,𝐲]), there is a solution.
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (Chp11.1-11.2, 11.5)

30
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
Example 2 𝑤1 + 𝑤2 = 1 (1) Two unknowns
𝑤1 − 𝑤2 = 0 (2) Three equations
𝐗 𝐰 𝐲 𝑤1 = 2 (3)
1 1 𝑤 1
1
1 −1 𝑤 = 0
2
1 0 2

𝐰
No exact solution
Approximated solution

ෝ = 𝐗 † 𝐲 = (𝐗 𝑇 𝐗)−𝟏 𝐗 𝑇 𝐲
𝐰
3 0 −1 1 1 1 1 1
= 0 =
0 2 1 −1 0 2 0.5 Python demo 4
𝐗 𝑇 𝐗 is invertible
31
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
𝐗𝐰 = 𝐲, 𝐗 ∈ 𝓡𝑚×𝑑 , 𝐰 ∈ 𝓡𝑑×1 , 𝐲 ∈ 𝓡𝑚×1
3. Under-determined system: 𝒎 < 𝒅
– More unknowns than equations
– Infinite number of solutions in general *

If the right-inverse of 𝐗 exists such that 𝐗𝐗 † = 𝐈, then the 𝑑-vector


𝐰 = 𝐗 † 𝐲 (one of the infinite cases) satisfies the equation 𝐗𝐰 = 𝐲, i.e.,
𝐗𝐰 = 𝐲 ⇒ 𝐗𝐗 † 𝐲 = 𝐲
⇒ 𝐈𝐲 = 𝐲
Definition:
A matrix B that satisfies 𝐀𝒎 𝒙 𝒅 𝐁𝒅 𝒙 𝒎 = 𝐈 is called a right-inverse of 𝐀.
The right-inverse of 𝐗: 𝐗 † = 𝐗 𝑇 (𝐗𝐗 𝑇 )−1 given 𝐗𝐗 𝑇 is invertible.
If 𝐗 is right−invertible, we can find a unique constrained solution.

Note: * exception: no solution if the system is inconsistent rank(𝐗) < rank([𝐗,𝐲]

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (Chp11.1-11.2, 11.5)

32
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
3. Under-determined system: 𝒎 < 𝒅
Derivation:
𝐗𝐰 = 𝐲, 𝐗 ∈ 𝓡𝑚×𝑑 , 𝐰 ∈ 𝓡𝑑×1 , 𝐲 ∈ 𝓡𝑚×1

A unique solution is yet possible by constraining the search using


𝐰 = 𝐗𝑇 𝐚

If 𝐗𝐗 𝑇 is invertible, let 𝐰 = 𝐗 𝑇 𝐚, then


𝐗𝐗 𝑇 𝐚 = 𝐲
⇒ 𝐚ො = (𝐗𝐗 𝑇 )−1 𝐲
⇒𝐰ෝ = 𝐗 𝑇 𝐚ො = 𝐗 𝑇 (𝐗𝐗 𝑇 )−1 𝐲

𝐗†
right-inverse

33
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations
Example 3 𝑤1 + 2𝑤2 + 3𝑤3 = 2 (1) Three unknowns
𝑤1 − 2𝑤2 + 3𝑤3 = 1 (2) Two equations
𝐗 𝐰 𝐲
𝑤
1 2 3 𝑤1 2
2 =
1 −2 3 𝑤3 1
Infinitely many solutions along
the intersection line
Here 𝐗𝐗 𝑇 is invertible

ෝ = 𝐗 𝑇 (𝐗𝐗 𝑇 )−1 𝐲
𝐰
−1
1 1 14 6 0.15
2
= 2 −2 = 0.25 Constrained solution
1
3 3 6 14 0.45

34
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations

Example 4 𝑤1 + 2𝑤2 + 3𝑤3 = 2 (1) Three unknowns


3𝑤1 + 6𝑤2 + 9𝑤3 = 1 (2) Two equations

𝐗 𝐰 𝐲
𝑤
1 2 3 𝑤1 2
2 =
3 6 9 𝑤3 1

Both 𝐗𝐗 𝑇 and 𝐗 𝑇 𝐗 are not invertible!


There is no solution for the system
35
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear Equations

Module II Contents
• Operations on Vectors and Matrices
• Systems of Linear Equations
• Set and Functions
• Derivative and Gradient
• Least Squares, Linear Regression
• Linear Regression with Multiple Outputs
• Linear Regression for Classification
• Ridge Regression
• Polynomial Regression

36
© Copyright EE, NUS. All Rights Reserved.
Notations: Set
• A set is an unordered collection of unique elements
– Denoted as a calligraphic capital character e.g., 𝓢, 𝓡, 𝓝 etc
– When an element 𝑥 belongs to a set 𝑺, we write 𝑥 ∈ 𝓢
• A set of numbers can be finite - include a fixed amount of values
– Denoted using accolades, e.g. {1, 3, 18, 23, 235} or {𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 , . . . , 𝑥𝑑 }
• A set can be infinite and include all values in some interval
– If a set of real numbers includes all values between a and b, including a and
b, it is denoted using square brackets as [a, b]
– If the set does not include the values a and b, it is denoted using
parentheses as (a, b)
• Examples:
– The special set denoted by 𝓡 includes all real numbers from minus infinity
to plus infinity
– The set [0, 1] includes values like 0, 0.0001, 0.25, 0.9995, and 1.0

Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p4 of chp2).

37
© Copyright EE, NUS. All Rights Reserved.
Notations: Set operations

• Intersection of two sets:


𝓢3 ← 𝓢1 ∩ 𝓢2
Example: {1,3,5,8} ∩ {1,8,4} = {1,8}

• Union of two sets:


𝓢3 ← 𝓢1 ∪ 𝓢2
Example: {1,3,5,8} ∪ {1,8,4} = {1,3,4,5,8}

Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p4 of chp2).

38
© Copyright EE, NUS. All Rights Reserved.
Functions
• A function is a relation that associates each element 𝑥 of a set 𝓧,
the domain of the function, to a single element 𝑦 of another set 𝓨,
the codomain of the function
• If the function is called f, this relation is denoted 𝑦 = 𝑓(𝑥)
– The element 𝑥 is the argument or input of the function
– 𝑦 is the value of the function or the output
• The symbol used for representing the input is the variable of the
function
– 𝑓(𝑥) 𝑓 is a function of the variable 𝑥; 𝑓(𝑥, 𝑤) 𝑓 is a function of the variable 𝑥 and w

𝓧 𝓨
1
1 2 Range
2 3 (or Image)
3 4 {3,4,5,6}
4 5
6
{1,2,3,4} domain codomain {1,2,3,4,5,6}
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p6 of chp2). 39
© Copyright EE, NUS. All Rights Reserved.
Functions

• A scalar function can have vector argument


– E.g. 𝑦 = 𝑓(𝐱) = 𝑥1 + 𝑥2 +2𝑥3
• A vector function, denoted as 𝐲 = 𝐟(𝐱) is a function
that returns a vector 𝐲
– Input argument can be a vector 𝐲 = 𝐟(𝐱) or a scalar 𝐲 = 𝐟(𝑥)
𝑦1 −𝑥1
– E.g. 𝑦 = 𝑥
2 2
𝑦1 −2𝑥1
– E.g. 𝑦 =
2 3𝑥1

Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p7 of chp2).

40
© Copyright EE, NUS. All Rights Reserved.
Functions
• The notation 𝑓: 𝓡𝑑 → 𝓡 means that 𝑓 is a function that
maps real d-vectors to real numbers
– i.e., 𝑓 is a scalar-valued function of d-vectors
• If 𝐱 is a d-vector argument, then 𝑓 𝐱 denotes the value
of the function 𝑓 at 𝐱
– i.e., 𝑓 𝐱 = 𝑓 𝑥1 , 𝑥2 , … , 𝑥𝑑 , 𝐱 ∈ 𝓡𝑑 , 𝑓 𝐱 ∈ 𝓡

• Example: we can define a function 𝑓: 𝓡4 → 𝓡 by


𝑓 𝐱 = 𝑥1 + 𝑥2 − 𝑥42

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, 2018 (Ch 2, p29)

41
© Copyright EE, NUS. All Rights Reserved.
Functions

The inner product function


• Suppose 𝒂 is a d-vector. We can define a scalar valued function 𝑓 of d-
vectors, given by
𝑓 𝐱 = 𝒂𝑇 𝐱 = 𝑎1 𝑥1 + 𝑎2 𝑥2 + ⋯ 𝑎𝑑 𝑥𝑑 (1)
for any d-vector 𝐱
• The inner product of its d-vector argument 𝐱 with some (fixed) d-vector 𝒂
• We can also think of 𝑓 as forming a weighted sum of the elements of 𝐱;
the elements of 𝒂 give the weights 𝒂

𝜃
𝐱
𝒂 cos𝜃
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p30)

42
© Copyright EE, NUS. All Rights Reserved.
Functions

Linear Functions

A function 𝑓: 𝓡𝑑 → 𝓡 is linear if it satisfies the following two properties:

• Homogeneity
• For any d-vector 𝐱 and any scalar 𝛼, 𝑓 𝛼𝐱 = 𝛼𝑓 𝐱
• Scaling the (vector) argument is the same as scaling the
function value

• Additivity
• For any d-vectors 𝐱 and 𝐲, 𝑓 𝐱 + 𝐲 = 𝑓 𝐱 + 𝑓 𝐲
• Adding (vector) arguments is the same as adding the function
values

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p31)

43
© Copyright EE, NUS. All Rights Reserved.
Functions

Linear Functions
Superposition and linearity
• The inner product function 𝑓 𝐱 = 𝒂𝑇 𝐱 defined in equation (1)
(slide 9) satisfies the property
𝑓 𝛼𝐱 + 𝛽𝐲 = 𝒂𝑇 𝛼𝐱 + 𝛽𝐲
= 𝒂𝑇 𝛼𝐱) + 𝒂𝑇 (𝛽𝐲
= 𝛼 𝒂𝑇 𝐱) + 𝛽(𝒂𝑇 𝐲
= 𝛼𝑓 𝐱) + 𝛽𝑓(𝐲
for all d-vectors 𝐱, 𝐲, and all scalars 𝛼, 𝛽.

• This property is called superposition, which consists of


homogeneity and additivity
• A function that satisfies the superposition property is called linear

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p30)

44
© Copyright EE, NUS. All Rights Reserved.
Functions

Linear Functions

• If a function 𝑓 is linear, superposition extends to linear


combinations of any number of vectors:
𝑓 𝛼1 𝐱1 + ⋯ + 𝛼𝑘 𝐱 𝑘 = 𝛼1 𝑓 𝐱1 ) + ⋯ + 𝛼𝑘 𝑓(𝐱 𝑘
for any d vectors 𝐱1 + ⋯ + 𝐱 𝑘 , and any scalars
𝛼1 + ⋯ + 𝛼𝑘 .

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p30)

45
© Copyright EE, NUS. All Rights Reserved.
Functions

Linear and Affine Functions

A linear function plus a constant is called an affine function

A linear function 𝑓: 𝓡𝑑 → 𝓡 is affine if and only if it can be


expressed as 𝑓 𝐱 = 𝒂𝑇 𝐱 + 𝑏 for some d-vector 𝒂 and scalar 𝑏,
which is called the offset (or bias)

Example:
𝑓 𝐱 = 2.3 − 2𝑥1 + 1.3𝑥2 − 𝑥3

This function is affine, with 𝑏 = 2.3, 𝒂𝑇 = [−2, 1.3, −1].


Affine if theres a bias, linear if bias is 0

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p32)

46
© Copyright EE, NUS. All Rights Reserved.
Functions

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p33)

47
© Copyright EE, NUS. All Rights Reserved.
Summary
• Operations on Vectors and Matrices Assignment 1 (week 6)
• Dot-product, matrix inverse Tutorial 4
• Systems of Linear Equations 𝐗𝐰 = 𝐲
• Matrix-vector notation, linear dependency, invertible
• Even-, over-, under-determined linear systems
• Set and Functions
𝐗 is Even- m = d One unique solution in general ෝ = 𝐗 −1 𝐲
𝐰
Square determined
𝐗 is Over- m > d No exact solution in general; ෝ = (𝐗 𝑇 𝐗)−𝟏 𝐗 𝑇 𝐲
𝐰
Tall determined An approximated solution Left-inverse
𝐗 is Under- m < d Infinite number of solutions in general; 𝐰ෝ = 𝐗 𝑇 (𝐗𝐗 𝑇 )−1 𝐲
Wide determined Unique constrained solution Right-inverse

• Scalar and vector functions


python package numpy
• Inner product function
Inverse: numpy.linalg.inv(X)
• Linear and affine functions
Transpose: X.T

48
© Copyright EE, NUS. All Rights Reserved.
EE2211 Introduction to Machine
Learning
Lecture 5
Semester 1
2021/2022

Helen Juan Zhou


[email protected]

Electrical and Computer Engineering Department


National University of Singapore

Acknowledgement:
EE2211 development team
(Thomas, Kar-Ann, Helen, Vincent, Chen Khong, Robby, and Haizhou)

© Copyright EE, NUS. All Rights Reserved.


Course Contents
• Introduction and Preliminaries (Haizhou)
– Introduction
– Data Engineering
– Introduction to Probability and Statistics
• Fundamental Machine Learning Algorithms I (Helen)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Thomas)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Haizhou)
– Performance Issues
– K-means Clustering
– Neural Networks

2
© Copyright EE, NUS. All Rights Reserved.
Least Squares and Linear Regression
Module II Contents
• Notations, Vectors, Matrices (introduced in L3)
• Operations on Vectors and Matrices
• Systems of Linear Equations
• Set and Functions
• Derivative and Gradient
• Least Squares, Linear Regression
• Linear Regression with Multiple Outputs
• Linear Regression for Classification
• Ridge Regression
• Polynomial Regression

3
© Copyright EE, NUS. All Rights Reserved.
Recap: Linear and Affine Functions

Linear Functions
A function 𝑓: 𝓡𝑑 → 𝓡 is linear if it satisfies the following two properties:

• Homogeneity 𝑓 𝛼𝐱 = 𝛼𝑓 𝐱 Scaling
No offset
• Additivity 𝑓 𝐱 + 𝐲 = 𝑓 𝐱 + 𝑓 𝐲 Adding

Inner product function


𝑓 𝐱 = 𝒂𝑇 𝐱 = 𝑎1 𝑥1 + 𝑎2 𝑥2 + ⋯ 𝑎𝑑 𝑥𝑑

Affine function
𝑓 𝐱 = 𝒂𝑇 𝐱 + 𝑏 scalar 𝑏 is called the offset (or bias)

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (p31)

4
© Copyright EE, NUS. All Rights Reserved.
Functions: Maximum and Minimum
A local and a global minima of a function

• 𝑓 𝑥 has a local minimum


at 𝑥 = 𝑐 if 𝑓 𝑥 ≥ 𝑓 𝑐 for
every 𝑥 in some open
interval around 𝑥 = 𝑐
x
• 𝑓 𝑥 has a global
minimum at 𝑥 = 𝑐 if 𝑓 𝑥 ≥
𝑓 𝑐 for all 𝑥 in the domain
of 𝑓 a b
a<x≤b
Note: An interval is a set of real numbers with the property that any number that lies
between two numbers in the set is also included in the set.
An open interval does not include its endpoints and is denoted using parentheses. E.g.
(0, 1) means “all numbers greater than 0 and less than 1”.
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p6-7 of chp2).

5
© Copyright EE, NUS. All Rights Reserved.
Functions: Maximum and Minimum
Max and Arg Max
• Given a set of values 𝓐 = {𝑎1 , 𝑎2 , … , 𝑎𝑚 },
• The operator max 𝑓(𝑎) returns the highest value 𝑓(𝑎) for all elements in
𝑎ϵ𝓐
the set 𝓐
• The operator arg max 𝑓(𝑎) returns the element of the set 𝓐 that
𝑎ϵ𝓐
maximizes 𝑓(𝑎) (returns input)

• When the set is implicit or infinite, we can write


max 𝑓(𝑎) or arg max 𝑓(𝑎)
𝑎 𝑎
E.g. 𝑓(𝑎) = 3𝑎, 𝑎 ϵ [0,1]  max 𝑓 𝑎 = 3 and arg max 𝑓 𝑎 = 1
𝑎 𝑎

Min and Arg Min operate in a similar manner

Note: arg max returns a value from the domain of the function and max returns
from the range (codomain) of the function.
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p6-7 of chp2).

6
© Copyright EE, NUS. All Rights Reserved.
Derivative and Gradient
• The derivative 𝒇′ of a function 𝒇 is a function that describes
how fast 𝒇 grows (or decreases)
– If the derivative is a constant value, e.g. 5 or −3
• The function 𝑓 grows (or decreases) constantly at any point x of its domain
– When the derivative 𝑓′ is a function
• If 𝑓′ is positive at some x, then the function 𝑓 grows at this point
• If 𝑓′ is negative at some x, then the function 𝑓 decreases at this point
• The derivative of zero at x means that the function’s slope at x is horizontal (e.g.
maximum or minimum points)

• The process of finding a derivative is called differentiation


• Gradient is the generalization of derivative for functions that
take several inputs (or one input in the form of a vector or
some other complex structure).

Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p8 of chp2).

7
© Copyright EE, NUS. All Rights Reserved.
Derivative and Gradient
The gradient of a function is a vector of partial derivatives
Differentiation of a scalar function w.r.t. a vector
If 𝑓(𝐱) is a scalar function of d variables, 𝐱 is a d x1 vector.
Then differentiation of 𝑓(𝐱) w.r.t. 𝐱 results in a d x1 vector
𝜕𝑓
ⅆ𝑓(𝐱) 𝜕𝑥1
= ⋮
ⅆ𝐱 𝜕𝑓.
𝜕𝑥𝑑
This is referred to as the gradient of 𝑓(𝐱) and often written
as 𝛻𝐱 𝑓.
𝑎
E.g. 𝑓 𝐱 = 𝑎𝑥1 + 𝑏𝑥2 𝛻𝐱 𝑓 =
𝑏 Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Appendix)
8
© Copyright EE, NUS. All Rights Reserved.
Derivative and Gradient
Partial Derivatives
Differentiation of a vector function w.r.t. a vector
If 𝐟(𝐱) is a vector function of size h x1 and 𝐱 is a d x1 vector.
Then differentiation of 𝐟(𝐱) results in a h x d matrix
𝜕𝑓1 𝜕𝑓1

ⅆ𝐟(𝐱) 𝜕𝑥1 𝜕𝑥𝑑
= ⋮ ⋱ ⋮
ⅆ𝐱 𝜕𝑓ℎ 𝜕𝑓.ℎ

𝜕𝑥1 𝜕𝑥𝑑
The matrix is referred to as the Jacobian of 𝐟(𝐱)

Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Appendix)

9
© Copyright EE, NUS. All Rights Reserved.
Derivative and Gradient

Some Vector-Matrix Differentiation Formulae

ⅆ𝐀𝐱
=𝐀
ⅆ𝐱

ⅆ(𝒃𝑇 𝐱) ⅆ(𝐲 𝑇 𝐀𝐱)


=𝒃 = 𝐀𝑇 𝐲
ⅆ𝐱 ⅆ𝐱
ⅆ(𝐱 𝑇 𝐀𝐱)
= (𝐀 + 𝐀𝑇 )𝐱
ⅆ𝐱

Derivations: https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf

Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Appendix)

10
© Copyright EE, NUS. All Rights Reserved.
Linear Regression

• Linear regression is a popular regression learning


algorithm that learns a model which is a linear
combination of features of the input example.

𝐗𝐰 = 𝐲, 𝐗 ∈ 𝓡𝑚×𝑑 , 𝐰 ∈ 𝓡𝑑×1 , 𝐲 ∈ 𝓡𝑚×1

𝑥1,1 𝑥1,2 … 𝑥1,𝑑 𝑤1 𝑦1


𝐗= ⁞ ⁞ ⋱ ⁞ , 𝐰= ⁞ , 𝐲= ⁞
𝑥𝑚,1 𝑥𝑚,2 … 𝑥𝑚,𝑑 𝑤𝑑 𝑦𝑚

Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (p3 of chp3).

11
© Copyright EE, NUS. All Rights Reserved.
Linear Regression
Problem Statement: To predict the unknown 𝑦 for a given 𝐱 (testing)
𝑚
• We have a collection of labeled examples (training) {(𝐱 𝑖 , y𝑖 )}𝑖=1
– 𝑚 is the size of the collection
– 𝐱 𝑖 is the d-dimensional feature vector of example 𝑖 = 1, … , 𝑚 (input)
– y𝑖 is a real-valued target (1-D)
– Note:
• when y𝑖 is continuous valued, it is a regression problem
• when y𝑖 is discrete valued, it is a classification problem

• We want to build a model 𝑓𝐰,𝑏 (𝐱) as a linear combination of features of


example 𝐱: 𝑓𝐰,𝑏 𝐱 = 𝐱 𝑇 𝐰 + 𝑏
where 𝐰 is a d-dimensional vector of parameters and 𝑏 is a real number.
• The notation 𝑓𝐰,𝑏 means that the model 𝑓 is parametrized by two values: w
and b
Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (chp.14)

12
© Copyright EE, NUS. All Rights Reserved.
Linear Regression

Learning objective function


• To find the optimal values for w*
and b* which minimizes the
following expression:
𝑚
1
෍(𝑓𝐰,𝑏 𝐱𝑖 − y𝑖 )2
𝑚
𝑖=1
• In mathematics, the expression we
minimize or maximize is called an
objective function, or, simply, an
objective

(𝑓𝐰 𝐱𝑖 − y𝑖 )2 is called the loss function: a measure of the difference


between 𝑓𝐰 𝐱 𝑖 and y𝑖 or a penalty for misclassification of example i.
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (chp3.1.2)

13
© Copyright EE, NUS. All Rights Reserved.
Linear Regression
Learning objective function (using simplified notation hereon)
• To find the optimal values for w* which minimizes the
following expression:
𝑚

෍(𝑓𝐰 𝐱𝑖 − y𝑖 )2
𝑖=1
with 𝑓𝐰 𝐱𝑖 = 𝐱 𝑇 𝐰,
where we define 𝐰 = [𝑏, 𝑤1 , … 𝑤𝑑 ]𝑇 = [𝑤0 , 𝑤1 , … 𝑤𝑑 ]𝑇 ,
and 𝐱 𝑖 = [1, 𝑥𝑖,1 , … 𝑥𝑖,𝑑 ]𝑇 = [𝑥𝑖,0 , 𝑥𝑖,1 , … 𝑥𝑖,𝑑 ]𝑇 , 𝑖 = 1, … , 𝑚

• This particular choice of the loss function is called squared


error loss
1
Note: The normalization factor can be omitted as it does not affect the optimization.
𝑚

14
© Copyright EE, NUS. All Rights Reserved.
𝑚
Linear Regression ෍(𝑓𝐰,𝑏 𝐱 𝑖 − y𝑖 )2
𝑖=1 predicted value True label

• All model-based learning algorithms have a loss function


• What we do to find the best model is to minimize the
objective known as the cost function
• Cost function is a sum of loss functions over training set
plus possibly some model complexity penalty (regularization)

• In linear regression, the cost function is given by the average


loss, also called the empirical risk because we do not have
all the data (e.g. testing data)
– The average of all penalties is obtained by applying the
model to the training data
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019 (chp3.1.2)

15
© Copyright EE, NUS. All Rights Reserved.
Linear Regression
Learning (Training)
• Consider the set of feature vector 𝐱 𝑖 and target output 𝑦𝑖
indexed by 𝑖 = 1, … , 𝑚, a linear model 𝑓𝐰 𝐱 = 𝐱 𝑇 𝐰 can
be stacked as 𝑦1
𝒇𝐰 𝐗 = 𝐗𝐰 𝐲= ⁞
𝑦𝑚
Learning 𝑇
Model 𝐱 1𝐰
Learning
target vector
= ⁞
𝑇𝐰
𝐱𝑚
𝑏
where 𝐱𝑖𝑇 𝐰 = [1, 𝑥𝑖,1 , … , 𝑥𝑖,𝑑 ] 𝑤1

𝑤𝑑
Note: The bias/offset term is responsible for translating the line/plane/hyperplane
away from the origin.

16
© Copyright EE, NUS. All Rights Reserved.
Linear Regression

Least Squares Regression


In vector-matrix notation, the minimization of the objective
function can be written compactly using 𝐞 = 𝐗𝐰 − 𝐲 :
J(𝐰) = 𝐞𝑇 𝐞
= (𝐗𝐰 − 𝐲)𝑇 (𝐗𝐰 − 𝐲)
= (𝐰 𝑇 𝐗 𝑇 − 𝐲 𝑇 )(𝐗𝐰 − 𝐲)
= 𝐰 𝑇 𝐗 𝑇 𝐗𝐰 − 𝐰 𝑇 𝐗 𝑇 𝐲 − 𝐲 𝑇 𝐗𝐰 + 𝐲 𝑇 𝐲
= 𝐰 𝑇 𝐗 𝑇 𝐗𝐰 − 2𝐲 𝑇 𝐗𝐰 + 𝐲 𝑇 𝐲.
Note: when 𝒇𝐰 𝐗 = 𝐗𝐰, then
𝑚
෌𝑖=1(𝑓𝐰 𝐱𝑖 − y𝑖 )2 = (𝐗𝐰 − 𝐲)𝑇 (𝐗𝐰 − 𝐲).

17
© Copyright EE, NUS. All Rights Reserved.
Linear Regression

Differentiating J(𝐰) with respect to 𝐰 and setting the


𝜕
result to 0: J 𝐰 =𝟎
𝜕𝐰
𝜕
(𝐰 𝑇 𝐗 𝑇 𝐗𝐰 − 2𝐲 𝑇 𝐗𝐰 + 𝐲 𝑇 𝐲) = 𝟎
𝜕𝐰
⇒ 2𝐗 𝑇 𝐗𝐰 − 2𝐗 𝑇 𝐲 = 0
⇒ 𝐗 𝑇 𝐗𝐰 = 𝐗 𝑇 𝐲
⇒ Any minimizer 𝐰 ෝ of J 𝐰 must satisfy 𝐗 𝑇 𝐗𝐰 − 𝐲 = 𝟎.
If 𝐗 𝑇 𝐗 is invertible, then
Learning/training: 𝐰ෝ = (𝐗 𝑇 𝐗)−𝟏 𝐗 𝑇 𝐲
Prediction/testing: 𝒇෠ 𝐰 𝐗 𝑛𝑒𝑤 = 𝐗 𝑛𝑒𝑤 𝐰 ෝ

18
© Copyright EE, NUS. All Rights Reserved.
Linear Regression
Example 1 Training set {(𝑥𝑖 , y𝑖 )}𝑚
𝑖=1 {𝑥 = −9} → {𝑦 = −6}
{𝑥 = −7} → {𝑦 = −6}
𝐗 𝐰 𝐲 𝑥 = −5 → {𝑦 = −4}
1 −9 −6 𝑥 = 1 → 𝑦 = −1
1 −7 −6 𝑥 = 5 → {𝑦 = 1 }
𝑤0 −4 𝑥 = 9 → {𝑦 = 4 }
1 −5 =
𝑤1 −1
1 1
1 5 1
4 w0: offset term
1 9
This set of linear equations has no exact solution
However, 𝐗 𝑇 𝐗 is invertible Least square approximation
eqn of line: y=0.5625x-1.4375
ෝ = 𝐗 † 𝐲 = (𝐗 𝑇 𝐗)−𝟏 𝐗 𝑇 𝐲
𝐰 −6
−6
−1
6 −6 1 1 1 1 1 1 −4 −1.4375
= =
−6 262 −9 − 7 − 5 1 5 9 −1 0.5625
1
4
19
© Copyright EE, NUS. All Rights Reserved.
Linear Regression
𝑦ො = 𝐗𝐰ෝ
−1.4375
=𝐗
0.5625
y = −1.4375+0.5625x

Prediction:
Test set
{𝑥 = −1} → {𝑦 =? }

−1.4375
𝑦ො = 1 − 1
0.5625
= −2
Linear Regression on one-dimensional samples

Python demo 1
20
© Copyright EE, NUS. All Rights Reserved.
Linear Regression Add column of 1 for offset in X

Example 2 {(𝐱 𝑖 , y𝑖 )}𝑚


𝑖=1
{𝑥1 = 1, 𝑥2 = 1, 𝑥3 = 1} → {𝑦 = 1}
{𝑥1 = 1, 𝑥2 = −1, 𝑥3 = 1} → {𝑦 = 0}
Training set {𝑥1 = 1, 𝑥2 = 1, 𝑥3 = 3} → {𝑦 = 2}
𝐗 𝐰 𝐲 {𝑥1 = 1, 𝑥2 = 1, 𝑥3 = 0} → {𝑦 = −1}

1 1 1 1
𝑤1
1 −1 1 𝑤2 = 0
1 1 3 𝑤3 2
1 1 0 −1
This set of linear equations has no exact solution
However, 𝐗 𝑇 𝐗 is invertible
ෝ = 𝐗 † 𝐲 = (𝐗 𝑇 𝐗)−𝟏 𝐗 𝑇 𝐲
𝐰 Least square approximation
−1
4 2 5 1 1 1 1 1 −0.7500
= 2 4 3 0 = 0.1786
1 −1 1 1 2
5 3 11 1 1 3 0 −1 0.9286

21
© Copyright EE, NUS. All Rights Reserved.
Linear Regression
The four linear equations Prediction:
Test set

{𝑥1 = 1, 𝑥2 = 6, 𝑥3 = 8} → {𝑦 =? }
{𝑥1 = 1, 𝑥2 = 0, 𝑥3 = −1} → {𝑦 =? }

ෝ = 𝒇෠ 𝐰 𝐗 𝑛𝑒𝑤 = 𝐗 𝑛𝑒𝑤 𝐰
𝒚 ෝ

1 6 8 −0.7500
ෝ=
𝒚 0.1786
1 0 −1
0.9286
7.7500
=
−1.6786

22
© Copyright EE, NUS. All Rights Reserved.
Linear Regression
Learning of Vectored Function (Multiple Outputs)
For one sample: a linear model 𝐟𝐰 𝐱 = 𝐱 𝑇 𝐖 Vector function

For m samples: 𝐅𝐰 𝐗 = 𝐗𝐖 = 𝐘
𝑇 𝑤0,1 … 𝑤0,ℎ
Sample 1
𝐱1 1 𝑥 1,1 … 𝑥 1,𝑑
𝑤1,1 … 𝑤1,ℎ
⁞ = ⁞ 𝐖= ⁞ ⁞ ⋱ ⁞
𝑇 1 𝑥 … 𝑥 ⁞ ⋱ ⁞
Sample m 𝐱𝑚 𝑚,1 𝑚,𝑑 𝑤
𝑑,1 … 𝑤𝑑,ℎ


Sample 1’s output 𝑦1,1 … 𝑦1,ℎ
⁞ = ⁞ 𝑚
Sample m’s output 𝑦𝑚,1 … 𝑦𝑚,ℎ

𝐗 ∈ 𝓡𝑚×(𝑑+1) , 𝐖 ∈ 𝓡(𝑑+1)×ℎ , 𝐘 ∈ 𝓡𝑚×ℎ


23
© Copyright EE, NUS. All Rights Reserved.
Linear Regression
𝑚
Objective:෌𝑖=1(𝐟𝐰 𝐱 𝑖 − 𝐲𝑖 )2 = 𝑬𝑇 𝑬

Least Squares Regression of Multiple Outputs


In matrix notation, the sum of squared errors cost
function can be written compactly using 𝐄 = 𝐗𝐖 − 𝐘:

J(𝐖) = trace(𝐄𝑇 𝐄)
= trace[(𝐗𝐖 − 𝐘)𝑇 (𝐗𝐖 − 𝐘)]

If 𝐗 𝑇 𝐗 is invertible, then
Learning/training: 𝐖 ෡ = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐘
Prediction/testing: 𝐅෠𝐰 𝐗 𝑛𝑒𝑤 = 𝐗 𝑛𝑒𝑤 𝐖 ෡

Ref: Hastie, Tibshirani, Friedman, “The Elements of Statistical Learning”, (2 nd ed., 12th printing) 2017 (chp.3.2.4)

24
© Copyright EE, NUS. All Rights Reserved.
Linear Regression

Least Squares Regression of Multiple Outputs

J(𝐖) = trace(𝐄𝑇 𝐄)

𝐞1𝑇
= trace( ⁞ [𝐞1 𝐞2 … 𝐞ℎ ])
𝐞𝑇ℎ

𝐞1𝑇 𝐞1 𝐞1𝑇 𝐞2 … 𝐞1𝑇 𝐞ℎ


𝐞𝑇2 𝐞1 𝐞𝑇2 𝐞2 … 𝐞𝑇2 𝐞ℎ ℎ
= trace( ) = ෌𝑘=1 𝐞𝑇𝑘 𝐞𝑘
⁞ ⁞ ⋱ ⁞
𝐞𝑇ℎ 𝐞1 𝐞𝑇ℎ 𝐞2 … 𝐞𝑇ℎ 𝐞ℎ

25
© Copyright EE, NUS. All Rights Reserved.
Linear Regression of multiple outputs
Example 3
Training set {𝑥1 = 1, 𝑥2 = 1, 𝑥3 = 1} → { 𝑦1 = 1, 𝑦2 = 0}
{𝑥1 = 1, 𝑥2 = −1, 𝑥3 = 1} → { 𝑦1 = 0, 𝑦2 = 1}
{𝐱𝑖 , 𝐲𝑖 }𝑚
𝑖=1
{𝑥1 = 1, 𝑥2 = 1, 𝑥3 = 3} → { 𝑦1 = 2, 𝑦2 = −1}
{𝑥1 = 1, 𝑥2 = 1, 𝑥3 = 0} → { 𝑦1 = −1, 𝑦2 = 3}
𝐗 𝐖 𝐘
Bias 1 1 1 𝑤 𝑤1,2 1 0
1,1
1 − 1 1 𝑤2,1 𝑤2,2 = 0 1
1 1 3 𝑤3,1 𝑤3,2 2 −1
1 1 0 −1 3
This set of linear equations has NO exact solution
Least square
෡ = 𝐗 † 𝐘 = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐘
𝐖 𝐗 𝑇 𝐗 is invertible approximation
−1 1 0
4 2 5 1 1 1 1 −0.75 2.25
= 2 4 3 0 1 = 0.1786 0.0357
1 −1 1 1 2 −1
5 3 11 1 1 3 0 −1 0.9286 − 1.2143
3

26
© Copyright EE, NUS. All Rights Reserved.
Linear Regression of multiple outputs
Example 3
Prediction:
Test set: two new samples
{𝑥1 = 1, 𝑥2 = 6, 𝑥3 = 8} → { 𝑦1 =? , 𝑦2 =? }
{𝑥1 = 1, 𝑥2 = 0, 𝑥3 = −1} → { 𝑦1 =? , 𝑦2 =? }

෡ = 𝐗 𝑛𝑒𝑤 ෢
𝐘 𝐖
Bias 1 6 8 −0.75 2.25
= 0.1786 0.0357
1 0 −1
0.9286 − 1.2143
7.75 − 7.25
=
−1.6786 3.4643

Python demo 2

27
© Copyright EE, NUS. All Rights Reserved.
Linear Regression of multiple outputs
Example 4
The values of feature x and their corresponding values of multiple
outputs target y are shown in the table below.

Based on the least square regression, what are the values of w?


Based on the current mapping, when x = 2, what is the value of y?

x [3] [4] [10] [6] [7]

y [0, 5] [1.5, 4] [-3, 8] [-4, 10] [1, 6]

1.9 3.6
𝐖 = 𝐗 𝐘 = (𝐗 𝐗) 𝐗 𝐘 =
෡ † 𝑇 −1 𝑇
Python demo 3
−0.4667 0.5
𝒏𝒆𝒘 = 𝐗 𝑛𝑒𝑤 𝐖 = 1

𝐘 ෢ ෢ = 0.9667 4.6
2 𝐖 Prediction
*Prediction must be close to/ within observed range.
28
© Copyright EE, NUS. All Rights Reserved.
Summary
• Notations, Vectors, Matrices
• Operations on Vectors and Matrices
• Dot-product, matrix inverse
• Systems of Linear Equations 𝒇𝐰 𝐗 = 𝐗𝐰 = 𝐲
• Matrix-vector notation, linear dependency, invertible
• Even-, over-, under-determined linear systems
• Functions, Derivative and Gradient
• Inner product, linear/affine functions
• Maximum and minimum, partial derivatives, gradient
• Least Squares, Linear Regression
• Objective function, loss function
• Least square solution, training/learning and testing/prediction
• Linear regression with multiple outputs
Learning/training ෝ = (𝐗 𝑇𝒕𝒓𝒂𝒊𝒏 𝐗 𝒕𝒓𝒂𝒊𝒏 )−𝟏 𝐗 𝑇𝒕𝒓𝒂𝒊𝒏 𝐲𝒕𝒓𝒂𝒊𝒏
𝐰
Prediction/testing 𝐲𝒕𝒆𝒔𝒕 = 𝐗 𝒕𝒆𝒔𝒕 𝐰 ෝ
• Classification Python packages: numpy, pandas, matplotlib.pyplot,
• Ridge Regression numpy.linalg, and sklearn.metrics (for
• Polynomial Regression mean_squared_error), numpy.linalg.pinv
29
© Copyright EE, NUS. All Rights Reserved.
EE2211 Introduction to Machine
Learning
Lecture 6
Semester 1
2021/2022

Helen Juan Zhou


[email protected]

Electrical and Computer Engineering Department


National University of Singapore

Acknowledgement:
EE2211 development team
(Thomas, Kar-Ann, Chen Khong, Helen, Robby and Haizhou

© Copyright EE, NUS. All Rights Reserved.


Course Contents
• Introduction and Preliminaries (Haizhou)
– Introduction
Mid-term: Oct 5th
– Data Engineering Week 8 lecture slot
– Introduction to Probability and Statistics
• Fundamental Machine Learning Algorithms I (Helen)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Thomas)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Haizhou)
– Performance Issues
– K-means Clustering
– Neural Networks

2
© Copyright EE, NUS. All Rights Reserved.
Ridge Regression & Polynomial
Regression

Module II Contents
• Notations, Vectors, Matrices
• Operations on Vectors and Matrices
• Systems of Linear Equations
• Functions, Derivative and Gradient
• Least Squares, Linear Regression
• Linear Regression with Multiple Outputs
• Linear Regression for Classification
• Ridge Regression
• Polynomial Regression

3
© Copyright EE, NUS. All Rights Reserved.
Review: Linear Regression
Learning of Scalar Function (Single Output)
For one sample: a linear model 𝑓𝐰 𝐱 = 𝐱 𝑇 𝐰 scalar function
For m samples: 𝒇𝐰 𝐗 = 𝐗𝐰 = 𝐲
𝐱1𝑇 𝐰
𝐲= ⁞ where 𝐱𝑖𝑇 = [1, 𝑥𝑖,1 , … , 𝑥𝑖,𝑑 ]
𝐱𝑚𝑇𝐰

1 𝑥1,1 … 𝑥1,𝑑 𝑏 𝑦1
𝐗= ⁞ ⁞ ⋱ ⁞ 𝐰 = 𝑤1 𝐲= ⁞
1 𝑥𝑚,1 … 𝑥𝑚,𝑑 ⁞ 𝑦𝑚
𝑤𝑑
𝑚
Objective:෌𝑖=1(𝑓𝐰 𝐱 𝑖 − y𝑖 )2 = 𝐞𝑇 𝐞 = (𝐗𝐰 − 𝐲)𝑇 (𝐗𝐰 − 𝐲)
Learning/training when 𝐗 𝑇 𝐗 is invertible
ෝ = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐲
Least square solution: 𝐰
Prediction/testing: 𝒚𝑛𝑒𝑤 = 𝒇෠ 𝐰 𝐗 𝑛𝑒𝑤 = 𝐗 𝑛𝑒𝑤 𝐰

4
© Copyright EE, NUS. All Rights Reserved.
Review: Linear Regression
Learning of Vectored Function (Multiple Outputs)
𝐅𝐰 𝐗 = 𝐗𝐖 = 𝐘
𝑤0,1 … 𝑤0,ℎ
Sample 1 𝐱1𝑇 1 𝑥1,1 … 𝑥1,𝑑
𝑤1,1 … 𝑤1,ℎ
⁞ = ⁞ 𝐖= ⁞ ⁞ ⋱ ⁞
𝑇 1 𝑥𝑚,1 … 𝑥𝑚,𝑑 ⁞ ⋱ ⁞
Sample m 𝐱𝑚 𝑤𝑑,1 … 𝑤𝑑,ℎ
Sample 1’s output 𝑦1,1 … 𝑦1,ℎ
⁞ = ⁞
Sample m’s output 𝑦𝑚,1 … 𝑦𝑚,ℎ Least Squares Regression

If 𝐗 𝑇 𝐗 is invertible, then
Learning/training: 𝐖 ෡ = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐘
Prediction/testing: 𝐅෠𝐰 𝐗 𝑛𝑒𝑤 = 𝐗 𝑛𝑒𝑤 𝐖 ෡

𝐗 ∈ 𝓡𝑚×(𝑑+1) , 𝐖 ∈ 𝓡(𝑑+1)×ℎ , 𝐘 ∈ 𝓡𝑚×ℎ


5
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)

Linear Methods for Classification


• We have a collection of labeled examples
• 𝑚 is the size of the collection
• 𝐱 𝑖 is the d-dimensional feature vector of example 𝑖 = 1, … , 𝑚
• 𝑦𝑖 is discrete target label (e.g., 𝑦𝑖 ∈ {−1, +1} or {0, 1} for
binary classification problems)
• Note:
• when 𝑦𝑖 is continuous valued  a regression problem
• when 𝑦𝑖 is discrete valued a classification problem
• Linear model: 𝑓𝐰,𝑏 𝐱 = 𝐱 𝑇 𝐰 + 𝑏 or in compact form 𝑓𝐰 𝐱 = 𝐱 𝑇 𝐰
(having the offset term absorbed into the inner product)

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (chp.14)

6
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)

Linear Methods for Classification


Binary Classification:
If 𝐗 𝑇 𝐗 is invertible, then
Learning: ෝ = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐲,
𝐰 𝑦𝑖 ∈ −1, +1 , 𝑖 = 1, … , 𝑚
Prediction: 𝑓መ𝐰𝑐 𝐱 𝑛𝑒𝑤 = sign(𝐱 𝑛𝑒𝑤
𝑇 𝐰) 𝑇
ෝ for each row 𝐱 𝑛𝑒𝑤 of 𝐗 𝑛𝑒𝑤

sign(𝑎)
+1

0
𝑎
-1

Ref: [Book4] Stephen Boyd and Lieven Vandenberghe, “Introduction to Applied Linear Algebra”, Cambridge University Press, 2018 (chp.14)

7
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)
Example 1 Training set {𝑥𝑖 , y𝑖 }𝑚
𝑖=1 {𝑥 = −9} → { 𝑦 = −1 }
𝐗 𝐰 𝐲 {𝑥 = −7} → { 𝑦 = −1 }
𝑥 = −5 → { 𝑦 = −1 }
Bias 1 −9 −1
{ 𝑥 = 1} → 𝑦 = +1
1 −7 −1
𝑤0 { 𝑥 = 5} → { 𝑦 = +1 }
1 −5 −1
𝑤1 = { 𝑥 = 9} → { 𝑦 = +1 }
1 1 1
1 5 1
1 9 1
This set of linear equations has NO exact solution

ෝ = 𝐗 † 𝐲 = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐲
𝐰 𝐗 𝑇 𝐗 is invertible
−1
−1
6 −6 −1
1 1 1 1 1 1 −1 0.1406 Least square
= =
−6 262 −9 − 7 − 5 1 5 9 1 0.1406 approximation
1
1
8
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)
Example 1

𝑦ො = sign(𝐗𝐰)
{ 𝑥 = 1} → 𝑦 = +1
{ 𝑥 = 5} → { 𝑦 = +1 } 0.1406
{ 𝑥 = 9} → { 𝑦 = +1 } = sign(𝐗 )
0.1406
ෝ = 0.1406+0.1406x
y' = 𝐗𝐰

Prediction:
𝑦𝑛𝑒𝑤′
Test set {𝑥 = −2} → {𝑦 = ? }
𝑦𝑛𝑒𝑤 = 𝑓መ𝐰𝑐 𝐱 𝑛𝑒𝑤 = sign 𝐱 𝑛𝑒𝑤 𝐰

Bias
0.1406
{𝑥 = −9} → { 𝑦 = −1 } −2
= sign( 1 − 2 )
{𝑥 = −7} → { 𝑦 = −1 } 0.1406
𝑥 = −5 → { 𝑦 = −1 }
𝑥𝑛𝑒𝑤 = sign(− 0.1406) = −1
Python
Linear Regression for one-dimensional classification
demo 1
9
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)
Linear Methods for Classification

Multi-Category Classification:

If 𝐗 𝑇 𝐗 is invertible, then

Learning: ෡ = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐘,
𝐖 𝐘 ∈ 𝐑𝑚×𝐶
Prediction: 𝑓መ𝐰𝑐 𝐱 𝑛𝑒𝑤 = arg max𝑘=1,…,𝐶 𝐱 𝑛𝑒𝑤
𝑇 𝐖 ෢ :,𝑘 𝑇
for each 𝐱 𝑛𝑒𝑤 of 𝐗 𝑛𝑒𝑤

Each row (of 𝑖 = 1, … , 𝑚) in 𝐘 has an one-hot encoding/assignment:


e.g., target for class-1 is labelled as 𝐲𝑖𝑇 = [1, 0, 0, … , 0] for the ith sample,
target for class-2 is labelled as 𝐲𝑗𝑇 = 0, 1, 0, … , 0 for the jth sample,
𝑇
target for class-C is labelled as 𝐲𝑚 = 0, 0, … , 0, 1 for the mth sample.
𝐶
Ref: Hastie, Tibshirani, Friedman, “The Elements of Statistical Learning”, (2nd ed., 12th printing) 2017 (chp.4)

10
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)
Example 2 Three class classification
{𝑥1 = 1, 𝑥2 = 1} → { 𝑦1 = 1, 𝑦2 = 0, 𝑦3 = 0} Class 1
Training set {𝑥1 = −1, 𝑥2 = 1} → { 𝑦1 = 0, 𝑦2 = 1, 𝑦3 = 0} Class 2
{𝐱𝑖 , 𝐲𝑖 }𝑚
𝑖=1 {𝑥1 = 1, 𝑥2 = 3} → { 𝑦1 = 1, 𝑦2 = 0, 𝑦3 = 0} Class 1
{𝑥1 = 1, 𝑥2 = 0} → { 𝑦1 = 0, 𝑦2 = 0, 𝑦3 = 1} Class 3
𝐗 𝐖 𝐘
Bias 1 1 1 𝑤 𝑤1,2 𝑤1,3 1 0 0
1,1
1 −1 1 𝑤2,1 𝑤2,2 𝑤2,3 = 0 1 0
1 1 3 𝑤3,1 𝑤3,2 𝑤3,3 1 0 0
1 1 0 0 0 1
This set of linear equations has NO exact solution. Least square
෡ = 𝐗 † 𝐘 = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐘
𝐖 𝐗 𝑇 𝐗 is invertible approximation

−1 1 0 0
4 2 5 1 1 1 1 0 0.5 0.5
= 2 4 3 0 1 0 = 0.2857 − 0.5 0.2143
1 −1 1 1 1 0 0
5 3 11 1 1 3 0 0.2857 0 − 0.2857
0 0 1
11
© Copyright EE, NUS. All Rights Reserved.
Linear Regression (for classification)
Example 2 Prediction
Test set 𝐗 𝑛𝑒𝑤 {𝑥1 = 6, 𝑥2 = 8} → {𝑐𝑙𝑎𝑠𝑠 1, 2, 𝑜𝑟 3? }
{𝑥1 = 0, 𝑥2 = −1} → {𝑐𝑙𝑎𝑠𝑠 1, 2, 𝑜𝑟 3? }

1 6 8 0 0.5 0.5
෡ = 𝐗 𝑛𝑒𝑤 ෢
𝐘 𝐖= 0.2857 − 0.5 0.2143
1 0 −1
0.2857 0 − 0.2857
Category prediction:

𝒇෠ 𝑐𝐰 𝐗 𝑛𝑒𝑤 = arg max𝑘=1,…,𝐶 (𝐘෡(: , 𝑘))


4 − 2.50 − 0.50
= arg max𝑘=1,…,𝐶 ( )
−0.2587 0.50 0.7857
1
=
Class 1

3 Class 3 For each row of Y, the column position of the largest number
(across all columns for that row) determines the class label.
Python
E.g. in the first row, the maximum number is 4 which is in
demo 2 column 1. Therefore, the resulting predicted class is 1.
12
© Copyright EE, NUS. All Rights Reserved.
Ridge Regression ensures that XTX is invertible

Recall Linear regression


𝑚
Objective:ෞ𝐰 = argmin ෌𝑖=1(𝑓𝐰 𝐱𝑖 − y𝑖 )2 = (𝐗𝐰 − 𝐲)𝑇 (𝐗𝐰 − 𝐲)
The learning computation: 𝐰ෝ = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐲
We cannot guarantee that the matrix 𝐗 𝑇 𝐗 is invertible

Ridge regression: shrinks the regression coefficients w by imposing a


penalty on their size
𝑚
𝐰 = argmin ෌𝑖=1(𝑓𝐰 𝐱𝑖 − y𝑖 )2 + λ σ𝑑𝑗=1 𝑤𝑗2
Objective:ෞ
= argmin (𝐗𝐰 − 𝐲)𝑇 (𝐗𝐰 − 𝐲) + λ𝐰 𝑇 𝐰

Here λ ≥ 0 is a complexity parameter that controls the amount of


shrinkage: the larger the value of λ, the greater the amount of shrinkage.

Note: m samples & d parameters


13
© Copyright EE, NUS. All Rights Reserved.
Ridge Regression
Using a linear model:
min𝐰 (𝐗𝐰 − 𝐲)𝑇 𝐗𝐰 − 𝐲 + λ𝐰 𝑇 𝐰
Solution:
𝜕
((𝐗𝐰 − 𝐲)𝑇 𝐗𝐰 − 𝐲 + λ𝐰 𝑇 𝐰) = 𝟎
𝜕𝐰
⇒ 2𝐗 𝑇 𝐗𝐰 − 2𝐗 𝑇 𝐲 + 2λ𝐰 = 𝟎
⇒ 𝐗 𝑇 𝐗𝐰 + λ𝐰 = 𝐗 𝑇 𝐲
⇒ (𝐗 𝑇 𝐗 + λ𝐈)𝐰 = 𝐗 𝑇 𝐲
where I is the dxd identity matrix
Here on, we shall focus on single column of output 𝐲 in derivations in
the sequel
Learning: ෝ = (𝐗 𝑇 𝐗 +λ𝐈)−1 𝐗 𝑇 𝐲
𝐰
Ref: Hastie, Tibshirani, Friedman, “The Elements of Statistical Learning”, (2nd ed., 12th printing) 2017 (chp.3)

14
© Copyright EE, NUS. All Rights Reserved.
Ridge Regression

Ridge Regression in Primal Form (when m > d)

(𝐗 𝑇 𝐗 + λ𝐈) is invertible for λ > 0,


Learning: 𝐰ෝ = (𝐗 𝑇 𝐗 +λ𝐈)−1 𝐗 𝑇 𝐲
Prediction: 𝒇෠ 𝐰 𝐗 𝑛𝑒𝑤 = 𝐗 𝑛𝑒𝑤 𝐰ෝ

Ref: Hastie, Tibshirani, Friedman, “The Elements of Statistical Learning”, (2nd ed., 12th printing) 2017 (chp.3)

15
© Copyright EE, NUS. All Rights Reserved.
Ridge Regression

Ridge Regression in Dual Form (when m < d)


underdetermined system

(𝐗𝐗 𝑇 +λ𝐈) is invertible for λ > 0,


Learning: 𝐰ෝ = 𝐗 𝑇 (𝐗𝐗 𝑇 +λ𝐈)−1 𝐲
Prediction: 𝒇෠ 𝐰 𝐗 𝑛𝑒𝑤 = 𝐗 𝑛𝑒𝑤 𝐰ෝ

Derivation as homework (see tutorial 6).


Hint: start off with (𝐗 𝑇 𝐗 + 𝜆𝐈)𝐰 = 𝐗 𝑇 𝐲 and make use of 𝐰 = 𝐗 𝑇 𝐚 and
𝒂 = 𝜆−1 𝐲 − 𝐗𝐰 , 𝜆 > 0

16
© Copyright EE, NUS. All Rights Reserved.
Polynomial Regression
Motivation: nonlinear decision surface
• Based on the sum of products of the variables
• E.g. when the input dimension is d=2,
a polynomial function of degree = 2 is:
𝑓𝐰 𝐱 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤12 𝑥1 𝑥2 + 𝑤11 𝑥12 + 𝑤22 𝑥22 .

XOR problem

𝑓𝐰 𝐱 = 𝑥1 𝑥2

17
© Copyright EE, NUS. All Rights Reserved.
Polynomial Regression
Polynomial Expansion
• The linear model 𝑓𝐰 𝐱 = 𝐱 𝑇 𝐰 can be written as
𝑓𝐰 𝐱 = 𝐱 𝑇 𝐰
= σ𝑑𝑖=0 𝑥𝑖 𝑤𝑖 , 𝑥0 = 1
= 𝑤0 + σ𝑑𝑖=1 𝑥𝑖 𝑤𝑖 .

• By including additional terms involving the products of


pairs of components of 𝐱, we obtain a quadratic model:
𝑓𝐰 𝐱 = 𝑤0 + σ𝑑𝑖=1 𝑤𝑖 𝑥𝑖 + σ𝑑𝑖=1 σ𝑑𝑗=1 𝑤𝑖𝑗 𝑥𝑖 𝑥𝑗 .

2nd order: 𝑓𝐰 𝐱 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤12 𝑥1 𝑥2 + 𝑤11 𝑥12 + 𝑤22 𝑥22


3rd order: 𝑓𝐰 𝐱 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤12 𝑥1 𝑥2 + 𝑤11 𝑥12 + 𝑤22 𝑥22 +
σ𝑑𝑖=1 σ𝑑𝑗=1 σ𝑑𝑘=1 𝑤𝑖𝑗𝑘 𝑥𝑖 𝑥𝑗 𝑥𝑘 , 𝑑 = 2 Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Chp.5)

18
© Copyright EE, NUS. All Rights Reserved.
Polynomial Regression
Generalized Linear Discriminant Function
• In general:
𝑓𝐰 𝐱 = 𝑤0 + σ𝑑𝑖=1 𝑤𝑖 𝑥𝑖 + σ𝑑𝑖=1 σ𝑑𝑗=1 𝑤𝑖𝑗 𝑥𝑖 𝑥𝑗 + σ𝑑𝑖=1 σ𝑑𝑗=1 σ𝑑𝑘=1 𝑤𝑖𝑗𝑘 𝑥𝑖 𝑥𝑗 𝑥𝑘 + ⋯

Weierstrass Approximation Theorem: Every continuous function defined on a


closed interval [a, b] can be uniformly approximated as closely as desired by
a polynomial function.
- Suppose f is a continuous real-valued function defined on the real interval [a, b].
- For every ε > 0, there exists a polynomial p such that for all x in [a, b], we have| f (x)
− p(x)| < ε.
(Ref: https://en.wikipedia.org/wiki/Stone%E2%80%93Weierstrass_theorem)

Notes:
• For high dimensional input features (large d value) and high polynomial order, the
number of polynomial terms becomes explosive! (i.e., grows exponentially)
• For high dimensional problems, polynomials of order larger than 3 is seldom used.
Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Chp.5) online

19
© Copyright EE, NUS. All Rights Reserved.
Polynomial Regression
Generalized Linear Discriminant Function

𝑓𝐰 𝐱 = 𝑤0 + σ𝑑𝑖=1 𝑤𝑖 𝑥𝑖 + σ𝑑𝑖=1 σ𝑑𝑗=1 𝑤𝑖𝑗 𝑥𝑖 𝑥𝑗 + σ𝑑𝑖=1 σ𝑑𝑗=1 σ𝑑𝑘=1 𝑤𝑖𝑗𝑘 𝑥𝑖 𝑥𝑗 𝑥𝑘 + ⋯

𝒇𝐰 𝐱 = 𝐏𝐰 ( Note: 𝐏 ≜ 𝐏(𝐗) for symbol simplicity )

𝒑1𝑇 𝐰 𝑤0
= ⁞ 𝑤1
𝒑𝑇𝑚 𝐰 ⁞
𝑤𝑑
where 𝒑𝑇𝑙 𝐰 = [1, 𝑥𝑙,1 , … , 𝑥𝑙,𝑑 , … , 𝑥𝑙,𝑖 𝑥𝑙,𝑗 , … , 𝑥𝑙,𝑖 𝑥𝑙,𝑗 𝑥𝑙,𝑘 , … ] ⁞
𝑤𝑖𝑗

𝑤𝑖𝑗𝑘
𝑙 = 1, … , 𝑚; d denotes the dimension of input features; m
denotes the number of samples ⁞
Ref: Duda, Hart, and Stork, “Pattern Classification”, 2001 (Chp.5)

20
© Copyright EE, NUS. All Rights Reserved.
{𝑥1 = 0, 𝑥2 = 0} → {𝑦 = −1}
Example 3 {𝑥1 = 1, 𝑥2 = 1} → {𝑦 = −1}
𝑚
Training set {𝐱𝑖 , 𝐲𝑖 }𝑖=1 {𝑥1 = 1, 𝑥2 = 0} → {𝑦 = +1}
{𝑥1 = 0, 𝑥2 = 1} → {𝑦 = +1}
2nd order polynomial model

𝑓𝐰 𝐱 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤12 𝑥1 𝑥2 + 𝑤11 𝑥12 + 𝑤22 𝑥22


𝑤0
𝑤1
𝑤2
= [1 𝑥1 𝑥2 𝑥1 𝑥2 𝑥1 𝑥2 ] 𝑤 2 2
12
𝑤11
Stack the 4 training samples as a matrix 𝑤22
2 2
1 𝑥1,1 𝑥1,2 𝑥1,1 𝑥1,2 𝑥1,1 𝑥1,2
1 0 0 0 0 0
2 2
1 𝑥2,1 𝑥2,2 𝑥2,1 𝑥2,2 𝑥2,1 𝑥2,2 1 1 1 1 1 1
𝐏= 2 2 =
1 𝑥3,1 𝑥3,2 𝑥3,1 𝑥3,2 𝑥3,1 𝑥3,2 1 1 0 0 1 0
1 𝑥4,1 𝑥4,2 𝑥4,1 𝑥4,2 2
𝑥4,1 2
𝑥4,2 1 0 1 0 0 1
21
© Copyright EE, NUS. All Rights Reserved.
Polynomial Regression
Summary

Ridge Regression in Primal Form (m > d)


For λ > 0,
Learning: 𝐰ෝ = (𝐏 𝑇 𝐏 +λ𝐈)−1 𝐏 𝑇 𝐲
Prediction: 𝒇෠ 𝐰 𝐏(𝐗 𝑛𝑒𝑤 ) = 𝐏𝑛𝑒𝑤 𝐰 ෝ

Ridge Regression in Dual Form (m < d)


For λ > 0,
Learning: 𝐰ෝ = 𝐏𝑇 (𝐏𝐏𝑇 +λ𝐈)−1 𝐲
Prediction: 𝒇෠ 𝐰 𝐏(𝐗 𝑛𝑒𝑤 ) = 𝐏𝑛𝑒𝑤 𝐰

Note: Change X to P with reference to slides 15/16; m & d refers to the size of P (not X)
22
© Copyright EE, NUS. All Rights Reserved.
Polynomial Regression
Summary

For Regression Applications


• Learn continuous valued 𝑦 using either primal form or dual form
• Prediction: 𝒇෠ 𝐰 𝐏(𝐗 𝑛𝑒𝑤 ) = 𝐏𝑛𝑒𝑤 𝐰

For Classification Applications


• Learn discrete valued 𝑦 (𝑦 ∈ {−1, +1}) or 𝐘 (one-hot) using either primal
form or dual form
• Binary Prediction: 𝒇෠ 𝑐𝐰 𝐏(𝐗 𝑛𝑒𝑤 ) = sign(𝐏𝑛𝑒𝑤 𝐰)

• Multi-Category Prediction: 𝒇෠ 𝑐𝐰 𝐏(𝐗 𝑛𝑒𝑤 ) = arg max𝑘=1,…,𝐶 (𝐏𝑛𝑒𝑤 𝐖(:
෡ , 𝑘))

23
© Copyright EE, NUS. All Rights Reserved.
{𝑥1 = 0, 𝑥2 = 0} → {𝑦 = −1}
Example 3 (cont’d) {𝑥1 = 1, 𝑥2 = 1} → {𝑦 = −1}
Training set {𝑥1 = 1, 𝑥2 = 0} → {𝑦 = +1}
{𝑥1 = 0, 𝑥2 = 1} → {𝑦 = +1}
2nd order polynomial model
2 2
1 𝑥1,1 𝑥1,2 𝑥1,1 𝑥1,2 𝑥1,1 𝑥1,2
1 0 0 0 0 0
2 2
1 𝑥2,1 𝑥2,2 𝑥2,1 𝑥2,2 𝑥2,1 𝑥2,2 1 1 1 1 1 1
𝐏= 2 2 =
1 𝑥3,1 𝑥3,2 𝑥3,1 𝑥3,2 𝑥3,1 𝑥3,2 1 1 0 0 1 0
1 𝑥4,1 𝑥4,2 𝑥4,1 𝑥4,2 2
𝑥4,1 2
𝑥4,2 1 0 1 0 0 1

ෝ = 𝐏 𝑇 (𝐏𝐏 𝑇 )−1 𝐲
𝐰
−1
1 1 1 1 −1
0 1 1 0 1 1 1 1 −1 1
−1 1
= 0 1 0 1 1 6 3 3 =
0 1 0 0 1 3 3 1 +1 −4
0 1 1 0 1 3 1 3 +1 1 Python
0 1 0 1 1 demo 3
24
© Copyright EE, NUS. All Rights Reserved.
Example 3 (cont’d) Prediction
Test set Test point 1: {𝑥1 = 0.1, 𝑥2 = 0.1} → {𝑦 = class − 1 or + 1? }
Test point 2: {𝑥1 = 0.9, 𝑥2 = 0.9} → {𝑦 = class − 1 or + 1? }
Test point 3: {𝑥1 = 0.1, 𝑥2 = 0.9} → {𝑦 = class − 1 or + 1? }
Test point 4: {𝑥1 = 0.9, 𝑥2 = 0.1} → {𝑦 = class − 1 or + 1? }
𝐲ො = 𝐏𝑛𝑒𝑤 𝐰
ෝ −1
1 0.1 0.1 0.01 0.01 0.01 1
1 0.9 0.9 0.81 0.81 0.81 1
=
1 0.1 0.9 0.09 0.01 0.81 −4
1 0.9 0.1 0.09 0.81 0.01 1
1
[1 𝑥1 𝑥2 𝑥1 𝑥2 𝑥12 𝑥22 ]
−0.82
−0.82
𝒇෠ 𝑐𝐰 𝐏(𝐗 𝑛𝑒𝑤 ) = sign(𝐲)
ො = sign( 0.46 )
0.46
−1 Class −1
−1 Class −1
= +1
Class +1
+1 Class +1

25
© Copyright EE, NUS. All Rights Reserved.
Summary
• Notations, Vectors, Matrices
• Operations on Vectors and Matrices
• Systems of Linear Equations 𝒇𝐰 𝐗 = 𝐗𝐰 = 𝐲
• Functions, Derivative and Gradient
• Least Squares, Linear Regression with Single and Multiple Outputs
• Learning of vectored function, binary and multi-category classification
• Ridge Regression: penalty term, primal and dual forms
• Polynomial Regression: nonlinear decision boundary

Primal form Learning: 𝐰ෝ = (𝐏 𝑇 𝐏 +λ𝐈)−1 𝐏 𝑇 𝐲


Prediction: 𝒇෠ 𝐰 𝐏(𝐗 𝑛𝑒𝑤 ) = 𝐏𝑛𝑒𝑤 𝐰 ෝ

Dual form Learning: 𝐰ෝ = 𝐏 𝑇 (𝐏𝐏 𝑇 +λ𝐈)−1 𝐲


Prediction: 𝒇෠ 𝐰 𝐏(𝐗 𝑛𝑒𝑤 ) = 𝐏𝑛𝑒𝑤 𝐰ෝ
Hint: python packages: sklearn.preprocessing (PolynomialFeatures), np.sign,
sklearn.model_selection (train_test_split), sklearn.preprocessing (OneHotEncoder)
26
© Copyright EE, NUS. All Rights Reserved.

You might also like