0% found this document useful (0 votes)

10 views43 pages

DSA5102X Lecture2

The document outlines the second lecture of the Foundations of Machine Learning course, covering topics such as consultations, homework assignments, and project instructions. It discusses feature maps, their role in defining similarity measures, and introduces kernel methods, including ridge regression and support vector machines. Key concepts include symmetric positive definite kernels, maximum margin solutions, and the importance of support vectors in predictions.

Uploaded by

Jake Huang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views43 pages

DSA5102X Lecture2

Uploaded by

Jake Huang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 43

Foundations of Machine Learning

DSA 5102X • Lecture 2

Soufiane Hayou
Department of Mathematics
Consultation, Homework, Project
Consultation
We have three TAs for this class
• Wang Shida
• Jiang Haotian
• Wang Weixi

Each would have set up a 3x15min consultation slot per week.

Please use LumiNUS (consultation tab) to sign up.

I will also address common questions during the lectures.

Homework
Regression problem on the UCI Concrete Compressive
Strength dataset

• Instructions are in DSA5102X-homework1.ipynb

• Due: 4th Sept 2021 (2 weeks)
• Submission: Luminus submission folder under Files
• Late submission policy
• To ensure fairness, -20% of homework total grade every day
late, down to 0
• Example: Actual grade 8/10, 1 day late, Obtained grade:
6/10.
Project
Instructions are found in the Project Instructions folder on
Luminus

Use this homework as a starting point for the project

Due date: End of reading week before exam week

Last time
From linear models to linear basis models via feature maps

From this, we can derive the least squares formula etc.

Today, we will focus on the role of feature maps and their

relationship with kernels
Interpreting Feature Maps
What do feature maps do really?

𝜙3

𝜙1 𝜙2
Another view of feature maps
One can also view feature maps as implicitly defining some sort
of similarity measure

Consider two vectors and . Then, measures how similar they are

𝑢 𝑣 𝑢 𝑢𝑣

𝑣 Increasing

Increasing similarity
Feature maps defines a similarity between two samples by
computing the dot product in feature space

𝜙 ( 𝑥) 𝜙 ( 𝑥′) 𝜙 ( 𝑥) 𝜙 ( 𝑥′)
𝜙 ( 𝑥)

𝜙 ( 𝑥′)
Increasing

Increasing similarity in feature space

Least Squares Revisited
Let us revisit the linear basis hypothesis space

The (regularized) least squares problem is

Recall:

This is known as ridge regression.

Solution of ridge regression:

Making new predictions

Two observations:
• The dataset is memorized by and is not needed for new
predictions
• For each new prediction, we have operations
Reformulation of Ridge Regression
Let us now write ridge regression solution another way…

Two observations:
• The input data participates in theGram matrix
predictions
• For each new prediction, we have operations
Original Solution Reformulated Solution

What did we gain with this reformulation?

• They are exactly the same function, but…
• Left side requires and right side requires
• Most importantly, the right side only depends on the
feature maps through the dot product
Reformulated Solution

So why not just specify it and

Only need “similarity” measure
forget about !

Kernel Ridge Regression

Kernel
Can we choose any we want?
Observe that cannot be arbitrary, since we need
• (Symmetry)

• (Non-negativity)

• (Positive Semi-definiteness)

for all

If satisfies these conditions, it is called Symmetric Positive

Definite (SPD). Are these conditions all we need?
Symmetric Positive Definite Kernels
For a kernel to represent a valid feature map, we define the
notion of Symmetric Positive Definite (SPD) kernels. These
satisfy

1. Symmetry:

2. Positive Semi-definiteness: For any and , the Gram matrix

is positive semi-definite
(Recall: a matrix is positive semi-definite if for any vector )
Mercer’s Theorem (1909)
Suppose is a SPD kernel. Then, there exists a feature space and a
feature map such that

In fact,

where are eigenvalues/eigenfunctions of the linear integral operator

Feature map SPD kernel

and

SPD kernel Feature map

Examples of SPD kernels
• Linear kernel

• Polynomial kernel

• Gaussian (Radial Basis Function, RBF) kernel

• Many more…
Flexibility of using kernels
For example, consider the RBF kernel in 1 input dimension

Taylor expanding gives

where

The feature space is infinite-dimensional!

Kernel ridge regression with different
types of kernels
Express solution in
1 terms of similarity
(dot product)
Key Idea of
Kernel
Methods
Replace them with
2 kernels
Support Vector Machines
Linear (Affine) Functions and
Hyperplanes
Linear functions

What are hyperplanes?

In two dimension: a line

In general:

Hyperplanes are solutions of a linear equation

Classification using linear functions
Binary classification: ,
Linear decision function:
Linear separability assumption:
There exists a linear decision function such that
for all
Margin
Margin
There can be many possible decision functions!
Maximum Margin Solution
Mathematically, the margin of a decision function

The goal of support vector machines (SVM) is to find the

maximum margin solution

Why?
1
min |𝑤 𝑥𝑖 +𝑏| subject ¿ 𝑦 𝑖 ( 𝑤 𝑥 𝑖 +𝑏 ) >0 ∀ 𝑖
𝑇 𝑇
max
𝑤 ,𝑏 ‖𝑤‖ 𝑖=1 ,… , 𝑁

Reformulated as a constrained
convex optimization problem

1 2
min ‖𝑤‖ subject ¿ 𝑦 𝑖 ( 𝑤 𝑥 𝑖 +𝑏 ) ≥1 ∀ 𝑖
𝑇

𝑤, 𝑏 2
The Method of Lagrange Multipliers
Minimizing a function can be found by . What if there are
constraints?
First example:

𝐹 (𝑧) 𝐹 (𝑧) ( ^
𝑧 )
𝐹
𝑧∇
^
∇𝐹
𝑎
𝑎

𝑇 𝑇
𝑧 𝑎=0 𝑧 𝑎=0
What about general equality constraints?

parallel at a local optimum point, i.e.

The gradient plays the role of so we must have and

𝐹 (𝑧)

^
𝑧 ∇ 𝐹( 𝑧^ )
∇ 𝐺( 𝑧^ )

𝐺 ( 𝑧 )=0
What about general inequality constraints?

Constraint Inactive Constraint Active

𝐹 (𝑧) 𝐹 (𝑧)

^
𝑧 ∇ 𝐹( 𝑧^ )
^
𝑧 ∇ 𝐺( 𝑧^ )
𝐺 ( 𝑧 )≤ 0 𝐺 ( 𝑧 )≤ 0

𝛻 𝐹(^ ^ ) <0
𝑧 ) =0 𝐺 ( 𝑧
Inactive Case Active Case

Define the Lagrangian

Then these conditions can be combined in the following

conditions:

The variable is called a Lagrange multiplier

The most general case has inequality constraints (why no
equality constraints?)
Karush-Kuhn-Tucker (KKT) Conditions
Define the Lagrangian

Then, under technical conditions, for each locally optimal , there

exists Lagrange multipliers
1. (Stationarity)
2. (Primal Feasibility)
3. (Dual Feasibility)
4. (Complementary Slackness)
Dual Problem
Moreover, under some additional conditions, we can find the
multipliers via the dual problem

Choice (Primal) Price (Dual)

Back to the SVM…

1 2
min ‖𝑤‖ subject ¿ 𝑦 𝑖 ( 𝑤 𝑥 𝑖 +𝑏 ) ≥1 ∀ 𝑖
𝑇

𝑤, 𝑏 2

We apply KKT conditions with

We obtain the following
1. From Stationarity

2. From Dual Feasibility

3. From Complementary Slackness

4. The multipliers can be found by the dual problem

Dual Formulation of SVM

Decision function:
Complementary slackness:
Crucial Observations
From the dual formulation, we
observe the following
1. Only vectors closest to the
decision boundary matters
in predictions. These are
called support vectors.
2. The dual formulation of
the problem depends on
the inputs only through the
dot product support vectors
Kernel Support Vector Machines

Decision function:

As before, only support vectors satisfying

matter for predictions. This is a sparse kernel method.
Summary
The essence of kernel methods
• Write solution only in terms of dot products (this usually
involves going to a “dual” formulation)
• Go nonlinear by using kernels to replace dot products

Support vector machines

• Maximum margins solution
• Example of sparse kernel method: only some points
(support vectors) are used for prediction

Math51H Translated For The Mathematical Underdog
100% (1)
Math51H Translated For The Mathematical Underdog
738 pages
Chapter 4. Linear Transformations: Lecture Notes For MA1111
No ratings yet
Chapter 4. Linear Transformations: Lecture Notes For MA1111
22 pages
Semester I: Discipline: Electronics and Communication Stream: EC3
No ratings yet
Semester I: Discipline: Electronics and Communication Stream: EC3
99 pages
Algebra Libeal - Hoffman - Solutions
0% (1)
Algebra Libeal - Hoffman - Solutions
107 pages
An Introduction To Coding Theory For Mathematics Students: John Kerl
No ratings yet
An Introduction To Coding Theory For Mathematics Students: John Kerl
28 pages
Strang Linear Algebra Notes
No ratings yet
Strang Linear Algebra Notes
13 pages
Sem 2
No ratings yet
Sem 2
128 pages
BSc-in-EEE-Syllabus 2019 20 Onwards
No ratings yet
BSc-in-EEE-Syllabus 2019 20 Onwards
87 pages
General Linear Transformations
No ratings yet
General Linear Transformations
67 pages
MSC Mathematics - Part-I - Part-II
No ratings yet
MSC Mathematics - Part-I - Part-II
16 pages
CBR Linear Algebra
No ratings yet
CBR Linear Algebra
18 pages
SVD and PCA
No ratings yet
SVD and PCA
36 pages
Chapter V Vector Spaces
No ratings yet
Chapter V Vector Spaces
44 pages
Lecture 19 - Nonlinear Learning With Kernels (1) - Plain
No ratings yet
Lecture 19 - Nonlinear Learning With Kernels (1) - Plain
15 pages
MIT 18.06 Exam 1, Fall 2018 Solutions Johnson: Problem 1 (30 Points)
No ratings yet
MIT 18.06 Exam 1, Fall 2018 Solutions Johnson: Problem 1 (30 Points)
7 pages
2014 02 26 Kernels
No ratings yet
2014 02 26 Kernels
140 pages
Patna University: Courses of Study For M.A./M.Sc. (STATISICS)
No ratings yet
Patna University: Courses of Study For M.A./M.Sc. (STATISICS)
24 pages
Kernel Methods!: Sargur Srihari!
No ratings yet
Kernel Methods!: Sargur Srihari!
29 pages
Zero Stiffness Tensegrity Structures
No ratings yet
Zero Stiffness Tensegrity Structures
15 pages
Iit-Jam: Previous Year Solved Paper (PSP)
No ratings yet
Iit-Jam: Previous Year Solved Paper (PSP)
57 pages
Journal of Building Engineering
No ratings yet
Journal of Building Engineering
14 pages
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
No ratings yet
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
65 pages
The Subject Math GRE Test: Linear Algebra: Oscar Vega Ovega@csufresno - Edu
No ratings yet
The Subject Math GRE Test: Linear Algebra: Oscar Vega Ovega@csufresno - Edu
9 pages
Kernal Methods Machine Learning
No ratings yet
Kernal Methods Machine Learning
53 pages
Support Vector Machine (SVM)
No ratings yet
Support Vector Machine (SVM)
45 pages
Icml Tutorial
No ratings yet
Icml Tutorial
85 pages
Intro SVM PDF
No ratings yet
Intro SVM PDF
47 pages
Introduction To Kernels: Max Welling
No ratings yet
Introduction To Kernels: Max Welling
16 pages
Enhancement of Harmonic State Estimation Using Harmonic Prediction
No ratings yet
Enhancement of Harmonic State Estimation Using Harmonic Prediction
8 pages
MLF Q2 Practice Problems
No ratings yet
MLF Q2 Practice Problems
61 pages
Q1. Explain Why SVM Is More Efficient Than Logistic Regression?
No ratings yet
Q1. Explain Why SVM Is More Efficient Than Logistic Regression?
6 pages
BSC (H) Mathematics
No ratings yet
BSC (H) Mathematics
5 pages
ML Lecture06 2
No ratings yet
ML Lecture06 2
63 pages
Ds 11
No ratings yet
Ds 11
21 pages
05 Lectureslides Kernels
No ratings yet
05 Lectureslides Kernels
47 pages
07 Kernels
No ratings yet
07 Kernels
6 pages
4c Kernels
No ratings yet
4c Kernels
31 pages
High Dimensional Representation
No ratings yet
High Dimensional Representation
33 pages
Lec5 SVM Kernel SoftMargin
No ratings yet
Lec5 SVM Kernel SoftMargin
44 pages
M.SC., Data Science
No ratings yet
M.SC., Data Science
128 pages
Support Vector Machine
No ratings yet
Support Vector Machine
35 pages
Machine Learning - Open Elective - Part III
No ratings yet
Machine Learning - Open Elective - Part III
90 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
Week-5 Session 1
No ratings yet
Week-5 Session 1
16 pages
Practice For Null Space
No ratings yet
Practice For Null Space
3 pages
SVM PRESENTATION
No ratings yet
SVM PRESENTATION
34 pages
Lecture Slides-Week12
100% (1)
Lecture Slides-Week12
41 pages
Some Methods of Constructing Kernel
No ratings yet
Some Methods of Constructing Kernel
23 pages
A Tutorial on ν-Support Vector Machines: 1 An Introductory Example
No ratings yet
A Tutorial on ν-Support Vector Machines: 1 An Introductory Example
29 pages
SML Unit 4
No ratings yet
SML Unit 4
61 pages
Unit-2 Machine Learning
No ratings yet
Unit-2 Machine Learning
110 pages
SVM Intro
No ratings yet
SVM Intro
23 pages
Regression Using LS Handout
No ratings yet
Regression Using LS Handout
21 pages
Kernel Nearest-Neighbor Algorithm
No ratings yet
Kernel Nearest-Neighbor Algorithm
10 pages
Chapter 7
No ratings yet
Chapter 7
64 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
Support Vector Machines: (Vapnik, 1979)
No ratings yet
Support Vector Machines: (Vapnik, 1979)
34 pages
SVM Class 2
No ratings yet
SVM Class 2
87 pages
Course Outline CSC210
No ratings yet
Course Outline CSC210
1 page
Time Series Forecasting by Using Wavelet Kernel SVM
No ratings yet
Time Series Forecasting by Using Wavelet Kernel SVM
52 pages
03 - Kernelization
No ratings yet
03 - Kernelization
32 pages
Kernel Models 1233
No ratings yet
Kernel Models 1233
56 pages
Support Vector Machine Explained
No ratings yet
Support Vector Machine Explained
10 pages
Lecture03 Kernel
No ratings yet
Lecture03 Kernel
28 pages
This Is
No ratings yet
This Is
7 pages
Vahid
No ratings yet
Vahid
18 pages
Another Introduction SVM
No ratings yet
Another Introduction SVM
4 pages
Poly Kernel
No ratings yet
Poly Kernel
6 pages
Kernel Methods in Machine Learning
No ratings yet
Kernel Methods in Machine Learning
53 pages
Lecture 05
No ratings yet
Lecture 05
49 pages
SVM Kernel Functions
No ratings yet
SVM Kernel Functions
12 pages
Introduction To Support Vector Machines: Andrew Moore CMU
No ratings yet
Introduction To Support Vector Machines: Andrew Moore CMU
40 pages
Question Bank (BMATS201)
No ratings yet
Question Bank (BMATS201)
11 pages
4 Separator Margin LogisticRegression SVM
No ratings yet
4 Separator Margin LogisticRegression SVM
118 pages
Practice Problems For ML Midterms
No ratings yet
Practice Problems For ML Midterms
5 pages
Lecture 4
No ratings yet
Lecture 4
49 pages
SCH Smo 03 C
No ratings yet
SCH Smo 03 C
24 pages
Lec 16
No ratings yet
Lec 16
23 pages
MLSlides4 Selected Shared
No ratings yet
MLSlides4 Selected Shared
35 pages
Machine Learning 3
No ratings yet
Machine Learning 3
35 pages
Kernal and Multiclass
No ratings yet
Kernal and Multiclass
51 pages
Lecture 13 - Kernels
No ratings yet
Lecture 13 - Kernels
5 pages
5th Unit ML
No ratings yet
5th Unit ML
40 pages
Rank of A Matrix
No ratings yet
Rank of A Matrix
14 pages
Practice Questions From Module-2
No ratings yet
Practice Questions From Module-2
4 pages

DSA5102X Lecture2

Uploaded by

DSA5102X Lecture2

Uploaded by

Foundations of Machine Learning

DSA 5102X • Lecture 2

Each would have set up a 3x15min consultation slot per week.

I will also address common questions during the lectures.

• Instructions are in DSA5102X-homework1.ipynb

Use this homework as a starting point for the project

Due date: End of reading week before exam week

From this, we can derive the least squares formula etc.

Today, we will focus on the role of feature maps and their

Increasing similarity in feature space

The (regularized) least squares problem is

This is known as ridge regression.

Making new predictions

What did we gain with this reformulation?

So why not just specify it and

Kernel Ridge Regression

If satisfies these conditions, it is called Symmetric Positive

2. Positive Semi-definiteness: For any and , the Gram matrix

where are eigenvalues/eigenfunctions of the linear integral operator

SPD kernel Feature map

• Gaussian (Radial Basis Function, RBF) kernel

Taylor expanding gives

The feature space is infinite-dimensional!

What are hyperplanes?

Hyperplanes are solutions of a linear equation

The goal of support vector machines (SVM) is to find the

parallel at a local optimum point, i.e.

Constraint Inactive Constraint Active

Define the Lagrangian

Then these conditions can be combined in the following

The variable is called a Lagrange multiplier

Then, under technical conditions, for each locally optimal , there

Choice (Primal) Price (Dual)

We apply KKT conditions with

2. From Dual Feasibility

3. From Complementary Slackness

4. The multipliers can be found by the dual problem

As before, only support vectors satisfying

Support vector machines

You might also like