0% found this document useful (0 votes)
10 views43 pages

DSA5102X Lecture2

The document outlines the second lecture of the Foundations of Machine Learning course, covering topics such as consultations, homework assignments, and project instructions. It discusses feature maps, their role in defining similarity measures, and introduces kernel methods, including ridge regression and support vector machines. Key concepts include symmetric positive definite kernels, maximum margin solutions, and the importance of support vectors in predictions.

Uploaded by

Jake Huang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views43 pages

DSA5102X Lecture2

The document outlines the second lecture of the Foundations of Machine Learning course, covering topics such as consultations, homework assignments, and project instructions. It discusses feature maps, their role in defining similarity measures, and introduces kernel methods, including ridge regression and support vector machines. Key concepts include symmetric positive definite kernels, maximum margin solutions, and the importance of support vectors in predictions.

Uploaded by

Jake Huang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 43

Foundations of Machine Learning

DSA 5102X • Lecture 2

Soufiane Hayou
Department of Mathematics
Consultation, Homework, Project
Consultation
We have three TAs for this class
• Wang Shida
• Jiang Haotian
• Wang Weixi

Each would have set up a 3x15min consultation slot per week.


Please use LumiNUS (consultation tab) to sign up.

I will also address common questions during the lectures.


Homework
Regression problem on the UCI Concrete Compressive
Strength dataset

• Instructions are in DSA5102X-homework1.ipynb


• Due: 4th Sept 2021 (2 weeks)
• Submission: Luminus submission folder under Files
• Late submission policy
• To ensure fairness, -20% of homework total grade every day
late, down to 0
• Example: Actual grade 8/10, 1 day late, Obtained grade:
6/10.
Project
Instructions are found in the Project Instructions folder on
Luminus

Use this homework as a starting point for the project

Due date: End of reading week before exam week


Last time
From linear models to linear basis models via feature maps

From this, we can derive the least squares formula etc.

Today, we will focus on the role of feature maps and their


relationship with kernels
Interpreting Feature Maps
What do feature maps do really?

𝜙3

𝜙1 𝜙2
Another view of feature maps
One can also view feature maps as implicitly defining some sort
of similarity measure

Consider two vectors and . Then, measures how similar they are

𝑢 𝑣 𝑢 𝑢𝑣

𝑣 Increasing

Increasing similarity
Feature maps defines a similarity between two samples by
computing the dot product in feature space

𝜙 ( 𝑥) 𝜙 ( 𝑥′) 𝜙 ( 𝑥) 𝜙 ( 𝑥′)
𝜙 ( 𝑥)

𝜙 ( 𝑥′)
Increasing

Increasing similarity in feature space


Least Squares Revisited
Let us revisit the linear basis hypothesis space

The (regularized) least squares problem is

Recall:

This is known as ridge regression.


Solution of ridge regression:

Making new predictions

Two observations:
• The dataset is memorized by and is not needed for new
predictions
• For each new prediction, we have operations
Reformulation of Ridge Regression
Let us now write ridge regression solution another way…

Two observations:
• The input data participates in theGram matrix
predictions
• For each new prediction, we have operations
Original Solution Reformulated Solution

What did we gain with this reformulation?


• They are exactly the same function, but…
• Left side requires and right side requires
• Most importantly, the right side only depends on the
feature maps through the dot product
Reformulated Solution

So why not just specify it and


Only need “similarity” measure
forget about !

Kernel Ridge Regression


Kernel
Can we choose any we want?
Observe that cannot be arbitrary, since we need
• (Symmetry)

• (Non-negativity)

• (Positive Semi-definiteness)

for all

If satisfies these conditions, it is called Symmetric Positive


Definite (SPD). Are these conditions all we need?
Symmetric Positive Definite Kernels
For a kernel to represent a valid feature map, we define the
notion of Symmetric Positive Definite (SPD) kernels. These
satisfy

1. Symmetry:

2. Positive Semi-definiteness: For any and , the Gram matrix


is positive semi-definite
(Recall: a matrix is positive semi-definite if for any vector )
Mercer’s Theorem (1909)
Suppose is a SPD kernel. Then, there exists a feature space and a
feature map such that

In fact,

where are eigenvalues/eigenfunctions of the linear integral operator


Feature map SPD kernel

and

SPD kernel Feature map


Examples of SPD kernels
• Linear kernel

• Polynomial kernel

• Gaussian (Radial Basis Function, RBF) kernel

• Many more…
Flexibility of using kernels
For example, consider the RBF kernel in 1 input dimension

Taylor expanding gives

where

The feature space is infinite-dimensional!


Kernel ridge regression with different
types of kernels
Express solution in
1 terms of similarity
(dot product)
Key Idea of
Kernel
Methods
Replace them with
2 kernels
Support Vector Machines
Linear (Affine) Functions and
Hyperplanes
Linear functions

What are hyperplanes?


In two dimension: a line

In general:

Hyperplanes are solutions of a linear equation


Classification using linear functions
Binary classification: ,
Linear decision function:
Linear separability assumption:
There exists a linear decision function such that
for all
Margin
Margin
There can be many possible decision functions!
Maximum Margin Solution
Mathematically, the margin of a decision function

is

The goal of support vector machines (SVM) is to find the


maximum margin solution

Why?
1
min |𝑤 𝑥𝑖 +𝑏| subject ¿ 𝑦 𝑖 ( 𝑤 𝑥 𝑖 +𝑏 ) >0 ∀ 𝑖
𝑇 𝑇
max
𝑤 ,𝑏 ‖𝑤‖ 𝑖=1 ,… , 𝑁

Reformulated as a constrained
convex optimization problem

1 2
min ‖𝑤‖ subject ¿ 𝑦 𝑖 ( 𝑤 𝑥 𝑖 +𝑏 ) ≥1 ∀ 𝑖
𝑇

𝑤, 𝑏 2
The Method of Lagrange Multipliers
Minimizing a function can be found by . What if there are
constraints?
First example:

𝐹 (𝑧) 𝐹 (𝑧) ( ^
𝑧 )
𝐹
𝑧∇
^
∇𝐹
𝑎
𝑎

𝑇 𝑇
𝑧 𝑎=0 𝑧 𝑎=0
What about general equality constraints?

parallel at a local optimum point, i.e.


The gradient plays the role of so we must have and

𝐹 (𝑧)

^
𝑧 ∇ 𝐹( 𝑧^ )
∇ 𝐺( 𝑧^ )

𝐺 ( 𝑧 )=0
What about general inequality constraints?

Constraint Inactive Constraint Active


𝐹 (𝑧) 𝐹 (𝑧)

^
𝑧 ∇ 𝐹( 𝑧^ )
^
𝑧 ∇ 𝐺( 𝑧^ )
𝐺 ( 𝑧 )≤ 0 𝐺 ( 𝑧 )≤ 0

𝛻 𝐹(^ ^ ) <0
𝑧 ) =0 𝐺 ( 𝑧
Inactive Case Active Case

Define the Lagrangian

Then these conditions can be combined in the following


conditions:

The variable is called a Lagrange multiplier


The most general case has inequality constraints (why no
equality constraints?)
Karush-Kuhn-Tucker (KKT) Conditions
Define the Lagrangian

Then, under technical conditions, for each locally optimal , there


exists Lagrange multipliers
1. (Stationarity)
2. (Primal Feasibility)
3. (Dual Feasibility)
4. (Complementary Slackness)
Dual Problem
Moreover, under some additional conditions, we can find the
multipliers via the dual problem

Choice (Primal) Price (Dual)


Back to the SVM…

1 2
min ‖𝑤‖ subject ¿ 𝑦 𝑖 ( 𝑤 𝑥 𝑖 +𝑏 ) ≥1 ∀ 𝑖
𝑇

𝑤, 𝑏 2

We apply KKT conditions with


We obtain the following
1. From Stationarity

2. From Dual Feasibility

3. From Complementary Slackness

4. The multipliers can be found by the dual problem


Dual Formulation of SVM

Decision function:
Complementary slackness:
Crucial Observations
From the dual formulation, we
observe the following
1. Only vectors closest to the
decision boundary matters
in predictions. These are
called support vectors.
2. The dual formulation of
the problem depends on
the inputs only through the
dot product support vectors
Kernel Support Vector Machines

Decision function:

As before, only support vectors satisfying


matter for predictions. This is a sparse kernel method.
Summary
The essence of kernel methods
• Write solution only in terms of dot products (this usually
involves going to a “dual” formulation)
• Go nonlinear by using kernels to replace dot products

Support vector machines


• Maximum margins solution
• Example of sparse kernel method: only some points
(support vectors) are used for prediction

You might also like