0% found this document useful (0 votes)
19 views31 pages

4c Kernels

The document discusses kernel methods for machine learning. It introduces kernels as functions that measure similarity between pairs of inputs. Various common kernels are described, including ones for comparing documents based on bag-of-words representations and cosine similarity. Kernel methods allow algorithms to be written in terms of kernels rather than feature vectors, allowing nonlinear models through choice of kernel.

Uploaded by

mestradocs2023
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views31 pages

4c Kernels

The document discusses kernel methods for machine learning. It introduces kernels as functions that measure similarity between pairs of inputs. Various common kernels are described, including ones for comparing documents based on bag-of-words representations and cosine similarity. Kernel methods allow algorithms to be written in terms of kernels rather than feature vectors, allowing nonlinear models through choice of kernel.

Uploaded by

mestradocs2023
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Kernel Methods

David Rosenberg

New York University

February 18, 2015

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 1 / 31
Introduction

Feature Extraction
Focus on effectively representing x ∈ X as a vector φ(x) ∈ Rd .
e.g. Bag of words:

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 2 / 31
Introduction

Kernel Methods

Primary focus is on comparing two inputs w , x ∈ X.

Definition
A kernel is a function that takes a pair of inputs w , x ∈ X and returns a
real value. That is, k : X × X → R.

Can interpret k(w , x) as a similarity score, but this is not precise.


We will deal with symmetric kernels: k(w , x) = k(x, w ).

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 3 / 31
Kernel Examples

Comparing Documents

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 4 / 31
Kernel Examples

Comparing Documents: Bag of Words

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 5 / 31
Kernel Examples

Comparing Documents: Bag of Words

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 6 / 31
Kernel Examples

Comparing Documents: Cosine Similarity

1 Normalize each feature vector to have kxk2 = 1.

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 7 / 31
Kernel Examples

Comparing Documents

1 Normalize each feature vector to have kxk2 = 1.


2 Take inner product

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 8 / 31
Kernel Examples

Comparing Documents: Cosine Similarity

1 Normalize each feature vector to have kxk2 = 1.


2 Take inner product
3 Then define

k(VentureBeat, Twitter Tweet) = 0.85

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 9 / 31
Kernel Examples

Cosine Similarity Kernel

Why the name? Recall

hw , xi = kw kkxk cos θ,

where θ is the angle between w , x ∈ Rd .


So  
w x
k(w , x) = cos θ = ,
kw k kxk

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 10 / 31
Kernel Examples

Linear Kernel

Input space X = Rd
k(w , x) = w T x
When we “kernelize” an algorithm, we write it in terms of the linear
kernel.
Then we can swap it out a replace it with a more sophisticated kernel

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 11 / 31
Kernel Examples

Quadratic Kernel in R2

Input space X = R2
Feature map:
 √ 
φ : (x1 , x2 ) 7→ x1 , x2 , x12 , x22 , 2x1 x2

Gives us ability to represent conic section boundaries.


Define kernel as inner product in feature space:

k(w , x) = hφ(w ), φ(x)i


= w1 x1 + w2 x2 + w12 x12 + w22 x22 + 2w1 w2 x1 x2
= w1 x1 + w2 x2 + (w1 x1 )2 + (w2 x2 )2 + 2(w1 x1 )(w2 x2 )
= hw , xi + hw , xi2

Based on Guillaume Obozinski’s Statistical Machine Learning course at Louvain, Feb 2014.
David Rosenberg (New York University) DS-GA 1003 February 18, 2015 12 / 31
Kernel Examples

Quadratic Kernel in Rd

Input space X = Rd
Feature map:
√ √ √
φ(x) = (x1 , . . . , xd , x12 , . . . , xd2 , 2x1 x2 , . . . , 2xi xj , . . . 2xd−1 xd )T

Number of terms = d + d(d + 1)/2 ≈ d 2 /2.


Still have

k(w , x) = hφ(w ), φ(x)i


= hx, y i + hx, y i2

Computation for inner product with explicit mapping: O(d 2 )


Computation for implicit kernel calculation: O(d).

Based on Guillaume Obozinski’s Statistical Machine Learning course at Louvain, Feb 2014.
David Rosenberg (New York University) DS-GA 1003 February 18, 2015 13 / 31
Kernel Examples

Polynomial Kernel in Rd

Input space X = Rd
Kernel function:
k(w , x) = (1 + hw , xi)M
Corresponds to a feature map with all terms up to degree M.
For any M, computing the kernel has same computational cost
Cost of explicit inner product computation grows rapidly in M.

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 14 / 31
Kernel Examples

Radial Basis Function (RBF) Kernel

Input space X = Rd

kw − xk2
 
k(w , x) = exp − ,
2σ2

where σ2 is known as the bandwidth parameter.


Does it act like a similarity score?
Why “radial”?
Have we departed from our “inner product of feature vector” recipe?
Yes and no: corresponds to an infinte dimensional feature vector
Probably the most common nonlinear kernel.

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 15 / 31
Kernel Machines

Feature Vectors from a Kernel

So what can we do with a kernel?


We can generate feature vectors:
Idea: Characterize input x by its similarity to r fixed prototypes in X.

Definition
A kernelized feature vector for an input x ∈ X with respect to a kernel k
and prototype points µ1 , . . . , µr ∈ X is given by

Φ(x) = [k(x, µ1 ), . . . , k(x, µr )] ∈ Rr .

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 16 / 31
Kernel Machines

Kernel Machines
Definition
A kernel machine is a linear model with kernelized feature vectors.
This corresponds to a prediction functions of the form
f (x) = αT Φ(x)
Xr
= αi k(x, µi ),
i=1
for α ∈ Rr .
An Interpretation
For each µi , we get a function on X:

x 7→ k(x, µi )

f (x) is a linear combination of these functions.


David Rosenberg (New York University) DS-GA 1003 February 18, 2015 17 / 31
Kernel Machines

Kernel Machine Basis Functions

Input space X = R
 
2
RBF kernel k(w , x) = exp − − x) .
(w
Prototypes at {−6, −4, −3, 0, 2, 4}.
Corresponding basis functions:

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 18 / 31
Kernel Machines

Kernel Machine Prediction Functions

Basis functions

Predictions of the form


X
r
f (x) = αi k(x, µi )
i=1

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 19 / 31
Kernel Machines

RBF Network

An RBF network is a linear model with an RBF kernel.


First described in 1988 by Broomhead and Lowe (neural network
literature)

Characteristics:
Nonlinear
Smoothness depends on RBF kernel bandwidth

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 20 / 31
Kernel Machines

How to Choose Prototypes

Uniform grid on space?


only feasible in low dimensions
where to focus the grid?
Cluster centers of training data?
Possible, but clustering is difficult in high dimensions
Use all (or a subset of) the training points
Most common approach for kernel methods

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 21 / 31
Kernel Machines

All Training Points as Prototypes

Consider training inputs x1 , . . . , xn ∈ X


Then
X
n
f (x) = αi k(x, xi ).
i=1

Requires all training examples for prediction?


Not quite: Only need xi for αi 6= 0.
Want αi ’s to be sparse.
Train with `1 regularization: `1 -regularized vector machine
[Will show SVM also gives sparse functions of this form.]

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 22 / 31
Kernel Machines

`1 -Regularized Vector Machine


RBF Kernel with bandwidth σ = 0.3.
P
Linear hypothesis space: F = {f (x) = ni=1 αi k(x, xi ) | α ∈ Rn }.
Logistic loss function: `(y , ŷ ) = log 1 + e −y ŷ


`1 -regularization, n = 200 training points


logregL1, nerr=169

−1

−2

−3

−2 −1 0 1 2 3

KPM Figure 14.4b


David Rosenberg (New York University) DS-GA 1003 February 18, 2015 23 / 31
Kernel Machines

`2 -Regularized Vector Machine


RBF Kernel with bandwidth σ = 0.3.
P
Linear hypothesis space: F = {f (x) = ni=1 αi k(x, xi ) | α ∈ Rn }.
Logistic loss function: `(y , ŷ ) = log 1 + e −y ŷ


`2 -regularization, n = 200 training points


logregL2, nerr=174

−1

−2

−3

−2 −1 0 1 2 3

KPM Figure 14.4a


David Rosenberg (New York University) DS-GA 1003 February 18, 2015 24 / 31
Example: Vector Machine for Ridge Regression

`2 -Regularized Vector Machine for Regression

Kernel function k : X × X → R is symmetric (but nothing else).


Hypothesis space (linear functions on kernelized feature vector)
 
Xn
F = fα (x) = αi k(x, xi ) | α ∈ R .
n

i=1

Objective function (square loss with `2 regularization):

1X
n
J(α) = (yi − fα (xi ))2 + λαT α,
n
i=1

where
X
n
fα (xi ) = αj k(xi , xj ).
j=1

Note: All dependence on x’s is via the kernel function.


David Rosenberg (New York University) DS-GA 1003 February 18, 2015 25 / 31
Example: Vector Machine for Ridge Regression

The Kernel Matrix

Note that
X
n
f (xi ) = αj k(xi , xj )
j=1

only depends on the kernel function on all pairs of n training points.

Definition
The kernel matrix for a kernel k on a set {x1 , . . . , xn } as
 
k(x1 , x1 ) · · · k(x1 , xn )
 .. .. n×n
K = k(xi , xj ) i,j =  ···  ∈ R .
 
. .
k(xn , x1 ) · · · k(xn , xn )

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 26 / 31
Example: Vector Machine for Ridge Regression

Vectorizing the Vector Machine

Claim: K α gives the prediction vector (fα (x1 ), . . . , fα (xn ))T :

  
k(x1 , x1 ) · · · k(x1 , xn ) α1
.
.. . .. . 
Kα =  · · ·   .. 
  

k(xn , x1 ) · · · k(xn , xn ) αn
 
α1 k(x1 , x1 ) + · · · + αn k(x1 , xn )
= 
 .. 
. 
α1 k(xn , x1 ) + · · · + αn k(x1, xn )
 
fα (x1 )
=  ...  .
 

fα (xn )

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 27 / 31
Example: Vector Machine for Ridge Regression

Vectorizing the Vector Machine

The ith residual is yi − fα (xi ). We can vectorize as:


 
y1 − fα (x1 )
y −Kα = 
 .. 
. 
yn − fα (xn )

Sum of square residuals is

(y − K α)T (y − K α)

Objective function:
1
J(α) = ky − K αk2 + λαT α
n

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 28 / 31
Example: Vector Machine for Ridge Regression

Vectorizing the Vector Machine


Consider X = Rd and k(w , x) = w T x (linear kernel)
Let X ∈ Rn×d be the design matrix, which has each input vector as a
row:  
−x1 −
X =  ...  .
 

−xn −
Then the kernel matrix is
 
−x1 − | ··· |

K = XX T =  ...  x1 · · · xn 
 

−xn − | ··· |

And the objective function is


1
J(α) = ky − XX T αk2 + λαT α
n
David Rosenberg (New York University) DS-GA 1003 February 18, 2015 29 / 31
Features vs Kernels

Features vs Kernels

Theorem
Suppose a kernel can be written as an inner product:

k(w , x) = hφ(w ), φ(x)i .

Then the kernel machine is a linear classifier with feature map φ(x).

Mercer’s Theorem characterizes kernels with these properties.

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 30 / 31
Features vs Kernels

Features vs Kernels

Proof.
For prototype points x1 , . . . , xr ,

X
r
f (x) = αi k(x, xi )
i=1
Xr
= αi hφ(x), φ(xi )i
i=1
X
* r +
= αi φ(xi ), φ(x)
i=1
T
= w φ(x)
Pr
where w = i=1 αi φ(xi ).

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 31 / 31

You might also like