4c Kernels
4c Kernels
David Rosenberg
David Rosenberg (New York University) DS-GA 1003 February 18, 2015 1 / 31
Introduction
Feature Extraction
Focus on effectively representing x ∈ X as a vector φ(x) ∈ Rd .
e.g. Bag of words:
David Rosenberg (New York University) DS-GA 1003 February 18, 2015 2 / 31
Introduction
Kernel Methods
Definition
A kernel is a function that takes a pair of inputs w , x ∈ X and returns a
real value. That is, k : X × X → R.
David Rosenberg (New York University) DS-GA 1003 February 18, 2015 3 / 31
Kernel Examples
Comparing Documents
David Rosenberg (New York University) DS-GA 1003 February 18, 2015 4 / 31
Kernel Examples
David Rosenberg (New York University) DS-GA 1003 February 18, 2015 5 / 31
Kernel Examples
David Rosenberg (New York University) DS-GA 1003 February 18, 2015 6 / 31
Kernel Examples
David Rosenberg (New York University) DS-GA 1003 February 18, 2015 7 / 31
Kernel Examples
Comparing Documents
David Rosenberg (New York University) DS-GA 1003 February 18, 2015 8 / 31
Kernel Examples
David Rosenberg (New York University) DS-GA 1003 February 18, 2015 9 / 31
Kernel Examples
hw , xi = kw kkxk cos θ,
David Rosenberg (New York University) DS-GA 1003 February 18, 2015 10 / 31
Kernel Examples
Linear Kernel
Input space X = Rd
k(w , x) = w T x
When we “kernelize” an algorithm, we write it in terms of the linear
kernel.
Then we can swap it out a replace it with a more sophisticated kernel
David Rosenberg (New York University) DS-GA 1003 February 18, 2015 11 / 31
Kernel Examples
Quadratic Kernel in R2
Input space X = R2
Feature map:
√
φ : (x1 , x2 ) 7→ x1 , x2 , x12 , x22 , 2x1 x2
Based on Guillaume Obozinski’s Statistical Machine Learning course at Louvain, Feb 2014.
David Rosenberg (New York University) DS-GA 1003 February 18, 2015 12 / 31
Kernel Examples
Quadratic Kernel in Rd
Input space X = Rd
Feature map:
√ √ √
φ(x) = (x1 , . . . , xd , x12 , . . . , xd2 , 2x1 x2 , . . . , 2xi xj , . . . 2xd−1 xd )T
Based on Guillaume Obozinski’s Statistical Machine Learning course at Louvain, Feb 2014.
David Rosenberg (New York University) DS-GA 1003 February 18, 2015 13 / 31
Kernel Examples
Polynomial Kernel in Rd
Input space X = Rd
Kernel function:
k(w , x) = (1 + hw , xi)M
Corresponds to a feature map with all terms up to degree M.
For any M, computing the kernel has same computational cost
Cost of explicit inner product computation grows rapidly in M.
David Rosenberg (New York University) DS-GA 1003 February 18, 2015 14 / 31
Kernel Examples
Input space X = Rd
kw − xk2
k(w , x) = exp − ,
2σ2
David Rosenberg (New York University) DS-GA 1003 February 18, 2015 15 / 31
Kernel Machines
Definition
A kernelized feature vector for an input x ∈ X with respect to a kernel k
and prototype points µ1 , . . . , µr ∈ X is given by
David Rosenberg (New York University) DS-GA 1003 February 18, 2015 16 / 31
Kernel Machines
Kernel Machines
Definition
A kernel machine is a linear model with kernelized feature vectors.
This corresponds to a prediction functions of the form
f (x) = αT Φ(x)
Xr
= αi k(x, µi ),
i=1
for α ∈ Rr .
An Interpretation
For each µi , we get a function on X:
x 7→ k(x, µi )
Input space X = R
2
RBF kernel k(w , x) = exp − − x) .
(w
Prototypes at {−6, −4, −3, 0, 2, 4}.
Corresponding basis functions:
David Rosenberg (New York University) DS-GA 1003 February 18, 2015 18 / 31
Kernel Machines
Basis functions
David Rosenberg (New York University) DS-GA 1003 February 18, 2015 19 / 31
Kernel Machines
RBF Network
Characteristics:
Nonlinear
Smoothness depends on RBF kernel bandwidth
David Rosenberg (New York University) DS-GA 1003 February 18, 2015 20 / 31
Kernel Machines
David Rosenberg (New York University) DS-GA 1003 February 18, 2015 21 / 31
Kernel Machines
David Rosenberg (New York University) DS-GA 1003 February 18, 2015 22 / 31
Kernel Machines
−1
−2
−3
−2 −1 0 1 2 3
−1
−2
−3
−2 −1 0 1 2 3
i=1
1X
n
J(α) = (yi − fα (xi ))2 + λαT α,
n
i=1
where
X
n
fα (xi ) = αj k(xi , xj ).
j=1
Note that
X
n
f (xi ) = αj k(xi , xj )
j=1
Definition
The kernel matrix for a kernel k on a set {x1 , . . . , xn } as
k(x1 , x1 ) · · · k(x1 , xn )
.. .. n×n
K = k(xi , xj ) i,j = ··· ∈ R .
. .
k(xn , x1 ) · · · k(xn , xn )
David Rosenberg (New York University) DS-GA 1003 February 18, 2015 26 / 31
Example: Vector Machine for Ridge Regression
k(x1 , x1 ) · · · k(x1 , xn ) α1
.
.. . .. .
Kα = · · · ..
k(xn , x1 ) · · · k(xn , xn ) αn
α1 k(x1 , x1 ) + · · · + αn k(x1 , xn )
=
..
.
α1 k(xn , x1 ) + · · · + αn k(x1, xn )
fα (x1 )
= ... .
fα (xn )
David Rosenberg (New York University) DS-GA 1003 February 18, 2015 27 / 31
Example: Vector Machine for Ridge Regression
(y − K α)T (y − K α)
Objective function:
1
J(α) = ky − K αk2 + λαT α
n
David Rosenberg (New York University) DS-GA 1003 February 18, 2015 28 / 31
Example: Vector Machine for Ridge Regression
−xn −
Then the kernel matrix is
−x1 − | ··· |
K = XX T = ... x1 · · · xn
−xn − | ··· |
Features vs Kernels
Theorem
Suppose a kernel can be written as an inner product:
Then the kernel machine is a linear classifier with feature map φ(x).
David Rosenberg (New York University) DS-GA 1003 February 18, 2015 30 / 31
Features vs Kernels
Features vs Kernels
Proof.
For prototype points x1 , . . . , xr ,
X
r
f (x) = αi k(x, xi )
i=1
Xr
= αi hφ(x), φ(xi )i
i=1
X
* r +
= αi φ(xi ), φ(x)
i=1
T
= w φ(x)
Pr
where w = i=1 αi φ(xi ).
David Rosenberg (New York University) DS-GA 1003 February 18, 2015 31 / 31