0% found this document useful (0 votes)
59 views

Introduction To Kernels: Max Welling

The document introduces kernels and kernel methods. It discusses: 1) How kernel methods allow applying linear algorithms to non-linear problems by mapping data to high-dimensional feature spaces. 2) The "kernel trick" which allows computing similarities between points in feature space using kernel functions without needing to explicitly compute the mapping. 3) How positive semi-definite kernel functions correspond to an inner product in some feature space. 4) How kernel methods consist of a kernel choice and learning algorithm, allowing different combinations.

Uploaded by

Kamesh Reddi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views

Introduction To Kernels: Max Welling

The document introduces kernels and kernel methods. It discusses: 1) How kernel methods allow applying linear algorithms to non-linear problems by mapping data to high-dimensional feature spaces. 2) The "kernel trick" which allows computing similarities between points in feature space using kernel functions without needing to explicitly compute the mapping. 3) How positive semi-definite kernel functions correspond to an inner product in some feature space. 4) How kernel methods consist of a kernel choice and learning algorithm, allowing different combinations.

Uploaded by

Kamesh Reddi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 16

(chapters 1,2,3,4)

Introduction to Kernels

Max Welling
October 1 2004

1
Introduction
• What is the goal of (pick your favorite name):
- Machine Learning
- Data Mining
- Pattern Recognition
- Data Analysis
- Statistics

Automatic detection of non-coincidental structure in data.

• Desiderata:
- Robust algorithms insensitive to outliers and wrong
model assumptions.
- Stable algorithms: generalize well to unseen data.
- Computationally efficient algorithms: large datasets.
2
Let’s Learn Something
Find the common characteristic (structure) among the following
statistical methods?

1. Principal Components Analysis


2. Ridge regression
3. Fisher discriminant analysis
4. Canonical correlation analysis

Answer:
We consider linear combinations of input vector: f ( x )  wT x

Linear algorithm are very well understood and enjoy strong guarantees.
(convexity, generalization bounds).
3
Can we carry these guarantees over to non-linear algorithms?
Feature Spaces

 : x   ( x), R  F d

non-linear mapping to F 
1. high-D space L2
2. infinite-D countable space :
3. function space (Hilbert space)

example: ( x, y )  ( x , y , 2 xy )
2 2

4
Ridge Regression (duality)

problem: min w  ( yi  wT xi ) 2   || w ||2
i 1

input regularization
target

solution: w  ( X T X   I d ) 1 X T y dxd inverse


 X T ( XX T   I  ) 1 y   inverse
 X T (G   I  ) 1 y Gij  xi , x j 

  xi i Gram-matrix
i 1

linear comb. data Dual Representation 5


Kernel Trick
Note: In the dual representation we used the Gram matrix
to express the solution.

Kernel Trick: kernel


Replace : x   ( x),
Gij  xi , x j  Gij   ( xi ),  ( x j )  K ( xi , x j )

If we use algorithms that only depend on the Gram-matrix, G,


then we never have to know (compute) the actual features 

This is the crucial point of kernel methods


6
Modularity

Kernel methods consist of two modules:

1) The choice of kernel (this is non-trivial)


2) The algorithm which takes kernels as input

Modularity: Any kernel can be used with any kernel-algorithm.


some kernels: some kernel algorithms:
2
- support vector machine
k ( x, y )  e( || x  y|| / c)
- Fisher discriminant analysis
k ( x, y )  ( x, y   ) d - kernel regression
k ( x, y )  tanh(  x, y   )
- kernel PCA
1
k ( x, y )  - kernel CCA 7
|| x  y || c 2 2
What is a proper kernel
Definition: A finitely positive semi-definite function k : x  y  R
is a symmetric function of its arguments for which matrices formed
by restriction on any finite subset of points is positive semi-definite.
 T K  0 
Theorem: A function k : x  y  R can be written
as k ( x, y )   ( x), ( y )  where  ( x) is a feature map
x   ( x)  F iff k(x,y) satisfies the semi-definiteness property.

Relevance: We can now check if k(x,y) is a proper kernel using


only properties of k(x,y) itself,
i.e. without the need to know the feature map! 8
Reproducing Kernel Hilbert Spaces
The proof of the above theorem proceeds by constructing a very
special feature map (note that more feature maps may give rise to a kernel)

 : x   ( x)  k ( x,.) i.e. we map to a function space.

definition function space: reproducing property:


m
f (.)    i k ( xi ,.) any m,{xi }  f ,  ( x)  f , k ( x,.) 
i 1
k
   i k ( xi ,.), k ( x,.) 
m 
 f , g    i  j k ( xi , x j )
i 1 j 1 i 1
k

  k ( x , x)  f ( x)
m 
 f , f    i j k ( xi , x j )  0 i i
i 1 j 1 i 1

( finite positive semi  definite)    ( x),  ( y )  k ( x, y ) 9


Mercer’s Theorem
Theorem: X is compact, k(x,y) is symmetric continuous function s.t.
Tk f   k (., x ) f ( x ) dx is a positive semi-definite operator: Tk  0
i.e.
  k ( x, y) f ( x) f ( y) dxdy  0 f  L2 ( X )
then there exists an orthonormal feature basis of eigen-functions
such that:

k ( x, y )    i ( x ) j ( y )
i 1

Hence: k(x,y) is a proper kernel.


Note: Here we construct feature vectors in L2, where the RKHS
construction was in a function space. 10
Learning Kernels
• All information is tunneled through the Gram-matrix information
bottleneck.
• The real art is to pick an appropriate kernel.
2
e.g. take the RBF kernel: k ( x, y )  e( || x  y|| / c )

if c is very small: G=I (all data are dissimilar): over-fitting


if c is very large: G=1 (all data are very similar): under-fitting

We need to learn the kernel. Here is some ways to combine


kernels to improve them:
 k1 ( x, y )   k2 ( x, y )  k ( x, y )  ,   0 k1 cone
k ( x, y ) k ( x , y )  k ( x, y ) k2
1 2
any positive
k1 (( x), ( y ))  k ( x, y ) polynomial
11
Stability of Kernel Algorithms
Our objective for learning is to improve generalize performance:
cross-validation, Bayesian methods, generalization bounds,...

Call ES [ f ( x)]  0 a pattern a sample S.


Is this pattern also likely to be present in new data: EP [ f ( x)]  0 ?
We can use concentration inequalities (McDiamid’s theorem)
to prove that:

Theorem: Let S  {x1 ,..., x} be a IID sample from P and define
the sample mean of f(x) as: f 1  f ( xi ) then it follows that:

 i 1

R 1 R  sup x || f ( x) ||
P(|| f  EP [ f ] || (2  2 ln( ))  1  
 
12
(prob. that sample mean and population mean differ less than is more than ,independent of P!
Rademacher Complexity
Prolem: we only checked the generalization performance for a
single fixed pattern f(x).
What is we want to search over a function class F?

Intuition: we need to incorporate the complexity of this function class.

Rademacher complexity captures the ability of the function class to


fit random noise. ( i  1 uniform distributed)  i  1
(empirical RC)
f1
 2  f2
R ( F )  E [sup |   i f ( xi ) |,| x1 ,..., x ]
f F  i 1

2 
R ( F )  ES E [sup |   i f ( xi ) |]
f F  i 1 xi 13
Generalization Bound
Theorem: Let f be a function in F which maps to [0,1]. (e.g. loss functions)
Then, with probability at least 1   over random draws of size 
every f satisfies:
2
ln( )
E p [ f ( x)]  Edata [ f ( x)]  R ( F )  
2
2
 ln( )
 Edata [ f ( x)]  R ( F )  3 
2
Relevance: The expected pattern E[f]=0 will also be present in a new
data set, if the last 2 terms are small:
- Complexity function class F small
- number of training data large 14
Linear Functions (in feature space)
Consider the FB  { f : x  w,  ( x)  , || w || B}
function class: with k ( x, y )  ( x),  ( y ) 

and a sample: S  {x1 ,..., x}

Then, the empirical  2B


R ( FB )  tr ( K )
RC of FB is bounded by: 

Relevance: Since: {x    i k ( xi , x) ,  T K  B}  FB it follows that
if we control the norm i 1 T K || w ||2 in kernel algorithms, we control
the complexity of the function class (regularization). 15
Margin Bound (classification)
Theorem: Choose c>0 (the margin).
F : f(x,y)=-yg(x), y=+1,-1
S: {( x1 , y1 ),..., ( x , y )} IID sample
 : (0,1) : probability of violating bound.
2
ln( )
1 
4 
Pp [ y  sign( g ( x ))]   i  tr ( K )  3
c i 1 c 2
(prob. of misclassification)
i  (c  yi g ( xi ))  ( slack variable)
( f )  f if f  0 and 0 otherwise

Relevance: We our classification error on new samples. Moreover, we have a


strategy to improve generalization: choose the margin c as large possible such
that all samples are correctly classified: i  0 (e.g. support vector machines).
16

You might also like