0% found this document useful (0 votes)

19 views31 pages

4c Kernels

The document discusses kernel methods for machine learning. It introduces kernels as functions that measure similarity between pairs of inputs. Various common kernels are described, including ones for comparing documents based on bag-of-words representations and cosine similarity. Kernel methods allow algorithms to be written in terms of kernels rather than feature vectors, allowing nonlinear models through choice of kernel.

Uploaded by

mestradocs2023

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views31 pages

4c Kernels

Uploaded by

mestradocs2023

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Kernel Methods

David Rosenberg

New York University

February 18, 2015

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 1 / 31
Introduction

Feature Extraction
Focus on effectively representing x ∈ X as a vector φ(x) ∈ Rd .
e.g. Bag of words:

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 2 / 31
Introduction

Kernel Methods

Primary focus is on comparing two inputs w , x ∈ X.

Definition
A kernel is a function that takes a pair of inputs w , x ∈ X and returns a
real value. That is, k : X × X → R.

Can interpret k(w , x) as a similarity score, but this is not precise.

We will deal with symmetric kernels: k(w , x) = k(x, w ).

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 3 / 31
Kernel Examples

Comparing Documents

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 4 / 31
Kernel Examples

Comparing Documents: Bag of Words

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 5 / 31
Kernel Examples

Comparing Documents: Bag of Words

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 6 / 31
Kernel Examples

Comparing Documents: Cosine Similarity

1 Normalize each feature vector to have kxk2 = 1.

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 7 / 31
Kernel Examples

Comparing Documents

1 Normalize each feature vector to have kxk2 = 1.

2 Take inner product

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 8 / 31
Kernel Examples

Comparing Documents: Cosine Similarity

1 Normalize each feature vector to have kxk2 = 1.

2 Take inner product
3 Then define

k(VentureBeat, Twitter Tweet) = 0.85

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 9 / 31
Kernel Examples

Cosine Similarity Kernel

Why the name? Recall

hw , xi = kw kkxk cos θ,

where θ is the angle between w , x ∈ Rd .

So
w x
k(w , x) = cos θ = ,
kw k kxk

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 10 / 31
Kernel Examples

Linear Kernel

Input space X = Rd
k(w , x) = w T x
When we “kernelize” an algorithm, we write it in terms of the linear
kernel.
Then we can swap it out a replace it with a more sophisticated kernel

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 11 / 31
Kernel Examples

Quadratic Kernel in R2

Input space X = R2
Feature map:
√
φ : (x1 , x2 ) 7→ x1 , x2 , x12 , x22 , 2x1 x2

Gives us ability to represent conic section boundaries.

Define kernel as inner product in feature space:

k(w , x) = hφ(w ), φ(x)i

= w1 x1 + w2 x2 + w12 x12 + w22 x22 + 2w1 w2 x1 x2
= w1 x1 + w2 x2 + (w1 x1 )2 + (w2 x2 )2 + 2(w1 x1 )(w2 x2 )
= hw , xi + hw , xi2

Based on Guillaume Obozinski’s Statistical Machine Learning course at Louvain, Feb 2014.
David Rosenberg (New York University) DS-GA 1003 February 18, 2015 12 / 31
Kernel Examples

Quadratic Kernel in Rd

Input space X = Rd
Feature map:
√ √ √
φ(x) = (x1 , . . . , xd , x12 , . . . , xd2 , 2x1 x2 , . . . , 2xi xj , . . . 2xd−1 xd )T

Number of terms = d + d(d + 1)/2 ≈ d 2 /2.

Still have

k(w , x) = hφ(w ), φ(x)i

= hx, y i + hx, y i2

Computation for inner product with explicit mapping: O(d 2 )

Computation for implicit kernel calculation: O(d).

Based on Guillaume Obozinski’s Statistical Machine Learning course at Louvain, Feb 2014.
David Rosenberg (New York University) DS-GA 1003 February 18, 2015 13 / 31
Kernel Examples

Polynomial Kernel in Rd

Input space X = Rd
Kernel function:
k(w , x) = (1 + hw , xi)M
Corresponds to a feature map with all terms up to degree M.
For any M, computing the kernel has same computational cost
Cost of explicit inner product computation grows rapidly in M.

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 14 / 31
Kernel Examples

Radial Basis Function (RBF) Kernel

Input space X = Rd

kw − xk2

k(w , x) = exp − ,
2σ2

where σ2 is known as the bandwidth parameter.

Does it act like a similarity score?
Why “radial”?
Have we departed from our “inner product of feature vector” recipe?
Yes and no: corresponds to an infinte dimensional feature vector
Probably the most common nonlinear kernel.

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 15 / 31
Kernel Machines

Feature Vectors from a Kernel

So what can we do with a kernel?

We can generate feature vectors:
Idea: Characterize input x by its similarity to r fixed prototypes in X.

Definition
A kernelized feature vector for an input x ∈ X with respect to a kernel k
and prototype points µ1 , . . . , µr ∈ X is given by

Φ(x) = [k(x, µ1 ), . . . , k(x, µr )] ∈ Rr .

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 16 / 31
Kernel Machines

Kernel Machines
Definition
A kernel machine is a linear model with kernelized feature vectors.
This corresponds to a prediction functions of the form
f (x) = αT Φ(x)
Xr
= αi k(x, µi ),
i=1
for α ∈ Rr .
An Interpretation
For each µi , we get a function on X:

x 7→ k(x, µi )

f (x) is a linear combination of these functions.

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 17 / 31
Kernel Machines

Kernel Machine Basis Functions

Input space X = R

2
RBF kernel k(w , x) = exp − − x) .
(w
Prototypes at {−6, −4, −3, 0, 2, 4}.
Corresponding basis functions:

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 18 / 31
Kernel Machines

Kernel Machine Prediction Functions

Basis functions

Predictions of the form

X
r
f (x) = αi k(x, µi )
i=1

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 19 / 31
Kernel Machines

RBF Network

An RBF network is a linear model with an RBF kernel.

First described in 1988 by Broomhead and Lowe (neural network
literature)

Characteristics:
Nonlinear
Smoothness depends on RBF kernel bandwidth

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 20 / 31
Kernel Machines

How to Choose Prototypes

Uniform grid on space?

only feasible in low dimensions
where to focus the grid?
Cluster centers of training data?
Possible, but clustering is difficult in high dimensions
Use all (or a subset of) the training points
Most common approach for kernel methods

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 21 / 31
Kernel Machines

All Training Points as Prototypes

Consider training inputs x1 , . . . , xn ∈ X

Then
X
n
f (x) = αi k(x, xi ).
i=1

Requires all training examples for prediction?

Not quite: Only need xi for αi 6= 0.
Want αi ’s to be sparse.
Train with `1 regularization: `1 -regularized vector machine
[Will show SVM also gives sparse functions of this form.]

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 22 / 31
Kernel Machines

`1 -Regularized Vector Machine

RBF Kernel with bandwidth σ = 0.3.
P
Linear hypothesis space: F = {f (x) = ni=1 αi k(x, xi ) | α ∈ Rn }.
Logistic loss function: `(y , ŷ ) = log 1 + e −y ŷ

`1 -regularization, n = 200 training points

logregL1, nerr=169

−1

−2

−3

−2 −1 0 1 2 3

KPM Figure 14.4b

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 23 / 31
Kernel Machines

`2 -Regularized Vector Machine

RBF Kernel with bandwidth σ = 0.3.
P
Linear hypothesis space: F = {f (x) = ni=1 αi k(x, xi ) | α ∈ Rn }.
Logistic loss function: `(y , ŷ ) = log 1 + e −y ŷ

`2 -regularization, n = 200 training points

logregL2, nerr=174

−1

−2

−3

−2 −1 0 1 2 3

KPM Figure 14.4a

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 24 / 31
Example: Vector Machine for Ridge Regression

`2 -Regularized Vector Machine for Regression

Kernel function k : X × X → R is symmetric (but nothing else).

Hypothesis space (linear functions on kernelized feature vector)

Xn
F = fα (x) = αi k(x, xi ) | α ∈ R .
n

i=1

Objective function (square loss with `2 regularization):

1X
n
J(α) = (yi − fα (xi ))2 + λαT α,
n
i=1

where
X
n
fα (xi ) = αj k(xi , xj ).
j=1

Note: All dependence on x’s is via the kernel function.

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 25 / 31
Example: Vector Machine for Ridge Regression

The Kernel Matrix

Note that
X
n
f (xi ) = αj k(xi , xj )
j=1

only depends on the kernel function on all pairs of n training points.

Definition
The kernel matrix for a kernel k on a set {x1 , . . . , xn } as
 
k(x1 , x1 ) · · · k(x1 , xn )
.. .. n×n
K = k(xi , xj ) i,j =  ···  ∈ R .
 
. .
k(xn , x1 ) · · · k(xn , xn )

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 26 / 31
Example: Vector Machine for Ridge Regression

Vectorizing the Vector Machine

Claim: K α gives the prediction vector (fα (x1 ), . . . , fα (xn ))T :

  
k(x1 , x1 ) · · · k(x1 , xn ) α1
.
.. . .. . 
Kα =  · · ·   .. 
  

k(xn , x1 ) · · · k(xn , xn ) αn
 
α1 k(x1 , x1 ) + · · · + αn k(x1 , xn )
= 
 .. 
. 
α1 k(xn , x1 ) + · · · + αn k(x1, xn )
 
fα (x1 )
=  ...  .
 

fα (xn )

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 27 / 31
Example: Vector Machine for Ridge Regression

Vectorizing the Vector Machine

The ith residual is yi − fα (xi ). We can vectorize as:

 
y1 − fα (x1 )
y −Kα = 
 .. 
. 
yn − fα (xn )

Sum of square residuals is

(y − K α)T (y − K α)

Objective function:
1
J(α) = ky − K αk2 + λαT α
n

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 28 / 31
Example: Vector Machine for Ridge Regression

Vectorizing the Vector Machine

Consider X = Rd and k(w , x) = w T x (linear kernel)
Let X ∈ Rn×d be the design matrix, which has each input vector as a
row:  
−x1 −
X =  ...  .
 

−xn −
Then the kernel matrix is
 
−x1 − | ··· |


K = XX T =  ...  x1 · · · xn 
 

−xn − | ··· |

And the objective function is

1
J(α) = ky − XX T αk2 + λαT α
n
David Rosenberg (New York University) DS-GA 1003 February 18, 2015 29 / 31
Features vs Kernels

Features vs Kernels

Theorem
Suppose a kernel can be written as an inner product:

k(w , x) = hφ(w ), φ(x)i .

Then the kernel machine is a linear classifier with feature map φ(x).

Mercer’s Theorem characterizes kernels with these properties.

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 30 / 31
Features vs Kernels

Features vs Kernels

Proof.
For prototype points x1 , . . . , xr ,

X
r
f (x) = αi k(x, xi )
i=1
Xr
= αi hφ(x), φ(xi )i
i=1
X
* r +
= αi φ(xi ), φ(x)
i=1
T
= w φ(x)
Pr
where w = i=1 αi φ(xi ).

David Rosenberg (New York University) DS-GA 1003 February 18, 2015 31 / 31

Capstone Project Vivek
100% (4)
Capstone Project Vivek
145 pages
Quick Revision of Bio Phy Che 9 Hours
100% (2)
Quick Revision of Bio Phy Che 9 Hours
489 pages
Shelly Cashman Series Microsoft Office 365 and Access 2016 Introductory 1st Edition Pratt Solutions Manual
75% (4)
Shelly Cashman Series Microsoft Office 365 and Access 2016 Introductory 1st Edition Pratt Solutions Manual
20 pages
Lecture 05
No ratings yet
Lecture 05
49 pages
03 - Kernelization
No ratings yet
03 - Kernelization
32 pages
Kernel Models 1233
No ratings yet
Kernel Models 1233
56 pages
07 Kernels
No ratings yet
07 Kernels
6 pages
Kernel Methods
No ratings yet
Kernel Methods
6 pages
Lecture 4
No ratings yet
Lecture 4
49 pages
תרגול - SVM 1
No ratings yet
תרגול - SVM 1
32 pages
Data An-6
No ratings yet
Data An-6
36 pages
Lec5 SVM Kernel SoftMargin
No ratings yet
Lec5 SVM Kernel SoftMargin
44 pages
Ds 11
No ratings yet
Ds 11
21 pages
SVM Kernel Functions
No ratings yet
SVM Kernel Functions
12 pages
Lecture17 Kernels
No ratings yet
Lecture17 Kernels
23 pages
cs229 Notes3
No ratings yet
cs229 Notes3
30 pages
SVM Class 2
No ratings yet
SVM Class 2
87 pages
High Dimensional Representation
No ratings yet
High Dimensional Representation
33 pages
Lecture 8 - Kernels
No ratings yet
Lecture 8 - Kernels
32 pages
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
No ratings yet
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
65 pages
ML Assignment 2 PDF
No ratings yet
ML Assignment 2 PDF
5 pages
Kernel Functions
No ratings yet
Kernel Functions
35 pages
Vahid
No ratings yet
Vahid
18 pages
05 Lectureslides Kernels
No ratings yet
05 Lectureslides Kernels
47 pages
SVM 4
No ratings yet
SVM 4
8 pages
Kernel Methods: Feature Mapping at No Cost
No ratings yet
Kernel Methods: Feature Mapping at No Cost
25 pages
Lecture 19 - Nonlinear Learning With Kernels (1) - Plain
No ratings yet
Lecture 19 - Nonlinear Learning With Kernels (1) - Plain
15 pages
Lec 16
No ratings yet
Lec 16
23 pages
05 Kernel
No ratings yet
05 Kernel
24 pages
Lecture 13 - Kernels
No ratings yet
Lecture 13 - Kernels
5 pages
Machine Learning 3
No ratings yet
Machine Learning 3
35 pages
Kernels and Kernelized Perceptron: Instructor: Alan Ritter
No ratings yet
Kernels and Kernelized Perceptron: Instructor: Alan Ritter
13 pages
Lecture03 Kernel
No ratings yet
Lecture03 Kernel
28 pages
Kernal and Multiclass
No ratings yet
Kernal and Multiclass
51 pages
Kernel Methods
No ratings yet
Kernel Methods
19 pages
Lecture 14: Kernels - Applied ML
No ratings yet
Lecture 14: Kernels - Applied ML
14 pages
Support Vector Machines: Kernels: CS4780/5780 - Machine Learning Fall 2011 Thorsten Joachims Cornell University
No ratings yet
Support Vector Machines: Kernels: CS4780/5780 - Machine Learning Fall 2011 Thorsten Joachims Cornell University
15 pages
Kernel Functions: Tejumade Afonja Jan 2, 2017 6 Min Read
No ratings yet
Kernel Functions: Tejumade Afonja Jan 2, 2017 6 Min Read
6 pages
ML Lecture06 2
No ratings yet
ML Lecture06 2
63 pages
SVM and Kernels
No ratings yet
SVM and Kernels
13 pages
KernelTrick PDF
No ratings yet
KernelTrick PDF
4 pages
Be Central
No ratings yet
Be Central
98 pages
Mva - Slides Machine Learning With Kernel Methods
No ratings yet
Mva - Slides Machine Learning With Kernel Methods
644 pages
Kernel Methods For General Pattern Analysis PDF
No ratings yet
Kernel Methods For General Pattern Analysis PDF
77 pages
Time Series Forecasting by Using Wavelet Kernel SVM
No ratings yet
Time Series Forecasting by Using Wavelet Kernel SVM
52 pages
Kernel Nearest-Neighbor Algorithm
No ratings yet
Kernel Nearest-Neighbor Algorithm
10 pages
Introduction To Kernels: Max Welling
No ratings yet
Introduction To Kernels: Max Welling
16 pages
Machine Learning Course - Kernel Regression
No ratings yet
Machine Learning Course - Kernel Regression
9 pages
Kernel Method
No ratings yet
Kernel Method
5 pages
Reproducing Kernel Hilbert Space, Mercer's Theorem, Eigenfunctions, Nystr Om Method, and Use of Kernels in Machine Learning: Tutorial and Survey
No ratings yet
Reproducing Kernel Hilbert Space, Mercer's Theorem, Eigenfunctions, Nystr Om Method, and Use of Kernels in Machine Learning: Tutorial and Survey
31 pages
Lecture 5
No ratings yet
Lecture 5
19 pages
HW 3
No ratings yet
HW 3
5 pages
SD-M1 TSI Chapitre 4
No ratings yet
SD-M1 TSI Chapitre 4
42 pages
Intro SVM PDF
No ratings yet
Intro SVM PDF
47 pages
Handout 03 Classic Classifiers
No ratings yet
Handout 03 Classic Classifiers
39 pages
SP14 CS188 Lecture 23 - Kernels and Clustering - Print
No ratings yet
SP14 CS188 Lecture 23 - Kernels and Clustering - Print
39 pages
5th Unit ML
No ratings yet
5th Unit ML
40 pages
Introduction To Support Vector Machines: BTR Workshop Fall 2006
No ratings yet
Introduction To Support Vector Machines: BTR Workshop Fall 2006
88 pages
Introduction To Support Vector Machines: BTR Workshop Fall 2006
No ratings yet
Introduction To Support Vector Machines: BTR Workshop Fall 2006
88 pages
Introduction To Support Vector Machines: Andrew Moore CMU
No ratings yet
Introduction To Support Vector Machines: Andrew Moore CMU
40 pages
Sample Questions 2020 Test Code PCB (Short Answer Type)
No ratings yet
Sample Questions 2020 Test Code PCB (Short Answer Type)
12 pages
Filter Sizing - Pool & Spa News
No ratings yet
Filter Sizing - Pool & Spa News
3 pages
Cycle Counting in Fatigue Analysis: Standard Practices For
No ratings yet
Cycle Counting in Fatigue Analysis: Standard Practices For
10 pages
Maths Paper Solving
No ratings yet
Maths Paper Solving
5 pages
EET302 M2-Ktunotes - in
No ratings yet
EET302 M2-Ktunotes - in
33 pages
Ieee1 PDF
No ratings yet
Ieee1 PDF
13 pages
S 2 BCTQ HK HMQB 9 Y2 Uew H4
No ratings yet
S 2 BCTQ HK HMQB 9 Y2 Uew H4
18 pages
Q2 Week 3 Relation and Function
No ratings yet
Q2 Week 3 Relation and Function
42 pages
Autodesk Nastran 2022 Nonlinear Analysis Handbook
No ratings yet
Autodesk Nastran 2022 Nonlinear Analysis Handbook
2 pages
Models - Ssf.forchheimer Flow
No ratings yet
Models - Ssf.forchheimer Flow
12 pages
3.OO Testing
No ratings yet
3.OO Testing
9 pages
3.1 Motion Is Relative
No ratings yet
3.1 Motion Is Relative
3 pages
Introduction To Simio
No ratings yet
Introduction To Simio
10 pages
P B Q Xii Maths 2023-24
No ratings yet
P B Q Xii Maths 2023-24
6 pages
Question Paper - CT-12-PCM-11th-JEE - (Batch-1) - 21.11.2021.pmd
No ratings yet
Question Paper - CT-12-PCM-11th-JEE - (Batch-1) - 21.11.2021.pmd
12 pages
05 - Multiple-Stage Factory Models - With - Solutions - New
No ratings yet
05 - Multiple-Stage Factory Models - With - Solutions - New
74 pages
Chapter 7
No ratings yet
Chapter 7
7 pages
Reduction Thesis Peirce
100% (3)
Reduction Thesis Peirce
7 pages
This PDF Is The Sample PDF Taken From Our Comprehensive Study Material For IIT-JEE Main & Advanced
No ratings yet
This PDF Is The Sample PDF Taken From Our Comprehensive Study Material For IIT-JEE Main & Advanced
13 pages
Year 8 Spring Core MS 2020
100% (1)
Year 8 Spring Core MS 2020
3 pages
Pennachi - Theory of Asset Pricing
100% (1)
Pennachi - Theory of Asset Pricing
570 pages
Solution 1
No ratings yet
Solution 1
6 pages
Business Statistics: Bba 2 Sem
No ratings yet
Business Statistics: Bba 2 Sem
30 pages
Model-Based Testing of Automotive Systems: Piketec GMBH, Germany
No ratings yet
Model-Based Testing of Automotive Systems: Piketec GMBH, Germany
9 pages
ChemE Course Descriptions - 19
No ratings yet
ChemE Course Descriptions - 19
5 pages
Handwritten Devanagari Word Recognition: A Curvelet Transform Based Approach
No ratings yet
Handwritten Devanagari Word Recognition: A Curvelet Transform Based Approach
8 pages
Unit V
No ratings yet
Unit V
22 pages

4c Kernels

Uploaded by

4c Kernels

Uploaded by

Kernel Methods

New York University

February 18, 2015

Primary focus is on comparing two inputs w , x ∈ X.

Can interpret k(w , x) as a similarity score, but this is not precise.

Comparing Documents: Bag of Words

Comparing Documents: Bag of Words

Comparing Documents: Cosine Similarity

1 Normalize each feature vector to have kxk2 = 1.

1 Normalize each feature vector to have kxk2 = 1.

Comparing Documents: Cosine Similarity

1 Normalize each feature vector to have kxk2 = 1.

k(VentureBeat, Twitter Tweet) = 0.85

Cosine Similarity Kernel

Why the name? Recall

where θ is the angle between w , x ∈ Rd .

Gives us ability to represent conic section boundaries.

k(w , x) = hφ(w ), φ(x)i

Number of terms = d + d(d + 1)/2 ≈ d 2 /2.

k(w , x) = hφ(w ), φ(x)i

Computation for inner product with explicit mapping: O(d 2 )

Radial Basis Function (RBF) Kernel

where σ2 is known as the bandwidth parameter.

Feature Vectors from a Kernel

So what can we do with a kernel?

Φ(x) = [k(x, µ1 ), . . . , k(x, µr )] ∈ Rr .

f (x) is a linear combination of these functions.

Kernel Machine Basis Functions

Kernel Machine Prediction Functions

Predictions of the form

An RBF network is a linear model with an RBF kernel.

How to Choose Prototypes

Uniform grid on space?

All Training Points as Prototypes

Consider training inputs x1 , . . . , xn ∈ X

Requires all training examples for prediction?

`1 -Regularized Vector Machine

`1 -regularization, n = 200 training points

KPM Figure 14.4b

`2 -Regularized Vector Machine

`2 -regularization, n = 200 training points

KPM Figure 14.4a

`2 -Regularized Vector Machine for Regression

Kernel function k : X × X → R is symmetric (but nothing else).

Objective function (square loss with `2 regularization):

Note: All dependence on x’s is via the kernel function.

The Kernel Matrix

only depends on the kernel function on all pairs of n training points.

Vectorizing the Vector Machine

Claim: K α gives the prediction vector (fα (x1 ), . . . , fα (xn ))T :

Vectorizing the Vector Machine

The ith residual is yi − fα (xi ). We can vectorize as:

Sum of square residuals is

Vectorizing the Vector Machine

And the objective function is

k(w , x) = hφ(w ), φ(x)i .

Mercer’s Theorem characterizes kernels with these properties.

You might also like