0% found this document useful (0 votes)
41 views32 pages

Hands-On Machine Learning: Chapter 5: Support Vector Machines

Uploaded by

shaikhmismail66
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views32 pages

Hands-On Machine Learning: Chapter 5: Support Vector Machines

Uploaded by

shaikhmismail66
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Hands-On Machine

Learning
Chapter 5: Support Vector Machines

San Diego Machine Learning


– Book Club

https://www.linkedin.com/in/bhanuyerra/
Overview
Linear SVM
Classification
Linear SVM Classification

Some Linear Classifier

SVM Classifier
Linear SVM Classification
Large Margin Classification
Sensitive to scale
Use Scikit-Learn’s
StandardScaler

Linear Separability

Sensitive to outliers
Use Soft Margin SVM
Classifier
Soft Margin Classification

• C is inverse of regularizing
hyperparameter alpha (from Chapter 4)
Soft Margin Classification
• LinearSVC:
o LinearSVC(loss=“hinge”, C=1)
o Accepts two loss functions: “hinge” and “squared_hinge”
o Default loss is “squared_hinge”
o Doesn’t output support vectors (use .intercept_ & .coef_ to find
support vectors in the training data)
o Regularizes bias term too...so center the data by using StandardScaler
o Set dual=False if training instances > number of features
• SVC:
o SVC(kernel=“linear”, C=1)
o For linear classifier use kernel=“linear”
o For hard margin classifier use C=float(“inf”)
• SGDClassifier:
• SGDClassifier(loss=“hinge”, alpha = 1/(m*C))
• Slow to converge, but good for online or huge datasets
Non-Linear SVM
Classification
Nonlinear SVM Classification

𝑋!
𝑋!
𝑋!"
Nonlinear SVM: Polynomial Kernels

1
𝑋!
High Bias 𝑋"
𝑋!𝑋"
𝑋!"
𝑋!
𝑋" 𝑋""
𝑋!"𝑋"
𝑋!𝑋""
𝑋!#
𝑋"#

Polynomial features of degree 3


2D to 10D
Nonlinear SVM: Polynomial Kernels
Two ways to implement
Polynomial Features

Use PolynomialFeatures,
StandardScaler, and LinearSVC

Explicitly adds polynomial features

Same results without adding features

Use SVC with kernel = “poly”

Kernel trick: adding the effect of high dimensional


features without explicitly adding them.
Nonlinear SVM: Polynomial Kernels

• d is degree of polynomial features


• r is polynomial kernel hyperparamter (aka coef0)
• C is the soft margin hyperparameter

𝑟#
3𝑟 "𝑋!
3𝑟 "𝑋"
𝑋! 6𝑟𝑋!𝑋"
𝑋" 3𝑟𝑋!"
3𝑟𝑋""
3𝑋!"𝑋"
3𝑋!𝑋""
𝑋!#
𝑋"#

Polynomial features of degree 3


2D to 10D with coef0
Nonlinear SVM: Similarity Features

Gaussian Radial Basis Function (RBF)


Nonlinear SVM: Gaussian RBF Kernel

Gaussian Radial Basis Function (RBF)


Nonlinear SVM: Computational Complexity
SVM
Regression
SVM Regression

SVM Classification SVM Regression


Fitting widest possible “road” Fitting narrowest possible “road”
between classes with few on that captures instances with few
street violations off street violations
SVM Regression
Under the
Hood
Under the Hood: Decision Function and Prediction

Hyperplane ℎ 𝒙 = 𝑤( 𝒙 + 𝑏

Prediction

Decision boundary is scale invariant. So set margins as ±1

Positive class and a negative class


Under the Hood: Training Objective
Geometry of the Hyperplane

+ve

-ve

Minimize squared slope for


widest “street”

Positive samples on positive side of hyperplane,


negative samples on negative side. And keep
training samples off the street.
Under the Hood: Training Objective
Soft margin training objective

Minimize squared slope for


widest “street” and reduce
Margin violation margin violations

+ve

Positive samples on positive side of


𝜁 (*) hyperplane, if not add a number to make
them positive.

𝑤( 𝑥 *
+𝑏
𝑤( 𝑥 *
+ 𝑏 , 𝑖𝑓 𝑣𝑖𝑜𝑙𝑎𝑡𝑖𝑜𝑛, 𝑖𝑒, 𝑤( 𝑥 *
+𝑏 <
Under the Hood: Quadratic Programming
Under the Hood: The Dual Problem
Under the Hood: The Dual Problem
Dual problem has two components:

A. Kernalization B. Kernal Trick

• Deparametrize the problem • Project input space into a very


high-dimensional feature space,
specification may be even infinity
• Express the hypothesis function, • Problem:
loss function, gradients, training and • Projecting training data in to a
high-dimensional space is
testing steps without ever expensive
computing parameters in the • large number of parameters
feature space
• Trick:
• Places weights on training samples • Compute dot-product between
training samples in the projected
high-dimensional space without
ever projecting.
Under the Hood: Kernalization
Parametric Linear Regression Kernalized Linear Regression
1 2 3 4 . . . . . . . 𝑛 1 2 3 4 . . . . . . . 𝑛
𝑤! 𝑤" 𝑤# 𝑤$ . . . . 𝑤%

𝑦! 𝑥!! 𝑥!" 𝑥!# 𝑥!$ . . . 𝑥!% 𝑦 = 𝒘( 𝒙 𝑦! 𝑥!! 𝑥!" 𝑥!# 𝑥!$ . . . 𝑥!% 𝛼! Express the linear
𝑦" 𝑥"! 𝑥"" 𝑥"# 𝑥"$ . . . 𝑥"% -
𝑦" 𝑥"! 𝑥"" 𝑥"# 𝑥"$ . . . 𝑥"% 𝛼" relationship, loss
1 "
𝐿(𝒘) = / 𝒘( 𝑥 (*) − 𝑦 (*) function, gradients
𝑦, 𝑥#! 𝑥#" 𝑥## 𝑥#$ . . . 𝑥#% 𝑚 𝑦, 𝑥#! 𝑥#" 𝑥## 𝑥#$ . . . 𝑥#% 𝛼, and train the
*.!
- model in 𝛼 space
𝑑𝐿(𝒘) 2
= / 𝒘( 𝑥 (*) − 𝑦 (*) 𝑥 (*) instead of w
. . . . . . . . . .

. . . . . . . . . .
. . . . . . . . . .
𝑑𝑤 𝑚
*.! parametric space

𝑦- 𝑥&! 𝑥&" 𝑥&# 𝑥&$ . . . 𝑥&% 𝑦- 𝑥&! 𝑥&" 𝑥&# 𝑥&$ . . . 𝑥&% 𝛼-
Under the Hood: Kernalization
Kernalized Linear Regression Express using 𝛼 Prediction Time Calculations
1 2 3 4 . . . . . . . 𝑛
-
𝑤! 𝑤" 𝑤# 𝑤$ . . . . 𝑤%
𝒚 = 𝑤(𝒙 * (
9 = 8 𝛼* 𝒙
𝒚 ;
𝒙 Prediction for 𝑥̅
𝑦! 𝑥!! 𝑥!" 𝑥!# 𝑥!$ . . . 𝑥!% 𝛼!
*.!
𝑦" 𝑥"! 𝑥"" 𝑥"# 𝑥"$ . . . 𝑥"% 𝛼"
Parameters are a linear
𝑦, 𝑥#! 𝑥#" 𝑥## 𝑥#$ . . . 𝑥#% 𝛼, combination of training
samples Dot product
- 𝒙(*)( 𝒙
;
. . . . . . . . . .
. . . . . . . . . .

𝒘 = 8 𝛼* 𝒙(*)
*.!
𝐾(𝒙 * (
,𝒙
;) Kernel
-
* ( -
𝒚 = 8 𝛼* 𝒙 𝒙
9 = 8 𝛼* 𝐾(𝒙 * (
;)
Hypothesis
*.! 𝒚 ,𝒙
*.!
function in kernels
𝑦- 𝑥&! 𝑥&" 𝑥&# 𝑥&$ . . . 𝑥&% 𝛼- (m dot products)
Under the Hood: Kernalization
Kernalized Linear Regression Express using 𝛼 Training Time Calculations
1 2 3 4 . . . . . . . 𝑛 &
' )
𝑤! 𝑤" 𝑤# 𝑤$ . . . . 𝑤% 𝒚 = & 𝛼' 𝒙 𝒙
'(! Dot product between training
𝑦! 𝑥!! 𝑥!" 𝑥!# 𝑥!$ . . . 𝑥!% 𝛼!
1
& & "
samples: m x m matrix
𝐿(𝒘) = 3 3 𝛼' 𝒙 ' )𝑥 (*) − 𝑦 (*)
𝑦" 𝑥"! 𝑥"" 𝑥"# 𝑥"$ . . . 𝑥"% 𝛼" 𝑚
*(! '(! Kernel Matrix
𝑦, 𝑥#! 𝑥#" 𝑥## 𝑥#$ . . . 𝑥#% 𝛼, 𝑗 𝑗

𝐾!! 𝐾!" 𝐾!# 𝐾!$ . . . 𝐾!&


. . . . . . . . . .
. . . . . . . . . .

x =
𝐾'*
𝑖
𝑖

𝐾&! 𝐾&" 𝐾&# 𝐾&$ . . . 𝐾&&


𝑦- 𝑥&! 𝑥&" 𝑥&# 𝑥&$ . . . 𝑥&% 𝛼-
mxn nxm mxm
Under the Hood: Kernel SVM
Kernalized SVMs
1 2 3 4 . . . . . . . 𝑛

𝑦! 𝑥!! 𝑥!" 𝑥!# 𝑥!$ . . . 𝑥!% 𝛼!


𝑦" 𝑥"! 𝑥"" 𝑥"# 𝑥"$ . . . 𝑥"% 𝛼"
𝑦, 𝑥#! 𝑥#" 𝑥## 𝑥#$ . . . 𝑥#% 𝛼,
• 𝛼? ≠ 0 only for support vectors
• At prediction time, dot product only
. . . . . . . . . .
. . . . . . . . . .

between new x and support vectors

𝑦- 𝑥&! 𝑥&" 𝑥&# 𝑥&$ . . . 𝑥&% 𝛼-


Under the Hood: Kernel Trick
Input Space: dimension n High-dimensional Feature Space: dimension N >> n
𝑥! 𝜙!(𝑥)
𝑥" 𝜙"(𝑥)
𝜙,(𝑥)
𝑥 = 𝑥, 𝜙(𝑥) =
𝑁≫𝑛
⋮ ⋮ Expensive operation and
⋮ requires large memory
𝑥9 𝑏! ⋮
𝑏" 𝜙: (𝑥) 𝜙!(𝑏)
𝐾 𝒂, 𝒃 = 𝒂( 𝒃 = 𝑎! 𝑎" 𝑎, … 𝑎9 𝑏, 𝜙"(𝑏)
⋮ 𝐾 𝜙(𝒂), 𝜙(𝒃) = 𝜙(𝒂)𝑻𝜙(𝒃) = 𝜙!(𝑎) 𝜙"(𝑎) 𝜙,(𝑎) … 𝜙: (𝑎) 𝜙,(𝑏)
𝑏9 ⋮
𝜙 9 (𝑏)
𝜙 𝒂 𝑻𝜙 𝒃 = 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 (𝒂( 𝒃)

Universal approximator. Kernel Trick


Corresponding feature space
𝜙 𝑥 Is infinite dimensional space
Mercer Theorem
If a kernel function K(a, b) satisfies a few mathematical
conditions, then there exist a 𝜙 that maps a and b into
another feature space.
Under the Hood: Online SVMs & Hinge Loss

• Use SGDClassifier for Online SVMs Soft Margin SVM Objective Function
• Specify loss=”hinge” 𝑤)𝑥 '
+𝑏
) ' ) '
𝑤 𝑥 + 𝑏 , 𝑖𝑓 𝑣𝑖𝑜𝑙𝑎𝑡𝑖𝑜𝑛, 𝑖𝑒, 𝑤 𝑥 +𝑏 <

+ve class

Interpretation of t
If you are starting to get a headache, it’s perfectly
normal: it’s an unfortunate side effect of the kernel
trick.
- Aurelien Geron

You might also like