Hands-On Machine Learning: Chapter 5: Support Vector Machines
Hands-On Machine Learning: Chapter 5: Support Vector Machines
Learning
Chapter 5: Support Vector Machines
https://www.linkedin.com/in/bhanuyerra/
Overview
Linear SVM
Classification
Linear SVM Classification
SVM Classifier
Linear SVM Classification
Large Margin Classification
Sensitive to scale
Use Scikit-Learn’s
StandardScaler
Linear Separability
Sensitive to outliers
Use Soft Margin SVM
Classifier
Soft Margin Classification
• C is inverse of regularizing
hyperparameter alpha (from Chapter 4)
Soft Margin Classification
• LinearSVC:
o LinearSVC(loss=“hinge”, C=1)
o Accepts two loss functions: “hinge” and “squared_hinge”
o Default loss is “squared_hinge”
o Doesn’t output support vectors (use .intercept_ & .coef_ to find
support vectors in the training data)
o Regularizes bias term too...so center the data by using StandardScaler
o Set dual=False if training instances > number of features
• SVC:
o SVC(kernel=“linear”, C=1)
o For linear classifier use kernel=“linear”
o For hard margin classifier use C=float(“inf”)
• SGDClassifier:
• SGDClassifier(loss=“hinge”, alpha = 1/(m*C))
• Slow to converge, but good for online or huge datasets
Non-Linear SVM
Classification
Nonlinear SVM Classification
𝑋!
𝑋!
𝑋!"
Nonlinear SVM: Polynomial Kernels
1
𝑋!
High Bias 𝑋"
𝑋!𝑋"
𝑋!"
𝑋!
𝑋" 𝑋""
𝑋!"𝑋"
𝑋!𝑋""
𝑋!#
𝑋"#
Use PolynomialFeatures,
StandardScaler, and LinearSVC
𝑟#
3𝑟 "𝑋!
3𝑟 "𝑋"
𝑋! 6𝑟𝑋!𝑋"
𝑋" 3𝑟𝑋!"
3𝑟𝑋""
3𝑋!"𝑋"
3𝑋!𝑋""
𝑋!#
𝑋"#
Hyperplane ℎ 𝒙 = 𝑤( 𝒙 + 𝑏
Prediction
+ve
-ve
+ve
𝑤( 𝑥 *
+𝑏
𝑤( 𝑥 *
+ 𝑏 , 𝑖𝑓 𝑣𝑖𝑜𝑙𝑎𝑡𝑖𝑜𝑛, 𝑖𝑒, 𝑤( 𝑥 *
+𝑏 <
Under the Hood: Quadratic Programming
Under the Hood: The Dual Problem
Under the Hood: The Dual Problem
Dual problem has two components:
𝑦! 𝑥!! 𝑥!" 𝑥!# 𝑥!$ . . . 𝑥!% 𝑦 = 𝒘( 𝒙 𝑦! 𝑥!! 𝑥!" 𝑥!# 𝑥!$ . . . 𝑥!% 𝛼! Express the linear
𝑦" 𝑥"! 𝑥"" 𝑥"# 𝑥"$ . . . 𝑥"% -
𝑦" 𝑥"! 𝑥"" 𝑥"# 𝑥"$ . . . 𝑥"% 𝛼" relationship, loss
1 "
𝐿(𝒘) = / 𝒘( 𝑥 (*) − 𝑦 (*) function, gradients
𝑦, 𝑥#! 𝑥#" 𝑥## 𝑥#$ . . . 𝑥#% 𝑚 𝑦, 𝑥#! 𝑥#" 𝑥## 𝑥#$ . . . 𝑥#% 𝛼, and train the
*.!
- model in 𝛼 space
𝑑𝐿(𝒘) 2
= / 𝒘( 𝑥 (*) − 𝑦 (*) 𝑥 (*) instead of w
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
𝑑𝑤 𝑚
*.! parametric space
𝑦- 𝑥&! 𝑥&" 𝑥&# 𝑥&$ . . . 𝑥&% 𝑦- 𝑥&! 𝑥&" 𝑥&# 𝑥&$ . . . 𝑥&% 𝛼-
Under the Hood: Kernalization
Kernalized Linear Regression Express using 𝛼 Prediction Time Calculations
1 2 3 4 . . . . . . . 𝑛
-
𝑤! 𝑤" 𝑤# 𝑤$ . . . . 𝑤%
𝒚 = 𝑤(𝒙 * (
9 = 8 𝛼* 𝒙
𝒚 ;
𝒙 Prediction for 𝑥̅
𝑦! 𝑥!! 𝑥!" 𝑥!# 𝑥!$ . . . 𝑥!% 𝛼!
*.!
𝑦" 𝑥"! 𝑥"" 𝑥"# 𝑥"$ . . . 𝑥"% 𝛼"
Parameters are a linear
𝑦, 𝑥#! 𝑥#" 𝑥## 𝑥#$ . . . 𝑥#% 𝛼, combination of training
samples Dot product
- 𝒙(*)( 𝒙
;
. . . . . . . . . .
. . . . . . . . . .
𝒘 = 8 𝛼* 𝒙(*)
*.!
𝐾(𝒙 * (
,𝒙
;) Kernel
-
* ( -
𝒚 = 8 𝛼* 𝒙 𝒙
9 = 8 𝛼* 𝐾(𝒙 * (
;)
Hypothesis
*.! 𝒚 ,𝒙
*.!
function in kernels
𝑦- 𝑥&! 𝑥&" 𝑥&# 𝑥&$ . . . 𝑥&% 𝛼- (m dot products)
Under the Hood: Kernalization
Kernalized Linear Regression Express using 𝛼 Training Time Calculations
1 2 3 4 . . . . . . . 𝑛 &
' )
𝑤! 𝑤" 𝑤# 𝑤$ . . . . 𝑤% 𝒚 = & 𝛼' 𝒙 𝒙
'(! Dot product between training
𝑦! 𝑥!! 𝑥!" 𝑥!# 𝑥!$ . . . 𝑥!% 𝛼!
1
& & "
samples: m x m matrix
𝐿(𝒘) = 3 3 𝛼' 𝒙 ' )𝑥 (*) − 𝑦 (*)
𝑦" 𝑥"! 𝑥"" 𝑥"# 𝑥"$ . . . 𝑥"% 𝛼" 𝑚
*(! '(! Kernel Matrix
𝑦, 𝑥#! 𝑥#" 𝑥## 𝑥#$ . . . 𝑥#% 𝛼, 𝑗 𝑗
x =
𝐾'*
𝑖
𝑖
• Use SGDClassifier for Online SVMs Soft Margin SVM Objective Function
• Specify loss=”hinge” 𝑤)𝑥 '
+𝑏
) ' ) '
𝑤 𝑥 + 𝑏 , 𝑖𝑓 𝑣𝑖𝑜𝑙𝑎𝑡𝑖𝑜𝑛, 𝑖𝑒, 𝑤 𝑥 +𝑏 <
+ve class
Interpretation of t
If you are starting to get a headache, it’s perfectly
normal: it’s an unfortunate side effect of the kernel
trick.
- Aurelien Geron