0% found this document useful (0 votes)
42 views69 pages

05 Lecture ML Supervised - Learning SVM

The document discusses lecture 6 of an introduction to machine learning course, which covers supervised learning for classification using support vector machines (SVM). It provides an overview of the lecture agenda, including a review of previous classification methods like Naive Bayes, k-Nearest Neighbor, and logistic regression. It then discusses the concepts of parametric versus non-parametric distributions as they relate to classification and provides more details on the k-Nearest Neighbor algorithm and different distance measures used.

Uploaded by

5bw8xs9ysh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views69 pages

05 Lecture ML Supervised - Learning SVM

The document discusses lecture 6 of an introduction to machine learning course, which covers supervised learning for classification using support vector machines (SVM). It provides an overview of the lecture agenda, including a review of previous classification methods like Naive Bayes, k-Nearest Neighbor, and logistic regression. It then discusses the concepts of parametric versus non-parametric distributions as they relate to classification and provides more details on the k-Nearest Neighbor algorithm and different distance measures used.

Uploaded by

5bw8xs9ysh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Introduction to Machine Learning & Deep Learning (Fall 2023)

Lecture 6: Supervised Learning – Classification - SVM


Prof. Damian Borth

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 1


Last Lecture

• Supervised Learning
– Classification
• Classifier
– Naïve Bayes
– k-Nearest Neightbour
– Logistic Regression
– Support Vector Machine
• Classifier Fusion

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 2


This Lecture

• Supervised Learning
– Classification
• Classifier
– Naïve Bayes
– k-Nearest Neightbour
– Logistic Regression
– Support Vector Machine
• Classifier Fusion

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 3


Supervised Learning

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 4


Types of Machine Learning

Machine Learning

Supervised Learning Semi-supervised Learning Unsupervised Learning


• Objective: Learn the • Objective: Learn structure use • Objective: Identification of
relationship between data only few labels to label unknown distributions,
and a desired output pattern and dependencies
• Data:
• Data x contains labels c whose -> few labels are known • Data x contains dependencies
relationship to be learned : • -> many labels are unknown or patterns to be observed
-> labels are known -> labels are unknown
• “Learning known pattern” Self-supervised Learning • ”Learning unknown pattern”
• e.g. decision trees, • Objective: Learn representation • e.g. clustering algorithms,
neural nets, support vector of data by controlled pseudo- principle component analysis,
machines etc. labels for downstream task self-organizing maps etc.
• Data:
-> labels are unknown
Classification Regression • -> representation & lin. classifier Clustering Dim. Reduction

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 5


Types of Machine Learning

Machine Learning

Supervised Learning Semi-supervised Learning Unsupervised Learning


• Objective: Learn the • Objective: Learn structure use • Objective: Identification of
relationship between data only few labels to label unknown distributions,
and a desired output pattern and dependencies
• Data:
• Data x contains labels c whose -> few labels are known • Data x contains dependencies
relationship to be learned : • -> many labels are unknown or patterns to be observed
-> labels are known -> labels are unknown
• “Learning known pattern” Self-supervised Learning • ”Learning unknown pattern”
• e.g. decision trees, • Objective: Learn representation • e.g. clustering algorithms,
neural nets, support vector of data by controlled pseudo- principle component analysis,
machines etc. labels for downstream task self-organizing maps etc.
• Data:
-> labels are unknown
Classification Regression • -> representation & lin. classifier Clustering Dim. Reduction

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 6


Overview
Agenda
1. What is supervised learning (Classification)?
2. How to build an “optimal” classifier ?
3. What kind of classifiers are there ?
a) “Naive” Bayes
b) Nearest Neighbors
c) Logistic Regression
d) Support Vector Machine (SVM)

4. How to combine „fuse“ distinct classifiers ?


5. Summary and conclusion
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 7
Parametric vs Non-parametric

• So far, we assumed P(x|c) to be Gaussian.


• What about these distributions?

• Often, we don't know the parametric form of P(x|c)


• Possible approaches:
• mixtures of Gaussians
• non-parametric methods (no parameters, no training)
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 8
k-Nearest Neighbor - Idea

Intuitive Understanding

Idea

“Assign each unknown example x to the majority class


y of its k closest neighbors where k is a parameter.” k=1

k=3

k=5

Unknown example x to classify class y=0


class y=1

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 9


k-Nearest Neighbor - Approach

Given

• A set of labeled training samples {xi, yi}


• xi - feature representation of examples
• yi - class labels (e.g. document type, rating on YouTube etc.)
• Unknown sample x that we aim to predict the target

Classification Algorithm

• Compute the distance D(x, xi) of x to every training sample xi


• Select the k closest instances xi1 … xik and their class labels yi1 … yik
• Classify x according to the majority class of its k neighbors
• Calculating the majority class: $
1 1 𝑖𝑓 𝑦%! = 𝑦
𝑃 𝑦|𝑥 = ( 𝛿 𝑦%! , 𝑦 , δ = ,
𝑘 0 𝑖𝑓 𝑦%! ≠ 𝑦
!"#

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 10


k-Nearest Neighbor – Distance Measures

“Euclidian” Distance (”L2-norm”)


• Used in the context of continuous variables
• Not very robust, single solution

“Manhattan” Distance (“L1-norm”)


• Used in the context of binary or encoded variables
• Robust, possibly multiple solution

“Hamming” Distance
• Used in the context of categorical variables 0
• E.g. distance between names, document types 1

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 11


k-Nearest Neighbor – Different “k” Example

k=1 k=3

k=10 k=50 k=200

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 12


k-Nearest Neighbor

Summary and Discussion


Pro’s Con’s
• “Non parametric” approach • Computationally expensive
• ”no” assumptions about the data distribution • time: computes all distances
• Simple to implement • Space: stores all examples
• Flexible to feature / distance choices • Sensitive outliers / irrelevant features

Use Cases
$
• Spam filtering 1 1 𝑖𝑓 𝑦%! = 𝑦
• Recommender systems 𝑃 𝑦|𝑥 = ( 𝛿 𝑦%! , 𝑦 , δ = ,
𝑘 0 𝑖𝑓 𝑦%! ≠ 𝑦
• Text classification !"#
• Document similarity

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 13


k-Nearest Neighbor

• When is nearest neighbor (NN) successful?


• we need many samples in small regions!

• Is nearest neighbor better than Gaussians?


• not necessarily – if the underlying class-conditional densities
are truly Gaussian and we can determine parameters reliably,
Gaussians are the optimal model!

• Are there really no parameters?


• there‘s K as hyper-parameter to choose
• low K = high variance
• high K = oversmoothing
• good compromise in practice: K=√n

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 14


Overview
Agenda
1. What is supervised learning (Classification)?
2. How to build an “optimal” classifier ?
3. What kind of classifiers are there ?
a) “Naive” Bayes
b) Nearest Neighbors
c) Logistic Regression
d) Support Vector Machine (SVM)

4. How to combine „fuse“ distinct classifiers ?


5. Summary and conclusion
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 15
Discriminative Models

• We saw a generative model: „Gaussians“


• we know P(x|c) and P(c), i.e. we know P(c|x)
• we can „generate“ samples from P(c|x)
• draw c' from P(c)
• draw x' from P(x|c)

• Alternative:
• omit P(x|c) and P(c), and directly estimate P(c|x) !

→ discriminative models: P(c|x) = fΘ(x)

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 16


Logistic Regression - Introduction

Intuitive Understanding
Linear Regression
Sea Bass? P(c|x) = fΘ(x1) =mx1+b, with Θ = (m,b)
(Yes) 1 (Yes) 1

0.5

(No) 0 (No) 0
Lightness x1 Lightness x1

Classification Hypothesis Challenge


Threshold classifier fΘ(x1) output at 0.5:
“How to handle anomalies or
• If fΘ(x1) ≥ 0.5, predict c = 1 “Sea Bass” different modalities in the data?”
• If fΘ(x1) < 0.5, predict c = 0 “Salmon”

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 17


Logistic Regression - Introduction

Intuitive Understanding
Linear Regression Challenge: Outlier
Sea Bass? Good hypothesis? P(c|x) = fΘ(x1) =mx1+b, with Θ = (m,b)
(Yes) 1

0.5

(No) 0
Lightness x1

Classification Hypothesis Idee


Threshold classifier fΘ(x1) output at 0.5: Improve “Linear Regression” by:
(1) a non-linear hypothesis with fΘ
• If fΘ(x1) ≥ 0.5, predict c = 1 “Sea Bass” (2) learnable parameters Θ
• If fΘ(x1) < 0.5, predict c = 0 “Salmon”
“Logistic Regression”

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 18


Logistic Regression - Idea (one dimensions)

• remember the Gaussian case P(c|x) was a sigmoid function

fΘ(x)

• where
sigmoid

linear

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 19


Logistic Regression - Idea (one dimensions)

• remember the Gaussian case P(c|x) was a sigmoid function

fΘ(x)

• where
sigmoid

linear

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 20


Logistic Regression - Idea (more dimensions)

• In more dimensions, we have a weight vector w

• The decision boundary becomes a (linear) hyperplane

• We can omit b using augmented vectors:

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 21


Logistic Regression - Approach

Given
• A set of labeled training samples {xi, ci}
• xi - feature representation of examples
• ci - class labels (e.g. document type, rating on YouTube etc.)
• For each weight configuration w we can compute the classification loss 𝓛 “Error”

Training Algorithm (see Bishop p. 205f.)


• Initialize the weight configuration w0 “Gradient Descent Learning”
• Until convergence of loss 𝓛 do:
• Update the weight configuration according
to Gradient Descent Learning
• Increase k = k+1

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 22


Logistic Regression

Summary and Discussion


Pro’s Con’s
• “Discriminative” approach • Non-deterministic results
• learn only the needed • May end up in a local minima
• Results are easy to interpret • Learns linear decision boundaries
• Can be trained fast • Vulnerable to overfitting

Use Cases
• Predictive maintenance
• Medical treatment response
• Customer churn prediction
• Loan default prediction

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 23


Overview
Agenda
1. What is supervised learning (Classification)?
2. How to build an “optimal” classifier ?
3. What kind of classifiers are there ?
a) “Naive” Bayes
b) Nearest Neighbors
c) Logistic Regression
d) Support Vector Machine (SVM)

4. How to combine „fuse“ distinct classifiers ?


5. Summary and conclusion
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 24
Support Vector Machines (SVMs)

• Support Vector Machines were leading the State-of-the-art in


many machine learning tasks (including image recognition)
• A classifier benchmarking experiment:
– More than 100 datasets from the public UCI machine learning repository
– 7 classifiers, with parameters (for example, k in k-NN) optimized by a cross-validation gridsearch
– this illustration counts the datasets on which each classifier works best

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 25


Support Vector Machines (SVMs)

• SVMs were particularly successful in image recognition


• visual words + SVMs = „standard pipeline“

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 26


Support Vector Machines (SVMs)

• SVMs were particularly successful in image recognition


• visual words + SVMs = „standard pipeline“

Visual Word Feature Extraction

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 27


Support Vector Machines (SVMs)

• SVMs were particularly successful in image recognition


• visual words + SVMs = „standard pipeline“

Visual Word Feature Extraction

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 28


Support Vector Machines (SVMs)

• SVMs were particularly successful in image recognition


• visual words + SVMs = „standard pipeline“

Visual Word Feature Extraction

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 29


Support Vector Machines (SVMs)

• SVMs were particularly successful in image recognition


• visual words + SVMs = „standard pipeline“

Visual Word Feature Extraction

[ 2, 0, 2, 0 ]

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 30


Support Vector Machines (SVMs)

• SVMs were particularly successful in image recognition


• visual words + SVMs = „standard pipeline“

Visual Word Feature Extraction SVM Classification

[ 2, 0, 2, 0 ]

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 31


Support Vector Machines (SVMs)

• SVMs were particularly successful in image recognition


• visual words + SVMs = „standard pipeline“

Visual Word Feature Extraction SVM Classification

[ 2, 0, 2, 0 ]

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 32


Support Vector Machines (SVMs)

• SVMs were particularly successful in image recognition


• visual words + SVMs = „standard pipeline“

Visual Word Feature Extraction SVM Classification

[ 2, 0, 2, 0 ]

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 33


Support Vector Machines (SVMs)

• SVMs were particularly successful in image recognition


• visual words + SVMs = „standard pipeline“

Visual Word Feature Extraction SVM Classification

[ 2, 0, 2, 0 ]

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 34


Support Vector Machines (SVMs)

• SVMs were particularly successful in image recognition


• visual words + SVMs = „standard pipeline“

Visual Word Feature Extraction SVM Classification

[ 2, 0, 2, 0 ]

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 35


Support Vector Machines (SVMs)

• SVMs were particularly successful in image recognition


• visual words + SVMs = „standard pipeline“

Visual Word Feature Extraction SVM Classification

[ 2, 0, 2, 0 ]

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 36


Approach

• maximum margin classification

• non-linearity by kernel functions


y distance from origin

angle

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 37


SVM: Notation

• Given:
– training samples x1,..,xnÎ Rd
• Geometric approach:
with labels y1,..,yn Î {-1,1} – find a hyperplane w that
separates the classes

• f(x) = <w,x> + b w
– use: “augmented vectors”

• f(x) = <w,x> (x → [x,1])


– classification of class presence ↔ x > 0

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 38


SVM: The maximum-margin Principle

• Which hyperplane is the best?


– multiple hyperplane possible

• Guiding Principles / Approaches


– generative models
(e.g., Gaussians with identical covariances)
– logistic regression
(likelihood maximization)
– perceptron
(error minimization)
– maximum-margin principle
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 39
SVM: Margin Maximization

• To find the hyperplane w that


w
maximizes the margin, let us
first require that for all
sample xi the following holds:

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 40


SVM: Margin Maximization

We have two kinds of samples: g w


– „safe“ samples xi which are
„far away“ from the decision
boundary: <w,xi> > yi

– „support vectors“ xi samples on


the margin: <w,xi> = yi

Relationship between g and w:


• the size of the margin g is 1/||w||2
• maximizing the margin is equivalent to minimizing ||w||2

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 41


SVM: Margin Maximization

• Altogether, a decision boundary w* w*=


that maximizes the margin can be
computed by solving the following
optimization problem:

• This is a “simple” optimization problem


– the objective function is quadratic, i.e., differentiable and convex
→ quadratic programming
– the constraints are all linear
– a globally optimal solution can be computed in O(n3)
– in practice, an SVM computational effort is » O(c×n1.8)
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 42
Classification Problem: Non-Separability

• Problem: in practice, datasets are often not linear separable!

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 43


Classification Problem: Non-Separability

• Problem: in practice, datasets are often not linear separable!

• We can solve this problem by two extensions:


Slack Variables Kernel Function

mapping samples to a (proper)


allowing for errors during training
higher-dimensional vector space
in favor of a max margin
and solving the problem there in a
hyperplane w
linear way

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 44


Classification Problem: Non-Separability

• Problem: in practice, datasets are often not linear separable!

• We can solve this problem by two extensions:


Slack Variables Kernel Function

mapping samples to a (proper)


allowing for errors during training
higher-dimensional vector space
in favor of a max margin
and solving the problem there in a
hyperplane w
linear way

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 45


SVM: Slack Variables

• What is the better hyperplane for this dataset?


→ allow some training error: introduce slack variables

no training errors, one training error,


but small margin but larger margin
(=likely test errors) (=likely fewer test errors)

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 46


SVM: Max Margin & Slack Variables

• Solution: Introduce slack variables x1, .., xn

w*= w*=

• We can satisfy all constraints by making xis large enough


• Hyper-parameter C realizes balancing:
→ C = ∞ i.e „hard“ margin, all xi are 0, no training error allowed
→ the smaller C, the larger the margin (at the cost of incorrectly classified training samples)
• The target function is still convex („simple“ optimization)
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 47
Classification Problem: Non-Separability

• Problem: in practice, datasets are often not linear separable!

• We can solve this problem by two extensions:


Slack Variables Kernel Function

mapping samples to a (proper)


allowing for errors during training
higher-dimensional vector space
in favor of a max margin
and solving the problem there in a
hyperplane w
linear way

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 48


Classification Problem: Non-Separability

• Problem: in practice, datasets are often not linear separable!

• We can solve this problem by two extensions:


Slack Variables Kernel Function

mapping samples to a (proper)


allowing for errors during training
higher-dimensional vector space
in favor of a max margin
and solving the problem there in a
hyperplane w
linear way

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 49


SVM: Non-(Linear)-Separability

• Slack variables are not enough!


– What is the best decision boundary on this dataset?

• We need non-linear decision boundaries


• Solutions:
– higher order decision functions y
– classifier stacking
– neural networks
x
(will be covered later)
– data transformation
(kernel functions)

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 50


SVM: Data Transformation f

• In the example, we can find a transformation


f for the samples xi - such that they become
linearly separable

→ transform each xi to polar coordinates:


y distance from origin

f
x

angle

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 51


SVM: Data Transformation f

• Linear Classification with Data Transformation

→ define a feature transformation f: Rd → Rm


→ perform classification on f(xi) instead of xi

w*= w*=

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 52


SVM: Kernel Trick & Representer Theorem

• Finding „good“ data transformations for the classification problem can be difficult

• Instead, we will omit the transformation f(x) and use a similarity functions
k(xi,xj) that compare two samples xi,xj → this approach is called the kernel trick

• The similarity functions k(xi,xj) are called kernel functions

• The Representer Theorem is the basis of the kernel trick

• It tells us that the maximum-margin


solution lies in the subspace spanned
by the training samples, i.e. we can re-
write the maximum-margin solution w as:
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 53
SVM: Kernel Trick & Representer Theorem

Using the Representer Theorem,


we can rewrite:

w*= w*=

derive

SVM Equation kernel function: <f(xi),f(xj)> = k(xi,xj)

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 54


Kernel Trick & Representer Theorem - Consequence

• Kernel Trick
– We can omit the computation of f, and simply
compute the kernel function k(.,.)

• Kernel Function k(.,.)


– The kernel function k(xi,xj) defines a
similarity measure between xi and xj
– there are several kernel functions to choose from

• We do not even have to know f


– this is actually pretty awesome!

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 55


SVM: Training

• Given
– training set with samples x1,..,xn, w
and its labels y1, .., yn

• Algorithm
1. choose a kernel function k(.,.)
2. estimate a1, .., an by optimizing the SVM equation
(ai ≠ 0 → xi is a „support vector“)
3. These a1, ..,an values define a maximum-margin decision boundary in a
high-dimensional space defined by the kernel function.

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 56


SVM: Training

• Given
– training set with samples x1,..,xn, w
and its labels y1, .., yn

• Algorithm
1. choose a kernel function k(.,.)
2. estimate a1, .., an by optimizing the SVM equation
(ai ≠ 0 → xi is a „support vector“)
3. These a1, ..,an values define a maximum-margin decision boundary in a
high-dimensional space defined by the kernel function.

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 57


SVM: Classification

• Given
– test samples x1,..,xn → x = w
• Unknown
– labels y1, .., yn → y = { , } ?

• Classification
1. compute k(x,xi) for all x
2. compute classification score:
3. class decision is: sign( f(x))

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 58


SVM: Kernel Best Practice

• How do we choose kernels k(,.,) in practice?


• They can be construct from distance functions
– if d(.,.) is a distance function, then e-d(.,.) i.e exp{-d(.,.)} can be used as a kernel function

• Linear:
• Polynomial:
• Gaussian (RBF) Some practical
kernel functions
• Histogram intersection:
• Chi square
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 59
SVM: Kernel Best Practice

• Kernels should show a class-wise block structure

• Example: β in the Gaussian kernel:

β very large...

(picture: Christoph Lampert)

β very small...
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 60
SVM: Hyper-Parameter Optimization

• Parameter Optimization in SVMs:


– cost of training samples misclassified: C
– kernel parameter β
good parameter choices

β
• Frequently used approach: Grid Search
– test different values of C and β
on a regular grid (alt. log grid)
– for each pair, measure
classification accuracy
on a held-out validation set
C
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 61
SVMs – Summary

• Support Vector Machines were state-of-the-art classifier,


particularly successful in image recognition
• Advantages
– the maximum-margin problem can be solved globally optimally!
– the number of parameters is „independent“ of the feature dimensionality.
This makes SVMs very suitable classifiers for small, high-dimensional training sets!
– flexibility: we can incorporate application-specific kernels
– very good empirical results

• Disadvantages
– often: ad hoc choice of kernel functions
– scalability problems to large training sets
– Limited learning capacity with large number of positive samples

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 62


Overview
Agenda
1. What is supervised learning (Classification)?
2. How to build an “optimal” classifier ?
3. What kind of classifiers are there ?
a) “Naive” Bayes
b) Nearest Neighbors
c) Logistic Regression
d) Support Vector Machine (SVM)

4. How to combine „fuse“ distinct classifiers ?


5. Summary and conclusion
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 63
Early vs. Late Fusion

• Multiple classifiers can combine


different pieces of evidence
• multiple features
• multiple modalities
• multiple classifiers
• multiple training sets

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 64


Early vs. Late Fusion

• Different combination strategies


• early fusion = concatenate features
• late fusion = combine classification results

x1
x2 [x1,x2,..,x
early fusion classifier decision
… M]

xM

x1 classifier P(c|x1)
x2
… classifier P(c|x2) late fusion decision
xM ...
classifier P(c|xM)

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 65


Overview
Agenda
1. What is supervised learning (Classification)?
2. How to build an “optimal” classifier ?
3. What kind of classifiers are there ?
a) “Naive” Bayes
b) Nearest Neighbors
c) Logistic Regression
d) Support Vector Machine (SVM)

4. How to combine „fuse“ distinct classifiers ?


5. Summary and conclusion
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 66
Overview
Agenda
1. What is supervised learning (Classification)?
2. How to build an “optimal” classifier ?
3. What kind of classifiers are there ?
a) “Naive” Bayes
b) Nearest Neighbors
c) Logistic Regression
d) Support Vector Machine (SVM)

4. How to combine „fuse“ distinct classifiers ?


5. Summary and conclusion
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 67
Discussion

• This lecture – four sample classifiers


• Naive Bayes (with Gaussian CCDs)
• K-nearest neighbor
• Logistic regression
• Support Vector Machine (SVM)

• The Big Answer to “Which one is the best?”


• the right classifier depends on the distribution of the target data...
• … on the preprocessing ...
• … on the features...
• … on the amount of training data

→ no-free-lunch theorem
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 68
Questions?
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 69

You might also like