0% found this document useful (0 votes)

42 views69 pages

05 Lecture ML Supervised - Learning SVM

The document discusses lecture 6 of an introduction to machine learning course, which covers supervised learning for classification using support vector machines (SVM). It provides an overview of the lecture agenda, including a review of previous classification methods like Naive Bayes, k-Nearest Neighbor, and logistic regression. It then discusses the concepts of parametric versus non-parametric distributions as they relate to classification and provides more details on the k-Nearest Neighbor algorithm and different distance measures used.

Uploaded by

5bw8xs9ysh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views69 pages

05 Lecture ML Supervised - Learning SVM

Uploaded by

5bw8xs9ysh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 69

Introduction to Machine Learning & Deep Learning (Fall 2023)

Lecture 6: Supervised Learning – Classification - SVM

Prof. Damian Borth

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 1

Last Lecture

• Supervised Learning
– Classification
• Classifier
– Naïve Bayes
– k-Nearest Neightbour
– Logistic Regression
– Support Vector Machine
• Classifier Fusion

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 2

This Lecture

• Supervised Learning
– Classification
• Classifier
– Naïve Bayes
– k-Nearest Neightbour
– Logistic Regression
– Support Vector Machine
• Classifier Fusion

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 3

Supervised Learning

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 4

Types of Machine Learning

Machine Learning

Supervised Learning Semi-supervised Learning Unsupervised Learning

• Objective: Learn the • Objective: Learn structure use • Objective: Identification of
relationship between data only few labels to label unknown distributions,
and a desired output pattern and dependencies
• Data:
• Data x contains labels c whose -> few labels are known • Data x contains dependencies
relationship to be learned : • -> many labels are unknown or patterns to be observed
-> labels are known -> labels are unknown
• “Learning known pattern” Self-supervised Learning • ”Learning unknown pattern”
• e.g. decision trees, • Objective: Learn representation • e.g. clustering algorithms,
neural nets, support vector of data by controlled pseudo- principle component analysis,
machines etc. labels for downstream task self-organizing maps etc.
• Data:
-> labels are unknown
Classification Regression • -> representation & lin. classifier Clustering Dim. Reduction

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 5

Types of Machine Learning

Machine Learning

Supervised Learning Semi-supervised Learning Unsupervised Learning

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 6

Overview
Agenda
1. What is supervised learning (Classification)?
2. How to build an “optimal” classifier ?
3. What kind of classifiers are there ?
a) “Naive” Bayes
b) Nearest Neighbors
c) Logistic Regression
d) Support Vector Machine (SVM)

4. How to combine „fuse“ distinct classifiers ?

5. Summary and conclusion
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 7
Parametric vs Non-parametric

• So far, we assumed P(x|c) to be Gaussian.

• What about these distributions?

• Often, we don't know the parametric form of P(x|c)

• Possible approaches:
• mixtures of Gaussians
• non-parametric methods (no parameters, no training)
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 8
k-Nearest Neighbor - Idea

Intuitive Understanding

Idea

“Assign each unknown example x to the majority class

y of its k closest neighbors where k is a parameter.” k=1

k=3

k=5

Unknown example x to classify class y=0

class y=1

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 9

k-Nearest Neighbor - Approach

Given

• A set of labeled training samples {xi, yi}

• xi - feature representation of examples
• yi - class labels (e.g. document type, rating on YouTube etc.)
• Unknown sample x that we aim to predict the target

Classification Algorithm

• Compute the distance D(x, xi) of x to every training sample xi

• Select the k closest instances xi1 … xik and their class labels yi1 … yik
• Classify x according to the majority class of its k neighbors
• Calculating the majority class: $
1 1 𝑖𝑓 𝑦%! = 𝑦
𝑃 𝑦|𝑥 = ( 𝛿 𝑦%! , 𝑦 , δ = ,
𝑘 0 𝑖𝑓 𝑦%! ≠ 𝑦
!"#

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 10

k-Nearest Neighbor – Distance Measures

“Euclidian” Distance (”L2-norm”)

• Used in the context of continuous variables
• Not very robust, single solution

“Manhattan” Distance (“L1-norm”)

• Used in the context of binary or encoded variables
• Robust, possibly multiple solution

“Hamming” Distance
• Used in the context of categorical variables 0
• E.g. distance between names, document types 1

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 11

k-Nearest Neighbor – Different “k” Example

k=1 k=3

k=10 k=50 k=200

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 12

k-Nearest Neighbor

Summary and Discussion

Pro’s Con’s
• “Non parametric” approach • Computationally expensive
• ”no” assumptions about the data distribution • time: computes all distances
• Simple to implement • Space: stores all examples
• Flexible to feature / distance choices • Sensitive outliers / irrelevant features

Use Cases
$
• Spam filtering 1 1 𝑖𝑓 𝑦%! = 𝑦
• Recommender systems 𝑃 𝑦|𝑥 = ( 𝛿 𝑦%! , 𝑦 , δ = ,
𝑘 0 𝑖𝑓 𝑦%! ≠ 𝑦
• Text classification !"#
• Document similarity

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 13

k-Nearest Neighbor

• When is nearest neighbor (NN) successful?

• we need many samples in small regions!

• Is nearest neighbor better than Gaussians?

• not necessarily – if the underlying class-conditional densities
are truly Gaussian and we can determine parameters reliably,
Gaussians are the optimal model!

• Are there really no parameters?

• there‘s K as hyper-parameter to choose
• low K = high variance
• high K = oversmoothing
• good compromise in practice: K=√n

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 14

4. How to combine „fuse“ distinct classifiers ?

5. Summary and conclusion
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 15
Discriminative Models

• We saw a generative model: „Gaussians“

• we know P(x|c) and P(c), i.e. we know P(c|x)
• we can „generate“ samples from P(c|x)
• draw c' from P(c)
• draw x' from P(x|c)

• Alternative:
• omit P(x|c) and P(c), and directly estimate P(c|x) !

→ discriminative models: P(c|x) = fΘ(x)

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 16

Logistic Regression - Introduction

Intuitive Understanding
Linear Regression
Sea Bass? P(c|x) = fΘ(x1) =mx1+b, with Θ = (m,b)
(Yes) 1 (Yes) 1

0.5

(No) 0 (No) 0
Lightness x1 Lightness x1

Classification Hypothesis Challenge

Threshold classifier fΘ(x1) output at 0.5:
“How to handle anomalies or
• If fΘ(x1) ≥ 0.5, predict c = 1 “Sea Bass” different modalities in the data?”
• If fΘ(x1) < 0.5, predict c = 0 “Salmon”

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 17

Logistic Regression - Introduction

Intuitive Understanding
Linear Regression Challenge: Outlier
Sea Bass? Good hypothesis? P(c|x) = fΘ(x1) =mx1+b, with Θ = (m,b)
(Yes) 1

0.5

(No) 0
Lightness x1

Classification Hypothesis Idee

Threshold classifier fΘ(x1) output at 0.5: Improve “Linear Regression” by:
(1) a non-linear hypothesis with fΘ
• If fΘ(x1) ≥ 0.5, predict c = 1 “Sea Bass” (2) learnable parameters Θ
• If fΘ(x1) < 0.5, predict c = 0 “Salmon”
“Logistic Regression”

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 18

Logistic Regression - Idea (one dimensions)

• remember the Gaussian case P(c|x) was a sigmoid function

fΘ(x)

• where
sigmoid

linear

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 19

Logistic Regression - Idea (one dimensions)

• remember the Gaussian case P(c|x) was a sigmoid function

fΘ(x)

• where
sigmoid

linear

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 20

Logistic Regression - Idea (more dimensions)

• In more dimensions, we have a weight vector w

• The decision boundary becomes a (linear) hyperplane

• We can omit b using augmented vectors:

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 21

Logistic Regression - Approach

Given
• A set of labeled training samples {xi, ci}
• xi - feature representation of examples
• ci - class labels (e.g. document type, rating on YouTube etc.)
• For each weight configuration w we can compute the classification loss 𝓛 “Error”

Training Algorithm (see Bishop p. 205f.)

• Initialize the weight configuration w0 “Gradient Descent Learning”
• Until convergence of loss 𝓛 do:
• Update the weight configuration according
to Gradient Descent Learning
• Increase k = k+1

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 22

Logistic Regression

Summary and Discussion

Pro’s Con’s
• “Discriminative” approach • Non-deterministic results
• learn only the needed • May end up in a local minima
• Results are easy to interpret • Learns linear decision boundaries
• Can be trained fast • Vulnerable to overfitting

Use Cases
• Predictive maintenance
• Medical treatment response
• Customer churn prediction
• Loan default prediction

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 23

4. How to combine „fuse“ distinct classifiers ?

5. Summary and conclusion
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 24
Support Vector Machines (SVMs)

• Support Vector Machines were leading the State-of-the-art in

many machine learning tasks (including image recognition)
• A classifier benchmarking experiment:
– More than 100 datasets from the public UCI machine learning repository
– 7 classifiers, with parameters (for example, k in k-NN) optimized by a cross-validation gridsearch
– this illustration counts the datasets on which each classifier works best