L6 Lecture Image - Classification.fundemental v4
L6 Lecture Image - Classification.fundemental v4
Fundemental
Easy Computer Vision
Xiaoyong Wei (魏驍勇)
[email protected]
New Toy
Outline
• Classification
• Supervised learning
• K nearest neighbors (k-NN)
• Bayesian classifiers
• Support vector machines (SVM)
• Rock-Paper-Scissors
How do you group them?
Feature Space
3
Skin Color
1 2 3
Hair Color
Clustering is unsupervised learning
which means we (human) don’t have
to tell the computers what each
group looks like. It’s data-driven
without using human knowledge
(supervision).
Sounds good?
But …
Feature Space
3
Skin Color
? ? ?
1 2 3
What if we have new examples unseen?
Hair Color
Feature Space
What
3 if the features selected are not
representative enough or consistent with
Skin Color
2 our understanding?
1 2 3
Hair Color
We can tell the computers about our
understanding on the subjects by
giving labels.
Training Examples (Seen)
Skin Color Class Label
Hair Color (H)
(S) (L)
2.2 0.8 ?
3.2 1.9 ?
3.1 2.2 ?
2.4 1.3 ?
3.1 2.9 ?
We can tell the computers about our
understanding on the subjects by
giving labels.
Training Examples (Seen)
Skin Color Class Label
Hair Color (H)
(S) (L)
2.2 0.8 ?
3.2 1.9 ?
3.1 2.2 ?
2.4 1.3 ?
3.1 2.9 ?
Classification: to predict the labels of
the testing (unseen) examples based
on the knowledge learned from the
training (seen) examples
Classification: to predict the labels of
the testing (unseen) examples based
on the knowledge learned from the
training (seen) examples
1 [0.1, …] 1 1 [0.8, …] ?
Model
2 [0.3, …] -1 2 [0.7, …] -?
… … … … … …
Validation Set
Skin color 2
1 2 3
Hair color
The nearest
k=1 neighbor
Skin color 2
1 2 3
Hair color
kNN Classifier
>NN: nearest neighbors
>k: number of nearest neighbors
>Idea
• k=1: assign the unseen with the label of its nearest neighbor
• k>1: assign the dominating label among these of the k
nearest neighbors
k=3
3
Skin color
2 #A=2
#W=1
#A>#W => A
1
1 2 3
Hair color
It’s straightforward. But so far, we picked the
simplest case (classes are well separated) for
illustration purpose.
In a more general sense, this is what we’re
going to have.
It’s straightforward. But so far, we picked the
simplest case (classes are well separated) for
illustration purpose.
In a more general sense, this is what we’re
going to have.
A B
A and B
Bayesian Classifiers
• Classes A and B as two sets
• P(A|x): the probability of A is observed when seeing an x
• P(B|x): the probability of B is observed when seeing an x
A B
Bayesian Classifiers
• Classes A and B as two sets
𝑃 𝑥 𝐴 𝑃(𝐴)
• P(A|x) ∝ P(x|A)P(A) 𝑃 𝐴𝑥 =
𝑃(𝑥)
• P(B|x) ∝ P(x|B)P(B)
𝑃 𝑥 𝐵 𝑃(𝐵)
𝑃 𝐵𝑥 =
𝑃(𝑥)
x
A B
https://towardsdatascience.com/naive-bayes-classifier-81d512f50a7c
Bayesian Classifiers
• Classes A and B as two sets
• P(A|x) ∝ P(x|A)P(A)
• P(B|x) ∝ P(x|B)P(B)
A B
P(A)=#A/(#A+#B) P(B)=#B/(#A+#B)
A B
P(x|A)=#x/#A P(x|B)=#x/#B
A B
𝒘 𝒙 𝒚 +𝒃=𝟎
−𝟏
wTx + b = 0
Linear Separators
>Binary classification can be viewed as the task of
separating classes in feature space:
wTx + b = 0
wTx + b > 0
wTx + b < 0
f(x) = sign(wTx + b)
Linear Separators
>Binary classification can be viewed as the task of
separating classes in feature space:
r
Maximum Margin Classification
> Maximizing the margin is good according to intuition and
PAC theory (Probably Approximately Correct).
> Implies that only support vectors matter; other training
examples are ignorable.
Soft Margin Classification
>What if the training set is not linearly separable?
>Slack variables ξi can be added to allow
misclassification of difficult or noisy examples,
resulting margin called soft.
ξi
ξi
Linear SVMs: Overview
> The classifier is a separating hyperplane
> Most “important” training points are support vectors; they
define the hyperplane.
> Quadratic optimization algorithms can identify which training
points xi are support vectors with non-zero Lagrangian
multipliers αi.
> Both in the dual formulation of the problem and in the
solution training points appear only inside inner products:
⾃⼰讀(Math)
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
f(x) = ΣαiyixiTx + b
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi
Linear SVMs: Overview
> The classifier is a separating hyperplane
> Most “important” training points are support vectors; they
define the hyperplane.
> Quadratic optimization algorithms can identify which training
points xi are support vectors with non-zero Lagrangian
multipliers αi.
> Both in the dual formulation of the problem and in the
solution training points appear only inside inner products:
0 x
0 x
Non-linear SVMs
> Datasets that are linearly separable with some noise work
out great:
0 x
0 x
0 x
Non-linear SVMs: Feature spaces
>General idea: the original feature space can
always be mapped to some higher-dimensional
feature space where the training set is separable:
Φ: x → φ(x)
The “Kernel Trick”
> The linear classifier relies on inner product between vectors K(xi,xj)=xiTxj
> If every datapoint is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the inner product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
> A kernel function is a function that is equivalent to an inner product in
some feature space.
> Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xiTxj)2,= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2=
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2] =
= φ(xi) Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
> Thus, a kernel function implicitly maps data to a high-dimensional space
(without the need to compute each φ(x) explicitly).
Examples of Kernel Functions
> Linear: K(xi,xj)= xiTxj
• Mapping Φ: x → φ(x), where φ(x) is x itself
> Polynomial of power p: K(xi,xj)= (1+ xiTxj)p
d + p
• Mapping Φ: x → φ(x), where φ(x) has
p
dimensions
2
xi −x j
−
2 2
> Gaussian (radial-basis function): K(xi,xj) = e
• Mapping Φ: x → φ(x), where φ(x) is infinite-dimensional:
every point is mapped to a function (a Gaussian);
combination of functions for support vectors is the
separator.
> Higher-dimensional space still has intrinsic dimensionality d
(the mapping is not onto), but linear separators in it correspond
to non-linear separators in original space.
Classification – Supervised Learning
1 [0.1, …] 1 1 [0.8, …] ?
Model
2 [0.3, …] -1 2 [0.7, …] -?
… … … … … …
Validation Set
Thank You!