0% found this document useful (0 votes)
2 views

Lecture06_separable

The document is a lecture on linear separability in machine learning, focusing on the geometry of decision boundaries and the concept of separating hyperplanes. It discusses the conditions under which classes can be linearly separated and introduces the Separating Hyperplane Theorem, which states that two closed convex sets can be separated by a linear function if they do not overlap. Additionally, it highlights the limitations of linear classifiers and suggests alternative approaches for non-linearly separable data.

Uploaded by

gsaidulu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lecture06_separable

The document is a lecture on linear separability in machine learning, focusing on the geometry of decision boundaries and the concept of separating hyperplanes. It discusses the conditions under which classes can be linearly separated and introduces the Separating Hyperplane Theorem, which states that two closed convex sets can be separated by a linear function if they do not overlap. Additionally, it highlights the limitations of linear classifiers and suggests alternative approaches for non-linearly separable data.

Uploaded by

gsaidulu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

ECE595 / STAT598: Machine Learning I

Lecture 06 Linear Separability

Spring 2020

Stanley Chan

School of Electrical and Computer Engineering


Purdue University

c Stanley Chan 2020. All Rights Reserved.


1 / 34
Overview

c Stanley Chan 2020. All Rights Reserved.


2 / 34
Outline

Goal: Understand the geometry of linear separability.

Notations
Input Space, Output Space, Hypothesis
Discriminant Function
Geometry of Discriminant Function
Separating Hyperplane
Normal Vector
Distance from Point to Plane
Linear Separability
Which set is linearly separable?
Separating Hyperplane Theorem
What if theorem fails?

c Stanley Chan 2020. All Rights Reserved.


3 / 34
Supervised Classification
The goal of supervised classification is to construct a decision boundary
such that the two classes can be (maximally) separated.

c Stanley Chan 2020. All Rights Reserved.


4 / 34
Terminology
Input vectors: x 1 , x 2 , . . . , x N .
E.g., images, speech, EEG signal, rating, etc
Input space: X . Every x n ∈ X .
Labels y1 , y2 , . . . , yN .
Label space: Y. Every yn ∈ Y.
If labels are binary, e.g., yn = ±1, then
Y = {+1, −1}.
Labels are arbitrary. {+1, −1} and {0, 1} has no difference.
Target function f : X → Y. Unknown.
Relationship:
yn = f (x n ).
Hypothesis h : X → Y. Ideally, want
h(x) ≈ f (x), ∀x ∈ X .

c Stanley Chan 2020. All Rights Reserved.


5 / 34
Binary Case
If we restrict ourselves to binary classifier, then

1,
 if g (x) > 0
h(x) = 0, if g (x) < 0

either, if g (x) = 0

g : X → R is called a discriminant function.


g (x) > 0: x lives on the positive side of g .
g (x) < 0: x lives on the negative side of g .
g (x) = 0: The decision boundary.
You can also claim

+1,
 if g (x) > 0
h(x) = −1, if g (x) < 0

either, if g (x) = 0

No difference as far as decision is concerned. c Stanley Chan 2020. All Rights Reserved.
6 / 34
Binary Case

c Stanley Chan 2020. All Rights Reserved.


7 / 34
Linear Discriminant Function
A linear discriminant function takes the form
g (x) = w T x + w0 .

w ∈ Rd : linear coefficients
w0 ∈ R: bias / offset
Define the overall parameter
θ = {w , w0 } ∈ Rd+1 .
Example:
If d = 2, then
g (x) = w2 x2 + w1 x1 + w0 .
g (x) = 0 means
w1 w0
x2 = − x1 + − .
w2 w2
| {z } | {z }
slope y-intercept
c Stanley Chan 2020. All Rights Reserved.
8 / 34
Linear Discriminant Function

c Stanley Chan 2020. All Rights Reserved.


9 / 34
Outline

Goal: Understand the geometry of linear separability.

Notations
Input Space, Output Space, Hypothesis
Discriminant Function
Geometry of Discriminant Function
Separating Hyperplane
Normal Vector
Distance from Point to Plane
Linear Separability
Which set is linearly separable?
Separating Hyperplane Theorem
What if theorem fails?

c Stanley Chan 2020. All Rights Reserved.


10 / 34
Linear Discriminant Function

In high-dimension,
g (x) = w T x + w0 .
is a hyperplane.

Separating Hyperplane:

H = {x | g (x) = 0}
= {x | w T x + w0 = 0}

x ∈ H means x is on the
decision boundary.
w /kw k2 is the normal vector
of H.

c Stanley Chan 2020. All Rights Reserved.


11 / 34
Why is w the Normal Vector?

c Stanley Chan 2020. All Rights Reserved.


12 / 34
Why is w the Normal Vector?

Pick x 1 and x 2 from H.


So g (x 1 ) = 0 and g (x 2 ) = 0.
This means:

w T x 1 + w0 = 0, and w T x 2 + w0 = 0.

Consider the difference vector x 1 − x 2 .


x 1 − x 2 is the tangent vector on the surface of H.
Check

w T (x 1 − x 2 ) = (w T x 1 + w0 ) − (w T x 2 + w0 ) = 0.

So w is perpendicular to x 1 − x 2 , hence it is the normal.


Normalize w /kw k2 so that it has unit norm.
c Stanley Chan 2020. All Rights Reserved.
13 / 34
Distance from x 0 to g (x) = 0
Pick a point x p on H
x p is the closest point to x 0
x 0 − x p is the normal direction
So, for some scalar η > 0,
w
x0 − xp = η
kw k2

x p is on H. So

g (x p ) = w T x p + w0 = 0
Therefore, we can show that
g (x 0 ) = w T x 0 + w0
 
T w
=w xp + η + w0
kw k2
= g (x p ) + ηkw k2 = ηkw k2 .
c Stanley Chan 2020. All Rights Reserved.
14 / 34
Distance from x 0 to g (x) = 0
So distance is
g (x 0 )
η=
kw k2

The closest point x p is


w
xp = x0 − η
kw k2
g (x 0 ) w
= x0 − · .
kw k2 kw k2

Conclusion:
g (x 0 ) w
xp = x0 − ·
kw k kw k
| {z 2} | {z 2}
distance normal vector
c Stanley Chan 2020. All Rights Reserved.
15 / 34
Distance from x 0 to g (x) = 0
Alternative Solution:

We can also obtain the same result by solving the optimization:


1
x p = argmin kx − x 0 k2 subject to w T x + w0 = 0.
x 2

Let Lagrangian
1
L(x, λ) = kx − x 0 k2 − λ(w T x + w0 )
2
Stationarity condition implies

∇x L(x, λ) = (x − x 0 ) − λw = 0,
∇λ L(x, λ) = w T x + w0 = 0.
c Stanley Chan 2020. All Rights Reserved.
16 / 34
Distance from x 0 to g (x) = 0
Let us do some derivation:
∇x L(x, λ) = (x − x 0 ) − λw = 0,
∇λ L(x, λ) = w T x + w0 = 0.

This gives x = x 0 + λw
⇒ w T x+w0 = w T (x 0 + λw )+w0
⇒ 0 = w T x 0 + λkw k2 + w0
⇒ 0 = g (x 0 ) + λkw k2
⇒ λ = − gkw
(x 0 )
k2

⇒ x = x 0 + − gkw (x 0 )
k 2 w.
Therefore, we arrive at the same result:
g (x 0 ) w
xp = x0 − ·
kw k kw k
| {z 2} | {z 2}
distance normal vector
c Stanley Chan 2020. All Rights Reserved.
17 / 34
Outline

Goal: Understand the geometry of linear separability.

Notations
Input Space, Output Space, Hypothesis
Discriminant Function
Geometry of Discriminant Function
Separating Hyperplane
Normal Vector
Distance from Point to Plane
Linear Separability
Which set is linearly separable?
Separating Hyperplane Theorem
What if theorem fails?

c Stanley Chan 2020. All Rights Reserved.


18 / 34
Which one is Linearly Separable? Which one is Not?

c Stanley Chan 2020. All Rights Reserved.


19 / 34
Separating Hyperplane Theorem
Can we always find a separating hyperplane?
No.
Unless the classes are linearly separable.
If convex and not overlapping, then yes.

Theorem (Separating Hyperplane Theorem)


Let C1 and C2 be two closed convex sets such that C1 ∩ C2 = ∅. Then,
there exists a linear function

g (x) = w T x + w0 ,

such that g (x) > 0 for all x ∈ C1 and g (x) < 0 for all x ∈ C2 .

Remark: The theorem above provides sufficiency but not necessity for
linearly separability.
c Stanley Chan 2020. All Rights Reserved.
20 / 34
Separating Hyperplane Theorem
Pictorial “proof”:
Pick two points x ∗ and y ∗ s.t. the distance between the sets is
minimized.
Define the mid-point as x 0 = (x ∗ + y ∗ )/2.
Draw the separating hyperplane with normal w = x ∗ − y ∗
Convexity implies any inner product must be positive.

c Stanley Chan 2020. All Rights Reserved.


21 / 34
Separating Hyperplane Theorem
Pictorial “proof”:
Pick two points x ∗ and y ∗ s.t. the distance between the sets is
minimized.
Define the mid-point as x 0 = (x ∗ + y ∗ )/2.
Draw the separating hyperplane with normal w = x ∗ − y ∗
Convexity implies any inner product must be positive.

c Stanley Chan 2020. All Rights Reserved.


21 / 34
Linearly Separable?
I have data {x 1 , . . . , x N }.
Closed. Convex. Non-overlapping.
Separating hyperplane theorem: I can find a line.
Victory?
Not quite.

c Stanley Chan 2020. All Rights Reserved.


22 / 34
When Theory Fails

Theorem (Separating Hyperplane Theorem)


Let C1 and C2 be two closed convex sets such that C1 ∩ C2 = ∅. Then,
there exists a linear function

g (x) = w T x + w0 ,

such that g (x) > 0 for all x ∈ C1 and g (x) < 0 for all x ∈ C2 .

Finding a separating hyperplane for training set does not imply it


will work for the testing set.
Separating hyperplane theorem is more often used in theoretical
analysis by assuming properties of the testing set.
If a dataset is linearly separable, then you are guaranteed to find a
perfect classifier. Then you can say how good is the classifier you
designed compared to the perfect one.
c Stanley Chan 2020. All Rights Reserved.
23 / 34
Linear Classifiers Do Not Work

Example 1 Example 2

Intrinsic geometry of the two classes could be bad.


The training set could be lack of training samples.
Solution 1: Use non-linear classifiers, e.g.,
g (x) = x T W x + w T x + ω0 .
Solution 2: Kernel method, e.g., Radial basis function.
Solution 3: Extract features, e.g., g (x) = w T φ(x).
c Stanley Chan 2020. All Rights Reserved.
24 / 34
Reading List

Separating Hyperplane:
Duda, Hart and Stork’s Pattern Classification, Chapter 5.1 and 5.2.
Princeton ORFE-523, Lecture 5 on Separating hyperplane
http://www.princeton.edu/~amirali/Public/Teaching/
ORF523/S16/ORF523_S16_Lec5_gh.pdf
Cornell ORIE-6300, Lecture 6 on Separating hyperplane
https://people.orie.cornell.edu/dpw/orie6300/fall2008/
Lectures/lec06.pdf
Caltech, Lecture Note http://www.its.caltech.edu/~kcborder/
Notes/SeparatingHyperplane.pdf

c Stanley Chan 2020. All Rights Reserved.


25 / 34
Appendix

c Stanley Chan 2020. All Rights Reserved.


26 / 34
Proof of Separating Hyperplane Theorem

Conjecture: Let’s see if this is the correct hyperplane

g (x) = w T (x − x 0 )
x∗ + y∗
 
∗ ∗ T
= (x − y ) x−
2
kx k − ky ∗ k2
∗ 2
= (x ∗ − y ∗ )T x −
2
According to picture, we want g (x) > 0 for all x ∈ C1 .
Suppose not. Assume

kx ∗ k2 − ky ∗ k2
g (x) = (x ∗ − y ∗ )T x − < 0.
2
See if we can find a contradiction.
c Stanley Chan 2020. All Rights Reserved.
27 / 34
Proof of Separating Hyperplane Theorem
C1 is convex.
Pick x ∈ C1
Pick x ∗ ∈ C1
Let 0 ≤ λ ≤ 1
Construct a point

x λ = (1 − λ)x ∗ + λx.

Convex means

x λ ∈ C1

So we must have
kx λ − y ∗ k ≥ kx ∗ − y ∗ k
c Stanley Chan 2020. All Rights Reserved.
28 / 34
Proof of Separating Hyperplane Theorem

Pick an arbitrary point x ∈ C1 .


x ∗ is fixed already.
Pick x λ along the line connecting x and x ∗ .
Convexity implies x λ ∈ C1 .
So kx λ − y ∗ k ≥ kx ∗ − y ∗ k. If not, something is wrong.
Let us do some algebra:

kx λ − y ∗ k2 = k(1 − λ)x ∗ + λx − y ∗ k2
= kx ∗ − y ∗ + λ(x − x ∗ )k2
= kx ∗ − y ∗ k2 + 2λ(x ∗ − y ∗ )T (x − x ∗ ) + λ2 kx − x ∗ k2
= kx ∗ − y ∗ k2 + 2λw T (x − x ∗ ) + λ2 kx − x ∗ k2 .

Remember: w T (x − x 0 ) < 0.
c Stanley Chan 2020. All Rights Reserved.
29 / 34
Proof of Separating Hyperplane Theorem

kx λ − y ∗ k2 = kx ∗ − y ∗ k2 + 2λw T (x − x ∗ ) + λ2 kx − x ∗ k2
< kx ∗ − y ∗ k2 + 2λ(w T x 0 − w T x ∗ ) + λ2 kx − x ∗ k2
 ∗ 2
kx k − ky ∗ k2
 
∗ ∗ 2 T ∗
= kx − y k + 2λ −w x
2
+ λ2 kx − x ∗ k2
= kx ∗ − y ∗ k2 − λkx ∗ − y ∗ k2 + λ2 kx − x ∗ k2
| {z } | {z }
=A =B
∗ ∗ 2 2
= kx − y k − λA + λ B
= kx ∗ − y ∗ k2 − λ(A − λB).
Now, pick an x such that A − λB > 0. Then −λ(A − λB) < 0.
A kx ∗ − y ∗ k2
λ< = .
B kx − x ∗ k2 c Stanley Chan 2020. All Rights Reserved.
30 / 34
Proof of Separating Hyperplane Theorem
Therefore, if we choose λ such that A − λB > 0, i.e.,

A kx ∗ − y ∗ k2
λ< = ,
B kx − x ∗ k2

then −λ(A − λB) < 0, and so

kx λ − y ∗ k2 < kx ∗ − y ∗ k2 − λ(A − λB)


< kx ∗ − y ∗ k2

Contradiction, because kx ∗ − y ∗ k2 should be the smallest!

Conclusion:
If x ∈ C1 , then g (x) > 0.
By symmetry, if x ∈ C2 , then g (x) < 0.
And we have found the separating hyperplane (w , w0 ).
c Stanley Chan 2020. All Rights Reserved.
31 / 34
Q&A 1: What is a convex set?

A set C is convex if the following condition is met.


Pick x ∈ C and y ∈ C , and let 0 < λ < 1. If λx + (1 − λ)y is also in
C for any x, y and λ, then C is convex.
Basically, it says that you can pick two points and draw a line. If the
line is also in the set, then the set is convex.

c Stanley Chan 2020. All Rights Reserved.


32 / 34
Q&A 2: Is there a way to check whether two sets are
linearly separable?
No, at least I do not know.
The best you can do is to check whether a training set is linearly
separable.
To do so, solve the hard SVM. If you can solve it with zero training
error, then you have found one. If the hard SVM does not have a
solution, then the training set is not separable.
Checking the testing set is impossible unless you know the
distributions of the samples. But if you know the distributions, you
can derive formula to check linear separability.
For example, Gaussians are not linearly separable because no matter
how unlikely you can always find a sample that lives in the wrong
side. Uniform distributions are linearly separable.
Bottom line: Linear separability, in my opinion, is more of a
theoretical tool to describe the intrinsic property of the problem. It
is not for computational purposes. c Stanley Chan 2020. All Rights Reserved.
33 / 34
Q&A 3: If two sets are not convex, how do I know if it is
linearly separable?

You can look at the convex hull.


A convex hull is the smallest convex set that contains the original set.
If the convex hulls are not overlapping, then linearly separable.
For additional information about convex sets, convex hulls, you can
check Chapter 2 of
https://web.stanford.edu/class/ee364a/lectures.html

c Stanley Chan 2020. All Rights Reserved.


34 / 34

You might also like