0% found this document useful (0 votes)
16 views

lecture3_supervised_learning_I

Uploaded by

y.lin.39
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

lecture3_supervised_learning_I

Uploaded by

y.lin.39
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

Introduction to Machine Learning (for AI)

Supervised Learning I - Linear Models

Dr. Matias Valdenegro, Dr. Andreea Sburlea

November 18, 2024

1 / 78
Today’s Lecture

• Today and Thursday we are covering a bunch of ML


methods for classification and regression.
• Today will cover linear and logistic regression, Linear
Discriminant Analysis, and Support Vector Machines.
• Thursday, we will talk about Kernel SVMs,
Tree-based models, Random Forest, Ensembles,
K-Nearest Neighbors.

2 / 78
Outline

1 Linear Regression

2 Logistic Regression

3 Support Vector Machines (SVMs)

4 Linear Discriminant Analysis (LDA)

3 / 78
Introduction

• Linear regression is the simplest regression model.


• As reminder, regression means that the output
variable y is continuous.

4 / 78
Representing the dataset
• The dataset is composed of N pairs of observations
(input) and response (output).
• The response is a vector of N scalars
⊤
y = y1 y2 . . . yN

• We have N observations over d variables


• A generic observation x ∈ Rd can then be
represented as a vector of d features:

x = x1 x2 . . . xd

For convenience, we will be representing this as a


row vector.
5 / 78
Representing the dataset

We can group together the observations throughout the


dataset in a matrix X ∈ Rn×d , where each row is a data
point:
   
x1,1 x1,2 . . . x1,d x1
x2,1 x2,2 . . . x2,d  x2 
X =  .. ..  =  ..  (1)
   
.. ...
 . . .  .
xn,1 xn,2 . . . xn,d xn

6 / 78
Linear Regression Model

The basic linear regression (LR) model is:


d
X
f (x) = ŷ = xj wj + b = wx + b. (2)
j=1

Here the parameter/weight vector w has the same length


as the input features x, and the bias b is a scalar.
The total number of parameters/weights is P = d + 1,
where d = dim(x).

7 / 78
Linear Regression Model

From https://en.wikipedia.org/wiki/File:Linear˙regression.svg

8 / 78
Matrix Formulation
The previous equations can be re-written in a matrix
form. First we define the input features x̃ with an
additional 1 concatenated/appended at the end:
 
x̃ = x1 . . . xP 1 = x 1 (3)
And the weight vector w is now defined as:
 
w
w̃ = (4)
b
Once these small transformations are done, we can
re-write the linear regression model as:
 
 w
f (x) = x̃w̃ = x 1 =x·w+b (5)
b
9 / 78
Closed Form Solution

Linear regression models are trained using the mean


squared error (MSE) loss (with n data points):
n n
X (ŷi − yi )2 X
L(y , ŷ ) = =n −1
(xi w − yi )2 (6)
i=1
n i=1

Here ŷi = xi w: for simplicity we dropped the tildes from


the notation.

10 / 78
Closed Form Solution
By condensing X and y for the whole dataset (including
the column of 1’s), then we can rewrite the loss as:
Pn
(xi w − yi )2
L = i=1 (7)
n
= (y − Xw)⊤ (y − Xw) (8)

To find the value of W that minimizes this loss, we can


use the derivative of L and solve for this to be zero:
∂L ∂
= (y − Xw)⊤ (y − Xw) = 0 (9)
∂w ∂w

11 / 78
Closed Form Solution
After some algebraic manipulations, this has a closed
form solution:

w = n−1 (X⊤ X)−1 X⊤ y (10)

Here X is a n × (d + 1) matrix, and y is a n × 1 vector,


where n is the number of training samples, and d is their
dimensionality.

Then we can note that X⊤ X is a (d + 1) × (d + 1)


matrix, while X⊤ y is a (d + 1) × 1 vector, so the whole
operation (X⊤ X)−1 X⊤ y produces a (d + 1) × 1 result.

12 / 78
Closed Form Solution

This only works if the rows of X are linearly independent.


It only makes sense to do this if the matrix inverse
(X⊤ X)−1 is tractable, and if the matrix multiplication
X⊤ X is computationally sane.

Linearly independence can be broken easily, for example, if


two data samples are duplicated, or labels are
inconsistent, or the data is far from being represented as
a line/hyper-plane.

13 / 78
Modeling Assumptions
Residuals
The actual model considers errors or residuals ϵ:

f (xi ) = xi w + ϵi = yi (11)
This is because the yi ’s can have measurement or other kinds of
errors. These residuals are zero only if the model fits the data
perfectly (zero loss).

Data Must Not Have Dependencies


As mentioned, the analytical solution assumes that the rows of X are
linearly independent. This means that no variable xi in the training
set is linearly related to another variable xj , and in simple words,
variables in the training set are not correlated.

Now you see the importance of decorrelating your inputs?


14 / 78
Modeling Assumptions

Independence of Residuals
The residuals ϵi are assumed to be independent and not related to
each other or to the input variables. This means that the errors do
not depend on the input.

Linearity
In this model, the input features xi are treated as fixed values, and
the model is only linear with respect to its parameters. Input features
can be transformed to produce other features and the model stays
linear, since the training data is treated as a constant, and
optimization only happens in the parameters w.

15 / 78
Modeling Assumptions

Constant Variance
The variance of the residuals does not depend in the inputs or the
outputs of the model. This is called Homoscedasticity. The opposite
(Heteroscedasticity) is when the variance of the output or the errors
can vary with the input or output of the model, for example, if larger
outputs have larger variance then smaller outputs.

16 / 78
Modeling Assumptions

Figure: These are samples from Anscombe’s Quartet, indicating


training sets that have different data points, but the same linear
regression line, indicating that a linear regression model should be
used with care. Source from https://en.wikipedia.org/wiki/Anscombe’s˙quartet
17 / 78
Solution with Gradient Descent

In case that the analytic solution is intractable, gradient


descent can be used as substitute. For this we need to
compute the gradient of the loss with respect to
parameters:
∂L ∂
= n−1 (y − Xw)⊤ (y − Xw) (12)
∂x ∂x
This has a well known closed form:
∂L
= n−1 X⊤ (Xw − y) (13)
∂w

18 / 78
Solution with Gradient Descent
Which produces a vector (d + 1) × 1, and then
parameters can be updated using gradient descent:

wm+1 = wm − αn−1 X⊤ (Xw − y) (14)

Where w is initialized with a random vector (in a small


range), α is the learning rate that has to be tuned
manually, and m and m + 1 identify the iteration indices.

This is iterated until M iterations have happened, and the


loss value is monitored for convergence (loss is not
decreasing and approximately constant).

19 / 78
Solution with Gradient Descent
*

x4
x3
x2
x1

x0

From https://en.wikipedia.org/wiki/Gradient˙descent
20 / 78
Multi-variable Linear Regression
What if the labels y are not scalars, but vectors of
dimension m? We can still perform linear regression, but
now the model outputs a vector instead of a scalar. This
can be seen as performing K individual linear regression
problems:

f (x) = [xk · wk + bk ]K
k=1 = XW + b (15)
Where now b is a K × 1 vector instead of a scalar, and W
is a d × K matrix instead of a vector.

In general all previous equations hold, but now there is an


added dimension K , and in many cases multi-dimensional
matrix multiplications are required. We will see more
details later in multi-class logistic regression.
21 / 78
Interpretability of Weights
For linear models like LR, the weights/parameters can
sometimes be interpreted if:
• Input features are normalized/scaled to be in the
exact same range.
• For the case of multi-variable LR, then the output
features should also be normalized/scaled to be at
the same range.
The interpretation in this case is that each weight
indicate the (loose) feature importance associated to
each weight. The bias does not have a particular
interpretation (other than the y intercept).

For numerical features, an increase of that feature by one


unit increases the value of output y by a factor of that
feature’s weight. 22 / 78
Polynomial Regression
A Polynomial of degree p is given by:
p
X
f (x) = wi x i = w0 + w1 x + w2 x 2 + w3 w 3 + ... + wp x p
i=0
(16)
This is a linear model on the features [1, x, x , x , ..., x p ].
2 3

This is called polynomial regression, and it is a way to


make regression non-linear, by modifying the input
features. You choose a value of p (using cross validation),
transform each feature xi into p polynomial values, and
train a linear regression model on the new features.

This method increases the feature space dimensionality by


a factor of p.
23 / 78
Polynomial Regression
20
Truth
Estimate
Confidence Bands
10

0
Y

10

20

4 2 0 2 4
X

Figure: Example of cubic regression (p = 3). 24 / 78


Outliers in Linear Regression

25 / 78
Robust Linear Regression
Linear Regression is overall not robust to outliers. There
are many alternatives.
• There are many robust linear regression methods,
generally making assumptions about the output y ,
for example that it follows a student’s t-distribution
or other heavy tailed distribution.
• The simple LR algorithm assumes that the output is
Gaussian distributed, making it not robust to outliers.
• RanSaC (Random Sampling Consensus) can be used
to fit multiple LR models and detect which ones are
outliers.
• Exploratory data analysis can be used to identify and
remove outliers.
26 / 78
Outline

1 Linear Regression

2 Logistic Regression

3 Support Vector Machines (SVMs)

4 Linear Discriminant Analysis (LDA)

27 / 78
Classification Reminder

• Classification is when the output variable and labels


are discrete.
• In Classification, the model should separate or
segregate the data points, while in Regression the
model usually tightly fits the data points.
• Different losses are used for these tasks, but also
they need changes in the model equations, mostly
related on how a discrete output is drawn from
continuous outputs produced by a model.

28 / 78
Classification Reminder

Figure: Regression vs. classification

29 / 78
Probabilistic Classifiers
Most classifiers output a probability vector p of length C .
The class integer class index c can be recovered by:

c = arg max pj (17)


j∈{1,...,C }

Note that for C classes, their indices go from 0 to C − 1.

30 / 78
Probabilistic Classifiers
Most classifiers output a probability vector p of length C .
The class integer class index c can be recovered by:

c = arg max pj (17)


j∈{1,...,C }

Note that for C classes, their indices go from 0 to C − 1.


For binary classification, only a single probability is
required:

f (x) = P(y = 1) = 1 − P(y = 0) (18)


In this case, the classifier outputs the probability of class 1
(usually the positive class), while the probability for class 0
(the negative class) can be recovered by subtracting with
one.
30 / 78
Decision Boundary
In classification, we can identify
areas in the feature space where
the model outputs a speficic class.

The shape identifying the boundary


between these areas is called the
decision boundary.

If data are separable, there are


possibly infinite valid decision
boundaries.

Points lying closer to the decision


boundary are most difficult to
classify. 31 / 78
Logistic Regression Model
Logistic Regression is a classification model (this is not a
mistake, the name is misleading). The basic model is:
d
!
X
f (x) = σ wj xj + b = σ(x · w + b) (19)
j=1

Where the function σ(x) is called the logistic or sigmoid


function, given by:
1
σ(x) = (20)
(1 + e −x )

This is basically linear regression with the logistic function


applied to its output.
32 / 78
Probabilistic Interpretation
Logistic regression outputs a continuous value in the
range [0, 1], which is usually interpreted as:
P(y = 1|x) = f (x) = σ(x · w + b) (21)

This means, the output of logistic regression is the


probability that y = 1, meaning it is the probability of the
positive class, given the input x.

This is advantageous since now the model outputs a


probability that can represent uncertainty. This also
requires some changes for training. Note that for class 0:
P(y = 0|x) = 1 − P(y = 1|x) = 1 − σ(x · w + b) (22)
33 / 78
Training Logistic Regression

Logistic regression uses the binary cross-entropy loss,


which can be used to estimate the maximum likelihood
solution given the data.
Binary Cross-Entropy
Used for binary classification problems with labels yi ∈ {0, 1}
X
L(y, ŷ) = − yi log(ŷi ) + (1 − yi ) log(1 − ŷi )
i

Where now ŷ = σ(x · w + b).

34 / 78
Logistic Regression Concepts

Logit
Logits are the input to the logistic/sigmoid function:

l = x·w+b (23)
ŷ = σ(l) (24)

Here l is a logit. Logits range in the real numbers (R), while the
output of the logistic/sigmoid function is [0, 1]. The expanded range
is useful in some applications, for example if you want to do
regression of the logits.

35 / 78
Training with Gradient Descent
Using the matrix representation, the gradient of the
cross-entropy loss has a well known closed form:
∂L
= n−1 X⊤ (σ(Xw) − y) (25)
∂w
Which produces a vector (d + 1) × 1, and then
parameters can be updated using gradient descent:

wn+1 = wn − αn−1 X⊤ (σ(Xw) − y) (26)

Here w is initialized with a random vector (in a small


range), and α is the learning range that has to be tuned
manually. Note that the gradient is very similar to the
one in linear regression, except for the application of
logistic/sigmoid function σ(x).
36 / 78
Multi-class Logistic Regression
The current logistic regression formulation only works for
binary classification, and there are some strategies to
extend to a multi-class setting.
One vs One
For each pair of classes, train one classifier. To decide the output
class at inference, select the class with the most votes. This way
each classifier pair works as a vote for one class. Requires
0.5C (C − 1) classifiers.

One vs All
For each class, train a classifier for that class vs all other data points,
and at inference, make a prediction with all classifiers and select the
class with highest probability. Requires only C classifiers.

37 / 78
One vs One

40

20

−20

−40

−10.0 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 10.0

38 / 78
One vs Rest
50

40

30

20

10

−10

−10.0 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 10.0

39 / 78
Multinomial Logistic Regression

A simpler way to formulate a multi-class logistic


regression model is with:

f (x) = softmax(Wx + b) (27)

Where now W is a C × d matrix and b is a C × 1 vector,


where C is the number of classes. This model now
outputs a probability vector instead of a single probability.

This formulation is equivalent to a one-layer neural


network.

40 / 78
Softmax Function

The softmax is a function s : R n → [0, 1]n defined as:


C −1
e xc e x0 e x1 e xC −1
  
softmax(x) = P xj
= P xj , P xj , ..., P xj
j e c=0 j e j e j e
(28)
This function transforms a vector of logits into a discrete
probability distribution, where the elements of the output
vector sum to 1.

It is usually used to transform the output of a linear


classifier (producing logits) into probabilities.

41 / 78
Softmax Function (Example)
4-way classification problem.
The output of the model is the following:
 
1.25
 0.32 
 
 2.39 
−3.01
After the application of softmax:
 
0.2205
0.0870
 
0.6894
0.0031

42 / 78
Training Multinomial LR
This model is trained using the categorical cross-entropy
loss.
Categorical Cross-Entropy
For this loss, labels y c should be one-hot encoded. Used for
multi-class classification problems, where the model predictions are
yˆi c are class probabilities that sum to 1.
XX
L(y, ŷ) = − yic log(ŷic )
i c

The gradient has the same form as in binary LR:


∂L
= n−1 X⊤ (softmax(XW) − y) (29)
∂W

43 / 78
Multi-Label Logistic Classification
The multi-label setting is where you have multiple classes
but more than one class is possible at the same time.
This can also be modeled using logistic regression, where
the model is:
f (x) = σ(xW + b) (30)
The difference is the use of the logistic/sigmoid
activation for each class, as W is a d × c matrix and b is
a d × 1 vector.
This model is trained using the binary-cross entropy loss,
but applied for each class separately:
XX
L(y, ŷ) = − yic log(ŷic ) + (1 − yic ) log(1 − ŷic )
c i
This means the target vector is not one-hot encoded, but
set to 1 if the class is present, and to 0 otherwise.
44 / 78
Multi-Label Logistic Classification

Predictions can be made normally, but classes need to be


decided in a slightly different way:

ŷ = σ(xW + b) (31)
(
Class c is present if ŷ c ≥ T c
class(ŷ c ) = (32)
Class c is absent if ŷ c < T c

Where T c is a threshold that can be tuned for each class,


for example, by using an ROC curve. A standard value is
T c = 0.5.

45 / 78
Outline

1 Linear Regression

2 Logistic Regression

3 Support Vector Machines (SVMs)

4 Linear Discriminant Analysis (LDA)

46 / 78
Motivation

In addition to the stability of the decision boundary


observed with LDA, SVMs have the following advantages:

• SVMs are robust to outliers (they are usually not


affected by the presence of outliers in the training
set).
• They model directly the decision boundary.
• They are easily extendable to non-linear models using
the kernel trick.

47 / 78
Motivation - Hyperplanes Might not be Unique
X2 H1 H2 H3

X1

Figure from https://en.wikipedia.org/wiki/Support˙vector˙machine#/media/File:


Svm˙separating˙hyperplanes˙(SVG).svg
48 / 78
Maximum Margin Formulation
People came up with a simple idea to solve this issue.

What if instead of a hyper-plane separating the data, we


learn a separating hyper-plane including a margin parallel
to the hyper-plane, and then try to find the plane that
has the biggest separation between the two classes
(maximum margin).

49 / 78
Maximum Margin Formulation
People came up with a simple idea to solve this issue.

What if instead of a hyper-plane separating the data, we


learn a separating hyper-plane including a margin parallel
to the hyper-plane, and then try to find the plane that
has the biggest separation between the two classes
(maximum margin).

The limits of the margin would be given by the data


points, meaning that the hyper-plane not only has to
separate both classes, but also touch the data points
closest to the hyper-plane.

This hyper-plane then would be unique, which solves the


theoretical issue.
49 / 78
Maximum Margin Formulation

Figure from https://en.wikipedia.org/wiki/Support˙vector˙machine#/media/File:SVM˙margin.png


50 / 78
Hard Maximum Margin Formulation
The hard margin formulation applies when the data is
linearly separable, using two parallel hyperplanes, defined
by:

xw + b = 1 Positive Class (33)


xw + b = −1 Negative Class (34)

51 / 78
Hard Maximum Margin Formulation
The hard margin formulation applies when the data is
linearly separable, using two parallel hyperplanes, defined
by:

xw + b = 1 Positive Class (33)


xw + b = −1 Negative Class (34)

Anything above the hyperplane xw + b = 1 is classified as


the positive class, and anything below the hyperplane
xw + b = −1 is classified as the negative class.

51 / 78
Hard Maximum Margin Formulation
The hard margin formulation applies when the data is
linearly separable, using two parallel hyperplanes, defined
by:

xw + b = 1 Positive Class (33)


xw + b = −1 Negative Class (34)

Anything above the hyperplane xw + b = 1 is classified as


the positive class, and anything below the hyperplane
xw + b = −1 is classified as the negative class.
2
The distance between these hyper-planes is ||w|| , so in
order to maximize the margin, we would like to minimize
||w||.
51 / 78
Hard Maximum Margin Formulation
To consider the labels yi and constrain points to not fall
inside the margin, we can use the following constraints.

xi w + b ≥ 1 if yi = 1 (35)
xi w + b ≤ −1 if yi = −1 (36)

These constraints can be compacted into:

yi (xi w + b) ≥ 1 (37)

From where the following optimization problem can be


derived:

Minimize ||w|| subject to yi (xi w + b) ≥ 1 ∀i ∈ [1, n] (38)

52 / 78
Hard Maximum Margin Formulation

Once the SVM is learned, predictions can be made with:

f (x) = sign(xi w + b) (39)

Note that in this formulation, the labels are 1 for the


positive class, and -1 for the negative class.

53 / 78
Soft Maximum Margin Formulation
The issue with the hard margin is that we do not have
solutions if the data is not linearly separable.
The margin can be made to be soft, by relaxing the
constraints using positive slack variables ξi :
yi (xi w + b) ≥ 1 − ξi (40)
ξi ≥ 0 (41)
The idea is to minimize the total of these slack variables,
which leads to the following optimization problem.
X
minimize ||w|| + κ ξi (42)
i
subject to yi (wxi + b) ≥ 1 − ξi (43)
ξi ≥ 0 (44)
54 / 78
Soft Maximum Margin Formulation
The constraints can be integrated into a single loss
function:
X
L = ||w|| + κ max(0, 1 − yi (wxi + b)) (45)
i

The part max(0, 1 − yi (xi w + b)) is called the hinge loss,


and controls the constraints implicitly.

55 / 78
Soft Maximum Margin Formulation
The constraints can be integrated into a single loss
function:
X
L = ||w|| + κ max(0, 1 − yi (wxi + b)) (45)
i

The part max(0, 1 − yi (xi w + b)) is called the hinge loss,


and controls the constraints implicitly.

The coefficient κ works as a regularization coefficient,


where it controls the weight associated to the hinge loss,
and varying it controls the softness of the margin (How
many misclassifications are allowed).

55 / 78
Soft Maximum Margin Formulation

56 / 78
Soft Maximum Margin Formulation
Effect of Varying κ (C in the pictures)

C=1 C = 0.5 C = 0.1

C = 0.05 C = 0.01 C = 0.005

57 / 78
Soft Maximum Margin Formulation
Effect of Varying κ on Digits Dataset

58 / 78
SVM Concepts

Margin
It is the area between the two separating hyperplanes, and ideally it
should not contain any data points (hard margin). For a soft margin
it can contain data points, depending on the value of κ.

59 / 78
SVM Concepts

Margin
It is the area between the two separating hyperplanes, and ideally it
should not contain any data points (hard margin). For a soft margin
it can contain data points, depending on the value of κ.

Support Vector
The points that lie in the border of the margin are called support
vectors, since they are the ones that define the geometry of the
margin and the values of the weights w. Data points beyond the
margin do not really contribute to training.

59 / 78
Training SVMs
Training SVMs is a bit different than other algorithms
that we have covered.
Hard Margin
This is quadratic programming problem, and a quadratic solver
needs to be used. The loss is convex so there is always a unique
solution.

Soft Margin
The slack variable (ξ) formulation is also a quadratic problem,
trainable with a quadratic solver. The hinge loss formulation is
trainable using gradient descent.

60 / 78
Multi-Class SVMs
Current SVM formulations are only for binary
classification. They can be transformed into multi-class
classifiers by applying the two strategies we covered
before:

One vs One
Train 0.5C (C − 1) classifiers, one for each
pair of classes.
One vs All / Rest
Train C classifiers, one class versus the rest.
This is a good default option to use.

Unfortunately there are no other formulations to allow


multi-class classification.
61 / 78
Support Vector Regression
The idea of an SVM can also be extended for regression
problems, this is called SVR.

The formulation is to have a tube around the linear


regression line, where all points that are at distance ϵ
from the line receive no penalty (zero loss), and points
outside of this tube do receive a standard mean absolute
error loss.

The value of ϵ is a tunable hyper-parameter that trades


off acceptable errors. The formulation is:

Minimize ||w|| subject to |wxi + b − yi | ≤ ϵ ∀i ∈ [1, n]


(46)
62 / 78
Support Vector Regression

63 / 78
Outline

1 Linear Regression

2 Logistic Regression

3 Support Vector Machines (SVMs)

4 Linear Discriminant Analysis (LDA)

64 / 78
Motivation

Linear Discriminant Analysis (LDA) presents another


approach at classification using a Bayesian approach.

LDA present a different approach to classification which


may provide improvements over logistic regression.

The main advantage is represented by the stability of the


decision boundary, which is not the case for logistic
regression.

65 / 78
Logistic regression models directly the predictive
distribution P(Y = 1|x).

LDA instead starts by modeling separately the probability


distribution for each category:

P(x|Y = c), c ∈ {0, . . . , C − 1}

66 / 78
Modeling assumptions

Specifically, LDA assumes that this probability is a


Gaussian:
P(x|Y = c) = N(µc , Σc ) (47)
What does this mean?

The observations inside each class all distribute according


to a Gaussian. Each Gaussian has a class-specific mean
µc and a class-specific variance-covariance matrix Σc .

67 / 78
Modeling assumptions - picture

Figure: Data fromthe red


  category are generated from
4.0 2.5 0.0
P(x|Y = red) = N , .
2.5 0 1.5
Data from the blue 
categoryaregenerated  from
−0.9 1.2 0.0
P(x|Y = blue) = N , .
−0.2 0 0.25

68 / 78
Estimating the Gaussian Mean

The parameters µc and Σc can be estimated from the


data itself:
Pn
1[yi = c]xi
µ̂c = Pi=1
n (48)
i=1 1[yi = c]

1[yi = c] is the indicator function:


(
1 if yi = c
1[yi = c] = (49)
0 otherwise

69 / 78
Estimating the Gaussian Covariance

The estimation of the covariance matrices depends upon


specific assumptions:

The most common assumption is homoscedasticity: we


suppose that the variance across all categories is the
same:

XX
Σ̂ = (n − C )−1 1[yi = c](xi − µˆc )⊤ (xi − µˆc ) (50)
c i

70 / 78
LDA Concept

The concept used in LDA is a separating hyperplane w ,


same as previous classifiers we have covered, but it makes
a projection of the data into a one dimensional line, and
finding the line that provides the best separation.

This can be seen as a combination of dimensionality


reduction and classification.

Here we will cover Fisher’s criteria, which is a simple


version of LDA, but there are more versions.

71 / 78
LDA Class Separation

Classifier in terms of dimensionality reduction: Projection


along a line! 72 / 78
LDA Creates New Axis for Projection
The new axis is created according to two simultaneous
criteria:
Means Maximize the distance between the means of
the classes.
Variances Minimize the variation within each class
(which LDA calls scatter and is represented
by s 2 )
µ µ

73 / 78
Fisher’s Criteria or Discriminant
µ µ

2
σbetween (µ − µ)2
S= 2 = (51)
σwithin s2 + s2
S is the fisher discriminant, and it is a measure of how
discriminative the features/labels are, how well the classes
are separated.
74 / 78
LDA Model/Equations
Fisher’s Criteria can be used to derive the separating
hyperplane equation, with parameters w and b:

w T = C −1 (µ1 − µ2 ) (52)
b = −0.5(µ1 + µ2 )w T (53)
Where µ1 is the mean of the first class, and µ2 is the
second class mean, and C is the pooled covariance matrix
of both classes.

class 1 if w T x − b ≥ 0
f (x) = (54)
class 2 if w T x − b < 0

75 / 78
Questions to Think About
1. What is the basic concept underpinning SVMs?
2. How do ML Classification methods relate to Linear
Separability?
3. Explain the concept of the Kernel Trick and its
relationship with Kernel Functions.
4. How to transform a binary classifier into a multi-class
one?
5. What are the main assumptions behind LDA?
6. How is it possible to interpret the coefficients of
linear regression?
7. What is the difference in the equations of linear and
logistic regression and what are its effect on the
characteristics of the two models?
76 / 78
Take Home Messages
• We covered a lot of methods, there are important
underpinning concepts like linear models, linear
separability, kernel tricks, etc.
• Linear methods are easy to understand and
implement, and generally have good performance.
• Kernel methods allow to transform Linear methods
into non-linear ones, with simple tricks.
• Non-linear methods can have better performance but
they are difficult to understand and implement.
• In the end performance depends on the features and
if they are linearly separable (classification) or fit a
line (regression).
77 / 78
Questions?

78 / 78

You might also like