0% found this document useful (0 votes)

36 views84 pages

Lecture3 Supervised Learning I

Uploaded by

y.lin.39

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views84 pages

Lecture3 Supervised Learning I

Uploaded by

y.lin.39

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 84

Introduction to Machine Learning (for AI)

Supervised Learning I - Linear Models

Dr. Matias Valdenegro, Dr. Andreea Sburlea

November 18, 2024

1 / 78
Today’s Lecture

• Today and Thursday we are covering a bunch of ML

methods for classification and regression.
• Today will cover linear and logistic regression, Linear
Discriminant Analysis, and Support Vector Machines.
• Thursday, we will talk about Kernel SVMs,
Tree-based models, Random Forest, Ensembles,
K-Nearest Neighbors.

2 / 78
Outline

1 Linear Regression

2 Logistic Regression

3 Support Vector Machines (SVMs)

4 Linear Discriminant Analysis (LDA)

3 / 78
Introduction

• Linear regression is the simplest regression model.

• As reminder, regression means that the output
variable y is continuous.

4 / 78
Representing the dataset
• The dataset is composed of N pairs of observations
(input) and response (output).
• The response is a vector of N scalars
⊤
y = y1 y2 . . . yN

• We have N observations over d variables

• A generic observation x ∈ Rd can then be
represented as a vector of d features:

x = x1 x2 . . . xd

For convenience, we will be representing this as a

row vector.
5 / 78
Representing the dataset

We can group together the observations throughout the

dataset in a matrix X ∈ Rn×d , where each row is a data
point:
   
x1,1 x1,2 . . . x1,d x1
x2,1 x2,2 . . . x2,d  x2 
X =  .. ..  =  ..  (1)
   
.. ...
 . . .  .
xn,1 xn,2 . . . xn,d xn

6 / 78
Linear Regression Model

The basic linear regression (LR) model is:

d
X
f (x) = ŷ = xj wj + b = wx + b. (2)
j=1

Here the parameter/weight vector w has the same length

as the input features x, and the bias b is a scalar.
The total number of parameters/weights is P = d + 1,
where d = dim(x).

7 / 78
Linear Regression Model

From https://en.wikipedia.org/wiki/File:Linear˙regression.svg

8 / 78
Matrix Formulation
The previous equations can be re-written in a matrix
form. First we define the input features x̃ with an
additional 1 concatenated/appended at the end:

x̃ = x1 . . . xP 1 = x 1 (3)
And the weight vector w is now defined as:

w
w̃ = (4)
b
Once these small transformations are done, we can
re-write the linear regression model as:

w
f (x) = x̃w̃ = x 1 =x·w+b (5)
b
9 / 78
Closed Form Solution

Linear regression models are trained using the mean

squared error (MSE) loss (with n data points):
n n
X (ŷi − yi )2 X
L(y , ŷ ) = =n −1
(xi w − yi )2 (6)
i=1
n i=1

Here ŷi = xi w: for simplicity we dropped the tildes from

the notation.

10 / 78
Closed Form Solution
By condensing X and y for the whole dataset (including
the column of 1’s), then we can rewrite the loss as:
Pn
(xi w − yi )2
L = i=1 (7)
n
= (y − Xw)⊤ (y − Xw) (8)

To find the value of W that minimizes this loss, we can

use the derivative of L and solve for this to be zero:
∂L ∂
= (y − Xw)⊤ (y − Xw) = 0 (9)
∂w ∂w

11 / 78
Closed Form Solution
After some algebraic manipulations, this has a closed
form solution:

w = n−1 (X⊤ X)−1 X⊤ y (10)

Here X is a n × (d + 1) matrix, and y is a n × 1 vector,

where n is the number of training samples, and d is their
dimensionality.

Then we can note that X⊤ X is a (d + 1) × (d + 1)

matrix, while X⊤ y is a (d + 1) × 1 vector, so the whole
operation (X⊤ X)−1 X⊤ y produces a (d + 1) × 1 result.

12 / 78
Closed Form Solution

This only works if the rows of X are linearly independent.

It only makes sense to do this if the matrix inverse
(X⊤ X)−1 is tractable, and if the matrix multiplication
X⊤ X is computationally sane.

Linearly independence can be broken easily, for example, if

two data samples are duplicated, or labels are
inconsistent, or the data is far from being represented as
a line/hyper-plane.

13 / 78
Modeling Assumptions
Residuals
The actual model considers errors or residuals ϵ:

f (xi ) = xi w + ϵi = yi (11)
This is because the yi ’s can have measurement or other kinds of
errors. These residuals are zero only if the model fits the data
perfectly (zero loss).

Data Must Not Have Dependencies

As mentioned, the analytical solution assumes that the rows of X are
linearly independent. This means that no variable xi in the training
set is linearly related to another variable xj , and in simple words,
variables in the training set are not correlated.

Now you see the importance of decorrelating your inputs?

14 / 78
Modeling Assumptions

Independence of Residuals
The residuals ϵi are assumed to be independent and not related to
each other or to the input variables. This means that the errors do
not depend on the input.

Linearity
In this model, the input features xi are treated as fixed values, and
the model is only linear with respect to its parameters. Input features
can be transformed to produce other features and the model stays
linear, since the training data is treated as a constant, and
optimization only happens in the parameters w.

15 / 78
Modeling Assumptions

Constant Variance
The variance of the residuals does not depend in the inputs or the
outputs of the model. This is called Homoscedasticity. The opposite
(Heteroscedasticity) is when the variance of the output or the errors
can vary with the input or output of the model, for example, if larger
outputs have larger variance then smaller outputs.

16 / 78
Modeling Assumptions

Figure: These are samples from Anscombe’s Quartet, indicating

training sets that have different data points, but the same linear
regression line, indicating that a linear regression model should be
used with care. Source from https://en.wikipedia.org/wiki/Anscombe’s˙quartet
17 / 78
Solution with Gradient Descent

In case that the analytic solution is intractable, gradient

descent can be used as substitute. For this we need to
compute the gradient of the loss with respect to
parameters:
∂L ∂
= n−1 (y − Xw)⊤ (y − Xw) (12)
∂x ∂x
This has a well known closed form:
∂L
= n−1 X⊤ (Xw − y) (13)
∂w

18 / 78
Solution with Gradient Descent
Which produces a vector (d + 1) × 1, and then
parameters can be updated using gradient descent:

wm+1 = wm − αn−1 X⊤ (Xw − y) (14)

Where w is initialized with a random vector (in a small

range), α is the learning rate that has to be tuned
manually, and m and m + 1 identify the iteration indices.

This is iterated until M iterations have happened, and the

loss value is monitored for convergence (loss is not
decreasing and approximately constant).

19 / 78
Solution with Gradient Descent
*

x4
x3
x2
x1

From https://en.wikipedia.org/wiki/Gradient˙descent
20 / 78
Multi-variable Linear Regression
What if the labels y are not scalars, but vectors of
dimension m? We can still perform linear regression, but
now the model outputs a vector instead of a scalar. This
can be seen as performing K individual linear regression
problems:

f (x) = [xk · wk + bk ]K
k=1 = XW + b (15)
Where now b is a K × 1 vector instead of a scalar, and W
is a d × K matrix instead of a vector.

In general all previous equations hold, but now there is an

added dimension K , and in many cases multi-dimensional
matrix multiplications are required. We will see more
details later in multi-class logistic regression.
21 / 78
Interpretability of Weights
For linear models like LR, the weights/parameters can
sometimes be interpreted if:
• Input features are normalized/scaled to be in the
exact same range.
• For the case of multi-variable LR, then the output
features should also be normalized/scaled to be at
the same range.
The interpretation in this case is that each weight
indicate the (loose) feature importance associated to
each weight. The bias does not have a particular
interpretation (other than the y intercept).

For numerical features, an increase of that feature by one

unit increases the value of output y by a factor of that
feature’s weight. 22 / 78
Polynomial Regression
A Polynomial of degree p is given by:
p
X
f (x) = wi x i = w0 + w1 x + w2 x 2 + w3 w 3 + ... + wp x p
i=0
(16)
This is a linear model on the features [1, x, x , x , ..., x p ].
2 3

This is called polynomial regression, and it is a way to

make regression non-linear, by modifying the input
features. You choose a value of p (using cross validation),
transform each feature xi into p polynomial values, and
train a linear regression model on the new features.

This method increases the feature space dimensionality by

a factor of p.
23 / 78
Polynomial Regression
20
Truth
Estimate
Confidence Bands
10

0
Y

4 2 0 2 4
X

Figure: Example of cubic regression (p = 3). 24 / 78

Outliers in Linear Regression

25 / 78
Robust Linear Regression
Linear Regression is overall not robust to outliers. There
are many alternatives.
• There are many robust linear regression methods,
generally making assumptions about the output y ,
for example that it follows a student’s t-distribution
or other heavy tailed distribution.
• The simple LR algorithm assumes that the output is
Gaussian distributed, making it not robust to outliers.
• RanSaC (Random Sampling Consensus) can be used
to fit multiple LR models and detect which ones are
outliers.
• Exploratory data analysis can be used to identify and
remove outliers.
26 / 78
Outline

1 Linear Regression

2 Logistic Regression

3 Support Vector Machines (SVMs)

4 Linear Discriminant Analysis (LDA)

27 / 78
Classification Reminder

• Classification is when the output variable and labels

are discrete.
• In Classification, the model should separate or
segregate the data points, while in Regression the
model usually tightly fits the data points.
• Different losses are used for these tasks, but also
they need changes in the model equations, mostly
related on how a discrete output is drawn from
continuous outputs produced by a model.

28 / 78
Classification Reminder

Figure: Regression vs. classification

29 / 78
Probabilistic Classifiers
Most classifiers output a probability vector p of length C .
The class integer class index c can be recovered by:

c = arg max pj (17)

j∈{1,...,C }

Note that for C classes, their indices go from 0 to C − 1.

30 / 78
Probabilistic Classifiers
Most classifiers output a probability vector p of length C .
The class integer class index c can be recovered by:

c = arg max pj (17)

j∈{1,...,C }

Note that for C classes, their indices go from 0 to C − 1.

For binary classification, only a single probability is
required:

f (x) = P(y = 1) = 1 − P(y = 0) (18)

In this case, the classifier outputs the probability of class 1
(usually the positive class), while the probability for class 0
(the negative class) can be recovered by subtracting with
one.
30 / 78
Decision Boundary
In classification, we can identify
areas in the feature space where
the model outputs a speficic class.

The shape identifying the boundary

between these areas is called the
decision boundary.

If data are separable, there are

possibly infinite valid decision
boundaries.

Points lying closer to the decision

boundary are most difficult to
classify. 31 / 78
Logistic Regression Model
Logistic Regression is a classification model (this is not a
mistake, the name is misleading). The basic model is:
d
!
X
f (x) = σ wj xj + b = σ(x · w + b) (19)
j=1

Where the function σ(x) is called the logistic or sigmoid

function, given by:
1
σ(x) = (20)
(1 + e −x )

This is basically linear regression with the logistic function

applied to its output.
32 / 78
Probabilistic Interpretation
Logistic regression outputs a continuous value in the
range [0, 1], which is usually interpreted as:
P(y = 1|x) = f (x) = σ(x · w + b) (21)

This means, the output of logistic regression is the

probability that y = 1, meaning it is the probability of the
positive class, given the input x.

This is advantageous since now the model outputs a

probability that can represent uncertainty. This also
requires some changes for training. Note that for class 0:
P(y = 0|x) = 1 − P(y = 1|x) = 1 − σ(x · w + b) (22)
33 / 78
Training Logistic Regression

Logistic regression uses the binary cross-entropy loss,

which can be used to estimate the maximum likelihood
solution given the data.
Binary Cross-Entropy
Used for binary classification problems with labels yi ∈ {0, 1}
X
L(y, ŷ) = − yi log(ŷi ) + (1 − yi ) log(1 − ŷi )
i

Where now ŷ = σ(x · w + b).

34 / 78
Logistic Regression Concepts

Logit
Logits are the input to the logistic/sigmoid function:

l = x·w+b (23)
ŷ = σ(l) (24)

Here l is a logit. Logits range in the real numbers (R), while the
output of the logistic/sigmoid function is [0, 1]. The expanded range
is useful in some applications, for example if you want to do
regression of the logits.

35 / 78
Training with Gradient Descent
Using the matrix representation, the gradient of the
cross-entropy loss has a well known closed form:
∂L
= n−1 X⊤ (σ(Xw) − y) (25)
∂w
Which produces a vector (d + 1) × 1, and then
parameters can be updated using gradient descent:

wn+1 = wn − αn−1 X⊤ (σ(Xw) − y) (26)

Here w is initialized with a random vector (in a small

range), and α is the learning range that has to be tuned
manually. Note that the gradient is very similar to the
one in linear regression, except for the application of
logistic/sigmoid function σ(x).
36 / 78
Multi-class Logistic Regression
The current logistic regression formulation only works for
binary classification, and there are some strategies to
extend to a multi-class setting.
One vs One
For each pair of classes, train one classifier. To decide the output
class at inference, select the class with the most votes. This way
each classifier pair works as a vote for one class. Requires
0.5C (C − 1) classifiers.

One vs All
For each class, train a classifier for that class vs all other data points,
and at inference, make a prediction with all classifiers and select the
class with highest probability. Requires only C classifiers.

37 / 78
One vs One

−20

−40

−10.0 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 10.0

38 / 78
One vs Rest
50

−10

−10.0 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 10.0

39 / 78
Multinomial Logistic Regression

A simpler way to formulate a multi-class logistic

regression model is with:

f (x) = softmax(Wx + b) (27)

Where now W is a C × d matrix and b is a C × 1 vector,

where C is the number of classes. This model now
outputs a probability vector instead of a single probability.

This formulation is equivalent to a one-layer neural

network.

40 / 78
Softmax Function

The softmax is a function s : R n → [0, 1]n defined as:

C −1
e xc e x0 e x1 e xC −1

softmax(x) = P xj
= P xj , P xj , ..., P xj
j e c=0 j e j e j e
(28)
This function transforms a vector of logits into a discrete
probability distribution, where the elements of the output
vector sum to 1.

It is usually used to transform the output of a linear

classifier (producing logits) into probabilities.

41 / 78
Softmax Function (Example)
4-way classification problem.
The output of the model is the following:
 
1.25
 0.32 
 
 2.39 
−3.01
After the application of softmax:
 
0.2205
0.0870
 
0.6894
0.0031

42 / 78
Training Multinomial LR
This model is trained using the categorical cross-entropy
loss.
Categorical Cross-Entropy
For this loss, labels y c should be one-hot encoded. Used for
multi-class classification problems, where the model predictions are
yˆi c are class probabilities that sum to 1.
XX
L(y, ŷ) = − yic log(ŷic )
i c

The gradient has the same form as in binary LR:

∂L
= n−1 X⊤ (softmax(XW) − y) (29)
∂W

43 / 78
Multi-Label Logistic Classification
The multi-label setting is where you have multiple classes
but more than one class is possible at the same time.
This can also be modeled using logistic regression, where
the model is:
f (x) = σ(xW + b) (30)
The difference is the use of the logistic/sigmoid
activation for each class, as W is a d × c matrix and b is
a d × 1 vector.
This model is trained using the binary-cross entropy loss,
but applied for each class separately:
XX
L(y, ŷ) = − yic log(ŷic ) + (1 − yic ) log(1 − ŷic )
c i
This means the target vector is not one-hot encoded, but
set to 1 if the class is present, and to 0 otherwise.
44 / 78
Multi-Label Logistic Classification

Predictions can be made normally, but classes need to be

decided in a slightly different way:

ŷ = σ(xW + b) (31)
(
Class c is present if ŷ c ≥ T c
class(ŷ c ) = (32)
Class c is absent if ŷ c < T c

Where T c is a threshold that can be tuned for each class,

for example, by using an ROC curve. A standard value is
T c = 0.5.

45 / 78
Outline

1 Linear Regression

2 Logistic Regression

3 Support Vector Machines (SVMs)

4 Linear Discriminant Analysis (LDA)

46 / 78
Motivation

In addition to the stability of the decision boundary

observed with LDA, SVMs have the following advantages:

• SVMs are robust to outliers (they are usually not

affected by the presence of outliers in the training
set).
• They model directly the decision boundary.
• They are easily extendable to non-linear models using
the kernel trick.

47 / 78
Motivation - Hyperplanes Might not be Unique
X2 H1 H2 H3

Figure from https://en.wikipedia.org/wiki/Support˙vector˙machine#/media/File:

Svm˙separating˙hyperplanes˙(SVG).svg
48 / 78
Maximum Margin Formulation
People came up with a simple idea to solve this issue.

What if instead of a hyper-plane separating the data, we

learn a separating hyper-plane including a margin parallel
to the hyper-plane, and then try to find the plane that
has the biggest separation between the two classes
(maximum margin).

49 / 78
Maximum Margin Formulation
People came up with a simple idea to solve this issue.

What if instead of a hyper-plane separating the data, we

learn a separating hyper-plane including a margin parallel
to the hyper-plane, and then try to find the plane that
has the biggest separation between the two classes
(maximum margin).

The limits of the margin would be given by the data

points, meaning that the hyper-plane not only has to
separate both classes, but also touch the data points
closest to the hyper-plane.

This hyper-plane then would be unique, which solves the

theoretical issue.
49 / 78
Maximum Margin Formulation

Figure from https://en.wikipedia.org/wiki/Support˙vector˙machine#/media/File:SVM˙margin.png

50 / 78
Hard Maximum Margin Formulation
The hard margin formulation applies when the data is
linearly separable, using two parallel hyperplanes, defined
by:

xw + b = 1 Positive Class (33)

xw + b = −1 Negative Class (34)

51 / 78
Hard Maximum Margin Formulation
The hard margin formulation applies when the data is
linearly separable, using two parallel hyperplanes, defined
by:

xw + b = 1 Positive Class (33)

xw + b = −1 Negative Class (34)

Anything above the hyperplane xw + b = 1 is classified as

the positive class, and anything below the hyperplane
xw + b = −1 is classified as the negative class.

51 / 78
Hard Maximum Margin Formulation
The hard margin formulation applies when the data is
linearly separable, using two parallel hyperplanes, defined
by:

xw + b = 1 Positive Class (33)

xw + b = −1 Negative Class (34)

Anything above the hyperplane xw + b = 1 is classified as

the positive class, and anything below the hyperplane
xw + b = −1 is classified as the negative class.
2
The distance between these hyper-planes is ||w|| , so in
order to maximize the margin, we would like to minimize
||w||.
51 / 78
Hard Maximum Margin Formulation
To consider the labels yi and constrain points to not fall
inside the margin, we can use the following constraints.

xi w + b ≥ 1 if yi = 1 (35)
xi w + b ≤ −1 if yi = −1 (36)

These constraints can be compacted into:

yi (xi w + b) ≥ 1 (37)

From where the following optimization problem can be

derived:

Minimize ||w|| subject to yi (xi w + b) ≥ 1 ∀i ∈ [1, n] (38)

52 / 78
Hard Maximum Margin Formulation

Once the SVM is learned, predictions can be made with:

f (x) = sign(xi w + b) (39)

Note that in this formulation, the labels are 1 for the

positive class, and -1 for the negative class.

53 / 78
Soft Maximum Margin Formulation
The issue with the hard margin is that we do not have
solutions if the data is not linearly separable.
The margin can be made to be soft, by relaxing the
constraints using positive slack variables ξi :
yi (xi w + b) ≥ 1 − ξi (40)
ξi ≥ 0 (41)
The idea is to minimize the total of these slack variables,
which leads to the following optimization problem.
X
minimize ||w|| + κ ξi (42)
i
subject to yi (wxi + b) ≥ 1 − ξi (43)
ξi ≥ 0 (44)
54 / 78
Soft Maximum Margin Formulation
The constraints can be integrated into a single loss
function:
X
L = ||w|| + κ max(0, 1 − yi (wxi + b)) (45)
i

The part max(0, 1 − yi (xi w + b)) is called the hinge loss,

and controls the constraints implicitly.

55 / 78
Soft Maximum Margin Formulation
The constraints can be integrated into a single loss
function:
X
L = ||w|| + κ max(0, 1 − yi (wxi + b)) (45)
i

The part max(0, 1 − yi (xi w + b)) is called the hinge loss,

and controls the constraints implicitly.

The coefficient κ works as a regularization coefficient,

where it controls the weight associated to the hinge loss,
and varying it controls the softness of the margin (How
many misclassifications are allowed).

55 / 78
Soft Maximum Margin Formulation

56 / 78
Soft Maximum Margin Formulation
Effect of Varying κ (C in the pictures)

C=1 C = 0.5 C = 0.1

C = 0.05 C = 0.01 C = 0.005

57 / 78
Soft Maximum Margin Formulation
Effect of Varying κ on Digits Dataset

58 / 78
SVM Concepts

Margin
It is the area between the two separating hyperplanes, and ideally it
should not contain any data points (hard margin). For a soft margin
it can contain data points, depending on the value of κ.

59 / 78
SVM Concepts

Support Vector
The points that lie in the border of the margin are called support
vectors, since they are the ones that define the geometry of the
margin and the values of the weights w. Data points beyond the
margin do not really contribute to training.

59 / 78
Training SVMs
Training SVMs is a bit different than other algorithms
that we have covered.
Hard Margin
This is quadratic programming problem, and a quadratic solver
needs to be used. The loss is convex so there is always a unique
solution.

Soft Margin
The slack variable (ξ) formulation is also a quadratic problem,
trainable with a quadratic solver. The hinge loss formulation is
trainable using gradient descent.

60 / 78
Multi-Class SVMs
Current SVM formulations are only for binary
classification. They can be transformed into multi-class
classifiers by applying the two strategies we covered
before:

One vs One
Train 0.5C (C − 1) classifiers, one for each
pair of classes.
One vs All / Rest
Train C classifiers, one class versus the rest.
This is a good default option to use.

Unfortunately there are no other formulations to allow

multi-class classification.
61 / 78
Support Vector Regression
The idea of an SVM can also be extended for regression
problems, this is called SVR.

The formulation is to have a tube around the linear

regression line, where all points that are at distance ϵ
from the line receive no penalty (zero loss), and points
outside of this tube do receive a standard mean absolute
error loss.

The value of ϵ is a tunable hyper-parameter that trades

off acceptable errors. The formulation is:

Minimize ||w|| subject to |wxi + b − yi | ≤ ϵ ∀i ∈ [1, n]

(46)
62 / 78
Support Vector Regression

63 / 78
Outline

1 Linear Regression

2 Logistic Regression

3 Support Vector Machines (SVMs)

4 Linear Discriminant Analysis (LDA)

64 / 78
Motivation

Linear Discriminant Analysis (LDA) presents another

approach at classification using a Bayesian approach.

LDA present a different approach to classification which

may provide improvements over logistic regression.

The main advantage is represented by the stability of the

decision boundary, which is not the case for logistic
regression.

65 / 78
Logistic regression models directly the predictive
distribution P(Y = 1|x).

LDA instead starts by modeling separately the probability

distribution for each category:

P(x|Y = c), c ∈ {0, . . . , C − 1}

66 / 78
Modeling assumptions

Specifically, LDA assumes that this probability is a

Gaussian:
P(x|Y = c) = N(µc , Σc ) (47)
What does this mean?

The observations inside each class all distribute according

to a Gaussian. Each Gaussian has a class-specific mean
µc and a class-specific variance-covariance matrix Σc .

67 / 78
Modeling assumptions - picture

Figure: Data fromthe red

category are generated from
4.0 2.5 0.0
P(x|Y = red) = N , .
2.5 0 1.5
Data from the blue
categoryaregenerated from
−0.9 1.2 0.0
P(x|Y = blue) = N , .
−0.2 0 0.25

68 / 78
Estimating the Gaussian Mean

The parameters µc and Σc can be estimated from the

data itself:
Pn
1[yi = c]xi
µ̂c = Pi=1
n (48)
i=1 1[yi = c]

1[yi = c] is the indicator function:

(
1 if yi = c
1[yi = c] = (49)
0 otherwise

69 / 78
Estimating the Gaussian Covariance

The estimation of the covariance matrices depends upon

specific assumptions:

The most common assumption is homoscedasticity: we

suppose that the variance across all categories is the
same:

XX
Σ̂ = (n − C )−1 1[yi = c](xi − µˆc )⊤ (xi − µˆc ) (50)
c i

70 / 78
LDA Concept

The concept used in LDA is a separating hyperplane w ,

same as previous classifiers we have covered, but it makes
a projection of the data into a one dimensional line, and
finding the line that provides the best separation.

This can be seen as a combination of dimensionality

reduction and classification.

Here we will cover Fisher’s criteria, which is a simple

version of LDA, but there are more versions.

71 / 78
LDA Class Separation

Classifier in terms of dimensionality reduction: Projection

along a line! 72 / 78
LDA Creates New Axis for Projection
The new axis is created according to two simultaneous
criteria:
Means Maximize the distance between the means of
the classes.
Variances Minimize the variation within each class
(which LDA calls scatter and is represented
by s 2 )
µ µ

73 / 78
Fisher’s Criteria or Discriminant
µ µ

2
σbetween (µ − µ)2
S= 2 = (51)
σwithin s2 + s2
S is the fisher discriminant, and it is a measure of how
discriminative the features/labels are, how well the classes
are separated.
74 / 78
LDA Model/Equations
Fisher’s Criteria can be used to derive the separating
hyperplane equation, with parameters w and b:

w T = C −1 (µ1 − µ2 ) (52)
b = −0.5(µ1 + µ2 )w T (53)
Where µ1 is the mean of the first class, and µ2 is the
second class mean, and C is the pooled covariance matrix
of both classes.

class 1 if w T x − b ≥ 0
f (x) = (54)
class 2 if w T x − b < 0

75 / 78
Questions to Think About
1. What is the basic concept underpinning SVMs?
2. How do ML Classification methods relate to Linear
Separability?
3. Explain the concept of the Kernel Trick and its
relationship with Kernel Functions.
4. How to transform a binary classifier into a multi-class
one?
5. What are the main assumptions behind LDA?
6. How is it possible to interpret the coefficients of
linear regression?
7. What is the difference in the equations of linear and
logistic regression and what are its effect on the
characteristics of the two models?
76 / 78
Take Home Messages
• We covered a lot of methods, there are important
underpinning concepts like linear models, linear
separability, kernel tricks, etc.
• Linear methods are easy to understand and
implement, and generally have good performance.
• Kernel methods allow to transform Linear methods
into non-linear ones, with simple tricks.
• Non-linear methods can have better performance but
they are difficult to understand and implement.
• In the end performance depends on the features and
if they are linearly separable (classification) or fit a
line (regression).
77 / 78
Questions?

78 / 78

Programming Assignments of Deep Learning Specialization 5 Courses 1
No ratings yet
Programming Assignments of Deep Learning Specialization 5 Courses 1
304 pages
Ch06 Deep Feedforward Networks
100% (1)
Ch06 Deep Feedforward Networks
90 pages
Chapter 6 - Feedforward Deep Networks
No ratings yet
Chapter 6 - Feedforward Deep Networks
27 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
LP III Lab Manual
100% (1)
LP III Lab Manual
8 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest
100% (1)
Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest
72 pages
The Deep Learning Revolution: Introductory Overview Lecture
No ratings yet
The Deep Learning Revolution: Introductory Overview Lecture
35 pages
Activation Functions
No ratings yet
Activation Functions
2 pages
Progression Linaire
No ratings yet
Progression Linaire
187 pages
MLA TAB Lecture3
No ratings yet
MLA TAB Lecture3
70 pages
What Is Perceptron - Simplilearn
No ratings yet
What Is Perceptron - Simplilearn
46 pages
A Practical Approach To Linear Regression in Machine Learning - by Ashwin Raj - Towards Data Science
No ratings yet
A Practical Approach To Linear Regression in Machine Learning - by Ashwin Raj - Towards Data Science
20 pages
Machine Learning Basics Lecture 7: Multiclass Classification
No ratings yet
Machine Learning Basics Lecture 7: Multiclass Classification
28 pages
LinearRegression1 210720 171800
No ratings yet
LinearRegression1 210720 171800
41 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
Introduction To Machine Learning Algorithms: Linear Regression
No ratings yet
Introduction To Machine Learning Algorithms: Linear Regression
1 page
LinearRegression PDF
No ratings yet
LinearRegression PDF
4 pages
A Sensitivity Analysis of Convolutional Neural Networks For Sentence Classification
No ratings yet
A Sensitivity Analysis of Convolutional Neural Networks For Sentence Classification
18 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
No ratings yet
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
61 pages
Lecture 3 - Linear Regression
No ratings yet
Lecture 3 - Linear Regression
31 pages
Linear Regression
No ratings yet
Linear Regression
104 pages
W2 Ecs7020p
No ratings yet
W2 Ecs7020p
54 pages
CH 1
No ratings yet
CH 1
24 pages
CS550 Lec2
No ratings yet
CS550 Lec2
24 pages
Face Recognition Using Deep Learning
No ratings yet
Face Recognition Using Deep Learning
7 pages
Linear Regression
No ratings yet
Linear Regression
31 pages
Transformers and Pretrained Language Models
No ratings yet
Transformers and Pretrained Language Models
18 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
DS303: Introduction To Machine Learning: Manjesh K. Hanawal
No ratings yet
DS303: Introduction To Machine Learning: Manjesh K. Hanawal
17 pages
Linear-Regression 231212 072619
No ratings yet
Linear-Regression 231212 072619
13 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
Linear Regression
No ratings yet
Linear Regression
61 pages
Intro To ML RevisionNotes
No ratings yet
Intro To ML RevisionNotes
24 pages
ML Lecture#4
No ratings yet
ML Lecture#4
109 pages
Activation Functions: Sigmoid, Tanh, Relu, Leaky Relu, Prelu, Elu, Threshold Relu and Softmax Basics For Neural Networks and Deep Learning
No ratings yet
Activation Functions: Sigmoid, Tanh, Relu, Leaky Relu, Prelu, Elu, Threshold Relu and Softmax Basics For Neural Networks and Deep Learning
15 pages
C2 W2 SoftMax
No ratings yet
C2 W2 SoftMax
7 pages
Improving Graph Neural Networks With Simple Architecture Design
No ratings yet
Improving Graph Neural Networks With Simple Architecture Design
10 pages
Research Article Application of Face Recognition Technology in Intelligent Education Management in Colleges and Universities
No ratings yet
Research Article Application of Face Recognition Technology in Intelligent Education Management in Colleges and Universities
10 pages
Unit-Vi 2
No ratings yet
Unit-Vi 2
31 pages
Linear Models and Learning Via Optimization: Piyush Rai Introduction To Machine Learning (CS771A)
No ratings yet
Linear Models and Learning Via Optimization: Piyush Rai Introduction To Machine Learning (CS771A)
26 pages
Week 4 Linear Regression
No ratings yet
Week 4 Linear Regression
38 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Linear Regression - Everything You Need To Know About Linear Regression
No ratings yet
Linear Regression - Everything You Need To Know About Linear Regression
17 pages
Linear Regression
No ratings yet
Linear Regression
7 pages
Transformers As Support Vector Machines: Davoud Ataee Tarzanagh Yingcong Li Christos Thrampoulidis Samet Oymak
No ratings yet
Transformers As Support Vector Machines: Davoud Ataee Tarzanagh Yingcong Li Christos Thrampoulidis Samet Oymak
58 pages
ML 2
No ratings yet
ML 2
155 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
Lec 3-5 (Function Approximation)
No ratings yet
Lec 3-5 (Function Approximation)
34 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
MECH4403 LR Week04
No ratings yet
MECH4403 LR Week04
25 pages
Linear-Regression ML
No ratings yet
Linear-Regression ML
36 pages
ML Unit
No ratings yet
ML Unit
23 pages
Forecasting and Learning Theory
No ratings yet
Forecasting and Learning Theory
46 pages
(Slide) Non Linear Regression
No ratings yet
(Slide) Non Linear Regression
39 pages
Wk05 Machine Learning
No ratings yet
Wk05 Machine Learning
6 pages
Ai Powered Waste Management: 23ec301 Project Phase - I Report
No ratings yet
Ai Powered Waste Management: 23ec301 Project Phase - I Report
39 pages
An Open Introduction To Linguistics 2022
No ratings yet
An Open Introduction To Linguistics 2022
279 pages
Linear Regression
No ratings yet
Linear Regression
5 pages
ML 1
No ratings yet
ML 1
24 pages
Lecture 3 - Regression
No ratings yet
Lecture 3 - Regression
47 pages
Unit2 ML Notes
No ratings yet
Unit2 ML Notes
19 pages
Training Models
No ratings yet
Training Models
13 pages
GradientDescent-Regression Slides
No ratings yet
GradientDescent-Regression Slides
26 pages
Unit - 3
No ratings yet
Unit - 3
55 pages
2 - (9-3) Regression Classifiers
No ratings yet
2 - (9-3) Regression Classifiers
35 pages
Linear Regression
No ratings yet
Linear Regression
62 pages
Weight Initialization Techniques Assignment Questions
No ratings yet
Weight Initialization Techniques Assignment Questions
8 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
21 pages
02-Linear Regression
No ratings yet
02-Linear Regression
17 pages
Fault Detection Based On Deep Learning For Digital VLSI Circuits
No ratings yet
Fault Detection Based On Deep Learning For Digital VLSI Circuits
10 pages
Cognitive Psychology Summary
No ratings yet
Cognitive Psychology Summary
57 pages
Regression Analysis
No ratings yet
Regression Analysis
11 pages
Reader AoI 2425
No ratings yet
Reader AoI 2425
174 pages
Lecture2 Introduction ML
No ratings yet
Lecture2 Introduction ML
72 pages
BiVRec Bidirectional View-Based Multimodal Sequential
No ratings yet
BiVRec Bidirectional View-Based Multimodal Sequential
12 pages
Text Classification Using Logistics Regression
No ratings yet
Text Classification Using Logistics Regression
64 pages
Time-Series Forecasting With Deep Learning - A Survey
No ratings yet
Time-Series Forecasting With Deep Learning - A Survey
14 pages
Lecture10 11 Combination
No ratings yet
Lecture10 11 Combination
73 pages
Week 6 - Lecture 12-1
No ratings yet
Week 6 - Lecture 12-1
34 pages
How To Get The PDF Ub
No ratings yet
How To Get The PDF Ub
4 pages
Homework 3 - Syntax I
No ratings yet
Homework 3 - Syntax I
3 pages
Homework 2 - Phonetics & Phonology
No ratings yet
Homework 2 - Phonetics & Phonology
3 pages
Homework 1 - Part 2 - Morphology
No ratings yet
Homework 1 - Part 2 - Morphology
2 pages
Unit 2 ML - Ver 2
No ratings yet
Unit 2 ML - Ver 2
129 pages
EVADE Targeted Adversarial False Data Injection Attacks For State Estimation in Smart Grid
No ratings yet
EVADE Targeted Adversarial False Data Injection Attacks For State Estimation in Smart Grid
13 pages
Regression Questionnaire
No ratings yet
Regression Questionnaire
10 pages
Deep Learning For Computer Vision
No ratings yet
Deep Learning For Computer Vision
125 pages
Harrison Kinsley, Daniel Kukieła - Neural Networks From Scratch in Python (2020) - 93-123
No ratings yet
Harrison Kinsley, Daniel Kukieła - Neural Networks From Scratch in Python (2020) - 93-123
31 pages
Module B Handbook
No ratings yet
Module B Handbook
11 pages
AAI Lecture 10 SP 25
No ratings yet
AAI Lecture 10 SP 25
37 pages
Onzon Neural Exposure Fusion For High-Dynamic Range Object Detection CVPR 2024 Paper
No ratings yet
Onzon Neural Exposure Fusion For High-Dynamic Range Object Detection CVPR 2024 Paper
10 pages
LR 1751142062
No ratings yet
LR 1751142062
10 pages
Linear Regression Model Presentation
No ratings yet
Linear Regression Model Presentation
7 pages
Linear Regression
No ratings yet
Linear Regression
89 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet