lecture3_supervised_learning_I
lecture3_supervised_learning_I
1 / 78
Today’s Lecture
2 / 78
Outline
1 Linear Regression
2 Logistic Regression
3 / 78
Introduction
4 / 78
Representing the dataset
• The dataset is composed of N pairs of observations
(input) and response (output).
• The response is a vector of N scalars
⊤
y = y1 y2 . . . yN
6 / 78
Linear Regression Model
7 / 78
Linear Regression Model
From https://en.wikipedia.org/wiki/File:Linear˙regression.svg
8 / 78
Matrix Formulation
The previous equations can be re-written in a matrix
form. First we define the input features x̃ with an
additional 1 concatenated/appended at the end:
x̃ = x1 . . . xP 1 = x 1 (3)
And the weight vector w is now defined as:
w
w̃ = (4)
b
Once these small transformations are done, we can
re-write the linear regression model as:
w
f (x) = x̃w̃ = x 1 =x·w+b (5)
b
9 / 78
Closed Form Solution
10 / 78
Closed Form Solution
By condensing X and y for the whole dataset (including
the column of 1’s), then we can rewrite the loss as:
Pn
(xi w − yi )2
L = i=1 (7)
n
= (y − Xw)⊤ (y − Xw) (8)
11 / 78
Closed Form Solution
After some algebraic manipulations, this has a closed
form solution:
12 / 78
Closed Form Solution
13 / 78
Modeling Assumptions
Residuals
The actual model considers errors or residuals ϵ:
f (xi ) = xi w + ϵi = yi (11)
This is because the yi ’s can have measurement or other kinds of
errors. These residuals are zero only if the model fits the data
perfectly (zero loss).
Independence of Residuals
The residuals ϵi are assumed to be independent and not related to
each other or to the input variables. This means that the errors do
not depend on the input.
Linearity
In this model, the input features xi are treated as fixed values, and
the model is only linear with respect to its parameters. Input features
can be transformed to produce other features and the model stays
linear, since the training data is treated as a constant, and
optimization only happens in the parameters w.
15 / 78
Modeling Assumptions
Constant Variance
The variance of the residuals does not depend in the inputs or the
outputs of the model. This is called Homoscedasticity. The opposite
(Heteroscedasticity) is when the variance of the output or the errors
can vary with the input or output of the model, for example, if larger
outputs have larger variance then smaller outputs.
16 / 78
Modeling Assumptions
18 / 78
Solution with Gradient Descent
Which produces a vector (d + 1) × 1, and then
parameters can be updated using gradient descent:
19 / 78
Solution with Gradient Descent
*
x4
x3
x2
x1
x0
From https://en.wikipedia.org/wiki/Gradient˙descent
20 / 78
Multi-variable Linear Regression
What if the labels y are not scalars, but vectors of
dimension m? We can still perform linear regression, but
now the model outputs a vector instead of a scalar. This
can be seen as performing K individual linear regression
problems:
f (x) = [xk · wk + bk ]K
k=1 = XW + b (15)
Where now b is a K × 1 vector instead of a scalar, and W
is a d × K matrix instead of a vector.
0
Y
10
20
4 2 0 2 4
X
25 / 78
Robust Linear Regression
Linear Regression is overall not robust to outliers. There
are many alternatives.
• There are many robust linear regression methods,
generally making assumptions about the output y ,
for example that it follows a student’s t-distribution
or other heavy tailed distribution.
• The simple LR algorithm assumes that the output is
Gaussian distributed, making it not robust to outliers.
• RanSaC (Random Sampling Consensus) can be used
to fit multiple LR models and detect which ones are
outliers.
• Exploratory data analysis can be used to identify and
remove outliers.
26 / 78
Outline
1 Linear Regression
2 Logistic Regression
27 / 78
Classification Reminder
28 / 78
Classification Reminder
29 / 78
Probabilistic Classifiers
Most classifiers output a probability vector p of length C .
The class integer class index c can be recovered by:
30 / 78
Probabilistic Classifiers
Most classifiers output a probability vector p of length C .
The class integer class index c can be recovered by:
34 / 78
Logistic Regression Concepts
Logit
Logits are the input to the logistic/sigmoid function:
l = x·w+b (23)
ŷ = σ(l) (24)
Here l is a logit. Logits range in the real numbers (R), while the
output of the logistic/sigmoid function is [0, 1]. The expanded range
is useful in some applications, for example if you want to do
regression of the logits.
35 / 78
Training with Gradient Descent
Using the matrix representation, the gradient of the
cross-entropy loss has a well known closed form:
∂L
= n−1 X⊤ (σ(Xw) − y) (25)
∂w
Which produces a vector (d + 1) × 1, and then
parameters can be updated using gradient descent:
One vs All
For each class, train a classifier for that class vs all other data points,
and at inference, make a prediction with all classifiers and select the
class with highest probability. Requires only C classifiers.
37 / 78
One vs One
40
20
−20
−40
38 / 78
One vs Rest
50
40
30
20
10
−10
39 / 78
Multinomial Logistic Regression
40 / 78
Softmax Function
41 / 78
Softmax Function (Example)
4-way classification problem.
The output of the model is the following:
1.25
0.32
2.39
−3.01
After the application of softmax:
0.2205
0.0870
0.6894
0.0031
42 / 78
Training Multinomial LR
This model is trained using the categorical cross-entropy
loss.
Categorical Cross-Entropy
For this loss, labels y c should be one-hot encoded. Used for
multi-class classification problems, where the model predictions are
yˆi c are class probabilities that sum to 1.
XX
L(y, ŷ) = − yic log(ŷic )
i c
43 / 78
Multi-Label Logistic Classification
The multi-label setting is where you have multiple classes
but more than one class is possible at the same time.
This can also be modeled using logistic regression, where
the model is:
f (x) = σ(xW + b) (30)
The difference is the use of the logistic/sigmoid
activation for each class, as W is a d × c matrix and b is
a d × 1 vector.
This model is trained using the binary-cross entropy loss,
but applied for each class separately:
XX
L(y, ŷ) = − yic log(ŷic ) + (1 − yic ) log(1 − ŷic )
c i
This means the target vector is not one-hot encoded, but
set to 1 if the class is present, and to 0 otherwise.
44 / 78
Multi-Label Logistic Classification
ŷ = σ(xW + b) (31)
(
Class c is present if ŷ c ≥ T c
class(ŷ c ) = (32)
Class c is absent if ŷ c < T c
45 / 78
Outline
1 Linear Regression
2 Logistic Regression
46 / 78
Motivation
47 / 78
Motivation - Hyperplanes Might not be Unique
X2 H1 H2 H3
X1
49 / 78
Maximum Margin Formulation
People came up with a simple idea to solve this issue.
51 / 78
Hard Maximum Margin Formulation
The hard margin formulation applies when the data is
linearly separable, using two parallel hyperplanes, defined
by:
51 / 78
Hard Maximum Margin Formulation
The hard margin formulation applies when the data is
linearly separable, using two parallel hyperplanes, defined
by:
xi w + b ≥ 1 if yi = 1 (35)
xi w + b ≤ −1 if yi = −1 (36)
yi (xi w + b) ≥ 1 (37)
52 / 78
Hard Maximum Margin Formulation
53 / 78
Soft Maximum Margin Formulation
The issue with the hard margin is that we do not have
solutions if the data is not linearly separable.
The margin can be made to be soft, by relaxing the
constraints using positive slack variables ξi :
yi (xi w + b) ≥ 1 − ξi (40)
ξi ≥ 0 (41)
The idea is to minimize the total of these slack variables,
which leads to the following optimization problem.
X
minimize ||w|| + κ ξi (42)
i
subject to yi (wxi + b) ≥ 1 − ξi (43)
ξi ≥ 0 (44)
54 / 78
Soft Maximum Margin Formulation
The constraints can be integrated into a single loss
function:
X
L = ||w|| + κ max(0, 1 − yi (wxi + b)) (45)
i
55 / 78
Soft Maximum Margin Formulation
The constraints can be integrated into a single loss
function:
X
L = ||w|| + κ max(0, 1 − yi (wxi + b)) (45)
i
55 / 78
Soft Maximum Margin Formulation
56 / 78
Soft Maximum Margin Formulation
Effect of Varying κ (C in the pictures)
57 / 78
Soft Maximum Margin Formulation
Effect of Varying κ on Digits Dataset
58 / 78
SVM Concepts
Margin
It is the area between the two separating hyperplanes, and ideally it
should not contain any data points (hard margin). For a soft margin
it can contain data points, depending on the value of κ.
59 / 78
SVM Concepts
Margin
It is the area between the two separating hyperplanes, and ideally it
should not contain any data points (hard margin). For a soft margin
it can contain data points, depending on the value of κ.
Support Vector
The points that lie in the border of the margin are called support
vectors, since they are the ones that define the geometry of the
margin and the values of the weights w. Data points beyond the
margin do not really contribute to training.
59 / 78
Training SVMs
Training SVMs is a bit different than other algorithms
that we have covered.
Hard Margin
This is quadratic programming problem, and a quadratic solver
needs to be used. The loss is convex so there is always a unique
solution.
Soft Margin
The slack variable (ξ) formulation is also a quadratic problem,
trainable with a quadratic solver. The hinge loss formulation is
trainable using gradient descent.
60 / 78
Multi-Class SVMs
Current SVM formulations are only for binary
classification. They can be transformed into multi-class
classifiers by applying the two strategies we covered
before:
One vs One
Train 0.5C (C − 1) classifiers, one for each
pair of classes.
One vs All / Rest
Train C classifiers, one class versus the rest.
This is a good default option to use.
63 / 78
Outline
1 Linear Regression
2 Logistic Regression
64 / 78
Motivation
65 / 78
Logistic regression models directly the predictive
distribution P(Y = 1|x).
66 / 78
Modeling assumptions
67 / 78
Modeling assumptions - picture
68 / 78
Estimating the Gaussian Mean
69 / 78
Estimating the Gaussian Covariance
XX
Σ̂ = (n − C )−1 1[yi = c](xi − µˆc )⊤ (xi − µˆc ) (50)
c i
70 / 78
LDA Concept
71 / 78
LDA Class Separation
73 / 78
Fisher’s Criteria or Discriminant
µ µ
2
σbetween (µ − µ)2
S= 2 = (51)
σwithin s2 + s2
S is the fisher discriminant, and it is a measure of how
discriminative the features/labels are, how well the classes
are separated.
74 / 78
LDA Model/Equations
Fisher’s Criteria can be used to derive the separating
hyperplane equation, with parameters w and b:
w T = C −1 (µ1 − µ2 ) (52)
b = −0.5(µ1 + µ2 )w T (53)
Where µ1 is the mean of the first class, and µ2 is the
second class mean, and C is the pooled covariance matrix
of both classes.
class 1 if w T x − b ≥ 0
f (x) = (54)
class 2 if w T x − b < 0
75 / 78
Questions to Think About
1. What is the basic concept underpinning SVMs?
2. How do ML Classification methods relate to Linear
Separability?
3. Explain the concept of the Kernel Trick and its
relationship with Kernel Functions.
4. How to transform a binary classifier into a multi-class
one?
5. What are the main assumptions behind LDA?
6. How is it possible to interpret the coefficients of
linear regression?
7. What is the difference in the equations of linear and
logistic regression and what are its effect on the
characteristics of the two models?
76 / 78
Take Home Messages
• We covered a lot of methods, there are important
underpinning concepts like linear models, linear
separability, kernel tricks, etc.
• Linear methods are easy to understand and
implement, and generally have good performance.
• Kernel methods allow to transform Linear methods
into non-linear ones, with simple tricks.
• Non-linear methods can have better performance but
they are difficult to understand and implement.
• In the end performance depends on the features and
if they are linearly separable (classification) or fit a
line (regression).
77 / 78
Questions?
78 / 78