AI701: Foundations of Artificial Intelligence
Week 2
Agenda
• Introduction to linear regression
• Logistic regression fundamentals
• Gradient descent (batch/stochastic)
• Underfitting and overfitting
• Introduction to regularization
Linear Regression
The discovery of Ceres
1801: Astronomer Piazzi discovered Ceres
Made 19 observations of location before it was obscured by the sun
Time Right ascension Declination
Jan 01,20:43:17.8 50.91 15.24
Jan 02,20:39:04.6 50.84 15.30
... ... ...
Feb 11, 18:11:58.2 53.51 18.43
Where and when it will be observed again?
Gauss's triumph
September 1801: Gauss took Piazzi’s data and created a model of
Ceres’s orbit
Makes prediction
December 7, 1801: Ceres located within 1/2 degree of Gauss’s prediction,
much more accurate than other astronomers
Method: Least squares linear regression
Linear regression framework
Design decisions:
Which predictors are possible? hypothesis class
How good is a predictor? Loss function
How do we compute the best predictor? Optimization algorithm
Hypothesis class: which predictors?
4
3
f (x) = 1 + 0.57x
2
y
f (x) = 2 + 0.2x
1
f (x) = w1 + w2x
0
0 1 2 3 4 5
Vector notation:
weight vector w = [w1, w2] feature extractor φ(x) = [1, x] feature vector
f w (x) = w ·φ(x) score
f w (3) = [1, 0.57] ·[1, 3] = 2.71
Hypothesis class:
F = { f w : w ∈R2}
Loss function: how good is apredictor?
4
training data Dtrain 3
f w (x) = w ·φ(x) x y 2
y
w = [1, 0.57] 1 1 residual
φ(x) = [1, x] 1
2 3
4 3 0
0 1 2 3 4 5
Loss(x, y, w) = (f w (x) − y)2 squared loss
Loss(1, 1, [1, 0.57]) = ([1, 0.57] ·[1, 1] − 1)2 = 0.32
Loss(2, 3, [1, 0.57]) = ([1, 0.57] ·[1, 2] − 3)2 = 0.74
Loss(4, 3, [1, 0.57]) = ([1, 0.57] ·[1, 4] − 3)2 = 0.08
1
TrainLoss(w) = |D train| Σ(x,y)∈Dtrain Loss(x, y,w)
TrainLoss([1, 0.57]) = 0.38
Visualizing Loss Function
w
1 2
TrainLoss(w) = |D | Σ ( f w (x) − y) min TrainLoss(w)
train (x,y)∈D train
Gradient Descent
Optimization algorithm: how to compute best?
Goal: min w TrainLoss(w)
Definition: gradient
The gradient ∇w TrainLoss(w) is the direction that increases the
training loss the most.
Computing the gradient
Objective function:
Gradient (use chain rule):
Gradient descent example
Linear Classification
Linear classification framework
3
2
training data decision boundary
[2, 0] input 1
x1 x2 y
x2
example 0
0 2 1 learning algorithm
example
f classifier
-1
-2 0 1
example
1 -1 -1 -2
-1 label -3
-3 -2 -1 0 1 2 3
x1
Design decisions:
Which classifiers are possible? hypothesisclass
How good is a classifier? loss function
How do we compute the best classifier? optimization algorithm
An example linear classifier
3
x2
-1
-2
-3
-3 -2 -1 0 1 2 3
x1
f ([0, 2]) = sign([−0.6, 0.6] ·[0, 2]) = sign(1.2) = 1
f ([−2, 0]) = sign([−0.6, 0.6] ·[−2, 0]) = sign(1.2) = 1
f ([1, −1]) = sign([−0.6, 0.6] ·[1, −1]) = sign(−1.2) = −1
Decision boundary: x such that w ·φ(x) = 0
Hypothesis class: which classifiers?
3
1
φ(x) = [x1, x2]
x2
0
f (x) = sign([−0.6, 0.6] ·φ(x)) -1
f (x) = sign([0.5, 1] ·φ(x)) -2
-3
-3 -2 -1 0 1 2 3
x1
General binary classifier:
f w (x) = sign(w ·φ(x))
Hypothesis class:
F = { f w : w ∈R2}
Loss function: how good is aclassifier?
3
2
training data Dtrain
1
f w (x) = w ·φ(x) x1 x2 y
x2
0
w = [0.5, 1] 0 2 1 -1
φ(x) = [x1, x2]
-2 0 1
-2
1 -1 -1
-3
-3 -2 -1 0 1 2 3
x1
Loss0-1(x, y, w) = 1[f w (x) y] zero-one loss
Loss([0, 2], 1, [0.5, 1]) = 1[sign([0.5, 1] ·[0, 2]) ƒ= 1] = 0
Loss([−2, 0], 1, [0.5, 1]) = 1[sign([0.5, 1] ·[−2, 0]) ƒ= 1] = 1
Loss([1, −1], −1, [0.5, 1]) = 1[sign([0.5, 1] ·[1, −1]) ƒ= −1] = 0
TrainLoss([0.5, 1]) = 0.33
Score and margin
3
Predicted label: f w (x) = sign(w ·φ(x))
x2
0
-1
Target label: y
-2
-3
-3 -2 -1 0 1 2 3
x1
Definition: score
The score on an example (x, y) is w·φ(x), how confident we are in
predicting +1.
Definition: margin
The margin on an example (x, y) is (w ·φ(x))y, how correct we are.
Zero-one loss rewritten
4
Loss(x,y, w)
3
0
-3 -2 -1 0 1 2 3
margin (w ·φ(x))y
Optimization algorithm: how to compute best?
Goal: min w TrainLoss(w)
To run gradient descent, compute the gradient:
∇w TrainLoss(w) = 1
|D train| Σ (x,y)∈D train ∇Loss0-1(x, y,w)
∇w Loss0-1(x, y, w) = ∇1[(w ·φ(x))y ≤ 0]
4
Loss(x,y, w)
2
Gradient is zero almost everywhere!
1
0
-3 -2 -1 0 1 2 3
margin (w ·φ(x))y
Hinge loss
4
Loss(x,y, w)
3
Loss0-1
2
Losshinge
1
0
-3 -2 -1 0 1 2 3
margin (w ·φ(x))y
Losshinge(x, y, w) = max{1 − (w ·φ(x))y, 0}
Another: logistic regression
Losslogistic(x, y, w) = log(1 + e−(w·φ(x))y )
4
Loss(x, y,w)
3
Loss0-1
2 Losshinge
1
Losslogistic
0
-3 -2 -1 0 1 2 3
margin (w ·φ(x))y
Intuition: Try to increase margin even when it already exceeds
1
Gradient of the hinge loss
4
Loss(x,y, w)
3
2 Losshinge
1
0
-3 -2 -1 0 1 2 3
margin (w ·φ(x))y
Hinge loss on training data
3
2
training data Dtrain
1
f w (x) = w ·φ(x) x1 x2 y
x2
0
w = [0.5, 1] 0 2 1 -1
φ(x) = [x1, x2]
-2 0 1
-2
1 -1 -1
-3
-3 -2 -1 0 1 2 3
x1
Losshinge(x, y, w) = max{1 − (w ·φ(x))y, 0}
Loss([0, 2], 1, [0.5, 1]) = max{1 − [0.5, 1] ·[0, 2](1), 0} = 0 ∇Loss([0, 2], 1, [0.5, 1]) = [0, 0]
Loss([−2, 0], 1, [0.5, 1]) = max{1 − [0.5, 1] ·[−2, 0](1), 0} = 2 ∇Loss([−2, 0], 1, [0.5, 1]) = [2, 0]
Loss([1, −1], −1, [0.5, 1]) = max{1 − [0.5, 1] ·[1, −1](−1), 0} = 0.5 ∇Loss([1, −1], −1, [0.5, 1]) = [1, −1]
TrainLoss([0.5, 1]) = 0.83 ∇TrainLoss([0.5, 1]) = [1, −0.33]
Regression vs Classification
Regression Classification
Prediction f w (x) score sign(score)
Relate to target y residual (score − y) margin (score y)
zero-one
squared
Loss functions hinge
absolute deviation
logistic
Algorithm gradient descent gradient descent
Stochastic Gradient Descent
Gradient descent is slow
Algorithm: gradient descent
Initialize w = [0, . . . , 0]
For t = 1, . . . , T :
w ← w − η∇w TrainLoss(w)
Problem: each iteration requires going over all training examples
Expensive when have lots of data!
Stochastic gradient descent
Algorithm: stochastic gradient descent
Initialize w = [0, . . . , 0]
For t = 1, . . . , T :
For (x, y) ∈ Dtrain:
w ← w − η∇wLoss(x, y, w)
Step size
Question: what should η be?
0 1
η
conservative, more stable aggressive, faster
Strategies:
• Constant: η = 0.1
• Decreasing:
GD vs SGD
gradient descent stochastic gradient descent
Key idea: stochastic updates
It’s not about quality, it’s about quantity.
Overfitting and Regularization
Minimizing the training loss
Hypothesis class:
f w (x) = w ·φ(x)
Training objective (loss function):
1
TrainLoss(w) = Σ Loss(x, y,w)
|Dtrain| (x,y)∈D
train
Optimization algorithm:
stochastic gradient descent
Is the training loss a good objective to optimize?
Rote Learning
Algorithm: rote learning
Training: just store Dtrain .
Predictor f (x):
If (x, y) ∈ Dtrain : return y.
Else: segfault.
Minimizes the objective perfectly (zero), but clearly bad...
Overfitting scenarios
Classification Regression
Overfitting
Overfitting – Possible reasons
• Too few training data
• Noise in the data
• The hypothesis space is too large
• The input space is high-dimensional
Overfitting vs. model complexity
• We talk of overfitting when High bias Low bias
Error
Low variance High variance
decreasing 𝐸𝑖𝑛 leads to
increasing 𝐸𝑜 𝑢 𝑡
• Major source of failure for
machine learning systems Out of sample error
𝐸𝑜𝑢𝑡
• Overfitting leads to bad
generalization
In sample error Overfitting
• A model can exhibit bad 𝐸𝑖𝑛
generalization even if it does not
overfit
Low Model complexity High
Overfitting vs. model complexity
Evaluation
Dtrain learning algorithm f
How good is the predictor f ?
Key idea: the real learning objective
Our goal is to minimize error on unseen future examples.
Don’t have unseen examples; next best thing:
Definition: test set
Test set Dtest contains examples not used for training.
Generalization
When will a learning algorithm generalizewell?
Dtrain Dtest
Approximation & estimation error
All predictors
F
Learning
f ∗ approx. error est. error
g fˆ
• Approximation error: how good is the hypothesis class?
• Estimation error: how good is the learned predictor relative to the potential of the
hypothesis class?
Effect of hypotheses class
All predictors
F
Learning
f∗ approx. error est. error
g fˆ
As the hypothesis class size increases...
Approximation error decreases because:
taking min over larger set
Estimation error increases because:
harder to estimate something more complex
How do we control the hypothesis class size?
Cure 1: Dimensionality
w ∈ Rd
Reduce the dimensionality d (number of features):
Controlling the dimensionality
Manual feature (template) selection:
• Add feature templates if they help
Remove feature templates if they don’t help
Automatic feature selection (beyond the scope of this class):
• Forward selection
• Boosting
• L 1 regularization
It’s the number of features that matters
Cure 2: Norm
Controlling the Norm
Regularized objective:
λ
min TrainLoss(w)+ ǁwǁ 2
w 2
Algorithm: gradient descent
Initialize w = [0, . . . , 0]
For t = 1, . . . , T :
w ← w − η(∇w TrainLoss(w)+λw)
Same as gradient descent, except shrink the weights towards zero by λ.
Controlling the Norm: Early Stopping
Algorithm: gradient descent
Initialize w = [0, . . . , 0]
For t = 1, . . . , T :
w ← w − η∇w TrainLoss(w)
Idea: simply make T smaller
Intuition: if have fewer updates, then ǁwǁ can’t get too big.
Lesson: try to minimize the training error, but don’t try too hard.
Regularization and bias-variance
The effects of the regularization procedure can be observed in the bias and variance
terms
• Regularization trades bias in order to considerably decrease the variance of the model
• Regularization strives for smoother hypothesis, thus reducing the opportunities to overfit
• The amount of regularization 𝜆 has to be chosen specifically for each type of regularizer
• Usually 𝜆 is chosen by cross-validation
How overfitting affects predictions
Predictive Underfitting Overfitting
Error
Error on Test Data
Error on Training Data
Model Complexity
Ideal Range
for Model Complexity
Regularization
• A method for automatically controlling the complexity of the learned
hypothesis
• Idea: penalize for largevaluesof thertaj
oIncorporate penalty into the cost function
oWorks well when we have a lot of features, each that contributes a bit to
predicting the label
• Can also address overfitting by eliminating features (either manually or via model
selection)
L2 Regularization
• Regularized linear regression objectivefunction:
model fit to data regularization
o λis the regularization parameter (λ
oNo regularization on ! 0!
Slide Credits
Percy Liang
Dorsa Sadigh
Mirko Mazzoleni
Ryan P. Adams
Thank You
Mohamed bin Zayed
University of Artificial Intelligence
Masdar City
Abu Dhabi
United Arab Emirates
mbzuai.ac.ae