CS480 6 Linear Models
CS480 6 Linear Models
Linear Models
Edith Law
Ank
Perceptron
2
Linear Separability
But… the algorithm can only find a linear separator if the dataset is
linearly separable. In reality, not all datasets are linearly separable.
+ +
+ + + +
- + + +
+ - -
+ -
- -
- - +-
- - -
- -
linearly separable not linearly separable
perceptron will converge perceptron will not converge
3
Relaxing the Requirement
We can relax the requirement and frame the learning problem as finding
the hyperplane that makes the fewest error on the training data.
Optimization Problem:
T
∑
min 1[yn(w xn + b) ≤ 0]
w,b
n
The objective function is the error rate (0/1 loss) of the linear classifier
parameterized by w,b.
We know that the optimum is 0: perceptron algorithm is
guaranteed to find the parameters for this model and classify all the
training data correctly if the data is linearly separable.
4
Solving the Optimization Problem
T
∑
min 1[yn(w xn + b) ≤ 0]
w,b
n
5
Solving the Optimization Problem
T
∑
min 1[yn(w xn + b) ≤ 0]
w,b
n
6
Optimization Framework for Linear Models
In order to make sure that we do not overfit the data, we need to add a
regularizer over the parameters of the model.
T
∑
min l(yn, w xn + b) + λR(w, b)
w,b
n
7
Optimization Framework for Linear Models
Questions:
1. How can we adjust the optimization problem so that there are
efficient algorithms for solving it?
2. Assuming we can adjust the optimization problem appropriately,
what algorithms exist for efficiently solving the regularized optimization
problem?
3. What are good regularization function R(w,b) for hyperplanes?
8
Optimization Framework for Linear Models
Questions:
1. How can we adjust the optimization problem so that there are
efficient algorithms for solving it?
2. Assuming we can adjust the optimization problem appropriately,
what algorithms exist for efficiently solving the regularized optimization
problem?
3. What are good regularization function R(w,b) for hyperplanes?
9
Why is Optimizing 0/1 Loss Hard?
T
∑
min 1[yn(w xn + b) ≤ 0]
w,b
n
10
Sigmoid Loss
Convex functions
• second derivative is always non-negative
• any chord of the function lies above it.
• easy to minimize!
12
Convex Surrogate Loss Function
(0/1)
0/1: l (y, y)̂ = 1[y ŷ ≤ 0]
hin
Hinge: l (y, y)̂ = max{0,1 − y y}̂
log 1
Logistic: l (y, y)̂ = ̂
log(1 + exp[−y y])
log 2
Exponential: l exp(y, y)̂ = exp[−y y]̂
13
Optimization Framework for Linear Models
Questions:
1. How can we adjust the optimization problem so that there are
efficient algorithms for solving it?
2. Assuming we can adjust the optimization problem appropriately,
what algorithms exist for efficiently solving the regularized optimization
problem?
3. What are good regularization function R(w,b) for hyperplanes?
14
Efficient Solutions to the Optimization Problem
l(yn, w T xn + b) + λR(w, b)
∑
min
w,b
n
Two Solutions:
1. Gradient Descent
2. Closed Form Solution (Least Squares)
15
Gradient Descent Solution for MSE
17
Optimization with Gradient Descent
18
Optimization with Gradient Descent
As an example, suppose that you choose exponential loss as the loss
function and the 2-norm as the regularizer.
λ
exp[−yn(w xn + b)] + | | w | |2
T
∑
L(w, b) =
n
2
δL δ δ λ
T
| | w | |2
δb ∑
= exp[−yn(w xn + b)] +
δb n
δb 2
δ
exp[−yn(w T xn + b)] + 0
∑ δb
=
n
δ
∑ ( δb ) [
= − yn (w T
xn + b) exp − yn(w T
xn + b)]
n
Thought Exercise:
•Consider positive examples, where yn = +1. For these examples, we
hope wTx + b is as large as possible.
- if predicted value -> infinity, the exp term goes to zero.
- if predicted value = small, the exp term will be positive and non-zero.
This means that the bias term b will be increased.
- Once all points are well classified, then the derivative goes to zero.
20
Optimization with Gradient Descent
λ
T 2
∑
L(w, b) = exp[−yn(w xn + b)] + | | w | |
n
2
λ
exp[−yn(w xn + b)] + ∇w | | w | |2
T
∑
∇w L = ∇w
n
2
( w ) [ xn + b)] + λw
T T
∑
= ∇ − yn (w xn + b) exp − yn(w
n
21
Optimization with Gradient Descent
w ← w − η ∇w L
23
Local Minima
l(yn, w T xn + b) + λR(w, b)
∑
min
w,b
n
Two Solutions:
1. Gradient Descent
2. Closed Form Solution (Least Squares)
25
Closed Form Solution
There are cases when you can obtain a closed form solution: squared
error loss with 2-norm regularizer.
For simplicity, we will consider the unbiased version, which is the linear
regression setting.
26
Supervised Learning
27
Regression Problem
Observe Predict
x y
0.86 2.49
0.09 0.83
-0.85 -0.25
0.87 3.10
-0.44 0.87
-0.43 0.02
-1.1 -0.12
0.40 1.81
-0.96 -0.83
0.17 0.43
fw(x) = w0 + w1x
28
Linear Hypothesis
fw(x) = w0 + w1x1 + w2 x2 + … + wD xD
Learning involves finding the best set of weights of the linear model
such that the test error is minimized.
29
Predicting Recurrence Time from Tumor Size
30
Input Variables for Linear Regression
In all cases, we can add XD+1, …, XD+k to the list of original variables
and perform the linear regression.
31
Order-3 Fit: Is this better?
32
Order-4 Fit: Is this better?
33
Order-5 Fit: Is this better?
34
Order-6 Fit: Is this better?
35
Order-7 Fit: Is this better?
36
Order-8 Fit: Is this better?
37
Order-9 Fit: Is this better?
∑
fw(x) = w0 + wd xd
d
39
Least-Squares Solution for X in R^2
40
Closed Form Solution for Optimal Weights
Training data X:
•a large matrix of size NxD, where Xn,d is the value the dth feature on
the nth example.
Weights w:
• a column vector of size D.
Predicted Values Y ̂ = Xw
•the product between the matrix X and vector w has dimension N.
41
Closed Form Solution for Optimal Weights
1 ̂ 2 1 ̂ 2
2∑
(Yn − Yn) = | | Y − Y | |
n
2
42
Closed Form Solution for Optimal Weights
43
Example 1
45
Example 1
46
Example 1
47
Example 2: With Polynomial Terms
fw(x) = w0 + w1x + w2 x 2
x2 x
49
Example 2: With Polynomial Terms
50
Example 2: With Polynomial Terms
51
Computational Cost
Questions:
1. How can we adjust the optimization problem so that there are
efficient algorithms for solving it?
2. Assuming we can adjust the optimization problem appropriately,
what algorithms exist for efficiently solving the regularized optimization
problem?
3. What are good regularization function R(w,b) for hyperplanes?
53
Overfitting
54
Estimating True Error
55
Understanding the Error
Given set of examples <X, Y>. We assume that
y = f(x) + ϵ
where ϵ is Gaussian noise with zero mean and std deviation σ.
56
An Example (from Tom Dietterich)
Example: 20 points
y = x + 2 sin(1.5x) + N(0,0.2)
59
Bias-Variance Tradeoff
Suppose we are trying to learn a regression function f(x) to approximate
the true function y
https://en.wikipedia.org/wiki/Bias–variance_tradeoff
60
Bias vs Variance
61
Weight Regularization
Recall this:
1[yn(w T xn + b) ≤ 0] + λR(w, b)
∑
min
w,b
m
62
Weight Regularization
Recall this:
T
∑
min 1[yn(w xn + b) ≤ 0] + λR(w, b)
w,b
m
Regularizer examples:
R abs(w, b) =
∑
R norm(w, b) = | | w | | = wd2 | wd |
∑
d d
63
P-Norm of w
d
• 2-norm corresponds to Euclidean norm
• 1-norm correspond to the absolute regularizer (in previous slide)
64
Ridge Regression (a.k.a. L2-Regularization)
{∑ }
ŵ ridge = arg min (yi − w T xi)2 + λ wj2
w ∑
i=1 j=1:D
65
Lasso Regression (a.k.a. L1-Regularization)
{∑ }
ŵ lasso = arg min (yi − w T xi)2 + λ
∑
| wj |
w
i=1 j=1:D
The objective:
• Non-linear in the output y
•No closed-form solution (quadratic programming)
•More computationally expensive than Ridge regression.
66
Comparing Ridge and Lasso
Ridge Lasso
Contours of
regression error
68