0% found this document useful (0 votes)
75 views68 pages

CS480 6 Linear Models

Here are the key steps in gradient descent to minimize this objective function: 1. Compute the gradient of the loss function w.r.t each parameter (w, b) 2. Update the parameters in the opposite direction of the gradient: b <- b - learning_rate * ∂L/∂b w <- w - learning_rate * ∂L/∂w 3. Repeat steps 1-2 until convergence. The gradient descent algorithm iteratively takes small steps in the negative gradient direction to minimize the loss function until reaching a local minimum. For positive examples, we want the predicted value to be large and positive to drive the exponential loss towards 0.

Uploaded by

Ankit Shukla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views68 pages

CS480 6 Linear Models

Here are the key steps in gradient descent to minimize this objective function: 1. Compute the gradient of the loss function w.r.t each parameter (w, b) 2. Update the parameters in the opposite direction of the gradient: b <- b - learning_rate * ∂L/∂b w <- w - learning_rate * ∂L/∂w 3. Repeat steps 1-2 until convergence. The gradient descent algorithm iteratively takes small steps in the negative gradient direction to minimize the loss function until reaching a local minimum. For positive examples, we want the predicted value to be large and positive to drive the exponential loss towards 0.

Uploaded by

Ankit Shukla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

CS480 Introduction to Machine Learning

Linear Models

Edith Law

Ank
Perceptron

Idea: run a particular algorithm until a linear separator is found.

2
Linear Separability

But… the algorithm can only find a linear separator if the dataset is
linearly separable. In reality, not all datasets are linearly separable.

+ +
+ + + +
- + + +
+ - -
+ -
- -
- - +-
- - -
- -
linearly separable not linearly separable
perceptron will converge perceptron will not converge
3
Relaxing the Requirement

We can relax the requirement and frame the learning problem as finding
the hyperplane that makes the fewest error on the training data.

Optimization Problem:
T

min 1[yn(w xn + b) ≤ 0]
w,b
n

The objective function is the error rate (0/1 loss) of the linear classifier
parameterized by w,b.
We know that the optimum is 0: perceptron algorithm is
guaranteed to find the parameters for this model and classify all the
training data correctly if the data is linearly separable.
4
Solving the Optimization Problem

T

min 1[yn(w xn + b) ≤ 0]
w,b
n

If the training data is linearly separable:


•The optimum is 0!
•The perceptron algorithm is guaranteed to find the parameters for this
model and classify all the training data correctly if the data is linearly
separable.

If the training data is not linearly separable:


•Is there an efficient algorithm for finding an optimal setting of the
parameters? No. The problem is NP hard.
•0/1 Loss is NP-hard to even approximately minimize!

5
Solving the Optimization Problem

T

min 1[yn(w xn + b) ≤ 0]
w,b
n

But … is this optimization what we want to solve anyways?


What we want is to not merely minimize training error, but to make sure
there is no overfitting such that it will generalize well to the test data!

6
Optimization Framework for Linear Models

In order to make sure that we do not overfit the data, we need to add a
regularizer over the parameters of the model.

T

min l(yn, w xn + b) + λR(w, b)
w,b
n

Here, we are trying to optimize the tradeoff between a solution that


gives low training error (the first term) and a solution that is “simple” (the
second term). Here, the hyper-parameter is λ and R is chosen to
impose a penalty on the complexity of function f.

The optimization problem:


Find me a linear separator that is not too complicated.

7
Optimization Framework for Linear Models

Questions:
1. How can we adjust the optimization problem so that there are
efficient algorithms for solving it?
2. Assuming we can adjust the optimization problem appropriately,
what algorithms exist for efficiently solving the regularized optimization
problem?
3. What are good regularization function R(w,b) for hyperplanes?

8
Optimization Framework for Linear Models

Questions:
1. How can we adjust the optimization problem so that there are
efficient algorithms for solving it?
2. Assuming we can adjust the optimization problem appropriately,
what algorithms exist for efficiently solving the regularized optimization
problem?
3. What are good regularization function R(w,b) for hyperplanes?

9
Why is Optimizing 0/1 Loss Hard?

T

min 1[yn(w xn + b) ≤ 0]
w,b
n

One reason is that small changes


to w,b can have large impact on
the value of the objective function.
• e.g., if (wTx+b)= -0.0000001,
adding 0.00000011 will
decrease error by 1, but adding
0.00000009 will have no effects.

10
Sigmoid Loss

One solution: replace non-smooth 0/1 loss with a smooth approximation.

Problem: Sigmoid function is not convex. 11


Convexity

Convex functions
• second derivative is always non-negative
• any chord of the function lies above it.
• easy to minimize!

12
Convex Surrogate Loss Function

Another solution: approximate 0/1 loss using an approximating function


that is convex (called surrogate loss function).
The surrogate loss function will upper bound the true loss, i.e., if you
minimize surrogate loss, you will also be pushing down the real loss.

(0/1)
0/1: l (y, y)̂ = 1[y ŷ ≤ 0]
hin
Hinge: l (y, y)̂ = max{0,1 − y y}̂

log 1
Logistic: l (y, y)̂ = ̂
log(1 + exp[−y y])
log 2
Exponential: l exp(y, y)̂ = exp[−y y]̂

Squared: l sqr(y, y)̂ = (y − y)̂ 2

13
Optimization Framework for Linear Models

Questions:
1. How can we adjust the optimization problem so that there are
efficient algorithms for solving it?
2. Assuming we can adjust the optimization problem appropriately,
what algorithms exist for efficiently solving the regularized optimization
problem?
3. What are good regularization function R(w,b) for hyperplanes?

14
Efficient Solutions to the Optimization Problem

l(yn, w T xn + b) + λR(w, b)

min
w,b
n

Two Solutions:
1. Gradient Descent
2. Closed Form Solution (Least Squares)

We will assume that we are minimizing squared error loss.

15
Gradient Descent Solution for MSE

Consider the error function

• The gradient of the error is a vector indicating the direction to the


minimum point.
• Instead of directly finding that minimum (using the closed-form
equation), we can take small steps towards the minimum.
16
Optimization with Gradient Descent

Gradient-based methods of optimization finds the maximum of a


function f(x) by “climbing a mountain”.

The optimizer maintains a current estimate of the parameter of interest,


z. At each step, it measures the gradient g of the function at the
current location z, and take a step in the direction fo the gradient,
where the size of the step is controlled by a parameter η.
z ← z + ηg
In machine learning, we are trying to minimize the loss function. So, we
are trying to find the global minimum of an objective function.

17
Optimization with Gradient Descent

18
Optimization with Gradient Descent
As an example, suppose that you choose exponential loss as the loss
function and the 2-norm as the regularizer.

λ
exp[−yn(w xn + b)] + | | w | |2
T

L(w, b) =
n
2

δL δ δ λ
T
| | w | |2
δb ∑
= exp[−yn(w xn + b)] +
δb n
δb 2
δ
exp[−yn(w T xn + b)] + 0
∑ δb
=
n
δ
∑ ( δb ) [
= − yn (w T
xn + b) exp − yn(w T
xn + b)]
n

yn exp[ − yn(w T xn + b)]



=−
n
19
Optimization with Gradient Descent

The optimization will operate by updating


δL
b←b−η
δb
δL
where yn exp[ − yn(w T xn + b)]

=−
δb n

Thought Exercise:
•Consider positive examples, where yn = +1. For these examples, we
hope wTx + b is as large as possible.
- if predicted value -> infinity, the exp term goes to zero.
- if predicted value = small, the exp term will be positive and non-zero.
This means that the bias term b will be increased.
- Once all points are well classified, then the derivative goes to zero.
20
Optimization with Gradient Descent

We can calculate the gradient with respect to w in a similar way:

λ
T 2

L(w, b) = exp[−yn(w xn + b)] + | | w | |
n
2

λ
exp[−yn(w xn + b)] + ∇w | | w | |2
T

∇w L = ∇w
n
2

( w ) [ xn + b)] + λw
T T

= ∇ − yn (w xn + b) exp − yn(w
n

yn xn exp[ − yn(w T xn + b)] + λw



=−
n

21
Optimization with Gradient Descent

The optimization will operate by updating w

w ← w − η ∇w L

yn xn exp[ − yn(w T xn + b)] + λw



where ∇w L = −
n

•for well classified points, gradient is near zero.


•for poorly classified points, the gradient points in the direction of −yn xn
and update is of the form w ← w + cyn xn (like the perceptron update!)
•looking at just the part of the gradient related to the regularizer, the
update says: w ← w − λw . This has the effects of shrinking the
weights towards zero, which is exactly what we expect regularizer to
do.
22
Optimization with Gradient Descent

The success of gradient descent hinges on appropriate choices of the


step size.
•If the step size is too big, then you can step over optimum and start
oscillating. This suggests setting η → 0 for later iterations.
•If the step size is too small, the parameter of interest may not move far
enough to reach a local minimum. it takes too long to reach the optimum.
Theoretical results suggest that when the parameters starts to not
change by much, then we can do early stopping.

23
Local Minima

Convergence is NOT to a global minimum, only to local minimum.

• For linear function approximations using squared error loss, this is


not an issue: only ONE global minimum!
– Local minima affects many other situations where the loss
function is not convex.
• Repeated random restarts can help (in all cases of gradient search).
24
Efficient Solutions to the Optimization Problem

l(yn, w T xn + b) + λR(w, b)

min
w,b
n

Two Solutions:
1. Gradient Descent
2. Closed Form Solution (Least Squares)

25
Closed Form Solution

There are cases when you can obtain a closed form solution: squared
error loss with 2-norm regularizer.
For simplicity, we will consider the unbiased version, which is the linear
regression setting.

26
Supervised Learning

Recall that supervised learning is about finding a function f : X → Y


such that f(x) is a good predictor for the value of y.

Output Y can have many types:


– If Y is a finite discrete set, the problem is called classification.
– If Y has 2 elements, the problem is called binary classification.
– If Y = ℜ, this problem is called regression.

radius texture perimeter … outcome time


18.02 27.6 117.5 N 31
17.99 10.38 112.8 N 51
23,51 24.27 155.1 R 27

27
Regression Problem

What hypothesis class should we pick?

Observe Predict
x y
0.86 2.49
0.09 0.83
-0.85 -0.25
0.87 3.10
-0.44 0.87
-0.43 0.02
-1.1 -0.12
0.40 1.81
-0.96 -0.83
0.17 0.43

fw(x) = w0 + w1x
28
Linear Hypothesis

Suppose Y is a linear function of X:

fw(x) = w0 + w1x1 + w2 x2 + … + wD xD

Here, wi are the parameters or weights.

w0 is the bias term or intercept.

Learning involves finding the best set of weights of the linear model
such that the test error is minimized.

29
Predicting Recurrence Time from Tumor Size

The function looks complicated and a linear hypothesis does not


seem very good.

What should we do?


• Pick a better function?
• Use more features?
• Get more data?

30
Input Variables for Linear Regression

• Original quantitative features X1, …, XD

• Transformations of variables, e.g. XD+1 = log(Xi)


2 3
X
• Basis expansions, e.g. D+1 = log(Xi ), XD+2 = log(Xi ), …

• Interaction terms, e.g. XD+1 = Xi Xj

• Numeric coding of qualitative variables, e.g. XD+1 = 1 if Xi is true


and 0 otherwise.

In all cases, we can add XD+1, …, XD+k to the list of original variables
and perform the linear regression.

31
Order-3 Fit: Is this better?

32
Order-4 Fit: Is this better?

33
Order-5 Fit: Is this better?

34
Order-6 Fit: Is this better?

35
Order-7 Fit: Is this better?

36
Order-8 Fit: Is this better?

37
Order-9 Fit: Is this better?

Problem: we have a lot of parameters (weights), so the hypothesis


matches the data points exactly, but is wild everywhere else. This
hypothesis will not generalize to new data.
38
Least-Squares Solution Method

The linear regression problem:


D


fw(x) = w0 + wd xd
d

where d is the dimension of the input space, i.e., number of features.

Goal: find the best linear model given the data.


There are many possible performance metrics, but the most common
choice is to find w that minimizes
n
(yi − w T xi)2

Err(w) =
i=1

39
Least-Squares Solution for X in R^2

40
Closed Form Solution for Optimal Weights

To start, we need to rewrite in matrix format.

Training data X:
•a large matrix of size NxD, where Xn,d is the value the dth feature on
the nth example.

Weights w:
• a column vector of size D.

Predicted Values Y ̂ = Xw
•the product between the matrix X and vector w has dimension N.

41
Closed Form Solution for Optimal Weights

The squared error says we should minimize

1 ̂ 2 1 ̂ 2
2∑
(Yn − Yn) = | | Y − Y | |
n
2

42
Closed Form Solution for Optimal Weights

43
Example 1

What is the least squares solution of w without penalty term?


44
Example 1

The fitted data: Y ̂ = Xŵ = X(X T X)−1X TY

45
Example 1

The fitted data: Y ̂ = Xŵ = X(X T X)−1X TY

46
Example 1

47
Example 2: With Polynomial Terms

fw(x) = w0 + w1x + w2 x 2

x2 x

What is the least squares solution of w without penalty term?


48
Example 2: With Polynomial Terms

The fitted data: Y ̂ = Xŵ = X(X T X)−1X TY

49
Example 2: With Polynomial Terms

The fitted data: Y ̂ = Xŵ = X(X T X)−1X TY

50
Example 2: With Polynomial Terms

51
Computational Cost

What operations are necessary?


•Overall: 1 matrix inversion + 3 matrix multiplications
•XTX: XT is DxN and X is NxD, so we need D2N operations.
•(XTX)-1: XTX is DxD, so we need D3 operations.
•Final multiplication: ND operations
•Total: O(D3+D2N+ND)
We can do least squares in polynomial time, but handling large
datasets (many examples, many features) can be problematic.
Gradient Descent takes O(ND) for each step.
For low and medium dimensional problems (D < 100), least square
might be better. For high dimensional problem (D > 10,000),
choose gradient descent.
52
Optimization Framework for Linear Models

Questions:
1. How can we adjust the optimization problem so that there are
efficient algorithms for solving it?
2. Assuming we can adjust the optimization problem appropriately,
what algorithms exist for efficiently solving the regularized optimization
problem?
3. What are good regularization function R(w,b) for hyperplanes?

53
Overfitting

• Every hypothesis has a true error measured on all possible


data items we could ever encounter.
• Since we don’t have all possible data, in order to decide what
is a good hypothesis, we measure error over the training set.
• Suppose we compare hypotheses f1 and f2.
– Assume f1 has lower error on the training set.
– If f2 has lower true error, then our algorithm is overfitting.

54
Estimating True Error

Which hypothesis has the lowest true error?

55
Understanding the Error
Given set of examples <X, Y>. We assume that
y = f(x) + ϵ
where ϵ is Gaussian noise with zero mean and std deviation σ.

56
An Example (from Tom Dietterich)

Example: 20 points
y = x + 2 sin(1.5x) + N(0,0.2)

The circles are data points. X is drawn uniformly randomly.


Y is generated by the sine function + Gaussian noise with
zero mean and standard deviation 0.2. 57
An Example (from Tom Dietterich)

50 Fits, 20 examples Each

With different sets of 20 points, we get different lines!


58
Understanding the error

Consider standard linear regression solution with squared loss error.


If we consider only the class of linear hypotheses, we have
systematic prediction error, called bias, whenever the data is
generated by a non-linear function.
Depending on what dataset we observed, we may get different
solutions. Thus we can also have error due to this variance.
– This occurs even if data is generated from class of linear functions.

59
Bias-Variance Tradeoff
Suppose we are trying to learn a regression function f(x) to approximate
the true function y

We can actually prove that the expected generalization error (based


on squared loss) is a combination of noise, bias and variance:

E[(y − f(x))2] = Noise 2 + Bias 2 + Variance

https://en.wikipedia.org/wiki/Bias–variance_tradeoff
60
Bias vs Variance

Gauss-Markov Theorem says that the least-squares estimates of


the parameters w have the smallest variance among all linear
unbiased estimates.
But, there may exist an estimator that has lower error, but some bias.
Insight: Find lower variance solution, at the expense of some bias.
E.g. Include penalty for model complexity in error to reduce overfitting.
n
(yi − w T xi)2 + λ |model_size|

l(w) =
i=1

λ is a hyper-parameter that controls penalty size, which can be chosen


manually or through cross validation.

61
Weight Regularization

Recall this:

1[yn(w T xn + b) ≤ 0] + λR(w, b)

min
w,b
m

What should the weight regularizer look like?


• Convex Why?
• Components of the weight vector should be small (close to zero).
This is a form of indicative bias (of preferring simple functions). Why?
• In addition to small weights being good, zero weights might be better.
Why?

62
Weight Regularization

Recall this:
T

min 1[yn(w xn + b) ≤ 0] + λR(w, b)
w,b
m

Regularizer examples:

R abs(w, b) =

R norm(w, b) = | | w | | = wd2 | wd |

d d

R sqr(w, b) = | | w | |2 = wd2 R cnt(w, b) =



1[xd ≠ 0]

d d

63
P-Norm of w

We can use the norm of the weight vector:


1
p

| | w | |p = ( | wd | ) p

d
• 2-norm corresponds to Euclidean norm
• 1-norm correspond to the absolute regularizer (in previous slide)

In general, smaller p means sparser vectors.

64
Ridge Regression (a.k.a. L2-Regularization)

Constrains the weights by imposing a penalty on their size:

{∑ }
ŵ ridge = arg min (yi − w T xi)2 + λ wj2
w ∑
i=1 j=1:D

Ridge gives a smooth solution, effectively shrinking the weights, but


drives few weights to 0.
The ridge solution is not equivariant under scaling of the data, so
typically need to normalize the inputs first.

65
Lasso Regression (a.k.a. L1-Regularization)

Constrains the weights by penalizing the absolute value of their size:

{∑ }
ŵ lasso = arg min (yi − w T xi)2 + λ

| wj |
w
i=1 j=1:D

Lasso regularization effectively sets the weights of less relevant input


features to zero.

The objective:
• Non-linear in the output y
•No closed-form solution (quadratic programming)
•More computationally expensive than Ridge regression.

66
Comparing Ridge and Lasso

Ridge Lasso
Contours of
regression error

Contours of constraint region


(model complexity penalty)

Solid area represents contraint regions w12 + w22 ≤ 1 and | w1 | + | w2 | ≤ 1


Circles (or ellipses) represent the contours of the least square error function.
67
What you should know

• How linear models are formulated mathematically


• Closed form solution (i.e., least squares) vs gradient descent methods
• The idea of regularization (ridge and lasso regression)

68

You might also like