0% found this document useful (0 votes)

75 views68 pages

CS480 6 Linear Models

Here are the key steps in gradient descent to minimize this objective function: 1. Compute the gradient of the loss function w.r.t each parameter (w, b) 2. Update the parameters in the opposite direction of the gradient: b <- b - learning_rate * ∂L/∂b w <- w - learning_rate * ∂L/∂w 3. Repeat steps 1-2 until convergence. The gradient descent algorithm iteratively takes small steps in the negative gradient direction to minimize the loss function until reaching a local minimum. For positive examples, we want the predicted value to be large and positive to drive the exponential loss towards 0.

Uploaded by

Ankit Shukla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views68 pages

CS480 6 Linear Models

Uploaded by

Ankit Shukla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 68

CS480 Introduction to Machine Learning

Linear Models

Edith Law

Ank
Perceptron

Idea: run a particular algorithm until a linear separator is found.

2
Linear Separability

But… the algorithm can only find a linear separator if the dataset is
linearly separable. In reality, not all datasets are linearly separable.

+ +
+ + + +
- + + +
+ - -
+ -
- -
- - +-
- - -
- -
linearly separable not linearly separable
perceptron will converge perceptron will not converge
3
Relaxing the Requirement

We can relax the requirement and frame the learning problem as finding
the hyperplane that makes the fewest error on the training data.

Optimization Problem:
T
∑
min 1[yn(w xn + b) ≤ 0]
w,b
n

The objective function is the error rate (0/1 loss) of the linear classifier
parameterized by w,b.
We know that the optimum is 0: perceptron algorithm is
guaranteed to find the parameters for this model and classify all the
training data correctly if the data is linearly separable.
4
Solving the Optimization Problem

T
∑
min 1[yn(w xn + b) ≤ 0]
w,b
n

If the training data is linearly separable:

•The optimum is 0!
•The perceptron algorithm is guaranteed to find the parameters for this
model and classify all the training data correctly if the data is linearly
separable.

If the training data is not linearly separable:

•Is there an efficient algorithm for finding an optimal setting of the
parameters? No. The problem is NP hard.
•0/1 Loss is NP-hard to even approximately minimize!

5
Solving the Optimization Problem

T
∑
min 1[yn(w xn + b) ≤ 0]
w,b
n

But … is this optimization what we want to solve anyways?

What we want is to not merely minimize training error, but to make sure
there is no overfitting such that it will generalize well to the test data!

6
Optimization Framework for Linear Models

In order to make sure that we do not overfit the data, we need to add a
regularizer over the parameters of the model.

T
∑
min l(yn, w xn + b) + λR(w, b)
w,b
n

Here, we are trying to optimize the tradeoff between a solution that

gives low training error (the first term) and a solution that is “simple” (the
second term). Here, the hyper-parameter is λ and R is chosen to
impose a penalty on the complexity of function f.

The optimization problem:

Find me a linear separator that is not too complicated.

7
Optimization Framework for Linear Models

Questions:
1. How can we adjust the optimization problem so that there are
efficient algorithms for solving it?
2. Assuming we can adjust the optimization problem appropriately,
what algorithms exist for efficiently solving the regularized optimization
problem?
3. What are good regularization function R(w,b) for hyperplanes?

8
Optimization Framework for Linear Models

9
Why is Optimizing 0/1 Loss Hard?

T
∑
min 1[yn(w xn + b) ≤ 0]
w,b
n

One reason is that small changes

to w,b can have large impact on
the value of the objective function.
• e.g., if (wTx+b)= -0.0000001,
adding 0.00000011 will
decrease error by 1, but adding
0.00000009 will have no effects.

10
Sigmoid Loss

One solution: replace non-smooth 0/1 loss with a smooth approximation.

Problem: Sigmoid function is not convex. 11

Convexity

Convex functions
• second derivative is always non-negative
• any chord of the function lies above it.
• easy to minimize!

12
Convex Surrogate Loss Function

Another solution: approximate 0/1 loss using an approximating function

that is convex (called surrogate loss function).
The surrogate loss function will upper bound the true loss, i.e., if you
minimize surrogate loss, you will also be pushing down the real loss.

(0/1)
0/1: l (y, y)̂ = 1[y ŷ ≤ 0]
hin
Hinge: l (y, y)̂ = max{0,1 − y y}̂

log 1
Logistic: l (y, y)̂ = ̂
log(1 + exp[−y y])
log 2
Exponential: l exp(y, y)̂ = exp[−y y]̂

Squared: l sqr(y, y)̂ = (y − y)̂ 2

13
Optimization Framework for Linear Models

14
Efficient Solutions to the Optimization Problem

l(yn, w T xn + b) + λR(w, b)
∑
min
w,b
n

Two Solutions:
1. Gradient Descent
2. Closed Form Solution (Least Squares)

We will assume that we are minimizing squared error loss.

15
Gradient Descent Solution for MSE

Consider the error function

• The gradient of the error is a vector indicating the direction to the

minimum point.
• Instead of directly finding that minimum (using the closed-form
equation), we can take small steps towards the minimum.
16
Optimization with Gradient Descent

Gradient-based methods of optimization finds the maximum of a

function f(x) by “climbing a mountain”.

The optimizer maintains a current estimate of the parameter of interest,

z. At each step, it measures the gradient g of the function at the
current location z, and take a step in the direction fo the gradient,
where the size of the step is controlled by a parameter η.
z ← z + ηg
In machine learning, we are trying to minimize the loss function. So, we
are trying to find the global minimum of an objective function.

17
Optimization with Gradient Descent

18
Optimization with Gradient Descent
As an example, suppose that you choose exponential loss as the loss
function and the 2-norm as the regularizer.

λ
exp[−yn(w xn + b)] + | | w | |2
T
∑
L(w, b) =
n
2

δL δ δ λ
T
| | w | |2
δb ∑
= exp[−yn(w xn + b)] +
δb n
δb 2
δ
exp[−yn(w T xn + b)] + 0
∑ δb
=
n
δ
∑ ( δb ) [
= − yn (w T
xn + b) exp − yn(w T
xn + b)]
n

yn exp[ − yn(w T xn + b)]

∑
=−
n
19
Optimization with Gradient Descent

The optimization will operate by updating

δL
b←b−η
δb
δL
where yn exp[ − yn(w T xn + b)]
∑
=−
δb n

Thought Exercise:
•Consider positive examples, where yn = +1. For these examples, we
hope wTx + b is as large as possible.
- if predicted value -> infinity, the exp term goes to zero.
- if predicted value = small, the exp term will be positive and non-zero.
This means that the bias term b will be increased.
- Once all points are well classified, then the derivative goes to zero.
20
Optimization with Gradient Descent

We can calculate the gradient with respect to w in a similar way:

λ
T 2
∑
L(w, b) = exp[−yn(w xn + b)] + | | w | |
n
2

λ
exp[−yn(w xn + b)] + ∇w | | w | |2
T
∑
∇w L = ∇w
n
2

( w ) [ xn + b)] + λw
T T
∑
= ∇ − yn (w xn + b) exp − yn(w
n

yn xn exp[ − yn(w T xn + b)] + λw

∑
=−
n

21
Optimization with Gradient Descent

The optimization will operate by updating w

w ← w − η ∇w L

yn xn exp[ − yn(w T xn + b)] + λw

∑
where ∇w L = −
n

•for well classified points, gradient is near zero.

•for poorly classified points, the gradient points in the direction of −yn xn
and update is of the form w ← w + cyn xn (like the perceptron update!)
•looking at just the part of the gradient related to the regularizer, the
update says: w ← w − λw . This has the effects of shrinking the
weights towards zero, which is exactly what we expect regularizer to
do.
22
Optimization with Gradient Descent

The success of gradient descent hinges on appropriate choices of the

step size.
•If the step size is too big, then you can step over optimum and start
oscillating. This suggests setting η → 0 for later iterations.
•If the step size is too small, the parameter of interest may not move far
enough to reach a local minimum. it takes too long to reach the optimum.
Theoretical results suggest that when the parameters starts to not
change by much, then we can do early stopping.

23
Local Minima

Convergence is NOT to a global minimum, only to local minimum.

• For linear function approximations using squared error loss, this is

not an issue: only ONE global minimum!
– Local minima affects many other situations where the loss
function is not convex.
• Repeated random restarts can help (in all cases of gradient search).
24
Efficient Solutions to the Optimization Problem

l(yn, w T xn + b) + λR(w, b)
∑
min
w,b
n

Two Solutions:
1. Gradient Descent
2. Closed Form Solution (Least Squares)

25
Closed Form Solution

There are cases when you can obtain a closed form solution: squared
error loss with 2-norm regularizer.
For simplicity, we will consider the unbiased version, which is the linear
regression setting.

26
Supervised Learning

Recall that supervised learning is about finding a function f : X → Y

such that f(x) is a good predictor for the value of y.

Output Y can have many types:

– If Y is a finite discrete set, the problem is called classification.
– If Y has 2 elements, the problem is called binary classification.
– If Y = ℜ, this problem is called regression.

radius texture perimeter … outcome time

18.02 27.6 117.5 N 31
17.99 10.38 112.8 N 51
23,51 24.27 155.1 R 27

27
Regression Problem

What hypothesis class should we pick?

Observe Predict
x y
0.86 2.49
0.09 0.83
-0.85 -0.25
0.87 3.10
-0.44 0.87
-0.43 0.02
-1.1 -0.12
0.40 1.81
-0.96 -0.83
0.17 0.43

fw(x) = w0 + w1x
28
Linear Hypothesis

Suppose Y is a linear function of X:

fw(x) = w0 + w1x1 + w2 x2 + … + wD xD

Here, wi are the parameters or weights.

w0 is the bias term or intercept.

Learning involves finding the best set of weights of the linear model
such that the test error is minimized.

29
Predicting Recurrence Time from Tumor Size

The function looks complicated and a linear hypothesis does not

seem very good.

What should we do?

• Pick a better function?
• Use more features?
• Get more data?

30
Input Variables for Linear Regression

• Original quantitative features X1, …, XD

• Transformations of variables, e.g. XD+1 = log(Xi)

2 3
X
• Basis expansions, e.g. D+1 = log(Xi ), XD+2 = log(Xi ), …

• Interaction terms, e.g. XD+1 = Xi Xj

• Numeric coding of qualitative variables, e.g. XD+1 = 1 if Xi is true

and 0 otherwise.

In all cases, we can add XD+1, …, XD+k to the list of original variables
and perform the linear regression.

31
Order-3 Fit: Is this better?

32
Order-4 Fit: Is this better?

33
Order-5 Fit: Is this better?

34
Order-6 Fit: Is this better?

35
Order-7 Fit: Is this better?

36
Order-8 Fit: Is this better?

37
Order-9 Fit: Is this better?

Problem: we have a lot of parameters (weights), so the hypothesis

matches the data points exactly, but is wild everywhere else. This
hypothesis will not generalize to new data.
38
Least-Squares Solution Method

The linear regression problem:

∑
fw(x) = w0 + wd xd
d

where d is the dimension of the input space, i.e., number of features.

Goal: find the best linear model given the data.

There are many possible performance metrics, but the most common
choice is to find w that minimizes
n
(yi − w T xi)2
∑
Err(w) =
i=1

39
Least-Squares Solution for X in R^2

40
Closed Form Solution for Optimal Weights

To start, we need to rewrite in matrix format.

Training data X:
•a large matrix of size NxD, where Xn,d is the value the dth feature on
the nth example.

Weights w:
• a column vector of size D.

Predicted Values Y ̂ = Xw
•the product between the matrix X and vector w has dimension N.

41
Closed Form Solution for Optimal Weights

The squared error says we should minimize

1 ̂ 2 1 ̂ 2
2∑
(Yn − Yn) = | | Y − Y | |
n
2

42
Closed Form Solution for Optimal Weights

43
Example 1

What is the least squares solution of w without penalty term?

44
Example 1

The fitted data: Y ̂ = Xŵ = X(X T X)−1X TY

45
Example 1

The fitted data: Y ̂ = Xŵ = X(X T X)−1X TY

46
Example 1

47
Example 2: With Polynomial Terms

fw(x) = w0 + w1x + w2 x 2

x2 x

What is the least squares solution of w without penalty term?

48
Example 2: With Polynomial Terms

The fitted data: Y ̂ = Xŵ = X(X T X)−1X TY

49
Example 2: With Polynomial Terms

The fitted data: Y ̂ = Xŵ = X(X T X)−1X TY

50
Example 2: With Polynomial Terms

51
Computational Cost

What operations are necessary?

•Overall: 1 matrix inversion + 3 matrix multiplications
•XTX: XT is DxN and X is NxD, so we need D2N operations.
•(XTX)-1: XTX is DxD, so we need D3 operations.
•Final multiplication: ND operations
•Total: O(D3+D2N+ND)
We can do least squares in polynomial time, but handling large
datasets (many examples, many features) can be problematic.
Gradient Descent takes O(ND) for each step.
For low and medium dimensional problems (D < 100), least square
might be better. For high dimensional problem (D > 10,000),
choose gradient descent.
52
Optimization Framework for Linear Models

53
Overfitting

• Every hypothesis has a true error measured on all possible

data items we could ever encounter.
• Since we don’t have all possible data, in order to decide what
is a good hypothesis, we measure error over the training set.
• Suppose we compare hypotheses f1 and f2.
– Assume f1 has lower error on the training set.
– If f2 has lower true error, then our algorithm is overfitting.

54
Estimating True Error

Which hypothesis has the lowest true error?

55
Understanding the Error
Given set of examples <X, Y>. We assume that
y = f(x) + ϵ
where ϵ is Gaussian noise with zero mean and std deviation σ.

56
An Example (from Tom Dietterich)

Example: 20 points
y = x + 2 sin(1.5x) + N(0,0.2)

The circles are data points. X is drawn uniformly randomly.

Y is generated by the sine function + Gaussian noise with
zero mean and standard deviation 0.2. 57
An Example (from Tom Dietterich)

50 Fits, 20 examples Each

With different sets of 20 points, we get different lines!

58
Understanding the error

Consider standard linear regression solution with squared loss error.

If we consider only the class of linear hypotheses, we have
systematic prediction error, called bias, whenever the data is
generated by a non-linear function.
Depending on what dataset we observed, we may get different
solutions. Thus we can also have error due to this variance.
– This occurs even if data is generated from class of linear functions.

59
Bias-Variance Tradeoff
Suppose we are trying to learn a regression function f(x) to approximate
the true function y

We can actually prove that the expected generalization error (based

on squared loss) is a combination of noise, bias and variance:

E[(y − f(x))2] = Noise 2 + Bias 2 + Variance

https://en.wikipedia.org/wiki/Bias–variance_tradeoﬀ
60
Bias vs Variance

Gauss-Markov Theorem says that the least-squares estimates of

the parameters w have the smallest variance among all linear
unbiased estimates.
But, there may exist an estimator that has lower error, but some bias.
Insight: Find lower variance solution, at the expense of some bias.
E.g. Include penalty for model complexity in error to reduce overfitting.
n
(yi − w T xi)2 + λ |model_size|
∑
l(w) =
i=1

λ is a hyper-parameter that controls penalty size, which can be chosen

manually or through cross validation.

61
Weight Regularization

Recall this:

1[yn(w T xn + b) ≤ 0] + λR(w, b)
∑
min
w,b
m

What should the weight regularizer look like?

• Convex Why?
• Components of the weight vector should be small (close to zero).
This is a form of indicative bias (of preferring simple functions). Why?
• In addition to small weights being good, zero weights might be better.
Why?

62
Weight Regularization

Recall this:
T
∑
min 1[yn(w xn + b) ≤ 0] + λR(w, b)
w,b
m

Regularizer examples:

R abs(w, b) =
∑
R norm(w, b) = | | w | | = wd2 | wd |
∑
d d

R sqr(w, b) = | | w | |2 = wd2 R cnt(w, b) =

∑
1[xd ≠ 0]
∑
d d

63
P-Norm of w

We can use the norm of the weight vector:

1
p
∑
| | w | |p = ( | wd | ) p

d
• 2-norm corresponds to Euclidean norm
• 1-norm correspond to the absolute regularizer (in previous slide)

In general, smaller p means sparser vectors.

64
Ridge Regression (a.k.a. L2-Regularization)

Constrains the weights by imposing a penalty on their size:

{∑ }
ŵ ridge = arg min (yi − w T xi)2 + λ wj2
w ∑
i=1 j=1:D

Ridge gives a smooth solution, effectively shrinking the weights, but

drives few weights to 0.
The ridge solution is not equivariant under scaling of the data, so
typically need to normalize the inputs first.

65
Lasso Regression (a.k.a. L1-Regularization)

Constrains the weights by penalizing the absolute value of their size:

{∑ }
ŵ lasso = arg min (yi − w T xi)2 + λ
∑
| wj |
w
i=1 j=1:D

Lasso regularization effectively sets the weights of less relevant input

features to zero.

The objective:
• Non-linear in the output y
•No closed-form solution (quadratic programming)
•More computationally expensive than Ridge regression.

66
Comparing Ridge and Lasso

Ridge Lasso
Contours of
regression error

Contours of constraint region

(model complexity penalty)

Solid area represents contraint regions w12 + w22 ≤ 1 and | w1 | + | w2 | ≤ 1

Circles (or ellipses) represent the contours of the least square error function.
67
What you should know

• How linear models are formulated mathematically

• Closed form solution (i.e., least squares) vs gradient descent methods
• The idea of regularization (ridge and lasso regression)

DL Unit-2
No ratings yet
DL Unit-2
24 pages
S in Reverse Order. S
No ratings yet
S in Reverse Order. S
2 pages
Machine Learning Exercises in Python, Part 1: Curious Insight
No ratings yet
Machine Learning Exercises in Python, Part 1: Curious Insight
14 pages
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
No ratings yet
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
86 pages
Linear Models (Unit II) Chapter III 1
No ratings yet
Linear Models (Unit II) Chapter III 1
24 pages
Week 2 Introduction To Linear Models - Revised - v1
No ratings yet
Week 2 Introduction To Linear Models - Revised - v1
54 pages
Linear Regression
No ratings yet
Linear Regression
29 pages
05 Optimization Basics
No ratings yet
05 Optimization Basics
94 pages
Introml 02 Regression Annotated PDF
No ratings yet
Introml 02 Regression Annotated PDF
26 pages
Group 30
No ratings yet
Group 30
33 pages
Lecture2 PDF
No ratings yet
Lecture2 PDF
111 pages
Week11 - Regularization and Optimization
No ratings yet
Week11 - Regularization and Optimization
75 pages
ML Opt
No ratings yet
ML Opt
89 pages
Linear Models
No ratings yet
Linear Models
30 pages
Linear - Regression - SGD
No ratings yet
Linear - Regression - SGD
71 pages
383 Fall11 Lec19
No ratings yet
383 Fall11 Lec19
30 pages
02 Lecturenote GD
No ratings yet
02 Lecturenote GD
10 pages
Lec 6 Tutorial
No ratings yet
Lec 6 Tutorial
27 pages
Lecture 07
No ratings yet
Lecture 07
29 pages
06 Optimization Basics PDF
No ratings yet
06 Optimization Basics PDF
82 pages
Lecture 3 - Regression
No ratings yet
Lecture 3 - Regression
47 pages
8 SVMs
No ratings yet
8 SVMs
72 pages
Les Hoches 2022 Convex Optimization
No ratings yet
Les Hoches 2022 Convex Optimization
34 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
Group30 Linear Regression
No ratings yet
Group30 Linear Regression
20 pages
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
Lecture15 Regularization
No ratings yet
Lecture15 Regularization
47 pages
CSE 440 AI Volume1 (p1)
No ratings yet
CSE 440 AI Volume1 (p1)
4 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
2 Linear Regression
No ratings yet
2 Linear Regression
14 pages
Lecture 3 - Linear Regression
No ratings yet
Lecture 3 - Linear Regression
31 pages
10 Convex Optimisation
No ratings yet
10 Convex Optimisation
31 pages
Ds 2
No ratings yet
Ds 2
27 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
No ratings yet
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
4 pages
Concise - Lecture - Notes - On - Optimization - Methods - 1722728042 2024-08-03 23 - 34 - 09
No ratings yet
Concise - Lecture - Notes - On - Optimization - Methods - 1722728042 2024-08-03 23 - 34 - 09
258 pages
SML Lecture5
No ratings yet
SML Lecture5
45 pages
2 LossAndOptimization
No ratings yet
2 LossAndOptimization
130 pages
IML Summary
No ratings yet
IML Summary
12 pages
L10 Learning II Gradient Based Learning
No ratings yet
L10 Learning II Gradient Based Learning
72 pages
Wk05 Machine Learning
No ratings yet
Wk05 Machine Learning
6 pages
DSCTP 2022 1 ML Slides
No ratings yet
DSCTP 2022 1 ML Slides
351 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
Lecture 2 Math
No ratings yet
Lecture 2 Math
34 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
CS221 - Artificial Intelligence - Machine Learning - 2 Linear Regression
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 2 Linear Regression
24 pages
ML Notes
No ratings yet
ML Notes
14 pages
Chapter
No ratings yet
Chapter
46 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
LogisticRegression ExercisesSolutions
No ratings yet
LogisticRegression ExercisesSolutions
5 pages
Lecture 1
No ratings yet
Lecture 1
6 pages
Module2 Optimizations
No ratings yet
Module2 Optimizations
65 pages
315 F19 14 SVM 1
No ratings yet
315 F19 14 SVM 1
33 pages
Linear Models and Learning Via Optimization: Piyush Rai Introduction To Machine Learning (CS771A)
No ratings yet
Linear Models and Learning Via Optimization: Piyush Rai Introduction To Machine Learning (CS771A)
26 pages
Lec 2
No ratings yet
Lec 2
5 pages
Skript Opt Mach
No ratings yet
Skript Opt Mach
49 pages
ch6 (Q 2,8,4)
No ratings yet
ch6 (Q 2,8,4)
9 pages
CIS 4526: Foundations of Machine Learning Linear Regression: (Modified From Sanja Fidler)
No ratings yet
CIS 4526: Foundations of Machine Learning Linear Regression: (Modified From Sanja Fidler)
20 pages
02 Grad Desc
No ratings yet
02 Grad Desc
54 pages
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Exercises of Logarithms and Exponentials
From Everand
Exercises of Logarithms and Exponentials
Simone Malacrida
No ratings yet
XGBOOST Advanced
100% (1)
XGBOOST Advanced
128 pages
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
No ratings yet
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
12 pages
ExcelVBAQuickReference PDF
100% (1)
ExcelVBAQuickReference PDF
2 pages
All Python CS
100% (2)
All Python CS
10 pages
ValueResearchFundcard TataRetirementSavingsFund ModeratePlan DirectPlan 2019jul15
No ratings yet
ValueResearchFundcard TataRetirementSavingsFund ModeratePlan DirectPlan 2019jul15
4 pages
Tax Spanner Tax Planning Ebook
No ratings yet
Tax Spanner Tax Planning Ebook
60 pages
SDLC and SE Concepts
No ratings yet
SDLC and SE Concepts
99 pages
LEAN MODE Workout Plan by Guru Mann PDF
No ratings yet
LEAN MODE Workout Plan by Guru Mann PDF
3 pages
The Least-Mean-Square (LMS) Algorithm and Its Geophysical Applications
No ratings yet
The Least-Mean-Square (LMS) Algorithm and Its Geophysical Applications
28 pages
SPE 125331 Waterflooding Optimization Using Gradient Based Methods
No ratings yet
SPE 125331 Waterflooding Optimization Using Gradient Based Methods
14 pages
Supervised and Unsupervised Learning
No ratings yet
Supervised and Unsupervised Learning
92 pages
Sobolev Gradients: A Nonlinear Equivalent Operator Theory in Preconditioned Numerical Methods For Elliptic Pdes
No ratings yet
Sobolev Gradients: A Nonlinear Equivalent Operator Theory in Preconditioned Numerical Methods For Elliptic Pdes
12 pages
Gradient Descent
No ratings yet
Gradient Descent
4 pages
Lecture 3
No ratings yet
Lecture 3
90 pages
Linear Regression For Machine Learning
100% (1)
Linear Regression For Machine Learning
17 pages
Adaptive Filtering - An Introduction: Jos e C. M. Bermudez
No ratings yet
Adaptive Filtering - An Introduction: Jos e C. M. Bermudez
21 pages
Economic Dispatch - 2
No ratings yet
Economic Dispatch - 2
76 pages
Chapter 6 E-ANN
No ratings yet
Chapter 6 E-ANN
64 pages
Wiener Filter - LMS
No ratings yet
Wiener Filter - LMS
60 pages
Optim
No ratings yet
Optim
70 pages
Mathematical Theory of Deep
No ratings yet
Mathematical Theory of Deep
275 pages
Convolutional Neural PDF
No ratings yet
Convolutional Neural PDF
187 pages
Unit 4 - Speech and Video Processing (SVP)
No ratings yet
Unit 4 - Speech and Video Processing (SVP)
32 pages
Pacific Journal of Mathematics: Minimization of Functions Having Lipschitz Continuous First Partial Derivatives
No ratings yet
Pacific Journal of Mathematics: Minimization of Functions Having Lipschitz Continuous First Partial Derivatives
7 pages
SGN 21006 Advanced Signal Processing: Stochastic Gradient Based Adaptation: Least Mean Square (LMS) Algorithm
No ratings yet
SGN 21006 Advanced Signal Processing: Stochastic Gradient Based Adaptation: Least Mean Square (LMS) Algorithm
30 pages
Optimization: 1 Motivation
No ratings yet
Optimization: 1 Motivation
20 pages
1994 Fredlund and Xing - SWCC Equations
No ratings yet
1994 Fredlund and Xing - SWCC Equations
12 pages
B.Tech Project Mid Term Report: Handwritten Digits Recognition Using Neural Networks
No ratings yet
B.Tech Project Mid Term Report: Handwritten Digits Recognition Using Neural Networks
13 pages
Heuristic Search
No ratings yet
Heuristic Search
49 pages
Dl-Unit-2 - 1
No ratings yet
Dl-Unit-2 - 1
47 pages
Quiz Feedback - Coursera
0% (1)
Quiz Feedback - Coursera
5 pages
Lecture - 7 Gradient Descent For Linear Regression
No ratings yet
Lecture - 7 Gradient Descent For Linear Regression
17 pages
ML Cheatsheet
100% (1)
ML Cheatsheet
219 pages
constrainedLQR Vehicle Dynamics Applications of Optimal Control Theory
No ratings yet
constrainedLQR Vehicle Dynamics Applications of Optimal Control Theory
40 pages
DL Que
No ratings yet
DL Que
14 pages
Gradient Descent
No ratings yet
Gradient Descent
9 pages

CS480 6 Linear Models

Uploaded by

CS480 6 Linear Models

Uploaded by

CS480 Introduction to Machine Learning

Idea: run a particular algorithm until a linear separator is found.

If the training data is linearly separable:

If the training data is not linearly separable:

But … is this optimization what we want to solve anyways?

Here, we are trying to optimize the tradeoff between a solution that

The optimization problem:

One reason is that small changes

One solution: replace non-smooth 0/1 loss with a smooth approximation.

Problem: Sigmoid function is not convex. 11

Another solution: approximate 0/1 loss using an approximating function

Squared: l sqr(y, y)̂ = (y − y)̂ 2

We will assume that we are minimizing squared error loss.

Consider the error function

• The gradient of the error is a vector indicating the direction to the

Gradient-based methods of optimization finds the maximum of a

The optimizer maintains a current estimate of the parameter of interest,

yn exp[ − yn(w T xn + b)]

The optimization will operate by updating

We can calculate the gradient with respect to w in a similar way:

yn xn exp[ − yn(w T xn + b)] + λw

The optimization will operate by updating w

yn xn exp[ − yn(w T xn + b)] + λw

•for well classified points, gradient is near zero.

The success of gradient descent hinges on appropriate choices of the

Convergence is NOT to a global minimum, only to local minimum.

• For linear function approximations using squared error loss, this is

Recall that supervised learning is about finding a function f : X → Y

Output Y can have many types:

radius texture perimeter … outcome time

What hypothesis class should we pick?

Suppose Y is a linear function of X:

Here, wi are the parameters or weights.

w0 is the bias term or intercept.

The function looks complicated and a linear hypothesis does not

What should we do?

• Original quantitative features X1, …, XD

• Transformations of variables, e.g. XD+1 = log(Xi)

• Interaction terms, e.g. XD+1 = Xi Xj

• Numeric coding of qualitative variables, e.g. XD+1 = 1 if Xi is true

Problem: we have a lot of parameters (weights), so the hypothesis

The linear regression problem:

where d is the dimension of the input space, i.e., number of features.

Goal: find the best linear model given the data.

To start, we need to rewrite in matrix format.

The squared error says we should minimize

What is the least squares solution of w without penalty term?

The fitted data: Y ̂ = Xŵ = X(X T X)−1X TY

The fitted data: Y ̂ = Xŵ = X(X T X)−1X TY

What is the least squares solution of w without penalty term?

The fitted data: Y ̂ = Xŵ = X(X T X)−1X TY

The fitted data: Y ̂ = Xŵ = X(X T X)−1X TY

What operations are necessary?

• Every hypothesis has a true error measured on all possible

Which hypothesis has the lowest true error?

The circles are data points. X is drawn uniformly randomly.

50 Fits, 20 examples Each

With different sets of 20 points, we get different lines!

Consider standard linear regression solution with squared loss error.

We can actually prove that the expected generalization error (based

E[(y − f(x))2] = Noise 2 + Bias 2 + Variance

Gauss-Markov Theorem says that the least-squares estimates of

λ is a hyper-parameter that controls penalty size, which can be chosen

What should the weight regularizer look like?

R sqr(w, b) = | | w | |2 = wd2 R cnt(w, b) =

We can use the norm of the weight vector:

In general, smaller p means sparser vectors.

Constrains the weights by imposing a penalty on their size:

Ridge gives a smooth solution, effectively shrinking the weights, but

Constrains the weights by penalizing the absolute value of their size:

Lasso regularization effectively sets the weights of less relevant input

Contours of constraint region

Solid area represents contraint regions w12 + w22 ≤ 1 and | w1 | + | w2 | ≤ 1

• How linear models are formulated mathematically

You might also like