0% found this document useful (0 votes)
19 views

Regularization

The document discusses overfitting and regularization techniques for linear models. It introduces the concepts of overfitting, cross-validation, and regularization using L1 and L2 norms. Methods like cross-validation and regularization are used to reduce overfitting and improve generalization. Leave-one-out cross-validation is presented as a method to estimate true error without bias.

Uploaded by

SANDHYA B
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Regularization

The document discusses overfitting and regularization techniques for linear models. It introduces the concepts of overfitting, cross-validation, and regularization using L1 and L2 norms. Methods like cross-validation and regularization are used to reduce overfitting and improve generalization. Leave-one-out cross-validation is presented as a method to estimate true error without bias.

Uploaded by

SANDHYA B
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Lecture 2: Overfitting.

Regularization

• Generalizing regression
• Overfitting
• Cross-validation
• L2 and L1 regularization for linear estimators
• A Bayesian interpretation of regularization
• Bias-variance trade-off

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 1


Recall: Overfitting

• A general, HUGELY IMPORTANT problem for all machine learning


algorithms
• We can find a hypothesis that predicts perfectly the training data but
does not generalize well to new data
• E.g., a lookup table!

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 2


Another overfitting example
1 M =0 1 M =1
t t

0 0

−1 −1

0 x 1 0 x 1

1 M =3 1 M =9
t t

0 0

−1 −1

0 x 1 0 x 1

• The higher the degree of the polynomial M , the more degrees of freedom,
and the more capacity to “overfit” the training data
• Typical overfitting means that error on the training data is very low, but
error on new instances is high

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 3


Overfitting more formally

• Assume that the data is drawn from some fixed, unknown probability
distribution
• Every hypothesis has a ”true” error J ∗(h), which is the expected error
when data is drawn from the distribution.
• Because we do not have all the data, we measure the error on the training
set JD (h)
• Suppose we compare hypotheses h1 and h2 on the training set, and
JD (h1) < JD (h2)
• If h2 is ”truly” better, i.e. J ∗(h2) < J ∗(h1), our algorithm is overfitting.
• We need theoretical and empirical methods to guard against it!

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 4


Typical overfitting plot
1
Training
Test

ERMS
0.5

0
0 3 M 6 9
• The training error decreases with the degree of the polynomial M , i.e.
the complexity of the hypothesis
• The testing error, measured on independent data, decreases at first, then
starts increasing
• Cross-validation helps us:
– Find a good hypothesis class (M in our case), using a validation set
of data
– Report unbiased results, using a test set, untouched during either
parameter training or validation

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 5


Cross-validation

• A general procedure for estimating the true error of a predictor


• The data is split into two subsets:
– A training and validation set used only to find the right predictor
– A test set used to report the prediction error of the algorithm
• These sets must be disjoint!
• The process is repeated several times, and the results are averaged to
provide error estimates.

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 6


Example: Polynomial regression

y
x x
y

x x

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 7


Leave-one-out cross-validation

1. For each order of polynomial, d:


(a) Repeat the following procedure m times:
i. Leave out ith instance from the training set, to estimate the true
prediction error; we will put it in a validation set
ii. Use all the other instances to find best parameter vector, wd,i
iii. Measure the error in predicting the label on the instance left out,
for the wd,i parameter vector; call this Jd,i
iv. This is a (mostly) unbiased estimate of the true prediction error
1
Pm
(b) Compute the average of the estimated errors: Jd = m i=1 Jd,i
2. Choose the d with lowest average estimated error: d∗ = arg mind J(d)

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 8


Estimating true error for d = 1
D = {(0.86, 2.49), (0.09, 0.83), (−0.85, −0.25), (0.87, 3.10), (−0.44, 0.87),
(−0.43, 0.02), (−1.10, −0.12), (0.40, 1.81), (−0.96, −0.83), (0.17, 0.43)}.
Iter Dtrain Dvalid Errortrain Errorvalid (J1,i)
1 D − {(0.86, 2.49)} (0.86, 2.49) 0.4928 0.0044
2 D − {(0.08, 0.83)} (0.09, 0.83) 0.1995 0.1869
3 D − {(−0.85, −0.25)} (−0.85, −0.25) 0.3461 0.0053
4 D − {(0.87, 3.10)} (0.87, 3.10) 0.3887 0.8681
5 D − {(−0.44, 0.87)} (−0.44, 0.87) 0.2128 0.3439
6 D − {(−0.43, 0.02)} (−0.43, 0.02) 0.1996 0.1567
7 D − {(−1.10, −0.12)} (−1.10, −0.12) 0.5707 0.7205
8 D − {(0.40, 1.81)} (0.40, 1.81) 0.2661 0.0203
9 D − {(−0.96, −0.83)} (−0.96, −0.83) 0.3604 0.2033
10 D − {(0.17, 0.43)} (0.17, 0.43) 0.2138 1.0490
mean: 0.2188 0.3558

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 9


Leave-one-out cross-validation results
d Errortrain Errorvalid (Jd)
1 0.2188 0.3558
2 0.1504 0.3095
3 0.1384 0.4764
4 0.1259 1.1770
5 0.0742 1.2828
6 0.0598 1.3896
7 0.0458 38.819
8 0.0000 6097.5
9 0.0000 6097.5

• Typical overfitting behavior: as d increases, the training error decreases,


but the validation error decreases, then starts increasing again
• Optimal choice: d = 2. Overfitting for d > 2

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 10


Estimating both hypothesis class and true error

• Suppose we want to compare polynomial regression with some other


algorithm
• We chose the hypothesis class (i.e. the degree of the polynomial, d∗)
based on the estimates Jd
• Hence Jd∗ is not unbiased - our procedure was aimed at optimizing it
• If we want to have both a hypothesis class and an unbiased error estimate,
we need to tweak the leave-one-out procedure a bit

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 11


Cross-validation with validation and testing sets
1. For each example j:
(a) Create a test set consisting just of the jth example, Dj = {(xj , yj )}
and a training and validation set D̄j = D − {(xj , yj )}
(b) Use the leave-one-out procedure from above on Dj (once!) to find a
hypothesis, h∗j
• Note that this will split the data internally, in order to both train
and validate!
• Typically, only one such split is used, rather than all possible splits
(c) Evaluate the error of h∗j on Dj (call it J(h∗j ))
2. Report the average of the J(h∗j ), as a measure of performance of the
whole algorithm

• Note that at this point we do not have one predictor, but several!
• Several methods can then be used to come up with just one predictor
(more on this later)

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 12


Summary of leave-one-out cross-validation

• A very easy to implement algorithm


• Provides a great estimate of the true error of a predictor
• It can indicate problematic examples in a data set (when using multiple
algorithms)
• Computational cost scales with the number of instances (examples), so
it can be prohibitive, especially if finding the best predictor is expensive
• We do not obtain one predictor, but many!
• Alternative: k-fold cross-validation: split the data set into k parts, then
proceed as above.

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 13


Regularization

• Remember the intuition: complicated hypotheses lead to overfitting


• Idea: change the error function to penalize hypothesis complexity:

J(w) = JD (w) + λJpen(w)

This is called regularization in machine learning and shrinkage in statistics


• λ is called regularization coefficient and controls how much we value
fitting the data well, vs. a simple hypothesis

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 14


Regularization for linear models

• A squared penalty on the weights would make the math work nicely in
our case:
1 λ
(Φw − y)T (Φw − y) + wT w
2 2
• This is also known as L2 regularization, or weight decay in neural
networks
• By re-grouping terms, we get:

1
JD (w) = (wT (ΦT Φ + λI)w − wT ΦT y − yT Φw + yT y)
2

• Optimal solution (obtained by solving ∇w JD (w) = 0)

w = (ΦT Φ + λI)−1ΦT y

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 15


What L2 regularization does

1 λ
arg min (Φw − y)T (Φw − y) + wT w = (ΦT Φ + λI)−1ΦT y
w 2 2

• If λ = 0, the solution is the same as in regular least-squares linear


regression
• If λ → ∞, the solution w → 0
• Positive λ will cause the magnitude of the weights to be smaller than in
the usual linear solution
• This is also called ridge regression, and it is a special case of Tikhonov
regularization (more on that later)
• A different view of regularization: we want to optimize the error while
keeping the L2 norm of the weights, wT w, bounded.

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 16


Detour: Constrained optimization
Suppose we want to find

min f (w)
w
such that g(w) = 0

∇f (x)

xA

∇g(x)

g(x) = 0

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 17


Detour: Lagrange multipliers
∇f (x)

xA

∇g(x)

g(x) = 0

• ∇g has to be orthogonal to the constraint surface (red curve)


• At the optimum, ∇f and ∇g have to be parallel (in same or opposite
direction)
• Hence, there must exist some λ ∈ R such that ∇f + λ∇g = 0
• Lagrangian function: L(x, λ) = f (x) + λg(x)
λ is called Lagrange multiplier
• We obtain the solution to our optimization problem by setting both
∇xL = 0 and ∂L ∂λ = 0

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 18


Detour: Inequality constraints
• Suppose we want to find

min f (w)
w
such that g(w) ≥ 0
∇f (x)

xA

∇g(x)

xB

g(x) = 0
g(x) > 0

• In the interior (g(x > 0)) - simply find ∇f (x) = 0


• On the boundary (g(x = 0)) - same situation as before, but the sign
matters this time
For minimization, we want ∇f pointing in the same direction as ∇g

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 19


Detour: KKT conditions

• Based on the previous observations, let the Lagrangian be L(x, λ) =


f (x) − λg(x)
• We minimize L wrt x subject to the following constraints:

λ ≥ 0
g(x) ≥ 0
λg(x) = 0

• These are called Karush-Kuhn-Tucker (KKT) conditions

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 20


L2 Regularization for linear models revisited

• Optimization problem: minimize error while keeping norm of the weights


bounded

min JD (w) = min(Φw − y)T (Φw − y)


w w

such that wT w ≤ η

• The Lagrangian is:

L(w, λ) = JD (w) − λ(η − wT w) = (Φw − y)T (Φw − y) + λwT w − λη

• For a fixed λ, and η = λ−1, the best w is the same as obtained by


weight decay

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 21


Visualizing regularization (2 parameters)
w2

w?

w1

w∗ = (ΦT Φ + λI)−1Φy

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 22


Pros and cons of L2 regularization

• If λ is at a “good” value, regularization helps to avoid overfitting


• Choosing λ may be hard: cross-validation is often used
• If there are irrelevant features in the input (i.e. features that do not
affect the output), L2 will give them small, but non-zero weights.
• Ideally, irrelevant input should have weights exactly equal to 0.

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 23


L1 Regularization for linear models

• Instead of requiring the L2 norm of the weight vector to be bounded,


make the requirement on the L1 norm:

min JD (w) = min(Φw − y)T (Φw − y)


w w
n
X
such that |wi| ≤ η
i=1

• This yields an algorithm called Lasso (Tibshirani, 1996)

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 24


Solving L1 regularization

• The optimization problem is a quadratic program


• There is one constraint for each possible sign of the weights (2n
constraints for n weights)
• For example, with two weights: Xm
min (yj − w1x1 − w2x2)2
w1 ,w2
j=1

such that w1 + w2 ≤ η
w1 − w2 ≤ η
−w1 + w2 ≤ η
−w1 − w2 ≤ η

• Solving this program directly can be done for problems with a small
number of inputs

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 25


Visualizing L1 regularization
w2

w?

w1

• If λ is big enough, the circle is very likely to intersect the diamond at


one of the corners
• This makes L1 regularization much more likely to make some weights
exactly 0

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 26


Pros and cons of L1 regularization

• If there are irrelevant input features, Lasso is likely to make their weights
0, while L2 is likely to just make all weights small
• Lasso is biased towards providing sparse solutions in general
• Lasso optimization is computationally more expensive than L2
• More efficient solution methods have to be used for large numbers of
inputs (e.g. least-angle regression, 2003).
• L1 methods of various types are very popular

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 27


From HTF: prostate data
Red lines: choice of by 10-fold CV.
Example of L1 vs L2 effect

• lcavol • • lcavol
• •
•• • •

0.6

0.6
• •

• • •
• • •
• • • • •
• •

• •

• •
0.4

0.4

• •

• • •• • svi
lweight
svi
• • • • lweight
• • • • • • •• • • • •• • •
• •• • • • • • pgg45 • • • • • pgg45
Coefficients

Coefficients

• • •• • •
• • •
• • •• • •• • • • •
• lbph • • • • • • lbph
0.2

0.2
• •
• • • • • • • ••
• • • • • • • • • • •
• •• • • • • •• ••
• • • • • • • •
• • • • • • • • •
• • • • • • • • • • •


• • •• • • • • •
• •••

•• •• •• • •• • • • •
• • • •
• •• • • • • • • •• • • • •
•• • • • • • • • •• •
• • • • • • • • • • • •• • •
0.0

0.0
• • • • • • • • • • • • • • • • • • • • • • • • • • • • •
• • • • gleason • • • gleason
• •
• •• • •
•• •
•• • • • • •
• •• • • •
• • age • • • age
• •

-0.2

-0.2
• •


• lcp • lcp

0 2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0

Degrees of Freedom Shrinkage Factor s

• Note the sparsity in the coefficients induces by L1


• LassoCS195-5
is an efficient way of performing the L1 optimization
2006 – Lecture 14 7

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 28


Bayesian view of regularization

• Start with a prior distribution over hypotheses


• As data comes in, compute a posterior distribution
• We often work with conjugate priors, which means that when combining
the prior with the likelihood of the data, one obtains the posterior in the
same form as the prior
• Regularization can be obtained from particular types of prior (usually,
priors that put more probability on simple hypotheses)
• E.g. L2 regularization can be obtained using a circular Gaussian prior for
the weights, and the posterior will also be Gaussian
• E.g. L1 regularization uses double-exponential prior (see (Tibshirani,
1996))

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 29


Bayesian view of regularization

• Prior is round Gaussian


• Posterior will be skewed by the data

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 30


What does the Bayesian view give us?
1 1
t t

0 0

−1 −1

0 x 1 0 x 1

1 1
t t

0 0

−1 −1

0 x 1 0 x 1

• Circles are data points


• Green is the true function
• Red lines on right are drawn from the posterior distribution

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 31


What does the Bayesian view give us?
1 1
t t

0 0

−1 −1

0 x 1 0 x 1

1 1
t t

0 0

−1 −1

0 x 1 0 x 1

• Functions drawn from the posterior can be very different


• Uncertainty decreases where there are data points

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 32


What does the Bayesian view give us?

• Uncertainty estimates, i.e. how sure we are of the value of the function
• These can be used to guide active learning: ask about inputs for which
the uncertainty in the value of the function is very high
• In the limit, Bayesian and maximum likelihood learning converge to the
same answer
• In the short term, one needs a good prior to get good estimates of the
parameters
• Sometimes the prior is overwhelmed by the data likelihood too early.
• Using the Bayesian approach does NOT eliminate the need to do cross-
validation in general
• More on this later...

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 33


The anatomy of the error of an estimator
• Suppose we have examples hx, yi where y = f (x) +  and  is Gaussian
noise with zero mean and standard deviation σ
• We fit a linear hypothesis h(x) = wT x, such as to minimize sum-squared
error over the training data:
m
X
(yi − h(xi))2
i=1

• Because of the hypothesis class that we chose (hypotheses linear in


the parameters) for some target functions f we will have a systematic
prediction error
• Even if f were truly from the hypothesis class we picked, depending on
the data set we have, the parameters w that we find may be different;
this variability due to the specific data set on hand is a different source
of error

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 34


Bias-variance analysis
• Given a new data point x, what is the expected prediction error?
• Assume that the data points are drawn independently and identically
distributed (i.i.d.) from a unique underlying probability distribution
P (hx, yi) = P (x)P (y|x)
• The goal of the analysis is to compute, for an arbitrary given point x,
2
 
EP (y − h(x)) |x
where y is the value of x in a data set, and the expectation is over all
training sets of a given size, drawn according to P
• For a given hypothesis class, we can also compute the true error, which
is the expected error over the input distribution:
X
2
 
EP (y − h(x)) |x P (x)
x
(if x continuous, sum becomes integral with appropriate conditions).
• We will decompose this expectation into three components

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 35


Recall: Statistics 101

• Let X be a random variable with possible values xi, i = 1 . . . n and with


probability distribution P (X)
• The expected value or mean of X is:

n
X
E[X] = xiP (xi)
i=1

• If X is continuous, roughly speaking, the sum is replaced by an integral,


and the distribution by a density function
• The variance of X is:

V ar[X] = E[(X − E(X))2]


= E[X 2] − (E[X])2

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 36


The variance lemma
V ar[X] = E[(X − E[X])2]
Xn
= (xi − E[X])2P (xi)
i=1
Xn
= (x2i − 2xiE[X] + (E[X])2)P (xi)
i=1
Xn n
X n
X
= x2i P (xi) − 2E[X] xiP (xi) + (E[X])2 P (xi)
i=1 i=1 i=1

= E[X 2] − 2E[X]E[X] + (E[X])2 · 1


= E[X 2] − (E[X])2

We will use the form:

E[X 2] = (E[X])2 + V ar[X]

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 37


Bias-variance decomposition

• Simple algebra:
2 2 2
   
EP (y − h(x)) |x = EP (h(x)) − 2yh(x) + y |x
2
   2 
= EP (h(x)) |x + EP y |x − 2EP [y|x]EP [h(x)|x]

• Let h̄(x) = EP [h(x)|x] denote the mean prediction of the hypothesis at


x, when h is trained with data drawn from P
• For the first term, using the variance lemma, we have:
EP [(h(x))2|x] = EP [(h(x) − h̄(x))2|x] + (h̄(x))2
• Note that EP [y|x] = EP [f (x) + |x] = f (x) (because of linearity of
expectation and the assumption on  ∼ N (0, σ))
• For the second term, using the variance lemma, we have:
E[y 2|x] = E[(y − f (x))2|x] + (f (x))2

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 38


Bias-variance decomposition (2)

• Putting everything together, we have:

EP (y − h(x)) |x = EP [(h(x) − h̄(x))2|x] + (h̄(x))2 − 2f (x)h̄(x)


2
 

+ EP [(y − f (x))2|x] + (f (x))2


= EP [(h(x) − h̄(x))2|x] + (f (x) − h̄(x))2
+ E[(y − f (x))2|x]

• The first term, EP [(h(x) − h̄(x))2|x], is the variance of the hypothesis


h at x, when trained with finite data sets sampled randomly from P
• The second term, (f (x) − h̄(x))2, is the squared bias (or systematic
error) which is associated with the class of hypotheses we are considering
• The last term, E[(y − f (x))2|x] is the noise, which is due to the problem
at hand, and cannot be avoided

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 39


Error decomposition
0.15
(bias)2
0.12 variance
(bias)2 + variance
0.09 test error

0.06

0.03

0
−3 −2 −1 0 1 2
ln λ

• The bias-variance sum approximates well the test error over a set of 1000
points
• x-axis measures the hypothesis complexity (decreasing left-to-right)
• Simple hypotheses usually have high bias (bias will be high at many
points, so it will likely be high for many possible input distributions)
• Complex hypotheses have high variance: the hypothesis is very dependent
on the data set on which it was trained.

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 40


Bias-variance trade-off

• Typically, bias comes from not having good hypotheses in the considered
class
• Variance results from the hypothesis class containing “too many”
hypotheses
• MLE estimation is typically unbiased, but has high variance
• Bayesian estimation is biased, but typically has lower variance
• Hence, we are faced with a trade-off: choose a more expressive class
of hypotheses, which will generate higher variance, or a less expressive
class, which will generate higher bias
• Making the trade-off has to depend on the amount of data available to
fit the parameters (data usually mitigates the variance problem)

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 41


More on overfitting

• Overfitting depends on the amount of data, relative to the complexity of


the hypothesis
• With more data, we can explore more complex hypotheses spaces, and
still find a good solution

1 N = 15 1 N = 100
t t

0 0

−1 −1

0 x 1 0 x 1

COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 42

You might also like