0% found this document useful (0 votes)
3 views21 pages

10 - Regularization

The document discusses regularization in optimization, focusing on tradeoffs between multiple objectives and the impact of different norms on solutions. It explains how to visualize these tradeoffs using Pareto curves and surfaces, and provides examples of regularization techniques such as L1 and L2 regularization. Additionally, it revisits a hovercraft problem to illustrate the effects of minimizing different norms on thrust inputs over time.

Uploaded by

laxkor1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views21 pages

10 - Regularization

The document discusses regularization in optimization, focusing on tradeoffs between multiple objectives and the impact of different norms on solutions. It explains how to visualize these tradeoffs using Pareto curves and surfaces, and provides examples of regularization techniques such as L1 and L2 regularization. Additionally, it revisits a hovercraft problem to illustrate the effects of minimizing different norms on thrust inputs over time.

Uploaded by

laxkor1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

CS/ECE/ISyE 524 Introduction to Optimization Spring 2017–18

10. Regularization
ˆ More on tradeoffs

ˆ Regularization

ˆ Effect of using different norms

ˆ Example: hovercraft revisited

Laurent Lessard (www.laurentlessard.com)


Review of tradeoffs
Recap of tradeoffs:
ˆ We want to make both J1 (x) and J2 (x) small subject to
constraints.
ˆ Choose a parameter λ > 0, solve

minimize J1 (x) + λJ2 (x)


x
subject to: constraints

ˆ Each λ > 0 yields a solution x̂λ .


ˆ Can visualize tradeoff by plotting J2 (x̂λ ) vs J1 (x̂λ ). This is
called the Pareto curve.

10-2
Multi-objective tradeoff

ˆ Similar procedure if we have more than two costs we’d like


to make small, e.g. J1 , J2 , J3
ˆ Choose parameters λ > 0 and µ > 0. Then solve:

minimize J1 (x) + λJ2 (x) + µJ3 (x)


x
subject to: constraints

ˆ Each λ > 0 and µ > 0 yields a solution x̂λ,µ .


ˆ Can visualize tradeoff by plotting J3 (x̂λ,µ ) vs J2 (x̂λ,µ ) vs
J1 (x̂λ,µ ) on a 3D plot. You then obtain a Pareto surface.

10-3
Minimum-norm as a regularization
ˆ When Ax = b is underdetermined (A is wide), we can
resolve ambiguity by adding a cost function, e.g.
min-norm LS:
minimize kxk2
x
subject to: Ax = b

ˆ Alternative approach: express it as a tradeoff!


minimize kAx − bk2 + λkxk2
x

Tradeoffs of this type are called regularization and λ is


called the regularization parameter or regularization weight
ˆ If we let λ → ∞, we just obtain x̂ = 0
ˆ If we let λ → 0, we obtain the minimum-norm solution! 10-4
Proof of minimum-norm equivalence

minimize kAx − bk2 + λkxk2


x

Equivalent to the least squares problem:


    2
√A b
minimize x−
x λI 0

Solution is found via pseudoinverse (for tall matrix)


 T  !−1  T  
√A √A √A b
x̂ =
λI λI λI 0
= (AT A + λI )−1 AT b

10-5
Proof of minimum-norm equivalence
Solution of 2-norm regularization is:

x̂ = (AT A + λI )−1 AT b

ˆ Can’t simply set λ → 0 because A is wide, and therefore


AT A will not be invertible.
ˆ Use the fact that: AT AAT + λAT can be factored two ways:
(AT A + λI )AT = AT AAT + λAT = AT (AAT + λI )
(AT A + λI )AT = AT (AAT + λI )
AT (AAT + λI )−1 = (AT A + λI )−1 AT

10-6
Proof of minimum-norm equivalence
Solution of 2-norm regularization is:

x̂ = (AT A + λI )−1 AT b

Also equal to:

x̂ = AT (AAT + λI )−1 b

ˆ Since AAT is invertible, we can take the limit λ → 0 by just


setting λ = 0.
ˆ In the limit: x̂ = AT (AAT )−1 b. This is the exact solution to
the minimum-norm least squares problem we found before!

10-7
Tradeoff visualization

minimize kAx − bk2 + λkxk2


x

λ→0
0, kA† bk2


kxk2

λ→∞
kbk2 , 0

kAx − bk2
10-8
Regularization

Regularization: Additional penalty term added to the cost


function to encourage a solution with desirable properties.

Regularized least squares:

minimize kAx − bk2 + λR(x)


x

ˆ R(x) is the regularizer (penalty function)


ˆ λ is the regularization parameter
ˆ The model has different names depending on R(x).

10-9
Regularization

minimize kAx − bk2 + λR(x)


x

1. If R(x) = kxk2 = x12 + x22 + · · · + xn2


It is called: L2 regularization, Tikhonov regularization, or
Ridge regression depending on the application. It has the
effect of smoothing the solution.
2. If R(x) = kxk1 = |x1 | + |x2 | + · · · + |xn |
It is called: L1 regularization or LASSO. It has the effect of
sparsifying the solution (x̂ will have few nonzero entries).
3. R(x) = kxk∞ = max{|x1 |, |x2 |, . . . , |xn |}
It is called L∞ regularization and it has the effect of
equalizing the solution (makes most components equal).
10-10
Norm balls

For a norm k·kp , the norm ball of radius r is the set:


Br = {x ∈ Rn | kxkp ≤ r }

1.5 1.5 1.5

1.0 1.0 1.0

0.5 0.5 0.5

-1.5 -1.0 -0.5 0.5 1.0 1.5 -1.5 -1.0 -0.5 0.5 1.0 1.5 -1.5 -1.0 -0.5 0.5 1.0 1.5
-0.5 -0.5 -0.5

-1.0 -1.0 -1.0

-1.5 -1.5 -1.5

kxk2 ≤ 1 kxk1 ≤ 1 kxk∞ ≤ 1


x2 + y2 ≤ 1 |x| + |y | ≤ 1 max{|x|, |y |} ≤ 1
10-11
Simple example
Consider the minimum-norm problem for different norms:

minimize kxkp
x
subject to: Ax = b

ˆ set of solutions to Ax = b 2.5


is an affine subspace 2.0
1.5 x
ˆ solution is point belonging 1.0
to smallest norm ball 0.5

ˆ for p = 2, this occurs at -1


-0.5
1 2 3 4

the perpendicular distance

10-12
Simple example
2.5
2.0 x
ˆ for p = 1, this occurs at 1.5
one of the axes. 1.0
0.5
ˆ sparsifying behavior
-1 1 2 3 4
-0.5

2.5
ˆ for p = ∞, this occurs at 2.0
1.5 x
equal values of
1.0
coordinates 0.5

ˆ equalizing behavior -1
-0.5
1 2 3 4

10-13
Another simple example
Suppose we have data points {y1 , . . . , ym } ⊂ R, and we would
like to find the best estimator for the data, according to
different norms. Suppose data is sorted: y1 ≤ · · · ≤ ym .
   
y1 x
 ..   .. 
minimize  .  − .
x
ym x p

ˆ p = 2: x̂ = 1
(y + · · · + ym ).
m 1
This is the mean of the data.
ˆ p = 1: x̂ = y dm/2e . This is the median of the data.
ˆ p = ∞: x̂ = 12 (y1 + ym ). This is the mid-range of the data.

Julia demo: Data Norm.ipynb

10-14
Example: hovercraft revisited

One-dimensional version of the hovercraft problem:


ˆ Start at x1 = 0 with v1 = 0 (at rest at position zero)
ˆ Finish at x50 = 100 with v50 = 0 (at rest at position 100)
ˆ Same simple dynamics as before:
xt+1 = xt + vt
for: t = 1, 2, . . . , 49
vt+1 = vt + ut

ˆ Decide thruster inputs u1 , u2 , . . . , u49 .


ˆ This time: minimize kukp

10-15
Example: hovercraft revisited

minimize kukp
xt ,vt ,ut

subject to: xt+1 = xt + vt for t = 1, . . . , 49


vt+1 = vt + ut for t = 1, . . . , 49
x1 = 0, x50 = 100
v1 = 0, v50 = 0

ˆ This model has 150 variables, but very easy to understand.


ˆ We can simplify the model considerably...

10-16
Model simplification

xt+1 = xt + vt
for: t = 1, 2, . . . , 49
vt+1 = vt + ut

v50 = v49 + u49


= v48 + u48 + u49
= ...
= v1 + (u1 + u2 + · · · + u49 )

10-17
Model simplification

xt+1 = xt + vt
for: t = 1, 2, . . . , 49
vt+1 = vt + ut

x50 = x49 + v49


= x48 + 2v48 + u48
= x47 + 3v47 + 2u47 + u48
= ...
= x1 + 49v1 + (48u1 + 47u2 + · · · + 2u47 + u48 )

10-18
Model simplification

xt+1 = xt + vt
for: t = 1, 2, . . . , 49
vt+1 = vt + ut

Constraint can be rewritten as:



u1
48 47 . . . 2 1 0  u2 
    
x50 − x1 − 49v1
 =
1 1 . . . 1 1 1  ...  v50 − v1
u49

so we don’t need the intermediate variables xt and vt !

Julia demo: Hover 1D.ipynb

10-19
Results
1. Minimizing kuk22 (smooth)
0.3
0.2
0.1
Thrust

0.0
0.1
0.2
0.3
0 10 20 30 40 50
Time

2. Minimizing kuk1 (sparse)


3
2
1
Thrust

0
1
2
3
0 10 20 30 40 50
Time

3. Minimizing kuk∞ (equalized)


0.20
0.15
0.10
0.05
Thrust

0.00
0.05
0.10
0.15
0.20
0 10 20 30 40 50
Time 10-20
Tradeoff studies
1. Minimizing kuk22 + λkuk1 (smooth and sparse)
0.4
0.2
Thrust

0.0
0.2
0.4
0 10 20 30 40 50
Time

2. Minimizing kuk∞ + λkuk1 (equalized and sparse)


0.6
0.4
0.2
Thrust

0.0
0.2
0.4
0.6
0 10 20 30 40 50
Time

3. Minimizing kuk22 + λkuk∞ (equalized and smooth)


0.3
0.2
0.1
Thrust

0.0
0.1
0.2
0.3
0 10 20 30 40 50
Time 10-21

You might also like