Intro
Intro
Ryan Tibshirani
Convex Optimization 10-725
Course setup
2
Prerequisites: no formal ones, but class will be fairly fast paced
If you fall short on any one of these things, it’s certainly possible to
catch up; but don’t hesitate to talk to us
3
Evaluation:
• 6 homeworks
• 6 quizzes
• 1 little test
4
Scribing: sign up to scribe one lecture per semester, on the course
website (multiple scribes per lecture). Can bump up your grade in
boundary cases
Heads up: class will not be easy, but should be worth it ... !
5
Optimization in Machine Learning and Statistics
This course: how to solve P , and why this is a good skill to have
6
Motivation: why do we bother?
P : min f (x)
x∈D
7
Example: algorithms for linear trend filtering
Given observations yi ∈ R, i = 1, . . . , n corresponding to
underlying locations xi = i, i = 1, . . . , n
●
● ●●
● ●
● ●● ●● ●
●
●
● ●● ●
● ● ● ● ●●● ●●● ● ●
● ●
●●● ● ● ●●● ●● ● ● ●
●
● ●●●●●● ●●●●● ● ●● ● ● ● ●● ● ●
● ●●●
●● ●
●●● ● ● ●
● ● ● ● ●
●●
● ●
●
●●
● ● ● ●●● ●
● ●●●
●
●●●● ●●●● ●● ● ● ●●
●●● ●
●
●●●
● ●● ●
●●
●● ● ●●●●●
●●●
●
●
●●
● ●●●●
●●
●
● ● ●
●
●●
● ●
●● ●● ●
● ● ●●
● ●● ● ● ●●●● ●● ● ●●●●
●●●● ●●
10
●●● ● ●
●●
● ●● ●
● ●●● ●● ● ●●
●●●● ●●●
●● ●
● ● ●● ● ●
●●● ● ● ●
●
●●●
● ● ●● ●
●
●
●● ●
●●●●●●
●●●●
●● ●
●●
●
●●●●● ●● ●
● ●
●●●
●
●
●●●
● ●● ●●
●●● ●●
●●
● ●
●
● ● ● ● ●
● ●●
● ● ●
●● ●
● ● ●●●
●●●●● ●●● ●● ● ● ●● ●● ● ●● ●
● ● ● ●●●●●●●● ● ●●
●●●● ●●●● ●● ●● ●●●● ● ●● ● ●● ● ●
● ● ●
● ●●●● ●●●
●
●
●
●●
●●●●
●● ●
●
●
●
●
●
●
●● ●
● ●●
●
●●●
●
●●
● ●●●
●●
●
●●
●●
●
●●●
●
●
●●
● ●
●●●
●
●
●
●
●
●●●
●
● ●
●
●●
●●
●●
●
●● ●
● ●●
● ● ●●
●●
● ●
● ●●● ●●●●
● ●● ●
● ● ● ●●● ●● ● ● ●
● ●●●
●
●●
Linear trend filtering
●●
●
●●● ●● ●●
●●●●
●
●●●●
●
Activation level
● ● ●
●● ● ● ●
● ●
●
●●
●●●
●
●●
●●●
● ●
●
●●
●
●●● ● ●
●
●
● ●● ●
●●●●●
●
● ●●
●
●●●
●
●●
●●●
●●
●
●
●
●●
●
●
● ●
fits a piecewise linear
●●●
●●●
●●●
● ●● ●
●●●● ●●●●●
●
● ●
●
● ●
●●●
●●
●
●●●
●
●●
●
●
●
●●
●
●●
●
●
●●●
●
●
●
●●
●●●
function, with adap-
5
●●●●
●●●
●● ●●● ●
● ● ●●●
●●
●● ●
●● ● ●●●●
●●●
● ●●●● ●●
●
●●●●
● ● ● ●●●●● ●
●
●● ●
●●
●●
●
●●
●●
●●
●
●●
●
●
●
●
●
●●
●● ●●●
●●
●●
●
●●
●
●
●●●
●● ● ●
●
●
●●
●
●
●
●
●
●
●
●● ●
●●●●●
●●
●●
●
●
●●
●●
●●●
●
tively chosen knots
●●●●● ●● ●●● ●
● ●●●●
●● ●● ●●
●●●
●
● ●●
●
●●
●●
●●●
●
●
●●
●
●
●●
●●
●
●●
●
●● ●
●●
●●
●
●
●●●
● ●
●●●●
●
●
●●
●●●
●●
●
●●
●●
●
●
●
(Steidl et al. 2006,
●● ●●● ● ●● ●●
●●●●●●
●● ●● ●
●●● ● ●●
●
●● ●● ●
Kim et al. 2009)
0
●●
●●● ●●● ●●● ●●
●● ●
● ●
●●
●
Timepoint
n n−2
1X X
How? By solving min (yi − θi )2 + λ |θi − 2θi+1 + θi+2 |
θ 2
i=1 i=1
8
n n−2
1X X
Problem: min (yi − θi )2 + λ |θi − 2θi+1 + θi+2 |
θ 2
i=1 i=1
●
● ●●
● ●
● ●● ●● ●
●
●● ●● ●
● ● ● ● ●●● ●●● ● ●
● ●● ●● ●● ●●●● ● ● ●
●● ●● ● ●●● ● ●●●
●● ●● ●● ●● ●
●
● ●●●
●●
● ●● ● ●● ● ●
● ●●●● ●
●●
●●
●●● ●● ● ●●●●● ●●●●
●
● ●
●
●
●
●●
●●
●
●
●●
●
●●●
●
●
●
●●
●
●●●●
● ●●●●
●●●
●●
● ●
●
●
●
●●
● ●
●
●●● ●●●●●
●
●●●●●●●●●●●
● ●
●●●
●● ● ●●● ●● ●
● ●● ● Interior point method,
10
● ●●
●● ● ●●● ●●●
●●
● ●
●●●
● ●
●
●●●●
●●
● ●● ●● ● ● ●
●●●● ● ●●●●
● ● ● ●●●
● ●●
● ●●
●● ●
●●●● ●● ●● ●● ●
●●●●● ●●●● ●
●●●●
● ● ● ● ●● ●●●●●●● ●
●
● ●●●
●● ● ●●●
●
●
● ● ●●
● ●●●●
● ●
● ● ● ●
●
●
●●
●●●●
●●
●
●
●
●●●
●
●●
●●
●
● ●
● ●●●
●
●
●
●
●●
● ●●
●●
●●
● ●●
●
●
● ● ●●
●●● ●● ●●●
●
● ●
●● ● ●● ●
● ● ●● ● ●●
● ● ●●
● ● ●
● ●
●
●● ●
● ●●
●
●
● ●●
●
●
●
●
●
●● ●●●●
●
●
●
●●●● ●
●●●
●●●
●
●
●
●●
●
20 iterations
● ●●● ●●●●●●
●●● ●●
● ● ● ● ●●● ●● ● ● ●
●●●● ● ●● ●●●●●
● ●● ● ● ● ●
● ●● ●
●● ●● ●●
●● ● ● ●
●
● ● ●
●● ●●● ●● ● ●●●●● ●● ●
●
Activation level
● ● ● ●● ●●● ●
● ● ● ●
●
●● ●● ●●● ●
● ● ●
●●
● ● ●
● ● ●● ● ●
●●
●
●●
●
●
●●●
●●●●
●●
●●●
●
●
●
●
●
● ●●●
●●●
●●
●
●●
●
●●
●●
●
●
●
●●
●
●●
●
●
●●●
●
●
●
Proximal gradient de-
●●● ● ●
●●● ●
●● ●●
● ● ●● ●
● ●●● ●● ●● ●●●●●●●●●●
scent, 10K iterations
5
●●●●
●●●
●●●●● ●
● ● ●●●
●●
●● ●
●● ● ●●●●
●●●
● ●●●● ●
●●
●●●●
● ● ● ●●●●● ●
●
● ●●●●● ●●
●●●
●● ●●●●●●● ● ●
●
● ●●●●●●●
●●
● ● ●●●● ● ●●●● ●
●●
●●
● ● ●●● ●● ●●
●●●●
●
●●
●
●
●
●
●●
● ●●●●●●
●●●●● ●
● ●●
●●
●●
●●● ●
● ●
●
●●
●●●
●
●
●
●●
●
●
●
●
●
●
●● ●●
●
●● ●
● ●
●
●●● ●
●
●● ●
●
Coordinate descent,
●●● ●
●
● ●
●●●
● ● ●
●● ● ●●
● ●●●●
●●
●●
●●●
●
●●●
● ●●
●
●
●● ●
●
●
●
●
●
●
●●
●●●●
●● ●
●● ●
●
●
●
1000 cycles
0
●●
●●● ●●●●
●● ●●
●● ●
● ●
●●
●
(all from the dual)
0 200 400 600 800 1000
Timepoint
9
n n−2
1X X
Problem: min (yi − θi )2 + λ |θi − 2θi+1 + θi+2 |
θ 2
i=1 i=1
●
● ●●
● ●
● ●● ●● ●
●
●● ●● ●
● ● ● ● ●●● ●●● ● ●
● ●● ●● ●● ●●●● ● ● ●
●● ●● ● ●●● ● ●●●
●● ●● ●● ●● ●
●
● ●●●
●●
● ●● ● ●● ● ●
● ●●●● ●
●●
●●
●●● ●● ● ●●●●● ●●●●
●
● ●
●
●
●
●●
●●
●
●
●●
●
●●●
●
●
●
●●
●
●●●●
● ●●●●
●●●
●●
● ●
●
●
●
●●
● ●
●
●●● ●●●●●
●
●●●●●●●●●●●
● ●
●●●
●● ● ●●● ●● ●
● ●● ● Interior point method,
10
● ●●
●● ● ●●● ●●●
●●
● ●
●●●
● ●
●
●●●●
●●
● ●● ●● ● ● ●
●●●● ● ●●●●
● ● ● ●●●
● ●●
● ●●
●● ●
●●●● ●● ●● ●● ●
●●●●● ●●●● ●
●●●●
● ● ● ● ●● ●●●●●●● ●
●
● ●●●
●● ● ●●●
●
●
● ● ●●
● ●●●●
● ●
● ● ● ●
●
●
●●
●●●●
●●
●
●
●
●●●
●
●●
●●
●
● ●
● ●●●
●
●
●
●
●●
● ●●
●●
●●
● ●●
●
●
● ● ●●
●●● ●● ●●●
●
● ●
●● ● ●● ●
● ● ●● ● ●●
● ● ●●
● ● ●
● ●
●
●● ●
● ●●
●
●
● ●●
●
●
●
●
●
●● ●●●●
●
●
●
●●●● ●
●●●
●●●
●
●
●
●●
●
20 iterations
● ●●● ●●●●●●
●●● ●●
● ● ● ● ●●● ●● ● ● ●
●●●● ● ●● ●●●●●
● ●● ● ● ● ●
● ●● ●
●● ●● ●●
●● ● ● ●
●
● ● ●
●● ●●● ●● ● ●●●●● ●● ●
●
Activation level
● ● ● ●● ●●● ●
● ● ● ●
●
●● ●● ●●● ●
● ● ●
●●
● ● ●
● ● ●● ● ●
●●
●
●●
●
●
●●●
●●●●
●●
●●●
●
●
●
●
●
● ●●●
●●●
●●
●
●●
●
●●
●●
●
●
●
●●
●
●●
●
●
●●●
●
●
●
Proximal gradient de-
●●● ● ●
●●● ●
●● ●●
● ● ●● ●
● ●●● ●● ●● ●●●●●●●●●●
scent, 10K iterations
5
●●●●
●●●
●●●●● ●
● ● ●●●
●●
●● ●
●● ● ●●●●
●●●
● ●●●● ●
●●
●●●●
● ● ● ●●●●● ●
●
● ●●●●● ●●
●●●
●● ●●●●●●● ● ●
●
● ●●●●●●●
●●
● ● ●●●● ● ●●●● ●
●●
●●
● ● ●●● ●● ●●
●●●●
●
●●
●
●
●
●
●●
● ●●●●●●
●●●●● ●
● ●●
●●
●●
●●● ●
● ●
●
●●
●●●
●
●
●
●●
●
●
●
●
●
●
●● ●●
●
●● ●
● ●
●
●●● ●
●
●● ●
●
Coordinate descent,
●●● ●
●
● ●
●●●
● ● ●
●● ● ●●
● ●●●●
●●
●●
●●●
●
●●●
● ●●
●
●
●● ●
●
●
●
●
●
●
●●
●●●●
●● ●
●● ●
●
●
●
1000 cycles
0
●●
●●● ●●●●
●● ●●
●● ●
● ●
●●
●
(all from the dual)
0 200 400 600 800 1000
Timepoint
9
n n−2
1X X
Problem: min (yi − θi )2 + λ |θi − 2θi+1 + θi+2 |
θ 2
i=1 i=1
●
● ●●
● ●
● ●● ●● ●
●
●● ●● ●
● ● ● ● ●●● ●●● ● ●
● ●● ●● ●● ●●●● ● ● ●
●● ●● ● ●●● ● ●●●
●● ●● ●● ●● ●
●
● ●●●
●●
● ●● ● ●● ● ●
● ●●●● ●
●●
●●
●●● ●● ● ●●●●● ●●●●
●
● ●
●
●
●
●●
●●
●
●
●●
●
●●●
●
●
●
●●
●
●●●●
● ●●●●
●●●
●●
● ●
●
●
●
●●
● ●
●
●●● ●●●●●
●
●●●●●●●●●●●
● ●
●●●
●● ● ●●● ●● ●
● ●● ● Interior point method,
10
● ●●
●● ● ●●● ●●●
●●
● ●
●●●
● ●
●
●●●●
●●
● ●● ●● ● ● ●
●●●● ● ●●●●
● ● ● ●●●
● ●●
● ●●
●● ●
●●●● ●● ●● ●● ●
●●●●● ●●●● ●
●●●●
● ● ● ● ●● ●●●●●●● ●
●
● ●●●
●● ● ●●●
●
●
● ● ●●
● ●●●●
● ●
● ● ● ●
●
●
●●
●●●●
●●
●
●
●
●●●
●
●●
●●
●
● ●
● ●●●
●
●
●
●
●●
● ●●
●●
●●
● ●●
●
●
● ● ●●
●●● ●● ●●●
●
● ●
●● ● ●● ●
● ● ●● ● ●●
● ● ●●
● ● ●
● ●
●
●● ●
● ●●
●
●
● ●●
●
●
●
●
●
●● ●●●●
●
●
●
●●●● ●
●●●
●●●
●
●
●
●●
●
20 iterations
● ●●● ●●●●●●
●●● ●●
● ● ● ● ●●● ●● ● ● ●
●●●● ● ●● ●●●●●
● ●● ● ● ● ●
● ●● ●
●● ●● ●●
●● ● ● ●
●
● ● ●
●● ●●● ●● ● ●●●●● ●● ●
●
Activation level
● ● ● ●● ●●● ●
● ● ● ●
●
●● ●● ●●● ●
● ● ●
●●
● ● ●
● ● ●● ● ●
●●
●
●●
●
●
●●●
●●●●
●●
●●●
●
●
●
●
●
● ●●●
●●●
●●
●
●●
●
●●
●●
●
●
●
●●
●
●●
●
●
●●●
●
●
●
Proximal gradient de-
●●● ● ●
●●● ●
●● ●●
● ● ●● ●
● ●●● ●● ●● ●●●●●●●●●●
scent, 10K iterations
5
●●●●
●●●
●●●●● ●
● ● ●●●
●●
●● ●
●● ● ●●●●
●●●
● ●●●● ●
●●
●●●●
● ● ● ●●●●● ●
●
● ●●●●● ●●
●●●
●● ●●●●●●● ● ●
●
● ●●●●●●●
●●
● ● ●●●● ● ●●●● ●
●●
●●
● ● ●●● ●● ●●
●●●●
●
●●
●
●
●
●
●●
● ●●●●●●
●●●●● ●
● ●●
●●
●●
●●● ●
● ●
●
●●
●●●
●
●
●
●●
●
●
●
●
●
●
●● ●●
●
●● ●
● ●
●
●●● ●
●
●● ●
●
Coordinate descent,
●●● ●
●
● ●
●●●
● ● ●
●● ● ●●
● ●●●●
●●
●●
●●●
●
●●●
● ●●
●
●
●● ●
●
●
●
●
●
●
●●
●●●●
●● ●
●● ●
●
●
●
1000 cycles
0
●●
●●● ●●●●
●● ●●
●● ●
● ●
●●
●
(all from the dual)
0 200 400 600 800 1000
Timepoint
9
n n−2
1X X
Problem: min (yi − θi )2 + λ |θi − 2θi+1 + θi+2 |
θ 2
i=1 i=1
●
● ●●
● ●
● ●● ●● ●
●
●● ●● ●
● ● ● ● ●●● ●●● ● ●
● ●● ●● ●● ●●●● ● ● ●
●● ●● ● ●●● ● ●●●
●● ●● ●● ●● ●
●
● ●●●
●●
● ●● ● ●● ● ●
● ●●●● ●
●●
●●
●●● ●● ● ●●●●● ●●●●
●
● ●
●
●
●
●●
●●
●
●
●●
●
●●●
●
●
●
●●
●
●●●●
● ●●●●
●●●
●●
● ●
●
●
●
●●
● ●
●
●●● ●●●●●
●
●●●●●●●●●●●
● ●
●●●
●● ● ●●● ●● ●
● ●● ● Interior point method,
10
● ●●
●● ● ●●● ●●●
●●
● ●
●●●
● ●
●
●●●●
●●
● ●● ●● ● ● ●
●●●● ● ●●●●
● ● ● ●●●
● ●●
● ●●
●● ●
●●●● ●● ●● ●● ●
●●●●● ●●●● ●
●●●●
● ● ● ● ●● ●●●●●●● ●
●
● ●●●
●● ● ●●●
●
●
● ● ●●
● ●●●●
● ●
● ● ● ●
●
●
●●
●●●●
●●
●
●
●
●●●
●
●●
●●
●
● ●
● ●●●
●
●
●
●
●●
● ●●
●●
●●
● ●●
●
●
● ● ●●
●●● ●● ●●●
●
● ●
●● ● ●● ●
● ● ●● ● ●●
● ● ●●
● ● ●
● ●
●
●● ●
● ●●
●
●
● ●●
●
●
●
●
●
●● ●●●●
●
●
●
●●●● ●
●●●
●●●
●
●
●
●●
●
20 iterations
● ●●● ●●●●●●
●●● ●●
● ● ● ● ●●● ●● ● ● ●
●●●● ● ●● ●●●●●
● ●● ● ● ● ●
● ●● ●
●● ●● ●●
●● ● ● ●
●
● ● ●
●● ●●● ●● ● ●●●●● ●● ●
●
Activation level
● ● ● ●● ●●● ●
● ● ● ●
●
●● ●● ●●● ●
● ● ●
●●
● ● ●
● ● ●● ● ●
●●
●
●●
●
●
●●●
●●●●
●●
●●●
●
●
●
●
●
● ●●●
●●●
●●
●
●●
●
●●
●●
●
●
●
●●
●
●●
●
●
●●●
●
●
●
Proximal gradient de-
●●● ● ●
●●● ●
●● ●●
● ● ●● ●
● ●●● ●● ●● ●●●●●●●●●●
scent, 10K iterations
5
●●●●
●●●
●●●●● ●
● ● ●●●
●●
●● ●
●● ● ●●●●
●●●
● ●●●● ●
●●
●●●●
● ● ● ●●●●● ●
●
● ●●●●● ●●
●●●
●● ●●●●●●● ● ●
●
● ●●●●●●●
●●
● ● ●●●● ● ●●●● ●
●●
●●
● ● ●●● ●● ●●
●●●●
●
●●
●
●
●
●
●●
● ●●●●●●
●●●●● ●
● ●●
●●
●●
●●● ●
● ●
●
●●
●●●
●
●
●
●●
●
●
●
●
●
●
●● ●●
●
●● ●
● ●
●
●●● ●
●
●● ●
●
Coordinate descent,
●●● ●
●
● ●
●●●
● ● ●
●● ● ●●
● ●●●●
●●
●●
●●●
●
●●●
● ●●
●
●
●● ●
●
●
●
●
●
●
●●
●●●●
●● ●
●● ●
●
●
●
1000 cycles
0
●●
●●● ●●●●
●● ●●
●● ●
● ●
●●
●
(all from the dual)
0 200 400 600 800 1000
Timepoint
9
What’s the message here?
10
Example: changepoints in the fused lasso
The fused lasso or total variation denoising problem:
n n−1
1X X
min (yi − θi )2 + λ |θi − θi+1 |
θ 2
i=1 i=1
This fits a piecewise constant function, given data yi , i = 1, . . . , n.
As tuning parameter λ decreases, we see more changepoints in the
solution θ̂
1.2
1.2
1.2
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ●
●●● ● ●● ● ● ●●● ● ●● ● ● ●●● ● ●● ● ●
1.0
1.0
1.0
● ●
● ●● ● ●
● ●● ● ●
● ●●
● ●● ● ●●● ●● ● ●● ● ●●● ●● ● ●● ● ●●● ●●
● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
●● ● ●● ● ●● ●
0.8
0.8
0.8
● ● ●
0.6
0.6
0.6
0.4
0.4
0.4
● ● ●
0.2
0.2
0.2
● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
●●● ● ● ●●● ● ● ●●● ● ●
● ● ● ●● ● ● ● ●● ● ● ● ●●
● ● ●● ● ● ● ●● ● ● ● ●● ●
0.0
0.0
0.0
● ● ●
● ● ●
● ●● ● ● ●● ● ● ●● ●
●● ● ●● ●● ● ●● ●● ● ●●
● ● ●
● ● ● ●●● ● ● ● ●●● ● ● ● ●●●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
−0.2
−0.2
−0.2
● ● ● ● ● ●
● ● ●
λ = 25 λ = 0.62 λ = 0.41
11
Let’s look at the solution at λ = 0.41 a little more closely
1.2
● ● ●
● ●
● ● ●
● ● ●
●
● ●● ● ● ●
●●● ●● ●
● ●
How can we test the
1.0
● ● ●●
●
● ●● ● ●●● ●●
● ● ●
●
●●
●
● ●
significance of detected
0.8
changepoints? Say, at
location 11?
0.6
A B C
0.4
● ●
● ● ●
● ● ●
●●●
●
● ●
● ● ●●
● ●
●
●● region A minus the av-
0.0
●
●
●
●
● ●
●●
●
●●
●●●
●
● ●●
erage in B, compare this
● ● ●
● ●
to what we expect if the
−0.2
● ●
●
0 20 40 60 80 100
signal was flat
1.2
●● ●
● ● ●
● ● ●
● ● ● ● ● ●
● ● ● ●●
● ● ●
● ●● ● ● ● ●●
● ●●
●●● ● ●● ● ● ● ●● ● ● ● ●
1.0
1.0
●
● ●
● ●● ● ●
●●●
● ●● ● ● ●●
●● ● ● ●
● ● ● ● ●● ● ● ●●
● ● ● ● ●
● ● ●
● ●● ● ●
●● ●
0.8
0.8
● ●
0.6
0.6
A B C A B C
0.4
0.4
●
●
●
0.2
0.2
● ●
● ● ● ● ●
● ● ● ●
●●● ● ●
● ● ● ● ● ●
●● ●● ● ●●
● ● ● ● ● ●
0.0
0.0
●
● ●● ● ●● ●● ●●
● ●● ● ● ● ● ●
●● ● ●● ● ● ● ●●
● ● ●
● ● ● ●●● ● ●● ●● ● ●
● ● ●
● ● ●●● ● ●
−0.2
−0.2
● ● ● ●
● ●
●
0 20 40 60 80 100 0 20 40 60 80 100
But it took 1222 simulated data sets to get one reference data set!
13
The role of optimization: if we understand the fused lasso, i.e., the
way it selects changepoints (stems from KKT conditions), then we
can come up with a reference distribution without simulation
1.2
● ● ●
● ●
● ● ●
● ● ●
●
● ●● ● ● ●
●●● ● ●● ●
● ●
1.0
● ● ●
●
● ●● ● ●●● ●●
● ● ●
●
● ●
●● ●
0.8
●
0.6
p−value = 0.000
●
conduct significance tests1
0.2
● p−value
● = 0.359
● ● ●
● ● ●
●●● ● ●
● ● ● ●●
● ● ●● ●
0.0
●
●
● ●● ●
●● ● ●●
●
● ● ● ●●●
● ● ●
● ●
−0.2
● ●
●
0 20 40 60 80 100
1
Hyun et al. 2018, “Exact post-selection inference for the generalized lasso
path”
14
Widsom from Friedman (1985)
15
Central concept: convexity
16
Wisdom from Rockafellar (1993)
17
Chapter 3
Convex sets and functions
Convex functions
Convex set: C ⊆ Rn such that
243.1 ∈ Cproperties
x, yBasic =⇒ tx and+ (1 − t)y
examples ∈ C for all 0 ≤ t ≤ 12 Convex sets
3.1.1 Definition
Geometrically, this inequality means that the line segment between (x, f (x)) and
(y, f (y)), which is the chord from x to y, lies above the graph of f (figure 3.1).
A function f is strictly convex if strict inequality holds in (3.1) whenever x ̸= y
and 0 < θ < 1. We say f is concave if −f is convex, and strictly concave if −f is
Figure 2.2nSome simple convex and nonconvex sets. Left.
Convex function: f : R → R such that dom(f ) ⊆ R convex, and
strictly convex.
n The hexagon,
which
For an affine includes
function its boundary
we always have equality(shown
in (3.1),darker),
so all affineis(and
convex. Middle. The kidney
therefore
also linear)shaped
functionsset
are isboth
notconvex and concave.
convex, since the Conversely,
line segmentany function
betweenthat the two points in
f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y) for all 0 ≤ t ≤ 1
is convex andtheconcave is affine.
set shown as dots is not contained in the set. Right. The square contains
A function is convex if and only if it is convex when restricted
some boundary points but not others, and is not convex. to any line that
and all x, y ∈ dom(f )
intersects its domain. In other words f is convex if and only if for all x ∈ dom f and
(y, f (y))
(x, f (x))
Figure 3.1 Graph of a convex function. The chord (i.e., line segment) be-
tween any two points on the graph lies above the graph. 2
18
Convex optimization problems
Optimization problem:
min f (x)
x∈D
subject to gi (x) ≤ 0, i = 1, . . . , m
hj (x) = 0, j = 1, . . . , r
Tp
Here D = dom(f ) ∩ m
T
i=1 dom(gi ) ∩ j=1 dom(hj ), common
domain of all the functions
hj (x) = aTj x + bj , j = 1, . . . , p
19
Local minima are global minima
For convex optimization problems, local minima are global minima
●
This is a very useful ●
● ●
fact and will save us ●
●
a lot of trouble!
● ● ●●
Convex Nonconvex
20