0% found this document useful (0 votes)
10 views

Intro

The document outlines the course structure for Convex Optimization, taught by Ryan Tibshirani, emphasizing its relevance to Machine Learning and Statistics. It details prerequisites, evaluation methods including homeworks and quizzes, and the importance of optimization in understanding and solving problems in the field. The course aims to equip students with skills to tackle optimization problems effectively, highlighting the significance of different algorithms and the potential for creating new problems.

Uploaded by

2727516661
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Intro

The document outlines the course structure for Convex Optimization, taught by Ryan Tibshirani, emphasizing its relevance to Machine Learning and Statistics. It details prerequisites, evaluation methods including homeworks and quizzes, and the importance of optimization in understanding and solving problems in the field. The course aims to equip students with skills to tackle optimization problems effectively, highlighting the significance of different algorithms and the potential for creating new problems.

Uploaded by

2727516661
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Introduction: Why Optimization?

Ryan Tibshirani
Convex Optimization 10-725
Course setup

Welcome to our course on Convex Optimization, with a focus on


its ties to Machine Learning and Statistics!

Basic adminstrative details:


• Instructor: Ryan Tibshirani
• Education associate: Daniel Bird
• Teaching assistants: Chen Dan, William Guss, Po-Wei Wang,
Lingxiao Zhao
• Course website:
http://www.stat.cmu.edu/~ryantibs/convexopt/
• We will use Piazza for announcements and discussions
• We will Canvas just as a gradebook

2
Prerequisites: no formal ones, but class will be fairly fast paced

Assume working knowledge of/proficiency with:


• Real analysis, calculus, linear algebra
• Core problems in Machine Learning and Statistics
• Programming (R, Python, Julia, your choice ...)
• Data structures, computational complexity
• Formal mathematical thinking

If you fall short on any one of these things, it’s certainly possible to
catch up; but don’t hesitate to talk to us

3
Evaluation:
• 6 homeworks
• 6 quizzes
• 1 little test

Quizzes: due at the same time as each homework. These should


be pretty easy if you’ve attended lectures ...

Little test: same format as quizzes. Will be comprehensive (you


are allowed one sheet of notes)

4
Scribing: sign up to scribe one lecture per semester, on the course
website (multiple scribes per lecture). Can bump up your grade in
boundary cases

Auditors: welcome, please audit rather than just sitting in

Heads up: class will not be easy, but should be worth it ... !

5
Optimization in Machine Learning and Statistics

Optimization problems underlie nearly everything we do in Machine


Learning and Statistics. In other courses, you learn how to:

translate into P : min f (x)


x∈D

Conceptual idea Optimization problem

Examples of this? Examples of the contrary?

This course: how to solve P , and why this is a good skill to have

6
Motivation: why do we bother?

Presumably, other people have already figured out how to solve

P : min f (x)
x∈D

So why bother? Many reasons. Here’s three:


1. Different algorithms can perform better or worse for different
problems P (sometimes drastically so)
2. Studying P through an optimization lens can actually give you
a deeper understanding of the task/procedure at hand
3. Knowledge of optimization can actually help you create a new
problem P that is even more interesting/useful

Optimization moves quickly as a field. But there is still much room


for progress, especially its intersection with ML and Stats

7
Example: algorithms for linear trend filtering
Given observations yi ∈ R, i = 1, . . . , n corresponding to
underlying locations xi = i, i = 1, . . . , n

● ●●
● ●
● ●● ●● ●


● ●● ●
● ● ● ● ●●● ●●● ● ●
● ●
●●● ● ● ●●● ●● ● ● ●

● ●●●●●● ●●●●● ● ●● ● ● ● ●● ● ●
● ●●●
●● ●
●●● ● ● ●
● ● ● ● ●
●●
● ●

●●
● ● ● ●●● ●
● ●●●

●●●● ●●●● ●● ● ● ●●
●●● ●

●●●
● ●● ●
●●
●● ● ●●●●●
●●●


●●
● ●●●●
●●

● ● ●

●●
● ●
●● ●● ●
● ● ●●
● ●● ● ● ●●●● ●● ● ●●●●
●●●● ●●
10

●●● ● ●
●●
● ●● ●
● ●●● ●● ● ●●
●●●● ●●●
●● ●
● ● ●● ● ●
●●● ● ● ●

●●●
● ● ●● ●


●● ●
●●●●●●
●●●●
●● ●
●●

●●●●● ●● ●
● ●
●●●


●●●
● ●● ●●
●●● ●●
●●
● ●

● ● ● ● ●
● ●●
● ● ●
●● ●
● ● ●●●
●●●●● ●●● ●● ● ● ●● ●● ● ●● ●
● ● ● ●●●●●●●● ● ●●
●●●● ●●●● ●● ●● ●●●● ● ●● ● ●● ● ●
● ● ●
● ●●●● ●●●




●●
●●●●
●● ●






●● ●

● ●●

●●●

●●
● ●●●
●●

●●
●●

●●●


●●
● ●
●●●





●●●

● ●

●●
●●
●●

●● ●
● ●●
● ● ●●

●●
● ●
● ●●● ●●●●
● ●● ●
● ● ● ●●● ●● ● ● ●
● ●●●

●●
Linear trend filtering
●●

●●● ●● ●●
●●●●

●●●●

Activation level

● ● ●
●● ● ● ●
● ●

●●
●●●

●●
●●●
● ●

●●

●●● ● ●


● ●● ●
●●●●●

● ●●

●●●

●●
●●●
●●



●●


● ●
fits a piecewise linear
●●●
●●●
●●●
● ●● ●
●●●● ●●●●●

● ●

● ●
●●●
●●

●●●

●●



●●

●●


●●●



●●
●●●
function, with adap-
5

●●●●
●●●
●● ●●● ●
● ● ●●●
●●
●● ●
●● ● ●●●●
●●●
● ●●●● ●●

●●●●
● ● ● ●●●●● ●

●● ●
●●
●●

●●
●●
●●

●●





●●
●● ●●●
●●
●●

●●


●●●
●● ● ●


●●







●● ●
●●●●●
●●
●●


●●
●●
●●●

tively chosen knots
●●●●● ●● ●●● ●
● ●●●●
●● ●● ●●
●●●

● ●●

●●
●●
●●●


●●


●●
●●

●●

●● ●

●●
●●


●●●
● ●
●●●●


●●
●●●
●●

●●
●●



(Steidl et al. 2006,
●● ●●● ● ●● ●●
●●●●●●
●● ●● ●
●●● ● ●●

●● ●● ●
Kim et al. 2009)
0

●●
●●● ●●● ●●● ●●
●● ●
● ●
●●

0 200 400 600 800 1000

Timepoint

n n−2
1X X
How? By solving min (yi − θi )2 + λ |θi − 2θi+1 + θi+2 |
θ 2
i=1 i=1
8
n n−2
1X X
Problem: min (yi − θi )2 + λ |θi − 2θi+1 + θi+2 |
θ 2
i=1 i=1


● ●●
● ●
● ●● ●● ●

●● ●● ●
● ● ● ● ●●● ●●● ● ●
● ●● ●● ●● ●●●● ● ● ●
●● ●● ● ●●● ● ●●●
●● ●● ●● ●● ●

● ●●●
●●
● ●● ● ●● ● ●
● ●●●● ●
●●
●●
●●● ●● ● ●●●●● ●●●●

● ●



●●
●●


●●

●●●



●●

●●●●
● ●●●●
●●●
●●
● ●



●●
● ●

●●● ●●●●●

●●●●●●●●●●●
● ●
●●●
●● ● ●●● ●● ●
● ●● ● Interior point method,
10

● ●●
●● ● ●●● ●●●
●●
● ●
●●●
● ●

●●●●
●●
● ●● ●● ● ● ●
●●●● ● ●●●●
● ● ● ●●●
● ●●
● ●●
●● ●
●●●● ●● ●● ●● ●
●●●●● ●●●● ●
●●●●
● ● ● ● ●● ●●●●●●● ●

● ●●●
●● ● ●●●


● ● ●●
● ●●●●
● ●
● ● ● ●


●●
●●●●
●●



●●●

●●
●●

● ●
● ●●●




●●
● ●●
●●
●●
● ●●


● ● ●●
●●● ●● ●●●

● ●
●● ● ●● ●
● ● ●● ● ●●
● ● ●●
● ● ●
● ●

●● ●
● ●●


● ●●





●● ●●●●



●●●● ●
●●●
●●●



●●

20 iterations
● ●●● ●●●●●●
●●● ●●
● ● ● ● ●●● ●● ● ● ●
●●●● ● ●● ●●●●●
● ●● ● ● ● ●
● ●● ●
●● ●● ●●
●● ● ● ●

● ● ●
●● ●●● ●● ● ●●●●● ●● ●

Activation level

● ● ● ●● ●●● ●
● ● ● ●

●● ●● ●●● ●
● ● ●
●●
● ● ●
● ● ●● ● ●
●●


●●


●●●
●●●●
●●
●●●





● ●●●
●●●
●●

●●

●●
●●



●●

●●


●●●



Proximal gradient de-
●●● ● ●
●●● ●
●● ●●
● ● ●● ●
● ●●● ●● ●● ●●●●●●●●●●
scent, 10K iterations
5

●●●●
●●●
●●●●● ●
● ● ●●●
●●
●● ●
●● ● ●●●●
●●●
● ●●●● ●
●●
●●●●
● ● ● ●●●●● ●

● ●●●●● ●●
●●●
●● ●●●●●●● ● ●

● ●●●●●●●
●●
● ● ●●●● ● ●●●● ●
●●
●●
● ● ●●● ●● ●●
●●●●

●●




●●
● ●●●●●●
●●●●● ●
● ●●
●●
●●
●●● ●
● ●

●●
●●●



●●






●● ●●

●● ●
● ●

●●● ●

●● ●

Coordinate descent,
●●● ●

● ●
●●●
● ● ●
●● ● ●●
● ●●●●
●●
●●
●●●

●●●
● ●●


●● ●






●●
●●●●
●● ●
●● ●



1000 cycles
0

●●
●●● ●●●●
●● ●●
●● ●
● ●
●●

(all from the dual)
0 200 400 600 800 1000

Timepoint

9
n n−2
1X X
Problem: min (yi − θi )2 + λ |θi − 2θi+1 + θi+2 |
θ 2
i=1 i=1


● ●●
● ●
● ●● ●● ●

●● ●● ●
● ● ● ● ●●● ●●● ● ●
● ●● ●● ●● ●●●● ● ● ●
●● ●● ● ●●● ● ●●●
●● ●● ●● ●● ●

● ●●●
●●
● ●● ● ●● ● ●
● ●●●● ●
●●
●●
●●● ●● ● ●●●●● ●●●●

● ●



●●
●●


●●

●●●



●●

●●●●
● ●●●●
●●●
●●
● ●



●●
● ●

●●● ●●●●●

●●●●●●●●●●●
● ●
●●●
●● ● ●●● ●● ●
● ●● ● Interior point method,
10

● ●●
●● ● ●●● ●●●
●●
● ●
●●●
● ●

●●●●
●●
● ●● ●● ● ● ●
●●●● ● ●●●●
● ● ● ●●●
● ●●
● ●●
●● ●
●●●● ●● ●● ●● ●
●●●●● ●●●● ●
●●●●
● ● ● ● ●● ●●●●●●● ●

● ●●●
●● ● ●●●


● ● ●●
● ●●●●
● ●
● ● ● ●


●●
●●●●
●●



●●●

●●
●●

● ●
● ●●●




●●
● ●●
●●
●●
● ●●


● ● ●●
●●● ●● ●●●

● ●
●● ● ●● ●
● ● ●● ● ●●
● ● ●●
● ● ●
● ●

●● ●
● ●●


● ●●





●● ●●●●



●●●● ●
●●●
●●●



●●

20 iterations
● ●●● ●●●●●●
●●● ●●
● ● ● ● ●●● ●● ● ● ●
●●●● ● ●● ●●●●●
● ●● ● ● ● ●
● ●● ●
●● ●● ●●
●● ● ● ●

● ● ●
●● ●●● ●● ● ●●●●● ●● ●

Activation level

● ● ● ●● ●●● ●
● ● ● ●

●● ●● ●●● ●
● ● ●
●●
● ● ●
● ● ●● ● ●
●●


●●


●●●
●●●●
●●
●●●





● ●●●
●●●
●●

●●

●●
●●



●●

●●


●●●



Proximal gradient de-
●●● ● ●
●●● ●
●● ●●
● ● ●● ●
● ●●● ●● ●● ●●●●●●●●●●
scent, 10K iterations
5

●●●●
●●●
●●●●● ●
● ● ●●●
●●
●● ●
●● ● ●●●●
●●●
● ●●●● ●
●●
●●●●
● ● ● ●●●●● ●

● ●●●●● ●●
●●●
●● ●●●●●●● ● ●

● ●●●●●●●
●●
● ● ●●●● ● ●●●● ●
●●
●●
● ● ●●● ●● ●●
●●●●

●●




●●
● ●●●●●●
●●●●● ●
● ●●
●●
●●
●●● ●
● ●

●●
●●●



●●






●● ●●

●● ●
● ●

●●● ●

●● ●

Coordinate descent,
●●● ●

● ●
●●●
● ● ●
●● ● ●●
● ●●●●
●●
●●
●●●

●●●
● ●●


●● ●






●●
●●●●
●● ●
●● ●



1000 cycles
0

●●
●●● ●●●●
●● ●●
●● ●
● ●
●●

(all from the dual)
0 200 400 600 800 1000

Timepoint

9
n n−2
1X X
Problem: min (yi − θi )2 + λ |θi − 2θi+1 + θi+2 |
θ 2
i=1 i=1


● ●●
● ●
● ●● ●● ●

●● ●● ●
● ● ● ● ●●● ●●● ● ●
● ●● ●● ●● ●●●● ● ● ●
●● ●● ● ●●● ● ●●●
●● ●● ●● ●● ●

● ●●●
●●
● ●● ● ●● ● ●
● ●●●● ●
●●
●●
●●● ●● ● ●●●●● ●●●●

● ●



●●
●●


●●

●●●



●●

●●●●
● ●●●●
●●●
●●
● ●



●●
● ●

●●● ●●●●●

●●●●●●●●●●●
● ●
●●●
●● ● ●●● ●● ●
● ●● ● Interior point method,
10

● ●●
●● ● ●●● ●●●
●●
● ●
●●●
● ●

●●●●
●●
● ●● ●● ● ● ●
●●●● ● ●●●●
● ● ● ●●●
● ●●
● ●●
●● ●
●●●● ●● ●● ●● ●
●●●●● ●●●● ●
●●●●
● ● ● ● ●● ●●●●●●● ●

● ●●●
●● ● ●●●


● ● ●●
● ●●●●
● ●
● ● ● ●


●●
●●●●
●●



●●●

●●
●●

● ●
● ●●●




●●
● ●●
●●
●●
● ●●


● ● ●●
●●● ●● ●●●

● ●
●● ● ●● ●
● ● ●● ● ●●
● ● ●●
● ● ●
● ●

●● ●
● ●●


● ●●





●● ●●●●



●●●● ●
●●●
●●●



●●

20 iterations
● ●●● ●●●●●●
●●● ●●
● ● ● ● ●●● ●● ● ● ●
●●●● ● ●● ●●●●●
● ●● ● ● ● ●
● ●● ●
●● ●● ●●
●● ● ● ●

● ● ●
●● ●●● ●● ● ●●●●● ●● ●

Activation level

● ● ● ●● ●●● ●
● ● ● ●

●● ●● ●●● ●
● ● ●
●●
● ● ●
● ● ●● ● ●
●●


●●


●●●
●●●●
●●
●●●





● ●●●
●●●
●●

●●

●●
●●



●●

●●


●●●



Proximal gradient de-
●●● ● ●
●●● ●
●● ●●
● ● ●● ●
● ●●● ●● ●● ●●●●●●●●●●
scent, 10K iterations
5

●●●●
●●●
●●●●● ●
● ● ●●●
●●
●● ●
●● ● ●●●●
●●●
● ●●●● ●
●●
●●●●
● ● ● ●●●●● ●

● ●●●●● ●●
●●●
●● ●●●●●●● ● ●

● ●●●●●●●
●●
● ● ●●●● ● ●●●● ●
●●
●●
● ● ●●● ●● ●●
●●●●

●●




●●
● ●●●●●●
●●●●● ●
● ●●
●●
●●
●●● ●
● ●

●●
●●●



●●






●● ●●

●● ●
● ●

●●● ●

●● ●

Coordinate descent,
●●● ●

● ●
●●●
● ● ●
●● ● ●●
● ●●●●
●●
●●
●●●

●●●
● ●●


●● ●






●●
●●●●
●● ●
●● ●



1000 cycles
0

●●
●●● ●●●●
●● ●●
●● ●
● ●
●●

(all from the dual)
0 200 400 600 800 1000

Timepoint

9
n n−2
1X X
Problem: min (yi − θi )2 + λ |θi − 2θi+1 + θi+2 |
θ 2
i=1 i=1


● ●●
● ●
● ●● ●● ●

●● ●● ●
● ● ● ● ●●● ●●● ● ●
● ●● ●● ●● ●●●● ● ● ●
●● ●● ● ●●● ● ●●●
●● ●● ●● ●● ●

● ●●●
●●
● ●● ● ●● ● ●
● ●●●● ●
●●
●●
●●● ●● ● ●●●●● ●●●●

● ●



●●
●●


●●

●●●



●●

●●●●
● ●●●●
●●●
●●
● ●



●●
● ●

●●● ●●●●●

●●●●●●●●●●●
● ●
●●●
●● ● ●●● ●● ●
● ●● ● Interior point method,
10

● ●●
●● ● ●●● ●●●
●●
● ●
●●●
● ●

●●●●
●●
● ●● ●● ● ● ●
●●●● ● ●●●●
● ● ● ●●●
● ●●
● ●●
●● ●
●●●● ●● ●● ●● ●
●●●●● ●●●● ●
●●●●
● ● ● ● ●● ●●●●●●● ●

● ●●●
●● ● ●●●


● ● ●●
● ●●●●
● ●
● ● ● ●


●●
●●●●
●●



●●●

●●
●●

● ●
● ●●●




●●
● ●●
●●
●●
● ●●


● ● ●●
●●● ●● ●●●

● ●
●● ● ●● ●
● ● ●● ● ●●
● ● ●●
● ● ●
● ●

●● ●
● ●●


● ●●





●● ●●●●



●●●● ●
●●●
●●●



●●

20 iterations
● ●●● ●●●●●●
●●● ●●
● ● ● ● ●●● ●● ● ● ●
●●●● ● ●● ●●●●●
● ●● ● ● ● ●
● ●● ●
●● ●● ●●
●● ● ● ●

● ● ●
●● ●●● ●● ● ●●●●● ●● ●

Activation level

● ● ● ●● ●●● ●
● ● ● ●

●● ●● ●●● ●
● ● ●
●●
● ● ●
● ● ●● ● ●
●●


●●


●●●
●●●●
●●
●●●





● ●●●
●●●
●●

●●

●●
●●



●●

●●


●●●



Proximal gradient de-
●●● ● ●
●●● ●
●● ●●
● ● ●● ●
● ●●● ●● ●● ●●●●●●●●●●
scent, 10K iterations
5

●●●●
●●●
●●●●● ●
● ● ●●●
●●
●● ●
●● ● ●●●●
●●●
● ●●●● ●
●●
●●●●
● ● ● ●●●●● ●

● ●●●●● ●●
●●●
●● ●●●●●●● ● ●

● ●●●●●●●
●●
● ● ●●●● ● ●●●● ●
●●
●●
● ● ●●● ●● ●●
●●●●

●●




●●
● ●●●●●●
●●●●● ●
● ●●
●●
●●
●●● ●
● ●

●●
●●●



●●






●● ●●

●● ●
● ●

●●● ●

●● ●

Coordinate descent,
●●● ●

● ●
●●●
● ● ●
●● ● ●●
● ●●●●
●●
●●
●●●

●●●
● ●●


●● ●






●●
●●●●
●● ●
●● ●



1000 cycles
0

●●
●●● ●●●●
●● ●●
●● ●
● ●
●●

(all from the dual)
0 200 400 600 800 1000

Timepoint

9
What’s the message here?

So what’s the right conclusion here?

Is primal-dual interior point method simply a better method than


proximal gradient descent, coordinate descent? ... No

In fact, different algorithms will work better in different situations.


We’ll learn details throughout the course

In the linear trend filtering problem:


• Primal-dual: fast (structured linear systems)
• Proximal gradient: slow (conditioning)
• Coordinate descent: slow (large active set)

10
Example: changepoints in the fused lasso
The fused lasso or total variation denoising problem:
n n−1
1X X
min (yi − θi )2 + λ |θi − θi+1 |
θ 2
i=1 i=1
This fits a piecewise constant function, given data yi , i = 1, . . . , n.
As tuning parameter λ decreases, we see more changepoints in the
solution θ̂
1.2

1.2

1.2
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ●
●●● ● ●● ● ● ●●● ● ●● ● ● ●●● ● ●● ● ●
1.0

1.0

1.0
● ●
● ●● ● ●
● ●● ● ●
● ●●
● ●● ● ●●● ●● ● ●● ● ●●● ●● ● ●● ● ●●● ●●
● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
●● ● ●● ● ●● ●
0.8

0.8

0.8
● ● ●
0.6

0.6

0.6
0.4

0.4

0.4
● ● ●
0.2

0.2

0.2
● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
●●● ● ● ●●● ● ● ●●● ● ●
● ● ● ●● ● ● ● ●● ● ● ● ●●
● ● ●● ● ● ● ●● ● ● ● ●● ●
0.0

0.0

0.0
● ● ●
● ● ●
● ●● ● ● ●● ● ● ●● ●
●● ● ●● ●● ● ●● ●● ● ●●
● ● ●
● ● ● ●●● ● ● ● ●●● ● ● ● ●●●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
−0.2

−0.2

−0.2
● ● ● ● ● ●
● ● ●

0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100

λ = 25 λ = 0.62 λ = 0.41
11
Let’s look at the solution at λ = 0.41 a little more closely
1.2

● ● ●
● ●
● ● ●
● ● ●

● ●● ● ● ●
●●● ●● ●
● ●
How can we test the
1.0

● ● ●●

● ●● ● ●●● ●●
● ● ●

●●

● ●
significance of detected
0.8

changepoints? Say, at
location 11?
0.6

A B C
0.4

Classically: take the av-


erage of data points in


0.2

● ●
● ● ●
● ● ●
●●●

● ●
● ● ●●
● ●

●● region A minus the av-
0.0




● ●
●●

●●
●●●

● ●●
erage in B, compare this
● ● ●
● ●
to what we expect if the
−0.2

● ●

0 20 40 60 80 100
signal was flat

But this is incorrect, because location 11 was selected based on the


data, so of course the difference in averages looks high!
12
What we want to do: compare our observed difference to that in
reference (null) data, in which the signal was flat and we happen
to select the same location 11 (and 50)
1.2

1.2
●● ●
● ● ●
● ● ●
● ● ● ● ● ●
● ● ● ●●
● ● ●
● ●● ● ● ● ●●
● ●●
●●● ● ●● ● ● ● ●● ● ● ● ●
1.0

1.0

● ●
● ●● ● ●
●●●
● ●● ● ● ●●
●● ● ● ●
● ● ● ● ●● ● ● ●●
● ● ● ● ●
● ● ●
● ●● ● ●
●● ●
0.8

0.8
● ●
0.6

0.6
A B C A B C
0.4

0.4



0.2

0.2
● ●
● ● ● ● ●
● ● ● ●
●●● ● ●
● ● ● ● ● ●
●● ●● ● ●●
● ● ● ● ● ●
0.0

0.0

● ●● ● ●● ●● ●●
● ●● ● ● ● ● ●
●● ● ●● ● ● ● ●●
● ● ●
● ● ● ●●● ● ●● ●● ● ●
● ● ●
● ● ●●● ● ●
−0.2

−0.2
● ● ● ●
● ●


0 20 40 60 80 100 0 20 40 60 80 100

Observed data Reference data


Abs. difference ≈ 0.088 Abs. difference ≈ 0.072

But it took 1222 simulated data sets to get one reference data set!
13
The role of optimization: if we understand the fused lasso, i.e., the
way it selects changepoints (stems from KKT conditions), then we
can come up with a reference distribution without simulation
1.2

● ● ●
● ●
● ● ●
● ● ●

● ●● ● ● ●
●●● ● ●● ●
● ●
1.0

● ● ●

● ●● ● ●●● ●●
● ● ●

● ●
●● ●
0.8


0.6

p−value = 0.000

We can use this to efficiently


0.4


conduct significance tests1
0.2

● p−value
● = 0.359
● ● ●
● ● ●
●●● ● ●
● ● ● ●●
● ● ●● ●
0.0



● ●● ●
●● ● ●●

● ● ● ●●●
● ● ●
● ●
−0.2

● ●

0 20 40 60 80 100

1
Hyun et al. 2018, “Exact post-selection inference for the generalized lasso
path”
14
Widsom from Friedman (1985)

From Jerry Friedman’s discussion of Peter Huber’s 1985 projection


pursuit paper, in Annals of Statistics:

Arguably, less true today due to the advent of disciplined convex


programming? But it still rings true in large part ...

15
Central concept: convexity

Historically, linear programs were the focus in optimization

Initially, it was thought that the important distinction was between


linear and nonlinear optimization problems. But some nonlinear
problems turned out to be much harder than others ...

Now it is widely recognized that the right distinction is between


convex and nonconvex problems

Your supplementary textbooks for the course:

Boyd and Vandenberghe Rockafellar


and
(2004) (1970)

16
Wisdom from Rockafellar (1993)

From Terry Rockafellar’s 1993 SIAM Review survey paper:

Credit to Nemirovski, Yudin, Nesterov, others for formalizing this

This view was dominant both within the optimization community


and in many application domains for many decades (... currently
being challenged by successes of neural networks?)

17
Chapter 3
Convex sets and functions
Convex functions
Convex set: C ⊆ Rn such that

243.1 ∈ Cproperties
x, yBasic =⇒ tx and+ (1 − t)y
examples ∈ C for all 0 ≤ t ≤ 12 Convex sets
3.1.1 Definition

A function f : Rn → R is convex if dom f is a convex set and if for all x,


y ∈ dom f , and θ with 0 ≤ θ ≤ 1, we have

f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y). (3.1)

Geometrically, this inequality means that the line segment between (x, f (x)) and
(y, f (y)), which is the chord from x to y, lies above the graph of f (figure 3.1).
A function f is strictly convex if strict inequality holds in (3.1) whenever x ̸= y
and 0 < θ < 1. We say f is concave if −f is convex, and strictly concave if −f is
Figure 2.2nSome simple convex and nonconvex sets. Left.
Convex function: f : R → R such that dom(f ) ⊆ R convex, and
strictly convex.
n The hexagon,
which
For an affine includes
function its boundary
we always have equality(shown
in (3.1),darker),
so all affineis(and
convex. Middle. The kidney
therefore
also linear)shaped
functionsset
are isboth
notconvex and concave.
convex, since the Conversely,
line segmentany function
betweenthat the two points in
f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y) for all 0 ≤ t ≤ 1
is convex andtheconcave is affine.
set shown as dots is not contained in the set. Right. The square contains
A function is convex if and only if it is convex when restricted
some boundary points but not others, and is not convex. to any line that
and all x, y ∈ dom(f )
intersects its domain. In other words f is convex if and only if for all x ∈ dom f and

(y, f (y))

(x, f (x))

Figure 3.1 Graph of a convex function. The chord (i.e., line segment) be-
tween any two points on the graph lies above the graph. 2
18
Convex optimization problems

Optimization problem:

min f (x)
x∈D
subject to gi (x) ≤ 0, i = 1, . . . , m
hj (x) = 0, j = 1, . . . , r
Tp
Here D = dom(f ) ∩ m
T
i=1 dom(gi ) ∩ j=1 dom(hj ), common
domain of all the functions

This is a convex optimization problem provided the functions f


and gi , i = 1, . . . , m are convex, and hj , j = 1, . . . , p are affine:

hj (x) = aTj x + bj , j = 1, . . . , p

19
Local minima are global minima
For convex optimization problems, local minima are global minima

Formally, if x is feasible—x ∈ D, and satisfies all constraints—and


minimizes f in a local neighborhood,
f (x) ≤ f (y) for all feasible y, kx − yk2 ≤ ρ,
then
f (x) ≤ f (y) for all feasible y


This is a very useful ●
● ●
fact and will save us ●

a lot of trouble!
● ● ●●

Convex Nonconvex
20

You might also like