0% found this document useful (0 votes)
12 views

10 Convex Optimisation

Uploaded by

mb6hbk2ctg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

10 Convex Optimisation

Uploaded by

mb6hbk2ctg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Foundations of Data Science, Fall 2024

Introduction to Data Science for Doctoral Students, Fall 2024

10. Convex Optimisation

Dr. Haozhe Zhang

October 21, 2024

MSc: https://lms.uzh.ch/url/RepositoryEntry/17589469505
PhD: https://lms.uzh.ch/url/RepositoryEntry/17589469506
Solving Machine Learning Problems

Most machine learning methods can be cast as optimisation problems.

• So far in the course: Closed-form solutions


e.g., minimisation of least squares and ridge regression objectives

• Most interesting learning problems do not admit closed-form solutions :(

1
Solving Machine Learning Problems

Most machine learning methods can be cast as optimisation problems.

• So far in the course: Closed-form solutions


e.g., minimisation of least squares and ridge regression objectives

• Most interesting learning problems do not admit closed-form solutions :(

Two approaches to solving the problems beyond closed-form solutions:

1. Frame the objective of the ML problem as a mathematical problem

Use existing blackbox solver for such problems

When objectives can be formulated as convex optimisation problems

2. Gradient-based optimisation methods

They are not blackbox: optimisation hyper-parameters affect performance

1
A Crash Course in Optimisation

Today:

• Convex optimisation

Next time:

• Recap: Gradients, Hessians

• Gradient Descent

• Stochastic Gradient Descent

• Constrained optimisation

Most machine learning packages, e.g., scikit-learn, tensorflow, octave, torch,


have optimisation methods readily implemented.

You need to understand the basics of optimisation to use them effectively.

2
Convex Sets
x1, x2 ∈ C, 0≤θ≤1 =⇒ θx1 + (1 −

A set C ⊆ RD is convex if for any x, y ∈ C and λ ∈ [0, 1], it holds λ x + (1 − λ) y ∈ C


mples (one convex, two nonconvex sets)

sets

3
Examples of Convex Sets

• Set RD
λ x + (1 − λ) y ∈ RD for all x, y ∈ RD

• Intersections of convex sets


Tn
Given convex sets C1 , . . . , Cn , the set i =1 Ci is convex

• Norm balls
For any L-norm || · ||, the set B = {x ∈ RD : ||x|| ≤ 1} is convex

• Polyhedra
Given A ∈ Rm×n and b ∈ Rm , the polyhedron {x ∈ Rn : A x ≤ b} is convex

• Positive semidefinite cone


D
The set S+ of positive semi-definite matrices is convex

4
Showing the Set of PSD Matrices is Convex

5
Showing the Norm Balls Form Convex Sets

6
Showing the Polyhedron is Convex + Example

Given A ∈ Rm×n and b ∈ Rm , the polyhedron P = {x ∈ Rn : A x ≤ b} is convex

7
Convex Functions

A function f : RD → R defined on a convex domain is convex if:

for all x, y ∈ RD where f is defined and 0 ≤ λ ≤ 1,

f (λ · x + (1 − λ) · y) ≤ λ · f (x) + (1 − λ) · f (y)

8
Examples of Convex Functions

• Affine functions: f (x) = bT x + c

• Quadratic functions: f (x) = 1/2 · xT Ax + bT x + c,


where A is symmetric positive semidefinite

• Nonnegative weighted sums of convex functions: Given convex functions


f1 , . . . , fn and w1 , . . . , wn ∈ R≥0 , the following is a convex function
n
X
f (x) = wi · fi (x)
i =1

• Norms: ∥ · ∥p except p = 0

9
Convex Optimisation

Given convex functions f , g1 , . . . , gm and affine functions h1 , . . . hn ,

a convex optimisation problem has the form:


minimise f (x)
subject to gi (x) ≤ 0 i ∈ [m]
hj (x) = 0 j ∈ [n ]

10
Convex Optimisation

Given convex functions f , g1 , . . . , gm and affine functions h1 , . . . hn ,

a convex optimisation problem has the form:


minimise f (x)
subject to gi (x) ≤ 0 i ∈ [m]
hj (x) = 0 j ∈ [n ]

Goal is to find an optimal value v ∗ of a convex optimisation problem:


S = {f (x) : gi (x) ≤ 0, i ∈ [m], hi (x) = 0, j ∈ [n]} set of feasible solutions

v = min S optimal value of the objective
∗ ∗ ∗
x = argmin S , i.e., f (x ) = v (not necessarily unique) optimal point
x

10
Convex Optimisation

Given convex functions f , g1 , . . . , gm and affine functions h1 , . . . hn ,

a convex optimisation problem has the form:


minimise f (x)
subject to gi (x) ≤ 0 i ∈ [m]
hj (x) = 0 j ∈ [n ]

Goal is to find an optimal value v ∗ of a convex optimisation problem:


S = {f (x) : gi (x) ≤ 0, i ∈ [m], hi (x) = 0, j ∈ [n]} set of feasible solutions

v = min S optimal value of the objective
∗ ∗ ∗
x = argmin S , i.e., f (x ) = v (not necessarily unique) optimal point
x

Infeasible and unbounded instances


def
• v ∗ = +∞ for infeasible instances (feasible = fulfils all constraints gi and hj )
def
• v ∗ = −∞ for unbounded instances (unbounded = the set of feasible
instances has no infimum)
10
Local Optima are Global Optima for Convex Optimisation Problems

x is locally optimal if:

• x is feasible and
• There is B > 0 s.t. f (x) ≤ f (y) for all feasible y with ||x − y||2 ≤ B.

x is globally optimal if:

• x is feasible and
• f (x) ≤ f (y) for all feasible y.

11
Local Optima are Global Optima for Convex Optimisation Problems

x is locally optimal if:

• x is feasible and
• There is B > 0 s.t. f (x) ≤ f (y) for all feasible y with ||x − y||2 ≤ B.

x is globally optimal if:

• x is feasible and
• f (x) ≤ f (y) for all feasible y.

Theorem: For any convex optimisation problem, all locally optimal points are
globally optimal.

11
Local Optima are Global Optima for Convex Optimisation Problems: Proof

12
Local Optima are Global Optima for Convex Optimisation Problems: Figure

f (x )

B
f (x )
f (z )
f (y )

x
x z y

13
Classes of Convex Optimisation Problems

Linear Programming:
T
minimize c x + d
subject to A x ≤ e
Bx=f

14
Classes of Convex Optimisation Problems

Linear Programming:
T
minimize c x + d
subject to A x ≤ e
Bx=f

Quadratically Constrained Quadratic Programming:


1 T T
minimize x Bx+c x+d
2
1 T T
subject to x Qi x + ri x + si ≤ 0 i ∈ [m ]
2
Ax=b

14
Classes of Convex Optimisation Problems

Linear Programming:
T
minimize c x + d
subject to A x ≤ e
Bx=f

Quadratically Constrained Quadratic Programming:


1 T T
minimize x Bx+c x+d
2
1 T T
subject to x Qi x + ri x + si ≤ 0 i ∈ [m ]
2
Ax=b

Semidefinite Programming:

minimize tr(C X)
subject to tr(Ai X) = bi i ∈ [m ]
X positive semidefinite

For a matrix B, tr(B) is the trace of B


14
Linear Programming

Looking for solutions x ∈ Rn to the following optimisation problem

T
minimize c x + d
subject to A x ≤ e
Bx=f

• No closed-form solution
• Efficient algorithms exist, both in
theory and practice (for tens of
thousands of variables)

15
Linear Model with Absolute Loss

Suppose we have data (X, y) and that we want to minimise the objective:
N
X
L(w) = |wT xi − yi |
i =1

We would like to transform this optimisation problem into a linear program.

16
Linear Model with Absolute Loss

Suppose we have data (X, y) and that we want to minimise the objective:
N
X
L(w) = |wT xi − yi |
i =1

We would like to transform this optimisation problem into a linear program.

We introduce one ζi for each datapoint.

The linear program in the D + N variables w1 , . . . , wD , ζ1 , . . . , ζN


N
X
minimize ζi
i =1

subject to:
T
w xi − yi ≤ ζi , i ∈ [N ]
T
yi − w xi ≤ ζi , i ∈ [N ]

16
Linear Model with Absolute Loss

Suppose we have data (X, y) and that we want to minimise the objective:
N
X
L(w) = |wT xi − yi |
i =1

We would like to transform this optimisation problem into a linear program.

We introduce one ζi for each datapoint.

The linear program in the D + N variables w1 , . . . , wD , ζ1 , . . . , ζN


N
X
minimize ζi
i =1

subject to:
T
w xi − yi ≤ ζi , i ∈ [N ]
T
yi − w xi ≤ ζi , i ∈ [N ]

The solution to this linear program gives w that minimises the objective L.
16
Linear Model with Absolute Loss via Linear Programming (1/2)
N
X
minimize ζi
N
X i =1
L(w) = |wT xi − yi | subject to:
i =1
T
w xi − yi ≤ ζi , i ∈ [N ]
T
yi − w xi ≤ ζi , i ∈ [N ]
Claim: The solution to this linear program gives w that minimises the objective L.

17
Linear Model with Absolute Loss via Linear Programming (2/2)

18
Recall: Likelihood of Linear Regression (Gaussian Noise Model)

Likelihood
 N /2  
1 1
p(y | X, w, σ) = exp − 2 (Xw − y)T (Xw − y)
2πσ 2 2σ

Maximise Likelihood = Maximise Log-Likelihood (log : R+ → R is increasing)


N 1
LL(y | X, w, σ) = − log(2πσ 2 ) − (Xw − y)T (Xw − y)
2 2σ 2

Maximise Log-Likelihood = Minimise Negative Log-Likelihood


N 1
NLL(y | X, w, σ) = log(2πσ 2 ) + (Xw − y)T (Xw − y)
2 2σ 2

19
Recall: Likelihood of Linear Regression (Gaussian Noise Model)

Likelihood
 N /2  
1 1
p(y | X, w, σ) = exp − 2 (Xw − y)T (Xw − y)
2πσ 2 2σ

Maximise Likelihood = Maximise Log-Likelihood (log : R+ → R is increasing)


N 1
LL(y | X, w, σ) = − log(2πσ 2 ) − (Xw − y)T (Xw − y)
2 2σ 2

Maximise Log-Likelihood = Minimise Negative Log-Likelihood


N 1
NLL(y | X, w, σ) = log(2πσ 2 ) + (Xw − y)T (Xw − y)
2 2σ 2
 
N 1  T T
= log(2πσ 2 ) + 2 w
| X{zXw} − 2y
T T
Xw + y y 

2 2σ | {z } |{z}
| {z } wT Bw cT w constant
constant

This is a convex quadratic optimisation problem with no constraints!


19
Minimising the Lasso Objective

For the Lasso objective, i.e., linear model with ℓ1 -regularisation, we have
N
X D
X D
X
Llasso (w) = (wT xi − yi )2 + λ |wi | = wT XT Xw − 2yT Xw + yT y + λ | wi |
i =1 i =1 i =1

20
Minimising the Lasso Objective

For the Lasso objective, i.e., linear model with ℓ1 -regularisation, we have
N
X D
X D
X
Llasso (w) = (wT xi − yi )2 + λ |wi | = wT XT Xw − 2yT Xw + yT y + λ | wi |
i =1 i =1 i =1

• Quadratic part of the loss function cannot be framed as linear programming

• Lasso regularisation does not allow for closed-form solutions

• Can be rephrased as quadratic programming problem

• Alternatively resort to general optimisation methods

20

You might also like