0% found this document useful (0 votes)
19 views6 pages

Lecture 1 2 Background

Uploaded by

drbaskerphd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views6 pages

Lecture 1 2 Background

Uploaded by

drbaskerphd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

UW-Madison CS/ISyE/Math/Stat 726 Spring 2024

Lecture 1–2: Optimization Background

Yudong Chen

1 Introduction
Our standard optimization problem
min f ( x ) (P)
x ∈X

• x: a vector, optimization/decision variable

• X : feasible set

• f ( x ) objective function, real-valued

• maxx f ( x ) ⇐⇒ minx − f ( x )

The (optimal) value of (P):


val(P) = inf f ( x ).
x ∈X

To fully specify (P), we need to specify

• vector space, feasible set, objective function;

• what it means to solve (P).

1.1 Can we even hope to solve an arbitrary optimization problem?


Example 1. Suppose we want to find positive integers x, y, z satisfying

x 3 + y3 = z3 .

Can be formulated as a (continuous) optimization problem (PF ):

min ( x n + yn − zn )2
x,y,z,n

s.t.x ≥ 1, y ≥ 1, z ≥ 1, n ≥ 3 (PF )
sin2 (πn) + sin2 (πx ) + sin2 (πy) + sin2 (πz) = 0.

If we could certify whether val(PF ) ̸= 0, we would have found a proof for Fermat’s Last theorem
(1637):

For any n ≥ 3, x n + yn = zn has no solutions over positive integers.

Proved by Andrew Wiles in 1994.

1
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024

Example 2. Unconstrained optimization, many local minima:1

We cannot hope for solving an arbitrary optimization problem.


We need some structure.

2 Specifying the optimization problem


2.1 Vector space
This is where the optimization variable and the feasible set live.
(Rd , ∥·∥): normed vector space, “primal space”.
• The variable x is a (column) vector in Rd .
 
x1
 x2 
x =  . .
 
 .. 
xd

• The norm tells us how to measure distances in Rd .


 1/2
Most often, we will take ∥ x ∥ = ∥ x ∥2 = ∑id=1 xi2 (Euclidean norm)
 1/p
We sometimes also consider ℓ p norm ∥ x ∥ p = ∑id=1 | xi | p ,p≥1

• ∥ x ∥1 = ∑ i | x i |,

• ∥ x ∥∞ = max1≤i≤d | xi |.

(Plots of unit balls of ℓ2 , ℓ1 , ℓ∞ norms.)


1 Left:
plot by Jelena Diakonikolas. Right: loss surfaces of ResNet-56 without skip connections (https://arxiv.
org/pdf/1712.09913.pdf).

2
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024

We will use ⟨·, ·⟩ to denote inner products. Standard inner product


d
⟨ x, y⟩ = x ⊤ y = ∑ xi yi .
i =1
 
When we work with Rd , ∥·∥ p , view ⟨y, x ⟩ as the value of a linear function y at x. So, if we
are measuring the length of x using the ∥·∥ p , we should measure the length of y using ∥·∥q ,where
1 1
p + q = 1.
Definition 1 (Dual norm). The dual norm of ∥·∥ is given by

∥z∥∗ := sup ⟨z, x ⟩ .


∥ x ∥≤1

From the definition we immediately have the


Proposition 1 (Holder Inequality). For all z, y ∈ Rd :

|⟨z, x ⟩| ≤ ∥z∥∗ · ∥ x ∥ .
x
Proof. Fix any two vectors x, z. Assume x ̸= 0, z ̸= 0, o.w. trivial. Define x̂ = ∥x∥
. Then

⟨z, x ⟩
∥z∥∗ ≥ ⟨z, x̂ ⟩ =
∥x∥
and hence ⟨z, x ⟩ ≤ ∥z∥∗ · ∥ x ∥. Applying same argument with x replaced by − x proves − ⟨z, x ⟩ ≤
∥ z ∥ ∗ · ∥ x ∥.
1 1
Example 3. ∥·∥ p and ∥·∥q are duals when p + q = 1. In particular, ∥·∥2 is its own dual; ∥·∥1 and
∥·∥∞ are dual to each other.
In Rd , all ℓ p norms are equivalent. In particular,
1 1

∀ x ∈ Rd , p ≥ 1, r > p : ∥ x ∥r ≤ ∥ x ∥ p ≤ d p r ∥ x ∥r .
However, choice of norm affects how algorithm performance depends on dimension d.

2.2 Feasible set


The feasible set
X ⊆ Rd
specifies what solution points we are allowed to output.
If X = Rd , we say that (P) is unconstrained. Otherwise we say that (P) is constrained.
X can be specified:
• as an abstract geometric body (a ball, a box, a polyhedron, a convex set)
• via functional constraints:

gi ( x ) ≤ 0, i = 1, 2, . . . , m,
hi ( x ) = 0, i = 1, . . . , p

Note that f i ( x ) ≥ C is equivalent to taking gi ( x ) = C − f i ( x ).

3
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024

Example 4.

X = B2 (0, 1) = unit Euclidean ball


X = { x ∈ Rd : ∥ x ∥ 2 ≤ 1 }

In this class, we will always assume that X is closed.

Hein-Borel Theorem: X ⊆ Rd is closed and bounded if and only if it is compact (if X ⊂ α∈ A Uα


S

for some family of open sets {Uα } ,then there there exists a finite subfamily {Uαi }in=1 such that
X ⊆ 1≤i≤n Uαi .)
S

Weierstrass Extreme Value Theorem: If X is compact and f is a function that is defined and
continuous on X , then f attains its extreme values on X .
What if X is not bounded? Consider f ( x ) = e x . Then infx∈R f ( x ) = 0, but not attained.
When we work with unconstrained problems, we will normally assume that f is bounded
below.

Convex sets: Except for some special cases, we often assume that the feasible set is convex, so
that we will be able to guarantee tractability.

Definition 2 (Convex set). A set X ⊆ Rd is convex if

∀ x, y ∈ X , ∀α ∈ (0, 1) : (1 − α) x + αy ∈ X

A picture.

We cannot hope to deal with arbitrary nonconvex constraints. E.g., xi (1 − xi ) = 0 ⇐⇒ xi ∈


{0, 1}, integer programs.

2.3 Objective function


“cost”, “loss”
Extended real valued functions:

f : D → R ∪ {−∞, ∞} ≡ R̄.

Here f is defined on D ⊆ Rd . Can extend the definition of f to all of Rd by assigning the value
+∞ at each point x ∈ Rd \ D .
Effective domain: n o
dom( f ) = x ∈ Rd : f ( x ) < ∞

In the sequel, domain means effective domain.

“Linear and nonlinear optimization” ≈ “continuous optimization” (as contrast to discrete/combinatorial


optimization)

4
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024

2.3.1 Lower semicontinuous functions


We mostly assume f to be continuous, which can be relaxed slightly.
Definition 3. A function f : Rd → R̄ is said to be lower semicontinuous (l.s.c) at x ∈ Rd if

f ( x ) ≤ lim inf f (y).


y→ x

We way f is l.s.c. on Rd if it is l.s.c. at every point x ∈ Rd .

This definition is mainly useful for allowing indicator functions.


Example 5. Verify yourself: Indicator of a closed set is l.s.c.
(
0, x ∈ X
IX ( x ) =
∞, x ∈ / X.

Using IX we can write


min f ( x ) ≡ min { f ( x ) + IX ( x )} ,
x ∈X x ∈Rd
thereby unifying constrained and unconstrained optimization.

2.3.2 Continuous and smooth functions


Unless we are abstracting away constraints, the least we will assume about f is that it is continu-
ous.
Sometimes we consider stronger assumptions.
Definition 4. f : Rd → R̄ is said to be
1. Lipschitz-continuous on X ⊆ Rd (w.r.t. the norm ∥·∥) if there exists M < ∞ such that

∀ x, y ∈ X : | f ( x ) − f (y)| ≤ M ∥ x − y∥ .

2. Smooth on X ⊆ Rd (w.r.t. the norm ∥·∥) if f ’s gradient are Lipschitz-continuous, i.e., there
exists L < ∞ such that2

∀ x, y ∈ X : ∥∇ f ( x ) − ∇ f (y)∥∗ ≤ L ∥ x − y∥ .
 ∂f 
 ∂x. 1 
(Gradient: ∇ f ( x ) =  . 
 .  .)
∂f
∂xd
2 This definition can be viewed a quantitative version of C1 -smoothness.

5
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024

• Picture:

In Rd , Lipschitz-continuity in some norm implies the same for every other norm, but M may differ.

Example 6. f ( x ) = 12 ∥ x ∥22 is 1-smooth on R2 w.r.t. ∥·∥2 . The log-sum-exp (or softmax) function
f ( x ) = log ∑id=1 exp( xi ) is 1-smooth on Rd w.r.t. ∥·∥∞ .

Example 7. Function that is continuously differentiable on its domain but not smooth:

1
f (x) =
x
dom( f ) = R++

2.3.3 Convex functions


Definition 5. f : Rd → R̄ is convex if ∀ x, y ∈ Rd , ∀α ∈ (0, 1) :

f ((1 − α) x + αy) ≤ (1 − α) f ( x ) + α f (y).

A picture.

Lemma 1. f : Rd → R is convex if and only its epigraph


n o
epi( f ) := ( x, a) : x ∈ Rd , a ∈ R, f ( x ) ≤ a

is convex.

Proof. Follows from definitions. Left as exercise.

Definition 6. We say that a function f : Rd → R̄ is proper if ∃ x ∈ Rd s.t. f ( x ) ∈ R.

Lemma 2. If f : Rd → R̄ is proper and convex, then dom( f ) is convex.

You might also like