Lecture 1 2 Background
Lecture 1 2 Background
Yudong Chen
1 Introduction
Our standard optimization problem
min f ( x ) (P)
x ∈X
• X : feasible set
• maxx f ( x ) ⇐⇒ minx − f ( x )
x 3 + y3 = z3 .
min ( x n + yn − zn )2
x,y,z,n
s.t.x ≥ 1, y ≥ 1, z ≥ 1, n ≥ 3 (PF )
sin2 (πn) + sin2 (πx ) + sin2 (πy) + sin2 (πz) = 0.
If we could certify whether val(PF ) ̸= 0, we would have found a proof for Fermat’s Last theorem
(1637):
1
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024
• ∥ x ∥1 = ∑ i | x i |,
• ∥ x ∥∞ = max1≤i≤d | xi |.
2
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024
|⟨z, x ⟩| ≤ ∥z∥∗ · ∥ x ∥ .
x
Proof. Fix any two vectors x, z. Assume x ̸= 0, z ̸= 0, o.w. trivial. Define x̂ = ∥x∥
. Then
⟨z, x ⟩
∥z∥∗ ≥ ⟨z, x̂ ⟩ =
∥x∥
and hence ⟨z, x ⟩ ≤ ∥z∥∗ · ∥ x ∥. Applying same argument with x replaced by − x proves − ⟨z, x ⟩ ≤
∥ z ∥ ∗ · ∥ x ∥.
1 1
Example 3. ∥·∥ p and ∥·∥q are duals when p + q = 1. In particular, ∥·∥2 is its own dual; ∥·∥1 and
∥·∥∞ are dual to each other.
In Rd , all ℓ p norms are equivalent. In particular,
1 1
−
∀ x ∈ Rd , p ≥ 1, r > p : ∥ x ∥r ≤ ∥ x ∥ p ≤ d p r ∥ x ∥r .
However, choice of norm affects how algorithm performance depends on dimension d.
gi ( x ) ≤ 0, i = 1, 2, . . . , m,
hi ( x ) = 0, i = 1, . . . , p
3
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024
Example 4.
for some family of open sets {Uα } ,then there there exists a finite subfamily {Uαi }in=1 such that
X ⊆ 1≤i≤n Uαi .)
S
Weierstrass Extreme Value Theorem: If X is compact and f is a function that is defined and
continuous on X , then f attains its extreme values on X .
What if X is not bounded? Consider f ( x ) = e x . Then infx∈R f ( x ) = 0, but not attained.
When we work with unconstrained problems, we will normally assume that f is bounded
below.
Convex sets: Except for some special cases, we often assume that the feasible set is convex, so
that we will be able to guarantee tractability.
∀ x, y ∈ X , ∀α ∈ (0, 1) : (1 − α) x + αy ∈ X
A picture.
f : D → R ∪ {−∞, ∞} ≡ R̄.
Here f is defined on D ⊆ Rd . Can extend the definition of f to all of Rd by assigning the value
+∞ at each point x ∈ Rd \ D .
Effective domain: n o
dom( f ) = x ∈ Rd : f ( x ) < ∞
4
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024
∀ x, y ∈ X : | f ( x ) − f (y)| ≤ M ∥ x − y∥ .
2. Smooth on X ⊆ Rd (w.r.t. the norm ∥·∥) if f ’s gradient are Lipschitz-continuous, i.e., there
exists L < ∞ such that2
∀ x, y ∈ X : ∥∇ f ( x ) − ∇ f (y)∥∗ ≤ L ∥ x − y∥ .
∂f
∂x. 1
(Gradient: ∇ f ( x ) = .
. .)
∂f
∂xd
2 This definition can be viewed a quantitative version of C1 -smoothness.
5
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024
• Picture:
In Rd , Lipschitz-continuity in some norm implies the same for every other norm, but M may differ.
Example 6. f ( x ) = 12 ∥ x ∥22 is 1-smooth on R2 w.r.t. ∥·∥2 . The log-sum-exp (or softmax) function
f ( x ) = log ∑id=1 exp( xi ) is 1-smooth on Rd w.r.t. ∥·∥∞ .
Example 7. Function that is continuously differentiable on its domain but not smooth:
1
f (x) =
x
dom( f ) = R++
A picture.
is convex.