0% found this document useful (0 votes)
15 views55 pages

Op Tim Ization Notes

Uploaded by

Rudrakshi Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views55 pages

Op Tim Ization Notes

Uploaded by

Rudrakshi Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Indian Institute of Technology Madras

Optimization

Nirav Bhatt
Email: [email protected]
Office: BT 307 Block II
Biotechnology Department

September 27, 2024

September 27, 2024 1/44


Content

I Unconstrained vs Constrained optimization


I Types of Optimizations problems:
I Linear programming (LP)
I Quadratic programming (QP)
I Nonlinear programming (NLP)
I Dynamic optimization as an NLP
I Overview of Numerical solution approaches
I Book: Nocedal J, Wright SJ, editors. Numerical optimization.
New York, NY: Springer New York; 1999 Aug 27. Reference
Chapters 1, 2, 12
I Reference Book: An Introduction to Optimization, EDWIN K. P.
CHONG and STANISLAW H. ZAK

September 27, 2024 2/44


Learning outcomes
Introduction to optimization and different forms of optimization

I The students are expected to learn


I Different types of optimization problems and their application
I KKT Conditions for finding an optimal solution
I Overview of Line search and trust region for numerical
optimization

September 27, 2024 3/44


Optimization
Elements

I Mathematical Optimization (or Mathematical Programming):


Select a best option or a set of options from the available set
I Consider an optimization problem with decision variables
x = [x1 , x2 , ..., xn ]T
max f (x)
x
s.t. Ax ≤ b
1T x ≤ 1
or
max f (x1 , x2 , ..., xn )
x1 ,x2 ,...,xn

I f (x) is called cost function or objective function or loss function


I Ax ≤ b and ni=1 xi ≤ 1 are constraints
P

September 27, 2024 4/44


Optimization
Some Problems in ML/DL
I Regression Analysis: Given a dataset (y, X), fit a function form
f (X, θ) : Rn × Rp → R

min ky − f (X, θ)k22


θ

I Classification Problems: Given (y, X), fit a function form


f (X, θ) : Rn × Rp → {0, 1}

m
1 X
min − (yi log(pi (x, θ)) + (1 − yi )log(1 − pi (x, θ))
θ m
i=1

where pi (· · · ) is the probability of the class 1.


September 27, 2024 5/44
Optimization
Constraints

I Types of constraints
I Linear Constraints

Ax ≤ b (Inequality )

Aeq x = beq (Equality )


I Inequality Constraints

g(x) ≤ 0 (Inequality )

h(x) = 0 (Equality )
I Bounds: xlb ≤ x ≤ xub
I x ∈ S For example, S = {−1, 0, 1}

September 27, 2024 6/44


Optimization
Types

I Types based constraints: (i) Unconstrained optimization and


(ii) Constrained Optimization; (i) Static optimization and (ii)
Dynamic optimization
I Types of function or variable set smoothness: (i) Continuous
Optimization, and (ii) Integer optimization
I Types of function and constraints
I Linear Programming (LP)
I Quadratic Programming (QP)
I Nonlinear Programming (NLP) or Nonlinear Optimization
I Mixed Integer LP or NLP

September 27, 2024 7/44


Optimization

I Unconstrained Optimization: x Decision variables

max or min f (x) (1)


x

I Constrained Optimization

max f (x)
x
s.t. gi (x) ≤ bi , i = 1, . . . , p
(2)
aTj x = cj , j = 1, . . . , q
LB
x ≤ x ≤ xUB

I x∗ denote an optimal solution

September 27, 2024 8/44


Motivation
Dynamic Optimization

I The current profit is a function of the current production (x(t))


and the rate of change of production (x 0 (t))
I The continuous problem can be defined as:

ZT
max J[x] = f (t, x(t), ẋ(t))dt
(3)
t=1
s.t. x(t) ≥ 0, x(0) = x0
I Objective: Find the function x(t) that maximizes the functional
J[x]
I x ∗ (t): Optimal trajectory

September 27, 2024 9/44


Optimization
I Linear Programming

min cT x
x
s.t. Ax ≤ b,
(4)
Cx = d,
xLB ≤ x ≤ xUB

I Quadratic Programming

min xT Hx + cT x
x
s.t. Ax ≤ b,
(5)
Cx = d,
xLB ≤ x ≤ xUB

H: Symmetric matrix
September 27, 2024 10/44
Optimization
I Nonlinear Programming

min f (x)
x
s.t. gi (x) ≤ 0, i = 1, . . . , M,
(6)
hj (x) = 0, j = 1, . . . , N
xLB ≤ x ≤ xUB

I Integer Quadratic Programming

min xT Hx + cT x
x
s.t. Ax ≤ b,
(7)
Cx = d,
some x are integer

September 27, 2024 11/44


Motivation
Static Optimization

I Static Optimization: An optimal number or finite set of


numbers
I Static Optimization:

max f (x) (8)


x
I Assumptions: f (x) is continuously differentiable
∂f
I First order necessary condition: ∂x (x ∗ ) = 0
2
I Second order necessary condition: ∂ f2 ≤ 0
∂x
I Example: The operating point x ∗ that maximizes the profit
f (x), where x: # of units

September 27, 2024 12/44


Motivation
Static Optimization

I Example

max 1000000 + 4000x − x 2 (9)


x

I ∂f ∗ ⇒ 4000 − 2x = 0
∂x (x ) = 0

x = 2000
I ∂2f
∂x 2
= −2 < 0

September 27, 2024 13/44


Optimization
Static Optimization

I Static Optimization with several variables

max f (x1 , . . . xn ) (10)


x1 ,...,xn

I Example: A plant can produce n items. Find the operating


point x1 , . . . xn that maximizes the profit f (x1 , . . . xn )

September 27, 2024 14/44


Motivation
Static Optimization: Detour to Some Problems in Linear Algebra

I Recall: Ax = b when b is not in the column space spanned by


A.
I Projection of b on the plane spanned by Ax
I Error e = Ax − b or the closest point on the plane spanned by
the columns of A from b
I Optimization Problem

min kAx − bkp


x1 ,...,xn

p = 1, 2, 3 . . . , ∞

September 27, 2024 15/44


Motivation
Static Optimization

I Example

max 1000000 + 300x1 + 500x2 − x12 − x22 (11)


x

I ∂f ∗ − 2x1 = 0⇒ x1∗ = 150


∂x (x1 ) = 0 ⇒ 300
I ∂f ∗ − 2x2 = 0⇒ x2∗ = 250
∂x (x2 ) = 0 ⇒ 500
I ∂2f ∂2f
∂x12
= ∂x 2 = −2 < 0
2
I These conditions are correct??

September 27, 2024 16/44


Static Optimization
Local vs Global SOlutions

I Local Minimizer
A point x∗ is a local minimizer if there is a neighborhood N of
x∗ such that f (x∗ ) ≤ f (x) for all x ∈ N .

A point x∗ is a strict local minimizer (also called a strong local


minimizer) if there exists a neighborhood N of x∗ such that
f (x∗ ) < f (x) for all x ∈ N with x 6= x∗

I Global Minimizer
A point x∗ is a global minimizer if f (x∗ ) ≤ f (x) for all x
I x∗ is a stationary point if ∇f (x∗ ) = 0.

September 27, 2024 17/44


Static Optimization
Unconstrained Optimization

I First-order Necessary Conditions


If x∗ is a local minimizer and f is continuously differentiable in
an open neighborhood of x∗ , then ∇f (x∗ ) = 0.

I Second-order Necessary Conditions


If x∗ is a local minimizer of f then ∇2 f (x) exists and is
continuous in an open neighborhood of x∗ , then ∇f (x∗ ) = 0
and ∇2 f (x∗ ) is positive semidefinite.

I Second-order Sufficient Conditions


Suppose that ∇2 f (x∗ ) is continuous in an open neighborhood
of x∗ and that ∇f (x∗ ) = 0 and ∇2 f (x∗ ) is positive definite.
Then, x∗ is a strict local minimizer of f .

September 27, 2024 18/44


Constrained Optimization
Equality Constraints

I Consider the following optimization problem

min f (x)
(12)
s.t. hi (x) = 0, i = 1, 2, . . . , m

I Constraint optimization to Unconstrained optimization


I Lagrange function with Lagrange multipliers λ1 , . . . , λm
m
X
L(x, λ) = f (x) + λi hi (x)
i=1

September 27, 2024 19/44


Constrained Optimization
I First-order necessary condition

∇f (x ∗ ) + λ∗1 ∇h1 (x ∗ ) + . . . + λ∗m ∇hm (x ∗ ) = 0

hi (x ∗ ) = 0, i = 1, 2, . . . , m
I Second-order necessary condition

w T Lxx (x ∗ , λ∗ )w ≥ 0, ∀w such that ∇hi (x ∗ ).w = 0, i = 1, 2, . . . , m

where
m
X
∗ ∗ ∗
2
Lxx (x , λ ) = ∇ f (x ) + λ∗i ∇2 hi (x ∗ )
i=1

Lxx(x ∗ , λ∗ )
is positive definite on the tangent space defined by

∇hi (x ).w = 0 the tangent space

September 27, 2024 20/44


Example: Constrained Optimization
I Problem:
min x1 + x2
s.t. x12 + x22 − 2 = 0
I ∇f = [1, 1]T ∇h = [2x1 , 2x2 ]T
x2
0 2
2

2
1 (1,1)
2

0
2
−1
1
2

x1
2

2
−1
1
−2

2
−1
2
(−1,−1)
−2
2

September 27, 2024 21/44


Example: Constrained Optimization
I Problem:
min x1 + x2
s.t. x12 + x22 − 2 = 0
I ∇f = [1, 1]T ∇h = [2x1 , 2x2 ]T

September 27, 2024 21/44


Example: Constrained Optimization
I Problem:
min x1 + x2
s.t. x12 + x22 − 2 = 0
I ∇f = [1, 1]T ∇h = [2x1 , 2x2 ]T

September 27, 2024 21/44


Constrained Optimization: Example
KKT Conditions
I Constrained optimization problem:

min f (x)
s.t. hi (x) = 0, i = 1, . . . , m
s.t. gj (x) ≤ 0, j = 1, . . . , n

where f , hi , and gj : smooth, and real-valued functions on a


subset of Rn
I x is a feasible point, if it satisfies the equality and inequality
constraints
I A constraint j is said to be active if gj (x) = 0 at any point x
I A(x): A set of all active constraints at any point x
I For a feasible point x and the active set A(x), if the gradients
∇hi , and ∇gj , ∀j ∈ A(x) are linearly independent,they satisfy
the Linear Independence Constraint Qualification (LICQ).
September 27, 2024 22/44
Constrained Optimization
KKT Conditions
I Lagrangian for the problem with the Lagrange multipliers λ
and µ:
m
X n
X
L(x, λ, µ) = f (x) + λi hi (x) + µj gj (x)
i=1 j=1

I First order Necessary condtions (Karush-Kuhn-Tucker


conditions): For x ∗ a local solution
∇x L(x ∗ , λ∗ , µ∗ ) = 0
hi (x ∗ ) = 0, i = 1, . . . , m
gj (x ∗ ) ≤ 0, i = 1, . . . , n
λi > 0, µj ≥ 0
µj (x )gj (x ∗ ) = 0 (Complementarity condition)

I Complementarity condition: Ensures µj = 0, ∀j 6∈ A(x)


September 27, 2024 23/44
Constrained Optimization
KKT Conditions
I Complementarity condition: Ensures µj = 0, ∀j 6∈ A(x)
I Second order necessary condition

w T ∇2 L(x ∗ λ∗ , µ∗ ) w ≥ 0, ∀w ∈ T (x ∗ )

T (x ∗ ) = {w|∇hi (x ∗ )T w = 0, ∀i and ∇gj (x ∗ )T w = 0, ∀j ∈ A(x ∗ )}


(a) (b) (c)

I (a)µj > 0, ∀j ∈ A(x ∗ ): ∇f


inwards: Otherwise maximization
and ∇g outward: Otherwise
increase inward
m
I −∇f (x ∗ ) = µj ∇gj (x ∗ )
P
(d)

j=1
(b) µ1 , µ2 < 0,
(c) µ1 > 0, µ2 < 0,
(d) µ1 , µ2 > 0

September 27, 2024 24/44


Constrained Optimization
KKT Conditions
I Complementarity condition: Ensures µj = 0, ∀j 6∈ A(x)
I Second order necessary condition

w T ∇2 L(x ∗ λ∗ , µ∗ ) w ≥ 0, ∀w ∈ T (x ∗ )

T (x ∗ ) = {w|∇hi (x ∗ )T w = 0, ∀i and ∇gj (x ∗ )T w = 0, ∀j ∈ A(x ∗ )}


(a) (b) (c)

I (a)µj > 0, ∀j ∈ A(x ∗ ): ∇f


inwards: Otherwise maximization
and ∇g outward: Otherwise
increase inward
m
I −∇f (x ∗ ) = µj ∇gj (x ∗ )
P
(d)

j=1
(b) µ1 , µ2 < 0,
(c) µ1 > 0, µ2 < 0,
(d) µ1 , µ2 > 0

September 27, 2024 24/44


Constrained Optimization
KKT Conditions: Problem

I Optimization problem
min x12 + 2x22

s.t. x1 + x2 ≥ 3
s.t. x2 − x12 ≥ 1
I The Lagrangian function:

l(x, µ1 , µ2 ) = x12 + 2x22 − µ1 (x1 + x2 − 3) − µ2 (x2 − x12 − 1), µ1 , µ2 ≥ 0


I KKT Conditions
1. 2x1 − µ1 + 2µ2 x1 = 0
2. 4x2 − µ1 − µ2 = 0
3. x1 + x2 ≥ 3
4. x2 − x12 ≥ 1
5. µ1 (x1 + x2 − 3) = 0
6. µ2 (x2 − x12 − 1) = 0
7. µ1 , µ2 ≥ 0

September 27, 2024 25/44


Constrained Optimization
KKT Conditions: Problem

I Case 1: (No active constraint) , then


KKT Conditions (x1∗ , x2∗ ) = (0, 0)
1. 2x1 − µ1 + 2µ2 x1 = 0 I Case 2: µ1 > 0, µ2 = 0 (First constraint is active),
then (x1∗ , x2∗ ) = (2, 1) and µ1 = 4 > 0, Does not
2. 4x2 − µ1 − µ2 = 0 satisfy Condition 4
3. x1 + x2 ≥ 3 I Case 3: µ1 = 0, µ2 > 0 (Second constraint is
active), then (x1∗ , x2∗ ) = (0, 1) and µ2 = 4 > 0
4. x2 − x12 ≥ 1 Does not satisfy Condition 3

5. µ1 (x1 + x2 − 3) = 0 I Case 4: µ1 > 0, µ2 > 0 (Both constraints are


active), then (x1∗ , x2∗ ) = (−2, 5) and
6. µ2 (x2 − x12 − 1) = 0
(x1∗ , x2∗ ) = (1, 2)
7. µ1 , µ2 ≥ 0 I (x ∗ , x ∗ ) = (−2, 5) =⇒ µ1 + 4µ2 = −4
1 2
I (x ∗ , x ∗ ) = (1, 2) =⇒ µ1 = 6, and µ2 = 2,
1 2

September 27, 2024 26/44


Constrained Optimization
KKT Conditions: Problem

I Case 1: (No active constraint) , then


KKT Conditions (x1∗ , x2∗ ) = (0, 0)
1. 2x1 − µ1 + 2µ2 x1 = 0 I Case 2: µ1 > 0, µ2 = 0 (First constraint is active),
then (x1∗ , x2∗ ) = (2, 1) and µ1 = 4 > 0, Does not
2. 4x2 − µ1 − µ2 = 0 satisfy Condition 4
3. x1 + x2 ≥ 3 I Case 3: µ1 = 0, µ2 > 0 (Second constraint is
active), then (x1∗ , x2∗ ) = (0, 1) and µ2 = 4 > 0
4. x2 − x12 ≥ 1 Does not satisfy Condition 3

5. µ1 (x1 + x2 − 3) = 0 I Case 4: µ1 > 0, µ2 > 0 (Both constraints are


active), then (x1∗ , x2∗ ) = (−2, 5) and
6. µ2 (x2 − x12 − 1) = 0
(x1∗ , x2∗ ) = (1, 2)
7. µ1 , µ2 ≥ 0 I (x ∗ , x ∗ ) = (−2, 5) =⇒ µ1 + 4µ2 = −4
1 2
I (x ∗ , x ∗ ) = (1, 2) =⇒ µ1 = 6, and µ2 = 2,
1 2

September 27, 2024 26/44


Constrained Optimization
KKT Conditions: Problem

I Case 1: (No active constraint) , then


KKT Conditions (x1∗ , x2∗ ) = (0, 0)
1. 2x1 − µ1 + 2µ2 x1 = 0 I Case 2: µ1 > 0, µ2 = 0 (First constraint is active),
then (x1∗ , x2∗ ) = (2, 1) and µ1 = 4 > 0, Does not
2. 4x2 − µ1 − µ2 = 0 satisfy Condition 4
3. x1 + x2 ≥ 3 I Case 3: µ1 = 0, µ2 > 0 (Second constraint is
active), then (x1∗ , x2∗ ) = (0, 1) and µ2 = 4 > 0
4. x2 − x12 ≥ 1 Does not satisfy Condition 3

5. µ1 (x1 + x2 − 3) = 0 I Case 4: µ1 > 0, µ2 > 0 (Both constraints are


active), then (x1∗ , x2∗ ) = (−2, 5) and
6. µ2 (x2 − x12 − 1) = 0
(x1∗ , x2∗ ) = (1, 2)
7. µ1 , µ2 ≥ 0 I (x ∗ , x ∗ ) = (−2, 5) =⇒ µ1 + 4µ2 = −4
1 2
I (x ∗ , x ∗ ) = (1, 2) =⇒ µ1 = 6, and µ2 = 2,
1 2

September 27, 2024 26/44


Constrained Optimization
KKT Conditions: Problem

I Case 1: (No active constraint) , then


KKT Conditions (x1∗ , x2∗ ) = (0, 0)
1. 2x1 − µ1 + 2µ2 x1 = 0 I Case 2: µ1 > 0, µ2 = 0 (First constraint is active),
then (x1∗ , x2∗ ) = (2, 1) and µ1 = 4 > 0, Does not
2. 4x2 − µ1 − µ2 = 0 satisfy Condition 4
3. x1 + x2 ≥ 3 I Case 3: µ1 = 0, µ2 > 0 (Second constraint is
active), then (x1∗ , x2∗ ) = (0, 1) and µ2 = 4 > 0
4. x2 − x12 ≥ 1 Does not satisfy Condition 3

5. µ1 (x1 + x2 − 3) = 0 I Case 4: µ1 > 0, µ2 > 0 (Both constraints are


active), then (x1∗ , x2∗ ) = (−2, 5) and
6. µ2 (x2 − x12 − 1) = 0
(x1∗ , x2∗ ) = (1, 2)
7. µ1 , µ2 ≥ 0 I (x ∗ , x ∗ ) = (−2, 5) =⇒ µ1 + 4µ2 = −4
1 2
I (x ∗ , x ∗ ) = (1, 2) =⇒ µ1 = 6, and µ2 = 2,
1 2

September 27, 2024 26/44


Constrained Optimization
KKT Conditions: Problem

I Case 1: (No active constraint) , then


KKT Conditions (x1∗ , x2∗ ) = (0, 0)
1. 2x1 − µ1 + 2µ2 x1 = 0 I Case 2: µ1 > 0, µ2 = 0 (First constraint is active),
then (x1∗ , x2∗ ) = (2, 1) and µ1 = 4 > 0, Does not
2. 4x2 − µ1 − µ2 = 0 satisfy Condition 4
3. x1 + x2 ≥ 3 I Case 3: µ1 = 0, µ2 > 0 (Second constraint is
active), then (x1∗ , x2∗ ) = (0, 1) and µ2 = 4 > 0
4. x2 − x12 ≥ 1 Does not satisfy Condition 3

5. µ1 (x1 + x2 − 3) = 0 I Case 4: µ1 > 0, µ2 > 0 (Both constraints are


active), then (x1∗ , x2∗ ) = (−2, 5) and
6. µ2 (x2 − x12 − 1) = 0
(x1∗ , x2∗ ) = (1, 2)
7. µ1 , µ2 ≥ 0 I (x ∗ , x ∗ ) = (−2, 5) =⇒ µ1 + 4µ2 = −4
1 2
I (x ∗ , x ∗ ) = (1, 2) =⇒ µ1 = 6, and µ2 = 2,
1 2

September 27, 2024 26/44


Constrained Optimization
KKT Conditions: Problem

I Case 1: (No active constraint) , then


KKT Conditions (x1∗ , x2∗ ) = (0, 0)
1. 2x1 − µ1 + 2µ2 x1 = 0 I Case 2: µ1 > 0, µ2 = 0 (First constraint is active),
then (x1∗ , x2∗ ) = (2, 1) and µ1 = 4 > 0, Does not
2. 4x2 − µ1 − µ2 = 0 satisfy Condition 4
3. x1 + x2 ≥ 3 I Case 3: µ1 = 0, µ2 > 0 (Second constraint is
active), then (x1∗ , x2∗ ) = (0, 1) and µ2 = 4 > 0
4. x2 − x12 ≥ 1 Does not satisfy Condition 3

5. µ1 (x1 + x2 − 3) = 0 I Case 4: µ1 > 0, µ2 > 0 (Both constraints are


active), then (x1∗ , x2∗ ) = (−2, 5) and
6. µ2 (x2 − x12 − 1) = 0
(x1∗ , x2∗ ) = (1, 2)
7. µ1 , µ2 ≥ 0 I (x ∗ , x ∗ ) = (−2, 5) =⇒ µ1 + 4µ2 = −4
1 2
I (x ∗ , x ∗ ) = (1, 2) =⇒ µ1 = 6, and µ2 = 2,
1 2

September 27, 2024 26/44


Constrained Optimization
KKT Conditions: Problem

I Case 1: (No active constraint) , then


KKT Conditions (x1∗ , x2∗ ) = (0, 0)
1. 2x1 − µ1 + 2µ2 x1 = 0 I Case 2: µ1 > 0, µ2 = 0 (First constraint is active),
then (x1∗ , x2∗ ) = (2, 1) and µ1 = 4 > 0, Does not
2. 4x2 − µ1 − µ2 = 0 satisfy Condition 4
3. x1 + x2 ≥ 3 I Case 3: µ1 = 0, µ2 > 0 (Second constraint is
active), then (x1∗ , x2∗ ) = (0, 1) and µ2 = 4 > 0
4. x2 − x12 ≥ 1 Does not satisfy Condition 3

5. µ1 (x1 + x2 − 3) = 0 I Case 4: µ1 > 0, µ2 > 0 (Both constraints are


active), then (x1∗ , x2∗ ) = (−2, 5) and
6. µ2 (x2 − x12 − 1) = 0
(x1∗ , x2∗ ) = (1, 2)
7. µ1 , µ2 ≥ 0 I (x ∗ , x ∗ ) = (−2, 5) =⇒ µ1 + 4µ2 = −4
1 2
I (x ∗ , x ∗ ) = (1, 2) =⇒ µ1 = 6, and µ2 = 2,
1 2

September 27, 2024 26/44


Constrained Optimization
KKT Conditions: Problem

I Case 1: (No active constraint) , then


KKT Conditions (x1∗ , x2∗ ) = (0, 0)
1. 2x1 − µ1 + 2µ2 x1 = 0 I Case 2: µ1 > 0, µ2 = 0 (First constraint is active),
then (x1∗ , x2∗ ) = (2, 1) and µ1 = 4 > 0, Does not
2. 4x2 − µ1 − µ2 = 0 satisfy Condition 4
3. x1 + x2 ≥ 3 I Case 3: µ1 = 0, µ2 > 0 (Second constraint is
active), then (x1∗ , x2∗ ) = (0, 1) and µ2 = 4 > 0
4. x2 − x12 ≥ 1 Does not satisfy Condition 3

5. µ1 (x1 + x2 − 3) = 0 I Case 4: µ1 > 0, µ2 > 0 (Both constraints are


active), then (x1∗ , x2∗ ) = (−2, 5) and
6. µ2 (x2 − x12 − 1) = 0
(x1∗ , x2∗ ) = (1, 2)
7. µ1 , µ2 ≥ 0 I (x ∗ , x ∗ ) = (−2, 5) =⇒ µ1 + 4µ2 = −4
1 2
I (x ∗ , x ∗ ) = (1, 2) =⇒ µ1 = 6, and µ2 = 2,
1 2

September 27, 2024 26/44


Constrained Optimization
KKT Conditions: Problem

I Case 1: (No active constraint) , then


KKT Conditions (x1∗ , x2∗ ) = (0, 0)
1. 2x1 − µ1 + 2µ2 x1 = 0 I Case 2: µ1 > 0, µ2 = 0 (First constraint is active),
then (x1∗ , x2∗ ) = (2, 1) and µ1 = 4 > 0, Does not
2. 4x2 − µ1 − µ2 = 0 satisfy Condition 4
3. x1 + x2 ≥ 3 I Case 3: µ1 = 0, µ2 > 0 (Second constraint is
active), then (x1∗ , x2∗ ) = (0, 1) and µ2 = 4 > 0
4. x2 − x12 ≥ 1 Does not satisfy Condition 3

5. µ1 (x1 + x2 − 3) = 0 I Case 4: µ1 > 0, µ2 > 0 (Both constraints are


active), then (x1∗ , x2∗ ) = (−2, 5) and
6. µ2 (x2 − x12 − 1) = 0
(x1∗ , x2∗ ) = (1, 2)
7. µ1 , µ2 ≥ 0 I (x ∗ , x ∗ ) = (−2, 5) =⇒ µ1 + 4µ2 = −4
1 2
I (x ∗ , x ∗ ) = (1, 2) =⇒ µ1 = 6, and µ2 = 2,
1 2

September 27, 2024 26/44


Constrained Optimization
KKT Conditions: Problem

I Case 1: (No active constraint), then


(x1∗ , x2∗ ) = (0, 0) =⇒ Violates 3 and 4
KKT Conditions
I Case 2: µ1 > 0, µ2 = 0 (First constraint is active),
1. 2x1 − µ1 + 2µ2 x1 = 0 then (x1∗ , x2∗ ) = (2, 1) and µ1 = 3 > 0 =⇒
Violates 4
2. 4x2 − µ1 − µ2 = 0 I Case 3: µ1 = 0, µ2 > 0 (Second constraint is
3. x1 + x2 ≥ 3 active), then (x1∗ , x2∗ ) = (0, 1) and µ2 = −1 < 0
=⇒ Violates 3 and 7
4. x2 − x12 ≥ 1 I Case 4: µ1 > 0, µ2 > 0 (Both constraints are
5. µ1 (x1 + x2 − 3) = 0 active), then (x1∗ , x2∗ ) = (−2, 5) and
6. µ2 (x2 − x12 − 1) = 0 (x1∗ , x2∗ ) = (1, 2)
I (x ∗ , x ∗ ) = (−2, 5) =⇒ µ1 = +4, µ2 = −4
1 2
7. µ1 , µ2 ≥ 0 not possible, Violates 7
I (x ∗ , x ∗ ) = (1, 2) =⇒ µ1 = 6, and µ2 = 2, all
1 2
KKT conditions satisfied

September 27, 2024 27/44


Example: Constrained Optimization

I Find the semi-major and semi-minor axes of the ellipse


defined by
(x1 + x2 )2 + 2(x1 − x2 )2 − 8 = 0
I Hint: Calculate the farthest (nearest) point on the ellipse from
the origin
I Problem formulation:

min x12 + x22


(13)
s.t. (x1 + x2 )2 + 2(x1 − x2 )2 − 8 = 0

September 27, 2024 28/44


Numerical Optimization
Different Types of Algorithms

I Optimization Algorithms:
generate a sequence of iterates: {xk }∞
0
I Termination Criteria:
I No more progress can be made
I A solution has been approximated with sufficient accuracy
I How to generate a new point xk +1 from xk
Use information about f at xk and/or previous iteration points
I xk +1 must be such that f (xk +1 ) < f (xk )
I Strategies to find a new xk +1 :
I Line search
I Trust region

September 27, 2024 29/44


Numerical Optimization
Different Types of Algorithms

I Strategies
I Line search
I Trust region
I Line search Strategy: Choose a direction pk from xk so that
f (xk +1 ) < f (xk )
I xk +1 = xk + αpk , α > 0
I Choose a direction pk and find a step length α.
I Objective function:

min f (xk + αpk ) (14)


α

September 27, 2024 30/44


Numerical Optimization
Different Types of Algorithms
I Line Search: Objective function:

min f (xk + αpk ) (15)


α
!
I The best pk = −∇f (xk ) = ∂f
∂x
xk
Steepest Descent direction Line search: Descent direction

September 27, 2024 31/44


Numerical Optimization
Different Types of Algorithms

I Line Search: Objective function:

min f (xk + αpk ) (16)


α
 
I The best pk = −∇f (xk ) = ∂f
∂x xk
I Newton’s direction:
pk = −(∇2 f (xk ))−1 ∇f (xk )
I Quadratic approximation of f (x) is sufficient to represent the
f (x):-> Netwon’s direction is reliable.
I Conjugate gradient directions:
pk = −∇f (xk ) + βk pk −1 , βk : scalar value
Conjugate vectors: pi A pj = 0, for i 6= j and A: Symmetric
positive definite matrix

September 27, 2024 32/44


Numerical Optimization
Line Search Strategy

I Why conjugate gradient Directions*

*: Honkela, A. et al., Journal of Machine Learning Research, 11, 2010.

September 27, 2024 33/44


Numerical Optimization
Line Search Strategy

I Step size α
I pk : Provide direction
I αk : Helps in reducing f value
I Challenge: Many evaluations of f (some time ∇f )

I Simplest Sufficient condition on αk , f (xk + αk pk ) < f (xk )


I Several Conditions: Wolfe Conditions, Goldstein Conditions.

September 27, 2024 34/44


Numerical Optimization
Trust Region

I Trust region method


I Idea: Construct a model function mk at xk using information
on f
I mk may not be a good approximation of f
I Limit search for xk +1 in a region: Trust region
I Objective function

min mk (xk + pk ) (17)

September 27, 2024 35/44


Numerical Optimization
Trust Region
I Not sufficient decrease in f for the candidate solution, change
the trust region
I Trust region method

min mk (xk + pk ) (18)

I mk : Quadratic approximation

September 27, 2024 36/44


Optimization for Machine and Deep Learning: Terminology

I Central Idea to DS/ML/AI: Minimize or maximize an objective


function
I Deep learning objective: minimize the loss function or
objective function
I Direct solution & Iterative solution
I Arthur Samuel’s paradigm:
“Mechanism" to improve performance by tweaking weights
(parameters)
I Loss function can be tweaked by varying the model
parameters
Million-dimensional space!

September 27, 2024 37/44


Gradient Descent

Loss Loss Loss


function function function

w w w

∇Lossw < 0 ∇Lossw > 0 ∇Lossw = 0

Increment w Decrement w Optimum w


w = w + ∆w w = w − ∆w ∆w = 0 (stationary point)

What should be ∆ w?
∆w = −η∇Lossw⇒ w − η∇Lossw, where η : Learning rate
September 27, 2024 38/44
Gradient Descent
Learning Rate

Learning rate is too big⇒ Learning rate is small⇒ Long


Oscillation diverges time to convergence

Loss Loss
function function

w w

I Learning rate should be:


I Near local solution: proceed quickly with small learning rate
I Far from local solution: proceed quickly with large learning rate

September 27, 2024 39/44


Gradient Descent
Three types

Stochastic GD (online) Mini-batch GD


Batch GD
I Use data point (xi , yi ) I Use a subset of data
I Use entire data set for
for computing gradient points for computing
computing gradient
I Update rule gradient
I Update rule
I Update rule
w = w−η∇w L (w, xi , yi ) w=
w = w − η∇w L(w)
I Much faster w −η∇w L(w, xi:i+n , yi:i+n )
I Slow I Best of both methods
I Update with high
I large data set I Reduces variances in
variance
does not fit in memory parameter updates
L fluctuates heavily

Algorithm of choice: Mini-batch gradient descent for NN and DL


Mini-batch GD is often referred to as SGD
Typical batch size of 50 or 256 are used

September 27, 2024 40/44


Gradient Descent
Momentum SGD

I SGD: Sometimes slow


I How do we accelerate the learning rate?
I Momentum approaches:
I Use the concept of momentum from physics
I v : velocity variable
I Update rule

wk +1 = wk − η∇w L(w, xi , yi ) + βv

I v : accumulates the past gradients


I β ∈ [0, 1): momentum parameter
I Larger β, current direction is affected by the more past
gradients

September 27, 2024 41/44


Gradient Descent
Stochastic GD
I Learning rate: An important hyperparameter
I Adaptive learning rate: Direction and magnitude of gradients
for different parameters
I AdaGrad Learning rate:
Scale learning rates of all parameters by the square root of the
sum of all the past gradient squares
η.
wk +1 = wk − √ ∇w L, vk +1 = vk +(∇w L(wk )) (∇w L(wk ))
δ + vk +1 .
I RMSProp Learning rate: Exponential moving average instead
of sum
η.
wk +1 = wk − p ∇w L,
δ + vk +1 .
vk +1 = ρ vk + (1 − ρ)(∇w L(wk )) (∇w L(wk ))
I Other algorithm: Adaptive moments "Adam": Combination of
RMSProp and Momentum SGD
September 27, 2024 42/44
Gradient Descent
Stochastic GD: Conclusions

I Which algorithm should I choose for a particular application?


I Which is the best algorithm to apply?
I Answer: None or all
I Robust performance: RMSProp and AdaDelta family1
I Hyper-parameter tuning does not matter much in adaptive
learning
I Higher power concludes
Choice of algorithm depends on user’s knowledge of algorithm

2
Schaul, et al. "No more pesky learning rates." International Conference on Machine Learning. 2013.

September 27, 2024 43/44


Gradient Descent
Stochastic GD: Summary

I The element for applying stochastic GD


I The loss function form
I A way to compute gradients wrt parameters
I Initialization for parameters and learning rate and other
hyperparameters

2
Schaul, et al. "No more pesky learning rates." International Conference on Machine Learning. 2013.

September 27, 2024 44/44

You might also like