Opt Sem10
Opt Sem10
Seminar
v § } 1
Strongly convex quadratics
Consider the following quadratic optimization problem: Optimality conditions:
1 ⊤ ∇f (x∗ ) = Ax∗ − b = 0 ⇐⇒ Ax∗ = b
min f (x) = min x Ax − b⊤ x + c, where A ∈ Sd++ .
x∈Rd x∈Rd 2
Lecture recap v § } 2
Strongly convex quadratics
Consider the following quadratic optimization problem: Optimality conditions:
1 ⊤ ∇f (x∗ ) = Ax∗ − b = 0 ⇐⇒ Ax∗ = b
min f (x) = min x Ax − b⊤ x + c, where A ∈ Sd++ .
x∈Rd x∈Rd 2
Lecture recap v § } 3
Overview of the CG method for the quadratic problem
1) Initialization. k = 0 and xk = x0 , dk = d0 = −∇f (x0 ).
2) Optimal Step Length. By the procedure of line search we find the optimal length of step. This involves
calculate αk minimizing f (xk + αk dk ):
d⊤
k (Axk − b)
αk = −
d⊤
k Adk
Lecture recap v § } 3
Overview of the CG method for the quadratic problem
1) Initialization. k = 0 and xk = x0 , dk = d0 = −∇f (x0 ).
2) Optimal Step Length. By the procedure of line search we find the optimal length of step. This involves
calculate αk minimizing f (xk + αk dk ):
d⊤
k (Axk − b)
αk = −
d⊤
k Adk
3) Algorithm Iteration. Update the position of xk by moving in the direction dk , with a step size αk :
xk+1 = xk + αk dk
Lecture recap v § } 3
Overview of the CG method for the quadratic problem
1) Initialization. k = 0 and xk = x0 , dk = d0 = −∇f (x0 ).
2) Optimal Step Length. By the procedure of line search we find the optimal length of step. This involves
calculate αk minimizing f (xk + αk dk ):
d⊤
k (Axk − b)
αk = −
d⊤
k Adk
3) Algorithm Iteration. Update the position of xk by moving in the direction dk , with a step size αk :
xk+1 = xk + αk dk
4) Direction Update. Update the dk+1 = −∇f (xk+1 ) + βk dk , where βk is calculated by the formula:
∇f (xk+1 )⊤ Adk
βk = .
d⊤
k Adk
Lecture recap v § } 3
Overview of the CG method for the quadratic problem
1) Initialization. k = 0 and xk = x0 , dk = d0 = −∇f (x0 ).
2) Optimal Step Length. By the procedure of line search we find the optimal length of step. This involves
calculate αk minimizing f (xk + αk dk ):
d⊤
k (Axk − b)
αk = −
d⊤
k Adk
3) Algorithm Iteration. Update the position of xk by moving in the direction dk , with a step size αk :
xk+1 = xk + αk dk
4) Direction Update. Update the dk+1 = −∇f (xk+1 ) + βk dk , where βk is calculated by the formula:
∇f (xk+1 )⊤ Adk
βk = .
d⊤
k Adk
5) Convergence Loop. Repeat steps 2-4 until n directions are built, where n is the dimension of space (dimension
of x).
Lecture recap v § } 3
Optimal Step Length
Lecture recap v § } 4
Optimal Step Length
1
f (xk + αdk ) = (xk + αdk )⊤ A (xk + αdk ) − b⊤ (xk + αdk ) + c
2
1 1 ⊤
= α2 d⊤ ⊤
k Adk + dk (Axk − b) α + xk Axk + x⊤k dk + c
2 2
Lecture recap v § } 4
Optimal Step Length
1
f (xk + αdk ) = (xk + αdk )⊤ A (xk + αdk ) − b⊤ (xk + αdk ) + c
2
1 1 ⊤
= α2 d⊤ ⊤
k Adk + dk (Axk − b) α + xk Axk + x⊤k dk + c
2 2
We consider A ∈ Sd++ , so the point with zero derivative on this parabola is a minimum:
d⊤
k (Axk − b)
d⊤ ⊤
k Adk αk + dk (Axk − b) = 0 ⇐⇒ αk = −
d⊤
k Adk
Lecture recap v § } 4
Direction Update
We update the direction in such a way that the next direction is A - orthogonal to the previous one:
dk+1 ⊥A dk ⇐⇒ d⊤
k+1 Adk = 0
Lecture recap v § } 5
Direction Update
We update the direction in such a way that the next direction is A - orthogonal to the previous one:
dk+1 ⊥A dk ⇐⇒ d⊤
k+1 Adk = 0
∇f (xk+1 )⊤ Adk
d⊤ ⊤ ⊤
k+1 Adk = −∇f (xk+1 ) Adk + βk dk Adk = 0 ⇐⇒ βk =
d⊤
k Adk
Lecture recap v § } 5
Direction Update
We update the direction in such a way that the next direction is A - orthogonal to the previous one:
dk+1 ⊥A dk ⇐⇒ d⊤
k+1 Adk = 0
∇f (xk+1 )⊤ Adk
d⊤ ⊤ ⊤
k+1 Adk = −∇f (xk+1 ) Adk + βk dk Adk = 0 ⇐⇒ βk =
d⊤
k Adk
Lemma 1
All directions of construction using the procedure described above are orthogonal to each other:
d⊤
i Adj = 0, if i ̸= j
d⊤
i Adj > 0, if i = j
Lecture recap v § } 5
A-orthogonality
2 2
0 0
x
x
2 2
4 4
4 2 0 2 4 4 2 0 2 4
Lecture recap x x v § } 6
Convergence of the CG method
Lemma 2
Suppose, we solve n-dimensional quadratic convex optimization problem. The conjugate directions method:
k
X
xk+1 = x0 + αi di ,
i=0
d⊤
i (Axi − b)
where αi = − taken from the line search, converges for at most n steps of the algorithm.
d⊤
i Adi
Lecture recap v § } 7
CG method in practice
In practice, the following formulas are usually used for the step αk and the coefficient βk :
⊤
rk⊤ rk rk+1 rk+1
αk = βk = ,
d⊤
k Adk
⊤
rk rk
where rk = b − Axk , since xk+1 = xk + αk dk then rk+1 = rk − αk Adk . Also, riT rk = 0, ∀i ̸= k (Lemma 5 from
the lecture).
Lecture recap v § } 8
CG method in practice
In practice, the following formulas are usually used for the step αk and the coefficient βk :
⊤
rk⊤ rk rk+1 rk+1
αk = βk = ,
d⊤
k Adk
⊤
rk rk
where rk = b − Axk , since xk+1 = xk + αk dk then rk+1 = rk − αk Adk . Also, riT rk = 0, ∀i ̸= k (Lemma 5 from
the lecture).
Let’s get an expression for βk :
∇f (xk+1 )⊤ Adk r⊤ Adk
βk = ⊤
= − k+1
dk Adk d⊤
k Adk
Lecture recap v § } 8
CG method in practice
In practice, the following formulas are usually used for the step αk and the coefficient βk :
⊤
rk⊤ rk rk+1 rk+1
αk = βk = ,
d⊤
k Adk
⊤
rk rk
where rk = b − Axk , since xk+1 = xk + αk dk then rk+1 = rk − αk Adk . Also, riT rk = 0, ∀i ̸= k (Lemma 5 from
the lecture).
Let’s get an expression for βk :
∇f (xk+1 )⊤ Adk r⊤ Adk
βk = ⊤
= − k+1
dk Adk d⊤
k Adk
⊤ 1 ⊤ ⊤ ⊤
Numerator: rk+1 Adk = r
αk k+1
(rk − rk+1 ) = [rk+1 rk = 0] = − α1k rk+1 rk+1
⊤
Denominator: d⊤
k Adk = (rk + βk−1 dk−1 ) Adk =
1 ⊤
r
αk k
(rk − rk+1 ) = 1 ⊤
r r
αk k k
Lecture recap v § } 8
CG method in practice
In practice, the following formulas are usually used for the step αk and the coefficient βk :
⊤
rk⊤ rk rk+1 rk+1
αk = βk = ,
d⊤
k Adk
⊤
rk rk
where rk = b − Axk , since xk+1 = xk + αk dk then rk+1 = rk − αk Adk . Also, riT rk = 0, ∀i ̸= k (Lemma 5 from
the lecture).
Let’s get an expression for βk :
∇f (xk+1 )⊤ Adk r⊤ Adk
βk = ⊤
= − k+1
dk Adk d⊤
k Adk
⊤ 1 ⊤ ⊤ ⊤
Numerator: rk+1 Adk = r
αk k+1
(rk − rk+1 ) = [rk+1 rk = 0] = − α1k rk+1 rk+1
⊤
Denominator: d⊤
k Adk = (rk + βk−1 dk−1 ) Adk =
1 ⊤
r
αk k
(rk − rk+1 ) = 1 ⊤
r r
αk k k
ñ Question
Lecture recap v § } 8
CG method in practice. Pseudocode
r0 := b − Ax0
if r0 is sufficiently small, then return x0 as the result
d0 := r0
k := 0
repeat
rTk rk
αk :=
dTk Adk
xk+1 := xk + αk dk
rk+1 := rk − αk Adk
if rk+1 is sufficiently small, then exit loop
rTk+1 rk+1
βk :=
rTk rk
dk+1 := rk+1 + βk dk
k := k + 1
end repeat
return xk+1 as the result
Lecture recap v § } 9
Exercise 1
Lecture recap v § } 10
Non-linear conjugate gradient method
In case we do not have an analytic expression for a function or its gradient, we will most likely not be able to solve
the one-dimensional minimization problem analytically. Therefore, step 2 of the algorithm is replaced by the usual
line search procedure. But there is the following mathematical trick for the fourth point:
For two iterations, it is fair:
xk+1 − xk = cdk ,
where c is some kind of constant. Then for the quadratic case, we have:
1
Expressing from this equation the work Adk = (∇f (xk+1 ) − ∇f (xk )), we get rid of the “knowledge” of the
c
function in step definition βk , then point 4 will be rewritten as:
Write iterations of the Polack-Ribier method and run experiments for several µ in binary logistic regression:
m
µ 1 X
f (x) = ∥x∥22 + log(1 + exp(−yi ⟨ai , x⟩)) −→ minn
2 m x∈R
i=1
Lecture recap v § } 12
A pathological example
Since W invertible, there exists a unique solution to W x = b. Solving it by conjugate gradient descent gives us
rather bad convergence. During the CG process, the error grows exponentially (!), until it suddenly becomes zero as
the unique solution is found.
Residual ∥W xk − b∥2 grows exponentially as (1/t)k until the n iteration, after which it drops sharply to zero.
See experiment here 3.
Computational experiments v § } 13
Another computational experiments
Computational experiments v § } 14