0% found this document useful (0 votes)
5 views26 pages

Opt Sem10

The document outlines the Conjugate Gradient (CG) method for solving strongly convex quadratic optimization problems, detailing the initialization, optimal step length, algorithm iteration, direction update, and convergence loop. It emphasizes the importance of A-orthogonality in direction updates and provides practical formulas for the step size and direction coefficients. Additionally, it includes pseudocode for implementing the CG method and discusses modifications for non-linear problems.

Uploaded by

Roman Degtyarev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views26 pages

Opt Sem10

The document outlines the Conjugate Gradient (CG) method for solving strongly convex quadratic optimization problems, detailing the initialization, optimal step length, algorithm iteration, direction update, and convergence loop. It emphasizes the importance of A-orthogonality in direction updates and provides practical formulas for the step size and direction coefficients. Additionally, it includes pseudocode for implementing the CG method and discusses modifications for non-linear problems.

Uploaded by

Roman Degtyarev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Conjugate gradient method

Seminar

Optimization for ML. Faculty of Computer Science. HSE University

v § } 1
Strongly convex quadratics
Consider the following quadratic optimization problem: Optimality conditions:
1 ⊤ ∇f (x∗ ) = Ax∗ − b = 0 ⇐⇒ Ax∗ = b
min f (x) = min x Ax − b⊤ x + c, where A ∈ Sd++ .
x∈Rd x∈Rd 2

Lecture recap v § } 2
Strongly convex quadratics
Consider the following quadratic optimization problem: Optimality conditions:
1 ⊤ ∇f (x∗ ) = Ax∗ − b = 0 ⇐⇒ Ax∗ = b
min f (x) = min x Ax − b⊤ x + c, where A ∈ Sd++ .
x∈Rd x∈Rd 2

Steepest Descent Conjugate Gradient


4 4
2 2
0 0
2 2
4 4
4 2 0 2 4 4 2 0 2 4
Lecture recap v § } 2
Overview of the CG method for the quadratic problem
1) Initialization. k = 0 and xk = x0 , dk = d0 = −∇f (x0 ).

Lecture recap v § } 3
Overview of the CG method for the quadratic problem
1) Initialization. k = 0 and xk = x0 , dk = d0 = −∇f (x0 ).
2) Optimal Step Length. By the procedure of line search we find the optimal length of step. This involves
calculate αk minimizing f (xk + αk dk ):

d⊤
k (Axk − b)
αk = −
d⊤
k Adk

Lecture recap v § } 3
Overview of the CG method for the quadratic problem
1) Initialization. k = 0 and xk = x0 , dk = d0 = −∇f (x0 ).
2) Optimal Step Length. By the procedure of line search we find the optimal length of step. This involves
calculate αk minimizing f (xk + αk dk ):

d⊤
k (Axk − b)
αk = −
d⊤
k Adk

3) Algorithm Iteration. Update the position of xk by moving in the direction dk , with a step size αk :

xk+1 = xk + αk dk

Lecture recap v § } 3
Overview of the CG method for the quadratic problem
1) Initialization. k = 0 and xk = x0 , dk = d0 = −∇f (x0 ).
2) Optimal Step Length. By the procedure of line search we find the optimal length of step. This involves
calculate αk minimizing f (xk + αk dk ):

d⊤
k (Axk − b)
αk = −
d⊤
k Adk

3) Algorithm Iteration. Update the position of xk by moving in the direction dk , with a step size αk :

xk+1 = xk + αk dk

4) Direction Update. Update the dk+1 = −∇f (xk+1 ) + βk dk , where βk is calculated by the formula:

∇f (xk+1 )⊤ Adk
βk = .
d⊤
k Adk

Lecture recap v § } 3
Overview of the CG method for the quadratic problem
1) Initialization. k = 0 and xk = x0 , dk = d0 = −∇f (x0 ).
2) Optimal Step Length. By the procedure of line search we find the optimal length of step. This involves
calculate αk minimizing f (xk + αk dk ):

d⊤
k (Axk − b)
αk = −
d⊤
k Adk

3) Algorithm Iteration. Update the position of xk by moving in the direction dk , with a step size αk :

xk+1 = xk + αk dk

4) Direction Update. Update the dk+1 = −∇f (xk+1 ) + βk dk , where βk is calculated by the formula:

∇f (xk+1 )⊤ Adk
βk = .
d⊤
k Adk

5) Convergence Loop. Repeat steps 2-4 until n directions are built, where n is the dimension of space (dimension
of x).
Lecture recap v § } 3
Optimal Step Length

Exact line search:


αk = arg min f (xk+1 ) = arg min f (xk + αdk )
α∈R+ α∈R+

Lecture recap v § } 4
Optimal Step Length

Exact line search:


αk = arg min f (xk+1 ) = arg min f (xk + αdk )
α∈R+ α∈R+

Let’s find an analytical expression for the step αk :

1
f (xk + αdk ) = (xk + αdk )⊤ A (xk + αdk ) − b⊤ (xk + αdk ) + c
2
1 1 ⊤
 
= α2 d⊤ ⊤
k Adk + dk (Axk − b) α + xk Axk + x⊤k dk + c
2 2

Lecture recap v § } 4
Optimal Step Length

Exact line search:


αk = arg min f (xk+1 ) = arg min f (xk + αdk )
α∈R+ α∈R+

Let’s find an analytical expression for the step αk :

1
f (xk + αdk ) = (xk + αdk )⊤ A (xk + αdk ) − b⊤ (xk + αdk ) + c
2
1 1 ⊤
 
= α2 d⊤ ⊤
k Adk + dk (Axk − b) α + xk Axk + x⊤k dk + c
2 2

We consider A ∈ Sd++ , so the point with zero derivative on this parabola is a minimum:

d⊤
k (Axk − b)
d⊤ ⊤

k Adk αk + dk (Axk − b) = 0 ⇐⇒ αk = −
d⊤
k Adk

Lecture recap v § } 4
Direction Update
We update the direction in such a way that the next direction is A - orthogonal to the previous one:

dk+1 ⊥A dk ⇐⇒ d⊤
k+1 Adk = 0

Lecture recap v § } 5
Direction Update
We update the direction in such a way that the next direction is A - orthogonal to the previous one:

dk+1 ⊥A dk ⇐⇒ d⊤
k+1 Adk = 0

Since dk+1 = −∇f (xk+1 ) + βk dk , we choose βk so that there is A - orthogonality:

∇f (xk+1 )⊤ Adk
d⊤ ⊤ ⊤
k+1 Adk = −∇f (xk+1 ) Adk + βk dk Adk = 0 ⇐⇒ βk =
d⊤
k Adk

Lecture recap v § } 5
Direction Update
We update the direction in such a way that the next direction is A - orthogonal to the previous one:

dk+1 ⊥A dk ⇐⇒ d⊤
k+1 Adk = 0

Since dk+1 = −∇f (xk+1 ) + βk dk , we choose βk so that there is A - orthogonality:

∇f (xk+1 )⊤ Adk
d⊤ ⊤ ⊤
k+1 Adk = −∇f (xk+1 ) Adk + βk dk Adk = 0 ⇐⇒ βk =
d⊤
k Adk

 Lemma 1

All directions of construction using the procedure described above are orthogonal to each other:

d⊤
i Adj = 0, if i ̸= j

d⊤
i Adj > 0, if i = j

Lecture recap v § } 5
A-orthogonality

v1 and v2 are orthogonal v and v are A-orthogonal


v1Tv2 = 0.00 v TTv = 0.80
v1TAv2 = 1.19 v Av = 0.00
4 4

2 2

0 0
x

x
2 2

4 4
4 2 0 2 4 4 2 0 2 4
Lecture recap x x v § } 6
Convergence of the CG method

 Lemma 2

Suppose, we solve n-dimensional quadratic convex optimization problem. The conjugate directions method:
k
X
xk+1 = x0 + αi di ,
i=0

d⊤
i (Axi − b)
where αi = − taken from the line search, converges for at most n steps of the algorithm.
d⊤
i Adi

Lecture recap v § } 7
CG method in practice
In practice, the following formulas are usually used for the step αk and the coefficient βk :


rk⊤ rk rk+1 rk+1
αk = βk = ,
d⊤
k Adk

rk rk

where rk = b − Axk , since xk+1 = xk + αk dk then rk+1 = rk − αk Adk . Also, riT rk = 0, ∀i ̸= k (Lemma 5 from
the lecture).

Lecture recap v § } 8
CG method in practice
In practice, the following formulas are usually used for the step αk and the coefficient βk :


rk⊤ rk rk+1 rk+1
αk = βk = ,
d⊤
k Adk

rk rk

where rk = b − Axk , since xk+1 = xk + αk dk then rk+1 = rk − αk Adk . Also, riT rk = 0, ∀i ̸= k (Lemma 5 from
the lecture).
Let’s get an expression for βk :
∇f (xk+1 )⊤ Adk r⊤ Adk
βk = ⊤
= − k+1
dk Adk d⊤
k Adk

Lecture recap v § } 8
CG method in practice
In practice, the following formulas are usually used for the step αk and the coefficient βk :


rk⊤ rk rk+1 rk+1
αk = βk = ,
d⊤
k Adk

rk rk

where rk = b − Axk , since xk+1 = xk + αk dk then rk+1 = rk − αk Adk . Also, riT rk = 0, ∀i ̸= k (Lemma 5 from
the lecture).
Let’s get an expression for βk :
∇f (xk+1 )⊤ Adk r⊤ Adk
βk = ⊤
= − k+1
dk Adk d⊤
k Adk

⊤ 1 ⊤ ⊤ ⊤
Numerator: rk+1 Adk = r
αk k+1
(rk − rk+1 ) = [rk+1 rk = 0] = − α1k rk+1 rk+1

Denominator: d⊤
k Adk = (rk + βk−1 dk−1 ) Adk =
1 ⊤
r
αk k
(rk − rk+1 ) = 1 ⊤
r r
αk k k

Lecture recap v § } 8
CG method in practice
In practice, the following formulas are usually used for the step αk and the coefficient βk :


rk⊤ rk rk+1 rk+1
αk = βk = ,
d⊤
k Adk

rk rk

where rk = b − Axk , since xk+1 = xk + αk dk then rk+1 = rk − αk Adk . Also, riT rk = 0, ∀i ̸= k (Lemma 5 from
the lecture).
Let’s get an expression for βk :
∇f (xk+1 )⊤ Adk r⊤ Adk
βk = ⊤
= − k+1
dk Adk d⊤
k Adk

⊤ 1 ⊤ ⊤ ⊤
Numerator: rk+1 Adk = r
αk k+1
(rk − rk+1 ) = [rk+1 rk = 0] = − α1k rk+1 rk+1

Denominator: d⊤
k Adk = (rk + βk−1 dk−1 ) Adk =
1 ⊤
r
αk k
(rk − rk+1 ) = 1 ⊤
r r
αk k k

ñ Question

Why is this modification better than the standard version?

Lecture recap v § } 8
CG method in practice. Pseudocode
r0 := b − Ax0
if r0 is sufficiently small, then return x0 as the result
d0 := r0
k := 0
repeat
rTk rk
αk :=
dTk Adk
xk+1 := xk + αk dk
rk+1 := rk − αk Adk
if rk+1 is sufficiently small, then exit loop
rTk+1 rk+1
βk :=
rTk rk
dk+1 := rk+1 + βk dk
k := k + 1
end repeat
return xk+1 as the result
Lecture recap v § } 9
Exercise 1

Write iterations of the conjugate gradient method for a quadratic problem


1 T
f (x) = x Ax − bT x −→ minn
2 x∈R

and run experiments for several matrices A. See code here 3.

Lecture recap v § } 10
Non-linear conjugate gradient method
In case we do not have an analytic expression for a function or its gradient, we will most likely not be able to solve
the one-dimensional minimization problem analytically. Therefore, step 2 of the algorithm is replaced by the usual
line search procedure. But there is the following mathematical trick for the fourth point:
For two iterations, it is fair:

xk+1 − xk = cdk ,

where c is some kind of constant. Then for the quadratic case, we have:

∇f (xk+1 ) − ∇f (xk ) = (Axk+1 − b) − (Axk − b) = A(xk+1 − xk ) = cAdk

1
Expressing from this equation the work Adk = (∇f (xk+1 ) − ∇f (xk )), we get rid of the “knowledge” of the
c
function in step definition βk , then point 4 will be rewritten as:

∇f (xk+1 )⊤ (∇f (xk+1 ) − ∇f (xk ))


βk = .
d⊤
k (∇f (xk+1 ) − ∇f (xk ))

This method is called the Polak-Ribier method.


Lecture recap v § } 11
Exercise 2

Write iterations of the Polack-Ribier method and run experiments for several µ in binary logistic regression:
m
µ 1 X
f (x) = ∥x∥22 + log(1 + exp(−yi ⟨ai , x⟩)) −→ minn
2 m x∈R
i=1

See code here 3.

Lecture recap v § } 12
A pathological example

Let t ∈ (0, 1) and √


 
√t t √  
1
 t +t
1√ t √ 0

t 1+t t
 
W = , b=
 ... 

 .. .. .. 
 . . . 
√ 0
t 1+t

Since W invertible, there exists a unique solution to W x = b. Solving it by conjugate gradient descent gives us
rather bad convergence. During the CG process, the error grows exponentially (!), until it suddenly becomes zero as
the unique solution is found.
Residual ∥W xk − b∥2 grows exponentially as (1/t)k until the n iteration, after which it drops sharply to zero.
See experiment here 3.

Computational experiments v § } 13
Another computational experiments

Let’s see another examples here 3. The code taken from §.

Computational experiments v § } 14

You might also like