BSC Part 3
BSC Part 3
A project report
submitted by
1 Introduction 2
2 General Idea 4
2.1 Method of Steepest Descent . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Step 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Step 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Convergence Theory 11
3.1 Quadratic Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.1 Convergence rate for Yj . . . . . . . . . . . . . . . . . . . . . 12
3.1.2 Relative decrease in F . . . . . . . . . . . . . . . . . . . . . . 13
3.1.3 Kantorovich inequality . . . . . . . . . . . . . . . . . . . . . . 14
4 Scaling 16
5 Extensions 21
6 Applications 23
6.1 Application 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.2 Application 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1
Chapter 1
Introduction
Always analytic methods may not work because of the complexity of the problem and
because of problems are not convex, we have numerical techniques to solve such types
of problems. The optimization problem is the task of selecting the best answer from
a set of alternatives, optimization problems basically divides in two type of problems
one is constrained optimization problems and another one is unconstrained optimiza-
tion problems. The only difference between them is that we have to minimizing or
maximizing the function in both type of problems but in constrained optimization we
have to maximizing or minimizing the function subject to some constrained or we can
say restrictions are there.
There are various unconstrained methods in order to solve the minimization or max-
imization problems, basically it is divide in two parts one is direct search methods
and second is descent methods(Gradient method). There is no derivative approach in
direct search methods so it is also called as zeroth order methods.
For nonlinear optimization there is lot of gradient methods but the steepest descent
method is the simplest one and it is a newton type method. The steepest descent
method is not as old as newtons method. Cauchy (1789 − 1857) developed it in the
nineteenth century (1847) about two centuries later than newton’s method. It is a lot
less complicated than newton’s method, in steepest descent algorithm we use only first
derivative of function it does not necessitate the computation of second derivative, for
finding the search direction there is no need to solve system of linear equations, and
there is no need for matrix storage. As a result, it lowers costs of newton’s method
in every way in terms of iteration costs.
2
Steepest descent method has a slower rate of convergence than newton’s methods
that is a negative side of this method. The steepest descent method converges linearly
to its minimum. Steepest descent method is used in solving a system of nonlinear
equations of the form
g(y1 , y2 , .......yn ) = 0
, where g is a real valued differentiable function and g have continuous first partial
derivatives and method accept negative of gradient.
Descent property : For a function g, g(xk+1 ) < g(xk ) for all k i.e. as we proceed,
the value of objective function should decrease.
3
Chapter 2
General Idea
2.1.1 Step 1
sj = −∇gj
The function value decreases at the quickest rate if we walk along a negative gradient
path from any point to n-dimensional space.
2.1.2 Step 2
Select a step length in the search direction to reduce g(y). In order to select the ideal
step length λj in the direction of sj and set
yj+1 = yj + λj sj
4
Begin from point Y and work your way down the sharpest downhill directions until
you reach the optimization point.
δyj = λj sj
g(yj +1 ) = g(yj + λj sj )
Taylor’s expansion
1
= g(yj ) + ∇T g(yj )(δyj ) + (δyj )T H(yj )(δyj )
2
Thus,
d(yj + λj sj )
=⇒ =0
dλj
∇gj sj
λj = −
sTj Hj sj
sTj sj
λj =
sTj Hj sj
2.2 Examples
1. Minimum of
g(y) = (y1 − 7)2 + (y2 − 2)2
5
2(y1 − 7)
sj = −
2(y2 − 2)
3
s1 = −
−2
sTj sj
λj =
sTj HJ sj
3
3 −2
−2
λ1 =
2 0 3
3 −2
0 2 −2
λ1 = 1/2
Now,
Yj+1 = Yj + λsj
Y2 = Y1 + λs1
7
Y2 =
2
(∇g)Y2 = 0
So optimal point is Y 2 .
6
Example 2:
Minimize
g(y1 , y2 ) = y1 − y2 + 2(y1 )2 + 2y1 y2 + (y2 )2
−1
s1 =
1
sTj sj
λj =
sTj Hsj
λ1 = 1
Now
Yj+1 = Yj + λsj
Y2 = Y1 + λs1
7
−1
Y2 =
1
Iteration 2: At Y2 .
Step-1: Search for best direction s2
s2 = −∇g2
1
s2 =
1
sT2 s2
λ2 =
sT2 Hs2
Where
4 2
H=
2 2
1
λ2 =
5
Now
Y3 = Y2 + λ2 s2
−1 1 1 −0.08
Y3 = + =
1 5 1 1.2
8
Check point Y3 is optimal or not
0.2 0
(∇g)Y 3 = ̸=
−0.2 0
Iteration 3: At Y3
Step-1: Search for best direction s3 .
For steepest descent method
s3 = −∇g3
−0.2
s3 =
0.2
sT3 s3
λ3 =
sT3 Hs3
Here
4 2
H=
2 2
λ3 = 1
Now
Y4 = Y3 + λs3
−0.8 −0.2
Y4 = +
1.2 0.2
−1.0
Y4 =
1.4
9
Check point Y4 is optimal or not
−0.2 0
(∇g)Y 4 = ̸=
−0.2 0
Iteration 4: At Y4
Step-1: Best direction s4
s4 = −∇g4
0.2
s4 =
0.2
Step-2:
4 2
H=
2 2
sT4 s4
λ4 =
sT4 Hs4
1
λ4 =
5
Now,
Y5 = Y4 + λs4
−1.0 0.2
Y5 = + 1
1.4 5 0.2
0.96
Y5 =
1.44
Y5 is optimum.
10
Chapter 3
Convergence Theory
The SDM has a nice convergence theory, which is one of its key advantages. It’s not
difficult to demonstrate that rate of convergence of SDM is linear, which is unsur-
prising because of the method’s simplicity regrettably, even for modestly nonlinear
systems, problems. As a result, convergence will be too slow for any practical use
application.
The importance of theory of SD approach’s convergence in comprehending the be-
haviour of convergence.
Let’s see how the SDM converges to its minimum in the quadratic situation. This
particular situation is critical because even if a function is not quadratic, it will behave
quadratic around the optimal point, hence it is critical to investigate the behaviour
of quadratic functions.
1
g(Y ) = Y T HY − dT Y
2
where d∈Rn , and H is an n × n symmetric positive definite matrix. All of the eigen-
values of H are real and positive because H is symmetric and positive definite matrix.
Let’s say the eigenvalues of the matrix H are e1 , e2 , e3 ,..., en . Where e1 is the smallest
eigenvalue and en is the largest eigenvalue of H. We know that the gradient of the
11
given quadratic function g is
s(y) = HY − d
and if we set the gradient to the zero then it gives the optimal point Y*.
Y ∗ = H −1 d
Since all the eigenvalues of H are positive and real, determinant of H is non zero so
H −1 exist. Thus the method of steepest descent represented as
Yj+1 = Yj − λj sj
where s(j) = HYj − d and λj is step length in direction of sj such that λj minimizes
g(Yj − λj sj ). We can determine the value of λj
1
g(Yj − λsj ) = (Yj − λsj )T H(Yj − λsj ) − (Yj − λsj )T d
2
which is minimized at λj , it can be determine by differentiating with respect to λ,
sTj sj
λj =
sTj Hsj
sTj sj
yj+1 = yj − ( )sj
sTj Hsj
1
FY = (Y − Y ∗ )T H(Y − Y ∗ )
2
12
The only difference between FY and gY is of a constant term 21 (Y ∗ )T H(Y ∗ )
1
FY = gY − (Y ∗ )T H(Y ∗ )
2
HY ∗ = d
sTj sj
Yj+1 = Yj − ( )sj
sTj Hsj
Now,
We require a bound on the right hand side of the equation to get a bound on rate of
convergence. Kantorovich and his lemma gives the best bound which is as described
below, is a valuable generic tool in convergence analysis.
13
3.1.3 Kantorovich inequality
Let H is an n × n symmetric positive definite matrix and let e1 and en be the smallest
and largest eigenvalues of H respectively then for any X ̸= 0
(X T X)2 e1 en
T T −1
≥4
(X HX)(X H X) (e1 + en )2
en − e1 2
[ ]
en + e1
en
r=
e1
• Clearly we can see the rate of convergence of SDM depends on condition number
of hessian.
• If we calculate r=1 then the contours are circular and in one iteration we obtain
optimum or convergence.
Example: As we see in example 1 SDM applied to g(y) with exact line search and we
obtained optimul or fast convergence in single iteration from any initial point.
14
And in example 2 On g(y) we applied SDM with line search and it takes many
iterations for convergence.
15
Chapter 4
Scaling
Even for a quadratic function, the SD approach’s rate of convergence is at best lin-
ear. The SD method’s rate of convergence can be enhanced By scaling the design
variables. The condition number of the Hessian of the function may be reduced by
scaling. It is possible to scale the design variables for a quadratic function so that the
Hessian matrix’s condition number is unity with respect to the new design variables.
A matrix’s condition number is determined as the ratio of the matrix’s biggest to
lowest eigenvalues.
Example will be used to highlight the benefits of scaling design variables. If g =
1 T
2
Y [B]Y denotes a quadratic case, lets see a transformation of the form,
y s s z
1 11 12 1
Y = [S]Z or =
y s21 s22 z2
2
1 T 1
Z [B̃]Z = ZT [S]T [B][S]Z
2 2
The matrix [S] can be selected to make [B̃] = [S]T [B][S] diagonal (i.e., mixed
quadratic terms will be eleminated). For this, the eigenvectors of the matrix [B]
will be columns of the matrix [S]. After that the diagonal elements of the matrix [B̃]
can be reduced to one (we have to fixed the condition number of the resulting matrix
16
such that it will be 1 ) here we use the transformation,
z c11 0 d1
1
Z = [C]D or =
z 0 c22 d2
2
Matrix C is
c11 = √1 0
b11
[C] =
0 c22 = √1
b̄22
EXAMPLE: 4.1
1
g(Y) = AT Y + YT [B]Y
2
where
y −1 12 −6
1
Y= ,A = , and [B] =
y −2 −6 4
2
As previously stated, the necessary variable scaling may be performed in two phases.
17
Stage 1: The eigenvectors of the matrix [B] are calculated by solving the eigenvalue
problem. Reduce [B] to a Diagonal Form
[[B] − βi [I]] wi = 0
√ √
which yield β1 = 8 + 52 = 15.2111 and β2 = 8 − 52 = 0.7889. The eigenvector ui
corresponding to βi can be found by solving Eq.
12 − β1 −6 w11 0
= or (12 − β1 ) w11 − 6w21 = 0
−6 4 − β1 w21 0
and
12 − β2 −6 w 0
12
= or (12 − β2 ) w12 − 6w22 = 0
−6 4 − β2 w22 0
18
that is,
y1 = z1 + z2
y2 = −0.5352z1 + 1.8685z2
1
f (z1 , z2 ) = AT [S]Z + ZT [B̄]Z
2
1 1
= 0.0704z1 − 4.7370z2 + (19.8682)z12 + (3.5432)z22
2 2
Y = [S]Z = [S][C]D = [T ]D
where
1 1 0.2262 0
[T ] = [S][C] =
−0.5352 1.8685 0 0.5313
0.2262 0.5313
=
−0.1211 0.9927
or
y1 = 0.2262d1 + 0.5313d2
y2 = −0.1211d1 + 0.9927d2
19
Figure 4.1: Contours of the original function.
1
f (d1 , d2 ) = AT [T ]D + DT [T ]T [B][T ]D
2
1 1
= 0.0160d1 − 2.5167d2 + d21 + d22
2 2
20
Chapter 5
Extensions
The SDM has been modified in a number of ways. Barzilai and Borwein presented
two new step sizes for the negative gradient direction in 1988. Despite the fact that
their method did not ensure descent in the objective function values, the numerical
results showed that it was a significant improvement over the traditional SDM. The
goal of their strategy was to hasten the convergence of the SDM. The Barzilai-Borwein
approach necessitates a small number of storage places and low-cost computations.
Although the Newton method and quasi-Newton methods are useful for addressing
unconstrained minimization problems, they cannot be used to solve large-scale un-
constrained minimization problems directly. As a result, numerical approaches based
on the SD direction are favoured since they do not require the storing of matrices.
For unconstrained minimization problems, SDM is the simplest gradient method.
The following precise step size
Yj+1 = Yj + λj sj
where
sTj sj
λj =
sTj H i sj
21
As a result, multiple authors experimented with different step sizes in order to address
this flaw. The new iterate as:
1
yk+1 = yk − gk
βk
Instead of performing a line search or employing the quadratic case approach, the
step length βK is calculated using the following formula:
sTk−1 xk−1
βk = ,
sTk−1 sk−1
where sk−1 = xk − xk−1 and xk−1 = gk − gk−1 . The Barzilai and Borwein approach
requires just O(n) floating point operations and a gradient evaluation for each itera-
tion.
During the process, there are no matrices to computation or line searches to do. In
the two-dimensional quadratic example, Barzilai and Borwein presented a convergence
analysis.They established R-superlinear convergence for that particular scenario.
22
Chapter 6
Applications
The basic convergence theory, as expressed by the rate of convergence formula has
been devised and proved to truly define SDM behaviour, it is necessary to demonstrate
how the theory may be used. We don’t recommend computing the numerical value
of the formula since it includes eigenvalues Or eigenvalue ratios that are difficult to
calculate. Nevertheless, the formula itself is quite useful in practise since it allows one
to compare various circumstances hypothetically, without a hypothesis like this. The
only option would be to depend only on experimental comparisons.
6.1 Application 1
Penalty Methods: Penalty techniques are methods for approximating restricted op-
timization problems using unconstrained problems. In the case of penalty approaches,
the approximation is achieved by adding a term to the objective function that spec-
ifies a high cost for violating the restrictions. There are two major flaws with the
method. The first is how closely the unconstrained problem resembles the restricted
one. The second difficulty, which is most relevant from a practical standpoint, is how
to solve an unconstrained problem with a penalty term in its objective function.
Look at the problem,
minimize g(y)
subject to Y ∈ T
23
where T is a constraint set in E n and g is a continuous function on E n . In most
instances, T is implicitly defined by a set of functional constraints, although the more
general description can be addressed in this section. A penalty function technique
replaces a restricted problem with an unconstrained problem of the type,
Example 1: Exterior penalty approach:- Start outside the feasible region and
slowly converge on a minimum from the outside.
Equality constraints
minimum g(y)
s.t. ki (y) = 0, i ∈ E
X
minimum F (y, c) = g(y) + σ h2i (y)
i∈E
T = {y : hj (y) ⩽ 0, j = 1, 2, . . . , r}
r
1X
R(y) = (max [0, hj (y)])2
2 j=1
24
Figure 6.1: plot of cR
In the one-dimensional case, the function cR(y) is shown in Figure 6.1 with
h1 (y) = y − b, h2 (y) = a − y.
It is obvious that for high c, the smallest point of problem will be in an area where R
is small. When a result, it is predicted that as c rises, the appropriate solution points
will approach the feasible area T and if near enough, will minimize g. The penalised
problem’s solution point should ideally converge to the constrained problem’s solution
point as c → ∞.
The Method: The procedure for solving problem by the penalty function method
is this: Let {cn } , n = 1, 2, . . . be a sequence tending to infinity such that for each
n, cn ⩾ 0, cn+1 > cn . Define the function
minimize q (cn , y) ,
obtaining a solution point yn . We’re going to assume that each n problem has a
solution. For example, if q(c, y) grows unboundedly as |y| → ∞, this will be true.
25
Let’s look at a problem with a single restriction once more:
minimize g(y)
subject to h(y) = 0
1
minimize g(y) + µh(y)2 ,
2
where µ is a (large) coefficient of penalty. Because of the penalty, the solution will
tend to have a small h(y). The method of SD may be used to solve the given problem
as an unconstrained problem. What will be the outcome? Consider the scenario when
g is quadratic and h is linear for the sake of simplicity. We focus on the problem in
particular,
1 T
minimize y Qy − dT y
2
subject to cT y = 0
The matrix Q + µccT defines the quadratic form connected with this objective and
accordingly, the condition number of this matrix will determine the SDM convergence
rate. The initial matrix Q has been supplemented by a big rank-one matrix. This
addition will inevitably result in one of the matrix’s eigenvalues being large (on the
order of mu). As a result, the condition number is proportional to mu. As a result,
the rate of convergence becomes exceedingly poor as µ is increased in order to provide
an accurate solution to the initial limited issue.
26
6.2 Application 2
Solution of Gradient Equation: Solving the equations ∇g(y) = 0 that express the
essential conditions is one technique to minimising of a function g. These equations
might be solved by applying SDM to the function k(y) = |∇g(y)|2 according to one
theory. The fact that the minimum value is known is a benefit of this strategy. We
want to know is this strategy is likely to be quicker or slower than using SDM on
the original function g. We only address the scenario where g is quadratic simplicity.
Thus let,
g(y) = (1/2)yT Qy − dT y
and
k(y) = |k(y)|2 = yT Q2 y − 2yT Qd + dT d
The matrix Q2 has condition number that is c̄ . However, we get the eigenvalues of
Q2 by squares of those of Q itself, so c̄ = c2 , Where c is the condition number of Q.
Thus it is obvious that the suggested method’s convergence rate would be lower than
SDM applied to the original function. We may even go a step further and predict
how much slower the recommended approach will be. We have SDM rate, if c is big
2
c−1
= ≃ (1 − 1/c)4
c+1
27
Bibliography
[1] Atkinson, K. E., An Introduction to Numerical Analysis, John Wily and Sons,
1989.
[2] Bazaraa, M. S., Sherali, H. D., Shetty, C.M., Nonlinear Programming Theory
and Algorithms, A John Wiley Sons, Inc., Publication, 2006.
[3] Griva, I., Nash, S. G., Sofer, A., Linear and Nonlinear Optimization, Society for
Industrial and Applied Mathematics, 2009.
[4] Luenberger, D. G., Ye, Y., Linear and Nonlinear Programming, Springer Science,
2008.
[5] Rao, S. S., Engineering Optimization Theory and Practice, New Age International
Publishers, 2013.
28