Basic Concepts: 1.1 Continuity
Basic Concepts: 1.1 Continuity
Basic concepts
1.1 Continuity
f (x) is continuous at x0 if f (x0 ) and limxx0 f (x) both exist, and limxx0 f (x) = f (x0 ). f (x) = 1/x and f (x) = 1/x2 are not continuous at x = 0. f (x) = ln(x) is continuous for x > 0 for x 0, f (x) is undened. Derivative based optimization may fail if discontinuities exist. If derivative changes sign about x0 , the algorithm may also oscillate. Reactors/heat echangers/pipes of certain sizes only available; hence discontinuities in process design. Vehicles/bus sizes/aircraft. Splines can be used to ensure dierentiability. The general problem In general, min f (x) subject to and ai gi (x) b1 lj x j u j i = 1, ..., m j = 1, ..., n
If ai = bi , the ith constraint is an equality constraint. If ai = and bi = +, then xi is unbounded. Local vs. global minima Solution may be on boundary, boundary extreme point, or interior (in which case, its as if you have an unconstrained problem).
1.2
Convexity
Convex set: For all pairs of points x1 and x2 in set , the straight line segment joining them lies entirely in the set. A point on the straight line is x1 + (1 )x2 f (x) is a convex function if f (x1 + (1 )x2 ) f (x1 ) + (1 )f (x2 ) Linear functions are both convex and concave, but not strictly convex/concave. If f (x) is convex, then set R = {x|f (x) k } is convex for all k . 1 01 If then convex, if < strictly convex.
CL603 c SBN, 2009 If f1 and f2 are two convex functions, then f1 + f2 is convex:
1 Basic concepts
f1 (x1 + (1 )x2 ) + f2 (x1 + (1 )x2 ) f1 (x1 ) + (1 )f1 (x2 ) + f2 (x1 ) + (1 )f2 (x2 ) [f1 (x1 ) + f2 (x1 )] + (1 )[f1 (x2 ) + f2 (x2 )] If f (x1 ) c and f (x2 ) c, then f (x1 + (1 )x2 ) f (x1 ) + (1 )f (x2 ) c Convex sets can be combined to get a convex set. For dierentiable convex functions: If f is convex, then for all , 0 1, x and y are two points in . f (y + (1 )x) f (y) + (1 )f (x) As 0, f (x)(y x) f (y) f (x) Assume f (y) f (x) + f (x)(y x). Set x = x1 + (1 )x2 and y = x1 or y = x2 . f (x1 ) f (x) + f (x)(x1 x) f (x2 ) f (x) + f (x)(x2 x) f (x + (y x)) f (y) f (x) i.e. a linear approx. based on the local derivative underestimates the function. Previously, while dening a convex function, we have observed that a linear interpolation between 2 points overestimates the function.
f (x1 ) + (1 )f (x2 ) f (x) + f (x)(x1 + (1 )x2 x) But x = x1 + (1 )x2 . Therefore f (x1 ) + (1 )f (x2 ) f (x1 ) + (1 )x2 ) Convex functions and global minima If x is a relative minimum of f , if there is a y with f (y) < f (x ), then on line y + (1 )x , we have f (y + (1 )x ) f (y) + (1 )f (x ) < f (x ) which contradicts the claim that x is a relative minimum. Hence x is a global minimum. Convexity and the Hessian If f is convex and twice dierentiable, 1 f (y) = f (x) + f (x) + (y x)T H(x + (y x))(y x) 2 Clearly if H is positive semidenite everywhere, then f (y) f (x) + f (x)(y x) and therefore f is convex. f (x) is H(x) is Strictly convex Positive denite Convex Positive semidenite Concave Negative semidenite Strictly concave Negative denite All eigenvalues of H(x) are >0 0 0 <0
Notation: f C 2 .
Quadratic functions
2 f (x) = a0 + a1 x1 + a2 x2 + a11 x2 1 + a11 x2 + a12 x1 x2
Determine the Hessian, and nd its eigenvalues, 1 and 2 . 1 , 2 > 0 minimum (circular valley). 1 = 2 gives a circle, and 1 > 2 is an ellipse. 1 , 2 < 0 maximum (circular hill). 1 > 0, 2 < 0 hyperbola, saddle points.
Converging across a valley is slow if movement is along the smaller direction. 2 f (x ) = H(x ) Positive denite Positive semidenite Negative denite Negative semidenite Indenite dT 2 f (x )d >0 0 <0 0 Both 0 and 0 x minimum possibly minimum maximum possibly maximum unknown
Weak minima vs strong minima: For f (x) = x2 2 (x1 + 3) a min surface exists at x2 = 0 and hence a weak minimum exists.
1.3
Unconstrained minimization
Consider a set of points x . x is a relative minimum if f (x ) f (x + x) for all x . x is a global minimum of f (x ) f (x), x . First order necessary condition If x is a relative minimum of f over , then for any feasible direction d at x , f (x )d 0 Proof: For any , 0 , x() = x + d. For any 0 , let g () = f (x()) which implies that g () has a relative minimum at = 0. f () g (0) = g (0) + o() 1 f (x) = f (x ) + f (x)(x x ) + (x x )T H(x )(x x ) + higher terms 2 Second order necessary conditions 1. f (x )d 0. 2. If f (x )d = 0, then dT 2 f (x )d 0. Proof: Let x() = x + d and g () = f (x()). Then g (0) = 0 and g () g (0) = 1 g (0)2 + o(2 ) 2 Big Oh: O(n ) implies scales as n . Small oh: o() implies that these terms go to 0 faster than itself does.
if g (0) < 0, RHS < 0 for small g (0) is not a minimum. Hence g (0) = dT 2 f (x )d 0 Note that the second condition states that the Hessian must be positive semidefinite for a minimum. Sucient conditions for a minimum 1. f (x ) = 0. 2. 2 f (x ) is positive denite.
1.4
Linear convergence For large k , (i.e. close to x ), and if c is the convergence ratio, ||xk+1 x || c||xk x || 0c1
The error in the result drops slowly. if c = 0.1, the error drops 10 fold with each iteration additional digit of accuracy.
4 Order p convergence
1 Basic concepts
c 0, p 1, k large
p = 2 implies quadratic (polynomial) convergence. If ||xk x0 || = 101 for some k , then ||xk+1 x || c102 , ||xk+2 x || c2 104 , ||xk+3 x || c3 106 etc. Even if c = 1, only a few iterations are needed to obtain 16 signicant decimal digits. Superlinear convergence ||xk+1 x || = ck k ||xk x || lim and ck 0 as k
1.5
Bracketing
Let the initial point be xA . Let the trial length be t and so x1 = xA + t. if f (x1 ) < f (xA ), then xB = x1 , and the interval remains t, else x1 = xA + t/2. 1. Let xA and x1 be at the interval bounds, with x1 = xA + t. 2. If f (x1 ) > f (xA ), xC = x1 and xB = xA + t/2. 3. If f (x1 ) f (xA ), set xB = x1 and x2 = xA + 2t. 4. If f (x2 ) > f (x1 ), set xC = x2 . If f (x2 ) f (x1 ), x1 = xB , t = 2t, go to step 2.
Width of uncertainty.
F0 = F1 = 1.
1=
dk+1 1 = = 0.618 dk 1
1.5: Bracketing
This can be performed iteratively. using the notation g (x) = f (x), xk+1 = xk g (xk ) g (xk )
Convergence of Newtons method: If g is continuous and has second derivatives (i.e. f (x)), let x satisfy g (x ) = 0 and g (x ) = 0. If x0 is reasonably close to x , the sequence {xk } k=0 generated by Newtons method converges to x with an order of convergence at least two. Proof: g (x ) = 0 and xk+1 = xk g (xk )/g (xk ). xk+1 x = xk x [g (xk ) g (x ) + g (xk )(x xk )] g (xk ) g (x ) = g ( xk ) g (xk )
By Taylors theorem, the numerator is zero to rst order. For some between x and xk , 1 g ( ) xk+1 x = (xk x )2 2 g ( xk ) near x , |xk+1 x | c|xk x |2 . Netwons method would require one iteration for a quadratic function (good!). We need to compute f (x) and f (x) (bad!) however, and if f (x) 0, convergence would be very slow (very bad!). We need to start close enough to the minimum to ensure convergence. Finite dierence approx. to Newton: If f (x) is not available, we can use a numerical version of the derivative(s). xk+1 = xk [f (xk + h) f (xk h)]/2h [f (xk + h) 2f (xk ) + f (xk h)]/h2
where we have used a central dierence method. We could use a forward dierence approach instead. h should be appropriately small.
The method is started at two points where g (x) are of dierent sign. Next nd xk+1 and g (xk+1 ). Keep xk+1 and one of xk and xk1 such that the two points chosen have g (x) of opposite sign.
6 Order of convergence
1 Basic concepts
g (x ) = 0.
g (xk )g (xk1 xk xk1
g (xk )g (x ) xk x
This is of the form (xk+1 x ) = M (xk x )(xk1 x ). Let k = xk x . Take logs, you would get an equation of the form yk+1 = yk + yk1 which is a Fibonacci sequence. Hence the order of convergence is 1.618,
For a minimization problem, having a bracket requires x1 < x2 , f (x1 ) < 0 and f (x2 ) > 0. Calculate f ( xk+1 and then choose which of the k and k 1 points to replace.
1.6
Start at [1, 2]T and use the negative gradient as initial search direction. f (x) = 4 x3 1 4 x1 x2 + 2 x1 2 2x2 1 + 2 x2
At x0 = [1, 2]T , f (x0 ) = 5, and d = f (x0 ) = [4, 2]T . The new point is xnew = xold + d where x1,new x2,new = x1,old + d1 = x2,old + d2
1.7: Termination of a line search We choose (0) = 0.05, and look to bracket the minimum. x1 = x1 + (0.05)4 = 1.2
(2) (1) (1) (0)
x2 = x2 + 0.05(2) = 1.9
(2) (1)
(1)
(0)
Then f (x(1) ) = 4.25. We next try (1) = 2(0) = 0.1. Then x1 = x1 + (0.1)4 = 1.6 x2 = x2 + 0.1(2) = 1.7
f (x(2) ) = 5.10. The minimum is now bracketed. The optimal value ( = 0.0797) is obtained by quadratic interpolation.
1.7
A line search has involved starting at a point, and then nding a direction too move along. Move to a min of an objective function along the line, and then restart from this new point. Percentage test: The step size x or comes to within a xed percentage of true value. Let x = proposed step size, and x the step size needed to the true minimum. |x x | cx , 0<c<1 c = 0.1 is OK. Armijos rule: We rst guarantee that x is not too large, and next that it is not too small. Let () = f (xk + dk ). is not too large if for a xed , (0 < < 1) () (0) + (0)
T
By focussing on and (), we can use 1D search techniques for a multidimensional problem. Typically 104 .
Note that f (xk )dk is the slope and is always negative if f is convex. should not be too small: dene > 1. (n) (0) + (0) i.e. if , the rst test (on largeness) fails. A common problem is that very small values of , given a certan and , such as 1015 may satisfy both conditions. Goldsteins test: Like Armijos rule, should not be too large () (0) + (0) but 0 < < 0.5 here. is not too small if () > (0) + (1 ) (0) and in the original notation f (xk+1 ) f (xk ) 1 f (xk )dk Notice the similarity to the Goldstein test. Therefore 0.5 < 1 < 1.
Wolfes test: This test is used when derivatives are available. () (0) + (0) (0) Backtracking This is a line search method where we start with an initial guess of , and we use two parameters > 1 and < 1 (usually < 0.5). The stopping criterion is the same as the rst condition of the Armijo and Goldstein tests. If this condition is not satised, then is reduced by a factor (1/ ) giving new = old / . If the initial satises the test, then it is used as the step size. Else, it is reduced by the factor (1/ ) repeatedly, until it nally satises the test. At that point, old = new does not pass the rst test. Notice that this now passes the second condition of Armijos test. This backtracking method is important because we repeatedly apply it during line searches for a multidimensional optimzation problem. (1 ) (0) 0 < < 0.5