L24 Simulated Annelaing
L24 Simulated Annelaing
The instructor of this course owns the copyright of all the course materials. This lecture
material was distributed only to the students attending the course MTH511a: “Statistical
Simulation and Data Analysis” of IIT Kanpur, and should not be distributed in print or
through electronic media without the consent of the instructor. Students can make their own
copies of the course materials for their use.
Recall that when the objective function is non-concave, all of the methods we’ve dis-
cussed cannot escape out of a local maxima. This creates challenges in obtaining global
maximas. This is where the method of simulated annealing has an advantage over other
methods.
Consider an objective function f (✓) to maximize. Note that maximizing f (✓) is equiv-
alent to maximizing exp(f (✓)). The idea in simulated annealing is that, instead of
trying to find a maxima directly, we will obtain samples from the density
Wherever there is a maxima, samples collected from ⇡(x) are likely to be from areas
near the maximas. However, obtaining samples from ⇡(x) means there will be samples
from low probability areas as well. So how do we force samples to come from areas
near the maximas?
1
Consider for T > 0,
@ef (✓)/T f 0 (✓)
= ef (✓)/T ,
d✓ T
which has the same roots and direction as f (✓). Thus,
For 0 < T < 1, the objective function’s modes are exaggerated there-by amplifying the
maximas.
Example 1. Consider the following objective function
T=1
T = .83
T = .75
T = .71
100
exp(f/T)
50
0
In simulated annealing, this feature is utilized so that every subsequent sample is drawn
from an increasingly concentrated distribution. That is, at a time point k, a sample
will be drawn from
⇡k,T (✓) / ef (✓)/T .
Certainly, we can try and use accept-reject or another Monte Carlo sampling method,
but such methods cannot be implemented generally.
2
Note that for any ✓0 , ✓ ⇢
⇡k,T (✓0 ) f (✓0 ) f (✓)
= exp .
⇡k,T (✓) T
Let G be a proposal distribution with density g(✓0 |✓) so that g(✓0 |✓) = g(✓|✓0 ). Such a
proposal distribution is a symmetric proposal distribution.
5: Else ✓k+1 = ✓k .
6: Update Tk+1
7: Store ✓k+1 and ef (✓k+1 )/T .
8: Return ✓⇤ = ✓k⇤ where k ⇤ is such that k ⇤ = arg maxk ef (✓k+1 )/T
Thus, if the proposed value is such that f (✓0 ) > f (✓), then ↵ = 1 and the move is
always accepted. The reason simulated annealing works is because when ✓0 is such
that f (✓0 ) < f (✓), even then, the move is accepted with probability ↵. Thus, there is
always a chance to move out of local maximas.
Essentially, each ✓k is approximately distributed as ⇡k,T , and as T ! 0, ⇡k,T puts more
and more mass on the maximas, thus, ✓k will typically be get increasingly closer to ✓⇤ .
• Typically, G(·|✓) is U (✓ r, ✓ + r) or N (✓, r) which are both valid symmetrical
proposals. The parameter r dictates how far/close the proposed values will be.
• Tk is often called the temperature parameter. A common value of Tk = d/ log(k)
for some constant d.
3
}
for(k in 2:N)
{
# U(x - r, x + r)
a <- runif(1, x[k-1] - r, x[k-1] + r)
T <- 1/(log(k))
ratio <- fn(a,T)/fn(x[k-1], T)
if( runif(1) < ratio)
{
x[k] <- a # accept
} else{
x[k] <- x[k-1] # reject, so stay
}
}
x
return(x)
}
Below I implement the algorithm for 500 steps and return the estimate of ✓⇤ . I also
plot the values of ✓ obtained.
N <- 500
sim <- simAn(N = N)
sim[which.max(fn(sim))] # theta^*
[1] 0.3792136
4
40
30
exp(f/T)
20
10
0
Example 3 (Location Cauchy). Recall the location Cauchy example discussed in Week
6 Lecture 15. The objective function is the log-likelihood of the location Cauchy
distribution with mode at µ 2 R. The goal is to find the MLE for µ.
1 1
f (x|µ) = .
⇡ (1 + (x µ)2 )
The log-likelihood is
n
X
l(µ) := log L(µ|X) = n log ⇡ log(1 + (Xi µ)2 ) .
t=1
−10
−15
log−likelihood
−20
−25
−30
−10 0 10 20 30 40
Recall that for the dataset generated, the log-likelihood (above) was not concave and
presented many local maxima. This caused Newton-Raphson to possible diverge/con-
verge to minima/local maxima and caused the gradient ascent algorithm to converge
5
to local maxima. We will implement the simulated annealing algorithm here with
G = U (✓ r, ✓ + r).
I run the algorithm for 100 steps from four randomly chosen starting points.
par(mfrow = c(2,2))
6
sim <- simAn(N = 1e2, r = 5)
plot(mu.x, ll.est, type = ’l’, ylab = "log-likelihood", xlab =
expression(mu))
points(sim$x, log(sim$fn.value), pch = 16, col = adjustcolor("darkred",
alpha = .4))
−10
−15
−15
log−likelihood
log−likelihood
−20
−20
−25
−25
−30
−30
−10 0 10 20 30 40 −10 0 10 20 30 40
µ µ
−10
−10
−15
−15
log−likelihood
log−likelihood
−20
−20
−25
−25
−30
−30
−10 0 10 20 30 40 −10 0 10 20 30 40
µ µ
7
Note the simulated annealing algorithm is able to escape out of local modes and head
towards the global maxima. However, the above algorithm is implemented only after
tuning r. Tuning r can be challenging.
• Large r: values too far away are proposed where the objective function is very
low. These values will get rejected and the algorithm will not move.
• Small r: values too close are proposed where the change in the objective function
is small. These values are often accepted, but the algorithm makes very tiny
jumps.
Below are runs of the simulated annealing algorithm with r chosen to be too high (500)
and too low (.1).
par(mfrow = c(1,2))
## Different values of r
# very large r
sim <- simAn(N = 1e3, r = 500)
plot(mu.x, ll.est, type = ’l’, main = "r = 500. Many rejections", ylab =
"log-likelihood", xlab = expression(mu))
points(sim$x, log(sim$fn.value), pch = 16, col = adjustcolor("blue", alpha
= .2))
#very small r
plot(mu.x, ll.est, type = ’l’, main = "r = .1. Many small acceptances",
ylab = "log-likelihood", xlab = expression(mu))
sim <- simAn(N = 1e3, r = .1)
points(sim$x, log(sim$fn.value), pch = 16, col = adjustcolor("blue", alpha
= .2))
−10
−15
−15
log−likelihood
log−likelihood
−20
−20
−25
−25
−30
−30
−10 0 10 20 30 40 −10 0 10 20 30 40
µ µ
8
2 Questions to think about
• How do you think this algorithm will scale in higher dimensions? Try implement-
ing simulated annealing for a Lasso optimization problem.
• Is there any benefit to having T > 1?