0% found this document useful (0 votes)
17 views60 pages

Lecture 7

The document discusses various methods for solving inverse problems, including linear, weakly nonlinear, and strongly nonlinear problems. Iterative solvers and optimization techniques like steepest descent, (quasi) Newton methods, and conjugate gradients are described for solving linear systems of equations. Monte Carlo methods, including uniform random search and simulated annealing, are summarized as derivative-free techniques for parameter search in nonlinear inverse problems. Examples provided include seismic receiver function inversion and history matching in an oil reservoir.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views60 pages

Lecture 7

The document discusses various methods for solving inverse problems, including linear, weakly nonlinear, and strongly nonlinear problems. Iterative solvers and optimization techniques like steepest descent, (quasi) Newton methods, and conjugate gradients are described for solving linear systems of equations. Monte Carlo methods, including uniform random search and simulated annealing, are summarized as derivative-free techniques for parameter search in nonlinear inverse problems. Examples provided include seismic receiver function inversion and history matching in an oil reservoir.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

What we will be skipping over

Optimization and iterative solvers, Backus and Gilbert...

186
Iterative solvers and optimization

Iterative solution of large linear systems of equations and for


nonlinear optimization
Ax = b
Steepest descent

(Quasi) Newton method’s

Conjugate gradients

Preconditioning

Gradient based optimization


methods
Parallelised libraries now
available

Review material can be found in Ch. 6 of Aster et al. (2005) and Section
2.3 of Rawlinson and Sambridge (2003).
187
Fully nonlinear inversion
and parameter search
Recap: linear and nonlinear inverse problems

Linear problems
Single minima
Gradient methods work
Quadratic convergence
Many unknowns

d = Gm

−1 −1
φ(d, m) = (d−Gm)T CD (d−Gm)+μ(m−mo)T CM (m−mo)

189
Recap: linear and nonlinear inverse problems

Weakly nonlinear problems δ d = Gδ m


Single or multiple minima
Gradient methods might work if starting point good enough
Many unknowns ∂di
Gi,j =
∂mj

−1
φ(d, m) = (δ d − Gδ m)T CD (δ d − Gδ m) +
−1
μ(m − mo)T CM (m − m o )
190
Recap: linear and nonlinear inverse problems

Strongly nonlinear problems d = g(m)


Multiple minima
Linearization and gradient methods fail
Tractable with relatively few unknowns 1-100 with
direct search techniques

Derivatives of little use


∂di
∂mj

Data misfit surface


in an infrasound array

Kennett et al. (2003)


−1
φ(d, m) = (d − g(m))T CD (d − g(m)) +
−1
μ(m − mo)T CM (m − m o )
191
Example: nonlinear inverse problems

Seismic receiver function inversion

192
Example: nonlinear inverse problems

History matching in an oil reservoir

A one parameter data fitting problem

Courtesy P. King
193
Monte Carlo methods

“A branch of experimental mathematics that is


concerned with experiments on random numbers”
Hammersley and Handscomb (1964)
194
Monte Carlo methods

`Monte Carlo’ Phrase coined by


Metropolis and Ulam (1949).

Originally developed by Ulam,


von Neumann, and Fermi to simulate
neutron diffusion in fissile materials
development of the atomic bomb.

195
Monte Carlo methods
…but did someone think of it earlier ?

As noted by Hammersley and Handscomb (1964) Lord Kelvin


(1901) described use of “ashtonishinglymodern Monte Carlo
techniques'' in a discussion of the Boltzmann equation.

S = k ln W
196
Monte Carlo methods

…but Monte Carlo solutions had been around for even longer

Hall (1874) recounts numerical


experiments to determine the
value of π by injured officers
during the American Civil War…

Throw a needle onto a board containing parallel straight lines.


The statistics of the number of times the needle intersected each
line could be used to estimate π.

This is Buffon’s needle problem (1733).

197
Buffon’s needle problem

Posed by Buffon in 1733 (solved by Buffon in 1777)

Given a needle of length l dropped on a plane ruled


with parallel lines a distance t apart, what is the
probability that the needle will cross a line?

If h needles intersect the lines (red) out on n, the solution is

An experiment with 500 tosses


For l≤t of a needle (l=t/3). 107 needles
cross a line, giving .
h 2l
=
n tπ t
l= h = 107 n = 500
3
For l>t ⇒ π = 3.116 ± 0.073

½ q µ ¶¾
h 2 t
= l− l2 − t2 − t sin−1 +1
n tπ l

Using a Monte Carlo trial of randomly throwing a needle on a board


the statistics allow an estimate of π

See http://mathworld.wolfram.com/BuffonsNeedleProblem.html 198


http://en.wikipedia.org/wiki/Buffon's_needle
Monte Carlo methods

Peacetime successes through operations research

Thomson (1957) describes a


Monte Carlo simulation of
fluctuations of traffic in the
British Telephone system.

199
Direct search

Derivative free or `direct search’ techniques can be useful for


weakly and strongly nonlinear problems.

Problem: Search a multi-dimensional parameter space to find


models with satisfactory fit to data and other criteria.

As computational power has grown global optimization algorithms


have become very popular.

200
Direct search methods

Uniform random/nested search

Simulated Annealing
(Thermodynamic analogy)

Genetic/evolutionary algorithms
(Biological analogy)

Neighbourhood algorithm

201
Uniform random search

Uniform random search means uniform in volume

For M unknowns we
have an M-
dimensional V = LM
parameter space. The
volume of a cube with
side L is

Data fit does not guide


the search, although,
we may wish to nest...

The curse of
dimensionality
always gets you
in the end !

What does random mean ?


202
The curse of dimensionality

Where is all the volume in a hyper-cube ?

A hyper-cube of volume V and side L of dimension M V = LM


L L

a How many samples


a required to get one
in the interior box ?

a = L/2 a = 0.9L
M
1 50% 90% Proportion of volume in the
2 25% 81% outside shell always dominates
over the interior
5 3% 50%
10 0.1% 35% All volume tends to be in the
1 exterior shell as M ↑
20 10000 % 12%
50 10−15% 0.5% 203
Example: Uniform random search

Uniform random search in seismology

Press (1968)

204
Nested uniform search

2-D global optimization

S North (s/km)

S East (s/km)

Maximizing beam power as a function of slowness

A simple but effective approach.


Issues are: Discretization level, number of samples, curse of dimensions
205
Global optimization: Simulated Annealing

A natural thermodynamic optimization process

Annealing is the process of heating a solid until thermal stresses are


released. Then cooling it very slowly. The final state of the crystal depends
on how fast the temperature is reduced. At each temperature it has a
crystaline potential energy, which is lowest only for a perfect crystal.

Fast cooling quenches the


crystal into a
local minimum
in potential energy

Slow cooling produces a


global minimum
in potential energy
206
Multiple minima and multiple maxima

1 −1 μ −1
φ(m) = (d−g(m))T CD (d−g(m))+ (m−mo)T CM (m−mo)
2 2

Multi peaked Likelihood


Single peaked Likelihood

Sampling according to a PDF means generating samples with


density equal to its distribution

207
Global optimization: Simulated Annealing

The Gibbs-Boltzmann distribution, σ (m, T), describes the probability of a


statistical system with state m having an energy φ(m) and temperature T.

− φ(m )
−φ(m )1/T
σ(m, T ) = e T =e
σ(m, T) is a probability density function for m at temperature T. In Simulated
Annealing we associate the energy σ(m) with negative log likelihood, or the
objective function in the inverse problem to be minimized e.g.
1 −1 μ −1
φ(m) = (d−g(m))T CD (d−g(m))+ (m−mo)T CM (m−mo)
2 2

The minimum of φ(m) corresponds to the maximum of the PDF σ(m, T).

Recall the least


squares case

A quadratic and
a Gaussian

Likelihood function Misfit function


208
Global optimization: Simulated Annealing

The general structure of the algorithm is:

Set T large (≈ 1000)

Generate samples, m probabilistically with a density that follows


the Gibbs-Boltzmann distribution for this T and energy φ(m)

− φ(m )
σ(m, T ) = e T

Cool the system by reducing T slowly. Use an annealing schedule


e.g. T = α T, then return to step 2

If the system is cooled too quickly then we get a local minimum. If it is cooled
too slowly the we waste a lot of energy (forward) evaluation. The optimum
annealing schedule will depend on the complexity of the energy function φ(m).

But what role does T play ?

209
Global optimization: Simulated Annealing
Global optimization using a heat bath
But what role does T play ?

σ(m, 1) σ(m, T ) φ(m)

T = 1000

T = 100

e.g. Likelihood Gibbs-Boltzmann e.g. data misfit


PDF at temp T
φ(m)
σ(m, T ) = e− T 210
Global optimization: Simulated Annealing

T = 10

T=1

T = 0.1

T = 0.01

e.g. Likelihood Gibbs-Boltzmann e.g. data misfit


211
PDF at temp T
Example: Gibbs-Boltzmann distribution

φ(m)
T

φ(m)
σ(m, T ) = e− T

212
Global optimization: Simulated Annealing

The general structure of the algorithm is:

Set T large (≈ 1000)

Generate samples, m probabilistically with a density that follows


the Gibbs-Boltzmann distribution for this T and energy φ(m)

− φ(m )
σ(m, T ) = e T

Cool the system by reducing T slowly. Use an annealing schedule


e.g. T = α T, then return to step 2

If the system is cooled too quickly then we get a local minimum. If it is cooled
too slowly the we waste a lot of energy (forward) evaluation. The optimum
annealing schedule will depend on the complexity of the energy function φ(m).

But how to implement step 2 ?

213
Global optimization: Simulated Annealing

Implementing simulated annealing with the Metropolis Algorithm

Set high temperature

Define annealing schedule, e.g. Ti+1 = α Ti


i

Define starting model, mo ← mcurrent

Step 1 Propose a new model mnew = mcur + δ m


where δm is some local perturbation of the
old model, e.g. a Gaussian of fixed width

Compared energy of new to old model Moveclass

∆φ = φ(mnew ) − φ(mcur )
Accept new model if fit improved
If ∆φ < 0, then mcur = mnew
−∆φ
Accept new model with probability p if fit worse p = e T

If not enough samples for Ti go to step 1.

Stop if converged else update Ti+1 = αi Ti go to step 1.

(Everything in green is a choice) 214


Example: Simulated Annealing

The Metropolis Algorithm produces a random walk in model space, initially T


is high and the walk is random because all moves are accepted, As T
decreases it slowly gets attracted to regions of high probability and toward
the maximum in σ(m, T).

P(accepted model)

Gibbs-Boltzmann PDF Number of accepted models

215
Simulated Annealing Example: TSP
In a famous paper Kirkpatrick et al. (1983) showed how simulated
annealing could be used to solve a difficult combinatorial optimization
problem known as the Travelling salesman problem.

Find the shortest closed path (tour) around the 37 cities.

This is known to be an NP-complete combinatorial optimization problem.

216
Recap: Simulated Annealing

Simulated Annealing is a computational method for global


optimization that uses an analogy to a physical
(Thermodynamic) optimization process

There are many variants of the method, each based on


different choices. (Optimal annealing schedules are known
in some cases.)

Relatively simple to implement, for continuous and discrete


optimization problems. Has been used in Geophysics since
the 1999s.

Can be inefficient if component choices are not made well.

Although sequential in nature (random walk) can be run in


parallel using independent random walks. Ensemble
annealing.

217
Global optimization: Genetic algorithms

Genetic algorithms follow an analogy with another real world

optimization technique, that of evolution, or natural selection.

Genetic algorithms are an attempt to exploit mimic the process of


adaptation to systems to survive in a competitive environment. Their
origin is attributed to the work of Holland (1975) devised original to
study mathematical models of adaptation in artificial systems.

Genetic algorithms are closely related to evolutionary computation, an


area independently devised by Fogel & Fogel (1961).

Applications are now widespread, especially as an optimization method.

First applications in geophysicis were around 1991-1992. 218


Global optimization: Genetic algorithms

An analogy to natural selection

Human genotypes

Phenotypes of Albert

Phenotypes of Sharon

Their offspring would depend on three processes


Did Albert and Sharon select each other ? Selection
If so, what properties (phenotypes) were propagated ? Crossover

What random mutations occurred ? Mutation


219
Genetic algorithms: bit encoding

In GA an entire population of individuals is treated as a ensemble of


chromosomes. The idea is to `evolve’ the population toward optimal
models.
Continuous variables (x1, x2,...) can be represented with a bit string

In the two parameter function opposite


the model consists of a pair of variables
(x,y). A fundamental (but absolutely
necessary) is the coding of an individual
into a bit string (bit coding)

Generate an initial population randomly.

220
Genetic algorithms: bit encoding

A natural example for a string representation for a geophysical inverse


problem is the treasure hunt.

01010000 .....010 10010000 ....0100 00000 .... 0000

Here the actual binary coding would make sense: 0 would mean sand and 1 would
represent gold. What if we want to describe a function taking on arbitrary values?

Example:

We want to invert for a depth dependent velocity model, described by

layer thickness d and velocity c. Then a model vector m would look like:

m = (d1,v1, d2,v2, d3,v3, d4,v4, d5,v5, d6,v6, .... dn,vn)

… and could simply be described by a long bit-string. It is your choice how many
bits you use for the possible range of values for each parameter.

221
Genetic algorithms: Selection

Fitnes Calculate the fit of each model to the data


s -> survival fitness of model

Remove the half of the models with


the worst fits

Model rank determines survival

Fitness

222
Genetic algorithms: Crossover

Randomly pair the remainders with probability of pairing depending on fitness

Fitness

This requires a mapping between


fitness and probability

Randomly choose a common point on the strings to cut and swap over

Before After

Total population then consists of N/2 of better models + N/2 offspring


223
Genetic algorithms: Mutation

Each of the offspring undergo a random mutation with a certain


probability, Pm

This adds a random component back in the population


and encourages diversity.

224
Genetic algorithms: Iterations

The three steps constitute one iteration

Selection Crossover Mutation

At each iteration a complete ensemble of models are updated


at once. No special treatment given to any one. Compare this to
Simulated Annealing which repeatedly perturbs a single model.

As iterations continue the average fitness of the population should


increase.

Iterations stop when the average fitness of the population, or the


fitness of the best individual has reached an acceptable level.

Overall the aim is for the three processes to combine to produce


a self-adaptive search process.
225
Example: Genetic algorithms

In a GA the average fitness should increase over time.

Example showing a GA
operating on the peaks
function above

In a GA the average fitness should increase over time.


226
Genetic algorithms: Ranking and Self-tuning

Ad hoc choices

Often in a GA selective pressure for fitter individuals takes the form


of a functional mapping between data misfit and probability.

Any chosen mapping can be arbitrary and not work for all
ranges of data fit.

Choices made for the level of mutation and crossover


probability can also be problem dependent.

227
Genetic algorithms: features

An ensemble based direct search method for optimization.

Specific choices that need to be made:


How to encode the model parameters into a suitable (bit) string.
How to map data misfit into probability of survival (selection step)
How to choose the frequency/probability of Crossover and Mutation
operators.

There are many variants of the method each making different


choices.

Active area of research and many applications across a number


of fields. Circuit board design, drug design...

GA are related to the wider field of evolutionary computation →


complex systems → emergent behaviour.

228
Neighbourhood Algorithm

The Neighbourhood Algorithm was designed as an ensemble based


procedure to search multi-dimensional parameter spaces. The intent was
to have as few parameters as needed to achieve a self adaptive search.

NA is motivated by simple geometrical concepts.

NA is based on a partition of the model space into neighbourhoods about


any set of samples.

How do we define neighbourhoods ?

Growing Voronoi cells using


Sambridge (1999)
Euclidean distance
229
Real world Voronoi cells

Voronoi cells appear in many parts of the natural world since


they arise out of of isotropic processes

From Okabe et al. (1995) after Cox and Agnew (1976)

230
Parameter search: Neighbourhood algorithm

The neighbourhood algorithm uses uniform random walks that


adapt to the information contained in previous sampling.

NA explained using an illustration with three blind mice.

See how they run... (or walk randomly)

231
Parameter search: Neighbourhood algorithm
A conceptually simple methods for adaptive sampling of multi-
dimensional parameter spaces.

n=3 m=1

1. Take n random walks to arrival at n new positions

2. Define Voronoi (nearest neighbour cells) between the n points

3. Resample the best m ranked neighbourhoods,


as determined by data misfit, φ(m)
232
Parameter search: Neighbourhood algorithm

Repeat the process iteratively each time updating the Voronoi cells and
generating n samples from a random walk inside m neighbourhoods.

233
Parameter search: Neighbourhood algorithm

How is it implemented ?

Uniform random walk inside


i-th cell starting at model mi

(
1/Vi : if m inside cell i
σ(m) =
0 : otherwise
First we sample from the
conditional PDF along the x1 axis

→ update the model along x1

The repeat for x2 axis

→ update the model along x2

This process is repeated for each axis


in rotation and the randow walk maps
out uniform samples inside the
irregular polygon.

Must solve for the intersection points between the 1-D axis and the edge of
each Voronoi cell. 234
Parameter search: Neighbourhood algorithm

This results in a self adapting search process


Behaviour
n≥m

As n and m increase the


search becomes more
explorative
How does the
converge behavior
→ sampling less quickly
depend on n and m ?
As n and m decrease the
search becomes more
concentrated
→ converge more quickly
but more likely to local
minimum

Repeat the process iteratively each time generating n samples


(uniformly) inside m previously generated neighbourhoods.

235
Example: Neighbourhood algorithm

Infrasound misfit function

Samples from Neighbourhood algorithm


From Kennett et al. (2003)

236
Examples: Neighbourhood algorithm

Receiver function waveform fitting for 1-D seismic earth models

237
Examples: direct search

238
A comparison of techniques

239
Neighbourhood algorithm

Self adaptive search controlled by two variables.

Ensemble based approach. At each stage random walks


guided by all previous samples through Voronoi partition.

Search driven by ranking of models rather than by a


particular transformation of objective function.

Behaviour as a function of variables predictable.

240
What to do with all the samples generated by a global
search algorithm ?

241
The appraisal problem

242
Mapping out the region of acceptable fit

243
Global search: exploration vs exploitation

244
Global search: Parallelisation

Ensemble based approaches are


ideally suited to parallel computation

245

You might also like