Data Analysis Earth Environmental Sci 2017
Data Analysis Earth Environmental Sci 2017
Julien Emile-Geay
Department of Earth Sciences
[email protected]
ERTH425L
Copyright © 2017 Julien Emile-Geay
Typeset in LATEX with help from MacTeX and a tweaked version of the Tufte LATEX book class. Many figures
generated via TikZ. Python highlighting by O. Verdier. All software open source.
Please cite as: Emile-Geay, J., 2017: Data Analysis in the Earth & Environmental Sciences, 265pp, Third edition,
http://dx.doi.org/10.6084/m9.figshare.1014336.
Chapter 1 Preamble 9
I What is data analysis? 9
II What should you expect from this book? 10
III Structure 11
IV Acknowledgements 12
4
Data Analysis in the Earth & Environmental Sciences CONTENTS
5
Data Analysis in the Earth & Environmental Sciences CONTENTS
Appendices 199
Appendices 199
6
Data Analysis in the Earth & Environmental Sciences CONTENTS
Bibliography 259
Index 265
7
Chapter 1
PREAMBLE
Now that we have defined the task that will occupy us for the rest of
this book, let us take a step back. What do we mean by data? Data is a
word about which most scientists think they agree, but turn out to dis-
agree a fair bit. Data represent numerical representation of quantities
(variables)1 which, we would like to believe, tell us about a system un- 1
a single data point, for those who like
der investigation. In today’s world, data refer both to measurements of Latin, is called a datum
10
Data Analysis in the Earth & Environmental Sciences Chapter 1. Preamble
derstand and use these techniques –perhaps even teach them, one day.
As such, a modicum of mathematical education is required, amounting
roughly to Calculus III, introductory linear algebra and trigonometry
(Appendices, A, C and C). Probability theory is our main foundation,
and none of it can be explained without the language of calculus and
linear algebra. If you never studied either of those, run for your life. If
you did study them, but never excelled at them, we hope that the desire
to understand the Earth will drive you to learn some new math! Ulti-
mately, however, if you want to use those techniques correctly, you will
have to understand where they come from, and to do that you will have
to delve into some mathematics – a black box approach will not suffice.
How much you want to do this is your choice, but it is hard to know too
much about this topic.
Each Earth Science department worth its salt has a class like this
one (at USC, ERTH425L); they each make different choices about what
topics to emphasize. In the process, they all achieve different compro-
mises between exposing students to the most common techniques used
in their field and giving them enough background to understand these
techniques at a deep level. It is our belief that knowledge is most ef-
ficiently acquired through practice; as such, a key component of this
class is a set of weekly laboratory practicums that put into use the no-
tions presented in the lectures. Equally important, and perhaps more
relevant for those of you who are becoming researchers (whether grad-
uate or undergraduate), a final project will apply the concepts taught here
on your own dataset. If you play your cards right, this will form the foun-
dation of a thesis chapter or paper, which will make you – and your
advisor – very happy indeed. Do we have your attention now?
III Structure
The class is articulated around three main themes:
Because uncertainties will turn out to matter at every step of the way,
we need to define a common language to deal with them. Such is the
object of statistics, rooted in probability theory, which are tackled in the
first third of these notes. Much of those fields of knowledge rely heavily
on results from calculus and linear algebra, which are briefly reviewed
in Appendices A & B. Next we focus on datasets that follow a sequential
order; we call those timeseries, though the independent variable need
11
Data Analysis in the Earth & Environmental Sciences Chapter 1. Preamble
IV Acknowledgements
Thorsten W. Becker, who put this class on the books at USC, was in-
strumental for much of the labs and least squares/inverse theory chap-
ters. Appendix A is a shameless (but authorized) rip-off of a similar
appendix in his and B. Kaus’ excellent e-book Numerical Modeling of
Earth Systems.
Dominique Guillot is to be commended for the excellent review of
linear algebra (Appendix B), and much of chapters 3 and 5.
I thank Elisa Ferreira, Laura Gelsomino and Antonio Mariano for
their help with LATEX typesetting, and all the ERTH425L students who
have pointed out typographical errors over the years3 . Despite my best 3
non-exhaustive list: Jianghao Wang,
efforts, the probability that some such errors remain in the following Kirstin Washington, Alexander Lusk,
Kevin Milner, Billy Eymold, Joseph Ko
pages is close to unity, so I am grateful for any comment that can help
us fix that.
12
Part I
PROBABILITY THEORY
Dennis Lindley
One seldom has all the information that one wishes for: often the
measurements are too few, too imprecise (or both) to allow us to con-
clude much with certainty. Yet this does not mean that we know nothing
– we may have a lot of information about the Earth system, and what we
need is a way to reasoning quantitatively about it, within these uncer-
tainties. This is the domain of probability theory, which encodes those
rules of reasoning in mathematical form.
he walked passed the store, a passing truck threw a stone through the
window, and he was only protecting his own property.
Now, while the policeman’s reasoning process was not a logical de-
duction from perfect knowledge of the situation, we will grant that it
had a certain degree of validity. The evidence above does not make the
man’s dishonesty certain, but it does make it extremely plausible. This
is an example of a kind of reasoning in which all of us are proficient,
without necessarily having learned the mathematical theory behind it.
Every day we make decisions based on imperfect information (e.g. it
might rain, should I take my umbrella?), and the fact that there is some
uncertainty in no way paralyzes our actions.
We therefore contrast this plausible reasoning with the deductive reason-
ing whose formalism is generally attributed to Aristotle (Fig. 2.1). The
latter is usually expressed as the repeated applications of the strong syl-
logism:
Figure 2.1: A bust of Aristotle, whose real
face is quite uncertain.
If A is true, then B is true;
(S1) but A is true.
Therefore B is true. ∴
16
Data Analysis in the Earth & Environmental Sciences Chapter 2. Probability Theory
In this case, the evidence does not prove that B is false, only that one
of the possible reasons why it might be true has been eliminated, so we
feel less confident about it. Scientific reasoning almost always looks like
S3 and S4.
Now, our policeman’s reasoning was not even of the above types. It
is a still weaker form of syllogism:
17
Data Analysis in the Earth & Environmental Sciences Chapter 2. Probability Theory
Common sense If we had old information G which gets updated to G 0 in such a way
that E becomes more plausible (P( E| G 0 ) > P( E| G )) but the plausi-
bility of F is unchanged, then this can only increase (not decrease) the
plausibility that both E and F are true: P( E ∩ F | G 0 ) > P( E ∩ F | G ). It
also reduces the plausibility that Ē is false: P( Ē| G 0 ) < P( Ē| G ). That
is, the rules of probability calculus must conform to intuitive human
reasoning (common sense).
Consistency If a conclusion can be reasoned out in more than one way, then ev-
ery possible way must lead to the same probability. Also, if in two
problems our state of knowledge is identical, then we must assign
identical probabilities in both. Finally, we must always consider ALL
the information relevant to a question, without arbitrarily deciding to
ignore some of it. Our robot is therefore completely non-ideological.
Figure 2.4: American physicist Richard T.
Cox, ca 1939.
Cox’s theorem states essentially that these postulates are jointly suf-
ficient to design a robot that can reason scientifically about the world.
This is not, however, how probability theory was initially built, so a his-
torical digression is warranted.
18
Data Analysis in the Earth & Environmental Sciences Chapter 2. Probability Theory
• 1933 : End of a long struggle to find a definition until Kolmogorov’s Figure 2.6: Pierre de Fermat
(Source:Wikimedia Commons).
axioms (1933). The difficulty had to do with finding a mathematical
definition that would conform to other (e.g. logical, intuitive) con-
siderations.
II Notion of probability
It turns out that the notion of probability is difficult enough a concept
that statisticians, probability theorists, and philosophers still argue about
it. Here we present three views of the topic: the Frequentist view, the
Bayesian, view, and the axiomatic view.
19
Data Analysis in the Earth & Environmental Sciences Chapter 2. Probability Theory
Frequentist Interpretation
Let E be an event in a random experiment. An experiment might be
“Rolling a die”; an event (outcome) of this experiment might be “Ob-
taining the number 6”. Let us repeat this experiment N times, each time
updating a counter xi :
1, if E occurred ,
xi = (2.4)
0, otherwise .
Relative frequency of 6 f N ( E)
The frequentist interpretation holds that , as N gets large, then the fre-
quency of occurrence of E, f N ( E), converges to a constant, which we call
Relative frequency
its probability P( E):
x1 + x2 + . . . + x N 1
(Relative Frequency) lim = P( E ) (2.5)
N →∞ N
1
This is the so-called Law of large numbers. 6
We represent the two dice as (i, j), where i = result of throwing the 1st die
and j = result of throwing the 2nd die. Possible outcomes S are:
where,
|S| = 36 . where | | means “number of elements”
What is the probability of the event E = sum > 8 ?
Possibilities:
E = {(3, 6) , (4, 6) , (5, 6) , (6, 6) , (4, 5) , (5, 4) , (5, 5) , (6, 3) , (6, 4) , (6, 5)} ,
⇒ 10 Possibilities .
| E| 10 5
P ( E) = = = ≈ 0.278 .
|S| 36 18
20
Data Analysis in the Earth & Environmental Sciences Chapter 2. Probability Theory
D.G. Martinson, “one cannot re-run the experiment called the Earth”. Does
this mean that we can form no judgement about the plausibility of an
event, like A = “a meteorite killed the dinosaurs at the end of the Creta-
ceous” (which, by definition, only happened once)? If so, what would
be the use of probabilities in Earth Science?
In the Bayesian interpretation, probabilities are rationally coherent de-
grees of belief, or a degree of belief in a logical proposition given a body of well-
specified information. Put it another way: it is a measure of a state of knowl-
edge about a problem.
General Idea:
→ Start from prior knowledge.
Figure 2.9: Rev. Thomas Bayes, 1701-
→ Collect observations about the system (perform experiments, go out 1761, Presbyterian minister.
in the field)
1, 3, 2, 1, 1, 5, 4, 4, 3, 5, 6, 6, 3, 1, 5
pi = P ( X = i ) = ?
Let’s assume we have no reason to think that the die is loaded. A priori consid-
erations of symmetry would lead us to set pi = 1/6, i ∈ [1 : 6], but we should
perhaps not be so definite (after all, if we decide from the outset that the die is
fair, how could we possibly revise our judgement if the evidence suggests other-
wise?). So instead we set a prior probability distribution for pi , illustrated 1
in Fig. 3.
Π ( pi ) =
Now, after rolling the die 15 times, we observe the sequence
X = {1, 3, 2, 1, 1, 5, 4, 4, 3, 5, 6, 6, 3, 1, 5} (2.7)
1
1
, after which we update our estimate of pi : 6
Prior distribution of pi
21
Data Analysis in the Earth & Environmental Sciences Chapter 2. Probability Theory
Advantage One can now compute the probability of any event, regardless of
whether it can be repeated ad nauseam.
Disadvantage We now rely on a subjective prior, and different priors may lead to
different results. On the other hand, the prior can be justified by the-
ory (e.g. considerations of symmetry, physical laws) or other exper-
iments. The role of prior information becomes less important as the
number of observations increases, so in the limit n → ∞, the frequen-
tist and Bayesian interpretations often lead to the same results.
Disadvantage In many cases Bayesian analysis is more cumbersome and more com-
putationally demanding.
Sample space aka "Universe" or "Compound event". Denoted by Ω, this defines all
the possible outcomes of the experiment. For example, if the experi-
ment is tossing a coin, Ω = {head, tail}. If tossing a single six-sided
die, the sample space is Ω = {1, 2, 3, 4, 5, 6}. For some kinds of exper-
iments, there may be two or more plausible sample spaces available.
For example, Temperature = any number from -60◦ C to 60◦ C, Precip
from 0 to 10 m/day.
22
Data Analysis in the Earth & Environmental Sciences Chapter 2. Probability Theory
E ∪ Ω = Ω, (2.8a)
E ∩ Ω = E. (2.8b)
Operations We have seen the basic operations of complement, union and inter-
section in Sect. I. To illustrate, let us cast a die and denote the events
A = {“Outcome is an even number ”} and B = {“ Outcome > 3 ”}.
• ∀i, ωi 6= ∅ (non-emptiness)
• ∀(i, j)/i 6= j, ωi ∩ ω j = ∅ (mutual exclusivity)
Sn
• i =1 ωi = Ω (complete coverage)
Probability Axioms
Positivity ∀ E, P( E) ∈ R and P( E) ≥ 0 3
As a special case, we have the much
simpler finite additivity P( E1 ∪ E2 ) =
Unitarity P(Ω) = 1 P( E1 ) + P( E2 ). We could have started
from that point, but it turns out that some
Additivity if E1 , E2 , · · · are mutually exclusive propositions, then strange things may happen when try-
ing to generalize this finite additivity to
P( E1 ∪ E2 ∪ · · · ) = ∑i∞=1 P( Ei ). 3 countable additivity (Borel-Kolmogorov
paradox), so this complicated definition
It turns out that these axioms produce a probability that has exactly is warranted. Jaynes (2004) contends,
however, that if we abide by Cox’s princi-
the same properties as the logical desiderata of Pólya and Cox. So why ples and apply them carefully, we never
the fuss? First, you should feel an immense relief: in designing a think- run into those kinds of embarrassing
ing robot out of logical principles, we wind up with exactly the same paradoxes.
23
Data Analysis in the Earth & Environmental Sciences Chapter 2. Probability Theory
A ⊆ B =⇒ P( A) ≤ P( B) (2.9a)
∀ A P( A) ∈ [0, 1] (2.9b)
P( Ā) = 1 − P( A) (2.9c)
P( ∅ ) = 0 (2.9d)
P( A ∪ B) = P( A) + P( B) − P( A ∩ B) (sum rule) (2.9e)
Conditional Probabilities
24
Data Analysis in the Earth & Environmental Sciences Chapter 2. Probability Theory
P ( A ∩ B) = P ( A| B) · P ( B) = P ( B| A) · P ( A) (2.13)
P ( A| B) = P ( A) & P ( B| A) = P ( B) (2.14)
Equivalently,
P ( A ∩ B ) = P( A ) · P( B ) (2.15)
This means that knowing B makes no difference about our state of
knowledge about A, and vice versa. A more advanced concept is that of
conditional independence. Two events A and B are said to be conditionally
independent given C if and only if:
Example
We roll two dice:
A : sum > 8 , B : X1 = 6 .
We want to compute:
25
Data Analysis in the Earth & Environmental Sciences Chapter 2. Probability Theory
=⇒ P ( X1 = 2 ∩ X2 = 3 ) =
= P ( X1 = 2 ) · P ( X2 = 3 )
1 1
= ×
6 6
1
= .
36
P ( E ∩ ωi ) = P ( E | ωi ) · P ( ωi )
But,
( E ∩ ωi ) ∩ E ∩ ω j = ∅ , (i 6 = j ) (Incompatible events) (2.17)
!!
[
=⇒ P ( E) = P E∩ ωi
i
= P (( E ∩ ω1 ) ∪ ( E ∩ ω2 ) ∪ . . . ∪ ( E ∩ ωn ))
= P ( E ∩ ω1 ) + P ( E ∩ ω2 ) + . . . + P ( E ∩ ω n )
= P ( E | ω1 ) · P ( ω1 ) + . . . + P ( E | ω n ) · P ( ω n )
= ∑ P ( E | ωi ) · P ( ωi )
i
Hence:
26
Data Analysis in the Earth & Environmental Sciences Chapter 2. Probability Theory
is that we may choose any partition we’d like, so we can pick one that
makes the conditional probabilities easy to evaluate. Then it is just a
matter of adding and multiplying.
Note that the above formula bears more than a passing resemblance
to Eq. (B.15). This is not a complete coincidence: you may think of a
partition as a sort of “basis” in a probability space. Thus, if you know
the probabilities of each partition and can express the probabilities of
each event in terms of the partition, you are done.
Example
5 jars containing white (w) and black (b) balls.
2w 2w 3w 3w
1b 1b 10 b 1b 1b
I II III IV V
Figure 2.14: Illustrating the experiment
of picking a random ball from 5 jars con-
1. Pick a jar at random (P = 15 ). taining white and black balls.
What is the probability of getting a black ball? To answer this, we use the
law of total probabilities.
Let Ei = { Choosing the ith jar } (i = 1, 2, . . . , 5), where { E1 , . . . , E5 } is a
partition.
Now, P ( Ei ) = 15 by hypothesis, and:
5
law of total prob. 13
=⇒ P ( B) = ∑ P ( B|Ei ) · P (Ei ) = 30 .
i =1
27
Data Analysis in the Earth & Environmental Sciences Chapter 2. Probability Theory
Bayes’ Rule
Elementary Case
This is a somewhat obvious application of the definition of conditional
probabilities:
P ( A ∩ B) = P ( A| B) · P ( B) = P ( B| A) · P ( A) ,
P ( B| A) · P ( A)
=⇒ P ( A| B) = (2.19)
P ( B)
(Expresses P ( A| B) as a function of P ( B| A))
likelihood prior
z }| { z }| {
P ( B| A) · P ( A)
P ( A| B) =
P ( B)
| {z }
normalizing constant
P ( Ei ∩ B)
P ( Ei | B) = ,
P ( B)
P ( Ei ∩ B) = P ( B| Ei ) · P ( Ei ) ,
law of tot. prob.
P ( B) = ∑ P B| Ej · P Ej ,
j
P ( Ei ) · P ( B| Ei )
=⇒ P ( Ei | B) = Bayes’ Rule (2.20)
∑ j P B| Ej · P Ej
Now, the test is not perfect: it could come back negative when the
person is in fact sick (“false negative”), and come back positive while
28
Data Analysis in the Earth & Environmental Sciences Chapter 2. Probability Theory
That is, as the rate of false positives drops, we get more and more con-
fident about the diagnosis. Worryingly enough, study after study has
found that only about 15% of medical doctors (the people whom we pay
to interpret such test results) get this right5 . Most have no clue how to 5
For this account and for a simple intro-
conduct such a simple calculation, and end up “transposing the condi- duction to Bayes’ theorem, see http://
yudkowsky.net/rational/bayes
tional“ (giving P ( T |S) instead of P (S| T )). As the example shows, those
will in general be very different, except in special cases (homework:
what values of f n and f p are needed for the equality P (S| T ) = P ( T |S)
to hold? The answer should depend on their ratio).
A remarkable feature is how counterintuitive the results may be for
what appear to be relatively small differences between f p and f n . This is
a reminder that the rules of probability calculus must be applied care-
fully and thoroughly, lest one commit some grave mistakes.
29
Data Analysis in the Earth & Environmental Sciences Chapter 2. Probability Theory
30
Chapter 3
PROBABILITY DISTRIBUTIONS
“If we look at the way the universe behaves, quantum mechanics gives
us fundamental, unavoidable indeterminacy, so that alternative
histories of the universe can be assigned probability..”
Murray Gell-Mann
Since we are dealing with real-world data (which only come in dis-
crete form e.g. because of digitization), most observations are discrete,
though continuous RV provide an incredibly useful lens through which
to view some Earth processes. Note that an RV could be complex; it can
also be a vector or scalar.
Distribution Functions
Any discrete random variable admits a probability mass function (PMF)
f defined by, ∀ xi ∈ X (Ω):
f ( xi ) = P ( X = xi ) (3.1)
which describes the “weight” of each xi in the total outcome. The PMF is
often presented as a histogram, binning outcomes together in relatively
coarse chunks.
32
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions
P( X ≥ x ) = 1 − F ( x ) (3.3)
(thus named by biomedical statisticians looking at survival times of pa-
tients under a pharmaceutical treatment). An example of survivor func-
tion that is foundational to seismology is the Gutenberg-Richter law
(Fig. 3.2).
Properties:
1. f is continuous
2. ∀ x ∈ R, f ( x ) ≥ 0
R +∞
3. f has unit mass: −∞ f ( x )dx = 1. (Appendix A, Sect. III).
Rx
Indeed, one can write F ( x ) = −∞ f (u)du, from which it follows that:
33
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions
• F goes from 0 to 1:
lim F ( x ) = 0; lim F ( x ) = 1;
x →−∞ x →+∞
P( x0 ≤ X ≤ x0 + δx ) = f ( x0 )δx + O (δx )2 (3.5)
For δx small enough, this probability is close to f ( x0 )δx: this quan-
4
tity is the probability of X being in a neighborhood of x0 of length δx. JOKE: A mathematician and a physi-
cist agree to a psychological experiment.
Notice, however, that as δx approaches 0, this probability approaches The mathematician is put in a chair in a
zero as well, since f is well-behaved. large empty room and a very tasty cake is
placed on a table at the other end of the
Thus we have the mind-altering result that for a continuous random
room. The psychologist explains, "You
variable, P( X = x0 ) = 0 (that is, its PMF is zero at every point)! Does are to remain in your chair. Every five
this mean that the variable can never reach this value? No: what it minutes, I will move your chair to a posi-
tion halfway between its current location
means is that we can never know this value with absolute certainty, so and the cake." The mathematician looks
the probability of exactly observing the value x0 is zero; the probability at the psychologist in disgust. "What?
I’m not going to go through this. You
of observing something close to x0 is finite, however. In other words,
know I’ll never reach the cake!" And he
there will always be some uncertainty about our knowledge of X, but gets up and storms out. The psychologist
thankfully we can get close enough to it for practical purposes4 . makes a note on his clipboard and ush-
ers the physicist in. He explains the sit-
uation, and the physicist’s eyes light up
Empirical Determination of a Distribution and he starts drooling. The psychologist
is a bit confused. "Don’t you realize that
Imagine X = { x1 , x2 , · · · , xn } is a collection of measurements. How do you’ll never reach it exactly?" The physi-
cist smiles and replied, "Of course! But
we find its distribution? I’ll get close enough for all practical pur-
poses!"
Histograms
34
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions
0.5
Probability Mass Function
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5 6 7 8
Precip (mm/day)
n
1 x − xi
KDE( X ) =
nh ∑K h
(3.6)
i =1
35
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions
Often we are interested in the values below which lie a certain frac-
tion of the mass of a certain distribution. If you’ve ever spoken of me-
dian income or the percentile of your GRE score, then you are already
familiar with the notion. The formal way of obtaining this is through the
inverse CDF, F −1 , aka the quantile function7 . More precisely, we define 7
Many choices to do this in Python,
the α-quantile qα the number such that: from the simple np.percentile() to
the comprehensive scipy.stats.mstats
.mquantiles(), modeled after the R
P( X ≤ q α ) = α (3.8) quantile function, to internal implemen-
tations in pandas and seaborn, to name
That is, qα = F −1 (α) . Unless the inverse CDF can be obtained an- just a few.
alytically (which can prove impossible even for usual distributions), it
is approximated using numerical root finding methods (cf. Appendix
A, Sect. III). Going back to Fig. 3.1, how many people earn less than
$225,000 a year? About 98%. Below what income do 98% of Ameri-
cans fall? About $225,000. The first answer used the CDF, the second
used the quantile function. The values qα are called, not coincidentally,
quantiles. Remarkable quantiles are:
the median the 50% quantile (q0.50 ), such that 50% of the mass of the
distribution is to the left of it, 50% of it on the right of it.
terciles split the distribution into 3 regions of equal mass: (0, 1/3, 2/3, 1)
quartiles split the distribution into 4 regions of equal mass (0, 25%, 50%,
75%,100%)
36
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions
Distribution Fitting
Moments of a distribution
Expectance : the gambler returns
This also called the moment of order 1, more commonly known as av-
erage, or mean. Indeed, if all outcomes are equally likely P( X = xi ) =
1/N, then this is just the arithmetic average of all observations. If the
outcomes are not equally likely, they must be weighted by their proba-
bility before summation – it is therefore a weighted mean.
Key Properties:
Transfer theorem
∑ N g( x ) P( X = x ) if X is discrete
i i
E( g( X )) = R i=1 (3.10)
g( x ) f ( x )dx if X is continuous
R
By the previous theorem, one may define the moment of order n as:
N
m( X, n) = E( X n ) = ∑ xi n P ( X = xi ) (3.11)
i =1
37
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions
N
E(( X − µ)2 ) = ∑ ( x i − µ )2 P ( X = x i ) = E ( X 2 ) − µ2 (3.12)
i =1
(the last two expression are exactly analogous in the continuous case).
The variance measures the spread of a distribution about its central ten-
p
dency. One often speaks of the standard deviation, σ = V ( X ), which
is the average deviation around the mean, and bears the same units as
X8 . 8
“Standardized data” refer to data hav-
Properties: ing been centered to the mean and di-
vided by their standard deviation
Finally, the kurtosis is derived from the fourth moment, and quantifies
the “peakedness” of a distribution. The list goes on, but no one ever
seems to worry about moments beyond n = 4.
38
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions
Numerical Summaries
Range
The range is simply the difference between the largest and smallest
value, and bears the same units as X. The dynamic range is usually de-
fined as:
max( x)
DR = 20 × log10 (3.14)
min( x)
expressed in decibels (dB) regardless of the units of x. For instance, the
dynamic range of human hearing is roughly 140 dB.
The range immediately gives us a crude idea of the size of variations
to expect, which can be very precious already. For instance, are they of
the expected order of magnitude? Does the range encompass 0?
Location
the trimmed mean: a version of the sample mean ignoring the most ex-
treme values.
39
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions
Spread
Symmetry
Graphical Summaries
“Numerical quantities focus on expected values, graphical summaries on unex-
pected values”. John Tukey
40
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions
Boxplots
Violinplots
Figure 3.9: A simple violin plot, using sns
.violinplot(). Credit: seaborn example
Let’s face it: boxplots are pretty boxy. A more refined way of plotting gallery
distributions are violin plots, which elegantly summarize their shape as
quantified via kernel density estimates, along with the data as points
(Fig. 3.10) or bars ("bean plots"). Good implementations come with a
number of customizable options, including the ability to perform side-
by-side comparisons as split violins, which must sound awful, but look
rather snazzy.
Scatterplots
41
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions
• When |ρ| < 1 and ρ > 0, we say that the two variables are correlated.
When ρ < 0 we say they are anti-correlated. y
42
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions
would give ρ > 0 even if the majority of points shows ρ < 0. As an alter-
native we can use Spearman’s correlation10 or Kendall’s τ 11 , which are 10
scipy.stats.spearmanr()
11
based on rank order statistics, and are therefore more robust. Kendall’s scipy.stats.kendalltau()
Bivariate densities
Bernoulli Distribution
A Bernoulli random variable is the easiest way to encode a binary out-
come: heads or tails? Rain or shine? Big earthquake or small earth-
quake? We say that a r.v. X ∼ B( p) if and only if X takes the value 1
with probability p (“success”) and 0 with probability q = 1 − p (“fail-
ure”). Verify that E( X ) = p, V ( X ) = pq.
Figure 3.15: The bivariate analogue of a
The Bernoulli distribution is not very interesting in its own right, but histogram is known as a “hexbin” plot,
serves as a building block for a very important distribution: because it shows the counts of obser-
vations that fall within hexagonal bins.
Credit: Seaborn manual
Binomial Distribution
A random variable X ∼ B(n, p) is said binomial if ∀k ∈ [1; n] :
n k
P( X = k ) = p (1 − p ) n − k (3.23)
k
where (nk) = k!(nn!−k)! is the binomial coefficient. The latter have the
useful property that:
n n−1 n−1
= + (Pascal0 s rule) (3.24)
k k−1 k
The binomial distribution measures the probability of k successes Figure 3.16: Pascal’s Triangle, an illustra-
tion of Pascal’s rule. Credit: Math Forum
amongst n trials. It is easy to show that X ∼ B(n, p) can be obtained
44
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions
90
P( X = 0) = 0.10 (1 − 0.1)90 ' 7.6e−5 (3.25a)
0
90
P( X = 1) = 0.11 (1 − 0.1)89 ' 7.6e−4 (3.25b)
1
.. ..
. = . (3.25c)
90
P( X = 9) = 0.19 (1 − 0.1)81 ' 0.139 (3.25d)
9
k =9
P( X > 10) = 1 − P( X < 10) = ∑ P(X = k) ' 0.412 ≈ 41% chance
k =0
0.5
0.15
f ( x)
0.4
0.1
0.3
0.2
0.05
0.1
0 0
0 2 4 6 8 10 12 14 16 18 20 -5 0 5 10 15
xi x
P ( X ≤ x)
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2 µ = 0, = 1
n = 20, p = 0.25
0.1 n = 20, p = 0.5 0.1
µ = 0, = 0.5
n = 20, p = 0.75 µ = 5, = 2
0 0
0 2 4 6 8 10 12 14 16 18 20 -5 0 5 10 15
xi x
45
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions
Poisson Distribution
A random variable is said to be Poisson with rate λ ∈ R+ (usually writ-
ten X ∼ P (λ)) if and only if, ∀k ∈ N:
λk
P( X = k ) = e − λ (3.26)
k!
Proof of the fact that all probabilities sum to unity is provided via
λk
the exponential series: ∑∞k =0 k! = e (Appendix A, Sect. III). From this
λ
and a simple rearrangement of dummy indices, one can easily show that
E( X ) = λ and V ( X ) = λ. For λ ≤ 1, the PMF decreases strongly with
Figure 3.18: PMF of the Poisson distribu-
k, hence the nickname: ’law of rare phenomena’. tion for 3 values of λ. Credit:Wikipedia
e−λt (λt)k
P( Nt = k) = f (k; λt) = (3.27)
k!
In particular, the number of events over two intervals ∆t1 and ∆t2 are Figure 3.19: CDF of the Poisson distribu-
independent, and the probability that the time between events exceeds tion for 3 values of λ. Credit:Wikipedia
∆t is e−λ∆t , which decays exponentially fast to zero.
The Poisson distribution has all sorts of wonderful properties. In par-
ticular, it is easy to show that if X1 ∼ P (λ1 ) and X2 ∼ P (λ2 ), then the
sum X = X1 + X2 ∼ P (λ1 + λ2 ) (additivity of the two variables). The
Poisson process is intimately tied to the exponential distribution, which
describes memoryless processes.
Interesting property: the Binomial distribution can be approximated
by a Poisson distribution when n → ∞ if λ = np is kept constant. In
turn, it converges to a normal distribution for λ larger than ∼ 5 (see
Fig. 3.18, cyan curve, and Chapter 4, section II).
46
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions
Geometric Distribution
The probability distribution of the number X of Bernoulli trials needed
to get one success, supported on the set {1, 2, 3, ...}. Also called "waiting
time until the first success" . X ∼ G( p) iff ∀k > 1:
P ( X = k ) = (1 − p ) k −1 p (3.28)
Again, this has no upper bound for large k, so we will need to sum
until infinity. This is made possible by the geometric series:
∞
1
∑ ak = 1 − a ∀| a| < 1 (3.29)
k =0
1 1− p
E( X ) = ; var( X ) = (3.30)
p p2
The fact that the expected time until the first success is inversely pro-
portional to the probability of the event (a frequency of occurrence),
should not be surprising. It is the basis for estimating “return periods”
from the statistics of past events, a practice as fraught as it is mislead-
ing. Most of the public thinks of a “100-year flood” as a flood that recurs
every 100 years, whereas it really is a flood that has a 1% chance of hap-
pening every year, assuming stationarity (Lab 5).
Application Waiting time until you get a 6 by rolling a die, waiting time
until any event with some estimated frequency (e.g. droughts, floods, earth-
quakes).
47
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions
48
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions
where k > 0 is the shape parameter and λ > 0 is the scale parameter
of the distribution. Its complementary cumulative distribution function
is a stretched exponential function. The Weibull distribution is related
to a number of other probability distributions; it interpolates between
the exponential distribution (k = 1) and the Rayleigh distribution (k =
Figure 3.22: PDF of the Weibull distribu-
2). tion
Property Value for the Weibull Distribution Table 3.3: Weibull Distribution
In lab #5, we shall see how these distributions can be fit to geophysical
and geochemical data and what it enables us to do.
49
Chapter 4
George Barnard
1 x − µ )2
−(
ϕµ,σ2 ( x ) = √ e 2σ2 (“density") . (4.1)
σ 2π
Special Case: (the standard normal distribution): µ = 0, σ = 1
1 x2
ϕ ( x ) = √ e− 2 . (4.2)
2π Figure 4.1: PDF of the Normal distribu-
tion. Credit:Wikipedia
The CDF (Cumulative Distribution Function) of X ∼ N (0, 1) is, gen-
erally, described by Φ:
Z x
1 t2
Φ (x) = P (X ≤ x) = √ e− 2 dt . (4.3)
2π −∞
2
Note One can show that the function e−t does not have any antiderivative,
0 2
i.e., if g (t) = e−t then g (t) cannot be expressed as a finite combination of
elementary functions (though it can be approximated that way). =⇒ Do not
try to compute Φ ( x ) analytically (use a table or a computer).
Figure 4.2: CDF of the Normal distribu-
tion. Credit:Wikipedia
In Python, the PDF can be generated at the values stored in array x
via scipy.stats.norm.pdf( x ), the CDF via scipy.stats.norm.cdf( x ).
As in all languages, the default location and scale parameters are 0 and
1, respectively.
Data Analysis in the Earth & Environmental Sciences Chapter 4. Normality. Error Theory
Standard Normal
The standard normal is the only one we really need to worry about,
because of the wonderful affine invariance of the normal distribution: if
X −µ
X ∼ N µ, σ2 , then Z = σ ∼ N (0, 1) ⇒ X = σZ + µ. Hence:
FX ( x ) = P ( X ≤ x )
= P (σZ + µ ≤ x )
x−µ
=P Z≤
σ
x−µ
=Φ .
σ
This means that we can express the PDF and CDF of any normal dis-
tribution with the standard normal PDF and CDF, simply by centering
and rescaling.
Example
Assume that the temperature at some place follows a normal distribution with
mean µ = 20◦ C, and standard deviation σ = 5◦ C. What is the probability
that the temperature drops below 12◦ C?
T ∼ N µ = 20, σ2 = 25 ; P ( T ≤ 12) =?
P ( T ≤ 12) =
= P ( Zσ + µ ≤ 12)
12 − µ
=P Z≤
σ
8
=P Z≤−
5
8
=Φ − = Φ (−1.6)
5
Large σ
≈ 5.5% . (Unlikely)
Moments
Consider X ∼ N µ, σ2 , then: µ
So:
σ2 )
N ( µ , |{z}
|{z} µ
mean variance
where σ is the standard deviation. Figure 4.3: Two normal distributions
with the same mean, µ, and different
scale parameters σ.
In general, if X ∼ N µ, σ2 , it looks like Fig. 4.4.
52
Data Analysis in the Earth & Environmental Sciences Chapter 4. Normality. Error Theory
E[( X − µ)4 ]
= 44
µ
2 2
(4.4)
(E[( X − µ) ]) σ
This quantity is 3 for the normal distribution, and it is common to
Φ (− x )
define σ44 − 3 as excess kurtosis. By definition, the normal distribution has
µ
Proof :
Using the unit measure and the previous result:
1 = P ( X ≤ − x ) + P (− x ≤ X ≤ x ) + P ( X ≥ x )
= Φ (− x ) + P (− x ≤ X ≤ x ) + Φ (− x ) .
| {z }
Φ( x )
−α α
53
Data Analysis in the Earth & Environmental Sciences Chapter 4. Normality. Error Theory
Proof P ( X ≤ − x ) = P ( X ≥ x ) But:
1 = P ( X ≤ − x ) + P (− x ≤ X ≤ x ) + P ( X ≥ x )
= 2 (1 − Φ( x )) + P (− x ≤ X ≤ x ) .
=⇒ P (| X | ≤ x ) = 2Φ ( x ) − 1 . QED
Stability
The normal distribution is stable under
scaling,
addition and subtrac-
2
2
tion. If X ∼ N µ x , σx and Y ∼ N µy , σy are independent normal
random variables, and a and b are constants, then:
Scaling If X ∼ N µ, σ2 , then aX + b ∼ N aµ + b, ( aσ )2 . In other
words, the mean gets an affine transform, and the standard deviation
is scaled by a factor a.
Addition The sum of two normals is another normal whose mean and
variances are the sums of individual means and variances.
U = X + Y ∼ N µ x + µy , σx2 + σy2 . (4.5a)
Although we’ve seen additivity before, you should realize how re-
markable it is: most distributions do not display this property. And the
remarkable result is that if you subtract two normal RVs, the means are
subtracted but the variances (measures of uncertainty) are added.
54
Data Analysis in the Earth & Environmental Sciences Chapter 4. Normality. Error Theory
II Limit Theorems
Central Limit Theorem
Whenever a large sample of chaotic elements are taken in hand and marshalled in
the order of their magnitude, an unsuspected and most beautiful form of regular-
ity proves to have been latent all along.
Xn − µ
Zn = √ ,
σ/ n
X1 +...+ Xn
where X n = n . Then, Zn → N (0, 1), hence:
Remarkable facts
→ Works for "any" distribution as long as µ and σ exist.
√
→ Notice the factor σ/ n: uncertainties decrease as the square root of the
number of observations. This is the theoretical basis for well-known
laboratory practice of averaging independent measurements together
to decrease uncertainty.
→ normal asymptotics apply for any n larger than about 30. Infinity is
within reach of anyone who can count to 30!
55
Data Analysis in the Earth & Environmental Sciences Chapter 4. Normality. Error Theory
Application
A certain strain of bacteria occurs in all raw milk.
σ 300
µ x = µ = 2500 , σx = √ = √ ≈ 46.3 ,
n 42
P ( x ≤ 2650) = 0.9994 .
=⇒ Observing x = 2700 is very unlikely if the milk is not contaminated.
=⇒ The milk must not be sold!
56
Data Analysis in the Earth & Environmental Sciences Chapter 4. Normality. Error Theory
np=λ λk
P( X = k) −−−→ e−λ (4.7)
n→∞ k!
which is more manageable if you can easily compute exponentials. But
it does get better: we may dispense with all these pesky factorials alto-
gether, as shown below.
One can show using Stirling’s approximation that, in turn, the dis-
tribution of a Poisson RV X with parameter λ converges to a normal
distribution as n → ∞. That is, the CDF of X√−λ converges to Φ.
λ
So if we had X ∼ B(n, p) such that n → ∞ and np stays constant, we
could use the normal approximation to evaluate the Poisson PMF. That
is just how DeMoivre and Laplace discovered it in the first place (see
Sect. III). Put it another way, the standardized sum:
n
Sn − np
Sn∗ = p
np(1 − p)
; Sn = ∑ Xi (4.8)
i =1
57
Data Analysis in the Earth & Environmental Sciences Chapter 4. Normality. Error Theory
Origins
Though usually named after Gauss, the “Gaussian” (normal) distribu-
tion was actually discovered in 1733 by DeMoivre, who found it as a
limit case of the binomial distribution with p = 1/2, but did not rec-
ognize the significance of the result. In the general case 0 < p < 1,
Laplace (1781) had derived its main properties and suggested that it
should be tabulated due to its importance. Gauss (1809) considered an-
other derivation of this distribution (not as a limiting form of the bi-
nomial distribution); it became popularized by his work and thus his
name became attached to it. It seems likely that the term “normal” is as-
sociated with a linear regression model for which Gauss suggested the
least-squares (LS) solution method and called the corresponding system Figure 4.7: The venerable Carl Friedrich
of equations the normal equations (Chapter 13). One more name – central Gauss, who contrary to popular belief
did not discover the Gaussian law.
distribution – originates from the central limit theorem. It was suggested
by Pólya and actively backed by Jaynes (2004). A well-known historian
of statistics, Stigler, formulates an universal law of eponymy that “no
discovery is named for its original discoverer.”2 Jaynes (2004) remarked 2
As already remarked for Bayes’ theorem
that “the history of this terminology excellently confirms this law, since
the fundamental nature of this distribution and its main properties were
derived by Laplace when Gauss was six years old; and the distribution
itself had been found by de Moivre before Laplace was born.” (Kim and
Shevlyakov, 2008)
It turns out that the conjunction of these two seemingly simple proper-
ties is enough to enforce the bivariate gaussian form:
f ( x, y) ∝ exp α( x2 + y2 ) (4.10)
58
Data Analysis in the Earth & Environmental Sciences Chapter 4. Normality. Error Theory
Information Theory
Shannon and Nyquist pioneered the concept of information entropy as-
sociated with a probability distribution. As in thermodynamics, infor-
mation entropy can only increase under spontaneous transformations.
It turns out that, by just specifying the mean and the variance of a mea-
surement error, the maximum entropy principle leads to a Gaussian
form for their distribution. Jaynes (2004) interprets the result this way:
The normal distribution provides the most honest description of our knowledge
of the errors given just the sample mean and variance of the measurements.
59
Data Analysis in the Earth & Environmental Sciences Chapter 4. Normality. Error Theory
IV Error Analysis
Error Terminology
Lack of mathematical culture is revealed nowhere so conspicuously as in mean-
ingless precision in numerical computations.
Common Model:
60
Data Analysis in the Earth & Environmental Sciences Chapter 4. Normality. Error Theory
Covariance
Special Case: Cov ( X, X ) = Var ( X ) = E ( X − E ( X ))2 .
(the equality holds if and only if the variables are not correlated.)
Put differently, the most conservative error estimate for sums or dif-
ferences is the sum of their uncertainties, but for independent errors
this reduces considerably to a quadratic sum (the norm of the uncer-
tainty vector)3 . 3
The inequality stems from Pythagoras’
theorem
61
Data Analysis in the Earth & Environmental Sciences Chapter 4. Normality. Error Theory
Product/Division
∆u
∆ log u ≈ ≤ ∆ log x + ∆ log y
u
∆x ∆y
≈ + .
x y
∆ ( xy) ∆ ( x/y) ∆x ∆y
= ≈ + . (4.16b)
xy x/y x y
which is just the same as Eq. (4.16a), but with relative instead of abso-
lute uncertainties. In words, the relative uncertainty of a product (or
ratio) is the sum of the relative uncertainty of its components (unless
there is some partial cancellation).
Nonlinear Propagation
For any well-behaved function ψ( X1 , X2 , ..., Xn ), with Xi ∼ N (µi , σi2 ),
where the Xi are independent of each other4 : 4
In the more general case, the equation
writes
2 2 2 n n
∂ψ ∂ψ
∂ψ ∂ψ ∂ψ σψ2 ≈ ∑∑ Cov Xi , X j
σψ2 ≈ σ12 + σ22 + · · · + σn2 (4.16d) i =1 j =1
∂Xi µi ∂X j
∂X1 µ1 ∂X2 µ2 ∂Xn µn
µj
(4.16c)
Where the subscript means that we evaluate the function around the where µi is the mean of Xi
value ψ(µ1 , µ2 , · · · , µn ), and higher order terms are assumed negligi-
ble. This result stems from performing a first-order Taylor expansion
of ψ about the mean, using the transfer theorem (Eq. (3.12)) to prop-
agate the errors through ψ, and using the linearity of the expectance
operator. This allows us to compute the contributions to total uncer-
tainty due to several input sources (cf. Lab 4). This may be particu-
larly helpful when considering which aspects of experiment design
should be improved to yield to lowest overall uncertainties.
62
Chapter 5
“There are three types of lies: lies, damn lies and statistics”
Benjamin Disraeli
I Preamble
Methods
We shall see three estimation methods:
II Method of Moments
We’ve seen in Chapter 3 that the moments of most distributions usually
have a simple relationship to their parameters; the normal distribution
is an extreme example of this, because its two parameters are essentially
its first two moments. Hence the idea, due to Karl Pearson ca. 1890, to
use those moments to estimate the parameters. It is quite simple to use,
and almost always yield some sort of estimate. Unfortunately, in many
cases, yields estimates that leave a lot to be desired (e.g. a probability
greater than one, a sample number smaller than zero). However, it is a
good place to start when other methods prove intractable, and it is used
in Lab 5.
Theorem 1 (Strong Law of large numbers) The n-sample mean
n
X̄n = ∑ xi /n (5.1)
i =1
converges in probability to the true mean as the number of trials tends to infin-
ity:
P lim X n = µ = 1. (5.2)
n→∞
This law justifies the intuitive interpretation of the expected value of a random
variable when sampled repeatedly as the long-term average. It is a justification
for the ergodic assumption frequently used in physics.
64
Data Analysis in the Earth & Environmental Sciences Chapter 5. Principles of Statistical Estimation
Method of Moments:
1. Express the unknown parameter(s) θ as a function of the moments
µk (k ≤ 1):
θ = g ( µ1 , . . . , µ k ) . (5.4a)
µ1 = E ( X ) = µ .
Var ( X ) = σ2 = E X 2 − E2 ( X ) = µ2 − µ2 .
=⇒ µ2 = µ2 + σ2 .
Faster to Solve:
n
1
µ1 = µ = n
∑ xi , (E1)
i =1
n
2 2 1
µ2 = µ + σ = n
∑ xi2 . (E2)
i =1
We must solve the system simultaneously for µ and σ2 . We know that µ =
1 n
n ∑i =1 xi = X. Substituting (E1) into (E2):
n
2 1
µ2 + σ 2 = X + σ 2 = ∑ xi2 .
n i =1
n
1 2
= σ2 = ∑ xi2 − X
n i =1
!2
n n
1 1
= ∑ xi2 − ∑ xi
n i =1
n i =1
n
1 2
=
n ∑ xi − X .
i =1
µ̂ = X .
So, 2
σ̂2 = 1
n ∑in=1 xi − X ≡ S2 . ← Sample variance.
65
Data Analysis in the Earth & Environmental Sciences Chapter 5. Principles of Statistical Estimation
Example
Given θ balls in a jar numbered 1, 2, . . ., θ. We draw balls at random (P ( xi ) =
1/θ, uniform distribution). We want to estimate θ.
X1 , . . . , Xn i.i.d. sample.
Method of moments:
θ θ
1 1 θ 1 θ ( θ + 1) θ+1
µ1 = E ( X ) = ∑ i · P (X = i) = ∑i· θ =
θ i∑
i =
θ 2
=
2
.
i =1 i =1 =1
θ +1 1
System of equations to solve: µ1 = 2 = n ∑i=1 nxi = X.
θ̂MoM = 2X − 1 (5.5)
Example
X1 = 3, X2 = 2, X3 = 10, X4 = 4, X5 = 2.
X = 4.
θ̂MoM = 2X − 1 = 7 .
(A rather poor estimate since we already know that 10 has been drawn, so we
know for sure that θ should be greater than 10!)
66
Data Analysis in the Earth & Environmental Sciences Chapter 5. Principles of Statistical Estimation
The silly way would to try different values of p, simulate from that
model, and see if the statistics are compatible with this sequence.
P ( Xi = x i | θ ) = f θ ( x i )
P ( X1 = x 1 , . . . , X n = x n ) = P ( X1 = x 1 ) × · · · × P ( X n = x n )
n
= ∏ P ( Xi = x i )
i =1
n
= ∏ f θ ( xi )
i =1
≡ L(θ | x) (5.7)
Which we call the likelihood function. For each value of θ, the likelihood
L(θ | x) is the probability of observing { X1 = x1 , . . . , Xn = xn } for that
distribution and that parameter θ. Using Eq. (5.6):
n
L ( p) = ∏ p x i (1 − p )1− x i (5.8)
i =1
L ( p ) = p ∑ xi (1 − p ) n − ∑ xi = p y (1 − p ) n − y (5.9)
67
Data Analysis in the Earth & Environmental Sciences Chapter 5. Principles of Statistical Estimation
` (θ ) ≡ log L (θ )
n
= log ∏ f ( xi |θ )
i =1
n
= ∑ log f (xi |θ ) . (5.10)
i =1
` θ̂ MLE = max ` (θ ) (5.11)
θ
L θ̂MLE = max L (θ ) (5.12)
θ
That is, our best guess for the trial probability p is just the average of
the successes. Intuitive though this may seem, it is immensely satisfying
to arrive at this result via a rigorous optimality principle.
Figure 5.2: A bimodal likelihood.
Remark: θ̂ MLE does not have to be unique (cf Fig. 5.2), although it is
often the case in practice.
68
Data Analysis in the Earth & Environmental Sciences Chapter 5. Principles of Statistical Estimation
L ( θ ) = f J ( x1 , · · · , x n | θ ) (5.16a)
n
L (θ ) = ∏ f ( xi | θ ) Independence (5.16b)
i =1
L θ̂ MLE = max L (θ ) . (5.16c)
θ
n
L µ, σ2 = ∏f xi |µ, σ2
i =1
n
1 ( xi − µ )2
−
=∏√ e 2σ2
i =1 2πσ
1 1 ( xi − µ )2
− ∑in=1
= e 2σ2 , → Likelihood function.
(2π )n/2 (σ2 ) n/2
(5.17)
log − likelihood :
n n
( x − µ )2
` µ, σ2 = log L µ, σ2 = − log 2πσ2 − ∑ i 2 .
2 i =1
2σ
69
Data Analysis in the Earth & Environmental Sciences Chapter 5. Principles of Statistical Estimation
So the MLE of the mean is just the sample mean? Really, all this for that?
Well, sure, it’s intuitive, but now it’s also rigorously established4 . What 4
Note that Gauss famously obtained the
about σ? Can it be estimated via the sample standard deviation? It turns normal distribution as one that would
have this property: that is, he considered
out that this is indeed the MLE. Now maximizing `(θ ) with respect to the inverse problem of designing a distri-
θ2 = σ 2 : bution for which the sample mean would
be the best estimate of the true mean (Kim
∂` n n
( x − µ )2 and Shevlyakov, 2008)
2
=− 2+∑ i 4 = 0.
∂σ 2σ i =1 2σ
n
⇐⇒ nσ̂2 = ∑ ( x i − µ )2 ,
i =1
n
1
⇐⇒ σ̂2 =
n ∑ ( x i − µ )2 , but we just showed that µ̂ = x,
i =1
n
1
⇐⇒ σ̂2 =
n ∑ ( x i − x )2 = s2 sample variance
i =1
µ̂ = x ,
So,
σ̂2 = s2 .
This happens to coincide the result of the method of moments, but it will
turn out to be an exception5 . 5
In general, MLE is very different from
MOM, and MLE has a sound basis in sta-
Max ?
tistical theory, unlike MOM.
2
Now, is x, s really a maximum of the likelihood? Min ?
Saddle point ?
A sufficient condition if for L or ` to have negative curvature around
there:
∂2 ` ∂2 `
1) < 0, < 0, (5.18a)
∂µ2 ∂ ( σ2 )
2
∂2 ` ∂2 ` ∂2 l
2) 2 2
− > 0, (5.18b)
∂µ ∂ (σ2 ) ∂µσ2
It is easy to verify that:
θ̂ MLE = µ̂, σ̂2 = x, s2 . (5.19)
These estimates are unique for the normal distribution (it is unimodal),
and show that the sample mean and sample variance are legitimate es-
timators of the true parameters (they are the most consistent with the
observations in this model).
70
Data Analysis in the Earth & Environmental Sciences Chapter 5. Principles of Statistical Estimation
X1 , . . . , Xn i.i.d.
θ̂ MLE = ?
Likelihood Function:
n 1
, if 0 ≤ xi ≤ θ .
θn
L (θ ) = ∏ f ( x i |0) =
i =1 0, otherwise.
71
Data Analysis in the Earth & Environmental Sciences Chapter 5. Principles of Statistical Estimation
Example (Normal estimation N µ, σ 2 )
2
MLE: µ̂ = X, σ̂2 = n1 ∑i xi − X :
E X = µ ,
Unbiased.
V X = σ2 ,
n
E σˆ2 = n− 1 2
n σ ,
2( n −1)
Biased.
V σ̂2 = n2 σ4 ,
Hence:
Then
E(σˆ2 ) = σ2 , (No bias)
and 2
MSE σ̂2 , σ2 = V (σ2 ) = σ4 .
n−1
72
Data Analysis in the Earth & Environmental Sciences Chapter 5. Principles of Statistical Estimation
But
2 2n − 1
MSE σ̂MLE , σ2 = σ4 .
n2
2n−1 2
Considering that n2
< n −1 , we have that:
2
MSE( σ̂MLE , σ2 ) < MSE( σ̂2 , σ2 ). (5.21)
↑ bias ↓ var ↓ bias ↑ var (5.22)
So even though it would be tempting to get rid of the bias, this would
increase the overall error.By trading off variance for bias, a lower MSE
can be reached. The MLE is optimal in the sense that it has the lowest
MSE of all estimators; it sometimes achieves this by introducing some
bias, so bias is not inherently evil. As usual in Life, there is no free lunch:
one cannot possibly lower bias and variance, so tradeoffs must be made.
Consistency
Both estimators above have an MSE that will tend to zero as n → ∞: this
is a property called consistency: as we collect more and more observa-
tions, we eventually beat down our uncertainty about the true parame-
ters to zero. Put differently, a consistent estimator is one that converges
to the true parameter value as the number of observations gets large
(“asymptotically”). We’ll see in Chap 9 that some spectral estimators
don’t do that, and that will be grounds for dismissal.
Efficiency
Suppose θ is an unknown parameter which is to be estimated from mea-
surements x, distributed according to some probability density function
f ( x; θ ). The variance of any unbiased estimator θ̂ of θ is then bounded
from below by the inverse of the Fisher information I (θ ). That is:
1
Var(θ̂ ) ≥ (5.23)
I (θ )
73
Data Analysis in the Earth & Environmental Sciences Chapter 5. Principles of Statistical Estimation
Note The method of moments yields estimators that are neither consistent nor
efficient, so you should always prefer MLE. Most numerical estimation routines
(including Python’s fitter). MLEs have good theoretical properties: they are
invariant under rescaling, consistent, efficient, and converge in distribution
to a normal (asymptotically, that is, for a large number of observations). In
practice, finding the maximum may not be as straightforward as in the normal
cases shown above, but even in a case where no closed-form expression exists,
it usually involves fairly modest computations. If it does not, the Expectation-
Maximization algorithm will quickly become your friend.
V Bayesian Estimation
The Big Idea
The following section is very strongly in-
Gelman et al. (2013) define Bayesian inference as spired from Wikle and Berliner (2007)
theorem:
p( X |θ ) p(θ )
p(θ | X ) = (5.26)
p( X )
There are four essential ingredients in this recipe, which all have
great conceptual importance:
74
Data Analysis in the Earth & Environmental Sciences Chapter 5. Principles of Statistical Estimation
75
Data Analysis in the Earth & Environmental Sciences Chapter 5. Principles of Statistical Estimation
Xi |{ T = m} ∼ N (m, σ2 ) (5.30)
76
Data Analysis in the Earth & Environmental Sciences Chapter 5. Principles of Statistical Estimation
Let’s assume for simplicity that we know σ (it comes from a calibrated
thermometer of known precision) and we want the distribution of m
conditional on observations, p(m| X ). What we do know is:
n
1 1 ( x i − m )2
p( X |m) = ∏ √ exp − (5.31a)
i =1 σ 2π
2 σ2
" #
1 n xi − m 2
p( X |m) ∝ exp − ∑ (5.31b)
2 i =1 σ
( " 2 2 #)
1 n xi − m m−µ
p(m| X ) ∝ exp − ∑ + (5.33)
2 i =1 σ τ
E( T | X ) = wx x̄ + wµ µ (5.35)
nτ 2
with x̄ = ∑in=1 xi /n (the sample mean of the observations), wx = nτ 2 +σ2
2
and wµ = nτ2σ+σ2 11 . That is, the posterior mean is a weighted average 11
Verify that wx + wµ = 1
of the prior mean and the sample mean of the observations. Note that
as τ 2 → ∞ (vague prior), the data model overwhelms the prior and
(as per the central limit theorem). Alternatively, for fixed τ 2 , but very
large amounts of data (i.e., n → ∞) the data model again dominates
77
Data Analysis in the Earth & Environmental Sciences Chapter 5. Principles of Statistical Estimation
the posterior density. On the other hand, if τ is very small, the prior 2 the posterior mean as
We can write
2⌧ 2
is critical for comparatively small n (itE(Xis|y)highly = 2
+ n⌧ 2
informative).
(n ȳ/ 2 + µ/⌧ 2 ) Though (1)
C.K. Wikle, L.M. Berliner / Physica D 230 (2007) 1–16 3
these properties are shown for the normal data, = w y
P
ȳ + normal
w µ µ, prior case, it is (2)
We canȳwrite
where = thei yposterior mean as2 2 2 ), and w
i /n, w y = n⌧ /(n⌧ + µ =
generally true that for very large datasets, 2 /(n⌧ 2 + the 2 data model is the major
2 ⌧).2 Note that w y + wµ = 1. That is, the posterior
mean is a weighted 2
average 2
of the) prior mean (µ) and (1) the
controller of the posterior. E(X |y) =
natural, data +
2
(n ȳ/ + µ/⌧
n⌧ 2 estimate of X , ȳ. Note that as ⌧ 2 ! 1, the
based
= ȳ + 2 (2)
To illustrate this, assume the prior distribution
Alternatively,
y
P for fixed
µ
is ⌧m
data model overwhelms the prior and p(x|y) ! N ( ȳ, /n).
w w µ,
2 , but∼very N (large 3), and
20,amounts of data
where ȳ = 2 2 2 ), and w
i ythe
i /n, w y = n⌧ /(n⌧ + µ =
the data model is Xi |m ∼ N (m, 1). 2 /(n⌧ We
(i.e., 2 + have
n ! 2 ). Notetwo
1)
that w yobservations
data model
+ wµ2 = 1. That is, thexposterior
again
density. On the other hand, if ⌧ is very small, the prior is
dominates the posterior
=
mean is aforweighted average of the prior mean (µ) and the
(19, 23). The posterior mean is 20 + (natural,
6/7 )(21
critical
showndata
− 20
comparatively
for based
the normal
)=
estimate of20.86
small
data, X , ȳ. Note
normal
andthese
n. Though
priorthat
the
as
case,
pos-
properties
⌧it2 is
! 1, the
are
generally
datatrue
model overwhelms thedatasets,
prior and p(x|y) ! Nis( ȳ, 2 /n). Fig. 1. Posterior distribution with normal prior and normal likelihood;
terior variance is (1 − 6/7)3 = 0.43.Alternatively,
Fig. 5.5 forshows
that for very large these distributions
the data model
⌧ 2 , but very large amounts of data
the major relatively precise data.
controller of the fixed posterior. Figure 5.5: Posterior distribution with
graphically. Since the data are relatively
(i.e., nForprecise
density. On
1) theofdata
!purposes
the other
compared
DA, model
we noteagain
hand, if ⌧
from (2)to thatthe
dominates prior,
the
the posterior
2 is very small, the prior is
posterior
mean normal prior and normal likelihood; rel-
can also be written as
we see that the posterior distributioncritical
is closer to✓ the
for comparatively 2 likelihood
small ◆ n. Though these than the
properties are atively precise data. From Wikle and
shown n⌧ normal prior case, it is generally
E(Xfor |y) =theµnormal
+ data, ( ȳ µ) Fig.Berliner
1. Posterior(2007)
distribution with normal prior and normal likelihood;
prior. Another way to look at this istruethat the
that for verydata weight
2 + n⌧ 2
large datasets, wxmodel
the data = 6/7 is
is the major relatively precise data.
controller of=the µ+ posterior.
K ( ȳ µ). (3)
close to one, so that the data model is That weighted
For purposes of DA, more we notethan from (2)the
is, the prior mean (µ) is adjusted toward the sample
that theprior.
posterior mean
can also be written as
Next, assume the same observationsestimate and (prior ✓ distribution,
ȳ) according
n⌧ 2
to ◆the “gain”, K =but change
(n⌧ 2 )/( 2 + n⌧ 2 ).
where the posterior variance is updated from the prior variance by:
data model
according to the is Ygain,
i |x ⇠ K . NEqs.
(x, 1).(3)Weandhave(4) two
are observations
critical for Z
y = (19, 23)data 0 . The posterior mean is 20 + (6/7)(21
understanding assimilation as will be seen throughout20) this= p( ỹ|y) = p( ỹ|x) p(x|y)dx
20.86 and the posterior variance is 6/7)3 = 0.43.
Normal model with unknown mean and variance
overview. (1
Fig. 1 shows these distributions graphically. Since the data
and 2.in Posterior
Fig. the normal case: with normal prior and normal likelihood;
distribution
are Numerical
2.1.1. relatively precise examples compared to the prior, we see that the Z uncertain data.
relatively
posterior distribution is “closer” to the likelihood than the prior. Figure 5.6: Posterior distribution with
/ exp{ 0.5( ỹ x)2 / 2 } exp{ 0.5(x xa )2 /⌧a2 },
In the case where σ is unknown, thingsAnother get away
Assume little
the
to look at this is difficult,
prior more
distribution
that the gain (Kbut
is X ⇠ N (20, 3), still and the
= 6/7) is close by:normal prior and normal likelihood; rel-
data model is Yi |x ⇠ N (x, 1). We have two observations
to one, so that the data model is weighted more than the prior.
tractable. Gelman et al. (2013, chap2) show
y = (19, that
Next,23) 0 . The
assume a convenient
posterior
the mean is 20
same observations prior
+and fordistribution,
(6/7)(21
prior σ20)is = p(
atively
where
ỹ|y) =
xaZ and
imprecise
p(
⌧a2 are thedata.
ỹ|x) p(x|y)dx
From
posterior mean Wikle and
and variance,
20.86 and the respectively (where we have made use of the conditional
theposterior
data modelvariance
to Yi |x ⇠is N(1 6/7)3 = is0.43. Berliner (2007)
the scaled inverse χ2 distribution, parameterized
but change
Fig.6/16
1 shows these by its
distributions degrees (x, 10).
graphically. of
Since
and the posterior distribution is X |y ⇠ N (20.375, 1.875).
free-
The gain
the
K =
data independence assumption that p( ỹ|x, y) = p( ỹ|x)). In this
and
caseinwethehave
normalỸ |ycase:
⇠ N (xa , 2 + ⌧a2 ). Thus, the predictive
dom ν0 and a scale parameter σ0 12 . posterior
areThis is illustrated in Fig. 2. In this case, the gain is closertheto
relatively precise compared to the prior, we see that
They show
distribution isa“closer”
veryerror neat
to the result:
likelihood than the in prior.
Z12
mean the inverse
is the posterior χ
exp{ 0.5(
2 distribution
mean,
x)2than
is adistribution
and the predictive
/ 2 }the
specialis
xa )2 /⌧a2 },
zero (since the measurement variance is relatively large /necessarily less ỹprecise exp{ 0.5(x distribution.
posterior
Another
such a case the role of prior can be thought
comparedwaytotothe
of look
as at this
providing
prior is thatand
variance) thethus
gainthe
the (Kprior
= 6/7)
informa- is close
is given more case of the inverse gamma distribution,
to one, so that the data model is weighted more than the prior.
weight. plotted
where in Fig.
⌧a2 are
xa distributions
and 5.7the posterior mean and variance,
2.2. Prior
tion equivalent to ν0 independent observations
Next, assume the with
but2.1.2.
changePosterior
the data predictive
average
same observations
model to Ydistribution
and precision
prior distribution, respectively (where we have made use of the conditional
i |x ⇠ N (x, 10). The gain is K = To many, Bayesian
independence assumption analysis
that p( isỹ|x,
predicated
y) = p( onỹ|x)).
a belief in
In this
σ0 . This illustrates once more that the6/16
roleand
Often,ofone
the prior,
posterior mathematically,
distribution
is interested is X |ythe
in obtaining N (20.375,
⇠ distribution is1.875).
oftoa new subjective
case we haveprobability.
Ỹ |y ⇠ N (xa ,is, 2the
That 2
+ ⌧quantification of beliefs
a ). Thus, the predictive
This is illustrated
observation Ỹ basedin Fig. 2. In
on the this case,
observed datathe gain aisdistribution
y. Such closer to (however
mean vague)
is the aboutmean,
posterior X before
and the
the data are considered.
predictive Theis
distribution
provide a starting point for the inference.is knownIn
zero (since the
as the cases
measurement
posterior where
predictive the
error variance observa-
is relatively
distribution large
and is given choice of prior
necessarily distributions
less precise than thehas been the
posterior subject of much
distribution.
compared to the prior variance) and thus the prior is given more
tions are too few, or too poor, the choice
weight.of the prior will be important to 2.2. Prior distributions
the final outcome (the posterior distribution); in cases
2.1.2. Posterior predictive where observa-
distribution To many, Bayesian analysis is predicated on a belief in
Often, one is interested in obtaining the distribution of a new subjective probability. That is, the quantification of beliefs
tions are numerous and/or good, itsobservation
role is diluted by the observations.
Ỹ based on the observed data y. Such a distribution (however vague) about X before the data are considered. The
is known as the posterior predictive distribution and is given choice of prior distributions has been the subject of much
A note on priors
Three notable properties of prior distributions deserve mention here:
• Conjugacy: As we’ve seen, conjugate priors are priors that, com- Figure 5.7: PDF of the inverse gamma dis-
bined with the likelihood, produce a posterior distribution that has tribution for various values of its param-
eters α and β. If X ∼ Scale-inv-χ 2 2
the same form as the likelihood. For instance, the normal model (ν, σ )
ν νσ2
then X ∼ Inv-Gamma 2, 2 .
above with a normal prior on the observations’ mean yielded a nor-
mal posterior. For a binomial likelihood you’d choose a beta prior.
The main justification for using such priors is that it enables a closed-
form expression for the posterior; otherwise, the posterior has to be
78
Data Analysis in the Earth & Environmental Sciences Chapter 5. Principles of Statistical Estimation
• Han Solo: For an entertaining view on how to choose a prior for the
odds of Han Solo to make it through an asteroid field, read this blog
post by Count Bayesie. The bottom line is that sometimes it is sensible
to be subjective, especially if you know Hollywood.
Non-analytical cases
As said above, in most real-world cases the posterior distribution does
not have a closed-form solution. In such cases, one must evaluate the
various integrals numerically, often over multiple dimensions, and this
is a topic of considerable complexity. Some of this can be coded in your
favorite language13 , though more specialized packages like STAN are 13
e.g. PyMC3 in Python
probably preferable for a first-timer. Either way, most applied Bayesians
spend the majority of their time hunched over a computer, as illustrated
in Fig. 5.8.
79
Data Analysis in the Earth & Environmental Sciences Chapter 5. Principles of Statistical Estimation
but the posterior also provides uncertainty quantification for the same
price.
One particularly neat use of the posterior is the ability to forecast new
observations. The posterior predictive distribution is the distribution of a
new data point x̃, marginalized over the parameters:
Z
p( x̃ | X ) = p( x̃ |θ ) p(θ | X ) dθ (5.37)
θ
Notice how the left hand side depends solely on the observations, but no
longer on the parameters. That is, past observations have been digested
by the statistical model so it can produce a probabilistic forecast of a
new estimate. In the normal case with known variance:
Z
" #
1 x̃ − m 2 1 m − µ p 2
p( x̃ | X ) ∝ exp − − dm (5.38)
R 2 σ 2 σp
80
Chapter 6
One must credit an hypothesis with all that has had to be discovered in
order to demolish it.
Jean Rostand
Now we’re getting to what most laypeople think of when they hear
the word ’statistics’: putting measurements to the test, and finding whether
the results are significant. This is usually taken as a synonym of rigor,
but we shall see that care is needed: sometimes, applying the wrong test
or testing the wrong hypothesis is worse than not doing any statistics at
all.
I Preamble
Two tribes war over the ownership of a piece of land, which each claims
to have first occupied before the other. They ask a geochronologist to
arbitrate their dispute. The geochronologist finds wood and bone arte-
facts that allow the site to be radiocarbon-dated and the claims evalu-
ated: specifically, tribe A claims to have occupied the region since 622
AD, while tribe B claims they got there
in 615 AD.
t̂ = 650 ± 50y , 615
A 622 650 750
The following dates come back: (1σ)
t̂ B = 750 ± 50y .
Figure 6.1: Age distributions for artifacts
from the two tribes. In green we have the
The geochronologist asks you three questions: data from tribe A with mean of 650 years
and standard deviation of 50 years. In red
we have the data from tribe B with mean
1. How confident can I be in each date? → Confidence or credible In-
of 750 years and standard deviation of 50
tervals years.
II Confidence Intervals
How confident are we that the true age lies within a certain range? As-
sume t A ∼ N (µ̂ A , σA ), where µ̂ A = 650 y and σA = 50 y. We want to
find tmin and tmax such that :
t A −µ̂ A
Defining: Z A = σA → N (0, 1),
A normal RV spends 95% of its time within 1.96 of the mean. That is,
it lies within [µ − 1.96σ, µ + 1.96σ] with 95% confidence. 68.2%
95%
t A −µ̂ A
Tribe A: Z A = σA ⇔ t A = µ̂ A + σA Z A ,
Figure 6.2: Confidence level of each con-
fidence interval chosen.
⇒ P (µ̂ A − 1.96σA ≤ t A ≤ µ̂ A + 1.96σA ) = 95% .
A 95% C.I. for the arrival of tribe A is [552, 748]. A 95% C.I. for the arrival
of tribe B is [652, 848].
General Case
82
Data Analysis in the Earth & Environmental Sciences Chapter 6. Confirmatory Data Analysis
σA = σB = σ = 50y .
µ A = 622 ( H0 )
• tB ∈
/ interval means that we could resoundly (at the 5% level) reject
the hypothesis that tribe B got there even as late as 622 AD; the tribe’s
claim claim that µ B = 615 is even less convincing (the associated p-
value would be smaller still).
83
Data Analysis in the Earth & Environmental Sciences Chapter 6. Confirmatory Data Analysis
Assuming again that t A ∼ N µ A , σA 2 , then:
E tA = µA ,
σ2
V tA = A , (Central limit theorem)
nA
84
Data Analysis in the Earth & Environmental Sciences Chapter 6. Confirmatory Data Analysis
to estimate). We can of course scale them so that they have unit vari-
2
ance, for simplicity. (n − 1) σS2 is then the sum of n iid standard normal ν=1
E Xν2 = ν,
random variables, squared. It turns out that solving this problem was 2
V Xν = 2ν.
2
sufficiently important that (n − 1) σS2 got its own distribution, known as
a chi-squared distribution, dependent only on the number of elements in ν=2
ν=3
the sum (ν), and denoted χ2ν . It is a special case of Gamma distribution ν=5
n Figure 6.3: χ2 distribution for different
Zi ∼ N (0, 1) I ID
χ2ν = ∑ Zi2 , (6.5) values of ν. We can see that for large ν,
i =1
ν = n − 1 = "degrees of freedom" it converges to a Gaussian, N (ν, 2ν), as
ν per the central limit theorem. Note that ν
need not be an integer.
χ2ν ∼ Γ ,2 . (6.6)
2
x
ν −1
2 e − 2x
i.e. f χ2ν ( x ) = ν (6.7)
Γ ν
2 22
n n
1 S2
S2 = ∑ ( x i − x )2 ⇒ ( n − 1) = ∑ Zi2
n − 1 i=1 |{z} σ2 i =1 |{z}
N (µ,σ2 ) N (0,1)
∼ χ2n−1 .
Now you may ask: why do we only have n − 1 degrees of freedom when
we added up n independent numbers? The answer is that 1 degree of
freedom went into estimating the mean, so we lost that.
Now recall the properties of the sample variance: Figure 6.4: PDF of Student’s T distribu-
tion, which is close to Gaussian with fat-
1. E S2 = σ2 , Unbiased Estimator. ter tails: that is, having to estimate the
√ variance introduces uncertainty about
2σ4
2. V S2 = n −1 , (uncertainty about S also decreases as n) the estimate of the mean. However, for
ν → ∞, tν → N (0, 1), in another splen-
So it would make sense to consider the test statistic: diferous manifestation of the central limit
theorem. Credit:Wikipedia
tA − µA
T̂A ≡ √ (6.8)
SA / nA
Dividing numerator and denominator by σA , we see that it is basically
the ratio of a standard normal to the square root of a chi-squared vari- 3
In the English-language literature it
able. Informally, one may write: takes its name from William Sealy Gos-
set’s 1908 paper in Biometrika under the
Z pseudonym "Student". Gosset worked
TA ∼ q at the Guinness Brewery in Dublin, Ire-
χ2n−1 land. One version of the origin of the
pseudonym is that Gosset’s employer for-
So what, you say? Again, solving this problem was, at some point, bade members of its staff from publish-
ing scientific papers, so he had to hide his
sufficiently important that someone went through the trouble of work- identity. Another version is that Guin-
ing out this distribution analytically, naming it Student’s T (or t) distri- ness did not want their competitors to
bution3 : tν , depicted in Fig. 6.4. In Python, it may be accessed via scipy know that they were using the t-test to
test the quality of raw material. Either
.stats.t(). In the case of the mean of n IID normal RVs, ν = n − 1. way, beer and statistics do mix. Credit:
Wikipedia
85
Data Analysis in the Earth & Environmental Sciences Chapter 6. Confirmatory Data Analysis
Let us define the null hypothesis H0 : "µ A = 622" and the alternate
Hypothesis: Ha : "µ A > 622"4 . Can the data distinguish between the 4
this is a one-sided test; we could also test
observed and hypothesized means? Let us compute the T statistic5 : for µ A < 622, but it wouldn’t make sense
here since we can safely assume that ei-
ther tribe is giving us its earliest possible
tA − µA date of arrival at the site
T̂ = √ = 1.94
SA / nA
↓
5
a deterministic function of the data
Is it significantly different from 0?
Then compute P t > T̂ = 1 − Ftν (1.94) where Ftν is the CDF of the
t-distribution with ν degrees of freedom. This is the p-value6 . 6
Here, one would compute
1 − tcdf(1.94, n A − 1)
• n = 4, P T > T̂ = 13%. We fail to reject H0 7 .
7
this double-negative is essential to hy-
• n = 12, P T > T̂ = 4% → We reject H0 . pothesis tests: we start from a presump-
tion of innocence (H0 is true) and see if
the data can convince us otherwise. The
choice of H0 is therefore crucial, and we
Put differently, with n = 4, there is insufficient evidence to distinguish should be careful that it does not stack
between 650 and 622 at the 5% level; with n = 12 there is sufficient the cards against a particular result, as is
sometimes the case
evidence to do so. So the data suggest that tribe A is exaggerating a
little, but is fundamentally honest. On the other hand, tribe B is either
lying or delusional. Let us ask a more relevant question for our dispute:
"Did tribe B arrive significantly after tribe A?"
86
Data Analysis in the Earth & Environmental Sciences Chapter 6. Confirmatory Data Analysis
n = 12
A
E.g.: → T ' −4.536.
n B = 9
As expected, the number is negative (since t B > t A ), but is it signifi-
cantly different from zero? For this we compute the p-value:
That is, if the two distributions had equal means, it would be exceed-
ingly unlikely to observe a deviation this big. We can thus reject the
null hypothesis with very high confidence: we conclude that tribe A
colonized the land first. Note that we would not be able to make this
claim with the same level of confidence with n A = n B = 2 (HW: what
p-value would we get?), so it was critical to collect more measurements.
Application: pval = f (n A , n B ) can tell you how many measurements
you need to achieve a certain level of confidence in a result. When
designing experiments, this tell us how many are needed to claim a
"significant difference" – a useful approach to convince a program man-
ager that you need monies to go collect data in faraway locales, or buy
a new instrument. This touches on the rich topic of experiment design.
Example
A Palm Spring resort claims that f c = 67 of its days are cloud-free. Yet, ob-
servations over 25 days indicate that only 15
25 days are cloud-free. Is the claim
supported by the evidence?
k 15
1. Test statistic: f = N = 25 .
6
2. Null hypothesis: f = f c = 7 ' 0.857.
87
Data Analysis in the Earth & Environmental Sciences Chapter 6. Confirmatory Data Analysis
25 k
P (X = k| f = f c ) = f c (1 − f c )25−k . Binomial Distribution
k
(6.9)
5. p-value:
15
25 k
P ( X ≤ 15) = ∑ f c (1 − f c )25−k
k =0
k
' 0.0015 Very unlikely
We conclude, with high confidence, that the data are inconsistent with the claim
that f = 67 .
V Test Errors
88
Data Analysis in the Earth & Environmental Sciences Chapter 6. Confirmatory Data Analysis
Z-Test
Comparing 2 means where the variance is known: see Sect. III
T-Test
Comparing 2 means when both means and variance are unknown. see
Sect. III
F-Test
This test compares the variances of two samples from two populations.
Sample 1, σ , n observations S2 /σ2 χ2
1
Fm,n = 12 12 ∼ 2n−1 . 9
Its CDF maybe accessed via scipy.
Sample 2, σ2 , m observations S2 /σ2 χ m −1 stats.f.cdf(F, n, m)
89
Data Analysis in the Earth & Environmental Sciences Chapter 6. Confirmatory Data Analysis
χ2 Test 0.18
Empirical fit to a theoretical distribution
0.16
The idea is to plot the histogram of the theoretical and empirical PMFs 0.14
bin k (Ok ) should be close to the expected number (Ek ). To quantify this,
0.08
0.06
Nb 2
( Ek − Ok )
0.00
0 5 10 15 20 25
Ξ2 = ∑ , (6.11) x
k =1
Ek
Figure 6.6: An example of a theoretical
It turns out that PDF (solid line) fit to an empirical PMF
Ξ2 ∼ χ2ν−1 , (6.12) (columns)
Kolmogorov-Smyrnov Test
The KS test11 relies on the existence of a universal distribution for the 11
scipy.stats.kstest() for a fit to a
Fn ( x ) = empirical CDF
D = max | Fn ( x ) − F ( x )| (6.13)
x F ( x ) = reference CDF
The critical value for rejection at level α is defined as
kα
Cα = √ √ (6.14)
n + 0.12 + 0.11/ n
where n is the sample size and k α is a function of α only.
One advantage is that it is universal: for any reference distribution
F ( x ), this can tell us where the sample that generated Fn ( x ) was drawn
from the same population. One problem is that the test may not be strin-
gent enough if some data has been used to estimate parameters (an ex-
ample of double-dipping). Also, for some specific distributions, one
may obtain more powerful tests (i.e. tests with greater statistical power
1 − β) by exploiting the functional form of the distribution.
90
Data Analysis in the Earth & Environmental Sciences Chapter 6. Confirmatory Data Analysis
Permutation Tests
Imagine that we have a sample of size n = n1 + n2 n1 n2
91
Data Analysis in the Earth & Environmental Sciences Chapter 6. Confirmatory Data Analysis
Reordering Tests
Skip (not common in Earth science)
Resampling plans
There are two main ones:
Bootstrap : thus called because it is a way to magically generate a large
ensemble from a limited sample, hence allowing you to pull yourself
up by your bootstraps. The idea is to generate surrogate data by sam-
pling with replacement from the original data (Fig. 6.8); there are nn
possible combinations.
250 7. Model Assessment and Selection
Z = (z1 , z2 , . . . , zN ) Training
sample
Block bootstrap sampling distribution of λ
0.1
0.09
FIGURE 7.12. Schematic of the bootstrap process. We wish to assess the sta- 0.08
tistical accuracy of a quantity S(Z) computed from our dataset. B training sets 0.07
Probability Mass Function
Z∗b ,The
b =idea
1, . . .is
, Btoeach
generate
of size BNbootstrap
are drawn samples, compute
with replacement fromthethetest statis-
original 0.06
dataset. The quantity of interest S(Z) is computed from each bootstrap training
tic for each of them, then sort and find the B × α/2 smallest and 0.05
set, and the values S(Z∗1 ), . . . , S(Z∗B ) are used to assess the statistical accuracy 0.04
dence interval.
! However, the draws are obviously not independent, 0.02
S̄ ∗ =datum ∗b "
where
as each b S(Zhas)/B. Note that
probability e−1Var[S(Z)] can be
of appearing in thought of as aIn
each sample.
0.01
0
Monte-Carlo estimate of the variance of S(Z) under sampling from the
some cases, consecutive observations might not be exchangeable, but 0.2 0.25 0.3
λ
0.35 0.4 0.45
and repeating this n times (one for each observation). For a sample
of size n, we have only n possibilities, each sample having n − 2 com-
mon points with another. The jackknife is in fact a special case of the
bootstrap; it is ∼ 20 years older, and much cheaper. In this day and
age it is not particularly recommendable, but it does prove useful to
gauge the sensitivity of a result to one particular observation. It typi-
cally yields much less accurate confidence intervals or p-values than
the bootstrap, however
Direct simulation
For our purposes, if we’ve determined a particular statistical model
(e.g. AR(1), see Chapter 8) to be an appropriate null hypothesis for the
process that generated the data, one can simulate a large number of sam-
ples (say Nmc > 200) from that model, compute the ECDF of a statistic
of interest and estimate the probability of observing the observed data
under this null distribution. We will use this in the isopersistent and
isospectral tests (Lab 7).
93
Data Analysis in the Earth & Environmental Sciences Chapter 6. Confirmatory Data Analysis
Γ (α + β + N )
P( f | X ) = f k + α −1 (1 − f ) N − k + β −1 (6.18)
Γ (α + k) Γ ( N − k + β)
94
Data Analysis in the Earth & Environmental Sciences Chapter 6. Confirmatory Data Analysis
15/25 = 0.6 (dashed line), but fairly broad; the associated 95% credible
Density p(θ)
2.5
interval16 is [0.406,0.766]. 2
Now, someone who is a bit more skeptical about the veracity of ad- 1.5
vertising claims may assume that there is only a 5% chance that the true 0.5
may assume that values of f above and below 0.5 are equally plausible Binomial probability θ
(symmetry). These two constraints together suggest α = 4, β = 4 for Figure 6.11: Bayesian inference on the
a more informative prior. The result of using this prior is illustrated in probability of cloudless skies with a uni-
form prior on f . The shaded area depicts
Fig. 6.12, where it can be seen that the mode is pulled slightly to the
a 95% credible interval
left compared to the flat prior case (0.581 instead of 0.6), which simply
reflects the fact that this prior is more concentrated around 0.5. The as-
sociated 95% credible interval, [0.406,0.736], is slightly narrower (notice
how only the high values got trimmed), also because extreme values
near 1 are weighted down by this prior.
Does it make a difference which prior we choose? In both cases, we 16
the Bayesian counterpart to confidence
see that the claimed probability f c = 0.857 is far outside the 95% cred- intervals, credible intervals quantify the
ible interval. We can easily compute P( f ≥ f c ) (by numerical integra- probability of the true parameter being in
some range, conditional on the modeling
tion if necessary, or using the incomplete beta function) and one would assumptions.
see that in either case it is smaller than 5 × 10−4 , so we could reject the
claim at that level of confidence. Hence the answer is: no, in this case the
prior did not matter to the actual question, because the evidence from
the data is so strong that it overwhelms the prior. The choice would be 5
Cloudless skies, informative prior
Informative prior, β(4,4)
Posterior, β(19,14)
more impactful in less constrained situations. 4.5
Posterior mode
CI = [0.406,0.736]
4
2.5
comes. The posterior predictive distribution does just that, and in this 2
case it takes the form of a beta-binomial (aka Pólya) distribution. As- 1.5
0.5
are N + + 1 = 6 possible outcomes (k+ = {0, · · · 6}), the probability of Binomial probability θ
which is shown in Fig. 6.13 (red curve). It is instructive to compare this Figure 6.12: The shaded area depicts a
distribution to one where we assume that f = 0.6 is known exactly. In 95% credible interval
both cases the most likely outcome is k+ = 3 (since 3/5 = 0.6), but the
beta binomial (red) allocates less probability to the central value, and
more to the extremes, in light of not knowing f exactly.
95
Data Analysis in the Earth & Environmental Sciences Chapter 6. Confirmatory Data Analysis
0.45
Now, having said all that, the reader is now urged to read a few things:
0.4
0.35
0.3
PMF
0.25
0.15
wildly erroneous scientific claims. We will never say it enough: sta- 0.05
• A classical test is primarily a measure of how much data one has (that
is, how much finely one can discriminate between a result originat-
ing from chance alone or from a “real” effect)18 . The choice of sig- 18
see e.g. Normality tests
nificance level is itself very subjective (if something is significant at
the 5.1% level but not the 5.0% level, do you throw the baby out with
the bathwater?). What matters is in the end for your research is that you
estimate the parameters that interest you and report their uncertainties, esti-
mated as transparently as possible so people can decide whether they should
believe you or not.
96
Part II
FOURIER ANALYSIS
2. Harmonic signals (of the form eiωt ) are smooth, orthonormal, objec-
tive, invariant by integration or differentiation; they are a very con-
venient basis over which to represent almost any function.
I Timeseries
Up to this point we have considered data as an amorphous scramble
whose order was unimportant. Obviously, in many instances the or-
der of the data will matter just as much as their values themselves, and
timeseries analysis is all about extracting patterns out of this order.
Notion of timeseries h
Discrete timeseries in which case {t, h(t)} = {(t1 , h1 ), · · · , (tn , hn )}. Here
hn = h(n∆) is the nth sample. All digitized signals are discrete, and
in our digital world this will be of most interest.
Timeseries analysis
Methods for time series analysis may be divided into two classes: frequency-
domain methods and time-domain methods. The former include spec-
tral analysis and recently wavelet analysis; the latter include auto-correlation
and cross-correlation analysis. The bridge between the time and fre-
quency domains is the Fourier transform, which can be either discrete or
continuous. Methods that use the Fourier transform usually fall under
the purview of Fourier analysis, which seeks to represent all signals as a
superposition of pure oscillations.
If the record is of finite length T, one may easily periodize it, so that
h(t) becomes periodic of period T (cf Fig. 7.2).
How can we represent (decompose) the signal h in terms of simpler T 2T 3T
Geometric analogy
In a plane P (orthonormal basis) any vector ~u can be represented as ~u =
ui~i + u j~j or, equivalently, as
ui
~u = ~i ⊥ ~j
uj
~j k~i k = k~jk = 1
where ui and u j are called components of the vector ~u.
The answer comes from the inner product for vectors: Figure 7.3: Projection
h~u, ~vi : P × P → R
Recall that:
!
n
h~u, ~vi = ui vi + u j v j = ∑ uk vk (7.1)
k =1
Now,
h~u,~i i = ui h~i,~i i +v j
|{z}
1
100
Data Analysis in the Earth & Environmental Sciences Chapter 7. Fourier Analysis
so
ui = h~u,~i i.
Back to timeseries
In general, one can show that:
Z T
2 2
L[0,T ] = timeseries h(t), period T, |h(t)| dt < ∞ (7.3)
| 0 {z }
L2 norm
is a vector space.
form a basis of this space. Moreover, sines and cosines are orthonormal:
Z 2π
1
cos(nθ ) sin(nθ ) = 0 ∀n, m > 0 (7.5a)
π 0
Z 2π 1 if n = m
1
cos(nθ ) cos(mθ ) = δn,m (same with sines)
π 0 0 otherwise
(7.5b)
101
Data Analysis in the Earth & Environmental Sciences Chapter 7. Fourier Analysis
II Fourier Series
Joseph Fourier is perhaps best known for his work on trigonometric rep-
resentations of periodic signals, but this was a really a side project of
his. He wanted to understand how heat flows through continuous me-
dia, and found that heat flow was proportional to the gradient of tem-
perature (i.e. temperature is the potential of heat). He was the first to
formulate heat flow as a diffusion problem
∂T
= κ ∇2 T (7.6)
∂t
In trying to solve this equation, he was looking for solutions that were
proportional to sines and cosines, because he could then easily turn this
partial differential equation into an algebraic one. This in turn led to
the question: is this legit? Can any periodic function be represented
this way? It took a long time for mathematicians to demonstrate this Figure 7.4: Joseph Fourier, whose forays
into the diffusion of heat led him to invent
rigorously, but physicists went ahead with it anyway. As usual, their how to represent any periodic function as
intuition was right. The key was to use an infinity of waves. a trigonometric series. In the process, he
gave birth to mathematical physics.
Fourier’s theorem
RT
Fourier’s theorem states that any integrable2 , T-periodic function can 2
0 |h(t)|dt < ∞
be represented as an infinite superposition of sines and cosines called
a Fourier series:
∞
a0 2π
h(t) = + ∑ ak cos(kωt) + bk sin(kωt) ω= (7.7)
2 k =1
T
102
Data Analysis in the Earth & Environmental Sciences Chapter 7. Fourier Analysis
- Oscillations near break points may reach high amplitudes (Gibbs phenomenon, Figure 7.5: Example of non-smooth series
lab. 6);
- These series have great theoretical use to find solutions of (linear) ordinary
and partial differential equations via the principle of superposition.
103
Data Analysis in the Earth & Environmental Sciences Chapter 7. Fourier Analysis
Examples
Dirac δ-function
- Definition: δ(t)
0 ∀t 6= 0
δ(t) = R (7.15)
+∞ δ(t)dt = 1
−∞
Figure 7.7: The Dirac δ-function de-
scribes perfect localization and repre-
- Sampling property: for any smooth function h sents an idealized point mass.
Z
δ(t)h(t)dt = h(0)
ZR
δ(t − t0 )h(t)dt = h(t0 )
R
- Fourier transform:
Z +∞
1
∆(ω ) = δ(t)e−iωt dt = 1
−∞
In general, good localization in time is associated with poor localiza- Figure 7.8: The Dirac δ-function has no
tion in frequency. On the opposite end, harmonic functions (sines and localization in the frequency domain.
cosines) have perfect localization in frequency space, but no localization
in time.
104
Data Analysis in the Earth & Environmental Sciences Chapter 7. Fourier Analysis
Negative Exponential
H (t)
h(t) = H (t)e− at
1
|h̃(ω )|2 = (7.17)
a2 + ω2
π (t)
b(t) = [ H (t − τ ) − H (t + τ )] t
Z −τ τ
1
b(t)dt =
R 2τ
sin(ωτ ) main lobe
b̃(ω ) = = τsinc(ωτ )
ω
sin( x )
Definition 5 The cardinal sine function sinc( x ) ≡ x is characterized
by a central peak at the origin and many oscillatory side lobes that decay
side lobes
hyperbolically away from it. (Fig. 7.10). The function integrates to π. It is
single-handedly responsible for leakage – one of the worst problems to befall Figure 7.10: The boxcar function and its
the time series analyst (Section V). Fourier transform
c̃(ω )
Monochromatic waves symmetry
(even)
105
Data Analysis in the Earth & Environmental Sciences Chapter 7. Fourier Analysis
Gaussian function
g(t)
− t2 R
g(t) = √1 e 2σ2 , g(t)dt = 1 2 σ = 1/5
σ 2π R
• Parity:
• Shifting:
• Derivation:
∂h c iω h̃(ω )
∂t
• Integration:
106
Data Analysis in the Earth & Environmental Sciences Chapter 7. Fourier Analysis
R
hdt c −i h̃ ( ω )
ω
• zero frequency
Z +∞
h(t)dt = h̃(0)
−∞
corresponds to the average value of h7 . 7
the “DC” in AC/DC
Important theorems
Parseval’s theorem
motion? What processes generated the measurement? What is the energy of the
low-frequency seismic waves (T > 4 min)? Are there scale invariances in the
system? Is the system particularly sensitive to forcing at a given frequency?).
Obtaining reliable estimates of the spectral density is a holy grail of spectral
analysis (Chapter 9).
Convolution theorem
107
Data Analysis in the Earth & Environmental Sciences Chapter 7. Fourier Analysis
Example
A recorded seismological signal r (t) = s(t) ∗ b(t) ∗ g(t) = (s ∗ b ∗ g)(t)
where s is the seismometer, b is the building motion, and g is ground motion.
Convolution formalizes the idea of the composition of various (linear) systems
influencing each other.
Correlation theorem
Z
C (h, g) = h(t + τ ) g(t)dt τ = “lag” (7.22)
ZR
C ( g, h) = h(t) g(t + τ )dt (7.23)
R
C (h, g) c h̃(ω ) g̃(−ω ) = h̃(ω ) g̃(ω )∗ for h, g real (7.24) a(τ )
Wiener-Khinchin theorem
108
Data Analysis in the Earth & Environmental Sciences Chapter 7. Fourier Analysis
N −1
kn
Hn = ∑ hk WN (7.27)
k =0
1
f± N = ± (7.29) Figure 7.15: Sampling at the Nyquist
2 2∆ frequency, where a digitized sinusoid
would take values of (−1, 0, +1). It
is called the Nyquist frequency and is the highest frequency resolvable by should be clear that sampling at half that
the dataset. If the dataset has energy above this frequency, it will be aliased to rate would result in mistaking the sinu-
lower frequencies (Sect. V). soid for a constant (red dots).
Periodicity
Ordering of frequencies
N
0 < f < f Ny ⇔ 1≤n≤ 2 −1
N
− f Ny < f < 0 ⇔ 2 +1 ≤ n ≤ N−1
N
f = ± f Ny ⇔ n= 2
109
Data Analysis in the Earth & Environmental Sciences Chapter 7. Fourier Analysis
Invertibility
Can we retrieve hk from Hn , and vice versa? Yes, via the inverse DFT:
Definition 11 Inverse Discrete Fourier Transform:
N −1
1 −kn
hk =
N ∑ Hn WN (7.31)
n =0
We write Hn N hk
N −1 N −1 N −1 N −1
1 2π 2πikn hl i2πn ( l − k )
iDFT (DFT(h)) = ∑ ∑ hl e− N iln e n = ∑ ∑ e− N
n =0
N l =0 l =0
N n =0
(7.32)
In the last term, we recognize the sum of the roots of unity:N th
N −1 1 − (WNl −k N
) 0 l 6= k
n(l −k)
∑ N W =
1 − WN l −k
=
N l=k
n =0
Properties
N
2
(r ∗ s ) j ≡ ∑ s j − k r k N Sn R n (7.36)
k =− N
2 +1
110
Data Analysis in the Earth & Environmental Sciences Chapter 7. Fourier Analysis
Matrix formulation
Numerical cost
The FFT makes many costly operations cheap in the Fourier domain :
This doesn’t look like much, but as N becomes large, the FFT cost be-
comes negligible in front of the direct cost. Hence, in practice convolu-
tions are always done in the Fourier domain: (s ∗ r )k N (Sn × Rn ).
111
Data Analysis in the Earth & Environmental Sciences Chapter 7. Fourier Analysis
N −1
t − k∆
h(t) = ∑ hk sinc
∆
(7.37)
k =0
1
where sinc is the cardinal sine function ∆ is chosen so that ∆ = 2 fc =
Tc samples
2 → 2 period .
You heard it right: not just approximately recovered, but completely
recovered. If a signal is band-limited, it only contains a countable amount
of information, and therefore can be entirely summarized with a (dis-
crete) sequence of numbers. The applications of this principle are all
around us, in audio engineering, video& sound processing, telecom-
munications, cryptography, etc.
Alternatively, this theorem states that the highest recoverable fre-
1
quency is the Nyquist frequency f N = 2∆ . If h(t) is not band-limited,
power outside [− f N , f N ] will be aliased (falsely translated) into this in-
terval (see below).
Both leakage and aliasing are linked to the {boxcar ↔ sinc} transform
pair. Let us analyze these plagues in detail.
112
Data Analysis in the Earth & Environmental Sciences Chapter 7. Fourier Analysis
Aliasing S( f )
f
− fc − fN + fN fc
Leakage
Leakage arises because sampling over a finite time window amounts to
multiplication by a gate function (e.g. Fig. 7.19, between the red lines):
1 ∀|t| ≤ T
2
BT ( t ) =
0 elsewhere
113
Data Analysis in the Earth & Environmental Sciences Chapter 7. Fourier Analysis
114
Data Analysis in the Earth & Environmental Sciences Chapter 7. Fourier Analysis
Spectral range
In practice, a spectrum can only be estimated on the interval [ f R , f N ]
1
f R = Rayleigh frequency =
N∆
(aka gravest tone, lowest note, fundamental harmonic).
n
Since f n = N∆ = n f R , f R is also the spacing between frequency points,
i.e. the spectral resolution: ∆ f = f R . All other frequencies are integer
multiples of f R .
1 N
f N = Nyquist frequency = = fR
2∆ 2
(highest note that can be recorded without aliasing the signal). If sam-
pling is chosen so that the signal has no energy beyond the Nyquist
frequency, then all is good. Otherwise, aliasing is always present, and
impossible to remove. This is one reason some scientific results may
change when more high-resolution data have been collected.
Bottom line: if N is fixed, ∆ can be increased at the expense of T, but
there is no free lunch. Hunting for periodic signals in discrete observa-
tions thus requires to deal with these fundamental constraints. This is
the object of spectral analysis (Chapter 9)
115
Chapter 8
TIMESERIES MODELING
Classic statistical tests assume IID data. When such conditions are met,
life is beautiful and those tests are useful. In the geosciences, this is
rarely the case. We thus wish to provide geophysically-relevant null hy-
potheses so that we can correctly judge the signficance of spectral peaks,
or of correlations between timeseries. Additionally, we will need such
models to estimate features of a timeseries believed to follow such mod-
els (cf the Maximum Entropy spectral method, Chap. 9).
may be because low-frequency dynamics are at play (cf. Fig. 8.1), or be- 2
cause the way we are measuring the signal introduces this persistence: 0
-2
more persistent behavior than the input climate (Hurst, 1951). Paleocli- -6
log10 (S( f ))
k k
2
1
Theoretical Spectrum
log10 ( f )
The theoretical spectrum of an AR(1) process is as follows. −1 1 2
118
Data Analysis in the Earth & Environmental Sciences Chapter 8. Timeseries modeling
single realization of a red noise process can yield arbitrarily high peaks
at arbitrarily low frequencies; such peaks could be attributed, quite er-
roneously, to periodic components.” This is illustrated in Fig. 8.6, where
it can be seen that many spectra exhibit peaks above the spectrum’s the-
oretical value. We will see in lab 8 how to determine the significance of
such peaks against a AR(1) nulls.
1−φ
Neff ' N effective sample size (8.4)
1+φ
For φ = 0.8, which is far from unusual, Neff is smaller than N by nearly
an order
q of magnitude! One can then plug this d.o.f into a t-test with
−2
T = r N1eff−r 2
∼ t Neff −2 , and one often finds very, very different (much
less significant) results that if one went along willy-nilly with the naïve
assumption that Neff = N (Fig. 8.5). Alternatively, one may use non-
parametric tests based on simulated timeseries (Lab 8). Figure 8.5: The p-value and numbers of
degrees of freedom (DOF) for a t test
Persistence means that every spectrum will look “red”, that is, more of a relatively low correlation (0.13) be-
energetic at low than high frequencies (e.g. Fig. 8.4). tween two AR(1) time series (500 samples
each) with autocorrelation φ. The green
dashed line is the 5% threshold for sig-
II Linear Parametric Models nificance. From Hu et al. (2017)
AR(1) models are part of a general class of timeseries models called lin-
ear parametric models, comprising autoregressive models, moving av-
erage models, and their union, ARMA models.
119
Data Analysis in the Earth & Environmental Sciences Chapter 8. Timeseries modeling
2 2
X
X
0 0
-2 -2
-4 -4
1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000
Time Time
AR(1) spectrum AR(1) spectrum
Normalized Spectral Density
0 0
10 Period (kyr) 50 20 10 5 2 1 10
2 2
X
X
0 0
-2 -2
-4 -4
1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000
Time Time
AR(1) spectrum AR(1) spectrum
Normalized Spectral Density
2 2
10 Period (kyr) 50 20 10 5 2 1 10 Period (kyr) 50 20 10 5 2 1
0 0
10 10
2 2
X
0 0
-2 -2
-4 -4
1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000
Time Time
AR(1) spectrum AR(1) spectrum
Normalized Spectral Density
2 2
10 Period (kyr) 50 20 10 5 2 1 10 Period (kyr) 50 20 10 5 2 1
0 0
10 10
120
Data Analysis in the Earth & Environmental Sciences Chapter 8. Timeseries modeling
Example
AR(1) model:
X t − µ = φ ( X t −1 − µ ) + ε t
= φ ( φ ( X t −2 − µ ) + ε t −1 ) + ε t
..
.
∞
Xt = ∑ φk ε t−k + µ
k =1
ARMA models
In general, a broad class of timeseries can be approximated by an ARMA(p, q)
model, which fuses AR and MA models:
p q
Xt − µ = ∑ φi (Xt−i − µ) + ∑ θi ε t−i (8.7)
i =1 i =0
121
Data Analysis in the Earth & Environmental Sciences Chapter 8. Timeseries modeling
K
Xt = ε t + ∑ α k Xt − k
k =0
K K
X t → E ( X t X t ) = E ( ε 0 X t ) + ∑ α k E ( X t X t − k ) = 0 + ∑ α k γk
| {z } k =0
| {z } k =0
γ(0)=1 γk
K
E ( X t X t −1 ) = E ( ε t X t −1 ) + ∑ α k E ( X t −1 , X t − k ) if stationary
| {z } k =0
| {z }
γ (1) γ ( k −1)
K
γ (1) = ∑ α k γ ( k − 1)
k =0
K
γ(n) = ∑ αk γ(k − n) → discrete convolution
k =0
∞
where Xt−1 = αXt−2 + ε t−1 ,
Xt = αXt−1 + ε t = αn Xt−n + ∑ αk ε t−k Xdt−2 = αXt−3 + ε t−2 , etc.
k =0
p
Xt = ∑ φi Xt−i + ε t
i =1
p
E(· × Xt ) → E( Xt Xt ) = ∑ φi E( Xt Xt−i ) + E(ε t Xt )
| {z } i=1 | {z }
σ 2 = γ (0) lag i autocorrelation (γ(i ))
p
using wide-sense stationarity
E(· × Xt−1 ) → E( xt−1 Xt ) = ∑ φi E( Xt−1 Xt−i ) + E(ε t Xt−1 ) (γ is function of lag only)
| {z } i=1 | {z }
γ (1) γt − i
p
E(· × Xt−k ) → E( Xt−k Xt ) = ∑ φi γ(i − k) = γ(k) define ρi = γ(i − 1)
i =1
ρ1 = γ0 = φ1 + φ2 ρ1 + φ3 ρ2 + . . . + ρk−1 φk
Linear system
∃ strong
of equations
ρ2 = φ1 ρ1 + φ2 + φ3 ρ1 + . . . + ρk−2 φk
relationship
(Yule-Walker .. → ρ = Mρ + φ1 → between
equations)
.
coefficients
ρ =φ ρ
k 1 k −1 + φ2 ρk−2 + . . . + φk
122
Data Analysis in the Earth & Environmental Sciences Chapter 8. Timeseries modeling
φ2 φ3 . . . φk
1
φ
.. .. .. ..
where M= . . . .
... ... ... ...
K
Hence ρm = ∑ φk ρm−k
k =1
Example
φ0
S( f ) = p (8.8)
|1 + ∑ j=1 φj e2πij f |
They can fit a wide array of spectral shapes; this formula is the basis for
Maximum Entropy spectral estimation (Chapter 9)
123
Data Analysis in the Earth & Environmental Sciences Chapter 8. Timeseries modeling
AR(1) AR(2)
f f
Blue noise
Blue noise is used loosely to refer to S( f ) ∝ f 2 or S( f ) ∝ f 2 , that is
power increasing with frequency (Fig. 8.10). Blue noise is not common
in the geosciences, and that’s all the space we’ll devote to it.
SPECTRAL ANALYSIS
Before we venture too far into technical details, however, let us pause
and reflect on the scientific motivations underlying such a complicated
analysis.
I Why spectra?
What do we hope to learn from the spectrum? Why is it worth calcu-
S U M BAWA
lating in the first place? To see, this, let us consider three Earth science CMO
8/19/77
examples.
will see distinct spectral lines that can be associated with the eigenfre-
~
SAMOA
quencies of Earth’s spheroidal normal modes. That is, the Earth is like CAN i
a bell that rings at certain notes (harmonics) when hit by the hammer of
12/26/75
i
large Earthquakes. These harmonics reveal fundamental aspects of its
inner structure.
Each of these discrete modes can be matched with spherical harmon-
3 25
ics (cf Appendix C) of fixed angular order, as illustrated in Fig. 9.1.The FREQUENCY (mHz)
quake and station to station. This variation arises because the mode
degeneracy is split by 3D Earth structure, and it can be precisely inter-
preted and inverted. These “apparent eigenfrequency” data are used in
most global 3D earth models.
slope appears to change at the centennial scale (black cross). The au- 3
126
Cambridge University Press
Data Analysis in the Earth & Environmental Sciences 978-1-107-01898-3 - The9.
Chapter Weather and Climate:
Spectral AnalysisEmergent Laws and Multi
Shaun Lovejoy and Daniel Schertzer
Excerpt
More information
thors hypothesized that this change in spectral slope (aka scaling expo-
1.2 The Golden Ag
nent) is indicative of different processes accomplishing energy transfers
between space and time scales.
Log10E(k) Log10E(k)
9
6
Testing theories of geophysical turbulence 8
4 7
Theories of fluid flow in the atmosphere and oceans make predictions
6
about rates of energy transfer between spatial and temporal scales that 2
5
are most easily summarized in the spectral domain. In particular, such (10 km)–1
4
theories predict the existence of scaling regimes: a power-law behavior 0.5 1.0
(10 000 km)–1
1.5 2.0 2.5 3.0 3.5
Log10k 3
(i.e. linear behavior in a log-log plot) of the spectrum of various state –2
variables (temperature, velocity, passive tracers), which is indicative of 0.5
Fig. 1.2 Spectra from " 1000 orbits of the Visible Infrared Fig. 1.3 Spectra o
scale invariance. As an example, consider radiance measurements from Sounder (VIRS) instrument on the TRMM satellite channels 1–5 from the TRMM sa
Figure 9.3: of Scaling
(at wavelengths behavior.
0.630, 1.60, 3.75, 10.8, 12.0 μm from Spectra
top to 1998. From bottom
satellites as presented by Lovejoy et al. (2008). bottom, displaced in the vertical for clarity). The data are for the
from channels
period January through March 1-51998(atand have wavelengths
nominal resolutions
(vertical polarizatio
exponents β ¼ 1.6
Fig. 9.3 shows the “along track” 1D spectra from the Visible Infrared ofof 2.20.630, 1.60,regression
km. The straight 3.75,lines10.8, 12.0exponents
have spectral
β ¼ 1.35, 1.29, 1.41, 1.47, 1.49 respectively, close to the value
µm 65, 26, 26, 13 km (h
by one order of m
from
β ¼ 1.53top to bottom,
corresponding displaced
to the spectrum in the
of passive scalars
Sounder (VIRS) instrument of the Tropical Rainfall Measurement Mis- (¼ 5/3 minus intermittency corrections: see Chapter 3). The units are
vertical for clarity). The straight re-
microwave results,
reflectance, water
such that k ¼ 1 is the wavenumber corresponding to the size of the
sion (TRMM) at wavelengths of 0.630, 1.60, 3.75, 10.8, 12.0 mm, i.e. for gression
planet (20 000lines have 1,spectral
km)–1. Channels 2 are reflectedexponents
solar radiation so
smaller than the wa
as the wavelength
that only the 15 600 km sections of orbits with maximum solar
visible, near infrared and (the last two) thermal infrared (IR) radiation. = 1.35,
βradiation 1.29,
were used. The 1.41, 1.47, 1.49 respectively,
high-wavenumber fall-off is due to the
absorbtivity due to
more of the signal
close to the of
finite resolution value β = 1.53
the instruments. corresponding
To understand the figure we underlying surface
The first two bands are essentially reflected sunlight, so that for thin note that the VIRS bands 1, 2 are essentially reflected sunlight
to(with
theveryspectrum
little emission and of absorption),
passivesoscalarsthat for thin(5/3
clouds the
with increasing wa
important in rainin
clouds the signal comes from variations in the surface albedo (influ- signal comes from variations in the surface albedo (influenced by
minus intermittency corrections).
the topography and other factors), while for thicker clouds it
Note gradients – which
large internal liquid
that
comes the analysis
from nearer the cloudwastop viaperformed
(multiple) geometric over
and Mie
enced by the topography and other factors), while for thicker clouds it scattering. As the wavelength increases into the thermal IR, the
space,
radiances not time, so
are increasingly duethe relevant
to black body emission spectral
and
comes from nearer the cloud top via (multiple) geometric and Mie scat- index
absorption is with
theverywavenumber k, which
little multiple scattering. Whereas atistheto visible Also of interes
wavelengths we would expect the signal to be influenced by the close (but a litt
tering. The units are such that k = 1 is the wavenumber corresponding wavelength
statistics of cloud what angular
liquid water frequency
density, for the thermal IRis to
for passive sca
the period.
wavelengths it wouldAdapted from Lovejoy
rather be dominated et al.
by the statistics of
theory discuss
to the size of the planet (20 000 km). The high-wavenumber fall-off is temperature variations – themselves also close to those of passive
(2008)
scalars. Adapted from Lovejoy et al. (2008). with theoretic
due to the finite resolution of the instruments. passive scalar
below about 10 km, there is a more rapid fall-off but et al., 2009a).
A scaling behavior is evident from the largest scales (20 000 km) to this is likely to be an artefact of the instrument, whose the radiance sp
sensitivity starts to drop off at scales a little larger than other atmosph
the smallest available – there are no peaks in this spectrum, and all the the nominal resolution. The scaling observed in the coupled to the
interesting information is contained in the slope of the “background”. visible channel (1) and the thermal IR channels (4, 5) ances are prim
are particularly significant since they are representa- variables of sta
This slope turns out to be incompatible with classical turbulence cas- tive respectively of the energy-containing short- and dynamics wer
long-wave radiation fields which dominate the earth’s structures at a
cade models which assume well-defined energy sources and sinks, with energy budget. One sees that thanks to the effects of see how this s
a source and sink-free “inertial” range in between. Thus, these estimates cloud modulation, the radiances are very accurately associated clou
scaling. This result is incompatible with classical To bolster
of spectral scaling enable one to directly test theories of atmospheric turbulence cascade models which assume well-defined the correspon
energy flux sources and sinks with a source and sink- (correspondin
turbulence. Such a test would be impossible (and meaningless) with free “inertial” range in between (see Section 2.6.6). wavelengths in
unprocessed timeseries data; nothing would be learned by staring at
wiggles, but staring at their spectra does teach us a lot.
127
Data Analysis in the Earth & Environmental Sciences Chapter 9. Spectral Analysis
Sampling
Analog signals may be pre-filtered make them band limited prior to dig-
itization. This operation brings any period to zero beyond a cutoff fre-
quency f c . Sampling usually refers to digitization at even increments ∆.
If even increments are not possible (e.g. in a sediment core, time may not
be linearly related to depth), there are two possibilities:
1. interpolate on regular time grid, as seen in the next Chapter (this might
introduce spurious features) and use the methods of this chapter.
128
Data Analysis in the Earth & Environmental Sciences Chapter 9. Spectral Analysis
Detrending
Example
In the global warming example of Fig. 9.4, it would make sense to remove the
slowly-evolving component (red line) if interested in periodic or quasi-periodic
oscillations
Tapering
Tapering refers to bringing the edges of a timeseries to zero. There are
two reasons to do so:
Edge effects We have seen that DFT periodizes the signal h(t) (so the
sequence Hn is N-periodic). If h( N − 1) 6= h(0), the jump h( N −
1) − h(0) will generate a discontinuity at the junction (cf. Fig. 9.5),
so higher-order harmonics will pollute the spectrum via Gibbs’ phe-
nomenon: this an example of edge effect. To some extent, detrending Figure 9.5: Edge effects may appear ins
will help with that. Multiplying by a taper, however, will ensure that the simplest settings.
129
Data Analysis in the Earth & Environmental Sciences Chapter 9. Spectral Analysis
Welch
Hanning
Bartlett
− 21 1
2
There are several choices available (Fig. 9.7), depending on what you’re
trying to achieve:
• Hanning window w(n) = sin2 πn
N −1
• Gaussian window
• Parzen window.
Zero-padding
Because the FFT algorithm works optimally on datasets with a sample
size equal to a power of two, it is common to pad with zeroes to reach
the next power of two6 : N ; N 0 = 2m . This will make the Fast Fourier 6
Matlab: nextpow2
Transform algorithm faster (though it is not mandatory); it may reduce
frequency spacing, yielding a smoother spectrum; and it theoretically
does not add or destroy information. However, this is only true if the
series is stationary and there is no jump at the end points; so it must
always be used in conjunction with a taper.
130
Data Analysis in the Earth & Environmental Sciences Chapter 9. Spectral Analysis
Ŝ( f n ) = Pn = | Hn |2 (9.3)
1 N −|k |
N +|k |−1 ∑ j=0
h j h j+k
â(k) = 1 N −1 2
, k ∈ [−( N − 1), +( N − 1)], (9.5)
N −1 ∑ j =0 h j
V (Ŝ) = S
Confidence intervals
Recall that:
N −1 2π
Hn = ∑ hk e−i N kn is a complex number (9.6)
k =0
| Hn |2 = <( Hn )2 + =( Hn )2 (9.7)
131
Data Analysis in the Earth & Environmental Sciences Chapter 9. Spectral Analysis
If
hk ∼ N (0, σ2 ) i.i.d.
then
so
| Hn |2
∼ χ22 chi-squared with 2 d.o.f.
σ2
1
∼ Exp .
2
So
h
| Hn |2 − gα i N
P ≥ gα = 1− 1−e 2 (9.8)
σ2
N −1
ŜBT ( f ) = ∑ wk ak e−2πi f ∆k (9.9)
k =0
where:
• ak = auto-correlation at lag k.
132
Data Analysis in the Earth & Environmental Sciences Chapter 9. Spectral Analysis
Ghil et al. (2002) write: “it turns out that the [...] windowed correl-
ogram method is quite efficient for estimating the continuous part of
the spectrum, but is less useful for the detection of components of the
signal that are purely sinusoidal or nearly so (the lines). The reason for
this is twofold: the low resolution of this method and the fact that its
estimated error bars are still essentially proportional to the estimated
mean Ŝ( f ) at each frequency f ”.
S
V ŜWelch ( f ) = (9.11)
K
Ŝ( f )
for i.i.d. Gaussian hk , ξ=ν ∼ χ2 (9.12)
S( f )
where ν =“equivalent d.o.f.” , which depends on K and shape of tapers.
Typically, ν = 2K − 2. Then:
!
2 Ŝ( f n ) 2
P χν α ≤ ν ≤ χ ν 1− α = 100 × (1 − α)% (9.13)
(2) S( f n ) ( 2)
This yields a parametric 95% confidence interval for the spectral esti-
mate. If the data are not i.i.d. Gaussian, you would better inspired to
use a non-parametric test (see examples below).
133
Data Analysis in the Earth & Environmental Sciences Chapter 9. Spectral Analysis
11
aka Minimum Description Length, or
Notes on MEM Final Prediction Error
S( f )
Multi-Taper Method
MEM can detect amazingly fine features (e.g. split peaks), but the choice
of p tends to be arbitrary; further the method only makes sense if an AR
model is an adequate approximation of the series. The Blackman-Tukey
method is even worse, since the choice of taper and is rather ad hoc. In
1982, D. Thomson found an optimal solution, known as the Multi-Taper
method (MTM, Thomson, 1982) f
MTM uses K orthogonal tapers (wk (t), k = 1, · · · , K ) belonging to Figure 9.8: Smoothed peak (obtained
with Blackman-Tukey) versus split peaks
a family of special functions (Discrete Prolate Spheroidal Sequences, obtained with MEM
or Slepian functions, Fig. 9.9). These wk (t) solve the variational prob-
lem of minimizing leakage outside of a frequency band f 0 ± p f R where
f R = 1/( N∆) is the Rayleigh frequency and p is some nonzero inte-
ger. Averaging over tapered spectra yields a more stable estimate, as 0.15
Slepian Sequences N=512, p=2.5
width of a harmonic line with this method. There is a trade-off between 0.05
-0.1
∑kK=1 µk | Yk ( f ) |2
SMTM ( f ) = (9.14)
∑kK=1 µk
Uncertainty Quantification
So, you’ve estimated a spectral density, ideally using several methods,
or with one method but different choice of its parameters (e.g. time-
bandwidth product, AR model order, etc). What now? The most com-
mon use of spectral analysis is to identify periodic components, and hav-
ing done so, decide whether they are significant. Aaaargh, the dreaded
S word again! We are now back in the realm of Chapter 6.
White nulls
Autoregressive nulls
135
Data Analysis in the Earth & Environmental Sciences Chapter 9. Spectral Analysis
Big Taurius δ18O MTM spectrum (nw = 3) Figure 9.10: Example of 95% confidence
2 interval for Ŝ. The thin blue lines depict
10
χ2 -based intervals, while the thin gray
lines depict a 95% confidence region from
1
an ensemble of 1,000 AR(1) timeseries
10 (Chapter 8), which serve as a null hypoth-
esis for the significance of spectral peaks.
Spectral,Density
0
10
-1 AR(1),5-95%,quantiles
10
Estimated,spectrum
χ2,,95%,CI
-2
10
-3
10
100 50 20 10 6 4 2
Period,(years)
this graph. You will also notice that both axes are logarithmic, with- 1
500 200 100
NINO3.4 MTM spectra
50 20 10 8 6 4 2
out which the spectrum would be squished to the left-hand corner, and
none of the interesting details would appear. Sometimes a semi-log
Frequency = Spectral Density
10-1
scale may be more sensible; the choice should ultimately serve to high- 10-2
One drawback of the log-log representation is that the area occupied 10-4 LIM
-2
10 Frequency (cycles/year) 10-1 0.5
grating the area under the curve does not represent the energy of the sig-
nal in that frequency band. A solution to this problem uses a variance- Figure 9.11: Comparisons of the MTM
spectra of 3 reconstructions of the
preserving log-scale (Fig. 9.11). This representation plots log( f × S( f ))
NINO3.4 index (a common metric of
as a function of log( f ), which conserves variance in each band and thus ENSO) with a simple stochastic null
allows to “integrate by eye” (A. Wittenberg, pers. comm., 2012) hypothesis (Ault et al., 2013).
136
Data Analysis in the Earth & Environmental Sciences Chapter 9. Spectral Analysis
'JHVSF4]"HFNPEFMGPSTBNQMF#JH5BVSJVTCBTFEPO65IBHFT BOEBO
Age-uncertain spectral analysis BHFDPOTUSBJOUXIFSFUIFUPQJTBTTVNFEUPCFD ZFBSPGDPMMFDUJPO 5IF
Figure 9.12: Age model for the Big Tau-
EBSLDVSWFJOUIFNJEEMFPGUIFHSFZTQSFBESFQSFTFOUTUIFNFEJBOBHFNPEFM
5IF .POUF$BSMPJUFSBUJPOTPGBMUFSOBUFBHFNPEFMTBSFQMPUUFEJOHSFZ
rius record of Partin et al. (2013), based on
As an example of a scientifically-motivated null hypothesis, let us 31 U-Th ages and a top date of 2005. The
consider the study of Partin et al. (2013), who used oxygen isotope mea- grey envelope represents 10,000 possible
spline fits of the age model, whose me-
surements in a South Pacific speleothem to learn about low-frequency dian is depicted by the dark curve.
hydroclimate variability. The measured timeseries may be seen in Fig. 9.13.
As in most climate proxy records, the time axis is uncertain to some de-
gree, in the sense that many possible functions could fit the age-depth
constraints (Fig. 9.12). One approach to this is to use ensemble methods
(Monte Carlo simulations), generating many possible realizations (say,
N = 10,000) of the timeseries that are permissible within age uncertain-
ties.
137
Data Analysis in the Earth & Environmental Sciences Chapter 9. Spectral Analysis
One may then compute the resulting spectra for each realization of
the age model, and see how the spectrum of the original timeseries (blue
curve in Fig. 9.13) fares in relation to this ensemble (Fig. 9.14). The anal-
ysis shows that age uncertainties would tend to blur high-frequency
peaks rather than creating them, lending credence to the notion that
the peaks observed in Fig. 9.10 are real.
18
Big Taurius b O MTM spectrum (nw = 3) Figure 9.14: MTM spectral analysis of
2
10 the Fig. 9.13 series using the median age
model (blue). Gray shading delineates
95% confidence interval obtained from
10,000 realization of the age model. From
1
10 Partin et al. (2013)
Spectral Density
0
10
-1
10
-2
10 b18O, MC age models 95% c.i.
b18O, median age model
AR(1), 95% quantile
-3
10
100 50 20 10 6 4 2
Period (years)
VI Cross-spectral analysis
Consider the Fourier transform pairs x (t) c X ( f ), y ( t ) c Y ( f ). Re-
member the Wiener-Khinchin theorem
138
Data Analysis in the Earth & Environmental Sciences Chapter 9. Spectral Analysis
In practice
N −|k |
1 1
γ̃xy (k) =
σx2 σy2 N − |k | − 1 ∑ x j y j+k (9.17)
j =0
| {z }
cross-covariance
γxy (t) c X ( f )Y ∗ ( f )
so if Y = X we will have
F (γxx (t)) =| X ( f ) |2 = S( f ).
or
where A xy and ψxy are called amplitude and phase spectrum respec-
tively.
| A xy |2 | Γxy |2
κ xy ( f ) = =
| X |2 | Y |2 Sxx Syy
Cov( x, y) Cov( x, y)
ρ xy = = √ √
σx σy v x vy
139
Data Analysis in the Earth & Environmental Sciences Chapter 9. Spectral Analysis
Properties
Y = [ HX ] ( f ) and Γxy = X ( f )Y ( f )∗
we have
| Γxy |2 | X ( f ) |2 | H ( f ) |2 | X ( f ) |2
κ xy = = =1 (9.18)
| X |2 | Y |2 | X ( f ) |2 | H ( f ) |2 | X ( f ) |2
at all f ’s. So, if the two series are related via a linear filter, they’ll
be perfectly coherent. Deviations from unity imply a non-linear
response or no response at all.
140
Chapter 10
SIGNAL PROCESSING
I Filters
A filter is an operation that removes only a part of a measured signal. Ev-
eryday examples include sunglasses (which filter UV and/or IR wave-
lengths) or the EQ on a stereo receiver or your iTunes player, where one
can change the relative importance of bass, mids or trebles. Earth Sci-
ence examples include:
Seismology filter out surface waves (long periods) to leave only P and S
waves.
ωc ω t
Question If the Fourier curse is so damning, why not stay in the time do-
main? Why not just average contiguous data points together, for instance?
This is called a running mean. One might think that by averaging to-
1
gether M contiguous points, you’d filter all frequencies f > M∆ . Unfor-
tunately not: this time it amounts to convolution by a boxcar in the time
domain, so smearing by a sinc function in the frequency domain: that
is yucky. But there is worse: not only does this mess up the amplitude
spectrum, but it destroys the phase spectrum even more. Defining the
boxcar as:
1 k ∈ [0, M − 1]
bk = (10.1)
0 k≥M
nπM
sin N N
Rn = nπ
has zeroes at n = l , l ∈ N∗ and (10.4)
sin N
M
πn( M − 1) nπM
Φn = − − επ ε = sign sin = 0 or 1 (10.5)
N N
N
When n = l M , Φn = ∓π: this is a 180 degree phase shift. So the
N
running mean does cut out the frequency n = M (as we hoped it would),
but also all its integer multiples, and completely shifts the phase around
at these points. Yikes! Such throwing the baby with the bathwater is a
142
Data Analysis in the Earth & Environmental Sciences Chapter 10. Signal Processing
INPUT: OUTPUT:
underlying signal Filter measured signal
u(t) r (t) s(t)
e.g. grand motion e.g. seismogram
|s̃(ω )|
where the gain G is the ratio of amplitudes: |ũ(ω )| and the phase shift ϕ is the
difference of phases ( phase(s̃) − phase(ũ) at each frequency. r (t) is called the
impulse response of the filter, that is, the response to a unit pulse at t = 0.
There are therefore three variables to consider for any filter. In general,
one wants to specify G (e.g. lowpass) but ϕ and r (t) will be affected as
well; this trade-off is fundamentally due to the uncertainty principle).
There are several classes of filters, each realizing a compromise between
amplitude response (gain), phase response and impulse response:
3. Infinite impulse response (recursive) vs. finite impulse response (non re-
cursive)
143
Data Analysis in the Earth & Environmental Sciences Chapter 10. Signal Processing
then:
• Iterating this process n times yields a binomial filter, with weights pro-
portional to binomial coefficients (nk)
Advantage: R( f ) = cos2n (π f ) at order n and the phase shift is linear
(easily corrected).
Frequency domain
144
Data Analysis in the Earth & Environmental Sciences Chapter 10. Signal Processing
1.0 0.5
n=1 n=1
0.8 n=2 0.4 n=2
n=4 0.3 n=4
Amplitude
0.6 n = 10 n = 10
| Bun |
0.2
0.4
0.1
0.2
0.0
0.0 −0.1
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0 10 20 30 40 50
Frequency [∆t−1 ] Time [second]
f f
V V
bandpass notch
f f
1. filter(n, b, a) (once) or
145
Data Analysis in the Earth & Environmental Sciences Chapter 10. Signal Processing
This allows to construct zero-phase digital filters from filters that have a
phase shift; the trick is to run it twice in opposite directions so that they
phase shifts cancel each other.
Other methods
• Wiener filter (optimal if a known signal has been corrupted with
noise of known characteristics)
Note Bottom line: never use a running mean. Everything else is ok, but
the choice depends on what you want to achieve. 2
Without loss of generality, this dis-
cussion also applies to spatial contexts,
though in some cases more sophisticated
methods (e.g. Kriging) must be used in
II Interpolation multiple dimensions
General idea
The general idea can be simply seen as “connecting the dots”. Given
a series of datapoints taken at different ti , we ask ourselves what hap-
pened between the ti ’s 2 . The least squares method3 could fit a curve 3
Interpolation can be described as “exact
that minimizes the squared distance to points but this is not what we curve fitting”. This is different from the
case of “smoothed curve fitting”, which
want. We want to go through all the points and figure out what could would encompass least squares, spline
plausibly have happened in between. How do we do this? fitting and others.
t i −1 t i t
Linear interpolation
There are many ways to connect the dots depending on the constraints
we impose. The simplest way is linear interpolation. This form of inter-
polation falls into the category of “piecewise interpolations”.
On I1 we have
L1 (t) = a1 t + b1
146
Data Analysis in the Earth & Environmental Sciences Chapter 10. Signal Processing
I3
I1 I2 nodes
t1 t2 t3 t4 t
• is computationally trivial.
The cons is that the interpolating function is continuous only at the 0th
order, i.e. it’s spikey.
Generally we want smoothness up to the second order (forces symme-
try) or up to the third order.
Cubic splines
The cubic splines approach is a piecewise approach but this time each
piece is a 3rd degree (i.e. cubic) polynomial
s1 ( t ) t ∈ [ t1 , t2 ]
s (t) t ∈ [ t2 , t3 ]
2
S( x ) (10.12)
... ...
s n −1 ( t ) t ∈ [ t n −1 , t n ]
where
s i ( t ) = a i ( t − t i ) 3 + bi ( t − t i ) 2 + c i ( t − t i ) + d i
and we require that it’s first and second derivatives are continuous in
the nodes
147
Data Analysis in the Earth & Environmental Sciences Chapter 10. Signal Processing
So we have
1. f i = di
4. We require
Mi + 1 − Mi
ai =
6h
while from (3.), after some tedious algebra, we get
f i +1 − f i ( Mi+1 − 2Mi )
ci = − h.
h 6
Now all four coefficients are determined
M −M
ai = i+16 i
b = Mi
i 2
f −f Mi+1 +2Mi
.
ci = i+1h i − h
6
di = f i
148
Data Analysis in the Earth & Environmental Sciences Chapter 10. Signal Processing
In order to solve this system of equations we can put the condition (3.)
in matrix form
which consists of (n − 2) rows and n columns. This means that the sys-
tem is underdetermined so we add some boundary conditions. Depend-
ing on the choice of the boundary conditions we have three types of
splines
• natural: M1 = Mn = 0,
Lagrange interpolant
Given (n + 1) points, there’s only one polynomial that goes through all
of them. It can be written in many forms but Lagrange’s is the most
common and consists of a linear combination of basis polynomials
n
L( x ) = ∑ y i li ( x )
i =0
149
Data Analysis in the Earth & Environmental Sciences Chapter 10. Signal Processing
where li ( x ) = ∏nj=0,j6=i xx−−xxi . Obviously li ( x j ) = δij so it goes through
j i
all the points yi . A problem with this approach is that when the num-
ber of points gets large, so does the degree of the polynomial and in this
case it is out of control. This is a global interpolant: one single function
fits all the points but at the price of ginormous oscillations. For this
reason this interpolant should never be used in practice. It is better to
use piecewise, local interpolants: these require more coefficients (hence
more computations) but they are much less sensitive to outliers since
one segment is fairly insulated from remote ones.
Fourier interpolant
So far we have only considered polynomials, what if we use cos(ωi t),
sin(ωi t)? These can be expressed as polynomials too (remember the
k
Euler’s formula and eiω j kt = eiω j t ). In fact Fourier analysis can be
seen as a curve fitting problem with trigonometric functions. It can also
be seen as an inverse problem
1 1 1 ... 1
1 ω2 ... ω N −1
ω
1
FN = √ 1 ω2 ω4 ... ω 2( N −1)
N
. . . ... ... ... ...
1 ω 2( N −1) ω 4( N −1) ... ω ( N −1)( N −1)
−2πi
where ω = e N > = I.
and FNn is a unitary matrix FN F̄N
150
Part III
MULTIVARIATE RELATIONSHIPS
X = ( X1 , X2 , · · · , X p )
The joint distribution is depicted by the color field and is seen to oc-
cupy the lower right quadrant of the square of possible values. Integrat-
ing such a distribution with respect to either variable leads two marginal
distributions. For instance:
Z
f 1 ( x1 ) = f ( x1 , x2 ) dx2 (11.1a)
ZR
f 2 ( x2 ) = f ( x1 , x2 ) dx1 (11.1b)
R
Such marginals are shown in panels a and b of Fig. 11.1. One may think
of a marginal distribution as averaging out the effect of all others vari-
able to focus on a particular one.
154
Data Analysis in the Earth & Environmental Sciences Chapter 11. Multivariate Relationships
Conditional Distributions
If the components of a random vector are independent, then it is easy to
show that their joint distribution factorizes into marginal distributions1 : 1
this is in fact the definition of indepen-
dence
f ( x1 , x2 , · · · , x p ) = f 1 ( x1 ) f 2 ( x2 ) · · · f p ( x p ) (11.2)
This is not the case in general, as shown in Fig. 11.1. In that case,
it may be interesting to slice through the joint distribution at a partic-
ular value of either variable, obtaining what is known as a conditional
distribution. In two dimensions, for instance:
We will return to this concept later with estimation and prediction from
linear models (Chapter 15).
Rij = ρ( Xi , X j ) (11.6)
Since a variable is always perfectly correlated with itself, its diagonal is
made of ones. 3
This is emphatically not true of all co-
variance matrices estimated from obser-
For a random matrix to qualify as a covariance (or correlation) ma-
vations, especially when the number of
trix, it must be positive definite2 and symmetric. These are rather strong samples n is smaller than the number of
constraints, so this is a fairly restricted class. Positive definiteness means variables p. This is a common difficulty
in inverse problems, which we will learn
in particular that a covariance matrix can always be inverted3 . to overcome in different ways in Chap-
ter 14
155
Data Analysis in the Earth & Environmental Sciences Chapter 11. Multivariate Relationships
f
ate version of the Central Limit Theorem, but also because, just like in 0
the univariate case, it has so many convenient properties that it is often −2 2
0 0
attractive to transform multivariate data to approximate normality just
2 −2
so that one can use the MVN. We start with a bivariate example before x1 x2
generalizing to p > 2.
Figure 11.2: The Bivariate Normal Dis-
tribution – independent isotropic case.
Easing in: The Bivariate Normal Black curves represent the marginal den-
sities, while red dots represent random
Let ( X1 , X2 ) follow a bivariate normal distribution. Assume for now that samples from this distribution, tending to
the two variables are independent, so f ( x1 , x2 ) = f 1 ( x1 ) f 2 ( x2 ). That is, cluster around the central bump.
2 2
1 − 12
x1 − µ1
1 − 12
x2 − µ2
f ( x1 , x2 ) = √ e σ1
× √ e σ2 0.1
f
σ1 2π σ2 2π 0
2 2
x1 − µ1 x −µ
1 1 − 12 + 2σ 2 −2 2
e 2
(11.7)
σ1
= 0 0
2π σ1 σ2
2 −2
This is situation is illustrated in Fig. 11.2 & Fig. 11.3. It is, however, a x1 x2
rather restrictive situation: in general, X1 and X2 could be dependent.
Figure 11.3: The Bivariate Normal Dis-
tribution – independent anisotropic case.
General Case As in Fig. 11.2, except that σ2 > σ1 .
Definition
1 1 1 > −1
f ( x1 , x2 , . . . , x p ) = p/2
p e − 2 (x− µ ) Σ (x− µ ) (11.8)
(2π ) |Σ|
156
Data Analysis in the Earth & Environmental Sciences Chapter 11. Multivariate Relationships
Mahalanobis Distance
The quantity within the integral is −1/2 times the Mahalanobis dis-
tance between x and µ. This distance a quadratic form in X p . It de-
√ √ p
fines an ellipsoid with lengths Σ11 , Σ22 , · · · , Σ pp for the semi-major
axes. You can think of it as a multivariate measure of distance (a norm)
scaled by the uncertainty in each variable (Fig. 11.4). To gain intuition
about the MVN, let us return to our bivariate case:
x1 µ1 σ1 2 Cov( X1 , X2 ) Figure 11.4: Illustration of the Maha-
x= ; µ= ; Σ= lanobis distance. In this anisotropic case,
x2 Cov( X2 , X1 ) 2
µ2 σ2 the red dot is further than the green dot
by this distance, though it is closer by the
Now, using the identity Cov( X1 , X2 ) = ρσ1 σ2 , we get: usual Euclidean distance (“as the crow
flies”). Put another way, the uncertainty
σ1 2 ρσ1 σ2 along the x axis is larger than the uncer-
Σ= tainty along the y axis, so the red dot ap-
ρσ1 σ2 σ2 2 pears more distant than the green dot.
Then:
|Σ | = σ1 2 σ2 2 (1 − ρ2 )
σ2 2 −ρσ1 σ2
Σ −1 = 1
|Σ |
−ρσ1 σ2 σ1 2
So
" 2 2 #
> −1 1 x1 − µ1 x1 − µ1 x2 − µ2 x2 − µ2
( x − µ) Σ ( x − µ) = − 2ρ +
1 − ρ2 σ1 σ1 σ2 σ2
xi − µi
Reasoning in terms of standardized variables zi = σi , we get:
z1 2 − 2ρz1 z2 + z2 2
( x − µ ) > Σ −1 ( x − µ ) =
1 − ρ2
First note that, after all these matrix operations are said and done, we
are left with a scalar – a single number. Second, notice that the numer-
ator looks an awful lot like a2 − 2ab + b2 = ( a − b)2 , hence the name
quadratic form (this would hold true in higher dimensions as well). It is
a normalized distance between x and µ (or between z and 0). Now, if
the variables were independent4 , then ρ = 0 and the covariance matrix 4
which we denote X1 ⊥⊥ X2
would be diagonal
σ1 2 0
Σ= (11.9)
0 σ2 2
In this case, it is easy to see how one would fall back on Eq. (11.7).
The resulting distance measure is called a normalized Euclidean distance:
in p dimensions, v
u p
u xi − µi 2
d(~x, ~y) = ∑
t (11.10)
i =1
σi
157
Data Analysis in the Earth & Environmental Sciences Chapter 11. Multivariate Relationships
Dependencies
·10−2
5
f
0
−2 2
0 0
2 −2
x1 x2
x3
x2
dimensions.
Properties
The MVN has four wonderful properties:
Subsets x3
x2
All subsets of variables from an MVN also follow an MVN. For in-
stance, if we split ( X1 , X2 , . . . , X p ) into X 1 = ( X1 , X2 , . . . , Xq ) and
X 2 = ( Xq+1 , Xq+2 , . . . , X p ), then X 1 follows a q-variate normal dis-
tribution, X 2 a (p − q)-variate normal distribution, with parameters:
x1
µ1 = ( µ1 , µ2 , . . . , µ q ) (11.11a)
µ2 = ( µ q +1 , µ q +2 , . . . , µ p ) (11.11b) Figure 11.7: Trivariate normal density,
anisotropic dependent case
158
Data Analysis in the Earth & Environmental Sciences Chapter 11. Multivariate Relationships
Independence
You already know that independence implies zero covariance:
Xi ⊥⊥ X j ⇒ Σij = 0 (11.13)
In the MVN world, a magical thing happens: the reverse is also true!
This is because when off-diagonal elements of Σ are zero, the expo-
nential term factorizes cleanly into distinct factors, as in Eq. (11.7).
This may seem trivial, but it is a very special property that makes life
incommensurately easier.
Linear combinations
Any linear combination of a multivariate normal distribution is a
multivariate normal distribution as well, and the means and vari-
ances are linearly related. Specifically, if Y = B> X + A ∼ N (µY , ΣY )
then µY = B> µ X + A and ΣY = B> Σ X B.
Conditional Distributions
This expression does not depend on the value x2 , just on the covari-
ance submatrices. If X 1 ⊥⊥ X 2 , Σ1,2 = Σ2,1 = 0 and the equations
reduce to µ1 | x2 = µ1 and Σ1,1 | x2 = Σ1,1 , which is another way of
saying that no information is provided by knowledge of X 2 .
159
Data Analysis in the Earth & Environmental Sciences Chapter 11. Multivariate Relationships
µ1 | x2 = µ1 + ρ 1 ( x2 − µ2 )
σ
(11.16a)
q 2
σ
σ1 | x2 = σ1 1 − ρ2 (11.16b)
Covariance Estimation
In real life, we get a random matrix X ∈ Mn× p (R) built out of n obser-
vations (arranged in rows) of p variables (arranged in different columns).
Each column represents a Gaussian random variable. We can define the
sample mean of each column
n
1
µ̂ j =
n ∑ Xkj = x j (11.17)
k =1
and the sample covariance matrix
n
1
n − 1 k∑
bij =
Σ x ki − x i x kj − x j . (11.18)
=1
We can define centered variables as
x:j0 = x:j − x j (11.19)
and rewrite the sample covariance matrix as
1
Σ̂ ij
= SX =X 0> X 0 (11.20)
n−1
so the sample covariance is a scaled, inner product of the centered data
matrices X 0 . It is exactly analogous to the univariate case of the sample
variance:
n
1 1
( x i − x )2 .
n − 1 i∑
sx = ( x − x )> ( x − x ) = (11.21)
n−1 =1
These estimators are MLE, hence optimal (consistent, efficient). In
fact, since they uniquely characterize the distribution, they are called
sufficient statistics of the MVN. However, as we shall see, things can go
sour when p > n – the sample covariance matrix no longer is positive
definite, and will have to be regularized in order to work properly. In
fact, this may even be the case for n & p (that is, for sample sizes not
much larger than the number of parameters to be estimated (i.e. the
number of random vectors))
160
Data Analysis in the Earth & Environmental Sciences Chapter 11. Multivariate Relationships
161
Chapter 12
SX = EΛE> (12.2)
which is possible because SX is real and symmetric. Figure 12.1: Example of data matrix
! over the years, then subtracted from the data to yield
2 SLP anomalies. We are only interested in analysing the
!λ2k ∼ λ2k
n∗ winter season defined by December to February (DJF).
!λk2 The data are therefore obtained by concatenating the win-
!uk ∼ 2 uj (18) ter monthly means for all years. Finally a weighting by
λj − λ2k
Data Analysis in the Earth & Environmental Sciences Chapter the 12.
square Principal component
root of the cosine analysis
of the corresponding lat-
itude is applied to each grid point to account for the
where λj2 is the closest eigenvalue to λ2k , and n∗ is
converging longitudes poleward. The data over the NH
the number of independent observations in the sample,
north of 20 ° N are used to compute EOFs. Note that the
also known as the effective sample size, or the number
examples presented here have also been used in Hannachi
of degrees of freedom (Trenberth, 1984; Thiébaux and
et al. (2006).
Moreover, the eigenvalues will all be non-negative. The matrix Λ has
Zwiers, 1984). For example,
" !the #95% confidence interval
Figure 1 shows the spectrum of the covariance matrix
of λ2k is given by λ2k 1 ± 2∗ . The effective sample along with their standard errors as given by the first
the form n
size of a time series of length n involves in general equation of (18) with sample size n = 3 × 52 = 156.
the autocorrelation structure of the series. For$example,
The leading two eigenvalues seem nondegenerate and
the sum of the autocorrelation function, 1 + 2 k≥1 ρ(k), separated from the rest, but overall the spectrum looks
Λ1 0 . . . 0 provides a measure of the decorrelation time, and an in general smooth, which makes truncation difficult.
Figure 2 shows the first two EOFs. These EOFs explain
estimate% of n∗ is given by (Thiébaux&−1and Zwiers, 1984);
0 Λ2 . . . 0 n∗ = n 1 + 2 n−1
$
(1 − k/n)ρ(k) .
Λ= (12.3) k=1
Eigenvalue spectrum
. Another alternative is to use Monte Carlo simulations,
. . . . . . . . 0 (see for example Björnsson and Venegas 1997). This can 30
be achieved by forming surrogate data by resampling a
part of the data using randomisation. An example would 25
0 0 0 be toMrandomly select a subsample and apply EOFs,
Λ
Eigenvalue (%)
20
then select another subsample etc. This operation, which
can be repeated many times, yields various realisations
where M = min( p, n). Assuming that p ≤ n, E is given byof the eigenelements from which one can estimate 15
E = e1 e2 . . . e (12.4)
chronological order then apply EOFs, and so on. Further
p Carlo alternatives exist to assess uncertainty on
Monte
5
along with estimates of uncertainty (12.14). eigenvalue. Vertical bars show approxi-
Copyright © 2007 Royal Meteorological Society Int. J. Climatol. 27: 1119–1152 (2007)
The temporal signature associated with mode i is given by the in- mate 95% confidence limits given by the
DOI: 10.1002/joc
The two formulations are equivalent, but SVD tends to be a more effi-
cient algorithm than eigendecomposition (Golub and van Loan, 1993).
164
Data Analysis in the Earth & Environmental Sciences Chapter 12. Principal component analysis
Orthogonality
The matrix U of left singular vectors therefore gathers the principal
components of the field. From the definition of the singular value de-
composition E> E = I p and
SU = Var( E> X
e ) = E> Var( X
e ) E = E> SX E = Λ. (12.7)
We can see that space and time have been separated, each mode is
weighted by the square root of its contribution to the total variance.
Thanks to orthogonality, we readily obtain the fraction of variance asso-
ciated with each pattern:
x1
λi σi2
Fi = M
= (12.9)
∑ i =1 λi ∑iM 2
=1 σi
Interpretations
Directions Principal component analysis finds the main directions of
variability which can be identified with the main axes of an ellipsoid.
In the case p = 2 we have an ellipse (Fig. 12.3),
Figure 12.3: PCA aims to identify the
2 2 main axes of an ellipse from noisy data
x1 x2
H2 = + (12.10)
σ1 σ2
When p = 3 the surface defines an ellipsoid (Fig. 12.4), while in the 1
The jargon of PCA is unnecessarily
case p > 3 we have a hyper-ellipsoid (which our perceptual system murky, due in no small part to climatol-
cannot fathom without projection). ogists taking the work of Lorenz (1956)
too literally. Lorenz pioneered the use
Decomposition on orthogonal functions of PCA in climate research, under the
name empirical orthogonal functions, or
these functions are like the sines and cosines found in the Fourier EOF, analysis. The name has stuck, but
transform but are dictated by the data, so they are empirical orthogo- it makes conversations with a statistician
needlessly complicated, because to them
nal functions1 . The following terminology is used in the atmospher- the PCs are the right singular vectors, not
ic/oceanic sciences: the left singular vectors. Know your au-
dience!
Ui (t) = PCi (t)
Ei (s) = EOFi (s) (12.11)
165
Data Analysis in the Earth & Environmental Sciences Chapter 12. Principal component analysis
2
2 0
z
5
2 0
−2
z
0 0 −2 5
y −5 0 −4
z
−2−5 0 0 −4 −2 0 2 4
5 −5 5 −5 y
x x y
PC compression
Because PCA is fundamentally a singular value decomposition, it al-
lows to re-express the field as a sum of rank-one matrices2 . 2
This is a special case of the Eckhardt-
Young-Mirsky theorem (Appendix D)
K
eK =
X ∑ σi ui ei> (12.12)
i =1
each grid box, thus preserving the energy (variance) of the field on
the sphere.
166
Data Analysis in the Earth & Environmental Sciences Chapter 12. Principal component analysis
Dimensionality
If n < p then SX is singular, so its inverse is undefined. This situation
may be overcome via sparse PCA (e.g. Zou et al., 2006; Johnstone and Lu,
2009).
Truncation
The choice of truncation determines the amount of compression achieved.
Heuristically, we may choose the first K patterns that account for a siz-
able portion of the variance, e.g.
∑iK=1 λi
r̃K = ' 0.7 or 0.9 (12.13)
∑iM=1 λ i
Other criteria exist, in particular Kaiser’s (retain all modes that account
for more variance than the average eigenvalue) or more elaborate rules
that hunt for breaks in the eigenvalue spectrum using decision theoretic
criteria (Wax and Kailath, 1985; Nadakuditi and Edelman, 2008).
Separability
The North et al. (1982) rule of thumb assesses the separation between
two successive eigenvalues as
r
2
δλi = λi (12.14)
n−1
which is related to the uncertainty about λi . Indeed, a 95% confidence
interval for λi under the null hypothesis of an correlated random field
(i.e. white noise) is !
r
2
λi 1 ± (12.15)
n−1
Hence, the smaller λi , the more difficult it is to separate it from its
neighbors, and from noise.
Significance
Not all modes are statistically significant. Fewer still are physically mean-
ingful. To find the meaningful ones we can use “Preisendorfer’s Rule
N” (Overland and Preisendorfer, 1982), which tests against the hypothe-
sis that the eigenvalues λi arise from a random Gaussian process; any
λi ≥ λthreshold is retained. A more sophisticated test is against multi-
variate red noise, as implemented by Dommenget (2007).
If the sample size is small, one true physical “mode” may be spread
over several of the statistical modes that we call principal components.
To some extent rotation can correct for this, but it brings challenges of
its own (Lab 9).
167
qualitatively similar.
[14] The SST patterns associated with the E and C indices,
on the other hand, are both ENSO‐like (Figures 3c and 3d).
DataThe E pattern
Analysis in the has
Earthits&strongest amplitude
Environmental Sciencesand explained var- Chapter 12. Principal component analysis
iance along the eastern equatorial Pacific (east of 120°W) and
along the coast of Peru (Figure 3c), whereas the C pattern has
IIIits amplitude
Geoscientific and explained
usesvariance
of PCA in the central equatorial
Pacific (170°E–100°W, maximum in the Niño 4 region; L10704 TAKAHASHI ET AL.: REINTERPRETING
TheFigure
use of 3d).
principal component analysis in many fields of applied re-
] Since extraordinarily
[15including yields a se
search, the geosciences,warm and strongly
runs deep. Here wecold giveevents
but three the wester
belong to different regimes, their SST patterns
examples of its use in climate research, but examples abound in pa-
are different. The first m
Niño 3 reg
In particular, the C pattern (Figure 3d), is similar to the pattern
leobiology, mineralogy, and others. Following the usual jargon of the up to 40%
obtained by compositing warm and cold events and averaging opposite s
atmospheric
the patterns sciences, EOF and
[Hoerling PCA
et al., will be
1997, used 7c],
Figure interchangeably.
while the E sponding
qualitative
pattern (Figure 3c), describes the difference between them [14] The
on the oth
[Hoerling
ENSO dynamics et al., 1997, Figure 7d]. This is reflected by the The E patt
positive (negative) skewness of 1.78 (−0.62) for the E (C) iance alon
along the c
In aindex.
recent study, Takahashi et al. (2011) used PCA to identify a state space its amplitu
[16] toThe
in which E and the
describe the evolution
C patternsofare
thealso similar to Rasmusson
El Niño-Southern Oscillation Pacific (1
Figure 3d)
and Carpenter’s [1982] composite SST anomalies
phenomenon. They performed PCA on monthly sea-surface tempera- for the [15] Sin
ture“peak” and and
(SST) data the subsequent
found that the “mature”
first twophases, respectively,
PCs combined of for
account
belong to
Figure 12.5: EOF patterns (◦ C, shading) In particul
mosttheofevolution
the variance of the
(68% composite
and 14%, warm El Niñoin event.
respectively) In thisThe
the domain. between the 1870–2010 HadISST sea sur- obtained b
“canonical” El Niño, the event starts near
associated spatial patterns (EOFs) are shown in Fig. 12.5.
the positive E axis, face temperature anomalies associated
the pattern
pattern (F
with (a) PC1, (b) PC2. The percent- [Hoerling
age of explained variance is contoured positive (n
(the interval is 20% (10%) below (above) index.
60%).From Takahashi et al. (2011). [16] The
and Carp
“peak” and
the evolut
“canonical
s (°C, shading)
ace temperature Figure 12.6: Evolution of PC1 and PC2
dex, and (d) the from May (indicated with circles) to
the following January (crosses, corre-
nce is contoured sponding year shown)coefficients
El Niño events:
Figure 3. Linear regression (°C, shading)
0%). between(a) the
events considered
1870–2010 HadISST by Rasmussen
sea surface temperature
anomalies and (a) PC1,(1982)
and Carpenter (b) PC2, (c) the
(their E‐index, and
composite is (d) the
C‐index. The thicker);
shown percentage(b)of extraordinary
explained variance is contoured
events;
(the interval is 20% (10%) below (above) 60%).
(c) central Pacific events (Kug et al., 2009)
result from the including the recent 2009–10 event (Lee
plained by these and McPhaden,
in contrast to the axes2010); (d) other
PC1–PC2, whichmoderate
result from the
events since
maximization 1950
procedure according
of SST varianceto NOAA,by these
explained
modes. corresponding to DJF Oceanic Nino
d in two other [12] Index
This same
≥ 1◦ Cbehavior is encountered in two other
(see http://www.cpc.noaa.
mith et al., 2008] observational SST estimates (ERSST
gov/products/analysis _
v3b [Smith et al., 2008]
and Kaplan SST v2 [Kaplan et al.,monitoring/
1998]; not shown), but
not shown), but is muchensostuff/ensoyears.shtml
clearer in the model data (Figures for details).
2b–2d), which
2b–2d), which containFrom Takahashi larger
a substantially et al. (2011).
number of extraordinary warm
events. The fact that even Zebiak and Cane’s [1987] ICM is
aordinary warm able to depict this behavior indicates that it is not essential to
s [1987] ICM is invoke mechanisms external to the tropical Pacific to explain
the different regimes, although we do not rule out their role as Figure 4.
s not essential to drivers [e.g., Vimont et al., 2001]. with circle
[13] The spatial patterns of SST associated with the year show
acific to explain EOFs (Figures 3a and 3b) are similar to the ones reported Rasmusson
out their role as Figure 4. Evolution of PC1 and PC2 from May (indicated by Ashok et al. [2007], although our second mode does not thicker); (b
They
with used thistoasthe
circles) a new coordinate
following Januarysystem, enabling
(crosses, the identifica-
corresponding show the signal in the far‐western tropical Pacific seen in [Kug et al
their “Modoki” pattern due to a more equatorially‐confined and McPh
tion of two ENSO regimes: the regime
ciated with the year shown) El Niño events: (a) events considered of extraordinary warm byevents domain. Using the domain of Ashok et al. [2007] and the 1950 acco
slightly longer time period used in this study yields patterns Niño Inde
e ones reported(theRasmusson
1982–83, 1997–98 and 1877–78
and Carpenter events)
[1982] andcomposite
(their the regime isthat includes
shown essentially the same as theirs, but a long period (1950–2009) analysis_m
the cold, neutral, and moderately warm years (Fig.
mode does not thicker); (b) extraordinary events; (c) central Pacific events 12.6). This PCA-
3 of 5
Pacific seen inbased
[Kug et al., 2009]has
decomposition including
durably the recent
shifted 2009–10 event of
the characterization [Lee
ENSO
orially‐confinedevents,
and and
McPhaden,
exemplifies 2010];
how PCA(d) other
can be moderate events dynamical
used to generate since
[2007] and theinsight.
1950 according to NOAA, corresponding to DJF Oceanic
y yields patterns Niño Index ≥ 1°C (see http://www.cpc.noaa.gov/products/
od (1950–2009) analysis_monitoring/ensostuff/ensoyears.shtml for details). 168
3 of 5
Data Analysis in the Earth & Environmental Sciences Chapter 12. Principal component analysis
Coral b18 O network, 22 sites, EOF 1 - 18% variance Figure 12.7: First EOF mode of the net-
24 N o
work 25 coral δ18 O records. (top) EOF
16oN
coefficients (blue < 0, red > 0) overlain
8oN
0o
on a map of HadSST2 DJF temperature
8oS
regressed onto the first principal compo-
nent. The center each dot is collocated
16oS
24oS
0.05
-0.1
-0.15
-4
-0.2 10
1900 1910 1920 1930 1940 1950 1960 1970 1980 -2 -1 0
10 10 10
Time Frequency (cpy)
169
Data Analysis in the Earth & Environmental Sciences Chapter 12. Principal component analysis
chance of miscounting each year (i.e. every 100 years, one expects ±5y
offsets due to age errors). They performed a “Monte Carlo EOF analy-
sis” (Anchukaitis and Tierney, 2013) on this time-uncertain ensemble and
used it to compute the relevant statistics.
a) ENSO mode pseudocoral EOF - 25% variance Figure 12.8: Spatiotemporal uncertainty
24oN
quantification on a pseudocoral network.
16oN (a) EOF loadings (circles) corresponding
8oN
0o
to the ENSO mode of an ensemble of
8oS age-perturbed pseudocoral records with
16oS
miscounting rate θ = 0.05. EOF load-
24oS
ings for error-free data are shown in light
EOF > 0 EOF < 0 colors circled in white, while the me-
median median dian and 95% quantile are shown by dark
95% quantile 95% quantile disks and black-circled disks, respec-
tively. Contours depict the SST field as-
-1 -0.5 0 0.5 1 sociated with the mode’s principal com-
Contours: SST (s C) regressed onto ENSO mode ponent PC (panel b), whose power spec-
b) ENSO mode PC timeseries c) ENSO mode PC spectrum trum is shown in (c). Results for the time-
uncertain ensemble are shown in blue:
Power w Frequency
0.4
median (solid line), 95% confidence in-
PC (unitless)
0.2 0.001
terval (light-filled area) and interquartile
0 0.0001 range [25%-75%] (dark-filled area). Re-
-0.2 1e-05 sults for the original (error-free) dataset
are depicted by solid red lines. Dashed
-0.4
1880 1900 1920 1940 1960 1980 2000 50 20 10 5 2 red lines denote χ2 error estimates for the
Time Period (y) MTM spectrum of the error-free dataset.
From Comboul et al. (2014)
The results are shown in Fig. 12.8, and show that time uncertainty
may greatly alter the spatial expression of interannual variability. They
also result in a transfer of power from high frequencies to low frequen-
cies, which suggests that age uncertainties are a plausible cause of the
observed enhanced decadal variability in coral networks (Ault et al., 2009).
170
Data Analysis in the Earth & Environmental Sciences Chapter 12. Principal component analysis
B3C8#D#:#**+#ER"
##DP3$# D
tern (left singular vector) ; b) Geopo-
###O3##
tential height field (Z250) (right singular
O
##DP *#3 !D
vector) c) Expansion coefficients (normal-
##PN *#3 !P ized). From Emile-Geay (2006, Chap 5)
!I
###O3## ##5O3&# #DPO3&# #DHO3Q# #DPO3Q# ##5O3Q#
#DHO3Q# U"#B3C8#D#:#VP4O#E<"
AL3LMG#DSJ #D4
3 Q#
O 3&
#D4
O
# ?"#*7!.C!9C2X8C#8Y@!.123.#?38AA2?28.71
DOO I
3
##IO $#
##D4
#
#DP3 $#
O 3Q
3 P !G#OLHS
##N4 $#
O
3 &#
#DP
4O
3
##5O $#
%38AA2?28.7#E>.27/811"
D
3
##T4 $#
##SO3&#
SO3Q
O O
!D
!4O
##5
#
O&
##5 3
O
!P
3 Q#
F2917#/8A7#8Y@!.123.#?38AA2?28.7#E**+"
F2917#92=Z7#8Y@!.123.#?38AA2?28.7#EVP4O"
!DOO !I
##I 3 3 &# DS4O DS5O DSTO DSHO DSSO POOO
OQ O
# ##I W8!9#3A#62.789
###O3##
Further reading
For a more thorough introduction to PCA and EOF analysis, the reader
is referred to Hannachi et al. (2007) and Wilks (2011, chap 12). For an
introduction to maximum covariance analysis and canonical correlation
analysis, read Bretherton et al. (1992) and Wilks (2011, chap 13).
171
Chapter 13
LEAST SQUARES
1 2
z(t) = gt + v0 t + z0 . (13.1)
2
Measuring (zi , ti ) many times we should give us the values of ( g, v0 , z0 ).
To simplify the problem, assume that initial speed and position are 0:
(v0 , z0 ) = (0, 0). So we would have zi
z1 = 12 gt12
z = 1 gt 2
2 2 2
(13.2)
...
zn = 12 gtn2
ti
and we expect that with a enough measurements we can accurately es-
zi
timate the gravitational acceleration g.
Very often one casts this type of problem (parameter estimation) as slope = g
fitting a line through a cloud of points. It’s often easier to reparameta-
rize the problem so that lines are straight (in this case, working with the
variable t2 as opposed to t). The goal of the famed least squares method
is to find the straight line that best fits the data.
ti2
I Ordinary Least Squares Figure 13.1: Example of data for a free
falling body
yi = β 0 + β 1 xi + ei = ŷi + ei (13.3)
Data Analysis in the Earth & Environmental Sciences Chapter 13. Least squares
y = Xβ + e (13.4)
Thus,
X> y = (X> X)β (13.11)
This is called a normal equation. The matrix ( X > X ) is square, real
and symmetric, therefore it is positive semi-definite. That is, provided
n > p, ( X > X ) always has an inverse (it has no zero eigenvalues)2 . There- 2
What if its eigenvalues are close to, but
fore the solution is: not exactly equal to, zero, you ask? We
will get to that in a bit.
174
Data Analysis in the Earth & Environmental Sciences Chapter 13. Least squares
Estimation of coefficients
In the previous case where the design matrix only involves xi ’s
We have
1 x1
1 1 . . . 1 1 x2 n x
∑i i
X> X =
..
.. = .
x1 x2 . . . x n . . ∑ i x i ∑ i x i2
1 xn
Cov
\ ( x, y) sy x xi
= 2
= ρ̂ xy . (13.13)
sx s x
Figure 13.2: Example of linear fit
So in this simple case, the least squares slope is proportional to the
sample linear correlation coefficient, scaled by the ratio of sample stan-
dard deviations. Since the correlation coefficient is unitless, this factor
ensures that the units of y get correctly mapped to the units of x3 . 3
One useful cross-check that you did the
For the intercept we have math right, thus, is that β 1 should be in
units of y divided by units of x
β 0 = y − β 1 x. (13.14)
Which depends on the slope. If any outliers or noise bias the estimate
of the slope, this will affect the estimate of the intercept too.
175
Data Analysis in the Earth & Environmental Sciences Chapter 13. Least squares
II Geometric interpretation
We just obtained the OLS solution by straightforward algebra after for-
mulating a minimization principle. There is more to it. What we were
really doing is finding an approximate solution to the problem
Ax ≈ b
Ax = b p + e
A> (b − Ax ) = 0 (13.15)
Thus:
A> Ax = A> b
If you substitute A, x and b, for their definitions, you get back the
OLS normal equation (Eq. (13.11)). This intuition is purely geometric:
it’s about finding the best approximation to the data vector y in the space
spanned by the design matrix X.
1 n 2
ei = χ2
2 i∑
− log L =
=1
which is called misfit function. Minimizing the errors (Eq. (13.9)) is equiv-
alent, in the Gaussian4 context, to maximizing the likelihood. Given the
176
Data Analysis in the Earth & Environmental Sciences Chapter 13. Least squares
data, these are the most likely parameters to fit a straight line through
the cloud of points. Further more, the OLS estimate is now imbued with
y
all the privileges that come with ML status: it’s consistent, and it’s got
the lowest MSE.
µ0 = β 0 x
s s
∑in=1 xi2 1 n
σ0 = se
n ∑in=1 ( xi − x )2
with se = ∑ e 2 (MSE).
n − 2 i =1 i
y
µ1 = β 1
se
σ1 = q . well resolved β 1
∑in=1 ( xi − x )2
x
Estimates are unbiased and precision depends on the MSE, as well as
the variance of x. The more variable x, the better. So, the experiment Figure 13.3: Example of linear fit for dif-
should be designed in order to cover a broad dynamic range (Fig. 13.3). ferent spread in x
Relationship
−x
r β0 ,β1 = q (13.17)
1
n ∑in=1 xi2
Trivial case
Imagine that we get no measurements xi , so X = 1n :
1 y
1
1 y2
Xβ = y ⇔
.. β = .. . (13.18)
. .
1 yn
The least squares estimate should yield the same parameter n times.
Indeed, X > X = n and ( X > X )−1 = 1/n so
1
( X > X ) −1 X > y = ∑ yi = y (13.19)
n i
177
Data Analysis in the Earth & Environmental Sciences Chapter 13. Least squares
x0 = Λ−1/2 V > x
then
Cx0 = I p
If
1/σ12 0 ··· 0
0 1/σ22 ··· 0
Ce−1 =
..
··· ··· . ···
0 ··· 0 1/σn2
178
Data Analysis in the Earth & Environmental Sciences Chapter 13. Least squares
yi = β 0 + β 1 xi
e
y = Xβ
• Fourier series,
179
Chapter 14
equidetermined underdetermined
Gm0 = 0
Gms = d
G = UΣV >
m = G −1 d.
G † = VΣ† U >
182
Data Analysis in the Earth & Environmental Sciences Chapter 14. Discrete inverse theory
where Σ† is given by
1/σ1 0 ... 0 ... 0
0 1/σ2 ... 0 ... 0
... ... ... ... ... 0
.
0 0 0 1/σn ... 0
... ... ... ... ... 0
0 0 0 0 ... 0
k
(k) u (i ) · d
mTSVD = ∑ σi
v (i ) .
i =1
p
Vij Vkj
Cov(mi , mk ) = ∑ σi 2
(14.2)
j =1
Model stability
183
Data Analysis in the Earth & Environmental Sciences Chapter 14. Discrete inverse theory
Model Resolution
Consider Gm = d. In order to evaluate how well G resolves m, one could
input a synthetic (i.e. made-up) model m0 , yielding the synthetic obser-
vations d0 = Gm0 . Solve the inverse problem
−1
m̃ = G † d0 = G > G G > Gm
−1 >
where G > G G G is called resolution matrix R. For a perfect resolu-
tion R = I, in reality
0
†
R=G G=
0
That is, we have a “blurry diagonal” matrix. This tells which models
are well resolved by the observations, and which are not. One can also
test the inversion with a spike function2 which is given by 2
The numerical equivalent of a Dirac
delta function
mspike = 0 . . . 0 1 0 . . . 0
i
Data resolution
How well does a model fit the data?
m̂ = G † d ⇒ dˆ = G m̂ ⇒ dˆ = G G † d = |{z}
GG † d.
N
Note Both R and N are independent of the data d and only depend on G. This
can be taken into account for the experimental design and modeling assump-
tions, so as to meet scientific goals.
184
Data Analysis in the Earth & Environmental Sciences Chapter 14. Discrete inverse theory
The Tikhonov solution may be expressed similarly using the SVD for-
mulation
p
(α) ( α ) u (i ) · d
mtikh = ∑ f i v (i )
i =1
σi
185
Data Analysis in the Earth & Environmental Sciences Chapter 14. Discrete inverse theory
This solution keeps the large singular values almost intact while it
damps the small ones to zero. Tikhonov solutions push all eigenval- i
ues α2 away from zero, which removes the matrix singularity. This is
Figure 14.4: TSVD and Tikhonov filter
at the cost of damping some of the modes with σi 0, so it repre- factors
sents a trade-off between misfit and complexity. The solution is always
smoother than the true model (a common feature of “`2 methods” –
those that seek to minimize the `2 norm). `1 methods are generally free
of such problems, but those are mathematically more complex: the op-
timization problem can no longer be solved analytically, even in simple
cases. Nowadays, there exist well-established convex optimization al-
gorithms (e.g. simplex) that can efficiently solve them, but they are still
less prevalent than `2 methods. This could be because people actually
like smooth solutions, though Nature is rarely that smooth.
186
Chapter 15
LINEAR REGRESSION
In the previous chapter we saw how to fit curves through data using
various techniques, all involving an exploration of the null space of a
matrix. Linear regression is closely related, but the emphasis is slightly
different. For instance, the goal may not solely be to find a best fit, but
to make predictions and quantify uncertainties about these predictions
(statistical forecasting). Thus, while the mathematics are very similar
(least squares will pop up again), the spirit is much more akin to Part I,
rooted in probability theory.
I Regression basics
Least Squares Solution
Regression seeks to estimate a variable Y from a predictor X, given a
deterministic link function f . In general, one writes
Y = f (X) + e (15.1)
Y = E (Y | X ) + e (15.2)
p
E (Y | X ) = β 0 + ∑ Xj β j (15.3)
j =1
e ∼ N (0, σ2 ) (15.4)
Data Analysis in the Earth & Environmental Sciences Chapter 15. Linear Regression
The 1D case (p = 1) is just the ordinary least squares we saw in Eq. (13.4) ŷ = β 0 + β 1 x + e, β1 =
∆y
∆x
and Eq. (13.12). The idea is to mimimize the residual sum of squares
∆x
n ∆y
RSS( β) = ∑ (yi − f (xi ))2 = ∑ ei2 = e> e. (15.5)
i =1 i
Yielding the familiar solution x
∑i ( xi − x̄ )(yi − ȳ)
β1 = β 0 = ȳ − β 1 x̄. (15.7)
∑i ( xi − x̄ )2
Analysis of Variance
The error associated with the linear estimate ŷi = β 0 + β 1 xi is ei =
yi − ŷ( xi )1 . So the total error is given by the mean squared error (MSE): 1
also known as “residual”
1
Se2 = e i2
n−2 ∑
i
1
Se2 = [yi − ŷ( xi )]2 .
n−2 ∑
i
1
Se2 = [(yi − ȳ) + (ȳ − ŷ( xi ))]2
n−2 ∑
i
1 1
Se2 = {SST − SSR} = {SSE}
n−2 n−2
188
Data Analysis in the Earth & Environmental Sciences Chapter 15. Linear Regression
Prediction intervals
Very often we fit models through data so we can make predictions from
them. It is common to call x the predictor and y the predictand2 , and the 2
“that which we want to predict”
idea is to use the relation at a point xnew that has not been used in fitting
the regression model y = β 0 1n + β 1 x, and use it to predict a new value
of the predictand, ynew . What error would we make?
To see this, note again that linear regression expresses the condi-
tional distribution of y given x, as:
y | xi ∼ N ( β 0 + β 1 xi , Se2 ) (15.9)
189
Data Analysis in the Earth & Environmental Sciences Chapter 15. Linear Regression
• the third term comes from the uncertainty in the estimate of the slope
and it grows as we move away from ( x̄, ȳ), the centroid of the dataset.
Designing an experiment with a large range in x will mitigate that.
hence the expected errors, by a factor (1 + φ)/(1 − φ), with φ the lag-1
Figure 15.3: Prediction intervals. Note
autocorrelation. the parabolic shape of the prediction in-
tervals, which widen away from the cen-
troid of the dataset ( x̄, ȳ)
III Model checking
Thanks to the wonders of modern programming, fitting a regression
model is now so unbelievably trivial3 that a monkey could do it. How- 3
Matlab:fitlm(); Python:scikit-
ever, the result is of little value unless one can confirm that a few basic learn(); or statsmodels; R:lm()
assumptions are met. The most important one is that residuals ei need to
be IID normal, since this was the basis for applying least squares (max-
imum likelihood estimation under a normal model) in the first place.
The onus is on you to convince your readers that your statistical model
actually fits the data. If not, there are plenty of more complex modeling
options to go by.
Regression residuals
Inspecting residuals is absolutely critical. Important things to check for:
190
Data Analysis in the Earth & Environmental Sciences Chapter 15. Linear Regression
ANOVA table
Another important aspect of model checking is to parse the various
components of variance, via an ANOVA table. There are three main
measures of the quality of the fit:
1
Mean Squared Error MSE = n −2 ∑i ei2 , it should be as small as possible.
Coefficient of determination R2
191
Data Analysis in the Earth & Environmental Sciences Chapter 15. Linear Regression
Source df SS MS F
Total n−1 SST
Regression 1 SSR MSR = SSR/1 MSR/MSE
Residual n−2 SSE MSE = Se2
or Z-score
β̂ j
zj = √ (15.14)
σ̂ v j
192
Data Analysis in the Earth & Environmental Sciences Chapter 15. Linear Regression
where v j is the jth diagonal element of ( X > X )−1 . Under the null hy-
pothesis that β j = 0, z j is distributed as t N − p−1 (a t distribution with
N − p − 1 degrees of freedom), and hence a large (absolute) value of
z j will lead to rejection of this null hypothesis. If σ̂ were replaced by a
known value σ, then z j would have a standard normal distribution. The
difference between the tail quantiles of a t-distribution and a standard
normal become negligible as the sample size increases, and so normal
quantiles are often quite a good approximation (Hastie et al., 2008).
Another way to see this is to report an approximate 95% CI for β j ,
[ β̂ j ± 2 · s.e.( β̂ j )]. If this interval excludes zero, then the effect is signif-
icant, provided the predictors X1 , · · · X p are independent. When they
are not, one must be far more careful, and control for how one variable
may speak through another (see "Regularized regression”).
The Multivariate ANOVA is
Source df SS MS F
Total n−1 SST
Regression k SSR MSR = SSR/k MSR/MSE
Residual n−k−1 SSE MSE = SSE/(n − k − 1)
CO2 (ppm)
An example: fitting the Keeling curve
380
Let us try to fit the famous Keeling Curve, sketched out in Fig. 15.7, us-
ing several predictors. First note that the curve displays some regular
oscillations superimposed on a roughly exponential trend. At first, let’s
see how well we would do with a simple linear fit, and then add com-
plexity. Our primary variable is time t, expressed in months since Jan
320
1, 1958.
1958 2010 t
Linear fit [CO2 ] = β 0 + β 1 t; the least squares solution yields β 0 = 308.6,
β 1 = 0.12, and R2 = 0.977. Despite this glorious statistic, this is ob- Figure 15.7: Keeling Curve: CO2 mea-
surements at Mauna Loa observatory,
viously a poor fit; in particular, we are missing the curvature, which
Hawaii, from March 1958 to May 2010.
is a first order feature of the dataset. Prediction intervals are of order
6.6ppm.
193
Data Analysis in the Earth & Environmental Sciences Chapter 15. Linear Regression
One lesson here is that we had to include not only t but also non-
linear functions of t; these we call “derived predictor variables”, since
they are transformed versions of the original predictor. The result is
non-linear in t, but linear in β so we still have a linear regression. Also,
note that predictors themselves need not be Gaussian as long as the
residuals are Gaussian.
Overfitting
One peculiar feature of multiple linear regression models is that one can
always add variables to a regression model, e.g.
2 2πt 2πt
[CO2 ] = β 0 + β 1 t + β 2 t + β 3 cos + β 4 sin +
12 12
β 5 × IBM stock + β 6 × (Monkey population in Bhutan)+
β 7 × (SAT scores at USC).
220 7. Model Assessment and Selection
0 5 10 15 20 25 30 35
where both X and Y are drawn randomly from their joint distribution
Data Analysis in the Earth & Environmental Sciences Chapter 15. Linear Regression
• use some "IC", like AIC or BIC, to select the appropriate predictors.
Regularized regression
Often predictor variables are colinear (e.g. t and t2 ); one variable says
something about another. In that case X > X will be singular: some of its
eigenvalues will be zero, or numerically very close to zero. As a result,
the central quantity ( X > X )−1 may be unbounded. To overcome this
problem, the covariance matrix needs to be regularized, which amounts
to filtering out small eigenvalues. There are several ways to do it:
195
Data Analysis in the Earth & Environmental Sciences Chapter 15. Linear Regression
the solution favors small coefficients. It turns out that this lowers the
effective number of parameters to estimate. Because it keeps coeffi-
cients in check, ridge regression is variously called "biased regres-
sion", or a “shrinkage method" 7 . “Bias” is usually something to 7
for large values of the ridge parameter,
avoid, but in this case we trade a little bias for a lot of variance, so β shrinks towards zero
even though coefficients are biased low, the total MSE may be sub-
stantially lowered. The optimal ridge parameter may be identified
via generalized cross-validation (Golub et al., 1979), with some com-
plications (Wahba and Wang, 1995).
196
Data Analysis in the Earth & Environmental Sciences Chapter 15. Linear Regression
y | { β, σ2 , X } ∼ N ( Xβ, σ2 In ) (15.15)
p( β, σ2 | X ) ∝ σ−2 (15.16)
which is just the ordinary least squares solution. With different priors
one gets different results, of course. The ridge regression solution may
be obtained via an inverse Wishart prior on the covariance matrix; the
LASSO solution via a Laplace prior on the covariance matrix.
The advantages of the Bayesian viewpoint are as usual: the infer-
ence is model-based, everything has a well-defined distribution, the
assumptions are transparent, and the model can be made as complex
as one wants and still follow the basic desiderata of probability theory
as logic (Chap 2). Thus, while in elementary cases (and given suitable
choices) the Bayesian solution may yield the good old OLS solution, the
framework offers much more flexibility when complex (e.g. hierarchi-
cal) models are warranted (see, Gelman et al., 2013, Chap. 14, for an in-
depth treatment), or when less friendly distributions are encountered.
197
Chapter 16
Like many of the students who take this class, you may have started
this book with little mathematical background and a thirst to better un-
derstand the Earth. Or you may have wandered here through the inter-
tubes, trying to get a little understanding of how to analyze data beyond
the shallow teachings of the "Data Science" fad.
We hope that the journey so far has lifted the curtain on what lies be-
hind data analysis, and that you are now a more sophisticated consumer
of data analysis methods.
In future editions of this book, we would like to bring a few improve-
ments:
• Fix all the pesky typos lacing this book (the more we fix, the more we
find)
• Fix figures so that they appear correctly in all PDF readers. [in progress]
Mathematical Foundations
Appendix A
CALCULUS REVIEW
I Differential Calculus
Derivative
In calculus, we are interested in the change or dependence of some quan-
tity, e.g. u, on small changes in some variable t ∈ R. If u has value u0
at t0 and changes to u0 + δu when t changes to t0 + δt, the incremental
change can be written as
δu
δu = |t δt. (A.1)
δt 0
The δ here means that this is a small, but finite quantity. If we let δt get
asymptotically smaller around t0 , we arrive at the derivative, which we
denote with u0 (t):
δu du
lim | t0 = . (A.2)
δt→0 δt dt
The limit in Eq. (A.2) will work as long as u doesn’t do any funny stuff
as a function of t, like jump around abruptly. When you think of u(t) Figure A.1: The derivative as a tangent
as a function (some line on a plot) that depends on t, u0 (t) is the slope slope (Source:Wikimedia Commons).
of this line that can be obtained by measuring the change δu over some Note that the first order approximation
f ( x ) + f 0 ( x )∆x captures the qualitative
interval δt, and then making the interval progressively smaller. behavior of f ( x ), but is still very far
off. To do better, one should include
more terms – this is the point of a Taylor
Interpretation expansion (Sect. III)
That being said, it is more cumbersome than the prime, so most lazy
people use u0 (t) to denote the derivative.
Properties
If you need to take derivatives of combinations of two or more functions,
here called f , g, and h, there are four important rules (with a and b being
real constants):
Linearity
Product rule:
( f g)0 = f 0 g + f g0 (A.4)
Quotient rule:
g( x )
If f (x) = (A.5)
h( x )
g0 ( x ) h( x ) − g( x ) h0 ( x )
Then f 0 ( x ) = (A.6)
h ( x )2
If f (x) = h( g( x )) (A.7)
df dh dg
Then f 0 ( x ) = = = h0 ( g( x )) g0 ( x ) (A.8)
dx dg dx
i.e. derivative of nested functions are given by the outer times the inner
derivative.
204
Data Analysis in the Earth & Environmental Sciences Appendix A. Calculus Review
Common Derivatives:
Say, f ( x ) = x3 , then
d 3 x3 d d dx3 d d 2 d
3
= = 3x = 6x = 6.
dx dx dx dx dx dx dx
In general, the nth derivative is denoted by f (n) . This will become useful
in Sect. III.
Differentiability
We should not take a function’s niceness for granted. Let us introduce
the C n notation, which formalizes this notion. A function is called C n
if it can be differentiated n times, and the nth derivative is still continu-
ous. Continuity may be loosely described as the property that a func-
tion’s graph can be drawn with a pencil without ever lifting it from the
page1 . We call C n ( I ) the set of functions that are C n over some interval 1
A more formal definition is that, if for
I. Obviously, a function that is n times differentiable is n − 1 times dif- every x in a neighborhood of x0 , f ( x ) is in
a neighborhood of f ( x0 ), then f is said
ferentiable, so we have the following Russian doll relationship2 , valid to be continuous at x0 . This definition is
for all I: valid over the whole real line, including
points at infinity.
C0(I) ∈ C1(I) ∈ · · · ∈ C ∞(I) (A.9) 2
which mathematicians call an inclusion
205
Data Analysis in the Earth & Environmental Sciences Appendix A. Calculus Review
Partial Derivation
In physics, it is very common for a function to depend on more than
one variable. For instance u could depend on space and time, which
we write u = u( x, y, z, t). In such cases one must account for variations
along each of the coordinates, which are themselves described by partial
derivatives:
δu ∂u
lim = . (A.10)
δx →0 δx ∂x
and similarly along other coordinates. The full differential, accounting
for all variations, then writes:
∂u ∂u ∂u ∂u
du = dx + dy + dz + dt (A.11)
∂x ∂y ∂z ∂t
Which states that the total variation in u is the sum of all its (partial)
variations along each coordinates, multiplied by the variation in each
coordinate. Partial derivatives and the equations derived from them
underlie much of modern physics, and we can’t do them justice here.
In this class they shall only be used for optimization, such as likelihood
maximization (Sect. III).
II Integral Calculus
Inverse Derivatives
Taking an integral
Z
F(x) = f ( x )dx,
206
Data Analysis in the Earth & Environmental Sciences Appendix A. Calculus Review
Interpretation
Graphically, the definite (with bounds) integral over f ( x )
Z b
f ( x )dx = F (b) − F ( a)
a
along x, adding up the value of f ( x ) over little chunks of dx, from the
left x = a to the right x = b, corresponds to the area under the curve
f ( x ). This area can be computed by subtracting the analytical form of
the integral at b from that at a, F (b) − F ( a). If no bounds a and b are
given, the function F ( x ) is only determined up to an integration con-
stant c, because the derivative of a constant is zero. In physics, initial
or boundary conditions are often used to determine the value of such a
constant, which can be very important in practice.
If f ( x ) = c (c a constant), then:
F(x) = cx + d (A.12)
F (b) = cb + d (A.13)
F ( a) = ca + d (A.14)
F (b) − F ( a) = c ( b − a ), (A.15)
Properties
A few conventions and rules for integration:
Notation:
R
Everything after the sign is usually meant to be integrated over up
to the dx, or the next major mathematical operator if the dx is placed
R
next to the if the context allows. Also:
Z Z
dx f ( x ) = f ( x )dx (A.16)
Linearity:
Z b Z b Z b
(c f ( x ) + dg( x )) dx = c f ( x )dx + d g( x )dx (A.17)
a a a
Reversal:
Z b Z a
f ( x )dx = − f ( x )dx (A.18)
a b
207
Data Analysis in the Earth & Environmental Sciences Appendix A. Calculus Review
Zero length:
Z a
f ( x )dx = 0 (A.19)
a
Additivity:
Z c Z b Z c
f ( x )dx = f ( x )dx + f ( x )dx (A.20)
a a b
Product rules:
Z
1
f 0 ( x ) f ( x )dx = ( f ( x ))2 + C (A.21)
Z
2 Z
f 0 ( x ) g( x )dx = f ( x ) g( x ) − f ( x ) g0 ( x )dx (A.22)
Quotient rule:
Z
f 0 (x)
dx = ln | f ( x )| + C (A.23)
f (x)
Symmetry:
R
Z a 2 a f ( x )dx if f is even
0
f ( x )dx = (A.24)
−a 0 if f is odd
Common Integrals
Here are the integrals of a few common functions, all only determined
up to an integration constant C
function f ( x ) integral F ( x ) comment
x p +1
xp p +1 + C special case: f ( x ) = c = cx0 → F ( x ) = cx + C
ex e x +C
1/x ln(| x |) + C
sin( x ) − cos( x ) + C
cos( x ) sin( x ) + C
208
Data Analysis in the Earth & Environmental Sciences Appendix A. Calculus Review
Numerical Quadrature
The term quadrature invokes squares, and it is not used here coinciden-
tally. Indeed, one of the early applications of integrals was to compute
areas, usually by diving domains into rectangles whose area was easy
to compute. The basic idea behind the following methods to numeri-
cally evaluate the integral of a function f over [ a; b] is to chop up its the
interval into smaller, contiguous segments, collectively known as a sub-
division of [ a; b]. There are many ways to do this, but the simplest is to
divide it into n equal slices of width ∆ = b− a
n :
[ a; b[ = [ a; a + ∆[ ∪ [ a + ∆; a + 2∆[ ∪ · · · ∪ [ a + (n − 1)∆; b[
k[
=n
= [ a + (k − 1)∆; a + k∆[
k =1
Rectangular Rule
The rectangular rule (aka Riemann sums) assumes that the function Figure A.2: Riemann summation in ac-
is constant over each subdivision. If we pick the end of each interval, tion. Right and left methods make the
approximation using the right and left
we get: endpoints of each subinterval, respec-
Z b n tively. Maximum (green) and minimum
a
f ( x )dx ≈ ∑ f ( a + k∆) ≡ An (A.25) (yellow) methods make the approxima-
k =1 tion using the largest and smallest end-
point values of each subinterval, respec-
As always in the numerical world, the approximation improves with n. tively. The values of the sums converge
In fact, one can show that as the subintervals halve from top-left to
bottom-right. The central panel describes
Z b the error as a function of number of bins
lim An = f ( x )dx (A.26) n. Wikimedia Commons)
n→∞ a
That is, with infinitely fine subdivisions, one recovers the area under the
curve exactly. This is something that a teacher of mine calls the “Stupid
Limit”, because one never has that much luxury. The goal of numerical
analysis is to obtain an estimate that is as accurate as possible given
computational constraints (which means that n ∞, for starters). The
rectangle rule is illustrated in Fig. A.2, exploring the impact of different
quadrature choices.
209
Data Analysis in the Earth & Environmental Sciences Appendix A. Calculus Review
Trapezoidal Rule
so Z ζ
f ( x )dx = F (ζ ) − F (0) = 1 − e−λζ
0
Now, as ζ approaches +∞, the second term decreases exponentially
fast, and the limit is well defined:
Z ζ Z +∞
lim f ( x )dx ≡ f ( x )dx = 1 (A.28)
ζ →+∞ 0 0
210
Data Analysis in the Earth & Environmental Sciences Appendix A. Calculus Review
Such integrals are called improper, but they are legitimate as long as this
limit exists. This particular function describes the exponential distribu-
tion, which is very useful in the study of waiting times and memoriless
processes (Chapter 3).
R +∞
Consider now the integral −∞ sin xdx. Does such a thing exist?
Given the symmetry property Eq. (A.24), one can show that for every
a, Z +a
sin xdx = 0
−a
since sin is an odd function (symmetric about (0, 0)). It is tempting to
take the limit a → ∞ and declare that the improper integral exists and
equals zero. This reasoning, however, is 100% incorrect. For an im-
proper integral with two infinite bounds to be defined, one has to look
separately at the limits at ±∞. In the case above, neither limit is defined,
so the integral doesn’t exist.
Now, it is a fact of life that many of the most useful functions have no
R x2
closed-form anti derivatives, a case in point being e− 2 dx. Not only
does its integral exist, but most remarkably
Z +∞ √
x2
e− 2 dx = 2π (A.29)
−∞
However, you would not find that out by application of the two stan-
dard tools to find integrals (integration by parts or variable substitu-
tion); you have to either invoke Fubini’s theorem or use a contour inte-
gral in the complex plane. In general, improper integrals are best tack-
led using measure theory, which is beyond our scope. For our purposes,
a standard math textbook (e.g. Abramowitz and Stegun, 1965), table of in-
tegrals, the Mathematica software, or Matlab’s Symbolic Math Toolbox
will be of help with such integrals, as well as more complicated forms.
ρ(z) = ρ0 e−z/H
211
Data Analysis in the Earth & Environmental Sciences Appendix A. Calculus Review
1 x2
ϕ( x ) = √ e− 2 (A.31)
2π
Then, by virtue of Eq. (A.29) and the linearity of the integration operator,
we’d find that Z +∞
ϕ( x )dx = 1
−∞
Measuring energy
Projection
Another application of integrals is their definition of an inner product
(cf. Appendix B) between two functions f and g over some interval I:
Z
h f , gi = f ( x ) g( x )dx (A.33)
I
212
Data Analysis in the Earth & Environmental Sciences Appendix A. Calculus Review
Convolution
The convolution product, or simply convolution of two functions f and
g is defined as:
Z +∞ Z +∞
( g ∗ f )( x ) = f (u) g( x − u)du = f ( x − u) g(u)du (A.34)
−∞ −∞
Z +∞ Z +∞
y(t) = x (u) L(t − u)du = L(u) x (t − u)du (A.35)
−∞ −∞
0.35
Gaussian window
0.2
Boxcar window
Figure A.6: Filtering timeseries, an exam-
0.3 ple of 1D convolution. The output is visi-
bly smoother than the inout, because con-
0.15
0.25
Amplitude
Amplitude
0.2
0.1 volution with either window amounts to
0.15
0.1
averaging nearby points together, which
cancels out high-frequency fluctuations
0.05
0.05
4
Input Signal pass filter (Chapter 10). Note that the box-
3 car filter is blurrier and noisier than the
2 Gaussian filter, which is one reason one
1 running means are a bad idea.
X(t)
-1
-2
-3
0.5
Y(t)
-0.5
-1
-1.5
-2
213
Data Analysis in the Earth & Environmental Sciences Appendix A. Calculus Review
Special points
Derivatives allow to characterize several special points around which the
graph of f ( x ) is organized:
Optimization
A generic problem in applied mathematics is to find some sort of opti-
mal solution to a problem: for example, one wants to find a curve that
minimizes tension between points, a surface that minimizes the misfit
to a set of measurements, fitting a line through a cloud of points, or
Figure A.8: Optimizing a multivalued
finding the most likely value of a parameter given some observational function amounts to finding the peaks in
constraints and prior knowledge. In many of these cases, one can define a graph such as this one (Source:Imran
an objective function S( x ) whose maximum or minimum yields the de- Nazar).
Root finding
A related problem is that of finding the roots (solutions) of an equation,
say f ( x ) = 0. We begin with a first guess x0 for a root. Provided the
function is C 1 , a better approximation x1 is
f ( x0 )
x1 = x0 − (A.36) Figure A.9: Finding the roots of an equa-
f 0 ( x0 ) tion via the method of Newton-Raphson
214
Data Analysis in the Earth & Environmental Sciences Appendix A. Calculus Review
f ( xn )
x n +1 = x n − (A.37)
f 0 ( xn )
until a sufficiently accurate value is reached (Fig. A.9). This is the
method of Newton-Raphson. One can show that the process always
converges unless the derivatives have too many zeroes, but its rate is
only quadratic, so it can be very slow. Also, it depends critically on
the initial guess, so one must use another method (e.g. eyeballing the
graph) to find a good one.
Functional Representations
For many purposes it is sometimes useful to express an arbitrary func-
tion f ( x ) in terms of simpler components. This indeed, is one meaning
of the word analysis. A generic representation uses a linear combination
of basis functions
∞
f (x) = ∑ ak φk (x) (A.38)
k =0
As a general rule, the complexity of φk ( x ) tends to increase with k.
Two examples are mentioned here.
( x − x0 )2
f (x) = f ( x0 ) + f 0 ( x0 )( x − x0 ) + f 00 ( x0 ) + . . . (A.39)
2!
n
( x − x0 ) k
f (x) = ∑ f ( k ) ( x0 ) + Rn ( x ) (A.40)
k =0
k!
n! = 1 × 2 × 3 × . . . n. (A.41)
215
Data Analysis in the Earth & Environmental Sciences Appendix A. Calculus Review
f(x)
10
quite good around 0 even for n = 2, but gets worse as one gets away
from the origin. If one wanted a useful approximation at x = 3, say, 5
This forms the basis for the Poisson distribution (Chapter 3, Sect. III).
Functions that are equal to their Taylor series on their domain of defi-
nition are called analytic; this is a remarkable property, which we won’t
use too much in this course, but you should know how very extraordi-
nary it is.
Another very important series is the geometric series:
∞
1
∀ x ∈] − 1, 1[, = ∑ xk (A.44)
1−x k =0
(that is, ak = 1 for all k). With this, one can derive expansions of a host
of other functions, like 1+1 x , 1+1x2 , arctan, arcsin, arccos, etc. It comes in
handy to approximate almost any ratio.
Example
What is a back-of envelope estimate of q = 1/0.93?
One recognizes the form above, with x = 0.07. So to first order, q = 1 + 0.07 +
O 0.072 ' 1.07. To second order, q = 1 + 0.07 + 0.072 + O 0.073 '
1.07 + 0.0049 = 1.0749. Here we used the “Big O” notation, which means
“terms of this order, or higher”.
This is in fact how calculators work (behind the scenes)! You can see
how the higher order terms add up to increasingly small contributions,
216
Data Analysis in the Earth & Environmental Sciences Appendix A. Calculus Review
because x m < x n for any 0 < m < n as long as | x | < 1. For larger | x |,
these geometric approximations quickly become gruesome.
The geometric series is also useful in probability theory, giving its
name to the geometric distribution (Chapter 4).
∞
a0 2π
f (x) = + ∑ [ ak cos(kωx ) + bk sin(kωx )] ω= (A.45)
2 k =1
T
Transformation
Another major application of calculus that is useful for this course is the
concept of integral transforms. In particular Laplace & Fourier trans-
forms are omnipresent in signal processing and probability theory; we
will mostly encounter them in Chapter 7.
217
Appendix B
LINEAR ALGEBRA
I Vectors
Definition 15 A vector is a quantity having three properties: a magnitude, a
direction and a sense.
Example
• Velocity vector indicating the movement of an object (e.g. planet)
• in R3 , v = (1, 0, −1)1 1
Numpy: v= [ 1, 0, -1]
( x1 , x2 )
P = ( x1 , x2 , x3 )
~v ~v
p=1 k v k1 = | v1 | + . . . + | v n |.
p=2 Euclidean norm3 . 3
Matlab: norm(v)
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra
Scalar Multiplication
αv = α (v1 , . . . , vn ) = (αv1 , . . . , αvn ) , α ∈ R.
II Matrices
Definition
A matrix is a quantity with two indices (a 2D array of numbers), which
can be seen as a linear operator (a function) on vectors.
Linear means:
Example
2 3 v1 2v1 + 3v2
=
1 4 v2 v1 + 4v2
| {z }
A
2 3 v1 2αv1 + 3αv2 2v1 + 3v2
= α = α = α ( Av)
1 4 v2 αv1 + 4αv2 v1 + 4v2
Similarly A (v + w) = Av + Aw.
Remark:
Canonical basis:
a b 1 a
• =
c d 0 c
a b 0 b
• =
c d 1 d
220
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra
=⇒ Aei = i column of A.
e2 = (0, 1)
An Example: Rotation matrices
e1 = (1, 0)
Consider the linear operation of rotating a vector v by an angle α.
Figure B.1: The canonical basis of R2
A
Rotation of an angle α
Av
(counterclockwise)
v
θ θ+α
e2 = (0, 1)
1 α
α
e1 = (1, 0)
cos α − sin α
Therefore: A =
sin α cos α
221
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra
z
In 3D
222
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra
Matrix Multiplication
A 4 2 3
A=
3 Ae2 1 4
e2 2
Ae1
1
e1 1 2 3 4
f g
R2 R2 R2
f ◦g
A, B matrices:
Transposition
The transpose of A, denoted A> is obtained by permuting its rows and
columns. If A is square, this is equivalent to flipping A along its main
diagonal:
A> = A ji (B.4)
ij
223
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra
Special Matrices
1 0
..
I= . identity
0 1 upper triangle
d1 0
..
. diagonal
0 dn
lower
triangle
Matrix Form
A
a b x c
= . (B.6)
d e y f
matrix unknown vector known vector
a b
The matrix maps vectors to vectors (linear operation).
d e
Geometric interpretation
A
Av2
v2
x c
Av1
v1 y f
224
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra
3 possibilities:
- The system has a unique solution.
Determinants
Motivation
A · x = b .
matrix vector vector
225
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra
• det αA = αn det A .
det Ai
xi = ,
det A
Matrix Inverse
Definition
x = A −1 b (B.10)
A · A −1 = A −1 · A = I , (B.11a)
−1
A −1 = A, (B.11b)
−1 >
A> = A −1 , (B.11c)
( A · B ) −1 = B −1 · A −1 . (B.11d)
The 2 × 2 case
For a matrix
a b
A=
c d
226
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra
Numerical Solutions
LUx = b ,
L (Ux) = b .
There are many kinds of other matrix factorizations (Cholesky, QR, etc)
which can become advantageous when A exhibits certain properties
(e.g. symmetry). SciPy incorporates the whole LAPACK machinery to
do so.
v1 6 = α2 v2 + α3 v3 + . . . + α k v k , (B.12a)
v2 6 = α1 v1 + α3 v3 + . . . + α k v k , (B.12b)
etc . . .
227
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra
Equivalently,
α1 v1 + . . . + αk vk = 0 ⇐⇒ α1 , . . . , αk = 0 .
(the right arrow is the more difficult one to prove; the left one is triv-
ial).
That is, linearly dependent vectors are redundant: at least one can
be expressed as a combination of the others, so there are fewer than k
degrees of freedom in this system.
A basis of Rn is a set of vectors {v1 , . . . , vk } such that:
Example
It is trivial to verify that with this choice of vectors, the only possible
way for λ1 e1 + λ2 e2 + λ3 e3 to be zero is if λ1 = λ2 = λ3 = 0 (Lin-
ear independence). Since the space is 3-dimensional, these 3 linearly-
independent vectors form a basis of it. QED
228
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra
hv, wi = v · w = v1 w1 + v2 w2 + . . . + vn wn
= ∑ v i wi .
i
θ
v
• orthogonal if
h vi , v j i = 0 , ∀i 6 = j .
• orthonormal if
h vi , v j i = 0 , ∀i 6 = j ,
&
h v i , v i i = k v i k2 = 1 ,
229
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra
v = α1 v1 + . . . + α n v n
Now that’s true in any basis, so what we really want to know is how
to find the coordinates (αi ’s), and quickly. Let us project v onto vi , i ∈
{1, · · · , n}:
hv, vi i = α1 hv1 , vi i + . . . + αn hvn , vi i
Of course, because the v j are orthogonal, the cross terms in the sum
drop out, so we are left with αi = hv, vi i.
That is, one can find the coordinates αi by simple projection onto the
basis vectors. This property is so intuitive that you probably take it for
granted – but it is remarkable, and entirely due to orthonormality. The
result is that we may write:
n
v= ∑hv, vi ivi (B.15)
i
{v1 , v2 } orthonormal
v2 = √1 (−1, 1)
2
v1 = √1 (1, 1)
2
230
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra
Question Find the coordinates of v = (1, 0) with respect to the basis {v1 , v2 }
i.e. find α1 and α2 such that v = α1 v1 + α2 v2 .
where
1 1 1
v · v1 = 1. √ + 0. √ = √ ,
2 2 2
1 1 1 1
v · v2 = 1. − √ + 0. √ = √ = − √ .
2 2 2 2
Thus:
1 1
v = √ v1 − √ v2 .
2 2
Let’s check:
1 1 1 1
√ v1 − √ v2 = (1, 1) − (−1, 1) = (1, 0) .
2 2 2 2
VI Projections
E is a subspace of Rn if:
ImA = { Ax : x ∈ Rn } . (B.16)
231
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra
Example
Image of A = { Av : v ∈ Rn } ≡ Im( A)
The image of A is a subspace of Rn ! For a matrix the image is also known
as the column space. A solution exists if and only if b ∈ Im( A).
Least Squares:
Equivalent to:
min ky − bk (B.19)
y= Ax
⇔ min ky − bk (B.20)
y∈ Im( A)
232
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra
ImA
PImA b
v ,...,v , v k +1 , . . . , v n
| 1 {z }k
orthonormal basis of E
| {z }
orthonormal basis of Rn
v = α1 v1 + . . . + α n v n .
If w ∈ E i.e. w = β 1 v1 + . . . + β k vk :
min kv − wk =?
w∈ E
Given that:
v − w = ( α 1 − β 1 ) v 1 + . . . + ( α k − β k ) v k + α k +1 v k +1 + . . . + α n v n ,
we have:
So,
−1
xls = A> A A> b (B.23)
233
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra
Solving Ax = b
A solution exists if and only if b ∈ ImA, with A a m × n matrix.
Typically:
Generalities
A vector space is a set V, where two operations:
1. Addition,
Addition
P1a) v1 + ( v2 + v3 ) = ( v1 + v2 ) + v3 . (Associativity)
P1b) v1 + v2 = v2 + v1 . (Commutativity)
P1c) There exists 0 ∈ V , such that 0 + v = v + 0 . (Identity element)
P1d) ∀v ∈ V, there exists − v ∈ V, such that v + (−v) = 0 . (Inverse element)
Scalar multiplication
234
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra
Example
1. Rn .
2. Cn .
( f + g ) ( x ) = f ( x ) + g ( x ).
( λ f ) ( x ) = λ f ( x ).
4. L2 (R), the set of square integrable functions on R. That is, all functions f
such that: Z ∞
| f ( x )|2 dx < ∞ .
−∞
1. Linearity (B.14a)
Z 2π
hα f + βg, hi = (α f + βg) hdθ
0
Z 2π Z 2π
=α f hdθ + β ghdθ
0 0
= αh f , hi + βh g, hi . (B.26a)
2. Symmetry (B.14b)
Z 2π Z 2π
h f , gi = f gdθ = g f dθ = h g, f i .
0 0
235
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra
i.e. {1, cos θ, sin θ, cos 2θ, sin 2θ, . . .}. These functions are vectors in E.
Do they form a basis of it?
One can check that:
Z 2π
cos nθ cos mθdθ = 0 , for n 6= m . (B.29a)
0
Z 2π
sin nθ sin mθdθ = 0 , for n 6= m . (B.29b)
0
Z 2π
sin nθ cos mθdθ = 0 , for all n, m . (B.29c)
0
Also,
Z 2π
(cos nθ )2 dθ = π , for n ≥ 1 . (B.31a)
0
Z 2π
(sin nθ )2 dθ = π , for n ≥ 1 . (B.31b)
0
Z 2π
12 dθ = 2π . (B.31c)
0
Therefore,
∞ ∞
1 1 1
√ ∪ √ cos (nθ ) ∪ √ sin (nθ ) , (B.33)
2π π n =1 π n =1
236
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra
∞
a0 1
f (θ ) = √
2π
+√
π
∑ (an cos nθ + bn sin nθ ) (Fourier Series Representation) .
n =1
N
a0 1
PN (θ ) = √
2π
+√
π
∑ (an cos nθ + bn sin nθ ) , (B.35)
n =1
237
Appendix C
I Trigonometry
Trigonometric functions
γ a2 + c2 = b2
β= π
2
b
a
β α
B c A
Sine
a
sin(α) =
b a
a
α = arcsin = sin−1
b b
1 π/2
2π
α −1 1 a
−1 −π/2
Data Analysis in the Earth & Environmental Sciences Appendix C. Circles and Spheres
Cosine
c
cos(α) =
b
c
α = arccos
b
2π π/2
α
−1 −1 1 a
Tangent
a
tan(α) = (C.1)
c
a
α = arctan (C.2)
c
−1 α −3 −2 −1 1 2 3 a
−2
−3 −π/2
Trigonometric relationships
sin(α)
tan(α) = (C.3)
cos(α)
sin2 (α) + cos2 (α) = 1 (C.4)
q q
sin(α) = ± 1 − cos2 (α); cos(α) = ± 1 − sin2 (α) (C.5)
240
Data Analysis in the Earth & Environmental Sciences Appendix C. Circles and Spheres
Symmetry
Periodicity
π
sin α + = cos(α) (C.9)
2
sin(α + π ) = − sin(α) (C.10)
sin(α + 2π ) = sin(α) (C.11)
C
etc.
γ
Angle sum
b
a
sin(α ± β) = sin(α) cos( β) ± cos(α) sin( β) (C.12)
cos(α ± β) = cos(α) cos( β) ∓ sin(α) sin( β) (C.13)
α
A β
c
Arbitrary triangles B
N v
Law of cosines
sin(α) sin( β) sin(γ)
= = (C.15) vector notation
a b c v = ~v
α
ϕ
Example (Vector components)
E
241
Data Analysis in the Earth & Environmental Sciences Appendix C. Circles and Spheres
p
r= x 2 + y2 + z2 x = τ cos( ϕ) sin(θ )
θ = arccos zr y = r sin( ϕ) sin(θ )
ϕ = arctan 2(y, x ) z = r cos(θ )
Note that:
• in a Cartesian system:
sin(θ ) cos( ϕ) cos(θ ) cos( ϕ) sin( ϕ)
er =
sin(θ ) sin( ϕ) eθ =
cos(θ ) sin( ϕ) eϕ =
cos( ϕ) ;
cos(θ ) − sin( ϕ) 0
dA = r2 sin(θ )dϕdθ
∂f 1 ∂f 1 ∂f
∇ f (r, θ, ϕ) = e + e + e
∂r r r ∂θ θ r sin(θ ) ∂ϕ ϕ Figure C.8: Surface integration on a
sphere
Horizontal surface gradient (r = 1)
∂θ
∇h =
1
∂
sin(θ ) ϕ
This is a poor formula numerically for small s (why?), much better to use:
" 1 #
2 λ1 + λ2 2 ϕ1 − ϕ2 2
s = arcsin sin + cos(λ), cos(λ2 ) sin
2 2
242
Data Analysis in the Earth & Environmental Sciences Appendix C. Circles and Spheres
v1
Example (Average directions and orientation) N
v2
1. Given a set of n unit vectors, find the mean azimuth α:
! v3
∑ viE ∑ viN α
hαi = arctan 2 ,
N N
E
õiE = sin(2αi )
02
õiN = cos(2αi )
! α
1 ∑ õiE ∑ õiN 03
hαiπ = arctan 2 , 01
2 N N
1
this was generally in order to compute
financial gains, so the research in this
II Complex Numbers area of mathematics was rather hand-
somely subsidized by the nascent bank-
ing industry
The Cardano-Tartaglia Equation
In sixteenth century Italy, there was a lot of interest in solving certain
polynomial equations, in particular cubics1 . Girolamo Cardano noticed
something strange when he applied his formula to certain cubics. When
√
solving x3 = 15x + 4 he obtained an expression involving −121. Car-
dano knew that you could not take the square root of a negative number
yet he also knew that x = 4 was a solution to the equation. He wrote
to Niccolo Tartaglia, another algebrist of the time, in an attempt to clear
up the difficulty. Tartaglia certainly did not understand. In Ars Magna,
Cardano’s claims to fame, he gives a calculation with “complex num-
bers” to solve a similar problem but he really did not understand his
own calculation which he says is “as subtle as it is useless.”
Later it was recognized that if one just allowed oneself to write i =
√ √
−1, then −121 would just be i11. If one could just bear the notion
that i was a permissible thing to write, it turned out, a polynomial of
Figure C.12: Gerolamo Cardano
243
Data Analysis in the Earth & Environmental Sciences Appendix C. Circles and Spheres
z = x + iy (C.19)
√
with ( x, y) two real numbers and i = −1 as before (that is, i2 = −1).
We call x the real part (<(z)) and y the imaginary part (=(z)), so it really
is “halfway between being and nothingness”. That not does make it
useless. In fact, it is applicable to so many areas of physics, mathematics Figure C.13: Niccolò Tartaglia
& engineering that many people consider complex numbers to be just
as “real” as real numbers. Excepts for integers, all of them are mental
constructs anyway, so we might as well use mental constructs that can
solve an astounding variety of problems. Here we will mostly use them
to understand cyclicities.
A complex number may be charted in the complex plane (Fig. C.14),
which is oriented by the real axis along (1, 0) and the imaginary axis
along (0, i ). Let us also define the define the complex conjugate of z, de-
noted z̄ or z∗ :
z∗ = x − iy (C.20)
That is, z∗ is in the image of z by reflection along the real axis.
If a complex number is just a point on a plane, why not just call it a
vector of R2 ? In R2 we know how to add two vectors, or multiply them
by a scalar. We even have an inner product (that turns two vectors into
one scalar) and an outer product (that turns two vectors into a vector
perpendicular to that plane, so outside the original space): in R2 there Figure C.14: Polar representation and
complex conjugate. Two conjugates have
is no rule to multiply two elements and yet stay in R2 . The space of the same modulus but opposite argu-
complex numbers C does have such a rule. In addition to the usual ments. Credit: Wikipedia
properties:
z + z0 = ( x + x 0 ) + i (y + y0 ) (addition) (C.21a)
λz = λx + iλy (scalar multiplication) (C.21b)
244
Data Analysis in the Earth & Environmental Sciences Appendix C. Circles and Spheres
with the same two rules of addition and multiplication. And while a
polynomial of degree n may not always have roots in R, it always has n
roots in C – that is totally wild.
Polar representation
And the argument may be written (for any non-zero complex z):
y
ϕ = arctan (C.24)
x
Any complex number (except zero) may be written in polar form:
so
z = r (cos ϕ + i sin ϕ) (C.27)
Setting r = 1 defines the unit circle (Fig. C.17), numbers whose real
part is given by cos ϕ and imaginary part given by sin ϕ. Conversely,
one may define sines and cosines this way:
eiϕ + e−iϕ
cos ϕ = (C.28a)
2
eiϕ − e−iϕ
sin ϕ = (C.28b)
2i
For the operations of multiplication, division, and exponentiation of
complex numbers, it is generally much simpler to work with complex
numbers expressed in polar form rather than rectangular form. From
the laws of exponentiation, for two numbers z1 = r1 eiϕ1 and z2 = r2 eiϕ2 :
Multiplication : r1 eiϕ1 · r2 eiϕ2 = r1 r2 ei( ϕ1 + ϕ2 ) (Fig. C.16) Figure C.16: Complex multiplication
r1 eiϕ1 r1 i ( ϕ1 − ϕ2 )
Division : = r2 e
r2 eiϕ2
245
Data Analysis in the Earth & Environmental Sciences Appendix C. Circles and Spheres
1 = ei0
−1 = eiπ
i = eiπ/2
−i = ei3π/2
because who wouldn’t want to use three symbols (two irrational num-
bers and an imaginary one) to write the number one?
Further, one quickly notices that this notation is everything but unique.
Indeed adding any multiple of 2π means going around the merry-go-
round that many times, so we say that a complex argument is defined
“modulo 2π”. Hence we should have written 1 = eik2π , where k is any
positive or negative integer (i.e. , k ∈ Z), and so on for the others. This
“modulo" business simply expresses periodicity.
Roots of Unity
An nth root of unity, where n ∈ N∗ , is a number z satisfying the deceit-
fully simple equation
zn = 1 (C.29)
246
Data Analysis in the Earth & Environmental Sciences Appendix C. Circles and Spheres
r n einϕ = 1 (C.30)
Two complex numbers are equal if and only if they have the same mod-
ulo and argument, so for this to work, r1/n = 1, yielding r = 1: the so-
lution must be on the unit circle (Fig. C.17). What about its argument?
It must verify:
1/n
eiϕ = ei2πk (C.31)
These are some of the coolest numbers you’ll ever meet. The third
roots of unity are illustrated in Fig. C.18, and you can see that they bisect
the unit circle in three equal slices. In general, nth roots of unity split the
unit circle in n equal slices3 : they are cyclotomic. The roots are always
located at the vertices of a regular n-sided polygon inscribed in the unit Figure C.18: Third roots of unity
3
circle, with one vertex at 1. This proves invaluable in cutting pies
without a protractor, i.e. in most social cir-
Beyond their assistance in cutting pies, their periodicity makes them cumstances
a cornerstone of Fourier analysis. Indeed, the sequence of powers
· · · , z −1 , z 0 , z 1 , · · · (C.35)
247
Data Analysis in the Earth & Environmental Sciences Appendix C. Circles and Spheres
n −1
2πj 2πj
<(z j ) = ∑ k A cos k
n
+ Bk sin k
n
(C.38)
k =0
ẍ + ω0 2 x = 0 (C.39)
where ω0 is a constant (in the case of the spring, ω0 2 is the ratio of its
stiffness to the mass of the mobile). You may have seen that the solutions
to such equations take the form
248
Data Analysis in the Earth & Environmental Sciences Appendix C. Circles and Spheres
Normalization
orthonormality
s
Z π Z 2π
(2l + 1)(l − m)!
I= dθ dϕYlm Yl 0 m , → Nlm = sin(θ ) = δll , δmm
0 0 4π (l + m)!
n
1 for i = j
where δ =Kronecker δ, δi,j = 0 for i 6= j , used in physics and seismol-
ogy.
In geodesy,
s
y (l − m)!
I = 4πδll 0 δmm0 ; Nlm = (2l + 1) .
(l + m)!
In magnetics,
s
4π m (l − m)!
I= ; Nlm = .
2l + 1 ll mm
δ 0δ 0
(l + m)!
249
Data Analysis in the Earth & Environmental Sciences Appendix C. Circles and Spheres
∞ l
f (θ, ϕ) = ∑ ∑ f lm Ylm (θ, ϕ) (C.41)
l =0 m=−l
Z 2π Z π L 2
lim
L→∞ 0
dϕ
0
dθ f (θ, ϕ) − ∑ f lm Ylm (θ, ϕ) sin(θ ) = 0
l =0
q
Note Compare |v|2 = ∑ v2i with the expression above.
250
Appendix D
DIAGONALIZATION
dN (t)
= −λN (D.1)
dt
where λ is a constant.
λ1 λ2 λ3
We can also have a more involved case, e.g. Bk −→ Bz −→ J −→ Rn.
In this case the numbers of atoms as a function of time are given by
dN1 (t)
= −λ1 N1
dt
dN2 (t)
= −λ2 N2 + λ1 N1
dt
dN3 (t)
= −λ3 N3 + λ2 N2 (D.2)
dt
and we can rewrite this system of differential equations as
N1 (t) − λ1 0 0 N1 (t)
d
N (t) = λ
2 1 − λ2 0
N2 (t) . (D.3)
dt
N3 (t) 0 λ2 − λ3 N3 (t)
| {z }
A
Data Analysis in the Earth & Environmental Sciences Appendix D. Diagonalization
A = VΛV −1 (D.4)
An = VΛn V −1 (D.5)
II Eigendecomposition
Eigenvalues and eigenvectors
Given the matrix A, λ is an eigenvalue of A if and only if ∃ v 6= 0 such
that Av = λv. The case v = 0 is trivial since it’s always true. We can
rewrite Av = λv as follows
Av = λv ⇔ Av − λv = 0 ⇔ ( A − λI )v = 0 ⇔
v ∈ N ( A) (null space of A). (D.6)
A simple example
Consider the case
2 1 2−λ 1
A= A − λI = (D.8)
1 2 1 2 − λ.
We have to solve
We have
√ p
∆ = b2 − 4ac = 2 ∈ R (D.10)
What are the eigenvectors? We have to look for solutions v such that
Av = λ1,2 v.
252
Data Analysis in the Earth & Environmental Sciences Appendix D. Diagonalization
• For λ1 we have
2 1 x1 x1 2x + y = 3x
= 3 ⇔ 1 1 1
⇔ x1 − y1 = 0 ⇔
1 2 y1 y1 x + 2y = 3y
1 1 1
1
v1 = c 1 c 1 ∈ R∗ (D.12)
1
• For λ2 we have
2 1 x2 x2 2x + y = x
2 2 2
= 1 ⇔ ⇔ x2 + y2 = 0 ⇔
1 2 y2 y2 x + 2y = y
2 2 2
−1
v 2 = c 2 c 2 ∈ R∗ (D.13)
1
We can define two unit vectors v10 , v20 associated to the eigenvalues we
have found
1
v1 · v1 = v1> v1 = kv1 k = 2 ⇒ v10 = √ v1
2
> 0 1
v2 · v2 = v2 v2 = k v2 k = 2 ⇒ v2 = √ v2 (D.15)
2
(D.16)
so the two vectors v10 , v20 are orthonormal. Out of these two vectors we
can build a matrix V = (v10 v20 ) which is orthogonal, i.e.
This means that V is invertible and the inverse matrix is given by V > .
Defining
λ1 0
Λ= (D.19)
0 λ2
253
Data Analysis in the Earth & Environmental Sciences Appendix D. Diagonalization
General case
The previous example represented a very special case. In fact
• A was a 2 × 2 matrix so P(λ) was a polynomial of order 2 in λ and
the eigenvalues have been calculated easily. Things are not quite so
simple for dimension n > 2 – finding the roots of the polynomial
needs to happen numerically.
• A was real and symmetric (i.e. A> = A). It can be shown that in
such a case one can always find two non-negative eigenvalues and
their eigenvectors are orthogonal.
In general, one always starts with a characteristic polynomial defined as
P(λ) = det( A − λIn ) where A ∈ Mn×n (R) with Mn×n (R) the space
of real n × n matrices. Then, in order to find the eigenvalues and eigen-
vectors, the following steps have to be performed.
1. The characteristic polynomial can always be factorized as follows
P ( λ ) = ( λ − λ 1 ) n1 ( λ − λ 2 ) n2 . . . ( λ − λ k ) n k (D.21)
λ i
0 Λ2 . . . 0
tim
Λ= where Λi = .
.
.. ..
es
. . . . . . . 0
0 λi
0 0 0 Λk
254
Data Analysis in the Earth & Environmental Sciences Appendix D. Diagonalization
What are the advantages of going through this procedure? There are
several of them:
• once the diagonal matrix has been found it is easy to raise the matrix
to any power m: Am = VΛm V −1 where Λm can be readily obtained
from Λ
λm ... 0
1
. .. ..
Λm = .. . . (D.23)
0 ... λnm
which is possible only in the case in which all of the λi are non-zero;
Definition
Eigendecomposition is the privilege of square matrices. For non-square
matrices, similar benefits may be obtained via the singular value decom-
position (SVD for short). Any matrix A ∈ Mm×n (R) (or A ∈ Mm×n (C))
admits a singular value decomposition as
A = U Σ V> (D.25)
255
Data Analysis in the Earth & Environmental Sciences Appendix D. Diagonalization
For m ≥ n we have
... 0
1
σ
. .. ..
Σ = .. . . with σi the singular values (D.26)
0 ... σn
VV > = V > V = In (right singular vectors) (D.27)
>
U U = In (left singular vectors) (D.28)
so
m×n m×n n×n n×n
A= = .
diagonal orthogonal
orthogonal matrix matrix
columns
Properties
What does the singular value decomposition do for us? The number of
non-zero singular values defines the rank of matrix A. In practice, how
do we find U, Σ and V? The matrix ( A> A) is real and symmetric, so it
is diagonalizable and admits orthonormal eigenvectors, i.e.
A> A = V 0 ΛV 0T , V 0T V = I (D.29)
2 >
A> A = (VΣ> U )(UΣV > ) = VΣ U > > 0
| {zU} ΣV = VΣ V ≡ V ΛV
0>
In
(D.30)
so
256
Data Analysis in the Earth & Environmental Sciences Appendix D. Diagonalization
What about U?
• its rank
• a basis for N ( A), R( A), N ( A> ), R( A> ) which are the four funda-
mental subspaces. All this with an incredibly efficient algorithm3 . 3
np.linalg.svd()
h i Σ1 0 h i
U =: U1 U2 , Σ =: , and V =: V1 V2 , (D.34)
0 Σ2
b ∗ = U1 Σ1 V1> ,
D (D.35)
257
Data Analysis in the Earth & Environmental Sciences Appendix D. Diagonalization
is such that:
q
b ∗ kF =
kD − D min b kF =
kD − D σr2+1 + · · · + σm
2. (D.36)
b )≤r
rank( D
258
BIBLIOGRAPHY
Cook, E. R., and K. Peters (1981), The smoothing spline: A new approach
to standardizing forest interior tree-ring width series for dendrocli-
matic studies, Tree-Ring Bulletin, 41, 45–53.
260
Data Analysis in the Earth & Environmental Sciences Appendix D. Bibliography
Kug, J.-S., F.-F. Jin, and S.-I. An (2009), Two types of El Niño events:
Cold tongue El Niño and warm pool El Niño, Journal of Climate, 22(6),
1499–1515, doi: 10.1175/2008JCLI2624.1.
261
Data Analysis in the Earth & Environmental Sciences Appendix D. Bibliography
262
Data Analysis in the Earth & Environmental Sciences Appendix D. Bibliography
Tarantola, A. (2004), Inverse Problem Theory and Methods for Model Pa-
rameter Estimation, Society for Industrial and Applied Mathematics,
Philadelphia, PA, USA.
Wahba, G., and Y. Wang (1995), Behavior near zero of the distribution of
GCV smoothing parameter estimates, Stat. Probabil. Lett., 25, 105–111.
263
Data Analysis in the Earth & Environmental Sciences Appendix D. Bibliography
Wunsch, C. (2000), On sharp spectral lines in the climate record and the
millennial peak, Paleoceanography, 15(4), 417–424.
264
INDEX