0% found this document useful (0 votes)
28 views265 pages

Data Analysis Earth Environmental Sci 2017

Uploaded by

ojeifoissy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views265 pages

Data Analysis Earth Environmental Sci 2017

Uploaded by

ojeifoissy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 265

Data Analysis in the Earth

& Environmental Sciences

Julien Emile-Geay
Department of Earth Sciences
[email protected]

ERTH425L
Copyright © 2017 Julien Emile-Geay

published by me, myself and the intertubes

Typeset in LATEX with help from MacTeX and a tweaked version of the Tufte LATEX book class. Many figures
generated via TikZ. Python highlighting by O. Verdier. All software open source.

This work is licensed under a Creative Commons Attribution-


ShareAlike 4.0 License. Legal Code

Please cite as: Emile-Geay, J., 2017: Data Analysis in the Earth & Environmental Sciences, 265pp, Third edition,
http://dx.doi.org/10.6084/m9.figshare.1014336.

Recompiled December 2017


CONTENTS

Chapter 1 Preamble 9
I What is data analysis? 9
II What should you expect from this book? 10
III Structure 11
IV Acknowledgements 12

Part I Living in an uncertain world 13

Chapter 2 Probability Theory 15


I Probability Theory as Extended Logic 15
II Notion of probability 19
III The Calculus of Probabilities 24

Chapter 3 Probability distributions 31


I Random Variables. Probability Laws 31
II Exploratory Data Analysis 38
III Common Discrete Distributions 44
IV Common Continuous Distributions 48

Chapter 4 Normality. Error Theory 51


I The Normal Distribution 51
II Limit Theorems 55
III A bit of history 58
IV Error Analysis 60
Data Analysis in the Earth & Environmental Sciences CONTENTS

Chapter 5 Principles of Statistical Estimation 63


I Preamble 63
II Method of Moments 64
III Maximum Likelihood Estimation 66
IV Quality of Estimators 71
V Bayesian Estimation 74

Chapter 6 Confirmatory Data Analysis 81


I Preamble 81
II Confidence Intervals 82
III Testing Archeological Hypotheses 83
IV The Logic of Statistical Tests 87
V Test Errors 88
VI Common Parametric Tests 89
VII Non-parametric tests 91
VIII Bayes: return of the reverend 94
IX Further considerations 96

Part II Living in the temporal world 97

Chapter 7 Fourier Analysis 99


I Timeseries 99
II Fourier Series 102
III Fourier transform 104
IV Discrete Fourier transform 109
V The Curse of Discretization 112

Chapter 8 Timeseries modeling 117


I The AR(1) model and persistence 117
II Linear Parametric Models 119
III Noise color 124

4
Data Analysis in the Earth & Environmental Sciences CONTENTS

Chapter 9 Spectral Analysis 125


I Why spectra? 125
II Signals, trends and noise 128
III Data pre-processing 128
IV Classical Spectral Estimation 131
V Advanced Spectral Estimation 133
VI Cross-spectral analysis 138

Chapter 10 Signal Processing 141


I Filters 141
II Interpolation 146

Part III Living in multiple dimensions 151

Chapter 11 Multivariate Relationships 153


I Relationships between variables 153
II The Multivariate Normal Distribution 156

Chapter 12 Principal component analysis 163


I Principal component analysis: theory 163
II Principal component analysis in practice 166
III Geoscientific uses of PCA 168

Chapter 13 Least squares 173


I Ordinary Least Squares 173
II Geometric interpretation 176
III Statistical interpretation 176
IV Exotic Least Squares 178

Chapter 14 Discrete inverse theory 181


I Classes of inverse problems 181
II Reduced rank (TSVD) solution 182
III Tikhonov regularization 185
IV Recipe for underdetermined problems 186

5
Data Analysis in the Earth & Environmental Sciences CONTENTS

Chapter 15 Linear Regression 187


I Regression basics 187
II Simple linear regression 188
III Model checking 190
IV Multiple Linear Regression 192
V What would a Bayesian do? 197

Chapter 16 Outlook 199

Appendices 199

Appendices 199

Part IV Mathematical Foundations 201

Appendix A Calculus Review 203


I Differential Calculus 203
II Integral Calculus 206
III Earth Science Applications 211

Appendix B Linear Algebra 219


I Vectors 219
II Matrices 220
III Matrices and Linear System of Equations 224
IV Linear Independence. Bases 227
V Inner Products and Orthonormality 229
VI Projections 231
VII Vector Spaces 234

Appendix C Circles and Spheres 239


I Trigonometry 239
II Complex Numbers 243
III Spherical harmonics 248

6
Data Analysis in the Earth & Environmental Sciences CONTENTS

Appendix D Diagonalization 251


I Earth Science Motivation 251
II Eigendecomposition 252
III Singular value decomposition 255

Bibliography 259

Index 265

7
Chapter 1

PREAMBLE

I What is data analysis?


We all have a basic concept of what data means, so let us first define
the word analysis. Etymologically, it is the union of the Greek ana (up,
throughout) + lysis (a loosening, from lyein “to unfasten”). The online
resource Dictionary.com gives four definitions of it:

1. the separating of any material or abstract entity into its constituent


elements (opposed to synthesis).

2. this process as a method of studying the nature of something or of


determining its essential features and their relations: the grammatical
analysis of a sentence.

3. a presentation, usually in writing, of the results of this process: The


paper published an analysis of the political situation.

4. a philosophical method of exhibiting complex concepts or proposi-


tions as compounds or functions of more basic ones.

Science is fundamentally a reductionist enterprise, and Earth Science


is no exception. Faced with complex environmental systems, we wish to
break them into simpler parts that we can better understand (definition
1); we must learn this process (definition 2), which is why you are read-
ing these notes. We must also communicate the result of our analysis, in
verbal or written form (as definition 3 alludes to), but also graphically.
Indeed, a scientific analysis is only as good as it can be understood by a
third party, so it behooves the analyst to be educated (know the meth-
ods), rigorous (apply the methods correctly, in conformity with their
intended uses and underlying assumptions), honest (present candidly
without withholding inconvenient information), lucid about the results
(interpret them but not over interpret them), and transparent (commu-
nicate them clearly).
Data Analysis in the Earth & Environmental Sciences Chapter 1. Preamble

Now that we have defined the task that will occupy us for the rest of
this book, let us take a step back. What do we mean by data? Data is a
word about which most scientists think they agree, but turn out to dis-
agree a fair bit. Data represent numerical representation of quantities
(variables)1 which, we would like to believe, tell us about a system un- 1
a single data point, for those who like
der investigation. In today’s world, data refer both to measurements of Latin, is called a datum

physical, chemical or biological quantities, or to their representation in


a numerical model, though most experimentalists will scoff at the idea
that models produce “data”, by which they mean “observations”. Still,
volumetrically the amount of model-generated data far outweighs any
observational data, so this is something with which today’s data analyst
must contend. On the other hand, most experimentalists believe their
data to be hard evidence about a phenomenon of interest. Is that so?
Evidence – always. Hard, not always, depending on its associated un-
certainties. Perhaps the most important task incumbent to the analyst
is that of assessing and communicating these uncertainties, and derive
the scientific conclusions that are warranted by them.
Hence, for the purposes of this book, data analysis is the task of de-
composing data into simpler components, describing the key features of
these components and their associated uncertainties, and communicate
these key features. We have our work cut out for us. Let us begin.

II What should you expect from this book?


This book was written as notes to a class aptly named “Data Analysis
in the Earth and Environmental Sciences” and, as such, introduces stu-
dents to key methods used in those fields. Most of the examples are
drawn from the Earth sciences, attempting to span a broad a spectrum
thereof. Many of the concepts, however, are quite general and apply
to virtually any dataset. Similarly, we chose to illustrate the numerical
methods in Python2 , but most programming languages have equivalent 2
Using the so-called “Pylab stack” using
solutions available. NumPy/SciPy/Matplotlib

Data analysis draws on many mathematical concepts, some of which


can be rather advanced. Our purpose is not to write an exercise in math-
ematical pedantry, but to borrow from mathematics what we need to
better understand the Earth. This is a balancing act: in these notes we
attempt to introduce the fundamental notions that are needed to un-
derstand the essence of each analysis technique, but there is only so
much that one can teach in a semester. If you did not study mathe-
matics very extensively, there will come a point where you will find
the mathematics abstruse, overwhelming, even obnoxious; conversely,
if you were trained in the mathematical sciences, you may at times be
saddened, even appalled, by the absence of proofs or the lack of rigor
in these pages. Our intent is to give each student the rudiments to un-

10
Data Analysis in the Earth & Environmental Sciences Chapter 1. Preamble

derstand and use these techniques –perhaps even teach them, one day.
As such, a modicum of mathematical education is required, amounting
roughly to Calculus III, introductory linear algebra and trigonometry
(Appendices, A, C and C). Probability theory is our main foundation,
and none of it can be explained without the language of calculus and
linear algebra. If you never studied either of those, run for your life. If
you did study them, but never excelled at them, we hope that the desire
to understand the Earth will drive you to learn some new math! Ulti-
mately, however, if you want to use those techniques correctly, you will
have to understand where they come from, and to do that you will have
to delve into some mathematics – a black box approach will not suffice.
How much you want to do this is your choice, but it is hard to know too
much about this topic.
Each Earth Science department worth its salt has a class like this
one (at USC, ERTH425L); they each make different choices about what
topics to emphasize. In the process, they all achieve different compro-
mises between exposing students to the most common techniques used
in their field and giving them enough background to understand these
techniques at a deep level. It is our belief that knowledge is most ef-
ficiently acquired through practice; as such, a key component of this
class is a set of weekly laboratory practicums that put into use the no-
tions presented in the lectures. Equally important, and perhaps more
relevant for those of you who are becoming researchers (whether grad-
uate or undergraduate), a final project will apply the concepts taught here
on your own dataset. If you play your cards right, this will form the foun-
dation of a thesis chapter or paper, which will make you – and your
advisor – very happy indeed. Do we have your attention now?

III Structure
The class is articulated around three main themes:

1. Living in an uncertain world

2. Living in the temporal world

3. Living in multiple dimensions

Because uncertainties will turn out to matter at every step of the way,
we need to define a common language to deal with them. Such is the
object of statistics, rooted in probability theory, which are tackled in the
first third of these notes. Much of those fields of knowledge rely heavily
on results from calculus and linear algebra, which are briefly reviewed
in Appendices A & B. Next we focus on datasets that follow a sequential
order; we call those timeseries, though the independent variable need

11
Data Analysis in the Earth & Environmental Sciences Chapter 1. Preamble

not be time (it could as well be space). Here a fundamental mathemat-


ical tool is Fourier analysis, in which we delve at some length, with the
ultimate goal of performing spectral analysis and signal processing. For
this we need some basic notions of trigonometry and complex algebra,
reviewed in Appendix C. Of course, we shall not forget that all our spec-
tral estimates are just that – estimates – so the probabilistic language
developed in Part 1 will still come to good use.
We finish with a variety of topics involving the interaction of sev-
eral variables: this may include predicting one physical variable from
measurements of another (linear models, least squares), analysis of spa-
tiotemporal variability (principal component anaysis), or the estimation
of hidden parameters from sparse measurements (geophysical inverse
theory). Much of this requires simplifying matrices to a diagonal form,
which we review in Appendix D.
Without further ado, we now begin our odyssey into data analysis.

IV Acknowledgements
Thorsten W. Becker, who put this class on the books at USC, was in-
strumental for much of the labs and least squares/inverse theory chap-
ters. Appendix A is a shameless (but authorized) rip-off of a similar
appendix in his and B. Kaus’ excellent e-book Numerical Modeling of
Earth Systems.
Dominique Guillot is to be commended for the excellent review of
linear algebra (Appendix B), and much of chapters 3 and 5.
I thank Elisa Ferreira, Laura Gelsomino and Antonio Mariano for
their help with LATEX typesetting, and all the ERTH425L students who
have pointed out typographical errors over the years3 . Despite my best 3
non-exhaustive list: Jianghao Wang,
efforts, the probability that some such errors remain in the following Kirstin Washington, Alexander Lusk,
Kevin Milner, Billy Eymold, Joseph Ko
pages is close to unity, so I am grateful for any comment that can help
us fix that.

12
Part I

Living in an uncertain world


Chapter 2

PROBABILITY THEORY

“Probability is the only satisfactory way to reason in an uncertain


world.”

Dennis Lindley

One seldom has all the information that one wishes for: often the
measurements are too few, too imprecise (or both) to allow us to con-
clude much with certainty. Yet this does not mean that we know nothing
– we may have a lot of information about the Earth system, and what we
need is a way to reasoning quantitatively about it, within these uncer-
tainties. This is the domain of probability theory, which encodes those
rules of reasoning in mathematical form.

I Probability Theory as Extended Logic


Probability theory provides an automatic means to conduct plausible
reasoning given scientific evidence (observations of the system under
consideration).

Deductive vs Plausible reasoning


Suppose some dark night a policeman walks down a street, apparently
deserted1 . Suddenly he hears a burglar alarm, looks across the street, 1
The following two sections borrow
and sees a jewelry store with a broken window. Then a man wearing a heavily from the excellent, if somewhat
peevish, work by Jaynes (2004)
mask comes crawling out through the broken window, carrying a bag
which turns out to be full of expensive jewelry. The policeman has little
hesitation in concluding that the man is a burglar, and promptly arrests
him. But by what reasoning does he arrive at that conclusion?
It should be clear that our policeman’s conclusion was not a logical
deduction from the evidence; for there may have been a perfectly in-
nocent explanation for all this. It might be, for instance, that the man
was the owner of the jewelry store, that he was coming home from a
masquerade party, and didn’t have the key with him. However, just as
Data Analysis in the Earth & Environmental Sciences Chapter 2. Probability Theory

he walked passed the store, a passing truck threw a stone through the
window, and he was only protecting his own property.
Now, while the policeman’s reasoning process was not a logical de-
duction from perfect knowledge of the situation, we will grant that it
had a certain degree of validity. The evidence above does not make the
man’s dishonesty certain, but it does make it extremely plausible. This
is an example of a kind of reasoning in which all of us are proficient,
without necessarily having learned the mathematical theory behind it.
Every day we make decisions based on imperfect information (e.g. it
might rain, should I take my umbrella?), and the fact that there is some
uncertainty in no way paralyzes our actions.
We therefore contrast this plausible reasoning with the deductive reason-
ing whose formalism is generally attributed to Aristotle (Fig. 2.1). The
latter is usually expressed as the repeated applications of the strong syl-
logism:
Figure 2.1: A bust of Aristotle, whose real
face is quite uncertain.
If A is true, then B is true;
(S1) but A is true.
Therefore B is true. ∴

and its inverse:

If A is true, then B is true;


(S2) but B is false.
Therefore A is false. ∴

This is the reasoning we would conduct in an ideal world. In the


real world, there is almost never sufficient information to determine the
absolute veracity of any proposition. However, George Pólya argued
that we can still reason via weak syllogisms:

If A is true, then B is true;


but B is true.
(S3)
Therefore A becomes more plausible. ∴

Figure 2.2: The Hungarian mathemati-


Example cian George Pólya, December 13, 1887 –
September 7, 1985)
A ≡ “it will start to rain by 10 am at the latest” ;
B ≡ “clouds will appear before 10 am”.

Observing clouds at 9:45am does not give us logical certainty that


it will rain by 10, only a strong inkling that it might be the case (if the
clouds are sufficiently dark). Note that the connection is purely logical,
here – not physical. We know that clouds are a necessary connection
for rain (Rain ⇒ Clouds), while the causal relationship goes the other

16
Data Analysis in the Earth & Environmental Sciences Chapter 2. Probability Theory

way (Clouds ⇒ Rain). Probability theory is not charged with inferring


physical causation, only logical connections. Another weak syllogism
is:

If A is true, then B is true;


but A is false.
(S4)
Therefore B becomes less plausible. ∴

In this case, the evidence does not prove that B is false, only that one
of the possible reasons why it might be true has been eliminated, so we
feel less confident about it. Scientific reasoning almost always looks like
S3 and S4.
Now, our policeman’s reasoning was not even of the above types. It
is a still weaker form of syllogism:

If A is true, then B becomes more plausible;


but B is true.

Therefore A becomes more plausible. ∴

Despite this apparent weakness, the burglary case should convince


you that such form of reasoning may approach the power of deductive
reasoning if the evidence is sufficiently strong. One subtlety is that we
are not only referring to present evidence, but also to past experience.
In this case, our policeman’s prior information is that people smuggling
jewelry out of a broken window tend to be thieves, not jewelers. His
judgement would be very different if he lived in a city where things were
usually otherwise, and he would soon learn to dismiss the evidence as
something perfectly ordinary.
Thus, our reasoning about the world involves several things: linking
statements together by way of degrees of plausibility, and prior infor-
mation about some of these statements. How our brain achieves this is
in fact immensely complex, and we conceal this complexity by calling it
common sense.

Designing a thinking robot


Of course, even with the most common sense in the world, anyone can
make mistakes. Objective and scientific though we’d like to be, we in-
evitably let other considerations enter our judgement about the plausi-
bility of a hypothesis: how elegantly it is stated, who stated it, whether
we had coffee that morning, etc. To eliminate the risk of inconsistent
judgement, and to be able to process a lot of information at once, we
would like to obtain a set of automatic rules for this kind of scientific
reasoning, so we can teach a computer to do it faster than we can, and

17
Data Analysis in the Earth & Environmental Sciences Chapter 2. Probability Theory

so our results can be reproducible by anyone given the same informa-


tion. What do we need to tell our robot?
We first need to operate on logical propositions. Given two proposi-
tions E and F, and their negation (or complement) Ē (also written NOT
E, or Ec ) and F̄ (NOT F, F c ), we define:

E∩F = Both E and F are true = AND (2.1)


E E∩F F
E∪F = Either E or F are true = OR (2.2)

This is sometimes represented by a Venn diagram (Fig. 2.3):


These operations are duals of each other:
Figure 2.3: Venn diagram of two events,
E ∩ F = Ē ∪ F̄ (2.3a) E and F, that are not mutually exclu-
sive and overlap (the intersection E ∩ F,
E ∪ F = Ē ∩ F̄ (2.3b) is non-empty). The total area in gray is
the union of E and F (E ∪ F)
(e.g. the opposite of “rich and beautiful” is “poor OR ugly”). Eq. (2.3a)
are known as DeMorgan’s laws. One can show that this basic set of
rules allow to decompose any complex proposition into simpler ones,
so we need is a set of rules to assign degrees of plausibility to these
propositions.
Finally, we will denote by ( E| F ) the event “E given F”, meaning “E
given that F is true”.

Basic desiderata for plausible reasoning


Measurability Degrees of plausibility are represented by real numbers, which we
call a probability, denoted P.

Common sense If we had old information G which gets updated to G 0 in such a way
that E becomes more plausible (P( E| G 0 ) > P( E| G )) but the plausi-
bility of F is unchanged, then this can only increase (not decrease) the
plausibility that both E and F are true: P( E ∩ F | G 0 ) > P( E ∩ F | G ). It
also reduces the plausibility that Ē is false: P( Ē| G 0 ) < P( Ē| G ). That
is, the rules of probability calculus must conform to intuitive human
reasoning (common sense).

Consistency If a conclusion can be reasoned out in more than one way, then ev-
ery possible way must lead to the same probability. Also, if in two
problems our state of knowledge is identical, then we must assign
identical probabilities in both. Finally, we must always consider ALL
the information relevant to a question, without arbitrarily deciding to
ignore some of it. Our robot is therefore completely non-ideological.
Figure 2.4: American physicist Richard T.
Cox, ca 1939.
Cox’s theorem states essentially that these postulates are jointly suf-
ficient to design a robot that can reason scientifically about the world.
This is not, however, how probability theory was initially built, so a his-
torical digression is warranted.

18
Data Analysis in the Earth & Environmental Sciences Chapter 2. Probability Theory

A brief history of probability


• 1654: A gambler’s dispute led to the creation of a mathematical the-
ory of probability by two famous French mathematicians, Blaise Pas-
cal and Pierre de Fermat.
Antoine Gombaud, Chevalier de Méré, a French nobleman with an
interest in gaming and gambling questions, called Pascal’s attention
to an apparent contradiction concerning a popular dice game. This
problem and others posed by de Méré led to an exchange of letters
between Pascal and Fermat in which the fundamental principles of
probability theory were formulated for the first time. This was the
Figure 2.5: Blaise Pascal
first occurrence of "chance" and randomness" in scientific parlance. (Source:Wikimedia Commons).

• 1657: Christian Huygens published the first book on probability; en-


titled De Ratiociniis in Ludo Aleae. The major contributors during this
period were Jakob Bernoulli (1654-1705) and Abraham de Moivre
(1667-1754).

• 1812: Pierre Simon Laplace (1749-1827) introduced a host of new


ideas and mathematical techniques in his book, Théorie Analytique des
Probabilités. Before Laplace, probability theory was solely concerned
with developing a mathematical analysis of games of chance. Laplace
applied probabilistic ideas to many scientific and practical problems.
The theory of errors, actuarial mathematics, and statistical mechan-
ics are examples of some of the important applications of probability
theory developed in the 19th century.

• 1933 : End of a long struggle to find a definition until Kolmogorov’s Figure 2.6: Pierre de Fermat
(Source:Wikimedia Commons).
axioms (1933). The difficulty had to do with finding a mathematical
definition that would conform to other (e.g. logical, intuitive) con-
siderations.

II Notion of probability
It turns out that the notion of probability is difficult enough a concept
that statisticians, probability theorists, and philosophers still argue about
it. Here we present three views of the topic: the Frequentist view, the
Bayesian, view, and the axiomatic view.

Figure 2.7: Pierre Simon Laplace (23


March 1749 – 5 March 1827).

19
Data Analysis in the Earth & Environmental Sciences Chapter 2. Probability Theory

Frequentist Interpretation
Let E be an event in a random experiment. An experiment might be
“Rolling a die”; an event (outcome) of this experiment might be “Ob-
taining the number 6”. Let us repeat this experiment N times, each time
updating a counter xi :

 1, if E occurred ,
xi = (2.4)
 0, otherwise .
Relative frequency of 6 f N ( E)
The frequentist interpretation holds that , as N gets large, then the fre-
quency of occurrence of E, f N ( E), converges to a constant, which we call

Relative frequency
its probability P( E):
x1 + x2 + . . . + x N 1
(Relative Frequency) lim = P( E ) (2.5)
N →∞ N
1
This is the so-called Law of large numbers. 6

The probability can also be thought of as: N

Number of ways E can occur Figure 2.8: Plot of f N ( E) of rolling one


P ( E) = . (2.6) fair die in function of the number of ex-
Total number of possible outcomes
periments, N. We can see that for large N,
the relative frequency tends to 61 , which
we define as the probability of occurrence
Example (Rolling two dice) of the event.

We represent the two dice as (i, j), where i = result of throwing the 1st die
and j = result of throwing the 2nd die. Possible outcomes S are:

S = {(1, 1) , (1, 2) , . . . , (1, 6) , (2, 1) , . . . , (6, 1)} ,

where,
|S| = 36 . where | | means “number of elements”
What is the probability of the event E = sum > 8 ?
Possibilities:

E = {(3, 6) , (4, 6) , (5, 6) , (6, 6) , (4, 5) , (5, 4) , (5, 5) , (6, 3) , (6, 4) , (6, 5)} ,
⇒ 10 Possibilities .

| E| 10 5
P ( E) = = = ≈ 0.278 .
|S| 36 18

The Bayesian Viewpoint


In the frequentist interpretation, probabilities are the frequencies of oc-
currence of random events as proportions of a whole. But what if we
cannot repeat these events to measure a frequency? In the words of

20
Data Analysis in the Earth & Environmental Sciences Chapter 2. Probability Theory

D.G. Martinson, “one cannot re-run the experiment called the Earth”. Does
this mean that we can form no judgement about the plausibility of an
event, like A = “a meteorite killed the dinosaurs at the end of the Creta-
ceous” (which, by definition, only happened once)? If so, what would
be the use of probabilities in Earth Science?
In the Bayesian interpretation, probabilities are rationally coherent de-
grees of belief, or a degree of belief in a logical proposition given a body of well-
specified information. Put it another way: it is a measure of a state of knowl-
edge about a problem.

General Idea:
→ Start from prior knowledge.
Figure 2.9: Rev. Thomas Bayes, 1701-
→ Collect observations about the system (perform experiments, go out 1761, Presbyterian minister.
in the field)

→ Update prior knowledge in light of new observations. This is


encoded in a posterior probability
Example
We roll a dice a small number of times, with the result:

1, 3, 2, 1, 1, 5, 4, 4, 3, 5, 6, 6, 3, 1, 5

pi = P ( X = i ) = ?
Let’s assume we have no reason to think that the die is loaded. A priori consid-
erations of symmetry would lead us to set pi = 1/6, i ∈ [1 : 6], but we should
perhaps not be so definite (after all, if we decide from the outset that the die is
fair, how could we possibly revise our judgement if the evidence suggests other-
wise?). So instead we set a prior probability distribution for pi , illustrated 1
in Fig. 3.
Π ( pi ) =
Now, after rolling the die 15 times, we observe the sequence

X = {1, 3, 2, 1, 1, 5, 4, 4, 3, 5, 6, 6, 3, 1, 5} (2.7)
1
1
, after which we update our estimate of pi : 6
Prior distribution of pi

likelihood of obtaining X given pi Figure 2.10: Prior distribution of pi of


rolling a dice with mean 16 .

prior
z }| { z }| {
P ( X | pi ) · P ( pi )
P ( pi | X ) =
∑i P ( X | pi ) · P ( pi )

Updated posterior probability.

And we conclude that this limited dataset contains no damning evi-


dence that the die is loaded.

21
Data Analysis in the Earth & Environmental Sciences Chapter 2. Probability Theory

Advantages / Disadvantages of Bayesian framework:

Advantage One can now compute the probability of any event, regardless of
whether it can be repeated ad nauseam.

Advantage The framework incorporates all information, which is relevant to the


problem, including information known a priori (e.g. before collecting
the data). This may lead to better estimation, especially when the
sample size is too small for frequentists theorems to apply.

Disadvantage We now rely on a subjective prior, and different priors may lead to
different results. On the other hand, the prior can be justified by the-
ory (e.g. considerations of symmetry, physical laws) or other exper-
iments. The role of prior information becomes less important as the
number of observations increases, so in the limit n → ∞, the frequen-
tist and Bayesian interpretations often lead to the same results.

Disadvantage In many cases Bayesian analysis is more cumbersome and more com-
putationally demanding.

Laplace’s ideas, in hindsight, would now be considered Bayesian. Iron-


ically, it is not clear whether Bayes himself was a Bayesian.2 2
Reading: "Who Discovered Bayes’ Theo-
rem", Stephen M. Stigler, The American
Statistician, Nov. 1983, vol. 37, n◦ 4.

Axiomatic Definition of Probability


In 1933, Andrei Kolmogorov devised an axiomatic definition of prob-
abilities, which borrows heavily from set theory. This is the definition
one finds in most textbooks. It turns out that the very same rules govern
operations on logical propositions (Boolean algebra), given a judicious
choice of notation, so what we say for sets will apply equally well to
propositions. We start be defining some basic terms:

Elementary set theory

Sample space aka "Universe" or "Compound event". Denoted by Ω, this defines all
the possible outcomes of the experiment. For example, if the experi-
ment is tossing a coin, Ω = {head, tail}. If tossing a single six-sided
die, the sample space is Ω = {1, 2, 3, 4, 5, 6}. For some kinds of exper-
iments, there may be two or more plausible sample spaces available.
For example, Temperature = any number from -60◦ C to 60◦ C, Precip
from 0 to 10 m/day.

Events Any subset of the Universe. e.g. die = 4 or Temperature= 0◦ C.


Mutually exclusive events are events that cannot occur simultaneously,
e.g. temperature < -20 ◦ C and liquid rain. In a Venn diagram, these Figure 2.11: Andrei Kolmogorov, Russian
mathematician (1903 – 1987).
would correspond to non-overlapping disks.

22
Data Analysis in the Earth & Environmental Sciences Chapter 2. Probability Theory

For any event E,

E ∪ Ω = Ω, (2.8a)
E ∩ Ω = E. (2.8b)

Operations We have seen the basic operations of complement, union and inter-
section in Sect. I. To illustrate, let us cast a die and denote the events
A = {“Outcome is an even number ”} and B = {“ Outcome > 3 ”}.

Notation English translation Numeric Outcome


A∩B both A & B {4, 6}
A∪B either A or B {2, 4, 5, 6}
A∩B Ā ∪ B̄ = {1, 2, 3, 5}
A∪B neither A nor B Ā ∩ B̄ = {13}
Note: both operators are commutative (e.g. A ∩ B = B ∩ A) and dis-
tributive: A ∩ ( B ∪ C ) = ( A ∩ B) ∪ ( A ∩ C )

Partition A partition is a subdivision of Ω made of events {ωi , i = 1 · · · n} that


verify:

• ∀i, ωi 6= ∅ (non-emptiness)
• ∀(i, j)/i 6= j, ωi ∩ ω j = ∅ (mutual exclusivity)
Sn
• i =1 ωi = Ω (complete coverage)

One such partition is represented in Fig. 2.12.

Figure 2.12: Partition of a Set. Credit :


wikipedia.

Probability Axioms

Given this lingo borrowed from set theory, Kolmogorov defined a


probability P using the following axioms:

Positivity ∀ E, P( E) ∈ R and P( E) ≥ 0 3
As a special case, we have the much
simpler finite additivity P( E1 ∪ E2 ) =
Unitarity P(Ω) = 1 P( E1 ) + P( E2 ). We could have started
from that point, but it turns out that some
Additivity if E1 , E2 , · · · are mutually exclusive propositions, then strange things may happen when try-
ing to generalize this finite additivity to
P( E1 ∪ E2 ∪ · · · ) = ∑i∞=1 P( Ei ). 3 countable additivity (Borel-Kolmogorov
paradox), so this complicated definition
It turns out that these axioms produce a probability that has exactly is warranted. Jaynes (2004) contends,
however, that if we abide by Cox’s princi-
the same properties as the logical desiderata of Pólya and Cox. So why ples and apply them carefully, we never
the fuss? First, you should feel an immense relief: in designing a think- run into those kinds of embarrassing
ing robot out of logical principles, we wind up with exactly the same paradoxes.

23
Data Analysis in the Earth & Environmental Sciences Chapter 2. Probability Theory

operating principles that mathematicians have imposed on probabil-


ity for a long time. It is not completely a coincidence, of course, since
Kolmogorov intended his axioms to provide a rigorous foundation for
the use of probabilities in theory and applications. What is remark-
able, however, is that his understanding of probability was rooted in
set and measure theory, while Pólya and Cox’s viewpoint is rooted in
logic and physics. That the two give the same answer is quite remark-
able, and means that we can apply the usual rules of probability cal-
culus with the reified understanding that they describe a way to reason
about logical propositions, and are is not limited to esoteric sets or games
of chance or repeated events. Further, in the limit of certainty (probabil-
ities approaching unity), one can show that the principles of deductive
logic dear to Aristotle are recovered as special cases of the more general
probability theory as extended logic. Satisfied that we are about this
mind-blowing realization, let us quit philosophizing and start actually
computing probabilities.

III The Calculus of Probabilities


Basic Properties
A direct consequence of these axioms are following properties:

A ⊆ B =⇒ P( A) ≤ P( B) (2.9a)
∀ A P( A) ∈ [0, 1] (2.9b)
P( Ā) = 1 − P( A) (2.9c)
P( ∅ ) = 0 (2.9d)
P( A ∪ B) = P( A) + P( B) − P( A ∩ B) (sum rule) (2.9e)

With three sets (Fig. 2.13): P( A ∪ B ∪ C ) = P( A) + P( B) + P(C ) −


P( A ∩ B ) − P( A ∩ C ) − P( B ∩ C ) + P( A ∩ B ∩ C ).
More generally,
! ! Figure 2.13: Principle of inclusion-
n
[ n 
k −1 exclusion illustrated on three overlap-
P Ai = ∑ (−1) ∑ P A i1 ∩ A i2 ∩ . . . ∩ A i k
ping sets. Credit: Wikipedia
i =1 k =1 1≤i1 <i2 <...<ik ≤n
(2.10)

Conditional Probabilities

P ( A| B) = Probability of A given that B occurred


P ( A ∩ B)
≡ . (2.11)
P ( B)

24
Data Analysis in the Earth & Environmental Sciences Chapter 2. Probability Theory

This is equivalent to:

P ( A ∩ B) = P ( A| B) · P ( B) (Product rule) (2.12)


Which is more intuitive; it says that the probability of A and B both be-
ing true is the probability of A assuming B is true, times the probability
that B is true. In fact, because ∩ is a commutative operation, the reverse
applies, so we also have:

P ( A ∩ B) = P ( A| B) · P ( B) = P ( B| A) · P ( A) (2.13)

This symmetry is the basis of Bayes’ rule4 . 4


Which, once again, may or may not have
been enunciated by Bayes, but we call it
that way regardless
Independence
Two events A and B are said to be independent if and only if:

P ( A| B) = P ( A) & P ( B| A) = P ( B) (2.14)

Equivalently,
P ( A ∩ B ) = P( A ) · P( B ) (2.15)
This means that knowing B makes no difference about our state of
knowledge about A, and vice versa. A more advanced concept is that of
conditional independence. Two events A and B are said to be conditionally
independent given C if and only if:

P (( A ∩ B)|C ) = P( A|C ) · P( B|C ) (2.16)

Example
We roll two dice:

X1 = result of 1st die . X2 = result of 2nd die .

1. Given the events:

A : sum > 8 , B : X1 = 6 .

We want to compute:

P (sum > 8 | X1 = 6) = P ( X1 + X2 > 8 | X1 = 6)

We can compute the intersection:

A ∩ B = X1 + X2 > 8 and X1 = 6 = {(3, 6) , (4, 6) , (5, 6) , (6, 6)} ,

and by definition Eq. (2.11):


4 1
P ( A ∩ B) = , P ( B) =
,
36 36
4/36 4 2
⇒ P ( A ∩ B) = = = .
6/36 6 3

25
Data Analysis in the Earth & Environmental Sciences Chapter 2. Probability Theory

Equivalently, we could recognize that once X1 = 6, there are only 4 ways


to make the sum greater than 8 ( X2 = (3, 4, 5, 6)), so the probability is 4/6,
hence 2/3. We find the second way faster and more intuitive, but they are
tantamount.

2. P ( X1 = 2 ∩ X2 = 3) The events are independent:

=⇒ P ( X1 = 2 ∩ X2 = 3 ) =
= P ( X1 = 2 ) · P ( X2 = 3 )
1 1
= ×
6 6
1
= .
36

Law of Total Probabilities


Assume {ω1 , . . . , ωn } is a partition of Ω. If E is some event, then via
Eq. (2.8b):
P( E ) = P( E ∩ Ω )
Sn
But Ω = i =1 ωi so:
!
n
[
P( E ) = P E∩( ωi )
i =1

And by definition of conditional probabilities:

P ( E ∩ ωi ) = P ( E | ωi ) · P ( ωi )

But,

( E ∩ ωi ) ∩ E ∩ ω j = ∅ , (i 6 = j ) (Incompatible events) (2.17)

!!
[
=⇒ P ( E) = P E∩ ωi
i
= P (( E ∩ ω1 ) ∪ ( E ∩ ω2 ) ∪ . . . ∪ ( E ∩ ωn ))
= P ( E ∩ ω1 ) + P ( E ∩ ω2 ) + . . . + P ( E ∩ ω n )
= P ( E | ω1 ) · P ( ω1 ) + . . . + P ( E | ω n ) · P ( ω n )
= ∑ P ( E | ωi ) · P ( ωi )
i

Hence:

P ( E) = ∑ P ( E | ωi ) · P ( ωi ) (Law of Total Probabilities) (2.18)


i

This is very useful for computing probabilities by considering dif-


ferent events covering all possibilities. What’s almost magical about it

26
Data Analysis in the Earth & Environmental Sciences Chapter 2. Probability Theory

is that we may choose any partition we’d like, so we can pick one that
makes the conditional probabilities easy to evaluate. Then it is just a
matter of adding and multiplying.
Note that the above formula bears more than a passing resemblance
to Eq. (B.15). This is not a complete coincidence: you may think of a
partition as a sort of “basis” in a probability space. Thus, if you know
the probabilities of each partition and can express the probabilities of
each event in terms of the partition, you are done.

Example
5 jars containing white (w) and black (b) balls.

2w 2w 3w 3w
1b 1b 10 b 1b 1b

I II III IV V
Figure 2.14: Illustrating the experiment
of picking a random ball from 5 jars con-
1. Pick a jar at random (P = 15 ). taining white and black balls.

2. We draw a ball at random in the chosen jar.

3. Drawing a black ball.

What is the probability of getting a black ball? To answer this, we use the
law of total probabilities.
Let Ei = { Choosing the ith jar } (i = 1, 2, . . . , 5), where { E1 , . . . , E5 } is a
partition.
Now, P ( Ei ) = 15 by hypothesis, and:

P ( B| Ei ) = Probability of drawing a black ball from Ei



 1/3 , I, I I ,


= 1, III ,



1/4 , IV, V .

5
law of total prob. 13
=⇒ P ( B) = ∑ P ( B|Ei ) · P (Ei ) = 30 .
i =1

27
Data Analysis in the Earth & Environmental Sciences Chapter 2. Probability Theory

Bayes’ Rule

Elementary Case
This is a somewhat obvious application of the definition of conditional
probabilities:

P ( A ∩ B) = P ( A| B) · P ( B) = P ( B| A) · P ( A) ,
P ( B| A) · P ( A)
=⇒ P ( A| B) = (2.19)
P ( B)
(Expresses P ( A| B) as a function of P ( B| A))

We obtain the simplest form of Bayes’ rule:

likelihood prior
z }| { z }| {
P ( B| A) · P ( A)
P ( A| B) =
P ( B)
| {z }
normalizing constant

As we will see later, sometimes one of the two conditional probabil-


ities is easier to compute than the other, so this allows to “switch the
conditionals”.
More generally, if { Ei } is a partition, and P ( Ei ) > 0. For any B:

P ( Ei ∩ B)
P ( Ei | B) = ,
P ( B)
P ( Ei ∩ B) = P ( B| Ei ) · P ( Ei ) ,
law of tot. prob.  
P ( B) = ∑ P B| Ej · P Ej ,
j

P ( Ei ) · P ( B| Ei )
=⇒ P ( Ei | B) =   Bayes’ Rule (2.20)
∑ j P B| Ej · P Ej

Application: Medical Diagnostic

We want to know if a person chosen at random is affected by a disease


using the result of a medical test. Define the events:

S : The person is sick


S̄ : The person is healthy
T : The test is positive.
T̄ : The test is negative.

Now, the test is not perfect: it could come back negative when the
person is in fact sick (“false negative”), and come back positive while

28
Data Analysis in the Earth & Environmental Sciences Chapter 2. Probability Theory

the person is in fact healthy (“false positive”). We define:


def
P( T̄ |S) = f n (probability of a false negative)
def
P( T |S̄) = f p (probability of a false positive)

We would like to know the probabilities:

P (S| T ) Being sick when the test is positive.


P (S̄| T̄ ) Not being sick when the test is negative.

To solve this, we can apply Bayes’ theorem using the partition Ω =


{S, S̄} and we set P(S) = f .
P ( T̄ |S̄) · P (S̄)
P (S̄| T̄ ) =
P ( T̄ |S̄) · P (S̄) + P ( T̄ |S) · P (S)
(1 − f p ) f
= .
(1 − f p ) f + f n (1 − f )
and
(1 − f n ) f
P (S| T ) = .
(1 − f n ) f + f p (1 − f )
We know from epidemiological studies that the frequency of occur-
rence of the disease f is one in 200. As we vary f p for constant f n :

 f = 1% 
n
• → P S| T = 0.33
 f p = 1%

 f = 1% 
n
• → P S| T = 0.498
 f p = 0.5%

 f = 1% 
n
• → P S| T = 0.83
 f p = 0.1%

That is, as the rate of false positives drops, we get more and more con-
fident about the diagnosis. Worryingly enough, study after study has
found that only about 15% of medical doctors (the people whom we pay
to interpret such test results) get this right5 . Most have no clue how to 5
For this account and for a simple intro-
conduct such a simple calculation, and end up “transposing the condi- duction to Bayes’ theorem, see http://
yudkowsky.net/rational/bayes
tional“ (giving P ( T |S) instead of P (S| T )). As the example shows, those
will in general be very different, except in special cases (homework:
what values of f n and f p are needed for the equality P (S| T ) = P ( T |S)
to hold? The answer should depend on their ratio).
A remarkable feature is how counterintuitive the results may be for
what appear to be relatively small differences between f p and f n . This is
a reminder that the rules of probability calculus must be applied care-
fully and thoroughly, lest one commit some grave mistakes.

29
Data Analysis in the Earth & Environmental Sciences Chapter 2. Probability Theory

Bayes’ rule is central to Bayesian inference, which we will touch on


briefly in Chapter 5. In this day and age, Bayesians are slowly taking
over the world, so having even a vague idea about this theorem can be
life-saving. For more than a vague idea, read Gelman et al. (2013).
Finally, note that even though we applied Bayes’ rule, we were quite
happy to equate a proportion of samples in a population (that is, a fre-
quency of occurrence) with a probability. The use of Bayes’ rule does
not have to make you a strict Bayesian, and in such a case the frequen-
tist interpretation of probability is most convenient and defensible.
Exercise (Cloud Seeding)
The effect of cloud seeding on the production of damaging hail is investigated 6
Taken from Wilks (2011)
by seeding and not seeding an equal number of candidate storms. Suppose the
probability of damaging hail from a seeded storm is 0.1, and the probability of
damaging hail from an unseeded storm is 0.4. If a candidate storm has just
produced damaging hail, what is the probability that it was seeded? 6 .

Exercise (Hearing the sea in a sea shell)


Listening to the sea in a seashell, what is the probability that you may be hearing
the sea itself? (Fig. 2.15) .

Figure 2.15: Bayes and sea-shells http://


xkcd.com/1236/

30
Chapter 3

PROBABILITY DISTRIBUTIONS

“If we look at the way the universe behaves, quantum mechanics gives
us fundamental, unavoidable indeterminacy, so that alternative
histories of the universe can be assigned probability..”

Murray Gell-Mann

One message from quantum mechanics is that some quantities can-


not be known with certainty (at least simultaneously), so the Universe
must be described in the language of probabilities. In this chapter we
introduce several commonly used probability distributions with broad
applicability in the Earth Sciences. In each case we (briefly) introduce
the necessary mathematical apparatus to derive their essential proper-
ties, and cite some examples of their use in Earth Sciences.

I Random Variables. Probability Laws


Random Variables
Definition 1 A random variable (RV) is a function X (Ω) −→ R, which
associates a real number to every event.

X (Ω) is the image of Ω, the space of all possible values taken by X. In


many cases it is a restricted subset of R.
Example
Result of rolling a dice, amount of rainfall in a day, magnitude of an Earthquake,
age of a rock, etc.

Random variables provide the link between the oh-so-esoteric "sample


space" and the every day world of matter-of-factly measurements. They
are the basis for applying probability theory to laboratory or field data,
whether or not they were generated by a truly random process. 1 How- 1
This does not necessarily mean that
ever, there is some order to this randomness: RVs can only take values "God is playing with dice", as Einstein
famously said, but that we have uncer-
within a certain interval, and do so according to a probability law, which tain information concerning the process.
we call a distribution. Even if you are a staunch determinist, it
is hard to argue about the fact that mea-
surements always carry some degree of
uncertainty
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions

Since we are dealing with real-world data (which only come in dis-
crete form e.g. because of digitization), most observations are discrete,
though continuous RV provide an incredibly useful lens through which
to view some Earth processes. Note that an RV could be complex; it can
also be a vector or scalar.

Distribution Functions
Any discrete random variable admits a probability mass function (PMF)
f defined by, ∀ xi ∈ X (Ω):

f ( xi ) = P ( X = xi ) (3.1)

which describes the “weight” of each xi in the total outcome. The PMF is
often presented as a histogram, binning outcomes together in relatively
coarse chunks.

Figure 3.1: US income distribu-


tion (2010 data), as represented on
http://visualizingeconomics.com/.
Several points are apparent: the dis-
tribution peaks around $20K, but the
median income is $49,400, and the mean
is $67,500: this is typical of a skewed dis-
tribution. Skewness is the mathematical
way of describing what economists call
inequality: the top 1% earners famously
control about 40% of the wealth (see
http://youtu.be/QPKKQnijnsM). This is
also a heavy-tailed distribution.

An example is the US income distribution of Fig. 3.1. How such a


distribution may be estimated from observations will be discussed next;
how it can be described by numbers of summarized by graphs, will be
discussed in Sect. II. An associated definition is the Cumulative Distri-
bution Function (CDF):
N
∀ x ∈ R, F ( x ) = P( X ≤ x ) = ∑ P ( X = xi ) = ∑ f ( x i ). (3.2)
i =1 i/xi ≤ x

so it is the sum of the PMF up until the largest xi such that xi ≤ x.

32
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions

A related definition is the so-called survivor function

P( X ≥ x ) = 1 − F ( x ) (3.3)
(thus named by biomedical statisticians looking at survival times of pa-
tients under a pharmaceutical treatment). An example of survivor func-
tion that is foundational to seismology is the Gutenberg-Richter law
(Fig. 3.2).
Properties:

• F is always increasing, and piecewise continuous (not differentiable


at each xi ). Figure 3.2: Earthquake magnitude dis-
tribution showing a power-law behav-
• F ( x ) = 0 for x < xmin , F ( x ) = 1 for x ≥ xmax ior over 6 decades. The Y axis rep-
resents the probability of observing an
Earthquake magnitude greater than m
• The CDF and PDF can be obtained from each other easily (by sum or
(the X axis), in log scale. The graph
differentiation). follows log10 N ( M > m) ∝ −bm,
where b is the Gutenberg – Richter ex-
Because of the first property, F is a bijection2 , so it can be inverted. This ponent b = 1 (dashed red line). The
means that for every F one can find an inverse F −1 such that, for any roll-off for m < 2 is due to diffi-
culties with detecting very small earth-
( x, y) such that y = F ( x ): quakes. From http://www.pnas.org/
content/99/suppl_1/2509.short
  2
A one-to-one mapping: i.e. for each y
F −1 ( F ( x )) = x and F F −1 (y) = y (3.4)
there is only one x such that y = F ( x ). In
other words, F allows to travel from x to
F −1 is called the quantile function, often denoted by the letter Q. It is y and vice versa, uniquely.
how things like confidence intervals and percentiles are obtained.

Probability Density Function


So far we have focused on RVs taking discrete values (i.e. values that
can be counted). An important class of variables are however contin-
uous. In a continuous medium (like air, water, the Earth’s mantle, etc)
all variables take by definition a continuum of values over some inter-
val [ a; b] ⊂ R, so their CDF F ( x ) is generally a smooth function of x:
it can be differentiated, yielding the probability density function (PDF)
f ( x ) = F 0 ( x ), which is analogous to the PMF. A PDF verifies 3 prop-
erties:

1. f is continuous

2. ∀ x ∈ R, f ( x ) ≥ 0
R +∞
3. f has unit mass: −∞ f ( x )dx = 1. (Appendix A, Sect. III).
Rx
Indeed, one can write F ( x ) = −∞ f (u)du, from which it follows that:

• F is continuously differentiable (being the integral of a continuous


function)

• F is monotonically increasing (since its derivative f ( x ) is positive)

33
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions

• F goes from 0 to 1:

lim F ( x ) = 0; lim F ( x ) = 1;
x →−∞ x →+∞

F is also called the cumulative distribution function, and is exactly


equivalent to the CDF of a discrete random variable. In fact, one defini-
tion of a discrete RV is having a staircase-like CDF (with jumps at each
xi ), while continuous RVs have smooth, continuous CDFs.
While the CDFs are equivalent for discrete and continuous RVs, there
is, however a crucial difference between the PMF and the PDF: by def-
inition, the probability that X takes values between x0 and x0 + δx is
F ( x0 ) − F ( x0 + δx ). But we have just seen that F is continuously differ-
entiable, and its derivative is f , so, applying an order one Taylor expan-
sion3 : 3
Appendix A, Sect. III


P( x0 ≤ X ≤ x0 + δx ) = f ( x0 )δx + O (δx )2 (3.5)
For δx small enough, this probability is close to f ( x0 )δx: this quan-
4
tity is the probability of X being in a neighborhood of x0 of length δx. JOKE: A mathematician and a physi-
cist agree to a psychological experiment.
Notice, however, that as δx approaches 0, this probability approaches The mathematician is put in a chair in a
zero as well, since f is well-behaved. large empty room and a very tasty cake is
placed on a table at the other end of the
Thus we have the mind-altering result that for a continuous random
room. The psychologist explains, "You
variable, P( X = x0 ) = 0 (that is, its PMF is zero at every point)! Does are to remain in your chair. Every five
this mean that the variable can never reach this value? No: what it minutes, I will move your chair to a posi-
tion halfway between its current location
means is that we can never know this value with absolute certainty, so and the cake." The mathematician looks
the probability of exactly observing the value x0 is zero; the probability at the psychologist in disgust. "What?
I’m not going to go through this. You
of observing something close to x0 is finite, however. In other words,
know I’ll never reach the cake!" And he
there will always be some uncertainty about our knowledge of X, but gets up and storms out. The psychologist
thankfully we can get close enough to it for practical purposes4 . makes a note on his clipboard and ush-
ers the physicist in. He explains the sit-
uation, and the physicist’s eyes light up
Empirical Determination of a Distribution and he starts drooling. The psychologist
is a bit confused. "Don’t you realize that
Imagine X = { x1 , x2 , · · · , xn } is a collection of measurements. How do you’ll never reach it exactly?" The physi-
cist smiles and replied, "Of course! But
we find its distribution? I’ll get close enough for all practical pur-
poses!"
Histograms

The simplest strategy is to bin in different (usually regularly spaced)


classes of x: how many between -5 and -4? how many between -3 and
-2? And so on. We plot this frequency of occurrence as a function of
the midpoint of each interval, say, and kaboom: this is the famed his-
togram5 . 5
np.histogram()

For instance, we plot in Fig. 3.3 the empirical PMF of precipitation


in Boulder, CO. 50 bins were chosen, which seems reasonable given the
large number of observations (1192). However, perhaps the little ups
and downs are artifacts of choosing such a large number of bins, and

34
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions

would disappear if we chose a coarser bin size. This is tremendously im-


portant, because one might be tempted to interpret those peaks, whereas
in fact they could arise from chance alone, simply because of low sam-
ple numbers in a few bins. Remember that the choice of the number of
bins is subjective and one always has to try several6 . Lab #3 will have 6
By default, np.histogram() uses 10
you investigate this granularity issue in detail. bins, regardless of sample size

Boulder, N=1192 , 50 bins


Figure 3.3: Empirical determination of
0.7 the probability mass function of precip-
Histogram itation in Boulder, CO. Data: GHCN.
KDE (Normal, Default)
KDE (Epanechikov, Default)
0.6 KDE (Normal, Scott width)
KDE (Normal, Silverman width)

0.5
Probability Mass Function

0.4

0.3

0.2

0.1

0
0 1 2 3 4 5 6 7 8
Precip (mm/day)

Kernel Density Estimation

In the case of precipitation, X (Ω) is the space of positive real num-


bers (R+ ), so we might want to estimate a continuous distribution (that
is, a probability density function), rather than a PMF. How do we do this
from discrete observations? Kernel Density Estimation achieves this by
using smoothing functions called kernels:

n  
1 x − xi
KDE( X ) =
nh ∑K h
(3.6)
i =1

where K is a smooth function of unit mass with good localization,


and h is the bandwidth, which controls how many neighboring points
are being used to estimate f around xi . The two main choices of kernel
are: 
 3 (1 − t2 ) (Epanechikov),
K (t) = 4 1 −t2 /2 (3.7)
√ e (Gaussian).

35
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions

(though there are plenty of other reasonable choices).


In the end, this choice doesn’t matter nearly as much as the choice
of the bandwidth h. A large h includes many points, so will lead to a
very smooth estimate, which might gloss over important details. On
the other hand, too small an h could lead a jagged PDF full of spurious
spikes. As with everything in life, there is a happy Middle Path (Fig. 3.4).
Choice of initial smoothing parameter: Two main methods on the
market, used by most codes out there:
Silverman (1986) h = min(0.9 ∗ σ, 0.666 ∗ IQR)/n1/5 Figure 3.4: One representation of the
Middle Path
Scott (1992) h = c ∗ IQR/n1/3 , where c ∈ [2; 2.6] (2.6 for Gaussian
kernels, smaller otherwise)
Let us emphasize that these are initial estimates and that a proper
determination will always involve trial and error to see whether this
choice mattered. Luckily for you, all of this is beautifully coded up in
the amazing seaborn package, which we’ll abbreviate as sns.

Quantiles and Percentiles

Often we are interested in the values below which lie a certain frac-
tion of the mass of a certain distribution. If you’ve ever spoken of me-
dian income or the percentile of your GRE score, then you are already
familiar with the notion. The formal way of obtaining this is through the
inverse CDF, F −1 , aka the quantile function7 . More precisely, we define 7
Many choices to do this in Python,
the α-quantile qα the number such that: from the simple np.percentile() to
the comprehensive scipy.stats.mstats
.mquantiles(), modeled after the R
P( X ≤ q α ) = α (3.8) quantile function, to internal implemen-
tations in pandas and seaborn, to name
That is, qα = F −1 (α) . Unless the inverse CDF can be obtained an- just a few.
alytically (which can prove impossible even for usual distributions), it
is approximated using numerical root finding methods (cf. Appendix
A, Sect. III). Going back to Fig. 3.1, how many people earn less than
$225,000 a year? About 98%. Below what income do 98% of Ameri-
cans fall? About $225,000. The first answer used the CDF, the second
used the quantile function. The values qα are called, not coincidentally,
quantiles. Remarkable quantiles are:
the median the 50% quantile (q0.50 ), such that 50% of the mass of the
distribution is to the left of it, 50% of it on the right of it.

terciles split the distribution into 3 regions of equal mass: (0, 1/3, 2/3, 1)

quartiles split the distribution into 4 regions of equal mass (0, 25%, 50%,
75%,100%)

deciles idem with 10 regions: 10%, 20%, · · · , 100%

percentiles idem with 100 regions.

36
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions

Distribution Fitting

Often it is of interest to fit a known (“parametric”) distribution to an


empirical PMF. This can only be done after one has surveyed the data
using the above techniques, and having noticed that the data follow a
particular pattern. Alternatively, one may have a priori ideas of why the
data should follow a certain distribution (e.g. a Gutenberg-Richter law),
so it is interesting to plot that theoretical distribution on the same graph.
In general, however, this is a topic of statistical estimation, which we will
tackle in Chapter 5. The fitter Python package allows to scan up to 80
distributional candidates to see which one fits best.

Moments of a distribution
Expectance : the gambler returns

The expectance operator gives the expected value of an RV X:



∑ N x P( X = x ) if X is discrete
i i
E ( X ) = µ 1 = R i =1 (3.9)
 x f ( x )dx if X is continuous
R

This also called the moment of order 1, more commonly known as av-
erage, or mean. Indeed, if all outcomes are equally likely P( X = xi ) =
1/N, then this is just the arithmetic average of all observations. If the
outcomes are not equally likely, they must be weighted by their proba-
bility before summation – it is therefore a weighted mean.
Key Properties:

Linearity E( aX + bY ) = aE( X ) + bE(Y )

Transfer theorem

∑ N g( x ) P( X = x ) if X is discrete
i i
E( g( X )) = R i=1 (3.10)
 g( x ) f ( x )dx if X is continuous
R

This last formula is incredibly useful to study error propagation: if


the uncertainties in X are known, so are the uncertainties in any function
of X. See Chapter 4 and Lab 4.

Higher order moments

By the previous theorem, one may define the moment of order n as:

N
m( X, n) = E( X n ) = ∑ xi n P ( X = xi ) (3.11)
i =1

37
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions

The moment of order 2 is closely related to the variance, obtained by


setting g( x ) = ( x − µ)2 . Hence:

N
E(( X − µ)2 ) = ∑ ( x i − µ )2 P ( X = x i ) = E ( X 2 ) − µ2 (3.12)
i =1

(the last two expression are exactly analogous in the continuous case).
The variance measures the spread of a distribution about its central ten-
p
dency. One often speaks of the standard deviation, σ = V ( X ), which
is the average deviation around the mean, and bears the same units as
X8 . 8
“Standardized data” refer to data hav-
Properties: ing been centered to the mean and di-
vided by their standard deviation

V ( X + b) = V ( X ) "Shifting the mean does not shift the variance"


V ( aX ) = a2 V ( X )
σ ( aX ) = aσ ( X ) "Rescaling X by a simply magnifies its excursions by the same amount"

The moment of order 3 measures departures from symmetry about


the y axis (x = 0). It is closely associated with the skewness, defined as
the third standardized moment:
h i  
X −µ 3 µ3 E ( X − µ )3
E = 3 =   (3.13)
σ σ (E ( X − µ)2 )3/2

Finally, the kurtosis is derived from the fourth moment, and quantifies
the “peakedness” of a distribution. The list goes on, but no one ever
seems to worry about moments beyond n = 4.

Note Knowledge of all moments of X is equivalent to knowledge of the distri-


bution itself (via the characteristic function).

II Exploratory Data Analysis


Exploratory data analysis (EDA) is an approach to analyzing data for the
purpose of formulating hypotheses, complementing the tools of con-
ventional statistics for testing hypotheses9 . It was so named by Tukey 9
For additional references, check out An-
(1977) to contrast with Confirmatory Data Analysis, the term used for drew Zieffer’s notes and Chapter 3 of
Wilks (2011)
the set of ideas about hypothesis testing, p-values, confidence intervals,
etc. (which formed the key tools in the arsenal of practicing statisticians
at the time, and which we will study in Chapter 6).
Often in EDA, one looks for methods that are both robust and resistant.
Robustness is defined as the insensitivity to assumptions about the data
(e.g. ’the data are normally distributed’). Resistance is the insensitivity
to large excursions, i.e. outliers. In many cases, EDA is a “first pass” to
take a gross look at the data, formulate hypotheses and then go on to
apply more specific techniques.

38
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions

Assume that you get a data vector x = { x1 , · · · , xn }, which we can


view as a realization of a random variable X (i.e. we have uncertain
knowledge about the process that generated the data). Where do we
start and where do we go from there? The first step would be to plot the
data themselves in a sensible way (i.e. as a function of a related variable,
like the depth of collection, time, distance along a transect and so on).
The second would be to obtain numerical summaries of its distribution,
then estimate the distribution itself, and finally plot it in fancier ways.

Numerical Summaries
Range

The range is simply the difference between the largest and smallest
value, and bears the same units as X. The dynamic range is usually de-
fined as:
 
max( x)
DR = 20 × log10 (3.14)
min( x)
expressed in decibels (dB) regardless of the units of x. For instance, the
dynamic range of human hearing is roughly 140 dB.
The range immediately gives us a crude idea of the size of variations
to expect, which can be very precious already. For instance, are they of
the expected order of magnitude? Does the range encompass 0?

Location

We wish to locate the centroid of the distribution, its central tendency,


i.e. its center of mass on the x axis. The sample mean allows to estimate
this simply:
1 N
N i∑
x= xi (3.15)
=1
which is simply (3.9) with P( X = xi ) = 1/N. This is called a uniform
distribution and means that each outcome is equally likely to occur. The
formula turns out to be valid even for more general distributions. We
shall see later why the sample mean is the optimal estimator of the true
mean (“Maximum Likelihood”) for Gaussian random variables.
However the sample mean is neither robust nor resistant. Robust
measures of location include:
the median: MED ( X ) = q50% . A different mean and median is a sure
sign of skewness.
q25% +2×q50% +q75%
the trimean: TM = 4 , a weighted mean of the quartiles.

the trimmed mean: a version of the sample mean ignoring the most ex-
treme values.

39
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions

For more information, read Wilks (2011), Chapter 3.

Spread

The most common measure of spread, as discussed above, is the sam-


ple standard deviation s:
s
n
1
s=
n ∑ ( x i − x )2 (3.16)
i =1
However, it is neither robust nor resistant (due to the presence of
squares, or put another way, the `2 norm), and may be easily fooled
by outliers. Better measures include the:

interquartile range: IQR = q0.75 − q0.25 , which complements the trimean.



mean absolute deviation: MAD = E | X − q0.50 | , which uses the `1 norm.

Symmetry

The sample skewness is defined as:


1
n∑in=1 ( xi − x )3
 3/2 (3.17)
1 n 2 Figure 3.5: Location, scale and symmetry
n ∑ i =1 ( x i − x ) of temperature distributions in a chang-
ing climate, from http://ipcc-wg2.gov/
But a robust and resistant measure is the Yule-Kendall skewness: SREX/report/

q0.75 + 2q0.50 − q0.25


YKS = (3.18)
IQR
To illustrate how these three properties may evolve in warming cli-
mate aspects of changing mean, variance and skewness of temperature
distributions are presented in Fig. 3.5.

Graphical Summaries
“Numerical quantities focus on expected values, graphical summaries on unex-
pected values”. John Tukey

A picture is worth a thousand words. We’ve already seen in Fig. 3.3


how one might represent a PMF, and if appropriate, a PDF. Here are
a few strategies to display such distributions. One way to do this is to
plot distribution quantiles as shaded bands, as in Fig. 3.7. This is a great Figure 3.6: Boxplot representation of a
normal distribution, illustrating how eco-
way to chart evolution over time and get a general impression, though
nomically the device can convey the loca-
it can become difficult to see exactly what happens to the distribution. tion of various quantiles of interest.
Boxplots (Fig. 3.8) and violin plots (Fig. 3.10) are more useful in this
regard.

40
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions

Figure 3.7: Historical probability of pre-


cipitation over Los Angeles at some point
in the day, expressed in %. Orange
represents thunderstorm-related precip-
itation, which is extremely rare in L.A.
Credit: weatherspark

Figure 3.8: Boxplots for grouped obser-


vations (here, illustrating convergence as
a function of sample size). Note the notch
to denote the median’s location.

Boxplots

Boxplots are compact ways to represent a distribution, focusing on


its quartiles (body), 95% mass (whiskers) and occasionally outliers. An
example is shown in Fig. II, showing how the normal distribution may
be summarized very succinctly in this fashion. Boxplots are sometimes
the only practical way of showing how several distributions compare
(Fig. 3.8). Whenever possible, thou shalt notch your boxplot.

Violinplots
Figure 3.9: A simple violin plot, using sns
.violinplot(). Credit: seaborn example
Let’s face it: boxplots are pretty boxy. A more refined way of plotting gallery
distributions are violin plots, which elegantly summarize their shape as
quantified via kernel density estimates, along with the data as points
(Fig. 3.10) or bars ("bean plots"). Good implementations come with a
number of customizable options, including the ability to perform side-
by-side comparisons as split violins, which must sound awful, but look
rather snazzy.

Scatterplots

Jumping ahead to part III, we sometimes want to explore the relation-


Figure 3.10: A split violin plot. Notice
ship between two variables. One expeditious way to do so is to draw a the median and quartiles which allow to
scatterplot of one variable versus the other; even better is to plot their his- quickly compare these quantiles among
tograms along using seaborn’s sns.jointplot(), as done in Fig. 3.11. paired datasets. Credit: seaborn example
gallery
This is not a bad time to introduce the idea of correlation.

41
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions

Correlation and covariance

How shall we quantify the relationship between two variables X and


Y, such as those of Fig. 3.11? Covariance and correlation are popular
measures of association for this purpose. Covariance is defined as:

Cov( X, Y ) = E [( X − E( X )) (Y − E(Y ))] (3.19a)


= E( XY ) − E( X ) E(Y ) (3.19b)

In particular, Cov( X, X ) ≡ V ( X ) (the variance of X is its covariance


with itself). If the two variables are independent, then Cov( X, Y ) = 0
(as a consequence of E( XY ) = E( X ) E(Y )).
In order to remove the dependence on the units of X and Y we can nor- Figure 3.11: A scatterplot between
malize by the standard deviations σX and σY . We define the correlation two variables, along with histogram
of marginal distributions. Generated
coefficient as: using sns.jointplot(). Credit: seaborn
  gallery
X − E ( X ) Y − E (Y )
ρ XY = Cov , . (3.20)
σX σY
It has the following properties:

• When |ρ| < 1 and ρ > 0, we say that the two variables are correlated.
When ρ < 0 we say they are anti-correlated. y

• If |ρ| = 1, there is a linear relationship between X and Y i.e. X =


aY + b or Y = cX + d., where a, b, c, d are constants.
− +
• ρ only measures linear association; a non-linear relation may not get
picked up). A famous example is U = cos X, V = sin X. Even though x

U = 1 − V 2 , you can verify that ρ(U, V ) = 0! + −

• ρ2 represents the fraction of variance shared between the variables X


and Y. It is common to speak of “variance explained” by one variable,
but we discourage use of the term: if X and Y are correlated because
they are both caused by a third variable Z, then what does X possibly Figure 3.12: Correlation in a scatter plot.
explain about Y (or vice versa)? You should prefer the terminology “X Eq. (3.19a) is equivalent to assigning a
accounts for ρ2 × 100% of variance in Y” (or vice versa). Causality sign to each quadrant, and multiplying
the distance from the origin ( E( X ), E(Y ))
aside, if |ρ| is high, one may use Y as a proxy for X and vice versa. by this sign.

The correlation coefficient for a sample can be estimated as


1
n −1 ∑in ( xi yi − x̄ ȳ)
ρ̂ ≡ r XY = (3.21)
SX SY
n n
1 1
SX2 = ∑ ( x − x̄ )2 SY2 = ∑ (y − ȳ)2 . (3.22)
n−1 i i n−1 i i

This estimate is also known as Pearson’s product-moment correlation


or Pearson’s correlation. It is the most common measure of correlation
but is neither robust nor resistant. For instance, this situation:

42
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions

would give ρ > 0 even if the majority of points shows ρ < 0. As an alter-
native we can use Spearman’s correlation10 or Kendall’s τ 11 , which are 10
scipy.stats.spearmanr()
11
based on rank order statistics, and are therefore more robust. Kendall’s scipy.stats.kendalltau()

τ lends itself to better uncertainty quantification. The wider problem


with correlations is how to interpret them. We will come back to this in
Chapter 6, Sect. VII.
Fig. 3.13 displays three types of cases: linear relation with various
amounts of scatter (top), perfect correlation (middle) and various ways
of obtaining a zero linear correlation (bottom), even when nonlinear as-
sociations are extremely strong.

Figure 3.13: A taxonomy of correlations.


(top) typical situations of a linear rela-
tionship of varying strength; (middle)
perfect correlation: one is measuring
the two variables twice, under different
names; (bottom) examples of nonlinear
associations yielding zero linear correla-
tion. See Lab 6 for some practical flavors.
Credit: Wikipedia
43
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions

Bivariate densities

More generally, we are often interested in the joint probability dis-


tribution of several variables. As in 1 dimension, kernel density esti-
mation can help estimate a continuous density from discrete observa-
tions, turning a cloud of points into a field of color (Fig. 3.14). Alterna-
tively, the 2-dimensional analog of the histogram visualizes a bivariate
PMF as hexagonal tiles (Fig. 3.15). Both figures were generated via sns
.jointplot().

III Common Discrete Distributions


We now introduce some classic theoretical distributions, which are very
useful in modeling natural phenomena. The underlying idea is that, in Figure 3.14: Joint plot of two variables, il-
cases where little to no data are available, we might anticipate measure- lustrating joint and marginal kernel den-
ments to follow a certain theoretical distribution; if so, this knowledge sity estimates, as well as individual ob-
servations as white crosses. Credit:
will greatly help the assessment of uncertainties and the estimation of Seaborn manual
parameters of interest. These distributions form the basis of so-called
“parametric statistics”, because they are characterized by one or more
parameters. Here we focus on the distributions themselves – we will re-
serve the difficult topic of statistical estimation (of said parameters) for
Chapter 5.

Bernoulli Distribution
A Bernoulli random variable is the easiest way to encode a binary out-
come: heads or tails? Rain or shine? Big earthquake or small earth-
quake? We say that a r.v. X ∼ B( p) if and only if X takes the value 1
with probability p (“success”) and 0 with probability q = 1 − p (“fail-
ure”). Verify that E( X ) = p, V ( X ) = pq.
Figure 3.15: The bivariate analogue of a
The Bernoulli distribution is not very interesting in its own right, but histogram is known as a “hexbin” plot,
serves as a building block for a very important distribution: because it shows the counts of obser-
vations that fall within hexagonal bins.
Credit: Seaborn manual
Binomial Distribution
A random variable X ∼ B(n, p) is said binomial if ∀k ∈ [1; n] :
 
n k
P( X = k ) = p (1 − p ) n − k (3.23)
k
where (nk) = k!(nn!−k)! is the binomial coefficient. The latter have the
useful property that:
     
n n−1 n−1
= + (Pascal0 s rule) (3.24)
k k−1 k
The binomial distribution measures the probability of k successes Figure 3.16: Pascal’s Triangle, an illustra-
tion of Pascal’s rule. Credit: Math Forum
amongst n trials. It is easy to show that X ∼ B(n, p) can be obtained

44
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions

as the sum of n independent Bernoulli trials with identical probability p.


If follows that: E( X ) = np, V ( X ) = npq12 . Its PMF and CDF are shown 12
For independent variables, the variance
in Fig. 3.17 (left column). of the sum is simply the sum of the vari-
ances
Application Many processes can be modeled as a sum of independent, bi-
nary events. For instance, if the probability of a winter frost day is 0.1, and
there are 90 days, what is the probability of getting at least 10 frost days per
winter? (assume a winter is 90 days long)

 
90
P( X = 0) = 0.10 (1 − 0.1)90 ' 7.6e−5 (3.25a)
0
 
90
P( X = 1) = 0.11 (1 − 0.1)89 ' 7.6e−4 (3.25b)
1
.. ..
. = . (3.25c)
 
90
P( X = 9) = 0.19 (1 − 0.1)81 ' 0.139 (3.25d)
9

k =9
P( X > 10) = 1 − P( X < 10) = ∑ P(X = k) ' 0.412 ≈ 41% chance
k =0

PMF of Binomial Distribution PDF of Normal Distribution


0.25 0.8
n = 20, p = 0.25 µ = 0, = 1
n = 20, p = 0.5 0.7 µ = 0, = 0.5
n = 20, p = 0.75 µ = 5, = 2
0.2
0.6
P ( X = x i)

0.5
0.15
f ( x)

0.4

0.1
0.3

0.2
0.05
0.1

0 0
0 2 4 6 8 10 12 14 16 18 20 -5 0 5 10 15
xi x

CDF of Binomial Distribution CDF of Normal Distribution


Figure 3.17: Binomial vs Normal distri-
1 1 bution. Top: PMF or PDF. Bottom: CDF.
0.9 0.9 Left: Binomial distribution with n = 20
0.8 0.8
and three values of p. Right: Normal dis-
0.7 0.7
tribution for 3 values of µ and σ.
P ( X ≤ x i)

P ( X ≤ x)

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2 µ = 0, = 1
n = 20, p = 0.25
0.1 n = 20, p = 0.5 0.1
µ = 0, = 0.5
n = 20, p = 0.75 µ = 5, = 2
0 0
0 2 4 6 8 10 12 14 16 18 20 -5 0 5 10 15
xi x

45
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions

Poisson Distribution
A random variable is said to be Poisson with rate λ ∈ R+ (usually writ-
ten X ∼ P (λ)) if and only if, ∀k ∈ N:

λk
P( X = k ) = e − λ (3.26)
k!
Proof of the fact that all probabilities sum to unity is provided via
λk
the exponential series: ∑∞k =0 k! = e (Appendix A, Sect. III). From this
λ

and a simple rearrangement of dummy indices, one can easily show that
E( X ) = λ and V ( X ) = λ. For λ ≤ 1, the PMF decreases strongly with
Figure 3.18: PMF of the Poisson distribu-
k, hence the nickname: ’law of rare phenomena’. tion for 3 values of λ. Credit:Wikipedia

A Poisson process is a collection of Poisson random variables. Exam-


ples that are well-modeled as Poisson processes include the radioactive
decay of atoms, telephone calls arriving at a switchboard, page view re-
quests to a website, or the number of US hurricane landfalls. If Nt is the
number of such occurrences over a time interval t, then:

e−λt (λt)k
P( Nt = k) = f (k; λt) = (3.27)
k!
In particular, the number of events over two intervals ∆t1 and ∆t2 are Figure 3.19: CDF of the Poisson distribu-
independent, and the probability that the time between events exceeds tion for 3 values of λ. Credit:Wikipedia
∆t is e−λ∆t , which decays exponentially fast to zero.
The Poisson distribution has all sorts of wonderful properties. In par-
ticular, it is easy to show that if X1 ∼ P (λ1 ) and X2 ∼ P (λ2 ), then the
sum X = X1 + X2 ∼ P (λ1 + λ2 ) (additivity of the two variables). The
Poisson process is intimately tied to the exponential distribution, which
describes memoryless processes.
Interesting property: the Binomial distribution can be approximated
by a Poisson distribution when n → ∞ if λ = np is kept constant. In
turn, it converges to a normal distribution for λ larger than ∼ 5 (see
Fig. 3.18, cyan curve, and Chapter 4, section II).

46
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions

Geometric Distribution
The probability distribution of the number X of Bernoulli trials needed
to get one success, supported on the set {1, 2, 3, ...}. Also called "waiting
time until the first success" . X ∼ G( p) iff ∀k > 1:

P ( X = k ) = (1 − p ) k −1 p (3.28)

Again, this has no upper bound for large k, so we will need to sum
until infinity. This is made possible by the geometric series:

1
∑ ak = 1 − a ∀| a| < 1 (3.29)
k =0

From this it follows that:

1 1− p
E( X ) = ; var( X ) = (3.30)
p p2
The fact that the expected time until the first success is inversely pro-
portional to the probability of the event (a frequency of occurrence),
should not be surprising. It is the basis for estimating “return periods”
from the statistics of past events, a practice as fraught as it is mislead-
ing. Most of the public thinks of a “100-year flood” as a flood that recurs
every 100 years, whereas it really is a flood that has a 1% chance of hap-
pening every year, assuming stationarity (Lab 5).

Application Waiting time until you get a 6 by rolling a die, waiting time
until any event with some estimated frequency (e.g. droughts, floods, earth-
quakes).

47
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions

IV Common Continuous Distributions Property Value


support x∈R
 
Normal Distribution
2
pdf √1 exp − ( x −µ2 )
σ 2π  2σ
1 1 x −µ
2 + 2 erf σ 2
cdf √
The most emblematic curve in all of statistics, the Normal (or Gaussian)
mean µ
distribution (Fig. 3.17) will in fact prove to be the central distribution to-
median µ
wards which all the others gravitate. Its properties are so numerous, mode µ
and its range of applications so vast, that we will here content ourselves variance σ2
with a very cursory description, but devote an entire Chapter (4) to nor- skewness 0
mality.
Table 3.1: Essential Properties of the
Gaussian Distribution
Gamma Distribution, Γ (k, θ )
Very versatile distribution that allows to fit many different behaviors.
Characterized by 2 real numbers k (shape parameter) and θ (scale pa-
rameter):
 x k−1 exp (− x/θ )
f ( x; k, θ ) =
θ Γ (k) θ
R∞
where Γ (k) = 0 tk−1 e−t dt is the famous Gamma function (a gen-
eralization of the factorial function). As is apparent from Fig 3.20 and
Fig 3.21, the qualitative behavior is radically different for different val- Figure 3.20: Gamma distribution (PDF)
ues of k . Only for k ≥ 1 does the distribution start at zero, with a tail
that can usefully approximate things like rainfall statistics. It is in fact
one of the simplest distributions to represent skewed variables, since the
skewness is solely constrained by k. For large k, it starts looking like a
Gaussian (but then again, most things do). Two notable special cases are
k = 1 (Exponential distribution) and θ = 2 (Chi-squared Distribution),
which we will see again.

Figure 3.21: Gamma distribution (CDF)

Property Value for Exponential Distribution


Property Value for Gamma Distribution
support x ∈ R+ support x ∈ R+
(
λe−λx , x ≥ 0, exp (− x/θ )
pdf f ( x; λ) = pdf x k −1 Γ (k) θ k
0, x < 0. γ(k,x/θ )
( cdf
− Γ (k)
1−e λx , x ≥ 0,
cdf F ( x; λ) = mean kθ
0, x < 0.
1 median no simple closed form
mean λ
ln(2) mode (k − 1)θ for k ≥ 1
median
kθ 2
λ
variance
mode 0
1 skewness √2
variance k
λ2
skewness 2

Table 3.2: Essential Properties of the Ex-


ponential and Gamma Distributions

48
Data Analysis in the Earth & Environmental Sciences Chapter 3. Probability distributions

Extreme Values: the Weibull Distribution


The Weibull is a special case in a broader class called generalized extreme
value distributions. It is given here because of its simplicity and applica-
bility in Earth Sciences, particularly for extreme events like upper ocean
velocities, floods, hurricane wind speeds, and all manner of imaginable
natural hazards. Non-surprisingly, it is a darling of the insurance and
reinsurance industry. The Weibull distribution is also suitable for mod-
eling particle size distributions, or the time before an instrument fails.
The density writes:

 k x k−1 e−( x/λ)k x ≥ 0
f ( x; λ, k ) = λ λ (3.31)
0 x<0

where k > 0 is the shape parameter and λ > 0 is the scale parameter
of the distribution. Its complementary cumulative distribution function
is a stretched exponential function. The Weibull distribution is related
to a number of other probability distributions; it interpolates between
the exponential distribution (k = 1) and the Rayleigh distribution (k =
Figure 3.22: PDF of the Weibull distribu-
2). tion

Property Value for the Weibull Distribution Table 3.3: Weibull Distribution

support x ∈ [0; +∞)


( 
k x k−1 −( x/λ)k
e x≥0
pdf f (x) = λ λ
0 x<0
k
cdf 1 − e−( x/λ)
 
mean λΓ 1 + 1k
median λ(ln(2))1/k
 1
λ k−k 1
k
mode if k > 1
 
variance λ2 Γ 1 + 2k − µ2
Γ (1+ 3k )λ3 −3µσ2 −µ3
skewness σ3

Figure 3.23: CDF of the Weibull distribu-


tion

In lab #5, we shall see how these distributions can be fit to geophysical
and geochemical data and what it enables us to do.

49
Chapter 4

NORMALITY. ERROR THEORY

Why have we for so long managed with the normality assumption?

George Barnard

I The Normal Distribution


Distribution Functions
A random variable X is said to be normal, or Gaussian, with parameters
µ and σ (X ∼ N (µ, σ2 )) if and only if its Probability Density Function
(PDF) verifies:

1 x − µ )2
−(
ϕµ,σ2 ( x ) = √ e 2σ2 (“density") . (4.1)
σ 2π
Special Case: (the standard normal distribution): µ = 0, σ = 1

1 x2
ϕ ( x ) = √ e− 2 . (4.2)
2π Figure 4.1: PDF of the Normal distribu-
tion. Credit:Wikipedia
The CDF (Cumulative Distribution Function) of X ∼ N (0, 1) is, gen-
erally, described by Φ:
Z x
1 t2
Φ (x) = P (X ≤ x) = √ e− 2 dt . (4.3)
2π −∞

2
Note One can show that the function e−t does not have any antiderivative,
0 2
i.e., if g (t) = e−t then g (t) cannot be expressed as a finite combination of
elementary functions (though it can be approximated that way). =⇒ Do not
try to compute Φ ( x ) analytically (use a table or a computer).
Figure 4.2: CDF of the Normal distribu-
tion. Credit:Wikipedia
In Python, the PDF can be generated at the values stored in array x
via scipy.stats.norm.pdf( x ), the CDF via scipy.stats.norm.cdf( x ).
As in all languages, the default location and scale parameters are 0 and
1, respectively.
Data Analysis in the Earth & Environmental Sciences Chapter 4. Normality. Error Theory

Standard Normal
The standard normal is the only one we really need to worry about,
because of the wonderful affine invariance of the normal distribution: if
 X −µ
X ∼ N µ, σ2 , then Z = σ ∼ N (0, 1) ⇒ X = σZ + µ. Hence:

FX ( x ) = P ( X ≤ x )
= P (σZ + µ ≤ x )
 
x−µ
=P Z≤
σ
 
x−µ
=Φ .
σ
This means that we can express the PDF and CDF of any normal dis-
tribution with the standard normal PDF and CDF, simply by centering
and rescaling.
Example
Assume that the temperature at some place follows a normal distribution with
mean µ = 20◦ C, and standard deviation σ = 5◦ C. What is the probability
that the temperature drops below 12◦ C?
 
T ∼ N µ = 20, σ2 = 25 ; P ( T ≤ 12) =?

P ( T ≤ 12) =
= P ( Zσ + µ ≤ 12)
 
12 − µ
=P Z≤
σ
 
8
=P Z≤−
5
 
8
=Φ − = Φ (−1.6)
5
Large σ
≈ 5.5% . (Unlikely)

Moments

Consider X ∼ N µ, σ2 , then: µ

E (X) = µ , Expected value ("mean"). Small σ


 
E ( X − µ )2 = σ 2 , Variance.

So:
σ2 )
N ( µ , |{z}
|{z} µ
mean variance
where σ is the standard deviation. Figure 4.3: Two normal distributions
with the same mean, µ, and different
 scale parameters σ.
In general, if X ∼ N µ, σ2 , it looks like Fig. 4.4.

52
Data Analysis in the Earth & Environmental Sciences Chapter 4. Normality. Error Theory

Figure 4.4: PDF of the standard normal,


showing the respective percentage sub-
division corresponding to each standard
deviation. Credit:Math Is Fun

The normal distribution is unimodal and symmetric, so the mean, the


mode
 and the  median coincide (this is rare). Hence, it is not skewed, so
3
E ( X − µ) = 0. Its fourth moment is related to the kurtosis, which
measures the “peakedness” of the distribution:

E[( X − µ)4 ]
= 44
µ
2 2
(4.4)
(E[( X − µ) ]) σ
This quantity is 3 for the normal distribution, and it is common to
Φ (− x )
define σ44 − 3 as excess kurtosis. By definition, the normal distribution has
µ

zero excess kurtosis. More peaked distributions are called leptokurtic,


less peaked distributions are called platykurtic. −x x

Figure 4.5: Representation of the symme-


Notable properties try of the cumulative distribution func-
tion since Φ (− x ), the left gray region,
1. Symmetry is equal to its complement 1 − Φ ( x ), the
right gray region.
Φ (− x ) = 1 − Φ ( x )

Proof :
Using the unit measure and the previous result:

1 = P ( X ≤ − x ) + P (− x ≤ X ≤ x ) + P ( X ≥ x )
= Φ (− x ) + P (− x ≤ X ≤ x ) + Φ (− x ) .
| {z }
Φ( x )
−α α

So Φ (− x ) = 1 − Φ ( x ). (cf Fig. 4.5) Figure 4.6: Illustration of property 2 of


the normal CDF
2.
P (| X | ≤ x ) = 2Φ ( x ) − 1

53
Data Analysis in the Earth & Environmental Sciences Chapter 4. Normality. Error Theory

Proof P ( X ≤ − x ) = P ( X ≥ x ) But:

1 = P ( X ≤ − x ) + P (− x ≤ X ≤ x ) + P ( X ≥ x )
= 2 (1 − Φ( x )) + P (− x ≤ X ≤ x ) .
=⇒ P (| X | ≤ x ) = 2Φ ( x ) − 1 . QED

Stability
The normal distribution is stable under
 scaling,
 addition and subtrac-
2
 2
tion. If X ∼ N µ x , σx and Y ∼ N µy , σy are independent normal
random variables, and a and b are constants, then:
  
Scaling If X ∼ N µ, σ2 , then aX + b ∼ N aµ + b, ( aσ )2 . In other
words, the mean gets an affine transform, and the standard deviation
is scaled by a factor a.

Addition The sum of two normals is another normal whose mean and
variances are the sums of individual means and variances.
 
U = X + Y ∼ N µ x + µy , σx2 + σy2 . (4.5a)

The converse is also true (Cramer’s theorem)1 1


If X, Y independent and X + Y is nor-
mal, then X and Y must be normal.
Subtraction The difference of two normals is another normal whose mean
is the difference of the individual means, but whose variance is the
sum of the individual variances:

X − Y ∼ N (µ x − µy , σx2 + σy2 ) . (4.5b)



Be careful! Variances are added.

Although we’ve seen additivity before, you should realize how re-
markable it is: most distributions do not display this property. And the
remarkable result is that if you subtract two normal RVs, the means are
subtracted but the variances (measures of uncertainty) are added.

54
Data Analysis in the Earth & Environmental Sciences Chapter 4. Normality. Error Theory

II Limit Theorems
Central Limit Theorem
Whenever a large sample of chaotic elements are taken in hand and marshalled in
the order of their magnitude, an unsuspected and most beautiful form of regular-
ity proves to have been latent all along.

Sir Francis Galton (1822 – 1911)

Let X1 , . . ., Xn be a sequence of independent and identically dis-


tributed (i.i.d) random variables each having mean µ and variance σ2 .
Central limit theorem (CLT): As n increases, the sample average:
X1 + . . . + X n
Xn = ,
n
 2

its distribution converges to: N µ, σn .
Put differently, if X1 , . . ., Xn be independent with common finite
mean µ and variance σ2 and:

Xn − µ
Zn = √ ,
σ/ n
X1 +...+ Xn
where X n = n . Then, Zn → N (0, 1), hence:

lim P ( Zn ≤ z) = Φ (z) (4.6)


n→∞

That is, N (0, 1) is the asymptotic distribution of Zn , regardless of


the original distribution of the X1 , . . ., Xn . This is absolutely mind-
boggling.

Remarkable facts
→ Works for "any" distribution as long as µ and σ exist.

→ Notice the factor σ/ n: uncertainties decrease as the square root of the
number of observations. This is the theoretical basis for well-known
laboratory practice of averaging independent measurements together
to decrease uncertainty.

→ normal asymptotics apply for any n larger than about 30. Infinity is
within reach of anyone who can count to 30!

→ An analog demonstration of the CLT is given by Galton boards, illus-


trated by the Quincux game

In summary, we have three extremely powerful properties that con-


spire to make the normal (Gaussian) distribution the central distribu-
tion of statistics:

55
Data Analysis in the Earth & Environmental Sciences Chapter 4. Normality. Error Theory

1. The average of n independent and identically-distributed variables


tends to be Gaussian for n sufficiently large, regardless of their original
distribution.

2. For certain values of their parameters, many distributions (Binomial,


Poisson, Gamma, Chi-squared, student’s T, etc) converge to a Gaus-
sian form, for similar reasons (they arise as the sum of many indepen-
dent increments). The normal distribution is therefore an attractor of
the distribution space.

3. For analytical reasons, once a Gaussian shape is attained, it is pre-


served under a variety of operations (scaling, convolution, product,
Fourier transform). The consequence is that the sum or difference
of normally-distributed variables is Gaussian too, and so it is for all
other linear transformations... so one never escapes the Gaussian
“black hole” once in its vicinity.

This will make the normal distribution a tool of choice to represent


and analyze errors in all manner of measurements.

Application
A certain strain of bacteria occurs in all raw milk.

X = bacteria count per milliliter of milk .

According to the health department, if the milk is not contaminated, the


mean µ of x is ≈ 2500 with a standard deviation σ of about 300.
An inspector measures x on 42 samples of milk that has been held in
a cold storage awaiting shipment. The mean bacteria count x is 2700.
What would you do if you were the inspector?
 2

Solution By the CLT, if the milk is not contaminated, x ≈ N µ, σn ,

σ 300
µ x = µ = 2500 , σx = √ = √ ≈ 46.3 ,
n 42

P ( x ≤ 2650) = 0.9994 .
=⇒ Observing x = 2700 is very unlikely if the milk is not contaminated.
=⇒ The milk must not be sold!

56
Data Analysis in the Earth & Environmental Sciences Chapter 4. Normality. Error Theory

Other limit theorems


Convergence of Binomial law to Poisson law

The binomial distribution is extremely useful, but the calculation of


the (nk) factors, made of factorials, is extremely tedious, even for modern
computers (try computing 100! in Matlab if you don’t believe it). It turns
out that in the case where n gets large but np = λ (a constant), the bino-
mial distribution converges in distribution to the Poisson distribution.
That is, if X ∼ B(n, p):

np=λ λk
P( X = k) −−−→ e−λ (4.7)
n→∞ k!
which is more manageable if you can easily compute exponentials. But
it does get better: we may dispense with all these pesky factorials alto-
gether, as shown below.

Convergence of Poisson law to Normal law

One can show using Stirling’s approximation that, in turn, the dis-
tribution of a Poisson RV X with parameter λ converges to a normal
distribution as n → ∞. That is, the CDF of X√−λ converges to Φ.
λ
So if we had X ∼ B(n, p) such that n → ∞ and np stays constant, we
could use the normal approximation to evaluate the Poisson PMF. That
is just how DeMoivre and Laplace discovered it in the first place (see
Sect. III). Put it another way, the standardized sum:
n
Sn − np
Sn∗ = p
np(1 − p)
; Sn = ∑ Xi (4.8)
i =1

converges in distribution to a standard normal.


Example
Suppose that a coin is tossed 100 times and lands heads up 60 times. Should
we be surprised and doubt that the coin is fair?
 
S − 50 60 − 50
P(S ≥ 60) = P ≥ ≈ 1 − Φ(2) = 0.0228 (4.9)
52 52

With n this large, there is only a 2% probability of observing this result, so


the fairness of the coin is called into question. This logic will be used again in
Chapter 6 with statistical hypothesis tests.

57
Data Analysis in the Earth & Environmental Sciences Chapter 4. Normality. Error Theory

III A bit of history


Physicists believe that the Gaussian law has been proved in mathematics while
mathematicians think that it was experimentally established in physics.

Henri Poincaré, 1854 –1912

Origins
Though usually named after Gauss, the “Gaussian” (normal) distribu-
tion was actually discovered in 1733 by DeMoivre, who found it as a
limit case of the binomial distribution with p = 1/2, but did not rec-
ognize the significance of the result. In the general case 0 < p < 1,
Laplace (1781) had derived its main properties and suggested that it
should be tabulated due to its importance. Gauss (1809) considered an-
other derivation of this distribution (not as a limiting form of the bi-
nomial distribution); it became popularized by his work and thus his
name became attached to it. It seems likely that the term “normal” is as-
sociated with a linear regression model for which Gauss suggested the
least-squares (LS) solution method and called the corresponding system Figure 4.7: The venerable Carl Friedrich
of equations the normal equations (Chapter 13). One more name – central Gauss, who contrary to popular belief
did not discover the Gaussian law.
distribution – originates from the central limit theorem. It was suggested
by Pólya and actively backed by Jaynes (2004). A well-known historian
of statistics, Stigler, formulates an universal law of eponymy that “no
discovery is named for its original discoverer.”2 Jaynes (2004) remarked 2
As already remarked for Bayes’ theorem
that “the history of this terminology excellently confirms this law, since
the fundamental nature of this distribution and its main properties were
derived by Laplace when Gauss was six years old; and the distribution
itself had been found by de Moivre before Laplace was born.” (Kim and
Shevlyakov, 2008)

Herschel, Maxwell, and the Error Distribution


Herschel (1850) arrived at a normal distribution when trying to describe
errors in the location of stars obtain from optical measurements in the
( x, y) plane. His reasoning was the following:

• Errors in x should be independent of errors in y, so the joint density


f ( x, y) = f ( x ) f (y)

• The distribution should depend only on the distance to the origin, so


p
f ( x, y) = f ( x2 + y2 ).

It turns out that the conjunction of these two seemingly simple proper-
ties is enough to enforce the bivariate gaussian form:
 
f ( x, y) ∝ exp α( x2 + y2 ) (4.10)

58
Data Analysis in the Earth & Environmental Sciences Chapter 4. Normality. Error Theory

Clearly, α must be negative for f to be integrable, and after normal-


ization to unit mass, it can be rewritten in the usual form:
 
1 x 2 + y2
f ( x, y) = exp − (4.11)
σ2 2π σ2
which reduces to Eq. (4.1) in the univariate case (with µ = 0).
This form is eerily reminiscent of the Maxwell-Bolztmann distribu-
tion of velocities in a gas, which forms the basis for the kinetic theory of
gases. There is in fact a deep link between Gaussians and diffusion pro-
cesses, known as the Fokker-Planck equation in statistical mechanics.
Essentially, when one has big ensemble of particles merrily bumping
into each other, their velocities wind up with a Gaussian distribution.
For a gas at rest, the mean velocity is zero, and the variance is propor-
tional to the temperature (which is therefore a measure of molecular
agitation).

Information Theory
Shannon and Nyquist pioneered the concept of information entropy as-
sociated with a probability distribution. As in thermodynamics, infor-
mation entropy can only increase under spontaneous transformations.
It turns out that, by just specifying the mean and the variance of a mea-
surement error, the maximum entropy principle leads to a Gaussian
form for their distribution. Jaynes (2004) interprets the result this way:

The normal distribution provides the most honest description of our knowledge
of the errors given just the sample mean and variance of the measurements.

This brings a theoretical justification for what countless astronomers


and experimentalists have observed empirically since the 19th century:
the normal distribution is therefore a legitimate error distribution for
most applications, and it should always be used as an error distribution
unless one has cogent evidence that errors are not normally distributed.

Maximum Error Cancellation


It can be shown that the sample mean x = n1 ∑ xi brings about the max-
imum chance of error cancellation for symmetrically distributed errors
(equal chance of positive and negative excursions means that the aver-
age will be closer to the ’true’ value).

59
Data Analysis in the Earth & Environmental Sciences Chapter 4. Normality. Error Theory

IV Error Analysis
Error Terminology
Lack of mathematical culture is revealed nowhere so conspicuously as in mean-
ingless precision in numerical computations.

Carl Friedrich Gauss, 1777 – 1855.

Absolute and relative error

Experimental errors are often reported in two ways:

Absolute error: x = xbest + ∆x (4.12a)


 
∆x
Relative error: x = xbest 1 ± (4.12b)
| xbest |

Accuracy vs. precision

Precision measures the spread of repeated measurements, i.e. their


Figure 4.8: Diagram of precision and
tendency of cluster close each other. Accuracy, in contrast, measures the accuracy for various dating methods.
distance to the “true” value of the quantity being measured (Fig. 4.8). Credit:Tectonic Geomorphology of
By the central limit theorem, precision can always been improved by Mountains (Figure 6.1, p. 211)

repeating measurements, but even devilish levels of precision can lead


to inaccurate measurements if systematic biases exist (e.g. a thermometer
is always 0.2K too cold, a mass spectrometer has a 0.5h offset, etc). One
needs external information to determine those (e.g. data run in another
lab or another machine, operator log, cross-validation, etc.).

Modeling Errors Using the Normal Distribution

Common Model:

Measurement = "true quantity" + error .


y = µ+ε.

where ε ∼ N 0, σ2 . Measurement → random variable.
Recall that P (−σ ≤ ε ≤ σ) ≈ 68.1%. Hence µ lies in the interval
[µ − σ, µ + σ] with a probability of 68.1%. We shall see later that if we
estimate µ via a sample average x̄, then: [ x̄ − 1.96σ, x̄ + 1.96σ ] is a 95%
confidence interval for µ.

60
Data Analysis in the Earth & Environmental Sciences Chapter 4. Normality. Error Theory

Covariance, Correlated Errors


X, Y random variables.

Covariance

Cov ( X, Y ) = E (( X − E ( X )) (Y − E (Y ))) . (4.13)

 
Special Case: Cov ( X, X ) = Var ( X ) = E ( X − E ( X ))2 .

X, Y independent =⇒ Cov ( X, Y ) = 0. The converse is false in


general; except, of course, when X, Y are both normal, in which case:
Cov ( X, Y ) = 0 =⇒ X, Y independent. Thus, in the Gaussian world,
independence and zero covariance are the same thing. One can show
that: !
n n n 
Var ∑ Xi = ∑ ∑ Cov Xi , X j . (4.14)
i =1 i =1 j =1
 
In particular, X ∼ N 0, σX2 and Y ∼ N 0, σY2 :

σX2 +Y = σX2 + σY2 + 2Cov ( X, Y ) . (4.15a)


i.e. Var ( X + Y ) = Var ( X ) + Var (Y ) + 2Cov ( x, y) ≥ Var ( X ) + Var (Y ) .
(4.15b)

(the equality holds if and only if the variables are not correlated.)

Simple Rules of Error Propagation


Using these properties, one can derive simple rules for various functions
of a normal RV.

Addition/Subtraction In general, with X ± ∆X; Y ± ∆Y:


The uncertainty on X + Y is at most ∆X + ∆Y. But assuming X, Y are
independent and normal,
q
σX2 +Y = σX2 + σY2 =⇒ σX +Y = σX2 + σX2 ,
q
=⇒∆ ( X + Y ) = ∆ ( X − Y ) = (∆X )2 + (∆Y )2 ≤ ∆X + ∆Y .
(4.16a)

Put differently, the most conservative error estimate for sums or dif-
ferences is the sum of their uncertainties, but for independent errors
this reduces considerably to a quadratic sum (the norm of the uncer-
tainty vector)3 . 3
The inequality stems from Pythagoras’
theorem
61
Data Analysis in the Earth & Environmental Sciences Chapter 4. Normality. Error Theory

Product/Division

log ( xy) = log ( x ) + log (y) .


 
x
log = log ( x ) − log (y) .
y
∆x
∆ log x ≈ . (First-order approximation)
x
If we call u = xy and v = x/y:

∆u
∆ log u ≈ ≤ ∆ log x + ∆ log y
u
∆x ∆y
≈ + .
x y

∆ ( xy) ∆ ( x/y) ∆x ∆y
= ≈ + . (4.16b)
xy x/y x y

which is just the same as Eq. (4.16a), but with relative instead of abso-
lute uncertainties. In words, the relative uncertainty of a product (or
ratio) is the sum of the relative uncertainty of its components (unless
there is some partial cancellation).

Nonlinear Propagation
For any well-behaved function ψ( X1 , X2 , ..., Xn ), with Xi ∼ N (µi , σi2 ),
where the Xi are independent of each other4 : 4
In the more general case, the equation
writes
2 2 2 n n 
∂ψ ∂ψ
∂ψ ∂ψ ∂ψ σψ2 ≈ ∑∑ Cov Xi , X j
σψ2 ≈ σ12 + σ22 + · · · + σn2 (4.16d) i =1 j =1
∂Xi µi ∂X j
∂X1 µ1 ∂X2 µ2 ∂Xn µn
µj
(4.16c)
Where the subscript means that we evaluate the function around the where µi is the mean of Xi
value ψ(µ1 , µ2 , · · · , µn ), and higher order terms are assumed negligi-
ble. This result stems from performing a first-order Taylor expansion
of ψ about the mean, using the transfer theorem (Eq. (3.12)) to prop-
agate the errors through ψ, and using the linearity of the expectance
operator. This allows us to compute the contributions to total uncer-
tainty due to several input sources (cf. Lab 4). This may be particu-
larly helpful when considering which aspects of experiment design
should be improved to yield to lowest overall uncertainties.

62
Chapter 5

PRINCIPLES OF STATISTICAL ESTIMATION

“There are three types of lies: lies, damn lies and statistics”

Benjamin Disraeli

Having built an understanding of measurements as realizations of


random processes governed by probability laws, and having seen how
many natural processes can be modeled by relatively simple, paramet-
ric distributions (the normal distribution being the most ubiquitous),
we now turn to the more technical topic of fitting those distributions to
observations. Because those theoretical distributions are entirely charac-
terized by a handful of parameters θ = (θ1 , · · · , θ p ), this amounts to
estimating θ based on n observations1 x = ( x1 , · · · , xn ), ideally with 1
n ≥ 1 , hopefully.
uncertainty bounds for those parameters.

I Preamble
Methods
We shall see three estimation methods:

1. The method of moments, which is simple, but risky.

2. Maximum likelihood estimation, which is not much more compli-


cated, but optimal by design.

3. Bayesian estimation, which is much richer, but usually more diffi-


cult.

In the following, we will denote estimates by a hat (θ̂) to make plain


that they are only a proxy for the real thing. The meaning of this no-
tation is worth explaining: we imagine a world where true parameters
θ roam free, and we try to ascertain what values they might assume
in our material world (possibly, your lab bench). This frequentist view
applies to the first two. The Bayesian view is that the parameters are
Data Analysis in the Earth & Environmental Sciences Chapter 5. Principles of Statistical Estimation

estimated from uncertain data, so they are themselves uncertain, and


therefore must be characterized by a probability distribution. If the es-
timation is good, that distribution is narrowly focused around a central
value, otherwise it will be spread out.

Point vs Interval Estimation


In the frequentist world, it is common to separate point estimation from
interval estimation. Point estimation applied to the age of the Earth would
be to say: given 10 Ar-Ar dates from mineral inclusions, what’s your
best guess for the age of the Earth? Interval estimation would be to say:
how well do you know that? In the Bayesian view, these distinctions are
somewhat superficial for the reason noted above (give me a distribution,
I’ll give you both its central tendency and its spread). However, since
the frequentist viewpoint has long pervaded the experimental sciences,
it is worth knowing this terminology.

II Method of Moments
We’ve seen in Chapter 3 that the moments of most distributions usually
have a simple relationship to their parameters; the normal distribution
is an extreme example of this, because its two parameters are essentially
its first two moments. Hence the idea, due to Karl Pearson ca. 1890, to
use those moments to estimate the parameters. It is quite simple to use,
and almost always yield some sort of estimate. Unfortunately, in many
cases, yields estimates that leave a lot to be desired (e.g. a probability
greater than one, a sample number smaller than zero). However, it is a
good place to start when other methods prove intractable, and it is used
in Lab 5.
Theorem 1 (Strong Law of large numbers) The n-sample mean
n
X̄n = ∑ xi /n (5.1)
i =1

converges in probability to the true mean as the number of trials tends to infin-
ity:  
P lim X n = µ = 1. (5.2)
n→∞
This law justifies the intuitive interpretation of the expected value of a random
variable when sampled repeatedly as the long-term average. It is a justification
for the ergodic assumption frequently used in physics.

According to the law of large numbers, the kth moment of a random


variable X may be estimated as:
1 n  
µ̂k = ∑ xik ≈ E X k = µk . (5.3)
n i =1

64
Data Analysis in the Earth & Environmental Sciences Chapter 5. Principles of Statistical Estimation

Method of Moments:
1. Express the unknown parameter(s) θ as a function of the moments
µk (k ≤ 1):
θ = g ( µ1 , . . . , µ k ) . (5.4a)

2. Approximate the moments µk by the sample moments µˆk and solve


for θ:

1 n
 µ 1 = µ 1 ( θ ) ≈ n ∑ i =1 x i ,




 µ2 = µ2 ( θ ) ≈ 1 ∑ n x 2 ,
n i =1 i
.. (5.4b)

 .




µk = µk (θ ) ≈ n1 ∑in=1 xik ,

Example (Normal method of moments)


 
X1 , . . . , Xn i.i.d. N µ, σ2 (X ∼ N µ, σ2 ).

 θ =µ
1
parameters to estimate.
 θ = σ2
2

µ1 = E ( X ) = µ .
 
Var ( X ) = σ2 = E X 2 − E2 ( X ) = µ2 − µ2 .

=⇒ µ2 = µ2 + σ2 .
Faster to Solve: 
n

 1

 µ1 = µ = n
 ∑ xi , (E1)
i =1
 n
 2 2 1

 µ2 = µ + σ = n
 ∑ xi2 . (E2)
i =1
We must solve the system simultaneously for µ and σ2 . We know that µ =
1 n
n ∑i =1 xi = X. Substituting (E1) into (E2):
n
2 1
µ2 + σ 2 = X + σ 2 = ∑ xi2 .
n i =1
n
1 2
= σ2 = ∑ xi2 − X
n i =1
!2
n n
1 1
= ∑ xi2 − ∑ xi
n i =1
n i =1
n
1 2
=
n ∑ xi − X .
i =1

 µ̂ = X .
So, 2
 σ̂2 = 1
n ∑in=1 xi − X ≡ S2 . ← Sample variance.

65
Data Analysis in the Earth & Environmental Sciences Chapter 5. Principles of Statistical Estimation

Example
Given θ balls in a jar numbered 1, 2, . . ., θ. We draw balls at random (P ( xi ) =
1/θ, uniform distribution). We want to estimate θ.

X1 , . . . , Xn i.i.d. sample.

Method of moments:
θ θ
1 1 θ 1 θ ( θ + 1) θ+1
µ1 = E ( X ) = ∑ i · P (X = i) = ∑i· θ =
θ i∑
i =
θ 2
=
2
.
i =1 i =1 =1
θ +1 1
System of equations to solve: µ1 = 2 = n ∑i=1 nxi = X.

θ̂MoM = 2X − 1 (5.5)

Example
X1 = 3, X2 = 2, X3 = 10, X4 = 4, X5 = 2.

X = 4.
θ̂MoM = 2X − 1 = 7 .

(A rather poor estimate since we already know that 10 has been drawn, so we
know for sure that θ should be greater than 10!)

III Maximum Likelihood Estimation


The Big Idea
The main idea is to treat the conditional probability of the observations
f ( X |θ ) as a function of the unknown parameters θ (with X fixed), and
find the value of the parameters that maximizes this function. That is,
we seek the value of the parameters that maximizes the probability of
observing the sample. This is much more defensible than the method
of moments, and leads to all sorts of nice properties.
The starting point is that the n observations are considered a random
sample of size n, i.e. realizations of n i.i.d. random variables X1 , . . ., Xn .

Discrete Case: estimating a Bernoulli parameter


Example
Sequence of Bernoulli trials (Tossing a coin, for example)
(
1, P ( X = 1) = p ,
X=
0, P ( X = 0) = 1 − p .
In other words:

P( X = x | p) = p x (1 − p)1− x , with x = 0, 1 (5.6)

66
Data Analysis in the Earth & Environmental Sciences Chapter 5. Principles of Statistical Estimation

Given the sequence

x = {1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0 .} (7 ones and 8 zeros)

We want to estimate p (= θ). How do we proceed?

The silly way would to try different values of p, simulate from that
model, and see if the statistics are compatible with this sequence.

P (observing xi if p = 0.99) → Small .


P (observing xi if p ≈ 0.5) → Large .
P (observing xi if p = 0.001) → Small .
Figure 5.1: Likelihood of the Bernoulli
We can, of course, be more efficient about it. We can recognize the trial with mean x̄ = 0.5.

probability of observing each datum as conditional on the parameters:

P ( Xi = x i | θ ) = f θ ( x i )

and we seek the probability of observing ( x1 , · · · , xn ) simultaneously,


that is, we seek the joint probability distribution of ( X1 , · · · , Xn )
From the independence assumption,

P ( X1 = x 1 , . . . , X n = x n ) = P ( X1 = x 1 ) × · · · × P ( X n = x n )
n
= ∏ P ( Xi = x i )
i =1
n
= ∏ f θ ( xi )
i =1
≡ L(θ | x) (5.7)

Which we call the likelihood function. For each value of θ, the likelihood
L(θ | x) is the probability of observing { X1 = x1 , . . . , Xn = xn } for that
distribution and that parameter θ. Using Eq. (5.6):
n
L ( p) = ∏ p x i (1 − p )1− x i (5.8)
i =1

Further, recognizing that p a × pb = p a+b , we can transform the prod-


uct as a sum:

L ( p ) = p ∑ xi (1 − p ) n − ∑ xi = p y (1 − p ) n − y (5.9)

wherein y = ∑in=1 xi , the number of successful trials.


Now, it is generally more practical to work with the so-called log-
likelihood ` (θ ) = log L (θ ) instead of L (θ ): the product becomes a sum,
and that greatly simplifies calculations (which is exactly why the log
was invented in the first place).

67
Data Analysis in the Earth & Environmental Sciences Chapter 5. Principles of Statistical Estimation

Definition 2 The log-likelihood is given by:

` (θ ) ≡ log L (θ )
n
= log ∏ f ( xi |θ )
i =1
n
= ∑ log f (xi |θ ) . (5.10)
i =1

Since log is an increasing function, the maximum of L is the same as


the maximum of `, hence:


` θ̂ MLE = max ` (θ ) (5.11)
θ

Definition 3 A parameter θ̂ maximizing L (θ ) is said to be a maximum like-


lihood estimator (MLE) of θ.


L θ̂MLE = max L (θ ) (5.12)
θ

All we have to do now to find the MLE of p, is to compute `( p) for


0 < p < 12 , and find its maximum: 2
Strange things may happen at the
boundary, so one needs to verify that
 other maxima are not present there
` ( p) = log py (1 − p)n−y = y log( p) + (n − y) log (1 − p) (5.13)

Now, the extrema ` (θ ) occur whenever ∂`


∂θ = 03 . 3
A necessary, but not sufficient, condi-
If we do this, we get : tion for finding a maximum. For in-
stance, the likelihood may have more
than one maximum, or a minimum as in
y n−y Fig. 5.2. We’ll see later how to ensure that
− =0 (5.14) a unique maximum has been reached.
p 1− p

After a few rearrangements, we get the solution:

θ̂ MLE = p̂ = y/n = x̄. (5.15)

That is, our best guess for the trial probability p is just the average of
the successes. Intuitive though this may seem, it is immensely satisfying
to arrive at this result via a rigorous optimality principle.
Figure 5.2: A bimodal likelihood.

Remark: θ̂ MLE does not have to be unique (cf Fig. 5.2), although it is
often the case in practice.

68
Data Analysis in the Earth & Environmental Sciences Chapter 5. Principles of Statistical Estimation

Continuous Case: normal observations


Similarly in a continuous case with IID observations, the likelihood is
the joint density of the observations f J which factorizes as the product
of the density functions:

L ( θ ) = f J ( x1 , · · · , x n | θ ) (5.16a)
n
L (θ ) = ∏ f ( xi | θ ) Independence (5.16b)
i =1

L θ̂ MLE = max L (θ ) . (5.16c)
θ

If ` (or L) is differentiable with respect to θ, possible candidates for


the MLE of θ are solutions of:
 
∂L ∂`
= 0, or =0 .
∂θ ∂θ

For a vector θ = (θ1 , . . . , θk ), this implies:


 
∂L ∂`
= 0, or =0 ∀i ∈ {1, · · · , k}.
∂θi ∂θi

Estimating the parameters of a normal model

Consider a random normal sample of size n, that is: X1 , . . ., Xn i.i.d.



N µ, σ2 .
(
µ̂ =? , µ ∈ R,
2
σ̂ =? , σ2 ∈ (0, ∞) .

 θ1 = µ , x − µ )2
−(
Calling and given that f ( x |θ ) = √1 e 2σ2 , then:
2πσ
 θ2 = σ 2 ,

  n  
L µ, σ2 = ∏f xi |µ, σ2
i =1
n
1 ( xi − µ )2

=∏√ e 2σ2

i =1 2πσ
1 1 ( xi − µ )2
− ∑in=1
= e 2σ2 , → Likelihood function.
(2π )n/2 (σ2 ) n/2

(5.17)

log − likelihood :

    n   n
( x − µ )2
` µ, σ2 = log L µ, σ2 = − log 2πσ2 − ∑ i 2 .
2 i =1

69
Data Analysis in the Earth & Environmental Sciences Chapter 5. Principles of Statistical Estimation

We first want to maximize `(θ ) with respect to θ1 = µ:


n n
∂` ( x − µ)
=∑ i 2 = 0 ⇐⇒ ∑ xi = nµ
∂µ i =1
σ i =1
n
1
⇐⇒ µ̂ =
n ∑ xi = x .
i =1

So the MLE of the mean is just the sample mean? Really, all this for that?
Well, sure, it’s intuitive, but now it’s also rigorously established4 . What 4
Note that Gauss famously obtained the
about σ? Can it be estimated via the sample standard deviation? It turns normal distribution as one that would
have this property: that is, he considered
out that this is indeed the MLE. Now maximizing `(θ ) with respect to the inverse problem of designing a distri-
θ2 = σ 2 : bution for which the sample mean would
be the best estimate of the true mean (Kim
∂` n n
( x − µ )2 and Shevlyakov, 2008)
2
=− 2+∑ i 4 = 0.
∂σ 2σ i =1 2σ
n
⇐⇒ nσ̂2 = ∑ ( x i − µ )2 ,
i =1
n
1
⇐⇒ σ̂2 =
n ∑ ( x i − µ )2 , but we just showed that µ̂ = x,
i =1
n
1
⇐⇒ σ̂2 =
n ∑ ( x i − x )2 = s2 sample variance
i =1


 µ̂ = x ,
So,
σ̂2 = s2 .

This happens to coincide the result of the method of moments, but it will
turn out to be an exception5 . 5
In general, MLE is very different from
 MOM, and MLE has a sound basis in sta-
 Max ?


 tistical theory, unlike MOM.
2
Now, is x, s really a maximum of the likelihood? Min ?



Saddle point ?
A sufficient condition if for L or ` to have negative curvature around
there:

∂2 ` ∂2 `
1) < 0, < 0, (5.18a)
∂µ2 ∂ ( σ2 )
2

∂2 ` ∂2 ` ∂2 l
2) 2 2
− > 0, (5.18b)
∂µ ∂ (σ2 ) ∂µσ2
It is easy to verify that:
   
θ̂ MLE = µ̂, σ̂2 = x, s2 . (5.19)

These estimates are unique for the normal distribution (it is unimodal),
and show that the sample mean and sample variance are legitimate es-
timators of the true parameters (they are the most consistent with the
observations in this model).

70
Data Analysis in the Earth & Environmental Sciences Chapter 5. Principles of Statistical Estimation

The method of moments revisited, and defeated

Let us compute the MLE for the previous example where



 1
, 0 < x < θ.
f ( x |θ ) = θ
 0, otherwhise .

X1 , . . . , Xn i.i.d.

θ̂ MoM = 2X − 1 , (Unacceptable if max xi > 2X − 1)

θ̂ MLE = ?
Likelihood Function:

n  1
, if 0 ≤ xi ≤ θ .
θn
L (θ ) = ∏ f ( x i |0) = 
i =1 0, otherwise.

where θ ≥ max xi . So θ̂ MLE = max( xi )

IV Quality of Estimators Figure 5.3: Likelihood function of the


Example III. Here we can see that θ ≤
max xi .
By now you should be familiar with the idea that inference is the task
of estimating unknown, sometimes latent parameters from observable
data, given some model. We’ve seen two ways to do this, and we may
ask, how good are they?
Recall here that estimators are statistics, that is, deterministic func-
tions of observations, which we hypothesize as originating from a ran-
dom process (one about which we have imperfect knowledge). There-
fore estimators like µ̂ and σ̂ are themselves random variables: they are
characterized by a distribution. There are two desirable properties of
estimators:

Example (with N µ, σ 2 )
Accuracy (No or low bias):
 µ̂ = x ,
We want E θ̂ = θ , 
x1 + . . . + x n

  E (x) = E
We define bias θ̂, θ ≡ E θ̂ − θ . n
1
= ∑ E ( xi ) = µ .
n i
Precision : small variance of θ̂.
No bias!

Mean Square Error (MSE)


Definition 4
 h i
MSE θ̂, θ = E (θ̂ − θ )2 (5.20)

71
Data Analysis in the Earth & Environmental Sciences Chapter 5. Principles of Statistical Estimation

The MSE incorporates the two types of errors: variability (precision)


and bias (accuracy). An estimator that has good MSE properties has
small combined bias and variance. Indeed,
h i   2
E (θ̂ − θ )2 = E θ̂ − E θ̂ + E θ̂ − θ → (bias-variance decomposition of the MSE)
h i   2  
= E (θ̂ − E(θ̂ )2 + E(θ̂ ) − θ + 2 E θ̂ − E θ̂ E θ̂
| {z }
=0
2
= V (θ̂ ) + bias

This is called the bias-variance decomposition.


Example (Normal estimation N µ, σ 2 )
2
MLE: µ̂ = X, σ̂2 = n1 ∑i xi − X :

 E X = µ ,
 Unbiased.
 V X = σ2 ,
n

  
 E σˆ2 = n− 1 2
n σ ,
 2( n −1)
Biased.
 V σ̂2 = n2 σ4 ,

Hence:

MSE (µ̂, µ) = bias2 (µ̂, µ) + Var (µ̂)


σ2
= Var (µ̂) = .
 n 
n−1 2 2 ( n − 1) 4
MSE (σ̂, σ) = σ − σ2 + σ
n n2
(2n − 1) 4
= σ .
n2
You can see that the estimator of the variance is slightly biased. Should
we correct for that? Prima facie that seems like a good idea, but let’s
consider all error sources.

Bias-Variance Trade Off


Consider the following estimation of σ2 :
1 n
σ̂2 = ( x i − x )2 = 2
n−1 ∑
σ̂MLE .
i
n − 1

Then
E(σˆ2 ) = σ2 , (No bias)
and   2
MSE σ̂2 , σ2 = V (σ2 ) = σ4 .
n−1

72
Data Analysis in the Earth & Environmental Sciences Chapter 5. Principles of Statistical Estimation

But    
2 2n − 1
MSE σ̂MLE , σ2 = σ4 .
n2
2n−1 2
Considering that n2
< n −1 , we have that:

2
MSE( σ̂MLE , σ2 ) < MSE( σ̂2 , σ2 ). (5.21)
↑ bias ↓ var ↓ bias ↑ var (5.22)

So even though it would be tempting to get rid of the bias, this would
increase the overall error.By trading off variance for bias, a lower MSE
can be reached. The MLE is optimal in the sense that it has the lowest
MSE of all estimators; it sometimes achieves this by introducing some
bias, so bias is not inherently evil. As usual in Life, there is no free lunch:
one cannot possibly lower bias and variance, so tradeoffs must be made.

Consistency
Both estimators above have an MSE that will tend to zero as n → ∞: this
is a property called consistency: as we collect more and more observa-
tions, we eventually beat down our uncertainty about the true parame-
ters to zero. Put differently, a consistent estimator is one that converges
to the true parameter value as the number of observations gets large
(“asymptotically”). We’ll see in Chap 9 that some spectral estimators
don’t do that, and that will be grounds for dismissal.

Efficiency
Suppose θ is an unknown parameter which is to be estimated from mea-
surements x, distributed according to some probability density function
f ( x; θ ). The variance of any unbiased estimator θ̂ of θ is then bounded
from below by the inverse of the Fisher information I (θ ). That is:

1
Var(θ̂ ) ≥ (5.23)
I (θ )

where the Fisher information I (θ ) is defined by:


"  #  2 
∂`( x; θ ) 2 ∂ `( x; θ )
I (θ ) = E = −E (5.24)
∂θ ∂θ 2

and `( x; θ ) = log f ( x; θ ) is the natural logarithm of the likelihood func-


tion and E denotes the expected value (over x).
The efficiency of an unbiased estimator θ̂ measures how close this
estimator’s variance comes to this lower bound; estimator efficiency is
defined as
I ( θ ) −1
e(θ̂ ) = (5.25)
var(θ̂ )

73
Data Analysis in the Earth & Environmental Sciences Chapter 5. Principles of Statistical Estimation

or the minimum possible variance for an unbiased estimator divided by


its actual variance. The Cramér – Rao bound thus gives e(θ̂ ) ≤ 1.
One can verify that the MLE is efficient; that is, it is the best point
estimator data can buy.

Note The method of moments yields estimators that are neither consistent nor
efficient, so you should always prefer MLE. Most numerical estimation routines
(including Python’s fitter). MLEs have good theoretical properties: they are
invariant under rescaling, consistent, efficient, and converge in distribution
to a normal (asymptotically, that is, for a large number of observations). In
practice, finding the maximum may not be as straightforward as in the normal
cases shown above, but even in a case where no closed-form expression exists,
it usually involves fairly modest computations. If it does not, the Expectation-
Maximization algorithm will quickly become your friend.

V Bayesian Estimation
The Big Idea
The following section is very strongly in-
Gelman et al. (2013) define Bayesian inference as spired from Wikle and Berliner (2007)

“. . . the process of fitting a probability model to a set of data and sum-


marizing the result by a probability distribution on the parameters of the
model and on unobserved quantities such as predictions for new obser-
vations”.

The basic idea is that we want the distribution of the parameters


given the observations, p(θ | X )6 , but often we can only compute p( X |θ ), 6
Here p is a generic notation for a proba-
and have some knowledge about θ. To invert this situation we use Bayes’ bility distribution

theorem:

p( X |θ ) p(θ )
p(θ | X ) = (5.26)
p( X )
There are four essential ingredients in this recipe, which all have
great conceptual importance:

Data distribution p( X |θ ). Statisticians often refer to this as a “sampling


distribution” or “measurement model”. It is simply the distribution
of the data, given the unobservables. When viewed as a function
of θ for fixed X, it is known as a likelihood function, L( X |θ), as in
classical maximum likelihood estimation (Sect. III). A key is that one
thinks of the data conditioned on θ. For example, if X represents
imperfect observations of temperature, and θ the true (unobservable)
temperature, then p( X |θ ) quantifies the distribution of measurement
errors in observing temperature, reflecting possible biases as well as
instrument error.

74
Data Analysis in the Earth & Environmental Sciences Chapter 5. Principles of Statistical Estimation

Prior distribution p(θ ): This distribution quantifies our a priori under-


standing of the unobservable quantities of interest. For example, if
X corresponds to temperature, then one might base this prior dis-
tribution on historical information (climatology) or perhaps from a
forecast model. In general, prior distributions can be informative or
non-informative, “subjective” or “objective”. The choice of such dis-
tributions is an integral part of Bayesian inference7 . 7
Because it involves a human decision,
some frequentists sneer at this as a despi-
Marginal distribution p( X ). This can we rewritten by conditioning on cable parody of statistics – forgetting that
they often have to make all kinds of unre-
the parameters and integrating over them8 : alistic assumptions, most of them highly
Z subjective, to be able to compute an MLE.
p( X ) = p( X |θ ) p(θ )dθ (5.27)

We assume continuous θ but note that there are analogous forms


(sums) for discrete θ. This distribution is also known as the prior
predictive distribution. Alternatively, for the observations X, p( X ) 8
effectively using the law of total proba-
bility Eq. (2.18)
can be thought of as the “normalizing constant” in Bayes’ rule. Un-
fortunately, it is only for very specific choices of the data and prior
distribution that we can solve this integral analytically.

Posterior distribution p(θ | X ): This distribution of the unobservables given


the data is our primary interest for inference. It is proportional to the
product of the likelihood and the prior. The posterior is the update of
our prior knowledge about θ as summarized in the prior p(θ ) given
the observations X. In this sense, the Bayesian approach is analogous
to the scientific method: one has prior belief (information), collects
data, and then updates that belief given the new data (information).
Much about the rational human mind works this way (remember
how you learned how to walk?).

In many cases, the normalizing constant p( X ) is of secondary impor-


tance, so it is common to rewrite Eq. (5.26) as:

p(θ | X ) ∝ p( X |θ ) p(θ ) (5.28)

Which makes plain that the posterior distribution (what we want) is


the product of the likelihood (what we observed, given prior informa-
tion) by the prior information.

75
Data Analysis in the Earth & Environmental Sciences Chapter 5. Principles of Statistical Estimation

Bayesian coin flips


We return to the examples of Sect. III, now with a Bayesian viewpoint.
Given n coin flips (or some other kind of Bernoulli trial), how do we
estimate the underlying parameter p?
Writing the likelihood is trivial once you know p (see Eq. (5.6)). Of
course, we don’t know p (that’s the whole point), but we can set priors
on it, and then apply Bayes’ theorem to obtain the posterior Pr ( p| x).
That’s exactly what’s done in this excellent blog post. To simplify the
estimation, it employs an oft-used Bayesian stratagem: conjugate pri-
ors. While the Bayesian recipe is universal, it can be computationally te-
dious. Conjugate priors help simplify calculations a great deal; a conju-
gate prior is one that keeps the same functional form once multiplied by
the likelihood (see “A note on priors”). For a binomial likelihood, those
priors (and posteriors) look like the beta distribution, whose density f
writes:
Γ ( α + β ) α −1
f ( p; α, β) = p (1 − p ) β −1 (5.29)
Γ (α) Γ ( β)
Figure 5.4: Beta distribution for various
where Γ is the Gamma function and α, β are two real, positive parame- values of its parameters α and β. Note
ters. α may be interpreted as number of times success is observed and β that it is always zero outside the interval
[0,1].
as the number of times failure is observed. The beta distribution is ex-
tremely flexible and limited to the interval [0, 1], so it is especially useful
to model a probability like p (Fig. 5.4).
The companion widget allows to experiment with various choices of
α and β9 . By playing with the widget, you will soon see when the choice 9
These parameterize the prior, and are
of the prior parameters matters, and when it doesn’t. called hyperparameters. In principle
these can also be associated with dis-
tributions (called hyperpriors), though
nothing so complicated is required here

Normal model with known variance, unknown mean


Let us apply the Bayesian framework to a somewhat artificial exam-
ple, simplified to the extreme for the purpose of analytical tractabil-
ity. Say we are interested in the temperature T at some location. As-
sume we have the prior distribution (e.g., from historical observations):
T ∼ N (µ, τ 2 ). Conditioned on the true value of the state process,
T = m, assume we have n independent (given t) but noisy observa-
tions X = ( X1 , · · · , Xn ), with standard deviation σ, and thus the data
model :

Xi |{ T = m} ∼ N (m, σ2 ) (5.30)

76
Data Analysis in the Earth & Environmental Sciences Chapter 5. Principles of Statistical Estimation

Let’s assume for simplicity that we know σ (it comes from a calibrated
thermometer of known precision) and we want the distribution of m
conditional on observations, p(m| X ). What we do know is:
n  
1 1 ( x i − m )2
p( X |m) = ∏ √ exp − (5.31a)
i =1 σ 2π
2 σ2
"   #
1 n xi − m 2
p( X |m) ∝ exp − ∑ (5.31b)
2 i =1 σ

And the prior is, by assumption:


2
1 −(
m−µ)
p(m) = √ e 2τ2 (5.32)
τ 2π
So Bayes’ rule [Eq. (5.28)] gives:

( " 2  2 #)
1 n xi − m m−µ
p(m| X ) ∝ exp − ∑ + (5.33)
2 i =1 σ τ

Notice that this is just the product of two Gaussian distributions. It


can be shown (by completing the square) that the normalized product
is also Gaussian10 , T | X ∼ N (µ p , σp2 ) , with the following mean and 10
in other words, the normal prior and
variance: normal likelihood conspire to make a
normal posterior. We have just found an-
∑in=1 xi /σ2 + µ/τ 2 other conjugate prior.
µp = (5.34a)
n/σ2 + 1/τ 2
1
σp2 = (5.34b)
n
σ2
+ τ12

Put differently: the posterior variance is the harmonic mean of the


observation and prior variance, weighted by the number of observa-
tions. In that case, the prior amounts to one extra observation, and it
will quickly get overwhelmed by the actual observations as long as σ is
comparable to or smaller than τ.
The posterior mean (the conditional expectation of the temperature,
given the observations) may be re-expressed as:

E( T | X ) = wx x̄ + wµ µ (5.35)

nτ 2
with x̄ = ∑in=1 xi /n (the sample mean of the observations), wx = nτ 2 +σ2
2
and wµ = nτ2σ+σ2 11 . That is, the posterior mean is a weighted average 11
Verify that wx + wµ = 1
of the prior mean and the sample mean of the observations. Note that
as τ 2 → ∞ (vague prior), the data model overwhelms the prior and

p(m| X ) → N ( x̄, σ2 /n) (5.36)

(as per the central limit theorem). Alternatively, for fixed τ 2 , but very
large amounts of data (i.e., n → ∞) the data model again dominates

77
Data Analysis in the Earth & Environmental Sciences Chapter 5. Principles of Statistical Estimation

C.K. Wikle, L.M. Berliner / Physica D 230 (2007) 1–16 3

the posterior density. On the other hand, if τ is very small, the prior 2 the posterior mean as
We can write
2⌧ 2
is critical for comparatively small n (itE(Xis|y)highly = 2
+ n⌧ 2
informative).
(n ȳ/ 2 + µ/⌧ 2 ) Though (1)
C.K. Wikle, L.M. Berliner / Physica D 230 (2007) 1–16 3
these properties are shown for the normal data, = w y
P
ȳ + normal
w µ µ, prior case, it is (2)
We canȳwrite
where = thei yposterior mean as2 2 2 ), and w
i /n, w y = n⌧ /(n⌧ + µ =
generally true that for very large datasets, 2 /(n⌧ 2 + the 2 data model is the major
2 ⌧).2 Note that w y + wµ = 1. That is, the posterior
mean is a weighted 2
average 2
of the) prior mean (µ) and (1) the
controller of the posterior. E(X |y) =
natural, data +
2
(n ȳ/ + µ/⌧
n⌧ 2 estimate of X , ȳ. Note that as ⌧ 2 ! 1, the
based
= ȳ + 2 (2)
To illustrate this, assume the prior distribution
Alternatively,
y
P for fixed
µ
is ⌧m
data model overwhelms the prior and p(x|y) ! N ( ȳ, /n).
w w µ,
2 , but∼very N (large 3), and
20,amounts of data
where ȳ = 2 2 2 ), and w
i ythe
i /n, w y = n⌧ /(n⌧ + µ =
the data model is Xi |m ∼ N (m, 1). 2 /(n⌧ We
(i.e., 2 + have
n ! 2 ). Notetwo
1)
that w yobservations
data model
+ wµ2 = 1. That is, thexposterior
again
density. On the other hand, if ⌧ is very small, the prior is
dominates the posterior
=
mean is aforweighted average of the prior mean (µ) and the
(19, 23). The posterior mean is 20 + (natural,
6/7 )(21
critical
showndata
− 20
comparatively
for based
the normal
)=
estimate of20.86
small
data, X , ȳ. Note
normal
andthese
n. Though
priorthat
the
as
case,
pos-
properties
⌧it2 is
! 1, the
are
generally
datatrue
model overwhelms thedatasets,
prior and p(x|y) ! Nis( ȳ, 2 /n). Fig. 1. Posterior distribution with normal prior and normal likelihood;
terior variance is (1 − 6/7)3 = 0.43.Alternatively,
Fig. 5.5 forshows
that for very large these distributions
the data model
⌧ 2 , but very large amounts of data
the major relatively precise data.
controller of the fixed posterior. Figure 5.5: Posterior distribution with
graphically. Since the data are relatively
(i.e., nForprecise
density. On
1) theofdata
!purposes
the other
compared
DA, model
we noteagain
hand, if ⌧
from (2)to thatthe
dominates prior,
the
the posterior
2 is very small, the prior is
posterior
mean normal prior and normal likelihood; rel-
can also be written as
we see that the posterior distributioncritical
is closer to✓ the
for comparatively 2 likelihood
small ◆ n. Though these than the
properties are atively precise data. From Wikle and
shown n⌧ normal prior case, it is generally
E(Xfor |y) =theµnormal
+ data, ( ȳ µ) Fig.Berliner
1. Posterior(2007)
distribution with normal prior and normal likelihood;
prior. Another way to look at this istruethat the
that for verydata weight
2 + n⌧ 2
large datasets, wxmodel
the data = 6/7 is
is the major relatively precise data.
controller of=the µ+ posterior.
K ( ȳ µ). (3)
close to one, so that the data model is That weighted
For purposes of DA, more we notethan from (2)the
is, the prior mean (µ) is adjusted toward the sample
that theprior.
posterior mean
can also be written as
Next, assume the same observationsestimate and (prior ✓ distribution,
ȳ) according
n⌧ 2
to ◆the “gain”, K =but change
(n⌧ 2 )/( 2 + n⌧ 2 ).

Analogously, the posterior variance can be rewritten:


the data model to Xi |m ∼ N (m, 10). E(X The |y) = µ +
data weight
var(X |y) = (1 K )⌧ 2 ,
2 + n⌧ 2 is now w
( ȳ µ)
x = 6/16 (4)
= µ + K ( ȳ µ).
and the posterior distribution is N (20.375, 1.875 ) . This is illustrated
where the posterior variance is updated from the prior variance in(3)
That is, the prior mean (µ) is adjusted
according to the gain, K . Eqs. (3) and (4) are critical for toward the sample
Fig. 5.6. In this case, the weight is closer
estimate to( ȳ)zero
understanding according
data(since
assimilation theas measurement
to the “gain”, K =
will be (n⌧
2
seen )/(
2 + n⌧ 2 ).
throughout this
Analogously,
overview. the posterior variance can be rewritten:
error variance is relatively large compared to the prior variance) and Fig. 2. Posterior distribution with normal prior and normal likelihood;
2
var(X |y) Numerical
= (1 K )⌧ , (4)
thus the prior is given more weight. 2.1.1. examples
Assume the prior distribution is X ⇠ N (20, 3), and the
relatively uncertain data.

where the posterior variance is updated from the prior variance by:
data model
according to the is Ygain,
i |x ⇠ K . NEqs.
(x, 1).(3)Weandhave(4) two
are observations
critical for Z
y = (19, 23)data 0 . The posterior mean is 20 + (6/7)(21
understanding assimilation as will be seen throughout20) this= p( ỹ|y) = p( ỹ|x) p(x|y)dx
20.86 and the posterior variance is 6/7)3 = 0.43.
Normal model with unknown mean and variance
overview. (1
Fig. 1 shows these distributions graphically. Since the data
and 2.in Posterior
Fig. the normal case: with normal prior and normal likelihood;
distribution
are Numerical
2.1.1. relatively precise examples compared to the prior, we see that the Z uncertain data.
relatively
posterior distribution is “closer” to the likelihood than the prior. Figure 5.6: Posterior distribution with
/ exp{ 0.5( ỹ x)2 / 2 } exp{ 0.5(x xa )2 /⌧a2 },
In the case where σ is unknown, thingsAnother get away
Assume little
the
to look at this is difficult,
prior more
distribution
that the gain (Kbut
is X ⇠ N (20, 3), still and the
= 6/7) is close by:normal prior and normal likelihood; rel-
data model is Yi |x ⇠ N (x, 1). We have two observations
to one, so that the data model is weighted more than the prior.
tractable. Gelman et al. (2013, chap2) show
y = (19, that
Next,23) 0 . The
assume a convenient
posterior
the mean is 20
same observations prior
+and fordistribution,
(6/7)(21
prior σ20)is = p(
atively
where
ỹ|y) =
xaZ and
imprecise
p(
⌧a2 are thedata.
ỹ|x) p(x|y)dx
From
posterior mean Wikle and
and variance,
20.86 and the respectively (where we have made use of the conditional
theposterior
data modelvariance
to Yi |x ⇠is N(1 6/7)3 = is0.43. Berliner (2007)
the scaled inverse χ2 distribution, parameterized
but change
Fig.6/16
1 shows these by its
distributions degrees (x, 10).
graphically. of
Since
and the posterior distribution is X |y ⇠ N (20.375, 1.875).
free-
The gain
the
K =
data independence assumption that p( ỹ|x, y) = p( ỹ|x)). In this
and
caseinwethehave
normalỸ |ycase:
⇠ N (xa , 2 + ⌧a2 ). Thus, the predictive
dom ν0 and a scale parameter σ0 12 . posterior
areThis is illustrated in Fig. 2. In this case, the gain is closertheto
relatively precise compared to the prior, we see that
They show
distribution isa“closer”
veryerror neat
to the result:
likelihood than the in prior.
Z12
mean the inverse
is the posterior χ
exp{ 0.5(
2 distribution
mean,
x)2than
is adistribution
and the predictive
/ 2 }the
specialis
xa )2 /⌧a2 },
zero (since the measurement variance is relatively large /necessarily less ỹprecise exp{ 0.5(x distribution.
posterior
Another
such a case the role of prior can be thought
comparedwaytotothe
of look
as at this
providing
prior is thatand
variance) thethus
gainthe
the (Kprior
= 6/7)
informa- is close
is given more case of the inverse gamma distribution,
to one, so that the data model is weighted more than the prior.
weight. plotted
where in Fig.
⌧a2 are
xa distributions
and 5.7the posterior mean and variance,
2.2. Prior
tion equivalent to ν0 independent observations
Next, assume the with
but2.1.2.
changePosterior
the data predictive
average
same observations
model to Ydistribution
and precision
prior distribution, respectively (where we have made use of the conditional
i |x ⇠ N (x, 10). The gain is K = To many, Bayesian
independence assumption analysis
that p( isỹ|x,
predicated
y) = p( onỹ|x)).
a belief in
In this
σ0 . This illustrates once more that the6/16
roleand
Often,ofone
the prior,
posterior mathematically,
distribution
is interested is X |ythe
in obtaining N (20.375,
⇠ distribution is1.875).
oftoa new subjective
case we haveprobability.
Ỹ |y ⇠ N (xa ,is, 2the
That 2
+ ⌧quantification of beliefs
a ). Thus, the predictive
This is illustrated
observation Ỹ basedin Fig. 2. In
on the this case,
observed datathe gain aisdistribution
y. Such closer to (however
mean vague)
is the aboutmean,
posterior X before
and the
the data are considered.
predictive Theis
distribution
provide a starting point for the inference.is knownIn
zero (since the
as the cases
measurement
posterior where
predictive the
error variance observa-
is relatively
distribution large
and is given choice of prior
necessarily distributions
less precise than thehas been the
posterior subject of much
distribution.
compared to the prior variance) and thus the prior is given more
tions are too few, or too poor, the choice
weight.of the prior will be important to 2.2. Prior distributions
the final outcome (the posterior distribution); in cases
2.1.2. Posterior predictive where observa-
distribution To many, Bayesian analysis is predicated on a belief in
Often, one is interested in obtaining the distribution of a new subjective probability. That is, the quantification of beliefs
tions are numerous and/or good, itsobservation
role is diluted by the observations.
Ỹ based on the observed data y. Such a distribution (however vague) about X before the data are considered. The
is known as the posterior predictive distribution and is given choice of prior distributions has been the subject of much

A note on priors
Three notable properties of prior distributions deserve mention here:

• Conjugacy: As we’ve seen, conjugate priors are priors that, com- Figure 5.7: PDF of the inverse gamma dis-
bined with the likelihood, produce a posterior distribution that has tribution for various values of its param-
eters α and β. If X ∼ Scale-inv-χ 2 2
the same form as the likelihood. For instance, the normal model   (ν, σ )
ν νσ2
then X ∼ Inv-Gamma 2, 2 .
above with a normal prior on the observations’ mean yielded a nor-
mal posterior. For a binomial likelihood you’d choose a beta prior.
The main justification for using such priors is that it enables a closed-
form expression for the posterior; otherwise, the posterior has to be

78
Data Analysis in the Earth & Environmental Sciences Chapter 5. Principles of Statistical Estimation

found numerically, sometimes at considerable expense. Clearly the


choice of a conjugate prior is a matter of convenience, and may not
always reflect our actual scientific knowledge about the parameters
we wish to discover from the data. They also have great pedagogi-
cal value, but should be avoided in scientific applications unless they
have a compelling scientific rationale.

• Objectivity: Uninformative priors, sometimes called flat priors, dif-


fuse priors, vague priors or (more boldly) “objective priors”, attempt
to provide as little information as possible to let the data speak. In
ideal circumstances (with good enough data), one may choose an es-
sentially flat prior (e.g. a uniform distribution) for a parameter, and
the likelihood will overwhelm it, leading to an estimate that is very
close to objective. Note that conjugacy and objectivity are not mutu-
ally exclusive. Any prior may be made uninformative by expanding
its scale parameter to large values.

• Han Solo: For an entertaining view on how to choose a prior for the
odds of Han Solo to make it through an asteroid field, read this blog
post by Count Bayesie. The bottom line is that sometimes it is sensible
to be subjective, especially if you know Hollywood.

Non-analytical cases
As said above, in most real-world cases the posterior distribution does
not have a closed-form solution. In such cases, one must evaluate the
various integrals numerically, often over multiple dimensions, and this
is a topic of considerable complexity. Some of this can be coded in your
favorite language13 , though more specialized packages like STAN are 13
e.g. PyMC3 in Python
probably preferable for a first-timer. Either way, most applied Bayesians
spend the majority of their time hunched over a computer, as illustrated
in Fig. 5.8.

Bayesian pros and cons


Spinal health notwithstanding, there are many advantages to the Bayesian
method of inference. First, it allows to use scientific knowledge as en-
coded by a prior, hence using all relevant knowledge to the analysis of a
dataset. As such, it mimics the scientific mind, updating intuition and
prior experience in the light of observations. Second, the posterior dis-
tribution is a much richer output than an MLE point estimate, and im-
mediately characterizes everything we want to know about the model
parameters: their most likely value (posterior mode), their central ten-
dency (posterior mean), their uncertainty (posterior variance or IQR).
In the case of a flat prior, the posterior mode corresponds to the MLE,

79
Data Analysis in the Earth & Environmental Sciences Chapter 5. Principles of Statistical Estimation

Figure 5.8: The evolution of statisticians:


Homo apriorus only considers his prior
knowledge p(θ ), without regard for ob-
servations; Homo Pragmaticus only con-
siders the observations (X), which is not
much better; Homo frequentistus consid-
ers the likelihood (the conditional distri-
bution of observations given the param-
eters), f ( X |θ); Homo sapiens, being wiser,
considers the joint distribution of param-
eters and observations; finally, Homo
Bayesianis, the wisest of all, considers
the posterior distribution of parameters
given the observations p(θ | X), but needs
an awful lot of coding to do that.

but the posterior also provides uncertainty quantification for the same
price.
One particularly neat use of the posterior is the ability to forecast new
observations. The posterior predictive distribution is the distribution of a
new data point x̃, marginalized over the parameters:
Z
p( x̃ | X ) = p( x̃ |θ ) p(θ | X ) dθ (5.37)
θ

Notice how the left hand side depends solely on the observations, but no
longer on the parameters. That is, past observations have been digested
by the statistical model so it can produce a probabilistic forecast of a
new estimate. In the normal case with known variance:
Z
"     #
1 x̃ − m 2 1 m − µ p 2
p( x̃ | X ) ∝ exp − − dm (5.38)
R 2 σ 2 σp

so it can be seen that integrating over the previously unknown parame-


ter m has made it disappear, and we are now predicting future (testable)
observations given only past observations, which is quite appealing to
a scientist.
So what can possibly go wrong with the Bayesian approach? As said
earlier, accommodating scientifically-defensible priors may require very
complex integrals, and therefore very heavy computations (e.g. Monte
Carlo Markov Chains). These are far beyond the scope of this book, but
you may want to dig in this direction if Bayesians have taken over your
field of study. Tarantola (2004) is an excellent introduction for geophysi-
cists.

80
Chapter 6

CONFIRMATORY DATA ANALYSIS

One must credit an hypothesis with all that has had to be discovered in
order to demolish it.

Jean Rostand

Now we’re getting to what most laypeople think of when they hear
the word ’statistics’: putting measurements to the test, and finding whether
the results are significant. This is usually taken as a synonym of rigor,
but we shall see that care is needed: sometimes, applying the wrong test
or testing the wrong hypothesis is worse than not doing any statistics at
all.

I Preamble
Two tribes war over the ownership of a piece of land, which each claims
to have first occupied before the other. They ask a geochronologist to
arbitrate their dispute. The geochronologist finds wood and bone arte-
facts that allow the site to be radiocarbon-dated and the claims evalu-
ated: specifically, tribe A claims to have occupied the region since 622
AD, while tribe B claims they got there
 in 615 AD.
 t̂ = 650 ± 50y , 615
A 622 650 750
The following dates come back: (1σ)
 t̂ B = 750 ± 50y .
Figure 6.1: Age distributions for artifacts
from the two tribes. In green we have the
The geochronologist asks you three questions: data from tribe A with mean of 650 years
and standard deviation of 50 years. In red
we have the data from tribe B with mean
1. How confident can I be in each date? → Confidence or credible In-
of 750 years and standard deviation of 50
tervals years.

2. Are the observations compatible with their claims? → Hypothesis


Test

3. How confident should I be that tribe A got there first? → p-value


Data Analysis in the Earth & Environmental Sciences Chapter 6. Confirmatory Data Analysis

II Confidence Intervals
How confident are we that the true age lies within a certain range? As-
sume t A ∼ N (µ̂ A , σA ), where µ̂ A = 650 y and σA = 50 y. We want to
find tmin and tmax such that :

P (tmin ≤ t A ≤ tmax ) = 95%

Transformation to a Z Score (standard normal)

t A −µ̂ A
Defining: Z A = σA → N (0, 1),

P (zmin ≤ Z A ≤ zmax ) = P (| Z A | ≤ zα ) (Symmetry) (6.1)

Find zα such that P (| Z A | ≤ zα ) = (100 − α)%, where α is the confi-


P (|z| ≤ 1) = 68.2%
dence level. P (|z| ≤ 2) = 95.45%
P (|z| ≤ 3) = 99.73%

 Z0.025 = Φ (0.025) = −1.96 ,
α = 0.05 ⇒
 Z0.975 = Φ (0.975) = +1.96 .

A normal RV spends 95% of its time within 1.96 of the mean. That is,
it lies within [µ − 1.96σ, µ + 1.96σ] with 95% confidence. 68.2%

95%
t A −µ̂ A
Tribe A: Z A = σA ⇔ t A = µ̂ A + σA Z A ,
Figure 6.2: Confidence level of each con-
fidence interval chosen.
⇒ P (µ̂ A − 1.96σA ≤ t A ≤ µ̂ A + 1.96σA ) = 95% .

A 95% C.I. for the arrival of tribe A is [552, 748]. A 95% C.I. for the arrival
of tribe B is [652, 848].

General Case

Find zα/2 , z1−α/2 such that,

P (zα/2 ≤ Z ≤ z1−α/2 ) = 1 − α . (6.2)

• Non linear problem (involves inverse CDF).

• Asymmetric if skewed distribution (e.g. Weibull, Gamma, Poisson).


1
An irony of orthodox statistics is that, if
Interpretation you quiz most practitioners on what a CI
is, the vast majority of them (though per-
A confidence interval is such that, if we were to repeat the experiment haps not 95%) will give you the defini-
a great many times, the true value t A would be in this interval 95% of tion of a credible interval. It’s not too sur-
prising, because it’s what people actually
the time. Notice that it is subtly different from saying we are “95% sure want to know, and if they truly under-
that t A in this interval”. Such is the difference between (frequentist) stood CIs the way frequentists do, they
would probably abandon the concept.
confidence intervals and (Bayesian) credible intervals1 .

82
Data Analysis in the Earth & Environmental Sciences Chapter 6. Confirmatory Data Analysis

III Testing Archeological Hypotheses


Working hypothesis: Tribe A arrived before Tribe B. Do the data
support this idea, and how confident are we about the statement?

Known Variance Case (Z-test)


Let us assume that we know the measurements’ precision:

σA = σB = σ = 50y .

Now, Tribe A claims that they arrived at t A = 622. So we define:


Null Hypothesis

µ A = 622 ( H0 )

And we proceed to evaluate the probability of observing the data


under this hypothesis. If H0 is true,
 
t − 622 650 − 622
P(t A ≥ 650) = P A ≥ (6.3a)
50 50
 
650 − 622
P(t A ≥ 650) = P z A ≥ , z A ∼ N (0, 1) (6.3b)
50
P(t A ≥ 650) = P (z A ≥ 0.56) = 1 − Φ(0.56) (6.3c)
P(t A ≥ 650) ' 29% (6.3d)

where, once again, we have transformed our normal variable to a Z


score so we can compute everything in terms of Φ, the standard normal
CDF. If H0 were true, we thus would observe ages as large as 650 about
29% of the time from chance alone ; this is not a rare occurrence, and
means that the data do not allow to reject the null hypothesis at the 5%
level. Equivalently, we find insufficient evidence to exclude this possibility
(“not guilty” is not the same thing as innocent). Put differently, if H0
were true, then t A should lie within [622 − 1.96 × 50, 622 + 1.96 × 50] =
[524, 720]. t̂ A = 650 is in this interval, but t̂ B is not. Note:

• This is called a Z test on account of the use of the standard normal.

• A 5% test level means “we have a 5% chance of observing a result as


extreme as this by random chance”. (see Sect. V)

• tB ∈
/ interval means that we could resoundly (at the 5% level) reject
the hypothesis that tribe B got there even as late as 622 AD; the tribe’s
claim claim that µ B = 615 is even less convincing (the associated p-
value would be smaller still).

83
Data Analysis in the Earth & Environmental Sciences Chapter 6. Confirmatory Data Analysis

Indeed, the probability computed above is a p-value, a fundamen-


tal notion of confirmatory data analysis, and one so misused that the
American Statistical Association issued a cautionary statement on it in
2016. Some journals have even taken the drastic measure of banning
them altogether. So, with that context: a p-value is the probability of ob-
taining a test statistic at least as extreme as the one observed, assuming the
null hypothesis is true. A p-value is emphatically not the probability that
H0 is true (or false, for that matter). Rather, it’s used to quantify how
much evidence the data provide against H0 . The smaller the p-value,
the more suspicious we are that the data could have been generated un-
der the null hypothesis. We say that the test is rejected if the p-value
falls under the (prescribed2 ) test level: p ≤ α. If such is not the case, we 2
It bears repeating that α is chosen by
say... nothing. We shut up and we collect more data. Or we go talk to the analyst, and there is nothing magi-
cal about 5%. If a test passes at the 5.1%
a Bayesian statistician and try to squeeze more information out of our but not the 5% level, it does not mean
data. that the result is “insignificant”. It only
means that under the null hypothesis, re-
sults as extreme as the one observed hap-
Unknown Variance Case (T-test) pen about 5% of the time from chance
alone.
Assume now that we don’t know the measurement’s precision; we have
to estimate it. To that end, our geochronologist takes repeated samples

t A1 , t A2 , . . . , t An and estimates:
nA
1
tA =
nA ∑ t Ai (Sample mean)
i =1

with n A = 12 and idem for t B with n B = 9.

Question Are the data still compatible with t A = 622?

Question What is the probability that t A > t B ?


Assuming again that t A ∼ N µ A , σA 2 , then:

E tA = µA ,
 σ2
V tA = A , (Central limit theorem)
nA

if σ2 is known. If it is not known, estimate it via the sample variance:


nA
1 2
S2A = ∑ tA − tA (6.4)
n A − 1 i =1 | i {z }
2
iid RVs∼N (0,σA )

2 , so S2 is the sum of n squared, normal
Note that t Ai − t A ∼ N 0, σA A
RVs with zero mean and identical variance σA 2 (precisely what we seek

84
Data Analysis in the Earth & Environmental Sciences Chapter 6. Confirmatory Data Analysis

to estimate). We can of course scale them so that they have unit vari-
2
ance, for simplicity. (n − 1) σS2 is then the sum of n iid standard normal ν=1

E Xν2 = ν,

random variables, squared. It turns out that solving this problem was 2
V Xν = 2ν.
2
sufficiently important that (n − 1) σS2 got its own distribution, known as
a chi-squared distribution, dependent only on the number of elements in ν=2
ν=3
the sum (ν), and denoted χ2ν . It is a special case of Gamma distribution ν=5

(cf Fig. 6.3), with the following properties:


n  Figure 6.3: χ2 distribution for different
Zi ∼ N (0, 1) I ID
χ2ν = ∑ Zi2 , (6.5) values of ν. We can see that for large ν,
i =1
 ν = n − 1 = "degrees of freedom" it converges to a Gaussian, N (ν, 2ν), as
ν  per the central limit theorem. Note that ν
need not be an integer.
χ2ν ∼ Γ ,2 . (6.6)
2
x
ν −1
2 e − 2x
i.e. f χ2ν ( x ) =  ν (6.7)
Γ ν
2 22

n n
1 S2
S2 = ∑ ( x i − x )2 ⇒ ( n − 1) = ∑ Zi2
n − 1 i=1 |{z} σ2 i =1 |{z}
N (µ,σ2 ) N (0,1)

∼ χ2n−1 .

Now you may ask: why do we only have n − 1 degrees of freedom when
we added up n independent numbers? The answer is that 1 degree of
freedom went into estimating the mean, so we lost that.
Now recall the properties of the sample variance: Figure 6.4: PDF of Student’s T distribu-
 tion, which is close to Gaussian with fat-
1. E S2 = σ2 , Unbiased Estimator. ter tails: that is, having to estimate the
 √ variance introduces uncertainty about
2σ4
2. V S2 = n −1 , (uncertainty about S also decreases as n) the estimate of the mean. However, for
ν → ∞, tν → N (0, 1), in another splen-
So it would make sense to consider the test statistic: diferous manifestation of the central limit
theorem. Credit:Wikipedia
tA − µA
T̂A ≡ √ (6.8)
SA / nA
Dividing numerator and denominator by σA , we see that it is basically
the ratio of a standard normal to the square root of a chi-squared vari- 3
In the English-language literature it
able. Informally, one may write: takes its name from William Sealy Gos-
set’s 1908 paper in Biometrika under the
Z pseudonym "Student". Gosset worked
TA ∼ q at the Guinness Brewery in Dublin, Ire-
χ2n−1 land. One version of the origin of the
pseudonym is that Gosset’s employer for-
So what, you say? Again, solving this problem was, at some point, bade members of its staff from publish-
ing scientific papers, so he had to hide his
sufficiently important that someone went through the trouble of work- identity. Another version is that Guin-
ing out this distribution analytically, naming it Student’s T (or t) distri- ness did not want their competitors to
bution3 : tν , depicted in Fig. 6.4. In Python, it may be accessed via scipy know that they were using the t-test to
test the quality of raw material. Either
.stats.t(). In the case of the mean of n IID normal RVs, ν = n − 1. way, beer and statistics do mix. Credit:
Wikipedia

85
Data Analysis in the Earth & Environmental Sciences Chapter 6. Confirmatory Data Analysis

T-Test: We distinguish two kinds:

One Sample T-Test

Let us define the null hypothesis H0 : "µ A = 622" and the alternate
Hypothesis: Ha : "µ A > 622"4 . Can the data distinguish between the 4
this is a one-sided test; we could also test
observed and hypothesized means? Let us compute the T statistic5 : for µ A < 622, but it wouldn’t make sense
here since we can safely assume that ei-
ther tribe is giving us its earliest possible
tA − µA date of arrival at the site
T̂ = √ = 1.94
SA / nA

5
a deterministic function of the data
Is it significantly different from 0?

Then compute P t > T̂ = 1 − Ftν (1.94) where Ftν is the CDF of the
t-distribution with ν degrees of freedom. This is the p-value6 . 6
Here, one would compute
1 − tcdf(1.94, n A − 1)

• n = 4, P T > T̂ = 13%. We fail to reject H0 7 .
7
 this double-negative is essential to hy-
• n = 12, P T > T̂ = 4% → We reject H0 . pothesis tests: we start from a presump-
tion of innocence (H0 is true) and see if
the data can convince us otherwise. The
choice of H0 is therefore crucial, and we
Put differently, with n = 4, there is insufficient evidence to distinguish should be careful that it does not stack
between 650 and 622 at the 5% level; with n = 12 there is sufficient the cards against a particular result, as is
sometimes the case
evidence to do so. So the data suggest that tribe A is exaggerating a
little, but is fundamentally honest. On the other hand, tribe B is either
lying or delusional. Let us ask a more relevant question for our dispute:
"Did tribe B arrive significantly after tribe A?"

Two Sample T-Test

When the means of two samples must be compared, with unknown


variances, we use the two sample T-test. That is, we compare t A and t B ,
with the null hypothesis H0 : "µ A = µ B " and the alternate hypothesis
Ha : "µ B > µ A "8 . We compute the two-sample T statistic: 8
By this point we have dealt with tribe B
enough that we suspect that it did not get
there first, so we consider a one-sided al-
t − tA
T = rB ∼ tν0 . ternate hypothesis
S2B S2A
nB + nA

The value of ν0 depends on the case:


0 n A nB
Equal Variance ν = , (Harmonic Mean)
n A + nB
S2A S2B
0 nA + nB
Unequal Variance ν =  2  2 .
S2A 1 S2B 1
nA n A −1 + nB n B −1

86
Data Analysis in the Earth & Environmental Sciences Chapter 6. Confirmatory Data Analysis


n = 12
A
E.g.: → T ' −4.536.
n B = 9
As expected, the number is negative (since t B > t A ), but is it signifi-
cantly different from zero? For this we compute the p-value:

p-value = tcdf( T̂ = −4.536, ν = 108/21) ≈ 3 × 10−3  α = 5 × 10−2

That is, if the two distributions had equal means, it would be exceed-
ingly unlikely to observe a deviation this big. We can thus reject the
null hypothesis with very high confidence: we conclude that tribe A
colonized the land first. Note that we would not be able to make this
claim with the same level of confidence with n A = n B = 2 (HW: what
p-value would we get?), so it was critical to collect more measurements.
Application: pval = f (n A , n B ) can tell you how many measurements
you need to achieve a certain level of confidence in a result. When
designing experiments, this tell us how many are needed to claim a
"significant difference" – a useful approach to convince a program man-
ager that you need monies to go collect data in faraway locales, or buy
a new instrument. This touches on the rich topic of experiment design.

IV The Logic of Statistical Tests


To summarize, there are five ingredients to a statistical test:

1. Identify the test and the test statistic (e.g. T-test).

2. Define the null hypothesis (e.g. "H0 : µ = µ0 ").

3. Define an alternate hypothesis (e.g. "Ha : µ > µ0 ").

4. Obtain the null distribution (distribution of the test statistic assuming


H0 is true).

5. Compute the p-value (probability of occurrence of values as extreme


as the observed test statistic) and compare to the test level α.

p < α ⇒ reject null hypothesis (guilty).


p ≥ α ⇒ insufficient evidence (not guilty 6= innocent).

Example
A Palm Spring resort claims that f c = 67 of its days are cloud-free. Yet, ob-
servations over 25 days indicate that only 15
25 days are cloud-free. Is the claim
supported by the evidence?
k 15
1. Test statistic: f = N = 25 .
6
2. Null hypothesis: f = f c = 7 ' 0.857.

87
Data Analysis in the Earth & Environmental Sciences Chapter 6. Confirmatory Data Analysis

3. Alternate hypothesis: f < f c (claim is overblown).

4. Null distribution: Sequence of 25 Bernoulli trials with probability f X ∼


B (25, f c ), that is:

 
25 k
P (X = k| f = f c ) = f c (1 − f c )25−k . Binomial Distribution
k
(6.9)

5. p-value:

15  
25 k
P ( X ≤ 15) = ∑ f c (1 − f c )25−k
k =0
k
' 0.0015 Very unlikely

We conclude, with high confidence, that the data are inconsistent with the claim
that f = 67 .

V Test Errors

Type I vs Type 2 errors


How might a test lead us astray? There are two main errors that we
0.40
0.35 Ho : µ = µ0 Ha : µ = µa

should watch for. A type-one error is a false positive: finding an effect

Probability Density, f(z)


0.30
0.25
when there is no effect. Medical examples of this include determining 0.20

a practice to be unsafe when it is safe, or an intervention non-beneficial 0.15


0.10
when it is in fact beneficial. A type-two error is the reverse, a false neg- 0.05 β α

ative: a practice is determined to be safe when it is unsafe, or an inter-


0.00
2 0 2 4 6 8 10
Test statistic z
vention beneficial when it is not. More precisely:
 Figure 6.5: This plot represents the false
Type I error = α = P rejecting H0 /H0 true = False Positive. positive, α (red shading), and false nega-
 tive, β (green shading). We can see that if
Type I I error = β = P accepting H0 /Ha true = False Negative. β increases, α will decrease, though there
is in general no simple equation to relate
No simple relationship between α and β, but they are impossible to the two.
jointly minimize → there is a trade off. We summarize the decisions
(and their correctness) in a table:

Decision H0 is true H0 is false


Reject H0 False positive (Type I error) True negative
Fail to reject H0 True positive False negative (Type II error) Table 6.1: Possible outcomes of a statisti-
cal test, and associated errors
A type-three error, in contrast, is one where the analysis itself is framed
incorrectly and thus the problem is mischaracterized. This one is a more of
a logical fallacy than a statistical error, and cannot therefore be spotted
by statistics, but it is very common: keep your eyes peeled for it, and try
your best never to commit that sin!

88
Data Analysis in the Earth & Environmental Sciences Chapter 6. Confirmatory Data Analysis

The power of the test is defined as:



1 − β = P rejecting H0 /Ha is true (6.10)
• Measures how discriminatory it is.

• Always depends on the number of observations.


Typically, the more observations, the more power to discriminate, but
this obviously depends on their quality; sometimes one good observa-
tion is worth 10 bad ones. But if they are truly independent, the central
limit theorem tells us that a million bad ones can beat a few good ones!

VI Common Parametric Tests


We now visit common parametric tests, so named because they rely on
parametric distributions. In fact, all these rely exclusively on the nor-
mal distribution, more specifically normal IID observations (some ad-
justments can be made for dependent data).

Z-Test
Comparing 2 means where the variance is known: see Sect. III

T-Test
Comparing 2 means when both means and variance are unknown. see
Sect. III

F-Test
This test compares the variances of two samples from two populations.

Sample 1, σ , n observations S2 /σ2 χ2
1
Fm,n = 12 12 ∼ 2n−1 . 9
Its CDF maybe accessed via scipy.
Sample 2, σ2 , m observations S2 /σ2 χ m −1 stats.f.cdf(F, n, m)

(Ratio of two sample variances normalized by their variances)


H0 : "σ1 = σ2 " → the 2 samples have identical variance. so
the ratio should be close to unity. How close is “close”? In other words,
how large a deviation should we observe before we pronounce the two
variances as distinct? We quantify this via the F distribution, which
is known but complicated (non-analytical), and depends only on two
degrees of freedom n and m9 .

Testing for Goodness of Fit


Frequently, one is led to the question: does a theoretical distribution ac-
curately represent a given empirical CDF? Here we give two approaches
to answer this (there are many more).

89
Data Analysis in the Earth & Environmental Sciences Chapter 6. Confirmatory Data Analysis

χ2 Test 0.18
Empirical fit to a theoretical distribution
0.16

The idea is to plot the histogram of the theoretical and empirical PMFs 0.14

Probability Density, f(x)


0.12
(e.g. Fig. 6.6). If the fit is good, then the observed number of events in 0.10

bin k (Ok ) should be close to the expected number (Ek ). To quantify this,
0.08
0.06

we compute the statistic: 0.04


0.02

Nb 2
( Ek − Ok )
0.00
0 5 10 15 20 25
Ξ2 = ∑ , (6.11) x

k =1
Ek
Figure 6.6: An example of a theoretical
It turns out that PDF (solid line) fit to an empirical PMF
Ξ2 ∼ χ2ν−1 , (6.12) (columns)

where ν = Nb − n p with n p the number of parameters estimated. One


can therefore use the χ2 inverse CDF to test whether the fit is acceptable
for a given number of bins10 . As the number of observations increase, 10
scipy.stats.chisquare.ppf()

so does the number of allowable bins, and Ξ2 ∼ N (0, 1).

Kolmogorov-Smyrnov Test
The KS test11 relies on the existence of a universal distribution for the 11
scipy.stats.kstest() for a fit to a

following statistic: theoretical distribution, scipy.stats.


ks_2samp() to determine if 2 samples
come from the same distribution

Fn ( x ) = empirical CDF
D = max | Fn ( x ) − F ( x )| (6.13)
x F ( x ) = reference CDF
The critical value for rejection at level α is defined as

Cα = √ √ (6.14)
n + 0.12 + 0.11/ n
where n is the sample size and k α is a function of α only.
One advantage is that it is universal: for any reference distribution
F ( x ), this can tell us where the sample that generated Fn ( x ) was drawn
from the same population. One problem is that the test may not be strin-
gent enough if some data has been used to estimate parameters (an ex-
ample of double-dipping). Also, for some specific distributions, one
may obtain more powerful tests (i.e. tests with greater statistical power
1 − β) by exploiting the functional form of the distribution.

The Anderson-Darling test12 is an avatar of the KS test against a few 12


scipy.stats.anderson()

common continuous distributions (e.g. normal, exponential, gumbel,


logistic). It is more powerful (statistically speaking) because less gen-
eral. There are other specialized tests to determine if a sample is consis-
tent with having come from a normal distribution (e.g. the Jarque-Bera
test). Use them with caution, as they are often a referendum on sam-
ple size: with low sample size, every normal-ish dataset will pass the
test with flying colors; with high enough sample size even data that are
actually normal have a decent chance of being flagged as non-normal.

90
Data Analysis in the Earth & Environmental Sciences Chapter 6. Confirmatory Data Analysis

VII Non-parametric tests


Parametric tests are incredibly useful, but rely heavily on the IID Gaus-
sian assumption, so they may break down outside of this rather restric-
tive context. In this day and age those assumptions seem rather quaint,
but remember that they were born out of necessity (not ignorance) at
a time when analytical approximations were the only salvation. In the
computer age, we can devise non-parametric tests that do not rely on such
assumptions. The key to such tests is to generate a large ensemble of
surrogate data, from which we obtain the null (sampling) distribution.
Once this distribution is obtained, we can use it to do all manner of un-
certainty quantification, like forming confidence intervals and testing
hypotheses. Here we discuss four main ideas, which all fall under the
more general umbrella of Monte Carlo methods13 : 13
One abuse of language is that – in the
Earth Sciences at least – Monte Carlo is
1. permutation often used interchangeably with resam-
pling plans, whereas Monte Carlo meth-
ods are a broader class of computational
2. reordering algorithms that rely on repeated ran-
dom sampling to solve complex integrals
3. resampling not amenable to closed form solutions.
Monte Carlo Markov Chains (MCMC) are
a prime example of this in the Bayesian
4. direct simulation world (Chap 5, Sect. V)

Note that first three assume exchangeability, which is closely related to


the IID assumption, but they need not assume Gaussianity.

Permutation Tests
Imagine that we have a sample of size n = n1 + n2 n1 n2

We can generate N surrogate replicates by sampling without replace-


ment
Figure 6.7: Samples from two popula-
   
tions of different sizes
n1 + n2 n1 + n2
 =  = ( n1 + n2 ) ! (6.15)
n1 n2 n1 ! n2 !

For example, imagine that we have two climatic scenarios, 1 × CO2 ,


2 × CO2 , and the corresponding 2 × 16 simulations. We would like to
know if California rainfall is significantly different between the two en-
sembles. One way to judge this significance is to choose a test statis-
tic (e.g. the difference in maximum winter rainfall between one group
of simulations and the next, a non-normal statistic for which paramet-
ric assumptions would fail), create a large ensemble of permutations,
and use it to obtain the sampling distribution. If the observed value of
the statistic seems unlikely under these permutations, this would be a
strong indication that the two populations are distinct.

91
Data Analysis in the Earth & Environmental Sciences Chapter 6. Confirmatory Data Analysis

Reordering Tests
Skip (not common in Earth science)

Resampling plans
There are two main ones:
Bootstrap : thus called because it is a way to magically generate a large
ensemble from a limited sample, hence allowing you to pull yourself
up by your bootstraps. The idea is to generate surrogate data by sam-
pling with replacement from the original data (Fig. 6.8); there are nn
possible combinations.
250 7. Model Assessment and Selection

Bootstrap Figure 6.8: Schematic representation of


replications bootstrap sampling for a sample Z and a
test statistic S( Z ). The ensemble thus cre-
S(Z∗1 ) S(Z∗2 ) S(Z∗B ) ated allows to estimate a sampling distri-
bution, or various numerical summaries,
Bootstrap like a sample mean, variance, etc. It can
samples be shown that this distribution asymptot-
ically converges to the true one.
Z∗1 Z∗2 Z∗B

Z = (z1 , z2 , . . . , zN ) Training
sample
Block bootstrap sampling distribution of λ
0.1

0.09

FIGURE 7.12. Schematic of the bootstrap process. We wish to assess the sta- 0.08

tistical accuracy of a quantity S(Z) computed from our dataset. B training sets 0.07
Probability Mass Function

Z∗b ,The
b =idea
1, . . .is
, Btoeach
generate
of size BNbootstrap
are drawn samples, compute
with replacement fromthethetest statis-
original 0.06

dataset. The quantity of interest S(Z) is computed from each bootstrap training
tic for each of them, then sort and find the B × α/2 smallest and 0.05

set, and the values S(Z∗1 ), . . . , S(Z∗B ) are used to assess the statistical accuracy 0.04

B × (1 − α/2) largest values, from which we get the α-level confi-


of S(Z). 0.03

dence interval.
! However, the draws are obviously not independent, 0.02

S̄ ∗ =datum ∗b "
where
as each b S(Zhas)/B. Note that
probability e−1Var[S(Z)] can be
of appearing in thought of as aIn
each sample.
0.01

0
Monte-Carlo estimate of the variance of S(Z) under sampling from the
some cases, consecutive observations might not be exchangeable, but 0.2 0.25 0.3
λ
0.35 0.4 0.45

empirical distribution function F̂ for the data (z1 , z2 , . . . , zN ).


largecan
How enough blocks
we apply theof them might
bootstrap reasonably
to estimate be assumed
prediction error?to be. ap-
One This
Figure 6.9: Uncertainty quantification for
is a job
proach forbe
would thetoblock bootstrap,
fit the model inanquestion
exampleonofawhich is shown in
set of bootstrap Fig. 6.9.
samples, a parameter λ following a non-normal
and then keep track of how well it predicts the original training set. If distribution. The sampling distribu-
What value of B should we pick? As a rule of thumb, distributions
fˆ∗b (xi ) is the predicted value at xi , from the model fitted to the bth boot- tion (staircase) was obtained by 2,000
strapstart looking
dataset, ourreasonable
estimate is around B = 200, but if possible, why not go bootstrap realizations using 20-year long
for 1,000 or 10,000? More would probably be a waste of CPU time. chunks of a 10,000 year long timeseries
B N
1 1 ## at biannual resolution. The shaded area
"boot = ˆ ∗b
depicts a 95% confidence interval for λ̂
Jackknife (“leave one out”):
Err
B Nthe idea is
L(yrecompute
i , f (xi )). a quantity by leaving
(7.54)
b=1 i=1 under this distribution; the mean is indi-
out each datum in turn, that is, defining each cated by a vertical line.
However, it is easy to see "boot does not provide a good estimate in
that Err
sample j = {1, 2, . . . , 2, . . . , n} (6.16)
general. The reason is that the bootstrap datasets
j −1 j j + are1 acting as the training
samples, while the original training set is acting as the test sample, and
these two samples have observations in common. This overlap can make 92
overfit predictions look unrealistically good, and is the reason that cross-
validation explicitly uses non-overlapping data for the training and test
samples. Consider for example a 1-nearest neighbor classifier applied to a
two-class classification problem with the same number of observations in
Data Analysis in the Earth & Environmental Sciences Chapter 6. Confirmatory Data Analysis

and repeating this n times (one for each observation). For a sample
of size n, we have only n possibilities, each sample having n − 2 com-
mon points with another. The jackknife is in fact a special case of the
bootstrap; it is ∼ 20 years older, and much cheaper. In this day and
age it is not particularly recommendable, but it does prove useful to
gauge the sensitivity of a result to one particular observation. It typi-
cally yields much less accurate confidence intervals or p-values than
the bootstrap, however

There are canned codes to perform these resampling plans automat-


ically, but they are a black box, so only use them if you’re confident that
you know what each is doing. Even better, write your own14 , and use 14
using np.random.choice(), or some-
the tools you already know to compute the test statistic of your choice, thing similar, to generate the samples

its empirical CDF, quantiles, and relevant numerical summaries.

Direct simulation
For our purposes, if we’ve determined a particular statistical model
(e.g. AR(1), see Chapter 8) to be an appropriate null hypothesis for the
process that generated the data, one can simulate a large number of sam-
ples (say Nmc > 200) from that model, compute the ECDF of a statistic
of interest and estimate the probability of observing the observed data
under this null distribution. We will use this in the isopersistent and
isospectral tests (Lab 7).

Example: significance of correlations


A common problem is to estimate the significance of correlation coeffi-
cients. A famous example is the high correlation (r = 0.52) between the
number of GOP senators and ... sunspots! (Fig. 6.10). Is the correlation
significant? Does it imply some causation? If so, in which direction? I’ll
let you think about that.
Standard theory says that for 2 normal IID series, the test statistic

t = n − 2 √ r 2 ∼ tν with ν = n − 2 degrees of freedom. So it is Figure 6.10: Republican Senators and
1−r Sunspots, 1960-1986. Figure attributed to
common to test against the null hypothesis |r | = 0 using a Student’s Richard Lindzen, MIT. For more details,
distribution as the null distribution. This is all fine and good except see Fun with correlations. For a hilarious
account of extremely high correlations
when the IID assumptions are not met, which happens all the time in
that cannot possibly harbor any causal
in Earth sciences. In particular, many geophysical datasets exhibit high relationships, see Tyler Vigen’s spurious
serial correlation, meaning that consecutive observations are not inde- correlations

pendent (e.g. the temperature on Tuesday depends on the temperature


on Monday). This means that the effective degrees of freedom are usu-
ally much lower than ν = n − 2, so even a relatively large value of r
may not be significant. Therefore, assuming ν = n − 2 would result in
overly confident assessments of the significance of a correlation. Unfor-
tunately, this test comes “out of the box” with most numerical packages,

93
Data Analysis in the Earth & Environmental Sciences Chapter 6. Confirmatory Data Analysis

so many claims of significance made in the literature by unsuspecting


users are questionable.
We will see this in excruciating detail in Lab 7, in which we will use
non-parametric tests to solve this problem (and obtain more stringent
bounds on significance in serially-correlated data). GEOL425L alumn
(and Fall 2017 TA) Jun Hu wrote a nice paper on this topic (Hu et al.,
2017).

VIII Bayes: return of the reverend


Let us revisit the example of Sect. IV about a resort’s advertising claims
for cloud-free days15 . We’ve just seen the orthodox (frequentist) prac- 15
This excellent example is shamelessly
tice of defining a null hypothesis about some parameters, using it to reproduced from Wilks (2011, pp 198-199)

obtain a null distribution, and then computing the likelihood of observ-


ing values of the test statistic as extreme as those we actually obtained.
Phew, is it just me or is it an awfully convoluted and perversely counter-
intuitive way of going about things? We have to teach this because such
tests show up all over the literature, but we wish there was something
better. Thankfully there is.
What would a Bayesian do? A Bayesian would reason in terms of the
posterior distribution [Eq. (5.26)]. In this case, our likelihood is similar
to Eq. (6.9). For the parameter θ = f , we write
 
N k
P( X = k | f ) = f (1 − f ) N − k (6.17)
k

(probability of observing k could free days in N, given some probability


of occurrence f for such days. We want P( f | X ) (the distribution of f
given the observations), so the only missing piece is the prior distribu-
tion p( f ).
We have potentially an infinity of choices. Let’s start by being mag-
nanimous and assuming that there is no reason to prefer any outcome,
so we may use a flat prior (p( f ) = 1 ∀ f ∈ [0, 1]). It turns out that the
uniform distribution so described is a special case of the beta distribution,
introduced in Sect. V and depicted in Fig. 5.4. The flat prior corresponds
to the choice α = 1, β = 1 (Fig. 5.4).
Of course, the beta distribution is a conjugate prior to the binomial
likelihood. Thus the posterior writes:

Γ (α + β + N )
P( f | X ) = f k + α −1 (1 − f ) N − k + β −1 (6.18)
Γ (α + k) Γ ( N − k + β)

so it is also a beta distribution, but with parameters α0 = α + k and


β0 = N − k + β. In our case, with k = 15 cloud-free days observed over
the span of N = 25 days, we get α0 = 16 and β0 = 11.

94
Data Analysis in the Earth & Environmental Sciences Chapter 6. Confirmatory Data Analysis

Cloudless skies, flat prior


This distribution is illustrated in Fig. 6.11, where it can be seen that 4.5
Uniform prior, β(1,1)
Posterior, β(16,11)
the data allow to lift the prior from a state of complete uncertainty ( f 4
p = 15/25
CI = [0.406,0.766]
3.5

could be anywhere in the [0, 1] interval), to one centered around f = 3

15/25 = 0.6 (dashed line), but fairly broad; the associated 95% credible

Density p(θ)
2.5

interval16 is [0.406,0.766]. 2

Now, someone who is a bit more skeptical about the veracity of ad- 1.5

vertising claims may assume that there is only a 5% chance that the true 0.5

probability is above the claimed value of f c = 0.857. In addition, they 0


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

may assume that values of f above and below 0.5 are equally plausible Binomial probability θ

(symmetry). These two constraints together suggest α = 4, β = 4 for Figure 6.11: Bayesian inference on the
a more informative prior. The result of using this prior is illustrated in probability of cloudless skies with a uni-
form prior on f . The shaded area depicts
Fig. 6.12, where it can be seen that the mode is pulled slightly to the
a 95% credible interval
left compared to the flat prior case (0.581 instead of 0.6), which simply
reflects the fact that this prior is more concentrated around 0.5. The as-
sociated 95% credible interval, [0.406,0.736], is slightly narrower (notice
how only the high values got trimmed), also because extreme values
near 1 are weighted down by this prior.
Does it make a difference which prior we choose? In both cases, we 16
the Bayesian counterpart to confidence
see that the claimed probability f c = 0.857 is far outside the 95% cred- intervals, credible intervals quantify the
ible interval. We can easily compute P( f ≥ f c ) (by numerical integra- probability of the true parameter being in
some range, conditional on the modeling
tion if necessary, or using the incomplete beta function) and one would assumptions.
see that in either case it is smaller than 5 × 10−4 , so we could reject the
claim at that level of confidence. Hence the answer is: no, in this case the
prior did not matter to the actual question, because the evidence from
the data is so strong that it overwhelms the prior. The choice would be 5
Cloudless skies, informative prior
Informative prior, β(4,4)
Posterior, β(19,14)
more impactful in less constrained situations. 4.5
Posterior mode
CI = [0.406,0.736]
4

Finally, let’s leverage one advantage of posterior distributions, men- 3.5

tioned in Chap 5, Sect. V: quantifying uncertainties in predicted out- 3


Density p(θ)

2.5

comes. The posterior predictive distribution does just that, and in this 2

case it takes the form of a beta-binomial (aka Pólya) distribution. As- 1.5

sume we want to predict (probabilistically) what will happen over the 1

0.5

next N + = 5 days, using the posterior distribution in Fig. 6.11. There 0


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

are N + + 1 = 6 possible outcomes (k+ = {0, · · · 6}), the probability of Binomial probability θ

which is shown in Fig. 6.13 (red curve). It is instructive to compare this Figure 6.12: The shaded area depicts a
distribution to one where we assume that f = 0.6 is known exactly. In 95% credible interval
both cases the most likely outcome is k+ = 3 (since 3/5 = 0.6), but the
beta binomial (red) allocates less probability to the central value, and
more to the extremes, in light of not knowing f exactly.

95
Data Analysis in the Earth & Environmental Sciences Chapter 6. Confirmatory Data Analysis

IX Further considerations 0.5


Posterior predictive distribution

0.45

Now, having said all that, the reader is now urged to read a few things:
0.4

0.35

0.3

• Science’s Significant Stats Problem, by Tom Siegfried does a great job

PMF
0.25

of explaining why the malpractice or over-insistence on frequentist 0.2

0.15

hypothesis tests to determine “statistical significance” may lead to 0.1

wildly erroneous scientific claims. We will never say it enough: sta- 0.05

tistical significance is not practical significance, and it depends cru-


0
0 1 2 3 4 5 6
Number of cloudy days in N = 5 days

cially on defining a sensible hull hypothesis.


Figure 6.13: Using Bayesian inference
• Scientific method: Statistical errors by Regina Nuzzo very much reaf- to predict future observations. The red
curve is the posterior predictive distri-
firms these points, but provides more historical background on how bution using the posterior in Fig. 6.11.
P-values came to pervade the experimental sciences... mostly for the The black curve is the posterior predic-
wrong reasons. It is now at the point that many statisticians argue tive distribution is we knew for sure that
f = 0.6. One can see that both distribu-
that P-values simply do not provide a useful measure of evidence tions are very similar, but the red curve
against a null hypothesis17 is slightly more spread out, in recogni-
tion of the fact that f is not known ex-
• Gelman and Stern (2006) have a neat article showing that “The Differ- actly (though its most likely value is also
f = 0.6).
ence Between “Significant” and “Not Significant” is not Itself Statis- 17
http://journals.sagepub.com/doi/
tically Significant”. We highly recommend it. abs/10.1177/0959354307086923

• What would a Bayesian do? To compare the merits of two relative


hypotheses, Bayes Factors are very useful (O’Hagan, 2006). A related
idea is the likelihood ratio, which as the name implies only involves
the likelihoods, not the priors.

• A classical test is primarily a measure of how much data one has (that
is, how much finely one can discriminate between a result originat-
ing from chance alone or from a “real” effect)18 . The choice of sig- 18
see e.g. Normality tests
nificance level is itself very subjective (if something is significant at
the 5.1% level but not the 5.0% level, do you throw the baby out with
the bathwater?). What matters is in the end for your research is that you
estimate the parameters that interest you and report their uncertainties, esti-
mated as transparently as possible so people can decide whether they should
believe you or not.

• No matter what you hear, there is no such thing as objectivity in hu-


man endeavors, so settle for a defensible kind of subjectivity!

96
Part II

Living in the temporal world


Chapter 7

FOURIER ANALYSIS

Fourier analysis aims to represent a time-ordered signal1 into periodic 1


That is, a timeseries
(sinusoidal) components – in effect, to decompose sounds into their con-
stituent notes. There are several good reasons for doing so:

1. Periodicity is a source of predictability. For instance, the annual cy-


cle is a type of climate change that far exceeds anthropogenic global
warming in many locations, yet human and natural systems have
learned to adapt to it because it is so regular.

2. Harmonic signals (of the form eiωt ) are smooth, orthonormal, objec-
tive, invariant by integration or differentiation; they are a very con-
venient basis over which to represent almost any function.

3. Thanks to Fast Fourier Transform, this decomposition is cheap.

Let us start with clarifying what we mean by timeseries.

I Timeseries
Up to this point we have considered data as an amorphous scramble
whose order was unimportant. Obviously, in many instances the or-
der of the data will matter just as much as their values themselves, and
timeseries analysis is all about extracting patterns out of this order.

Notion of timeseries h

A timeseries {t, h(t)} is a collection of ordered observations. h is the


independent variable (the quantity of interest), t the independent vari-
able. While timeseries analysis was ostensibly developed with time as
the independent variable, it could just as well be a spatial variable (lon-
gitude, latitude, depth). We distinguish:
t
Continuous timeseries h(t) continuous, analog signal (e.g. old seismome-
ter on paper) Figure 7.1: An example timeseries h(t),
sampled at regular increments.
Data Analysis in the Earth & Environmental Sciences Chapter 7. Fourier Analysis

Discrete timeseries in which case {t, h(t)} = {(t1 , h1 ), · · · , (tn , hn )}. Here
hn = h(n∆) is the nth sample. All digitized signals are discrete, and
in our digital world this will be of most interest.

Timeseries analysis
Methods for time series analysis may be divided into two classes: frequency-
domain methods and time-domain methods. The former include spec-
tral analysis and recently wavelet analysis; the latter include auto-correlation
and cross-correlation analysis. The bridge between the time and fre-
quency domains is the Fourier transform, which can be either discrete or
continuous. Methods that use the Fourier transform usually fall under
the purview of Fourier analysis, which seeks to represent all signals as a
superposition of pure oscillations.
If the record is of finite length T, one may easily periodize it, so that
h(t) becomes periodic of period T (cf Fig. 7.2).
How can we represent (decompose) the signal h in terms of simpler T 2T 3T

components? This amounts to a projection of h onto those elemental


Figure 7.2: Periodization of a finite-
components. length signal

Geometric analogy
In a plane P (orthonormal basis) any vector ~u can be represented as ~u =
ui~i + u j~j or, equivalently, as
 
ui
~u = ~i ⊥ ~j
uj
~j k~i k = k~jk = 1
where ui and u j are called components of the vector ~u.

Question How to find ~u’s components (i.e. coordinates)? ~i

The answer comes from the inner product for vectors: Figure 7.3: Projection

h~u, ~vi : P × P → R

Recall that:
!
n
h~u, ~vi = ui vi + u j v j = ∑ uk vk (7.1)
k =1

As seen in Appendix B, continuous functions on [0, T ] form a vector


space, and we defined an inner product for functions:
Z T
hh, gi = h(t) g(t)dt (7.2)
0

Now,

h~u,~i i = ui h~i,~i i +v j
|{z}
1

100
Data Analysis in the Earth & Environmental Sciences Chapter 7. Fourier Analysis

so

ui = h~u,~i i.

In a space provided with an orthonormal basis, a vector’s components


are simply its projections onto the basis vectors (which are given by the
inner product). One may use this property to decompose a signal onto
basis functions, as we will do in what follows.

Back to timeseries
In general, one can show that:
 

 

 Z T 
2 2
L[0,T ] = timeseries h(t), period T, |h(t)| dt < ∞ (7.3)
 

 | 0 {z } 

L2 norm

is a vector space.

Note Verify this at home.

Question Do we know of any basis?

Say T = 2π without loss of generality. In Appendix B we see that


   n=∞  n=∞
1 1 1
√ ∪ √ cos(nθ ) ∪ √ sin(nθ ) (7.4)
2π π n =1 π n =1

form a basis of this space. Moreover, sines and cosines are orthonormal:

Z 2π
1
cos(nθ ) sin(nθ ) = 0 ∀n, m > 0 (7.5a)
π 0


Z 2π  1 if n = m
1
cos(nθ ) cos(mθ ) = δn,m (same with sines)
π 0  0 otherwise
(7.5b)

Thus, sines and cosines form an orthonormal basis of L2[0,2π ] . We


may thus use them to decompose any timeseries into pure oscillations.
This is equivalent to decomposing a sound into its component notes
(harmonics), hence the name harmonic analysis often used interchange-
ably with Fourier analysis.

101
Data Analysis in the Earth & Environmental Sciences Chapter 7. Fourier Analysis

II Fourier Series
Joseph Fourier is perhaps best known for his work on trigonometric rep-
resentations of periodic signals, but this was a really a side project of
his. He wanted to understand how heat flows through continuous me-
dia, and found that heat flow was proportional to the gradient of tem-
perature (i.e. temperature is the potential of heat). He was the first to
formulate heat flow as a diffusion problem
∂T
= κ ∇2 T (7.6)
∂t
In trying to solve this equation, he was looking for solutions that were
proportional to sines and cosines, because he could then easily turn this
partial differential equation into an algebraic one. This in turn led to
the question: is this legit? Can any periodic function be represented
this way? It took a long time for mathematicians to demonstrate this Figure 7.4: Joseph Fourier, whose forays
into the diffusion of heat led him to invent
rigorously, but physicists went ahead with it anyway. As usual, their how to represent any periodic function as
intuition was right. The key was to use an infinity of waves. a trigonometric series. In the process, he
gave birth to mathematical physics.

Fourier’s theorem
RT
Fourier’s theorem states that any integrable2 , T-periodic function can 2
0 |h(t)|dt < ∞
be represented as an infinite superposition of sines and cosines called
a Fourier series:

a0 2π
h(t) = + ∑ ak cos(kωt) + bk sin(kωt) ω= (7.7)
2 k =1
T

Since cos(kωt) and sin(kωt) are orthogonal, coefficients ak and bk can


be obtained by projecting on the basis functions as before:
Z
2 T
ak = h(t) cos(kωt)dt k∈N (7.8)
T 0
Z
2 T
bk = h(t) sin(kωt)dt k ∈ N∗ (7.9)
T 0
Recall that sines and cosines may be viewed as the real and imaginary
parts of a complex exponential (Appendix C). That is: eiθ = cos(θ ) +
i sin(θ ), so one may also use a complex representation of h(t):
+∞
h(t) = ∑ Hn einωt (7.10)
n=−∞

Again, these circular functions are orthogonal3 : 3


The complex dot-product
R of two func-
Z T tions h(t) and g(t) is h(t) g∗ (t)dt, where
1 the asterisk indicates the complex conju-
einωt e−imωt dt = δn,m gate of g.
T 0

so the nth complex Fourier coefficient Hn writes:


Z T
1
Hn = h(t)e−inωt dt (7.11)
T 0

102
Data Analysis in the Earth & Environmental Sciences Chapter 7. Fourier Analysis

It is easy to show that:



 an −ibn
n≥0
2
Hn = (7.12)
 an +ibn
2 n<0

Note Do this at home, also expressing an and bn as a function of Hn .

Properties of the Fourier series


• Hermitian property: if h(t) is real, then H−n = Hn∗ ;

• even functions4 are only comprised of cosine terms; 4


h(−t) = h(t)

• odd functions5 are only comprised of sine terms; 5


h(−t) = −h(t)

• Fourier series converge in a least square sense: Defining


n
a0
Sn ( t ) = + ∑ ak cos(ωkt) + bk sin(ωkt) (Fourier expansion of order n)
2 k =1

Then kSn (t) − h(t)k2 → 0 as n → ∞. So while any periodic func-


tion can be represented in this way, it may need many terms if the
original signal looks nothing likes sines or cosines. The universality
of Fourier decomposition may thus come at the expense of efficiency
– this is a recurrent tradeoff in science.
square wave triangle wave
Note

- h(t) must be continuous but not necessarily smooth (Fig. 7.5);

- Oscillations near break points may reach high amplitudes (Gibbs phenomenon, Figure 7.5: Example of non-smooth series
lab. 6);

- These series have great theoretical use to find solutions of (linear) ordinary
and partial differential equations via the principle of superposition.

103
Data Analysis in the Earth & Environmental Sciences Chapter 7. Fourier Analysis

III Fourier transform


The Fourier transform generalizes Fourier series to any continuous func-
tion. We go from quantized frequencies to a continuum of frequencies. F
Time Frequency
domain domain
Definition
F −1
A Fourier transform can be viewed as the continuous limit of a Fourier h(t) ∈ R h̃(ω ) ∈ C
series. (h, h̃) are called a Fourier transform pair and they are given by:
Figure 7.6: Fourier transform
Z +∞
1
h(t) = h̃(ω )e+iωt dω ω = 2π f (7.13) 6
There are many different definitions of
2π −∞ the Fourier transform in the literature:
Z +∞
pick one and be consistent. In particular,
h̃(ω ) = h(t)e−iωt dt (7.14) it is often more convenient to make calcu-
−∞
lations with an angular frequency (ω) for-
mulation, though the frequency f is more
Notation: h c h̃ means “h̃ is the Fourier transform of h”. Con- intuitive (since it is the inverse of a pe-
versely, by h̃ c h we mean “h is the inverse Fourier transform of h̃”. riod). In this chapter we will mostly deal
with ω, but in applications we will mostly
Those two statements are equivalent6 think in terms of f . The two are related by
ω = 2π f .

Examples
Dirac δ-function

- Definition:  δ(t)
 0 ∀t 6= 0
δ(t) = R (7.15)
 +∞ δ(t)dt = 1
−∞
Figure 7.7: The Dirac δ-function de-
scribes perfect localization and repre-
- Sampling property: for any smooth function h sents an idealized point mass.
Z
δ(t)h(t)dt = h(0)
ZR
δ(t − t0 )h(t)dt = h(t0 )
R

- Fourier transform:
Z +∞
1
∆(ω ) = δ(t)e−iωt dt = 1
−∞

In general, good localization in time is associated with poor localiza- Figure 7.8: The Dirac δ-function has no
tion in frequency. On the opposite end, harmonic functions (sines and localization in the frequency domain.
cosines) have perfect localization in frequency space, but no localization
in time.

104
Data Analysis in the Earth & Environmental Sciences Chapter 7. Fourier Analysis

Negative Exponential
H (t)
h(t) = H (t)e− at

- H (t) = Heaviside jump (Fig. 7.9) 1



 1 t≥0
- H (t) = 0
 0 otherwise
Z +∞
−1
h̃(ω ) = H (t)e− at e−iωt dt = (7.16) Figure 7.9: Heaviside function, akin to
−∞ a + iω
flipping a switch on at t = 0.
Whence

1
|h̃(ω )|2 = (7.17)
a2 + ω2
π (t)

Boxcar Function (aka Gate Function)


1

b(t) = [ H (t − τ ) − H (t + τ )] t
Z −τ τ
1
b(t)dt =
R 2τ
sin(ωτ ) main lobe
b̃(ω ) = = τsinc(ωτ )
ω

sin( x )
Definition 5 The cardinal sine function sinc( x ) ≡ x is characterized
by a central peak at the origin and many oscillatory side lobes that decay
side lobes
hyperbolically away from it. (Fig. 7.10). The function integrates to π. It is
single-handedly responsible for leakage – one of the worst problems to befall Figure 7.10: The boxcar function and its
the time series analyst (Section V). Fourier transform

c̃(ω )
Monochromatic waves symmetry
(even)

Pure vibrations, aka harmonic functions, or monochromatic waves,


are a fancy names for sines and cosines. ω
− ω0 + ω0
antisymmetry
c(t) = cos(ω0 t) → c̃(ω ) = π [δ(ω + ω0 ) + δ(ω − ω0 )] (odd)
π
s(t) = sin(ω0 t) → s̃(ω ) = [δ(ω + ω0 ) − δ(ω − ω0 )]
i + ω0
ω
Perfect localization in frequency. No localization in time. − ω0

Figure 7.11: Fourier transform of


monochromatic waves

105
Data Analysis in the Earth & Environmental Sciences Chapter 7. Fourier Analysis

Gaussian function
g(t)
− t2 R
g(t) = √1 e 2σ2 , g(t)dt = 1 2 σ = 1/5
σ 2π R

- F (Gaussian) = another Gaussian! 1


2 2
Z σ = 3/2
g̃(ω ) = e − ω 2σ
g̃(ω )dω ∝ σ−1
R t
-2 -1 0 1 2
with σ0 = 1/σ. g̃(ω )
σ0
- scale in time = (scale)−1 in frequency 2

 fatty curve ↔ slim curve
1
- σ = 2/3
 slim curve ↔ fatty curve
σ=5
ω
-2 -1 0 1 2

Figure 7.12: The Gaussian transform pair.


Properties of the continuous Fourier transform Note again that localization in the time
domain is inversely related to localization
• Linearity: F ( ah + bg) = aF (h) + bF ( g) A linear system is one where in the frequency domain
Fourier’s principle of superposition works. Linearity ⇔ Fourier
transform.

• Unicity: Every h(t) admits a unique h̃(ω ). Conversely, every h̃(ω )


admits a unique h(t). A function can thus be identified by its Fourier
transform.

• Hermitian property: h(t) ∈ R ⇔ h̃(−ω ) = h̃(ω )∗

• Parity:

h(t) even ⇔ h̃(ω ) even


h(t) odd ⇔ h̃(ω ) odd

• Scaling: for ( a, b) real constants



h( at) c 1
| a|
h̃ ωa → time scaling

1
|b|
h bt c h̃(bω ) → frequency scaling

• Shifting:

h ( t − t0 ) c eiωt0 h̃(ω ) → time shifting


h(t)e−iω0 t c h̃(ω − ω0 ) → frequency shifting

• Derivation:
∂h c iω h̃(ω )
∂t

• Integration:

106
Data Analysis in the Earth & Environmental Sciences Chapter 7. Fourier Analysis

R
hdt c −i h̃ ( ω )
ω

This means that differential equations can be solved as polynomials


in frequency space (very useful since there is such a large apparatus
to solve polynomial equations).

• zero frequency
Z +∞
h(t)dt = h̃(0)
−∞
corresponds to the average value of h7 . 7
the “DC” in AC/DC

Important theorems
Parseval’s theorem

The energy of a signal h(t):


Z +∞ Z +∞
1
E≡ |h(t)|2 dt = |h̃(ω )|2 dω (7.18)
−∞ 2π −∞
may be measured in the time or frequency domain. This leads to the
definition of a fundamental quantity: S(ω )

Definition 6 The Power Spectrum or Spectral Density is defined as:


1
R +T 2 lines ↔ periodic component
2T − T | h ( t )| dt PSD/unit time (7.19a)
S(ω ) ≡ |h̃(ω )|2 (7.19b) background

It is an energy density per frequency band, and quantifies the partition of


variance (energy) among frequency components (periodicities). It tells us fun- Power spectrum
ω
damental things about the behavior of a physical system from which the time-
series originated. For instance, how much temperature variance is accounted Figure 7.13: Spectral analysis and har-
for by the annual cycle? How is energy transferred between different scales of monic analysis.

motion? What processes generated the measurement? What is the energy of the
low-frequency seismic waves (T > 4 min)? Are there scale invariances in the
system? Is the system particularly sensitive to forcing at a given frequency?).
Obtaining reliable estimates of the spectral density is a holy grail of spectral
analysis (Chapter 9).

Convolution theorem

Definition 7 The convolution of two functions h, g is:


Z
h∗g = g∗h = h(τ ) g(t − τ )dτ (7.20)
R
This product verifies many of the properties of the usual product: com-
mutativity, associativity. Its neutral element is the Dirac δ-function. The
convolution represents the action of one linear system over another (cf
Filters, Chapter 10)

107
Data Analysis in the Earth & Environmental Sciences Chapter 7. Fourier Analysis

Example
A recorded seismological signal r (t) = s(t) ∗ b(t) ∗ g(t) = (s ∗ b ∗ g)(t)
where s is the seismometer, b is the building motion, and g is ground motion.
Convolution formalizes the idea of the composition of various (linear) systems
influencing each other.

An essential property of linear systems is that

h∗g c h̃(ω ) × g̃(ω ) (7.21)

• convolution in time ↔ multiplication in the frequency domain.

• multiplication in time ↔ convolution in the frequency domain.

This theorem is absolutely fundamental and is a key reason why the


Fourier transform is so used. In practice, convolution is almost always
done in the frequency domain.

Correlation theorem

Z
C (h, g) = h(t + τ ) g(t)dt τ = “lag” (7.22)
ZR
C ( g, h) = h(t) g(t + τ )dt (7.23)
R
C (h, g) c h̃(ω ) g̃(−ω ) = h̃(ω ) g̃(ω )∗ for h, g real (7.24) a(τ )

Definition 8 The autocorrelation of a function h is the function a:


Z τ

a(τ ) = C (h, h) = h(t)h(t + τ )dt (7.25)


R

is a measure of its memory (persistence), indicating how much a function re-


Figure 7.14: One way to define decorrela-
sembles itself at lag τ. tion time, is the time it takes for the au-
tocorrelation function to first cross zero.
Alternatively, one may use the e-folding
Note - a is even (a(τ ) = a(−τ )) when h is real. time associated with the envelope of this
curve. In both cases it is a measure of the
- the autocorrelation is the cross-correlation of a function with itself. memory, or persistence, of a timeseries

Wiener-Khinchin theorem

The Fourier transform of the autocorrelation is the squared ampli-


tude (power spectral density) of F (h).

a(t) c h̃(ω )h̃∗ (ω ) = |h̃(ω )|2 = S(ω ) (7.26)

This is the foundation for classical (Blackman-Tukey) spectral analysis,


but we will shortly see that much better methods exist.

108
Data Analysis in the Earth & Environmental Sciences Chapter 7. Fourier Analysis

IV Discrete Fourier transform


Definitions
In practice, all measured signals are discrete if processed on a computer:

h(t) → hk = h(k∆) k ∈ 0, 1, . . . N − 1, N even

where ∆ is the sampling interval. Ideally, N spans the entire domain of h.

Definition 9 The Discrete Fourier Transform of h is:

N −1
kn
Hn = ∑ hk WN (7.27)
k =0

where n ∈ {0, N − 1} and WN is an N th root of unity. (Appendix C, section


II). We write hk N Hn .

Note the relationship to the CFT:


Z +∞ N −1 N −1 i2π kn
h̃( f n ) = h(t)e−2πi f n t dt ' ∑ hk e−2πi f n tk ∆ ' ∆ ∑ hk e− N = ∆Hn
−∞ h =0 k =0
(7.28)
There is however a fundamental difference: the DFT is limited to a
finite range of frequencies; in particular:
n
Definition 10 (Nyquist frequency) In the frequency domain f n = N∆ ,
n = − N2 , . . . 0, the following expression

1
f± N = ± (7.29) Figure 7.15: Sampling at the Nyquist
2 2∆ frequency, where a digitized sinusoid
would take values of (−1, 0, +1). It
is called the Nyquist frequency and is the highest frequency resolvable by should be clear that sampling at half that
the dataset. If the dataset has energy above this frequency, it will be aliased to rate would result in mistaking the sinu-
lower frequencies (Sect. V). soid for a constant (red dots).

Periodicity

H−n = HN −n n ∈ [1, N − 1] → N periodic (7.30)

Ordering of frequencies

N
0 < f < f Ny ⇔ 1≤n≤ 2 −1
N
− f Ny < f < 0 ⇔ 2 +1 ≤ n ≤ N−1
N
f = ± f Ny ⇔ n= 2

109
Data Analysis in the Earth & Environmental Sciences Chapter 7. Fourier Analysis

Invertibility
Can we retrieve hk from Hn , and vice versa? Yes, via the inverse DFT:
Definition 11 Inverse Discrete Fourier Transform:
N −1
1 −kn
hk =
N ∑ Hn WN (7.31)
n =0

We write Hn N hk

N −1 N −1 N −1 N −1
1 2π 2πikn hl i2πn ( l − k )
iDFT (DFT(h)) = ∑ ∑ hl e− N iln e n = ∑ ∑ e− N

n =0
N l =0 l =0
N n =0
(7.32)
In the last term, we recognize the sum of the roots of unity:N th

N −1 1 − (WNl −k N
)  0 l 6= k
n(l −k)
∑ N W =
1 − WN l −k
=
 N l=k
n =0

(See Appendix C).


Put differently,
N −1
1 n(k−l )
N ∑ WN = δkl δ = Kronecker Delta (7.33)
n =0
So
N −1 N −1
1 2πik
hk = ∑ Hn e+ N n = ∑ hl δkl = hk (7.34)
n =0
N l =0
The two operations really are reciprocal (we start with hk and we get
it back without any loss). The converse is also true, applying the inverse
DFT to Hn , then the DFT, gives you Hn back (try it at home). This is fun-
damental, since it means that we can go from time domain to frequency
domain without any loss of information. So even though the range of fre-
quencies is limited, this transformation defines a one-to-one mapping
between (discrete) time and frequency domains.

Properties

• Parseval’s theorem (Discrete form)


N −1 N −1
1
∑ | h k |2 = ∑ | Hn |2 (7.35)
h =0
N n =0

• Discrete Convolution For two sequences (rk , sk ) such that rk N Rn


and sk N Sn , their discrete convolution is the product of their DFTs.

N
2
(r ∗ s ) j ≡ ∑ s j − k r k N Sn R n (7.36)
k =− N
2 +1

110
Data Analysis in the Earth & Environmental Sciences Chapter 7. Fourier Analysis

Note Care is required with the edges → see lab 6.

To which we could add the discrete form of the cross-correlation and


autocorrelation functions.

Fast Fourier Transform


Cooley and Tukey developed in 1968 an algorithm to compute the Dis-
crete Fourier Transform very efficiently.

Matrix formulation

The key is to recognize the DFT as a matrix multiplication, and the


inverse DFT as a matrix inversion problem (Appendix B).

Definition 12 Fourier matrix


(i −1)( j−1) 1 ∗
Hn = Fhk with Fij = WN , hk = F Hn
N
Because of the wonderful properties of the roots of unity, F is an or-
thonormal (circular) matrix, so that its inverse is it conjugate: F −1 = F ∗ .
This makes the inverse DFT trivially simple to compute. Further, when
N is a power of two, huge computational gains can be achieved by ex-
ploiting the properties of symmetry between the roots and the trans-
formed values. Padding to the next power of two (e.g. 1000 → 1024) is
therefore extremely common.

Numerical cost

The FFT makes many costly operations cheap in the Fourier domain :

Operation Formulation Direct cost FFT cost


DFT ∑kN=−01 hk WNkn N2 2N log2 N
Convolution ∑nN=−01 hk gn−k N2 3N log2 N
Correlation ∑nN=−01 hk gn+k N2 3N log2 N

This doesn’t look like much, but as N becomes large, the FFT cost be-
comes negligible in front of the direct cost. Hence, in practice convolu-
tions are always done in the Fourier domain: (s ∗ r )k N (Sn × Rn ).

111
Data Analysis in the Earth & Environmental Sciences Chapter 7. Fourier Analysis

V The Curse of Discretization


Shannon’s sampling theorem
If a signal h(t) is band-limited8 , and is sampled at intervals ∆ fine enough 8
h̃( f ) = 0 outside of some interval
that ∆ ≤ 21f c , then h(t) can be completely recovered by the convolution: [− f c , f c ], where f c is called the cutoff fre-
quency

N −1  
t − k∆
h(t) = ∑ hk sinc

(7.37)
k =0

1
where sinc is the cardinal sine function ∆ is chosen so that ∆ = 2 fc =
Tc samples
2 → 2 period .
You heard it right: not just approximately recovered, but completely
recovered. If a signal is band-limited, it only contains a countable amount
of information, and therefore can be entirely summarized with a (dis-
crete) sequence of numbers. The applications of this principle are all
around us, in audio engineering, video& sound processing, telecom-
munications, cryptography, etc.
Alternatively, this theorem states that the highest recoverable fre-
1
quency is the Nyquist frequency f N = 2∆ . If h(t) is not band-limited,
power outside [− f N , f N ] will be aliased (falsely translated) into this in-
terval (see below).

Fourier sampling theory


The Discrete Fourier Transform allows to move freely between time and
frequency domains, but there are three important differences with the
Continuous Fourier Transform:
1
Aliasing: ∆ > 0 ⇒ limited spectral resolution ( f N = 2∆ )
→ only frequencies up to f N are faithfully captured.
→ | f | > f N ⇒ frequencies higher than f N parade as low fre-
quencies (bad, Fig. 7.18)

Cyclicity: the cyclicity of the roots of unity leads to Hn+ N = Hn – the


Figure 7.16: When sampling at coarse in-
spectrum repeats itself indefinitely; more precisely, it gets folded over tervals, higher-frequency harmonics are
the Nyquist frequency. erroneously projected on low-frequency
harmonics. This is the essence of aliasing.
Credit: xyo balancer blog
Leakage: Finite time window ⇒ Leakage of energy from lines to broader
bands (bad)

Both leakage and aliasing are linked to the {boxcar ↔ sinc} transform
pair. Let us analyze these plagues in detail.

112
Data Analysis in the Earth & Environmental Sciences Chapter 7. Fourier Analysis

Aliasing S( f )

The discretization is akin to folding the time spectrum at integer mul-


tiples of f N .

f
− fc − fN + fN fc

Figure 7.17: An example of true spec-


trum, aliased by sampling at too low a
frequency.

Figure 7.18: Consequences of the spec-


tral folding inherent to the DFT. The
blue curve is the spectrum of a band-
limited signal in [− B, B], the green curves
its replica about the sampling frequency
( f s = 1/∆). If f s is lower than B, the fold-
ing results in high frequency signals be-
ing mislabeled as lower-frequency com-
ponents, thus distorting the spectrum
(lower panel). Credit: Wikipedia
Unresolved frequencies (| f | > f N ) come back to haunt the time spec-
trum like a, well, specter (grey area in Fig. 7.18). These high-frequency
components thus parade as low-frequency components. This leads to
a distortion of the spectrum, and may lead to misinterpretation of the
variability. For a stark example of this, see Wunsch (2000).

Solution (instrument) = band-limit the signal by analog filtering prior


to digitization (e.g. seismometer). Unfortunately, this is not always pos-
sible, e.g. if your “instrument” is a paleo record.

Leakage
Leakage arises because sampling over a finite time window amounts to
multiplication by a gate function (e.g. Fig. 7.19, between the red lines):

 1 ∀|t| ≤ T
2
BT ( t ) =
 0 elsewhere

113
Data Analysis in the Earth & Environmental Sciences Chapter 7. Fourier Analysis

Figure 7.19: An example seismogram,


h(t)
sampled over a finite time. Credit:
http://thegeosphere.pbworks.com/w/
start end page/24663884/Sumatra

finite sampling akin to multiplication in a boxcar

Owing to the convolution theorem (Eq. (7.21)), multiplication by a


gate function in the time domain is tantamount to convolution by its
Fourier transform in the frequency domain. This function is B̃T ( f ) =
Tsinc(πT f ) (Fig. 7.20, bottom right), far from being the perfect delta
function we wish for (Fig. 7.20, top right). Thus energy leaks from the
narrow central peak to a blurry, broad peak. Adding insult to injury, the
side lobes spread this energy even further to the sides in an oscillatory 9
the cardinal sine function converges to
a delta function in that limit, meaning
fashion, where it may pollute other harmonics. As T → ∞, this effect
that sampling in the frequency domain
goes to zero9 , so one should always strive for long records. becomes perfectly localized in the fre-
quency domain

Figure 7.20: Spectral leakage for a perfect


sinusoid. (top) ideal sampling situation,
when T is large; (bottom), T spans only
two periods of the oscillation, resulting
in convolution by a rather broad cardinal
sine in the frequency domain. The result-
ing spectrum (lower right) is very much
distorted.

114
Data Analysis in the Earth & Environmental Sciences Chapter 7. Fourier Analysis

In summary, both aliasing and leakage are consequences of the finite-


ness of N.

finite length T = N∆ < ∞ ⇒ leakage

finite resolution ∆ > 0 (+ spectrum cyclicity) ⇒ aliasing


T
Because they are bound by the relation N = ∆ , one cannot increase
both T and decrease ∆ for a constant N: this tradeoff is the curse of
discretization.

Spectral range
In practice, a spectrum can only be estimated on the interval [ f R , f N ]

1
f R = Rayleigh frequency =
N∆
(aka gravest tone, lowest note, fundamental harmonic).
n
Since f n = N∆ = n f R , f R is also the spacing between frequency points,
i.e. the spectral resolution: ∆ f = f R . All other frequencies are integer
multiples of f R .

1 N
f N = Nyquist frequency = = fR
2∆ 2
(highest note that can be recorded without aliasing the signal). If sam-
pling is chosen so that the signal has no energy beyond the Nyquist
frequency, then all is good. Otherwise, aliasing is always present, and
impossible to remove. This is one reason some scientific results may
change when more high-resolution data have been collected.
Bottom line: if N is fixed, ∆ can be increased at the expense of T, but
there is no free lunch. Hunting for periodic signals in discrete observa-
tions thus requires to deal with these fundamental constraints. This is
the object of spectral analysis (Chapter 9)

115
Chapter 8

TIMESERIES MODELING

Classic statistical tests assume IID data. When such conditions are met,
life is beautiful and those tests are useful. In the geosciences, this is
rarely the case. We thus wish to provide geophysically-relevant null hy-
potheses so that we can correctly judge the signficance of spectral peaks,
or of correlations between timeseries. Additionally, we will need such
models to estimate features of a timeseries believed to follow such mod-
els (cf the Maximum Entropy spectral method, Chap. 9).

I The AR(1) model and persistence


Perhaps the overarching quality of geophysical timeseries is their persis- Southern Oscillation Index
6

tence: very often, neighboring values tend to be highly correlated. This 4

may be because low-frequency dynamics are at play (cf. Fig. 8.1), or be- 2

cause the way we are measuring the signal introduces this persistence: 0

for instance, it is well known that watersheds tend to integrate climate


SOI

-2

fluctuations, so that watershed measurements (e.g. streamflow) exhibit -4

more persistent behavior than the input climate (Hurst, 1951). Paleocli- -6

mate examples abound. Persistence has many consequences, which will -8


SOI
MC-SSA-reconstruction

shall explore now.


1950 1960 1970 1980 1990 2000 2010
Time

Figure 8.1: The Southern Oscillation


Stationarity (SOI) times series (black), whose low-
frequency behavior is highlighted by sin-
In all the following, we consider stationary stochastic processes. A stochas- gular spectrum analysis (red)
tic process is a sequence of random variables indexed by time; that is,
we consider each temporal observation as the realization of a random
variable. A strictly stationary process is one whose joint probability dis-
tribution (P( Xt0 , Xt1 , · · · , Xtn )) does not change when shifted in time
(t ← t + L). Accordingly, all its moments (like the mean or variance)
are constant in time. Here, we only need a weaker form, called wide-
sense stationarity, which states that the autocorrelation depends only on
lag, not on time. That is:

∀ L ∈ R, γ(τ ) = Cov( Xt , Xt+τ ) = Cov( Xt+ L , Xt+ L+τ ) (8.1)


Data Analysis in the Earth & Environmental Sciences Chapter 8. Timeseries modeling

The AR(1) model


The simplest and most popular way to represent persistence is via the
autoregressive model of order one, aka AR(1):

Xt = φXt−1 + ε t φ = ρ1 = “Lag-1 autocorrelation” (8.2)


Xt
where ε t ∼ N (0, σε2 ). It is common to consider a normally distributed
series X ∼ N (0, σ2 ). The coefficient φ measures the memory of the sys-
2
tem, i.e. the degree of persistence from one value to the next:
slope = φ
• φ=0 → independent, Gaussian i.i.d. process. 1

• The higher the φ, the smoother the timeseries (Fig. 8.6).


1 2 3 X t −1
• for φ > 0, Var( Xt ) = (1 − φ2 )σε2 < σε2 → the variance of the
Figure 8.2: Regression of timeseries val-
noise is reduced by knowledge of past observations. ues over their predecessors. This is one
way to estimate φ.
• φ may be estimated by regressing Xt onto Xt−1 (Fig. 8.2), hence the
name autoregression.→ Xt tends to resemble Xt−1 , with a scatter
2
governed by σε .

• the autocorrelation function (ACF) of an AR(1) model is simply ρ(k) =


φk for each lag k. We distinguish two cases, φ > 0 and φ < 0.

ρ ρ Figure 8.3: Autocorrelation function of


an AR(1) series, for positive and negative
φ>0 φ<0
values of φ. k is the lag, ρ(k ) the autocor-
relation at this lag.

log10 (S( f ))
k k
2

1
Theoretical Spectrum
log10 ( f )
The theoretical spectrum of an AR(1) process is as follows. −1 1 2

σ2 Figure 8.4: Spectrum of an AR(1) model


S( f ) =   (8.3)
f with σ = 1, f N = 10 and φ = 0.9. Notice
1 − 2φ cos 2π f + φ2
N the slope (-2) at the inflection point. Gen-
erally, any process obeying S( f ) ∝ f −2 is
which is represented in Fig. 8.4. However, as Ghil et al. (2002) point called "red noise” (Sect. III)
out, “a single realization of a noise process can [...] have a spectrum
that differs greatly from the theoretical one, even when the number of
data points is large. It is only the (suitably weighted) average of such
sample spectra over many realizations that will tend to the theoretical
spectrum of the ideal noise process. Indeed, the Fourier transform of a

118
Data Analysis in the Earth & Environmental Sciences Chapter 8. Timeseries modeling

single realization of a red noise process can yield arbitrarily high peaks
at arbitrarily low frequencies; such peaks could be attributed, quite er-
roneously, to periodic components.” This is illustrated in Fig. 8.6, where
it can be seen that many spectra exhibit peaks above the spectrum’s the-
oretical value. We will see in lab 8 how to determine the significance of
such peaks against a AR(1) nulls.

The effects of persistence


Since Xt depends on φXt−1 , then Xt depends on φ2 Xt−2 , . . ., and so
on until φk Xt−k . The influence of old values decays exponentially, but
there’s an infinite memory of the system (the first value is never com-
pletely forgotten). This means the Xt ’s are no longer independent – we
say that they are serially dependent. Hence, all tests requiring i.i.d. data
(e.g. Correlation between two timeseries) will fail on such data. For a
t-test, the effective number of degrees of freedom may be estimated as:

 
1−φ
Neff ' N effective sample size (8.4)
1+φ
For φ = 0.8, which is far from unusual, Neff is smaller than N by nearly
an order
q of magnitude! One can then plug this d.o.f into a t-test with
−2
T = r N1eff−r 2
∼ t Neff −2 , and one often finds very, very different (much
less significant) results that if one went along willy-nilly with the naïve
assumption that Neff = N (Fig. 8.5). Alternatively, one may use non-
parametric tests based on simulated timeseries (Lab 8). Figure 8.5: The p-value and numbers of
degrees of freedom (DOF) for a t test
Persistence means that every spectrum will look “red”, that is, more of a relatively low correlation (0.13) be-
energetic at low than high frequencies (e.g. Fig. 8.4). tween two AR(1) time series (500 samples
each) with autocorrelation φ. The green
dashed line is the 5% threshold for sig-
II Linear Parametric Models nificance. From Hu et al. (2017)

AR(1) models are part of a general class of timeseries models called lin-
ear parametric models, comprising autoregressive models, moving av-
erage models, and their union, ARMA models.

Autoregressive Models AR(p)


A random process X is said to follow an autoregressive model of order
p if:
p
Xt − µ = ∑ φi (Xt−i − µ) + ε t , ε t ∼ N (0, σε2 ), φi ∈ R. (8.5)
i =1

where µ = E( X ). Xt depends only on the last p observations, plus an


innovation term ε t . This means that Xt is conditionally independent of
all past observations prior to Xt− p , given { Xt−1 , Xt−2 , . . . Xt− p }.

119
Data Analysis in the Earth & Environmental Sciences Chapter 8. Timeseries modeling

Random AR(1) timeseries, γ = 0 Random AR(1) timeseries, γ = 0.4


4 4

2 2
X

X
0 0

-2 -2

-4 -4
1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000
Time Time
AR(1) spectrum AR(1) spectrum
Normalized Spectral Density

Normalized Spectral Density


2 2
10 10 Period (kyr) 50 20 10 5 2 1

0 0
10 Period (kyr) 50 20 10 5 2 1 10

-2 95% C.I., lower -2 95% C.I., lower


10 95% C.I., upper 10 95% C.I., upper
MTM spectrum MTM spectrum
Theoretical AR(1) spectrum Theoretical AR(1) spectrum
-4 -4
10 10
-3 -2 -1 0 -3 -2 -1 0
10 10 10 10 10 10 10 10
Frequency (cycles/yr) Frequency (cycles/yr)

Random AR(1) timeseries, γ = 0.8 Random AR(1) timeseries, γ = 0.9


4 4

2 2
X

X
0 0

-2 -2

-4 -4
1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000
Time Time
AR(1) spectrum AR(1) spectrum
Normalized Spectral Density

Normalized Spectral Density

2 2
10 Period (kyr) 50 20 10 5 2 1 10 Period (kyr) 50 20 10 5 2 1

0 0
10 10

-2 95% C.I., lower -2 95% C.I., lower


10 95% C.I., upper 10 95% C.I., upper
MTM spectrum MTM spectrum
Theoretical AR(1) spectrum Theoretical AR(1) spectrum
-4 -4
10 10
-3 -2 -1 0 -3 -2 -1 0
10 10 10 10 10 10 10 10
Frequency (cycles/yr) Frequency (cycles/yr)

Random AR(1) timeseries, γ = 0.95 Random AR(1) timeseries, γ = 0.99


4 4

2 2
X

0 0

-2 -2

-4 -4
1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000
Time Time
AR(1) spectrum AR(1) spectrum
Normalized Spectral Density

Normalized Spectral Density

2 2
10 Period (kyr) 50 20 10 5 2 1 10 Period (kyr) 50 20 10 5 2 1

0 0
10 10

-2 95% C.I., lower -2 95% C.I., lower


10 95% C.I., upper 10 95% C.I., upper
MTM spectrum MTM spectrum
Theoretical AR(1) spectrum Theoretical AR(1) spectrum
-4 -4
10 10
-3 -2 -1 0 -3 -2 -1 0
10 10 10 10 10 10 10 10
Frequency (cycles/yr) Frequency (cycles/yr)

Figure 8.6: The spectral effects of persis-


tence, illustrated for 8 values of the lag-1
autocorrelation parameter, here denoted
by γ.

120
Data Analysis in the Earth & Environmental Sciences Chapter 8. Timeseries modeling

Example
AR(1) model:

X t − µ = φ ( X t −1 − µ ) + ε t
= φ ( φ ( X t −2 − µ ) + ε t −1 ) + ε t
..
.

Xt = ∑ φk ε t−k + µ
k =1

Although Xt depends on all past values of the noise, it is conditionally


independent of them, given Xt−1 ; values prior to t − 1 do not add any
information. This is also known as a Markov process.

Moving Average Models MA(q)


q
Xt = µ + ∑ θ i ε t −i θ0 = 1, ε t ∼ N (0, σε2 ) (8.6)
i =0

Amounts to a discrete convolution with Gaussian noise, with filter co-


efficients θi . This is also known as a finite impulse response filter (FIR).

ARMA models
In general, a broad class of timeseries can be approximated by an ARMA(p, q)
model, which fuses AR and MA models:
p q
Xt − µ = ∑ φi (Xt−i − µ) + ∑ θi ε t−i (8.7)
i =1 i =0

121
Data Analysis in the Earth & Environmental Sciences Chapter 8. Timeseries modeling

Fitting a timeseries model

Autoregressive model of order p Assume without loss of generality that µ =


0. Then:

K
Xt = ε t + ∑ α k Xt − k
k =0
K K
X t → E ( X t X t ) = E ( ε 0 X t ) + ∑ α k E ( X t X t − k ) = 0 + ∑ α k γk
| {z } k =0
| {z } k =0
γ(0)=1 γk
K
E ( X t X t −1 ) = E ( ε t X t −1 ) + ∑ α k E ( X t −1 , X t − k ) if stationary
| {z } k =0
| {z }
γ (1) γ ( k −1)
K
γ (1) = ∑ α k γ ( k − 1)
k =0
K
γ(n) = ∑ αk γ(k − n) → discrete convolution
k =0

Moving average of order q Since Xt = ∑kK=0 β k ε t−k ,


where Xt−1 = αXt−2 + ε t−1 ,
Xt = αXt−1 + ε t = αn Xt−n + ∑ αk ε t−k Xdt−2 = αXt−3 + ε t−2 , etc.
k =0
p
Xt = ∑ φi Xt−i + ε t
i =1
p
E(· × Xt ) → E( Xt Xt ) = ∑ φi E( Xt Xt−i ) + E(ε t Xt )
| {z } i=1 | {z }
σ 2 = γ (0) lag i autocorrelation (γ(i ))
p
using wide-sense stationarity
E(· × Xt−1 ) → E( xt−1 Xt ) = ∑ φi E( Xt−1 Xt−i ) + E(ε t Xt−1 ) (γ is function of lag only)
| {z } i=1 | {z }
γ (1) γt − i
p
E(· × Xt−k ) → E( Xt−k Xt ) = ∑ φi γ(i − k) = γ(k) define ρi = γ(i − 1)
i =1


 ρ1 = γ0 = φ1 + φ2 ρ1 + φ3 ρ2 + . . . + ρk−1 φk

Linear system 

 ∃ strong
of equations
 ρ2 = φ1 ρ1 + φ2 + φ3 ρ1 + . . . + ρk−2 φk
relationship
(Yule-Walker  .. → ρ = Mρ + φ1 → between
equations)


 .
 coefficients

 ρ =φ ρ
k 1 k −1 + φ2 ρk−2 + . . . + φk

122
Data Analysis in the Earth & Environmental Sciences Chapter 8. Timeseries modeling

 
φ2 φ3 . . . φk
 1
φ

 .. .. .. .. 
where M= . . . . 
 
... ... ... ...
K
Hence ρm = ∑ φk ρm−k
k =1

This is solved recursively starting from ρ0 ≡ 1. In practice, one es-


timates the Autocorrelation function (ACF) and uses this recursive for-
mula to estimate the φ’s.

Conditions necessary to ensure stationarity

Example

AR(1) → |φ| < 1


AR(2) → φ1 + φ2 < 1, φ2 − φ1 < 1
|φ2 | < 1

Question How to choose p?

There is a common tradeoff in statistics: a high p means higher fidelity,


but fewer and fewer observations to estimate each coefficient. In other
words, there’s a tradeoff between model accuracy and complexity. The fol-
lowing numerical criteria can be minimized to identify an optimal order
balancing fidelity and complexity:

• AIC (Akaike Information Criterion, aka Final Prediction Error)

• BIC (Bayesian Information Criterion, aka Minimum Description Length) 1


The timeseries analysis (tsa) module of
the excellent statsmodels Python package
Both penalize the misfit and model complexity (the number of pa-
allows to fit any number of such time-
rameters, p) in different ways, and will generally yield different esti- series models, and the fitted models will
mates for the optimal p. Such critera can both either overfit or underfit, include an estimate of AIC and BIC. See
this blog post for an example on Sunspot
depending on the situation. There is no foolproof method1 . number data

Theoretical spectrum of an AR(p) process


A big advantage of modeling with AR(p) processes is that their theoret-
ical spectrum is known, and depends solely on the φj ’s:

φ0
S( f ) = p (8.8)
|1 + ∑ j=1 φj e2πij f |

They can fit a wide array of spectral shapes; this formula is the basis for
Maximum Entropy spectral estimation (Chapter 9)

123
Data Analysis in the Earth & Environmental Sciences Chapter 8. Timeseries modeling

S( f ) S( f ) Figure 8.7: Examples of spectra that can


be fit by AR(1) and AR(2) spectra. In gen-
peak eral, AR(p) models can fit at most p − 1
peaks

AR(1) AR(2)

f f

III Noise color


White noise
By analogy with the color white, which blends all colors of the rainbow
in equal amounts, white noise exhibits power at all frequencies. The
spectrum is flat (S( f ) ∝ f 0 = 1), which corresponds to pure Gaussian
noise (an AR(0) process).

Red noise Figure 8.8: White noise

Red noise refers to processes which exhibit a spectral slope S( f ) ∝ f −2 .


This is certainly the case for an AR(1) process near the inflection point
of its spectrum (Fig. 8.4). More generally, red noise can be generated by
integrating white noise over time. An example of this is the celebrated
model of Hasselman (1976), which shows how simple ocean mixed-layer
physics may “redden” random weather fluctuations to produce power
at low-frequencies. This physical plausibility is one reason why red
noise is such a common null hypothesis. Note that the term red noise
may be loosely applied to many spectra that exhibit decreasing power
with increasing frequency. It is sometimes called "Brown" noise, not for
the color brown but rather as a corruption of Brownian motion. Often
used exchangeably with the term “AR(1) process”.
Figure 8.9: Red noise

Blue noise
Blue noise is used loosely to refer to S( f ) ∝ f 2 or S( f ) ∝ f 2 , that is
power increasing with frequency (Fig. 8.10). Blue noise is not common
in the geosciences, and that’s all the space we’ll devote to it.

Figure 8.10: Blue noise


124
Chapter 9

SPECTRAL ANALYSIS

Spheroidal mode observations


The object of spectral analysis is to estimate the partitioning of energy oS20 os21 as22 0% OS24 OS23

between frequency bands. Owing to Parseval’s theorem (Eq. (7.18)), this


means evaluating the spectral density |h̃(ω )|2 of a signal h(t). Doing so
from a sequence of finite and possibly noisy measurements, we must
S A N T A CRUZ
estimate this density and characterize its associated uncertainties. The ISLAND

notions we saw in Part I will therefore come to good use. NNA


10/6/75

Before we venture too far into technical details, however, let us pause
and reflect on the scientific motivations underlying such a complicated
analysis.

I Why spectra?
What do we hope to learn from the spectrum? Why is it worth calcu-
S U M BAWA
lating in the first place? To see, this, let us consider three Earth science CMO
8/19/77
examples.

Earth’s normal modes


First, a simple example from low-frequency seismology1 : if you exam-
ine the Fourier amplitude spectrum of a long vertical-component seis-
mogram from a big earthquake (M > 7) at frequencies of 1-5 mHz, you
I
I

will see distinct spectral lines that can be associated with the eigenfre-
~

SAMOA
quencies of Earth’s spheroidal normal modes. That is, the Earth is like CAN i
a bell that rings at certain notes (harmonics) when hit by the hammer of
12/26/75
i
large Earthquakes. These harmonics reveal fundamental aspects of its
inner structure.
Each of these discrete modes can be matched with spherical harmon-
3 25
ics (cf Appendix C) of fixed angular order, as illustrated in Fig. 9.1.The FREQUENCY (mHz)

amplitude of the spectral peaks constrains the earthquake sourceFigure 6.exci-


Detailed power spectra of the fundamental modes ,S,,-,S,, showing frequenc
attributable primarily to lateral heterogeneity (linear scales, arbitrary units). Spectra are from
tation. These data are especially important for very large earthquakes, Figure 9.1: Expression of Earth’s normal
Thin vertical fines mark some degenerate eigenfrequencies of the PEM-A model. Fundamen
modes
apparent centre frequencies of theintwo
seismograms. Fig.
lower panels differ 6 of Silver 10pHz.
by approximately
which have linear dimensions comparable to the mode wavelength. Also, and Jordan (1981)
the peaks move around a bit (in frequency) from earthquakepeak, to earth-
calculated from 1 SNREI model 1066A and a simple Q model (Gilbert & Dzi
courtesy of Prof T.H. Jordan
1975, p. 199) and adjusted for the effective window length. Each mode peak was
examined and measurements on low-amplitude peaks or peaks obviously corrupted
tone activity were removed from the data set.
Data Analysis in the Earth & Environmental Sciences Chapter 9. Spectral Analysis

quake and station to station. This variation arises because the mode
degeneracy is split by 3D Earth structure, and it can be precisely inter-
preted and inverted. These “apparent eigenfrequency” data are used in
most global 3D earth models.

The continuum of climate variability


NATURE|0 2006 LETTERS

gradient, providing a background Figure of 9.2:


climate variability upon
Patchwork spectral estimate
which variations due to Milankovitch- and annual-period changes
using instrumental
are superimposed. The diurnal cycle is also expected to cause and proxy records of
a local
continuum response, but we have surface
restrictedtemperature
the analysis to thevariability,
more and in-
climatically relevant longer-period responses.
solation at 65N. Power-law estimates be-
Discussion of long-term climate variability is commonly divided
tweencomponents
between deterministic15 and stochastic 1.1–100 12yr andassoci-
—often 100–15,000 yr pe-
ated with spectral peaks and riods are listed
the continuum, along10.with
respectively The standard er-
analysis presented here indicates that the annual and Milankovitch
rors, and indicated
energy are linked with the continuum, and together represent the
by the dashed lines.
The sum of the power-laws fitted to the
climate response to insolation forcing.
long- and
These observations have implications forshort-period
understanding futurecontinuum are in-
climate variability. The distinction between the magnitude of low-
dicated
24 by
and high-latitude climate variability may become the black
smaller curve.
toward The vertical
line-segment
centennial timescales. Furthermore, indicates
the physics controlling the approximate
centen-
95% confidence
nial and longer-timescale temperature interval,
variability appears to where
be the circle
distinct from that of shorter timescales, raising questions of whether
climate models able to representindicates the background
decadal variations 25
will adequatelylevel. The mark
at 1/100
represent the physics of centennial yr indicates the
and longer-timescale region mid-way
climate
variability. Finally, because the between
annual and the
Milankovitch
annual and cyclesMilankovitch
are pe-
deterministic, a greater understanding of their influence on the
continuum may lead to improved riods. At bottom
predictability is the spectrum
of interannual and of inso-
lation at 658 N sampled monthly over the
longer-timescale climate variability.
past million years plus a small amount of
METHODS
Spectral analysis. Spectra are estimatedwhite noise. The
using Thomson’s vertical
methodblack
indi-
multitaper 18 line
catesofthe
with three windows. To reduce the influence 41-kyr
a small obliquity
bias present period.
at the lowest
frequencies in the multitaper spectral estimates26, the three lowest-frequency
estimates are dropped when estimating b. For palaeorecords, when spectral
estimates from the same proxy type overlap, a weighted average is taken.
Weightings are proportional to frequency resolution so that better resolved
records contribute more heavily. Spectral energy, or more precisely spectral
energy density, is measured in 8C2 ds21. In this case energy refers to the squared
quantity, 8C2 (not joules), and density refers to energy per unit frequency, ds,
Figure 2 | Patch-work spectral estimate using instrumental and proxy where ds equals one over the length of the record in years. The energy density of
records of surface temperature variability, and insolation at 658 N. The the spectral continuum is independent of record duration.
In most real-world spectra the peaks are superimposed on a “back-
more-energetic spectral estimate is from high-latitude continental records
and the less-energetic estimate from tropical sea surface temperatures. High-
The power law, b, is estimated by a least-squares fit to log-frequency and log-
power-density estimates. To more uniformly weight the estimate, spectra are
latitude spectra are estimated from Byrd, Taylor and GISP2 ice-core d18O;
ground”. Peaks correspond to components that are periodic or nearly
Vostok and Dome C ice-core dD; Donard lake varve thickness; the Central
binned into equally spaced log-frequency intervals and averaged before fitting b.
For the instrumental records, monthly anomalies are used when estimating b in
so. They often reveals eigenmodes (as in the previous case) or the mainly
England Time-series (CET); and the Climate Research Unit’s (CRU)
instrumental compilation. From low latitudes we use ODP846 marine
order to filter out spectral energy associated with the annual cycle and its higher
harmonics. To preclude effects from a small residual annual cycle, power-law fits
sediment-core alkenones; W167-79, OCE205-103, EW9209-1, ODP677 and
linear response of a system to periodic forcing (e.g. seasons). The back-
ODP927 calcite d18O; PL07-39 and TR163-19 Mg/Ca; ODP658 foram
and computation of the average continuum energy exclude frequencies between
1/0.8 to 1/1.2 yr21 and 1/1.8 to 1/2.2 yr21. For palaeorecords, b is estimated
assemblages; Rarotonga coral Sr/Ca; and the Climate Analysis Center (CAC) between 1/100 and 1/15,000 yr21 to minimize the influence from the Milankovitch
ground connecting the peaks is informative of the processes that trans-
and CRU instrumental compilations. Temperature spectral estimates from bands. Uncertainties reported for b are the standard error of the least-squares
records of the same data type are averaged together (d18Oice and dDice are
fer energy between scales. As pointed out by Lovejoy (2015), it is often
fit. In the case of the auto-bicoherence statistic, the energy contained at each
also averaged). Power-law estimates between 1.1–100 yr and 100–15,000 yr frequency band is first normalized to unity so that only the phase information
periods are listed along with standard errors, and indicated by the dashed is retained, thus ensuring that the magnitude of the annual cycle does not bias
at least as interesting as the peaks.
lines. The sum of the power-laws fitted to the long- and short-period the result.
continuum are indicated by the black curve. The vertical line-segment The fraction of variance at the annual cycle is calculated by integrating the
Fig. 9.2 (reprinted from Huybers and Curry, 2006) shows a compos-
indicates the approximate 95% confidence interval, where the circle
indicates the background level. The mark at 1/100 yr indicates the region
estimated spectral energy between frequencies of 1/0.95 to 1/1.05 and then
dividing by the total spectral energy. The integration accounts for variations in
ite spectrum attempting to portray surface temperature variability from
mid-way between the annual and Milankovitch periods. At bottom is the
spectrum of insolation at 658 N sampled monthly over the past million years
the frequency spacing so that changes in record length do not influence estimates
of the energy at the annual period. At high frequencies, only instrumental
timescales of 1 month to 100,000 years. One can see that most of the en-
plus a small amount of white noise. The vertical black line indicates the
41-kyr obliquity period.
estimates are used as these are more accurate. Furthermore, tropical d18Ocalcite
records are excluded to minimize influence from non-temperature effects. To
ergy is associated with the annual cycle and its higher order harmonics permit direct comparison between the continuum and annual-period energy in
Fig. 2, the amplitude of the annual cycle and its harmonics are increased to the
(vertical lines on the that
response right). The
transfers second-most
spectral important
energy toward higher frequencies, source of vari-
expected value for a record one million years in duration.
possibly involving nonlinear ice-sheet dynamics. These low- and Another technique commonly used to estimate scaling relationships is
ability are broadband peaks
high-frequency nearresponses
temperature the so-called
appear to be ofMilankovitch
nearly equal periodici-
detrended fluctuation analysis4,5,7 (DFA), which yields a coefficient, a, related
to the spectral power law by a ¼ (b þ 1)/2. Results presented here were checked
magnitude at centennial timescales, midway in log-frequency space
ties near 23, 41,between
andthe 100 kyr,
annual and clearly
Milankovitchvisible in thetempera-
bands. Interannual insolation spectrum using DFA and are in agreement; this is as expected, because DFA and Fourier
estimates are mathematically equivalent up to the choice of weighting terms27.
ture variability may also follow different power laws during glacial
on the bottom. and
Whileinterglacial periods, but assessment of this further possibility in the climate
those frequencies show up strongly For ease of comparison, other authors’ DFA results are reported as b values.
Data. The NCEP reanalysis19 of two-metre air temperatures has global coverage
awaits a global set of high-resolution glacial climate records.
response as wellWhat (top),
wouldthe main point
the background continuumoflook
thelikearticle was
in the absence of to draw atten-
but only spans the period 1949 to 2002. Palaeo-temperature proxies are
necessary to resolve centennial and longer-period temperature variability.
annual or Milankovitch variations? Variations would presumably still
tion to the slopeexist owing to adjustments in the Equator to pole temperature and how this
of the background joining those peaks,
Following ref. 1, we piece together spectral estimates from records of differing
durations and resolution. Two separate spectral patchworks of surface air

slope appears to change at the centennial scale (black cross). The au- 3

126
Cambridge University Press
Data Analysis in the Earth & Environmental Sciences 978-1-107-01898-3 - The9.
Chapter Weather and Climate:
Spectral AnalysisEmergent Laws and Multi
Shaun Lovejoy and Daniel Schertzer
Excerpt
More information

thors hypothesized that this change in spectral slope (aka scaling expo-
1.2 The Golden Ag
nent) is indicative of different processes accomplishing energy transfers
between space and time scales.
Log10E(k) Log10E(k)
9
6
Testing theories of geophysical turbulence 8

4 7
Theories of fluid flow in the atmosphere and oceans make predictions
6
about rates of energy transfer between spatial and temporal scales that 2
5
are most easily summarized in the spectral domain. In particular, such (10 km)–1
4
theories predict the existence of scaling regimes: a power-law behavior 0.5 1.0
(10 000 km)–1
1.5 2.0 2.5 3.0 3.5
Log10k 3
(i.e. linear behavior in a log-log plot) of the spectrum of various state –2
variables (temperature, velocity, passive tracers), which is indicative of 0.5
Fig. 1.2 Spectra from " 1000 orbits of the Visible Infrared Fig. 1.3 Spectra o
scale invariance. As an example, consider radiance measurements from Sounder (VIRS) instrument on the TRMM satellite channels 1–5 from the TRMM sa
Figure 9.3: of Scaling
(at wavelengths behavior.
0.630, 1.60, 3.75, 10.8, 12.0 μm from Spectra
top to 1998. From bottom
satellites as presented by Lovejoy et al. (2008). bottom, displaced in the vertical for clarity). The data are for the
from channels
period January through March 1-51998(atand have wavelengths
nominal resolutions
(vertical polarizatio
exponents β ¼ 1.6
Fig. 9.3 shows the “along track” 1D spectra from the Visible Infrared ofof 2.20.630, 1.60,regression
km. The straight 3.75,lines10.8, 12.0exponents
have spectral
β ¼ 1.35, 1.29, 1.41, 1.47, 1.49 respectively, close to the value
µm 65, 26, 26, 13 km (h
by one order of m
from
β ¼ 1.53top to bottom,
corresponding displaced
to the spectrum in the
of passive scalars
Sounder (VIRS) instrument of the Tropical Rainfall Measurement Mis- (¼ 5/3 minus intermittency corrections: see Chapter 3). The units are
vertical for clarity). The straight re-
microwave results,
reflectance, water
such that k ¼ 1 is the wavenumber corresponding to the size of the
sion (TRMM) at wavelengths of 0.630, 1.60, 3.75, 10.8, 12.0 mm, i.e. for gression
planet (20 000lines have 1,spectral
km)–1. Channels 2 are reflectedexponents
solar radiation so
smaller than the wa
as the wavelength
that only the 15 600 km sections of orbits with maximum solar
visible, near infrared and (the last two) thermal infrared (IR) radiation. = 1.35,
βradiation 1.29,
were used. The 1.41, 1.47, 1.49 respectively,
high-wavenumber fall-off is due to the
absorbtivity due to
more of the signal
close to the of
finite resolution value β = 1.53
the instruments. corresponding
To understand the figure we underlying surface
The first two bands are essentially reflected sunlight, so that for thin note that the VIRS bands 1, 2 are essentially reflected sunlight
to(with
theveryspectrum
little emission and of absorption),
passivesoscalarsthat for thin(5/3
clouds the
with increasing wa
important in rainin
clouds the signal comes from variations in the surface albedo (influ- signal comes from variations in the surface albedo (influenced by
minus intermittency corrections).
the topography and other factors), while for thicker clouds it
Note gradients – which
large internal liquid
that
comes the analysis
from nearer the cloudwastop viaperformed
(multiple) geometric over
and Mie
enced by the topography and other factors), while for thicker clouds it scattering. As the wavelength increases into the thermal IR, the
space,
radiances not time, so
are increasingly duethe relevant
to black body emission spectral
and
comes from nearer the cloud top via (multiple) geometric and Mie scat- index
absorption is with
theverywavenumber k, which
little multiple scattering. Whereas atistheto visible Also of interes
wavelengths we would expect the signal to be influenced by the close (but a litt
tering. The units are such that k = 1 is the wavenumber corresponding wavelength
statistics of cloud what angular
liquid water frequency
density, for the thermal IRis to
for passive sca
the period.
wavelengths it wouldAdapted from Lovejoy
rather be dominated et al.
by the statistics of
theory discuss
to the size of the planet (20 000 km). The high-wavenumber fall-off is temperature variations – themselves also close to those of passive
(2008)
scalars. Adapted from Lovejoy et al. (2008). with theoretic
due to the finite resolution of the instruments. passive scalar
below about 10 km, there is a more rapid fall-off but et al., 2009a).
A scaling behavior is evident from the largest scales (20 000 km) to this is likely to be an artefact of the instrument, whose the radiance sp
sensitivity starts to drop off at scales a little larger than other atmosph
the smallest available – there are no peaks in this spectrum, and all the the nominal resolution. The scaling observed in the coupled to the
interesting information is contained in the slope of the “background”. visible channel (1) and the thermal IR channels (4, 5) ances are prim
are particularly significant since they are representa- variables of sta
This slope turns out to be incompatible with classical turbulence cas- tive respectively of the energy-containing short- and dynamics wer
long-wave radiation fields which dominate the earth’s structures at a
cade models which assume well-defined energy sources and sinks, with energy budget. One sees that thanks to the effects of see how this s
a source and sink-free “inertial” range in between. Thus, these estimates cloud modulation, the radiances are very accurately associated clou
scaling. This result is incompatible with classical To bolster
of spectral scaling enable one to directly test theories of atmospheric turbulence cascade models which assume well-defined the correspon
energy flux sources and sinks with a source and sink- (correspondin
turbulence. Such a test would be impossible (and meaningless) with free “inertial” range in between (see Section 2.6.6). wavelengths in
unprocessed timeseries data; nothing would be learned by staring at
wiggles, but staring at their spectra does teach us a lot.

© in this web service Cambridge University Press


Summary
From these examples we conclude that spectra are characterized by two
main features – peaks and background – and that the properties of both
(location and height for peaks, scaling law for the background) are deeply
informative of the physics and dynamics underlying the measurements.
Put it another way, spectra can reveal things that nothing else can reveal,
and we hope that you are now convinced of their virtues. Let us thus
estimate spectra until we grow numb.

127
Data Analysis in the Earth & Environmental Sciences Chapter 9. Spectral Analysis

II Signals, trends and noise


At this time it is useful to define a few terms of timeseries lore. We
shall consider each timeseries X (t) as the sum of one or more periodic2 , 2
not necessarily harmonic
stationary components (or “signals”); a trend q(t), and noise ε t . The
periodic components are the reason we’re doing spectral analysis in the
first place – it will tell us how big they are and what is their frequency.
Trend is a very fashionable3 word these days, and it is used and 3
shall we say trendy?
abused into meaninglessness. Here, it will refer to a slowly-evolving,
non-stationary component4 . For instance: a linear increase or decrease; 4
Please only use it for this purpose, and
a sinusoidal component so slow that its period is not resolved by the banish the heinous “cyclical trend” from
your vocabulary.
dataset; a nonlinear trend like the exponential increase of the Keeling
curve (Fig. 15.7) or the red line in Fig. 9.4).
As for noise, it clearly involves a subjective definition. Under this
name we usually subsume any variable component in which we are not
interested. Some of this noise may be composed of actual measurement
errors (what you would think of as noise), but some of it could be what
another analyst would call “signal”. If you study climate, daily fluctua-
tions are “noise”; if you study weather, they are your bread and butter.
One commonly says that “one analyst’s signal is another analyst’s noise”.
This noise is often modeled as a Normal random process (aka white
noise) with zero mean, ε ∼ N (0, σn2 ).
To summarize, our model is:
Figure 9.4: Global mean temperature
N −1     
2πk 2πk for all November months since 1880 in
X (t) = q(t) + ∑ ak cos
N
t + bk sin
N
t + εt (9.1) the NASA GISS dataset. The non-linear
k =0 trend is highlighted in red.

III Data pre-processing


In Chapter 7 we saw that Fourier analysis implicitly assumes a periodic
timeseries, and that all digital estimates of Fourier components come
with the curses of leakage and aliasing. Accordingly, some preparatory
steps must be taken to minimize the most common problems.

Sampling
Analog signals may be pre-filtered make them band limited prior to dig-
itization. This operation brings any period to zero beyond a cutoff fre-
quency f c . Sampling usually refers to digitization at even increments ∆.
If even increments are not possible (e.g. in a sediment core, time may not
be linearly related to depth), there are two possibilities:

1. interpolate on regular time grid, as seen in the next Chapter (this might
introduce spurious features) and use the methods of this chapter.

128
Data Analysis in the Earth & Environmental Sciences Chapter 9. Spectral Analysis

2. use methods designed for irregularly-spaced data like the Lomb-Scargle


periodogram5 (Mudelsee et al., 2009) or the weighted wavelet Z-transform 5
also known as Least-Squares Spectral
(Foster, 1996). Analysis

Detrending
Example
In the global warming example of Fig. 9.4, it would make sense to remove the
slowly-evolving component (red line) if interested in periodic or quasi-periodic
oscillations

Detrending always needs a very good justification, because one might


be inadvertently throwing out the signal with the “trend”. For instance,
if the instrumental record were 5 times as long, you might find that the
red line is actually a quasi-periodic low-frequency oscillation as well. If
you choose to detrend, you must therefore avoid making any statement
about the lowest frequency variability.

Tapering
Tapering refers to bringing the edges of a timeseries to zero. There are
two reasons to do so:

Edge effects We have seen that DFT periodizes the signal h(t) (so the
sequence Hn is N-periodic). If h( N − 1) 6= h(0), the jump h( N −
1) − h(0) will generate a discontinuity at the junction (cf. Fig. 9.5),
so higher-order harmonics will pollute the spectrum via Gibbs’ phe-
nomenon: this an example of edge effect. To some extent, detrending Figure 9.5: Edge effects may appear ins
will help with that. Multiplying by a taper, however, will ensure that the simplest settings.

both edges are zero exactly, eliminating this edge effect.

Figure 9.6: Periodization of a non-


stationary signal. In this case, the
large trend generates discontinuities that
would mar the spectrum, if no care is
taken.

Minimizing leakage recall that spectral leakage is a consequence of ob-


serving a system over a finite sample, which is akin to multiplication
by a boxcar (whose Fourier transform is proportional to sinc(π f t)).
Tapers are often designed to minimize leakage outside the main lobe.
But there is always a trade-off. Tapering will broaden central peaks
and give a smoother appearance to the spectrum; this is good or bad,
depending on what you want.

129
Data Analysis in the Earth & Environmental Sciences Chapter 9. Spectral Analysis

Figure 9.7: Window carpentry

Welch
Hanning
Bartlett

− 21 1
2

There are several choices available (Fig. 9.7), depending on what you’re
trying to achieve:

• Hanning window w(n) = sin2 πn
N −1

• Bartlett (triangular) window:


   h h
2 t+ T t ∈ − T2 , 0
w(t) =
T  2  h i (9.2)
− 2 t − T t ∈ 0, T2
T 2

(convolution of two boxcars)

• Gaussian window

• Parzen window.

There is a whole menagerie of tapers, illustrated in Fig. 9.7. Design-


ing windows with certain spectral properties was once referred to as
“window carpentry”. Fortunately, nowadays, the Multi-Taper Method
(MTM, Sect. V) solves this conundrum for us in an optimal way.

Zero-padding
Because the FFT algorithm works optimally on datasets with a sample
size equal to a power of two, it is common to pad with zeroes to reach
the next power of two6 : N ; N 0 = 2m . This will make the Fast Fourier 6
Matlab: nextpow2
Transform algorithm faster (though it is not mandatory); it may reduce
frequency spacing, yielding a smoother spectrum; and it theoretically
does not add or destroy information. However, this is only true if the
series is stationary and there is no jump at the end points; so it must
always be used in conjunction with a taper.

130
Data Analysis in the Earth & Environmental Sciences Chapter 9. Spectral Analysis

IV Classical Spectral Estimation


Correlogram vs. Periodogram
The simplest way to compute a spectrum is the periodogram:

Ŝ( f n ) = Pn = | Hn |2 (9.3)

for the nth frequency component, f n . Unfortunately, this turns out to


be a very bad idea. In the old days, the correlogram (Blackman and
Tukey, 1958) was more reliable. Recall the Wiener-Khinchin theorem,
R
with a(τ ) = R h(t)h(t + τ )dt, then:

ã(ω ) = h̃(ω )h̃(ω )∗ = |h̃(ω )|2 = S(ω ) (9.4)

In practice, a(τ ) is estimated for each discrete lag k :

1 N −|k |
N +|k |−1 ∑ j=0
h j h j+k
â(k) = 1 N −1 2
, k ∈ [−( N − 1), +( N − 1)], (9.5)
N −1 ∑ j =0 h j

Then apply the Discrete Fourier Transform7 . 7


this estimate requires tapering as well,
to minimize edge effects, as well as cen-
tering, so that the mean is zero.
Statistical considerations
Periodogram and correlogram are estimates of the true spectrum S( f ).
Ideally, this estimator Ŝ( f ) is unbiased: E(Ŝ( f )) = S( f ). Now, the
problem is with its variance; it turns out that

V (Ŝ) = S

which is independent of N. Hence the estimate won’t improve as N →


∞: it is inconsistent. Put another way, increasing N only adds more fre-
quency points f n at which to estimate S, but at each of these points the
number of observations going into the estimate Ŝ( f n ) does not change,
so it does nothing to reduce uncertainties.
Another way of saying this is that the precision decreases with the
height of a peak, so the more energetic components (i.e. the ones that
dominate the spectrum) are the ones we are most uncertain about, which
is vexing to the point of annoyance.

Confidence intervals

Recall that:
N −1 2π
Hn = ∑ hk e−i N kn is a complex number (9.6)
k =0
| Hn |2 = <( Hn )2 + =( Hn )2 (9.7)

131
Data Analysis in the Earth & Environmental Sciences Chapter 9. Spectral Analysis

If

hk ∼ N (0, σ2 ) i.i.d.

then

| Hn |2 ≈ X12 + X22 , where X1 , X2 ∼ N (0, σ2 )

so
| Hn |2
∼ χ22 chi-squared with 2 d.o.f.
σ2  
1
∼ Exp .
2
So
  h
| Hn |2 − gα i N
P ≥ gα = 1− 1−e 2 (9.8)
σ2

where gα is a number estimated from the quantile function of an expo-


nential distribution with parameter 1/2, for a given confidence level α.
Because the exponential distribution is highly skewed to the right, the
corresponding confidence intervals (C.I.) are skewed towards the high
end.

The Windowed Correlogram


Although more computationally intensive, the correlogram can be made
consistent through a cunning use of 0 < M  N windows. This is the
essence of Blackman-Tukey spectral analysis, which dominated the field
for a long time.

N −1
ŜBT ( f ) = ∑ wk ak e−2πi f ∆k (9.9)
k =0

where:

• wk = window function (taper);

• ak = auto-correlation at lag k.

With a judicious choice of window:


 3M N →∞
V ŜBT ≈ ŜBT −→ 0 (9.10)
2N
so it is now consistent (hurray!). But that, of course, comes at a price: the
bias-variance trade-off rears its ugly head again: increasing M will in-
crease variance but decrease leakage; decreasing M will decrease vari-
ance but increase leakage. So once again there is a compromise to be
made. No free lunch.

132
Data Analysis in the Earth & Environmental Sciences Chapter 9. Spectral Analysis

Ghil et al. (2002) write: “it turns out that the [...] windowed correl-
ogram method is quite efficient for estimating the continuous part of
the spectrum, but is less useful for the detection of components of the
signal that are purely sinusoidal or nearly so (the lines). The reason for
this is twofold: the low resolution of this method and the fact that its
estimated error bars are still essentially proportional to the estimated
mean Ŝ( f ) at each frequency f ”.

Welch’s averaged periodogram


Idea8 : chop series into K segments of length K
N:
8
Matlab: pwelch

 S
V ŜWelch ( f ) = (9.11)
K
Ŝ( f )
for i.i.d. Gaussian hk , ξ=ν ∼ χ2 (9.12)
S( f )
where ν =“equivalent d.o.f.” , which depends on K and shape of tapers.
Typically, ν = 2K − 2. Then:
!
2 Ŝ( f n ) 2
P χν α ≤ ν ≤ χ ν 1− α = 100 × (1 − α)% (9.13)
(2) S( f n ) ( 2)

This yields a parametric 95% confidence interval for the spectral esti-
mate. If the data are not i.i.d. Gaussian, you would better inspired to
use a non-parametric test (see examples below).

V Advanced Spectral Estimation


The previous methods were all fine and good in the 1960’s, but in this
day and age we can do much better9 9
It is a wonder, then, that so many scien-
tists still use the old ones.

Maximum entropy method


The first one uses the notions of Chapter 8 to approximate a time series
h(t) by an autoregressive model of order p, AR( p). It thus performs
best when estimating line frequencies for a time series that is actually
generated by such a process.
The maximum entropy method (MEM) determines the spectral density Ŝ =
SMEM that is associated with the most random, or least predictable AR pro-
cess that has the same auto-correlation function (ACF) â(k ) (Eq. (9.5)) as the
dataset. In terms of information theory, this corresponds to the concept
of maximum entropy, hence the name of the method.
In practice, one estimates p coefficients (φ1 , φ2 , . . . , φ p ) from the ACF.
Knowledge of these coefficients enables to compute the spectrum ac-
cording to the theoretical spectrum of an AR process (Eq. (8.8)). Any
good routine10 will do all this for you given a model order p. 10
Matlab: pburg

133
Data Analysis in the Earth & Environmental Sciences Chapter 9. Spectral Analysis

11
aka Minimum Description Length, or
Notes on MEM Final Prediction Error

• Because everything comes down to an AR( p) process, one must be


very careful in choosing p. One objective way to do this is via sev-
eral information criteria (Neumaier and Schneider, 2001; Stoica and Selen,
2004) like Akaike’s Information Criterion (AIC, Akaike, 1974) and the
Bayesian Information Criterion (BIC, Schwarz, 1978)11 . In many cases,
however, these tend to over- or under-estimate p depending on the
timeseries’ properties (Ghil et al., 2002).
Possible solutions are

– try a few different p’s and look for robust features;


S( f )
– compare with other methods (Welch, BT, MTM).

• The number of special peaks grows with p so it’s easy to “find” as


many peaks as you want. Nonetheless, MEM is extremely good in
detecting split peaks, if you know where they should be.

S( f )
Multi-Taper Method
MEM can detect amazingly fine features (e.g. split peaks), but the choice
of p tends to be arbitrary; further the method only makes sense if an AR
model is an adequate approximation of the series. The Blackman-Tukey
method is even worse, since the choice of taper and is rather ad hoc. In
1982, D. Thomson found an optimal solution, known as the Multi-Taper
method (MTM, Thomson, 1982) f

MTM uses K orthogonal tapers (wk (t), k = 1, · · · , K ) belonging to Figure 9.8: Smoothed peak (obtained
with Blackman-Tukey) versus split peaks
a family of special functions (Discrete Prolate Spheroidal Sequences, obtained with MEM
or Slepian functions, Fig. 9.9). These wk (t) solve the variational prob-
lem of minimizing leakage outside of a frequency band f 0 ± p f R where
f R = 1/( N∆) is the Rayleigh frequency and p is some nonzero inte-
ger. Averaging over tapered spectra yields a more stable estimate, as 0.15
Slepian Sequences N=512, p=2.5

Var(S( f 0 )) ∝ 1/K. In practice only the first 2p − 1 tapers have usefully


1st
2nd
3rd
4th

small leakage so we can take K ≤ 2p − 1. The bandwidth 2p f R is the 0.1

width of a harmonic line with this method. There is a trade-off between 0.05

minimizing the variance (∝ 1/K ) and minimizing the leakage (∝ 2p)


DPSS

but this tradeoff is optimal by design (e.g. the choice p = 2 and K = 3 is


quite reasonable). MTM estimates both:
-0.05

-0.1

• the background (as the Blackman-Tukey method does),


0 50 100 150 200 250 300 350 400 450 500
t

• the lines (as the Maximal entropy method does).


Figure 9.9: Slepian functions used to pro-
It is adaptive and explicit, unlike many of its competitors. The spectrum duce the tapers wk (t). As a rule, the kth
taper has k bumps.
134
Data Analysis in the Earth & Environmental Sciences Chapter 9. Spectral Analysis

is obtained by averaging all the tapered spectra:

∑kK=1 µk | Yk ( f ) |2
SMTM ( f ) = (9.14)
∑kK=1 µk

where Yk ( f ) is the DFT of h(t) × wk (t) and µk its corresponding weight,


chosen adaptively by the algorithm. In Matlab12 , you must provide nw 12
Matlab: pmtm
(the time-bandwidth product), which can take the values [2, 5/2, 3,
7/2, 4]. Small values of the parameter mean high spectral resolution,
low bias, but high variance. Large values of the parameter mean lower
resolution, higher bias, but reduced variance. No free lunch, as usual.
The choice is subjective, but since it formulates an explicit tradeoff, it
may be argued rationally.
In our opinion, MTM is usually the best option, but it must be tried
with several different bandwidths to make sure the results are robust.
If a feature appears with only one method or one choice of parameter,
you should always regard it as suspect.

Uncertainty Quantification
So, you’ve estimated a spectral density, ideally using several methods,
or with one method but different choice of its parameters (e.g. time-
bandwidth product, AR model order, etc). What now? The most com-
mon use of spectral analysis is to identify periodic components, and hav-
ing done so, decide whether they are significant. Aaaargh, the dreaded
S word again! We are now back in the realm of Chapter 6.

White nulls

As discussed earlier, the frequentist view is that if X (t) is IID normal


(white noise) then the Fourier coefficients are distributed as χ2 variables
with 2 degrees of freedom, so one can test against the null hypothesis
that the spectrum was generated by such a process; most spectral analy-
sis routines will provide confidence intervals for S( f ) based on this null
distribution. For instance, MTM also enables to test for the significance
of peaks using an F ratio against a null hypothesis of white noise. There
are some extensions (Mann and Lees, 1996) to test against a red noise
null hypothesis (AR(1) process), which is often much more appropriate
in the geosciences.

Autoregressive nulls

As you saw in Chapter 8 and in lab 7, individual AR processes may


exhibit spectra that differ considerably from their theoretical shape, fea-
turing prominent bumps that one might erroneously interpret as ’real’
(Fig. 8.6). It is thus crucial to guard against these pitfalls.

135
Data Analysis in the Earth & Environmental Sciences Chapter 9. Spectral Analysis

One solution is to simulate a large number of AR nulls (say, with an


autocorrelation parameter comparable to that of X (t)), compute their
spectrum as you did for X (t), and empirically estimate the quantiles
of this distribution. Such an approach is illustrated in Fig. 9.10, which
shows the result of estimating uncertainties in the estimate of δ18 O in
a speleothem from Vanuatu (Partin et al., 2013). The thin gray curves
encompass 95% of the AR(1) spectra, showing that, although several
broadband peaks are present, few lie outside the AR(1) envelope. This
implies that the record is generally consistent with red noise, with a few
notable exceptions in the interannual (ENSO) and multidecadal bands.

Big Taurius δ18O MTM spectrum (nw = 3) Figure 9.10: Example of 95% confidence
2 interval for Ŝ. The thin blue lines depict
10
χ2 -based intervals, while the thin gray
lines depict a 95% confidence region from
1
an ensemble of 1,000 AR(1) timeseries
10 (Chapter 8), which serve as a null hypoth-
esis for the significance of spectral peaks.
Spectral,Density

0
10

-1 AR(1),5-95%,quantiles
10
Estimated,spectrum
χ2,,95%,CI

-2
10

-3
10
100 50 20 10 6 4 2
Period,(years)

This is a good occasion to point out some good practices on graphi-


cally displaying spectra. The output of a routine like pmtm.m13 will be 13
for Python, see the nitime package
phrased in terms of angular frequency14 . It is often much more mean- 14
from 0 to π, the latter corresponding to
ingful to phrase things in terms of the period of oscillation, as done on the Nyquist frequency

this graph. You will also notice that both axes are logarithmic, with- 1
500 200 100
NINO3.4 MTM spectra
50 20 10 8 6 4 2

out which the spectrum would be squished to the left-hand corner, and
none of the interesting details would appear. Sometimes a semi-log
Frequency = Spectral Density

10-1

scale may be more sensible; the choice should ultimately serve to high- 10-2

light interesting parts of the spectrum, and what’s “interesting” is a sub-


jective matter. Apply common sense. 10-3

One drawback of the log-log representation is that the area occupied 10-4 LIM

by the low-frequencies appears disproportionately large, so that inte-


EG12, ERSSTv3
EG12, HadSST2i
EG12, Kaplan

-2
10 Frequency (cycles/year) 10-1 0.5

grating the area under the curve does not represent the energy of the sig-
nal in that frequency band. A solution to this problem uses a variance- Figure 9.11: Comparisons of the MTM
spectra of 3 reconstructions of the
preserving log-scale (Fig. 9.11). This representation plots log( f × S( f ))
NINO3.4 index (a common metric of
as a function of log( f ), which conserves variance in each band and thus ENSO) with a simple stochastic null
allows to “integrate by eye” (A. Wittenberg, pers. comm., 2012) hypothesis (Ault et al., 2013).

136
Data Analysis in the Earth & Environmental Sciences Chapter 9. Spectral Analysis

For reasons elaborated in Chapter 8, an AR(1) null hypothesis is often


sensible in the Earth Sciences, but is by no means an absolute rule; the
key is to realize that every estimated spectrum must be gauged with
respect to a suitable null hypothesis before claiming that some of its
features are “significant”. Also pursuant to the discussion at the end of
Chapter 6, recall that statistical significance is not the same as practical
significance. In some cases, an overly stringent null hypothesis would
have us reject things that we know to be true! For instance, we’ve seen
some econometrists argue that temperature trends should be gauged
against some nulls called “fractional Brownian motion”, though it’s been
shown with physically-based models (Mann, 2011) that this statistical
vitriol can dissolve even the most real trends into “noise”. In summary,
you should always choose a null hypothesis based on scientific, not just
statistical, considerations.

'JHVSF4]"HFNPEFMGPSTBNQMF#JH5BVSJVTCBTFEPO65IBHFT BOEBO
Age-uncertain spectral analysis BHFDPOTUSBJOUXIFSFUIFUPQJTBTTVNFEUPCFD ZFBSPGDPMMFDUJPO 5IF
Figure 9.12: Age model for the Big Tau-
EBSLDVSWFJOUIFNJEEMFPGUIFHSFZTQSFBESFQSFTFOUTUIFNFEJBOBHFNPEFM
5IF .POUF$BSMPJUFSBUJPOTPGBMUFSOBUFBHFNPEFMTBSFQMPUUFEJOHSFZ
rius record of Partin et al. (2013), based on
As an example of a scientifically-motivated null hypothesis, let us 31 U-Th ages and a top date of 2005. The
consider the study of Partin et al. (2013), who used oxygen isotope mea- grey envelope represents 10,000 possible
spline fits of the age model, whose me-
surements in a South Pacific speleothem to learn about low-frequency dian is depicted by the dark curve.
hydroclimate variability. The measured timeseries may be seen in Fig. 9.13.
As in most climate proxy records, the time axis is uncertain to some de-
gree, in the sense that many possible functions could fit the age-depth
constraints (Fig. 9.12). One approach to this is to use ensemble methods
(Monte Carlo simulations), generating many possible realizations (say,
N = 10,000) of the timeseries that are permissible within age uncertain-
ties.

Figure 9.13: Time series of stalagmite


δ18 O from Taurius Cave in Espiritu Santo,
Vanuatu from 1557 to 2003 C.E. U-Th ages
are represented by black points on top,
with 2σ analytical error bars. Blue er-
ror bars along the record are 2σ errors
for individual δ18 O measurements calcu-
lated using 10,000 realizations of the age
model, and are included to depict range
of uncertainty associated with peaks or
valleys. From Partin et al. (2013)

137
Data Analysis in the Earth & Environmental Sciences Chapter 9. Spectral Analysis

One may then compute the resulting spectra for each realization of
the age model, and see how the spectrum of the original timeseries (blue
curve in Fig. 9.13) fares in relation to this ensemble (Fig. 9.14). The anal-
ysis shows that age uncertainties would tend to blur high-frequency
peaks rather than creating them, lending credence to the notion that
the peaks observed in Fig. 9.10 are real.

18
Big Taurius b O MTM spectrum (nw = 3) Figure 9.14: MTM spectral analysis of
2
10 the Fig. 9.13 series using the median age
model (blue). Gray shading delineates
95% confidence interval obtained from
10,000 realization of the age model. From
1
10 Partin et al. (2013)
Spectral Density

0
10

-1
10

-2
10 b18O, MC age models 95% c.i.
b18O, median age model
AR(1), 95% quantile
-3
10
100 50 20 10 6 4 2
Period (years)

Robust spectral analysis


Building on the MTM, Chave et al. (1987) extended the theory to non-
Gaussian signals and devised non-parametric confidence intervals via
the jackknife method. This is particularly handy if your data are highly
non-normal and if no parametric null hypothesis applies to your prob-
lem. Some excellent Matlab code may be found on the page of Frederik
J. Simons, along with many useful goodies.

VI Cross-spectral analysis
Consider the Fourier transform pairs x (t) c X ( f ), y ( t ) c Y ( f ). Re-
member the Wiener-Khinchin theorem

S( f ) =| X ( f ) |2 c γxx (t) (9.15)

where γxx (t) is the autocorrelation function (ACF).

138
Data Analysis in the Earth & Environmental Sciences Chapter 9. Spectral Analysis

1. What if we wanted to examine the spectral features of the cross - cor-


relation? Remember the definition of γxy
Z +∞
γxy = x (t)y(t + τ )dt. (9.16)
−∞

In practice

N −|k |
1 1
γ̃xy (k) =
σx2 σy2 N − |k | − 1 ∑ x j y j+k (9.17)
j =0
| {z }
cross-covariance

in which we have assumed that x and y are centered.


We have seen that

γxy (t) c X ( f )Y ∗ ( f )

so if Y = X we will have

F (γxx (t)) =| X ( f ) |2 = S( f ).

S is therefore the auto-spectrum of X.


The cross-spectrum is given by
+∞
Γxy ( f ) = ∑ γxy (τ )e−2πiτ f
τ =−∞

or

Γ̂xy ( f n ) N γ̂xy (k)(∈ C).

We can express the cross-spectrum as

Γxy ( f ) = A xy ( f )eiψ( x,y) = A xy eiψxy ( f )

where A xy and ψxy are called amplitude and phase spectrum respec-
tively.

2. The amplitude spectrum reveals areas of common power between x


and y. Most often, however, one looks at the coherence

| A xy |2 | Γxy |2
κ xy ( f ) = =
| X |2 | Y |2 Sxx Syy

which is eerily reminiscent of the correlation coefficient

Cov( x, y) Cov( x, y)
ρ xy = = √ √
σx σy v x vy

so the coherence spectrum is the frequency-domain equivalent of the corre-


lation coefficient.

139
Data Analysis in the Earth & Environmental Sciences Chapter 9. Spectral Analysis

Properties

• We have that 0 ≤| κ xy |≤ 1 → and it is easy to interpret the


value of | κ xy |: if | κ xy ( f 0 ) |= 1 for a certain frequency f 0 , then
x (t) and y(t) are said to be coherent at that frequency.
• If y(t) is the output of a linear filter (transfer function H ( f )) whose
input is x (t), then, considering that

Y = [ HX ] ( f ) and Γxy = X ( f )Y ( f )∗

we have
| Γxy |2 | X ( f ) |2 | H ( f ) |2 | X ( f ) |2
κ xy = = =1 (9.18)
| X |2 | Y |2 | X ( f ) |2 | H ( f ) |2 | X ( f ) |2

at all f ’s. So, if the two series are related via a linear filter, they’ll
be perfectly coherent. Deviations from unity imply a non-linear
response or no response at all.

3. For the phase spectrum we have


  
 Im( Γxy )

 arctan Re( Γxy )
if both are 6= 0




 0
 if Im( Γxy ) = 0 and Re( Γxy ) > 0

ψxy ( f ) = ±π if Im( Γxy ) = 0 and Re( Γxy ) < 0




 π/2
 if Im( Γxy ) > 0 and Re( Γxy ) = 0



 −π/2 if Im( Γxy ) < 0 and Re( Γxy ) = 0

The phase spectrum is usually expressed in degrees (×180/π). It


expresses the phase lag between two signals at various frequencies.
For example

 ψ ( f ) ∈ [0, π ] y lags x at the frequency f
xy
 ψxy ( f ) ∈ [−π, 0] y leads x at the frequency f

where “lags” and “leads” can never be interpreted causally 15 . While 15


e.g. does y lead by T/4 or lag by 3T/4?
the coherence measures spectral bands of common power, it says It is impossible to tell

nothing about the relationship between x and y. As before, corre-


lation is not causation.

140
Chapter 10

SIGNAL PROCESSING

In this chapter we tackle two essential elements of signal processing.


The first, filtering, has to do with bringing out parts of a signal that we
care about and silencing others. The second, interpolation, has to do
with changing time grids without altering the frequency content of a
signal too much.

I Filters
A filter is an operation that removes only a part of a measured signal. Ev-
eryday examples include sunglasses (which filter UV and/or IR wave-
lengths) or the EQ on a stereo receiver or your iTunes player, where one
can change the relative importance of bass, mids or trebles. Earth Sci-
ence examples include:

Seismology filter out surface waves (long periods) to leave only P and S
waves.

Climate filter short-term variability (weather) to focus on climate. Filter


variability outside the 2-7 year periodicity band to highlight ENSO
variability.

(and many more)

The uncertainty principle


Recall from Chapter 7 that localization in time is inversely related to
localization in frequency; this is analogous to Heisenberg’s celebrated
Uncertainty Principle (Fig. 10.1).
Because of this property of all Fourier transform pairs, the fetch of
the frequency response of a filter (i.e. how sharply it cuts the frequency
domain) will turn out to be inversely related to its impulse response (i.e.
its fetch in the time domain), so designing a perfect filter is impossible.
There are hundreds of different filters designed to meet specific design Figure 10.1: An illustration of Heisen-
berg’s uncertainty principle, which states
that Planck’s constant is a fundamental
limit on how accurately one might simul-
taneously know the location and velocity
of a quantum object.
Data Analysis in the Earth & Environmental Sciences Chapter 10. Signal Processing

challenges; it is therefore important that you state your own desiderata


before picking a filter.
Example
Lowpass filter: cut all frequencies above a certain frequency ωc . This amounts
to multiplying by a boxcar in the frequency domain, hence convolution in the
time domain (Fig. 10.2). This will smear the features of the signal.

F̃ (t) Figure 10.2: A sharp lowpass filter: box-


car function in frequency domain, cardi-
F (ω ) nal sine in the time domain

ωc ω t

Question If the Fourier curse is so damning, why not stay in the time do-
main? Why not just average contiguous data points together, for instance?

This is called a running mean. One might think that by averaging to-
1
gether M contiguous points, you’d filter all frequencies f > M∆ . Unfor-
tunately not: this time it amounts to convolution by a boxcar in the time
domain, so smearing by a sinc function in the frequency domain: that
is yucky. But there is worse: not only does this mess up the amplitude
spectrum, but it destroys the phase spectrum even more. Defining the
boxcar as:

 1 k ∈ [0, M − 1]
bk = (10.1)
 0 k≥M

Its DFT is (10.2)


phase
z}|{
−iπn MN−1 sin( nπM
N ) i Φn
Bn = e ≡ Rn e (10.3)
N sin( nπ
N )
|{z}
amplitude

 
nπM
sin N N
Rn = nπ
 has zeroes at n = l , l ∈ N∗ and (10.4)
sin N
M
 
πn( M − 1) nπM
Φn = − − επ ε = sign sin = 0 or 1 (10.5)
N N
N
When n = l M , Φn = ∓π: this is a 180 degree phase shift. So the
N
running mean does cut out the frequency n = M (as we hoped it would),
but also all its integer multiples, and completely shifts the phase around
at these points. Yikes! Such throwing the baby with the bathwater is a

142
Data Analysis in the Earth & Environmental Sciences Chapter 10. Signal Processing

terrible idea, though millions of scientists seem perfectly content with


it on the grounds that it is “simple”. Yes, but simple can be stupid. Let
us now see how one would proceed to build better filters.

Linear Filter Theory

INPUT: OUTPUT:
underlying signal Filter measured signal
u(t) r (t) s(t)
e.g. grand motion e.g. seismogram

A linear filter verifies the convolution equation:


Z +∞
s(t) = r ∗ u = r (t − τ )u(τ )dτ (10.6)
−∞

or, in the discrete world, a discrete convolution:


p
sp = ∑ rk un−k length N (10.7)
k =0

where rk usually has length  N.


By the convolution theorem, s̃(ω ) = r̃ (ω )ũ(ω ), so the output s̃(ω )
contains the same frequencies as ũ(ω ), but scaled by the amount r̃ (ω ).

Definition 13 (Transfer Function)


phase shift
z }| {
s̃(ω ) output
r̃ (ω ) = = = G ( ω ) ei ϕ ( ω ) (10.8)
ũ(ω ) input | {z }
gain

|s̃(ω )|
where the gain G is the ratio of amplitudes: |ũ(ω )| and the phase shift ϕ is the
difference of phases ( phase(s̃) − phase(ũ) at each frequency. r (t) is called the
impulse response of the filter, that is, the response to a unit pulse at t = 0.

There are therefore three variables to consider for any filter. In general,
one wants to specify G (e.g. lowpass) but ϕ and r (t) will be affected as
well; this trade-off is fundamentally due to the uncertainty principle).
There are several classes of filters, each realizing a compromise between
amplitude response (gain), phase response and impulse response:

1. Analog (instrument) vs. digital (computer)

2. Time domain (real time) vs. frequency domain (a posteriori)

3. Infinite impulse response (recursive) vs. finite impulse response (non re-
cursive)

143
Data Analysis in the Earth & Environmental Sciences Chapter 10. Signal Processing

4. Causal (physically realizable put involving phase distortion vs. acausal


(digital only: a zero phase shift possible)
Say we define a filter by a set of coefficients ck and dk , such that:
n N
Sk = ∑ cn uk−n + ∑ d j sk− j (10.9)
n =0 j =i

then:

∑nk=0 ck e−2πi f k∆ Finite Impulse Response (polynomial)


r̃ ( f ) =
1 − ∑Nj =i d j e
−2πi f j∆
Infinite Impulse Response (rational)
(10.10)
If past values of s are allowed to enter the definition of sk , more general
filters can be designed, but their Impulse Response may not be com-
pactly supported, and may have poles (singularities).

Some useful filters


Time domain

• Shapiro filter: [1 2 1]/4 (∑ ωi = 1)


• Shapiro ∗ Shapiro: [1 2 1] ∗ [1 2 1] = [1 3 3 1]/8

• Iterating this process n times yields a binomial filter, with weights pro-
portional to binomial coefficients (nk)
Advantage: R( f ) = cos2n (π f ) at order n and the phase shift is linear
(easily corrected).

Frequency domain

• Gaussian filters: Convolution by a Gaussian leads to a certain amount


of blurring, controlled by the half-width of the function. Because
Bu4
the Fourier transform of a Gaussian is a Gaussian with inverse scale, 1
1+ ε2
blurring in the time domain is inversely related to blurring in the
frequency domain. See Appendix A, Sect. III. Fig. A.6
1
• Butterworth filters: | Bun | = 2n
1+( ωωc ) 1
1+ a2
Then choosing ε and r, ω p and ωs , allows the determination of ωc
and n:  ω p ωc ωs
ln aε ωp
n = ω  ωc = 1 (10.11) Figure 10.3: Butterworth filter
ln ωps ε n
|{z}
| {z } governs cutoff location
governs sharpness
This filter’s phase shift is linear as well. As n → ∞ →, it tends to a
boxcar. The choice of n is therefore a tradeoff between sharpness of
the frequency response and the behavior in the time domain, which
is quantified by the impulse response (Fig. 10.4).

144
Data Analysis in the Earth & Environmental Sciences Chapter 10. Signal Processing

1.0 0.5
n=1 n=1
0.8 n=2 0.4 n=2
n=4 0.3 n=4

Amplitude
0.6 n = 10 n = 10
| Bun |

0.2
0.4
0.1
0.2
0.0
0.0 −0.1
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0 10 20 30 40 50
Frequency [∆t−1 ] Time [second]

Figure 10.4: Examples of frequency (left)


and impulse (right) response for Butter-
Filter nomenclature worth filters. The cutoff frequency is
marked by a dashed red line. Notice that
the larger n, the sharper the frequency re-
Lowpass L(ωc ): lets through only low-frequencies (ω < ωc )
sponse, but the longer the ringing in the
time domain: there is no free lunch.
Highpass the opposite; may be constructed as 1 − L(ωc )

Bandpass lets through energy in a specific band (ω1 , ω2 ). May be con-


structed by the difference of two lowpass filters L(ω2 ) − L(ω1 )

Notch blocks the energy in a specific band (ω1 , ω2 ). May be constructed


as 1− a bandpass filter

V V Figure 10.5: Filter nomenclature using a


Butterworth filter
lowpass highpass

f f

V V
bandpass notch

f f

In practice: find coefficients (b, a) using Filter Design Toolbox. → freqz(b, a)


gives the frequency response of the filter. Then apply:

1. filter(n, b, a) (once) or

2. filtfilt (n, b, a) (twice + reversal) – zero-phase.

145
Data Analysis in the Earth & Environmental Sciences Chapter 10. Signal Processing

This allows to construct zero-phase digital filters from filters that have a
phase shift; the trick is to run it twice in opposite directions so that they
phase shifts cancel each other.

Other methods
• Wiener filter (optimal if a known signal has been corrupted with
noise of known characteristics)

• Spline smoothing is our personal favorite. The theory is explained


in Cook and Peters (1981) and the code1 uses the algorithm of Weinert 1
hepta_smooth.m
(2009).

Note Bottom line: never use a running mean. Everything else is ok, but
the choice depends on what you want to achieve. 2
Without loss of generality, this dis-
cussion also applies to spatial contexts,
though in some cases more sophisticated
methods (e.g. Kriging) must be used in
II Interpolation multiple dimensions

General idea
The general idea can be simply seen as “connecting the dots”. Given
a series of datapoints taken at different ti , we ask ourselves what hap-
pened between the ti ’s 2 . The least squares method3 could fit a curve 3
Interpolation can be described as “exact
that minimizes the squared distance to points but this is not what we curve fitting”. This is different from the
case of “smoothed curve fitting”, which
want. We want to go through all the points and figure out what could would encompass least squares, spline
plausibly have happened in between. How do we do this? fitting and others.

f (t) Figure 10.6: Example of interpolation of


points

t i −1 t i t

Linear interpolation
There are many ways to connect the dots depending on the constraints
we impose. The simplest way is linear interpolation. This form of inter-
polation falls into the category of “piecewise interpolations”.
On I1 we have

L1 (t) = a1 t + b1

146
Data Analysis in the Earth & Environmental Sciences Chapter 10. Signal Processing

f (t) Figure 10.7: Linear interpolation

I3
I1 I2 nodes

t1 t2 t3 t4 t

and we ask that L1 (t1 ) = f 1 and L1 (t2 ) = f 2 so


f2 − f1
L1 ( t ) = f 1 + ( t − t1 ).
t2 − t1
By induction
f i +1 − f i
Li ( t ) = f i + ( t − t i ).
t i +1 − t i
The pros of this approach are
• is simple,

• is computationally trivial.
The cons is that the interpolating function is continuous only at the 0th
order, i.e. it’s spikey.
Generally we want smoothness up to the second order (forces symme-
try) or up to the third order.

Cubic splines
The cubic splines approach is a piecewise approach but this time each
piece is a 3rd degree (i.e. cubic) polynomial



 s1 ( t ) t ∈ [ t1 , t2 ]


 s (t) t ∈ [ t2 , t3 ]
2
S( x ) (10.12)

 ... ...




s n −1 ( t ) t ∈ [ t n −1 , t n ]

where

s i ( t ) = a i ( t − t i ) 3 + bi ( t − t i ) 2 + c i ( t − t i ) + d i

and we require that it’s first and second derivatives are continuous in
the nodes

si0 = 3ai (t − ti )2 + 2bi (t − ti ) + ci ,


si00 = 6ai (t − ti ) + 2bi .

The cubic spline has to have four properties

147
Data Analysis in the Earth & Environmental Sciences Chapter 10. Signal Processing

1. the piecewise s(t) will interpolate all points f i ,

2. s(t) ∈ C0 ([t1 , tn ]),

3. s0 (t) ∈ C0 ([t1 , tn ]),

4. s00 (t) ∈ C0 ([t1 , tn ]).

So we have

1. f i = di

2. si (ti ) = si−1 (ti ) ∀i ∈ [2, n]


⇒ di = ai−1 (ti − ti−1 )3 + bi−1 (ti − ti−1 )2 + ci−1 (ti − ti−1 ) + di−1 with
i ≤ n − 1.

3. Let h = ti − ti−1 . We require

si0 (ti ) = si0−1 (ti )

but since si0 (ti ) = ci we have

ci = 3ai−1 h2 + 2bi−1 h + ci−1 i ∈ [2, n − 1]

4. We require

si00 (ti ) = si00−1 (ti )

but since si00 (ti ) = 2bi we have

2bi = 6ai−1 h + 2bi−1 .

Now let Mi ≡ si00 (ti ), we have bi = Mi /2 and we already know that


di = f i . From (4.) we get

Mi + 1 − Mi
ai =
6h
while from (3.), after some tedious algebra, we get

f i +1 − f i ( Mi+1 − 2Mi )
ci = − h.
h 6
Now all four coefficients are determined
 M −M


 ai = i+16 i


 b = Mi
i 2
f −f Mi+1 +2Mi
.

 ci = i+1h i − h

 6


di = f i

148
Data Analysis in the Earth & Environmental Sciences Chapter 10. Signal Processing

In order to solve this system of equations we can put the condition (3.)
in matrix form

ci+1 = 3ai h2 + 2bi h + ci


   
Mi + 1 − Mi f − fi Mi+1 + 2Mi
3 h 2 + Mi h + i + 1 − h=
6 h 6
 
f i +2 − f i +1 Mi+2 + 2Mi+1
− h
h 6
After some regrouping we have
 
f i − 2 f i +1 − f i +2
∀i ∈ [1, n − 1] Mi + 4Mi+1 − Mi+2 = 6
h2
leading to the matrix equation
    
1 4 1 0 0 ... 0 M1 y1 − 2y2 + y3
    
0 1 4 1 0 ... 0   6  
   M2   y2 − 2y3 + y4 
  = 2 
. . . . . . . . . . . . . . . . . . . . .   h  
 ...  ... 
0 ... 0 0 1 4 1 Mn yn−2 − 2yn−1 + yn

which consists of (n − 2) rows and n columns. This means that the sys-
tem is underdetermined so we add some boundary conditions. Depend-
ing on the choice of the boundary conditions we have three types of
splines
• natural: M1 = Mn = 0,

• parabolic runout: M1 = M2 , Mn = Mn−1 ,

• cubic runout: M1 = 2M2 − M3 , Mn = 2Mn−1 − Mn−2 .


We can also have periodic splines and clamped splines. The most used ones
are the natural cubic splines since they can extend outside the endpoints
(others have crazy behaviour).
Tridiagonal system: second derivatives (curvature) is easily determined
from the data ( f i , t: i ), straightforward back-substitution.
Note For spatial date bilinear or bicubic interpolation can be extremely useful
in changing resolutions or going from an unevenly-spaced dataset to an evenly-
spaced one. The Matlab function griddata.m proves to be useful in such
applications, and allows for different methods to be tried as.

Lagrange interpolant
Given (n + 1) points, there’s only one polynomial that goes through all
of them. It can be written in many forms but Lagrange’s is the most
common and consists of a linear combination of basis polynomials
n
L( x ) = ∑ y i li ( x )
i =0

149
Data Analysis in the Earth & Environmental Sciences Chapter 10. Signal Processing

 
where li ( x ) = ∏nj=0,j6=i xx−−xxi . Obviously li ( x j ) = δij so it goes through
j i
all the points yi . A problem with this approach is that when the num-
ber of points gets large, so does the degree of the polynomial and in this
case it is out of control. This is a global interpolant: one single function
fits all the points but at the price of ginormous oscillations. For this
reason this interpolant should never be used in practice. It is better to
use piecewise, local interpolants: these require more coefficients (hence
more computations) but they are much less sensitive to outliers since
one segment is fairly insulated from remote ones.

Fourier interpolant
So far we have only considered polynomials, what if we use cos(ωi t),
sin(ωi t)? These can be expressed as polynomials too (remember the
 k
Euler’s formula and eiω j kt = eiω j t ). In fact Fourier analysis can be
seen as a curve fitting problem with trigonometric functions. It can also
be seen as an inverse problem
 
1 1 1 ... 1
 
1 ω2 ... ω N −1 
 ω 
1  
FN = √  1 ω2 ω4 ... ω 2( N −1) 
N



. . . ... ... ... ... 
 
1 ω 2( N −1) ω 4( N −1) ... ω ( N −1)( N −1)

−2πi
where ω = e N > = I.
and FNn is a unitary matrix FN F̄N

150
Part III

Living in multiple dimensions


Chapter 11

MULTIVARIATE RELATIONSHIPS

We now seek to describe dependencies among different variables. This


could be variables of different nature (e.g. temperature vs pressure) or
different measurements of the same quantify at different locations (i.e. a
field) or in different experiments. Since the normal distribution is so
omnipresent in nature, the multivariate normal will emerge as a central
theme of this chapter.

I Relationships between variables


Random Vectors
Definition 14 A random vector is a collection of random variables

X = ( X1 , X2 , · · · , X p )

A special case of a random vector is a timeseries where all observations


are IID. Autoregressive models would be another example, but in this
case independence is lost (neighboring values tend to resemble each
other).

Joint & Marginal Distributions


A random vector is characterized by its joint distribution f ( x1 , · · · , x p ).
As in 1D, a multivariate PDF must verify three conditions: (1) f is con-
tinuous; (2) f is positive; (3) f has unit mass:
Z Z
··· f ( x1 , · · · , x p )dx1 · · · dx p = 1

It gets tricky to represent in more than two dimensions, so let us focus


on p = 2, and f ( x1 , x2 ). An example of joint distribution between two
climate variables (transient climate response and equilibrium climate
sensitivity) is shown in Fig. 11.1.
Data Analysis in the Earth & Environmental Sciences Chapter 11. Multivariate Relationships

Figure 11.1: a) Marginal probability den-


sity functions (PDFs) of climate sensitiv-
ity; b), marginal PDFs of transient cli-
mate response (TCR); c), posterior joint
distribution constraining model parame-
ters to historical temperatures, ocean heat
uptake and radiative forcing under rep-
resentative illustrative priors. For com-
parison, TCR and climate sensitivities are
shown in (c) for model versions that yield
a close emulation of 19 CMIP3 climate
models (white circles). From Meinshausen
et al. (2009)

The joint distribution is depicted by the color field and is seen to oc-
cupy the lower right quadrant of the square of possible values. Integrat-
ing such a distribution with respect to either variable leads two marginal
distributions. For instance:
Z
f 1 ( x1 ) = f ( x1 , x2 ) dx2 (11.1a)
ZR
f 2 ( x2 ) = f ( x1 , x2 ) dx1 (11.1b)
R

Such marginals are shown in panels a and b of Fig. 11.1. One may think
of a marginal distribution as averaging out the effect of all others vari-
able to focus on a particular one.

154
Data Analysis in the Earth & Environmental Sciences Chapter 11. Multivariate Relationships

Conditional Distributions
If the components of a random vector are independent, then it is easy to
show that their joint distribution factorizes into marginal distributions1 : 1
this is in fact the definition of indepen-
dence

f ( x1 , x2 , · · · , x p ) = f 1 ( x1 ) f 2 ( x2 ) · · · f p ( x p ) (11.2)
This is not the case in general, as shown in Fig. 11.1. In that case,
it may be interesting to slice through the joint distribution at a partic-
ular value of either variable, obtaining what is known as a conditional
distribution. In two dimensions, for instance:

g2 ( x 1 | X2 = x 2 ) = f ( x1 , x2 ), x1 variable, x2 fixed (11.3a)


g1 ( x 2 | X1 = x 1 ) = f ( x1 , x2 ), x1 fixed, x2 variable (11.3b)

We will return to this concept later with estimation and prediction from
linear models (Chapter 15).

The covariance matrix


We saw in Chapter 3, Sect. II that covariance and correlation are use-
ful measures of the linear association between variables. Their gener-
alization to multiple dimensions are the covariance matrix and its scaled
doppleganger, the correlation matrix. By definition, the covariance ma-
trix associated with X p is a p × p matrix Σ such that

Σij = Cov( Xi , X j ) (11.4)


Since Cov( X, Y ) = Cov(Y, X ), the covariance matrix is symmetric,
Σ> = Σ. Further, the diagonal elements are of the form Cov( Xi , Xi ) =
Var( Xi ), so the matrix carries the variance of each variable along its
main diagonal. For two variables, the covariance matrix would write:
 
Var( X1 ) Cov( X2 , X1 )
Σ=  (11.5)
Cov( X1 , X2 ) Var( X2 ) 2
For any nonzero real vector a, the ma-
trix M is positive definite if and only if
The correlation matrix is obtained similarly as a> Ma > 0

Rij = ρ( Xi , X j ) (11.6)
Since a variable is always perfectly correlated with itself, its diagonal is
made of ones. 3
This is emphatically not true of all co-
variance matrices estimated from obser-
For a random matrix to qualify as a covariance (or correlation) ma-
vations, especially when the number of
trix, it must be positive definite2 and symmetric. These are rather strong samples n is smaller than the number of
constraints, so this is a fairly restricted class. Positive definiteness means variables p. This is a common difficulty
in inverse problems, which we will learn
in particular that a covariance matrix can always be inverted3 . to overcome in different ways in Chap-
ter 14

155
Data Analysis in the Earth & Environmental Sciences Chapter 11. Multivariate Relationships

II The Multivariate Normal Distribution


Just as the normal distribution is the crown jewel of univariate distri-
·10−2
butions, the multivariate normal (MVN) is by far the most ubiquitous
of multivariate distributions. This is partially because of a multivari- 5

f
ate version of the Central Limit Theorem, but also because, just like in 0
the univariate case, it has so many convenient properties that it is often −2 2
0 0
attractive to transform multivariate data to approximate normality just
2 −2
so that one can use the MVN. We start with a bivariate example before x1 x2
generalizing to p > 2.
Figure 11.2: The Bivariate Normal Dis-
tribution – independent isotropic case.
Easing in: The Bivariate Normal Black curves represent the marginal den-
sities, while red dots represent random
Let ( X1 , X2 ) follow a bivariate normal distribution. Assume for now that samples from this distribution, tending to
the two variables are independent, so f ( x1 , x2 ) = f 1 ( x1 ) f 2 ( x2 ). That is, cluster around the central bump.

if X1 ∼ N (µ1 , σ12 ), X2 ∼ N (µ2 , σ22 ),

 2  2
1 − 12
x1 − µ1
1 − 12
x2 − µ2
f ( x1 , x2 ) = √ e σ1
× √ e σ2 0.1

f
σ1 2π σ2 2π 0
 2  2 
x1 − µ1 x −µ
1 1 − 12 + 2σ 2 −2 2
e 2
(11.7)
σ1
= 0 0
2π σ1 σ2
2 −2
This is situation is illustrated in Fig. 11.2 & Fig. 11.3. It is, however, a x1 x2
rather restrictive situation: in general, X1 and X2 could be dependent.
Figure 11.3: The Bivariate Normal Dis-
tribution – independent anisotropic case.
General Case As in Fig. 11.2, except that σ2 > σ1 .

Definition

To express this situation we need the tools of linear algebra. Define:


 
    σ1 2 Cov( X1 , X2 ) ... Cov( X1 , X p )
x1 µ1  
 

 .. 
 
 .. 
  Cov( X2 , X1 ) σ2 2 ... Cov( X2 , X p ) 
x =  . ; µ =  . ; Σ =   
    .. .. .. 
 . . . 
xp µp  
Cov( X p , X1 ) Cov( X p , X2 ) ... σp 2

We say that X p follows a p-variate normal distribution (X p ∼


N p (µ, Σ)) if and only if:

1 1 1 > −1
f ( x1 , x2 , . . . , x p ) = p/2
p e − 2 (x− µ ) Σ (x− µ ) (11.8)
(2π ) |Σ|

where, as per standard notation, |Σ| is the determinant of Σ (Ap-


pendix B).

156
Data Analysis in the Earth & Environmental Sciences Chapter 11. Multivariate Relationships

Mahalanobis Distance

The quantity within the integral is −1/2 times the Mahalanobis dis-
tance between x and µ. This distance a quadratic form in X p . It de-
√ √ p
fines an ellipsoid with lengths Σ11 , Σ22 , · · · , Σ pp for the semi-major
axes. You can think of it as a multivariate measure of distance (a norm)
scaled by the uncertainty in each variable (Fig. 11.4). To gain intuition
about the MVN, let us return to our bivariate case:

     
x1 µ1 σ1 2 Cov( X1 , X2 ) Figure 11.4: Illustration of the Maha-
x= ; µ= ; Σ=  lanobis distance. In this anisotropic case,
x2 Cov( X2 , X1 ) 2
µ2 σ2 the red dot is further than the green dot
by this distance, though it is closer by the
Now, using the identity Cov( X1 , X2 ) = ρσ1 σ2 , we get: usual Euclidean distance (“as the crow
  flies”). Put another way, the uncertainty
σ1 2 ρσ1 σ2 along the x axis is larger than the uncer-
Σ=  tainty along the y axis, so the red dot ap-
ρσ1 σ2 σ2 2 pears more distant than the green dot.

Then:
|Σ | = σ1 2 σ2 2 (1 − ρ2 )
 
σ2 2 −ρσ1 σ2
Σ −1 = 1
|Σ |
 
−ρσ1 σ2 σ1 2
So
" 2     2 #
> −1 1 x1 − µ1 x1 − µ1 x2 − µ2 x2 − µ2
( x − µ) Σ ( x − µ) = − 2ρ +
1 − ρ2 σ1 σ1 σ2 σ2
xi − µi
Reasoning in terms of standardized variables zi = σi , we get:

z1 2 − 2ρz1 z2 + z2 2
( x − µ ) > Σ −1 ( x − µ ) =
1 − ρ2
First note that, after all these matrix operations are said and done, we
are left with a scalar – a single number. Second, notice that the numer-
ator looks an awful lot like a2 − 2ab + b2 = ( a − b)2 , hence the name
quadratic form (this would hold true in higher dimensions as well). It is
a normalized distance between x and µ (or between z and 0). Now, if
the variables were independent4 , then ρ = 0 and the covariance matrix 4
which we denote X1 ⊥⊥ X2
would be diagonal
 
σ1 2 0
Σ=  (11.9)
0 σ2 2
In this case, it is easy to see how one would fall back on Eq. (11.7).
The resulting distance measure is called a normalized Euclidean distance:
in p dimensions, v
u p  
u xi − µi 2
d(~x, ~y) = ∑
t (11.10)
i =1
σi

157
Data Analysis in the Earth & Environmental Sciences Chapter 11. Multivariate Relationships

Dependencies

The general case (dependent, anisotropic) of the bivariate normal dis-


tribution is shown in Fig. 11.5. This time, the axis is rotated, meaning
that knowing something about X1 tells us something about X2 (and vice
versa).

Figure 11.5: The Bivariate Normal Distri-


bution – dependent anisotropic case. As
in Fig. 11.3, now with ρ 6= 0.

·10−2
5
f

0
−2 2
0 0
2 −2
x1 x2
x3
x2

In three dimensions, the multivariate normal distribution is a “cu-


cumber”. E.g., the case σ1  σ2 = σ3 is called anisotropic independent
(Fig. 11.6), while the case in which σ12 6= 0, σ13 6= 0, σ23 6= 0 is called
anisotropic dependent (Fig. 11.7). In higher dimensions, the probability x1
equisurfaces define a hyperellipsoid, which generalizes these cucum-
bers. The main task of principal component analysis (PCA, Chapter 12) Figure 11.6: Trivariate normal density,
is to identify the main axes of such ellipsoids, sometimes in very high anisotropic independent case

dimensions.

Properties
The MVN has four wonderful properties:

Subsets x3
x2
All subsets of variables from an MVN also follow an MVN. For in-
stance, if we split ( X1 , X2 , . . . , X p ) into X 1 = ( X1 , X2 , . . . , Xq ) and
X 2 = ( Xq+1 , Xq+2 , . . . , X p ), then X 1 follows a q-variate normal dis-
tribution, X 2 a (p − q)-variate normal distribution, with parameters:
x1
µ1 = ( µ1 , µ2 , . . . , µ q ) (11.11a)
µ2 = ( µ q +1 , µ q +2 , . . . , µ p ) (11.11b) Figure 11.7: Trivariate normal density,
anisotropic dependent case

158
Data Analysis in the Earth & Environmental Sciences Chapter 11. Multivariate Relationships

and Σ1,1 , Σ2,2 such that


 
Σ1,1 Σ1,2
Σ=  (11.12)
Σ2,1 Σ2,2

Independence
You already know that independence implies zero covariance:

Xi ⊥⊥ X j ⇒ Σij = 0 (11.13)

In the MVN world, a magical thing happens: the reverse is also true!
This is because when off-diagonal elements of Σ are zero, the expo-
nential term factorizes cleanly into distinct factors, as in Eq. (11.7).
This may seem trivial, but it is a very special property that makes life
incommensurately easier.

Linear combinations
Any linear combination of a multivariate normal distribution is a
multivariate normal distribution as well, and the means and vari-
ances are linearly related. Specifically, if Y = B> X + A ∼ N (µY , ΣY )
then µY = B> µ X + A and ΣY = B> Σ X B.

Conditional Distributions

Conditional distributions of a subset of the MVN, given values for


the other variables, are also MVN. This means that whichever way
we slice it, we always get an MVN (e.g. Fig. 11.8). This can be used
to estimate, or predict, values of one variable given the other. For
instance, using the notation of Eq. (11.11b) and Eq. (11.12), the con-
ditional mean of X 1 given that variable X 2 takes the value x2 is
Figure 11.8: Slicing through a dependent
−1 bivariate normal, one gets a univariate
µ1 | x2 = µ1 + Σ1,2 Σ2,2 ( x2 − µ2 ) (11.14) normal with mean and variance depen-
dent on value of the slicing variable.
That is, the mean is pulled in the direction of x2 − µ2 , by an intensity
proportional to the projection of x1 on x2 , scaled by the uncertainties
in x2 . This is the basis for linear regression. The conditional covariance
matrix is
−1
Σ1,1 | x2 = Σ1,1 − Σ1,2 Σ2,2 Σ2,1 (11.15)

This expression does not depend on the value x2 , just on the covari-
ance submatrices. If X 1 ⊥⊥ X 2 , Σ1,2 = Σ2,1 = 0 and the equations
reduce to µ1 | x2 = µ1 and Σ1,1 | x2 = Σ1,1 , which is another way of
saying that no information is provided by knowledge of X 2 .

159
Data Analysis in the Earth & Environmental Sciences Chapter 11. Multivariate Relationships

In the bivariate case, one may rewrite these as:

µ1 | x2 = µ1 + ρ 1 ( x2 − µ2 )
σ
(11.16a)
q 2
σ
σ1 | x2 = σ1 1 − ρ2 (11.16b)

It indicates that the conditional mean is larger than the unconditional


mean µ1 if x2 is above its mean µ2 and ρ > 0, or if x2 < µ2 and
ρ < 0. Importantly, as long as ρ 6= 0, then the uncertainty about x1 | x2
(measured by σ1 | x2 ) is smaller than the unconditional uncertainty σ1 .
In this sense, ρ2 can be viewed as the fraction of variance in x1 that
is accounted for by x2 . These remarks also apply in the general case
(Eq. (11.14) and Eq. (11.15))

Covariance Estimation
In real life, we get a random matrix X ∈ Mn× p (R) built out of n obser-
vations (arranged in rows) of p variables (arranged in different columns).
Each column represents a Gaussian random variable. We can define the
sample mean of each column
n
1
µ̂ j =
n ∑ Xkj = x j (11.17)
k =1
and the sample covariance matrix
n   
1
n − 1 k∑
bij =
Σ x ki − x i x kj − x j . (11.18)
=1
We can define centered variables as
x:j0 = x:j − x j (11.19)
and rewrite the sample covariance matrix as
 1
Σ̂ ij
= SX =X 0> X 0 (11.20)
n−1
so the sample covariance is a scaled, inner product of the centered data
matrices X 0 . It is exactly analogous to the univariate case of the sample
variance:
n
1 1
( x i − x )2 .
n − 1 i∑
sx = ( x − x )> ( x − x ) = (11.21)
n−1 =1
These estimators are MLE, hence optimal (consistent, efficient). In
fact, since they uniquely characterize the distribution, they are called
sufficient statistics of the MVN. However, as we shall see, things can go
sour when p > n – the sample covariance matrix no longer is positive
definite, and will have to be regularized in order to work properly. In
fact, this may even be the case for n & p (that is, for sample sizes not
much larger than the number of parameters to be estimated (i.e. the
number of random vectors))

160
Data Analysis in the Earth & Environmental Sciences Chapter 11. Multivariate Relationships

Multivariate Central Limit Theorem


As in the univariate case, the MVN enjoys the status of gravitational at-
tractor of the space of distributions. If X1 , · · · , X p are IID with mean µ
and covariance Σ, then their sample mean over n observations Xn con-
verges in distribution to a MVN with mean µ and covariance matrix n1 Σ.
√  D
n Xn − µ → N p (0, Σ) (11.22)
Multinormality for the sample mean implies that the sampling distri-
bution of the Mahalanobis distance between the true and sample means
asymptotically follows a χ2 distribution with p degrees of freedom:
 > 1 −1 
Xn − µ Σ Xn − µ ∼ χ2p (11.23)
n
This allows to draw confidence ellipsoids around the estimated mean,
within the uncertainties of the sampled observations. When Σ is esti-
mated from observations, these confidence regions get a little harder to
compute, requiring the use of the F distribution.

161
Chapter 12

PRINCIPAL COMPONENT ANALYSIS

It is not easy to arrive at a conception of a whole which is constructed


from parts belonging to different dimensions

Paul Klee, On Modern Art

Motivation: given a field ψ( x, y, t) we ask ourselves if there are cer-


tain spatio-temporal patterns (or modes) that account for a large portion
of the observed variability. If so, we can more efficiently describe varia-
tions in the dataset, and use these patterns as a new reference frame in
which to view the data.

I Principal component analysis: theory


The Big Idea
Imagine that one collects observations of a multivariate normal random
variable, e.g. T ( x, y, t), SLP( x, y, t). We want to find the patterns that
describe the most of the covariance found in the dataset.
One may label all locations from 1 to p so

T ( x, y, t) −→ T (s, t). (12.1)


p
If we put this observations in an n × p matrix X (n samples, p locations)
1
and we center the columns X 0 = X − X̄, then SX = n− 0> 0
1 X X is a real,
symmetric matrix. Principal component analysis aims to find the major
axes of the ellipsoid spanned by (x − µ)> Σ−1 (x − µ), i.e. the directions ... ...
of maximum variation.
To do this consider an eigendecomposition (Appendix D) of SX n

SX = EΛE> (12.2)

which is possible because SX is real and symmetric. Figure 12.1: Example of data matrix
! over the years, then subtracted from the data to yield
2 SLP anomalies. We are only interested in analysing the
!λ2k ∼ λ2k
n∗ winter season defined by December to February (DJF).
!λk2 The data are therefore obtained by concatenating the win-
!uk ∼ 2 uj (18) ter monthly means for all years. Finally a weighting by
λj − λ2k
Data Analysis in the Earth & Environmental Sciences Chapter the 12.
square Principal component
root of the cosine analysis
of the corresponding lat-
itude is applied to each grid point to account for the
where λj2 is the closest eigenvalue to λ2k , and n∗ is
converging longitudes poleward. The data over the NH
the number of independent observations in the sample,
north of 20 ° N are used to compute EOFs. Note that the
also known as the effective sample size, or the number
examples presented here have also been used in Hannachi
of degrees of freedom (Trenberth, 1984; Thiébaux and
et al. (2006).
Moreover, the eigenvalues will all be non-negative. The matrix Λ has
Zwiers, 1984). For example,
" !the #95% confidence interval
Figure 1 shows the spectrum of the covariance matrix
of λ2k is given by λ2k 1 ± 2∗ . The effective sample along with their standard errors as given by the first
the form n
size of a time series of length n involves in general equation of (18) with sample size n = 3 × 52 = 156.
  the autocorrelation structure of the series. For$example,
The leading two eigenvalues seem nondegenerate and
the sum of the autocorrelation function, 1 + 2 k≥1 ρ(k), separated from the rest, but overall the spectrum looks
Λ1 0 . . . 0 provides a measure of the decorrelation time, and an in general smooth, which makes truncation difficult.
  Figure 2 shows the first two EOFs. These EOFs explain
  estimate% of n∗ is given by (Thiébaux&−1and Zwiers, 1984);
 0 Λ2 . . . 0  n∗ = n 1 + 2 n−1
$
(1 − k/n)ρ(k) .
Λ=   (12.3) k=1
Eigenvalue spectrum
.  Another alternative is to use Monte Carlo simulations,
. . . . . . . . 0  (see for example Björnsson and Venegas 1997). This can 30
  be achieved by forming surrogate data by resampling a
part of the data using randomisation. An example would 25
0 0 0 be toMrandomly select a subsample and apply EOFs,
Λ

Eigenvalue (%)
20
then select another subsample etc. This operation, which
can be repeated many times, yields various realisations
where M = min( p, n). Assuming that p ≤ n, E is given byof the eigenelements from which one can estimate 15

the uncertainties. Another example would be to fix a


  subset of variables then scramble them by breaking the
10

E = e1 e2 . . . e (12.4)
chronological order then apply EOFs, and so on. Further
p Carlo alternatives exist to assess uncertainty on
Monte
5

the spectrum of the covariance matrix. One could for 0


example scramble blocks of the data, for example two- 0 10 20 30 40
where the ei ’s are the eigenvectors of SX . The eigenvalue λ can be con- i data keeping thus some
or three-year blocks of monthly Rank
parts of the autocorrelation structure, (e.g. Peng and Fife
sidered as the variance accounted for by the pattern defined by the cor- Figure 1. Spectrum, in percentage, of the covariance matrix of winter
1996). To keep the whole autocorrelation structure of monthly (DJF) SLP. Vertical bars show approximate 95% confidence
the data the phase randomisation method (Kaplan and Figure 12.2: Athumb
scree plot, showing the
responding eigenvector ei . It may be charted on a scree plot (Fig. 12.2), limits given by
Glass, 1995) can be used. The method generates data fraction of variance
the rule of (18). Only the leading 40 eigenvalues
accounted for by each
are shown.

along with estimates of uncertainty (12.14). eigenvalue. Vertical bars show approxi-
Copyright © 2007 Royal Meteorological Society Int. J. Climatol. 27: 1119–1152 (2007)
The temporal signature associated with mode i is given by the in- mate 95% confidence limits given by the
DOI: 10.1002/joc

North et al. (1982) rule of thumb. From


ner product ui (t) = ei> X and defines the ith principal component, PCi . Hannachi et al. (2007)
Note that it is only a function of the time variable (or whichever vari-
able serves as the row index in your application).

Singular value decomposition formulation


Recall that for any matrix A, we can write A = UΣV > ⇒ A> A =
VΣ2 V > (Appendix D). Now consider the case A = X e = √ 1 (X − X)
n −1
so that SX = X e>Xe is the sample covariance matrix of the dataset. In
this case

e = U diag(σ) V > = U diag(σ ) E> ⇔ X


X e>X
e = EΛE> . (12.5)

where diag(σ) is given by


 
σ1 0 ... 0
 
 
0 σ2 ... 0
diag(σ ) = 
 ..

 (12.6)
. . . ... . 0
 
0 0 0 σp

So for each mode i

• ei represents the spatial pattern



• σi = λi ↔ standard deviation associated with ei .

• ui (t) is the temporal pattern.

The two formulations are equivalent, but SVD tends to be a more effi-
cient algorithm than eigendecomposition (Golub and van Loan, 1993).

164
Data Analysis in the Earth & Environmental Sciences Chapter 12. Principal component analysis

Orthogonality
The matrix U of left singular vectors therefore gathers the principal
components of the field. From the definition of the singular value de-
composition E> E = I p and

SU = Var( E> X
e ) = E> Var( X
e ) E = E> SX E = Λ. (12.7)

The principal components are also orthogonal and, given an appropri-



ate scaling 1/ λi , can be made orthonormal. Hence, principal compo-
nent analysis is a bi-orthogonal operation. From a multivariate normal

dataset we can define U, λi and E from which we derive a decompo-
sition on a set of biorthogonal spatio-temporal patterns:
M p
X (s, t) = ∑ λ i u i ( t ) × ei ( s ). (12.8)
i =1

We can see that space and time have been separated, each mode is
weighted by the square root of its contribution to the total variance.
Thanks to orthogonality, we readily obtain the fraction of variance asso-
ciated with each pattern:
x1
λi σi2
Fi = M
= (12.9)
∑ i =1 λi ∑iM 2
=1 σi

We usually sort the eigenvalues in descending order λ1 ≥ λ2 ≥ · · · ≥


~e2 ~e1
λ p ≥ 0, and all of them are positive, so Fi increases with i.
x1

Interpretations
Directions Principal component analysis finds the main directions of
variability which can be identified with the main axes of an ellipsoid.
In the case p = 2 we have an ellipse (Fig. 12.3),
Figure 12.3: PCA aims to identify the
 2  2 main axes of an ellipse from noisy data
x1 x2
H2 = + (12.10)
σ1 σ2
When p = 3 the surface defines an ellipsoid (Fig. 12.4), while in the 1
The jargon of PCA is unnecessarily
case p > 3 we have a hyper-ellipsoid (which our perceptual system murky, due in no small part to climatol-
cannot fathom without projection). ogists taking the work of Lorenz (1956)
too literally. Lorenz pioneered the use
Decomposition on orthogonal functions of PCA in climate research, under the
name empirical orthogonal functions, or
these functions are like the sines and cosines found in the Fourier EOF, analysis. The name has stuck, but
transform but are dictated by the data, so they are empirical orthogo- it makes conversations with a statistician
needlessly complicated, because to them
nal functions1 . The following terminology is used in the atmospher- the PCs are the right singular vectors, not
ic/oceanic sciences: the left singular vectors. Know your au-
dience!
Ui (t) = PCi (t)
Ei (s) = EOFi (s) (12.11)

165
Data Analysis in the Earth & Environmental Sciences Chapter 12. Principal component analysis

2
2 0

z
5
2 0
−2

z
0 0 −2 5
y −5 0 −4
z

−2−5 0 0 −4 −2 0 2 4
5 −5 5 −5 y
x x y

Figure 12.4: A three dimensional ellip-


soid representing the probability space
Reformulation occupied by the data points in red. The
major axes, aka principal components,
Principal components are linear combinations of the data that opti-
(black) delineate the main directions of
mally describe its variations; by themselves, they do not explain any- variability. Three viewing angles are of-
thing, but they may highlight important modes, some of which may fered here to illustrate the difficulty of vi-
sualizing a three-dimensional object on a
have a scientific interpretation (e.g. Sect. III). two dimensional page.

PC compression
Because PCA is fundamentally a singular value decomposition, it al-
lows to re-express the field as a sum of rank-one matrices2 . 2
This is a special case of the Eckhardt-
Young-Mirsky theorem (Appendix D)
K
eK =
X ∑ σi ui ei> (12.12)
i =1

If K = M then we have a complete recovery (analysis ⇔ synthe-


sis); all we have done is reformulating the original observations in
a different reference frame, but no information was created or lost. If
K  M but, for example, ∑iK=1 Fi ≈ 90%, then we can describe 90% of
the variance with just a few modes. It is so because most eigenvalue
spectra decay relatively fast (e.g. Fig. 12.2).

II Principal component analysis in practice


Pre-processing
• Centering: X 0 = X − X. Failing to do so will result in nonsensical
results.
1/2 0
• Scaling (normalization): X 00 = S−
X X amounts to working with the 3
On a sphere, the integral of some
correlation rather than the covariance matrix. field ψ(λ, φ) over a domain D is
s 2
√ D ψ ( λ, φ ) R cos φ dφ, where λ is the
• Area weighting: it is common to multiply the field by cos φ because longitude, φ the latitude, and R the
the area of a surface element on the sphere is dA = R2 cos φ dφ dλ radius of the sphere.
3 . On a uniform grid, this transformation accounts for the area of

each grid box, thus preserving the energy (variance) of the field on
the sphere.

166
Data Analysis in the Earth & Environmental Sciences Chapter 12. Principal component analysis

Dimensionality
If n < p then SX is singular, so its inverse is undefined. This situation
may be overcome via sparse PCA (e.g. Zou et al., 2006; Johnstone and Lu,
2009).

Truncation
The choice of truncation determines the amount of compression achieved.
Heuristically, we may choose the first K patterns that account for a siz-
able portion of the variance, e.g.

∑iK=1 λi
r̃K = ' 0.7 or 0.9 (12.13)
∑iM=1 λ i
Other criteria exist, in particular Kaiser’s (retain all modes that account
for more variance than the average eigenvalue) or more elaborate rules
that hunt for breaks in the eigenvalue spectrum using decision theoretic
criteria (Wax and Kailath, 1985; Nadakuditi and Edelman, 2008).

Separability
The North et al. (1982) rule of thumb assesses the separation between
two successive eigenvalues as
r
2
δλi = λi (12.14)
n−1
which is related to the uncertainty about λi . Indeed, a 95% confidence
interval for λi under the null hypothesis of an correlated random field
(i.e. white noise) is !
r
2
λi 1 ± (12.15)
n−1
Hence, the smaller λi , the more difficult it is to separate it from its
neighbors, and from noise.

Significance
Not all modes are statistically significant. Fewer still are physically mean-
ingful. To find the meaningful ones we can use “Preisendorfer’s Rule
N” (Overland and Preisendorfer, 1982), which tests against the hypothe-
sis that the eigenvalues λi arise from a random Gaussian process; any
λi ≥ λthreshold is retained. A more sophisticated test is against multi-
variate red noise, as implemented by Dommenget (2007).
If the sample size is small, one true physical “mode” may be spread
over several of the statistical modes that we call principal components.
To some extent rotation can correct for this, but it brings challenges of
its own (Lab 9).

167
qualitatively similar.
[14] The SST patterns associated with the E and C indices,
on the other hand, are both ENSO‐like (Figures 3c and 3d).
DataThe E pattern
Analysis in the has
Earthits&strongest amplitude
Environmental Sciencesand explained var- Chapter 12. Principal component analysis
iance along the eastern equatorial Pacific (east of 120°W) and
along the coast of Peru (Figure 3c), whereas the C pattern has
IIIits amplitude
Geoscientific and explained
usesvariance
of PCA in the central equatorial
Pacific (170°E–100°W, maximum in the Niño 4 region; L10704 TAKAHASHI ET AL.: REINTERPRETING
TheFigure
use of 3d).
principal component analysis in many fields of applied re-
] Since extraordinarily
[15including yields a se
search, the geosciences,warm and strongly
runs deep. Here wecold giveevents
but three the wester
belong to different regimes, their SST patterns
examples of its use in climate research, but examples abound in pa-
are different. The first m
Niño 3 reg
In particular, the C pattern (Figure 3d), is similar to the pattern
leobiology, mineralogy, and others. Following the usual jargon of the up to 40%
obtained by compositing warm and cold events and averaging opposite s
atmospheric
the patterns sciences, EOF and
[Hoerling PCA
et al., will be
1997, used 7c],
Figure interchangeably.
while the E sponding
qualitative
pattern (Figure 3c), describes the difference between them [14] The
on the oth
[Hoerling
ENSO dynamics et al., 1997, Figure 7d]. This is reflected by the The E patt
positive (negative) skewness of 1.78 (−0.62) for the E (C) iance alon
along the c
In aindex.
recent study, Takahashi et al. (2011) used PCA to identify a state space its amplitu
[16] toThe
in which E and the
describe the evolution
C patternsofare
thealso similar to Rasmusson
El Niño-Southern Oscillation Pacific (1
Figure 3d)
and Carpenter’s [1982] composite SST anomalies
phenomenon. They performed PCA on monthly sea-surface tempera- for the [15] Sin
ture“peak” and and
(SST) data the subsequent
found that the “mature”
first twophases, respectively,
PCs combined of for
account
belong to
Figure 12.5: EOF patterns (◦ C, shading) In particul
mosttheofevolution
the variance of the
(68% composite
and 14%, warm El Niñoin event.
respectively) In thisThe
the domain. between the 1870–2010 HadISST sea sur- obtained b
“canonical” El Niño, the event starts near
associated spatial patterns (EOFs) are shown in Fig. 12.5.
the positive E axis, face temperature anomalies associated
the pattern
pattern (F
with (a) PC1, (b) PC2. The percent- [Hoerling
age of explained variance is contoured positive (n
(the interval is 20% (10%) below (above) index.
60%).From Takahashi et al. (2011). [16] The
and Carp
“peak” and
the evolut
“canonical
s (°C, shading)
ace temperature Figure 12.6: Evolution of PC1 and PC2
dex, and (d) the from May (indicated with circles) to
the following January (crosses, corre-
nce is contoured sponding year shown)coefficients
El Niño events:
Figure 3. Linear regression (°C, shading)
0%). between(a) the
events considered
1870–2010 HadISST by Rasmussen
sea surface temperature
anomalies and (a) PC1,(1982)
and Carpenter (b) PC2, (c) the
(their E‐index, and
composite is (d) the
C‐index. The thicker);
shown percentage(b)of extraordinary
explained variance is contoured
events;
(the interval is 20% (10%) below (above) 60%).
(c) central Pacific events (Kug et al., 2009)
result from the including the recent 2009–10 event (Lee
plained by these and McPhaden,
in contrast to the axes2010); (d) other
PC1–PC2, whichmoderate
result from the
events since
maximization 1950
procedure according
of SST varianceto NOAA,by these
explained
modes. corresponding to DJF Oceanic Nino
d in two other [12] Index
This same
≥ 1◦ Cbehavior is encountered in two other
(see http://www.cpc.noaa.
mith et al., 2008] observational SST estimates (ERSST
gov/products/analysis _
v3b [Smith et al., 2008]
and Kaplan SST v2 [Kaplan et al.,monitoring/
1998]; not shown), but
not shown), but is muchensostuff/ensoyears.shtml
clearer in the model data (Figures for details).
2b–2d), which
2b–2d), which containFrom Takahashi larger
a substantially et al. (2011).
number of extraordinary warm
events. The fact that even Zebiak and Cane’s [1987] ICM is
aordinary warm able to depict this behavior indicates that it is not essential to
s [1987] ICM is invoke mechanisms external to the tropical Pacific to explain
the different regimes, although we do not rule out their role as Figure 4.
s not essential to drivers [e.g., Vimont et al., 2001]. with circle
[13] The spatial patterns of SST associated with the year show
acific to explain EOFs (Figures 3a and 3b) are similar to the ones reported Rasmusson
out their role as Figure 4. Evolution of PC1 and PC2 from May (indicated by Ashok et al. [2007], although our second mode does not thicker); (b
They
with used thistoasthe
circles) a new coordinate
following Januarysystem, enabling
(crosses, the identifica-
corresponding show the signal in the far‐western tropical Pacific seen in [Kug et al
their “Modoki” pattern due to a more equatorially‐confined and McPh
tion of two ENSO regimes: the regime
ciated with the year shown) El Niño events: (a) events considered of extraordinary warm byevents domain. Using the domain of Ashok et al. [2007] and the 1950 acco
slightly longer time period used in this study yields patterns Niño Inde
e ones reported(theRasmusson
1982–83, 1997–98 and 1877–78
and Carpenter events)
[1982] andcomposite
(their the regime isthat includes
shown essentially the same as theirs, but a long period (1950–2009) analysis_m
the cold, neutral, and moderately warm years (Fig.
mode does not thicker); (b) extraordinary events; (c) central Pacific events 12.6). This PCA-
3 of 5
Pacific seen inbased
[Kug et al., 2009]has
decomposition including
durably the recent
shifted 2009–10 event of
the characterization [Lee
ENSO
orially‐confinedevents,
and and
McPhaden,
exemplifies 2010];
how PCA(d) other
can be moderate events dynamical
used to generate since
[2007] and theinsight.
1950 according to NOAA, corresponding to DJF Oceanic
y yields patterns Niño Index ≥ 1°C (see http://www.cpc.noaa.gov/products/
od (1950–2009) analysis_monitoring/ensostuff/ensoyears.shtml for details). 168

3 of 5
Data Analysis in the Earth & Environmental Sciences Chapter 12. Principal component analysis

Patterns in age-uncertain coral data


PCA may also be applied to data matrices that are not associated with
uniformly spaced data. For instance, Emile-Geay and Eshleman (2013)
used it to identify the main mode of variability in a network of 25 coral
δ18 O records from the tropical Indo-Pacific ocean (Fig. 12.7).

Coral b18 O network, 22 sites, EOF 1 - 18% variance Figure 12.7: First EOF mode of the net-
24 N o
work 25 coral δ18 O records. (top) EOF
16oN
coefficients (blue < 0, red > 0) overlain
8oN

0o
on a map of HadSST2 DJF temperature
8oS
regressed onto the first principal compo-
nent. The center each dot is collocated
16oS

24oS

with a coral reef, and their area is pro-


EOF > 0, size | magnitude
EOF < 0, size | magnitude
portional to the EOF loading. δ18 O val-
ues were multiplied by -1 so that positive
excursions correspond to warming tem-
perature; (bottom left) timeseries of the
-1 -0.5 0
Contours: SST (° C) regressed onto PC 1
0.5 1
first principal component; (bottom right)
Multi-taper spectrum (Thomson, 1982) of
the first principal component; note that
the y-axis is the product of the power by
0.15
Principal component timeseries PC 1 MTM spectrum the frequency so that the relative area un-
0.1
0
10 50 20 10 6 4 2 1 der the spectrum is preserved in this log-
Power × Frequency

arithmic scale. Numbers refers to the pe-


PC 1 (unitless)

0.05

0 riod of oscillation in years. From Emile-


Geay and Eshleman (2013)
-2
-0.05 10

-0.1

-0.15
-4
-0.2 10
1900 1910 1920 1930 1940 1950 1960 1970 1980 -2 -1 0
10 10 10
Time Frequency (cpy)

As expected, the spatial pattern of sea-surface temperature (SST) as-


sociated with this mode bears a strong resemblance to El Niño-Southern
Oscillation (ENSO), as confirmed by its temporal expression (PC1), which
displays maxima and minima coincident with known ENSO events. The
MTM spectrum of the mode reveals a relatively strong annual compo-
nent and dominant interannual variability, consistent with what is in-
dependently known about ENSO (e.g. Sarachik and Cane, 2010). The rel-
ative area covered by each circle illustrates the magnitude of the EOF
coefficients (“loadings”), while the color refers to their sign (after multi-
plying δ18 O values by −1 so that negative excursions in oxygen isotopes
correspond to positive temperature anomalies). Their signs are consis-
tent with the ENSO thermal signal expected in each isotopic record,
though the disparate amplitudes reflect the varying influence of other
factors (climatic or not). A question arising from this work is the ex-
tent to which the EOF modes are affected by dating uncertainties in
these layer-counted records. Comboul et al. (2014) investigated this ques-
tion with a perfectly known “pseudocoral” network. Age perturbations
were introduced according to a Poisson probability model, with a 5%

169
Data Analysis in the Earth & Environmental Sciences Chapter 12. Principal component analysis

chance of miscounting each year (i.e. every 100 years, one expects ±5y
offsets due to age errors). They performed a “Monte Carlo EOF analy-
sis” (Anchukaitis and Tierney, 2013) on this time-uncertain ensemble and
used it to compute the relevant statistics.

a) ENSO mode pseudocoral EOF - 25% variance Figure 12.8: Spatiotemporal uncertainty
24oN
quantification on a pseudocoral network.
16oN (a) EOF loadings (circles) corresponding
8oN

0o
to the ENSO mode of an ensemble of
8oS age-perturbed pseudocoral records with
16oS
miscounting rate θ = 0.05. EOF load-
24oS
ings for error-free data are shown in light
EOF > 0 EOF < 0 colors circled in white, while the me-
median median dian and 95% quantile are shown by dark
95% quantile 95% quantile disks and black-circled disks, respec-
tively. Contours depict the SST field as-
-1 -0.5 0 0.5 1 sociated with the mode’s principal com-
Contours: SST (s C) regressed onto ENSO mode ponent PC (panel b), whose power spec-
b) ENSO mode PC timeseries c) ENSO mode PC spectrum trum is shown in (c). Results for the time-
uncertain ensemble are shown in blue:
Power w Frequency

0.4
median (solid line), 95% confidence in-
PC (unitless)

0.2 0.001
terval (light-filled area) and interquartile
0 0.0001 range [25%-75%] (dark-filled area). Re-
-0.2 1e-05 sults for the original (error-free) dataset
are depicted by solid red lines. Dashed
-0.4
1880 1900 1920 1940 1960 1980 2000 50 20 10 5 2 red lines denote χ2 error estimates for the
Time Period (y) MTM spectrum of the error-free dataset.
From Comboul et al. (2014)

The results are shown in Fig. 12.8, and show that time uncertainty
may greatly alter the spatial expression of interannual variability. They
also result in a transfer of power from high frequencies to low frequen-
cies, which suggests that age uncertainties are a plausible cause of the
observed enhanced decadal variability in coral networks (Ault et al., 2009).

Patterns of ocean-atmosphere covariability


Maximum Covariance Analysis is a close cousin of PCA’s. As the name
implies, it seeks to maximize the covariance between different fields or
variables. Its mathematics are reviewed in Bretherton et al. (1992).
To get a feel for the method, let us apply it to the diagnosis of present
day teleconnections, the study of climate relationships over wide dis-
tances. Let us use the ERSST dataset for sea-surface temperature (Reynolds
and Smith, 1994) and NCEP/NCAR Reanalysis dataset (Kalnay, 1996) for
the upper atmosphere geopotential height (a measure of atmospheric
circulation). The SST data was limited to the tropical domain [30S,30N],
while the geopotential height data includes only the Northern Hemi-
sphere. The season considered is Northern Hemisphere winter (DJF).
The variables were suitably transformed into anomalies and multiplied
by the square-root of the cosine of latitude, in order for variance inte-
grals to be representative of geometric areas (North et al., 1982).

170
Data Analysis in the Earth & Environmental Sciences Chapter 12. Principal component analysis

!"#$%&'(&)**+#!#*,-#!.!/0121#3.#45#62.7891:#;3<3=8.83>1#?3998/!723.#<!@#3A#**+#B3C8#D#E*%FGHIJK#AL3LMGN5J" I Figure 12.9: Maximum Covariance Anal-


##PN3$#
P
ysis of Present Day Teleconnections from
the tropical Oceans. Mode 1 a) SST pat-

B3C8#D#:#**+#ER"
##DP3$# D
tern (left singular vector) ; b) Geopo-
###O3##
tential height field (Z250) (right singular
O

##DP *#3 !D
vector) c) Expansion coefficients (normal-
##PN *#3 !P ized). From Emile-Geay (2006, Chap 5)
!I
###O3## ##5O3&# #DPO3&# #DHO3Q# #DPO3Q# ##5O3Q#
#DHO3Q# U"#B3C8#D#:#VP4O#E<"
AL3LMG#DSJ #D4
3 Q#
O 3&
#D4
O
# ?"#*7!.C!9C2X8C#8Y@!.123.#?38AA2?28.71
DOO I
3
##IO $#

##D4
#

#DP3 $#
O 3Q

3 P !G#OLHS
##N4 $#
O
3 &#
#DP

4O
3
##5O $#

%38AA2?28.7#E>.27/811"
D
3
##T4 $#
##SO3&#
SO3Q

O O

!D

!4O
##5

#
O&
##5 3
O

!P
3 Q#

F2917#/8A7#8Y@!.123.#?38AA2?28.7#E**+"
F2917#92=Z7#8Y@!.123.#?38AA2?28.7#EVP4O"
!DOO !I
##I 3 3 &# DS4O DS5O DSTO DSHO DSSO POOO
OQ O
# ##I W8!9#3A#62.789
###O3##

The first mode is displayed in Fig. 12.9. It accounts for an overwhelm-


ing fraction of the covariance (83%). Panel a) shows its very distinctive
El Niño sea-surface temperature (SST) pattern in accounting for about
half of all SST variability in the domain. The associated heterogeneous
correlation map for Z250 (Panel B) displays the familiar Pacific North
American (PNA) pattern (Wallace and Gutzler, 1981) that has long been
recognized as the main ENSO teleconnection pattern. The pattern ex-
plains only 19% of the total variability in Z250 , meaning that more than
81% is due to other factors (namely, atmospheric dynamics not simply
related to SST). In other words, although the pattern accounts for 83% of
the covariance between the two variables, it only accounts for a limited
amount of the total atmospheric variability.
This type of analysis can cleanly identify patterns of covariability be-
tween datasets. A close cousin of it is Canonical Correlation Analysis
(CCA), which works on maximizing correlation rather than covariance.

Further reading
For a more thorough introduction to PCA and EOF analysis, the reader
is referred to Hannachi et al. (2007) and Wilks (2011, chap 12). For an
introduction to maximum covariance analysis and canonical correlation
analysis, read Bretherton et al. (1992) and Wilks (2011, chap 13).

171
Chapter 13

LEAST SQUARES

Motivation: find relationships among noisy data.


For example, consider a free falling body whose position is given by

1 2
z(t) = gt + v0 t + z0 . (13.1)
2
Measuring (zi , ti ) many times we should give us the values of ( g, v0 , z0 ).
To simplify the problem, assume that initial speed and position are 0:
(v0 , z0 ) = (0, 0). So we would have zi




 z1 = 12 gt12


 z = 1 gt 2
2 2 2
(13.2)
 ...





zn = 12 gtn2
ti
and we expect that with a enough measurements we can accurately es-
zi
timate the gravitational acceleration g.
Very often one casts this type of problem (parameter estimation) as slope = g
fitting a line through a cloud of points. It’s often easier to reparameta-
rize the problem so that lines are straight (in this case, working with the
variable t2 as opposed to t). The goal of the famed least squares method
is to find the straight line that best fits the data.
ti2
I Ordinary Least Squares Figure 13.1: Example of data for a free
falling body

Straight line fit


Assume thedatay are related to some independent variable x via the
β0
model β =   such that yi = β 0 + β 1 xi .
β1
In practice, the measurements have errors, so

yi = β 0 + β 1 xi + ei = ŷi + ei (13.3)
Data Analysis in the Earth & Environmental Sciences Chapter 13. Least squares

and this expression can be rewritten as

y = Xβ + e (13.4)

where β = ( β 0 , β 1 )> and


 
1 x1
 
 
1 x2 
X=
 ..

..  (13.5)
. . 
 
1 xn

is the design matrix1 . It is a n × 2 matrix, so it’s not invertible. How do 1


The design matrix is often called G in in-
we find β? We seek the solution that minimizes the mean squared error: verse theory. The following notations are
equivalent:
n
1 1 y = Xβ + e (stats)
MSE = ∑ e2 = E2 (13.6)
n − 2 i =1 i n−2 d = Gm + e (geophys).

ei =yi − ( β 0 + β 1 xi ) = yi − ŷi (13.7)


2
ei = y i − ∑ Xij β j−1 . (13.8)
j =1

The total error E2 is the sum of the individual errors:


!2
n n 2
E2 = ∑ e i2 = ∑ yi − ∑ Xij β j−1 (13.9)
i =1 i =1 j =1

that must be minimized. For the case of p parameters β k , with k =


(0, 1, . . . , p − 1), we would set
!
n p
∂E2
= 0 ⇔ 2 ∑ yi − ∑ Xij β j−1 (− Xik ) = 0 ⇔
∂β k i =1 j =1
!
n n p n p n
∑ |{z}
Xik yi = ∑ ∑ Xij Xik β j−1 ⇔ ∑ Xik yi = ∑ ∑ Xij Xik β j −1
i =1 i =1 j =1 i =1 j =1 i =1
>
Xki | {z }
X> X
(13.10)

Thus,
X> y = (X> X)β (13.11)
This is called a normal equation. The matrix ( X > X ) is square, real
and symmetric, therefore it is positive semi-definite. That is, provided
n > p, ( X > X ) always has an inverse (it has no zero eigenvalues)2 . There- 2
What if its eigenvalues are close to, but
fore the solution is: not exactly equal to, zero, you ask? We
will get to that in a bit.

βOLS = ( X > X )−1 X > y (13.12)

The case n = p is a special case: X is square and β = X −1 y.

174
Data Analysis in the Earth & Environmental Sciences Chapter 13. Least squares

Estimation of coefficients
In the previous case where the design matrix only involves xi ’s
We have
 
1 x1
    
 
1 1 . . . 1  1 x2  n x
∑i i 
X> X =  
 ..
 
..  = .
x1 x2 . . . x n  . .  ∑ i x i ∑ i x i2
 
1 xn

Using the usual rules (Appendix B), its inverse is:


 
x2 − ∑i xi
( X > X ) −1 =
1  ∑i i 
n ∑ i x i2 − ( ∑ i x i ) 2 − ∑ i x i n
 
1  ∑ i x i2 − ∑ i x i 
=
n2 S X 2 − ∑ i x i n

Where Sx 2 is the sample variance of dataset x. Further,


 
y
   1  
 
1 1 . . . 1  y2  ∑ y i
X> y =    =  i .
.
x1 x2 . . . xn  ..  ∑i xi yi
  yi
yn

The slope β 1 is given by


! y
1
1 ∑i ( xi − x )(yi − y) y = β0 + β1 x
β1 = − ∑ xi ∑ y j − n ∑ xi yi = n
n Sx 2
2
i j i
1
n ∑i ( xi − x )
2

Cov
\ ( x, y) sy x xi
= 2
= ρ̂ xy . (13.13)
sx s x
Figure 13.2: Example of linear fit
So in this simple case, the least squares slope is proportional to the
sample linear correlation coefficient, scaled by the ratio of sample stan-
dard deviations. Since the correlation coefficient is unitless, this factor
ensures that the units of y get correctly mapped to the units of x3 . 3
One useful cross-check that you did the
For the intercept we have math right, thus, is that β 1 should be in
units of y divided by units of x

β 0 = y − β 1 x. (13.14)

Which depends on the slope. If any outliers or noise bias the estimate
of the slope, this will affect the estimate of the intercept too.

175
Data Analysis in the Earth & Environmental Sciences Chapter 13. Least squares

II Geometric interpretation
We just obtained the OLS solution by straightforward algebra after for-
mulating a minimization principle. There is more to it. What we were
really doing is finding an approximate solution to the problem

Ax ≈ b

where in this case A = X, x = β, b = y. More precisely, we seek a


vector b p that is the projection of b into the range of matrix A – the vector
space spanned by the columns of A, vectors a1 , a2 , · · · , a p (Appendix B,
section VI). In other words, we seek the solution to

Ax = b p + e

, where e is a residual that must be (1) as small as possible; (2) orthogonal


to b p . Why the latter condition? If it weren’t orthogonal, then it would
project onto the columns of A, thus contribute to b p . If we put these
words into equations, it means A> e = 0, which implies:

A> (b − Ax ) = 0 (13.15)

Thus:
A> Ax = A> b
If you substitute A, x and b, for their definitions, you get back the
OLS normal equation (Eq. (13.11)). This intuition is purely geometric:
it’s about finding the best approximation to the data vector y in the space
spanned by the design matrix X.

III Statistical interpretation


As in most things, there is also a statistical interpretation.

Maximum Likelihood Estimator


Assume ei ∼ N (0, σi 2 ) are IID errors and further assume that ∀i, σi 2 =
σ2 . Then the most likely model β is the one that maximizes the likeli-
hood
n  2  2
yi − β 0 − β 1 xi yi − β 0 − β 1 xi
− 21 − 21 ∑in=1
L ( x1 , x2 , . . . , x n ) = ∏e σ
=e σ
(13.16)
4
i =1 Note that only the errors ei = ŷi − yi
need be normal; xi and yi can have what-
so we define ever distribution they please.

1 n 2
ei = χ2
2 i∑
− log L =
=1

which is called misfit function. Minimizing the errors (Eq. (13.9)) is equiv-
alent, in the Gaussian4 context, to maximizing the likelihood. Given the

176
Data Analysis in the Earth & Environmental Sciences Chapter 13. Least squares

data, these are the most likely parameters to fit a straight line through
the cloud of points. Further more, the OLS estimate is now imbued with
y
all the privileges that come with ML status: it’s consistent, and it’s got
the lowest MSE.

Uncertainties in the parameters


β̂ 0 = y − βˆ1 x follows the distribution N (µ0 , σ02 ) with poorly resolved β 1

µ0 = β 0 x
s s
∑in=1 xi2 1 n
σ0 = se
n ∑in=1 ( xi − x )2
with se = ∑ e 2 (MSE).
n − 2 i =1 i
y

β̂ 1 follows the distribution N (µ1 , σ12 ) with

µ1 = β 1
se
σ1 = q . well resolved β 1
∑in=1 ( xi − x )2
x
Estimates are unbiased and precision depends on the MSE, as well as
the variance of x. The more variable x, the better. So, the experiment Figure 13.3: Example of linear fit for dif-
should be designed in order to cover a broad dynamic range (Fig. 13.3). ferent spread in x

Relationship
−x
r β0 ,β1 = q (13.17)
1
n ∑in=1 xi2

The two estimates are anti-correlated for x > 0, uncorrelated for x = 0,


and correlated for x < 0.

Trivial case
Imagine that we get no measurements xi , so X = 1n :
   
1 y
   1
   
1  y2 
Xβ = y ⇔    
 ..  β =  ..  . (13.18)
. .
   
1 yn

The least squares estimate should yield the same parameter n times.
Indeed, X > X = n and ( X > X )−1 = 1/n so
1
( X > X ) −1 X > y = ∑ yi = y (13.19)
n i

The model’s covariance matrix writes:


1 σy
Cβ = σy2 ( X > X )−1 = σy2 ⇒ σβ = √ (13.20)
n n

177
Data Analysis in the Earth & Environmental Sciences Chapter 13. Least squares

which is essentially a paraphrase of the Central Limit Theorem. The


least squares estimate is both intuitive and consistent with probability
theory, as it should be.

IV Exotic Least Squares


Weighted Least Squares
What if the errors are not IID? Specifically, what if some measurements
are more precise than others, so σ1 6= σ2 · · · 6= σn . In general, for any
matrix H, one may write:

x0 = Hx ⇒ Cx0 = HCx H > .

with Cx = VΛV > . A special case is when

x0 = Λ−1/2 V > x

then

Cx0 = I p

that is we have univariant data.


This is a good idea if we have different units or instrument precisions.
So

E2 = e> VΛ−1/2 Λ−1/2 V > e = e> Ce−1 e

where e = y − Xβ. Thus, the error term is now a true Mahalanobis


distance, a quadratic form using measurement precision as a weight.
Expanding this term:

E2 =(y − Xβ)> Ce−1 (y − Xβ) = y> Ce−1 y − 2y> Ce−1 Xβ+


β> X > Ce−1 Xβ

and imposing ∂E2 /∂β = 0 we get

( X > Ce−1 X ) β = X > Ce−1 y. (13.21)

If
 
1/σ12 0 ··· 0
 
 
 0 1/σ22 ··· 0 
Ce−1 =
 ..


 ··· ··· . ··· 
 
0 ··· 0 1/σn2

each datum is independent of the others and gets weighted by wi =


1/σi 2 . This will make high-precision measurements count more than
poor precision ones, confirming the experimental knowledge that one
good measurement can be worth 10 or 100 bad ones.

178
Data Analysis in the Earth & Environmental Sciences Chapter 13. Least squares

General Least Squares


So far we have considered the simple model:

yi = β 0 + β 1 xi

which is linear in β and x. This is not always the case. In general we


have

e
y = Xβ

where X can be given by, for example, a cubic polynomial


 
Xij = 1 xi xi2 xi3 .

We can also have β e = f ( β i ), e.g. log( β i ), β α (non linear functions). It is


i
possible to fit very general classes of models this way. Examples are:

• Spherical harmonic coefficients,

• Fourier series,

• more complex functions (non-linear least squares).

In summary, least squares are a powerful tool for curve-fitting, and


we’ve only just scratched the surface. In the next Chapter we shall see
how to deal with ill-posed problems where even least squares need to
be amended (regularized) to yield reasonable estimates of geophysical
model parameters.

179
Chapter 14

DISCRETE INVERSE THEORY

Estimating parameters from observations is almost always an inverse


problem.

I Classes of inverse problems


Consider the seismic tomography problem
Z
δti = δs( x)dl (14.1)
path

where s represents the “slowness” s = 1/v = dt/dl.

overdetermined Figure 14.1: Schematic representation of


the seismic tomography problem: from
seismic measurements of travel times be-
Earthquake δs3 tween different disturbances, we try to in-
δs1 δs2
vert for fundamental properties of Earth
materials (the elastic moduli), which are
related to the propagation speed of elas-
tic waves. Some boxes are illuminated by
several ray paths, others by one, and still
others are in the dark.

equidetermined underdetermined

Continuous problems must always be solved numerically if no ana-


lytical solution exists (as is the case here):
p
δti ≈
|{z} ∑ Pij δs j ⇔ Gm = d matrix inversion problem.
j=1 |{z} |{z}
di Gij mj

With n observations and p model parameters, there are 3 possibilities:

1. n = p: equidetermined problem, unique solution ⇔ det( G ) 6= 0,

2. n > p: overdetermined problem, G > G is ( p × p);

3. n < p: underdetermined problem, GG > is (n × n).


Data Analysis in the Earth & Environmental Sciences Chapter 14. Discrete inverse theory

Overdetermined problems typically have no solution, and underdeter-


mined problems potentially an infinite number of solutions. Something
(regularization) has to be done in order to obtain a unique solution. The
seismic tomography problem is a mixed-determined problem, some boxes
are “illuminated” by many rays, some by none and the corresponding
entries in G are zero. This means that we can find some non-null vectors
m0 such that

Gm0 = 0

so the null space of G can be large. If ms is a solution, i.e.

Gms = d

then m(s) + cm(0) (c a real constant) is also a solution

G (m(s) + cm(0) ) = d (non-uniqueness).

So which solution should we pick?

II Reduced rank (TSVD) solution


SVD solution
Remember the singular value decomposition (SVD) for a “fat” matrix.
Given an n × p matrix, G we can express it as

G = UΣV >

where U is an n × n matrix and U > U = In , V is a p × p matrix, V > V =


I p and Σ is an n × p matrix with the following structure
 
 σ1 0 . . . 0 ... 0 
 
 0 σ2 . . . 0 ... 0 
 
 ... ... ... ... ... ... .
 
 
0 0 . . . σn ... 0
p − n columns

We want to solve the following equation

m = G −1 d.

It can be shown that the Moore-Penrose pseudoinverse of a matrix al-


ways exists:

G † = VΣ† U >

182
Data Analysis in the Earth & Environmental Sciences Chapter 14. Discrete inverse theory

where Σ† is given by
 
1/σ1 0 ... 0 ... 0
 
 0 1/σ2 ... 0 ... 0
 
 
 ... ... ... ... ... 0
 
 .
 0 0 0 1/σn ... 0
 
 
 ... ... ... ... ... 0
 
0 0 0 0 ... 0

The solution of rank k assigns (Σ† ) jj = 0 for j > k → truncation of


singular values. For an overdetermined problem, the truncated SVD
(TSVD) solution is the OLS solution. For an underdetermined problem,
it is the minimum norm solution. One may write

k  
(k) u (i ) · d
mTSVD = ∑ σi
v (i ) .
i =1

The solution is a linear combination of the right singular vectors, weighted


by a data dot-product and 1/σi . The v(i) are orthonormal and span the
axes of the error ellipsoid for m:

p
Vij Vkj
Cov(mi , mk ) = ∑ σi 2
(14.2)
j =1

(cf principal components and EOFs). σi


•• degenerate singular values
Now, if G is singular then σi ≈ 0 for i bigger than some index L from •
•• break
which we get an infinite weight. In this case 1/σi should be set to zero •

in Σ† , implying that one adds zero weight to the corresponding eigen- ••
••

vectors. •

If σi ≈ 0, the solutions will be dominated by noise. The singular value •••••
k i
spectrum must be inspected for a break. Truncating at k will increase
Figure 14.2: Searching for a break in the
χ2 = k Gm − dk2 only slightly, while greatly decreasing the model com- singular value spectrum. In this case, k is
plexity. an obvious truncation

Model stability

mtrue − m(k) d − d(k)


1
The condition number measures the
≤ κ (G) (14.3) feasibility of inverting G. It is such
kmtrue k kdk that a value close to 1 implies a well-
conditioned inversion problem, while a
where d(k) = Gm(k) , κ ( G ) = σmax /σmin is the condition number of large value implies a numerically unsta-
ble inverse. Singular matrices are charac-
matrix G1 and k = min(n, p). The larger κ ( G ), the larger the distance terized by σmin = 0, therefore κ = ∞.
from the true model.

183
Data Analysis in the Earth & Environmental Sciences Chapter 14. Discrete inverse theory

Model Resolution
Consider Gm = d. In order to evaluate how well G resolves m, one could
input a synthetic (i.e. made-up) model m0 , yielding the synthetic obser-
vations d0 = Gm0 . Solve the inverse problem
  −1
m̃ = G † d0 = G > G G > Gm

 −1 >
where G > G G G is called resolution matrix R. For a perfect resolu-
tion R = I, in reality
 
 
0
 
 

 
R=G G=





 0 

That is, we have a “blurry diagonal” matrix. This tells which models
are well resolved by the observations, and which are not. One can also
test the inversion with a spike function2 which is given by 2
The numerical equivalent of a Dirac
delta function
 
mspike = 0 . . . 0 1 0 . . . 0
i

for the model i. This method is known as the Backus-Gilbert method


(Backus and Gilbert, 1968).

Data resolution
How well does a model fit the data?
 
m̂ = G † d ⇒ dˆ = G m̂ ⇒ dˆ = G G † d = |{z}
GG † d.
N

N is called data resolution matrix and we have N = Uk Uk> from TSVD.


For a perfect model N = I p but it is typically “blurry”, like R.

Note Both R and N are independent of the data d and only depend on G. This
can be taken into account for the experimental design and modeling assump-
tions, so as to meet scientific goals.

184
Data Analysis in the Earth & Environmental Sciences Chapter 14. Discrete inverse theory

III Tikhonov regularization


Motivation
A disadvantage of TSVD is that singular vectors are either on or off : ei-
ther a singular vector vi participates in the solution or it doesn’t. It is
a form of regularization. One may desire a smoother transition between
“valid” and “invalid” eigenvectors vi . The overarching goal is still the
same: to find the least complex model that fits the noisy data without over-
fitting. Misfit large α
To do so, we now seek m such that kd − Gmk ≤ δ, and kmk isn’t too
large. Equivalently, we seek m such that the cost function
knee, compromise
Jα (m) = kd − Gmk2 + α2 kmk2

small α
with α ≥ 0, is minimized. α is a Lagrange multiplier or regularization pa-
rameter. It represents the price to pay for having a more complex model:
kmk
α discourages a large kmk, because in the limit of α → ∞, kmk → 0.
Figure 14.3: Typical Regularization trade-
Formulation off: a small regularization parameter
means a good fit to the data but poten-
This approach may be recast as tially a very wobbly model (large norm),
while a large regularization parameter
   
will encourage a smooth model at the ex-
G d pense of some model misfit. One again,
G̃m =   m =   = d˜
there is no free lunch in a world of finite
αI 0
information.

with I an n × p “identity” matrix. This can be rewritten


 
G̃ > G̃m = G̃ > d ⇒ G > G + α2 I m = G > d

The Tikhonov solution is given by


  −1
(α)
mtikh = G > G + α2 I G > d.

The TSVD solution is given by


p  
(k) (k) u (i ) · d
m TSVD = ∑ fi
σi
v (i )
i =1

where the f i ’s are called binary filter factors



 1 i≤k
(k)
fi = .
 0 i>k

The Tikhonov solution may be expressed similarly using the SVD for-
mulation
p  
(α) ( α ) u (i ) · d
mtikh = ∑ f i v (i )
i =1
σi

185
Data Analysis in the Earth & Environmental Sciences Chapter 14. Discrete inverse theory

but now the filter factors are given by fi


 TSVD
2  f →1 for α  σi 2
(α) σ i
fi = 2 i 2
σi + α  f → 0 i for α  σi 2 Tikhonov

This solution keeps the large singular values almost intact while it
damps the small ones to zero. Tikhonov solutions push all eigenval- i
ues α2 away from zero, which removes the matrix singularity. This is
Figure 14.4: TSVD and Tikhonov filter
at the cost of damping some of the modes with σi  0, so it repre- factors
sents a trade-off between misfit and complexity. The solution is always
smoother than the true model (a common feature of “`2 methods” –
those that seek to minimize the `2 norm). `1 methods are generally free
of such problems, but those are mathematically more complex: the op-
timization problem can no longer be solved analytically, even in simple
cases. Nowadays, there exist well-established convex optimization al-
gorithms (e.g. simplex) that can efficiently solve them, but they are still
less prevalent than `2 methods. This could be because people actually
like smooth solutions, though Nature is rarely that smooth.

Choice of regularization parameter


Given this tradeoff, how should one go about picking the regularization
parameter (α or k)? There are many methods:
• Picking the knee of the L-curve,

• picking the largest slope of the L-curve,

• visual inspection of smoothness,

• minimize the generalized cross validation function (GCV)3 3


The GCV consists in finding the value of
α that makes the best prediction of all d(i)
for all inversions where the datapoint di
IV Recipe for underdetermined problems is withheld (Wahba, 1990)

1. Establish the theoretical null space (often hard or impossible in prac-


tice)

2. SVD(G) → inspect singular values

3. Regularize by using TSVD, Tikhonov or something else (guided by


physics of the problem, not by observations)

4. Compute resolution matrices

Model resolution R = G† G (hat matrix)


Data resolution N= GG †
 −1
Tikhonov Gα† = G > G + α2 I G> .

5. figure out if this is what you want, else go back to step 3.

186
Chapter 15

LINEAR REGRESSION

In the previous chapter we saw how to fit curves through data using
various techniques, all involving an exploration of the null space of a
matrix. Linear regression is closely related, but the emphasis is slightly
different. For instance, the goal may not solely be to find a best fit, but
to make predictions and quantify uncertainties about these predictions
(statistical forecasting). Thus, while the mathematics are very similar
(least squares will pop up again), the spirit is much more akin to Part I,
rooted in probability theory.

I Regression basics
Least Squares Solution
Regression seeks to estimate a variable Y from a predictor X, given a
deterministic link function f . In general, one writes

Y = f (X) + e (15.1)

where e is some random quantity (errors). f ( X ) may be understood as


the conditional expectation of Y given X, so we have

Y = E (Y | X ) + e (15.2)

Linear models make the further assumption that

p
E (Y | X ) = β 0 + ∑ Xj β j (15.3)
j =1

hence E(Y | X ) is expressed as a linear combination of the predictor vari-


ables. It is most common to treat e as a random sample from a normal
random variable with zero mean:

e ∼ N (0, σ2 ) (15.4)
Data Analysis in the Earth & Environmental Sciences Chapter 15. Linear Regression

II Simple linear regression


y

The 1D case (p = 1) is just the ordinary least squares we saw in Eq. (13.4) ŷ = β 0 + β 1 x + e, β1 =
∆y
∆x
and Eq. (13.12). The idea is to mimimize the residual sum of squares
∆x
n ∆y
RSS( β) = ∑ (yi − f (xi ))2 = ∑ ei2 = e> e. (15.5)
i =1 i
Yielding the familiar solution x

βb = ( X > X )−1 X > y (15.6)


Figure 15.1: Example of simple linear re-
gression
which gives the slope and the intercept of the line fitting trough a
cloud of points whose scatter is described by σ (Fig. 15.1)

∑i ( xi − x̄ )(yi − ȳ)
β1 = β 0 = ȳ − β 1 x̄. (15.7)
∑i ( xi − x̄ )2

Analysis of Variance
The error associated with the linear estimate ŷi = β 0 + β 1 xi is ei =
yi − ŷ( xi )1 . So the total error is given by the mean squared error (MSE): 1
also known as “residual”

1
Se2 = e i2
n−2 ∑
i

in which there are only n − 2 degrees of freedom because two parame-


ters (β 0 and β 1 ) had to be estimated. We can rewrite this term as follows:

1
Se2 = [yi − ŷ( xi )]2 .
n−2 ∑
i

and we can insert a (ȳ − ȳ) in every term of the sum

1
Se2 = [(yi − ȳ) + (ȳ − ŷ( xi ))]2
n−2 ∑
i

and define the sum of squares total


n
SST = ∑ (yi − ȳi )2 = (n − 1)Sy2
i =1

with Sy the sample standard deviation in y, and the regression sum of


squares
n
SSR = ∑ (ŷ(xi ) − ȳ)2 = β1 2 ∑(xi − x̄)2 = (n − 1) β1 2 Sx2 .
i =1 i

with Sx the sample standard deviation in x. One can show that:

1 1
Se2 = {SST − SSR} = {SSE}
n−2 n−2

188
Data Analysis in the Earth & Environmental Sciences Chapter 15. Linear Regression

where we have introduced SSE = ∑i ei2 . We then have

SST = SSR + SSE (15.8)


SST = (n − 1) × total variance in y
SSR = (n − 1) β 12 × total variance in x
SSE = sum of squared regression residuals.

This is the most primitive form of an analysis of variance (ANOVA)


and can help ascertain to what extent the linear fit accounts for the ob-
served variations in Y.

Prediction intervals
Very often we fit models through data so we can make predictions from
them. It is common to call x the predictor and y the predictand2 , and the 2
“that which we want to predict”
idea is to use the relation at a point xnew that has not been used in fitting
the regression model y = β 0 1n + β 1 x, and use it to predict a new value
of the predictand, ynew . What error would we make?
To see this, note again that linear regression expresses the condi-
tional distribution of y given x, as:

y | xi ∼ N ( β 0 + β 1 xi , Se2 ) (15.9)

So what we want is P(y | xnew ). This is illustrated in Fig. 15.2.

y Figure 15.2: Conditional and marginal


distributions for normal random vari-
conditional ables Y and X. Note that for linear regres-
distributions sion to work, only e need be a normal RV,
but this example nicely illustrates what is
of y | xi going on. At every point x, the regres-
x1 sion model allows to predict y as a Gaus-
marginal sian with mean β 0 + β 1 x. Note how the
distribution x2 marginal distribution of y is much wider
than any of the conditional distributions,
of y illustrating the fact that x reduces the un-
x3 Forecast y certainty about y by virtue of their ap-
proximately linear relation
given x
x

Because of the normality of the errors about the regression line, it


turns out that ŷ ± 2Se would be a good 95% prediction interval for y.
Now, if x has not yet been observed, it makes sense that we would have
more uncertainty about y | xnew . So a 95% prediction interval for ynew
would be β 0 + β 1 xnew ± 2Sy , with
 
1 ( xnew − x̄ )2
Sy2| xnew = Se2 1 + + (15.10)
n ∑i ( xi − x̄ )2

189
Data Analysis in the Earth & Environmental Sciences Chapter 15. Linear Regression

The expression Eq. (15.10) is proportional to the MSE but is inflated


by two factors:

• the second term in the bracket comes from estimating µ as x̄ over a


finite sample; it goes to zero for n → ∞,

• the third term comes from the uncertainty in the estimate of the slope
and it grows as we move away from ( x̄, ȳ), the centroid of the dataset.
Designing an experiment with a large range in x will mitigate that.

Conditional prediction interval of the mean


y
Now if we only try to estimate the mean of the dataset given new obser- 2Sy
vations, its variance:
2Sy | xnew
 
1 ( xnew − x̄ )2
Sȳ2| xnew = Se2 + ȳ
n ∑i ( xi − x̄ )2

is smaller than (15.10) by an amount Se2 . quadratic


Note that in the case of autocorrelated data, once again the IID as-
sumption would be violated, and we would need to inflate the variance, x̄ x

hence the expected errors, by a factor (1 + φ)/(1 − φ), with φ the lag-1
Figure 15.3: Prediction intervals. Note
autocorrelation. the parabolic shape of the prediction in-
tervals, which widen away from the cen-
troid of the dataset ( x̄, ȳ)
III Model checking
Thanks to the wonders of modern programming, fitting a regression
model is now so unbelievably trivial3 that a monkey could do it. How- 3
Matlab:fitlm(); Python:scikit-

ever, the result is of little value unless one can confirm that a few basic learn(); or statsmodels; R:lm()

assumptions are met. The most important one is that residuals ei need to
be IID normal, since this was the basis for applying least squares (max-
imum likelihood estimation under a normal model) in the first place.
The onus is on you to convince your readers that your statistical model
actually fits the data. If not, there are plenty of more complex modeling
options to go by.

Regression residuals
Inspecting residuals is absolutely critical. Important things to check for:

Normality Residuals should be normally distributed, which can be graph-


ically checked with a simple histogram, a Q-Q plot, or more formally
via a Lilliefors or Jarque-Bera test. They may only be approximately
normal, which is in general no cause for alarm.

Homogeneity Residuals should not exhibit structure (e.g. Fig. 15.4).

190
Data Analysis in the Earth & Environmental Sciences Chapter 15. Linear Regression

Figure 15.4: Heteroskedastic residuals.


The left panel shows an attempt at fitting
data whose variance increase over time, a
property called heteroskedasticity. When
such is the case, the errors inherit the
same property (right panel), which vio-
lates the regression assumptions.

Independence Residuals should be non-persistent.


If x is serially correlated (i.e. autocorrelated), then regressing on xi
and not xi−1 might carry some memory, so the ei ’s will exhibit au-
tocorrelation: consecutive values will no longer be independent. A
useful diagnostic is the Durbin-Watson statistic d: e.g. statsmodels.stats.stattools.
durbin_watson()

∑in=2 (ei − ei−1 )2


d= .
∑in=2 ei2

d can vary in the range 0 ≤ d ≤ 4, and is 2 for uncorrelated resid-


uals. A value substantially less than 2 indicates that neighboring
residuals tend to be very similar, which is usually cause for alarm,
though it must be established quantitatively via a (tabulated) distri-
bution for the statistic. A value substantially larger than 2 indicates
anti-correlated residuals, also a cause for alarm.

ANOVA table
Another important aspect of model checking is to parse the various
components of variance, via an ANOVA table. There are three main
measures of the quality of the fit:

1
Mean Squared Error MSE = n −2 ∑i ei2 , it should be as small as possible.

Coefficient of determination R2

SSR SSE Figure 15.5: Regression fit and variance


R2 = = 1−
SST SST “explained”. On the left, the slope of the
regression is very strong (a confidence
A perfect regression has no scatter about the regression line ŷ so, in interval for β 1 would exclude zero with
high confidence), so SSR dominates SST.
this case, we would have MSE = 0, R2 = 1. R2 = 0 indicates a use-
On the right, the slope is near zero and
less regression. More generally, R2 represents the fraction of variance SSE dominates the total variation SST
in y accounted for by the variance in x.

F ratio F = MSR/MSE follows an F1,n−2 distribution for Gaussian resid-


uals (rarely useful in practice).

They can be summarized in the form of an ANOVA table (Table III))

191
Data Analysis in the Earth & Environmental Sciences Chapter 15. Linear Regression

Source df SS MS F
Total n−1 SST
Regression 1 SSR MSR = SSR/1 MSR/MSE
Residual n−2 SSE MSE = Se2

IV Multiple Linear Regression


Now Y depends on multiple inputs ( X1 , X2 , . . . , X p )
3.2 Linear Regression Models and Least Squares 45
p
Y = β0 + ∑ Xj β j + e (15.11) Y
j =1

so there are ( p + 1) regression coefficients to estimate.



• • ••

•• • • • •
Least squares solution •
• •• •• •••• • •
• •
• • ••
It is customary to gather all variables into the design matrix X, of di- •• • •
• • • • • •

mensions n × ( p + 1). •• • •
• • X2
  • •
1 x11 x21 ... x p1 •
 
 .. .. .. .. ..  X1
X = . . . . . 
  FIGURE 3.1. Linear least squares fitting with X ∈ IR2 . We seek the linear
function of XFigure 15.6: Least
that minimizes the sumsquares
of squared solution for Y .
residuals from
1 x1n x2n ... x pn p = 2, minimizing the sum of squared
space occupied by the pairs (X, Y ). Note that (3.2) makes no assumptions
residuals from Y [from Hastie et al. (2008,
about the validity of model (3.1); it simply finds the best linear fit to the
And the solution is the familiar ordinary least squares (OLS) solutiondata.
: Least Chap.
squares3)]
fitting is intuitively satisfying no matter how the data
arise; the criterion measures the average lack of fit.
How do we minimize (3.2)? Denote by X the N × (p + 1) matrix with
each row an input vector (with a 1 in the first position), and similarly let
b > −1 >
OLS = ( X X ) X y.
β y be the N -vector of outputs in the training set. Then we can write the
residual sum-of-squares as
RSS(β) = (y − Xβ)T (y − Xβ). (3.3)
which consists of fitting a hyperplane through the data (Fig. 15.6).This
It is a quadratic function in the p + 1 parameters. Differentiating with
can be shown that respect to β we obtain
∂RSS
= −2XT (y − Xβ)
b ∼ N p+1 ( β, ( X > X )−1 σ2 )
β (15.12)
∂β
(3.4)
∂ 2 RSS
= 2XT X.
∂β∂β T
which tells us everything we want to know about the solution. The noise
Assuming (for the moment) that X has full column rank, and hence XT X
2 is positive definite, we set the first derivative to zero
variance σ must usually be estimated from the data, often via the resid-
XT (y − Xβ) = 0 (3.5)
ual sum of squares: σ̂2 = n−1p−1 RSS (no bias), which has distribution
to obtain the unique solution
β̂ = (XT X)−1 XT y. (3.6)
2
(n − p − 1)σ̂ ∼ σ2 χ2n− p−1 (15.13)

How significant is the role of each predictor X j ? | β j |  0 indicates a


large influence of x j on y4 . More formally, one tests the hypothesis that 4
this comparison is only fair when all the
a particular coefficient β j = 0, by forming the standardized coefficient input variables have similar magnitudes

or Z-score
β̂ j
zj = √ (15.14)
σ̂ v j

192
Data Analysis in the Earth & Environmental Sciences Chapter 15. Linear Regression

where v j is the jth diagonal element of ( X > X )−1 . Under the null hy-
pothesis that β j = 0, z j is distributed as t N − p−1 (a t distribution with
N − p − 1 degrees of freedom), and hence a large (absolute) value of
z j will lead to rejection of this null hypothesis. If σ̂ were replaced by a
known value σ, then z j would have a standard normal distribution. The
difference between the tail quantiles of a t-distribution and a standard
normal become negligible as the sample size increases, and so normal
quantiles are often quite a good approximation (Hastie et al., 2008).
Another way to see this is to report an approximate 95% CI for β j ,
[ β̂ j ± 2 · s.e.( β̂ j )]. If this interval excludes zero, then the effect is signif-
icant, provided the predictors X1 , · · · X p are independent. When they
are not, one must be far more careful, and control for how one variable
may speak through another (see "Regularized regression”).
The Multivariate ANOVA is

Source df SS MS F
Total n−1 SST
Regression k SSR MSR = SSR/k MSR/MSE
Residual n−k−1 SSE MSE = SSE/(n − k − 1)

CO2 (ppm)
An example: fitting the Keeling curve
380
Let us try to fit the famous Keeling Curve, sketched out in Fig. 15.7, us-
ing several predictors. First note that the curve displays some regular
oscillations superimposed on a roughly exponential trend. At first, let’s
see how well we would do with a simple linear fit, and then add com-
plexity. Our primary variable is time t, expressed in months since Jan
320
1, 1958.
1958 2010 t
Linear fit [CO2 ] = β 0 + β 1 t; the least squares solution yields β 0 = 308.6,
β 1 = 0.12, and R2 = 0.977. Despite this glorious statistic, this is ob- Figure 15.7: Keeling Curve: CO2 mea-
surements at Mauna Loa observatory,
viously a poor fit; in particular, we are missing the curvature, which
Hawaii, from March 1958 to May 2010.
is a first order feature of the dataset. Prediction intervals are of order
6.6ppm.

Quadratic fit [CO2 ] = β 0 + β 1 t + β 2 t2 . Adding a quadratic term cap-


tures the exponential curvature to a very large extent, so additional
terms (t3 , t4 , etc) are no warranted. The prediction interval width is
on the order of 4.4ppm, which is better than the linear fit. However,
an inspection of the residuals (not shown) reveals a non-random scat-
ter. The Durbin-Watson statistic (' 0.135) is also very low, suggesting
highly autocorrelated residuals. Clearly, this model ignores the well-
known seasonal cycle in carbon capture and release by the biosphere,
so we need to add sinusoidal terms.

193
Data Analysis in the Earth & Environmental Sciences Chapter 15. Linear Regression

Quadratic + Harmonic fit


   
2 2πt 2πt
[CO2 ] = β 0 + β 1 t + β 2 t + β 3 cos + β 4 sin
12 12

This last fit has an R2 of 99.83%, prediction intervals ±2 MSE are
only 1.8 ppm so it is now a much better fit, nearly 73% more precise than
the linear fit.

One lesson here is that we had to include not only t but also non-
linear functions of t; these we call “derived predictor variables”, since
they are transformed versions of the original predictor. The result is
non-linear in t, but linear in β so we still have a linear regression. Also,
note that predictors themselves need not be Gaussian as long as the
residuals are Gaussian.

Overfitting
One peculiar feature of multiple linear regression models is that one can
always add variables to a regression model, e.g.
   
2 2πt 2πt
[CO2 ] = β 0 + β 1 t + β 2 t + β 3 cos + β 4 sin +
12 12
β 5 × IBM stock + β 6 × (Monkey population in Bhutan)+
β 7 × (SAT scores at USC).
220 7. Model Assessment and Selection

Figure 15.8: In-sample (blue) vs out-of-


High Bias Low Bias sample (red) prediction error. Each curve
1.2

Low Variance High Variance


is the result of a given random sample
from the data, whose average is shown
1.0

by the thick line. This figure makes


plain that a more complex model will
always decrease in-sample prediction er-
0.8

ror, while out-of-sample prediction error


Prediction Error

will bottom out for some intermediate


value, and then increase away from this
0.6

minimum. This is the essence of over-


fitting: a more complex model will do
better at predicting the training data, but
0.4

may be completely off outside of it. Since


we most often use regression models for
0.2

prediction, it is absolutely essential to be


aware of this fact. [figure from Hastie et al.
(2008, Chap. 7)]
0.0

0 5 10 15 20 25 30 35

Model Complexity (df)

FIGURE 7.1. Behavior


Why would of test
we do that? sampleout
It turns andthat
training
such sample
“nonsenseerrorpredictors”
as the model
complexity is varied. The light blue curves show the5 training error err, while the
might
light red lower
curves the
showMSE on the training
the conditional sample
test error ErrT, so
for if100
wetraining
fish forsets
poten-
of size
5
the n observations used to fit the regres-
sion
50 each, as the model complexity is increased. The solid curves show the expected
test error Err and the expected training error E[err]. 194

Test error, also referred to as generalization error, is the prediction error


over an independent test sample

ErrT = E[L(Y, fˆ(X))|T ] (7.2)

where both X and Y are drawn randomly from their joint distribution
Data Analysis in the Earth & Environmental Sciences Chapter 15. Linear Regression

tial predictors in this myopic way, we would be tempted to use them


for prediction. However, these predictors may completely ruin out-of-
sample predictions; this is called overfitting (Fig. 15.8). To guard against
this phenomenon, one can do several things:

• include only variables relevant to your problem,

• screen them for significant relations with y

• use some "IC", like AIC or BIC, to select the appropriate predictors.

• cross-validation: reserve part of the training set as validation set.

The best model is the one minimizing out-of-sample prediction error


to guard against overfitting6 . AIC or BIC attempt to do this, but de- 6
Note that all these comments apply a
fortiori when Y is multivariate, so several
pending on the situation may either overfit or underfit. A more reli-
models are being fitted together.
able method to estimate out-of-sample prediction error is k-fold cross
validation (KCV). It consists of chopping the training sample in k parts
(“folds”), training on k − 1 folds and using the remaining fold to com-
pute prediction error. The expected prediction error is then estimated
as the average prediction error on each fold. The choice of k is a bal-
ance between information amount and computational feasibility. KCV
can be expensive but is usually worth it, since the EPE would take a U-
shape very much like the red curve on Fig. 15.8 (though usually not as
smooth).

Regularized regression
Often predictor variables are colinear (e.g. t and t2 ); one variable says
something about another. In that case X > X will be singular: some of its
eigenvalues will be zero, or numerically very close to zero. As a result,
the central quantity ( X > X )−1 may be unbounded. To overcome this
problem, the covariance matrix needs to be regularized, which amounts
to filtering out small eigenvalues. There are several ways to do it:

PC regression The idea here is to “orthogonalize” predictors by using


principal component analysis. By definition, PCA yields orthogonal
coordinates that maximize the variance of a dataset, so we can use
them as a new basis for the regression. This is exactly like the TSVD
solution of Chap 14, Sect. II. One problem is to decide objectively
how many PCs (SVD modes) to retain; there are some rules to do
this a priori (Wax and Kailath, 1985; Nadakuditi and Edelman, 2008), but
cross-validation usually is a safer bet when prediction is involved.

Ridge regression (aka Tikhonov regularization) bumps diagonal entries


of the covariance matrix, resulting in a smooth filtering of its eigen-
value spectrum; see Chap 14, Sect. III. Ridge regression may be seen
as adding a penalty to the `2 penalty to the MLE of β to ensure that

195
Data Analysis in the Earth & Environmental Sciences Chapter 15. Linear Regression

the solution favors small coefficients. It turns out that this lowers the
effective number of parameters to estimate. Because it keeps coeffi-
cients in check, ridge regression is variously called "biased regres-
sion", or a “shrinkage method" 7 . “Bias” is usually something to 7
for large values of the ridge parameter,
avoid, but in this case we trade a little bias for a lot of variance, so β shrinks towards zero

even though coefficients are biased low, the total MSE may be sub-
stantially lowered. The optimal ridge parameter may be identified
via generalized cross-validation (Golub et al., 1979), with some com-
plications (Wahba and Wang, 1995).

Least Angles Regression (LASSO) The lasso is a shrinkage method like


ridge regression, with subtle but important differences. The main
one is that the `2 penalty is replaced by an `1 penalty on β. Big deal,
you say? Well the `1 norm encourages coefficients to go to zero ex-
actly, so the LASSO can actually eliminate variables altogether. In
contrast, `2 methods will tend to shrink all coefficients, but will never
tell you: get rid of this guy. The LASSO can do that, so it is much more
of a model selection tool than a regularization tool8 . In the end, starting 8
"If a traveler comes to a fork in the road, the
from a large pool of potential predictors, it reduces the number of pa- `1 norm tells him to turn either left or right,
which the `2 norm tells him to head straight
rameters that must be estimated. It has been used very successfully for the bushes." – A. Tarantola
in genetics, where it is used to identify genotypes that have predictive
power over phenotypes.

196
Data Analysis in the Earth & Environmental Sciences Chapter 15. Linear Regression

V What would a Bayesian do?


As in every field of data analysis, there is a Bayesian view. And as in
every field, for a suitable choice of priors, the Bayesian solution yields
an identical solution as the traditional method, but often with stronger
insights. A Bayesian would first write the model

y | { β, σ2 , X } ∼ N ( Xβ, σ2 In ) (15.15)

Starting from a non-informative prior on the model variables:

p( β, σ2 | X ) ∝ σ−2 (15.16)

application of Baye’s theorem yields the posterior distribution of model


parameters:
β | σ2 , y ∼ N ( β̂, Vβ σ2 ) (15.17)
where, as before:

β̂ = ( X > X ) −1 X > y (15.18a)


> −1
Vβ = (X X) (15.18b)

which is just the ordinary least squares solution. With different priors
one gets different results, of course. The ridge regression solution may
be obtained via an inverse Wishart prior on the covariance matrix; the
LASSO solution via a Laplace prior on the covariance matrix.
The advantages of the Bayesian viewpoint are as usual: the infer-
ence is model-based, everything has a well-defined distribution, the
assumptions are transparent, and the model can be made as complex
as one wants and still follow the basic desiderata of probability theory
as logic (Chap 2). Thus, while in elementary cases (and given suitable
choices) the Bayesian solution may yield the good old OLS solution, the
framework offers much more flexibility when complex (e.g. hierarchi-
cal) models are warranted (see, Gelman et al., 2013, Chap. 14, for an in-
depth treatment), or when less friendly distributions are encountered.

197
Chapter 16

CONCLUSION & OUTLOOK

Like many of the students who take this class, you may have started
this book with little mathematical background and a thirst to better un-
derstand the Earth. Or you may have wandered here through the inter-
tubes, trying to get a little understanding of how to analyze data beyond
the shallow teachings of the "Data Science" fad.
We hope that the journey so far has lifted the curtain on what lies be-
hind data analysis, and that you are now a more sophisticated consumer
of data analysis methods.
In future editions of this book, we would like to bring a few improve-
ments:

• Fix all the pesky typos lacing this book (the more we fix, the more we
find)

• Fix figures so that they appear correctly in all PDF readers. [in progress]

• include an additional chapter on age modeling and related techniques


(leveraging the advances of the GeoChronR project)

• include a new chapter on climate field reconstructions of the Com-


mon Era, deconstructing a peer-reviewed article and isolating all the
data-analytic components for pedagogical purposes.
Part IV

Mathematical Foundations
Appendix A

CALCULUS REVIEW

I Differential Calculus
Derivative
In calculus, we are interested in the change or dependence of some quan-
tity, e.g. u, on small changes in some variable t ∈ R. If u has value u0
at t0 and changes to u0 + δu when t changes to t0 + δt, the incremental
change can be written as
δu
δu = |t δt. (A.1)
δt 0
The δ here means that this is a small, but finite quantity. If we let δt get
asymptotically smaller around t0 , we arrive at the derivative, which we
denote with u0 (t):
δu du
lim | t0 = . (A.2)
δt→0 δt dt
The limit in Eq. (A.2) will work as long as u doesn’t do any funny stuff
as a function of t, like jump around abruptly. When you think of u(t) Figure A.1: The derivative as a tangent
as a function (some line on a plot) that depends on t, u0 (t) is the slope slope (Source:Wikimedia Commons).
of this line that can be obtained by measuring the change δu over some Note that the first order approximation
f ( x ) + f 0 ( x )∆x captures the qualitative
interval δt, and then making the interval progressively smaller. behavior of f ( x ), but is still very far
off. To do better, one should include
more terms – this is the point of a Taylor
Interpretation expansion (Sect. III)

u0 (t0 ) has a clear geometric interpretations as the slope of the tangent


to the curve {t; u(t)} at point (t0 , u(t0 )) (Fig. A.1). Its physical interpre-
tation is that it reflects the instantaneous rate of change of the function
u (for instance, if u(t) is the velocity, then u0 (t) is the acceleration).
Note that in mechanics, u0 (t) is often denoted u̇(t), though more a
rigorous notation is that of Leibniz du dt . Leibniz’s “differential” notation
is clear in three respects:

• it makes clear which variation we are considering, because u could


depend on other variables (e.g. x, y, z).
Data Analysis in the Earth & Environmental Sciences Appendix A. Calculus Review

• it makes clear that this is ratio of differentials, so the notation is faith-


ful to the definition.

• it expresses differentiation as an operator acting on a function, which


allows differential operators to be defined easily, and makes multiple
derivatives and the chain rule very easily understood.

That being said, it is more cumbersome than the prime, so most lazy
people use u0 (t) to denote the derivative.

Properties
If you need to take derivatives of combinations of two or more functions,
here called f , g, and h, there are four important rules (with a and b being
real constants):

Linearity

( a f + bg)0 = a f 0 + bg0 (A.3)

Product rule:

( f g)0 = f 0 g + f g0 (A.4)

Quotient rule:

g( x )
If f (x) = (A.5)
h( x )
g0 ( x ) h( x ) − g( x ) h0 ( x )
Then f 0 ( x ) = (A.6)
h ( x )2

Chain rule (inner and outer derivative):

If f (x) = h( g( x )) (A.7)
df dh dg
Then f 0 ( x ) = = = h0 ( g( x )) g0 ( x ) (A.8)
dx dg dx

i.e. derivative of nested functions are given by the outer times the inner
derivative.

204
Data Analysis in the Earth & Environmental Sciences Appendix A. Calculus Review

Common Derivatives:

Here are some of the most common derivatives of a few functions:


function f ( x ) derivative f 0 ( x ) comment

xp px p−1 special case: f ( x ) = c = cx0 → f 0 ( x ) = 0


where c, p are constants
exp( x ) = e x ex that’s what makes e so special
ln( x ) 1/x
sin( x ) cos( x )
cos( x ) − sin( x )
tan( x ) sec2 ( x ) = 1/ cos2 ( x )

Higher Order Derivatives

If you need higher order derivatives, those are obtained by successively


computing derivatives, e.g. the third derivative of f ( x ) is
  
d3 f d d d f (x)
= .
dx3 dx dx dx

Say, f ( x ) = x3 , then
    
d 3 x3 d d dx3 d d 2 d
3
= = 3x = 6x = 6.
dx dx dx dx dx dx dx

In general, the nth derivative is denoted by f (n) . This will become useful
in Sect. III.

Differentiability
We should not take a function’s niceness for granted. Let us introduce
the C n notation, which formalizes this notion. A function is called C n
if it can be differentiated n times, and the nth derivative is still continu-
ous. Continuity may be loosely described as the property that a func-
tion’s graph can be drawn with a pencil without ever lifting it from the
page1 . We call C n ( I ) the set of functions that are C n over some interval 1
A more formal definition is that, if for
I. Obviously, a function that is n times differentiable is n − 1 times dif- every x in a neighborhood of x0 , f ( x ) is in
a neighborhood of f ( x0 ), then f is said
ferentiable, so we have the following Russian doll relationship2 , valid to be continuous at x0 . This definition is
for all I: valid over the whole real line, including
points at infinity.
C0(I) ∈ C1(I) ∈ · · · ∈ C ∞(I) (A.9) 2
which mathematicians call an inclusion

205
Data Analysis in the Earth & Environmental Sciences Appendix A. Calculus Review

Yes, that was an ∞ symbol: some functions can be continuously dif-


ferentiated an infinite number of times. A trivial example is the function
x → f ( x ) = C (a constant), but more interesting examples are the usual
functions exponential, sine, cosine, log, etc.

Partial Derivation
In physics, it is very common for a function to depend on more than
one variable. For instance u could depend on space and time, which
we write u = u( x, y, z, t). In such cases one must account for variations
along each of the coordinates, which are themselves described by partial
derivatives:
δu ∂u
lim = . (A.10)
δx →0 δx ∂x
and similarly along other coordinates. The full differential, accounting
for all variations, then writes:

∂u ∂u ∂u ∂u
du = dx + dy + dz + dt (A.11)
∂x ∂y ∂z ∂t

Which states that the total variation in u is the sum of all its (partial)
variations along each coordinates, multiplied by the variation in each
coordinate. Partial derivatives and the equations derived from them
underlie much of modern physics, and we can’t do them justice here.
In this class they shall only be used for optimization, such as likelihood
maximization (Sect. III).

II Integral Calculus
Inverse Derivatives
Taking an integral
Z
F(x) = f ( x )dx,

in a general (indefinite) sense, is the inverse of taking the derivative of


a C 0 function f , i.e. F 0 ( x ) = f ( x ). Another way to put this in the funda-
mental theorem of calculus, valid for any C 1 function:

Theorem 2 Every continuously differentiable function f verifies


Z b
f 0 ( x )dx = f (b) − f ( a)
a

Which states that integration and differentiation are inverse opera-


tions ; that is, if we integrate the derivative we get back where we started
(with this subtlety that a definite integral from a to b will equal the dif-
ference f (b) − f ( a), not just f ( x ), say).

206
Data Analysis in the Earth & Environmental Sciences Appendix A. Calculus Review

Interpretation
Graphically, the definite (with bounds) integral over f ( x )
Z b
f ( x )dx = F (b) − F ( a)
a

along x, adding up the value of f ( x ) over little chunks of dx, from the
left x = a to the right x = b, corresponds to the area under the curve
f ( x ). This area can be computed by subtracting the analytical form of
the integral at b from that at a, F (b) − F ( a). If no bounds a and b are
given, the function F ( x ) is only determined up to an integration con-
stant c, because the derivative of a constant is zero. In physics, initial
or boundary conditions are often used to determine the value of such a
constant, which can be very important in practice.
If f ( x ) = c (c a constant), then:

F(x) = cx + d (A.12)
F (b) = cb + d (A.13)
F ( a) = ca + d (A.14)
F (b) − F ( a) = c ( b − a ), (A.15)

the area of the box (b − a) × c.

Properties
A few conventions and rules for integration:

Notation:
R
Everything after the sign is usually meant to be integrated over up
to the dx, or the next major mathematical operator if the dx is placed
R
next to the if the context allows. Also:
Z Z
dx f ( x ) = f ( x )dx (A.16)

Linearity:

Z b Z b Z b
(c f ( x ) + dg( x )) dx = c f ( x )dx + d g( x )dx (A.17)
a a a

Reversal:

Z b Z a
f ( x )dx = − f ( x )dx (A.18)
a b

207
Data Analysis in the Earth & Environmental Sciences Appendix A. Calculus Review

Zero length:

Z a
f ( x )dx = 0 (A.19)
a

Additivity:

Z c Z b Z c
f ( x )dx = f ( x )dx + f ( x )dx (A.20)
a a b

Product rules:

Z
1
f 0 ( x ) f ( x )dx = ( f ( x ))2 + C (A.21)
Z
2 Z
f 0 ( x ) g( x )dx = f ( x ) g( x ) − f ( x ) g0 ( x )dx (A.22)

Quotient rule:

Z
f 0 (x)
dx = ln | f ( x )| + C (A.23)
f (x)

Symmetry:

 R
Z a 2 a f ( x )dx if f is even
0
f ( x )dx = (A.24)
−a 0 if f is odd

Common Integrals
Here are the integrals of a few common functions, all only determined
up to an integration constant C
function f ( x ) integral F ( x ) comment

x p +1
xp p +1 + C special case: f ( x ) = c = cx0 → F ( x ) = cx + C
ex e x +C
1/x ln(| x |) + C
sin( x ) − cos( x ) + C
cos( x ) sin( x ) + C

In general, the integral of an arbitrary function may be difficult to


find, unless one recognizes one of the forms shown above. This is in

208
Data Analysis in the Earth & Environmental Sciences Appendix A. Calculus Review

contrast to the derivative of a differentiable function, which can always


be found. Two main methods exist to find more complicated integrals:
integration by parts, and variable substitution. When no analytical so-
lution exists, one must resort to numerical quadrature.

Numerical Quadrature
The term quadrature invokes squares, and it is not used here coinciden-
tally. Indeed, one of the early applications of integrals was to compute
areas, usually by diving domains into rectangles whose area was easy
to compute. The basic idea behind the following methods to numeri-
cally evaluate the integral of a function f over [ a; b] is to chop up its the
interval into smaller, contiguous segments, collectively known as a sub-
division of [ a; b]. There are many ways to do this, but the simplest is to
divide it into n equal slices of width ∆ = b− a
n :

[ a; b[ = [ a; a + ∆[ ∪ [ a + ∆; a + 2∆[ ∪ · · · ∪ [ a + (n − 1)∆; b[
k[
=n
= [ a + (k − 1)∆; a + k∆[
k =1

(whether b gets included in the calculation or not makes no differ-


ence, as a point has zero width. This of course, is only true if f isn’t
doing anything fishy at x = b, so we require that f be C 0 ).

Rectangular Rule

The rectangular rule (aka Riemann sums) assumes that the function Figure A.2: Riemann summation in ac-
is constant over each subdivision. If we pick the end of each interval, tion. Right and left methods make the
approximation using the right and left
we get: endpoints of each subinterval, respec-
Z b n tively. Maximum (green) and minimum
a
f ( x )dx ≈ ∑ f ( a + k∆) ≡ An (A.25) (yellow) methods make the approxima-
k =1 tion using the largest and smallest end-
point values of each subinterval, respec-
As always in the numerical world, the approximation improves with n. tively. The values of the sums converge
In fact, one can show that as the subintervals halve from top-left to
bottom-right. The central panel describes
Z b the error as a function of number of bins
lim An = f ( x )dx (A.26) n. Wikimedia Commons)
n→∞ a

That is, with infinitely fine subdivisions, one recovers the area under the
curve exactly. This is something that a teacher of mine calls the “Stupid
Limit”, because one never has that much luxury. The goal of numerical
analysis is to obtain an estimate that is as accurate as possible given
computational constraints (which means that n  ∞, for starters). The
rectangle rule is illustrated in Fig. A.2, exploring the impact of different
quadrature choices.

209
Data Analysis in the Earth & Environmental Sciences Appendix A. Calculus Review

Trapezoidal Rule

The rectangular rule is a pretty dumb one: in general f is far from


constant, and the Stupid Limit is the only way to get away with assum-
ing that it is. We surely can do better. The trapezoidal rule takes this
one step further, and assumes that f is piecewise linear over each inter-
val. This is akin to a first-order Taylor expansion (Sect. III). As you can
see in Fig. A.3, now we can get away with a very crude subdivision and
still espouse the true area rather closely. In Matlab, this method can be
called via trapz.m.
Figure A.3: Illustration of the trapezoidal
Simpson’s rule rule. This case is particularly favor-
able to it as the function is extremely
smooth, and varies monotonically, so
What in the world is better than a first-order Taylor expansion? A even a rather broad discretization inter-
second-order Taylor expansion! That is the essence of Simpson’s rule3 . val ∆ = 0.5 is enough to produce an ex-
The basic principle is illustrated in Fig. A.4, though as in the previous cellent approximation of the full integral.
Wikimedia Commons)
methods, the approximations would improve with finer subdivisions. 3
first used by the famous astronomer
Over each interval [ xi ; xi+1 ], of length ∆, Johannes Kepler, who used it about
    100 years before Thomas Simpson (1710-
Z x
i +1 ∆ x i + x i +1 1761), whose name has stuck to it
f ( x )dx ≈ f ( xi ) + 4 f + f ( x i +1 ) (A.27)
xi 6 2

How good to these approximations go? Amazingly, the error is at


least fourth order accurate4 , which means that it integrates cubics ex-
actly, and its numerical efficiency makes it a darling of data analysts
and engineers. In Matlab it may be called via quad.m, which is espe-
cially nifty because it adaptively chooses the subdivisions to focus on
the parts of the function that are most variable (hence, most difficult to
approximate by a parabola).
Figure A.4: Illustration of Simpson’s rule.
This figure is somewhat lame since it con-
Improper Integrals siders only a subdivision with n = 1.
Wikimedia Commons)
 5
So far we have only considered intervals of the form [ a; b], where a and b 4
The error is 901 ∆
2 f (ξ ), for someξ ∈
are both finite. What would happen if we let one or both of those bounds [ a; b]
to reach infinity? It turns out that functions that decay fast enough to-
ward either end of the real line may be integrated over R. As an example
consider the function f ( x ) = λe−λx . It is trivial to show that:
Z
f ( x )dx = −e−λx + C = F ( x )

so Z ζ
f ( x )dx = F (ζ ) − F (0) = 1 − e−λζ
0
Now, as ζ approaches +∞, the second term decreases exponentially
fast, and the limit is well defined:
Z ζ Z +∞
lim f ( x )dx ≡ f ( x )dx = 1 (A.28)
ζ →+∞ 0 0

210
Data Analysis in the Earth & Environmental Sciences Appendix A. Calculus Review

Such integrals are called improper, but they are legitimate as long as this
limit exists. This particular function describes the exponential distribu-
tion, which is very useful in the study of waiting times and memoriless
processes (Chapter 3).
R +∞
Consider now the integral −∞ sin xdx. Does such a thing exist?
Given the symmetry property Eq. (A.24), one can show that for every
a, Z +a
sin xdx = 0
−a
since sin is an odd function (symmetric about (0, 0)). It is tempting to
take the limit a → ∞ and declare that the improper integral exists and
equals zero. This reasoning, however, is 100% incorrect. For an im-
proper integral with two infinite bounds to be defined, one has to look
separately at the limits at ±∞. In the case above, neither limit is defined,
so the integral doesn’t exist.
Now, it is a fact of life that many of the most useful functions have no
R x2
closed-form anti derivatives, a case in point being e− 2 dx. Not only
does its integral exist, but most remarkably
Z +∞ √
x2
e− 2 dx = 2π (A.29)
−∞
However, you would not find that out by application of the two stan-
dard tools to find integrals (integration by parts or variable substitu-
tion); you have to either invoke Fubini’s theorem or use a contour inte-
gral in the complex plane. In general, improper integrals are best tack-
led using measure theory, which is beyond our scope. For our purposes,
a standard math textbook (e.g. Abramowitz and Stegun, 1965), table of in-
tegrals, the Mathematica software, or Matlab’s Symbolic Math Toolbox
will be of help with such integrals, as well as more complicated forms.

III Earth Science Applications


Measuring the Earth
Measuring mass and probability

One direct application of integrals is that if ρ is the density of some sub-


stance, its integral over the medium gives the total mass of that sub-
stance. Consider for instance atmospheric density (the mass per unit
volume) as a function of height z:

ρ(z) = ρ0 e−z/H

where H is the so-called “scale height” (about 7km in Earth’s tropo-


sphere).

211
Data Analysis in the Earth & Environmental Sciences Appendix A. Calculus Review

The total mass of the atmosphere per unit area is:


Z ∞
ρ(z)dz = ρ0 H
0

Similarly, we define in Chapter 3 the notion of probability density, which


is a measure of the likelihood for a given variable to lie in a certain range.
Any probability density f ( x ) must have, by definition, unit mass over
the real line5 , which writes: 5
we are 100% sure that the variable can-
not be smaller than −∞ or greater than
Z +∞ Z +∞
f ( x )dx ≡ f ( x )dx = 1 (A.30)
−∞ R

Indeed this is what we found for the exponential distribution above.


If we defined ϕ( x ) such that

1 x2
ϕ( x ) = √ e− 2 (A.31)

Then, by virtue of Eq. (A.29) and the linearity of the integration operator,
we’d find that Z +∞
ϕ( x )dx = 1
−∞

which is a restatement of Eq. (A.30). ϕ( x ), the infamous “bell curve” is


a centerpiece of the Laplace-Gauss distribution (Chapter 4), which is to
probability theory what the Sun is to the solar system.

Measuring energy

Another application of integrals is to measure the energy of a signal, say


X (t), which could be either deep-sea temperature, atmospheric ground
velocity at a point, the ground motion measured by a seismometer, or
anything you’d like:
Z +∞
E= | X (t)|2 dt (A.32)
−∞

This, coupled with Parseval’s theorem, is the foundation of spectral anal-


ysis (Chapters 7, 8 & 9).

Projection
Another application of integrals is their definition of an inner product
(cf. Appendix B) between two functions f and g over some interval I:
Z
h f , gi = f ( x ) g( x )dx (A.33)
I

which is how we measure the similarity between functions, and can


project one onto the other.

212
Data Analysis in the Earth & Environmental Sciences Appendix A. Calculus Review

Convolution
The convolution product, or simply convolution of two functions f and
g is defined as:
Z +∞ Z +∞
( g ∗ f )( x ) = f (u) g( x − u)du = f ( x − u) g(u)du (A.34)
−∞ −∞

This operation is essential to describing the action of a linear system.


For instance, let x (t) be some input signal into a linear system (say, a
filter) characterized by an impulse function L(t), then the response of
the system is

Z +∞ Z +∞
y(t) = x (u) L(t − u)du = L(u) x (t − u)du (A.35)
−∞ −∞

If L describes the behavior of a filter acting on x (t), then the convolu-


tion expresses the output of the filter once x has run through it. An 1D
example is provided in Fig. A.6, where two filters (Gaussian and box-
car) are compared. A 2D example is shown in Fig. A.5, using bivariate
Gaussian windows with half-widths set to 3 and 10 pixels, respectively.
The inverse operation, that of retrieving X given Y and some knowl-
edge of L, is called deconvolution. Image enhancing (going from the bot- Figure A.5: Gaussian blurring, an
tom to the top of Fig. A.5) is one way to think of it. It is an inverse prob- example of discrete 2D convolution
(Source:Wikimedia Commons).
lem that can become rather thorny in the presence of noise, and usually
needs some regularization (Chapter 15).

0.35
Gaussian window
0.2
Boxcar window
Figure A.6: Filtering timeseries, an exam-
0.3 ple of 1D convolution. The output is visi-
bly smoother than the inout, because con-
0.15
0.25
Amplitude

Amplitude

0.2
0.1 volution with either window amounts to
0.15

0.1
averaging nearby points together, which
cancels out high-frequency fluctuations
0.05
0.05

0 0 and emphasizes the long-term behavior


of the series. This is an example of low
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Lag Lag

4
Input Signal pass filter (Chapter 10). Note that the box-
3 car filter is blurrier and noisier than the
2 Gaussian filter, which is one reason one
1 running means are a bad idea.
X(t)

-1

-2

-3

0 50 100 150 200 250 300 350 400


t
Output Signals
2
Convolution w/ Gaussian window
1.5 Convolution w/ Boxcar window
1

0.5
Y(t)

-0.5

-1

-1.5

-2

0 50 100 150 200 250 300 350 400


t

We now show some applications of differential calculus.

213
Data Analysis in the Earth & Environmental Sciences Appendix A. Calculus Review

Special points
Derivatives allow to characterize several special points around which the
graph of f ( x ) is organized:

Roots correspond to the points where f ( x ) = 0. In the case of the func-


tion shown in Fig. A.7 (a cubic polynomial), the roots can be found by
various methods. For arbitrary functions, one must resort to iterative
techniques like Newton-Raphson.

Extrema Extrema are either maxima or minima of a function, verify-


ing f 0 ( x ) = 0 (a flat rate of change, locally). The derivative itself
does not tell us whether the point is a maximum or a minimum, how-
ever. Higher-order derivatives are required in order to find this out. Figure A.7: Special Points of a Function
If f 00 ( x ) > 0 then the function is convex, and the extremum is a min- f ( x ) = x3 − 3x2 − 144x + 432 = ( x −
3)( x − 12)( x + 12) (Source:Wikimedia
imum ; if f 00 ( x ) < 0 the extremum is a maximum – see Fig. A.7. Care Commons).
must be taken if f 00 ( x ) = 0.

Inflexion points are points where the second derivative f 00 ( x ) passes


through zero while changing sign. This is a turning point for f , as
its curve changes from being concave upwards (positive curvature)
to concave downwards (negative curvature), or vice versa.

Optimization
A generic problem in applied mathematics is to find some sort of opti-
mal solution to a problem: for example, one wants to find a curve that
minimizes tension between points, a surface that minimizes the misfit
to a set of measurements, fitting a line through a cloud of points, or
Figure A.8: Optimizing a multivalued
finding the most likely value of a parameter given some observational function amounts to finding the peaks in
constraints and prior knowledge. In many of these cases, one can define a graph such as this one (Source:Imran
an objective function S( x ) whose maximum or minimum yields the de- Nazar).

sired solution. The equation S0 ( x ) = 0 is used to find extrema and the


sign of S00 ( x ) = 0 (the curvature of S) is used to determine whether they
are maxima or minima. In this class, we use optimization in Chapter 5
to estimate the parameters of a probability distribution (maximum like-
lihood method) and in Chapter 13 to fit curves to observations (it turns
out that the two are deeply related).

Root finding
A related problem is that of finding the roots (solutions) of an equation,
say f ( x ) = 0. We begin with a first guess x0 for a root. Provided the
function is C 1 , a better approximation x1 is

f ( x0 )
x1 = x0 − (A.36) Figure A.9: Finding the roots of an equa-
f 0 ( x0 ) tion via the method of Newton-Raphson

214
Data Analysis in the Earth & Environmental Sciences Appendix A. Calculus Review

Geometrically, ( x1 , 0) is the intersection with the x-axis of the tangent


to the graph of f at the point ( x0 , f ( x0 )).
The process is repeated as

f ( xn )
x n +1 = x n − (A.37)
f 0 ( xn )
until a sufficiently accurate value is reached (Fig. A.9). This is the
method of Newton-Raphson. One can show that the process always
converges unless the derivatives have too many zeroes, but its rate is
only quadratic, so it can be very slow. Also, it depends critically on
the initial guess, so one must use another method (e.g. eyeballing the
graph) to find a good one.

Functional Representations
For many purposes it is sometimes useful to express an arbitrary func-
tion f ( x ) in terms of simpler components. This indeed, is one meaning
of the word analysis. A generic representation uses a linear combination
of basis functions

f (x) = ∑ ak φk (x) (A.38)
k =0
As a general rule, the complexity of φk ( x ) tends to increase with k.
Two examples are mentioned here.

Polynomial representation: Taylor Expansions

What if we pick φk ( x ) = x k ? That is, what if seek a representation of


f in terms of monomials of increasing complexity, up to degree n > 1?
It turns out that we can do this for every well-behaved function, and
that the value of the ak coefficients only involves derivatives of f . This
is called a Taylor approximation or Taylor series, and is valid for any C n
function. It is generally done in a neighborhood of a point x0 , where
f ( x ) can be expressed as follows:

( x − x0 )2
f (x) = f ( x0 ) + f 0 ( x0 )( x − x0 ) + f 00 ( x0 ) + . . . (A.39)
2!
n
( x − x0 ) k
f (x) = ∑ f ( k ) ( x0 ) + Rn ( x ) (A.40)
k =0
k!

Here, f (k) ( x0 ) is the kth derivative. n! denotes the factorial, i.e.

n! = 1 × 2 × 3 × . . . n. (A.41)

Rn ( x ) is the remainder of order n , whose absolute value measures the


goodness of this approximation.

215
Data Analysis in the Earth & Environmental Sciences Appendix A. Calculus Review

There are a few formulations of the remainder, which determine the


name of the formula. For instance, the Taylor-Lagrange formula states
that there exists a number θ between x and x0 such that:
( x − x 0 ) n +1
R n ( x ) = f ( n +1) ( θ ) (A.42)
( n + 1) !
So, as long as f is well-behaved (none of the derivatives get too large)6 6
more precisely, C n+1 in an interval that
and x − x0 is small, each successive term is smaller than the next, and the encompasses x0
Taylor approximations of ex
approximation becomes better and better: limn→∞ Rn ( x0 ) = 0. How- 25
y = ex
ever, this ceases to be true as we get away from x0 , so this approximation Taylor series, order 1
Taylor series, order 2
20 Taylor series, order 3
is only useful locally. Taylor series, order 4
Taylor series, order 5

Fig. A.10 illustrates this in the case of f ( x ) = e x an x0 = 0 , in which


Taylor series, order 6
15 Taylor series, order 7

case ∀k, f (k) ( x0 ) = 1, thus ak = 1/k!. Note that the approximation is

f(x)
10

quite good around 0 even for n = 2, but gets worse as one gets away
from the origin. If one wanted a useful approximation at x = 3, say, 5

one could either set x0 = 3, or keep x0 = 0 but push the expansion to a 0

higher order. Indeed, by n = 6 or 7, the true curve and its polynomial


-5

approximations are virtually indistinguishable on Fig. A.10. Which so- -3 -2 -1 0


x
1 2 3

lution is the smarter choice?


Figure A.10: Approximations of the ex-
Now, because exp( x ) ∈ C ∞ (R), one could push n to infinity, leading
ponential function by polynomials: the
to the exponential series: truncated Taylor series.

xk
∀λ ∈ R, ex = ∑ (A.43)
k =0
k!

This forms the basis for the Poisson distribution (Chapter 3, Sect. III).
Functions that are equal to their Taylor series on their domain of defi-
nition are called analytic; this is a remarkable property, which we won’t
use too much in this course, but you should know how very extraordi-
nary it is.
Another very important series is the geometric series:

1
∀ x ∈] − 1, 1[, = ∑ xk (A.44)
1−x k =0
(that is, ak = 1 for all k). With this, one can derive expansions of a host
of other functions, like 1+1 x , 1+1x2 , arctan, arcsin, arccos, etc. It comes in
handy to approximate almost any ratio.
Example
What is a back-of envelope estimate of q = 1/0.93?
One recognizes the form above, with x = 0.07. So to first order, q = 1 + 0.07 +
 
O 0.072 ' 1.07. To second order, q = 1 + 0.07 + 0.072 + O 0.073 '
1.07 + 0.0049 = 1.0749. Here we used the “Big O” notation, which means
“terms of this order, or higher”.

This is in fact how calculators work (behind the scenes)! You can see
how the higher order terms add up to increasingly small contributions,

216
Data Analysis in the Earth & Environmental Sciences Appendix A. Calculus Review

because x m < x n for any 0 < m < n as long as | x | < 1. For larger | x |,
these geometric approximations quickly become gruesome.
The geometric series is also useful in probability theory, giving its
name to the geometric distribution (Chapter 4).

Trigonometric Representation: Fourier series

Another common form of functional representation takes the form


of trigonometric series, and applies to T-periodic functions. This time,

φk ( x ) = Z k , where Z = ei2π and i = −1 (cf. Appendix C), so we
are dealing with complex polynomials, which is not the most intuitive.
After some rearrangements, it can be shown that this representation is
equal to a sum of sines and cosines:


a0 2π
f (x) = + ∑ [ ak cos(kωx ) + bk sin(kωx )] ω= (A.45)
2 k =1
T

This is called a Fourier series and is omnipresent throughout Chap-


ter 7. The coefficients { ak , bk , k ∈ N} can be found by projecting f onto
sines and cosines of different frequencies, using Eq. (A.33).

Transformation
Another major application of calculus that is useful for this course is the
concept of integral transforms. In particular Laplace & Fourier trans-
forms are omnipresent in signal processing and probability theory; we
will mostly encounter them in Chapter 7.

217
Appendix B

LINEAR ALGEBRA

I Vectors
Definition 15 A vector is a quantity having three properties: a magnitude, a
direction and a sense.
Example
• Velocity vector indicating the movement of an object (e.g. planet)

• An n-tuple in the n-dimensional real space Rn : ( x1 , x2 , . . . , xn ) where xi ∈


R, i ∈ {1, . . . n}

• in R3 , v = (1, 0, −1)1 1
Numpy: v= [ 1, 0, -1]

( x1 , x2 )
P = ( x1 , x2 , x3 )

~v ~v

Different Ways to Measure the Length (Norms):


q
• Euclidean norm (or 2-norm): kvk2 = v21 + v22 + . . . + v2n .
1
• p-norm: kvk p = (|v1 | p + . . . + |vn | p ) p , p ≥ 12 . 2
Matlab: norm(v,p)

p=1 k v k1 = | v1 | + . . . + | v n |.
p=2 Euclidean norm3 . 3
Matlab: norm(v)
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra

Limiting case: p=∞ kvk∞ = max |vi | .


1≤ i ≤ n
One can show that lim p→∞ kvk p = kvk∞

If kvk = 1, we say that v is a unit vector or equivalently, that v is


normalized.

Scalar Multiplication
αv = α (v1 , . . . , vn ) = (αv1 , . . . , αvn ) , α ∈ R.

II Matrices
Definition
A matrix is a quantity with two indices (a 2D array of numbers), which
can be seen as a linear operator (a function) on vectors.
Linear means:

f (αv) = α f (v) , α ∈ R, (B.1a)


f (v + w) = f (v) + f (w) . (B.1b)

Example

    
2 3 v1 2v1 + 3v2
  = 
1 4 v2 v1 + 4v2
| {z }
A

Matrix × vector = vector

      
2 3 v1 2αv1 + 3αv2 2v1 + 3v2
   = α  = α  = α ( Av)
1 4 v2 αv1 + 4αv2 v1 + 4v2

Similarly A (v + w) = Av + Aw.

Remark:

Canonical basis:
    
a b 1 a
•   = 
c d 0 c
    
a b 0 b
•   = 
c d 1 d

220
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra

=⇒ Aei = i column of A.

Hence, every linear operator on Rr may be represented as an r × r


matrix. Conversely, every r × r matrix may be seen as representing the
action of this operator.

e2 = (0, 1)
An Example: Rotation matrices
e1 = (1, 0)
Consider the linear operation of rotating a vector v by an angle α.
Figure B.1: The canonical basis of R2
A
Rotation of an angle α
Av
(counterclockwise)
v

θ θ+α

What formulation of A would represent this transformation? To fig-


ure this out, consider the action of the operator on the canonical basis
vectors e1 and e2 :
 
| |
 
 Ae Ae 
 1 2 
| |
It turns out:

Ae1 = (cos α, sin α) Ae2 = (− sin α, cos α)

e2 = (0, 1)
1 α
α
e1 = (1, 0)

 
cos α − sin α
Therefore: A =  
sin α cos α

221
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra

z
In 3D

• Rotation around z of an angle α:


 
cos α − sin α 0 y
0
(α)  

R z =  sin α cos α 0 
 α y
0 0 1 α
0
x
x

• Rotation around y of an angle β: z


 
cos β 0 − sin β
( β)  
Ry =   0 1 0 

sin β 0 cos β
y
0
x
x β x
β 0
z

• Rotation around x of an angle γ:


 
z
1 0 0
(γ)  

R x =  0 cos γ − sin γ 

0 sin γ cos γ 0 0
z y
γ
γ
y

General rotation R (α, β, γ) = R x (γ) Ry ( β) Rz (α). That is, one can x


express any rotation as successive applications of such rotation opera-
tors.

222
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra

Matrix Multiplication

 
A 4 2 3
A= 
3 Ae2 1 4

e2 2
Ae1
1

e1 1 2 3 4

Matrix multiplication = “composition of functions”

f g

R2 R2 R2

f ◦g

A, B matrices:

AB = ”A ◦ B” , ( AB) v = A ( Bv) . (B.2)

Matrix product 6= Elementwise product


    
a b e f ae + bg a f + bh
  =  (B.3)
c d g h ce + dg c f + dh

Formally, ( AB)ij = ∑k Aik Bkj . 4 4


In Numpy, A * B denotes elementwise
product; For matrix multiplication, use
np.matmul(A,B)

Transposition
The transpose of A, denoted A> is obtained by permuting its rows and
columns. If A is square, this is equivalent to flipping A along its main
diagonal:
 
A> = A ji (B.4)
ij

223
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra

Special Matrices

 
1 0 

 .. 
I= .  identity
 
0 1 upper triangle

 
d1 0 

 .. 
 .  diagonal
 
0 dn
lower
triangle

III Matrices and Linear System of Equations



 ax + by = c ,
(B.5)
 dx + ey = f ,

with a, b, c, d, e and f given. Solve for x and y.

Matrix Form
 A     
a b x c
    =   . (B.6)
d e y f
matrix unknown vector known vector
 
a b
The matrix   maps vectors to vectors (linear operation).
d e

Geometric interpretation

A
Av2

v2
   
x c
  Av1  
v1 y f

224
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra

3 possibilities:
- The system has a unique solution.

- The system has multiple solutions.

- The system has no solution.


We may know this via inspection of the determinant of A.

Determinants
Motivation

For square matrices n × m, where n = m with n the number of rows


and m the number of columns, solve for x.
    
    
    
 A  x   b 
  = , (B.7)
    
    

A · x = b .
matrix vector vector

When does this system of linear equations admit a unique solution?


The following theorem links this notion to the determinant det A or | A|,
which is a scalar that packs as much information about the matrix as
possible.
Theorem 3 The system Ax = b, with given A, has a unique solution x for
any given vector b if and only if det A 6= 0.

Computing the Determinant of a Matrix

For a 2 × 2 matrix, this is easy.


 
a b
det   = ad − bc . (B.8)
c d

It gets a little uglier for a 3 × 3 matrix, but it is still manageable:


 
a b c
 
det  
 d e f  = a |(ei − f h) −d (bi − ch) + g (b f − ce)
{z } | {z } | {z }
     
g h i e f b c b c
det  det  det 
h i h i e f
(B.9)
(note the minus sign in front of d).
Important properties:

225
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra

• det ( AB) = det ( A) det ( B) .

• det A> = det A (transpose) .

• det αA = αn det A .

Theorem 4 (Cramer’s rule) Suppose det A 6= 0. Then the solution x =


( x1 , . . . xn ) of Ax = b is given by:

det Ai
xi = ,
det A

where Ai is the matrix formed by replacing the ith column of A by b.

→ Very interesting in theory;

→ Not very useful in practice (computationally intensive).

Matrix Inverse
Definition

To solve for x when Ax = b (if det A 6= 0), we introduce the inverse:

x = A −1 b (B.10)

A−1 is a matrix such that:

A · A −1 = A −1 · A = I , (B.11a)
  −1
A −1 = A, (B.11b)
  −1  >
A> = A −1 , (B.11c)

( A · B ) −1 = B −1 · A −1 . (B.11d)

We already know that this matrix exists if and only if det A 6= 0.

The 2 × 2 case

For a matrix  
a b
A= 
c d

whose determinant | A| = ad − bc 6= 0, we have the simple formula:


 
1  d − b
A −1 = 
| A| −c a

226
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra

As a simple application, consider the case of the bivariate normal


density. We saw in Chapter 11 that the multivariate normal density
Eq. (11.8) involves an exponential argument of the form (x − µ)> Σ−1 (x −
µ), which we called the Mahalanobis distance. The reader is referred to
Chapter 11, Sect. II for a full account. The bivariate independent case
simplifies to something surprisingly lean Eq. (11.7), all thanks to a the
formula above.
For larger matrices this trick becomes impractical, so a common method
is Gauss-Jordan elimination, which is simple, but takes quite a bit of
practice. This is not where you will learn it: either take a proper linear
algebra course, or try one of many excellent MOOCs on the topic.
Although there exist general, analytical solutions to compute matrix
inverses, it turns out that for n larger than 3 they get quite ugly, and
computationally just as intensive as computing the inverse by hand, so
let’s just tell a computer to do that and be done with it:

Numerical Solutions

1. Python (NumPy) has a ready-made function to invert matrices 5 , but 5


np.linalg.inv()

it is usually a terrible idea. One reason is that if the purpose of finding


A−1 is only to multiply it by b to get the solution x = A−1 b, then one
is better served by solving for x directly6 . 6
np.linalg.solve(A,b)

2. One can also play with LU factorization (write A = LU 7 , where L is 7


scipy.linalg.lu()

the lower triangular and U is the upper triangular).

LUx = b ,
L (Ux) = b .

Then solve Ly = b, for y = Ux.

There are many kinds of other matrix factorizations (Cholesky, QR, etc)
which can become advantageous when A exhibits certain properties
(e.g. symmetry). SciPy incorporates the whole LAPACK machinery to
do so.

IV Linear Independence. Bases


Two vectors v and w are linearly independent if one cannot be expressed
as a scaled version of the other. i.e. v 6= αw, for any α ∈ R. More gen-
erally, {v1 , . . . , vk } are linearly independent if each vi cannot be written
as a linear combination of the other vectors i.e.

v1 6 = α2 v2 + α3 v3 + . . . + α k v k , (B.12a)
v2 6 = α1 v1 + α3 v3 + . . . + α k v k , (B.12b)
etc . . .

227
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra

Equivalently,

α1 v1 + . . . + αk vk = 0 ⇐⇒ α1 , . . . , αk = 0 .

(the right arrow is the more difficult one to prove; the left one is triv-
ial).
That is, linearly dependent vectors are redundant: at least one can
be expressed as a combination of the others, so there are fewer than k
degrees of freedom in this system.
A basis of Rn is a set of vectors {v1 , . . . , vk } such that:

• {v1 , . . . , vk } are linearly independent

• Any vector v can be written as a linear combination of {v1 , . . . , vk }

v = λ1 v1 + λ2 v2 + . . . + λ k v k (these vectors span the whole space)


(B.13)

If {v1 , . . . , vk } is a basis, then the decomposition Eq. (B.13) is unique.

Example

e1 = (1, 0, 0), e2 = (0, 1, 0) and e3 = (0, 0, 1) form a basis of R3 . Because


these vectors are so intuitive, it is further called a canonical8 basis. 8
None of them got canonized, however
 
1
 
v= 2 

 =⇒ v = 1.e1 + 2.e2 + 3.e3
3

A basis of Rn has always exactly n vectors.

It is trivial to verify that with this choice of vectors, the only possible
way for λ1 e1 + λ2 e2 + λ3 e3 to be zero is if λ1 = λ2 = λ3 = 0 (Lin-
ear independence). Since the space is 3-dimensional, these 3 linearly-
independent vectors form a basis of it. QED

228
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra

V Inner Products and Orthonormality


An inner product on Rn is a function h·, ·i : Rn × Rn → R, such that:

Bilinear hαx + y, zi = αhx, zi + hy, zi hx, γz + wi = γhx, zi + hx, wi


(B.14a)
Symmetric hx, yi = hy, xi (B.14b)
Positive definite hx, xi ≥ 0 . If hx, xi = 0 ⇔ x = 0 (B.14c)

Example (Canonical Dot product in Rn )

hv, wi = v · w = v1 w1 + v2 w2 + . . . + vn wn
= ∑ v i wi .
i

Definition 16 (Orthogonality) Two vectors v and w are said to be orthog-


onal (with respect to a given inner product h·, ·i) if hv, wi = 0.

A Classic Case: the Dot Product w


v · w = kvk2 .kwk2 . cos θ,where θ =smallest angle between v and w, for
v · w 6= 0.

θ
v

So v · w = 0 if and only if cos θ = 0 i.e. v and w are perpendicular.

A basis {v1 , . . . , vn } is:

• orthogonal if
h vi , v j i = 0 , ∀i 6 = j .

• orthonormal if

h vi , v j i = 0 , ∀i 6 = j ,
&
h v i , v i i = k v i k2 = 1 ,

229
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra

where this norm is induced by the inner product. Clearly, orthonor-


mality is a stronger constraint (orthonormality implies orthogonal-
ity).

Advantages of an Orthonormal Basis


{v1 , . . . , vn } is an orthonormal basis. If v is a given vector, it may be
written as a linear combination of basis vectors:

v = α1 v1 + . . . + α n v n

Now that’s true in any basis, so what we really want to know is how
to find the coordinates (αi ’s), and quickly. Let us project v onto vi , i ∈
{1, · · · , n}:
hv, vi i = α1 hv1 , vi i + . . . + αn hvn , vi i

Of course, because the v j are orthogonal, the cross terms in the sum
drop out, so we are left with αi = hv, vi i.
That is, one can find the coordinates αi by simple projection onto the
basis vectors. This property is so intuitive that you probably take it for
granted – but it is remarkable, and entirely due to orthonormality. The
result is that we may write:

n
v= ∑hv, vi ivi (B.15)
i

As extra gravy, this formula enables to compute the norm of v: kvk =


∑hv, vi i2 , which is nothing more than Pythagoras’ theorem in n dimen-
sions.

Example (R2 ; dot product)

{v1 , v2 } orthonormal
v2 = √1 (−1, 1)
2
v1 = √1 (1, 1)
2

230
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra

Question Find the coordinates of v = (1, 0) with respect to the basis {v1 , v2 }
i.e. find α1 and α2 such that v = α1 v1 + α2 v2 .

Solution Since {v1 , v2 } is an orthonormal basis,

v = hv, v1 iv1 + hv, v2 iv2 ,

where
1 1 1
v · v1 = 1. √ + 0. √ = √ ,
2 2 2
1 1 1 1
v · v2 = 1. − √ + 0. √ = √ = − √ .
2 2 2 2
Thus:
1 1
v = √ v1 − √ v2 .
2 2
Let’s check:
1 1 1 1
√ v1 − √ v2 = (1, 1) − (−1, 1) = (1, 0) .
2 2 2 2

VI Projections
E is a subspace of Rn if:

• ∀v ∈ E, αv ∈ E. (Invariant under scaling)

• ∀v, w ∈ E, v + w ∈ E. (Invariant under addition)


Example (in R3 )

Line Plane (dim= 2)

Projecting onto a Subspace


Idea: E a subspace and v ∈/ E. Find a vector in E closest to v i.e. best
approximation of v by a vector of E.

ImA = { Ax : x ∈ Rn } . (B.16)

231
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra

Example

Definition 17 E is subspace of Rn . The projection of v on E, denoted by


PE (v), is the unique vector PE (v) ∈ E such that:

k PE (v) − vk = min kv − wk . (B.17)


w∈ E

A Very Important Application: Least Squares


Problem: Solve Ax = b.

• det A 6= 0 ⇒ Unique solution.

• det A = 0 ⇒ No solution or multiple solutions. What to do?

Image of A = { Av : v ∈ Rn } ≡ Im( A)
The image of A is a subspace of Rn ! For a matrix the image is also known
as the column space. A solution exists if and only if b ∈ Im( A).

Least Squares:

min k Ax − bk =⇒ Best solution . (B.18)


x

Equivalent to:

min ky − bk (B.19)
y= Ax

⇔ min ky − bk (B.20)
y∈ Im( A)

So, y = PIm( A) b is the solution of (B.18).

232
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra

Ax = PImA b has a solution.

ImA
PImA b

How can we compute a projection?


E subspace of Rn .

v ,...,v , v k +1 , . . . , v n
| 1 {z }k
orthonormal basis of E
| {z }
orthonormal basis of Rn

So, v can be decomposed in a basis of Rn as:

v = α1 v1 + . . . + α n v n .

If w ∈ E i.e. w = β 1 v1 + . . . + β k vk :

min kv − wk =?
w∈ E

Given that:

v − w = ( α 1 − β 1 ) v 1 + . . . + ( α k − β k ) v k + α k +1 v k +1 + . . . + α n v n ,

we have:

kv − wk2 = (α1 − β 1 )2 + . . . + (αk − β k )2 + α2k+1 + . . . + α2n .


The minimum is obtained when β 1 = α1 , β 2 = α2 , . . . and β k = αk
i.e. when
w = α1 v1 + . . . + α k v k .
Therefore:
PE (v) = α1 v1 + . . . + αk vk . (B.21)
and 9
In Python, np.linalg.lstsq(A, b)
v = α 1 v 1 + . . . + α k v k + α k +1 v k +1 + . . . + α n v n (B.22) solves the least square problem
| {z }
PE (v)

The solution to the least square problem Axls = PImA .b satisfies9 :

A> Axls = A> b .

So,
  −1
xls = A> A A> b (B.23)

233
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra

Solving Ax = b
A solution exists if and only if b ∈ ImA, with A a m × n matrix.

Typically:

1. Unique solution ⇒ m = n, det A 6= 0.

2. No solution ⇒ m>n More equations than unknowns.


→ Least squares

3. Multiple solutions ⇒ m<n Fewer equations than unknowns.

What to do? We need to make an assumption to obtain a unique solu-


tion. For example, we can choose to seek the solution with the smallest
Euclidean (L2 ) norm, which is the least squares solution.

VII Vector Spaces

Generalities
A vector space is a set V, where two operations:

1. Addition,

2. Multiplication by a scalar (R or C),

are defined and satisfy the following properties:

Addition

P1a) v1 + ( v2 + v3 ) = ( v1 + v2 ) + v3 . (Associativity)
P1b) v1 + v2 = v2 + v1 . (Commutativity)
P1c) There exists 0 ∈ V , such that 0 + v = v + 0 . (Identity element)
P1d) ∀v ∈ V, there exists − v ∈ V, such that v + (−v) = 0 . (Inverse element)

Scalar multiplication

P2a) λ (v1 + v2 ) = λv1 + λv2 , ∀ λ ∈ R or C and ∀ v1 , v2 ∈ V . (Distributivity w.r.t. vector addition)


P2b) (λ1 + λ2 ) v = λ1 v + λ2 v, λ1 , λ2 ∈ R and v ∈ V . (Distributivity w.r.t. scalar addition)
P2c) λ1 ( λ2 ) v = ( λ1 λ2 ) v . (Compatibility of vector and scalar mult.)
P2d) 1 · v = v, ∀ v ∈ V . (Identity element)

234
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra

Example

1. Rn .

2. Cn .

3. C ([ a, b]) = the set of continuous functions on the interval [ a, b]

( f + g ) ( x ) = f ( x ) + g ( x ).
( λ f ) ( x ) = λ f ( x ).

4. L2 (R), the set of square integrable functions on R. That is, all functions f
such that: Z ∞
| f ( x )|2 dx < ∞ .
−∞

5. Mn (R) = the space of n × n matrices with real coefficients.

Theorem 5 Every vector space has a basis.

If # of basis < ∞ ⇒ finite dimensional Examples: 1, 2 and 5.

If # of basis → ∞ ⇒ infinite dimensional Examples: 3 and 4.

Application to Fourier Series


Consider the space (i.e. E) of periodic functions on [0, 2π ] which satisfy:
Z 2π
| f (θ )|2 dθ < ∞ , (B.24)
0

An inner product on that space is given by:


Z 2π
h f , gi = f (θ ) g (θ ) dθ . (B.25)
0

Indeed it satisfies the properties of the inner product (B.14)

1. Linearity (B.14a)
Z 2π
hα f + βg, hi = (α f + βg) hdθ
0
Z 2π Z 2π
=α f hdθ + β ghdθ
0 0
= αh f , hi + βh g, hi . (B.26a)

2. Symmetry (B.14b)
Z 2π Z 2π
h f , gi = f gdθ = g f dθ = h g, f i .
0 0

235
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra

3. Positive definiteness (B.14c)


Z 2π
hf, fi = | f (θ ) |2 dθ ≥ 0. (B.26b)
0
Z 2π
hf, fi = 0 ⇔ | f (θ ) |2 dθ = 0 ⇒ f = 0 .
0

The norm induced by the inner product is


Z 2π
 12
1
2
k f k = (h f , f i) = 2 | f (θ ) | dθ . (B.27)
0

Now consider the following set of functions:

{1} ∪ {cos (nθ )}∞ ∞


n=1 ∪ {sin ( nθ )}n=1 , (B.28)

i.e. {1, cos θ, sin θ, cos 2θ, sin 2θ, . . .}. These functions are vectors in E.
Do they form a basis of it?
One can check that:
Z 2π
cos nθ cos mθdθ = 0 , for n 6= m . (B.29a)
0
Z 2π
sin nθ sin mθdθ = 0 , for n 6= m . (B.29b)
0
Z 2π
sin nθ cos mθdθ = 0 , for all n, m . (B.29c)
0

In vector space, this translates to:

hcos nθ, cos mθ i = 0 , for n 6= m . (B.30a)


hsin nθ, sin mθ i = 0 , for n 6= m . (B.30b)
hsin nθ, cos mθ i = 0 , for all n, m . (B.30c)

Also,
Z 2π
(cos nθ )2 dθ = π , for n ≥ 1 . (B.31a)
0
Z 2π
(sin nθ )2 dθ = π , for n ≥ 1 . (B.31b)
0
Z 2π
12 dθ = 2π . (B.31c)
0

=⇒k cos nθ k2 = hcos nθ, cos nθ i = π , for n ≥ 1 . (B.32a)


2
k sin nθ k = hsin nθ, sin nθ i = π , for n ≥ 1 . (B.32b)
2
k1k = 2π . (B.32c)

Therefore,
   ∞  ∞
1 1 1
√ ∪ √ cos (nθ ) ∪ √ sin (nθ ) , (B.33)
2π π n =1 π n =1

236
Data Analysis in the Earth & Environmental Sciences Appendix B. Linear Algebra

is an orthonormal set of functions in E.


Let f ∈ E:
Z

1 1
hf, √ i = f (θ ) √ dθ =: a0 . (B.34a)
2π 0 2π
Z 2π
1 1
h f , √ cos nθ i = f (θ ) √ cos nθdθ =: an , n > 0 . (B.34b)
2π 0 2π
Z 2π
1 1
h f , √ sin nθ i = f (θ ) √ sin nθdθ =: bn , n > 0 . (B.34c)
2π 0 2π

If {v1 , . . . , vn } is an orthonormal basis of Rn , then v = ∑hv, vn ivn ,


so we can express f as a linear combination of these functions with co-
efficients given by the inner product (projection) on each basis function:


a0 1
f (θ ) = √

+√
π
∑ (an cos nθ + bn sin nθ ) (Fourier Series Representation) .
n =1

That is, can we represent any function f ∈ E as a linear combination of


the functions Eq. (B.33)?
By working harder, one can prove that if,

N
a0 1
PN (θ ) = √

+√
π
∑ (an cos nθ + bn sin nθ ) , (B.35)
n =1

then, k f − PN k → 0 as N → ∞ (the approximation is optimal in the


sense of the L2 norm).
Which answers the question by a resounding yes (yes in the L2 sense):
provided we add an infinity of sines and cosines, we can represent any
periodic function this way. For some functions, just a few terms will
suffice, but that will not be true in general, leading to phenomena like
Gibb’s oscillations (Lab 7). In other words, the space of periodic func-
tions is a vector space, but it is of infinite dimension, which makes low-
dimensional approximations challenging.

237
Appendix C

CIRCLES AND SPHERES

I Trigonometry
Trigonometric functions

C Figure C.1: Angle relations in the triangle

γ a2 + c2 = b2
β= π
2
b
a

β α
B c A

Sine

a
sin(α) =
b  a
a
α = arcsin = sin−1
b b

sin(α) arcsin( a) Figure C.2: Sine and arcsine functions

1 π/2

α −1 1 a

−1 −π/2
Data Analysis in the Earth & Environmental Sciences Appendix C. Circles and Spheres

Cosine

c
cos(α) =
b 
c
α = arccos
b

arccos( a) Figure C.3: Cosine and arccosine func-


cos(α) tions
π
1

2π π/2
α

−1 −1 1 a

Note Angles are to be given in radians:


180
α[degrees] = α[rad] ·
π

Tangent

a
tan(α) = (C.1)
c 
a
α = arctan (C.2)
c

tan(α) arctan( a) Figure C.4: Tangent and arctangent func-


tions
3 π/2
2
1 π 2π

−1 α −3 −2 −1 1 2 3 a
−2
−3 −π/2

Trigonometric relationships

sin(α)
tan(α) = (C.3)
cos(α)
sin2 (α) + cos2 (α) = 1 (C.4)
q q
sin(α) = ± 1 − cos2 (α); cos(α) = ± 1 − sin2 (α) (C.5)

240
Data Analysis in the Earth & Environmental Sciences Appendix C. Circles and Spheres

Symmetry

sin(−α) = − sin(α) (C.6)


cos(−α) = cos(α) (C.7)
tan(−α) = − tan(α) (C.8)

Periodicity
 π
sin α + = cos(α) (C.9)
2
sin(α + π ) = − sin(α) (C.10)
sin(α + 2π ) = sin(α) (C.11)
C
etc.
γ
Angle sum
b
a
sin(α ± β) = sin(α) cos( β) ± cos(α) sin( β) (C.12)
cos(α ± β) = cos(α) cos( β) ∓ sin(α) sin( β) (C.13)
α
A β
c
Arbitrary triangles B

Law of sines Figure C.5: Arbitrary triangle

c2 = a2 + b2 − 2ab cos(γ) (C.14)

N v
Law of cosines
sin(α) sin( β) sin(γ)
= = (C.15) vector notation
a b c v = ~v
α
ϕ
Example (Vector components)
E

Figure C.6: Vector components


v E = |v| · sin(α)
v N = |v| · cos(α)
q
|v| = v2E + v2N
 
v E numerically! z
α = arctan = arctan 2(v E , v N )
vN | {z }
| {z } resultin[π,π ]
resultin[0,π ]
θ
r (r,θ,φ)
Example (Spherical coordinates)
y
Spherical coordinates: (τ, θ, ϕ) v.s. ( x, y, z) φ

θ = co-latitude ∈ [0; π ] (C.16) x


h π πi
λ = latitude = π − θ ∈ − ; (C.17)
2 2
ϕ = longitude ∈ [0; 2π [ (C.18)
Figure C.7: Spherical coordinates

241
Data Analysis in the Earth & Environmental Sciences Appendix C. Circles and Spheres

p
r= x 2 + y2 + z2 x = τ cos( ϕ) sin(θ )

θ = arccos zr y = r sin( ϕ) sin(θ )
ϕ = arctan 2(y, x ) z = r cos(θ )

Note that:

• basis vectors er , eθ , e ϕ are f (θ, ϕ); e ϕ is undefined for θ = 0 or θ = π;

• in a Cartesian system:
     
sin(θ ) cos( ϕ) cos(θ ) cos( ϕ) sin( ϕ)
     
er =  
 sin(θ ) sin( ϕ)  eθ =  
 cos(θ ) sin( ϕ)  eϕ =  
cos( ϕ) ;
cos(θ ) − sin( ϕ) 0

• for integration, a surface element is:

dA = r2 sin(θ )dϕdθ

Likewise for volume:


dV = r2 sin(θ )drdθdϕ

• the gradient in spherical coordinates is:

∂f 1 ∂f 1 ∂f
∇ f (r, θ, ϕ) = e + e + e
∂r r r ∂θ θ r sin(θ ) ∂ϕ ϕ Figure C.8: Surface integration on a
sphere
Horizontal surface gradient (r = 1)
 
∂θ
∇h =  
1

sin(θ ) ϕ

Example (Distance on sphere)


Given two points with longitude ϕ and latitude λ (ϕ, λ) and (ϕ2 , λ2 ), find the
great circle distance s in radians:

s = arccos (sin(λ), sin(λ2 ) + cos(λ), cos(λ2 ) cos( ϕ1 − ϕ2 ))

This is a poor formula numerically for small s (why?), much better to use:
"     1 #
2 λ1 + λ2 2 ϕ1 − ϕ2 2
s = arcsin sin + cos(λ), cos(λ2 ) sin
2 2

Distance in m assuming sphere is:

d = R · s, with R = 6371 · 103 m for Earth

For more, see the Aviation Formulary, by Ed Williams.

242
Data Analysis in the Earth & Environmental Sciences Appendix C. Circles and Spheres

v1
Example (Average directions and orientation) N
v2
1. Given a set of n unit vectors, find the mean azimuth α:
! v3
∑ viE ∑ viN α
hαi = arctan 2 ,
N N
E

2. Given a set of N π-periodic orientations, find the mean azimuth:


Figure C.9: Mean azimuth of unit vectors
 
αi = arctan 2 oiE , oiN

õiE = sin(2αi )
02
õiN = cos(2αi )
! α
1 ∑ õiE ∑ õiN 03
hαiπ = arctan 2 , 01
2 N N

∆α 2 Figure C.10: Mean azimuth of π-periodic


3. Rose diagrams: A = 2 r orientations

Figure C.11: Rose diagrams


αr length
αA scaled
α

1
this was generally in order to compute
financial gains, so the research in this
II Complex Numbers area of mathematics was rather hand-
somely subsidized by the nascent bank-
ing industry
The Cardano-Tartaglia Equation
In sixteenth century Italy, there was a lot of interest in solving certain
polynomial equations, in particular cubics1 . Girolamo Cardano noticed
something strange when he applied his formula to certain cubics. When

solving x3 = 15x + 4 he obtained an expression involving −121. Car-
dano knew that you could not take the square root of a negative number
yet he also knew that x = 4 was a solution to the equation. He wrote
to Niccolo Tartaglia, another algebrist of the time, in an attempt to clear
up the difficulty. Tartaglia certainly did not understand. In Ars Magna,
Cardano’s claims to fame, he gives a calculation with “complex num-
bers” to solve a similar problem but he really did not understand his
own calculation which he says is “as subtle as it is useless.”
Later it was recognized that if one just allowed oneself to write i =
√ √
−1, then −121 would just be i11. If one could just bear the notion
that i was a permissible thing to write, it turned out, a polynomial of
Figure C.12: Gerolamo Cardano

243
Data Analysis in the Earth & Environmental Sciences Appendix C. Circles and Spheres

degree n would always admit n solutions2 – but the solutions would 2


this is known as the Fundamental The-
be “halfway between being and nothingness” (Leibniz), because they orem of Algebra
would involve this imaginary number i, which had no basis in reality.

Life in the complex plane


Complex Algebra

Liebniz’s characterization was rather prescient, it turns out. Today


we define a complex number z as

z = x + iy (C.19)

with ( x, y) two real numbers and i = −1 as before (that is, i2 = −1).
We call x the real part (<(z)) and y the imaginary part (=(z)), so it really
is “halfway between being and nothingness”. That not does make it
useless. In fact, it is applicable to so many areas of physics, mathematics Figure C.13: Niccolò Tartaglia
& engineering that many people consider complex numbers to be just
as “real” as real numbers. Excepts for integers, all of them are mental
constructs anyway, so we might as well use mental constructs that can
solve an astounding variety of problems. Here we will mostly use them
to understand cyclicities.
A complex number may be charted in the complex plane (Fig. C.14),
which is oriented by the real axis along (1, 0) and the imaginary axis
along (0, i ). Let us also define the define the complex conjugate of z, de-
noted z̄ or z∗ :
z∗ = x − iy (C.20)
That is, z∗ is in the image of z by reflection along the real axis.
If a complex number is just a point on a plane, why not just call it a
vector of R2 ? In R2 we know how to add two vectors, or multiply them
by a scalar. We even have an inner product (that turns two vectors into
one scalar) and an outer product (that turns two vectors into a vector
perpendicular to that plane, so outside the original space): in R2 there Figure C.14: Polar representation and
complex conjugate. Two conjugates have
is no rule to multiply two elements and yet stay in R2 . The space of the same modulus but opposite argu-
complex numbers C does have such a rule. In addition to the usual ments. Credit: Wikipedia
properties:

z + z0 = ( x + x 0 ) + i (y + y0 ) (addition) (C.21a)
λz = λx + iλy (scalar multiplication) (C.21b)

there is an inner multiplication:

zz0 = ( xx 0 − yy0 ) + i ( x 0 y + y0 x ) (C.22)

which, needless to say, is also in C. Because of this, C is more like a


richer version of R, one that involves two coordinates but is endowed

244
Data Analysis in the Earth & Environmental Sciences Appendix C. Circles and Spheres

with the same two rules of addition and multiplication. And while a
polynomial of degree n may not always have roots in R, it always has n
roots in C – that is totally wild.

Polar representation

It is often extremely useful to represent z in terms of its distance to


the origin (which we term the modulus, and denote |z|), and an angle φ,
called its argument. The modulus verifies:
q √
|z| = x 2 + y2 = zz∗ (C.23)

And the argument may be written (for any non-zero complex z):
y
ϕ = arctan (C.24)
x
Any complex number (except zero) may be written in polar form:

z = reiϕ (C.25) Figure C.15: Euler’s formula. Credit:


Wikipedia

where e is Euler’s number as usual. What’s magical about this is Euler’s


formula:
eiϕ = cos ϕ + i sin ϕ (C.26)

so
z = r (cos ϕ + i sin ϕ) (C.27)

Setting r = 1 defines the unit circle (Fig. C.17), numbers whose real
part is given by cos ϕ and imaginary part given by sin ϕ. Conversely,
one may define sines and cosines this way:

eiϕ + e−iϕ
cos ϕ = (C.28a)
2
eiϕ − e−iϕ
sin ϕ = (C.28b)
2i
For the operations of multiplication, division, and exponentiation of
complex numbers, it is generally much simpler to work with complex
numbers expressed in polar form rather than rectangular form. From
the laws of exponentiation, for two numbers z1 = r1 eiϕ1 and z2 = r2 eiϕ2 :

Multiplication : r1 eiϕ1 · r2 eiϕ2 = r1 r2 ei( ϕ1 + ϕ2 ) (Fig. C.16) Figure C.16: Complex multiplication

r1 eiϕ1 r1 i ( ϕ1 − ϕ2 )
Division : = r2 e
r2 eiϕ2

Exponentiation (reiϕ )n = r n einϕ

245
Data Analysis in the Earth & Environmental Sciences Appendix C. Circles and Spheres

the latter is known as De Moivre’s formula, and explains a whole lot of


trigonometric relations without lifting a finger. For instance, it explains
cos 2θ = cos2 θ − sin2 θ and sin 2θ = 2 sin θ cos θ.
The notation also allows to re-express many famous numbers in terms
of their argument. For instance,

1 = ei0
−1 = eiπ
i = eiπ/2
−i = ei3π/2

because who wouldn’t want to use three symbols (two irrational num-
bers and an imaginary one) to write the number one?

Figure C.17: The unit circle, defined as all


complex numbers with unit modulus.

Further, one quickly notices that this notation is everything but unique.
Indeed adding any multiple of 2π means going around the merry-go-
round that many times, so we say that a complex argument is defined
“modulo 2π”. Hence we should have written 1 = eik2π , where k is any
positive or negative integer (i.e. , k ∈ Z), and so on for the others. This
“modulo" business simply expresses periodicity.

Roots of Unity
An nth root of unity, where n ∈ N∗ , is a number z satisfying the deceit-
fully simple equation
zn = 1 (C.29)

246
Data Analysis in the Earth & Environmental Sciences Appendix C. Circles and Spheres

Now, using the polar notation z = reiϕ , we have by De Moivre’s formula:

r n einϕ = 1 (C.30)
Two complex numbers are equal if and only if they have the same mod-
ulo and argument, so for this to work, r1/n = 1, yielding r = 1: the so-
lution must be on the unit circle (Fig. C.17). What about its argument?
It must verify:
 1/n
eiϕ = ei2πk (C.31)

Hence ϕ = 2πk n , k ∈ Z. Now, Z is infinite, but clearly we can’t have an


infinity of solutions: every time we go around the the merry-go-round
(i.e. when k is any multiple of n), we get back where we started. It is easy
to show that they can be only n distinct solutions, called the primitive
nth roots of unity, verifying

Wnk = ei2πk/n , k ∈ {0, · · · , n − 1} (C.32)

These are some of the coolest numbers you’ll ever meet. The third
roots of unity are illustrated in Fig. C.18, and you can see that they bisect
the unit circle in three equal slices. In general, nth roots of unity split the
unit circle in n equal slices3 : they are cyclotomic. The roots are always
located at the vertices of a regular n-sided polygon inscribed in the unit Figure C.18: Third roots of unity
3
circle, with one vertex at 1. This proves invaluable in cutting pies
without a protractor, i.e. in most social cir-
Beyond their assistance in cutting pies, their periodicity makes them cumstances
a cornerstone of Fourier analysis. Indeed, the sequence of powers

· · · , Wn−1 , Wn0 , Wn1 , · · · (C.33)


j+n j j j
is n-periodic (because Wn = Wn · Wnn = Wn · 1 = Wn for all values of
j), and the n sequences of powers
k ·(−1)
sk : · · · , Wn , Wnk·0 , Wnk·1 , · · · (C.34)
k( j+n) kj
for k ∈ {1, · · · , n} are all n-periodic (because Wn = Wn ). Even
more powerful, the set {s1 , · · · , sn } of these sequences is a basis of the
linear space of all n-periodic sequences. This means that any n-periodic
sequence of complex numbers

· · · , z −1 , z 0 , z 1 , · · · (C.35)

can be expressed as a linear combination of powers of a primitive nth


root of unity:
n −1
k· j 1· j n· j
zj = ∑ Zk · Wn = Z1 · Wn + · · · + Zn · Wn (C.36)
k =0

for some complex numbers { Z1 , · · · , Zn } and every integer j. This is


a form of Fourier analysis. If j is a (discrete) time variable, then k is a

247
Data Analysis in the Earth & Environmental Sciences Appendix C. Circles and Spheres

frequency and Zk is a complex amplitude. Choosing for the primitive


nth root of unity:
   
i2π/n 2π 2π
W=e = cos + i sin (C.37)
n n

allows z j to be expressed as a linear combination of cos and sin func-


tions:

n −1    
2πj 2πj
<(z j ) = ∑ k A cos k
n
+ Bk sin k
n
(C.38)
k =0

where ( Ak , Bk ) are sequences of real numbers. This is a discrete Fourier


transform, without which no modern telecommunication system would
exist. The goal of Fourier analysis is to find those sequences (rapidly),
and this is done at length in Chapter 7.

III Spherical harmonics

The harmonic equation


If you ever studied mechanics, you’ve doubtless encountered the har-
monic equation, describing the small oscillations of a mass on a spring,
or a pendulum, about their equilibrium position:

ẍ + ω0 2 x = 0 (C.39)

where ω0 is a constant (in the case of the spring, ω0 2 is the ratio of its
stiffness to the mass of the mobile). You may have seen that the solutions
to such equations take the form

x (t) = A cos(ω0 t + φ) or; (C.40a)


x (t) = A cos(ω0 t) + B sin(ω0 t) (C.40b)

where A, B and φ are constants determined by the initial conditions.


(the two formulations above are equivalent). A and B are called ampli-
tudes, and φ is a phase (in radians).
Hence, sines and cosines are solutions to the harmonic equation, and
are therefore called harmonic functions. In fact, many usual functions
may be found as the solutions to well-known differential equations such
as Eq. (C.39), which arise very often in physics, especially in the study
of oscillations and vibrations. Spherical harmonics are a generalization
of sines and cosines on a sphere.

248
Data Analysis in the Earth & Environmental Sciences Appendix C. Circles and Spheres

Harmonics on the sphere


Solutions of Laplace’s equation (aka the harmonic equation) describing
free oscillations in a spherical medium:
Spherical Harmonics at degree: 2 , and order: 2
   
1 ∂ ∂f 1 ∂ ∂f 1 ∂2 f
∇2 f = r2 + sin θ + =0
r2 ∂r ∂r r2 sin(θ ) ∂θ ∂θ r2 sin2 (θ ) ∂ϕ2

Eigenfunctions of orbital angular momentum operator in quantum


mechanics (orbitals), normal modes of seismology, and form a natural
set of orthonormal basis functions on the sphere.
 √
 2N P (cos(θ )) cos(mϕ) for m ≥ 0
lm lm
Ylm = √
 2N P (cos(θ )) sin(mϕ) for m < 0
lm lm

where l =degree, m =order and m ∈ [−l; l ].

Associated Legendre functions

m dm Figure C.19: The Ylm spherical harmonic


Plm ( x ) = (−1)m (1 − x2 ) 2 ( P ( x )) for l = 2, m = 2, plotted via
dx m l sphere3d.m
1 dl  2 
Pl ( x ) = ( x − 1) l
2l l! dx l

where P00 = 1, P10 = 0, P20 = 12 (3x2 − 1), P22 = 3(1 − x2 ).

Normalization
orthonormality
s
Z π Z 2π
(2l + 1)(l − m)!
I= dθ dϕYlm Yl 0 m , → Nlm = sin(θ ) = δll , δmm
0 0 4π (l + m)!
n
1 for i = j
where δ =Kronecker δ, δi,j = 0 for i 6= j , used in physics and seismol-
ogy.
In geodesy,
s
y (l − m)!
I = 4πδll 0 δmm0 ; Nlm = (2l + 1) .
(l + m)!

In magnetics,
s
4π m (l − m)!
I= ; Nlm = .
2l + 1 ll mm
δ 0δ 0
(l + m)!

An illustration is offered in Fig. C.19. For more extensive visual-


izations, see http://geodynamics.usc.edu/~becker/teaching-sh.
html.

249
Data Analysis in the Earth & Environmental Sciences Appendix C. Circles and Spheres

Spherical harmonics expansions (SHE)


In analogy to Fourier Transform in 2D, one can use Spherical Harmonics
to convert a field given on the sphere to a harmonic representation:

∞ l
f (θ, ϕ) = ∑ ∑ f lm Ylm (θ, ϕ) (C.41)
l =0 m=−l

This holds for L2 norm convergence:

Z 2π Z π L 2
lim
L→∞ 0

0
dθ f (θ, ϕ) − ∑ f lm Ylm (θ, ϕ) sin(θ ) = 0
l =0

q
Note Compare |v|2 = ∑ v2i with the expression above.

Lab 1 illustrates how to obtain spherical harmonics expansion coeffi-


cients via numerical integration.
Z Z 2π Z π
f em = dΩ f (θ, ϕ)Ylm (θ, ϕ)dΩ = dϕ dθ sin(θ ) f (θ, ϕ)Ylm (θ, ϕ)
Ω 0 0

Once fields are expressed as spherical harmonics, on can easily compute


derived quantities like its power spectrum, derivatives, integrals, etc.

250
Appendix D

DIAGONALIZATION

Diagonal matrices are exceedingly easy to manipulate (multiply, invert,


compute powers, etc). Diagonalization is all about transforming a non-
diagonal matrix to diagonal form, to take advantage of these superpow-
ers. This is used abundantly in Chapters 12 through 15.

I Earth Science Motivation


Example (Radioactive decay chain)

Consider the simple decay of a radioactive element to a stable daugh-


ter element, (e.g. 89 Rb → 87 Sr). In this case the number of atoms as a
function of time N (t) is given by

dN (t)
= −λN (D.1)
dt
where λ is a constant.
λ1 λ2 λ3
We can also have a more involved case, e.g. Bk −→ Bz −→ J −→ Rn.
In this case the numbers of atoms as a function of time are given by

dN1 (t)
= −λ1 N1
dt
dN2 (t)
= −λ2 N2 + λ1 N1
dt
dN3 (t)
= −λ3 N3 + λ2 N2 (D.2)
dt
and we can rewrite this system of differential equations as
    
N1 (t) − λ1 0 0 N1 (t)
d   
 N (t) =  λ
 
 2   1 − λ2 0   
  N2 (t) . (D.3)
dt
N3 (t) 0 λ2 − λ3 N3 (t)
| {z }
A
Data Analysis in the Earth & Environmental Sciences Appendix D. Diagonalization

After n time-steps the solution to the system of equations is given by the


nth power of the matrix A, An . If we could diagonalise A, that is find a
matrix Λ such that

A = VΛV −1 (D.4)

then we would have

An = VΛn V −1 (D.5)

and the problem would be solved.

II Eigendecomposition
Eigenvalues and eigenvectors
Given the matrix A, λ is an eigenvalue of A if and only if ∃ v 6= 0 such
that Av = λv. The case v = 0 is trivial since it’s always true. We can
rewrite Av = λv as follows

Av = λv ⇔ Av − λv = 0 ⇔ ( A − λI )v = 0 ⇔
v ∈ N ( A) (null space of A). (D.6)

N ( A) is non-trivial only if the matrix ( A − λI ) is singular, i.e.

det( A − λI ) = P(λ) = 0 (D.7)

where P(λ) is called characteristic polynomial.

A simple example
Consider the case
   
2 1 2−λ 1
A=  A − λI =   (D.8)
1 2 1 2 − λ.

We have to solve

det( A − λI ) = (2 − λ)2 − 1 = λ2 − 4λ + 3 = 0 (D.9)

We have
√ p
∆ = b2 − 4ac = 2 ∈ R (D.10)

so we’ll have two real solutions. These are given by


√ 
−b ± ∆  λ =3
1
λ1,2 = = 2±1 = . (D.11)
2a  λ =1
2

What are the eigenvectors? We have to look for solutions v such that
Av = λ1,2 v.

252
Data Analysis in the Earth & Environmental Sciences Appendix D. Diagonalization

• For λ1 we have
     
2 1 x1 x1  2x + y = 3x
   = 3  ⇔ 1 1 1
⇔ x1 − y1 = 0 ⇔
1 2 y1 y1  x + 2y = 3y
1 1 1
 
1
v1 = c 1   c 1 ∈ R∗ (D.12)
1

• For λ2 we have
     
2 1 x2 x2  2x + y = x
2 2 2
   = 1  ⇔ ⇔ x2 + y2 = 0 ⇔
1 2 y2 y2  x + 2y = y
2 2 2
 
−1
v 2 = c 2   c 2 ∈ R∗ (D.13)
1

As we can see, eigenvectors are arbitrary up to a non-zero constant, i.e.


what matters is only their direction.
The two eigenvetors are orthogonal to each other
 
  −1
v1 · v2 = v1> v2 = 1 1   = 0. (D.14)
1

We can define two unit vectors v10 , v20 associated to the eigenvalues we
have found
1
v1 · v1 = v1> v1 = kv1 k = 2 ⇒ v10 = √ v1
2
> 0 1
v2 · v2 = v2 v2 = k v2 k = 2 ⇒ v2 = √ v2 (D.15)
2
(D.16)

Then we can write



 1 if i = j
vi0 · v0j = δij = (D.17)
 0 if i 6= j

so the two vectors v10 , v20 are orthonormal. Out of these two vectors we
can build a matrix V = (v10 v20 ) which is orthogonal, i.e.

V > V = VV > = I2 (2-dimensional identity matrix). (D.18)

This means that V is invertible and the inverse matrix is given by V > .
Defining
 
λ1 0
Λ=  (D.19)
0 λ2

253
Data Analysis in the Earth & Environmental Sciences Appendix D. Diagonalization

we have, by definition, AV = ΛV and, consequently, we have the canon-


ical expansion

A = VΛV −1 or Λ = V −1 AV. (D.20)

The original matrix A has been diagonalized.

General case
The previous example represented a very special case. In fact
• A was a 2 × 2 matrix so P(λ) was a polynomial of order 2 in λ and
the eigenvalues have been calculated easily. Things are not quite so
simple for dimension n > 2 – finding the roots of the polynomial
needs to happen numerically.

• A was real and symmetric (i.e. A> = A). It can be shown that in
such a case one can always find two non-negative eigenvalues and
their eigenvectors are orthogonal.
In general, one always starts with a characteristic polynomial defined as
P(λ) = det( A − λIn ) where A ∈ Mn×n (R) with Mn×n (R) the space
of real n × n matrices. Then, in order to find the eigenvalues and eigen-
vectors, the following steps have to be performed.
1. The characteristic polynomial can always be factorized as follows

P ( λ ) = ( λ − λ 1 ) n1 ( λ − λ 2 ) n2 . . . ( λ − λ k ) n k (D.21)

where λ1 , λ2 . . . λk are the roots of the of characteristic polynomial


and n1 , n2 . . . nk are the corresponding multiplicities. The multiplici-
ties are such that
k
∑ ni = n. (D.22)
i =1

This decomposition is always possible in C, not always in R.


Once the eigenvalues have been identified by solving the equation
P(λ) = 0 (numerically, if necessary) one must find the corresponding
eigenvectors V = (v1 , v2 , . . . , vk ). The diagonalization of matrix A
yields a matrix with the following structure:
 
Λ1 0 . . . 0  
  0
 
ni

 λ i 
 0 Λ2 . . . 0 
tim

Λ=  where Λi =   .

.
 ..  ..
es

. . . . . . . 0  
   
0 λi
0 0 0 Λk

2. Solve for Avi = λi vi for each i ≤ i ≤ k (Gaussian elimination will


yield vi ). The vi ’s are not repeated, as vi can be a matrix ni × ni if
ni > 1.

254
Data Analysis in the Earth & Environmental Sciences Appendix D. Diagonalization

3. If desired, normalize all vi ’s.

What are the advantages of going through this procedure? There are
several of them:

• once the diagonal matrix has been found it is easy to raise the matrix
to any power m: Am = VΛm V −1 where Λm can be readily obtained
from Λ
 
λm ... 0
 1 
 . .. .. 
Λm =  .. . .  (D.23)
 
0 ... λnm

and for m = −1 we get the inverse matrix


 
1
... 0
 λ1 
 . .. ..  −1
A−1 = V  .. . . V (D.24)
 
1
0 ... λn

which is possible only in the case in which all of the λi are non-zero;

• a zero eigenvalue means that ∃ v 6= 0 | Av = 0. v is in the null space


of A and it is denoted as N ( A);

• the number r of non-zero eigenvalues (r ≤ n) yields the rank of A, it


corresponds to the number of linearly independent matrix columns;

• we can identify (n − r ) vectors (vr+1 , . . . , vn ) which define a basis for


N ( A). This basis is orthogonal if A is real and symmetric.

All of this can be done numerically, of course, though it may take a


long time for very large matrices1 . 1
np.linalg.eig()

III Singular value decomposition

Definition
Eigendecomposition is the privilege of square matrices. For non-square
matrices, similar benefits may be obtained via the singular value decom-
position (SVD for short). Any matrix A ∈ Mm×n (R) (or A ∈ Mm×n (C))
admits a singular value decomposition as

A = U Σ V> (D.25)

255
Data Analysis in the Earth & Environmental Sciences Appendix D. Diagonalization

For m ≥ n we have
 
... 0
 1
σ

. .. .. 
Σ =  .. . . with σi the singular values (D.26)
 
0 ... σn
VV > = V > V = In (right singular vectors) (D.27)
>
U U = In (left singular vectors) (D.28)

so
m×n m×n n×n n×n

A= = .
diagonal orthogonal
orthogonal matrix matrix
columns

If A is “fat” (m < n), a similar form exists with


 
 Σ n × n 0 
Σ= 0 0
(m − n) columns of zeroes

Properties
What does the singular value decomposition do for us? The number of
non-zero singular values defines the rank of matrix A. In practice, how
do we find U, Σ and V? The matrix ( A> A) is real and symmetric, so it
is diagonalizable and admits orthonormal eigenvectors, i.e.

A> A = V 0 ΛV 0T , V 0T V = I (D.29)

so, if A = UΣV > then

2 >
A> A = (VΣ> U )(UΣV > ) = VΣ U > > 0
| {zU} ΣV = VΣ V ≡ V ΛV
0>

In
(D.30)

so

• V 0 ≡ V: the eigenvectors of A> A are the right singular vectors of A,

• Λ ≡ Σ2 : the eigenvalues of A> A are the squared singular values of


A.

256
Data Analysis in the Earth & Environmental Sciences Appendix D. Diagonalization

What about U?

A = UΣV > ⇒ AV = UΣ V >


| {zV} = UΣ (D.31)
In 2
In English, the range of A the set of vec-
tors in the arrival space that can be writ-
so Avi = ui σi for i ∈ [1, n]. For each σi we have two possibilities ten as A times a vector from the source
 space. In other words, it is the image of
 σ 6= 0 ⇒ u = 1 Av the source space by the transformation A.
i i i (i)
σi
. (D.32)
 σ = 0 ⇒ u is arbitrary (provided that U > U = In ) (ii)
i i

Because of (i) and the orthogonality of the columns of U, the set of


vectors {u1 , . . . , ur } represents a basis for R( A) which is the range of
A = {y ∈ Rn | ∃ x ∈ Rm | y = Ax }2 . Similarly, {vr+1 , . . . , vn } are a ba-
sis for N ( A). One can show that

• {v1 , . . . , vr } is a basis for R( A> )

• {ur+1 , . . . , un } is a basis for N ( A> ).

So the singular value decomposition gives us everything we want to know


about A

• its diagonal form

• its rank

• a basis for N ( A), R( A), N ( A> ), R( A> ) which are the four funda-
mental subspaces. All this with an incredibly efficient algorithm3 . 3
np.linalg.svd()

Low Rank Approximation


The idea is to approximate a matrix by one with low-rank, with fit mea-
sured by the Frobenius norm4 , i.e. find the matrix D
b such that 4
The Frobenius norm of a matrix A is
such that: k Ak F 2 = ∑im=1 ∑nj=1 | aij |2 =

b kF is minized, subject to
kD − D b ≤r
rank D (D.33) ∗
trace( A A) = ∑i=1
min{m, n}
σi2 . It is very
useful in numerical linear algebra, being
The result is referred to as the matrix approximation lemma or Eckart- invariant under unitary transformations
like rotations
Young-Mirsky theorem, for those who want to show off at cocktail par-
ties. This minimization problem may be solved via (you guessed it)
SVD. Let D = UΣV > ∈ Rm×n , m ≤ n be the singular value decom-
position of D and partition U, Σ =: diag(σ1 , . . . , σm ), and V as follows:

 
h i Σ1 0 h i
U =: U1 U2 , Σ =:  , and V =: V1 V2 , (D.34)
0 Σ2

where Σ1 is r × r, U1 is m × r, and V1 is n × r. Then the rank-r matrix,


obtained from the truncated singular value decomposition

b ∗ = U1 Σ1 V1> ,
D (D.35)

257
Data Analysis in the Earth & Environmental Sciences Appendix D. Diagonalization

is such that:
q
b ∗ kF =
kD − D min b kF =
kD − D σr2+1 + · · · + σm
2. (D.36)
b )≤r
rank( D

The minimizer D b ∗ is unique if and only if σr+1 6= σr . In other words,


by retaining only r singular values, one gets a matrix of rank r, with a
misfit to the original matrix that is readily quantified by the quadratic
sum of the remaining singular values. This is better than whiskey with-
out hangovers! It means that out of every ill-conditioned system, one
can find a low-rank solution that is better conditioned, therefore invert-
ible. Of course, this comes at the price of a misfit of the original matrix,
but at least we can directly optimize that misfit and use the dual in-
formation to choose the truncation level r that best accomplishes our
goals.

258
BIBLIOGRAPHY

Abramowitz, M., and I. A. Stegun (1965), Handbook of Mathematical Func-


tions, National Bureau of Standards, Applied Math # 55, Dover Pub-
lications.

Akaike, H. (1974), A new look at the statistical model identification,


IEEE Transactions on Automatic Control, 19, 716–723.

Anchukaitis, K., and J. Tierney (2013), Identifying coherent spatiotem-


poral modes in time-uncertain proxy paleoclimate records, Climate
Dynamics, 41(5-6), 1291–1306, doi: 10.1007/s00382-012-1483-0.

Ault, T. R., J. E. Cole, M. N. Evans, H. Barnett, N. J. Abram, A. W.


Tudhope, and B. K. Linsley (2009), Intensified decadal variability in
tropical climate during the late 19th century, Geophys. Res. Lett., 36(8).

Ault, T. R., C. Deser, M. Newman, and J. Emile-Geay (2013), Character-


izing decadal to centennial variability in the equatorial pacific during
the last millennium, Geophysical Research Letters, 40, 3450–3456, doi:
10.1002/grl.50647.

Backus, G., and F. Gilbert (1968), The Resolving Power of Gross


Earth Data, Geophysical Journal International, 16, 169–205, doi:
10.1111/j.1365-246X.1968.tb00216.x.

Bretherton, C. S., C. Smith, and J. M. Wallace (1992), An Intercomparison


of Methods for Finding Coupled Patterns in Climate Data, J. Climate,
5, 541–560.

Chave, A. D., D. J. Thomson, and M. E. Ander (1987), On the robust es-


timation of power spectra, coherences, and transfer functions, J. Geo-
phys. Res., 92, 633–648, doi: 10.1029/JB092iB01p00633.

Comboul, M., J. Emile-Geay, M. N. Evans, N. Mirnateghi, K. M. Cobb,


and D. M. Thompson (2014), A probabilistic model of chronolog-
ical errors in layer-counted climate proxies: applications to annu-
ally banded coral archives, Climate of the Past, 10(2), 825–841, doi:
10.5194/cp-10-825-2014.
Data Analysis in the Earth & Environmental Sciences Appendix D. Bibliography

Cook, E. R., and K. Peters (1981), The smoothing spline: A new approach
to standardizing forest interior tree-ring width series for dendrocli-
matic studies, Tree-Ring Bulletin, 41, 45–53.

Dommenget, D. (2007), Evaluating eof modes against a stochastic null


hypothesis, Climate Dynamics, 28(5), 517–531, doi: 10.1007/s00382-
006-0195-8.

Emile-Geay, J. (2006), ENSO dynamics and the Earth’s climate : from


decades to Ice Ages, Ph.D. thesis, Columbia University.

Emile-Geay, J., and J. A. Eshleman (2013), Toward a semantic web of pa-


leoclimatology, Geochemistry, Geophysics, Geosystems, 14(2), 457–469,
doi: 10.1002/ggge.20067.

Foster, G. (1996), Wavelets for period analysis of unevenly sampled time


series, Astron. Jour., 112, 1709, doi: 10.1086/118137.

Gelman, A., and H. Stern (2006), The difference between “significant”


and “not significant” is not itself statistically significant, The American
Statistician, 60(4), 328–331, doi: 10.1198/000313006X152649.

Gelman, A., J. B. Carlin, H. S. Stern, and D. B. Rubin (2013), Bayesian


Data Analysis, 2nd ed., 675 pp., Chapman and Hall, New York, NY.

Ghil, M., R. M. Allen, M. D. Dettinger, K. Ide, D. Kondrashov, M. E.


Mann, A. Robertson, A. Saunders, Y. Tian, F. Varadi, and P. Yiou
(2002), Advanced spectral methods for climatic time series, Rev. Geo-
phys., 40(1), 1003–1052.

Golub, G. H., and C. F. van Loan (1993), Matrix Computations, 642 pp


pp., 2d ed. Johns Hopkins University Press.

Golub, G. H., M. Heath, and G. Wahba (1979), Generalized cross-


validation as a method for choosing a good ridge parameter, Tech-
nometrics, 21(2), 215–223.

Hannachi, A., I. T. Jolliffe, and D. B. Stephenson (2007), Empirical or-


thogonal functions and related techniques in atmospheric science:
A review, International Journal of Climatology, 27(9), 1119–1152, doi:
10.1002/joc.1499.

Hasselman, K. (1976), Stochastic climate models. part i. theory, Tellus,


28, 473–485.

Hastie, T., R. Tibshirani, and J. Friedman (2008), The elements of statistical


learning: data mining, inference and prediction, 2 ed., Springer.

260
Data Analysis in the Earth & Environmental Sciences Appendix D. Bibliography

Hu, J., J. Emile-Geay, and J. Partin (2017), Correlation-based in-


terpretations of paleoclimate data – where statistics meet past
climates, Earth and Planetary Science Letters, 459, 362–371, doi:
10.1016/j.epsl.2016.11.048.

Hurst, H. E. (1951), Long term storage capacities of reservoirs, Trans.


ASCE, 116, 776–808.

Huybers, P., and W. Curry (2006), Links between annual, milankovitch


and continuum temperature variability, Nature, 441(7091), 329–332.

Jaynes, E. T. (2004), Probability Theory: The Logic of Science, 727 pages,


Cambridge University Press, Cambridge.

Johnstone, I. M., and A. Y. Lu (2009), Sparse Principal Components


Analysis, ArXiv e-prints.

Kalnay, E. & coauthors. (1996), The NCEP/NCAR 40-year Reanalysis


Project, Bull. Amer. Meteor. Soc., 77, 437–471.

Kim, K., and G. Shevlyakov (2008), Why gaussianity?, Signal Processing


Magazine, IEEE, 25(2), 102–113, doi: 10.1109/MSP.2007.913700.

Kug, J.-S., F.-F. Jin, and S.-I. An (2009), Two types of El Niño events:
Cold tongue El Niño and warm pool El Niño, Journal of Climate, 22(6),
1499–1515, doi: 10.1175/2008JCLI2624.1.

Lee, T., and M. J. McPhaden (2010), Increasing intensity of el niño in


the central-equatorial pacific, Geophysical Research Letters, 37(14), n/a–
n/a, doi: 10.1029/2010GL044007.

Lorenz, E. N. (1956), Empirical orthogonal functions and statistical


weather prediction, Scientific Report 1, Statistical Forecasting Project.
110268, Massachusetts Institute of Technology Defense Doc. Center.

Lovejoy, S. (2015), A voyage through scales, a missing quadrillion and


why the climate is not what you expect, Climate Dynamics, 44(11-12),
3187–3210, doi: 10.1007/s00382-014-2324-0.

Lovejoy, S., D. Schertzer, M. Lilley, K. B. Strawbridge, and A. Radkevich


(2008), Scaling turbulent atmospheric stratification. i: Turbulence and
waves, Quarterly Journal of the Royal Meteorological Society, 134(631),
277–300, doi: 10.1002/qj.201.

Mann, M. (2011), On long range dependence in global surface tempera-


ture series, Climatic Change, 107, 267–276, 10.1007/s10584-010-9998-z.

Mann, M., and J. Lees (1996), Robust estimation of background noise


and signal detection in climatic time series, Clim. Change, 33, 409–445.

261
Data Analysis in the Earth & Environmental Sciences Appendix D. Bibliography

Meinshausen, M., N. Meinshausen, W. Hare, S. C. B. Raper, K. Frieler,


R. Knutti, D. J. Frame, and M. R. Allen (2009), Greenhouse-gas emis-
sion targets for limiting global warming to 2[thinsp][deg]c, Nature,
458(7242), 1158–1162.

Mudelsee, M., D. Scholz, R. Röthlisberger, D. Fleitmann, A. Mangini,


and E. W. Wolff (2009), Climate spectrum estimation in the presence
of timescale errors, Nonlinear Processes in Geophysics, 16(1), 43–56, doi:
10.5194/npg-16-43-2009.

Nadakuditi, R., and A. Edelman (2008), Sample eigenvalue based detec-


tion of high-dimensional signals in white noise using relatively few
samples, Signal Processing, IEEE Transactions on, 56(7), 2625–2638, doi:
10.1109/TSP.2008.917356.

Neumaier, A., and T. Schneider (2001), Estimation of parameters and


eigenmodes of multivariate autoregressive models, ACM Trans. Math.
Softw., 27, 27–57.

North, G. R., T. L. Bell, R. F. Cahalan, and F. J. Moeng (1982), Sam-


pling errors in the estimation of empirical orthogonal functions, Mon.
Weather Rev., 110, 699–706.

O’Hagan, T. (2006), Bayes factors, Significance, 3(4), 184–186, doi:


10.1111/j.1740-9713.2006.00204.x.

Overland, J. E., and R. W. Preisendorfer (1982), A significance


test for principal components applied to a cyclone climatol-
ogy, Monthly Weather Review, 110(1), 1–4, doi: 10.1175/1520-
0493(1982)110<0001:ASTFPC>2.0.CO;2.

Partin, J., T. Quinn, C.-C. Shen, J. Emile-Geay, F. Taylor, C. Maupin,


K. Lin, C. Jackson, J. Banner, D. Sinclair, and C.-A. Huh (2013), Mul-
tidecadal rainfall variability in south pacific convergence zone as re-
vealed by stalagmite geochemistry, Geology, doi: 10.1130/G34718.1.

Rasmussen, E., and T. Carpenter (1982), Variations in tropical sea-


surface temperature and surface winds associated with the Southern
Oscillation/ El Niño, Mon. Weather Rev., 110, 354–384.

Reynolds, R. W., and T. M. Smith (1994), Improved Global Sea Surface


Temperature Analyses Using Optimum Interpolation., J. Climate, 7,
929–948.

Sarachik, E. S., and M. A. Cane (2010), The El Niño-Southern Oscillation


Phenomenon, 384 pp., Cambridge University Press, Cambridge, UK.

Schwarz, G. (1978), Estimating the dimension of a model, Annals of


Statistics, 6, 461–464.

262
Data Analysis in the Earth & Environmental Sciences Appendix D. Bibliography

Scott, D. W. (1992), Multivariate Density Estimation: Theory, Practice, and


Visualization (Wiley Series in Probability and Statistics), 1 ed., Wiley.

Silver, P. G., and T. H. Jordan (1981), Fundamental spheroidal mode


observations of aspherical heterogeneity, Geophysical Journal Interna-
tional, 64(3), 605–634.

Silverman, B. W. (1986), Density estimation: for statistics and data analysis,


Chapman & Hall, London.

Stoica, P., and Y. Selen (2004), Model-order selection: a review of infor-


mation criterion rules, Signal Processing Magazine, IEEE, 21(4), 36–47,
doi: 10.1109/MSP.2004.1311138.

Takahashi, K., A. Montecinos, K. Goubanova, and B. Dewitte


(2011), Enso regimes: Reinterpreting the canonical and modoki
el niño, Geophysical Research Letters, 38(10), n/a–n/a, doi:
10.1029/2011GL047364.

Tarantola, A. (2004), Inverse Problem Theory and Methods for Model Pa-
rameter Estimation, Society for Industrial and Applied Mathematics,
Philadelphia, PA, USA.

Thomson, D. J. (1982), Spectrum estimation and harmonic analysis,


Proc. IEEE, 70(9), 1055–1096.

Tukey, J. W. (1977), Exploratory data analysis, Addison-Wesley.

Wahba, G. (1990), Spline models for observational data, in CBMS-NSF


Regional Conference Series in Applied Mathematics, Based on a series of 10
lectures at Ohio State University at Columbus, March 23-27, 1987, Philadel-
phia: Society for Industrial and Applied Mathematics, 1990.

Wahba, G., and Y. Wang (1995), Behavior near zero of the distribution of
GCV smoothing parameter estimates, Stat. Probabil. Lett., 25, 105–111.

Wallace, J. M., and D. S. Gutzler (1981), Teleconnections in the geopo-


tential height field during the Northern Hemisphere winter, Mon.
Weather Rev., 109, 784–812.

Wax, M., and T. Kailath (1985), Detection of signals by information theo-


retic criteria, Acoustics, Speech and Signal Processing, IEEE Transactions
on, 33(2), 387–392, doi: 10.1109/TASSP.1985.1164557.

Weinert, H. L. (2009), A fast compact algorithm for cubic spline smooth-


ing, Computational Statistics & Data Analysis, 53(4), 932–940, doi:
10.1016/j.csda.2008.10.036.

Wikle, C. K., and L. M. Berliner (2007), A bayesian tutorial for data


assimilation, Physica D: Nonlinear Phenomena, 230(1–2), 1 – 16, doi:
http://dx.doi.org/10.1016/j.physd.2006.09.017, data Assimilation.

263
Data Analysis in the Earth & Environmental Sciences Appendix D. Bibliography

Wilks, D. S. (2011), Statistical Methods in the Atmospheric Sciences: an In-


troduction, 676 pp., Academic Press, San Diego.

Wunsch, C. (2000), On sharp spectral lines in the climate record and the
millennial peak, Paleoceanography, 15(4), 417–424.

Zou, H., T. Hastie, and R. Tibshirani (2006), Sparse principal component


analysis, Journal of Computational and Graphical Statistics, 15(2), 265–
286, doi: 10.1198/106186006X113430.

264
INDEX

conditional independence, 21 independent, 21 plausible reasoning, 11

You might also like