Statistical Modeling of Extreme Values PDF
Statistical Modeling of Extreme Values PDF
1. INTRODUCTION
High wind speeds pose a threat to the integrity of structures such as wind turbines. An
accurate estimation of the occurrence of extreme wind speeds is an important factor in
achieving a correct balance between safety and cost of “over-design”. This design
problem also arises in many other engineering areas such as ocean engineering (with the
wave height), hydraulics engineering (floods), structural engineering (earthquakes) and
also in meteorology (temperatures, rainfall, etc), fatigue strength (workloads), etc. All
these applications have in common that the interest is not the knowledge of the average
behaviour of the analysed phenomena but the extreme behaviour of them. Then, the
distinguishing feature of an extreme value statistical analysis is that the objective is not
to describe the usual behaviour of the stochastic phenomena but the unusual and the
rarely observed events.
For example, suppose that a sea-wall is going to be built with the purpose of protecting
the coast against all sea-levels
that it is likely to occur within
its projected life span (for
example, 100 years). Accurate
estimation of the highest sea-
level in 100 years is necessary
in order to balance
economical and safety goals.
The problem that the
statistical methods face is that
the records of sea-levels could
span for shorter periods of time, of say 15 years. The challenge is to estimate what sea-
levels might occur over the next 100 years given the 15 year data.
1
An introduction to statistical modeling of extreme values
The extreme wind speed estimates are used to determine critical design loads which the
turbine must withstand during its
lifetime. According to the
International Standard IEC 61400-1
for Wind Turbine Generator
Systems the extreme wind speed
Vref is a basic parameter for wind
turbine classes and therefore
strongly related to design of wind
turbines. The Vref is defined as the
extreme 10-min average wind speed
with a recurrence period of 50 years. In general Vref has to be determined statistically
on the basis of on-site measurement.
The statistical theory developed to deal with these problems and this type of data is
known as Extreme Value Theory. The presentation of its main results as well as its
application to the analysis of extreme wind speeds are the two main purposes of this
monography.
http://www.youtube.com/watch?v=oAWMpxX60KM&feature=player_embedded
http://www.youtube.com/watch_popup?v=CqEccgR0q-o
http://www.youtube.com/watch?v=b43lAoovqd8&feature=fvw
2
An introduction to statistical modeling of extreme values
The core of the extreme value theory is the study of the statistical behaviour of
M n maxX1,, X n
Thus, a way to study M n is to estimate F from the available data (for example the
10 minutes speed records measured during certain interval of time) and then to
substitute this estimation in the previous formula to estimate M n .
The problem of this approach is that small deviances in the estimation of F lead to
large discrepancies for Fn.
One alternative approach is to estimate Fn directly from the extreme data. This idea
is similar to that used to estimate the distribution of the sample mean average.
Following this way it is necessary to study the behaviour of Fn as n tends to infinity.
Although in this case this is not enough because for any z zsup
n
F n ( z ) 0 , where zsup is the smallest value of z such that F ( z ) 1 .
To overcome this difficulty, reaching a limit different from 0, the following linear
normalization of M n is allowed:
M n bn
M n* , where {an } and {bn } are sequences of constants.
an
3
An introduction to statistical modeling of extreme values
Now, the objective is to find limit distributions for M n* with appropriate choices for
z b
I. G( z ) exp exp , z (Gumbel)
a
0 zb
z b
G ( z ) exp
z b
II. (Fréchet)
a
z b
exp
zb
III. G ( z ) a (Weibull)
1 zb
For parameters a>0, b and, in the case of families II and III >0.
These three classes of distributions are named the extreme value distributions, with
types I, II and III, respectively, and also known as Gumbel, Fréchet and Weibull
families, respectively.
Observe that these three types of distributions are the only possible limits for the
distributions of the normalized maxima regardless of the distribution F for the
population.
4
An introduction to statistical modeling of extreme values
The three limit types have different forms of tail behaviour. The end point zsup is
a heavy tail, verifying that E X r for r 1 (which means that it has infinite
variance if 1 2 ).
It was usual in the past to adopt one of the three families and then to estimate the
parameters of the model. But this way has a weakness: it needs to choose one out of
the three models which is assumed to be correct and then the uncertainty implied by
this choice is not considered in the subsequent inferences. A better analysis can be
done combining the three models into a single family of models named the
generalized extreme value distribution (GEV):
1
z
G ( z ) exp 1
location
scale 0
shape
5
An introduction to statistical modeling of extreme values
This unification facilitates the statistical analysis. The uncertainty in the estimation
of parameter measures the lack of certainty in the choice of one of the three
models.
Now, the extremal types theorem can be re-state in the following way
Theorem 2. If there exist sequences of constants {an } and {bn } such that
1
z
G ( z ) exp 1
location
scale 0
shape
The difficulty of the normalizing constants are unknown is easily solved in practice
because if P(M n bn an ) z G( z ) for large n, then
The above results lead to the following approach for modelling extremes of a series
of independent and identically distributed observations X1, X 2 , . First step
consists in blocking the data into sequences of n observations, being n large enough.
Then the maxima M i of each block i is calculated and, finally, the GEV distribution
6
An introduction to statistical modeling of extreme values
Once the GEV distribution has been fitted, let say for the annual maxima, we can
calculate the quantile function, z p , for the annual maximum distribution as:
( ) (1 log(1 p) )
for 0
zp
log log(1 p) for 0
associated with the return period 1 p . That is, z p is the level that is expected to be
( ) (1 y p ) for 0
zp
log y p for 0
Then, if z p is plotted against log y p the plot is linear in the case of 0 ; the plot
This graph is named a return level plot and it is useful as validation tool as well as a
way of presenting the fitted model.
The choice of the length of blocks implies a trade off between bias and variance.
When the length is small then the approximation of the distributions by the limit is
quite poor leading to bias in estimation and extrapolation, while long blocks
generate few data leading to large estimation variance.
The method most commonly used to estimate the parameters is the likelihood
method. One difficulty of this approach is that the regularity conditions for its
application are not satisfied by the GEV distributions because the end-point of the
distribution depends on the parameter values. This violation means that the standard
7
An introduction to statistical modeling of extreme values
asymptotic likelihood results are not automatically applicable. This problem has
been studied in detail (Smith, 1985) with the following results:
Observe that the case 0.5 corresponds to distributions with a very short
bounded upper tail, which is rarely present in real applications of extreme value
modelling.
By denoting Z1,, Z m the block maxima and under the assumption that they are
independent variables having a GEV distribution, the log-likelihood for the GEV
when 0 is
1
m z m z
( , , ) m log (1 1 ) log1 i 1 i
i 1 i 1
z
provided that 1 i 0 for i=1,…,m. When this condition is not satisfied
then the likelihood is zero and the log-likelihood is minus infinity.
8
An introduction to statistical modeling of extreme values
9
An introduction to statistical modeling of extreme values
10
An introduction to statistical modeling of extreme values
Figures 1, 2, 3 and 4 show the original data, the daily, monthly and yearly maxima,
respectively.
11
An introduction to statistical modeling of extreme values
12
An introduction to statistical modeling of extreme values
The maximum likelihood estimate of the 1/p return level z p for 0<p<1 is
ˆ
ˆ (ˆ ˆ) (1 y p ) for ˆ 0
zˆ p
ˆ ˆ log y p for ˆ 0
1 ˆ
z
ˆ
in the same point is Gˆ ( Z (i ) ) exp 1 ˆ
( i )
.
ˆ
~
To be good the model is necessary that G( z(i ) ) Gˆ ( z(i ) ) , and then the plot of points
G~( z(i) ), Gˆ ( z(i) ) i 1,, m , should lie close to the diagonal unit. But because both
functions are bounded to approach 1 as the values of z increase the plot is least
informative in this region. The following graph avoids this deficiency.
Gˆ 1
(i (m 1)), z(i ) i 1,, m , where
ˆ i
ˆ
ˆ 1
G (i /( m 1)) ˆ 1 log i 1,, m
ˆ m 1
Again, departures from linearity in the quantile plot also indicate model failure.
13
An introduction to statistical modeling of extreme values
Return level plot. The return level plot represents the points log y p , zˆ p 0 p 1 .
Confidence intervals are usually added to this plot to increase its informativeness.
The importance of return periods in engineering is due to the fact that the return
period is used as a design criterion. Furthermore, to use this plot as a model
diagnostic one, the empirical estimates of the return level function are also added.
For suitable models the model based curve and empirical estimates should be in
agreement.
14
An introduction to statistical modeling of extreme values
3. THRESHOLD MODELS
Modelling only block maxima implies to waste a lot of data if a detailed recording of
the studied phenomenon is available. Now it is proposed another alternative analysis
that is more efficient in the use of data. The approach consists in considering for the
analysis those data that are viewed as extreme observations, let say, those data that
surpass a threshold level u. Then the stochastic behaviour of these excesses over u is
studied.
1 F (u y )
PrX u y / X u , y0
1 F (u )
The flowing result gives an approximation to this probability for high values of the
threshold u.
1
z
PrM n z G ( z ), where G ( z ) exp 1
Then, for large enough u, the distribution function of (X-u), conditioned to X>u, is
approximately
1
y
H ( y ) 1 1 ~ GENERALIZED PARETO DISTRIBUTION
This result relates the two approximations to study the distribution of the maximum. We
see how the parameters of the Generalized Pareto Distribution (GDP) are uniquely
determined by the parameters of the associated GEV distribution of block maxima.
Observe that this imply that if we change the size of blocks in the GEV analysis then the
15
An introduction to statistical modeling of extreme values
As for the GEV distribution the parameter is dominant for determining the qualitative
behaviour of the GPD distribution:
with parameter 1 ~ .
Threshold selection
Let {x1 ,, xn } be the original data and let us consider as extreme events those that
excess a threshold u, let say, x(1) ,, x( k ) . We denote the excesses over the threshold by
according to a GPD, whose parameters have to be estimated and then the model
validated.
The issue of how to choose the threshold is similar to that of selecting the size of a
block in the sense that both imply a balance between bias and variance. A low level
leads to failure in the asymptotic approximation of the model and a high level provides
few observations and then high variance.
A method to help in the choice of the threshold is based on the mean of the GPD: if Y is
a random variable following a GPD with parameters and , then E (Y ) (1 )
when <1, in other case the mean is infinite.
If a model is valid for a threshold u0 then it is also valid for all thresholds u greater than
E ( X u0 / X u0 ) u0 (1 )
E ( X u / X u) u (1 ) ( u (u u0 )) (1 )
16
An introduction to statistical modeling of extreme values
u, nu
i 1
( x(i ) u ) nu u xm a ,x where nu is the number of observations
Choose as threshold the value above which the plot is approximately linear in u.
The representation of confidence intervals can help to the determination of this
point.
Parameter estimation
Once the threshold has been estimated the next step is to estimate the parameters of the
GPD, for example by maximum likelihood. If we denote by y1 ,, yk the k excesses
over the threshold, the log-likelihood function, in the case that is not zero, is:
k
( , ) k log (1 1 ) log(1 yi ) , when (1 yi ) 0 , in other case
i 1
( , ) .
k
In the case 0 the log-likelihood is ( ) k log 1 yi
i 1
Return levels
To calculate the return levels, first we need an expression for the unconditional
distribution of variables X. Denoting by u Pr X u and from the conditional
1
( x u)
distribution PrX x / X u 1 we obtain that
1
( x u)
PrX x u 1
Hence, the level xm that is exceeded on average once every m observations is the
solution of
1
1 ( x u)
u 1 m , which is xm u
(mu ) 1
m
17
An introduction to statistical modeling of extreme values
In the case 0 the return level is xm u logm u , again for m enough large.
The estimation of these return levels requires the substitution of parameters by their
estimates. In the case of the probability u Pr X u , the maximum estimator is the
As we said before, when the GPD is a valid model for a threshold u0 then it is also a
valid model for any u u0 . With both levels the parameter is the same and the scale
above u0 , when it is a valid threshold. This argument leads to plot ˆ and ˆ against u,
together with confident intervals for them and selecting u0 as the lowest value of u for
which the estimates remain near-constant.
Model checking
Probability plots, quantile plots and return level plots are used for assessing the quality
of a fitted generalized Pareto model. Assuming a threshold u, ordered excesses
y(1) ,, y( k ) and an estimated model Ĥ for the GPD then
Probability plot. It represents the points i (k 1) , Hˆ ( y(i ) ) i 1,, k .
Quantile plot. It represents the points Hˆ 1 i (k 1), y(i ) i 1,, k .
When the model is valid in both plots the points are almost linearly placed.
1 ˆ
ˆ y
When ˆ 0 the estimations are: Hˆ ( y ) 1 1
ˆ
ˆ
ˆ
ˆ
and Hˆ 1 ( p) (1 p) 1
y
When ˆ 0 the expressions are: Hˆ ( y ) 1 exp , and Hˆ 1 ( p) ˆ ln (1 p)
̂
18
An introduction to statistical modeling of extreme values
Return level plot. It represents the points m, xˆm , where as we have seen before
ˆ
for ˆ 0 xˆm u (m ˆu ) 1 ,
ˆ
ˆ
for the case ˆ 0 the return level is xˆm u ˆ log mˆu .
Recall that x̂m is the estimated value that is exceeded on average once every m
observations.
19
An introduction to statistical modeling of extreme values
In the models studied so far it is supposed that the sequence of observations comes from
a sequence of independent random variables. But in real applications this is an
unrealistic assumption because it is observed some dependence over time. For example,
in the case of wind speed records it is natural to find high positive correlation among
consecutive hourly observations. Next figure show the correlation for the wind series of
Schiphol (hourly average wind) that we used in the previous sections.
To obtain the theoretical results in which are based the analysis of extremes of statinary
sequences it is usual to assume a condition that limits the extent of long-range
dependence at extreme levels, in the sense that the events X i u and X j u are
approximately independent, when the threshold level u is high enough and the time
points i and j are far away one from each other. Many physical phenomena satisfy this
property. In our example of wind speed it means that a high wind today might influence
the probability of an extreme wind tomorrow, maybe because both are due to the pass of
the same storm, but it is unlike that it might influence in a extreme wind in one month’s
time.
20
An introduction to statistical modeling of extreme values
The following condition formalizes the notion of extreme events being near-
independent if they are sufficiently distant in time.
D(un ) condition . A stationary series X1, X 2 , is said to satisfy the D(un ) condition if
Pr X i1 un , X i p un , X j1 un , X j q un
Pr X i1 un , X i p u PrX
n j1 u n , X j q u n (n, k )
Observe that for independent sequences the difference is always 0. To get the following
result the condition needs to be satisfied only for threshold un that increases with n. In
this way we assure almost the independence of extreme observations that are enough far
apart.
PrM n bn an zn
G( z )
Observe that this result implies that when the stationary series has limited long-range
dependence at extreme levels, the maxima follow the same limit laws that in the case of
independent series. Furthermore, there exists a relationship between both distributions.
conditions, there exist sequences of constants an 0 and bn such that
G1 ( z ) , if and only if PrM n bn an zn
Pr M n* bn an z n
G2 ( z )
21
An introduction to statistical modeling of extreme values
From the relationship between both distributions is ready to obtain that both have the
same parameter and
when 0 : *
1 and
The quantity is named the extremal index. This index can be interpreted in terms of
the propensity of the process to cluster at extreme levels. Loosely,
Models for block maxima. The distribution of the block maxima, when the D(un )
condition is satisfied, falls in the same family of distributions as would be if the series
were independent. It means that dependence in the data can be ignored and then we can
model the data as it was done when we suppose independence. The only question is
that, because M n has similar statistical properties to M n (corresponding to the
22
An introduction to statistical modeling of extreme values
The return level is estimated then by xm u
(m u ) 1 where and are the
Denoting the number of exceedances above the threshold u by nu and the number of
nu n
ˆu and ˆ c
n nu
23
An introduction to statistical modeling of extreme values
Conclusions
Extreme value theory is a statistical discipline that is focused in describing the unusual
rather than the usual. Its objective is to quantify the stochastic behaviour of a process at
unusually large levels. By definition, the observation of these extreme values is very
few frequent. Furthermore, the objective of an extreme analysis is to estimate
probabilities of events that are more extreme than any that have already been observed.
The extreme value paradigm. The model extrapolation is based on the implementation
of mathematical limits as finite level approximations. One main objection is that it is
implicitly assumed that the underlying stochastic mechanism of the process being
modelled is sufficiently smooth to enable extrapolation to unobserved levels.
Though the GEV model is supported by mathematical argument, its use in extrapolation
is based on unverifiable assumptions, and measures of uncertainty on return levels
should properly be regarded as lower bounds that could be much greater if uncertainty
due to model correctness were taken into account.
24
An introduction to statistical modeling of extreme values
Bibliography
http://www.youtube.com/watch_popup?v=CqEccgR0q-o
http://www.youtube.com/watch?v=b43lAoovqd8&feature=fvw
Citar librería de R
25
An introduction to statistical modeling of extreme values
26
An introduction to statistical modeling of extreme values
Extreme value theory is a statistical discipline that is focused in describing the unusual
rather than the usual. Its objective is to quantify the stochastic behaviour of a process at
unusually large levels. By definition, the observation of these extreme values is very
few frequent. Furthermore, the objective of an extreme analysis is to estimate
probabilities of events that are more extreme than any that have already been observed.
In this talk the classical block maxima models for extremes as well as threshold
excesses models are introduced and illustrated by using real wind data.
----- Original
27
An introduction to statistical modeling of extreme values
28