Outlier Detection Algorithms
Outlier Detection Algorithms
8 September 2014
Summary: We review recent asymptotic results on some robust methods for multiple regres-
sion. The regressors include stationary and non-stationary time series as well as polynomial
terms. The methods include the Huber-skip M-estimator, 1-step Huber-skip M-estimators,
in particular the Impulse Indicator Saturation, iterated 1-step Huber-skip M-estimators and
the Forward Search. These methods classify observations as outliers or not. From the as-
ymptotic results we establish a new asymptotic theory for the gauge of these methods, which
is the expected frequency of falsely detected outliers. The asymptotic theory involves normal
distribution results and Poisson distribution results. The theory is applied to a time series
data set.
Keywords: Huber-skip M-estimators, 1-step Huber-skip M-estimators, iteration, Forward
Search, Impulse Indicator Saturation, Robusti…ed Least Squares, weighted and marked em-
pirical processes, iterated martingale inequality, gauge.
1 Introduction
The purpose of this paper is to review recent asymptotic results on some robust methods
for multiple regression and apply these to calibrate these methods. The regressors include
stationary and non-stationary time series as well as quite general deterministic terms. All the
reviewed methods classify observations as outliers according to hard, binary decision rules.
The methods include the Huber-skip M-estimator, 1-step versions such as the robusti…ed
least squares estimator and the Impulse Indicator Saturation, iterated 1-step versions thereof,
and the Forward Search. The paper falls in two parts. In the …rst part we give a motivating
empirical example. This is followed by an overview of the methods and a review of recent
asymptotic tools and properties of the estimators. For all the presented methods the outlier
classi…cation depends on a cut-o¤ value c which is taken as given in the …rst part. In the
second part we provide an asymptotic theory for setting the cut-o¤ value c indirectly from
the gauge, where the gauge is de…ned as the frequency of observations classi…ed as outliers,
when in fact there are no outliers in the data generating process.
Robust methods can be used in many ways. Some methods reject observations that are
classi…ed as outliers, while other method give a smooth weight to all observations. It is
1
Acknowledgements: We would like to thank the organizers of the NordStat meeting in Turku, Finland,
June 2014, for giving us the opportunity to present these lectures on outlier detection.
2
The …rst author is grateful to CREATES - Center for Research in Econometric Analysis of Time Series
(DNRF78), funded by the Danish National Research Foundation.
3
Department of Economics, University of Copenhagen and CREATES, Department of Economics and
Business, Aarhus University, DK-8000 Aarhus C. E-mail: [email protected].
4
Nu¢ eld College & Department of Economics, University of Oxford & Programme on Economic Mod-
elling, INET, Oxford. Address for correspondence: Nu¢ eld College, Oxford OX1 1NF, UK. E-mail:
bent.nielsen@nu¢ eld.ox.ac.uk.
2
open to discussion which method to use, see for instance Hampel, Ronchetti, Rousseeuw and
Stahel (1986, §1.4). Here, we focus on rejection methods. We consider an empirical example,
where rejection methods are useful as diagnostic tools. The idea is that most observations are
‘good’in the sense that they conform with a regression model with symmetric, if not normal,
errors. Some observations may not conform with the model - they are the outliers. When
building a statistical model the user can apply the outlier detection methods in combination
with considerations about the substantive context to decide which observations are ‘good’
and how to treat the ‘outliers’in the analysis.
In order to use the algorithms with con…dence we need to understand its properties when
all observations are ‘good’. Just as in hypothesis testing, where tests are constructed by
controlling their properties when the hypothesis is true, we consider the outlier detection
methods when, in fact, there are no outliers. The proposal is to control the cut-o¤ values of
the robust methods in terms of their gauge. The gauge is the frequency of wrongly detected
outliers when there are none. It is distinct from, but related to, the size of a hypothesis test
and of false discovery rate in multiple testing (Benjamini and Hochberg, 1995).
The origins of the notion of a gauge are as follows. Hoover and Perez (1999) studied
the properties of a general-to-speci…c algorithm for variable selection through a simulation
study. They considered various measures for the performance of the algorithm, that are
related to what is now called the gauge. One of these, they referred to as the size, and this
was the number of falsely signi…cant variables divided by the di¤erence between the total
number of variables and the number of variables with non-zero coe¢ cients. The Hoover-Perez
idea for regressor selection was the basis of the PcGets and Autometrics algorithms, see for
instance Hendry and Krolzig (2005), Doornik (2009) and Hendry and Doornik (2014). The
Autometrics algorithm also includes an impulse indicator saturation algorithm. Through
extensive simulation studies the critical values of these algorithms have been calibrated in
terms of the false detection rates for irrelevant regressors and irrelevant outliers. The term
gauge was introduced in Hendry and Santos (2010) and Castle, Doornik and Hendry (2011).
Part I
Review of recent asymptotic results
2 A motivating example
What is an outlier? How do we detect them? How should we deal with them? There is no
simple, universally valid answer to these questions –it all depends on the context. We will
therefore motivate our analysis with an example from time series econometrics.
Demand and supply is key to discussing markets in economics. To study this Graddy
(1995, 2006) collected data on prices and quantities from the Fulton Fish market in New
York. For our purpose the following will su¢ ce. The data consists of daily data of the
quantity of whiting sold by one wholesaler over the period 2 Dec 1991 to 8 May 1992. Figure
1(a) shows the daily aggregated quantity Qt measured in pounds. The logarithm of the
quantity, qt = log Qt is shown in panel (b). The supply of …sh depends on the weather at sea
where the …sh is caught. Panel (c) shows a binary variable St taking value 1 if the weather
3
8
10000
7
log quantity
fitted
0
0.5
-2
Figure 1: Data and properties of …tted model for Fulton Fish market data
is stormy. The present analysis is taken from Hendry and Nielsen (2007, §13.5).
A simple autoregressive model for log quantities qt gives
Here b2 is the residual variance, b̀ is the log likelihood, T is the sample size. The residual
speci…cation tests include cumulant based tests for skewness, 2skew ; kurtosis, 2kurtosis and
both, 2norm = 2skew + 2kurtosis ; a test Far for autoregressive temporal dependence, see Godfrey
(1978), a test Farch for autoregressive conditional heteroscedasticity, see Engle (1982), a
test Fhet for autoregressive conditional heteroscedasticity, see White (1980), and a test Freset
for functional form, see Ramsey (1969). We note that the above references only consider
stationary processes, but the speci…cation tests also apply for non-stationary autoregressions,
see Kilian and Demiroglu (2000) and Engler and Nielsen (2009) for 2skew ; 2kurtosis and Nielsen
(2006) for Far : The computations were done using OxMetrics, see Doornik and Hendry (2013).
Figure 1(b; d) shows the …tted values and the standardized residuals.
The speci…cation tests indicate that the residuals are skew. Indeed the time series plot
of the residuals in Figure 1(d) shows a number of large negative residuals. The three largest
residuals have an interesting institutional interpretation. The observations 18 and 34 are
Boxing Day and Martin Luther King Day, which are public holidays, while observation 95 is
Wednesday before Easter. Thus, from a substantive viewpoint it seems preferable to include
4
dummy variables for each of these days, which gives
qbt = 7:9 + 0:09 qt 1 0:36 St 1:94 Dt18 1:82 Dt34 2:38 Dt95 ; (2.2)
(0:7) (0:08) (0:14) (0:66) (0:66) (0:66)
[10:8] [1:04] [ 2:68] [ 3:00] [ 2:75] [ 3:64]
Speci…cation tests, which are not reported, indicate a marked improvement in the speci…-
cation. Comparing the regressions (2.1) and (2.2) it is seen that the lagged quantities were
marginally signi…cant in the …rst, misspeci…ed regression, but not signi…cant in the second,
better speci…ed, regression. It is of course no surprise that outliers matter for statistical
inference - and that institutions matter for markets.
The above modelling strategy blends usage of speci…cation tests, graphical tools and
substantive arguments. It points at robustifying a regression by removing outliers and then
re…tting the regression. We note that outliers are de…ned as those observations that do not
conform with the statistical model. In the following we will consider some algorithms for
outlier detection that are inspired by this example. These algorithms are solely based on
statistical information and we can then discuss their properties by mathematical means. In
practice, outcomes should of course be assessed within the substantive context. We return
to this example in §11.
3 Model
Throughout, we consider data (yi ; xi ), i = 1; : : : ; n where yi is univariate and xi has dimension
dim x: The regressors are possibly trending in a deterministic or stochastic fashion. We
assume that (yi ; xi ), i = 1; : : : ; n satisfy the multiple regression equation
0
yi = xi + "i ; i = 1; : : : ; n: (3.1)
The innovations, "i ; are independent of the …ltration Fi 1 ; which is the sigma-…eld generated
by x1 ; : : : ; xi and "1 ; : : : ; "i 1 : Moreover, "i are identically distributed with mean zero and
variance 2 ; so that "i = has known symmetric density f and distribution function F(c) =
P("i c): In practice, the distribution F will often be standard normal.
We will think of the outliers as pairs of observations (yi ; xi ) that do not conform with
the model (3.1). In other words, a pair of observations (yi ; xi ) gives us an outlier if the
scaled innovation "i = does not conform with reference density f: This has slightly di¤erent
consequences for cross-sectional data and for time series data. For cross-sectional data
the pairs of observations (y1 ; x1 ); : : : ; (yn ; xn ) are unrelated. Thus, if the innovation "i is
classi…ed as an outlier, then the pair of observations (yi ; xi ) is dropped. We can interpret
this as an innovation not conforming with the model, or that yi or xi or both are not correct.
This is di¤erent for time-series data, where the regressors will include lagged dependent
variables. For instance, for a …rst order autoregression xi = yi 1 : We distinguish between
innovative outliers and additive outlier. Classifying the innovation "i as an outlier, has the
consequence that we discard the evaluation of the dynamics from yi 1 to yi without discarding
the observations yi 1 and yi . Indeed, yi 1 appears as the dependent variable at time i 1 and
the yi as the regressor at time i+1; respectively. Thus, …nding a single outlier in a time series
5
context, implies that the observations are considered correct, but possibly not generated by
the model. An additive outlier arises if an observation yi is wrongly measured. For a …rst
order autoregression this is captured by two innovative outliers "i and "i+1 : Discarding these,
the observation yi will not appear.
We consider algorithms using absolute residuals and calculation of least squares estima-
tors from selected observations. Both these choices implicitly assume a symmetric density:
If non-outlying innovations were asymmetric then the symmetrically truncated innovations
would in general be asymmetric and the least squares estimator for location would be biased.
With symmetry the absolute value errors j"i j= have density g(c) = 2f(c) and distribution
function G(c) = P(j"1 j c) = 2F(c) 1. We de…ne = G(c) so that c is the quantile
=1 =1 G(c):
which will serve as a bias correction for the variance estimators based on the truncated
sample. De…ne also the quantity
In this paper we focus on the normal reference distribution. The truncated moments
then simplify as follows
4.1 M-estimators
Huber (1964) introduced M-estimators as a class of maximum likelihood type estimators for
location. The M-estimator for the regression model (3.1) is de…ned as the minimizer of
P
Rn ( ) = n 1 ni=1 (yi x0i ): (4.1)
6
28.180
200
e
150
objectiv
28.165
100
28.150
50
for some absolutely continuous and non-negative criterion function : In particular, the
least squares estimator arises when (u) = u2 while the median or least absolute deviation
estimator arises for (u) = juj: We will pursue the idea of hard rejection of outliers through
the non-convex Huber-skip criterion function (u) = u2 1(juj c) + c2 1(juj> c) for some cut-o¤
c > 0 and known scale :
The objective function of the Huber-skip M-estimator is non-convex. Figure 2 illustrates
the objective function for the Fish data.5 The speci…cation is as in equation (2.1). All
parameters apart from that on qt 1 are held …xed at the values in (2.1). Panel (a) shows
that when the cut-o¤ c is large the Huber-skip is quadratic in the central part. Panel (b)
shows that when the cut-o¤ c is smaller the objective function is non-di¤erentiable in a …nite
number of points. Subsequently, we consider estimators that are easier to compute and apply
for unknown scale, while hopefully preserving some useful robustness properties.
The asymptotic theory of M-estimators has been studied in some detail for the situation
without outliers. Huber (1964) proposed a theory for location models and convex criterion
functions : Jureµcková and Sen (1996, p. 215f) analyzed the regression problem with convex
criterion functions. Non-convex criterion functions were considered for location models in
Jureµcková and Sen (1996, p. 197f), see also Jureµcková, Sen, and Picek (2012). Chen and Wu
(1988) showed strong consistency of M-estimators for general criterion functions with i.i.d.or
deterministic regressors, while time series regression is analyzed in Johansen and Nielsen
(2014b). We review the latter theory in §7.1.
The weights vi may depend on . The …rst example is the Huber-skip M-estimator which
depends on a cut-o¤ point c; where
Another example is the Least Trimmed Squares estimator of Rousseeuw (1984) which de-
pends on an integer k n; where
for (k) chosen as the k-th smallest order statistic of absolute residuals i = jyi x0i j for
i = 1; : : : ; n. Given an integer k n we can …nd and c so k=n = = G 1 (c); and ; c; k
are di¤erent ways of calibrating the methods. In either case, once the regression estimator
b has been determined the scale can be estimated by
P P
b2 = & 2 ( ni=1 vi ) 1 f ni=1 vi (yi x0i b)2 g; (4.5)
The Least Trimmed Squares weight (4.4) is scale invariant in contrast to the Huber-skip
M-estimator. It is known to have breakdown point of = 1 = 1 k=n for < 1=2, see
Rousseeuw and Leroy (1987, §3.4). An asymptotic theory is provided by Víšek (2006a,b,c).
The estimator is computed through a binomial search algorithm which is uncomputable in
most practical situations, see Maronna, Martin and Yohai (2006, §5.7) for a discussion. A
number of iterative approximations have been suggested such as the Fast LTS algorithm by
Rousseeuw and van Driessen (1998). This leaves additional questions with respect to the
properties of the approximating algorithms.
If the weights vi do not depend on ; the objective function has a least squares solution
b = (Pn vi xi x0 ) 1 (Pn vi xi yi ): (4.7)
i=1 i i=1
From this the variance estimator (4.5) can be computed. Examples include 1-step Huber-skip
M-estimators based on initial estimators e; e2 , where
and 1-step Huber-skip L-estimators based on an initial estimator e and a cut-o¤ k < n;
which de…nes the k-th smallest order statistic e(k) of absolute residuals i = jyi x0i ej; where
The Iteration Algorithm 4.1 does not have a stopping rule. This leaves the questions
whether the algorithm convergences with increasing m and n and in which sense it approxi-
mates the Huber-skip estimator.
The Impulse Indicator Saturation algorithm has its roots in the empirical work of Hendry
(1999) and Hendry, Johansen and Santos (2008). It is a 1-step M-estimator, where the
initial estimator is formed by exploiting in a simple way the assumption, that a subset of
observations is free of outliers. The idea is to divide the sample into two sub-samples. Then
run a regression on each sub-sample and use this to …nd outliers in the other sub-sample.
bj = (P xi x0 ) 1 (P xi yi ); bj2 =
1P
(yi x0i ^j )2 :
i2Ij i i2Ij
nj i2Ij
1.4. Compute least squares estimators b(0) ; (b(0) )2 using (4:7); (4:5); replacing vi by vbi
( 1)
and let m = 0:
(m)
2. De…ne indicator variables vi = 1(jyi x0 b(m) j cb(m) ) as in (4:8):
i
Due to its split half approach to the initial estimation, the Impulse Indicator Satura-
tion may be more robust than robusti…ed least squares. The Impulse Indicator Saturation
estimator will work best when the outliers are known to be in a particular subset of the
observations. For instance, consider the split half case where index sets I1 ; I2 are chosen as
the …rst half and the second half of the observations, respectively. Then the algorithm has a
good ability to detect for instance a level shift half way through the second sample, while it
is poor at detecting outliers scattered throughout both samples, because both sample halves
are contaminated. If the location of the contamination is unknown, one will have to iterate
over the choice of the initial sets I1 ; I2 : This is what the more widely used Autometrics
algorithm does, see Doornik (2009) and Doornik and Hendry (2014).
The Forward Search algorithm is an iterated 1-step Huber-skip L-estimator suggested
for the multivariate location model by Hadi (1992) and for multiple regression by Hadi and
Simono¤ (1993) and developed further by Atkinson and Riani (2000), see also Atkinson,
Riani and Cerioli (2010). The algorithm starts with a robust estimate of the regression
parameters. This is used to construct the set of observations with the smallest m0 absolute
residuals. We then run a regression on those m0 observations and compute absolute residuals
of all n observations. The observations with m0 + 1 smallest residuals are then selected, and
a new regression is performed on these m0 + 1 observations: This is then iterated. Since
the estimator based on the m0 + 1 observation is computed in terms of the order statistic
based on the estimator for the m0 observation, it is a 1-step Huber-skip L-estimator. When
iterating the order of the order statistics is gradually expanding.
(m)
2.3. De…ne indicator variables vi = 1(jyi (m)
x0i b(m) j b(m+1) )
as in (4:9):
3. Compute least squares estimators b(m+1) ; (b(m+1) )2 as in (4:7); (4:5) replacing vi by vi :
(m)
The idea of the Forward Search is to monitor the plot of scaled forward residuals zb(m) =b(m) .
For each m we can …nd the asymptotic distribution of zb(m) =b(m) and add a curve of point-
wise p-quantiles as a function of m for some p: The …rst m for which zb(m) =b(m) exceeds the
quantile curve is the estimate m b of the number of non-outliers. Asymptotic theory for the
forward residuals zb(m) =b(m) is reviewed in §8.3. A theory for the estimator m b is given in §10.
A variant of the Forward Search advocated by Atkinson and Riani (2000) is to use the
minimum deletion residuals db(m) = mini62S (m) bi instead of the forward residuals zb(m) .
(m)
We can use the Central Limit Theorem to show asymptotic normality of the estimator.
The asymptotic variance follows in Theorem 7.3. The e¢ ciency relative to least squares
estimation is shown as the top curve in Figure 3.
11
1.0
0.8
efficiency
0.6
0.4
Figure 3: The e¢ ciency of robusti…ed least squares, the Impulse Indicator Saturation, and
the Huber-skip M-estimator relative to full sample least squares when the reference distrib-
ution is normal.
Starting with other estimators give di¤erent asymptotic variances. An example is the
Impulse Indicator Saturation Algorithm 4.2. Theorem 7.4 shows that the initial split-half
estimator b(0) has the same asymptotic distribution as the robusti…ed least squares estimator.
The updated 1-step estimator b(1) is slightly less e¢ cient, as shown by the middle curve in
Figure 3, but hopefully more robust.
The 1-step M-estimator can be iterated along the lines of Algorithm 4.1. This iteration
has a …xed point b solving the equation
1 Pn 2cf(c)
n1=2 (b )= n 1=2
i=1 "i 1(j"i j c) + n1=2 (b ) + oP (1); (5.4)
see Theorem 7.6. Thus, any in‡uence of the initial estimator is lost through iteration. Solving
this equation gives
1 Pn
n1=2 (b )= "i 1(j"i j c) + oP (1); (5.5)
2cf(c) i=1
Assumption 6.1 Let Fi be the …ltration generated by x1 ; : : : ; xi+1 and "1 ; : : : ; "i : Assume
(i) innovations "i = are independent of Fi 1 and standard normal;
(ii) regressors xi satisfy, for some non-stochastic normalisation matrix N ! 0 and random
matrices V; ; , the following joint convergence results hold
P D
(a) Vn = N 0 ni=1 xi "i ! V ;
P D a:s:
(b) n = N 0 ni=1 xi x0i N ! > 0;
P D
(c) n 1=2 N 0 ni=1 xi ! ;
1=2 0
(d) maxi P n jn N xi j = oP (n ) for all > 0;
(e) n 1 E ni=1 jn1=2 N 0 xi jq = O(1) for some q > 9:
The Assumption 6.1(ii) for the regressors are satis…ed in a range of situations, see Jo-
hansen and Nielsen (2009). For instance, xi could be vector autoregressive with stationary
roots or roots at one. It also holds for quite general regressors including polynomial regres-
sors. The normalisation is N = n 1=2 Idim x for stationary regressors and N = n 1 Idim x for
random walk regressors.
We note that Assumption 6.1 implies Assumption 3.1(i; ii) of Johansen and Nielsen
(2014a) by choosing = 1=4, q0 = q > 9 and a = for a small > 0 so that 0 < is
bounded by the minimum of ; 1=(1 + dim x) and (q 9)=(q 1):
where vi are indicator functions for small residuals. Such sums of indicator functions are the
basis for empirical process. The Fi 1 -predictable factors xi and xi x0i are called weights in
line with Koul (2002). The unbounded, Fi -adapted factors "i and "2i are said to be marks.
For M-type estimators, the indicator functions have the form
Theorem 6.1 (Johansen and Nielsen, 2014a, Lemma D.5) Suppose Assumption 6.1 holds.
Consider the product moments (6.1) with weights vib;c;d given by (6.4) and expansions
P P
n 1=2 ni=1 vib;c;d = n 1=2 ni=1 1(j"i j c) + 2f(c)d + Rv (b; c; d);
P P
n 1=2 ni=1 vib;c;d "2i = n 1=2 ni=1 "2i 1(j"i j c) + 2 2 c2 f(c)d + Rv"" (b; c; d);
P P P
N 0 ni=1 vib;c;d xi "i = N 0 ni=1 xi "i 1(j"i j c) + 2cf(c)N 0 ni=1 xi x0i N b + Rvx" (b; c; d);
P P
N 0 ni=1 vib;c;d xi x0i N = N 0 ni=1 xi x0i N + Rvxx (b; c; d):
Let
R(b; c; d) = jRv (b; c; d)j + jRv"" (b; c; d)j + jRvx" (b; c; d)j + jRvxx (b; c; d)j
Then it holds for all (large) B > 0, all (small) > 0 and n ! 1 that
sup sup R(b; c; d) = oP (1): (6.5)
jbj;jdj n1=4 B 0<c<1
Theorem 6.1 is proved by a chaining argument. The idea is to cover the domain of b; d
with a …nite number of balls. The supremum over the large compact set can then be replaced
by considering the maximum value over the centers of the balls and the maximum of the
variation within balls. By subtracting the compensators of the product moments we turn
them into martingales. The argument will therefore be a consideration of the tail behaviour
of the maximum of a family of martingales using the iterated martingale inequality presented
in §6.3 and Taylor expansions of the compensators.
Related results of Theorem 6.1 are considered in the literature. Koul and Ossiander
(1994) considered weighted empirical processes without marks and with > 1=4: Johansen
and Nielsen (2009) considered the situation (6.6) for …xed c and with > 1=4:
14
Theorem 6.2 (Bercu and Touati, 2008, Theorem 2.1) For i = 1; : : : ; n let (mi ; Fi ) be a
locally square integrable martingale di¤erence. Then, for all x; y > 0;
P Pn x2
P[j ni=1 mi j x; 2 2
i=1 fmi + E(mi jFi 1 )g y] 2 exp( ):
2y
Theorem 6.3 (Johansen and Nielsen, 2014a, Theorem 5.2.) For ` = 1; : : : ; L let z`;i be
P
2r
Fi -adapted so Ez`;i < 1 for some r 2 N: Let Dr = max1 ` L ni=1 E(z`;i
2r
jFi 1 ) for 1 r r:
Then, for all 0 ; 1 ; : : : ; r > 0, it holds
P EDr Pr EDr P 2
P[ max j ni=1 fz`;i E(z`;i jFi 1 )gj > 0] L + r=1 + 2L rr=01 exp( r
):
1 ` L r r 14 r+1
Theorem 7.1 (Johansen and Nielsen, 2014b, Theorems 1,2,3) Consider the Huber-skip M-
estimator de…ned from (4.2), (4.3). Suppose Assumption 6.1 holds and that the frequency of
small regressors is bounded as outlined above. Then any minimizer of the objective function
(4:2) has a measurable version and satis…es
1 Pn
N 1
(b )= n
1
N0 i=1 xi "i 1(j"i j c) + oP (1):
2cf(c)
Theorem 7.1 proves the conjecture (5.1) of Huber (1964) for time series regression. The
regularity conditions on the regressors are much weaker than those normally considered in
for instance Chen and Wu (1988), Liese and Vajda (1994), Maronna, Martin, and Yohai
(2006), Huber and Ronchetti (2009), and Jureµcková, Sen, and Picek (2012). Theorem 7.1
extends to non-normal, but symmetric densities and even to non-symmetric densities and
objective function, by introducing a bias correction.
Theorem 7.1 is proved in three steps. First, it is shown that b is tight, that is N 1 ( b ) =
OP (n1=2 ); through a geometric argument that requires the assumption to the frequency of
small regressors. Secondly, it is shown that b is consistent, in the sense that N 1 ( b )=
OP (n1=2 ) for any < 1=4; using the iterated martingale inequality of Theorem 6.3. Finally,
the presented expansion of Theorem 7.1 is proved, again using Theorem 6.3.
Theorem 7.2 (Johansen and Nielsen, 2009, Corollary 1.2) Consider the 1-step Huber-skip
M-estimators b(1) ; b(1) de…ned by (4.7), (4.5) with weights (4.8). Suppose Assumption 6.1
holds and that N 1 ( b(0) ) and n1=2 (b(0) ) are OP (1): Then
1 Pn 2cf(c)
N 1
( b(1) )= n
1
N0 i=1 xi "i 1(j"i j c) + N 1
( b(0) ) + oP (1): (7.1)
1 Pn
n1=2 (b(1) )= n 1=2 2
i=1 ("i
2
)1(j"i j c) + n1=2 (b(0) ) + oP (1) (7.2)
2 2
16
Theorem 7.2 generalises the statement (5.2) for the location problem. Theorem 7.2 shows
that the updated regression estimator b(1) only depends on the initial regression estimator
b(0) and not on the initial scale estimator b(0) : This is a consequence of the symmetry imposed
on the problem. Johansen and Nielsen (2009) also analyze situations where the reference
distribution f is non-symmetric and the cut-o¤ is made in a matching non-symmetric way.
In that situation both expansions involve the initial estimation uncertainty for and 2 :
We can immediately use Theorem 7.2 for an m-fold iteration of (7.1), (7.2). Results for
in…nite iterations follow in §7.5.
Thus, the conditions of Theorem 7.2 are satis…ed so that the robusti…ed least squares esti-
mators can be expanded as in (7.1), (7.2). The asymptotic distribution of estimator for
will depend on the properties of the regressors. For simplicity the regressors are assumed
stationary in the following result.
Theorem 7.3 (Johansen and Nielsen, 2009, Corollary 1.4) Consider the 1-step Huber-skip
M-estimator de…ned with the weights (4:8) and where the initial estimators e; e2 are the
full-sample least squares estimators. Suppose Assumption 6.1 holds and that the regressors
are stationary. Then
b D
2 1
0
n1=2 ! N 0; 4 ;
b2 2 0 2
where, using the coe¢ cients ( ; {; ) from (3.2) and (3.4), the e¢ ciency factors ; are
2
2 2 2 2
= f1 + 4cf(c)g + f2cf(c)g ; 2 = ({ = )(1 + ) + ( 1): (7.4)
4
The result generalises the statement (5.3) for the location problem. The e¢ ciency factor
is plotted as the top curve in Figure 3. A plot of the e¢ ciency for the variance, can be
found in Johansen and Nielsen (2009, Figure 1.1). Situations with non-stationary regressors
are also discussed in that paper.
Theorem 7.4 (Johansen and Nielsen, 2009, Theorems 1.5, 1.7) Consider the split-half Im-
pulse Indicator Saturation estimator of Algorithm 4.2. Suppose Assumption 6.1 holds with
stationary regressors. Recall the e¢ ciency factors ; from (7:4). Then the initial esti-
mators satisfy
b(0) D
2 1
0
n1=2 ! N 0; 4 ;
(b(0) )2 2 0 2
where
4 iis 1
= f + 2cf(c)g [ + 2cf(c) + 2f2cf(c)g2 ] + f2cf(c)g4 :
2
The e¢ ciency factors and iis for the split-half case are plotted as the top and the
middle curve, respectively, in Figure 3. Johansen and Nielsen (2009) also discuss situations
with general index sets I1 ; I2 and where the regressors are non-stationary.
Theorem 7.5 (Johansen and Nielsen, 2013, Theorem 3.3) Consider the iterated 1-step
Huber-skip M-estimator in Algorithm 4.1. Suppose Assumption 6.1 holds and that N 1 ( b(0)
) and n1=2 (b(0) ) are OP (1): Then
sup jN 1
( b(m) )j + jn1=2 (b(m) )j = OP (1):
0 m<1
Theorem 7.5 is proved by showing that the expansions (7.1), (7.2) are contractions.
Necessary conditions are that 2cf(c)= < 1 and =(2 ) < 1: This holds for normal or t-
distributed innovations, see Johansen and Nielsen (2013, Theorem 3.6).
In turn, Theorem 7.5 leads to a …xed point result for in…nitely iterated estimators.
Theorem 7.6 (Johansen and Nielsen, 2013, Theorem 3.3) Consider the iterated 1-step
Huber-skip M-estimator in Algorithm 4.1. Suppose Assumption 6.1 holds and that N 1 ( b(0)
18
) and n1=2 (b(0) ) are OP (1): Then, for all ; > 0 a pair m0 ; n0 > 0 exists so for all
m > m0 and n > n0 it holds
PfjN 1 ( b(m) b )j + n1=2 jb(m) b j > g < ;
where
1 Pn
N 1
(b )= n
1
N0 i=1 xi "i 1(j"i j c) ; (7.6)
2cf(c)
2 Pn
n1=2 f(b )2 2
g= n 1=2 2
i=1 ("i
2
= )1(j"i j c) : (7.7)
2
Recently Cavaliere and Georgiev (2013) made a similar analysis of a sequence of Huber-
skip M-estimators for the parameter of a …rst order autoregression with in…nite variance
errors and an autoregressive coe¢ cient of unity.
Iterated 1-step Huber-skip M-estimators can be viewed as iteratively reweighted least
squares with binary weights. Dollinger and Staudte (1981) gave conditions for convergence
of iteratively reweighted least squares for smooth weights. Their argument was cast in
terms of in‡uence functions. While Theorem 7.6 is similar in spirit, the employed tightness
argument is di¤erent because of the binary weights.
An issue of interest in the literature is, whether a slow initial convergence rate can be
improved upon through iteration. This would open up for using robust estimators converging
for instance at an n1=3 rate as initial estimator. An example would be the Least Median
Squares estimator of Rousseeuw (1984). Such a result would complement the result of He
and Portnoy (1992), who …nd that the convergence rate cannot be improved in a single step
of the iteration, as well as Theorem 8.3 below showing that the Forward Search can improve
the rate of a slowly converging initial estimator.
1 Pn b(k)
n1=2 (b(1) )= n 1=2 2
i=1 ("i
2
)1(j"i j c) + n1=2 ( ) + oP (1) (8.2)
2 2 c
1 1=2
Pn 2 2 1=2
P n
= n i=1 ("i )1(j"i j c) + n i=1 f1(j"i j c) g + oP (1)
2 4c
(8.3)
Proof. Equations (8.1), (8.2) follow from Theorem 6.1. Equation (8.3) with its expansion
of the quantile b(k) follows from Johansen and Nielsen (2014a, Lemma D.11).
Ruppert and Carroll (1980) state a similar result for a related 1-step L-estimator, but
omit the details of the proof. It is interested to note that the expansions of the one-step
regression estimator of L-type in (8.1) is the same as for the M-type in (7.1). In contrast, the
variance estimators have di¤erent expansions. In particular, the L-estimator does not use
the initial variance estimator and, consequently, the expansion does not involve uncertainty
from the initial estimation.
The proof uses the theory of weighted and marked empirical processes outlined in §6.2
combined with the theory of quantile processes discussed in Csörg½o (1983). A single step of
the algorithm was previously analyzed in Johansen and Nielsen (2010).
Comparing Theorem 8.3 with Theorems, 7.6, 8.1, we recognise the asymptotic result for
the estimator for : The e¢ ciency relative to the least squares estimator is shown as the bot-
tom curve in Figure 3. The asymptotic expansion for the variance estimator b2 is, however,
di¤erent from the expression for the iterated 1-step Huber-skip M-estimator in Theorem
7.6, re‡ecting the di¤erent handling of the scale. The Bahadur (1966) representation link-
ing the empirical distribution of the scaled innovations "i = with their order statistics, bc
say, implies that 2f(c )n1=2 (b
z = b
c ) vanishes. Moreover, the minimum deletion residual
db(m) = mini62S (m) bi has the same asymptotic expansion as zb(m) = b(m+1) after a burn-in
(m) (m)
period. See Johansen and Nielsen (2014a, Theorem 3.4) for details.
The idea of the Forward Search is to monitor the plot of the sequence of scaled forward
residuals. Combining the expansions for b and zb in Theorem 8.3 gives the next result.
Theorem 8.4 (Johansen and Nielsen 2014a, Theorem 3.3). Consider the Forward Search-
estimator in Algorithm 4.3. Suppose Assumption 6.1 holds and that N 1 ( b(m0 ) ) is
OP (n1=4 ) for some > 0. Let 1 > 0 : Then
zb
sup j2f(c )n1=2 ( c ) + Zn (c )j = oP (1);
0 n=(n+1) b
(8.4)
converges to a Gaussian process Z: The covariance of Z is given in Theorem A.1.
Part II
Gauge as a measure of false detection
We now present some new results for the outlier detection algorithms. Outlier detection
algorithms will detect outliers with a positive probability when in fact there are no outliers. xxx
We analyze this in terms of the gauge, which is the expected frequency of falsely detected
outliers when, in fact, the data generating process has no outliers. The idea of a gauge
originates in the work of Hoover and Perez (1999) and is formally introduced in Hendry and
Santos (2010), see also Castle, Doornik and Hendry (2011).
The gauge concept is related to, but also distinct from the concept of a size of a statistical
test, which is the probability of falsely rejecting a true hypothesis. For a statistical test we
choose the critical value indirectly from the size we are willing to tolerate. In the same way,
for an outlier detection algorithm, we can choose the cut-o¤ for outliers indirectly from the
gauge we are willing to tolerate.
21
The detection algorithms assign binary weigths vbi to each observation, so that vbi = 0 for
outliers and vbi = 1 otherwise. We de…ne the empirical or sample gauge as the frequency of
falsely detected outliers
1 Pn
b= (1 vbi ): (8.5)
n i=1
In turn, the population gauge is the expected frequency of falsely detected outliers, when in
fact the model has no contamination, that is
1 Pn
Eb = E (1 vbi ):
n i=1
To see how the gauge of an outlier detection algorithm relates to the size of a statistical
test, consider an outlier detection algorithm which classify observations as outliers if the
absolute residuals jyi x0i bj=b is large for some estimator ( b; b): That algorithm has gauge
1 Pn 1 Pn
b= (1 vbi ) = 1 x0i bj>bc) : (8.6)
n i=1 n i=1 (jyi
Suppose the parameters ; where known so that we could choose b; b as ; : Then the
population gauge reduces to the size of a test that a single observation is an outlier, that is
1 Pn 1 Pn
E 1(jy x0i j> c) =E 1(j" j> c) = P(j"1 j > c) = :
n i=1 i n i=1 i
In general, the population gauge will, however, be di¤erent from the size of such a test
because of the estimation error. In §9, §10 we analyze the gauge implicit in the de…nition of
a variety of estimators of type M and L, respectively. Proofs follow in the appendix.
Theorem 9.1 Consider a sample gauge b of the form (8:6): Suppose Assumption 6.1 holds
and that N 1 ( b ); n1=2 (b2 2
) are OP (1): Then, for …xed c;
Pn b
n1=2 (b )=n 1=2
i=1 f1(j"i j> c) g + 2cf(c)n1=2 ( 1) + oP (1): (9.1)
It follows that Eb ! :
22
Note that convergence in mean is equivalent to convergence in probability since the gauge
takes values in the interval [0; 1], see Billingsley (1968, Theorem 5.4).
Theorem 9.1 applies to various Huber-skip M-estimators. For the Huber-skip M-estimator
the estimators b; b2 are the Huber-skip estimator and corresponding variance estimator. For
1-step Huber-skip M-estimator the estimators b; b2 are the initial estimators. For Impulse
Indicator Saturation or 1-step Huber-skip M-estimator iterated m times the estimators b; b2
are the estimators from step m 1:
D
n1=2 (b ) ! Nf0; (1 )g:
The robusti…ed least squares estimator: This is the 1-step Huber-skip M-estimator b
de…ned in (4.7), (4.8), where the initial estimators e; e2 are the full-sample least squares
estimators. The binomial term in Theorem 9.1 is now combined with a term from the initial
variance estimator e2 :
Theorem 9.3 Consider the robusti…ed least squares estimator b de…ned from (4:7), (4:8),
and the initial estimatorsPe and e2 are the full sample least squares estimators, …xed c and
sample gauge is e = n 1 ni=1 1(jyi x0 ej>ec) . Suppose Assumption 6.1 holds. Then
i
D
n1=2 (e ) ! N[0; (1 ) + 2cf(c)( ) + fcf(c)g2 ( 1)]:
The variance in Theorem 9.3 is larger than the binomial variance for a normal reference
distribution and any choice of : This is seen through di¤erentiation with respect to c.
The split-half Impulse Indicator Saturation estimator: The estimator is de…ned in Algo-
( 1)
rithm 4.2. Initially, the outliers are de…ned using the indicator vbi based on the split-sample
b 2 b 2
estimators 1 ; b1 and 2 ; b2 ; see (4.10). The outliers are reassessed using the updated esti-
mators b(0) ; b(0) : Thus, the algorithm gives rise to two sample gauges
P P
b( 1) = n 1 i2I1 1(jyi x0 b2 j>b2 c) + n 1 i2I2 1(jyi x0 b1 j>b1 c) ; (9.2)
i i
P
b(0) = n 1 ni=1 1(jyi x0 b(0) j>b(0) c) : (9.3)
i
For simplicity we only report the result for the initial gauge b( 1)
. The updated gauge b(0)
has a di¤erent asymptotic variance.
23
Theorem 9.4 Consider the Impulse Indicator Saturation. Suppose Assumption 6.1 holds
for each set I1 ; I2 . Then, for …xed c; the initial sample gauge b( 1) has the same asymptotic
distribution as the sample gauge for robusti…ed least squares reported in Theorem 9.3.
The iterated 1-step Huber-skip M-estimator: The estimator is de…ned in Algorithm 4.1.
Special cases are the iterated robusti…ed least squares estimator and the Impulse Indicator
Saturation. If the algorithm is stopped after m + 1 steps the sample gauge is
P
b(m) = n 1 ni=1 1(jyi x0 b(m) j>b(m) c) for m = 0; 1; 2; : : :
i
Because the estimation errors N 1 ( b(m) ); n1=2 (b(m) ) are tight by Theorem 7.5, the
sequence of sample gauges will also be tight. Theorem 9.1 then generalises as follows.
Theorem 9.5 Consider the iterated 1-step Huber-skip estimator. Suppose Assumption 6.1
holds and that the initial estimators satisfy that N 1 ( b(0) ) and n1=2 (b(0) ) are OP (1):
(m)
Then, for …xed c; the sequence of sample gauges b satis…es
Theorem 9.6 Consider the iterated 1-step Huber-skip estimator ; ; see (7.6) and (7.7)
and the vi ; de…ned from these. Suppose Assumption 6.1 holds and that the initial estima-
tors satisfy that N 1 ( b(0) ) and n1=2 (b(0) ) are OP (1): Then, for all ; > 0 a pair
n0 ; m0 > 0 exists so that for all n; m so n n0 and m m0 it holds, for …xed c;
Pn 2cf(c) Pn "2i
n1=2 (b )=n 1=2
i=1 f1(j"i j> c) g n 1=2
i=1 ( 2
)1(j"i j c) + oP (1):
2
Moreover, the two sums are asymptotically independent and it holds that
D 2cf(c) 2
n1=2 (b ) ! N[0; (1 )+f g ({ )]:
2
24
Table 1 shows the asymptotic variances for the Huber-skip M-estimator, the Robusti…ed
Least Squares and for the fully iterated 1-step Huber-skip estimators . The latter include
iterated Robusti…ed Least Squares and iterated Impulse Indicator Saturation. The results
are taken from Theorems 9.2, 9.3, 9.6, respectively. For gauges of 1% or lower the standard
deviations are very similar. If the gauge is chosen as = 0:05 and n = 100; then the sample
gauges b will be asymptotically normal with mean = 0:05 and a standard deviation of
about 0:2=n1=2 = 0:02: This suggests that it is not unusual to …nd up to 8-9 outliers when in
fact there are none. Lowering the gauge to = 0:01 or = 0:0025; the standard deviation
is about 0:1=n1=2 = 0:01 and 0:05=n1=2 = 0:005; respectively, when n = 100. Thus, it is not
unusual to …nd up to 2-3 and up to 1 outliers, respectively, when in fact there are none. This
suggests that the gauge should be chosen rather small in line with the discussion in Hendry
and Doornik (2014, §7.6).
In the …rst result we assume that estimation errors N 1 ( bn ) and n1=2 (bn ) are
tight. Thus, the result immediately applies to robusti…ed least squares, where the initial
estimators bn and bn are the full sample least squares estimators, which do not depend on
the cut-o¤ cn : But, in general we need to check this tightness condition.
Theorem 9.7 Consider the 1-step Huber-skip M-estimator, where nP (j"1 j cn ) = .
1 b 1=2 2 2
Suppose Assumption 6.1 holds, and that N ( n ) and n (bn ) are OP (1): Then the
sample gauge bn in (9:5) satis…es
D
nbn ! Poisson( ):
We next discuss this result for particular initial estimators.
Robusti…ed least squares estimator: The initial estimators e and e2 are the full sample
least squares estimators. These do not depend on cn so Theorem 9.7 trivially applies.
Theorem 9.8 Consider the robusti…ed least squares estimator b de…ned from (4:7), (4:8),
where the initial estimators e and e2 are the full sample least squares estimators, while
cn isPde…ned from (9:4). Suppose Assumption 6.1 holds. Then the sample gauge en =
1 n
n i=1 1(jyi x0 ej>ecn ) satis…es
i
D
nen ! Poisson( ):
25
x
cn=100 cn=200 0 1 2 3 4 5
5 1.960 2.241 0.01 0.04 0.12 0.27 0.44 0.62
1 2.576 2.807 0.37 0.74 0.92 0.98 1.00
0.5 2.807 3.023 0.61 0.91 0.98 1.00
0.25 3.023 3.227 0.78 0.97 1.00
0.1 3.291 3.481 0.90 1.00
Table 2: Poisson approximations to the probability of …nding at most x outliers for a given
1
. The implied cut-o¤ cn = f1 =(2n)g is shown for n = 100 and n = 200.
Impulse Indicator Saturation: Let bj and bj2 be the split sample least squares estimators.
These do not depend on cn so Theorem 9.7 trivially applies for the split sample gauge based
on
( 1)
vbi;n = 1(i2I1 ) 1(jyi x0 b2 j>b2 cn ) + 1(i2I2 ) 1(jyi x0 b1 j>b1 cn ) :
i i
Theorem 9.9 Consider the Impulse Indicator Saturation Algorithm 4.2. Let cn be de…ned
from (9:4). Suppose Assumption 6.1 holds for each set I1 ; I2 . Let the estimators bn and
(0)
Table 1 shows the Poisson approximation to the probability of …nding at most x outliers
for di¤erent values of : For small and n this approximation is possibly more accurate
than the normal approximation, although that would have to be investigated in a detailed
simulation study. The Poisson distribution is left skew so the probability of …nding at most
x = outliers increases from 62% to 90% for decreasing from 5 to 0.1. In particular, for
= 1 and n = 100 so the cut-o¤ is cn = 2:58 the probability of …nding at most one outlier is
74% and the probability of …nding at most two outliers is 92%. In other words, the chance
of …nding 3 or more outliers is small when in fact there are none.
n b
m 1 Pn 1
b= = (n m)1(m=m)
b :
n n m=m0
P 1
Rewrite this by substituting n m = nj=m 1 and change order of summation to get
1 Pn 1
b= 1(m
b j) : (10.1)
n j=m0
b
If the stopping time is an exit time, then the event (m j) is true if z^(m) =^ (m) has exited
at the latest by m = j:
An example of a stopping time is the following. Theorem 8.4 shows that
e n (c ) = 2f(c)n1=2 ( zb
Z c ) = Zn (c ) + oP (1) (10.2)
b
b
To analyze the stopping time (10.3) we consider the event (m j): This event satis…es
e n (cm=n )
Z
b
(m j) = [ max > q]:
m1 m j sdvfZe n (cm=n )g
Inserting this expression into (10.1) and then using expansion (10.2) we arrive at the following
result, with details given in the appendix.
Theorem 10.1 Consider the Forward Search. Suppose Assumption 6.1 holds. Let m0 =
int( 0 n) and m1 = int( 1 n) for some 1 b in (10:3)
0 > 0: Consider the stopping time m
for some c 0: Then
Z 1
n m b Z(c )
Eb = E ! = P[ sup > q]du:
n 1 1 u sdvfZ(c )g
If 1 > 0 ; the same limit holds for the forward search when replacing zb(m) by the deletion
residual db(m) in the de…nition of m
b in (10:3):
27
vs 1 0.05 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90
0.10 2.50 2.43 2.28 2.14 1.99 1.81 1.60 1.31 0.82 -
0.05 2.77 2.71 2.58 2.46 2.33 2.19 2.02 1.79 1.45 0.69
0.01 3.30 3.24 3.14 3.04 2.94 2.83 2.71 2.55 2.33 1.91
0.005 3.49 3.44 3.35 3.26 3.15 3.04 2.95 2.81 2.62 2.26
0.001 3.90 3.85 3.77 3.69 3.62 3.53 3.43 3.32 3.18 2.92
Table 3: Cut-o¤ values q for Forward Search as a function of gauge and lower point 1 of
range for the stopping time.
The integral in Theorem 10.1 cannot be computed analytically in an obvious way. In-
stead we simulated it using Ox 7, see Doornik (2007). For a given n; draws of normal
"i can be made. From this, the process Zn in (8.4) can be computed. The maximum of
Zn (cm=n )=sdvfZ(cm=n )g over m1 m j can then be computed for any m1 j n.
Repeating this nrep times the probability appearing as the integrand can be estimated for a
given value of q and u : From this the integral can be computed. This expresses = ( 1 ; q)
as a function of q and 1 . Inverting this for …xed 1 expresses q = q( 1 ; ) as a function of
and 1 : Results are reported in the Table 3 for nrep = 105 and n = 1600:
The preliminary second half outliers are in observations 95, 108, 68, 75, 94 with residuals
4:66; 3:11; 2:85; 2:74; 2:66: The estimated model for the second sample half is
(2nd half)
qbt = 7:5 + 0:13 qt 1 0:21 St ; b = 0:77; t = 57; : : : ; 111:
(1:2) (0:14) (0:30)
The preliminary …rst half outliers are in observations 18, 34 with residuals 3:78; 2:95:
In step m = 0 we estimate a model with dummies for the preliminary outliers and get
the full sample model
(0)
qbt = 1:98 Dt18 1:80 Dt34 1:26 Dt68 1:34 Dt75 1:35 Dt94 2:40 Dt95
(0:60) (0:61) (0:60) (0:60) (0:60) (0:61)
[ 3:16] [ 2:93] [ 2:10] [ 2:23] [ 2:25] [ 3:96]
The observations 18, 34, 95 remain outliers, while all residuals are small.
In step m = 2 the estimated model is identical to the model (2.2). In that model the
observations 18, 34, 95 remain outliers, while all residuals are smaller. Thus, the algorithm
has reached a …xed point.
If the gauge is chosen as 0.5% or 0.25% so the cut-o¤ is 2.81 or 3.02, respectively, the
algorithm will converge to a solution taking 18, 95 or 95 as outliers, respectively.
scaling is chosen in line with Atkinson, Riani and Cerioli (2010). Consider panel (a) where
0 = 1 = 0:95: Choose the gauge as, for instance, = 0:01; in which case the we need
to consider the third exit band from the top. This is exceeded for m b = 107; pointing at
n m b = 3 outliers. These are the three holiday observations 18, 34, 95 discussed in §2. If
the gauge is set to = 0:001 we …nd no outliers. If the gauge is set to = 0:05 we …nd
mb = 104; pointing at n m b = 6, which is 5% of the observations.
Consider now panel (b) where 0 = 1 = 0:80: With a gauge of = 0:01 we …nd m b = 96;
pointing at n m b = 14 outliers. These include the three holiday observations along with
11 other observations. This leaves some uncertainty about the best choice of the number
of outliers. The present analysis is based on asymptotics and could be distorted in …nite
samples.
3.4
3.2 (a) ψ 0=ψ 1=0.95 (b) ψ0=ψ1=0.80
3.0
forw ard residual
3.0
2.5
2.8
2.6
2.0
2.4
104 105 106 107 108 109 110 90 95 100 105 110
Figure 4: Forward Plots of forward residuals for …sh data. Here 0 = 1 is chosen either as
0.95 or 0.80. The bottom curve shows the pointwise median. The top curves show the exit
bands for gauges chosen as, from top, 0.001, 0.005, 0.01, 0.05, respectively. Panel (b) also
includes an exit band for a gauge of 0.10.
not normal. Combined with the concept of the gauge, these results are used for calibrating
the cut-o¤ values of the estimators.
In further research we will look at situations, where there actually are outliers. Various
con…gurations of outliers will be of interest: single outliers, clusters of outliers, level shifts,
symmetric and non-symmetric outliers. The probability of …nding particular outliers is called
potency in Hendry and Santos (2010). It will then be possible to compare the potency of
two di¤erent outlier detection algorithms, that are calibrated to have the same gauge.
The approach presented is di¤erent from the traditional approaches of robust statistics.
It would be of interest to compare the approach with the traditional idea of analyzing robust
estimators in terms of their breakdown point, see Hampel (1971), or the in‡uence function,
see Hampel, Ronchetti, Rousseeuw and Stahel (1986) or Maronna, Martin and Yohai (2006).
First order asymptotic theory is known to be fragile in some situations. A comprehensive
simulation study of the results presented would therefore be useful, possibly building on
Atkinson and Riani (2006) and Hendry and Doornik (2014).
It would be of interest to extend this research to variable selection algorithms such as
Autometrics, see Hendry and Doornik (2014). The Impulse Indicator Saturation is a stylized
version of Autometrics. It should work well, if the researcher can identify a part of the data,
that is free from outliers. If this is not the case, one will have to iterate over the choice
of sub-samples. In Autometrics potential outliers are coded as dummy variables and the
algorithm then searches over these dummy variables along with the other regressors.
30
A Proofs
For the asymptotic normality results for the gauge some covariance matrices have to be
computed. The results are collected in
Theorem A.1 Suppose Assumption 6.1 holds and that c = G( ). Then the processes
P
An (c) = n 1=2 ni=1 f1(j"i j> c) g;
P "2
Bn (c) = n 1=2 ni=1 ( i2 )1(j"i j c) ;
1=2
Pn "2i
Cn (c) = n i=1 ( 2
1);
0
Pn
Kn (c) = N i=1 xi "i 1(j"i = j c)
converge to continuous limits A; B; C; K on D[0; 1] endowed with the uniform metric. The
processes An ; Bn ; Cn have Gaussian limits with covariance matrix
8 9 8 9
< An (c) = < (1 ) 0 =
2 2
= V ar Bn (c) = 0 { = { = : (A.1)
: ; : 2 ;
Cn (c) { = ( 1)=2
2f(c )n1=2 (^
z = c ) An (c )
asVar = Var = !10 !1
n (^ = 2
1=2 2
1) Bn (c )= + (c2 = 1= )An (c )
for
1 0 0
!10 = :
(c2 = 1= ) 1= 0
The asymptotic variance in Theorem 8.4 is given by
where
c f(c ) c f(c )
!20 = f1 (c2 )g; ; 0 :
Proof ofPTheorem 9.3. The initial estimators satisfy N 1 ( e ) = OP (1) and n1=2 (e2
2 1=2 n 2 2
)=n i=1 ("i ) + oP (1); see (7.3). Use Theorem 9.1 to get
P P
n1=2 (b ) = n 1=2 ni=1 f1(j"i j> c) g + cf(c)n 1=2 ni=1 ("2i 2
) + oP (1); (A.2)
1=2
1=2 ( 1) 1=2
P2 nj P
n (b )=n j=1 1=2 i2Ij f1(jyi x0i b3 j j>b3 j c)
g
nj
1=2
P2 P 1=2
P2 P 2 2
=n j=1 i2Ij f1(j"i j> c) g + cf(c)n j=1 i2Ij ("i ) + oP (1):
This is reduces to the expansion (A.2) for the robusti…ed least squares estimator.
Proof of Theorem 9.5. Theorem 7.5 shows that the normalised estimators are tight.
Thus, for all there exists an A > 0 so that the set
T 1 b(m)
An = 1 m=0 fjN ( )j + jn1=2 f(b(m) )2 2
gj U g (A.3)
has probability of at least 1 : Theorem 6.1 then shows that on that set
1 Pn
b(m) = f1(j"i j> c) g n 1=2
2cf(c)n1=2 (b(m) = 1) + oP (n 1=2
); (A.4)
n i=1
1=2
where, uniformly in m; the …rst term and the second term are OP (n ); while the remainder
term is oP (n 1=2 ): Therefore
Since ^ (m) and are bounded by one then E sup0 m<1 j^ (m) j vanishes as n ! 1. Thus,
by the triangle inequality
sup0 m<1 jE^ (m) j sup0 m<1 Ej^ (m) j E sup0 m<1 j^ (m) j = o(1):
Proof of Theorem 9.6. On the set An de…ned in the proof of Theorem 9.5, see (A.3),
we consider the expansion (A.4), that is,
P
n1=2 (b(m) ) = n 1=2 ni=1 f1(j"i j> c) g 2cf(c)n1=2 (b(m) = 1) + oP (1);
32
where the remainder is uniform in m: Theorem 7.6 shows that for large m; n we have
1 Pn "2i
n1=2 (b(m) = 1) = n 1=2
i=1 ( 2
)1(j"i j c) + oP (1);
2
where the remainder is uniform in m: Combine to get the desired expansion. The asymptotic
normality follows from Lemma A.1.
Theorem 9.7 is a special case of the following Lemma subjected to Remarks A.1 and A.2
below, because Assumption 6.1 assumes Gaussian errors.
Lemma A.2 Suppose Assumption 6.1(ii; d) holds. Let the cut-o¤ cn be given by (9.4) and
assume that
(i) the density f is symmetric with decreasing tails and support on R so that cn ! 1 with
(a) Ej"i jr < 1 for some r > 4;
(b) f(cn )=[cn f1 F(cn )g] = O(1);
(c) f(cn n 1=4 A)=f(cn ) = O(1) for all A > 0;
(ii) N 1 ( b ); n1=2 (b2 2
) are OP (1):
Then the sample gauge b in (8:6) satis…es
D
nb ! Poisson( ):
Remark A.1 Assumption (ia) implies that cn = O(n1=r ) where 1=r < 1=4. Combine the
de…nition P(j"i j > cn ) = =n with the Markov inequality P(j"i j > cn ) ( cn ) r Ej"i jr so
1
that cn (Ej"i jr )1=r 1=r n1=r = O(n1=r ).
Remark A.2 Assumption (i) of Lemma A.2 holds if f = ' is standard normal. For (b) use
the Mill’s ratio result f(4 + c2 )1=2 cg=2 < f1 (c)g='(c), see Sampford (1953). For (c)
1=4
note that 2 logff(cn n A)=f(cn )g = cn (cn n 1=4 A)2 = 2cn n 1=4 A n 1=2 A2 and use
2
Remark A.1.
Bn = fjN 1
(b )j + n1=2 jb j + n1=4 max jN 0 xi j A0 g
1 i n
has probability larger than 1 : It su¢ ces to prove the theorem on this set.
2. A bound on indicators. Introduce the quantity
On the set Bn ; using cn = o(n1=4 ); by Remark A.1 the quantity si satis…es, for some A1 > 0;
1=2 1=4
si cn + n A0 cn + n A20 (cn + n 1=4
A1 );
1=2 1=4
si cn n A0 cn n A20 (cn n 1=4
A1 ):
33
It therefore holds that
A …rst order Taylor expansion and the identity 2f1 F(cn )g = =n give
Z cn +n 1=4 A
1
1=4 4 n 1=4 A1 f(c )
En = n 2f(x)dx = 4n A1 f(c ) = ;
cn n 1=4 A
1
2f1 F(cn )g
1=4
for jc cn j n A1 . Rewrite as
Using (A.6) the Poisson limit theorem shows that the upper and lower bounds have Poisson
limits with mean :
( bn
1 (0)
Proof of Theorem 9.9. 1. Comparison with least squares: The estimator N )
is based on
P ( 1) P P ( 1)
N 0 ni=1 vbi;n xi x0i N = N 0 ni=1 xi x0i N N 0 ni=1 (1 vbi;n )xi x0i N; (A.7)
P ( 1) P P ( 1)
N 0 ni=1 vbi;n xi "i = N 0 ni=1 xi "i N 0 ni=1 (1 vbi;n )xi "i ; (A.8)
In each equation the …rst term is the full sample product moment, which converges due
to Assumption 6.1, and the estimation error of the full sample least squares is bounded in
probability. It su¢ ces to show that the second terms vanish in probability. The argument
(0)
for n1=2 f(bn )2 2
g is similar.
2. Tightness of the initial estimators. Because N 1 ( bj ) and n1=2 (bj2 2
) are OP (1),
then for all > 0 there exists a constant A0 > 1 such that the set
P P
Bn = f 2j=1 jN 1 ( bj )j + 2j=1 n1=2 jbj j + n1=2 max jN 0 xi j A0 g
1 i n
34
has probability larger than 1 : It su¢ ces to prove the theorem on this set.
3. Bounding the second terms: The second terms of (A.7) and (A.8) are bounded by
P ( 1)
Sp = ni=1 (1 vbi;n )jN 0 xi j2 p j"i jp for p = 0; 1:
On the set Bn we get the further bound, see (A.5) in the proof of Lemma A.2,
Pn 0 2 p
Sp 1Bn i=1 jN xi j j"i jp 1(j"i = j>cn n 1=4 A1 ) :
The expectation is bounded as
Pn 0
E(Sp 1Bn ) E i=1 jN xi j2 p
j"i jp 1(j"i = j>cn n 1=4 A
1)
:
Now
j"i jp 1(j"i = j>cn n 1=4 A
1)
E1=2 1(j"i = j>cn n 1=4 A
1)
E1=2 j"i j2p 1(j"i = j>cn n 1=4 A
1)
:
The …rst factor is of order n 1=2 ; because n(1 F (cn n 1=4 A1 )) ! ; and the second factor
tends to zero because E"2i < 1. We also have
P P
E ni=1 jN 0 xi j2 p = n1=2(p 2) E ni=1 jn1=2 N 0 xi j2 p Cnp=2 Cn1=2
P
by Assumption 6.1(ii; d): Collecting these evaluations we …nd Sp ! 0:
Proof of Theorem 10.1. Theorem 8.3 implies, that Zn converges to a Gaussian process
Z on D[ 0 ; 1] endowed with the uniform metric. The variance of Z(c ) vanishes for ! 1
so a truncation argument is needed to deal with the ratio Xn (c ) = Zn (c )=sdvfZ(c )g:
Approximate the sample gauge by
n m b 1 Pint(nv) 1
bv = 1(mb vn) = 1(m
b j) ;
n n j=m1
for some v < 1 and using (10.1). Then the sample gauge is ^ = ^1 ; and
n m b n nv
0 ^ ^v = 1(m>vn)
b < = 1 v: (A.9)
n n
The process Xn (c ) converges on D[ 1 ; v]: The Continuous Mapping Theorem 5.1 of Billings-
ley (1968) then shows that sup 1 u Xn (c ) converges as a process in u on D[ 1 ; v]: In
turn, for a given q; the deterministic function P(m b nu) = Pfsup 1 u Xn (c ) > qg in
1 u v converges to a continuous increasing function p(u) on [ 1 ; v], which is bounded
by unity. In particular it holds that
Z v
1 Pint(nv) 1 1 Pint(nv) 1
Ebv = E1(m
b j) = P(mb j) ! v = p(u)du v 1 1 1;
n j=m1 n j=m1 1
and Z Z Z
v 1 1
Z(c )
v = p(u)du % = p(u)du =
> c]du P[ sup
1 1 1 1 u sdvfZ(c )g
References
Atkinson, A.C. and Riani, M. (2000) Robust Diagnostic Regression Analysis. New York:
Springer.
Atkinson, A.C. and Riani, M. (2006) Distribution theory and simulations for tests of outliers
in regression. Journal of Computational and Graphical Statistics 15, 460–476.
Atkinson, A.C., Riani, M. and Cerioli, A. (2010) The forward search: Theory and data
analysis (with discussion). Journal of the Korean Statistical Society 39, 117–134.
Bahadur, R.R. (1966) A note on quantiles in large samples. Annals of Mathematical Sta-
tistics 37, 577–580.
Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: a practical and
powerful approach to multiple testing. Journal of the Royal Statistical Society B 57,
289–300.
Bercu, B. and Touati, A. (2008) Exponential inequalities for self-normalized martingales
with applications. Annals of Applied Probability 18, 1848–1869.
Bickel, P.J. (1975) One-step Huber estimates in the linear model. Journal of the American
Statistical Association 70, 428–434.
Billingsley, P. (1968) Convergence of Probability Measures. New York: Wiley.
Castle, J.L., Doornik, J.A. and Hendry, D.F. (2011) Evaluating automatic model selection.
Journal of Time Series Econometrics 3, Issue 1, Article 8.
Cavaliere, G. and Georgiev, I. (2013) Exploiting in…nite variance through dummy variables
in nonstationary autoregressions. Econometric Theory 29, 1162–1195.
Chen, X.R. and Wu, Y.H. (1988) Strong consistency of M-estimates in linear models. Jour-
nal of Multivariate Analysis 27, 116–130.
Csörg½o, M. (1983) Quantile Processes with Statistical Applications. CBMS-NFS Regional
Conference Series in Applied Mathematics 42, Society for Industrial and Applied Math-
ematics.
Davies, L. (1990) The asymptotics of S-estimators in the linear regression model, The
Annals of Statistics 18, 1651–1675.
Dollinger, M.B. and Staudte, R.G. (1991) In‡uence functions of iteratively reweighted least
squares estimators. Journal of the American Statistical Association 86, 709–716.
Doornik, J.A. (2007) Object-Oriented Matrix Programming Using Ox, 3rd ed. London:
Timberlake Consultants Press and Oxford: www.doornik.com.
Doornik, J.A. (2009) Autometrics. In Castle, J.L. and Shephard, N. (eds.) The Methodology
and Practice of Econometrics: A Festschrift in Honour of David F. Hendry, pp. 88–
121. Oxford: Oxford University Press.
36
Doornik, J.A. and Hendry, D.F. (2013) Empirical Econometric Modelling - PcGive 14,
volume 1. London: Timberlake Consultants.
Engle, R.F. (1982) Autoregressive conditional heteroscedasticity with estimates of the vari-
ance of United Kingdom in‡ation. Econometrica 50, 987–1108
Engler, E. and Nielsen, B. (2009) The empirical process of autoregressive residuals. Econo-
metrics Journal 12, 367–381.
Godfrey, L.G. (1978) Testing Against General Autoregressive and Moving Average Error
Models when the Regressors Include Lagged Dependent Variables. Econometrica 46,
1293–1302.
Graddy, K. (1995) Testing for imperfect competition at the Fulton Fish Market. RAND
Journal of Economics 26, 75–92.
Graddy, K. (2006) The Fulton Fish Market. Journal of Economic Perspectives 20, 207–220.
Hadi, A.S. (1992) Identifying multiple outliers in multivariate data. Journal of the Royal
Statistical Society B 54, 761–771.
Hadi, A.S. and Simono¤, J.S. (1993) Procedures for the Identi…cation of Multiple Outliers
in Linear Models Journal of the American Statistical Association 88, 1264-1272.
Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J. and Stahel, W.A. (1986) Robust Statistics:
The Approach Based on In‡uence Functions. New York: John Wiley & Sons.
He, X. and Portnoy, S. (1992) Reweighted LS estimators converge at the same rate as the
initial estimator. Annals of Statistics 20, 2161–2167.
Hendry, D.F. and Doornik, J.A. (2014) Empirical Model Discovery and Theory Evaluation.
Cambridge MA: MIT Press.
Hendry, D.F. and Krolzig, H.-M. (2005) The properties of automatic GETS modelling.
Economic Journal 115, C32–61.
Hendry, D.F. and Mizon, G.E. (2011) Econometric modelling of time series with outlying
observations. Journal of Time Series Econometrics 3:1:6.
Hendry, D.F. and Nielsen, B. (2007) Econometric Modelling. Princeton NJ: Princeton
University Press.
37
Hendry, D.F. and Santos, C. (2010) An automatic test of super exogeneity. In Bollerslev,
T., Russell, J.R. and Watson, M.W. (eds.) Volatility and Time Series Econometrics:
Essays in Honor of Robert F. Engle, pp. 164–193. Oxford: Oxford University Press.
Hoover, K.D. and Perez, S.J. (1999) Data mining reconsidered: encompassing and the
general-to-speci…c approach to speci…cation search (with discussion). Econometrics
Journal 2, 167–191.
Huber, P.J. and Ronchetti, E.M. (2009) Robust Statistics. New York: Wiley.
Jaeckel, L.A. (1971) Robust estimates of location: Symmetry and asymmetric contamina-
tion. Annals of Mathematical Statistics 42, 1020–1034.
Johansen, S. and Nielsen, B. (2010) Discussion: The forward search: Theory and data
analysis. Journal of the Korean Statistical Society 39, 137–145.
Johansen, S. and Nielsen, B. (2013) Asymptotic theory for iterated one-step Huber-skip
estimators. Econometrics 1, 53–70.
Johansen, S. and Nielsen, B. (2014a) Analysis of the Forward Search using some new
results for martingales and empirical processes. Updated version of 2013 Discussion
Paper with title Asymptotic theory of the Forward Search.
Johansen, S. and Nielsen, B. (2014b) Asymptotic theory of M-estimators for multiple re-
gression. Work in Progress.
Jureµcková, J. and Sen, P.K. (1996) Robust Statistical Procedures: Asymptotics and Interre-
lations. New York: John Wiley & Sons.
Jureµcková, J., Sen, P.K. and Picek, J. (2012) Methodological Tools in Robust and Nonpara-
metric Statistics. London: Chapman & Hall/CRC Press.
Kilian, L. and Demiroglu, U. (2000) Residual based tests for normality in autoregressions:
asymptotic theory and simulations. Journal of Economic Business and Control 18,
40–50
Koul, H.L. (2002) Weighted Empirical Processes in Dynamic Nonlinear Models. 2nd edition.
New York: Springer.
Koul, H.L. and Ossiander, M. (1994) Weak convergence of randomly weighted dependent
residual empiricals with applications to autoregression. Annals of Statistics, 22, 540–
582.
38
Liese, F. and Vajda, I. (1994) Consistency of M-estimates in general regression models.
Journal of Multivariate Analysis 50, 93–110.
Maronna, R.A., Martin, R.D. and Yohai, V.J. (2006) Robust Statistics: Theory and Meth-
ods. Chicester: John Wiley & Sons.
Nielsen, B. (2006) Order determination in general vector autoregressions. In Ho, H.-C., Ing,
C.-K., and Lai, T.L. (eds): Time Series and Related Topics: In Memory of Ching-Zong
Wei. IMS Lecture Notes and Monograph Series 52, 93-112.
R Development Core Team (2014). R: A language and environment for statistical comput-
ing. R Foundation for Statistical Computing, Vienna, Austria.
Ramsey, J.B. (1969) Tests for Speci…cation Errors in Classical Linear Least Squares Re-
gression Analysis. Journal of the Royal Statistical Society B 31, 350–371.
Riani, M., Atkinson, A.C. and Cerioli, A. (2009) Finding an unknown number of multivari-
ate outliers. Journal of the Royal Statistical Society B, 71, 447–466.
Rousseeuw, P.J. (1984) Least median of squares regression. Journal of the American Sta-
tistical Association, 79, 871–880.
Rousseeuw, P.J. and van Driessen, K. (1998) A fast algorithm for the minimum covariance
determinant estimator. Technometrics 41, 212–223.
Rousseeuw, P.J. and Leroy, A.M. (1987) Robust Regression and Outlier Detection. New
York: Wiley.
Ruppert, D. and Carroll, R.J. (1980) Trimmed least squares estimation in the linear model.
Journal of the American Statistical Association, 75, 828–838.
Sampford, M.R. (1953) Some inequalities on Mill’s ratio and related functions. Annals of
Mathematical Statistics, 24, 130–132.
Víšek, J.Á. (2006a) The least trimmed squares. Part I: Consistency. Kybernetika, 42, 1–36.
p
Víšek, J.Á. (2006b) The least trimmed squares. Part II: n-consistency. Kybernetika, 42,
181–202.
Víšek, J.Á. (2006c) The least trimmed squares. Part III: Asymptotic normality. Kyber-
netika, 42, 203–224.
Welsh, A.H. and Ronchetti, E. (2002) A journey in single steps: robust one-step M-
estimation in linear regression Journal of Statistical Planning and Inference, 103, 287–
310.