0% found this document useful (0 votes)
172 views13 pages

A Fast Algorithm For The Minimum Covariance Determinant Estimator PDF

This document describes a new fast algorithm called FAST-MCD for computing the minimum covariance determinant (MCD) estimator. The MCD is a robust estimator of multivariate location and scatter that is useful for outlier detection and robust multivariate analysis. Existing algorithms for computing the MCD were slow and limited to small sample sizes. The new FAST-MCD algorithm significantly speeds up the computation of the MCD through techniques like selective iteration and nested extensions, enabling its use on larger datasets. The algorithm is demonstrated on two real-world examples involving hundreds to tens of thousands of observations.

Uploaded by

7ER013aU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
172 views13 pages

A Fast Algorithm For The Minimum Covariance Determinant Estimator PDF

This document describes a new fast algorithm called FAST-MCD for computing the minimum covariance determinant (MCD) estimator. The MCD is a robust estimator of multivariate location and scatter that is useful for outlier detection and robust multivariate analysis. Existing algorithms for computing the MCD were slow and limited to small sample sizes. The new FAST-MCD algorithm significantly speeds up the computation of the MCD through techniques like selective iteration and nested extensions, enabling its use on larger datasets. The algorithm is demonstrated on two real-world examples involving hundreds to tens of thousands of observations.

Uploaded by

7ER013aU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Technometrics

ISSN: 0040-1706 (Print) 1537-2723 (Online) Journal homepage: https://www.tandfonline.com/loi/utch20

A Fast Algorithm for the Minimum Covariance


Determinant Estimator

Peter J. Rousseeuw & Katrien Van Driessen

To cite this article: Peter J. Rousseeuw & Katrien Van Driessen (1999) A Fast Algorithm for the
Minimum Covariance Determinant Estimator, Technometrics, 41:3, 212-223

To link to this article: https://doi.org/10.1080/00401706.1999.10485670

Published online: 12 Mar 2012.

Submit your article to this journal

Article views: 1007

View related articles

Citing articles: 818 View citing articles

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=utch20
A Fast Algorithm for the Minimum Covariance
Determinant Estimator
Peter J. ROUSSEEUW Katrien VAN DRIESSEN

Department of Mathematics Faculty of Applied Economics


and Computer Science Universitaire Faculteiten
Universitaire Instelling Antwerpen Sint lgnatius
Universiteiteitsplein 1 Prinsstraat 13
B-261 0 Wilrijk B-2000 Antwerp
Belgium Belgium
([email protected]) (katrien. [email protected]. be)

The minimum covariance determinant (MCD) method of Rousseeuw is a highly robust estimator of
multivariate location and scatter. Its objective is to find h observations (out of n) whose covariance
matrix has the lowest determinant. Until now, applications of the MCD were hampered by the
computation time of existing algorithms, which were limited to a few hundred objects in a few
dimensions. We discuss two important applications of larger size, one about a production process at
Philips with n = 677 objects and p = 9 variables, and a dataset from astronomy with n = 137,256
objects and p = 27 variables. To deal with such problems we have developed a new algorithm
for the MCD, called FAST-MCD. The basic ideas are an inequality involving order statistics and
determinants, and techniques which we call “selective iteration” and “nested extensions.” For small
datasets, FAST-MCD typically finds the exact MCD, whereas for larger datasets it gives more
accurate results than existing algorithms and is faster by orders of magnitude. Moreover, FAST-
MCD is able to detect an exact fit-that is, a hyperplane containing h or more observations. The
new algorithm makes the MCD method available as a routine tool for analyzing multivariate data.
We also propose the distance-distance plot (D-D plot), which displays MCD-based robust distances
versus Mahalanobis distances, and illustrate it with some examples.

KEY WORDS: Breakdown value: Multivariate location and scatter; Outlier detection; Regression:
Robust estimation.

It is difficult to detect outliers in p-variate data when Positive-breakdown methods such as the MVE and least
p > 2 because one can no longer rely on visual inspec- trimmed squaresregression (Rousseeuw1984) are increas-
tion. Although it is still quite easy to detect a single outlier ingly being usedin practice-for example,in finance,chem-
by means of the Mahalanobis distances, this approach no istry, electrical engineering, process control, and computer
longer suffices for multiple outliers becauseof the masking vision (Meer, Mintz, Rosenfeld, and Kim 1991).For a sur-
effect, by which multiple outliers do not necessarily have vey of positive-breakdown methods and some substantive
large Mahalanobis distances. It is better to use distances applications, see Rousseeuw(1997).
basedon robust estimators of multivariate location and scat- The basic resampling algorithm for approximating the
MVE, called MINVOL, was proposed by Rousseeuw and
ter (Rousseeuwand Leroy 1987,pp. 265-269). In regression
Leroy (1987). This algorithm considers a trial subset of
analysis, robust distances computed from the explanatory
p + 1 observations and calculates its mean and covariance
variables allow us to detect leverage points. Moreover, ro-
matrix. The corresponding ellipsoid is then inflated or de-
bust estimation of multivariate location and scatter is the flated to contain exactly h observations.This procedure is
key tool to robustify other multivariate techniques such as repeatedmany times, and the ellipsoid with the lowest vol-
principal-component analysis and discriminant analysis. ume is retained. For small datasetsit is possible to consider
Many methods for estimating multivariate location and all subsetsof size p+ 1, whereasfor larger datasetsthe trial
scatter break down in the presence of n/(p + 1) outliers, subsetsare drawn at random.
where n is the number of observations and p is the num- Several other algorithms have been proposed to approx-
ber of variables, as was pointed out by Donoho (1982). For imate the MVE. Woodruff and Rocke (1993) constructed
the breakdown value of the multivariate &f-estimators of algorithms combining the resampling principle with three
Maronna (1976), see Hampel, Ronchetti, Rousseeuw,and heuristic search techniques-simulated annealing, genetic
Stahel (1986, p. 296). In the meantime, several positive- algorithms, and tabu search. Other people developed al-
breakdown estimators of multivariate location and scatter gorithms to compute the MVE exactly. This work started
have been proposed. One of these is the minimum volume with the algorithm of Cook, Hawkins, and Weisberg (1992),
ellipsoid (MVE) method of Rousseeuw(1984, p. 877; 1985).
This approachlooks for the ellipsoid with smallest volume
@ 1999 American Statistical Association
that covers h data points, where n/2 < h < n. Its break- and the American Society for Quality
down value is essentially (n - h)/n. TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3

212
A FAST ALGORITHM FOR THE MINIMUM COVARIANCE DETERMINANT ESTIMATOR 213

which carries out an ingenious but still exhaustive searchof Problem 1 (Engineering). We are grateful to Gertjan
all possible subsetsof size h. In practice, this can be done Otten for providing the following problem. Philips Mecoma
for n up to about 30. Recently, Agullo (1996) developed (The Netherlands), is producing diaphragm parts for TV
an exact algorithm for the MVE that is basedon a branch sets. These are thin metal plates, molded by a press. Re-
and bound procedure that selects the optimal subset with- cently a new production line was started, and for each of
out requiring the inspection of all subsets of size ft. This n = 677 parts, nine characteristics were measured. The
is substantially faster and can be applied up to (roughly) aim of the multivariate analysis is to gain insight into the
n _< 100 and p 5 5. Because for most datasets the exact production process and the interrelations between the nine
algorithms would take too long, the MVE is typically com- measurementsand to find out whether deformations or ab-
puted by versions of MINVOL-for example, in S-PLUS normalities have occurred and why. Afterward, the esti-
(seethe function cov,mve). mated location and scatter matrix can be used for multi-
Presently there are severalreasonsfor replacing the MVE variate statistical process control.
by the minimum covariance determinant (MCD) estima-
Due to the support of Herman Veraaand Frans Van Dom-
tor, which was also proposedby Rousseeuw(1984, p. 877;
melen (at Philips PMF/Mecoma, Product Engineering, PO.
1985). The MCD objective is to find h observations (out
Box 218, 5600 MD Eindhoven, The Netherlands), we ob-
of n) whose classical covariance matrix has the lowest de-
terminant. The MCD estimate of location is then the av- tained permission to analyze these data and to publish the
erage of these h points, and the MCD estimate of scatter results.
is their covariance matrix. The resulting breakdown value Figure 1 shows the classical Mahalanobis distance
equals that of the MVE, but the MCD has several advan-
tages over the MVE. Its statistical efficiency is better be- MD(x,) = (xi - TO)‘Sil(xi - To) (1.1)
cause the MCD is asymptotically normal (Butler, Davies,
versus the index i, which correspondsto the production se-
and Jhun 1993), whereas the MVE has a lower conver-
quence. Here xI is nine-dimensional, To is the arithmetic
gence rate (Davies 1992). As an example, the asymptotic
mean, and So is the classical covariance matrix. The hori-
efficiency of the MCD scatter matrix with the typical cover-
age h, = .75n is 44% in 10 dimensions, and the reweighted zontal line is at the usual cutoff value i&975 = 4.36
J---
covariance matrix with weights obtained from the MCD In Figure 1 it seemsthat most observationsare consistent
attains 83% of efficiency (Croux and Haesbroeck in press), with the classical multivariate normal model. except for a
whereas the MVE attains 0%. The MCD’s better accuracy few isolated outliers. This should not surprise us, even in
makes it very useful as an initial estimate for one-step re- the first experimental run of a new production line because
gression estimators (Simpson, Ruppert, and Carroll 1992; the Mahalanobis distancesare known to suffer from mask-
Coakley and Hettmansperger1993).Robust distancesbased ing. That is, even if there were a group of outliers (here,
on the MCD are more precise than those basedon the MVE deformed diaphragm parts) they would affect To and So in
and hence better suited to expose multivariate outliers- such a way as to become invisible in Figure 1. To further
for example, in the diagnostic plot of Rousseeuwand van investigate these data, we need robust estimators T and S,
Zomeren (1990), which displays robust residuals versus ro- preferably with a substantial statistical efficiency so that we
bust distances. Moreover, the MCD is a key component can be confident that any effects that may become visible
of the hybrid estimators of Woodruff and Rocke (1994) are real and not due to the estimator’s inefficiency. After de-
and Rocke and Woodruff (1996) and of high-breakdown
veloping the FAST-MCD algorithm, we will return to these
linear discriminant analysis (Hawkins and McLachlan
data in Section 7.
1997).
In spite of all these advantages,until now the MCD has
rarely beenapplied becauseit was harder to compute. In this
article, however, we construct a new MCD algorithm that is
actually muchfaster than any existing MVE algorithm. The . .
new MCD algorithm can deal with a sample size n in the cu
w I .
tens of thousands.As far as we know, none of the existing r
Q .
MVE algorithms can cope with such large sample sizes. I5
5’0
i
l
.
Because the MCD now greatly outperforms the MVE in
terms of both statistical efficiency and computation speed,
we recommend the MCD method.

1. MOTIVATING PROBLEMS
Two recent problems will be shown to illustrate the need 04
for a fast, robust method that can deal with many objects (,n) b 200 400 660
and/or many variables (p) while maintaining a reasonable Index
statistical efficiency. Figure 1. Plot of Mahalanobis Distances for the Philips Data.
TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3
214 PETER J. ROUSSEEUW AND KATRIEN VAN DRIESSEN

Problem 2 (Physical Sciences). A group of astrono- .


:
mers at the California Institute of Technology are working
on the Digitized Palomar Sky Survey (DPOSS); for a full OD-

description, see their report (Odewahn, Djorgovsky, Brun- n,


ner, and Gal 1998). In essence, they make a survey of
celestial objects (light sources)for which they record nine
characteristics (such as magnitude, area, image moments)
in each of three bands-blue, red, and near-infrared. They
seek collaboration with statisticians to analyze their data,
and gave us accessto a part of their database,containing
137,256 celestial objects with all 27 variables.
We started by using quantile-quantile plots, Box-Cox
transforms, selecting one variable out of three variables 0
with near-perfect linear correlation, and other tools of data 0 2000 4000 6000 8000 10000
analysis. One of these avenuesled us to study six variables Index
(two from each band). Figure 2 plots the Mahalanobis dis- Figure 3. Digitized Palomar Data: Plot of Mahalanobis Distances of
tances (I .l) for these data (to avoid overplotting, Fig. 2 Celestial Objects as in Figure 2 After Removal of Physically Impossible
Measurements.
shows only 10,000 randomly drawn points from the entire
plot). The cutoff is X&5 = 3.82. In Figure 2 we seetwo 2. BASIC THEOREM AND THE C-STEP
d--
groups of outliers with MD(x,) z 9 and MD(xi) z 12, plus A key step of the new algorithm is the fact that, start-
some outliers still further away. ing from any approximation to the MCD, it is possible to
Returning to the data and their astronomical meaning, compute another approximation with an even lower deter-
it turned out that these were all objects for which one or minant.
more variables fell outside the range of what is physically Theorem I. Consider a dataset X, = {xi, . , x,} of
possible. So, the MD(xi) did help us to find outliers at this p-variate observations.Let Hr c { 1, . . , n} with IN11= h,
stage. We then cleaned the data by removing all objects and put Ti := (l/h) CiEHl xi and Sr := (l/h) CzEHl (xz -
with a physically impossible measurement,which reduced Tr)(x$ - Tr)‘. If det(Sr) # 0, define the relative distances
our sample size to 132,402. To these data we then again
applied the classical mean To and covariance So, yielding c&(i) := (xi - T r)‘S;‘(xi-T1) for i=l,...,n.
the plot of Mahalanobis distancesin Figure 3.
Figure 3 looks innocent, like observationsfrom the a Now take Hz such that {dr(i);i E HZ} := {(dl)~:,,. ,
distribution, as if the data would form a homogeneouspopu- (d~)h:~}, where (c&, 5 (4)~:~ 5 .. i (dl),:, =e the
lation (which is doubtful becausewe know that the database ordered distances, and compute Ta and S2 based on HZ.
contains stars as well as galaxies). To proceed further, we Then
need high-breakdown estimates T and S and an algorithm det(Sa) I det(Si)
that can compute them for n. = 132,402.Such an algorithm
will be constructed in the next sections. with equality if and only if Tz = Ti and SZ = Sr.
The proof is given in the Appendix. Although this theo-
rem appearsto be quite basic, we have been unable to find
it in the literature.
The theorem requires that det(Si) # 0, which is no real
restriction becauseif det(Sr ) = 0 we already have the min-
imal objective value. Section 5 will explain how to interpret
the MCD in such a singular situation.
If det(Si) > 0, applying the theorem yields SZ with
det (Sa) < det(Sr). In our algorithm we will refer to the
construction in Theorem 1 as a C-step, where C stands for
“concentration” becausewe concentrate on the h observa-
tions with smallest distances and SZ is more concentrated
(has a lower determinant) than Sr. In algorithmic terms, the
C-step can be described as follows.
I Given the &subset Hold or the pair (Told, Sold), perform
0 2000 4000 6000 8000 10000
Index the following:

Figure 2. Digitized Palomar Data: Plot of Mahalanobis Distances of 1. Compute the distances dold(i) for i = 1, . . . , n.
Celestial Objects, Based on Six Variables Concerning Magnitude and 2. Sort these distances,which yields a permutation 7rfor
Image Moments. which &ld(r( 1)) < dold(“(2)) 5 . 5 &ld(71.(n)).

TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3


A FAST ALGORITHM FOR THE MINIMUM COVARIANCE DETERMINANT ESTIMATOR 215

3. Put H,,, := {r(l), 742)>. >r(h)}. 1, Hawkins and Olive (1999) usedthe C-condition as a pre-
4. Compute T,,,, :==ave(HI,,,) and S,,,, := COV(H~~,~). liminary screen,followed by case swapping as a technique
For a fixed number of dimensions p, the C-step takes for decreasingdet(S), as in the feasible solution approach
only O(n) time [becauseH,,, can be determined in O(n) (Hawkins 19941,which will be described in Section 6. The
operations without sorting all the d&r(?) distances]. C-condition did not reduce the time complexity of this ap-
Repeating C-steps yields an iteration process. If proach, but it did reduce the actual computation time in
det(Sa) = 0 or det(Sp) = det(Si), we stop; otherwise, experiments with fixed ‘n,.
we run another C-step yielding det(Ss), and so on. The se-
quence det(Sr) > det(Sz) 2 det(Ss) 2 ... is nonnegative 3. CONSTRUCTION OF THE NEW ALGORITHM
and hence must converge. In fact, becausethere are only 3.1 Creating Initial Subsets H,
finitely many h-subsets,there must be an index m such that To apply the algorithmic concept (2.1), we first have to
det(S,,) = 0 or det’(S,,) = det(S,-i), hence convergence decide how to construct the initial subsetsHI. Let us con-
is reached. (In practice, m is often below IO.) Afterward, sider the following two possibilities:
running the C-step on (T, , S,,) no longer reducesthe de-
1. Draw a random h-subset HI.
terminant. This is not sufficient for det(S,) to be the global
2. Draw a random (p + l)-subset J, and then compute
minimum of the MCD objective function, but it is a neces-
To := ave(J) and Sa := cov(J). [If det(Sa) = 0. then
sary condition.
extend J by adding another random observation, and con-
Theorem 1 thus provides a partial idea for an algorithm:
tinue adding observations until det(Sa) > 0.1 Then com-
Take many initial choices of HI and apply C-steps pute the distances d;(i) := (xi - Tc)‘S;l(X, - To) for
i = 1,. . ,R. Sort them into da(~(l)) 5 .. < da(n(n)) and
to each until convergence:and keep the put H1 := {n(l), . , T(h)}.
solution with lowest determinant. (2.1) Option 1 is the simplest, whereas2 starts like the MINVOL
algorithm (Rousseeuw and Leroy 1987, pp. 259-260). It
Of course,severalquestionsmust be answeredto make (2.1) would be uselessto draw fewer than p + 1 points, for then
operational: How do we generate sets HI to begin with? So is always singular.
How many HI are needed?How do we avoid duplication When the dataset does not contain outliers or deviating
of work becauseseveral HI may yield the same solution? groups of points, it makes little difference whether (2.1) is
Can we do with fewer C-steps? What about large sample applied with 1 or 2. But becausethe MCD is a very robust
sizes?These matters will be discussedin the next sections. estimator, we have to consider contaminateddatasetsin par-
Corollary 1. The MCD subset H of X, is separated ticular. For instance, we generateda dataset with n. = 400
from X, \ H by an ellipsoid. observationsandp = 2 variables,in which 20.5observations
were drawn from the bivariate normal distribution
Prooj For the MCD subset H, and in fact any limit
of a C-step sequence,applying the C-step to H yields H
itself. This meansthat all xi E H satisfy (xi - T)‘S1(xi -
T) 5 9 = {(x - T)‘Sl(x - T)}hZn, whereas all x3 $ and the other 195 observationswere drawn from
H satisfy (x3 - T)‘S1(xi - T) 2 9. Take the ellipsoid
E = {x; (x - T)‘S’(x - T) 5 9}. Then H c E and Nz([ i$[; iI)-
X,, \ H c closure(EC). Note that there is at least one point
The MCD has its highest possible breakdown value when
xi E H on the boundary of E, whereas there may or may
h = [(n + p + 1)/2] (see Lopuhaa and Rousseeuw 1991),
not be a point x3 @H on the boundary of E.
which becomeshj = 201 here. We now apply (2.1) with 500
The same result was proved by Butler et al. (1993) un- starting sets HI. Using option 1 yields a resulting (T, S)
der the extra condition that a density exists. Note that the whose 97.5% tolerance ellipse is shown in Figure 4(a).
ellipsoid in Corollary 1 contains h observations but is not Clearly, this result has broken down due to the contami-
necessarily the smallest ellipsoid to do so, which would nated data. On the other hand, option 2 yields the result in
yield the MVE. We know of no technique like the C-step Figure 4(b), which concentrates on the majority (51.25%)
for the MVE estimator; hence, the latter estimator cannot of the data.
be computed faster in this way. The situation in Figure 4 is extreme,but it is useful for il-
Independently of our work, Hawkins and Olive (1999) lustrative purposes.(The same effect also occurs for smaller
discovered a version of Corollary 1 in the following form: amountsof contamination, especially in higher dimensions.)
‘A necessarycondition for the MCD optimum is that, if we Approach 1 has failed becauseeachrandom subsetHI con-
calculate the distance of each case from the location vec- tains a sizable number of points from the majority group as
tor using the scatter matrix, each covered case must have well as from the minority group, which follows from the
smaller distance than any uncovered case.” This necessary law of large numbers.When starting from a bad subsetHI,
condition could perhapsbe called the “C-condition,” as op- the iterations will not converge to the major solution. On
posed to the C-step of Theorem 1, where we proved that a the other hand, the probability of a (?-,+ 1)-subsetwithout
C-step always decreasesdet(S). In the absenceof Theorem outliers is much higher, which explains why 2 yields many
TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3
216 PETER J. ROUSSEEUW AND KATRIEN VAN DRIESSEN

TOLERANCE ELLIPSE (97.5%)

. -4
1 -5 0 5 10 15 20 0 5 10 15 20 25 30
Xl step number

TOLERANCE ELLIPSE (97.5%) Figure 5. Govariance Determinant of Subsequent C-Steps in the


Dataset of Figure 4. Each sequence stops when no further reduction
. . . is obtained.

the dashed lines correspond to nonrobust results. To get


N
a clear picture, Figure 5 only shows the first 100 starts.
After two C-steps (i.e., for j = 3), many subsamplesH3
that will lead to the global optimum already have a rather
MC small determinant. The global optimum is a solution that
contains none of the 195 “bad” points. By contrast, the de-
terminants of the subsetsH3 leading to a false classification
are considerably larger. For that reason,we can save much
computation time and still obtain the sameresult by taking
just two C-steps and retaining only the (say, 10) best H3
5 10 15 20
subsetsto iterate further. Other datasets,also in more di-
I -5 0
Xl mensions, confirm these conclusions. Therefore, from now
Figure 4. Results of Iterating C-Steps Starting From 500 Random
on we will take only two C-steps from each initial subsam-
Subsets t-t, of (a) Size h = 201 and (b) Size p+ 1 = 3. ple HI, select the 10 different subsets H3 with the lowest
determinants, and only for these 10 we continue taking C-
subsetsH1 consisting of points from the majority and hence steps until convergence.
a robust result. From now on, we will always use 2.
3.3 Nested Extensions
Remark. For increasing n, the probability of having at For a small sample size n, the preceding algorithm does
least one “clean” (p + l)-subset among m random (p + l)- not take much time. But when n grows, the computation
subsetstends to time increases,mainly due to the n distances that need to
1 - (1 - (1 - &)p+l)nL> 0, (3.1) be calculated eachtime. To avoid doing all the computations
in the entire dataset, we will consider a special structure.
where E is the percentageof outliers. In contrast, the proba- When 7~> 1,500, the algorithm generatesa nested system
bility of having at least one clean h-subsetamongm random of subsetsthat looks like Figure 6, where the arrows mean
h-subsetstends to 0 becauseh increaseswith r~. “is a subset of.” The five subsetsof size 300 do not over-
3.2 Selective iteration
lap, and together they form the merged set of size 1,500,
Each C-step calculates a covariance matrix, its deter-
minant, and all relative distances. Therefore, reducing the
number of C-steps would improve the speed. But is this
possible without losing the effectiveness of the algorithm?
It turns out that often the distinction between robust solu-
tions and nonrobust solutions already becomesvisible after
two or three C-steps.For instance, consider the data of Fig-
ure 4 again. The inner workings of the algorithm (2.1) are
traced in Figure 5. For each starting subsampleHI, the de-
terminant of the covariance matrix S, based on h = 201
observations is plotted versus the step number j. The runs Figure 6. Nested System of Subsets Generated by the FAST-MCD
yielding a robust solution are shown as solid lines, whereas Algorithm.

TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3


A FAST ALGORITHM FOR THE MINIMUM COVARIANCE DETERMINANT ESTIMATOR 217

which in turn is a proper subsetof the datasetof size 71.[Al- 1. The default h is [(r~ + p + 1)/2], but the user may
ready the algorithm of Woodruff and Rocke (1994) made choose any integer h with [(71 + p + 1)/2] _< h _< n.
use of partitioning for this purpose. The only difference The program then reports the MCD’s breakdown value
with the nested extensions in Fig. 6 is that we work with (n - h + 1)/n. If you are sure that the dataset contains
two stages, hence our use of the word “nested,” whereas less than 25% contamination, which is usually the case, a
Woodruff and Rocke partitioned the entire dataset, which good compromise between breakdown value and statistical
yields more and/or larger subsets.]To construct Figure 6, efficiency is obtained by putting h = [.757x].
the algorithm draws 1,500 observations, one by one, with- 2. If h, = n,, then the MCD location estimate T is the
out replacement. The first 300 observations it encounters averageof the whole dataset,and the MCD scatter estimate
are put in the first subset,and so on. Becauseof this mech- S is its covariance matrix. Report these and stop.
anism, each subsetof size 300 is roughly representativefor 3. If p = 1 (univariate data), compute the MCD esti-
the dataset, and the merged set with 1,500 cases is even mate (T, S) by the exact algorithm of RousseeuwandLeroy
more representative. (1987, pp. 171-172) in O(nlogn) time; then stop.
When n _<600, we will keep the algorithm as in the previ- 4. From here on, h < n and p > 2. If n is small (say,
ous section, while for n 2 1,500we will useFigure 6. When n 5 600), then
600 < n < 1,500,we will partition the data into at most four . repeat (say) 500 times:
subsetsof 300 or more observationsso that eachobservation
belongs to a subset and such that the subsetshave roughly * construct an initial h-subset HI using method 2 in
the same size. For instance, 601 will be split as 300 + 301 Subsection 3.1-that is, starting from a random
and 900 as 450+ 450. For n = 901, we use 300+ 300+ 301, (p + l)-subset;
and we continue until 1,499 = 375 + 375 + 375 + 374. By * carry out two C-steps (describedin Sec. 2);
splitting 601 as 300+ 301 we do not mean that the first sub-
set contains the observations with case numbers 1,. ,300 . for the 10 results with lowest det(Sa):
but that its 300 case numbers were drawn randomly from
1,. . : 601. * carry out C-steps until convergence;
Whenever n > 600 (and whether n < 1,500 or not), our . report the solution (T, S) with lowest det(S).
new algorithm for the MCD will take two C-stepsfrom sev-
eral starting subsamplesH1 within each subset,with a total 5. If 12is larger (say, n > 600), then
of 500 starts for all subsetstogether. For every subset the
best 10 solutions are stored. Then the subsets are pooled, . construct up to five disjoint random subsets of size
yielding a mergedset with at most 1,500observations.Each r&“b according to Section 3.3 (Say, five subsetsof size
of these(at most 50) available solutions (Tsub, S&t,) is then n.,,b = 300);
extended to the merged set. That is, starting from each . inside each subset,repeat 500/5 = 100 times:
(Tsub, Ssuil),we continue taking C-steps,which now use all
1,500observationsin the merged set. Only the best 10 solu- * construct an initial subset HI of size hsub =
tions (Tmerged,Snlerged)will be consideredfurther. Finally,
each of these 10 solutions is extendedto the full dataset in * carry out two C-steps, using n$,b and hsub;
the sameway, and the best solution (Tfutlr Sf,,r) is reported. * keep the 10 best results (Tsub,Ssub);
Becausethe final computations are carried out in the en- . pool the subsets,yielding the merged set (say, of size
tire dataset, they take more time when n increases.In the %wged = 1,500);
interest of speed we can limit the number of initial solu- l
in the merged set, repeat for each of the 50 solutions
tions(Tmergedr %nerged) and/or the number of C-steps in
(Tsuh,
Ssuh):
the full dataset as n becomeslarge.
The main idea of this subsectionwas to carry out C-steps * carry out two C-steps, using n,,,ged and hmerged=
in several nested random subsets,starting with small sub- b merged(h/n)];
sets of around 300 observations and ending with the entire * keep the 10 best results (Tmerged,Smerged);
dataset of n observations. Throughout this subsection, we
have chosenseveralnumbers such as five subsetsof 300 ob- .
in the full dataset,repeat for the mfulr best results:
servations; 500 starts, 10 best solutions, and so on. These
choices were basedon various empirical trials (not reported * take several C-steps, using n and h;
here). We implemented our choices as defaults so the user * keep the best final result (Tr,ir, SfUll).
does not have to choose anything, but of course the user Here, mruil and the number of C-steps(preferably, until
may changethe defaults. convergence)dependon how large the datasetis.
We will refer to the preceding as the FAST-MCD algorithm.
4. THE RESULTING ALGORITHM FAST-MCD Note that it is affine equivariant: When the data are trans-
Combining all the componentsof the preceding sections lated or subjected to a linear transformation, the resulting
yields the new algorithm, which we will call FAST-MCD. (Trllll, SfUil) will transform accordingly. The computer pro-
Its pseudocodelooks as follows: gram contains two more steps:
TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3
218 PETER J. ROUSSEEUW AND KATRIEN VAN DRIESSEN

6. To obtain consistencywhen the data come from a mul- .


m-
tivariate normal distribution, we put .
.
Tnrc~= 'I'm1 and &CD = med4Lll St,,,)(4 sf II1,, .
.
X%,.5 CD- -0 . .
. .
7. A one-stepreweighted estimate is obtained by

S1 = 2 wi(Xi - TI)(x~ - TI)’


i=l
where
Figure 7. Exact Fit Situation(n = 100,p = 2).

= 0 otherwise. There are 55 observations in the entire dataset of


100 observations that lie on the line with the equa-
The program FAST-MCD has been thoroughly tested tion
and can be obtained from our Web site http://win-www. .000000(2,1 - ml) + 1.000000(2,2 - m2) = 0,
uia.ac.be/u/statis/index.html. It has beenincorporated into where the mean (ml,mz) of these observations is the
S-PLUS 4.5 (as the function “cov.mcd”) and it is also in MCD location:
.10817
SAS/IML 7 (as the function “MO”). 5.00000
and their covariance matrix is the MCD scatter matrix
5. EXACT FIT SITUATIONS 1.40297 .ooooo
An important advantage of the FAST-MCD algorithm .ooooo .ooooo
is that it allows for exact fit situations-that is, when h Therefore, the data are in an "exact fit" position.
or more observations lie on a hyperplane. Then the algo- In such a situation the MCD scatter matrix has deter-
rithm still yields the MCD location T and scatter matrix minant 0, and its tolerance ellipse becomes the line
of exact fit.
S, the latter being singular as it should be. From (T, S)
the program then computes the equation of the hyper- If the original data were in p dimensions and it turns out
plane. that most of the data lie on a hyperplane,it is possible to ap-
When n is larger than (say) 600, the algorithm performs ply FAST-MCD againto the data in this (p - 1)-dimensional
many calculations on subsetsof the data. To deal with the space.
combination of large n and exact fits, we addeda few steps
to the algorithm. Supposethat, during the calculations in a 6. PERFORMANCE OF FAST-MCD
subset,we encountersome (Tsub,Ssuh)with det(S,,b) = 0. To get an idea of the performance of the overall algo-
Then we know that there are h,& or more observationson rithm, we start by applying FAST-MCD to some small
the correspondinghyperplane.First we check whether h or datasets taken from Rousseeuw and Leroy (1987). To be
more points of the full datasetlie on this hyperplane. If so, precise,thesewere all regressiondatasets,but we ran FAST-
we compute (Tf”ll, Sf,n) as the mean and covariancematrix MCD only on the explanatory variables-that is, not using
of all points on the hyperplane, report this final result, and the responsevariable. The first column of Table 1 lists the
stop; if not, we continue. Because det(S,,b) = 0 is the name of each dataset, followed by n and p. We used the
best solution for that subset,we know that (Tsub, Ssub) will default value of h = [(n + p + 1)/2]. The next column
be among the 10 best solutions that are passedon. In the shows the number of starting (p + 1)-subsetsusedin FAST-
merged set we take the set Hi of the hmergedobservations MCD, which is usually 500 except for two datasetsin which
with smallest orthogonal distances to the hyperplane, and the number of possible (p + l)-subsets out of n was fairly
start the next C-step from Hi. Again, it is possible that small-namely, (7) = 220 and (‘,“) = 816-so we used all
during the calculations in the mergedset we encountersome of them.
CTmerged 1 Smergcd)with det(Smerped)= 0, in which casewe The next entry in Table 1 is the result of FAST-MCD,
repeat the preceding procedure. given here as the final h-subset. Comparing these with the
As an illustration, the dataset in Figure 7 consists of 45 exact MCD algorithm of Agull (personal communication,
observationsgeneratedfrom a bivariate normal distribution, 1997), it turns out that these h-subsets do yield the exact
plus 55 observations that were generatedon a straight line global minimum of the objective function. The next column
(using a univariate normal distribution). The FAST-MCD shows the running time of FAST-MCD in secondson a Sun
program (with default value h = 51) finds this line within Ultra 2170. These times are much shorter than those of
.3 seconds.A part of the output follows: our MINVOL program for computing the MVE estimator.
TECHNOMETRICS, AUGUST 1999, VOL. 41, NO, 3
A FAST ALGORITHM FOR THE MINIMUM GOVARIANCE DETERMINANT ESTIMATOR 219

Table 1. Performance of the FAST-MCD and FSA Algorithms on Some Small Datasets

Time (seconds)
Dataset n P Starts Best h-subset found FAST-MCD FSA

Heart 12 2 220 13457911 .6 .6


Phosphor 18 2 816 3 5 8 9 11 12 13 14 15 17 1.8 3.7
Stackloss 21 3 500 4 5 6 7 8 9 IO 11 12 13 14 20 2.1 4.6
Coleman 20 5 500 2 3 4 5 7 8 12 13 14 16 17 19 20 4.2 8.9
Wood 20 5 500 1 2 3 5 9 10 12 13 14 15 17 18 20 4.3 8.2
Salinity 28 3 500 1 2 6 7 8 12 13 14 18 20 21 22 25 26 27 28 2.4 8.6
HBK 75 3 500 15 16 17 18 19 20 21 22 23 24 26 27 31 32 5.0 71.5
33 35 36 37 38 40 43 49 50 51 54 55 56 58
59 61 63 64 66 67 70 71 72 73 74

We may conclude that for these small datasetsFAST-MCD suffice, whereas no previous algorithm we know of could
gives very accurate results in little time. handle such large datasets.
Let us now try the algorithm on larger datasets, with The currently most well-known algorithm for approxi-
n > 100. In each dataset, we generated over 50% of the mating the MCD estimator is the feasible subset algorithm
points from the standard multivariate normal distribution (FSA) of Hawkins (1994). Instead of C-steps, it uses a dif-
Np(O, IP), and the remaining points from NP(p7IP), where ferent kind of steps,which for conveniencewe will baptize
/.A = (b, b, , b)’ with b = 10. This is the model of “shift “I-steps,” where the I standsfor “interchanging points.” An
outliers.” For each dataset,Table 2 lists n, p, the percentage I-step proceedsas follows:
of majority points, and the percentage of contamination. Given the 1z-subsetHoid with its average Toid and its
The algorithm always used 500 starts and the default value covariance matrix Sold,
of 11,= [(n + ?,+ 1)/2]. repeat for each i E Hold and each j 6 Hold:
The results of FAST-MCD are given in the next column,
under “robust.” Here “yes” means that the correct result is * putffi.j = (Hold\ {i}) u {j}
obtained-that is, correspondingto the first distribution [as (i.e., remove point i and add point j);
in Fig. 4(b)]-whereas “no” stands for the nonrobust re- * compute & = det(%d) - det(S(fA,j));
sult, in which the estimates describe the entire dataset [as keep the i’ and j’ with largest &J,3r;
in Fig. 4(a)]. Table 2 lists data situations with the highest
if &I,~~ < 0, put H,,,, = HoId and stop;
percentageof outlying observations still yielding the clean
if a,~.~/ > 0, put H,,, = H,I,~,.
result with FAST-MCD, as was suggestedby a referee. That
is, the table says which percentageof outliers the algorithm An I-step takes O(h(n - h)) = O(n”) time becauseall pairs
can handle for given n, and p. Increasing the number of (i, j) are considered. If we would compute each S(Hz,3)
starts only slightly improves this percentage.The compu- from scratch, the complexity would even become O(n3),
tation times were quite low for the given values of 71and but Hawkins (1994, p. 203) used an update formula for
p. Even for a sample size as high as 50,000, a few minutes det(S(&)).
Table 2. Performance of the FAST-MCD and FSA Algorithms on Larger Datasets, With Time in Seconds

FAST-MCD FSA
n P % N&O, lp) % N&z 1~) Robust Time Robust Time
100 2 51 49 yes 2 yes 50
5 53 47 yes 5 no 80
10 63 37 yes 40 no 110
20 77 23 yes 70 IlO 350
500 2 51 49 yes 7 n0 2,800
5 51 49 yes 25 IlO 3,800
10 64 36 yes 84 fl0 4,100
30 77 23 yes 695 I-IO 8,300
1,000 2 51 49 yes 8 fl0 20,000
5 51 49 yes 20 - -
10 60 40 yes 75 - -
30 76 24 yes 600 - -
10,000 2 51 49 yes 9 -
5 51 49 yes 25 - -
IO 63 37 yes 85 - -
30 76 24 yes 700 -
50,000 2 51 49 yes 15 -
5 51 49 yes 45 -
IO 58 42 yes 140 - -
30 75 25 yes 890 - -

TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3


220 PETER J. ROUSSEEUW AND KATRIEN VAN DRIESSEN

The I-step can be iterated: If det(S,,,) < dFt(S,id), we Hawkins (19941,whereas the new version of FSA is sub-
can take another I-step with H,,,,; otherwise, we stop. The stantially faster (although it retains the samecomputational
resulting sequencedet,(Si) > (let(&) > must converge complexity as the original FSA due to 2 in the preceding
after a finite number of steps; that is, dct(S,,) = 0 or list).
det,(S,) = det(S,,_i), so det(S,) can no longer be re- In conclusion, we personally prefer the FAST-MCD
duced by an I-step. This is again a necessary(but not sum- algorithm because it is both robust and fast, even for
cient) condition for (T,. S,,,) to be the global minimum of large n.
the MCD objective function. In our terminology, Hawkins’s
FSA algorithm can be written as follows: 7. APPLICATIONS
l repeat many times: Let us now look at some applications to compare the
FAST-MCD results with the classical mean and covariance
* draw an initial h-subset Ht at random;
matrix. At the same time we will illustrate a new tool, the
* carry out I-steps until convergence,yielding H,,;
distance-distance plot.
keep the H,, with lowest det(S,,);
l

l report this set H,,L as well as (T,, S,,,). Example 1. We start with the coho salmon dataset (see
Nickelson 1986) with n = 22 and p = 2, as shown in Fig-
In Tables 1 and 2 we have applied the FSA algorithm to ure 8(a). Each data point corresponds to one year. For 22
the samedatasetsas FAST-MCD, using the samenumber of years the production of coho salmon in the wild was mea-
starts. For the small datasetsin Table 1, the FSA and FAST- sured, in the Oregon Production Area. The x-coordinate is
MCD yielded identical results. This is no longer true in the logarithm of millions of smolts, and the y-coordinate
Table 2, where the FSA begins to find nonrobust solutions. is the logarithm of millions of adult coho salmon. We see
This is becauseof the following: that in most years the production of smolts lies between2.2
1. The FSA starts from randomly drawn h-subsets HI. and 2.4 on a logarithmic scale, whereas the production of
Hence, for sufficiently large n all of the FSA starts are adults lies between - 1.Oand .O.The MCD tolerance ellipse
nonrobust, and subsequentiterations do not get away from excludes the years with a lower smolts production, thereby
the corresponding local minimum.
TOLERANCE ELLIPSES (97.5%)
We saw the sameeffect in Section 3.1, which also explained
why it is better to start from random (p + I)-subsets as in
MINVOL and in FAST-MCD.
The tables also indicate that the FSA needs more time
than FAST-MCD. In fact, time(FSA)/time(FAST-MCD) in-
creasesfrom 1 to 14 for n. going from 12 to 75. In Table 2,
the timing ratio goes from 25 (for n = 100) to 2,500 (for
n. = l,OOO),after which we could no longer time the FSA
algorithm. The FSA algorithm is more time-consuming than
FAST-MCD becauseof the following:
2. An I-step takes O(n2) time, comparedto O(n) for the
C-step of FAST-MCD.
3. Each I-step swaps only one point of HoId with one I
1.6 2.0 2.2 2.4 2.6 2.6
point outside Hold. In contrast, each C-step swaps h - log million smelts
1HoIdf’ H,,,, 1points inside Hoid with the samenumber out-
DISTANCE-DISTANCEPLOT
side of H&J. Therefore, more I-steps are needed,especially
l-
for increasing n.
4. The FSA iterates I-steps until convergence, starting
from each Hi. On the other hand, FAST-MCD reducesthe
number of C-steps by the selective iteration technique of -21
Section 3.2. The latter would not work for I-steps because
of 3. ‘1
5. The FSA carries out all its I-steps in the full dataset
of size 71, even for large n. In the same situation, FAST-
b!
MCD applies the nested extensions method of Section 3.3,
SO most C-steps are carried out for r&b = 300, some for
‘%Erged= 1,500, and only a few for the actual ‘II.
While this article was under review, Hawkins and Olive
(1999) proposedan improved version of the FSA algorithm, 1.0 1.5 2.0
Mahalanobis Distance
as describedat the end of our Section 2. To avoid confusion,
Figure 8. Coho Salmon Data: (a) Scatterplot With 97.5% Tolerance
we would like to clarify that the timings in Tables 1 and 2 Ellipses Describing the MCD and the Classical Method; (b) Distance-
were made with the original FSA algorithm described by Distance Plot.

TECHNOMETRICS, AUGUST 1999, VOL. 41, NO, 3


A FAST ALGORITHM FOR THE MINIMUM COVARIANCE DETERMINANT ESTIMATOR 221

marking them as outliers. In contrast, the classical tolerance Something happenedin the production process that was
ellipse contains nearly the whole dataset and thus does not not visible from the classical distances shown in Figure 1.
detect the existence of far outliers. Figure 9(a) also shows a remarkable change after the first
Let us now introduce the distance-distance plot (D-D 100 measurements.Thesephenomenawere investigatedand
plot), which plots the robust distances (basedon the MCD) interpreted by the engineers at Philips. Note that the D-D
versus the classical Mahalanobis distances. On both axes plot in Figure 9(b) again contrasts the classical and robust
in Figure 8(b) we have indicated the cutoff value JzT75 analysis. In Figure 9, (a) and (b), one can in fact see three
groups-the first 100 points, those with index 491 to 565,
(here p = 2, yielding ,,/G = 2.72). lf the data were not
and the majority.
contaminated (say, if all the data would come from a sin-
gle bivariate normal distribution), then all points in the D-D Problem 2 (continued). We now apply FAST-MCD to
plot would lie near the dotted line. In this example many the same n = 132,402 celestial objects with p = 6 vari-
points lie in the rectangle where both distances are regu- ables as in Figure 3, which took only 2.5 minutes. (In
lar, whereas the outlying points lie higher. This happened fact, running the program on the same objects in all 27
becausethe MCD ellipse and the classical ellipse have a dimensions took only 18 minutes!) Figure IO(a) plots the
different orientation. resulting MCD-based robust distances. In contrast to the
Naturally, the D-D plot becomes more useful in higher homogeneous-looking Mahalanobis distances in Figure 3,
dimensions, where it is not so easy to visualize the dataset the robust distancesin Figure 10(a)clearly show that there
and the ellipsoids. is a majority with RD(xi) < JZL as well as a sec-
Problem 1 (continued). Next we consider Problem 1 in ond group with RD(xi) between 8 and 16. By exchang-
Section 1. The Philips data represent 677 measurements ing our findings with the astronomers at the California
of metal sheetswith nine componentseach, and the Maha- Institute of Technology, we learned that the lower group
lanobis distancesin Figure 1 indicated no groups of outliers. consists mainly of stars and the upper group mainly of
The MCD-based robust distancesRD(x,) in Figure 9(a) tell galaxies.
a different story. We now see a strongly deviating group of Our main point is that the robust distances separatethe
outliers, ranging from index 491 to index 565. data in two parts and thus provide more information than

.
a. .*
(4 . .

a*
. .
4#
.
.
l

0 200 400 600 4000 6000 8000 10000


Index Index

0 2 8
MatLnobis DisGtance Mahalkbis DistancGe

Figure 9. Philips Data: (a) Plot of Robust Distances; (b) Distance- Figure IO. Digitized Palomar Data: (a) Plot of Robust Distances of
Distance Plot, Celestial Objects; (b) Their Distance-Distance Plot.

TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3


222 PETER J. ROUSSEEUW AND KATRIEN VAN DRIESSEN

the Mahalanobisdistances.That robust distancesand Maha- 8. CONCLUSIONS


lanobis distancesbehave differently is illustrated in Figure The algorithm FAST-MCD proposed in this article is
IO(b), where we seethe stars near the diagonal line and the specifically tailored to the properties of the MCD estima-
galaxies above it. tor. The basic ideas are the C-step (Theorem 1 in Sec. 2),
Of course our analysis of these data was much more ex- the procedure for generating initial estimates(Sec. 3.11,se-
tensive and also used other data-analytic techniquesnot de- lective iteration (Sec. 3.2), and nestedextensions(Sec. 3.3).
scribed here, but the ability to compute robust estimates By exploiting the special structure of the problem, the new
of location and scatter for such large datasets was a key algorithm is faster and more effective than general-purpose
tool. Based on our work, these astronomers are thinking techniques such as reducing the objective function by suc-
about modifying their classification of objects into stars and cessively interchanging points. Simulations have shown that
galaxies, especially for the faint light sourcesthat are diffi- FAST-MCD is able to deal with large datasetswhile outrun-
cult to classify. ning existing algorithms for MVE and MCD by orders of
magnitude. Another advantageof FAST-MCD is its ability
Example 2. We end this section by combining robust to detect exact fit situations.
location/scatter with robust regression. The fire data (An- Due to the FAST-MCD algorithm, the MCD becomes
drews and Herzberg 1985) reported the incidences of fires accessibleas a routine tool for analyzing multivariate data.
in 47 residential areasof Chicago. One wants to explain the Without extra cost we also obtain the D-D plot, a new data
incidence of fire by the ageof the houses,the income of the display that plots the MCD-based robust distances versus
families living in them, and the incidence of theft. For this the classical Mahalanobis distances.This is a useful tool to
we apply the least trimmed squares (LTS) method of ro- explore structure(s) in the data. Other possibilities include
bust regression, with the usual value of h = [(3/4)n] = 35. an MCD-based PCA and robustified versions of other mul-
In S-PLUS 4.5, the function “ltsreg” now automatically tivariate analysis methods.
calls the function “cov.mcd,” which runs the FAST-MCD
algorithm, to obtain robust distances in x-space based on ACKNOWLEDGMENTS
the MCD with the same h. Moreover, S-PLUS automati-
cally provides the diagnostic plot of Rousseeuw and van We thank Doug Hawkins and Jose Agullo for making
Zomeren (1990), which plots the robust residuals versus their programs available to us. We also dedicate special
the robust distances. For the fire data, this yields Figure thanks to Gertjan Otten, Frans Van Dommelen, and Her-
11, which shows the presenceof one vertical outlier-that man Veraa for giving us accessto the Philips data and to
is, an observation with a small robust distance and a large S. C. Odewahn and his researchgroup at the California In-
LTS residual. We also see two bad leverage points-that stitute of Technology for allowing us to analyze their dig-
is, observations (x, w) with outlying z and such that (5, y) itized Palomar data. We are grateful to the referees and
does not follow the linear trend of the majority of the Technometrics editors Max Morris and Karen Kafadar for
data. The other observations with robust distances to the helpful comments improving the presentation.
right of the vertical cutoff line are good leveragepoints be-
cause they have small LTS residuals and hence follow the APPENDIX: PROOF OF THEOREM 1
samelinear pattern as the main group. In Figure 11 we see
that most of these points are merely boundary cases, ex- ProoJ Assume that det(Sz) > 0; otherwise the result
cept for the two leverage points that are really far out in is already satisfied. We can thus compute d2 (i) = dCT2,S2j(i)
x-space. for all i = 1,. , n. Using jHa( = h and the definition of
(Tz, Sz), we find
DIAGNOSTIC PLOT

. k c d;(i) = j$ tr x(x7 -T2)S;l(x, -T#


tEHz itH2

ZZ r tr Sy1S2 = t tr(1) = 1. (A.1)


P

Moreover, put

2 4 6 8 10 12
Robust Distance based on the MCD (-4.2)
Figure 11. Diagnostic Plot of the Fire Dataset. where X > 0 becauseotherwise det(Sa) = 0. Combining
TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3
A FAST ALGORITHM FOR THE MINIMUM GOVARIANCE DETERMINANT ESTIMATOR 223

(A.l) and (A.2) yields Griibel, R. (1988) “A Minimal Characterization of the Covariance Matrix,”
Metrika, 35, 49-52.
1 Hampel, E R., Ronchetti. E. M., Rousseeuw, P. J., and Stahel, W. A.
c &,xs~)(~) = & c (xi - Td’;S;%i - T,) (1986), Robust Statistics, The Approach Based on Influence Functions,
’ iEHz ZEH? New York: Wiley.
Hawkins, D. M. (1994) “The Feasible Solution Algorithm for the Mini-
= & c d?(i) = ; = 1. mum Covariance Determinant Estimator in Multivariate Data,” Compu-
tational Statistics and Data Analysis, 17, 191-210.
2EH2
Hawkins, D. M., and McLachlan, G. J. (1997). “High-Breakdown Linear
Griibel (1988) proved that (Tz, S2) is the unique Discriminant Analysis,” Journal qf the American Statistical Association,
92, 136-143.
minimizer of det(S) among all (T, S) for which Hawkins, D. M., and Olive, D. J. (1999), “Improved Feasible Solution
(l/b) C&Y2 +,) ci) = 1. This implies that det(S2) 5 Algorithms for High Breakdown Estimation,” Computational Statistics
det)(XSl). On the other hand it follows from the inequality and Data Analysis, 30, l-l 1.
Lopuhaa, H. P., and Rousseeuw, P. J. (1991), “Breakdown Points of Affine
(A.2) that det(XSr) < det(Sl), hence Equivariant Estimators of Multivariate Location and Covariance Matri-
ces,” The Annals of Statistics, 19, 229-248.
det(Sz) 5 det(XS1) 5 det(Sr). (A.3) Maronna, R. A. (1976) “Robust M-estimators of Multivariate Location
and Scatter,” The Annals of Statistics, 4, 51-56.
Moreover, note that det(&) = det(Sr) if and only if both Meer, P., Mintz, D., Rosenfeld, A., and Kim, D. (1991). “Robust Regres-
inequalities in (A.3) are equalities. For the first, we know sion Methods in Computer Vision: A Review,” International Journal of
from Griibel’s result that det(S2) = det(XSr) if and only if Computer Vision, 6, 59-70.
(T2,St) = (Tr,XS1). For the second,det(XSt) = det(Sr) Nickelson, T. E. (1986), “Influence of Upwelling, Ocean Temperature. and
Smolt Abundance on Marine Survival of Coho Salmon (Oncorhynchus
if and only if X = 1; that is, Sr = /Wt. Combining both Kisutch) in the Oregon Production Area,” Canadian Journal of Fisheries
yields (Tz, Sz) = (TI, Sr). and Aquatic Sciences, 43, 527-535.
Odewahn, S. C., Djorgovski, S. G., Brunner, R. J., and Gal, R. (1998), “Data
[Received December 1997. Revised March 1999.1 From the Digitized Palomar Sky Survey,” technical report, California
Institute of Technology.
Rocke, D. M., and Woodruff, D. L. (1996), “Identification of Outliers in
REFERENCES Multivariate Data,” Journal of the American Statistical Association, 91,
Agullo, J. (1996) “Exact Iterative Computation of the Multivariate Mini- 1047-1061.
mum Volume Ellipsoid Estimator With a Branch and Bound Algorithm,” Rousseeuw, P. J. (1984), “Least Median of Squares Regression,” Journal
in Proceedings in Computational Statistics, ed. A. Prat, Heidelberg: of the American Statistical Association, 79, 871-880.
Physica-Verlag, pp. 175-180. __ (1985), “Multivariate Estimation With High Breakdown Point,” in
Andrews, D. F., and Herzberg, A. M. (1985), Data, Springer-Verlag, New Mathematical Statistics and Applications, Vol B, eds. W. Grossmann, G.
York. Pflug, I. Vincze, and W. Wertz, Dordrecht: Reidel, pp. 283-297.
Butler, R. W., Davies, P. L., and Jhun, M. (1993) “Asymptotics for the __ (1997) “Introduction to Positive-Breakdown Methods,” in Hand-
Minimum Covariance Determinant Estimator,” The Annals ofStatistics, book of Statistics, Vol. 15: Robust Inference, eds. G. S. Maddala and
21, 1385-1400. C. R. Rao, Amsterdam: Elsevier, pp. 101-121.
Coakley, C. W., and Hettmansperger, T. P. (1993), “A Bounded Influence, Rousseeuw, P. J., and Leroy, A. M. (1987). Robust Regression and Outlier
High Breakdown, Efficient Regression Estimator,” Journal of the Amer- Detection, New York: Wiley.
ican Statistical Association, 88, 872-880. Rousseeuw, P. J., and van Zomeren, B. C. (1990), “Unmasking Multivari-
Cook, R. D., Hawkins, D. M., and Weisberg, S. (1992), “Exact Iterative ate Outliers and Leverage Points,” Journal of the American Statistical
Computation of the Robust Multivariate Minimum Volume Ellipsoid Association, 85, 633-639.
Estimator,” Statistics and Probability Letters, 16, 213-218. Simpson, D. G., Ruppert, D., and Carroll, R. J. (1992), “On One-Step GM-
Croux, C., and Haesbroeck, G. (in press), “Influence Function and Effi- estimates and Stability of Inferences in Linear Regression,” Journal of
ciency of the Minimum Covariance Determinant Scatter Matrix Esti- the American Statistical Association, 87, 439450.
mator,” Journal of Multivariate Analysis. Woodruff, D. L., and Rocke, D. M. (1993) “Heuristic Search Algorithms
Davies, L. (1992), “The Asymptotics of Rousseeuw’s Minimum Volume for the Minimum Volume Ellipsoid,” Journal of Computational and
Ellipsoid Estimator,” The Annals of Statistics, 20, 1828-l 843. Graphical Statistics, 2, 69-95.
Donoho, D. L. (1982) “Breakdown Properties of Multivariate Location __ (1994) “Computable Robust Estimation of Multivariate Location
Estimators,” unpublished Ph.D. qualifying paper, Harvard University, and Shape in High Dimension Using Compound Estimators,” Journal
Dept. of Statistics. of the American Statistical Association, 89, 888-896.

TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3

You might also like