A Fast Algorithm For The Minimum Covariance Determinant Estimator PDF
A Fast Algorithm For The Minimum Covariance Determinant Estimator PDF
To cite this article: Peter J. Rousseeuw & Katrien Van Driessen (1999) A Fast Algorithm for the
Minimum Covariance Determinant Estimator, Technometrics, 41:3, 212-223
The minimum covariance determinant (MCD) method of Rousseeuw is a highly robust estimator of
multivariate location and scatter. Its objective is to find h observations (out of n) whose covariance
matrix has the lowest determinant. Until now, applications of the MCD were hampered by the
computation time of existing algorithms, which were limited to a few hundred objects in a few
dimensions. We discuss two important applications of larger size, one about a production process at
Philips with n = 677 objects and p = 9 variables, and a dataset from astronomy with n = 137,256
objects and p = 27 variables. To deal with such problems we have developed a new algorithm
for the MCD, called FAST-MCD. The basic ideas are an inequality involving order statistics and
determinants, and techniques which we call “selective iteration” and “nested extensions.” For small
datasets, FAST-MCD typically finds the exact MCD, whereas for larger datasets it gives more
accurate results than existing algorithms and is faster by orders of magnitude. Moreover, FAST-
MCD is able to detect an exact fit-that is, a hyperplane containing h or more observations. The
new algorithm makes the MCD method available as a routine tool for analyzing multivariate data.
We also propose the distance-distance plot (D-D plot), which displays MCD-based robust distances
versus Mahalanobis distances, and illustrate it with some examples.
KEY WORDS: Breakdown value: Multivariate location and scatter; Outlier detection; Regression:
Robust estimation.
It is difficult to detect outliers in p-variate data when Positive-breakdown methods such as the MVE and least
p > 2 because one can no longer rely on visual inspec- trimmed squaresregression (Rousseeuw1984) are increas-
tion. Although it is still quite easy to detect a single outlier ingly being usedin practice-for example,in finance,chem-
by means of the Mahalanobis distances, this approach no istry, electrical engineering, process control, and computer
longer suffices for multiple outliers becauseof the masking vision (Meer, Mintz, Rosenfeld, and Kim 1991).For a sur-
effect, by which multiple outliers do not necessarily have vey of positive-breakdown methods and some substantive
large Mahalanobis distances. It is better to use distances applications, see Rousseeuw(1997).
basedon robust estimators of multivariate location and scat- The basic resampling algorithm for approximating the
MVE, called MINVOL, was proposed by Rousseeuw and
ter (Rousseeuwand Leroy 1987,pp. 265-269). In regression
Leroy (1987). This algorithm considers a trial subset of
analysis, robust distances computed from the explanatory
p + 1 observations and calculates its mean and covariance
variables allow us to detect leverage points. Moreover, ro-
matrix. The corresponding ellipsoid is then inflated or de-
bust estimation of multivariate location and scatter is the flated to contain exactly h observations.This procedure is
key tool to robustify other multivariate techniques such as repeatedmany times, and the ellipsoid with the lowest vol-
principal-component analysis and discriminant analysis. ume is retained. For small datasetsit is possible to consider
Many methods for estimating multivariate location and all subsetsof size p+ 1, whereasfor larger datasetsthe trial
scatter break down in the presence of n/(p + 1) outliers, subsetsare drawn at random.
where n is the number of observations and p is the num- Several other algorithms have been proposed to approx-
ber of variables, as was pointed out by Donoho (1982). For imate the MVE. Woodruff and Rocke (1993) constructed
the breakdown value of the multivariate &f-estimators of algorithms combining the resampling principle with three
Maronna (1976), see Hampel, Ronchetti, Rousseeuw,and heuristic search techniques-simulated annealing, genetic
Stahel (1986, p. 296). In the meantime, several positive- algorithms, and tabu search. Other people developed al-
breakdown estimators of multivariate location and scatter gorithms to compute the MVE exactly. This work started
have been proposed. One of these is the minimum volume with the algorithm of Cook, Hawkins, and Weisberg (1992),
ellipsoid (MVE) method of Rousseeuw(1984, p. 877; 1985).
This approachlooks for the ellipsoid with smallest volume
@ 1999 American Statistical Association
that covers h data points, where n/2 < h < n. Its break- and the American Society for Quality
down value is essentially (n - h)/n. TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3
212
A FAST ALGORITHM FOR THE MINIMUM COVARIANCE DETERMINANT ESTIMATOR 213
which carries out an ingenious but still exhaustive searchof Problem 1 (Engineering). We are grateful to Gertjan
all possible subsetsof size h. In practice, this can be done Otten for providing the following problem. Philips Mecoma
for n up to about 30. Recently, Agullo (1996) developed (The Netherlands), is producing diaphragm parts for TV
an exact algorithm for the MVE that is basedon a branch sets. These are thin metal plates, molded by a press. Re-
and bound procedure that selects the optimal subset with- cently a new production line was started, and for each of
out requiring the inspection of all subsets of size ft. This n = 677 parts, nine characteristics were measured. The
is substantially faster and can be applied up to (roughly) aim of the multivariate analysis is to gain insight into the
n _< 100 and p 5 5. Because for most datasets the exact production process and the interrelations between the nine
algorithms would take too long, the MVE is typically com- measurementsand to find out whether deformations or ab-
puted by versions of MINVOL-for example, in S-PLUS normalities have occurred and why. Afterward, the esti-
(seethe function cov,mve). mated location and scatter matrix can be used for multi-
Presently there are severalreasonsfor replacing the MVE variate statistical process control.
by the minimum covariance determinant (MCD) estima-
Due to the support of Herman Veraaand Frans Van Dom-
tor, which was also proposedby Rousseeuw(1984, p. 877;
melen (at Philips PMF/Mecoma, Product Engineering, PO.
1985). The MCD objective is to find h observations (out
Box 218, 5600 MD Eindhoven, The Netherlands), we ob-
of n) whose classical covariance matrix has the lowest de-
terminant. The MCD estimate of location is then the av- tained permission to analyze these data and to publish the
erage of these h points, and the MCD estimate of scatter results.
is their covariance matrix. The resulting breakdown value Figure 1 shows the classical Mahalanobis distance
equals that of the MVE, but the MCD has several advan-
tages over the MVE. Its statistical efficiency is better be- MD(x,) = (xi - TO)‘Sil(xi - To) (1.1)
cause the MCD is asymptotically normal (Butler, Davies,
versus the index i, which correspondsto the production se-
and Jhun 1993), whereas the MVE has a lower conver-
quence. Here xI is nine-dimensional, To is the arithmetic
gence rate (Davies 1992). As an example, the asymptotic
mean, and So is the classical covariance matrix. The hori-
efficiency of the MCD scatter matrix with the typical cover-
age h, = .75n is 44% in 10 dimensions, and the reweighted zontal line is at the usual cutoff value i&975 = 4.36
J---
covariance matrix with weights obtained from the MCD In Figure 1 it seemsthat most observationsare consistent
attains 83% of efficiency (Croux and Haesbroeck in press), with the classical multivariate normal model. except for a
whereas the MVE attains 0%. The MCD’s better accuracy few isolated outliers. This should not surprise us, even in
makes it very useful as an initial estimate for one-step re- the first experimental run of a new production line because
gression estimators (Simpson, Ruppert, and Carroll 1992; the Mahalanobis distancesare known to suffer from mask-
Coakley and Hettmansperger1993).Robust distancesbased ing. That is, even if there were a group of outliers (here,
on the MCD are more precise than those basedon the MVE deformed diaphragm parts) they would affect To and So in
and hence better suited to expose multivariate outliers- such a way as to become invisible in Figure 1. To further
for example, in the diagnostic plot of Rousseeuwand van investigate these data, we need robust estimators T and S,
Zomeren (1990), which displays robust residuals versus ro- preferably with a substantial statistical efficiency so that we
bust distances. Moreover, the MCD is a key component can be confident that any effects that may become visible
of the hybrid estimators of Woodruff and Rocke (1994) are real and not due to the estimator’s inefficiency. After de-
and Rocke and Woodruff (1996) and of high-breakdown
veloping the FAST-MCD algorithm, we will return to these
linear discriminant analysis (Hawkins and McLachlan
data in Section 7.
1997).
In spite of all these advantages,until now the MCD has
rarely beenapplied becauseit was harder to compute. In this
article, however, we construct a new MCD algorithm that is
actually muchfaster than any existing MVE algorithm. The . .
new MCD algorithm can deal with a sample size n in the cu
w I .
tens of thousands.As far as we know, none of the existing r
Q .
MVE algorithms can cope with such large sample sizes. I5
5’0
i
l
.
Because the MCD now greatly outperforms the MVE in
terms of both statistical efficiency and computation speed,
we recommend the MCD method.
1. MOTIVATING PROBLEMS
Two recent problems will be shown to illustrate the need 04
for a fast, robust method that can deal with many objects (,n) b 200 400 660
and/or many variables (p) while maintaining a reasonable Index
statistical efficiency. Figure 1. Plot of Mahalanobis Distances for the Philips Data.
TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3
214 PETER J. ROUSSEEUW AND KATRIEN VAN DRIESSEN
Figure 2. Digitized Palomar Data: Plot of Mahalanobis Distances of 1. Compute the distances dold(i) for i = 1, . . . , n.
Celestial Objects, Based on Six Variables Concerning Magnitude and 2. Sort these distances,which yields a permutation 7rfor
Image Moments. which &ld(r( 1)) < dold(“(2)) 5 . 5 &ld(71.(n)).
3. Put H,,, := {r(l), 742)>. >r(h)}. 1, Hawkins and Olive (1999) usedthe C-condition as a pre-
4. Compute T,,,, :==ave(HI,,,) and S,,,, := COV(H~~,~). liminary screen,followed by case swapping as a technique
For a fixed number of dimensions p, the C-step takes for decreasingdet(S), as in the feasible solution approach
only O(n) time [becauseH,,, can be determined in O(n) (Hawkins 19941,which will be described in Section 6. The
operations without sorting all the d&r(?) distances]. C-condition did not reduce the time complexity of this ap-
Repeating C-steps yields an iteration process. If proach, but it did reduce the actual computation time in
det(Sa) = 0 or det(Sp) = det(Si), we stop; otherwise, experiments with fixed ‘n,.
we run another C-step yielding det(Ss), and so on. The se-
quence det(Sr) > det(Sz) 2 det(Ss) 2 ... is nonnegative 3. CONSTRUCTION OF THE NEW ALGORITHM
and hence must converge. In fact, becausethere are only 3.1 Creating Initial Subsets H,
finitely many h-subsets,there must be an index m such that To apply the algorithmic concept (2.1), we first have to
det(S,,) = 0 or det’(S,,) = det(S,-i), hence convergence decide how to construct the initial subsetsHI. Let us con-
is reached. (In practice, m is often below IO.) Afterward, sider the following two possibilities:
running the C-step on (T, , S,,) no longer reducesthe de-
1. Draw a random h-subset HI.
terminant. This is not sufficient for det(S,) to be the global
2. Draw a random (p + l)-subset J, and then compute
minimum of the MCD objective function, but it is a neces-
To := ave(J) and Sa := cov(J). [If det(Sa) = 0. then
sary condition.
extend J by adding another random observation, and con-
Theorem 1 thus provides a partial idea for an algorithm:
tinue adding observations until det(Sa) > 0.1 Then com-
Take many initial choices of HI and apply C-steps pute the distances d;(i) := (xi - Tc)‘S;l(X, - To) for
i = 1,. . ,R. Sort them into da(~(l)) 5 .. < da(n(n)) and
to each until convergence:and keep the put H1 := {n(l), . , T(h)}.
solution with lowest determinant. (2.1) Option 1 is the simplest, whereas2 starts like the MINVOL
algorithm (Rousseeuw and Leroy 1987, pp. 259-260). It
Of course,severalquestionsmust be answeredto make (2.1) would be uselessto draw fewer than p + 1 points, for then
operational: How do we generate sets HI to begin with? So is always singular.
How many HI are needed?How do we avoid duplication When the dataset does not contain outliers or deviating
of work becauseseveral HI may yield the same solution? groups of points, it makes little difference whether (2.1) is
Can we do with fewer C-steps? What about large sample applied with 1 or 2. But becausethe MCD is a very robust
sizes?These matters will be discussedin the next sections. estimator, we have to consider contaminateddatasetsin par-
Corollary 1. The MCD subset H of X, is separated ticular. For instance, we generateda dataset with n. = 400
from X, \ H by an ellipsoid. observationsandp = 2 variables,in which 20.5observations
were drawn from the bivariate normal distribution
Prooj For the MCD subset H, and in fact any limit
of a C-step sequence,applying the C-step to H yields H
itself. This meansthat all xi E H satisfy (xi - T)‘S1(xi -
T) 5 9 = {(x - T)‘Sl(x - T)}hZn, whereas all x3 $ and the other 195 observationswere drawn from
H satisfy (x3 - T)‘S1(xi - T) 2 9. Take the ellipsoid
E = {x; (x - T)‘S’(x - T) 5 9}. Then H c E and Nz([ i$[; iI)-
X,, \ H c closure(EC). Note that there is at least one point
The MCD has its highest possible breakdown value when
xi E H on the boundary of E, whereas there may or may
h = [(n + p + 1)/2] (see Lopuhaa and Rousseeuw 1991),
not be a point x3 @H on the boundary of E.
which becomeshj = 201 here. We now apply (2.1) with 500
The same result was proved by Butler et al. (1993) un- starting sets HI. Using option 1 yields a resulting (T, S)
der the extra condition that a density exists. Note that the whose 97.5% tolerance ellipse is shown in Figure 4(a).
ellipsoid in Corollary 1 contains h observations but is not Clearly, this result has broken down due to the contami-
necessarily the smallest ellipsoid to do so, which would nated data. On the other hand, option 2 yields the result in
yield the MVE. We know of no technique like the C-step Figure 4(b), which concentrates on the majority (51.25%)
for the MVE estimator; hence, the latter estimator cannot of the data.
be computed faster in this way. The situation in Figure 4 is extreme,but it is useful for il-
Independently of our work, Hawkins and Olive (1999) lustrative purposes.(The same effect also occurs for smaller
discovered a version of Corollary 1 in the following form: amountsof contamination, especially in higher dimensions.)
‘A necessarycondition for the MCD optimum is that, if we Approach 1 has failed becauseeachrandom subsetHI con-
calculate the distance of each case from the location vec- tains a sizable number of points from the majority group as
tor using the scatter matrix, each covered case must have well as from the minority group, which follows from the
smaller distance than any uncovered case.” This necessary law of large numbers.When starting from a bad subsetHI,
condition could perhapsbe called the “C-condition,” as op- the iterations will not converge to the major solution. On
posed to the C-step of Theorem 1, where we proved that a the other hand, the probability of a (?-,+ 1)-subsetwithout
C-step always decreasesdet(S). In the absenceof Theorem outliers is much higher, which explains why 2 yields many
TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3
216 PETER J. ROUSSEEUW AND KATRIEN VAN DRIESSEN
. -4
1 -5 0 5 10 15 20 0 5 10 15 20 25 30
Xl step number
which in turn is a proper subsetof the datasetof size 71.[Al- 1. The default h is [(r~ + p + 1)/2], but the user may
ready the algorithm of Woodruff and Rocke (1994) made choose any integer h with [(71 + p + 1)/2] _< h _< n.
use of partitioning for this purpose. The only difference The program then reports the MCD’s breakdown value
with the nested extensions in Fig. 6 is that we work with (n - h + 1)/n. If you are sure that the dataset contains
two stages, hence our use of the word “nested,” whereas less than 25% contamination, which is usually the case, a
Woodruff and Rocke partitioned the entire dataset, which good compromise between breakdown value and statistical
yields more and/or larger subsets.]To construct Figure 6, efficiency is obtained by putting h = [.757x].
the algorithm draws 1,500 observations, one by one, with- 2. If h, = n,, then the MCD location estimate T is the
out replacement. The first 300 observations it encounters averageof the whole dataset,and the MCD scatter estimate
are put in the first subset,and so on. Becauseof this mech- S is its covariance matrix. Report these and stop.
anism, each subsetof size 300 is roughly representativefor 3. If p = 1 (univariate data), compute the MCD esti-
the dataset, and the merged set with 1,500 cases is even mate (T, S) by the exact algorithm of RousseeuwandLeroy
more representative. (1987, pp. 171-172) in O(nlogn) time; then stop.
When n _<600, we will keep the algorithm as in the previ- 4. From here on, h < n and p > 2. If n is small (say,
ous section, while for n 2 1,500we will useFigure 6. When n 5 600), then
600 < n < 1,500,we will partition the data into at most four . repeat (say) 500 times:
subsetsof 300 or more observationsso that eachobservation
belongs to a subset and such that the subsetshave roughly * construct an initial h-subset HI using method 2 in
the same size. For instance, 601 will be split as 300 + 301 Subsection 3.1-that is, starting from a random
and 900 as 450+ 450. For n = 901, we use 300+ 300+ 301, (p + l)-subset;
and we continue until 1,499 = 375 + 375 + 375 + 374. By * carry out two C-steps (describedin Sec. 2);
splitting 601 as 300+ 301 we do not mean that the first sub-
set contains the observations with case numbers 1,. ,300 . for the 10 results with lowest det(Sa):
but that its 300 case numbers were drawn randomly from
1,. . : 601. * carry out C-steps until convergence;
Whenever n > 600 (and whether n < 1,500 or not), our . report the solution (T, S) with lowest det(S).
new algorithm for the MCD will take two C-stepsfrom sev-
eral starting subsamplesH1 within each subset,with a total 5. If 12is larger (say, n > 600), then
of 500 starts for all subsetstogether. For every subset the
best 10 solutions are stored. Then the subsets are pooled, . construct up to five disjoint random subsets of size
yielding a mergedset with at most 1,500observations.Each r&“b according to Section 3.3 (Say, five subsetsof size
of these(at most 50) available solutions (Tsub, S&t,) is then n.,,b = 300);
extended to the merged set. That is, starting from each . inside each subset,repeat 500/5 = 100 times:
(Tsub, Ssuil),we continue taking C-steps,which now use all
1,500observationsin the merged set. Only the best 10 solu- * construct an initial subset HI of size hsub =
tions (Tmerged,Snlerged)will be consideredfurther. Finally,
each of these 10 solutions is extendedto the full dataset in * carry out two C-steps, using n$,b and hsub;
the sameway, and the best solution (Tfutlr Sf,,r) is reported. * keep the 10 best results (Tsub,Ssub);
Becausethe final computations are carried out in the en- . pool the subsets,yielding the merged set (say, of size
tire dataset, they take more time when n increases.In the %wged = 1,500);
interest of speed we can limit the number of initial solu- l
in the merged set, repeat for each of the 50 solutions
tions(Tmergedr %nerged) and/or the number of C-steps in
(Tsuh,
Ssuh):
the full dataset as n becomeslarge.
The main idea of this subsectionwas to carry out C-steps * carry out two C-steps, using n,,,ged and hmerged=
in several nested random subsets,starting with small sub- b merged(h/n)];
sets of around 300 observations and ending with the entire * keep the 10 best results (Tmerged,Smerged);
dataset of n observations. Throughout this subsection, we
have chosenseveralnumbers such as five subsetsof 300 ob- .
in the full dataset,repeat for the mfulr best results:
servations; 500 starts, 10 best solutions, and so on. These
choices were basedon various empirical trials (not reported * take several C-steps, using n and h;
here). We implemented our choices as defaults so the user * keep the best final result (Tr,ir, SfUll).
does not have to choose anything, but of course the user Here, mruil and the number of C-steps(preferably, until
may changethe defaults. convergence)dependon how large the datasetis.
We will refer to the preceding as the FAST-MCD algorithm.
4. THE RESULTING ALGORITHM FAST-MCD Note that it is affine equivariant: When the data are trans-
Combining all the componentsof the preceding sections lated or subjected to a linear transformation, the resulting
yields the new algorithm, which we will call FAST-MCD. (Trllll, SfUil) will transform accordingly. The computer pro-
Its pseudocodelooks as follows: gram contains two more steps:
TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3
218 PETER J. ROUSSEEUW AND KATRIEN VAN DRIESSEN
Table 1. Performance of the FAST-MCD and FSA Algorithms on Some Small Datasets
Time (seconds)
Dataset n P Starts Best h-subset found FAST-MCD FSA
We may conclude that for these small datasetsFAST-MCD suffice, whereas no previous algorithm we know of could
gives very accurate results in little time. handle such large datasets.
Let us now try the algorithm on larger datasets, with The currently most well-known algorithm for approxi-
n > 100. In each dataset, we generated over 50% of the mating the MCD estimator is the feasible subset algorithm
points from the standard multivariate normal distribution (FSA) of Hawkins (1994). Instead of C-steps, it uses a dif-
Np(O, IP), and the remaining points from NP(p7IP), where ferent kind of steps,which for conveniencewe will baptize
/.A = (b, b, , b)’ with b = 10. This is the model of “shift “I-steps,” where the I standsfor “interchanging points.” An
outliers.” For each dataset,Table 2 lists n, p, the percentage I-step proceedsas follows:
of majority points, and the percentage of contamination. Given the 1z-subsetHoid with its average Toid and its
The algorithm always used 500 starts and the default value covariance matrix Sold,
of 11,= [(n + ?,+ 1)/2]. repeat for each i E Hold and each j 6 Hold:
The results of FAST-MCD are given in the next column,
under “robust.” Here “yes” means that the correct result is * putffi.j = (Hold\ {i}) u {j}
obtained-that is, correspondingto the first distribution [as (i.e., remove point i and add point j);
in Fig. 4(b)]-whereas “no” stands for the nonrobust re- * compute & = det(%d) - det(S(fA,j));
sult, in which the estimates describe the entire dataset [as keep the i’ and j’ with largest &J,3r;
in Fig. 4(a)]. Table 2 lists data situations with the highest
if &I,~~ < 0, put H,,,, = HoId and stop;
percentageof outlying observations still yielding the clean
if a,~.~/ > 0, put H,,, = H,I,~,.
result with FAST-MCD, as was suggestedby a referee. That
is, the table says which percentageof outliers the algorithm An I-step takes O(h(n - h)) = O(n”) time becauseall pairs
can handle for given n, and p. Increasing the number of (i, j) are considered. If we would compute each S(Hz,3)
starts only slightly improves this percentage.The compu- from scratch, the complexity would even become O(n3),
tation times were quite low for the given values of 71and but Hawkins (1994, p. 203) used an update formula for
p. Even for a sample size as high as 50,000, a few minutes det(S(&)).
Table 2. Performance of the FAST-MCD and FSA Algorithms on Larger Datasets, With Time in Seconds
FAST-MCD FSA
n P % N&O, lp) % N&z 1~) Robust Time Robust Time
100 2 51 49 yes 2 yes 50
5 53 47 yes 5 no 80
10 63 37 yes 40 no 110
20 77 23 yes 70 IlO 350
500 2 51 49 yes 7 n0 2,800
5 51 49 yes 25 IlO 3,800
10 64 36 yes 84 fl0 4,100
30 77 23 yes 695 I-IO 8,300
1,000 2 51 49 yes 8 fl0 20,000
5 51 49 yes 20 - -
10 60 40 yes 75 - -
30 76 24 yes 600 - -
10,000 2 51 49 yes 9 -
5 51 49 yes 25 - -
IO 63 37 yes 85 - -
30 76 24 yes 700 -
50,000 2 51 49 yes 15 -
5 51 49 yes 45 -
IO 58 42 yes 140 - -
30 75 25 yes 890 - -
The I-step can be iterated: If det(S,,,) < dFt(S,id), we Hawkins (19941,whereas the new version of FSA is sub-
can take another I-step with H,,,,; otherwise, we stop. The stantially faster (although it retains the samecomputational
resulting sequencedet,(Si) > (let(&) > must converge complexity as the original FSA due to 2 in the preceding
after a finite number of steps; that is, dct(S,,) = 0 or list).
det,(S,) = det(S,,_i), so det(S,) can no longer be re- In conclusion, we personally prefer the FAST-MCD
duced by an I-step. This is again a necessary(but not sum- algorithm because it is both robust and fast, even for
cient) condition for (T,. S,,,) to be the global minimum of large n.
the MCD objective function. In our terminology, Hawkins’s
FSA algorithm can be written as follows: 7. APPLICATIONS
l repeat many times: Let us now look at some applications to compare the
FAST-MCD results with the classical mean and covariance
* draw an initial h-subset Ht at random;
matrix. At the same time we will illustrate a new tool, the
* carry out I-steps until convergence,yielding H,,;
distance-distance plot.
keep the H,, with lowest det(S,,);
l
l report this set H,,L as well as (T,, S,,,). Example 1. We start with the coho salmon dataset (see
Nickelson 1986) with n = 22 and p = 2, as shown in Fig-
In Tables 1 and 2 we have applied the FSA algorithm to ure 8(a). Each data point corresponds to one year. For 22
the samedatasetsas FAST-MCD, using the samenumber of years the production of coho salmon in the wild was mea-
starts. For the small datasetsin Table 1, the FSA and FAST- sured, in the Oregon Production Area. The x-coordinate is
MCD yielded identical results. This is no longer true in the logarithm of millions of smolts, and the y-coordinate
Table 2, where the FSA begins to find nonrobust solutions. is the logarithm of millions of adult coho salmon. We see
This is becauseof the following: that in most years the production of smolts lies between2.2
1. The FSA starts from randomly drawn h-subsets HI. and 2.4 on a logarithmic scale, whereas the production of
Hence, for sufficiently large n all of the FSA starts are adults lies between - 1.Oand .O.The MCD tolerance ellipse
nonrobust, and subsequentiterations do not get away from excludes the years with a lower smolts production, thereby
the corresponding local minimum.
TOLERANCE ELLIPSES (97.5%)
We saw the sameeffect in Section 3.1, which also explained
why it is better to start from random (p + I)-subsets as in
MINVOL and in FAST-MCD.
The tables also indicate that the FSA needs more time
than FAST-MCD. In fact, time(FSA)/time(FAST-MCD) in-
creasesfrom 1 to 14 for n. going from 12 to 75. In Table 2,
the timing ratio goes from 25 (for n = 100) to 2,500 (for
n. = l,OOO),after which we could no longer time the FSA
algorithm. The FSA algorithm is more time-consuming than
FAST-MCD becauseof the following:
2. An I-step takes O(n2) time, comparedto O(n) for the
C-step of FAST-MCD.
3. Each I-step swaps only one point of HoId with one I
1.6 2.0 2.2 2.4 2.6 2.6
point outside Hold. In contrast, each C-step swaps h - log million smelts
1HoIdf’ H,,,, 1points inside Hoid with the samenumber out-
DISTANCE-DISTANCEPLOT
side of H&J. Therefore, more I-steps are needed,especially
l-
for increasing n.
4. The FSA iterates I-steps until convergence, starting
from each Hi. On the other hand, FAST-MCD reducesthe
number of C-steps by the selective iteration technique of -21
Section 3.2. The latter would not work for I-steps because
of 3. ‘1
5. The FSA carries out all its I-steps in the full dataset
of size 71, even for large n. In the same situation, FAST-
b!
MCD applies the nested extensions method of Section 3.3,
SO most C-steps are carried out for r&b = 300, some for
‘%Erged= 1,500, and only a few for the actual ‘II.
While this article was under review, Hawkins and Olive
(1999) proposedan improved version of the FSA algorithm, 1.0 1.5 2.0
Mahalanobis Distance
as describedat the end of our Section 2. To avoid confusion,
Figure 8. Coho Salmon Data: (a) Scatterplot With 97.5% Tolerance
we would like to clarify that the timings in Tables 1 and 2 Ellipses Describing the MCD and the Classical Method; (b) Distance-
were made with the original FSA algorithm described by Distance Plot.
marking them as outliers. In contrast, the classical tolerance Something happenedin the production process that was
ellipse contains nearly the whole dataset and thus does not not visible from the classical distances shown in Figure 1.
detect the existence of far outliers. Figure 9(a) also shows a remarkable change after the first
Let us now introduce the distance-distance plot (D-D 100 measurements.Thesephenomenawere investigatedand
plot), which plots the robust distances (basedon the MCD) interpreted by the engineers at Philips. Note that the D-D
versus the classical Mahalanobis distances. On both axes plot in Figure 9(b) again contrasts the classical and robust
in Figure 8(b) we have indicated the cutoff value JzT75 analysis. In Figure 9, (a) and (b), one can in fact see three
groups-the first 100 points, those with index 491 to 565,
(here p = 2, yielding ,,/G = 2.72). lf the data were not
and the majority.
contaminated (say, if all the data would come from a sin-
gle bivariate normal distribution), then all points in the D-D Problem 2 (continued). We now apply FAST-MCD to
plot would lie near the dotted line. In this example many the same n = 132,402 celestial objects with p = 6 vari-
points lie in the rectangle where both distances are regu- ables as in Figure 3, which took only 2.5 minutes. (In
lar, whereas the outlying points lie higher. This happened fact, running the program on the same objects in all 27
becausethe MCD ellipse and the classical ellipse have a dimensions took only 18 minutes!) Figure IO(a) plots the
different orientation. resulting MCD-based robust distances. In contrast to the
Naturally, the D-D plot becomes more useful in higher homogeneous-looking Mahalanobis distances in Figure 3,
dimensions, where it is not so easy to visualize the dataset the robust distancesin Figure 10(a)clearly show that there
and the ellipsoids. is a majority with RD(xi) < JZL as well as a sec-
Problem 1 (continued). Next we consider Problem 1 in ond group with RD(xi) between 8 and 16. By exchang-
Section 1. The Philips data represent 677 measurements ing our findings with the astronomers at the California
of metal sheetswith nine componentseach, and the Maha- Institute of Technology, we learned that the lower group
lanobis distancesin Figure 1 indicated no groups of outliers. consists mainly of stars and the upper group mainly of
The MCD-based robust distancesRD(x,) in Figure 9(a) tell galaxies.
a different story. We now see a strongly deviating group of Our main point is that the robust distances separatethe
outliers, ranging from index 491 to index 565. data in two parts and thus provide more information than
.
a. .*
(4 . .
a*
. .
4#
.
.
l
0 2 8
MatLnobis DisGtance Mahalkbis DistancGe
Figure 9. Philips Data: (a) Plot of Robust Distances; (b) Distance- Figure IO. Digitized Palomar Data: (a) Plot of Robust Distances of
Distance Plot, Celestial Objects; (b) Their Distance-Distance Plot.
Moreover, put
2 4 6 8 10 12
Robust Distance based on the MCD (-4.2)
Figure 11. Diagnostic Plot of the Fire Dataset. where X > 0 becauseotherwise det(Sa) = 0. Combining
TECHNOMETRICS, AUGUST 1999, VOL. 41, NO. 3
A FAST ALGORITHM FOR THE MINIMUM GOVARIANCE DETERMINANT ESTIMATOR 223
(A.l) and (A.2) yields Griibel, R. (1988) “A Minimal Characterization of the Covariance Matrix,”
Metrika, 35, 49-52.
1 Hampel, E R., Ronchetti. E. M., Rousseeuw, P. J., and Stahel, W. A.
c &,xs~)(~) = & c (xi - Td’;S;%i - T,) (1986), Robust Statistics, The Approach Based on Influence Functions,
’ iEHz ZEH? New York: Wiley.
Hawkins, D. M. (1994) “The Feasible Solution Algorithm for the Mini-
= & c d?(i) = ; = 1. mum Covariance Determinant Estimator in Multivariate Data,” Compu-
tational Statistics and Data Analysis, 17, 191-210.
2EH2
Hawkins, D. M., and McLachlan, G. J. (1997). “High-Breakdown Linear
Griibel (1988) proved that (Tz, S2) is the unique Discriminant Analysis,” Journal qf the American Statistical Association,
92, 136-143.
minimizer of det(S) among all (T, S) for which Hawkins, D. M., and Olive, D. J. (1999), “Improved Feasible Solution
(l/b) C&Y2 +,) ci) = 1. This implies that det(S2) 5 Algorithms for High Breakdown Estimation,” Computational Statistics
det)(XSl). On the other hand it follows from the inequality and Data Analysis, 30, l-l 1.
Lopuhaa, H. P., and Rousseeuw, P. J. (1991), “Breakdown Points of Affine
(A.2) that det(XSr) < det(Sl), hence Equivariant Estimators of Multivariate Location and Covariance Matri-
ces,” The Annals of Statistics, 19, 229-248.
det(Sz) 5 det(XS1) 5 det(Sr). (A.3) Maronna, R. A. (1976) “Robust M-estimators of Multivariate Location
and Scatter,” The Annals of Statistics, 4, 51-56.
Moreover, note that det(&) = det(Sr) if and only if both Meer, P., Mintz, D., Rosenfeld, A., and Kim, D. (1991). “Robust Regres-
inequalities in (A.3) are equalities. For the first, we know sion Methods in Computer Vision: A Review,” International Journal of
from Griibel’s result that det(S2) = det(XSr) if and only if Computer Vision, 6, 59-70.
(T2,St) = (Tr,XS1). For the second,det(XSt) = det(Sr) Nickelson, T. E. (1986), “Influence of Upwelling, Ocean Temperature. and
Smolt Abundance on Marine Survival of Coho Salmon (Oncorhynchus
if and only if X = 1; that is, Sr = /Wt. Combining both Kisutch) in the Oregon Production Area,” Canadian Journal of Fisheries
yields (Tz, Sz) = (TI, Sr). and Aquatic Sciences, 43, 527-535.
Odewahn, S. C., Djorgovski, S. G., Brunner, R. J., and Gal, R. (1998), “Data
[Received December 1997. Revised March 1999.1 From the Digitized Palomar Sky Survey,” technical report, California
Institute of Technology.
Rocke, D. M., and Woodruff, D. L. (1996), “Identification of Outliers in
REFERENCES Multivariate Data,” Journal of the American Statistical Association, 91,
Agullo, J. (1996) “Exact Iterative Computation of the Multivariate Mini- 1047-1061.
mum Volume Ellipsoid Estimator With a Branch and Bound Algorithm,” Rousseeuw, P. J. (1984), “Least Median of Squares Regression,” Journal
in Proceedings in Computational Statistics, ed. A. Prat, Heidelberg: of the American Statistical Association, 79, 871-880.
Physica-Verlag, pp. 175-180. __ (1985), “Multivariate Estimation With High Breakdown Point,” in
Andrews, D. F., and Herzberg, A. M. (1985), Data, Springer-Verlag, New Mathematical Statistics and Applications, Vol B, eds. W. Grossmann, G.
York. Pflug, I. Vincze, and W. Wertz, Dordrecht: Reidel, pp. 283-297.
Butler, R. W., Davies, P. L., and Jhun, M. (1993) “Asymptotics for the __ (1997) “Introduction to Positive-Breakdown Methods,” in Hand-
Minimum Covariance Determinant Estimator,” The Annals ofStatistics, book of Statistics, Vol. 15: Robust Inference, eds. G. S. Maddala and
21, 1385-1400. C. R. Rao, Amsterdam: Elsevier, pp. 101-121.
Coakley, C. W., and Hettmansperger, T. P. (1993), “A Bounded Influence, Rousseeuw, P. J., and Leroy, A. M. (1987). Robust Regression and Outlier
High Breakdown, Efficient Regression Estimator,” Journal of the Amer- Detection, New York: Wiley.
ican Statistical Association, 88, 872-880. Rousseeuw, P. J., and van Zomeren, B. C. (1990), “Unmasking Multivari-
Cook, R. D., Hawkins, D. M., and Weisberg, S. (1992), “Exact Iterative ate Outliers and Leverage Points,” Journal of the American Statistical
Computation of the Robust Multivariate Minimum Volume Ellipsoid Association, 85, 633-639.
Estimator,” Statistics and Probability Letters, 16, 213-218. Simpson, D. G., Ruppert, D., and Carroll, R. J. (1992), “On One-Step GM-
Croux, C., and Haesbroeck, G. (in press), “Influence Function and Effi- estimates and Stability of Inferences in Linear Regression,” Journal of
ciency of the Minimum Covariance Determinant Scatter Matrix Esti- the American Statistical Association, 87, 439450.
mator,” Journal of Multivariate Analysis. Woodruff, D. L., and Rocke, D. M. (1993) “Heuristic Search Algorithms
Davies, L. (1992), “The Asymptotics of Rousseeuw’s Minimum Volume for the Minimum Volume Ellipsoid,” Journal of Computational and
Ellipsoid Estimator,” The Annals of Statistics, 20, 1828-l 843. Graphical Statistics, 2, 69-95.
Donoho, D. L. (1982) “Breakdown Properties of Multivariate Location __ (1994) “Computable Robust Estimation of Multivariate Location
Estimators,” unpublished Ph.D. qualifying paper, Harvard University, and Shape in High Dimension Using Compound Estimators,” Journal
Dept. of Statistics. of the American Statistical Association, 89, 888-896.