Identification of Multivariate Outliers - Problems and Challenges of Visualization Methods
Identification of Multivariate Outliers - Problems and Challenges of Visualization Methods
Zeszyty Naukowe
Uniwersytetu Ekonomicznego w Katowicach
ISSN 2083-8611 Nr 247 · 2015
Informatyka i Ekonometria 4
Justyna Majewska
University of Economics in Katowice
Faculty of Informatics and Communication
Department of Demography and Economic Statistics
[email protected]
Introduction
outliers are “an empirical reality but their exact definition is as elusive as the ex-
act definition of a cluster”. They argue that outliers “are observations that devi-
ate from the model suggested by the majority of the point cloud, where the cen-
tral model is a multivariate normal” [Rousseeuw and van Zomeren, 1990]. Booth et
al. [1989] pointed out the difficulty of defining a multivariate outlier when they re-
ferred to a statistical outlier as a nonrepresentative observation whose “position
may not be extreme enough on the basis of a single variable to demonstrate its out-
lying characteristics. However, the combined effects of several variables could be
substantial enough to justify categorizing” it as an outlier. However, such words as
appear to deviate, deviates so much imply some kind of subjectivity.
In univariate data, the identification of outlier seems relatively simple to
carry out. A simple plot of the data, such as scatter plot, stem-and-leaf plot,
QQ-plot etc., can often reveal which points are outliers. Identification of multi-
variate outliers is definitely more complex than in the univariate case. Practi-
cally, identification of outliers are hard to detect when dimension of p exceeds
two [Rousseeuw and van Zomeren, 1990]. Some of the procedures for identify-
ing multivariate outliers have been adapted from the univariate methods. And
unfortunately, “many of the standard multivariate methods are derived under the
assumption of normality and the presence of outliers will strongly affect inferences
made from normal-based procedures” [Schwager and Margolin, 1982]. Various con-
cepts for multivariate outlier detection methods exist in the literature [e.g. Barnett
and Lewis, 1994; Rocke and Woodruff, 1996; Peña and Prieto, 2001].
Masking effect: it is said that one outlier masks a second outlier, if the sec-
ond outlier can be considered as an outlier only by itself, but not in the presence
of the first outlier. Thus, after the deletion of the first outlier the second instance
is emerged as an outlier. Masking occurs when a cluster of outlying observations
skews the mean and the covariance estimates toward it, and the resulting dis-
tance of the outlying point from the mean is small.
Swamping effect: it is said that one outlier swamps a second observation, if
the latter can be considered as an outlier only under the presence of the first one.
In other words, after the deletion of the first outlier the second observation be-
comes a non-outlying observation. Swamping occurs when a group of outlying
instances skews the mean and the covariance estimates toward it and away from
other non-outlying instances, and the resulting distance from these instances to
the mean is large, making them look like outliers.
25% quantile
6
50% quantile
5
75% quantile
4
Adjusted quantile
2
dat[,2]
0
0
y
-2
-4
-5
-6
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
dat[,1] x
x values y values
4
4
2
2
0
0
-2
-2
-4
-4
Fig. 1. An attempt to identify outliers from the set of simulated 100 observations (from
N(100,5) distribution) in 2D with boxplot method and scatterplots (one of them
with four ellipsoids where Mahalanobis distances are constant; these constant
values correspond to the 0.25, 0.50, 0.75 and adjusted (see section 2.1) quantiles
of the chi-square distribution)
Source: Own calculations in R.
72 Justyna Majewska
A single step procedure with low masking and swamping is given in Igle-
wics and Martinez [1982].
The phenomenon of outlier masking and swamping also argues for the use
of outlier resistant identification methods for detecting multivariate outliers. The
degree of masking is measured in terms of an increase in Type II error, or false
negatives, since observations that are truly outlying are classified as part of the
uncontaminated population of data. Swamping refers to the increase in Type I er-
ror caused by outliers.
Becker and Gather [1999] developed the masking breakdown point1 of
outlier identification method that specifies the smallest fraction of outliers in
a sample that can induce the masking affect. Becker and Gather proved that
the masking breakdown point for an outlier detection method that uses a mean
and covariance estimator is bounded by the breakdown points of these two esti-
mators. Further, if the two estimators have the same breakdown point, then
the masking breakdown point of the detector is equal to the estimator break-
down point.
1
Breakdown point is an important measure that is used to describe the resistance of robust esti-
mators in the presence of outliers. Following Hodges [1967] and Hampel [1968, 1971], break-
down point of an estimator is the fraction of arbitrary contaminating observations that can be
presented in a sample before the value of the estimator can become arbitrarily large. Lopuhaä
and Rousseeuw [1991] have presented more formal definitions of the breakdown point for loca-
tion and covariance estimators.
Identification of Multivariate Outliers… 73
12
⎛ n ⎞
M i = ⎜ ∑ (x i − μ) T V −1 (x i − μ) ⎟
⎝ i =1 ⎠
Accordingly, those observations with a large MD can be indicated as out-
liers [Aguinis et al., 2013]. For normally distributed data the Mahalanobis dis-
tance is approximately chi-square distributed with p degrees of freedom. Poten-
tial multivariate outliers xi will typically have large values Mi, and in this
situation a comparison with the χ p2 distribution can be made.
Masking and swamping effects play an important rule in the adequacy of
the MD as a criterion for outlier detection. Namely, masking effects might de-
crease the MD of an outlier. This might happen, for example, when a small clus-
ter of outliers attracts μ and inflate V towards its direction. On the other hand,
swamping effects might increase the Mahalanobis distance of non-outlying ob-
servations. For example, when a small cluster of outliers attracts μ and inflate V
away from the pattern of the majority of the observations [see Penny and
Jolliffe, 2001].
Due to these problems, robust estimators been used and substituted in the
distance formula which yield robust distance. The use of robust estimates of the
multidimensional distribution parameters can often improve the performance of
the detection procedures in presence of outliers. Hadi [1992] addresses this prob-
lem and proposes to replace the mean vector by a vector of variable medians and
to compute the covariance matrix for the subset of those observations with the
smallest MD. A modified version of Hadi’s procedure was presented in Penny
and Jolliffe [2001]. Caussinus and Roiz [1990] proposed a robust estimate for
the covariance matrix, which is based on weighted observations according to
their distance from the center. The authors also propose a method for a low di-
mensional projections of the dataset. They use the Generalized Principle Com-
ponent Analysis to reveal those dimensions which display outliers. Other robust
estimators such as M-estimator, S-estimator, MM-estimator, MVE, MCD and
Fast-MCD (FMCD) estimator have been proven to identify outliers better than
classical estimator. Among the robust estimators, FMCD has been shown to be
the best estimator compare to other robust estimators [Rousseeuw, 1985;
Rousseeuw and Leroy, 1987; Acuna and Rodriguez, 2004].
74 Justyna Majewska
1.0
58 4
5 97
6
3 10
90
62 82 1
36
77 58
4 78 64
78
28
7311
90
0.8
36 77
69
79 21 23 27
25
59
Cumulative probability
4955 14 33
94
17 100
59 76 9760 87 12 15 20 20
34
23
55
76
96 49
21
0.6
92 83 5616 19 29
26
52 22 85
19
80
89 24 5063 18 92
82
32
99 7295 81
31
97
0
98 85 84 14
15
82 61 42 3948 885467 22
53
13
44 66 53 100 74
17
12
65 18
0.4
45403746 41 719375
51
68 84
96
95
47 45
60
28 80 57
86 38 43 89
87
67
-2
70 74 70
61
72
27 2631 81 91 57
16
40
86
75
83
56
33 63
37
28
0.2
34 32 38
93
29 35 30 69 10 43
68
54
36 65
66
-4
1 5 71
52
51
25 50
47
97 4 46
24
88
41
44
42
39
99
0.0
11 48
-4 -2 0 2 4 0 5 10 15
Ordered squared robust distance
78 4 78
36 36
79 21 23 21 23 79
4955 4955
62 7364 14
13 94 62 7364 14
13 94
2
17
12 15 20 17
12 15 20
59 76 9760 87 59 76 9760 87
92 96 83 5616
52 22
19 92 96 83 5616
52 22
19
89 24 5063 18 89 24 5063 18
99 7295 99 7295
0
98 85
82 61 42 84 98 85
82 61 42 84
3948 885467 3948 885467
44 66
65 53 100 44 66
65 53 100
45403746 41 71
47 9375
51
68 45403746 41 71
47 9375
51
68
28 80 57
86 38 43 28 80 57
86 38 43
-2
-2
70 74 70 74
27 2631 81 91 27 2631 81 91
33 33
34 32 30 69
29 35 28 34 32 30 69
29 35 28
3610 3610
-4
-4
1 5 1 5
25 25
97 4 97 4
11 11
-4 -2 0 2 4 -4 -2 0 2 4
Fig. 2. The ordered squared robust Mahalanobis distances of the observations against
the empirical distribution function of the squared the Mahalanobis distance.
In addition the distribution function of chisq_p^2 is plotted as well as two vertical
lines corresponding specified in the argument list (default is 0.975) and the
so-called adjusted quantile. Three additional graphics are created (the first showing
the data, the second showing the outliers detected by the specified quantile of the
chisq_p^2 distribution and the third showing these detected outliers by the adjusted
quantile)
Source: Figures made with mvoutlier package in R.
2
If an estimator is affine equivariant, stretching or rotating the data will not affect the estimator.
Dropping this requirement greatly increases the number of available estimators, and in many
cases, non-affine equivariant estimators have superior performance to affine equivariant estimators.
76 Justyna Majewska
6
4
36
2
0
-2
28
3610
-4
1 5
97 4
-6
robust
classical
-6 -4 -2 0 2 4 6
sent: vector with final weights for each observation (weight 0 indicates potential
multivariate outliers), vector with final weights for each observation (small val-
ues indicate potential multivariate outliers), vector with weights for each obser-
vation (small values indicate potential location outliers), vector with weights for
each observation (small values indicate potential scatter outliers).
4
1.0
0.8
Distance (location)
Weight (location)
0.6
2
0.4
1
0.2
0.0
0
0 20 40 60 80 100 0 20 40 60 80 100
Index 1.0
0.8 Index
Distance (scatter)
2.5
Weight (scatter)
0.6
1.5
0.4
0.2
0.5
0.0
0 20 40 60 80 100 0 20 40 60 80 100
Index Index
1.0
1.0
0.8
0.8
Weight (combined)
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0 20 40 60 80 100 0 20 40 60 80 100
Index Index
Fig. 4. Results of outliers’ identification method of Filzmoser, Maronna, and Werner [2007]
Source: Figures made with mvoutlier package in R.
78 Justyna Majewska
This method detected more outliers than MCD method, see Figure 5.
Outlier detection based on SRPC Outlier detection based on MCD
4
2
2
dat[,2]
dat[,2]
0
0
-2
-2
-4
-4
-4 -2 0 2 4 -4 -2 0 2 4
dat[,1] dat[,1]
Some of data, for example mortality data, can be treat as set of curves,
which are the realizations on the functional space. By visualizing these curves
we can identify outliers in the observed curves using functional equivalents of
boxplots and bagplots. Hyndman and Shang [2010] proposed the functional bag-
plot and a functional boxplot in order to visualize functional data and to detect
any outliers present.
Suppose we have a set of curves {yi(x)}, i = 1,...,n, which are realizations
on the functional space I. After visualizing these curves for large n using func-
tional equivalents of boxplots and bagplots we want to identify outliers in the
observed curves. In this concept the notion of ordering a set of curves is crucial.
This methods use approach to ordering obtained using a principal component
decomposition of the set of observed curves. If we let:
n −1
y i ( x) = μ ( x) + ∑ z i , k ϕ k ( x)
k =1
where {φk(x)} represents the eigenfunctions, then we can use an ordering method
from multivariate analysis based on the principal components scores {zi,k}. The
simplest procedure is to consider only the first two scores, zi = (zi,1, zi,2). Then an
ordering of the curves is defined using the ordering of zi = (zi,1, zi,2). For example,
bivariate depth can be used [Rousseeuw et al., 1999]. Alternatively, the value of the
kernel bivariate density estimate at zi can be used to define an ordering.
Identification of Multivariate Outliers… 79
Age-specific mortality rates are very good example to illustrate this method.
There are two major advantages in ordering via the principal component scores.
The first, it leads to a natural method for defining visualization methods such as
functional bagplots and functional boxplots. The second, it seems to be better
able to identify outliers in real data. Outliers will usually be more visible in the
principal component space than the original (functional) space [Filzmoser et al.,
2008]. Thus finding outliers in the principal component scores does no worse
than searching for them in the original space. Often, it is the case that the first
two principal component scores3 suffice to convey the main modes of variation.
Because principal component decomposition is itself non-resistant to outliers,
Hyndman and Shang [2010] applied a functional version of Croux and Ruiz-
Gazen’s [2005] robust principal component analysis, which uses a projection pur-
suit technique. This method was described and used in Hyndman and Ullah [2007].
The functional bagplot is based on the bivariate bagplot of Rousseeuw et al.
[1999] applied to the first two (robust) principal component scores. The bagplot
is constructed on the basis of the halfspace location depth denoted by d(θ,z) of
some point θ∈R2 relative to the bivariate data cloud {zi; i = 1,...,n}. The depth
region Dk is the set of all θ with d(θ,z) ≥ k. Since the depth measurements are
convex polygons, we have Dk+1 ⊂ Dk. For a fixed center, the regions grow as the
radius increases. Thus, the data points are ranked according to their depth. The
bivariate bagplot displays the median point (the deepest location), along with the
selected percentages of convex hulls. Any point beyond the highest percentage
of the convex hulls is considered as an outlier. Each point in the scores bagplot
corresponds to a curve in the functional bagplot. The functional bagplot also dis-
plays the median curve which is the deepest location, the 95% confidence inter-
vals for the median, and the 50% and 95% of surrounding curves ranking by
depth. Any curve beyond the 95% convex hull is flagged as a functional outlier
(see Figure 6).
The functional highest density region (HDR) boxplot is based on the bivariate
HDR boxplot of Hyndman [1996] applied to the first two (robust) principal
component scores. The HDR boxplot is constructed using the Parzen-Rosenblatt
bivariate kernel density estimate fˆ (w; a, b) . For a bivariate random sample
{zi; i = 1,...,n} drawn from a density f, the product kernel density estimate is de-
fined by Scott [1992]:
3
Hyndamn and Shang [2008] found empirically that the first two principal component scores are
adequate for outlier identification.
80 Juustyna Majewsska
ˆf (w; a, b) = 1
n
⎛ w1 − z i ,1 ⎞ ⎛ w2 − z i , 2 ⎞
∑ K ⎜⎜
nab i =1 ⎝ a ⎟⎠ ⎜⎝
⎟K ⎜
b
⎟⎟
⎠
where w = (w1, wa)T, K is a syymmetric uniivariate kernnel function such that
∫K(u)duu = 1 and (a, b)
b is a bivariaate bandwidtth parameter such that a > 0, b > 0,
a → 0 annd b → 0 as n → ∞. The contribution n of data poiint zi to the eestimate at
some poinnt w dependss on how disttant zi and w are.
{
A higghest densityy region is deefined as Rα = z : fˆ ( z; a, b) ≥ f α w }
where fα is
such that ∫ fˆ ( z; a, b)dz = 1 − α . That is, it iss the region with
Rα
w probabiility cover-
age 1 − α where everry point withhin the region has higherr density estiimate than
every poinnt outside the region.
The advantage of ranking byy the HDR iss its ability to
t show mulltimodality
in the bivariate
b datta. The HD DR boxplot displays thhe mode, ddefined as
sup f (z; a, b) , alongg with the 500% HDR andd the 95% HDR.
H All points not in-
z
Bothh of methodss identified the same ou utliers. Of thhe two neww methods,
Hyndmann and Shang [2010] preffer the functiional HDR boxplot b as itt also pro-
vides an additional
a addvantage in that
t it can identify unusuual “inliers” tthat full in
sparse reggions of the sample
s spacee.
Conclusiions
The procedure
p off outlier idenntification wo
ould not be comprehensiv
c ve without
displayingg the results graphically.. In this pap per we revieww most interresting ap-
proaches to t outliers’ detection.
d
It is known that using
u robustt (high-breakkdown) estimmators for location and
covariancce is also very effective in finding multivariate
m outliers. In particular,
examiningg the structuure of outlierrs found by high-breakdo
h own estimatoors is a di-
agnostic effort
e that iss often someewhat negleected. The diistance-projeection plot
has the addvantage of being quite easy to inteerpret, but heere is alwayss a chance
that the “ooutlier-free” sample conttains some ou utliers.
Any single bivariiate plot cann not reveal alla multivariaate structure,, so differ-
ent bivariate plots shoould be madee providing complementa
c ary informatiion. There-
fore, we recommend
r u
using t know betteer structure, shape and
of diffferent plots to
dependencies betweenn the data. Every E described in this paper methhod is pre-
sented witth artificial example/real
e data examplle.
82 Justyna Majewska
References
Acuna E., Rodriguez C.A. (2004), Meta Analysis Study of Outlier Detection Methods in
Classification, Technical paper, University of Puerto Rico at Mayaguez, Proceed-
ings IPSI 2004, Venice.
Aguinis H., Gottfredson R.K., Joo H. (2013), Best-Practice Recommendations for Defin-
ing, Identifying, and Handling Outliers, “Organizational Research Methods”,
p. 270-301.
Barnett V., Lewis T. (1994), Outliers in Statistical Data (2nd Edition), John Wiley and Sons.
Becker C., Gather U. (1999), The Masking Breakdown Point of Multivariate Outlier Identi-
fication Rules, “Journal of the American Statistical Association” 94, p. 947-955.
Ben-Gal I. (2005), Outlier Detection [in:] O. Maimon, L. Rockach (eds.), Data Mining
and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Re-
searchers, Kluwer Academic Publishers.
Breunig M.M., Kriegel H.P., Ng R.T., Sander J. (2000), Identifying Density-based Local
Outliers, Proceedings ACMSIGMOD 2000, p. 93-104.
Booth D.E., Alam P., Ahkam S.N., Osyk B. (1989), A Robust Multivariate Procedure for
the Identification of Problem Savings and Loan Institutions, “Decision Sciences”, 20,
p. 320-333.
Butler R.W., Davies P.L. and Jhun M. (1993), Asymptotics for the Minimum Covariance
Determinant Estimator, “The Annals of Statistics”, 21, p. 1385-1400.
Caussinus H., Roiz A. (1990), Interesting Projections of Multidimensional Data by
Means of Generalized Component Analysis, COMPSTAT90, Physica-Verlag, Hei-
delberg, p. 121-126.
Croux C., Ruiz-Gazen A. (2005), High Breakdown Estimators for Principal Compo-
nents: The Projection-pursuit Approach Revisited, “Journal of Multivariate Analysis”,
95(1), p. 206-226.
Fawcett T., Provost F. (1997), Adaptive Fraud Detection, “Data-mining and Knowledge
Discovery”, 1(3), p. 291-316.
Filzmoser P., Maronna R., Werner M. (2008), Outlier Identification in High Dimensions,
“Computational Statistics and Data Analysis”, 52, p. 1694-1711.
Hadi A.S. (1992), Identifying Multiple Outliers in Multivariate Data, “Journal of the
Royal Statistical Society”, Series B, 54, p. 761-771.
Hawkins D.M. (1980), Identification of Outliers, Chapman and Hall, London.
Human Mortality Database (2015), University of California, Berkeley (USA), and Max
Planck Institute for Demographical Research (Germany), viewed 15/09/07, avail-
able online at: www.mortality.org.
Hyndman R.J., Shang H.L. (2008), Rainbow Plots, Bagplots, and Boxplots for Functional
Data, “Journal of Computational and Graphical Statistics” 19(1), p. 29-45.
Hyndman R.J., Ullah S. (2007), Robust Forecasting of Mortality and Fertility Rates: A Func-
tional Data Approach, “Computational Statistics and Data Analysis”, 51, p. 4942-4956.
Identification of Multivariate Outliers… 83
Iglewics B., Martinez J. (1982), Outlier Detection Using Robust Measures of Scale,
“Journal of Statistical Computation and Simulation”, 15, p. 285-293.
Peña D., Prieto F.J. (2001), Multivariate Outlier Detection and Robust Covariance
Matrix Estimation, “Technometrics”, 43, p. 286-300.
Penny K.I., Jolliffe I.T. (2001), A Comparison of Multivariate Outlier Detection Methods
for Clinical Laboratory Safety Data, “The Statistician”, 50(3), p. 295-308.
Rocke D.M., Woodruff D.L. (1996), Identification of Outliers in Multivariate Data,
“Journal of the American Statistical Association” 91, p. 1047-1061.
Rousseeuw P. (1985), Multivariate Estimation with High Breakdown Point [in:]
W. Grossmann et al. (eds.), “Mathematical Statistics and Applications”, Vol. B,
p. 283-297.
Rousseeuw P.J., Driessen K.A. van (1999), Fast Algorithm for the Minimum Covariance
Determinant Estimator, “Technometrics”, 41, p. 212-223.
Rousseeuw P.J., Katrien V.D. (1999), A Fast Algorithm for the Minimum Covariance
Determinant Estimator, “Technometrics”, 41(3), p. 212-223.
Rousseeuw P., Leroy A. (1987), Robust Regression and Outlier Detection, Wiley Series
in Probability and Statistics.
Rousseeuw P., Ruts I., Tukey J. (1999), The Bagplot: A Bivariate Boxplot, “The American
Statistician”, 53(4), p. 382-387.
Rousseeuw P.J., Zomeren B.C. van (1990), Unmasking Multivariate Outliers and Lever-
age Points, “Journal of the American Statistical Association”, 85(411), p. 633-651.
Schwager S.J., Margolin B.H. (1982), Detection of Multivariate Normal Outliers,
“Annals of Statistics”, 10, p. 943-95.