0% found this document useful (0 votes)
23 views15 pages

Dodge1987 - An Introduction To L1-Norm Based Statistical Data Analysis

Uploaded by

nilkanth chapole
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views15 pages

Dodge1987 - An Introduction To L1-Norm Based Statistical Data Analysis

Uploaded by

nilkanth chapole
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Computational Statistics & Data Analysis 5 (1987) 239-253 239

North-Holland

An introduction to &-norm based


statistical data analysis
Y adolah DODGE
University of NeuchEtel, NeuchZtel, Switzerland

Received May 1987

Abstract: A brief introduction to statistical data analysis based on the minimization of L,-norm is
given for those who are not familiar with the subject. A selected bibliography on the statistical data
analysis &-norm based is provided.

Keywords: &-norm, Least absolute deviations estimation, Minimum sum of absolute errors,
Minimum absolute deviations, Linear programming.

1. Introduction

In almost every possible sphere of human endeavor in the modern world, due
to lack of certainty, statistical technology plays an indispensable role. The
inferential aspects of statistical techniques have made them essential to the toolkit
of anyone engaged in scientific enquiry, including economy, sociology, medicine,
astronomy, business, psychology, education, industry, engineering and in all other
branches of applied sciences. Model fitting, estimation, grouping and data
classification, design and analysis of experiments, testing of statistical hypotheses,
sample surveys, and time series are most of the major statistical methods that are
widely used in many areas of applied sciences.
The underlying theories of these statistical methods contain elements of
optimization.
For example, to determine an estimator, we need a set of criteria by which its
performance can be judged. By an estimator of a parameter, we mean a function
of observations which is closest to the true value in some sense. In choosing a
criteria of estimation, one attempts to provide a measure of closeness of a
parameter subjects to some suitable constraints on the class of estimators. An
optimum estimator in the restricted class is determined by minimizing the
measure of closeness.
As a second example, which also serves as a base for some definitions, consider
one of the most extensively discussed methods among the statistical tools availa-
ble for analysis of data, the regression analysis.

0167-9473/87/$3.50 0 1987, Elsevier Science Publishers B.V. (North-Holland)


240 Y Dodge / L, -norm based data analysis

A function of f( Xi, . . . , X,) of independent variables X = (Xi,. . . , X,) is


called a predictor of a dependent variable Y that is considered. When the model
used to explain the dependent variable in terms of independent variables assumes
a linear relationship between them we have a linear regression model. Otherwise
we have a nonlinear regression model.
Suppose we assume that the relationship is linear between two variables
(simple case). That is, the functional relationship of Y and X is of the following
form:
Y=&+&X+& (1.1)
which is known as simple linear regression of Y on X. & and & are called
parameters, and should be found.
Equations (1.1) means that for a given Xi, a corresponding Y consists of
& + &X, and an E, by which an observation may fall off the true regression line.
On the basis of information available from the observations we would like to find
& and pi. The term E is a random variable and is called ‘error term’. From (1.1)
we can write
Ei= Y-&&X,.
Finding &, and pi from ( Xi, Y), i = 1, 2, . . . , n, is called estimation of the
parameters. There are different methods of obtaining such estimates. Three of
such methods are the minimization of (1) sum of squares deviations; (2) sum of
absolute deviations and (3) maximum of absolute deviations. These three meth-
ods are members of the class called L,-estimators which are obtained by
minimizing what is known as Minkowsky metric or $-norm (criteria) defined as:

[ CIEilP]l’p 0.2)
with p > 1. If we set p = 1, we obtain what is known as an absolute or city block
metric or &-norm. The minimization of this criterion (distance) is called, among
other names, least absolute values method. If p = 2, we have what is known as an
Euclidian metric or &-norm. The minimization of this distance is known as the
least squares method. The classical approach to the regression problem uses this
method. If we minimize the L,-norm for p = co we have the minimax or
Chebychev method.
The root of the most popular method of estimating /3= (&,, pi)‘, that is, the
least squares, goes back to Gauss or Legendre (see Stigler (1981) for a recent
historical discussion). Laplace used the name ‘most advantageous method’. The
usual assumption required to use this method, is that the error &i is distributed
normally with mean zero and variance a*.
While the method of least squares (and its generalizations) have served
statisticians well for a good many years (mainly because of mathematical con-
venience and ease of computation), and enjoys certain well known properties
within strictly Gaussian parametric models, it is recognized, however, that out-
liers, which arise from heavy-tailed distributions, have an unusually large in-
fluence on the estimates obtained by these methods. Indeed, one single outlier
Y. Dodge / L, -norm based data analysis 241

can have an arbitrary large effect on the estimate. Consequently, robust methods
have been created to modify least squares methods so that the outliers have much
less influence on the final estimates. Since Box (1953) coined the term robustness,
a number of excellent papers have been published on the subject. In 1964, Huber
published what is now considered to be a classic paper on robust estimation of
location parameter. Huber’s work was subsequently extended to linear model by
Andrews (1974), Bickel (1975) and Huber (1973) among others. Huber (1964)
introduces the concept of M-estimate (or maximum likelihood type estimate)
based on the idea of replacing E; in the minimization of sum of E; by (Ed), where
is a symmetric function with a unique minimum at zero. Harter’s (1974a, b,
1975a, b, c, 1976) monumental study provides a fascinating historical account of
linear model estimation based on least squares and alternative methods.
One of the most satisfying robust alternative to the least squares is the least
absolute values method. This method, which is the subject of this book, is a
widely recognized superior robust method specially well-suited to longer-tailed
error distributions, e.g. Laplace or Cauchy. Increasingly, this method is recom-
mended as a preliminary (consistent) estimator for one-step and iteratively
reweighted least squares methods. Other advantages of least absolute methods in
robust estimation are explained by Huber (1987).
There are, however, many other robust procedures. For an excellent mathe-
matical aspect of robustness, the reader is referred to Huber (1964, 1972, 1973,
1981), and Hampel et al. (1986).

2. Historical remarks

Depending on the field of application, the minimization of the &-norm


criterion has been studied in several contexts under a variety of names such as:
minimum, or least sums absolute errors (MSAR or LAE); minimum of least
absolute deviations or errors or values (MAD ‘, MAE, LAD, LAV ‘) and the
&-norm method (for minimizing the &-norm of the vector of deviations). See
Bloomfield and Steiger (1983) for other terms.
The LAV estimation, as mentioned above, is a technique that estimates the
unknown parameters in a stochastic model so as to minimize the sum of the
absolute deviations of a given set of observations from the values predicted by the
model. That is,

Minimize ClI:-f(X;, P)I


P

for observation x, and model function f( X,, p) depending on an observable

’ Whether one uses MAD or LAV (sound love) for abbreviation, they mean the same thing since
love is a kind of MADness anyhow.
242 Y. Dodge / L, -norm based data analysis

vector variable Xi (non-random) and vector parameter /3.For the model f( Xi, p)
= &, the LAV estimator is the sample median.
Historically, the LAV estimation is probably the oldest of all robust methods.
The origin of the LAV estimator can be traced back to Galilei (1632). Galilei in
determining the position of a newly discovered star in his “Dialog0 dei massimi
sistemi”, proposes the least possible correction in order to obtain a reliable result
for the problem. However, the LAV estimator in a simple model was suggested
and studied by Boscovich (1757) and Laplace (1793). Laplace called it the
‘method of situations’ (Stigler, 1973). In this paper Stigler discusses Laplace’s
work as well as that of R.A. Fisher who studied the loss of information of LAV
estimator in the double exponential model for which it is, however, the maximum
likelihood estimator.
After nearly seventy years of silence, following the publication of Laplace’s
second supplement to the ThCorie Analytique des Probabilites (1818), Edgeworth
(1887) presented a method for linear regression using LAV method. Farebrother
(1987) discusses several contributions of Laplace, Gauss, Edgeworth and others to
the geometrical, graphical and analytical solution of the L, and L, estimations
problems. Via many worked out examples, he teaches and inlightens many vacant
periods in the chronological development of L, and L, methods that were
otherwise not clear. Gentle (1977) and Bloomfield and Steiger (1983) provided
excellent bibliographical notes on the LAV estimation.

3. Computational algorithms

Since the publication of Edgeworth’s work, few attempts have been made to
convince the statisticians and particularly the applied users to employ this
method. (See Turner (1887), Rhodes (1930), and Singleton (1940) and Karst
(1958).) Reasons for such a long silence may be summarized as follows:
(1) Computational difficulties in producing the numeric values of the LAV
estimates in regression. (Lack of closed form formulae similar to that of least
squares.)
(2) Lack of asymptotic theory for LAV estimation in the regression model, and
in general nonexistence of accompanying statistical inference procedures.
(3) Insufficient evidence to show the superiority of the small sample properties
of the LAV estimation to the least squares estimators when sampling from long
tailed distributions.
Following the work of Charnes, Cooper and Ferguson (1955) a renewed
interest in using the LAV estimation for regression problems was created. They
showed the equivalence between LAV problem and a linear programming prob-
lem. Wagner (1959) suggested that the LAV problem can be solved by solving the
dual of the LAV problem. He also observed that the dual problem can be reduced
to a problem with a smaller basis but that the dual variables have upper-bound
restrictions. Wagner formulation of the problem can be stated as follows:
Consider the problem of minimizing C 1q 1 with respect to p where ei is the
Y. Dodge / L, -norm based data analysis 243

deviation from the observed, and predicted values of Y the ith observation. The
problem can be stated as follows:

Minimize c I&II
subject to Xf3 + E = Y, (3-L)
E, /? unrestricted in sign.

Noting the fact that 1E, 1 = q, + ezi where &ri and &2r are nonnegative, and
E, = El, - &2,, we can reformulate (3.1) as a linear programming problem as
follows:

Minimize EEli + CE2i

subject to xp + &I- E2 = Y, (3.2)


/3 unrestricted in sign, E,, ~~ >, 0.

From the computational point of view, Barrodale and Roberts (1973) pre-
sented a modified simplex algorithm so as to bypass some iterations. The LAV
estimation problem with additional linear restrictions (restricted LAV problem) is
considered along the same lines in Barrodale and Roberts (1978). Special purpose
algorithms for LAV problem have also been given by Armstrong and Hultz
(1977) and Bartels and Conn (1977). Computer comparisons have established the
Barrodale and Roberts algorithms as an efficient method for solving the LAV
problem. Armstrong, Elam and Hultz (1977) considered the problem of two-way
classification models using the L,-norm minimization. They demonstrate the
equivalence between the problem of obtaining LAV estimates for the two-way
classification model and a capacitated transportation problem. They also devel-
oped a special purpose algorithm to provide LAV estimates. Their method along
with a real life example can be found in Arthanari and Dodge (1981). Bloomfield
and Steiger (1983) devoted a complete chapter on LAV in multi-way tables
(classification models) in which they also discussed the relationship between
Tukey’s (1977) median polish and LAV fitting. A recent revised simplex version
of this algorithm by Armstrong, Frome, and Kung (1979) is claimed to be even
more efficient than Barrodale and Roberts algorithm.
In 1981, Arthanari and Dodge devoted nearly a complete chapter on computa-
tional aspects of the LAV estimation. Bloomfield and Steiger (1983) described
and compared three algorithms for LAV estimation by Barrodale and Roberts
(1973, 1974), Bartels, Conn and Sinclair (1978) and Bloomfield and Steiger
(1980).
For recent development in computational algorithms, the interested reader
may refer to Dielman (1984), and Narula (1987). Gentle, Narula and Sposito
(1987) compared the codes that are openly available for solving unconstrained L,
linear regression problems. They took the CPU time on a scalar and virtual
memory as the bases of comparison. For the simple linear regression model (two
parameters), they tested the programs of Armstrong and Kung (AK) (1978),
Josvanger and Sposito (JS) (1983), Abdelmalek (A) (1980), Armstrong, Frome
244 Y. Dodge / l., -norm based data ana~vsis

and Kung (AFK) (1979) and Bloomfield and Steiger (BS) (1980). And for more
than two parameters they considered only A, AFK and BS. They recommended
the AFK algorithm (all things considered) to be the best.
Ekblom (1987) presents some algorithms to compute L,, and Huber estimates.
He also examines and points out some difficulties of applying these algorithms
when p approaches to 1 for the I,,-estimates. The reduced gradient algorithm to
minimize polyhedral convex functions has been presented by Osborne (1987).
Gonin and Money (1987) presented a complete and systematic review of compu-
tational methods for solving the nonlinear &-norm estimation problem.
Probably with new improved algorithms for the linear programming such as
Khachian (1979) Karmarkar (see Meketon (1986)), new possibilities will also be
open for the LAV estimation.

4. Sampling distribution of LAV estimates

The behaviour of estimators has been considered under different conditions.


Ashar and Wallace (1963) Rice and White (1964), Glahe and Hunt (1970),
Kiontouzis (1973), Bourdon (1974), Rosenberg and Carlson (1977), and Pfaffen-
berger and Dinkel (1978), among others, have tried to find the distribution of
LAV estimators using the Monte Carlo method. For some theoretical basis for
comparisons of statistical methods based on the &-norm with the methods based
on other possible distances see Vajda (1987).
A method of obtaining unbiased LAV estimators, when the error distribution
is symmetric and the estimates may not be unique, was given by Sielken and
Hartley (1973).
Taylor (1973) suggested the combination of LAV and least squares in which
LAV should be applied first as a means of identifying outliers to be terminated,
and then the least squares applied after trimming has been done. He also gives
excellent arguments for the use of LAV in econometric analysis. Ronner (1984)
developed the consistency and asymptotic normality of p-norm estimators (1 <p
c 2). Arthanari and Dodge (1981) introduced a method of estimation in linear
model based on a convex combination of LAV and least squares estimators. This
idea was further extended in Dodge (1984) to the convex combination of Huber’s
M-estimator and of the LAV estimator. Dodge and Jureckova (1987) showed that
this convex combination can be adopted in such a way that it minimizes a
consistent estimator of the asymptotic variance of the estimator under considera-
tion. Wilson (9178) via Monte Carlo sampling investigated the cases in which the
disturbances are normally distributed with constant variance except for one or
more outliers whose disturbances are generated from normal distribution with
larger variance. Among other results he found that LAV estimation retains its
advantage over least squares under different conditions such as variations in
outlier variance, number of independent variables, number of observations, and
number of outliers.
Y. Dodge / L, -norm based data analysis 245

Rosenberg and Carlson (1977) attempted to find the distribution of p on the


basis of an extensive Monte Carlo study. Using symmetrical djsturbance distribu-
tions of the error they found that the distribution of p is approximately
multivariate normal, with mean p and covariance matrix X2( X’X) -*, where
X2/T is the variance of the median of a sample of size T drawn from the
disturbance distribution. They conclude that, LAV appears as a feasible alterna-
tive to least squares in regression with high-kurtosis disturbances. The LAV
estimator does have a significantly smaller standard error than the least squares
estimator for regression with high-Kurtosis disturbances.
Bassett and Koenker (1978) developed the asymptotic theory for LAV estima-
tors in the regression model. Their finding is considered to be a breakthrough for
the problem. Their main result is that the sampling distribution of LAV estima-
tors will be asymptotically normal with a specified mean of variance. Moreover,
they showed that the LAV estimators are consistent_. Under very general assump-
tions they confirmed that the LAV estimator p has a normal distribution
asymptotically with mean p and covariance matrix A*( X’X)-’ where X2/T is
the asymptotic variance of the sample median from random samples of size T
taken from the error distribution. Dupacova (1987) obtains excellent results on
the consistency and asymptotic normality of restricted LAV estimates.

5. Statistical inference procedures

Probably the major difficulty for applied researchers in using LAV estimation
for many years was the lack of accompanying statistical inference procedures.
Such procedures would include methods for testing general linear hypothesis,
obtaining confidence intervals, analysis of variance tables and performing multi-
ple comparison procedures.
Koenker and Bassett (1982) investigate the asymptotic distribution of three
alternative L, test statistics of a linear hypothesion in the standard linear model.
They showed that the three test statistics which correspond to Wald, likelihood
ratio, and Lagrange multiplier tests under mild regularity conditions on design
and error distribution have the same limiting chi-square behavior. A very nice
summary of LAV estimation which includes computational algorithms, small and
large (asymptotic) sample properties, confidence intervals and hypotheses testing
in linear regression is given by Dielman and Pfaffenberger (1982). They have also
suggested some future lines of research on LAV estimation in regression. Koenker
(1987) compares the small-sample performance of these three L, test statistics on
a two way model with interaction for testing the hypothesis of no interaction.
In a series of articles McKean and Hettmansperger (1976), Shrader (1976),
McKean and Shrader (1977), Shrader and Hettmansperger (1980) have adopted
many robust inference procedures which are very similar to classical analysis of
variance. McKean and Shrader (1987) presented an LAV analysis of the general
linear model as complete and unified as least squares analysis of variance. For
testing a general linear hypothesis, the procedure is to replace the classical
246 Y. Dodge / L, -norm based data analysis

reduction in sums of squares residuals by the reduction in sum of absolute


residuals. Consequently, he produces an LAV analysis of variance table which
summarizes the LAV test of hypothesis quite similar to the least squares test.
Apart from the likelihood ratio type test statistics, he also presented a wald type
and a score type test statistics for hypothesis purposes.
Since in small samples, various asymptotically equivalent test statistics differ
dramatically with respect to stability of significance level and power, Shrader and
Mckean (1987) and Strangenhaus (1987) presented a method of selecting critical
values of test statistics based on Efran’s bootstrap procedure.
Federov (1987) among other discrepancy measures, considers the &-norm in
the experimental design for model testing problems. Since these measures lead to
identical optimal designs, consequently one may choose to work with the most
convenient criteria of optimality. However, as concluded by Federov, the use of
the L,-norm criterion allows a number of classical results from the function
approximation theory.

6. Density estimation

There are problems in estimating an f(x) when one has a sequence of


independent identically distributed random variables xi, x2,. . . , x, with an un-
known common probability density function f(x). Parzen (1962) gave a method
in which one selects a kernel function k(x) >, 0 such that

k(x) dx = 1.
1
After selecting a k(x), one can estimate the density function key:

f(x)=$$q(x-x,)/h].
J
However, choosing the scale factor h (bandwidth) for a given kernel is not all that
easy. There are many asymptotic results on how h should be selected in order to
obtain the best estimate of the density. As suggested by Parzen, h must be a
function of n so that as n tends to infinity, h tends to 0 and nh * 00. But it is
obvious that the optimal h depends on f(x). Among other results Dodge (1986)
showed that the optimal asymptotic value of h varies as a function of x in
different densities in comparison with practical situations. When the number of
observations reaches 2000 in the case of normal distribution with mean 0 and
variance 1, the optimal value of h reaches 0.54 which shows a slow convergence
of h as n goes towards infinity.
Recently, Devroye and Gyiirfi (1985) published a complete manuscript entitled
“Nonparametric density estimation: the L, view”. In that, they developed a
smooth L, theory since the better studied L, theory has led to various anomalies
and misconceptions. Their choice of the L,-norm is also motivated by its
invariance under monotone transformations of the coordinate axes and the fact
that it is always well defined.
Y. Dodge / L, -norm based data analysis 247

For uniformly mixing samples and for strong mixing samples, the
distribution-free of asymptotically L, consistency of kernel and histogram den-
sity estimates is proved by Gyiirfi (1987). For independent samples Devroye
(1983) gives the complete characterization of L, consistency of kernel estimates.

7. Cluster analysis

One of the problem in multivariate data analysis is the problem of cluster


analysis. Problems which involve grouping a certain number of entities into a
certain number of groups, in some sense optimally, arise in various fields of
scientific inquiry.
Given a set N, of n entities, we wish to partition this set into a number of
subsets such that it is optimal with respect to a certain chosen criterion function,
defined on the set of all partitions.
In general, if we have to talk about a clustering problem we must suppose that
for each element belonging to the set N, there is a vector giving the measurements
on the characteristics or numerical codes for attributes that we wish to use as the
input information for grouping. Thus we can assume that X, = ( Xri, . . . , X,;) E R”
is available for all i E N. Next, we also require that a distance be defined between
any two elements of N.
A real valued function dCj of i and j E N is said to be a metric or distance
defined in N if it satisfies the following conditions:
(1) d,j 2 0, dlj = 0 if i =j,
(2) d,j = dJi>
(3) d,, + d,j >, d,], i, j,k E N.
Let d,, denote the distance between i and j for i, j E N. The matrix of dij is
also called the dissimilarity matrix.
Given a subset A of N, we define a real valued function of the elements
belonging to A denoted by r(A). In many cases, the function r(A) is a function
of the d;,‘s. One such criterion is given below:

$4) =
Ii

10
F,
‘<J
;,JEA
ds for X 2 2,

otherwise,

where X is the number of elements in A.


Given r( 0) and a real number h called ‘threshold level’, a cluster A c N can be
defined as I-( A) d h. A is maximal in the sense that T( A U{ i }) > h for any i E A,
i E N. Then we may be interested in finding all clusters, for a given T( .) and h.
Usually, we wish to find disjointed clusters. In some cases this disjointeness
restriction on clusters is relaxed. Then we are said to be seeking overlapping
clusters. In these problems the number of clusters is not prespecified.
However, if we specify the number of clusters m for set N, we have the
248 Y. Dodge / L, -norm based data analysis

problem of choosing the best partition of N into m clusters, according to certain


objective criteria. Such problems have the general form as follows:
Let C be a real-valued function defined over R”. Let 5 = { T( JI), . . . , T( ,I,)},
where J1,..., J, are the m clusters such that N=Uy=,J, and J= { J1,..., Jm}.
Let G = { J 1J is a partition of N }. Then we want to

Optimize C( TV).
JEG

Depending on the form of c(TJ), we have different clustering problems. For


example, when C( TV)= CE17( JI) for any partition J of N, with T(A) =
c- l,,EA,i< jd,t.. We have the problem of minimizing the total within-group sum of
squares of the distances.
Methods available for clustering (apart from those which use the probability
distribution assumptions of the characterization vectors) can be broadly divided
into two categories: Classical (such as hierarchical) and mathematical program-
ming ones. The primary charm of the first category is their computational ease.
However, these methods suffer a basic deficiency: they generally search for
locally optimal clustering. These advantages and disadvantages are shared by
other similar procedures available for solving clustering problems. Thus one
should consider methods that can produce optimal solutions to clustering prob-
lems.
Clustering techniques which are used to form clusters by optimizing a cluster-
ing criterion have to face the curse of mathematical programming - the local
optimum may not be the global optimum. Obviously all possible partitions of the
original set of objects or elements must be considered before we can conclude
that the local optimum at hand is really the overall best. However, the enor-
mously large number of partitions that are possible prohibits the use of complete
enumeration for this purpose. The need for algorithms that can find the global
optimal cluster has been filled to some extent by the applications of integer
(Vinod (1969)) and dynamic programming (Jensen (1969)) methods and
partial-enumeration techniques, known as branch-and-bound methods (Arthanari
and Dodge (1981)).
Kaufmann and Rousseeuw (1987) introduced an L, type in the m-medoid
method (m-median) for clustering. It searches for m representative objects, called
medoids, which minimize the average dissimilarity of all objects of the data set to
nearest medoid. A cluster is then defined, as the set of objects which have been
assigned to the same medoid.
Since the &-norm criterion is used for clustering, the medoid method becomes
less susceptive to outlying values. However, as pointed out by themselves, their
approach does not necessary provide the optimal solution to the problem.
SpZth (1987) proposes the &-norm as a criterion for clustering problems. He
also provides a software overview of the programming approaches. Trauwaert
(1987) suggests the &-norm version of the FUZZY ISODATA developed by Dunn
(1974). In the presence of outliers or data errors, he shows a great superiority of
&-norm over &-norm.
Y. Dodge / L, -norm based data analysis 249

8. Concluding remarks

With the availability of many computationally efficient algorithms and recently


developed statistical inference procedures such as testing general linear hypothe-
sis, obtaining confidence intervals, selection of variables, ANOVA tables and
multiple comparison for statistical data analysis based on the &-norm, it is
hoped that LAV estimation methods will be employed more often by researchers
in applied sciences than before.
Certainly, there are many other areas of statistical data analysis based on the
&-norm (such as time series) that could have been discussed here. But rather
unfortunately, limitation of space and time have caused many interesting and
important lines of research to be treated too lightly or not at all. While it is now
evident that no single robust procedure is best by any criteria, it may be
appropriate (reasonable) to use multiple-criteria or adoptive convex combination
of LAV with other L,-norm methods rather than a simple criterion to estimate
the unknown parameters. However, for the error distributions for which the
median is superior to the mean as an estimator of location, LAV estimation is
certainly preferred to least squares and strongly recommended to use in these
cases.
From 1757 to date, considerable progress has been made in this field and a
substantial amount of knowledge has been accumulated. But the know-how is too
spread out. A collective effort is therefore needed to bring all of the past learning
together and to help give birth to a coherent field. A collective effort is also
probably needed to simplify the theory, in order to bring about further possibili-
ties for applications which has a long history in economy, but not probably
known in other fields.
However, to equal and surpass the prestige and ubiquity which the &-norm
method has achieved today, troubled waters lay ahead which may be cleared by
the following guidelines:
(i) The development of statistical packages in different branches of statistical
data analysis based on the &-norm.
(ii) Writing elementary books and introducing statistics courses based on the
&-norm at the undergraduate level. This should be accompanied by addition of
new chapters or sections in the revised version of the existing books in statistics
on statistical data analysis based on the &-norm.
Bloomfield and Steiger (1983) and Devroye and Gyiirfi (1985) are the only
books available in this field. The authors of both manuscripts had the courage to
pull together a rich and diverse literature. I hope that this issue of Computational
Statistics and Data Analysis will encourage someone to write a third book.
(iii) Further research should be continued in the areas of inference techniques,
such as multiple comparison procedures for classification models, variance com-
ponents, cluster analysis (using mathematical programming approaches with
L,-norm as the objective function), tabulation of significance values for test
statistics for hypotheses testing in classification models and analysis of variance
in small samples.
250 Y Dodge / L, -norm based data analysis

In summary, the main goal is to be able to apply the &-norm wherever the
&-norm was used before.

References

N.N. Abdelmalek (1980) A Fortran subroutine for the L, solution of overdetermined systems of
linear equations, ACM Trans. Math. Software 6, 228-230.
D.F. Andrews (1974), A robust method for multiple linear regression, Technometrics 16, 523-531.
R.D. Armstrong and E.L. Frome (1977) A special purpose linear programming algorithm for
obtaining least absolute value estimates in a linear model with dummy variables, Commun.
Statist. B6, 383-398.
R.D. Armstrong and J.W. Hultz (1977) An algorithm for a restricted discrete approximation
problem in the L, norm, SIAM J. Numer. Anal. 14, 555.
R.D. Armstrong and D.S. Kung (1978) AS132: Least absolute value estimates for a simple linear
regression problem, Appl. Statist. 21, 363-366.
R.A. Armstrong, J.J. Elam and J.W. Hultz (1977) Obtaining least absolute value estimates for a
two-way classification model, Commun. Statist. B6, 365-381.
R.D. Armstrong, E.L. Frome and D.S. Kung (1979), A revised simplex algorithm for the absolute
deviation curve fitting problem, Commun. Stat. BS, 175.
T.S. Arthanari and Y. Dodge (1981) Mathematical Programming in Statistics (John Wiley,
Interscience Division, New York).
V.G. Ashar and T.D. Wallace (1963), A sampling study of minimum absolute deviations estimators,
Oper. Res. 11, 747.
I. Barrodale and F.D.K. Roberts (1973), An improved algorithm for discrete L, linear approxima-
tion, SIAM J. Numer. Anal. 10, 839.
I. Barrodale and F.D.K. Roberts (1974) Algorithm 478: Solution of an overdetermined system of
equations in the L, norm, Commun. Assoc. Comput. Mach. 17, 319.
I. Barrodale and F.D.K. Roberts (1978), An efficient algorithm for discrete L, linear approxima-
tion with linear constraints, SIAM J. Numer. Anal. 15, 603.
R.H. Bartels and A.R. Conn (1977) LAV regression: A special case of piecewise linear minimi-
zation, Commun. Stat. B6, 329.
R.H. Bartels, A.R. Conn and J.W. Sinclair (1978), Minimization techniques for piecewise differen-
tiable functions: The L, solution to an overdetermined linear system, SIAM J. Numer. Anal. 15,
224.
G.W. Bassett and R. Koenker (1978), Asymptotic theory of least absolute error, J. Amer. Statist.
Assoc. 73, 618-622.
P.J. Bickel (1975), One-step Huber estimates in the linear model, Journal of the American Statistical
association 70, 428-434.
P. Bloomfield and W. Steiger (1980), Least absolute deviations curve-fitting, SIAM J. Sci. Statist.
Comput. 1, 290-300.
P. Bloomfield and W.L. Steiger (1983), Least Absolute Deviations: Theory, Applications, and
Algorithms (Birkhauser, Boston).
R.J. Boscovich (1757). De litteraria expeditione per pontificiam ditionem, et synopsis amplioris
operis, ac habentur plura eius ex exemplaria etiam sensorum impressa, Bononienci Scientiarum et
Artium Znstituto Atque Academia Commentarrii 4, 353-396.
G.A. Bourdon (1974), A Monte Carlo sampling study for further testing of the robust regression
procedure based upon the kurtosis of the least squares residuals, Unpublished M.S. thesis, Air
Force Institute of Technology, Wright-Patterson AFB, Ohio.
G.E.P. Box (1953) Non-normality and tests on variances, Biometrika 40, 318-334.
A. Charnes, W.W. Cooper and R.O. Ferguson (1955) Optimal estimation of executive compensa-
tion by linear programming, Manage. Sci. 1, 138.
Y. Dodge / L, -norm based data analysis 251

L. Decroye (1983) The equivalence of weak, strong and complete convergence in L, for kernel
density estimates, Annals of Statistics 11, 896-904.
L. Defroye and L. Gyorfi (1985), Nonparametric Density Estimation, the L, View (John Wiley, New
York).
T.E. Dielman (1984). Least absolute value estimation in regression models: An annotated bibliogra-
phy, Commun. Statist. 13, 513-541.
T. Dielman and R. Pfaffenberger (1982), LAV (least absolute value) estimation in linear regression:
A review, in: S.H. Zanakis and J.S. Rustagi (Eds.), Optimization in Statistics (North-Holland,
Amsterdam).
Y. Dodge (1986) Some difficulties involving nonparametric estimation of a density function, J. of
Official Statistics 2(2), 193-202.
Y. Dodge and J. Jureckova (1987) Adaptive combination of least squares and least absolute
deviations estimators, in: Y. Dodge (Ed.), Statistical Data Analysis Based on the L, -Norm and
Related Methods (North-Holland, Amsterdam).
J.C. Dunn (1974) A fuzzy relative of the ISODATA process and its use in detecting compact
well-separated clusters, J. of Cybernetics 3(3), 32-57.
J. Dupacova (1987). Asymptotic properties of restricted L,-estimates of regression, in: Y. Dodge
(Ed.), Statistical Data Analysis Based on the L, -Norm and Related Methods (North-Holland,
Amsterdam).
F.Y. Edgeworth (1887), On observations relating to several quantities, Phil. Mag. (5th Series) 24,
222.
H. Ekblom (1987) The L,-estimate as limiting case of an L,- or Huber-estimate, in: Y. Dodge
(Ed.), Statistical Data Analysis Based on the L,-Norm and Related Methods (North-Holland,
Amsterdam).
R.W. Farebrother (1987) Mechanical representations of the L, and L, estimation problems, in: Y.
Dodge (Ed.), Statistical Data Analysis Based on the L, -Norm and Related Methods (North-Hol-
land, Amsterdam).
R.W. Farebrother (1987), The historical development of the L, and L, estimation procedures, in:
Y. Dodge (Ed.), Statistical Data Analysis Based on the L, -Norm and Related Methods (North-
Holland, Amsterdam).
V.V. Fedorov (1987), Various discrepancy measures in model testing (two competing regression
models), in: Y. Dodge (Ed.), Statistical data Analysis Based on the L, -Norm and Related Methods
(North-Holland, Amsterdam).
A. Gaivoronski (1987) Numerical techniques for finding estimates which minimize the upper
bound of the absolute deviation, in: Y. Dodge (Ed.), Statistical Data Analysis Based on the
L, -Norm and Related Methods (North-Holland, Amsterdam).
G. Galilei (1632), Dialog0 dei massimi sistemi.
J.E. Gentle (1977) Least absolute values estimation: An introduction, Commun. Statist. B6,
313-328.
J.E. Gentle, S.C. Narula and V.A. Sposito (1987) Algorithms for unconstrained L, linear
regression, in: Y. Dodge (Ed.), Statistical Data Analysis Based on the L,-Norm and Related
Methods (North-Holland, Amsterdam).
F.R. Glahe and J.G. Hunt (1970) The small sample properties of simultaneous equation least
absolute estimators vi-vis least squares estimators, Econometrica 38, 742.
R. Gonin and A.H. Money (1987) A review of computational methods for solving the nonlinear
L,-norm estimation problem, in: Y. Dodge (Ed.), Statistical Data Analysis Based on the L, -Norm
and Related Methods (North-Holland, Amsterdam).
R. Gonin and A.H. Money (1987). Outliers in physical processes: L,- or adaptive L,-norm
estimation?. In: Y. Dodge (Ed.), Statistical Data Analysis Based on the L,-Norm and Related
Methods (North-Holland, Amsterdam).
L. Gyorfi (1987) Density estimation from dependent sample. in: Y. Dodge (Ed.), Statistical Data
Analysis Based on the L, -Norm and Related Methods (North-Holland, Amsterdam).
Hample, F.R. (1968) Contributions to the theory of robust estimation, Ph.D. thesis, University of
California, Berkeley.
252 Y. Dodge / L, -norm based data analysis

F.R. Hample (1971) A general qualitative definition of robustness, Annals of Math. Statistics 42,
1887-1896.
F. Hampel, E. Ronchetti, P. Rousseeuw and W. Stahel (1986) Robust Statistics: The Approach
Based on Injluence Functions (John Wiley, New York).
H.L. Harter (1974a), The method of least squares and some alternatives I, Znt. Stat. Reu. 42, 235.
H.L. Harter (1974b), The method of least squares and some alternatives II, Znt. Stat. Rev. 42, 235.
H.L. Harter (1975a), The method of least squares and some alternatives III, Int. Stat. Rev. 43, 1.
H.L. Harter (1975b), The method of least squares and some alternatives IV, Int. Stat. Reu. 43,
125-190 and 273-278.
H.L. Harter (1975~) The method of least squares and some alternatives V, Znt. Stat. Reu. 43, 269.
H.L. Harter (1976) The method of least squares and some alternatives VI, Znt. Stat. Rev. 44, 113.
P.J. Huber (1964) Robust estimation of a location parameter, Ann. Math. Statist. 35, 73-101.
P.J. Huber (1972) Robust statistics: A review, Ann. Math. Statist. 43, 1041-1067.
P.J. Huber (1973), Robust regression: Asymptotics, conjectures, and Monte Carlo, Ann. Statist. 1,
7999821.
P.J. Huber (1981) Robust Statistics (John Wiley, New York).
P.J. Huber (1987) The place of the L,-norm in robust estimation, in: Y. Dodge (Ed.), Statistical
Data Analysis Bused on the L, -Norm and Related Methods (North-Holland, Amsterdam).
R.E. Jensen (1969) A dynamic programming algorithm for cluster analysis, Oper. Res. 17, 1034.
L.A. Josvanger and V.A. Sposito (1983) L,-norm estimates for the simple regression problem,
Commun. Statist. B12, 215-221.
O.J. Karst (1958) Linear curve fitting using least deviations, J. Amer. Statist. Assoc. 53, 1188132.
L. Kaufman and P.J. Rousseeuw (1987), Clustering by means of medoids, in: Y. Dodge (Ed.),
Statistical Data Analysis Bused on the L, -Norm and Related Methods (North-Holland, Amster-
dam).
L.G. Khachian (1979) A polynomial algorithm for linear programming, Dokludy Akud. Nuuk
USSR 244(5) 93-96.
E.A. Kiountouzis (1973) Linear programming techniques in regression analysis, Appl. Stat. 22, 69.
R. Koenker (1987) A comparison of asymptotic testing methods for L,-regression, in: Y. Dodge
(Ed.), Statistical Data Analysis Bused on the L, -Norm and Related Methods (North-Holland,
Amsterdam).
R. Koenker and G.W. Bassett (1982), Tests of hypotheses and L, estimation, Econometrica 50,
1577-1583.
P.S. Laplace (1793), Sur quelques points du systeme du monde, Memoires de 1’AcadCmie Royale des
Sciences de Paris, l-87, reprinted in Oeuvres Completes de Laplace, Vol. 11 (Gauthier-Villars,
Paris, 1895) 477-558.
P.S. Laplace (1812) ThCorie analytique des probabilites (Mme Courtier, Paris, 1820) reprinted in
Oeuvres Completes de Laplace, Vol.. 7 (Gauthier-Villars, Paris, 1886).
P.S. Laplace (1818) Deuxieme supplement to Laplace (1812).
J.W. McKean and R.M. Schrader (1987) Least absolute errors analysis of variance, in: Y. Dodge
(Ed.), Statistical Data Analysis Bused on the L, -Norm and Related Methods (North-Holland,
Amsterdam).
M.S. Meketon (1986) Least absolute value regression, Working Paper, AT&T Bell Laboratories,
Holmdel, NJ.
S.C. Narula (1987) The minimum sum of absolute errors regression, J. Quality Tech. 19, 37-45.
M.R. Osborne (1987) The reduced gradient algorithm, in: Y. Dodge (Ed.), Statistical Data Analysis
Based on the L, -Norm and Related Methods (North-Holland Amsterdam).
E. Parzen (1962) On the estimation of a probability density function and the mode, Annuls of
Mathemuticul Statistics 40, 1065-1076.
R.C. Pfaffenberger and J.J. Dinkel (1978) Absolute deviations curve fitting: An Alternative to
Least Squares, in: H.A. David (Ed.), Contributions to Survey Sampling and Applied Statistics
(Academic Press, New York).
E.C. Rhodes (1930) Reducing observations by the method of minimum deviations, Phil. Mug. (7th
Series) 9, 974.
Y. Dodge / L, -norm based data analysis 253

J.R. Rice and J.S. White (1964), Norms for smoothing and estimation, SIAM Reu. 6, 243.
A.E. Ronner (1977), P-norm estimators in a linear regression model, Ph.D. Thesis, Groningen, The
Netherlands.
A.E. Ronner (1984), Asymptotic normality of p-norm estimators in multiple regression, Z.
Wahrscheinlichkeitstheorie Verw. Gebiete 66, 613-620.
B. Rosenberg and D. Carlson (1977), A simple approximation of the sampling distribution of least
absolute residuals regression estimates, Commun. Stat. B6, 421.
P.J. Rousseeuw (1987), An application of L, to astronomy, in: Y. Dodge (Ed.), Statistical Data
Analysis Based on the L, -Norm and Related Methods (North-Holland, Amsterdam).
R.M. Schrader and J.W. McKean (1987), Small sample properties of least absolute errors analysis
of variance, in: Y. Dodge (Ed.), Statistical Data Analysis Based on the L,-Norm and Related
Methods (North-Holland, Amsterdam).
R.L. Sielken and H.O. Hartley (1973), Two linear programming algorithms for unbiased estimation
of linear models, J. Amer. Statist. Assoc. 68, 639.
R.R. Singleton (1940), A method of minimizing the sum of absolute values of deviations, Ann.
Math. Statist. 11, 301-310.
H. Spgth (1987), Using the L, norm within cluster analysis, in: Y. Dodge (Ed.), Statistical Data
Analysis Based on the L, -Norm and Related Methods (North-Holland, Amsterdam).
G. Stangenhaus (1987), Bootstrap and inference procedures for L, regression, in: Y. Dodge (Ed.),
Statistical Data Analysis Based on the L, -Norm and Related Methods (North-Holland, Amster-
dam).
S.M. Stigler (1973), Studies in the history of probability and statistics, XxX11: Laplace, Fisher, and
the discovery of sufficiency, Biometrika 60, 439-445.
S.M. Stigler (1981), Gauss and the invention of least squares, Annals of Statistics 9, 465-474.
L.D. Taylor (1973), Estimation by minimizing the sum of absolute errors, in: P. Zarembka (Ed.),
Frontiers in Econometrics (Academic Press, New York) 169-190.
E. Trauwaert (1987), L, in fuzzy clustering, in: Y. Dodge (Ed.), Statistical Data Analysis Based on
the L, -Norm and Related Methods (North-Holland, Amsterdam).
J.W. Tukey (1977), Exploratory Data Analysis (Addison-Wesley, Reading, MA).
H.H. Turner (1887), On Mr. Edgeworth’s method of reducing observations relating to several
quantities, Phil. Mag. (5th Series) 24, 466-470.
I. Vajda (1987), L,-distances in statistical inference: Comparison of topological, functional and
statistical properties, in: Y. Dodge (Ed.), Statistical Data Analysis Based on the L, -Norm and
Related Methods (North-Holland, Amsterdam).
H.D. Vinod (1969), Integer programming and theory of grouping, J. Amer. Statist. Assoc. 64, 506.
H.M. Wagner (1959), Linear programming techniques for regression analysis, J. Amer. Statist.
Assoc. 54, 206.
H.G. Wilson (1978), Least squares versus minimum absolute deviations estimation in linear models,
Decis. Sci. 9, 322.

You might also like