Data Science, Classification, and Related Methods PDF
Data Science, Classification, and Related Methods PDF
Data Science,
Classification, and
Related Methods
Studies in Classification, Data Analysis,
and Knowledge Organization
Springer Japan KK
Titles in the Series
M. Schader (Ed.)
Analyzing and Modeling Data
and Knowledge
Data Science,
Classification,
and Related Methods
Proceedings of the Fifth Conference of the International
Federation of Classification Societies (IFCS-96),
Kobe, Japan, March 27-30, 1996
, Springer
Prof. Emeritus Chikio Hayashi
The Institute of Statistical Mathematics
4-6-7 Minami-Azabu, Minato-ku, Tokyo 106, Japan
This work is subject to copyright. All rights are reserved. whether the whole or part of the
material is concerned. specifically the rights of translation. reprinting. reuse of illustrations. re-
citation, broadcasting. reproduction on microfilms or in any other ways. and storage in data
banks.
© Springer Japan 1998
Originally published by Springer-Verlag Tokyo Berlin Heidelberg New York in 1998
The use of general descriptive names, registered names. trademarks. etc. in this publication does
not imply. even in the absence of a specific statement. that such names are exempt from the
relevant protective laws and regulations and therefore free for general use.
Product liability: The publishers cannot guarantee the accuracy of any information about the
application of operative techniques and medications contained in this book. In every individual
case the user must check such information by consulting the relevant literature.
The volume covers a wide range of topics and perspectives in the growing field
of data science, including theoretical and methodological advances in domains
relating to data gathering, classification and clustering, exploratory and
multivariate data analysis, and knowledge discovery and seeking.
It gives a broad view of the state of the art and is intended for those in the
scientific community who either develop new data analysis methods or gather
data and use search tools for analyzing and interpreting large and complex data
sets. Presenting a wide field of applications, this book is of interest not only to
data analysts, mathematicians, and statisticians but also to scientists from many
areas and disciplines concerned with complex data: medicine, biology, space
science, geoscience, environmental science, infonnation science, image and
pattern analysis, economics, statistics, social sciences, psychology, cognitive
science, behavioral science, marketing and survey research, data mining, and
knowledge organization.
v
VI
The organizers of the conference are indebted to many industrial companies and
institutions that financially supported the conference:
Finally, the editors thank the staff of Springer-Verlag Tokyo for their support
and dedication and for the opportunity for publishing this volume in the series
Studies in Classification, Data Analysis, and Knowledge Organization.
Conference President
Chikio Hayashi
The Institute of Statistical Mathematics
Professor Emeritus
VIII
LOCAL EDITORIAL BOARD
Chikio Hayashi The Institute of Statistical Mathematics
Keiji Yajima Science University of Tokyo
Noboru Ohsumi The Institute of Statistical Mathematics
Yutaka Tanaka Okayama University
Yasumasa Baba The Institute of Statistical Mathematics
Tadashi Imaizumi The Institute of Management and Information Science
Atsuhiro Hayashi The National Center for University Entrance Examination
Shoichi Ueda Ryukoku University
LIST OF REVIEWERS
IX
TABLE OF CONTENTS
Preface ..................................................................................................................... V
Conference Committee ............................................................................................. VIII
Local Editorial Board ................ ............ ................... ................................................ IX
List of Reviewers ...................................................................................................... IX
x
XI
Summary: This pa.per surveys various wa.ys in which probabilistic approaches can be
useful in partitional Cnon-hierarchical'} cluster analysis. Four basic distribution models
for 'clustering structures' are described in order to derive suitable clustering strategies.
They are exemplified for various special distribution cases, including dissimilarity data
and random similarity relations. A special section describes statistical tests for checking
the relevance of a calculated classification (e.g., the max-F test, convex cluster tests) and
comparing it to standard clustering situations (comparative assessment of classifications,
CAG).
1. Introduction
Consider a finite set 0 = {I, ... , n} of objects whose properties are characterized by
some observed or recorded data (a table, a data matrix, verbal descriptions) such
that 'similarities' or 'dissimilarities' which may exist among these objects can be
determined by these data. Cluster analysis provides formal algorithms for subdividing
the set 0 into a suitable number of homogeneous subsets, called clusters (classes,
groups etc.) such that all objects of the same cluster show approximately the same
claSs-specific properties while objects belonging to different classes behave differently
in terms of the underlying data (separation of clusters).
A range of clustering algorithms is based on a probabilistic point of view: Data
are considered as realizations of random variables, thus influenced by random errors
and natural fluctuations (variations), and even the finite set of objects 0 may be
considered as a random sample from an infinite universe (super-population 11). Then
any class or classification of objects must necessarily be defined in terms of probability
distributions for the data. This paper presents a survey on this probability-based part
of cluster analysis. While we can point only briefly to various topics, a more detailed
presentation with numerous references may be found in Bock (1974, 1977, 1985, 1987,
1989a, 1994, 1996a,b,c,d), Jain and Dubes (1988) and Milligan (1996)j also see various
other papers of this volume, e.g., by Gordon, Hardy, Lapointe and Rasson.
Let us first specify the notation: The set of objects 0 = {I, ... , n} shall be partitioned
into a suitable number m of disjoint classes GI , ... , Gm C 0 resulting in an m-partition
C = (GI , ... , Gm ) of O. In fact, we focus on partitional clusterings here, thus neclecting
hierarchical or overlapping classifications. We will consider three types of data:
1. A data matrix X = (Xkj)nxl' = (Xl! ... , xn)' where for each object k E 0 P
(quantitative or qualitative) variables have been sampled and compiled into the
observation vector Xk = (xu, ... ,Xkl')' (' denotes the transposition of a matrix).
2. A dissimilarity matrix D = (dk,)nxn where d k / quantifies the dissimilarity exi-
sting between two objects k and 1 (e.g., dk / =
Ilxk - x/W)j typically, we have
= =
o dH ::; dk / d/ k < 00 for all k,l E O.
3. A similarity relation S = =
(Sk,.)nxn where S'd 1 or 0 if the objects k and I are
considered to be 'similar' or 'dissimilar', respectively.
3
4
In our probabilistic context, the observed Zk, dkl and SkI will be realizations of suitable
random variables Xk, Dkl and SkI, respectively, whose probability distributions des-
cribe the type and extent of the underlying clustering (or non-clustering) structure.
In contrast to deterministic (e.g., algorithmic or exploratory) approaches to cluster
analysis, this probabilistic framework can be helpful for the following purposes:
(I) Modeling clustering structures allowing for various shapes of clusters.
(2) Providing clustering criteria under more or less specified clustering assumpt.ions
(to be distinguished from clustering algorithms that optimize t.hese criteria)
(3) Describing and quantifying the performance or optimality of clustering methods.
(e.g., in a decision-theoretic framework; see Remark 2.2).
(4) Investigating the asymptotic behaviour of clustering methods if, e.g., n approaches
00 under a mixture model or under a 'randomness' hypothesis.
(5) Testing for the 'homogeneity' or 'randomness' of the data, either versus a gene-
ral hypothesis of 'non-randomness' or versus special clustering alternatives,
thus checking for the ezistence of a hidden clustering of the objects.
(6) Testing for the relevance of a calculated classification C· that has been obtained
by a special clustering algorithm.
(7) Determining the 'true' or an appropriate number m of classes (see Remark 2.1).
(8) Assessing the relevance of a special cluster Cj of objects that has been obtained
from a clustering algorithm (Gordon 1994).
In the following we will survey some of these topics in detail and consider several
special clustering models.
On the other hand, minimizing 9 with respect to C for a given parameter vector 9
yields the maximum-probability-assignment partition C(O) := C := (ClI ... , Cm) with
classes
i = 1, ... ,m (2.4)
(with suitable ruIes for avoiding ties and empty classes). C(9) can often be interpreted
~ a minimum-distance partition and will be termed in this way here. Using (2.4),
the optimization problem (2.2) reduces to
n
..,(9) L
k=l
min {-logf(xki1?v}}
v=l •...• m
-+ min.
9
(2.5)
~ t L Ilxk - XG,W
1=1 keG.
--> mjn = 9':nn' (2.7)
where the m.1. estimate of Di is just the convex hull D; = H(C;) := conv{xkJk E C;}
of the data points belonging to the class C i , and minimization is over all m-partitions
with non-overlapping H(Cr), ... , H(Cm ) with positive volumes (Bock 1997; see also
section 3.2.3). Note that such a model is applicable only if the presumptive clusters
are clearly separated by some empty space.
2.1.3 Qualitative data, contingency tables and entropy criteria
In the case of qualitative data where the j-th component X ki of X k takes its values
in a finite set Xi of alternatives (e.g., Xi = {O,1} in the binary case) the obser-
ved vectors XI! ... , Xn belong to the Cartesian product X := n~=1 Xi which corre-
sponds to the cells of the p-dimensional contingency table N = (ne)eex = (nel, ....ep)
7
that contains as its entries the number of objects k E 0 with the sam~ data vector
Xk = ~ = (~I' ... ,~p)' E X. Thus any clustering C of 0 corresponds to a decomposition
of N =.NI + ... + N m into m c1ass-spedlc sub-tables N I , ... ,Nm .
Loglinwr models provide a convenient tool for describing the distribution of multiva-
riate qualitative data. A loglinear model for a vector X involves various parameters
(stacked into a vector v) which are distinguished into main effects (roughly descri-
bing the size of marginal frequencies) and interaction parameters (of various orders)
that describe the association or dependencies that might exist among the p com-
ponents of X. Then the distribution density (probability function) of X takes the
form f(~i t9) = P(X = {) = c(t9) . exp{z(~)'v} where z(O is a binary dummy vector
that picks from vthe interaction parameters which correspond to the cell ~ of the
contingency table (c(t9) is a norming factor). - Assuming a fixed-partition model
with class-specific interaction vectors. VI, ... , vm for the m classes, the m.l. method
yields the entropy clustering criterion
m
g(C,O):= L [L
l~;~j~m kEC"IEC)
[-logJ(dklNi;)] + Tlij.IOgVij]-' mill
c,s
(2.11)
where for i = j the inner sum is over k < 1 only, and Tlij = ICd ·IGjl and nii = ('~d)
is the number of terms in the inner sum for i #- j and i = j, respectively. For
exponentially distributed dissimilarities with f( d) = e- d for d > 0, this reduces to:
g(C,O) ~
~ n·[D
I) cI. cJ /t9 I}.. + logv]I)
-+ min
C 8 (2.12)
I~i~j~m '
where Dc"c) = Tl;/ LkEC"IEC) dl;l is the average dissimilarity between two classes Gi
and Gj , and DCi,c, = n~1 Lk,IEC,k<1 dk1 measures the heterogeneity of Ci. - Note
8
that the unconstrained m.l. estimate for fJ ij is given by J ij = Dc;,c} such that (2.12)
reduces to the log-distance clustering criterion
(McLachlan and Basford 1988, Titterington et at. 198.5). Whilst this criterion invol-
ves no classification of objects, such a classification C= ((\, ... ,Cm) can be construc-
ted from the estimated parameters Ji,,ri by using, in an additional stage, a plug-in
Bayesian rule that yields the classes
(Fahrmeir, Kaufmann and Pape 1980, Symons 1981, Anderson 1985). Obviously,
this adds an entropy term to the previous fixed-partition criterion (2.2). It can be
shown that the classes of an optimum m-partition C· are generated by the Bayesian
rule (2.18) (after replacing Ji by the optimum values 1?i). Computationally, the li-
kelihood I( B, 7r, III ... ,In; Xll ... , Xn) can be successively increased by using a modified
k-means algorithm where, in the t-th iteration step, a new partition C(t) is obtained
by applying the Bayesian rule (2.18) with the previous parameter estimates 1?!t-l)
and 1T!t-l) = IC?-I)I/n (obtained from the previous partition C(t-1)), in analogy to
the maximum-probability-assignment partition (2.4) (Fahrmeir et al. 1980, Celeux
and Diebolt 1985, Bock 1996a, Fahrmeir 1996).
2.3 Modal clusters and density-contour clusters
Another group of clustering models, designed primarily for data points XI, ... , Xn in
RP, looks for those regions of RP where these data points are locally concentrated,
or, alternatively, for the regions in which the density of points exceeds some given
threshold; These regions (or the corresponding clouds of points) can be used and
interpreted as 'classes' or 'clusters', especially in the context of pattern recognition
and image processing.
More specifically, let f(x) be the common (smooth) distribution density of XI, ""Xn
and define, for a threshold c > 0, by B(c) := {x E RPlf(x) ~ c} the level-c regi-
on of f. Then the connected components BI(e), B2(e), ... of B(e) are termed high-
density clusters (Bock 1974, 1996a) or density-contour clusters of f at the level c.
For increasing values of c, these clusters split, but also disappear and show inso-
far a pseudo-hierarchical structure. The unknown density f can be approximated,
e.g., by a kernel estimate in(x) obtained from the data XI, ... , Xn and corresponding
estimates Bl (c), B2 ( c), ... c RP are found from in. From these estimated regions,
a (non-exhaustive) clustering of objects or data points is obtained by defining the
clusters Ci(e);= {k E Olin(Xk) 2': e} = Bi(e)n {XiJ ... ,xn}, i = 1,2, .... Note that
a cluster Ci(e) can show a very general (even ramificated) shape in RP, and will be
particularly useful if, for a fixed sufficiently large c, it is separated by broad 'density
10
valleys' from the rest of the data, and, for a varying c, if it is constant over a wide
range of values of c.
Except for the two-dimensional case, the geometrical description of high-density clu-
sters is difficult. Therefore many 'discretized' or modified versions of this clustering
strategy have been proposed (often using a weaker or discretized version of connec-
tivity in RP; see Bock 1996a). From a theoretical point of view, Hartigan (1981)
showed that single linkage clustering fails in detecting high-density clusters for all
dimensions p ~ 2.
A related clustering approach focusses on local density peaks in RP, i.e., on the points
6,6, ... E RP where the underlying (smooth) density f (or its estimate in) has its
local maxima (modes): Clusters are formed by successively relocating each data point
Xlc into a region with a larger value ofin (by hill-climbing algorithms, steepest ascent
etc.) and then collecting all data points into the same cluster C i , termed mode clu-
ster, which finally reach the same mode of in. Even if this approach can be criticized
for its instability of the cluster concept (small local variations of f orin can generate
an abritarily large number of modes or clusters) it is often used in image analysis and
pattern recognition and there exist mallY algorithmic variations of this approach (cf.
Bock 1996a).
2.4 Spatial clustering and clumping models
Motivated by biological and physical applications, spatial statistics provides various
other models for describing a clustering tendency of points in the space RP. A first
non-parametric approach considers the data Xl, .•• , Xn as a realization of a Poisson p1'O-
cess (restricted to a finite window G C RP), either a homogeneous one with a constant
intensity A (= the average number of data points per unit square) for describing a
'homogeneous' or 'non-clustered' sample, or with a location-dependent intensity >.(x)
in the case of a clustering structure: Here the modes and contour regions of A( x) can
characterize clusters similarly as in section 2.3 (when using a distribution density f),
and will be determined by suitable non-parametric estimates of >.(x) (Ripley 1981,
Cressie 1991).
Another model is motivated by the spread of plants in a plane or the growing of
cristals around kernels: The Neyman-Scott process builds clusters in three separate
steps: (1) by placing random 'seed points' ~1,6, ... into RP according to a homoge-
neous Poisson process, (2) by choosing, for each ~i' a random integer Ni with a Pois-
son distribution P(,\), and (3) by surrounding each 'parent' point ~i by Ni 'daughter'
points XiI, ... , X iN• that are independently distributed according to h( (x - ~)/ (j)/ (j
(conditionally on the result of (1) and (2)) where h(x) is a spherically symmetric
density (typically, h ~ N(O.I) or h ,.." U(K(O,I)), the uniform distribution in the
unit ball [((0,1)). The data are then identified with the set of all daugther points
X jk inside a suitable window G. - There exist statistical methods for estimating the
unknown parameters A, (j etc. from these data, but the problem of reconstructing
the 'clusters' (families) from the data is largely unsolved. Insofar this model is repre-
sentative for a range of models (including Cox processes, Poisson cluster processes
etc,) that focus more on the clustering tendency of the data than on the underlying
clustering of objects.
problem. A range of strategies can be proposed in order to solve this problem, inclu-
ding:
(a) Descriptive and exploratory methods for determining the properties of the
clusters, either in terms of the observed data or by using secondary (background)
information that has not yet been used in the classification process (e.g.,
Bock 1981);
(b) A substance-related analysis of the classes that looks for intuitive, 'natural'
explanations or interpretations of the differences existing among the obtained
classes;
(c) A quantitative evaluation of the benefits that can be gained by using the construc-
ted classification in practice (e.g., in marketing, administration, official statistics,
libraries etc.);
(d) A qualitative or quantitatit·e validation of the clusters by comparing them to
classifications obtained from other clustering methods, from alternative data
(for the same objects) or from traditional systematics (see Lapointe 1997);
(e) Infer'ential statistics which is based on probabilistic models and proceeds essen-
tially by classical hypothesis testing.
It is this latter issue (e) that will be discussed in this section. In fact, there is a long
list of clustering-related questions that can be im'estigated by hypothesis testing. A
full account is given in Bock (1996a). Here we will address only two of the major
problems: Testing for homogeneity and checking the adequacy of a calculated classi-
fication (model).
HG: Xl, ... , X" are uniformly distributed in a finite domain G E RP (to be estimated
from the data),
H,mi: Xl, .. " X" have the same (often unknown) unimodal distribution density fo(x)
in RP, with the special case:
Hf: Xl, ... , X" all have the same p-dimensional normal distribution Np(J.l, a 2 Ep).
HD: All (;) dissimilaritiesDk/' k < I, are i.i.d., each with an arbitrary (or a specified)
continuous distribution density f(d)j this implies the two following models:
Hw m : All (;)! rallkings of the dissimilarities Diet, k < I, are equally probable.
Hn,M: For each fixed number M of 'similar' pairs of objects {k, I} (i.e. with Dct smaller
that a given threshold d > 0), these M links are purely randomly assigned to
the set of all (;) pairs of objects.
12
L LH_(C:)
:= 210g ~
LHI
= 210g ~ HI
and (3.1)
where LHm, LH;:i. and LHI denote the likelihood of the data maximized under the
mQdels H"., H:;:iz and HI, respectively, for a fixed number m > 1 of clusters, and C~
is the optimum m-partition resulting from (2.1) or (2.3). Unfortunately, the classical
asymptotic LRT theory (yielding X2 distributions for Tm) fails for these clustering
models, due either to the fact that HI is on the 'boundary' of H:;:iz under the parame-
trizatioil (2.16) or to the discrete character of the parameter C in the fixed-partition
model (2.1) (see also Hartigan 1985, Bock 1996a). However, there exist some special
investigations relating to these two test criteria.
3.£.1 Testing versus the mixture model
The case of a one-dimensional normal mixture f(x) '" L~1 N(J.li, 0'2) has been in-
vestigated by Everitt (1981), Thode et al. (1988) and Bohning (1994) who present
simulated percentiles of T;::iz under N(O, 1) for various sample sizes nand m = 2.
The power of this LRT is investigated by Mendell et al. (1991, 1993) where it results,
e.g., that n ~ 50 is needed to have 50% power to detect a difference IJ.lI - J.l21 ~ 30'
with 0.1 $ 11'1 $ 0.9 (also see Milligan (1981, 1996), Bock (1996a)). The paper of
Bohning (1994) extends these results to the case of one-dimensional exponential fami-
lies and shows that inside these families, the asymptotic distribution of T;::iz remains
(approximately) stable. For more general cases we recommend to determine suitable
percentiles of T;::iz by simulations instead of recurring, e.g., to heuristic formulas.
More theoretical investigations are presented by Titterington et al. (1985), Titte-
rington (1990), Goffinet et al. (1992), and Bohning et al. (1994). Those authors
show, for two-component mixtures (with partially fixed parameters), that under HI
the asymptotic distribution of T2'iz (for n -+ 00) is a. mixture of the unit mass a.t 0
13
and a.xi distribution. Ghosh and Sen (1985) show that the asymptotic distribution
of T2":1: is closely related to a suitable Gaussian process, and Bardai and Garel (1994)
pres.ent the corresponding tabulations. - An alternative method for testing HI versus
H;::':1: has been proposed by Bock (1977, 1985, 1996a, chap. 6.6) and uses, as test
statistics, the average similarity among the sample points Xl, ... , Xn which should be
larger under HI than under the mixture alternative.
S.2.2 Testing for the fixed-classification model; the max-F test
In contrast to the mixture model, the fixed-classification model (2.1) is defined in
terms of an unknown m-partition C = (GI , .'" Gm ) of the n objects, for a fixed num-
ber m of classes and a given family of densities f(x; 11). Therefore the LRT usingTm,
(3.1), can be interpreted either as
• a test for homogenity HI versus the clustering structure Hm;
• a test for the significance or suitability of the calculated (optimum) classifica-
tion C~ of the n objects;
• a test for the existence of Tn > 1 'natural' classes in the data set (versus the
hypothesis of one class only).
Thus, depending on the interpretation, the analysis of the LR test has many facets.
Under the assumption that XI, .'" Xn are i.i.d, all with the same density f(x) (descri-
bing either homogeneity or a mixture of distributions) the almost sure convergence
and the asymptotic normality of the para.meter estimates J; has been intensively
studied, e.g., in Bryant and Williamson (1978), Pollard (1982), Parna (1986), and
Bryant (1991). It appears that the asymptotic behaviour of these estimates is closely
related to the solution of the 'continuous' clustering problem
where minimization is over all m-partitions B = (BlI ... , Bm) of RP and all parameter
vectors B = (11 1 , ".,l1 m ). Instead of going into details here (see Bock 1996a) we will
focus on the special case of the normal distribution model described in section 2.1.1
where X" '" N(J1.i, (12 Ep) for k E Ci under Hm, and X" ,..., N(J1., (12 Ep) for all k in case
of HI. Here the LR test reduces to the intuitive max-F test defined by
where C~ minimizes the variance criterion (2.7). In this case the continuous optimi-
zation problem (3.2) reduces to
as an analogue to (2.6), and its solution B" is necessarily a stationary partition, i.e.
a minimum-distance partition of RP generated by its own class centroids, the condi-
tional expectations p.i := E,[XIX E Bi], i = 1, ... , m.
For the one-dimensional case, the optimum partition B" of RI is given by Cox (1957)
14
~l
B3 ( cos(2tri/3) ) G3 = 0.92570·-
3 J.Li = 1.03648· sin(2tri/3) Pi == 1/3
"3 = 1.16052
and Bock (197,1, p. 179) for the cases f '" .qO,l) and f ~ [i([-y'},y'}ll with va·
riance 1. For two· and three-dimension?lnol'mals J '" }.(I'(Il. Ep) it range of stationary
partitions l3 has been calculated by Baubkus (198;j) (m = 2, ... ,6) and Flury (1993),
the ellipsoidal case has been considered by Baubkus (198.»), Flury (1993, m = 4),
Kipper and Parna (1992), Tarpey et a!. (199.5) and .lank (1996,2::': m ::.: 4). For the
two·dimensional normal JV2 (0. E1 ) some stationary partitions as well as their nume·
rical characteristics are reproduced in Tab. 1; for example. the three quite distinct
.j·partitions 8 5 ,1, 6 5 ,2, 8 5 ,3 differ in their G;·values by no more than 0.013 (for other
cases see Bock 1996a). It is conjectured that for m = :2 to 5 this list includes the
optimum partitions of R2 (see the asterisks "" in Tab. 1), but a formal proof of opti·
mality exists only for m = :2 and :3 classes (Baubkus 1985).
In order to apply the max·F test, the critical threshold (percentile) c must be calcu·
lated from the null distribution of ~;"n under some J '" HI. While this distribution
is intractable for a finite 71, the asymptotic normality of g;"n and k;"n has been pro-
ved under some reguliirity conditions on J. with asymptotic expectations G;" and
lI:~l := (E/lIIX - E/[X1Wl - G~,l/G;", the continuous analogues of the minimum
variance and 77WXF criteria (2.7) and (:3.3), respectively (see Bryant and Williamson
1978, Hartigan 1978 (for p = 1). Bock 1985 (for p 2: 1). Since the regularity conditio
ons include the uniqueness of the optimum partition 6" of (3.5), these results cannot
be applied for the rotation·invariant density f "'- A!p(O, Ep) if p > 1, but suitable
simulations have been conducted (see Hartigan 1978, Bock 1996a, Jank 1996). for
example, Jank (1996) and .Jank and Bock (1996) found that for n 2: 100 (in parti·
cular: n = 1000) the null distribution of the standardized values (g;"" - a)/b and
(k;n" - (1)/& is satisfactorily approximated by a .\"(0,1) distribution (in the range
[-2,2]' say) if a and & chosen to be the empirical mean and standard deviation of the
optimum \'alues g~ln and k~"" respecti\'ely (results of the k·means algorithm) from
N = 1600 simulations of {Xl, ... , In} under '\"2(0, E2J and m = 2, ... , 5.
3.2.3 The LRT for the C01!vex cluster case; C01!vex cluster tests
A less investigated case is provided by clustering models where each class is characte·
rized by a uniform distribution on a convex domain of RP. To be specific, we consider
the followiilg three convex clustering models which all involve a system of m unknown
non-overlapping con\'ex sets DI , ... , D", C RP (to be estimated from the data):
Hm Fiud·ciassijicatioll model:
.\k '" U(Dd for all k E C i , with an unknown m·partition C = (C h ... , C'm) of
O.
where the partition C· = (C;, ... , C;") minimizes the clustering criterion (2.8).
where C· = (C;, ... , C;") is the partition that minimizes the clustering criterion
gm
mir(c) '= ~
. L
IC.I.10 a volp(H(C
I 0
i
IC'I )) --+ •
mID. (3.7)
i=1 I C
over all partitions C = (CI , ... , Cm) with disjoint convex hulls H(C I ), ... , H(Cm ),
all with a positive volume .
• He versus H::,ni:
Denote by C· = (C;, ... , C;") the m-partition which minimizes the volume clu-
stering criterion
A(C) (3.8)
i.e. the sum of the volumes of the m class-specific convex hulls Di := H(C;) C
Gn (supposed to be non-overlapping). Then Vn:= Gn-(H(C;)+ .. ·+H(C;"))
is the maximum 'empty space' that is left in Gn outside of the cluster domains
Di , and the LRT reduces to the following empty space test:
volp(Vn) { > c decide for clustering H::,ni (3.9)
vol p ( G n ) ::; c accept uniformity He.
Quite generally, it may be doubted if the classical hypothesis testing paradigm with
only two alternative decisions: acceptance and rejection of a hypothesis Ho, will
be appropriate for the clustering framework where it would be much more realistic
and useful to distinguish various grades of classifiability which interpolate between
the extreme cases of 'homogeneity or randomness' and an 'obvious clustering struc-
ture '. Instead of defining corresponding quantitative measures of classifiabilty here,
we propose a more qualitative approach, the comparative assessment of classifications
(GAG): It proceeds by defining, prior to any clustering algorithm, some 'benchmark
clustering situations' H:", i.e., data configurations or distribution models which show
a more or less marked classification structure, indexed by f (not necessarily a real
number, but possibly a measure of class separation; see below). These configurations
can be selected in cooperation of practitionners and statisticians. Then, after having
calculated a special (e.g., optimum) classification C~ for the given data {Xl, ... , X n },
we compare this classification with those to be expected under H:"', for various de-
grees f. Thus we place the observed data into a network of various different clustering
situations H:'" and in order to get an idea of their underlying structure.
In the case of the fixed-classification normal model Hm, (2.1), with class-specific
distributions Np{JL;, 0'2 Ep), this idea can be realized as follows:
1. Determine suitably parametrized benchmark clusterings H:"', e.g.:
• A partition C of 0 with m classes Gi of equal sizes ni = IGd = n./m,
whose class centroids J.li are sufficiently different, e.g., IIJ.li - JLjll/O' ~ f for
all i:j:. j, or ~ L~I IIJL; -IlW/0'2 ~ L
• A normal mixture L~I7ri . N p{JLi,0'2Ep) with f:= L~I7ri 'IIJL; -IlW·
2. Consider the LRT statistics Tm or, equivalently, the max - F statistics k;"n for
the hypothesis Ho: JLl = ... = ILm.
3. Determine or estimate some characteristics Q' of the (untractable) probability
distribution of Ttn or k;"" under the benchmark situations H:'" selected in 1.,
e.g., by simulating a large number of data sets {XI,"', Xn} under H:'" and
calculating the empirical mean, median or some other empirical percentile of
the resulting values of Tm or k;"n'
4. Compare the values Tm or k;"n calculated from the original data {Xll""X n}
(eventually after a suitable standardization) to the characteristics Q' of the
benchmark situations.
5. The clustering tendency or classifiability of the data {Xl, ... , xn} and the rele-
vance of the calculated classification C~ is then described, illustrated and quan-
tisized (by f) by confronting them to those benchmark situations H:" which
show a weaker clustering behaviour (in the sense that, e.g., Q' s: Tm or s: k;;'n)
and, on the other hand, to those which describe a stronger clustering structure
(i.e., where the converse inequality holds).
It is obvious that this strategy CAC is related to a formal test of a hypothesis f s: fO
versus f > fO), but is more flexible due to the arbitrary selection of suitable bench-
mark situations. Its generalization to other clustering models is obvious.
4. Final remarks
In this paper we have described various probabilistic and inferential tools for cluster
18
analysis. These methods provide a firm basis for deriving suitable clustering stra-
tegies and allow for a quantitative evaluation of classification results and clustering
methods, including error probabilities and risk functions. In particular, various test
statistics can safeguard against a too rash acceptance of a clustering structure and
help to validate calculated classifications.
On the other hand, the application of these methods is not at all easy and self-evident:
Problems arise, e.g., when selecting a suitable clustering model and an appropriate
family of densities f(x, 11), or when different types of cluster shapes will simultaneous-
ly occur in the same data set. We have seen that the probability distribution of many
test statistics is hard to obtain in many cases. Moreover, our analysis is always based
on only one sample from the n objects such that we cannot evaluate the stability or
variation of the resulting classification (as it would be the case if repeated samples
were available for the same objects).
When comparing the risks and benefits of probability-based versus deterministic clu-
stering approaches (which proceed, e.g., by intuitive clustering criteria or heuristic
algorithms) we see that these same deficiencies exist, in some other and disguised
form, for the latter methods as well. It is recommended here to combine both ap-
proaches in an exploratory way and thereby profit from both points of view. The
CAC strategy presented above is an example for such an analysis.
References:
Anderson, J.J. (1985): Normal mixtures and the number of clusters problem. Computational Statis-
tics Quarterly 2, 3-14.
P. Arabie, L. Hubert and G. De Soete (eds.) (1996): Clustering and Classification. World Science
Publishers, River Edge/NJ.
Baubkus, W. (1985): Minimizing the variance criterion in cluster analysis: Optimal configurations
in the multidimensional normal case. Diploma thesis, Institute of Statistics, Technical University of
Aachen, Germany.
Berdai, A., and B. Garel (1994): Performances d'un test d'homogeneite contre une hypothese de
melange gaussien. Revue de Statistique Appliquee 42 (1),63-79.
Bernardo, J.M. (1994): Optimizing prediction with hierarchical models: Bayesian clustering. In:
P.R. Freeman, A.F .M. Smith (Eds.): Aspects of uncertainty. Wiley, New York, 1994, 67-76.
Binder, D.A. (1978): Bayesian cluster analysis. Biometrika 65,31-38.
Bock, H.H. (1968): Statistische Modelle fiir die einfache und doppelte Klassifikation von nor-
malvertei1ten Beobachtungen. Dissertation, Univ. Freiburg i. Brsg., Germany.
Bock, H.H. (1969): The equivalence of two extremal problems and its application to the iterative
classification of multivariate data.. Report of the Conference 'Medizinische Statistik', Forschungsin-
stitut Oberwolfach, February 1969, lOpp.
Bock, H.H. (1972): Statistische Modelle und Bayes'sche Verfahren zur Bestimmung einer unbekan-
nten Klassifikation normalverteilter zufalliger Vektoren. Metrika 18 (1972) 120-132.
Bock, H.H. (1974): Automatische K1assifikation (Clusterana1yse). Vandenhoeck & Ruprecht, Gottingen,
480 pp.
Bock, H.H. (1977): On tests concerning the existence of a classification. In: Proc. First Symposium
on Data Analysis and Informatics, Versailles, 1977, Vol. II. Institut de Recherche d'Informatique et
d'Automatique (IRIA) , Le Chesnay, 1977,449-464.
Bock, H.H. (1984): Statistical testing and evaluation methods in cluster analysis. In: J.K. Ghosh,\,;
J. Roy (Eds.): Golden Jubilee Conference in Statistics: Applications and new directions. Calcutta,
December 1981. Indian Statistical Institute, Calcutta, 1984, 116-146.
Bock, H.H. (1985): On some significance tests in cluster analysis. J. of Classification 2, 77-108.
Bock, H.H. (1986): Loglinear models a.nd entropy clustering methods for qualitative data. In: W.
Gaul, M. Scha.der (Eds.), Classification as a tool of researcb. North Holland, Amsterdam, 1986,
19-26.
Bock, H.H. (1987): On the interface between cluster analysis, principal component analysis, and
19
multidimensional scaling. In: H. Bozdogan and A.K. Gupta (eds.): l\-Iultivariate statistical modeling
and data analysis. Reidel, Dordrecht, 1987, 17-34.
Bock, H.B. (Ed.) (1988): Classification and related methods of data analysis. Proc. First IFCS
Conference, Aachen, 1987. North Holland, Amsterdam.
Bock, H.H. (1989a): Probabilistic aspects in cluster analysis. In: O. Opitz (Ed.): Conceptual and
numerical analysis of data. Springer-Verlag, Heidelberg, 1989, 12-44.
Bock, H.H. (1989b): A probabilistic clustering model for graphs and similarity relations. Paper pre-
sented at the Fall Meeting 1989 of the Working Group 'Numerical Classification and Data Analysis'
of the Gesellschaft fiir Klassifikation, Essen, November 1989.
Bock, H.H. (1994): Information and entropy in cluster analysis. In: H. Bozdogan et al. (Eds.): Mul-
tivariate statistical modeling, Vol. II. Proc. 1st US/Japan Conference on the Frontiers of Statistical
Modeling: An Informational Approach. Univ. of Tennessee, Knoxville, 1992. Kluwer, Dordrecht,
1994, 115-147.
Bock, H.H. (1996a): Probability models and hypotheses testing in partitioning cluster analysis. In:
P. Arabie et al. (Eds.), 1996, 377-453.
Bock, H.H. (1996b): Probabilistic models in cluster analysis. Computational Statistics and Data
Analysis 22 (in press).
Bock, H.H. (1996c): Probabilistic models in partitional cluster analysis. In: A. Ferligoj and A.
Kramberger (Eds.): Developments in data analysis. Metodoloski zvezki, 12, Faculty of Social Sci-
ences Press (Fakulteta za druzbene vede, FDV), Ljubljana, 1996, 3-25.
Bock, H.H. (1996d): Probabilistic models and statistical methods in partitional classification prob-
lems. Written version of a Tutorial Session organized by the Japanese Classification Society and the
Japan Market Association, Tokyo, April 2-3, 1996,50-68.
Bock, H.H. (1997): Probability models for convex clusters. In: R. Klar and O. Opitz (Eds.): Clas-
sification and knowledge organization. Springer-Verlag, Heidelberg, 1997 (to appear).
Bock, H.H., and W. Polasek (Eds.) (1996): Data analysis and information systems: Statistical and
conceptual approaches. Springer-Verlag, Heidelberg, 1996.
Bohning, D., Dietz, E., Schaub, R., Schlattmann, P., &: Lindsay, B.G. (1994): The distribution of
the likelihood ratio for mixtures of densities from the one-parameter exponential family. Annals of
the Institute of Mathematical Statistics 46, 373-388.
Bryant, P. (1988): On characterizing optimization-based clustering methods. J. of Classification 5,
81-84.
Bryant, P.G. (1991): Large-sample results for optimization-based clustering methods. J. of Classi-
fication 8, 31-44.
Bryant, P.G., and J.A. Williamson (1978): Asymptotic behaviour of classification maximum likeli-
hood estimates. Biometrika 65, 273-28l.
Celeux, G., &: Diebolt, J. (1985): The SEM algorithm: A probabilistic teacher algorithm derived
from the EM algorithm for the mixture problem. Computational Statistics Quarterly 2, 73-82. Cox,
D.R. (1957): A note on grouping. J. Amer. Statist. Assoc. 52,543-547.
Cressie, N. (1991): Statistics for spatial data. Wiley, New York.
Diday, E. (1973): Introduction a. l'analyse factorielle typologique. Rapport de Recherche no. 27,
lillA, Le Chesnay, France, 13 pp.
Diday, E., Y. Lechevallier, M. Schader, P. Bertrand, and B. Burtschy (Eds.) (1994): New approaches
in classification and data analysis. Studies in Classification, Data Analysis, and Knowledge Orga-
nization, vol. 6. Springer-Verlag, Heidelberg, 186-193.
Dubes, R., and Jain, A.K. (1979): Validity studies in clustering methodologies. Pattern Recognition
11, 235-254.
Dubes, R.C., and Zeng, G. (1987): A test for spatial homogeneity in cluster analysis. J. of Classifi-
cation 4, 33-56.
Everitt, B. S. (1981): A Monte Carlo investigation of the likelihood ratio test for the number of
components in a mixture of normal distributions. Multivariate Behavioural Research 16, 171-180.
Fahrmeir, L., Hamerle, A. and G. Tutz (Eds.) (1996): Multivariate statistische Verfahren. Walter
de Gruyter, Berlin - New York.
Fahrmeir, L., Kaufmann, H.L., and H. Pape (1980): Eine konstruktive Eigenschaft optimaler Par-
titionen bei stochastischen Klassifikationsproblemen. Methods of Operations Researcb 37, 337-347.
20
Flury, B.D. (1993): Estimation of principal points. Applied Statistics 42, 139-151.
W. Gaul & D. Pfeifer (Eds.) (1996): From data to knowledge. Theoretical and practical aspects of
classification, data analysis and knowledge organization. Springer-Verlag, Heidelberg.
Ghosh, J. K., & Sen, P. K (1985): On the asymptotic performance of the log likelihood ratio statis-
tic for the mLxture model and related results. In: L.M. LeCam, R.A. Ohlsen (Eds.): Proc. Berkeley
Conference in honor of Jerzy Neyman and Jack Kiefer. Vol.lI, Wadsworth, Monterey, 1985,789-806.
Godehardt, E. (1990): Graphs as structural models. The application of graphs and multigraphs in
cluster analysis. Friedrich Vieweg & Sohn, Braunschweig, 240pp.
Godehardt, E., and Horsch, A. (1996): Graph-theoretic models for testing the homogeneity of data.
In: W. Gaul & D. Pfeifer (Eds.), 1996, 167-176.
Goffinet, B., Loisel, P., and B. Laurent (1992): Testing in normal mixture models when the propor-
tions are known. Biometrika 79, 842-846.
Gordon, A.D. (1994): Identifying genuine clusters in a classification. Computational Statistics and
Data Analysis 18, 561-581.
Gordon, A.D. (1996): Null models in cluster validation. In: W. Gaul and D. Pfeifer (Eds.), 1996,
32-44.
Gordon, A.D. (1997a): Cluster validation. This volume.
Gordon, A.D. (1997b): How many clusters? An investigation of five procedures for detecting nested
cluster structure. This volume.
Hardy, A. (1997): A split and merge algorithm for cluster analysis. This volume.
Hartigan, J .A. (1978): Asymptotic distributions for clustering criteria. Ann. Statist. 6, 117-13l.
Hartigan, J .A. (1985): Statistical theory in clustering. J. of Classification 2, 63-76.
Hayashi, Ch. (19??): ..... .
Jain, A.K, and Dubes, R.C. (1988): Algorithms for clustering data. Prentice Hall, Englewood
Cliffs, NJ.
Jank, W. (1996): A study on the varaince criterion in cluster analysis: Optimum and stationary
partitions of RP and the distribution of related clustering criteria. (In German). Diploma thesis,
Institute of Statistics, Technical University of Aachen, Aachen, 204pp.
Jank, W., and Bock, H.H. (1996): Optimal partitions of R2 and the distribution of the variance and
max-F criterion. Paper presented at the 20th Annual Conference of the Gesellschaft fiir Klassifika-
tion, Freiburg, Germany, March 1996.
Lapointe, F.-J. (1997): To validate and how to validate? That is the real question. This volume.
Ling, R.F. (1973): A probability theory of cluster analysis. J. Amer. Statist. Assoc. 68,159-164.
McLachlan, G.J., and KE. Basford (1988): Mixture models. Inference and applications to cluster-
ing. Marcel Dekker, New York - Basel.
Mendell, N.P., Thode, H.C., & Finch, S.J. (1991): The likelihood ratio test for the two-component
normal mixture problem: power and sample-size analysis. Biometrics 47, 1143-1148. Correction:
48 (1992) 661.
Mendell, N.P., Finch, S.J., and Thode, H.C. (1993): Where is the likelihood ratio test powerful for
detecting two-component normal mixtures? Biometrics 49, 907-915.
Milligan, G. W. (1981): A review of Monte Carlo tests of cluster analysis. Multivariate Behavioural
Research 16, 379-401.
Milligan, G.W. (1996): Clustering validation: Results and implications for applied analyses. In: P.
Arabie et al. (Eds.), 1996, 341-375.
Milligan, G. W., and M.C. Cooper (1985): An examination of procedures for determining the num-
ber of clusters in a data set. Psychometrika 50, 159-179.
Parna, K. (1986): Strong consistency of k-means clustering criterion in separable metric spaces.
Tartu Riikliku Ulikooli, TOIMEISED 733, 86-96.
Kipper,S., and Piirna, K. (1992): Optimal k-centres for a two-dimensional normal distribution.
Acta et Commentationes Universitatis Tartuensis, Tartu Ulikooli TOIMEISED 942, 21-27.
Pollard, D. (1982): A central limit theorem for k-means clustering. Ann. Probab. 10, 919-926.
Rasson, J .-P. (1997): Convexity methods in classification. This volume.
Rasson, J .-P., Hardy, A., and Weverbergh, D. (1988): Point process, classification and data analysis.
21
Summary: Clustering algorithms can provide misleading summaries of data, and attention
has been devoted to investigating ways of guarding against reaching incorrect conclusions,
by validating the results of a cluster analysis. The paper provides an overview of recent work
in this area of cluster validation. Material covered includes: the distinction between exter-
nal, internal, and relative clustering indices; types of null model, including 'data-influenced'
null models; tests of the complete absence of any class structure in a data set; and ways
of assessing the validity of individual clusters, partitions of data into disjoint clusters, and
hierarchical classifications. A discussion indicates areas in which further research seems
desirable.
1. Introduction
The topic of classification addresses the problem of summarizing the relationships
within a large set of objects by representing them as a smaller number of classes (or
clusters) of objects with the property that objects in the same class are similar to
one another and different to objects in other classes. On occasion, it can be relevant
to obtain a nested set of such partitions of the objects, providing a hierarchical clas-
sification which summarizes the class structure present at several different levels in
the data.
Many different clustering algorithms have been proposed for obtaining such classi-
fications; see, for example, Bock (1974), Hartigan (1975), Gordon (1981) and Jain
and Dubes (1988). However, clustering algorithms impose a classification on a data
set even if there is no real class structure present in it. Further, classifications of the
same data set obtained using different clustering criteria can differ markedly from one
another. In effect, each clustering criterion implicitly specifies a model for the data;
for example, Scott and Symons (1971) demonstrated links between Ward's (1963)
incremental sum-of-squares clustering criterion and a spherical normal components
model (see also Binder (1978), Marriott (1982) and Bock (1989, 1996)). If this model
is inappropriate for the data set under investigation, a mislea.ding summary will be
provided.
Realization of this fact has led to the consideration by investigators of how one might
specify appropriate clustering strategies for data sets. Some adaptive clustering pro-
cedures have been proposed (e.g., Rohlf, 1970; Diday and Govaert, 1977: Lefkovitch,
1978, 1980; Art et al., 1982), the rationale being that the clustering procedure ad-apts
itself in response to the structure found in the data. In effect, such procedures just
involve a more general model for the data, and it seems unrealistic to expect to be
able to construct a model that is sufficiently general to be appropriate for the analysis
of any data set that may be encountered.
Jardine and Sibson (1971) presented a list of properties that one might require of a
clustering method, and proved that the single link method uniquely satisfies these
conditions. This provides a valuable characterization of the single link method, but
many investigators have regarded some of the conditions as undesirable. A less pre-
22
23
scriptive approach was provided by Fisher and Van Ness (1971) and Van Ness (1973).
These authors presented a list of properties which one might expect clustering proce-
dures or the classes obtained from them to possess, and stated whether or not these
properties were possessed by each of several standard clustering criteria. From back-
ground information about the data, an investigator might be able to specify some
relevant conditions; the work of Fisher and Van Ness then indicates which clustering
criteria could be relevant for the analysis of that particular data set. More than
one clustering criterion might be indicated as relevant, and it can be informative to
analyse a data set using several different clustering criteria, and synthesize the results
in a consensus classification (e.g., Diday and Simon, 1976; Gordon, 1981, Chapter
6, 1996a), the rationale being that the results are less likely to be an artifact of a
clustering criterion and more likely to give an accurate representation of the structure
in the data.
Investigators have commonly combined a classification of a set of objects with a low-
dimensional configuration of points, in which each object is represented by a different
point, points which are close together representing objects that are similar to one an-
other. Such configurations can then be examined by eye, to establish whether or not
the points fall into distinct, well-separated classes; closed curves can also be drawn
around each class provided by the clustering criterion (e.g., Rohlf, 1970; Shepard,
1974) to assist in the assessment.
Further indications of the properties of classes provided by a clustering procedure are
given by various plots summarizing their homogeneity and isolation from one another
(e.g., Gnanadesikan et al., 1977; Rousseeuw, 1987).
It is rarely the case that investigators know with certainty which clustering procedure
should be used to analyse a set of object.s, and increasing attention has been paid
in recent years to providing ways of testing the validity of cluster structure that are
more formal than those described in the previous paragraphs. This paper presents
an overview of this topic of cluster validation; earlier reviews were given by Perruchet
(1983), Bock (1985), Jain and Dubes (1988, Chapter 4) and Gordon (199.5).
In statistics, it is common for an exploratory analysis to be carried out on a data set,
for example to assist in model formulation. Such exploratory analyses are generally
followed by confirmatory analyses, which are carried out on new data sets. This has
rarely occurred in classification, in which interest has usually resided in the set of
objects which is being classified, and not in some larger population of objects, from
which the classified set is regarded as comprising a representative sample. In much
of the work which is described in this paper, therefore, cluster generation and cluster
validation are carried out on the same data set. This has major implications for
cluster validation: for example, the classes provided by a clustering algorithm are
likely to be more homogeneous than those contained in a random partition of a set
of objects, and care must be taken to specify appropriate reference distributions for
test statistics used to assess the validity of the obtained classes.
In this context, it is relevant to distinguish three different types of validation test,
involving use of external, internal, or relative cluster indices. External tests compare
a classification, or part of a classification, with external information that was not
used in the construction of the classification. Internal tests compare a (part of a)
classification with the original data. Relative tests compare several different classifi-
cations of the same set of objects; such tests are relevant in both external and internal
investigations (F.-J. Lapointe, pers. comm.).
Cluster validation tests can be categorized into four classes, depending on the type
of cluster structure under investigation (Jain and Dubes, 1988, Chapter -!): these are
24
tests of (1) the complete absence of class structure, (2) the validity of an individual
cluster, (:3) the validity of a partition into disjoint classes. and (4) the validity of a
hierarchical classification.
The remainder of the paper is organized as follows: the next section describes null
models for the absence of cluster structure: the following four sections provide an
overview of tests. categorized as in the previous paragraph; and the final section
presents a discussion, indicating areas in which it would be useful to see further
research.
2. Null models
The information about objects on which a classification study is to be undertaken is
usually provided in one of two formats. A pattern matrix is an n x p matrix X == (Xik),
where Ilk denotes the value of the kth variable describing the ith object. If all the
variables are continuous, the objects can be represented by a set of n [>oints in p-
dimensional Euclidean space. A dissimilarity matrix is an 11 x n matrix D == (d ij ),
where ell) denotes the dissimilarity between the ith and ph objects. Dissimilarities
arc symmetric (ell) = dJI ) and self-dissimilarities (did are zero, hence it suffices to
present only the n(n - 1)/2 entries in the lower triangle of D. The relationships
within a set of objects are sometimes described in terms of a symmetric similarity
matrix, but such data can also be analysed by straightforward modifications of the
theory described ill this paper.
Four main classes of null model, based on pither a pattern matrix or a dissimilarity
matrix, have been proposed, as described in the following four subsections.
to the convex hull. Use of the convex hull boundary for the Poisson model is thus
limited at present to data sets having small values of the dimensionality, p, though
one might hope that the development of more efficient algorithms or increase in com-
puting power might ease this restriction in the future.
Generalizations of a test proposed for use in two dimensions in Hopkins (1954) were
found to have superior performance in comparative studies of such tests carried out
by Zeng and Dubes (1985b) and Dubes and Zeng (1987).
Tests based on the distribution of the lengths of edges in the minimum spanning
tree have been proposed under the Poisson model (Hoffman and Jain, 1983) and a
multivariate normal model (Rohlf, 1975). Smith and Jain (1984) considered adding
randomly-positioned points to the data set and using a test due to Friedman and
Rafsky (1979) based on the number of edges in the minimum spanning tree of the
combined data set that link original and added points. Brailovsky (1991) generated
modified data sets, in which each point was either retained with a specified prob-
ability or replaced by a randomly-positioned point, and compared the strength of
clustering of the original data set with that of the modified data sets.
Bock (1985) described theoretical results pertaining to the average pairwise similarity
of objects and the total within-class sum of squared distances when the objects were
optimally partitioned into a specified number of classes, deriving asymptotic distri-
butions of relevant test statistics; see also Hartigan (1977, 1978). However, more
work is required to establish how relevant such results are for assessing finite-sized
data sets. It has been more common for tests to involve simulation studies.
Some tests are based on searching for 'gaps' or multimodality. Hartigan and Mohanty
(1992) studied the properties under both Poisson and Unimodal models of a test based
on a single link dendrogram: for each single link class C, the number of objects n(C)
in its smallest offspring class is noted, and the test statistic is the maximum value of
n( C) over all classes C, large values suggesting the presence of bimodality. Muller and
Sawitzki (1991) described a test based on comparing the amounts of probability mass
exceeding various threshold values when there are c modes in the distribution, for
different values of c, but this test is at present computationally feasible only for small
values of the dimensionality, p. Hartigan (1988) considered evaluating the minimum
amount of probability mass required to render the distribution unimodal, but settled
for estimating departures from unimodality along the edges of an optimally-rooted
minimum spanning tree. Rozal and Hartigan (1994) described a test based on mini-
mum spanning trees, constrained so that edge lengths are non-increasing on all paths
to the root node(s) corresponding to the class centre(s). Sneath (1986) presented a
test based on the empirical distribution function of the number of internal nodes in
a dendrogram that are located at less than a specified height.
The following test statistics have been proposed for use in conjunction with the ran-
dom dissimilarity matrix model to test for the absence of class structure: the number
of edges required before the graph consists of a single component (Rapoport and Fil-
lenbaum, 1972; Schultz and Hubert, 1973; Ling, 1975; Ling and Killough, 1976); the
number of vertices not belonging to the largest component when a specified number
of edges is present (Ogilvie, 1969); the number of components when a specified num-
ber of edges is present (Ling, 1973b; Ling and Killough, 1976); the sizes of clusters
in a partition into two clusters (Van Cutsem and Ycart, 1996). Godehardt (1990)
described a generalization of the random dissimilarity matrix model to rnultigraphs,
in which each variable describing the objects defines a different dissimilarity matrix.
ilarities between an object in the cluster and an object outside it (van Rijsbergen,
1970); other definitions of what might constitute an ideal cluster were provided by
McQuitty (1963, 1967), Jardine (1969), Ling (1972) and Hubert (1974a). However,
these are fairly restrictive requirements, and few data sets possess large ideal clusters.
Various measures of the internal cohesion and external isolation of clusters have been
defined (e.g., Estabrook, 1966; Bailey and Dubes, 1982). Ling (1973a) defined the
'lifetime' of a cluster to be the difference between the rank of the dissimilarity at
which it is formed and the rank at which it is incorporated into a larger cluster,
evaluating lifetime distributions of single link clusters under the random dissimilarity
matrix model. Matula (1977) proved that the distribution of the size of the maximal
complete subgraph of a random graph, in which each edge is independently present
with the same probability, is highly peaked, allowing an assessment of complete link
clusters. When k edges are present in a random graph, Bailey and Dubes (1982) ob-
tained inequalities for the probabilities of obtaining indices of cohesion and isolation
as extreme as the observed ones for single link and complete link clusters, plotting
these bounds for different values of k. Lerman (1970, Chapter 2, 1980, 1983) de-
fined U statistics comparing within-cluster and between-cluster dissimilarities, and
assessed both partitions and individual clusters under the hypothesis of random par-
titions having the same cluster sizes as observed in the results. Gordon (1994, 1996b)
obtained by simulation critical values of U statistics under all four types of null model
described in Section 2, by reanalysing random data sets using the same clustering
procedure used ill the classification of the original set of objects. He noted unsatisfac-
tory properties of the random dissimilarity matrix and random permutation models,
and the dependence of the results for the random pattern matrix models OIl the pre-
cise specification of the null model.
Many tests of the validity of an individual cluster have involved examining the dis-
tinctness of its offspring classes (e.g., Engelman and Hartigan, 1969; Gnanadesikan et
al., 1977; Sneath, 1977,1979,1980; Barnett et al., 1979; Lee, 1979). These tests have
usually been restricted to univariate data, sometimes obtained by projection onto the
line joining the centroids of the two classes. Analytical results are difficult to obtain
(Hartigan, 1977), and recourse has usually been made to simulation studies.
Some tests of whether or not a cluster should be sub-divided have been used as 'local
stopping rules' for deciding on the number of clusters in a data set; such tests are
described in the next section.
The tests for assessing individual clusters described in this section are internal val-
idation tests, as defined in Section 1: cluster generation and cluster validation are
carried out on the same data set. They encounter the 'multiple comparison' prob-
lem that tests of different clusters in the same data set are not independent of one
another, which has implications for the significance levels of the tests. By contrast,
Gabriel and Sokal (1969) described a simultaneous test procedure with assured over-
all significance level, in which (possibly overlapping) largest homogeneous clusters
are identified; their approach is also applicable to determining coarsest acceptable
partitions.
5. Assessing partitions
A partition of a set of objects can be compared with an externally-specified parti-
tion of the objects by evaluating a relevant index comparing the partitions. The
significance of the value of this index can be determined by comparing it with its
distribution under random permutations of the labels of the objects that leave un-
changed the numbers of objects in each class of the partitions. A comparative study
29
of five indices conducted by Milligan and Cooper (1986) concluded that Hubert and
Arabie's (1985) modification of Rand's (1971) index was best suited to comparing a
specified partition with cluster output comprising several different numbers of clus-
ters.
On occasion, the external information might not provide a partition of the objects to
be classified, but rather describe the classes to which they are believed to belong. At
the most formal level, this description could specify a statistical model comprising a
mixture of a known number of classes, each with known parametric form, but with
unknown mixing proportions; at the least formal level, the distribution function for
each class could be provided by the empirical distribution of a class provided by a
clustering algorithm (Gordon, 1996d).
However, in many classification studies, external information is not available, and it
has been more common for investigators to seek answers to the following question
about a given set of objects: 'How many clusters are there in the data (and what is
their composition)?' There are several ways in which this question may be posed:
1. Does a partition into c (say) clusters that has been provided by a clustering
algorithm comprise cohesive and isolated clusters?
2. What value(s) of c is (are) most strongly indicated as (an) informative repre-
sentation( s) of the data?
These questions are addressed using, respectively, internal and relative validation
tests. The multiple comparison problem is again relevant: the value of c in (1) will
have been chosen by an investigator after studying the results of a classification, and
most tests for determining the appropriate value(s) of c in (2) have unknown signifi-
cance levels.
The first question stated above can be addressed by defining a measure of the ade-
quacy of a partition into c classes (e.g., the total within· class sum of squared distances
about the c centroids), and obtaining its distribution under a null model of the ab-
sence of class structure. Some asymptotic theoretical results have been obtained (e.g.,
Hartigan, 1977; Pollard, 1982; Bock, 1985), but further work is required to establish
their appropriateness for finite data sets. Baker and Hubert (1976) assessed parti-
tions into complete link clusters using the number of extraneous edges present under
the random graph model. Monte Carlo tests have been carried out by evaluating the
measure of partition adequacy when many randomly-generated data sets are parti-
tioned into c classes using the same clustering procedure as employed on the original
data (Arnold, 1979; Milligan and Mahajan, 1980; Milligan and Sokol, 1980).
Partitions have also been assessed by investigating their stability when slightly modi-
fied versions of the data are reanalysed (e.g., Rand, 1971; Gnanadesikan et aI., 1977),
and by their replicability across subsamples (e.g., McIntyre and Blashfield, 1980;
Breckenridge, 1989).
A problem with internal validation tests of a partition of the set of objects into c
classes is the inter-relatedness of class structure at different values of c: thus, if there
were really Co classes in the data, the hypothesis that there were c classes in the data
would probably be acceptable for a range of values of c close to Co. The second ques-
tion stated above is concerned with the problem of identifying appropriate values of
c.
Procedures that seek to find the single most appropriate value of c are generally
referred to as 'stopping rules '; this is because, in the absence of information about
relevant values of c, investigators often obtain a complete hierarchical classification
30
using an agglomerative algorithm, and wish to have guidance on when to stop amal-
gamating and regard the current partition as the optimal one. Most research on
cluster validation has involved the construction of stopping rules, and this overview
describes only a small selection of them.
Stopping rules can be categorized as either global or local. Global stopping rules make
use of all the information contained in a partition into c clusters for each value of
c. The value of c for which a specified criterion is satisfied is then identified. Many
of the criteria are based on seeking the optimal (maximum or minimum) value of
measures comparing within-cluster and between-cluster variability, and do not have
a natural definition when c = 1: it is clearly a disadvantage of stopping rules if they
do not allow one to reach the conclusion that the data comprise only a single cluster.
Selected global stopping rules are described in this paragraph. Calinski and Harabasz
(1974) identified the value of c that maximized a scaled ratio of between-cluster
to within-cluster sum of squared distances. The C-index identifies c which min-
imizes a standardized version of the sum of all within-cluster dissimilarities (D):
if the partition has a total of r within-cluster dissimilarities, Dmin (resp., Dma:z:)
is defined as the sum of the r smallest (resp., largest) dissimilarities, and C =
(D - Dmin)/(Dma:z: - Dmin). Goodman and Kruskal's (1954) I compares all within-
cluster dissimilarities with all between-cluster dissimilarities, defining a comparison
to be concordant (resp., discordant) if a within-cluster dissimilarity is less (resp.,
greater) than a between-cluster dissimilarity; the optimal value of c is defined to be
the one which maximises (5'+ - 5'_)/(S+ + S_), where S+ (resp., 5'-) is the number
of concordant (resp., discordant) comparisons. A test based on the point biserial cor-
relation identifies the value of c which maximizes the correlation between the original
dissimilarities and an n x n matrix of l's and D's indicating whether or not the objects
belonged to the same cluster. The cubic clustering criterion (Sarle, 1983) is based on
R2, the proportion of the variance accounted for by the clusters, and identifies the
value of c which maximizes a function of R2 which includes terms derived from sim-
ulation studies under the Poisson model. Other global stopping rules were proposed
by Jackson (1969), Gower (1973), Davies and Bouldin (1979), Hill (1980), Ratkowsky
(1984), Krzanowski and Lai (1988), Xu et al. (1993), and many others; reviews and
comparative studies were presented by Milligan (1981), Milligan and Cooper (1985),
and Dubes (1987).
Local stopping rules are based on tests of whether or not a pa.ir of clusters should be
amalgamated, or a single cluster should be subdivided. Unlike global stopping rules,
they are thus restricted to the analysis of hierarchically-nested sets of partitions, and
use only a part of the data. They often also require the specification of a threshold,
the value of which can markedly influence the properties of the stopping rule.
Selected local stopping rules are described in this paragraph. Duda and Hart (1973,
Section 6.12) compared WI, the within-cluster sum of squared distances, with W 2 ,
the total within-cluster sum of squared distances when the cluster is optimally par-
titioned into two, rejecting the hypothesis of a single cluster if W2 /WI is sufficiently
small; amalgamations cease when the hypothesis is first rejected. A test with a sim-
ilar rationale proposed by Beale (1969) is based on (WI - W 2 )/W2 . Legendre et
al. (1985) categorized the dissimilarities in a cluster as either 'high' or 'low', and
assessed whether or not it should be subdivided into two sub-clusters by carrying
out a randomization test based on the proportion of high dissimilarities between the
sub-clusters compared to within the cluster. Other local stopping rules have been
proposed by Gnanadesikan et al. (1977) and Howe (1979), and some of the tests de-
scribed in Section 4 for assessing individual clusters can also be used as local stopping
rules.
31
For many data sets, a complete hierarchical classification has been obtained and a
stopping rule has then been used to provide guidance on where to section the hierar-
chy to provide a partition. Such work often disregards which clustering criterion has
been used to obtain the hierarchy, whereas one can expect different stopping rules -
just as different clustering criteria - to be more effective in analysing different types of
data. Stronger links between the processes of generating and validating partitions are
provided by work that assesses the stability of a partition by reanalysis of a perturbed
version of the data set (e.g., Begovich and Kane, 1982) or of bootstrap samples (Jain
and Moreau, 1987); or by separately reanalysing sub-samples of the data (Overall
and Magee, 1992); or by noting the value of c for which 'fuzzy partitions' are 'closest'
to 'hard' partitions (e.g., Roubens, 1978; Windham, 1981, 1982; Rivera et al., 1990).
Proposals for determining the number of clusters present in a set of objects continue
to be published in the research literature, often with only a cursory examination of
their properties and little attempt to establish how they perform in comparison with
previously-published procedures. A detailed comparative study of thirty procedures
by Milligan and Cooper (1985) showed that many tests performed very poorly in
detecting reasonably clear-cut clusters. The five procedures that performed best in
Milligan and Cooper's (1985) simulation study were the first three global stopping
rules and the first two local stopping rules described earlier in this section. It is pos-
sible that these results have been influenced by the fact that the clusters generated in
Milligan and Cooper's (198.5) study were sampled from mildly-truncated multivariate
normal distributions.
The work described above seeks to identify the single most appropriate value for the
number of clusters present in the data. However, it may be relevant to describe a set
of objects in terms of partitions into two or more (possibly nested) widely-separated
numbers of clusters, depicting the class structure present at several different scales in
the data. Gordon (1996c) presented a study investigating the ability of modifications
of Milligan and Cooper's (1985) five superior stopping rules to detect nested cluster
structure.
specified. It is stressed that all trees referred to in this context are rooted.
Many different indices have been proposed for comparing two hierarchical classifi-
cations (Rohlf, 1982). The significance of the value taken by such an index can be
assessed by comparing it with its distribution under a suitable null hypothesis. Hy-
potheses that have been considered are the random label hypothesis, in which labels
are independently permuted, and several random tree hypotheses, in which distribu-
tions of various types of tree are considered (e.g., binary or multifurcating, ranked
or non-ranked or dendrograms). Because the numbers of different trees in the dis-
tributions increase rapidly with n, it is common for these investigations to involve
simulation studies, in which trees are randomly selected from an appropriate dis-
tribution. Furnas (1984) reviewed algorithms for the uniform generation of various
types of random tree, and Lapointe and Legendre (1991) described the generation of
random dendrograms. Using this theory, tests that could be used for the external
validation of a complete hierarchical classification have been proposed under the ran-
dom label hypothesis (Hubert and Baker, 1977) and various random tree hypotheses
(e.g., Simberloff, 1987). Tests can also be carried out on hypotheses that a specified
part of a hierarchical classification possesses a certain class structure (De Soete et
al., 1987; Hubert, 1987, Chapter 5; Lapointe and Legendre, 1990).
It is worth stressing, however, that the distribution of trees resulting from the appli-
cation of a clustering algorithm to data generated under a different null model can
differ markedly from being uniform over the set of trees (Frank and Svensson, 1981).
In the cognate topic of decision trees (e.g., Breiman et al., 1984), trees have been
'pruned' so as to remove branches below uninformative internal nodes, sometimes
using new data sets (e.g., Quinlan, 1987), and such methodology has been used to
simplify hierarchical classifications (Fisher, 1996).
Apart from such external validation tests, little work has been carried out on as-
sessing the validity of a hierarchical classification provided by a clustering algorithm.
Information has been obtained about the distributions, under Poisson, Unimodal,
and random dissimilarity matrix models of the distortion imposed on data when they
are represented in a hierarchical classification (e.g., Rohlf and Fisher, 1968; Hubert,
1974b; Gower and Banfield, 1975), but if such null hypotheses are rejected, it does
not follow that the complete hierarchy is validated. Lerman (1970, Chapter 4, 1981,
Chapter 3) defined a measure of the extent to which dissimilarities failed to satisfy
the ultrametric property defining a hierarchical classification, and investigated prop-
erties of this measure when sampling from binary pattern matrices with specified row
sums. Smith and Dubes (1980) divided the set of objects into two and compared
the classification of a subset of the data with the relevant part of the original clas-
sification, assessing the resemblance by reference to the random dissimilarity matrix
model. Other approaches have aimed at assessing the stability of a hierarchical classi-
fication by measuring the influence of each object, i.e. the extent to which the results
are altered if an object is deleted or differentially weighted (e.g., Gnanadesikan et
al., 1977; Jambu and Lebeaux, 1983, Chapter 6; Gordon and De Cata, 1988; Jolliffe
et al., 1988); separate trees, each based on (n - 1) objects, can be combined into a
consensus tree or trees (Lanyon, 1985; Lapointe et al., 1994).
A validation test which would appear to be of considerable interest is a relative test
of which of two or more hierarchical classifications provides the best summary of
a given set of objects. One might consider addressing this problem by defining a
suitable measure of the distortion imposed in representing the data in a hierarchical
classification, and identifying the classification that has minimum distortion. How-
ever, different measures of distortion tend to favour classifications obtained using
33
different clustering procedures (e.g., Sokal and Rohlf, 1962; Sneath, 1969; Faust and
Romney, 1985): in effect, this approach simply reformulates the problem of specify-
ing appropriate clustering procedures i"n terms of defining appropriate measures of
distortion.
Investigators are often interested only in parts of a hierarchical classification, and
assess its constituent clusters and partitions using methodology described in the pre-
vious two sections. It can then be of interest to represent the data in a parsimonious
tree which retains only those parts of the classification that are deemed to be signifi-
cant (Lerman, 1980; Lerman and Ghazzali, 1991; Gordon, 1994).
7. Discussion
The topic of cluster analysis was initially perceived as being concerned primarily with
the exploratory analysis of sets of objects, with little attention being paid to assessing
the validity of the results. Some results have been assessed solely in terms of their
interpretability or usefulness, but there are clearly dangers in such an approach: the
human mind is quite capable of providing post hoc justification for results of dubious
validity. More recently, there has been an increased awareness of the importance of
cluster validation. However, few studies have included a validation phase: of those
that have, most have involved stopping rules, as described in Section .5.
i\'Iuch research remains to be done in the field of cluster validation. Ideally, one would
like to be able to specify: appropriate null hypotheses of the absence of cluster struc-
ture; alternative hypotheses describing departures from such null hypotheses which
it is important to detect; test statistics with known properties, which are effective in
identifying the type of class structure that is present in the data.
The precise null model that is specified can markedly influence the results of a test
(Gordon, 1996b), and it would be useful to have further investigations, particularly
of data-influenced null models.
~Iany different test procedures have been proposed. particularly for addressing the
'how many clusters?' question, and the time would seem to have come when attention
should be devoted to carrying out further comparisons of these in order to determine
their properties. One problem is that a 'cluster' is a vaguely-defined concept, that
many different types of cluster could be present in data, and that one can expect
different test procedures to be effective at detecting different types of structure. It is
thus unrealistic to expect to be able to identify a single 'best' test procedure for each
type of investigation. Nevertheless, Milligan and Cooper's 11985) comparative study
indicated that some tests performed very poorly in detecting reasonably clear-cut
structure. It thus seems useful to advocate further studies, with the aims of elimi-
nating from further consideration tests that have poor performance, and identifying
a small number of 'superior' tests; such tests could then profitably be incorporated
into standard statistical and classification computer packages. It seems inevitable
that such comparative studies will be largely based on assessing the performance of
test procedures in the analysis of data sets that have known class structure.
Some cluster validation tests make heavy demands on computing resources, and the
kind of investigation which is feasible depends on the size of the data set and the
nature of the data. :\evertheless, one might hope that the future will see the develop-
ment of more efficient procedures and algorithms, and an increase in the power and
availability of computing facilities, thus facilitating a greater use of cluster validation
methodology.
34
References:
Arnold, S. J. (1979): A test for clusters. Journal oJ Marketing Research, 16, 545-55l.
Art, D., Gnanadesikan, R. and Kettenring, J. R. (1982): Data-based metria; for cluster analysis.
Utilitas M athematica, 21A, 75-99.
Bailey, T. A., Jr. and Dubes, R. (1982): Cluster validity profiles. Pattern Recognition, 15,61-83.
Baker, F. B. (1974): Stability of two hierarchical grouping techniques case I: Sensitivity to data
errors. Journal oJ the American Statistical Association, 69, 440-445.
Baker, F. B. and Hubert, L. J. (1976): A graph-theoretic approach to goodness-of-fit in complete
link hierarchical clustering. Journal oJ the American Statistical Association, 71, 870-878.
Barnett, V., Kay, R. and Sneath, P. H. A. (1979): A familiar statistic in an unfamiliar guise -- A
problem in clustering. The Statistician, 28, 185-191.
Beale, E. M. L. (1969): Euclidean cluster analysis. Bulletin of the International Statistical Institute,
43(2), 92-94.
Begovich, C. 1. and Kane, V. E. (1982): Estimating the number of groups and group membership
using simulation c;luster analysis. Pattern Recognition, 15, 335-342.
Binder, D. A. (1978): Bayesian cluster analysis. Biometrika, 65, 31-38.
Bobisud, H. M. and Bobisud, L. E. (1972): A metric for classification. Taxon, 21, 607-613.
Bock, H. H. (1974): Automatische Klassifikation: Theoretische und Praktische Methoden zur Grup-
pierung und Strukturierung von Daten (Cluster-Analyse). Vandenhoeck ell Ruprecht, Gottingen.
Bock, H. H. (1985): On some significance tests in cluster analysis. Journal oJ Classification, 2,
77-108.
Bock, H. H. (1989): Probabilistic aspects in cluster analysis. In Conceptual and Numerical Analysis
of Data, Opitz, O. (ed.), 12-44, Springer-Verlag, Berlin.
Bock, H. H. (1996): Probability models and hypothesis testing in partitioning cluster analysis. In
Clustering and Classification, Arabie, P. et a\. (eds.), 377-453, World Scientific Publishing, River
Edge, NJ.
Boorman, S. A. and Olivier, D. C. (1973): Metrics on spaces of finite trees. Journal of Mathematical
Psychology, 10, 26-59.
Brailovsky, V. L. (1991): A probabilistic approach to clustering. Pattern Recognition Letters, 12,
193-198.
Breckenridge, J. N. (1989): Replicating cluster analysis: Method, consistency and validity. Multi-
variate Behavioral Research, 24, 147-161.
Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984): Classification and Regression
Trees. Wadsworth, Belmont, CA.
Calinski, T. and Barabasz, J. (1974): A dendrite method for cluster analysis. Communications in
Statistics, 3, 1-27.
Chand, D. R. and Kapur, S. S. (1970): An algorithm for convex polytopes. Journal of the Associa-
tion for Computing Machinery, 17, 78-86.
Chazelle, B. (1985): Fast searching in a real algebraic manifold with applications to geometric com-
plexity. Lecture Notes in Computer Science, 185, 145-156.
Cross, G. C. and Jain, A. K. (1982): Measurement of clustering tendency. In Proceedings of IFAC
Symposium on Theory and Application of Digital Control (Volume f!), 24-29, New Delhi.
Cunningham, K. M. and Ogilvie, J. C. (1972): Evaluation of hierarchical grouping techniques: A
preliminary study. Computer Journal, 15, 209-213.
Davies, D. L. and Bouldin, D. W. (1979): A cluster separation measure. IEEE Transactions on
Pattern Analysis and Machine Intelligence, PAMI-1, 224-227.
De Soete, G., Carroll, J. D. and DeSarbo, W. S. (1987): Least squares algorithms for constructing
constrained ultrametric and additive tree representations of symmetric proximity data. Journal of
Classification, 4, 155-173.
Diday, E. and Govaert, G. (1977): Classification automatique avec distances adaptatives. R. A. I.
R. O. InformatiquejComputer Sciences, 11, 329-349.
35
Saunders, R. and Funk, G. M. (1977): Poisson limits for a clustering model of Strauss. Journal of
Applied Probability, 14, 77~784.
Schultz, J. V. and Hubert, L. J. (1973): Data analysis and the connectivity of random graphs.
Journal of Mathematical Psychology, 10, 421-428.
Scott, A. J. and Symons, M. J. (1971): Clustering methods based on likelihood ratio criteria. Bio-
metrics, 27, 387-397.
Shepard, R. N. (1974): Representation of structure in similarity data: Problems and prospects.
Psychometrika, 39, 373-42l.
Simberloff, D. (1987): Calculating probabilities that cladograrns match: A method of biogeograph-
ical inference. Systematic Zoology, 36, 115-195.
Smith, S. P. and Dubes, R. (1980): Stability of a hierarchical clustering. Pattern Recognition, 12,
177-187.
Smith, S. P and Jain, A. K. (1984): Testing for uniformity in multidimensional data. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, PAMI-6, 73-8l.
Sneath, P. H. A. (1969): Evaluation of clustering methods (with Discussion). In Numerical Taxon-
omy, Cole, A. J. (ed.), 257-271, Academic Press, London.
Sneath, P. H. A. (1977): A method for testing the distinctness of clusters: A test of the disjunction
of two clusters in Euclidean space as measured by their overlap. Mathematical Geology, 9,123-143.
Sneath, P. H. A. (1979): The sampling distribution of the W statistic of disjunction for the arhitrary
division of a random rectangular distribution. Mathematical Geology, 11, 423-429.
Sneath, P. H. A. (1980). Some empirical tests for significance of clusters. In Data Analysis and
Informatics, Diday, E. et al. (eds.), 491-508, North-Holland, Amsterdam.
Sneath, P. H. A. (1986): Significance tests for multivariate normality of clusters from branching
patterns in dendrograms. Mathematical Geology, 18, 3-32.
Sokal, R. R. and Rohlf, F. J. (1962): The comparison of dendrograms by objective methods. Taxon,
11,33-40.
Strauss, D. J. (1975): A model for clustering. Biometrika, 62, 467-475.
Strauss, R. E. (1982): Statistical significance of species clusters in association analysis. Ecology, 63,
634-639.
Van Cutsem, B. and Ycart, B. (1996): Indexed Dendrograms on Random Dissimilarities. Rapport
MAl 23, CNRS, Universite Joseph Fourier Grenoble I.
Van Ness, J. W. (1973): Admissible clustering procedures. Biometrika, 60, 422-424.
van Rijsbergen, C. J. (1970): A clustering algorithm. Computer Journal, 13, 113-115.
Vassiliou, A., Ignatiades, L. and Karydis, M. (1989): Clustering of transect phytoplankton collec-
tions with a quick randomization algorithm. Journal of Experimental Marine Biology and Ecology,
130, 135-145.
Ward, J. H., Jr. (1963): Hierarchical grouping to optimize an objective function. Journal of the
American Statistical Association, 58, 236-244.
Windham, M. P. (1981): Cluster validity for fuzzy clustering algorithms. Fuzzy Sets and Systems,
5, 177-185.
Windham, M. P. (1982): Cluster validity for the fuzzy c-means clustering algorithm. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, PAMI-4, 357-363.
Xu, S., Karnath, M. V. and Capson, D. W. (1993): Selection of partitions from a hierarchy. Pattern
Recognition Letters, 14, 7-15.
Zeng, G. and Dubes, R. C. (1985a): A test for spatial randomness based on k-NN distances. Pattern
Recognition Letters, 3, 85-9l.
Zeng, G. and Dubes, R. C. (1985b): A comparison of tests for randomness. Pattern Recognition,
18, 191-198.
What is Data Science ?*
Fundamental Concepts and a Heuristic Example
Chikio Hayashi
The Institute of Statistical Mathematics
Sakuragaoka, Birijian 304
15-8 Sakuragaoka, Shibuya-ku
Tokyo 150, Japan
Summary: Data Science is not only a synthetic concept to unify statistics, data analysis
and their related methods but also comprises its results. It includes three phases, design for
data, collection of data, and analysis on data. Fundamental concepts and various methods
based on it are discussed with a heuristic example.
1. Introduction:
Statistics and data analysis have developed in their realms separately and contributed to the
development of science, showing their unique properties. The ideas and various methods
of statistics were very useful, well known and solved many problems. Mathematicat
statistics succeeded it and developed new frontiers with the idea of statistical inference.
Thus the application of these view points brought us many useful results.
However, the development of mathematical statistics, has devoted itself only to .the
problems of statistical inference, an apparent rise of precision of statistical models, and to
the pursuit of exactness and mathematical refinement, so mathematical statistics have been
prone to be removed from reality.
On the other hand, the method of data analysis has developed in the fields disregarded by
mathematical statistics and has given useful results to solve complicated problems based on
mathematico-statistical methods (which are not always based on statistical inference but
rather are descriptive). Some results are found in the references.
In the development of data analysis, the following tendency is often found, that is to say,
data analysts have come to manipulate or handle only existing data without taking into
consideration both the quality of data and the meaning of data, to cope with the
methodological problem based on unrealistic artificial data with simple structure, to make
efforts only for the refinement of convenient and serviceable computer software and to
imi tate popular ideas of mathematical statistics wi thout considering the essential meaning.
As this differentiation proceeds with specialization, the innovatiorr of useful methods of
statistics and data analysis seem to disappear and signs of stagnation appear. The reason is
that the essential aim of analysis of phenomena with data has been forgotten. For
extensive and profound development of intrinsically useful methods of statistics and data
analysis beyond the present state, the unification of statistics and data analysis is
necessary. For this purpose, the construction of a new point of view or a new paradigm is
a crucial problem. So, I will present "Data Science" as a new concept.
* The roundtable discussion "Perspectives in classification and the Future of IFCS" was
held at the last Conference under the chairmanship of Professor H. -H. Bock. In this
panel discussion, I used the phrase 'Data Science'. There was a question, "What is 'Data
Science'?" I briefly answered it. This is the starting point of the present paper.
40
41
Data Science is not only a synthetic concept to unify statistics, data analysis and their
related methods, but also comprises its results. Data Science intends to analyze and
understand actual phenomena with "data". In other words, the aim of data science is to
reveal the features or the hidden structure of complicated natural, human and social
phenomena with data from a different point of view from the established or traditional
theory and method. This point of view implies multidimensional, dynamic and flexible
ways of thinking.
Data Science consists of three phases: design for data, collection of data and analysis on
data. It is important that the three phases are treated with the concept of unification based
on the fundamental philosophy of science explained below. In these phases the methods
which are fitted for the object and are valid, must be studied with a good perspective. The
strategy for research in Data Science through three phases is summarized in Fig. 1.
Data Science
Designfor Data
Collection of Data
Analysis on Data
by method of
by finding and classification
reconsideration muiluiimensionai
of deviations of daca analysis and
"individuals·· from other statistical
the meanar methods.
class and Structure.
simplification
Generally speaking, phenomena are multifarious. First, these phenomena are formulated
and the planning of a surveyor experiment is completed, based on the ideas of Data
Science (phase of design for data). Thus phenomena are expressed as multidimensional
and, frequently, time-series data. The characteristics or properties of the data are
necessarily made clear (phase of collection of data). The obtained data are too
complicated to draw a clear conclusion. So, by methods of classification and
multidimensional data analysis, and other mathematico-statistical methods, the data
structure is revealed. In other words, simplification and conceptualization are carried
out. However, this information generally turns out to be incomplete and unsatisfactory
42
even though the structure finding was realized. At this stage, by finding and
reconsidering the deviation of "individuals", which gives a vivid account of the
roughness of conceptualization or simplification, from the mean values or class-
belonging (classification) and structure, diversification of data is made. Based on this
multifariousness, structure finding or conceptualization is attained, in an advanced sense,
in the progressive stages. Such a circular movement of research then continues.
Dynamic movement of both simplification or conceptualization and diversification begins
in tum. Further, having been able to solve a problem, it is expected to discover another
new problem to be solved in an advanced sense. The developmental process, in phase,
design ---> collection -> analysis --->design ---> collection ---> analysis --->design ->
collection ---> analysis ---> design ---> ... and the dynamic process mentioned above,
that is to say, progress and regress, are indispensable in Data Science. This shows that
the methodology of Data Science develops, as it were, in the ascending-spiral-process
and research proceeds as seen in spiral stairs. The main points is schematically depicted
in Fig. 1.
Thus we can say that data science comprises not only the results themselves of theory and
method but also all methodological results related to various processes which are necessary
to work out the results mentioned above. The former is called "hard results" and the latter
is called "soft results". Data Science includes simultaneously hard and soft results. It goes
without saying that a useful solution emerges in coping with the complicated problem in
question by the use of Data Science. It is repeatedly emphasized that the coherent idea
through all items shown in Fig. 1 flows in Data Science for the purpose of analysis of
phenomena with data.
Some concrete examples in social and medical surveys for the three phases are shown
below. Before everything, it is stressed that the relevant methods are always treated with
validity.
The theory and method concerning this phase are next considered. Particularly, theoretical
and systematic construction of a questionnaire is a very important problem. The problems
in this phase are frequently solved using various kinds of methods of data analysis. For
example,
Collection of data is not only a problem of practice, but must be theoretically and
concretely studied. The problems in this phase can not be solved without any information
of design for data and any use of data analysis.
Evaluation of survey bias and evaluation of experimental bias including
question bias, interview bias, interviewer bias, observation bias, etc.
Evaluation of non-response error,
Evaluation of measurement error,
Evaluation of response error, inevitably variable response data, for example,
live data,
Method of diminution of the relevant bias and error,
and etc ..
The problems in this phase are, of course, closely related to the previous two phases. The
main point is to obtain useful and instrumental information without any distortion or with
validity. For this purpose, clear and lucid methods of analysis without unnecessary
mathematical conditions only for model building and a too sophisticated style are desirable.
For example,
Various methods of scaling, quantification methods, correspondence analysis
(analyse des donnees), multidimensional scaling, exploratory data
analysis, categorical data analysis and various methods of classification and
clustering,
Useful data analysis suitable for the purpose,
Useful coding of questions and their synthesis,
Valid analysis of data including various errors,
Evaluation of data quality and data analysis depending on data quality,
Analysis on probabilistic response,
Exploratory approach by data analysis,
Method of simultaneous realization of classification and structure finding,
Treatment of open answers in an open ended question for example, exploratory
approach for coding or automatic processing of textual data,
Probabilistic approach,
Computer experiments,
and etc ..
These three phases must be synthetically treated or taken into consideration with the
consistent idea in order to understand phenomena. This is the fundamental concept of
Data Science. Of course, each subject will be studied separately. However, each subject
must be studied in 'the context of Data Science. This idea will lead to the development of
statistics and data analysis in a new direction. Thus the stand point of them highten and a
new horizon will appear as innovative method and theory are created in three phases.
4. A Heuristic Example
Thus, we know that individuals have various response patterns. These are integrated in a
collective through mutual and social communications in so far as individuals live in a
society. This is collective character or national character ( in some cases ethnic character)
which is formed beyond individuals. In this situation, some principles emerge in the sOcial
environment. Receiving impacts from the exterior, social norm, customs system,
paradigm, education, contemporary thought and arts, religious feelings, future course of
philosophy and science, etc. are formed, as a "cultural climate" is created. Individuals are
influenced by this cultural climate: the strongest influence is upon the response pattern in
general social items, the second upon that in national character items and the weakest upon
that in basic human feelings items. Such a perpetual circular movement continues. It is
our aim to represent the collective character in terms of Data Science.
Our point of view of research is not hypothesis testing (theory-driven) but to put the
emphasis on an exploratory approach (data-driven).
First of all, we define the universe and population of the Japanese. A nation-wide sample
survey is done for a sample by stratified three stage random sampling and by face to face
interviewing using the same questionnaire, the contents of which cover the items shown
below.
1) Fundamental Attributes, 2) Religion, 3) Family,
4) Social Life, 5) Interpersonal Relations,
6) Politics, 7) Individual Attitude tbward Other Unclassified Social Issues
The outline of our survey is shown in Fig. 2.
The analysis from such time series data makes clear both enduring and changing aspects.
The next step is a comparative study of national character.
45
Here, in a comparative study, we present a new idea for questionnaire construction and
selection of nations to be compared. This is Cultural Link Analysis ( CLA in abbreviation)
--Hayashi et al. 1986, 1992-- which belongs to a similar genre to Guttman's Facet Theory,
(Guttman (1994», and reveals a relational structure of collective characters of peoples in
different cultural spheres (nations or ethnic groups).
a. A spatial link inherent in the selection of the subject culture or society.
The connections seen in such selection may be considered along the dimensions
of social environment, cuI ture and ethnic or national characteristics.
b. An item structure link inherent in the commonness and differences in item
response patterns within and across different cultures.
c . A temporal link inherent in longitudinal analysis.
An example of a. is shown in Fig.3.
Hawaii Residents
7 3 2 2
No of T:mes Sur'Jeyed
(b) Mullidlmenslonallinkage
Questions common to
Modern Societies
As for c . time series surveys in various nations or ethnic groups and their comparison are
infonnative.
Our international comparative surveys, which consist of Americans in North America,
English in UK., French in France, Gennans in the past West-Gennany, Dutch in the
Netherlands, Italians in Italy, Japanese in Japan, Japanese-Americans in Hawaii, Japanese-
Brazilians in Brazil, are described in Fig. 5 and the conjecture of link scheme is depicted
in Fig.6.
1971 Japanese
Americans
in Hawaii
(434)
1993 Dutch
(1083)
------Nation- Wide Sampling------
( ) Sample Size
OISSl\lll..U.TY SIMlt.\RJTY
J,,'AShSF.·IIRAlJll\:"S
J).PA~F..Sf. 1,\P"i:.SEOlI.lcr.
C\,;UURE
A~ElUCAN JM'.\'l:I:.'tEOR[G~
Ct:LruRE
JAP.\!'f.SE·A.'dEKl(,\'\S
IIII!AWAl.1
JAPA~ESE
O\;TCII
I.\TI~
IUI.L\SS ('LLn;RAl (ll\t.\ n:.
((",\H!OUc.:ClI.n.Kf:.i
It is our aim to make clear and depict the following points by well-designed comparative
surveys and their data analysis, i.e. "quantitative and data scientific" methods,
Since such a way of research is based on universal logic, people even in different cultural
spheres can understand the results of the analysis.
48
Mainly considering the view of Japanese national character itself, we can summarize our
study as in Fig.7. In contrast, mainly considering the comparison of national character in
different cultural spheres, we can summarize our study as in Fig. 8.
~
Particularity .4------------ t Comparati,·e aspect)
Then
From these two kinds of surveys, surveys both in time and space, that is to say,
continuing surveys and comparative surveys, we can define national character in statistical
terms corresponding to various levels. See Fig. 8.
Here, majority opinion is defined as not only that supported by more than 2/3 of the
individuals in the total but also that supported by more than 2/3 of the individuals in each
breakdown in sex, age and education. In Fig. 8, 0 marks mean existence of the item, 0*
marks mean existence of temporally stable data and "no datum" means non existence of
temporally stable evidence but existence of cross-section data. For example, as fot 2.
Opinion Distributions, the first line means a definition on the highest level, i.e. the opinion
distribution is not only temporally stable but also particular or characteristic compared with
those in different nations or ethnic groups and the second line means a definition on a
lower level, i.e. temporally stable data do not exist but it is particular or characteristic
compared with those in different nations or ethnic groups, in which temporally stable
evidence occasionally exists. X marks mean there is no logical meaning.
R
dij = I/R I
r
I Pir - Pjr I
where Pir is the percentage of i-nation on the only one key answer
category of the r-th question.
dij is a fuzzy measure of difference between i and j.
Thus we have a similarity matrix between i and j. Based on this fuzzy similarity matrix, a
method of multidimensional data analysis, MDA-OR (Minimum Dimension Analysis
Ordered class belonging) [Hayashi 1974, 1976], which is one kind of so-called
multidimensional scaling MDS, is applied for graphical representation of groups. The
quite similar configuration of groups is obtained by quantification method III or
Correspondence analysis using the matrix of d's directly. The result is shown in Fig.9.
This is a simple graphical summarization of the similarity relations. The degree of
similarity is revealed as the distance in Euclidean space. Roughly speaking, consider that
the distance corresponds to the similarity and the configuration gives a reasonable
summarization of linked similarities. Here, the triangular relation mentioned above has
been revealed.
The arrow means the direction of the value in the third axis in Fig.9. A line means plus
direction while a dotted line means minus direction in the third dimension.
If French and Italians are deleted and the same analysis is done, Fig. 10 is obtained.
JB is found as a pole instead of French and Italians.
50
r\ : Americans
E: English
F : French
G : Germans
H: Dutch
I : lIalians
J : Japanese
JA : Japanese-Americans in Hawaii
JB : Japanese-Brazilians in Brazil
JB 1: First generation of 18
182: Second and Third generations of JB
The similarity between Japanese-Brazilians and French and Italians is very interesting.
The positions of Japanese-Americans and Japanese-Brazilians are to be noted with J-
attitude mentioned previously, the former being as a linkage between Japanese and
Aqlericans, the latter being as a linkage between Japanese and French or Italians. The link
relation in Fig. 6 has been revealed in Fig. 9 by the data analysis. Thus we could clarify
the entire picture of the configuration of groups.
51
Then, we can proceed to a detailed analysis of data without loss of sight of the whole
situation. For example, the nations being different in what group of questions and
common in what group of questions i.e. simultaneous classification of questions and
nations or the universality and particularity of data structure across the nations.
References
The following references are relevant to the various parts of this paper.
Arabie, P., Hubert, LJ. and De Soete, G. (1996) ed.: Clustering and Classification,
World Scientific.
Benzecri, J.P. (1973): L'Analyse des Donnees, Dunod.
Benzecri, J.P. (1992) : Correspondence Analysis Hand-Book, Marcel Dekker.
Bock, H.-H. and Polasek, W. (1996) ed.: Data Analysis and Information Systems,
Springer.
Borg, I. & Shye, S. (1995): Facet Theory, Fonn and Content, Advanced Quantitative
Techniques in the Social Sciences Series 5, Sage Publication.
Diday, E., J. Lemaire, 1. Pouget and F. Testu (1983): Elements d'Analyse des
Donnees, Dunod.
Diday, E., G. Celeux, Y. Lechevallier, G. Govaert and H. Ralambondrainy (1989):
Classification automatique el Analyse des Donnees: Methodes et environment
informalique. Dunod.
Diday, E. and Y. Leschevallier (1991): Symbolic -Numeric Data Analysis and Learning,
-Versailles Sept 91- Nova Science Publisher.
Gaul. W. and Pferfer, D. (1996) ed: From Data to Knowledge, Springer.
Guttman, L. (1994): Louis Gullmall 011 Theory and Methodology: Selected Writings,
Shlomit Levy ed, Dartmouth.
Hayashi, C. (1956): Theory and example of quantification(II). Proc. Inst. Statist. Math.,
3,69-98.
Hayashi. C. (1974):·Minimum dimensional analysis MDA. Behaviormelrika, I, 1-24.
Hayashi, C. (1976): Minimum dimensional analysis MDA-OR and MDA-UO, Essays in
Probability and Statistics, Ikeda, S., et al. (eds.), 395-412, Shinko Tsusho Co. Ltd.
Hayashi, C. (1993): Treatise on Behavionnetrics, Asakura Shoten.
Hayashi, C. (1993): Quantification of QuaJitative Data --Theory and Method --, Asakura
Shoten.
Hayashi. C. (1995): Changing and Enduring Aspects of Japanese National Character,
The Institute of Social Research, Osaka, Japan.
Hayashi, C. and Suzuki, T. (1986): Data Analysis in Social Surveys, Iwanami Shoten.
The English version by Hayashi. C. Suzuki,T. and Sasaki, M., "Data Analysis for
Comparative Social Research: 111leT1laiional Perspectives" was published by
Elsevier, North-Holland in 1992.
Hayashi, C. and Hayashi, F. (1995): Comparative Study of National Character.
Proceedings of the Institute of Statistical Mathematics Vol. 43, No.1, 27-80.
Jambu, M. (1989) : Exploration Injormatique et Statistique des Donnees. Dunod.
Jambu, M. (1991): Exploratory and Multivariate Data Analysis, Academic Press.
Lebart, L., Morineau, A. and Warwick, K.M. (1984): Multivariate Descriptive
Statistical Analysis. John Wiley.
Lebart, L. and Salem, A. (1988): Analyse Statistiques des Donnees, Textuelles,
Dunod.
Lebart, L. and Salem, A. (1994): StatistiqueTextuelle, Dunod.
Lebart. L.. Morineau, A. and Piron, M. (1995): Statistique Exploratoire
Multidimensiollnelle, Dunod.
Van Cutsem. B. (1994): Classification and Dissimilarity Analysis, Springer.
Fitting Graphs and Trees with Multidimensional
Scaling Methods
Willem 1. Heiser
Summary: The symmetric difference between sets of qualitative elements (called features)
forms the basis of a distance model that can be used as a general framework for fitting a
particular class of graphs, which includes additive trees, hierarchical trees and circumplex
structures. It is shown how to parametrize this fitting problem in terms of a lattice of
subsets, and how inclusion relations between feature sets lead to additivity of distance
along paths in a graph. An algorithm based on alternating least squares and on the recent
method of cluster differences scaling is described, and illustrated for the general case.
52
53
and Shepard and Arabie (1979), we will use the concept of a feature space, in which each
object of analysis is represented by some subset of features, while the features in turn are
represented by subsets of objects. By restricting attention to models that can be fonnulated
in tenns of features, we are considering a particular subclass of graphical structures, to be
calledfeatllre graphs.
The natural metric used in feature space is the city-block distance, which acquires several
remarkable properties when the coordinates are restricted to be binary. Before discussing
these in more detail, we need to introduce some notation.
(I)
in which P(vi' v} is the set of edges on the geodesic (shortest path) between vi and vi" We
write dr(A, A) because the distance depends not only on A, but also on A via the hsts P
(Vi, Vj)' "fhus, the path length is the sum of the edge lengths along the geodesic. If all Aij
are equal, di/A, A) is the usual graphical distance: a count of the number of edges in the
shortest patn from Vi to vi"
Let us first consider the question of embedding, or realizability: under what conditions on
.1 can the objects be mapped into a valued graph with some path-length distance? The
answer is, that we may identify Aij = oij for some subset !l(, provided that oijJs positive-
definite and satifies the triangle inequality 0ij S; Oil + olj (Hakimi and Yau, 1964). Note that
symmetry is not required (if we allow two edges between any two nodes); but if oij is in
addition symmetric, it is a metric, and the result says that any metric can be embedded into
a valued graph. In the presence of error, it is much to be preferred to optimize some loss
function measuring the lack of fit of feasible model distances, rather than to rely on
idealized conditions evaluated directly in tenns of the data. Therefore, we study the fitting
problem of finding G (in particular, some A and A) so that the least squares loss function
(2)
is minimal. Note that the major difficulty in (2) is finding A, since (1) is additive in the
elements of A, so that, once we know A, finding A is just a non-negative regression
problem, which can be solved by standard methods (Lawson and Hanson, 1974). How
can we find out which edges to include and which to delete?
Our approach in the present paper will be to use a reparametrization of di/A, A), which
restricts attention to a certain subclass of graphs. To define the vertices of such a graph, we
introduce a set of p discrete features !J = {F 1, '" , F /, ... , F n }. On the feature set !J we
54
define a family S of n distinct nonempty subsets S = (S" ... , Si' ... , S'I}. whose union
is !J. Furthermore, each feature Ft E :Jis associated with some nonnegative feature dis-
eriminability parameter Tit. Every object will now be represented by some subset of
features, that is, our goal will be to find a mapping 'T: 0i E 0 -7 Si E S.
To rephrase the fitting problem in terms of the mapping cr,
we must have a metric on
(sub )sets that parallels the path-length distance. Following Goodman (1951, 1977) and
Restle (1959, 1961), we may define a metric on sets, here to be called the feature distance
(3)
where p[ . ] is a measure function defined over the set of features (usually, just a count),
and A - B is the symmetric set difference between sets A and B. Thus the feature distance
measures the extent to which Si possesses features that Sj does not have and vice versa. By
elementary means it can be shown that (3) satisfies the metric axioms, and there are a
number of alternative expressions of it that enable us to naturally include the feature
discriminabilities 11k> which we will consider more closely in section 3. The first and
foremost property of d(Sjo Sj)' however, is stated in the following result.
Theorem (Flament, 1963, p. 17).
Let .£.i..S) be the lattice obtained from ordering the elements of Sby inclusion, and con-
sider the graph representing .£.i..S) having nodes Vi = Si and an edge between Vi and V·
whenever Si covers Sj or vice versa. Define di/A) as the pathlength distance (geodesic)
between nodes Vi and Vj with all edge lengths I.ij equal to unity. Then the feature
distance d(Si' Sj) is equal to dij(A) in the graph representation of the lattice .£.i..S).
If S is an arbitrary selection of subsets, it is understood that .£.i..S) includes the extension of
S with all subsets that can be formed by union and intersection of its elements. In the
graphical representation of this lattice of subsets there are generally several paths from Si to
Sj' but the crucial thing is that they all have equal length; hence, they are equivalent in
terms of distance. Equivalence of distinct paths follows from the fact that each edge in the
graphical representation of the lattice corresponds to one single element of :J. which is the
feature that distinguishes the covering subset from the covered one. While the graphical
distance di/A) is a count taken along a path of the distinguishing features in some
particular order, the feature distance d(Si' Sj) is the same count of distinguishing features,
taken in any order.
Another property that turns out to be crucial for the present approach is that betweenness
implies additivity: that is, if Sj is inbetween Si and Sk in the sense that either Si::J Sj::J Sk'
or Si ::J Sj and Sk ::J Sj' we have d(Si' Sk) = d(Si' Sj) + d(S", Sk) along the path from Si to
Sk' In this case, there need not be a direct edge from Sj to Sk' This characteristic allows us
to first formulate the fitting problem (2) in terms of feature distances, next sort out the
additivities in the fitted distances, and finally construct the graph by excluding edges that
are sums of other edges. Thus the graphs to be constructed with the present approach will
always be sub graphs of the graph representation of a lattice, which forms the embedding
space of the given set of objects in much the same way as Euclidean space is used to embed
a finite number of points in ordinary multidimensional scaling.
(5)
where the bitS are still binary, albeit not necesarily (0,1) variables, collected in the n xp
matrix B = (bit =11rejt} , and where the discriminabilities 11t are nonnegative parameters to
be estimated. Thus, the weighted feature distance defined in (5) allows for a differential
contribution of the features to the overall length of the path from Sj to Sj. In geometrical
terms, the introduction of feature discriminabilities turns the hypercube correspondng to E
into a rectangular parallelepiped corresponding to B.
It can be shown that the Theorem stated in the previous section still holds for the weighted
feature distance, if it is adjusted to allow for unequal ~j' Since each edge in the graphical
representation of the lattice f1..5) corresponds to one feature in 1'. we can associate exactly
one feature discriminability 11t in (5) with each edge length ~. in (1). For example, if the
set of edges on the shortest path between Vj and V· would be P(Vj, Vj) = {(Vj, vk), (vk' VI),
(VI, Vj)}, there will be three features F I , F2, and ~ on which Sj and Sj are different, with
the edge lengths being related by the one-to-one mapping Ajk =111, Akl = 112, and Alj = 113'
Hence we have djjCA, A) =Ajk + Akl + Alj =111 + 112 + 113 =djj(B) in this example.
(6)
which must be minimized over all binary valued matrices B. Because the feature distance is
additive over features, it is possible to employ an alternating least squares (ALS) scheme,
fitting the model one feature at a time, given some starting values {flit}. Explicitly, given
56
the current values {!lis} for s *- t. ~ij is defined as ~ij = ~ii - Ls;t/ l!lis - !lisl, the original
dissimilarity corrected for the contribution of the fixed vanables. Substitutmg (5) into (6),
and inserting ~ij. we find that the ALS subproblem for feature t is to minimize. given ~ij'
(7)
over the binary n-vector b l . This minimization subtask is a one-dimensional MDS problem
with the coordinates restricted to form a bipartition. and therefore the cluster differences
scaling (CDS) algorithm of Heiser and Groenen (1996) applies, with number of clusters
equal to two. The ALS algorithm cycles over CDS subtasks until convergence.
Let us have a closer look at this particular CDS substask, by resolving B again in its
discrete and continuous factors. Writing Ibil - bjll = 711 {(I - eil)ejl + (I - ejl)eir}, setting
the partial derivative of (7) with respect to 111 equal to zero and simplifying, shows that, for
any given bipartition {eill i = 1•... , n} the optimal value of the discriminability parameter
for feature t is equal to max(O. '171)' with 9JI denoting the unconstrained minimizer
(8)
where n l is the number of objects in one group. and n - 111 the number of objects in the
other. Thus. the length of edge t in the fitted feature graph will be equal to the average
corrected dissimilarity between the two groups of objects that constitute that particular
feature. If the features are exclusive. eil(1 - ejl)~ij = eir(1 - ejl)~ij' and (8) becomes just the
average between-group dissimilarity. If the reatures are not exclusive and '17/ is relatively
large, then the corresponding bipartition must be a good discriminator by itself, on top of
the contribution. of the other features, since we alWays have ~ij ~ ~ij; this justifies the name
discriminability parameter.
We still have to indicate how to find E. Loss function (7) is quadratic in one column (size
n) of the binary matrix E. so this subtask still is a hard combinatorial problem, even though
its size is reduced by a factor p with repect to loss function (6). The present implementation
uses a nesting of several random starts (within features and across features), together with
K-means type reallocations. Heiser and Groenen (1996) have described a strategy called
Fuzzy Steps to alleviate the local minimum problem for CDS, but it looks like the problem
here is especially difficult across features (in the ALS phase). not so much within features.
A more extended discussion of the algorithmic aspects of finding E is in preparation.
seem to represent the best split in terms of the dissimilarity data. The percentage of variance
accounted for (V AF) for all ten solutions is given in Table I. This table also gives for each
solution the DAF (percentage of Dispersion Accounted For), defined as the sum of squared
fitted distances divided by the sum of squared dissimilarities. DAF is the scale-free
goodness-of-fit measure that is maximized when the badness-of-fit measure (6) is
minimized.
n.ble I. Goodness of fit for feature graph representations of the Henley (1969) data
# features: 1 2 3 4 5 6 7 8 9 10*
DAF 63.3 83.4 89.4 93.9 95.8 96.9 97.6 98.1 98.5 97.3
VAF 21.9 41.9 54.0 71.9 79.2 82.3 84.7 87.7 90.3 83.9
We see from Table 1 that a VAF just above the percentage of the tree solution is reached
with the five-feature solution (79.2%), which has a DAF of 95.8%. This solution, which
does not yet discriminate all objects from each other (leaving seven objects in three small
clusters), is shown in Figure 2. While the terms in the clusters (cat, dog} and (goat, cow,
horse} in the feature graph are also close together in the additive tree in Figure 1, this is not
the case for the cluster {bear, pig}. Another major difference is that the feature graph is not
tree-like at all. As to the interpretation of the five-feature solution in Figure 2, it is clear that
{deer} is an isolate (there is one feature that contrasts it with all other terms, with a
discriminability of approximately 20), and that there is a "domestic" versus "wildlife"
feature contrasting, from top to bottom, (sheep, cow, horse, goat, cat, dog} from (pig,
bear, lion, mouse, rabbit, deer}, with a discriminability of about 14. A third important split
is (cat, dog, lion, mouse} versus the rest, with discriminability 16.
A more differentiated representation arises if we look at one of the solutions with a higher
number of features. How many features to take is not only a matter of amount of fit that is
58
deemed acceptable, but also depends on the issue of how many edges need to be kept, or
conversely, how many additivities there are in the fitted distances. Judged by the number
of edges needed, while still accounting for a reasonable amount of variance, a special ten-
feature solution was selected as the best one (see the last column of Table 1; its graph with
24 edges is displayed in Figure 3). It consists of six common features (Le., features shared
by more than one object) and four unique features (not shared by any other object).
Figure 3 contains two types of nodes: the closed circles, which represent the objects of
analysis, and the open circles, called latent nodes, which represent subsets of features that
can be obtained by taking either the union or the intersection of the feature subsets
characterizing two other nodes. Remember that the fitted feature graph is a subgraph of the
9--_--.
rabbit
graph representation of the lattice of feature subsets, and latent nodes are other elements of
this lattice that can be included afterwards, to make the graph simpler in terms of its
pathways and number of edges. As a good example of the effect of the introduction of a
latent node, consider four objects characterized by the subsets (BCD), (ACD), (ABD), and
(ABC), and assume equal discriminability of the features. Then all distances are equal, and
the objects are mapped as four points on a regular tetrahedron, with six edges. Introducing
the latent node (ABCD), which is the union of each of the pairs of subsets, allows us to
construct a star graph, in which there are only four edges, one between each of the
manifest nodes and the latent node, and no one among the manifest nodes themselves.
The fitted edge lengths are also given in Figures 2 and 3 (rounded to integer numbers). An
edge is not included in the graph if its length is the sum of two other edge lengths (a rather
simple algorithm looping over all triads is sufficient to sort this out). To reconstruct the
distance between two terms (and hence their dissimilarity), we just have to add the edge
lengths along the shortest path between them. It will be noted that there are several
instances of distinct paths with equal length. Comparing the two feature-graph solutions, it
appears that there are primarily local changes: one is slightly more (less) differentiated than
the other, a result that makes sense.
2 0 o 0 1 0 o 2 o 112
o 4 o 0 o 1 2 o o 112 oo 3/2]
312
o 0 1 0 and o 1 o 2 112 0 o 3/2
o 0 o 3 o I o 2 o 112 3/2 0
generate the same feature distances among their rows. Thus we can freely add the comple-
ments of any column of the incidence matrix, provided that we half the corresponding
discriminabilites. Any n x 2 matrix formed by concatenating some column of E with its
complement has the property that it has row sums equal to one, and such a matrix is called
the indicator matrix of a feature.
Now suppose that the features are nested: that is, if G l is the indicator matrix of feature Fl
and G s is the indicator matrix of feature Fs ' than the matrix Gr'Gs has at least one element
equal to zero. Nestedness implies that one feature separates a subgroup from one of the
two groups formed by the other feature. For instance, the bipartitions {(ABCD), (EFG)}
and {(EF), (ABCDG)} are nested, since (EF) is a subset of (EFG) and (ABCD) is a subset
of (ABCDG). Then, by a famous result of Buneman's (1971), the feature distance satisfies
the four-point property that characterizes additive trees if and only if all its features are
nested. Additive trees thus form an important special case of feature graphs, in which each
edge corresponds to exactly one feature (or split).
60
The case of a linear array (Goodman, 1951, 1977; Restle, 1959, 1961), called Guttman
scale in psychometrics, is obtained when the features are not only nested, but have an
additional property. In terms of the feature incidence matrix E, this property implies that
each column of E consists of either a single run of zeros followed by a single run of ones
or of a single run of ones followed by a single run of zeros. When n - I distinct features
have this structure, the feature graph has n - I edges, connecting the objects in a certain
order, and no latent nodes. Except for the two endpoints, which have degree one, all nodes
have degree two. For an exact characterization of the Guttman scale, see Holman (1995).
A hierarchical tree is a rooted additive tree with the extra requirement that the distance from
any endpoint to the root is equal. In a feature graph, the root corresponds to the latent node
that has all features, that is, 'J Then the first feature defines the first split in two groups of
objects, the second feature splits one of these groups further down into subgroups, and so
on. So the features are again nested. The hierarchical tree is a more parsimoneous model
than the additive tree, because the requirement of equal distance to the root puts restrictions
on the discriminabilities. The characterization of trees in terms of a feature model is due to
Tversky (1977).
The last example of a family of subsets that satisfies a specific structural property is the
circumplex or radex (Guttman, 1954). It is characterized by the circular ones property,
which implies that each column of E consists of either a single run of zeros bordered by a
run of ones on one or both sides, or of a single run of ones bordered by one or two run(s)
of zeros. The graph of a regular circumplex is like a closed simple chain, with exactly n
edges, if each feature divides the objects in equal groups (when n is even). When divisions
in unequally sized groups are included in the feature set, the graph of a circumplex
becomes more complicated. In the complete case, it looks like a network spanned over a
(half)sphere (Heiser, 1981, chapter 4).
7. Discussion
We have seen that a metric defined on the symmetric set difference between sets of features
can be used as a general framework for fitting a particular class of graphs, which includes
additve trees, hierarchical trees and circumplex structures. It was shown that we can find
out which edges to include in the graph by formulating the problem in terms of a lattice of
subsets, using a weighted count of feature differences (the feature discriminabilities). The
algorithm presented, based on alternating least squares and on cluster differences scaling,
is still in its early stage of development. It always converges to a local minimum, but is as
usual in this type of problem, there are an awful lot of local minima. On the positive side, it
is the first systematic method to fit the Hamming distance.
A crucial ingredient of this approach to finding graph representations is the fact that
inclusion relations between feature sets lead to additivity of distance along paths in a graph.
In fact, Hutchinson (1989) and Klauer and Carroll (1989) used the criterion of dropping
direct edges by looking at additivity of link length as their main graph construction
strategy. But they applied this criterion to the dissimilarities, rather than to the fitted
distances, as is proposed here. Feature graphs are similar to Corter and Tversky's (1986)
extended similarity trees, but exactly how these two models are related needs further study.
In any case, it seems clear that additive trees and other restricted representations do not
show up spontaneously in real examples, although the method reproduces a circumplex,
for example, when the data are error-free.
Choosing the number of features p is a matter that requires experience and cannot be settled
yet with clear-cut rules. In most examples analyzed so far, the number of features needed
to get good fit is in the neighborhood of n12. Also, it appears that, as soon as p is in the
61
range of values where the fit stabilizes, solutions with one feature more or one feature less
are only different in their fine structure, as was the case in the example of the Henley
(1969) data. There is a trade-off to be made with the number of links in the graph, a
quantity that increases nonlinearly with p, and which we want to have as small as possible.
Making a good trade-off is complicated by the fact that we can often reduce the number of
links by including latent nodes, without it being clear how to do this optimally.
Unlike methods based on distance constraints, feature graph fitting can be extended
without too much trouble to well-known variants of MDS, such as individual differences
scaling (INDSCAL) and two-mode scaling (unfolding), possibly combined with nonlinear
transformations of the data. An easy way to recognize this flexibility is to view the basic
distance model (5) as a. squared Euclidean distance (since the deviations eit - ejt are zero or
(minus) one, we just have to reparametrize 1]t as the square of some other non-negative
parameter). Then the feature graph loss function (6) is identical to Takane et al.'s (1977)
SSTRESS loss function with restrictions on the configuration.
References:
Abdi, H. (1990): Additive tree representations, In: Trees and Hierarchical Structures,
Dress, A. et al. (Eds.), 43-59, Springer Verlag, Berlin.
Arabie, P., and Carroll, J.D. (1980): MAPCLUS: A mathematical programming approach
to fitting the ADCLUS model, Psychometrika, 45,211-235.
Arabie, P., and Hubert, L. (1992): Combinatorial data analysis, Annual Review of
Psychology, 43, 169-203.
Barthelemy, J.-P. and Guenoche, A. (1991): Trees and Proximity Representations, Wiley,
New York.
Boorman, S.A. and Arabie, P. (1972): Structural measures and the method of sorting, In:
Multidimensional Scaling: Theory and Applications in the Behavioral Sciences, Shepard,
R.N. et al. (Eds.), 225-249, Seminar Press, New York.
Buneman, P. (1971): The recovery of trees from measures of dissimilarity, In:
Mathematics in the Archaeological and Historical Sciences, Hodson, F.R. et al. (Eds.),
387-395, Edinburgh University Press, Edinburgh.
Carroll, J.D. (1976): Spatial, non-spatial and hybrid models for scaling, Psychometrika,
41, 439-463.
Chandon, J.L., Lemaire, J., and Pouget, J. (1980): Construction de l'ultrametrique la plus
proche d'une dissimilarite au sens des moindres carres, R.A.I.R.O. Recherche Opera-
tionelle, 14, 157-170.
Corter, J.E., and Tversky, A. (1986): Extended similarity trees, Psychometrika, 51, 429-
451.
Cunningham, J.P. (1978): Free trees and bidirectional trees as representations of
psychological distance, foumal of A'Iathematical Psychology, 17, 165-188.
De Soete, G. (1983): A least squares algorithm for fitting additive trees to proximity data,
Psychometrika, 48, 621-626.
Felsenstein, J. (Ed.)(1983): Numerical Taxonomy, Springer Verlag, Heidelberg.
Flament, C. (1963): Applications of Graph Theory to Group Structure, Prentice-Hall,
Englewood Cliffs, New Jersey.
62
Guttman, L. (1954): A new approach to factor analysis: The radex, In: Mathematical
thinking in the social sciences, Lazarsfeld, P.F. (Ed.), 258-348, The Free Press, Glencoe,
Illinois.
Hakimi, S.L., and Yau, S.S. (1965): Distance matrix of a graph and its realizability,
Quarterly of Applied Mathematics, 22, 305-317.
Hartigan, J.A. (1967): Representation of similarity matrices by trees, Journal of the
American Statistical Association, 62, 1140-1158.
Heiser, W.I. (1981): Unfolding analysis of proximity data, Unpublished doctoral
dissertation, University of Leiden, The Netherlands.
Heiser, W.J., and Groenen, P.J.F. (1996): Cluster differences scaling with a within-
clusters loss component and a fuzzy successive approximation strategy to avoid local
minima, Psychometrika, 61, in press.
Henley, N.M. (1969): A psychological study of the semantics of animal terms, Journal of
Verbal Learning and Verbal Behavior, 8,176-184.
Holman, E.W. (1995): Axioms for Guttman scales with unknown polarity, Journal of
Mathematical Psychology, 39, 400-402.
Hutchinson, J.W. (1989): NETSCAL: A network scaling algorithm for nonsymmetric
proximity data, Psychometrika, 54,25-52.
Klauer, K.C. (1994): Representing proximities by network models, In: New Approaches
in Classification and Data Analysis, Diday, E. et al. (eds.), 493-501, Springer Verlag,
Heidelberg.
Klauer, K.c., and Carroll, J.D. (1989): A mathematical programming approach to fitting
general graphs, Journal of Classification. 6,247-270.
Lawson, C.L., and Hanson, R.I. (1974): Solving least squares problems, Prentice Hall,
Englewood Cliffs, N1.
Mirkin, B.G. (1987): Additive clustering and qualitative factor analysis methods for
similarity matrices, Journal of Classification, 4, 7-31.
Restle, F. (1959): A metric and an ordering on sets, Psychometrika, 24,207-220.
Restle, F. (1961): Psychology of Judgment and Choice, Wiley, New York.
Roberts, F.S. (1976): Discrete Mathematical Models. with Applications to Social,
Biological, and Environmental Problems, Prentice Hall, Englewood Cliffs, New Jersey.
Sattath, S., and Tversky, A. (1977): Additive similarity trees, Psychometrika, 42, 319-
345.
Shepard, R.N., and Arabie, P. (1979): Additive clustering: Representation of similarities
as combinations of discrete overlapping properties, Psychological Review, 86, 87-123.
Takane, Y., Young, F.W., and De Leeuw, 1. (1977): Nonmetric individual differences in
multidimensional scaling: An alternating least quares method with optimal scaling features,
Psychometrika, 42, 7-67.
Tversky, A. (1977): Features of similarity, Psychological Review, 84,327-352.
Classification and data analysis in finance
Krzysztof Jajuga
Wroclaw University of Economics
ul. Komandorska 118/120
53-345 Wroclaw, Poland
Summary: The paper gives a brief review of the main areas of financial applications where
classification and data analysis methods can be used. First of all, historical context is given.
It is shown that the emerging of modern finance was made possible due to use of
quantitative methods. The presented applications are divided into two main groups: 1)
analysis of financial investments and markets, 2) corporate finance. The review is put in the
framework where the relationship between dependent variable and explanatory variables is
determined.
1. Historical remarks
In the paper some links between classification and data analysis on one side and the financial
applications on the other side are shown. The history proved that the classification and data
analysis methods are useful in financial applications. It is also clear that the usefulness of
these methods will grow in future.
There are two important streams in the development of modern finance, which, on one
hand, were the driving forces of this discipline and where, on the other hand, the
contribution of statistics was crucial to the emerging and the development. These are:
- forecasting of financial prices;
- portfolio theory.
Forecasting of prices in financial markets (as well as in commodity markets) has been
probably the most exciting issue in financial research. First of all, this is very difficult task,
which has not been solved despite a lot of efforts. Secondly, people believe that by finding
good forecasts of financial prices they can make a lot of money. This makes job even more
exciting.
The first work on forecasting of financial prices was done by Louis Bachelier in 1900. His
doctoral thesis "Theory of speculation" is considered today as seminal work, which had
been unnoticed for more than fifty years. He completed thesis for degree of Doctor of
Mathematical Sciences at Sorbo nne.
In the dissertation he proved two statements. The first one: prices in financial markets
cannot be succesfully predicted. He argued that: "contradictory opinions concerning market
changes diverge so much that at the same instant buyers believe in a price increase and
sellers believe in a price decrease. [ ... J It seems that the market, the aggregate of
speculators, at a given instant can believe in neither a market rise nor a market fall, since for
each quoted price, there are as many buyers as sellers" (Bachelier (1900». The main
conlusion of Bachelier is: the mathematical expectation of price changes is zero, therefore
the best forecast of next price is current price. This means that the prices of financial
instruments follow random walk process.
63
64
The second statement of Bachelier was: the range of the interval of prices is proportional to
the square root of time. This is reflected today in many stochastic price models of ARIMA
type. This was also confirmed empirically for many time series of prices.
Despite of this and some other theoretical results, financial practitioners searched for
effective forecasting tools. One of such tool is so called technical analysis, the foundations
of which were laid of by Charles Dow (also known as co-founder and editor of Wall Street
Journal and the co-author of the most famous stock market index, Dow Jones Industrial
Average). Dow claimed that financial prices (especially stock prices) are changing
according to some trends. The particularly attractive (from the practical point of view)
claim of the proponents of technical analysis is the following one: the directions of stock
prices movements can be predicted and the forecasts can be used to develop trading
strategy resulting in returns above average. The users of technical analysis try to detect
regular patterns in past prices (by means of charts) which they believe will repeat in future.
This is a simplified version of pattern recognition problem. Basically the idea of existence of
regular patterns is used today in neural networks methodology.
The discussion between the advocates of two concepts: the first one that effective forecasts
of financial prices can be made and the second one that the financial prices follow random
walk process can be put in the framework of so called market efficiency. This concept was
proposed by Fama (1965). According to him market is to be called efficient if current
financial prices instantaneously and fully reflect all available information. For the
predictability of prices so called weak form market efficiency is of particular importance. It
is said that the market is a weak-form eflicient if current financial prices instantaneously and
fully reflect all information contained in the past history of financial prices. This means that
historical prices provide no information (about future prices) that will lead to higher than
average return by using trading rules based on forecasts of prices. Thus the best forecast of
next stock price is current stock price. The search for the methods of financial prices
forecasting (particularly stock prices forecasting) is based on the conviction that markets
are no! weak form efficient.
The second stream of the development of modern finance is connected to portfolio theory.
Portfolio theory was laid oEfby Harry Markowitz. In his seminal paper (Markowitz (1952»,
published in ,,Journal of Finance", probably in the most significant paper in the theory of
finance, Markowitz introduced the concept of risk in tinancial investments. He was the first
one to propose the use in finance the concept of a distribution of a random variable. As a
measure of the return on the investment, the expected value of return was used. As a
measure of investment risk, standard deviation of return was used. Then he developed the
concept of risk diversification. This means that the investment risk can be reduced by
forming a portfolio of stocks and this reduction depends on the correlation of the returns of
between stocks belonging to portfolio.
At this time the approach proposed by Markowitz was entirely different from the traditional
approach used in finance. For several years the paper of Markowitz was unnoticed. Then it
caused a lot of discussions and criticisms. The weak point in Markowitz approach was the
one that solving portfolio problem required very substantial amount of time by using the
computers available at this time. Today portfolio theory is widely used in practice. For his
contribution to economic sciences Harry Markowitz was awarded Nobel Prize in 1990.
In the beginning of seventies the substantial increase of the volatility (variability) of tinancial
prices (particularly exchange rates and interest rates) was observed. This caused the search
for the ways to cope with resulting risk. One solution was the introduction of the new
financial instruments, like options and financial futures. Another solution was to look for
more sophisticated mathematical methods.
65
In last llfteen years the enormous development in the area of computer technology
occurred. This was extremely r.eneficial as far as the use of sophisticated mathematical and
statistical methods is concerned. At present the use of these methods is not time-consuming
which means that the costs of the implementation of these methods are relatively low. On
the other hand. the computer software designed to solve complicated financial problems is
widely available. Statistical and data analysis methods are at disposal of firms. banks and
investors.
It is not easy to give the general framework for the presentation of applications of statistical
methods in finance. The relatively simple way is to consider the financial applications
through the analysis of the following function:
(1)
where:
Y - dependent variable,
Xl'Xz, ...• Xm - explanatory variables.
It is not easy to systematize all quantitative methods that can be used to solve financial
problems. One possible way is to classify them into two groups:
- classical multivariate data analysis methods;
- financial cybernetics methods.
This taxonomy is based on historical criterion since the second group of methods emerged
in last ten years when the use of the methods requiring large amount of computer time was
made possible due to the development of computer technology.
The term .. financial cybernetics" was used by Thomas E. Berghage (president of company
developing artificial intelligence software for finance) to describe the process of enhancing
financial decision making by introducing artificial intelligence technologies. The term
..cybernetics" was used for the tirst time by Norbert Wiener in 1948. At this time it was new
science dealing with modifying or enhancing human decision systems with artificial
electronic systems. In the area of finance this means to enhance financial decision making
with computer systems which to some extent ressemble human systems.
One of the very first applications of financial cybernetics was the application on neural
networks done by Lapedes and Farber (Lapedes and Farber (1987)). They attempted at
forecasting the closing value of market index Standard and Poor 500, where as the
explanatory variables the closing values from ten previous weeks were used. As an
algorithm back-propagation method was used.
The other useful way to classify the methods to solve main financial problems can be done
on the basis of two criteria:
1. The type of dependent variable - Y is quantitative or categorical.
2. The knowledge of the values (or the categories) of dependent variable - the values (or
categories) are known or unknown.
Therefore four different classes of methods can be distinguished, leading to four research
situations (cases):
1. Y is categorical variable and its categories are known.
2. Y is quantitative variable and its values are known.
66
Now we give a brief review of several most important areas of financial applications, where
the use of classification and data analysis methods proved to be useful. All financial
applications can be divided into 2 main groups:
- analysis of financial investments and markets;
- corporate finance.
evaluate default risk in order to avoid the negative consequences. This can be achieved by so
called bond rating.
Bond rating consists in determination of the classes of bonds of approximately equal level of
default risk. There are many institutions specializing in bond ratings (e.g. Standard and
Poor's Corporation and Moody Investment Services). The usual way to determine bond
rating is to ask experts to evaluate different factors influencing the default risk. As a rule
past performance of the bond issuer is also taken into account while determining bond
ratings. It is worth to mention that rating institutions claim that they link together financial
statement data of issuers of bonds and experts' opinions.
Bond rating problem can be regarded as a determination of a function (1), where Y is the
categorical variable standing for the class of bond and explanatory variables are the factors
influencing the default risk. From the point of view of the statistical methods used in bond
rating, we can distinguish two of the mentioned four situations, namely:
- situation 1: the categories of dependent variable are known;
- situation 3: the categories of dependent variable are unknown.
In the first case two types of data can be used:
- historical data, that is past bond ratings plus the information on the previous values of the
factors influencing the default risk;
- expert opinions, obtained by assigning bond ratings to the hypothetical values of factors.
Here any of these two types of data sets can be treated as a learning set and can be used to
determine a function which divides the bonds into classes. As a rule, discriminant analysis or
neural network methodology can be used. The classes corresponding to particular bond
ratings can be interpreted which is very important for end-users.
In the second case past data on bond ratings are not available and one uses classification
methods (for example cluster analysis) to determine the classes of bonds. Here, the problem
of difficulty to interpret the classes of bonds may occur.
B. Financial prices forecasting
This is by no doubt the most difficult financial problem. This task is important for different
types of investors, short-term and long-term, individual and institutional. The following
prices are usually predicted: commodity prices, exchange rates, interest rates and stock
prices.
From the point of view of classification and data analysis this problem is regarded via a
function (1). Here Y is usually quantitative variable - the price. Sometimes it can be
categorical assuming one of three categories: "the price will go up", "the price will go
down", "the price will stay within defined interval". To determine a function (1), historical
data are used. This fits either to situation 1 or 2.
There are many approaches to financial prices forecasting. Basically they can be divided into
three broad categories:
- technical analysis;
- econometric regression and time series models;
- neural networks.
Technical analysis is simple and widely used approach, which has been already mentioned.
"Technicians" use different types of charts of past prices to discover regular patterns, which
they believe will occur in future. Their reasoning is supported by simple indicators
describing financial markets.
The large group of researchers uses econometric models which emerging from well known
ARIMA approach. The development of statistical methodology and the development of
computer technology allowed for implementation of these models in real world. The
detailed description of these models is presented by Taylor (1986) and Mills (1993).
68
Neural networks as well as some other approaches (genetic algorithms, chaos theory) are
relatively new approaches in financial forecasting. These models could be developped with
the use of fast computers, since they require lengthy computations. The most popular are
probably neural networks. Here the algorithm is used so that quite complicated nonlinear
function is estimated. This function approximates the past financial prices. Then this
function is used for forecasting.
As it was already mentioned, the question of market efficiency is a crucial one to the
forecasting of financial prices. Those who apply the mentioned methods believe that the
market is not efficient and the changes of prices do not follow random walk process.
C. Risk-return analysis
This is classical financial problem, which traces back to the origin of portfolio theory. The
rationale behind this problem lies in the fact that most individual and institutional investors
try to maximize their return while keeping risk as low as possible. This behaviour of
investors was reflected in portfolio theory proposed by Harry Markowitz. He considered a
portfolio of stocks. T.he portfolio problem can be regarded as a task of finding a
combination of individual stocks, called portfolio, so that the expected return is as high as
possible and risk is as low as possible. The main results of the classical portfolio theory are:
- expected return on a portfolio is weighted average of the expected returns on individual
stocks;
- risk of a portfolio depends on the risk of individual stocks and on the correlation of returns;
this means that the low, possibly negative correlations lead to low risk, while holding
constant or even decreasing expected return.
Risk-return analysis can be treated as the analysis of location and spread parameters of a
distribution. One possible solution is to use robust estimates of location and spread. Since
building the portfolio involves multivariate distributions, estimates of multivariate location
vector and multivariate scatter matrix are to be used.
Risk·return analysis can be also put in the framework of a function (1). There are two
explanatory variables, return and risk, and the dependent variable is unknown and
characterizes the attractiveness of the investment (for example, very attractive - high return
and low risk, medium attractive - high return and high risk or low return and low risk, not
attractive - low return and high risk). This problem fits to the situation 3.
D. Beta analysis
Beta coefficient is one of the most important coefficients used by financial industry. This
coefficient was proposed by William Sharpe (Sharpe (1963)) in a so called single-index
model. This is simple regression model which gives a linear relationship between the return
on a stock (or other investment) and the return on the market (measured usually through a
return on market index). The slope in this regression is beta coefficient. It is the measure of
the sensitivity of a return on a stock on the changes of a return on the market. It can be also
regarded as a measure of so called systematic risk or market risk. The stocks with beta
higher than 1 are called aggressive stocks and the stocks with beta lower than 1 are called
defensive stocks.
From the point of view of function (1), beta analysis fits to the situation 2. However, it is
usually the case that people from finance industry do not pay particular attention to the
justifiability of linearity of the relationship and use ordinary least squares to estimate beta
coefficient. This raises the issue of the application of more advanced methods, for example
nonlinear regression, robust regression or segmented regression models.
E. Factor market models
The researchers who analyze financial markets try to find a theoretical model, which
explains what determine the returns on financial instruments. Among several proposed
69
models, the widely accepted one is so called Arbitrage Pricing Theory (APT) model,
proposed by Ross (Ross (1976)). This is linear model, given by the formula:
(2)
where:
R - the return on investment;
Fi - the i-th factor intluencing the investment;
bi - the sensitivity coefficient of the return with respect to i-th factor.
From the point of view of the determination of function (1), this fits to the situation 2. As a
rule, historical data are used. If the factors intluencing returns on the investments are known
(for example: interest rates, GOP growth rate, etc.) then the regression analysis can be used.
There is also a chance that one has no idea about the factors. In this case other solution can
be applied, where factor analysis is used. Here the returns on different stocks are used to
extract the values of unknown factors. However, it is often the case that it is very difficult (if
even possible) to give useful interpretation of extracted factors.
From the point of view of the use of statistical methods this can be regarded as a
determination of a function (1) and it fits to the situation 1. Here the historical data on the
bankrupt and non-bankrupt companies can be used to determine this function.
One of the first attempts to use discriminant analysis in the bankruptcy prediction was made
by Altman (1968). This is classical model, called Altman model. Altman compared the
financial data of 33 manufacturers who bankrupted with the data of 33 nonbankrupt firms of
similar industry and asset size. From a number of avalilable financial variables he finally used
5 ratios.
References:
Altman, E.L. (1968): Financial ratios, discriminant analysis and the prediction of corporate
bankruptcy, Journal of Finance, 23, 589-609.
Bachelier, L. (1900): Theory of speculation, Gauthier-Villars, Paris.
Fama, E. (1965): The behavior of stock prices, Journal of Business, 37, 34-105.
Lapedes, A. and Farber, R. (1987): Non-linear signal processing using neural networks:
prediction and system modeling, Los Alamos National Laboratory Report.
Markowitz, H.M. (1952): Portfolio selection, Journal of Finance, 7, 77-91.
Mills, T.e. (1993): The econometric modelling of financial time series, Cambridge
University Press, Cambridge.
Ross, S.A. (1976): The arbitrage theory of capital asset pricing, Journal of Economic
Theory, 13,341-360.
Sharpe, W.F. (1963): A simplified model for portfolio analysis, Management Science, 19,
277-293.
Taylor, S. (1986): Modelling finaTJciaitime series, Wiley, New York.
How to validate phylogenetic trees?
A stepwise procedure
Fran<;ois-Joseph Lapointe
Summary: In this paper, I review some of the methods and h:sts currently available to validate
trees, focussing on phylogenetic trees (dendrograms and cladograms). I first present some of the
more commonly used techniques to compare a tree with the data it is derived from (internal
validation), or compare a tfee to another tree or to more than one (external validation). I also
discuss some of the advantages of performing combined (total evidence) versus separate analyses
(consensus) of independent data sets for validation purposes. A stepwise validation procedure
defined across all levels of comparison is introduced, along with a corresponding statistical test: A
phylogeny will be said to be globally validated only if it satisfies all the tests. An application to the
phylogeny of kangaroos is presented to illustrate the stepwise procedure.
1. Introduction
The construction of a classification is a simpled-minded task. First, you need data, second,
you need an algorithm, and then, like magic, you get a classification of your data. Indeed,
the sole purpose of a classification algorithm is to do just that; i.e., return a classification
(e.g., dendrogram, c1adogram, pyramid, weak hierarchy, or any other type of
classification). The. problem with such an approach is usually that no safeguards are
provided to ensure that the output is meaningful. Indeed, most algorithms (i e., clustering
algorithms or phylogeny reconstruction algorithms) wiII return a solution, no matter what
data are fed into it. This implies that a classification can even be derived from pure noise
(i.e., randomly generated data). This is why validation becomes necessary.
In this paper, I will review some of the safeguards currently available to validate
c1assitications represented in the form of trees; those trees can either be obtained by
different algorithms or derived from independent data sets using the same algorithm. It is
not my goal to present an exhaustive review of all validation techniques for all types of
trees. I will focus my review on weighted trees such as those used in some phylogenetic
studies. Furthermore, given the number of statistical papers published on the subject, I will
emphasize validation methods based on permutation and/or resampling procedures. I will
first show (l) how one can assess whether any phylogenetic structure was present in the
data to begin with (internal validation). Then, (2) I will introduce some of the methods
designed to compare trees obtained from independent data sets (external validation). I will
also present (3) the rationale for combining those independent data sets before proceeding
with phylogenetic reconstruction (total evidence). The combined approach will then be
contrasted with (4) a consensus approach in which the trees are analyzed separately and
then combined. A stepwise procedure wiII finally be introduced to validate phylogenetic
trees using both internal and external validation methods as well as separate and combined
approaches. This stepwise procedure will be used to validate the phylogeny of kangaroos.
71
72
2. What is a phylogeny?
Biologically speaking. a phylogeny is a tree-like representation of evolutionary
relationships among n different taxa (e.g., species. genera or other taxonomic units).
Phylogenetic trees arc usually derived from a character-state matrix representing
morphological or molecular data (n species by p characters). or from a square n x n
distance matrix; several algorithms and computer packages arc currently availabh: to do so
(sec Penny et aI., 1992; Swofford et aI., 1996a). In mathematical terms, a phylogeny can be
detined as a connected graph without cycles. Such phylogenies arc usually represented as
rooted trees with labeled leaves. They can be depicted in the form of weighted trees if the
branches of the phylogeny have lengths that represent the amount of evolutionary
divergence between the nodes of the tree. Therefore, the sum of the lengths along the path
of branches between any pair of taxa can be recorded in a path-length matrix (similarly, the
number of branches on a path can be recorded in a branch-distance matrix). When the rates
of change in the various branches of the phylogeny are identical, every terminal node will
be equidistant from the root, and the tree can be associated with a path-length matrix that
satisfies the ultrametric inequality (Hartigan, 1967):
d(i. j} :5 max[d(i. k}: d(k. j}], for every triplet of taxa i. j. k..
Such trees arc usually dctined as dendrograms. On the other hand, when rates vary among
lineages, the path-length matrix is not ultrametric but remains additive (i.e., ultrametric
trees represent a special case of additive trees with constant evolutionary rates); additive
distances satisfy the four-point condition (Buneman. 1971) and apply to c1adograms:
d(i. j) + d(k. I) :5 max[d(j. [) + d(j. k); d(j. k) + d(j. I)), for any quartet of taxa i. j. k. I.
The path-length matrices (ultrametric or not) are in one-to-one correspondence with a set
of wcighted phylogenetic trees (Jardine et al.. 1967; Buneman, 1974); this is also true for
branch-distance matrices (Zaretskii, 1965). Therdl)re, it is equivalent for validation
purposes to compare phylogenetic trees or their associated path-kngth (or branch-
distance) matrices.
1. Compute a reference test statistic (i.e .• REFSTAT) relevant to the question asked.
2. Permute (or resample) the data (i.e .. distance, character-state, or path-length matrices).
3. Recompute the test statistic for the randomized data (Le., RANDSTAT).
73
It is worth mentioning at this point that the statistical outcome of a permutation test is
likely to be affected by different aspects of the procedure including, (i) the maximum
possible number of random realizations of the null hypothesis, (ii) the actual number of
permutations performed, (iii) the permutation model, and (iv) the test statistic selected to
compute the test.
In the case of phylogenies, three distribution models are usually defined (Simberloff et aI.,
1981; Savage, 1983; Lapointe and Legendre, 1995). The first and simplest model is to
generate every topology equiprobably. In the second model, each tree is equally likely; this
is the "proportional-to-distinguishable-type" model of Simberloff et al. (1981). The third
model implies that every branching point is equally likely when growing a tree (Harding,
1971); this is the Markovian branching model of Sirnberloll et al. (1981). It is interesting to
note that dendrograms can he generated equiprobably under this Markovian model
(Lapointe and Legendre, 1991; Page, 1991).
When data matrices are used for validation purposes instead of trees, the models available
differ with respect to the type of data considered. For example, character-state matrices
are not randomized as distance matrices would be. In the first case, the general model is
based on a permutation of the observed states within each character (Archie, 1989a;
1989c; Faith, 1991; Faith and Cranston, 1991; Klillersj6 et aI., 1992). With such models,
the phylogenetic structure of the data is destroyed by permutations (or random data
generation, Klassen et aI., 1991), and the probability of each state being assigned to any
taxa is equally likely. This approach has been much debated (Bryant, 1992; Carpenter,
1992; Faith, 1992; Klillersj6 et aI., 1992; Alroy, 1994; Faith and Ballard, 1994». It
nevertheless remains the model of choice in phylogenetic studies (but see Goloboft: 1991a;
1991b).
In the case of distance matrices, two types of models have been proposed by Sneath
(1967). One option is to compute distance matrices from random points distributed in a
multidimensional space (see Gordon, 1987); the unilorm model based on a Poisson
74
distribution and the unimodal model have been applied to generate such random
distributions of points (Bock, 1985). Another option is to randomize the distances directly
as if all the observations in the matrix were independent from one another; this is the
random graph model of Ling (1973). In some testing procedures, the values in the matrix
are held constant but the rows and columns (the labels) are permuted (Mantel, 1967). It is
also possible to generate random distance matrices from permuted character-state matrices
using the random-data model described above.
Another important aspect of validation methods is the test statistic selected to compare the
actual tree to random realizations of the null hypothesis. Here again, data comparisons will
differ from tree comparisons, and distances will n:quire different statistics than character-
state data. Common indices computed from a tree are its length, the consistency index
(Kluge and Farris, 1969), retention index (Farris, 1989a), and homoplasy excess ratio
(Archie, 1989b), among' others (see also Farris, 1989b; Archie, 1990; Farris, 1991;
Goloboff, 1991a; Hillis, 1991; Meier et aI., 1991; Bremer, 1995). All of these statistics are
used to measure how well the phylogeny fits the data; some are even used as optimality
criteria for phylogenetic reconstruction (see Swofford et aI., 1996a). For distance data,
one usually calls for metric indices, like the matrix correlation (Rohlt: 1982), or any other
measure of the fit between original and path-length distances (Rohlf, 1974; Gower, 1982;
Gordon, 1987).
When trees arc considered, it is always necessary to distinguish topological indices from
tree metric indices; the former are designed for unweighted trees (i.e., ignoring branch
lengths) whereas the latter are for weighted-tree comparisons (i.e., dendrograms or
c1adograms). Test statistics (Le., consensus indices sensu Day and McMorris, 1985)
available for topological comparisons include the partition metric (Bourque, 1978;
Robinson and Foulds, 1981), the neighborhood interchange metric (Robinson, 1971;
Waterman and Smith, 1978), the quartet metric (Estabrook et aI., 1985; Day, 1986;
Estabrook, 1992), and the triples distance metric (Critchlow et aI., 1996), among many
others (Bosibud and Bosibud, 1972; Margush, 1982; Hendy et aI., 1984; Penny and Hendy,
1985b; Steel, 1988; Steel and Penny, 1993). When path-lengths matrices need to be
compared, modified versions of the topological indices can be used (Robinson and Foulds,
1979), in addition to specitic indices designed for dendrograms (Sokal and Rohlf, 1962;
Day, 1983; Fowlkes and Mallows. 1983; Faith and Belbin, 1986; Lapointe and Legendre,
1990), or 1l1r any weighted trees (Williams and Clifford, 1971; Lapointe and Legendre,
1992a; Steel and Penny, 1993).
Resampling methods (Efron, 1979; Efron and Gong, 1983; Efron and Tibshirani, 1993) are
in some ways related to permutation tests. Indeed, resampling is used to assess the stability
of some parts of the tree. or the tree as a whole, by comparing actual phylogenies to trees
derived from resampled data. However, such methods are usually not designed as
statistical tests, Le., p-values arc rarely provided and can not always be interpreted as such
(Felsenstein et Kishino. 1993). Here the values in a data matrix are not permuted but
sampled with or without replacement. The bootstrap (Felsenstein, 1985; Sanderson, 1989;
1995) and the jackknife (Davis, 1993; Farris et aI., 1995a; 1995b) are among the most
popular resampling techniques in phylogenetic studies; both methods have been used to
validate phylogenies (see Hillis, 1995; Swoftl1rd et aI., 1996a).
75
To perform internal validation, one needs to compare a tree with the data it is derived from.
When it comes to a specific phylogeny, validation proceeds by comparing the actual tree
with others derived from randomly generated or permuted data (Archie, 1989a; Faith and
Cranston, 1991; Kallersj6 et aI., 1992; Alroy, 1994). Using one of the random data models
described above, a distribution of the test statistic can then be computed, or tables of
critical values can be generated (e.g., Klassen et aI., 1991). For example, the validity of a
tree as a whole can be assessed by comparing its length to a distribution of the lengths of
trees derived from random data (Le Quesne, 1989; Carter et a!., 1990; Faith and Cranston,
1991; Steel et aI., 1992; Archie and Felsenstein, 1993). When no phylogenetic structure
was present in the data to begin with, most random data sets (e.g., 95%) will lead to trees
shorter than the real phylogeny. In that situation, the original tree will not be validated.
Other methods based on different statistics (e.g., Alroy, 1994) test the same null
hypothesis stating that a phylogeny derived from actual data is no better than what would
be expected from random data. The same approach can be used to test the stability of parts
of the tree (e.g., monophyletic groups) under various permutations and resampling models
(Faith, 1991; Faith and Trueman, 1996; Swofford et aI., 1996b).
In resampling methods (e.g., the bootstrap and the jackknife), the effect of character
and/or taxonomic sampling on phylogenetic reconstruction is assessed. Mueller and Ayala
76
(1982) were among the first to test the validity of their trees with a resampling procedure.
Since Felsenstein (1985), the bootstrap has been the most popular validation technique in
phylogenetic studies (Hedges, 1992; Hillis and Bull, 1993; Rodrigo, 1993a; Dopazo, 1994,
Harshman, 1994; Zharkikh and Li, 1992a; 1992b; Li and Zharkikh, 1994; Berry and
Gascuel, 1996; Efron et aI., 1996), with extensions for distance data (Krajewski and
Dickerman, 1990; Marshall, 1991). The method remains controversial (Sanderson, 1989;
1995), and has been greatly modified by some (Zharkikh and Li, 1995). The original
nonparametric bootstrap (Felsenstein, 1985) consists in resampling the characters of a data
(or distance) matrix with replacement to assess the stability of a tree (the parametric
bootstrap has been introduced by Huelsenbeck et aI., 1995). Jackknifing can proceed in a
similar fashion, except that characters are resampled without replacement (Davis, 1993).
This rationale also appties to taxonomic sampling (Lecointrc et aI., 1993). Lanyon (1985)
and Lapointe et al. (1994) have shown that deleting taxa from the analysis can be used to
evaluate the stability of phylogenetic trces. The consensus of thc jackknifc trees is then
used to evaluate the support of thc tree as a whole or part of it (for an application, see
Bleiweiss et aI., 1994). When the trees tested with resampling models are not validated
(e.g., a partially-resolved tree is obtained), the original phylogenies should be treated with
caution; additional data must be gathered to improve the results.
Given two (or more) internally validated trees, the next task is to verify whether these trees
tell the same story or not; that is, that they are congruent (Prager and Wilson, 1976;
Mickevich, 1978; Colless, 1980; Sokal and Rohlf, 1981; Penny et aI., 1982; Swofford,
1991; Bledsoe and Raikow, 1992; Patterson et aI., 1993). External validation proceeds by
comparing phylogenies to one another (or to a reference phylogeny) to assess whether the
observed measure of congruence could be expected by chance alone. As for internal
validation, permutation and resampling methods can be used for this test. In the case of the
random-data model, the data from which the original trees were derived are randomized
(Rodrigo et aI., 1993; Farris et aI., 1995b) to obtain new phylogenies which can be
compared in turn to build a distribution of the test statistic. For the random-tree model, the
actual pair of trees is compared to pairs of random trees (Hubert and Baker, 1977; Podani
and Dickinson, 1984; Simberloft: 1987; Nemec and Brinkhurst, 1988; Page, 1988;
Lapointe and Legendre, 1990; 1992a; Brown, 1994). Depending on the trees compared,
the topology of the phylogeny, its branch lengths, and the taxon positions can be
randomized (Lapointe and Legendre, 1995). No matter what method is used, the null
hypothesis states that the phylogenies compared are not more similar than randomly
generated trees would be (Lapointe and Legendre, 1990); a pair of trees is declared
congruent when more similar than the majority (e.g., 95%) of the pairs of random trees. To
prevent one from generating the null distribution for every test, tables of critical values for
various consensus indices and dift'crent models have been produced (Day, 1983b; Shao and
Rohlf, 1983; Shao and Sokal, 1986; Steel, 1988; Lapointe and Legendre, 1992b; Steel
and Penny, 1993).
Even though phylogenies should always be validated, it is worth mentioning that data sets
can be externally validated as well. The approach is similar to the random-data models
used to compare trees; the character-state or distance matrices are randomized (or
resampled) to assess their congruence. The Mantel test (1967) has been widely used to
compare distance matrices. For comparing character-state data, canonical correlations can
be applied with a testing procedure based on permutations (see Lapointe and Legendre,
77
1994). In any case, trees or data matrices that are not validated should be treated with
caution. One should never rely on ad hoc criteria to decide which of the phylogenies is the
best. It might be better to combine the data or trees to analyze them jointly.
How should one choose among ditferent phylogenies based on independent data sets?
Which one is closer to the true phylogeny? Given that different parts of the genome evolve
at different rates, it is very unlikely that one would obtain identical phylogenies for slow-
cvolving versus fast-evolving genes (Russo et aI., 1996), or cven for morphological versus
molecular data (Hillis, 1987). The solution, according to Kluge (1989), is to include all
available data (i.e., character-state matrices) in one analysis (for a combination of distancc
matrices, see Lapointe and Kirsch, 1995). The rationale is that a tree based on total
evidence rather than partial information will usually be more accurate as more data are
added (see also Barrett et aI., 1991; Eernisse and Kluge, 1993). This approach has been
criticized by several authors (Huelsenbec!<: et aI., 1996), and alternative views have been
proposed (Williams, 1994; Bandelt, 1995; Miyamoto and Fitch, 1995; Nixon and
Carpenter, 1996), one of which is to combine the data conditionally (Bull et aI., 1993). The
question then becomes one of when to combine data or not.
constructed before some new data set was generated). Nevertheless, a total-evidence tree
must always be assessed with internal validation methods.
S.2 Consensus
Whether one decides to combine or not to combine data sets for statistical, practical, or
philosophical reasons (Barrett et al., 1991; 1993; de Queiroz, 1993; Nelson, 1993), the
problem remains the same: how to synthesize a profile of incongruent phylogenies?
Whereas data are combined in a total-evidence approach, trees will be combined with a
consensus approach (Miyamoto, 1985; Anderberg and Tehler, 1990). A consensus tree
method (as opposed to consensus indices, Day and McMorris, 1985) takes as input a
profile of trees and return a single solution that is in some sense representative of the entire
set (Leclerc and Cucumel, 1987). Several approaches, including the strict (Sokal and Rohlf,
1981), semi-strict (Bremer, 1990), median (Barthelemy and McMorris, 1986), and
majority-rule consensus (Margush and McMorris, 1981) methods have been developed to
combine unweighted (see also Adams, 1972; Nelson, 1979; Stinebrickner, 1982;
McMorris and Neumann, 1983; McMorris et aI., 1983; Neumann, 1983; McMorris, 1985;
Phillips and Warnow, 1996) or weighted trees (Stinebrickner, 1984; Letkovitch, 1985;
Lapointe and Cucumel, 1997). Other methods are designed for the construction of
consensus supertrees from phylogenies bearing overlapping sets of taxa (Gordon, 1986;
Baum, 1992; Ragan, 1992; Steel, 1992; Baum and Ragan, 1993; Lanyon, 1993; Rodrigo,
1993b; Purvis, 1995a; Ronquist, 1996; Lapointe and Cucumel; 1997), or the computation
of common pruned trees (Finden and Gordon, 1985) and reduced consensus trees
(Wilkinson, 1994; 1996).
The problem with consensus trees is that they are seldom validated. Assessing the
significance of a consensus phylogeny remains problematic. As tor phylogeny-
reconstruction algorithms, a given consensus method will always return a solution. Onc
then has to evaluate whether the consensus representation is pertinent or not; i.e., is it
more structured than what would be expected from chance alone (Cucumel and Lapointe,
1997)? Consensus validation is somewhat related to the congruence tests used for external
validation. The problem with consensus trees is that more than two phylogenies are usually
considered at once; the tables of significance of consensus indices do not account for more
than two trees at a time (e.g., Shao and Rohlf, 1983; Shao and Sokal, 1986; Lapointe and
Legendre, 1992b). Furthermore, consensus trees are sometimes the synthesis of trees
bearing nonidentical sets of taxa (Purvis, 1':I95b; Kirsch et a\., 1':197), which makes them
even more difficult to test. Cucumel and Lapointe (1997) test the consensus by comparing
it to the trivial classification (i.e., a bush, or a star tree); a distribution of consensus trees
computed from randomly generated phylogenies (Lapointe and Legendre, 1990; 1992a) is
used to assess the significance of the null hypothesis. Another approach would be to check
whether the consensus falls within a confidence set (Sanderson, 1989) of the trees in the
input profile.
when the different approaches converge to the same solution. This is related to what Kim
(1993) has shown by combining different algorithms to improve the accuracy of
phylogenetic estimations. In the present case, combined and separate analyses are
pertormed and the resulting phylogenies are assessed using a stepwise procedure (Fig. 1).
1- Initially, each and every tree produced has to be checked tor internal validity (4.1).
2- Trees that satisfy the first test need to be compared to assess their congruence (4.2).
3- The congruent data sets must be combined to derive a total-evidence tree (5.1).
That tree has to. be validated.
4- The independent phylogenies must also be combined to obtain a consensus tree.
That consensus has to be validated.
5- Finally, the trees obtained at steps 3 and 4 of the validation procedure must be compared.
Data 1 Data 2
/
'" i Data 3
Tree 3
T T
..
Tree 1 ......I---~,~-4.~ Tree 2
7. Application
To illustrate how the stepwise validation procedure works with real data, I have applied
the methods to validate the kangaroo phylogeny in Kirsch et al. (1995), based on DNA-
hybridization data and depicting phylogenetic relationships among 12 species. It is
compared to Baverstock et a\. (1989) phylogeny of 14 kangaroo species, based on
immunological data. For the purpose of the demonstration, I have reduced the original
data sets to only consider the nine species in common to both studies (Fig. 2).
~~~~
Data 1 t-------I~~I Data 1+2 ....
,.
I~-----I Data 2
.
_ - - NotamacroplIs
~~~~
Osphranter
....----N[acroplls
Pelrogale
....- - - - - - - DorcopslI/lIs ,
.....- - - Wallabia
....- - - Osphranter
_ - - - - Notamacroplls Notamacroplls
....- - - Macroplls - - - - - Nfaaoplls
Pelrogale
- - - - Dendrolaglls Pelrogale
....- - - Thylogale 7hylogale
- - - - - - Setonix
....- - - - - DorcopslIllIS ....- - - - - - - DorcopSlI/lIs
Wallab/Q /
NotamacroplIs
Osphranter
'"
r----- ft-facroplls
Petrogale
Thylogale
....- - - - - - - - Dorcopsll/lls
fig. 2. llIustratiou of the stepwise valdation procedure. Data 1 is Kirsch et al. (1995) DNA-hybridization
data. Data 2 is BavershKk et al. (1989) immunological data. Data 1+2 is the average of the standardized
data sl'ls. The corresponding phylogenies were reconstruded with the FITCH algorithm (Felsenstein,
1993). The consensus was derived using the average prot·cdure (Lapointe et aI., 1994; Lapointe and
Cucumel, 1997). All trees are rooted by Dorcopsulus.
81
The first step of any validation study is always to check for internal validity. In the present
case, each tree had already been validated by bootstrapping and/or jackknifing in the
original studies. I thus proceeded directly with external validation. Metric and topological
indices were selected to compare t~e phylogenies depicted in the form of additive trees;
the matrix correlation computed from path-length distances is 0.354, compared to 0.645
for branch distances. The latter is more extreme than would be we expected from pairs of
random additive trees of the same size (Lapointe and Legendre, 1992b). That is, the two
kangaroo phylogenies are topologically congruent (i.e., the data are not heterogeneous).
The next step was, therefore, to combine the standardized data matrices from the different
studies. I did so by a simple average of the immunological and DNA hybridization
distances among the nine species (Lapointe and Kirsch, 1995). A phylogeny was derived
from that total-evidence matrix (Fig. 2), and internal validation was performed with
taxonomic jackknifing (Lapointe et aI., 1994). Finally, the average consensus (Lapointe
and Cucumel, 1997) of the original phylogenies was computed to account for branch
lengths (Fig. 2). The consensus was compared to a distribution of consensus trees derived
from pairs of random trees to assess its pertinence (Cucumel and Lapointe, 1997). As both
the total-evidence and consensus trees were validated, the last and crucial step consisted in
the comparison of those phylogenies.
The correlation between the path-length matrices is 0.996 in this case, whereas the
topological correlation value is 0.927. Using the significance test described above, it was
shown that this particular pair of trees is more similar than what would be expected from
most consensus and total-evidence trees based on random data. The kangaroo phylogeny
is thus said to be globally validated.
8. References
Adams, E. N., III. (1972): Cons~usus t~chniques and the comparison of taxonomic trees, Systematic
Zoology, 21,390-397.
Alroy, J. (1994): Four permutation tests for the pres~ncc of phylogenetic structure, Systematic Biology, 43,
430-437.
Anderberg, A. and Teblcr, A. (1990): Consensus tr~~s, a n~cessity in taxonomil' practice, Cladistics, 6,
399-402.
Archie, J. W. (1989a): A randomization test for phylogenetic information in systematic data, Systematic
Zoology, 38,219-252.
Archie, J. W. (1989b): Homoplasy excess ratios: N~\V indices for m~asuring levels of homoplasy in
phylogenetic syst~matics and a critique of the consistency index, Systematic Zoology, 38, 253-269.
Archie. J. W. (I 989c): Pbylogenies of plant families: A demonstration of phylogenetic randomness in
DNA sequence data derived from proteins, Evolutioll, 43, 1796-1800.
Archie, 1. W. (19':10): Homoplasy exccss statistics and relention indices: A reply to Farris. Systematic
Zoology, 39, 169-174.
Archie, J. W. and Felsenstein, J. (1993): The number of evolutionary steps 011 random and minimum
lengths trees for random evolutionary data. Theoretical Poplliatioll Biology, 43, 52-79.
Bandelt. H. J. (1995): Combination of data in pbylogenetic analysis, Plallt Systematics awl E valutioll.
Slipplemelltlim 9, 355-361.
Barrett, M. et al. (1991): Against consensus, Systematic Zoology, 40,486-493.
Barrett, M. ct al. (1993): Crusade'! A response to Nelson, Systematic Biology, 42, 216-217.
Barthelemy. J .-P. and McMorris, F. R. (1986): Tbe median procedure for II-trees, JOllrtlal o/Classificatioll,
3,329-334.
Baum. B. R. (1992): Combining trees as a way of combining data for pbylogenetic inkrcnce, and tbe
desirability of combining gene trees. Taxoll. 41,3-10.
82
Baum, B. R. and Ragan, M. A. (1993): Reply to A. G. Rodrigo's "A comment on Baum's method for
combining phylogenetic trees, Ta.'COII, 42, 637-640.
Baverstock, P. R. et al. (1989): Albumin immunologic relationships of the Macropodidae (Marsupialia),
Systematic Zoology, 38, 38-50.
Berry, V. and Gascue1, O. (1996): On the interpretation of bootstrap trees: Appropriate threshold of clade
selection and induced gain, Molecular Biology alld Evolutioll, 13,999-1011.
Bledsoe, A. H. and Raikow, R.1. (1990): A quantitative assessment of congruence between molecular and
nonmolecular estimates of phylogeny, loumal of Molecular Evolutioll, 30, 247-259.
Bleiweiss, R. et al. (1994): DNA-DNA hybridization-based phylogeny of "higher nonpasserines:
Reevaluating a key portion of the avian family tree, Molecular Phylogelletics and Evolutioll, 3,248-255.
Bock, H. H. (1985): On some sigoiticance tests in cluster analysis, loumal ofClassificatioll, 2,77-108.
Bosibud, H. M. and Bosibud, L. E. (1972): A metric for classifications, Ta.'Con, 21,.607-613.
Bourque, M. (1978): Arbres de Steiner et reseaux dont varie I'emplacement de certains sommets. Ph. D.
Thesis, Dcpartement d'Informatique et de Recherche Operationelle, Unversite de Montreal, Montreal.
Bremer, K. (1990): Combinable component consensus, Cladistics, 6,369-372.
Bremer, K. (1995): Branch support and tree stability, Cladistics, 10,295-304.
Brown, 1. K. M. (1994): Probabilities of evolutionary trees, Systematic Biology, 43,78-91.
Bryant, H. N. (1992): The role of permutation tail probability tests in phylogenetic systematics, Systematic
Biology, 41, 258-263.
Bull,1.1. et al. (1993): Partitioning and combining data in phylogenetic analysis, Systematic Biology, 42,
384-397.
Buneman, P. (1971): The recovery of trees from measures of dissimilarity. In: l"lathematics ill
Archeological allli Historical Sciellces, HodsOll, F. R. et al. (eds.), 387-395, Edinburgh University Press,
Edinburgh.
Buneman, P. (1974): A note on the metric properties of trees, loumal of Combillatorial Theory (B), 17,
48-50.
Carpeoler, J. M. (1992): Random cladistics, Clatlistics, 8, 147-153.
Carter, M. et al. (1990): On the distribution of lengths of e\'olutionary trees, SIAM loumal of Discrete
Mathematics, 3,38-47.
Chippindale, P. T. and Wiens, 1. 1. (1994): Weighting, partitioning, and combining characters in
phylogenetic analysis, Systematic Biology, 43, 278-287.
Colless, D. H. (1980): Congruence between morphometric and allozyme data for Mellidia species: A
reappraisal, Systematic Zoology, 29, 288-299.
Critchlow, D. E. et al. (1996): The triples distance for rooted bifurcating phylogenetic trees, Systematic
Biology, 45, 323-334.
Cucumel, G. and Lapointe, F.-l (1997): Un test de la pertinence du consensus par une methode de
permutations. [n: Actes des XXIXe joumees de statistique, 299-300, Carcassonne.
Davis, J. I. (1993): Character removal as a means for assessing stability of clades, Cladistics, 9,201-210.
Day, W. H. E. (1983a): The role oi complexity in comparing classifications, Mathematical Biosciellces, 66,
97-114.
Day, W. H. E. (1983b): Distributions of distances between pairs of classifications. [n: Numerical
Ta.'Collomy, Felscnstein, J. (cd.), 127-131, Springer-Verlag, Berlin.
Day, W. H. E. (1983c): Computationally difficult parsimony problems in phylogenetic systematics,
loumal of Theoretical Biology, 103,429-438.
Day, W. H. E. (1986): Analysis llf quartet dissimilarity measures between undirected phylogenetic trees,
Systematic Zoology, 35, 325-333.
Day, W. H. E. (1987): Computational complexity of inferring phylogenies from dissimilarity matrices,
Bulletill of Mathematical Biology, 49,461-467.
Day, W. H. E. and McMorris, F. R. (1985): A formalization of consensus index methods, Bulletill oj'
Mathematical Biology, 47, 215-229.
de Queiroz, A. (1993): For consensus (sometimes), Systematic Biology, 42,368-372.
83
de Queiroz, A. el al. (1995): Separate versus combined analysis of phylogenetic evidence, Annual Review
of Ecology and Systematics, 26.657-681.
Dopazo. J. (1994): Estimating errors and confidence intervals for branch lengths in phylogenetic tres by a
bootstrap approach. loumal of Molecular Evolution, 38. 300-304.
Dubes. R. and Jain. A. K. (1979): Validity studies in clustering methodologies. Pattern Recognition. 11.
235-254.
Dwass, M. (1957): Modilied randomization tests for nonparametric hypotheses. Amzals of Mathematics
and Statistics, 28. 181-187.
Edgington. E. S. (1995): Randomization tests, 3rd Edition, Revised ami Expanded. Marcel Dekker. New
York.
Eernisse, D. J. and Kluge, A. G. (1993): Taxonomic congruence versus total evidence. and the phylogeny
of amniotes inferred from fossils. molecules and morphology. Molecular Biology and E,·olution. 10.
1170-1195.
Efron. B. (1979): Bootstrapping methods: Another look at the jackknife. Annals of Statistics, 7. 1-26.
Efron. B. and Gong. G. (1983): A leisurely look at the bootstrap. the jackknife. and cross-validation.
American Statisticiair, 37. 36-48.
Efron. B. and Tibshirani. R. J. (1993): An introduction to the bootstrap, Chapman and Hall. New York.
Efron. B. et al. (1996): Bootstrap confidence levels for phylogenetic trees. Proceedings of the National
Academy of Sciences. USA, 93. 13429-13434.
Estabrook. G. F. (1992): Evaluating undirected positional congruence of individual taxa between two
estimates of the phylogenetic tree for a group of taxa. Systematic Biology. 41. 172-177.
Estabrook. G. F. et al. (1985): Comparison of undirected phylogenetic trees based on subtrees of four
evolutionary units. Systematic Zoology, 34. 193-200.
Faith. D. P. (1991): Cladistk permutation tests for monophyly and uoumouophyly. Systematic Zoology, 40.
366-375.
Faith. D. P. (1992): Ou corroboration: A reply to Carpeuter. Cladistics, 8.265-273.
Faith. D. P. aud Ballard. J. W. O. (1994): Length differences topology-dependent tests: A response to
Kallersjii et al. Cladistics, 10,57-64.
Faith. D. P. and Belbin. L. (1986): Comparison of classifications using measures intermediate between
metric dissimilarity and consensus similarity. lournal of Classification, 3,257-280.
Faith, D. P. and Cranston. P. S. (1991): Could a c1adogram this short have arisen by chance alone'! on
permutation tests for cladistic structure. Cladistics, 7. 1-28.
Faith, D. P. and Trueman. J. W. H. (1996): When the topology-dependent permutation test (T-PTP) for
monophyly returns significant support for monophyly. should that be equated with (a) rejecting a null
hypothesis of nonmonophyly. (b) rejecting a null hypothesis of "no structure." (c) failing to falsify a
hypothesis of monophyly. or (d) none of the above? Systematic Biology, 45. 580-586.
Farris. J. S. (1989a): The retention index and the rescaled consistency index, Cladistics,S, 417-419.
Farris. J. S. (1989b): The retention index and homoplasy excess, Systematic Zoology, 38.406-407.
Farris. J. S. (1991): Excess homoplasy ratios. Cladistics, 7.81-91.
Farris. J. S. et al. (1995a): Constructing a significance test for incongruence, Systematic Biology, 44, 570-
572.
Farris. J. S. et al. (1995b): Testing significance of incongruencies, Cladistics, 10. 315-370.
Felsenstein. J. (1978): The number of evolutionary trees. Systematic Zoology, 27.27-33.
Felsenstein. J. (1985): Confidence limits on phylogenies: An approach using the bootstrap. Evolution, 39.
783-791.
Felsenstein. J. (1993): Pfm...IP: Phylogeny inference package, version 3.5c. distributed by the author.
University of Washington. Seattle.
Felsenstein. J. and Kishino. H. (1993): Is there something wrong with the bootstrap on phylogenies'! A
reply to Hillis and Bull. Systematic Biology, 42, 193-200.
Finden, C. R. and Gordou. A. D. (1985): Obtaining common pruned trees, loumal of Classification, 2,
225-276.
Fowlkes. E. B. and Mallows, C. L. (1983): A method for comparing two hierarchical c1usterings, loumal
84
Kirsch, I. A. W. et a\. (1997): DNA-hybridisation studies of marsupials and their implications for
metatherian Classification. Australian lournal of Zoology, in press.
Klassen, G. I. et a\. (1991): Consistency indk-es and random data, Systematic Zoology, 40,446-457.
Kluge, A. G. (1989): A concern for evidence and a phylogenetic hypothesis of relationships among
Epicrates (80idae, Serpentes), Systematic Biology, 38, 7-25.
Kluge, A. G. and Farris, I. S. (1969): Quantitative phyletics and the evolution of anurans. Systematic
Zoology, 18, 1-32.
Krajewski, C. and Dickerman, A. W. (1990): Bootstrap analysis of phylogenetic trees derived from DNA
hybridization matrices, Systematic Zoology, 39,383-390.
Lanyon, S. (1985): Detecting internal inconsistencies in distance data, Systematic Zoology, 34,397-403.
Lanyon, S. (1993): Phylogenetic frameworks: Towards a firmer foundation for the comparative approach,
Biological loumal of the Linnean Society, 49,45-61.
Lapointe, F.-I. and Cucumel, G. (1997): The average consensus procedure: combination of weighted trees
l'ontaining identical or overlapping sets of objects, Systematic Biology, 46, 306-312.
Lapointe, F.-I. and Legendre, P. (1990): A statistical framework to test the consensus of two nested
classitications, Systematic Zoology, 39, 1-13.
Lapointe, F.-I. and Legendre, P. (1991): The generation of random ultrametric matrices representing
dendrograms, lournal of Classification, 8, 177-200.
Lapointe, F.-I. and Legendre, P. (1992a): A statistical framework to test the consensus among additive
trees (cladograms), Systematic Biology, 41, 158-17l.
Lapointe, F.-I. and Legendre, P. (1992b): Statistical significance of the matrix correlation coefficient for
comparing independent phylogenetic trees, Systematic Biology, 41,378-384.
Lapointe, F.-J. and Legendre, P. (1994): A classitication of pure malt Scotch whiskies, Applied Statistics,
43, 237-257.
Lapointe, F.-I. and Kirsch, J. A. W. (1995): Estimating phylogenies from lacunose distance matrices, with
spedal reference to DNA hybridization data, Molecular Biology and Evolution, 12, 266-284.
Lapointe, F.-I. and Legendre, P. (1995): Comparison tests for dendrograms: A comparative evaluation,
loumal of Classification, 12,265-282.
Lapointe, F.-I. et a\. (1994): lackknifing of weighted trees: Validation of phylogenies reconstructed from
distances matrices, Molecillar Phylogenetics and EvolUlion, 3,256-267.
Leclerc, B. and Cucumd, G. (1987): Consensus en classification: Une revue bibliographique,
Mathematiques et Sciences Humaines, 100, 109-128.
Lecointre, G. H. et a\. (1993): Species sampling has a major impact on phylogenetic inference, Molecular
Phylogenetics and Evolution, 2, 205-224.
Lefkovitch, L. P. (1985): Euclidean consensus dendrograms and other classification structures,
Mathematical Biosciem:es, 74,1-15.
Le Quesne, W. (1989): Frequency distributions of lengths of possible networks from a data matrix,
Cladistics, S, 395-407.
Li, W.-H. and Guoy, M. (1991): Statistical methods for testing phylogenies, In: Phylogenetic analysis of
DNA sequences, Miyamoto; M. M. and Cracraft, I. (cds.), 249-277, Oxford University Press, New York.
Li, W.-H. and Zharkikh, A. (1994): What is the bootstrap technique?, Systematic Biology, 43,424-430.
Li, W.-If. and Zharkikh, A. (1995): Statistical tests of DNA phylogenies, Systematic Biology, 44,49-63.
Ling, R. F. (1973): A probability theory of cluster analysis, loumal of the American Statistical Association,
68, 159-164.
Mantel, N. (1967): The detection of disease clustering and a generalized regression approach, Cancer
Research, 27, 209-220.
Margush, T. (1982): Distances between trees, Discrete Applied Mathematics, 4, 281-290.
Mar~ush, T. and McMorris, F. R. (1981): Consensus n-trees, Bul/etin of Mathematical Biology, 43,239-
244.
Marshall, C. R. (1991): Statistical tests and bootstrapping: Assessing the reliability of phylogenies based
on distance data, Molecular Biology and Evolution, 8, 386-391.
Mason-Gamer, R. I. and Kellogg, E. K. (1996): Testing for phylogenetic conflict among molecnlar data
86
1.. Introduction
The most useful hierarchical clustering methods are of agglomerative type, iteratively
transforming the unit clusters partition {{x}l xeE} into the trivial one {E}, merging at each
step the most similar clusters, E being the set submitted to the analysis. These methods
require a previous double choice: the comparison function (dissimilarity/ similarity,
represented as cf) Yxy between pairs of elements x,yeE, and the cf r(A, B) between
pairs of clusters of E. The generalisation of Y to r may be performed in several ways, as
there is a certain lack of consensus on the precise definition of "cluster" and "resemble
clusters". Thus the question of measuring the resemblance remains a central one.
This reference hypothesis is quite suitable and natural to evaluate the global resemblance
between clusters of E. Based on it some good hierarchical and non-hierarchical (Nicolau,
Brito, 1989) clustering methods and fruitful ideas arise. All those methods are included in
89
90
the package CLASSIF and the results obtained on either simulated or real data show very
often a clear progress regarding to other traditional ones, specially in the cases of A VL,
A VM and particular mixed methods (like AVB) which are generated from some
parametric families of agglomerative methods.
In the sequence we brietly examine the general procedure to get the cf 's Yand r, in section
2. For details see Bacelar-Nicolau (1972, 1979, 1980, 1981), Costa Nicolau (1980, 1983,
1985), Bacelar-Nicolau and Costa Nicolau (1981,1985), and Lerman (1972, 1981).
Section 3 concerns some properties of AVL and AVM methods, and section 4 finally
presents some parametric families defining mixed aggregation criteria from the Single
Linkage and the AVL methods (Bace1ar-Nicolau and Costa Nicolau, 1994).
Often Y. y will be calculated approximately, assuming our data set to have a large
dimension. Then S' = (S - E(S» / as follows an asymptotic normal unit N (0, 1)
distribution, and in each case we can find:
In the case of binary data, for instance, we can take Sxy = I { Ix(i). Iy(i) I i £ D } =
number of common presences of x and y over the descriptive set D, where Ix(i) = 1 if i
verifies x, =0 otherwise. Thus S will have an hypergeometric or binomial distribution (or
91
Poisson) as exact law, depending on the underlying reference hypothesis (concerning the
marginal frequencies of the 2x2 contingency table associated to the pair (x,y». Then Y'Y
can be estimated by the corresponding cdf of the normal approximation of each selected
discrete distribution.
Moreover, in binary case H. Bacelar Nicolau (1981, 1987, 1989) found the usual
association coefficients .are grouped in several distributional equivalent classes in the
following sense: either they have the same exact distribution or else they share the same
normal asymptotic distribution. This means that using the probabilistic coefficient, one can
choose for hierarchical clustering purpose only one coefficient in each distributional
equivalent class, the other ones giving (exactly or asymptotically) the same hierarchical
results (dendrogram, for instance).
In the case of frequency or contingency tables we usually take the probabilistic coefficient
based on the affinity coefficient (Matusita, 1951; Bacelar-Nicolau, 1988): the above
reference hypothesis will then be referred either to a sampling scheme or a permutational
scheme, depending on the way the data have been observed. The sampling scheme has
been extended to integer data (Bacelar-Nicolau and Nicolau, 1993), while the permutation
scheme usually applies to real data.
Often we refer to the probabilistic coefficient y,y as the VL similarity (V for Validity, L
for Linkage), because, as Tiago de Oliveira pointed out, the probabilistic coefficient
validates in a probabilistic scale the basic linkage s between each pair (x, y). On the other
hand VL comes primary from "Vraisemblance du Lien", that is Likelihood Function Link.
In fact much of our research on this matter find their roots in the works of Lerman (1970)
and Bacelar-Nicolau (1972), where the probabilistic coefficient was for the first time used
in binary case: it was called as VL, "Vraisemblance du Lien" similarity coefficient, and its
first associated aggregation criterion AVL, "Algorithme de la Vraisemblance du Lien".
Subsequent extensions have often kept the same label.
Assume that we are dealing with some similarity coefficient y 'y following a unit uniform
distribution and the measure of resemblance between each pair of clusters A and B is
based on the set [y abl(a, b)EA x B], crossing A and B, with size a~ (where a=card A,
~=card B). Also assume the a~ Y-similarities are i.i.d. uniform on [0, 1].
This reference hypothesis is quite general and fits well to many applied situations. On the
other hand it is well known the integral transformation of a continuous random variable
leads to the uniform distribution on [0, 1]. The S random variable in step 1 being in fact of
discrete type, the size of the descriptive set D is in general large enough in cluster analysis,
to apply this property to the sample of Y-similarities.
Let's now introduce some probabilistic cf 's that will generalise y, y to the cf r(A, B)
between subsets of E. We have studied, in particular, the cf coefficients associated to the
following statistics:
92
-
where PA.B' qA.B and PA,B are the basic statistics tor single linkage, complete linkage and
mean average linkage methods, respectively, and tA,B is a statistics already used (with
empirical dissimilarity coefficients) by some researchers to get a compromise between the
single and complete linkage methods. Here we are not interested in directly using those
statistics for clustering purposes; instead we want to take their cumulative distribution
functions, in order to use the probabilistic approach to hierarchical classification, As we
have pointed out we assume the af} V-similarities are LLd. in this work.
The first probabilistic aggregation criterion, MaxProb, derived from p,\,8' is the so-called
AVL method, from Validity-Linkage Algorithm or "Algoritlune de la Vraisemblance du
Lien", which was first purposed by Lerman (1970), its main properties being t.stablished
by Bacelar-N,icolau (1972) in the context of classification of variables. It was generalised
in Lerman (1981), Bacelar-Nicolau (1981) and Nicolau and Bacelar-Nicolau (1981), for
instance. The second statistics, qA,B' was used as support to a probabilistic clustering
method for frequency data introduced by Goodall, where the probabilistic similarity is
defined from the X2 distribution (in Legendre and Legendre (1983». In our
approach the probabilistic aggregation criterion MinProb associated to qA,B did not
perform very well. The third probabilistic criterion, AvmProb, derived from PA.B is also
the so-called AVM method, Mean-Validity Algorithm. Finally the probabilistic criterion
generated by the "hybrid" statistics t A.B will not be considered' now, since we are most
interested in studying parametric mixed criteria. Thus, among those four statistics we
shall take in the present paper only PA.B and PA,B' Under the Li.d. assumption referred to
above, we obtain for the corresponding AVL and AVM clustering criteria the following
expressions:
and
Both the A VL and A VM methods generally pertorm quite well on either simulated or
real data. Nevertheless they differ in some specific features, which can allow us to choose
each one lor different particular situations. In the next section we compare !he two
93
Let [(k-I) be the table of similarity values between clusters at the (k-l loth step, of an
agglomerative process, and f (k) = max fl k - I). We call d (k) = 1-1' (k) a level index: it
measures the lack of cohesion of the successive clusters formed at the sequential
hierarchical tree levels; in this sense one usually hopes d (k) to be a decreasing function of
the levels k. Nevertheless we want to point out that we believe the existence of inversions
in d (k) can be quite compatible with the construction of good hierarchical classifications.
Property - The level index of AVM method can present inversions, whereas the level
index of AVL method is strictly increasing.
Proof: The absence of monotony of the AVM level index is simply a consequence of the
statistical convergence of the mean statistics of an i.i.d. sample of the unit uniform
distribution to the value 0.5: when the entries of [ are updated at the end of each step of
the clustering algorithm, the [values will increase, if the mean of VL similarities is
greater than 1/2, or decrease, if that mean is lesser than 1/2. So we can naturally have
d(k+l) < d (k) at some lcvels of the AVM dendrogram.
Now in order to see that for the AVL method one has d (k+ 1) > d (k) always, let's first
suppose that [(HI has only a maximum value (binary tree), so that only a pair of clusters
verifies the criterion at each step. Let (A, B) be the pair of clusters satisfying the AVL
criterion (maximisation of [(k-I) and let Pk be the partition defined at k-th step:
Pk = Pk - I - {A, B} U {A U B}.
[Ik-I I = {['It.II(A
A '
x)lx ;00 A X EP. } C
'k-l
[(k.l)
Now, either: max [(k) = max ~k-I) and then max [(k) < max [Ik-I)
In the general case, if h pairs of clusters, {Ai' Bi } , i = 1, ... , h, verify the AVL criterion
at k-th step, then r lk ) becomes: r(i l = 6,(i-l) U 6,(1)
where now,
8 k- l) = r 1k - I) _ [(U r(k-I» U(U rlk-I)] c r(k'l)
h h
A, B,
j-l i-l
h
and 6,l i) = Y rill
A,UB,
i-1
so that the monotony property of the index level of AVL always holds.
As successive updating of the cf r in AVM method tend to reinforce the strong links and,
conversely, to decrease the week links one usually observes in the A VM dendrograms a
sort of "bipolarisation effect". Most examples treated so far by AVM method clearly show
the (same) manner how this method works: the kernel of each cluster grows by joining
clement after element in some kind of local chain effect (similar to the characteristic global
effect of single linkage method); once a kernel is finished, another one begins to built in the
same .way, and the process continues until all the main clusters are formed; those clusters
are then merged, without chain effect, producing some good-looking and coherent trees.
This double effect, globally conserves the initial structure as expressed by the cf y and is
r~ponsible after all by the trustworthy fitting of A VM trees to the data.
A dill'erent tendency can usually be observed in the A VL hierarchies, which seldom are
associate to a chain ell'ect. Instead they produce quite regular trees with clusters of equal
size at each level of the tree. We call this the "symmetry effect" of AVL method, the
responsible for that being the exponent of PA,B in the formula of r. This exponent
performs in fact a sort of brake action to the chain effect at all levels of the dendrogram.
One can easily understand the symmetry effect of AVL in the following example:
Suppose there are h clusters of equal size a at step k ( k = 0, 1, 2, ... ); let A, BE Pk the
clusters which are going to be merged. Thus updating the cf between any other cluster C
and the new merged cluster AU B will give:
rIK+I)(A U R,C) = p!~B.C
while the link between C and other cluster of Pk +1 is expressed by
r IK +I) = r(il (c, X) = P~.'x
So AU B will attract C, shaping a new cluster AU B U C of size 3, only if
2a 2 a2
PAUB,C> Pc,x
or equivalently:
PAUB.C>~PC.x' Vx .. C,AUB
95
Therefore it is natural that clusters of size 2a arise before any other group of size 3a could
emerge.
Concerning "symmetry" effect of AVL versus "bipolarisation" etl"ect of AVM, recent work
of Nicolau and Bacelar-Nicolau has been developed which associate them with spatial
dilating and spatial contracting properties of agglomerative methods (Lance and Williams,
1967).
Therefore a bridge between SL and AVL methods can be established by using, for instance,
the following chain of exponents:
. 2afJ C;; a + fJ
1 s mm(a,fJ) s - - s va~ s - - s max(a,fJ) s afJ
a+fJ 2
The correspondent agglomerative cf based on VL similarity define the following iterative
clustering procedure
This methodology assures economic computation, invariance with respect to the initial
order of the elements or the clusters to be merged, and some way of evaluating the brake
action on both the chain and symmetry effects. In what concerns these aspects, the two
methods associated to the geometric and the arithmetic means of the cardinal of clusters
being compared perform quite well, producing fine interpretable hierarchical trees.
The above iterative clustering procedure has been generalised by defining some suitable
parametric families of aggregation criteria to link the SL and AVL methods.
The first idea for finding such a family was to take a scalar transformation of afJ:
afJ-afJll; , where 1<1; < a~, fixed 1;. This turned out not to be a good solution, since it is
easy to prove that:
96
Given rg(A, B) = ~Jill) , then any other method r g , such that g'= () g( 0, ~), with fixed
b >0, will give exactly the same hierarchic tree built by r g , differences existing only in
the values of the level index.
The invariance does not hold with linear transformations of the exponent g( o,~),
rg(A, B) and r 6$+t (A, 8) = p!~~a.JI)+t generally producing different hierarchic trees.
Thus a natural solution to the question of defining some appropriate parametric family of
agglomerative criteria linking the SL and A VL methods appears to be the following:
r(A, B) = p6cz/l+t
.~.B
Experimental work was conducted either on simulated or real data in order to study this
parametric family and its role in searching for hierarchical methods which better fit the
initial similarities, as well as in looking for stability and validity of those methods.
On the other hand the above parametric family has been later extended in order to include
the two probabilistic methods associated to the exponents with the geometric and the
arithmetic means of the cardinal of clusters being compared, in the former iterative
«
clustering procedure. We get in this case g(o,P; E,!;) = 1/ (1+1: 0 x ~)!; -1», where E
and!; both take values in the interval [0,1]. More recently the whole family has been
included in the probabilistic similarity version of the well known Lance and Williams
tormula (Bacelar-Nicolau and Nicolau, 1994). The recursive Lance and Williams tormula,
designed lor dissimilarity coefficients can in fact easily be adapted to clustering methods
based on similarities and particularly on VL-similarity. One has:
where the constants 01, 02, ~,y vary according to the method we want to
reproduce.
The extended formula derived in order to include the probabilistic hierarchical family,
needs only two more constants, which are the parameters E, l§ above and can be
represented as follows:
where g(o,1']; E,1;) = 1/ (1+1: «0 x l])1; -1»,1'] being the cardinal of the cluster C.
We can easily see that making E = 0 we simply get the first formula above. On the other
hand we find as particular cases in the family: the single linkage SL (01 =02=y=1I2,
~=E=O), the validity-linkage AVL (01 =02=y=112, ~=O, E=1;= 1) and the brake-validity
linkage AVB (01=02=,(=112, ~=O, E=1, !;=112) algorithms.
97
Using such parametric family enables us to analyse the robustness of our clustering
methods: by varying E and S from 0 to 1, the other coefficients remaining unchanged, we
can study the stability of the AVL family of models, like before. But this procedure can of
course now be generalised to the comparison among the probabilistic family and the
agglomerative methods generated by the former Lance and Williams lormula.
5. Conclusions
The concept of the probabilistic (VL) similarity obtained by using (the unilorm
transformation) the distribution function of a random variable and its extension to
agglomerative criteria, allow us to establish a general consistent probabilistic approach to
the hierarchical clustering methods. Validity-aflinity combined with AVL, AVB and AVM
methods are good examples of this approach.
Finally note that this concept of prohabilistic similarity does not consume itself in the field
of hierarchical agglomerative methods: we have also developed a non hierarchical
prohabilistic approach hased on the distribution function of the mean of a unit uniform
sample. which gives an excellent non hierarchical method of k-means type.
References:
Sununary: We investigate the solutions to the clustering and the discriminant analysis
Problems when the points are supposed to be distribllted according to Poisson Processes
on convex supports. This leads to very intuitive criteria for homogeneous Poisson Processes
based on the Lebesgue measures of convex hulls. For non homogeneous Poisson Processes,
the Lebesgue measures have to be replaced by intensities integrated on convex hulls.
Similar geometrical tools, based on the Lebesgue measure, are used in the context of Pattern
Recognition. First, a discriminant analysis algorithm is developed for estimating a convex
domain when inside and outside points are available. Generalisation to non convex domains
is explored.
Introduction
This research has been initiated by the question of D.G. Kendall of how to estimate a
bounded convex set observing only the realization of an homogeneous Poisson Process
inside this convex set. The solution (Ripley and RasSOll, 1977; Rasson, 1979) was an
homothetic expansion of the convex hull of the sample from its centroid. The follow-
ing question, raised by E. Diday, was naturally the problem of using these arguments
in Clustering. This led us to the maximum likelihood estimation of the hypothetized
support for clustering, i.e. the union of K bounded convex sets. The solution was
"find the partition of the points in K subgroups of such that the sum of the Lebesgue
measures of their convex hulls is minimal" (Hardy and Rasson 1982).
Then came the question of the corresponding discriminant analysis raised by A.D.
Gordon. The Bayesian solution for the same hypothesis (still for an homogeneous
Poisson Process) was to affect the new point to the sample for which the Lebesgue
measure added by convexity to its convex hull is minimal (Baufays and Rasson, 1984a,
1984b, 1985). But this could not classify the points belonging to more than one con-
vex hull as this is often the case of pixels in images.
Then to deal with them, we moved to non-homogeneous independent Poisson Pro-
cesses with convex supports. The main change in the solution was to replace the
Lebesgue measure by the integrated intensity. But the same equation gave also the
solution for points lying in the intersections of the ~upports.
99
100
disjoint domains lD k ), 1 ::; k ::; y. Our point of vipw will be that if we are able to
find back these domains, making some inference ahout thf'lll, we will, in some sellSP,
solve the clustering problem.
1.2. The maximum likelihood solution of the clustering problem.
Let x denote the sample vector (;1:1, ... ,.r,,) with .I'j E ur,i = 1, ... ,11. The inllicator
function of a set A at the point y is defined oy : JlA(y) = 1 if y E .4. alit! 0 otherwise.
So, since with our hypothesis, the points will be independently <md uniformly dis-
tributed on D, the likelihood function takes the form
where m(D), the Lebesgue measure of D, is the sum of the measurps of the g subsets
Dk(l ::; k ::; g). The domain D, parameter of infinite dilllt'nsion, for which the
likelihood is maximal, is, among all thosp which contain all till' points, the 0111' whosf'
Lebesgue measure is minimal.
If we do not impose more conditions 011 the subsets, we can easily find g sets Dk
which contain all the points and are such that the sum of tllt'ir mea.~ures is zero.
Thus there are many trivial solutions to the problem. Nevertheless, we can easily
see that the problem of estimating a domain is not well-posed and that the weakest
CI.~sumption that makes the domain D estimable is the convexity of the Dk (Baufay;;
and Rasson, 1984).
With a partition of the set of points into g sub-domaills having disjoint convex hulls,
we can associate a whole class of estimators; indeed we only have to find g disjoint.
convex sets, each of them containing one of thf'. For each partition, thf' likf'lihood
has a local maximum: the convex h1llls of tIll' g subsets. Tllf' global maximum will
be attained with the partition for which the sum of the Lebesguf' measmf'S of the
convex hulls of the g subgroups is minimal. This is t.he solution we sf'ek.
Practically, if the basic space is IR, we look for the g disjoint intervals containing all
the points such that the SUln of their lengths is minimal. In m2 (or /H"l ), we try to
find the g groups of points such that the sum of the areas (volumes) of their disjoint
convex hulls is minimal.
1.3. The statistical model and the rule for the associated discriminant
analysis.
The conditional distribution for the k-th population is assumed to be uniform in a
convex compact doma.in Dk , and the a priori probability Ilk that an individual belongs
to population k is proportional to the Lebe~gue meCl.<;ure of D k . The convex domains
Die are assumed disjoint. The density of population k, Jd:l:), and the unconditional
density J(x), are respectively equal to :
. 1
Jd;l:) = -(D)
171 Ie
·JlO.(r)
f(.r)
1
= m(D)' E
9
JlD t (,,)
The decision rule is the Bayesian one, wit.h the unknown parameters - the convex sets
Die - replaced by their maximum likelihood estimations. Let X" be the labeled sample
of population k, H(.'<k) be its convex hull, and ;1: be the individual to be assigned
101
to one of the g populations. If:r is allocated to the k-th group, the estimates of the
domains Dk are equal to :
J ifj =f. k
~ - {H(X)
D.
J- H(Xju{x})ifj=k
Fi[Jurr. 1.
The allocation rule is then :
assign :r. to the k-th population if (Lilli only if Sk(.r.) < 8 j (x), k =f. j.
Our point of view will be agaill to try to find the maximum likelihood estimation
of the Dk. If x denote the sampl!:" vpctor (:r.\, ...• :r .. ) with :Cj E nrt. i = 1•...• nits
likelihood will be
FD(:r)
1
= (m(D))n. gn
D.D(:Ci).q(:Cj)
102
Thus, if the intensity is known (or, maybe, has been estimated), the maximum likeli-
hood clustering solution, since the maximum likelihood estimation of a convex domain
based on a sample of points inside this domain is still the convex hull of this sample,
is then:
Find the g groups of points for which the slim of the inten.~ities integmted on their
convex hulls is minimal.
Thus if qk(.) is the intensity of the Process and satisfying fJk(:Z:) > 0 ~ x E Dk (e.g.
fJk(') = q(.) aD.(.) for the case of unique Process on disjoint sets), we may suppose
that any point is distributed on D = U~=l Dk with respect to the density function
g
/(:1:) =L Jlk.fk(:r:)
k=l
) fJk(:r)
where fk(:r = f D. f/k () I
Y (Y
and
As usual, the possibly convex supports Dk are estimated by the convex hulls H(X k)
of the training set points. If we denote S' = L:~=1 fH(X., fJk(y)dy and S'k(X) =
fH(X.U{:r:})\H(X., fJk(y)dy then the Bayesian classification rule becomes:
a.ssign the. new point :r to the class k sitch that 7Ik!k(:r.) = qk(:C)/(S + Sk(X» is
maximal.
When the local intensities lJk(:r:) do not depend on k , this rule simply _consists in
assigning x to the class k such that the added intensity Sk(:I:) is minimal. Thus, in
this case and when the convex hulls an~ disjoint, we still keep the convex admissibility
property i.e. "if the point x belongs to the convex hull of only one class, x is assigned
to this class".
The solution we propose to the problem is the use of the discriminant rule we have
just described in cluster analysis. The situation here is quite similar as we have two
103
In ) ( 1 n+m )
LD(X, y) ( m(D)" II
lL[r.ED]
,=1
m(D)'" . lL[x.ED] II
l=n+l
where J(.1:"+h ... , x"+,,,lxl, ... , :1'11) is the "shadow" statistic defined in Hatchel, Meili-
json and Nadas (1981). It is define!l as:
(H(.), J(.I.)) is a minimal sufficient stat.istic for tlH' estimation of D. See Figure 2.
It is known that J(.1:"+\, ... ,:I",,+,,,I·f\, ... ,.1',,) has similarpropertips as the convex hull
statistic H(.1:\, ... , ;z:,,). It is a consistent. pstimate of D. It is robust with respect
to small changes in the locatioll of till' data points. It satisfies the equivariance
requirement and underestimates the Lpbesgup IIlea.~llI·e of D in the same way as
m(H(.T\, ... , ;z:,,)) does for m(D), ie
with E[~~+Jl being the expected lIum])er of pxtreme points of J(.I.) for m + 1 obser-
vations in D. See Ripley and Rassoll (1\177) and Remon (19!13).
104
The use of H(Xb ... , xn) and J(xn+l, ... , Xn+mlXb ... , xn) as estimates of the two un-
known domains [here D and .0] in the criterion proposed by Baufays and Rassoll is
the key idea of our discriminant algorithm.
One gets then the following boundary of the regions allocating to D and .o. This is
the set of points Xo such that:
-
PDfD(XO)
-
= PDfD(Xo)
I.e.
=
I.e.
where
SI(XO) == m(H(.fl, ... , .f,,, :1'0)) - m(H(:rb ... , :1:,,))
and
This boundary gives us a practical and easily computable estimate Dfor the unknown
domain D. See Figure :3 for large data ~et results. The symmetric difference between
D and D is noted by Dtl.D = DuD \ D n D.
" '.
....
" ..
. .: .
'.' .
.' .' .
.. :.
FigTU·t" :]b: Ellipsoidal D with m(D) = O.:W : t = 300 observations with n = 68 from
D yield D with I/!(D) = 0.2U and m(Di:l.D) = 0.011.
105
3.3. Properties of D
The estimator D yields a consistent estimation of D, as it is bound by two consistent
estimators H(.) and J(.I.) of D. It has a piecewise continuous boundary. Unfor-
tunately it happens not to be a convex set. On the other hand, this last feature
turns out to be an advantage when a similar reasoning is applied to the estimation
of non-convex domains.
This estimator D is robust with respect to small changes in the location of data
points, as it is based only on the shapes of H(.) and J(.I.), which are robust in this
sense. Such a property is rare in spatial statistics. For instance, the estimator b
proposed by Moore et al. (1988) can be very sensitive to small change in the location
of data points.
Let us note that the time required for the computing of D does not depend on the
number of points, excepted for the computation of the convex hull and shadow statis-
tic. The amount of required cpu-time is only a function of the precision asked for the
estimator D.
3.4. Conclusions and future researches.
Our estimate for D, based on a well known discriminant analysis criterion, seems to
be a powerful tool for pattern recognition. Moreover, it is quite straightforward to
generalize it to a non-homogeneous Poisson process.
This research is currently working on recognition of nOll convex domains. The first
results seems very encouraging as shown by the estimation of some letter A. See figure
4 where our algorithm is compared to a discriminant rule based Oil the distance to
the nearest neighbour.
References :
Baufays, P., Rasson, J.-P. (1!J84): line nouvelle regie de classement, utilisant I'enveloppe
convexe et la mesure de Lebesgue, Statistiq1Jc ct Analyse des Donnees,2, pp. 31-47.
106
Baufays, P., Rasson, J.P. (1984): Proprietes theoriques et pratiqlles et applications d'lIne
nouvelle regie de classement, Statistique et A111llyse des Donnees, voI.9/3, pp.1-10.
Baufays, P., Rasson, J.P. (1985): A new geometric discriminant rule. Computational Statis-
tics Quaterly, vol. 2, issue 1, 1.5-30.
Degytar, Y.U., Finkelsh Tein, M.Y. (1974): Classification Algorithms Based on Construc-
tion of Convex Hulls of Sets, Enginee1'i1lg Cybclmetics,12, pp. 150-154.
Duda, R.O., Hart, P.E. (1973): Pattern Recognition and Scene Analysis, Wiley, Chichester.
Efron, B. (1965): The Convex Hull of a Random Set of Points. BiometTika,52, pp. 331-453.
Fisher, 1., Van Ness, J.W. (1971): Admissible Clustering Procednres. BiometT"ika,58, pp.
91-104.
Fukunaga, K. (1972): Introduction to Stutistical Pattem Recognition, Academic Press, New
York.
Grenander, U. (1973): Statistical geometry: a tool for pattern analysis. Bulletin of the
American Mathematical Society,vol. 79, 829-856.
Hand, D.J. (1981): Discrimination and Classification ,Wiley, Chichester.
Hardy, A., Rasson, J .-P. (1982): Une Nouvelle Approche des Problemes de Classification
Automatique. Statistique et An(lly~(' des Donnees, 7, pp. 41-56.
Hartigan, J.A. (1975): Clustering Algorithms, Wiley, Chichester.
Mac Lachlan, G.J. (1992): Discl"imillllTlt ATI(jly~is (Inti Statistiwl PuttcT'n Recognition, Wiley,
New York.
Moore, M., Lemay, Y. and Archambault, S. (1988): Algorithms to reconstrnct a convex set
from sample points. Computing Science and Statistics, In: P7'ocecdings of the 20th Sym-
posium on the Interface, Eds. E.J. Wegman, D.T. Gantz and J ..1. Miller, ASA, Virginia,
553-558.
Rasson, J.P. (1979): Estimation de domaines convexes du plan. St(ltistique et Analyse des
Donnees,l, pp. 31-46.
Remon, M. (1994): The estimation of a convex domain when inside and outside observations
are available. Supplemento ai Renriiconti rid CiT'colo Mutr.mlltico di Plllermo, serie II, no
35, 227-235.
Remon, M. (1996): A Discrimina.nt Analysis Algorithm for the Inside/Outside Problem.
Computational Statistics and Dutu AT&(llysis.
Ripley, B.D., Rasson, J.P. (1977): Finding the edge of a Poisson forest. Jow'nal of Applied
Probability,14, 483-49l.
Toussaint, G.T. (1980): Pattern Recognition and Geometrical Complexity, In: Proc. Fith
lnt. Conf. Pattern Recogniti01l, pp. 1324-1347, IEEE.
Part II
Methodologies in Classification
Summary: The paper addresses the problem of identifying relevant values for the number
of clusters present in a data set. The problem has usually been tackled by searching for a
best partition using so-called stopping rules. It is argued that it can be of interest to de-
tect cluster structure at several different levels, and five stopping rules that performed well
in a. previous investigation are modified for this purpose. The rules are assessed by their
performance in the analysis of simulated data sets which contain nested cluster structure.
1. Introduction
The aim of cluster analysis is to provide informative summaries of multivariate data
sets. and in particular to investigate whether or not a set of n (say) objects can
validly be described in terms of a smaller number of clusters of objects that have the
property that objects in the same cluster are similar to one another and different from
objects in other clusters. Clustering procedures provide little guidance on ways of
addressing the problem of determining rele\'ant values for the numbers of clusters, c
(say), present in a data set. This has long been recognized as a challenging problem;
overviews of the topic have been presented by Jain and Dubes (1988. Chapter 4),
Bock (r996) and Gordon (1996).
This paper addresses the problem of determining which values of c are most strongly
indicated as providing informative representations of the data. Published work to
date has concentrated on identifying the single most appropriate value for c, and
the test procedures and rules that have been proposed for addressing this problem
are collectively usually referred to as 'stopping rules'. since investigators have often
obtained a complete hierarchical classification using an agglomerative algorithm, and
wish to have guidance on when amalgamation should cease. Many different stopping
rules have been proposed in the research literature. often with only cursory examina-
tion of their performance. The most detailed comparative study of which the author
is aware was carried out by .'vlilligan and Cooper (198.j). These authors assessed the
ability of thirty stopping rules to predict the correct number of clusters in a collection
of different randomly-generated data sets after these had been analysed using four
standard clustering criteria implemented in an agglomerative algorithm (single link,
group average link, \\'a1'(1'5 sum-of-squares criterion. and complete link). Some of
the proposed stopping rules performed wry poorly. and cannot be recommended for
further uSe.
Specifying a single "best' \'alue for c will on occasion provide a misleading represen-
tation of the cluster structure present in data. The aim of the current study is to
assess the ability of modifications of five stopping rules - those whose performance
was best in ~rilligan and Cooper's (198.5) study - to detect when several different,
widely-separated values of c would be appropriate, that is, when structure is present
in the data at several different levels. For example. it might be valid to summarize a
clata set in terms of two different, nested partitions: one into three clusters. and the
109
110
2. Stopping Rules
Stopping rules can be categorized as global or Inca I. Global rules are based on the
complete data set, typically seeking the optimal value of some index that compares
within-cluster and between-cluster variability. The partitions into clusters for two
different values of c thus need not be hierarchically-nested, although in practice they
usually will be. There is often not a natural definition of the within/between vari-
ability corresponding to the case c = I, and such indices possess the unsatisfactory
feature of being unable to indicate that the data comprise just a single cluster.
Local rules involve an assessment of whether 01' not a single cluster should be sub-
divided into two sub-clusters (or a pair of clusters should be amalgamated). They are
thus restricted to the assessment of hierarchically-nested sets of partitions, and are
based on a subset of the data; this latter property means that the effective sample
size for the test is usually much smaller than the size of the data set.
The five rules investigated in this study are defined below, in the order in which they
were ranked in Milligan and Cooper's (1985) investigation.
1. Clf. An index proposed by Caliriski and Harabasz (1974), for assessing a partition
into c clusters of a set of n objects described by numeric variables, is defined by
where p denotes the dimensionality of the data, Tn denotes the number of objects
in the cluster being investigated, and z is a standard normal deviate specifying the
significance level of the test. Amalgamation has generally proceeded until the hy-
pothesis can first be rejected.
:3. C. This index is based on the sum of all within-cluster pairwise dissimilarities
(D). If the partition has r such dissimilarities, Dmin (resp., Dmax) is defined as the
sum of the r smallest (resp., largest) pairwise dissimilarities, and
C == (D - Dmin)/(Dmar - Dmin).
4. 7. This index, proposed by Goodman and Kruskal (19.54), has been widely used
for assessing cluster output (e.g., Hubert, 1974). In the present instance, comparisons
111
are made between all within-cluster pairwise dissimilarities (d ij , say) and all between-
cluster pairwise dissimilarities (d kl , say): a comparison is deemed concordant (resp.,
discordant) if dij is strictly less (r·es-p., greater) than dkl • The index is defined by
where S+ (resp., S_) denotes the number of concordant (resp., discordant) compar-
isons.
·5. Beale. A test proposed by Beale (1969) has been used as a local stopping rule
for assessing whether or not a single cluster should be sub-divided. The test involves
comparing
( Wl ~ W2)
H'2
j((mm -- .1)2 22/p _ 1)
(where Wl, W2 , m and p are defined in the DH test above) with an F".(m-2)p distri-
bution. As for the D/l test, amalgamation proceeds until the hypothesis can first be
rejected.
Three of these five rules are global, with the single most appropriate value for c.being
indicated by the maximum value of C H or 'Y, or the minimum value of C (with the
restriction that not all values of c are investigated, as some indices can display dis-
tracting patterns for values of c close to n (Milligan and Cooper (198.5)). Such global
rules can readily be extended for use in the detection of nested cluster structure, by
recording all local optima of the index. If the correct solution comprises partitions
into Clo C2, ... , Ck clusters, a.n ideal index would have local optima at all of, and only,
these values of c. One might hope that the values taken by the index at these local
optima were also the k most extreme values, but it is possible that this may not occur
because of the inter-relatedness of structure at neighbouring values of c: thus, the
value of the index may be more extreme at (Ci + 1) clusters than at Cj clusters.
The local stopping rules proposed by Beale (1969) and Duda and Hart (1973) re-
quire the specification of a significance level Q or threshold value z, and Milligan and
Gooper (198.5) selected values that ensured the best possible performance of these
rules. Relevant threshold values depend on characteristics of the data sets under
investigation, such as the values of nand p: for example, in the use of the DH
stopping rule on their data, Milligan and Cooper (198.5) chose z to be 3.2, whereas a
similar examination for the data in the current study would specify z to be 4.0. The
need to specify a threshold, whose most appropriate value varies in this way, is an
unsatisfactory feature of a stopping rule.
The two local rules have been modified for use in the detection of nested cluster struc-
ture by abandoning their more formal hypothesis-testing aspects and just identifying
large values of the corresponding z or F statistic. The critical values of the Fp ,(m-2)p
distribution used in Beale's (1969) test depend on p and m, but for the values of p
used in the current study, the variation is not large for even moderately small values
of m. For neighbouring values of c, the z and F statistics will usually be evaluated
on disjoint subsets of the data, and it could be argued that the relevant values of c
are indicated by the k largest values of the statistics. However, it was found in this
study that this approach generally provided inferior results to those indicated by the
local maxima of the z and F statistics (in which neighbouring values of c cannot be
indicated), and the results which are presented here are based on this latter strategy.
112
four outer ones record the numbers of times that these numbers of clusters were
indicated by the rule when applied. to the dendrograms provided by, in clockwise
order from top left: the single link (SL). group average link (AL), complete link (eL)
and sum-of-squares (SSQ) criteria; the central number is the sum (~) of the four
outer ones. When the two optimal values were equal, each of the relevant entries in
a table was augmented by O..j.
Table 1. Performance of the C H rule: the figures show the frequencies with which
various values of c Were indicated as the numbers of clusters, for 108 data sets
analysed by four clustering criteria .
Tables 1-3 summarize the results for the three global rules; correct detections appear
in one of the two cells (12,c2) or (c2.12). It should be noted that only 107 results are
reported for the sum-of-squares criterion in Table 1: in one of the data sets for which
C2 == 2, the only local optimum occurred at c == 2 (i.e., no low-level clusters were
indicated). Further, these accumulated results hide the fact that when C2 == 2, the
CH index never achieved its maximum value (but usually achieved its second-largest
local maximum value) at 12 clusters; as C2 increased, so did the frequency with which
C == 12 was specified as the optimal solution.
The results for the C H rule are highly encouraging: the two largest local maxima
have indicated the correct values of C1 and C2 in more than 9a % of the cases for
three of the clustering criteria; further, the numbers of times for which these are the
only local maxima are 79 (group average link), 99 (sum-of-squares), and 98 (complete
link), and no more than three local maxima are ever indicated for clusters provided
by the latter two clustering criteria.
The results for the C and 1 rules are summarized in Tables 2 and 3 respectively. For
these indices, there is less variation in the results for different values of C2. \Vhen
one data set comprising 3 and 12 clusters was analysed using single link and group
average link, both of the indices obtained their optimal value when c = 3, 12 and 1:3:
these results were interpreted as a successful outcome, and the relevant numbers in
the (c2,12) and (12,c2) cells were each augmented by a.5.
sum-of-squares and complete link clustering criteria; this proportion rises to nearly
three-quarters when near misses (specifying CI = 11 or 1:3, instead of 12) are included.
However. there is an increase in the mean number of local optima of the rules, and
values of both CI and C2 are incorrectly identified about 10 % of the time.
The results for the two local rules are summarized in Tables 4 and 5. It can be seen
that the modification of Duda and Hart's (1973) rule has proved very effective in
identifying the smaller number of clusters, but poor at detecting the 12 clusters that
are present. The modification of Beale's (1969) rule has performed very poorly. Both
of these rules have a tendency to provide a large number of local maxima. They
would appear to have little to offer in this kind of investigation.
By contrast, the three global rules, and in particular the C H rule, would seem to have
considerable potential for the detection of nested cluster structure. However, enthu-
siasm should be tempered by two observations. First, as in the Milligan and Cooper
(198.5) study, the cluster structure present in the simulated data sets was reasonably
clear-cut. Secondly, in both the current investigation and that conducted by rvIilligan
and Cooper (1985), the simulated clusters were generated using multivariate nor-
mal distributions (mildly truncated, in rvlilligan and Cooper's (1985) study). Several
authors (e.g., Scott and Symons, 1971) have noted reasons why the sum-of-squares
criterion is particularly relevant for analysing such data, and one can speculate that
a rule based on total within- and between-cluster sum-of-squares like the C H index
might be better able to detect such clusters than clusters of other shapes. ;--;ever-
theless, further support has been provided for the rules that performed well in this
study, and - until presented with evidence to the contrary - one can recommend
their collective use to applied scientists seeking to understand the underlying cluster
structure in their data.
References
Beale, E. ~l. 1. (1969): Euclidean cluster analysis. Bulletin of the International Statistical
Institute. 43(2). 92-9-!.
Bock. H. H. (1996): Probability models and hypotheses testing in partitioning cluster anal-
ysis. In Clustering and Classification. Arabie, P., Hubert. 1. J. and De Soete, G. (eds.),
:Ji"/--!.'):3, World Scientific, River Edge, ~J.
116
Calinski, T. and Harabasz, J. (1974): A dendrite method for cluster analysis. Communi-
cations in Statistics, 3, 1-27.
Cooper, M. C. and Milligan, G. W. (1988): The effect of measurement error on determining
the number of clusters in cluster analysis. In Data, Expert Knowledge and Decisions, Gaul.
W. and Schader, M. (eds.), 319-328, Springer-Verlag, Berlin.
Duda, R. O. and Hart, P. E. (1973): Pattern Classification and Scene Analysis. Wiley,
New York.
Goodman, L. A. and Kruskal, W. H. (19.54): Measures of association for cross-classifications.
Journal of the American Statistical Association, 49, 732-764.
Gordon, A. D. (1996): Cluster validation. Paper presellted at IFCS-96 Conference. J\·obe.
27-30 March, J996.
Hubert, 1. (1974): Approximate evaluation techniques for the single-link and complete-link
hierarchical clustering procedures. Journal of the Americall Statistical Association. 69.
698-704.
Jain, A. K. and Dubes, R. C. (1988): Algorithms for Clustering Data. Prentice-Hall, En-
glewood Cliffs, NJ.
Milligan, G. W. and Cooper, M. C. (1985): An examination of procedures for determining
the number of clusters in a data set. Psychometrika, 50, 159-179.
Scott, A. J. and Symons, ~f. J. (1971): Clustering methods based on likelihood ratio crite-
ria. Biometrics, 27,387-397.
Partitional Cluster Analysis with Genetic
Algorithms:
Searching for the Number of Clusters1
J. A. Lozano, P. Larmnaga and M. Grana
Dept. of Computer Science and Artificial Intelligence
University of the Basque Country
P.O. Box 649, 20080 San Sebastian, Spain
e-mail: [email protected]
tel.: (+3443) 218000, fax.: (+3443) 219306
Summary: In this article we deal with the problem of searching for the number of clusters
in partitional clustering in n2. We set up the problem as an optimization problem by giving
a real function on the different partitions that is optimized when the number of clusters
and the classes are the most natural. \Ne use the Genetic Algorithm for optimizing this
function. The algorithm has been applied to the well-known Rnspini data and to syntheti-
cally generated datasets, with different cluster numbers and underlying distributions. The
results are encouraging.
1. Introduction
Cluster Analysis (Hartigan (1975); Everitt (1974); Jain and Dubes (1988)) is an im-
port.\nt technique in the field of exploratory data analysis. It is a tool for grouping a
set of objects into classes such that 'similar' ones are in the same class and 'different'
ones in different classes. Cluster analysis explores the data known about the objects
to be classified ancl tries to uncover the underlying structure without requiring the
assumptions common to most classical statistical methods. Two main different types
of clustering methods exist: hierarchical methods, which result in a nested sequence
of partitions, and partitional methods, which give one single partition.
Genetic Algorithms (G.A's) (Goldberg (1989)) are probabilistic search algorithms
which simulate natural e\·olution. They are based on the mechanics of natural selec-
tion and genetics. They combine 'survival of the fittest' among string structures with
a structured yet randomized information exchange. In G .A. 's the search space of a
problem is represented as a collection of individuals. The individuals are represented
by character strings, which are referred to as chromosomes. The purpose is to find
the indi vidual from the search space wi th the best 'genetic material'. The quali ty of
an individual is measured with an objective function. The part of the search space to
be examined in e,.ch iteration is called the population. A G .A. works approximately
as follO\lis. First, the initial population is chosen at random, and the quality of each
of its individuals is determined. r\ext, in every iteration parents are selected from
the population. These parents produce children, which are added to the population.
For all newly created individuals of the resulting population a probability near zero
indicates thlLt they 'mutate', i.e. they change their hereditary distinctions. The pop-
ulation is reduced to its initial size by removing some individuals from it according to
some selection criterion. One iteration of the algorithm is referred to as a generation.
IThis work is supported by the Diputacion Foral de Gipuzkoa, under grant 95/1127 and by the
Basque Government, ullder grant PI 94/78.
117
118
Some attempts to solve the clustering problem with G.A.'s have already been done.
Krovi (1991) described the different aspects of designing a G.A. for cluster analysis
and explained how II set of objects can be grouped into two clusters using binary
strings. Bhuyan et a\. (1991) developed a G.A. for the partitioning of n objects into
k clusters, where 1 :S k :S nand k is given. They started by considering three differ-
ent representations for their individuals but finally decided on the so-called ordered
representation. Their preliminary experimental results reflected the superiority of
the genetic algorithms over the known heuristic methods. Cucchiara (1993) showed
the effectiveness of the G.A.'s in clustering problems in image analysis. Jones and
Beltramo (1993) used integer encoding with the application of an operator used in the
travelling salesman problem while Bezdek et al. (1994) used three different distances.
Babu and Murty (1994) did not tackle the clustering problem with G.A.'s but with
evolution strategies; another type of algorithm based on the principles of natural
selection. In all of the research that is mention<.d above the number of clusters into
which to group the objects is supposed to be given.
Yet, some research has been carried out 011 the problem of the optimal number of
clusters, using methods based on entropy-based statistical complexity criteria (Celeux
and Soromeno (1993); Bozdogan (1994)) and statistical tests (Hardy (1994); Rasson
and Kubushishi (1994); Gordon (199.5)).
Our research on the use of G.A.'s for cluster analysis focuses upon the search for the
optimal number of clusters. We want to develop an algorithm that automatically
classifies the objects into an adequate number of clusters without this number being
specified. The main problem in the development of such an algorithm is the definition
of an evaluation function that makes it possible to compare the fitness of clustering
consisting of a distinct number of clusters. Other difficult steps are the selection of
a suitable clustering representation and the development of the operators that define
the mutation and offspring production processes. An ongoing study about that using
G.A.'s can be seen in Luchian et al. (1994). \OVe have carried out experiments with
five artificial data sets and with the well-known Ruspini data sets, in order to test
our clustering method.
F:
" Pd,'\:')
U ~ R (1)
k=l
such that the global optimum of the function F will be found in the number of clusters
k and in the groups that are the most natural. It is important to note that the size of
the search space U~=l PdX), can be expressed by the following expression (Bhuyan
et al. (1991)):
t~ IJ-1)k- J (k
k=lk'j=l J
)j". (2)
In order to define this function we need to think about the characteristics that define
a natural cluster. Hence, an important characteristic is that there are no big empty
spaces inside a cluster (assuming that the cluster is not a ring), so we divide the
119
space that is occupied by the objects into small squares. all of these squares having
the same area. We check to see if the squares are empty or not. Each empty square
is assigned a value of one. and a value of ze~o is assigned to a non-empty squares.
With the former grid it is possible to assign a real value to every partition of the set
X in each number of clusters. Given a partition {'\'I. '\'2 •. · .• X k } the value given to
it is the sum of the values given to ea~h cluster (the algorithm now has the possibility
of being a parallel algorithm). For a cluster we calculate its convex hull and then we
sum the value of the squares whose centres are inside the convex hull. If we denote
H (Xi) to be the convex hull of the partition Xi. V (x, y) the value assigned to a square
with centre (x, y) and C the set of centres of squares, an initial aproximation to the
function can be written as follows:
k
F·({X1 .X2 ..... Xd) = L L V(x.y). (3)
;=1 (x,Y)ECnH(X,)
However this function is not capable of distinguishing between the optimal partition
which gives zero to the former function, and the partition that can be constructed
when splitting one of the clusters of the previous partition into two. Because of this,
we need to add to the preceding function a value Q x k. where Q denotes a postive
real number and k specifies the number of clusters. The final function is:
(4)
At first sight the value of Q does not play an important role. and the only constraint
is Q < 1. The reason is that for Q ~ lour function could assign a lower value to
a partition that has an empty square inside rather than a partition with one cluster
more and without empty squares inside which would be the natural partition. Later
we will see that the value of Q can be important in special cases.
Finally. there is another question that is left to answer: what is the size of the
squares? This is the key question in our approach. To calculate it we have used a
simple approximation that works, as we will see later. well enough. As we do not have
any information about the points. we are going to assume that the points have been
generated at random. following a uniform distribution. Then if the natural structure
is just one cluster and we want to discover it, we must not find an empty square in
the convex hull formed for all the points. This is because it would be possible to split
two or more clusters in such a way that the empty square would not be in any clusters
and the optimum value of the objective function could be found in the partition in
two or three clusters. Figure 1 shows that the square marked with an arrow has a
value of 1 so the value of the objective function in one cluster is 1 + Q X 1 while the
value of the function with two cluster is a + Q X 2. Hence we are going to choose
the size of the square such that the probability of finding such a square will be quite
small. in our case we take 0.001. i.e.
where r is the size of square. S is the area of the convex hull and n is the number
of points. We have taken in each experiment the smallest r that complies with the
constraint.
. . . . . . . . . j • • • • • • • \ •• ',",0"
., -, ,- .. i ...... ',0' ""', ..
. ..•.. -, .'t' ••• ",0' ,,0'';'"
begin SSGA
Create initial population at random
WHILE NOT stop DO
BEGIN
Select two parents from the population
Let the selected parents Produ.ce (! child
Mu.tate the child with certain probability
Extend the population by assigning the child to it
Redu.ce the extended population to the original size
END
Output the optimum of the population.
end SSGA
the first k integers of the permutation are taken as members and centres of a cluster
and the next numbers are added in order to the cluster whose centre is nearest to
the represented point. Once a point is added to a cluster the centre of this cluster
changes to the centre of gravity of the points in the cluster. With this decoding we
have every permutation of the n numbers representing a partition of the n points for
every value of number of clusters k.
Of course we now lHwe another problem, that is, which value to assign to a permuta-
tion of the numbers of objects. The former function assigns a value to each partition
of n objects in k clusters, however every permutation represents a partition for every
value of k. We solve this problem, of course, by giving to each individual the smallest
value that takes the function in the different partitions for k = 1,2, ... , n. Taking
the former evaluation of each permutation into account, our algorithm can be seen
as a hybrid G.A. where a local optimizer is applied in each evaluation.
The second point to note is the kind of operators that can be applied to the individ-
uals (strings of integers) to reproduce them for getting new individuals. Hence we
have studied the kind of operators that have been used in G.A.'s with permutation
of number representation. Most of the work in this sort of representation has been
directed to the design of genetic operator to solve the travelling salesman problem
with a path representation. We have some experience in applying these operators
to other resecwch fields (Larra.l1aga et al. (1996)). The crossover operators that we
have used are: ex (cycle crosso\'er), ER (edge-recombination crossover), OXI (order-
crossover) and PMX (partially map crossover). As mutation operator we have used
: SM (scramble mutation), SIM (insert mutation), ISM (simple inversion mutation),
IVM (inverse mutation), El\! (exchange mutation), Dr..'l (displace mutation).
Finally, there remain the parameters of the G.A., i.e., size of the population, prob-
ability of mutcLtion, probability of crossover, and stopping criteria. These will be
discussed in the next section.
------1
.
.......
'
'.
I !
i
Ii
" "
..
..,:'
~,::
'0 :" i
I
~.
I
1 .....
1 _ _ _ _ _- '
1 L ____ -----l
11'''):'i~'::'
ta~'c"~::.: !
I!
! It,-;,·~'! 1.'1$.
1 .".~. . ~~ I
/
\_. '. ".; . 3t\'ft1t!!; I:
I. ~ ~:~·~:~ ·. . . -.. . . .~~~. . .
Fig. '2: The Datasets
Dataset 2. It has proved a difficult problem for our approach, but there is a way of
solving it. Choose a p,trameter r (size of the square) in such a way that only a
empty square would be inside of the ring, and a parameter Q bigger than 0.5.
Of course, it is not a natural approach.
Dataset 3. The result is not very good, but this is a problem of the small parameters
values used in the algorithm. In this problem the objective function gets the
optimum in three clusters but the way in which we decode the permutation
makes it difficult for the algorithm to find the optimum. However we ha\'e
carried out other experiments with this dataset where the optimum was reached.
If we take into account the operators, the Pl\lX operator continues to be the
fastest (96.1.5 function evaluatiolls) and the slowest is ER (1.54.7).
Dataset 4. This is the first dataset for which we find some difference between the
opera~or with respect to the objective function. The operator CX is the best,
i.e., it finds the correct number of clusters more times than the other operators.
In relation to speed P~IX again is the fastest.
Dataset 5. The results arc lIot very good because of our way of choosing r (5). Some
more experiments with a small change in the parameter,' allow our algorithm
to reach the correct number of clusters in nearly e\'ery execution. Again the
best operator in relation to the objective function is the ex and in relation to
the uumber of fUllctiou e\',Lluation the P~IX (119.7) is the best and the worst
is the OXI (227.1).
5. Conclusion and Future Work
\Ve have given an algorithm that searches for the number of clusters in partitional
cluster analysis and at the same time it gets the most natural classes. Moreover our
algorithm is very flexible in the sense that it could be used to find only the classes
given the number of clusters, or it could find the optimum number of clusters between.
given possible values. The result" are encouraging but it is important to note the
dependence of our algorithm on the parameter 1'. Some more experiments changing
this parameter seem to he <L good way to continue with this work. Of course other
obvious steps in our future work could be to generalize our approach to objects in nd
spaces. This has a problem, that while the size of the search space does not change,
the evaluation function is more expensive, the fundamental point, search the convex
hull, is much more complicated in 'Rd.
In addition we plan to apply G. A. 's to hierarchical clustering and to the most modern
pyramidal dusteriug.
Acknowledgements
Thanks to Prof. A. Hardy for providing software and references.
124
References:
Babu, G.P. and l\\urty, l\\'N. (1994); Clustering with evolution strategies, Pattern Recog-
nilio/I, 27, 2, 3:21-329.
Bezdek, J.C. et al. (1994); GeneLic Algorithm Guided Clustering, In; Proc. of The First
IEEE Conference on Et>ol1Ltiollat-y Computation, 34-40.
Bozdogan, H. (1994); Choosiug the number of clusters, subset selection of variables, and
outlier detection in the standard mixture-model cluster analysis, In; Diday E., Lechevallier
Y., Schader I\\., Bertrand P. and Burtschy B. (eds.), New Approaches in Classification a/id
Data AlLaiysis, Springer- Verlag, 169-177.
Bhuyan, J.N. et al. (1991); Genetic Algorithms for clustering with an ordered representa-
tiou, In; Belew and Booker (eds.), Proceedings of the Fourth Intemational Conference on
Genetic Algorithms, 408-415, I\Iorgan I'aufmann.
Celeux, G. and Soromenho, G. (1993); An entropy criterion for assesing the number of
clusters in a mixture model, Technical Report 1874 INRIA, France.
Cucchiara, R (1993); Analysis and comparison of different genetic models for the clustering
problem in image analysis, In; Albrecht R.F., Reeves C.R. and Steele N.C. (eds.), Artificial
Neural Networks and Genetic Algorithms. Springer-Verlag, 423-4:27.
Everitt, B.S (1974); Cluster Analysis, John Wiley &: Sons, Inc.
Goldberg, D.E. (1989); Genetic Algorithlllsill Search, Optimization, AIachine Learning,
Addison-Wesley.
Gordon, A.D. (1995); Test for asessing clusters, Statistics in Transition, 2,207-217.
Hardy, A. (1994); An examination of procedures for determining the number of clusters in
a data set, In; Diday E., Lechevallier Y., Schader 1\\., Bertrand P. and Burtschy B. (eds.),
New Approaches in Classification and Data Analysis, Springer-Verlag, 178-185.
Hartigan, J.A. (1975); Clusteling Algo/itllms, John Wiley & Sons, New York.
Jain, A.K and Dubes, R.C. (1988); Algorithms for Clustering Data, Prentice Hall.
Jones, D.R. and Beltramo. 1\I.A. (1993); Solving partitioning problems with Genetic Algo-
rithms, In Albrecht R.F., Reeves c.R. and Steele ~.C. (eds.). Artificial Neural Networks
and Genetic A 19orithllls, Spril1ger-Verlag, 423-427.
Krovi, R. (1991); Genetic Algorithms for clustering; A preliminary investigation, In; Pro-
ceedings of the Twenty-Fifth International Conference on System Sciences, 4, 540-544.
Larraiiaga, P. et al. (199G); Learning Bayesian l'etworks Structures by Searching for the
Best Ordering with Genetic Algurithms, IEEE T,.,Ulsactions on Systems Alan and Cyber-
net ics, 26, 4. In press.
Lozano, J.A. et al. (1995); Genetic Algorithms; Bridging the Convergence Gap, submitted
to Et'o/l!tio/HLry COllllutalion.
Luchiall, S. et al. (199.:\); Evolutionary automated classification, In; Proc. of The First
IEEE COII!ercl/ce 011 Evolutionary COlllplttatioll, 585-589.
Rasson, J.P. and Kubushishi, T. (1994); The gap test; an optimal method for determining
the number of natural dasses in cluster analysis, In Diday E., Lechevallier Y., Schader M.,
Bertrand P. and Blll'tschy B. (e<ls.). NeIL' Approaches in Classification and Data Analysis,
Springer- Verlag, 186-193.
Whitley, D. and I\auth. J. (1988); Genitor; A different Genetic Algorithm, In; Proceedings
of the Rocky MOILlltain Conference on Artificial Intelligence, 2, 189-214.
Explanatory Variables in Classifications and
the Detection of the Optimum
Number of Clusters
Janos Podani
1. Introduction
An integral part of the interpretation of clustering results is to evaluate how the individual
variables explain the classes. Finding an order of importance of variables for an existing
classification is often called a posteriori feature selection (cf. Dale et al. 1986), as
opposed to a priori feature selection. when the variables are ranked before the analysis
starts (e.g., Orl6ci 1973, Stephenson and Cook 1980), and to forward selection, in which
evaluation of variables is part of the algorithm (e.g., Jancey and Wells 1987. Fowlkes et
al. 1988).
Attention in a posteriori feature selection may be focused on two fundamental aspects of
classification: cluster cohesion and separation (sensu Gordon 1981). The analysis can be
restricted to either of these aspects (e.g., to contributions to within-cluster sum of squares
only). Alternatively, the effect of variables on the distinction between clusters as well as
on the internal "homogeneity" of clusters is simultaneously incorporated in the study,
even if the clustering method did not actually consider both. A simple possibility which
comes to the mind first is to compute for each variable the ratio of within-group and
between-group sum of squares as an index of explanatory power.
It is emphasized, however, that there is no point in examining cluster cohesion and
separation in terms of sum of squares when, say, the starting matrix contained chord
distances or percentage dissimilarity values and the algorithm was single or complete
linkage sorting. In other words. the evaluation procedure has to be compatible with the
distance coefficient used in creating the classification. Since in many fields of science.
e.g., in biological taxonomy, relatively few classifications are based on sum of squares or
variance, and often the clustering models are not even Euclidean, a more generally
applicable, yet flexible, criterion is required. Godehardt's (1990) multigraph approach, in
which each variable is treated independently, seems to satisfy this requirement.
The third point emphasized here is that the importance of variables may be judged in two
125
126
ways. The more obvious one is the measurement of the absolute effect of each variable
upon the creation of clusters. For example. in case of Euclidean distance and centroid
clustering. we can examine how far apart the cluster centroids are for each variable. and
then order the variables on this basis. This ordering will emphasize variables that
dominated the classification process. and may neglect others that are equally if not more
interesting for the a posteriori interpretation of clusters. In fact. any variable supporting
the given partition may prove useful in subsequent descriptions. no matter how small this
support is in absolute temlS. Thus. an alternative procedure free from the implicit variable
weighting. i.e .. measurement of the relative importance of the variable may prove
useful. Lance and Williams (1977) are early proponents of this approach. by suggesting
taking the ratio of between-cluster sum of squares to the total for each continuous
variable or to compute Cramer's index (see also Anderberg 1973) for each binary or
multistate variable. These criteria are thus only data-type-dependent and do not consider
the manner in which the dissimilarities were calculated. The procedure described in this
paper releases implicit weighting by introducing an ordinal measure of the explanatory
power of variables in non-hierarchical classifications. This measure also satisfies the
requirement of being compatible with the dissimilarities used and relies equally on both
cluster separation and cohesion.
I will also provide an alternative approach to the familiar problem of detecting the
optimum number of clusters. The sum of measures of explanatory power for all variables
will be defined as an overall measure of the agreement (in a sense: consensus) among
the variables regarding the partition of m objects into t clusters. Plotting the sum over a
reasonable range of t values provides a graphical means to find the optimum. if any. This
approach. as will be seen. is radically different from most of the methods reviewed and
compared by Milligan and Cooper ( 1985).
2. Variable contributions
The procedure starts with evaluating the contribution of each variable to the distances or
dissimilarities between objects. To ensure compatibility. the determination of this
contribution must be specific to the distance or dissimilarity function used. As an
example. the total contribution of variable i to all the z=m(m-l)12 values in the lower
semi matrix of D1 containing the squared Euclidean distances for m objects is computed
as
111·1 III
<l>i =L L gijk •
j=l k= j+l
where gijk = (xij - XikP is the contribution of variable i to d 2jk and is written as an
element of matrix G i . The contributions are strictly additive. the matrix of squared
distances is therefore reproduced as
/I
0 2 = I,Gi.
i=l
with n as the number of variables. Formulae for computing contributions have been
deri ved and are presented without proofs for 15 other distance and dissimilarity measures
(Table I). The measures themselves are not shown here. because most of them are well-
known from the clustering literature. A full list is found in Podani (1994). although the
127
Euclidean distance ..
(XIJ-XI'/e)2
IXij-Xile I
Canberra metric
Ixijl + Ixilel
IXij-Xik I
I'ercentage difference, I-Sorensen II
I Xij -Xik I
I-Ruzicka, I-Jaccard II
2,lIIaX[Xllj. Xllk)
11=1
XijXik
I-Similarity ratio *
I·Rogers • Tanimoto II
I·Sokal • Sneath II
min[Xij,Xik) min(xij,Xile)
1.Ku1czynski * -~---+
L.xllj L.XIIk
II h
128
reader may also consult Anderberg (1973). Sneath and Sokal (1973) and Orl6ci (1978).
Note that some indices known generally as similarity functions are expressed as
complements.
R(t)min
,
= (q-+q)/2 where q= Lt (
Ills
2 ) .
s=1
Let. further. R(i,t)obs be an observed sum of ranks of within-cluster contributions for
variable i. Clearly. R(t)mi/l" ::;; R(i,t)obs' Ties in the rank order can be resolved randomly
which has negligible effects for large values of z. The sum of ranks for between-cluster
contributions will not be used. because it conveys no extra information.
Let R(t)exp denote the random expectation for the null situation. i.e .. when the variable
does not make distinction as to whether a contribution is within- or between clusters
(indifferent variable). In other words, contributions are arranged at random with the
expected sum of within-cluster ranks given by
where p=q/z is the probability that a randomly chosen value in the rank order is a within-
cluster contribution. The expectation will be used below as a reference basis for
constructing the formula.
Then, the explanatory power of the variable is defined as the complement of the
deviation of the actual sum of ranks from the minimum as divided by the deviation of the
expectation from the minimum:
R(i,t)obs - R(t)mill
I"(i t)
,
= 1.0 - R(t)exp - R(t)mill
r(i,t)values close to 1.0 indicate high explanatory power. values- around zero reflect
indifference. whereas negative values correspond to a situation when variable i is
contradictory with Pt. (I deliberately avoid using the term discriminatory power. because
it is usually coined with variance-related concepts as in discriminant analysis.) The
variables may be ordered based on their r scores. to facilitate interpretation of clusters
and to detect variables which happen to be indifferent or even contradictory with the
given partition.
129
n
(5( = Ir(i,t)
i=l
which will be called the coefficent of cluster separation. The upper bound of this
coefficient is 11, reached in the unanimous situation with all variables fully explaining the
partition. The (5( coefficient is computed for each level of interest in the hierarchy and the
results are plotted against t. For data sets with group structure, the curve shows a peaked
effect allowing to detect the number of clusters at which the majority of variables
support the same clustering in terms of their ranked within-cluster contribution scores.
For very many clusters, each with 1-2 objects only, the increase of (5( is a necessity, but
such trivial clusters attract no interest anyway (these clusters are usually excluded from
such studies, cf. Milligan and Cooper 1985). Absence of clear-cut peaks is indicative of
either strong disagreements between variables as to the "optimum" value of t, or
complete lack of group structure, so the r(i,r) values must be inspected.
Computer program SYN-TAX 5.02 (Podani 1994) designed for classification purposes
includes an option for computing the explanatory power of variables and the coefficient
of cluster separation at several levels in a dendrogram and for plotting the graph
automatically (available for PCs and Macintosh computers).
5. Example
The method will be demonstrated by an actual example coming from community
ecology. A total of 80 vegetational plots (objects) represent a sample of dolomite
grasslands of Sas-hill, Budapest, Hungary (for more details, see Podani 1985). The plots
have been described in tenns of percentage cover scores of 123 vascular plant species.
For the purpose of illustration, the matrix of Euclidean distances of objects was subjected
to complete linkage clustering (Fig. I). The explanatory power of variables and the
coefficient of cluster separation were computed for the top ten cut levels in the
dendrogram, i.e., for t=2 to II. The plot of cluster separation against the number of
clusters (Fig. 2) indicates high agreement of variables for two an9 three clusters, with the
maximum at t=3. When t is raised from 3 to 4, the coefficient drops by more than 50%
and for more groups it remains about the same. The analysis thus suggests that the given
classification is best supported by the species at the 3-cluster level. It is therefore
worthwhile to examine the rank order of variables based on their explanatory power
values for t=3 (Tab. 2). To save space, the table lists only the first ten and the last ten
species from the rank order. Those at the beginning of the list are the best indicators of
difference between closed (the two smaller groups) and open grasslands (the large
group), whereas species with negative scores counter-support this classification because
they tend to differentiate the large group even further. These species have been widely
used as discriminatory species to subdivide the relatively open communities. The analysis
revealed, however, that the majority of species are contradictory with this, showing that
the classical syntaxonomic classification was subjectively based on a narrow subset of
species.
130
100
O .... C":c- .. ,.,C ......... "=n ....... ." ",..,. ... 1"1,...,...,.,,..,..
"'"1'1" .. ,.."1'="
............. 'C'N"NN"' .... Nit'." .. 1'11"1 ... ., ...... ., .. ,...,...,.. ..
.. 1"1 ... ,,'==-=,...... _ .. = .. ~ "at'!I'oO"'''''''
C" til ... Nt'!",t"I"" C"" .... ,... n."." .. ." ..... n" ....... N,....
",:\1
O .. ,.. ........... C"Zlt'tto-C ...
• ........... ,....,..,...,...,..,. .. ,...
c
o .5
+=
a....
a
a.
Q)
'0
-....Q)
C/)
C /)
4.
::J
U 40
3.
30
2'
lU
Number of clusters
Fig. 2: Plot showing the relationship between the number of clusters and the coefficient
of cluster separation for the top ten partitions obtained from the dendrogram of Fig. I.
131
Tab. 2: The first ten and the last ten species in the rank order of variables for the three-
cluster partition obtained from the dendrogram in Fig. I.
6. Discussion
The measure of explanatory power proposed in this paper is a non-metric criterion
because actual differences are irrelevant: it is the rank order of contributions that matters.
Thus, even if a variable had negligible effects on the distances (because of lack of
commensurability, for example), it may tum out to be a good explanatory variable
afterwards. Also, ranking variables based on the r values reveals an aspect rarely
emphasized: the identification of variables that do not agree with the partition. Finding
these variables may lead to revisions of former classifications. The possibility is also
raised here that after the removal of these variables a repeated classification based on the
reduced set of variables may provide a more noise-free classification. This is certainly an
aspect which merits future investigations.
The coefficient of cluster separation, being based on ranked contributions, is considerably
different from the currently known indices of optimum number of clusters as reviewed by
Milligan and Cooper (1985). In addition to the ranking technique, the most substantial
difference is that whereas the other methods are less dependent on the number of
variables (so that they can be best demonstrated with a two-dimensional example), the
present technique is more meaningful when there are quite a few variables. Therefore, it
may perform very poorly in a low dimensional situation if compared to the other
methods, and evaluation of the method proposed here along the lines of Milligan and
Cooper's study would be irrelevant. The only exception seems to be the Ratkowsky and
Lance (1978) criterion, which involves computation of the ratio used by Lance and
Williams (1977) for each variable, and takes the average over variables. It is perhaps an
explanation for the fairly poor performance of this measure in the two-dimensional case
of Milligan and Cooper's study, though Ratkowsky and Lance reported high success,
usually with many dimensions. It is also noted here that whereas almost all methods
provide the same result after rigid rotation of the axes (rotation invariance) the
Ratkowsky and Lance criterion and the one suggested in this paper are exceptions.
As with other formulae for detecting the optimum number of clusters in hierarchical
classifications, the possibility to incorporate the measure directly as a clustering criterion
may be examined. The coefficient of cluster separation is computationally very
132
demanding, however. (The actual example presented in this paper took ten hours on a PC
486.) Building clusters based on a global nonmetric criterion similar to crt will provide a
clustering procedure completely compatible with the optimality measure.
Acknowledgements:
The author expresses his sincerest thanks for receiving an OMFB Travel Grant (No.
MEC 96-0176) and an OTKA Travel Grant. (No. U21456) to participate at IFCS'96,
Kobe, where this contribution was presented. This study was funded by the OTKA
Hungarian National Research Grant No. TI9364. I am grateful to A. D. Gordon
(University of St. Andrews, U.K.) for his comments on the manuscript, and to M. B. Dale
(CSIRO, Australia) and Sz. Bokros (ELTE, Budapest) for discussions.
References:
Anderberg, M. R. (1973): Cluster Analysis for Applications. Academic, New York.
Dale. M. B .. Beatrice, M .. Venanzoni. R. and Ferrari. C. (1986): A comparison of some methods
of selecting species in vegetation analysis. Coenoses. 1. 3S-S2.
Fowlkes, E. B .. Gnanadesikan, R. and Kettenring, J. R: (\ 988): Variable selection in clustering.
Journal of Classification. 5, 205-228.
Godehardt, E. ( 1990): Graphs as Structural Models: The Application of Graphs and Multigraphs
in Cluster Analysis (2nd ed.). Vieweg & Sohn, Braunschweig.
Gordon. A. D. (1981): Classification: Methods for the Exploratory Analysis of Multivariate
Data. Chapman and Hall, London.
Jancey, R. C. and Wells, T. C. (1987): Locality theory: the phenomenon and its significance.
Coenoses, 2, 31-37.
Lance. G. N. and Williams. W. T. (1977): Attribute contributions to a classification. Australian
Computer Joumal, 9. 128-129.
Milligan. G. W. and Cooper, M. C. (198S): An examination of procedures for determining the
number of clusters in a data set. Psychometrika. 50. IS9-179.
Orl6ci. L. (1973): Ranking characters by a dispersion criterion. Nature. 244. 371-373.
Orl6ci. L. (1978): Multivariate Analysis ill Vegetation Research. Junk. The Hague.
Podani. J. (198S): Syntaxonomic congruence in a small-scale vegetation survey. Abstracta
Botanica. 9. 99-128.
Podani. J. ( 1994): Multivariate Data Analysis in Ecology and Systematics. SPB Publishing, The
Hague.
Ratkowsky. D. A. and Lance. G. N. (1978): A criterion for determining the number of groups in
a classification. Australian Computer Journal, 10. IIS-117.
Sneath, P.H.A. and Sokal, R. R. (1973): Numerical Taxonomy. Freeman, San Francisco.
Stephenson, W. and Cook. S. D. (1980): Elimination of species before cluster analysis.
Australian Journal of Ecology. 5. 263-273.
Random dendrograms for classifiability testing
Bernard Van Cutsem, Bernard Ycart
Laboratoire de Modelisation et Calcul, Universite Joseph Fourier, Grenoble
B.P.53, F-38041 GRENOBLE Cedex 9,
Bernard.Van-Cutsem@imagJr, Bernard.Ycart@imagJr
1. Introduction
Mathematical structures such as partitions, trees or dendrograms are commonly used
to analyze and represent classification structures on n objects. Classical algorithms
construct these structures either from dissimilarities between pairs of objects or from
the values of a set of variables on the n objects.
We consider here only hierarchical classifications also called indexed dendrograms
(ID), as introduced by Hartigan (1967), Johnson (1967), Jardine, Jardine and Sibson
(1967). In practical situations, ID's are deduced from dissimilarities using ascend-
ing hierarchical algorithms. Descriptions of such algorithms can be found in classical
books on classification such as Sneath and Sokal (1973) or Jain and Dubes (1988). We
focus here on the Single Link Algorithm (SLA) which is among the most commonly
used. The ID's obtained using the SLA will be called single link indexed dendrograms
(SLID).
Even if the data do not present any cluster (they are homogeneous in some sense),
the SLA will produce an ID. In some instances, external information on the data
can be used to confirm or not the exhibited structure. If this is not possible, it is
nevertheless important to be able to decide if this ID is significant or not. A method
is to consider a probability distribution on the set of data which corresponds to the
absence of classification structure (no clusters), and then derive the distribution of
the corresponding random SLID. Then statistical tests for this null hypothesis of non
classifiability of data against an hypothesis of existence of a classification structure
can be constructed. Bock (1996) is a thorough review on the problem of testing par-
titions as structures of classification.
In Van Cutsem and Ycart (1994), we made a first attempt in this direction by de-
scribing the finite set of stratified dendrograms on n objects. This set was endowed
with the equiprobability and the distributions of some characteristics were derived.
We considered mainly the number of levels of such dendrograms, and the sizes of par-
titions. Those results can be easily extended to binary and strictly binary stratified
dendrograms (see exact definitions below).
133
134
More realistic hypotheses bear on non classifiable data. In Van Cutsem and Ycart
(1996a), we considered two other types of null hypotheses corresponding to non clas-
sifiable data .
• Model 1. The first hypothesis concerns dissimilarities. It supposes that the
dissimilarities between objects are exchangeable random variables, and more
particularly i.i.d. random variables uniformly distributed on [0,1] .
• Model 2. The second hypothesis concerns objects. It supposes that the n ob-
jects are i.i.d. points in a metric space distributed according either to a uniform
distribution on a convenient domain or to a unimodal distribution. Dissimilar-
ities between objects are then defined as their distances in the representation
space.
Model 1 has been considered long ago, for instance by Ling (1973). In Van Cut-
sem and Ycart (1996b), a probabilistic analysis of the main ascending hierarchical
algorithms was proposed. Different variables attached to indexed dendrograms were
introduced, including the sequence of levels, the survival time of an object (i.e. the
number of partitions in which this object is isolated) and the ultrametric distance
between two objects. The distributions of these random variables attached to ID's
produced by the SLA, (and also the average link and complete link algorithms) ap-
plied to random data corresponding to model 1 were studied, and exact as well as
asymptotic results were proposed.
Our goal in the present paper is twofold. Firstly we want to extend the results
obtained in Van Cutsem and Ycart (1996b) to a particular case of model 2, more
precisely to i.i.d. objects uniformly distributed in the interval [0,1]. The main re-
sults for model 1 are recalled in theorem 3.2, where explicit asymptotic distributions
for large sets of objects are given. The corresponding results for model 2 are given in
theorem 4.3. Interestingly enough, the asymptotic distribution turns out to be the
same (scaled Gumbel distribution) for the last index of the dendrogram. However, it
is quite different for other variables. For instance the size of the smallest subset in
the penultimate partition tends to 1 in probability for i.i.d. dissimilarities (model 1),
whereas for i.i.d. points on [0,1]' it has a uniform distribution.
Our second aim is to demonstrate the applicability of our results to some problems
of classifiability testing. To do this, we introduce simple classifiability hypotheses to
be tested against models 1 and 2. They consist in assuming that the set of objects is
partitionned into two subsets, such that the distribution of dissimilarities inside each
set is stochastically smaller than that of dissimilarities between the two subsets. For
both models, we derive explicit tests based on the last index of the SLID and compute
the probability of detecting the given partition under the classifiability hypothesis.
The article is organized as follows. Section 2 recalls basic definitions. We summarize
in section :3 the results of Van Cutsem and Ycart (1996b) concerning modell. Section
-l contains new results for i.i.d. points uniformly distributed on [0,1]. Applications
to classifiability testing are presented in section .5.
135
2. Basic definitions
A dissimilarity on a set S of n object.s is as usual a function d: S2 -+ m+ such
that d(a,a) = 0, d(a,b) = d(b,a) for any pair of objects a and bin S. Moreover we
suppose here that d is definite, that is: (d(a, b) = 0) =? (a = b).
An indexed dendrogram is a sequence {(Pl, Al)}O!>l!>lm.. such that
1) {Pl }O<l<lm.. is a sequence of nested partitions of S such that Po is the partition
into singfetons and Plm .. is the partition with only one element {S}. The dendrogram
is defined by {Pl }o!>l!>lm&. .
2) {Al }O!>l!>lm.. ' the sequence of indices of the dendrogram, is strictly increasing and
starts from Ao = O.
A sequence {Pt}O<l of partitions is nested if any subset of a partition Pl is union
of subsets of Pl - I .- An ID is called strictly binary if any partition Pl is obtained by
joining exactly two subsets of Pl-I' For a strictly binary dendrogram, lmax = n - l.
An ID is a stratified dendrogram (SD) if, for any i E {O, ... , i max }, Al = i.
If A and B are two disjoint subsets of S, we define
2) Determine the strictly increasing sequence of integers In, by Ino = 0 and, for any
fE[O,n-1],
2. The survival time of a singleton, which is the minimum level f at which a given
singleton is not an isolated point for the partition Pt. This survival time can
also be evaluated either by the discrete level In,
or by the continuous one Al.
An object which is isolated at too high a level may be significantly different
from the others.
:3. The ultrametric distance between two given objects, defined as the minimum
level f at which both objects are in a same subset of the partition Pt. This
ultrametric distance can be evaluated either by the discrete level In( or by the
continuous level ,\( of partition P~. Two objects which are separated at too high
a. level may suggest the existence of two different clusters.
-1. The size of subsets of partitions P,. Balanced or unbalanced subsets may pro-
\'ide indications on the presence or not of clusters.
In Van Cutsem and Ycart (1996b), we derived the exact and asymptotic distributions
of all these variables. Exact distributions involve some combinatorics on graphs and
use mainly the following numbers.
,,((n,m) number of graphs with n vertices, m edges
,,((n, m, k) = number of graphs with n vertices, m edges
and k connected components.
These numbers can be computed using recurrence relations, but the actual imple-
mentation is quickly limited by numerical explosion.
We summarize below, without proofs, the distributions of A" A(a), A(a, b), M(a, b)
and Tn - 2 . Details of the proofs and other related results can be found in Van Cutsem
and Ycart (1996b).
Theorem 3.1 With the above hypotheses and notations,
• VlE{0,1, ... ,n-1},'v'~E[0,1J,
a(n) n-l
Prob(A I ::::; A) =L L ,,((n,m,kpm(1- At(n)-m.
m=t k=l
• 'v'AE[0,1],
with
138
where
The size of the numbers involved makes the exact computation of the probability
distributions of theorem 3.1 impossible for n larger than a few tens. Fortunately, ex-
plicit asymptotic distributions are also available. These asymptotics are also derived
in Van Cutsem and Ycart (1996b). The proofs are based on the theory of random
graphs. The key observation is that {Gm } and {G('\)} are random graph processes in
the sense of Bollobas (1985). In particular the discrete and the continuous families of
graphs are equivalent in some sense for n tending to infinity. More precisely, Gm and
GP) will have similar properties if ~ '" m / a(n). As an illustration, we summarize
below the asymptotic distributions of some of our variables.
Theorem 3.2 With the above hypotheses and notations,
• W ~ 1, "Ix E m:t ,
i-I
n2
lim Prob( -At < x) l_e- r (1+"'+_x_).
n-oo 2 i-I!
.Vx Em,
lim Prob(nA n_ 1 -log(n) ~ x)
n-oo
= e-·-'
• "Ix E m:t ,
lim Prob(nA(a) ~ x) 1 -e - r .
n_oo
.VxEj1,+oo[,
. Prob(nA(a,b)
hm ~ x) = ( 1 -'P(X))2
-
n-oo X
where 1'( x) is the only solution in ]0, 1[ of the equation tpe-'P = xe- r
• lim Prob(min(Tn_2 (a),n - Tn_z(a)) = 1) = 1 .
n-oo
where
fin = {(UI,U1, ...• Un ) : 0:::; UI:::; U2:::; ... :::; lIn:::; 1}
and where U.-t denotes the indicator function of t he set A.
2) the ranks R(.) = (RI' H2 ••.. , Rn) define a permutation of the first n integers which
is uniformly distributed on the n! permutations of these integers.
Let us define ~i = U(i+I) - U(i) to be the distance between two consecutive points in
the representation. The density of ~ = (~I, ~2' ...• ~n-I) is easily computed.
where
Then. for all i E {I, 2, ... ,11 - 1}, ~i is distributed according to a Beta distribution
;3( 1, n). Moreover the variables ~I. ~2' ... , ~n-I are exchangeable.
We now associate a SLID to the dissimilarities D(a. b). Since almost surely there are
no ties in dissimilarities. this ID is strictly binary. \Ve shall consider separately, the
sequence of levels Au = O. AI," .. ;\,,-1 and the stratified strictly binary dendrogram
denoted by BD (see Figure 1 for an example).
1. The sequence of itl'e/" of Iht RSLID i ... Ihe ol'dend stquellCt ~(l)' ~(l)' .•.•
~(n-I) u8sociuted to the l'urillblts ~l' ~2'"'' ~TI-I'
The proof is elementary. Result 1 is obvious because of the properties of the SLA.
To prove 2 we remark that, as the variables Ai are exchangeable, their ranks define a
permutation which is uniformly distributed on the set of all the (n -I)! permutations
of {1,2, ... ,n -I}. Fix first a permutation of the n objects, and independently a
permutation of the Ai'S. Then the stratified dendrogram BD is determined. But 2n - 1
such choices will lead to the same- binary dendrogram, since the order of the pair of
sets which are joined at each of the n -1 levels can be switched. Thus -each binary
dendrogram is obtained with probability
n!(n-l)!'
Result 3 is easy to check and uses once more the exchangeability of the variables Ai'
Remark. This decomposition of the RSLID into the product of two independent
random structures given by the levels and the stratified binary dendrogram can be
extended in a product of three independent random structures if we decompose a
stratified dendrogram into the product of labels and of an unlabelled strictly binary
dendrogram.
The consequences of this theorem are important. Firstly the distribution of levels
can be deduced directly from that of A(.). Many results on survival times of objects,
ultrametric distances between objects expressed as ranks of partitions, and conse-
quently expressed as levels, can be directly obtained from combinatorics on the set
Bn of strictly binary dendrograms on n objects. Here are for instance the distribu-
tions of survival times and ultrametric distances expressed as ranks of partitions.
Theorem 4.2 Let ED denote a stratified strictly binary dendrogram uniformly dis-
tributed on the set Bn.
ThenWE {l, ... ,n-l},
n-l
Prob(L(a) = l) = (~) ,
n+l
Prob(L(a,b) = l) = (n -1) (t
n-/2 )
Also, 'ifnI E {1,2, ... ,n -I},
1
Prob(Tn _ 2 (a) = nd = - -.
n-l
= ~ ( n -:- 1) Prob(!:ll :5 .\, ... , Aj :5 .\, ~j+1 > .\, ... An-I> .\) .
n-I
j=k J
141
.'Vx E IR,
lim Prob(nA,,_1 -log(n) :5 x) = e-e-'
"-00
• 'Vx Em:+-,
lim Prob(nA(a) :5 x)
,,-00 = 1 - e2r •
Thus the asymptotic distribution for the last index A,,-1 is the same for i.i.d. dissim-
ilarities (theorem 3.2) and for i.i.d. points on [0,1]. The asymptotic distributions of
A(a) and A(a, b) are different but with scalings of similar orders of magnitude. The
main difference comes from the penultimate partition which is very unbalanced in the
case of i.i.d. dissimilarities and corresponds to a random cut in the case of i.i.d. points
on [0,11. The proof of theorem 4.3 uses the well known Poisson approximation, under
the following equivalent form. Consider n i.i.d. points, uniformly distributed on [0, nl.
Let AI", . , AI. be a fixed number of distances between consecutive neighbors. As n
tends to infinity, they are asymptotically independent, exponentially distributed with
parameter 1. Recall that the infimum of n i.i.d. exponential r.v.'s is exponential with
parameter n. Their supremum is asymptotically Gumbel, with location parameter
Log(n).
142
Let us now test 'Ho against 'HI> using the statistic An-I' We determine the critical
region [co, 1], at significance level Q, by
Probo(A n _ 1 ~ co) = Q •
143
Since the asymptotic distribution of An-I is Gumbel, for n large enough, Co is deter-
mined by
Notice that, somewhat paradoxically, the hypothesis of non classifiability 1i1 may
be rejected by the test on the last level, whereas the penultimate partition does not
detect the correct structure.
The power of the test on the last level is given by
and it is clear that 11"(1') = 1 if I' ~ Ca. As limn _+ oo Ca = 0, we see that, for any fixed
I'> 0,
lim 11"(/1)
n-+oo
= 1.
Consider now model 2. The null hypothesis 1i~ is: the n objects are i.i.d. points in
[0,1]. For the hypothesis 1i~ we choose: the objects in A are i.i.d. points uniformly
distributed in [0,1'1], and the objects in Bare i.i.d. points uniformly distributed in
[1'2,1], with
0<1'1 < 1'2 < 1 1'2 -I'I = I' > 0 .
If the statistic of test. is the last level An-I, since its asymptotic distribution is the
same for model 1 as for model 2, we obtain exactly the same critical region as before.
However the probability of detection PI' has chang~d. Define again A~1 and A:1 as
the last levels of the sub-SLID's on the sets A and B respectively. For nl and n2
A: A:
large enough, the asymptotic distributions of 1 and 2 are still Gumbel, but with
different scalings. One has
As an example, table 3 gives some values of PI' in the case nl = n2 = 100, 1'1 =
(1 - 1')/2 and 1'2 = (1 + 1')/2 (the gap is centered).
144
References:
Bock, H.H. (1996): Probability models and hypotheses testing in partitioning cluster anal-
ysis. In: Clustering and classification, Arabie, P. et al. (eds.), 377-453, World Scientific,
Singapore.
Bollobas, B. (1985): Random Graphs, Academic Press, London.
Feller, W. (1971): An introduction to probability theory and its applications, vol. II, Wiley,
London.
Godehardt, E.(1988): Graphs as Structural Models, Vieweg, Brauschweig/Wiesbaden.
Hartigan, J.A. (1967): Representations of similarity matrices by trees. J. Amer. Statist.
Assoc., 62, 1140--1158.
Hartigan, J.A. and Mohanty, S. (1992): The RUNT test for multimodality. J. of Classifi-
cation, 9, 63-70.
Jain, A.K. and Dubes, R.C. (1988): Algorithms for clustering data. Prentice Hall, Engle-
wood Cliffs.
Jardine, C.J., Jardine, N., and Sibson, R. (1967): The structure and the construction of
taxonomic hierarchies. Math. Biosci., I, 171-179.
Johnson, S.C. (1967): Hierarchical clustering schemes. Psychometrika, 32,241-254.
Ling, R.F. (1973): A probability theory of cluster analysis, J. Amer. Statist. Ass., 68,
159-164.
Sneath, P.H. and Sokal, R.R. (1973): Numerical Taxonomy, Freeman, San Francisco.
Van Cutsem, B. and Ycart, B. (1994): Renewal-type behaviour of absorption times in
Markov Chains. Adv. Appl. Probab., 26, 988-1005.
Van Cutsem, B. and Ycart, B. (1996a): Probability distributions on indexed dendrograms
and related problems of classifiability. In: Data Analysis and Information Systems, Bock,
H. (ed.), 73-87. Springer-Verlag, Berlin.
Van Cutsem, B. and Ycart, B. (1996b): Indexed dendrograms on random dissimilarities.
J. of Classification, to appear.
The Lp-product of ultrametric spaces and the
corresponding product of hierarchies
Bernard Fichet
Laboratoire de Biomathematiques
Universite d'Ai"'<-Marseille II
27 Boulevard Jean Moulin
13385 Marseille, France.
1 Introduction
This paper is devoted to the Lp-product of indexed hierarchies and its relationship
wit h the J..p-product of ultrametric spaces. In this approach, quasi-hierarchies will
appem as an fundamental structure. Recall that the main axiom of a quasi-hierarchy,
uxium iii) below. has been investigated by Batbedat (1989) and Bandelt and Dress
(198Y) in defining weak hierarchies. This axiom stipulates that the intersection of
three clusters always is the intersection of two clusters among them. Then an indexed
quasi-hierarchy is defined by adding some usual a."'<ioms. Quasi-hierarchical classifica-
tiun may be regarded as a unifying way for two extensions of hierarchical classification.
Indeed, indexed quasi-hierarchies extend indexed (but not weakly-indexed) pseudo-
hierarchies, also called "pyramids", and additive trees. For references concerning the
three previous concepts, see Durand and Fichet (1988), Bertrand and Diday (1991)
und Buneman (1974).
A bijection has been established between indexed quasi-hierarchies and particular
dissimilurities, called quasi-ultrametrics. See Diatta and Fichet (1994) or Bandelt
( 1992) via a four-point characterization.
In this paper we show that the Cartesian product of two hierarchies is a quasi-
hierarchy. \ Ioreover . from two indexed hierarchies a level index of Lp-type (p < (0)
is lJruduced in connection with the Lp-product of the corresponding two ultrametric
spure". Finally the supremum product of r ultrametric spaces is ultrametric and the
assuriated indexed hierarchy is characterized as a subclass of th.e Cartesian product
uf the ronesponding /' hierarchies.
The reader will find an analogy with the primary approach of Benzecri and Escofier
defining correspondence analysis, see Escofier (1969). Introducing the x2-metric on
the ruw-set and the column-set of categories, they produce a global scattering which
is nuthing but the L2 -product of the two metric sets.
Let us nute that the results given here have been presented by the author at the 19th
Annual Conference of the Gesellschaft fur Klassifikation e.V, held in Basel, March
145
146
lYY5. and the I.F.C.S. 5th Conference held in Kobe, ~'larch 1996.
Proof. It is clear that axioms i), ii) (or iii)) and iv) of a quasi-hierarchy are
fulfilled. In particular H = HI X H2 is minimal in 'H iff both HI and H2 are minimal
in 'HI and 'H 2 .
Now, let H = HI X H2, H' = H~ x H~, HI! = H~' x H:; be three elements of 1-l.
Then H n H' n HI! = (HI n H~ n Hn x (H2 n H~ n H:;).
First suppose that HI n H; n H~ = 0. Since 'HI is a hierarchy, two clusters, say HI
and Hi, are necessarily disjoint. Then:
Let us observe that the quasi-hierarchy 'H obtained in the previous proposition is
very particular. For example, the following property is noteworthy. Every cluster
H = HI X H2 of 'H, with HI =I- II and H2 =I- 12, has exactly two predecessors,
specifically HI X H~ and H; x H2, where H; and H~ stand for the (unique) predecessors
of HI in 'HI and H2 in 'H 2.
Corollary 1 Let ('HI.!I) and ('H 2.!2) be two indexed hierarchies on II and [2, re-
spectively. Let 'H = 'HI X 'H 2 and I : 'H f-+ 1R+ such that:
l:::;p<oo
Proposition 2 Let (h,dd and (I2,d 2) be two ultrametric spaces and let (I,d) be
their Lp-product (1 :::; p < 00). Then:
Similarly, since every point of a ball in an ultrametric space is a centre of the ball,
we have: d(j,k) :::; d(i,j). Thus k E Bt .
Cunversely, let k E Bt . Then:
Proposition 2 shows that the Lp-product of two ultrametric spaces is connected with
an indexed quasi-hierarchy. Similarly, we deduce from Corollary 1 that the Lp-product
uf twu indexed hierarchies is connected with a quasi-ultrametric. In fact, each map-
ping is the inverse of the other.
Since the clusters of a hierarchy are the balls and the clusters of a quasi-hierarchy are
the 2-balls, the proof is immediate from Proposition 2 and Corollary 1. We may also
use the opposite way by observing that the smallest cluster of 'HI x 'H 2 containing i
and j is the Cartesian product of the smallest clusters of 'HI and 'H 2 containing the
curresponding components of i and j.
The fullowing coherent diagram summarizes the previous results.
Given the Lp-product (HI x H2 , II ffip h), it is easy to recover the hierarchical
cumponents (HI, Id and (H 2, 12) . Indeed, a subset HI of II is in 'HI iff HI x 12 E
'HI >: 'H 2 · Thus we have 'HI and H2 and in particular their minimal elements. Then,
for every HI E HI , II (HI) = II ffip 12 (HI x H 2 ) where H2 stands for any minimal
element of H2 .
.-\ similar way stays valid for a more general problem. Is a given indexed quasi-
hierarchy on II x 12 the Lp-product of two indexed hierarchies (HI, II) and ('H 2, h)
un 11 and 12 for some p ? Indeed, the previous procedure gives potential candidates
as clusters of HI and H 2. Then it suffices to check whether 'HI and H2 are hierarchies
and whether 'H = HI X 'H2. Similarly we have potential level indices II and 12 and a
unique potential real number p.
In terms of metric spaces, such a property remains clear still.
150
Proposition 5 Let (Ilodd , ... , (Ir,d r ) be r ultrametric spaces and let (I,d) be their
s'upremum product.
Then, d is ultrametric.
Proof. Let i,j,k E II. x .. · x I r • There exists s such that: d(i,j) = d. (i.,j.).
Then: d (i,j) = d. (i.,j.) ~ max [d. (i., k.) , d. (k.,j.)] ~ max [d (i,k) , d (k,j)]. •
Proof. Denote by (HI.!I), ... , (Hr.!r) the indexed hierarchies defined on Ill··· ,Ir,
respectively. Let I = II X •.. x Ir and let (H, f) be the supremum product of
(HI.Jd, ... ,(Hr.Jr). For every i = (iI, ... ,i r ), j = (jI, ... 'jr) E I and for ev-
ery l = 1, ... ,r , let HL, be the smallest cluster of HI containing i l and k Define
H:~j, as the greatest cluster of HI containing i l and jl , and such that II (H::it ) ~
mp-x [II (HL,)]· Let s be an integer such that I. (HI. i.) = mF [II (HL,)]. Then
H;~j, = His,i,. We show that H' = H;tit x ... x H':ir is in H and is the smallest cluster
of H containing i and j. Indeed the existence of H' is obvious and H' E 'H derives
from the definition of 1i. Now, let H" = H~ x ... X H;' E H be a cluster containing
i and j. For every l, HLi, ~ HI'. Then we have: I (H") 2: I. (H:) 2: I. (HI.i.) =
I (H') . Thus H' ~ H" and H' has the announced property. The result follows since
I (H') = mfx [II (HL,)]· •
We may also use the opposite way. Observing that a ball Bd (i, d (i, j)) for the supre-
mum product may be expressed as B d , (ill d (i, j)) x· .. XBd,. (ir, d (i,j)) , it suffices to
show that the family of such balls coincides with the supremum product of hierarchies
defined in Proposition 4.
The following diagram summarizes the previous properties.
Although the supremum product does not contain all elements of the Cartesian prod-
uct, it is still possible to extract the hierarchical components from such a structure
and to solve a more general problem as in paragraph 3. Indeed, for every HI E 'HI,
it always exists some HI E HI , l = 2, ... , r such that H = HI X H2 X ... x Hr E 'H.
Thus HI, ... ,'liT follow. Furthermore, owning the hierarchies, there is a minimal
H2 x ... x Hr , HI E HI , l = 2, ... , r such that H = HI X H2 X ••• x Hr E 1t for a
fixed HI E HI. Then It (HI) = I (H). One might also use the fastidious but clearer
way of ultrametric spaces.
Replacing the maximum by the minimum, we may establish, with a similar proof, a
property analogous to the one of Proposition 4. From 1t' = 'HI X ... x H r , define
I. : H' I-> lR.+ by: VH = HI x ... x Hr E H' , !. (H) = min!1 I
(HI). Let 11. be the
subclass of H' defined by:
5 Example
'vVe give here a simple and illustrative example.
Two indexed hierarchies are given via their dendrograms in Figure 1.
152
Figure 2 exhibits the three clusters at the level 5 for the L1-product of the indexed
hierarchies given in Figure 1: {iI, i 2, i3} X J , I x {jI, j2} , I x {j4, js}. We may
imagine a practical procedure, with a computer, displaying in this way the clusters
at a given level.
The dendrogram for the supremum product of the same indexed hierarchies is visu-
alized in Figure 3.
41-
at
2
0
h i2 i3 i4 i5 is jl j2 j3 j4 j5
Figure 1: Two indexed hierarchies
is
o
1 1 2 2 3 3 1 2 1 1 2 2 3 3 344 5 5 4 5 4 455 6 6 666
j 121212334545345121233454512345
Figure 3: The supremum product of the indexed hierarchies
153
6 References
Bandelt, H.J. (1992): Four-point characterization of the dissimilarity functions obtained
from indexed closed weak hierarchies, Math. Seminar der Universitiit, Hamb·urg.
Bandelt, H.J. and Dress, A.W. (1989): Weak hierarchies associated with similarity mea-
sures: an additive clustering technique, Bull. Math. Biology, 51, 113-166.
Batbedat, A. (1989): Les dissimilarites Medas et arbas, Statistique et Analyse des donnees,
14,3, 1-18
Benzecri J.P. (1973): L'analyse des donnees. Tome 1, La Taxmomie., Dunod, Paris.
Bertrand, P. and Diday, E. (1991): Les pyramides classifiantes: une extension de Ia struc-
ture hierarchique, C.R. Acad. Sci. Paris, Serie I, 693-{;96.
Buneman, P. (1974): A note on metric properties of trees, J. Combin. Theory, Ser. B,17,
48-50.
Diatta, J. and Fichet, B. (1994): From Apresjan hierarchies and Bandelt-Dress weak hier-
archies to quasi-hierarchies, In:New approaches in Classification and Data Analysis, Diday,
E. et al. (eds.), 111-118, Springer-Verlag, Berlin.
Durand, C. and Fichet, B. (1988): One-to-one correspondences in pyramidal representa-
tion: a unified approach, In: Classification and Related Methods of Data Analysis, Bock,
H. (ed.), 80-85, North-Holland, Amsterdam.
Escofier, B. (1969): L'analyse factorielle des correspondances, Cahiers du B.U.R.O., Uni-
veT'site de Paris VI, 13,25-29.
Jardine, C.J., Jardine, N. and Sibson, R. (1967): The structure and construction of ta.xo-
nomic hierarchies, Mathematical Biosciences, I, 465-482.
Johnson, S.C. (1967): Hierarchical clustering schemes, Psychometrika, 32, 241-254.
Towards Comparison of Decomposable Systems
\lark Sh. Le\'in
The Cniversity of Aizu, Fukushima 965-80 Japan
Summary: The paper focuses on the comparison of decomposable s~'stems on the base
of combinatorial descriptions of systems and their parts. Our system description involves
the following interconnected hierarchies: a tree-like system model; criteria and restrictions
for s~'stem components (nodes of the model): design alteruatives (DAs) for nodes; inter-
connection (Is) or compatibility between DAs of different system components; estimates of
DAs and Is . .-\ vector-like proximity for rankings is described.
1. Introduction
Usually the following basic system analytical problems have been examined to an-
alyze complex systems: (1) to compare two system versions; (2) to classify system
versions; (3) to evaluate a system version set (e.g., aggregation, construction of con-
sensus, etc.); (-4) to analyze tendencies of system changes (evolution, etc.); (5) to
reveal the Illost significant system parameters; (6) to plan an improvement process
for a system, Here we examine system representations on the base of structural
or combinatorial objects. We assume that a system is decomposable one, and may
have several versions. Thus the following kinds of problems are basic ones: (a) de-
scription of the system and their parts; (b) operations of the system analysis and
transformation. Our description of decomposable systems consists of the following
interconnected hierarchies (Le\'in, 1996b): (1) a ~ystem tree-like model); (2) require-
ments (criteria, restrictions) to system components (nodes of the model): (3) design
alternatives (D.-\s) for nodes: (-1) interconnection (Is) or compatibilitv between D.-\s
of different components; (.5) factors of compatibility.
The system proximity mayhe examined as the following structure corresponding to
the system model: proximity of hierarchical tree-like model; proximity of requirement
hierarchy (cri terion hierarchy; rest riction hierarchy; com pati bilit,Y, factors hierarchy);
proximity of D.-\s (sets of DAs, estimates on criteria, and priorities); proximity of
Is (set of Is with priorities). We consider three levels of combinatorial decriptions:
(I) hasic combinatorial objects (points in a space: vectors; sets; partitions: rankings,
strings, trees, posets, etc.); (2) elements of the system description: leaf nodes; set of
DAs and/or Is; tree-like system model: criteria for D.-'l.s; etc.; (3) basic system descrip-
tions, e.g .. complete description; external requirement. So a vector-like proximity for
rankings is examined,
2. Measurement of proximity
First let us consider approaches to modeling a proximity (distance, similarity, close-
ness, dissimilarity', etc,) for combinatorial objects. :'-iote that these investigations
hal'e been executed in various disciplines (e.g., mathematical psychology: decision
making; chemistry: linguistics; morphological schemes of systems in technological
forecasting; biology; genetics; data and knowledge engineering; network engineering:
archi tect ure; combinatorics). .-\ survey of coefficients for measures of si mi lari ty, dis-
similarity, and distance from viewpoint of statistical sciences is presented in (Gower,
1995). From a system viewpoint it io reasonable to examine some functions, op-
erations and corresponding requirements to mathematical models of the proximity
154
155
(Table 1).
Formal requirements to proximity models are based on three Freshe's axioms specify-
ing metrics, and sometimes on additional axioms (Kemeny and Snell, 1972; etc.). In
some cases, the triangulation axiom is rejected, e.g., for architectural objects (Zeitoun,
1977), for rankings (Belkin and Levin, 1990). Measuring the proximity between com-
binatorial objects is based on the following approaches: (1) a metric in a parameter
space; (2) attributes of the largest common part of objects (intersection) or an uni-
fication (the minimal covering construction); (3) minimum of changes (change path)
which allows to transform an initial object into a target one. .
Secondly let us consider scales of measuring. Traditionally Rl , [0,1] or an ordinal
scale are applied. Huber and Arabie use measures of agreement or consensus indices,
e.g., from [-1,1] (Hubert and Arabie, 1985). Recently some extensions of metric
spaces have been proposed (Pouzet and Rosenberg, 1994; Barthelemy and Guenoche,
1991; etc.), for examples: (a) graphs and ordered sets (Jawhari et al., 1986); (b) con-
ceptual lattices for complex scaling (Ganter and \Ville, 1989); (c) ordered sets and
semilattices for partitions (Barthelemy et al., 1986); (d) simplices for rankings (Belkin
and Levin, 1990). Generally, Arabie and Hubert ha\·e examined. three approaches to
compare combinatorial objects (sequences, partitions, trees, graphs~ through given
matrices (Arabie and Hubert, 1992): (a) axiomatic approach to construct a "good"
measures; (b) usage of structural representations; (c) usage of an optimization task.
Finally, in complex cases, the following approaches maybe applied: (a) multidimen-
sional scaling (Torgenson, 1958; Kruskal, 1977; etc.); (b) usage of graphs and ordered
sets as a kind of a metric space (Jawhari et al. 1986; Barthelemy et al., 1986;
Barthelemy and guenoche, 1991; Ganter and \Ville, 1989); (c) integrating or compos-
ing of a global proximity from distances or proximities of system components.
1 ~ d(l,i) ~ d(2,i) ~ m and 7r(i) = d(l,i) if d{l,i) = d(2:i). Thus the system of
intervals {t5( i)} is specified. By the analogy of definitions above it is possible to specify
clusters, and fuzzy clusters. Sometimes the comparison of structures representing the
union of similar graphs (e.g. 'chains'- N L, layered structures - N S) has a particular
interest in practice. Let 0(S) be a set of all layered structures on A.
Definition 1. We say that
5,,(i, S, Q) = 7r(i, S) - 7r(i, Q),
5,,(i,j, S, Q) = 7r(i, S) - 7r(j, S) - (7r(i, Q) - 7r(j, Q)),
where 7r(i, S) = / Vi E A(/) in S, are the first order error Vi E A, and the second
order error V(i,j) E {A * Ali f j VS, Q E 0(S)} respectively. Thus for an estimate
of a discordance between the structures S, Q E 0(S) with respect to i and (i, j) we
obtain an integer-valued scale with the following ranges: -(m - 1) ~ r ~ m - 1 for
5,,(i, S, Q), and -2(m - 1) ~ r ~ 2(m - 1) for 5,,(i,j, S, Q).
Definition 2. Let
x(S, Q) = (x[-(m - 1)], ... , x[-I],x[1], ... ,x[m - 1]), (1)
y(S,Q) = (y[-2(m -1)], ... ,y[-I],y[I], ... ,y[2(m -1)]), (2)
be vectors of an error (proximity) VS, Q E 0(S) with respect to components i (1st
order), and the pairs (i,j) (2nd order). The vector components are:
It is possible to force condition (4) by using a right side which is equal to a parameter
II> O.
Definition 7. Let Al = {x E XI LrEO x[rj = I} be a marginal set (similarly for y).
Kote that 'rIx (y) there exists a dominating subset D(x) = {1/ E AflT} t x}.
Definition 8. Let a pair of vectors Xl, X2 (Yl, Y2) be:
(a) comparable ones, if Xl t X2 (therefore D(X2) 2 D(xd and vice versa)j
(b) strongly uncomparable ones, if ID(xllx2)1 = ID(xd&D(x2)1 = 0;
(c) weakly uncomparable ones, ifID(xl,x2)1 f. 0, D(XllX2) does not include D(xd,
D(X2)'
Finally let us consider properties of our vector-like proximity as follows:
1. Condition (4) defines a poset.
!2. 0 ~ Ix(S, Q)I ~ 1, 0 ~ IY(S, Q)I ~ 1 'rIS, Q E 0(S).
3. x(S,Q) t (0, ... ,0), y(S,Q) t (0, .. ,0) 'rIS,Q E 0(S).
4. The following condition is true for one-sided vectors: x( S, Q) -< (0,0, ... ,0,1),
y(S, Q) -< (0,0, ... , 0,1).
5. The following condition is true for any two-sided vector x(S, Q), 'rIS, Q E 0(S):
there exists such vector e = (e[-kd,0, ... ,0,e[k2D E M(k 1 ,k2 > 0), that x(S,Q) t e
(similarly, for y).
6. For any modular vector the following is true: x(S,Q) = x(Q,S) (similarly, for y).
7. For any two-sided symmetrical vector the following is true: x(S, Q) = x"(Q, S),
where x"[r] = x[-r] (similarly, for y).
8. 'rIx(S,Q), 'rIS,Q E 0(S), the following is true: if x(S,Q) = (0, .. ,0) then S = Q.
An assessment of proximity between fuzzy layered structures is a more complicated
problem. Let us consider an example of qualitative vector-like proximity for any fuzzy
structures S"Q, E 0(5,), where 0(S,) is a set of all fuzzy layered structures on A.
Definition 9. Let z(S"Q,) = (z[-(m-l), ... ,z[-l],z[l], ... ,z[m-l]), where
In the same way, we may describe properties for vectors z, which are similar to those
of vector x (y) (besides the 8th one).
5. Example
Let us consider an example: .4 = {l,2,3,-!,5,6,7,8,9}; SI : Al = {2,4},A2 =
{9},.43 = {I, 3, 7}, A~ = {5, 6, 8}; and S2 : Al = {7, 9}, .'h = {l. 3}. A3 = {2, 5, 8}, A~ =
P,6}. Let Ilg'ill. (i,j E A) be an adjacency matrix for graph 0:
1, ifi~j,
9iJ ={ 0, !f
i ~ j,
-1,lfl...(J.
Then Kendall proximity measure (metric) for graphs 0 1 and 0 2 is the following
(Kendall, 1962): PK(01,02) = L;<i I gL - g;j I, where glj, g;j are elements of adja-
cency matrices of graphs 0 1 and 0 2 respectively. Adjacency matrices, corresponding
to our example, are the following:
-1 o -1 1 0 1 -1
1 1 0 1 1 1 1
o -1 -1 1 0 1 -1
1 0 1 1 1 1 1
-1 -1 -1 -1 o -1 0 -1
-1 -1 -1 -1 0 -1 0 -1
o -1 0 -1 1 1 1 -1
-1 -1 -1 -1 0 o -1 -1
1 -1 1 -1 1 1 1
o 1 1 -1 1-1
-1 -1 0 1 -1 0-1
o 1 1 1 -1 1-1
-1 -1 -1 -1 0 -1 -1 -1
1%(S2)1 = -1 0 -1 1 1 -1 0-1
-1 -1 -1 0 -1 -1 -1 -1
11111 10
-1 0 -1 1 0 -1 -1
1 1 1 1 1 0
So, Kendall's distance is: PK(SI, S2) = 31. Proposed vector-.tike proximity allows to
describe dissimilarity between two structures more prominent:
1I"(SI) = (1I"1(Sl), ... ,rri(Sl, ... ,1I"4(SI) = (3,I,3,1,4,-!,3,4,2),
rr(S2) = (11"1 (S2), ... , 1I",(S2, ... , rr~(S2) = (2.3,2, -!, 3, -!, 1, 3,1).
8;(SI,S2) = (1,-2,1,-3,1,0,2,1,1),
3 o -! o I -1 0 0
-3 -3 -1 -3 -2 --! -3 -3
o 3 --! o 1 100
--! 1 4 --! -3 -5 --! -4
o 3 0 -! I -1 0 0
-1 -2 -1 3 -1 -1 -1 -1
1 -!-I 5 1 1 1
o 3 0 -! o -1 o
o 3 0 -! o -1 o
160
6. Conclusion
.-\ combinatorial approach (e.g., description, design, transformation) to decomposable
systems has been described in (Levin 1996a, Levin, 1996b). Comparison of decom-
posable systems can be applied in decision making, information engineering (e.g.,
hypertext systems), network design, etc. Clearly,. that composing a global system
proximity from proximities of system parts is a central problem. Finally, let us point
out basic kinds of changes for our system description: (1) internal changes: microlevel
(DAs), parts (subsystems, requirements), macrolevel (model); (2) external changes:
requirements. ~ote proposed vector-like proximity can be used in aggregation of
rankings.
References:
Arabie, Ph. and Boorman, S.A. (1973): Multidimensional Scaling of Measures of Distance
between Partitions. J. of Math. Psychology, 10, 2, 148-203.
Arabie, Ph. and Hubert, L.J. (1992): Combinatorial Data Analysis. Annual Review of
Psychology, 43, 169-203.
Barthelem!', J.-P., Guenoche, A. (1991): Trees and proximity Representation, Wiley, New
York.
Barthelemy, J.-P., Leclerc, B. and :VIonjardet. B. (1986): On the ese of Ordered Sets in
Problems of Comparison and Consensus of Classifications. J. of Classification, 3, 2, 17-224.
Belkin, A.R. and Levin. M.Sh. (1990): Decision Making: Combinatorial A/odds of Infor-
mation .-tpproximation, Nauka Publishing House, \1oscow (in Russian).
Bogart, K.P. (1973): Preference Structures I: Distance between Transitive Preference Re-
lations. J. Math. Soc., 3, -19-67.
Boorman, S.A. and Oliver, D.C. (1973): Metrics on Spaces of Finite Trees. J. of Math.
Psychology, 10, 1, 26-59.
Botafogo, R.A., Rivlin, E. and Shneiderman, B. (1992): Structural Analysis of Hypertexts:
Identif!'ing Hierarchies and Useful Metrics . .-tCM Tran.~. on Information Systems, 10, 2,
1-12-180.
Cook, W.O. and Kress, M. (1984): Relationships between II Metrics on Linear Rankings
Space. 5UAI J. on .-tppl. Math., 44, 1,209-220.
Da.\·, W.H.E. (1985): Optimal Algorithms for Comparing Trees with Labeled Leafs. J. of
Classification, 2, 1, 7-28.
Ga nter, B. and Wille, R. ( 1989): Conceptual Sca.li ng. In Applications of Combinatorics and
Graph Theory to the Biological and Social Sciences. Roberts F.S. (ed.), 140-167, Springer-
Verlag, New York.
Gordon, A.D. (1986): Consensus Supertrees: The Synthesis of Rooted Trees Containing
Overlapping Sets of Labeled Leafs. J. of Classification, 3, 2, 335-348.
161
Summary: Concerned with the problem of clustering a compositional data set consisting
of vectors of positive components subject to a unit-sum constraint, as a first step we looked
for an appropriate dissimilarity coefficient or distance between two compositions. In this
paper we selected eight different dissimilarity measures, and their performance was evalu-
ated by means of graphics and cluster validity coefficients of six clustering methods applied
to three compositional data sets. Almost recent criteria for measures of compositional dif-
ference are also tested for those measures emerging as the best to cluster compositions.
1. Introduction
Any compositional data set consists of N D-part compositions with Xri the ith_
component of the rth-composition (r = 1, ... , Nj i = 1, ... , D) satisfying the require-
ments that each component is non-negative and that the sum of all the components
in each composition is 1.
It is our objective to study the problem of Cluster Analysis for compositonal data
sets. As a first step, we will look for an appropriate dissimilarity coefficient or dis-
tance between two compositions. Aitchison (1992) discussed some criteria to define
a measure of difference between two compositions, although no clustering problems
were tried by the author. Briefly, Aitchison's proposal can be stated as follows:
The appropriate sample space for a compositional data vector Z = (XI, ... ,'<D) is
the d-dimensional positive simplex (Aitchison (1986, Chapter2))
Sd = {(Xt, ... ,XD): Xl> O, ... ,XD > 0 and Xl + ... + XD = I}. (1)
As a way to eliminate the constant-sum constraint difficulty that each compositional
data vector Z E sd must satisfy and therefore to work in a space without restrictions,
Aitchison (1986, Sec.4.6) proposed the following two transformations of the data:
• the logratio vector y = log (:Z:r~D) E R d, where Z_D is the vector Z with the last
component omitted, and
• the centred logratio vector z = log Ci~!)) E RD, where 9(Z) = (Xl'" xD)i is the
geometric mean of the components.
*'
Aitchison's ideas are based on three postulates:
1.- A composition contains information about relative, not absolute, magnitudes of
its components. In other words, by writing ri = (i = 1"", d) a composition can
be completely determined by Xi = Tl + ...r.;. Ttl +1 (i = 1,···,d) and XD = Tl + ... ~ Td +1 ;
2.- Any discussion of the variability of a composition can be expressed in terms of
ratios, such as :!:!. of components;
r}
3.- No compositional information is lost if log (~) is studied instead of ratios them-
selves.
162
163
"as the simplest and most tractable measure of difference or distance between two
compositions Zr = (Xrl, ... , XrD) and z, = (X.l, ... , X,D)" . The above expression,
apart from a constant factor, coincides with the Euclidean distance between two·
centred logratio vectors, which is the first proposal of Aitchison (1986, p.193) as the
inquired measure.
The fundamental point of Aitchison's criteria in order to obtain (2) is:
" Any scalar measure of difference between two compositions Zr and z, must be
expressible in terms of ratios of components in the same composition (this is in terms
of ~ and ~), and also as ratios of components in different compositions (this is as
;Cr) x,}
:.c...
:r:
II
or E.u).
%'rt
In other words, the measure will be expressible as a function of the ratio
-R (or in a symmetric way as 3.) where r = :.c... and R = ~".
r
r %rJ :r' J
In the present paper, we first review some dissimilarity coefficients or distances which
are currently being used in Cluster Analysis and select those we consider capable of
measuring the difference between two compositions. Then, we conduct a study of six
clustering methods applied to three compositional data sets in order to explore, and
compare with Aitchison's proposal, the performance of the different dissimilarities
displayed. This profit is evaluated by means of graphics and coefficients of cluster
validity of the many clustering techniques used. Also, Aitchison's criteria for those
coefficients emerging as the best to cluster compositions are tested.
Based on the above cited measures, four Hierarchical and two Partitioning Clus-
tering l\Iethods were considered. Single, Complete and Average Linkage were the
Agglomerative Hierarchical Clustering Methods used, and MacNoughton-Smith algo-
rithm was the Divisive Hierarchical Clustering Method applied. Among Partitioning
164
Clustering Methods we have considered Partitioning around Medoid and Fuzzy Anal-
ysis. Single and Complete Linkage outputs were obtained by means of SPSS or Splus
packages; whereas, programs AGNES (Average Linkage), DIANA (Divisive Analy-
sis), PAM (Partitioning around Medoid) and FANNY (Fuzzy Analysis) of Kaufman
and Rousseeuw (1990) were used in the computation of the other algorithms.
'31 ..
"'4 ....,.
03i. 32 '27
'2:n& ." '43"~~
• 12'.17
.$
• 40 • 11 .~:O.,
..iI'
B c
The ternary diagrams of Figure 2 summerize the observations from the dendrograms
by Single Linkage of the three-part compositions of 39-samples in the Arctic Lake
Sediments.
The extensive variation in the ratio of clay to sand is again represented in the shape of
a banana. Even when the ternary diagrams show a clear cluster of compositions with
low proportions of sand and approximately equal proportions of silt and clay, this
cluster was not detected by Aitchison's approach when Single Linkage was applied.
Aitchison's distance prefers to isolate compositions with a very low proportion of
clay (Figure 2. (a)), whereas for the other seven coefficients the same four and six
cluster solutions could be observed (Figure 2. (b)). Clearly, the last four-solutions
are reasonable, but not the solutions using Aitchison's coefficient.
166
Sand Sand
(a) (b)
I C.Data IManhat I Euclid I Cheby. I J .Mats I Diverg I B(log} IB( acos) I Aitch. I
Sedim. 0.93 0.9:3 0.93 0.93 0.99 0.99 0.93 0.93
HK(3) 0.94 0.94 0.94 0.92 0.99 0.99 0.93 0.94
HK(5) 0.89 0.90 0.91 0.89 0.99 0.98 0.89 0.90
Table 2: Divisive Coefficients of three different Compositional Data Sets using eight
different Dissimilarity Coefficients
I C.Data IManhat I Euclid I Cheby. I J.Mats I Diverg I B(log) IB( acos) I Aitch. I
Sedim. 0.9.5 0.94 0.95 0.94 0.99 0.99 0.94 0.96
HKJ3) 0.96 0.96 0.96 0.95 1.00 1.00 0.96 0.97
HK(5) 0.93 0.94 0.95 0.94 1.00 1.00 0.94 0.95
5. Conclusions
The rank order performance for the dissimilarity coefficients or distances examined
was replicated for the six clustering methods applied to the three compositional data
sets, namely:
5.1,' Divergence of Jeffreys and Bhattacharyya using logarithm produce the better
(excellent in general) recovery of cluster structures;
5.!!: The relati\-e merits of the other coefficients are not straightforward. They indi-
cate the lowest recovery values, especially when partitioning clustering methods are
used.
Partitioning Clustering Methods appear more sensible than Hierarchical Clustering
Methods to evaluate the performance of the different dissimilarity coefficients in or-
der to cluster a compositional data set. For example, Agglomerative and Divisive
Coefficients, independent of the measure used, always ranged between 0.90 and 1
(strong clustering structures). However, Silhouette and Dunn's partition coeffi('ients
clearly allow the separation of Divergence and Bhattacharyya(log), given reasonable
or good structures. from the other measures under analysis which show weak or poor
structures.
169
Finally, since Aitchison's proposal does not reveal advantages over the other coeffi-
cients investigated, and it also can conduce to unreasonable cluster structures, we
wish to know if Divergence of Jeffreys and Bhattacharyya(log) dissimilarities satisfy
the cited Aitchison's criteria for measures of compositional difference. That is, can
we express Divergence and Bhattacharyya(log) dissimilarities as a function of 'h with
r = Eu.
ZrJ
and R = £u..,
X'J
(r,s = 1,,,.,N; i,j = 1,,, .,D)?
Considering two-partcomposi'tions Divergence Dissimilarity of Jeffreys becomes:
10 (!:.-) [( r + 1) -
(R + 1)] . (:3)
g R (r+l)(R+l)
5. References
Aitchison, J. (1986): The Statistical Analysis of Compositiona.l Data. Chapman and
Hall, London
Aitchison, .J. (1992): On criteria for Measures of Compositional Difference. Mathe-
matical Geology, Vol. 24, NoA, p.36.5-379.
Aitchison, J. (1994): Principles of Compositional Data Analysis. Multivariate Anal-
ysis and its Applications. IMS Lectures Notes-Monograph Series, Vo1.24, p.n-8l.
Kaufman, L. and Rousseeuw, P.J. (1990): Finding Groups in Data. John Wiley and
Sons, Inc.
Consensus of Hierarchical Classifications
Bruno Simeone I. Maurizio Vichi 2
Summary: In this paper, after briefly reviewing the techniques to achieve one or more consensus
dendrograms, we propose new algorithms. These perform a sequence of elementary tree operations
to obtain at each step, in a greedy fashion, a dendrogram that is as close as possible to the given
ones. The time and the space complexities of such procedures make them suitable for large scale
applications. A numerical example is used to show the proposed technique.
1. Introduction.
Often in analyzing multivariate data, a set of hierarchical classitications can be determined as the
result of: (i) different hierarchical clustering methods applied on the same set of n objects; (ii) the
same clustering algorithm (aggregative or divisive) embodying different proximity measures
between pairs of objects; (iii) a hierarchical technique applied on r frontal slices of a three-way
data set, i.e., on different multivariate data sets, relati~e to the same n multivariate objects
examined in different occasions, such as times, spaces, etc. In this paper we will consider
especially the last case.
One general problem in numerical taxonomy is to compare and synthesize the given set of
hierarchical classifications, determining a consensus structure toften an n-tree. less frequently a
dendrogram). The intuitive reasons to search for a consensus may be different: in case (i) we are
looking fur a single and more natural classification not depending on the clustering technique
considered; or, in case (ii), for a classitication not depending on the chosen dissimilarity measure:
or, in case (iii), we may want to synthesize several ditferent classifications into a single one, by
detecting the relevant and stable intormation in the individual classifications.
In consensus theory, much attention has been devoted to taxonomic models such as n-trees and
three approaches have been identified (Barthelemy, Leclerc and Monjardet, 1986) to determine a
consensus n-tree: (I) constructive ones, (Adams, 1972: Margush and McMorris, 1981: McMorris
et al., 1983), where purely combinatorial methods are proposed: these generally involve heuristic
algorithms with interesting properties; (2) a.xiomatic ones, (McMorris and Neumann, 1983;
Neumann, 1983: Stinebrickner, 1984; Barthelemy and McMorris. 1986; Day et al., 1986:
Neumann and Norton. 1986), whereby after the formulation of some general "particularl:
desirable" properties of consensus procedures (or functions) methods satisfying these properties
are defined if possible; (3) optimization ones (Barthelemy and Monjardet, 1981), where after the
choice of a distance index between each observed n-tree and the consensus n-tree, the consensus
minimizing such distance is determined. The three approaches are not independent, since, for
example, consensus methods defined via the optimization approach generally give rise to heuristic
procedures satisfying some given axioms or properties. These heuristics procedures are used to
define initial good solutions for the optimization algorithm.
An alternative approach searches for a common pruned tree (Rosen, 1978) by pruning the least
number of leaves of the given trees in order to render them equivalent. An interesting overview on
consensus methods is given in Gordon. t 1996).
170
171
,I) In the following we represent the n-tree by its non trivial classes: Th = (lih' Ilh • .... In,! hI. where
I < IljI,l < n for all).
172
diag(U) = O.
Notice that if V is ultrametric (a sufficient condition for this to happen is that the binary n-trees
T 1• T2••••• T,. associated with the ultra metric matrices are equal (Vichi 1995» then it is obviously
the optimal solution to [P2).
173
Algorithms for finding good (although not necessarily optimal) solutions to the above problems
have been given by Vichi (1993) ([PI). [P2) versions): Carroll and Pruzansky. (1980) and De
Soete (1984), ([P3) version).
4. Algorithms.
In view of the NP-hardness of[PI], [P2). [P3), and of the exponentially growing running times of
the available exact solution algorithms, it makes sense to look for fast heuristic procedures.
yielding "good", although not provably optimal, solutions. Any such solution is also useful as a
starting point for iterative global optimization procedures.
An initial feasible solution to [P I], [P2) and [P3) may be obtained through a hierarchical
classification algorithm having in input the Euclidean matrix U. The best choice is the average link
method. which it is known to give, among all aggregative methods, the ultrametric matrix U at
minimum distance from the given dissimilarity matrix (in our case U), see for example
Cunningham and Ogilvie (1972). This variant will be called Algorithm I. The resulting
dendrogram can be considered as a member of the family of consensus dendrogram methods
proposed by Ghashghai et al. (1989), as a generalization of the Stinebrickner's top-down method
for dendrograms (Stinebrickner, 1984a) and of Neumann's generalized intersection methods for n-
trees (Neumann, 1983).
A second heuristic algorithm (Algorithm 2) for [PI], [P2] and [P3] has been proposed by Vichi
(1994). Essentially, starting from U the algorithm: (i) computes a least upper bound ultrametric
(via complete linkage) and the largest lower bound ultrametric (via the single linkage), (ii) finds the
mean matrix of these two ultrametrics; and (iii) repeats steps (i) and (ii) on the current mean
matrix until convergence is achieved. If the limit matrix (which always exists) happens to be not
ultrametric. then the average link method is applied to it.
A third algorithm is actually a procedure to improve the objective function:
(I)
h=1
when a dendrogram approximating U is given. In our case the dendrogram is defined by one of the
previous two procedures.
Before introducing the algorithm we need to define a swap operation and some of its properties.
Let T be an n-tree containing, among others, the following sets:
(i) three pairwise disjoint sets J, H, K; (ii) JuH; (iii) JuHuK; then a swap operation transforms T
into the n-tree T' or T", where JuH is replaced by JuK or HuK. respectively.
The swap operation can be described by the tree representation in Fig. I.
J~:~\ JUHU
Ac " J~:;{~
JUc!
J
\
H
~
K J
111/~K
H K
JUJ:o \
J K H
(I) (2) (3)
174
Thus, given the n-tree T with classes J, H, K, whose fusion is represented in Fig. 1.1, a swap
operation is represented in Fig. 1.2, or in Fig. 1.3. These fusions give rise to the n-trees T" and T'.
respectively.
Example I: Given the n-tree (including also the trivial classes) T ={ 1,2,3,4.5,12.34.125,1}. a swap on
classes { 1},{ 2},{ 51 gives the n-trees T'={ 1,2,3.4,5, 15,34,125.Il. T"={ 1,2,3.4,5,25,34, 125,l} .
From an algebraic viewpoint. the trees T, T' and T" correspond to the three possible ways to obtain
JuHuK by successive unions starting from J, H. K. namely, (JuH)uK, (Jul0..JH, (HuK)uJ.
Theorem: Given two binary n-trees Sand T on the same set I, it is always possible to obtain T
from S, or vice-versa, by a finite sequence of swap operations.
Remark: If S and Tare two n-trees on the same set I, if the sub-trees SJ and TJ are identical, and
if TJ can be obtained from sJ by a finite sequence of swaps, then T can be obtained from S by a
finite sequence of swaps as well.
Weare now in position to prove the theorem.
Proof By induction on n=ll]. For n=2 or 3 the theorem is trivial. Suppose that the theorem holds
for all p-trees with p9l-I, and let Sand T be any two n-trees on 1 (n ~4 l.
Case I) There exist two identical sub-trees SJ and TJ with 1JI~2.
Then by the inductive hypothesis TJ may be obtained form sJ through a finite sequence of swaps
and thus by the above remark the theorem holds also for·n-trees.
Case 2) There are no two identical sub-trees SJ and TJ with 1JI~2.
Then consider any two brother terminals Land M in the tree T (two brother terminals must always
exist). Both Land M are terminals also in S, but they are not brothers in S, otherwise Case I
would occur. We are going to describe a procedure that, starting from S, produces after a finite
sequence of swaps an S' where Land M are brother terminals. Then S' and T have two identical
sub-trees -- namely, those rooted at L u M. Thus, we .fall back into Case I) and the theorem
follows.
Hl:\M
Fig. 2: Swaps in the procedure DOUBLE LIFTING
(a)
(b)
L/)\H
(c) L!;\N
In the procedure to be described below. and with reference to the current tree S', we denote by
pred(J) and brother(J) the predecessor and the brother of the node J (hI), respectively.
175
2 3 4 5 6 4 2 3 5 6
5 6 4
2 3 5 6 423 5
176
The above theorem motivates the following Algorithm 3 (although it is not enough 10 guarantee the
optimality of the dendrogram generated by such an algorithm). For a better understanding of the
algorithm. consider again the three sub-trees in Fig. I.
The sub-trees represent three sub-dendrograms of a dendrogram LI only if: a-:;,h: or ('5,(/: or t!<::;t:
where a, b, c, e./. are the levels of fusion. as reported in Figure I. respectively.
Furthermore. the contribution of each sub-dendrogram to the objective function ( I) is given by the
amount:
respectively.
The minimum increase s, is given, for a sub-tree of type ( I ) in Fig. I. when the level of fusion is:
I I '
I I I
r
a*=--,- Ill/gh : h* 11 "11 if a* < h*. (5)
rIJI·IHI h=I/<.I rlJu HI·IKlh=I/<.Iufl .
gel! /<.g gii)'; (<'g
I r
a**=b**= I IIIJ:gh' (6)
rlJ u H U Klh=ll.gEJ<...,I/,,,1\
f<g
but in the last case the n-tree (I) in Fig. I becomes a bush. i.e., the level of fusion Ll is equal to the
level of fusion b. In fact. the value a* is the arithmetic mean of the dissimilarities among classes J
and H, while b* is the mean of the dissimilarities among classes JuH and K. Thus. a*> h* means
that the best feasible solution is a tree with one level of fusion (bush) equal to the mean of
dissimilarities between the elements of cluster JuHuK,
Similarly, the minimum increase .1'2 is given, for a sub-tree of t)-pe (2). in Fig. I. when the level of
fusion is .
if e* <f*. (8)
ALGORITHM 3
Given a dendrogram.1= {<XII)' <X/~) • ...• ci(lln.I)}' with at most n-I internal nodes. where II' I~ • ...• fn• 1
are the non singleton classes associated with the internal nodes;
Step 0: Set the iteration parameter k:= I:
Stepl: Visit the k-th internal node according to the order given by the non-decreasing kvels of fusion
Wk):
If cluster fk is the root of a sub-tree with at least one internal node then
If the sub-tree has more than one internal node then
aggregate those clusters with the smallest level of fusion. Let J. H. K be
the three leaves of this sub-tree. With a swap operation we have one of the
three sub-trees in Fig. I:
End If;
Step 2: Compute (a·. b*). (c*. d*). (e*.j"):
,,*
If :<; b* or c'* :<; d* or e* :<;j" (i.e .• at least one pair is feasible) then
Compute for the feasible levels of fusion the increases Sl' s~. S1:
Consider the sub-dendrogram with the smallest increase of the
objective function:
End If;
Else
Compute for the class with two elements {ij} the mean of lI'}h h= I..... r.
End If;
Step 3: k:=k+ I ;
repeat step I to step 3. n-I times.
The worst time-complexity of the algorithm is O(rnJ). since processing the k-th node. which has at
most k proper descendants. takes O(irr). k= I ..... n-I.
A fourth algorithm is illustrated in the following table, and can be applied directly on the original
dissimilarity matrices D I • D~ ..... D,.
ALGORITHM 4
Step 0: (initialization): let r matrices D" D~ .... , D, be given, where Dh={d,}h i,je/}: these may be
dissimilarity or ultrametric matrices. Set the iteration parameter k:= I:
Step 1: For each matrix D,: h= I.... , r. find the minimum value:
d'1l = min{D I }; .... ; dlmh=min{ Dh}; ... ; dpq,=min{ D,l.
These are the values of fusion between groups: (G" G) • .... (G I• G m ), .... (Gp • Gq ). respectively. and
represent the k-th smallest value of fusion of r dendrograms.
Step 2: Compute the means of the dissimilarities:
I E{ d'1'" y= I ,.... r}; ... ; hE! dlnrv , v= 1..... r}; ... ; .E{ dp</v' y= I .... ,r}:
Step 3: Among the above means. choose the smallest one. let it be hE{dlmv' ~·=I, .... r}, which is at least as
large as the minimum mean detected at iteration k-I.
Step 4: The increment of the objective function in [PI] after the fusion of groups (Gr G m ) is:
DEV{dlmv , y=I .... ,r}=r.(dlnrv - hE)2
Thus. the fusion of groups (G r Gm ) with cardinality nl and n.. defines the k-th group I. with associated
value of fusion ,E{ din.., v= I..... r}. Cluster h represents the k-th node of the consensus IHree associated
with the r dendrograms.
Step 5: Update the matrices DI , D~ ..... D" i.e., the distances between the fused cluster I, and a generic
cluster G. : dh(l" G)= (n, / (n,+n.,)) dh(GI' G)+(n .. / (n,+ll m»)dh(Gm • G)
where dh(l" G) is the distance between G,uGm, and cluster G. in the h-th matrix Dh .
Step 6: k:=k+ I;
repeat Step I to Step 6. n-I times.
5. Application
In order to show the behavior of the proposed algorithms a well-known benchmark in cluster
analysis, given by Michner (1970). has been used. This data set arises from a taxonomic problem
on II types of Hoplites bees, described by 23 variables related to the form and characteristics of
178
the bees (further details are reported in the original paper). Using the distance matrix between pairs
of bees. Everitt (1993) compares the dendrograms obtained by single linkage. complete linkage and
average linkage between groups (UPGMA). The ultrametric matrices VI' V 2 and V3 are reported
in Vichi (1993). The mean matrix iJ =(1/3)(VI + V 2 + VJ). and the dendrogram of the average
3.4.6.9.5.12.10.11
Fig. 4: Mean matrix iJ of the ultrametric matrices associated with single linkage, complete linkage
and average linkage applied on Hoplites data (Everitt, 1993). The dendrogram is obtained applying
average linkage on the mean matrix.
I 6 10 II
0.000
0940 0.000
1383 1333 0000
1.383 1.333 0.303 0000
))91 1.391 1.066 1.066 0000
U8J I 333 0676 0.676 1.066 0.000
I 530 1:130 1.630 1.630 1.630 163!l 0000
I 376 1376 1.598 1.598 1598 1.598 1.530 0000
9 1383 1333 0.947 0947 0.996 0947 1.630 1.598 0.000
If) 1.570 1.570 I 493 I ~93 I ~93 1.493 1.63!l I 598 I 493 0000
II I :\70 I 570 I 493 I 493 1493 I 493 16]1) 1.598 149J 0645 1).1)01)
• ~5, 6. 7 .8. In steps I and 2 the sub-dendrograms (iv). (vii). (x), (xiii) are found to be optimal
respecti\ely.
• ~9. This node P.S}, with level of fusion 1.53, has no internal node among its successors and it
cannot be the root ofa sub-tree. The value 1.53 is the mean of the elements between (7,S) in the three
ldtrametric matrices. The increase of the 0.1'. is 0.10492:
• k 10. Step I: The internal node {1.2.3,4.5.6.7.S.9.IO,II} is the last node that can be considered as
the root of a sub-tree. The associated sub-tree with one internal node is ShO\Hl in Fig. 5 (xvi). With
one swap the sub-trees (xvii) end (xviii) in Fig. 5 are obtained. Step 2: compute a*~ 1.53033.
179
1(\676 '
m 1\\""
Fig. 5' Steps of the algorithm 3 for which a swap operation can be executed
346 3 4 6 3 6 4
k=3 (i) feasible (ii) nol [easible (iii) nOlfeasible
",c;J~""7 (JM7~
(3.4)u6 il"7~":9~7 6u9
e*=O.9467
(3.4)u9
"eX
k=5 (iv) feasible (V) nol feasible (vi) nol feasible
(JA.6)v9ul
"'~";I\\'"
'J'6~; ~92
c*=0.996
(3.4.6)v'1 9u5 (3.4.6)u9
,'.,'/!. .
O.U) 9 50 (3.~.6)
""m'"
(3,4,6) 9 5 90 5
k=6 (vii) feasible (viii) feasible lix} nol feasible
",c<,."
O"'R\~;
".li'" "z
a·~1.0~88J c·= \.391 e·: 1.3579
(JA.6.9)u5
~1.2) (3.4.6.9)v( 1,2)
r-,,,,
"'"'O)C<'~\ 'm.• c<'
a*=136450
(3.4.6.9.5) u( 1.2)
(1.2)u(IO.II)
c*=1570
O'
\
".H .•.
(10.11)
(3.~.6,9.5)
(10,11 )
u
,·=1.413
6
O.H, .5) (1.2) (10.11) (JA.6,9.5) (10,11) (1,2)
(3.4,6,9,5) ( \.2) (10,11 )
M'"''
k=8 (.tiiiJ jeasible (.tiv) nOI feasible (xv) nOI feasible
Il\~
a·=1.53033 N
8Jo..A.\~,6.8.1.2.5.8.IO,111 e*=1.608 '"
The second data set, analyzed -by Carroll et al. (1984), describes over-the-counter pain reliever
usage in remedying three common maladies. On the three arrays of dissimilarities the average
linkage algorithm has yielded the dendrograms reported in Vichi (1994).
On these data Algorithm 4 has been applied.
k=1; step 1. u(4;7;1)=20.29; u(3;4;2)=20.24; u(5;10;2)=20.03; step 2. 3. E(d(4;7);i=I,2,3»=26.76:
E(d(3;4;i=I,2,3»=22.68; E(5;10;i=I,2,3»=20.96; step 4. DEV(d{5;10;i= 1,2,3))= 1.8066801. Thus, 5 with
10 at level 20.96 are aggregated, and o. r. =1.806680 I.
For k=2. 3. 4,5,6,7,8, Steps I to 4 can be executed in a similar way.
k=9: step 1. u(3,4, I,7,6,9;5.1 0, 1.8; I)=37.41; u{3,4,1 ,7,6,9;5.1 0,2,8;2)=34.65; u{3.4.I, 7,6.9;5.10,2,8;3)=
=33.46; step 2,3. E(d{3,4, I,7.6,9;5.1 0.2.8;i= 1.2.3))=35.17; step 4. DEY(d(3,4.1, 7,6.9 ;5, I0,2,8;i= I,2.
3))= 197.09015; Thus, fuse 3.4,1,7,6.9 with 5, 10,2.8 at level 35.17; The o.r. value is 232.73522 +
197.09015=429.82538.
Thus. the dendrogram obtained can be synthetized as follows:
I.H.7.6.9
31.940
The same result has been obtained by Vichi (1994) through the solution of [P3] by the truncated-
Newton method and also through the Algorithm 2 briefly outlined in this paper.
References
Adams, E. N. (1972): Consensus techniques and comparison of taxonomic trees, Systematic
Zoology, 21, 390-397.3
Barthelemy, J. P., Leclerc, B. and Monjardet B. (1986): On the use of Ordered Sets in Problems of
Comparison and Consensus of Classifications, Journal oIClassification.], 187-224.
Barthelemy, J. P., and Me Morris, F.R. (1986): The median procedure for n-trees, Journal ol
Classification. 3, 329-334.
Barthelemy, J. P. and Monjardet B. (1981): The Median Procedure in Cluster Analysis and Social
Choice Theory, Mathematical Social Sciences, I, 235-267.
Carroll, J. D., Clark, L.A. and De Sarbo, W. S. (1984): The representation of three-way proximity
data by single and multiple tree structure models, Journal olClassification, 1,24-74.
Carroll. J.D. and Pruzansky, S. (1980): Discrete and Hybrid Scaling Models, In: Similarity and
Choice, E. D. Lantermann and H. Feger (Eds.), Huber, Bern, 108-139.
Cunningham. K., M. and Ogilvie, J., C. (1972): Evaluation of hierarchical grouping techniques: a
preliminary study. Computer Journal, 15,209-213.
Day, W. E., McMorris, F.R. and Meronk, D.B. (1986): Axioms for Consensus Function Based on
Lower Bounds in Posets. Mathematical Social Sciences, 12, 185-190.
De Soete, G. (1984): A Least Squares Algorithm for titting an ultrametric tree to a dissimilarity
matrix, Pattern Recognition Letters. 2, 133-137.
Everitt, B., S. (1993): Cluster Ana(vsis. Edward Arnold, II edition.
Ghashghai. E.. Stinebrickner, R. and Suters, W.H. (1989): A Family of Consensus Dendrograms
Methods. Abstract of the paper presented at the Second Conference of IFCS. Charlottesville.
Gordon. A. D. (1987): A Review of Hierarchical Classification, The Journal of the Royal
Statistical Society, A, vol. ISO, 2, 119-137.
Gordon. A. D. (1996): Hierarchical Classification, In: P. Arabie et al. (Eds), Clustering and
Classification, World Scientific Publishing, 65-121.
Krivanek, M. and Moravek, J. (1986): NP-Hard Problems in Hierarchical-Tree Clustering. Acta
Informatica, 23. 311-323.
Lapointe, F. 1. and Cucumel. G. (1991): The Average Consensus. Abstract of the paper presented
the Third Conference of the IFCS, Edinburgh. Scotland.
181
Letkovitch, L.P. (1985): Euclidean Consensus Dendrograms and Other Classification Structures.
MathemClfical Bioscience.l. 74. 1-15.
Margush, T. and Me. Morris, F.R. (1981): Consensus n-tree, Bulletin of Afathematical Biology.
43. 239-244.
McMorris F.D., Meronk. D. B. aand Neumann. D.A. (1983): A view of Some Consensus Methods
for Trees. In J. Felsestein (Ed.). Numerical Taxonomy. Spriger-Verlag. Berlin.
McMorris. F.D. and Neumann. D. A. (1983): Consensus Functions on Trees, Mathematical
Social Sciences, 4, 131-136.
Neumann, D. A. (1.983): Faithful consensus methods for n-trees, Mathematical Biosciences. 63.
271-287.
Neumann D. A. and Norton V. T. (1986): On Lattice Consensus Methods. Journal of
Classification, 3, 225-255.
Powell, M.J.D. (1983): Variable Metric Methods for Constrained Optimization. Mathematical
Programming: The State of the Art. In Bachem A. et al. eds. Springer Verlag. 288-311.
Rosen, D. E. (1978): Vicariant Patterns and Historical Explanation in Biogeography, Systematic
Zoology, 27, 159-188.
Stinebrickner. R. (1984a): s-consensus trees and indices, Bulletin of Mathematical Biology, 46,
923-935.
Vichi M. (1993): Un algoritmo dei minimi quadrati per interpol are un insieme di classificazioni
gerarchiche con una elassificazione consenso, Metron, 5 I. 3-4. 139-163.
Vichi M. ( 1994): An algorithm for the consensus of the hierarchical classifications, Proceedings of
Italian Statistical Society, 37, 261-268.
Vichi M. (1995): Principal Classification analysis of a three-way data set, presented at the
meeting: Analisi dei dati multidimensionali, Napoli, 30-31 october.
On the Minimum Description Length (MDL) Principle
for Hierarchical Classifications
Peter G. Bryant
Graduate School of Business Administration
University of Colorado at Denver
Campus Box 16.5
Denver, Colorado 80217-3364 U.S.A.
1. Background
1.1 Hierarchical clustering
Commonly used hierarchical clustering procedures group obervations into a nested
sequence of classifications. Often that sequence is represented by a tree or dendogram.
A Simple Example: To fix ideas, let us consider a simple example consisting of the
seven univariate observations
y = (1,2,5,7,12, 16,20)t, (1 )
which are to be grouped in some appropriate manner.
To use agglomerative hierarchical methods, we must specify an appropriate distance
measure such as Euclidean distance, city-block distance, etc., and an aggregation
criterion such as single linkage or complete linkage, which specifies how the distances
between groups of observations are determined from the distances between individual
observations. Such measures and aggregation criteria are discussed in standard text-
books such as Everitt (1993). The tree produced by single linkage clustering using
Euclidean distance for the data in (1) is given in Fig. 1.
1.2 The problem of cutting the tree
Hierarchical methods do not require that we specify a priori how many groups are
to be found, and this is often an advantage, but neither do they give us specific
guidance on how many groups we have actually found. For those problems in which
the tree is not the fundamental object of interest, but is simply a means to obtain a
final grouping, the user must determine at what point it is useful to "cut" the tree.
For example, in Fig. 1, if we cut the tree at (vertical) level 4.5, say, we obtain two
groups, (1,2,5,7) and (12,16,20), while if we cut the tree at level 3.5, we obtain 4
groups, (1,2,5,7), (12),(16), and (20). The finer the subdivision, the more accurate
(in some sense) the description of the data is, but the additional accuracy comes at
the expense of a more complicated model. At what point, then, should we cut the
tree?
182
183
Euclidean
distance'
5
4
3
2 r--
1
1
12 5 7 12 16 20
x
1.3 Possible approaches
At any given level of aggregation, that is, for any point at which we cut the tree,
we usually obtain a corresponding figure or merit of some kind, such as the "pooled
within group distance." The smaller this measure, the better the grouping describes
our data. One way to determine an appropriate grouping is to plot this figure of
merit versus the level of aggregation. The resulting curve often displays distinct dips
at one or more points, and such points indicate clusterings which appear "significant"
in some sense. How much of a dip is enough to be considered significant is harder to
specify, though. For Euclidean distance, Duda and Hart (1973), for example, suggest
referring t.he ratio of two successive figures of merit to some critical value, although
in many cases of clustering, the sampling theory assumptions lmderlying classical
statistical approaches to determining such critical values will be violated. In the
next section, I explore an approach baseJ on the MDL principle, an approach which
doesn't depend on sampling theory.
n2
lvIDL=-ln (2)
s +--In - n-2 P (n)2 -lnr (n--2-P) +-In
P (nR2)
2
-
27r
. (2)
184
where i is derived from the vector of least squares coefficients, suitably standardized.
The details of this derivation are given in Bryant (1996).
2.3 Application to Cutting a Hierarchical Tree
To each level of a tree representing p groups, there corresponds a Gaussian model
with p independent variables. For example the design matrix X corresponding to the
top level of Fig. 1 (a single group) is
X =(1 1 1 1 1 1 1)1,
while that corresponding to the division into the groups (1,2,5,7) and (12,16,20) is
( 1 1 1 1 0 0 0)1
X= 0 0 0 0 1 1 1 '
and so forth. To each grouping represented by the tree, there will correspond an MDL
figure of merit (2), combining its ability to represent the data with the complexity
of the description. The best level at which to cut the tree is the one yielding the
smallest value of M DL.
The criterion given above is for univariate data. In clustering, it is more usual to have
m-variate data (m > 1), and for such cases, we replace 52 in (2) by the total within.
group squared Euclidean distance divided by n' = nm and replace n and p by n' and
p' = nm, respectively. For non-Euclidean distance measures d, the corresponding
models would use a probability density measure proportional to e- d , though the
detailed calculations to produce an analogue of (2) will often be messy.
They suggest that for these data, each finer subdivision of the data is preferred at
least slightly to the one which precedes it, until we reach the last: splitting the (5,7)
group into two components seems to cost more than it is worth.
185
Five properties are given for each of twelve breakfast cereals. A clustering of these
observations using standardized variables, the complete linkage criterion, and Euclid-
ean distance is summarized in the tree in Fig. 2, and the corresponding figures of
Euclidean
.distance
6
5
4
3
2
1
10 1 4 7 9 6 8 2 11 3 5 12
Cereal ID Number
merit from the MDL criterion are listed in Tab. 3 for several groupings derived by
cutting the tree.
We see, for example, that the three-group clustering is not preferred to that with
two groups, but the proposed division into four groups is preferred to the division
into three groups. Such "reversals" may happen, since for Euclidean distance, the
optimal division into, say, k + 1 groups is not necessarily a subdivision of the division
into k groups. For these data, it seems likely that the results may be sensitive to the
particular distance and aggregation criteria used, and to the choice of a hierarchical
method rather than some other kind. The analyst would be well-advised to explore
these issues further, rather than simply accepting the complete linkage, Euclidean
distance results.
186
Table 3: MDL Criteria for Clustering Johnson and Wichern's Cereal Data (n' = 60)
Number of Groups MDL p' s R
2 19.02 10 .760 .547
3 20.92 15 .614 1.278
4 17.00 20 .479 1.642
5 20.39 25 .434 1.814
4. Remarks
The l\IDL approach as explored here is clearly easiest to apply in the case of mathe-
matically tractable measures of error, such as least squares, at least in the sense that
the formulae correspond naturally to the distances being used.
On the other hand, (2) could be used to assess any series of classifications whatever,
hierarchical or not, without reference to the distances or other criteria used to gen-
erate them. It can thus be used as a kind of external check on the results of other
methods. It seems likely that the MDL measures will be most useful when combined
with other measures and results. They are intended to augment careful thought and
reflection, not to replace them.
The last subdivision, in which all observations are distinct and there is no clustering,
has no corresponding MDL figure of merit, as the sum of squared errors becomes O.
This will often be of little practical consequence, though is it theoretically unappeal-
ing.
Finally, note that other MDL criteria are possible, too, and they will not necessar-
ily lead to identical conclusions. The differences among them arise from different
specifications of allowable ranges for the parameters, scaling of the observations, etc.
TheSe different specifications are roughly analogous to different prior distributions in
Bayesian analysis, though the exact formalisms are different. Some remarks on this
are given in Bryant (1996).
References
Bryant, P. (1996): The Minimum Description Length Principle for Gaussian Regression.
Working Paper 1996-08, University of Colorado at Denver, Graduate School of Business
Administration. Denver, Colorado 80217-3364.
Duda, R. O. and Hart, P.E. (1973): Pattern Classification and Scene Analys·is. John Wiley
& Sons, New York.
Everitt. B. S. (1993): Cluster Analysis. Edward Arnold, London.
Johnson. R. A. and Wichern, D. W. (1988): Applied Multivariate Statistical Analysis, sec-
ond edition, Prentice-Hall, Englewood Cliffs, N. J.
Rissanen. J. (1987): Stochastic complexity. Journal of the Royal Statistical Society, Series
B. 49,3,223-265
Rissanen, J. (1989): Stochastic Complexity in Statistical Inqui1Y. World Scientific Publish-
ing Co .. Singapore.
Rissanen, J. (1996): Shannon-Wiener information and stochastic complexity, In: Proceed-
ings, N. Wiener Centenary Congress. East Lansing, Michigan.
Consensus Methods for Pyramids and Other
Hypergraphs
J. Lehel, F. R. McMorris, R. C. Powers
Department of Mathematics
University of Louisville
Louisville, KY 40292
U.S.A.
187
188
an interval of P. (See Duchet (1995) for standard terminology. Note that a graph is
simply a hypergraph with each edge having two vertices.) Thus every pyramid is an
interval hypergraph. A hypergraph H is a tree hypergraph Lf there is a tree T so that
every A E H is a subtree of T. Let I denote the set of interval hypergraphs on 5, TB
the set of totally balanced hypergraphs on 5, and TH. the set of tree hypergraphs
on 5. From results in Lehel (1983) and Lehel (1985) we have that Ie; TB, TB
= W nTH., and H E TB if and only if every subhypergraph of H is in TH.. Thus
the complete list of inclusions is Te; pe; Ie; TB e; W (TB e; TH.) with examples
existing that show proper inclusions.
Because of the above characterizations of TB as those weak hierarchies that are also
tree hierarchies, it is our opinion that totally balanced hypergraphs merit further
study for possible uses in classification theory. However, in this paper our concern is
with consensus methods for various hypergraphs and we now turn our attention to
this topic.
2. Consensus
Let 7-f. denote a class of hypergraphs on 5. A consensus function on 7-f. is a mapping
C : 7-f.k -> 7-f., where k is a fi..xed positive integer. Elements of 7-f.k are called profiles and
are denoted by r. = (HI,""Hk ),7r' = (H;, ... ,H4J, etc. Among the general types of
consensus functions are the counting rules and the intersection rules. A counting rule
puts a cluster in C( iT) if it appears sufficiently often in the hypergraphs making up the
input profile iT. For example, the majority rule on T (Margush and McMorris (1981))
puts a cluster in the output if it appears in more than half of the input hierarchies.
Counting rules from Tk into T were characterized in McMorris and Neumann (1983)
and counting rules from (TUW)k into W were characterized in l'vIcMorris and Powers
(1991). 'vVe will shortly investigate the possibilities for counting rules on P and TB.
An intersection rule puts a cluster in the output C(iT) when it is the intersection
of certain clusters from the input hierarchies in 7r. For H E 7-f., let h : H -> Zo
(Zo denotes the set of nonnegative integers) be defined by h(A) = t if and only if
5 = 04 0 :l Al :J ... :l At = A (proper inclusion) with each Ai E H, and maximum in
length. When 7-f. = T, h is an easily visualized height function. For iT = (HI, ... , H k ) E
7-f.k and j E Zo let Lj(ii) = {Xl n ... n X k : Xi E Hi and h(X";) = j for i = 1, ... ,k}*.
(For a set of subsets R, R" = {X E R : X =1= 0}). Now set CD (7r) = U~oLj(7r).
The driving motivation for considering CD(r.) is that by int.ersecting clusters at the
same level across the profile 7r one might be able to produce a hypergraph whose
clusters represent areas of partial overlap (agreement). This is precisely the type
of information lost when using counting rules. A problem, of course, is that CD(r.)
might not be the same type of hypergraph as those making up the profile iT. However,
when 7-f. = T then CD(iT) E T, and in this case CD was studied in Neumann (1983).
Intersection rules on T are further investigated in Adams (1986), Powers (1995) and
Vach (1994). Problems start to arise even as we pass from T to P, and in McMorris
and Powers (1996) it is noted that CD(iT) need not be a pyTamid when iT E pk.
However when each pyramid in 7r is based on the same linear ordering of 5 then
CD(r.) is an interval hypergraph, from which a pyramid can be formed. The resulting
consensus function on P is characterized in McMorris and Powers (1996).
We now seek counting rules for P and TB. Recall that a counting rule C : 7-f.k -> 7-f.
can be described by a threshold l. It is then referred to as a 1\-[t - rule with A E IV[t(r.)
if and only if 11i:A!H,lI > I for iT E 7-f.k. The codomain of M t is of concern. For
example, when 7-f. = T then the majority rule Ml(iT) E T for all 11" E T (Margush
1
189
and McMorris (1981)); while if rt = W, then A12 (it) E W for all IT E W k (Mci\Iorris
3
and Powers (1991)). Clearly AII(rr) E rt for all it E rtk where rt E {T, P,I, TB, W}
and is usually called the unanimity rule.
Surprisingly, counting rules other than the unanimity rule fail for P and TB as our
example shows.
Example 1: Let 5 = {Xt, ... ,xn } with n :::: .3. Define the hypergraphs H t , ... , Hn as
follows:
HI = {5, {xd, ... , {Xn}, {Xt,X2}, {X2, X3}, ... , {Xn-I, Xn}},
H2 = {5, {xd, ... , {Xn}, {X2, X3}, {X3, X4}, ... , {Xn, xd},
Hn = {5, {xd, ... , {Xn}, {Xn' xd, {Xt, X2}, ... , {Xn-2, Xn-I}}.
It is easy to see that each Hi E P and thus Hi E TB. Letting k = nand
IT = (HI, ... , Hk ) we now see that lvI/(ii) has a special k-cycle for alii E (0,1) and is
thus not a totally balanced hypergraph (and hence not a pyramid). Therefore the
only I that works is I = 1.
If we try to dodge the problem pointed out in the example by requiring each pyramid
in rr E p k to be defined from the same linear order of 5, then any selection of clus-
ters from those that appear in the H;'s will give an interval hypergraph from which a
pyramid is easily formed by taking intersections of intervals and adding the singletons
and 5. In particular, a cluster that appears in only one out of the k hypergraphs
could be part of the consensus output and this is contrary to the notion of consensus.
Generalizing the Ah3 -rule for weak hierarchies gives the following result.
Theorem: Let rr = (HI, ... , Hk ) where each hypergraph Hi has no special cycle of
length m. Set I = m,;;-t and assume that flkl < k. Then AII/(rr) has no special cycle
of length m.
Proof: Let AI, ... ,A m E M/(ii). Then, for j = 1, ... ,m, I{i : Aj E Hi}1 > lk. Since
mrlk1 > (m -1)k it follows that there exists j E {I, ... , k} such that At, ... , Am E H).
Since Hj has no special cycle of length m we have that AI, ... , Am is not a special
cycle of length m. Hence ]\.-I/(rr) has no special cycle of length m. 0
We point out that if l < ~I, then M/(rr) might have a special cycle of length m. This
leads to another interpretation as to why lvII is the only counting rule that works
on TB. If rr E (TB)k and we are trying to eliminate special cycles of all lengths in
lvI,(rr), we must have limm_ool = limm_oo(m,;:l) = 1.
vVe now are ready to make a proposal for an approach to consensus for hypergraphs
that utilize clusters from both counting and intersection rules. This procedure is
first described in general terms as follows: Let rt be a fi'(ed class of hlpergraphs for
which there is a smallest l E (0,1] such that lvI/(rr) E rt for all ii E rt . For rr E rt\
consider GD(ii) and add clusters from Go(rr) subject to preserving membership in rt
and other appropriate constraints.
To illustrate this approach consider rt = P. We have seen that I = 1 for pyramids so
for rr E p k we first form unanimity consensus MI(rr). Next construct GD(rr) and sort
GD(ii) = {Atl "'l Am} according to a criterion such as size, lAd ~ IA21 :::: .. , ~ IAml·
190
Now consider the hypergraph l\I1 (iT) U {Ad. The idea is to have Al as a cluster in the
consensus output if and only if AlI(rr) U {Ad is a pyramid. This procedure continues
until a decision is made about the last cluster Am. Thus the final consensus pyramid
gives clusters that are obtained by either counting or intersection. One should, how-
ever, consider exact algorithms and their associated complexities and we leave this
for future work.
Acknowledgement
The research of F.R. McMorris was supported by the United States Office of Naval
Research Grant N00014-95-1-0109.
3. References
Adams, E.N. III (1986): N-trees as nestings: Complexity, Similarity, and Consensus, Jou·r-
nal of ClaSSification, 3, 2, 299-317.
Bandelt, H.-J. and Dress, A. (1989): Weak hierarchies associated with similarity measures-
an additive clustering technique, Bulletin of Mathematical Biology, 51, 1, 133-166.
Bertrand, P. (199.5): Structural Properties of Pyramidal Clustering, In: Partitioning Data
Sets, Cox, 1. et al. (eds.), DIMACS Series in Discrete Mathematics and Theoretical Com-
puter Science, 19, 35-53, AMS, Providence, RI.
Bertrand, P. and Diday, E. (1991): Les pyramides c1assifiantes: une extension de la struc-
ture hierarchique, C. R. Acad. Sci. Paris, Serie I, 693-696.
Duchet, P. (1995): Hypergraphs, In: Handbook of Combinatorics, Graham, R. et al. (eds.),
VOL. 1,381-432, MIT Press, Cambridge, MA.
Gaul, W. and Schader, M. (1994): Pyramidal classification based on incomplete dissimilar-
ity data, Journal of Classification, 11, 2,171-193.
Lehel, J. (1983): Helly-hypergraphs and abstract interval structures, ARS Combinatoria,
16-A, 239-253.
Lehel, J. (1985): A characterization of totally balanced hypergraphs, Discrete Mathematics,
57, 59-65.
Margush, T. and McMorris, F.R. (1981): Consensus notrees, Bulletin of Mathematical Bi-
ology, 43, 239-344.
McMorris, F.R. and Neumann, D.A. (1983): Consensus functions defined on trees, Mathe-
matical Social Sciences, 4, 131-136.
Mc}'Iorris, F.R. and Powers, R.C. (1991): Consensus weak hierarchies, Bu.lIetin of Mathe-
matical Biology, 53, 679-684.
McMorris, F.R. and Powers, R.C. (1996): Intersection rules for consensus hierarchies, In:
Proceedings of the third international conference on ordinal and symbolic data analysis, Di-
day, E. et al. (eds.), 301-308, Springer Verlag, Berlin.
Neumann, D.A. (1983): Faithful consensus methods for notrees, AJathematical Biosciences,
63, 271-287.
Powers, R.C. (1995): Intersection rules for consensus notrees, Applied Mathematics Letters,
8, 4, 51-55.
Vach, W. (1994): Preserving consensus hierarchies, Journal of Classification, 11, 1, 59-77.
Van Cutsem, B. (1994): Classification and dissimilarity analysis, Lecture Notes in Statis-
tics, New York.
On the Behavior of Splitting Criteria
for Classification Trees
Roberta Siciliano. Francesco i\Iola
Dipartimento di i\Iatematica e Statistica
F niversita di Napoli Federico II
Via Cintia - Monte S.Angelo
80126 - ~aples - Italy
e-mail: r.sic§ldmsna.dms.unina.it
[email protected]
191
192
proportion of cases that belong to class j given that they have category i of X at node
t and p(ilt) is the proportion of cases that has category i of X at node t (Mola and
Siciliano, 1992). A further index can be proposed, namely the conditional entropy
index of Shannon HyJx(t) = - L, Lj p(jli, t) logp(jli, t).
Both the above mentioned indexes can be proved to bt: special cases of the following
general measure for the proportional reduction in the heterogeneity of the response
variable Y due to the information provided by the predictor X (globally considered):
iy(t) - 2::;p(ilt)iYli(t)
-IYI.d t ) = iy(t) (2)
where iy(t) is the measure of heterogeneity for the variable Y at node t and iYli(t)
is the same measure for the conditional distribution of Y given the modalityi of
predictor X; using the Gini index yields to the predictability T index whereas using
the entropy index yields to the conditional entropy. I
As a result, we can describe the two stages of the splitting criterion using such general
index rYI.(t) to be defined for both the predictor and the splits generated by a given
predictor.
INotice that the impurity measure in CART is nothing else than an heterogeneity index and for
this reason we have adopted the same notation for the heterogeneity index here as for the impurity
measure in section 1.1.
193
and, at the second stage, we maximize for each split s E S of the predictor X'
At the first stage we can select more than one predictor, namely we can order the
predictors with respect to the values of "'IYIX",(t) so that we can rank the predictors
with respect to their predictability power. The selected predictors are used to gener-
ate the set of splits used at the second stage.
Notice also that the numerator of (4) is equivalent to the decrease in impurity (1).
As a result, the splitting rule in CART can be defined in terms of the dependency
index Ai instead of the decrease in impurity. Indeed, if we consider in place of S the
set Q of all possible splitting variables generated by all predictors then we could use
directly (4) as a splitting rule in CART. In this way, we provide at least two new
interpretations of CART splitting rules: one in terms of the predictability T index,
the other in terms of the conditional entropy index. Using this result, Mola and
Siciliano (1997a) have recently introduced a fast splitting algorithm that is related to
the two-stage criterion but that allows to find the same solution of CART splitting
criterion.
any dependency. Tables 1 and 2 describe the results using the Gini index of hetero-
geneity and the entropy index respectively in order to find the best split at the root
node. In particular, for each combination of the factorial design each cell of the table
gives the percentage of times that the best split of CART splitting criterion is found
by the two-stage splitting criterion through the first best predictor X(l), the second
predictor X(2) and the third predictor X(3) in the order respectively (see section 1.2).
For instance, for I = 3 and. J = 2 in table 1 the best split according to CART has
been found in predictor X(l) for 88% of times and in predictor X(2) for the remaining
12% of times.
J-2 J-3
I X(1) X(2) X(3) overall X(I) X(2) X(3) overall
3 88% 12% 0% 100% 79'10 16'10 4% 99%
4 88% 10% 2% 100% 72% 22% 4% 98%
5 81% 16% 2% 99% 75% 14% 5% 94%
6 72% 21% 7% 100% 76% 13% 3% 92%
Table 1: Percentage of times that the two-stage splitting criterion finds the best split
of CART (using the Gini index of heterogeneity)
J-2 J-3
I X(1) X(2) X(3) overall X(1) X(2) X(3) overall
3 88% 12% 0'10 100'10 79% 18% 2% 99'lo
4 89% 9% 2% 100% 77% 15% 5% 97%
5 85% 12% 2% 99% 74% 20% 1% 95%
6 76% 17% 7% 100% 70% 11% 8% 89%
Table 2: Percentage of times that the two-stage splitting criterion finds the best split
of CART (using the entropy index)
Both tables 1 and 2 show that whatever is the adopted splitting rule considering three
best predictors in two-stage splitting criterion yields to find with a high percentage
of times (see column "overall") the same best split of CART, especially when J = 2,
although we have not imposed any dependency structure.
As soon as we have one of the predictors related to the response variable then the
CART splitting criterion and the two-stage splitting criterion using only one best
predictor give the same result. This has been the result of a further simulation study
in which one of the predictors was generated according to a dependency structure
(passing from a low level of dependency to a high level of dependency as measured
with the dependency 'Y index).
on some of the predictors then different splitting rules agree OIl the same best split.
On this purpose we consider the factorial design described in section 2.1 and we cal-
culate the percentage of times that the Gini index of heterogeneity and the entropy
inclex in the CART splitting procedure attain the same result: analogously, we cal-
culate the percentage of times that the predictability T inclex and the conditional
entropy index in the two-stage splitting procedure yield to the same result. \Ve de-
scribe these results in tables :3 and 4 where we notice a certain coherence in the results
especially for J = 2.
J-~ J - J
I Gini-Entropy Ciini-Entropy
:3 99Yc 90Yc
4 987c 847c
.') 977c S2%
6 9:3% I 707c
Table :3: Percentage of times that the best split is the same using the Gini index and
the t'ntropy index in C.-\RT splitting criterion
J=2 J - :3
I , Clini-f:.·ntropy Clini-f:.ntropy
:3 99'1c 91 'lc
4 99% 88%
.) 977c 84%
6 97% 72%
Table 4: Percentage of times that the best split is the same using the T index and
the conditional entropy index in two-stage splitting criterion
of uterine irritability (.\7. O=no. l=yes). physician yisits during last trimester (X s .
O=none, l=one or more).
We have analyzed this data set using the C.-\RT splitting procedure. The ~plit ,e-
quence to grow the final binary tree is shown in table :J. :\Io.,t of lit.., split, ba\'e
been generated by numerical predictors. \\"e ha\'e repeated the analysis consickring
a categorization of the numerical predictors. The res\llt is shown ill table 6. It i,
interesting to notice how the categorization of llumerical predictor~ modifies the Tree
structure and thus the misclassification rate in both nonterminal and terminal node,;
(see both columns "cases" and "error rate'").
sp it sequ~nce tcrmina.
cases I error beS[ Jabel
% rat< (%) pred. split class
1 100.0 31.0 .\, o vs 1
2 16.i6 40.0 X, < 31.5
4 15.08 33.0 I Xo < l·Se.·)
S 13.97 28.0 1
9 1.1 ~ 0.0 i o
I 6~
1.6S 0.0 o
83.24 2';.0 .Y, < 106.0
15.64 46.0 X; < 22.5
12 10.06 28.0 u
1
13 5.59 20.0 1
7 6i.70 21.0 X6 no
14 63.69 18.0 Xc no
1 28 56..t2 14.0 X, < 27
29 7.26 46.0 X2 < 122.5
Il ~~ 1Hf
2.;,0
20.0
15 29.0 : I
sp It sequence terminal
cas~s error best abe
t % rat< (%) pred. split class
1 100.0 31.0 "\ 5 o ,·s 1
2 16.76 40.0 X, 1.2 \"5 3.4
4 4.47 38.0 0
.:; 12.29 32.0 1
., 83.24 26.0 X. o \'5 1
6 is.i';'' 23.0 .Y, 1 vs :! ..3.-1
51.40 16.0 0
,
12
13 27 ..37 37.0 .\" '2 \"s 0.1
26 13.41 46.0 Xc o vs 1
52 10.61 37.0 0
53 2.79 20.0 I 1
I '27
54
13.97
6.70
28.0
17.0
X. 1 V.5 0
0
55 7.26 38.0 Xl
1I0 6.15 27.0 0
III 1.12 0.0 1
7 4.47 38.0 1
A skeptical researcher might consider "unstable" the results of tree procedures: this
is in fact a crucial point that we discuss in this section.
Through some resampling methods we analyze the stability of classification tree pro-
197
cedures with the respect to the structure of the final binary tree and the related
misclassification rates (R(t)). In particular, we have analyzed the behavior of the
CART splitting procedure when the test sample is used to validate the classification
rule (see Breiman et al., 1984) . Considering the same data set described in section
2.3 we have repeated .the analysis 1000 times taking 30% of cases randomly chosen
in the test sample; again we have repeated the analysis 1000 times taking 20% of
cases in the test sample; finally we have repeated the analysis 1000 times taking 10%
of cases in the test sample. For each analysis we have considered final classification
trees with 3 and 4 terminal nodes respectively and we have calculated the related
misclassification rates. In figure 1 we show two series of boxplots in order to describe
the distribution of the misclassification rate considering 3 terminal nodes (boxplots
above) and 4 terminal nodes (boxplots below). As a result, the misclassification rate
appears to be quite "unstable" when the test sample takes 30% of cases. In conclu-
sion, when the learning sample is not too large the cross-validation is recommended
since it can provide more stable validations of the classification rule.
-
cs I
•
C5 I
I Q
3 Concluding remarks
In this paper we have discussed important aspects concerning the behavior of splitting
criteria in classification trees. We have shown how the two-stage splitting criterion
can be fruitfully used to select a number of predictors that generate with high confi-
dence level the best split according to the CART criterion.
We have also verified that the structure of the binary tree is not influenced by the
choice among alternative splitting rules, but rather by the type of predictors and
their treatment in the splitting procedure.
There are several classification tree procedures proposed in literature and in recent
198
years a lot of attention has been given to the specialized software for applying such
procedures (i.e. CART. RECPAM). Furthermore. it is possible to find binary seg-
mentation procedures also in statistical packages such as SPSS, S+, SPAD.s.
As a result. the number of utilizers of such procedures is increasing and "nonexpert
researchers" might be willing to apply classification tree procedures for statistical
analysis. It becomes then evident that a correct use of such methods requires a cer-
tain experience or at least the attention for some crucial aspects such as the simul-
taneous treatment Qf numerical and categorical predictors. the choice of the splitting
rule, the method for validating the tree and so on. We believe that is worthwhile to
discuss some of the problems and peculiar aspects of classification tree procedures as
we have described in this paper: we hope that the present contribution provides a
good step in this direction.
Acknowledgements: This research was supported for the first author by C:\,R research
funds number 9.5.020-l1.CTI0 and for the second author by C:\R research funds number
92.187:2 P
References
Breiman L., Friedman J.H .. Olshen R.A., Stone C.J .. (198-1): Classification and Regression
Trees. Belmont C.A. Wadsworth.
Celeux, G. and Lechevallier Y. (198:2): ~Iethodes de segmentation non parametriques, Re-
l'ue de statistique appliquees. 4, 39-.5:3.
Ciampi. A. and Thiffault, .1. (1987): Recursive Partition and Amalgamation (RECPAM)
for Censored Survival Data: Criteria for tree selector. Statistical Software Neu'sletter, 2,
vol. 1-1. 78-81.
Hosmer. D. W. and Lemeshow, S. (1990): A.pplied Logistic Regression, J. Wiley, l"'ew York.
~Iingers. J. (1988): .-\n empirical comparison of selection measures for decision tree induc-
tion. Jlachine Imming, 3. 319-3-12.
~Iola. F. (199:3): A.spetti metodologici e computaziona/j delle tecniche di segmentazione bi-
naria. Cn contributo ba.sato su fun=ioni di predi;ione. PhD dissertation, university of
:\aples.
~Iola. F. and Siciliano, R. (1992): A Two-Stage Predictive Splitting Algorithm in Binary
Segmentation, Computational Statistics, Dodge, Y. and Whittaker, J. (eds.), 1, 179-18-1,
(Compstat '92 Proceedings). Physica Verlag.
Mola. F. and Siciliano, R. (199-1): Alternative Strategies and CATANOVA Testing in Two-
-Stage Binary Segmentation. New A.pproaches in Classification and Data .4nalysis, Diday,
E. et al. (eds.).:316-323. Springer Verlag.
Mola. F. and Siciliano, R. (1997a): A Fast Splitting Procedure for Classification Trees.
Statistics and Computing (to appear).
Mola, F. and Siciliano, R. (1997b): Visualizing Data in Tree-Structured Classification, Pro-
ceedings of IFCS-96: Data Science, Classification and Related Methods, (Hayashi, C. et aI.,
eds.), Springer Verlag. Tokyo.
Mola, F., Klaschka, J. and Siciliano, R. (1996): Logistic Classification Trees, COMPSTAT
96 Proceedings (A. Prato ed.), Physica Verlag.
Taylor, P.C. and Silverman, B.W. (1993): Block Diagrams and Splitting Criteria for Clas-
sification Trees. Statistics and Computing, 3, 1-H-161.
Fitting Pre-specified Blockmodels
Vladimir Batagelj I, AnuSka Ferligoj2, and Patrick Doreian3
I University of Ljubljana, FMF, Dept. of Mathematics
Jadranska 19, 1000 Ljubljana, Slovenia
2 University of Ljubljana, Faculty of Social Sciences
P.O. Box 47, 1109 Ljubljana, Slovenia
3 University of Pittsburgh, Dept. of Sociology
Pittsburgh, PA 15260, USA
1. Introduction.
The goal of conventional blockmodeling is to reduce a large, potentially incoherent
network to a smaller comprehensible structure that can be interpreted more read-
ily. Blockmodeling, as an empirical procedure, is based on the idea that units in a
network can be grouped according to the extent to which they are equivalent, under
some meaningful definition of equivalence.
There are many inductive approaches for establishing blockmodels for a set of social
relations defined oyer a set of social actors. Some form of equivalence is specified and
clusterings are sought that are consistent with the specified equivalence. In all cases,
the analyses respond to empirical information in order to establish the blockmodel.
Another view of blockmodeling is deductive in the sense of starting with a blockmodel
that is specified in terms of substance prior to an analysis. In this paper we present
methods where a set of observed relations are fitted to a pre-specified blockmodel.
2. Basic Terms
Network: Let E = {XI,X2, ... ,Xn } be a finite set of units. The units are re-
lated by binary relations R t C; E x E, t = 1, ... , r which determine a network
N = (E, R I, R2 , ... , Rr). In the following we restrict our discussion to a single rela-
tion R described by a corresponding binary matrix R = [rji]nxn where
{ I xRx
r 'J -- •
0 otherwise
J
In some applications Tjj can be a nonnegative real number expressing the strength of
the relation R between units Xj and x J.
Cluster, clustering: One of the main procedural goals of blockmodeling is to iden-
tify, in a given network, clusters (classes) of units that share structural characteristics
defined in terms of R. The units within a cluster have the same or similar connection
patterns to other units. They form a clustering C = {CI, C 2, ... , C k} which is a par-
tition of the set E: U. C j = E and i =f. j => C j n Cj = 0. Each partition determines
an equivalence relation (and vice versa). Let us denote by '" the relation determined
199
200
Figure 1: Types of connection between two sets; the left set is the ego-set.
by partition C.
Block: A clustering C partitions also the relation R into blocks R( Cj , C j ) = R n Cj x
Cj . Each such block is defined by units belonging to clusters Cj and Cj in terms of
the arcs leading from cluster C. to cluster Cj . If i = j, a block R(Cj , Cj ) is called a
diagonal block.
Blockmodel: A blockmodel consists of structures obtained by identifying all units
from the same cluster of the clustering C. For an exact definition of a blockmodel we
have to be precise also about which blocks produce an arc in the reduced graph and
which do not . The reduced graph can be presented by a matrix, called also image
matrix.
Block Types: Several possible block types can be defined. In Figure 1 nine block
types are presented (Batagelj, 1993) . In the relational matrix below three types of
blocks can be found:
1 1 1 1 1 1 0 0
1 1 1 10 1 0 1
1 1 1 1 0 0 1 0
1 1 1 1 1 0 0 0
0 0 0 0 01 1 1
0 0 0 0 1 0 1 1
0 0 0 0 1 1 0 1
0 0 0 0 1 1 1 0
3. Blockmodeling - Formalization
A blockmodel is an ordered sextuple .M = (U, K , T ,Q, 7r , 0) where:
201
• 1\ ~ U x U is a set of connections:
• T is a set of preciicates used to describe the t~'pes of connections between
different classes (clusters, groups, types of units) in a network. \\"1' assume that
nul E T. A mapping ;r : I .... -t T \ {nul} assigns predicates to connections:
and
V(t, tv) E U xU \ J( : nul(C(t), C(u')).
:\ote. T == {nul, com} implies a structural blockmodel (Lorrain and White, 19/1):
and, T == {nul, reg} implies a regular blockmodel (White and Reitz. 198:3)
Let ~ be an equivalence relation over t' and [xl == {y E \. : x ~ y}. We say that ~
is compatible with T over a network N iff
4. Optimization
4,1 A Criterion Function
One of the possible ways of constructing a criterion function that directly reflects
the considered equivalence is to measure the fit of a clustering to an ideal one with
perfect relations within each cluster and between clusters according to the considered
equivalence (Batagelj, Doreian, and Ferligoj. 1992; Batagelj. 1993: Doreian. Batagelj.
and Ferligoj, 1994).
Given a set of types of connection T and a block R(X, Y), X, Y ~ \'. we can
determine the strongest (according to the ordering of the set T) type T which is
satisfied by R(X, }'). In this case we set
We need to consider also the (many) cases where no type from T is satisfied. One
approach is to introduce the set of ide!!l blocks for a given type T E T
a.nd define the deviation 6 (X, Y: T) of a block R( X, Y) from the nearest ideal block.
We ca.n efficiently test whether the block R(X, n is of the type T (see Table 1).On
202
the basis of these characterizations we can construct also the corresponding measures
of deviation from the ideal realization. For the proposed types, all deviations are
sensitive
8(X, Y; T) = 0 <=> T(R(X, Y)).
Therefore a block R(X, Y) is of a type T exactly when the corresponding deviation
8(X, Y; T) is O. In the deviation <5 we can also incorporate the values of lines, v, if
the network has valued arcs.
Based on the deviation8(X, Y; T) we introduce the block-err'or E(X, }'; T) of R(X, Y)
for type T. Two examples of block-errors are
where w(T) > 0 is a weight for type T. We extend the block-error to the set of
feasible types T by defining
IO(X, Y; T) = TET
minE(X, Y; T) and "(/lo(X), 11(Y)) = argminTET':-(X, Y: T)
To' make 7r well-defined, we order (priorities) the Sf't T and select the first type from
T which minimizes.:, We combine block-errors into a total ermr - a blockmodeling
criterion function
PUt; Tl = L ::(C(t), C(1L"); T).
(t.wjEU<F
The criterion functions based on block-errors ':1 ann. ::2 are denoted PI and P2 respec-
tively.
For the criterion function PI(/lo) we have PI(/lo) = 0 <=> /lo is an exact blockmodeling.
Also for P2 , we obtain an exact blockmodeling Jt iff the deviations of all blocks are O.
The obtained optimization problem can be solved by local optimization, Once a par-
titioning /lo and types of connection iT are determined, we can also compute the values
of connections by using the avemg'ing rules.
4.2 Local Optimization
For solving the blockmodeling problem we use a local optimization procedure (a
relocation algorithm):
Determine the initial clustering C:
repeat:
if in the neighborhood of the current clustering C
there exists a clustering C' such that P(C') < P(C)
then move to clustering C' .
203
• we can fit the network to a partial model and analyze the residual afterward;
• we can also introduce different constraints on the model, for example: units x
and yare of the same type; or, types of units x and yare not connected; ...
2
1
2
3
°
{ 0, reg}
{ reg}
The result is a subset of the set of inductive solutions with 7 errors. One of the
solutions is the following:
C1 = {{pm, m7, a3} , {ml, rn2, rn3, m4 , m5, m6, all, {a2}}
The solution is presented also in Figure 3. The black dots on the arcs denote super-
flous arcs (errors) according to the ideal solution and the white dots the missing arcs.
We can constrain our pre-specified blockmodel further by additional constraint on
205
7r 1 2 3
1 reg 0 0
2 reg reg 0
3 0 reg 0
d 1 2 3
1 0 4 0
2 0 0 0
3 0 3 0
7r 1 2 3
1 0 0 0
2 reg reg 0
3 0 reg 0
d 1 2 3
1 0 1 0
2 2 0 3
3 0 0 2
units in clusters: all advisers are in cluster 3. We obtained a single solution with 8
errors.
C2 = {{pm}, {ml , m2, m3 , m4, m5, m6, m7}, {aI , a2, a3}}
The solution is also presented in Figure 4.
The results of model fitting show that the hypothetical hierachical model was ob-
tained with a minimal increase of error compared to inductive solution (from 7 to 8).
This indicates that it represents the network structure well.
206
6. Further Research
There are several possible directions for further research in the field of blockmodeling.
At least two questions need an attention:
• to define additional block types which are more appropriate for describing spe-
cific network struct ures:
References:
Batagelj, V. (1991): STR.4.N - STRucture ,4Naiysis, Manual, Ljubljana.
Batagclj, V., Doreian, P. and Ferligoj, A. (1992): An optimizational approach to regular
equivalence, Social Networks, 14, 121-135.
Batagelj, V., Ferligoj, A. and Doreian, P. (1992): Direct and indirect methods for structural
equivalence, Social Networks, 14, 63-90.
Batagelj, V. (1993): Notes on block modelling. In Abstracts and Short Versions of Papers,
3rd European Conference on Social Network Analysis, Munchen: DJI, 1-9. Extended ver-
sion in print in Social Networks 1997.
Borgatti, S.P. and Everett, M.G. (1989): The class of all regular equivalences: Algebraic
structure and computation, Social Networks, 11:65-88.
Dorcian, P., Batagelj, V. and Feriigoj, A.(1994): Partitioning Networks Oil Generalized
Conccpts of Equivalcnce, Jou77Iai of Mathemat'ical Sociology, 19, 1, 1-27.
Doreian, P. and Mrvar, A. (1996): A Partitioning Approach to Structural Balance, Social
Networks, 18, 2, 149 ·168.
Ferligoj, A., Batagelj, V. and Doreian, P. (1994): On Connecting Network Analysis and
Cluster Analysis, In Contributions to Mathematical Psychology, Psychometrics, and Method-
ology (G.H. Fischer and D. Laming, Eds.), New York: Springer.
Hlebec, V. (1993): Recall versus recognition: Comparison of two alternative procedures for
collecting social network data, In Developments in Stati$tics and Methodology (A. Ferligoj
and A. Kramberger, Eds.), Metodoloski zvezki 9, Ljubljana: FDV, 121-128.
Lorrain, F. and White, H.C. (1971): Structural equivalence of individuals in social net-
works, Journal of Mathematical Sociology, 1. 49-80.
Sampson, S.F. (1968): 11 Novitiate in f1 Pe,-iod of Change: A.n Experimental and Cas"
Study of Social Relationships, PhD thesis, Cornell University.
White, D.R. and Reitz. K.P. (1983): Graph and semigroup homomorphisms on networks
of relations, Social Networks, 5, 193-234.
Robust impurity measures in decision trees
Tomas Aluja-Banet, Eduard Nafria
Dept. of Statistics and Operational Research
Universitat Politcnica de Catalunya
c. Pau Gargallo. ·5. 08028 Barcelona. Spain
E-mail: [email protected]
Summary: Tree-based methods are a statistical procedure for automatic learning from
data, their main characteristic being the simplicity of the results obtained. Their virtue is
also their defect since the tree growing process is very dependent on data; small fluctuations
in data may cause a big change in the tree growing process. Our main objective was to
define data diagnostics to prevent internal instability in the tree growing process before
a particular split has been made. We present a general formulation for the impurity of a
node, a function of the proximity between the individuals in the node and its representative.
Then. we compute a stability measure of a split and hence we can define more robust splits.
Also. we have studied the theoretical complexi ty of this algorithm and its applicability to
large data sets.
1. Introduction
The objective of tree-based methods is to automatically detect which variables serve
to explain the behaviour of a response variable, whether quantitative or categorical.
They can be applied in the same context as other alternative methods, such as multi-
ple regression, discriminant analysis, logistic regression or neural networks. Its main
advantage is the simplicity of the results obtained and the possibility of automatic
generation of decision rules. This property links this methodology with the AI tech-
niques. Thus the main usage is decision making. Its strength is also its weakness,
since the tree growing process is very dependent on data; a small fluctuation in data
may cause a .major change in the topology of the tree. This raises the problem of
the stability of a tree. 'v'ie distinguish internal stability from external stability in
the same sense as that stated by Greenacre (1984). External stability refers to the
sensiti\'ity of the tree to independent random samples. and can be assessed by means
of a test sample or cross-validation, whereas by internal stability we mean the influ-
ence exerted by each observation in the learning sample on the formed tree. Another
problem relating to the tree methodology is the computational cost, due to the recur-
sive nature of the aigorithms and the large number of possible splits, which can be
very costly for large data sets. For this reason we have studied the complexity of the
algorithm in order to optimise it. proposing an efficient heuristic capable of coping
with large data sets. with almost linear cost depending on the number of individuals
and variables and the depth of the tree.
Since the pioneering work of AID. Sonquist et al. (1964), tree growing methodolo-
207
208
gy has con~istPd u1" splittingl ea,-h group of incii \-iduals (node) r('cursin'ly into t\I'O
groups. starting from the total sample 11. a«urding to a statistical criterion relatill>:
the condition for splitting to the rpsponse \-ariabl('_ Since then. a grpat dt'al uf re-
search have been done into I he threshold-\Jased criterion. r":ass (['1";:0) dewluped tn'!'
llwthorlology for <l categorical response variable llsing a C'hi·~qllarE' "plit criterion.
and Cdeux et al. (19S~) proposed ior the latter cas.' a split criterion hased on a
distance b"t\\-een distribution fun(tion~. while Ciampi (1991) propo:;nl instead the
use of the deviance uf a gent'ralised lincar model. .-\lthough the r\,:ildt~ obI<1iw'.-j Cilll
be sittisfactory in applipd research. with an error rate ()f the same orcleT ot' altemitl i\-"
methodo. they do not CSt-ape the criticism of the optimality and goodness (.f th,' Tl't'f'
obtained_ The CART approach, introduced hy i:3reiman f't ill. (19,SJ). Wib ,~ll attt'lllpi
to solve these problems. It" main innovations consisted of:
l. l'nification of the ca"" of a categorical rt'~p')!l5e \-ariahle (clitssitlratioll tree:') with
thill of a quantitative r('sponse variable I. reg:-",,:-;ion trrt'S) withi!l it similar trill1H".wrk,
2_ L;" of an impurity index to meaSLl,'e tIle het.erogcneity of a node,
:3_ Pruning from a maxima! !.ree instead of using a stop criterion.
-). Gi\-ing right hOlles t es t imatE's of the n;isdassi ticat ion error.
where p(j I t) is the probability of class j in node t and If) is th(-' \-ahw of tite rcspon'w.
for an indi vid Lial in node t_
Impurity indices "hould have a lllaximul11 \-al11" fllr <'lass,·< with cqual probability. a
valuE' of 0 for a pure node and ,;}lOllld he it decreasing function through the splitting
process:
i(1) 2: 0 11f,) T (J -- It) illr,l O:s Ct <
i( t) =
i(l) = absolute deviation index
Then, the split criterioll consist" of :;elt'cling the "plit which maximises the \\'f'ighted
reduction of impurity h"t.wE't'n tlif' part'nt node and its offspring ([eft and rightl.
n.,1 , n'r
'::::'ill) = 1111 - ~I,it,) - ""':"'ilt r ) (~)
II, n.
209
'() L:iEtW,tb(i,mt)
t == ---C-=:-,------'-_---'-
! (3)
L:iEt Wit
where b( i, mt) is the distance of an individual i and mt (obviously, all individuals
in the same response class j share the same distance). This formula reduces for a
categorical response variable with uniform weight to:
IClass 3
Class 2
Class I
For a regression tree. the geometrical interpretation is easier since the values of the
response variable are represented in the real line m: being the point on the real line
representing the node t, which minimises its impurity.
This formulation being very general, we can choose the distances c(j, nit) in a very
general sense. In particular, we can use the L2 norm. Then. it is easy to show that
for a classification tree the representative of the node coincides with the multinomial
vector of probabilities of classes. and the impurity index reduces to the well known
Gini index; and that for a regression tree. the representative of the node is the mean
of the response in a node and the impurity is the variance. On the other hand. in the
case of an Ll norm. for a classification tree the representative coincides with the class
of maximum probability and the index reduces to twice the misclassification index.
and for a regression tree the representative is the median of the response in the lIode
and the impurity is the absolute deviation.
Furthermore, this formulation can allow for different misclassification costs. Let C be
the matrix of misclassification costs. and Cj. the cost of misclassifying an individllal
in class j when it belongs to clas~ i. Then C.i will represent the overall cost of
misclassifying an individual of class i.
(5)
\Ve can see that, in fact. introducing the misclassification costs entails overweighting
those response classes for which it is most dangerous to make a wrong assignment.
In any case, the reduction to impurity can be expressed as a function of the distances
of the individuals to the representative of the parent node and the distance to the
corresponding successor.
l:I.i(t) = LiEt Witc(i, mtl- LiEt. tt·it,c(i. m t,) - LiEt, Wit,C(i. Int,) (6)
LiEtl1'.t
where t{ and tT represent the left and right child nodes of node t.
4. Stability analysis
Thus. from Formula 6, it. is easy to calculate the contrihution of any individual t.o the
reduction of impurity. It is simply the difference in the distances of this individual to
the representative of the parent node and to the representative of its corresponding
child node:
c, = u',t(6(i,mtl- e(i,m:.) (7)
Notice that the contribution to the reduction of impurity can be positive or negative.
although on average this contribution coincides with the overall impurity reduction.
(8)
Then. the ratio for the average reduction of impurity ,{itt) is an easy way to diagnose
individuals with a strong inHuence in the split. Moreover. the distance between the
representatives of the child nodes is an indicator of the stability of the present split.
211
It is clear that in classification trees with the L2 metric, due to the quadratic form
of the impurity index, instabilities occur when the splitting process leads to nodes
with very few members of at least one response class j, whereas the most stable case
is when the probability of classes are similar, that is, the representative of the node
mt is close to the centre of the convex polygon. In the Ll metric, the locus of the
representative of the node coincides with the class with maximum probability, thus
the distance of the remaining classes to the representative is equal to 2, that is, each
individual of the latter classes has the same influence.
In regression trees, instability may occur when dealing with nodes with some outlying
values; a split attempting to accommodate for these outliers will reduce the impurity
significantly and hence the distance between the representatives of both child nodes
will be large. Of course, one way of achieving a robust split that is insensitive to the
effect of outliers would be by using the Ll norm, that is, using the absolute deviation
as an impurity measure.
For each predictor variable we can define a function of the impurity reduction relative
to the impurity in the parent node, defined over all possible splits of this variable,
defined as follows:
For the case of a categorical response variable, it is natural to compute for every node
the empirical distribution function of every response class Fj and compare them with
the average distribution function Fe in node t. Then, the split point can be defined
by a distance between these functions.
k
Mal' du =L IFj(!!) - Flu)1 ( 10)
)=1
where Fj(u) and Fe(u) are the empirical distribution function of class j and the
average distribution function of node t evaluated at split It. To use this split criterion
there should be an ordering among the possible splits of the predictor variable.
It is easy to see that this distance coincides with the Smirnov distance for the case
of two classes (Figure :3).
Furthermore, it coincides with the Celel.lx- Lechevallier (1982) index with uniform
weighing of the response classes. In fact, the difference with this latter index consist
of in the different weighing of the response classes.
6. Complexity
The search for an optimal tree is an NP-complete problem. Thus, we should use
efficient heuristics. The most used heuristic consists of finding at each step the best
split among the whole set of binary partitions. This solution leads to fairly good
results obtained in a hierarchical fashion. However, the computational cost of this
heuristic is very high, and it requires an efficient algorithm to guarantee a reasonable
speed.
213
We have designed an algorithm of almost linear cost with the parameters of a tree.
Lets us first define the parameters of a tree. These are the number of individuals n,
the number of total splits s, and the maximum depth of the tree d. Obviously, the
total number of splits depends on the number of variables and its type. See Table 1
for the number of splits according the type of variable. Let P be the total number
of variables, which can be split according to type: p = Ph + Po + Pn + Pc. For each
variable we have Sj splits with E~=1 Sj = S and for a maximum depth of d we have
I $ 2d - 1 nodes.
1. Cost of a split for a given variable: O(nt) + C. This cost depends only on the
number of individuals in the node plus a constant.
2. Cost of all splits for a given variable: E:;10(nt) = O(nt . Sj). This cost
depends on the number of splits of the variable, which according to its type can
be optimised to the following results:
O(nt) (for a binary variable)
1 O(nt) + O(k)
O(nt) + 0(2k-l)
(for an ordinal variable)
(for a nominal variable)
O(nt) + O(n ·log(n)) (for a continuous variable)
3. Cost of all splits for a node: E~=1 O(ne) = O(nt . s). This is, of course, simply
the sum of the costs for all the active variables in a node, which, according to
the above result reduces in most cases to: O(nt . p).
4. Cost of all splits in every node: E:~~d-I O(nt·p) = O(p) Et;l O(n) = O(p·n·d).
Moreover, we have the cost of assigning every individual to its node: 0(/· n).
A critical point is the total number of splits when S ~ n, then O(s· n· d) ~ 0(n 2 • d).
This is particularly dangerous for a nominal response variable with a large number
of classes kj when k ~ nt the cost becomes quadratic or even exponential. See Mola
et al. (1992) for the treatment of a multiple class response variable. Also, when I
becomes large the cost increases quadratically.
This algorithm, named SA AD (Segmentacio Automatica per Arbres de Decisio), runs
on a PC platform in a Windows environment and is able to cope with problems with
up to 100.000 individuals and 100 variables. Here we present the time in seconds of
two problems, one corresponding to a classification tree into 4 classes, with 9 explana-
tory variables (6 of them were categorical with a maximum of 8 categories each and 3
were continuous), and the other being a regression tree with 18 explanatory variables
214
(half categorical and the other half continuous). For each problem we have varied
the number of individuals considered and the depth of the tree produced, obtaining
the results shown in Table 2. As can be seen, linearity is preserved approximately up
to a depth of 8.
References
Aluja T., Nafria E. (1995). Generalised impurity measures and data diagnostics in decision
trees. Visualising Categorical Data. Cologne.
Breiman L.• Friedman J.H., Olshen R.A., and Stone C.J. (1984). Classification and Re-
gression Trees. Waldsworth International Group, Belmont, California.
Ciampi A. (1991). Generalized Regression Trees. Computational Statistics and Data Anal-
ysis, 12,57-78. North Holland.
Gueguen A., Nakache J.P. (1988). Methode de discrimination basee sur la construction
d'un arbre de decision binaire. Revue de Statistique Appliquee, XXXVI (1), 19-38.
Kass G.V. (1980). An Exploratory Technique for Investigating Large Quantities of Cate-
gorical Data. Applied Statistics, 29, n 2, pp. 119-127.
Mola F., Siciliano R. (1992). A two-stage predictive splitting algorithm in binary segmen-
tation. Computational Statistics. vo!' 1. Y. Dodge and J. Whittaker ed. Physica Verlag.
Sonquist J.A., Morgan J.N. (1964). The Detection of Intemction Effects. Ann Arbor: In-
stitute for Social Research. University of Michigan.
Induction of Decision Trees
Based on the Rough Set Theory
Tu Bao Ho, Trong Dung Nguyen, Masayuki Kimura
Japan Advanced Institute of Science and Technology, Hokuriku
Tatsunokuchi, Ishikawa, 923-12 JAPAN
Summary: This paper aimed at two following objectives. One was the introduction of
a new measure (R-measure) of dependency between groups of attributes in a data set, in-
spired by the notion of dependency of attribute in the rough set theory. The second was
the application of this measure to the problem of attribute selection in decision tree induc-
tion, and an experimental comparative evaluation of decision tree systems using R-measure
and other different attribute selection measures most of them are widely used in machine
learning: gain-ratio, gini-index, dN distance, relevance, )(2.
1. Introduction
The goal of inductive classification learning is to learn a classifier from a training
set that correctly predicts classes of unseen instances. Among approaches to induc-
tive classification learning, decision trees is certainly the most active and applicable
one. During the last decade, many top-down induction of decision tree (TDIDT)
systems have been developed, most notably CART of Breiman et al. (1984), .ID3
of Quinlan (1986) and its successor C4.5 (1993). There are two crucial problems of
variable selection (choosing the "best" attribute to split a decision node) and prun-
ing (avoiding overfitting) in TDIDT systems. Most heuristics for estimating multi-
valued attributes are information-based measures, such as Quinlan's information gain
or gain-ratio (1986, 1993), Mantaras' normalized information gain (1991), etc., and
statistics-basd measures, such as Breiman's Gini-index (1984), X2 (Liu and White,
1994), Baim's relevance (1989), etc. An analysis of eleven measures for estimating
the quality of the multi-valued attributes was given by Kononenko (1995).
Rough set theory, introduced by Pawlak in early 1980s (Pawlak, 1991), is a mathemat-
ical tool to deal with vagueness and uncertainty, in particular for the approximation of
classifications. Although the rough set methodology of approximation sets has been
successful in many real-life applications, there are still several theoretical problems to
solve. For example, we will show that one of its fundamental notions - the measure
of dependency of an attribute set Q on an attribute set P - is not always robust
with noisy data and not enough sensitive with "partial" dependency between Q and
P. Inspired by this measure for dependency of attributes, we introduce in this paper
a new measure of attribute dependency, called R-measure. We show experimentally
that R-measure can be applied with success to the problem of attribute selection in
decision tree induction.
215
216
The starting point of the rough set theory is the assumption that our "view" on
elements of the object set 0 depends on an equivalence relation E ~ 0 x O. Two
objects 01> D2 EO are called to be indiscernible in E if olED2. The lower and upper
approximations of any X ~ 0 consisting all objects which surely and possibly belong
to the X, respectively, regarding the relation E. These lower approximation E.(X)
and upper approximation E·(X) are defined as
Consider how the attribute Flu depends on the attribute Temperature. We express
the causal relation between these attributes in the form of usual rules
If Temperature = normal then Flu = no
If Temperature = very_high then Flu = yes
The number of objects satisfied these rules is 5 out of 8. In the other words, the pro-
portion of objects whose values on Flu are correctly predicted by values of Temperature
is 5/8. This argument is analogous with the definition of degree of dependency, where
217
(5)
The degree of dependency of Fiu on Temperature calculated by (3) is 3/4. The main
difference between J.Lp( Q} and J.L~( Q) is the latter measures the dependency of Q on P
in maximizing the predicted membership of an instance in the family of equivalence
classes generated by Q given its membership in the family of equivalence classes
generated by P. We have obtained the following property (Ho and Nguyen, 1997).
From this theorem we can define that Q totally depends on P iff J.L~(Q) = 1; Q
partially depends on P iff max[oIQcard([o]Q) /card(O) < J.L~( Q) < 1; Q is independent
of P iff J1.~(Q) = max[olQ card([o]Q)/card(O). In practice, to emphasize rules those
have the higher generalities we use the following formula, and call it R-measure
_ 1 card([o]Q n[o]p)2
J1.p(Q) L max[olQ
= card(O) [o)p car d([0 1p)
(7)
218
denote the approximation of the probabilities from the object set O. Let
(10)
be the entropy of the classes, of the values of the given attribute, of the joint example
class-attribute value, and of the class given the value of the attribute, respectively
(all logarithms introduced here are of the base two).
The well-known decision tree algorithm C4.5 use the gain-ratio (Quinlan, 1993)
. R Hc + HA - HCA
Ga~n = ---~---
HA
(11)
Hc +HA -HCA
dN = 1 - - ------ (12)
HCA
The author has reported experiments with two data sets "hepatitis" and "breast
cancer". Gini-index used in decision tree learning algorithm CART (Breiman et al.,
1984) can be rewritten as
Baim (1988) introduced a selection measure called relevance and showed an experi-
ment in craniostenosis syndrome identification
1 n··
Relev =1- -- L L ~, . (.)
tm J = arg maXi {ni;}
-
n·I.
(14)
1 - k ; i;Oi".(;) ni.
219
n .,·n·I.
eij =-- (15)
n ..
Suppose that P = {AI, A 2 , ••. , Ap}, and Q = {BI' B 2 , .•• , Bq}. Let n.jJi2 ... j, denotes the
number of instances with the jl-th, i2-th, ... , jp-th values of attributes AI! A 2 , ... , A p ,
respectively, and ni l i 2.. .i.lili2 ... j, the number of instances with the il-th, i 2 -th, ... , iq-th
values of attributes BI! B 2 , .•. , Bq and with the jl-th, i2-th, ... , jp-th values of AI, A 2 ,
... ,Ap , simultaneously. We also denote by P.j';2 ... j, and P i li2 ... i.I;';2 ... j, the approxima-
tions of these probabilities from the training set. We can rewrite (7) as
jip(Q) = . L P.;. ... jpmaxil •...• i.P~I ... i.ljl ... jp (16)
11,···,1,
In a special case of (7) when Q stands for the class attribute and P stands for the
given attribute A, ji.p( Q) can be used to measure the dependency of the class attribute
on A, and can be written as
Traditionally, in machine learning data are usually divided into two sets of training
and testing data. Training data are used to produce the classifier by a method and
testing data are used to estimate the prediction accuracy of the method. A single
train-and-test experiment is often used in machine learning for estimating perfor-
mance of learning systems.
It is· recognized that multiple train-and-test experiments can do better than single
train-and-test experiments. Recent works showed that cross validation is a suitable
for accuracy estimation, particularly the lO-fold stratified cross validation (Kohavi,
1995). However, cross validation is still not widely used in machine learning as it is
computationally expensive.
I .. _·-·I'_··. .
........ 1 ( -
............ 1.....-
- 1 - 1 ...--]
'~
rates can not be used directly to evaluate measures, it may, however, provide a snap-
shot of measure comparison, and show the stability of these measures.
more reliable and new results on R-measure (Ho and Nguyen, 1997).
In this paper we have introduced R-measure for the degree of dependency b~tween
two groups of attributes in a data set. R-measure is inspired by the same notion of
attribute dependency in the rough set theory and it aims at overcoming some limi-
tations of that notion. We have applied R-measure to the general scheme of decision
tree induction and carried out carefully an experimental comparative study of R-
measure with five attribute selection measures which are well-known in the machine
learning literature.
References:
Bairn, P.W. (1988): A method for attribute selection in inductive learning systems. IEEE
Trans. on PAMI, 10, 888-896.
Breiman, L., Friedman, J., Olshen, R., Stone, C. (1984): Classification and Regression
Trees, Belmont, CA: Wadsworth.
Buntine, W., Niblett, T. (1991): A further comparison of splitting rules for decision-tree
induction. Machine Learning, 8, 75-85
Dougherty, J., Kohavi, R. and Sahami, M. {1995}: Supervised and Unsupervised Dis-
cretization of Continuous Features. Proceedings 1~th International Conference on Machine
Learning, Morgan Kaufmann, 194-202.
Ho, T.B., Nguyen, T.D. {1997}: An interactive-graphic system for decision tree induction
(under review).
Kononenko, I. {1995}: On biases in estimating multi-valued attributes. Proc. 14th Inter.
Joint. Conf. on Artificial Intelligence, Montreal, Morgan Kaufmann, 1034-1040.
Kohavi, R {1995}: A study of cross-validation and bootstrap for accuracy estimation and
model selection. Proc. Int. Joint Conf. on Artificial Intelligence IJCAI'95, 1137-1143.
Liu, W.Z., White, A.P. {1994}: The importance of attribute selection measures in decision
tree induction. Machine Learning, 15, 25-4l.
L6pez de Mantaras, R. (1991): A distance-based attribute selection measure for decision
tree induction. Machine Learning, 6, 81-92.
Mingers, J. {1989}: An empirical comparison of selection measures for decision-tree induc-
tion. Machine Learning, 3, 319-342.
Pawlak, Z. (1991): Rough Sets: Theoretical Aspects of Reasoning About Data, Kluwer Aca-
demic Publishers.
Pawlak, Z., Grzymala-Busse, J., Slowinski, R., Ziarko, W. (1995): Rough sets. Communi-
cations of the ACM, 38, 89-95.
Quinlan, J. R. (1993): C{5: Programs for Machine Learning, Morgan Kaufmann.
Wille, R. (1992): Concept lattice and conceptual knowledge systems. Computers and Math-
ematics with Applications, 23, 493-515.
Visualizing Data in
Tree-Structured Classification
Francesco Mola, Roberta Siciliano
Dipartimento di Matematica e Statistica
l'niversita di Kapoli Federico II
Via Cintia - ~[onte S.Angelo
80126 - ~aples - Italy
e-mail: [email protected]
r.sicQidmsna.dms.unina.it
Summary: This paper provides a classification tree methodology to analyze and visualize
data in multi-way cross-classifications of a categorical response variable and a high num-
ber of predictors observed in a large sample. The idea is to perform recursively a factorial
method such as nonsymmetric correspondence analysis to every finer partitions of the given
sample. Some new insights on the graphic displays of nonsymmetric correspondence analy-
sis are considered for defining classification criteria based on factorial scores. As a result. we
grow particular types of classification trees with two aims: 1) to enrich the interpretation of
the dependence structure using predictability measures and graphic displays; 2) to obtain
a classification rule for new cases of unknown class on the basis of a factorial model.
We consider the set of :Vl contingency tables where each table cross-classifies the
response categories of Y with the categories of each predictor. In this phase we select
the best predictor X' by maximizing the predictability index T of Goodman and
Kruskal, that is:
223
224
(1 )
where p,; are the proportions of the selected contingency table that cross-classifies
the response categories i = L .... [ of Y with the categories j = 1, .... J of the best
predictor X". The usual dot notation is used for summations. i.e., Lj pi] = Pi ..
1.2 Dependence analysis by a factorial model
In this phase we analyze the dependency in the selected table using the factorial
model of nonsymmetric correspondence analysis:
for I.: = 1, ... , [\" = mint 1- 1. J -1). and )..1 ~ .•• ~ )..i: ~ O. The row scores T'ik and
the column scores C}k satisfy the following centering and orthonormality conditions:
(4)
predictability measures. {j sing the orthonormality conditions (-1) we can prove that
the following relation holds
Since the index, is proportional to Lk Ak. (5) shows that we can further decompose
such predictability over the row categories as well as over the column categories. In
particular, we can identify which predictor categories contribute at most in predicting
the response variable by defining the following predictability measure of category j:
pred(C)) = (P.) Lk (CjkAk)2)/(Lk A%), which sum up to one over the index j. When
we consider the first factorial axis in the partitioning criterion this formulation simpli-
fies to pred( C)) = c7IP)' We can compare pred( C j ) with the weight p.) of category j:
we can say that category j is a strong category when prcd( C)) 2: P.j whereas category
j is a weak category when prcd(Cj ) < p.)" By definition. we can equivalently check
the conditions !eJk! 2: 1 and [C)k! < 1 respectively.
2. Example
.\ real data set is analyzed with the proposed methodology. The set consists of 286
graduates of the Economy and Commerce Faculty of the {j niversity of :\ aples over
the period 1986 - 1989, on which the following variables were observed:
• Final score with categories Lou' (LJ ..\ledium-Lou' (JIL) . .\hdium-High ('\IHJ.
High (H):
• Se.r with categories male, femalt:
• Origin with categories Saples. county. other counties;
The variable Final Score is the response variable. In table 1 we summarize the par-
titioning sequence to grow the binary tree shown in figure 1.
For sake of brevity. we analyze only some partitions. The first table that is selected
cross-classifies the response variable Final Score with the predictor Age. In figure :2
we present the twcr-dimensional factorial representation of nonsymmetric correspon-
dence analysis applied to this table. The first factorial axis explains a very high
percentage of T index (94 %).
We notice an axial opposition between old graduated and young graduated as well as
between g'raduated u:ith high score and g7'aduated with low sco,.e. Only the age cate-
gory 26-30 yea,.s is a weak category (very nearby the origin). These considerations
enrich the interpretation of the partitioning at node 1 that divides the 286 cases into
two subgroups of 2:39 and 47 cases respectively.
At node 2 the selected predictor is Diploma. Figure 3 shows the graphic display
where the first factorial axis explains again a high percentage of T index (88 7c).
\Ve notice an axial opposition between classical diploma and prof€s8ional diploma
that represent the strong predictor categories. Considering the simultaneous inter-
pretation of predictor and response categories we underline that graduated with high
score are better predicted by the category classical diploma. whereas the category
professional diploma predicts more the graduated with medium low score. Instead.
the category magistral diploma predicts more the graduated with medium high SCOrf.
The partitioning at node 2 divides the 239 cases into two subgroups of .j4 and IS.)
cases respectively.
The selected predictor at node 5 is Study Plan. In the graphic display of figure -I
the prediction explained by the first factorial axis is 84 <te. The axial opposition is
now provided by official and public against managerial and professional study plans.
The other predictor categories, economics and quantitative. are weak categories (very
nearby the origin). The partitioning divides the 185 cases at node .j into two sub-
groups of 174 and 11 cases respectively.
227
1 Classical I
MaQistral
L
Naples
ML
Ecoromv
Quantitative
ManaQement
H MH
MH
0
0.5-
31-35 25-30
0- L+36·
•
-. -25
o • ML
0
-0.5- H
0
-I I
-I -0.5 0 0.5
Figure 2. Factorial representation at node 1: Final Score vs Age
1
MH
~
magistral
0.5
. techniC,iCntifiC
0 professIonal· •
ML .0 classical
o L
H
-0.5 - 0
-I-r--.--.I------,---r-~--_r~
-I -0.5 o 0.5
ML
o
0.5 -
economics quantitative
0- profcssional4t. ••
L managerial public
o •
MH
-0.5 -
o official
•
-I~--~~--~--~----~I--~~
-I -0.5 o 0.5
Table 2 summarizes the terminal node information with the values of the CATANOVA
statistic and the assigned class. The left side of table 3 shows the misclassification
matrix obtained with the proposed tree-classification procedure whereas the right side
of table 3 shows the misclassification matrix obtained with a standard classification
tree procedure such as for example the CART methodology (Breiman, Friedman,
Olshen and Stone, 1984). We can compare the two misclassification matrices of ta-
ble 3. Notice that the proportions of misclassified cases are very similar in the two
analyses and for some response categories the proposed method has classified better
than CART.
5 . 5 . 1
predicted ML 0.35 0.36 0.25 predicted ML 0.37 0.41 0.27
class MH 0.20 0.38 0.12 class MH 0.17 0.31 0.18
H 0.10 0.26 0.63 H 0.11 0.27 0.55
t
As main results of applications on real data sets, the proposed methodology has
shown to achieve several purposes:
Acknowledgements: The Authors wish to thank Carlo Lauro and Jaromir Antoch for
helpful comments on a previous version of this paper. This research was supported for the
first author by CNR research funds number 92.1872 P and for the second author by CNR
research funds number 95.02041.CTlO.
References
Breiman L., Friedman J.H., Olshen R.A., Stone C.J., (1984): Classification and Regression
Trees, Belmont C.A. Wadsworth.
D'Ambra, L. and Lauro, C. (1989): Nonsymmetrical Analysis of Three-way Contingency
Tables, in Multiway Data Analysis, Coppi, R. and Bolasco, S. (eds.), North Holland, Ams-
terdam.
Lauro, N.C. and D'Ambra, L. (1984): l'Analyse non Symmetrique des Correspondances.
Data Analysis and Informatics Ill. E. Diday et al. (eds.), 433-446. North Holland, Ams-
terdam.
Lauro, N.C. and Siciliano, R. (1989): Exploratory methods and modelling for contingency
tables analysis: an integrated approach. Statistica Applicata. Italian Journal of Applied
Statistics, 1, 5-32.
Loh, W. and Vanichsetakul, N. (1988): Tree-Structured Classification via Generalized Dis-
criminant Analysis. Journal of the American Statistical Association, 83, 715-728.
Mola, F. and Siciliano, R. (1992): A Two-Stage Predictive Splitting Algorithm in Binary
Segmentation. Computational Statistics, Dodge, Y. and Whittaker, J. (eds.), 1, 179-184,
(Compstat '92 Proceedings). Physica Verlag.
Mola, F. and Siciliano, R. (1994): Alternative Strategies and CATANOVA Testing in Two-
-Stage Binary Segmentation. New Approaches in Classification and Data Analysis, Diday,
E. et al. (eds.), 316-323, Springer Verlag.
Mola, F. and Siciliano, R. (1995): Nonsymmetric correspondence analysis for tree-structured
classification, research internal report, conditionally accepted for Applied Stochastic Models
and Data Analysis.
Mola, F., Klaschka., J. and Siciliano, R. (1996): Logistic Classification Trees, Compstat 96
Proc., Prat, A. (ed.), Physica Verlag.
Siciliano, R. and Mola, F. (1997), Ternary classification trees: a factorial approach, Visu-
alization of categorical data (Greenacre, M., Blasius, J., eds.), Academic Press, CA.
Siciliano, R., Mooijaart, A. and van der Heijden, P.G.M. (1993): A Probabilistic Model for
Nonsymmetric Correspondence Analysis and Prediction in Contingency Tables. Journal of
the Italian Statistical Society, 2, 1, 85-106.
Adaptive Cluster Analysis Techniques -
Software and Applications
Hans-Joachim Mucha I , Rainer Siegmund-Schultze I and Karl Dilbon 2
Summary: Well-known cluster analysis techniques like for instance the K-means
method can be improved in almost every case by using adaptive distances. For this one
has to estimate at least "appropriate" weights of variables, i.e. appropriate contributions
to cluster analysis. Recently, adaptive classification techniques for two class models are
under development. Here usually both the weights of variables and the weights (masses)
of observations play an important role. For instance, observations that are harder to clas-
sify get increasingly larger weights. Quite successful applications of these techniques
can be reported from the area of credit scoring systems for consumer loans or credit
cards. The software CIL/sCorr (running under Microsoft EXCEL) perform classification,
cluster analysis and multivariate graphics of (huge) high-dimensional data sets contain-
ing numerical values (quantitative or categorical) as well as non-numerical information.
1. Introduction
What can our statistical software do for you? For example, with the help of the EXCEL-
Add-In ClusCorr one is able to look for classification rules in data sets like the follow-
ing one:
kl,kB,kFF,k2,kl,kl,k2,kO,kO,kO,kD,kl,k2,k4,k8,k7,kl
kO,kS,kME,kO,kl,kl,k3,kO,kO,kO,kD,kO,k4,kl,kO,k7,kl
kO,kS,kFF,kl,kl,kl,kl,kO,kO,kO,kD,kO,k4,k3,k2,k9,kO
kO,kS,kMB,k2,kO,k2,kl,kO,kO,kl,kD,k2,kO,k3,k5,k9,kl
kl,kS,kAD,kl,kl,kl,kl,kO,kO,kl,kD,k7,k6,k2,k7,k5,kl
kO,kS,kFF,kl,kO,kl,k3,kO,kO,kO,kD,kO,k4,k2,k7,k9,kO
Moreover, one can look at multivariate graphics of such data sets in order to get a first
view on the way of understanding the data at hand. Above, every line contains informa-
tion (whatever this may be in reality) about an applicant for credit. At the end of each
line the code "k I" characterises a good applicant, whereas a "kO" stands for a bad one
(which is not able to pay the amount of credit back to the bank, telephone company,
mail-order house, or department store). Depending on the kind of credit often additional
numerical (quantitative, categorical, ordinal, ... ) information has to be taken into account
in order to optimise the decision about a new applicant. In fact that is no problem for the
classification technique described later on.
231
232
(1)
between two observations x; and XI are well-known dissimilarity measures for metric
scaled data which are often used in cluster analysis. Here the number of variables of a
data matrix X is denoted by 1. In the simple case which we consider here, the metric Q is
diagonal with non-negative diagonal elements qj which are unknown usually and there-
fore have to be estimated during an adaptive clustering procedure. For example, the gen-
eral variant of the well-known K-means clustering method (MacQueen 1967) takes into
consideration both the weights of variables qj introduced above and the non-negative
weights (masses) of observations m; in order to minimise the sum of the within-clusters
variances
K I
VK =L LO;km;d~(x;;ik) (3)
1=\ ;=\
concerning a fixed number of clusters K. Here the number of observations of the data
table X is denoted by /. The indicator function a;l is equals 1 if the observation x; comes
from cluster k, or 0 otherwise. The vector xt contains the usual arithmetic mean values
in the cluster k. On the one hand the weights can be chosen in conformity with data and
model (usually in this case these weights are termed special weights). On the other hand
the weights can be adaptive ones estimated in some ways in an iterative manner regard-
ing some calibration constraints in (3). In any case, as a result the contributions of the
variables to cluster analysis differ among one another. As a further result it is no longer
necessary to provide a selection of variables.
The performance of adaptive clustering methods was investigated using extensive simu-
lation studies (resampling as well as random error techniques). We used measures which
are proposed by Hubert and Arabie (1985). The usefulness of adaptive techniques has
been shown in practice in several applications. Because of weights qj and w; accompa-
nied by various kinds of data transformations like rank transformation, clustering tech-
niques based on (3) have a wide area of applications. For instance. one can look for
clusters in contingency tables using K-means technique or Ward's hierarchical clustering
method (Mucha and Klinke 1993; see Figure 1 below: a dendrogram which is drawn
onto the plane of the first two factors of correspondence analysis). It should be men-
tioned already here that the dendrograms and other graphics of the software ClusCorr are
linked by click with data vaiues. Another example of a successful application of adaptive
clustering based on (3) is the detection of groups in rank order data (Mucha 1992).
233
Moreover. adaptive distances are very important in order to obtain highly informative
multivariate plots. In that way. both the interactive data analysis and the interpretation of
clustering results become much easier.
A basic approach for a simultaneous classification and visualization of data is proposed
by Bock (1987). Some methods are descibed by Bock (1996). These methods are based
on a well-specified goodness-of fit criterion.
In the case of mixed data several distance measures can be used side by side. for exam-
ple. in an additive fashion.
1980). Generally, we want to obtain a new variable x so as to make the values (scores)
within the given classes SS. as similar as possible and the scores between classes SSb as
different as possible. Let us look at a contingency table which can be obtained by cross-
ing a categorical variable j (KJ categories, where K J ~ 2 is considered) with the class
membership variable (2 categories at least). That is (regarding some constraints in the
frame of the dual scaling approach), the squared correlation ratio has to be maximised:
(= SSb) . (4)
SS,
* 1
t (x.,x/) = L. x ... (7)
I j = 1 I}
I 1
t(x.,x/)=- L. x .. (8)
I 1 j = I I}
because of independence of the number of variables 1. Now the question arises, how can
a suitable cut-off point be determined on the distance scale? The simplest way is to take
the cut-off point which gives the minimum error rate for the training sample (Figure 2).
Usually (without consideration of assumptions on distributions) the cut-off point is cho-
sen by sampling techniques. For example, one can take the mean or median of a set of
cut-off points (corresponding to minimum error rates obtained from many different sam-
ples which are drawn randomly from the training data).
235
Afterwards, as usual, we are able to classify the observations of the test sample using the
scores (6) and the cut-off point of the training sample.
The obtained results of the first step (described above) can be improved in almost every
case by altering the weights of observations (i.e. by changing the amount of influence of
the observations). This leads to a local adaptive distance approach.
9,5
8,5
7,5
eft.
,5 7
~...
...5 6,S
...
G.I
iij 6
(5
l-
S,S
4,5
3,5
3 ~~~~~~~~
Fig. 2: Credit scoring. Example 1. Cut-off point on the distance axis versus error rate
(dark curve: training sample, grey curve: test sample).
236
4. The software
The XC/IIst-library (Mucha, 1995; Mucha and Klinke, 1993) of the interactive statistical
computing environment XploRe contains well-known cluster analysis methods as well as
new adaptive ones. Additionally, highly interactive and dynamic graphics of XploRe
support both the search for and the interpretation of clusters. Everyone who is a little bit
familiar with matrix notations can write new distance functions (for instance for mixed
data) by using the macro language of XploRe.
In order to make cluster analysis techniques available for almost everyone (without any
knowledge in algebra and statistical languages) the software ClusCorr, running under
Microsoft Windows, is under development (Mucha 1996b). It is written in Visual Basic
for Applications (VBA) to function as an Add-In under Microsoft EXCEL 5 (or higher).
Hardware requirements are defined from this point. Have regard to the wide family of
EXCEL users in mind, CIlIsCorr is designed on the one hand to have teachware proper-
ties including an extensive help system and decision support for choice of distance
measures, clustering techniques, multivariate graphics, and so on, and on the other hand
to perform the cluster analysis of huge data sets.
In consideration of the kind of results obtained one can distinguish roughly between hi-
erarchical and partitioning techniques. The hierarchical methods fonn a sequence of
nested partitions, a hierarchy (Figure 1). In CIlIsCorr (as well as in XClust) well-known
hierarchical ascending clustering methods are available, for instance
• Ward's minimum variance method,
• Single Linkage (nearest neighbour),
• Complete Linkage (furthest neighbour),
• Average Linkage (group average), and
• Centroid Method (weighted centroid).
About twenty well-known distance measure can be selected out simply by a click:
Euclid, LI' Jaccard, Cosine, ...
What is one to do if one has to carry out a hierarchical cluster analysis for a million of
observations? We offer a special principal axes cutting algorithm in order to reduce a
huge data set to, for instance, 250 "pre-clusters". This algorithm is quite fast because it
takes 25 minutes of time (on a Pentium 200 MHz) to reduce one million objects to some
few hundred pre-clusters. Afterwards one can carry out a hierarchical cluster analysis.
The (adaptive) K-means method is available in different variants: exchange method, gra-
dient technique, minimum distance method, ...
The stability of cluster analysis results can be investigated by simulation studies based
on measures for comparing partitions. In that way one can check automatically whether
the adaptive clustering perfonns better or not. Furthennore, in that way one can validate
the number of clusters, or the user can assess the importance of the variables.
(attributes) and adds them up to form a so-called creditworthiness score (see above). On
this basis a decision about a new applicant is derived.
Example I: The training sample A (see above) consists of 16795 applicants, whereas the
test sample B contains 14745 persons. Several statistical methods (see, for example,
Quinlan 1993) are used which give the following total error rates:
Method A B
Example 2: Credit scoring data from Fahrmeir and Hamerle (1984): the sample consists
of 1000 applicants, whereas 300 are bad clients and 700 are good ones. All 20 variables
are used. The following total error rates were obtained by resubstitution (R) and cross-
validation (C) (Ieaving-one-out-method):
Method R C
2see Fahrmeir and Hamerle ( 1984). The following average error rates (sum of error rates
of each class divided by 2) are given by the authors: quadratic discriminant analysis:
24,3% (R), 31,2% (C); linear discriminant analysis: 27,0% (R), 28,9% (C).
6. References
Bock, H. H. (1987): On the interface between cluster analysis, principal component
analysis, and multidimensional scaling, In: Multivariate statistical modeling and data
analysis. Bozdogan, H. and Gupta A. K. (eds.), 17-34, D. Reidel, Dordrecht.
Bock, H. H. (1996): Simultaneous visualization and classification methods as an alter-
native to Kohonen's neural networks. In: Classification and Multivariate Graphics:
Models. Software and Applications. Mucha, H.-1. and Bock, H. H. (eds.), Report No. 10
(ISSN 0956-8838). 15-24, Weierstrass Institute for Applied Analysis and Stochastics,
Berlin.
Fahrmeir, L. and Hamerle, A. (1984): Multivariate statistische Verfahren. De Gruyter,
Berlin.
Hubert, L. 1. and Arabie, P. (1985): Comparing partitions, lournal of Classification. 2,
193-218.
238
MacQueen, J. (1967): Some methods for classification and analysis of multivariate ob-
servations, In: Proc. 5th Berkeley Symp. Math, Statist. Prob. 1965/66, LeCam, L. and
Neyman, J. (eds.), Vol. 1,281-297, Univ. California Press, Berkeley.
Michie, D. M., Spiegelhalter, D. J. and Taylor, C. C. (eds.)(1994): Machine Learning,
Neural and Statistical Classification, Ellis Horwood, Chichester.
Mucha, H.-J. (1992): Clusteranalyse mit Mikrocomputem, Akademie Verlag, Berlin.
Mucha. H.-J. (1995): XClust: clustering in an interactive way. In: XploRe: an Interactive
Statistical Computing Envirollment, Hardie. W., Klinke. S .• and Turlach. B. A. (eds.).
141-168. Series Statistics and Computing. Springer-Verlag. New York.
Mucha. H.-J. (I 996a): Distance based credit scoring. In: Classification and Multivariate
Graphics: Models, Software and Applications, Mucha. H.-J. and Bock. H. H. (eds.), Re-
port No. \0 (ISSN 0956-8838). 69-76. Weierstrass Institute for Applied Analysis and
Stochastics. Berlin.
Mucha. H.-J. (l996b): ClusCorr: cluster analysis and multivariate graphics under MS
EXCEL. In: Classification and lHultivariate Graphics: Models, Software and Applica-
tions, Mucha. H.-J. and Bock. H. H. (Eds.). Report No. \0 (ISSN 0956-8838). 97- \05.
Weierstrass Institute for Applied Analysis and Stochastics, Berlin.
Mucha. H.-J. and Klinke. S. (1993): Clustering techniques in the interactive statistical
computing environment XploRe. DiscLlssion Paper 9318. Institute de Statistique. Uni-
versite Catholique de Louvain, Louvain-Ia-Neuve.
Nishisato. S. (1980): Analysis of Categorical Data: Dual Scaling and its Applications.
University of Toronto Press. Toronto.
Nishisato. S. (1994): Elements of Dual Scaling: An Introduction to Practical Data
Analysis. Lawrence Erlbaum Associates. Publishers. Hillsdale.
Quinlan. J. R. (1993): C4.5: Programs for Machine Learning. CA: Morgan Kaufmann.
San Mateo.
Part III
Summary: The notion of 'the most random partition of a finite set' is defined and pro-
posed as a null model in classification, A Hamming-type distance between two independent
and most random partitions is used for justifying its randomness. and is used for testing
this null hypothesis, The probability distribution of the distance is studied for the latter
purpose,
1. Validation of classification
In order to justify a classification procedure from the viewpoint of classical statistics.
one should test the null hypothesis that a given or resulting classification is purely
random, thus meaningless against the alternative that the classification is somehow
structured or meaningfuL Many models for the null hypothesis were proposed as
reviewed by Gordon (1996) and Bock (1996) at this conference. Typically, a proba-
bilistic null model assumes the uniform distribution of the observable variables on a
domain, or a uniform random structure of a similarity or dissimilarity matrix,
The approach presented in this report is completely new and different from those
which were previously proposed (see also Sibuya 1993 a,b), The new proposal is
based on the notion of ·the most random partition of a finite set' which yields directly
the null hypothesis of 'random partition without any regularity', A Hamming-type
distance between two independent and most random partitions plays the role of the
conventional chi-square goodness of fit statistic,
2. Preliminaries
2.1 Random partitions
In this paper. a classification means a partition M = {7n J, 1n2, .. ,} of a finite set
Nn = {I, 2",. ,n} of n objects or elements. The ordering of the classes or subsets
is disregarded. Let the set of all partitions of N n be denoted by An' A discrete
probability distribution P on An given by
241
242
Definition (Rand, 1971). Let L, MEAn and i,j E Nn,i =I j, and let
This defines the vector S(M) = S = (Sil' .• , sn) which is a partition of the natural
number n.
If a random partition A = {al,"" aK} on An with K classes has the probability
distribution,
Ok
II ((j -
n
P(K = k and A {mll"" md) = (,if 1)!)&j =: g(n; s),
a j=1
for 1 ~ k ~ n, M = {ml,'" ,mk} E An, (3)
we say that it has the distribution P(n, a) and write A ,...., P(n, a). Here S =
S(M), a > 0 is a real parameter and a lnl = 0(0 + 1)··· (a + n - 1). An illus-
trative interpretation of P(n, a) is given in Section 5.
We can prove the following recursions of g(n; s). 9 with a negative argument Sj <
0, 1 ~ i 5 n is regarded as O.
(i)
ng(n; s) = Pn-1Slg(n -
1; SI. - 1, S2, . .. , Sn-I)
+ 1-;'' :11 Ej:f i(j + I)Sj+lg(n - 1; Sit ... , Sj-t. Sj + 1, Sj+l -1, Sj+2, ... , Sn-I), (4)
243
This is a sort of dual of (4). It means that if an arbitrarily chosen element, say n, of
A is deleted, the remaining random partition has the distribution P (n - 1, a).
(iii)
o.(j - 1)1 .
g(11.;S) = ( ')lJl g (n- J ;Sl! ... ,Sj-l,Sj-I,Sj+h ... ,Sn-j). (6)
o.+n-)
This relation means that if an arbitrarily chosen element, say 11, of .4 belongs to a
class with j elements, and if the class is deleted from A, then the remaining partition
has the distribution P{n - j, a.).
lilly one of the three conditions with some part of the other two conditions charac-
terizes the distribution P(n,o.), Sibuya and Yamato (1995).
1 2 3 4 5 6
az abi ajk'olz ag af ackqr
bilrs C'ol bcdegmt bix bcmr b
cfkmp dp fly cdjmt diu d
dht'ol elrtx hos eh eo e
eq fhjnou iu fsz gkpqtz fho
g gmy nx kno hlv gimntz
jv k py 1 jn jp'olX
nox qsz qr pqr'ol s'olXy 1
uy v u sv
v u
y y
In this table. the elements in a class are alphabetically ordered, and the classes of
a partition are lexicographically ordered. The latter order, which is independent of
the ordering of elements in a class, is natural and automatic if the elements of the
original set is linearly ordered. In this paper, the order of classes and elements are
disregarded.
6. Applications
Let A, B be independent random partitions on An following P(n, 1). Let Fn be the
distribution function (d.f.):
O<;L'
- <
- (n).
2
The d.f. F,. is used for measuring the performance of classification. Some properties
of F,. are shown in the last section.
Chernoff's faces are useful for subjective classification of a multivariate dataset. Its
performance depends on the design of comical faces, the allocation of variates to the
face features, and the training of judges. The effects of these factors can be measured
in terms of F,.(d(A, B)). A trial experiment and its analysis are reported by Harada
and Sibuya (1991). The planning of the classification experiments of Section 1 can
be examined in a similar way.
In numerical taxonomy, suppose that we have a 'standard' classification procedure,
and some new candidate procedure. Sample data are generated from a mixture of
some known population distributions. Each multivariate observation is known to be-
long to its true population, and observations from the same population are expected
to be classified in the same subset. Consider the distance between the classifications
A, B of sample data into the true populations and that obtained by a classification
procedure. The distance d(A, B) measures the difficulty of the classification of a
mixed population if the classification is the 'standard' one. Otherwise, the distance
measures the performance of a new candidate, to be compared with that of the stan-
dard one.
Proposition 4 The moment of Dn of degree 'r, for any n. can be calculated from
the probabi.lity /unction of D2r ,'r = 1,2, ...
Acknowledgements
The author has benefited from valuable discussions with participants of the IFCS
conference and \\rith Prof. H. Yamato.
References:
Bock, H. H. (1996): Probabilistic aspects in classification, IFCS'96, Kobe, March 1996,
Invited Lecture 5.
Ewens, W. J. (1990): Population genetics theory - the past and the future, In: Mathematical
and Statistical Developments of Evolutionary Theory, Lessard, S. ed., NATO Adv. Sci. Inst.
Ser. C-299, Kluwer, Dordrecht, 177-227.
Gordon, A. D. (1996): Cluster validation, IFCS'96, Kobe, March 1996. Invited Lecture l.
Harada, M. and Sibuya, M. (1991): Effectiveness of the classification using Chernoff faces,
Japanese Journal of Applied Statistics. 20. 39-48 (in Japanese).
Rand. W. M. (1971): Objective criteria for the evaluation of clustering methods, Journal
of American Statistical Association, 66, 846-850.
Sibuya, M. (1992): Distance between random partitions of a finite set, In: Distancia '92.
Joly. S. and Le Calve, G. (eds.) 143-145. June 22-26. Rennes.
Sibuya, M. (1993a): A random clustering process, Ann. Inst. Statist. Math., 45. 459-465.
Sibuya, M. (1993b): Random partition of a finite set by cycles of permutation. Japan
Journal of Industrial and Applied lvJathematics, 10, 69-84.
Sibuya, :\1. and Yamato. H. (1995): Characterization of some random partitions. Japan
Journal of Industrial and Applied Mathematics. 12, 237-263.
Snijders, T. A. B. (1996): Private communications.
Zabell. S. 1. (1992): Predicting the unpredictable, Syntheses, 90, 205-232.
Zahn. C. T .. Jr. (1964): Approximating symmetric relations by equivalence relations. J.
Soc. Indust. Appl. Math .. 12. 840-8-17.
A Mixture Model To Classify Individual
Profiles Of Repeated Measurements
Toshiro Tango
Division of Theoretical Epidemiology, The Institute of Public Health
4-6.-1 Shirokanedai, ~Iinato-ku, Tokyo IDS, JAPAN
E-mail: [email protected]
1. Introduction
In clinical medicine, drugs are usually administered to control some response vari-
able, X, reflecting the patient's disease state directly or indirectly, within a specified
range. So, in many clinical trials, some response variable is scheduled to be observed
at regular intervals for assessing changes from the baseline. Figure 1 shows the mean
treatment profiles of (a) log-trasformed serum levels of glutamate pyruvate transam-
inase (CPT) for 124 patients with chronic hepatitis patients randomly assigned to
recieve the new treatment A or the standard B in a double-blinded clinical trial,
which are measured at baseline and 1 week interval thereafter up to 4 weeks and (b)
its change from the baseline level for each of treatment groups. In this paper, only
complete cases are shown and used for illustration purposes. In this clinical trial, the
effects of treatment can be observed as "decrease" in levels of CPT as compared with
baseline level and must be evaluated at the last observation time (4 weeks). In this
kind of clinical trials, the difference between mean treatment profiles can generally
be defined as the size of interaction term TREATMENT x TIME. Classical and still
most frequently used procedures in medical literature will be to repeat Student's t-
test or Wilcoxon's rank sum test at each time point for the treatment difference in
change from the baseline shown in Figure 1-(b). Repeated application of two-tailed
t-test resulted in no significant differences between the two groups at all the time
points. Since the test results are the same regardless of the time point, we tend to
conclude "no difference". However, such multiple comparisons inflate the over-all
siginificance level and often show siginificant differences at some points but not at
other points, which will generate confusion and may lead to the post hoc selection of
the most highly significant difference.
To avoid this problem, the following two kinds of procedures are well known: 1)
Univariate repeated measure ANOVA where the degrees of freedom associated with
F test for repeated factors are reduced by one of two procedures, Creenhouse-Ceisser
method and Huynh-Feldt method, and 2)maximum likelihood based ANOVA which
can allow for more general within-subjects covariance structure, missing values and
irregularily spaced data. But all are assuming within-group covariance structure to
be homogeneous between treatment groups, which seems to be a difficult assumption
to justify in clinical trials. The results of applying these procedures using Bi\IDP
247
248
a similar mixture model but it is unsuitable to data with "improper records" and
undesirable since it estimates unrealistic ragged profiles.
The purpose of this paper is to make clear the difference between the two models,
to present a generalized formulation of my model to cope with improper records and
describe how these procedures are useful and essentially important to analyze data
and interprete the results ill some sorts of randomized clinical trials.
2.Model
Suppose that a randomized clinical trials specify the following protocol:
1. Patients are randomly assigned to receive one of G treatments, with Ni patients
on the ith treatment group.
2. The response variable X is measured T + 1 times at baseline and at equally
spaced intervals, where T is at most 4 to G.
3. The effects of treatment is evaluated at the last measurement time.
But, in practice, the occurrence of missing values and measurements at irregularily
spaced intervals is inevitable. Further, recent tendency of "intent-ta-treat" requires
that all the patients registered should be included in the analysis regardless of the
degree of completeness of records. Both Tango and Skene and White formulated
the model only for complete data. Therefore, we shall here generalize the model
to allow for incomplete records of patients. Let Xij(tijk) denote the measurement
made at the time tijk(k = 0,1, ... , Uij ::; T) of the jth subject (j = 1,2, ... , N i ) in
the ith treatment group (i = 1,2, ... , G) and tijo = O. Thus Xij(O) indicates the
baseline level. Without loss of generality, the "improvement" induced by a treatment
is defined by the" decrease" in levels of the response variable X as compared with the
baseline level. Let l~j(tijk) = Xij(tijd - Xij(O), the change from the baseline level
and assume the existence of AI latent profiles common to all the treatment groups.
Then, under the condition that the jth patient of the ith group follows the mth latent
profile (m = 0,1,2, ... , M - 1), it can be assumed
(1)
where mth latent profile and fijI are. conditional on 711, mutually independent. With
regard to the mean profile ~tm(t), Tango proposed a smooth function of time by a low
degree polynomial. For example. when Al = 3, we have
=0
~t(t) = 1
~to(t)
~tl(t)
~t2(t)
= ,£f=1 i3lkt"( < 0)
= ,£f=1 i32k tk (> 0)
if the subject belongs to "unchanged" ,
if the subject belongs to "improved",
if the subject belongs to "worsened",
(2)
where R is the degree of polynomial common to all the profiles except for the un-
changed profile. On the other hand, Skene and White proposed a profile vector
~!m = (~tml' ... ,~tmT)' This kind of parameterization is seemingly more flexible in
representing profiles but tends to estimate undesirably ragged profiles. Further, it
cannot allow for incomplete data with missing values or measured at irregularily
spaced intervals, which are also pointed out in the discussion section of their paper.
In this model, the response pattern for the jth subject of the ith group, Y ij
(1'iJ( tijil ..... lij(tijU,,))'. j = 1..... N i , have the following mixture density
.II-I
9i(Y iJ lfJ) =L Pimfm(Y ij ) (3)
m::;;:;O
250
where Pim implies the mixing proportion of the mth latent profile in the ith treatment
group and fm(.) denotes the density function of the mth latent profile and are given
by
m = 0, ... ,A1-1
(4)
where m=O means "unchanged" profile. The log-likelihood for the parameters of
() = (Pim,.!3mk.U2),i = 1, ...• G;m = 0.1, ... ,:\1 -1;1.: = 1•... ,R is
G /'V I
L = L L 9i(Yijl())· (5)
i=lj=1
and the comparison of treatment effects might be reduced to the test of the following
null hypothesis:
(7)
2. Step l[M-step]:Given the Qij(m), the parameters Pim are easily given by
/1,",
Pi", = L QiJ(m)/:V. (8)
j=1
3. Step 2[E-step] :Calculate the posterior probability Qij(m) based on the esti-
mates Bobtained in the step 1.
-1. Step 3:Check to see if {) has converged; if not, repeat !vI-step and E-step.
\\'hen we construct a particular alternative hypothesis, we need to apply some con-
straints to the Pim'S, say Pll = P"21. In this case, expression (9) must be changed. But
this kind of extra work can easily be handled in GLIM or S-PLUS.
5. Examples
We shall consider again the data of GPT shown in Figure 1. Empirically the dis-
tribution of GPT in healthy subjects can be approximated by log-normal, then let
Xij(tijk) denotes here the transformed value 10g(GPT), natural logarithm. Further
assume M=5 since several other endpoints in this trial are to be evaluated by 5 or-
dered categories for each patient. As an criterion which gives initial values of Qij(m),
we may use the 5 ordered categories based on the value of 5 ij = L~';:l Yij(tijk), where
Uij = AI - 1 for all the patients in this complete set of data. Several other initial
values were examined to assure whether the result derived below is optimal. The
main results are summarized in Table 2.
Based on the likelihood ratio tests, one alternative hypothesis HI: Pll = P21, R = 3
was selected as the most appropriate model. Compared with models for the null
hypothesis Ho for each of three kinds of mean profiles, R=2, 3, and Skene and White's
vector Itm, fitting the model with constraints PII = P21 gave a significant decrease in
deviance of 8.9 on 3 d.f., 9.6 on 3 d.f. and 8.5 on 3 d.f., respectively, regardless of
the goodness-of-fit of models. Among others, a cubic polynomial effected the highest
decrease with p = 0.022. Goodness-of-fit of these models were also investigated by
observing each patient's response profile in relation to the estimated 95% region of
252
..
N
"1
0
..,oj N
d
B ...
I='
~
N c;
N g
d
.s
fE
~
~
D
"f ;:i B 0
.
d
E
Ii
A ..go
,g
..
:::I!
C!
N £ 9
li.
:::I!
"!
~
Figure 1. The mean treatment profiles and mean ± 2SD at each time point, of
(a) log(GPT) and (b) its change from the baseline level for each of new treatment
A and standard treatment B. The difference in change from the baseline was not
significant (p> 0.05 by two-tailed Student's t test) at each time point.
W..ks Weeki
.
We.kI
g#C?fi
Unchanged Worsened - Greally worsen~
=~"""-
~---=~~
We.k. Weeki
Figure '2. (a)Individual profiles for all the patients. (b)-(f) Estimated 95% re-
gion of profiles, iLm(t) ± 20-, m = 0,1, ... ,4 and individual profiles classified into the
corresponding region regardless of the treatment group.
253
Table 2: -210g L for each of mixture models assuming .\1 = 5. Degrees of freedom
are shown in parentheses.
profiles, itm(t) ± 20', of the mth latent profile into which each patient was classifed
according to the maximum of estimated Q;j(m). In Figure 2, the estimated 95%
region of latent profiles for the optimal model with Pll = P21 and R = 3 are illustrated
together with individual profiles classified into the corresponding profile regardless
of treatment groups. Table 3 presents the classification of each patient into one of
those 5 profiles. As would be expected. \:2 test based on this table yielded P-value
= 0.023 very close to that of the likelihood ratio test. These results are summarized
using estimated mixing proportions Pmk'S as follows: Compared with the standard
treatment B. the treatment A has
1. the same proportion of "greatly improved" (4.1%),
2. higher proportions of "improved" (32.7% vs 24.7o/c). but also higher " worsened"
(24.5% vs 16.7%) and "greatly worsened" (11.8% YS 2.9%)
3. a lower proportion of" unchanged" (26.7% \'s 51.6%).
Therefore, based on estimated proportions and latent profiles, we can not say that
treatment A is better than B, but we can say that the effects are significantly different.
These characterizations of the efficacy of treatments seem to be medically important
espeically for finding the key factors on baseline factors to discriminate responders
from non-responders. but cannot be recognized by observing the mean treatment pro-
files over time. Figure 2 suggests how the mean treatment profiles shown in Figure
1 is misleading. Example illustrated here is not exceptional but rather typical one.
Tango(1989) illustrated the method with another two sets of data from randomized
clinical trials.
Greatly Greatly
Group improved improved unchanged worsened worsened total
A 3 20 17 14 8 62
B 2 13 34 11 2 62
254
6. Discussion
It is well recognized that some unknown prognostic factors could have larger effects
on the response variable than treatment under study. Therefore, if they really ex-
ist, random allocation of patients into one of treatment groups help these unknown
prognostic factors also to distribute equally likely between groups. Namely, several
distinct latent profiles common to all the groups could be explained by these prog-
nostic factors and the mixture model provide useful data to investigate and identify
these unknowns in the next stage of research.
On the other hand, it seems to me that recent literature concerning to the analysis
of repeated measurements has been concentrated too much on modelling the within-
group covariance structure assuming homogeneity between groups which seems to be
unrealistic especially for clinical trials. As Skene and White has pointed out, the ob-
served autocorelation could be a consequence of under-specifying the mean structure
for subjects of each treatment group. Therefore. before applying such statistically
flexible but clinically unrealistic models, more attention should be placed on the va-
lidity of these assumptions and on t.he reasons why we take observations over time.
ACKNOWLEDGEMENTS
The author is indebted to the Japanese Foundation for rvIultidisciplinary Treatment
of Cancer. This study was supported in part by Grant-in-Aid for Scientific Research
(Grant No. 05302064) from the Ministry of Education, Science and Culture of Japan.
References:
Crowder,M.J. and Hand,D.J. (1990): Analy.~is of Repeated Measures, Chapman and Hall.
Dempster,A.P., Laird,N.M. and Rubin,D.B.(1977): Maximum likelihood from incomplete
data via the EM algorithm, Journal of the Royal Statistical Society, Series B,39,l-22.
Diggle,P., Liang,K.Y. and Zeger,S.L. (1994): .4nalysis of Longitudinal Data, Oxford Science
Publication.
Everitt,B.S.(1981): A Monte Carlo investigation of the likelihood ratio test for the nwnber
of components in a mixture of normal distributions, Multi. Beha·v. Res.,16, 171-180.
Frison,L. and Pocock,S.J.(1992): Repeated measures in clinical trials: analysis using mean
summary statistics and its implications for design, Statistics in Medicine, 11, 1685-1704.
McLachlan,G.J. (1987): On bootstrapping the likelihood ratio test statistics for the number
of components in a normal mixture, Applied Statistics,36, 318-324.
Self, S.G. and Liang,K.Y.(1987): Asymptotic properties of maximum likelihood estimates
and likelihood ratio tests under nonstandard conditions, Journal of Ame7-ican Statistical
Association,82, 605-610.
Skene,A.M. and White,S.A. (1992): A latent class model for repeated measurments exper-
iments, Statistics in Medicine,11,2111-2122.
Tango,T. (1989): Mixture models for the analysis of repeated measurements in clinical tri-
als,Japanese Journal of Applied Statistics,18. 143-161.
Thode,Jr.,H.C, Finch,S.J. and Mendell,N.R.(1988): Simulated percentage points for the
null distribution of the likelihood ratio test for a mixture of two nomlals, Biometrics,44,
1195-1201.
Titterrington,D.M .. Smith,A.F.M. and Makov, U.F. (1985): Statistical Analysis of Finite
Mixture Distributions. New York. Wiley and Sons.
Irregularly Spaced AR (ISAR) Models
Jeffrey s.c. Pail, Wolfgang Polasek2 and Hideo Kozumi3
1 Faculty of Management, University of Manitoba
181 Freedman Crescent, Winnipeg, Manitoba, R3T 5V4, Canada
2 Institute of Statistics and Econometrics, University of Basel
Holbeinstrasse 12, CH-4051 Basel, Switzerland
3 Faculty of Economics and Business Administration, Hokkaido University
Kita 9 Nishi 7 Kita-ku, Sapporo 060, Japan
Summary: High frequency data in finance are time series which are often measured at
unequally or irregularly spaced time intervals. This paper suggests a modeling approach
by so-called AR response surfaces where the AR coefficients are declining functions in con-
tinuous lag time. The irregularly spaced ISAR models contain the usual AR models as a
special case if the time series is equally spaced. We illustrate our methodology with two
examples.
1. Introduction
For some years now the set of available data from financial markets has increased
rapidly. So far only a small subset of the information available has been used. In
the 70'ies, most of the empirical studies were based on yearly, quarterly, or monthly
data. This data could typically be modeled by random walks or linear models such
as AruMA-models (Box and Jenkins, 1976). In the 80'ies, the study of weekly and
daily financial data lead to non-linear models such as ARCH models (Engle, 1982).
Recently, empirical studies through analyzing intra-daily data are gaining new in-
sights into the behavior of financial markets (see Guillaume et al. 1994).
For the Foreign Exchange (FX) market, Muller et al. (1990) include the daily het-
eroskedasticity of the volatility while Goodhart and Figliuoli (1991) discovered neg-
ative first order autocorrelation at one minute intervals. In fact, daily data are
computed on the basis of the average of five intra-daily quoted prices of the largest
banks around a particular time. The spot intra-daily FX data are observed as an
irregularly spaced time series (see Olsen & Associates, 1993). The standard methods
of data analysis, on the contrary, are based upon equally spaced da.ta. Typically, a
certain fixed interval is chosen, and some averaging procedure for all the transactions
within the intervals is done in order to apply standard methods of analysis. Several
problems arise if the data from irregularly spaced time series are converted to regu-
larly spaced time series. The appropriate interval will depend upon the transaction
frequency. These intervals vary with different markets. There could be a problem of
missing data if the length of the interval is too short. On the other hand, information
is lost if the length of the interval is too long.
Data from financial markets exhibit high correlation between the ticking frequencies
and the volatility of the time series. It is widely believed that the durations between
transactions may carry information about the volatility.
In Section 2, we describe a new class of AR models for irregularly spaced time series,
the ISAR(p) models, as well as the ordinary least squares result. Section 3 illustrates
our methodology with two examples. Section 4 concludes.
255
256
where Yi is the observed time series at time t i , and L-,t, is the time span between two
adjacent' observations. L-,jti is the distance between observations which are j "ticks"
apart: L-,jti = ti - t i - j . For the residual process we assume fi "" N(O, 75), where 15 is
the white noise variance. Note that the parameter functions cI>j[L-,jtiJ are functions of
L-,jti as well. These functions will pick up the effect from the previous observations
and noises. Possible parametrization for the lag response functions, the decay of cI>-
functions are:
-Constant functions (in L-,Jti , or in time)
(2)
This special case is the usual AR models for equally spaced time series.
-Exponential functions
(3)
-Reciprocal functions
(4)
For different ¢4 and ¢b parameters, we will have different decay functions (in absolute
value). The stationarity condition for the irregularly spaced model in (1) depends on
the parameter space of ¢'s as well as the distribution of L-,ti' This may be obtained by
taking the expectation of cI>j[L-,it;J with respect to L-,t i and assuming Yii are observed
at regularly spaced intervals. For example, assume
(81) L-,t i are independent and identically distributed as Gamma random variables
with parameters a and {3,
(82) L-,ti and fi are independent.
From the exponential function we have
(5)
For the process to be stationary, the roots of ¢(z) = 0 must lie outside the unit circle.
Consider the linear equation Y = XcI> + f where the parameter vector is cI> =
(¢41 , ¢bl , ... ,,p4., ¢b.)', and the dependent variable vector is Y = (YI , ... , Yn)'. The
n x 2p regression matrix X is built up by lagged unweighted and weighted dependent
variables, where the weights depend on the elapsed duration time. For each lag j the
first regressor component is
X i ,jo2-1 = Yi'-i'
and the second regressor component is
3. Illustrative example
We present two examples to illustrate our methodology. The first example is based
on a simulated lSAR(l) model and the second example is a high frequency exchange
rate of the FX market.
Example 1: The first data set consists of three time series each with length 1000
simulated from lSAR(l) with
(i) constant function (¢ = 0.3),
(ii) exponential function (¢a = 0.3, ¢b = 0.3),
(iii) reciprocal function (¢a = 0.3, ¢b = 0.3),
respectively. ,0,;'s are sampled from a Gamma distribution with parameter Q = 2
and {3 = 0.5 (giving the mean ticking frequency of 1) and fi'S are sampled from a
Normal distribution with mean 0 and variance 1. The ordinary least squares estimates
together with the AlC values from three different functions are shown in Table 1. Our
approach is successful in two aspects: First, the OLS estimates are all very close to
the values we sampled from (¢ = 0.3):
(i) constant function (¢ = 0.28),
(ii) exponential function (¢a = 0.31, ¢b = 0.23),
(iii) reciprocal function (¢~ = 0.29, ¢b = 0.30).
Second, we are able to select the correct model based on AlC. More interesting
results can be seen from Table 1. The estimated parameter (¢ = 0041) modeled by
the constant function with data sampled from the exponential function is very closed
to 0.42 which is value in (5) with Q = 2 and {3 = 0.5. Similarly, from the reciprocal
function we have
(6)
258
This procedure will give 0.9 which is very close to 0.86 from Table 1.
Example 2: The data consists of the FX rate quotes for the DEM-JPY exchange rate
distributed by Olsen and Associates (1993). We first introduce the data definitions
based on Guillaume et al. (1994). The logarithmic price Z is defined as the log of
the geometric mean of the bid and the ask prices, Pbid and P ask ,
Y. = Zt. - Zti_l
ti - 6t ' (8)
where 6t is some fixed time interval. The change of the logarithmic price is often
referred to as "FX return".
For the study of the irregularly spaced time series, we adopt the same data definitions
in Guillaume et al. except (8). We define the return as duration dependent difference
of the logarithmic price Zti:
(9)
o
ci
E
::I
~o
xci
u.
It)
o
9
o
9
Differencing the time series means taking the difference of adjacent points and divide
it by the space between. We define as the difference of an irregularly spaced time
series the sequence of slopes between adjacent points. Instead of interpolating data
for a fixed time interval 6t, we define Yii in a natural way as shown in (9). The
difference of the logarithmic series gives the growth rates, therefore the logarithmic
price change is simply the average of the growth rates of the bid and the ask time
259
series.
Figure 1 shows the plot of the return in hours for the first week in June, 1993 with
sample size n=3000. Table 2 shows the ordinary least squares estimators by fitting
ISAR(p) models. The negative first order autocorrelation is consistent with the study
by Goodhart et al. (1991).
The OLS estimates on Table 2 for the ISAR(p) model shows a clear negative estimate
which is significant up to order 2. But the higher order effects do not reduce the
residual variance substantially. This result is also confirmed by the ML estimation
method. A similar picture can be seen for the ISAR(p) model with exponential or
reciprocal decay functions. All parameter estimates are negative where the limiting
decay parameter are not significant. The slope decay parameter is significant up to
order 2, but the residual variance is again rather constant.
Table 2: Ordinary least squares estimates for the DEMj JPY exchange rate
P ¢Jl ¢2 ¢J3 ')'0
parameter estimates (standard error) residual variance
CONSTANT function
1 -.163(.018) .0195
2 -.170J.(18) -.0471-018j .0195
3 -.171(.018} -.049( .019) -.016( .(18) .0195
EXPONENTIAL function
1 .964PI4) .0195
-1.163(.323}
2 1.008i: 316 ) .262S·228). .0195
-1.217(.325) -.328(.241)
3 1.007i: 316 ) .274S:232). .048S:189). .0195
-1.217( .325) -.343(.245) -.070( .205)
RECIPROCAL function
1 .0385:029r .0193
-.0022( .(003)
2 .0385.029) .036(.030) .0192
-.0024( .00(3) -.0024( .0007)
3 .0375.029) .037(.030) .021(.032) .0192
-.0024( .0003) -.0026( .0007} -.00l8( .0013}
4. Conclusions
This paper has demonstrated how we can estimate irregularly spaced AR models
by reciprocal or exponential ISAR(p) models where the attributes" reciprocal" and
"exponential" refer to the form of the response function of the AR model over the
lag interval. Based on the setup for ISAR(p) models from the previous section,
we can include an autoregressive conditional heteroskedasticity component into our
irregularly spaced models (see Pai et al. 1995). Also, the modeling process is flexible
enough to incorporate an ISAR model for the ticking process, i.e., the irregularly
observed time spacing process as well.
260
References:
Box, G.E.P. and Jenkins, G.M. (1976): Time Series Analysis: Forecasting and Control.
Holden Day: San Francisco.
Engle, R.F., (1982): Autoregressive conditional heteroskedasticity with estimates of the
variance of U.K. inflation. Econometrica, 50, 987-1008.
Goodhart, C.A.E. and Figliuoli, L. (1991): Every minute counts in financial markets. Jour-
nal of International Money and Finance, 10, 23-52.
Guillaume, D.M., Dacorogna, M.M., Dave, R.R., Miiller, U.A., Olsen, R.B. and Pictet
O.V. (1994): From the bird's eye to the microscope: A survey of new stylized facts of the
intra-daily foreign exchange. Olsen and Associates.
Miiller, U.A., Dacorogna, M.M., Olsen, R.B., Pictet, O.V., Schwarz, M., and Morgenegg,
C. (1990): Statistical study of foreign exchange rates, empirical evidence of a price change
scaling law, and intraday analysis. Journal of Banking and Finance, 14, 1189-1208.
Olsen and Associates (1993): Data distribution for HFDF - 1.
Pai, J.S.C., Polasek, W. and Kozumi, H. (1995): Irregularly spaced AR and ARCH models.
WWZ-Discussion Paper Nr. 9509, University of Basel.
Two Types of Partial Least Squares Method
in Linear Discriminant Analysis
Summary: Partial least squares linear discriminant function (PLSD) is a new discrimi-
nant function proposed by Kim and Tanaka (1995a). PLSD uses the idea of partial least
squares (PLS) method, which was originally developed in multiple regression analysis, in
discriminant analysis. In this paper, two types of PLSD are investigated and evaluated in
a simulation study. In the first type named PLSDA(all), a common pooled within-group
covariance matrix of all groups is used in modeling PLSD to discriminate all pairs of groups.
In the second type named PLSDT(two), pooled within-group covariance matrices based on
the related two groups are used in modeling PLSD to discriminat.e pairs of groups. As
the results of the simulation study PLSDA has the better performance than PLSDT in all
situations when the covariance matrices are equal in all groups, while PLSDT is better than
PLSDA in well conditioned situations when the covariance matrices are different among the
groups.
1. Introduction
Partial least squares regression (PLSR), which was originally developed by Wold
(1975) in the field of chemometrics, is a regression method which intends to reduce
the effect of multicollinearity by the reduction of dimensionality of the explanatory
variables like principal components regression (PCR). The performance of PLSR has
been investigated by some authors including Frank and Friedman (1993) and Kim
and Tanaka (1994) in their simulation studies. It is known that PLSR has worked
well also in many practical problems of chemical fields.
The basic idea of PLS is closely related to the conjugate gradient method for solving
linear equations or for calculating inverse matrices in numerical analysis (see, e.g.,
Wold et al. 1984). Taking this aspect into consideration, we can apply the algorithm
of PLS to other statistical methods in which we need to calculate the inverse of the
covariance matrix. In discriminant analysis the inverse of the covariance matrix is
needed. The direct application of ordinary linear discriminant functions can not be
successful for the so-called ill-conditioned or multicollinear data set.
Kim and Tanaka (l995a) proposed a new linear discriminant function using partial
least squares method (PLSD) and compared the performance of PLSD with that of
the ordinary linear discriminant function (LDF) by applying to two real data sets,
i.e., Fisher's iris data and Yoshimura's arc pattern data (see, Yoshimura et al., 1993)
and by a Monte Carlo simulation study (Kim and Tanaka (1995a, 1996)). The results
of these studies suggest that there are no great differences between the performances
of PLSD and LDF in case of no multicollinearity and that the performance of PLSD
is remarkably better than that of LDF in case of high degree of multicollinearity or
poorly conditioned situations.
In this paper we consider two types of PLSD. In the first type, a common pooled
within-group covariance matrix is calculated using the observations in all groups. In
the second type, each one of within-group covariance matrices is calculated using the
261
262
observations in only related two groups. We abbreviate the former PLSDA(all) and
the latter PLSDT(two). These two types of PLSD are compared through a simulation
study. .
In sections 2, 3 and 4 the algorithms of PLSR, the ordinary LDF and the proposed
PLSD are briefly reviewed, respectively. In section 5 two types of PLSD are de-
scribed and compared through a simulation study. Finally, section 6 provides a short
discussion on PLSD.
2 For k = 1,2"" to J(
2.1 Wk = XLtYk-l
2.2 tk = Xk--1Wk
2.3 Pk = XLltk/t~tk ( = Xttk/t~tk )
2.4 qk = YLltk/t~tk ( = yttk/t~tk )
2.5 X k = X k - 1 - tkP~
2.6 Yk = Yk-l - tkqk
3 Calculation of regression coefficients :
~K = W K(WkXtXW K)-IWkxty,
4 Prediction equation:
fJ = (j) - ~kx) + ~kx,
where nand p indicate the numbers of observations and explanatory variables, re-
spectively, X is an 11 x p matrix of explanatory variables, Y is an n x 1 vector of
response variable, 1 is a unit vector with 1 in its all elements, x is the mean vector
of X, j} is the mean of Y, K (::::; rank of X) is the number of components employed
in the model. PLS becomes equivalent to OLS if it uses all possible components. It
is important how to determine the number of components K in applying the PLSR.
3. LDF
Suppose that there exist 9 groups 7Tt, 7T2,"', 7Tg and that an observation x from group
7Tj follows a p-variate normal distribution N(lLi' E) with a common covariance matrix
263
E. Also suppose that the Calts c(j I i) due to misclassifying an observation from 1I"j
to 1I"j fori,} = 1,2"" ,g are the same for all pairs (i,i). Then the LDF for the i-th
group which minimizes the total cost is expressed as
(1)
where qj is the prior probability of drawing an observation from group 1Ti. In sample
version J.1.i, E and qi are replaced by the estimated Xi, E and qj, respectively. Ap-
plying the LDF in sample version to, say, the i-th and i-th groups, we obtain LDFij
for thale groups defined by
~-I 1 ~-I
LDFij(X) = (Xi - Xj)t E X - 2"(Xi - Xi)!}J (Xi + Xi) + In(qi) -In(qj)' (2)
= HK{JOLS,
where
(3)
Let U( W K) be the linear subspace spanned by the columns of matrix W K and
the inner product in U be defined by (a, b) = at X t Xb. Then H K is the explicit
expression for an orthogonal projector onto the subspace U with K dimensions (Rao
1973). Consequently the coefficient {JPLSR is obtained by projecting the ordinary
regression coefficient vector {JOLS onto the subspace U to stabilize the estimator by
reducing the dimensions.
In the principal components regression (PCR), the regression coefficient vector is
stabilized by reducing the dimension of the eigenspace of X t X, which is calculated
from only X. But in PLSR the regression coefficient vector is stabilized by reducing
the dimension of the subspace spanned by successively derived covariance vectors
between X and y. We are of the opinion that the above difference between PCR and
PLSR makes PLSR to have a slightly better performance than peR. The comparison
between PLSR and PCR was reported by Frank and Friedman (1993), Kim and
Tanaka (1994) among others.
4.2 PLSD
PLSD was pro Paled by Kim and Tanaka (1995a) for the purpose of reducing the
effects of multicollinearity. The proposed PLSD is obtained by replacing E-l in
LDF.j for the i-th and i-th groups, by H ijK defined as
(4)
264
Namely,
.. 1 ..
PLSD'j(x) = (Xi - x)r Hi)KX - 2(Xi - Xj)t HijK(X, + Xj) + In(cli) -In(q)). (5)
Here W ijK consists of K covariance vectors obtained by applying PLSR to the data of
the i-th and j-th groups, where a dummy variable with the values -nj/(ni + n)) and
n;/(n, + nj) is used as the response variable to indicate which group an observation
belongs to. The basic idea behind this is that LDF is mathematically equivalent to
the OLS regression of binary responses.
In case of more than three groups, gC2 PLSDijs are calculated and an observation x
is assigned to thei-th or j-th group based on PLSD,) for all possible pairs. Then, x
is classified into the group to which x is most often assigned. The proposed method
coped very well with the real data with more than three groups such as Fisher's iris
data with three groups (Kim and Tanaka, 1995a) and Yoshimura et. al.'s arc pattern
data with twenty groups (Kim and Tanaka, 1996).
The classification obtained by PLSD ij becomes equivalent to that by LDF if all
possible components are used in PLSDij, by the same reason that PLSR becomes
OLS regression if full components are employed. This fact suggests that PLSD has at
least equal and possibly better performance than LDF if the number of components
are properly determined. It is important, therefore, how to choose the number of
components K like in case of PLSR.
As a criterion to choose K, the correct discrimination rate (CDR), CDR = 100 x
I:f=l e;/n;, is computed using the cross-validation method, where e; is the numbers of
correctly classified observations in the i-th group. We search for a PLSD model with
the maximum value of cross-validated CDR. In the case where two or more PLSD
models have the same maximum value of CDR, we choose the PLSD model with the
least number of components as a tentative rule.
It is shown that PLSD has the better performance than LDF in Kim and Tanaka
(1996), where two methods are compared through a simulation study and a real data.
(6)
(7)
Naturally, it is expected that PLSDA fits the case where the covariance matrices
are equal for all groups and that PLSDT is suitable for the case where groups have
unequal covariance matrices.
Balanced Data with Equal Covariances Unbalanced Data with Equal Covariances
eo
70
60 60
15 25 35 45 55 15 25 35 45 55
'otvanables
Balanced Data with Unequal Covariances Unbalanced Data with Unequal Covariances
2 '" 2.
eo 1 - - 1 -.:.:.; ~ ....
! -...:.:..:..~ .
8
2
--....:.T~i
70 70
60 60
15 25 35 45 ·55 15 25 35 45 55
• 01 ",arnlbles • 01 ",.nabl••
As expected, PLSDA shows remarkably better performance than PLSDT at all sit-
uations in case of equal covariances. In case of unequal comriances, there are great
differences between cons of PLSDA and PLSDT at p = 5, 1.5 (well conditioned
situations), but no great differences at p = 25, 35, .15, 5.5 (poorly conditioned situa-
tions) .
We can explain th~ results in such a way that the advantage of PLSDT in case of
unequal covariances is not large enough to overcome the disadvantage of the smaller
degrees of freedom of the covariances compared to the case of PLSDA. That is, the
precision of the estimated values of Li) is worse in PLSDT than in PLSDA, because
PLSDT always has smaller sample size than PLSDA for estmating the covariance
matrices. \Ve think there is the relationship between sample size and dimension such
that the more the number of dimension is in explanatory data, the better result is
able to be obtained if sample size is in well conditioned, but no good if ill-conditioned
or multicollinear data set because of the unstable covariance matrix. So, we can
give one suggestion that the technique of dimension reduction is useful in order to
avoid the problem of singularity or multicollinearity as a result of ill-conditioned or
multicollinear data set.
~Icov. 5 15 25 35 45 55
Sal. Equal 2.16/2.11 2.69/3.29 3.42/4.69 356/4.00 4.10/5.38 :U5/4.83
Sal. L:neq. 5.41/5.30 3.11/3.28 4.05/4.34 3.92/3.23 4.49/3.56 3.72/3.28
Unba!. Equal 1.65/2.58 3.·18/3.83 3.82/5.53 4.23/5.77 495/4.91 3.96/3.78
Unba!. Uneq. 5.41/4.79 405/3.90 4.01/4.72 3.82/3.85 3.98/3.3.) 3.64/-1.08
6. Discussion
In this paper two types of partial least squares liuear discriminant function (PLSD)
are investigated through a simulation study. In the first type, which is abbreviated
as PLSDA, the pooled within-group covariance matrix of all groups is used as an
estimate for the common covariaw:e matrix in the discriminant function, while in
the second type, which is abbreviated as PLSDT, the pooled within-group covariance
matrix of only the i-th and j -th groups is used for discriminating these two groups.
From the results of the simulation study we can conclude:
(1) \Vhcn the covariance matrices are common in all groups, PLSDA has better
performance than PLSDT as is expected.
(2) When the covariance matrices are different, it is expected that PLSDT has bet-
ter performance than PLSDA. However, we can say so only in well-conditioned
situations, not in poorly or ill-conditioned situations.
As discussed by Flury et al. (1994). it is natural to use principal components in dis-
criminant analysis (PCD) to reduce the dimensionality of explanatory Yariables as in
regression analysis. So, we have a plan to compare PLSD with peD. ~[oreover, it is
our open and future task to compare PLSD with other kinds of discriminant functions
such as Friedman's (1989) regularized discriminant function(RDF) using shrinkage es-
timators (see, e.g, James and Stein 1961: Efron and Morris 1976; Stigler 1990), which
intend to reduce the effects of multicollinearity and can be applied in poorly condi-
tioned or ill-conditioned situations (see, e.g, Titterington 1985: O'Sullivan 1986).
267
References :
Efron, B., and Morris, C. (1976), Multivariate Empirical Bayes and Estimation of Covari-
ance Matrices, The Annals of Statistics, Vol. -1, pp.22-32.
Flury, B., Schmid, M. J. and Narayanan, A. (199-1), Error Rates in Quadratic Discrim-
ination with Constraints on the Covariance Matrices, Journal of Classification, Vol. 11,
pp.101-120.
Frank, I. E. and Friedman, J. H. (1993), A Statistical View of Some Chemometrics Regres-
sion Tools, Technometrics, Vol. 35, No.2, pp.109-l.t8.
Friedman, J. H. (1989), Regularized Discriminant Analysis, Journal of the American Sta-
tistical Association, Vol. 8-1, pp.165-175.
James, W., and Stein, C. (1961), Estimation with Quadratic Loss, Proceedings of the
Fourth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, pp.361-
379, Berkeley: University of California Press.
Kim, H. B., and Tanaka, Y. (199-1), A Numerical Study of Partial Least Squares Regression
witb an Emphasis on the Comparison with Principal Component Regression, Proceedings
of the Eighth Japan and Korea Joint Conference of Statistics, pp.83-88, Okayama, Japan.
Kim, H. 8., and Tanaka, Y. (1995a), Linear Discriminant Function Using Partial Least
Squares Method, Proceedings of International Conference on Statistical Methods and Sta-
tistical Computing for Quality and Productivity Improvement (ICSQP '95), Vol. 2, pp.875-
881, Seoul, Korea.
Kim, H. B., and Tanaka, Y. (1995b), Generating Artificial Data with Preassigned Degree of
r.lulticollinearity by Using Singular Value Decomposition, The Journal of Japanese Society
of Computational Statistics, Vol. 8., pp.I-8.
Kim, H. B., and Tanaka, Y. (1996), Application of Partial Least Squares Linear Discrim-
inant Function to Writer Identification in Pattern Recognition, Journal of the Faculty of
Environmental Science and Technology Okayama liniversity, Vol. 1. pp.65-76.
O'Sullivan, F. (1986), A Statistical Perspective on Ill-Posed Inverse Problems, Statistical
Science, Vol. 1, pp.502-527.
Rao, C. R. (197:3), Linear Statistical Inference and Its Applications, 2nd Edition, John
Wiley & Sons, Inc., New York.
Stigler, S. M. (1990), The 1988 Neyman Memorial Lecture: A Galtonian Perspective on
Shrinkage Estimators, Statistical Science, Vol. 5, pp.147-155.
Titterington, D. r.1. (1985), Common Structure of Smoothing Techniques in Statistics, In-
ternational Statistical Review, Vol. 53, pp.l.tl-170.
Wold, H. (1975), Soft ~Iodeling by Latent Variables; the )jon-linear Iterative Partial Least
Squares Approach, In Perspectives in Probability and Statistics, Papers in Honou of M. S.
Bartlett, Edited J. Gani, Academic Press, Inc., London.
Wold, S., \void, H., Dunn, W. J., and Ruhe, A. (198-1), The CoLlinearity Problem in Lin-
ear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverse, SIA~I
Journal on Scientific and Statistical Computing, \·01. 5, pp. 735- 7-13.
Yoshimura, M., Yoshimura, I. and Kim, H. 8. (1993), A Text-Independent Off-Line Writer
Identification Method for Japanese and Korean Sentences, IE ICE Trans. Inf. & Syst., Vol.
E76-D, No -1, pp.-154--161.
Resampling Methods for Error Rate Estimation
in Discriminant Analysis
l\lasayuki Honda 1 and Sadanori Konishi 2
1. Introduction
The main aim in discriminant analysis is to allocate a future observation to one of a
finite number of distinct groups or populations on the basis of several characteristics
of the observation. It is assumed here that an individual with observation on the
p-dimensional random vector is allocated into one of two p-variate populations, and
that allocation is carried out on the basis of Fisher's linear discriminant function or
the quadratic discriminant function.
In practice it is important to estimate the error rates in allocating a randomly selected
future observation. For the problem of estimating the actual error rates (conditional
error rates), Efron (1979, 1983) proposed nonparametric bootstrap methods, such as
the bootstrap bias-corrected apparent error rate and 'the 0.632 estimator'. Gane-
shanandam and Krzanowski (1990) investigated several methods including the 0.632
estimator for error-rate estimation in Fisher's linear discriminant function. Konishi
and Honda (1990) examined the parametric and nonparametric methods for esti-
mating the error rates in linear discriminant analysis under normal and nonnormal
populations.
Very little work has been done on evaluating error rate estimation procedures both in
the linear and the quadratic discriminant functions simultaneously. In this paper we
investigate several estimation methods for error rates in Fisher's linear discriminant
function and the quadratic discriminant function, when the population distribution
is assumed to be a mixture of two multivariate normal distributions. We exam-
ine the performance of the estimation methods through :\lonte Carlo simulations, in
which evaluation in non normal situations and the quadratic discriminant function
268
269
(1)
where Xl, X2, SI, S2 and S are, respectively, the sample means, the sample covariance
matrices and the pooled sample covariance matrLx based on the training sample Xn =
{x~): Q = 1,2,··· ,Ni,i = 1,2}.
A future observation Xo is allocated to 111 or 112 according as Xo belongs to the region
of discrimination RI or R2 given by
. .
e;(Fi; Xn) h
= .
R,
.
dF;(x) = J ..
[(xIRj)dF, =
1 '''.,
V L
' '0=1
() .
[(x; IRj ), (6)
I if x EA
[(xIA) ={a if x ~ A . (7)
Usually the apparent error rate provides an optimistic assessment of the actual error
rate, and hence much attention has been given to the development of the estimation
procedures.
In the next section we present several methods for estimation of error rate in linear
and quadratic discriminant analyses.
270
where X~ denotes the bootstrap sample of size (Nl + N 2 ),the discrimination regions
ki(i = 1,2) is constructed based on the bootstrap sample and Ft is the empirical
distribution function of the bootstrap sample {x~)' : Q = 1,2, ... , N;}.
The bias of the apparent error rate is approximated by averaging { e;(Fi; X~) -
ei(F;,; X~) } over a large number of repeated bootstrap samples, say, h; for i = 1,2.
Then we have the bias-corrected apparent error rate e,(F;; X,,) +h; called BS method.
3.2 The 0.632 bootstrap estimator
Efron (1983) proposed the 0.632 bootstrap estimator given by
(10)
where e(FI • F2 ; Xn) denotes the total apparent error rate over the discriminant rule
(n = Nl + N 2 ) and
Here X~ (j) denotes the j-th bootstrap sample of size n for) = 1,···, Band c50 )
equals 1 when x~) does not belong to X~(j), otherwise zero, and nh = Lf=l(L:~l c50 ]
+ L~;l c5oj ). \Ve call this estimator 632 method.
The 0.632 estimator is considered to be a summation of the apparent error rate and
some kind of cross-validation with appropriate weights. Fitzmaurice et aL. (1991)
investigated the performance of the 0.632 estimator by a Monte Carlo simulation and
showed that its performance depended on the true error rate.
3.3 Cross-validation method
In cross-validation or the leaving-one-out method, an individual is removed from the
training samples and LDF h(xIXn) or QDF q(xIX n) is constructed with the remain-
ing data. Then check whether or not the removed individual is allocated to the correct
population. This is done for each individual in turn in the samples. The proportion
271
of ,al,p and !h,p given in Table 5 for a mixture of two multivariate normal distributions
in (13) were calculated, using the formulae obtained by Konishi and Honda (1990).
Several findings as a summary of the simulation study are given in the followings.
• Under the assumption of multivariate normality (Le. (3l,P = 0, (32,p = p(p + 2))
and equal covariance matrices, all methods besides BS method in QDF provide
estimates with small biases, when two populations are not close together (see
the cases that e = 0 in Tables 1 and 2).
• Apparent error rates (AP method) clearly underestimate the actual error rates
(TV) in the case of normal and nonnormal situations. The difference between
TV and AP in QDF is larger than one in LDF.
• Bootstrap method (BS method) gives good performance in LDF, but it is a
little biased in QDF. High dimension gives worse influence to BS method when
QDF is used, because AP method has more severely underestimated in the case
of p = 8 than in the case of p = 4 (see Tables 1 and 2).
• 632 method performs fairly well with regard to the mean square errors in the
case of normal and nonnormal situations. But it slightly underestimates the
actual error rate in high dimensional case (Table 2) of nonnormal situations in
QDF and the case of closer populations (Table 3).
• Cross Validation method (CV) has a little bias to the actual error rates but has
larger mean square errors than 632 method in LDF. When QDF is used, CV
performs well with regard to the unbiasedness.
• Parametric estimator (QM method) gives overestimated results in the cases
further from normality and equality of covariance. But it totally performs very
well in the case of e = O.
• With total consideration, CV method is superior to other methods in regard to
the unbiasedness, and 632 method is far superior in regard to the mean square
error.
In practical situations, there often occur nonnormal populations and close together of
two populations. When e varies 0.1 through 0.5, populations seem to be non normal
along the measure of nonnormality given in Table 5. In such cases we would like
to recommend 632 and BS method in linear discriminant analysis and CV and 632
method in quadratic discriminant analysis.
Results of our simulation study also indicate that QDF is superior to LDF for two
normal populations with unequal covariance matrices provided that the sample sizes
are sufficiently large. QDF has poor performance with high-dimension of p relative to
the sample sizes. Sample size is a critical factor in choosing between LDF and QDF
with normal data, as shown by Marks and Dunn {1974}, Wahl and Kronmal {1977}.
Acknowledgment
The authors would like to thank referees for their helpful comments and suggestions.
275
References
Andrews, D. F. and Herzberg, A. M. (1985): Data: A Collection of problems from many
fields for the student and research worker. Springer-Verlag, New York.
Ashikaga, T. and Chang, P. C. (1981): Robustness of Fisher's linear discriminant function
under two-component mixed normal models. J. Amer. Statist. Assoc. 76,676-680.
Efron, B. (1979): Bootstrap methods: Another look at the jackknife. Ann. Statist. 7, 1-26.
Efron, B. (1983): Estimating the error rate of a prediction rule: Improvement on cross-
validation. J . ..imer. Statist. Assoc. 78, 316-331.
Fitzmaurice, G. M., Krzanowski, W. J. and Hand, D. J. (1991): A Monte Carlo study of
the 632 bootstrap estimator of error rate. J. of Classification 8, 239-250.
Ganeshanandam, Sand Krzanowski, W. J. (1990): Error-rate estimation in two-group
discriminant analysis using the linear discriminant function. J. Statist. Comput. Simul.
36, 157-175.
Konishi, S. and Honda, M. (1990): Comparison of procedures for estimation of error rates
in discriminant analysis under nonnormal populations. J. Statist. Comput. Simul. 36,
105-115.
Mardia, K. V. (1970): Measures of multivariate skewness and kurtosis with applications.
Biometrika 57, 519-530.
Marks, S and Dunn, O. J. (1974): Discriminant functions when covariance matrices are
unequal. J. Amer. Statist. Assoc. 69, 555-559.
McLachlan, G. J. (1974): An asymptotic unbiased technique for estimating the error rates
in discriminant analysis. Biometrics 30, 239-249.
Wahl, P and Kronmal, R. (1977): Discriminant functions when covariances are unequal
and sample sizes are moderate. Biometrics 33, 479-484.
A Short Overview of the Methods
for Spatial Data Analysis
Masaharu Tanemura
The Institute of Statistical Mathematics
4-6-7, Minarni-Azabu, Minato-ku
Tokyo 106, Japan
Summary: Some methods of spatial data analysis are given in a manner of short overview.
Here the recent. development of this field is also included. At first. it is shown that spatial
indices based on quadrat counts or nearest neighbour distances. which have been devised
mostly by ecologists, are still useful for preliminary analysis of spatial data. Then. it is
discussed that distance functions such as nearest neighbour distribution and K function
are useful to the diagnostic analysis of spatial data. Further, it is shown that the maximum
likelihood procedures for estimating and fitting pair interaction potential models are very
useful for a wide class of spatial patterns. Finally, it is pointed out that Markov chain
Monte Carlo (MCMC) methods are powerful tools for spatial data analysis and for other
fields.
276
277
In the field study, the spatial data are often sampled according to the following
methods:
1. Quadrat method : Counts of individuals in contiguous quadrats are obtained.
Let Si be the counts in the i-th quadrat (i = 1,2"" ,q), where q is the number
of quadrats.
2. Nearest neighbour distance method: Usually the two types of nearest neighbour
distances are measured; the distance between a randomly sampled point to its
nearest individual (rt}; and the distance between a randomly sampled individual
to its nearest individual (r2)'
For these types of spatial data, many spatial indices had been considered by mainly
ecologists in order to represent the degree of aggregation of individuals.
2.1 Spatial indices for quadrat counts
As regards quadrat counts, we cite here only three indices. David and Moore (1954)
presented an index 1 = Vim -1, where m and V are, respectively, sample mean and
variance of s/s. Morisita (1959) introduced the index 16 = qL:l=l Si(Si -1)/(N(N-
1)). This is often called Morisita's 16 . Lloyd (1967) presented an index m+ m=
m
V / m - 1. The index is called a mean crowdedness.
o
o 20 40 60 80 100 120 140 160 180 200
x
Fig. 1: A pattern of 584 trees of longleaf pines (Pinus palustris) taken from Cressie (1993).
278
Let us show an example of values of these indices. Figure 1 is the stands of longleaf
pines (Pinus palustris) (Cressie, 1993). For this data, the values of the above indices
are, respectively, I =
1.335, 16 =
1.930 and m=
2.765. These values indicate that
the stands of longleaf in Fig. 1 is a clustered pattern.
2.2 Spatial indices for nearest neighbour distances
As regards nearest neighbour distances, we show three indices here, too. Hopkins
and Skellam (1954) gave an index A = E rU E r~, Clark and Evans (1954) proposed
the index R = 2fiE r2/n and Besag and Gleaves (1973) presented the index T =
2 E rU E r~. Here, n is the number of samples of r2, A the intensity of Poisson point
process and Ta is the so-called T-square samples devised by Besag and Gleaves.
For the data of Fig. 1, the values of indices given above are: A =
2.049, R 0.849 =
and T = 1.646. These values again indicate the clustered nature of the longleaf pines
data.
This is illustrated in Fig. 2. In Fig. 2, the radius of each shaded disc is r. In order to
compute the area of the union of discs, it will be most easy to use Voronoi tessellations
279
as shown in Fig. 2.
It is also important to note that we can get an exact empirical 'density' fer) =
dp( r)/ dr for /"l in the following manner:
fer) = ISI-l{peripherallength of union of discs with radius r
and with center at every indi\·idual}.
Fig. 2: Illustration for computing exact empirical distribution per) and its density f(r).
In Fig. 2, the value of f(r) is obtained by computing the total peripheral length of
union discs. This is again not difficult if we consider the Voronoi tessellation as in
Fig. 2.
Let us show an example. In Fig. 3, a pattern of nests of gray gulls (Larus modestus) is
given. In Fig. -lea), the exact empirical distribution estimated for the gray gull data
of Fig. 3 is shown as the curve with crosses. We have done computer simulations of
Poisson point process for the same number of individuals in the same size of area.
In this figure. the envelopes of p(r)'s obtained from 19 simulations are represented
as curves. As a comparison, we give in Fig. ol(b) similar curves for the empirical
distribution q( r) for the same data sets as in Fig. 4( a). It is obvious that the range
of envelopes in Fig. 4(a) is narrower than in Fig. ol(b). This indicates the power of
testagainst Poisson model is bigger for p(r) than for q(r). Actually, we can see from
Fig. ol( a) that the curve of p( r) for gray gull data systematically deviates from the
envelopes for the Poisson model. This indicates the rejection of the Poisson model.
The figures such as Figs. 4( a) and (b) indicates their usefulness as diagnostic analyses.
280
Fig. 3: Patt.cl"Il of 110 nests of gray gull (Larus modestus) with its Voronoi t.essellation.
1.20 (a)
0.80
-:
...
~
0.40
1.20
( b)
0.80
-:
<';
0.40
where Z = Z/ISI"".
For Poisson model, it holds <l>8(r) == 0 and Z(O;S,S) = lSI"".
In order to perform the likelihood procedure, we ellcounter a serious difficulty of
obtaining L( <1>8: X) and Z( <1>8; N, S) for a general <1>8 ( r) as a fuuction of B. This is
due to the high multiplicity of integration in the normalizing constant Z.
To overcome this difficulty, efforts for devising methods for obtaining approximate
log-likelihoods have been done. Some of such methods are the cluster expansion, the
virial expansion and the polynomial approximation through computer experiments
(Ogata and Tanemura, 1981, 198-1, 1989). For the computer experiments of Gibbs
point process, the so-called Markov chain Monte Carlo method is most suitable. This
is discussed more in the next section .
.-\s the interaction potential models, it would be desirable that they can cover a
wide class of patterns of individuals, including regular and clustered patterns which
respectively correspond to repulsive and attractive interactions. For that purpose, we
are necessary to consider potential models with several parameters, even in different
class of potential family. In order to select a suitable model among competing models.
the informatioll criterion Ale is useful:
Ale = -2 (maximum Log-likelihood) +2 (number of adjusted parameters).
The model which has minimal Ale value is most suitable. By using this criterion,
the Poisson model is always considered as a candidate, since Ale == 0 for this model.
Here, we will show some examples. Let us consider the so-called 'soft-core potential
moclels' cI>".n(r) = (a/r)n, a> O,n > 2 (Ogata and Tanemura, 1984, 1989). This
family of potential can cover a cert.ain class of point patterns with repulsive inter-
actions. Table 1 shows some of results of fitting the soft-core models (Ogata and
Tanemura, 1989).
Data iJ it T
Pines 0.1-1 meters ex: 0.0-1
Gulls 2.26 meters 6.9 0.06
Iowa 16.8 miles 6.2 0.37
Balls 1.10 millimeters 5.8 0.-12
Table 1: Result.s of fitting soft-core models to rea.l data.
Here, T = lYa 2 /A. represents a crampedness of a pattern. The data for 'Gulls' corre-
sponds to Fig. 3.
is obtained by (lltn::::=1 g(X;) after generating a Markov chain (XI, X 2 ,' " , X,,, .. )
without knowing the normalizing constant of the density u. It is because XI --+ X '"
u(X) and (lIt) 'E:=I g(X;) --+ E.[g(X)] as t --+ 00, almost surely.
Let q(x,x') be an arbitrary transition probability such that. if X, = x, a sample x'
drawn from q(.r,x') is considered as a proposed possible value for XI+!' Then. the
essence of the ~Iarkov chain Monte Carlo (MCMC) method would be to choose the
transition probability p(X, X') in such a way:
References:
Besag. J., Green, P., Higdon, D. and Mengersen, K. (1995): Bayesian computation and
stochastic systems, Statistical Science, 10, 3-66.
Cressie. N.A.C. (1993): Statistics for Spatial Data. Revised edition, John Wiley & Sons.
New York.
Geyer. C.J. (1992): Practical Markov chain Monte Carlo. Statistical Science. 7. 473-483.
Hasegawa. M. and Tanemura. M. (1986): Ecology of Territories - Statistics of Ecological
Models and Spatial Patterns, Tokai University Press (in Japanese).
Morisita. M. (1959): Measuring of the dispersion and analysis of distribution patterns,
Memoires of the Faculty of Science, Kyushu University. Series E. Biology. 2, 215-235.
Ogata, Y. and Tanemura, M. (1981): Estimation of interaction potentials of spatial point
patterns through the maximum likelihood procedure. _4nnals of the Institute of Statistical
Mathematics, 33B, 315-338.
Ogata, Y. and Tanemura, M. (1984): Likelihood analysis of spatial point patterns. Journal
of the Royal Statistical Society. Series B. 46. 496-518.
Ogata, Y. and Tanemura. M. (1989): Likelihood estimation of soft-core interaction poten-
tials for Gibbsian point patterns. Annals of the Institute of Statistical Mathematics. 41.
583-600.
Okabe, A. and Miki, K. (1981): A statistical method for the analysis of a point distribution
in relation to networks and structural points, and its empirical application. DP-3, Depart-
ment of Urban Engineering, University of Tokyo.
Tanemura, M. (1983): Statistics of spatial patterns - Territorial patterns and their forma-
tion mechanisms. Suri Kagal.:u (Mathematical Science), 246. 25-32 (in Japanese).
Choice of Multiple Representative Signatures for
On-line Signature Verification
Using a Clustering Procedure
Isao Yoshimura 1, ;\[itsu Yoshimura 2 and Shin-ichi Matsuda 3
Summary: The task of signature verification is to judge whether the writer of a signature
is truly the declared person or not, by referring to a previously provided signature database.
The verification is completed when the observed dissimilarity between the questioned sig-
nature and a set of signatures, which were previously written by the declared person and
included in the database, is less than a given threshold. This paper proved, through a ver-
ification experiment using a signature database supplied by CADIX Co. Ltd., that the use
of multiple representatives, which represent respective clusters constructed by a clustering
procedure, is effective. According to the experiment, the resulting error rate decreased from
about 6% to about 1% on average by increasing the number of representatives from one to
three.
1. Introduction
This paper deals with an automatic signature verification for the identification of
an individual based on on-line information, where the problem is to verify an input
signature to be a realization of the declared autograph by comparing it with a set of
authentic signatures previously provided.
From a statistical viewpoint, this problem can be formulat.ed as one of judging
whether a questioned sample belongs to the declared population of authentic sig-
natures based on a set of samples from this population. This will be referred to as
the reference sample in the following. \Vhen we can assume a suitable distribution of
the population, the problem is reduced to one of estimating the population parame-
ters based on the reference sample and of judging whether the questioned sample is
located near the central part of the estimated distribution.
In real situations, however, the population is not so homogeneous that we can assume
a simple distribution. It seems better for us to usc signatures themselves to represent
the population without assuming any distribution. Then there occurs a problem of
how to construct the representative signature from the reference sample.
In our experience, people often write their signatures in two or three different forms,
(examples of which are shown in Fig. 1) although most of them are similar, as are
shown in Fig. 2. From this observation we considered an idea that we can construct
a signature verification system with a good performance by using two or three sig-
natures as the representatives. Fortunately, the first two authors of this paper have
284
285
Sincerely yours,
.J&tI..~/J1J<U-ntJ<--~J
(Mrs.) Blaire Mossman
Managing Editor
Sincerely yours,
4M~~)~;~
(Mrs.) Blaire V. Mossman
Managing Editor
Figure 1 An example of two different forms of signatures written by the same person
F ~
~
~
-
tJt;: ~
~
~
~ ~
.~ ~
2. Devised system
Our system gets any signature as a set of time series ((x(t), y(t), p(t)); t = 1,2, ... ,T}
of pen position (x, y) and wri ting pressure p, where T is the number of sampling points
automatically fixed by the sampling time inherent in the input device. ,\Ve retain in
the system, in advance, a set of authentic signatures as a database to be referred
to in verificat.ion. After a stage of preprocessing, such as the normalization of size
and sampling poin~s, the system classifies these reference signatures into a certain
number (three in this paper) of clusters by using a complete linkage method (Cf.
Yanagisawa and Ohsumi (1979).) and chooses one representative from each cluster,
where the distance for clustering is set as the dissimilarity measure explained in the
next section.
When the system receives a questioned signature it measures the dissimilarity be-
tween the questioned signature and each representative and uses their minimum as
the dissimilarity between the questioned signature and the declared autograph. If
the dissimilarity is less than a given threshold the questioned signature is judged as
genuine; otherwise it is juaged as forged. The threshold is defined for each autograph
as the mean value of dissimilarity measure among authentic signatures multiplied
by an optional coefficient, C, which is specified by the system manager. Note that
although there are some optional parts such as the method of normalization or the
choice of weights in the system, they are properly fixed in this paper because they
do not seriously affect the conclusion induced from the experiment.
3 .. Dissimilarity measure
Although there are various proposals concerning the dissimilarity measure (See Yoshi-
mura and Yoshimura (1996) ). we adopted a measure similar to that of Sato and
Kogure (1982), whose characteristic feature is the adoption of a dynamic program-
ming matching method for adjusting the time scale of two signatures. Note that this
adjustment is very important to measure the dissimilarity, because anyone often takes
irregular rests of pen movement during the course of writings and these meaningless
time duration is better to be omitted.
The adjustment in this sense can be realized by determining a matching of two time
coordinates, say ~ and TJ, for the two signatures. The matching thus determined is
referred to as the "warping function" in this paper, which is visually illustrated by a
thick line on Fig. 3.
Practically, the system first considers a temporal dissimilarity measure D-(ZA. ZB)
between two signatures =.-\ and ZB as
,,(..I,B)
D"(z..l' ZB) = L [w p 6(s, s - 1) + 1 - 6(s, s - 1)]
.=1
X[WI {(x..l(~.) - YB(TJ.))2 + (Y.-\(~.) - YB(TJ.))2}
+W2{PA(~.) - PB(TJ.)}2
+W3{(d:rA(~.) - d:rB(TJ.))2 + (dyA ((,) - dyB(TJ.))l}], (1)
where the subscripts A and B correspond to two signatures respectively, (~., TJ. ) is a
coordinate of a warping function T = {(~" TJ.); s = 1,2, ... , n(A, 8)} which represents
an adjustment in the time domain, n(A, B) is the number of sampling points, WI. W2,
and W3 are properly chosen constants, 6' is an indicator constant related to the s-th
287
Time axis ,
for 2iJ '
31-----i---r"'---
0-1 o
Time axis for ~
Warping function
I
I
I dy(~s)
I
y(~s-l) ' ________ 1. _______ '
"(~s-l) ,,(~s)
point on the warping function, and (dr(~.)' dy(~.)) is the unit vector with the same
direction as (l~(~.) - x(~._tl, y(~.) - Y(~._I)) (See Figs. 3 and ell, i.e., as follows:
d (t) = - l·A(~.-d
x ..d~.) (2)
rA <,. J(XA(~.) - xA(~.-Il)2 + (YA(~.) - Y.-\(~._Il)2
(4)
where the minimum is taken over the possible change of the warping function T under
a certain restriction, the detail of which is explained in Yoshimura et al. (1991) and
Yoshimura and Yoshimura (1992).
In the proposed method, although the dissimilarity of questioned signature to the rep-
resentative signatures are measured, there are some five optional parameters 8, W p , WI, tL'2,
and W3 in the definition of this dissimilarity measure. Among them, 8 and Wp arc
introduced to impose a penalty for warping the time scale. As our proposal, 8 was
set as one when ~. = ~._I or 1]. = 1].-1, and as zero otherwise, additionally wp was
set as two. These setting implies that the additional increment in a coordinate at the
s-tb step is not the same as the one in the other coordinate, then the dissimilarity
value is increased depending on the degree of distortion of the time scale. Other
constants represent the weights on the three variables of x, y, and p with respect to
the differcnce between the two signatures. The optimum values of them are to be
determined adaptively to the signature data. In the experiment below, they were set
as (Wll tt:2,U'3, wp ) = (0.5,OA,0.1,2.0) based on a preliminary experiment.
4. Experiment
An experiment to examine the effectiveness of ollr proposal was performed based
on a database of on-line signatures provided by CADIX Co. Ltd. The database is
composed of 2203 signatures, including both authentic and forged signatures, for 28
autographs (See Table 1). Thcse contain 10 in Roman letters by non-Japanese, 14 in
Japanese letters by Japanese and 4 in Roman letters by Japanese; "Roman letters"
implies English alphabet such as a, b etc. Examples of signatures in the database
are shown in Fig. 5. In the experiment, ten authentic signatures for each autograph
and five forged signatures were used as references and the remaining signatures were
used to evaluate the performance of our system.
In the experiment, the number of clusters were set in three cases of one to three in
order to evaluate the effectiveness of the increase of representative signatures.
The most difficult obstacle to achieving a good performance of the system was the
determination of optional constants inherent in the system, such as the weights in
the dissimilarity measure and the threshold coefficient C. \Ve utilized the reference
sample to optimize these constants. It was important that forged signatures were
included in the reference because, by comparing the dissimilarities within authentic
signatures with those between authentic and forged, we could reasonably evaluate
the suitability of various values of C.
289
The system is composed of two parts: a preprocessing part and verification part.
In the preprocessing part various normalizations and standardizations such as the
reduction of sampling points, are possible which may affect the achieved error rates.
In this paper, however, we fixed the normalization and standardization in one form
because we could make sure, through preliminary experiments, that the effect of such
preprocessing did not influence the relative capability of the three cases of the number
of representative signatures.
Genuine Forgery
Case R~: Roman sig. by non-Japanese Case JJ: Japanese sig. by Japanese
No. Gn. Forg. ( repetitions) No. Gn. Forg. ( repetitions)
1 30 54 (2,4,5,5,5,5,5,5,6,12) 1 27 57 4,5,5,5,5,6,6,6,6,9)
2 25 64 4,5,5,5,5,5,5,6,6,6,12) 2 25 53 5,5,5,5,5,5,5,5,8)
3 51 53 5,5,5,5,5,5,5,5,6,7) 3 25 16 2,6,8)
4 25 51 5,5,5,5,5,5,5,5,5,6) 4 20 52 3,4,5,5,5,5,5,5,5,5,5)
5 25 51 (5,5,5,5,5,5,5,5,5,6) 5 24 53 (5,5,5,.5,5,5,5,5,6,7)
6 33 56 (5,5,5,5,5,5,5,5,5,5,6) 6 25 21 2,7,12)
7 30 65 (5,5,5,5,5,5,5,5,5,5,5,5,5) 7 25 9 (9)
8 28 60 5,5,5,5,5,5,5,5,5,5,5,5) 8 28 53-(5,5,5,.5,5,5,56,6,6 )
9 25 64 4,4,5,5,5,5,5,5,5,5,5,5,6) 9 26 67 (4,5,5,5,5,5,5,5,5,6,7)
10 35 40 5,5,10,10,10) 10 26 56 4,5,5,5,5,5,5,5,6,11 )
11 28 53 5,5,5,5,5,5,5,6,6,6)
Case RJ: Roman sig. by Japanese 12 26 50 5,5,5,5,5,5,5,5,5,5T
No. Gn. Forg. l repetitions) 13 34 50 (5,5,5,.5,5,5,5,.5,5,5)
1 27 62 1,2,3,4,5,5,5,5,5,5,5,5,6,6) 14 26 15 2,3,,5,5)
2 42 62 5,5,5,5,5,5,5,5,5,5,6,6)
3 31 55 4,5,5,5,5,5,5,5,5,5,6)
4 25 64 (5,5,5,5,5,5,5,5,.5,5,7,7)
290
5. Result
The effectiveness of the use ·r_
of multiple representatives is Case AN J!-
clearly shown in Table 2, which
is a part of the results of the ex-
periment with WI, W2, W3 and Wp
fixed as O..j, 0.4, 0.1, and 2.0,
respectively. The error rates in
Table 2 are an average of those
for foreigners and Japanese. For
any choice of the threshold co-
efficient C, at least two repre-
1.' I.S 1.8 1.1 I I , g
---
sentatives are necessary to get
good performance of the system
in the verification, while more
than three do not yield any im-
---
provement.
CaseJJ
Table 2 Error rates in %
#0 representatives
1 2 3
5.22 1.51 1.46
.
1.4
1..5 6.28 1.27 1.11
1.6 8.07 1.83 1.21
1.7 10.31 2.89 1.73 I • ,s I 6 1.7 ,
6. Discussions
The use of multiple representatives of authentic signatures based on a clustering pro-
cedure was proved, through an experiment using a signature database, to be effective
in decreasing error rates in on-line signat.ure verification. The authors think that it
is due to the existence of clusters even in a set of authentic signatures written by
one person. It also implies that various type of signatures should be included in the
database for signature verification.
The achieved error rates are an average of two types of errors and each error varies
depending on the threshold or threshold coefficient C in our setting of the threshold.
If it moves to greater values, type I errors decrease whereas type II errors increase.
The trade-off relation is obvious by looking at Fig.6. The determination of the rea-
sonable value of C is the responsibility of the system manager, while C = 1.5 is a
generally good coefficient in our experience.
An example of clustering and of representative signatures obtained from each cluster
is shown in Fig. 7. While the result of clustering is not visually clear average error
rates decreased, which may imply that the automatic verification is better than vi-
291
sual verification. Although the experiment was limited to on-line verification, similar
conclusions will be obtained for off-line verification, too.
2 ~
2 ~
3 ~
3 ~
2 ~
2 ~
3 ~
3 "ftiV\:f;
3 ~
Figure 7 Example of realized clusters and representatives with time series of (x, y, p)
The number placed at the head of signature implies the cluster it belongs to when
three clusters are constructed. The bold figure denotes the representative signature.
The three traced curves to time in abscissa imply the time series of x(t),y(t),p(t) in
this order from left to right.
References:
Sato, Y. and Kogure, K. (1982): On-line signature verification based on shape, motion, and
writing pressure, Proc. 6th Int. Conf. Pattern Recog. , 823-826.
Yanagisawa, Y. and Ohsumi, N. (1979): Evaluation procedure for estimating the number
of clusters in hierarchical clustering system, The Japanese Journal of Applied Statistics, 8,
51-7l.
Yoshimura, I. et al. (1991): all-line signature verification incorporating he direction of pen
movement, Trans IEICE Japan, E74, 2083- 2092.
Yoshimura, 1. and Yoshimura, M. (1992): On-line signature verification incorporating he
direction of pen movement- An experimental examination of the effectiveness, In: From
Pixels to natures, Impedove, S. and Simon, J. C. (eds.), 3,53-361, North Holland, Amster·
dam.
Yoshimura, ~L and Yoshimura, 1. (1996): The-state-of-the-art and issues to be addressed
in writer recognition, Technical Report of IEICE, PRMU96-48, 81-90 (in Japanese).
Part IV
Summary: Algorithms for LI and Lp based fuzzy c-means are proposed. These algorithms
calculate cluster centers in the general alternating algorithm of the fuzzy c-means. The
algorithm for the Ll space is based on a simple linear search on nodes of step functions
derived from derivatives of components of the objective function for the fuzzy c-means,
whereas the algorithm for the Lp spaces use binary search on the nodes and then the interval
to which the cluster center belong. Termination of the algorithms based on different criteria
for the convergence is discussed. The algorithm for the Ll space is proved to be convergent
after a finite number of iterations. A numerical example is shown.
1. Introduction
The Ll and Lp spaces have sometimes been referred to in studies of data analysis such
as the regression analysis in the Ll space (Bloomfield and Steiger, 1983), although
these spaces have not extensively been applied to cluster analysis yet. Recently, many
researchers have studied fuzzy clustering, which uses membership degrees in the unit
interval interpreted as fuzzy classification. The method of fuzzy c-means, abbreviated
as FCM, is most well-known among different techniques in fuzzy clustering (Bezdek,
1981).
Fuzzy c-means clustering based on the Ll or Lp space has recently been considered by
Jajuga (1991) and by Bobrowski and Bezdek (1991). These two studies have shown
difficulties in solving fuzzy c-means in the Lp spaces. The general fuzzy c-means
algorithm is an iterative procedure in which the step of determining grades and that
of determining cluster centers are repeated. A unified formula can be used for the
determination of grades for different distances, whereas the calculation of cluster
centers strongly depends on a selected distance. Cluster centers in the Euclidean
space are derived by a simple formula of weighted average, whereas the calculation
of cluster centers seems to require much computation for the Ll and Lp spaces,
however.
In this paper we propose efficient algorithms for the FCM in the Lp spaces. The
results are divided into those for the Ll space and for the Lp spaces, which implies
that stronger results are obtained for the Ll case.
In the case of the Ll space, the fact that each coordinate of a cluster center is the
minimizing element of a piecewise affine function is utilized, whereas a binary search
is considered for the Lp spaces.
Moreover we prove several theorems of convergence of the algorithms. The proofs of
the convergence theorems for the Ll case use the finiteness of the search region for
the cluster centers and the uniqueness of the minimum of strictly convex functions
with respect to U.
A numerical example is given to show that the algorithm actually works well on a
295
296
d(x", Vi} = IIx/c - Viii: = L" IX"i - viil P , where Vi = (Vil' ••• , Vi"). Namely, the objective
i=l
function is
e n
J(U, v} =L L(ui"}mll x,, - Viii:· (1)
i=l "=1
It is well-known that the direct optimization of J by (U, v): min J(U, v} is
(lEM,vER·h
difficult. A two stage iteration algorithm is therefore used.
A General Algorithm of FCM (cf. Bezdek, 1981):
(a) Initialize U(O); Set s = o.
(b) Calculate cluster centers v(·) = (v~·), ... , v~·)} that minimize J(U('), .}:
J(U(o+l) v(·)}
,
= (min J(U v(·)}.
lEM'
(d) Check convergence using a given f > 0: If the convergence criterion is satisfied,
stop; otherwise s = s + 1 and go to (b).
The convergence criterion in general should be one of the following (I-III).
(I)
297
(II)
Remark: The above term of convergence simply implies that the algorithm will
eventually terminate; the convergence does not guarantee that the obtained solution
is the correct one, nor the solution is the global optimum. In general it is difficult to
derive an efficient algorithm that guarantees the true optimal solution.
In general, calculation of U(·+l) does not depend on a particular choice of a norm.
It is well-known that Uilt is easily derived by using the Lagrange multiplier. Namely,
for Xk such that Xk =I- Vi, i = 1, ... , c, and m > 1,
1
(2)
On the other hand, calculation of cluster centers is not simple for the Lp spaces, and
therefore the problem to be solved is the minimization with respect to v in the step
(b) of the above FC~. The algorithms for the step (b) that we propose here are based
on the following ideas.
(i) Each component of a cluster center can independently be calculated from other
coordinates, by decomposing the function J(U, t') to be optimized with respect to v
n
into a sum of he functions Fij('W) = I:(Uik)m!Xltj - U'!P, i = 1, .. , c, j = 1, .. , h:
10=1
c h
J(U, v) = I: I: Fij( Vij),
.=1 j=1
where each F' j depends solely on the jth coordinate of a cluster center. lS"otice that
U is a parameter in this subproblem. Thus, concerning the search of cluster centers,
we can limit ourselves to the minimization of Fij (U').
(ii) The function Fij is convex with repect to each coordinate of a cluster center, and
in particular. it is a piecewise affine function in the Ll case.
(iii) Seeing the properties (i) and (ii), we can use one-dimensional search for the
minimization of F.j('w). For the Ll space, however, a more efficient algorithm can be
derived using the piecewise affine property: the coordinate is calculated by a linear
search on the derivative of the function Fij (U'), which is remarkably simple.
F ij . We assume that when {Xlj, ... , Xnj} is ordered, first subscripts are changed using
a permutation function qj( k), k = 1, ... , n, that is, x qJ ( I)j ~ x qJ (2)j ~ ... ~ xqJ(n)j'
Using {Xqj(k)Jl,
n
Fij(W) = ~)Uiqj(k))mlw - Xqj(k)jl·
k=l
Although Fij (tv) is not differentiable on R. we extend the derivative of Fij (w) on
{Xqj(k)j }:
n
dFij(w) = ~)uiqJ(k))msign"-(tv - xqJlk)j)
k=l
where
(z 0),
sign"'(z) = {~1 (z
~
< 0).
Thus, dFij (w) is a step function which is right continuous and monotone nondecreas-
ing in view of its convexity and piecewise affine property. :\ ow, it is easy to see that
the minimizing element for (3) is one of XqJ(k)j at which dFij(w) changes its sign.
More precisely, Xqj(t)j is the optimal solution of (3) if and only if dFij(w) < 0 for
W < Xqj(t)j and dFij(w) ~ 0 for w ~ xqJ(t)j.
It is easy to see that this algorithm correctly calculates one coordinate of the cluster
center.
Remark: A similar idea to the above algorithm can be found in the LI regression.
See Bloomfield and Steiger (1983).
(ii) Search the solution Wij in the interval [xqJlr)j,XqJ(r-;-l)j] using, e.g., binary search.
299
since the FCM algorithm is the alternative optimization with respect to U and v.
Then, an obvious result for the convergence of the above algorithms using (I) can be
stated as follows.
Theorem 1. For arbitrary given f > 0 and the convergence criterion (I) used in
FCM, the Ll and Lp algorithms converge.
(Proof) The proof is obvious, seeing that the sequence a( s) = J (U('), v(')) is monotone
nonincreasing and bounded from below by the obvious bound a( s) ~ O. Hence the
basic theorem of convergence of monotone and bounded sequences is applied. (QED)
Theorem 2. For f = 0 and the convergence criterion (I), the LI algorithm converges
after a finite number of iterations in FCM.
(Proof) Consider the set
Let lIe be the Cartesian product of II: lIe = IT X II X ... X Ii, and Illel be the number
of elements in lIe. This set is finite and we can easily see that if J(U(·),v(·-I)) >
J(U('), v(')) then V(·-I) and v(·) are different points of lIe. Thus, within Illel + 1 times
of iterations in FCM, J(U(·),V(·-I)) = J(U('),v(')) occurs and then the algorithm for
the LI space terminates with f = O. (QED)
In most cases of iterative calculations, the convergence is checked using the solutions
such as (II) or (III) instead of the function (I). Unfortunately, it is difficult to prove
a theoretical property for the Lp space algorithm using (II) or (III). In contrast,
however, the LI algorithm can further be analyzed.
In general J(U('), V(·-I)) = J(U('), v(·)) does not imply V(·-I) = v('), since the solution
for (3) may not be unique. It is, however, easy to modify the LI algorithm so that
(5)
holds. The modification is obvious:
Check if U' = vi;-I), the (i, j) component of the previous solution V(·-I),
satisfies dFij(w) = O. When dF;j(u') = 0, i.e., the previous solution is
still optimal, use the previous solution as vl;)
= w, the (i, j) component
of the new v(·).
Theorem 3. For f = 0 and the convergence criterion (II), the LI algorithm with
the above modification converges after a finite number of iterations in FCM.
(Proof) Since we have modified the algorithm so that (5) is satisfied, the conclusion
follows from the observation stated in the proof of theorem 2: within Illel + 1 times
of iterations, J(U(·),V(·-I)) = J(U('l,v(')) occurs. (QED)
300
"y> 1,
the clusters i l , i z, ... , i.., should be replaced by a unique cluster, say i l , and the other
cluster numbers i z, .. :, i.., will not be used thereafter. After this reduction we can set
Ui,lo = 1, since Xio = Vi,·
Theorem 4. For f = a and the convergence criterion (III), the Ll algorithm with
the above two modifications converges after a finite number of iterations in FCM.
(Proof) By the above consideration, the matrL",{ U is uniquely determined, since (1)
is a strictly convex function with respect to Uik with an arbitrarily fixed v for XI. i=- Vi
for all i. When Xio = Vi, the previous modification uniquely determines the corre-
sponding part of U. Thus, if V{·-l) = v('), we have U(·) = U('Tl). (QED)
5. A numerical example
In this example we assume that m = 2, the number of clusters c = 2, the dimension
p = 2, and the number of points n = 10, 000. A region is considered and data points
are scattered over the region. Namely, two squares ABCD and EFGH of unit size
with the intersecting square P EQC are considered, as shown in Figure 1. The square
PEQC has edge length a (0 < a < 1). Data points have been scattered over the area
surrounded by ABQ F G H P D A using the uniformly distributed random numbers.
The initial value for the grade u~~ for each XI. has been generated by the pseudo
random numbers uniformly distributed over [0, IIi u~~ = 1 - u~~ to form a fuzzy
partition. Ten trials with different initial grades U(O) have been carried out. The
criterion (III) has been used for the convergence test in all cases.
H G
D p C
,,
,,
,
E IQ F
A B
Fuzzy clusters have been transformed into crisp clusters using the Q - cut of Q = 0.5.
Then, a measure of miscIassification has been introduced for a quantitative evalua-
tion of the results. Namely, when a data point that is in the left and lower side of
the broken line segment PQ in Figure 1 is classified into the same class as the north
east cluster, i.e., the one to which data in the area surrounded by PQFGHP belong,
the former data point is called miscIassified. In the same way, when a data point
that is in the right and upper side of the broken segment PQ in Figure 1 is classified
into the same class as the south west cluster, i.e., the one to which data in the area
surrounded by QPDABQ belong, the data point is also called miscIassified.
Table 1 shows the number of successes, the average number of misclassified data, the
maximum number of iterations, and the average CPU time (sec) throughout the ten
trials for three values of the parameter a: a = 0.1, 0.2, 0.3. Moreover this table
compares results by the L1 c-means, Euclidean c-means, and the Lp c-means with
p= 3.
The number seven, for example, of successes means that seven trials out of the ten
have produced good results, while the other three have led to unacceptable classifica-
tions of large numbers of misclassified data. The average number of misclassifications
has been calculated from the successful trials: if the seven trials are successful, the
data of the other three trials are not used for the calculation. The CPU time is for
one cycle of calculating v(') and U(·';"l) in the main loop of FCM. The total CPU time
needed until the convergence is, for example, 0.758 x 11 ~ 8.34 for a = 0.1 by L1
FCM.
Comparison of the statistics given in Table 1 leads to the following observations.
(a) The computation for one cycle by L1 FCM is faster than Euclidean FCM, whereas
the Lp algorithm is far slower than the other two.
(b) The number of iterations by L1 FCM is less than that by Euclidean FCM in
every case; the Lp algorithm requires more iterations than Euclidean FCM.
(c) For the numbers of misclassifications, we do not find a remarkable difference
among these three algorithms.
(d) L1 FCM has failed 4 times in all 30 trials, while Euclidean and Lp FCM algo-
rithms have succeeded in all trials.
We have analyzed the cases of the failure of the L1 method, and found that the iter-
ation stopped at s = 1, i.e., after (V(l), U(2)) had been calculated, or at s = 2. Thus,
the failure to produce an appropriate result occurred when the iteration terminated
too early. A simple technique for improving the algorithm is to incorporate an em-
pirical rule into the L1 algorithm, whereby if an early termination is detected, the
calculation starts again with renewed initial membership values.
This failure has not been caused by the present algorithm, since the L1 method
exactly calculates the optimal solution for the cluster center, without any approxi-
mation. In other words, one cannot theoretically expect a better result by replacing
the present L1 algorithm by any other procedure for calculating the cluster centers.
(This does not mean, however, that we are unable to improve the algorithm by using
heuristic or ad hoc rules.)
6. Conclusions
We have presented two algorithms, i.e., the L1 algorithm based on a linear search on
302
Table 1: The number of successes out of ten trials, the average number of misclas-
sifications, and the ma:rimum number of iterations, and CPU time for a = 0.1, 0.2,
0.3 by L1 , L2 (Euclidean), and Lp FCM (p = 3).
L1 FCM
a successes misclassifications iterations( ma."C) CPU time(sec)
0.1 7 5.4 11 0.758
0.2 9 10.8 13 0.752
0.3 10 19.8 13 0.755
Euclidean FCM
a successes misclassifications iterations{ ma."C) CPU time{ sec)
0.1 10 3.3 14 0.813
0.2 10 6.0 15 0.812
0.3 10 24.4 14 0.812
Lp FCM(p = 3)
a successes misclassifications iterations( ma."C) CPU time( sec)
0.1 10 3.90 17 36.465
0.2 10 13.90 18 34.366
0.3 10 25.10 18 35.053
the nodes of step functions, and the Lp algorithm using the binary search. Moreover
we have shown theorems of convergence under three stopping criteria. The numerical
example has shown that the Ll algorithm is as efficient as the Euclidean algorithm.
The Lp algorithm has required 40 times more processing time than the other algo-
rithms, since it uses an iterative procedure in calculating a cluster center.
For the Lp (p > 1) algorithm, improvements are still possible, whereas further im-
provement cannot be expected for the Ll case, since the present algorithm is already
simple enough and has theoretically good properties as shown in the theorems of
convergence.
Further studies on the Ll and Lp clustering include: (i) to find applications when
the Ll and Lp spaces are more appropriate than the Euclidean space; (ii) analysis of
data obtained from real applications using these two algorithms; (iii) development of
a system of softwares for these methods.
Acknowledgment:
This research has partly been supported by TARA (Tsukuba Advanced Research
Alliance), University of Tsukuba.
References:
Bezdek, J.C. (1981): Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum.
Bloomfield, P. and Steiger, W.L. (1983): Least Absolute Deviations: Theory, Applications,
and Algorithms, Birkhauser.
Bobrowski, L. and Bezdek, J.C. (1991): c-means clustering with the II and ("'" norms.
IEEE Transactions on Systems, Man, and Cyhern., 21, 3, 545-554.
Jajuga, K. (1991): LI-norm based fuzzy clustering. Fuzzy Sets and Systems, 39, 43-50.
General Approach to the Construction
of Measures of Fuzziness of Fuzzy K -Partitions
Slavka Bodjanova
Department of Mathematics,
Texas A&M University-Kingsville,
Kingsville, TX 78363, U.S.A.
Summary: One of the most important characterizations of fuzzy partitions is the amount
of their fuzziness. This paper proposes an axiomatic framework for measures of fuzziness
of nonhierarchical fuzzy partitions. Mathematical conditions for measures of fuzziness are
discussed and a way of measuring fuzziness in terms of dissimilarity between a fuzzy parti-
tion and its complement is presented.
1. Introduction
Research on the theory of fuzzy sets and on the broad variety of applications of fuzzy
sets has been growing steadily since the inception of the theory in the mid sixties by
Lotfi Zadeh (1965). In the area of classification, an impressive number of papers has
been published on fuzzy clustering algorithms, fuzzy pattern recognition, etc. As a
result of fuzzy clustering of a finite set of objects, each object may be assigned to
the multiple clusters with some degree of certainty. The amount of fuzziness of the
resulting partition is an important characterization of the structure of data. If the
fuzziness is low, it means that clusters are reasonably separable. On the other hand,
if the fuzziness is large, the fuzzy cluster separability is low and either the partition
does not reflect the real structure well or there is not a clear structure present in
data.
Bezdek (1981) introduced the partition entropy as a measure of fuzziness of non-
hierarchical partitions which is a complete formal analogy to Shannon's entropy and
generalization of non probabilistic entropy of fuzzy sets introduced by de Luca and
Termini (1972). Backer (1987) proposed to measure fuzziness of fuzzy partitions in
terms of fuzziness of fuzzy clusters. There are some other characteristics of fuzzy par-
titions which could be interpreted as measures of fuzziness, but the comprehensive
theory of measures of fuzziness of fuzzy partitions is still missing.
The aim of this paper is to develop an axiomatic framework for measures of fuzziness
of nonhierarchical fuzzy partitions and to show a more general way of constructing
these measures.
In the first part of our paper we briefly review the matrix characterization of k-
partitions and the sharpness of fuzzy k-partitions.
De Luca and Termini (1972) formulated three essential requirements that adequately
capture the intuitive comprehension of fuzziness of fuzzy set. Generalization of these
requirements for fuzzy partitions is the cornerstone of our definition of measure of
fuzziness of a fuzzy k-partition introduced in the second part of our paper. Empotz
(1981) studied the mathematical background of measures of fuzziness of fuzzy sets.
We discuss conditions under which some nonnegative real functions on the interval
[0, 1] could be used for constructing measures of fuzziness of fuzzy partitions.
Fuzziness of fuzzy sets is often measured in terms of lack of distinction between the
set and its complement, or by a metric distance between its membership grade func-
303
304
tion and the characteristic function of the nearest crisp set (Klir and Yuan (1995)).
In the last part of our paper we show how this idea can be used in the evaluation of
the amount of fuzziness of fuzzy partition.
2. Fuzzy k-partitions
2.1 Matrix characterization
Let X = {Xl, X2, .•• , Xn} be a given set of objects. Fix the integer k, 2 $ k < n and
denote by Vkn the usual vector space of real k x n matrices. Bezdek (1981) proposed
the following matrix characterization of partitions of X into k clusters.
Fuzzy k-partition space associated with X:
P lk = {U E Vkn; tLij E [0,1]; L tLij = 1 for all j; L tLij > 0 for all i}. (1)
Pk = {U E Vkn ; tLij E {O, I}; L tLij = 1 for all j; L tLij > 0 for all i}. (2)
j
(4)
2.2 Sharpness
Partitions from Pka are certain, i.e. they have zero amount of uncertainty. On the
other hand, the partiton (; = til E Pika is maximally uncertain, maximally fuzzy. If
a partition U E Plko is moving from D to a partition V E Pko its amount of fuzziness
decreases, we say that the partition U becomes" sharper". We propose the following
definition of sharpness of fuzzy k-partitions.
Definition 1 Let U, V E P lko ' We say that U is sharper than V denoted by U -: V, I
Note:
For k = 2 we get the relation sharpness defined by De Luca and Termini (1972) for
fuzzy sets.
305
Example 1:
Function Fl : Plica -+ R defined by
(7)
0.1 0.05 1
1 0 0 00) V _ ( 0.10 0
U _ ( 0.1 000
- 0.3 o 1 1 0 - 0.30 0
0.5 o 0 0 1 0.55 0
0.0 1 0 0 0)
W _ ( 0.0 0 0 0 0
- 0.4 0 1 1 0
0.6 0 0 0 1
Note:
Partition entropy where a = t is a normalized measure of fuzziness of U E Plica'
Example 9:
The following functions satisfy the properties of Definition 3:
k" 1
tjJ(U) =1- 2n(k -1) ~ ~ lUi; - kl (12)
, J
307
(17)
j
2. Let I : [0, 1] -+ R be any convex or any concave nonlinear function. Then there
exist constants a, (3 such that
¢I(U) = aLL I(u;j) + {3 (18)
j
Z
.) A . f O<r<lI<l f(II)-_ fer) >
= In _ sUP~<r<.<1
f(.)- fer)
_r =B,
_ Y or
-Ir: lII'- _ ,
.. ) C <.
U = sUPO;5;r<II;5;:/; f(II)-f(r)
II-r
f
- In :/;;5;r<'9
f(.)-f(r)
'-r = D,
Then
1jJ(U) = 0 L L f(Uij) + {3 (19)
i
where
1
(20)
0= n[k.f(t) - (k - 1).f(0) - f(l)]
and
{3 = [- f(l) - (k - l).f(O)]o.n (21)
Example 5:
Examples of functions which could be used in (19):
t if t E [0, t]
I(t) = { l~k(t - 1) if t E (t, 1]
o if t = 0
I(t) = {-tlogt ift E (0,1]
Example 6:
Let U, V be two fuzzy k-partitions of X = {Xl,''''X n }, Let 1"(u(xj),v(Xj)) be the
distance function of Minkowski class defined by
(24)
Then
(25)
Example 8:
Measure
D(U, V) = t
j=1
E7-21l u ij - Ui-l,jl-I vij - Vi-l,jll
UmjVmj
(27)
I
> d Uij}, and Vmj = mini v· > d Vij}, is a dissimilarity measure
where Umj = mini u '1_1: I .,_"
where
;{. 1
IJ mIni Uij < k
;{. I (29)
IJ mIni Uij = k.
Note:
Let U E Pika, where k = 2. Then'" Uij = 1 - Uij for all i,j, which is Zadeh's
definition of complementation of fuzzy sets.
Example 9:
Let us consider the fuzzy partition V E PI3a given by the matrix
Complement of V:
D(V1 , '" Yt) = D(V2 , '" V2 ) > D(W, '" W). (30)
Then there exist constants a f. 0 and f3 E R such that
It is obvious, that the fuzzier a partition U E PIka, the more similar to the equilib-
rium of PIka. Therefore, another way of measuring the amount of fuzziness of a fuzzy
partition is to evaluate its dissimilarity with U.
5. Conclusion
We have proposed a definition of a measure of fuzziness of fuzzy partitions. Our
definition is a generalization of the measure of fuzziness of fuzzy sets. We identified
classes of real functions which could be used for constructing measures of fuzziness
of fuzzy partitions. We also explained how the fuzziness of fuzzy partitions can be
evaluated by the dissimilarity between a fuzzy partition and its complement. Since
fuzziness is one of the most important characterizations of fuzzy partitions, more
theoretical work needs to be conducted in this area.
References:
Backer, E. (1987): Cluster analysis by optimal decomposition of induced fuzzy sets, Delftse
Universitaire Pres, Delft.
Bezdek, J.C. (1981): Pattern recognition with fuzzy objective function algorithms, Plenum
Press, New York.
Bodjanova, S. (1994): Complement of fuzzy k-partitions, Fuzzy Sets and Systems , 62,
175-184.
De Luca, A. and Termini, S. (1972) : A definition of a nonprobabilistic entropy in the
setting of fuzzy sets theory, Information and Control, 20, 4, 301-312.
Empotz, H. (1981) : Nonprobabilistic entropies and indetermination measures in the set-
ting of fuzzy set theory, Fuzzy Sets and Systems, 5, 307-317.
Klir, J.G. and Youan, B. (1995): Fuzzy sets and fuzzy logic: theory and applicaitons, Pren-
tice Hall, Englewood Cliffs.
Zadeh, L.A. (1965): Fuzzy sets, Information and Control, 8, 3, 338-353.
Additive Clustering Model and Its Generalization
Mika Sato l , Yoshiharu Sat0 2
I Institute of Policy and Planning Science, University of Tsukuba
Tenodai 1-1-1, Tsukuba 305. Japan
e-mail: [email protected]
2 Department of Information and Management Science, Hokkaido University
Kita 13, Nishi 8, Kita-ku, Sapporo 060, Japan
e-mail: [email protected]
1. Introduction
The concept of ADCLUS (ADditive CLUSTering model) was proposed by Shepard and
Arabie (1979). This model is intended to find the structure of a similarity relation between
the pair of objects by clusters. The model is defined by the following:
1\'
Sij = L
k=1
U"kP,kP;k + Oij' (1)
where 8ij (0 ::; s') ::; 1; i, i = 1,2,···. n) is the observed similarity between objects
i and i, K is the number of clusters, and U'k is a weight representing the salience of the
property corresponding to cluster k. If object i has the property of cluster k, then Pi/c 1,=
otherwise it is O. In this model, the similarity is represented by the sum of weights of
clusters kl' 1..'2,' .. ,km' to which both objects i and i belong. That is,
This shows that if objects i and j belong to cluster kc, then the degree of contribution to
the similarities is U'k,. Moreover, if the pair of objects shares some common properties, the
grades which the pair of objects contributes to the similarities U'k" •..• 1Ckm are additive.
In fuzzy clustering, a fuzzy cluster is defined to be a fuzzy subset on a set of objects and
the fuzzy grade of each object represents the degree of belongingness. The degree of be-
longingness of object i to cluster k is denoted by Uik. (0 ::; U,k ::; 1). To avoid conditions
in which objects do not belong to any clusters, we assume that
I,
lIil.: :::: 0, L Itil.: = 1. (2)
k=1
A pioneering work for applying the concept of fuzzy sets to a cluster analysis was made
by E. Ruspini (1969). Since the fuzzy c-means clustering algorithm was proposed by lC.
312
313
Bezdek (1987) and J.e. Dunn (1973), several methods of fuzzy clustering have rapidly
developed and many applications have been suggested (Dave, et al. (1992), Hall, et al.
(1992».
In order to construct a fuzzy clustering model, we have to define a function p( llik, U jk) which
represents the grade of belongingness of objects i and j to cluster k. Generally, p(.T, y) is a
function from [0,1] x [0,1] to [0,1]. By using this function p, the clustering model (1) is
extended as follows (M. Sato and Y. Sato (1994a, 1994b»:
I\
sij = LP(Uikoltjk)+:ij. (3)
k=l
Namely, the similarity sij is represented by the addition of the functions P(Uik, Ujk), where
=ijis an error. T-norms (K. Menger (1942), M. Mizumoto (1989» are well known as a class
of concrete functions p(x, y).
In the practical applications, the similarity data Sij is not always symmetric - for instance,
the mobility data, the input-output data, the perceptual confusion data and so on. Recently,
clustering techniques based on such asymmetric similarity have generated tremendous in-
terest among a number of researchers.
In conventional clustering methods for asymmetric similarity, A.D. Gordon (1987) pro-
posed a method using only the symmetric part of the data. This method is based on the
idea in which the asymmetry of the given similarity data can be regarded as errors of the
symmetric similarity data, that is,
- 1
5 =2(5 + 5),
I
where 5 is a similarity matrix and 5' is the transposed matrix of 5. As for the data, S was
used.
L. Hubert (1973) proposed a method to select a maximum element in the corresponding
elements, that is,
= =
sij max(sij, Sji), oSij min(sij, Sji).
R.E. Tarjan (1983) proposed an algorithm to hierarchically decompose a directed graph
with weighted edges which is used for clustering of asymmetric similarity.
We have proposed a model under which the cluster is constructed with similar objects, but
the similarity between clusters is not symmetric_ (see M. Sato and Y. Sato (1995a, 1995b,
1995c).
In the above conventional clustering algorithm, the concept of similarity between a pair
of objects is eventually reduced to symmetry. Therefore, we will propose a new concept
of asymmetric aggregation operators in order to represent the asymmetric relationship be-
tween a pair of objects. Introducing these asymmetric aggregation operators into the above
fuzzy clustering model (3), a new model is proposed in order to obtain clusters in which ob-
jects are not only similar to each other but also asymmetrically related. The validity of this
model is shown by the numerical example and some features of the aggregation operators.
Let p( Il,/.:. II j/) be a function which denotes the degree of simultaneous belongingness of the
pair of objects i and j to clusters k and I, namely, a degree of sharing common properties.
Then a general model for the similarity sij is defined as follows:
(5)
(1 > O.
We consider the following function as a typical function of .;:
I,'
8ij =L P(lti/.:. Ilj/.:) + :ij'
/,:=1
In this model, the weight U'k/ is considered to be a quantity which shows the asymmet-
ric similarity between the pair of clusters. That is, we assume that the asymmetry of the
similarity between the objects is caused by the asymmetry of the similarity between the
clusters.
represents the asymmetry between objects by the asymmetry between the clusters in the
foregoing section. The second is a way using the new approach, that is the asymmetric
aggregation operators. In this case, we have to create new aggregation operators which
satisfy the following conditions: Boundary conditions, Monotonicity, and Asymmetry.
Suppose. f(x) is a generator function of t-norms, and 0(1") is a continuous monotone de-
creasing function satisfying
Using the generator function of the Hamacher product (R Fuller (1991», i.e. f(.r) = 1 - .1'
.1'
1
and the monotone decreasing function 0(.1') = - (m > 0). the asymmetric aggregation
.rm
operator is defined as
.rmy
-,(.r, y) = 1(I) •
- Y +.r m- y
which is shown in Figure 1. In Figure 2, the dotted curve shows the intersecting curve of
the surface shown in Figure 1 and the plane x = y, and the solid curve is the intersection
with .1' + Y = 1. >From the solid curve, we find the asymmetry of the proposed aggregation
operator. Figure 3 shows the asymmetric aggregation operator defined as
xy
-/(X, y) = y + .r(2 _ .1')m(1 _ y)'
where the generator function is the Hamacher product, i.e. f(l') = 1 - .r and the monotone
.1'
decreasing function 0(.1') = (2 - .r)"'. Figure 4 shows intersecting curves with .r = ./) and
.1' + Y = 1. In the case of the generator function of the Algebraic product, i.e. f(.r) = - log.r
and the monotone decreasing function o(.r) =2 - .rm (shown in Figure 5). the asymmetric
aggregation operator is defined as
-f(.r, y) = .ry(2-x"'l.
by f(.1') + fey) ~ f(x) + O(.r)f(y), because o(.r) ;:::: 1. Since the following inequality:
/,- 1,"
L -(Ulb Ujk) ~ L P(U,b up.,) :'S 1
k=1 k=1
/
y(x.xr-......../
o.
y(x.l·x)
o.
y(x,y)=xy/(y+(1-y)x(2-x))
y (x,y)=xy"')
o.
y(x.l·x)
o
o
o.
\
........., ..•.............
4. Numerical Example
To demonstrate the applications of Model (7), we will use data which shows telephone
traffic from one prefecture to another. In the optimization algorithm used in this example,
20 sets of initial values are given by using uniform pseudorandom numbers in the interval
[0. 'fl, and in the end, we select the best result. The number of clusters is determined based
on the value of fitness. By increasing the number of clusters, the value of fitness decreases,
but even if the number of clusters is greater than 4, there is no severe decrease of the fitness.
>From the principle of parsimony, it should be considered that the number of clusters is
determined to be 4.
The results of the analysis using the asymmetric aggregation operator defined in Figure 1
are shown in Table 1 and Figure 7. In Figure 7, monotone gradation in this figure shows
the degree of belongingness of a prefecture to a cluster. The darker the shade, the larger
the degree of belonging. As for the results, we find that geographical distance is closely
connected with the telephone communication. Moreover, the results show that large cities
become influx points.
o
.
-~ --
No. of Prefectures C 1 C1 C3 C4
Hokkaido .326~ .17U2 .1~)3 .3076
Aomori .5469 .0992 .0999 .2540
Iwate .6248 .0574 .0684 .2494
Miyagi .7782 .0000 .0000 .2218
Akita .5956 .0584 .0785 .2675
Yamagata .6063 .0417 .0739 .2781
Fukushima .4882 .0532 .0737 .3849
Ibaragi .2093 .1022 .0974 .5911
Tochigi .2484 .0915 .0908 .5693
Gunma .2173 .1145 .1106 .5576
Saitama .1862 .0003 .0056 .8079
Chiba .0951 .0847 .0985 .7218
Tokyo .0000 .0000 .0000 1.0000
Kanagawa .0537 .0591 .1071 .7800
Niigata .2979 .0999 .1762 .4260
Toyama .2277 .1223 .3414 .3085
Ishikawa .2063 .1060 .3934 .2942
Fukui .1993 .1294 .4218 .2495
Yamanashi .1880 .1665 .1612 .48·B
Nagano .1657 .1167 .2292 .4883
Gifu .1261 .1177 .4326 .3236
Shizuoka .1454 .1373 .2527 .4645
Aichi .0000 .0000 .5544 .4456
Miye .1286 .1217 .4716 .2780
Shiga .1054 .1091 .5867 .1988
Kyoto .0988 .0234 .6947 .1831
Osaka .0000 .0000 1.0000 .0000
Hyogo .0000 .1597 .6592 .1811
Nara .1203 .1152 .6121 .1525
Wakayama .1715 .1532 .4937 .1816
TOllori .0808 .2584 .4700 .1908
Shimane .0790 .3073 .4226 .1911
Okayama .0660 .2810 .4652 .1878
Hiroshima .0000 .3785 .4287 .1928
Yamaguchi .0530 .4795 .2784 .1891
Tokushima .1416 .2270 .4428 .1887
Kagawa .0756 .2431 .4756 .2057
Ehime .1033 .2811 .4153 .2003
Kochi .1537 .2519 .3947 .1997
Fukuoka .0000 .9098 .0000 .0902
Saga .1106 .6236 .1127 .1531
Nagasaki .0975 .5546 .1616 .1863
Kumamoto .0738 .5996 .1456 .1810
Oita .0945 .5480 .1741 .1834
Miyazaki .1076 .5098 .1843 .1984
Kagoshima .1063 .4796 .2045 .2096
Okinawa .2347 .3276 .2050 .2327
Acknowledgement
A part of this work was supported by a grant for Scientific Research from the Ministry of
Education, Science and Culture of Japan.
References:
Bezdek, J.e. (1987). Pattem Recogl/itiol/ with Fuzzy Objective FUilctiol/Algorithms. Plenum Press.
Dave, R.N. and Bhaswan, K. (1992). Adaptive Fuzzy c-Shells Clustering and Detection of Ellipses.
IEEE Tral/sactiol/s 01/ Neural NeMorks, 3, 643-662.
Dunn, J.e. (1973). A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact
Well-Separated Clusters. Jour. Cybemecics, 3,3,32-57.
Fuller, R. (1991). On Hamacher-sum of Triangular Fuzzy Numbers. Fuzzy Sets al/d Systems, 42,
205-212.
Gordon, A.D. (1987). A Review of Hierarchical Classification. Joumal of the Royal Statistical
Society, Series A, 150,119-137.
Hall, L.O. and Bensaid, A.M. et al.. (1992). A Comparison of Neural Network and Fuzzy Clus-
tering Techniques in Segmenting Magnetic Resonance Images of the Brain. IEEE Tral/sactiolls 01/
Neural NeMorks, 3, 672-682.
Hubert, L. (1973). Min and Max Hierarchical Clustering Using Asymmetric Similarity Measures.
Psychometrika, 38, 63-72.
Menger, K. (1942). Statistical Metrics. Mathematics, 28, 535-537.
Mizumoto, M. (1989). Pictorial Representations of Fuzzy Connectives, Part I: Cases of t-Norms,
t-Conorms And Averaging Operators. Flazy Sets alld Systems, 31,217-242.
Ruspini, E.H. (1969). A New Approach to Clustering,/llform. Colltrol., 15, 1,22-32.
Sato, M. and Sato, Y. (1994a). An Additive Fuzzy Clustering Model. Japallese Joumal of Fuzzy
Theory alld Systems, 6, 2, 185-204.
Sato, M. and Sato, Y. (1994b). Structural Model of Similarity for Fuzzy Clustering. Joumal of the
Japal/ese Society of Co mputa tiolla I Statistics, 7, 27-46.
Sato, M. and Sato, Y. (1995a). Extended Fuzzy Clustering Models for Asymmetric Similarity.
Ftazy Logic alld Soft Computillg, World Scientific, 228-237.
Sato, M. and Sato, Y. (1995b). A General Fuzzy Clustering Model Based On Aggregation Opera-
tors. Behaviormetrika, 22, 2, 115-128.
Sato, M. and Sato, Y. (1995c). On a General Fuzzy Additive Clustering Model. II/tematiol/al Jour-
I/al of buelligel/t A utomatioll al/d Soft Computil/g. I, 4, 439-448.
Shepard, R.N. and Arabie, P. (1979). Additive Clustering: Representation of Similarities as Com-
binations of Discrete Overlapping Properties. Psychological Review, 86, 87-123.
Tarjan. R.E. (1983). An improved algorithm for hierarchical clustering using strong components.
Illformatioll Processil/g Letters, 17. 37-4l.
Weber. S. (1983). A General Concept of Fuzzy Connectives, Negations and Implications Based on
T-Norms and T-Conorms. Ftazy Sets al/d Systems, 11, 115-134.
A Proposal of an Extended Model of ADCL US
Model
Tadashi Imaizumi
Department of Management & Information Sciences
Tama University
4-1-1 Hijirogaoka, Tama-shi,Tokyo 206, Japan
Summary: This paper presents an extended model of ADCLUS model as a model for
overlapping clustering. It is assumed that a degree of the belongings of each object to a
cluster is represented as a binary random variable, and similarity between two objects is
defined by a weighted expectation of a cross product of these variables. The problems on
the visualization of the results is also discussed. An extension of the proposed model to a
two-mode, two-way data is also described.
1. Introduction
When we want to analyze a proximity data set among objects, it is very important
how to extract a hidden information(structure) of it by using several models and
several methods. MDS(Multi-Dimensional Scaling) models and methods have been
used for extracting it. We can extract a continuos structure of a data set using these
model in which dissimilarity or similarity are related to the inter-point distance of
the corresponding two points.
A combinatorial theoretic model will be proper ones when the structure of data set is
assumed to be the descrete one though these MDS methods are also applicable(Arabie
and Hubert, 1992). For example, this combinatorial theoretic model has been used
to analyze the data set which represents the structure of a confusions between 16
consonant phonemes (Arabie and Carroll, 1980; Soli, Arabie and Carroll, 1986), that
of a kinship relations(Carroll and Arabie, 1983).
As Shepard and Arabie (1979) first introduced ADCLUS model as a model for over-
lapping clustering of one-mode, two-way data, several models and several methods
which based on this model are proposed. The basic ADCLUS model for one mode,
two-way data matrix is expressed as
R
Sij = L WtPitPjt + C + eij· ( 1)
t=1
320
321
fitting ADCLUS model. An extended models for three-way matrices are named IN-
DCLUS model. In these models, N objects are clustered into R possibly overlapping
clusters. Similarity between two objects that are not members of a cluster is assumed
to be zero.
There are four states ( Pil,Pjl ) of two objects which represents two objects belonging
to the same cluster or not. The state (1,1) represents two objects belonging to the
same cluster, the state (0,0) represents them not belonging, and the states (I,D) or
(0,1) represents one object belonging to the cluster and the other one not. The prop-
erty of ADCLUS model considering the state (1,1) only will also lead the number of
clusters being increased and the visualization of the resultant clusters being difficult.
In the ADCLUS model, only the state (1,1) contributes to the similarity and the
other three states are ignored.
As this ADCLUS model will be rewritten as the matching model of two binary
sequences (Pi1, Pi2, ... ,Pit, ... , PiR) and (Pj1, Pj2, ... , Pjl," . ,PjR) which represent
whether an object has the property t or not. The binary 0 or 1 are exchangeable
for any cluster t, the state (0,0) or (1,1) will contribute the similarity between the
corresponding two objects. From this point of view, we propose an extended model
of ADCL US model in this paper.
2. Model
We present an extended model for overlapping clustering by defining similarity be-
tween two objects that are not members of a cluster. Let XiI denote an binary random
variable such that
(4)
'vVe also assume
where WI, Uj and c denote real value respectively. In this model, similarity between
0i and OJ for cluster t, Sijl takes one of three values
(6)
We also assume that the observed similarity sij(i = 1,2,···, n;j = 1,2,···, n;j =1= i)
are a realization of E(Sij). We have
Sij = E[XiWXj + (J - Xi)U(J - Xj)'] + c, (i =1= j). ( 7)
where S = (Sij) is N x N a similarity matrix of N objects, X =(Xk ) is a N x R
binary random matrix, J is 1 x R matrix whose elements are aliI, tV and U is R x
R diagonal matrix of positive weights respectively. This model is expressed as
Sij = trace(U) - E(Xi - Xj)U(Xi - Xi!' + E(Xi)(W - U)E(Xj)' + c, (i =1= j). (8)
322
The expection E(Xi - Xj)U(Xi - X;)' is interpreted as the weighed squared distance
between Xi and Xj, and indicates the distance property of the proposed model. As
the term jWr - uri is decreasing to 0 for each cluster t, a visualization of estimated
parameters will be possible. The following two models are special sub-model of the
proposed model.
2.1 Case W=U
In this case, Two objects which are in the same state were equally weighted. A mean
similarity 5ij is expressed as
5ij = trace(U) - E(Xi - Xj)U(Xi - Xj)' + c. (9)
A resultant clustering will be represented as a geometric configuration.
3. The Algorithm
The observed mean similarity 5ij is represented as
R
5ij = 2)WrPirPjr + ur(1- Pir)(1 - Pjr)} + c + eij, (ll)
r=1
with the constraints 0 ~ Pir ~ 1. To estimate these parameters, we use the ordinary
least squares method with the constraints on the parameters. The estimation pro-
cedure for given the number of clusters R is an iterative process consisted of three
steps.
L L
i=1 j;ioi,j=l
{Sij - WrPirPjr - ur(1- Pir)(l- Pjr) - cV (13)
with the constraints 0 ::; Pil ::; 1. We use the quadratic programming prodecure with
the active set method to obtain a feasible solution of (Pil, Pi2," ·,PiR).
These iterative processes 3.2 and 3.3 are repeated until some the convergence criteria
are satisfied. This iterative process is simple one than that of MAPCLUS procedure
since we estimate Pil instead of XiI, a realization of XiI'
the quantity E(Xi)(W - U)E(Xj)' of two object 0i and OJ will indicates the degree of
the resultant configuration being not represented as the space structure. To represent
the resultant configuration as space structure, we assume the model with W = U,
(16)
as the target model to be represented. This configuration can be represented as a
space structure. We define the shifting factor, Fi , of object 0i to the other objects
by
Fi = E(Xi)(W - U) L" E(Xj )'/2(n -1). (17)
i~jJ=l
When this quantity Fi is positive one, a weighted squared distance E(Xi - Xj )U(Xi-
X j ), should be increased to represent as space structure. When Fi is negative one,
the distance should be decreased. The quantity Fi is interpreted as a deviate from
the point E(Xi)U to embed as point. This suggests that the resultant configuration
will be more informative with a vector,
n
Fi = L E(Xj)/(n - 1) (18)
j~i.j=l
5 Applications
We applied the proposed model to the confusion matrix of Morse Code signals
collected by Rothkopt(1957). This data was analyzed by Shepard using Kruskal's
non-metric MDS procedure(Shepard, 1963). We analyzed the symmetrized similarity
matrix Slj whose (i,j) element is defined by
(20)
where Mij is (i,j) element of the confusion matrix. The analysis was done using the
number of clusters from 5 to 1. Ratio of the minimized squared loss over trace(S)
in 5 clusters to 1 cluster were 0.051, 0.069, 0.082, 0.109, and 0.183. By using elbow
criterion and to compare result by MDS, we chose two-cluster result as a solution.
The estimated w, u and c are shown Table 1. The estimated P is shown Table 2 and
is illustrated in Figure 1,
Table 1 shows that the state (1,1), two Morse code are a member of this cluster, and
the state (0,0), they are not a member of this cluster, are contributed to similarity
in cluster 1 since the ratio Ul/Wl is about 0.679. In cluster 2, the state (1,1) only
is contributed to similarity since the ratio U2/W2 is about 0.261 and it seems to be
small.
Tab. 2: The estimated P in number of clusters = 2
Morse code Clus.l Clus. 2 Morse code Clus. 1 Clus. 2
A .- 0.201 0.068 S ... 0.391 0.000
B -... 1.000 0.051 T- 0.154 0.099
C -.-. 0.873 0.398 U .. - 0.437 0.000
D -.. 0.622 0.000 V ... - 0.675 0.000
E. 0.194 0.083 W.-- 0.501 0.068
F .. -. 0.747 0.099 X -.. - 1.000 0.151
G --. 0.415 0.224 Y -.-- 0.872 0.512
H .... 0.521 0.000 Z -- .. 0.781 0.512
I .. 0.211 0.084 1 .- - -- 0.375 0.873
J .- -- 0.643 0.544 2 .. - -- 0.565 0.744
K -.- 0.668 0.024 3 ... -- 0.686 0.316
L .-.. 0.943 0.105 4 .... - 0.690 0.032
M -- 0.234 0.125 5 ..... 0.593 0.000
N -. 0.189 0.086 6 -.... 0.929 0.209
0-- - 0.393 0.292 7 --... 0.792 0.541
P .--. 0.808 0.565 8 ---.. 0.503 0.817
Q --.- 0.818 0.676 9 - - --. 0.233 0.855
R .-. 0.524 0.000 0----- 0.202 0.692
325
T· E. 0····· g.... 1··- 0··· U..· 8·-.. R:. 5..... J.- V.. : 4... : Z·· .. P-. y.:. 6· .... X'.:
R· D· 5 U·4 ·W·· EN· T· M·· 6· 0·- C·· y •.. J-Q'" 2··· g••••
lr(k~wa·~~J~l-:bJ~-
~(~ ~X(, i
mm~~~~ o~rno~m
ClU3tr2
Fig. 1: The estimated P.
From Table 2 and Figure 1, the cluster 1 seems to be interpreted as the degree of
mixture of dash and dot. The cluster 2 is interpreted as dot-to-dash ratio.
9-\. \ 1 .----
rr
3.5 -
8 ---
3- 2
O-~ Q --.-
2.5 -
I
N
~
I-<
2-
J;t- jI.j . -
.!l
t)
~.-.
1.5 -
0---
----p.
y-
1- -<:1---
G-~
6-.~
'PV.K '\
L'
, X- ..-
0.5 -
0
~ fW.t,
S ..u
t~ ~~\~ . -.
..' 1~ .-. 5 ..... ~ ....
B -...
I I I I
0 1 2 3 4 5 6 7
Cluster 1
Fig. 2: A geometric representation with the shifting factor from each point.
Each vetcor from each point illustrates the shifting factor. each point was shifted to
the direction of the centroid of other n - 1 points.
(21 )
We also assume
and
Xit and Yj. are also independently distributed,
Then the model for two-mode data matri..x of similarity will be defined as
R
Sij = 2:: WtXitljt + Ut(1- X it )(1 - Yjt) + c (22)
t=1
= E[XiWY; + (J - XdU(J - lj)'] + c, (23)
7 Discussions
We proposed the extended model of the ADCLUS model with random variables. By
using this model, statistical inference on parameters will be applicable. The state
(0,0) of (Xit , Xit) is also contributed to the similarity Sij. As Sijl takes one of three
values, 0, Wt, and Ut, the maximum distinct values of similarity matrix is 3 + 2/- 1 •
The other hand, it is 2 + t - 1 in ADCLUS model. The number of clusters needed
will be less that that of ADCLUS model for analyzing a similarity matrix.
As the estimation procedure is simple one, the weighted least squares method will
be applicable by using a consistent estimator of variance of Sij.
327
The model discussed in section 6 is also applicable to the case of the preference data
with some modification. It seems to be natural that each subject is correspond to
each ideal object. Then we relate the degree of preference of subject Si to object OJ
with similarity between an idean object Ii and OJ. However,it seems to be difficult
that the evaluated preference is a realization of the expection from statistical point
of view. We will analyze the preference data with the latent class model in which the
definition of the expection will make a sense.
When data matrix is one of preference data, we assume that there are G unobserved
group and each subject belongs to one of these groups. Then we can apply this
model by assuming that the preference to object OJ of subject Si, who belongs group
9 (g = 1,2,···, G), is related to the similarity between OJ and some ideal point g".
Let X gt denote the binary random variable of group 9 in cluster t. We define the
similarity between group 9 and object OJ by
R
Sgj = L WgtXgtYit + U gt(1 - X gt )(1 - }jt) + cg • (25)
t=1
Each group differently weights each property in this model. We assume that the
preference to object OJ in group 9 is equall to Suj. Each subject is assumed to be a
sample from one of G groups. A clustering procedure will be rolled in the estimation
procedure for allocating each subject to one of groups.
As the extension of the proposed model to analyze three-way data matrices, specially
for analysis of individual differences, one model is same to INDCLUS model, and the
other model is based on the latent class model in which each individual belongs to
one of G groups.
8. References
Arabie, P. and Carroll, J. D.(1980): MAPCLUS: a mathematical programming approach
to fitting the ADCLUS model Psychometrika, 45,211-235.
Arabie, P. and Hubert, L. J .(1992): Combinatorial Data Analysis. Annual Review of
Psychology,43,169-203.
Carroll, J. D. and Arabie, P.(1983):INDCLUS: An individual differences generalization of
the ADCLUS model and the MAPCLUS algorithm Psychometrika, 48,157-169.
Rothkopt, E. Z.(1957): A measurement of stimulus similarity ans errors in some paired-
associate learning tasks. Journal of Ezperimental Psychology, 53 94-1Ol.
Soli, S. D., Arabie, P. and Carrol, J.D.(1986): Representation of descrete structure under-
lying observed confusions between consonant phonemes. Journal of the Acoustical Society
of America, 79,826-837
Shepard, R.N., and Arabie, P. (1979): Additive Clustering Representation of Similarities
as Combinations of Descrete Overlapping Properties. Psychological Review,86,87-123.
Comparison of Pruning Algorithms
in Neural Networks
Yoshihiko Hamamoto, Toshinori Hase, Satoshi Nakai, and Shingo Tomita
Faculty of Engineering, Yamaguchi University
Ube, 755 Japan
Summary: In order to select the right-sized network, many pruning algorithms have been
proposed. One may ask which of the pruning algorithms is best in terms of the generaliza-
tion error of the resulting artificial neural network classifiers. In this paper, we compare the
performance of four pruning algorithms in small training sample size situations. A com-
parative study with artificial and real data suggests that the weight-elimination method
proposed by Weigend et al. is best.
1. Introduction
There are two fundamental problems in the design of artificial neural network (ANN)
classifiers: finding training algorithms and selecting the right-sized network. Concern-
ing training algorithms, the back-propagation algorithm (Rumelhart et al., 1986) has
been widely used because of its simplicity. In the back-propagation learning, the
following error function is minimized:
(1)
where Ok is the output in the output layer for the k-th training sample, tk is the
corresponding target value, and T is the training set. However, BP has two seri-
ous disadvantages: the extremely long training times and the possibility of trapping
in local minima. On the other hand, concerning right-sized network selection, its
issue is that the network that is too large or too small can overfit or underfit the
data, respectively. The use of the right-sized network leads to the improvement in
the performance of the resulting ANN classifier. Hence, the problem of selecting the
right-sized network is very important in neural networks. We will address only this
problem.
In order to select the right-sized network, many pruning algorithms have been pre-
sented (Reed, 1993). Unfortunately, little is known about experimental comparison
of the pruning algorithms in finite sample conditions. In this paper, we compare the
performance of four pruning algorithms in terms of the generalization error of ANN
classifiers in small training sample size situations. Our emphasis is on giving practical
advice to designers and users of ANN classifiers.
2. ANN classifiers
We will consider AN:-.1 classifiers with one hidden layer. The units in the input layer
correspond to the components of the feature vector to be classified. The hidden layer
has m units. The units in the output layer are associated with pattern class labels.
In the network discussed here, the inputs to the units in each successive layer are the
outputs of the preceding layer. Initial weights were distributed uniformly in -0.5 to
0.5.
328
329
3. Pruning algorithms
In this section, we briefly describe four pruning algorithms. We will follow Reed
(1993)'s notations.
A. l{arnin's method (Kamin, 1990)
Kamin measures the sensitivity of the error function with respect to the removal of
each connection and then prunes the weights with low sensitivity. The sensitivity of
a weight Wi) is given as
(2)
where N is the number of training epochs, TJ is a learning rate, w! is the final value of
the weight after training, and Wi is the initial weight. ~Wij in (2) can be calculated
by the back-propagation algorithm.
B. Optimal Brain Damage (OBD) method (Le Cun et al., 1990)
When the weight vector w is perturbed, the change in the error is approximately
given by
(3)
where the liwi's are the components of liw and C is the set of all connections. The
second derivatives can be calculated by a modified back-propagation algorithm. The
saliency of the weight Wi is then
s--_·-
[PE wl (4)
, - ow? 2'
Pruning is done iteratively: i.e., train to a reasonable error level, compute saliencies,
delete low saliency weights, and resume training.
C. vVeight-Elimination (WE) method (Weigend et al., 1991)
The following error function is minimized:
2 W~IW6
J=E(tk-Ok) +,\E 2/2 (5)
kET iEG 1 + Wi lVo
where, is a small positive constant, L(-,·) denotes the angle between vectors and the
superscript s indexes the layer. If unit i has weights parallel to those of unit j, then
the gain of each will decrease in proportion to the gain of the other and the one with
330
the smaller gain will be driven to zero faster. Once a gain becomes zero, it remains
zero. A unit with zero gain has a constant output. So, the unit can be removed.
4. Experimental Results
Jain and Chandrasekaran (1982) suggest that the size of training samples per class
should be at least five to ten times the dimensionality. However, in practice, the
number of available samples is limited. Hamamoto et al. (l996a) point out that
the evaluation of ANN classifiers in the small-sample, high-dimensional setting is
very important. In our experiments, thus, the ratio of the training sample size to
the dimensionality is small. On the other hand, a large test sample size was used
to accurately evaluate a classifier. Note that the estimated generalization error is a
random variable, because it is a function of training and test samples. Thus, it is
preferable to repeat the experiments several times independently.
To highlight the difference in the performance of four pruning algorithms, the follow-
ing experiments were conducted.
4.1 Experiment 1
We briefly describe the Ness data set (Van Nes~, (980), which is used in this experi-
ment. This data set has been used in order to evaluate the performance of classifiers
such as the nearest neighbor, Parzen, linear, quadratic and neural network classifiers
(Van Ness, 1980; Hamamoto et aI., 1996a). The available samples were independently
generated from n-dimensional Gaussian distributions NU.Lk, Ek ) with the following
parameters:
PI = [O,O,···,of, P2 = [~/2, 0, ... ,0, tl/2f
where tl is the Mahalanobis distance between class ....'J and class W2, PJ is the n-
dimensional zero vector, and In
is the n x Tl identity matrix. The true Bayes error
can be controlled by the values of ~ and II. That is, the degree of overlapping be-
tween two distributions can be controlled by the values of tl and Tl. For that reason,
we used this data set.
The experimental condition is summarized as follows:
Figs. 1-3 provide the mean of the estimated generalization error. For comparison,
the generalization error of BP is also presented. It is well known that when a fixed
number of training samples is used to design a classifier, the error of the classifier
tends to increase as the dimensionality n gets large. That is, as n increases, the
generalization problem becomes severe. In our limited experiment, the WE method
works well regardless of the true Bayes error and the dimensionality, even in practical
situations where the training sample size is relatively small for the dimensionality.
331
36
34
~
~ 32
E
W
c 30
0
~
.t:l
til 28
Q; -+- Karnin
C
<1l .+. OBD
CJ 26 -0- WE
-*- Kruschke
24 -6-. BP
22 2 10 20
Dimensionality
Fig. 1: Comparison of pruning algorithms in terms of the generalization error
(Ness data set with .6. = 2).
16
15
14 ._._._._._._._. _.8
~
~
E
13
/
/
.6'-
,, ,,' ,
W 12 /
"
....... " ........+
c / ,/
0 / ,, _El
11
.~
/
/
"
,,
, ,x'
.t:l
-
/
O' --
til 10 /
Q; / , ,/
c , ,, -+- Karnin
<1l
CJ
9
, ,, , .
, ,, , , ·+·OBD
8 . ~~-: .. -0· WE
t'!- -*".Kruschke
7 -&- BP
6 2 10 20
Dimensionality
Fig. 2: Comparison of pruning algorithms in terms of the generalization error
(Ness data set with .6. = 4).
332
5.5
-+-Karnln
5 '+-OBD
-G- WE
;e 4.5 -*"- Kruschke
...
!!- ...... BP
...wIi:! 4
c: 3.5
0
~
.!::! 3
...
iii
Q)
c: 2.5
Q)
(!)
2
1.5
2 10 20
Dimensionality
Fig. 3: Comparison of pruning algorithms in terms of the generalization error
(Ness data set with D. = 6).
4.2 Experiment 2
Next, we compare four pruning algorithms on a real data set. In this data set, each
class represents one of 10 handwritten numerals. This data set contains 1400 128-
dimensional feature vectors per class. In feature extraction, Gabor filters (Gabor,
1946) were applied to a character image. The outputs of Gabor filters produce a
128-dimensional feature vector. Gabor filters tend to detect line and edge segments,
which seem to be good discriminating features. We call this the Gabor data set. For
additional details refer to (Hamamoto et al., 1996b). We need to assure the inde-
pendence between training and test sets. Thus, the following handwritten numeral
character experiment was performed:
(1) Divide 1400 samples into the training set of size 100 and the test set of size
1300. Note that the two sets are mutually exclusive.
(2) Design an ANN classifier with 256 hidden units by using a pruning algorithm
with the above training set.
(3) Estimate the generalization error of the ANN classifier by using the test set.
(4) Repeat steps (1)-(3) 5 times independently.
(5) Compute the average of the generalization error and its standard deviation.
Results are shown in Tab. 1. The performance of the classifiers trained only on 25
training samples per class, which are randomly selected out of 100 training samples
per class, is also presented. It should be pointed out that as the training sample
size decreases, the generalization problem becomes severe. Again, the WE method
performs better than other pruning algorithms.
333
Training sample
Kamin OBD WE Kruschke BP
size per class
8.77 8.57 7.65 7.72 8.58
2.5
0.81 0.77 0 ..58 0.66 0.72
4.98 4.59 4.28 4.92 4.6.5
100
I 0.61 0.28 0.17 0.26 0.33
5. Conclusions
We have compared four pruning algorithms for ANN classifier design, in small train-
ing sample size situations. The generalization error of resulting ANN classifiers was
estimated on artificial and real data. Experimental results show that the WE method
outperforms other pruning algorithms. Therefore, we believe that the WE method is
best for ANN classifier design.
References:
Gabor, D. (1946): Theory of communication, J. Inst. Elect. Engr., 93, 429-459.
Hamamoto, Y. et al. (1996a): On the behavior of artificial neural network classifiers in
high-dimensional spaces, IEEE Trans. Pattem A.nalysis and Machine Intelligence, 18, 5,
571-574.
Hamamoto, Y. et al. (1996b): Recognition of handwritten numerals using Gabor features,
In Proc. of 13th Int. Con/. Pattem Recognition, Vienna, in press.
Jain, A. K. and Chandrasekaran, B. (1982): Dimensionality and sample size considerations
in pattern recognition practice, In Handbook of Statistics, Vo1.2, P. R. Krishnaiah and L.
N. Kanal, Eds., North-Holland, 835-855.
Kamin, E. D. (1990): A simple procedure for pruning back-propagation trained neural
networks, IEEE Trans. Neural Networks, 1,2,239-242.
Kruschke, J. K. (1988): Creating local and distributed bottlenecks in hidden layers of back-
propagation networks, In Proc. 1988 Connectionist Models Summer School, 120-126.
Le Cun, Y. et al. {1990}: Optimal brain damage, In Advances in Neural Information Pro-
cessin9 (2), Denver, 598-605.
Reed, R. (1993): Pruning algorithms - A survey, IEEE Trans. Neural Networks, 4, 5,
740-747.
Rumelhart, D. E. et al. (1986): Learning internal representations by error propagation, In
D. E. Rumelhart & J. L. McClelland (Eds.), Parallel Distributed Processing: Explorations
in the Microstructure of Cognition, VoU : Foundations., MIT Press.
Van Ness, J. (1980): On the dominance of non-parametric Bayes rule discriminant algo-
rithms in high dimensions, Pattern Recognition, 12, 355-368.
Weigend, A. S. et al. (1991): Generalization by weight·elimination with application to
forecasting, In Advances in Neural Information Processing (3),875-882.
Classification Method by Using the Associative
Memories in Cellular Neural Networks
Akihiro Kanagawa, Hiroaki Kawabata, and Hiromitsu Takahashi
Summary: This paper deals with a classification problem. such as medical diag-nosis.
which classes are defined by categorical forms. Classification should be done by careful
and synthetical judgement for a lot of characteristic values taking each individual
variations into account. We use the associative memory function of the cellular neural
networks to classify by means of remembering one category from among the
preregistered categories.
1. Introduction
Supposing there are K objects. and each object has n kinds of characteristic values
ql' q2. ''', qj' ... , qn' The problem to classify these objects into given m classes
{C I' C 2 , •••• C j , .... Cm } based on their characteristic values has been discussed for a
long time. Classification problem discussed here is a sort of the diagnosis problem which
includes some of individual variations. Classic methods using the multivariate normal
distribution theory including the discriminant analysis are difficult to be applied to these
problems. To cope with these. Pawlak(1984) proposed the concept of rough sets. which
can be employed to discuss the consistency of the classification given by human experts
with the observed attribute values of each sample. Shigenaga et al. (1993) modified the
Pawlak's method to reduce more attributes by considering the given classification. In the
case of rough sets. choice of descriptive function or fundamental sets is difficult; while in
the case of fuzzy if-then rules. it is so troublesome to determine a set of membership
functions or logical structure. In addition. if there are some lacking or missing data. these
methods easily have a bad effect upon their classification results. This paper aims to
apply the associative memories of cellular neural networks to this classification problem.
The associative memory is a function of nueral networks that the nueral net wok recall a
pattern from patterns embedded in advance. Further we propose a CNN whose each cell
has three output values to enhance its capability of association.
334
335
where xij and uij denote the state variable and control variable respectively. lij is the
threshold value. and Pij and Sij are template matices. Pij * Yij implies the sum of the
influence terms from the adjacent cells.
Pij * Yij = p, j( 0.- 'l'" Pi) (-,. - 'l'" Pi)( 0., l *Yij = k ~-r I ~ r Pj}(k.f)Yj + k.} + I (2)
p') (T. -rj'" Pi) (T. 0 l'" Pi)( T. T l
.t = - x + T Y + I (x) }
(3)
Y = sat (x) •
where
ll=NxM
yEDndct'{xEm n : -lsxjsl. i=I ... ·.ll}
T = [ T ij] E m")( n
1=(II ... ·.IIl)TEm n
sat (x) = [sat (Xl)' .... sat (Xn) ]T E mn.
Original CNN has binary state output. and it is suit for coding as black-and-white picture.
But the binary state outputs put fetters upon the approach of higher recognizing problem
such as classification or diagnosis based on association. The reason is that it is essential
to grasp these problems based on more than two valued logic such as low-middle-high or
grade A - grade B - grade C - grade D. For example. in a group examination. personal
condition of health is measured by each ingredient of the blood test. There is an ideal state
( healthy condition) given by a certain range for each ingredient. and from where degrees
of separation toward either side are comprehensively judged.
In this study. for the purpose of extending method of these associative memory functions
of CNN. we propose a cellular neural network which has three valued outputs of the cell
neurons. Here we devise the output function to apply it to classification problems.
OJ =I = > 1.5, Xj
.r = - x + To + I . (6)
As f3 = Ta + I
is constant, this equation has the apparent equilibrium point of Xe = f3. If
f3 E C(a), this equilibrium point is also asymptotically stable since all eigen values of
equation (5) are -1. From this. Liu and Michel discussed the designing method of
336
r =2 I
------1/
I
.--:,' ,~
-:'.5 ~ .5
Fig.l: CNN with r -neighbor Fig.2 : Output function with three levels
templates which can recollect the image pattern, and called it the associative memories. In
the same manner we can easily make output functions with multiple levels.
Now we focus the i - j cell in TVCNN. Then the next condition is needed to be satisfied:
(10)
where bk and tk are the k th row vectors of the matrices Band T respectively. If we take
the elements out of the matrix A and vectors bk and tk which belong to the r- neighbor cells
of kth cell, we obtain the following equation:
337
b~ = I,tA'. (11)
where b~ • l,t and A' are the vectors and matrix in which the coupling coefficients having
no influence on the kth cell are removed. Since A' is not a square matrix. we apply the
singular value decomposition to the matrix N. Then we obtain the relation:
(12)
where Uk Uk 2' Vk and Vk 2 are the unit onhogonal matrices which satisfy:
l
I' 1
where [It 1is a diagonal matrix with the non-zero singular values of the matrix [ AT] T A' .
Thus we can obtain the desired matrix T and the vector I by calculating the above lk in
each cell.
ql UA uric asid
ql BUN blood urea nitrogen
ClJ LDH lactate dehydrogenase UA BUN LDH PLt
q4 PL[ plate let
qs ALb albumin
q6 y- GTP y - glutamyl trlnspepti[ase ALb yGTP GOT
q7 GOT glutamic ol(a1oaceric transaminase
LAP
qs LAP leucine aminopep[itase
q9 TBil [0[a1 bilirubin
qlO GOT/GPT [he ratio of GOT [0 GPT TBii GOT/GPT GPT AFP
qll GPT glutamic pyruvic transaminase
ql1 AFP alpha-l fc[opro[ein
q!3 DBil direct bilirubin
ql4 ChE cholinesterase DBil ChE ALP AFP
ql5 ALP alkaline phosphatase
ql6 AFP alpha-l fe[opro[ein
variations into account. So even the medical specialist occasionary makes a wrong
diagnosis when he must examine a large number of patient data_ On the contrary, in case
of diagnonsis of diabetes, it is sufficient to know only information of FPO (blood
sugar)_
(1) We select 14 medical inspection indices shown in Table I, and make a 4 x 4 CNN,
whose each cell is allocated a medical inspection index shown in Fig.3. The neiborhood ,
is 4. In order to see robustness against the missing or lacking data, we allocated qg - cell
nevertheless we have no data for LAP. .
(2) We adopt three valued output function for all cells because most of inspection indeces
are grasped as three stages, for example, y- OTP roughly has the following three levels:
NORMAL (0 - 50), LIGHT EXCESS ( 50 - 100) and HEAVY EXCESS (100 - ). For
another example ChE has the levels: SHORTAGE ( 0 - 200) , NORMAL (200 - 400) and
EXCESS (400 - ). Both SHORTAGE and EXCESS are regarded as extraordinary.
(3) We make the scaling functions by consulting some technical books of medical
science. It is shown in Table 2.
II UA (ql- 5) / 6
f2 BliN (q:r:- 14)/12
13 LDH (q3- 175 )/120
14 PL[ (q4- 25) /30
Is ALb ( qs- 4.2 ) / 1.6
16 r-GTP q6/ 1oo-1
h GOT q7/ 80- 1
Is LAP o (lacking data)
19 TBil (q9""5118
110 GOT/GPT In ( q7/ qll )
III GPT qll/loo-1
112 AFP (ql:r:- 210) /380
113 DBii ( q13- 70 ) / 40
ti4 ChE (qI4- 375 ) / 150
115 ALP ( qlS - 60 ) / 80
116 AFP =/11
(4) We take up four liver diseases. Registered patterns shown in Fig. 4 imply 'Healthy
person', 'Hepatoma', 'Chronic hepatitis' and 'Liver cirrhosis' ,respectively. In these
pictures, I corresponds to a black pixel, 0 corresponds to a neutral gray pixel and -I
corresponds to a white pixel.
339
(5) Fig. 5 shows an evolution process of assoiation for the MVCNN. The initial state
represents an actual patient's data. Resultingly, the MVCNN associates a pattern of
'Liver cirrhosis' with the initial pattern. Actually, this patient is diagnosed liver cirrhosis
by a close medical examination.
Statistical inference results should be shown. We are offered forty actual patient's data,
which consist of ten healthy person data, ten hepatoma data, ten chronic hepatics data and
ten liver cirrhosis data from Kawasaki medical college. Diagnostic results by TVCNN
system is shown in Table 3, where 2 (#4) implies that TVCNN diagnosed two hepatoma
cases wrongly as two liver cirrhosis cases. In Table 3, "irregular convergent" is that the
TVCNN did not reach a pattern among the preregistered images; besides "diagnostic
sensitivity" is the right diagnosis rate in the past literature (Shigenaga et al. (1993)). They
used a concept of rough set and fuzzy if - then rules.
right wrong irregular diagnostic
diagnosis diagnosis convergent sensitivity
5. Conclusion
We propose a CNN whose each cell has three output values. It is easily extended to one
whose each cell has multiple output values We call this CNN multiple-valued cellular
neural network (MVCNN). Associative memory function of MVCNN enables one to
express several kinds of aspects. We give an application of TVCNN to a diagnosis
ploblem of live!: troubles. The classification power of the proposed method is
demonstrated eqivalent or somewhat inferior in comparison with other method using the
fuzzy if -then rules. But it should be emphasized our method is :
(1) Data of LAP is lacking, so it is allocated neutral grey,
(2) Scaling functions and disease patterns are made by us (amateur) without
consulting the medical specialists,
(3) Sample size is rather small. Shigenaga et al. [8] used 500 sample data.
Thus our diagnosis system using TVCNN has much room for improvement. So it cannot
be concluded from Table 3 that the classification power of TVCNN is somewhat inferior
to that of fuzzy if-then rules. Generally, fuzzy expert system is effective to these
problems. It is, however, complicated and troblesome to make a great number of fuzzy if-
then rules when one should make an expert system. On the contrary, the diagnosis system
using the TVCNN is designable by simple procedure.
Optimization of diagnosis system using TVCNN (MVCNN) is an important subject in the
future study.
References:
Fisher, R.A.(l936) : The use of Multiple Mesurements in Taxonomic Problems, Annals
of Eugenics, 7, pp.175-188.
Pawlak, Z (1984): Rough Classification,Intern. 1. of Man Machine Studies, 30, pp. 457-
473.
Shigenaga, T., Ishibuchi,H. and Tanaka, H.(l993): Fuzzy Inference of Expert System
Based on Rough Sets and Its Application to Classification Problems," 1. of Japan
Society for Fuzzy Theory and Systems,S, 2, pp. 358-366.
Application of Kohonen maps to the
G PS Stochastic Tomography of the Ionosphere
M.Hernandez-Pajares, J.M.Juan and J.Sanz
Research Group of Astronomy and Space Geodesy
Vniversitat Politecnica de Catalunya
Campus Nord Mod. C3,Bol
cj. Gran Capita. s/n., 08034 Barcelona, Spain
e-mail: [email protected]
Summary: The adaptative classification of the rays received from a constellation of geo-
detic satellites (GPS) by a set of ground receivers is performed using neural networks. This
strategy allows to improve the reliability of reconstructing the Ionospheric electron density
from GPS data. As an example, we present the evolution of the radially integrated electron
density (Total Electron Content, TEC) during the day 18th October 199.5, coinciding with
an important geomagnetic storm. Also the problems in the vertical reconstruction of the
electron density are discussed, including the data coming from one Low Earth Orbiter GPS
receiver: the GPS/MET. Finally is proposed as main conclusion a new strategy to estimate
the ionospheric electron distribution, using GPS data, at different scales -the 2-D distribu-
tion (TEC) at Global scale and the 3-D distribution (Electron density) at Regional scale-.
1. Introduction
about the Problem:
As it is well known, the Ionosphere is the part of the Earth Atmosphere containing
free ions and causes a frequency-dependent delay in the propagated EM-signals, be-
ing proportional to the columnar density of electrons (TEe) (see for instance Davies
1990, page 73).
This is a distorting physical effect for Space Geodesy and Satellite Telecommunications
activities, that can be used in a positive sense to estimate the global 3-D distribution
of the free-electrons in the Atmosphere, from dual frequency delay observations, i.e.
for the Stochastic Tomography of the Ionosphere.
about the Data:
To achieve this objective, we need during a certain time interval, a high sampling rate
of the Atmosphere, with so many rays in so many orientations as possible. Nowadays,
the unique system that provides so many observations, continuously and on a planetary
scale, is the Global Positioning System (GPS). Its space segment contains a constel-
lation with more than 24 satellites emitting continuously carrier and code phases in
two frequencies L1 (~1.6 GHz) and L2 (~1.2 GHz) (see for instance more informa-
tion in Seeber 1993, pages 209-349). In the GPS user segment, it is possible to get
few hours later the public domain data gathered from a global network of permanent
receivers, such as International GPS Service for Geodynamics (IGS, Zumberge et al.
1994), with more than 100 stations worldwide distributed, mainly concentrated in the
Northern Hemisphere, in North America and Europe. Also the Low Earth Orbiters
containing GPS r(,ceivers (LEO), are becoming usual. Hajj et al. (1994) conclude
than the GPS/MET LEO observations, are important to resolve the vertical structure
of the Ionosphere.
341
342
2. The Model
The Scenario:
Figure I: Layout of the CPS full constellation (2-1 satellites) orbiting around the
Earth. The CPS rays corresponding to the observations of a given station at a given
time are ,dso represent.ed.
[]:(] Satellite j
Layer k+1
Layer k
integral equation:
(1)
where:
• N(r) is the electron density at position r, which is a point that belong to the
ray between station i and satellite j at distance s of the station.
• ftWi, i:') is the ionospheric combination corresponding to the ray from the satel~
lite j for the station i at time t, obtained preprocessing the GPS observations
(see for instance Sardon et al. 1994).
• r;, .p are the position vectors of the station i and satellite j at the observation
time.
• OJ and Dj are the instrumental delays associated to the station i and satellite
J.
• The integral path is assumed to be extended over the linear ray.
In order to estimate the density N from equation 1, we can expand it within a certain
basis functions,
N(r) = AI91(F') L (2)
I
(3)
Our final purpose is to get an estimation of N, from the data f, knowing r;, ;:i. The
ill$trumental delays Di, oj are also unknowns. Then, the general model proposed
consists on a certain number of geocentric spherical shells, coveri~g the Ionosphere
sampled by the GPS rays. These shells define the floor and ceil of each layer, which
is going to be partitioned within pixels.
For each layer k (see figure 2) we define the pixels {Pk./}vl as those given by a set of
centers {Wid} with the minimum distance criterium:
P ((I
k.l . J
= {I0 if 1Ir' -wdl <= I!r - tih./,II
otherwise
Vl' (4)
(5)
where the unknowns A have been reinterpreted as the mean electronic density in the
cell Nk,l' From the last equations, we get:
where N k ,/, i1s k .l are the mean density, and fraction of the ray length respectively,
corresponding to the celll of layer k.
345
Table 1: Comparison of some features of the regular and em adaptative grid to get
cells for each considered layer of the Ionosphere
Layer k
Figure 3: Scheme of the strategy adopted to classify the sub-rays to the corresponding
cells, with a minimum computation load -explanations in the text-.
Figure 4:
Estimated lV!odels
We have computed basically two different models using Least Squares:
2-D Global Ionosphere; we use the Self Organizing Map algorithm to generate the
adaptative cells in order to compute the TEC for the Data Set G (subsets 2-7h,
7-12h, 12-17h and 17-22h, see figure 5)2. We have solved the equation 6 for one
unique layer between 300 and 400 km, taking 400 cells (self-organized along
a 20x20 Kohonen map) and a subray length of 5 km. The cells presents sizes
ranging from few squared-degrees to 10-100 times greater, depending on the
small or large sparsity of the data. In figure 5 appears an important result: the
detection of the TEC increase due to the start of the geomagnetic storm in the
last time interval (from 17 to 22 hUT).
3-D Regional Ionosphere; we describe the electron density with 5 spherical layers
in equation 6, with height boundaries at 50, 200, 350, 500, 650 and 800 km. The
regular gridding consists on cells of 5x5°. We have performed two computations:
one with the data set RGROUND (only ground data, see table 2) and other using
also one GPS/MET occultation (data set RGROUND+RMET). In both cases
we have reemplazed the real observations (ionospheric delays) by those coming
2During this period one
geomagnetic storm happened, with a high variability of the ionospheric electron distribution (see
for instance Web document at http://bolero.gsfc.nasa.gov/gov / solart/cloud/cloud.html)
348
Global Regional
Data set G Data set RGROUND Data set RMET
GPS Receivers 60 31 1~(jt'_::;!Mt;T occultation 0207)
Time Interval 2-22h UT 18-23h UT 20h30m-20h31m UT approx.
Number of subsets 4 1 1
Right ascension range of
the rays at 330 km height o to 360° 185 to 230° 185 to 230° approx.
Declination range of the
rays at 330 km height -90 to 90° 20 to 60° 20 to 60° approx.
Elevation mask 0° 0° none (occultation geometry)
Number of rays ~200000 6581 3347
Table 2: Description of the Data Sets considered in the computations, all of them
corresponding to the 18th October 1995, with a geomagnetic storm and P-code not
encrypted (Antispoofing Off)
from a very simple model: the Ionosphere as a one spherical layer with constant
density of 10 12 e/m 2 , with boundaries at 240 and 400 km height. Nevertheless
the geometry (the rays) is the real one. The results (figure 6) are quite signific-
ative showing the important improvement in the estimation when we add to the
ground data, the observations coming from one orbital GPS receiver such the
GPS/MET.
5. Conclusions
We present in this paper an study of a difficult problem: To reconstruct the 3-D
Ionospheric electron distribution from GPS data.
The use of the neural network in this work makes it possible to overcome the problem
of the non-homogeneous sampling, dividing each layer into a partition of cells or
clusters with a similar number of rays. The existence of the topological relationships
between the neighbors centers in the Kohonen map, is taken into account to reduce the
computation load. In order to solve it, we have consider a new approach than consist
on a basis of aclaptative pixels ([(ohonen adaptative pixel basis) that are defined from
a certain number of centers, obtained with the Kohonen artificial neural network.
As the main conclusion, a new modeling of the Ionosphere is proposed within two
steps: firstly a bidimensional Global model, describing the TEe and instrumental
delays by mean of adaptative cells; and secondly a tridimensional regional model with
regular cells for the electron density within a region with a high number of observations
(Northern hemisphere). We ought to include Low Earth Orbital GPS receivers, data
such GPS/MET, to improve the vertical resolution, and constraining the instrumental
delays with the values obtained in step 1. The cells in this case are chosen regular in
angular size, in order to diminish the discretization error in the description of electron
density, in a model with high correlations between layers.
This new strat-egy is supported by the main results obtained for the day 18th October
1995:
• the detection in the global model of the increase of electron content during the
geomagnetic storm start .
• the capability to estimate the 3-D structure of tqe ionosphere with a large number
of regional data, specially when we include orbital GPS occultation data from
349
"
" .. . ..
'1
"
··S
.~
..
"
.' .... , .. ,.
' ' ",,' ' ..
, ' ",,' MO'
'1
......
~ "
"II .. ! 'iii
'. "
.,
· ·S ··S
.~
" ,
.. ' ,. ' ,,,,' ,.. '
-' -'
Figure 5: Global model of the TEC for the day 18th October 1995, for the data subsets
2·7h, 7-12h, 12-17h and 17-22h respectively
8e+11
710+11
8e+ll / \ \\\\
M"
I 58+11
!
./ /
i · . ·.............. ....
,./I
-...... ~=...:: ....
~ 4<1+11
./ / \ ..~
z
~
38+11
// \ ... ",,",
~
g
w
2.... 11 j /;: """
. . ~'
/' ................ '..
". ..,
1.... 11
.~f
o -L._. .---..-----...-.------.-.-------..--.--~,.__ •______ ~~~~ _. __ ... _
-,&+, \00L-----'2OQ'----300-'----4OQ.L----'5OQ'----eoo-'----7oo'----eoo
HEIGHT(m)
orbital receivers. This last point confirms the study of Hajj et al. about the
importance of including orbital data to reconstruct the Ionosphere with GPS.
6. Acknowledgments
We would like to acknowledge to the International GPS Service for Geodynamics and
to the University Corporation for Atmospheric Research for the availability ot the
GPS data used in this research. This work has been partially supported with funds
from the Spanish government projects PB94-1205 and PB94-0905 (DGICYT).
References:
Davies K. (1990): Ionospheric Radio. lEE ElectroMagnetic Waves Series 31, Peter Peri-
grinus Ltd., London.
Hajj G.A., Ibanez-Meier R., Kursinski E.R., Romans L.J. (1994): Imaging the Ionosphere
with the Global Positioning System. Imaging Systems and Technology, Vo1.5, 174-184.
Kohonen T. (1990): The self-organizing map. Proceedings of the IEEE, Vo1.78, pages 1464-
1480.
Murtagh F., Hernandez-Pajares M. (1995): The Kohonen self-organizing map method: an
assessment. Journal of Classification, Vo1.l2, 165-190.
Press W.H., Flannery B.P., Teukolsky S.A., Vetterling W.T. (1986): Numerical Recipes.
The Art of Scientific Computing. Cambridge Univ. Press, Cambridge.
Sanz J., Juan J.M., Hernandez-Pajares M., Madrigal A.M. (1996): GPS Ionosphere imaging
during the "October-18th 1995" magnetic cloud. European Geophysical Society meeting,
The Hague, May 1996.
Sard6n E., Rius. A., Zarraoa N. (1994): Estimation of the transmitter and receiver dif-
ferential biases and the ionospheric total electron content from Global Positioning System
observations. Radio Science, Vo1.29, No.3, pages 577-586.
Seeber G (1993): Satellite Geodesy. Walter de Gruyter, Berlin.
Zumberge J., Neilan R., Beutler G., Gurtner W., 1994. The International GPS Service for
Geodynamics-Benefits to Users. ION GPS-94, Salt Lake City, Utah.
351
APPENDICES
'III
Figure 7: Ordering induced by the Kohonen network: after training the data, the
centroids which arc close within the representation space A will also be close within
the input space S.
3. Process 2 is repeated for the overall database until good final training is ob-
tained.
The final point dell,ity function of {tV" ... , we} is an approximation of the continuous
probability clen,itl ['ullction of the vectorial input variable 9(i) (Kohonen 1990, p.
1466).
Capacities, Credibilities in Analysis of
Probabilistic Objects by Histograms and
Lattices
Edwin Didayl, Richard Emilion 2
1 CEREAIADE - Universiif Paris 9
INRI.4 Rocquencourt, 781.']5 Ie Chesnay Cedex, France.
e-mail: [email protected]
353
354
if all the members of the class that it represents are yellow. When the individuals
are described by probabilistic objects the specialized class is described by. "the prob-
ability that all the members of a class are yellow". This probability is called the
"credibility" of the colour yellow for this class since it can be shown that it satisfies
the mathematical properties of a "credibility" (or "belief") in the sense of Schafer
(1976). When the n random variables associated to each member of the class are in-
dependent, the capacity and credibility may respectively be calculated by a t-conorm
and t-norm as defined by Schweizer and Sklar (198:3).
A "concept" is usually defined by an intent and extent. It may be modelized by a
"symbolic object" also defined by an intent and an extent, where the intent is the
description of a second order object (in terms of capacities or credibilities, in this
paper) and the extent is defined by the set of individuals (described by probabilistic
objects, in this paper) which satisfy this description as well as possible. The general
aim of this paper is to extract such (concepts) from a set of probabilistic objects
versus an extension of standard data analysis methods to such objects.
2. Probabilistic objects
2.1 Basic model:
Several real situations can be modelled as follows:
Let C be a set whose elements c are called objects and let C denote a a-algebra on
C. Let (0, F, J.L) be a measure space and I a set of indices.
A description of the object c is a family (XC,;),ET, of measurable maps defined on 0
and taking values in a measurable space (OJ, 0;). If J.L is a probability P then the
objects will be called probabilistic objects, described by the random variables Xc,j
with laws denoted by .t,j. By definition the distribution law Xc,j is a probability on
0; such that Xc.;(O) = P{X;/(O)} for any 0 EO,.
2.2 Examples:
- Let C be a set of computer processors which are working in various environments
i, i E I. The time of execution of a task w E 0 for the processor c under condition i
is not deterministic, therefore it will be given by Xc,;(w) E 0; where Xc,; is a r.v.
- Let C be a set of individuals who are submitted to different tests of type i. The
random result of the individual c to a test w E 0 of type i will be Xc,;(w).
- Let C be a set of specimen (for example, insects or plants from a same species)
given with different descriptors i, i E I (for example, the size, the age, etc ... ). Due
to the variability of these descriptor's values on a same specimen when time varies,
the random value of a descriptori of the specimen c at time wE 0 will be Xc,j(w).
3. Capacities and credibilities
3,1 Capacities
Let i E I, A E C and 0 EO;. \Ve will say that the system A is not able to reach the
objective 0 if, for all c E A, p(X c,; E 0) = O. Hence, we are tempted to evaluate the
U
ability or the capacity of A to reach 0, by the number J.L( (Xc.; EO)) (for countable
cE.4
A) or by the number SUpUt(Xc ,; EO)). In the above examples we will then get the
cE.4
C3pacity for a set of processors to complete a task before t seconds, the capacity of a
set of individuals to succeed, etc ... It turns out that these definitions agree with the
capacities as defined by Choquet (19.5-1) :
355
Definition:
A capacity on (C, C) is a map" from C to ~+ such that
i) K(0) ::: O i i ) ,,(AI U A2 ) :S K(Ad + ,,(A 2 )
iii) A ~ B => ,,(A) :S ,,(B) iv) ,,(lim TAn) ::: lim T ,,(,4 n)
The capacity K is a strong capacity if, in addition, we have:
ii') ,,(A I UA 2 )+,,(A l nA 2 ):S ,,(Ad + 11:(.1 2 )
The capacity,.; is a capacity of order oc if, in addition we have
U U
,.;(AI A2 • .. A,,) :S L(,.;A,) - L ,,(Ai n Aj) + ... + (-It+llI:(n A)).
i<j
Proposition 1 :
Let ,,(A,O)::: sup (Jl{U(Xc.d-I(O))); then the map A --> ,,(.1,0) (resp.
initet; A
Bf cEB
0--> ,,(A,O)) is a capacity of order oc on C (resp. on 0,).
3.2 Credibilities
What happens if in section :3.1, we replace union by intersection? Actually, passing
to complementary, we get in a natural way credibilities instead of capacities.
Definition :
A credibility on (C, C) is a map {3 from C to ~+ such that
i) 6(0) :::° n n
ii) ;3(AI nA2···nA,,):S L,6(Aj) - L.d(A,UA}) + ... + (-It+ I{3(U A})
j=1 i-<) j=1
Proposition 2 :
Let .d(0) ::: P{ n
(Xc,i(O)tI} for any fixed countable A E C; then the map
cEA
° ~
distributed as a r.v Xc,i' In this case, each X!,i is defined on {In. For example, let
W = (WI, ... , wn ) E {In the result of the individual c to the test W j of type i is X!,i (w).
Consider the probability measure defined by :
Pn(w)([s, til = ( number of X:,;{w), X~,i(W), ... , X~i(w) between sand t)/n.
Histograms of frequencies can be derived from this measure: given k real numbers
·
SI < S2 .. • < Sk, we represen t th e func t Ion H"",+1 ) on [Sj, Sj+J [ were
(Sj+1 h H • ')+1
- Sj )'
denotes Pn(w)([Sj,sj+JD.
A consequence of the large numbers strong law is that ~n([s, tD is a good approxima-
tion of P(Xc,i E Is, t]) if n is large enough. Moreover, as an application of Lebesgue
differentiation theorem the preceding function is a good approximation of the density
of Xc,i, (if its distribution has a density) when the steps Sj+J - Sj are small enough.
Now, starting with two (or more) r.v. Xc,i,Xd,i as above, and putting A = {c,d}, it
is natural to compute the capacity 1I:.4,n(W) = Pc,n(w) * Pd,n(w) and the credibility by
t3A,n = Pc,n(w) T Pd,n(W), where T is a t-norm and * a t-conorm. (We have omitted
the index i). Note that in case of independent r.v. Xc,i we can take u T t' = U x v
and u * t> = U + l' - Ut'.
Put 11:.,1 = )imoo 1I:.4,n(W)[S, t[ and t3.,1 = )imoo .8A,n(W)[S, t[. Due to the continuity of
t-norms and t-conorms, these capacities and credibilities converge as n tends to 00.
Since we have 11:.,.. ::::; 11:.,1 + 11:1, .. and .8.,.. ~ .8.,1 + .81,,, for (s ::::; t ::::; u), we· get two new
type of histograms: histogram of capacity which are sub-additive and a histogram of
credibility which are super-additive, while usual frequencies histograms are additive.
Theorem: If KO,I and .Bo,1 are O(ltl) then lim ( 11:.,1 ) and lim ( .8.,1 ) exist for
almost all s, as t tends to s. The limit function
1-. t- S 1_. t-
f is the smallest (resp. the greatest)
S
5. Lattices
5.1 Lattices of a set of probabilistic objects
For any fixed i let ~'; E Oi, where ~'i may be defined, for instance, by a percentile,
and consider the subset (Xc,.)-I(",;) = {w E {l/Xc,i(W) E "';}. Denoting by Fv"c the
characteristic function of this set, the family de = {H;,e, i E I} is a partial description
of the object c depending on the choice of the values sets "';.
Consider the complete lattice generated by dc, c E C with respect to the order
F~;,c ::::; FV,.d for all i. This order corresponds of course to the inclusion order of the
sets {X~l("';)}.
357
Summary: As symbolic pattern classifiers, this paper presents region oriented methods
based on the Cartesian system model which is a mathematical model to treat symbolic data.
Our region oriented methods are able to use locally effective information to discriminate
between pattern classes. This fact may achieve, at least superficially, a perfect discrimina-
tion of the pattern classes under a finite design set. Therefore. we have to take a ballance
between the separability between classes and the generality of class desciptions. We describe
this viewpoint theoretically and experimentally in order to assert the importance of feature
selection which is essentially important in ~ny pattern classification problem. We present
also an example based on symbolic data in order to illustrate the usefulness of our approach.
1. Introduction
Traditional approaches to pattern classification (e.g. Bow (1992)) may be divided
into two categories as follows.
1) Boundary oriented approach:
The purpose in this category is fo find the equations of decision boundaries. Linear
classifiers and Bayes classifiers are examples for this category.
2) Similarity based approach: The purpose in this category is to find standard pat-
terns for pattern classes and to use an appropriate similarity measure between the
standard patterns and new patterns to be classified. Nearest neighbor rules and
various matching methods are examples for this category.
As the third category of classification methods, several authors developed region
oriented approaches. Stoffel (1974) used the prime events to describe class regions for
binary feature variables. The prime events for a class cover only training samples for
the class in the feature space. Michalski (1980) developed a very general approach to
pattern classification based on his mathematical model so called the variable valued
logic system. In his approach, various feature types can be used simultaneously to
describe sample patterns, and feature selection is performed in the process to find
class regions. According to the recent terminology, we may use the term symbolic
data (Diday (1988)) for this general type of sample patterns. !chino (1979, 1981)
used hyperrectangles to describe pattern classes in the feature space. This approach
can treat ordinal and binary feature variables simultaneously. However, a further
generalization is necessary to treat symbolic data (Ichino(1986, 1988, 1993, 1995)).
The purpose of this paper is to present symbolic pattern classifiers based on region
oriented approaches. In particular, we point out the importance of feature selection
under a limited number of design samples, since we may be disturbed to achieve
a proper classification ability by the pretended simplicity appearing in classification
problems. In Section 2, we describe the Cartesian System Model (CSM ) as the
mathematical model to treat symbolic data. The CSM is represented as (U(d), I±I,
~), where U(d) is the feature space in which each sample pattern is represented by
358
359
a mixture of various feature types, E6 is the Cartesian join operato'r which generates
a generalized description from given descriptions in the feature space, and ~ is the
Cartesian meet operator which extracts a common description from given descriptions
in the feature space. In Section 3, we define graphs so called the roeLative neighborhood
graph (RNG ) and the mutual neighborhood graph (UNG ). The UNG for a pattern
class yields the interclass structure against the other pattern class. On the other hand,
the RNG for a pattern class 'indicates the intracLass structure for the class. Under the
assumption that the sample size of each pattern class is finite, the UNG and the RNG
approach complete graphs, when we increase the number of features used tO,describe
sample patterns. The completeness of the UNG means that a perfect separability
between pattern classes is achieved, while the completeness of the RNG means that
the class description has a minimum generality even for the given design set. Then,
based on the properties of the RNG and the MNG, we restate the Pretended simplicity
theorem (Ichino(1993)) by an improved way in Section -1. We describe our symbolic
pattern classifiers and we compare our approach to other well known approaches, the
ID3 (Quinlan(1986)) and the backpropagation neuraL network (Rumelhart(1986)), by
using an example of symbolic data in Section 5. Section 6 is a summary.
where Ek is the feature value taken by the feature X k . We can treat the following
five feature types.
1) continuous quantitative feature (e.g. height, weight, etc.)
2) discrete quantitative feature (e.g. the number of family members, etc.)
3) ordinal qualitative feature (e.g. academic career etc.; appropriate numerical coding
is assumed)
-1) nominal qualitative feature (e.g. sex, blood type, etc.)
5) tree structured feature (see Fig. 1, where terminal values are taken as feature values)
Feature types 1), 2). and 3) are permitted to take interval values of the form [a,b],
and feature types -1) and 5) are permitted to take finite sets as feature values. The
Cartesian product (1) described in terms of features 1) '" 5) is called an event. It
should be noted that a sample pattern is an event. Let Uk be the domain of the
feature X k . Let Uk be a finite interval when the feature type is 1), 2), or 3), and be
a finite set when the feature type is 4) or 5). Then, the feature space is given by the
product set
(2)
2.2 The Cartesian join operator
The Cartesian join A EEl B of a pair of events A and B in the feature space U(d) is
defined by:
(3)
where AJEBk is the Cartesian join of feature values Ak and Bk for feature Xk and
is defined as follows.
1) When X k is a quantitative or an ordinal qualitative feature, Ak ffiBk is a closed
interval given by:
(-1)
360
Microprocessor
Inlel~Olh."
/1\ ffi
80Z16 80386 10.016 680Z0 680JO 680.00
I~
zaooo
V50 \'60 HP
where AkL and A kU are the minimum value and the maximum value of the interval A k,
respectively; and min(AkL, B kL ) and max(AkU,BkU ) are the operators which select
the minimum and the maximum values from the sets {A kL , Bkd and {AkU' B kU },
respectively.
2) When Xk is a nominal feature, Ak EEl Bk is the union:
(5)
3) When Xk is a tree structured feature, let N(Ak) be the nearest parent node which
is common to all terminal values included in A k. Then, if N(Ak) = N(Bk ),
(6)
and if N(Ak) =/: N(Bk),
AJBBk = the set of all terminal values branched from the node N(Ak U B k ), (i)
where Ak [ElBk is the Cartesian meet of the k-th feature values Ak and Bk, and is
defined by the intersection:
(10)
When the intersection (10) takes the empty value ¢J at least one feature, events A
and B have no common part. We denote this fact by
A[E]B = 4>, (11)
and we call that "A and B are completely distiguishable." Fig. 2(a) and Fig. 2(b)
illustrate the Cartesian join and the Cartesian meet in the Euclidean plane, respec-
tively.
We call the triple (U(d),EEl,[~) as the Cartesian System Model (CSM). 1
lWe used the name Cartesian Space Model initially. However, according to the suggestion by
Prof. E. Diday, we renamed to prevent misunderstanding about the model.
361
X2 A EE B X2
(b)
(a)
Figure 2: The Cartesian join and the Cartesian meet in the Euclidean plane.
(12)
(13)
If two samples are relative neighbors, the Cartesian join of these samples never in-
cludes other samples. In this context, samples which are relative neighbors are isolated
and singular from other samples. The relative neighborhood groph, written RNG (Wi),
is a graph constructed by joining all pairs of sample patterns, E ip , E iq E Wi, which
are relative neighbors. Fig. 3 (a) illustrates the RNG in the Euclidean plane, where
we omit all edges which represent the fact that each sample is itself relative neighbor.
XI
(a) (b)
Xl
(a) (b)
The mutual neighborhood graph (AING ) of WI against W2, written M NG(wd W2), is
a graph constructed by joining all pairs of sample patterns, E lp , E lq E WI , which are
mutual neighbors against W2.
Fig. 3(b) and Fig. 4(a) illustrate the AIiVG in the Euclidean plane, where we omit
all edges which represent the fact that each sample pattern is itself mutual neigh-
bor. When two pattern classes are well separated (Fig. 3(b)), the MNG becomes
a complete graph or a near complete graph (e.g. , J'vING(wdw2) and MNG(W2Iwd
in Fig. 3(b)) . On the other hand, the number of edges of the MNG decreases ac-
cording to the closeness of the two pattern classes (e.g., MNG(Wllw2) in Fig. 4(a)).
The shaded regions in Fig. 4(b) are called the silhouettes of the pattern classes. The
silhouette S(w;J approximates the region for pattern class Wi, and is defined by
(17)
\vhere the union is taken for all mutual neighbor pairs in w, against the other class.
It should be noted that the AlNG and the silhouette of a pattern class are the descrip-
tions of the class from the view point of relativity against the other class . However,
the AlIVG is an abstract mathematical description. while the silhouette is an actual
description in the feature space.
} - -I C
(a)
(b)
where NI is the number of samples given for class WI, On the other hand, let 3 'j be
the number of samples, in class ",,'2, which are included in the Cartesian join of Ell
and Elj under the feature set F. Then, we define the separability of the Cartesian
join of Ell and Ell from the other class ;.,)2 under the feature set F as follows:
o
••
• lIJz
o
o •
XI
and the generality and the separability become maximum when Cen(i,jIF) = 1 and
Sep(i , jIF) = I, respectively.
We illustrate the generality and the separability by using a two dimensional example
in Fig. 6. In this figure, we have 8 samples for class WI and 7 samples for class W2.
In the feature space by XI, the Cartesian join of samples Eli and Elj includes 2 WI
samples and 3 W2 samples. Hence, we have
On the other hand, in the two dimensional feature space by XI and X 2, we have
(24)
From this example, it may be clear that the generality is monotonically decreased and
the separability is monotonically increased by the increasion of features to describe
sample patterns. We shoulld point out that this monotonic property of the generality
and the separability is based on the As the boy so the man theorem.
4.3 Pretended simplicilty theorem
Now we assume that the sample size of each pattern class is finite, and that, for each
pair of samples Eli and Elj in WI, there exist features by which the Cartesian join
Ed33EIJ is completely distinguishable from any other samle Elk in WI and from any
sample E2k in W2. Then, we have the following theorem.
Theorem 2 (Pretended simplicity theorem )
By adding features appropriately to the feature set F:
1) The generality CenCi, j!F) becomes zero and the separability Sep(i, jlF) becomes
one for each pair of samples Eli and Elj in WI;
2) The RNC(wI) and the MNC(WI I W2) approach complete graphs; and
3) The silhouette S(wd has a perfect separability from class W2, but it has a minimum
generality as a description of class WI·
365
The properties 1) and thus 2) are direct conclusions from the As the boy so the man
theorem. Then the property 3) is derived from 1) and 2).
This theorem asserts that: 1) The silhouette S(wtJ becomes a connected single clus-
ter, and it never includes any sample in class W2, since all sample pairs in . . . 1 are
mutual neighbors; but 2) The silhoutte S(wtJ yields a very sparse description for class
WI, and it yields only a very poor covering ability even for other design samples in the
same class WI, since all sample pairs of class WI are also relative neighbors. Therefore,
the simplicity for the interclass structure obtained here is superficial and is a "pre-
tended simplicity", and thus the selection of globally effective features is absolutely
important in order to achieve a realistic classification performance.
4.4 Example 1
We generate 2N d-dimensional Gaussian samples, where d features are mutually in-
dependent identically distributed with the zero mean and the unit variance. We
de vide 2N samples randomely into two sets of N samples. These two sets are used
as the design sets for pattern classes WI and W2. Therefore. two pattern classes are
completely overlapped in the d-dimensional feature space. Fig. 7(a) illustrates the
distributions in a three dimensional feature space. The Pretended simplicity theo-
rem asserts that if we fix N and increase d, the MNG(Wllw2) (MNG(W2IwI)) and
RNG(wtJ (RNG(W2)) approach complete graphs, namely their number of edges ap-
proach the maximum number NC2 . Fig. 7(b) summarizes our experimental results.
For example, when N = 500, the numbers of edges of MNGs and RNGs increase
by the addition of feat;lres and approach the maximum number NC2 =124750 at
arround 11 features. This is a remarkable fact, since we can separate our mixed up
pattern classes by using only a small number of very locally effective features. The
silhouettes S(WI) and S(W2) may be mutually overlapped, but they never include any
sample from their counter pattern class. Therefore, we achieved a perfect separability
between classes in terms of our design sets, although it is pretended simplicity from
the view point of the given interclass structure. Furthermore, the silhouette S(Wk)
includes all samples of the class Wk, but each Cartesian join region of S(Wk) which
is spanned by a pair of samples of Wk never includes other samples of Wk except the
pair of samples. Therefore, the silhouette S(Wk) has a minimum generality in the
description of the class . . . k. In fact, for a new sample pattern independent from the
design sets, the silhouttes S(wtJ and S(W2) have exactly the same possiblity to cover
the pattern.
This example asserts again that we have to select only suffiCiently effective features
in order to take a ballance between the separability between pattern classes and the
generality of class descriptions.
Xl
XI
(a) 11,e /lumber of fea tures
(b)
Figure 7: Example 1.
where I'v!, < N" i = 1,2, in general. Then, we can use the following decision rule to
classify a given pattern sample E.
1) E is determined to come from class ...:, if there exists an R;k for which E <;;; R;k
and if E ct Rpq for all q where p =F i.
2) It is rejected as type·I reject if it is covered by events Of;.,,'1 and W2 simultaneously.
3) It is rejected as type-II reject if no events cover it.
Now we can state our basic problem as: "Generate an appropriate set of events which
satisfy (25) and (26)". We can assert the following theorem.
Theorem 3 (Existence thorem (Ichino (1988)))
If the given training sets WI and W2 are mutually completely distinguishable, there
exist events which satisfy (25) and (26).
This theorem may be clear by assuming that R;k = Eik, k = i, 2, ... , N i , i = 1,2.
However, realistic covering ability of R;] for new patterns will be achieved by the
events which are expanded from given sample patterns.
As an approach, we may use the s2lhouette S(w,) in (17) to describe the region for
each pattern class "';,. We point out a principle of the relativity that silhouettes can
be relatively described in the feature space according to the mutual separability of
the pattern classes (see Fig. 4). Therefore, if we can find a set of minimum number
of sample patterns which span the silhouettes, we may obtain a realistic symbolic
classifier.
As a different approach. Ichino (1986, 1988) presented an algorithm which approxi-
mates the silhouette of a class by a lesser number of events. This algorithm generates
events so that the events cover the sample patterns which yield the densely connected
portions of the mutual ne'ighborhood graph (see Fig. 8).
In the above decision rule, a given new sample is rejected to assign class name, when
the sample is not included in any event. In this case, we can suggest the nearest
pattern class ",,'i by using membership grade of sample E = EI X E2 X •.. X Ed from
the event R;k = R.kl X R.k2 X ... x R.kd for class w, defined by
~
M G(E I R;k)=~dl
lR.kpl
D
k "1
r+iE I' ·=I,~, ... ,jv.,
? (27)
p= 1 j ..... k?"-' P
367
.n RII
XI
5.2 Example 2
As an example of symbolic pattern classification problem, we treat here the data of
"TOYOTA" and ":\ISSAj\" car models in 1992. Each sample (car model) is described
by 23 quantitative features and 3 qualitative features. We prepared 181 samples as
the design set. The experiments were performed in the following way.
Step 1: We applied the furthest neighbor method (complete linkage method) for hier-
archical clustering based on the generalized Euclidean distance by Ichino and Yaguchi
(199q) defined for a pair of samples A = Al X A2 X ... X Ad and B = BI X B2 X ... X Bd
by
d
d(A, B) = [2::: 1b(Ak, B"Y] 1/2, (28)
k=1
.I'(A. B.) = I A.,ffiBk I-I A.&JBk i +0.5(21 A~Bk I-I Ak I-I Bk I) (29)
0/ k, k I Uk I '
where I * I is the same in (27).
We found five clusters (pattern classes) which were well correspondent to usually
used concepts of "luxury cars", "sport cars", "leisure vehicles", etc ..
Step 2: We found events in (25) and (26) for each pattern class by using the method
in Ichino (1986, 1988), where our multiclass problem was treated as a set of dualclass
problems in the sense that "class u..:, versus other classes except class ;,.;;". Each
pattern class was described by one or two event(s). Then, for each event, we found a
minimum set of features by which the event is separated from other pattern classes
by using a modified ze1'O-one integer programming (Ichino (1986, 1988)). The total
number of features was reduced from 26 to 16. The selected features were 1) We·ights.
2) Width, 3) Length, 4) Height,S) Wheel base, 6) Front tread. 7) Rear tread. 8)
Minimum turning radius, 9) Maximum power, 10) Rev/AIax power, 11) AIax torque.
12) Rev/Max torque, 13) Engine stroke, 14) Cylinder layout, 15) Final gear ratio.
and 16) 10-mode mileage, where 14-th feature is a tree structured feature and others
are quantitative features (some of them are interval valued features).
368
Decisions
No. Company Model Correct Neural ID3 Proposed system
Answer Network
I HONDA NSX 2 2 2,3,5 2
2 HONDA LCllend I I 1,2,3,4,5 I
3 HONDA Prelude 3 3 2,3,5 3
4 HONDA Accord Wallon 4 3 2,3,4,5 4
5 HONDA Accord 3 3 2,3,4,5 3
6 HONDA Intewa 3 3 2,3,5 3
7 HONDA Civic 3 3 2,3,5 3
8 MAZDA Sentia I 2 1,2,3,4,5 1
9 MAZDA Eunos Cosmo 2 4 - 2
10 MAZDA Efmi RX-7 2 3 - 2
11 MAZDA EtiniMPV 4 1 - 4
12 MAZDA Familia 3 3 - 3
13 MAZDA Re~ue 5 2 - 5
14 MAZDA Carol 5 5 - 5
15 SUBARU Legacv wagon 4 4 4
16 SUBARU Vivio 5 5 - 5
6. Concluding remarks
This paper presented region oriented methods for symbolic pattern classifiers based
on the Cartesian system model. Our methods use the mutual neighborhood graph
(MNG) as a tool to understand the interclass structure. This graph is able to pick
369
up very local discrimination information to describe class regions. This property re-
quires to take a ballance between the separability of classes and the generality of class
descriptions. In order to assert this viewpoint we presented the Pretended simplicity
theorem, and we pointed out the importance of feature selection in symbolic pattern
classification. We compared our approach to well known the ID3 and the backpropa-
gation neural network based on the symbolic data of car models.
Acknowledgment
The authors thank Professor Edwin Diday for his helpful discussions. The authors
wish to thank also the referees for their suggestions leading to improvements in this
paper.
References
Bow, S. T. (1992): Pattern Recognition and Image Preprocessing, Mercel Dekker.
Diday, E. (1988): The symbolic approach in clustering. In Classification and Related
lvlethods of Data Analysis" Bock, H. H. (ed.), Elsevier.
Stoffel, J. C. (1974): A classifier design technique for discrete pattern recognition
problems. IEEE Trans. Compt., C-23. pp. 428-441.
~Iichalski, R. S. (1980): Pattern recognition as rule-guided inductive inference. IEEE
Trans. Pattern Anal. and Mach. Intell. PAlv/l-2. pp. 349-361.
Quinlan, J.R. (1986): Introduction of Decision Tree, l'vlachine Learning, 1, pp. 81-
106.
Rumelhart, D.E.R. and \1cCleliand (1986): Parallel Distributed Processing, MIT
Press.
Ichino, M. (1979): A nonparametric multiclass pattern classifier. IEEE Trans. Syst.
Man, Cybern. 9, pp.345-352.
Ichino, M. (1981): ~onparametric feature selection method based on local interclass
structure, IEEE Trans. on Syst. Man. Cybern. 11. pp. 289-296.
Ichino, M and Sklansky, J. (1985): The relative neighborhood graph for mixed fea-
ture variables, Pattern Recognition, 18, 2. pp. 161-167.
Ichino, M. (1986): Pattern classification based on the Cartesian join system: A gen-
eral tool for feature selection, In Proc. IEEE Int. Conf. on SMC (Atlanta).
Ichino, M. (1988): A general pattern classification method for mixed feature prob-
lems. Trans IEICE Japan J-71-D. PP. 92-101 (in Japanese).
Ichino, M. (1993): Feature selection for symbolic data classification. In New Ap-
proaches in Classification and Data Analysis, Diday, E. et al. (ed.), Springer-Verlag.
Ichino, M and Yaguchi, H. (1995): Generalized Minkowski metrics for mixed feture-
type data analysis, IEEE Trans. Syst. Alan. Cybern. 24. 4. pp. 698-708.
Ichino, M., Yaguchi, H. and Diday, E. (1995): A fuzzy symbolic pattern classifier.
OSDA '95. Paris.
Yaguchi, H., Ichino, M. and Diday, E. (1995): A knowledge acquisition system based
on the Cartesian space model. OSDA '95. Paris.
Extension based proximities between
constrained Boolean symbolic objects
Francisco de A. T. de Carvalho l
I Departamento de Estatistica - (,CEN I UFPE
Av. Prof. Luiz Freire, sin - Cidade Universitaria
50.740-.540 Recife - PE BRASIL
Fax: ++.5.5 +81 2718422 and E-mail:fatdildi.ufpe.br
Summary: In conventional exploratory data analysis each variable takes a single value. In
real life applications, the data will be more general spreading from single \'alues to interval
or set of values and including constraints between variables. Such data set are identified
as Boolean symbolic data. The purpose of this paper is to present two extension based
approaches to calculate proximities between constrained Boolean symbolic objects. Both
approaches compares a pair of these objects at the level of the whole set of variables by
functions based on the description potential of its join, union and conjunctions. The first
comparison function is inspired on a function proposed by Ichino and Yaguchi (1994) while
the others are based on the proximity indices related to arrays of binary variables.
1. Introduction.
Constrained Boolean symbolic objects (Diday (1991)) are better adapted than usual
objects of data analysis to describe classes of individuals taking into account simulta-
neously variability, as a disjunction of values on a variable, and logical dependencies
between variables. For example, if an expert wishes to describe the fruits produced by
a village, by the fact that "the weight is between 300 and 400 and the colour is white
or red and if the colour is white then the weight is lower than 3.50", it is not possible
to put this kind of information on a usual data table where rows represent villages
and columns descriptors of the fruits. Instead, this description may be represented
by a constrained Boolean symbolic object as aj = [weight = [300,400]]/\ [colour =
{white, red}) /\ [[colour = {white}] ~ [weight = [:300,350]]] where aj, which repre-
sents the jth village, is a mapping defined on the set of fruits such that for a given
fruit w, aj(w) = true iff the weight of w belongs to the interval [300, 400], its colour
is white or red and if it is white then its weight is less than 350.
370
371
between variables YI and Y2 and that there is no logical dependence between variables
Y2 and Y3' We consider now several important cases.
yl
~D
y2 y3
D
(a)lci. (b) Union (c) Oisjuoction (d) ennjunc1ion
range(Ai} being the sum of absolute value of the difference between the upper bound
and the lower bound of each interval, where Ai is a set of real intervals.
Proposition 1 If {at, ... ,an} is a set of Boolean symbolic objects, where aj =
n
I\f=dYi E Aij] with j E {2, ... ,n}, then 1I"(al V ... Van) = L1I"(aj) - L1I"(aj 1\
j=1 j<k
ak) + L 1I"(aj 1\ ak 1\ a,) + ... + (_1)n-l11"(al 1\ ... 1\ an).
j<k<1
2.4.2 Constrained Boolean symbolic object.
Suppose now there are logical dependencies between variables in the knowledge base.
Let a = I\f=dYi E Ad be a constrained Boolean symbolic object, where NA E Ai
if Yi may become inapplicable, and let {rl, ... , rt} be the set of rules expressing
the dependencies between the variables. The description potential of a will be now
374
calculated as the difference between the volume of the Cartesian product Al x ... x Ap
and the part of this volume which is formed by the individual descriptions which are
not coherent, i. e.,
p
'Jr(a) = II ~(,4;) - 'Jr(a A (--'(rl A ... A rtl)) (3)
i=1
We have
r.(a A (..,(rl A ... A I"t))) = ;r((a A ..,rd V ... V (a A ..,rtl)
and therefore, according to proposition 1,
t
7r(a A (..,(rl A ... A rtl)) L 7r(a A ..,rj) - L'Jr«a A ..,rj) A ..,rk) + ...
j=1 j<k
+ (-I)t-l7r((aA..,rdA..,r2) ... )A..,rt_dA..,r,) (4)
The complexity of the calculation of the description potential of a constrained Boolean
symbolic object is exponential on the number of rules and linear on the number of
variables to each connected graph of dependencies.
4. Case studies.
We present two examples in order to illustrate the usefulness of our dissimilarity
indices.
376
This data base, which includes continuous variables, cannot be studied by the com-
parison functions of equations 8, 9 and 10. Fig. 4 shows the dendrograms obtained
by the complete linkage method by using di (Ichino and Yaguchi function, Fig. 4a
to Fig. 4d), d2 (equation 6, Fig. 4e to Fig. 4h) and d3 (equation 7, Fig. 4i to Fig.
41). It is not necessary to show the results obtained by d 1 (equation 5) because this
proximity measure is equivalent to d2 (see proposition 1).
It seems that parameter "'f has no influence on the proximity indices d2 and d3 • To "'f
fixed in d2 , d3 and dr, the join operator furnishes better results than union operator.
This is because in the case of continuous variables position is important. With join
operator, proximity measure d3 is able to obtain the five groups of fats and oils in-
dicated by experts (Fig. 41). This is not the case by using !chino and Yaguchi index
(Fig. 4d).
4.2 Freshwater insect orders.
The biological knowledge base (Vignes, 1991) concerns freshwater insect orders. It
includes items described by 12 nominal qualitative variables (which one takes finite
sets as values), where the number of modalities is between 2 and 6, and there are 3
different pairs of variables presenting conditional dependencies. Both types of com-
parison functions (equations 5 to 10) may be used to study this knowledge base.
Fig. 5 shows the dendrograms obtained by the complete linkage method by using d3
h = 0..5) as proximity measure (a) under hypothesis of logical independence between
variables (each insect order is described by a no constrained Boolean symbolic object)
and (b) under hypothesis of conditional dependencies between variables (each insect
order is described by a constrained Boolean symbolic objects). In this figure, I means
imago, L means larva, N means naiad and P means pupa.
In both cases all the orders larva are grouped together but only under hypothesis of
conditional dependencies the orders naiad and imago are in the same subgroup. It
seems that we have got better results when we describe each item of this knowledge
base by a constrained Boolean symbolic object.
5. Conclusions.
Two approaches to calculate proximities (dissimilarities) between constrained Boolean
symbolic objects are presented. These approaches use comparisons functions which
are based on the description potential of its join or union and conjunction and are
inspired from Ichino and Yaguchi (1994) functions and from the proximity indices
of binary variables. Classical properties concerning proximity indices such as type,
equivalence and metric properties are presented. Simulations with a chemical data
and with a biological knowledge base seems corroborate the approaches.
377
-[
n.1I 0.111 11.211
-i:
11.11 lUX", tWill
-
f,
lUI n." 11.8
~
, ,
~.
lin.setd Io"-'CCOl linseed
f,
penll~ i
i
br:cf~IIIN
penll.a f. perin"
",ao".....
i."
: tteefwlow
§. olive ,,,i,,t ~. nlive
~ ~
't
i beefedlnw 'i"1.~
•!
L'OCtuftSftd
I'Ill,',Jl
! tq(;K
! I:;amellia
..
g" """
p
."
I5 __
n,n 11.2 11.4 11.11 112 IU 11.6 lUi IU u.•
~ S
i:
~.
,
bccualluw I-, bccfwklW
",.tal ~ bt.foJt ...r..
~
Co
,? ,........
fe
f,
liRKeil linxed
perilla !. perilla perilla
~
I::amelliil : 1:~lIia camclli;r,
~.
~. ~lIi¥c l ohve 1 "liYe
i ..:naunsco1 i . . u~ il!uaOlUftd
! !, Jewne !,
5" " .=
'"
~
lUI II.1W lUll
w
lUI lUI,.. IJ.IIIII
R
U.II
0'" II.•
l,- li~1l
i:
R
, linKftJ I- linsmJ
I-
Co I.:"arncllia Co ",,-'a ~
" .:amcllia
i i i
.......
penUiI bftftlliow perilla
I, •~ penUa
~
~
COftlN'llft:Il
" ........ " boe'' '1ow
I- J.
1
otive aI... olive
'-EJ !-p -[
IJ.U n.lil 0.2' lUI 11.2 (U 11.6 0." (U II.•
e ~
a", bftflallllw
e; ~ ~fat Ito,ral
i
Iqfill
t
Co
-=================]-_______
0-
Planipenna_L - - - - - - - - - - - - - - - - - - - - - - - - - - - - , n
1il
-==================:=1----------,
Dipccra_L -
Uymcnopcera_L --.J ::>
0-
0
::>
n
l\.1egaluptera_L
Coh::opccr.1_L - n
Tt'ichoptcra_L ?f
L.epUc.lplcra_L ~
Ephcmera_N
Plc..:optcra_N
'"
'"
::>
Ouunata __ N
-<
Heccruptcra_N ..,
~.
Hetcroptcra_( 5!:
~
==================f=======J---J
Coleoptcra_1
Tnchopcera._P ""..,3
Diph:ra_P
Hyrn.:nI.1ptcra_P 3
'"11
=V.
f-lcccropccra_1 v.
Colcoptcra_1
Ephcll1~ra_N
~
<"1>
0
PJecupccra_N ------------- ::>
Ouonal.J._N -<
'"
::!.
=-
f-h!rcrllp(cra_~
'"
~
Fig ..5: Dendrograms by the complete linkage method (Freshwater insect orders).
References:
Summary: Boolean Symbolic Objects were introduced by Diday (1988) and since that
time a large number of applications have been defined. using these objects, but relatively
few of them take constraints on the variables into account. Even in this case, when the
graph of dependencies becomes too large, the computational time becomes huge because
dependencies are treated in a combinatorial way. "'Ie present a method inspired by the
technique used in relational data bases (Codd 1972) leading to a decomposition of symbolic
objects into a I\'ormal Symbolic Form which allows an easier calculation, however huge the
graph of dependencies rules may be. We will apply our method to distance computation
following a method due to De Carvalho and inspired by Ichino (1994) but the normal form
we present in this paper could be llsed for other purposes. In our first trials we obtained
a 90% reduction of the computational time. In the present text we will only deal with
nominal boolean Symbolic Objects, but the method could be used with other kinds of
symbolic objects.
1. Introduction
Constrained Boolean Symbolic Objects defined by Diday (1991) are better adapted
than usual objects of data analysis to describe classes of individuals such as popu-
lations or species, being able to take into account variability. They are expressed
by a logical disjunction of elements called elementary events, and each of these ele-
mentary events represents a set of values associated with a variable. Each boolean
symbolic object describes a volume which is a subset of the Cartesian product of the
description variable.
A symbolic object can be constrained by different kinds of rules which express
logical dependencies between the variables. These rules reduce the description space
described by symbolic objects, they interfere greatly on computation of distance
between them.
\Ve shall use for distance computation a comparison function based on the de-
scription potential (De Carvahlo (1994)) of each object. \Ve define the description
potential as the part of the volume described by a symbolic object which is coherent.
i.e. where all the values are satisfying all the dependencies rules.
Until now the different methods used to compute distances between symbolic ob-
jects were taking rules into account by computing the incoherent part of each object
or each computed element. This computation can become huge when the dependen-
cies graph is deep and has to be repeated for each pair of objects each time you choose
a different distance indice.
To avoid this kind of problem. we were induced to propose a representation of
symbolic objects where only the coherent part of an object is represented. We recall
379
380
that a boolean symbolic object (if no depcndence rule applies) describes a subset of
a Cartesian product, this is just the definition of a relation.
Since long people dealing with data bases are familiar with relations: they use
a relational model. E. Codd has introduced some normal forms to structure more
efficiently the data base relational schema, particularly the third one which concerns
the case where it exists functional dependencies between the variables. Normal forms
are used in relational data bases to offer a better factorization of data, thus providing
a simpler and easier way to update data.
All this has induced us to introduce a normalization of boolean symbolic objects,
inspired by the Codd's third normal form, which allows a representation of the only
coherent part of a symbolic object. By reference to the relational normalization we
call it Normal Symbolic Form(NSF).
We shall expose in section 2 a mathematical definition of boolean symbolic object
and examine different possible kinds of dependencies rules. In the section 3 we shall
see rules influence on distance computation. In the section 4 we shall see the def-
inition of Codd's third normal form, the definition of the 1\SF and an example of
decomposition it induces. In section 5 we shall examine the principle of the decom-
position process, and in the section 6 we shall see how to use a NSF description of
symbolic objects to perform some usual calculus needed by distance computation and
then conclude in section 7.
2. Constrained Boolean Symbolic Objects
Let n be a set of elementary objects generally called" individuals", described by p
variables Yi where i E {L.p}. Let Oi be a set of observations where the variable Yi
takes its values. The set 0 = 0IX02X ... Op is then called the description space.
An elementary event f,. denoted by the symbolic expression
f, = [V, = q
where (i E {l,p}, \; ~ 0,), expresses that "the variable y, takes its values in Vi".
A Boolean symbolic object a, is a conjunction of elementary events of the form:
a = /\ ej = [V, = \;1
'E{l...p}
means that the colour of a is red or blue and the size small or medium.
vVe define the symbolic union denoted U, :
40 40
CD D
30 30
20 20
10 10
yl y1
0 10 20 30 40 0 10 20 30 40
Fig'ure 1
382
the left side of the Figure 1 shows a rectangle ill plain line which area represents
the description potential of al, and a rectangle in dot line which area represents
the description potential of a2. While there is no dependence rule, the description
potential is here equivalent to the Cartesian product given by the description of
respectively al and a2.
In the right side of the Figure 1 we represent the description potential of al and
a2 considering the following dependence rule:
if y 1 E [5, 15[ then y2 has No Sense;
Examining Figure I it seems that al and a2 are more similar when the dependence
rule is considered. We believe that the distance must take into account this evidence.
4. The Normal Symbolic Form
In a way symbolic Boolean objects are very close to the relations used in data bases
(i.e. a subset of a Cartesian product). People using relational data bases have been
long familiar with the decomposition in normal forms introduced by Codd (1972).
We will focus on the third normal form which states roughly" a relation is in third
normal form if there are no functional dependencies between a non-key attribute S
and other non-key attributes SI, ... , S" ".
Attributes have the same meaning as variable, and relations are presented as arrays
where each column represents an attribute (or variable). Each line represents a tuple
of values (an individual) and each line can be identified by a key. :\ key is an
attribute (or a set of attributes) which can be used to identify a tuple in a relation.
Codd's definition of a functional dependency says that" an attribute Y is functionally
dependent of an attriLute X if each X-value is associated with precisely one Y-value".
Usually it is necessary to decompose a relation if you want it to follow the third
normal form.
The third normal form is used in relational data bases to offer a better factorization
of data, thus providing a simpler and easier way to update data and a reduction of
the space amount necessary.
IVlost of the time a symbolic object has to be decomposed to follow the [\;ormal
Symbolic Form (NSF), as we can see in the following example,
wings wings_colour I Thorax_colour Thorax....size
a1 {present,absent} {blue,red} I {blue,yellow} {big,small}
a2 {present,absent} {green,red} I {blue,red} {small}
Table 1 anginal table
The previous array represents two boolean symbolic objects called a1 and a2, the
dependencies rules r1,r2 are associated with the definition.
if wings = Absent then wings_colour = No_Sense. (r1)
if wings_colour = red then Thorax-colour = blue. (r2)
The description of the objects a1 and a2 representing two different(imaginary) insect
species is obviously not NSF, because the description of wings in a1 has two values
and there is a dependency between wings and wings_colour (r1). There is also a
dependency between wings_colour (r2) and Thora:ccolour and Wings_colour is not
the first variable. Then the description has to be transformed in the sequence of the
three following tables to be NSF. In these tables the upper left corner contains the
table name, a new kind of column appears where the values are integers referring to
a line in another table with the same name as the column. The first table has no
name, it refers to the initial table.
wings ... wings colour
1 absent 4
2 absent 5
3 present { 1, 2 }
4 present { 1, 3 }
main table secondary table 1
We have now three tables instead of a single one, but only the valid parts of the
objects are represented: now, the tables include the rules.
5. How to Transform a set of Description in a NSF Form
The aim of the ;-;SF is to provide a description of a symbolic object where only
the part of the object satisfying dependencies rules is represented, so the description
potential of the object can be calculated directly.
As we mentioned it before, we will have to split the original table. into a main
table and some secondary ones. \Ve will follow the dependencies graph to proceed
this task. Generally each premise variable will generate a new table containing the
premise variable and all the conclusion variables depending from the premise.
At first, we must precise we will only consider the case where the graph between
the variables induced by dependencies forms a tree or a set of trees i.e. no variable
can be the conclusion of more that one premise variable.
The transformation process can be decomposed in two different phases:
1) Definition of the new tables 2) Fill the new tables
384
The secondary tables definition follows the variables dependencies graph. For each
non terminal node N of the graph, we build a new table Tv composed of the variable
V associated with N as the first variable, and the variables associated with each of
the sons of N as other variables. The variable V will be replaced in its original table
by a reference variable Rv which will contain lines number of the new table.
The table filling is a little more complicated and the lack of space does not allow
us to describe it in full detail. It is decomposed in two processes the first one consists
in a construction proc~ss, the second in a factorization one.
The first process induces a table growing, the second one a table reduction. With
the real examples we did proceed, the reduction factor was more important than the
growing factor, and we did obtain a reduction in size of the secondary tables greater
than 30% .
For commodity reason we will expose the construction process under an algorithm
form.
for each symbolic object
{ for each variable ~r
if (V is not a premise nor a conclusion)
put value of V in the main table;
else if (V is a premise)
put the references provided by GetrRef{~')
}
GetrRef{V)
{ for each value Val of V / / the premise 1,ariable
/ / build a new line in Tv
for each other variable.5 Vc in Tv
{ restrict the values of Vc according to ~'al and the rule
if (Vc is not a premise)
put the correspondingualues in Tv
else
put the references provided GetrRef( ~'c) }
7'etUn! the list of lines builded
}
Once the construction is done, we need to factorize. For each newly builded line L,
if L' = L is already present in the table we change the reference for a reference to L'
and L is suppressed.
6. An application to distance computation
At present distance computation algorithms must generate (in the worst case) all
possible combinations of variables, and then verify which ones are valid. In that case
Mp combinations must be generated (M is the average number of modalities, p the
number of variables) and verified. The NFS avoids this huge amount of verification.
Our distance measure uses, for the comparison process of the distance calculus, the
volume of the description potential of the union of two objects. For nominal symbolic
objects we will use a distance due to De Carvalho (to be published) inspired by Ichino
(1994).
We will first show how to compute, using a NSF representation, the potential of a
symbolic object, second, how to compute the potential of a symbolic union. We will
illustrate the method, using the two objects aj, a2 described in our previous example.
385
The computation of the potential of the union of al and a2 must be split in two
parts: the first one concerning the lines where the premise is verified the second
concerning the lines where the premise is not verified.
For each part one needs to compute the union of two lines II and 12 , II participating
to the description of ai, 12 participating to the description of a2. These values must
be multiplied by the number of lines of the part participating to the union and we
obtain the potential related to the part. The potential of the union is obtained by
summing up the potential computed for each part.
We will note poWl (1,2) the union of lines 1 and 2 of the secondary table l.
We will show now, on the previous example, how to compute
potential(al Us a2) = potU( {big,small})* potUl( {1,3},{2,4})
For the secondary table 1:
potUl( {1,3},{2,4}) = potUl(1,2) (premise verified) +
potUl(3,4) (premise not verified)
potUl(I,2) = 1*potU2(4,5) potUl(3,4) = 1*potU2( {1,2},{1,3})
For the secondary table 2:
potU2( {1,2},{1,3}) = potU2(1,1) (premise verified) +
potU2(2,3) (premise not verified)
potU2(4,5) = pot({blue,red,yellow}) = 3 potU2(1,1) = 1
potU2(2,3) = pot( {blue,green}) * pot( {blue,red,yellow}) = 2*3 = 6
This is expressed by the following tables:
386
2
1 absent
absent
4 pot5~,2)_ -
1*3 =3
I
~
3
4
present
present
{1,2}
p,J}
pot(~,4~-
1+6 =7
I
main table secondary table 1
7. Conclusion
The decomposition of symbolic objects following the Normal Symbolic Form induces
the easiest way to take dependencies rules into account, as shown by our first ap-
plication on distance computation. Including the construction of the NFS we have
obtained on our first trials a reduction of about 90 % of the computational time. This
encourages us to carryon our work.
We need to test it with a larger set of examples, to estimate better the amelioration
it can provide. This will lead us to make a more formal approach of the complexity
of the different computation phases needed in a distance processing with and without
NSF. Because NFS is a normal form, we hope it will induce in the future, a better
and easier interfacing of large sets of symbolic objects with data bases.
References:
Codd, E.F. (1972) : Further Normalization of the Data Base Relational Model. In Data
Base Systems, Courant Computer Science Symposia Series, Vol 6. Englewood Cliffs, N.J .
. Prentice-Hall
De Carvalho, F.A.T. (1994): Proximity Coefficients between Boolean symbolic objects. In:
E. Diday et al (eds.): New Approaches in Classification and Data Analysis. Springer-Verlag,
387-394.
De Carvalho, F.A.T (to be published) : Extension based proximities between Boolean
symbolic objects In : Proceeding of Fifth Conference of the International Federation of
Classification societies
Diday, E. (1991): Des objets de l'analyse de donnees 11 ceux de I'analyse de conuaissances.
In: Y. KodratofJ and E. Diday (eds.): Induction symbolique et numerique d partir de
donnees. Cepadue Editions,Toulouse, 9-75.
Diday, E. (1988) : The Symbolic Approach In Clustering. In H.H. Bock (eds.): Classifica-
tion and Related Methods of Data Analysis. North Holland 673-683
Ichino, M. , Yaguchi, H (1994) : Generalized Minkowski Metrics for Mixed Feature-Type
Data Analysis. IEEE Transaction on Systems, Man, and Cybernetics 24,4,698-708.
The SODAS Project: a Software for Symbolic Data Analysis
Georges Hebrail
Summary: This paper presents an ESPRIT European project, whose goal is to develop a
prototype software for symbolic data analysis. Symbolic data analysis is an extension of
standard methods of data analysis (such as clustering, discrimination, or factorial
analysis) to more complex data structures, called symbolic objects. After a short
presentation of the model of symbolic objects, the different parts of the software are
briefly described.
1. Introduction
Standard statistical data analysis methods, such as clustering, discrimination, or factorial
analysis, apply to data which are basically structured as arrays. Each row represents an
individual and each column for a particular row contains a single value, which is the
value of a variable describing individuals. In many real world applications, for some
variables, an individual may be described by sets of values, intervals of values, or
probability distributions of values. Moreover, some a priori knowledge of the user may
be associated with data, such as taxonomies in variable domains.
More complex data structures - called symbolic objects - have been proposed by Pr Diday
in the last decade (see Diday (1991». These data structures capture the complexity
described above, but remain manageable regarding to computations performed in
statistical data analysis methods. Beyond these data structures, some extensions of
standard methods have been studied and evaluated to apply to symbolic objects. The
extensions include clustering, discrimination, and factorial analysis.
But these new methods remain difficult to use in real applications for two main reasons
(see Hebrail (1995»: there is no available software to do so (there are only disparate
pieces of software in various universities), and it is difficult to manage data objects with a
more complex structure than simple arrays. The goal of the SODAS project is to develop
a prototype sofuvare to solve these problems and make these methods available to more
users.
The SODAS project (for fu'mbolic Qfficial.Qata Analysis ~ystem) i·s an European project
within the DOSIS Programme mevelopment Qf .s.tatistical Information fu'stems),
organized by EUROSTAT, the Statistical Office of the European Communities in
Luxembourg. It gathers several partners, including national official statistics offices,
industrials and universities; various European countries are represented. An important
part of the project is devoted to benchmarks of real world applications, which will be
387
388
used to specify and test the software. These benchmarks are mainly provided by the
national official statistics offices involved in the project.
In this communication, after a short presentation of the model of symbolic objects. we
describe the main contents of this project, and especially the different parts of the
software.
2. Symbolic objects
As mentioned before, standard methods of statistical data analysis accept as their input
INDIVIDUALS by VARIABLES arrays. Each cell of such arrays contains the value
taken by an individual for a variable. This value is said to be atomic in the sense that it is
not a list or a set of values. For instance, if individuals are people, and if variables are
AGE and SOCIO-PROFESSIONAL CATEGORY (SPC), the AGE cell for a person
contains one value (the age of the person) and the SPC cell contains one value (the SPC
of the person).
Symbolic objects introduced by Pr Diday extend the classical data structure to
INDIVIDUALS by VARIABLES arrays where the value taken by an individual on a
variable may be non-atomic: possibly a set of values, intervals of values, or a probability
distribution. For instance, if individuals represent groups of people, and if variables are
still AGE and SPC, a cell of this new array may contain, for each individual (i.e. each
group of people), the interval of people age values in the group for the AGE variable, and
the list of SPC of people of the group for the SPC variable.
We recall below, in an informal way, the basic data structures defined by Pr DIDA Y.
Additional structures have already been defined (see Diday (1991», but are not presented
here. The benchmarks of the project will be used to define the final list of data structures
supported by the software.
Assertions
Assertions are conjunctions of boolean and/or probabilistic elementary events, for
instance:
Group 125 = [AGE = ( 34, 29, 2, 1 }] 1\ [SPC = { Employee, Worker}]
District 92 = [AGE = ( [25,30](0.2), [31,35](0.23), ... }]
1\ [SPC = { Executive manager(0.6), Worker(0.2), ... }]
The model of symbolic objects also enables the user to associate with the data some a
priori knowledge (i.e. metadata), which is then used by methods applied to symbolic
objects. This a priori knowledge can be defined by different means. We present below
three ways to do so: mother-daughter variables, rules, and taxonomies in variable
domains.
Mother-daughter variables
Mother-daughter variables offer the possibility of defining variables which are not
applicable to all people, but only to people verifying some properties. For instance:
SPC is applicable only if AGE> 18
AGE Variable
/\~ Adult
[0.10] (11.15] [16.(8) ______ I~
/\
(18,34]
/\ I"""
[35,45] [46,55] [56,60] [60,70] [70,?]
390
As a summary, symbolic objects can represent individuals which are groups of elements
of an underlying population, featuring variation within these groups. These objects can
also describe uncertainty on data (with probabilistic objects) and metadata (;,vith
mother/daughter variables, taxonomies in variable domains, and rules).
The methods which will be developed include partitioning algorithms (see Chavent
(1995)) and hierarchical clustering (see Brito (1995)). Hierarchical clustering will
produce either disjoint or overlapping clusters (pyramids).
These two approaches will be available in SODAS through the following features:
- an interface to call standard methods from SODAS,
- a tool for building disjunctive arrays of data from symbolic objects,
- a tool for building distance matrices from symbolic objects (see !chino (1994),
Carvalho (1996)),
- a tool for creating symbolic interpretations by symbolic objects to standard
clustering and factor analysis methods (see Tong et at. (1996».
6. User interface
A large part of the project will be devoted to the consideration of user's needs. As a
working package of the project, a users' group will be created and animated.
This users' group will :
gather different benchmarks from national official statistics offices and from
industrial partners,
- list the symbolic object structures which are necessary in these benchmarks,
- check if the developed methods solve real problems and meet user's needs,
- test the software on real world applications.
From another point of view, the software will include a user-friendly interface to help the
end-user to visualize graphically symbolic objects.
Within the budget of this project, it will not be possible to develop a fancy homogenized
interface to all the methods. But some guidelines will be edited to homogenize interfaces
of different methods of the software.
392
Finally, a scientitic reference manual will be edited for the software. This book will
contain a unified presentation of symbolic objects and methods applicable to them in the
SODAS prototype software.
7. Partners
The project gathers 18 partners from various European countries. \\'hile THOMSON-
CSF is the pilot of the project, Pr. Diday will be the scientific manager. The CISIA
French company will be responsible for the development of the kernel of the software.
For more information about this project and the distribution of tasks between partners,
see SODAS Project (1995).
The main partners of the project are: Thomson-CSF (France), Universite de Dauphine
(France), Facultes Universitaires Notre Dame de la Paix (Belgium), Instituto Nacional de
Estadistica (Portugal), and University of Athens (Greece).
The associated partners are: CISIA (France), Centre de recherche public STADE
(Luxembourg), Central Statistical Office (England), Universita degli Studi dei Bari
(Italy), Universita Federico II - Napoli (Italy), Electricite de France - Research center
(France), EUSTAT (Spain). INRIA (France), Universidade de Lisboa (Portugal). Institute
for Statistics - RWTH (Germany), Service des etudes et de la statistique du ministere de
la region wall one (Belgium), and Universidad complutense de Madrid (Spain).
8. References
Brito P. (1995): « Symbolic objects: order structure and pyramidal clustering », in Annals
o/Operations Research, N°55, pp.277-297.
Carvalho F.A.T. (1996): « Extension based proximities between constrained boolean
symbolic objects ». in Proceedings o/the IFCS'96 Conference. Kobe. March 96.
Chavent M. (1995): « Choix de base pour un partitionnement d'objets- symboliques ». in
Actes des i mes Rencontres de la Societe Francophone de Classification (SFC-95j.
Namur, Sept.95.
Chouakria A., Cazes P., Diday E. (1995): « Extension de I'analyse factorielle des
correspondances multiples Ii des donnees de type intervalle et de type ensemble» in
Actes des /mes Rencontres de la Societe Francophone de Classijication (SFC-95),
Namur, Sept.95.
Diday E. (1991): « Des objets de I'analyse des donnees a ceux de I'analyse des
connaissances», in Induction Symbolique et Numerique a partir de Donnees. Editeurs
Y.Kodratoff et E.Diday. Cepadues-Editions.
Granville V., Rasson J.P. (1995): « Multivariate discriminant analysis and maximum
penalized likelihood density estimation ». Journal 0/ Royal Statistical Society. Series B,
57, pp. 501-517.
Hebrail G. (1995): « L'analyse de donnees symboJiques : etat de I'art et perspectives»,
EDF-DER Research report, nOHI-23/95-0 18.
393
Ichino M. (1994): « Generalised Minkowski metrics for mixed features type data
analysis », in IEEE Transactions on System, Man, and Cybernetics, 24, 4, pp. 698-708.
Lebbe 1., Vignes R. (1991): « Generation de graphes d'identification a partir de
descripti.ons de concepts », in Induction Symbo/ique et NlImerique a partir de Donnees,
Editeurs Y.Kodratoff et E.Diday, Cepadues Editions.
Tong H.T.T., Summa M.,. Perine I E., Ferraris 1. (1996): « Generating symbolic
descriptions for classes », in Proceedings of the IFCS'96 Conference, Kobe, March 96.
SODAS Project (1995): Answer to the DOSIS Call for Proposals.
Classification Structures for Cognitive Maps
Stephen C. Hirtle and Guoray Cai
School of Information Sciences
University of Pittsburgh
Pittsburgh, PA 15260 USA
sch,[email protected]
Summary: The ability to create and manipulate meaningful data structures of cogni-
tive spaces remains a problem for designers of geographic information systems. Methods
to represent the inherent hierarchical structure in cognitive spaces are discussed. Several
alternative scaling techniques for developing hiera.rchical and overlapping representations,
including orderE'd trees, ultramE'tric trees. and sE'mi-latticE's, are presented and discussed.
To demonstratE' the differences among thE'se three rE'prE'sentation schE'mes, each of three
techniques is applied to two small datasets collected on the recall of capitals or countries
in Europe. The methods disCUSSE'd here were chosen to illustrate the limitations of a strict,
hierarchical representation and because they have been used in the past to model cognitive
spaces.
1. Introduction
The ability to create and manipulate meaningful data structures of cognitive space~
remains a problem for designers of geographic information systems. The ability to
present and to interpret spatial data in a method that is consistent with the inter-
nal cognitive map of the user would lead to systems that are n10re flexible and will
provide greater functionality in terms of cognitive spatial tasks (Hirtlp and Heidorn,
199:3; Medyckyj-Scott and Blades, 1992).
A common conclusion that has emerged from the research on the structure of cog-
nitive mapping is that spatial memory is organized hierarchically, which results in
processing biases and errors in judgments (Couclelis, et al. 1987; Golledge, 1992: Hir-
tle and .Jonides, 198.5; McNamara, et al., 1989; Stevens and Coupe, 1978). However,
as Hirtle (1995) argued recently, the claim that mental representations are inherently
hierarchical is often made without providing an explicit alternative. For example,
the first author and his colleagues have argued that their data is consistent with a
"partially hierarchical model" (McNamara, et al., 1989) and have warned against the
conclusion that only structure in a cognitive map is of a hierarchical nature (Hirtle
and .Jonides, 198.''»). While such qualifications are intriguing, they are often stated
without proposing an explicit alternative. In this paper, several alternative scaling
techniques for developing hierarchical and overlapping representations, including or-
dered trees, ultrametric trees, and semi- lattices, are considered.
2. Hierarchies
A strict hierarchy if often assumed for representing spatial concepts. For example,
Stevens and Coupe (1978) showed how people consistently misjudged certain direc-
tions, such as assuming that Reno, Nevada is north and east of San Diego, California,
when in fact Reno is Ilorth and west of San Diego. To account for such effects, Stevens
a.nd Coupe (1978) presented a nested, propositional model, with San Diego as part
of California, Reno as part of Nevada, and California to the west of Nevada. Here,
the reasoning processes occur Oil a hierarchical tree structure, which contains cities
394
395
3. Ordered Trees
A technique that has proven useful for uncovering hierarchical structure in cognitive
maps has been that of the ordered tree algorithm for free-recall data (Hirtle and
.Jonides, 1985; McNamara, et al., 1989). An ordered tree is a rooted tree where the
childrpn of a node, at any level, may be ordered, as a unidirectional or bidirectional
nodI', or unordered, as a nondirectional node. Ordered trees, as discussed here, werp
first introduced by Reitman and Rueter (1980) and differ from two other uses in thp
literature of the term. Aho, et al. (1974) define an ordered tree as one in which all
children are strictly ordered from left to right. In a third use of the term, Barthelemy,
et al. (1986) dpfine an ordered trpf' as a rooted tree whpre thp nodes arp orderpd by
the height of the nodes. In this paper, thf' discussion is restricted to the first USf' of
the term, as df'fined by Reitman and Rueter (1980).
(eN MA ME NH RI VT}
/'
(ME NH VT} \
(eN MA RI)
i~\
/ '\
(eN MA}(M..a. RI}
NH VT ME eN MA RI
Fig. 1: Ordered dendrogram and set inclusion diagram for ordered tree.
An ordered tree is built by examining the regularities in a set of recalls over a fixed set
of items. In fact, an ordered tree is a generalization that allows for some overlapping
structure. As an example, the collection of sets {NH VT}, {ME NH VT}, {eN MA},
{MA RI}, {CN MA RI}, and {CN MA ME NH RI VT} can not be represented by
a tree, since the sets {eN MA} and {MA RI} are overlapping and violate the defini-
tion of a hierarchy, given above. However, this collection can be represented by the
ordered trt>e as seen in Figure 1.
396
One might be tempted to conclude that an ordered tree is simply a variant of non-
binary tree. However, this is not the case. Note that in the previous example, a
non-binary tree could be constructed by the removal of the overlappin sets {eN MA}
aud {MA RI}. However by the inclusion of these two sets, along with the explicit
exclusion of the set {eN RI}, the collection of sets can no longer be represented by a
strict hierarchy. Furthermore, many cognitive and real- world relations are best seen
as exactly this type of ordered structure.
4. Semi-lattices
A semi-lattice is a generalization of an ordered tree. It is defined formally as a collec-
tion of sets, such that for any two overlapping sets in the collection, the intersection
of the sets is also in the collection (Alexander, 1965). Therefore, if the sets {A B C
D E F} and {B C E G H} are in the collection, then the set {B C E} must be in
the collection, as well. As an example, consider the collection of sets {NH VT}, {eN
MA}, {ME NH VT}, {CN MA VT}, {CN MA RI}, and {CN MA ME NH RI VT}.
Such a collection cannot be represented as either a tree or an ordered tree, but can be
represented as a semi-lattice. The sets {ME NH VT}, {CN MA VT} and {CN MA
RI} are overlapping and thus violate the definition of an ordered tree, given above.
However, this collection can be represented by the graph structure shown ill Figure
::!.
(eN MA ME NH RI VT}
the countries in Europe, or all the capitals in Europe. No other instructions were
given to the subjects. To equate the two samples, the capitals were converted into
the country name for those receiving the capital task. It is further acknowledged
that the capital task was harder and that exclusions might occur, not from forgetting
the country, but because the subject does not know or is unsure of the name of the
capital. However, these two datasets are considered only to highlight the differences
between the representations discussed in this paper, and not to generalize about spe.
riflc regional understanding of European geography. Furthermore, the purpose of this
exercise was to explore the possible clustering that exists among countries and not
the rather trivial set inclusion principle of aggregating a capital to its host country.
A group of 18 subjects in Norway, who were asked to recall countries of Europe,
produced a total of 44 distinct entries. The entire set of recalls can be seen using a
path-graph visualization developed by Hirtle (1991) in Figure 3. Here, the line width
is proportional to the number of the times two countries were recalled simultane-
ously. By visually focusing on the thicker connections, several clear clusters, such as
the Scandinavian countries, begin to emerge, as seen in Figure 3.
Cyprus
Fig. :3: Path graph of European countries generated by the Norwegian subjects
398
Fig. 4: Pa.th gra.ph of country na.mes for the Europea.n ca.pita.ls genera.ted by the
Austria.n subjects
A group of 12 subjects in Austria., who were asked to list a.ll the ca.pita.l cities of
Europe, produced a. tota.l of :J2 distinct entries. The ca.pita.ls were converted to the
country na.mes, a.nd resulting complete pa.th-gra.ph, shown in Figure 4.
The ordered lists from ea.ch of the da.ta.sets were clustered into a. strict hiera.rchica.l
tree, using a.n a.vera.ge-link clustering a.lgorithm (UPGMA). This was done using two
different llwasures of dista.nce, city-block a.nd a. log-ba.sed dista.nce. The city-block
metric is equiva.lent to sta.ting tha.t the dista.nce between a.ny two countries is propor-
tiona.l to the tota.l number of intervening items between them a.cross a.ll the ordered
lists. However, a.s items a.re further sepa.ra.ted on the list, the a.rtua.lnumerica.l differ-
ence becomes less importa.nt. Therefore, we replica.ted the a.na.lysis with the loga.rithm
of the difference. Furthermore, four countries were dropped from the a.na.lysis, due
to a. la.ck of da.ta. for c.a.lcula.ting pa.irwise dista.nces. For simplicity, only the la.ter
distance a.na.lysis is reported here. The resulting tree is shown in Figure 5.
399
CYPRUS -+-------+
TURKEY -+ ~-----------------t
GREECE ---------+ +---t
POloAN[J +---t
AIJSTRIA -------------T---------+
SWITZERLANli +-------.
ITALY
ANDORRA
lo I ECH7ENc~TE IN ---------------+ + - - - - - - - ...
MONAC'O
BULGARIA ---T-----------------+ ~-----+
ROMANIA "---.,.
,,~ LO'v'A!< Ll!, +-----------+
SLOV2NI~. t-------+
CZECfi REPUB -------t t-----t
HUNGARY +---+
,:'ROATIA -------+---t +-----t
"P. YUGOSLAVIA -------+ t---------------t
BO~;NIA-HEP.ZE'; .,.-----t
ALB.\NIA
EST0NIA -----------t---------+
LITfiUANIA +-----------+
LATVIA
[Ji(HAINE -----+ t-------t -t----------+
BE LOR U~"; I,"
RUSSIA
N0P.WAY -t-------------+
SWEDEN -+ t---------------t
DENMARK ---------------+ t-----------t
nNLAN,' -----------+
IC2LAN!) -------------------------------+ .,.-----+
IRELAN" -------------------+---------------+
U.K. -------------------+
PORTUC;"'.L +------- ..
SPAPI +---------+
"RA~J(,E
BEU;:UM -+-------------+
N~TfiERL"'.Nr'~; -+ +---------+ i
LijXSMBOUR'~ +-+
,-;SEMAN',
IT.l\.LY -+-----------------+
SAN MARINO -+ +-------------+
MONACO -------------------+
LIECHTENSTEIN ---------------------+-----------+
SWITZERLAND ---------------------+
POLAND -----------+------ ---+
SLOVAKIA -----------+ +-----------------+
CZECH REPUB -----------------+
BULGARIA -----------+-------+
ROMANIA +---------+
"REECE -------------------+ +- --+
BOSNIA --+-------------+ +-----+
FR Y'JGOSL.'.V IA +--------
CROATIA ---------f-----+ +-----+
SLOVEN I!'.
HUNGARY
AUSTRIA ---------------------------------------------+
I(:ELAND ---------------------------+-------------------+ I
IRELAND ---------------------------+ I 1
"RANCE -------------T-----------------+ I
UK -------------+ ~---------+
POR'fUGA"
SPAIN +---------+
ANDOR:<A ---------------------+
NORWAY -+-----------+ +-----+
SWEDEN -+ +-+
DENM.l\.RK -------------t ~---------------------+
FINLAr;C
BELGIUM ---+-----------------+ +---+
LUXEMBOURG ---+ +-------------+
NETHERLANDS ---------------------+ +-+
,;ERMAN'( -------------------------------_._--+
Fig. 6: Hierarchical tree of the country names for the European capitals generated
by the Austrian subjects
In examining these trees, the limitation of a hierarchical tree becomes obvious. Each
country is placed uniquely in a single cluster within the tree, by the very definition
of a tree. The multiple relationships, which Alexander (1965) argued convincingly in
favor of, are not able to be incorporated in the representational structure.
5.2 Ordered trees
An ordered tree might allow some overlapping relationships to emerge. Unfortu-
nately, an immediate application of the existing ordered tree algorithm of Reitman
and Rueter (l980) is not possible. The algorithm was developed to account for the
strong representational structures within a single subject for a domain of interest
and not to build an average structure across many subjects. Thus, the algorithm is
deterministic and produces clusters that exist across all recall patterns. Within the
Norwegian sample, there was not a single cluster that was common to all subjects,
whereas in the Austrian sample, only the single cluster of !'."orway Sweden existed for
all the subjects.
However, by examining subgroups of subjects within each sample, one can identify
small groups of subjects with common strategies, for which one can calculate non-
trivial ordered trees. Figure 7 shows one tree from a subset of the Norwegian subjects
and Figure 8 shows trees from two subsets of the Austrian subjects. It is interest-
ing to note the predominance of the home country, as expected, in each sample. In
addition, the two ordered trees in Figure 8, from the Austrian sample, indicate two
very different strategies, one that is geographically oriented (Figure 8a), and another
that is ordered by prominence (Figure 8b). The former strategy resulted in Austria
401
clustered with Switzerland and Liechtenstein, whereas the latter strategy resulted in
Austria being followed by France and United Kingdom.
Norway
Sweden
Finland
Denmark
Iceland
United Kingdom
The Netherlands
Belgium
France
Spain
Portugal
Switzerland
Austria
Italy
Germany
Poland
Greece
ard and Arabie, 1979) of overlapping clusters. These clusters could then provide a
seed set of potential clusters to build a semi-lattice upon. An initial application of
the MAPCLUS algorithm to the data from the Norwegian subjects resulted in four
clusters, with only the Scandinavian cluster being distinct from the others. The data
from the Austrian subjects also resulted in four overlapping clusters. One cluster con-
sisted of northern European countries, including Scandinavia and the British Isles.
The second consisted of eastern European countries. The third cluster consisted of
prominent central European countries and the final ofless prominent countries. While
such an analysis is promising, it is clear that any implementation of semi-lattice mod-
els will require the additional development of appropriate algorithms.
7. Acknowledgments
This paper was prepared, in part, while the first author was on sabbatical at tht-'
Department of Computer Science, Molde College, in Molde, Norway. Their support
is gratefully appreciated. The authors wish to thank Adrijana Car, Kai Olsen, and
Phipps Arabie for their comments concerning the issues presented in this paper.
403
References:
Aho, A. V., et al. (1974): The design and analysis of computer programs, Addison-Wesley,
Reading, MA.
Alexander (1965): A city is not a tree, Design, 46-55.
Barthelemy, J. P., et al. (1986): On the use of ordered sets in problems and consensus of
classification, Journal of Classification, 3, 187-224.
Carroll, J. D. and Corter, J. E. (1995): A graph-theoretic method for organizing overlap-
ping clusters into trees, multiple trees, or extended trees, Journal of Classification, in press.
Carroll, J. D. and Pruzansky, S. (1980): Discrete and hybrid scaling models. In: Similarity
and Choice, Lantennann, E. D. and Feger, H .. (eds.), Hans Huber, Bern.
Couclelis, H., et al. (1987): Exploring the anchor·point hypothesis of spatial cognition,
Journal of Environmental Psychology, 7, 99-122.
Diday, E. (1986): Orders and overlapping clusters in pyramids, In: Multidimensional data
analysis de Leeuw, J., et al. (eds.), 201-234, DSWO Press, Leiden.
Golledge, R. G. (1992): Place recognition and wayfinding: Making sense of space, Oeofo-
rum, 23, 199-214.
Hirtle, S. C. (1991): Knowledge representations of spatial relations. In: Mathematical
psychology: Current developments, Doignon, J.-P. and Falmagne, J.-C. (eds.), 233-250,
Springer-Verlag, New York.
Hirtle, S. C. (1995) Representational structures for cognitive space: Trees, ordered trees,
and senii-lattices, In: Spatial information theory: A theoretical basis for CIS, Frank, A. V.
and Kuhn, W. (eds.), Springer- Verlag, Berlin.
Hirtle, S. C. and Heidorn, P. B. (1993): The structure of cognitive maps: Representations
and processes. In: , Behavior and environment: Psychological and geographical approaches,
Garling, T. and Golledge, R. G. (eds.), 170-192, North-Holland, Amsterdam.
Hirtle, S. C. and Jonides, J. (1985): Evidence of hierarchies in cognitive maps, Memory
and Cognition, 3,208-217.
Kim, H., and Hirtle, S. C. (1995). Spatial metaphors and disorientation in hypertext brows-
ing. Behaviour and Information Technology, 14, 239- 250.
McNamara, T. P., et al. (1989): Subjective hierarchies in spatial memory, Journal of Ex-
pe"i71lental Psychology: Learning, Memory, and Cognition, 15, 211- 227.
Medyckyj-Scott, D. J. and Blades, M. (1992): Human spatial cognition, Ceoforum, 2,215-
226.
Reitman, J. S. and Rueter, H. R. (1980): Organization revealed by recall orders and con-
firmed by pauses, Cognitive Psychology, 12, 5.'54-58l.
Sattath, S. and Tversky, A. (1977): Additive similarity trees, Psychometrika, 42, 319-345.
Shepard, R. N. and Arabie, P. (1979): Additive clustering: Representation of similarities
as combinations of discrete overla.pping properties, Psychological Rel1iew, 86,87-123.
Stevens, A. and Coupe, P. (1978): Distortions in judged spatial relations, Cognititlf P,~y
cllology, 10, 422-4:H.
Van Cutsem, B. (Ed.) (1994): Clas,~ification and di,~si71lilarity allalysis, Lecture Notes in
Statistics, No. 9:3, Springer- Verlag, New York.
Unsupervised Concept Learning
U sing Rough Concept Analysis
Tu Bao Ho 1
Summary: Formal concept analysis (Wille, 1982) offers an algebraic tool for representing
and analyzing formal concepts, and the rough set theory (Pawlak, 1982) offers an alternative
tool to deal with vagueness and uncertainty. Rough concept analysis (Kent, 1994) is an
attempt to synthesize common features of these two theories. In this work we develop a
method for unsupervised concept learning in the framework of rough concept analysis that
aims at finding and using concepts with their approximations.
1. Introduction
The problem of finding not only hierarchical clusters of unlabelled objects but also
their 'good' conceptual descriptions was addressed early in data analysis, e.g., Diday
and Simon (1976). This problem is referred to as unsupervised concept learning in
machine learning which can be defined as
• Given a set of unlabelled object descriptions;
• Find a hierarchical clustering that determines useful object subsets {clustering};
• Find intensional definitions for these subsets of objects {characterization}.
Unsupervised concept learning techniques depend strongly on how concepts are un-
derstood and represented. The notion of concepts under the classical view and the
generality relation was mathematically formulated in the theory of formal concept
analysis by Wille and his colleagues during last fifteen years (Wille, 1992). Recently,
several concept learning methods have been developed in this framework, e.g., those
of Godin and Missaoui (1994), Carpineto and Romano (1996) that incrementally
learn all possible concepts, or method OS HAM (Ho, 1995) that extracts a part of the
hypothesis space in the form of concept hierarchies.
The theory of rough sets is a ma.thematical tool to deal with vagueness and uncer-
tainty in interpreting given data (Pawlak, 1991). Recently, by combining common
features between the rough set theory and formal concept analysis, Kent (1994) in-
troduced a theory of rough concept analysis as a framework for representing and
learning approximations of concepts. In this paper we developed unsupervised con-
ceptual clustering method A-OSHAM, an extension of OSHAM, for inducing concept
hierarchies with concept approximations by using rough concept analysis.
404
405
objects sharing these properties and accepted as members of the concept. Formal con-
cept analysis models formal concepts from a formal context which is a triple (0, A, R)
where 0 is a set of objects, A is a set of attributes and R is a binary relation between
o and A, i.e., R ~ 0 x A. Notice that for the simplicity, formal concept analysis is
always described with Boolean data. However, its notions can be extended for multi-
valued data. In general, oRa is understood as "object 0 has attribute a" in Boolean
domain, and can be extended to nominal or discrete numeric domains as "object 0
has attribute a with some value v". Data from continuous domains can be discretized
into discrete data to be used with the framework. A formal concept of a given formal
context is a pair of extent/intent (X, 5) where the extent X contains precisely those
objects sharing all attributes in the intent 5, and vice-versa, the intent 5 contains
precisely those attributes shared by all objects in the extent X. The relation between
extent and intent of concepts can be described by two operators A and p
X = p(5) = {o E 0 I Va E 5: oRa}
ol=X
n
= Ra ~ 0
aES
(2)
{Leech (Le), Bream (Br), Frog (Fr), Dog (Dg), Spike-Weed (SW), Bean (Bn), Maize (Mz),
Reed e(Rd)}. The living organisms are described by attributes {needs water (nw), lives in
water (Iw), lives on land (II), needs chlorophyll (nc), 2 leaf germination (2Ig), 1 leaf germination
(llg), is motile (mol, has limbs (Ib), suckles young (sk)). Furthermore, the 0 indicates
when an object has an attribute, i.e., which living organism has which attribute. The
concept lattice .c( 0, A, R) consists of 19 formal concepts Gl , G2 , .•. , G l9 (denoted in
the figure only by their indexes), for example Gg = ({Le, Br, Fr}, {nw, Iw, mol).
2.2 Rough sets and rough concept analysis
There have been different methods of approximating concepts, e.g., those employ
the Bayesian decision theory or the well-known fuzzy set theory which characterize
approximately concepts by a membership function with a range between 0 and 1.
Rough set theory can be considered as an alternative way for approximating concepts.
The starting point of this theory is the assumption that our "view" on elements of
a set of objects 0 depends on some equivalence relation E on O. An approximation
space is a pair (0, E) consisting of 0 and an equivalence relation Ea ~ 0 x O. The
key notion of the rough set theory is the lower and upper approximations of any su bset
X ~ 0 which consist of all objects surely and possibly belonging to X, respectively.
The lower approximation Eo(X) and the upper approximation EO(X) are defined by
where [olE denotes the equivalence class of objects indiscernible with 0 with respect to
the equivalence relation E. Kent (1994) has pointed out common features between the
theories of rough sets and formal concept analysis, and formulated the rough concept
analysis. Saying that a given formal context (0, A, R) is not obtained completely and
precisely means that the relation R is incomplete and imprecise. Let (0, E) be any
approximation space on objects 0, we wish to approximate R in terms of E. The
lower approximation RoE and the upper approximation R· E of R w.r.t. E can be
defined element-wise as
Now, any formal concept (X, S) E .c(0, A, R) can be approximated by RoE and RoE.
The lower and upper E-approximation of (X, S) are defined as
A rough concept of a formal concept (0, .4, R) in (0, E) is the collection of concepts
which have the same lower and upper E-approximations (roughly equal concepts).
Figure 2: Lower and upper approximations of living organisms w.r.t. (Iw, nc)
Note that approximate contexts of (0, A, R) in (0, E) vary according to the equiv-
alence relation E. Figure 2 (modified from Kent, 1994) illustrates the lower and
upper approximate contexts (left and right tables) as well as the lower and upper ap-
proximate concept lattices C(O, A, R. E) (up) and C(O, A, R· E) (down) of C(O, A, R)
where the indiscernible relation E (denoted by Ed is determined with respect to
two features lives in water (Iw) and needs chlorophyll (nc). These approximate concept
lattices consist of 10 lower approximations and 9 upper approximations, denoted by
Ct"q., ... ,ctoJ and q.,q., ... ,q.}, respectively. Figure 3 illustrates those of
C(O, A, R) (similar positions) where the indiscernible relation E (denoted by E2 ) is
determined with respect to two features lives on land (II) and is motile (mo). These
approximate concept lattices also consist of 10 and 9 approximations, denoted by
{Cr., C? ' ... , C~oJ and {C?, C?, ... , C~. }, respectively.
We can determine the maps of concepts from C( 0, A, R) to their approximations in
C(O,A,R. E,) and C(O,A,R· E,) and to C(0,.4,R. E.) and C(O,A,R·E.).
Example: Some assignments from these maps
408
I"
Figure 3: Lower and upper approximations of living organisms w.r.t. (il, mo)
Consider concept C l 3 = ({Br, Fr}, {nw, Iw, mo, Ib}) in .c(O,A,R). It has lower ap-
proximations Ci. = ({Le, Br, Fr}, {nw, Iw, mo}) and C~. = ({Le, Br, Fr, Dg}, {nw, mol)
in .c(0,A,R. E1 ) and .c(O.A,RoE.), and upper approximations CJ. = ({Le, Br, Fr},
{nw, Iw, II, mo, Ib}) and C~. = ({Le, Br, Fr, Dg}, {nw, II, mo, Ib}) in .c(0, A, RoEI) and
.c(0,A,RO E2). Although some concepts in .c(O,A,R) have similar indexes in differ-
ent approximate lattices, their approximations may have different intent and extent.
OSHAM (Ho, 1995) is an unsupervised concept learning method that induces a con-
cept hierarchy H from the concept lattice .c( 0, A, R). Inspired by this algorithm, we
propose algorithm A-OSHAM for learning approximate concepts in the framework of
rough concept analysis. Essentially, A-OSHAM induces a concept hierarchy in which
each induced concept is associated with a pair of its lower and upper approximations.
The search for the concept hierarchy is carried out through the hypothesis space of
concept lattice .c( 0, A, R). The basis of this search is a generate-and-test operator
to split a concept C into subconcepts at a lower level of H. Associated lower and
409
upper approximations are computed from the approximate contexts generated cor-
responding to the heuristic of inducing the concept. Starting from the root concept
with the whole set of training instances, A-OSHAM induces the concept hierarchy H
recursively in a top-down direction as described in Table 1. Procedure for computing
the lower and upper approximations in 1.(c) will be given in Table 2. Procedures for
doing 1.(a), 1.(b), 1.(d), 1.(e) and 1.(f) are the same as for OSHAM (Ho, 1995).
Table 1: Algorithm A-OSHAM (C/c, H)
1. Suppose that CIc" ... , Clc ft are subconcepts of CIc = (XIc , Sic) found so far. While
CIc is still splittable, find a new subconcept C lcft +1 = (Xlcft + p Slc ft +1) of CIc and its
approximations by doing:
(a) Find attribute aO so that U:':l XIc, U p( {aO}) is the largest cover of XIc.
(b) Find the largest attribute set S containing aO satisfying Ap(S) = S.
(c) Form subconcept Clc ft +1 with p(Slc +1)
ft = Sand Xlc ft + = p(S).
l
(d) Find a lower approximation and an upper approximation of C lcft +1 with respect
to the chosen equivalence relation E.
(e) From intersecting subconcepts corresponding to intersections of p(Slc +1) with ft
2. Let X; = XIc \U7:11 XIc,' If one ofthe following conditions holds then CIc is considered
unsplittable:
(a) There exist not any attribute set S ~ Sic satisfying Ap(S) = S in X Ic '
(b) card(Xk) ~ Q.
A-OSHAM forms concepts at different levels of generality in the hierarchy each level
corresponds to a partition of O. A-OSHAM generates concepts with their approxima-
tions recursively and gradually, once a level of the hierarchy is formed the procedure
is repeated for each class.
There are many possible lower and upper approximations of a concept according to
the the family :F of equivalence relations E on O. There will be at least two ways
of approximating concepts in the framework of the rough concept analysis: (1) to
compute all possible approximations for each induced concept, and (2) to compute
one plausible lower and upper approximation for each induced concept.
We investigate in this paper the case of one plausible lower and upper approxima-
tion for each induced concept. In each attempt of splitting a concept at the level n,
A-OSHAM finds consequently its subconcepts at the level n + 1 by the specialized
hypotheses with maximum coverage. Suppose that we want to find approximations
(X/c, SIc)oE and (X Ic , SIc)oE of the induced concept C/c = (X/c, Sic). As Sic is chosen
among the hypotheses specialized from the intent of its superconcept with the maxi-
mum coverage, it is reasonable to approximate CIc by the one-step further specialized
hypothesis generated by the same way. Suppose that P is the partition of the super-
concept extent with respect to Sic, we first generate a refinement of P with respect to
410
Sk U {a*} where the conjunction Sk U {a*} forms the largest cover of the Ck's super-
concept extent, then we approximate Ck with equivalence classes of this refinement
according to (8) and (9).
1. Find attribute a' E A \ Sk so that Sk U {a'} is the largest cover of the extent of the
superconcept of Ck .
2. Find equivalence classes of the superconcept extent of Ck according to the equivalence
relation E formed by adding a' to Sk.
3. Find the lower and upper approximations of Ck with respect to E, i.e., (Xk' Sk),E
and (Xk' Sk),E.
12
(SW, Rd) (nw, Iw, nc, IIg)
13
(B" Fr) (nw, Iw, mo, Ib)
1 The theory
Every researcher interested by the relations between variables (for example psycholo-
gist, specialist of methods, didactic specialist, ... ) questions himself as follows: "Let
a and b be two binary variables, can I affirm that the observation of a leads to the
observation of b ?". In fact, this non-symmetrical point of view on the couple (a, b),
a. contrary to methods of similarity analysis, expresses itself by the question: "Is it
right that if a then b ?". Generally, the strict answer is not possible and the researcher
must content himself with a quasi-implication. We propose, with the statistical im-
plication, a concept and a method which allow to measure the degree of validity of an
implicative proposition between (binary or not) variables. Furthermore, this method
of data analysis allows to represent the partial order (or pre-order) which structures
a set of variables.
412
413
~ a b c d e
a 0.97 0.73
b
c 0.82 0.975 0.82
d 0.78 0.92
e
A. Larher (Larher 1991) has proved that the order between the intensities respects
the order between the cardinals. So, for each pair of variables, we only keep the
maximal intensity of the two couples defined by this pair. We can also proved (Gras
and Larher 1992) the relation existing between the linear correlation coefficient and
the statistical implication and the relation between the X2 of independance and the
statistical implication.
This last notion have a (semantic) meaning only when the considered classes have
a good cohesion. This cohesion must be measured. This measure is founded on the
implication intensity. For example, if we consider the class (a, b, c), we observe:
'P(a, b) = 0.97, 'P(b, c) = 0.95 and 'P(a, c) = 0.92. We can say thatthe oriented class
from a to c has a good cohesion. It would not be the case if the implication intensities
were respectively equal to 0.82, 0.38 and 0.48. We then define the cohesion of a class
like a notion which is opposed to the entropy (of the class). Then we can write (Gras
and Larher 1992) :
Let be p = max('P(a, b), 'P(b, a)), the entropy (Shannon's definition) En of a class
(a, b) equals: En = -plOg2(P) - (1 - p)log2(1 - p).
Let coh(a, b) be the implicative cohesion of the class (a, b). It is defined as follows for
a class with two elements:
coh(a, b) = 1 if P = 1
{ coh(a, b) = vI - En 2 if p ~ 0 ..5
coh(a, b) = 0 if p < 0.5
The cohesion coh( a, b) is an increasing function of the implication intensity between
a and b. The cohesion of a general class C is the geometrical average of the cohesion
of the elements of C taken two by two.
415
For the previous example, we find coh(c, b) = 0.98, coh(d. e) = 0.91 and coh(a, (c, b)) =
0.89.
The implication between classes must integrate their cohesion. For a class A. =
{ai, ... , ar } and a class B = {b l ,· .. , b.}, we define the implication from class A to
class B as follows :
The implication between A and B increases according to their cohesion and decreases
with their cardinal. As is the case with two variables, we retain the implication
.4. => B if ",,(A, B) > ",,(B, A) or the implication B => A. if ",,(B, A) > ",,(A, B).
We obtain, for the example (a, (b, c)) => (d, e), an implication intensity equal to 0.27.
Using the classical algorithm of hierarchical clustering, we can build a hierarchy.
However, here we refuse the aggregation of two classes if the cohesion of the resulting
class is equal to zero. For the example, we obtain:
a c b d
I 'I
e
An appropriateness between the partition and the numerical index of the implicative
clustering happens at some nodes of the hierarchical tree. These nodes are said
significant and are studied in (Ratsimba-Rajohn 1992) and in (Gras and Ratsimba-
Rajohn 1996). We find also in these papers a statistical tool which attributes at a
given class of the hierarchy, the objects and sets of objects which contribute the more
to this class. A software, CHIC (Ag Almouloud 1992), is available. Now we shall
present an application of this method.
analysis. These variables can help to explain the meaning of the classes which objects
are responsible for their formation.
The objective of this experience is to provide to the expert the different relations
between the different behaviour features in a population. Two populatfons have been
examined : commercials of the firm (60 persons), workers of the same firm (40 per-
sons).
With the agreement of the expert, the number of examples (Cab), the conditionnal
probabilities, probability to have b true if a is true (Pab), and the forces of implica-
tions (Pfab) are the same in the discovery process on the two populations.
There are 5 examples which verify this rule whose conditonnal probability is equal to
1, and whose intensity of implication is equal to 0.96. We notice that the condition of
discovery is valid, this rule has no counter-example (it is a "logical" rule). This rule
means that if the commercial has positive extraversion, and negative anxiety then his
self-confidence is posi ti ve.
With the help of the expert psychologist, we will comment some rules discovered.
Their analysis permits us to confirm and to ameliore the expertise; It permits us to
valid the adequation of the theorical model of the construction to measured phenom-
ena. We know that the psychometry assigns dimensions to qualitative objects that
are not measurable.
If a person (commercial) is aggressive and his intellectual dynamism is not very im-
portant then he does not listen other persons (no reception from another person).
The coverage rate is equal to 0.22 (number of examples that satisfy the rule (13) /
number of all examples (60)), and there are two counter-examples.
Some rules which proove that the same behaviour may be deduced by
different premisses
IF EST = ESTO AND AFL = AFL- AND REC = REC- THEN P = P+ : 1010098
IF LED = LED+ AND REC = REC- THEN P = P+ : 12 10099
If a person is animated by power motivation and he does not listen other persons then
he is aggressive. Rate of coverage is equal to 0.2 and there is no counter-example.
The overall results are summarized in the next table where we can notice, among
others, an excellent distribution for each concept of the number of explicative rules
and of the number of covered examples in each class of definition of the concept.
Nl N2 N3
- 0 + - 0 + - 0 +
E 142 462 220 3 6 2 138 460 211
P 193 386 245 3 6 4 186 374 242
N 186 438 200 5 6 3 185 428 190
ACH 178 478 168 4 6 2 173 474 157
CLV 205 242 377 7 12 10 194 239 374
CON 226 418 180 6 13 6 222 407 180
EST 163 495 166 2 4 2 154 494 162
LED 148 482 194 3 7 4 143 476 188
AFL 213 300 311 6 9 5 204 194 305
REC 220 384 220 6 14 7 205 382 218
2.4 Conclusion
After the extraction process, the database contains many rules, many of them are
not interesting, accurate and useful enough with respect to the end user's objectives.
The quality of each generated rule has to be verified. So we need probabilistic tests
to check if the rule actually describes some regularity in the data. So, the evaluation
has to determine the usefulness of extracted rules, and decides which to save in the
database. If a rule is no valid, by using the indexes, and more precisely the intensity
of the implication, then it will be considered not interesting, and will not be saved in
the database.
In the context of the collaboration between Knowledge and Information System team
at IRIN and the firm PerformanSe SA, studies of other populations are underway in
sportive and education domains.
3. References
Briand, H. et al. (1995): Mesure statistique de la robustesse d'une implication pour I'apprentissage
symbolique. Prepublication IRMAR 10-1995, Rennes
Gras, R. (1979): Contribution a I'etude experiment ale et a I'analyse de certaines acquisitions cog-
nitives et de certains objectifs didactiques. These d'Etat, Universitl! de Rennes I.
Gras, R. and Ratsimba-Rajohn H. (1996): Analyse non symetrique de donnees par I'implication
statistique. RAIRO - Recherche operationnel/e, 30-3 AFCET, Paris.
Gras and al.(1996) : structuration sets with implication intensity, Proceeding of the International
Conference on Ordinal and symbolic Data Analysis - OSDA 95, E. Diday, y.. Cheval/ier, O. Opitz,
Eds, Springer, Paris.
Lerman, I.C. et al. (1981) : Evaluation et elaboration d'un indice d'implication pour des donnees
binaires I et II. ,'"fathematiques et Sciences Humaines 74 p 5-35 et 75 p 5-47.
Correspondence Analysis,
Quantification Methods, and
Multidimensional Scaling
· Multidimensional Scaling
Correspondence Analysis,
Discrimination, and Neural Networks
Ludovic Lebart
Centre National de la Recherche Scientifique
Ecole Nationale Superieure des Telecommunications
46 rue Barrault, 75013, Paris, France.
423
424
A general framework (see, e.g., Baldi and Hornik (1989» can deal simultaneously with
the supervised and the unsupervised cases.
Let X be the (n, q) matrix whose n rows contain the n observations of an input q-vector,
and let Y be the (n, p) matrix containing (as rows) the n observations of an output p-
vector.
A designates the (q, r) matrix of weights (ajm) (see fig.l) before the hidden layer, and B
the (r, p) matrix of weights (bmk) following it (r:S; p and r:S; q).
JC il
Xi2 Yi}
JCjJ Yi2
JCj.j YjJ
JC ij output
input
Fig. 1: Perceptron with one hidden layer (i-th observation)
(1)
In the case of identity transfer functions (tP and 'P) and null constant terms, the model
collapses to the simpler form:
is the diagonal matrix whose q (resp. p) diagonal elements are the counts of the q classes
(resp. p classes).
This particular Multilayer Perceptron, whose training entails the diagonalization of the
matrix M:
Note that M* involves symmetrically the two sets (p columns of X on the one hand, q
columns of Y on the other).
The Multilayer Perceptron will coincide with Correspondence Analysis if Dp is a scalar
matrix (all the p classes have the same number of elements) or if the output matrix Y has
been properly re-scaled during a preliminary step into Y according to the following
formula:
Y = YD-Pl12
The new matrix to be diagonalized :
M5 =0-P1I2 CO-q 1C 0-p 112
has the same eigenvalues as M*, and has eigenvectors that can be easily derived from
those ofM*.
"if
Zi4 z;.$
input hidden layer output
Fig. 2: Auto association strangulated network
It is an apparently trivial situation. In fact, these networks are of great interest if the hidden
layer is narrower than the others, thus realizing a compression of the input signal (fig. 2).
Bourlard and Kamp (1988), Baldi and Hornik (1989) have stressed the link between SVD
- and consequently Principal Component Analysis (PCA) - and these particular networks.
The proof is straightforward if we replace both Y and X by Z in the formulas obtained in
the previous section.
427
In this context, the matrix M given by the equation (6) is nothing but the product-moment
matrix ZTZ.
In this setting, the equivalence with Correspondence Analysis is obtained if Z is derived
from a contingency table K according to the transformation (with usual notations):
z .. - k··lJ - k·k·
l. .J (7)
lJ - ~ki.k.j
Note that the nature and the size of the input data involved in the two approaches of section
2 and 3 are radically different.
The network of section 2 is "fed" by n individual observations. It learns how to predict the
output category corresponding to observation i, from the knowledge of its input category.
The network of section 3 is fed simultaneously by q observations of p categories (rows of
Z) or equivalently by p observations of q categories (columns of Z). It learns how to
summarize the input information.
Note that section 3 deals with properties common to Principal Component Analysis and
Correspondence Analysis.
with:
A; =XiX; I (Xi being the ith column ofXT)
The classical iterated power algorithm can then be performed using this decomposition.
(cf. Wold (1966» taking advantage of the possible sparsity of the data matrix X.
Starting from a random vector Uo. the step k of this algorithm. after setting Uk =O.
consists of n assignments such as:
for i =1 to i =n, do: Uk f- uk + Ai uk-I (8)
Both linear adaptive networks corresponding to algorithms (8) and (9) can produce
simultaneously several eigenvectors, provided that orthonormalizations are carried out with
a frequency that depends on the available precision. It is by no mean necessary to
orthonormalize the estimates of eigenvectors at each reading (i.e. for each value of the
indexj when using the algorithm (9».
It must be stressed that stochastic approximation algorithms such as algorithm (9)
converge very slowly, their convergence being based on the divergence of the harmonic
series. Iterated power algorithms (8) (whose firts steps could be speeded up by using
stochastic approximation (9» perform well if they confine themselve to finding a s-
dimensional space Vs containing the t first eigenvectors (with: t« s). Then, the t
dominant eigenvectors (and their corresponding eigenvalues) can be efficiently computed
through a classical diagonalization algorithm applied to the (s, s) product-moment matrix
obtained after projection onto the subspace Vs.
5. References
Asoh, H. and Otsu, N. (1989): Nonlinear Data Analysis and Multilayer Perceptrons. IEEE.
/JCNN-89,2, 411-415.
Baldi, P. and Hornik, K. (1989): Neural networks and principal component analysis:
learning from examples without local minima. Neural Networks. 2, 52-58.
Benzecri. 1.-P. (1969a): Statistical analysis as a tool to make patterns emerge from clouds. In :
Methodology of Pattern Recognition. S.Watanabe, (ed.) Academic Press, 35-74.
Benzecri, 1.-P. (1969b): Approximation stochastique dans une algebre normee non
commutative. Bull. Soc. Math. France, 97, 225-241.
Benzecri 1.-P. (1992): Correspondence Analysis Handbook. Marcel Dekker, New York.
Bourlard, H. and Kamp. Y. (1988): Auto-association by Multilayers perceptrons and singular
value decomposition. Biological Cybernetics. 59. 291-294.
Cheng, B. and Titterington, D.M. (1994): Neural networks: a review from a statistical
perspective. Statistical Science, 9, 2-54.
Gallinari, P., Thiria, S. and Fogelman-Soulie. F. (1988): Multilayers perceptrons and data
analysis, International Conference on neural Networks. IEEE" I, 391-399.
Gifi A. (1990): Non Linear Multivariate Analysis, 1. Wiley, Chichester.
Greenacre M. (1984): Theory and Applications of Correspondence Analysis. Academic
Press, London.
Guttman L. (1941): The quantification of a class of attributes:.a theory and method of a
scale construction. In : The prediction of personal adjustment, Horst P., (ed.) 251 -264, SSCR
New York.
Hayashi C.( 1956): Theory and examples of quantification. (II) Proc. of the Institute of
Statist. Math. 4 (2). 19-30.
Hornik, K. (1994): Neural networks: more than "statistics for amateurs". In: COMPSTAT.
Dutter R., Grossmann W. (eds.), Physica Verlag, Heidelberg, 223-235.
430
Summary: This paper gives an illustration of exploratory data analysis with Hayashi's
Quantification Method III using graphics (hereafter we abbreviate this method as HQM III).
It is shown that artificial data sets with ordinal structures can be expressed on the surface
of a torus, and it is also pointed out that the torus suggests the results of HQM III. Some
applications of HQM III to medical data using the graphical configuration are reported.
1. Introduction
Hayashi's quantification method III was presented by Hayashi in 1956. Correspon-
dence analysis was later developed by the school of Benzecri in France in 1973. Artifi-
cial data sets with one-dimensional structure have been proposed by Guttman(1950),
Iwatsubo(1987) and Okamoto, among others. Iwatsubo(1987) also presented data
with circular structure.
The purpose of this report is to illustrate a graphical representation of qualitative
data with ordinal structures and the usage of such graphics for exploratory analysis.
We show that it is possible to express on the surface of torus a family of artificial
data sets with ordinal structures, which includes the Guttman data and the Iwatsubo
data. Hayashi's quantification method ill is used to illustrate these data sets by a
graphical configuration in three-dimensional spaces.
D = {Oi(jk)} ,
for i = 1,2, ... ,n; j,u = 1,2, ... ,m; k = 1,2, ... ,£}; v = 1,2, ... ,£u, that n is the
number of subjects, m is the number of items, e) or eu is the number of categories of
item j or item u.
431
432
AX==)..X,
where
A == {a(jk)(UV)} , X == {X(jk)} '
1 { n(;k) . n(uv) }
a(jk)(uv) == n(jk)(uv) - .
m· n(jk) n
3. Torus Model
The following artificial data are the representative data with a one-dimensional struc-
ture.
Generally, the data matrix D is given by a square matrix (n x n), where n is the
e
number of subjects, m is the number of item, is the number of categories of an item.
But,
n == e·m
Then this data matrix Dis,
where,
(1) i ~ n - m +1
{ 1 O~j-i<m
0.) == 0 otherwise
(2) i > n - m + 1
fJ_{O l~i-j~n-m
.) - 1 otherwise
Table 1.1 shows a few examples generated from a matrix with n, m and e being 20,
10 and 2 respectively.
Table 1.1
item 12345678910, 1 2 3 4 5 6 7 8 9 10
category 1 1 111 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
1
2
••••••••••
••••••••• 0
3
4 ••••••••
•••••••
00
000
6
5 ••••••
•••••
0000
00000
8
7 ••••
•••
000000
0000000
9 •• 00000000
---il------------------ 000000000
10 •
0000000000
- -- ---- -.--- - - --- - ----- -- _. --- -- ------ --- - --- - ---
12
13 ••• 000000000
00000000
14
15 •••
••••
0000000
000000
16
17 •••••
••••••
00000
0000
18
19
•••••••
••••••••
ODD
00
20 ••••••••• 0
Table 1.2
item 1 2 3 4 5 6 7 8 9 10
category 12 12 12 12 12 12 12 12 12 12
1
•O.• •• •• •• •• •• •• •• ••
• •• •• •• •• •• ••
2
3 0 O.
0 0 O.
• •• •• •• ••
4
5 0 0 0 O.
0 O.
• •• ••
6 0 0 0
7 0 0 0 0 0 O.
8 0 0 0 0 0 0 O.
9
10
11
0 0
0 0 0
0
0 0 0
0 0
0 0
0
0
0 O.
•
0 0 0
0
0
0
0
O.
0 0
- --- -- -- ---- -- -- ----
••
--------.-
12 0 0 0 0 0 0 0 0 0
••• •••
13 0 0 0 0 0 0 0 0
14
• 0 0 0 0 0 0 0
••• ••
15 0 0 0 0 0 0
16
17 •• •• •• •• 0
•
0
0
0
0
0
0
0
0
dar
cross cut V
Fig. 2(a) Circular model Fig. 2(b) Its configuration of HQM III
435
Fig. 3(a) Guttman's perfect Fig. 3(b) Its configuration of HQM ill
scale model
Fig. 4(a) Guttman's unidimensional Fig. 4(b) Its configuration of HQM ill
scale model
In this section, we show the results of HQM ill on an examination of heart functions .
Let us consider a data set from Treadmill excercise study (Arai, 1987). This consists
of data from 32 subjects (n = 32) on five variables (m = 5), and as shown in Table 2
the data set is augmented by item6, a classification variable which will be used later.
436
1 2 3 1 2 3 4 1 2 3 1 2 3 4 1 2 3 1 2
Categories
- + ' , - ± +-++ - +++ - ± + ++ -+++ 'Z
-;-r
0
.... ~
:::0
a ....
e. S
Subjects e.
1 1 1 1 1 1 1
2 1 1 1 1 1 1
3 1 1 1 1 1 1
4 1 1 1 1 1 1
5 1 1 1 1 1 1
6 1 1 1 1 1 1
7 1 1 1 1 1 1
8 1 1 1 1 1 1
9 1 1 1 1 1 1
10 1 1 1 1 1 1
11 1 1 1 1 1 1
12 1 1 1 1 1 1
13 1 1 1 1 1 1
14 1 1 1 1 1 1
15 1 1 1 1 1 1
16 1 1 1 1 1 1
17 1 1 1 1 1 1
18 1 1 1 1 1 1
19 1 1 1 1 1 1
20 1 1 1 1 1 1
21 1 1 1 1 1 1
22 1 1 1 1 1 1
23 1 1 1 1 1 1
24 1 1 1 1 1 1
25 1 1 1 1 1 1
26 1 1 1 1 1 1
27 1 1 1 1 1 1
28 1 1 1 1 1 1
29 1I 1 1 1 1 1
30 1 1 1 1 1 1
31 1 1 1 1 1 1
32 1
I 1 1 1 1 1
Item No. 1 2 4 3 5 2 4 3 1 2 1 4 5 3 2 5 4
Group
Category No. 1 1 1 1 1 2 2 2 2 3 3 3 2 3 4 3 4
en D Z >-
I 8... t:l0
>= ::: 0 ::r
I I I I I I I
~.
'...." 9 e:.. ...
z '"<+:n
t:l
n 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
~ VJ VJ VJ VJ ...... ...... ...... ...... 0 ...... C11 0> 0> 8
(5' ol>- 0> ...... ...... ol>- -J 0
~ ~ ~ ~
ol>- <0 <C> 0) VJ 0l>-
...... e:..
00 Co" VJ ...... VJ -J -J
0> C11 ~ ~
!=l en 0) ol>- 00 C11 ol>- 00 00 ~ ~
6 1.346 1 1 1 1 1 1
12 1.334 1 1 1 1 1 1
13 1.120 1 1 1 1 1 1
9 1.116 1 1 1 1 1 1
8 1.022 1 1 1 1 1 1
4 1.010 1 1 1 1 1 1
3 1.006 1 1 1 1 1 1
I 5 0.890 1 1 1 1 1 1
11 0.796 1 1 1 1 1 1
7 0.780 1 1 1 1 1 1
1 0.566 1 1 1 1 1 1
10 0.566 1 1 1 1 1 1
19 0.566 1 1 1 1 1 1
14 0.318 1 1 1 1 1 1
28 0.318 1 1 1 1 1 1
2 0.163 1 1 1 1 1 1
16 - 0.085 I 1 1 1 1 1 1
18 - 0.085 1 1 1 1 1 1
25 - 0.103 1 1 1 1 1 1
24 - 0.141 1 1 1 1 1 1
27 - 0.141 1 1 1 1 1 1
23 - 0.197 1 1 1 1 1 1
31 - 0.202 1 1 1 1 1 1
22 - 0.447 1 1 1 1 1 1
30 - 0.506 1 1 1 1 1 1
21 - 0.810 1 1 1 1 1 1
26 - 1.198 1 1 1 1 1 1
29 - 1.213 1 1 1 1 1 1
32 - 1.697 1 1 1 1 1 1
15 - 1.751 1 1 1 1 1 1
17 - 1.999 1 1 1 1 1 1
20 - 2.342 1 1 1 1 1 1
438
o·
Fig. 5 Evaluation of Heart Function
(Examination Variables)
26
6
5
2
. Normal __ : Abnormal
5. Conclusion
In this paper, we presented the use of the torus model and its comparison with
HQM ill. Artificial data by Guttman. Iwatsubo and Komazawa were used to investi-
gate their properties by graphic data processing, specifically, the torus representation
and graphical display based on HQiVr III. It was shown that graphical displays were
helpful to clarify otherwise hidden properties of those data.
Acknowledgments
We would like to thank the editor and referees for a careful reading of the manuscript
and for comment that improved it. Anonymous referee "A" was particularly helpful
in detecting typographical errors and bring our attention to relevant references .
440
References:
Arai, C., Komazawa, T., et al. (1984): The effect of hypoxanthine riboside on integral
value % for various hemodynamic parameters under upright treadmill exercise. The Jour-
nal of Japanese College of Angiology, 24, 1, 75-81 (in Japanese) & 90 (abstract in English).
Guttman, L. (1950): The principal components of scale analysis. Measurement and Pre-
d'ietion, Stouffer, S. A. et al. (eds.), 312-361, Wiley, New York.
Hayashi, C. (1956): Theory and examples of quantification, II. Proc. of ISM, 4, 1, 19-30
(in Japanese).
Iwatsubo, S. (1987): The Foundation of Quantification Method. Asakura-shoten, Tokyo (in
Japanese).
Komazawa, T. and Tsuchiya, T. (1995): Exploratory data analysis for Hayashi's quan-
tification method ill by graphics: Its application of ordinal structure analysis to ninth
nationwide survey of the Japanese national charactor. Proe. of ISM, 43, 1, 161-176 (in
Japanese).
Exploring Multidimensional Quantification Space
Shizuhiko Nishisato
The University of Toronto
OISE/UT, 252 Bloor Street West
Toronto, Ontario, Canada M5S 1V6
Summary: Dual scaling deals with two distinct types of categorical data, incidence data
and dominance data. While perfect row-column association of an incidence data matrix
does not mean that the data matrix can be fully explained by one solution, dominance
data with perfect association can be explained by one solution. Considering a main role
of quantification theory is to explain data in multidimensional space, the present study
presents a non-technical look at some fundamental aspects of quantification space that are
used for analysis of the two types of data.
1. Introduction
Consider principal component analysis (PCA) of n standardized variables. If the rank
of the correlation matrix is 2, that is, r(R) == 2, then all the variabl~ are positioned
in two-dimensional space, and more specifically, on a circle of the radius of 1. In
other words, each variable is located on a plane at the distance of 1 from the origin.
Likewise, if r(R) = 3, all the variables are located on the surface of a sphere at the
distance of 1 from the origin. One can extend the same statement to the general
multidimensional case. Thus, no matter what rank the correlation matrix may have,
one can infer that K dimensions are sufficient if the sum of squares of K coordinates
of each variable is close to 1. Thus, with standardized continuous variables. both
graphical display and the statistic" percentage accounted for" can tell us the dimen-
sionality of data.
Suppose that we now consider PCA of·deviation scores, that is, variables which are
not standardized, but centered. This leads to principal component analysis of the
variance-covariance matrix, V. Since the variance of each variable is different, even
when r(V) = 2, variables are not positioned on a circle of any fixed radius. Al-
though the shape of the distribution of variables in two-dimensional principal plane
cannot tell us that r(~r) == 2, the statistic of the percentage accounted for by each
component tells us dimensionality and importance of each component. Thus. with
non-standardized continuous variables, graphical display does not tell us the dimen-
sionality of data, but the statistic" percentage accounted for" still does.
From the practioner's point of view, the real problem in choosing between Rand \.
is not that of computational convenience, but rather that of which one is more mean-
ingful out of the two. In the social sciences, many variates such as personality scores
do not have any rational unit of measurement, and one tends to opt for standardizing
variables for the purpose of comparability. Even in the natural science.s where there
exists a rational unit of measurement, however, one may still face the choice between
the use of available units (e.g., kilometers or miles) and standardization. There is no
readily available guideline on which one to use. To make the task of choosing either
441
442
R or V more important and difficult than one may think, the results of PCA from the
two can be vastly different, and there does not seem to be any mathematically trace-
able relationship between the two sets of PCA results (e.g., Nishisato and Yamauchi,
19i4). This ma.l<es it almost impossible to consider a rational way of choosing one
over the other.
When one looks at quantification methods such as Hayashi's quantification theory,
correspondence analysis, homogeneity analysis and dual scaling, they are conceptu-
ally the same as PCA. In fact, Torgerson (1958) called these quantification methods
PCA of categorical data. There are, however, a number of differences between them
and PCA. On one hand, "PCA of categorical data" is regarded as singular value
decomposition of standardized categorical variables, suggesting stronger resemblance
to PCA of R than to PCA of V. On the other hand, there are many more numerical
aspects that connect it to PCA of V rather than to PCA of R. If we consider such
multiple-choice data in the response-pattern format (i.e., in the form of an indicator
matrix) that can be explained by two components, we would realize almost imme-
diately that the quantification results are more like PCA results associated with V
than with R because a twcrdimensional graph does not tell us that the data are two
dimensional. In fact, it is the present author's view that the quantification method as
we know by many different names is PCA of non-standardized categorical variables.
As French researchers in the area of correspondence analysis would say, graphical
display of quantification results is almost indispensable for their interpretation. But,
surprisingly, there seems to be little knowledge about the space used in quantifica-
tion. For instance, under what circumstances can we infer from graphical display
that the categorical data in hand are twcrdimensional? To answer this question, it
seems essential to know what multidimensional quantification space we are dealing
with. For some reason or others, this fundamental question on the space has not
been a topic of intensive investigation. The present study will be concerned \\ith this
problem, and some preliminary consideration will be presented.
2. Dual Scaling
As a method of quantification, dual scaling (Nishisato, 1980, 1994, 1996) will be con-
sidered to see what kind of multidimensional space it uses. Dual scaling is known for
two aspects: (1) it handles a wide variety of categorical data, in particular inc'idence
data (contingency tables, mUltiple-choice data, sorting data) and dominance data
(rank-order data, paired comparison data, successive categories data), (2) it employs
two distinct objectives, one for incidence data and the other for dominance data.
The former is the familiar low-rank approximation to the input data, and the latter
the low-rank approximation to the ranking of input data by the ranking of distances
between each subject and a set of stimuli. The latter is, therefore, considered to offer
a low-rank solution to the problem of multidimensional unfolding (Nishisato, 1994,
1996). Mathematical handling of the two major types of categorical data by dual
scaling is based on singular value decomposition of appropriately transformed data
matrices.
For incidence data, denote the data matrix by F, where the typical element I,) is
443
either 1 (presence) or a (absence), or the joint frequency of cells i and j, the diagonal
matrices of row marginals and column marginals by Dr and Dc, respectively. Then
singular value decomposition is carried out as follows,
where \/'\/ = 1, W'H/ = I, and A = diag(Aj). Optimal weight matrices for rows, Y,
and columns, X, are given by
I I
Y = Dt\/,X = Dlw so that F = ItYAX'
Optimal vectors are scaled as 1.'~Dcxk = y!.:DrYk = It, that is, the sum of all the
elements in F.
For dominance data, responses collected are first transformed into the subject-by-
stimulus matrix of dominance numbers E, where a typical element e'i indicates the
number of times Subject i judges Stimulus j higher (larger, more attractive) than
other stimuli minus the number of times Subject i judges other stimuli higher than
Stimulus j. If we indicate the number of subjects by lV and the Humber of stimuli
by n, the dominance matrix is N xn, and the sum of elements of each row is zero.
To define diagonal matrices Dr and Dc for the dominance matrix, each element is
considered based on n - 1 comparisons. Thus, we can now specify the two diagonal
matrices as follows: Dr = n(n - 1)1, and Dc "-' N(n - 1)1. Hence, It = nN(n - 1).
Then, with these newly defined diagonal matrices and dominance matrix E, we can
carry out singular value decomposition of E to obtain optimal vectors .I!" and 1.."
(Nishisato, 1978).
position on the continuum is called his or her ideal point, and the ranking of each
subject is interpreted as the ranking of stimuli on the continuum folded at each
subject's ideal point. Thus, the problem of unfolding is that given a set of rank orders
(i.e., folded continua) from subjects we wish to unfold the rank orders to recover
the original single continuum, on which subjects and stimuli are jointly located.
When a single continuum is replaced with multidimensional a.xes, unfolding becomes
complicated, and the problem then is called that of multidimensional unfolding. A
number of studies on the topic have been published (e.g., Coombs and Kao, 1960;
Coombs, 1964: Schi:inemann, 1970; Schi:inemann and Wang, 1972; Carroll, 1972; Gold,
1973; Heiser, 1981; Greenacre and Browne, 1986). However, they have not made any
reference to dual scaling, obviously being unaware of its relevance. Kow we know
that dual scaling always offers a perfect solution to the problem of multidimensional
unfolding (Nishisato. 1994).
Table 2
Dominance ylatrix E
9. -3. -7. -9. 7. -I. 5. -5. I. 3.
-I. -9. -7. I. 5. 9. -3. 7. 3. -5.
-7. -5 3. 5. I. -1. -9. 7. 9. -3.
7. -9. I. -I. 5. 9. 3. -5. -3. -7.
7. -9. -1. -3. 3. 9. I. 5. -7. -5.
9. 5. I. -1. -3. -5. 7. 3. -9. -7.
-3. -9. 9. -1. 1. 5. -5. 3. 7. -7.
7. -9. -1. -3. 3. 9. 1. 5. -7. -5.
7. -9. 1. -5. 3. 9. -1. 5. -3. -7.
7. -9. 1. -7. -5. -3. 3. 9. 5. -1.
-7. -9. -3. -1. 1. 9. 3. 7. 5. -5.
-I. -9. -3. 3. 7. 9. 5. -7. -5. 1.
9. -9. 5. -7. -1. 3. 1. 7. -3. -5.
-5 -1. 1. 5. -9. -3 -7. 7. 9. 3.
-5. -9. -7. -I. 3. 9. 5. 7. I. -3.
5. I. -9. 3. -1. -7. -5. 7. 9. -3:
9. -9. -5. -7. 5. 1. 7. -I. -3. 3.
1. 3. -7. 5. -9. -5. -3. 7. 9. -1.
7. -9. -1. -3. -5. 9. 1. 3. 5. -7.
9. 3. 7. -9. -7. -3. -I. 5. 1. -5.
7. -9. I. -3. 5. 9. 3. -1. -5. -7.
-1. 5. -7. 3. -9. -5. -3. 7. 9. 1.
-1. -, . -9. 3. -5. -3. 1. 7. 9. 5.
1. 7. 9. -7. -9 3. -5. -1. 5. -3.
7. -9. -1. -3. -7. 9. 5. 3. 1. -;).
Suppose we consider the ranking of ten services by the first two subjects. As one
can see easily, there is no constraints on each of the ten columns, while each row is
constrained by the condition that each row sum of dominance numbers is zero. Thus
the analysis of this 2 x 10 matrix yields at most two solutions (see Table :3). Similarly,
if we analyze the ranking by the first three subjects, the 3 x 10 dominance matrix
yields three dual scaling solutions (see Table 4). Since variates Yt'· and J:) •. do not
span the same space, it is impurtallt that one of them is projected onto the space of
446
the other. For dual scaling to provide a perfect solution, we must project stimuli onto
the space for subjects (Nishisato, 1994). In other words, we must plot Yik and PkXjk.
Figure 1 shows the results of dual scaling analysis of the 2 x 10 matrix: compute
the distance between Subject 1 and each of the ten stimulus points, and rank order
the distances from the closest (smallest distance) to the furthest. which reproduces
the exact ranking of the ten services by Subject I! Similarly, one can calculate the
distances between Subject 2 and the stimuli, rank the distances. and see that the
observed ranking is now again reproduced. For dual scaling of the 3 x 10 matrix, we
need all the three solutions to recover the same ranking of the services by each of the
three subjects from the plot of (U,k, PkXjk).
Figure 1
A Perfect T\vo-dimensional Solution
2
PX2
Postal
Service
pX-I
Sports/
Recreation
2
Suppose we increase the Humber of subjects to nine, and analyze the 9 x 10 dominance
table. As expected, if we use all nine solutions. the ranking of distances between sub-
jects and stimuli in 9-dimensional space will reproduce perfectly the ranking in the
original data (Table 1). How about the case in which there are more subjects than
nine? Would we fail to recover the ranking of the services by some subjects'? A truly
remarkable property of dual scaling is that no matter how many more subjects than
the number of stimuli we may have the rankings of stimuli by all the subjects are
reproduced in (n - I)-dimensional space, or !V-dimensional space! This is a classi-
cal property of singular value decomposition. Table S shows the sums of squares of
rank discrepancies between the input rankings and rankings approximated by one,
two, three, .... , and nine solutions. As you see in the table, the rank-9 approxima-
tion yields 0 discrepancy, indicating that it is a perfect solution to the problem of
multidimensional unfolding.
447
Table :3
Normed Weights for 2 Subjects and
Projected Weights for 10 Stimuli (2 Solutions)
1 2 1 2
-0.9996 1.0005 -0.4442 0.5557
-1.0000 -0.9995 0.6668 0.3331
0.7778 -0.0003
0.4442 -0.5557
-0.6666 0.1114
-0.4447 -0.5554
-0.1109 0.4445
-0.1114 -0.6666
-0.2223 -0.1110
0.1113 0.4444
Table 4
Kormed Weights for 3 Subjects and
Projected Weights for 10 Stimuli (3 Solutions)
1 2 3 1 2 3
0.9485 -1.0848 -0.9610 0.6677 -0.3027 -0.0398
-0.7304 -1.3498 0.8027 03699 0.5768 0.0608
-1.2517 -0.0342 -1.1967 -0.1956 0.6274 -0.0919
-0.5750 0.3053 0.1285
0.0643 -0.5325 -0.1448
-0.2323 -0.4085 0.3475
0.6740 -0.0395 0.1317
-0.6895 -0.1580 0.0758
-0.4633 -0.2016 -0.3453
0.3797 0.1332 -0.1225
Thus, in dual scaling of dominance data, all we are interested in is the recovery of,
or approximation to, the original rankings of stimuli by individual subjects. As such,
the importance of shape of quantification space becomes secondary. The percentage
of original rankings approximated by K sblutions is sufficient for the investigator to
know. Even so, however, it should be mentioned that multidimensional quantification
space for dominance data is in some sense standardized since the total contribution
of each subject to the total space is fixed and equal to (n + 1)/ [3( n - 1)], irrespective
of N. One further characteristic of dominance data is that both Dr and Dc are scalar
matrices.
space. Let us look at a numerical example Cfable 6) and note the following regular-
ities: (a) The sum of squared singular values (correlation ratios in dual scaling) is
equal to the average number of options minus 1, that is, 2; (b) The sum of squares of
quantified item scores is equal to nN(mj -1), that is, 3 x 7 x (3 - 1) = 42; (c) The
sum of r;t over the total solutions (6) is equal to mj -1, which is, 3 - 1 = 2; (d) The
sum of normed scores of each subjects over the total solutions is equal to the total
number of solutions; (e) The sum of squares of option weights over the total solutions
is equal to n(1\[ - hp), where fjp is the number of subjects who chose option p of
Item j, and; (f) The inter-subject squared distance in the total normed space is 2/V,
which is 14 in the present example.
Table 5
Sum of Squares of Discrepancies
between Observed and Reproduced Ranks
Number of Solutions
Subject 1 2 3 -1 5 6 7 8 9
1 88 78 90 46 42 1-t 16 0 0
2 62 28 14 2 4 -1 2 4 0
3 60 80 80 12 12 0 0 0 0
4 14 10 12 16 16 16 6 2 0
5 12 8 14 1-1 14 10 8 6 0
6 100 12-1 66 62 8 2 0 0 0
7
8
104
12
106
8
78
1-t
1-1
14
12
14
8
10
8
8 "6 0
0
9 10 16 8 6 6 0 0 0 0
10 104 52 22 8 2 -1 2 2 0
11 84 34 12 8 10 4 0 0 0
12 68 52 4 4 4 4 2 0 0
13 40 42 8 4 4 0 0 0 0
14 72 28 14 2 2 2 2 0 0
15 68 38 10 10 6 4 2 2 0
16 122 46 52 30 22 22 22 2 0
17 52 54 54 14 6 0 0 0 0
18 90 28 16 12 10 10 0 0 0
19 H 28 16 16 14 8 0 0 0
20 12-1 186 6 6 2 2 2 0 0
21 4 2 4 4 4 2 2 0 0
22 72 28 22 12 10 6 6 6 0
23 152 8 12 10 10 4 2 0 0
24 160 158 36 8 8 0 0 0 0
25 44 2-1 12 12 8 6 2 0 0
26 96 52 14 8 6 8 4 0 0
27 88 18 16 10 8 8 4 2 0
28 104 64 70 44 8 6 6 0 0
29 10 12 6 6 8 2 2 2 0
30 102 80 88 16 26 10 0 0 0
31 30 4 4 2 4 2 2 2 0
As stated earlier, these relations will be violated, depending whether the number of
total solutions is determined by rows or columns. These cases, however, "'ill not be
discussed due to the limitation of the space.
449
Table 6
Summary Statistics 011 a Case Where
the Rank is Determined by Both Rows and Columns
2
1/ a 6(%)
1 0.8116 0.8839 40.5777
2 0.6071 0.6765 30.3575
3 0.2476 -0.5195 12.379,1
4 0.1884 -1.1538 9.4206
5 0.1075 -3.1501 5.3762
6 0.0378 -11.7370 18886
Sum 2.0000 -15.0000 100.0000
SS
1 -1.1013 0.3336 -0.8355 -0.4292 -0.2986 -1.3056 12.00
2 0.2846 -1.4987 1.7208 1.1650 -0.8413 0.3830 15.00
3 1.3673 0.9983 -0.4675 -0.5212 1.2892 1.5754 15.00
4 -1.3673 0.9983 0.4675 -0.5212 -1.2892 1.5754 15.00
5 1.1013 0.3336 0.8355 -0.4292 0.2986 -1.3056 12.00
6 -0.2846 -1.4987 -1.7208 1.1650 0.8413 0.3830 15.00
7 1.7207 1.6214 -1.4305 2.1860 -2.3437 -0:3043 18.00
8 -1.7207 1.6214 1.4305 2.1860 2.3437 -0.3043 18.00
9 0.0000 -0.6486 0.0000 -0.8744 0.0000 0.1217 16.00
4. Concluding Remarks
The current paper touched only on a surface of the topic. The two data types,
incidence data and dominance data, are distinct as reflected on their respective ob-
jectives for quantification. From the view of multidimensional decomposition of data,
however, the probably most important distinction lies in the fundamental premise of
"what is multidimensionality for the two data types?" For incidence data, perfect
association in the data (Le., the correlation ratio of 1) does not mean that a sin-
gle solution (component) can explain the data exhaustively. A simple example is a
10 x 10 contingency table of perfect row-column association (e.g., all non-zero entries
are found only in the main diagonal). This data matrix yield nine perfect correlation
ratios (Le., 1), yet needing nine solutions to explain the data. In contrast, when
association is perfect in dominance data (e.g., all the subjects rank the stimuli in
the same way), one solution explains the data completely. This distinction creates
a number of different characteristics between the data types, some of which are dis-
cussed in Nishisato (1993, 1994, 1996). From the graphical point of "iew as well
as the interpretation point of view, the distinction between the two types should be
well understood, but a number of further investigations into the differences are still
needed before we are certain about our full understanding of the implications of the
differences.
The last remark on both types of data is about the treatment of missing responses.
Most imputation methods have the effect of increasing total information (i.e., the
sum of squared singular values) in data. This is obviously undesirable, and should
be regarded as fabrication of information. Thus, any method of imputation must
be such that the total observed information be kept invariant. Such an example is
rare (see dual scaling of rank order data, Nishisato, 1994), and the effects of missing
responses on multdimensional space need to be further investigated.
5. References
Carroil, J.D. (1972). Individual differences and multidimensional scaling. In R.N. Shepard,
A.K. Romney, and S.B. Nerlove (eds.), Multidimensional Scaling: Theory and Applications
in the Behatl'ioral Sciences, Volume 1. New York: Seminar Press.
Coombs, C.H. (1950). Psychological scaling without a unit of measurement. Psychological
Remew, 51, 148-158.
Coombs, C.H. (1964). A Theory of Data. New York: Wiley.
Coombs, C.H., and Kao, R.C. (1960). On a connection between factor analysis and multi-
dimensional unfolding. Psychometrika, 25, 219-231.
Gold, E.M. (1973). Metric unfolding: Data requirements for unique solution and clarifica-
tion of SchOnemann's algorithm. Psychometrika, 38, 555-569.
Greenacre, M.J., and Browne, M.W. (1986). An efficient alternating least-squares algo-
rithm to perform multidimensional unfolding. Psychometrika, 51, 241-250.
Heiser, W.J. (1981). Unfolding analysis of proximity data. Doctoral dissertation, Leiden
University, The Netherlands.
451
Nishisato, S. (1978). Optimal scaling of paired comparison and rank-order data: An alter-
native to Guttman's formulation. Psychometrika, 43, 26~271.
Nishisato, S. (1980). Analysis of Categorical Data: Dual Scaling and Its Applications .
Toronto: University of Toronto Press.
Nishisato, S. (1993). On quantifying different types of categorical data. Psychometrika,
58,617-629.
Nishisato, S. (1994). Elements of Dual Scaling: An Introduction to Practical Data Analysis.
Hillsdale, NJ: Lawrence Erlbaum Associates.
Nishisato, S. (1996). Gleaning in the field of dual scaling. Psychometrika, 61, 559-599.
Nishisato, S., and Yamauchi, H. Principal components of deviation scores and standardized
scores. Japanese Psychological Research, 16, 162-170.
SchOnemann, P.H. (1970). On metric multidimensional unfolding. Psychometrika, 35, 167-
176.
Schoneruann, P.H., and Wang, M.M. (1972). An individual difference model for the multi-
dimensional analysis of preference data. Psychometrika, 37, 2 75-309.
Torgerson, W.S. (1958). Theory and Methods of Scaling. New York: Wiley.
Homogeneity Analysis for Partitioning
Qualitative Variables
Takahiro Tsuchiya
The Institute of Statistical Mathematics
4-6-7, Minami-Azabu, Minato-ku
Tokyo 106, Japan
1. Introduction
In the social and cultural sciences, uni-dimensional scaling is an important theme.
A distinctive feature of scaling in those fields is that a data set often consists of
qualitative variables. Multiple correspondence analysis, dual scaling and Hayashi's
Quantification Method ill (HQM ill) are methods for analyzing the structure of qual-
itative variables. In case that all the variables are homogeneous (namely they have
only one uni-dimensionality), the first solution of those methods can be used as a uni-
dimensional scale. It is rare, however, that all the variables have a uni-dimensionality
in practical situations. In order to construct uni-dimensional scales, selection or par-
titioning of variables is usually needed. HQM ill is not necessarily appropriate for this
purpose. To demonstrate it, let us consider the following example.
Tab. 1 shows an artificial data set 1,
which consists of eight variables and 13 Tab. 1: Artificial Data Set 1
observations. The numerals in the ta- variable
ble indicate category labels and n is an n
1 2 3 4 5 6 7 8
observation number. For example, ob- 1 1 1 1 1 2 1 1 1
servation 13 chose category 4 of vari- 2 2 1 1 1 2 2 2 1
able 1. Although there are 8 variables 3 2 2 1 1 3 3 2 2
in the data set, they should be parti- 4 2 2 2 1 2 2 1 1
tioned into two groups. That is, an in- 5 2 2 2 2 3 3 3 3
dicator matrix Dl is obtained by using 6 3 2 2 2 2 2 2 2
only variable 1 to 4 (Tab. 2). Zeros are 7 3 3 2 2 1 1 1 1
omitted in the table. It is clear that Dl 8 3 3 3 2 3 2 2 2
has a Guttman's uni-dimensional struc- 9 3 3 3 3 4 4 4 3
ture. Another indicator matrix D2 10 4 3 3 3 4 4 3 3
with uni-dimensional structure is also 11 4 4 3 3 3 3 3 2
obtained by using variable 5 to 8 (Tab. 12 4 4 4 3 4 4 4 4
3). Because the orders of the rows are, 13 4 4 4 4 4 3 3 3
however, different between Dl and D 2 ,
we should say that the data set in Tab.
1 have two uni-dimensional structures.
452
453
Fig. 1 shows the first and the second axes of HQM ill applied to this data set. In the
figure, 0 indicates a category of variable 1 to 4, while • indicates that of variable
5 to 8. It is well known that a horseshoe is obtained when a data set has one uni-
dimensional structure. In Fig. 1, such a horseshoe appears and we might incorrectly
conclude that the data set has only one uni-dimensional structure.
Fig. 2 shows the first and the third axes. Carefully observing the figure, we can
see that there are two flows of categories and that they correspond to two groups of
variables.
0
o
•
,
080 00
o. • o.
••0
~
•• ·ct •
• 0
..
••
j. o •
0
o • 0 0
6D 0 <»
~
•
Fig. 1: The second axis versus the first Fig. 2: The third axis versus the first
The above example illustrates that at least three-dimensional display of the result
is needed for partitioning variables in order to construct multiple uni-dimensional
scales. It is impractical, however, to plot all the categories when the number of vari-
ables increases because the figure will be illegible. In the example, difference between
two groups of variables appeared on the third axis, but in case of other data sets, it
is not certain which axis should be used.
Thus this paper proposes an easy method for partitioning qualitative variables to
construct multiple uni-dimensional scales.
454
2. Method
Let Xi(N x Gi ) be an indicator matrix of variable i(i = 1, ... ,1) where N is the
number of observations and Gi is the number of categories of variable i . There is at
most a single 1 in each row of Xi. Let Ai(N x N) = diag(Xil) be a diagonal matrix
whose nn-th diagonal element is the n-th element of Xil.
The method is developed based on homogeneity analysis (Gifi, 1990). It is well
m
known that HQM leads to the same equation as homogeneity analysis for analyzing
X 11 ... ,X I. Homogeneity analysis is, however, more appropriate to explain the
following method. In homogeneity analysis, sum of the squares of the distances
between optimally transformed score vector, XiWi, and one common vector m is
minimized under the constraint on m:
I
S(Wi,m) = ~ IIAi(XiWi - m)112 --+ min, (1)
i=1
I I I
where ~IIAimW=~IIA;lW, ~l'Aim=O.
i=1 i=1 ;=1
If all the variables have only one uni-dimensional structure, it is possible to transform
every score vector XiWi close to each other. Since m in (1) is obtained as a mean
vector of X I WI, .'.. ,X /W [, m can be a un i-dimensional score vector. In other words,
a un i-dimensional score vector m has to be constructed from vectors which can be
transformed near to each other. In case of the artificial data set 1, it is impossible to
transform all the eight variables close. Hence, homogeneity analysis fails to construct
uni-dimensional scales from data set 1. However, the first four variables can be
transformed close to each other, and the same is true for the last four variables. If
we prepare two score vectors, ml and m2, and sum of the distances between ml and
the first four variables and between m2 and the last four are minimized, then both
ml and m2 can be uni-dimensional score vectors:
2 8
S(Wi' ml! m2) = ~ ~ 6di IIA;(X;Wi - md)W,
d=1 i=1
where
{ I d = 1 and 1 :5 i :5 4, d = 2 and 5 :5 i :5 8,
6di = 0 otherwise.
In practice, because we do not know in advance which variables have uni-dimensional
structure, weight parameters summing up to 1 are introduced 'instead of 6di :
D I
S(adi' Wdi, md) = ~ ~ a~i IIAi (XiWdi - md)11 2 , (2)
d=1 i=1
I / /
where ~a~iIIA;mdI12 = ~a~iIIAilW, ~a~im~Ail = 0,
;=1 ;=1 ;=1
D
~ adi = 1, adi ~ O.
d=1
We call D dimensionality because it indicates the number of uni-dimensional scales.
k is given a priori in the region of k > 1. To explain the meaning of (2), let us first
455
consider the case when ml,"" mD are obtained in some ways. For each i, adi is
constrained to 2:d adi = 1. Hence, in order to minimize the quantity S, a large value
has to be assigned to adi if corresponding e~i = IIAi(Xiwdi - md)W is small. On
the other hand, a small value must be assigned to adi if e~i is large. Thus adi can be
considered as an index of degree of correlation between XiWdi and md. Variable i is
classified into the dimension d if adi is the largest among ali, ... ,aDi. The variables
which are classified into the same dimension d are close to each other because they
are all in the neighborhood of one score vector md·
Next let us consider the case when all adi's are determined. In that case, md is
obtained as a mean vector of XiWdi weighted by ad;' It implies that md is constructed
from the vectors which are near to each other. Hence, we can say that md is a uni-
dimensional score vector.
The same principle is used in fuzzy c-means clustering (Bezdek, 1981). X;Wdi is,
however, fixed in fuzzy c-means unlike (2). Also, XiWdi depends on dimension din
(2) .
There are two reasons for not introducing k-means criterion (MacQueen, 1967) but
fuzzy c-means. One is that fuzzy c-means includes k-means by letting k -+ 1 in (2).
The other is that the influence of variables which should be removed is decreased
by letting adi = D- 1 . In practice, it is often the case that some variables have low
correlation with the other variables. k-means criterion classifies such a variable into
some dimension and md of that dimension can not be a uni-dimensional score.
The values of adi, Wdi, md, are obtained by means of alternating least squares method.
Because of space limitations, the details of the algorithm are omitted.
An index
Dk- 1
A= 1- I min S, (3)
LI'Ai I
1=1
Dk - 1
Ai = 1- TminS\, (4)
Dk
Ad = 1- NI minSd, (5)
3. Example
Three data sets are analyzed. k is set to be 2 for all data sets.
data set 1, the order of categories of the last four variables is 3,1,4 and 2.
Tab. 7 is the result of principal components analysis with optimal scaling followed
by VARIMAX rotation. The analysis was performed with the PRINQUAL procedure in
SAS (A similar procedure is PRINCALS in SPSS.). The orders of categories are 1,2,3,4
for variable 1 to 4 and 3,1,4,2 for variable 5 to 8. Hence, the procedure found the
proper orders. Variables are partitioned into two groups according to factor loadings.
The first group consists of variable 4 to 8 and the second group consists of variable
1 to 3. This result does not represent the original data structure. This is because
the assigned value for each category is optimal for applying peA model but it is not
optimal for partitioning of variables.
Tab. 8 is a summary of ad. and Ai. '\d obtained by the proposed method. Variable 1 to
4 are classified into the first dimension and the other variables are classified into the
second dimension. Hence, the proposed method succeeded in partitioning variables
in order to construct uni-dimensional scales.
Fig. 3 shows the first three axes of HQM ill applied to the data set. There seems to
be no distinctive structures in the figure.
• ••
•
•
• • •
• • •• •
•••, •
, •
•• • • ,• • ••
•••
•
•
• •• •
• •
•
,
• • • •
•
•
• ••
•
• • •• • ••• •• ••
• • • • •
• • • •
••
• •
• ••
•
• •
The second axis versus the first The third axis versus the first
The proposed method was applied to the data set. The same as data set 1, the
algorithm was randomly started for 100 times. Tab. 9 summarizes the value of D,
frequency of global minimum, and A. When D = 3, there are quite some local minima
(34%), but the algorithm successfully reached a global minimum for 66 times. This
indicates that we can not say it is too difficult to partition 15 variables into three
groups. Tab. 10 is a summary of adi and Ai, Ad when D = 3.
"Honesty" and so on. This dimension means "external appearance" or "the first
impression". These three dimensions are easy to understand, but it is difficult to
obtain these dimensions from Fig. 3.
4. Conclusion
In this paper a new method to construct multiple uni-dimensional scales by parti-
tioning qualitative variables is proposed. As for the presented data sets, the results
of the application of the model are successful.
References:
Bezdek, J.C. (1981): Pattern Recognition with Fuzzy Objective Function Algorithm. Plenum
Press, New York.
Gifi, A. (1990): Non Linear Multivariate Analysis. John Wiley & Sons, Chichester.
Kendall, M.G. et al. (1983): The Advanced Theory of Statistics, Volume, 3, 4th ed. Charles
Griffin.
MacQueen, J. (1967): Some methods for classification and analysis of multivariate ob-
servations. Proceeding of the fifth Berkeley Symposium on Mathematical Statistics and
Probability, 1, 281-297.
Meulmann, J.J. (1996): Fitting a distance model to homogeneous subsets of variables:
points of view analysis of categorical data. Journal of Classification, 13, 249-267.
Sato, T. and Yanai, H. (1985): A method of simultaneous scaling of discrete variables.
Behaviormetrika, 18, 39-51.
Determining the Distance Index, II
Matevz Bren 1 and Vladimir Batagel?
1 University of Maribor, Faculty of Organizational Sciences,
Presernova 11,4000 Kranj, Slovenia
2 University of Ljubljana, Department of Mathematics,
Jadranska 19, 1000 Ljubljana, Slovenia
1. Preliminaries
1.1 Dissimilarities
A mapping d: f x f -+ IR is a dissimilarity measure on the set of objects f iff it is
PI. symmetric:·d(x,y) = d(y,x) for all x,y E f,
P2. straight: d(x,x):::; d(x,y) for all x,y E f.
A dissimilarity measure d that is
460
461
We call the threshold value p = p(d) a distance index of the dissimilarity d. For
details see Batagelj and Bren (1996).
2. Determining the Distance Index for Dissimilarities
on Binary Vectors
In the first part of the paper we present:
• an analytical solution of this problem for two families of dissimilarities obtained
from functions defined in Gower and Legendre (1986)
s_ a+d and To
a
= --::-:-:--
o-a+d+(}(b+c) a+(}(b+c)
(where (} > 0 to avoid negative values) that contain some well-known similarity
measures (see Table 1: Kendall, Sokal-Michener; Rogers and Tanimoto; Jac-
card; Dice, Czekanowski; Sokal and Sneath);
The quantities a, b, c and d have the usual meaning: for binary vectors x, y E
IBm we denote with x y = L~1 XiY, their scalar product, and with x = [1 - Xi]
the complementary vector of x. We define counters: a = xy, b = xy, c = xy
and d = xy, where a + b + c + d = m. Using these counters several resemblance
measures on binary vectors are defined (see Table 1). The use of symbol d
for a dissimilarity measure and also for a counter might be confusing, but it's
meaning is always clear from the context .
• and a computational approach to determining the distance index for other dis-
similarity measures from Table 1.
In this part we study, using a computational approach, dissimilarity measures ob-
tained by the following nonlinear transformations
D __ d_ d d
1 - 1+d
D2 = - -
1- d D3(t) = 1 + t(l - d)
2
D4 = -In(l - d) Ds = - arctan d
'II"
D6 = 1- 11 - 2d1
D7 = 4d(1 - d)
on dissimilarity measures from Table 1.
Since the triangle inequality implies evenness we consider only even (D4) dissimilar-
ity measures.
For a dissimilarity measure that does not vanish on the diagonal the triangle inequal-
ity implies
d( x, x) :5 2 d( x, y) for all x, y E £ .
But, since P2 holds this is true for any dissimilarity measure.
For the indeterminate cases we use the definitions proposed in Batagelj and Bren
(1995). To make this reading easier, the second column of Table 1 includes also the
labels used there.
.".
~
measure Si definition di D2 D3 D4 D5
From the local minima (i", j", k") obtained by this procedure we can usually guess a
general pattern of 'extremal' triples from which we compute an upper bound for Pm:
Pm = p(i·,j", k"),
We conjecture that for the obtained triples (i",j",k·) the equality Pm = Pm often
holds.
3. Results
In Table 2 there are values 1520 for twelve even dissimilarity measures and their trans-
formations computed with this local optimization procedure. The notation used:
In the following paragraphs we shall give explanation of the obtained results for most
interesting cases.
Russel and Rao d l = 1 - SI = 1 - ~ E [0,1]
It does not vanish at the diagonal: for 0:= [0, ... ,0] E IBm we have dl(O,O) = l.
The triangle inequality holds with equality on vectors of the form
I m-r
~~
[0, ... ,0,1, ... , 1]
j"= [ 1, ... , 1,0, ... ,0]
k" = [ 1, ... , 1, 1, ... , 1] =: 1
Driver S16 0.5535 0.7706 N-d 0.3286 N-d 0.6509 N-e N-e
p~ 0.5 ~
g.
uns S17 0.5946 0.9319 N-d 0.3412 N-d 0.7257 N-e N-e
p ~ 0.5645 p ~ 0.8755 p ~ 0.3286 p ~ 0.6938
-
'"
Y for n ::; 10
¢ S18 0.5696 0.7383 N-d 0.4255 N-d 0.6067 0.6236 0.8965
p ~ 0.5645
S"" S19 0.3785 0.5386 N-d 0.2460 N-d 0.4337 N-e N-e
p~O p~O
inequality holds and it is an equality only for the "triples" 1, 1, i for all i E Bm.
Indeed, for the previous triple i", j", 1 we have
and Ii = 0.5.
The transform D4 is undefined because D4 (d l (i",j")) = -In(1 - 1).
The transform Ds is a concave function on the interval [0,00]. Hence strict triangle
inequality holds.
The transformations D6 and D7 are not even: for the triple i = [0, 1, 1,0] , j =
[0,0,0,1] and k = [0,1,1,1], we have
pair a b c d dl (-,-) D6 D7
..
Z) 0 2 1 1 1 0 0
1
ik 2 0 1 1 2' 1 1
jk 1 0 2 1 ;! 1 ;!
4 2' 4
J
m = 3: equal to 1/2 - triangle equality holds;
m = 4: equal to 1 - == 0.42 - D5 doesn't hold;
m~: equal to 1 - -f
== 0.29 - D5 doesn't hold.
466
So 1520 = - In(I-In~
Ji)
== 0.5946 and when m tends to infinity 15 = -~. == 0.5645.
In(l- , )
The transform DI is a concave function on the interval [0, 1] hence the triangle inequal-
ity holds for larger dimensions. For the previous triple we get DI(dli(i*,j*)) = 1/2
and DI (d I7 ( i*, k*)) = DI (d17(j*, k*)) is for dimension
m =2: also equal to 1/2 - 05 holds;
m equal to 1/3 - 05 holds;
=3:
m = 4: equal to ;9J--\
== 0.30 - 05 holds;
m = 10: equal to 1/4 - triangle equality holds;
m = 20: equal to 2~-=-33 == 0.24 - 05 doesn't hold;
m ::;P : equal to /J-=-\ == 0.23 - 05 doesn't hold.
For m > 10 we obtain for the upper bound of Pm:
In 2 In2
1520 =-I v'i9 == 0.9319 and for m ::;P 15 -- -ln2 -/2-1 == 0.875.5
n -2Ji9-3
? 19-3
i72-1
The transform D2 is undefined because D2(d 17 (i*,j*)) = 1:1'
For t = 2 the transform D3(2) = 3!~L' For the previous triple i*, J", k* we get
D3(2)(d 17 (i*,j")) = 1 and D3(2)(d 17 (i*,k*)) = D3(2)(dI7 (j",k*)). This is for dimen-
sIon
m = 2:also equal to 1 - 05 holds;
m= equal to 1/4 - 05 doesn't hold;
3:
m = 20: equal to ~t~ == 0.13 - 05 doesn't hold;
m ::;P : equal to ~t~ == 0.12 - 05 doesn't hold.
Now we can calculate the upper bound of Pm:
In 2 In2
1520 = -I v'i9-3 == 0.3412 and for m ::;P 15 = -----
In ";2-1
== 0.3286
n Ji9+6 Ji+2
If we calculate 1520 for different values of parameter t we get for t = I: 1520 = OAI03,
for t = 2: 1520 = 0.3412 and for t = 3: 1520 = 0.3033.
The transform D4 is undefined because D4(d17(i",j*)) = -In(1 - 1).
The transform Ds is a concave function on the interval to,
00]. For the previous triple
we get Ds(d I7 (i",j*)) = 1/2 and Ds(d 17 (i*,k*)) = Ds(d17(j*,k")) is for dimension
m = 2: also equal to 1/2 - 05 holds:
m = 3: equal to ~ arctan ~ == 0.29 - 05 holds:
m = 4: equal to ~ arctan( 1 - 1") ~ 0.25 - 05 holds;
m = 5: equal to JI)
~ arctan( 1 - == 0.2:3 - 05 doesn't hold;
m = 20: equal to ~ arctan(1 - 3Yf) == 0.19 - 05 doesn't hold:
m::;P: equal 1.0 ~ arctan(1 -~) == 0.18·- 05 doetin't hold.
Hence for m > 4 we can calculate the upper bound of Pm:
I) 1 1 2 0 1 0 0
ik 1 1 1 1 ;! 1 ;!
4 2 4
In the last two columns of Table 2 we can see that most of the considered dissimilarity
measures are not even. This occurs because the transformations Ds and D7 map
1 -> O. Hence the transformed dissimilarity measure is even if and only if the original
one has the property: for all pairs i,j E IBm such that d(i,j) = 1
Because the dissimilarity d ll has this property on IBm - {a, I} we put the' in it's
row.
4. Conclusion
In the paper we present an approach to determining the distance index for a given
dissimilarity on binary vectors and applied it to some well known dissimilarity mea-
sures and their transforms. The results obtained offer new information that can be
used when selecting dissimilarity measure for applications.
We also expect that the proposed approach can be successfully applied to dissimilar-
ities between other types of units.
5_ References:
Sergio Bolasco
Faculty of Economy
University of Rome "La Sapienza"
Via del Castro Laurenziano, 9 - 00161 Roma - Italy
1. Introduction
In this paper we are concerned with the different phases of text pre-treatment necessitated
by a content analysis, based on multidimensional statistical techniques. These phases have
been modified in recent years by the growth in sizes of textual data corpora and their related
vocabularies and by the increased availability of lexical resources.
As a consequence of this, some new problems arise. The first one is how to select the
fundamental core of the corpus vocabulary, when it is composed of several thousands of
elements. In other words how to identify the subset of the characteristic words within a text,
regardless of their frequency, in order to optimize computing time and minimize
interpretation problems. The second problem is how to reduce the ambiguity of language
produced by the automatic treatment of a text. The main aspects of this are the choice of the
unit of analysis and of lemmatization.
We also propose the validation of the lemmatization choices in terms of the stability of the
word points on factorial planes in order to control the effects of this preliminary
intervention.
To solve these problems, it is possible to use both external and internal information,
concerning the corpus: i. e. both meta-data and data. Some examples of our proposals are
applied to a very large corpus of parliamentary discourses on government programmes
(called Tpg from now on). The size of the Tpg corpus (Tpg Program Discourses and Tpg
Replies) is over 700.000 occurrences and the size of the Tpg vocabulary it is over 28.000
unlemmatized words, equivalent to 2500 pages of text.
468
469
The context and the situation are characterized with the aid of a specialized frequency
dictionary (political, scientific, or economic, etc.). In this event, the lexical inclusion
percentage of the corpus vocabulary in the reference language model is a basic measure.
With regards to the Tpg, the chosen frequency dictionary is the lexicon of Press and Press
Agencies Information (called Veli). This vocabulary is derived from a collection of over 10
million occurrences. On the assumption that the Veli vocabulary is the pertinent neutral
model available of a formal language in social and political context, we can ask ourselves to
what extent the Tpg corpus resembles it, or differs from it.
In this sense the situation can be identified by studying the original terms not included in this
external knowledge base. In our case, the language of the situation is composed of the Tpg
terms which does not belong to the VeIL This sub-set is interesting in itself.
On the contrary, the context can be identified through the words in common in the above two
lexicons. Among these words, in general, the highly specific sectorial terms are measured by
the largest diversities of use with respect to the chosen frequency dictionary.
In this way we are interested to identify one sub-set of characteristic words. The peculiarity
or intrinsic specificity of this sub-set will be measured by calculating the diversities of use for
each pair of words. As Lyne says (1985: 165): "The specific words are terms whose
frequency differs characteristically from what is normal. The difference can be calculated
from the theoretical frequency of a word in a given text, on the assumption that the latter is
proportional to the .length of the text." One possible measure 01 specificity could be the
classical measure of z - like a normalized difference of the frequencies -
Ij - I *
z= .fi*
where: the Ij is the relative number of occurrences in the corpus and I * the correspondent
in the frequency dictionary. Proposed by P. Guiraud in 1954, z usually is called ecart reduit,
and it is equivalent to the square root of the chi square.
It is possible to compare the coefficients of usage between the two vocabularies; where the
latter is - for each headword - the frequency weighed with the measure of dispersion.
The above specificity measure can be either positive or negative. Using the Veli list as a
yardstick, we can investigate the Tpg vocabulary. In fact as Lyne suggests (ibidem: 1985:
7): "The ranking favours those items which are most characteristic of our corpus, what we
shall call, Positive .. Items. Conversely, towards the bottom of this list are found those items,
Negative .. Items, which, although still occurring (in some instances frequently) in our
corpus, are nevertheless least characteristic of it, since they occur relatively less frequently
than in the reference dictionary"
Once the relative differences between the Tpg and the Veli vocabulary are measured in terms
of z, it is possible to select and to visualize two comparative rankings of words in the above
vocabularies. The threshold of selection can be the classical level of the absolute value of z
(greater than or equal to 3). The set of these selected words can be visualized by using the
method of "parallel coordinates" (Wegman, 1990). As known, Wegman's proposal consists
in using the parallel coordinate representation as a high-dimensional data analysis tool.
Wegman shows that this geometry has some interesting properties; in particular a statistical
interpretation of the correlation can be given. For highly negatively correlated pairs, the dual
line segments in parallel coordinates tend to cross near a single point between the two
parallel axes. So the level of correlation can be visualized by means of the set of these
segments (see Wegman's fig. 3, ibidem: 666).
471
VELI RANK
TPG RANK
Fig. Ia: Comparison between TPG and VEU ranking of the 100 Verbs with the Highest
Peculiarity in the TPG (either Positive or Negative intrinsic Specificity)
YELl RANK
3000
TPG RANK
TO INVOLVE. TO INSURE
: TPG
1000 2000
Fig. Ie: UF - VELI - TPG rank comparison. 15 most commonly used verbs in Italian and a
selection of sOme highly peculiar Tpg verbs with positive or negative specificity
472
Generally, only two dimensions are considered (fig. la, 1b), but it is possible to -compare
several (more than two) ranking lists from the related frequency dictionaries (fig. 1c).
Figures 1 illustrate the above selected verbs according to whether they occur more or less
markedly in our Tpg corpus than in the Veli corpus. In fig. 1a we show the 50 verbs with the
highest positive specificity, among these: <intendere>= to intend, <assicurare>= to assure,
<impegnarsi>= to involve, <provvedere>= to take measures, <favorire>= to favour,
<garantire>= to garantee; and also the other 50 verbs with the highest negative specificity in
our Tpg. Among them, there are several most commonly used verbs like: <dire>= to say,
<stare>= to stay, <fare>= to do, <vedere>= to see, <parlare>= to talk, <venire>= to come,
but also <decidere>= to decide, <spiegare>= to explain, <andare>= to go. As you can see
the criterion of negative specificity can clearly characterize certain words as "infrequent"
words. In fact they are very relevant in their "rarity" (under-used or not so frequent) with
respect to the chosen frequency dictionary, being consciously or unconsciously avoided by
the writer or speaker. Also this selection of terms could be the subject of a study by itself.
In fig. 1b we show the group of words that are not specific, also called "banal", and could be
discarded, because not so relevant as expressions of the context.
A further selection of items could be derived from the comparison of 3 ranking lists (Tpg -
Veli - Lit). The figure 1c shows the first 15 most common verbs and some specific Tpg Verb,
as Positive or Negative Items. From this illustration we can conclude that the most typical
governmental verbs, among the Positive Items, are "to take measures" and "to intend".
Conversely the most relevant among the negative ones, in comparison with Veli and Lif, are
"to explain" and "to decide". Finally it is possibile to observe the situation oflhe same use, in
the three dictionaries, of the verbs "to assure", "to involve", "to insure" as a set of high
politic peculiarity due to their progressive ranking in the passage from the general language
(Lit) to the sectorial one (Vell) up to the more specific one of government programs (Tpg).
inventories: one lexicon of over 110.000 simple entries - derived from a collection of 4 main
dictionaries of the Italian language -, called DELAS; one lexicon of over 900.000 inflected
simple forms, called DELAF; one lexicon of over 600.000 inflected polyforms, derived from
250.000 lexias, called DELAC. It is also available one dictionary of over 800.000 bilingual
terms, called DEBIS. Elia's study show - for example - that in 13.790 simple forms there are
1.406 polyrhematic constructions (polyrhematic is a sequence of terms whose whole
meaning is different from its elementary components), composed of 3.500 simple forms,
equivalent to 25% of vocabulary. As we can see the density of polyrhematic forms is very
high.
Therefore it could be very important to construct some frequency dictionaries ofpolyforms,
in order to compare the corpus vocabulary of repeated segments (Salem, 1987) or, even
better, of quasi-segments (Becue, 1995), and select those sequences that are more
significant. Up to now such frequency dictionaries are not available: an initial attempt to
construct one is illustrated here in tab. 2, concerning the adverbial groups and other typical
expressions. Preliminary matching with the corpus under study allows us to isolate the
relevant parts of lexical items (either single or compound forms) and constitutes a valid
system of text pre-categorization.
An additional possibility for this disambiguation emerges from the data. In every corpus it is
possible to observe some equivalence of frequency - I call it iso-frequency - among the
474
inflected forms of the same adjectives or nouns. See in tab. 3 some examples of adjectives
like economic, important and legislative.
(5) singular (p) plural (ms) masculine and singular (fs) feminine and singular
(mp) masculine and plural (fp) feminine and plural
This iso-frequency can be the first clue to their equivalent use and meaning. On the contrary,
in some cases, the lack of iso-frequency among the inflected forms of the same headword
(Bolasco, 1993) suggests the need for disambiguation. In fact, this happens in presence of
some compound forms, especially where the incidence of the occurrences of simple
component forms is relevant. As you can see in words like <[orza> (force) and <livello>
(level). For example when we take away the frequency of the compound form of the word
"level" (187) like "at (local) level" (48) and "at level of' (19), we return to the presence of
iso-frequency (120) with the plural (110). As we will see later the differences among the
inflected forms can be the clue to their different meanings. This should be verified by means
of a bootstrapping approach.
planes with confidence areas (Balbi, 1995). This assessment procedure is based on a
bootstrapping strategy that generates a set of "word to subtext" frequency matrices. We
assume Balbi's hypothesis which consists in generating a large number B of contingency
tables by resampling, with replacement, into the original contingency table.
This set of bootstrapped matrices generates a three-way data structure; which could be
analysed for example by means of a multiway technique, for constructing a reference matrix.
A technique, such as STATIS, can be used, see Lavit (1988). In our example, in order to
optimize computing time, the reference matrix is the average of these B matrices, due to the
large dimensions of the original matrix (786 x 46) and of the number of bootstrapped
matrices (B=2oo).
The stability of word points is graphically established by projecting them, as supplementary
points, into the first factorial plane computed from a correspondence analysis of this
reference matrix. Balbi proposes to use the non symmetrical correspondence analysis
(ANSC). We have attempted this road but the results have not been comforting at level of
interpretation. We believe that, in general, it is more opportune use the analysis of the simple
correspondence analysis and only for special reasons the ANSC.The resulting clouds of
points (for each word) constitute the empirical confidence areas, delimitated by a convex
hull. The tig. 2 shows the convex hull regarding the word <way> and its locution
<in/the/way>.
in/the/way
1')6 nee.
way
roo ace.
Fig. 2: Convex hulls of the locution IN THE WAYand of the word WAY
Fig. J: Conve:c hulls of three different meanings of the word DEVELOPMENT (semantic disambiguation)
Let me give some examples concerning these situations. In fig. 4a "stato _verb" (equivalent
to "been" in English) is clearly distant from "stato _noun" ("state"); but, conversely, the
different unlemmatized forms «stato/a/e/i» of "stato _verb" (see tig. 4b) have their convex
hulls completely overlapped and it is not important to distinguish them. Furthermore, if we
look at the two meanings of "stato_noun" - they are further distinguished (fig. 4a).
<Stato_s1> like state or nation and <stato_s2> like status or condition/situation (marital
status, state of mind) have their relative convex hulls separated.
Fig. 4a: Convex hulls of the Italian 1V0rd "STATO" after disambiguation
477
'I
I--
I
STATA v
116 OCt:.
~~:'{J'IJI;"':'/.JJ , 77 nee.
. I.('(f'ntl
.
STA. TA : /r,,,min,. nnJ .lInt,,/n,: sr" TO a '''''11
,./int' tIIul uII""I,,,:
STATE = Irm;,."" 111,,1 pluTal. ST" 1/ z '1II1.f(:IIIi",. I"ul rillmi.
Fig. 4b: Convex hulls of Italian different infected forms of the past participle of the verb "to be"
CONDITION 51 72 nee.
...•
STATE_51 10000eC'.
= ·condition- ~.. position .a ·"~I~· ,~'us,
situAtion M condition
• -S;'u<11inn·.1.1 pos;,;nn
Mconditiml
I
,.IL-_I---_
Fig 4c: Convex hulls of the synonym of "STATE" as "condition" or "situation"
In particular the latter does not overlap so much with the other synonyms such as
"conditionL2)" and "situationL1)", as you can see in fig. 4c. This shows how the use of
these terms has changed over time in political discourse. Paying particular attention to fig.
4a, these words are always distant from state as the Italian State.
Now let me look at the significance and interpretation of convex hull sizes and positions, as
shown in the following scheme:
1) a small convex hull, and therefore closeness of points, means high stability of
representation but: a) when the points arc located around the origin of the axes, it means
evenness of these items in the various parts of the corpus, or b) when the points are in one
particular quadrant of the plane, distant from the origin, it means the item is very
characteristic and specific to some sub-set of the corpus. In this case, most of the time we
obtain convex hulls not so small as above, because the factor scale of this region depends on
the point distance from the origin (see the example of Politics in fig, 5);
478
POLITlCHE_S
POI.ITlCA_S
2) a large convex hull, that is with a wide dispersion of points, means a not so strong stability
of representation, and several different uses of this word in the corpus, but: a) if we do not
have overlapping convex hulls, this means that the relative items have different meanings and
that their fusion is not pertinent or, in other words, that their disambiguation is justified (see
in fig. 4a the case of nation and status) or b) if, conversely, we have overlapping convex
hulls, this means irrelevant disambiguation or justified fusion (factual synonyms).
In conclusion, having discussed how to identify the most significant part of the corpus and
how to construct a more restricted and highly peculiar vocabulary composed of items with a
high level of semantic quality, we can now finally proceed to an accurate and proper content
multi-dimensional analysis, based on the 'above vocabulary, in which all the relevant units of
analysis, which I have called "textual/arms", are considered (Bolasco, 1993).
To this effect, such a vocabulary (see an example in tab. 4) will be composed of the items
which are: 1) not banal with respect of some model of language (high intrinsic specificity or
original terms); 2) significant as a minimal unit of meaning (lexia): either headwords (verbs
and adjectives), or unlemmatized significant in11ected forms (such as nouns in the plural with
different meaning from the singular, i.e. forzalforze), or more frequent typical locutions and
other idiomatic expressions (phrasal verbs and nominal groups).
Rderences:
Balbi, S. (1995): Non symmetrical correspondence analysis of textual data and confidence regions f()r
graphical forms. In: JADT 1995 Ana/isi stalislica dei dali lestua/i, Bolasco, S. et al. (eds.), II, 5-12, CISU,
Roma
Becue, M. et Haeusler, L. (1995): Vers une post-codification automatique In: JADT 1995 Ana/isi Slalislica
dei dali testua/i, Bolasco, S. et al. (eds.), 1,35-42, CISU, Roma
Bolasco, S. (1993): Choix de lemmatisatioll en vue de reconslrUcliollS synlagmaliques du lexte par ['analyse
des correspondallces. Proc. JADT 1993,399-410, ENST-Telecom, Paris
479
Bolasco, S. (1994): L 'individuazione di forme testuali per 10 studio statistico dei testi con tecniche di analisi
multidimensiotlale. Alii della XXXVII Riunione Scienlifica della S.I.S., II, 95-103, CISU, Roma
Bortolini N., Tagliavini C., ZampoJli A. (1971): Lessico di frequenza della lingua italiana contemporanea.
Garzanli., Milano.
Dubois, J. el al. (1979): Dizionario diLinguislica, Bologna: ZanicheJli
Elia, A. (1995): Per una disambiguazione semi-aulomalica di sinlagmi composli: i dizionari elellronici
lessico-grammalicali. In: Ricerca Qualitativa e Computer, Cipriani, R. e Bolasco, S. (cds.), 112-141, Franco
Angeli, Milano
Cipriani, R. e Bolasco, S., eds. (1995): Ricerca Qualitativa e Computer. Franco Angeli, Milano
Lavil, Ch. (1988): Allalyse conjointe de tableaux qualltitatifs. Masson, Paris
Lebart, LeI Salem, A. (1994): Statistique textuelle. Dunod, Paris
Lyne A. A. (1985): The vocabulary offrench business correspolldence, Sial kine-Champion, Paris
Salem, A. (1987): Pratique des segments repetes. Essai de statistique textuelle. K1incksieck, Paris
Wegman, E. J. (1990): Hyperdimensional Data Analysis Using Parallel Coordinates JASA, 85, 411, 664-
675
Clustering of Texts using Semantic Graphs.
Application to Open-ended Questions in Surveys
Monica Becue Bertaut l and Ludovic Lebart 2
480
481
studies. it could be simplistic to consider that two different words (or expressions) have
the same meaning for different categories of respondents. But it is clear that some units
having a common extra-textual reference are used to designate the same "object".
Whereas the syntactic meta information can provide the user with new variables, the
semantic information defmed over the pairs of statistical units (words, lemmas) is
described by a graph that can lead to a specific metric structure (see fig. 1).
a) The semantic graph can be constructed from an external source of information (a
dictionary of synonyms. a thesaurus. for instance). In such a case. a preliminary
lernmatization of the text must be performed.
b) It can be built up according to the associations observed in a separate (external) corpus.
c) Eventually. the semantic graph can also be extracted from the corpus itself. In this latter
case, the similarity between two words (or other units) is derived from the proximity
between their distributions (lexical profiles) within the corpus.
The vertices of this weighted undirected graph are the distinct units (words) j. U=I, .... pl.
The edge U. j') exists iff there is some non-zero similarity sUo j') between j and j'. The
weighted associated matrix M= (mij). of order (P. p) associated to this graph. contains in
the line j and column j' the weight sUo j') of the edge U. j'). or the value 0 if there is no
edge between j and j' .
The repeated presence of a pair of words within a same sentence of a text is a relevant
feature of a corpus. The words can help to disambiguate each other (see e. g.: Lewis and
Croft. 1992). To take into account co-occurrence relationships allows one to use words in
their most frequent contexts. In particular, we can also describe the most relevant co-
occurrences by using a weighted undirected complete graph linking lexical units. Each pair
of units are joined by an edge weighted by a co-occurrence intensity index.
At that stage. we fmd ourselves within the scope of a series of descriptive approaches
working simultaneously with units and pairs of units (see for example Art and al., 1982),
including contiguity analysis or local analysis (see: Lebart, 1969; Aluja and Lebart. 1984;
Escofier, 1989; Cazes and Moreau, 1991).
Statistical units Grammatical Other
n
(words, lemmas) categories categories
Respondents
Texts ~
, \I
" II
, II
These visualization techniques are designed to modify the classical methods based on
Singular Value Decomposition by taking into account a graph structure over the entries
(row orland column) of the data table. Visualizing the proximities using contiguity analysis
is equivalent to performing a projection pursuit algorithm as described in Burtschy and
Lebart (1991).
The classification can then be performed either:
1) by using as input data the principal coordinates issued from these contiguity analyses,
or:
2) by computing a new similarity index between texts. This new index is built from
generalized lexical proflles (i.e.: original proftles complemented with weighted units that
are neighbors (contiguous) in the semantic graph).
Since:
Y = X (I + a(CoI) ),
the matrix S to be diagonalized when perfonning a Principal Component Analysis of Y,
reads:
483
I T I , , 2..-.3
S = - Y (1-- U)Y = (1- arc + 2a(l- a)e- + a L (3)
n n
Therefore, the eigenvectors of S are the same as the eigenvectors of C. However, to an
eigenvalue A, of C corresponds the eigenvalue Jl of S such that:
Jl = (1- a)l A, + 2a(1- a)A, 1 + a 2A,l
The effect of the new metric is simply to re-weight the principal coordinates when
recomputing the distances to perform the classification.
If c£ = I, for instance, we get A,3 instead of A" thus the relative importance of the first
eigenvalues is strongly increased.
Such properties contribute to shed light on the prominent role played by the first principal
axes, particularly in the techniques of Latent SenuU1tic Analysis used in Automatic
Information Retrieval: see Furnas and al. (1988).
4. Example
This methodology has been applied to a corpus of 1563 answers to an open-ended
question included in a sociological survey. We confine ourselves here to discussing the
choice of the semantic graph that has been used to improve the classification of the
respondents.
The following open-ended question was asked in a multinational survey conducted in
seven countries (Japan, USA, United Kingdom, Germany, France, Italy and Netherlands)
in the late nineteen eighties (Hayashi et al., 1992): "What is the singLe most impoT1ant thing
in Life for you? " It was followed by the probe: "What other things are very impoT1ant to
you ?". Our illustrative example is limited to the American sample (Sample size: J563).
Some aspects of this multinational survey concerning general social attitudes are described
in: Sasaki and Susuki (1989).
Examples of answers to the first question were:
1 - Family, being together as afamily
2 - Mother. money, peace of mind, peace in the worLd
Some words are connected to a single lemma (or dictionary word) (be. is, are, being).
Also to be noted is the strong presence of function words (a, and, for, of, the). Note that
the concept of a function word (sometimes referred to as empty word, or tool word or
grammatical word in information retrieval) is widely used by text researchers. These words
are obviously excluded from the semantic networks.
This corpus has a length of 13 999 occurrences and contains 1378 distinct words. Only the
126 words used at least 16 times are kept, the total length being then reduced to 10 752
occurrences.
Figure 2 shows three branches of the dendrogram of words obtained through a direct
classification of the columns of X performed on the 15 first principal axes of a
Correspondence Analysis of the sparse contingency table X. We observe grouping of
words according to their meanings (home and members of family, standard of living,
485
together with function words (on) or pronouns (my), and repeated and isolated segments
(don't know).
We note that some topics are characteristic of the produced clusters. These topics are less
salient when a second dendrogram is computed from the 45 first principal axes.
If we cut both dendrograms at a level producing 30 classes, we obtain a ratio of variance
(between classes variance divided by total variance) of 0.87 (case of 15 axes) and 0.64
(case of 45 axes). It is a further empirical proof of the ability of the first axes to gather
structural features, ability already mentioned in section 3.2.
Index Words
Fig. 2: Parts of the dendrogram obtained through a direct clustering of the columns of X
Fig.3 presents some similar branches of the dendrogram obtained through clustering of
the columns of the aggregated table A = zTx. Z corresponds to a partition into 6 classes
obtained by crossing the two nominal variables sex, and age (3 categories). The themes
observed previously are now disseminated into various groups strongly influenced by the
criteria of grouping.
Index Words
Fig. 3: Parts of the dendrogram obtained through clustering of the columns of the
aggregated table A = Z1'X. (Z corres ponds to a partition into 6 classes
according to sex and age).
486
The category ''female-over 55 years" being particular and homogeneous has strongly
influenced the proximities between words. We do not fmd again wife (used by men), kids
(used by younger people, mostly by men), home (used by younger people), etc ..
Obviously, a composite partition (crossing or stacking various criteria) should be chosen
instead of the nominal variable "sex-age" used to obtain higher frequencies of words. The
quality of the partition of responses depends on the homogeneity of the classes and their
interpretability. In such a context, only a group of experts could assess the obtained
results. The use of a co-occurrence graph (with: a = 1) issued from a direct classification
of the columns of X enables a better classification of the responses that have a poor lexical
profile, and leads to more meaningful grouping.
It is clear that we are dealing with experimental statistics: the value of the parameter a, the
number of neighbours to be taken into account could probably vary according to the field
of application. Reliable tools for assessing and comparing partitions are all the more
needed.
5. References
Aluja Banet, T., l..ebart, L. (1984): Local and Partial Principal Component Analysis and
Correspondence Analysis. COMPSTAT Proceedings, 113-118, Physica Verlag, Vienna.
Art, D., Gnanadesikan, R., and Kettenring, J. R. (1982): Data Based Metrics for Cluster
Analysis. Utilitas Mathematica, 21 A, 75-99.
Becue, M. (1991): Analisis de Datos Textuales. Metodos Estadisticos y Algoritmos.
CISIA, Paris.
Burtschy, B., Lebart, L. (1991): Contiguity analysis and projection pursuit, in : Applied
Stochastic Models and Data Analysis, Gutierrez R. and Valderrama M.l., (eds), World
scientific, Singapore, 117-128.
Cazes, P., Moreau, J. (1991): Analysis of a contingency table in which the rows and
columns have a graph structure. in : Symbolic and Numeric DaJa Analysis and Leaming,
Diday E., and l..echevallier Y. (eds), 271-280, Novascience publisher, New York.
Celeux, G., Hebrail, G., Mkhadri, A, Suchard, M. (1991): Reduction of a large scale
and ill-condition ned problem on textual data. in: Applied Stochastic Model and Data
AnalYSis, Gutierrez R. and Valerrama N., 1. (eds.), World Scientific, Singapore, 129-
137.
Church, K. W., Hanks, P. (1990): Words association norms, mutual information and
lexicography. Computational Linguistics, 16,22-29.
Escofier, B. (1989): Multiple correspondence analysis and neighboring relation Data
Analysis Leaming Symbolic and Numeric knowledge, Diday E. (eds), 55-62,
Novascience publisher, New York.
Furnas, G. W. et al. (1988): Information retrieval using a singular value decomposition
model of latent semantic structure. Proceedings of the 14th ACM Conference on R. and
D. in 1nfonnation Retrieval, 465-480.
Gordon, AD. (1996): Hierarchical Classification. in: Clustering and Classification. P.
Arabie, L. J. Hubert, G. De Soete (eds.) World Scientific, River Edge, NJ.
487
Summary: The first objective of this contribution is to give a description of our tex-
tual information retrieval system based on distributional semantics. The central idea of
the approach is to represent the retrievable units and the user queries in a unified way
as projections in a vector space of pertinent terms. The projections are derived from a
co-occurrence matrix computed on large reference (textual) corpora collecting the distribu-
tional semantic information. A similarity computation based on the cosine measure is then
used to characterize the semantic proximity between queries and documents.
Retrieval effecti veness can be further improved by the use of relevance feedback techniques.
A simple feedback method where document relevance is interactively integrated to the orig-
inal query will also be presented and evaluated.
Although our first experiments lead to quite promising results, one major drawback of our
IR system in its original form is that the satisfaction of a query requires the evaluation of
the similarities between that query and all the documents in the textual base. Therefore,
the second objective of this contribution is to investigate how clustering techniques can
be applied to the textual database in order to retrieve the documents satisfying a query
through a partial exploration of the base. A tentative solution based on hierarchical clus-
tering will be suggested.
1. Introduction
Information Retrieval (IR) research is concerned with the analysis, the representation, and
the searching of heterogeneous textual databases with wide varieties of vocabularies and
unrestricted subject matters. Examples of such databases, the elements of which will be
called hereafter documents, are databases containing newspaper articles, new5wires, tech-
nical or scientific articles, magazines, encyclopedia entries and so on. Due to the enormous
amoWlt of information currently available on-line in the different computer networks (In-
ternet, ... ) and in the library environments, simple keyword search and browsing are not
sufficient anymore. IR users need more sophisticated tools to help them to reach the rele-
vant information.
Our retrieval model exploits co-occurrence properties of words to determine whether queries
and texts are semantically related (distributional semantics). More precisely, documents
and queries are represented in a unified way as projections of co-occurrence profile vectors
in a multidimensional vector space of selected informative terms, in which the proximity
is interpreted as semantic similarity. These co-occurrence profile vectors are derived from
co-occurrence matrices, computed on large reference textual corpora. The cosine similarity
measure is used to characterize the proximity and thus the relevance between user queries
and documents in the textual database.
2. Distributional Semantics
Using distributional information for automatic extraction of general morphologic, syntac-
488
489
tic or semantic properties of a given language has already been considered by several re-
searchers (Schiitze (1992), Gallant et al. (1992), Rungsawang & Rajman (1995». Such
properties correspond to observable regularities (frequency, distribution, co-frequency, ... )
in large textual corpora. In our approach, we use a co-occurrence matrix (Clij)i,j to fetch
the semantic information by automatic co-occurrence computation on a large reference cor-
pus of texts. The lines of such a matrix correspond to all distinct terms Wi found in the
reference corpus and the columns correspond to selected informative terms tj, called perti-
nent terms, used to represent the meaning of all other terms in the textual database. Each
element clij records how often a term Wi co-occurs with a term tj within some pre-defined
textual units (e.g. sentences or paragraphs) in the reference corpus.
To build a co-occurrence matrix, several elements must be determined. First, the nature of
the primary linguistic units which will be used as terms has to be defined. Tokens, as pro-
duced by the simple stemmer (e.g. Stopper and Porter stemmer, (Frakes and Baeza-Yates
1992», or words reduced to their radical forms (i.e. conjugated verbs to infinitives, nouns
to singular forms) are frequently used. Words with their part-of-speech tags, for example,
produced by a natural language tagger may also be considered.
Then, we need to determine the sets of terms and pertinent terms that define the rows
and the colwnns of the co-occurrence matrix. To cover the maximum semantic informa-
tion, all distinct terms (except perhaps the functional words) appearing in the reference
corpus should be used. However, feasibility constraints have to be taken into consideration.
Salton et al. (1975, 1976) indicate that terms which have the document frequency (i.e. the
proportion of documents in which they appear) ranging between 1/100 and 1/10, possess
good content discrimination in the document space and yield good retrieval effectiveness.
Therefore, we decided to reduce the w-dimension of the matrix to terms appearing in at
least 2 documents, and to use Salton's criterion for the pertinent terms in the t-dimension.
The third element to define is the textual unit in which the co-occurrences will be computed.
Usually, sentences, paragraphs, fixed-size word windows or fixed-size character windows are
chosen.
Once the co-occurrence matrix is built, a distributional semantic hypothesis is assumed pos-
tulating a correlation between terms that co-occur in similar distributional environments.
The semantics of a term Wi is then represented by its co-occurrence profile vector, the row
corresponding to term Wi in the co-occurrence matrix. The geometric proximity between
the co-occurrence profile vectors is interpreted as an indication of the semantic similarity
between the corresponding terms, provided that the reference corpus is large enough to
cover sufficient semantic information. The geometric proximity between these vectors is
measured by the cosine value of the angle between them.
0.8
vector space, the document-query relevance can be defined on the basis of the proximity
between the average IS vector representing the documents and the IS vector" representing
the query. We currently measure the proximity by the cosine of the angle between the IS
vectors.
A benefit that we expect from our approach is that any two documents may have a high sim-
ilarity score in the multi-dimensional space of pertinent terms even though they only have a
few terms in common. For example, one document might contain the words "corpus-based
linguistic analysis", whereas the other might contain words "computational linguistics" . If
the emphasized terms globally occur in the same distributional environment in the corpus
that was taken as reference, the resulting IS vectors should correspond to similar directions
in the multi-dimensional space of pertinent terms.
4. Preliminary Experiments
In the first phase of this ongoing research, we have implemented (in C and Perl languages on
a SPARC workstation) a prototype corresponding to the system described in the previous
section. vVith this prototype, we have conducted several experiments using the Cranfield
standard test collection. The Cranfield collection is a test collection of 1400 documents and
225 queries in the field of aerodynamics which also contains, for each query, the list of the
relevant documents. The collection is available in the SMART version 11.0 distribution 1,
and has been used for several years to test many retrieval algorithms.
The 11 points Recall- Precision curves (Salton and McGill (1983)) comparing our system (de-
noted DSIR 2 ) with the ones obtained with the standard version of the system SMART with
term-frequency weights (denoted SMART nnn weight) and augmented inverse-document-
frequency weights (denoted SMART atc weight) are given in figure 1.
+ f3 L 8;fsi -,ISi1
k
ejnew = ejold (1)
£=1
where ejnew is the new query vector, ejold the original query vector. lSi are the IS vectors
of the k first previously retrieved documents. 8i equals 1 when lSi is the IS vector of
a relevant document (and 0 otherwise), and ISi1 the IS vectors of the first non-relevant
documents. f3 and, are importance respectively given to relevant and non-relevant com-
ponents. To eliminate the problem of residual (or ranking) effect (Hull (1993)), we created
the feedback query vectors from one collection of documents and apply them to another
one (see experimental setup below). Settings f3 = 0.75 and, = 0.25 yield the best results
in our experiments.
, FB,Q05 RlQ10
®', :(}):'
" .
® /,... :.....
ft" 17(})'1 ~
DOD
OS BASE OSFB05 OSFB 10
the relevant documents corresponding to the original 106 queries, and present the result
in a form of an ordered list, where the first document (rank 1) is the most similar to the
query. Then (4), the FBQ program creates a new query based on the old one, the relevant
documents (as evaluated by the user) and the first non-relevant document from rank 1 to
rank 5 (FBQ05), rank 1 to rank 10 (FBQ10), etc.
Continuing at the lower portion of the diagram, the DSBASE program will create (5) the
B2 document IS vectors, using previously calculated co-occurrence data and the B2 sub-
collection. Then (6), the B2 document IS vectors and the original 106 queries are used by
the DSSEARCH program to produce the DSBASE (result corresponding to the original
queries) that will be used as a reference to evaluate the retrieval with feedback. Finally (7),
the DSSEARCH program retrieves from the B2 document collection the relevant document
sets (DSFB05, DSFB10, etc.) in response to the feedback queries FBQ05, FBQ10, etc.
1.0
SMART -+-
DS8ASE -+-_ .
0.9 •••••••••••••••• -1 ••• - ........_..L
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.0 0.1 0.2 0.3 04 0.5 0.6 0.7 0.8 0.9 1.0
RECALL
reason for this behavior could be the unsuitable use of co-occurrence data derived from
Bl sub-collection. However, the result remains very encouraging especially when robust-
ness considerations are taken into account, because our system can satisfyingly deal with
many new documents (:::::: 20000 from the B2 sub-collection) without any parameter change,
re-indexing a new co-occurrence computation. In addition to this, experiments conducted
with the B2 sub-collection as reference corpus still confirm better results than SMART.
As far as feedback is concerned, the curves denoted DSFB05, DSFBI0 and DSFB15 in
figure 4 clearly indicate the improvement of our system's overall level of recall, and allow
an interesting quantification of this improvement when various maximal (document) ranks
are used for the feedback. The results in table 1 show the improvement in % of average pre-
cision change (compared with SMART). These results confirm the usefulness of relevance
feedback as printed out in several previous references (Salton & Buckley (1990), Harman
(1992), Allen (1995) and Buckley et al. (1995)).
6. Research Directions
The results of the experiments that we have conducted with our IR system on different
standard test collections have convinced us of the feasibility of our approach.
However, as far as algorithmic efficiency is concerned, one major drawback of our system in
its original form is that the satisfaction of a query requires the evaluation of the similarities
between the query and all the documents in the textual base. Therefore, we are now inves-
tigating how clustering techniques can be used in association with the similarity measure
in order to cluster the database in a way that allows the identification of the documents
satisfying the query through a partial exploration of the base. We are currently working on
a first tentative solution based on a very simple hierarchical partitioning process.
The process starts (step 0) with a unique initial class containing all the N documents of the
textual database. At step n, each class c defined at step n - 1 is divided into 2 subclasses
corresponding respectively to elements with negative and positive coordinates in the first
factorial dimension obtained by a factorial analysis performed on c. The partitioning pro-
494
1.0 r----,----.,------,---.---,--,------;;----,----.---,
0.9
0.8
06
0.5
0.4
0.3
0.2
0.1
cess is then iterated until all classes contain only one element.
By construction, the resulting set of classes corresponds to a binary tree T. In T, each non
terminal node (i.e. class) is associated with a decision function based on the equation of
the corresponding one-dimensional factorial vector space. Furthermore, consider, for any
query q, the path P(q), starting at the root ofT, and in which, at each node, the branching
decision is taken according to the result, for q, of the decision function associated with the
node. Then P( q) verifies the following interesting properties:
• P( q) leads to a leaf of T that corresponds to a document representing a good approx-
imation of the nearest document to q;
• the number of operations necessary to build P(q) (i.e. to retrieve the document
associated with the leaf) is at most log2(N).
'vVe are currently integrating the partitioning process described above in our IR system in
order to quantify, for the available test databases, the influence of such a method on the
overall retrieval efficiency of the system. In addition to this, we are currently implementing
495
a parallel version of our retrieval engine within the PVM programming environment (Geist
et al. (1994)). This work will give us the means to conduct a realistic evaluation of the
computational speed-up provided by the clustering based model.
References:
Allen, J. (1995): Relevance Feedback with Too Much Data, In Proceedings of the 18 th An-
nual International ACM/SIGIR Conference on Research and Development in Information
Retrieval, Seattle, USA.
Buckley, C. et al. (1995): Automatic Query Expansion Using SMART: TRECS, In the
third Text REtrieval Conference (TREC-3), NIST Special Publication 500-225.
Frakes, W.B. and Baeza-Yates, R. (1992): Information Retrieval: Data Structures €3 Algo-
rithms. Prentice Hall.
Gallant, S.l. et al. (1992): HNC's MatchPlus System. SIGIR FORUM, 16(2).
Geist, A. et al. (1994): PVM: Parallel Virtual Machine, A Users' Guide and Tutorial for
Netwo'rked Parallel Computing, The MIT Press, Cambridge, England.
Harman, D. (1992): Relevance Feedback Revisited. In Proceedings of the 15 th Annual Inter-
national ACM/SIGIR Conference on Research and Development in Information Retrieval,
Copehagen, Denmark.
Hersh, W. and Buckley C. (1994): OHSUMED: An Interactive Retrieval Evaluation and
New Large Test Collection for Research, In Proceedings of the 17th Annual International
ACM/SIGIR Conference on Research and Development in Information Retrieval, Dublin,
Ireland.
Hull, D., (1993): Using Statistical Testing in the Evaluation of Retrieval Experiments, In
Proceedings of the 16 th Annual International ACM/SIGIR Conference on Research and
Development in Information Retrieval, Pittsburgh, USA.
Rungsawang, A. and Rajman, M. (1995): Textual Information Retrieval Based on the Con-
cept of the Distributional Semantics. In Proceedings of the 3rd International Conference
on Statistical Analysis of Textual Data. Rome, Italy, December.
Schiitze, H. (1992): Dimensions of Meaning. In IEEE Proceedings of Supercomputing 92.
Salton, G. and McGill, M.J. (1983): Introduction to Modern Information Retrieval. Mc-
Graw Hill.
Salton, G. et al. (1975): A Theory of Term Importance in Automatic Text Analysis. Jour-
nal of the American Society for Information Science.
Salton, G. et al. (1976): Automatic Indexing Using Term Discrimination and Term Preci-
sion Measurement. Information Processing &: Management, 12.
Salton, G. and Buckley, C. (1990): Improving Retrieval Performance by Relevance Feed-
back. Journal of the American Society for Information Science, 41(4).
Fitting the CANDCLUSjMUMCLUS Models
with Partitioning and Other Constraints
J. Douglas Carroll! and Anil Chaturvedi 2
Faculty of Management
1
Rutgers University
Management Education Center, Room 125
81 New Street
Newark, NJ 07102-1895
USA
2 AT&T Bell Laboratories
Room 5C-133
600 Mountain Avenue
Murray Hill, NJ 07974
USA
496
497
coding whether a particular object (or other entity corresponding to a given level of
a given mode/way) belongs (value = 1) or does not belong (value = 0) to a partic-
ular class or cluster. In CAND·CLUS, the dimensions for various ways/modes may
be continuous (spatial) or binary (clusterlike). Other possibilities include discretely
valued dimensions with k(> 2) distinct possible values for a particular dimension.
Specifically, the CANDCLUS model for a general N-way array Y can be stated in
the following form:
R
ViI i~ l •.• liN rv L aJ ra~2r
l •.• a~T . (1 )
r=l
Define An as the parameter matrix of order In X R with elements ainr for the rth
dimension and the· nth way. The elements of matrices An( n = 1, ... ,N) can take on
any of the following values:
• Real values for all matrices An, (n = 1, ... , N). This results in the R-dimensional
CANDECOMP model of Carroll and Chang (1970; see also Carroll and Pruzan-
sky, 1984).
• Discrete integer (or finite set of real number) values for all An, n = 1, ... , N.
• Mixture of real parameters for some ways and discrete values for the rest. Other
possibilities, such as those involving "hybrid" models, also exist-in a hybrid
model for a particular way/mode, some dimensions are defined via continuous,
and others via discrete parameters. See Carroll (1976), Carroll and Pruzansky
(1980), De Soete and Carroll (1996), and Carroll and Arabie (in press) for
discussion of hybrid models.
(2)
where A is an I x Ra binary matrix, B is a J x Rb binary matrix, while Uk is a
completely general Ra x Rb matrix.
Given an II x 12 x ... X IN array, with general entry Yi,i, ... iN, where in = 1, 2, ... , In
and n = I, 2, ... , N. We fit it by a model of the general algebraic form
T, T, TN
Yi l 12 .. ·i N r-.; L L ... L ai1t 1 a;2t2 ... a~tN Utltz ... tN , (3)
tl =1 t';!::::2 (1',;:::::1
While both CANDCLUS and MUMCLUS, when one or more of the ways/modes
entails binary (clustering) "dimensions," are, in general, overlapping clustering mod-
els/methods, it is possible as a special case that the resulting clustering will cor-
respond to that in which the clusters are non-overlapping (mutually exclusive and
collectively exhaustive)~i.e., comprise a partition of the objects or other entities cor-
responding to the mode/way in question. Assuming for the moment that the mode
in question is treated completely cluster-wise, (not in terms of "hybrid" representa-
tion), the matrix (say AI) for that mode must then have the property that each row
contains exactly one "1", all other entries in that row equaling O.
Approaches for fitting either CANDCLUS or MUMCLUS models via either an OLS
or LAD (least absolute deviation) criterion are described in Carroll and Chaturvedi
(1995), for the general, unconstrained case. Slight modification of these methods
would enable fitting via a WLS (weighted least squares) or weighted LAD (WLAD)
criterion, or other even more general "additively decomposable" loss functions.
(4)
where I n = 1112 ... In-lIn+ l ... IN, while jn is an index ranging from 1 to I n,
varying systematically over the combinations of all values of all N - 1 subscripts
exclILding in, while t[z, i] is a measure of discrepancy between the two scalar valued
quantities z and i; e.g., (z - i)2 or 1 z - i I, in the case of an OLS or LAD loss
function, L, respectively.
(see, also, Carroll and Chaturvedi, 1995 for a general discussion of this in the con-
text of the general unconstrained CANDCLUS/MUMCLUS models), very efficient
general algorithms .can be fDrmulated for fitting these models via an OLS or LAD
criterion, via what can be called a "one dimension at a time" elementwise approach.
Thil.t it is "one dimension at a time" simply means that, at each stage of an "outer
iteration" process, only one dimension-whether continuous, or a discrete (usually bi-
nary) one (e.g., defining membership vs. non-membership in a cluster )-is estimated,
conditional on fixed values of all the other R - 1 dimensions (and/or clusters). This
conditional estimation procedure is iterated over dimensions/clusters until conver-
gence occurs. Within each of these "one dimension at a time" estimation steps an-
other "inner estimation" process is used, in this case iterating over individual values
of the (continuous or discrete) components of that dimension/cluster. Since this inner
iteration process is, in fact, iterating over certain elements of the set of parameters
matrices or arrays, we call this an element wise procedure. At each of the most basic
computational steps of the composite iterative process all elements of all parameter
arrays are fixed, save the one which is currently being (re)estimated-conditional on
the fixed values of the remaining parameters.
Concretely, in the stage in which conditional estimates are being made for dimen-
sion/ cluster r for the general CANDCL US model, as in the CANDECOl\fP algorithm
(Carroll and Chang, 1970; Carroll and Pruzansky, 1984) when using the "one dimen-
sion at a time" approach, the CANDCL US algorithm fixes the parameters for all ways
except the nth, and conditiona.l/v (re )estimates the parameters for that nth way. In
fact, if all parameters are continuous, and OLS estimation is done, the CANDCLUS
algorithm is exactly equivalent to the CANDECO;\lIP algorithm, implemented on the
"one dimension at a time" Dil.sis. Thus a basic step in this overall algorithm entails
conditional estimation of a vector of coordinates of just one (continuous or discrete)
dimension, for just one way of the multiway data array, with all parameters for all
other dimensions, and all other ways for the dimension currently being (re)estimated,
held fixed at their current values.
Chaturvedi was the first to note that this adaptation of the one-dimension-at-a-time
CANDECOMP algorithm to fitting clustering or other discrete models via an OLS
(and later LAD) criterion could be greatly accelerated computationally based on a
separability property resulting from the additively decomposable structure of these
loss functions (see Chaturvedi and Carroll, 1994, Chaturvedi, Lakshmi-Ratan, and
Carroll 1995 and Carroll and Chatnrvedi, 1995 for details). This separability property
enables optimization of the objective function over the entire vector of discretely val-
ued parameter values via optimization for each discrete parameter separately. Thus,
for example, in the "standard" case of binary valued parameters, this optimization,
for a vector of In components (comprising the In values of the Rth dimension for
the nth way) can be accomplished via 2In evaluations, rather than 21• evaluations.
(In the case of a K-ary discrete parameter, KIn rather than KIN evaluations are
required.) These discrete conditional estimation steps are implemented via what are
called the "elementary discrete least squares or LAD procedures," respectively. Con-
ditional OLS or LAD fitting of a vector of continuous pitrameters defining coordinates
of a continuous (spatial) dimension can also be done via a sequence of 2IN (or K 1,-.;,
in the K-ary case) quite simple OLS or LAD regression steps (called the "elementary
continuous least squares or LAD procedures," respectively). Because of the one di-
mension at <1, time estimation in this case. it is quite straightforward to impose simple
500
It is straightforward to extend either the OLS or LAD estimation scheme for the un-
constrained CANDCLUS model to weighted least squares (WLS) or weighted LAD
(WLAD) estimation. It is also simple, in principle, to extend this to a criterion of fit
"counting metric," appropriate for categorical data. An application of this "Lo loss
function" to a clustering approach called "K-modes" has been discussed by Carroll,
Chaturvedi, and Green (1994) and Chaturvedi, Green, and Carroll (1996). OLS and
LAD fitting correspond, of course, to the L2 (or Euclidean) and Ll (or "city-block")
norms, respectively.
The elementary discrete estimation procedures for any Lp-norm based loss function
would be quite simple-merely entailing evaluating the loss function for each of the
2 (or K) values of each parameter, with all other parameters fixed, and choosing
the parameter value with the lower (lowest) value of the loss function. There is,
however, no closed form solution, in general, for an Lp norm based loss function for
p other than 1 or 2 (and, also, for p = 0, for appropriate models and data types,
where the solution will simply be the mode of certain values of an associated cat-
egorical variable, and for p = (X), where the solution is the midrange of certain values).
In these cases, some form of line search algorithm or other unidimensional optimiza-
tion procedure would have to be used to solve the elementary continuous estimation
problem. These statements can be extended to any additively decomposable loss
function of the form defined in equation (4); again the elementary discrete proce-
dure would entail simple enumeration of the 2( or K) parameter values and choice
of the one yielding the lower (lowest) value of the loss function, while the elemen-
tary continuous procedure would require either use of a unidimensional optimization
method, or a closed form solution, if available. The separability property discussed
above makes estimation of the CANDCLUS parameter based on any loss function of
the form stated in equation (4) particularly efficient computationally for the discrete
parameters, while reducing it to a unidimensional problem for the continuous ones
(with a straightforward way of imposing such constraints as nonnegativity, if desired.)
Carroll and Chaturvedi (1995) also discuss an extension of this approach to estima-
tion of the unconstrained MUMCLUS model, at least in the case of an OLS criterion
of fit. LAD, or most other loss functions, would not be nearly as tractable for fitting
MUMCLUS, however, since the estimation of the core array would be particularly
difficult in this case (requiring, generally, use of some form of multidimensional op-
timization procedure, one not being simplifiable at all via a separability property
such as that utilized in CANDCLUS). A WLS extension of MUMCLUS estimation
would be quite straightforward, however, since estimation of the core array could be
implemented in this case by use of a WLS multivariate regression procedure.
estimated simultaneously, rather than via the "one dimension at a time" elementwise
strategy discussed in the case of CANDCLUS/MUMCLUS)-enables fitting a solu-
tion constrained to have this partitioning form (for one or more of the ways/modes).
In this approach, one simply seeks, for each row, to which column (cluster) the sin-
gle "I" should be assigned to optimize the (OLS, LAD or other) loss function being
minimized. The separability property, again, can be used-in this case it means that
this decision can be made separately for each row of the matrix given that the other
matrices are treated as fixed.
We describe the resulting approach for CANDCL US, with partitioning constraints,
below. To be specific, given current estimates of all other matrices, A 2 , A 3 , •.. AN,
(whether these are spatial/continuous, cluster-like/discrete, or hybrid), we define
what is sometimes called the "column-wise Kronecker product" of A 2 , A 3 , .•. AN,
which we shall denote A 2 @c A 3 @c ... @c AN, and denote when appropriate as
Ql-indexing Q by the index of the matrix, Al in this case, which is omitted (see
Carroll and Chang, 1970, and ten Berge and Kiers, 1996).
For two matrices, An, and Ani, In X R and In' X R respectively (note, in particular,
that both have a common column order, R) the columnwise Kronecker product is
defined as :
(5)
where a~n)is the rth column of matrix An, and @ is the ordinary Kronecker product
(applied in the present case separately to corresponding columns An, and An' ). Since
An is In X R, and An' is In' X R, An @c An' will be InIn' X R. We define such products
as A 2 @cA3@c ... @cAN, recursively (as in the case of ordinary Kronecker products);
thus, for example:
(6)
We also define a matrix Zl (II X 1213 ... IN) as the matrix defined by concatenating
the N - 1 subscripts i 2 , i3 .. . iN to form a row vector of 1213 ... IN components,
for each il = 1,2, ... 11 , and then defining theilth row of Zl as that i 1 th row
vector. (The order of the subscripts i 2i3, '" ,iN is assumed to be identical to the
order induced on these same susbscripts in the columns of the columnwise Kronecker
product Ql')
Having so defined Zl and Ql or, more generally, Zn and Qn, for all n = 1,2, ... N
(extending these definitions in the obvious way to the case of anyone of the N
matrices in the overall decomposition of Y, say An, denoting the corresponding
matrices as Zn and Qn, respectively, in this case with Qn being defined in terms of
current estimates of all A matrices except An), we then can solve for a new estimate
of An, An, conditional on the current estimates of An' for all n l ' " n, by finding the
constrained OLS, WLS, LAD, or other estimate of An, by solving for an estimate,
An, in the equation
(7)
(where ~ can be taken as implying optimizing a fit criterion such as OLS, WLS, LAD,
weighted LAD, or an objective function optimizing any other specified additively de-
composable criterion of fit. of the additively decomposed form given in equation (4).
502
Given this general form of-the loss function, whether OLS, WLS, LAD, \VLAD or
other, we can very simply minimize L by the use of a similar separability property as
noted in the case of CANDCLUS/MUMCLUS. In this case, this separability of the
overall loss function L is defined rowwise, for the entire matrix An, since for a loss
function of the additive form given in equation (5), given that Zn = An Q:, we have
i(n) = a(n)(q(n)), so that
In)n l. n In
In J I1
L = .0
""""
G
e [zn .. a(n)(q(n)),]
ttl]n' tn In '
(8)
In In
so that L is separable in each of the row vectors in the matrix An, generalizing quite
directly the element wise separability property defined earlier (and used for uncon-
strained CANDCLUS/MUMCLUS) to a rowwise separability property.
In the case of a partition, the problem is particularly simple, of course, since each of
these row vectors is constrained to have one and only one 1, with all other elements
= O. Thus, the optimal vector can be selected, at each stage of the overall algorithm,
by simply computing the objective function being optimized (minimized) for each
of the In unit vectors, and choosing the one optimizing (minimizing) the objective
function-so this entails only In computations, rather than the 2I n computations that
would be required for the unconstrained case.
(9)
where fdal~)l is a function of the vector a;~) alone, since all other variables are
treated as constant, while /;" is minimized by a very simple exhaustive search. Given
the constraints that An have only one "1" per row for each of its In rows, the column
vector al~) must be one of the Runit column vectors €I, €2, ... €R· Thus fin need be
evaluated only for those R permissible unit vectors-entailing exactly Revaluations.
Thus An as a whole can, in view of the separability of the assumed loss function, be
optimized via a total of InR such evaluations-a quite manageable search process,
since it is linear in the size (J"R) of the matrix (as opposed to being proportional
to 2 I n R , as would be true in the case of a totally general binary matrix and a non-
separable loss function).
The conditional OLS, LAD (or other) estimation of the remaining matrices can be
implemented either rowwise or in a "one-dimension at a time" elementwise manner,
as with unconstrained CANDCLUS/CANDECOMP, depending on the nature of the
parameters, constraints, or other factors. For purposes of the approach described
here, conditional estimates of each of the other N - 1 other matrices must be given,
and treated as fixed for reestimation of An. This enables using the elementwise sep-
arability property described earlier for CANDCLUS to estimate overlapping cluster
structures, or other discrete structures, for some modes and also, if desired, to im-
plement continuous optimization of parameter matrices for other modes with (say)
non negativity constraints. In the case of a matrix Ak with continuous parameters,
and in which OLS or WLS estimation is being done, Ak may be estimated matrix wise
503
It should be noted that other forms of constrained overlapping cluster structures can
also be estimated (conditionally) on a row wise basis. For example, if one wanted to
constrain a cluster structure to one in which each object or other entity correspond-
ing to a level of a specified mode/way is contained in no more than C other clusters,
the search could be restricted to those binary R-vectors having C or fewer 1 'so The
additive decomposability of the objective function leads to a very simple algorithm
for this problem. First, go through all components of a particular row vector, allow-
ing each component to be 0 or 1 on an unconstrained elementwise basis. Then, if
the number of l's is C or less, you're finished (for that row vector). If not, choose
the C components associated with the C largest reductions (or "differentials") in the
objective function being minimized. (This step requires storing these differentials for
each such unit component, followed by a sorting algorithm aimed at choosing the C
largest absolute differentials among them.) This algorithm would enable choosing the
optimal vector for theinth row of An in considerably fewer than (~) operations-the
specific number of operations being dependent on the data and other parameters'
current estilTlCltes. Other constmints are possible (e.g., that each object be in exactly
C clusters, or that the R-vectors be restricted to a predefined subset of all possible
bimuy R-dimensional vectors). Any constrClints that can be defined in terms of the
set of permissible binary R-vectors for eitch row of a specific parameter matrix (even
if a different set for each row is specified, so long itS these constraints are independent
of para.llleter valnes in the other rows) can be imposed in this manner.
'While we have only roughly outlined the constmined CANDCLUS approach, at least
one special case lllerits particular attcntion. This is the special case involving two-
mode two-way data, in which one mode is modeled by partitions and the other by
continuous parameters. The, particular special case can easily be shown to be equiv-
iclent to the well-know clustering procedure know as K-means (where K = R. in this
case) if an OLS, and K-medians if it LAD, criterion is used. "We might mention, as a
related model of interest. the case of unconstrained CANDCLUS fit to such two-mode
data with one mode modeled by olJeTiappmg clusters and the other by continuous pa-
rameters. This model/ method we have called "overlapping K-centIOids" (Chaturvedi,
Canoll, Green and Rotondo. 1994) elsewhere.
While we have dealt here only with the CANDCLUS model with constraints making
the clustering (for one or more Illodes) a, partition, this entire appn)ilch generalizes in
a straightforward lllanller to a similarly constrained version of MU~TCL US, at least
via OLS or \VLS optimization. As cliscllssecl by Carroll and Chatnrvecli (1995), es-
timation of the contilluollS pMiullders in the core alTay via LAD, \VLAD or other
adclitively dccoIllposabl,~ fit criteria is not, ill generitl, an easiiy implcmelltable COIll-
putational problem leilciing to a closed form solution. \Ve shall not explore this last
class of models and methods more fully, however, ill the present pitper.
504
3. CONCLUSION
The general CANDCLUS and MUMCLUS models have been defined and discussed,
including references to previously published work on unconstrained estimation of each
using various additively decomposable loss functions as fit criteria, based on a sepa-
rability property originally observed by Chaturvedi. An extension of the separability
property from individual elements of various parameter matrices to (row) vectors of
those matrices leads to a straightforward approach for estimating these models with
the clusters corresponding to certain ways or modes being constrained to satisfy par-
titioning constraints. Using this property to impose certain other kinds of constraints
on the clustering is also possible. One that is discussed entails constraining the ob-
jects to be contained in no more than a fixed number (C) of clusters.
Some special cases of CANDCLUS are discussed, including OLS and LAD estimation
of the ADCLUS/INDCLUS models and a procedure generalizing K-means and K-
medians to the case of overlapping clusters, called K -overlapping centroids clustering
(which includes methods called overlapping K-means and overlapping K-medians as
special cases), as well as some other potential applications (e.g., to fitting "hybrid"
models entailing combinations of continuous and discrete dimensions for the same
set of entities corresponding to one or more modes of the multiway data array).
We anticipate many future applications of these and other specific special cases of
constrained and unconstrained CANDCLUS/MUMCLUS models, some not yet even
contemplated, to a wide variety of data analytic situations.
New algorithmic developments may further improve fitting procedures for this class of
models, using various loss functions as fitting criteria-potentially reducing problems
of merely local optima and slow convergence, and generally increasing computational
efficiency so as to make dealing with the increasingly large data arrays arising in
many practical data analytic situations much more feasible.
4. References
Carroll, J. D. (1976): Spatial, non-spatial and hybrid models for scaling, Psychometrika,
41, 439-463.
Carroll, J. D. and Arabie, P. (in press): Multidimensional scaling, In: Handbook of Percep-
tion and Cognition. Volume 3: Measurement, Judgment and Decision Making, Birnbaum,
M. H. (ed.), San Diego, CA: Academic Press.
Carroll, J. D. and Chang, J. J. (1970): Analysis of individual differences in multidimensional
scaling via an N-way generalization of "Eckert-Young" decomposition, Psychometrika, 35,
283-319.
Carroll, J. D. and Chaturvedi, A. (1995): A general approach to clustering and multidimen-
sional scaling of two-way, three-way, or higher way data, In: Geometric Representations of
Perceptual Phenomena, Luce, R. D. et aL (eds.), 295-318, Mahwah, NJ: Erlbaum.
Carroll, J. D. et al. (1994): K-means,K-medians and K-modes: Special cases ofpartitiorung
multi way data. (Paper presented at meeting of the Classification Society of North America,
Houston, TX.)
Carroll, J. D. and Pruzansky, S. (1980): Discrete and hybrid scaling models, In: Similarity
and Choice, Lantermaun et aI., (eds.), 108-139, Bern: Hans Huber.
Carroll, J. D. and Pruzansky, S. (1984): The CANDECOMP-CANDELINC family of mod-
505
els and methods for multidimensional data analysis, In: Research Methods for Multimode
Data Analysis, Law, H. G. et a1. (eds.), 372-402, New York: Praeger.
Chaturvedi, A. and Carroil, J. D. (1994): An alternating combinatorial optimization ap-
proach to fitting the 1NDCL US and generalized INDCL US models, J oumal of Classification,
11,155-170.
ChatUTvedi, A. et a1. (1994): A feature based approach to market segmentation via over-
lapping K-centroids clustering. Manuscript submitted for publication.
Chaturvedi, A. et a1. (1995): Two L1 norm procedures for fitting ADCLUS and INDCLUS.
Manuscript submitted for publication.
Chaturvedi, A. et a1. (1996): Market segmentation via K-modes clustering. (Paper pre-
sented at American Statistical Association Conference, Chicago, 1L.
De Soete, G. and Carroil, J. D. (1996): Tree and other network models for representing
proximity data, In: Clustering and Classification, Arabie, P. et a1. (eds.), 157-197, River
Edge, NJ' World Scientific.
DeSarbo, W. S (1982): GENNCLUS: New models for general nonhierarchical clustering
analysis, Psychometrika, 47, 449-475.
ten Berge, J. M. F and Kiers, H. A. L. (1996): Some uniqueness results for PARAFAC2,
Psychometrika, 61, 123-132.
A Distance-Based Biplot
for Multidimensional Scaling of Multivariate Data
Jacqueline J. Meulman
Department of Data Theory, University of Leiden
P.O. Box 9.55.5, 2300 RB Leiden, The Netherlands
Summary: Least squares multidimensional scaling (MDS) methods are attractive candi-
dates to approximate proximities between subjects in multivariate data (Meulman, 1992).
Distances in the subject space will resemble the proximities as closely as possible, in con-
trast to traditional multivariate methods. When we wish to represent the variables in the
same display - after using MDS to represent the subjects - various possibilities exist. A
major distinction is between linear and nonlinear biplots. Both types will be discussed
briefly, including their drawbacks. To circumvent these drawbacks, a third alternative will
be proposed. By expanding the optimal p-space (where p denotes the dimensionality of the
subject space) into an m-dimensional space of rank p (with m > p), we obtain a coordinate
system that is appropriate for the evaluation of the MDS solution directly in terms of the
m original variables. The latter are represented graphically as vectors in p-space, and their
entries as markers that are located on these vectors. The overall approach, including the
analysis of mixed sets of continuous and categorical variables, can be viewed as a distance-
based alternative for the graphical display of multivariate data in Gifi (1990).
1. Introduction
In the approach to :'I.lultivariate Analysis (:\1VA) applied in this paper, the vari-
ables are used to define an observation or measurement space in which the units
are located according to their scores. The distances in this observation space are
regarded as proximities to be approximated by distances between subject points in
a low-dimensional representation space. If the m-dimensional observation space is
denoted by Q, giving coordinates for n points in Tn dimensions, the proximities be-
tween all pairs of subjects are given in the proximity matrix D( Q) where D(.) is the
Euclidean (also called Pythagorean) distance function. So the n x n matrix D( Q)
contains proximities dik(Q) between subject i and k. Squared distances are given by:
D2(Q) = vI' + Iv' - 2QQ', (1 )
where v = vecdiag( QQ') is the n-vector containing the diagonal elements of QQ',
and 1 is the n-vector of all 1 '". Analogously, squared distances in the p-dimensional
representation space X are defined by D"(X) = vI' + 1 v' - 2XX', now with v =
vecdiag( XX').
Approximation of a set of proximities by a set of distances in some low-dimensional
space is usually identified as a multidimensional scaling (MDS) task. In lvleulman
(1986), following Gower (1966), it is shown that techniques of multivariate analy-
sis, like principal components, canonical correlation and homogeneity analysis, are
equivalent to yIDS tasks applied to particular derived proximities when the so-called
classical Torgerson-Gower approach to MDS (Torgerson, 1958; Gower, 1966) is used.
A basic ingredient is the original 'Y"oung-Householder (1938) process that transforms
a squared distance matrix D2(Q) into an n x n scalar product matrix QQ', modified
by locating the origin in the centroid of points through the use of the n x n centering
operator J = I - (11'/1'1), which gives
-1/2J(D2(Q))J = -1/2J(vl' + lv' - 2QQ')J = QQ'. (2)
506
507
(I is the n x n identity matrix; we assume that the variables qj have zero mean.) The
scalar product matrix QQ' is then approximated by another scalar product matrix
of lower rank, using an objective function that can be written in the form:
so I! • il" denotes a least squares discrepancy measure. (The term STRAIN is used
after Carroll and Chang. 19i2.) In our case. because XX' = -1/2J(D2(X))J. (3)
can also be written as
(6)
)=1
m
1/mL::IIJ(D2(G)(GjG))-1/2) - D2(X))JII",
)=1
where pro:.:imities are deri ved simultaneously from all G) separately. The columns
of the indicator matri:.: G J are divided by the square root of the marginals GjG j ;
the latter operation defines the chi-squared metric. Finally, the proximities are ap-
proximated by Euclidean distances in X. As before, in the classical scaling approach,
X would not be normalized to represent the subjects in an orthonormal cloud, but
instead the eigenvalues are llsed to give the representation space a certain shape,
displaying the differential saliences.
In ?v1eulman (1986, 1992) an alternative is proposed, which is to analyse multivari-
ate clata by minimizing a loss fUIlction that is directly defined on the distances (so it
508
does not approximate distances through inner products). The history of least squares
:.vms methods can be followed from Shepard (1962), I":ruskal (1964), Guttman (1968).
Takane, Young, and De Leeuw (1977). De Leeuw and Heiser (1980), Ramsay (1982).
a.o. Least squares MDS methods are traditionally applied to a given proximity ma-
trix, whose proximities are then approximated through minimization of some least
squares loss function that is defined on (transformations of) proximities and distances
in a representation space X. In the multivariate cases described above, we derive the
proxirriities from the multivarate data. Then, in the distance-based modification of
principal components analysis (distance-based PCA. for short). we minimize
X = 1/nB(XQ)XQ. (8)
(9)
where the elements of the matrix B'(XO) are given by bik(XO) = dik\Q)/dik(XO)
if i i- k and dik(XO) i- 0; otherwise bik(XO) = O. The elements of the diagonal
matrix B+(XO) are given by bt(XO) = I' B·(XO)ei. where ei is the ith column of the
identity matrix I. Repeatedly computing the update X gives a convergent series of
configurations.
A special feature of the Gifi-system is the possibility of differential treatment of vari-
ables in the analy·sis. For example, some variables may be treated as containing
numerical scores, while others may be treated as nominal variables. The latter treat-
ment is appropriate when a variable partitions the subjects into unordered classes. In
distance-based PCA nominal treatment of variables is carried out as follows. First,
the nominal variable h J is replaced by a binary indicator matrix G j with n rows and
Ij columns, as above. Then,_ proximities are derived simultaneously from Q and G J :
( 10)
where j = 1, .'" m is the running index to indicate the variables in the analysis that
classify the subjects into groups (there may be more than one classifying variable, and
then multiple indicator matrices should be created). As in MCA and homogeneity
analysis, the columns of the indicator matrix G J are divided by the square root of the
marginals GjG j to give distances in the chi-squared metric. Finally, the proximities
6( Q; G) are approximated by Euclidean distances D(X).
Meulman (1992) has shown that if we apply the classical scaling approach (as in
Torgerson, 1958; Gower, 1966) to approximate 6( Q; G) then this results in a so-
lution for X that is equivalent to the subject scores in Gifi's PCA, with numerical
and nominal variables. (Again, apart from a scaling factor per dimension, displaying
the differential saliences; PCA usually displays the subject points as an orthonormal
509
(l2)
Using aj to represent the endpoint of the vector gives a linear biplot representation.
This biplot is obtained, however, through the use of different rationales for fitting the
subjects on the one hand and the variables on the other: the subject points are fitted
through the use of least squares distance fitting in (7), and the vectors representing
the variables through ordinary multiple regression in (12).
(1988). The latter nonlinear biplot was developed to obtain nonlinear representations
of variables in a space that is generated through a principal coordinates analysis. The
procedure discussed in Meulman and Heiser (1993) can be described as follows. First,
regard each variable as a series of 5 = 1, ... , S supplementary points (a trajectory)
in the space X. In terms of the data, a supplementary point for variable j has
coordinates e r , in X that are all equal to zero, except for the jth variable; so when
e J is the jth column of the m x m identity matrix I, e r , = T'sej, where min(CJJ) :::;
T's :::; max(CJJ). Next, for each supplementary point, the distance is calculated to the
n original points in observation space. The vector with squared distances between
supplementary point e r , and the subjects in Q is given by
(13)
with v the n-vector containing the diagonal elements of QQ'. The kth element of
d2 (e r ,; Q) gives the squared distance between the 5th supplementary point and the
kth subject point in observation space, and will be written as d2(e r ,;qk), where
qk denotes the kth row of Q. Mapping the trajectory for variable CJJ involves the
approximation of d(e r ,; Q) by d(ys; X), where Ys gives p-dimensional coordinates in
the space of X, for different values of T'31 s = 1, ... , S. Here S denotes a prechosen
number. appropriate to cover the range of min( CJJ) to ma:l;( CJJ). Each supplementary
point has to be mapped separately, and the coherent method with respect to least
squares multidimensional scaling as in (7) is the use of
Define the singular value decomposition of the m x p matrix Q'X as Q'X = KAL',
and the eigenvalue decomposition of the p x p matrix X'QQ'X as X'QQ'X = LA 2L'.
Then the rotation-expansion matrix is found by
Now the m-dimensional coordinate system XA' can be used to evaluate the MDS
solution directly in terms of the original variables, with the Pearson correlation co-
efficient as a natural measure of association. At the same time, the jth row in A
(denoted by aj) gives the coordinates to display the variable <Ii in the space X. The
scores {%} themselves can be represented as well; the projected coordinates in the
space X are given by <Iiaj/ajaj.The latter set of quantities are called single category
coordinates in Gifi's (1990) approach to PCA. Therefore, the approach proposed here
can be regarded as a distance-based alternative for Gifi's biplot display of multivari-
ate data. The series of points <Iiaj/ajaj are located on the vector that represents the
variable <Ii, and these are usually called markers (Gabriel, 1971; Gower and Hand,
1996).
the centroids of the subject points in a particular class define the associated category
points. In addition to Euclidean distance, the allocation used the posterior probabil-
ities, by employing Bayes rule and the a priori distribution over the categories. The
resulting assignment is compared to the original time points and diagnostic group
classification.
5. Results
The primary result of the analysis consists of the coordinates for the observational
units in two-dimensional space; the graph displaying the cloud of points is given in
Figure 1, top panel. The subjects have been labeled with their diagnosis. Visual in-
spection immediately suggests that the second dimension is related to the diagnostic
categories of eating disorder. We see that the anorexia subjects (label 1 ) form a group,
but patients with atypical eating disorder (label 4) form a subgroup. The latter pa-
tients can be considered as anorectic patients for whom the loss of weight is unknown
or less than 15%. The second dimension separates the anorectic patients (classes 1
and 4) from the boulimic patients (classes 2: anorexia nervosa with boulimia nervosa,
and 3: boulimia nervosa after anorexia). Having connected the diagnosis category
points over the different points in time (by computing the appropriate centroids of
subject points), it is clear that the first dimension displays the development in time.
The variables are displayed in the bottom panel of Figure l. Instead of displaying
the variables and subjects together, it was chosen to display the variables with the
group points, since this gives a more comprehensive biplot; groups of subjects are
represented by their centroid, which is associated with a particular point in time and
diagnosis.
The first thing to note is that all variables have a positive correlation with the first
dimension; this means there is a general factor that correlates positively with all the
variables. The second dimension separates the variables. We find three bundles of
variables: Bingeing (4), Vomiting (5) and Purging (6) are clearly distinguished in the
vertical direction of the graph; Preoccupation (15), Body Perception (16), Hyper-
activity (7), Sexual Behavior (13), and Fasting (3) correlate most with the general
factor, and Weight (1), Menstruation (2), Family Relations (8), Emancipation (9),
Work/School record (11) and Sexual Attitude (12) form a third bundle of variables.
Friends (10) and Mood (14) do not fit very well in the overall representation. The
fit of the variables, as measured by the Pearson correlation between observed scores
and fitted scores, is given in Table l. We notice that distance-based PCA gives a
very decent fit for the variables, compared to the maximum that could be obtained
by applying standard PCA.
As described above, we compare the classification of the subjects on the basis of the
results from both analyses with the original ones. With respect to time, the grouping
is given in Table 2. Here, rows indicate the original time points, and columns the fitted
time points. From the percentage correctly assigned subjects in the bottom row, we
conclude that distance-based PCA performs better than standard PCA, although it
is obviously hard to distinguish between consecutive time points, especially between
3 and 4. Next, we inspect assignment to the eating disorders categories. Results are
given in Table 3, with the rows and columns ordered according to the subgroupings.
It is clear that neither approach to PCA can distinguish between groups 1 and 4
on the one hand, and groups 2 and 3 on the other hand. On the whole, distance-
based PCA performs better. This becomes more clear when we combine results from
Table 3 in its bottom row: now standard PCA finds 83% and 91 % correctly, while
distance-based PCA obtains 92% and 97%.
513
0.5
4
4
1 4 1
4 4
1
441 f
4 1 1 III 4
1 1
J
0.25
11 1 1 1
1
\ 11 1 4
1 1 1 j4
11 1,11\\ f
0
1
3 4 Il114 3- t ?
3 4
42
4 3 1
3 2 3 33~f~
? ?1
3 ~ -2
~ 33 3
3
3- 4 3
?
-0.25 3 2
3 2 2 3 3
2 2 2 2
4 2
1 3
3 3
-0.5
-0.5 -0.25 0 0.25
0.5
0.25
15
o -+-
1
-0.25
3
2 0
Figure l. Top panel: Subjects represented in two-dimensional space. Labels 1: anorexia, 2: anorexia
with boulimia, 3: boulimia after anorexia, 4: atypical eating disorder. Trajectories represent the
diagnostic categories in time (from left to right). Lower panel: Trajectories, now displayed with the
variables (described in Table 1). Markers represent the three categories for variable 13.
514
,
I
6 .,:3 .69
.6'3 .64
8 .6.5 .60
9 .66 ..59
10 ..5, .,s·l
11 ..51 '-
• ""I')
To inspect these properties in a more general context, (replicated) artificial data were
generated, with ,.5 subjects and 1:3 variables with a perfect representation in three
dimensions. From this set, two variables were selected as partitioning variables, and
five categories were created using an optimal discretization strategy. The remaining
11 variables were subjected to a fair amount of random error (with an average of
.5:3%). The resulting set of variables was analyzed by distance-based PCA, and com-
pared to standard PCA. In both cases, analyses were done in two dimensions. The
number of replications was set to 100. Four criteria were inspected:
• 1. The fit per variable, as measured by the Pearson correlation between the true
scores (without error) and the bilinear approximation.
• 2. The fit per dimension, as measured by the Pearson correlation between the true
dimensions and the fitted dimensions.
• :3. The correct classification of subjects in two-space as compared to the original
classes. The a priori distribution in the population was taken into account to compute
the posterior probabilities.
• 4. The distances between the subjects in two-space as compared to the distances
in true-space.
The results reported below were obtained by averaging the results after applying
distance-based and standard PCA to the 100 samples of the artificially created struc-
ture described above. The first part of Table 4 gives the correlations between true
scores and fitted scores; the second part reports on the results with respect to the
original three dimensions (that are approximated in two-space; the correlations here
are again obtained by applying the rotation-expansion strategy from Section :3, but
now to the original dimensions: so if the original dimensions are denoted by Z. we
W
minimize II Z - XB ' over all B satisfying B'B = I. and we compute the correlations
between Z and XB'). \Ve notice that distance-based PCA performs better for each
variable and each dimension separately. Results for the two classification variables
were combined; the aggregated resltits are given in Table .s, where again rows indi-
cate the original categories and columns the fitted categories. Except for the third
category (7.3% versus 76% correct), distance-based PC A performs better: this effect
is strongest for the extreme categories 1 and .s (88% versus 78% and 87% versus 76S\·
correct). Finally, the overall statistics are given in Table 6, confirming the superior
performance of distance-based PCA o\'er standard PCA.
516
7. Discussion
A distance-based biplot was developed for a least squares MDS analysis of multivari-
ate data. The fitted p-dimensional space is expanded into an m-dimensional space of
rank p that directly represents the fitted variables, while distances between subjects
are preserved. The biplot method was applied to a data concerning patients with
various types of eating disorders. The variables obtained a decent fit, also when com-
pared to standard PCA. The method was further studied in a Monte Carlo study.
Here results were compared with respect to the true structure, and the method pro-
posed performed better than standard PCA with respect to the criteria considered.
The analysis allows nominal variables to be included; their categories are represented
by centroids of subjects. The method could include optimal scoring of ordinal vari-
ables too. By representing nominal variables as points in the subject space, and the
other variables as vectors in the same space, we have actually developed a triplot,
with subjects, variables, and classes as its constituents.
References:
Carroll, J.D., and Chang, J ..J. (1972): IDIOSCAL (Individual differences in orientation
scaling): A generali=ation of INDSCAL allow£ng IDIOsyncratic reference systems as /L'eU
as an analytic approximation to INDSCAL, Paper presented at the Psychometric Society
Meeting, Princeton, NJ.
Carroll, J. D. (1972): Individual differences and multidimensional scaling, In: Multidimen-
sional scaling: Theory and applications in the behavioral sciences, R. :'-f. Shepard, A. K.
Romney, and S. B. Nerlove (eds.), Vol. 1, 105-1.5.5, Seminar Press, New York and London.
De Leeuw, .J., and Heiser, W.J. (1980): Multidimensional scaling with restrictions on the
configuration, In: Multivariate analysi.s, P.R. Krishnaiah (ed.), , Vol. V, 501-522, North-
Holland, Amsterdam.
Eckart, C. and Young, G. (1936): The approximation of one matrix by another of lower
rank, Psychometrika, 1, 211-218.
Gabriel, K.R. (1971): The biplot graphic display of matrices with application to principal
517
Summary: A new class of multidimensional scaling models for the analysis of longitudi-
nal choice data is introduced. This class of models extends the work by Bockenholt and
Bockenholt (1991) who proposed a synthesis of latent-class and multidimensional scaling
(mds) models to take advantage of the attractive features of both classes of models. The
mds part provides graphical representations of the choice data while the latent-class part
yields a parsimonious but still general representation of individual differences. The exten-
sions discussed in this paper involve simultaneously fitting the mds latent-class model to
the data obtained at each time point and modeling stability and change in preferences by
autocorrelations and shifts in latent-class membership over time.
1. Introduction
Over the years, latent class models have proven to be a versatile tool for the analy-
sis of panel, survey, or experimentally derived choice data. For example, numerous
marketing applications demonstrate that these models can provide useful insights
into market structure by simultaneously segmenting and structuring a market at a
particular point in time or over time with the use of panel data. However, to some
extent the interpretation of classification results obtained by these models is compli-
cated by the fact that no information is provided about the perceptual space of the
choice options or the decision process of a respondent. To overcome this limitation,
Bockenholt and Bockenholt (1991) presented a synthesis of latent-class and multi-
dimensional scaling (mds) models that takes advantage of the attractive features of
both classes of models. The mds part determines the perceptual space of the choice
options while the latent class part yields a parsimonious but still general representa-
tion of individual differences.
This paper introduces an extension of this approach for the analysis of stability and
change in choice data collected at two time points. The new class of models represents
persons and items in a joint space. Individual differences are captured by different
vector termini or ideal point positions in a multidimensional space. Person-specific
changes are modelled by allowing for switching among the latent classes over time.
Changes in the perception of items are modelled by drifts in the item parameters.
As a result, the proposed approach provides a much stronger test of the stability and
validity of a multidimensional representation than possible on the basis of a data set
collected at a single point in time. Moreover, a rich set of hypotheses can be tested
to determine the locus of change in the time-dependent scaling results.
The remainder of this paper is structured as follows. First, parametric representa-
tions of ideal and vector models are reviewed. Next, a general modeling framework
is presented for the analysis of choi:e data collected at two time points. Various
special cases of the approach are derived. The paper concludes with an analysis of a
sociometric choice data set.
518
519
where 'ira represents the relative size or proportion of class a and L~=l 7ra = 1. Al-
though the unconstrained latent-class model is well-suited to describe individual pref-
erence differences, it does not furnish information about the perception of the items
and the response process of the respondents. However, this information can be ex-
tracted from the data by constraining the class-specific probabilities, 7rila, to be a
function of a scaling model. In particular, the ideal point and vector models may
prove useful in supplying succinct graphical representations of the choice behavior of
the respondents.
B6ckenholt and B6ckenholt (1991) showed that latent-class models and scaling models
can be combined by setting 7riia = <I> ( zila) and <I> is the normal cumulative distribution
function. The deviate zila is expressed as a function of an unfolding model or a vector
model. For example, the ideal point model specifies that an item is selected when it
is close to a person's ideal point position. As a result, for members of class a we may
write
,
Zila = -Ti - L(Lah - Bih)2, (2)
h=l
where Ti is an item-specific threshold parameter, and Lah is the ideal point position
of class a on dimension h. The location of item i on dimension h is denoted by ,8ih .
Thus, according to (2) the items' locations are perceived homogeneously in an T-
dimensional space, however, the positions of the ideal points differ among the classes.
The smaller the distance is between an item and a class' ideal point position, the
more members of this class prefer the item.
Occasionally, a special case of the ideal point model, the vector model, may prove
sufficient for describing individual differences in choice data. According to this model
each class is characterized by a preference vector Va = (Val, Va2,"" Va,)' An item's
projection onto the preference vector of a class determines its probability of being
chosen,
,
Zila = Ti + L Vah{3ih (3)
"=1
520
By combining the vector or ideal point model with the latent-class representation \ve
get
Clearly, when zila is unconstrained the latent class representation in (1) is obtained.
where ~2(Z:I~' Z:;~; p;) is the bivariate normal distribution function with correlation
coefficient Pi and upper integration limits z:l~ and z:i~' These limits are either un-
constrained, or a function of the ideal point or vector models. For example, in the
case of the unidimensional ideal point model
z(t) = __ (t) _ (L _ 3(t))2 (6)
t[a I t a, L •
The class size parameter IT ab refers to the probability of belonging to classes a and
b at time points t and t + 1, respectively. Switching among class locations occurs
521
where c is a constant, and fu denotes the observed number of persons with selection
vector Yu = (i(1), ... , q(l); i(2), ... ,q(2)). The Expectation-Maximization (EM) algo-
rithm is used for parameter estimation (Dempster, et al., 1977). The implementation
of this algorithm is straightforward and not further discussed here because it is well
documented in the literature. For example, Hathaway (1985) and Lwin and Martin
(1989) provide a detailed discussion of the EM algorithm for estimating normal mix-
ture models, and Bockenholt (1992) reviews methods for the computation of normal
probabilities.
Large sample tests of fit are available based on the likelihood-ratio (LR) X2 statistic
(G 2) and/or Pearson's goodness-of-fit test statistic (P2). Asymptotically, if a latent-
class mds model provides an adequate description of the data, then both statistics
follow a X2 -distribution with (4n - m - 1) degrees of freedom, where m refers to the
number of parameters to be estimated. The LR test is most useful when the number
of items is small. Otherwise, only a small subset of the possible choice patterns may
be observed. In this case, it is doubtful that the test statistics will follow approx-
imately a X2 -distribution. However, useful information about a model fit may be
obtained by inspecting standardized differences between the observed and expected
model probabilities for certain partitions of the data (e.g., subsets of items). Al-
though these residuals are not independent, a careful inspection of their direction
522
5. Sociometric Choices
This section presents the analysis of a small study with two items observed at two
points in time. Although the number of items is too small for testing the mds part
of the latent-class models, the data set is useful for demonstrating the importance
of taking into account latent change and re-test effects in an analysis of longitudinal
data.
In a sociometric choice investigation reported by Langeheine (1994) students were
asked with respect to every classmate whether they would choose (C) or reject (R)
this person on the basis of two criteria measuring interpersonal attractiveness. Crite-
rion i is "share a table in case of a new seating arrangement" and criterion j is "share
a tent if a class would go for a camping trip". Table 1 contains t~e results of this
study for two measurement occasions separated by a one week interval. For example,
the majority of responses (673) rejects class mates on the basis of the two criteria
at both time points. For the most part the following results agree with the ones
obtained by Langeheine (1994). The main difference between his and the analyses
reported here is the application of rvIodel (5).
Because the LR-test of the two-class no-change model,
2
Pr(i(l)
,
)'(1),
"
i(2) )'(2)) = '"
L.....
-
lIa
<p(Z(I)) <p(z(I)) <p(,.(2)) <p(z(2))
.ia lia ~'ia lia (8)
a=1
yields a G2 = 239.1 with dJ =10, it is justified to test whether the poor fit is a re-
sult of variability in the evaluation of the items over time, person-specific changes,
or both. When allowing for switching among classes (i.e., 'Trab > 0 when a =I b) G2
drops to 49.9 (dJ = 8) indicating a substantial latent change effect. In contrast,
when, additionally, relaxing the assumption of time-homogenous item probabilities
(i.e., <P(z:i~) =I <P(z:I~))' G2 reduces by little to 49.7 (dJ=4). Clearly, item-specific
effects can be ignored after allowing for person-specific change. Because of the short
interval between the time points, it seems likely that the poor fit of the model is
caused by strong auto dependencies of the choices. Support for this hypothesis is
obtained by inspecting the fitted frequencies of the latent change model in (the p = 0
column of) Table 1. In particular, the two choice patterns with identical choices are
underpredicted. This result suggests that the fit of the model can be improved con-
siderably by Model (5) because it allows for test-retest effects. The LR-test for this
model with equal correlations for both items is G2 = 5.99 (dJ = 7), and the estimated
correlation coefficient is P = Pi = Pl = .49. The predicted frequencies are given in
(the P= .49 column of) Table 1 and the estimated choice probabilities and class size
parameters are listed in Table 2. We note that the class consisting of reject responses
is twice as large as the class consisting of pick responses. However, the transition
probabilities of both class are symmetric with 7r112 ~ 7r211.
523
6. Discussion
This paper presented latent-class scaling models for graphical representations of pref-
erence stability and change in choice data. The models provide a parsimonious frame-
work for the analysis of individual taste differences. A rich set of hypotheses can be
tested that examine the stability of the perceptual space of the items and shifts
among different preference states for two time points. In addition, as illustrated in
the application section test-retest effects can also be taken into account. Because
the approach can be applied with minor modifications to different data types (i. e.,
rankings, first choices, preference ratings) it can be viewed as a general framework
for obtaining graphical representations of preference changes.
Acknowledgements
This work was partially supported by the \ational Science Foundation (Grant No.
SBR 94-095.31).
524
References:
Bockenholt, D. (1992). Thurstonian models for partial ranking data. British Journal of
lVlathematical and Statistical Psychology, 45, 31-49.
Bockenholt, D. (1997). Modeling time-dependent preferences: Drifts in ideal points. In:
Visualization of Categorical Data, Greenacre, M., and Blasius, J. (eds.). Lawrence Erlbaum
Press.
Bockenholt, D., and Bockenholt, 1. (1991). Constrained latent class analysis: Simultaneous
classification and scaling of discrete choice data. Psychometrika, 56, 699-716.
Bockenholt, D., and Langeheine, R. (1996). Latent change in recurrent choice data. Psy-
chometrika, 61, 285-302.
Carroll, J. D., and Pruzansky, S. (1980). Discrete and hybrid scaling models. In E. D.
Lantermann and H. Feger (eds.), Similarity and Choice. Vienna: Huber Verlag.
Coombs, C. H. (1964). A theory of data. New York: Wiley.
Dempster, A. P., Laird, N. H., and Rubin, D. B. (1977). Maximum likelihoqd from incom-
plete data via the EM algorithm. Journal of the Royal Statistical Society, B39, 1 - 38.
Hagenaars, J. (1990). Categorical longitudinal data. Newbury Park: Sage.
Hathaway, R. J. A. (1985). A constrained formulation of maximum-likelihood estimation
for normal mixture distributions. Annals of Statistics, 13, 795-800.
Langeheine, R. (1994). Latent variables markov models. In: A. von Eye, and C. C. Clogg
(eds.) Latent variables analysis. Thousand Oaks: Sage.
Langeheine, R., and van de Pol, F. (1990). A unifying framework for Markov modeling in
discrete space and discrete time. Sociological lylethods and Research, 18, 416-44l.
Lazarsfeld, P. F., and Henry, N. W. (1968). Latent structure analysib. New York: Houghton-
Mifflin.
Loewenstein, G. F.,and Elster, J. (1992). Choice over Time. New York: Russell Sage
Foundation.
Lwin, T. and Martin, P. J. (1989). Probits of mixtures. Biometrics, 45, 721-732.
Poulsen, C. S. (1990). Mixed Markov and latent Markov modeling applied to brand choice
data. International Journal of Marketing, 7, 5-19.
Takane, Y. (1983). Choice model analysis of the "p-ick any/n" type of binary data. Handout
at the European Psychometric and Classification tvleetings, Jouy-en-Josas, France.
Wiggins, L. M. (1973). Panel analysis: Latent probability models for attitude and behaviour
processes. Amsterdam: Elsevier.
Part VI
1. Introduction
Feed-forward neural netv;ork (NN) models and statistical models have much in com-
mon. The former can be viewed as approximating nonlinear functions that connect
inputs to outputs. Many statistical techniques can be viewed as approximating func-
tions (often linear) that connect predictor variablf's to criterion variables. It is thus
beneficial to exploit various developments in i\'N models in nonlinear extensions of
linear statistical techniques. There is one aspect of nonlinear transformations by ~N
models that is particularly attractive in developing nonlinear multivariate analysis
(MVA). It allows joint multivariate transformations of input variables, so that inter-
actions among them can be captured automatically in as much as they are needed
for prediction. In this paper we examine various properties of nonlinear J\IVA by NN
models in two specific contexts: Cascade Correlation (CC) networks for nonlineaT
discriminant analysis simulating the learning of personal pronouns, and a five-layer
auto-associative network for nonlinear principal component analysis (PCA) recover-
ing two defining attributes of cylinders. In particular, ,ve analyze the mechanism of
function approximations in these networks.
527
528
such a way that all pre-existing units are connectd to the new one. Input units are
directly connected to output units (cross conneions) as 'Nell as to all hidden units.
The cross connections capture linear effects of input variables. Hidden units, on
the other hand, produce nonlinear and interaction effects among the input variables
that are necessary to connect inputs to outputs in some tasks. When a. new hi.dden
unit is recruited, the connection weights associated with input connections are deter-
mined so as to m~'(imize the correlation between residuals from network predictions
at the particular stage and projected outputs from the recruited hidden unit, and are
fixed throughout the rest of the learning process. This avoids the necessity of back-
propagating error across different levels of the network, and leads to faster and more
stable convergence. The weights associated with output connections are, however,
re-estimated after each new hidden unit is recruited.
The CC algorithm constructs a net and estimates connection weights based on a
sample of training patterns. For each input pattern, a unit in a trained net sends
contributions to units it is connected to. A contribution is defined as the product of
the activation for the pattern at the sending unit and the weight associated with the
connection between the sending unit and the receiving unit. The receiving unit forms
its activation by summing up the contributions from other units and applying the
sigmoid transformation to the summed contribution. An activation is computed at
each unit and for each input pattern in the training sample. Let al denote an input
pattern (a vector of activations at input units and bias, which acts like a constant
term in regression analysis), and let WI represent the vector of weights associated
with the connections from the input and bias units to hidden unit 1 (hd. Then,
the activation for the input pattern at hI is obtained by bl = f( a~ WI) - .5, where
f is a sigmoid function, i.e., f(t) = 1/{1 + exp(-t)}. Now hI as well as the input
and bias units send contributions to h2. The activation at h2 is then obtained by
b2 = f(a~w2) - .5. A similar process is repeated until an activation at the output
unit is obtained, which is the network prediction for the output. In the training
phase, connection weights are determined so that the network prediction closely ap-
proximates the output corresponding to the input pattern.
3. Two-Number Identification
The CC network algorithm was first applied to the two-number identification prob-
lem, in which there are two input variables, Xl and X2 (excluding the bias). Pairs of
Xl and X2 are classified into group 1 (indicated by output variable y equal to .5) when
the two numbers are identical, and are otherwise classified into group 2 (indicated
by y = -.5). This is a simple t.wo-group discrimination problem, but the function to
be approximated is highly nonlinear, as can be seen in Figure 1(a). The problem is
interesting because of its implication to real psychological problems; identifying two
objects underlies many psychological phenomena, as exemplified by an example given
in the next section.
One hundred training patterns, generated by facorially combining Xl and X2 varied
systematically from 1 to 10 in the step of 1, were used in the training. The CC
network algorithm constructed a network depicted in Figure 1(b). This net has three
input units (including the bias), one output unit, and two recruited hidden units.
Network predictions are computed in a manner described above. Figure 1( c) displays
the function approximated by the CC net (the set of network predictions as a function
of Xl and X2). The approximation looks quite good, although the ridge at Xl = X2 in
the approximated function is not as "sharp" as in the original target function. This
is due to the "crudeness" of the training sample. The minimum difference between
two distinct numbers in the training sample is 1, so that the net was not required
529
~- Hiddens
0.5 0.5
O",p",
0 0
hi
10 Inputs 10
a 0 bias XI 2:2 o 0
50 50
0 0
-50
10 0
10 5
o 0 a 0
50 0.5 0.5
0 a a
0 10 10
o 0 o 0 a 0
to discriminate beween two numbers whose differences are less than 1. The ridge in
the approximated function can be made sharper if pairs of numbers with smaller dif-
ferences are included in the training sample. Note that interpolations are clone quite
nicely. That is. although numbers like 5.5 were not included in the training sample,
the identification involving such numbers are handled as expected. Extrapolation,
on the other hand, seems a bit difficult, as indicated by a slight increase in function
values toward the righthand side corner. Note that the target function involves·a
form of interaction between .1"1 and £2, where the word " interaction" is construed
broadly; the meaning of a specific value on one variable, say .rl, depends on the value
on the other variable, ;r'2.
It is interesting to see how the approximated function is bulit up and what roles the
two hidden units play. Figure 1(cl) through (f) present contributions of the three
input units to hI. As described above, contributions are defined as products of acti-
vations at the input units and the weights associated with the connections leading to
hI. The contributions are summed up (Figure l(g)), and further sigmoid-transformecl
to obtain the activation function at hI (Figure l(h)). It seems that hI is identifying
if .1:1 2: .1"2' The activation function at h2 (Figure l(i)) is similarly derived. Contribu-
tions now come from four units (three input units plus hd. h2 seems to be identifying
if X2 2: l·I. The output unit (y) receives contributions from all other units. However,
hI ·and h2 seem to play particularly important roles. y stands out to take ..5, when
and only when input patterns satisfy both Xl 2: .1'2 and .1:2 2: .l·r, but otherwise -..5.
Interestingly, this is essentially how we prove l'I = .1"2 in mathematics.
4. Pronoun Learning
\Ve were interested in the two-number identification problem because of its implica-
tion to a real psychological problem, that is, the learning of first. and second person
pronouns. \Vhen the mother talks to her child, me refers to the mother and you to
the child. However, when the child talks to the mother, me refers to the child, and
YOIt to the mother. The child has to learn the shifting reference of these pronouns.
There are three relevant input variables in this problem (excluding bias) and one
output variable indicating me (y = ..5) or you (y = - ..S). The three input variables
are speaker (sp). addressee (ad), and referent (rf). The rule (or the function) to be
learned is: "Use me when the speaker and the referent agree (i.e., y = ..5, when sp =
rff'. a.nd "use YOlt when the addressee and the referent agree (i.e., y = -.,S. when ad
= rf)." The network should be able to judge which two of the three input variables
agree in their values. The two-number identification problem is thus a prerequisite
to the pronoun learning problem.
How children learn the correct use of these pronouns has been studied by Oshima-
Takane (1988. 1992) and her collaborators (Oshima-Takane, et at.. 1996). Simulation
studies by CC networks have also been reported in Oshima-Takane, et al. (1995), and
in Takane, et al. (199.5). All previous simulation studies, however, presupposed the
existence of only two pronouns, me and yOlt. This severely limits the scope of these
studies. In particular, the operating rule may not coincide with the one assumed
above. Tha.t is, seemingly correct bellavior can follow from rules other than the one
described above. For example, a rule such as me if sp = rf and YOlt otherwise. or
yo It if ad = rf and me otherwise. works equally well so far as only me and yo It are
considered. That is, ad = rf is equivalent to sp =I rL and sp = rf is equivalent to ad
=I rf when only me and you are to be distinguished.
\Ve. therefore, first investigate what rule is in fact learned under the me-you-only
condition. Forty training patterns were created by systematically varying the three
531
input variables from -2 to 2 in the step of 1, and by discarding all but me and you
patterns. Forty patterns were retained. (Remember that sp and ad cannot agree,
and such patterns were also discarded.) The CC network algorithm recruited two
hidden units to perform the task. The approximated function is depicted in Figure
2 in terms of ad on the y-axis and rf on the x-axis for nine different values of sp (-2,
-1.5, -1. -.5, O.. 5, I, 1..5, and 2). It looks like the output variable, y, takes the value
of -.5, as it is supposed to (see the diagonal "ditch" observed in each graph), but
it also takes the value of ..5 in all other cases, including sp = rf and sp #- rf #- ad.
This is correct for sp = rf, but not for sp #- rf #- ad. Remember that no training
patterns were given for the latter and so it is quite natural that the net responded
rather arbitrarily to the latter patterns. This implies, however, that pronouns other
than me and you are necessary to learn the correct use of these two pronouns. That
is, to learn to discriminate between sp = rf and sp #- rf #- ad, patterns involving other
pronouns such as he and she have to be included in the training sample.
To verify the above assertion, another simulation study was conducted, this time,
with pronouns other than me and you also included in the training sample. This
condition, called the me-you-others condition, had 100 training patterns with 40
me-you patterns plus 60 others patterns. The net was trained to take the value of 0
(y = 0) when sp #- rf #- ad in addition to y = .5 when sp = rf and y = -.5 when ad
= rf. Figure 3 shows the approximated function under this condition, which looks as
it is supposed to. The task is appreciably more complicated than before, and the CC
network algorithm recruited five hidden units to perform the task.
o':lL
sp=-2 sp=-1,5 sp=-1
0.:
-0,5
~~
'-, ___ -0,5 ~ ,
O':~'
-0,5 ~, ___.
2 0
ad- 2
0 2
-2 rf
2 °
ad- 2
0 2
-2 rf
2 0
ad- 2 -2 rf
02
sp=-0.5 sp=O sp=0.5
~ ~~
°:1 ,~
0.5 0.5
-0: "V _ _
2 0
ad-2 -2 °rf
2
-0,5~ '- ,
20~2
ad- 2
.,____---
-2 °rf
-0: ~, ____
2 0
ad- 2 -2 °rt
2
Figure 2: The approximation function for the pronoun learning problem obtained
under the me-you-only condition, The function is depicted as functions of 'addressee'
(y-axis) and 'referent' (x··axis) at several values of 'speaker'. Function values (.::-axis)
at ad=rf should indicate you (y = +,,5), and those at sp=rf me (!J = ~.'j), if the
pronouns are correctly learned. The problem is that the function takes the assumed
value for me even if sp#rf#ad for which no examples were given in the training.
Discontinuities in the function correspond to points where sp=ad which never occurs,
533
Figure 3: The same as Figure 2, but under the me-you-others condition. When the
correct learning occurs, the function takes the value of you (y = -.5) if and only
if ad=rf, the value of me (y = +.5) if and only if sp=rf, and y = 0 if and only if
sp#f:rfad.
534
O'~l~ o.~~
Outputs
'-L'~.-r-..-''--''>-''
Hidden 3
Hidden 2
~___
-O.S ~
-O.S
~...A...A-,,-<:v Hidden 1
10~ 10 S 10
o0 5 0 0 S
-2',,-
Inputs
-S
-10O\~
VJ:Pi5lii#
10 S 10
_4~C
-6
-8
-10
10 5 10
o 0 S o 0 5
60
-10o~
-S~ _5~
-10
40
20
10 S 10 10 5 10
o 0 5 o 0 5
has been applied to extracting components that determine facial expressions of emo-
tion (DeMers & Cottrell, 1993) and internal color representation (Usui, et aI., 1991).
Takane (1995) examined recovery properties of nonlinear PCA by the NN models
using several artificial data sets.
7. Discussion
NN models present interesting perspectives to nonl.inea.r multivariate ana.lysis bv al-
lowing joint multivariate nonlinear transformations of input variables. In this pa.per,
we highlighted the mechanisms of these transformations in two specific context: CC'
536
0 10 5 0 0 5 10
0'~~<8~~
10 5 0 0 5 10
o~~~ 10 5 0 0 5 10
O.~~~~
10 5 0 0 5 10
Figure 5: The activation functions created at units in hidden layer 1. These acti I'ation
functions are linearly combined to obtain the activation functions ((h) and (c) of
Figure 4), which are recovered component scores.
537
o~~~
0 10 5 0 0 5 10
0\£::
0 10 5 0 0 5 10
o~l~:::;_;
0 10 5 0 0 5 10
ol~~
10 5 0 0 5 10
olhla_
10 5 0 0 5 10
o~v::::~
0 10 5 0 ° 5 10
'~;
0.5~ o~~
5 0 0 5 °10500510 10 5 0 0 5 10
olb.
1
10
o:~~ 10 5 0 0 5 10
10 5 0 0 5
Figure 6: The same as Figure 5, but for hidden layer :3. These activation functions
are linearly combined to obtain the output functions (activation functions at output
units), some of which are given in (g)-(il of Figure 4.
538
References
1. Introduction
Yanai & Takane (1992) proposed canonical correlation analysis with linear constraints
by imposing linear constraints upon parameters corresponding to two sets of variables.
In this paper, we extend the earlier results to the case where more than two sets of
variables are available, and thereby derive general explicit solutions for canonical cor-
relation analysis with more than two sets of variables by imposing linear constraints
upon the parameters corresponding to the sets of variables. We call this method the
generalized canonical correlation analysis with linear constraints (GCCAC). Further-
more, we show that for categorical data our method yields the multiple correspon-
dence analysis with linear constraints (MCC) ,or the Quantification method of the
third type (Hayashi,1952) with linear constraints which covers the canonical analysis
of contingency tables with linear constraints presented by Bockenholt & Bockenholt
(1990) as well.
539
540
o
X~X2
X~X2
X~X",
X~Xm
J ( X~Xl
0
and
X:nXm 0 o
are block matrices of order P x P where p = Ez;,l Pi' Further, the Pi x Pi matrices
XjXj(j = 1,···, m) are assumed to be nonsingular throughout the paper. Any
solution of (1) is necessary a solution of the following eigenvalue problem:
(4)
(XDx:":X')1 = At (5)
when a;= (.\}\"]tl Xif. Now, minimizing (7) with respect to I leads to (3). We
show an important property of gP in the following corollary.
541
mIn::::: 'LPx, =g P,
i::::::;l
thus establishing the desired result due to Poincare separation theorem (PST). (Q.E.D.)
Then, we have
-1
- - 1 ::; R k (X 1 , " ' , Xm) ::; 1 for k
m-
= 1,···, rank(gP). (9)
Proof: First, multiply (3) by f' PXj from the left and sum over J. \Ve get
m m
'L(PX./k, PXJk)/('L IIPXi l1 2 h) = Ak - l.
ii=j i=l
and m
X[XI
X[Xm ) ( X[XI
X~Xm 0
it=X'X = ( X.~.X.l ,B=
X:,. Xl X:"Xm 0 o
and the p x s matrix
Multiplying (14) from the left by X B- 1 and expanding, we get the equation for A
and a:
(XB-1X' - XB-IG(G'B-1C)-lG'B-1X')Xa = AXa,
thus leading, by g = X a, to
m
"(PJ(
L ') - PXo(X'
J j
Xo)-lco)g
J 1
= Acg, (15)
j=1
o Px J QcJ )f = Acf
(" where = I -]
Qc) Pc . (16)
j=l
which implies that adding to the constraint GJaj = 0 is meaningless under the con-
dition stated above.
543
h
were Y = D 1-l/2N12 D-l/2
2 .
This is the canonical analysis of a contingency table with linear constraints by Bock-
enholt & Bockenholt (1990), which is here subsumed as a special case of GCCAC
with m = 2 and Xl and X 2 two matrices of dummy variables.
5. Numerical Examples
We give three numerical examples demonstrating the usefulness of our method.
Example 1: Given m = 3 dummy matrices A, Band C of orders 8 x 2, we compute
the eigenvalues of gP where gP = PA + Pa + Pc· Then we have Al (gP) = 3, A2(gP) =
A3(9P) = /\4(gP) = 1. and obtain generalized canonical correlation coefficients by
the equation (8): Rl(A,B,C) = 1, R 2 (A,B,C) = R 3 (A,B,C) = R 4 (A,B,C) = O.
1 0 1 0 1 0
1 0 1 0 1 0
1 0 0 1 0 1
1 0 0 1 0 1
A= , B= and C =
0 1 0 1 1 0
0 1 0 1 1 0
0 1 1 0 0 1
0 1 1 0 0 1
Further, we showed the eigenvalues of the four cases in Table 4. It is shown that the
eigenvalue of the case 1 is larger than any of the other three eigenvalues obtained
from ( 3 ). The result also demonstrates the validity of Corollary 3.
6. Discussions
Our method proposed in this paper is said to be the general method of multivariate
analysis for finding appropriate linear combinations by imposing linear constraints
upon parameters corresponding to many sets of variables. It is to be noted here
that our method covers canonical correlation analysis, canonical discriminant analy-
sis ,principal component analysis, multiple correspondence analysis and also Quan-
tification method of the third type ,taking linear constraints into consideration. In
545
our paper, we have not given any comment how to find the appropriate constraints
for each of the methods. More often, natural forms of constraints may appear from
specific empirical questions posed by the investigators concerned in the problem of
their fields. With such formulations, one specifies the space in which the original
parameter should lie, and then proceeds to test the hypothesis by means of an ap-
propriate statistics. It should be noted, however, that little work has been done on
various kinds of multivariate analysis. In this viewpoint, how to test the hypothesis
stated in (10) provides an interesting problem to be tackled in the future.
References:
Yanai,H. &:: Takane,Y.(1992): Canonical correlation analysis with linear constraints,
Linear Algebra fj its Applications, 176, 75-89
Bokenholt, U. &:: Bockenholt,I.(1990): Canonical analysis of contingency tables with
linear constraints, Psychometrika, 55, 633-639
Lebart, L. Morineau, A.K. &:: Warwick, M.(1984): Multivariate Descriptive Statistical
Analysis, John Wiley &:: Sons, New York
Rao,C.R. &:: Yanai,H.(1979): General definition of a projector, its decomposition and
application to statistical problems, 1. of Statistical Planning and Inference, 3, 1-17
Rao, C.R.(1973): Linear Statistical Inferences and its Applications (second Edition),
John Wiley &:: Sons, New York
Hayashi,C. (1952): On the prediction of phenomena from qualitative data from the
mathematico-statistical point of view, Annals of the Institute of Statistical lvfathe-
matics, 3, 69-96
Principal Component Analysis Based on
a Subset of Variables for Qualitative Data
Yuichi Mori 1 , Yutaka Tanaka 2 and Tomoyuki Tarumi 2
1 Kurashiki City College
160 Kojima Bieda-cho
Kurashiki 711, Japan
1. Introduction
Consider a situation where we wish to make a small dimensional rating scale to mea-
sure latent traits. On the one hand, from the validity aspect all the variables should
be included in the stage of constructing the rating scale. On the other hand, the
number of variables should be as small as possible in the stage of application of the
rating scale from the practical aspect . Thus we meet the problem of variable selec-
tion in principal component analysis (PCA).
Suppose we observe quantitative variables and wish to make a small dimensional
rating scale by applying a PCA-like method. Furthermore, we wish to obtain the
rating scale or a set of principal components (PCs) in such a way that it is based
on only a subset of variables but it represents all the variables very welL If we can
find such PCs, we may say that those PCs provide a multidimensional rating scale
which has high validity and is easy to be applied practically. To do this Tanaka and
Mori (1996) proposed a method of PCA based on a subset of variables using the
idea of Rao(1964)'s PCA of instrumental variables and called it a :'modified PCA"
(NLPCA), while the problems of variable selection in the ordinary PCA have been
studied by Jolliffe (1972, 1973, 1986), Robert and Escoufier (1976), McCabe (198c1)
and Krzanowski (1987a, 1987b) among others.
In this paper we extend M.PCA so that it can deal with qualitative data with un-
ordered or ordered categories. 0."amely we perform both quantification of qualitative
data and :VLPCA at the same time. and study the performance of our method nu-
merically by applying it to a real data set. For analyzing qualitative data with a
PCA approach, a lot of techniques have been proposed, e.g., De Leeuw (1982, 198c1),
De Leeuw and Van Rijckevorsel (1980), Israels (198c1), and Young et aL (1978). We
use Young et al.(1978)'s PCA in which the alternating least squares (ALS) method
is utilized. We call this qualitative PCA based on a subset of variables a "qualitative
modified PCA" (QM.PCA) to distinguish it from yLPCA and others.
547
548
Let the covariance matrix of Y = (Yl , }2) be 5 = (~~: ~~~). The residual covari-
ance matrix of y after subtracting the best linear predictor is expressed as
(1)
where 51 = (511 , 5 12 ), Then the problem becomes to maximize 5 Reg . If it is formu-
lated as the ma..ximization problem of tr(5 Reg ) among other possibilities, a generalized
eigenvalue problem
(2)
is obtained. Let the q eigenvalues of (2) be ordered from the largest to the smallest
as Al,A2, ... ,A q and the associated eigenvectors be denoted by WI, W2, ... , w q .
Then, the solution is expressed as W = (WI, ... , wr), and the maximized value of
r
the criterion tr(5 Reg ) is given by I: A" or the proportion of the original variations
,=1
explained by the r PCs is given by
r
P = I: A;/tr(5). (3)
i=l
In QM.PCA this proportion P indicates the average squared multiple correlation be-
tween each of the original variables and the r PCs, since all y;'s are standardized in
the quantification process.
Thus we apply the following two-stage procedure of variable selection as a practical
549
strategy to find PCs which are based on a small number of variables but represent
all the variables very well.
A. Initial fixed-variable stage
Assign q variables to subset Y1 , usually q := p, and solve the eigenvalue problem
(EVP) (2). Looking carefully at the eigenvalues and the proportions P in (3), deter-
mine the number r of PCs to be used.
B. Variable selection stage (backward elimination)
B-1: Based on the results of stage A, start with q variables - r PCs model.
B-2: Remove each one of the q variables in turn, solve the q EVP (2) with q - 1
dimensional matrices, and find the best subset of size q - 1, that is, the one for which
the proportion P in (3) is the largest. Remove the corresponding variable from the
present subset of q variables and put q := q - l.
B-3: If both P and the number of variables in Y1 are larger than preassigned values,
go back to B-2. Otherwise stop.
2.3 Estimation of category scores and variable weights
Before performing M.PCA we have to estimate the unknown parameters aj and then
make optimally scaled variables. To do this we use the idea of PRINCIPALS (Young
et al., 1978) which is an extension of ordinary PCA to the situation where the vari-
ables may be measured at a variety of scale levels. It is used to estimate both aj and
W simultaneously. Here, we use all variables in this estimation phase, i.e., q := p.
Let
Y= ZB, (4)
where Z (= YW in this phase) is an n x r matrix of n component scores and B is a r x p
coefficient matrix. We wish to obtain B so that Y predicts Y as well as possible, where
Y contains unknown a/s and Z contains unknown W. The optimization criterion is
expressed as
e = tr(Y - Y)'(Y - Y). (5)
Based on the ALS principle, unknown parameters are estimated as follows.
(Step 0) Determine initial values Y, which is assigned as category scores provided
externally or assigned as random numbers. Standardize Y columnwise.
(Step 1) Apply M.PCA to the data matrix Y, that is, solve EVP (2) and obtain W.
Successively obtain Z and B = (Z'zt1Z'Y.
(Step 2) Evaluate (). If the improvement in fit from the previous iteration to the
present iteration is negligible, stop.
(Step 3) From Z and B compute Y by eq.(4). Obtain the matrix of optimally scaled
data Y which gives the minimum () in (5) for the fixed Y. The optimal scaling of
data is performed for each variable separately and independently. Standardize the
optimally scaled data and go back to Step 1.
Steps 1 through 3 are iterated until convergence is obtained.
3. A numerical example
We illustrate the proposed method by applying it to the data gathered for the purpose
of making a rating scale to measure the seriousness of mild disturbance of conscious-
ness (MDOC) due to head injury and other lesions (Sano et a!., 1977, 1983). The
data set consists of 87 individuals and 25 variables (test items, four points scale).
550
r
lI
0.75
'A
0.55
23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2
Number of Variables
Fig. 1: Process of removing variables (Q:\1.PCA and ".I.PCA).
( . indicates the result of Q:\I.PCA which means ;\I.PCA of Y and" indicates
that of :\I.PCA which is applied to X as quantitative variables.)
These data were originally analyzed by Sano, et al. (1977, 1983) to make a rating
scale for :'vIDOC, and later by Tanaka and Kodake (1981) and Tanaka (1983) us-
ing principal factor analysis with variable selection functions. All of these studies
adopted a 2.3 variables - 2 factors model, since two of 25 variables were thought not
to be important, and analyzed the 23 variables as quantitative ones using scores O.
1, 2 and 3 or 0 and ;3 according to its grade of disturbance.
~ow let us apply Q:\1.PCA to this data set. We denote this data set by X.
At first indicator matrices Gj are generated from X and are transformed into initial
quantified vectors Y J by assigning original values of Xj to score vectors aj (j = L
... , 23). After normalizing y/s so that they have zero means and unit variances. we
551
apply the ordinary PCA to the initial Y. The eigenvalues and the cumulative pro-
portions are Al = 13.878 (60.34%) > A2 = 1.579 (67.20%) > A3 = 0.9580 (71.37%)
> A4 = 0.8456 (75.05%) > As = 0.7432 (78.28%) > .... Looking at these values
and referring the previous studies we decided to extract two PCs. Starting with this
initial Y and r = 2, scores a/s and weights Ware estimated iteratively by means of
ALS method. Y obtained at the final iteration is an optimally quantified matrix of
qualitative X. The eigenvalues and the cumulative proportions of the quantitative
variables Yare Al = 14,458 (62.86%) > A2 = 1.502 (69.39%) > A3 = 0.9401 (73.48%)
> A4 = 0.7788 (76.86%) > As = 0.6948 (79.88%) > .... These proportions are larger
than those obtained by the ordinary PCA applied to the original X.
Next, we apply M.PCA to this optimally scaled variables Y. The process of removing
variables is shown in Tab.l and the change of P values are visualized in Fig.l (.).
They illustrate that the change of P value is very small until 11 or 12 variables are
removed. Even when the number of variables is reduced drastically to 6, the reduc-
tion of P is less than 2.5%.
Fig.2 shows scatter plots of the correlation loadings which are defined as the correla-
tions between the original variables and the (varimax-rotated) derived PCs and which
play important roles for the interpretation of PCs. Three scatter plots correspond
to the cases of (a) q = 23, (b) q = 11 and (c) q = 6. Black circles in (b) and (c)
indicate the selected variables. Comparing these scatter plots, it is observed that the
configurations are almost the same between (a) and (b), but they are slightly different
between (a) and (c). This fact indicates that the meanings of the PCs do not change
when 12 variables are removed and they change only slightly when 17 variables are
removed. We can say that PCs based on the selected variables have almost the same
information as PCs based on the whole ones.
Comparing with the result of M.PCA applied to the original X, Fig.l shows that
Q\JI.PCA provides higher values of P at every step. This difference is due to the
effect of optimal scaling by ALS method.
To illustrate how our method selects variables from clusters of variables, the selected
variables are marked on the dendrogram in Fig.3 obtained by cluster analysis of Y
using standardized squared Euclidean distances and the furthest neighbor method.
As illustrations 11 variables and G variables selected by QM.PCA are marked 'below
the variable numbers in Fig.3. The dendrogram suggests that there exist three clus-
ters which are composed of 11, 2 and 10 variables, respectively. It is noted that the
selected variables are distributed in two major clusters in a well-balanced manner.
The reason why no variables are selected from the smallest cluster may be explained
in such a way that variables ::\0.21 and :-:0.23 in this cluster are located near the
origin in the scatter plots of the correlation loadings (Fig.2) and therefore they do
not play important roles for composing these PCs.
Finally let us compare our method with other variable selection methods. On the
basis of the previous studies we consider the following methods.
(1) Method of regression analysis: Remove one variable which has the largest multiple
correlation with the remaining variables successively.
(2) lo11iffe(1972. 1973. 1986)'s method based on peA: Apply PCA to all p variables.
Looking at loadings associated with the p-th largest eigenvalue, remove one variable
which has the largest loading. i\ext, looking at loadings associated with the (p -1 )-th
largest eigenvalue, remove one variable among the remaining ones in the same way,
and so on. Iterate the process until the number of removed variables is p - q.
(3) Method based on cluster analysis: Form q clusters of variables using a method of
552
(a) I
. •,
0.8
! 2'25;
~.
0.6 • v,
~~.~ •
"0
c:
C\J 7
0.4
2 ,
0.2
~ ,
:
I I
o I I 1 I
o 0.2 0.4 0.6 0.8
1st
(b) (c) I, I
I I
0.8
I
'6 2i
0.8 •
v
I
~
i
!
0.6
4 CO • ~~~~ I
0.6
tl
o 4,!-
9 ?ib. 4~ ,ra~~
-0 -0
c: c: 7 1
C\J C\J
2W ~
0.4 0.4
s: ~ s: ~ ~~
0.2 0.2
•
~.
I
~ I
o
i
OLI__ ~__L-~ -L1_1~
I
__
x x x x x x x x x x x
x x x x x x
Fig. 3: Dendrogram of 23 variables (Furthest neighbor method with standardized
squared Euclidean distances). Eleven and six variables selected by Qi\I.PCA are
marked by X below the variable number.
0.6
~ _ , QM.PCA
___
7- Reg
Cl. i -S- PCA (by Jolliffe)
!
I ~ Cluster analysis ' \\
0.5 ~ -0- RV1 \ ,",.
I -. RV2 \
I
0.4 i~__~~__~~__L-~~_ _ _ _~_ _~~~_ _~~_ _~~~_ _~~_ _L-~~~
~\~
23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2
Number of variables
Fig. 4: Comparison of P obtained by applying various selection methods to Y.
(5) Method using Robert and Escoufier(1976),s RV-coefficient (2): Find q-1 variables
among q variables, which have the closest configuration of the modified PC scores
to that of the original variables Y in terms of the RV -coefficient. In this case the
RV -coefficient is computed as RV = {LJ=l AJ / tr( 52) } 1/2 \vhere Aj'S are obtained by
solving EVP (2) (see, Tanaka and Mori, 1996).
Fig.4 indicates the changes of P for each selection method. P values at each step
are computed using Y1 obtained by each method as the instrumental variables and Y
as the main variables. The reason why we use the formulation based on instrumen-
tal variables to compute P is to make fair comparisons, because P values obviously
become smaller if the ordinary PCA is applied to the selected variables (this fact is
observed in Tab. 1, that is, P values in the last column, which are computed by the
ordinary PCA of the selected variables, are always smaller than those in the third
column). Fig.4 shows that the proposed method selects a subset of variables which
explains the original variables better than other methods.
4. Concluding remarks
Modified PCA (M.PCA), which was proposed by Tanaka and :'v1ori (1996) to derive
PCs which are formulated as linear combinations of a subset of variables but which
represent all the variables very welL is extended so as to deal with qualitative data
by combining a method of optimal scaling and M.PCA. Performance of the proposed
method (QM.PCA) is studied numerically by applying it to a real data set. From
the results of the numerical study we can say the followings:
1) The proportion of variance explained by a specified number of PCs in Q:VI.PCA is
larger than in M.PCA without optimal scaling in every step of variable selection, It
suggests the usefulness of optimal scaling in analyzing qualitative data.
2) In our numerical example the average proportion P of the variance of all the vari-
ables explained by PCs does not change much by omitting at most 17 among 2:3
variables. Also the loadings of PCs based on all the variables (q = 23) and on two
subsets of variables (q = 11 and q = 6) are almost the same. These facts suggest that
we can construct a multidimensional rating scale which is based on a small number
of variables but which has almost the same information as the case using all the
variables.
554
3) Comparison is made among the performances of our method and other variable
selection methods in PCA. Our method is superior to the other methods in the sense
that the proportion P of our method is larger than those of the other methods.
4) To study how our method select variables the selected variables are marked in the
scatter plots and the dendrogram for cluster analysis of the variables. It seems that
the variables are selected from the major clusters in a well-balanced manner.
References:
De Leeuw, J. (1982): Nonlinear principal component analysis. COlvfPSTAT 1982, 77-86,
Physica-Verlag, Vienna.
De Leeuw, J. (1984): The Gifi system of nonlinear multivariate analysis. Data Analysis
and Informatics III, Diday, E. et al. (eds.), 415-424, Elsevier Science Publishers, North-
Holland.
De Leeuw, J. and Van Rijckevorsel. J. (1980): HO:\IALS & PRINCALS: Some general-
ization of principal components analysis. Data Analysis and Infomwtics, Diday, E. et al.
(eds.), 415-424, North-Holland Publishing Company.
Israels, A. Z. (1984): Redundancy analysis for qualitative variables. Psychometrika, 49,
331-346.
Jolliffe, 1. T. (1972): Discarding variables in a principal component analysis. 1. Artificial
data. Applied Statistics, 21, 160-173.
Jolliffe, 1. T. (1973): Discarding variables in a principal component analysis. II. Real data.
Applied Statistics, 22, 21-31.
Jolliffe, 1. T. (1986): Principal component analysis. Springer-Verlag, New York.
Krzanowski, W. J. (1987a): Selection of variables to preserve multivariate data structure,
using principal components. Applied Statistics, 36, 22-33.
Krzanowski, W. J. (1987b): Cross-validation in principal component analysis. Biometrics.,
43, 575-584.
J\IcCabe, G. P. (1984): Principal Variables. Technometrics, 26, 137-144.
Rao, C. R. (1964): The use and interpretation of principal component analysis in applied
research. Sankhya, A, 26, 329-58.
Robert, P. and Escoufier, Y. (1976): A unifying tool for linear multivariate statistical meth-
ods: the RV-coefficient. Applied Statistics, 25, 257-265.
Sano, K. et al. (1977): Statistical studies on evaluation of mind disturbance of conscious-
ness: Abstraction of characteristic clinical pictures by cross-sectional investigation. Sinkei
Kenkyu no Shinpo, 21, 1052-1065 (in Japanese).
Sano, K. et al. (1983): Statistical studies on evaluation of mind disturbance of conscious-
ness. Journal of Neurosurg, 58, 223-230.
Tanaka, Y. and Kodake, K. (1981): A method of variable selection in factor analysis and
its numerical investigation. Behavion71etrika, 10, 49-6l.
Tanaka, Y. (1983): Some criteria for variable selection in factor analysis. Behaviormetrika,
13, 31-45.
Tanaka, Y and Mori, Y. (1996): Principal component analysis based on a subset of vari-
ables: Variable selection and sensitivity analysis. American Journal of Mathematics and
J1;fanagement Sciences, Special Volume (to appear).
Young, F. et al. (1978): The principal components of mLxed measurement level multivariate
data: An alternating least squares method with optimal scaling features. Psychometrika,
43. 279-281.
Missing Data Imputation in Multivariate Analysis
j'vIario Romanazzi
Department of Statistics, "Ca' Foscari" University of Venice
2347 S. Polo, 30125 Venice, Italy
Summary: A new imputation method for incomplete data is suggested, to be used with
non-random multivariate observations. The principle is to fill the gaps in a data set so that
the partial complete-data configuration and the total filled-in configuration are similar, ac-
cording to a matrix correlation coefficient. The optimality criteria are Escoufier's RV and
Procrustes normalized statistic. Three examples are illustrated.
1. The method
Suppose P numerical variables have been observed on n units, but only nj units have
complete data, whereas the other n2 = n - nl have incomplete data. Let X be the
n X P data matrix, with some empty cells, and let XI be the n X PI submatrix including
the PI ::::: 1 variables with complete data. It is assumed that the variables (columns
of XI and X) are centered with respect to the means. 'vVe suggest estimating the
missing values in such a way that XI and X are as similar as possible, according to
a matrix correlation coefficient. Crettaz De Roten and Helbling (1991) also use a
matrix correlation criterion, but they consider an a priori partition of the variables
in two groups, or try to maximize the match between XI and the submatrix of the
incomplete-data variables.
A geometrical interpretation can be given. XI and X correspond to two constel-
lations representing the same n points in PI- and p-dimensional Euclidean space,
respectively. The constellation associated with XI is fi'(ed, whereas in the constel-
lation corresponding to X, relative positions of points and overall shape change for
each selection of values to substitute for missing values in the empty cells of X. Our
method amounts to fi'(ing X at the position of maximum similarity \vith XI.
Obviously, a definition of "similar configurations" must be given. In multivariate
analysis there are two standard equivalence criteria: congruence up to linear transfor-
mations, typical of canonical correlation and multivariate regression, and congruence
up to orthogonal transformations, as in principal component and factor analysis. In
this work the match of Xl and X is evaluated under orthogonal congruence. It is
well-known that a maximal invariant is the pair of inner-product matrices XIX[,
X X T that are comparable by Escoufier's RV
trXTXxTx j
LS(X X) _ {t'(KTXXTK
1" 1 " 1
)1/2}2
_
{t'(KTXXTX
I" 1 j
)1/2}2
], - tr(XjXT) tr(XXT) - tr(XTXd tr(XTX)
555
556
X = ( X 1 X)
2 = ( Xl!
X 21
X12)
X 22 '
with x~j) (X~2») denoting the vector of the means of the first PI (last P2) variables. For
j E {I, 2}, it is useful to interpret x~) as the weighted mean of xij) and x~j), the
group averages computed from first nj and last n2 units. The relevant expression is
1+ ¢(8)
RV(X1' X) == RV(8)
J,b + 2cf;(D) + 1jJ(D)} ,
{tr(f + <1>(8))1/2)2
LS(Xl , X) == LS(8) =
X[ XI =I 0Pl' thus RV(O) and LS(O) are positive for all 0 E IRq.
Remark 4. RV(O) = LS(O) = 1 iff there exists a column-orthogonal matrix Q(O)
PI x P2 and a scalar b such that X 2 (0) = c5XIQ(O).
Remark 5. If iI>(0) does not depend on 0, LS(()) attains its ma."Ximum value at the
O-vector where tr X2(0)T X 2 (0) is minimized. Then the optimal value for all missing
elements in the j-th incomplete-data variable is the mean of observed data for that
variable. If iI>(0) does not depend on 0 and all the data corresponding to last 71.2 units
and last P2 variables are missing, RV(O) satisfies the same property.
Remark 6. If PI = P2 =q= 1, then
where 811 = varX I , 812(0) = coV(X I ,X 2 ) and 822(0) = varX 2 . It can be shown that
the value 0Ls which maximizes LS(O) satisfies
2. Optimization of RV and LS
To obtain maximum similarity between XI and X(O), the optimization problems
ma."XOERq RV(O), or maXOERq L5(0), must be solved. The two functions vary in the
(0,1] interval, if X[ XI is positive definite they are continuous and differentiable for
all 0 E Rq, but the search for the extrema might be complicated by the existence of
local extrema. A detailed discussion of the behaviour of RV(O) in the particular case
q = 1 is given in Romanazzi (1995). When P and q are not very high (tentatively,
p:5 10, q:5 4, in the case of RV(O)), the optimizations are conveniently performed
by symbolic computation, using f01' instance 1-Iathematica or MathLab. Otherwise,
ad hoc algorithms based on steepest descent methods must be designed. The gradient
vector of RV(B) is (Romanazzi, 1995):
Here 8vecb.(0)/80 == b.(1) is the q x T/.2P2 matrix of the partial derivatives of the
elements of vec b.( 0) with respect to 01 " .. , ()q and 8{3( D) /80 == {3(I) is the q x P2
matrix of the partial derivatives of the components of {3(0) with respect to 01 , ..• ,Oq.
b. (1) and {3(1) are constant matrices no longer depending on 01 , ••• ,Oq. The expressions
558
2{.6. (I) [(I p , 0BBT) vec .6.(0) + nl n2 (I p , 0B) vec a,8(Of + vec BAT C]
n
+n n 2,8(1)[n 1n 2(aT a),8(O)
1 + .6. (Of Ba + cT Aa]},
n n
.J{.6.(l) [(.6. (Of.6. (0) + CTC + nln2,8(0),8(0)T) 0 In,]vec.6.(0)
n
+n l n2 ,8(1) [.6. (0) T.6.(0) + C TC + nl n2 (,8 (Of ,8( 0)) Ip,],8(O)}.
n n
LS(O)
tr[r + <I>(0)]!/2 tr[X(O)T X(O)] x
{tr[X(OfX(O)][.6.(l)(<I>~)(O)f + n~n2,8(l)(<I>~I)(0))Tl vec[r + <I>(OW~
-2 tr[r + <Jl(0)j1/2[.6.() vee .6.(0) + n)n2,3(l)r'J(0)]},
n
where
avec <I> (0)
avec .6.(0)
BT .6.(0) 0 BT + (BT 0 BT .6.(0))J{n2,P' + (BT 0 ATC)J{n"p,
+ n)n2 (a,8(Of 0 BT + BT 0 a,3(Of) ,
n
avec <I> (0)
0,8(0)
n n 2(a.8(0)T 0 a + a 0 a.8(Of)
1 + a 0 ATC + ATC 0 a
n
+a 0 BT .6.(0) + BT .6.(0) 0 a.
Here J{n"p, denotes the commutation matrix of the order n2P2, transforming vec .6.(0)
into vec(.6.(Of).
Unfortunately, explicit expressions of 0RV and 0is are not available: even in the
simplest case - RV coefficient, q = 1 - the optimal value of 0 is a root of a fourth-
degree equation. This is an important drawback in comparison with imputation based
on linear regression.
The optimization of RV(O) and LS(O) is computationally feasible and the results are
sensible if the number of missing values is small. In particular, at least one variable
must have complete data. \Vith large matrices scattered with missing values we
suggest the following iterative procedure.
Step O. Substitute a naive initial estimate (e.g., the mean) for each missing value.
Step 1. For k = 1, ... , q, determine the optimal value of Ok, according to RV or LS,
keeping the other O/s, j =f. k, fi.-xed at their previous values.
Step 2. Iterate step 1 until the configuration of points no longer changes, i. e, I i ) - 01
Ok(i-I) I<ck,
- k-
- l , ... ,q.
\Vhile the general method determines the optimal values of 01 , ••• ,Oq simultane-
ously, the iterative procedure tries to locate the optimum O-vector optimizing the
559
RV LS
60 60
ek's one at a time. This implies that it will only find a sub-optimal configuration
X(eRV·)(X(e LS )) such that RV(eRlf)(LS(e LS )) ~ RV(eRV)(LS(Ois))'
3. Numerical illustrations
The following examples illustrate the behaviour of matrix correlation imputation in
three situations which are fairly different both in the characteristics of the data and
the number of missing values. A comparison with alternative procedures - imputa-
tion of means, multiple regression - is also given.
Example 1. The data set includes p = 4 measures of school performance of a
sample of n = 85 students of the faculty of sociology at the university of Trento.
The measures are (1) final secondary-school mark, (2) average score in the university
examinations, (3) and (4) mathematics and sociology examination scores, respec-
tively. Variable 1 is measured on a sixty-points scale, with 36 and 60 corresponding
to pass-mark and m1L"{imum mark, respectively. The other variables are measured
on a thirty-points scale, with 18 and 30 corresponding to pass-mark and maximum
mark. In the data set there are q = 2 missing values, since one student did not record
the final secondary-school mark (the other data are X2 = 28, X3 = 21, X4 = 28)
and another one did not record the average examination score (the other data are
Xl = 42, X3 = 25, X4 = 28).
Imputation of the means over complete cases gives OAV,1 ~ 49.6, OAV,2 ~ 26.6. The
predictions derived from multiple regression are OUR,1 ~ 51.2 (average examination
560
c3(16%)
-1
2
pc2(2n)
pc1(49~)
-2
Figure 2: Principal components of crimes data for 16 American cities (numbers cor-
respond to cities); bold (italic) numbers are cities with discarded data estimated by
optimization of LS (multiple regression).
score, mathematics and sociology scores as predictors), 0 M R,2 ='= 25.7 (final secondary-
school mark, mathematics and sociology scores as predictors). The optimal O-vectors
according to RV and LS are O~w ='= (45.5,34.5)T, 0is ='= (46.8, 26.0)T. The optimal
values of RV and LS are rather low (RVmax ='= .353, LSmax ='= .346), which suggests
that the optimal four-dimensional configuration X (ORv) (X (Ois)) is not very similar
to the two-dimensional configuration of the complete-data variables Xj. Moreover,
Figure 1 reveals that the surface described by RV(O) in a neighbourhood of 0Rv is
almost flat in the direction of O2 : this means that the value of 0Rv,2 can be altered
without a substantial reduction in the value of RV(O). Predictions derived from mul-
tiple regression should also be interpreted ,with caution, since the squared multiple
correlations are low (.156 in the estimation of OJ, A80 in the estimation of O2 ), It
is clear that the estimates of OJ vary according to the imputation method, whereas
the estimates of O2 are similar fm' all of the methods except optimization of RV.
(However, OR!' 2 ='= 34.5 is not valid, since the maximum score is 30; this surprising
result can be clue to the fact that RV(O) is almost constant in the direction of O2 .)
The discrepancy between multiple regression and matrix correlation estimates of 0 1
follows from the "global" character of the first method and the "local" character of
the second one. i'vlultiple regression uses mainly the positive correlation between de-
pendent and explanatory variables (in particular, average examination score) in the
complete data set, thus producing a value somewhat greater than the mean. On the
contrary, optimization of RV (LS) looks for the value of 0 1 for which the position of
the missing-data student is as near as possible to the positions of the complete-data
students with similar values of the average examination score and mathematics and
sociology scores. Inspection of the data shows that this value is about 46.
Example 2. To explore the validity of iterative optimization of RV(O) and LS(O),
we discarded, at random, q = 4 values from the city-crimes data set described by
561
Table 1: Estimates of discarded values from city-crimes data (AV: average; !vIR:
multiple regression; RV, LS: optimization of RV or LS; ': squared multiple corre-
lation in brackets; ": results from iterative optimization in brackets).
Everitt (1984). The variables are crime rates for p = 7 different crimes in n = 16
American cities. The minimum, maximum and average (absolute) correlations are
rl,6 == .050, r2,4 == .772 and .403, respectively. The scatter plot of the first three prin-
cipal components of the standardized data is shown in Figure 2. The first component
is interpreted as a size factor, with high scores associated with low levels of crime.
The second is a shape factor, contrasting the first four variables ("violent" crimes) to
the last three ("non-violent" crimes); high values correspond to a prevalence of "non-
violent" crimes. The third component is easily interpreted, being highly correlated
with variable no. 7 (auto-theft).
In Table 1 we record the discarded values and their matrix correlation imputations,
obtained by direct and iterative optimization. Results from imputation of means and
multiple regression are also given. Iterative estimates derived from RV are better
than global estimates in three cases out of four. Iterative estimates derived from LS
are very near to the global ones, except X2,6' Global and local optima are almost
equal (RV(ORV) ='= .921, RV(7J RV ) ='= .920, LS(Ois) ='= .774, LS(7JLS ) ='= .771), this
maybe indicating, as in Example 1, that the surfaces RV(O) and LS(O) are not very
steep in a neighbourhood of 0'. Finally, the results in Table 1 suggest that, in this
data set, matrix correlation imputation (in particular, optimization of LS) might
be superior to multiple regression imputation. This is confirmed by the scatter plot
in Figure 2 where, together with the complete-data units no. 1, 2, 10 and 15, we
represent as supplementary points the same units with discarded values replaced by
imputations derived from optimization of LS and multiple regression. It is clear that
the supplementary points corresponding to LS are closer to the true ones than those
corresponding to multiple regression.
Example 3. We consider the sons data discussed by Seber (1985): the n = 25
units form a compact cluster of points and the 4 variables are positively correlated.
The minimum, maximum and average correlations are r2,3 ='= .693, r3,4 == .839 and
.732, respectively. We also consider an artificial data set with n = 15, P = 4, whose
characteristics mimic the city-crimes data. The minimum, ma.:'Cimum and average
(absolute) correlations are 11"1,21='=1 -.126 !. 11"2,31='=1 -.889 I, and .387, respectively.
To assess the behaviour of imputation methods, all possible pairs of points are dis-
carded in turn from the configuration of X3 and X 4 , and the omitted values are
estimated by averages over complete data, multiple regressions using Xl and X 2 as
predictors, optimizations of RV and LS. As a summary of the results, the "sampling"
distribution of the sum of the squared deviations between true values and estimates
and the percentage of best results, i. e., minimum squared errors, are computed. Ac-
cording to the frequency distributions of squared errors in Table 2, the best results
562
are given by multiple regression for the sons data, by the optimization of LS for the
artificial data. In both cases, olJtimization of LS produces the highest percentage of
minimum squared errors.
4. Concluding remar'ks
Imputation of values to non-observed data requires a criterion that is subjectively, if
not arbitrarily, chosen. Our criterion is to fill the gaps in such a way that the partial
complete-data configuration and the total "filled-in" configuration are as similar as
possible according to a measure of matrix correlation. \Ve confined the attention
to coefficients invariant under separate orthogonal transformations of the matching
configurations, but other choices are possible. such as coefficients that are invariant
under linear transformations. The computational effort is not negligible - even in the
simplest cases optimization rOlltines are necessary - but this drawback is bearable if
the aim is a truly mllitivariate solution. using all the available information.
Preliminary results suggest that when data is homogeneous with strong linear rela-
tions between complete-data variables and those with missing values, then imputation
based on linear regression is superior. Conversely. when the units belong to different
groups and there are no important linear relations, the matrix correlation method is
more reliable. In these situations, LS is often better than RV
5. References
Crettaz De Roten F. and Helbling, J.-.\1. (1991): Une estimation de donnees man-
quantes basee sur Ie coefficient RV. ReFILe de Statisliq'ue Appliquee. 39, .:17-57.
Everitt. B. S. (198.:1): An in.lraduclian 10 lalenl1lar"iable models. Chapmann and Hall.
Romanazzi. i\I. (1995): :\Iissing vailles imputation and matrix correlation. Quademi
di Slalii5lica e AI alemlllica Ilpplical a aile S cien:;e Eeo Ilomieo-S aciali, 15, .:1 1-59.
Seber. G. A. F. (198.:1): Mu.lli?lariale ObscT1lalions. Wiley.
Recent Developments in Three-Mode Factor Analysis:
Constrained Three-Mode Factor Analysis and Core Rotations
Henk A.L. Kiers
Department of Psychology
University of Groningen
Grote Kruisstraat 211
9712 TS Groningen
The Netherlands
1. Introduction
1.1 Three-way data and three-way methods
Three-way data are data associated with three entries. Three-way data are collected in
various disciplines. For instance, in the behavioral sciences, three-way data may
consist of scores of a set of individuals on a set of variables at different occasions. As
another example, in spectroscopy, three-way data are obtained when measuring
absorbed energy at various absorption levels on various mixtures of substances that
have been exposed to various sorts of light emission.
Several methods have been proposed for the analysis of three-way data (for overviews
see Law, Snyder, Hattie & McDonald, 1984; Coppi & Bolasco, 1989; Carlier et ai.,
1989; Kiers, 1991). The most popular methods are probably Three-Mode Factor
Analysis (3MFA; Tucker, 1966; Kroonenberg & De Leeuw, 1980) and CANDE-
COMP/PARAFAC (Carroll & Chang, 1970; Harshman, 1970; Harshman & Lundy,
1984). In 3MFA the data Xijk are modelled by
P Q R
Xijk = L L L aipbjqc~pqr
r l q=1 r=1
+eijk'
(1)
i=1, ... ,1, j=1, ... ,J, k=1, ... ,K, where aip , b;q' ckr are elements of the matrices A, B,
and C of orders I by P, J by Q, and K by R, and the additional parameters gpqr denote
the elements of the P by Q by R so called "core array". The matrices A, B, and C
can be considered component matrices for "idealized subjects" (in A), "idealized
variables" (in B), and "idealized occasions" (in C), respectively. The elements of the
core indicate how the components from the different modes interact. The model is
563
564
fitted to the data by minimizing the sum of squared error terms. In this minimization,
the component matrices can be taken columnwise orthonormal without loss of fit,
since orthonormalizations of A, B. and C can be compensated for by transformations
of the core. In fact, any nonsingular transformation of the component matrices can be
compensated by applying the inverse transformation to the core. As a consequence,
the model parameters have a great deal of transformational freedom, comparable to
the rotational indeterminacy in factor analysis.
i = 1, ... ,1, j
= 1, ... ,1, k= 1, ... , K, where air' bIT' c kr again denote elements of matrices
A, B, and C. As has been noted by Carroll and Chang (170, p. 312), the PARAFAC
model can be considered as a version of the 3MFA model where the core is con-
strained to be "superdiagonal" (which implies that gpqr is unconstrained if p=q=r and
gpqr is constrained to 0 otherwise). It follows that, if P=Q=R, the 3M FA fit is
always at least as good as the PARAFAC fit, because the 3MFA model uses not only
the superdiagonal elements of the core, but also the off-superdiagonal elements, which
may considerably enhance the fit. In contrast to the 3MFA model, the parameters in
the CANDECOMP/PARAFAC model are (usually) unique, up to scalar multiplica-
tions and permutations.
To increase insight in the difference between the two models and in the role of the
core in the 3MFA mode, we use a (simplified) tensorial description of the two
models. Considering x as a vectorized version of the modelled three-way array, and e
as a vector with error terms, the 3MFA model can be written as
(4)
It can now be seen easily that the essential difference between the 3M FA model and
the CANDECOMP/PARAFAC model consists of the fact that the latter contains only
a subset of the triple tensor products of the former. Specifically, the PARAF AC
model contains only the triple products for which p=q=r. Another difference
between the descriptions in (3) and (4) is that in (3) all triple tensor products are
565
weighted by an element of the core matrix, whereas such weights in (4) are missing.
This difference, however, is nonessential, because in (4) these weights can be
understood to be subsumed in the columns of A, B or C.
Kroonenberg (1983, p.58) has observed that if the 3MFA core can be simplified into
a core with diagonal core planes, then the 3MFA model has the same form as the
CANDECOMP/PARAFAC model (which, as we saw above, is considerably less
complicated than the 3MFA model); it should be noted that the CANDECOMP/PA-
RAFAC model thus turns out to be equivalent to two different forms of the 3MFA
model, that is, not only the 3MFA model with a superdiagonal core, but also a
different 3MF A model employing a core with diagonal frontal planes. Kroonenberg
(1983, Chapter 5) has mentioned and demonstrated that procedures for diagonalizing
the frontal planes of a core that have been investigated in the related context of the
IDIOSCAL model (Cohen, 1974, 1975; MacCallum, 1976; De Leeuw & Pruzansky,
1978) apply equally to the 3MFA model. Thus, the first attempts at simplifying the
3MFA core consist of diagonalizing the frontal planes of the core by means of
nonsingular transformations. By applying nonsingular transformations to the core, and
the inverse transformations to A, B, and C, the fit of the model is unaffected. In this
way, the simplicity of the CANDECOMP/PARAFAC model is approximated, without
566
Obviously, the more elements of the core are constrained to zero, the poorer the fit of
the model, and hence Constrained 3MFA (C3MFA) seems merely to offer a set of
compromise methods, in which the benefit of a simpler solution can only be attained
at the cost of a poorer fit. Fortunately, in practice. the costs of using a considerably
simpler solution are usually low. Experience with C3MFA indicates that a fit almost
as well as that of C3MF A can often be attained by models with only a few more
triple product terms than the CANDECOMP/PARAFAC model. In fact, it can be
proven that the core of a 3MFA model can always be constrained to have a consider-
able number of zero elements without loss of fit. For instance, for P by Q by R cores
with P=QR-I, Murakami, Ten Berge and Kiers (1996) have shown that at least
QR(QR-2)-(R-2) elements can be constrained to zero without loss of fit. For example.
in a 5x3x2 core array as many as 24 of the 30 elements can be constrained to zero.
The possibility of costless constraining core elements in other situations is still under
study, but experience suggests that usually (far) more than half of the elements can be
constrained to zero without affecting the fit.
As has been seen above, an important feature of C3MFA is that it offers models that
567
are considerably more simple than the full 3MFA model and still fit nearly as well.
An added property of some C3MFA models is that they give unique solutions just as
CANDECOMP/PARAFAC. In particular, Kiers, Ten Berge and Rocci (in press) have
shown that C3MFA employing 3x3x3 cores yield unique solutions if the cores are
constrained to have zeros in the positions indicated by 0 in the core, the frontal planes
of which are given below:
Y 0 0 o 0 X o X 0
o 0 X o Y 0 X 0 0 (5)
o X 0 X 0 0 o 0 Y
Just as Kroonenberg (1983), Kiers (1992) searched for methods for rotation of the
core such that the core approximates a core associated with the CANDECOMP/PA-
RAFAC model. However, contrary to Kroonenberg's approach, Kiers suggested
rotating the core to approximate superdiagonality (rather than diagonality of the core
planes). The rationale behind this approach is twofold. On the one hand, the approxi-
mation is expected to yield more small sized elements, simply because it explicitly
aims at finding more elements close to zero. On the other hand, a 3M FA model
employing a superdiagonal core is more directly related to the CANDECOMP/PARA-
FAC model than one employing a core with diagonal planes. When using the former.
the component matrices are directly related to those of CANDECOMP/PARAFAC;
when using the latter, one of the matrices can only be obtained after a transformation.
Especially when only orthonormal rotations of the core are considered, the latter
(usually nonorthonormal) transformation makes the relation with the CANDECOMPI
PARAF AC model more indirect and complicated.
568
Kiers (1992) studied procedures for both oblique and orthonormal rotation to
superdiagonality. Two of the procedures for oblique rotation turned out to behave
poorly in that they frequently led to degenerate solutions, in which the transformation
matrices tended to singular matrices, and hence inverse transformations could not be
computed sensibly anymore. A third procedure for oblique rotation, which behaved
better, was deemed uninteresting because it imposed unwanted restrictions on the
transformation matrices. A procedure for oblique rotation that does not share these
problems has been devised only recently as a variant of a procedure for rotation to
arbitrary simple structure (Kiers, 1995. see below).
The procedure for orthonormal rotation proposed by Kiers (1992) behaved well
computationally. In a simulation study with randomly rotated superdiagonal cores of
sizes 2x2x2, 3x3x3 and 4x4x4 (ten of each), the method recovered all 30 superdiago-
nal cores. Further properties of this rotation procedure, like the derivation of
theoretical bounds to the "superdiagonalizability" of core arrays, have been studied by
Henrion (1993). Practical experience with the method, however, often indicated that
superdiagonality was not well approximated, and that the resulting rotated core was
sometimes far from simple. Further research, therefore, mainly aimed at rotation of
cores to an arbitrary simple structure.
Rather than rotating the core to an a priori specified form, one may choose to rotate
the core to an unspecified simple structure. One such procedure was employed by
Murakami (1983) who rotated the frontal core planes to simple structure by means of
varimax rotation applied to the transpose of the supermatrix containing all core planes
next to each other. In Murakami's case, the core could be rotated in only one
direction. The general situation where the core can be rotated in all three directions
such that some kind of overall simple structure is attained was first dealt with by
Kruskal (1988). He proposed a method called "tri-quartimax " rotation, which is a
procedure for maximizing a combination of normalized quartimax functions applied to
the supermatrices consisting of the frontal, lateral and horizontal planes, respectively,
of the core. Kruskal proposed to maximize this function over oblique rotations, thus
ignoring the fact that quartimax was originally proposed as a criterion for orthonor-
mal rotation. Therefore, there is little reason to expect that the properties of "ordi-
nary" quartimax carry over to tri-quartimax. Unfortunately, no published information
is available on the performance of the method, nor on its implementation.
Kruskal's (1988) idea has been an important point of departure for Kiers' (in press)
"three-mode orthomax" rotation procedure. "Orthomax" (Jennrich, 1970) is a general
family of simple structure criteria containing the well-known varimax (Kaiser, 1958)
and quartimax (Carroll, 1953; Ferguson, 1954; Neuhaus & Wrigley, 1954; Saunders,
1953) criteria as special cases. Kiers' proposal consists of the application of the
orthomax criterion to the supermatrices consisting of the frontal, lateral and horizon-
tal planes of the core, thereby using a generalization of Kruskal's (1988) criterion. In
contrast to Kruskal's method, three-mode orthomax is restricted to orthonormal
rotations, thus respecting the orthonormality of the 3MFA component matrices.
Because of the latter fact, the method should be used with care: If the rotated core is
569
not as simple as one would like, one should take into account that further simplicity
might be attainable upon relaxing the orthonormality restriction. It seems, however,
that in practice the method often gives quite simple solutions for the core, despite the
restriction of orthonormality on the rotations.
Three-mode orthomax is not just a single method, but a class of methods. First,
different methods arise from different choices of the simplicity criterion (e.g.,
quartimax, varimax, or other orthomax criteria). Second, three-mode orthomax allows
for rotation of the core in all three directions simultaneously, but also, if desired, in
two or only one direction. Similarly, the criterion can measure simplicity of the core
in all three directions or a subset of those. The quartimax criterion always measures
simplicity in all three directions, because it operationalizes amount of simplicity by
the sum of fourth powers of all elements. Clearly, the order in which these fourth
powers are considered is irrelevant. For other criteria, the situation is quite different.
For instance, varimax measures simplicity as a sum of variances of columns of
squared loadings. In three-mode varimax these columns can be constituted in three
different ways, depending on the mode of interest. For instance, varimax in direction
A takes the variances of the columns that each contain the elements related to one of
the entries of mode A (e.g. "idealized individual"). Hence, varimax in direction A
aims at large variation between the elements corresponding to the same idealized
individual. Varimax simplicity in direction A does not imply varimax simplicity in
direction B, because large variances of squared loadings could be found per entry of
mode A, even when elements related to one entry in mode B are all equal. Finally, it
is possible to attach different weights to the criteria used for simplicity in direction A,
B, and C. Thus three-mode orthomax is a very flexible approach. Despite this
flexibility, the procedure for optimizing the three-mode orthomax criterion is
relatively simple: It consists of iteratively updating estimates for the three rotation
matrices by applying an ordinary orthomax procedure to a supermatrix computed
from the original core and the current values for the rotations. The method is
somewhat sensitive to local optima, but because the algorithm converges quickly,
using several restarts is an adequate and feasible way of dealing with this problem.
to the unrotated core. It can be seen that the varimax rotated core is considerably
more simple than the unrotated core: Most of its elements are extreme; it has far
fewer medium sized elements (in the intervals [-.5,-.2] or [.2,.5] say) than the
unrotated core.
Three-mode orthomax can indirectly also be used for oblique core rotations, for
instance, by combining them with normalization operations, in analogy to Harris and
Kaiser's (1964) orthoblique approach. Recently, Kiers (1995) proposed a different
procedure for oblique simple structure rotation of the core. He developed a straight-
forward generalization of the SIMPLIMAX procedure for oblique rotation of a
loading matrix to an optimal simple target (Kiers, 1994). Specifically, in three-mode
SIMPLIMAX, oblique rotation matrices for all three modes are found in such a way
that the m (a number to be specified in advance) smallest elements of the rotated core
have a minimal sum of squares (a). The technique thus aims at simple structure in a
very explicit way. For example, when a 3x3x3 core is analyzed by SIMPLIMAX with
m=20, the method finds a core in which 20 elements are optimally close to zero (as
expressed by the fact that the method minimizes its sum of squares), whereas the
other seven elements will be relatively large. How close to zero the smallest elements
are depends on the choice for m. If m is chosen very small, then the method will
often succeed in setting all m elements to exactly 0, but as m increases, the small
values (or at least their sum of squares) will increase as well. Because of this trade-
off relationship one should apply SIMPLIMAX with different values for m, and
search the solution which has sufficiently many small elements that are sufficiently
close to zero. It should be noted that in three-mode SIMPLIMAX it is, just as in
three-mode orthomax, possible to rotate over only one or two modes, rather than all
three.
The method is computationally much less efficient than three-mode orthomax. The
algorithm for three-mode SIMPLIMAX consists of iterative application of the two-
mode SIMPLIMAX procedure, applied to supermatrices of frontal, lateral or horizon-
tal planes of the current rotated core matrix. This procedure turns out to be very
sensitive to local optima, and hence requires many starts from different starting
positions (say 200), which makes it relatively inefficient (e.g., using about one hour
for one full analysis of 200 runs on an 486 66MHz pc). Hence, better starting pro-
cedures, or approaches for avoiding local optima are called for. On the other hand, in
571
practice the cores are usually rather small, and not many different values of m need to
be tested, as in the empirical, and, as far as size is concerned, rather typical example
reported below.
24 -26 -5 -2 1 1
18 11 9 -3 -3 0
-2 -12 15 -1 -6 7
35.4 0 a 3.1 a a
0 24.0 a a 0 a
a a 18.9 a 0 12.4
The core array used in the present example was reported by Kroonenberg (1994), and
is based on a three-mode factor analysis on scores of 82 subjects measured on five
variables (pertaining to performance and drunkenness) at eight occasions (at which
different doses of alcohol had been administered to them). The present 3x3x2 core
array has been rotated here by means of three-way SIMPLIMAX. We used the values
m=12, ... ,16. For m=12 and m=13 three-way SIMPLIMAX found a solution in
which the smallest m elements were zero, up to the accuracy implied by the conver-
gence criterion used. In fact, it can be proven that any 3x3x2 cores can be transfor-
med into a core with as many as thirteen exactly zero elements (Ten Berge, 1995),
which explains what we found here. Hence, the only nontrivial applications of
SIMPLIMAX are those with m= 14, m= 15 and m= 16. In each complete SIMPLI-
MAX analysis we used 200 random starts. The rotated cores, as well as the values of
the function (J (the sums of smallest squared core elements) are given in Table 2, with
the (l8-m) highest elements in bold face. It can be seen that the core rotated towards
14 zeros indeed gives only four important core elements, the others being about ten
times as small or smaller. Even in the core rotated to 15 zeros the high values are at
572
least eight times as large as the small values. However, in the core rotated to 16
zeros the smallest elements were no longer negligible compared to the high values. It
can hence be concluded that this 3x3x2 core can be simplified tremendously, and, in
fact, the main relations between components can be described in three or four terms.
Therefore, even though this procedure may destroy some of the simplicity of the
component matrices (Kroonenberg's matrix was obtained after 'simplicity' rotations of
two of the component matrices), the gain in simplicity of the core is worth consider-
ing.
The algorithms used in SIMPLIMAX can also be used for rotations to a target in
which the positions of the small elements are fixed. In this way, the procedure can
also be used for superdiagonalization, and hence a new technique has come available
for superdiagonalization by means of oblique rotation.
4. Discussion
In the present paper, an overview has been given of recent developments concerning
simplification of the core. Simplification of the core has been proposed as an aid for
interpreting a 3MFA solution. The interpretation of a 3MFA solution usually starts by
giving interpretations to the components for the three modes, and next, the relations
between these modes, as reflected in the core, are considered. The developments
discussed here all focused on simplifying the core, none aimed at simplifying the
interpretation of the component matrices. In fact, methods yielding simplified cores
may yield component matrices that are hard to interpret. In other words, from the
former situation where component matrices could be interpreted easily, but the core
made the results rather complex, we are now at the other extreme where the relations
are made simple, but the component matrices may be rather complex. One way to
deal with this problem, as suggested by Kiers (1995, in press) is to consider rotation
of the core in only one or two directions, and not affect those component matrices
that are very important in interpretation, and for which simple interpretations are
available. Of course, such approaches do not always work. On the one hand, it is
possible that for all three component matrices simple interpretations are available,
which one does not wish to disturb; on the other hand, it is possible that using only
one or two rotation directions no longer leads to a simple core. However, even in
those situations it is conceivable that a solution exists in which the core as well as the
component matrices are reasonably simple. A way to find such solutions would be to
define and optimize criteria that combine simplicity of the component matrices and
simplicity of the core. Because of the interdependence of the rotation of the core and
of the component matrices, optimization of such combined simplicity criteria seems
far from trivial.
An alternative position can be taken as well: Methods that simplify the core can be
seen as methods that simplify the structure of the model, just as CANDECOMP/PA-
RAFAC is a simpler model than 3MFA. The fact that the components themselves are
not related in a simple way to the original individuals, variables, occasions, or
whatever, can be deemed less disturbing. For instance, it can be deemed acceptable
that, when a 3x3x3 core is reduced to only 4 nonnegligible elements, the 4 ensuing
tensor product terms are somewhat complicated to interpret. The alternative of trying
573
to grasp up to 27 interactions between (more simple) components does not seem more
attractive. Using a few tensor product terms based on more complex components is a
way of moving most of the interactions into the components, and once these are
conceptualized, the model becomes easy to grasp.
References:
Carlier, A., Lavit, Ch., Pages, M., Pernin, M.O. and Turlot, J.C. (1989): Analysis of data tables
indexed by time: a comparative review. In: Multiway data analysis, Coppi, R. and Bolasco, S. (Eds.),
85-101. Amsterdam, Elsevier Science Publishers.
Carroll, J .B. (1953): An analytic solution for approximating simple structure in factor analysis,
Psychometrika, 18, 23-38.
Carroll, J.D. and Chang, J.-J. (1970): Analysis of individual differences in multidimensional scaling
via an n-way generalization of "Eckart-Young" decomposition, Psychometrika, 35, 283-319.
Cohen, H.S. (1974): Three-mode rotation to approximnte INDSCAL structure (TRIAS), Paper presented
at the Psychometric Society Meeting, Palo Alto.
Cohen, H.S. (1975): Further thoughts on three-mode rotation to INDSCAL structure, with jackknifed
confidence regions for points, Paper presented at U.S.-Japan seminar on Theory, Methods and
Applications of Multidimensional Scaling and Related Techniques. La Jolla.
Coppi, R. and Bolasco, S. (Eds.) (1989): Multiway data analy<is, Amsterdam, Elsevier Science
Publishers.
De Leeuw, J. and Pruzans\cy, S. (1978): A new computational method to tit the weighted Euclidean
distance model, Psychometrika, 43, 479-490.
Ferguson, G.A. (1954): The concept of parsimony in factor analysis, Psychometrika, 19, 281-290.
Harris, C.W. and Kaiser, H.F. (1964): Oblique factor analytic solutions by orthogonal transformations,
Psychometrika, 29, 347-362.
Harshman, R.A. (1970): Foundations of the PARAFAC procedure: models and conditions for an
"explanatory" multi-mode factor analysis, UCLA Working Papers in Phonetics, 16, 1-84.
Harshman, R.A. and Lundy, M.E. (1984): The PARAFAC model for three-Way factor analysis and
multidimensional scaling, In: Research methods for multimode data analysis, Law, H.G., Snyder,
C.W., Hattie, I.A. and McDonald, R.P. (Eds.), 122-215, New York, Praeger.
Henrion, R. (1993): Body diagonalization of core matrices in three-way principal components analysis:
Theoretical bounds and simulation, Journal of Chemnmetrics, 7, 477-494.
Jennrich, R.I. (1970): Orthogonal rotation algorithms, Psychometrika, 35, 229-235.
Kaiser, H.F. (1958): The varimax criterion for analytic rotation in factor analysis, Psychometrika, 23,
187-200.
Kiers, H.A.L. (1991): Hierarchical relations among three-way methods, Psychometrika, 56, 449-470.
574
Kiers, H.A.L. (1992): TUCKALS core rotations and constrained TUCKALS modelling, Statistica
Applicata, 4, 659-667.
Kiers, H.A.L. (1994): SIMPLIMAX: Oblique rotation to an optimal target with simple structure,
Psychometrika, 59, 567-579.
Kiers, H.A.L. (1995): Three-way SIMPLlMAXfor oblique rotation of the three-mode factor analysis
core to simple structure, Manuscript submitted for publication.
Kiers, H.A.L. (in press): Three-mode Orthomax rotation, Psychometrika.
Kiers, H.A.L., ten Berge, J.M.F. and Rocci, R. (in press): Uniqueness of three-mode factor models
with sparse cores: The 3)(3)(3 case, Psychometrika.
Kroonenberg, P.M. (1983): Three-mode principal component analysis: Theory and applications,
Leiden, DSWO press.
Kroonenberg, P.M. (1994): The TUCKALS line: A suite of programs for three-way data analysis,
Computational Statistics and Data Analysis, 18, 73-96.
Kroonenberg, P.M. and De Leeuw, J. (1980): Principal component analysis of three-mode data by
means of a1tentating least squares algorithms, Psychometrika, 45, 69-97.
Kruskal, J.B. (1988): Simple structure for three-way data: A new method intermediate between 3-mode
factor analysis and PARAFAC-CANDECOMP, Paper presented at the 53rd Annual Meeting of the
Psychometric Society, Los Angeles, June 27-29.
Law, H.G., Snyder, C.W., Hattie, J.A. and McDonald, R.P. (Eds.)(1984): Research methods for
multimode data analysis, New York, Praeger.
MacCallum, R.C. (1976): Transformations of a three-mode multidimensional scaling solution to
INDSCAL form, Psychometrika, 41, 385-400.
Murakami, T. (1983): Quasi three-mode principal component analysis - A method for assessing factor
change, Behaviormetrika, 14, 27-48.
Murakami, T., Ten Berge, J.M.F. and Kiers, H.A.L. (1996): A class of core matrices in three-mode
principal components analysis which can be transformed to have a mojority of vanishing elements,
Manuscript submitted for pUblication.
Neuhaus, J.O. and Wrigley, C. (1954): The quartimax method: An analytic approach to orthogonal
simple structure, British Journal of Mathematical and Statistical Psychology, 7, 81-91.
Rocci, R. (1992): Three-mode factor analysis with binary core and orthonormality consrraints, Journal
of the Italian Statistical Society, 3, 413-422.
Saunders, D·.R. (1953): An analytic method for rotation to orthogonal simple structure, Research
Bulletin, RB 53-10, Princeton, New Jersey, Educational Testing Service.
Ten Berge, I.M.F. (1995): How sparse can core arrays get: The 3x3x2 case, Unpublished note.
Tucker, L.R. (1966): Some mathematical notes on three-mode factor analysis, Psychometrika, 31,
279-311.
Tucker2 as a Second-order
Principal Component Analysis
Takashi Murakami
School of Education, :'\agoya University
Furo-cho, Chikusa-ku, :'\agoya
464-01, Japan
Summary: Statistical properties of the Tucker2 (TI) model, a simplified version of three-
mode principal component analysis (PCA), are investigated aiming at applications to the
study of factor invariance. The T2 model is derived as a restricted form of second-order
PCA in the situation comparing component loadings and component scores across occa-
sions. Several statistical interpretations of coefficients obtained from the least squares
algorithm of TI are proposed, and several aspects of T2 are shown to be natural extensions
of characteristics of classical PCA. A scale free formulation of T2 and a new derivation of
the algorithm for the large sample case are also shown. The relationship with a generalized
canonical correlation model is suggested.
1. Introduction
Consider a set of data collected by administering an inventory consisting of p items to
n subjects on m occasions with or without changing the conditions. In this situation,
we may be interested in the comparisons between factor loadings of the same items
(variables) and between factor scores of the same subjects on different occasions. This
is the factor invariance problem.
In the present article, we will investigate how to use the Tucker2 (T2) model, a sim-
plified version of three-mode principal component analysis (three-mode PCA; Tucker,
1966; Kroonenberg & De Leeuw, 1980), as a tool for the study of factor invariance.
While T2 is a very general model of multidimensional analysis of three-way data,
we will specify the manner of preprocessing of input data and of transformation of
output coefficients which are appropriate to the restricted purpose. In Section 2, we
will recapitulate several formulations of classical PCA, and distinguish two solutions;
PCA-1 and PCA-2. We will prefer PCA-1 because of several favorable aspects; rota-
tional freedom, statistical convenience in interpretations of coefficients, and the scale
free derivation. In Section 3, we will formulate T2 as a restricted second-order PCA,
and will show that the least squares solution obtained by the TUCKALS2 algorithm
(Kroonenberg & De Leeuw, 1980) has almost all properties of classical PCA, which
",ill be listed in Section 2. We will find two classes of solutions; T2-1 and T2-2,
neither are the same as the standard formulation. We will conclude the T2-1 is more
favorable due to the similar convenient properties to PCA-1 while T2-2 is also useful
for straightforward derivation of the algorithm for large sample data.
575
576
(2)
where Pq contains the first q columns of P, Qq contains the first q columns of Q, Dq is
the upper left q x q submatrix of D, T is an q x q arbitrary orthonormal matrix, and
q is the number of components satisfying q < p (cf. Ten Berge, 1993, pp. 35-36). The
formulation can be seen to be a kind of regression problem where columns of F are
predictor variables and elements of A* are regression coefficients to predict columns
of Z. The matrix A* in this sense is sometimes called the pattern matrix.
Because constraints are essentially inactive in this problem (Ten Berge, 1993, p . .15
etc.), we can change the constraints without loss of optimality; if we minimize
(4)
where U is an arbitrary square orthonormal matrL'{. In this formulation, it is usual
to set U = I otherwise U destroys the orthogonality of columns of F*.
We will call the class of formulations peA leading to (2) PCA-l, and call one prer
clucing (4) PCA-2. Solutions of peA with various criteria and constraints result in
either of them as long as the analysis is based on centered and standardized data
matrix Z. Of course, we can transform a solution to the other through F* = F Dql
and A = A* D;! if T=U=I, and F* A' = F A*' irrespective of T and U.
For example. we can see that both F* and F are matrices of linear composites of input
variables, namely, columns of F* and F exist in the column space of Z although the
fact is not explicit in (1) and (3). Hence, we know that the minimum of the function
f (F, A *) in (1) is equal to the minimum of
f(v, A*) = liZ - ZV A*'11 2 , (7)
where V is the p x q matrix of weights constrained as V'RV = J, and given by
V = QrD;lT (Ten Berge & Kiers, 1996). This can be much simplified in PCA-2;
substituting (6) into (3), we have
formula, for example, to transform the result of SVD of raw score matrix X into that
of Z. Because variables do not share the same measurement unit and origin in many
applications, centering and standardizing are practically useful. We want a scale free
formulation of PCA to justify it.
Mereclith and Millsap (1985) proposed the scale free formulation justifying the stan-
dardization based on the ma.'<imization of sum of squared multiple correlations. We
will extend it to justification of the centering.
Let us define the matrix At as
where X is the matrix of raw input data, and Ds is the diagonal matrL,( of vari-
ances of columns of X. We will introduce one more constraint. F'l = 0 as well as
F' P = I. If we define the n x 11 matrix, .} = I - II', then F'l = 0 means F = .} F
(Ten Berge, 1993, p.66), hence X'F = X''}F. Because X''} = (X -Ix')', we obtain
that At = Z'F, where Z' = DS12(X - Ix')'. /\8 a result, the elements of At are
correlation coefficients between variables and components although variables are not
centered in (11), and we know that A t is equal to A' in (5).
:\"ow, we have justified not only the preprocessing of centering and standardization
but also constraints of PCA-I. In other words. we can use classical PCA of standard-
ized variables with orthogonal rotation as the maximization of tr At' A t under the
constraints of F'l = 0 and P' F = I. Our result may not be so impressive because
it looks to depend solely on the word correlation. However, the same rationale will
produce the result which is not necessarily tri"vial on T2 in the next section.
Here, we will list the formulae to obtain final output of PCA-l to compare them with
those for T2: The loading matrix is given by
(12)
where A is the diagonal matrix of q largest eigenvalues of R, K q . the matrix of corre-
sponcling eigenvectors as' before, and T, an arbitrary orthogonal matrix determined,
say, Varimax method. Then, the matrix of component scores are obtained by
(13)
which is equivalent to ZV in (7). These equations are the same as those resulting
from the algorithm of PCA of" factor analysis with unit diagonals" in many standard
statistical packages.
the matrix of q derived random variables, A* is the matrix of some coefficients with
the definite statistical meaning rather than values of random variables. As was men-
tioned before, elements of A* have two implications; coefficients of correlation and
coefficients of regression. While coefficients themselves will be objective of interpre-
tations, individual values on variables per se are not usually interpreted, especially in
large sample cases. Properties of components are interpreted through the coefficient
matrix and the relationships with external information.
Our treatment of PCA refiects these asymmetries. For example, we are willing to
rotate the result of PCA-l because the orthogonal rotation keeps the orthonormality
of derived variables.
The reason why we have emphasized the asymmetric treatment is that T2 had been
formulated as a method of MDA rather than MVA in Gifi's sense. In the sequel,
we will not necessarily follow the standard symmetric notations and treatments of
three-mode PCA as in, for example, Kroonenberg (1983) because we consider that
the factor invariance problem commonly involves the asymmetry.
3. T2 as a second-order peA
3.1 Derivation of the T2 model as a second-order peA
Let Zk (k = 1, ... , m) be the n x p matrix of data on k-th occasion, namely, the k-th
frontal plane of a t~ree-mode data array in the terminology of three-mode analysis.
From the standpoint of asymmetric role of rows and columns of the matrix of data
mentioned above, we consider that the three-mode array consists of mp random vari-
ables. We will postpone to specify the manner of centering and standardization of
m
Zk, but we assume that L Z~Zk = m]. We also assume that n > mp and that all
k=l
mp columns of data are linearly independent.
The simplest way of applying classical PCA to these matrices may be the separate
analysis of p variables on each occasion such as
k= 1, ... ,m, (14)
where Fk is an n x q (q < p) matrix of component scores, and Ak is an p x q matrix
of component loadings. (We will adhere to PCA-I.) It looks easy to attain the two
aims mentioned in the introduction, namely, comparisons between factor loadings and
factor scores obtained on different occasions. To do so, one can compare matrices of
loadings on different occasions directly, or compute any of indices of coefficients of
congruence between them (e.g. Ten Berge, 1986), and compute correlation coefficients
between component scores. However, there are several probler.ns in these methods.
First, the comparisons are not so easy when m and q are large. Second, rotation
which can be performed in each condition separately may bring the indeterminacy to
any indices for comparisons, for example, correlation coefficients between component
scores. Third, there may be much redundancy in mq columns of loadings and scores
which makes estimates of coefficients unstable and interpretations of results confusing.
A simple way to partially avoid these difficulties is applying PCA-l to an mn x p
matrix obtained by juxtaposing all frontal planes vertically;
k= 1, ... ,m, (15)
where A* is a p x q common loading matrix, and we assume that p > q. (Here, we re-
gard the data array as a sample of size nm on p variables temporarily.) Although (15)
580
is more parsimonious than (14), some of mq columns of Fk's can remain redundant.
Hence, we will apply PCA-l again to the n x mq matrix obtained by juxtaposing Fk's
horizontally. (We return to a sample of size n);
k = 1, ... ,Tn, (16)
where C k is a q x r matrix of second-order loadings and G is an n x r matrix of
second-order component scores. From the relationship of ranks of matrices, it follows
that r :s; mq, and q :s; mr. By substituting (16) into (15), we can get the equation
having the same form as T2; Zk;::j GCk' A*'. This is a kind of second-order PCA with
the equality restriction on first-order loadings (Bloxom, 1984), but ;::j does not mean
the least squares approximation of the model. We will obtain the three matrices
simultaneously in the least squares sense by minimizing
m
f(G,Ck,A') =L IIZk - GCk'A*'11 2 , (17)
k=!
m
subject to G'G = I, and L CkCk' = mI. We can impose constraints on G and
k=!
Ck because the model of T2 is the product of three matrices. We will call the
minimization of (17) T2-1 problem. The original problem for T2 by Kroonenberg
and De Leeuw (1980) with different constraints from T2-1 will be introduced later.
One can distinguish the change of loadings from the change of scores by checking the
pattern appeared in C k unless the changes are not so drastic ones. Hence T2-1 can be
used a tool for the study of factor changes on the descriptive level. That is illustrated
in Murakami (1983; pp.31-34), and we will not repeat it here. We will only point out
that the simple structure attained by orthogonal rotation in (43) is crucial for such
interpretations.
Next, we redefine the first-order components in (15) as
k= 1, ... ,m, (18)
where G and Ck are given by the minimization of (17) rather than the heuristic solu-
tions in (16). Correspondingly, we will call A* the matrix of the first-order loadings.
Fk defined in (18) has some convenient aspects: First, it is orthonormal in the sense
m
of m-! L F~Fk = I. Second, as will been shown, the formula similar to (5),
k=!
m
A* = m-! 2:: Zk' Fk , (19)
k=!
will give the basis for a scale free formulation of T2. Third, another analogous
equation is derived immediately from (18) by the use of G'G = I;
k = 1, ... ,171, (20)
which means that C k is a kind of structure matrix, whose elements are covariances
between the first-order components and the second-order components.
There may be another approach to the second-order PCA of three-mode data; we can
derive it through the definition of the linear composites. First, we define the matrix
of the first-order composites for each occasion such as
The weights held constant across occasions such as used in (21) are sometimes called
the stationary weights (d. Meredith & Tisak, 1982). Next, let us define the matrix
of the second-order composites as
m
G* = LF;CA. (22)
k=!
Then, we can formulate the second-order PCA problem in the similar way to Rotelling
PCA as the maximization of the following function under the constraints of A' A = I,
m
and L Ck'Ck = I;
k=l
m m
g(A, C\, C2 , ••• , Cm) = tr L L Ck' A'RkIAC1, (23)
k=ll=l
where Rkl = Zk'ZI' Ck should be distinguished from Ck because they differ in the
direction of constraints. We will call the maximization of (23) T2-2 problem.
As will be shown later, the relationship between G in (17) and G* in (22) is simple.
However, the relationship between the first-order composites F* and the first-order
component Fk defined in (18) is somewhat complicated and must be defined separately
because the former spans an mq dimensional subspace in mp columns of Zk'S, but
the latter exists only in an r dimensional column space of G. (Note that r :::; mq.)
3.2 Equivalence of two formulations of second-order peA to T2
Kroonenberg and De Leeuw (1980) defined TUCKALS2 as the algorithm minimizing
the following function;
m
f( G, C~, ... ,e;" A) =L IIZk - GCi.' A'11 2 (24)
k=l
where G is the n x r orthonormal matrix, Ck, the q x r frontal plane of a three-mode
core matrix, and A, the p x q orthonormal matrix. This is the original T2 formulation.
Analogous to the case of (1) and (3), we can change the constraints without loss of
optimality, hence we can also consider the minimization problem of (17) and of
m
f(G*,C 1, ••• ,Cm,A) = L IIZk - G*Ck'A'11 2 (25)
k=l
m
subject to LCk'Ck = I, and A'A = I.
k=l
Assuming that we have the solution of (24), define
m m
A= LCkC;', and ~ = LC;'C;. (26)
k=l k=l
Kroonenberg and De Leeuw (1980) showed that both A and ~ are the diagonal
matrices of eigenvalues of positive definite matrices. Then,
and (27)
582
and
and C-k -- C* 1\ -1,2
'kL.:> , (28)
where T and U are arbitrary orthonormal square matrices. We did not introduce
rotational freedom to (28) for the same reason as in the case of PCA-2. It is easy
to verify that matrices defined above produce the same optimum in (IT) and (25) as
that of (24), and Ck and G)'; satisfy their corresponding constraints.
Similar to the case of classical PCA, we can also convert the minimization problems
to the maximization ones. On the one hand, by applying regression theory to (1'1),
L C),;C'CC),;' =
Tn
m
A * -- m -I ' "
L-, Z k 'CC""''k,,
I (29)
k=1
L
Tn
L AIC\'GkA = 1, we obtain
m
On the other hand, using (25), and
k=1
m
G* = LZ)';AG),;. (31)
k=1
which.is equal to (22), and can be regarded as an extension of (6). Substituting this
into (2.5), we also obtain the formula of a natural extension of (8);
m
f(A, G1 , G2, ... ,Gp ) = liZ),; - (L ZlAGl)G~A'112, (32)
l=1
where the matrix AG),; has two roles; as a weight matrix and a pattern matrix. It is
easy to confirm that the minimization of (32) is equivalent to the T2 problem, the
m m
maximization of (23), because (32) can be written as mp- tr L L G~A' RklAGl . We
),;=ll=1
also point out that the relationship between (29) and (31) is analogous to the dual
relationship between (5) and (6).
AI:, was in the case of classical PCA, implications of the two solutions are remarkaoly
different. We prefer T2-1 to T2-2 for the same reason why we prefer PCA-l to PCA-
2; rotational freedom, convenient properties of A', and the (origin- and) scale free
formulation derived below notwithstanding the several attractive properties of T2
such as (32).
basis of T2-1 in the same way in classical PCA. We will start from the redefined
structure matrix of the first-order composites;
m .
At = m- 1DS1/2 2..: X~GC~, (33)
k=1
where X k is the matrix of raw input data on k-th occasion, and Ds is the diagonal
matrix of variances of variables which are centered in such a way as to transform At
into the correlation matrix.
The process is almost the same as in the case of classical PCA in Section 2.3. First,
we will add one constraint that G'l = 0, which means G = JG. Therefore, we have
Xk'G = (Xk - IXk')' G where Xic = n- 1X Ic'l. Hence, we know that Ds must be
m
defined as Ds = m- 1 diag 2..:(XIc - IXk')'(Xk - IXk'),
1c=1
This suggests that a sufficient condition for obtaining G such that At defined in (33)
has the interpretation of the structure matrix is the transformation
(34)
m
which satisfies that Zk'l = 0, and 2..:ZIc'Zk = mI, and we have At = A*, where A*
1c=1
is given in (29).
Although the above discussion may look almost trivial, we should consider that there
are some other possible methods of preprocessing. First, one seemingly plausible
m
transformation Z~ = DS1/2(Xk - Ix')', where x = (mn)-l 2..: XIc'I, and Ds =
k=1
m
diag 2..:(XIc - Ix')'(X", - Ix'), are not a sufficient condition to make At be a struc-
1c=1
ture matrix. Second, another transformation, Z~ = DSk -1/2(XIc - IXk')', where
DSk = diag(XIc - IXk')'(Xk - IXk'), the standardization for each k, is also plausi-
ble. This possibility is not precluded but somewhat spurious notwithstanding Mu-
rakami(1983)'s early recommendation.
We will not assert that the preprocessing (34) is universally valid. Theoretical and
empirical studies to find better methods of preprocessing (e.g. Kroonenberg, 1983)
are meaningful for the vast class of applications. Our conclusion is limited to the
study of factor comparisons which we define in 3.1.
3.4 The algorithm
As the sample size n is usually very large comparing to mp in the study of factor
invariance, the iterative algorithm based on Rlcl is more convenient than that on Zic.
Murakami(1983) derived such an algorithm from TUCKALS2 through the algebraic
manipulations. A very straightforward derivation of an improved version is possible
on the basis of T2-2 criterion in (23).
First, we will assume that A is given. We define the n x mp data matrix Z byar-
ranging frontal planes next to each other as Z = [ Zl Z2 ... Zml, and compute the
mp x mp covariance matrix as R = Z' Z. Next, we define the mq x mq matrix,
where 0 denotes the Kronecker product, and also define the mp by q matrix, G=
[ G/ (\' ... Gm ' 1'. Then we can rewrite (23) as
(36)
where the constraint is C'C = I. Columns of G "lVill be given as the eigenvectors
associated with the r largest eigenvalues of H. (For the simplicity, we assumed that
all the eigenvalues are distinct.)
:\ext, we assume that G is given. In addition, we also assume that we have a set
initial values of elements of A. (It should be one given in the previous iteration in an
ALS process.) We will rewrite (23) into
Tn m
g(A) = trA':L:L RktAGtG/, (37)
k=11=1
and consider the singular value decomposition
m m
:L:LRktAGtG,,' = PAQ', (38)
k=cltc~1
and let
B= PQ'. (39)
Using Schwartz inequality, Ten Berge (1988) proved that
trG'(l0 B)' R(l0 B)G 2': trG'(l0 A)' R(l0 A)(\ (40)
or g(B) 2': g(A). This means that the process replacing A by B increases the criterion
monotonically. Therefore, we can alternate the eigendecomposition of H and the SVD
in (38) until convergence is attained.
Clearly, when the algorithm converges,
g(A, G) = trL'-. = trA (41)
holds, and we can use this for a criterion of convergence.
As the step for A in the algorithm described here performs the singular value de-
composition of the matrix of p by q, it is much better than that in Murakarni(1983)
which needs the eigendecomposition of the p by p matrix. However, more careful
studies will be necessary to compare the efficiency with that of the new algorithm
using Gram-Schmidt orthogonalization (Kiers, et al., 1992).
Finally, we will list formulae to complete the analysis of T2-1. First, we will obtain
the first-order loading matrix;
(42)
where T is an orthogonal matrix which should be determined to attain the simple
structure of A* . .\"ext, we will have the second-order loading matrix by
Ck = mL2T'A-U2GkL'-.12U, (43)
where U is also an orthogonal matrix determined to reach the simple structure of C.
If necessary, the second-order component scores are obtained by
m m
G =:L ZkA*Ck(:LC;A*'A*Cttl. (44)
k=.1 1=1
585
Eq. (42) and (44) show the apparent similarities to their counterparts for PCA-l,
(12) and (13)
3.5 Use of the first-order composites
In analyzing the three-mode data, one may want to evaluate the correlations of com-
ponents between occasions. For the purpose, the covariances between the first-order
components defined in (18), F~Fz = GkC{, may be useful, and it is easily computed
on only the second-order loading matrix. However, elements in the matrix often
overestimate the correlations because columns of the first-order components exist in
the r dimensional subspace as is mentioned in 3.1. Hence, the first-order composites
defined in (21) may be appropriate for because they are the linear combinations of
input variables, and columns of them spans an mq dimensional space.
Although one can compute the correlation coefficients between the first-order com-
posites directly, we can transform them into variables with unit variances which are
mutually orthogonal in advance. In addition, we can maximize the congruence of
the transformed first-order composites with the first-order components, which is con-
ceptually in the same level as the first-order composites. That is, we will define the
composites,
k= 1, ... ,m, (45)
m m
which satisfy L F~Fk = mI and minimize L IlFk - Fk11 2. Through somewhat com-
k=l k=l
plicated operations using eigenequation of H and the rationale for the orthogonal
Procrustes method (Cliff, 1966), we obtain
(46)
L
Tn
which is an extension of (7), and completes our list of the parallel relationships be-
tween classical PCA and T2.
Acknowledgment
The author is obliged to anonymous reviewers for their many helpful comments.
References:
Bloxom. B. (1984): Tucker's three-mode factor analysis model. In: Research Methods for
lv!ultimode Data Analysis, Law. H.G. et al (eds.), 104--120, Praeger Publishers, :\"ew York.
Cliff, K. (1966): Orthogonal rotation to congruence, Psychometrika, 31. 33-42.
Gifi. A. (1990): Nonlinear lV[ultil'ariate Analysis. Chichester: Wiley.
Kiers. H.A.L. et al. (1992): An efficient algorithm for TUCKALS3 on data with large
number of observation unit, Psychometrika. 57, 415-422.
Kroonenberg, P ..'.L (1983): Three-mode Principal Component Analysis. DS'vVO Press. Lei-
den.
Kroonenberg. P.:\L and De Leeuw. J. (1980): Principal component analysis of three-mode
data by means of alternating least squares algorithm, Psychometrika, 45. 69-97.
:\Ieredith, W. & :\Iillsap, R.E. (198.')): On component analysis. Psychometrika, 50, 49.')-507.
:\Ieredith, W. and Tisak, J. (1982): Canonical analysis of longitudinal and repeated mea-
sures data with stationary weights. Psychometrika. 47, 47-67.
:\Iurakami, T. (1983): Quasi three-mode principal component analysis: A method for as-
sessing the factor change, Behaviormetrika. 14, 27-48.
:\"ishisato, S. (1994): Elements of Dual Scaling: An Introduction to Practical Data Analysis.
Lawrence Erlbaum, Hilsdale.
Ten Berge, J.M.F. (1986): Some relationship between descriptive comparisons of compo-
nents from different studies. lvlultivariate Behavioral Research. 21, 29-40.
Ten Berge. J.:\LF. (1988): Generalized approach to the :\IAXBET problem and the .'.IAX-
DIFF problem. with applications to canonical correlations. Psychometrika. 53, 487-494.
Ten Berge, J ..'.LF. (1993): Least Squares Optimization in Multivariate Analysis. DSWO
Press. Leiden.
Ten Berge. J.:\LF. & Kiers. H.A.L. (1996): Optimality criteria for principal component
analysis and generalizations, British Journal of Mathematical and Statistical Psychology.
49, 335-345.
Tucker. L.R. (1966): Some mathematical notes on three-mode factor analysis. Psychome-
trika. 31. 279-311.
Parallel Factor Analysis with Constraints on the
Configurations: An overview
Pieter M. Kroonenberg 1 and vVillem J. Heiser 2
1. Introduction
The PARAFAC model is a data-analytic model for three-way data, in which each way
represents a different mode (three-mode data), for example, subjects (mode 1) have
scores on semantic differential scales (mode 2) under several conditions (mode 3), or
in case of a three-way analysis of variance design the mean yield of several varieties
of maize (mode 1) planted in several locations (mode 2) during several years (mode 3).
In most applications to date the full model has been used, sometimes with provisions
for missing data. However, it is possible to include constraints on the configurations
of the components in t.he model. In this paper an overview is given of constraints that
have been proposed in this context and the practical relevance of such constraints is
illustrated. Moreover, attention will be paid to ways of fitting the model to include
constraints. To this end the recently developed alternatives to the basic algorithm
will be reviewed, in particular, the triadic or component-wise estimation.
withi = 1, .. ,1, j = 1, .. ,J, and k = I, .. ,K. The a", bjs , and C\;s are the elements of
the components as, b s , and c s , respectively, and the eiJk are the errors of approxima-
tion. Note that each component depends only on one the indices i, j, k and that each
component s is present in all three ways. An alternative formulation of the model
(seen as a generalization of the singular value decomposition)
587
588
5
Xijk = L AsaisbjsCks + eijk (2)
8=1
where the vectors (components) as, bS) and CS) have lengths equal to 1 (or mean
squares equal to 1), and the scale factors As can be considered the three-way ana-
logues of the singular values.
In the physical sciences, explicit models for physical processes occur frequently, and
with respect to three-way and higher data several examples can be found in chemistry.
Smilde, Van der Graaf, and Doornbos (1990) discuss a model for the multivariate cali-
bration of reversed phase chromatographic systems which is identical to the PARAFAC
model, and Leurgans and Ross (1992) discuss several three- and multimode models
in spectroscopy. For instance that discuss a model in which the light emission mea-
sured is separately linear (1) in the number of photons absorbed, (2) in the fraction
of photons absorbed that lead to emission at a particular wavelength (:3) for several
concentrations of light emitting entities.
In contrast with several other three-way models, the PARAFAC model is an identi-
fied model, so that after estimation the parameters of the model cannot be changed
without affecting the fit of the model to the data. In particular no transformations
of the components are possible without loss of fit. The identifiability is a great help
in evaluating and interpreting solutions, and it is this feature which makes the model
extremely relevant in those cases where an a priori model for the data is available. A
PARAFAC analysis is in that case not a search for a good fitting model, but a method
to obtain identified estimates for the parameters.
3. Constraints
Substantive reasons, parsimony, or modelling considerations may require constraints
on the parameters. In particular, the following situations may be considered, which
we will discuss in turn. (1) Orthogonality of components, (2) non-negativity of com-
ponents, (:3) linear constraints including design variables for the components, (4) fixed
components, (5) order constraints on the components, (6) mixed measurement levels,
(7) missing data. It should be noted that all constraints lead to a loss of fit, but
the constrained solution may be compared with an unconstrained one to assess the
importance of the constraints.
Constraints can enter the problem of finding a solution for the parameter estimates
in essentially two ways, i.e. as constraints on the parameters, cases (1)-(5), or as con-
straints on possible transformations of the data (6) and (7). An example of the latter
is optimal scaling in which case optimal transformations for the data are sought given
589
the measurement level of the variables, simultaneously with optimal parameter esti-
mates for the model. Another example is the estimation of the model in the presence
of missing data, in which case either the missing data are 'ignored' via a 0-1 weight
matrix for the data (the model is fitted around the missing data), or the missing data
are estimated simultaneously with the model in a form of expectation-maximization
algorithm.
Harshman and Lundy (1984a) also discuss fitting covariance and similarity data by
PARAFAC, and show that this indirect fitting (indirect with respect to fitting the
PARAFAC model to the raw data) implies that one mode (usually that of the subjects
or generally the data generators) is orthogonal.
Facet or factor"ial designs for variables. Tests and questionnaires are sometimes con-
structed according to a facet or factorial design and as above these variables may be
combined and then subjected to a standard PARAFAC analysis.
A priori cluste7's on the variables. Another type of constraint occurs when variables
belong to certain explicitly defined a priori clusters and this is to be evident via a
simple structure on the components. In the three-way case one might like to fit such
a constraint via the application of the PARAFAC model. The details of such approach
has been worked out by Krijnen (1993, chap. 5).
Another view of the same situation might be that external (two-way) information on
continuous variables is available. For instance, personality information is available
for the subjects, and it is desired to explain the structure of the analysis in terms
of these external variables. Within the analysis-of-variance context such procedures
have sometimes been called factorial regression. A discussion of this approach for
two-way case as well as references to examples can be found in Van Eeuwijk, Denis,
and Kant (1995).
Fixing components for modelling purposes. It is possible to decide for each component
in a PARAFAC model whether one, two of three ways should have constant values for
this component. By doing this, one can, for instance, perform a three-way analysis
of variance with the PARAFAC model. At present, we are working on a viable way of
carrying out three-way analysis of variance with multiplicative terms for the interac-
tions via a PARAFAC model as suggested by De Leeuw in Kroonenberg (198:3, p. 141).
the model and the missing data in turn, or by differentially weighting elements of the
three-way data array, for instance, by weighting every valid data point with 1 and
each missing data point by o.
4. Algorithms
In this section we will give an overview of several algorithms proposed for the PARAFAC
model and comment on the way constraints can be handled. In particular, we will
concentrate on the standard or so-called CP algorithm and the triadic algorithm.
(;3 )
The basic solution proposed by Carroll and Chang (1970) and Harshman and Lundy
(1970) which at a later date has also been independently worked out by several other
authors (e.g. Mocks, 1988), consists of fixing two of the parameters matrices, solving
for the third one via (multivariate) regression, performing this procedure for each
permutation of the component matrices in turn, and repeating the whole procedure
until convergence. Also Hayashi and Hayashi (1982) presented an algorithm which
show similarities to the standard one. Technically such the CP algorithm is straight-
forward to implement, but it turns out that the algorithm does not always converge,
due to the mismatch of data and model. This situation is called degeneracy and de-
tails can be found in Harshman and Lundy (1984b), Kruskal, Harshman, and Lundy
(1989), and Krijnen and Kroonenberg (submitted). As an aside the work of Pham
and Mocks (1992) should be mentioned, as they prove the consisency and asymptotic
normality of the least squares estimators.
Variants. A variant on the basic algorithm was proposed by Kiers and Krijnen (1991).
They showed by reauanging the calculations and by operating on the (multivariable-
multioccasion) cova.riance matrix rather than the raw data, that the computation
time becomes independent of the number of observations, but that at each iteration
step the results are the same as the standard algorithm. Moreover, by operating on
the covaria.nce matrix, the storage space does not increase with the number of obser-
593
Using the assumption of normality for the errors, Mayekawa (1987) developed a max-
imum likelihood procedure for indirect fitting (see also section 3.2), i.e. first the raw
data are converted covariance matrices per occasion (with possibly different samples
per occasion). Note that the covariances are different from the Kiers and Krijnen
(1991) proposal, because they use only a single (multivariable-multioccasion) covari-
ance matrix, and thus necessarily assume repeated measurements.
Weights for error terms. Several authors (Harshman and Lundy, 1984b, pp. 242ff.;
Carroll, De Soete and Kamensky, 1992; Heiser and Kroonenberg, 1994) considered
a generalization of the discrepancy function by including weights for the error terms
eijk
(4)
The addition of weights changes the details of the algorithm in several places but
not its basic character. The seemingly small change has considerable consequences
in the ability of the algorithms to handle certain kinds of constraints. Harshman and
Lundy (1984b) presented the weighted version of the standard CP algorithm in the
context of reweighting or scaling of error terms to achieve equal error variances but
seemed to look only at diagonal weight matrices (i.e. Wi)k = wJ
[ J f.; (5
IlJ(A,B,C)=t;~EWijk Zijk-~(piSaLS)(qjsb)5)(rkSckS)
)2 , (.5)
where P = (Pis), Q = (qjs), and R = (rks), are, generally known, weight matrices
for the components. Krijnen (199:3, cha.p ..5) uses such, binary, weights as a priori
cluster constraints. Moreover, Krijnen (199:3, chap. 6) also proposed an algorithm
for optima.!ly clustering the coordinates of the variables using explicitly the row-wise
character of the standard algorithm, in which the binary weights had to be deter-
mined. Carroll, Pruzansky, and Kruskal (1980) showed that if a factorial design or
linear constraints are specified on the components there is no need for a special algo-
rithm because one may first reduce the data matrix according to the design and then
594
use the basic algorithm to solve for the parameter estimates (see also Franc (1992,
pp.188ff.). The latter author also discusses the inclusion of different metrics for the
component matrices and shows that this situation can be handled by rewriting the
basic equations and then using the standard algorithm. Such a procedure is analogous
to the use of the ordinary singular value decomposition to solve the singular value
decomposition with weighted metrics in correspondence analysis (see e.g. Greenacre,
198:3, p. 40).
<l>(a.,b.,c,)
(6)
Note that after the s- th components have been estimated the components s + 1 (or 1)
will be next after the new component s have been incorporated in Zijk.
Variants. The triadic algorithm opens the way to including many more types of con-
straints and it also allows the development of special variants of the PARAFAC model
595
5. Programs
The standard algorithm has been included in (FORTRAN) programs such as CAN-
DECaMP (Carroll and Chang, 1970) and PARAFAC (Harshman and Lundy, 1994), of
which the latter program is probably the most extensive one as it includes many
special features relevant for the analysis of three-mode data.
Most of the authors who have contributed to the further development of PARAFAC
have written their own programs, primarily in Matlab, Splus, or other matrix-based
languages. The standard algorithm for the PARAFAC model, as well as the nonneg-
ativity and orthogonal variants, have also been included in the analysis package for
three-way data 3W. WPACK (Kroonenberg. 1994, 1996), and the triadic or component-
wise version will be in the next version of 3WAYPACK. This package also includes
other three-way models, such as those proposed by Tucker (1966. 1972).
6. References
Carroll, J.D. (1987): New algorithm for symmetric CA'IDECOMP. Unpublished manuscript. AT&T
Bell Laboratories, Murray Hill, NJ.
Carroll. J.D. and Chaturvedi, A. (1995): A general approach t.o clustering and multidimensional
scaling of two-way. t.hree-way, and higher way dat.a. Georpetric representatIOns of perceptual phe-
nomena: Papers in honor of Tarow Indow on h,s 70th birthday, Luce, R.D. et. al. (eds.), Erlbaum.
Mahwah, NJ.
CarrolL J.D. and Chang, J .-J. (1970): Analysis of individual differences in multidimensional scaling
via an N-way generalization of "Eckart-Young" decomposit.ion. Psychometrika, 35. 283-:319.
Carroll, J.D .. De Soete, G., and Kamenski, A.D. (1992): A modified CANDECOMP algorit.hm for
fit.ting the lat.ent class model: Implementat.ion and evaluat.ion. Applied Stochastic Models and Data
Al!al.~SlS, 8, 303-309.
596
Carroll, J.D., Pruzansky, S., and Kruskal, J .B. (1980): CANDELINC: A general approach to multidi-
mensional analysis of many-way arrays with linear constraints on parameters. Psychometrika, 45,
3-24.
Denis, J.B. and Dhorne, T. (1989): Orthogonal tensor decomposition of 3-way tables. In: Multiway
data analysis, Coppi, R. and Bolasco, S. (eds.), 31-38, Elsevier, Amst.erdam.
DeSarbo, W.S., Carroll, J.D., Lehmann, D.R., and O'Shaughnessy, J. (1982). Three-way multivari-
ate conjoint analysis. Marketing Science, 1, 323-350.
Durrell, S.R., Lee, C.-H., Ross, R.T., and Gross, E.L. (1990): Factor analysis of the near-ultraviolet
absorption spectrum of plastocyanin using bilinear, trilinear, and quadrilinear models. Archives of
Biochemistry and Biophysics, 278, 148-160.
Franc, A. (1992): Etude algebrique des multitableaux: Apports de l'algebre tensorielle. Unpublished
PhD thesis, Universite de Montpellier II, France.
Gifi, A. (1990): Nonlinear multIVariate analysis, Wiley, Chicester, UK.
Greenacre, M.J. Theory and applicatIOns of correspondence analysis, Academic Press, London.
Harshman, R.A. (1970): Foundations of the PARAFAC procedure: Models and contributions for an
"explanatory" multi-modal factor analysis. UCLA Working Papers in Phonetics,16, 1-84. [Also
available as University Microfilms, No. 10,0085).
Harshman, R.A. and Lundy, M.E. (1984a): The PARAFAC model for three-way factor analysis and
multidimensional scaling. In: Research methods in multimode data analysis, Law, H.G. et al. (eds.).
122-214, Praeger, New York.
Harshman, R.A. and Lundy, M.E. (1984b): Data preprocessing an the extended PARAFAC model.
In: Research methods in multimode data analysis, Law, H.G. et al. (eds.), 216-284, Praeger, New
York.
Harshman, R.A. and Lundy, M.E. (1994): PARAFAC: Parallel factor analysis. Computational Statis-
tics and Data Analysis, 18,39-72.
Hayashi, C. and Hayashi, F. (1982): A new algorithm t.o solve PARAFAc-model. Behav·iormetrika.
11, 49-60.
Heiser, W.J. and Kroonenberg, P.M. (1994): Dimensionwise fitting in Parafac-Candecomp with
missing data and constrained parameters. Unpublished manuscript, Department of Data Theory,
Leiden University, Leiden.
Kettenring, J .R. (1983): Components of interaction in analysis of variance models with no repli-
cations. In: Contributions to statistics: Essays in honor of Norman L. Johnson, Sen, P.K. (ed.),
North-Holland, Amsterdam. Kiers, H,A.L. and Krijnen, W.P. (1991): An efficient. algorit.hm for
PARAFAC of three-way dat,a with large numbers of observation units. Psychometrika, 56, 147-152.
Krijnen, W.P. (1993): The analYSIS of three-way arrays by constrained PARAFAC methods, DS\VO
Press, Leiden.
Krijnen, W.P. and Kroonenberg, P.M. (submit.t.ed): Det.ecting degeneracy when fitting the PARAFAC
model.
Krijnen, W.P. and Ten Berge, J.M.F. (1992): A const.rained PARAFAC method for positive manifold
data. Applied Psychological Measurement, 16, 295-305.
Kroonenberg, P.M. (1983): Three-mode principal component analysis: Theory and applications,
DSWO Press, Leiden.
Kroonenberg, P.M. (1994): The TUCKALS line: A suite of program for three-way data analysis.
Computational Statistics and Data Analysis, 18, 73-96.
Kroonenberg, P.M. and De Leeuw, J. (1980): Principal component analysis of three-mode data by
means of alternating least squares algorithms. Psychometrika, 45, 69-97.
Kruskal, J.B., Harshman, R.A., and Lundy, M.E. (1989): How 3-MFA can cause degenerate PARAFAC
solutions, among other relat.ionships. In: Mu[tiway data analysis, Coppi, R. and Bolasco, S. (eds.),
115-122, Elsevier, Amsterdam.
Lawson, C.L. and Hanson, R.J. (1974): Solving least squares problems, Prentice Hall, Englewood
Cliffs, NJ. Leurgans, S.E. and Ross, R.T. (1992): Multilinear models: Application in spectroscopy
(wit.h discussion). Statistical Science, 7, 289-319.
Mayekawa, S.-I. (1987). Maximum likelihood solution to the PARAFAC model. Behaviormetr'ika, 21,
4.5-6:3
Mocks, J. (1988): Decomposing event-related potentials: A new topographic components model.
BIOlogical Psychology. 26, 129-215.
Paatero. P. and Tapper, G. (1994): Positive matrix factorization: A non-negative factor model with
optimal utilization of error estimates of data values. Environmetrics, 5, 111-126.
Paatero. P. (1995): User's guide for posit.ive matrix factorization programs PMF2.EXE and PMF3.EXE,
Depart.ment of Physics, University of Helsinki.
Pham, T.D. and Mocks, J. (1992). Beyond prinicpal component analysis: A trilinear decomposition
model and least squares estimation. Psychometnka, 57, 203-215.
Sands, R. and Young, F.W. (1980): Component medels for three-way data: ALSCOMP;3, an alterna-
tive least squares algorithm with optimal scaling features. Psychometnka, 45, 39-67.
Smilde, A.K., Van der Graaf, P.H., and Doornbos, D.A. (1990): Multivariate calibration ofreversed-
phase chromatographic systems. Some designs based on three-way data analysis. Analytica Chlmica
Acta. 235, 41-5l.
Ten Berge, J.M.F. (1986): Three not.es on three-way analysis. Paper presented at t.he Workshop on
TCCKALS and PARAFAC, Leiden Universit.y, July 2.
Ten Berge, J.M.F., Kiers, H.A.L., and Krijnen, W.P. (1993): Computational solutions for the prob-
lem of negative saliences and nonsymmetry in INDSCAL. Journal of Classification, 10, 115-124.
Tucker, L.R. (1966): Some mathematical notes on three-mode factor analysis. Psychometrika, 37,
279-311.
Tucker, L.R. (19T2): Relations bet.ween multidimensional scaling and three-mode factor analysis.
Psychometrika, 37, 3-27.
Van der Kloot. W.A. and Kroonenberg, P.M. (198.5): External analysis with three-mode principal
component analysis. Psychometrika, .
Van Eeuwijk, F.A., Denis, .I.-B., Kang. M.S. (1995): Incorporating additional information on geno-
types and environment.s in models for two-way genot.ype by environment tables. In: Genotype by
environment wteractlOn: New perspectives, M.S. Kang and H.G. Gaugh Jr. (eds.), eRC-Press, Boca
Raton. USA.
Yoshizawa, T. (1988): Singular value decomposition of multiarray data and its applications. In
Recent developments in clustering and data analysis, C. Hayashi et al. (eds.), Academic Press, New
York.
Acknowledgement
The research of the first author was financially supported by a grant from the Nis-
san Fellowship Programme of the Netherlands Organization for Scientific Research
(NWO).
Regression Splines for Multivariate Additive
Modeling
.Jean-Fran~·ois Durand
Probabilites et Statistiq ue li nite de Biometrie
Universite i\lontpellier II Ei'iSAM-INRA-UM II
Place Eugene Bataillon 9 Place Pierre Viala
3409·5 i\Iontpellier. Frallce :34060 I\Iontpellier, France
Summary: Four additive spline extensions of some linear multiresponse regression meth-
ods are presented. Two of them are defined in this paper and their properties are compared
with those of two other recently devised methods. Dimension reduction aspects and quality
of the regression are discussed and illustrated on examples.
1. Introduction
Let (Xl, ... , xp) be a set of predictors related to a set of responses (Yl, ... , Yq), all measured
on the same n individuals with sample data matrices X (n x p) and Y (n x q). The goal
of this paper is to present methods for multivariate additive modeling and data reduction
which integrate regression splines in tl'0ir settings. Piecewise polynomials or splines are ex-
tensively used in statistics and data analysis, see for example (Ramsay 1988), (Gifi 1990),
(Hastie and Tibshirani 1990) and (Durand 199:3). Spline transformations of the explanatory
variables presented in Section 2, non necessarily monotonic in contrast to (Ramsay 1988),
are conjointly used with orthogonal projections on either such smoothed predictors or "syn-
thetic" explanatory variables called additive components. Spline functions are attractive
due to their appealing local sensitivity to data and orthogonal projectors preserve some lin-
ear properties in the considered nonlinear methods. Section 3 outlines that problems arise
with least-squares splines (Eubank 1988) when the number of predictors is large. Scarcity of
data in multivariate setting may cause harm in additive modeling by least-squares splines,
thus providing a point in favour of dimension reduction.
Four multiresponse additive regression models are considered, all presenting dimensi'on re-
ducing aspects. Because of their similar scope of applicability, the three first methods are
compared on a "running example". In Section 4, the regression on Additive Spline Principal
Components (ASPCs henceforth) is based on a new definition of ASPCs. Additive principal
components have been recently defined by Donnell et al. (1994) who explore the low end
of the component spectrum for detecting concurvities in additive models. Here, large un-
correlated ASPCs are used not for condensing in an additive fashion the X sample matrix
since the predictors are transformed, but rather for predicting linearly the response data
set. The second met hod, Partial Least Sq uares regression via additi ve splines referred to as
ASPLS (Durand and Sabatier 1991), is summarized in Section ,5. This method differs from
the preceding ill that dimension reduction is processed at the same time as the regression
is computed. \\'hen predictor and response data sets are identical, ASPLS components are
called self-ASPLS components. In Section 6, such self-ASPLS components are interpreted
as a.n additive summary of the original predictor matrix and used for regression purpose.
Filially. Principal Component .\nalysis with Instrumental Variables (Durand 1993) whose
scope of applicability ditTers from that of the preceding methods, is presented in Section 7
with applications to simple regression and to nonlinear Discriminant Analysis.
598
599
tors: we take 1{ interior knots in which piecewise polynomials of order m are required to
join end to end so that r = m + /\.is the dimension of the spline space. A spline function
5' (.) used for transforming the predictor :t i is a linear combination of normalized B-spline
basis functions {B;r.)}l=l, .,r
(1 )
see (De Boor 1978) for computational and mathematical properties of B-splines. The
ith column of X, denoted X', is replaced by Xi (a') linearly depending on a i through
Xi(a i) = Bia i , where Bi is the n x r coding matrix of Xi, and a i is the vector of the r
spline coefficients. The matrix X is now being transformed in a n x p matrix X(a) which
is a function of the spline vectors. This matrix is columnwise denoted
(2)
In order to make Xi(a') centered independently of the v,dues of the spline coefficients, B' is
column centered with respect to D, a n x n diagonal matrix of weights for the observations.
The knot sequence {El' ... , E~m+K} llsed for transforming the ith predictor is written as
When In 2': 2, particular spline coefficients called nodal coefficients (Durand 1993) and given
by at = mean(E!+l'" .,El+ m- 1 ), keep the ith predictor invariant which gives X'(a'J = X'.
The existence of such spline coefficients implies firstly that the additive spline model as
defined in section 3 can effectively take account of possibly linear relationships and secondly,
that the different iterative algorithms can reasonably be initialized.
j = 1, ... , q, (:3)
where the coordinate function Ii (xd is a spline function as defined in (1) with spline
coefficients stored in a;' In matrix notation, the jth column of the nx; q model matrix Y
is the sl1m of the smoothed sample predictors
(~)
\Ve will note whether or not an additive method is associated with a multivariate linear
smoother (Hastie and Tibshirani 1990) defined by Y = SY, where the smoother matrix S
does not depend on Y.
Spline coefficients can be chosen to minimize the mean squared error IIY - BAllb, where
IIXII'b = trace(X'DX), B = [Bil ... IEP] and A is the pr,~ q matrix of spline coefficients.
Linear multiple regression on the n X pr design matrix B provides the so called least-sqlwres
spline model (Eubank 1988) associated with the smoother matrix S = PB, where PB is the
D-orthogonal projector on span(B). However, usirrg least squares splines may cause prob-
lem when no sufficient points are available for fitting surfaces in high dimensional spaces.
Our O.E.C.D. "running" data set is typical of such u[llucky CO[ltext since only eighteen
points are needed for fitting an additive surface in a space of fourteen dimensions (n = 18.
P = 1:3). The scope of applicability of this method concerns data sets with large samples
of few predictors since scarcity of data is increased because the number of independent
variables (the columns of the design matrix B) is getting larger.
600
The aim of the paper is to present four methods for additive modeling that include dimen-
sion reducing aspects in their settings. All are based on the column centered design matrix
X(a) defined by (2). In the different models, a; is expressed as a linear combination of
M optimal spline vectors, all associated with the same ith predictor transformed at the
different stages of the methods (M is the dimension of the model, that is, the number
of additive components or latent variables). Model dimension AI may be considered a
tuning parameter and can be checked out in the same way as in linear methods by using
cross-validation. When it is computationally feasible, cross-validation is more important in
nonlinear modeling because the risk of overfitting is increased due to the greater flexibility
of splines. Other tuning parameters allow to choose what type of splines is to be used: the
order of the polynomials, the number and position of the knots. Choosing few well located
knots generally suffices in multiresponse regression but finding their optimal number and
location is a difficult problem so that giving the optimal answer is beyond the scope of this
paper. To summarize, the model defined by (3) and (4), belongs to the family of nonpara-
metric additive models depending on the aforesaid tuning parameters.
IluW 1,
var(Xi(a i )) var(xt i = 1, .. . ,p,
cov(c, ell 0, j = 1, ... , k - 1.
The last constraint is omitted when k = 1. Writing (a l,k, ... , aP,k, uk) as an optimal argu-
ment of f and X(a(k 1) = [Blal,kl .. . IBPaP,k] as an optimal matrix, the kth ASPC becomes
c k = X(a(kl)u k. Note that X as well as all coding matrices Bi are column centered with
respect to D in order to make c k centered for arbitrary spline coefficients:
The kth additive principal function may be defined as Ck(Xl,"" xp) = 2:;=1 ¢i,k(x,) with
¢i,k(x;J = ufsi,k(xil, where si,k is the optimal spline function used for transforming the ith
601
predictor. A crucial problem is the choice of Nf, the number of components. Here we have
no total variance decomposition theorem because data sets are changing when successive
ASPCs are computed. However the sequence var(c l ), ... , var(c M ), cannot increase because
nested optimizations are considered (an exception to this rule may nevertheless occur when
local optima are reached by the algorithm). The only question is then of setting a stopping
rule, that is, on estimating whether var( c k ) is "small". Since uncorrelated ASPCs are con-
structed for regression purpose only, one can pragmatically check out the goodness-of-fit
for different model dimensions, see Section 4.3.
Predictors Responses
POP population CAL calories per capita and per day
DENS density per km 2 LODG number of lodgings per 1000 cap.
POPG population growth ELEC electricity consumption
AGRF % of farming & fishing population EDUC public expenditure for education
INDU % of industrial population TV number of TV sets per 1000 cap.
GNP gross national product per capita
GDPA % of GNP for agriculture
FCF fix capital formation
RR running receipts
OFR official reserves (million $)
DR discount rate
IMP importations (million $)
EXP exportations (million $)
Variables are centered and standardized with equally distributed weights (D = 18- 1 liS)
and for all the competing regression methods we will compare, the 13 predictors are trans-
formed by B-spine functions of degree 1 (order 2) with 2 equally spaced interior knots. In
this section, twelve ASPCs have been computed whose variances are in Table 2.
ASPC 2 3 4 5 6 7 8 9 10 11 12
var 7.15 6.62 4.76 3.52 3.02 2.48 1.99 1.82 1.73 1.59 1.09 0.98
In contrast to common principal components, we observe the fact, shared by all the ex-
amples we have studied, that the sequence of variances is mildly decreasing. We are now
ready to address the problem of what to do with the ASPC's D-orthogonal basis: the next
section presents a simple way of selecting the components that best explain the responses.
shows the R2 values of the O.E.C.D. responses according to different model dimensions. It
is clear that some ASPCs may be deleted, for instance. the second and the three last.
~od
1m. ASPC CAL LODG ELEC EDGC TV f -va·o~total
lance
0.056 0.00.5 0.084 0.034 0608 1.5.74
2 0.105 0.062 0.091 0.080 0.608 18.92
:3 0.118 0.101 0.188 0.482 0700 31.78
4 0.:3/8 0.10:3 0358 0./27 0.821 4774
5 0.378 0.373 0359 0.744 0823 5354
6 0.443 0.46:3 0.360 0.746 0828 56.80
7 0.615 0.718 0.671 0.781 0.904 73.78
« 0.629 0732 0.693 0.814 0.90.5 75.46
9 0.769 0.8:38 0.71-1 0.8:32 090.5 81 16
10 0.865 0.8:38 0.737 0.837 0.929 84.12
11 0.880 0.8-14 0.785 0.853 0.952 8628
12 0.923 0.871 0.827 0.878 0.966 89.:30
Finally one can reduce the model dimension JI by choosing ASPCs that best explain the
responses. Here ASPCs 1, 3, 4, .), 7 and 9 seem to provide a good trade-off between goodness
of fit and dimension reduction. Table 4 presents the results for the nested corresponding
models and the goodness-of-fit may be compared with that of the model dimension 6 in
Table 3 (73.04% against 56.80%).
~od
1m. ASPC CAL LODG ELEC EDUC TV fo-vanance
of total
1 0.0.56 0005 0084 0.034 0.608 15.74
2 3 0.068 0.044 0.181 0.436 0.700 28.58
3 4 0.329 0.0-17 0.35~ 0.681 0.820 -14.58
4 .5 0.329 0.317 0.352 0698 0822 5036
5 7 0.501 0 ..571 0.664 0.732 0.898 6732
6 9 0641 0.678 0.68.5 0.750 0.898 73.04
Responses TV and EDUC are well reconstituted by the six dimensions model whereas CAL,
LODG and ELEC are rather badly approximated. Let us now pay more attention on se-
lecting the predictors of main influence for the variable EDUC which will be our "running
response" for comparing different models. Figure 1 shows coordinate function plots for the
six main variables of the ASpeR model. The influence of a predictor on a response is here
measured by the range of the transformed data marked by their corresponding numbers:
Germany (1 D), Austria (2 A), Belgium (S B), Canada (4 CND) , Denmark (5 DK), Spain (6 E),
USA (7 USA), Finland (8 FI), FranCE (9 F), Greece (10 G), Ireland (11 IRL), Italw (12 I), Japan
(13 J), iVorway (14 N), the Netherland,; (15 !VL), Portugal (16 P), England (17 UK), Sweden (18
S). As a confirmatory point, note that EDl'C is modeled in the same fashion when all
ASPCs are used (same influential predictors with similar function shapes).
tv
N N N N
c5 c5 c5 c5
0 0 0 0
c5 c5 c5 c5
N
9 '"9 N
9
N
9
... ... ... ...
9 9 9 9
-2
df1 1 2 -2
~OPG 2
Figure 1: Coordinate function plots of the main predictors (in decreasing order from left to right)
for EDUC modeled by ASPCR. The dotted vertical lines indicate the position of the knots.
illustrate that a smaller number of components are needed in ASPLS (Additive Spline Par-
tial Least Squares) than in ASPCR for explaining the same amount of Y-variance.
IIwl12 = IIcl1 2 1,
var(X'(a i )) var(Xi), i=l, ... ,p,
court, t j ) 0, j = 1, ... , k - 1.
In the same way as ASPCs are defined, the last constraint is omitted when k = 1. Writing
(a 1.k, ... , aP,k, wk, c k ) as an optimal argument of f and X(a(k 1) = [B 1a 1,kl .. • IBPaP,k] as
an optimal matrix, the kth ASPLS components become t k = X(a(kl)w k and uk = Fk_1Ck.
The final part of step k of ASPLS consists of updating Fk as the residual regression of
Fk-1 onto tk,
(5)
It must be noted that fixing spline coefficients equal to nodal coefficients in the optimization
problem above, does not lead to linear PLS components except for t 1 and u 1. Moreover,
the linear PLS property of reconstructing the predictor matrix is obviously not preserved in
ASPLS whose aim is only to provide an additive approximation to the response variables.
As a consequence of (.5), the ASPLS model is given by
M
+ FA + FA,
~ ~
Y = Fo = 2::k=l Y k = Y (6)
where 17k = PtkFk-1 is the kth partial model matrix of rank 1. Model dimension M can
be determined by cross-validation. !\Iore pragmatically, the fact that the tks are mutually
uncorrelated implies the additive decomposition of the total Y-variance
(7)
604
~od
1m. CAL LODG ELEC EDlT TV -f-vanance
of total
O.O.5·l 0.10 1 ();l70 0.463 0.874 37.26
2 0.534 0 ..547 (UDS 0.),9 0.881 58.82
:3 0.722 0 ..576 0.423 U.820 0.889 68.64 '
-± 0.814 0.6'lS U.692 0.820 0.902 77.;i7
To determine the ASPLS model dimension. one can easily measure the part of each com-
ponent in the reconstruction of the total response variance. It can be shown (Durand and
Sabatier 1994) that (6) enters the additive framework (4). :\0 linear smoother matrix can
;V \,,/
d 0; d 0;
N N
I N N
d d d d
0 0
'
•. iI 0 0
0; d d d
N N N N
0;; 0;;
• • I 0;; 0;;
...0;; ...0;; . . I
...0;; ...
0;;
·2 ·1 ·2 1 2 ·1.0 15
~R D~ Dg~S
.,.
d
...
d
...d ...d
N N N N
d d d d
~
0 0 0 0
d d d d
N N N N
0;; 0;; 0;; 0;;
...0;; ...0;; ...0;; .,.
0;;
·1 0 1
FCF
Figure 2: Infiuentwl coordinate functwn plols (ill decreasing order from left to right) for EDUC
modeled by .4SPLS. The dotted ,uticallines md,cate the postlion of the kllots.
be associated with model (6) since Y does not enter linearly in the expression of Y. One
can also consult the latter paper for computational aspects of ASPLS which are based on
normal equations just like the ASPC's algorithm of Section ·1.2.
CDN
" " 1
Figure 3: The (tl' t 2) d,splay IS mndar to a first prinCipal component scatterplot used to explain
the responses summari:ed by (Cl. C2) and (Ul' U2) plots.
response EDUC'. whose eight coordinate functions of main influence are presented in Figure
2, is similarly modeled by ASPLS than by A.SPCR: predictors RR. DR, DENS, G)lP, EXP
605
and AGRF mainly participate in the additive fit. Because responses are projected on com-
ponents, a particular aspect of ASPLS reducing properties is the possibility of " naming"
predictor components by their capability in explaining the responses. In Figure 3, Axis
1 summarizes the opposition between strong and weak consumer countries, while axis 2
contrasts countries according to responses LODG and CAL.
'"Ea. ';" AN
NL '"'iI.,'" 9
0
DK UK
u C)' B
S '"9
·4 -2 o 2 4 6 -0.5 0.0 0.5 1.0
(e) (d)
US
'"0
0
G \l:\L 0
'"u
E
o FI '"9
A'l
I"'r
...
OKS IBIl
UK 0 9
·4 -2 o 2 4 ·0.2 0.0 0.2 0.4
t1 e1
Figure 4: Comparison between common (a),(b) and selj-ASPLS (c),(d) principal component plots.
ponents explain respectively 43.98, 1-!.9-!, 12.08,7,5.26 and 4.11 percents (total 87.37%) of
the total variance against 4.5.12, 18.2,5, 11.91, 8.49, 1 ..52 and 3.80 percents (total 95.09%)
for common principal components.
Linear regression on self-ASPLS components provides an additive model in the same way
ASPCR does in Section 4.3. Table 6 shows R2 values for the O.E.C.D. responses regressed
on six explanatory self·ASPLS components. Figure .5 displays coordinate function plots of
the influential predictors on the "current" response EDUC. For all the models and methods
studied, we observe a great stability in the prediction since variables DR, RR, GNP, DENS
606
and AGRF all chiefly intervene with 6imilar transformations in the model.
Table 6. R2 of the responses for sir nested self-ASPLc") component regressIOn models.
~od
( tm. CAL LODG GLEC EDUC TV srof total
\ -vanance
0.047 O.G 18 0.106 0.072 0.611 18.28
2 0.052 0.084 0.107 0.2-1'5 01)76 2:328
:3 0.338 0.091 0.327 0.·l(i5 0.S11 ·10.8-1
.j 0.427 0:176 0.-166 0100 0.861 ')6.61]
5 (H6t) 05-1.5 0475 0.101 0.864 G102
6 0.570 0.606 0.605 0.757 0.875 68.26
o
cO
N
o
·2 -1 0 ·2 ·1 o ·1 o 1
DR RR GNP
~
cO
t~/I ~.J:r:
v
cO
N
0
0
cO
N
9
v
9
-1.0 O.OOENS 1 ,0 2.0 -1 0 1 2 -2 -1 0 1
AGRF POPG
Figure 5: iHain coordinate function plots for EDUC modeled by regressIOn on self-ASPLS compo-
nents. The dotted vertical lines indIcate the positIOn of the ",tfrlOr knots.
ASPCAIV method is first to find a p x p metric Ii and a vector of spline coefficients a that
minimize the objective function f(R,a) = trace [YQY'D - X(a)RX(a)'Dt Then, di-
mension reduction is done (if needed), by solving the eigenanalysis of X(a)RX(a)'D.
The objective function can be interpreted as a measure of discrepancy between eigen-
matrices for components associated with the predictor and response sets of variables. For
fixed a, an optimal metric is explicitly given by
(9)
0.' 0.8 TO
Figure 6: Evolution of the smoother along with the 6 first steps of the method applied on the exam-
ple based on 200 (Xi, Yi) observations (dots), Yi = sin(27r(1 - x;)2) + XiEi, with Xi uniform on
[0,1] and Ei standard normal. The signal (dashed), the spline smooth (solid), degree 2 with 3 knots.
By taking the indicator matrix of classes for the Y sample matrix, a direct applica-
tion of this result concerns nonlinear Discriminant Analysis (Durand 1992, 1993, Hastie
et al 1994). Note that, here, linear Discriminant Analysis is obtained at the first step
of the algorithm (because nodal spline coefficients are used) and that discriminating by
additive spline variables can only improve linear results. Discriminant variables are de-
duced from principal components by using a scaling factor (Escoufier 1987). Denoting
G = (Y'Dy)-IY'DX(a), the matrix whose rows Gi are the centroids of the classes de-
fined by X (a) and Y, the classification rule is: The object a: to be classified, is transformed
in t (considered as a row vector) by using optimal B-spline functions, and then affected to
class j if
As in linear Discriminant Analysis, other metrics can be chosen for the geometrical affec-
tation rule (10). However, in order that spline transformations could make sense, the user
has to verify that each observation lies within the range of the corresponding variable in
the training sample.
,:i1 ,
~ 33, "" 2
" ,~ §o
'" 'l! "
1~\3 ,
' l~22 33 11'- ,~, ~
'" ,
~~' 2.... ~2 2 i 2 ~
i- <lJ
", (;\ ~ t;,2i1z
?J,J
2 '2 0 2~~ ~
3
,,' ,2
22 111 2 2
2 "
'"
2
??2~2222 f " ,n
~~~t 1 ~ ~o ~ .2:i2 3. 33
' 22 2 \
~'
J
>-0 0 :>0 2
3:\J ; ~ ",' ZI! 2 3
3,
t ~2 ~~2
J
,,2
22 2
3 2 22
'1''1', 2
22 '"
:ii
> ';"
~il 333~)3
~2£D%, a! 2 I
2i3
j, .,i- ,~ , 3,
'l'
,,2
~ 2 22 22 2 2 2 2 3 "
1;l
2
3$3~
, 2 3 Ci 33
~33:i13
31 '
,
3 3
,
3
,!!:iJ 'l' 31\3,
:lI.,
~
Figure 7: Misclassified items are marked with circles and centroids with black squares.
The application illustrating the performance of the method is based on three class, two
dimensional data (x, y). Simulated data are generated as follows: the three class distribu-
tions are uniform on the annuls centered at the origin with respective extreme radii (0,1),
(1.5,2.5) and (3,3.5). Number of items are 50 for the first group and 100 for the others
so that the training sample matrix X is 250 x 2 and Y, the indicator matrix of classes,
250 x 3. Usual linear and quadratic discriminant methods perform poorly on this data set
while discrimination by additive spline discriminant variables provides good results: Fig-
ure 7 presents seven misclassified items, marked with circles, in both (x, y) and two first
discriminant variables. Only one discriminant variable is needed for separating the classes
and the close to one first eigenvalue yield well separated groups.
609
8. Conclusion
In this paper we have compared and illustrated various additive multiresponse regression
methods from the point of view of prediction as well as interpretation. It is well known
that in case of extreme collinearity in predictors, interpreting linear regression coefficients
is dangerous. ASPCAIV is an additive modeling method in which dimension reducing as-
pects occur after the regression is computed. Therefore, its scope of application is that of
predictor data sets with much more observations than variables. The possibility of choosing
different metrics leads us to view ASPCAIV as a unifying framework for additive extensions
to some two data blocks linear methods.
In some chemometrics, econometrics or sociometries applications, the number of predictors
exceeds or equals approximately the number of observations and nonlinear methods pre-
senting dimension reduction stages before or at the same time prediction is processed, are
to be preferred. The ASPLS method, the regression on ASPCs as well as on self-ASPLS
components, all construct a set of uncorrelated components which are additive functions
of the predictors. In ASPLS, linear regression on such components occurs as soon as they
are constructed, thus generally providing a more parsimonious additive model in the sense
that less additive components are needed for explaining the same amount of the response
variance. A great stability has been found in interpreting the different additive models on
the O.E.C.D. data: the choice of a set of influential predictors is in a large way independent
of the method and coordinate spline function shapes are similar.
References
Bertier, P. and Bouroche, J.-M. (1975), Analyse des donnees multidimensionnelles, Paris: PUF.
De Boor, C. (1978), A practical guide to splines, New York: Springer.
Donnell, D. J. et al. (1994), Analysis of additive dependencies and concurvities using smallest ad-
ditive principal components (with discussion), The Annals of Statistics, 4,1635-1673
Durand, J. F. (1992), Additive spline discriminant analysis, in Computatzonnal Statistics, Vol.J, (Y.
Dodge and J. Whittaker, eds.), Physica-Verlag, 144-149.
Durand, J. F. (1993), Generalized principal component analysis with respect to instrumental vari-
ables via univariate spline transformations, Computational Statistics & Data Analysis, 16, 423-440.
Durand, J. F. and Sabatier, R. (1994), Additive splines for PLS regression, Tech. Rept. 94-05,
Unite de Biometrie, ENSAM-INRA-UM II, Montpellier, France. In press in Journal of the Ameri-
can Statistical Association.
Escoufier, Y. (1987), Principal components analysis with respect to instrumental variables, Euro-
pean Courses in Advanced Statistics, University of Napoli, 285-299.
Eubank, R. L. (1988), Spline smoothing and nonpammetric regression, New York and Basel: Dekker.
Frank,1. E., and Friedman, J. H. (1993), A statistical view of some Chemometrics regression tools
(with discussion), Technometrics, 35, 109-148.
Gifi, A. (1990), Nonlinear multzvariate analysis, Chichester: Wiley.
Bastie, T. and Tibshirani, R. (1990), Generalzzed additive models. London: Chapman and Hall.
Hastie, T. et al. (1994), Flexible discriminant analysis by optimal scoring, Journal of American
Statistical Association, 89, 1255-1270.
Ramsay, J. O. (1988), Monotone regression splines in action (with discussion), Statistical Science,
3, 425-461.
Rao, C. R. (1964), The use and the interpretation of principal component analysis in applied re-
search, Sankhya A, 26, 329-356.
Wold, S. et al. (1983), The multivariate calibration problem in chemistry solved by the PLS method,
Proc. ConI Matrix Pencils. Ruhe, A. and Kagstrom, B. (Eds), Lecture notes in mathematics, Hei-
delberg: Springer Verlag, 286-293.
Bounded Algebraic Curve Fitting for
Multidimensional Data
Using the Least-Squares Distance
Masalliro Mizuta l
I Division of Systems and Information Engineering, Hokkaido University
N.13, W.S, Kita-ku, Sapporo-slli
Hokkaido 060, Japan
Summary: Linear regression or smoothing techniques are not adequate for curve fitting,
in cases in which neither variable can be designated as the response. We present a new
method for fitting bounded algebraic curve to multidimensional data using the least-squares
distarlce between data points and the curve. Numerical examples of the proposed method
are also shown.
1. Introduction
In data analysis process, we can sometimes investigate data structures with methods
of curve fitting. Curves can be represented by explicit functions, parametric func-
tions, or implicit functions. Curves represented by explicit functions reveal infiuences
of other variables on one variable, like regression analysis. Curves by parametric
fUllctions give latent order of data and are depicted easily with computer graphics.
Curves by implicit functions, i.e., Algebraic C'Lwves show relations with variables.
[VrallY researchers proposed curve fitting methods with algebraic curves. Particularly,
it is worthwhile to notice that Keren et a1.(1994), and Taubin et a1.(1994) indepen-
dently developed the algorithms for bounded algebraic curve. However, most studies
are based on approximate distances between data points and the algebraic curve. We
have developed a method to find algebraic curve that minimizes the sum of squares
exact distances. In this article, we propose a method to find bo'unded algebraic curves
based on exact distances.
610
611
''''',,-
(a./3)
Fig. 1: Distance between curve Z(f) and point (0,(3).
It was said that the distance between a point and the algebraic curve cannot be
computed by direct methods. So, Taubin proposed an approximate distance from a
to Z(f) (Taubin (1991)). The point fI that approximately minimizes the distance
II y - a 1/, is given by
(3)
where ('\7 f(af)+ is the pseudoinverse of '\7 f(a)T. The distance from a to Z(f) is
approximated by
. 2~ f(a)2
dlst(a,Z(f)) ~ 1/ '\7f(a) 1/2' (4)
f(x)
Exact Distance
/
\ - Data Point
Approximate /
'--------T'---Oistance
1.)
Q(x, tt, r) = (x - ?
0)- + (y -
-)
jJ)- + ttf(x, y) + 2r f(x, y)-
612
Let A(x, y) be the form of degree k of a polynomial f(x, y): f(x, y) = 2:%=0 A(x, y).
The leading form of a polynomial f (x, y) of degree d is defined by f d( x, y). For
example, the leading form of f(x, y) = x 2 + 2xy - y2 + 5x - y + 3 is h(x, y) =
613
X2 + 2xy _ y2.
Lemma: For an even positive integer d, any leading form fd( x, y) can be represented
by XAXT. Where A is asymmetric matri.x and X = (X~,x~-ly, ... ,xy~-I,y~).
Remark: Symmetric matrix A is not unique. For example,
B = (~~ ~~ ~~).
a3 a5 a6
Let (a,,B)(i = 1,2,"',nl be n data points in the plane. The point in Z(fl that
minimizes the distance from (ai, !1i) is denoted by (Xi, y;)(i = 1,2"", n).
The sum of squares of distances is
n
R='LRi,
i=1
614
Z(f)=[(x,Y)
vVe can minimize R with respect to the parameters of polynomial f with Lcucnbcrg-
Marquardt Afcthod. The method requires the partial derivatives of R with respect to
The only. thing left to discuss is a solution for L)JJ' and~. Hereinafter, the snbscript
(~ u~
i is omitted.
B.Y the derivative of the both sides of
f(ol,'" ,Clq,J:,.lJ) = 0 with respect to {Lj (j = l,···.q), we obtain
of ox of oy df
--+--+-=0 (6)
01'OOj o.lJoa) daj ,
where ~IdrJ is the differcntial of 'f" with aj whcn .r and •lJ arc fixed.
( (l
Because (Gi,/3d is on the normallinc from (Xi,y,), theu ClJ - ,])~ - (J" - (l)~ = O.
By the derivative of the both sides with respect \0 aj, we obtain
(i)
The equations (6) and (I) arc simultaneous linear cq uations in two \·ariables t j
' and
!!..Y.... So. we can get rh, and 0h.. "J
iJ~' D~ &}
4. Numerical examples
Two examples arc provided of bOllnded versus unbounded curve fitting.
The first example data is two dimensional artificial data of size -i0 t hat lies in the
neighborhood of an asteroid. We set the result of GPCA (Generalizcd Principal
Components Analys'is: Gnanadcsikcin (19ii), lvIizuta (1983)) for an initial curve, and
search for fitting an algebraic curve and a bounded algebraic curve of degree -i with
the proposed method (Fig. 5). The sum of squares of distances R is 0.088 in the
case of bounded fitting. This value is greater than the value in the case of bounded
fitting (R = 0.0:26). But the figure 5 shows that the bounded algebraic curye reveals
a suitable outline of the data points.
The second is three-dimen"ional data of size 210. The 210 points almo::;t lie on a
615
closed cylinder (Fig. 6 (a)). We also apply the method to the data with an algebraic
curve and a bounded algebraic curve of degree 4 (Fig. 6 (b), (c)). The value of R
is 1.239 in the case of bounded fitting and the value of R is 0.924 in the case of
unbounded fitting. The unbounded algebraic surface reproduces the structure of the
closed cylinder and the bounded surface shows a global shape of the data points .
5. Concluding remarks
In tile article, we do not mention the curve fitting in the 3-dimensional space:
References:
implicit polynomials, IEEE tmns. Patt. Anal. Machine Intell., 16, 1, 38-53.
Kriegman, D. J. and Ponce, J. (1990). On recognizing and positioning curved 3-D objects
from image contours, IEEE Trans. Patt. Anal. Machine Intell., 12, 12, 1127-1137.
Mizuta, M. (1983). Generalized principal components analysis invariant under rotations of
a coordinate system, J.Japan Statist. Soc., 14, 1-9.
Mizuta, M. (1995). A derivation of the algebraic curve for two-dimensional data using the
least-squares distance, In Data Science and Its Application, Hayashi, C. et al, (eds.), 167-
176, Academic Press, Tokyo.
Taubin, G.(1991). Estimation of planar curves, surfaces, and nonplanar space curves de-
fined by implicit equations with applications to edge and range image segmentation, IEEE
Trans. Patt. Anal. Machine Intell., 13, 11, 1115-1138.
Taubin, G.(1994). Distance ApprOximations for Rastering Implicit Curves, ACM Trans.
on Gmphics, 13, 1, 3-42.
Taubin, G., Cukierman, F., Sullivan, S., Ponce, J. and Kriegman, D. J. (1994). Parameter-
ized families of polynomials for bounded algebraic curve and surface fitting. IEEE tmns.
PaU. Anal. Machine In te II. , 16, 3, 287-303.
Using the Wavelet Transform for
Multivariate Data Analysis and
Time Series Analysis
Fionn Murtagh!, Alexandre Aussem 2
! Faculty of Informatics, University of Ulster
Magee College, Londonderry BT48 7JL, Nth. Ireland
Email [email protected] Web www.infm.ulst.ac.ukrfionn
2 Universite Blaise Pascal, Clermont-Ferrand II
ISIMA, Campus des Cezeaux, BP 12.5 63173 Aubiere Cedex, France
Email: [email protected]
Summary: We discuss the use of orthogonal wavelet transforms in multivariate data anal-
ysis methods such as clustering and dimensionality reduction. Wavelet transforms allow
us to introduce multiresolution approximation, and multi scale nonpararnetric regression or
smoothing, in a natural and integrated way into the data analysis. Applications illustrate
the powerfulness of this new perspective on data analysis.
1. Introduction
Data analysis, for exploratory purposes, or prediction, is usually preceded by various
data transformations and recoding. In fact, we would hazard a guess that 90% of the
work involved in analyzing data lies in this initial stage of data preprocessing. This
includes: problem demarcation and data capture; selecting non-missing data of fairly
homogeneous quality; data coding; and a range of preliminary data transformations.
The wavelet transform offers a particularly appealing data transformation, as a pre-
liminary to data analysis. It offers additionally the possibility of close integration
into the analysis procedure as will be seen in this article. The wavelet transform
may be used to "open up" the data to de-noising, smoothing, etc., in a natural and
integrated way.
617
618
to X m +l, the detail signal, is yielded by the wavelet transform. If ~m is this detail
signal. then the following holds:
X m +l = HT(m)x m + GT(m)~m (1)
where G( m) and H(m) are matrices (linear transformations) depending on the wave-
let chosen, and T denotes transpose (adjoint). An intermediate approximation of the
original signal is immediately possible by setting detail components ~m' to zero for
m' 2' m (thus, for example. to obtain X2, we use only xo, ~o and ~d. Alternatively
we can de-noise the detail signals before reconstituting x and this has been termed
wavelet regression (Bruce and Gao, 1994).
Define ~ as the row-wise juxtaposition of all detail components, {~m}' and the final
smoothed signal, Xa, and consider the wavelet transform Hl given by
(2)
The right-hand side is a concatenation of vectors. Taking WTvV = I (the identity
matrix) is a strong condition for exact reconstruction of the input data, and is satisfied
by an orthogonal wavelet transform. The important fact that W'TW = I will be
used below in our enha.ncement of multivariate data analysis methods. This permits
use of the "prism" (or decomposition in terms of scale and locatio~l) of the wavelet
transform.
Examples of these orthogonal wavelets, i.e. the operators G and H, are the Daubechies
family. and the Haar wavelet transform (Press et aI., 1992; Daubechies, 1992). For
the Daubechies D4 wavelet transform, H is given by
(OA8296291 :31,0.8:36·516:30:37. 0.224H:38680, -0.l29-10095226)
and G is given by
(-0.129-10095226, -0.22H-1o:38680, 0.8:36516:30:37, -0.48296291:31).
ImplementatioIl is by decimating the signal by two at each level and convolving with
G and H: therefore the number of operations is proportional to n + nl2 + nl4 + ... =
O( n). 'vVrap-around (or "mirroring") is used by the convolution at the extremities of
the signal.
Euclidean metric, which nonetheless covers a considerable area of current data anal-
ysis practice.
Note that the wavelet basis is an orthogonal one, but is not a principal axis one
(which is orthogonal, but also optimal in terms of least squares projections). Wicker-
hauser (1994) proposed a method to find an approximate principal component basis
by determining a large number of (efficiently-calculated) wavelet bases, and keeping
the one closest to the desired. Karhunen-Loeve basis. If we keep, say, an approximate
representation allowing reconstitution of the original n components by n' components
(due to the dyadic analysis, n' E {n/2, n/4, . .. }), then we see that the space spanned
by these n' components will not be the same as that spanned by the n' first principal
components.
IJo / lLJ
I
-do
4: :: = ::
.
~ 0
~
0
0
-
. on g 0
I 1§
I
f
j
0
. 0
~
. 0
Z
~ !
0
"
'0
2
i I • J •
uooooo
~~~~~~
'"
'" ,., _ _ .n
~
Figure 1: Sample of 20 spectra (from -15 used) with origina.l flux measurements plotted
on the y-a,xis.
~ ''\ .~ ~
~
~ ~ o Q o 0
U; I H
I
l~ lil
i
i
, ig
1: t
I
g
f
t o ",0
~
000000 ~
ci
UJ
~
~ _0000000 ~
~
\~
1
.~
J:
' 9
Figure 2: Sample of 20 spectra (as in previous Fig.), each normalized to unit maxi~
mum value, then wavelet transformed, approximately 7.5% of wavelet coefficients set
to zero, and reconstituted.
In wavelet space or in direct space, the assignment results obtained were identical.
With 76% of the wavelet coefficients zeroed, the result was very similar, indicating
622
that redundant information had been successfully removed. This approach to SOFM
construction leads to the following possibilities:
2. Data "cleaning" or filtering is a much more integral part of the data analysis
processing. If a noise model is available for the input data, then the data
can be de-noised at multiple scales. By suppressing wavelet coefficients at
certain scales, high-frequency (perhaps stochastic or instrumental noise) or low-
frequency (perhaps "background") information can be removed. Part of the
data coding phase, prior to the analysis phase, can be dealt with more naturally
in this new integrated approach.
A number of runs of the k-means partitioning algorithm were made. The exchange
method, described in Spath (198.5) was used. Four. or two, clusters were requested.
Identical results were obtained for both data sets. which is not surprising given that
this partitioning method is based on the Euclidean distance. For the 4-cluster, and
2-cluster, solutions we obtained respectively these assignments:
123213114441114311343133141121412222222121114
122211111111111111111111111121112222222121111
The case of principal components analysis was very interesting. We know that the
basic PCA method uses Euclidean scalar products to define the new set of axes.
Often PCA is used on a variance-covariance input matrix (i.e. the input vectors are
centered); or on a correlation input matrix (i.e. the input vectors are rescaled to zero
mean and unit variance). These two transformations destroy the Euclidean metric
properties vis-a-vis the raw data. Therefore we used PCA on the unprocessed input
data. We obtained identical eigenvalues and eigenvectors for the two input data sets.
The eigenvalues are similar up to numerical precision:
The eigenvectors are similarly identical. The actual projection values are entirely
different. This is simply due to the fact that the principal components in wavelet
space are themselves inverse-transformable to provide principal components of the
initial data.
Various aspects of this relationship between original and wavelet space remain to be
investigated. We have argued for the importance of this. in the framework of data
coding and preliminary processing. We have also noted that if most values can be set
to zero with limited (and maybe beneficial) effecL then there is considerable scope for
computational gain also. The processing of spa.rse data can be based on an "inverted
623
file" data-structure which maps non-zero data entries to their values. The inverted
file data-structure is then used to drive the distance and other calculations. Murtagh
(198.5, pp. 51-.54 in particular) discusses various algorithms of this sort.
8. Wavelet-Based Forecasting
In experiments carried out on the sunspots benchmark dataset (yearly averages from
1720 to 1979, with forecasts carried out on the period 1921 to 1979: see, e.g., Tong,
1990), a wavelet transform was used for values k up to a time-point ko. One-step-
ahead forecasts were carried out independently at each Wi. These were summed to
produce the overall forecast (d. the additive decomposition of the original data, pro-
vided by the wavelet tra.nsform). An interesting varia.nt on this was also investigated:
this variant was that there was no need to use the same forecasting method at each
level, i. We ran autoregressive. multilayer perceptron and recurrent connectionist
networks in pa.rallel, and kept the best results indicated by a cross-validation on
withheld data at that leveL vVe found the overall result to be superior to working
with the original data a.lone, or with one forecasting engine alone. Details of this
work can be found in Aussem and ?vlurtagh (1996).
9. Conclusion
The results described here. from the multivariate data analysis perspective, are very
exciting. They !lot only open up the possibility of computational advances but also
624
provide a new approach in the area of data coding and preliminary processing.
The chief advantage of these wavelet methods is that they provide a multiscale de-
composition of the data, which can be directly used by multivariate data analysis
methods, or which can be complementary to them.
A major element of this work is to show the practical relevance of doing this. It has
been the aim of this paper to do precisely this in a few cases. Finding a symbiosis
between what are, at first sight, methods with quite different bases and quite differ-
ent objectives, requires new insights. Wedding the wavelet transform to multivariate
data analysis no doubt leaves many further avenues to be explored.
Further details of the experimentation described in this paper, details of code used,
and further information, can be found in Murtagh (1996).
References:
Aussem, A. and l\lurtagh, F. (1996): Combining neural network forecasts on wavelet-
transformed time series, Connection Science, in press.
Bhatia, M., Karl, W.C. and Willsky, A.S. (1996): A wavelet-based method for multiscale
tomographic reconstruction, IEEE Transactions on :\Jedical Imaging, 15, 92-101.
Bijaoui, A., Starck, J.-L. and Murtagh, F. (1994): Restauration des images multi-echelles
par I'algorithme 11 trous, Traitement du Signal, 11, 229-243.
Bruce, A. and Gao, H.- Y. (1994): S+ Wavelets ["sa's Manual, Version 1.0, Seattle. WA:
StatSci Division, MathSoft Inc.
Daubechies. 1. (1992): Ten Lectures on Wavdets, Philadelphia: SIAM.
Holschneider, M., Kronland-Martinet, R., Morlet. J. and Tchamitchian. Ph. (1989): A
real-time algorithm for signal analysis with the help of the wavelet transform. in .J.M.
Combes, A. Grossmann and Ph. Tchamitchian (eds.), Wavelets: Time-Frequency Methods
and Phase Space, Berlin: Springer-Verlag, 286-297.
Murtagh, F. (198.5): Clustering Algorithms, Wiirzburg: Physica-Verlag.
Murtagh, F. and Hernandez-Pajares, M. (199.5): The Kohonen self-organizing feature map
method: an assessment, Journal of Classification, 12. 16.5-190.
Murtagh, F. (1996): Wedding the wavelet transform and multivariate data analysis, Jour-
nal of Classification, submitted.
Press, W.H., Teukolsky, S.A., Vetterling, W.T. and Flannery. B.P. (1992): Numerical
Recipes, 2nd ed., Chapter 13, New York: Cambridge lJniversity Press.
Shensa, M.J. (1992): The discrete wavelet transform: wedding the a trous and Malla,t al-
gorithms, IEEE Transactions on Signal Processing, 40, 2464-2482.
Spath, H. (1985): Cluster Dissection and Analysis, Chichester: Ellis Horwood.
Starck, J.-1. and Bijaoui, A. (1994): Filtering and deconvolution by the wavelet transform,
Signal Processing, 35, 19.5-211.
Starck, J .-1., Bijaoui, A. a,nd Murtagh, F. (199.5): Multiresolution support applied to image
filtering and deconvolution, Graphical J'vfodels and Image Processing. 57, 420-·131.
Strang, G. (1989): Wavelets and dilation equations: a brief introduction, SIAM Review,
31,614-627.
Strang, G. and Nguyen, T. (1996): Wavelets and Filter Banks, Wellesley, '\IA: Wellesley-
Cambridge Press.
Tong, H. (1990): Non Linear Time SEries, Oxford: Cla.rendon Press.
Wickerhauser, M.V. (199.,1): Adapted Wavelet Analysis from Theory to Practice, Wellesley,
~IA: A.K. Peters.
Visual Manipulation Environment for Data
Analysis System
Masahiro Mizuta l , Hiroyuki Minami 2
I Division of Systems and Information Engineering, Hokkaido University
n.13, W.8, Kita-ku, Sapporo-shi
Hokkaido 060, Japan
2 Department of Information and Management Science, Otaru University of Commerce
3-5-21, Midori, Otaru-shi
Hokkaido 047, Japan
Summary: Most statistical softwares utilize graphical facility to display data mainly, not
to construct and execute analysis.
We had developed the data analysis system with visual manipulation and improve it on
the UNIX platform with TcljTk, the interface builder available on many architectures and
operation systems. We offer its features and introduce some examples on our environment.
625
626
Outputs
. Status
. Execution
Program e Program!
An arrow stands for a datum and a box is a (statistical) method. The leftmost box
stands for the import procedure for an observation and the others are methods which
are applied to the input. The data flow through many arrmvs and boxes.
"Flow Chart" which shows a coutrol of a program is used in software development
and similar to our concept. It is really useful when we check the control in a progmm
but cannot map it into a real program since we cannot find the input and thc out put
from this chart.
627
The flows we construct can execute as procedures directly since they show flows of
data.
2.1 Overview of our system
We made a prototype of a data analysis system based on data flow diagram with
PC(MS-DOS)(Mizuta (1989)). The prototype was ported into a workstation(X Win-
dow system on UNLX) in 1991.
Recently, the progress of computer environments has been accelerated. Then, we
decide to rebuild our system, based on the environment available in many computer
platforms.
Tcl/Tk is a kind of interface builder in computer graphical environment. It is de-
veloped on X Window originally and ported many platforms. Now it is available
on rv'fS-Windows, Windows95, WindowsNT and so on. It can handle interface parts
(button, menu, bar, etc.) and assign its function easily. The characteristics are
suitable for our system then we make a new prototype with Tcl/Tk.
pPIIOM:
EJ
EJ
EJ
EJ
Fig. 3: First phase
The selected procedure box appears in the view follows the movement of pointing
device(mouse). He/She can put it any place as you like. If he/she decides its position,
click the left button of the mouse then the box is fixed. Figure 4 shows the situation
some boxes are put.
628
I~ IkDlrect ~
file ~dit Qption Help
EJ EJ EJ
Io!sCrimlnatlon
nput
Oulllut
PredJctlon I
r'"~'" :u
Procedures:
f~
Tree
,"" ~ .
, EJ EJ
ScatterPlot
~:.
SM
Istogram .
BarPlot
!pie
I A
!\;"ext, the user makes connections between boxes to construct a flow. He/She clicks
a source box then an arrow appears. The start point is fixed to the source box and
the other follows the movement of its pointing device. He/She clicks on the other
box then a source box and the other are connected. The view in Figure 5 shows tv;O
streams. The upper flow is to do discriminant analysis and print results. The lower
is to do regression analysis and plot results.
Now the user can really do analysis. He/She can click the rightmost box of a flow.
A procedure box is executed if the input is calculated or ready to import. If !lot, the
procedure which is expected to make the input is triggered. In short, if the rightmost
box is activated, all boxes on the flow arc executed recursively.
Figure 6 is the result of regression analysis. 'The user can get the [rsult through tilr
plot with the points and the regression line.
EJ
Help
Options:
- -:. r,
i~~}~~~~~';i ;;~:~} ,
~~/I';'§.?-'~~~JI~~~' :;l.~:
5. Concluding Remarks
The "Macro" feature may be effective for t.he system. The flow except an import box
631
Acknowledgment
vVe would like to thank Mr. Kikuchi (Graduate School of Information Engineering,
Hokkaido University) for his programming support.
A part of this work was supported by a Grant-in-Aid for Scientific Research from the
Ministry of Education, Science, Sports, and Culture of Japanese Governments.
References:
Becker, R.A., Chambers, J.M. and Wilks, A.R (1988). The New S Language, Wadsworth
& Brook/Cole Advanced Books & Software, Pacific Grove.
Minami, M.. Mizuta, M. and Sato, Y.(1993). A Knowledge Supporting System for Data
Analysis. Journal of the Japanese Society of Computational Stat'istics, 6, 1, 85-97.
Mizuta, M. (1990). Data Analysis System with Visual Manipulations, Bulletin of the Com-
putational Stat'istics of Japan, 3, 1, 23-29 (in Japanese).
Ousterhout, J. K. (1994). Tel and the Tk Toolkit, Addison Wesley.
Shu. N. C. (1988). Visual Programming. Van Nostrand Reinhold Company.
Tukey, J. W. (1977). Exp/omtory Data A nalys'is, Addison-Wesley.
Human Interface for Multimedia Database with
Visual Interaction Facilities
Toshikazu Kato
Summary: This paper describes visual interaction mechanisms for image database
systems. The typical mechanisms for visual interactions arc query by visual example
(QVE) and query by subjective descriptions (QBD). The former includes a similarity
retrieval function by showing a sketch, and the latter includes a sense retrieval function
by learning user's personal taste. We mudeled user's visual perception process by four
levels; a physical level, a physiological level, a visual psychological level, and visual
cognition level. These models are automatically created by image analysis and statistical
learning, are referred as multimedia indexes to database systems.
1. Introduction
"A picture is worth a thousand words. " A human interrace plays an important role in a
multimedia information system. For instance, we request a content based visual interface
in order to communicate visual information itself to and from a multimedia database
system, Iyenger and Kashyap (1988). Grosky and ivkhmtra (1989). The algorithms or
multimedia operations have to suit user's subjective viewpoint, such as a similarity
measure, a sense of taste, de. Thus, we have to provide tlexible interaction mechanism tll
design a multimedia human interface.
We expect a multimedia database system to manage multimedia data themselves, such as
image data, as well as alphanumeric data. We also expect it to provide a human interface
to accomplish tlexible man-machine communicati()n in a user-friendly manner. Then.
what are needed in multimedia interaction? We can summarize the essential needs in
multimedia interaction as follows, Kato, et a1. (1991).
(a) Visual query on pictorial domain: We need to communicate multimedia data to and
from the database in a user-friendly manner. For instance, we would lib: to sh(1W
image data itself as a pictorial key to retrieve some visual information from database
systems.
(b) Subjectivity of judging criteria: We want to adjust database operations as well as
database schemata to each of our subjective views. In case of similarity retrieval, we
would like to get some suitable candidates according to our subjective measures
where the measures may differ with each individuals.
(c) Interpretation between multimedia domains: Some of the multimedia queries should
evaluate multimedia data on the different domains. In case of content based retrieval,
we expect the system to retrieve some image data by describing their contents as text
data.
We can answer these needs by a multimedia human interface with our visual interaction
facilities. Let us show the general framework of visual interaction by typical user's query
requests in our applications. Our basic ideas are QVE (query hy visual example) and QBD
(query by subjective description). Multimedia interaction requires interpreting the conknts
of multimedia information in order to operate from user's subjective viewp()int. Thus,
interpretation algorithms have to suit the perception processes of each user. Such
processes belong to a subjective human factor. On our multimedia human interface, the
system refers to the object model on multimedia inf()rmation and the uscr model on the
perception process to operate from user's subjective viewp()int.
632
633
Cognitive Level
Psychological Level
~--~-----r--~
Physiological Level
'-----'----.,---'
graphical features.
(3) Psychological level interaction: A user may wish to see some graphic symbols which
give him the similar impression from his view. We have to notice that the criteria for
similarity belongs to a subjective human factor. Although human beings have
anatomically the common organs, each person may show different interpretation in
classification and similarity measure. It means each person has his own weighting
factors on graphical features. The system should evaluate similarity according to his
subjective criterion. Therefore, the system should analyze and learn the subjective
similarity measure on the images with each user. Graphical features are mapped into
subjective features by the weighting factors.
(4) Cognitive level interaction: We often have difference impressions, even when
viewing the same painting. Each person may also differently give a unique interpreta-
tion even viewing the same picture. It seems each person has his own correlation
between concepts and graphical features and/or subjective features. The system
should evaluate a subjective description according to his criterion. Therefore, the
system should analyze and learn the correlation between the subjective descriptions
and the images with each user.
A user model is needed to operate visual information based on the subjective viewpoint of
each user. We have to develop a simple learning algorithm to adjust the criteria for each
user.
3. Query by Visual Example at Physiological Level
This chapter describes the visual perception models and algorithms for query by visual
example (QVE), i.e. similarity retrieval on objective criteria.
(2) Spatial frequency RunBIW, Run W': The spatial frequency measures the complexity
of graphic symbols. RunBIW approximates the frequency by the run-length
distribution of each rectangle mesh. Here, the figure is divided into four horizontal
meshes as well as four vertical meshes. Similarly, we defined Run W' without distin-
guishing the black and white runs.
(3) Local' correlation measure and local contrast measure Corr4, Cont4: The local
correlation and the local contrast show the spatial structure, such as the regularity of
arrangement of partial figures.
Corr4 = (mij x m l ,)'} )
mij-mij' (O:::;i:::;4,O:::;j:::;4)
Cont4 =-----'---"-
mij+mi·}'
Where, mij , mi,}' are adjacent meshes. These parameters are defmed on 4 x 4 square
meshes.
[Alg. 1] GF space for graphic symbols
(1) Analyze the layout of a document image to extract graphic symbols. Normalize
the image size of the graphic symbols.
(2) Calculate the GF vector Pi with each graphic symbol.
AI
<Subjective Feature Space> <SUbJUlive SimilGTilY> r------ ----- -- - ----s~bi~~ti~~F~~t~;~:
i w: Ave. illtra-group cov. ~ =(s 'J' ---~-- SF) Space '
178: Ave. inter-group cav. s'=SS"'~
N_ N_ T '
Lw: lntra-grollpcov.
178: Inter-RrOllpcov.
(;=2.,2.5"I'i ',):mGx
• I I • 1 rr-- --:- -;---+--=-+----'<--------!,...,
:
r( =~ ;0
<Criteria> ,A----'----....
¢=IEA=IxAII
J = LT( tv) i B) : max AT IxA = II
J:je~:,::":,, ~
<= L8A =Lw A ;\
AT Lw A =1
----------------------------~
Let us show the sketch retrieval algorithm for graphic symbols. Fig. :2 shows the outline
of the whole QVE mechanisms. In Fig. 2. the sketch retrieval process is enclosed by solid
lines.
[Alg. 2] Sketch retrieval on GF space
(I) Normalize the image size of the sketch, i.e. the visual example.
(2) Calculate the GF vector Po of the sketch.
(3) Calculate the distance ci, between the sketch poand the graphic symbols in the
database Pi
d, = I~=l(hkllp, - Pill)
Where, k and IVk mean the GF vector and its weight factor.
(4) Choose the graphic symbols in the ascending order of d.
[Rl search
~ .,. £IfMple ( 111 .'.[,)
r;.,.ltil' ..,....,.
JI"'c!clB~
U. "-. Ie!
,"".4l7S)
We have evaluated this algorithm in an experiment in which we showed fair copies, hand-
written sketches and rough sketches with every 100 visual example. (Currently, the
TRADEMARK database manages about 2,000 graphic symbols.) We have tested the
recall ratio. Here, the recall ratio shows the rate of retrieval of the original graphic symbol
among the best ten candidates For a fair copy, the system had an almost 100% recall ratio
among the flrst ten candidates, using the GF features. Even for the rough sketches, it had
about 95% recall. We may conclude that our GF features satisfy the requirements for a
robust image model for sketch retrieval.
4. Query by Visual Example at Psychological Level
This chapter describes another aspect to query by visual example (QVE) , i.e. similarity
retrieval on subjective criteria.
graphic symbols by comparing their SF vectors on the personal index. Then, the system
shows suitable candidates. The algorithm for similarity retrieval is as follows, which is
also shown enclosed by dotted lines in Fig. 2.
[Alg. 4] Similarity retrieval on SF space
(1) Apply the linear mapping A to the GF vector Po of the sketch.
ro = A' Po.
(2) Choose the neighboring graphic symbols Pi on the personal index as the
candidates forsirnilarity retrieval.
(3) Calculate the distance di between the sketch ro and the graphic symbols r; on
the personal index.
d, =11r; -roll·
(4) Choose the graphic symbols in the ascending order of di.
these candidates in the sketch retrieval on the GF space, since their graphic features differ
from those of the visual example.
We have evaluated the learning algorithm and the similarity retrieval algorithm in an
experiment with eleven users. In this experiment, we used 230 graphic symbols for the
samples out of 2000. The system had at least one similar graphic symbol more than 98%
recall ratio among the ftrst ten candidates. We may conclude that our SF spaces satisfy the
subjective similarity measure of each user.
(2) The user describes his impressions as the weight of the adjectives a, to each
painting k E P .
Pointing on
Reduced SF Space Subjective Description Visual Example
has learned the linear mappings F and G, it can automatically construct the personal index
only from the GF vectors. This is a labor-saving algorithm for indexing.
Acknowledgments:
The author would like to thank the coUeagues in Electrotechnical Laboratory, especially
Dr. Akio Tojo, Dr. Toshitsugu Yuba, Dr. Kunikatsu Takase, Dr. Hideo Tsukune, Mr.
Koreaki Fujimura and Mr. Takio Kurita for their support in this research.
643
The author would also thank to the students from University for Library and Information
Science (ULIS) and Tsukuba University and visitors from private companies.
References:
Chang, N. S. and Fu, K. S. (1980): Query-by-Pictorial Example, IEEE Trans. on
Software Engineering, SE-6, 6, 519-524.
Chang, S. K., et aI. (1988): An Intelligent Image Database System", IEEE Trans. on
Software Engineering, SE-14, 5, 681-688.
Chijiiwa, H. (1983): Chromatics, Fukumura Printing Co., 128-163.
Grosky, W. I. and Mehrotra, R. (eds.) (1989): Image Database Management,
COMPUTER (special issue), 22, 12,7-71.
Iyenger, S. S. and Kashyap, R. L. (1988): Image Databases, IEEE Trans on Software
Engineering (special selection), SE-14, 5, 608-688.
Kato, T., et aI. (1991): A Cognitive Approach to Visual Interaction, Proc. of Multimedia
Information Systems MIS'91, 271-278.
Yankelovich, N., et aI. (1988): Intermedia: The Concept and the Construction of a
Seamless Information Environment, COMPUTER, 21, 1, 81-96.
Part VII
Jyuji lVlisumi
Institute of Social Research,
Institute of Nuclear Safety System, Incorporated
Keihanna Plaza, 1-7
Hikaridai, Seika-cho,
Soraku-gun, Kyoto 619-02,
Japan
Summary: This is our approach to the behavioral science ofleadership, comprising a behavioral
morphology and a behavioral dynamics, each with its speci£lcally and generality. In this sense, it
is different from current social science research. The later lacks the category of general
behavioral morphology and cross-disciplinary perspective. A balance must be struck among the
four areas of the above-mentioned paradigm, if there is to be productive interdisciplinary
research.
Mouton's(1964) model.
In the case of PM concept, we consider P and M to be two axes on which the level of
each type can be measured (high or LDw),thus obtaining four distinct types ofleadership
(see figure 1).The validity of these four PM types was proved using correspondence
analysis which was first developed by Guttman(1950) and later by Hayashi(1956).
pM PM
MSCALE
pm Pm
PSCALE
Fig.1.Conceptual representation of 4 patterns of PM leadership behavior
(Misumi,J .1984)
situation
General Specific
Behavioral-morp hology General behavioral Specific behavioral
dimension morphology morphology
Table 1
Factor Loading's of Main Items on L€adership{1fisumi,1984)
Factor loadings
Items
I II III
59.Make subordinates work to maximum capacity .687 -.017 -.203
57.Fussy about the amount of work .670 -.172 .029
50.Fussy about regulations .664 -.072 .001
58.Demand finishing a job within time limit .639 .070 .065
51.Give orders and instructions .546 .207 .198
60.Blame the poor job on the employee .528 .113 -.121
74.Demand reporting on the progress of work .466 .303 .175
86.Support subordinates .071 .780 .085
96.Understand subordinates' viewpoint .079 .775 .229
92.Trust subordinates .024 .753 -.003
109.Favor subordinates .067 .742 -.050
82.Subordinates talk to their superior without any hesitation -.026 .722 .059
lOl.Concerned about subordinates' promotion, .147 .713 .134
pay-raise, and so forth
88.8how consideration for subordinates' personal .132 .705 .150
problems
94.Express appreciation for job well done .058 .651 .129
104.Impartial to everyone in work group -.143 .644 .164
95Ask subordinates' opinion of how on-the-job problems .049 .643 .121
should be solved
85.Make efforts to fill subordinates' request when they request .110 .606 .333
improvement of facilities
81.Try to resolve unpleasant atmosphere .233 .538 .338
87.Give subordinates jobs after considering their feelings -.276 ..!78 .457
76.Work out detailed plans for accomplishment of goals .229 .212 .635
75.No time is wasted because of inadequate planning and .038 .333 .614
processing
70.Inform of plans and contents ofthe work for the day .254 .278 .607
52.Set time-limit for the completion of the work .319 .299 .554
53.Indicate new method of solving the problem .251 .489 .479
56.Show how to obtain knowledge necessary for the work .295 .492 .472
61.Take proper steps for an emergency .360 .451 .305
69.Know anything about the machinery and equipment .255 .304 .458
subordinates are in charge of
651
Table 2
The Summary of Comparison of the Effectiveness of 4 Patterns of P-M
Leadership Behavior on Various Kinds of Factors of Work Group (the figures
of this table show the ranking of effectiveness in each factor) ( Misumi,J.,
1984)
Turn over 1 2 3 4
Job satisfaction 1 2 3 4
Team work 1 2 3 4
Comm unication 1 2 3 4
c
Mental hygiene (excessive tension and anxiety) 1 2 3 4
Hostility to supervisol 1 2 3 4
It is noteworthy that this order of effectiveness is not limited to businesses only, but is
the same for teachers(Misumi, Yoshizaki & Shinohara, 1977), government offices
(Misumi, Shinohara & Sugiman, 1977), sports coaches(Misumi, 1985) and
religious groups(Kaneko, 1986).
652
References:
Bales,RF.(1953): The equilibrium problem in small groups. In working papers in the theory 01
action, ed. Parsons.T.,Bales,RF.& Shils,E.A. Glencoe,III: Free Press.
lTeikyo University
359 Ohtsuka, Hachiohji
Tokyo 192-03, Japan
Summary: How does the Japanese general public respond to the planned future progress of
nuclear power generation? Research and surveys on the Japanese national character, which
have been carried out in Japan for over 40 years, confirm that despite considerable change. the
"core" of the Japanese national character continues to be firmly preserved. What consideration,
therefore, should be given to such national character in connection with nuclear power
generation. when nuclear power generation is expected to continue in future, and even higher-
level developments in nuclear power technology are likely to be pursued, both quantitatively
and qualitatively?
In the present research project, data gathered via opinion survey are analyzed to elucidate how
Japanese general public attitudes toward nuclear power generation, whether favorable or
unfavorable, and the Japanese national character, which governs various aspects of societal life,
are interrelated within an attitude space, and how subjects can be classified and divided therein.
2. Findings
Replies to a group of closely correlated questions, such as questions concerning
attitudes toward nuclear power generation, were analyzed first. When a group of
questions was found to form a clear-cut scale, a category defined by the value range of
the scale was regarded as representative of replies to the group of questions, in the
same manner as in other question categories Replies to almost all questions presented
in the survey were analyzed, yielding the following findings.
The first axis in the attitude space divides the subjects ( respondents) into those who
are completely indifferent and those who are not. Subjects with high respondent scores
on this axis ( inflection point on or above the distribution curve) can be distinguished
as the indifferent group. The number of subjects in this group accounts for 13% of all
subjects.
The second axis divides the subjects into those who are strongly positive toward
nuclear power generation and those who are not, while the third axis divides the
subjects into those who are strongly negative toward nuclear power generation and
those who are not. Respondent scores show that strongly positive accounts for 11 %,
and strongly negative for 9%.
15
135
12.6
112
10
a7
5
3.6
3.0
15
-0.8 -0.6 -0A -02 o 02 Of 0.6 0.8 1.0 12 1.4 1.6 1.8
The straight line formed by the "positive", "moderate" and "negative" categories which
runs diagonally on the plane demarcated by the second and third axis, becomes, when
projected onto the parallel line passing through the original point, a scale that divides
the three categories. The subjects are classified on this scale into groups that account
for 12%, 50%, and 5%, respectively, of the total.
The positions of item categories that correspond to these groups, the item categories of
the Japanese national character, and other item categories indicate that respondents in
the indifferent, strongly positive, and strongly negative groups have nothing to do with
typical Japanese sentiments. (Tab. 1, Tab 2)
It is believed that for the majority of respondents, which excludes people in these
groups, somewhat Japanese style communication can be effective in promoting nuclear
power generation.
•
III axis
Center of
strongly negative
group
Center of
negative group
•
"-"- ,
" I
""~;.
\s. . . 'IX
".: .:.!.:.
,>~ -. !.. .
II axis
(. '.' .,,: : ··i .. :'. .\
Center .... ::
.,', .',', .··.i.~···
' . . I ..
.. : .
' "~
I "
:"".
-
Center of
(Continued)
(Continue)
658
( Continued)
(2) Know about positive/negative effects of
nuclear power generation?
1 Yes 700 -2.537 3.061 1.591
2 No 3052 0.734 -0.700 -0.580
3 Neither 924 0.189 0.669 0.699
(Continued)
(3) Environmental issues
1 Very interested 1001 -U81 -1.679 2.181
2 Interested 1580 -0.881 -0.305 -0.389
3 Slightly interested 1592 0.969 0.540 -1.418
4 Not very interested 503 4.l! 7 3.830 1345
(4) Fear (Sense of anxiety)
1 Little 2052 0.508 0.802 -0.389
2 Moderate 1849 -0.145 -0.104 0.607
3 Much 775 -0.176 -1.070 -0.433
(5) Interest in accident (Sensitivity to risk)
1 Little 1027 2.438 2.954 -0.636
2 Moderate 217+ -0.537 -0.216 0.500
3 Much 1475 -0.473 -1.316 -0.301
(6) Most dangerous thing in social life
1 Traffic accident 1976 -0.382 0.599 -0.195
2 Natural disaster 831 -0.0l! 0.256 -1.l67
3 Environmental pollution, destruction, 247 -1.766 -1.337 2.400
abnormal weather
4 Fire 244 0.174 -1.588 -1.717
5 Crime, bullying, interpersonal trouble 46 -0.676 2.340 4.727
6 War 77 0.444 0.444 2.926
7 Nuclear power(generation), radioactive 54 -0.436 -1. 981 3.282
contamination
8 Disease, drug hazards, malpractice 101 -0.423 -1. 631 3.744
Views of science and civilization
1 Very negative 563 -0.409 -1.675 2.769
2 Somewhat negative 862 -0.479 -0.639 0.187
3 Slightly negative 1288 1.180 0.239 0.166
4 Moderate 738 0.859 -0.489 -0.777
5 Somewhat positive 957 -0.454 1.287 -0.901
6 Very positive 268 -1.635 3.504 -1. 902
Social and political attitude
(I) No. of items considered important
(importance of aircraft etc.)
1 0 item 604 3.055 1.459 1.862
2 1 item Ill8 0.530 -0.809 1.949
3 2 items 1703 -0.410 -0.685 0.078
43 items 1251 -0.880 1.448 -2.755
(2) No. of items considered useful
(usefulness of aircraft etc.)
1 0-1 item 623 2.740 0.822 3.941
22 items 1415 -0.019 -1.114 1.656
3 3 items 2638 -0.395 0.639 -1.823
(3) Interest in political affairs
I Very interested 877 -1.984 1.707 1.321
2 Somewhat interested 2009 -0.410 -0060 -0.245
3 Not interested 1749 LSI I -0.405 -0.412
(Continue)
660
(Continued)
(4) Ideology
1 Democracy and capitalism considered good 188-1 -1.256 1.-108 -0.3-19
2 Depend on time and situation 18-13 1.146 -0.715 0.035
3 Socialism considered good 9-19 0.9-10 -0.750 0.613
Japanese national characteristics
(1) Tendency toward moderate opinions
1 0--1 items (Tend for express opinions clearly) -162 -2.689 3.702 -0.818
2 5-12 items 2792 -0.566 -0.132 -0.189
3 13 or more items 1-122 2.-135 -0.505 0.328
(2) Scale of sense of trust
10 item 1316 0.-1-19 0.398 1.099
2 I item 15-1-1 0.808 -0.577 -0.385
3 2-3 item (Strong distrust of others) 1816 -0.661 0.5-16 -0.-175
(3) Typical Japanese leadership
1 0-2 items (In favorable to Japanese style of -112 1.268 0.88-1 3.837
leadership)
2 3-6 items 2480 0.617 -0.-183 0.018
3 7 or more items (Favorable) 178-1 -0.793 0.816 -0.917
(4) Scale of interest in supernatural beings
1 0-3 items (Little interest) 627 1.566 2.861 2.705
24-9 items 2300 0.099 -0.065 0.12-1
3 10 or more items (Much interest) 17-19 -0.326 -0.58-1 -1.l39
(5) Superstition believable?
I 0-2 items (Little influence) 653 -0.288 2.324 3.615
2 3-6 items 1797 -0.231 -0.184 -0.025
3 7-8 items (Much influence) 2226 0.558 -0.253 -1.0-15
Demographics
(1) Sex
I Male 221-1 -0.78-1 1.-186 0.653
2 Female 2-162 0.96-1 -1.083 -0.592
(2) Age
I 18-29 years old 1121 0.972 -0.092 -0.533
2 30-39 years old 967 -0.0-1-1 -0.170 -0.389
3 40-59 years old 1937 -0.323 0.078 0.-155
-I 60 years old or above 651 0.33-1 1.138 0.126
(3) Education
1 Elementary/secondary school graduate 66-1 1.-110 0.-136 0.-190
2 High school 2-167 0.467 -0.4-11 -0.415
3 University 1505 -1.023 0.918 0.-110
( 4) Residence
1 Urban -1000 0.030 0.0-16 0.017
2 Provincial 676 0.768 0.649 -0.115
R( correlation) 0.335 0.282 0.265
References:
Hayashi, C. and Morikawa. S. (1995) National Character and Communication. INSS Report May, 1995
Research Concerning the Consciousness of
Women's Attitude toward Independence
Setsuko T akakura
Tokyo International r niversity
2,509 Matoba, Kawagoe-shi
Saitama-ken 3.50-11, Japan
1. Outline of Survey
Recently what do .Japanese women understand by the expression "women's inde-
pendence", and what is the relation with the other items: consciousness of liberty.
equality between the sexes, or happiness, autonomy, identity, etc.'; What are the
obstructions to their independence'? The purpose of this research is to clarify these
subjects, (we conducted a mail survey~).
We have chosen 7 universities and ,j junior colleges situated within Tokyo and sur-
rounding areas, As the population we have determined female graduates of these
universities and junior colleges in the years: 19.58, '67, '7.j, '81, '86, '9l. The number
of samples and the number of effective responses are as table-l.
Table-l
N umber of samples and number of effective responses
graduation year 1 '.58 '67 '91 TOTAL i
188 : 188 31.5 11.52:3 !
7 universities 1
132: 98, 137, 1321 7:38
(74%) (58%)1 (.52%)j (:37%)1 (,56%)
! 160 ! 160 ! 220 1367 11.500
.5 colleges 981 89; 1,5:3i 1:32 669
(64%)! (59%)' (4.5%)1 (:37%) (48%)
1348 1:348 i 682 30n
I TOTAL , 2:301 290 264 140'1
(69%)1 (48%)1 (41%) (52%):
661
662
2. Results
2.1 outline
We will show the results of the responses to the two main questions. (The numbers
are the percentage of the responses.)
Q11. Do you think that you are independent now"? (no dependency upon anyone)
l.sufficiently independent (12%) 2.just independent (46%)
:3.not sufficiently independent (24%) 4.hardly independent (13%)
5.D.K. ( 5%)
Regarding the degree of satisfaction with the level of one's own independence:
un iversity or college
college
university
o 10 20 30 40 50 60 70% o 10 20 30 40%
~
married
~ ~
single
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50
Fig.-1-! Fig,-2-l
663
I . just independent I
.sufficiently independent
''881 6_----
year of the graduation
year of the graduation
'91 • • • •_
~
~
_...._
75_...._
~
·67 ~_ _ _ _ __
I
'58_. . . . ._
~
o 10 20 30 5~
occupation
without
occup.
sidejob • • • • • •
sel f- , • •••••••••
employee ~ I
part-time • • • • •
full -lime
I
o 10 20 30 40 50 60 70 8(fk o 10 20
Fig.-l·2 Fig.-2-2
From these figures we understand that the highest degree of consciousness of inde-
pendence was as follows: by a.ge, among the oldest samples; by profession, among
full-time workers and the self-employed; by marital status, among divorcees and wid-
ows.
664
Q.2 Some examples of women's life-styles are listed below. Please mark the types
that attract you. (no more than 2)
l. fashionable single career woman able to make full use of her abilities in her
profession ( 6 %)
2. woman with a full-time profession living with husband (without children) ( 9%)
:3. married woman with a child/children who devotes her energies to her profession.
entrusting housework and child care to somebody else as necessary (2.5%)
~. woman with a full-time profession who shares partial charge of housework with
hus band and child/ children (:3 T%)
5. woman with a full-time profession who does housework and child care without
help from anybody else ( 5%)
6. woman whose priority is taking care of husband and child! children and who
works part-time locally during the daytime as long as she has time to spare (11 %)
T. woman who takes care of husband and child! children and participates acti vely
in some activity which contributes to society (:30%)
8. woman who depends on the income of husband. and who efficiently deals with
housework and child care and then participates actively in various free-time activ-
ities of her own choice (21%)
9. woman who while preserving good judgement and making efforts to extend her
knowledge, and although depending on the income of her husband. successfully
takes care of the family and domestic management and acts as the mainstay of the
family (2.5%)
Q.29 When you hear the expression "women's independence" how do you feel about
it? Please mark any of the following categories which correspond to your feelings.
(as many as you like)
Q.15 How important do you think each of the following items is, from the point of
view of women's independence and men's independence '?
F: women's independence M: men's independence
very a lit- not not at very a lit- not not at
tie very all Item tie very all
import- import- import- import- import- import- import- import-
ant ant ant ant ant ant ant ant
H:
,
. ,
,
A
",""
,,
'PO
\G
....... !- ;0/-;--_:
-------~-----~~~--------+---------+---------
\L , ., 1st.
"" I MJ
' ... - _~ ... I
0\ 2nd.
I
1
,,
pi
'-N .... " : .,-A.... ·'
1 "If \
1 ' : \ ~I
---------~---------+-------~+--------~~--~~~---+---
\ I', 1st.
IM\
\ E I \
J)
I
I
\
L G /
\ I
I
\
, F
K 1/
1
, ,
This classification reveals that the samples (female graduates) consider women's inde-
pendence from four different aspects: financial power, psychological strength, human
relations and role in the family; while, concerning men's independence they consider
almost only two aspects: the aspect of financial and psychological strength and the
aspect of human relations etc.. (The item-O and the item-P are situated very far
from the other items; it means that the samples who marked "very important" for
items 0 and P are few and heterogeneous.)
We apply also Quantification Method III to all the responses of Q.2, Q.29, and Q.l5-
F. (Fig.-5). We understand that the samples who regard financial power as very
important items for women's independence chose the categories 1 or 2 in Q.2, which
is the opposite of the samples who chose the categories 6,7,8,9 in Q.2; the latter
choosing also "'vV" in Q.29 seem conservative; it seems that they do not consider
"independence" positively. The samples who chose O,P,G,K,L,F,E,N, in the Q.15-F
respect the family role and they chose type-5 in Q.2, it seems that they consider this
kind of life-style as an ideal.
2nd.
I H
IQ
I
JMI
A
o K E Ie
P L 5 N BD 3
F I 4
-----+--~----+---------+---------+---------+---------+ ---------+---
1st.
ZT
S
I vu y
8
W
I
AX
17
I
9
People say that financial power and psychological strength are very important for
independence. We found here that women's (female graduates) opinions of inde-
pendence are different according to whether they are considering women's indepen-
dence or men's independence. That is, concerning men's independence they consider
that the two strengths, financial and psychological, are important, while concerning
women's independence some samples (female graduates) attach importance to finan-
cial power and other some samples to psychological strength.
Concerning the women's indeppndenre we ha.ve ta.ken the items A,H,Q (which mean
668
financial power); dividing two categories: "very important" and the others, we ap-
plied again Quantification Method III. Then, taking the first 'eigen-vector' as the
out-side criterion we applied the multiple regression analysis for the purpose of find-
ing the effective factors; we have taken as predictor factors: kind of school, year of
the graduation, marital status, occupation, response to Q.11 and response to Q.l:3.
We found from this result that the occupation (having one or not) is the remarkable
effective factor, and the kind of schools (uni versi ty or junior college) and the response
to Q.ll are quite effective factors. vVe have tried this same method concerning the
items B,C ,D (which mean the psychological strength). vVe found as effective factors
(to discriminate the samples who at tach importance to these items for women's inde-
pendence and the other samples who do not) the kind of schools (university or junior
college), the response to Q.11 and the response to Q.1:3.
3. Conclusion
After these analysies we can say that the consciousness of female graduates toward
independence is not so high; especially the consciousness of young graduates is lower
than that of older graduates. The consciousness of the samples who are widows or
divorced is strong; most of them having an occupation, retain financial independence .
.\-lost of the samples who have husbands do not consider financial power as a very im-
portant factor for women's independence, they attach importance to the psychological
aspect rather than the financial aspect. The great part of young women graduates
who are not yet married have not enough consciousness of independence and are not
satisfied with the level of their independence. It is particularly interesting that we
found by the method of the Quant.-:3, the difference in women's (graduates) way of
thinking between women's independence and men's independence. There were many
discussions on the important factors: financial power and psychological strength re-
garding independence. 'vVe found that the samples who support the former aspect
and the samples who support the latter aspect are not necessarily the same. The
former are many among the samples who have an occupation, the latter are many
among the samples who graduated from junior college are confident of the situation
of their own independence. v'lie obtained in this way some new views concerning the
consciousness of women's at ti tude toward independence.
References
Garaudy, R. (1981): Pour l'avement de la femme, Albin Michel
.'vloen, Y. (1992): Women's Two Roles, Auburn House
Sodei, T. et al. (1992): Research Concerning the Women's Attitude toward Indepen-
dence, Bureau of Citizens and Cultural ,.l"ffaires, Tokyo metropolitan Government (in
Japanese)
Watanabe, K. (1990): The Concept of "Jiritsu" in Psychology, The Journal of the
Faculty of Integrated Arts and Social Sciences, Japan Women's University, 1, 189-206
(in .J apa.nese)
Acknowledgment:
We conducted this survey with the help of a grant in aid of The Tokyo Women's
Foundation, and this study was carried out under the IS.\! Cooperative Research
Program (9.'i-IS:\[·CRP A-:3.t).
A Cross-National Analysis of the Relationship between
Genderedness in the Legal Naming of Same-Sex Sexual/Intimate
Relationships and the Gender System
Saori Kamano
Institute of Statistical Mathematics
4-6-7 Minami-Azabu, Minato-ku
Tokyo 106 Japan
1. Introduction
Feminist scholarship has noted that it is ultimately a gender issue as to whether and how
same-sex sexual/intimate relationships are named or seen as a category apart from other
types of relationships (see, e.g., Connell 1987). The very conceptual possibility of
"same-sex sexual/intimate relationships" (abbreviated as SSSIR) as an identifiable type of
relationships hinges on the viability of gender as a social category. In other words, it is
only when" gender" or the social categories of "men" and "women" operate as a social
divider and when heterosexuality is assumed can sexual/intimate relationships involving
people of the same-sex be constructed as an identifiable and distinguishable category of
relationships. Extant historical and anthropological studies have documented that such
"naming" of same-sex sexual/intimate relationships varies overtime and across societies
(see, e.g., Durberman, et al. 1980; D'Emilio and Freedman 1988). However, such
documentation of the variation awaits more rigorous theorizing as well as systematic
analysis, which I will attempt in this study by focusing on one aspect of naming--the
"genderedness" of the naming of SSSIR--and take a first step toward a more theoretical
and rigorous exploration of its cross-national variation.
I will first layout my argument regarding the genderedness of the naming of SSSIR and
the gender system before discussing how genderedness is operationalized and how
countries are differentiated in this respect by socio-politico-economic factors. I will next
consider whether and how genderedness in naming is related to the rigidity of gender
categories and the level of gender inequality.
Since the naming of SSSIR is basically a gender issue, one can expect it to be related to the
gender system of a society. Specifically, I argue that the rigidity of gender categories--
which partially constitute the gender system--produces differences in how men and women
are generally treated in a society, leading to gender differences in the naming of SSSIR.
Highly rigid gender categories mean that men and women are treated and conceptualized
669
670
separately and differently in a society. Similarly, the level of gender inequality is expected
to contribute to whether or not men's and women's SSSIR are named in the same way.
Gender inequality means that men and women are evaluated differently and unequally for
who they are and what they do. In virtually all modem societies, men are granted more
importance in the system; whatever men do receives more attention whereas what women
do tends to be disregarded. I argue that such a tendency is expected to be more prominent
in societies with a higher level of gender inequality. Rigidity and inequality in gender
system are analytically distinct but inseparable empirically. I therefore expect the cross-
national pattern of the naming of SSSIR to be as follows: the more rigid and/or unequal the
gender system, the more likely the naming of SSSIR is gendered--i.e. only men's SSSIR
(and not those of women's) are named.
1: countries in where 5 0 1 7 5 3 21
statements reference men's (17.9) (0) (4.2) (36.8) (17.9) (37.5) (24)
SSSIR exclusively
N 19 2 12 19 28 8 88
% (100) (100) (100) (100) (100) (100) (100)
The table indicates that 21 out of 88 countries, or 24% of the countries considered here,
name only men's SSSIR while 67 countries, or 76%, do so for SSSIR of both genders
and/or persons without gender specification. Examination by each geographic region
indicates that the tendency to name only men's relationships is stronger in Asian countries
and Oceanic countries (about 40%). The proportion of countries naming only men's
SSSIR is particularly low in Central and South American countries--only 1 out of 12
countries references only men's SSSIR.
671
The correlation analyses show that economically more developed countries and former
French colonies tend to belong to the group of countries referencing both men and women
or "persons" in the naming of SSSIR, while former British colonies are more likely
grouped among countries exclusively referencing men in naming such relationships, as
indicated by the statistically significant coefficient (at .05 level) of -.322, -.177, and .401
respectively. Using a more generous criterion of .10 level, one can also conclude that
countries with a higher percentage of Protestant population and countries with a higher
percentage of married women using contraceptives are countries which name SSSIR of
both men and women or without gender specification.
Embedded in these political, social and economic structures might also be differences in
the gender system, which might underlie the observed differences in the particular gender
group(s) referenced. For example, it might be the level of rigidity and inequality in gender
categories divide countries varying in economic development and which differ in the
naming of SSSIR In the next section, I directly examine the how the pattern of the
naming of SSSIR is related to the gender system.
Apart from the link of the gender system to the cross-national patterns of the genderedness
of the naming of SSSIR, statistically significant coefficients of some socio-politico-
economic factors show the following patterns: (a) democracy increases the likelihood of
naming SSSIR for both or unspecified genders; (b) countries with a history of British
colonization tend to name both men's and women's SSSIR ,while countries not colonized
by Britain tend to name only men's relationships; (c) the higher the total fertility rate, the
less likely SSSIR are named exclusively for men; and, (d) the higher the proportion of
Catholic and Protestant population, the more likely are men exclusively referenced.
The observed effect of democracy on gender reference in the naming of same-sex
sexual/intimate relationships is consistent with conventional wisdom. If one grants that
democratic countries are more "inclusive," then it follows that democratic countries either
reference both men and women or do not make any distinctions by gender in the naming of
SSSIR. If Britain has imposed its law which exclusively references men onto its colonies,
674
and that these colonies have kept the law in its original form in the 1980's, then one would
expect a history of British colonization (COLBR!) to have a positive effect on the
genderedness in naming (Dynes 1990). However, the effect observed here is negative.
The key to understanding the unexpected effect might lie in understanding the post-colonial
developments in former British colonies. One can conjecture that Britain might have
imposed its legal and cultural paradigm on its colonies which, upon independence, also
show a greater tendency to reverse these imposed legal and cultural norms. Testing these
ideas requires a separate analysis. For the present purpose, it is sufficient to note that
presence/absence of British colonization divides countries in how they name SSSIR.
The positive effect of total fertility rate means that countries with a higher fertility rate name
only men's 3SSIR, while the countries with a lower fertility rate name SSSIR of both men
and women or gender-unspecified persons. The observed effect here might support my
argument linking a higher level of gender inequality and a rigidity of gender categories to a
stronger tendency to name only men's SSSIR. It is generally the case that the total fertility
rate is higher in societies with more rigid and unequal gender categories, which in tum
increases the likelihood of the exclusive reference to men in the naming of SSSIR
The negative effect of the percentage of population affiliated with Catholicism and
Protestantism indicates that the higher the proportion of Catholics and Protestants in the
population, the more likely that same-sex sexuallintimate relationships are named
exclusively for men. Given that the formal teachings of Judeo-Christianity tend to address
men exclusively, this finding is perhaps readily comprehensible.
6. Conclusion
In this paper, I focused on the genderedness in the naming of same-sex sexual/intimate
relationships, differentiating between countries that exclusively reference men in naming
and those which reference both genders or gender-unspecified "persons". I argued that the
gender system is one of the important factors differentiating the two types of the naming of
SSSIR: men are referenced exclusively in the naming of same-sex sexuallintimate
relationships in societies with rigid and unequal gender categories. I reasoned that
societies with rigid gender categories conceptualize and treat men and women as two
distinctive groups, resulting in differences in the naming of men's and women's same-sex
sexual/intimate relationships. Similarly, societies with a higher level of gender inequality
by definition privilege men over women and grant the former more visibility and
prominence. It is therefore expected that men tend to be referenced exclusively in the
naming of same-sex sexual/intimate relationships in these type of societies.
After presenting a coding scheme of the legal naming of same-sex sexual/intimate
relationships, I discussed the results of correlation analysis of the genderedness of naming
and various socio-politico-economic factors, which showed that countries which are
former French colonies, Protestant, and economically more developed belong mostly to
the group that references both men and women or persons in the naming of SSSIR. In
contrast, countries which are former British colonies belong largely to the group
referencing only men in the naming of SSSIR. Furthermore, I undertook a logistic
regression analysis to examine the linkage between gender references in naming and the
rigidity of gender categories and the level of gender inequality, controlling for the effects
of socio-politico-economic factors. The anticipated pattern of the genderedness in the
naming of SSSIR in relation to the rigidity and inequality of gender categories was borne
out by the finding that the more rigid the gender categories are, the stronger the tendency to
exclusively reference men, rather than both genders or gender-unspecified "persons," in
naming SSSIR. More importantly, the analyses presented here affirm that aspects of the
675
gender system are important in differentiating among societies which name SSSIR
differently.
References:
Connell, R. W. (1987): Gender and Power. Stanford University Press, Stanford.
D'Emilio,1. and Freedman, E. B. (1988): Intimate Matters: A History 0/ Sexuality in
America. Harper and Row, New York.
Durberman, M. B. et al. (eds.) (1980): Hiddenjrom History: Reclaiming the Gay and
Lesbian Past. New American Library, New York.
Dynes, W. R. (ed.). (1990): Encyclopedia o/Homosexuality. Garland, New York and
London.
Kamano, S. (1995): Same-Sex Sexual/Intimate Relationships: A Cross-National Analysis
o/the Interlinkages among Naming, the Gender System, and Gay and Lesbian Resistance
Activities. Ph.D. Dissertation, Stanford University.
Tielman, R. and de Jonge, T. (1988): A worldwide inventory of the legal and social
situation of lesbians and gay men. In ILGA Pink Book: A Global View 0/ Lesbian and
Gay Liberation and Oppression, Tielman, R. and van der Veen, E. (eds.), 183-242,
Utrecht University, Utrecht.
A Constrained Clusterwise Regression
Procedure for Benefit Segmentation
Daniel Baier
Institute of Decision Theory and Operations Research
University of Karlsruhe; Post Box 6980;
76128 Karlsruhe; Germany
Summary: A new procedure for benefit segmentation using clusterwise regression is pre-
sented. Constraints on the model parameters ensure that the derived benefit segments can
be easily attached to single competing products under consideration. The new procedure is
compared to other one-stage and two-stage procedures for benefit segmentation using data
from the European air freight market.
1. Introduction
Conjoint analysis is the label attached to a popular research tool for measuring buy-
ers' tradeoffs among competing multiattributed products (see, e.g., Green, Srinivasan
(1990) for a review): First, respondents are asked for preferential judgments w.r.t.
a set of attribute-level-combinations (stimuli) which serve as (hypothetical) product
descriptions. Then, the observed response data are analyzed using regressionlike
estimation procedures at the individual level. The resulting so-called part-worths
(estimated preferences for at tri bute-levels) are later used to predict responses W.r. t.
a set of competing products and-assuming, e.g., that each individual chooses the
product with the highest predicted preference-shares of choices or market shares.
A research purpose served by many commercial applications of conjoint analysis is
the identification and understanding of so-called benefit segments (see, e.g., Green.
Krieger (1991), Wittink, Vriens, Burhenne (1994)): The observed response data are
used in order to identify groups of buyers having similar preferences. For this pur-
pose, various procedures have been proposed and successfully applied during the last
years. Some of these procedures are so-called two-stage procedures where the indi-
vidual part-worth estimates are used in an unrelated secondary stage as an input
for clustering techniques. Others are so-called one-stage procedures (see, e.g., Ka-
makura (1988), DeSarbo, Oliver, Rangaswamy (1989), Wedel, Kistemaker (1989),
Wedel, Steenkamp (1989), (1991), DeSarbo, Wedel, Vriens, Ramaswamy (1992), and
Baier, Gaul (1995) for sample procedures or Wedel, DeSarbo (1994) for a review)
where segment-specific part-worth functions and segment-membership indicators are
simultaneously estimated using generalizations of well-known cluster wise regression
procedures (see, e.g., Bock (1969), Spath (1983), DeSarbo, Cron (1988)). In both
cases, the resulting parameter estimates are then used to predict preferences and
choices at the segment level W.r.t. a set of competing products.
However, a major shortcoming of these procedures consists in the (missing) link
between the computational derivation of segment-specific model parameters and the
prediction of choices for the competing products: No guarantee is provided that each
competing product is chosen by at least one benefit segment, a fact that-when ana-
lyzing real markets-could lead to the (surprising) situation that established products
in these markets are predicted to have no buyers. For this reason, a new one-stage
procedure for benefit segmentation is proposed where constraints are implemented
which ensure that each competing product is selected by at least one segment.
676
677
with
T
Lhti = 1 Vi, hti E{O,I} Vt,i (3)
t=l
is minimized. The constraints in formula (3) ensure that segment) chooses compe-
ting product) resp. that each competing product is selected by at least one segment
(N ote that for obvious reasons index) is also used for segments.), and that the seg-
mentation scheme is non-overlapping.
It should be mentioned that with n=O the standard versions of Wedel, Kistemaker's
(1989) as well as Baier, Gaul's (1995) (non-overlapping) clusterwise regression proce-
dures are contained in the model formulation. (In this case the inequality constraints
in formula (3) are omitted.)
2.2 Algorithm
For estimation of the model parameters, an exchange algorithm is proposed as gi-
ven in Tab. 1: In the initialization phase we start with a segmentation matrix H,
segment-specific response estimates for the stimuli U, and segment-specific response
estimates for the competing products U so that the constraints in formula (3) are
fulfilled. Additionally, the initial loss function value is computed. In the iteration
phase we repeatedly test, whether an exchange of a respondent from one segment
to another improves the loss function without violating the constraints in formula
(3). If so, a new loss function value is calculated. If not, the exchange is cancelled
(using the variables hI, ... ,hT for restoration). In the final phase segment-specific
part-worth estimates are computed.
678
if w i= Wu
'tt,v,w.
else
Note that in the algorithm for constrained clusterwise regression B is the dummy-
coded design matrix for the stimuli with elements
1 if l == 1
Ejl == { B I Va == max{vll>l + (W1 -l) + ... + (Wu- 1 -l)}, (4)
J"oWo ese, Wa == l - (1 + (Wl-l) + ... + (Wuo-1-l)),
and that :B is defined in the same way as the dummy-coded design matrix for the
competing products. The algorithm uses some computational simplifications con-
cerning parameter estimation when the segmentation matrix H with elements h t ;
is additionally known: In this case, we get segment-specific response estin;ates for
the stimuli U=BC with elements Ujt and for the competing products U=:BC with
elements U;t by using
C = (B'Br1B'Y H'(HH,)-l (5)
~
==: G
679
{ -&-
N
if hti =1 = L htz
git = lv t ' Vt, i with Nt Vt. (6)
0, else, i=l
3. Comparisons
The new constrained clusterwise regression procedure (in the following referred to
as CCR) was empirically compared to other one-stage and two-stage procedures for
benefit segmentation. For comparisons, data from the European air freight market
were used (see Baier, Gaul (1995) for a more detailed discussion of the data) which
descri be preferences of 150 respondents w. r. t. 18 hypothetical product descri ptions for
an over-night parcel service with house-to-airport delivery and European destination
collected from responsibles for parcel delivery of German companies with more than
2.5 air freight parcels per month within Europe. The reduced design of the 18 stimuli
W.r.t. the attributes 'collection time', 'delivery time', 'transport control', 'agency
type', and 'price' (for a 10kg parcel) and descriptions of six competing products
selected according to Baier, Gaul (1995) are given in Tab. 2 and 3.
Tab. 2: 18 stimuli for data collection in the European air freight market
Tab. :3: Six products for simulations in the European air freight market
680
:V n T
L L(Y)i - L h tl uJtl 2 lV n
VAF = 1_ i=l j=l t=l
with f) = L LY)i/(lYn) (7)
* /'v' n
;=1 )=1
Even a first glance at Tab, 4 and .5 reveals that the best one-stage procedure (CR)
performs better (with the exception of 7'= 1 or 7'=2) than the other one-stage and two-
stage procedures, CCR, the new procedure, competes well in this context with only
minor deteriorations w,r.t, variolls fit measures. Hovy'ever -- to be honest -- one should
681
mention that this behavior heavily depends on the selected competing products in the
market. (If, e.g., some benefit segments are not satisfied by the available products
in the market, an application of CCR should lead to inferior classification results
compared to an application of CR.)
l
200DM 0.313 0.059 0.021 0.047 0.08,s 0.11,s
240DM 0.000 0.000 0.000 0.000 0.000 0.000
prod. E prod. E prod. F prod. D prod. F prod. D
In order to demonstrate advantages of the new procedure, Tab. 6 and T show the
six-segment solutions with part-worth functions and most preferred competing pro-
682
ducts from CR and CCR. (The attribute-levels of the most preferred products are
underlined.) The part-worth functions are standardized in the usual way that the
segment-specific part-worth estimates for the least preferred attribute-levels are fi-
xed to 0 and that the response estimates for the combinations of the most preferred
attribute-levels (not necessarily the most preferred product) sum up to l.
Using this standardization, the (relative) importances of the single attributes are gi-
ven by the part-worth estimates for the most preferred attribute-levels: So, e.g., from
Tab. 6 we can see that segment 1 is highly price-sensitive (attribute 'price' accounts
for 66.:3% of the overall preference), that ('16:30', '10:30', 'active', 'airline company',
'160D M') is its most preferred at tri bu te-Ievel-combination (with a segment-specific
response estimate of 1), that 'product E' with attribute-level-combination ('16:30"
'12:00', 'passive', 'integrator', '160D~I') is its most preferred competing product (with
a response estimate of 0.860), and that 'product E' is also the most preferred product
of segment 2 where early collection times are most important. However, only the
competing products 'product D' to 'product F' can be attached to benefit segments.
The utilities for 'product A' to 'product C' are for all segments inferior to the utilities
for 'product D', 'product E', or 'product F', s. t. under the usual assumptions these
products are predicted to have no buyers.
This shortcoming is overcome in the six-segment solution from CCR as given in
Tab. 7: Here, segments with similar interpretations as in the CR solution can be
found: e.g., a highly price-sensitive cluster (segment 1) and two clusters focusing on
an early collection time (segment 3 and 5). Nevertheless, each product can be atta-
ched to a segment. So, e.g., segment 1 now prefers 'product A' whereas 'product E'
is preferred by segment 5. (Note that segment 1 in the CR solution also had a fairly
high response estimate of 0.769 for 'product A'.)
The cross-tabulation in Tab. 8 shows that only few reallocations were necessary for
achieving this easier interpretation: Segment 1 of both solutions consists mainly of
the same respondents whereas segment 2 of the CR solution was divided up into the
two CCR-segments 3 and 5.
References:
Baier, D., Gaul, W. (1995): Classification and Representation Using Conjoint Data, In:
From Data to Knowledge:, Gaul, W., D. Pfeifer (eds.), Berlin, Springer, 298-307.
Baier, D., Gaul, W., Schader, M. (1996): Two-Mode Overlapping Clustering With Ap-
plications to Simultaneous Benefit Segmentation and Market Structuring, To appear in:
Classification, Data Analysis and Knowledge Organization, Klar, R., Opitz, O. (eds.), Ber-
lin, Springer.
Bock, H. H. (1969): The Equivalence of Two Extremal Problems and its Application to the
Iterative Classification of Multivariate Data, In: Report on the Conference Medizinische
Statistik, Forschungsinsti t u t 0 berwolfach.
DeSarbo, w. S., Cron, w. L. (1988): A Maximum Likelihood Methodology for Clusterwise
Regression, Journal of Classification, 5, 249-282.
DeSarbo, W. S., Oliver, R., Rangaswamy, A. (1989): A Simulated Annealing Methodology
for Clusterwise Linear Regression, Psychometrika, 54, 707-736.
DeSarbo, W. S., Wedel, M., Vriens, M., Ramaswamy, V. (1992): Latent Class Metric Con-
joint Analysis, Marketing Letters, 3, 273-288.
Green, P. E., Helson, K. (1989): Cross-Validation Assessment of Alternatives to Individual-
Level Conjoint Analysis: A Case Study, Journal of Marketing Research, 26, 346-350.
Green, P. E., Krieger, A. M. (1991): Segmenting Markets with Conjoint Analysis, Journal
of Marketing, 55, 20-3l.
Green, P. E., Srinivasan V. (1990): Conjoint Analysis in Marketing: New Developments
with Implications for Research and Practice, Journal of lV[arketing, 54, 3-15.
Kamakura, W. A. (1988): A Least Squares Procedure for Benefit Segmentation with Con-
joint Experiments, Journal of Marketing Research, 25, May, 157-167.
Spa.th, H. (1983): Cluster-Formation und -Analyse, Oldenbourg, Munchen.
Wedel, M., DeSarbo, W. S. (1994): A Review of Recent Developments in Latent Class Re-
gression Models, In: Advanced Methods of Marketing Research, Bagozzi, R. P. (ed.): Basil
Blackwell, Cambridge, MA, 352-388.
Wedel, M., Kistemaker, C. (1989): Consumer Benefit Segmentation Using Clusterwise Li-
near Regression, International Journal of Research in l'vfarketing, 6, 45-49.
Wedel, M., Steenkamp, J.-B. E. M. (1989): A Fuzzy Clusterwise Regression Approach to
Benefit Segmentation, International Journal of Research in Marketing, 6, 241-258.
Wedel, M., Steenkamp, J.-B. E. M. (1991): A Clusterwise Regression Method for Simulta-
neous Fuzzy Market Structuring and Benefit Segmentation, Journal of Marketing Research,
28, November, 385-396.
Wittink, D. R., Vriens, M., Burhenne, W. (1994): Commercial Use of Conjoint Analysis in
Europe: Results and Critical Reflections, International Journal of Research in Marketing,
11,41-.52.
Application of Classification and Related l'vlethods
to SQC Renaissance in Toyota l'vlotor
Kakuro Amasaka
Summary: To capture the true nature of making products, and in the belief that the
best personnel developing is practical research to raise the techr.ologicallevel, we have been
engaged in SQC promotion activities under the banner" SQC Renaissance". The aim of
SQC promoted by Toyota is to take up the challenge of solvi!lg vital technological assign-
ments, and to conduct superior QCDS research by employing SQC in a scientiftc, recursi',e
manner. To this end, we must build Toyota's technical methods for conducting scientific
SQC as a key technology in all stages of the management process, from product plar.ning
and development through to manufacturing and sales. Especially, multivariate analysis
resolves complex entanglements of cause and effect relationships for both quantitative and
qualitative data. A part of application examples are reported below, fOCUSing on the cluster
ane.lysis as a representative example.
Practical Growing
effort Development Hwnan Resources
Education
684
685
If we are to make products that can satisfy our customers, we must work under the
most appropriate conditions for raising work quality and minimizing problems. If
staff keep a careful watch over their work and apply SQC properly, SQC can assist
them in remedying work processes and effectively raise the quality of work (Amasaka
and Yamada 1991).
~I
-3
\ c·
...r; Hiah '3
Business
Marketing
- for the
CLlSlomer-
Process
- Behavioral
Science -
Designing
-WIu,
10 ~W:::e-
¢ Problem "'Quality
- General Solution Problem
by Scientific SQC -
I
ti- ft
-:g
9'0
<:?.i,
<9
l<9cfnoloG;J
2-1 Business Process for Customer Science 2-2 Scientific SQC for Improving Technology
Fig.2 New Schematic Drawing of Scientific SQC to Conduct of Superior QCDS Research
686
MA DE
(Mul,ivari310 analysis) (Design ofE:<pcrimcnc)
Structuring Problem and Sclc:"ins: Topic Prablcm~SQI . . ing (Level I) Problem-Solving (Level 21
started out with cluster analysis to unravel the complex technical assignments, and
show that the application of a combination of different multivariate analysis methods
has brought about the expected results,
And the summaries of these two analytical results allow us to interpret, as shown
in the lower section of Fig. 4.1, that group (1) is A: the realist group (that places
importance on practical invention based on Toyota's engineering capability and on the
right of practical effect); group (2) is B: the advance group (that places importance
on advanced invention superior to competitors or specific right competitors are eager
to get); and group (3) is C: the futuristic group (that places importance on the
prospective invention on the construction stage and the right to lead competitors with
internationally). For example, research section" b" mostly consists of the futuristic
group while Technical Administration "a" is largely composed of the realist group
and so on for the rest of sections that reflect logical result of analysis. Reganding
to this results, it is found that respective departments are conscious and in need of
well-balanced activities for" good patent" as they understand what to expect from
and what role to play in executing their job. We have thus obtained valuable results
(of the analysis) which constitute the preparation for our future patent strategy.
,,,
,
(l) pr.1ctic.liry 1 1
(13) large system
L--_-=-_-' 21 :inCl:manan.al
8:foword-looking
Group
I
B: Advance I
Group
I C: Future
Fig.5 The Relate ofInventive Technic and Patent Right (canonical correlation analysis)
This case (1) shows a scientific application of SQC exactly as seen in Fig. 2 by
upgrading the quality of job on the business process stage in a proactive engineering
area. It is thus judged that the application effect· of multivariate analysis as the core
of Toyota Technical Method has been verified.
4.2 "Analysis of sources of variation for preventing vehicles' rusting"
One of the subjects for technological development of higher quality, longer life vehicles
is the quality assurance of anti-rusting of the body. From the engineering viewpoint of
anti-rusting of vehicle body, it involves consideration for structural design, adoption of
anti-rusting steel, adoption of local anti-rusting processing (such as the application
of wax, sealer, etc.), and paint design (conversion treatment and improvement of
electrodeposition coating, etc.). On the implementation stage of the research, these
anti-rusting measures are adopted singularly or in combination of multiple measures
as may be required by the construction of subject section or corrosion factors. In order
691
for us to proceed with advanced and timely QCDS study activities, it is necessary
to conduct variable factor analysis of vehicle's anti-rusting and the optimization. In
this sense, SQC centering around the multivariate analysis has much to contribute
as a scientific approach.
In this connection, this paper describes a characteristic case for study (Amasaka et
al. 1995c).
4.2.1 Anti-rusting Methods and How to Outline their Characteristics
To establish a superior quality assurance system, it is important to set up the network
of quality assurance on the job processing stage so as to raise reliability technology
of whole the sectors including product planning, design, review, production engineer-
ing, process design, administration, inspection and so on. Such a quality assurance
activity under the cooperation of all the sectors has been established as Toyota's QA
network, where SQC plays an important role as the behavioral science for enhancing
quality performance.
Toyota has been incorporating various anti-rusting methods to various sections of the
vehicle. To outline the deployment of quality performance, the application of matrix
diagram method is effective. For example, Table 2 outlines complex correlationship,
between corrosive environmental factors of 72 sections of a vehicle divided for the ease
of arrangement of anti-rusting measures and the manufacturing processes factors, and
the respective anti-rusting methods adopted in an anti-rusting QA network table.
This table is a summary of objective facts inductively arrested by engineers from
multiple engineering sectors, which are then arranged deductively in the subjective
point of view.
To make this table more effective, it is necessary to provide it with the ease of visual
recognition so that it shows at a glance where Toyota stands in its present activities
for rust prevention quality assurance. Table 2 enables staff and engineers to make
the engineering judgment more accurately, subsequently contributing much to the
strategic decision making on the part of the management.
~i
1 1 1 }
Heming Lower part of door Q : Adhesive
of Lower part or I~g~ge 2 1 1 0: Plug Hole
Shell Lower port of fuel fillerlid 3 1 <!l) : Scaler
Parts Door. The: ot.hen 4 1 1 Size or a maJU numbc:n
o(CIIoCnmeuurc : lapping put
5 1 1 ~ of sled sheet
.
Shell Hood' Lock RJF
,4..xis 2
Pam Door * Lock RJF 6 1
Upper Body • Under Body
Door. Side Procection Bar 7 1 1
RJF Bock Door * Lock RJF 8
9
1
1
\
@:~~~-~~~,
~
Genc:r.al Surf:lce
-- °i~l ~
Floor
Under RJF 10 1
Exhoust Pioc RIF. /"----' /"'----
Door group o( membm
L-__________~NWI~--------~
To proceed with such an aim and make it much easier to understand complex en-
tanglements of cause and effect relationships, they will be summarized visually as
shown in Fig. 6 by using quantification method type III. From a scatter diagram of
vehicle section (axis I x axis II), it is apparent that Toyota's anti-rusting measures
are taken mainly against the steel sheet joint portions and that the local anti-rusting
processing and many other methods tend to be adopted as the measures reach the
underbody sections. The diagram indicates that the combination of several types of
anti-rusting methods is applied to the doors and underbody members positioned in
the upperbody sections.
Adoption of such an analytical method makes it easy to evaluate the vehicles of com-
petition, enabling the reactive bench marking for additional advantages. In addition,
insight into proprietary technologies with the knowledge acquired from the result of
this analysis enables us to grasp proactively our responses to the future quality as-
surance including the trends of anti-rusting measures and techniques of competition
that reflect their thought on the target quality and the counter-marketability of anti-
rusting materials.
4.2.2 Factorial Analysis Method for an Optimal Anti-Rusting of Vehicle
Bodies
This subsection takes up the door hemming sections to which application of anti-
rusting technique meets particular difficulties. Spraying of snow melting salts during
winter to prevent the roads from freezing allows the filtration of salt water, a corrosive
factor, to the hemmed joint portions of door outer-panels and inner-panels of rust
prevention steel sheet. To prevent this, wax and/or sealer are adopted or the joint
portions of the door outer-panels and inner-panels are sealed with adhesive agent for
the dual purpose of adhesion and anti-rusting.
To evaluate the performance of various anti-rusting measures, we conduct a market
monitor test using actual vehicles, in-house accelerated corrosion test, and the accel-
erated corrosion test on the bench using testpieces and/or parts. For any of these
tests, it is imperative to conduct analysis of sources of variation for the optimization
of anti-rusting of the body.
Example indicated in Table 3 outlines the test data showing the result of analysis on
the relationship between the (combined) rust prevention specifications of a testpiece
and the depth of corrosion. A dendrogram can be generated as shown in Fig. 7
by subjecting the above data to a cluster analysis. This dendrogram is used to
hierarchically outline the degree of effect from the factors of anti-rusting measures
by grouping the experiment numbers of a corrosion test. Moreover, analysis using
quantification method type I enables us to verify quantitative degree of effect from the
factors of anti-rusting measures by category-scores and the size of partial correlation
coefficient. From ·this diagram, it is possible to grasp the effect of anti-rusting steel
sheets used to the door hemming sections and the validity of quantitative effect of
local anti-rusting processing A, B, and C from the engineering point of view, which
in concurrent application with the analytical results of other testing methods using
actual vehicles enables the optimization of the vehicle body rust prevention.
We have applied comprehensive and technical insight into these results of analyses.
And by adopting controllable factors effective for anti-rusting measures in addition
to the environmental factors in consideration of market environments, we have been
able to verify deftly the factorial effects with the application of the design of exper-
iment. We have thus succeeded in the embodiment of production with good QCDS
performance through the optimization of the vehicle body rust prevention.
693
-
I I 2 2 2 0.15 0
b
I I I 2 2 0.10 0 ]
N I 2 J...---
:~..,
~
'-
XI: typcofscc.:1 (oula") I: sloel shoe!
"u
Ci
2: and,corTosion steel shc:ct level 1 'U
CI.l ~
X2: type of steel (iMer) t: anti-GOITosion steel sho:t level 1
2: anri~rro.ion steel sheet level 2
)0: rpollUlti.aorrosion steel proccSl A (I:no 2;ye.}
X4: rpollUlti.c<Jrrosion steel process B (I :no 2;ycs)
X5: spollUlri-carrosion slecl proccs.C (I:no 2;ycs)
..
. 21 combmanons
Fig.7 Cluster Analysis
-p ---qJ:1II
process A process B JX1>C= C
• W'
(-0. v.
(0. 87] (0. 15) (0.67) (0. 781 (0.36)
( ): =gory-scorc:s • [ ]: partial correlation coefficient
It is judged that this case (2) too constitutes a new methodology of mountain-climbing
for Problem-Solving effectively using SQC in concurrent application of multivariate
analysis with N7 and design of experiment. We think that here too it has been
verified that the multivariate analytical method forms the core of the Toyota Technical
Method for scientific implementation of SQC as observed in Fig. 2 and 3.
5. Conclusion
We have been able to verify the following from the two demonstrative cases for studies
as above mentioned: In combination with N7 and design of experiment, various
multivariate analysis starting from the cluster analysis have been established as the
core of technology-advancing SQC from a mere statistical analysis that tends to place
unbalanced importance on analysis.
We think that we have verified that the multivariate analysis offers a great application
effect as the core of the SQC methods in connection with the construction of the
presently advocated Toyota Technical Method for the scientific implementation of
SQC and the enhancement of the effectiveness of called SQC.
In unraveling confound situations and complex entanglements of cause and effect
relationships, one of the multivariate analysis' methods cluster analysis enables them
to clarify and organize visually and logically. On this view point, engineers' abilities
with latent pursuit-minded and new ideas-minded are enhanced and improved. It
is considered that this new technological approach has great insight to unravel and
pursue complex problem-assignments appropriately. The author appreciates valuable
teaching and comments from those people concerned.
694
References:
Amasaka, K. (1993): "SQC Development and Effects at TOYOTA," (in Japanese) QUAL-
ITY, JSQC (Journal of the Japanese Society for Quality Control), 23, 4, 47-58.
Amasaka, K. (1995): "A Construction of SQC Intelligence System for Quick Registration
and Retrieval Library, - A Visualized SQC Report for Technical Wealth -," Springer Lec-
ture Notes in Economics and Mathematical Systems, 318-336.
Amasaka, K. et al.(1992): "A Method on Equipment Diagnosis of Grinder," (in Japanese)
QUALITY, JSQC (Journal of the Japanese Society for Quality Control), The 42th Tech-
nical Conference, 37-40.
Amasaka, K. et al.(1993): "A Study of Quality Assurance to Protect Plating Pants from
Corrosion by SQC, - Improvement of Grenading Roughness for Rod Piston by Centerless
Grinding -," (in Japanese) QUALITY, JSQC (Journal of the Japanese Society for Quality
Control), 23, 2, 90-98.
Amasaka, K. et al. (1994): "Consideration of effieientical counter measure method for
Foundry, - Adaptability of defects control to Casting Iron Cylinder Block -," (in Japanese)
JSQC (.Journal of the Japanese Society for Quality Control), The 47th Technical Confer-
ence, 60-65.
Amasaka, K. et al. (1995a): "Aiming at Statistical Package using in the Job Process," (in
Japanese) JSQC (Journal of the Japanese Society for Quality Control), The 25th Annual
Technical Conference, 3-6.
Amasaka, K. et al. (1995b): "A study of Questionnaire Analysis of the Free Opinion, -The
Analysis of Information Expressed in 'Nords Using N7 and i\Iultivariate Analysis Together-
," (in Japanese) JSQC (Journal of the Japanese Society for Quality Control), The 50th
Technical Conference, 43-46.
Amasaka, K. et al. (1995c): "The Q.A. Network Activity for Prevent Rusting ofVehicleoy
Using SQC," (in Japanese) JSQC (Journal of the Japanese Society for Quality Control),
The 50th Technical Conference, 35-38.
Amasaka, K. et al. (1996a): "A Study on Estimating Vehicle Aerodynamics of Lift, -
Combining the Usage of Neural Networks and Multivariate Analysis -," (in Japanese) The
Institute of Systems Control and Information Engineers, 9, 5, 229-237.
Amasaka, K. et al. (1996b): "Influence of Multicollinearity and Proposal of New Method
of Variable Selection, -A Study of Applied j'vIultiple Regression Analysis for Analysis of
Source of Valuation-," (in Japanese) JIMA (Japan Industrial Management Association),
46, 6, 573-584.
Amasaka, K. et al.(1996c): "A Study on Validity of the EN method for Variable Selection,
- A Study of Applied Multiple Regression Analysis for Analyzing Source of Variation Fac-
tors (Part II) -," (in Japanese) lIMA (Japan Industrial Management Association), 47, 4,
248-256.
Amasaka, K. et al. (1996d): "An Investigation of engineers' Recognition and Feelings about
Good Patens by New SQC Method," (in Japanese) JSQC (Journal of the Japanese Society
for Quality Control), The 52th Technical Conference, 17-24.
Amasaka, K. and Azuma, H. (1991): "The Practice SQC Education at TOYOTA, - For
Growing Human Resource and Practical Effort -," (in Japanese) QUALITY, JSQC (Jour-
nal of the Japanese Society for Quality Control), 21, 1, 18-25.
Amasaka, K. and Ihara, M. (1996): "Latent Structure of Goodness-of-invention," Springer
Lecture Notes in Economics and Mathematical Systems, 348-353.
695
Summary:The purpose of this paper is to show how statistical binary tree analysis would
be applied in the field of QC. Analysis of discrimination of uncollectible issue of credit loan
in sales management is shown. by using Classification And Regression Trees (CART).
1. Introduction
Quality of Japanese industrial products has been improved by modern quality con-
trol(QC) which is introduced into Japanese industries after the World War II. Espe-
cially, application of statistical techniques to QC (SQC) is one of the reasons why it
helps us to improve quality of products. QC has spread not only to manufacturing
department but also to almost all of departments in a company tQ improve quality
of products, services and jobs as total quality control (TQC), which is known as a
management tool of Japanese continuous quality improvement.
One of the basic concepts of SQC is how well we decompose many causes affecting
variations of quality into each cause vrithout confound. This means how many causes
are well classified. To do this, engineers relating to quality planning, design, im-
provement of products and services apply many kinds of statistical techniques such
as control chart, statistical test and estimation, design of experiments and multivari-
ate statistical analysis. Especially, accompanying advance of computer during recent
decade, they well apply multivariate statistical techniques such as multiple regres-
sion analysis, discriminant analysis and cluster analysis to specify and classify some
causes affecting quality of products and services. They sometimes feel that the result
of analysis by such techniques does not always bring them a useful information of
problem solving since statistical linear model differs from the model that they esti-
mate with their professional knowledge.
On the other hand, statistical binary tree analysis gives us a basic concept of strat-
ification and/or classification. As shown in Fig. 1, the process is binary because
parent nodes are always split into exactly two child nodes and is recursive' because
the process can be repeated by treating each child node as a parent. It gives us
the same as a naturally simple thinking way of human being like an answer, "yes"
or "no" according to successive questions. As anyone who would like to apply the
methodology into real world can make clear that successive causes are followed until
a terminal node of which the contents are sufficiently homogeneous, it is easier for
him/her to understand the result.
696
697
Root Node
Parent Node
Terminal Node
,,
,
,
, , .'
-'... , , ..., - ... ,
: t I : t I
, '-' '
L B.
, '
'-'
iNhere A\'\'GROSS denotes the variable name of annual gross sales amount and
A;-;\,I\,CO:\1 the rank of annual sales income.
In this output CART answers:
NODEINFORMA110N
.. .. , 248 1 8.160
.. *
Class Tap Left lUchi
1 325 103 222
Tap
0.950
Left
0.884
lUcht
0.984
.....
* * 2 123 !17 26 0.850 0.1l6 0.016
**
..
** 3 HEAVYMEC .0,1,2 0.245 0.056
4I!MPLOYEE • 5.500 0.242 b.b4!!
5 JURIDIC • 4,5 0.203 0.024
In each case the question is used to split a node by sending the yes answers to the
left child node and the no answers to the right child node. In the credit loan data,
200 cases go the left node; 248 cases go the right node. The cases in the left node
contains 103 cases of Class 1 for collectible loan and 97 of Class 2 for uncollectible
loan; the cases in the right node contains 222 of Class 1 for collectible loan and 26 of
Class 2 for uncollectible loan.
CART's method is to look at all possible splits for all variables included in the anal-
ysis. For example, it would consider splits on ANNGROSS at one-hundred-billion,
two-hundred-billions, etc. All the way through the highest annual gross sales amount
observed in the data. It would then do the same for the rank of annual sales income
which is the variable name ANNINCOM, and for all other variables as well. Since
there are at most 448 different values for each variable in this data set. Any problem
will have a finite number of candidate splits and CART will conduct a brute force
search through them all by converting a continuous variable to an ordered categorical
variable with quantile of data. If a categorical variable is nominal, CART selects the
best of all combination of iC2, where i denotes number of categories.
3.2 Choosing a split: Measure of goodness-of-split criterion
CART's next activity is to rank order each splitting rule on the basis of a goodness-of-
split criterion. One criterion commonly used is a measure of how well the splitting
rule separates the classes contained in the parent node. The goodness-of-split crite-
rion is defined as follows:
700
This criterion is called the Gini index of diversity as a measure of node impurity.
In practice, the following improvement of impurity by a split variable is employed as
criterion for selection of split variable:
In the example, the improvement of the best split variable, A:\":\"GROSS was 0.093
while the improvement of the best competitor variable, A:\:\I:\COM which appears
in the part of labelled" Competitor" was 0.089. Thus, the goodness-of- split criterion
of the variable, A:\":\GROSS is higher than that of the variable, A:\:\I:\CO:vr which
is the best competitor.
has cost equal to l.000 x 0.500 = 0.500 and the tree with two terminal nodes has cost
0.605 x 0.500 = 0.3025. Therefore the cost complexities for the two trees are:
TREE SEQUENCE
Dependentvari:>ble: JUDGE
Terntinal Cross-Validated Resubstitution Complexity
Tree Nodes Relative Cost Relative Cost Parameter
35 0.722 +f-
OIl60 0.248 OIlOO
7 21 0.685 OIl59
+f- 0.286 OIl03
8 20 0.685 +f-
0.059 0.292 0.003
9 15 0.644 +f-
0 I158 0.323 OIl03
10 12 0.654 +f- OIl58 0.351 OliOS
11* 11 0.612 +1- 0.057 0.363 OIl06
12 9 0.643 +1- 0.058 0.392 OIl07
13 7 0.650 +1- 0.058 0.447 OIl14
14** 5 0.650 +1- 0.1l58 OS04 oII 14
15 3 0.731 +1- 0.060 0.605 OIl25
16 1 I100 +1- 0 I100 I liDO OIl99
samples are then combined to fonn error rates for trees of each possible size; these
error rates are applied to the trees based on the entire learning sample. This complex
process brings a set of reliable estimates of independent predictive accuracy of the
tree. The middle column in Fig. 4, labelled "Cross-Validated Relative Cost," contains
the relative misclassification rates. The first value is the average of cross-validated rel-
ative error and the second value is the standard error of cross-validated relative cost.
We can see that the minimum cross-validated relative error is 0.612 of 11-terminal
nodes. Let us notice that the maximal tree does not always have the minimal value
of cross-validated relative error though the maximal tree has the minimal value of
resubstitution relative cost, which means estimates of relative misclassification rate.
J
0
to)
0.06
J 0.6
JI
.j
...~
to)
0.04 JI
r0
0.4 to)
0.02
0.2 0
0 10 20 30 40
Number of Term.unal Nodes
Fig. 5: Cross-validated relative cost.
The characteristics are fairly rapid initial decrease followed by a long, fiat valley and
then a gradual increase for larger trees shown in Fig. 5.
CART employs the following 1 SE rule for selecting the right sized tree to:
1. Reduce the instability
2. Choose the simplest tree whose accuracy is comparable to minimal
misclassification rate.
In addition,
3_ Less complex model is easier to understand and is prefered in
all applied statistics_
(3)
where Tko denotes the tree with the minimal misclassification cost and Tkl denotes
the tree having the minimal nodes vvithin l-SE rule.
In the analysis cross-validated relative cost of the tree with 5 terminal nodes is 0.650
and it is within l-SE rule from theminimal cross-validated relative cost of the tree
with 11 terminal nodes.
704
In conclusion the minimal cost tree having 11 terminal nodes are employed because
the variable X5 to X9 are considerably important in business unlike variable impor-
tance in the analysis which is defined as a measure of a variable's ability to mimic
the chosen tree and to playa role as a surrogate for the best splitting variable.
705
MISClASSlFICATION BY ClASS
ClASSlFlCAIION ME DlAGRAt\!
i
·------1--------
I I
·----·2---· -----7----
-----3-----
I I
---.1---
I i
------l---- --10--
I I !
.--5--.-
I I
--~--
I
Terminal Regions
10 11
I
.------1------
I
·----·2---·
I I
-----3-----
I I
-----+----
I
Termina! Regionl
References
Breiman, 1., Friedman, J. H., Olshen, R A., Stone, C. J. (1984): Classification and Re-
gression Trees, Wadsworth.
California Statistical Software(1994): CART Version uno, CalStat, Inc.
Morgan, J. N., Sonquist, J. A. (1963): Problems in the analysis of survey data, and pro-
posal, JABA, 58, 415-434.
Steinberg, D., Colla, P.(1992): CART, SYSTAT, Inc.
Analysis of Preferences for Telecommunication
Services in Each Area
Tohru UEDA 1 and Daisuke SATOH 2
1Seikei University
3-3-1 Kichijoji-Kitamilchi, Musilshino-Shi
Tokyo 180, Jilpan
2NTT TeleconmllU1ication Networks Lilboriltories
3-9-11 Midori-Cho, Musilshino-Shi
Tokyo 180, Jilpiln
1. Introduction
Conjoint analysis [Luce and Tukey (196-1)] has heen used in marketing research to
measure consumer preferences. It is a practical set of methods for predicting con-
sumer preferences for multi-at.t.rihute opt.ions ill a wide variet.y of product and service
contexts. \Vhen developing new products and services, it. is an effective method to
determine service characteristics. As for the e"olutioll of conjoint analysis in market-
ing research, see reviews by Green and Srinivasan (1978, 1990), vVittink and Cat.tin
(1989), and Wittink et al (199-1).
In this paper, we apply conjoint analysis to t.elecomlllunication services. We divide
telecommunication services into tvvo classes: sen'ices that are independent of sub-
scriber networks and services that depend on subscriber net.works. In the former
case we can use conjoint analysis to determine service characteristics because we can
regard typical consumer preference as reflecting general consumer preferences. U eda
(199-1) has recently applied conjoint analysis to existing t.elecommunication services
(voice mail). This kind of service offers a good opportunit.y to apply conjoint analysis.
In the latter case we must analyze overall consumer preferences in service areas be-
cause the services depend on subscriber networks. lvloreover, when we have several
service area candidates, we must forecast demands in order to build subscriber net-
works economically. It has been difficult to use conjoint analysis to identify preference
tendencies on an aggregate basis and especially in specific geographic areas because it
has mainly focused on individual consumers. Conjoint analysis alone is insufficient to
measure consumer preferences for services suc h as cable television (CATV) in service
areas.
In this paper we propose a method of identifying likely preferences in various areas.
To do this we cOlllbine conjoint analysis wit 11 regression analysis. This combination
enables us to analyze preference tenclencies in specific geographic areas. \Ve apply
our method to new telecomlllunication services.
708
709
3. Sampling
vVe conduct a survey in two areas of Japan, getting 389 respondents from one area
and 200 from the other. In Table 2 the area denotes a major city in Japan. The
respondents in each area are composed of random and purposive sampling. Although
there are very few subscribers to CATV service in those areas, their opinions are very
important, so we also intentionally select respondents who are CATV subscribers.
Table 2: Sampling
4. Estimation Method
Various algorithms for conjoint analysis have been proposed, such as MONANOVA
[Kruskal (1965)], TRADE-OFF [Johnson (1975)], LINMAP [Srinivasan and Shocker
(1973)], and RANKLOGIT [Ogawa (1987)]. LINMAP differs from the others in that
it uses linear programming whereas t.he other approaches use nonlinear optimization.
The use of linear programming enables LINMAP to obtain global optimum param-
eter estimates, while the other approaches cannot be guaranteed to achieve global
optimums.
Satoh and Veda have discovered two problems wit.h LINMAP solutions.
1: Even if there is a set of solutions that expresses the preference data perfectly,
LINMAP cannot always generate it.. Instead it produces a set that expresses only
the partial rankings.
2: LINiVIAP cannot necessarily produce a set of solutions that matches an analyst's
inferences from observed data.
Satoh and Ueda have proposed an improvement of LINMAP[Satoh and Ueda]. Thus
we applied it to the new telecommunication services in Table l. The algorithm is
composed of two steps as follows. STEP1 is LINMAP and STEP2 is a new additional
part.
STEP 1 (LINMAP):
n-1
min L Ii. (5)
i== 1
subject to
where n is the total number of combinations, a is the vector whose components are
the utility parameters aj for every attribute, and Uh and x h are respectively the utility
and the combination vector of the combination ranked lith by the respondent.
STEP 2:
(6)
subject to
1
(Xi - x:i)a + 11 -< --+e
11. - 1 u,
1
(x:i - xj)a + "/2 ~ --+cu,
II - 1
1
(X,,':'1 - x;,)a + /'/1-1'::; - - + CUI
II - 1
1
(xi - xi)a + ~'1
1
>-
-
--
/I - 1
~[
~,
1
(xi-xj)a + "V,»
1- -
---e[
n - 1 '
1
(x,,':'1 - Xi,) a + ,,,-1>- - --
n-1
e[,
(Xi - x;.)a 1,
11-t
'min Lli,
;=1
a 2: 0, 11, ,1,,-1 2: 0,
1 n-2
0< e[ < - -
- - 11.-1
0::; ell ::; - - ,
11.-1
5. Analysis
Here is the sum of the range of every respondcnt's partworth for each attribute in
Table 3 and Figs. 1 and 2.
Monthly charge
22% v.a.D.
34%
Registration
fee
Phone
24%
20%
Monthly charge
17%
V.a.D.
Registration 39%
fee
20%
Phone
24%
TTl'
VI
= 'L-
\""' /;'(jI'
J J
..L =1'
' .... ; (8)
713
Thus. we can obtain Ui(A) for combination I in area A if we know the value of
Z=pE.-t B'j. The factors Vie chose as explanatory variables B'J are shown in Appendix 1.
.\Ioreover, we can obtain the relative importance of attribute adA) in area A through
multiple regression analysis by nsing conjoint analysis or the multiple regression equrt-
tion
[filA) = o.dA)6di) + o.2(A)6 2 (i) - o.:)(A) log:r:l(i) - o.4(A)log:r4(i), (11)
where [';(A) is the criterion variable and 6j (i) and .I",,(i) are explanatory variables,
vvhich are defined by Eqs. (2) and (3), respectin'ly.
Regression analysis gives coefficients bj. Hence we can obtain a preference-ranking
Coefficient of determination
A 0.3199
B 0.3:526
C 0.338,
D 0.36,1
E 0.3:52:5
F 0.3,02
G 0.3:539
H 0.3111
7. Conclusion
We have applied conjoint analysis to new telecommunication services and investigated
a method of obtaining a preference-ranking of choice-set combinations in various ar-
eas. \'"e have chosen this elaborate method oYer ot her ones because it should enable
us to determine the relative importance of an attribute in various areas and to choose
714
a new service that most consumers in those area prefer to another sen'ice. Although
the results had low accuracy clue to a lack of dfectiw explanatory variables, this
method will he effectiw if appropriate ones arc f0\1I1(1. Fnrther studies ,vill he needed
to tl'ansform our estimation into a forecast of thc clcmancl[l:eda et al (199:5)] for new
telecommunication services in variollS areas.
References:
Green, P.E. and Srinivasan, V. (1978): Conjoint Analysis in Consumer Research: Issues
and Outlook, Journal oj ConslLmer Research, 5. 103-123.
Green, P.E. and Srinivasan, V. (1990): Conjoint Analysis in Marketing: New Developments
with Implications for Research and Practice . .lonnl.ILI oj MILrket7>ng. 54. 3-19.
Johnson. R.M. (1975): A Simple Method for Pairwise Monotone Regression. Psychome-
trika, 40, 2. 163-168.
Kruskal. J .B. (1965): Analysis of Factorial Experiments by Estimating Monotone Trans-
formations of the Data, .lo·urnal oj the Royal StatistieILI Society, Series B, 27, 251-263.
Luce, R.D. and Tukey, J.W.(1964): Simultaneous conjoint measurement: A new type of
ftmdamental measurement, .loumal oj Mathematical Ps:~chology, 1, 1-27.
Ogawa. K. (1987): An Approach to Simultaneous Estimation and Segmentation in Conjoint
Analysis. Marketing Science. 6, 1, 66-8l.
Sat.oh, D. and Veda. T.: to be submit.ted.
Srinivasan, V. and Shocker, A.D. (1973): Estimating the Weights for Multiple Attributes
in a Composite Criterion Vsing Pairwise Judgments. Psychometrika. 38, 473-493.
Veda, T. (1994): Analysis of Preferences for Services Based on Conjoint Analysis, Singak-n
Ton. J77-B-I, 9. 542-549, (in Japanese).
Veda, T. et al. (1995): A method of forecast.ing demand for new telecommunication ser-
vices, 9th ElLropean Meeting of the Psychometric Society. 123.
Wit.tink, D. and Cat.t.in, P. (1989): Commercial Vse of Conjoint. Analysis: An Update,
.IolLrnai oj Market?:ng. 53. 91-96.
Wit.t.ink. D. et al. (1994): Commercial Vsc of COlljoint. in Europe: Results and Critical
Refiect.ions, International .lolLrnai oj Research in Marketing, 11, 41-52.
Effects of End-Aisle Display and Flier
on the Brand-Switching of Instant Coffee
Akinori Okada
Department of Industrial Relations
School of Social Relations
Rikkyo (St. Paul's) University
3 Nishi Ikebukuro
Toshima-ku, Tokyo 171, Japan
Summary: Brand-switching data among instant coffee brands were analyzed by a non-
metric asy=etric multidimensional scaling (Okada and lmaizumi, 1987) to identify effects
of the end-aisle display and of the flier. Two-dimensional solutions show that the end-aisle
display of the brand is in general not effective to induce switching to the brand and is vul-
nerable against switching to other brands. and that for some brands the flier of the brand
is effective to induce switching from similar brands to the brand and is defensive against
switching to other brands, but that for some brands the flier is not effective to induce
switching to the brand and is vulnerable against switching to similar brands.
1. Introduction
After several asymmetric multidimensional scaling (MDS) models and procedures
have been introduced (Zielman and Heiser, 1996), asymmetric MDS has been utilized
to analyze various sorts of data such as attraction relationships (Chino, 1978; Collins,
1987), journal citations (Chino, 1978, 1990; \Neeks and Bentler, 1982), word associa-
tions (Chino, 1990; Harshman et al., 1982: Zielman and Heiser, 1993), telephone com-
munication (Okada, 1989), intergenerational occupational mobility (Okada, 1988a),
foreign trade (Chino, 1978), marriages among ethnic groups (Zielman, 1991), or data
from various areas of psychology and of sociology.
One of the most important areas for applying asymmetric MDS seems to be market-
ing research. Brand switching data have been analyzed by asymmetric ;\:lDS, because
asymmetries in brand switching might have a relationship with the differences in at-
tractiveness among brands (DeSarbo and De Soete, 1984; Zielman and Heiser, 1996).
Asymmetric ~vIDS has been used to analyze brand switching data among car cate-
gories or among soft drink brands (DeSarbo and Manrai, 1992: DeSarbo, et al., 1992;
Harshman. et al.. 1982; Okada, 1988b; Zielman, 1991). In the present study, brand
switching data among instant coffee brands are analyzed by a nonmetric asymmetric
[I.'IDS (Okada and Imaizumi, 1987) to investigate the effects of the end-aisle display
and of the flier (a pamphlet or circular for mass distribution issued by a supermarket
informing of sales) on the brand switching.
2. Data
The brand switching data analyzed in the present study was derived from scanner
data of about 5,000 instant coffee purcha'3es made in 1993 by a panel which consists of
796 households who frequently came to a super market. Eleven instant coffee brands
which were analyzed in the present study (10 brands and other instant coffee brands
which were treated as the II-th brand) are represented in Table l. These brands
716
717
include three types of instant coffee; freeze-dried instant coffee (type a in Table 1),
regular instant coffee (type b in Table 1), and ones which are already mixed with
sugar and cream or which are already packed in a plastic or paper cup (type c in
Table 1). They also include brands of Nestle which dominates in the Japanese in-
stant coffee market and brands of Ajinomoto General Foods which is a joint venture
between Ajinomoto and General Foods.
Seven brands were purchased both when there was and when there was not an end-
aisle display of each of them at the supermarket. Others, the Il-th brand, were not
purchased when there was an end-aisle display of any of them. If we distinguish a
purchase of a brand when there was an end-aisle display of that brand (the brand
'with the end-aisle display) from a purchase when there was not an end-aisle display of
that brand (the brand without the end-aisle display) as two different items, we have
18 items or brands. A sv:itching matrix among 18 items or brands was calculated
for each household. The sum of 796 matrices was derived to construct the switching
matrix among 18 items or brands for the panel. Table 2 shows the 18 x 18 switch-
ing matrix, whose (j,k) element represents the frequency of switching from items or
brands j to k. which is called the end-aisle display data.
Six brands were purchased both when a flier accompanied and when did not ac-
company each of the brands. If we distinguish a purchase of a brand when the flier
issued by the supermarket accompanied that brand (the brand with the flier) from a
purchase when the flier did not accompany the brand (the brand without the flier) as
two different items, we have 17 items or brands. Others were not purchased when a
flier accompanied any of them. Table 3 shows the switching matrix among 17 items
or brands which is called the flier data.
to
from 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
1 NGB 150 w/o end 70 76 26 34 8 5 7 12 1 3 1 4 5 3 1 5 2 47
2 NGB 150 with end 65 92 39 71 8 21 15 23 I 3 5 6 10 9 1 7 5 52
3 NGB I 00 wlo end 26 24 33 24 4 4 10 12 0 2 2 0 2 1 0 2 3 231
4 NGB 100 with end 27 37 39 25 4 6 11 51 0 9 4 0 2 6 0 3 2 47
5 NEX250 w/o end 9 13 7 5 84 59 2 1 0 0 1 7 12 0 0 29 I 22
6 NEX250 with end 10 13 8 4 60 88 0 10 2 4 2 5 8 3 0 18 I 26
7 AMX 100 w/o end 11 22 8 14 3 1 22 18 1 5 2 2 2 15 0 3 6 45
8 AMX I 00 with end 1I 31 47 12 5 4 38 65 0 1 3 0 4 7 2 6 2 51
9 AMX2 w/o end I I I 0 0 0 0 1 39 0 1 0 I 3 0 0 2 19
10 NCP w/o end 4 6 5 6 3 1 4 7 0 187 21 0 2 1 0 3 6 73
II NCP with end 3 I 4 4 4 1 1 5 0 15 18 0 0 0 0 0 I 31
12 ABL250 w/o end I 5 0 3 8 8 0 4 0 1 0 6 18 0 0 I 0 21
13 ABL250 with end 11 13 5 6 19 19 3 9 0 2 4 20 41 3 0 2 I 43
14 UCCI00 w/o end 6 14 9 15 0 I 12 31 I 1 0 I 5 36 0 3 3 49
15 UCC 100 with end 1 0 1 1 0 0 0 0 0 0 0 0 0 I 0 0 0 2
16 NEX 150 w/o end 6 5 3 3 19 27 1 6 I 2 2 3 I 0 I 51 3 28
17 AMX30 w/o end 0 2 4 2 1 3 8 8 2 5 3 I 0 7 0 0 51 38
18 others 37 63 54
------------
36
---
25
-------
30
-------
27
_.. _ - -
79
----
16
--------
69
-----
35
--- -- _
23
.. _------
35
----
48
---- - - - - -
2 25 32 768
Tab. 3: Switching Matrix among 17 Items or Brands (with/without the Flier).
to
from 1 2 3 4 5 6 7 8 9 10 11 12 I3 14 15 16 17
1 NGB 150 w/o flier 279 22 139 11 30 6 43 11 2 9 2 23 2 14 11 7 97
2 NGB 150 with flier 2 0 20 0 1 5 1 2 0 1 0 0 0 0 1 0 2'1
3 NGB 100 w/o flier 99 2 116 1 12 3 55 27 0 14 1 4 0 6 3 4 661
4 NGB100 with flier I3 0 4 0 3 0 1 1 0 2 0 0 0 1 2 1 4
5 NEX250 w/o flier 28 4 17 1 160 52 3 2 1 1 1 25 1 1 37 1 35
6 NEX250 with flier 12 1 6 0 50 29 3 5 1 5 0 6 0 2 10 1 I3
7 AMX 100 w/o flier 56 2 60 3 10 1 86 13 1 9 0 7 0 22 5 8 80
8 AMX 100 with flier 16 1 16 2 1 1 32 12 0 2 0 1 0 2 4 0 16
9 AMX2cup w/o flier 2 0 1 0 0 0 1 0 39 1 0 1 0 3 0 2 19
10 NCP w/o flier 12 1 17 1 7 1 17 0 0 227 8 1 1 1 3 7 97
11 NCP with flier 1 0 0 1 1 0 0 0 0 6 0 0 0 0 0 0 7
12 ABL250 w/o flier 25 2 12 2 34 13 11 3 0 6 0 78 5 2 3 1 60
13 ABL250 with flier 3 0 0 0 3 4 2 0 0 1 0 2 0 1 0 0 4
14 DCC 100 w/o flier 19 2 24 2 1 0 28 15 1 1 0 6 0 37 3 3 51
15 NEX 150 w/o flier 10 1 5 1 33 13 5 2 I 4 0 4 0 1 51 3 28
16 AMX30 w/o flier 2 0 6 0 2 2 15 1 2 8 0 1 0 7 0 51 38
17 others 96 4 88 2 38 17 80 26 16 100 4 52 6 50 25 32 768
-.,J
~
co
720
Euclidean space as a point and a circle (in a two-dimensional space), a sphere (in a
three-dimensional space) or a hypersphere (in a four- or higher-dimensional space)
centered at that point. A configuration, which consists of points and circles (spheres,
hyperspheres), represents both symmetric and asymmetric proximity relationships
among items or brands. The Euclidean distance between two points imbedded in a
configuration corresponds to the symmetric switching between two items or brands
represented by those two points, and the difference of two radii of the circles centered
at those two points corresponds to the asymmetric switching from one item or brand
to the other.
We regard the number of purchases of item or brand k switched from item or brand
j as the similarity from items or brands j to k, and the number of purchases of item
or brand j switched from item or brand k as the similarity from items or brands k to
j. Let Sjk be the similarity from items or brands j to k, and Skj be the similarity from
items or brands k to j. Sjk is not necessarily equal to Slcj. The similarity is assumed
to be monotonically decreasingly related with Tnj/:. Tnjk is defined by Equation (1)
(1 )
(2)
item or brand)
item or brand k
4. Results
The minimized Sin five- through unidimensional spaces for the end-aisle data \vere
0.232, 0.265. 0.309. 0.367 and 0.471. and those for the flier data were 0.204, 0.252",
0.303, 0.377 and 0.499. These figures and the interpretation of configurations suggest
that we choose the two dimensional result as the solution for each of the two data
sets.
Figure 2 shows the two-dimensional solution for the end-aisle display data. (The
configuration was visually rotated so that the interpretation of the rotated configu-
ration seems a~ clear as possible.) Each item or brand is represented as a point and a
circle. The bold circle represents the brand with the end-aisle display, and the light.
circle represent the brand without the end-aisle display. For most of the brands, two
722
NEX250
o ABL250
o
~ABL250
NEX2~
o with end-aisle display NGB150~B150
o
Alv!X2cup
without end-aisle
display
NGBIOO.~ others
c§J 0
AMXIOO<!() NGBIOO
AMXIOO
•
.
CE
NCP
NCP
UCCIOO
o UCClOO • AMX30
NGBI50 NEX250
NEXI500
NEX250 ®
0
0
o
ABL250
other; ABL250
oo
o
AMXIOO. NGBI50
NGBIO()
AMXIOO
UCCIOO
NGBlOO
• NCP
• AMX30
AMX2cup
o
o
withllier
without Ilier
NCP
•
points representing the same brand with and without the end-aisle display are closely
located in the configuration. The vertical dimension seems to differentiate regular
instant coffee brands from freeze-dried and ones already mixed with sugar and cream
or already packed in a plastic or paper cup, and the horizontal dimension seems to
represent the difference between brands of Nestle and those of Ajinomoto General
Foods (Okada and Genji, 1995).
Figure 3 shows the two-dimensional solution for the flier data. (The configuration
was also visually rotated.) The bold circle represents the brand with the flier, and
the light circle represent the brand without the flier. Two points representing the
same brand with and without the flier are also closely located in the configuration.
The two dimensions look like to have the same meaning of those in Figure 2.
The obtained configuration derived from the analysis of the end-aisle data shows that
most of the brands with the end-aisle display are in the central part of the configura-
tion and that most of the brands without the end-aisle display are in the periphery of
the configuration. Most of the brands with the end-aisle display located in the central
part of the configuration have larger radii than those of the same brands without the
end-aisle display located in the periphery of the configuration. The obtained configu-
ration derived from the analysis of the flier data shows that most of the brands with
the flier are in the periphery of the configuration and that most brands without the
flier are in the central part of the configuration. Some of the brands with the flier
located in the periphery of the configuration have larger radii and some of them have
smaller radii than those of the same brands without the flier located in the central
part of the configuration.
5. Discussion
vVe would like to focus our attention on those brands which are represented as two
different items in a configuration; one with the end-aisle display or the flier and the
other without the end-aisle display or the flier. Combining the location of a point
(central part or periphery of a configuration) and the radius of a circle (smaller or
larger), we can classify these brands into four categories (a) to (d) shown below
(Okada and Genji, 1995). Characterization of a brand in each of four categories is
accompanied.
(a) A brand with the end-aisle display or the flier has a smaller radius than the same
brand without the end-aisle display or the flier and is located in rather a central part
of a configuration, while that brand without the end-aisle display or the flier having
a larger radius is located in rather a periphery of the configuration.
With the end-aisle display or the flier, a brand is likely to be switched from the same
brand without the end-aisle display or the flier as well as from other brands, and is
unlikely to be switched to the same brand without the end-aisle display or the flier as
well as to other brands. Without the end-aisle display or the flier, a brand is unlikely
to be switched from the same brand with the end-aisle display or the flier as well as
from other brands, and is likely to be switched to the same brand with the end-aisle
display or the flier as well as to other similar brands.
(b) A brand with the end-aisle display or the flier has a larger radius than the same
brand without the end-aisle display or the flier and is located in rather a central part
of a configuration, while that brand without the enQ-aisle display or the flier having
a smaller radius is located in rather a periphery of the configuration.
With the end-aisle display or the flier, a brand is unlikely to be switched from the
same brand without the end-aisle display or the flier as well as from other brands,
724
and is likely to be switched to the same brand without the end-aisle display or the
flier as well as to other brands. \Vithout the end-aisle display or the flier, a brand is
likely to be switched from the same brand with the end-aisle display or the flier as
well as from other similar brands, and is unlikely to be switched to the same brand
with the end-aisle display or the flier as well as to other brands.
(c) A brand with the end-aisle display or the flier has a smaller radius than the same
brand without the end-aisle display or the flier and is located in rather a periphery
of a configuration, while that brand without the end-aisle display or the flier having
a larger radius is located in rather a central part of the configuration.
With the end-aisle display or the flier, a brand is likely to be switched from the same
brand without the end-aisle display or the flier as well as from other similar brands,
and is unlikely to be switched to the same brand without the end-aisle display or the
flier as well as to other brands. Without the end-aisle display or the flier, a brand is
unlikely to be switched from the same brand "ith the end-aisle display or the flier as
well as from other brands, and is likely to be switched to the same brand "ith the
end-aisle display or the flier as well as to other brands.
(d) A brand with the end-aisle display or the flier has a larger radius than the same
brand without the end-aisle display or the flier and is located in rather a periphery
of a configuration, while that brand without the end-aisle display or the flier having
a smaller radius is located in rather a central part of the configuration.
With the end-aisle display or the flier, a brand is unlikely to be switched from the
same brand without the end-aisle display or the flier as well as from other brands,
and is likely to be switched to the same brand without the end-aisle display or the
flier as ,vell as to other similar brands. Without the end-aisle display or the flier, a
brand is likely to be switched from the same brand with the end-aisle display or the
flier as well as from other brands, and is unlikely to be switched to the same brand
with the end-aisle display or the flier as well as to other brands.
As mentioned earlier, seven brands were purchased both when there was and when
there was not the the end-aisle display, and six brands were purchased both when
there was and when there was not the flier. For the end-aisle display data, these
seven brands are classified into categories (a) to (d) as shown in Table 4. For the flier
data, these six brands are classified as shown in Table 5.
Tab. 4: Classification of the Seven Brands for the End-Aisle Display Data.
location of a brand with the end-aisle display
radius of a brand with central part periphery
the end-aisle display
smaller radius (a) AMXIOO (c) NCP
=i"<
UCCIOO
(b) NGB150 (d) none
NGBlOO
I__ NEX250
ABL250
Five of the seven brands with the end-aisle display were in the central part of the
configuration (categories (a) or (b) in Table 4), and four of the five had larger radii
when they were with the end-aisle display than without the end-aisle display ((b) in
Table 4), suggesting that the end-aisle display is in general not effective to induce
725
switching from other brands and is vulnerable against switching to other brands. All
six brands with the flier were in the periphery of the configuration ((c) or (d) in
Table 5). Three of the six brands had smaller radii when they were with the flier
than without the flier ((c) in Table 5), suggesting that the flier is effective to induce
switching from similar brands as well as from the same brand without the flier and
is defensive against switching. to other brands. The other three had larger radii when
they were with the flier than without the flier ((d) in Table 5), suggesting that the
flier is not effective to induce switching from other brands and is vulnerable against
switching to other similar brands.
Four of the si..x brands with the flier were always accompanied with the end-aisle dis-
play. For these four brands (NGBl50, NGBIOO, AMXIOO, and ABL250), the effect
of the flier actually means the effect of the end-aisle display and the flier. For the
two brands (NEX250 and NCP), the effect of the flier means the mixture of the effect
of the flier alone and the effect of the flier and the end-aisle display. To separate
the effect of the end-aisle display and the effect of the flier, a switching matrix was
constructed by treating each of the four brands above mentioned as three different
items; (1) an item with the end-aisle display and the flier, (2) an item with the end-
aisle display alone and (3) an item without the end-aisle display nor the flier, and by
treating each of the two brands above mentioned as four different items; (1) an item
with the end-aisle display and the flier, (2) an item with the end-aisle display alone,
(3) an item ",ith the flier alone and (4) an item without the end-aisle display nor the
flier.
The resultant table was analyzed by the asymmetric MDS. Obtained results seem to
suggest adopting the two-dimensional configuration as the solution. The two dimen-
sions of the configuration have the same meaning of those in Figures 2 and 3. For
the three brands (NGB150, NGBlOO, and ABL250) of the four mentioned above, the
comparison between the brand with the end-aisle display alone and the same brand
without the end-aisle display nor the flier tells that these three brands are classified in
the same category (b) as shown in Table 4. For AMXlOO, the comparison shows that
this brand is classified into (c) not (a) as shown in Table 4 (locations were reversed).
Although locations of two items, one representing AMXlOO with the end-aisle display
alone and the other representing AMXIOO without the end-aisle display nor the flier,
were reversed, two items were closely located. It seems that the effect of the end-aisle
display alone is almost the same as the mixture of the effect of the end-aisle display
alone and the effect of the end-aisle display and the flier. For these four brands, the
flier was always accompanied with the end-aisle display, and it seems impossible to
separate the effect of the flier and that of the end-aisle display.
For NEX250 and NCP, the comparison between the brand with the end-aisle dis-
726
play alone and the same brand without the end-aisle display nor the flier shows that
NEX250 is classified into (c) not (b) as shown in Table 4 (both radii and locations
,vere reversed), and that NCP is classified in the same category (c) as shown in Table
4. Although locations and radii of two items; one representing NEX250 with the end-
aisle display alone and the other representing NEX250 without the end-aisle display
nor the flier, were reversed, two items were closely located and the difference of two
radii was small. The comparison between the brand with the flier alone and the same
brand without the end-aisle display nor the flier shows that l':EX250 is classified in
the same category (d) as shown in Table .J, and that NCP is classified into (d) not (c)
as shown in Table 5 (radii were reversed). Although radii of two items; one represent-
ing the radius of NCP vrith the flier alone and the other representing NCP without
the end-aisle display nor the flier, were reversed, the difference of two radii wa.~ small.
It seems to suggest that the effect of the end-aisle display alone is almost the same as
the mixture of the effect of the end-aisle display alone and the effect of t.he end-aisle
display and the flier. and that the effect of the flier alone is almost the same as the
mi..xture of the effect of the flier alone and the effect of the end-aisle display and
the flier. For NEX250 and NCP, the item representing the brand with the end-aisle
display and the flier wa.'S located between the item representing the brand with the
end-aisle display alone and the item representing the brand with the flier alone, and
had a radius whose length was betweeIl the hvo radii of these two items, suggesting
that the interaction of the end-aisle display and of the flier are rather small.
Acknow ledgment
The author would like to express his gratitude to Professor Dr. Wolfgang Gaul of the
University of Karlsruhe to his helpful comments and suggestions given to him at his presen-
tation on the IFCS-96 meeting. He also wishes to express his appreciation to Dr. Takeshi
Moriguchi of The Distribution Economics Institute of Japan for providing him with the
data. The author is indebted to H. A. Donovan for his helpful advice concerning English.
References
Chino. N. (1978): A Graphical Technique for Representing the Asyuunetric Relationships
between N Objects, Beha'viormetrika, No, 5,23-40.
Chino. N. (1990): A Generalized inner product model for the analysis of aSY=letry, Be-
hauiormetrika, No. 27. 25-46.
Collins. L.M. (1987): Deriving sociograms via asymmetric multidimensional scaling, In:
Muitidzmens'ional Scaling: History. Theo.,.·y, and Applications. Young, F.W. et al. (eds.).
179-196, Lawrence Erlbaum Associates, Hillsdale, NJ.
DeSarbo. W.S., and De Soete. G. (1984): On the use of hierarchical clustering for the
analysis of nonsyuunetric proximities. JO'urnal of Gonsu.mer Research, 11,601-610.
DeSarbo, W.S. et it!. (1992): TSCALE: A new multidimensional scaling procedure based
on Tversky's contrast model. Psychometrika, 57. 43-69.
DeSarbo. W.S., and Manrai, A.K. (1992): A new multidimensional scaling methodology
for the analysis of asymmetric proximity data in marketing research, Iv[arketing Science,
11. 1-20.
Harshman. R.A. et al. (1982): A model for the analysis of asymmetric data in marketing
research, Marketing Science, 1. 205-242.
Kruskal, J.B. (1964): Nonmetric multidimensional scaling: A numerical method, Ps·y-
727
Summary: In hardly any other area is the availability of regional development data of such great importance as it is in
the sector of environmental protection and conservation. It is therefore the goal of every environmental information
system to provide relevant data collections for legislative bodies and for the daily execution of administrative tasks. In
this context, environmental information systems are mainly represented by the organisational association of data
collections by specialist information systems. Organisational association, because the required data should be
accessible, but must remain with the authorities responsible for the specialist information. The current technology of
data processing supports these requirements placed in divided processing.
The proof of where which data can be found and processed under \"hich qualitative and quantitative conditions is the
core of the system. The required 'common denominator' is th~ classification and the determination of the common
vocabulary and its relations (=thesaurus) as a metalanguage. !II order to be able to secure comparable and combinable
research results regarding specialist information. This therefore means for information systems that the 'language' and
the 'grammatics' of the data used in the said information s:'stem is clearly defined and known to the user. The dialogue
with the user is organised and supported on the basis of tflese language patterns. For this reason, the importance of a
thesaurus and of a data model as the basis of a common laubUage is given particular weight.
Due to the fact that the references, as described above, are based on extremely varied contents and regulations of how
the data is to be dealt with, it would appear justifiable to classify these points of view as being aspects of the object.
The content classification of data according to the topic, i.e. specialist origins and possible usability, is the aspect
pertaining to the structuring of the data. The question of the technical procedural position of the data, within the data
bank, points the way to an object-related classification and storage of data Using a simple mathematic model, the
following sections are intended to demonstrate,
that the object-related classification of data as a content-inderJendent aspect is an important supplement to the
data itself and
several possibilities of an object-orientated client SP.1Ver technolcgy will be pointed 'Jut.
1. Introduction
One of the most important preconditions [or thou!$ht is the comparative classification of terms to
categories. It is only in this way that the remembrance of experiences can develop to become an
aid in life. As means of categorisation, we have at our disposal, in addition to emotionally
experienced images, language and script, also the ability to quantify. Finally however, even
quantitative expressions of orders require their linguistic expression. The entirety of these
experiences of orders and their cross-references to the model of reality is a part of what we mean
when we claim to be informed (Wef1zlaff 19~ 1).
728
729
1996). This statement is equally true for classifi,.ation schemes and rules.
In this context, 'data' is to be understood as being definite values of statements, whose meaning
(explicit or implicit - i.e. according to recognised ~greement) has been determined. The meaning
of a document can be made explicit through texts, tabular diagrams, legends or in files and data
banks, thesaurii, etc. In this context, definite does not mean that it is a determined value in a
mathematical sense. Even a parameter of a probability distribution has a definite value.
A classification system is therefore only relevant, if the data it contains make possible an
extension of the user's subjective model of reality (Feyerabend 1976). Therefore, the order of the
system to the research and the classification of the data according to the preconditions described
above must be accessible.
Due to the fact that the majority of the available environmental data is regionally related, then, in
addition to the subject-related search, regionally-related search into town names, certain regional
units and co-ordinates are required and must be taken into consideration in classification systems
(Weihs 1993, LABO 1994 p.21-22)
In this way, it is possible to carry out regional search asing both an order term, as well as the co-
ordinates of a freely-defined area. The regionally-related access is necessary because the majority
of the information is to be evaluated on the basis of area structures (i.e. natural regional units,
such as river valleys, lakes, rivers and canals, aquatic and terrestrial tenns or town names).
The basis for the indexing and search is the thesaurus, acting as a common 'vocabulary of
language', for example for:
as a defined succession of symbols (data in the above mentioned sense) is not revocably and
definitely based on only one object 0i n' If It is homonymous (i.e. 'field' as a expression for
agricultural land or a data section, see diagram 2) then
730
bn = {Og,n}, g=1,2, ... , g+ with g+ = the number of all objects 0g,n to the term bn (3).
Correspondingly, we treat objects in the same way that, from the point of view of different users,
are observed under different aspects: for example, 'pasture' has a very different meaning,
depending on whether it is considered from the point of view of land use, or of natural
preservation. Based on the same aspect (i.e. natural preservation versus land use), various objects
can be observed in the same way. We use the factor g' to denote the number of all possible
aspects. We will continue from here without limitation from an aspect-related point of view, due
to the fact that the homonyms, according to (3), are included. As we will see, the introduction oj
the aspects mean a further classification, independent oj the subject-logical classification.
The relation expressed by (3) is not revocably definite (i.e. bn = 'field', unlike in table 1 for
agriculture and/or data-bank)
bj means that there can be terms with equal rights g+ (synonyms, i.e. 'field' and 'field') for one
object:
bj = Sj = {Sij}, with Sij i =1,2, ., m, .. ,ij; Slj oF Srnj' j oF m and Sij ff Band (4)
B*=BuS (7)
The relation stated by (4) is rarely transitive in a logical linguistic sense, i.e. for the term n = b l
= 'field', accordin.g to table 1 under the aspect of 'land use' si,k = b 3 ='pasture' it is defined in a
synonymous manner, under the aspect 'data bank' b2 'variable'. Thus, in this case, the chain of
synonyms si,l '" bk '" si,k leads to the erroneous relation of synonyms 'pasture' '" 'variable'.
This then means for the process of indexing that data i~ not sinlply given a catchword or a
definition because a finite number of preconditions (=allocation of catchwords, codes, etc.) have
been fulfilled, rather only if the habits relating to the expression, which ma.ke information out of
the data, are taken into consideration.
The homonymicity of the terms in a formulated system of order, which we denote as the
thesaurus T, result in this generally not being of a transitive nature, i.e. these deficiencies are
circumnavigated by redundant structures. Due to the fact that the transitive systems of order offer
considerable advantages in the retrieval and in the object-orientated procedural technology, the
731
model of a transitive thesaurus is the subject of discussion and will also be used in the
environmental information system.
Initially, the preconditions for a consistent thesaurus T are stated for just one aspect.
Subsequently, the required extensions for a consistent 'multi-lingual' thesaurus T are formulated
according to several aspects: the classification and the research terms must correspond with the
(specialist) linguistic understanding of the user according to various aspects g = 1, 2, .. , g+ .
"@
e::s E
(l) i:: (l)
!§.g '"cd
(l)
Related '" ,;::
;:1 ::s
Expression Synonyms terms I u o U ~
I
]>
(l)
"0
0.. i:: .~ ~ cd
;>, cd i:: 0 1il
E-< .....:l ~~ Q
II Field x x X X
13 Pasture X
Variable X
Pasture X
21 Variable X
Field X
31 Pasture X X X
II Field X
Field I X
2 The Thesaurus
The ordering and evaluation of information is essentially a precondition for successful and, above
all, for efficient work in all areas of knowledge and application. The drafting of classification
systems and classifications is an essential method in creating order with regards to one certain
criterion, i.e. regaining information, for objects which require classification. Therefore, in the
process of regaining information, one must not always base all considerations on the terms stored
in the data, as one is able to make use of the order created by the classification (i.e. research with
the aid of generic terms, catchwords, etc.). Conversely, the classification can be made
considerably easier by means of automation, provided that this contains the terms used by the
thesaurus or has derived its terms from the thesaums (compare diagram 2, Weihs 1992).
The thesaurus T is an ordered sum of expressions E', which forms an open system for the
specialist- and/or problem-orientated classification and crdering of tenns; as a classification
system, it is striving for the revocable, definite allocation of expressions bn to the objects OJ This
732
means that each terms is contained in the thesaurus T once and only once and only refers to one
obj ect OJ (excluding the demands of (3».
on <=> bnwith b n E B' =sum of all terms, n = 1,2,3,j, ... , n+ and (8)
(9)
Due to (1) through (7), the transitive, synonymous expressions are contained in B'.
A definite allocation of the expressions bn to the k+ categories (Wersig, 1982) is defined as being
the order system.
(10)
B' 1-
- B' (11)
* .. t * *. +.
B k \ B 2k+1 n B 2k+1 = {O}, B k \ B 2k+1 u B 2k = B k wIth k = 1, 2, ... k Categones (12)
j I\'
Due to the fact that (1) and (2) ensure that a term will appear once and once only in the sum B*,
and therefore is included in only one subset. The allocations of the synonyms to the categories
according to (4) bj = {'1,i},is also definite according to the preconditions (5) and (6), because A
n B = O. (10) through to (12) define a strictly hierarchical, transitive system of order. The rule R,
which, according to (10), must be defined for the Oldering of the terms, is established on the basis
of subject logical determined preconditions, determined between these as super, i.e. sub-orders
of category forms. Nevertheless, the literature has examples of promising, mathematical-statistical
principles for the extraction ofthesaurii from texts or empirical material (see p.e. Bock 1993).
The equation (3) demands the consideration of the homonyms, because that is the point when the
allocation between object and expression is no longer revocable. Conversely, (9) is the
precondition for a logical subject set of rules according to (10) for the division of expressions in
categories. We will therefore extend (8) to the effect that we can i'.llocate in a definite sense each
of the h <: 1 varied objects with identical expression b n (i.e. field, compare table 1) to a certain
aspect g = 1, , .. ,u, .. , gO:
0h,n <=> a g and 0h,n <=> bn,g with bn,g E Bg = sum of all terms of the aspect g (13)
(14)
Sg n Bg = 0 but Sg n B t :l 0 (16)
Bk,g \ B 2k+1,g n B 2k+1,g = {O}; Bk,g \ B 2k+1,gU B 2k,g = Bk,g with k=I,2, . k+ Categories (17)
733
Bk,g E C;: the Bk,g of C; with a division rule, based on the aspect, according to: g. (19)
In accordance with (19), it must be taken into consideration in the logical subject classification
rules, that this is related to one and only the one related aspect ago
(20)
~k,gt =sum of the homonyms from the expressions vf the categories j, k of the aspects g and t.
A pyramid structure, in accordance with Brito (1990), results from the inclusion of several
languages r, u, v .... E {g = 1, ... > gO}:
(21)
D Category 10
Category 11
Category:; ~S·'=r
i
52 . \
IlL.___-"-'"--'
Total C +. B*
C=J Category 12
Category13/~
''''I ~~;:.;
Category 6
'. ~
EJ . .
Category 3
Category 14 "'J !/
011(1----.... /1 Category 7 (
Category 15 ' I
The sum total (universal sum) of the expressions of the thesaurus T of all expressions results from
the terms and the synonyms allocated to the same.
734
The subject logical origins of a term is characterised by the affiliation to one (or several) aspects,
(Schilling, 1991). Depending on which super, i. e. sub-orders have been permitted, the thesaurus
will be, according to diagram 1, net-like, pyramidal or strictly hierarchical.
It is easily recognisable in diagram 2 that, through the introduction of the aspects,a g' + 1 -
dimensional variable area is stretched out. The aspects must not only be independent of each
other and the categories in a Cartesian sense, they must also be independent of each other in a
subject logical sense. The selection quantity, which is of interest for the retrieval of expressions,
then results from the projection of the desired aspect area into the term level.
Aspect g
./
./
./
r'-----
I .....•.•.....•.•....
I
I .'
I .' .' ..
~" "
I ....•.....•
I
I B
I
In an earlier part of this paper, we examined the preconditions which make possible the definition
of a thesaurus T, which includes any number of aspects. The introduction of the aspects makes
possible the consideration of the homonymous expressions used in various specialist languages,
i.e. speCialist points of view. In this context, the precondition was presupposed that each
expression would appear once and only once in the thesaurus. The expressions are allocated to
synonymous terms, depending on their aspect, and these are then equally components of the
thesaurus. There is therefore an hierarchical clustering of terms within each aspect. The
combination of various aspects then produces a pyramidal structure in the thesaurus. If one holds
onto this precondition, it is possible to make a definite, non-redundant depiction of the
expressions, simply by stretching out a g' + 1 dimensional space. In accordance with the
precondition, the aspects are now independent of each other. In as far as an aspect is then
accurate for the expression bn, the binary value I is commonly allocated, otherwise o. The g' +
1 th dimension makes reference to the term area. Due to the fact that, according to (2), each term
appears only once, all changes related to the term (i.e. update of synonyms {g+}, category
affiliation {k}, references to aspects {s}) directly affect the relevant relations and objects. This is
an essential precondition for the realisation of an object-orientated principle.
735
Our term object b(n) is defined by the vector of the variable area of the aspects, the categories of
the thesuarii and the synonymous ring:
(23)
The subject logical division of the point of view of the aspect from the classification determined
by the contents makes possible an extended interpretation of the m'Jdel: the aspect term can,
without contradiction, be regarded as being a technical reference (address) to interfaces, such as
HTML pages of the Internet, data-banks or methods. This therefore opens up a further area of
use for a thesaurus of this kind.
References:
Bock, H.-H., Lenski W., Richter M.M. Eds. (1993): Information Systems and Data Analysis, Procedings of the 17th
Anual Conference of the Gesellschaft fur KIassiflkation e. V., Berlin Springer
Brito, P., Diday, E. (1990): Pyramidal representaton ofsymbolic objects, Knowledge, Data Knowledge and Decision,
Hamburg, Eds. Schader, Springer
Feyerabend, P. (1976): Wider den Methodenzl1lang, Suhrkamp, Frankft. 1976, 296f., p. 352ff..
Herrschaft, L.: Zur Bestimmung eines medienspeziJlschen lnformationsbegriffs in: Zeitschriftfur
lnformationswissenschaft und -praxis, Journal for Information Thgeory and Work, 47. Jahrg. Nr.3 Voltune 47, NO.3
Nadoaw47(3) SI71 ff.
Kommission der Europiiischen Gemeinschaft, KoEU (1990): "Richtlinien uber denfreien Zugang zu Informationen
uber die Umwelf', Brtlssel1990
Bund Landerarbeitsgemeinschaft Bodenschutz LABO (1994): Aufgaben und Funktion'!n von Kemsystemen des
Bodeninformationssystems als Tei! von Umweltinformationss}'stemen, Umweltrninisteritun Baden Wilrttemberg,
Stuttgart 1994
Nedobity, W (1989) Ordnungsstrukturenfiir Begriffskategorien, Studie zur KIassiflkation, Bd 19(5K 19), Hrsg. Ges.
f. KIassiflkation e. V. Darmstadt 1989, p. 183 f. Opitz, 0 ... Lausen, B., KIar (Eds.) (1992) Information and
Classification, Springer Berlin New York
Schilling, P. (1991): Variabler Thesaurus - eine Schliisselfonktionftr die zukiinftige lnformationsverarbeitung in
einer Verwaltung Konzeption und Einsatz von Umweltinformationssystemen, Brauer, W. im Auftrag der Gesellschaft
fUr Informatik (GI)
Weihs, E. (1992): On the Client-Server Conzept of Text Relater! Data, Cognitive Paradigms in Knowledge Orga
nisation, Sarada Ranganathan Endowment for Library Science, (Hrsg.), 452-459, Madras.
Weihs, E. (1993): An approach to a Space Related Thesaurus; Informaticn and Classification, O.Opitz, B.Lausen,
R.KIar (Hrsg), 469-476; Springer; Berlin, Heidelberg.
Weihs, E. (1993): Datenbanken als Grundlage von Umweltinformationssystemen, Tagungsunterlagen zur 17.
J ahrestagung der Ges. fur KIassiflkation, Kaiserslautem.
Wenzlaff, B. (1991): Vielfalt der Informationsbegriffe, Nachncnten fur Dokumentation 42, Heft 5/1991, 335-361,
Weinheim.
Wersig, G. (1985): Thesaurus-Leitfaden: eine Einftlhnmg in das Thesawus-Prinzip, Theorie und Praxis. - DGD-
Schriftenreihe Ed.8; K.G. Saur; Milnchen.
Whorf, X. (1976): Language, Thought and Reality, MIT Press, 1956, p. 12
Cluster Analysis of Associated Words Obtained
from a Free Response Test on Tokyo Bay
Shinsuke Suga l , Ko Oil and Sadaaki Miyamot02
Summary: This study is concerned with data analysis of associated words obtained from
a free association test on Tokyo Bay. It is shown that cluster analysis of the words is an
effective method to find respondents' concerns about the bay. Word clusters are
considered to give structures of inseparable conceptions of the object of association. We
analyze data from two survey areas near Tokyo Bay, i.e., one in a residential area and the
other in a town where fisheries are primary industries. The data analysis shows that water
pollution is a respondents' important concern in the two areas. Associated words showing
various industries or development works are classified in some specific clusters. Further,
we find a word cluster which indicates that Tokyo Bay is closely related to the lives of
respondents engaged in fisheries.
1. Introduction
In the data analysis concerning the investigation in several environmental problems a
variety of data are used. The authors have been analyzing the word data obtained from a
questionnaire survey to find the environmental awareness of local residents. In the
questionnaire survey respondents were asked to write down freely what they associated
with a given stimulus word or a phrase. It is considered that people have a wide variety of
conceptions about environmental problems. Thus, the questionnaire survey based on a free
association is more useful to get satisfactory information about residents' concerns than the
usual survey in which respondents find questions in the given list of individual items. We
consider the classification of the words obtained from the free association test. In this aim,
cluster analysis is applied to the associated words.
Applying the method of classification is useful to examine the awareness of local residents
through the associated words in the following senses. First, discussing groups of
classified words in the whole data is more practical than examining each word one by one.
Second, a word cluster of associated words is considered to give the cognitive structures
of the awareness, if an appropriate measure of the similarity is used. When people ponder
on their living condition, for example, do they associate individual words, "convenience",
"road", "quiet", etc., separately? On the contrary, as mentioned in Oi et al. (1986), they
seem to rather recognize these items as a group of inseparable conceptions. A word Cluster
is considered to indicate some notion related to respondents' concerns.
The authors have analyzed word data obtained from various questionnaire surveys asking
respondents to write down about living condition (Oi et al. (1988)), water side in general
and Lake Kasumigaura (Suga et al. (1993)), and acoustic environment (Kondoh et al.
(1993)). In the present paper, we examine local residents' awareness or images of Tokyo
Bay. For data analysis the word data obtained from a free association test which was
carried out in some regions near the inland sea are used. In the survey, respondents were
asked to write down freely what they associated with a stimulus phrase "Tokyo Bay".
736
737
2. Questionnaire survey
The questionnaire survey based on a free association concerning this study was planned to
find how people around Tokyo Bay evaluate the nearby sea area. In the survey, three
stimuli, "sea", "Tokyo Bay", and "the new road across Tokyo Bay" were used for the free
association. Respondent.s were asked to write down their association items in
questionnaire sheets for each stimulus. In the present study, we analyze the words
associated with "Tokyo Bay".
The survey was carried out in four areas. Two areas were in Kawasaki City in Kanagawa
Prefecture, and other two areas in Kisarazu City in Chiba Prefecture. The two cities are on
the opposite sides of Tokyo Bay each other. Data obtained from two areas nearer the bay
in the cities are used in this study. One is a residential area, adjacent to a coastal industrial
district in Kawasaki City, located about five kilometers from the edge of the bay. Another
one is a rural area facing the bay on Kisarazu City side. We call the former one and the
latter one Kawasaki and Kisarazu, respectively.
The questionnaires were mailed to 667 people in Kawasaki and 550 in Kisarazu selected
from the residential map of each area by systematic sampling. The average ratio of
recovery of questionnaires was 41 %. About 45% of respondents in Kawasaki were office-
workers. In Kisarazu about 60% of respondents were fisheries workers. In fact, fisheries
are important industries in Kisarazu. The period of the survey was from February to
March, 1993. The whole results concerning the survey are shown in Suga and Oi (1995).
m m
Clearly, 0:::: s(xi,x):::: 1. This measure shows that two words associated by more common
respondents are more similar to each other. Thus
In this study, we set N=10 and N=7 for the data of Kawasaki and that of Kisarazu,
respectively. Thus, 50 words and 53 words are analyzed for Kawasaki and Kisarazu,
respectively. A computer package PAB developed by Miyamoto (1984) is employed. The
method of average linkage between the merged groups is used.
Though other similarity measures may be used for classification of words, the measure
defined by equation (1) is an effective one for considering respondents' concerns in the
sense that the similarity between two words are measured based on common respondents'
association. The efficiency of our measure will be shown in sections 5 and 6.
Kawasaki Kisarazu
11 ferry 32 water 27
fishing 30 fish 24
old days 29 fisheries 23
pollution 28 pollution 21
sludge 28 life 21
shell gathering 28 nature 20
river 26 shell gathering 20
Edo-mae* 25 Mt Fuji 19
industrial area 22 waste water 16
20 goby 21 tidal flat 16
31 Tokyo 15 ferry 12
port 15 human being 12
Japan 14 flat fish 11
reclaimed land 14 fisherman 11
landscape 14 development 11
houseboat 14 tasty 11
new road across Tokyo Bay 13 goby 10
short-necked clam 13 Tokyo 9
Chiba 13 wind 9
40 fishing boat 12 Edo-mae' 9
51 sludge 7
fishing net 7
reds tied 7
* The meaning is described in the text.
740
the other hand, some clusters containing a few words, say, less than three words, are also
difficult to be considered.
We describe the classification procedure of the associated words by using the results in
Kawasaki. In Figure 1, if the whole data are classified at level 0.054, then eight clusters
Al to A8 are obtained. Two clusters A2 and A3 are divided further because they contain
too many words. Finally, i4 word clusters ai to ai4 are obtained. Clusters A4, A7, and
A8 contain only one word. If such a cluster is formed at lower level, the word is
considered to be associated independently of other words. In the same way, the whole data
in Kisarazu are classified into i4 clusters bi to bi4 as shown in Figure 2.
5.2.3 Common and different awareness between the two survey areas
We can find the clusters composed of the associated words relating to water pollution of
Tokyo Bay in two survey areas. This indicates that water pollution is a respondents'
important concern about the inland sea in the two areas. Two clusters al2 in Kawasaki and
b5 in Kisarazu show respondents' concerns about the relation between nature and
development. They also seem to show respondents' interest in the change of nature caused
by various development works in Tokyo Bay. The associated words showing fisheries or
individual names of marine products are classified into some specific clusters. Typical
examples in each area are clusters a7 and b 1.
In Kawasaki, the clusters containing words showing various industries or development
works are found. Especially, cluster a9 which contains the words concerning a coastal oil
industry is characteristic. As described in section 4, respondents in Kisarazu write various
words related to fisheries and marine products from Tokyo Bay, and those words
constitute some symbolic clusters. Among them, cluster b 1 includes the words showing
743
Similarity measure
0.0
I
Bl hI laver
short-necked clam
shell gathering
flat fish
goby
trough shell
sudate
new road across Tokyo Bay
ferry
bl. tidal nat
Edo-mae
sludl!e
b3 old days
fish
shellfish
b4 construction
b5 abundance
change
nature
environment
reclamation
development
bb tlsherman
fishing net
sea
life
fisheries
B2 b7 ship
fishinl!
B3 b8 dirty
water
small
b9 factory
death
wind
oil
litter
red tide
culture of marine products
blU Mt. .FUJI
clean
human being
Tokyo
waste water
B4 bll marine products
tasty
pollution
place of fisheries
B5 bl2 mdustrial area
B6 b13 place of work
leisure
B7 b14 work
place of life
I
1.0 t
0.052
I
0.0
Figure 2 A sketch of the dendrogram showing the result of cluster analysis of word
data in Kisarazu
744
several marine products. Formation of cluster b6 indicates that fisheries are not only parts
of industries in Tokyo Bay but closely related to the lives of respondents themselves in
Kisarazu.
7. Concluding remarks
A free response test is useful for examInIng residents' concerns about several
environmental problems directly. However, the analysis of such data is not as easy as that
of the data obtained by a usual surveys in which respondents choose their answer from
given items in a questionnaire. The method of classification we use in this study gives
clear structures of the association with Tokyo Bay. Grasping such structures is important
to discuss several issues about the bay in [he future, for example development and
conservation. It is not easy to reveal such structures through the data analysis based on a
usual survey.
Acknowledgment: The authors would like to express our appreciation to the subjects
for cooperating the survey.
References:
Summary: The data of 696 deer-train accidents which occurred on 330.95 km distance in
eastern Hokkaido, Japan from April 1988 to March 1995 was statistically analyzed. Many of the
accidents occurred at particular sites and night hours, which suggests the relation with the habitat
and diel activity of deer. Relative densities of deer were estimated where the train runs were
constant.
1. Background
One of the serious problems between human and animals is deer-train collisions. It
includes the breakdown of or damage to trains, hence the disturbance to the train
diagram, and the death or injury of deer. Although several studies were published on
deer-car accidents (Allen and McCullough (1976), Schafer and Penland (1985), Waring et
al. (1991), Reeve and Anderson (1993)), no actual data of deer-train accidents has been
studied. Recently the number of accidents between the train and the Sika deer (Cervus
nippon yesoensis) greatly increased from year to year in eastern Hokkaido, Japan.
We have produced a data set including a total of 696 cases of deer-train accidents from
April 1987 to March 1995 (8 years) based on the driver reports of the Kushiro Branch of
Hokkaido Railway Company. Determining the altitude and representative vegetation at
0.5 km distance along the Line on the basis of 1/50000 scale topographical maps by the
National Geographical Survey Institute, Japan and 1/50000 scale actual vegetation maps
(Environment Agency, 1988), we have created another data set consisting of the number
of accidents per 0.5 km and environmental conditions. We present the results of
statistical analysis on the data sets and an estimation of the relative densities of deer.
2. Statistics
Fig. 1 shows that the number of accidents increased year by year except for 1991. Since
the number of train runs per year was almost the same, it is suggested the increase in the
number of deer that passed across the railway track, hence the increase of the population
of deer. Fig. 2 gives the hourly change in the number of accidents. Most of the accidents
(79%) occurred between 16:00 to 23:00, when the deer activity is high. Fig. 3 presents
the number of accidents per 10 km from Kamiochiai to Nemuro stations on the Nemuro
I: This study was in part carried out under the ISM Cooperative Research Program (95-ISM'CRP-
A58).
746
747
160
-
140
- -
120
100
80 r-
n
60 r-
,--
40
20
~n
o
1987 1988 1989 1990 1991 1992 1993 1994
110
100 -
-
90
80
-
70
60
- - - -
50
40
30
,. I-
20
10
o h -=LiJ= !h-n---n-rr-
o 1 2 3 4 5 6 7 8 9 10 11 1213 1415 1617 18192021 222324
Hour
140
120
100
60
60
40
20
100
n--fl
150 200
-rf n-
250 300
r rr ~
350 400 450
Line. Many accidents occurred between Kushiro and Nemuro stations, where hunters
reported that many deer lived.
Table 1 gives the number of cases in which numbers deer were found. From January to
April the number of deer was great and the mean ranged from 4.1 to 5.0. However, the
number of deer. which collided with trains ranged from 1 to 4. (Cl means a nearmiss), and
among these the cases of 1 occupied <)4.1 'Yr (Table 2). This situation is similar in every
month.
Tab. 2: Frequency distrihution of the numher of deer which collided with trains.
Number of deer which collided with trains Mean number of
Month 0 1 2 3 4 unknown Total deer when collided
Jan 63 6 70 1 .11
Feb 7 58 6 1 73 1.14
Mar 5 68 5 1 4 83 1.09
Apr 4 25 2 2 33 1.21
May 2 31 33 1.00
Jun 12 14 1.23
Jul 2 34 3 41 1 .13
Aug 1 35 37 1.00
Sep 4 45 2 51 1.04
Oct 5 68 1 2 77 1.04
Nov 2 100 5 108 1.02
Dec 71 4 1 76 1.05
Total 32 610 29 7 2 16 696 1.08
The distances at whkh drivers found deer ranged 0 to 3UO m. Fig. 4. is box plots for the
distances in the daytime (8:00-16:00) and at night (20:00-24:00). Drivers found deer
farther in front of them in the daytime than at night. The mean distance at which drivers
749
m n = 49 248
350
300 o
250
200 8
8
150 o
100
50
o o
-50
Oayti me Nlght
Fig. 4: Distances at which driver~ found deer in the daytime and at night
found deer is in the daytime is 89 m with a standard deviation of 53 m in the daytime, but
it is 56 m (SD=41 m) at night. Namely the distance in the daytime is on average 33 m
longer than that at night.
Fig. 5 is box plots for the distances at which drivers found deer on 5 weathers. The
distances were shorter in rainy or misty situations than in fine or cloudy ones. The
difference in the means is about 18 m.
m n = 419 170 43 19 19
350
300 0
250 0
0
200 8 0 0
0 0
8
~
150
0
100 0
50
a
Fine Cloudy Rainy Misty Snowy
mean=63 63 45 46 78
m
500
Y = -173.625 + 5.684 * X; R-2 = .761
450 o o
400 o g 0
o o ~
350
300 o
250
200 o
150
100
50
30 40 50 60 70 80 90 100
Train Speed
Fig. 6 shows the relation between train speed and distance to stop. As is expected, the
greater the train speed is, the greater the distance becomes.
,
number of accidents on the 8 train runs as relative densities of deer along the railway.
450
.5638
..
I • II
440 X
;(
X 85639
430 11 X 5640 •
420 ~
;~ 5641
410
~ ;Ix ;( 5642
400 )I( .5645
"
390 X. +5647
• .....
<
380 .3644
;(
370
360
l1li ~ tI •
350 ";1.*
•
'.•• •
X
340 ;( 'XI
•
oft
x
330
~ +
320
310
X
X
• )(
Fig. 7 shows the accidents occurred on the 8 train runs, and Fig. 8 the relative densities
per 2.5 km along the Nemuro Line between Kushiro and Nemuro stations. The relative
density at the distances of 420 to 430 km was very high. It is suggested that there may
be three or four large deer populations along the line between the two stations.
Hayashi's quantification method (type I) analysis performed on the data set of 0.5 km
distance (270 cases) has revealed that the relative density of deer between Kushiro and
Nemuro stations is related to the vegetation type and altitude category along the Nemuro
Line. Since the multiple correlation coeficient is 0.468, other factors such as the deer's
behavioral habit itself may be also related.
50~~~~~~~~~~~~~~~~-L~~+
45
40
35
30
25
20
15
10
5
4. Acknowledgement
We would like to thank the Kushiro Branch of Hokkaido Railway Company for offering
the material.
References:
Allen, R. and McCulough, D. (1976): Deer-car accidents in southern Michigan. Journal of
Wildlife Management, 40, 317-325.
Environment Agency. (1988): Actual vegetation map: the 3rd national survey on the natural
environment (vegetation). Japan Wildlife Research Center, Tokyo.
Reeve, A F. and Anderson, S. H. (1993): Ineffectiveness of Swareflex reflectors at reducing
deer-vehicle collisions. Wildlife Society Bulletin, 21, 127-132.
Schafer, J. A and Penland, S. T. (1985): Effectiveness of Swareflex reflectors in reducing deer-
vehicle accidents. Journal of Wildlife Management, 49, 775-776.
Waring, G. H., Griffis, J. L. and Vaughn, M. E. (1991): White-tailed deer roadside behavior,
wildlife warning reflectors, and highway mortality. Applied Animal Behaviour Science, 29,215-
223.
Comparison of some numerical data between the belisama group
of the genus Delias Hiibner (Insecta: Lepidoptera) from Bali
Island, Indonesia
Sadaharu Morinaka
The University of the Air
2-11 Wakaba. Mihama-ku, Chiba-ken, 261 Japan
1. Introduction
The description of biota is very important for various natural sciences because it is the
basis of them, for instance taxonomy. phylogenetics, population genetics, ethology,
evolutionary biology. Since Linne, huge numbers of descriptions have been carried out
but they are far from complete even now. Many more biota await description. As for
descriptions of insects, we can see them in the journals of various learned societies, for
instance, the Entomological Society of Japan, the Biogeographical Society of Japan, the
Lepidopterological Society of Japan. We can see non-numerical expressions for instance
"antennae red, upper side of forewing blue and shining" in them. We can see also
numerical expressions such as "forewing 34 mm in average", but they are not well used in
studies.
P= 6 +f (P: Phenotypic value, G: Genetic value, E: Environmental value)
This formula (Kimura 1960) is well known in quantitative genetics. We know that insects,
for instance, buttert1ies or moths when their larvae are not fed enough or live in severe
conditions become small adults. Thus numerical data, for instance, the size of wings, is
affected by various environmental factors, I consider. Therefore I consider that numerical
data is difficult to use for taxonomic studies.
Recently Lande (1976) or Lynch (1988) discussed the genetic model of evolution but
hardly used actual data of organisms. Komatsu (1996) referred that variation in the size of
genitalia was independent of that of general body but did not show actual data in his paper
then. I found that relative size of phallus (a part of male genitalia) had a distinct difference
from those of other organs in the way of taxomic study and showed that it was peculiar
and independent from the variety of sizes of other organs using actual data of two very
closely related taxa (Morinaka 1996). And I referred that it had some important relation to
the speciation and biological evolution (Morinaka 1996). In this paper I showed again that
phallus had distinct difference from other organs using actual data and suggested it had
very important meaning for the speciation and biological evolution between two very
closely related taxa.
752
753
2. Materials
2.1 The genus Delias (Insecta: Lepidoptera, Pieridae)
The genus Delias HUbner, 1819 is a big genus which has more than two hundred species
and belongs to Pieridae. The constituent species are distributed broadly from India to New
Caledonia. Lots of them inhabit South East Asia and is also divergent on highlands of
New Guinea Island. Their larvaes eat mistletoe plants and the adults have colorful wing
markings on the undersides of the wings. These are known as remarkable characters of
the genus Delias. Talbot divided this big genus to twenty groups (Talbot, 1928-1937). I
have been studying Group 17 (belisama group). It has 15 species which are closely
related to each other and distributed from Nepal to New Caledonia including Australia.
2.2 Materials
Delias belisama balina Fruhstorfer, 1910 (population from Bali Island) and Delias oraia
braTal/a Kalis, 1941 (population from Bali Island) are used. Distributions of Delias
belisama and its relatives are shown in Fig.l. D. belisama inhabits Sumatera, Jawa and
Bali Island. On the other hand, D. oraia i:lhabits Lesser Sunda Islands for instance
Lombok Island, Rores Island and also Bali Island beyond Wallace's lme. They are
closely related but clearly different species because they fly together in the mountains of
Bali island, maintaining their separate identities. Males of both species are shown in
Fig.2. Their wing markings are nearly identical and sometimes it is difficult to distinguish
from each other.
~ : Delias beliSiJma
.:: ~ ~: :.: : De1ias oraia
/
/ Wallace's line
A B
Fig. 2 Male adults of both species (A: Delias belisama balina, B: D. oraia bralana,
upper: upperside, lower: underside)
3. Method and results
3.1 Method
Fig.3 Mesured portions of male genitalia. A: Ring (anterior). B: Juxta (posterior), C Dorsum (dorsal).
D: Val va (inner), E: Phallus (lateral and dorsal)
755
3.2 Results
The results are shown in Table l. Generally sizes of D. oraia are larger than these of D.
belisama. But the actual data is considered to include environmental effects. It is
considered that originally one species existed in Bali Island and another species invaded
there. Therefore it can be imagined that environments are not so suitable for the latter
species. And also it is imagined that large individuals have large genitalia. Therefore
correlation coefficients between sizes of genitalia and forewing length (=fl.), and also
regression equations are required. They are shown in FigA. and Table 2. And ratios of
genital sizes/forewing length also shown in Table 2. It is clear that D. oraia is larger than
D. belisama on some genital sizes of the same fl. Therefore it is concluded that male
genitalia of Delias oraia are larger than those of Delias belisama. But each value of the
correlation coefficients has remarkable difference. Values of the dorsum and valva are
large but the values of the phallus is markedly small.
D.belisallla balina (n=35) D.oraia bratalla (n=2-+) Comparison
mean ± S.B. mean ± S.E.
Ring long diameter 2.901 ± 0.027 2.995 ± 0.0-+8 belisallla < oraia
Table 1. Each sizes (=) of male genitalia and comparison of both species.
(*:P<0.05, **: P<O.OI by t-test)
8.9
A B
E
5. 8.8
E 8.7
S
bra/ana
""
'v
.<: 8.6 x
E 0
Fig. 4 Correlation of forewing length and juxta height (A) or dorsum length (8) in Delias belisama
balina (0) and D. oraia bralana (X).
756
LJ_hf'lisama halina (n;:35) V,I/raia hrat(/l((l {n-::-n D.hr/i.t<llTld halma (n_1'i) n "r"ill /I("(lI,m<l (11.::'2-1)
RLflg shm1 diameter 0190 Y=OO\2 .\ ... I (~5 :)['){) Y=(]OlJ x1 I I~ ()()·-H.t 0001 (l~!:OOOI
Ph;llius length '(=0025" ,\. il-.l9 0.-171 y::() Ul.'i~ .1'" I 774 () 076 1: r. co 1 no..'u.t Oml
Table 2. Correlation coefficients between sizes of male genitalia and forewing length, and regression
equations and each ratio to forewing length. (*:P<005, **: P<O.O I by t-test)
4. Discussion
Ratios are sometimes used for description because it is considered that they are
comparatively constant and effects of envIronments are excluded. In this study it is
concluded that male genitalia of Delias oraia are larger than those of Delias belisami1
using ratios. On the other hand I also found each value of the correlation coefficients has
remarkable difference. Regression equations of forewing length and ratios (genital sIzes!
forewing length) and correlation coefficients are shown in Fig. 5. In this correlation chart,
the X axis is the forewing length and the Y axis is the ratio of two genital sizes to forewing
length. Both lower regression equations show valva width and upper lines show phallus
length. The ratio is constant in all forewing lengths in the lower graph, on the other hand,
in the upper graph the ratio is not constant, reducing with the X axis. What does it mean?
If environmental effects are constant to genitalia and forewing length, these regression
lines have to be parallel to the X axis like the lower graph. In other words, thIs ratio has to
be constant in all sizes of the forewing. Certainly the regression line of the valva is
constant but that of the phallus is not, the ratio of phallus decreasing as forewlllg length
lllcreases. I consider that it means the phallus length is comparatively constant,
independent of forewing length. In other words the environmental effects are not equal to
phallus and forewing length. The sizes of these buttert1ies' phalli are rather constant and
affected less than forewing length by the environment. I considered that such a character
that is hardly affected by the environment as for phallus in this case, is most important to
discuss the speciation and biological evolution.
RallO (genil;!] size lo forewing lenglh)
0.1
0, belisama balina
• Phallus lenglh/ f.!.
D. oraia bra lalla •
0.09 III Valva \vidthl f.L •
) •
r=·OJj."i.K"'~
('
.) • I • •
• •• • •
• ••
0.08
• • • •
• • I
• • " •
r-e-
"
iil
0.07
g • "
!
"
;" " III ill" .
" " "
• l' "
"
~
"
"
II iil
0.05
ForC\Vlng length (mmj Forew!ng length (mm)
30 32 34 36 38 40 42 30 32 34 36 38 40 42
Fig.5 Correlation of forewing length and ratios (Valva width and phallus length forewing length)
757
Acknowledgments
I greatly acknowledge Prof. Dr S. Sakai, Daito Bunka Uruversity, Saitama (The President
of The Biogeographical Society of Japan), who recommended and encouraged to my talk
in this IFCS-96 Conference. I wish to express his gratitude to Dr H. Mohri, Director
General of National Institute for Basic Biology, Okazaki, and Prof. Dr T. Nakazawa, The
University of The Air, Chiba for their kindly supporting and encouragement to my study.
I also express my heartily thank to Dr N. Minaka, National Institute of Agro-
Environmental Sciences, Tsukuba. He gave me much helpful advice, critically read and
corrected the manuscript. I express my cordial thanks to Mr S. Sugi, Tokyo. He also
critically read and corrected the manuscript.
References:
1. Introduction
The biological sequences are composed of strings of letters belonging to a finite size
alphabet. In case of protein sequences. the alphabet A comprises 20 letters each one
representing, respectively, one amino acid.
A = {ACDEFGHIKUvINPQRSTnVy} (1)
Similarity betvveen amino acids is one of the important factors to be considered while
computing the similarity between sequences. Several researchers have focussed their
attention on this problem and put forward the similarity matrices (see, for exam-
ple, Dayhoff et al. (198.3), George et al. (1990), Risler et al. (1988), Lerman et
al. (1994b)). The "profile matrix" considered by (Gribscov (1987)) is a particular
case of stadardized similarity matrix between letters used in the second approach of
classification proposed by Lerman et al. (1994a). The first approach of classification
described by the latter is based upon the preordonnance coding and necessitates site-
by-site comparison of sequences which is well adapted for aligned sequences. However,
the high variation in sequence length renders the overall comparison of sequences very
difficult and the site-by-site comparison altogether impossible in case of no prealign-
ment. The" significant windows" approach for classifying unaligned sequences may
be summarized as follows: A fixed-size window is made to slide along the sequences
to be compared. Each vvindow (i.e. subsequence delimited by each window position)
of the shorter sequence is compared with the set of all windows of the longer one
and, the "most significant window" - with respect to a proper null hypothesis of
independence - is selected by means of beam search. Similarity index is then de-
fined as a function of the similarities resulting from the site-wise comparison with the
most significant window. This index depends on some parameters such as window
size and window significance level whose values need to be chosen by the users, and
also on some other parameters such as percentage of homologous sites that may be
estimated from the data. Finally, the similarity index is standardized with respect
758
759
to its observed distribution over the set of sequence pairs and hierarchical clustering
using LLA program (Lerman et al. (19'93)) is performed.
and
b1 ,b2 , ... ,b Ll
Let D = ((D ij ))l :s; i,j :s; 20 be a matrix of similarity scores associated with all possible
letter pairs over a 20-letter alphabet A (such as for instance, Dayhoff's mutation
data matrix (PAM 250)) where Dij is the score of the letter pair (i,j), for (i,j) E
A. Let M = ((Mij)h :s; i,j :s; 20 be a "match matrix" where Mij takes a value of
1 if the ith letter" matches" with the /h letter and 0 otherwise. We may build
the match matrix in several ways and, in particular, by considering a classification
of the set of letters. A window of fixed size / is made to slide along each of the
sequences compared. Let us denote by Wil the ith window of the first sequence (i.e.
the subsequence ai, ai+l, ... ai+I-1) and by Wj2 the jth window of the second sequence
(i.e. the subsequence bj , bj +1, ... bj +I- 1). The similarity score S(Wi1, Wj2) of a window
pair (Wi1, Wj2) for 1 :::; i :::; (L1 -/ + 1) and 1 :::; j :::; (L2 -/ + 1) is defined as the sum
of similarity scores of the corresponding letter pairs, as given by the matrix D. i.e.
I-I
S( Wi1, Wj2) = L D( ai+kl bj+k) (2)
k=O
But as a matter of fact, we will rather consider the standardized scoring matrix DS
instead of D (see Lerman et al. (1994a)).
2.1 Significant window pairs
Consider the window pair (Wil, Wj2). The total number m of letter matches occurring
in this pair is
I-I
m = L A1(ai+kl bJ + k ) (3)
k=O
Under the null hypothesis that the letters in both windows are randomly and indepen-
dently distributed, the random variable associated with the Ilumber m is distributed
as a Binomial(l, p) variable where p is the probability of OIle match occurring in the
sequence pair. The parameter p may be estimated from the observed frequencies of
letters in the sequences as follows:
where (i, J) E set of all letter pairs (ai, bj ), freq( i, seq1) is the relative frequency of
the letter ai in sequence 1 and freq(j, seq2) is the relative frequency of the letter bJ
in sequence2. The window pair (Wi!, wd is said to be significant or significantly
760
comparable at level a if rna > [: where ma is the observed number of matches and
U is an integer determined such that
2. Compare - using the beam search technique - each window inside the pertinent
area of the shorter sequence (i.e. sequence1) to all the windows inside the
pertinent area of the longer one (i.e. sequence2).
3. Rule out the window pairs that are not significantly comparable by means of
the binomial test described in section 2.1.
The significantly comparable sequence pairs alone will contribute to the overall sim-
ilarity between sequences. For each window of sequence1, the best score associated
with the window of sequence2 (among those selected as above) is considered and a
,. rough" similarity index between two given sequences is obtained by averaging it
over all chosen window pairs. The standardized similarity index between sequences is
then obtained by standardizing the" rough" index, first with respect to its empirical
distribution over the set of window pairs and finally, over the set of all sequence pairs.
The matrix of the probabilistic similarity index required by LLA is obtained by ap-
plying the standard normal cumulative distribution function as in case of prealigned
sequences.
2.3 Aggregation criterion used in LLA method
The basic data required by the LLA method of hierarchical classification is the ma-
trix of probabilistic similarity indices between the sequence pairs. To fix the ideas let
us consider the set (] of all sequences, and the probabilistic similarities between the
sequences 01 and 02 given by the equation
(6)
where <P, Q sand P2 ((]) denote respectively the standard normal cumulative distri-
bution function. standardized similarity index between sequences (defined in section
2.2) and the set of all possible sequence pairs. The algorithm builds a classification
tree iteratively. by joining together at each step the two (or more in case of ties) most
similar sequences or classes of sequences until all clusters are merged together. Thus
the aggregation criterion that is maximized at each step or "level" of the algorithm
761
aa
is expressed as a similarity measure between two clusters. Suppose that C and Dare
any two arbitrary disjoint subsets (or clusters) of 0 comprising respectively r and s
elements. Then a family of criteria of the "maximal link likelihood" is defined by the
following measure of similarity between C and D
LLy(C, D) = [max{P(c, d)/(c, d) E C x D}l(rxs)~ 0::; 'Y ::; 1 (7)
In case of our data sets 'Y = 0.5 was found to yield the best results.
3. Applications
The experiments on the protein sequences belonging to cytochrome and globin fam-
ilies were carried out using different amino acid similarity matrices (e.g. Dayhoff et
al. (1983), Risler et al. (1988), among others) and different values of the parameters
such as window size, level of significance for window comparison and percentage of
homology. The hierarchical classification method based on the LLA approach was
used. Similar experiments on aa-tRNA ligase (also known as aminoacyl-tRNA syn-
thetase) family of sequences were also conducted. The results were found to be rather
unaffected by most of the above parameters whereas the choice of the relevent se-
quence areas retained for comparison was proved to be of utmost prevalence.
3.1 Sequences from cytochrome and globin families
A set of 89 sequences belonging to cytochrome family was classified using the simi-
larity index described in section 2 and the LLA method of hierarchical classification.
The most significant level was found to be the 86 th where three main classes may be
distinguished (see figure 1). Two of them group together the bacterial chromosomes
and the third one is split into two subclasses corresponding respectively to plant and
763
<ueRSIECI
'<«<·
3~' V.'RSIE'I
-~=
(RSIE'I
GlnRS{ECI
, 5
GluAS{ECI
7<TrpRSIECI
TyrRS{Ecl
Figure 4: Classification of class I aatRNA from E. coli with Risler's matrix, window
size = 9, sequence area selection by maximam predictable classification.
764
animal families. As for the globin family, a set of 4:2 sequences were classified and
the figure :2 displays the corresponding classification tree. It may be observed that at
.3.S th level of the tree - which is most significant according to Lerman's global statistic
(Lerman et al. 199.3) - 7 classes are clearly visible. Two of them are characterized by
the vertebrates IV), the others being characterized by bivalves (E), plants (P), an-
nelids (An), gastropods (G) and arthropods (Ar). All artropods but artemia are very
clearly seperated from the rest of the species; artemia is a miss which is classified
among the vertebrates. Similarly, excepting glycera all the annelids are nicely put
together and the bacterial hemoglobin ggzlb, which is very" neutral", is associated
with the gastropods. It may also be noticed that the two classes of vertebrates are
not joined quickly enough. Notwithstanding the above remarks, the results are glob-
ally quite satisfactory and are comparable to those produced by the best methods
available.
3,2 Sequences from aa-tRNA ligase family
It is well knO\vn, from the biological standpoint, that the ligases may be considered
as belonging to one of the two groups (see Eriani et al. (1990), Landes et al. (1995)).
Class I comprising the sequences that recognize the amino acids Met, Ile, Leu, Val,
Cys. Arg, Gin, Gz.u, Tyr and Trp seems to be most homogeneous wherein three sub-
groups may be distinguished: {Met, Ile, Leu, Val, Cys, Arg}, {Gin, Glu} and {Tyr,
Trp}. The second group corresponding to the amino acids Ser, Pro, Thr, Asp, Lys,
His, Ala, Gly and Phe is the least structured. The aim of our experiment was to vali-
date our method by producing the results that are as close as possible to the biological
knowledge todate by applying it to a test data set. The first test data set was made
up of 65 sequences of aa-tRNA ligases belonging to various species and was particu-
larly hard to classify due to the very high variation in the length of different amino
acids for the same species on the one hand and in the length of the same amino acid
for different species, on the other. For instance, for E. Coli the length of TrpRS was
.3.34 and that of ValRS was 951, whereas the length of GlnRS was 554 for E. Coli and
809 for Saccharomyces cerevisiae. It was found that the selection of suitable sequence
areas using the "maximum predictability classification" (Lebbe and Vignes (1993),
Lebbe and Vignes (1992)) was particularly useful in this case. Figure 3 illustrates
the hierarchical classification tree obtained by this method. The results are not fully
satisfactory in that they do not agree perfectly with the biological classification given
above. This is in fact explainable by a strong inter-species variation. The second test
data set containing the sequences of class I aa-tRNA ligases, all belonging to a same
species namely, E. Coli yielded much better results. Figure 4 shows clearly the three
subgronps known to the biologists.
4. Conclusion and further research
A hierarchical classification method based on significant windows approach for clas-
sifying unaligned biological sequences has been presented and the results of the ex-
periments with several not-easy-to-classify data sets have been described. On the
whole, the results are close to the present knowledge of the phylogeny of correspond-
ing species. In case of aa-tRNA ligases, it was shown that the quality of the results
depends on the consideration of the biologically pertinent sequence areas as well as
on the dgree of inter-species variation. Further direction of this research would be to
improve the sensitivity of our method by a refinement of the clustering strategy on
the one hand and to reduce the need for parameter tuning (window size, similarity
matrix etc.), 011 the other.
765
References
Abe.K., Gita.N .. (1982): Distances between strings of symbols: Review and remarks. ICPR6,
Munich.
Barker W.C., Hunt L., George D.(1988): Prote'in Seq. Data Anal., 1, 363.
Dayhoff M.O., Barker W.C., Hunt L.T. (1983): Mehods Enzymol., 91, 524-545.
Dickerson R., Geis I. (1983): Hemoglobin, Benjamin/Cummings, Menlo Park, CA,
Eriani G., Delarue E., Poch 0., Gangloff J., Moras D. (1990): Partitions of tRNA syn-
thetases into two classes based on mutually exclusive sets of sequence motifs. Nature, 347,
203-206.
George D., Barker W., Hunt L. (1990): Mutation Data Matrix and Its Uses. Mehods En-
zymol., 183, 313-330,
Gribscov M., Mclachlan A., Eisenberg D. (1987): Profile analysis: Detection of distantly
related proteins. Proc. Nat!. Acad. Sci. 84, 4355-4358.
Landes C., Henault A., Risler J.L, (1992): A comparison of several similarity indices used
in the classification of protein sequences: a multivariate analysis. Nucleic Acids Research,
20, 3631-3637.
Landes C., Perona J.J, Brunie S., Rould M.A., Zelwer C., Steitz T.A., Risler J-L. (1995):
A structure-based multiple sequence alignment of all class I aminoacyl- tRNA synthetases.
Biochimie, 77, 194-203.
Lebbe J., Vignes R. (1992): Selection d'un sous-ensembe de descripteurs maximalement
discriminant. Troisieme journees Symbolique-Numerique, universite de Paris Dauphine.
Lebbe J., Vignes R. (1993): Local predictability in biological sequences, algorithms and
application. Biochimie, 75, 371-378.
Lerman LC., Peter Ph., Leredde H. (1993): Principes et calculs de la methode implantee
dans Ie programme CHAVL (Classification Hierarchique par Analyse de la Vraisemblance
des Liens). La revue de Modulad, Numero 12, Dec 93.
Lerman I.C., Nicolas J., Tallur B., Peter Ph. (1994a): Classification of aligned biological
sequences. New Approaches in Classification and Data Analysis, Springer Verlag, Berlin.
Lerman I. C., Peter Ph., Risler J. L. (1994b): Matrices AVL pour 1", classification et aligne-
ment de sequences proteiques. Publication IRISA No 866, IRISA, Rennes, France.
Risler J.L., Delorme M.O., Delacroix H. and Henault A. (1988): Amino acid substitutions
in structurally related proteins: a pattern recognition approach. Determination of a new
and efficient scoring matrix. Journal of Molecular Biology, 204, 1019-1029.
An Approach to Determine the Necessity of
Orthognathic Osteotomy or Orthodontic Treatment
in a Cleft Individual
-comparison of craniomaxillo-facial structures in borderline
cases by roentogenocephalometrics-
1. Introduction
Cleft patients have severe dental problems related to their abnormal facial structures,
disturbed facial growth patterns and tooth anomalies, therefore their habilitation is needed
from childhood to adulthood. Early orthodontic treatment is often indicated in order to
change unfavorable growth pattern and to correct abnormal oral functions such as speech,
mastication and swallowing. A majority of treatment objectives in some cases can be
achieved through orthodontic treatment alone, while surgical treatment must be applied in
the long run for others. At the adulthood, the combined approach between orthodontics
and surgery such as the orthognathic osteotomy would be the best way for the one with
severe maxillo-mandibular three dimensional disharmony which could not be treated by
orthodontic treatment alone. However, the selection of orthodontic or surgical orthodontic
treatment remains subjective in nature, which often result in forced long continuous
orthodontic treatment on surgical case in the borderline cases. Besides, the decision for
the surgery or not are multifactorial things which are related not only maxillo-mandibular
relationship but also occlusion, soft tissue profile and the consent of the patient for the
surgery. If we could judge the treatment plan for the surgical case earlier before the patient
reaches maturity, we could avoid to force long term treatment of growth control which
must be finally useless at the time of surgery. The earlier, the better.
This study was designed to investigate cephalometrically, on a longitudinal basis, the
possibility of growth prediction for surgical-orthodontic treatment in the cleft patients as
766
767
early grwth stage as possible with the aid of the cephalometric analysis (Fig. 1).
Non-OPE case
~O
borderline case
\1:lf®iill\1:N'J®!fa'!:
~ ~lf@~\1:ihJ
OPE case
s
/W
Po Cd
~
3. Results
3.1 Descriptive statistics for the differences between the OPE and Non-
OPE group
Table 1 summarizes the results of the nonparametric statistical comparison on the
differences between the OPE and Non-OPE groups. At the stage A in male, gonial angle
(Ar-Go-Me) exhibited significantly larger in OPE group (p<O.OOl); mandibular ramus
inclination angle (FH-Ramus plane) exhibited significantly smaller in OPE group
(p<O.05). Whereas in female, SNB angle which represents the anterior limit of the
mandibular basal arch in relation to the anterior cranial base showed significantly larger in
OPE group (p<O.Ol); mandibular plane angle and Y-axis angle showed significantly
smaller in OPE group (p<O.Ol).
At the stage B in male, mandibular plane angle and gonial angle exhibited larger in OPE
group (p«l.CJ5, p<O.OO 1); ramus inclination angle, posterior position of maxilla to cranial
base (S' -Ptm') and mandibular ramus length (Cd-Go) appeared to be significantly smaller
770
in OPE group. In female, both SNB and Y-axis angle showed significantly difference as
the same as stage A (p<O.O 1, p<0.05). The linear measurement for the assessment of
mandibular ramus and total length of the mandible (Gn-Cd) showed signiticantly larger in
OPE group (p<O.O 1, P<0.05).
At the stage C in male, Both SNB and gonial angle showed significantly larger in OPE
group (p<0.05, p<O.OOl); ramus inclination and ramus length showed signiticantly
smaller in OPE group (p<0.05). On the other hand, SNB and Y-axis angle showed
significant ditTerence between OPE and Non-OPE groups (p<0.05, p<O.Ol); mandibular
total length and ramus length exhibited significantly larger in OPE group (p<0.05).
male female
OPE Non p OPE Non p
stage A Gonial 134.1 125.9 *** SNB 78.9 75.8 **
Ramus 79.1 82.3 * MandP 30.7 35.2 **
Y-axis 63.2 662 **
stage B Mand P 33.9 30.5 * SNB 78.1 75.0 **
Gonial 132.2 125.3 *** Y-axis 64.1 67.3 *
Ramus 81.7 85.2 * Ramus 81.7 85.2 *
S'-Ptm' 17.2 19.8 ** Gn-Cd 107.9 103.0 **
Cd-GQ 51 2 554 ** Cd-G() 51.0 48.2 *
stage C SNB 76.0 72.8 * SNB 78.7 74.9 *
Gonial 130.8 124.6 *** Y-axis 63.8 67.6 **
Ramus 81.8 85.8 * Gn-Cd 116.0 110.1 *
Cd-GQ 5J..!L 61.8 * Cd-Go 55.5 51.6 *
***p< 0.1 %, **p< 1 % and *p< 5% by Mann-Whitney test
male female
stage A 77.8 76.7
stage B 83.3 74.1
stage C 838 786
Predictive Gonial angle Y-axis
vaiable Ramus inclination SNB
Cd-Go Ramus inclination
Cd-Gn
4. Discussion
Early treatment of the cleft patient's malocclusion has beeen generally recommended by
many authors for the favorable results on growth and occlusal relationship. However,
some disadvantages of the problems that might be encountered during the early dentition
tretment are: (l)it may be not always that early treatment bring easier and better results,
(2)patient's cooperation may deteriorate because of long periods of active treatment, and
(3) family financing may also have an int1uence on the length and timing of treatment. If
prediction of maxillo-facial relationship at the time of its growth completed could be done
at early growth such as a childhood, it could be free from a wasted treatment and various
sufferings with it. There are many reports about the criteria of treatment adaptation to
skeletal Class III patients which should be selected orthodontic treatment or surgical
orthodontic treatment. However, these subjects are almost for adults, there are few for
growing young people, especially children. It may be the main reason that it is very hard
to predict skeletal changes by growth and orthodontic treatment at the raky growth stage.
On the other hand, variables to express the morphological differences In both groups got
increased according to the raising the bone maturity. It suggested that growth prediction
could be easier by aging. At the stage C which is almost completed growth, it seems eaier
to make a diagnosis and treatment plannings what could be finished by orthodontic
treatment alone or combined with orthognathic osteotomy on earth by the reason of
stopped growth. In the present study a correct methodologic approach for evaluation of
borderline case with malocclusions was then initiated.
The cephalometric analysis we applied was suitablefor a geometric evaluation of maxillo-
facial components. Several significant differences in craniofacial skeletal structures were
found between the OPE group and Non-OPE group. The following results were obtained.
1. There were significant morphological ditIerences of dentofacial complex, especially in
the Mandible, between the OPE and Non-OPE group in both sexes. No signitlcant
difference in the maxillary components were noted in both groups.
2. The differences became more clearly with growth.
3. In the male OPE group samples were characterized at short and anterior position of the
ramus with wide gonial angle, while dominant anterior growth direction of the
mandible and its larger size in females (Fig. 4).
4. The morphological information from the mandible were available for determination of
future surgical case.
772
Ramus «v....,~.
d-Go "~--"'1
~~~ \SNB
Comparing these data to the standard data which are gained from the subjects of normal
occlusion by Iizuka and Ishikawa, the male gonial angle and ramus inclination showed
larger than the standard. On the other hand, female Y-axis and ramus inclination showed
almost same the standard except SNB which were larger by 2 dt:gret::. As they arc
substantively different from sample composed from sizt:: and criteria of growth stage, it
coud not easily to compart: the study and standard sampk hae.
In our opinion, a fundamental question arises from the obtained data. \\!hat wert: the
reasons for morphological differences in the orthognathic surgical case between male and
female? Moreover, why the differences were almostly found in the mandibular componets
except by the maxillary components. There may be some important factors to be
considered for the reason. One is the size of sampling in this study. The borderline case
were selected from the poins of severity of maloocclusion with anterior and lateral
crossbite, There is a data which indicated their similarity of malocclusion about inter-
maxillary relationship in the term of ANB angle and SNA angle which shows no
significant difference in both groups. In fact, there may be several morphological patterns
for cleft patients. The male operation group shows the downward rotated of the mandible,
the female cases show a typical skeletal class III which has overgrowth of the mandible
relatively. The former is very difficult to correct anterior crossbite by the mandible
backward rotation, since it makes the mandible more rotation result in a long face and
shallow oberbite. On the other hand, the latter overgrowth of mandible is not t::asily
controlled becaust:: of its size even if growth control would be begun from early growth
stage. Treatment planning could be the other main cause to ini1uence the results, sinct::
dt::cision of treatment plannings may depend on some factors which are inter-maxillary
relationship such as ANB angle, soft tissue protik from tht:: point of the aesthetic sense,
teeth movement in the orthodontic treatment and consent of the surgery by patient and
parents. Therefore it could be said that borde line case is multifactorial. That may be settkd
in the collecting more sampks for borderline cases, which should be separated and
773
5. Concluding remarks
There were significant morphological differences in dentofacial complex, especially in the
mandible, between OPE and Non-OPE group in both sexes. The differences became more
and more clearly with growth. The male OPE group samples were characterized at short
and anterior position of the ramus with wide gonial angle, while dominant anterior growth
direction of the mandible and its large size are shown in females. The morphological
information from the mandible were available for determination of future surgical case.
The previous study suggested that there are some effective parameters between surgical
and non-surgical treatment for a cleft indvidual. The posssiblity of prediction for surgical
case could be indicated in the early growth stage. We could not make our conclusion from
the point of some problems by small sample, however, after this, we would like to collect
more variable cases and examination them in detail. Although the growth prediction is
really very hard things, we will do more in the future.
References:
Battagcl 1. M. (1994): The identification of Class III malocclusions by discriminant
analysis, European Journa! of Orthodontics, 16, 71-80.
Cassidy D.W. et al. (1993): A comparison of surgery and orthodontics in "borderline"
adults with Class II, Divisionl malocclusions, American Journa! of Orthodontics and
Dentofada! Orthopedics, 164, 455-470.
Downs, W. B. (1948) : Variation in facial relationships: their significance in treatment and
prognosis, American Journal of Orthodontics, 34,812-840.
Iizuka T. and Ishikawa F. (1950): Points and landmark in head plates, The Journal of
Japan Orthodontic Society, 16,66-75.
Sinclair P.M. et al. (1993): Combined surgical and orthodontic treatment, In
Contemporary orthodontics, 2nd ed., Pro tli t W.R. (eds.), 607-631, Mosby Year Book,
St. Louis.
Tollaro I., Baccetti T. and Franchi L. (1996): Craniofacial changes induced by early
functional treatment of Class III malocclusion, American Journal of Orthodontics and
Den tofa cial Orthopedics, 109,310-318.
Valko R. M. (1968): Indications for selecting surgical or orthodontic correction of
mandibular protrusion, Journal of Oral Surgery, 26, 230-238.
(t)This study was in part carried out under the ISM cooperative Research Program (95-
ISM·CRP-A54 )
Data Analysis for Quality of Life and Personality
Kazue Yamaoka! and Mariko Watanabe 2
1. Introduction
In the last decade interest has increased in quality of life (QOL) measures in four broad
health contexts: such as measuring the health of population, assessing the benefit of
alternative uses of resources, and making a decision on treatment for an individual
patient. Each context requires an assessment of the impact of ill health on aspects of
everyday life of the individual. Increased attention is now being given to the importance
of QOL, and methods of QOL measurement have accordingly been developed (Cox, et
aI., 1992). Yamaoka, et al. (1994) has developed the QOL20 Questionnaire for the
measurement of non-disease specific QOL of a patient.
In the present study, we focused on the data analysis for the problems related to QOL
measurement and it's association with personality type and special attention was given
for the classification of subjective attitude measured by a questionnaire survey on the
basis of structure analysis.
774
775
2.2 Questionnaire
QOL20
In general, QOL measures were classified into three types: The first type is a
performance score, e.g. an activity score. This type of measure is evaluated by a third
party and Kamofsky Performance Status Scale (Karnofsky, 1949) and WHO
Performance Status Scale (WHO, 1979) were established, for instance. The second type
of evaluation employs objective data such as clinical findings. For this, nutritional
parameters and duration of inpatient/outpatient stay, and so on were used. The third
type is subjective evaluation by the patient. Although there is still no standard method
for evaluating QOL, however, many trials have been carried out and FLIC (Schipper,
1985) and EORTC (Aaronson, et ai, 1988), were proposed for QOL measurement in the
West. Furthermore, a large number of studies have been performed on QOL
questionnaires, little attention has been given developing a non-disease-specific QOL
questionnaire for Japanese. Kobayahi, et al. (1994) summarized the problems in the
investigation of QOL.
For the measurement of subjective QOL of patients, we developed the QOL20, which
belongs to the third type and consists of 20 questions related to psychological,
physiological, and environmental factors.
QOL20 was conducted under the working hypotheses that 1) The QOL of a patient is
similar to that of a healthy person and that 2) QOL includes two main factors, i.e., "state
of disease" (D) and "attitude toward disease" (F) and with the changes of these factors
the QOL of a patient also changes. The scoring method has developed on the basis of
the structure of the questionnaire. Because the structure of QOL recognized to be uni-
dimentional, an additive scale, that is, the sum of number of responses, was used for the
measurement of QOL. The symmetry of the positive score and negative score was not
guarantied. This means that the distance in the configuration of the items were not
equivalent among the items. In such a case, we thought that it is better not to use Likert
Scale but to calculate the both positive score (QTP) and negative score (QTN).
2.3 Subjects
Using the above two questionnaires, the survey was conducted on the parents of 145
students. A hundred and twenty-five males and 145 females responded to the
questionnaires. We used the subjects who responded to both QOL20 and EPQ, and
whose ages were between 40 to 65 years old. Thus the subjects used for the analysis
776
3. Results
3.1 Structure of QOL20
The structure of QOL20 was examined using QIII method and a scattergram of the
category values corresponding to the maximum latent root and the category values
corresponding to the second maximum latent root was figured. Although items were
somewhat varied, however, it could be thought that a cup type curve like a uni-
dimensional Guttmann's scale was reconfirmed (see Yamaoka, et aI., 1994) Therefore,
the scoring method, described in the above, was recognized to be valid.
Table 1. Spearman rank correlation coefficient between QOL scores and EPQ
dimensions.
males n=120
females; n=128
Table 2: Personality Types (Tolerable, Intolerable, Other) by EPQ and QOL20 scores.
Males (n=120)
QOL Tolerable (17) Intolerable (16) Other (78) X 2value
Females (n=128)
QOL Tolerable (18) Intolerable (23) Other (87) X 2 value
4. Conclusion
It is concluded that there is a possibility that the QOL20 scores of the tolerable type
subjects were greater than those of the intolerable subjects Therefore, personality type
was thought to be a possible confounding factor for a study related to QOL. It is
suggested that one must interpret carefully, as diagnostic measures, the QOL scores of
the persons with "tolerable type" personality.
5. References
Aaronson NK, et al. (1988): A modular approach to quality oflife assessment in cancer
clinical trials. Resent Results Cancer Res, 111,231-249.
Shigehisa T (1989): Behavioral regulation of dietary risk factor associated with stress-
induced disease, in relation to personality and interpersonal behavior, in a sociocultural
perspective Tokyo Kasei Gakuin University Journal, 29, 25-45.
Schipper H, et al (1985) Measuring quality oflife: Risks and benefits. Cancer Treat Rep,
69,1115-1125.
World Health Organization (1979): WHO handbook for reporting results of cancer
treatment. Offset Pub! No 48, WHO, Genove.
Yamaoka K, et al. (1994) A Japanese version of the questionnaire for quality of life
measurement. Ann Cancer Res Ther 3,45-53.
CONTRIBUTORS INDEX
779
Lapointe, F.-J. 71 Rajman, M. 488
Larraii.aga, P. 117 Rasson,l.-P. 99
Lebart, L. 423, 480 Romanazzi, M. 555
Lehel, J. 187 Rungsawang, A. 488
Levin, M.S. 154
Lozano, J.A. 117 Sanz, J. 341
Sato, M. 312
Martin, M.e. 162 Sato, Y. 312
Matsuda, S. 284
Satoh, D. 708
McMorris, F.R. 187
Shibasaki, Y. 766
Meulman, J.J. 506
Minami, H. 625 Sibuya, M. 241
Misumi, J. 647 Siciliano, R. 191,223
Mitsumochi, N. 746 Siegmund-Schultze, R. 231
Miyamoto, S. 295, 736 Simeone, B. 170
Mizuta, M. 610, 625 Suga, S. 736
Mola, F. 191,223
Mori, Y. 547 Takahashi, H. 334
Morikawa, S. 653 Takakura, S. 661
Morinaka, S. 752
Takane, Y. 527
Mucha, H.-J. 231
Tallur, B. 758
Murakami, T. 575
Tanaka, Y. 261,547
Murtagh, F. 617
Tanemura, M. 276
Nafria, E. 207 Tango, T 247
Nakai, S. 328 Tarumi, T 547
Nguyen, TD. 215 Tomita, S. 328
Nicolas, J. 758 Tsuchiya, T. 431,452
Nicolau, F.e. 89
Nishisato, S. 441 Ueda, T. 708
780