0% found this document useful (0 votes)
48 views18 pages

Correspondence Analysis and Classification: Lebart L.

This document summarizes correspondence analysis and its links to classification methods for contingency tables. It discusses how correspondence analysis projects row and column categories onto a principal subspace, while classification groups similar categories. These approaches can produce identical or related results depending on the data. The document outlines several theoretical models that unify correspondence analysis and classification, and illustrates these concepts using a sample 8x8 contingency table. It proposes formulating the problem of partitioning a contingency table rows and columns as an approximation problem to minimize the difference between the actual and projected relative increment profile values.

Uploaded by

ZENDARI Aissa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views18 pages

Correspondence Analysis and Classification: Lebart L.

This document summarizes correspondence analysis and its links to classification methods for contingency tables. It discusses how correspondence analysis projects row and column categories onto a principal subspace, while classification groups similar categories. These approaches can produce identical or related results depending on the data. The document outlines several theoretical models that unify correspondence analysis and classification, and illustrates these concepts using a sample 8x8 contingency table. It proposes formulating the problem of partitioning a contingency table rows and columns as an approximation problem to minimize the difference between the actual and projected relative increment profile values.

Uploaded by

ZENDARI Aissa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Seventh International Conference on Multivariate Analysis

Barcelona Meeting, September, 21 - 24, 1992


in "Multivariate Analysis, Future Directions, C.Cuadras, C.R.Rao, Eds, North Holland, 1993, p 341-357.

(corrected version)

CORRESPONDENCE ANALYSIS AND CLASSIFICATION


LEBART L.,
Centre National de la Recherche Scientifique,
Ecole Nationale Supérieure des Télécommunications,
46, rue Barrault, 75013, Paris, France

and

MIRKIN B.G.
Department of Applied Statistics and Informatics
Central Economics-Mathematics Institute of Russian Academy of Sciences
(Currently at International Energy Agency,
2, rue André Pascal, 75775, Paris Cedex, 16, France)

1. Introduction

The present paper contains a survey of some of the most salient results about the links and
the complementarity between clustering and correspondence analysis (CA) of
contingency tables. It includes also a presentation of certain new contributions and
domains of research.
The practitioners use to complement one approach with the other when a thorough
exploration of data is needed, since the two points of view may provide quite different
portrays of data. The involved processes are obviously distinct (projection onto a principal
subspace on the one hand, grouping of similar categories on the other) but they could lead
to identical results in specific situations. In more general cases, the parameters they
produce are not independent. We will precisely focus on this interdependence and these
specific situations below.
Two characteristics of CA are in favour of a reconciliation with classification : the
symmetry of the roles of rows and columns in the process, and the property of
distributional equivalence (Benzecri, 1973; Escofier, 1978; Gilula, 1986; Greenacre,
1988), allowing for a great stability of the results when agglomerating elements with
similar profiles. Agglomerating the rows or the columns of a contingency table is
"natural" in the sense that it is merely replacing classes by classes (instead of replacing
individuals by groups, or variables by groups of variables...).
The questions of clustering in contingency data tables based on grouping of homogeneous
items are discussed in Cazes (1986), Escoufier (1988), Greenacre (1988), Gilula (1986),
Goodman (1981), Jambu (1978).

1
Seventh International Conference on Multivariate Analysis
Barcelona Meeting, September, 21 - 24, 1992
in "Multivariate Analysis, Future Directions, C.Cuadras, C.R.Rao, Eds, North Holland, 1993, p 341-357.

2. Some links between the two approaches

One can find a series of theoretical bridges between these approaches, exemplified by
some particular models. We discuss below a set of such models having in mind the
following purposes: to unify the previous developments and to propose certain new
approaches. Let us illustrate this discussion with a numerical example of a symmetric 8
by 8 contingency table KIJ = (kij) comprising k=640 cases (table 1). The marginals ki and
kj are identical (equal to 80) in this particular example, but all the results concern as well
the cases with unequal marginals.

Table 1
Contingency table KIJ

COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8

LIG1 30 18 12 12 2 2 2 2
LIG2 18 30 12 12 2 2 2 2
LIG3 12 12 27 21 2 2 2 2
LIG4 12 12 21 27 2 2 2 2
LIG5 2 2 2 2 24 20 14 14
LIG6 2 2 2 2 20 24 14 14
LIG7 2 2 2 2 14 14 23 21
LIG8 2 2 2 2 14 14 21 23

The agglomerative clustering methodology based on chi-square distance using a


generalized Ward criterion (see Benzecri, 1973; Greenacre, 1988; Jambu, 1978),
agglomerates the elements pairwise, as it is shown on Figure 1.

ROW1
(.023)
ROW2
(.090)

ROW3
(.006)
ROW4 (.640)

ROW5
(.003)
ROW6
(.040)
ROW7
(.001)
ROW8

Figure 1
Sketched dendrogram issued from hierarchical clustering of the (8,8) table KIJ

2
Seventh International Conference on Multivariate Analysis
Barcelona Meeting, September, 21 - 24, 1992
in "Multivariate Analysis, Future Directions, C.Cuadras, C.R.Rao, Eds, North Holland, 1993, p 341-357.

Moving centers (k-means) method based on chi-square distance gives similar results. For
example, beginning with the two centers corresponding to the elements 1 and 8, we obtain
easily the 2-class partition presented corresponding to the upper part of the dendrogram
(cf. Figure1).
To express both the symmetry and distributional equivalence in a unified form let us
consider for each i ∈ I and j ∈ J the value
qij = (k ∞ kij)/(kikj) - 1 (i ∈ I, j ∈ J)
which expresses the relative increment (or decrement) RIP(i/j) of the probability of row i
due to the knowledge of column j. Dual interpretation of qij as the relative increment
RIP(j/i) of the probability of column j due to row i is straightforward. Relative increments
for subsets are defined in analogous way using the total probabilities (or frequencies). The
RIP values in table 1 are calculated by multiplying the entries by .1 and substracting 1
afterwards. Note that we have the two following relationships expressing the classical
Chi-square X2 as a function of the RIP coefficients :

X2 = Σ ij kij qij = (1/k) Σ ij ki kj (qij )2

The RIP concept is useful in many aspects (Mirkin, 1985, 1992). In the present context we
should point out that the RIP concept underlies the basic reconstruction formulas of CA :

qij = Σh%H µhFh(i)Gh(j) (1)

where Fh, Gh are CA factors corresponding to singular value µh (h∈ H).

2.1 A global approximation formulation

Using this concept, the distributional equivalence principle can be specified in a


symmetric form as follows. In rough terms, the block structure of the coinciding RIP
values in matrix Q ={qij} reflects the CA presentation in such a way that the sub-arrays
(boxes) of the equal RIP values correspond to the sets of the equal row or column-points
in CA space. This can be expressed also in terms of the equalities (1) using Boolean
vectors instead of CA factors. Explicitely, let the classes of some partitions {Vs : s ∈ S}
on I and {Wt: t ∈ T} on J represent the sets of coinciding row and column-points in CA
space. The formulas (1) express the principle if H=SxT and Boolean Fh(i), Gh(j) are
defined for h=(s,t) as follows: Fh(i)=1 iff i ϖ Vs and Gh(j)=1 iff j ∈ Wt. This form of the
principle allows us to formulate the partitioning problem of a contingency table KIJ as an
approximation problem: to find a pair of partitions, {Vs: s ∈ S} on I and {Wt: t ∈ T} on J,

3
Seventh International Conference on Multivariate Analysis
Barcelona Meeting, September, 21 - 24, 1992
in "Multivariate Analysis, Future Directions, C.Cuadras, C.R.Rao, Eds, North Holland, 1993, p 341-357.

and corresponding values µh for h=(s,t) to approximate the RIP matrix Q={qij} , that is,
to minimize the difference between the left and right parts of (1) (in the Boolean form)
measured by the weighted least square criterion L2 such that :
L2 = Σij ki kj [qij - Σh µhFh(i)Gh(j) ]2 (2)
(the weight of the entry (i,j) is to be equal to kikj (Carroll, Pruzansky, and Green, 1977;
Escoufier, 1988). When the user wants to clusterize only one of the sets, I (or, J) the
corresponding partition of J (or, of I) consists of the set of singletones.
Evidently, for Fh and Gh (h∈ H) fixed, the optimal values µh are equal to the
corresponding RIP values, that is, for each h = (s,t), the optimal value is such that µh = qst.
It is not difficult to prove also that the alternating algorithm for minimizing L2 is
equivalent to the chi-square distance moving centers method, and that an agglomerative
suboptimal algorithm is equivalent to the chi-square distance based agglomerating
clustering procedure using generalized Ward criterion (Mirkin, 1992). The value of the
criterion can be expressed through the difference of the chi-square contingency
coefficients for the initial and aggregated contingency tables: L2= (X2(I,J)-X2(S,T)). This
approach can account for various results and findings derived in Benzecri et al.(1980),
Cazes (1986), Moussaoui (1987), Jambu (1978).

2.2 Simultaneous clustering of rows and columns

This approximation clustering approach can be expanded to the problems of finding


"mixed" clusters containing the rows and columns simultaneously (an approach dating
back to Hartigan, 1972 ; Braverman et al., 1974; see also Govaert, 1977; Bock, 1979).
The chi-square distance concept cannot help in this matter, since no satisfactory concept
exists to measure distance between a row and a column! But we can consider the model
(1) as a set of approximate equalities with arbitrary Boolean vectors Fh and Gh (and
corresponding cluster boxes Bh={(i,j) : Fh(i)=1 and Gh(j)=1}) to find out.
A suboptimal algorithm to fit the model (1) for this case was developed in Mirkin (1992):
the cluster boxes Bh are separated sequentially maximizing the accounted part of the
general value of X2(I,J); the values µh are estimated by the RIP values of the cluster boxes
obtained.
More explicitely, each iteration h (the index h is omitted below for convenience) aims at
minimizing the following reduced form of criterion (2) :

L2 = Σij ki kj [qij - µF(i)G(j) ]2 (3)

4
Seventh International Conference on Multivariate Analysis
Barcelona Meeting, September, 21 - 24, 1992
in "Multivariate Analysis, Future Directions, C.Cuadras, C.R.Rao, Eds, North Holland, 1993, p 341-357.

µ, F(i) and G(j) are unknown, whereas the qij are the residuals computed after each
iteration (for the first iteration, the qij are the initial RIP values).
The optimal µ for any fixed box VxW (defined as V = {i : F(i) = 1} and W = {j : G(j) =
1}) is determined by the weighted average of qij computed within the box :

µ = Σi %V Σj %W ki kj qij / kv kW

which equals qv W.
Substituting this value into (3) leads to the following equality :

L2 = Σi %I Σj %J ki kj (qij)2 - µ2 kv kW

which shows that minimizing L2 is equivalent to maximizing the following form of the
criterion, depending on box VxW only :

g(V,W) = µ2 kv kW = (Σi %V Σj %W ki kj qij )2 / kv kW

To maximize this criterion, the following step-by-step procedure of box generation can be
performed : each step adds to the box issued from the previous step only one element, a
row or a column, to maximize the increment of the criterion due to the added element. At
the first step, two elements are simultaneously selected : a row i and a column j ,
maximizing g({i},{j}) for all the pairs of singletones. The process stops when the
maximal increment becomes negative. The suboptimal cluster box obtained through this
algorithm has the following property (Mirkin, 1992) : For each row i or column j outside
the cluster box, the absolute value of the relative increments qVj = qjV and qiW = qWi are
at least twice smaller than the absolute value of the relative "internal" increment qVW =
qWV.
The residual data in this sequential fitting procedure are obtained through substracting the
solution provided by the h-th iteration from the residual data of the preceding iteration.

qij,h+1 = qij,h - µhFh(i)Gh(j) (i ∈ I, j ∈ J).

For the first iteration, qij,1 = qij (i ∈ I, j ∈ J).

Even in the case of overlapping boxes, the initial Chi-square can be partitioned into
components corresponding to these boxes in order to evaluate the contribution of each
cluster, and to help fixing the number of cluster (by using traditional values of the

5
Seventh International Conference on Multivariate Analysis
Barcelona Meeting, September, 21 - 24, 1992
in "Multivariate Analysis, Future Directions, C.Cuadras, C.R.Rao, Eds, North Holland, 1993, p 341-357.

accumulated contributions, or testing the hypothesis of independence for the residual


data).
The obtained boxes are shown to correspond to certain fragments of the CA space
(maximally connected if µh>0, or maximally disconnected if µh<0). In our example, the
algorithm separates, initially, the singletone boxes {1}×{1} and {2}×{2} (each having 2
as the RIP value and accounting 7.8% of X2(I,J)), then the pair segments {3,4}×{3,4),
{5,6}×{5,6}, and {7,8}×{7,8} are obtained sequentially followed by the link boxes
({1}×{2} and {2}×{1}) for the first two elements. The RIP value for each of these boxes
is positive (evidently, the values are 1.4, 1.2, 1.2, .8, .8, respectively). Then boxes
{1,2,3,4}×{5,6,7,8} and {5,6,7,8}×{1,2,3,4} appear having negative RIP value .8. All this
structure accounts for 95.8% of the X2(I,J).

2.3 An example of coincidence between clustering and C.A.

Unfortunately, the Boolean form of decomposition (1) has no longer the weighted
orthonormality properties of the CA factors. But for the symmetric matrices KII (which is
exactly the case of our example), Benzecri (1973, vol.2, ch.11) has pointed out a situation
where the discrete orthonormal eigen-functions are relevant.

This author has derived a representation of a binary hierarchy H through a set of


orthogonal functions allowing to build a symmetric contingency table (through the
reconstruction formula) whose CA restitutes the initial hierarchy.

The preceding symmetric (8,8) contingency table KIJ has thus the property of providing
an exact coincidence between correspondence analysis and hierarchical clustering (using
the Ward's criterion) in the following sense : each eigenvalue of the CA corresponds
exactly to a node of the classification.
Table 2

Eigenvalues issued from the C.A. of KIJ

λ1 = .640 (80 % of the trace)


λ2 = .090 (11 %)
λ3 = .040 (5 %)
λ4 = .023 (3 %)
λ5 = .006 (.7 %)
λ6 = .003 (.4 %)
λ7 = .001 (.1 %)

The associated axis of the CA separates the two sets of elements constituting this node.

6
Seventh International Conference on Multivariate Analysis
Barcelona Meeting, September, 21 - 24, 1992
in "Multivariate Analysis, Future Directions, C.Cuadras, C.R.Rao, Eds, North Holland, 1993, p 341-357.

Correspondence analysis of table KIJ leads to 7 clearly separated eigenvalues (see table
2).
The sequence of patterns that can be observed in the columns of table 3 (eigen-vectors) is
typical of a hierarchical structure : the non-zero coordinates on each principal axis can
take only two distinct values, opposing two groups of elements.

Table 3

Principal Coordinates issued from the C.A. of KIJ

Axes 1 2 3 4 5 6

ROW1 * -.80 .42 0.00 .30 0.00 .00 *


ROW2 * -.80 .42 0.00 -.30 0.00 .00 *
ROW3 * -.80 -.42 0.00 0.00 -.15 .00 *
ROW4 * -.80 -.42 0.00 0.00 .15 .00 *
ROW5 * .80 0.00 -.28 0.00 0.00 .10 *
ROW6 * .80 0.00 -.28 0.00 0.00 -.10 *
ROW7 * .80 0.00 .28 0.00 0.00 .00 *
ROW8 * .80 0.00 .28 .00 0.00 .00 *

The first axis, for instance, opposes (ROW1 ... ROW4) to (ROW5 ... ROW8). The second axis,
within the first group isolated by axis 1, opposes (ROW1, ROW2) to (ROW3, ROW4), etc..
Correspondence analysis performs in this case like a divisive algorithm, working
iteratively from the upper to the lower level of a hierarchy.

ROW1 Axis 2 11%


ROW2

ROW5 Axis 1
ROW6
-0.8 0 ROW7 80%
ROW8

-0.4
ROW3
ROW4

Figure 2

7
Seventh International Conference on Multivariate Analysis
Barcelona Meeting, September, 21 - 24, 1992
in "Multivariate Analysis, Future Directions, C.Cuadras, C.R.Rao, Eds, North Holland, 1993, p 341-357.

Planar display of table KIJ through CA.

We notice that the configuration of points in the principal plane of CA (Figure 2)


highlights only a limited part of the underlying structure, by comparison with the
dendrogram (also a planar representation) of Figure 1.

Figure 2 gives neither pertinent information about the distances between ROW1 and ROW2
(the corresponding points are superimposed on the plane, suggesting a null distance), nor
useful information about the distances between ROW4, ROW5, ROW6, also superimposed on
the graphical display. This shrinkage of distances, easily explained by the geometrical
properties of the initial swarm of points, should prompt the users to use simultaneously
the two kinds of methods to obtain a reliable description of the data.

2.4 Properties of these "compatible" matrices

The above example concerns the case of a binary hierarchy H, whose each nonterminal
element h ∈ H can be partitioned in a unique way into two sets a(h) and b(h) belonging to
H. The orthonormal set of "3-valued" functions fh is defined as follows: fh(i) equals da for
i ∈a(h), -db for i ∈ b(h), 0 for others elements i, where da, db are chosen in order to make
the average of fh equal to zero, and the norm equal to 1.
Evidently, da = [ (k×kb(h)) / (ka(h)kh) ]1/2 , db = [ (k×ka(h)) / (kb(h)kh) ]1/2).
We say that a square symmetric contingency table is compatible if (1) holds for some
binary hierarchy H with Fh(i)=fh(i), Gh(j)=fh(j) and some µh>0 (h ∈ H). In general, a
method to approximate the RIP values with those 3-valued eigen-function decomposition
can be developed. The method fits model (1) sequentially, each iteration finding a bi-
partition of current set h into two subsets, a(h) and b(h), to minimize the weighted least
square criterion, or, equivalently, to maximize the "explained" part of the chi-square value
which is shown to be equal to :
(µh)2 = (qa(h)a(h) + qb(h)b(h) - 2 qa(h)b(h))2.
This divisive clustering procedure, in our example, leads to the hierarchy of Figure 1.

3. Eigenvalues and indices

3.1 Some inequalities

The largest eigenvalues issued from the CA of a contingency table are greater or equal to
the largest index corresponding to the last node of a hierarchical clustering of the rows or

8
Seventh International Conference on Multivariate Analysis
Barcelona Meeting, September, 21 - 24, 1992
in "Multivariate Analysis, Future Directions, C.Cuadras, C.R.Rao, Eds, North Holland, 1993, p 341-357.

of the column of this contingency table (using the chi-square distance and the generalized
Ward criterion to ensure a compatibility between the two techniques). The equality occurs
for special tables such as the compatible matrices dealt with in the previous section. This
upper bound for the indices could be derived easily from the above considerations since
the indices and the eigenvalues appear to be solutions of the same optimization problem,
with supplementary constraints for the indices. Benzecri and Cazes (1978) have more
generally shown that the quantity (λ1 + λ2 + .. λp ) is greater or equal than the sum of
the p indices corresponding to the p highest nodes of the associated hierarchy (a property
which can be directly derived from the general criterion (2), where F and G are less
constrained in the case of CA). Moreover, these authors have produced a counter-example
showing that there exists no general lower bound for the index corresponding to the
highest node : one can find distributions of density such that the largest index remains an
arbitrarily small fraction of the largest eigenvalue.

3.2 The case of block-structured contingency tables

The limiting case of multiple eigenvalues λi = 1, (i=1,m) (for the CA of a rectangular


contingency table) is particularly interesting since it is closely related to the classification
of rows and columns (the trivial eigenvalue 1 corresponding to a constant eigenvector is
supposed to be removed beforehand). It is straightforward that such multiple 1-
eigenvalues exist iff there exist a block-structure of the contingency table into m+1
blocks. (i.e. : iff only m +1 diagonal blocks contains non-zero elements). Surprisingly
enough, no similar property hold for the most usual agglomerative algorithms. Kharchaf
and Rousseau (1988,1989) present some counter-examples of bloc-structures in
contingency tables easily recognised by CA, although undetected by an agglomerative
clustering technique.

3.3 Experiments about the joint distribution of indices and eigenvalues

We give in this section some empirical results about the joint behavior of the indices and
the eigenvalues issued from the same random contingency table.
Under the hypothesis of independence (also called homogeneity in the case of
contingency tables), a series of 1000 pseudo-random independent (8,8) contingency table
with equal theoretical marginal are generated, according to a multinomial scheme. For
each generated table, the total number of observations k is 1000.

9
Seventh International Conference on Multivariate Analysis
Barcelona Meeting, September, 21 - 24, 1992
in "Multivariate Analysis, Future Directions, C.Cuadras, C.R.Rao, Eds, North Holland, 1993, p 341-357.

Table 4
Mean values and standard deviations of the
Eigenvalues and the Clustering indices.
(1000 independent random ( 8 , 8 ) contingency tables C . For each table C, k = 1000.)

Identifier Mean-value Standard Standard


deviation deviation
of the mean
Eigenvalues
EV1 * .02130 * .00560 * .00018
EV2 * .01282 * .00353 * .00011
EV3 * .00772 * .00234 * .00007
EV4 * .00442 * .00156 * .00005
EV5 * .00214 * .00100 * .00003
EV6 * .00070 * .00050 * .00002
EV7 * .00010 * .00014 * .00000

Indices of rows (INRi) and columns (INCi)


INR1 * .01692 * .00452 * .00014
INR2 * .01063 * .00289 * .00009
INR3 * .00733 * .00197 * .00006
INR4 * .00537 * .00148 * .00005
INR5 * .00391 * .00117 * .00004
INR6 * .00280 * .00090 * .00003
INR7 * .00183 * .00074 * .00002
INC1 * .01679 * .00450 * .00014
INC2 * .01061 * .00291 * .00009
INC3 * .00739 * .00202 * .00006
INC4 * .00535 * .00151 * .00005
INC5 * .00396 * .00118 * .00004
INC6 * .00280 * .00091 * .00003
INC7 * .00182 * .00075 * .00002

The 7 eigenvalues issued from the CA of each table as well as the 7 indices of the
hierarchical classification of the rows and the columns (always using the generalized
Ward criterion and the chi-square distances) of the same table are computed, enabling to
estimates the means, variances and correlations relating to these 21 variates.
Table 4 summarizes the results concerning the means, the standard deviations of the initial
variables and the standard deviations of the means.

The results concerning the eigenvalues are consistent with some previous approximations
(Lebart, 1976), since their distribution is similar to the one of the eigenvalues of a Wishart
matrix (n=7, p=7). The sum τ of the means of the different eigenvalues equals 0.0492 ;
the statistics kτ has thus the value 49.2 (no significant difference with the expectation of a
Chi-square with 7x7 degrees of freedom)

10
Seventh International Conference on Multivariate Analysis
Barcelona Meeting, September, 21 - 24, 1992
in "Multivariate Analysis, Future Directions, C.Cuadras, C.R.Rao, Eds, North Holland, 1993, p 341-357.

The indices corresponding to the clustering of the rows and of the columns are distinct
for each simulated matrix. The statistical identity of their first and second order moments
is a further indication of the consistency of the simulation process.
As expected, the largest indices INR1 and INC1 are smaller than the largest eigenvalue
λ1=EV1, whereas the smallest indices INR7 and INC7 are on the average much larger their
counterpart.

0,025
Eigenvalue
0,02
Clust. Index
0,015

0,01

0,005

0
1 2 3 4 5 6 7

Figure 3. Sequences of eigenvalues and indices

Figure 3 shows the compared trajectories of these two quantities, highlighting the smaller
range of variation of the indices.

Figure 4 below presents the scattering diagram of the joint distribution of the first
eigenvalue λ1 = EV1 and the first row-clustering index INR1 both issued from the same
pseudo-random matrix. The correlation coefficient between λ1 and INR1 is 0.91. (The
same value is obtained for the correlation coefficient between λ1 and INC1). The
theoretical constraint INR1≤ λ1 clearly defines the upper left boundary of the swarm of
points.

11
Seventh International Conference on Multivariate Analysis
Barcelona Meeting, September, 21 - 24, 1992
in "Multivariate Analysis, Future Directions, C.Cuadras, C.R.Rao, Eds, North Holland, 1993, p 341-357.

LARGEST INDEX OF CLUSTERING

% * *
!
! *
! * * *
! * *
.030 ! * * *
! * *
! ***
! ** *** *
! * * *
! * * * * * * *
! * * * * * * * * * *
! ** * * * *
.025 ! * * * * * * *
! *** *** ** * * * * *
! * ***** ** * * * *
! * * ****** * ****
! ** ** ***** * *** ** *
! ** * ****** * * * ** *
! * ** *** *** * *
! ******* * * * * * **
! * * ** ***** **** *** *
.020 ! * ************ * * * * *
! ************* * ** * *
! **************** ** * * ** *
! * *********** ** * * * *
! ************** *** **** * *
! ************* * * ** *
! ************** ** ** ** *
! * ****************** ***
.015 ! * ************** ***
! ************** * * * *
! ***************** * * *
! * *************** ** *
! *** ** ****** * *
! *********** ** * *
! ******* * * *
! ******* *** *
! ***** ** * * *
.010 ! ** * *** * * *
! ***** * *
! * *
! *
! * *
!
!
.005 !
!
!
!
!
!
!
.000 !
+--------------------------------------------------------------------------------------------------->
004 .009 .018 .027 .036
.044
FIRST EIGENVALUE

Figure 4
Correlation between λ1 = ΕV1 and the first clustering index INR1

To study the complex system of relationships between the various indices and the
eigenvalues, we will visualize the corresponding correlation matrix through a principal
component analysis (PCA), which will summarize the main observable patterns.

Figure 5 shows the principal plane of a PCA whose the active elements are the
eigenvalues and the illustrative elements are the indices. A classical size effect (all the

12
Seventh International Conference on Multivariate Analysis
Barcelona Meeting, September, 21 - 24, 1992
in "Multivariate Analysis, Future Directions, C.Cuadras, C.R.Rao, Eds, North Holland, 1993, p 341-357.

coordinates on the first axis are positive) corresponds to the fact that all the involved
correlation coefficients are positive.

EV7 EV6
0.60
Axis 2 19%

0.40

EV5

0.20
INR7

INC7 INC6
EV4 Axis 1
INR6
40%
INC5
0 0.20 0.40 0.60 INR5 0.90

INC4
INR4
EV3

INR3
LEGEND INC3
INC1 INR2
EVi Eigenvalue i
INRi Row-Index i INR1
INCi Column-Index i INC2
EV1 EV2

Figure 5. Structure of the correlation between eigenvalues and indices


(Principal plane of a Principal Component Analysis of the (1000,7) matrix containing the 1000
observations of the 7 eigenvalues EV1,...EV7.)
[ Note that the 7 row-indices INR1, ...INR7 and the 7 column indices INC1,...INC7 have been
projected afterwards as supplementary elements onto this principal plane]

The first indices are clearly correlated with the first eigenvalues. As mentioned
previously, the two correlation coefficients between each of the largest indices (INR1 and

13
Seventh International Conference on Multivariate Analysis
Barcelona Meeting, September, 21 - 24, 1992
in "Multivariate Analysis, Future Directions, C.Cuadras, C.R.Rao, Eds, North Holland, 1993, p 341-357.

INC1) and the first eigenvalue take the value 0.91 (the correlation between INR1 and
INR2 is only 0.80, but these relatively small differences are not visible on the display).
The positive autocorrelations beween successive eigenvalues or indices entail regular
trajectories on the plane spanned by the two first principal components, but these
trajectories diverge for the smallest eigenvalues and indices.
This pattern established from pseudo-random matrices is an assessment of the intuitive
experience of the practitioners : on the one hand the upper part of the dendrogram
provides the user with about the same results than the first axes ; on the other hand the
lower part of the dendrogram often pinpoints some interesting local properties of the data,
while the smallest eigenvalues correspond to some unidentifiable noise.

4. Some hybrid methods

Two series of works involving simultaneously at different level both CA (or other
principal axes method) and clustering are briefly mentioned below.

4.1 Clustering involving optimal coding

In the case of individuals described by several categorical variables (these variables could
be measured on nominal, ordinal or interval scales), van Buuren and Heiser (1989)
propose an algorithm achieving simultaneoulsly a coding of the variable and a clustering
of the individuals. An alternative least square algorithm is used, starting from a multiple
correspondence analysis of the data table.

4.2 Principal axes method for displaying or discovering clusters.

Some techniques related to projection pursuit and discrimination can be considered also
as an intermediate step between the two approaches.
Let us consider n objects described by p variables (yij is the value of variable j for object i).
Furthermore, these objects are also the vertices of a symmetric graph G, whose associated
matrix is M (mii' = 1 if nodes i and i' are joined by an edge, mii' = 0 otherwise). Such
situation occurs when objects are time-points, geographic areas, or if they are assigned to a
priori classes. Contiguity Analysis simultaneously uses the local covariance matrix C (such
that cjj' = (1/2m) Σi,i' mii'(yij - yi'j) (yij' - yi'j') ) , and the global covariance matrix V. If
the graph is made of k disjoined complete subgraphs, V is very similar to the classical
"within covariance matrix" used in linear discriminant analysis, and coincides with it when
the graph is regular (i.e. each vertex is provided with the same number of edges). The

14
Seventh International Conference on Multivariate Analysis
Barcelona Meeting, September, 21 - 24, 1992
in "Multivariate Analysis, Future Directions, C.Cuadras, C.R.Rao, Eds, North Holland, 1993, p 341-357.

minimization of the ratio: u'Cu / u'Vu (u being a p-vector) provides then a generalization
of linear discriminant analysis in the case of overlapping clusters (see for instance Aluja and
Lebart., 1984).
Using more general similarity indices in place of the binary quantity mii' allows to define a
series of indices analogous to those used in Projection Pursuit (see Caussinus, 1992).
It is easy to derive a contiguity matrix from the basic data array itself: any threshold applied
to the set of n(n-1) distances or similarities between observations allows to define a binary
relationship which can be described by a symmetric graph. Similarly, a contiguity matrix
can be derived, from the k nearest neighbours of each observation.
The contiguity analysis applied to such matrices (Burtschy and Lebart, 1991) is closely
related to the techniques proposed by Gnanadesikan et al. (1982), Art et al. (1982). It
produces planar (or low dimensional) representations which can be viewed as
compromises between the outcomes of principal axis techniques (CA or PCA) and those
of clustering techniques.

5. Complementarity from a practical point of view

Various authors have insisted upon the complementarity between principal axes
techniques and classification, which concerns the comprehension of the data structure as
well as the interpretation of the results. Gower and Ross (1969), for example, have shown
how the drawing of a minimum spanning tree onto a principal plane issued from a
principal component analysis could enrich the interpretation of the represented distances
between points. Benzecri et al. (1980) have developed a thorough methodology for the
conjoint use of CA and hierarchical clustering, comprising various parameters which
describes the mutual links between axes and nodes.
CA , like PCA, could entail shrinkages and distorsions due to both the projection onto the
principal dimensions and the possible lack of robustness of the global fit (sensitivity to
outliers). It is then advisable to complement it with a classification performed in the
whole space. The clusters are not only used to mark out the factorial planes by a sample
of well described areas. Being derived in a much higher dimensional space, they can
supply elements of information that could have been hidden by the projection onto a low
dimensional subspace.

A practical issue reinforces this need for both approaches : it is much easier to describe a
set of clusters than a continuous space. The most significant categories or variables for
each cluster could be automatically selected, therefore producing a computer aided
description of the classes, and hence, of the whole space. A series of statistical tests

15
Seventh International Conference on Multivariate Analysis
Barcelona Meeting, September, 21 - 24, 1992
in "Multivariate Analysis, Future Directions, C.Cuadras, C.R.Rao, Eds, North Holland, 1993, p 341-357.

allow to select and to sort (according to the computed levels of significance) the most
characteristics items for each cluster (see for instance Lebart et al., 1984). .

From a purely computational point of view, when dealing with very large data sets such as
those provided by survey data files, it may prove efficient to perform a classification
using a limited number of factors issued from CA to increase the performances of the
techniques (Morineau and Lebart., 1986).

Finally, the user may wish to discover some unexpected latent factors or some hidden
existing groups within the data. Although the theoretical models underlying CA and
classification are seldom referred to by exploratory data analysts, it is clear that each tool
has its own vocation and idiosyncrasies. Even if the history of statistical applications
abounds in examples of groups discovered through eigen-analyses as well as latent factors
discovered through clustering, it seems wiser to systematically use both techniques.

References

Aluja Banet T., L. Lebart (1984). Local and Partial Principal Component Analysis and
Correspondence Analysis, COMPSTAT Proceedings, 113-118, Physica Verlag, Vienna.
Art D., Gnanadesikan R, Kettenring J.R.(1982). Data Based Metrics for Cluster
Analysis, Utilitas Mathematica, 21 A, 75-99.
Benzécri J.P. (1973) Analyse des Données.-Paris: Dunod.
Benzécri, J.P. (1983) Analyse d'inertie intraclasse par l'analyse d'un tableau de
correspondance, Les Cahiers d'Analyse des Données, 8, no.3, 351-358.
Benzécri J.P., Cazes P. (1978) Problème sur la classification. Les Cahiers d'Analyse des
Données, 3, no.1, 95-101.
Benzécri J.P., Jambu M. (1976) Agrégation suivant le saut minimum et arbre de longueur
minimum. Les Cahiers d'Analyse des Données, 1, no.4, 441-452.
Benzécri, J.P., Lebeaux M.O., and Jambu M. (1980) Aides a l'interpretation en
classification automatique, Les Cahiers de l'Analyse des Données, vol.V, n.1, 101-123.
Bock H. H. (1979) Simultaneous clustering of objects and variables. in Analyse des
donnees et informatique, European C.C. Courses, INRIA, p 187-203.
Braverman E.M., Kiseleva N.E., Muchnik I.B., and Novikov, S.G. (1974) Linguistic
approach to the problem of processing large bodies of data, Automation and Remote
Control, 35, no.11, part 1, 1768-1788.

16
Seventh International Conference on Multivariate Analysis
Barcelona Meeting, September, 21 - 24, 1992
in "Multivariate Analysis, Future Directions, C.Cuadras, C.R.Rao, Eds, North Holland, 1993, p 341-357.

Burtschy B., and Lebart L. (1991) Contiguity analysis and projection pursuit, 117-128. in
Applied Stochastic Models and Data Analysis, World Scientific , Singapore.
van Buuren S., and Heiser W.J. (1989) Clustering N objects into k groups under optimal
scaling of variables, Psychometrika, 54, no.4, 699-706.
Carrol J.D., Pruzansky S., and Green P.F. (1977) Estimation of the parameters of
Lazarsfeld's Latent Class Model by application of canonical decomposition
CANDECOMP to multi-way contingency tables, AT&T Bell Laboratories, unpublished
paper, 18 p.
Cazes P. (1986) Correspondance entre deux ensembles et partition de ces deux
ensembles, Les Cahiers de l'Analyse des Données, vol.XI, no.3, 335-340.
Cazes P., and Moreau J. (1991) Contingency table in which the rows and columns have a
graph structure, in E.Diday, Y.Lechevallier (Eds) Symbolic-Numeric Data Analysis and
Learning, Nova Science Publishers: New York, 271-280.
Caussinus H.(1992). Projections Revelatrices in Modèles pour l'Analyse des Données
Multidimensionnelles, J.J. Droesbeke, B. Fichet, P.Tassi, eds, Economica, Paris.
Escofier B. (1978). Analyse factorielle et distances répondant au principe d'équivalence
distributionnelle. Revue de Statist. Appl. vol. 26, n°4, p 29-37.
Escoufier Y. (1988) Beyond correspondence analysis. In: H.H.Bock (Ed.) Classification
and Related Methods of Data Analysis. Elsevier Sc.P.
Gilula Z. (1986) Grouping and association in contingency tables: an exploratory
canonical correlation approach, Journal of American Statistical Association, vol.81,
no.395, 773-779.
Gnanadesikan R., Kettenring J.R., Landwehr J.M. (1982). Projection Plots for Displaying
Clusters, in Statistics and Probability, Essays in Honor of C.R. Rao, G. Kallianpur, P.R.
Krishnaiah, J.K.Ghosh, eds, North-Holland.
Goodman L.A. (1991) Measures, models, and graphical displays in the analysis of cross-
classified data (with Discussion), Journal of American Statistical Association, vol.86,
No.416, 1085-1138.
Goodman L.A.(1981) Criteria for determining whether certain categories in a cross-
classification table should be combined with special reference to occupational categories
in an occupational mobility table, American Journal of Sociology, 87, 612-650.
Govaert G. (1977) Algorithme de classification d'un tableau de contingence. In:
"Premières Journées Internationales Analyse des Données et Informatique (Versailles
1977)" INRIA, p. 487-500.

17
Seventh International Conference on Multivariate Analysis
Barcelona Meeting, September, 21 - 24, 1992
in "Multivariate Analysis, Future Directions, C.Cuadras, C.R.Rao, Eds, North Holland, 1993, p 341-357.

Gower J.C, Ross G. (1969) Minimum spanning tree and single linkage cluster analysis.
Appl.Statistics, vol 18, p 54-64.
Greenacre M.J. (1988) Clustering the rows and columns of a contingency table, Journal
of Classification, 5, 39-51.
Hartigan J.A. (1972) Direct clustering of a data matrix, Journal of American Statistical
Association, vol.67, p. 123-129.
Jambu M. (1978) Classification Automatique pour l'Analyse des Données, I- Méthodes et
Algorithms. Paris:Dunod.
Kharchaf I., Rousseau R. (1988, 1989) Reconnaissance de la structure de blocs d'un
tableau de correspondance par la classification ascendante hiérarchique: parts 1 and 2, Les
Cahiers de l'Analyse des Données, vol.XIII, n.4, 439-443; vol.XIV, n.3, 257-266.
Lebart L. (1976) The significance of eigenvalues issued from correspondence analysis.
Proceedings in Comp. Stat., COMPSTAT, Physica verlag, Wien, p 38-45.
Lebart L., Morineau A., Warwick K. (1984) - Multivariate Descriptive Statistical
Analysis, J.Wiley, New-York.
Marcotorchino F. (1987) Block seriation problems: a unified approach, Journal of
Applied Stochastical Models and Data Analysis, vol.3, no.3, 73-93.
Mirkin B.G. (1985) Grouping in SocioEconomic Studies. Finansy i Statistika Publishers,
Moscow (in Russian).
Mirkin B.G. (1992) Correspondence-wise clustering for contingency tables, submitted for
publication.
Morineau A., Lebart L. (1986) Specific Clustering Algorithms for Large data sets and
Implementation in SPAD Software. in Classification as a Tool of Research, Gaul W.,
Schader M., Eds, North Holland, 1986.
Moussaoui A.E. (1987) Sur la reconstruction approchée d'un tableau de correspondance a
partir du tableau cumulé par blocs suivant deux partitions des ensembles I et J, Les
Cahiers de l'Analyse des Données, vol.XII, n.3, 365-370.

Key-words :
Correspondence Analysis, Clustering techniques, Classification, Hybrid
approaches in Data Analysis, Contingency tables.

18

You might also like