0% found this document useful (0 votes)
499 views786 pages

Data Science, Classification, and Related Methods PDF

Uploaded by

Tumaini Sankofa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
499 views786 pages

Data Science, Classification, and Related Methods PDF

Uploaded by

Tumaini Sankofa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 786

Studies in Classification, Data Analysis,

and Knowledge Organization

Chikio Hayashi · Keiji Yajima


Hans H. Bock · Noboru Ohsumi
Yutaka Tanaka · Yasumasa Baba

Data Science,
Classification, and
Related Methods
Studies in Classification, Data Analysis,
and Knowledge Organization

Managing Editors Editorial Board


H.-H. Bock, Aachen W. H. E. Day, St. John's
o. Opitz, Augsburg E. Diday, Paris
M. Schader, Mannheim A. Ferligoj, Ljubljana
W. Gaul, Karlsruhe
J. c. Gower, Harpenden
D. J. Hand, Milton Keynes
P. Ihm, Marburg
J. Meulmann, Leiden
S. Nishisato, Toronto
F. J. Radermacher, Ulm
R. Wille, Darmstadt

Springer Japan KK
Titles in the Series

H.-H. Bock and P. Ihm (Eds.)


Classification, Data Analysis, and
Knowledge Organization
(out of print)

M. Schader (Ed.)
Analyzing and Modeling Data
and Knowledge

O. Opitz, B. Lausen, and R. Klar (Eds.)


Information and Classification
(out of print)

H.-H. Bock, W. Lenski, and M. M. Richter (Eds.)


Information Systems and Data Analysis
(out of print)

E. Diday, Y. Lechevallier, M. Schader, P. Bertrand,


and B. Burtschy (Eds.)
New Approaches in Classification and Data Analysis
(out of print)

W. Gaul and D. Pfeifer (Eds.)


From Data to Knowledge

H.-H. Bock and W. Polasek (Eds.)


Data Analysis and Information Systems

E. Diday, Y. Lechevallier and O. Opitz (Eds.)


Ordinal and Symbolic Data Analysis

R. Klar and O. Opitz (Eds.)


Classification and Knowledge Organization
c. Hayashi· N. Ohsumi
K. Yajima . Y. Tanaka
H.-H. Bock· Y. Baba (Eds.)

Data Science,
Classification,
and Related Methods
Proceedings of the Fifth Conference of the International
Federation of Classification Societies (IFCS-96),
Kobe, Japan, March 27-30, 1996

With 240 Figures

, Springer
Prof. Emeritus Chikio Hayashi
The Institute of Statistical Mathematics
4-6-7 Minami-Azabu, Minato-ku, Tokyo 106, Japan

Prof. Keiji Yajima


School of Management, Science University of Tokyo
500 Shimokiyoku, Kuki, Saitama 346, Japan

Prof. Hans-Hermann Bock


Institut fUr Statistik
Rheinisch-Westfalische Technische Hochschule (RWTH)
D-52056 Aachen, Germany

Prof. Noboru Ohsumi


The Institute of Statistical Mathematics
4-6-7 Minami-Azabu, Minato-ku, Tokyo 106, Japan

Prof. Yutaka Tanaka


Faculty of Environmental Science & Technology, Okayama University
2-1-1 Tsushima-naka, Okayama 700, Japan

Prof. Yasumasa Baba


The Institute of Statistical Mathematics
4-6-7 Minami-Azabu, Minato-ku, Tokyo 106, Japan

ISBN 978-4-431-70208-5 ISBN 978-4-431-65950-1 (eBook)


DOI 10.1007/978-4-431-65950-1

This work is subject to copyright. All rights are reserved. whether the whole or part of the
material is concerned. specifically the rights of translation. reprinting. reuse of illustrations. re-
citation, broadcasting. reproduction on microfilms or in any other ways. and storage in data
banks.
© Springer Japan 1998
Originally published by Springer-Verlag Tokyo Berlin Heidelberg New York in 1998

The use of general descriptive names, registered names. trademarks. etc. in this publication does
not imply. even in the absence of a specific statement. that such names are exempt from the
relevant protective laws and regulations and therefore free for general use.
Product liability: The publishers cannot guarantee the accuracy of any information about the
application of operative techniques and medications contained in this book. In every individual
case the user must check such information by consulting the relevant literature.

SPIN 10634047 Printed on acid-free paper


PREFACE

This volume, Data Science, Classification, and Related Methods, contains a


selection of papers presented at the Fifth Conference of the International
Federation of Oassification Societies (IFCS-96), which was held in Kobe, Japan,
from March 27 to 30,1996.

The volume covers a wide range of topics and perspectives in the growing field
of data science, including theoretical and methodological advances in domains
relating to data gathering, classification and clustering, exploratory and
multivariate data analysis, and knowledge discovery and seeking.

It gives a broad view of the state of the art and is intended for those in the
scientific community who either develop new data analysis methods or gather
data and use search tools for analyzing and interpreting large and complex data
sets. Presenting a wide field of applications, this book is of interest not only to
data analysts, mathematicians, and statisticians but also to scientists from many
areas and disciplines concerned with complex data: medicine, biology, space
science, geoscience, environmental science, infonnation science, image and
pattern analysis, economics, statistics, social sciences, psychology, cognitive
science, behavioral science, marketing and survey research, data mining, and
knowledge organization.

Data Science, Classification, and Related Methods contains 85 invited and


contributed refereed papers presented during IFCS-96. Four preceding
conferences were held in Aachen (Germany), Charlottesville (U.SA),
Edinburgh (U.K), and Paris (France). This fifth IFCS-96 conference was
convened at the International Conference Center, Kobe, under the sponsorship
of the Ministry of Education, Science, Sports and Culture of Japan and the
Behavionnetric Society of Japan, and resulted from the close cooperation
between the following ten Members of the IFCS:

British Oassification Society (BCS)


Classification Society of North America (CSNA)
Gesellschaft rur Klassifikation (GfK1)
Japanese Classification Society (JCS)
J ugoslovenska Sekcija za KJasifikacije (JSK)
Societe Francophone de Classification (SFC)
Societa Italiana di Statistica (SIS)
\\:reniging voor Ordinatie en Oassificatie (VOC)
Section on Classification and Data Analysis of
Polish Statistical Society (SKAD)
Associa<rao Portuguesa de Classifica<rao e Analise de Dados (CLAD).

v
VI

IFCS-96 was organized by the IFCS-96 Organizing Committee under the


auspices of the Japanese Classification Society. In addition, the Korean
Classification Society (KCS) joined the conference as a new member of the
IFCS.

Moreover, the organizers appreciated very much the co-sponsorship and


support of twenty-one related academic societies in Japan:

Biogeographic Society of Japan


Japan Association for Medical Informatics
Japan Society for Fuzzy Theory and Systems
Japan Society of Forest Planning
Japan Society of Medical Electronics and Biological Engineering
Japan Society of Plant Taxonomists
Japan Statistical Society
Japanese Society of Applied Statistics
Japanese Society of Computational Statistics
Japanese Wildlife Research Society
The Biometric Society of Japan
The Botanical Society of Japan
The Ecological Society of Japan
The Entomological Society of Japan
The Ichthyological Society of Japan
The Japanese Forestry Society
The Japanese Society for Quality Control
The Mammalogical Society of Japan
The Operations Research Society of Japan
The Society of Instrument and Control Engineers
Zoological Society of Japan

A software exhibition provided the opportunity for industrial companies,


research laboratories, and researchers to show their programs and data analysis
application software. The various prototypes from research laboratories
reflected the growing activity in this field.

The editors of this volume gratefully acknowledge the cooperation of many


colleagues who selected and reviewed papers or chaired sessions during the
conference. We also thank all the members of the International Scientific
Committee for their help and support. We very much appreciate the active
collaboration of all participants and authors, who came from more than nineteen
nations and who rendered possible the scientific success of IFCS-96.

The organizers of the conference are indebted to many industrial companies and
institutions that financially supported the conference:

Central Research Services, Inc.


Commemorative Association for the Japan World Exposition (1970)
Dentsu, Inc.
Government Housing Loan Corporation
Hitachi, Ltd.
Japan Tobacco, Inc.
Japan Travel Bureau (Foundation)
VII

Kansai Electric Power Co., Inc.


Labourer's Health and Welfare Association
Marketing Service Co., Ltd.
Nikkei Research, Inc.
Portpia 81 Foundation
Research and Development, Inc.
Shin Joho Center, Inc.
Tokyo Electric Power Co., Inc.
Video Research, Ltd.
Yoron Kagaku Kyokai (Public Opinion Research Center)

Finally, the editors thank the staff of Springer-Verlag Tokyo for their support
and dedication and for the opportunity for publishing this volume in the series
Studies in Classification, Data Analysis, and Knowledge Organization.

Tokyo, June 1997 Chikio Hayashi


N oboru Ohsumi
CONFERENCE COMMITfEE

Conference President
Chikio Hayashi
The Institute of Statistical Mathematics
Professor Emeritus

Scientific Program Committee


Keiji Yajima (Chairperson)
Yutaka Tanaka (Co-chairperson)

International Scientific Program Committee


S. Aivazi"an, Russian Federation Willem J. Heiser, The Netherlands
Tomas Aluja-Banet, Spain Dekun Hu, People's Republic of China
Jean-Pierre Barthelemy, France Natale Carlo Lauro, Italy
Vladimir Batagelj, Slovenia Jae Chang Lee, Korea
Hans H. Bock, Germany Fred R. McMorris, U.S.A.
Srdjan Bogosavljevic, Yugoslavia Jacqueline J. Meulman, The Netherlands
Edwin Diday, France Konstantin Momirovic, Yugoslavia
Yadolah Dodge, Switzerland Clive M. Moncrieff, U. K.
Brian Everitt, United Kingdom Fernando Costa Nicolau, Portugal
Anuska Feriigoj, Repllblika Slovenia Alfredo Rizzi, Italy
Wolfgang Gaul, Germany Michael Windham, U.S.A.

Local Scientific Program Committee


ShuheiAida Sachiko Matsui Setsuo Suoh
Hirotugu Akaike Nobuhiro Minaka Yoshio Takane
Chooichiro Asano Hideo Miyahara Kiyoshi Tanaka
Yasumasa Baba Shunichi Miyai Yutaka Tanaka
Sadao Fujimura Masahiro Mizuta Tomoyuki Tarumi
Masashi Goto Masakatsu Murakami Mikinori Tsuiki
Atsuhiro Hayashi Yasuo Ohashi Shoichi Veda
Chikio Hayashi Noboru Ohsumi Junzo Watada
Manabu !chino Sumimasa Ohtsuka Nagasumi Yago
Tadashi Imaizumi Keiichi Onoyama Kenji Yajima
Shuichi Iwatsubo Seiroku Sakai Haruo Yanai
Sadanori Konishi Yoshiharu Sato Mitsu Yoshimura
Koji Kurihara Shingo Shirahata Tadashi Yoshizawa
Kazufumi Manabe Masae Shiyomi
Yoshiro Matsuda Meiko Sugiyama

VIII
LOCAL EDITORIAL BOARD
Chikio Hayashi The Institute of Statistical Mathematics
Keiji Yajima Science University of Tokyo
Noboru Ohsumi The Institute of Statistical Mathematics
Yutaka Tanaka Okayama University
Yasumasa Baba The Institute of Statistical Mathematics
Tadashi Imaizumi The Institute of Management and Information Science
Atsuhiro Hayashi The National Center for University Entrance Examination
Shoichi Ueda Ryukoku University

LIST OF REVIEWERS

Tomas Aluja-Banet Kiyoshi Karube Gavin J. S. Ross


Helena Bacelar-Nicolau Michiharu Kimura Yoshiharu Sato
Jean-Pierre Barthelemy Tsutomu Komazawa Shingo Shirahata
V1admir Batagelj Sadanori Konishi Masae Shiyomi
Hans Hermann Bock Pieter M. Kroonenberg Meiko Sugiyama
Sergio Bolasco Koji Kurihara Setsuo Suoh
Hamparsum Bozdogan Natale Carlo Lauro Setsuko Takakura
J. Douglas Carroll Ludovic Lebert Yoshio Takane
Naohito Chino Israel-Cesar Lerman Kiyoshi Tanaka
Edwin Diday Jacqueline J. Meulman Yutaka Tanaka
Jean-Fran<;ois Durand Nobuhiro Minaka Takahiro Tsuchiya
Bernard Fichet Hideo Miyahara MOOnori Tsuiki
Sadao Fujimura Sadaaki Miyamoto Shoichi Ueda
Wolfgang Gaul Masahiro Mizuta Bernard Van Cutsem
Allan Drummond Gordon Francesco Mola Maurizio Vichi
John C. Gower Masakatsu Murakami Milan Vlach
Andre Hardy Takashi Murakami Junzo Watada
Chikio Hayashi Yoshiteru Nakamori Suzanne Winsberg
Fumi Hayashi Shizuhiko Nishisato Nagasumi Yago
Willem J. Heiser Akinori Okada Keiji Yajima
Tu Bao Ho Noboru Ohsumi Kazue Yamaoka
Manabu Ichino Keiichi Onoyama Haruo Yanai
Tadashi Imaizumi Atsushi Ootaki Ryozo Yoshino
Shuichi Iwatsubo Sung H. Park Tadashi Yoshizawa

IX
TABLE OF CONTENTS

Preface ..................................................................................................................... V
Conference Committee ............................................................................................. VIII
Local Editorial Board ................ ............ ................... ................................................ IX
List of Reviewers ...................................................................................................... IX

Part I: General Aspects of Data Science


Probabilistic Aspects in Classification
Hans H. Bock .................... ................. .............. ............... ................................ 3
Cluster Validation
A. D. Gordon ............ ................ ........... .......... .................. ............ ................... 22
What is Data Science? - Fundamental Concepts and a Heuristic Example
Chikio flayashi ................................................................................................ 40
Fitting Graphs and Trees With Multidimensional Scaling Methods
{-Villem J. Heiser ......................... ........... ... ..................................... .................. 52
Classification and Data Analysis in Finance
KrzysztoJ Jajuga .............................................................................................. 63
How to Validate Phylogenetic Trees?: A Stepwise Procedure
Franr;ois-Joseph Lapointe ............................................................................... 71
Some Trends in the Classification of Variables
F. Costa Nicolau and H. Bacelar-Nicolau ........................................................ 89
Convexity Methods in Classification
Jean-Paul Rassoll ............................................................................................ 99

Part II: Methodologies in Classification


Evaluation and Assessment Procedures
How Many Clusters? - An Investigation of Five Procedures for
Detecting Nested Cluster Structure
A. D. Gordon ....................... ................ ..... ................. ......... ................... ........ 109
Partitional Cluster Analysis With Genetic Algorithms: Searching
for the Number of Clusters
J.A. Lozano, P. Larranaga, and M. Grana ...................................................... 117
Explanatory Variables in Classifications and the Detection of
the Optimum Number of Clusters
Janos Podani .................................................................................................. 125

x
XI

Random Dendrograms for Classifiability Testing


Bernard Van Cutsem, Bernard Ycart ............................................................... 133
The Lp -Product of Ultrametric Spaces and the Corresponding Product of Hierarchies
Bernard Fichet ................................................................................................ 145
Towards Comparison of Decomposable Systems
Mark Sh. Levin ................................................................................................ 154
Performance of Eight Dissimilarity Coefficients to Cluster a Compositional Data Set
Maria Cristina Martin ..................................................................................... 162
Topics in Clustering and Classification
Consensus of Hierarchical Classifications
Bruno Simeone and Maurizio Vichi ................................................................. 170
On the Minimum Description Length (MOL) Principle for Hierarchical Classifications
Peter C. Bryant ............................................................................................... 182
Consensus Methods for Pyramids and Other Hypergraphs
J. Lehel, F. R. McMorris, and R. C. Powers .................................................... 187
On the Behavior of Splitting Criteria for Classification Trees
Roberta Siciliano and Francesco Mala ........................................................... 191
Fitting Pre-Specified Blockmodels
Vladimir Batagelj, Anuska Ferligoj, and Patrik Doreian ................................. 199
Robust Impurity Measures in Decision Trees
Tomas Aluja-Banet and Eduard Natria ........................................................... 207
Induction of Decision Trees Based on the Rough Set Theory
Tu Bao Ho, Trong Dung Nguyen, and Masayuki Kimura ................................. 215
Visualizing Data in Tree-Structured Classification
Francesco Mola and Roberta Siciliano ........................................................... 223
Adaptive Cluster Analysis Techniques - Software and Applications
Hans-Joachim Mucha, Rainer Siegmund-Schultze, and Karl Dubon ................ 231

Part III: Classification and Discrimination


Statistical Approaches for Classification Problems
The Most Random Partition of a Finite Set and Its Application to Classification
Masaaki Sibuya ............................................................................................... 241
A Mixture Model to Classify Individual Profiles of Repeated Measurements
Toshiro Tango ................................................................................................. 247
Irregularly Spaced AR(lSAR) Models
Jeffrey S.c. Pai, Wolfgang Polasek, and Hideo Kozumi ................................. 255
Discrimination and Pattern Analysis
Two Types of Partial Least Squares Method in Linear Discriminant Analysis
Hyun Bin Kim and Yutaka Tanaka ................................................................... 261
Resampling Methods for Error Rate Estimation in Discriminant Analysis
Masayuki Honda and Sadanori Konishi ........................................................... 268
A Short Overview of the Methods for Spatial Data Analysis
Masaharu Tanemura ....................................................................................... 276
XII

Choice of Multiple Representative Signatures for On-Line Signature Verification


Using a Clustering Procedure
lsao Yoshimura, Mitsu Yoshimura, and Shin-ichi Matsuda .............................. 284

Part IV: Related Approaches for Classification


Fuzzy and Probabilistic Modeling Methods
Algorithms for L, and Lp Fuzzy c-Means and Their Convergence
Sadaaki Miyamoto and Yudi Agusta ................................................................. 295
General Approach to the Construction of Measures of Fuzziness of Fuzzy K-Partitions
Slavka Bodjanova ............................................................................................ 303
Additive Clustering Model and Its Generalization
Mika Sato and Yoshiharu Sato ........................................................................ 312
A Proposal of an Extended Model of ADCLUS Model
Tadashi lmaizumi ............................................................................................ 320
Spatial Clustering and Neural Networks
Comparison of Pruning Algorithms in Neural Networks
Yoshihiko Hamamoto, Toshinori Hase, Satoshi Nakai, and Shingo Tomita ...... 328
Classification Method by Using the Associative Memories in Cellular Neural Networks
Akihiro Kanagawa, Hiroaki Kawabata, and Hiromitsu Takahashi ................... 334
Application of Kohonen Maps to the GPS Stochastic Tomography of the Ionosphere
M. Hernandez-Pajares, J. M. Juan, and J. Sanz .............................................. .341
Symbolic and Conceptual Data Analysis
Capacities, Credibilities in Analysis of Probabilistic Objects by Histograms and Lattices
Edwin Diday and Richard Emilion .................................................................. 353
Symbolic Pattern Classifiers Based on the Cartesian System Model
Manabu lchino and Hiroyuki Yaguchi ............................................................. 358
Extension Based Proximities Between Constrained Boolean Symbolic Objects
Francisco de A. T. de Carvalho ....................................................................... 370
Towards a Normal Symbolic Form
Marc Csernel and Francisco de A.T. de Carvalho .......................................... 379
The SODAS Project: A Software for Symbolic Data Analysis
Georges Hebrail ............................................................................................. 387
Classification Structures for Cognitive Maps
Stephen C. Hirtle and Guoray Cai .................................................................. 394
Unsupervised Concept Learning Using Rough Concept Analysis
Tu Bao Ho ...................................................................................................... 404
Implicative Statistical Analysis
R. Gras, H. Briand, P. Peter, and J. Philippe ................................................. .412

Part V: Correspondence Analysis, Quantification Methods,


and Multidimensional Scaling
Correspondence Analysis and Its Application
Correspondence Analysis, Discrimination, and Neural Networks
Ludovic Lebart ............................................................................................... 423
XII'

Exploratory Data Analysis for Hayashi's Quantification Method III by Graphics


Tsutomu Komazawa and Takahiro Tsuchiya ................................................... 431
Exploring Multidimensional Quantification Space
Shizuhiko Nishisato ........................................................................................ 441
Homogeneity Analysis for Partitioning Qualitative Variables
Takahiro Tsuchiya .......................................................................................... 452
Determining the Distance Index, II
Matevi. Bren and Vladimir Batagelj ................................................................ 460
Classification of Textual Data
Meta-Data and Strategies of Textual Data Analysis: Problems and Instruments
Sergio Bolasco ................................................................................................ 468
Clustering of Texts Using Semantic Graphs. Application to
Open-Ended Questions in Surveys
Monica Becue Bertallt and Ludovic Lebart ..................................................... 480
How to Find the Nearest by Evaluating Only Few?
-Clustering Techniques Used to Improve the Efficiency of
an Information Retrieval System Based on Distributional Semantics
Martin Rajman and Arnon Rungsawang ......................................................... 488
Multidimensional Scaling
Fitting the CANDCLUS / MUMCLUS Models with Partitioning and Other Constraints
J. Douglas Carroll and Ani! Chaturvedi ......................................................... 496
A Distance-Based Biplot for Multidimensional Scaling of Multivariate Data
Jacqueline 1. Meulman ................................................................................... 506
Latent-Class Scaling Models for the Analysis of Longitudinal Choice Data
UlfBockenholt ................................................................................................ 518

Part VI: Multivariate and Multidimensional Data Analysis


Multidimensional Data Analysis
Nonlinear Multivariate Analysis by Neural Network Models
Yoshio Takane ................................................................................................ 527
Generalized Canonical Correlation Analysis with Linear Constraints
Haruo Yanai ................................................................................................... 539
Principal Component Analysis Based on a Subset of Variables for Qualitative Data
Yuichi Mori, Yutaka Tanaka, and Tomoyuki Tarumi ....................................... 547
Missing Data Imputation in Multivariate Analysis
Mario Romanazzi ............................................................................................ 555
Multiway Data Analysis
Recent Developments in Three-Mode Factor Analysis
- Constrained Three-Mode Factor Analysis and Core Rotations
Henk A. L. Kiers ............................................................................................. 563
Tucker2 as a Second-Order Principal Component Analysis
Takashi Murakami .......................................................................................... 575
Parallel Factor Analysis with Constraints on the Configurations: An Overview
Pieter M. Kroonenberg and Willem J. Heiser .................................................. 587
XIV

Non-Linear Modeling and Visual Treatment


Regression Splines for Multivariate Additive Modeling
lean-Francois Durand .................................................................................... 598
Bounded Algebraic Curve Fitting for Multidimensional Data
Using the Least-Squares Distance
Masahiro Mizuta ............................................................................................. 610
Using the Wavelet Transform for Multivariate Data Analysis and
Time Series Analysis
Fionn Murtagh and Alexandre Aussem ........................................................... 617
Visual Manipulation Environment for Data Analysis System
Masahiro Mizuta and Hiroyuki Minami .......................................................... 625
Human Interface for Multimedia Database With Visual Interaction Facilities
Toshikazu Kato ............................................................................................... 632

Part VII: Case Studies of Data Science


Social Science and Behavioral Science
Proposition of a New Paradigm for a Scientific Classification for
Leadership Behavior Research Perspective
lyuji Misllmi ................................................................................................... 647
Structure of Attitude Toward Nuclear Power Generation
Kiyoshi Karube, Chikio Hayashi, and Shinichi Morikawa ............................... 653
Research Concerning the Consciousness of Women's Attitude Toward Independence
Setsuko Takakura ............................................................................................ 661
A Cross-National Analysis of the Relationship Between Genderedness in
the Legal Naming of Same-Sex Sexual/Intimate Relationships and the Gender System
Saori Kamano ................................................................................................. 669
Management Science and Marketing Science
A Constrained Clusterwise Regression Procedure for Benefit Segmentation
Daniel Baier ................................................................................................... 676
Application of Classification and Related Methods to SQC Renaissance
in Toyota Motor
Kakuro Amasaka ............................................................................................. 684
Application of Statistical Binary Tree Analysis to Quality Control
Atsushi Ootalci ................................................................................................ 696
Analysis of Preferences for Telecommunication Services in Each Area
Tohru Ueda and Daisuke Satoh ...................................................................... 708
Effects of End-Aisle Display and Aier on the Brand-Switching of Instant Coffee
Alcinori Okada ................................................................................................ 716
Environmental, Ecological, Biological, and Medical Sciences
On the Classification of Environmental Data in the Bavarian Environment
Information System Using an Object-Oriented Approach
Erich Weihs .................................................................................................... 728
Cluster Analysis of Associated Words Obtained From a Free Response Test
on Tokyo Bay
Shinsuke Suga, Ko Oi, and Sadaaki Miyamoto ................................................ 736
xv
Data Analysis of Deer-Train Collisions in Eastern Hokkaido, Japan
Keiichi Onoyama, Noborll OhslImi, Naoko Mitsumochi,
and Tsuyoshi Kishihara .................................................................................. 746
Comparison of Some Numerical Data Between the belisama Group of
the Genus Delias Hubner (Insecta: Lepidoptera) From Bali Island, Indonesia
Sadaharu Morinaka ........................................................................................ 752
A Method for Classifying Unaligned Biological Sequences
B. TaUlIr and J. Nicolas .................................................................................. 758
An Approach to Determine the Necessity of Orthognathic Osteotomy or
Orthodontic Treatment in a Cleft Individual - Comparison of
Craniomaxillo-Facial Structures in Borderline Cases by Roentogenocephalometrics
Sumimasa Ohtsuka, Fumiye Ohmori, Kazunobu Imamura,
Yoshinobu Shibasaki, and Noborll Ohsllmi ...................................................... 766
Data Analysis for Quality of Life and Personality
Kazue Yamaoka and Mariko Watanabe ........................................................... 774

Contributor Index ..................................................................................................... 779


Part I

General Aspects of Data Science


Probabilistic Aspects in Classification
Hans H. Bock
Institute of Statistics, Technical University of Aachen,
Wiillnerstr. 3, D-52056 Aachen, Germany

Summary: This pa.per surveys various wa.ys in which probabilistic approaches can be
useful in partitional Cnon-hierarchical'} cluster analysis. Four basic distribution models
for 'clustering structures' are described in order to derive suitable clustering strategies.
They are exemplified for various special distribution cases, including dissimilarity data
and random similarity relations. A special section describes statistical tests for checking
the relevance of a calculated classification (e.g., the max-F test, convex cluster tests) and
comparing it to standard clustering situations (comparative assessment of classifications,
CAG).

1. Introduction
Consider a finite set 0 = {I, ... , n} of objects whose properties are characterized by
some observed or recorded data (a table, a data matrix, verbal descriptions) such
that 'similarities' or 'dissimilarities' which may exist among these objects can be
determined by these data. Cluster analysis provides formal algorithms for subdividing
the set 0 into a suitable number of homogeneous subsets, called clusters (classes,
groups etc.) such that all objects of the same cluster show approximately the same
claSs-specific properties while objects belonging to different classes behave differently
in terms of the underlying data (separation of clusters).
A range of clustering algorithms is based on a probabilistic point of view: Data
are considered as realizations of random variables, thus influenced by random errors
and natural fluctuations (variations), and even the finite set of objects 0 may be
considered as a random sample from an infinite universe (super-population 11). Then
any class or classification of objects must necessarily be defined in terms of probability
distributions for the data. This paper presents a survey on this probability-based part
of cluster analysis. While we can point only briefly to various topics, a more detailed
presentation with numerous references may be found in Bock (1974, 1977, 1985, 1987,
1989a, 1994, 1996a,b,c,d), Jain and Dubes (1988) and Milligan (1996)j also see various
other papers of this volume, e.g., by Gordon, Hardy, Lapointe and Rasson.
Let us first specify the notation: The set of objects 0 = {I, ... , n} shall be partitioned
into a suitable number m of disjoint classes GI , ... , Gm C 0 resulting in an m-partition
C = (GI , ... , Gm ) of O. In fact, we focus on partitional clusterings here, thus neclecting
hierarchical or overlapping classifications. We will consider three types of data:
1. A data matrix X = (Xkj)nxl' = (Xl! ... , xn)' where for each object k E 0 P
(quantitative or qualitative) variables have been sampled and compiled into the
observation vector Xk = (xu, ... ,Xkl')' (' denotes the transposition of a matrix).
2. A dissimilarity matrix D = (dk,)nxn where d k / quantifies the dissimilarity exi-
sting between two objects k and 1 (e.g., dk / =
Ilxk - x/W)j typically, we have
= =
o dH ::; dk / d/ k < 00 for all k,l E O.
3. A similarity relation S = =
(Sk,.)nxn where S'd 1 or 0 if the objects k and I are
considered to be 'similar' or 'dissimilar', respectively.

3
4

In our probabilistic context, the observed Zk, dkl and SkI will be realizations of suitable
random variables Xk, Dkl and SkI, respectively, whose probability distributions des-
cribe the type and extent of the underlying clustering (or non-clustering) structure.
In contrast to deterministic (e.g., algorithmic or exploratory) approaches to cluster
analysis, this probabilistic framework can be helpful for the following purposes:
(I) Modeling clustering structures allowing for various shapes of clusters.
(2) Providing clustering criteria under more or less specified clustering assumpt.ions
(to be distinguished from clustering algorithms that optimize t.hese criteria)
(3) Describing and quantifying the performance or optimality of clustering methods.
(e.g., in a decision-theoretic framework; see Remark 2.2).
(4) Investigating the asymptotic behaviour of clustering methods if, e.g., n approaches
00 under a mixture model or under a 'randomness' hypothesis.
(5) Testing for the 'homogeneity' or 'randomness' of the data, either versus a gene-
ral hypothesis of 'non-randomness' or versus special clustering alternatives,
thus checking for the ezistence of a hidden clustering of the objects.
(6) Testing for the relevance of a calculated classification C· that has been obtained
by a special clustering algorithm.
(7) Determining the 'true' or an appropriate number m of classes (see Remark 2.1).
(8) Assessing the relevance of a special cluster Cj of objects that has been obtained
from a clustering algorithm (Gordon 1994).
In the following we will survey some of these topics in detail and consider several
special clustering models.

2. Some probabilistic clustering models


A major and basic problem of classification theory arises by the difficulty of defi-
ning the concept of a 'class' that may be approached by philosophical, mathematical,
probabilistic or heuristical tools. Probabilistic approaches provide a flexible way of
describing the relationship between objects and classes, not just by building each
class from the set of objects that share the same decriptor values or feature combi-
nation (as it is common, e.g., in concept theory and monothetic classification), but
by allowing some variation between the objects of the same class whose properties
may deviate, to some limited and random degree, from the typical class profile. In
technical terms this is realized by characterizing each class (classification, clustering
structure) by suitable probability distribut.ions for the sampled data.
There are four basic probabilistic models that are often used in this context. They
will be formally described for the important case when the data is provided by n
random independent p-dimensional feature vectors X h •.• , Xn (for dissimilarity and
similarity data see the sections 2.1.4 and 2.1.5).
2.1 The fixed-partition clustering model Hrn
A ped-partition model Hrn assumes that there exist a fixed, but unknown partition
C = (Ch ••. , Crn) of the set 0 into m non-empty classes Ch ••• , Crn C 0 and a sy-
stem 6 = ("h ... ,"rn) of (unknown) parameter values "h .... '''rn E R· describing the
properties of these classes such that

for all 11: E Ci and i = I, ... ,m. (2.1)


where fez; ") originates from a given parametric family of distribution densities. In
this context, 'clustering' consists in estimating the unknown parameters, i.e., the
5

number of classes m ~ 1, the m-partition C = (Gil"" Gm ) and the parameter system


9 = (1?Il"" 1?m). Note that for n data vectors we have (at least) ms + n unknown
parameters, viz. 1?1, ... , 11m and the c1as5 indicators It, ... , In where h. = i iff k E Gj.
Classical statistics provides various estimation strategies from which we consider the
maximum likelihood method here (see also Remark 2.2). Assuming m to be known,
maximizing the joint likelihood of the observed data vectors Xl, ••. , Xn is equivalent
to:
m
g(C,9) LL (-logf(xki1?j)) -+ min
e.9
(2.2)
j=l keC;

(maximum-likelihood classification approach). This is a combined combinatorial and


analytic optimization problem whose solution(s) can be approximated by the well-
known iterative k-means algorithm which partially minimizes 9 with respect to 9 and
C in turn. In fact, partial minimization with respect to the parameter f) provides the
maximum likelihood (m.l.) estimates O(C) := (Jt, ... ,J m) for a given C, thus reducing
(2.2) to the combinatorial problem
m
LL (-logf(xkit?;)) -+ min.
e
(2.3)
;=1 keC;

On the other hand, minimizing 9 with respect to C for a given parameter vector 9
yields the maximum-probability-assignment partition C(O) := C := (ClI ... , Cm) with
classes

i = 1, ... ,m (2.4)

(with suitable ruIes for avoiding ties and empty classes). C(9) can often be interpreted
~ a minimum-distance partition and will be termed in this way here. Using (2.4),
the optimization problem (2.2) reduces to
n
..,(9) L
k=l
min {-logf(xki1?v}}
v=l •...• m
-+ min.
9
(2.5)

Thus, fixed-partition models provide, simultaneously, three equivalent clustering cri-


teria (2.2), (2.3) and (2.5) and a comfortable (and fast converging) optimization
strategy (k-means algorithm). We must remind, however, that in this formulation
any resulting (optimal) class Gt is not necessarily a most 'homogeneous' one, descri-
bed by good internal properties or even well separated from other classes, but only
an element of an m-partition C· that optimizes the overall-criterion (2.3), i.e., some
average homogeneity of classes. This fact explains, why small classes are typically
neclected in this approach (except from those far away from the rest of the data,
e.g., 'outlier classes'). Empirical practice as well as theoretical investigations (Bock
1968, 1974) show that criteria of this type have the tendency to produce equally-sized
classes.
We illustrate the flexibility of the fixed-partition approach by five special distribution
models (see also Bock 1974, 1987, 1996a,b,c,d).
Remark 2.1: The determination of the unknown (or: a suitable) number m of classes
is a conceptually and technically difficult problem that is intensively discussed, e.g.,
in Bock (1968, 1974, 1996a), Gordon (1997b), Milligan (1981, 1996) and Milligan and
6

Cooper (1985), but will not be investigated here.

2.1.1 The normal distribution case and the variance criterion


Considering the case of quantitative data vectors Xl, ... , Xn E RP, let each class C be
describe.d by a p-dimensional spherical normal distribution N(J1.;, a 2 Ep) with class-
specific expectations ILl, ... , ILm E RP, a 2 > 0 (known or unknown), and Ep the p x p
unit matrix. Then the criteria (2.2) and (2.3) reduce to the equivalent well-known
SSQ or variaTlce criteria:
1 m
g(C, B) -n L keG,
L Ilxk -
i=1
t9;W --> min =: g;;'n
C,1l
(2.6)

~ t L Ilxk - XG,W
1=1 keG.
--> mjn = 9':nn' (2.7)

Similar normal distribution models assume an Np(J.Li, E) or Np(J.Lil E;) distribution in


Ci allowing for (possibly class-specific) dependencies among the p components (Bock
1974). ~Iore general approaches characterize each class Ci by a hyperplane Hi in RP
(instead by a single point IL; only): principal component clustering (Bock 1969, 1974;
Diday 1973: analyse factoT'ielle typologique) and regression clustering (Bock 1969),
or constrain the class centers J.LI, ... , J.Lm to belong to the same (unknown) hyperplane
H c RP of a small dimension s, say, such that the clustering structure is essentially
low-dimensional (projection pursuit clustering; Bock 1987).
Remark 2.2: Clustering has been investigated in a decision theoretic framework as
well, thereby looking for a clustering criterion that minimizes the expected loss in-
cured by missing a hidden clustering structure (Binder 1968; Bock 1968, 1972, 1974;
Hayashi 1974, 1993; Bernardo 1994). For example, Bock derived various Bayesian
methods for normal distribution clustering models under suitable prior assumptions
on the underlying parameters and showed, e.g., that (2.7) is asymptotically optimum
in some cases. Similarly, Hayashi investigated the minimaxity of clustering criteria.

2.1.2 A semi-parametric convex cluster model


There are instances (e.g.; in pattern rec;)gnition and image exploration) where clusters
can be characterized by uniform distributions U(D) concentrated on some convex set
DC RP. A corresponding fixed-partition model for XI. ... , Xn involves an m-partition
C and m unknown non-overlapping convex domains ('parameters') DI , ... , Dm C RP
such that for all k E C;, X k ,....., U(Di). The resulting m.l. clustering criterion (2.3) is
given by
m
L ICd . log volp(H(C;)) --> mm
c
(2.8)
;=1

where the m.1. estimate of Di is just the convex hull D; = H(C;) := conv{xkJk E C;}
of the data points belonging to the class C i , and minimization is over all m-partitions
with non-overlapping H(Cr), ... , H(Cm ) with positive volumes (Bock 1997; see also
section 3.2.3). Note that such a model is applicable only if the presumptive clusters
are clearly separated by some empty space.
2.1.3 Qualitative data, contingency tables and entropy criteria
In the case of qualitative data where the j-th component X ki of X k takes its values
in a finite set Xi of alternatives (e.g., Xi = {O,1} in the binary case) the obser-
ved vectors XI! ... , Xn belong to the Cartesian product X := n~=1 Xi which corre-
sponds to the cells of the p-dimensional contingency table N = (ne)eex = (nel, ....ep)
7

that contains as its entries the number of objects k E 0 with the sam~ data vector
Xk = ~ = (~I' ... ,~p)' E X. Thus any clustering C of 0 corresponds to a decomposition
of N =.NI + ... + N m into m c1ass-spedlc sub-tables N I , ... ,Nm .
Loglinwr models provide a convenient tool for describing the distribution of multiva-
riate qualitative data. A loglinear model for a vector X involves various parameters
(stacked into a vector v) which are distinguished into main effects (roughly descri-
bing the size of marginal frequencies) and interaction parameters (of various orders)
that describe the association or dependencies that might exist among the p com-
ponents of X. Then the distribution density (probability function) of X takes the
form f(~i t9) = P(X = {) = c(t9) . exp{z(~)'v} where z(O is a binary dummy vector
that picks from vthe interaction parameters which correspond to the cell ~ of the
contingency table (c(t9) is a norming factor). - Assuming a fixed-partition model
with class-specific interaction vectors. VI, ... , vm for the m classes, the m.l. method
yields the entropy clustering criterion
m

L IGd· H(X, Ji(C)) -. mm


C
(2.9)
i=1

where H(X,Vi) := - L{EXJ(~;t9;) ·logJ({;vi) > 0 is Shannon's entropy for the


distribution density J(.; ''?i) and Ji = Ji(C) the m.l. estimate for Vi in the class Gi. -
This and similar entropy criteria (such as logistic regression clustering) are considered
in Bock (1986, 1993, 1994, 1996a,d) and reformulated in Celeux and Govaert (1991).

2.1.4 Clustering models for dissimilarity data


The fixed-partition approach can also be used 'in the case of dissimilarity-based clu-
stering where the data is a matrix D = (Dkdnxn of random dissimilarities Did' Recall
that a. basic model for describing a 'no-clustering' or 'randomness' situation assumes
that all (;) variables Dkl with k < I, whilst being independent, have the same distri-
bution density J(d), like a suitable generic variable D" ~ 0, say (e.g., an exponential
distribution Exp(l)). In this context, a clustering structure involving a partition
C = (G I , ... , C m ) will intuitively result if we shrink the dissimilarities between objects
in the same class Gi by a factor t9 ii > 0, and stretch the dissimilarities between ob-
jects belonging to different classes Ci,G j , i #- j, by another factort9 ij > 0 (typically
ih < Vij for all i,j). The resulting dissimilarity clustering model reads as follows:
t9 ij . D" for all k E Ci,l E C j • (2.10)
The corresponding m.l. clustering criterion is given by

g(C,O):= L [L
l~;~j~m kEC"IEC)
[-logJ(dklNi;)] + Tlij.IOgVij]-' mill
c,s
(2.11)

where for i = j the inner sum is over k < 1 only, and Tlij = ICd ·IGjl and nii = ('~d)
is the number of terms in the inner sum for i #- j and i = j, respectively. For
exponentially distributed dissimilarities with f( d) = e- d for d > 0, this reduces to:

g(C,O) ~
~ n·[D
I) cI. cJ /t9 I}.. + logv]I)
-+ min
C 8 (2.12)
I~i~j~m '

where Dc"c) = Tl;/ LkEC"IEC) dl;l is the average dissimilarity between two classes Gi
and Gj , and DCi,c, = n~1 Lk,IEC,k<1 dk1 measures the heterogeneity of Ci. - Note
8

that the unconstrained m.l. estimate for fJ ij is given by J ij = Dc;,c} such that (2.12)
reduces to the log-distance clustering criterion

g(C,8) - (~) = E nij log Dci,c} -+ mm.


c
(2,13)
195j5m
2.1.5 Clustering models for random similarity relations
In this last example, we consider binary similarity data where two objects k and
I are considered either as being 'similar' (SkI = 1, e.g. for k = I) or 'dissimilar'
(SkI = 0). (Note that these data can be interpreted as a similarity or association graph
with n vertices and M := Lk,l,k#1 SkI links or edges). Intuitively, a corresponding
clustering model should be such that links are more likely inside the clusters than
between different clusters. This can be modeled by introducing linking probabilities
Pij between the classes Ci and Cj of C (typically Pii > Pij for i '" j). The resulting
fixed-partition ~odel for (;) random and independent similarity indicators SkI, k < I
(with Skk = 1 and SkI = S,k) reads as follows:

P(Skl = 1) = Pii for all k E Ci,1 E Cj (2.14)


and leads to the m.l. clustering criterion:

g(C,p) E (NiilogPii + (nij - Nij)log(1 - Pij))-+ min. c,., (2.15)


195i5 m
where, for i < j, Nij (Nii) is the number of observed links SkI = 1 between Ci and
Cj (inside Ci). If side constraints are neglected, Pij := Nij/nij is the m.l. estimate
for Pij and can be substituted into (2.15). - The model (2.14) has been described by
Bock (1989b, 1996a,c), related models are known under the heading 'block models'
(see, e.g., Snijders and Nowicki 1996).
2.2 Random-partition clustering models and the mixture model H:;:iz
Clustered populations are often described by mixture models: The random vectors
XI, ... , Xn are assumed to' be independent, all with the same mixture density
m
f(x) = f(xj6,'11'):= E 71';!(XjfJ i ) (2.16)
i=1
which involves m class-specific parameters fJ], ... , fJ.,. and m unknown probabilities
71'10 ... ,7I'm (with L~17I'i = 1). Whilst this model incorporates no explicit clustering
of objects, it is well known that it is obtained by a two-stage process where, in a
first step, the objects of 0 are sampled independently from a superpopulation IT that
is decomposed into m subpopulations IT}, ... , ITm with relative sizes 71'i and described
by the parameters fJ i . This first step results in a non-observable random p.,.rtition
C = (GI , ... , Gm ) of 0 that is characterized by the (random) class indicators 110 ... , In
according to Ci = {k E OIIk = i}, i = 1, ... , m. Conditionally on lA: = i (i.e., k E Gi),
the vector Xk is distributed with the density f(·j fJ i ).
Classical mixture analysis concentrates on the estimation of the unknown parameters
71'10 ... , 71'm, 1110 ... , 11m for a suitable number m of components, typically by maximizing
the log likelihood
n m
L(6,7I'):= L log(L7I'i' f(xkjfJ i ) -+ max.
8,pi
(2.17)
k=1 i=1
9

(McLachlan and Basford 1988, Titterington et at. 198.5). Whilst this criterion invol-
ves no classification of objects, such a classification C= ((\, ... ,Cm) can be construc-
ted from the estimated parameters Ji,,ri by using, in an additional stage, a plug-in
Bayesian rule that yields the classes

i = I, ... ,m. (2.18)

C= (C1 , ••• , Cm) is an 'estimate' for the random partition C.


A more appropriate approach that incorporates a partition of objects from the out-
set, is based on the random-partition model which com prizes the n independent
pairs (l1,Xd, ... ,(ln,Xn) with the joint 'density' 7r;!(X;{Ji) (for x E W, i = I, ... m)
and maximizes the joint likelihood I(B, 7r, II, ... , In; Xll ... , Xn) = Ok=1 7r I.!(Xl;; {J I.) =
O~I OkeCi 7r;!(Xl;;t?i) of these data with respect to B, 7r = (71"I, ... ,7r m ) and the 'mis-
sing' values II, ... , In (or C, equivalently). Using the fact that partial maximization
with respect to 7r yields the estimates ,ri = ICd/n we obtain the clustering criterion
m m
g(C, B) ;= L L (-log f(Xk; {Ji)) - n . L(lC;j/n) ·log(ICil/n) -+ min (2.19)
i=1 keC, i=1 C.B

(Fahrmeir, Kaufmann and Pape 1980, Symons 1981, Anderson 1985). Obviously,
this adds an entropy term to the previous fixed-partition criterion (2.2). It can be
shown that the classes of an optimum m-partition C· are generated by the Bayesian
rule (2.18) (after replacing Ji by the optimum values 1?i). Computationally, the li-
kelihood I( B, 7r, III ... ,In; Xll ... , Xn) can be successively increased by using a modified
k-means algorithm where, in the t-th iteration step, a new partition C(t) is obtained
by applying the Bayesian rule (2.18) with the previous parameter estimates 1?!t-l)
and 1T!t-l) = IC?-I)I/n (obtained from the previous partition C(t-1)), in analogy to
the maximum-probability-assignment partition (2.4) (Fahrmeir et al. 1980, Celeux
and Diebolt 1985, Bock 1996a, Fahrmeir 1996).
2.3 Modal clusters and density-contour clusters
Another group of clustering models, designed primarily for data points XI, ... , Xn in
RP, looks for those regions of RP where these data points are locally concentrated,
or, alternatively, for the regions in which the density of points exceeds some given
threshold; These regions (or the corresponding clouds of points) can be used and
interpreted as 'classes' or 'clusters', especially in the context of pattern recognition
and image processing.
More specifically, let f(x) be the common (smooth) distribution density of XI, ""Xn
and define, for a threshold c > 0, by B(c) := {x E RPlf(x) ~ c} the level-c regi-
on of f. Then the connected components BI(e), B2(e), ... of B(e) are termed high-
density clusters (Bock 1974, 1996a) or density-contour clusters of f at the level c.
For increasing values of c, these clusters split, but also disappear and show inso-
far a pseudo-hierarchical structure. The unknown density f can be approximated,
e.g., by a kernel estimate in(x) obtained from the data XI, ... , Xn and corresponding
estimates Bl (c), B2 ( c), ... c RP are found from in. From these estimated regions,
a (non-exhaustive) clustering of objects or data points is obtained by defining the
clusters Ci(e);= {k E Olin(Xk) 2': e} = Bi(e)n {XiJ ... ,xn}, i = 1,2, .... Note that
a cluster Ci(e) can show a very general (even ramificated) shape in RP, and will be
particularly useful if, for a fixed sufficiently large c, it is separated by broad 'density
10

valleys' from the rest of the data, and, for a varying c, if it is constant over a wide
range of values of c.
Except for the two-dimensional case, the geometrical description of high-density clu-
sters is difficult. Therefore many 'discretized' or modified versions of this clustering
strategy have been proposed (often using a weaker or discretized version of connec-
tivity in RP; see Bock 1996a). From a theoretical point of view, Hartigan (1981)
showed that single linkage clustering fails in detecting high-density clusters for all
dimensions p ~ 2.
A related clustering approach focusses on local density peaks in RP, i.e., on the points
6,6, ... E RP where the underlying (smooth) density f (or its estimate in) has its
local maxima (modes): Clusters are formed by successively relocating each data point
Xlc into a region with a larger value ofin (by hill-climbing algorithms, steepest ascent
etc.) and then collecting all data points into the same cluster C i , termed mode clu-
ster, which finally reach the same mode of in. Even if this approach can be criticized
for its instability of the cluster concept (small local variations of f orin can generate
an abritarily large number of modes or clusters) it is often used in image analysis and
pattern recognition and there exist mallY algorithmic variations of this approach (cf.
Bock 1996a).
2.4 Spatial clustering and clumping models
Motivated by biological and physical applications, spatial statistics provides various
other models for describing a clustering tendency of points in the space RP. A first
non-parametric approach considers the data Xl, .•• , Xn as a realization of a Poisson p1'O-
cess (restricted to a finite window G C RP), either a homogeneous one with a constant
intensity A (= the average number of data points per unit square) for describing a
'homogeneous' or 'non-clustered' sample, or with a location-dependent intensity >.(x)
in the case of a clustering structure: Here the modes and contour regions of A( x) can
characterize clusters similarly as in section 2.3 (when using a distribution density f),
and will be determined by suitable non-parametric estimates of >.(x) (Ripley 1981,
Cressie 1991).
Another model is motivated by the spread of plants in a plane or the growing of
cristals around kernels: The Neyman-Scott process builds clusters in three separate
steps: (1) by placing random 'seed points' ~1,6, ... into RP according to a homoge-
neous Poisson process, (2) by choosing, for each ~i' a random integer Ni with a Pois-
son distribution P(,\), and (3) by surrounding each 'parent' point ~i by Ni 'daughter'
points XiI, ... , X iN• that are independently distributed according to h( (x - ~)/ (j)/ (j
(conditionally on the result of (1) and (2)) where h(x) is a spherically symmetric
density (typically, h ~ N(O.I) or h ,.." U(K(O,I)), the uniform distribution in the
unit ball [((0,1)). The data are then identified with the set of all daugther points
X jk inside a suitable window G. - There exist statistical methods for estimating the
unknown parameters A, (j etc. from these data, but the problem of reconstructing
the 'clusters' (families) from the data is largely unsolved. Insofar this model is repre-
sentative for a range of models (including Cox processes, Poisson cluster processes
etc,) that focus more on the clustering tendency of the data than on the underlying
clustering of objects.

3. Hypothesis testing in cluster analysis


A major problem in cluster analysis consists in the interpretion of the constructed
clusters of objects and the assessment of their relevance for the underlying practical
11

problem. A range of strategies can be proposed in order to solve this problem, inclu-
ding:
(a) Descriptive and exploratory methods for determining the properties of the
clusters, either in terms of the observed data or by using secondary (background)
information that has not yet been used in the classification process (e.g.,
Bock 1981);
(b) A substance-related analysis of the classes that looks for intuitive, 'natural'
explanations or interpretations of the differences existing among the obtained
classes;
(c) A quantitative evaluation of the benefits that can be gained by using the construc-
ted classification in practice (e.g., in marketing, administration, official statistics,
libraries etc.);
(d) A qualitative or quantitatit·e validation of the clusters by comparing them to
classifications obtained from other clustering methods, from alternative data
(for the same objects) or from traditional systematics (see Lapointe 1997);
(e) Infer'ential statistics which is based on probabilistic models and proceeds essen-
tially by classical hypothesis testing.
It is this latter issue (e) that will be discussed in this section. In fact, there is a long
list of clustering-related questions that can be im'estigated by hypothesis testing. A
full account is given in Bock (1996a). Here we will address only two of the major
problems: Testing for homogeneity and checking the adequacy of a calculated classi-
fication (model).

3.1 Testing for homogeneity


Each clustering algorithm provides a classification of objects even if the underlying
data show no cluster structure and are homogeneous in some sense. In this case, the
resulting classification will typically be an artifact of the algorithm and might lead
to wrong conclusions, e.g. when searching for 'natural' classes of objects. In order
to avoid this error, it will be useful to check, before applying a clustering algorithm,
some hypothesis of 'homogeneity or 'randomness' and perform clustering only if this
hypothesis is rejected. Depending on the type of data, the following models for 'ho-
mogeueity' or 'randomness' have been considered in this context '( also see Bock (1985,
1989a, 1996a) and Gordon (1996, 1997)):

HG: Xl, ... , X" are uniformly distributed in a finite domain G E RP (to be estimated
from the data),
H,mi: Xl, .. " X" have the same (often unknown) unimodal distribution density fo(x)
in RP, with the special case:
Hf: Xl, ... , X" all have the same p-dimensional normal distribution Np(J.l, a 2 Ep).
HD: All (;) dissimilaritiesDk/' k < I, are i.i.d., each with an arbitrary (or a specified)
continuous distribution density f(d)j this implies the two following models:

Hw m : All (;)! rallkings of the dissimilarities Diet, k < I, are equally probable.
Hn,M: For each fixed number M of 'similar' pairs of objects {k, I} (i.e. with Dct smaller
that a given threshold d > 0), these M links are purely randomly assigned to
the set of all (;) pairs of objects.
12

For testing one of these hypotheses versus a general alternative of non-homogeneity


we may consider, e.g., the empirical distribution of the Euclidean distances Dkl :=
IIXk - Xdl, the nearest neighbour distances Dk = min,;t!dD.H}, maximin distances or
gap statistics such as T := maxd Dk } or T·, the radius of the maximum ball that can
be placed in the window G without containing a data point Xk, and various other
test statistics. A survey of the resulting homogeneity tests is given, e.g., by Dubes
and Jain (1979), Dubes and Zeng (1987) and Bock (1985, 1989a, 1996a).
A better power performance is to be expected from tests which are tailored to some
clustering alternative of the type that has been defined in section 2. For example,
there exists a range of tests for bimodality and multimodality which are suited to
the concept of mode clusters or mixtures (Silverman 1981, Hartigan 1985, Sawitzki
1996) or related to single linkage clusters (i.e., the connected components of suitable
similarity graphs). In particular, the. graph-theoretical and combinatorial methods
proposed by Ling (1973), Godehardt (1990), Godehardt and Horsch (1996), and Van
Cutsem and Ycart (1996a,b, 1997) are designed to test the hypotheses Hperm and
Hn,N, just by comparing the single linkage hierarchy calculated from the data to the
one to be expected for mndom data under these hypotheses.
3.2 Testing versus parametric clustering models
The fixed-partition clustering model Hm, (2.1)' and the corresponding mixture model
H:;:ir, (2.16), are specified by a parametric density family (typically with a unimodal
density f(x; 11)), and they reduce to the same homogeneity model HI for m = 1
or 111 = ... = -am. Thus testing HI against the alternatives Hm or H:;:iz can, in
principle, be performed with the likelihood ratio test (LRT) statistics

L LH_(C:)
:= 210g ~
LHI
= 210g ~ HI
and (3.1)

where LHm, LH;:i. and LHI denote the likelihood of the data maximized under the
mQdels H"., H:;:iz and HI, respectively, for a fixed number m > 1 of clusters, and C~
is the optimum m-partition resulting from (2.1) or (2.3). Unfortunately, the classical
asymptotic LRT theory (yielding X2 distributions for Tm) fails for these clustering
models, due either to the fact that HI is on the 'boundary' of H:;:iz under the parame-
trizatioil (2.16) or to the discrete character of the parameter C in the fixed-partition
model (2.1) (see also Hartigan 1985, Bock 1996a). However, there exist some special
investigations relating to these two test criteria.
3.£.1 Testing versus the mixture model
The case of a one-dimensional normal mixture f(x) '" L~1 N(J.li, 0'2) has been in-
vestigated by Everitt (1981), Thode et al. (1988) and Bohning (1994) who present
simulated percentiles of T;::iz under N(O, 1) for various sample sizes nand m = 2.
The power of this LRT is investigated by Mendell et al. (1991, 1993) where it results,
e.g., that n ~ 50 is needed to have 50% power to detect a difference IJ.lI - J.l21 ~ 30'
with 0.1 $ 11'1 $ 0.9 (also see Milligan (1981, 1996), Bock (1996a)). The paper of
Bohning (1994) extends these results to the case of one-dimensional exponential fami-
lies and shows that inside these families, the asymptotic distribution of T;::iz remains
(approximately) stable. For more general cases we recommend to determine suitable
percentiles of T;::iz by simulations instead of recurring, e.g., to heuristic formulas.
More theoretical investigations are presented by Titterington et al. (1985), Titte-
rington (1990), Goffinet et al. (1992), and Bohning et al. (1994). Those authors
show, for two-component mixtures (with partially fixed parameters), that under HI
the asymptotic distribution of T2'iz (for n -+ 00) is a. mixture of the unit mass a.t 0
13

and a.xi distribution. Ghosh and Sen (1985) show that the asymptotic distribution
of T2":1: is closely related to a suitable Gaussian process, and Bardai and Garel (1994)
pres.ent the corresponding tabulations. - An alternative method for testing HI versus
H;::':1: has been proposed by Bock (1977, 1985, 1996a, chap. 6.6) and uses, as test
statistics, the average similarity among the sample points Xl, ... , Xn which should be
larger under HI than under the mixture alternative.
S.2.2 Testing for the fixed-classification model; the max-F test
In contrast to the mixture model, the fixed-classification model (2.1) is defined in
terms of an unknown m-partition C = (GI , .'" Gm ) of the n objects, for a fixed num-
ber m of classes and a given family of densities f(x; 11). Therefore the LRT usingTm,
(3.1), can be interpreted either as
• a test for homogenity HI versus the clustering structure Hm;
• a test for the significance or suitability of the calculated (optimum) classifica-
tion C~ of the n objects;
• a test for the existence of Tn > 1 'natural' classes in the data set (versus the
hypothesis of one class only).
Thus, depending on the interpretation, the analysis of the LR test has many facets.
Under the assumption that XI, .'" Xn are i.i.d, all with the same density f(x) (descri-
bing either homogeneity or a mixture of distributions) the almost sure convergence
and the asymptotic normality of the para.meter estimates J; has been intensively
studied, e.g., in Bryant and Williamson (1978), Pollard (1982), Parna (1986), and
Bryant (1991). It appears that the asymptotic behaviour of these estimates is closely
related to the solution of the 'continuous' clustering problem

G(B,8) .- ~~, [-logf(x;t9;)]·f(x) dx -+ Ij}~9n (3.2)

where minimization is over all m-partitions B = (BlI ... , Bm) of RP and all parameter
vectors B = (11 1 , ".,l1 m ). Instead of going into details here (see Bock 1996a) we will
focus on the special case of the normal distribution model described in section 2.1.1
where X" '" N(J1.i, (12 Ep) for k E Ci under Hm, and X" ,..., N(J1., (12 Ep) for all k in case
of HI. Here the LR test reduces to the intuitive max-F test defined by

L~IIGd '1Ixe; - xW (3.3)


mlx L~I LlcEC, Ilx" - xc,I12
SSB(C) SSB(C~) { > c decide for Hm (3.4)
mlx SSW(C) = SSW(C~) ~ c decide for HI

where C~ minimizes the variance criterion (2.7). In this case the continuous optimi-
zation problem (3.2) reduces to

G(B, p.) t 1 IIx -


i=l Bi
P.iW, f(x) dx -+ min =:
S,,...
G~ (3.5)

as an analogue to (2.6), and its solution B" is necessarily a stationary partition, i.e.
a minimum-distance partition of RP generated by its own class centroids, the condi-
tional expectations p.i := E,[XIX E Bi], i = 1, ... , m.
For the one-dimensional case, the optimum partition B" of RI is given by Cox (1957)
14

~l

m B J.Li = Ef[XIX E Bd Pi = P(X E Bd Gm := G(B,J.L)


i = 1, ... ,m i = 1, ... ,m "m := (2 - Gm)/Gm

B2 . _ ( ±0.79789 ) G2= 1.36338--


2 J.L. - 0 Pi == 1/2
"2 = 0.46694

B3 ( cos(2tri/3) ) G3 = 0.92570·-
3 J.Li = 1.03648· sin(2tri/3) Pi == 1/3
"3 = 1.16052

[fl.! ( cos(tri/2) ) G4 = 0.72676--


4 J.Li = 1.12838· sin(tri/2) Pi == 1/4
"4 = 1.75194

B4.2 . = 1.27910 . ( c?s(2try3) ) G4 = 0.82034


J.L. sm(2tr,/3) Pi = 0.24034
"4 = 1.43801
i = 1,2,3 i=1,2,3
J.L4 = (0,0)' P4 = 0.27898

B 5,1 ( cos( tri/2) ) Gs = 0.61448--


5 J.Li = 1.36334· sine tri/2) Pi = 0.18636
"5 = 2.25477
i=1, ... ,4, i=1, ... ,4
J.Ls = (0,0)' Ps = 0.25457

B 5.2 ( ±0.70505 ) P!.2 = 0.22217


Gs = 0.62246
J.Ll.2 = 0.87119 "s = 2.21305
J.LS.3 : ±l.34'l']
~ -~.52106 PS.3 = 0.14580

J.L4 - -0.89064 P4 = 0.26405

B5.3 ( cos(2tri/5) ) Gs = 0.62533


J.Li = 1.17246· sin(2tri/5) Pi == 1/5 "5 = 2.19830
=
Tab. 1: Stationary partitions B = (E 1 , ... , Bm) of R2 with m 2, ... ,5 classes lor the conti-
nuous variance criterion (3.5) if I'" N2(0, E 2), with the class centers J.Li, class percentages
Pi and criterion values Gm := G(B,J.L}. The SSB "m = (2 - Gm}/Gm is the continuous
analogue to k;;'n, (3.3) ... marks the optimum or best known m-partitions.
15

and Bock (197,1, p. 179) for the cases f '" .qO,l) and f ~ [i([-y'},y'}ll with va·
riance 1. For two· and three-dimension?lnol'mals J '" }.(I'(Il. Ep) it range of stationary
partitions l3 has been calculated by Baubkus (198;j) (m = 2, ... ,6) and Flury (1993),
the ellipsoidal case has been considered by Baubkus (198.»), Flury (1993, m = 4),
Kipper and Parna (1992), Tarpey et a!. (199.5) and .lank (1996,2::': m ::.: 4). For the
two·dimensional normal JV2 (0. E1 ) some stationary partitions as well as their nume·
rical characteristics are reproduced in Tab. 1; for example. the three quite distinct
.j·partitions 8 5 ,1, 6 5 ,2, 8 5 ,3 differ in their G;·values by no more than 0.013 (for other
cases see Bock 1996a). It is conjectured that for m = :2 to 5 this list includes the
optimum partitions of R2 (see the asterisks "" in Tab. 1), but a formal proof of opti·
mality exists only for m = :2 and :3 classes (Baubkus 1985).
In order to apply the max·F test, the critical threshold (percentile) c must be calcu·
lated from the null distribution of ~;"n under some J '" HI. While this distribution
is intractable for a finite 71, the asymptotic normality of g;"n and k;"n has been pro-
ved under some reguliirity conditions on J. with asymptotic expectations G;" and
lI:~l := (E/lIIX - E/[X1Wl - G~,l/G;", the continuous analogues of the minimum
variance and 77WXF criteria (2.7) and (:3.3), respectively (see Bryant and Williamson
1978, Hartigan 1978 (for p = 1). Bock 1985 (for p 2: 1). Since the regularity conditio
ons include the uniqueness of the optimum partition 6" of (3.5), these results cannot
be applied for the rotation·invariant density f "'- A!p(O, Ep) if p > 1, but suitable
simulations have been conducted (see Hartigan 1978, Bock 1996a, Jank 1996). for
example, Jank (1996) and .Jank and Bock (1996) found that for n 2: 100 (in parti·
cular: n = 1000) the null distribution of the standardized values (g;"" - a)/b and
(k;n" - (1)/& is satisfactorily approximated by a .\"(0,1) distribution (in the range
[-2,2]' say) if a and & chosen to be the empirical mean and standard deviation of the
optimum \'alues g~ln and k~"" respecti\'ely (results of the k·means algorithm) from
N = 1600 simulations of {Xl, ... , In} under '\"2(0, E2J and m = 2, ... , 5.
3.2.3 The LRT for the C01!vex cluster case; C01!vex cluster tests
A less investigated case is provided by clustering models where each class is characte·
rized by a uniform distribution on a convex domain of RP. To be specific, we consider
the followiilg three convex clustering models which all involve a system of m unknown
non-overlapping con\'ex sets DI , ... , D", C RP (to be estimated from the data):

Hm Fiud·ciassijicatioll model:
.\k '" U(Dd for all k E C i , with an unknown m·partition C = (C h ... , C'm) of
O.

H,';;ir Mixture model:


Xk '" 2::: 1 7i'i' U(Di) for k = I, ... ,n.
H,~ni Pseudo-mixture model:
Xk '" U(DI + ... + Drn) for k = 1, ... , n (constrained to the union DI + ... + Dm).
These models have been introduced Bock (1997) as a generalization of some work by
Rassoll et al. (1988, 1994) related to H:;.ni (also see Rasson 1997).
Here we consider the problem of testing the hypothesis of homogeneity HG (i.e., X k '"
U(G) for some unknown convex G C RP) versus one of these clustering alternatives.
We find the following LR tests where G n := con{xl' ... , In} denotes the convex hull
of all 71 data points. D. := H (C) the convex hull of all data points belonging to a
class Ci C a, and c is a critical threshold to be obtained from the HG distribution
of the test statistics:
16

T := _ ~ ICtl .loa volp(H(Ct)) { > c decide for. Hm . (36)


'" ~ n 0 volp(G n) ::; c accept uniformity He. .

where the partition C· = (C;, ... , C;") minimizes the clustering criterion (2.8).

{ > c decide for H;;;ir (3.7)


::; c accept uniformity He.

where C· = (C;, ... , C;") is the partition that minimizes the clustering criterion

gm
mir(c) '= ~
. L
IC.I.10 a volp(H(C
I 0
i
IC'I )) --+ •
mID. (3.7)
i=1 I C

over all partitions C = (CI , ... , Cm) with disjoint convex hulls H(C I ), ... , H(Cm ),
all with a positive volume .
• He versus H::,ni:
Denote by C· = (C;, ... , C;") the m-partition which minimizes the volume clu-
stering criterion

A(C) (3.8)

i.e. the sum of the volumes of the m class-specific convex hulls Di := H(C;) C
Gn (supposed to be non-overlapping). Then Vn:= Gn-(H(C;)+ .. ·+H(C;"))
is the maximum 'empty space' that is left in Gn outside of the cluster domains
Di , and the LRT reduces to the following empty space test:
volp(Vn) { > c decide for clustering H::,ni (3.9)
vol p ( G n ) ::; c accept uniformity He.

This test is a multivariate generalization of various one-dimensional gap tests


(see Bock (1989a, 1996a, 1997), Rasson and Kubushishii (1994), Hardy (1997)).
Since all these test criteria are invariant against any regular affine transformation of
the data, their distribution under He can be determined (e.g., by simulations) for
some standardized form of G such as the unit square or the unit ball in RP, without
restriction of generality.
3.3 Power considerations and the comparative assessment of classificati-
ons
Practical experience and theoretical investigations have shown that the power of
clustering tests is far from being satisfactory. For example, in normal distribution
mixture cases a considerable separation of the classes is needed in order to detect a
hidden clustering structure with a satisfactorily large probability (see Everitt 1981,
Mendell et al. 1993, Bock 1996a). As a matter of fact, this difficulty seems to be
an intrinsic feature of the classification problem, and not only a technical deficiency
of our statistical tools. Insofar the result of any clustering test must be interpreted
with care, more in an indicative than in a convincing sense.
17

Quite generally, it may be doubted if the classical hypothesis testing paradigm with
only two alternative decisions: acceptance and rejection of a hypothesis Ho, will
be appropriate for the clustering framework where it would be much more realistic
and useful to distinguish various grades of classifiability which interpolate between
the extreme cases of 'homogeneity or randomness' and an 'obvious clustering struc-
ture '. Instead of defining corresponding quantitative measures of classifiabilty here,
we propose a more qualitative approach, the comparative assessment of classifications
(GAG): It proceeds by defining, prior to any clustering algorithm, some 'benchmark
clustering situations' H:", i.e., data configurations or distribution models which show
a more or less marked classification structure, indexed by f (not necessarily a real
number, but possibly a measure of class separation; see below). These configurations
can be selected in cooperation of practitionners and statisticians. Then, after having
calculated a special (e.g., optimum) classification C~ for the given data {Xl, ... , X n },
we compare this classification with those to be expected under H:"', for various de-
grees f. Thus we place the observed data into a network of various different clustering
situations H:'" and in order to get an idea of their underlying structure.
In the case of the fixed-classification normal model Hm, (2.1), with class-specific
distributions Np{JL;, 0'2 Ep), this idea can be realized as follows:
1. Determine suitably parametrized benchmark clusterings H:"', e.g.:
• A partition C of 0 with m classes Gi of equal sizes ni = IGd = n./m,
whose class centroids J.li are sufficiently different, e.g., IIJ.li - JLjll/O' ~ f for
all i:j:. j, or ~ L~I IIJL; -IlW/0'2 ~ L
• A normal mixture L~I7ri . N p{JLi,0'2Ep) with f:= L~I7ri 'IIJL; -IlW·
2. Consider the LRT statistics Tm or, equivalently, the max - F statistics k;"n for
the hypothesis Ho: JLl = ... = ILm.
3. Determine or estimate some characteristics Q' of the (untractable) probability
distribution of Ttn or k;"" under the benchmark situations H:'" selected in 1.,
e.g., by simulating a large number of data sets {XI,"', Xn} under H:'" and
calculating the empirical mean, median or some other empirical percentile of
the resulting values of Tm or k;"n'
4. Compare the values Tm or k;"n calculated from the original data {Xll""X n}
(eventually after a suitable standardization) to the characteristics Q' of the
benchmark situations.
5. The clustering tendency or classifiability of the data {Xl, ... , xn} and the rele-
vance of the calculated classification C~ is then described, illustrated and quan-
tisized (by f) by confronting them to those benchmark situations H:" which
show a weaker clustering behaviour (in the sense that, e.g., Q' s: Tm or s: k;;'n)
and, on the other hand, to those which describe a stronger clustering structure
(i.e., where the converse inequality holds).
It is obvious that this strategy CAC is related to a formal test of a hypothesis f s: fO
versus f > fO), but is more flexible due to the arbitrary selection of suitable bench-
mark situations. Its generalization to other clustering models is obvious.

4. Final remarks
In this paper we have described various probabilistic and inferential tools for cluster
18

analysis. These methods provide a firm basis for deriving suitable clustering stra-
tegies and allow for a quantitative evaluation of classification results and clustering
methods, including error probabilities and risk functions. In particular, various test
statistics can safeguard against a too rash acceptance of a clustering structure and
help to validate calculated classifications.
On the other hand, the application of these methods is not at all easy and self-evident:
Problems arise, e.g., when selecting a suitable clustering model and an appropriate
family of densities f(x, 11), or when different types of cluster shapes will simultaneous-
ly occur in the same data set. We have seen that the probability distribution of many
test statistics is hard to obtain in many cases. Moreover, our analysis is always based
on only one sample from the n objects such that we cannot evaluate the stability or
variation of the resulting classification (as it would be the case if repeated samples
were available for the same objects).
When comparing the risks and benefits of probability-based versus deterministic clu-
stering approaches (which proceed, e.g., by intuitive clustering criteria or heuristic
algorithms) we see that these same deficiencies exist, in some other and disguised
form, for the latter methods as well. It is recommended here to combine both ap-
proaches in an exploratory way and thereby profit from both points of view. The
CAC strategy presented above is an example for such an analysis.

References:
Anderson, J.J. (1985): Normal mixtures and the number of clusters problem. Computational Statis-
tics Quarterly 2, 3-14.
P. Arabie, L. Hubert and G. De Soete (eds.) (1996): Clustering and Classification. World Science
Publishers, River Edge/NJ.
Baubkus, W. (1985): Minimizing the variance criterion in cluster analysis: Optimal configurations
in the multidimensional normal case. Diploma thesis, Institute of Statistics, Technical University of
Aachen, Germany.
Berdai, A., and B. Garel (1994): Performances d'un test d'homogeneite contre une hypothese de
melange gaussien. Revue de Statistique Appliquee 42 (1),63-79.
Bernardo, J.M. (1994): Optimizing prediction with hierarchical models: Bayesian clustering. In:
P.R. Freeman, A.F .M. Smith (Eds.): Aspects of uncertainty. Wiley, New York, 1994, 67-76.
Binder, D.A. (1978): Bayesian cluster analysis. Biometrika 65,31-38.
Bock, H.H. (1968): Statistische Modelle fiir die einfache und doppelte Klassifikation von nor-
malvertei1ten Beobachtungen. Dissertation, Univ. Freiburg i. Brsg., Germany.
Bock, H.H. (1969): The equivalence of two extremal problems and its application to the iterative
classification of multivariate data.. Report of the Conference 'Medizinische Statistik', Forschungsin-
stitut Oberwolfach, February 1969, lOpp.
Bock, H.H. (1972): Statistische Modelle und Bayes'sche Verfahren zur Bestimmung einer unbekan-
nten Klassifikation normalverteilter zufalliger Vektoren. Metrika 18 (1972) 120-132.
Bock, H.H. (1974): Automatische K1assifikation (Clusterana1yse). Vandenhoeck & Ruprecht, Gottingen,
480 pp.
Bock, H.H. (1977): On tests concerning the existence of a classification. In: Proc. First Symposium
on Data Analysis and Informatics, Versailles, 1977, Vol. II. Institut de Recherche d'Informatique et
d'Automatique (IRIA) , Le Chesnay, 1977,449-464.
Bock, H.H. (1984): Statistical testing and evaluation methods in cluster analysis. In: J.K. Ghosh,\,;
J. Roy (Eds.): Golden Jubilee Conference in Statistics: Applications and new directions. Calcutta,
December 1981. Indian Statistical Institute, Calcutta, 1984, 116-146.
Bock, H.H. (1985): On some significance tests in cluster analysis. J. of Classification 2, 77-108.
Bock, H.H. (1986): Loglinear models a.nd entropy clustering methods for qualitative data. In: W.
Gaul, M. Scha.der (Eds.), Classification as a tool of researcb. North Holland, Amsterdam, 1986,
19-26.
Bock, H.H. (1987): On the interface between cluster analysis, principal component analysis, and
19

multidimensional scaling. In: H. Bozdogan and A.K. Gupta (eds.): l\-Iultivariate statistical modeling
and data analysis. Reidel, Dordrecht, 1987, 17-34.
Bock, H.B. (Ed.) (1988): Classification and related methods of data analysis. Proc. First IFCS
Conference, Aachen, 1987. North Holland, Amsterdam.
Bock, H.H. (1989a): Probabilistic aspects in cluster analysis. In: O. Opitz (Ed.): Conceptual and
numerical analysis of data. Springer-Verlag, Heidelberg, 1989, 12-44.
Bock, H.H. (1989b): A probabilistic clustering model for graphs and similarity relations. Paper pre-
sented at the Fall Meeting 1989 of the Working Group 'Numerical Classification and Data Analysis'
of the Gesellschaft fiir Klassifikation, Essen, November 1989.
Bock, H.H. (1994): Information and entropy in cluster analysis. In: H. Bozdogan et al. (Eds.): Mul-
tivariate statistical modeling, Vol. II. Proc. 1st US/Japan Conference on the Frontiers of Statistical
Modeling: An Informational Approach. Univ. of Tennessee, Knoxville, 1992. Kluwer, Dordrecht,
1994, 115-147.
Bock, H.H. (1996a): Probability models and hypotheses testing in partitioning cluster analysis. In:
P. Arabie et al. (Eds.), 1996, 377-453.
Bock, H.H. (1996b): Probabilistic models in cluster analysis. Computational Statistics and Data
Analysis 22 (in press).
Bock, H.H. (1996c): Probabilistic models in partitional cluster analysis. In: A. Ferligoj and A.
Kramberger (Eds.): Developments in data analysis. Metodoloski zvezki, 12, Faculty of Social Sci-
ences Press (Fakulteta za druzbene vede, FDV), Ljubljana, 1996, 3-25.
Bock, H.H. (1996d): Probabilistic models and statistical methods in partitional classification prob-
lems. Written version of a Tutorial Session organized by the Japanese Classification Society and the
Japan Market Association, Tokyo, April 2-3, 1996,50-68.
Bock, H.H. (1997): Probability models for convex clusters. In: R. Klar and O. Opitz (Eds.): Clas-
sification and knowledge organization. Springer-Verlag, Heidelberg, 1997 (to appear).
Bock, H.H., and W. Polasek (Eds.) (1996): Data analysis and information systems: Statistical and
conceptual approaches. Springer-Verlag, Heidelberg, 1996.
Bohning, D., Dietz, E., Schaub, R., Schlattmann, P., &: Lindsay, B.G. (1994): The distribution of
the likelihood ratio for mixtures of densities from the one-parameter exponential family. Annals of
the Institute of Mathematical Statistics 46, 373-388.
Bryant, P. (1988): On characterizing optimization-based clustering methods. J. of Classification 5,
81-84.
Bryant, P.G. (1991): Large-sample results for optimization-based clustering methods. J. of Classi-
fication 8, 31-44.
Bryant, P.G., and J.A. Williamson (1978): Asymptotic behaviour of classification maximum likeli-
hood estimates. Biometrika 65, 273-28l.
Celeux, G., &: Diebolt, J. (1985): The SEM algorithm: A probabilistic teacher algorithm derived
from the EM algorithm for the mixture problem. Computational Statistics Quarterly 2, 73-82. Cox,
D.R. (1957): A note on grouping. J. Amer. Statist. Assoc. 52,543-547.
Cressie, N. (1991): Statistics for spatial data. Wiley, New York.
Diday, E. (1973): Introduction a. l'analyse factorielle typologique. Rapport de Recherche no. 27,
lillA, Le Chesnay, France, 13 pp.
Diday, E., Y. Lechevallier, M. Schader, P. Bertrand, and B. Burtschy (Eds.) (1994): New approaches
in classification and data analysis. Studies in Classification, Data Analysis, and Knowledge Orga-
nization, vol. 6. Springer-Verlag, Heidelberg, 186-193.
Dubes, R., and Jain, A.K. (1979): Validity studies in clustering methodologies. Pattern Recognition
11, 235-254.
Dubes, R.C., and Zeng, G. (1987): A test for spatial homogeneity in cluster analysis. J. of Classifi-
cation 4, 33-56.
Everitt, B. S. (1981): A Monte Carlo investigation of the likelihood ratio test for the number of
components in a mixture of normal distributions. Multivariate Behavioural Research 16, 171-180.
Fahrmeir, L., Hamerle, A. and G. Tutz (Eds.) (1996): Multivariate statistische Verfahren. Walter
de Gruyter, Berlin - New York.
Fahrmeir, L., Kaufmann, H.L., and H. Pape (1980): Eine konstruktive Eigenschaft optimaler Par-
titionen bei stochastischen Klassifikationsproblemen. Methods of Operations Researcb 37, 337-347.
20

Flury, B.D. (1993): Estimation of principal points. Applied Statistics 42, 139-151.
W. Gaul & D. Pfeifer (Eds.) (1996): From data to knowledge. Theoretical and practical aspects of
classification, data analysis and knowledge organization. Springer-Verlag, Heidelberg.
Ghosh, J. K., & Sen, P. K (1985): On the asymptotic performance of the log likelihood ratio statis-
tic for the mLxture model and related results. In: L.M. LeCam, R.A. Ohlsen (Eds.): Proc. Berkeley
Conference in honor of Jerzy Neyman and Jack Kiefer. Vol.lI, Wadsworth, Monterey, 1985,789-806.
Godehardt, E. (1990): Graphs as structural models. The application of graphs and multigraphs in
cluster analysis. Friedrich Vieweg & Sohn, Braunschweig, 240pp.
Godehardt, E., and Horsch, A. (1996): Graph-theoretic models for testing the homogeneity of data.
In: W. Gaul & D. Pfeifer (Eds.), 1996, 167-176.
Goffinet, B., Loisel, P., and B. Laurent (1992): Testing in normal mixture models when the propor-
tions are known. Biometrika 79, 842-846.
Gordon, A.D. (1994): Identifying genuine clusters in a classification. Computational Statistics and
Data Analysis 18, 561-581.
Gordon, A.D. (1996): Null models in cluster validation. In: W. Gaul and D. Pfeifer (Eds.), 1996,
32-44.
Gordon, A.D. (1997a): Cluster validation. This volume.
Gordon, A.D. (1997b): How many clusters? An investigation of five procedures for detecting nested
cluster structure. This volume.
Hardy, A. (1997): A split and merge algorithm for cluster analysis. This volume.
Hartigan, J .A. (1978): Asymptotic distributions for clustering criteria. Ann. Statist. 6, 117-13l.
Hartigan, J .A. (1985): Statistical theory in clustering. J. of Classification 2, 63-76.
Hayashi, Ch. (19??): ..... .
Jain, A.K, and Dubes, R.C. (1988): Algorithms for clustering data. Prentice Hall, Englewood
Cliffs, NJ.
Jank, W. (1996): A study on the varaince criterion in cluster analysis: Optimum and stationary
partitions of RP and the distribution of related clustering criteria. (In German). Diploma thesis,
Institute of Statistics, Technical University of Aachen, Aachen, 204pp.
Jank, W., and Bock, H.H. (1996): Optimal partitions of R2 and the distribution of the variance and
max-F criterion. Paper presented at the 20th Annual Conference of the Gesellschaft fiir Klassifika-
tion, Freiburg, Germany, March 1996.
Lapointe, F.-J. (1997): To validate and how to validate? That is the real question. This volume.
Ling, R.F. (1973): A probability theory of cluster analysis. J. Amer. Statist. Assoc. 68,159-164.
McLachlan, G.J., and KE. Basford (1988): Mixture models. Inference and applications to cluster-
ing. Marcel Dekker, New York - Basel.
Mendell, N.P., Thode, H.C., & Finch, S.J. (1991): The likelihood ratio test for the two-component
normal mixture problem: power and sample-size analysis. Biometrics 47, 1143-1148. Correction:
48 (1992) 661.
Mendell, N.P., Finch, S.J., and Thode, H.C. (1993): Where is the likelihood ratio test powerful for
detecting two-component normal mixtures? Biometrics 49, 907-915.
Milligan, G. W. (1981): A review of Monte Carlo tests of cluster analysis. Multivariate Behavioural
Research 16, 379-401.
Milligan, G.W. (1996): Clustering validation: Results and implications for applied analyses. In: P.
Arabie et al. (Eds.), 1996, 341-375.
Milligan, G. W., and M.C. Cooper (1985): An examination of procedures for determining the num-
ber of clusters in a data set. Psychometrika 50, 159-179.
Parna, K. (1986): Strong consistency of k-means clustering criterion in separable metric spaces.
Tartu Riikliku Ulikooli, TOIMEISED 733, 86-96.
Kipper,S., and Piirna, K. (1992): Optimal k-centres for a two-dimensional normal distribution.
Acta et Commentationes Universitatis Tartuensis, Tartu Ulikooli TOIMEISED 942, 21-27.
Pollard, D. (1982): A central limit theorem for k-means clustering. Ann. Probab. 10, 919-926.
Rasson, J .-P. (1997): Convexity methods in classification. This volume.
Rasson, J .-P., Hardy, A., and Weverbergh, D. (1988): Point process, classification and data analysis.
21

In: H.H. Bock (Ed.), 1988,2·15-256.


Rasson, J .-P., and Kubushishi, T. (19£I.j): The gap test: an optimal method for determining the
number of natural classes in cluster analysis. In: E. Diday et a!. (eds.), 1994, 186-193.
Ripley, B.D. (1981): Spatial statistics. Wiley, ~ew York.
Sawitzki, G. (1996): The excess-mass approach and the analysis of multi-modality. In: W. Gaul
and D. Pfeifer (Eds.), 1996,203-211.
Silverman, B.W. (1981): Using kernel density estimates to investigate multimodality. J. Royal
Statist. Soc. B 43, 97-99.
Snijders, T.A.B. and K. Nowicki (1996): Estimation and prediction for stochastic blockmodels for
graphs with latent block structure. J. of Classification 1.3 (in press).
Symons, M.J. (1981): Clustering criteria and multivariate normal mixtures. Biometrics 37,3.'>-43.
Tharpey, Th., Li, L., Flury, B.D. (1995): Principal points and self-consistent points of elliptical
distributions. Annals of Statistics 23, 103-112.
Thode, H .C., Finch, S.J., & Mendell, N. R. (1988): Simulated percentage points for the null distri-
bution of the likelihoo.d ratio test for a mixture of two normals. Biometrics 44, 1195-1201.
Titterington, D.M. (1990): Some recent research in the analysis of mixture distributions. Statistics
21,619-641.
Titterington, D.M., A.F.M. Smith and U.E. Makov (1985): Statistical analysis of finite mixture
distributions. Wiley, New York.
Van Cutsem, B., and Ycart, B. (1996a): Probability distributions on indexed dendrograms and
related problems of c1assifiability. In H.H. Bock and W. Polasek (Eds.), 1996, 73-87.
Van Cutsem, B., and Ycart, B. (1996b): Combinatorial structures and structures for classification.
Computational Statistics and Data Ana(vsis (in press).
Van Cutsem, B., and Ycart, B. (1997): This volume.
Cluster Validation
A. D. Gordon
Mathematical Institute, University of 8t Andrews,
North Haugh, 5t Andrews KYl6 9SS, Scotland

Summary: Clustering algorithms can provide misleading summaries of data, and attention
has been devoted to investigating ways of guarding against reaching incorrect conclusions,
by validating the results of a cluster analysis. The paper provides an overview of recent work
in this area of cluster validation. Material covered includes: the distinction between exter-
nal, internal, and relative clustering indices; types of null model, including 'data-influenced'
null models; tests of the complete absence of any class structure in a data set; and ways
of assessing the validity of individual clusters, partitions of data into disjoint clusters, and
hierarchical classifications. A discussion indicates areas in which further research seems
desirable.

1. Introduction
The topic of classification addresses the problem of summarizing the relationships
within a large set of objects by representing them as a smaller number of classes (or
clusters) of objects with the property that objects in the same class are similar to
one another and different to objects in other classes. On occasion, it can be relevant
to obtain a nested set of such partitions of the objects, providing a hierarchical clas-
sification which summarizes the class structure present at several different levels in
the data.
Many different clustering algorithms have been proposed for obtaining such classi-
fications; see, for example, Bock (1974), Hartigan (1975), Gordon (1981) and Jain
and Dubes (1988). However, clustering algorithms impose a classification on a data
set even if there is no real class structure present in it. Further, classifications of the
same data set obtained using different clustering criteria can differ markedly from one
another. In effect, each clustering criterion implicitly specifies a model for the data;
for example, Scott and Symons (1971) demonstrated links between Ward's (1963)
incremental sum-of-squares clustering criterion and a spherical normal components
model (see also Binder (1978), Marriott (1982) and Bock (1989, 1996)). If this model
is inappropriate for the data set under investigation, a mislea.ding summary will be
provided.
Realization of this fact has led to the consideration by investigators of how one might
specify appropriate clustering strategies for data sets. Some adaptive clustering pro-
cedures have been proposed (e.g., Rohlf, 1970; Diday and Govaert, 1977: Lefkovitch,
1978, 1980; Art et al., 1982), the rationale being that the clustering procedure ad-apts
itself in response to the structure found in the data. In effect, such procedures just
involve a more general model for the data, and it seems unrealistic to expect to be
able to construct a model that is sufficiently general to be appropriate for the analysis
of any data set that may be encountered.
Jardine and Sibson (1971) presented a list of properties that one might require of a
clustering method, and proved that the single link method uniquely satisfies these
conditions. This provides a valuable characterization of the single link method, but
many investigators have regarded some of the conditions as undesirable. A less pre-

22
23

scriptive approach was provided by Fisher and Van Ness (1971) and Van Ness (1973).
These authors presented a list of properties which one might expect clustering proce-
dures or the classes obtained from them to possess, and stated whether or not these
properties were possessed by each of several standard clustering criteria. From back-
ground information about the data, an investigator might be able to specify some
relevant conditions; the work of Fisher and Van Ness then indicates which clustering
criteria could be relevant for the analysis of that particular data set. More than
one clustering criterion might be indicated as relevant, and it can be informative to
analyse a data set using several different clustering criteria, and synthesize the results
in a consensus classification (e.g., Diday and Simon, 1976; Gordon, 1981, Chapter
6, 1996a), the rationale being that the results are less likely to be an artifact of a
clustering criterion and more likely to give an accurate representation of the structure
in the data.
Investigators have commonly combined a classification of a set of objects with a low-
dimensional configuration of points, in which each object is represented by a different
point, points which are close together representing objects that are similar to one an-
other. Such configurations can then be examined by eye, to establish whether or not
the points fall into distinct, well-separated classes; closed curves can also be drawn
around each class provided by the clustering criterion (e.g., Rohlf, 1970; Shepard,
1974) to assist in the assessment.
Further indications of the properties of classes provided by a clustering procedure are
given by various plots summarizing their homogeneity and isolation from one another
(e.g., Gnanadesikan et al., 1977; Rousseeuw, 1987).
It is rarely the case that investigators know with certainty which clustering procedure
should be used to analyse a set of object.s, and increasing attention has been paid
in recent years to providing ways of testing the validity of cluster structure that are
more formal than those described in the previous paragraphs. This paper presents
an overview of this topic of cluster validation; earlier reviews were given by Perruchet
(1983), Bock (1985), Jain and Dubes (1988, Chapter 4) and Gordon (199.5).
In statistics, it is common for an exploratory analysis to be carried out on a data set,
for example to assist in model formulation. Such exploratory analyses are generally
followed by confirmatory analyses, which are carried out on new data sets. This has
rarely occurred in classification, in which interest has usually resided in the set of
objects which is being classified, and not in some larger population of objects, from
which the classified set is regarded as comprising a representative sample. In much
of the work which is described in this paper, therefore, cluster generation and cluster
validation are carried out on the same data set. This has major implications for
cluster validation: for example, the classes provided by a clustering algorithm are
likely to be more homogeneous than those contained in a random partition of a set
of objects, and care must be taken to specify appropriate reference distributions for
test statistics used to assess the validity of the obtained classes.
In this context, it is relevant to distinguish three different types of validation test,
involving use of external, internal, or relative cluster indices. External tests compare
a classification, or part of a classification, with external information that was not
used in the construction of the classification. Internal tests compare a (part of a)
classification with the original data. Relative tests compare several different classifi-
cations of the same set of objects; such tests are relevant in both external and internal
investigations (F.-J. Lapointe, pers. comm.).
Cluster validation tests can be categorized into four classes, depending on the type
of cluster structure under investigation (Jain and Dubes, 1988, Chapter -!): these are
24

tests of (1) the complete absence of class structure, (2) the validity of an individual
cluster, (:3) the validity of a partition into disjoint classes. and (4) the validity of a
hierarchical classification.
The remainder of the paper is organized as follows: the next section describes null
models for the absence of cluster structure: the following four sections provide an
overview of tests. categorized as in the previous paragraph; and the final section
presents a discussion, indicating areas in which it would be useful to see further
research.

2. Null models
The information about objects on which a classification study is to be undertaken is
usually provided in one of two formats. A pattern matrix is an n x p matrix X == (Xik),
where Ilk denotes the value of the kth variable describing the ith object. If all the
variables are continuous, the objects can be represented by a set of n [>oints in p-
dimensional Euclidean space. A dissimilarity matrix is an 11 x n matrix D == (d ij ),
where ell) denotes the dissimilarity between the ith and ph objects. Dissimilarities
arc symmetric (ell) = dJI ) and self-dissimilarities (did are zero, hence it suffices to
present only the n(n - 1)/2 entries in the lower triangle of D. The relationships
within a set of objects are sometimes described in terms of a symmetric similarity
matrix, but such data can also be analysed by straightforward modifications of the
theory described ill this paper.
Four main classes of null model, based on pither a pattern matrix or a dissimilarity
matrix, have been proposed, as described in the following four subsections.

2.1. Poisson model


This model assumes that the objects can be represented by points which are UIll-
formly distributed in some region A of p-dimensional space.
Other terms that have been used to describe this model include the Uniformity hy-
pothesis (Bock, 198.j) and the Random Position hypothesis (Jain and Dubes, 1988,
Chapter -i). The region A has usually been specified to be the unit p-dimensional
hypercube or hypersphere (e.g., Zeng and Dubes, 1985a; Dubes and Zeng. 1987; Har-
tigan and Mohanty, 1992). Howe\'er, the results of tests can depend markedly on the
region A that is specified (Gordon, 1996b), and it can be difficult to provide rigorous
justification for choice of region. An alternative approach is to choose A to be the
convex hull of the points in the data set (Smith and Jain, 1984; Bock, 198.5), invoking
for justification a result due to Ripley and Rasson (1977) that if points are randomly
generated within a convex region of the plane. an estimate of the boundary of the
region i3 given by a uniform dilation of the convex hull about its centroid. The ra-
tionale of such data-influenced null models is that it enables the construction of tests
tltat investigate departures from lack of cluster structure and that are not influenced
by unimportant differences between model and data, such as the region within which
the points are located.
However, algorithms for finding the convex hull of a set of points in p-dimensional
s[>ace (e.g., Chand and Kapur, 1970; Edelsbrunner, 1987, Chapter 8) and algorithms
that can be used to generate data within such regions (e.g .. Dobkin and Lipton, 1976;
Rubin. 198"1; Chazelle. 198.j) make heavy demands on computing resources. An al-
gorithm proposed by Smith and Jain (1984) that generates points uniformly within
a region approximating the con\'ex hull can perform poorly in 'empty' regions close
25

to the convex hull. Use of the convex hull boundary for the Poisson model is thus
limited at present to data sets having small values of the dimensionality, p, though
one might hope that the development of more efficient algorithms or increase in com-
puting power might ease this restriction in the future.

2.2. Unimodal model


This model assumes that the joint distribution of the variables describing the objects
is unimodal.
The assumption behind this model is that there is only one cluster in the data set.
However, instead of insisting that the points are uniformly distributed within this
cluster, the Unimodal model allows for the possibility that objects are closer to one
another nearer the centre of the cluster than they are on the boundary.
Many different unimodal distributions exist. The most common choice for continu-
ous variables has been the multivariate normal distribution with identity variance-
covariance matrix (e.g., Rohlf and Fisher, 1968; Gower and Banfield, 197.5; Hartigan
and Mohanty, 1992). Data-influenced unimodal models have also been investigated,
for example by allowing the data to specify the variance-covariance matrix of the
multivariate normal distribution (Gordon, 1996b).

2.3. Random dissimilarity matrix model


This model assumes that the elements of the lower triangle of the dissimilarity ma-
trix are ranked in random order, all (n(n -1)/2)! ran kings being equally likely (Ling,
1973a).
The model has also been referred to as the Random Graph hypothesis (J ain and
Dubes, 1988, chapter 4): if each object is represented by a vertex in a graph, and
an edge is present between the ith and ph vertices if dij is less than some specified
threshold value, then the edges are inserted in random order. In variants of the hy-
pothesis, a specified number of (randomly-selected) edges is present in the graph, or
each edge is present with a specified probability, independently of all other edges.
In that only the order of the n(n - 1)/2 dissimilarities is taken into account, the
random dissimilarity matrix model might seem to be restricted to use in conjunction
with monotone admissible (Fisher and Van Ness, 1971) clustering procedures, and
it appears to have been used only with the single link and complete link clustering
criteria.
A major criticism of the random dissimilarity matrix model is that it ignores second-
and higher-order relationships: thus, if dij were small, one might expect dik and djk
to have similar ranks. Ling (1973a) argued that clusters that were not significant
under this model were unlikely to be deemed significant under any other null model.
However, the model seems inappropriate for objects described by a pattern matrix
(Gordon, 1996b), although it might be relevant for testing for the presence of class
structure when the dissimilarity matrix is provided directly, rather than being derived
from a pattern matrix. Frank and Strauss (1986) proposed a more general random
graph model that allows for dependence between edges that meet the same vertex.

2.4. Random permutation model


This model considers independently permuting the entries in each of the p columns
of the pattern matrix. There are (n!)P-I essentially different matrices, each of which
26

is regarded as equally likely under the model.


Tests based on this null model and variants of it (e.g., Harper, 1978; Strauss, 1982;
Vassiliou et al., 1989) compare the class structure revealed in the analysis of the orig-
inal pattern matrix with that provided by the classification of randomly-generated
pseudo pattern matrices. This approach has a different rationale to use of the
previously-described null models, and it is not obvious that clusters in the origi-
nal data set that are more homogeneous than those found in many pseudo data sets
need have a high degree of absolute homogeneity. The approach also ignores the cor-
relation between variables, and generates pseudo-objects within a hyperrectangular
region in the space of variables.

2.5. Alternative hypotheses


Alternative models, specifying the presence of some clusters in the data, have been
presented. For objects described by pattern matrices, a multimodal distribution
for the variables describing the objects is clearly appropriate. One relevant imple-
mentation is that the data arise from a mixture of distributions that differ only in
their location parameters (Bock, 1985, 1996; Hartigan, 1985; Hartigan and Mohanty,
1992). Other models have arisen in the analysis of spatial point patterns, such as the
Neyman-Scott cluster process (Diggle, 1983, Chapter 4).
In a relevant 'clustering' alternative to the random dissimilarity matrix model, the
probability that an edge is present depends on whether the vertices that it links
represent objects that belong to the same cluster or different clusters in a partition
(Frank, 1978; Frank and Harary, 1982).
An alternative hypothesis for hierarchical classifications can be obtained by perturb-
ing data that have perfect ultrametric structure; such data could be presented in
a matrix of dissimilarities (Cunningham and Ogilvie, 1972) or their ranks (Baker,
1974), or in a pattern matrix (Gordon, 1996c).
The usefulness of tests for the absence of cluster structure can be assessed by estab-
lishing their power when data are generated under such alternative hypotheses. Such
investigations have rarely been carried out.

3. Tests of the absence of class structure


It might be expected that investigators would test a set of objects to establish if they
could be regarded as lacking class structure, before undertaking a classification of
them. Such preliminary investigations have rarely been carried out, possibly because
investigators were confident that classes existed, or because they intended to validate
the results using some of the methodology described later in the paper. It should also
be noted that some tests of the absence of class structure make use of the results of a
classification; some of these tests can thus also be used in the validation of particular
types of class structure in a data set.
Tests of the Poisson model have been based on: the number of interpoint distances less
than a specified threshold (Strauss, 1975; Kelly and Ripley, 1976; Saunders and Funk,
1977); the largest nearest neighbour distance within the set of objects (Bock, 1985);
and generalizations of tests for randomness of two-dimensional spatial point patterns
(Ripley, 1981; Diggle, 1983) that compare distances from a randomly-selected posi-
tion to the nearest point, with the dista.nce between that (or a randomly-selected)
point and its nearest neighbour (Cross and Jain, 1982; Panayirci and Dubes, 1983;
Zeng and Dubes, 1985a), such tests requiring attention to be paid to edge effects.
27

Generalizations of a test proposed for use in two dimensions in Hopkins (1954) were
found to have superior performance in comparative studies of such tests carried out
by Zeng and Dubes (1985b) and Dubes and Zeng (1987).
Tests based on the distribution of the lengths of edges in the minimum spanning
tree have been proposed under the Poisson model (Hoffman and Jain, 1983) and a
multivariate normal model (Rohlf, 1975). Smith and Jain (1984) considered adding
randomly-positioned points to the data set and using a test due to Friedman and
Rafsky (1979) based on the number of edges in the minimum spanning tree of the
combined data set that link original and added points. Brailovsky (1991) generated
modified data sets, in which each point was either retained with a specified prob-
ability or replaced by a randomly-positioned point, and compared the strength of
clustering of the original data set with that of the modified data sets.
Bock (1985) described theoretical results pertaining to the average pairwise similarity
of objects and the total within-class sum of squared distances when the objects were
optimally partitioned into a specified number of classes, deriving asymptotic distri-
butions of relevant test statistics; see also Hartigan (1977, 1978). However, more
work is required to establish how relevant such results are for assessing finite-sized
data sets. It has been more common for tests to involve simulation studies.
Some tests are based on searching for 'gaps' or multimodality. Hartigan and Mohanty
(1992) studied the properties under both Poisson and Unimodal models of a test based
on a single link dendrogram: for each single link class C, the number of objects n(C)
in its smallest offspring class is noted, and the test statistic is the maximum value of
n( C) over all classes C, large values suggesting the presence of bimodality. Muller and
Sawitzki (1991) described a test based on comparing the amounts of probability mass
exceeding various threshold values when there are c modes in the distribution, for
different values of c, but this test is at present computationally feasible only for small
values of the dimensionality, p. Hartigan (1988) considered evaluating the minimum
amount of probability mass required to render the distribution unimodal, but settled
for estimating departures from unimodality along the edges of an optimally-rooted
minimum spanning tree. Rozal and Hartigan (1994) described a test based on mini-
mum spanning trees, constrained so that edge lengths are non-increasing on all paths
to the root node(s) corresponding to the class centre(s). Sneath (1986) presented a
test based on the empirical distribution function of the number of internal nodes in
a dendrogram that are located at less than a specified height.
The following test statistics have been proposed for use in conjunction with the ran-
dom dissimilarity matrix model to test for the absence of class structure: the number
of edges required before the graph consists of a single component (Rapoport and Fil-
lenbaum, 1972; Schultz and Hubert, 1973; Ling, 1975; Ling and Killough, 1976); the
number of vertices not belonging to the largest component when a specified number
of edges is present (Ogilvie, 1969); the number of components when a specified num-
ber of edges is present (Ling, 1973b; Ling and Killough, 1976); the sizes of clusters
in a partition into two clusters (Van Cutsem and Ycart, 1996). Godehardt (1990)
described a generalization of the random dissimilarity matrix model to rnultigraphs,
in which each variable describing the objects defines a different dissimilarity matrix.

4. Assessing individual clusters


An early approach to the assessment of individual clusters was to define properties of
an ideal cluster, and identify which clusters satisfied these conditions. For example,
all within-cluster pairwise dissimilarities can be required to be less than all dissim-
28

ilarities between an object in the cluster and an object outside it (van Rijsbergen,
1970); other definitions of what might constitute an ideal cluster were provided by
McQuitty (1963, 1967), Jardine (1969), Ling (1972) and Hubert (1974a). However,
these are fairly restrictive requirements, and few data sets possess large ideal clusters.
Various measures of the internal cohesion and external isolation of clusters have been
defined (e.g., Estabrook, 1966; Bailey and Dubes, 1982). Ling (1973a) defined the
'lifetime' of a cluster to be the difference between the rank of the dissimilarity at
which it is formed and the rank at which it is incorporated into a larger cluster,
evaluating lifetime distributions of single link clusters under the random dissimilarity
matrix model. Matula (1977) proved that the distribution of the size of the maximal
complete subgraph of a random graph, in which each edge is independently present
with the same probability, is highly peaked, allowing an assessment of complete link
clusters. When k edges are present in a random graph, Bailey and Dubes (1982) ob-
tained inequalities for the probabilities of obtaining indices of cohesion and isolation
as extreme as the observed ones for single link and complete link clusters, plotting
these bounds for different values of k. Lerman (1970, Chapter 2, 1980, 1983) de-
fined U statistics comparing within-cluster and between-cluster dissimilarities, and
assessed both partitions and individual clusters under the hypothesis of random par-
titions having the same cluster sizes as observed in the results. Gordon (1994, 1996b)
obtained by simulation critical values of U statistics under all four types of null model
described in Section 2, by reanalysing random data sets using the same clustering
procedure used ill the classification of the original set of objects. He noted unsatisfac-
tory properties of the random dissimilarity matrix and random permutation models,
and the dependence of the results for the random pattern matrix models OIl the pre-
cise specification of the null model.
Many tests of the validity of an individual cluster have involved examining the dis-
tinctness of its offspring classes (e.g., Engelman and Hartigan, 1969; Gnanadesikan et
al., 1977; Sneath, 1977,1979,1980; Barnett et al., 1979; Lee, 1979). These tests have
usually been restricted to univariate data, sometimes obtained by projection onto the
line joining the centroids of the two classes. Analytical results are difficult to obtain
(Hartigan, 1977), and recourse has usually been made to simulation studies.
Some tests of whether or not a cluster should be sub-divided have been used as 'local
stopping rules' for deciding on the number of clusters in a data set; such tests are
described in the next section.
The tests for assessing individual clusters described in this section are internal val-
idation tests, as defined in Section 1: cluster generation and cluster validation are
carried out on the same data set. They encounter the 'multiple comparison' prob-
lem that tests of different clusters in the same data set are not independent of one
another, which has implications for the significance levels of the tests. By contrast,
Gabriel and Sokal (1969) described a simultaneous test procedure with assured over-
all significance level, in which (possibly overlapping) largest homogeneous clusters
are identified; their approach is also applicable to determining coarsest acceptable
partitions.

5. Assessing partitions
A partition of a set of objects can be compared with an externally-specified parti-
tion of the objects by evaluating a relevant index comparing the partitions. The
significance of the value of this index can be determined by comparing it with its
distribution under random permutations of the labels of the objects that leave un-
changed the numbers of objects in each class of the partitions. A comparative study
29

of five indices conducted by Milligan and Cooper (1986) concluded that Hubert and
Arabie's (1985) modification of Rand's (1971) index was best suited to comparing a
specified partition with cluster output comprising several different numbers of clus-
ters.
On occasion, the external information might not provide a partition of the objects to
be classified, but rather describe the classes to which they are believed to belong. At
the most formal level, this description could specify a statistical model comprising a
mixture of a known number of classes, each with known parametric form, but with
unknown mixing proportions; at the least formal level, the distribution function for
each class could be provided by the empirical distribution of a class provided by a
clustering algorithm (Gordon, 1996d).
However, in many classification studies, external information is not available, and it
has been more common for investigators to seek answers to the following question
about a given set of objects: 'How many clusters are there in the data (and what is
their composition)?' There are several ways in which this question may be posed:
1. Does a partition into c (say) clusters that has been provided by a clustering
algorithm comprise cohesive and isolated clusters?
2. What value(s) of c is (are) most strongly indicated as (an) informative repre-
sentation( s) of the data?
These questions are addressed using, respectively, internal and relative validation
tests. The multiple comparison problem is again relevant: the value of c in (1) will
have been chosen by an investigator after studying the results of a classification, and
most tests for determining the appropriate value(s) of c in (2) have unknown signifi-
cance levels.
The first question stated above can be addressed by defining a measure of the ade-
quacy of a partition into c classes (e.g., the total within· class sum of squared distances
about the c centroids), and obtaining its distribution under a null model of the ab-
sence of class structure. Some asymptotic theoretical results have been obtained (e.g.,
Hartigan, 1977; Pollard, 1982; Bock, 1985), but further work is required to establish
their appropriateness for finite data sets. Baker and Hubert (1976) assessed parti-
tions into complete link clusters using the number of extraneous edges present under
the random graph model. Monte Carlo tests have been carried out by evaluating the
measure of partition adequacy when many randomly-generated data sets are parti-
tioned into c classes using the same clustering procedure as employed on the original
data (Arnold, 1979; Milligan and Mahajan, 1980; Milligan and Sokol, 1980).
Partitions have also been assessed by investigating their stability when slightly modi-
fied versions of the data are reanalysed (e.g., Rand, 1971; Gnanadesikan et aI., 1977),
and by their replicability across subsamples (e.g., McIntyre and Blashfield, 1980;
Breckenridge, 1989).
A problem with internal validation tests of a partition of the set of objects into c
classes is the inter-relatedness of class structure at different values of c: thus, if there
were really Co classes in the data, the hypothesis that there were c classes in the data
would probably be acceptable for a range of values of c close to Co. The second ques-
tion stated above is concerned with the problem of identifying appropriate values of
c.
Procedures that seek to find the single most appropriate value of c are generally
referred to as 'stopping rules '; this is because, in the absence of information about
relevant values of c, investigators often obtain a complete hierarchical classification
30

using an agglomerative algorithm, and wish to have guidance on when to stop amal-
gamating and regard the current partition as the optimal one. Most research on
cluster validation has involved the construction of stopping rules, and this overview
describes only a small selection of them.
Stopping rules can be categorized as either global or local. Global stopping rules make
use of all the information contained in a partition into c clusters for each value of
c. The value of c for which a specified criterion is satisfied is then identified. Many
of the criteria are based on seeking the optimal (maximum or minimum) value of
measures comparing within-cluster and between-cluster variability, and do not have
a natural definition when c = 1: it is clearly a disadvantage of stopping rules if they
do not allow one to reach the conclusion that the data comprise only a single cluster.
Selected global stopping rules are described in this paragraph. Calinski and Harabasz
(1974) identified the value of c that maximized a scaled ratio of between-cluster
to within-cluster sum of squared distances. The C-index identifies c which min-
imizes a standardized version of the sum of all within-cluster dissimilarities (D):
if the partition has a total of r within-cluster dissimilarities, Dmin (resp., Dma:z:)
is defined as the sum of the r smallest (resp., largest) dissimilarities, and C =
(D - Dmin)/(Dma:z: - Dmin). Goodman and Kruskal's (1954) I compares all within-
cluster dissimilarities with all between-cluster dissimilarities, defining a comparison
to be concordant (resp., discordant) if a within-cluster dissimilarity is less (resp.,
greater) than a between-cluster dissimilarity; the optimal value of c is defined to be
the one which maximises (5'+ - 5'_)/(S+ + S_), where S+ (resp., 5'-) is the number
of concordant (resp., discordant) comparisons. A test based on the point biserial cor-
relation identifies the value of c which maximizes the correlation between the original
dissimilarities and an n x n matrix of l's and D's indicating whether or not the objects
belonged to the same cluster. The cubic clustering criterion (Sarle, 1983) is based on
R2, the proportion of the variance accounted for by the clusters, and identifies the
value of c which maximizes a function of R2 which includes terms derived from sim-
ulation studies under the Poisson model. Other global stopping rules were proposed
by Jackson (1969), Gower (1973), Davies and Bouldin (1979), Hill (1980), Ratkowsky
(1984), Krzanowski and Lai (1988), Xu et al. (1993), and many others; reviews and
comparative studies were presented by Milligan (1981), Milligan and Cooper (1985),
and Dubes (1987).
Local stopping rules are based on tests of whether or not a pa.ir of clusters should be
amalgamated, or a single cluster should be subdivided. Unlike global stopping rules,
they are thus restricted to the analysis of hierarchically-nested sets of partitions, and
use only a part of the data. They often also require the specification of a threshold,
the value of which can markedly influence the properties of the stopping rule.
Selected local stopping rules are described in this paragraph. Duda and Hart (1973,
Section 6.12) compared WI, the within-cluster sum of squared distances, with W 2 ,
the total within-cluster sum of squared distances when the cluster is optimally par-
titioned into two, rejecting the hypothesis of a single cluster if W2 /WI is sufficiently
small; amalgamations cease when the hypothesis is first rejected. A test with a sim-
ilar rationale proposed by Beale (1969) is based on (WI - W 2 )/W2 . Legendre et
al. (1985) categorized the dissimilarities in a cluster as either 'high' or 'low', and
assessed whether or not it should be subdivided into two sub-clusters by carrying
out a randomization test based on the proportion of high dissimilarities between the
sub-clusters compared to within the cluster. Other local stopping rules have been
proposed by Gnanadesikan et al. (1977) and Howe (1979), and some of the tests de-
scribed in Section 4 for assessing individual clusters can also be used as local stopping
rules.
31

For many data sets, a complete hierarchical classification has been obtained and a
stopping rule has then been used to provide guidance on where to section the hierar-
chy to provide a partition. Such work often disregards which clustering criterion has
been used to obtain the hierarchy, whereas one can expect different stopping rules -
just as different clustering criteria - to be more effective in analysing different types of
data. Stronger links between the processes of generating and validating partitions are
provided by work that assesses the stability of a partition by reanalysis of a perturbed
version of the data set (e.g., Begovich and Kane, 1982) or of bootstrap samples (Jain
and Moreau, 1987); or by separately reanalysing sub-samples of the data (Overall
and Magee, 1992); or by noting the value of c for which 'fuzzy partitions' are 'closest'
to 'hard' partitions (e.g., Roubens, 1978; Windham, 1981, 1982; Rivera et al., 1990).
Proposals for determining the number of clusters present in a set of objects continue
to be published in the research literature, often with only a cursory examination of
their properties and little attempt to establish how they perform in comparison with
previously-published procedures. A detailed comparative study of thirty procedures
by Milligan and Cooper (1985) showed that many tests performed very poorly in
detecting reasonably clear-cut clusters. The five procedures that performed best in
Milligan and Cooper's (1985) simulation study were the first three global stopping
rules and the first two local stopping rules described earlier in this section. It is pos-
sible that these results have been influenced by the fact that the clusters generated in
Milligan and Cooper's (198.5) study were sampled from mildly-truncated multivariate
normal distributions.
The work described above seeks to identify the single most appropriate value for the
number of clusters present in the data. However, it may be relevant to describe a set
of objects in terms of partitions into two or more (possibly nested) widely-separated
numbers of clusters, depicting the class structure present at several different scales in
the data. Gordon (1996c) presented a study investigating the ability of modifications
of Milligan and Cooper's (1985) five superior stopping rules to detect nested cluster
structure.

6. Assessing hierarchical classifications


Some investigators have been interested in comparing two or more independently-
derived hierarchical classifications of the same set of objects; see, for example, the
references cited in Lapointe and Legendre (1995). Much of this research is considered
to be outside the terms of reference of this paper, which concentrates on assess-
ing the validity of (aspects of) a single classification; for a discussion of a general
approach to synthesizing the results of several different classifications, see Lapointe
(1996). However, some of this methodology is relevant for conducting external vali-
dation tests comparing a hierarchical classification of data provided by a clustering
algorithm with an externally-specified hierarchical classification that does not make
use of these data: thus, the methodology is not relevant for comparing the results of
applying two or more different clustering algorithms to the same data set, but would
be appropriate if the externally-specified classification were based on the clustering
of data comprising different variables describing the same set of objects.
It is relevant to distinguish between several different ways in which the information
contained in a hierarchical classification can be summarized (Gordon, 1996a). A non-
ranked tree or n-tree (Bobisud and Bobisud, 1972; McMorris et al., 1983) specifies
only the nested set of clusters present in the hierarchy. A dendrogram also specifies
the height in the tree of the internal node subtending each cluster. In a ranked tree
(Boorman and Olivier, 1973; Murtagh, 1984), only the ordering of these heights is
32

specified. It is stressed that all trees referred to in this context are rooted.
Many different indices have been proposed for comparing two hierarchical classifi-
cations (Rohlf, 1982). The significance of the value taken by such an index can be
assessed by comparing it with its distribution under a suitable null hypothesis. Hy-
potheses that have been considered are the random label hypothesis, in which labels
are independently permuted, and several random tree hypotheses, in which distribu-
tions of various types of tree are considered (e.g., binary or multifurcating, ranked
or non-ranked or dendrograms). Because the numbers of different trees in the dis-
tributions increase rapidly with n, it is common for these investigations to involve
simulation studies, in which trees are randomly selected from an appropriate dis-
tribution. Furnas (1984) reviewed algorithms for the uniform generation of various
types of random tree, and Lapointe and Legendre (1991) described the generation of
random dendrograms. Using this theory, tests that could be used for the external
validation of a complete hierarchical classification have been proposed under the ran-
dom label hypothesis (Hubert and Baker, 1977) and various random tree hypotheses
(e.g., Simberloff, 1987). Tests can also be carried out on hypotheses that a specified
part of a hierarchical classification possesses a certain class structure (De Soete et
al., 1987; Hubert, 1987, Chapter 5; Lapointe and Legendre, 1990).
It is worth stressing, however, that the distribution of trees resulting from the appli-
cation of a clustering algorithm to data generated under a different null model can
differ markedly from being uniform over the set of trees (Frank and Svensson, 1981).
In the cognate topic of decision trees (e.g., Breiman et al., 1984), trees have been
'pruned' so as to remove branches below uninformative internal nodes, sometimes
using new data sets (e.g., Quinlan, 1987), and such methodology has been used to
simplify hierarchical classifications (Fisher, 1996).
Apart from such external validation tests, little work has been carried out on as-
sessing the validity of a hierarchical classification provided by a clustering algorithm.
Information has been obtained about the distributions, under Poisson, Unimodal,
and random dissimilarity matrix models of the distortion imposed on data when they
are represented in a hierarchical classification (e.g., Rohlf and Fisher, 1968; Hubert,
1974b; Gower and Banfield, 1975), but if such null hypotheses are rejected, it does
not follow that the complete hierarchy is validated. Lerman (1970, Chapter 4, 1981,
Chapter 3) defined a measure of the extent to which dissimilarities failed to satisfy
the ultrametric property defining a hierarchical classification, and investigated prop-
erties of this measure when sampling from binary pattern matrices with specified row
sums. Smith and Dubes (1980) divided the set of objects into two and compared
the classification of a subset of the data with the relevant part of the original clas-
sification, assessing the resemblance by reference to the random dissimilarity matrix
model. Other approaches have aimed at assessing the stability of a hierarchical classi-
fication by measuring the influence of each object, i.e. the extent to which the results
are altered if an object is deleted or differentially weighted (e.g., Gnanadesikan et
al., 1977; Jambu and Lebeaux, 1983, Chapter 6; Gordon and De Cata, 1988; Jolliffe
et al., 1988); separate trees, each based on (n - 1) objects, can be combined into a
consensus tree or trees (Lanyon, 1985; Lapointe et al., 1994).
A validation test which would appear to be of considerable interest is a relative test
of which of two or more hierarchical classifications provides the best summary of
a given set of objects. One might consider addressing this problem by defining a
suitable measure of the distortion imposed in representing the data in a hierarchical
classification, and identifying the classification that has minimum distortion. How-
ever, different measures of distortion tend to favour classifications obtained using
33

different clustering procedures (e.g., Sokal and Rohlf, 1962; Sneath, 1969; Faust and
Romney, 1985): in effect, this approach simply reformulates the problem of specify-
ing appropriate clustering procedures i"n terms of defining appropriate measures of
distortion.
Investigators are often interested only in parts of a hierarchical classification, and
assess its constituent clusters and partitions using methodology described in the pre-
vious two sections. It can then be of interest to represent the data in a parsimonious
tree which retains only those parts of the classification that are deemed to be signifi-
cant (Lerman, 1980; Lerman and Ghazzali, 1991; Gordon, 1994).

7. Discussion
The topic of cluster analysis was initially perceived as being concerned primarily with
the exploratory analysis of sets of objects, with little attention being paid to assessing
the validity of the results. Some results have been assessed solely in terms of their
interpretability or usefulness, but there are clearly dangers in such an approach: the
human mind is quite capable of providing post hoc justification for results of dubious
validity. More recently, there has been an increased awareness of the importance of
cluster validation. However, few studies have included a validation phase: of those
that have, most have involved stopping rules, as described in Section .5.
i\'Iuch research remains to be done in the field of cluster validation. Ideally, one would
like to be able to specify: appropriate null hypotheses of the absence of cluster struc-
ture; alternative hypotheses describing departures from such null hypotheses which
it is important to detect; test statistics with known properties, which are effective in
identifying the type of class structure that is present in the data.
The precise null model that is specified can markedly influence the results of a test
(Gordon, 1996b), and it would be useful to have further investigations, particularly
of data-influenced null models.
~Iany different test procedures have been proposed. particularly for addressing the
'how many clusters?' question, and the time would seem to have come when attention
should be devoted to carrying out further comparisons of these in order to determine
their properties. One problem is that a 'cluster' is a vaguely-defined concept, that
many different types of cluster could be present in data, and that one can expect
different test procedures to be effective at detecting different types of structure. It is
thus unrealistic to expect to be able to identify a single 'best' test procedure for each
type of investigation. Nevertheless, Milligan and Cooper's 11985) comparative study
indicated that some tests performed very poorly in detecting reasonably clear-cut
structure. It thus seems useful to advocate further studies, with the aims of elimi-
nating from further consideration tests that have poor performance, and identifying
a small number of 'superior' tests; such tests could then profitably be incorporated
into standard statistical and classification computer packages. It seems inevitable
that such comparative studies will be largely based on assessing the performance of
test procedures in the analysis of data sets that have known class structure.
Some cluster validation tests make heavy demands on computing resources, and the
kind of investigation which is feasible depends on the size of the data set and the
nature of the data. :\evertheless, one might hope that the future will see the develop-
ment of more efficient procedures and algorithms, and an increase in the power and
availability of computing facilities, thus facilitating a greater use of cluster validation
methodology.
34

References:
Arnold, S. J. (1979): A test for clusters. Journal oJ Marketing Research, 16, 545-55l.
Art, D., Gnanadesikan, R. and Kettenring, J. R. (1982): Data-based metria; for cluster analysis.
Utilitas M athematica, 21A, 75-99.
Bailey, T. A., Jr. and Dubes, R. (1982): Cluster validity profiles. Pattern Recognition, 15,61-83.
Baker, F. B. (1974): Stability of two hierarchical grouping techniques case I: Sensitivity to data
errors. Journal oJ the American Statistical Association, 69, 440-445.
Baker, F. B. and Hubert, L. J. (1976): A graph-theoretic approach to goodness-of-fit in complete
link hierarchical clustering. Journal oJ the American Statistical Association, 71, 870-878.
Barnett, V., Kay, R. and Sneath, P. H. A. (1979): A familiar statistic in an unfamiliar guise -- A
problem in clustering. The Statistician, 28, 185-191.
Beale, E. M. L. (1969): Euclidean cluster analysis. Bulletin of the International Statistical Institute,
43(2), 92-94.
Begovich, C. 1. and Kane, V. E. (1982): Estimating the number of groups and group membership
using simulation c;luster analysis. Pattern Recognition, 15, 335-342.
Binder, D. A. (1978): Bayesian cluster analysis. Biometrika, 65, 31-38.
Bobisud, H. M. and Bobisud, L. E. (1972): A metric for classification. Taxon, 21, 607-613.
Bock, H. H. (1974): Automatische Klassifikation: Theoretische und Praktische Methoden zur Grup-
pierung und Strukturierung von Daten (Cluster-Analyse). Vandenhoeck ell Ruprecht, Gottingen.
Bock, H. H. (1985): On some significance tests in cluster analysis. Journal oJ Classification, 2,
77-108.
Bock, H. H. (1989): Probabilistic aspects in cluster analysis. In Conceptual and Numerical Analysis
of Data, Opitz, O. (ed.), 12-44, Springer-Verlag, Berlin.
Bock, H. H. (1996): Probability models and hypothesis testing in partitioning cluster analysis. In
Clustering and Classification, Arabie, P. et a\. (eds.), 377-453, World Scientific Publishing, River
Edge, NJ.
Boorman, S. A. and Olivier, D. C. (1973): Metrics on spaces of finite trees. Journal of Mathematical
Psychology, 10, 26-59.
Brailovsky, V. L. (1991): A probabilistic approach to clustering. Pattern Recognition Letters, 12,
193-198.
Breckenridge, J. N. (1989): Replicating cluster analysis: Method, consistency and validity. Multi-
variate Behavioral Research, 24, 147-161.
Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984): Classification and Regression
Trees. Wadsworth, Belmont, CA.
Calinski, T. and Barabasz, J. (1974): A dendrite method for cluster analysis. Communications in
Statistics, 3, 1-27.
Chand, D. R. and Kapur, S. S. (1970): An algorithm for convex polytopes. Journal of the Associa-
tion for Computing Machinery, 17, 78-86.
Chazelle, B. (1985): Fast searching in a real algebraic manifold with applications to geometric com-
plexity. Lecture Notes in Computer Science, 185, 145-156.
Cross, G. C. and Jain, A. K. (1982): Measurement of clustering tendency. In Proceedings of IFAC
Symposium on Theory and Application of Digital Control (Volume f!), 24-29, New Delhi.
Cunningham, K. M. and Ogilvie, J. C. (1972): Evaluation of hierarchical grouping techniques: A
preliminary study. Computer Journal, 15, 209-213.
Davies, D. L. and Bouldin, D. W. (1979): A cluster separation measure. IEEE Transactions on
Pattern Analysis and Machine Intelligence, PAMI-1, 224-227.
De Soete, G., Carroll, J. D. and DeSarbo, W. S. (1987): Least squares algorithms for constructing
constrained ultrametric and additive tree representations of symmetric proximity data. Journal of
Classification, 4, 155-173.
Diday, E. and Govaert, G. (1977): Classification automatique avec distances adaptatives. R. A. I.
R. O. InformatiquejComputer Sciences, 11, 329-349.
35

Diday, E. and Simon, J. C. (1976): Clustering analysis. In Communication and Cybernetics 10


Digital Pattern Recognition, Fu, K. S. (ed.), 47-94, Springer-Verlag, Berlin.
Diggle, P. J. (1983): Statistical Analysis of Spatial Point Patterns. Academic Press, London.
Dobkin, D. and Lipton, R. J. (1976): Multidimensional searching problems. SIAM Journal on
Computing, 5, 181-186.
Dubes, R. C. (1987): How many clusters are best? - An experiment. Pattern Recognition, 20,
645-663.
Dubes, R. C. and Zeng, G. (1987): A test for spatial homogeneity in cluster analysis. Journal of
Classification, 4, 33-56.
Duda, R. O. and Hart, P. E. (1973): Pattern Classification and Scene Analysis. Wiley, New York.
Edelsbrunner, H. (1987): Algorithms in Combinatorial Geometry. Springer-Verlag, Berlin.
Engelman, L. and Hartigan, J. A. (1969): Percentage points of a test for clusters. Journal of the
American Statistical Association, 64, 1647-1648.
Estabrook, G. F. (1966): A mathematical model in graph theory for biological classification. Jour-
nal of Theoretical Biology, 12, 297-310.
Faust, K. and Romney, A. K. (1985): The effect of skewed distributions on matrL"( permutation
tests. British Journal of Mathematical and Statistical Psychology, 38, 152-160.
Fisher, D. (1996): Iterative optimization and simplification of hierarchical clusterings. Journal of
Artificial Intelligence Researeh, 4, 147-180.
Fisher, L. and Van Ness, J. W. (1971): Admissible clustering procedures. Biometrika, 58, 91-104.
Frank, O. (1978): Inferences concerning cluster structure. In COMPSTA.T 1978, Corsten, L. C. A.
and Hermans, J. (eds.), 259-265, Physica-Verlag, Wien.
Frank, O. and Harary, F. (1982): Cluster inference by using transitivity indices in empirical graphs.
Journal of the American Statistical Association, 77, 835-840.
Frank, O. and Strauss, D. (1986): Markov graphs. Journal of the American Statistical Association,
81, 832-842.
Frank, O. and Svensson, K. (1981): On probability distributions of single-linkage dendrograms.
Journal of Statistical Computation and Simulation, 12, 121-131.
Friedman, J. H. and Rafsky, L. C. (1979): Multivariate generalizations of the Wald-Wotfowitz and
Smirnov two-sample tests. Annals of Statistics, 7, 697-717.
Furnas, G. W. (1984): The generation ofrandom, binary unordered trees. Journal of Classification,
1, 187-233.
Gabriel, K. R. and Sokal, R. R. (1969): A new statistical approach to geographical variation anal-
ysis. Systematic Zoology, 18, 259-278.
Gnanadesikan, R., Kettenring, J. R. and Landwehr, J. M. (1977): Interpreting and assessing the
results of cluster analyses. Bulletin of the International Statistical Institute, 47(2), 451-463.
Godehardt, E. (1990): Graphs as Structural Models: The Application of Graphs and Multigraphs in
Cluster Analysis rend edn.). Friedr. Vieweg &; Sohn, Braunschweig.
Goodman, L. A. and Kruskal, W. H. (1954): Measures of association for cross classifications. Jour-
nal of the American Statistical Association, 49, 732-764.
Gordon, A. D. (1981): Classification: Methods for the Exploratory Analysis of Multivariate Data.
Chapman and Hall, London.
Gordon, A. D. (1994): Identifying genuine clusters in a classification. Computational Statistics &
Data Analysis, 18, 561-581.
Gordon, A. D. (1995): Tests for assessing clusters. Statistics in 1hInsition, 2, 207-217.
Gordon, A. D. (1996a): Hierarchical classification. In Clustering and Classification, Arabie, P. et
at. (eds.), 65-121, World Scientific Publishing, River Edge, NJ.
Gordon, A. D. (1996b): Null models in cluster validation. In From Data to Knowledge: Theoretical
and Practical Aspects of Classification, Data Analysis, and Knowledge Organization, Gaul, W. and
Pfeifer, D. (eds.), 32-44, Springer-Verlag, Berlin.
Gordon, A. D. (1996c): How many clusters? An investigation of five procedures for detecting nested
cluster structure. Paper presented at IFCS-96 Conference, Kobe, e7-30 March 1996.
36

Gordon, A. D. (1996d): External validation in cluster analysis. Submitted for publication.


Gordon, A. D. and De Cata, A. (1988): Stability and influence in sum of squares clustering. Metron,
46, 347-360.
Gower, J. C. (1973): Classification problems. Bulletin of the International Statistical Institute,
45(1),471-477.
Gower, J. C. and Banfield, C. F. (1975): Goodness-of-fit criteria for hierarchical classification and
their empirical distributions. In Proceedings of the 8th International Biometric Conference, Corsten,
L. C. A. and Postelnicu, T. (eds.), 347-361, Constanta, Romania.
Harper, C. W., Jr. (1978): Groupings by locality in community ecology and paleoecology: Tests of
significance. Lethaia, 11, 251-257.
Hartigan, J. A. (1975): Clustering Algorithms. Wiley, New York.
Hartigan, J. A. (1977): Distribution problems in clustering. In Classification and Clustering, Van
Ryzin, J. (ed.), 45-71, Academic Press, New York.
Hartigan, J. A. (1978): Asymptotic distributions for clustering criteria. Annals of Statistics, 6,
117-13l.
Hartigan, J. A. (1985): Statistical theory in clustering. Journal of Classification, 2, 63-76.
Hartigan, J. A. (1988): The span test for unimodality. In Classification and Related Methods of
Data Analysis, Bock, H. H. (ed.), 229-236, North-Holland, Amsterdam.
Hartigan, J. A. and Mohanty, S. (1992): The runt test for multimodality. Journal of Classification,
9,63-70.
Hill, R. S. (1980): A stopping rule for partitioning dendrograms. Botanical Gazette, 141, 321-324.
Hoffman, R. and Jain, A. K. (1983): A test of randomness based on the minimal spanning tree.
Pattern Recognition Letters, 1, 175-180.
Hopkins, B. (1954): A new method for determining the type of distribution of plant individuals
(with an appendix by J. G. Skellam). Annals of Botany, NS, 18, 213-227.
Howe, S. E. (1979): Estimating Regions and Clustering Spatial Data: Analysis and Implementation
of Methods Using the Voronoi Diagram. Unpublished Ph.D. thesis, Brown University, Providence,
ill.
Hubert, L. J. (1974a): Some applications of graph theory to clustering. Psychometrika, 39, 283-309.
Hubert, L. (1974b): Approximate evaluation techniques for the single-link and complete-link hier-
archical clustering procedures. Journal of the American Statistical Association, 69, 698-704.
Hubert, L. J. (1987): Assignment Methods in Combinatorial Data Analysis. Marcel Dekker, New
York.
Hubert, L. and Arabie, P. (1985): Comparing partitions. Journal of Classification, 2, 193-218.
Hubert, L. J. and Baker, F. B. (1977): The comparison and fitting of given classification schemes.
Journal of Mathematical Psychology, 16, 233-253.
Jackson, D. M. (1969): Comparison of classifications. In Numerical Taxonomy, Cole, A. J. (ed.),
91-113, Academic Press, London.
Jain, A. K. and Dubes, R. C. (1988): Algorithms for Clustering Data. Prentice-Hall, Englewood
Cliffs, NJ.
Jain, A. K. and Moreau, J. V. (1987): Bootstrap techniques in cluster analysis. Pattern Recognition,
20, 547-568.
Jambu, M. and Lebeaux, M. O. (1983): Cluster Analysis and Data Analysis. North-Holland, Ams-
terdam.
Jardine, N. (1969): Towards a general theory of clustering (abstract). Biometrics, 25, 609-610.
Jardine, N. and Sibson, R. (1971): Mathematical Taxonomy. Wiley, London.
Jolliffe, I. T., Jones, B. and Morgan, B. J. T. (1988): Stability and influence in cluster analysis. In
Data Analysis and Informatics V, Diday, E. (ed.), 507-514, North-Holland, Amsterdam.
Kelly, F. P. and Ripley, B. D. (1976): A note on Strauss's model for clustering. Biometrika, 63,
357-360.
Krzanowski, W. J. and Lai, Y. T. (1983): A criterion for determining the number of groups in a
37

data set using sum-of-squares clustering. Biometrics, 44, 23-34.


Lanyon, S. M. (1985): Detecting internal inconsistencies in distance data. Systematic Zoology, 34,
397-403.
Lapointe, F.-J. (1996): To validate and how to validate? That is the real question. Paper presented
at IFCS-96 Conference, Kobe, 27-30 March 1996.
Lapointe, F.-J., Kirsch, J. A. W. and Bleiweiss, R. (1994): Jackknifing of weighted trees: Valida-
tion of phylogenies reconstructed from distance matrices. Molecular Phylogenetics and Evolution,
3,256-267.
Lapointe, F .-J. and Legendre, P. (1990): A statistical framework to test the consensus of two nested
classifications. Systematic Zoology, 39, 1-13.
Lapointe, F.-J. and Legendre, P. (1991): The generation of random ultrametric matrices represent-
ing dendrograms. Journal of Classification, 8, 177-200.
Lapointe, F.-J. and Legendre, P. (1995). Comparison tests for dendrograms: A comparative evalu-
ation. Journal of Classification, 12, 265-282.
Lee, K. L. (1979): Multivariate tests for clusters. Journal of the American Statistical Association,
74, 708---714.
Lefkovitch, 1. P. (1978): Cluster generation and grouping using mathematical programming. Math-
ematical Biosciences, 41, 91-110.
Lefkovitch, L. P. (1980): Conditional clustering. Biometrics, 36, 43-58.
Legendre, P., Dallot, S. and Legendre, L. (1985): Succession of species within a community: Chrono-
logical clustering, with applications to marine and freshwater zooplankton. The American Natural-
ist, 125, 257-288.
Lerman,1. C. (1970: Les Bases de la Classification Automatique. Gauthier-Villars, Paris.
Lerman, 1. C. (1980): Combinatorial analysis in the statistical treatment of behavioral data. Quality
and Quantity, 14, 431-469.
Lerman, 1. C. (1981): Classification et Analyse Ordinale des Donnees. Dunod, Paris.
Lerman, 1. C. (1983): Sur la signification des classes issues d'une classification automatique de
donnees. In Numerical Taxonomy, Felsenstein, J. (ed.), 179-198, Springer-Verlag, Berlin.
Lerman, 1. C. and Ghazzali, N. (1991): What do we retain from a classification tree? An experiment
in image coding. In Symbolic-Numeric Data Analysis and Learning, Diday, E. and Lechevallier, Y.
(eds.), 27-42, Nova Science, New York.
Ling, R. F. (1972): On the theory and construction of k-clusters. Computer Journal, 15, 326-332.
Ling, R. F. (1973a): A probability theory for cluster analysis. Journal of the American Statistical
Association, 68, 159-164.
Ling, R. F. (1973b): The expected number of components in random linear graphs. Annals of
Probability, 1, 876-881.
Ling, R. F. (1975): An exact probability distribution on the connectivity ofrandom graphs. Journal
of Mathematical Psychology, 12, 90-98.
Ling, R. F. and Killough, G. G. (1976): Probability tables for cluster analysis based on a theory of
random graphs. Journal of the American Statistical Association, 71, 293-300.
McIntyre, R. M. and Blashfield, R. K. (1980): A nearest-centroid technique for evaluating the
minimum-variance clustering procedure. Multivariate BehaVIoral Research, 15, 225-238.
McMorris, F. R., Meronk, D. B. and Neumann, D. A. (1983): A view of some consensus methods
for trees. In Numerical Taxonomy, Felsenstein, J. (ed.), 122-126, Springer-Verlag, Berlin.
Mcquitty, L. L. (1963): Rank order typal analysis. Educational and Psychological Measurement,
23, 55-61.
Mcquitty, L. L. (1967): A mutual development of some typological theories and pattern analytical
methods. Educational and Psychological Measurement, 27, 21-46.
Marriott, F. H. C. (1982): Optimization methods of cluster analysis. Biometrika, 69, 417-422.
Matula, D. W. (1977): Graph theoretic techniques for cluster analysis algorithms. In Classification
and Clustering, Van Ryzin, J. (ed.), 95-129, Academic Press, New York.
Milligan, G. W. (1981): A Monte Carlo study of thirty internal criterion measures for cluster anal-
38

ysis. Psychometrika, 46, 187-199.


Milligan, G. W. and Cooper, M. C. (1985): An examination of procedures for determining the
number of clusters in a data set. Psychometrika, 50, 159-179.
Milligan, G. W. and Cooper, M. C. (1986):" A study of the comparability of external criteria for
hierarchical cluster analysis. Multivariate Behavioral Research, 21, 441-458.
Milligan, G. W. and Mahajan, v. (1980): A note on procedures for testing the quality of a clustering
of a set of objects. Decision Sciences, 11, 669-677.
Milligan, G. W. and Sokol, L. M. (1980): A two-stage clustering algorithm with robust recovery
characteristics. Educational and Psychological Measurement, 40, 755-759.
Miiller, D. W. and Sawitzki, G. (1991): Excess mass estimates and tests for multi modality. Journal
of the American Statistical Association, 86, 738-746.
Murtagh, F. (1984): Counting dendrograms: A survey. Discrete Applied Mathematics, 7, 191-199.
Ogilvie, J. C. (1969): The distribution of number and size of connected components in random
graphs of medium size. Information Processing, 68, 1527-1530.
Overall, J. E. and Magee, K. N. (1992): Replication as a rule for determining the number of clusters
in hierarchial cluster analysis. Applied Psychological Measurement, 16, 119-128.
Panayirci, E. and Dubes, R. C. (1983): A test for multidimensional clustering tendency. Pattern
Recognition, 16, 433-444.
Perruchet, C. (1983): Une analyse bibliographique des epreuves de classifiabilite en analyse des
donnees. Statistiques et Analyse de Donnees, 8, 18-4l.
Pollard, D. (1982): A central limit theorem for k-means clustering. Annals of Probability, 10, 919-
926.
Quinlan, J. R. (1987): Simplifying decision trees. International Journal of Man-Machine Studies,
27,221-234.
Rand, W. M. (1971): Objective criteria for the evaluation of clustering methods. Journal of the
American Statistical Association, 66, 846-850.
Rapoport, A. and Fillenbaum, S. (1972): An experimental study of semantic structures. In Multi-
dimensional Scaling. Theory and App/icati01ls in the Behavioral Sciences: Volume II. Applications,
Romney, A. K. et al. (eds.), 93-131, Seminar Press, New York.
Ratkowsky, D. A. (1984): A stopping rule and clustering method of wide applicability. Botanical
Gazette, 145, 518-523.
Ripley, B. D. (1981): Spatial Statistics. Wiley, New York.
Ripley, B. D. and Rasson, J .-P. (1977): Finding the edge of a Poisson forest. Journal of .4pp/ied
Probability, 14, 483-49l.
Rivera, F. F., Zapata, E. L. and Carazo, J. M. (1990): Cluster validity based on the hard tendency
of the fuzzy classification. Pattern Recognition Letters, 11, 7-12.
Rohlf, F. J. (1970): Adaptive hierarchical clustering schemes. Systematic Zoology, 19,58-82.
Rohlf, F. J. (1975): Generalization of the gap test for the d'etection of multivariate outliers. Bio-
metrics, 31, 93-101.
Rohlf, F. J. (1982): Consensus indices for comparing classifications. l.Jathematical Biosciences, 59,
131-144.
Rohlf, F. J. and Fisher, D. R. (1968): Tests for hierarchical structure in random data sets. System-
atic Zoology, 17, 407-412.
Roubens, M. (1978): Pattern classification problems and fuzzy sets. Fuzzy Sets and Systems, 1,
239-253.
Rousseeuw, P. J. (1987): Silhouettes: A graphical aid to the interpretation and validation of cluster
analysis. Journal of Computational and Applied Mathematics, 20, 53--65.
RozaJ, G. P. M. and Hartigan, J. A. (1994): The MAP test for multimodality. Journal of Classifi-
cation, 11, 5-36.
Rubin, P. A. (1984): Generating random points in a polytope. Communications in Statistics: Sim-
ulation and Computation, B 13, 375-396.
Sarle, W. S. (1983): Cubic Clustering Criterion. Technical Report A-108, SAS Institute, Cary, NC.
39

Saunders, R. and Funk, G. M. (1977): Poisson limits for a clustering model of Strauss. Journal of
Applied Probability, 14, 77~784.
Schultz, J. V. and Hubert, L. J. (1973): Data analysis and the connectivity of random graphs.
Journal of Mathematical Psychology, 10, 421-428.
Scott, A. J. and Symons, M. J. (1971): Clustering methods based on likelihood ratio criteria. Bio-
metrics, 27, 387-397.
Shepard, R. N. (1974): Representation of structure in similarity data: Problems and prospects.
Psychometrika, 39, 373-42l.
Simberloff, D. (1987): Calculating probabilities that cladograrns match: A method of biogeograph-
ical inference. Systematic Zoology, 36, 115-195.
Smith, S. P. and Dubes, R. (1980): Stability of a hierarchical clustering. Pattern Recognition, 12,
177-187.
Smith, S. P and Jain, A. K. (1984): Testing for uniformity in multidimensional data. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, PAMI-6, 73-8l.
Sneath, P. H. A. (1969): Evaluation of clustering methods (with Discussion). In Numerical Taxon-
omy, Cole, A. J. (ed.), 257-271, Academic Press, London.
Sneath, P. H. A. (1977): A method for testing the distinctness of clusters: A test of the disjunction
of two clusters in Euclidean space as measured by their overlap. Mathematical Geology, 9,123-143.
Sneath, P. H. A. (1979): The sampling distribution of the W statistic of disjunction for the arhitrary
division of a random rectangular distribution. Mathematical Geology, 11, 423-429.
Sneath, P. H. A. (1980). Some empirical tests for significance of clusters. In Data Analysis and
Informatics, Diday, E. et al. (eds.), 491-508, North-Holland, Amsterdam.
Sneath, P. H. A. (1986): Significance tests for multivariate normality of clusters from branching
patterns in dendrograms. Mathematical Geology, 18, 3-32.
Sokal, R. R. and Rohlf, F. J. (1962): The comparison of dendrograms by objective methods. Taxon,
11,33-40.
Strauss, D. J. (1975): A model for clustering. Biometrika, 62, 467-475.
Strauss, R. E. (1982): Statistical significance of species clusters in association analysis. Ecology, 63,
634-639.
Van Cutsem, B. and Ycart, B. (1996): Indexed Dendrograms on Random Dissimilarities. Rapport
MAl 23, CNRS, Universite Joseph Fourier Grenoble I.
Van Ness, J. W. (1973): Admissible clustering procedures. Biometrika, 60, 422-424.
van Rijsbergen, C. J. (1970): A clustering algorithm. Computer Journal, 13, 113-115.
Vassiliou, A., Ignatiades, L. and Karydis, M. (1989): Clustering of transect phytoplankton collec-
tions with a quick randomization algorithm. Journal of Experimental Marine Biology and Ecology,
130, 135-145.
Ward, J. H., Jr. (1963): Hierarchical grouping to optimize an objective function. Journal of the
American Statistical Association, 58, 236-244.
Windham, M. P. (1981): Cluster validity for fuzzy clustering algorithms. Fuzzy Sets and Systems,
5, 177-185.
Windham, M. P. (1982): Cluster validity for the fuzzy c-means clustering algorithm. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, PAMI-4, 357-363.
Xu, S., Karnath, M. V. and Capson, D. W. (1993): Selection of partitions from a hierarchy. Pattern
Recognition Letters, 14, 7-15.
Zeng, G. and Dubes, R. C. (1985a): A test for spatial randomness based on k-NN distances. Pattern
Recognition Letters, 3, 85-9l.
Zeng, G. and Dubes, R. C. (1985b): A comparison of tests for randomness. Pattern Recognition,
18, 191-198.
What is Data Science ?*
Fundamental Concepts and a Heuristic Example

Chikio Hayashi
The Institute of Statistical Mathematics
Sakuragaoka, Birijian 304
15-8 Sakuragaoka, Shibuya-ku
Tokyo 150, Japan

Summary: Data Science is not only a synthetic concept to unify statistics, data analysis
and their related methods but also comprises its results. It includes three phases, design for
data, collection of data, and analysis on data. Fundamental concepts and various methods
based on it are discussed with a heuristic example.

1. Introduction:
Statistics and data analysis have developed in their realms separately and contributed to the
development of science, showing their unique properties. The ideas and various methods
of statistics were very useful, well known and solved many problems. Mathematicat
statistics succeeded it and developed new frontiers with the idea of statistical inference.
Thus the application of these view points brought us many useful results.
However, the development of mathematical statistics, has devoted itself only to .the
problems of statistical inference, an apparent rise of precision of statistical models, and to
the pursuit of exactness and mathematical refinement, so mathematical statistics have been
prone to be removed from reality.
On the other hand, the method of data analysis has developed in the fields disregarded by
mathematical statistics and has given useful results to solve complicated problems based on
mathematico-statistical methods (which are not always based on statistical inference but
rather are descriptive). Some results are found in the references.
In the development of data analysis, the following tendency is often found, that is to say,
data analysts have come to manipulate or handle only existing data without taking into
consideration both the quality of data and the meaning of data, to cope with the
methodological problem based on unrealistic artificial data with simple structure, to make
efforts only for the refinement of convenient and serviceable computer software and to
imi tate popular ideas of mathematical statistics wi thout considering the essential meaning.
As this differentiation proceeds with specialization, the innovatiorr of useful methods of
statistics and data analysis seem to disappear and signs of stagnation appear. The reason is
that the essential aim of analysis of phenomena with data has been forgotten. For
extensive and profound development of intrinsically useful methods of statistics and data
analysis beyond the present state, the unification of statistics and data analysis is
necessary. For this purpose, the construction of a new point of view or a new paradigm is
a crucial problem. So, I will present "Data Science" as a new concept.

* The roundtable discussion "Perspectives in classification and the Future of IFCS" was
held at the last Conference under the chairmanship of Professor H. -H. Bock. In this
panel discussion, I used the phrase 'Data Science'. There was a question, "What is 'Data
Science'?" I briefly answered it. This is the starting point of the present paper.

40
41

2. Fundamental Concepts of Data Science

Data Science is not only a synthetic concept to unify statistics, data analysis and their
related methods, but also comprises its results. Data Science intends to analyze and
understand actual phenomena with "data". In other words, the aim of data science is to
reveal the features or the hidden structure of complicated natural, human and social
phenomena with data from a different point of view from the established or traditional
theory and method. This point of view implies multidimensional, dynamic and flexible
ways of thinking.
Data Science consists of three phases: design for data, collection of data and analysis on
data. It is important that the three phases are treated with the concept of unification based
on the fundamental philosophy of science explained below. In these phases the methods
which are fitted for the object and are valid, must be studied with a good perspective. The
strategy for research in Data Science through three phases is summarized in Fig. 1.

Data Science

Designfor Data
Collection of Data
Analysis on Data

The leading idea of data treatment in the research process is:

by method of
by finding and classification
reconsideration muiluiimensionai
of deviations of daca analysis and
"individuals·· from other statistical
the meanar methods.
class and Structure.
simplification

Dynamics of Both Diversification

and Conceptualization or Simplification

Fig.l Strategy for Research

Generally speaking, phenomena are multifarious. First, these phenomena are formulated
and the planning of a surveyor experiment is completed, based on the ideas of Data
Science (phase of design for data). Thus phenomena are expressed as multidimensional
and, frequently, time-series data. The characteristics or properties of the data are
necessarily made clear (phase of collection of data). The obtained data are too
complicated to draw a clear conclusion. So, by methods of classification and
multidimensional data analysis, and other mathematico-statistical methods, the data
structure is revealed. In other words, simplification and conceptualization are carried
out. However, this information generally turns out to be incomplete and unsatisfactory
42

even though the structure finding was realized. At this stage, by finding and
reconsidering the deviation of "individuals", which gives a vivid account of the
roughness of conceptualization or simplification, from the mean values or class-
belonging (classification) and structure, diversification of data is made. Based on this
multifariousness, structure finding or conceptualization is attained, in an advanced sense,
in the progressive stages. Such a circular movement of research then continues.
Dynamic movement of both simplification or conceptualization and diversification begins
in tum. Further, having been able to solve a problem, it is expected to discover another
new problem to be solved in an advanced sense. The developmental process, in phase,
design ---> collection -> analysis --->design ---> collection ---> analysis --->design ->
collection ---> analysis ---> design ---> ... and the dynamic process mentioned above,
that is to say, progress and regress, are indispensable in Data Science. This shows that
the methodology of Data Science develops, as it were, in the ascending-spiral-process
and research proceeds as seen in spiral stairs. The main points is schematically depicted
in Fig. 1.
Thus we can say that data science comprises not only the results themselves of theory and
method but also all methodological results related to various processes which are necessary
to work out the results mentioned above. The former is called "hard results" and the latter
is called "soft results". Data Science includes simultaneously hard and soft results. It goes
without saying that a useful solution emerges in coping with the complicated problem in
question by the use of Data Science. It is repeatedly emphasized that the coherent idea
through all items shown in Fig. 1 flows in Data Science for the purpose of analysis of
phenomena with data.

3. Content of Data Science

Some concrete examples in social and medical surveys for the three phases are shown
below. Before everything, it is stressed that the relevant methods are always treated with
validity.

3.1 How to Design

The theory and method concerning this phase are next considered. Particularly, theoretical
and systematic construction of a questionnaire is a very important problem. The problems
in this phase are frequently solved using various kinds of methods of data analysis. For
example,

Sampling survey methods,


Design of experiment,
Evaluation of bias in quota sampling
New systematic idea of survey planning for the sol uti on of difficult and complicated
problems,
Construction of questionnaire,
Theory-driven (which is an extension of hypothesis testing), Guttman's
Facet Theory,
Data-driven (exploratory approach), Hayashi's Cultural Link Analysis in
comparative study,
Utilization of various types of questions, for example, dynamic use of closed and
open-ended questions,
Use of various projective methods,
Design for evaluation of data quality and data characteristics,
Randomized response method,
Problems of translation in international comparative study,
and etc ..
43

3.2 How to Collect Data

Collection of data is not only a problem of practice, but must be theoretically and
concretely studied. The problems in this phase can not be solved without any information
of design for data and any use of data analysis.
Evaluation of survey bias and evaluation of experimental bias including
question bias, interview bias, interviewer bias, observation bias, etc.
Evaluation of non-response error,
Evaluation of measurement error,
Evaluation of response error, inevitably variable response data, for example,
live data,
Method of diminution of the relevant bias and error,
and etc ..

3.3 How to Analyze Data

The problems in this phase are, of course, closely related to the previous two phases. The
main point is to obtain useful and instrumental information without any distortion or with
validity. For this purpose, clear and lucid methods of analysis without unnecessary
mathematical conditions only for model building and a too sophisticated style are desirable.
For example,
Various methods of scaling, quantification methods, correspondence analysis
(analyse des donnees), multidimensional scaling, exploratory data
analysis, categorical data analysis and various methods of classification and
clustering,
Useful data analysis suitable for the purpose,
Useful coding of questions and their synthesis,
Valid analysis of data including various errors,
Evaluation of data quality and data analysis depending on data quality,
Analysis on probabilistic response,
Exploratory approach by data analysis,
Method of simultaneous realization of classification and structure finding,
Treatment of open answers in an open ended question for example, exploratory
approach for coding or automatic processing of textual data,
Probabilistic approach,
Computer experiments,
and etc ..

These three phases must be synthetically treated or taken into consideration with the
consistent idea in order to understand phenomena. This is the fundamental concept of
Data Science. Of course, each subject will be studied separately. However, each subject
must be studied in 'the context of Data Science. This idea will lead to the development of
statistics and data analysis in a new direction. Thus the stand point of them highten and a
new horizon will appear as innovative method and theory are created in three phases.

4. A Heuristic Example

As an example of the data-scientific approach, we now explain our national character


survey in interchronological and international perspectives.

4.1 Fundamental Scheme of Study

What is national character? I define it operationally as collective character on belief


systems, the way of thinking and emotional attitudes, feelings or sentiments. By a survey
of individuals, we can find individual response patterns on the items mentioned below.
44

Thus, we know that individuals have various response patterns. These are integrated in a
collective through mutual and social communications in so far as individuals live in a
society. This is collective character or national character ( in some cases ethnic character)
which is formed beyond individuals. In this situation, some principles emerge in the sOcial
environment. Receiving impacts from the exterior, social norm, customs system,
paradigm, education, contemporary thought and arts, religious feelings, future course of
philosophy and science, etc. are formed, as a "cultural climate" is created. Individuals are
influenced by this cultural climate: the strongest influence is upon the response pattern in
general social items, the second upon that in national character items and the weakest upon
that in basic human feelings items. Such a perpetual circular movement continues. It is
our aim to represent the collective character in terms of Data Science.
Our point of view of research is not hypothesis testing (theory-driven) but to put the
emphasis on an exploratory approach (data-driven).

4.2 Time Series Data ( Interchronological Approach) in Japan

First of all, we define the universe and population of the Japanese. A nation-wide sample
survey is done for a sample by stratified three stage random sampling and by face to face
interviewing using the same questionnaire, the contents of which cover the items shown
below.
1) Fundamental Attributes, 2) Religion, 3) Family,
4) Social Life, 5) Interpersonal Relations,
6) Politics, 7) Individual Attitude tbward Other Unclassified Social Issues
The outline of our survey is shown in Fig. 2.

Nation Wide Sample Survey by


Stratified Three Stage Random Sampling

Sample Spot 200 300


Sample Size 2000 - 4000
by face· to-face Interviewing

Survey Symbol Year

lst Survey I 1953


2nd Survey II 1958
3rd Survey III 1963
4th Survey IV 1968
5th Survey V 1973
6th Survey VI 1978
7th Survey VII 1983
8th Survey VIII 1988
9th Survey IX 1993
every 5 years

[Research Committee on the Study of Japanese National Character of


the Institute of Statistical Mathematics, Tokyo]

Fig. 2 Survey Design

The analysis from such time series data makes clear both enduring and changing aspects.
The next step is a comparative study of national character.
45

4.3 Comparative Study (International Approach)

In a comparative study, the following points are indispensable,


1. How to secure comparability in a scientific sense
Design
Sample
Selection of questions and construction of questionnaire
Translation*
2. Clarification of particularity and universality (community) or speciality and
generality
3. By common logic and scientific methods for easy (international understanding)

* back translation, retranslation, confinnation by free question and answers, etc.

Here, in a comparative study, we present a new idea for questionnaire construction and
selection of nations to be compared. This is Cultural Link Analysis ( CLA in abbreviation)
--Hayashi et al. 1986, 1992-- which belongs to a similar genre to Guttman's Facet Theory,
(Guttman (1994», and reveals a relational structure of collective characters of peoples in
different cultural spheres (nations or ethnic groups).
a. A spatial link inherent in the selection of the subject culture or society.
The connections seen in such selection may be considered along the dimensions
of social environment, cuI ture and ethnic or national characteristics.
b. An item structure link inherent in the commonness and differences in item
response patterns within and across different cultures.
c . A temporal link inherent in longitudinal analysis.
An example of a. is shown in Fig.3.

Hawaii Residents

7 3 2 2

No of T:mes Sur'Jeyed

(b) Mullidlmenslonallinkage

Fig.3 Cultural link survey design: selection of groups


46

As an example of b . concerning questionnaire construction, the idea is explained in Fig.4.

Questions common to
Modern Societies

Ooestions patllcular Questions particular


10 Sociely A to SOCiety B

Questions concerning social life, basic emotions


and feelings whiCh are, to some exten!,
common to Socielies A and B

Fig.4 Cultural Link Survey Design: Selection of Questions

As for c . time series surveys in various nations or ethnic groups and their comparison are
infonnative.
Our international comparative surveys, which consist of Americans in North America,
English in UK., French in France, Gennans in the past West-Gennany, Dutch in the
Netherlands, Italians in Italy, Japanese in Japan, Japanese-Americans in Hawaii, Japanese-
Brazilians in Brazil, are described in Fig. 5 and the conjecture of link scheme is depicted
in Fig.6.
1971 Japanese
Americans
in Hawaii
(434)

1978 Honolulu Americans in


residents Nonh America
including JA ( 1571)
(751)

1983 Same above


(807)

1987 English (1043)


Germans (1000)
French (1013)

1988 Same above Americans in


(499) Nonh America
(1563)

1992 Japanese Brazilians Italians


in Brazil (1048)
(492)

1993 Dutch
(1083)
------Nation- Wide Sampling------
( ) Sample Size

Fig.S. Comparative Surveys By Cultural Link Analysis


47

OISSl\lll..U.TY SIMlt.\RJTY

J,,'AShSF.·IIRAlJll\:"S
J).PA~F..Sf. 1,\P"i:.SEOlI.lcr.
C\,;UURE

A~ElUCAN JM'.\'l:I:.'tEOR[G~
Ct:LruRE
JAP.\!'f.SE·A.'dEKl(,\'\S
IIII!AWAl.1
JAPA~ESE

S-I .\MF.RIC.':-r.S 11111" lAo .\U


\:"<iOllM'nonN MAL'I.A!'IID
arSORTIl AMEIUC.\I

S·' .\MEIICA:'tS In tM W"it


(B«n" :.'w: \IAlStA\D A.\tI:!tJCI..\
HAWAIlA~ o{soanr -\MEltlC.-\) CLLfI.Kr.
CI.:Ln:RE
A.\CElUCA~
CL1.Tt:RE AMF.lICASS .... ~
~lAL\l.A:'tDof
A..\fERtCAN
CL.1.TliRE l\ORTU "MERle,
EI,;ROPI:.A.'i
CLl.TI!RE E.\"L1JS1I
1ft the L.~ln:D Kl\tiOO.\1

O\;TCII

I.\TI~
IUI.L\SS ('LLn;RAl (ll\t.\ n:.
((",\H!OUc.:ClI.n.Kf:.i

• ~GU'H ct:l.ruRALCt.iMAru "'. CX:C!I)I'.\TAL Cl.t rt:KE


CO~"l"ISE.''''''l CLl.lt;A,Al CloorE

Fig.6 Chain in Our Study

Further Remarks: It brings us very important information to include people of Japanese


origin, who settled down in foreign countries, as a linkage in order to explore and reveal
the characteristics of Japanese national character.
The attitude, in which Japanese Americans are between Japanese and Americans in
response distributions or data structure and, what is more, Japanese Brazilians are between
Japanese and Portuguese, French and Italian in response distributions or data structure, is
defined as J-attitude. The existence of J-atti tude implies that J-attitude is a characteristic of
Japanese national character. In other words, it may be said that J-attitude remains
somewhat in Japanese Americans and Japanese Brazilians even though the tendency in
them is not so strong as in the Japanese. This fact suggests that J-attitude is a Japanese
characteristic. [Hayashi (1995), Hayashi C. and F. (1995)] So, it is meaningful to include
people of Japanese origin in a comparative survey. However, it goes without saying that
the characteristics of the Japanese are found according to the items shown in Fig.8, even
though they are not J-attitude.

4.4 National Character in Statistical Terms

It is our aim to make clear and depict the following points by well-designed comparative
surveys and their data analysis, i.e. "quantitative and data scientific" methods,

"difference in some points and commonness or similarity in other points"


or
"particularity in some points and universality in other points"

Since such a way of research is based on universal logic, people even in different cultural
spheres can understand the results of the analysis.
48

Mainly considering the view of Japanese national character itself, we can summarize our
study as in Fig.7. In contrast, mainly considering the comparison of national character in
different cultural spheres, we can summarize our study as in Fig. 8.

Enduring, Persistent features (Time·dependent aspect)

~
Particularity .4------------ t Comparati,·e aspect)

Then

Changing features (Time·dependent aspect)

Uni,ersality ••- - - - - - - - - - - - (Comparative aspect)

Fig.7 Japanese National Character

From these two kinds of surveys, surveys both in time and space, that is to say,
continuing surveys and comparative surveys, we can define national character in statistical
terms corresponding to various levels. See Fig. 8.

Temporally Stable Panicular Characteristic


Consistent compared with others

I. Majority Opinion o O' orO


o indifferent
no datum OorO'

2. Opinion Distributions o O' orO


no datum OorO'

3. Opinion Distributions o O' orO


by Various Breakdowns
(for eumple. gender. age no datum OorO'
education. rural· urban)

4. Changing Patterns of Opinion


Distributions and Opinion Structure X o
(including those by breakdowns. and.
for example. age-cohort analysis
based on time series of opinion distributions)

5·1 Opinion Structure o X


or systematic change

5-2 Comparison of Opinion Structures


i Existence of the Same Unidimensional 0 by comparison of the
Scale or no datum scale val ue of nation
O' orO
ii Same Structure of Opinions o
(more than 2 dimensional or no datum by comparison of the
structure) position of nation
OorO'
iii Different Structure of Opinions o by comparison of the
or no datum position of nation
based on the similarity
or dissimilarity analysis
of structure
OorO'
Fig. 8 Statistical Derinition of National Character
.. ·on "arious levels.. •
49

Here, majority opinion is defined as not only that supported by more than 2/3 of the
individuals in the total but also that supported by more than 2/3 of the individuals in each
breakdown in sex, age and education. In Fig. 8, 0 marks mean existence of the item, 0*
marks mean existence of temporally stable data and "no datum" means non existence of
temporally stable evidence but existence of cross-section data. For example, as fot 2.
Opinion Distributions, the first line means a definition on the highest level, i.e. the opinion
distribution is not only temporally stable but also particular or characteristic compared with
those in different nations or ethnic groups and the second line means a definition on a
lower level, i.e. temporally stable data do not exist but it is particular or characteristic
compared with those in different nations or ethnic groups, in which temporally stable
evidence occasionally exists. X marks mean there is no logical meaning.

4.5 Cross-Societal Surveys and Classification of Nations


--Realization of Cultural Link--
One example of data analysis of comparative surveys will be shown as below, with the
following groups being taken up: Japanese, Americans, English, French, Germans,
Dutch, Italians, Japanese-Americans and Japanese-Brazilians.
For example, let the opinion distribution be given in each group. Here, only one key
answer category is taken up in each question item. If the number of questions is R, the
number of the answer category taken up is r. Here, all questions are used except for the
items of personal characteristics, for example sex, age, education, etc. We calculate the
similarity index dij between i-nation and j-nation, as below.

R
dij = I/R I
r
I Pir - Pjr I

where Pir is the percentage of i-nation on the only one key answer
category of the r-th question.
dij is a fuzzy measure of difference between i and j.

Thus we have a similarity matrix between i and j. Based on this fuzzy similarity matrix, a
method of multidimensional data analysis, MDA-OR (Minimum Dimension Analysis
Ordered class belonging) [Hayashi 1974, 1976], which is one kind of so-called
multidimensional scaling MDS, is applied for graphical representation of groups. The
quite similar configuration of groups is obtained by quantification method III or
Correspondence analysis using the matrix of d's directly. The result is shown in Fig.9.
This is a simple graphical summarization of the similarity relations. The degree of
similarity is revealed as the distance in Euclidean space. Roughly speaking, consider that
the distance corresponds to the similarity and the configuration gives a reasonable
summarization of linked similarities. Here, the triangular relation mentioned above has
been revealed.

The arrow means the direction of the value in the third axis in Fig.9. A line means plus
direction while a dotted line means minus direction in the third dimension.
If French and Italians are deleted and the same analysis is done, Fig. 10 is obtained.
JB is found as a pole instead of French and Italians.
50

r\ : Americans
E: English
F : French
G : Germans
H: Dutch
I : lIalians
J : Japanese
JA : Japanese-Americans in Hawaii
JB : Japanese-Brazilians in Brazil
JB 1: First generation of 18
182: Second and Third generations of JB

Fig. 9 Configurations of Nations

Fig.lO Configurations of Nations


(Deleting French and Italians)

The similarity between Japanese-Brazilians and French and Italians is very interesting.
The positions of Japanese-Americans and Japanese-Brazilians are to be noted with J-
attitude mentioned previously, the former being as a linkage between Japanese and
Aqlericans, the latter being as a linkage between Japanese and French or Italians. The link
relation in Fig. 6 has been revealed in Fig. 9 by the data analysis. Thus we could clarify
the entire picture of the configuration of groups.
51

Then, we can proceed to a detailed analysis of data without loss of sight of the whole
situation. For example, the nations being different in what group of questions and
common in what group of questions i.e. simultaneous classification of questions and
nations or the universality and particularity of data structure across the nations.

References

The following references are relevant to the various parts of this paper.

Arabie, P., Hubert, LJ. and De Soete, G. (1996) ed.: Clustering and Classification,
World Scientific.
Benzecri, J.P. (1973): L'Analyse des Donnees, Dunod.
Benzecri, J.P. (1992) : Correspondence Analysis Hand-Book, Marcel Dekker.
Bock, H.-H. and Polasek, W. (1996) ed.: Data Analysis and Information Systems,
Springer.
Borg, I. & Shye, S. (1995): Facet Theory, Fonn and Content, Advanced Quantitative
Techniques in the Social Sciences Series 5, Sage Publication.
Diday, E., J. Lemaire, 1. Pouget and F. Testu (1983): Elements d'Analyse des
Donnees, Dunod.
Diday, E., G. Celeux, Y. Lechevallier, G. Govaert and H. Ralambondrainy (1989):
Classification automatique el Analyse des Donnees: Methodes et environment
informalique. Dunod.
Diday, E. and Y. Leschevallier (1991): Symbolic -Numeric Data Analysis and Learning,
-Versailles Sept 91- Nova Science Publisher.
Gaul. W. and Pferfer, D. (1996) ed: From Data to Knowledge, Springer.
Guttman, L. (1994): Louis Gullmall 011 Theory and Methodology: Selected Writings,
Shlomit Levy ed, Dartmouth.
Hayashi, C. (1956): Theory and example of quantification(II). Proc. Inst. Statist. Math.,
3,69-98.
Hayashi. C. (1974):·Minimum dimensional analysis MDA. Behaviormelrika, I, 1-24.
Hayashi, C. (1976): Minimum dimensional analysis MDA-OR and MDA-UO, Essays in
Probability and Statistics, Ikeda, S., et al. (eds.), 395-412, Shinko Tsusho Co. Ltd.
Hayashi, C. (1993): Treatise on Behavionnetrics, Asakura Shoten.
Hayashi, C. (1993): Quantification of QuaJitative Data --Theory and Method --, Asakura
Shoten.
Hayashi. C. (1995): Changing and Enduring Aspects of Japanese National Character,
The Institute of Social Research, Osaka, Japan.
Hayashi, C. and Suzuki, T. (1986): Data Analysis in Social Surveys, Iwanami Shoten.
The English version by Hayashi. C. Suzuki,T. and Sasaki, M., "Data Analysis for
Comparative Social Research: 111leT1laiional Perspectives" was published by
Elsevier, North-Holland in 1992.
Hayashi, C. and Hayashi, F. (1995): Comparative Study of National Character.
Proceedings of the Institute of Statistical Mathematics Vol. 43, No.1, 27-80.
Jambu, M. (1989) : Exploration Injormatique et Statistique des Donnees. Dunod.
Jambu, M. (1991): Exploratory and Multivariate Data Analysis, Academic Press.
Lebart, L., Morineau, A. and Warwick, K.M. (1984): Multivariate Descriptive
Statistical Analysis. John Wiley.
Lebart, L. and Salem, A. (1988): Analyse Statistiques des Donnees, Textuelles,
Dunod.
Lebart, L. and Salem, A. (1994): StatistiqueTextuelle, Dunod.
Lebart. L.. Morineau, A. and Piron, M. (1995): Statistique Exploratoire
Multidimensiollnelle, Dunod.
Van Cutsem. B. (1994): Classification and Dissimilarity Analysis, Springer.
Fitting Graphs and Trees with Multidimensional
Scaling Methods

Willem 1. Heiser

Department of Data Theory, Leiden University


P.O. Box 9555, 2300 RB Leiden
The Netherlands

Summary: The symmetric difference between sets of qualitative elements (called features)
forms the basis of a distance model that can be used as a general framework for fitting a
particular class of graphs, which includes additive trees, hierarchical trees and circumplex
structures. It is shown how to parametrize this fitting problem in terms of a lattice of
subsets, and how inclusion relations between feature sets lead to additivity of distance
along paths in a graph. An algorithm based on alternating least squares and on the recent
method of cluster differences scaling is described, and illustrated for the general case.

1. Introduction: Fitting distances or coordinates


Graphs and trees are increasingly considered to be attractive discrete structures for
modelling general similarity or dissimilarity (or: proximity) data in the social and behavioral
sciences (Arabie and Hubert, 1992; Klauer, 1994), in biology, which can built upon a
conceptual tradition of long standing (Felsenstein, 1983), and in many other areas (Abdi,
1990; Barthelemy and Guenoche, 1991). Discrete structures are commonly contrasted
with, and thought to be alien from the continuous spatial structures that are used in
multidimensional scaling, although 'hybrid' models have been proposed (Carroll, 1976).
The purpose of this paper is to take some steps towards an integrated view of these two
types of models, by showing how we can deal with the problem of fitting graphs and trees
with the same formalism that is the basis of least squares methods used in multidimensional
scaling (MDS).
Within the framework of least squares, discrete structural representations of proximity data
are usually identified by fitting a distance matrix under constraints. For example, least
squares fitting of a hierarchical tree can be done by enforcing the ultrametric inequality
upon a set of non-negative quantities (Hartigan, 1967; Chandon, Lemaire and Pouget,
1980), for an additive tree we can impose the four-point condition (Sattath and Tversky,
1977; Cunningham, 1978; De Soete, \983), and for network representations one criterion
that has been used is additivity of distances along every possible path (Klauer and Carroll,
1989). Most methods used in practice, while not being least squares, are nevertheless
based upon classic operations on a dissimilarity matrix to transform it into a constrained
distance matrix (e.g., both the single-link or minimum method and the complete-link or
maximum method for hierarchical clustering can be viewed as transformations of an
arbitrary dissimilarity matrix into an ultran1etric distance matrix).
By contrast, an MDS model typically represents the objects in terms of points characterized
by coordinates, so that distances are not parameters to be estimated, butftmctions of other
parameters. These functions may be Euclidean (as is most commonly the case) or non-
Euclidean (e.g., Groenen, Mathar and Heiser, 1996). The present paper will show how to
set up such an indirect parametrization, which restricts the coordinates to be discrete (in
fact, binary), for a relatively large class of graphical structures. Following Tversky (1977)

52
53

and Shepard and Arabie (1979), we will use the concept of a feature space, in which each
object of analysis is represented by some subset of features, while the features in turn are
represented by subsets of objects. By restricting attention to models that can be fonnulated
in tenns of features, we are considering a particular subclass of graphical structures, to be
calledfeatllre graphs.
The natural metric used in feature space is the city-block distance, which acquires several
remarkable properties when the coordinates are restricted to be binary. Before discussing
these in more detail, we need to introduce some notation.

2. Notation and reparametrization in terms of features


Let 0 = {ol' ... ,Oi' '" ,on} be the set of objects of analysis, and suppose that the ele-
ments of the square table .1 = {012' ... , Oil' ... , onn} denote the values of a given dis-
similarity function defined on the set of ordered pairs 0 x O. We are looking for a graph
representation of {O • .1} by a valued graph (or network) G = {V, !l(, A), where the set V
= {v I, '" , vi' ... , v n } contains the nodes (vertices, points), and the set !l( the edges
(arcs, lines) of G. Thus, !l( is the collection of unordered pairs of V that defines a relation,
that is, a subset of V x V. An edge rij = {vi' V)} E !l( is' said to join the nodes viand V· in
the graph, and presence or absence of edges is indicated by the binary n x n matrix = A
{aij}' called the adjacency matrix, which has aij = I if rij E !l(, and a ij = 0 if rij e 1?..
Finally, the graph G is valued: we associate with each edge present (aij = 1) some non-
negative function value \j' called the edge length, collected in the matrix A = {\j}, where
we define \) =0 when aij =O. A metric on G is defined by the path-length distance

(I)

in which P(vi' v} is the set of edges on the geodesic (shortest path) between vi and vi" We
write dr(A, A) because the distance depends not only on A, but also on A via the hsts P
(Vi, Vj)' "fhus, the path length is the sum of the edge lengths along the geodesic. If all Aij
are equal, di/A, A) is the usual graphical distance: a count of the number of edges in the
shortest patn from Vi to vi"
Let us first consider the question of embedding, or realizability: under what conditions on
.1 can the objects be mapped into a valued graph with some path-length distance? The
answer is, that we may identify Aij = oij for some subset !l(, provided that oijJs positive-
definite and satifies the triangle inequality 0ij S; Oil + olj (Hakimi and Yau, 1964). Note that
symmetry is not required (if we allow two edges between any two nodes); but if oij is in
addition symmetric, it is a metric, and the result says that any metric can be embedded into
a valued graph. In the presence of error, it is much to be preferred to optimize some loss
function measuring the lack of fit of feasible model distances, rather than to rely on
idealized conditions evaluated directly in tenns of the data. Therefore, we study the fitting
problem of finding G (in particular, some A and A) so that the least squares loss function

(2)
is minimal. Note that the major difficulty in (2) is finding A, since (1) is additive in the
elements of A, so that, once we know A, finding A is just a non-negative regression
problem, which can be solved by standard methods (Lawson and Hanson, 1974). How
can we find out which edges to include and which to delete?
Our approach in the present paper will be to use a reparametrization of di/A, A), which
restricts attention to a certain subclass of graphs. To define the vertices of such a graph, we
introduce a set of p discrete features !J = {F 1, '" , F /, ... , F n }. On the feature set !J we
54

define a family S of n distinct nonempty subsets S = (S" ... , Si' ... , S'I}. whose union
is !J. Furthermore, each feature Ft E :Jis associated with some nonnegative feature dis-
eriminability parameter Tit. Every object will now be represented by some subset of
features, that is, our goal will be to find a mapping 'T: 0i E 0 -7 Si E S.
To rephrase the fitting problem in terms of the mapping cr,
we must have a metric on
(sub )sets that parallels the path-length distance. Following Goodman (1951, 1977) and
Restle (1959, 1961), we may define a metric on sets, here to be called the feature distance

(3)

where p[ . ] is a measure function defined over the set of features (usually, just a count),
and A - B is the symmetric set difference between sets A and B. Thus the feature distance
measures the extent to which Si possesses features that Sj does not have and vice versa. By
elementary means it can be shown that (3) satisfies the metric axioms, and there are a
number of alternative expressions of it that enable us to naturally include the feature
discriminabilities 11k> which we will consider more closely in section 3. The first and
foremost property of d(Sjo Sj)' however, is stated in the following result.
Theorem (Flament, 1963, p. 17).
Let .£.i..S) be the lattice obtained from ordering the elements of Sby inclusion, and con-
sider the graph representing .£.i..S) having nodes Vi = Si and an edge between Vi and V·
whenever Si covers Sj or vice versa. Define di/A) as the pathlength distance (geodesic)
between nodes Vi and Vj with all edge lengths I.ij equal to unity. Then the feature
distance d(Si' Sj) is equal to dij(A) in the graph representation of the lattice .£.i..S).
If S is an arbitrary selection of subsets, it is understood that .£.i..S) includes the extension of
S with all subsets that can be formed by union and intersection of its elements. In the
graphical representation of this lattice of subsets there are generally several paths from Si to
Sj' but the crucial thing is that they all have equal length; hence, they are equivalent in
terms of distance. Equivalence of distinct paths follows from the fact that each edge in the
graphical representation of the lattice corresponds to one single element of :J. which is the
feature that distinguishes the covering subset from the covered one. While the graphical
distance di/A) is a count taken along a path of the distinguishing features in some
particular order, the feature distance d(Si' Sj) is the same count of distinguishing features,
taken in any order.
Another property that turns out to be crucial for the present approach is that betweenness
implies additivity: that is, if Sj is inbetween Si and Sk in the sense that either Si::J Sj::J Sk'
or Si ::J Sj and Sk ::J Sj' we have d(Si' Sk) = d(Si' Sj) + d(S", Sk) along the path from Si to
Sk' In this case, there need not be a direct edge from Sj to Sk' This characteristic allows us
to first formulate the fitting problem (2) in terms of feature distances, next sort out the
additivities in the fitted distances, and finally construct the graph by excluding edges that
are sums of other edges. Thus the graphs to be constructed with the present approach will
always be sub graphs of the graph representation of a lattice, which forms the embedding
space of the given set of objects in much the same way as Euclidean space is used to embed
a finite number of points in ordinary multidimensional scaling.

3. Introducing discriminability of features: Weighted counting


In the simplest case, the Goodman-Restle feature distance (3) is a straight count of the
features in the set difference between the union and the intersection of two subsets.
Although there are a number of interesting re-expressions of the feature distance in terms of
set operations, what we need for our MDS-like fitting problem is an expression in terms of
55

coordinates. Let E = {eit } be a binary matrix of order n x p, which indicates which


features of ~ are included in each of the n subsets in S that represent the objects. Since E
characterizes objects as subsets of features (but also features as subsets of objects), it is
(the transpose of) a point-set incidence matrix (see Roberts 1976, page 60). When Jl[ • ] is
just a counting measure, (3) becomes

d(Si' Sj} =}2t {(I - eit)ejt + (I - ejt)eit}

= }2t {ejt - eirejt + eit - ejreit}

= }2t (eit - ejt)2 =}2t leit - ejtl , (4)


where the last equality follows from the binary nature of E. Thus, the feature distance is
equal to a city-block metric on a space with binary coordinates, a metric better known as
the Hamming distance. This distance is commonly used as a dissimilarity coefficient, in a
situation where the eit are presence-absence data, or as a theoretical device (Boorman and
Arabie, 1972), but - to the best of the author'S knowledge - it has never been used as a
structural model to be fitted to dissimilarity data.
Especially for fitting purposes, it is useful to take one further step and to go from a simple
count to a weighted count, that is, to generalize (4) into

(5)

where the bitS are still binary, albeit not necesarily (0,1) variables, collected in the n xp
matrix B = (bit =11rejt} , and where the discriminabilities 11t are nonnegative parameters to
be estimated. Thus, the weighted feature distance defined in (5) allows for a differential
contribution of the features to the overall length of the path from Sj to Sj. In geometrical
terms, the introduction of feature discriminabilities turns the hypercube correspondng to E
into a rectangular parallelepiped corresponding to B.
It can be shown that the Theorem stated in the previous section still holds for the weighted
feature distance, if it is adjusted to allow for unequal ~j' Since each edge in the graphical
representation of the lattice f1..5) corresponds to one feature in 1'. we can associate exactly
one feature discriminability 11t in (5) with each edge length ~. in (1). For example, if the
set of edges on the shortest path between Vj and V· would be P(Vj, Vj) = {(Vj, vk), (vk' VI),
(VI, Vj)}, there will be three features F I , F2, and ~ on which Sj and Sj are different, with
the edge lengths being related by the one-to-one mapping Ajk =111, Akl = 112, and Alj = 113'
Hence we have djjCA, A) =Ajk + Akl + Alj =111 + 112 + 113 =djj(B) in this example.

4. Algorithm for fitting a feature graph


Due to the pioneering work of Hartigan (1967), Cunningham (1978), Arabie and Carroll
(1980), De Soete (1983), Mirkin (1987), and others, least squares fitting of discrete
models is gradually gaining ground over various ad hoc procedures that were once more
common. After replacement of the path-length distance dijCA, A) by the feature distance
djj(B), the least squares loss function (2) that we are interested in becomes

(6)
which must be minimized over all binary valued matrices B. Because the feature distance is
additive over features, it is possible to employ an alternating least squares (ALS) scheme,
fitting the model one feature at a time, given some starting values {flit}. Explicitly, given
56

the current values {!lis} for s *- t. ~ij is defined as ~ij = ~ii - Ls;t/ l!lis - !lisl, the original
dissimilarity corrected for the contribution of the fixed vanables. Substitutmg (5) into (6),
and inserting ~ij. we find that the ALS subproblem for feature t is to minimize. given ~ij'

(7)

over the binary n-vector b l . This minimization subtask is a one-dimensional MDS problem
with the coordinates restricted to form a bipartition. and therefore the cluster differences
scaling (CDS) algorithm of Heiser and Groenen (1996) applies, with number of clusters
equal to two. The ALS algorithm cycles over CDS subtasks until convergence.
Let us have a closer look at this particular CDS substask, by resolving B again in its
discrete and continuous factors. Writing Ibil - bjll = 711 {(I - eil)ejl + (I - ejl)eir}, setting
the partial derivative of (7) with respect to 111 equal to zero and simplifying, shows that, for
any given bipartition {eill i = 1•... , n} the optimal value of the discriminability parameter
for feature t is equal to max(O. '171)' with 9JI denoting the unconstrained minimizer

(8)
where n l is the number of objects in one group. and n - 111 the number of objects in the
other. Thus. the length of edge t in the fitted feature graph will be equal to the average
corrected dissimilarity between the two groups of objects that constitute that particular
feature. If the features are exclusive. eil(1 - ejl)~ij = eir(1 - ejl)~ij' and (8) becomes just the
average between-group dissimilarity. If the reatures are not exclusive and '17/ is relatively
large, then the corresponding bipartition must be a good discriminator by itself, on top of
the contribution. of the other features, since we alWays have ~ij ~ ~ij; this justifies the name
discriminability parameter.
We still have to indicate how to find E. Loss function (7) is quadratic in one column (size
n) of the binary matrix E. so this subtask still is a hard combinatorial problem, even though
its size is reduced by a factor p with repect to loss function (6). The present implementation
uses a nesting of several random starts (within features and across features), together with
K-means type reallocations. Heiser and Groenen (1996) have described a strategy called
Fuzzy Steps to alleviate the local minimum problem for CDS, but it looks like the problem
here is especially difficult across features (in the ALS phase). not so much within features.
A more extended discussion of the algorithmic aspects of finding E is in preparation.

5. Example: Henley's (1969) animal terms


To see how the feature graph procedure works. dissimilarity data originally collected by
Henley ( 1969) will be analyzed as an example. In a psychological experiment on semantic
memory structure. 21 subjects were instructed to freely list from memory any animal terms
they knew. From the total set of animal terms mentioned, 12 common ones were selected.
Dissimilarities were computed for each pair of terms as the average (across subjects) of the
proportion of items separating them in each list. Figure 1 displays an additive tree
representation of the animal terms, given by Abdi (1990), with a percentage of variance
accounted for of 73.0%. Many other tree representations for this example can be found in
Barthelemy and Guenoche (1991).
The feature-graph method was applied with the number of features ranging from 1 to 10.
The one-feature solution, which is a simple two-cluster split, had a prevalence of 7 out of
10 random initial bipartitions. and accounted for 21.9% of the variance. It splits {lion,
bear, pig. sheep. goat. cow. horse} from {cat, dog. mouse, rabbit, deer}. Note that this
split cannot be obtained by cutting anyone of the edges of the optimal tree, while it does
57

Fig. 1: Additive tree for Henley's animal terms

seem to represent the best split in terms of the dissimilarity data. The percentage of variance
accounted for (V AF) for all ten solutions is given in Table I. This table also gives for each
solution the DAF (percentage of Dispersion Accounted For), defined as the sum of squared
fitted distances divided by the sum of squared dissimilarities. DAF is the scale-free
goodness-of-fit measure that is maximized when the badness-of-fit measure (6) is
minimized.

n.ble I. Goodness of fit for feature graph representations of the Henley (1969) data

# features: 1 2 3 4 5 6 7 8 9 10*

DAF 63.3 83.4 89.4 93.9 95.8 96.9 97.6 98.1 98.5 97.3
VAF 21.9 41.9 54.0 71.9 79.2 82.3 84.7 87.7 90.3 83.9

*solution with 4 unicities

We see from Table 1 that a VAF just above the percentage of the tree solution is reached
with the five-feature solution (79.2%), which has a DAF of 95.8%. This solution, which
does not yet discriminate all objects from each other (leaving seven objects in three small
clusters), is shown in Figure 2. While the terms in the clusters (cat, dog} and (goat, cow,
horse} in the feature graph are also close together in the additive tree in Figure 1, this is not
the case for the cluster {bear, pig}. Another major difference is that the feature graph is not
tree-like at all. As to the interpretation of the five-feature solution in Figure 2, it is clear that
{deer} is an isolate (there is one feature that contrasts it with all other terms, with a
discriminability of approximately 20), and that there is a "domestic" versus "wildlife"
feature contrasting, from top to bottom, (sheep, cow, horse, goat, cat, dog} from (pig,
bear, lion, mouse, rabbit, deer}, with a discriminability of about 14. A third important split
is (cat, dog, lion, mouse} versus the rest, with discriminability 16.

A more differentiated representation arises if we look at one of the solutions with a higher
number of features. How many features to take is not only a matter of amount of fit that is
58

Fig. 2: Five-feature solution of Henley's data

deemed acceptable, but also depends on the issue of how many edges need to be kept, or
conversely, how many additivities there are in the fitted distances. Judged by the number
of edges needed, while still accounting for a reasonable amount of variance, a special ten-
feature solution was selected as the best one (see the last column of Table 1; its graph with
24 edges is displayed in Figure 3). It consists of six common features (Le., features shared
by more than one object) and four unique features (not shared by any other object).
Figure 3 contains two types of nodes: the closed circles, which represent the objects of
analysis, and the open circles, called latent nodes, which represent subsets of features that
can be obtained by taking either the union or the intersection of the feature subsets
characterizing two other nodes. Remember that the fitted feature graph is a subgraph of the

9--_--.
rabbit

Fig. 3: Ten-feature solution of Henley's data, with 6 common


features and 4 unioue ones (ooen circles are latent nodes)
59

graph representation of the lattice of feature subsets, and latent nodes are other elements of
this lattice that can be included afterwards, to make the graph simpler in terms of its
pathways and number of edges. As a good example of the effect of the introduction of a
latent node, consider four objects characterized by the subsets (BCD), (ACD), (ABD), and
(ABC), and assume equal discriminability of the features. Then all distances are equal, and
the objects are mapped as four points on a regular tetrahedron, with six edges. Introducing
the latent node (ABCD), which is the union of each of the pairs of subsets, allows us to
construct a star graph, in which there are only four edges, one between each of the
manifest nodes and the latent node, and no one among the manifest nodes themselves.
The fitted edge lengths are also given in Figures 2 and 3 (rounded to integer numbers). An
edge is not included in the graph if its length is the sum of two other edge lengths (a rather
simple algorithm looping over all triads is sufficient to sort this out). To reconstruct the
distance between two terms (and hence their dissimilarity), we just have to add the edge
lengths along the shortest path between them. It will be noted that there are several
instances of distinct paths with equal length. Comparing the two feature-graph solutions, it
appears that there are primarily local changes: one is slightly more (less) differentiated than
the other, a result that makes sense.

6. Some special cases


Modeling considerations can be formulated in terms of the lattice of subsets 45> as
follows: given 6., or some approximation of it satisfying the triangle inequality, does there
exist feature distances deS;, Sj) that arise from the feature graph of a family of subsets that
has a certain property? Consequently, the fitting problem would become one of optimizing
loss function (6) over a family of feature sets that have a specified structural property. In
this section, we will briefly indicate some examples of structural properties that may be
handled in the present framework; a more detailed treatment of them is in preparation.
Before considering the special cases, however, we discuss a technical issue that needs to
be settled first. In a feature distance model, the role of presence and absence of features is
symmetric, that is, we can always replace all elements eit of a whole column of E by their
complement I - e;l without changing the feature distance. To illustrate, the two matrices

2 0 o 0 1 0 o 2 o 112
o 4 o 0 o 1 2 o o 112 oo 3/2]
312
o 0 1 0 and o 1 o 2 112 0 o 3/2
o 0 o 3 o I o 2 o 112 3/2 0

generate the same feature distances among their rows. Thus we can freely add the comple-
ments of any column of the incidence matrix, provided that we half the corresponding
discriminabilites. Any n x 2 matrix formed by concatenating some column of E with its
complement has the property that it has row sums equal to one, and such a matrix is called
the indicator matrix of a feature.
Now suppose that the features are nested: that is, if G l is the indicator matrix of feature Fl
and G s is the indicator matrix of feature Fs ' than the matrix Gr'Gs has at least one element
equal to zero. Nestedness implies that one feature separates a subgroup from one of the
two groups formed by the other feature. For instance, the bipartitions {(ABCD), (EFG)}
and {(EF), (ABCDG)} are nested, since (EF) is a subset of (EFG) and (ABCD) is a subset
of (ABCDG). Then, by a famous result of Buneman's (1971), the feature distance satisfies
the four-point property that characterizes additive trees if and only if all its features are
nested. Additive trees thus form an important special case of feature graphs, in which each
edge corresponds to exactly one feature (or split).
60

The case of a linear array (Goodman, 1951, 1977; Restle, 1959, 1961), called Guttman
scale in psychometrics, is obtained when the features are not only nested, but have an
additional property. In terms of the feature incidence matrix E, this property implies that
each column of E consists of either a single run of zeros followed by a single run of ones
or of a single run of ones followed by a single run of zeros. When n - I distinct features
have this structure, the feature graph has n - I edges, connecting the objects in a certain
order, and no latent nodes. Except for the two endpoints, which have degree one, all nodes
have degree two. For an exact characterization of the Guttman scale, see Holman (1995).
A hierarchical tree is a rooted additive tree with the extra requirement that the distance from
any endpoint to the root is equal. In a feature graph, the root corresponds to the latent node
that has all features, that is, 'J Then the first feature defines the first split in two groups of
objects, the second feature splits one of these groups further down into subgroups, and so
on. So the features are again nested. The hierarchical tree is a more parsimoneous model
than the additive tree, because the requirement of equal distance to the root puts restrictions
on the discriminabilities. The characterization of trees in terms of a feature model is due to
Tversky (1977).
The last example of a family of subsets that satisfies a specific structural property is the
circumplex or radex (Guttman, 1954). It is characterized by the circular ones property,
which implies that each column of E consists of either a single run of zeros bordered by a
run of ones on one or both sides, or of a single run of ones bordered by one or two run(s)
of zeros. The graph of a regular circumplex is like a closed simple chain, with exactly n
edges, if each feature divides the objects in equal groups (when n is even). When divisions
in unequally sized groups are included in the feature set, the graph of a circumplex
becomes more complicated. In the complete case, it looks like a network spanned over a
(half)sphere (Heiser, 1981, chapter 4).

7. Discussion
We have seen that a metric defined on the symmetric set difference between sets of features
can be used as a general framework for fitting a particular class of graphs, which includes
additve trees, hierarchical trees and circumplex structures. It was shown that we can find
out which edges to include in the graph by formulating the problem in terms of a lattice of
subsets, using a weighted count of feature differences (the feature discriminabilities). The
algorithm presented, based on alternating least squares and on cluster differences scaling,
is still in its early stage of development. It always converges to a local minimum, but is as
usual in this type of problem, there are an awful lot of local minima. On the positive side, it
is the first systematic method to fit the Hamming distance.
A crucial ingredient of this approach to finding graph representations is the fact that
inclusion relations between feature sets lead to additivity of distance along paths in a graph.
In fact, Hutchinson (1989) and Klauer and Carroll (1989) used the criterion of dropping
direct edges by looking at additivity of link length as their main graph construction
strategy. But they applied this criterion to the dissimilarities, rather than to the fitted
distances, as is proposed here. Feature graphs are similar to Corter and Tversky's (1986)
extended similarity trees, but exactly how these two models are related needs further study.
In any case, it seems clear that additive trees and other restricted representations do not
show up spontaneously in real examples, although the method reproduces a circumplex,
for example, when the data are error-free.
Choosing the number of features p is a matter that requires experience and cannot be settled
yet with clear-cut rules. In most examples analyzed so far, the number of features needed
to get good fit is in the neighborhood of n12. Also, it appears that, as soon as p is in the
61

range of values where the fit stabilizes, solutions with one feature more or one feature less
are only different in their fine structure, as was the case in the example of the Henley
(1969) data. There is a trade-off to be made with the number of links in the graph, a
quantity that increases nonlinearly with p, and which we want to have as small as possible.
Making a good trade-off is complicated by the fact that we can often reduce the number of
links by including latent nodes, without it being clear how to do this optimally.
Unlike methods based on distance constraints, feature graph fitting can be extended
without too much trouble to well-known variants of MDS, such as individual differences
scaling (INDSCAL) and two-mode scaling (unfolding), possibly combined with nonlinear
transformations of the data. An easy way to recognize this flexibility is to view the basic
distance model (5) as a. squared Euclidean distance (since the deviations eit - ejt are zero or
(minus) one, we just have to reparametrize 1]t as the square of some other non-negative
parameter). Then the feature graph loss function (6) is identical to Takane et al.'s (1977)
SSTRESS loss function with restrictions on the configuration.

References:
Abdi, H. (1990): Additive tree representations, In: Trees and Hierarchical Structures,
Dress, A. et al. (Eds.), 43-59, Springer Verlag, Berlin.
Arabie, P., and Carroll, J.D. (1980): MAPCLUS: A mathematical programming approach
to fitting the ADCLUS model, Psychometrika, 45,211-235.
Arabie, P., and Hubert, L. (1992): Combinatorial data analysis, Annual Review of
Psychology, 43, 169-203.
Barthelemy, J.-P. and Guenoche, A. (1991): Trees and Proximity Representations, Wiley,
New York.
Boorman, S.A. and Arabie, P. (1972): Structural measures and the method of sorting, In:
Multidimensional Scaling: Theory and Applications in the Behavioral Sciences, Shepard,
R.N. et al. (Eds.), 225-249, Seminar Press, New York.
Buneman, P. (1971): The recovery of trees from measures of dissimilarity, In:
Mathematics in the Archaeological and Historical Sciences, Hodson, F.R. et al. (Eds.),
387-395, Edinburgh University Press, Edinburgh.
Carroll, J.D. (1976): Spatial, non-spatial and hybrid models for scaling, Psychometrika,
41, 439-463.
Chandon, J.L., Lemaire, J., and Pouget, J. (1980): Construction de l'ultrametrique la plus
proche d'une dissimilarite au sens des moindres carres, R.A.I.R.O. Recherche Opera-
tionelle, 14, 157-170.
Corter, J.E., and Tversky, A. (1986): Extended similarity trees, Psychometrika, 51, 429-
451.
Cunningham, J.P. (1978): Free trees and bidirectional trees as representations of
psychological distance, foumal of A'Iathematical Psychology, 17, 165-188.
De Soete, G. (1983): A least squares algorithm for fitting additive trees to proximity data,
Psychometrika, 48, 621-626.
Felsenstein, J. (Ed.)(1983): Numerical Taxonomy, Springer Verlag, Heidelberg.
Flament, C. (1963): Applications of Graph Theory to Group Structure, Prentice-Hall,
Englewood Cliffs, New Jersey.
62

Goodman, N. (1951): The Structure of Appearence, Bobbs-Merrill, Indianapolis, Indiana.


Goodman, N. (1977): The Structure of Appearence (3rd ed.), Reidel, Dorcirecht, Holland.
Groenen, P.I.F., Mathar, R., and Heiser, W.I. (1995): The majorization approach to
multidim~nsional scaling for Minkowski distances, Journal of Classification, 12,3-19.

Guttman, L. (1954): A new approach to factor analysis: The radex, In: Mathematical
thinking in the social sciences, Lazarsfeld, P.F. (Ed.), 258-348, The Free Press, Glencoe,
Illinois.
Hakimi, S.L., and Yau, S.S. (1965): Distance matrix of a graph and its realizability,
Quarterly of Applied Mathematics, 22, 305-317.
Hartigan, J.A. (1967): Representation of similarity matrices by trees, Journal of the
American Statistical Association, 62, 1140-1158.
Heiser, W.I. (1981): Unfolding analysis of proximity data, Unpublished doctoral
dissertation, University of Leiden, The Netherlands.
Heiser, W.J., and Groenen, P.J.F. (1996): Cluster differences scaling with a within-
clusters loss component and a fuzzy successive approximation strategy to avoid local
minima, Psychometrika, 61, in press.
Henley, N.M. (1969): A psychological study of the semantics of animal terms, Journal of
Verbal Learning and Verbal Behavior, 8,176-184.
Holman, E.W. (1995): Axioms for Guttman scales with unknown polarity, Journal of
Mathematical Psychology, 39, 400-402.
Hutchinson, J.W. (1989): NETSCAL: A network scaling algorithm for nonsymmetric
proximity data, Psychometrika, 54,25-52.
Klauer, K.C. (1994): Representing proximities by network models, In: New Approaches
in Classification and Data Analysis, Diday, E. et al. (eds.), 493-501, Springer Verlag,
Heidelberg.
Klauer, K.c., and Carroll, J.D. (1989): A mathematical programming approach to fitting
general graphs, Journal of Classification. 6,247-270.
Lawson, C.L., and Hanson, R.I. (1974): Solving least squares problems, Prentice Hall,
Englewood Cliffs, N1.
Mirkin, B.G. (1987): Additive clustering and qualitative factor analysis methods for
similarity matrices, Journal of Classification, 4, 7-31.
Restle, F. (1959): A metric and an ordering on sets, Psychometrika, 24,207-220.
Restle, F. (1961): Psychology of Judgment and Choice, Wiley, New York.
Roberts, F.S. (1976): Discrete Mathematical Models. with Applications to Social,
Biological, and Environmental Problems, Prentice Hall, Englewood Cliffs, New Jersey.
Sattath, S., and Tversky, A. (1977): Additive similarity trees, Psychometrika, 42, 319-
345.
Shepard, R.N., and Arabie, P. (1979): Additive clustering: Representation of similarities
as combinations of discrete overlapping properties, Psychological Review, 86, 87-123.
Takane, Y., Young, F.W., and De Leeuw, 1. (1977): Nonmetric individual differences in
multidimensional scaling: An alternating least quares method with optimal scaling features,
Psychometrika, 42, 7-67.
Tversky, A. (1977): Features of similarity, Psychological Review, 84,327-352.
Classification and data analysis in finance
Krzysztof Jajuga
Wroclaw University of Economics
ul. Komandorska 118/120
53-345 Wroclaw, Poland

Summary: The paper gives a brief review of the main areas of financial applications where
classification and data analysis methods can be used. First of all, historical context is given.
It is shown that the emerging of modern finance was made possible due to use of
quantitative methods. The presented applications are divided into two main groups: 1)
analysis of financial investments and markets, 2) corporate finance. The review is put in the
framework where the relationship between dependent variable and explanatory variables is
determined.

1. Historical remarks

In the paper some links between classification and data analysis on one side and the financial
applications on the other side are shown. The history proved that the classification and data
analysis methods are useful in financial applications. It is also clear that the usefulness of
these methods will grow in future.
There are two important streams in the development of modern finance, which, on one
hand, were the driving forces of this discipline and where, on the other hand, the
contribution of statistics was crucial to the emerging and the development. These are:
- forecasting of financial prices;
- portfolio theory.
Forecasting of prices in financial markets (as well as in commodity markets) has been
probably the most exciting issue in financial research. First of all, this is very difficult task,
which has not been solved despite a lot of efforts. Secondly, people believe that by finding
good forecasts of financial prices they can make a lot of money. This makes job even more
exciting.
The first work on forecasting of financial prices was done by Louis Bachelier in 1900. His
doctoral thesis "Theory of speculation" is considered today as seminal work, which had
been unnoticed for more than fifty years. He completed thesis for degree of Doctor of
Mathematical Sciences at Sorbo nne.
In the dissertation he proved two statements. The first one: prices in financial markets
cannot be succesfully predicted. He argued that: "contradictory opinions concerning market
changes diverge so much that at the same instant buyers believe in a price increase and
sellers believe in a price decrease. [ ... J It seems that the market, the aggregate of
speculators, at a given instant can believe in neither a market rise nor a market fall, since for
each quoted price, there are as many buyers as sellers" (Bachelier (1900». The main
conlusion of Bachelier is: the mathematical expectation of price changes is zero, therefore
the best forecast of next price is current price. This means that the prices of financial
instruments follow random walk process.

63
64

The second statement of Bachelier was: the range of the interval of prices is proportional to
the square root of time. This is reflected today in many stochastic price models of ARIMA
type. This was also confirmed empirically for many time series of prices.
Despite of this and some other theoretical results, financial practitioners searched for
effective forecasting tools. One of such tool is so called technical analysis, the foundations
of which were laid of by Charles Dow (also known as co-founder and editor of Wall Street
Journal and the co-author of the most famous stock market index, Dow Jones Industrial
Average). Dow claimed that financial prices (especially stock prices) are changing
according to some trends. The particularly attractive (from the practical point of view)
claim of the proponents of technical analysis is the following one: the directions of stock
prices movements can be predicted and the forecasts can be used to develop trading
strategy resulting in returns above average. The users of technical analysis try to detect
regular patterns in past prices (by means of charts) which they believe will repeat in future.
This is a simplified version of pattern recognition problem. Basically the idea of existence of
regular patterns is used today in neural networks methodology.
The discussion between the advocates of two concepts: the first one that effective forecasts
of financial prices can be made and the second one that the financial prices follow random
walk process can be put in the framework of so called market efficiency. This concept was
proposed by Fama (1965). According to him market is to be called efficient if current
financial prices instantaneously and fully reflect all available information. For the
predictability of prices so called weak form market efficiency is of particular importance. It
is said that the market is a weak-form eflicient if current financial prices instantaneously and
fully reflect all information contained in the past history of financial prices. This means that
historical prices provide no information (about future prices) that will lead to higher than
average return by using trading rules based on forecasts of prices. Thus the best forecast of
next stock price is current stock price. The search for the methods of financial prices
forecasting (particularly stock prices forecasting) is based on the conviction that markets
are no! weak form efficient.
The second stream of the development of modern finance is connected to portfolio theory.
Portfolio theory was laid oEfby Harry Markowitz. In his seminal paper (Markowitz (1952»,
published in ,,Journal of Finance", probably in the most significant paper in the theory of
finance, Markowitz introduced the concept of risk in tinancial investments. He was the first
one to propose the use in finance the concept of a distribution of a random variable. As a
measure of the return on the investment, the expected value of return was used. As a
measure of investment risk, standard deviation of return was used. Then he developed the
concept of risk diversification. This means that the investment risk can be reduced by
forming a portfolio of stocks and this reduction depends on the correlation of the returns of
between stocks belonging to portfolio.
At this time the approach proposed by Markowitz was entirely different from the traditional
approach used in finance. For several years the paper of Markowitz was unnoticed. Then it
caused a lot of discussions and criticisms. The weak point in Markowitz approach was the
one that solving portfolio problem required very substantial amount of time by using the
computers available at this time. Today portfolio theory is widely used in practice. For his
contribution to economic sciences Harry Markowitz was awarded Nobel Prize in 1990.
In the beginning of seventies the substantial increase of the volatility (variability) of tinancial
prices (particularly exchange rates and interest rates) was observed. This caused the search
for the ways to cope with resulting risk. One solution was the introduction of the new
financial instruments, like options and financial futures. Another solution was to look for
more sophisticated mathematical methods.
65

In last llfteen years the enormous development in the area of computer technology
occurred. This was extremely r.eneficial as far as the use of sophisticated mathematical and
statistical methods is concerned. At present the use of these methods is not time-consuming
which means that the costs of the implementation of these methods are relatively low. On
the other hand. the computer software designed to solve complicated financial problems is
widely available. Statistical and data analysis methods are at disposal of firms. banks and
investors.

2. Statistical methods in financial applications

It is not easy to give the general framework for the presentation of applications of statistical
methods in finance. The relatively simple way is to consider the financial applications
through the analysis of the following function:

(1)

where:
Y - dependent variable,
Xl'Xz, ...• Xm - explanatory variables.
It is not easy to systematize all quantitative methods that can be used to solve financial
problems. One possible way is to classify them into two groups:
- classical multivariate data analysis methods;
- financial cybernetics methods.
This taxonomy is based on historical criterion since the second group of methods emerged
in last ten years when the use of the methods requiring large amount of computer time was
made possible due to the development of computer technology.
The term .. financial cybernetics" was used by Thomas E. Berghage (president of company
developing artificial intelligence software for finance) to describe the process of enhancing
financial decision making by introducing artificial intelligence technologies. The term
..cybernetics" was used for the tirst time by Norbert Wiener in 1948. At this time it was new
science dealing with modifying or enhancing human decision systems with artificial
electronic systems. In the area of finance this means to enhance financial decision making
with computer systems which to some extent ressemble human systems.
One of the very first applications of financial cybernetics was the application on neural
networks done by Lapedes and Farber (Lapedes and Farber (1987)). They attempted at
forecasting the closing value of market index Standard and Poor 500, where as the
explanatory variables the closing values from ten previous weeks were used. As an
algorithm back-propagation method was used.
The other useful way to classify the methods to solve main financial problems can be done
on the basis of two criteria:
1. The type of dependent variable - Y is quantitative or categorical.
2. The knowledge of the values (or the categories) of dependent variable - the values (or
categories) are known or unknown.
Therefore four different classes of methods can be distinguished, leading to four research
situations (cases):
1. Y is categorical variable and its categories are known.
2. Y is quantitative variable and its values are known.
66

3. Y is categorical variable and its categories are unknown.


4. Y is quantitative variable and its values are unknown.
There are many methods that can be used in each case. Here is a sample list:
Situation 1 - discriminant analysis, neural networks.
Situation 2 - regression analysis, neural networks.
Situation 3 - cluster analysis.
Situation 4 - principal component analysis (or any method synthetizing set of variables).
There is an opinion that financial cybernetics methods outperform classical methods (for
example neural networks outperform discriminant analysis). It seems that it is still too early
to give general conclusions. As a rule, artificial intelligence methods are time-consuming
and difficult to interpret, therefore difficult to explain to non-experts, which are end users of
these methods.
It is worth to give two very general remarks on the use of statistical methods in finance. First
remark refers to the approach which can be assumed, stochastic approach or descriptive
(distribution-free) approach. It is often the case that the use of stochastic approach, that is
the use of methods based on distributional assumptions, is not justifiable in financial
applications. Many statistical methods rely upon assumption of the normal (univariate or
multivariate) distribution, but the distributions of financial variables are often heavy-tailed
or asymmetric (as a rule, skewed to the right). Moreover, studied individuals often come
from finite population, which also means that the stochastic approach may not be assumed.
However, many statistical methods still can be used as descriptive methods, provided
statistical inference is not made.
Second remark refers to the great opportunity to use classification methods in finance,
because heterogeneity very often occurs in financial data. This heterogeneity may be due to
the character of the studied problem - objects falling into separate classes, which we want to
detect (for example: well performing companies and badly performing companies). This
also may be due to the existence of different classes, known a priori (different industries,
different sizes of companies). Finally this may be due to the intrinsic heterogeneity of data,
for example resulting from the existence of outliers. These facts indicate the great
opportunity for classification methods.

3. A review of the most important areas of financial applications

Now we give a brief review of several most important areas of financial applications, where
the use of classification and data analysis methods proved to be useful. All financial
applications can be divided into 2 main groups:
- analysis of financial investments and markets;
- corporate finance.

Group 1. Analysis of financial investments and markets.


A. Bond rating
Investing in bond is one of the basic financial investments. Bond gives its holder the right to
receive the amount of money equal to the nominal value of the bond at maturity and to
receive regular payments, so called coupons, being the interest on the loan.
One of the main types of risk occurring while investing in bonds is the so called default risk.
Default risk means that the issuer of the bond does not pay back the loan and/or does not
pay the interest on the loan. From the point of view of investor it is very important to
67

evaluate default risk in order to avoid the negative consequences. This can be achieved by so
called bond rating.
Bond rating consists in determination of the classes of bonds of approximately equal level of
default risk. There are many institutions specializing in bond ratings (e.g. Standard and
Poor's Corporation and Moody Investment Services). The usual way to determine bond
rating is to ask experts to evaluate different factors influencing the default risk. As a rule
past performance of the bond issuer is also taken into account while determining bond
ratings. It is worth to mention that rating institutions claim that they link together financial
statement data of issuers of bonds and experts' opinions.
Bond rating problem can be regarded as a determination of a function (1), where Y is the
categorical variable standing for the class of bond and explanatory variables are the factors
influencing the default risk. From the point of view of the statistical methods used in bond
rating, we can distinguish two of the mentioned four situations, namely:
- situation 1: the categories of dependent variable are known;
- situation 3: the categories of dependent variable are unknown.
In the first case two types of data can be used:
- historical data, that is past bond ratings plus the information on the previous values of the
factors influencing the default risk;
- expert opinions, obtained by assigning bond ratings to the hypothetical values of factors.
Here any of these two types of data sets can be treated as a learning set and can be used to
determine a function which divides the bonds into classes. As a rule, discriminant analysis or
neural network methodology can be used. The classes corresponding to particular bond
ratings can be interpreted which is very important for end-users.
In the second case past data on bond ratings are not available and one uses classification
methods (for example cluster analysis) to determine the classes of bonds. Here, the problem
of difficulty to interpret the classes of bonds may occur.
B. Financial prices forecasting
This is by no doubt the most difficult financial problem. This task is important for different
types of investors, short-term and long-term, individual and institutional. The following
prices are usually predicted: commodity prices, exchange rates, interest rates and stock
prices.
From the point of view of classification and data analysis this problem is regarded via a
function (1). Here Y is usually quantitative variable - the price. Sometimes it can be
categorical assuming one of three categories: "the price will go up", "the price will go
down", "the price will stay within defined interval". To determine a function (1), historical
data are used. This fits either to situation 1 or 2.
There are many approaches to financial prices forecasting. Basically they can be divided into
three broad categories:
- technical analysis;
- econometric regression and time series models;
- neural networks.
Technical analysis is simple and widely used approach, which has been already mentioned.
"Technicians" use different types of charts of past prices to discover regular patterns, which
they believe will occur in future. Their reasoning is supported by simple indicators
describing financial markets.
The large group of researchers uses econometric models which emerging from well known
ARIMA approach. The development of statistical methodology and the development of
computer technology allowed for implementation of these models in real world. The
detailed description of these models is presented by Taylor (1986) and Mills (1993).
68

Neural networks as well as some other approaches (genetic algorithms, chaos theory) are
relatively new approaches in financial forecasting. These models could be developped with
the use of fast computers, since they require lengthy computations. The most popular are
probably neural networks. Here the algorithm is used so that quite complicated nonlinear
function is estimated. This function approximates the past financial prices. Then this
function is used for forecasting.
As it was already mentioned, the question of market efficiency is a crucial one to the
forecasting of financial prices. Those who apply the mentioned methods believe that the
market is not efficient and the changes of prices do not follow random walk process.
C. Risk-return analysis
This is classical financial problem, which traces back to the origin of portfolio theory. The
rationale behind this problem lies in the fact that most individual and institutional investors
try to maximize their return while keeping risk as low as possible. This behaviour of
investors was reflected in portfolio theory proposed by Harry Markowitz. He considered a
portfolio of stocks. T.he portfolio problem can be regarded as a task of finding a
combination of individual stocks, called portfolio, so that the expected return is as high as
possible and risk is as low as possible. The main results of the classical portfolio theory are:
- expected return on a portfolio is weighted average of the expected returns on individual
stocks;
- risk of a portfolio depends on the risk of individual stocks and on the correlation of returns;
this means that the low, possibly negative correlations lead to low risk, while holding
constant or even decreasing expected return.
Risk-return analysis can be treated as the analysis of location and spread parameters of a
distribution. One possible solution is to use robust estimates of location and spread. Since
building the portfolio involves multivariate distributions, estimates of multivariate location
vector and multivariate scatter matrix are to be used.
Risk·return analysis can be also put in the framework of a function (1). There are two
explanatory variables, return and risk, and the dependent variable is unknown and
characterizes the attractiveness of the investment (for example, very attractive - high return
and low risk, medium attractive - high return and high risk or low return and low risk, not
attractive - low return and high risk). This problem fits to the situation 3.
D. Beta analysis
Beta coefficient is one of the most important coefficients used by financial industry. This
coefficient was proposed by William Sharpe (Sharpe (1963)) in a so called single-index
model. This is simple regression model which gives a linear relationship between the return
on a stock (or other investment) and the return on the market (measured usually through a
return on market index). The slope in this regression is beta coefficient. It is the measure of
the sensitivity of a return on a stock on the changes of a return on the market. It can be also
regarded as a measure of so called systematic risk or market risk. The stocks with beta
higher than 1 are called aggressive stocks and the stocks with beta lower than 1 are called
defensive stocks.
From the point of view of function (1), beta analysis fits to the situation 2. However, it is
usually the case that people from finance industry do not pay particular attention to the
justifiability of linearity of the relationship and use ordinary least squares to estimate beta
coefficient. This raises the issue of the application of more advanced methods, for example
nonlinear regression, robust regression or segmented regression models.
E. Factor market models
The researchers who analyze financial markets try to find a theoretical model, which
explains what determine the returns on financial instruments. Among several proposed
69

models, the widely accepted one is so called Arbitrage Pricing Theory (APT) model,
proposed by Ross (Ross (1976)). This is linear model, given by the formula:

(2)

where:
R - the return on investment;
Fi - the i-th factor intluencing the investment;
bi - the sensitivity coefficient of the return with respect to i-th factor.
From the point of view of the determination of function (1), this fits to the situation 2. As a
rule, historical data are used. If the factors intluencing returns on the investments are known
(for example: interest rates, GOP growth rate, etc.) then the regression analysis can be used.
There is also a chance that one has no idea about the factors. In this case other solution can
be applied, where factor analysis is used. Here the returns on different stocks are used to
extract the values of unknown factors. However, it is often the case that it is very difficult (if
even possible) to give useful interpretation of extracted factors.

Group 2. Corporate finance.

A. Analysis of financial condition of the firm


This is one of the very basic financial problems to be solved by investors purchasing stocks,
management of the firm making decisions on the policy of the firm, banks making loans to
the company.
This problem can be regarded as a determination of the function (1), where Y is the variable
characterizing financial condition of the firm. This can be categorical variable, whose values
correspond to the categories of the financial condition ("very good", "good", "average",
"bad", "very bad"). This can be also continuous variable, whose values inform on the level
of the financial condition. Explanatory variables are different variables characterizing
financial condition. Five groups of variables are usually used - they are called financial
ratios:
- liquidity ratios - measure the firm's ability to fulfill its short-term commitments out of
current or liquid assets;
- debt management ratios - measure the firm's ability to pay its debt and the interest on debt;
- activity ratios - measure the quality of management of firm's assets;
- profitability ratios - measure the firm's ability to generate its profit from available
resources;
- market ratios - measure the atractiveness of firm from investor's point of view.
Therefore the analysis of financial condition of the firm can be regarded as a situation 3 or 4,
depending whether the variable characterizing financial condition of the firm is categorical
or quantitative one.
B. Bankruptcy prediction
Since the bankruptcy is the potential threat to all firms, the bankruptcy prediction becomes
one of the most important financial tasks. It is particularly important to all investors and
creditors. This problem can be regarded as a determination of a function (1), where Y is
categorical variable, either binary variable ("will bankupt", "will not bankrupt") or more
general nominal variable (for example three categories: "will bankrupt in one year's time",
"will bankrupt in the second year", "will not bankrupt in two year's time"). As the
explanatory variables financial ratios, mentioned above, are usually used.
70

From the point of view of the use of statistical methods this can be regarded as a
determination of a function (1) and it fits to the situation 1. Here the historical data on the
bankrupt and non-bankrupt companies can be used to determine this function.
One of the first attempts to use discriminant analysis in the bankruptcy prediction was made
by Altman (1968). This is classical model, called Altman model. Altman compared the
financial data of 33 manufacturers who bankrupted with the data of 33 nonbankrupt firms of
similar industry and asset size. From a number of avalilable financial variables he finally used
5 ratios.

References:

Altman, E.L. (1968): Financial ratios, discriminant analysis and the prediction of corporate
bankruptcy, Journal of Finance, 23, 589-609.
Bachelier, L. (1900): Theory of speculation, Gauthier-Villars, Paris.
Fama, E. (1965): The behavior of stock prices, Journal of Business, 37, 34-105.
Lapedes, A. and Farber, R. (1987): Non-linear signal processing using neural networks:
prediction and system modeling, Los Alamos National Laboratory Report.
Markowitz, H.M. (1952): Portfolio selection, Journal of Finance, 7, 77-91.
Mills, T.e. (1993): The econometric modelling of financial time series, Cambridge
University Press, Cambridge.
Ross, S.A. (1976): The arbitrage theory of capital asset pricing, Journal of Economic
Theory, 13,341-360.
Sharpe, W.F. (1963): A simplified model for portfolio analysis, Management Science, 19,
277-293.
Taylor, S. (1986): Modelling finaTJciaitime series, Wiley, New York.
How to validate phylogenetic trees?
A stepwise procedure

Fran<;ois-Joseph Lapointe

Departement de sciences biologiques, Universite de Montreal,


c.P. 6128, Succursale centre-ville, Montreal, Quebec, H3C 317, Canada
e-mail: [email protected]

Summary: In this paper, I review some of the methods and h:sts currently available to validate
trees, focussing on phylogenetic trees (dendrograms and cladograms). I first present some of the
more commonly used techniques to compare a tree with the data it is derived from (internal
validation), or compare a tfee to another tree or to more than one (external validation). I also
discuss some of the advantages of performing combined (total evidence) versus separate analyses
(consensus) of independent data sets for validation purposes. A stepwise validation procedure
defined across all levels of comparison is introduced, along with a corresponding statistical test: A
phylogeny will be said to be globally validated only if it satisfies all the tests. An application to the
phylogeny of kangaroos is presented to illustrate the stepwise procedure.

1. Introduction
The construction of a classification is a simpled-minded task. First, you need data, second,
you need an algorithm, and then, like magic, you get a classification of your data. Indeed,
the sole purpose of a classification algorithm is to do just that; i.e., return a classification
(e.g., dendrogram, c1adogram, pyramid, weak hierarchy, or any other type of
classification). The. problem with such an approach is usually that no safeguards are
provided to ensure that the output is meaningful. Indeed, most algorithms (i e., clustering
algorithms or phylogeny reconstruction algorithms) wiII return a solution, no matter what
data are fed into it. This implies that a classification can even be derived from pure noise
(i.e., randomly generated data). This is why validation becomes necessary.

In this paper, I will review some of the safeguards currently available to validate
c1assitications represented in the form of trees; those trees can either be obtained by
different algorithms or derived from independent data sets using the same algorithm. It is
not my goal to present an exhaustive review of all validation techniques for all types of
trees. I will focus my review on weighted trees such as those used in some phylogenetic
studies. Furthermore, given the number of statistical papers published on the subject, I will
emphasize validation methods based on permutation and/or resampling procedures. I will
first show (l) how one can assess whether any phylogenetic structure was present in the
data to begin with (internal validation). Then, (2) I will introduce some of the methods
designed to compare trees obtained from independent data sets (external validation). I will
also present (3) the rationale for combining those independent data sets before proceeding
with phylogenetic reconstruction (total evidence). The combined approach will then be
contrasted with (4) a consensus approach in which the trees are analyzed separately and
then combined. A stepwise procedure wiII finally be introduced to validate phylogenetic
trees using both internal and external validation methods as well as separate and combined
approaches. This stepwise procedure will be used to validate the phylogeny of kangaroos.

71
72

2. What is a phylogeny?
Biologically speaking. a phylogeny is a tree-like representation of evolutionary
relationships among n different taxa (e.g., species. genera or other taxonomic units).
Phylogenetic trees arc usually derived from a character-state matrix representing
morphological or molecular data (n species by p characters). or from a square n x n
distance matrix; several algorithms and computer packages arc currently availabh: to do so
(sec Penny et aI., 1992; Swofford et aI., 1996a). In mathematical terms, a phylogeny can be
detined as a connected graph without cycles. Such phylogenies arc usually represented as
rooted trees with labeled leaves. They can be depicted in the form of weighted trees if the
branches of the phylogeny have lengths that represent the amount of evolutionary
divergence between the nodes of the tree. Therefore, the sum of the lengths along the path
of branches between any pair of taxa can be recorded in a path-length matrix (similarly, the
number of branches on a path can be recorded in a branch-distance matrix). When the rates
of change in the various branches of the phylogeny are identical, every terminal node will
be equidistant from the root, and the tree can be associated with a path-length matrix that
satisfies the ultrametric inequality (Hartigan, 1967):

d(i. j} :5 max[d(i. k}: d(k. j}], for every triplet of taxa i. j. k..

Such trees arc usually dctined as dendrograms. On the other hand, when rates vary among
lineages, the path-length matrix is not ultrametric but remains additive (i.e., ultrametric
trees represent a special case of additive trees with constant evolutionary rates); additive
distances satisfy the four-point condition (Buneman. 1971) and apply to c1adograms:

d(i. j) + d(k. I) :5 max[d(j. [) + d(j. k); d(j. k) + d(j. I)), for any quartet of taxa i. j. k. I.

The path-length matrices (ultrametric or not) are in one-to-one correspondence with a set
of wcighted phylogenetic trees (Jardine et al.. 1967; Buneman, 1974); this is also true for
branch-distance matrices (Zaretskii, 1965). Therdl)re, it is equivalent for validation
purposes to compare phylogenetic trees or their associated path-kngth (or branch-
distance) matrices.

3. How to test phylogenies?


When phylogenetic trees (or their matrix representations) are to be v<\lidated, standard
statistical procedures could rarely be applied (but see Li and Guoy, 1991; Li and Zharkikh,
1995). Indeed. the values in path-length (or branch-distance) matrices arc not independent
from one another, and thus violate the most crucial assumption of any parametric test.
Furthermore, the non-independence of the observations implies that the degrees of
freedom of the tests are always overestimated. If one also notes that distances seldom meet
the normality assumption. and that their parameter distributions are rarely known, we have
enough reason to call for permutation methods. For a given test, the probability of the null
hypothesis is thereti.)re assessed from a distribution of test statistics obtained under a
permutation/randomization/resampling model. The general testing procedun: follows
Edgington (1995):

1. Compute a reference test statistic (i.e .• REFSTAT) relevant to the question asked.
2. Permute (or resample) the data (i.e .. distance, character-state, or path-length matrices).
3. Recompute the test statistic for the randomized data (Le., RANDSTAT).
73

4. Repeat steps 1 and 2 a large number of times (e.g, NPERM = 1000).


5. Compute the p-value as follows (see Dwass, 1957):

nb (RANDST AT 2: REFST AT) + 1


P= NPERM + I

It is worth mentioning at this point that the statistical outcome of a permutation test is
likely to be affected by different aspects of the procedure including, (i) the maximum
possible number of random realizations of the null hypothesis, (ii) the actual number of
permutations performed, (iii) the permutation model, and (iv) the test statistic selected to
compute the test.

3.1 Random tree models

Depending on the type of trees compared (rooted or unrooted, labeled or unlabeled,


weighted or unweighted, ultrametric or nonultrametric), different random-tree generation
algorithms are distinguishable (Proskurowski, 1980; Furnas, 1984; aden and Shao, 1984;
Quiroz, 1989; Lapointe and Legendre, 1991). The sampling distribution of the trees is also
important in the computations. The number of rooted trees for example, is larger than the
number of unrooted ones for a given number of taxa (Phipps, 1975; Felsenstein, 1978), but
smaller than the number of dendrograms (Murtagh, 1984); i.e., distinct populations of
trees are considered. Still, random rooted trees can be sampled equiprobably or not; i.e.,
different distrihutions of trees with respect to the same population are considered
(L1pointe and Legendre, 1995).

In the case of phylogenies, three distribution models are usually defined (Simberloff et aI.,
1981; Savage, 1983; Lapointe and Legendre, 1995). The first and simplest model is to
generate every topology equiprobably. In the second model, each tree is equally likely; this
is the "proportional-to-distinguishable-type" model of Simberloff et al. (1981). The third
model implies that every branching point is equally likely when growing a tree (Harding,
1971); this is the Markovian branching model of Sirnberloll et al. (1981). It is interesting to
note that dendrograms can he generated equiprobably under this Markovian model
(Lapointe and Legendre, 1991; Page, 1991).

3.2 Random data models

When data matrices are used for validation purposes instead of trees, the models available
differ with respect to the type of data considered. For example, character-state matrices
are not randomized as distance matrices would be. In the first case, the general model is
based on a permutation of the observed states within each character (Archie, 1989a;
1989c; Faith, 1991; Faith and Cranston, 1991; Klillersj6 et aI., 1992). With such models,
the phylogenetic structure of the data is destroyed by permutations (or random data
generation, Klassen et aI., 1991), and the probability of each state being assigned to any
taxa is equally likely. This approach has been much debated (Bryant, 1992; Carpenter,
1992; Faith, 1992; Klillersj6 et aI., 1992; Alroy, 1994; Faith and Ballard, 1994». It
nevertheless remains the model of choice in phylogenetic studies (but see Goloboft: 1991a;
1991b).

In the case of distance matrices, two types of models have been proposed by Sneath
(1967). One option is to compute distance matrices from random points distributed in a
multidimensional space (see Gordon, 1987); the unilorm model based on a Poisson
74

distribution and the unimodal model have been applied to generate such random
distributions of points (Bock, 1985). Another option is to randomize the distances directly
as if all the observations in the matrix were independent from one another; this is the
random graph model of Ling (1973). In some testing procedures, the values in the matrix
are held constant but the rows and columns (the labels) are permuted (Mantel, 1967). It is
also possible to generate random distance matrices from permuted character-state matrices
using the random-data model described above.

3.3 Test statistics

Another important aspect of validation methods is the test statistic selected to compare the
actual tree to random realizations of the null hypothesis. Here again, data comparisons will
differ from tree comparisons, and distances will n:quire different statistics than character-
state data. Common indices computed from a tree are its length, the consistency index
(Kluge and Farris, 1969), retention index (Farris, 1989a), and homoplasy excess ratio
(Archie, 1989b), among' others (see also Farris, 1989b; Archie, 1990; Farris, 1991;
Goloboff, 1991a; Hillis, 1991; Meier et aI., 1991; Bremer, 1995). All of these statistics are
used to measure how well the phylogeny fits the data; some are even used as optimality
criteria for phylogenetic reconstruction (see Swofford et aI., 1996a). For distance data,
one usually calls for metric indices, like the matrix correlation (Rohlt: 1982), or any other
measure of the fit between original and path-length distances (Rohlf, 1974; Gower, 1982;
Gordon, 1987).

When trees arc considered, it is always necessary to distinguish topological indices from
tree metric indices; the former are designed for unweighted trees (i.e., ignoring branch
lengths) whereas the latter are for weighted-tree comparisons (i.e., dendrograms or
c1adograms). Test statistics (Le., consensus indices sensu Day and McMorris, 1985)
available for topological comparisons include the partition metric (Bourque, 1978;
Robinson and Foulds, 1981), the neighborhood interchange metric (Robinson, 1971;
Waterman and Smith, 1978), the quartet metric (Estabrook et aI., 1985; Day, 1986;
Estabrook, 1992), and the triples distance metric (Critchlow et aI., 1996), among many
others (Bosibud and Bosibud, 1972; Margush, 1982; Hendy et aI., 1984; Penny and Hendy,
1985b; Steel, 1988; Steel and Penny, 1993). When path-lengths matrices need to be
compared, modified versions of the topological indices can be used (Robinson and Foulds,
1979), in addition to specitic indices designed for dendrograms (Sokal and Rohlf, 1962;
Day, 1983; Fowlkes and Mallows. 1983; Faith and Belbin, 1986; Lapointe and Legendre,
1990), or 1l1r any weighted trees (Williams and Clifford, 1971; Lapointe and Legendre,
1992a; Steel and Penny, 1993).

3.4 A note on rcsampling methods

Resampling methods (Efron, 1979; Efron and Gong, 1983; Efron and Tibshirani, 1993) are
in some ways related to permutation tests. Indeed, resampling is used to assess the stability
of some parts of the tree. or the tree as a whole, by comparing actual phylogenies to trees
derived from resampled data. However, such methods are usually not designed as
statistical tests, Le., p-values arc rarely provided and can not always be interpreted as such
(Felsenstein et Kishino. 1993). Here the values in a data matrix are not permuted but
sampled with or without replacement. The bootstrap (Felsenstein, 1985; Sanderson, 1989;
1995) and the jackknife (Davis, 1993; Farris et aI., 1995a; 1995b) are among the most
popular resampling techniques in phylogenetic studies; both methods have been used to
validate phylogenies (see Hillis, 1995; Swoftl1rd et aI., 1996a).
75

Bootstrapping proceeds by resampling the characters of a data matrix with replacement; a


new phylogeny is derived from each bootstrap replicate, and the frequencies of the
different clades are computed over the set of bootstrapped trees. The results are presented
in the form of a consensus tree bearing the clades found in the majority of the bootstrapped
trees. Character jackknifing is very similar to bootstrapping: sets of characters are sampled
at random from the original matrix without replacement (i.e., a fixed number of characters
are deleted in turn from the matrix) and the frequencies of the clades occuring in the
jackknife replicates are used to provide support for parts of the phylogeny. Other
resampling techniques (Penny and Hendy, 1985b) and generalizations of the bootstrap
(Hall and Martin, 1988; Zharkikh and Li, 1995) and the jackknife (Lanyon, 1985; Lapointe
et aI., 1994) have been applied to phylogenetic studies.

4. Internal versus external validation


Phylogenetic reconstruction represents a very difficult computational problem (Graham
and Foulds, 1982; Day, 1983c; 1987). When the size of the matrix becomes too large, no
algorithm is guaranteed to find the optimal solution. The ditIcrent methods are thus
compared and assessed with respect to several criteria: efticiency, consistency, power,
robustness, and falsiability (see Penny et aI., 1992; Hillis, 1995). Whereas numerous
simulation studies (Huelsenbeck, 1995) have shown the relative accuracy of the competing
algorithms, it remains true that no single method can always converge to the correct tree.
It then becomes necessary to test the trees derived by a particular method with null
phylogenetic models: this is the purpose of validation techniques. Dubes and Jain (1979)
distinguish between internal and external validation methods. The first type of validation is
required to assess the reliability of a tree with respect to the data it is derived from and in
contrast to unstructured data (Milligan, 1981). External validation, however, refers to the
comparison of two or more trees derived from independent data sets. In this case, either a
reference tree b~comes the comparison criterion or all trees are compared to one another
sequentially, or simultaneously.

4.1 Internal validation

To perform internal validation, one needs to compare a tree with the data it is derived from.
When it comes to a specific phylogeny, validation proceeds by comparing the actual tree
with others derived from randomly generated or permuted data (Archie, 1989a; Faith and
Cranston, 1991; Kallersj6 et aI., 1992; Alroy, 1994). Using one of the random data models
described above, a distribution of the test statistic can then be computed, or tables of
critical values can be generated (e.g., Klassen et aI., 1991). For example, the validity of a
tree as a whole can be assessed by comparing its length to a distribution of the lengths of
trees derived from random data (Le Quesne, 1989; Carter et a!., 1990; Faith and Cranston,
1991; Steel et aI., 1992; Archie and Felsenstein, 1993). When no phylogenetic structure
was present in the data to begin with, most random data sets (e.g., 95%) will lead to trees
shorter than the real phylogeny. In that situation, the original tree will not be validated.
Other methods based on different statistics (e.g., Alroy, 1994) test the same null
hypothesis stating that a phylogeny derived from actual data is no better than what would
be expected from random data. The same approach can be used to test the stability of parts
of the tree (e.g., monophyletic groups) under various permutations and resampling models
(Faith, 1991; Faith and Trueman, 1996; Swofford et aI., 1996b).

In resampling methods (e.g., the bootstrap and the jackknife), the effect of character
and/or taxonomic sampling on phylogenetic reconstruction is assessed. Mueller and Ayala
76

(1982) were among the first to test the validity of their trees with a resampling procedure.
Since Felsenstein (1985), the bootstrap has been the most popular validation technique in
phylogenetic studies (Hedges, 1992; Hillis and Bull, 1993; Rodrigo, 1993a; Dopazo, 1994,
Harshman, 1994; Zharkikh and Li, 1992a; 1992b; Li and Zharkikh, 1994; Berry and
Gascuel, 1996; Efron et aI., 1996), with extensions for distance data (Krajewski and
Dickerman, 1990; Marshall, 1991). The method remains controversial (Sanderson, 1989;
1995), and has been greatly modified by some (Zharkikh and Li, 1995). The original
nonparametric bootstrap (Felsenstein, 1985) consists in resampling the characters of a data
(or distance) matrix with replacement to assess the stability of a tree (the parametric
bootstrap has been introduced by Huelsenbeck et aI., 1995). Jackknifing can proceed in a
similar fashion, except that characters are resampled without replacement (Davis, 1993).
This rationale also appties to taxonomic sampling (Lecointrc et aI., 1993). Lanyon (1985)
and Lapointe et al. (1994) have shown that deleting taxa from the analysis can be used to
evaluate the stability of phylogenetic trces. The consensus of thc jackknifc trees is then
used to evaluate the support of thc tree as a whole or part of it (for an application, see
Bleiweiss et aI., 1994). When the trees tested with resampling models are not validated
(e.g., a partially-resolved tree is obtained), the original phylogenies should be treated with
caution; additional data must be gathered to improve the results.

4.2 External validation

Given two (or more) internally validated trees, the next task is to verify whether these trees
tell the same story or not; that is, that they are congruent (Prager and Wilson, 1976;
Mickevich, 1978; Colless, 1980; Sokal and Rohlf, 1981; Penny et aI., 1982; Swofford,
1991; Bledsoe and Raikow, 1992; Patterson et aI., 1993). External validation proceeds by
comparing phylogenies to one another (or to a reference phylogeny) to assess whether the
observed measure of congruence could be expected by chance alone. As for internal
validation, permutation and resampling methods can be used for this test. In the case of the
random-data model, the data from which the original trees were derived are randomized
(Rodrigo et aI., 1993; Farris et aI., 1995b) to obtain new phylogenies which can be
compared in turn to build a distribution of the test statistic. For the random-tree model, the
actual pair of trees is compared to pairs of random trees (Hubert and Baker, 1977; Podani
and Dickinson, 1984; Simberloft: 1987; Nemec and Brinkhurst, 1988; Page, 1988;
Lapointe and Legendre, 1990; 1992a; Brown, 1994). Depending on the trees compared,
the topology of the phylogeny, its branch lengths, and the taxon positions can be
randomized (Lapointe and Legendre, 1995). No matter what method is used, the null
hypothesis states that the phylogenies compared are not more similar than randomly
generated trees would be (Lapointe and Legendre, 1990); a pair of trees is declared
congruent when more similar than the majority (e.g., 95%) of the pairs of random trees. To
prevent one from generating the null distribution for every test, tables of critical values for
various consensus indices and dift'crent models have been produced (Day, 1983b; Shao and
Rohlf, 1983; Shao and Sokal, 1986; Steel, 1988; Lapointe and Legendre, 1992b; Steel
and Penny, 1993).

Even though phylogenies should always be validated, it is worth mentioning that data sets
can be externally validated as well. The approach is similar to the random-data models
used to compare trees; the character-state or distance matrices are randomized (or
resampled) to assess their congruence. The Mantel test (1967) has been widely used to
compare distance matrices. For comparing character-state data, canonical correlations can
be applied with a testing procedure based on permutations (see Lapointe and Legendre,
77

1994). In any case, trees or data matrices that are not validated should be treated with
caution. One should never rely on ad hoc criteria to decide which of the phylogenies is the
best. It might be better to combine the data or trees to analyze them jointly.

S. Combined versus separate analysis


It is not unusual in phylogenetic studies (Bledsoe and Raikow, 1990; Patterson et aI.,
1993) to obtain different trees from different data sets (e.g., molecular versus molecular
studies). In most cases this problem can be solved by validation methods, however. Indeed,
the discrepancies between two phylogenies might disappear after internal validation (i.e.,
the trees might collapse to the same unresolved phylogeny). Likewise, two seemingly
different trees could well turn out to be significantly more congruent than would be
expected by chance alone. However, when statistically different but nonetheless valid trees
are dealt with, one faces a difficult choice. How to reconcile the trees to reach a unique
solution? Two options are available. The first is to combine the data sets in order to derive
a phylogeny based on total evidence (Kluge, 1989). The other is to analyze the data sets
separately, and then to combine the trees using a consensus method (Miyamoto and Fitch,
1995). Resolving this dilemma is currently one of the hottest debates in phylogenetics
(de Queiroz et aI., 1995; Huelsenbeck et aI., 1996): to combine or not to combine?

5.1 Total evidence

How should one choose among ditferent phylogenies based on independent data sets?
Which one is closer to the true phylogeny? Given that different parts of the genome evolve
at different rates, it is very unlikely that one would obtain identical phylogenies for slow-
cvolving versus fast-evolving genes (Russo et aI., 1996), or cven for morphological versus
molecular data (Hillis, 1987). The solution, according to Kluge (1989), is to include all
available data (i.e., character-state matrices) in one analysis (for a combination of distancc
matrices, see Lapointe and Kirsch, 1995). The rationale is that a tree based on total
evidence rather than partial information will usually be more accurate as more data are
added (see also Barrett et aI., 1991; Eernisse and Kluge, 1993). This approach has been
criticized by several authors (Huelsenbec!<: et aI., 1996), and alternative views have been
proposed (Williams, 1994; Bandelt, 1995; Miyamoto and Fitch, 1995; Nixon and
Carpenter, 1996), one of which is to combine the data conditionally (Bull et aI., 1993). The
question then becomes one of when to combine data or not.

The debate on conditional data-combination (Chippindale and Wiens, 1994; Huelsenbeck


et aI., 1994; Wiens and Chippindale, 1994) is somewhat related to external validation.
According to Bull et al. (1993), one has to make sure that the data sets are not
heterogeneous before combining them. If it is shown, using randomization (Huelsenbeck
and Bull, 1996) or resampling techniques (de Queiroz, 1993; Rodrigo et aI., 1993; Farris et
aI., 1995a; 1995b), that the data are indeed more homogeneous than by chance alone, a
combined analysis can be then performed; otherwise, the data matrices should be treated
separately. In other words, one evaluates the congruence among the independent
phylogenies to assess whether the corresponding data should be combined or nol. The
combined approach has been used in several applications since its publication (Miyamoto
et aI., 1994; Olmstead and Sweere, 1994; Omland, 1994; Mason-Gamer and Kellogg,
1996; Poe, 1996; Sullivan, 1996), and generalized to allow combination of overlapping
sets of taxa (Wiens and Reeder, 1995; see also Lapointe and Kirsch, 1995). It is important
to note, however, that a total-evidence tree can never be externally validated since all data
are combined. It can only be compared with a previous total-evidence tree (i.e., one
78

constructed before some new data set was generated). Nevertheless, a total-evidence tree
must always be assessed with internal validation methods.

S.2 Consensus

Whether one decides to combine or not to combine data sets for statistical, practical, or
philosophical reasons (Barrett et al., 1991; 1993; de Queiroz, 1993; Nelson, 1993), the
problem remains the same: how to synthesize a profile of incongruent phylogenies?
Whereas data are combined in a total-evidence approach, trees will be combined with a
consensus approach (Miyamoto, 1985; Anderberg and Tehler, 1990). A consensus tree
method (as opposed to consensus indices, Day and McMorris, 1985) takes as input a
profile of trees and return a single solution that is in some sense representative of the entire
set (Leclerc and Cucumel, 1987). Several approaches, including the strict (Sokal and Rohlf,
1981), semi-strict (Bremer, 1990), median (Barthelemy and McMorris, 1986), and
majority-rule consensus (Margush and McMorris, 1981) methods have been developed to
combine unweighted (see also Adams, 1972; Nelson, 1979; Stinebrickner, 1982;
McMorris and Neumann, 1983; McMorris et aI., 1983; Neumann, 1983; McMorris, 1985;
Phillips and Warnow, 1996) or weighted trees (Stinebrickner, 1984; Letkovitch, 1985;
Lapointe and Cucumel, 1997). Other methods are designed for the construction of
consensus supertrees from phylogenies bearing overlapping sets of taxa (Gordon, 1986;
Baum, 1992; Ragan, 1992; Steel, 1992; Baum and Ragan, 1993; Lanyon, 1993; Rodrigo,
1993b; Purvis, 1995a; Ronquist, 1996; Lapointe and Cucumel; 1997), or the computation
of common pruned trees (Finden and Gordon, 1985) and reduced consensus trees
(Wilkinson, 1994; 1996).

The problem with consensus trees is that they are seldom validated. Assessing the
significance of a consensus phylogeny remains problematic. As tor phylogeny-
reconstruction algorithms, a given consensus method will always return a solution. Onc
then has to evaluate whether the consensus representation is pertinent or not; i.e., is it
more structured than what would be expected from chance alone (Cucumel and Lapointe,
1997)? Consensus validation is somewhat related to the congruence tests used for external
validation. The problem with consensus trees is that more than two phylogenies are usually
considered at once; the tables of significance of consensus indices do not account for more
than two trees at a time (e.g., Shao and Rohlf, 1983; Shao and Sokal, 1986; Lapointe and
Legendre, 1992b). Furthermore, consensus trees are sometimes the synthesis of trees
bearing nonidentical sets of taxa (Purvis, 1':I95b; Kirsch et a\., 1':197), which makes them
even more difficult to test. Cucumel and Lapointe (1997) test the consensus by comparing
it to the trivial classification (i.e., a bush, or a star tree); a distribution of consensus trees
computed from randomly generated phylogenies (Lapointe and Legendre, 1990; 1992a) is
used to assess the significance of the null hypothesis. Another approach would be to check
whether the consensus falls within a confidence set (Sanderson, 1989) of the trees in the
input profile.

6. Stepwise validation procedure


The debate on character congruence versus taxonomic congruence (sensu Kluge, 1989)
has led to three different positions among systematists; (1) always combine data, (2) never
combine data, or (3) combine conditionally. In other words, that means, (1) never use a
consensus, (2) always usc a consensus, or (3) sometimes use a consensus tree to represent
the congruence among different studies. I am here proposing to always do hoth (i.e .. total
evidence AND consensus). The rationale is that a phylogeny is more likely to be accurate
79

when the different approaches converge to the same solution. This is related to what Kim
(1993) has shown by combining different algorithms to improve the accuracy of
phylogenetic estimations. In the present case, combined and separate analyses are
pertormed and the resulting phylogenies are assessed using a stepwise procedure (Fig. 1).

1- Initially, each and every tree produced has to be checked tor internal validity (4.1).
2- Trees that satisfy the first test need to be compared to assess their congruence (4.2).
3- The congruent data sets must be combined to derive a total-evidence tree (5.1).
That tree has to. be validated.
4- The independent phylogenies must also be combined to obtain a consensus tree.
That consensus has to be validated.
5- Finally, the trees obtained at steps 3 and 4 of the validation procedure must be compared.

Data 1 Data 2

/
'" i Data 3

Tree 3
T T
..
Tree 1 ......I---~,~-4.~ Tree 2

" Tree 4 ,;I


Fig. I. flowd.lart of the stepwise pnKedure. Data 1 and Data 2 represent two independent sets of data.
Tree 1 and Tree 2 are the phylogenies derived from the corresponding data sets using any phylogenetic
reconstrul1ion algorithm. Data 3 is obtained by combining Data 1 and Data 2; Tree 3 is the
corresponding phylogeny. Tree 4 is the consensus of Tree 1 and Tree 2. Numbers refer to the different
validation steps: 0 internal validation, @ external validation, e total evidence, and 0 consensus.

The comparison of a total-evidence phylogeny with a consensus tree is not as simple as it


seems. One could not rely on standard validation methods in this case. The problem is that
those trees are not independent from one another. The testing procedure is as tollows: (i)
randomize (i.e., permute, resample, or simulate) the initial data sets independently, (ii)
analyze these random data sets separately, and (iii) compute a consensus of the
corresponding trees. At the same time, (iv) combine the random data sets and (v) derive a
total-evidence phylogeny. The total-evidence and consensus trees are then (vi) compared,
and the whole procedure is repeated a large number of times to build a distribution of the
test statistic measuring the congruence between the two trees. If the actual trees under
comparison are more similar than the majority of trees derived from random data, they are
declared congruent. In that situation, the phylogenetic tree is said to be globally validated.
80

7. Application
To illustrate how the stepwise validation procedure works with real data, I have applied
the methods to validate the kangaroo phylogeny in Kirsch et al. (1995), based on DNA-
hybridization data and depicting phylogenetic relationships among 12 species. It is
compared to Baverstock et a\. (1989) phylogeny of 14 kangaroo species, based on
immunological data. For the purpose of the demonstration, I have reduced the original
data sets to only consider the nine species in common to both studies (Fig. 2).

~~~~
Data 1 t-------I~~I Data 1+2 ....
,.
I~-----I Data 2
.
_ - - NotamacroplIs
~~~~

Osphranter
....----N[acroplls

Pelrogale

....- - - - - - - DorcopslI/lIs ,
.....- - - Wallabia
....- - - Osphranter
_ - - - - Notamacroplls Notamacroplls
....- - - Macroplls - - - - - Nfaaoplls
Pelrogale
- - - - Dendrolaglls Pelrogale
....- - - Thylogale 7hylogale
- - - - - - Setonix
....- - - - - DorcopslIllIS ....- - - - - - - DorcopSlI/lIs

Wallab/Q /
NotamacroplIs
Osphranter

'"
r----- ft-facroplls
Petrogale
Thylogale

....- - - - - - - - Dorcopsll/lls

fig. 2. llIustratiou of the stepwise valdation procedure. Data 1 is Kirsch et al. (1995) DNA-hybridization
data. Data 2 is BavershKk et al. (1989) immunological data. Data 1+2 is the average of the standardized
data sl'ls. The corresponding phylogenies were reconstruded with the FITCH algorithm (Felsenstein,
1993). The consensus was derived using the average prot·cdure (Lapointe et aI., 1994; Lapointe and
Cucumel, 1997). All trees are rooted by Dorcopsulus.
81

The first step of any validation study is always to check for internal validity. In the present
case, each tree had already been validated by bootstrapping and/or jackknifing in the
original studies. I thus proceeded directly with external validation. Metric and topological
indices were selected to compare t~e phylogenies depicted in the form of additive trees;
the matrix correlation computed from path-length distances is 0.354, compared to 0.645
for branch distances. The latter is more extreme than would be we expected from pairs of
random additive trees of the same size (Lapointe and Legendre, 1992b). That is, the two
kangaroo phylogenies are topologically congruent (i.e., the data are not heterogeneous).
The next step was, therefore, to combine the standardized data matrices from the different
studies. I did so by a simple average of the immunological and DNA hybridization
distances among the nine species (Lapointe and Kirsch, 1995). A phylogeny was derived
from that total-evidence matrix (Fig. 2), and internal validation was performed with
taxonomic jackknifing (Lapointe et aI., 1994). Finally, the average consensus (Lapointe
and Cucumel, 1997) of the original phylogenies was computed to account for branch
lengths (Fig. 2). The consensus was compared to a distribution of consensus trees derived
from pairs of random trees to assess its pertinence (Cucumel and Lapointe, 1997). As both
the total-evidence and consensus trees were validated, the last and crucial step consisted in
the comparison of those phylogenies.
The correlation between the path-length matrices is 0.996 in this case, whereas the
topological correlation value is 0.927. Using the significance test described above, it was
shown that this particular pair of trees is more similar than what would be expected from
most consensus and total-evidence trees based on random data. The kangaroo phylogeny
is thus said to be globally validated.

8. References
Adams, E. N., III. (1972): Cons~usus t~chniques and the comparison of taxonomic trees, Systematic
Zoology, 21,390-397.
Alroy, J. (1994): Four permutation tests for the pres~ncc of phylogenetic structure, Systematic Biology, 43,
430-437.
Anderberg, A. and Teblcr, A. (1990): Consensus tr~~s, a n~cessity in taxonomil' practice, Cladistics, 6,
399-402.
Archie, J. W. (1989a): A randomization test for phylogenetic information in systematic data, Systematic
Zoology, 38,219-252.
Archie, J. W. (1989b): Homoplasy excess ratios: N~\V indices for m~asuring levels of homoplasy in
phylogenetic syst~matics and a critique of the consistency index, Systematic Zoology, 38, 253-269.
Archie. J. W. (I 989c): Pbylogenies of plant families: A demonstration of phylogenetic randomness in
DNA sequence data derived from proteins, Evolutioll, 43, 1796-1800.
Archie, 1. W. (19':10): Homoplasy exccss statistics and relention indices: A reply to Farris. Systematic
Zoology, 39, 169-174.
Archie, J. W. and Felsenstein, J. (1993): The number of evolutionary steps 011 random and minimum
lengths trees for random evolutionary data. Theoretical Poplliatioll Biology, 43, 52-79.
Bandelt. H. J. (1995): Combination of data in pbylogenetic analysis, Plallt Systematics awl E valutioll.
Slipplemelltlim 9, 355-361.
Barrett, M. et al. (1991): Against consensus, Systematic Zoology, 40,486-493.
Barrett, M. ct al. (1993): Crusade'! A response to Nelson, Systematic Biology, 42, 216-217.
Barthelemy. J .-P. and McMorris, F. R. (1986): Tbe median procedure for II-trees, JOllrtlal o/Classificatioll,
3,329-334.
Baum. B. R. (1992): Combining trees as a way of combining data for pbylogenetic inkrcnce, and tbe
desirability of combining gene trees. Taxoll. 41,3-10.
82

Baum, B. R. and Ragan, M. A. (1993): Reply to A. G. Rodrigo's "A comment on Baum's method for
combining phylogenetic trees, Ta.'COII, 42, 637-640.
Baverstock, P. R. et al. (1989): Albumin immunologic relationships of the Macropodidae (Marsupialia),
Systematic Zoology, 38, 38-50.
Berry, V. and Gascue1, O. (1996): On the interpretation of bootstrap trees: Appropriate threshold of clade
selection and induced gain, Molecular Biology alld Evolutioll, 13,999-1011.
Bledsoe, A. H. and Raikow, R.1. (1990): A quantitative assessment of congruence between molecular and
nonmolecular estimates of phylogeny, loumal of Molecular Evolutioll, 30, 247-259.
Bleiweiss, R. et al. (1994): DNA-DNA hybridization-based phylogeny of "higher nonpasserines:
Reevaluating a key portion of the avian family tree, Molecular Phylogelletics and Evolutioll, 3,248-255.
Bock, H. H. (1985): On some sigoiticance tests in cluster analysis, loumal ofClassificatioll, 2,77-108.
Bosibud, H. M. and Bosibud, L. E. (1972): A metric for classifications, Ta.'Con, 21,.607-613.
Bourque, M. (1978): Arbres de Steiner et reseaux dont varie I'emplacement de certains sommets. Ph. D.
Thesis, Dcpartement d'Informatique et de Recherche Operationelle, Unversite de Montreal, Montreal.
Bremer, K. (1990): Combinable component consensus, Cladistics, 6,369-372.
Bremer, K. (1995): Branch support and tree stability, Cladistics, 10,295-304.
Brown, 1. K. M. (1994): Probabilities of evolutionary trees, Systematic Biology, 43,78-91.
Bryant, H. N. (1992): The role of permutation tail probability tests in phylogenetic systematics, Systematic
Biology, 41, 258-263.
Bull,1.1. et al. (1993): Partitioning and combining data in phylogenetic analysis, Systematic Biology, 42,
384-397.
Buneman, P. (1971): The recovery of trees from measures of dissimilarity. In: l"lathematics ill
Archeological allli Historical Sciellces, HodsOll, F. R. et al. (eds.), 387-395, Edinburgh University Press,
Edinburgh.
Buneman, P. (1974): A note on the metric properties of trees, loumal of Combillatorial Theory (B), 17,
48-50.
Carpeoler, J. M. (1992): Random cladistics, Clatlistics, 8, 147-153.
Carter, M. et al. (1990): On the distribution of lengths of e\'olutionary trees, SIAM loumal of Discrete
Mathematics, 3,38-47.
Chippindale, P. T. and Wiens, 1. 1. (1994): Weighting, partitioning, and combining characters in
phylogenetic analysis, Systematic Biology, 43, 278-287.
Colless, D. H. (1980): Congruence between morphometric and allozyme data for Mellidia species: A
reappraisal, Systematic Zoology, 29, 288-299.
Critchlow, D. E. et al. (1996): The triples distance for rooted bifurcating phylogenetic trees, Systematic
Biology, 45, 323-334.
Cucumel, G. and Lapointe, F.-l (1997): Un test de la pertinence du consensus par une methode de
permutations. [n: Actes des XXIXe joumees de statistique, 299-300, Carcassonne.
Davis, J. I. (1993): Character removal as a means for assessing stability of clades, Cladistics, 9,201-210.
Day, W. H. E. (1983a): The role oi complexity in comparing classifications, Mathematical Biosciellces, 66,
97-114.
Day, W. H. E. (1983b): Distributions of distances between pairs of classifications. [n: Numerical
Ta.'Collomy, Felscnstein, J. (cd.), 127-131, Springer-Verlag, Berlin.
Day, W. H. E. (1983c): Computationally difficult parsimony problems in phylogenetic systematics,
loumal of Theoretical Biology, 103,429-438.
Day, W. H. E. (1986): Analysis llf quartet dissimilarity measures between undirected phylogenetic trees,
Systematic Zoology, 35, 325-333.
Day, W. H. E. (1987): Computational complexity of inferring phylogenies from dissimilarity matrices,
Bulletill of Mathematical Biology, 49,461-467.
Day, W. H. E. and McMorris, F. R. (1985): A formalization of consensus index methods, Bulletill oj'
Mathematical Biology, 47, 215-229.
de Queiroz, A. (1993): For consensus (sometimes), Systematic Biology, 42,368-372.
83

de Queiroz, A. el al. (1995): Separate versus combined analysis of phylogenetic evidence, Annual Review
of Ecology and Systematics, 26.657-681.
Dopazo. J. (1994): Estimating errors and confidence intervals for branch lengths in phylogenetic tres by a
bootstrap approach. loumal of Molecular Evolution, 38. 300-304.
Dubes. R. and Jain. A. K. (1979): Validity studies in clustering methodologies. Pattern Recognition. 11.
235-254.
Dwass, M. (1957): Modilied randomization tests for nonparametric hypotheses. Amzals of Mathematics
and Statistics, 28. 181-187.
Edgington. E. S. (1995): Randomization tests, 3rd Edition, Revised ami Expanded. Marcel Dekker. New
York.
Eernisse, D. J. and Kluge, A. G. (1993): Taxonomic congruence versus total evidence. and the phylogeny
of amniotes inferred from fossils. molecules and morphology. Molecular Biology and E,·olution. 10.
1170-1195.
Efron. B. (1979): Bootstrapping methods: Another look at the jackknife. Annals of Statistics, 7. 1-26.
Efron. B. and Gong. G. (1983): A leisurely look at the bootstrap. the jackknife. and cross-validation.
American Statisticiair, 37. 36-48.
Efron. B. and Tibshirani. R. J. (1993): An introduction to the bootstrap, Chapman and Hall. New York.
Efron. B. et al. (1996): Bootstrap confidence levels for phylogenetic trees. Proceedings of the National
Academy of Sciences. USA, 93. 13429-13434.
Estabrook. G. F. (1992): Evaluating undirected positional congruence of individual taxa between two
estimates of the phylogenetic tree for a group of taxa. Systematic Biology. 41. 172-177.
Estabrook. G. F. et al. (1985): Comparison of undirected phylogenetic trees based on subtrees of four
evolutionary units. Systematic Zoology, 34. 193-200.
Faith. D. P. (1991): Cladistk permutation tests for monophyly and uoumouophyly. Systematic Zoology, 40.
366-375.
Faith. D. P. (1992): Ou corroboration: A reply to Carpeuter. Cladistics, 8.265-273.
Faith. D. P. aud Ballard. J. W. O. (1994): Length differences topology-dependent tests: A response to
Kallersjii et al. Cladistics, 10,57-64.
Faith. D. P. and Belbin. L. (1986): Comparison of classifications using measures intermediate between
metric dissimilarity and consensus similarity. lournal of Classification, 3,257-280.
Faith, D. P. and Cranston. P. S. (1991): Could a c1adogram this short have arisen by chance alone'! on
permutation tests for cladistic structure. Cladistics, 7. 1-28.
Faith, D. P. and Trueman. J. W. H. (1996): When the topology-dependent permutation test (T-PTP) for
monophyly returns significant support for monophyly. should that be equated with (a) rejecting a null
hypothesis of nonmonophyly. (b) rejecting a null hypothesis of "no structure." (c) failing to falsify a
hypothesis of monophyly. or (d) none of the above? Systematic Biology, 45. 580-586.
Farris. J. S. (1989a): The retention index and the rescaled consistency index, Cladistics,S, 417-419.
Farris. J. S. (1989b): The retention index and homoplasy excess, Systematic Zoology, 38.406-407.
Farris. J. S. (1991): Excess homoplasy ratios. Cladistics, 7.81-91.
Farris. J. S. et al. (1995a): Constructing a significance test for incongruence, Systematic Biology, 44, 570-
572.
Farris. J. S. et al. (1995b): Testing significance of incongruencies, Cladistics, 10. 315-370.
Felsenstein. J. (1978): The number of evolutionary trees. Systematic Zoology, 27.27-33.
Felsenstein. J. (1985): Confidence limits on phylogenies: An approach using the bootstrap. Evolution, 39.
783-791.
Felsenstein. J. (1993): Pfm...IP: Phylogeny inference package, version 3.5c. distributed by the author.
University of Washington. Seattle.
Felsenstein. J. and Kishino. H. (1993): Is there something wrong with the bootstrap on phylogenies'! A
reply to Hillis and Bull. Systematic Biology, 42, 193-200.
Finden, C. R. and Gordou. A. D. (1985): Obtaining common pruned trees, loumal of Classification, 2,
225-276.
Fowlkes. E. B. and Mallows, C. L. (1983): A method for comparing two hierarchical c1usterings, loumal
84

of the American Statistical Association, 78, 553-569.


Furnas, G. W. (1984): The generation of random. binary unordered trees. lournal of Classification, 1.
187-233.
Goloboff. P. (1991a): Homoplasy and the choice among c1adograms.Cladistics, 7.215-232.
Goloboff. P. (1991b): Random data. homoplasy and information. Cladistics. 7.395-406.
Gordon. A. D. (1986): Consensus supertrees: the synthesis of rooted trees containing overlapping sets of
labeled leaves. loumal of Classification. 3. 335-348.
Gordon. A. D. (1987): A review of hierarchical c1assitications. lournal of the Royal Statistical Society (A).
150.119-137.
Gower. 1. C. (1983): Comparing classifications. In: Numerical Taxonomy. Felsenstein. 1. (ed.), 137-155,
Springer-Verlag. Berlin.
Graham. R. L. and Foulds. L. R. (1982): Unlikelihood that minimal phylogenies for a realistic biological
study can be constructed in reasonable computational time, Mathematical Biosciences. 60. 133-142 ..
Hall, P. and Martin. M. A. (1988): On bootstrap resampling and iterations. Biometrika, 75. 661-671.
Harding. E. F. (1971): The probabilities of rooted tree-shapes generated by random bifurcations. Advances
in Applied Probability, 4. 44-77.
Harshman. 1. (1994): The effect of irrelevant characters on bootstrap values. Systematic Biology, 43.419-
424.
Hartigan. 1. A. (1967): Representation of similarity matrices by trees. loumal of the American Statistical
Association, 62. 1140-1158.
Hedges. S. B. (1992): The number of replications needed for accurate estimation of the bootstrap P value
in phylogenetic studies. Molecular Biology ami EvolutiOfI, 9.366-369.
Hendy. M. D. et al. (1984): Comparing trees with pendant vertices labelled, SIAM 10l/mal in Applied
Mathematics, 44. 1054-1065.
Hillis. D. M. (1987): Molecular versus morphological approaches to systematics. Annual Review of
Ecology and Systematics, 18. 23-42.
Hillis. D. M. (1991): Discriminatin g between phylogenetic signal and random noise in DNA sequences.
In: Phylogenetic alia lysis of DNA sequences. Miyamoto. M. M. and Cracraft. 1. (eds.). 278-294. Oxford
University Press. New York.
l-Iillis. D. M. (1995): Approaches for assessing phylogenetic accuracy. Systematic Biology, 44.3-16.
Hillis. D. M. and Bull. 1. 1. (1993): An empirical test of bootstrapping as a method for assessing
confidence in phylogenetic analysis. Systematic Biology, 42.182-192.
Hubert. L. 1. and Baker. F. B. (1977): The comparison and fitting of given classification schemes. loumal
of Mathematical Psychology, 16. 233-253.
Huelsenbeck. 1. P. (1995): Performance of phylogenetic methods in simulation. Systematic Biology, 44.
17-48.
Huelsenbeck. 1. P. and Bull. 1. 1. (1996): A likelihood ratio test for detection of conflicting phylogenetic
signal. Systematic Biology, 45. 92-98.
Huelsenbeck. 1. P. et al. (1994): Is character weighting a panacea for the problem of data heterogeneity in
phylogenetic analysis? Systematic Biology, 43. 288-291.
Huelsenbeck. 1. P. et al. (1995): Parametric bootstrapping in molecular phylogenetics: Applications and
performance. In: Molecular Zoology: Strategies and Protocols. Ferraris. 1 .and Palumbi. S. (eds.). Wiley.
New York.
Huelsenbel'k. 1. P. et al. (1996): Combining data in phylogenetic analysis. Trellds in Ecology alld
Evolutioll. 11. 152-158.
1ardine. C. 1. c:t al. (1967): The structure and construction of taxonomic hierarchies. Mathematical
Biosciellces. 1. 173-179.
Kiillersjo. M. et al. (1992): Skewness and permutation. Cladistics, 8. 275-287.
Kim. 1. (1993): Improving the accuracy of phylogenetic estimation by combining different methods.
Systematic Biology, 42. 331-340.
Kirsch. 1. A. W. et al. (1995): Resolution of portions of the kangaroo phylogeny (Marsupialia:
Macropodidae) using DNA hybridization. Biological Journal of the Linneall Society, 55. 309-328.
85

Kirsch, I. A. W. et a\. (1997): DNA-hybridisation studies of marsupials and their implications for
metatherian Classification. Australian lournal of Zoology, in press.
Klassen, G. I. et a\. (1991): Consistency indk-es and random data, Systematic Zoology, 40,446-457.
Kluge, A. G. (1989): A concern for evidence and a phylogenetic hypothesis of relationships among
Epicrates (80idae, Serpentes), Systematic Biology, 38, 7-25.
Kluge, A. G. and Farris, I. S. (1969): Quantitative phyletics and the evolution of anurans. Systematic
Zoology, 18, 1-32.
Krajewski, C. and Dickerman, A. W. (1990): Bootstrap analysis of phylogenetic trees derived from DNA
hybridization matrices, Systematic Zoology, 39,383-390.
Lanyon, S. (1985): Detecting internal inconsistencies in distance data, Systematic Zoology, 34,397-403.
Lanyon, S. (1993): Phylogenetic frameworks: Towards a firmer foundation for the comparative approach,
Biological loumal of the Linnean Society, 49,45-61.
Lapointe, F.-I. and Cucumel, G. (1997): The average consensus procedure: combination of weighted trees
l'ontaining identical or overlapping sets of objects, Systematic Biology, 46, 306-312.
Lapointe, F.-I. and Legendre, P. (1990): A statistical framework to test the consensus of two nested
classitications, Systematic Zoology, 39, 1-13.
Lapointe, F.-I. and Legendre, P. (1991): The generation of random ultrametric matrices representing
dendrograms, lournal of Classification, 8, 177-200.
Lapointe, F.-I. and Legendre, P. (1992a): A statistical framework to test the consensus among additive
trees (cladograms), Systematic Biology, 41, 158-17l.
Lapointe, F.-I. and Legendre, P. (1992b): Statistical significance of the matrix correlation coefficient for
comparing independent phylogenetic trees, Systematic Biology, 41,378-384.
Lapointe, F.-J. and Legendre, P. (1994): A classitication of pure malt Scotch whiskies, Applied Statistics,
43, 237-257.
Lapointe, F.-I. and Kirsch, J. A. W. (1995): Estimating phylogenies from lacunose distance matrices, with
spedal reference to DNA hybridization data, Molecular Biology and Evolution, 12, 266-284.
Lapointe, F.-I. and Legendre, P. (1995): Comparison tests for dendrograms: A comparative evaluation,
loumal of Classification, 12,265-282.
Lapointe, F.-I. et a\. (1994): lackknifing of weighted trees: Validation of phylogenies reconstructed from
distances matrices, Molecillar Phylogenetics and EvolUlion, 3,256-267.
Leclerc, B. and Cucumd, G. (1987): Consensus en classification: Une revue bibliographique,
Mathematiques et Sciences Humaines, 100, 109-128.
Lecointre, G. H. et a\. (1993): Species sampling has a major impact on phylogenetic inference, Molecular
Phylogenetics and Evolution, 2, 205-224.
Lefkovitch, L. P. (1985): Euclidean consensus dendrograms and other classification structures,
Mathematical Biosciem:es, 74,1-15.
Le Quesne, W. (1989): Frequency distributions of lengths of possible networks from a data matrix,
Cladistics, S, 395-407.
Li, W.-H. and Guoy, M. (1991): Statistical methods for testing phylogenies, In: Phylogenetic analysis of
DNA sequences, Miyamoto; M. M. and Cracraft, I. (cds.), 249-277, Oxford University Press, New York.
Li, W.-H. and Zharkikh, A. (1994): What is the bootstrap technique?, Systematic Biology, 43,424-430.
Li, W.-If. and Zharkikh, A. (1995): Statistical tests of DNA phylogenies, Systematic Biology, 44,49-63.
Ling, R. F. (1973): A probability theory of cluster analysis, loumal of the American Statistical Association,
68, 159-164.
Mantel, N. (1967): The detection of disease clustering and a generalized regression approach, Cancer
Research, 27, 209-220.
Margush, T. (1982): Distances between trees, Discrete Applied Mathematics, 4, 281-290.
Mar~ush, T. and McMorris, F. R. (1981): Consensus n-trees, Bul/etin of Mathematical Biology, 43,239-
244.
Marshall, C. R. (1991): Statistical tests and bootstrapping: Assessing the reliability of phylogenies based
on distance data, Molecular Biology and Evolution, 8, 386-391.
Mason-Gamer, R. I. and Kellogg, E. K. (1996): Testing for phylogenetic conflict among molecnlar data
86

sets iD th~ tribe Triticeae (GramiD~ae), Systematic Biology, 45, 524-545.


McMorris, F. R. (1985): Axioms for CODseDSUS fuDctioDs OD uDdirected phylogeDetic trees, Mathematical
Biosciellces, 74, 17-21.
McMorris, F. R. et al. (1983): A view of some consensus methods for trees. ID: Numerical Taxollomy,
Felsenstein,1. (ed.), 122-126, Springer-Verlag, Berlin.
McMorris, F. R. and Neumann, D. (1983): Consensus functions defined on trees, Mathematical Social
Sciellces, 4, 131-136.
Meier, R. et al. (1991): Homoplasy slope ratio: A better measurement of observed homoplasy in cladistic
analyses, Systematic Zoology, 40, 74-88.
Mickevich, M. F. (1978): Taxonomic congruence, Systematic Zoology, 27, 143-158.
MilligaD, G. W. (1981): A Monte-Carlo study of 30 internal criterion measures for clust~r-aDalysis,
Psychometrika, 46, 187-195.
Miyamoto, M. M. (1985): Consensus cladograms aDd general classitications, Cla(listics, I, 186-189.
Miyamoto, M. M. et al. (1994): A congrueDce test of reliability usiDg linked mitochoDdrial DNA
sequences, Systematic Biology, 43, 236-249.
Miyamoto, M. M. and Fitch, W. M. (1995): Testing species phylogenies and phylogenetic methods with
congruence, Systematic Biology, 44, 64·76.
Mueller, I.. D. and Ayala, F. 1. (1982): Estimation aDd iDterpretatioD of geDetic distaDces in empirical
studies, Gelletical Research, 40, 127-137.
Murtagh, F. (1984): Counting dendrograms: A survey, Discrete Applied Mathematics, 7, 191-199.
Nelson, G. (1979): Cladistic analysis and synthesis: Principks and definitions, with a historical note on
Adanson's Famille des Pllllltes (1763·1764), Systematic Zoology, 28, 1-21.
Nelson, G. (1993): Why crusade against l'onsenslls"? A reply to Barrett, Donoghue. aDd Sober, Systematic
Biology, 42. 215-216.
Nemec, A. F. L. and BriDkhurst, R. O. (1988): The Fowlkes-Mallows statistic aDd the comparison of two
indepeDdeDtly determined dendrograms. Calladiall lOl/mal of Fisheries ami Aqllatic Sciellces, 45. 971-
975.
Neumann. D. A. (1983): Faithful conseDSUS methods for Il-tr~es. Mathematical Biosciellces, 63, 271-287.
NixoD. K. C. and 1. M. CarpeDter. (1996): OD simultaDeous aDalysis. Cladistics, 12.221-241.
Od~D. N. L. and Shao. K. T. (1984): Au algorithm to equiprobably generate all directed trees with k
labeled terminal Dodes aDd uDlabeled interior nodes. Bulletill of Mathematical Biology, 46.379-387.
Olmst~ad, R. G. and Sweere, 1. A. (1994): Combining data iD phylogenetic systematics: Au empirical
approal'h using three mokcular data sets in the Solanacae. Systematic Biology, 43.467-481.
Omland. K. E. (1994): Character cODgrueDce betweeD a molecular aDd a morphologil-al phylogeDY for
dabbliDg ducks (AIlas), Systematic Biolog-j, 43. 369·386.
Page. R. D. M. (1988): QuaDtitativ~ cladistic biogeography: CODstructiDg aDd compariDg area cladograms.
Systematic Zoology, 37. 254-270.
Page, R. D. M. (1991): RaDdom deDdrograms aDd Dull hypotheses iD cladistic biogeography, Systematic
Zoology, 40. 54-62.
Patterson. C. et al. (1993): CODgrueDce between molecular and morphological phylogenies. Amlllal Re\'iew
of Ecology ami Systematics, 24. 153-188.
PeDny. D. aDd Hendy. M. D. (1985a): The use of tree comparison metrics. Systematic Zoology. 34. 75-82.
PeDDY. D. and Hendy, M. D. (1985b): Testing methods of evolutionary tree constructioD. Cladistics, 1.
266-278.
PeDny. D. et al. (1982): TestiDg the theory of evolution by compariDg phylogenetic trees constructed from
five differeDt proteiD sequeDces, Nature, 297. 197·200.
PeDny. D. et al. (1992): Progress with methods for constructiDg evolutioDary trees, Trellds ill Ecology alld
Evolutioll. 7. 73-79.
Phillips, C. and Warnow. T. 1. (1996): The asymmetric median tree - A Dew model for buildiDg CllD~nSIlS
trees. Discrete Applied Mathematics, 71.311-335.
Phipps. J. B. (1975): The numbers of classificatioDs. Calladiall lOl/mal of BotallY, 54. 686-688.
87

Podani, J. and Dickinson, T. A. (1984): Comparison of dendrograms: A multivariate approach, Calladiall


Jotlmal of BotallY, 62, 2765-2778.
Poe, S. 1996. Data set incongrencc and the phylogeny of Crocodilians, Systematic Biology, 45, 393-414.
Prager, E. M. and Wilson, A. C. (1976): Congruency of phylogenies derived from different proteins,
Joumal of Molecular Evolutioll, 9,45-57.
Proskurowski, A. (1980): On the generation of binary trees, Joumal of the Associatioll of Computillg
Machillery, 27, 1-2.
Purvis, A. (1995a): A modification to Baum and Ragan's method for combining phylogenetic trees,
Systematic Biology, 44,251-255.
Purvis, A. (1995b): A composite estimate of primate phylogeny, Philosophical Trallsactiolls of the Royal
Society of LOlldoll (B), 348,405-421.
Quiroz, A. J. (1989): Fast random generation of binary, t-ary and other types of trees, Joumal of
Classificatioll, 6,223-231.
Ragan, M. A. (1992): Phylogenetic inference based on matrix representation of trees, Molecular
Phylogelletics ami Evolutioll, I, 53-58.
Robinson, D. F. (1971): Comp:lrison of labeled trees with valency Three.loumal of Combillatorial Theory,
11,105-119.
Robinson, D. F. and Foulds, L. R. (1979): Comparison of weighted labelled trees. In: l.ecture Notes ill
Matehmatics, Volume 748,119-126, Springer-Verlag, Berlin.
Robinson, D. F. and Foulds, L. R. (1981): Comparison of phylogenetic trees, Mathematical Biosciences,
53.131-147.
Rodrigo, A. G. (1993a): Calibrating the bootstrap test of monophyly.llltematiollal Joumal of Parasitology,
23. 507-514.
Rodrigo, A. G. (1993b): A comment on Baum's method for combining phylogenetic trees, Taxon, 42, 631-
636.
Rodrigo, A. G. et at (1993): A randomisation test of the null hypothesis that two c1adograms are sample
estimates of a parametric phylogenetic tree, New lealalld Joumal of Botany, 31, 257-268.
Rohlf, F. J. (1974): Methods of comparing classifications. Alllltlal Review of Ecology alld Systematics, 5,
101·113.
Rohlf. F. J. (1982): Consensus indices for comparing classifications, Mathematical Biosciellces, 59, 131-
144.
Ronquist, F. (1996): Matrix representations of trees, redudancy and weighting, Systematic Biology, 45,
247-253.
Russo, C. A. M. et al. (1996): Efliciencies of different genes and different tree-building methods in
(l'covering a known wrtebrate phylogeny, Molecular Biology alld Evollllioll, 13, 525-536.
Sanderson, M. J. (1989): Confidence limits on phylogenies: The bootstrap revisited, Cladistics,S, 113-
129.
Sanderson, M. J. (1995): Objections of bootstrapping phylogenies: A critique, Systematic Biology, 44,
299-320.
Savage, H. M. (1983): The shape of evolution: Systematic tree topology, BiologicalJournal of the Linneatl
Society, 20, 225-244.
Shao, K. and Rohlf, F. J. (1983): Sampling distribution of consensus indices when all bifurcating trees are
equally likely. In: Numerical Taxonomy, Felsenstein, J. (ed.). 132·136, Springer-Verlag, Berlin.
Shao, K. and Sokal, R. R. (1986): Significance tests of consensus indices, Systematic Zoology, 35, 582-
590.
Simberloff, D. (1987): Calculating probabilities that cladograms match: A method of biogeographic
iuference. Systematic Zoology, 36, 175-195.
Simberloff, D. et al. (1981): There have been no statistical tests of cladistics biogeographical h)potheses.
[u: ~'icariatlce Biogeography: A Critique, Nelson, G. and Rosen, D. E. (eds.), 40-63. Columbia University
Press, New York.
Sneath. P. H. A. (1967): Some statistical problems in numerical taxonomy. The Statisticiall, 17, 1-12.
aa
Sokal R. R. and Roblf, F. J. (1962): Tbe comparison of dendrograms by objective methods, Taxon, 9,33-
40.
Sokal R. R. and Roblf, F. 1. (1981): Taxonomic cougrueuce in tbe Leptopodomorpba re-examined,
Systematic Zoology, 30, 309-325.
Steel, M. A. (1988): Distribution of tbe symmetric difference mdric on phylogenetic trees, SIAM loumal
of Discrete Mathematics, 1,541-555.
Steel, M. A. (1992): The complexity of reconstructing trees from qualitative characters and subtrees,
loumal ofClassificatiotl, 9, 91-116.
Steel, M. A. and Penny, D. (1993): Distribution of tree comparison mdrics-Some new results, Systematic
Biology, 42,126-141.
Steel., M. A. et al. (1992): Signilicance oftbe lengtb oftbe shortest tree,loumal of Classification, 9,63-
70.
Stinebrickner, R. (1982): S-conseusus trees aud indices, Bul/etin of Mathematical Biology, 46, 923-935.
Stinebrickner, R. (1984): An extension of intersection methods from trees to dendrograms, Systematic
Zoology, 33,381-386.
Sullivan, J. (1996): Combining data witb different distributions of among-site variation, Systematic
Biology, 45, 375-379.
Swofford, D. L. (1991): Wben are phylogeny estimates from molecular aud morphological data
iucongruent"!, In: Phylogenetic analysis of DNA sequences, Miyamoto, M. M. and Cracraft, J. (cds;),
295-333, Oxlord UniVl.'rsity Press, New York.
Swofford, D. Let Oil. (199601): Phylogendic inference, In: Mol..cular Systematics, 2nd edition, Hillis, D.
M. et Oil. (cds.), 407-514, Sinauer, Suuderland.
Swollord, D. L. et al. (1996b): The topology-dept'ndent permutatiou test for monopbyly does uot test for
monopbyly, Systematic Biology, 45, 575-579.
Waterman, M. S. aud Smith, T. F. (1978): On tbe similarity of dcndrograms, loumal of Theoretical
Biology, 73,789-800.
Wiens, J. J. and Cbippindale, P. T. (1994): Combining and weighling cbaracters and the prior agreement
approal·h revisited, Systematic Biology, 43, 564-566.
Wiens, J. J. and Reeder, T. W. (1995): Combining dala sels with diftcrent numbers of taxa for
phylogcnetic analysis, Systematic Biology, 44, 548-558.
Wilkinsou, M. (1994): Common cladistic iulormation and its consensus representation: Reduced Adams
and reduced cladistic cousensus trees and proliles, SystenlCltic Biology, 43, 343-368.
Wilkiuson, M. (1996): Majority-rule reduced l·onsensus trees and their use in boostrapping, Molecldar
Biology ami E~·olution, 13,437-444.
Williams, D. M. (1994): Combining trees aud combining data, Taxotl, 43,449-453.
Williams, W. T. and ClilIord, H. T. (1971): On the comparison of two c1assilications on Ihe same set of
elemenls, TiL~OIl, 20,519-522.
Zaretskii, K. (1965): Constructing a tree on the basis of a set of distances between the hanging vertices,
Uspekhi Mathematika Nauk, 20, 90-92. (in Russian).
Zharkikb, A. and Li, W.-H. (1992a): Slatistical properties of bootstrap estimation of phylogenetic
variability from nucleotide sequences. I. Four taxa with a molecular clock, Molecular Biology ami
Evolution, 9, 1119-1147.
Zharkikh, A. and Li, W.-H. (1992b): Statistical properties of bootstrap estimalion of pbylogenetic
variability from uucleotide sequeuces. II. Four taxa wilbout a molecular dOl·k, 101/mal of Molecular
Evolution, 35, 356-366.
Zharkikb, A. and Li, W.-H. (1995): Estimalion of conlidence in pbylogeuy: Tbe full-and-partial bootstrap
lechnique, Molecular Phylogenetics atld Evolution, 4,44-63.
Some Trends in the Classification of Variables
F. Costa Nicolau l , H. Bacelar-Nicolau 2

1New University of Lisbon,


Department of Mathematics, Faculty of Sciences and Technology,
Laboratory of Statistics and Actuarial Mathematics (LEMA)

2University of Lisbon, Faculty of Psychology and Education,


Laboratory of Statistics and Data Analysis (LEAD), CEA / JNICT,
Portugal

Summary: In this paper we review a class of hierarchical clustering methods based on


similarity coefficients and aggregation criteria which are associated to the integral
transformation by the (probabilistic) distribution function of some suitable sample statistics.
Some properties of those methods we have studied are remembered and/or derived here.
Applications on either simulated or real data set have shown this approach performs better
than the traditional one (using empirical clustering methods) in many situations. Moreover
we define some "hybrid" criteria, which we generalise in order to get some mixed or
parametric hierarchical clustering methods. Inside of such parametrical families we are able
to tind, among ditlerent criteria those better fitting to the initial similarities, and to search
for stability and validity of those methods.

1.. Introduction

The most useful hierarchical clustering methods are of agglomerative type, iteratively
transforming the unit clusters partition {{x}l xeE} into the trivial one {E}, merging at each
step the most similar clusters, E being the set submitted to the analysis. These methods
require a previous double choice: the comparison function (dissimilarity/ similarity,
represented as cf) Yxy between pairs of elements x,yeE, and the cf r(A, B) between
pairs of clusters of E. The generalisation of Y to r may be performed in several ways, as
there is a certain lack of consensus on the precise definition of "cluster" and "resemble
clusters". Thus the question of measuring the resemblance remains a central one.

In our approach to cluster analysis we assume the following reference hypothesis: 1- we


are dealing with some similarity coefficient Yxy following a unit uniform distribution; 2-
the measure of resemblance between eaeh pair of clusters A and B is set in the sample of
i(
Y-similarities [Y ab a, b )eA x B), crossing A and B, with size 013, where o=card A,
~=card B. Moreover, we shall suppose, in the present paper that: 3- the 013
Y-similarities are i.i.d. uniform on [0, 1 J.

This reference hypothesis is quite suitable and natural to evaluate the global resemblance
between clusters of E. Based on it some good hierarchical and non-hierarchical (Nicolau,
Brito, 1989) clustering methods and fruitful ideas arise. All those methods are included in

89
90

the package CLASSIF and the results obtained on either simulated or real data show very
often a clear progress regarding to other traditional ones, specially in the cases of A VL,
A VM and particular mixed methods (like AVB) which are generated from some
parametric families of agglomerative methods.

In the sequence we brietly examine the general procedure to get the cf 's Yand r, in section
2. For details see Bacelar-Nicolau (1972, 1979, 1980, 1981), Costa Nicolau (1980, 1983,
1985), Bacelar-Nicolau and Costa Nicolau (1981,1985), and Lerman (1972, 1981).
Section 3 concerns some properties of AVL and AVM methods, and section 4 finally
presents some parametric families defining mixed aggregation criteria from the Single
Linkage and the AVL methods (Bace1ar-Nicolau and Costa Nicolau, 1994).

2. Similarity and distribution function: hierarchical clustering


probabilistic approach

In our probabilistic approach the general procedur.e to get cf Y and r is the


following:

Step 1: Defining a probabilistic similarity coefficient Y

Let S : E x E - R be a similarity function (random variable), chosen according


to the nature of date, and let s.y be its particular value on the pair [x, y1.
Let S be transformed by the cumulative distribution function, computed under
some convenient reference hypothesis (This hypothesis will be defined in each particular
case. Occasionally we use an independence hypothesis, considering our lack of knowledge
about the structure of data. ):
5 df
(x,y)~s.y ~Y'Y = Prob (S ~ S,y)

Y.y = Prob (S ~ s.y) is a new, probabilistic, similarity coefficient: Y. y £ [0, 1] and


Y.y = Yyx; on the other hand Y. y being an increasing function of s:y' global invariant
hierarchical criteria (Sibson (1972» will give exactly the same clustering results either with
s:y or with Y.y ' The probabilistic similarity coefficient verifies the minimal properties
usually required for being a comparison function.

Often Y. y will be calculated approximately, assuming our data set to have a large
dimension. Then S' = (S - E(S» / as follows an asymptotic normal unit N (0, 1)
distribution, and in each case we can find:

Y.y = Prob (S ~ s.y) = Prob (S' ~ s:y ) = <I> (s:y)


where <I> is the cumulative distribution function (cdt) of N (0, 1).

In the case of binary data, for instance, we can take Sxy = I { Ix(i). Iy(i) I i £ D } =
number of common presences of x and y over the descriptive set D, where Ix(i) = 1 if i
verifies x, =0 otherwise. Thus S will have an hypergeometric or binomial distribution (or
91

Poisson) as exact law, depending on the underlying reference hypothesis (concerning the
marginal frequencies of the 2x2 contingency table associated to the pair (x,y». Then Y'Y
can be estimated by the corresponding cdf of the normal approximation of each selected
discrete distribution.

Moreover, in binary case H. Bacelar Nicolau (1981, 1987, 1989) found the usual
association coefficients .are grouped in several distributional equivalent classes in the
following sense: either they have the same exact distribution or else they share the same
normal asymptotic distribution. This means that using the probabilistic coefficient, one can
choose for hierarchical clustering purpose only one coefficient in each distributional
equivalent class, the other ones giving (exactly or asymptotically) the same hierarchical
results (dendrogram, for instance).

In the case of frequency or contingency tables we usually take the probabilistic coefficient
based on the affinity coefficient (Matusita, 1951; Bacelar-Nicolau, 1988): the above
reference hypothesis will then be referred either to a sampling scheme or a permutational
scheme, depending on the way the data have been observed. The sampling scheme has
been extended to integer data (Bacelar-Nicolau and Nicolau, 1993), while the permutation
scheme usually applies to real data.

Often we refer to the probabilistic coefficient y,y as the VL similarity (V for Validity, L
for Linkage), because, as Tiago de Oliveira pointed out, the probabilistic coefficient
validates in a probabilistic scale the basic linkage s between each pair (x, y). On the other
hand VL comes primary from "Vraisemblance du Lien", that is Likelihood Function Link.
In fact much of our research on this matter find their roots in the works of Lerman (1970)
and Bacelar-Nicolau (1972), where the probabilistic coefficient was for the first time used
in binary case: it was called as VL, "Vraisemblance du Lien" similarity coefficient, and its
first associated aggregation criterion AVL, "Algorithme de la Vraisemblance du Lien".
Subsequent extensions have often kept the same label.

Step 2: Defining a probabilistic aggregation criterion

Assume that we are dealing with some similarity coefficient y 'y following a unit uniform
distribution and the measure of resemblance between each pair of clusters A and B is
based on the set [y abl(a, b)EA x B], crossing A and B, with size a~ (where a=card A,
~=card B). Also assume the a~ Y-similarities are i.i.d. uniform on [0, 1].

This reference hypothesis is quite general and fits well to many applied situations. On the
other hand it is well known the integral transformation of a continuous random variable
leads to the uniform distribution on [0, 1]. The S random variable in step 1 being in fact of
discrete type, the size of the descriptive set D is in general large enough in cluster analysis,
to apply this property to the sample of Y-similarities.

Let's now introduce some probabilistic cf 's that will generalise y, y to the cf r(A, B)
between subsets of E. We have studied, in particular, the cf coefficients associated to the
following statistics:
92

PA,B = [rabl (a,b) E A xB]


max
= min [rabl (a,b) E AxB]
qA,B

P = ~,8 I [rabl (a,b)EAxB]


A •B

t A.B = [PA.B + q A.B] /2

-
where PA.B' qA.B and PA,B are the basic statistics tor single linkage, complete linkage and
mean average linkage methods, respectively, and tA,B is a statistics already used (with
empirical dissimilarity coefficients) by some researchers to get a compromise between the
single and complete linkage methods. Here we are not interested in directly using those
statistics for clustering purposes; instead we want to take their cumulative distribution
functions, in order to use the probabilistic approach to hierarchical classification, As we
have pointed out we assume the af} V-similarities are LLd. in this work.

The first probabilistic aggregation criterion, MaxProb, derived from p,\,8' is the so-called
AVL method, from Validity-Linkage Algorithm or "Algoritlune de la Vraisemblance du
Lien", which was first purposed by Lerman (1970), its main properties being t.stablished
by Bacelar-N,icolau (1972) in the context of classification of variables. It was generalised
in Lerman (1981), Bacelar-Nicolau (1981) and Nicolau and Bacelar-Nicolau (1981), for
instance. The second statistics, qA,B' was used as support to a probabilistic clustering
method for frequency data introduced by Goodall, where the probabilistic similarity is
defined from the X2 distribution (in Legendre and Legendre (1983». In our
approach the probabilistic aggregation criterion MinProb associated to qA,B did not
perform very well. The third probabilistic criterion, AvmProb, derived from PA.B is also
the so-called AVM method, Mean-Validity Algorithm. Finally the probabilistic criterion
generated by the "hybrid" statistics t A.B will not be considered' now, since we are most
interested in studying parametric mixed criteria. Thus, among those four statistics we
shall take in the present paper only PA.B and PA,B' Under the Li.d. assumption referred to
above, we obtain for the corresponding AVL and AVM clustering criteria the following
expressions:

and

Both the A VL and A VM methods generally pertorm quite well on either simulated or
real data. Nevertheless they differ in some specific features, which can allow us to choose
each one lor different particular situations. In the next section we compare !he two
93

methods in what concerns the presence/absence of inversions in the corresponding


dendrograms and the type of hierarchical structure they produce.

3. Comparing MaxProb (AVL) and AvmProb (AVM) methods


3.1 - Index level or Absence/Presence of inversions

Let [(k-I) be the table of similarity values between clusters at the (k-l loth step, of an
agglomerative process, and f (k) = max fl k - I). We call d (k) = 1-1' (k) a level index: it
measures the lack of cohesion of the successive clusters formed at the sequential
hierarchical tree levels; in this sense one usually hopes d (k) to be a decreasing function of
the levels k. Nevertheless we want to point out that we believe the existence of inversions
in d (k) can be quite compatible with the construction of good hierarchical classifications.

Concerning the level index, the following property holds:

Property - The level index of AVM method can present inversions, whereas the level
index of AVL method is strictly increasing.

Proof: The absence of monotony of the AVM level index is simply a consequence of the
statistical convergence of the mean statistics of an i.i.d. sample of the unit uniform
distribution to the value 0.5: when the entries of [ are updated at the end of each step of
the clustering algorithm, the [values will increase, if the mean of VL similarities is
greater than 1/2, or decrease, if that mean is lesser than 1/2. So we can naturally have
d(k+l) < d (k) at some lcvels of the AVM dendrogram.

Now in order to see that for the AVL method one has d (k+ 1) > d (k) always, let's first
suppose that [(HI has only a maximum value (binary tree), so that only a pair of clusters
verifies the criterion at each step. Let (A, B) be the pair of clusters satisfying the AVL
criterion (maximisation of [(k-I) and let Pk be the partition defined at k-th step:

Pk = Pk - I - {A, B} U {A U B}.

Then: (disjoint components)


where

[Ik-I I = {['It.II(A
A '
x)lx ;00 A X EP. } C
'k-l
[(k.l)

Now, either: max [(k) = max ~k-I) and then max [(k) < max [Ik-I)

or else: max [(kl = max [i~B


94

and then max r lkl = p,"+PlY


AUB.C
= pla+~)y <
A.C
prrl
A.C
< max r(k-I I
,

where C E ~ , maxr(i l = r~~B.C ,Pu = max(PA,C,PB.C)'Y = card C.

The proof is then completed in the case of binary trees.

In the general case, if h pairs of clusters, {Ai' Bi } , i = 1, ... , h, verify the AVL criterion
at k-th step, then r lk ) becomes: r(i l = 6,(i-l) U 6,(1)
where now,
8 k- l) = r 1k - I) _ [(U r(k-I» U(U rlk-I)] c r(k'l)
h h

A, B,
j-l i-l
h
and 6,l i) = Y rill
A,UB,
i-1
so that the monotony property of the index level of AVL always holds.

3.2 • "Symmetry" versus "bipolarisation" effects

As successive updating of the cf r in AVM method tend to reinforce the strong links and,
conversely, to decrease the week links one usually observes in the A VM dendrograms a
sort of "bipolarisation effect". Most examples treated so far by AVM method clearly show
the (same) manner how this method works: the kernel of each cluster grows by joining
clement after element in some kind of local chain effect (similar to the characteristic global
effect of single linkage method); once a kernel is finished, another one begins to built in the
same .way, and the process continues until all the main clusters are formed; those clusters
are then merged, without chain effect, producing some good-looking and coherent trees.
This double effect, globally conserves the initial structure as expressed by the cf y and is
r~ponsible after all by the trustworthy fitting of A VM trees to the data.

A dill'erent tendency can usually be observed in the A VL hierarchies, which seldom are
associate to a chain ell'ect. Instead they produce quite regular trees with clusters of equal
size at each level of the tree. We call this the "symmetry effect" of AVL method, the
responsible for that being the exponent of PA,B in the formula of r. This exponent
performs in fact a sort of brake action to the chain effect at all levels of the dendrogram.
One can easily understand the symmetry effect of AVL in the following example:

Suppose there are h clusters of equal size a at step k ( k = 0, 1, 2, ... ); let A, BE Pk the
clusters which are going to be merged. Thus updating the cf between any other cluster C
and the new merged cluster AU B will give:
rIK+I)(A U R,C) = p!~B.C
while the link between C and other cluster of Pk +1 is expressed by
r IK +I) = r(il (c, X) = P~.'x
So AU B will attract C, shaping a new cluster AU B U C of size 3, only if
2a 2 a2
PAUB,C> Pc,x
or equivalently:
PAUB.C>~PC.x' Vx .. C,AUB
95

Therefore it is natural that clusters of size 2a arise before any other group of size 3a could
emerge.

Concerning "symmetry" effect of AVL versus "bipolarisation" etl"ect of AVM, recent work
of Nicolau and Bacelar-Nicolau has been developed which associate them with spatial
dilating and spatial contracting properties of agglomerative methods (Lance and Williams,
1967).

4. Parametric approach to adaptive clustering


Once the VL similarity is chosen as cf Yxy between pairs of elements, the Single Linkage
(SL) and the AVL methods being associated to the same basic aggregation function,
PA.B = max[r.bl(a,b) EA x B], they can be generated by a conjoined formula:
r(A,B) = ~~Il)
where a=card A. fJ=card B: g(a,fJ) = 1 for SL and g(a,fJ) = afJ for AVL.
One could expect that by varying g( a,fJ) with 1 < g( a,fJ) < afJ, a sort of compromise
will be built between SL and AVL methods: r(A, B) = ~~Il) will be more polluted by the
chain effect when g(a,fJ) remains near 1, and more contaminated by the symmetry ell"ect
as long as g( a,~) is in the neighbourhood of afJ.

Therefore a bridge between SL and AVL methods can be established by using, for instance,
the following chain of exponents:
. 2afJ C;; a + fJ
1 s mm(a,fJ) s - - s va~ s - - s max(a,fJ) s afJ
a+fJ 2
The correspondent agglomerative cf based on VL similarity define the following iterative
clustering procedure

This methodology assures economic computation, invariance with respect to the initial
order of the elements or the clusters to be merged, and some way of evaluating the brake
action on both the chain and symmetry effects. In what concerns these aspects, the two
methods associated to the geometric and the arithmetic means of the cardinal of clusters
being compared perform quite well, producing fine interpretable hierarchical trees.

The above iterative clustering procedure has been generalised by defining some suitable
parametric families of aggregation criteria to link the SL and AVL methods.

The first idea for finding such a family was to take a scalar transformation of afJ:
afJ-afJll; , where 1<1; < a~, fixed 1;. This turned out not to be a good solution, since it is
easy to prove that:
96

Given rg(A, B) = ~Jill) , then any other method r g , such that g'= () g( 0, ~), with fixed
b >0, will give exactly the same hierarchic tree built by r g , differences existing only in
the values of the level index.

The invariance does not hold with linear transformations of the exponent g( o,~),
rg(A, B) and r 6$+t (A, 8) = p!~~a.JI)+t generally producing different hierarchic trees.
Thus a natural solution to the question of defining some appropriate parametric family of
agglomerative criteria linking the SL and A VL methods appears to be the following:
r(A, B) = p6cz/l+t
.~.B

where 1 s () o~ + E S o~ , o=card A, /3=card B, the exponent function being a linear


convex combination of the exponents of SL and AVL, respectively 1 and op.

Experimental work was conducted either on simulated or real data in order to study this
parametric family and its role in searching for hierarchical methods which better fit the
initial similarities, as well as in looking for stability and validity of those methods.

On the other hand the above parametric family has been later extended in order to include
the two probabilistic methods associated to the exponents with the geometric and the
arithmetic means of the cardinal of clusters being compared, in the former iterative
«
clustering procedure. We get in this case g(o,P; E,!;) = 1/ (1+1: 0 x ~)!; -1», where E
and!; both take values in the interval [0,1]. More recently the whole family has been
included in the probabilistic similarity version of the well known Lance and Williams
tormula (Bacelar-Nicolau and Nicolau, 1994). The recursive Lance and Williams tormula,
designed lor dissimilarity coefficients can in fact easily be adapted to clustering methods
based on similarities and particularly on VL-similarity. One has:

f(AU B, C) = ((1 f(A, C) + 02 f(B, C) + ~ f(A, B) +y I f(A, C)- f(B, C)I

where the constants 01, 02, ~,y vary according to the method we want to
reproduce.

The extended formula derived in order to include the probabilistic hierarchical family,
needs only two more constants, which are the parameters E, l§ above and can be
represented as follows:

f(AU B, C) = [ 01 f(A, C)g(m];E,!;) + 02 f(B, C)g(fh];E,!;) + ~ f(A, B)g(O,/3;E,!;)


+ y I r (A, C)g(m];E,!;) - f(B, C)g(/31'];E'!;)I]l!g(O,~;E,!;)

where g(o,1']; E,1;) = 1/ (1+1: «0 x l])1; -1»,1'] being the cardinal of the cluster C.

We can easily see that making E = 0 we simply get the first formula above. On the other
hand we find as particular cases in the family: the single linkage SL (01 =02=y=1I2,
~=E=O), the validity-linkage AVL (01 =02=y=112, ~=O, E=1;= 1) and the brake-validity
linkage AVB (01=02=,(=112, ~=O, E=1, !;=112) algorithms.
97

Using such parametric family enables us to analyse the robustness of our clustering
methods: by varying E and S from 0 to 1, the other coefficients remaining unchanged, we
can study the stability of the AVL family of models, like before. But this procedure can of
course now be generalised to the comparison among the probabilistic family and the
agglomerative methods generated by the former Lance and Williams lormula.

5. Conclusions
The concept of the probabilistic (VL) similarity obtained by using (the unilorm
transformation) the distribution function of a random variable and its extension to
agglomerative criteria, allow us to establish a general consistent probabilistic approach to
the hierarchical clustering methods. Validity-aflinity combined with AVL, AVB and AVM
methods are good examples of this approach.

Moreover the probabilistic similarity enables us. to derive a parametric clustering


approach to the hierarchic methodology, that is a recursive formula generating criteria
which appear to be a mixture of the single linkage (chain effect) and the AVL (symmetry
effect) methods. Questions so important as stability, optimisation and compatibility with
the data. concerning the hierarchies produced by the parametric agglomerative methods,
may in that way be enlighten.

Finally note that this concept of prohabilistic similarity does not consume itself in the field
of hierarchical agglomerative methods: we have also developed a non hierarchical
prohabilistic approach hased on the distribution function of the mean of a unit uniform
sample. which gives an excellent non hierarchical method of k-means type.

References:

Bacelar-Nicolau, H. (1972) - Analyse d'ull algorithme de classification automatique - Thcse de


3eme Cyck. Univ. ParisVI (I.S.U.P.). Nov. 1972.
Bacelar-Nicolau, H. (1981) - Contribui~'6es ao estudo dos coeficielltes de comparaqiio em
Amilise Classificatoria - Docl. Thesis, FCL, Univ. de Lisboa.
Baeelar-Nicolau, H. (1987) - On the distributioll equivalence in cluster analysis, NATO ASI
Series, vol F30. Patt. Recogn. Theory and Applic .. P.ADevijver! J.Kitler (eds). Springer-Verlag.
73-79.
Bacelar-Nicolau. H. (1988) - Two Probabilistic Models for Classification of Variables in
Frequency Tables. Classif. and ReI. Meth. of Data Analysis, H.H.Bock (ed.). North Holland.181-
186.
Bacdar-Nicolau, H.. (1':189) - Sur /'eqUivalence distriblltiOllnelle entre coefficiellts d'association.
Bulletin of the International Statistical Institute (lSI), 47th Session, Contributed Papers. Book 1.
8':1-':lO
Baeelar-Nicolau. H. : Nicolau. F.C.(1993) - Classifying integer scale data by the affinity
coefficient: a probt.lbilistic approt.lch: Proceedings of the Sixth Intern. Symp. on Applied
Stochastic Models and Data Analysis (ASMDA). J.Jansen and C.H.Skiadas (cd). World Scientific.
vol 1,63-74
98

Bacelar-Nicolau, H. ;Nicolau ,F.e.(1994) - Exploratory and confirmatory discrete mllitivariate


analysis in a probabilistic approach for studying the regional distriblltion of AIDS in Angola.
New App. in Classif. and Data Analysis, E.Diday, Y. Lechevallier, M. Shader (ed.), Springer-
Verlag, 610-618
Lance. G.N.; Williams, W.T. (1967) - A general theory of classificatory sorting strategies.
Hierarchical systems. The Computer Journal, vol 9, no 4, 373-380
Legendre, L.; Legendre. P. (1983) - Numerical Ecology. Elsevier Sc. Publ. I. e.
LERMAN,I.e.(1970) - SlIr l'Analyse des Donnees Prealable a une Classification Automatiqlle,
Rev. Math. et Sc.Hum., vol 32, 8eme annee, 5-15.
Lerman, I.e. (1973) - Etude distributionnelle de statistiques de proximiti entre structllres finies
de meme type, application a la classification automatiqlle. Cahiers du BURO, Paris.
Lerman, I.e. (1981) - Classification et Allalyse Ordinale des Donnees. Dunod.
Matusita, K. (1951) - On the Theory of Statistical Decision Fllnctions. An. In. Stat.
Math.,voI.IIl,I-30.
Nicolau, F. Costa (1981) - Criterios de analise classificatoria hierarquica baseados na fllnt;iio
de distribllic;iio - Doct. Thesis, FCL, Univ. de Lisboa.
Nicolau, F. Costa; Bacelar-Nicolau, H. (1981) - NOllvelles methods d'agregation basees slIr la
fonction de repartition. Collection Semina ires INRlA de Classification et Perception par
Ordinateur 1981, INRlA, Domaine de Voluceau-Rocquencourt, France.
Nicolau, F. Costa (1983) - Cillster Analysis and Distriblltioll Fllnctioll, Meth. OpeL Res., vol. 45,
Verlag Anton Hain, 431-433.
Nicolau, F. Costa; Brito, M.P.(1989) - lmprovment in NHMEAN method. Data Analysis,
Learning Symbolic and Numerical Knowledge (ed. E. Diday) Nova Science Publishers
Sibson, R. (1972) - Order invariallt methods for Data Analysis in J.R.S.S., B, vo1.34, nO.3, 311-
338.
CONVEXITY METHODS IN CLASSIFICATION
Jean-Paul Rasson 1
1 F. U.N .D.P.,
Departement de Mathematique,
8, Rempart de la Vierge,
B-5000 Namur, Belgium
Tel.: + 32 81 724928 - Fax: + 32 81 724914
E-Mail: [email protected]

Sununary: We investigate the solutions to the clustering and the discriminant analysis
Problems when the points are supposed to be distribllted according to Poisson Processes
on convex supports. This leads to very intuitive criteria for homogeneous Poisson Processes
based on the Lebesgue measures of convex hulls. For non homogeneous Poisson Processes,
the Lebesgue measures have to be replaced by intensities integrated on convex hulls.
Similar geometrical tools, based on the Lebesgue measure, are used in the context of Pattern
Recognition. First, a discriminant analysis algorithm is developed for estimating a convex
domain when inside and outside points are available. Generalisation to non convex domains
is explored.

Introduction
This research has been initiated by the question of D.G. Kendall of how to estimate a
bounded convex set observing only the realization of an homogeneous Poisson Process
inside this convex set. The solution (Ripley and RasSOll, 1977; Rasson, 1979) was an
homothetic expansion of the convex hull of the sample from its centroid. The follow-
ing question, raised by E. Diday, was naturally the problem of using these arguments
in Clustering. This led us to the maximum likelihood estimation of the hypothetized
support for clustering, i.e. the union of K bounded convex sets. The solution was
"find the partition of the points in K subgroups of such that the sum of the Lebesgue
measures of their convex hulls is minimal" (Hardy and Rasson 1982).
Then came the question of the corresponding discriminant analysis raised by A.D.
Gordon. The Bayesian solution for the same hypothesis (still for an homogeneous
Poisson Process) was to affect the new point to the sample for which the Lebesgue
measure added by convexity to its convex hull is minimal (Baufays and Rasson, 1984a,
1984b, 1985). But this could not classify the points belonging to more than one con-
vex hull as this is often the case of pixels in images.
Then to deal with them, we moved to non-homogeneous independent Poisson Pro-
cesses with convex supports. The main change in the solution was to replace the
Lebesgue measure by the integrated intensity. But the same equation gave also the
solution for points lying in the intersections of the ~upports.

1. The Stationary Poisson Process


1.1. The model
We consider that we deal with a clustering problem where the points we observe are
generated by a Poisson Process and are distributed in D where D is the union of g

99
100

disjoint domains lD k ), 1 ::; k ::; y. Our point of vipw will be that if we are able to
find back these domains, making some inference ahout thf'lll, we will, in some sellSP,
solve the clustering problem.
1.2. The maximum likelihood solution of the clustering problem.
Let x denote the sample vector (;1:1, ... ,.r,,) with .I'j E ur,i = 1, ... ,11. The inllicator
function of a set A at the point y is defined oy : JlA(y) = 1 if y E .4. alit! 0 otherwise.
So, since with our hypothesis, the points will be independently <md uniformly dis-
tributed on D, the likelihood function takes the form

Fo(x) = (m(~))'" g llo{xd

where m(D), the Lebesgue measure of D, is the sum of the measurps of the g subsets
Dk(l ::; k ::; g). The domain D, parameter of infinite dilllt'nsion, for which the
likelihood is maximal, is, among all thosp which contain all till' points, the 0111' whosf'
Lebesgue measure is minimal.
If we do not impose more conditions 011 the subsets, we can easily find g sets Dk
which contain all the points and are such that the sum of tllt'ir mea.~ures is zero.
Thus there are many trivial solutions to the problem. Nevertheless, we can easily
see that the problem of estimating a domain is not well-posed and that the weakest
CI.~sumption that makes the domain D estimable is the convexity of the Dk (Baufay;;
and Rasson, 1984).
With a partition of the set of points into g sub-domaills having disjoint convex hulls,
we can associate a whole class of estimators; indeed we only have to find g disjoint.
convex sets, each of them containing one of thf'. For each partition, thf' likf'lihood
has a local maximum: the convex h1llls of tIll' g subsets. Tllf' global maximum will
be attained with the partition for which the sum of the Lebesguf' measmf'S of the
convex hulls of the g subgroups is minimal. This is t.he solution we sf'ek.
Practically, if the basic space is IR, we look for the g disjoint intervals containing all
the points such that the SUln of their lengths is minimal. In m2 (or /H"l ), we try to
find the g groups of points such that the sum of the areas (volumes) of their disjoint
convex hulls is minimal.
1.3. The statistical model and the rule for the associated discriminant
analysis.
The conditional distribution for the k-th population is assumed to be uniform in a
convex compact doma.in Dk , and the a priori probability Ilk that an individual belongs
to population k is proportional to the Lebe~gue meCl.<;ure of D k . The convex domains
Die are assumed disjoint. The density of population k, Jd:l:), and the unconditional
density J(x), are respectively equal to :
. 1
Jd;l:) = -(D)
171 Ie
·JlO.(r)

f(.r)
1
= m(D)' E
9
JlD t (,,)

The decision rule is the Bayesian one, wit.h the unknown parameters - the convex sets
Die - replaced by their maximum likelihood estimations. Let X" be the labeled sample
of population k, H(.'<k) be its convex hull, and ;1: be the individual to be assigned
101

to one of the g populations. If:r is allocated to the k-th group, the estimates of the
domains Dk are equal to :

J ifj =f. k
~ - {H(X)
D.
J- H(Xju{x})ifj=k

The maximum likelihood estimate of Ildk(X) is therefore:

with Sk(X) = m(H(Xk U {x})) - m(H(Xk))


As the convex sets are assumed disjoint, x can be allocated to the k-th group, if and
only if H(Xk u {x}) and H(Xj) are disjoint for all j =f. k. Otherwise, we define
arbitrarily: Sk(X) = +00.

Fi[Jurr. 1.
The allocation rule is then :

assign :r. to the k-th population if (Lilli only if Sk(.r.) < 8 j (x), k =f. j.

2. The Non-stationary Poisson Process


In order to avoid an heuristical part in the supervised classification procedure, we
generalize now the basic hypothesis. namely the Homogeneous Poisson Process, one
into the hypothesis of Non Homogeneous Poisson Process.
2.1. The Maximum Likelihood solution for the clustering problem.
We shall consider therefore that we deal with a clustering problem where the points
we observe are generated by a NOli Homogeneous Poisson Process with intensity q(.)
(satisfying q(x) > 0) and are observed only in D. where D is the union of g disjoint
convex domains.
9
D= U Dk , Dk convex disjoint.
k=1

Our point of view will be agaill to try to find the maximum likelihood estimation
of the Dk. If x denote the sampl!:" vpctor (:r.\, ...• :r .. ) with :Cj E nrt. i = 1•...• nits
likelihood will be

FD(:r)
1
= (m(D))n. gn
D.D(:Ci).q(:Cj)
102

where m(D) = fD q(x)d:z: and q(.) is the intensity of the process.

Thus, if the intensity is known (or, maybe, has been estimated), the maximum likeli-
hood clustering solution, since the maximum likelihood estimation of a convex domain
based on a sample of points inside this domain is still the convex hull of this sample,
is then:
Find the g groups of points for which the slim of the inten.~ities integmted on their
convex hulls is minimal.

2.2. The associated discriminant analysis.


It would be easy to reconstruct the discriminant rule associated with this (unique)
Non Stationary Poisson Process. We just have to replace the Lebesgue's measures of
any convex set by the intensity integrated 011 this convex set.

Thus if qk(.) is the intensity of the Process and satisfying fJk(:Z:) > 0 ~ x E Dk (e.g.
fJk(') = q(.) aD.(.) for the case of unique Process on disjoint sets), we may suppose
that any point is distributed on D = U~=l Dk with respect to the density function
g

/(:1:) =L Jlk.fk(:r:)
k=l

) fJk(:r)
where fk(:r = f D. f/k () I
Y (Y
and

As usual, the possibly convex supports Dk are estimated by the convex hulls H(X k)
of the training set points. If we denote S' = L:~=1 fH(X., fJk(y)dy and S'k(X) =
fH(X.U{:r:})\H(X., fJk(y)dy then the Bayesian classification rule becomes:
a.ssign the. new point :r to the class k sitch that 7Ik!k(:r.) = qk(:C)/(S + Sk(X» is
maximal.

When the local intensities lJk(:r:) do not depend on k , this rule simply _consists in
assigning x to the class k such that the added intensity Sk(:I:) is minimal. Thus, in
this case and when the convex hulls an~ disjoint, we still keep the convex admissibility
property i.e. "if the point x belongs to the convex hull of only one class, x is assigned
to this class".

3. Pattern Recognition applications


3.1. The inside/outside problem: presentation.
Suppose that X is a Poisson point process within a fixed finite window F C urI. In
F, we have a compact convex domain D. 'We suppose that the Poisson process X is
homogeneous on F, with density ..\. We observe a fixed number t ~ 1 of realizations
of X in F, from which II turn out to be iuside the domain D and m outside of D
(t = TI + m). We want to estimatt' t.ht' unknown convex domain D. This problem is
indeed the third problem of Grenander (UI73).

The solution we propose to the problem is the use of the discriminant rule we have
just described in cluster analysis. The situation here is quite similar as we have two
103

disjoint domains D and its complementary D= F \ D. The main difference lies in


the non convexity of D.

3.2. The discriminant analysis rule and the inside/outside problem


The same idea is applied to the inside/outside problem, with D and D playing the
role of Cl and C2 . The major difference here is that D is no longer a convex set.
Let us note (x, y) = {Xl, Yh"" X"+",, Yn+".} the realizations of the homogeneous
Poisson process X and the labding variable Y : Yi = 1 if :ri E D and Yi = 2
otherwise. Let Yi = 1 for i = 1, ... , nand Yi = 2 for i = 'Tl + 1, ... , n + m, without
restricting generality.
Conditionally on nand m fixed, the likelihood function for (:I:, y) is

In ) ( 1 n+m )
LD(X, y) ( m(D)" II
lL[r.ED]
,=1
m(D)'" . lL[x.ED] II
l=n+l

where J(.1:"+h ... , x"+,,,lxl, ... , :1'11) is the "shadow" statistic defined in Hatchel, Meili-
json and Nadas (1981). It is define!l as:

J(xn+\"",x,,+mlx\, ... ,:fn) =U {Ii+A(:ri- o) E oflA? O,oE H(x\, ... ,x,,)}.


i:y,='l

(H(.), J(.I.)) is a minimal sufficient stat.istic for tlH' estimation of D. See Figure 2.

It is known that J(.1:"+\, ... ,:I",,+,,,I·f\, ... ,.1',,) has similarpropertips as the convex hull
statistic H(.1:\, ... , ;z:,,). It is a consistent. pstimate of D. It is robust with respect
to small changes in the locatioll of till' data points. It satisfies the equivariance
requirement and underestimates the Lpbesgup IIlea.~llI·e of D in the same way as
m(H(.T\, ... , ;z:,,)) does for m(D), ie

E[m(J(:c"+I' ... , .1'"+,,,1:1'1, ... , :r"ll] = (1 - m+ E[V,n+d) m(D)


1

with E[~~+Jl being the expected lIum])er of pxtreme points of J(.I.) for m + 1 obser-
vations in D. See Ripley and Rassoll (1\177) and Remon (19!13).
104

The use of H(Xb ... , xn) and J(xn+l, ... , Xn+mlXb ... , xn) as estimates of the two un-
known domains [here D and .0] in the criterion proposed by Baufays and Rassoll is
the key idea of our discriminant algorithm.
One gets then the following boundary of the regions allocating to D and .o. This is
the set of points Xo such that:
-
PDfD(XO)
-
= PDfD(Xo)
I.e.

=
I.e.

where
SI(XO) == m(H(.fl, ... , .f,,, :1'0)) - m(H(:rb ... , :1:,,))
and

This boundary gives us a practical and easily computable estimate Dfor the unknown
domain D. See Figure :3 for large data ~et results. The symmetric difference between
D and D is noted by Dtl.D = DuD \ D n D.

" '.
....
" ..

. .: .
'.' .

.' .' .

.. :.

Figure 31L: Quadratic D with m( D) = 0.25 : t = 250 observations with n = 59 from


D yield D with m(D) = 0.2G and m(Di:l.D) = 0.018.

FigTU·t" :]b: Ellipsoidal D with m(D) = O.:W : t = 300 observations with n = 68 from
D yield D with I/!(D) = 0.2U and m(Di:l.D) = 0.011.
105

3.3. Properties of D
The estimator D yields a consistent estimation of D, as it is bound by two consistent
estimators H(.) and J(.I.) of D. It has a piecewise continuous boundary. Unfor-
tunately it happens not to be a convex set. On the other hand, this last feature
turns out to be an advantage when a similar reasoning is applied to the estimation
of non-convex domains.
This estimator D is robust with respect to small changes in the location of data
points, as it is based only on the shapes of H(.) and J(.I.), which are robust in this
sense. Such a property is rare in spatial statistics. For instance, the estimator b
proposed by Moore et al. (1988) can be very sensitive to small change in the location
of data points.
Let us note that the time required for the computing of D does not depend on the
number of points, excepted for the computation of the convex hull and shadow statis-
tic. The amount of required cpu-time is only a function of the precision asked for the
estimator D.
3.4. Conclusions and future researches.
Our estimate for D, based on a well known discriminant analysis criterion, seems to
be a powerful tool for pattern recognition. Moreover, it is quite straightforward to
generalize it to a non-homogeneous Poisson process.
This research is currently working on recognition of nOll convex domains. The first
results seems very encouraging as shown by the estimation of some letter A. See figure
4 where our algorithm is compared to a discriminant rule based Oil the distance to
the nearest neighbour.

Figure 4a: A non-COllvex body to be estimated: the letter A.


Figure 4b: Our discriminant analysis algorithm.
Figure 4c: The estimation based on the distance to the nearest neighbour.

References :
Baufays, P., Rasson, J.-P. (1!J84): line nouvelle regie de classement, utilisant I'enveloppe
convexe et la mesure de Lebesgue, Statistiq1Jc ct Analyse des Donnees,2, pp. 31-47.
106

Baufays, P., Rasson, J.P. (1984): Proprietes theoriques et pratiqlles et applications d'lIne
nouvelle regie de classement, Statistique et A111llyse des Donnees, voI.9/3, pp.1-10.
Baufays, P., Rasson, J.P. (1985): A new geometric discriminant rule. Computational Statis-
tics Quaterly, vol. 2, issue 1, 1.5-30.
Degytar, Y.U., Finkelsh Tein, M.Y. (1974): Classification Algorithms Based on Construc-
tion of Convex Hulls of Sets, Enginee1'i1lg Cybclmetics,12, pp. 150-154.
Duda, R.O., Hart, P.E. (1973): Pattern Recognition and Scene Analysis, Wiley, Chichester.
Efron, B. (1965): The Convex Hull of a Random Set of Points. BiometTika,52, pp. 331-453.
Fisher, 1., Van Ness, J.W. (1971): Admissible Clustering Procednres. BiometT"ika,58, pp.
91-104.
Fukunaga, K. (1972): Introduction to Stutistical Pattem Recognition, Academic Press, New
York.
Grenander, U. (1973): Statistical geometry: a tool for pattern analysis. Bulletin of the
American Mathematical Society,vol. 79, 829-856.
Hand, D.J. (1981): Discrimination and Classification ,Wiley, Chichester.
Hardy, A., Rasson, J .-P. (1982): Une Nouvelle Approche des Problemes de Classification
Automatique. Statistique et An(lly~(' des Donnees, 7, pp. 41-56.
Hartigan, J.A. (1975): Clustering Algorithms, Wiley, Chichester.
Mac Lachlan, G.J. (1992): Discl"imillllTlt ATI(jly~is (Inti Statistiwl PuttcT'n Recognition, Wiley,
New York.
Moore, M., Lemay, Y. and Archambault, S. (1988): Algorithms to reconstrnct a convex set
from sample points. Computing Science and Statistics, In: P7'ocecdings of the 20th Sym-
posium on the Interface, Eds. E.J. Wegman, D.T. Gantz and J ..1. Miller, ASA, Virginia,
553-558.
Rasson, J.P. (1979): Estimation de domaines convexes du plan. St(ltistique et Analyse des
Donnees,l, pp. 31-46.
Remon, M. (1994): The estimation of a convex domain when inside and outside observations
are available. Supplemento ai Renriiconti rid CiT'colo Mutr.mlltico di Plllermo, serie II, no
35, 227-235.
Remon, M. (1996): A Discrimina.nt Analysis Algorithm for the Inside/Outside Problem.
Computational Statistics and Dutu AT&(llysis.
Ripley, B.D., Rasson, J.P. (1977): Finding the edge of a Poisson forest. Jow'nal of Applied
Probability,14, 483-49l.
Toussaint, G.T. (1980): Pattern Recognition and Geometrical Complexity, In: Proc. Fith
lnt. Conf. Pattern Recogniti01l, pp. 1324-1347, IEEE.
Part II

Methodologies in Classification

• Evaluation and Assessment Procedures

. Topics in Clustering and Classification


How Many Clusters? An Investigation of Five
Procedures for Detecting Nested Cluster Structure
A. D. Gordon
~IathernaticalInstitute. lini\'ersity of St Andrews.
~orth Haugh, St Andrews KY16 955, Scotland

Summary: The paper addresses the problem of identifying relevant values for the number
of clusters present in a data set. The problem has usually been tackled by searching for a
best partition using so-called stopping rules. It is argued that it can be of interest to de-
tect cluster structure at several different levels, and five stopping rules that performed well
in a. previous investigation are modified for this purpose. The rules are assessed by their
performance in the analysis of simulated data sets which contain nested cluster structure.

1. Introduction
The aim of cluster analysis is to provide informative summaries of multivariate data
sets. and in particular to investigate whether or not a set of n (say) objects can
validly be described in terms of a smaller number of clusters of objects that have the
property that objects in the same cluster are similar to one another and different from
objects in other clusters. Clustering procedures provide little guidance on ways of
addressing the problem of determining rele\'ant values for the numbers of clusters, c
(say), present in a data set. This has long been recognized as a challenging problem;
overviews of the topic have been presented by Jain and Dubes (1988. Chapter 4),
Bock (r996) and Gordon (1996).
This paper addresses the problem of determining which values of c are most strongly
indicated as providing informative representations of the data. Published work to
date has concentrated on identifying the single most appropriate value for c, and
the test procedures and rules that have been proposed for addressing this problem
are collectively usually referred to as 'stopping rules'. since investigators have often
obtained a complete hierarchical classification using an agglomerative algorithm, and
wish to have guidance on when amalgamation should cease. Many different stopping
rules have been proposed in the research literature. often with only cursory examina-
tion of their performance. The most detailed comparative study of which the author
is aware was carried out by .'vlilligan and Cooper (198.j). These authors assessed the
ability of thirty stopping rules to predict the correct number of clusters in a collection
of different randomly-generated data sets after these had been analysed using four
standard clustering criteria implemented in an agglomerative algorithm (single link,
group average link, \\'a1'(1'5 sum-of-squares criterion. and complete link). Some of
the proposed stopping rules performed wry poorly. and cannot be recommended for
further uSe.
Specifying a single "best' \'alue for c will on occasion provide a misleading represen-
tation of the cluster structure present in data. The aim of the current study is to
assess the ability of modifications of five stopping rules - those whose performance
was best in ~rilligan and Cooper's (198.5) study - to detect when several different,
widely-separated values of c would be appropriate, that is, when structure is present
in the data at several different levels. For example. it might be valid to summarize a
clata set in terms of two different, nested partitions: one into three clusters. and the

109
110

other into twelve clusters.


The remainder of the paper is organized as follows. The stopping rules, and their
modifications to allow more than one value of c to be indicated, are described in the
next section. The rules are assessed by their performance in the analysis of randomly-
generated data whose structure is known; the method in which the data are generated
is described in the third section. The final section contains a summary of the results
of analysing the data and some general comments.

2. Stopping Rules
Stopping rules can be categorized as global or Inca I. Global rules are based on the
complete data set, typically seeking the optimal value of some index that compares
within-cluster and between-cluster variability. The partitions into clusters for two
different values of c thus need not be hierarchically-nested, although in practice they
usually will be. There is often not a natural definition of the within/between vari-
ability corresponding to the case c = I, and such indices possess the unsatisfactory
feature of being unable to indicate that the data comprise just a single cluster.
Local rules involve an assessment of whether 01' not a single cluster should be sub-
divided into two sub-clusters (or a pair of clusters should be amalgamated). They are
thus restricted to the assessment of hierarchically-nested sets of partitions, and are
based on a subset of the data; this latter property means that the effective sample
size for the test is usually much smaller than the size of the data set.
The five rules investigated in this study are defined below, in the order in which they
were ranked in Milligan and Cooper's (1985) investigation.
1. Clf. An index proposed by Caliriski and Harabasz (1974), for assessing a partition
into c clusters of a set of n objects described by numeric variables, is defined by

CH == [B/(c - l)]/[W/(n - c)J,


where Wand B denote, respectively, the total within-cluster sum of squared distances
(about the centroids), and the total between-cluster sum of squared distances.
2. DH. Duda and Hart (197:3, Section 6.12) proposed a local rule for deciding if a
cluster should be subdivided into two sub-clusters, based on comparing the within-
cluster sum of squared distances (WI) with the sum of within-cluster sum of squared
distances when the cluster is optimally partitioned into two (W2 ). The hypothesis of
a single cluster is rejected if

where p denotes the dimensionality of the data, Tn denotes the number of objects
in the cluster being investigated, and z is a standard normal deviate specifying the
significance level of the test. Amalgamation has generally proceeded until the hy-
pothesis can first be rejected.
:3. C. This index is based on the sum of all within-cluster pairwise dissimilarities
(D). If the partition has r such dissimilarities, Dmin (resp., Dmax) is defined as the
sum of the r smallest (resp., largest) pairwise dissimilarities, and

C == (D - Dmin)/(Dmar - Dmin).
4. 7. This index, proposed by Goodman and Kruskal (19.54), has been widely used
for assessing cluster output (e.g., Hubert, 1974). In the present instance, comparisons
111

are made between all within-cluster pairwise dissimilarities (d ij , say) and all between-
cluster pairwise dissimilarities (d kl , say): a comparison is deemed concordant (resp.,
discordant) if dij is strictly less (r·es-p., greater) than dkl • The index is defined by

where S+ (resp., S_) denotes the number of concordant (resp., discordant) compar-
isons.
·5. Beale. A test proposed by Beale (1969) has been used as a local stopping rule
for assessing whether or not a single cluster should be sub-divided. The test involves
comparing
( Wl ~ W2)
H'2
j((mm -- .1)2 22/p _ 1)
(where Wl, W2 , m and p are defined in the DH test above) with an F".(m-2)p distri-
bution. As for the D/l test, amalgamation proceeds until the hypothesis can first be
rejected.
Three of these five rules are global, with the single most appropriate value for c.being
indicated by the maximum value of C H or 'Y, or the minimum value of C (with the
restriction that not all values of c are investigated, as some indices can display dis-
tracting patterns for values of c close to n (Milligan and Cooper (198.5)). Such global
rules can readily be extended for use in the detection of nested cluster structure, by
recording all local optima of the index. If the correct solution comprises partitions
into Clo C2, ... , Ck clusters, a.n ideal index would have local optima at all of, and only,
these values of c. One might hope that the values taken by the index at these local
optima were also the k most extreme values, but it is possible that this may not occur
because of the inter-relatedness of structure at neighbouring values of c: thus, the
value of the index may be more extreme at (Ci + 1) clusters than at Cj clusters.
The local stopping rules proposed by Beale (1969) and Duda and Hart (1973) re-
quire the specification of a significance level Q or threshold value z, and Milligan and
Gooper (198.5) selected values that ensured the best possible performance of these
rules. Relevant threshold values depend on characteristics of the data sets under
investigation, such as the values of nand p: for example, in the use of the DH
stopping rule on their data, Milligan and Cooper (198.5) chose z to be 3.2, whereas a
similar examination for the data in the current study would specify z to be 4.0. The
need to specify a threshold, whose most appropriate value varies in this way, is an
unsatisfactory feature of a stopping rule.
The two local rules have been modified for use in the detection of nested cluster struc-
ture by abandoning their more formal hypothesis-testing aspects and just identifying
large values of the corresponding z or F statistic. The critical values of the Fp ,(m-2)p
distribution used in Beale's (1969) test depend on p and m, but for the values of p
used in the current study, the variation is not large for even moderately small values
of m. For neighbouring values of c, the z and F statistics will usually be evaluated
on disjoint subsets of the data, and it could be argued that the relevant values of c
are indicated by the k largest values of the statistics. However, it was found in this
study that this approach generally provided inferior results to those indicated by the
local maxima of the z and F statistics (in which neighbouring values of c cannot be
indicated), and the results which are presented here are based on this latter strategy.
112

3. Parameters of Simulation Study


In each of the data sets that wa:; generated. the number of objects. n, was equal to
fiO. ~ested cluster structure was imposed at two lew'ls: the objects were arranged in
C2 high-level classes {A) (j = 1, ... ,e2l }, where (2 E {2, ;3, -1}; each .el i was subdi\'ided
into h) low-level classes {B)s (05 = 1, .... h)) }, with I)s == card(B)s) denoting the
number of objects in Bl ,. Attention was restricted to cases in which the number
of low-level classes CI == L) h) was equal to 12. The class-size distributions at each
!e\'el were specified to be 'balanced' or 'Illlbalanced' as defined below. The high-level
classes are said to be balanced if hi = 12/C2 (j = 1.. ... C2). and unbalanced if (hi' h 2 )
= (S.-1) for C2 = 2: (hI, il 2 • h3) = (.5,1.:3) for C2 = :3: and (hi, h2' h3 , h4 ) = Cl,3.3.2)
for (2 =.1. The low-level classes are said to be balanced if lis = :) for all j. s. and
unbalanced if III = 16 and I), = -1 for all other j, s.
The data consist of n p-dirnensional normal random vectors. where p = C2 + 12 and
the p coordinate values a,re independent of each other and have the same variance 0'2.
The classes are distinguished by their centres: the p-dimensional \"ector specifying
the expected value of each object belonging to Bi , has only two non-zero elements,
b> 0 in position j (characterizing Aj) and 1 in position C2 + Lk<) hk + 5.
Three different pairs of values of (b, 0'2) were chosen, to allow varying degrees of spread
of points within and between the classes. Let dl denote the distance between a pair
of objects belonging to the same low-Ie\'el class, dm denote the distance between a
pair of objects belonging to the same high-level class but different low-level classes,
and dh denote the distance between a pair of objects belonging to different high-level
classes. The aim is to have

prob( d! > d",.) = Q = prob( d m > dh ).


For each value of C2 (= :2. :3. -1), a large-scale simulation study established the values
of band 0'2 necessary to ensure that 0' = 0.0000.5, 0.00.5, and 0.0 l.
For each combination of all the previous factor levels, three replications were carried
out. Thus, the number of data sets generated was :3 (number of high-level classes) ',(
2 (low-level balance) x 2 (high-level balance) x. :3 (amount of within-class variability'
x :3 (number of replications) = 108. Each of these data sets was analysed using the
same four clustering criteria employed by \lilligan and Cooper (198.5), to provide -1:32
analyses. each of which was assessed using the fin> rules described in the previous
section.

4. Results and Discussion


The performances of the fi\'e rules are tabulated in Tables 1-.), in which attention has
been restricted to recording the two values of c which provided the most extreme local
optima of the corresponding index; values of c greater t.han 20 were not considered
in this exercise. for the reasons given earlier. The results in the tables have been
accumulated over the less variable and less interesting factors. Increasing the amount
of within-cluster variability generally degraded the performance of the rules, a result
also notecl by Cooper and \Iilligan (1988). However. imposing a lack of balance in
the cluster sizes did not ha\'e this efTect, and OIl occasion led to an improvement in
performance. The value of C2 had some effect on the results, but this was greatly
outweighed by differences due to the clustering criterion employed. These differences
are not surprising. because each clustering criterion implicitly involves a model for the
data and one should not expect them to be equally effective in detecting the clusters
that have been generated. Each cell in Tables 1-.5 thilS contains five numbers: the
113

four outer ones record the numbers of times that these numbers of clusters were
indicated by the rule when applied. to the dendrograms provided by, in clockwise
order from top left: the single link (SL). group average link (AL), complete link (eL)
and sum-of-squares (SSQ) criteria; the central number is the sum (~) of the four
outer ones. When the two optimal values were equal, each of the relevant entries in
a table was augmented by O..j.

Table 1. Performance of the C H rule: the figures show the frequencies with which
various values of c Were indicated as the numbers of clusters, for 108 data sets
analysed by four clustering criteria .

.Second optimal value for c


Optimal
value for c C2 12 11 or 1:3 other
SL 72 27 :3 27 6
AL 130
C2 ~ 2.j7 :j.l :3.3
SSQ CL 78 77 I
1 3 1 1
2:3 ·r
-I 0 0
12 10-1 0
27 27 0 0
1 0 0 0
11 or 13 1 0
0 0 0 0
0 0 0 0 0 0 0 0
other 0 0 0 0
0 0 0 0 0 0 0 0

Tables 1-3 summarize the results for the three global rules; correct detections appear
in one of the two cells (12,c2) or (c2.12). It should be noted that only 107 results are
reported for the sum-of-squares criterion in Table 1: in one of the data sets for which
C2 == 2, the only local optimum occurred at c == 2 (i.e., no low-level clusters were
indicated). Further, these accumulated results hide the fact that when C2 == 2, the
CH index never achieved its maximum value (but usually achieved its second-largest
local maximum value) at 12 clusters; as C2 increased, so did the frequency with which
C == 12 was specified as the optimal solution.

Table 2. Performance of the C rule.


Second optimal value for c
Optimal
value for c (2 12 11 or 1:3 other
10 10 2 2 14 :3
C2 ·1.3 9 19
11.·j n.j :3 2 1 1
:32 :39 2 10
12 187 26
,58.,5 57,,5 7 7
8 12 2 6
11 or 13 :3,5 19
7 8 ,5 6
28 12 1 2 1 :3 8 9
other ,50 10 8 24
I ,5 .j :3 4 2 2 ,j 2
114

The results for the C H rule are highly encouraging: the two largest local maxima
have indicated the correct values of C1 and C2 in more than 9a % of the cases for
three of the clustering criteria; further, the numbers of times for which these are the
only local maxima are 79 (group average link), 99 (sum-of-squares), and 98 (complete
link), and no more than three local maxima are ever indicated for clusters provided
by the latter two clustering criteria.

Table 3. Performance of the ~I rule.

Second optimal value for c


Optimal
value for c C2 12 11 or 13 other
9 11 :3 2 15 6
C2 46 12 24
12 ..5 1:3..5 4 :3 1 2
3:3 41 1 7
12 1S.5 23
.54..5 56 ..5 9 6
9 10 1 6
11 or 13 33 16
7 7 4 .5
28 11 1 5 1 2 7 7
other 51 1.5 6 21
6 6 4 5 2 1 4 :3

The results for the C and 1 rules are summarized in Tables 2 and 3 respectively. For
these indices, there is less variation in the results for different values of C2. \Vhen
one data set comprising 3 and 12 clusters was analysed using single link and group
average link, both of the indices obtained their optimal value when c = 3, 12 and 1:3:
these results were interpreted as a successful outcome, and the relevant numbers in
the (c2,12) and (12,c2) cells were each augmented by a.5.

Table 4. Performance of the D H rule.

Second optimal value for c


Optimal
value for c C2 12 11 or 13 other
2:3 16 23 2.') 6a 6a
C2 49 79 291
5 5 15 16 86 85
a a a a
12 a a
a a a a I
a a a a
11 or 13 a a
a a a a
a a a a 1 a 1 7
other a a 1 12
a a a a a a 2 2 I
The performances of the C and ~I rules are very similar, each correctly specifying
both values of Cl and C2 for just under two-thirds of the data sets analysed by the
115

sum-of-squares and complete link clustering criteria; this proportion rises to nearly
three-quarters when near misses (specifying CI = 11 or 1:3, instead of 12) are included.
However. there is an increase in the mean number of local optima of the rules, and
values of both CI and C2 are incorrectly identified about 10 % of the time.
The results for the two local rules are summarized in Tables 4 and 5. It can be seen
that the modification of Duda and Hart's (1973) rule has proved very effective in
identifying the smaller number of clusters, but poor at detecting the 12 clusters that
are present. The modification of Beale's (1969) rule has performed very poorly. Both
of these rules have a tendency to provide a large number of local maxima. They
would appear to have little to offer in this kind of investigation.

Table .5. Performance of the Beale rule.

Second optimal value for C


Optimal
value for C C2 12 11 or 1:3 other
1 I 12 8 34 26
C2 1:3 35 129
2 :3 I 8 :36 3:3
0 10 ,5 1
12 12 9
1 1 2 1
9 11 I 5
II or 1:3 :30 18
,5 5 3 3
23 22 5 4 2 6 10 8
other 105 20 14 47
30 30 .5 6 :3 3 14 15

By contrast, the three global rules, and in particular the C H rule, would seem to have
considerable potential for the detection of nested cluster structure. However, enthu-
siasm should be tempered by two observations. First, as in the Milligan and Cooper
(198.5) study, the cluster structure present in the simulated data sets was reasonably
clear-cut. Secondly, in both the current investigation and that conducted by rvIilligan
and Cooper (1985), the simulated clusters were generated using multivariate nor-
mal distributions (mildly truncated, in rvlilligan and Cooper's (1985) study). Several
authors (e.g., Scott and Symons, 1971) have noted reasons why the sum-of-squares
criterion is particularly relevant for analysing such data, and one can speculate that
a rule based on total within- and between-cluster sum-of-squares like the C H index
might be better able to detect such clusters than clusters of other shapes. ;--;ever-
theless, further support has been provided for the rules that performed well in this
study, and - until presented with evidence to the contrary - one can recommend
their collective use to applied scientists seeking to understand the underlying cluster
structure in their data.

References

Beale, E. ~l. 1. (1969): Euclidean cluster analysis. Bulletin of the International Statistical
Institute. 43(2). 92-9-!.
Bock. H. H. (1996): Probability models and hypotheses testing in partitioning cluster anal-
ysis. In Clustering and Classification. Arabie, P., Hubert. 1. J. and De Soete, G. (eds.),
:Ji"/--!.'):3, World Scientific, River Edge, ~J.
116

Calinski, T. and Harabasz, J. (1974): A dendrite method for cluster analysis. Communi-
cations in Statistics, 3, 1-27.
Cooper, M. C. and Milligan, G. W. (1988): The effect of measurement error on determining
the number of clusters in cluster analysis. In Data, Expert Knowledge and Decisions, Gaul.
W. and Schader, M. (eds.), 319-328, Springer-Verlag, Berlin.
Duda, R. O. and Hart, P. E. (1973): Pattern Classification and Scene Analysis. Wiley,
New York.
Goodman, L. A. and Kruskal, W. H. (19.54): Measures of association for cross-classifications.
Journal of the American Statistical Association, 49, 732-764.
Gordon, A. D. (1996): Cluster validation. Paper presellted at IFCS-96 Conference. J\·obe.
27-30 March, J996.
Hubert, 1. (1974): Approximate evaluation techniques for the single-link and complete-link
hierarchical clustering procedures. Journal of the Americall Statistical Association. 69.
698-704.
Jain, A. K. and Dubes, R. C. (1988): Algorithms for Clustering Data. Prentice-Hall, En-
glewood Cliffs, NJ.
Milligan, G. W. and Cooper, M. C. (1985): An examination of procedures for determining
the number of clusters in a data set. Psychometrika, 50, 159-179.
Scott, A. J. and Symons, ~f. J. (1971): Clustering methods based on likelihood ratio crite-
ria. Biometrics, 27,387-397.
Partitional Cluster Analysis with Genetic
Algorithms:
Searching for the Number of Clusters1
J. A. Lozano, P. Larmnaga and M. Grana
Dept. of Computer Science and Artificial Intelligence
University of the Basque Country
P.O. Box 649, 20080 San Sebastian, Spain
e-mail: [email protected]
tel.: (+3443) 218000, fax.: (+3443) 219306

Summary: In this article we deal with the problem of searching for the number of clusters
in partitional clustering in n2. We set up the problem as an optimization problem by giving
a real function on the different partitions that is optimized when the number of clusters
and the classes are the most natural. \Ne use the Genetic Algorithm for optimizing this
function. The algorithm has been applied to the well-known Rnspini data and to syntheti-
cally generated datasets, with different cluster numbers and underlying distributions. The
results are encouraging.

1. Introduction
Cluster Analysis (Hartigan (1975); Everitt (1974); Jain and Dubes (1988)) is an im-
port.\nt technique in the field of exploratory data analysis. It is a tool for grouping a
set of objects into classes such that 'similar' ones are in the same class and 'different'
ones in different classes. Cluster analysis explores the data known about the objects
to be classified ancl tries to uncover the underlying structure without requiring the
assumptions common to most classical statistical methods. Two main different types
of clustering methods exist: hierarchical methods, which result in a nested sequence
of partitions, and partitional methods, which give one single partition.
Genetic Algorithms (G.A's) (Goldberg (1989)) are probabilistic search algorithms
which simulate natural e\·olution. They are based on the mechanics of natural selec-
tion and genetics. They combine 'survival of the fittest' among string structures with
a structured yet randomized information exchange. In G .A. 's the search space of a
problem is represented as a collection of individuals. The individuals are represented
by character strings, which are referred to as chromosomes. The purpose is to find
the indi vidual from the search space wi th the best 'genetic material'. The quali ty of
an individual is measured with an objective function. The part of the search space to
be examined in e,.ch iteration is called the population. A G .A. works approximately
as follO\lis. First, the initial population is chosen at random, and the quality of each
of its individuals is determined. r\ext, in every iteration parents are selected from
the population. These parents produce children, which are added to the population.
For all newly created individuals of the resulting population a probability near zero
indicates thlLt they 'mutate', i.e. they change their hereditary distinctions. The pop-
ulation is reduced to its initial size by removing some individuals from it according to
some selection criterion. One iteration of the algorithm is referred to as a generation.
IThis work is supported by the Diputacion Foral de Gipuzkoa, under grant 95/1127 and by the
Basque Government, ullder grant PI 94/78.

117
118

Some attempts to solve the clustering problem with G.A.'s have already been done.
Krovi (1991) described the different aspects of designing a G.A. for cluster analysis
and explained how II set of objects can be grouped into two clusters using binary
strings. Bhuyan et a\. (1991) developed a G.A. for the partitioning of n objects into
k clusters, where 1 :S k :S nand k is given. They started by considering three differ-
ent representations for their individuals but finally decided on the so-called ordered
representation. Their preliminary experimental results reflected the superiority of
the genetic algorithms over the known heuristic methods. Cucchiara (1993) showed
the effectiveness of the G.A.'s in clustering problems in image analysis. Jones and
Beltramo (1993) used integer encoding with the application of an operator used in the
travelling salesman problem while Bezdek et al. (1994) used three different distances.
Babu and Murty (1994) did not tackle the clustering problem with G.A.'s but with
evolution strategies; another type of algorithm based on the principles of natural
selection. In all of the research that is mention<.d above the number of clusters into
which to group the objects is supposed to be given.
Yet, some research has been carried out 011 the problem of the optimal number of
clusters, using methods based on entropy-based statistical complexity criteria (Celeux
and Soromeno (1993); Bozdogan (1994)) and statistical tests (Hardy (1994); Rasson
and Kubushishi (1994); Gordon (199.5)).
Our research on the use of G.A.'s for cluster analysis focuses upon the search for the
optimal number of clusters. We want to develop an algorithm that automatically
classifies the objects into an adequate number of clusters without this number being
specified. The main problem in the development of such an algorithm is the definition
of an evaluation function that makes it possible to compare the fitness of clustering
consisting of a distinct number of clusters. Other difficult steps are the selection of
a suitable clustering representation and the development of the operators that define
the mutation and offspring production processes. An ongoing study about that using
G.A.'s can be seen in Luchian et al. (1994). \OVe have carried out experiments with
five artificial data sets and with the well-known Ruspini data sets, in order to test
our clustering method.

2. Setting up the problem as a combinatorial optimization


problem
First of all, we set up the problem. A set ,,\:, = {:rl, :2'2, ... , x,,} of n two-dimensional
real objects is given, the set PdX) denotes all partitions of the set ,,\:, in nonempty
k cla.sses. \Ve want to define a function:

F:
" Pd,'\:')
U ~ R (1)
k=l

such that the global optimum of the function F will be found in the number of clusters
k and in the groups that are the most natural. It is important to note that the size of
the search space U~=l PdX), can be expressed by the following expression (Bhuyan
et al. (1991)):

t~ IJ-1)k- J (k
k=lk'j=l J
)j". (2)

In order to define this function we need to think about the characteristics that define
a natural cluster. Hence, an important characteristic is that there are no big empty
spaces inside a cluster (assuming that the cluster is not a ring), so we divide the
119

space that is occupied by the objects into small squares. all of these squares having
the same area. We check to see if the squares are empty or not. Each empty square
is assigned a value of one. and a value of ze~o is assigned to a non-empty squares.
With the former grid it is possible to assign a real value to every partition of the set
X in each number of clusters. Given a partition {'\'I. '\'2 •. · .• X k } the value given to
it is the sum of the values given to ea~h cluster (the algorithm now has the possibility
of being a parallel algorithm). For a cluster we calculate its convex hull and then we
sum the value of the squares whose centres are inside the convex hull. If we denote
H (Xi) to be the convex hull of the partition Xi. V (x, y) the value assigned to a square
with centre (x, y) and C the set of centres of squares, an initial aproximation to the
function can be written as follows:
k
F·({X1 .X2 ..... Xd) = L L V(x.y). (3)
;=1 (x,Y)ECnH(X,)

However this function is not capable of distinguishing between the optimal partition
which gives zero to the former function, and the partition that can be constructed
when splitting one of the clusters of the previous partition into two. Because of this,
we need to add to the preceding function a value Q x k. where Q denotes a postive
real number and k specifies the number of clusters. The final function is:

(4)

At first sight the value of Q does not play an important role. and the only constraint
is Q < 1. The reason is that for Q ~ lour function could assign a lower value to
a partition that has an empty square inside rather than a partition with one cluster
more and without empty squares inside which would be the natural partition. Later
we will see that the value of Q can be important in special cases.
Finally. there is another question that is left to answer: what is the size of the
squares? This is the key question in our approach. To calculate it we have used a
simple approximation that works, as we will see later. well enough. As we do not have
any information about the points. we are going to assume that the points have been
generated at random. following a uniform distribution. Then if the natural structure
is just one cluster and we want to discover it, we must not find an empty square in
the convex hull formed for all the points. This is because it would be possible to split
two or more clusters in such a way that the empty square would not be in any clusters
and the optimum value of the objective function could be found in the partition in
two or three clusters. Figure 1 shows that the square marked with an arrow has a
value of 1 so the value of the objective function in one cluster is 1 + Q X 1 while the
value of the function with two cluster is a + Q X 2. Hence we are going to choose
the size of the square such that the probability of finding such a square will be quite
small. in our case we take 0.001. i.e.

s)" ::; 0.001


~
1'-
(l - (.5 )

where r is the size of square. S is the area of the convex hull and n is the number
of points. We have taken in each experiment the smallest r that complies with the
constraint.

3. The Genetic Algorithm.


The kind of G.A. that we have used is the so-called Steady-State Genetic Algorithm
120

. . . . . . . . . j • • • • • • • \ •• ',",0"
., -, ,- .. i ...... ',0' ""', ..
. ..•.. -, .'t' ••• ",0' ,,0'';'"

Fig. 1: Justification of the size of the square r.

(SSGA)(Whitley and Kauth (1988)). A pseudocode for this algorithm is:

begin SSGA
Create initial population at random
WHILE NOT stop DO
BEGIN
Select two parents from the population
Let the selected parents Produ.ce (! child
Mu.tate the child with certain probability
Extend the population by assigning the child to it
Redu.ce the extended population to the original size
END
Output the optimum of the population.
end SSGA

In this algorithm a population of size ,\ is maintained. In each step of the algorithm


two individuals of the population are chosen with a certain probability proportional
to their fitness function; a series of operators, usually crossover operators and mu-
tation operators, are applied to produce a new individual. The fitness function is
evaluated in the new individual and the individual is introduced into the population
if its fitness function value is better than the worst (the individual with the worst
fitness value in the population) individual in the population. This cycle continues
until a stopping rule has critical value (convergence, number of iterations, ... ).
An important point about the algorithm is that the algorithm converges to the opti-
mum, unlike the classical G.A. (Lozano et al. (1995)). Once the kind of algorithm
we use is established, we need to define the parameters of the algorithm. The first
point to define is the representation of the individuals. In the classic G.A. the in-
dividual lIsed to be binary strings, however we are going to use strings of integers.
An individual of the G .A. is a string of size n (number of points) which contains a
permutation of the numbers 1,2, ... ,n. We want an individual of the population to
be a possible partition of 11 objects in k clusters, k being each value between 1 and n.
Thus we decode each string of integers as follows: for each k the points represented by
121

the first k integers of the permutation are taken as members and centres of a cluster
and the next numbers are added in order to the cluster whose centre is nearest to
the represented point. Once a point is added to a cluster the centre of this cluster
changes to the centre of gravity of the points in the cluster. With this decoding we
have every permutation of the n numbers representing a partition of the n points for
every value of number of clusters k.
Of course we now lHwe another problem, that is, which value to assign to a permuta-
tion of the numbers of objects. The former function assigns a value to each partition
of n objects in k clusters, however every permutation represents a partition for every
value of k. We solve this problem, of course, by giving to each individual the smallest
value that takes the function in the different partitions for k = 1,2, ... , n. Taking
the former evaluation of each permutation into account, our algorithm can be seen
as a hybrid G.A. where a local optimizer is applied in each evaluation.
The second point to note is the kind of operators that can be applied to the individ-
uals (strings of integers) to reproduce them for getting new individuals. Hence we
have studied the kind of operators that have been used in G.A.'s with permutation
of number representation. Most of the work in this sort of representation has been
directed to the design of genetic operator to solve the travelling salesman problem
with a path representation. We have some experience in applying these operators
to other resecwch fields (Larra.l1aga et al. (1996)). The crossover operators that we
have used are: ex (cycle crosso\'er), ER (edge-recombination crossover), OXI (order-
crossover) and PMX (partially map crossover). As mutation operator we have used
: SM (scramble mutation), SIM (insert mutation), ISM (simple inversion mutation),
IVM (inverse mutation), El\! (exchange mutation), Dr..'l (displace mutation).
Finally, there remain the parameters of the G.A., i.e., size of the population, prob-
ability of mutcLtion, probability of crossover, and stopping criteria. These will be
discussed in the next section.

4. The experimental results


We have carried out ten experiments with ec\ch pair of operators and each of the six
datasets (see Figure 2), the Ruspini dataset and five other generated datasets. The
features of the five generated data sets can be seen in Table 1.

data set points clustersunderlying dist


1 1000 1 uniform
2 400 1 uniform
3 800 3 uniform
4 1000 5 uniform
5 1000 5 normal
Tab. 1: The parameters of datasets.
Every experiment has been carried out with the GENITOR software (Whitley and
Kauth (1988)) and in a SparcServer 630 MP. The parameters we used are a popula-
tion size A = 20, a mutation probability Pm = 0.1, and we used as a stopping rule
the following criterion: if the mean function value of the population did not change
in 20 generations we stopped the algorithm.
The previous parameters were set up when we carried out the experiments with the
Ruspini and the first of the genemted datasets that had a big influence on the results
122

------1
.
.......
'
'.
I !
i
Ii
" "
..
..,:'
~,::
'0 :" i
I
~.

I
1 .....
1 _ _ _ _ _- '

1 L ____ -----l

11'''):'i~'::'
ta~'c"~::.: !
I!
! It,-;,·~'! 1.'1$.
1 .".~. . ~~ I
/
\_. '. ".; . 3t\'ft1t!!; I:
I. ~ ~:~·~:~ ·. . . -.. . . .~~~. . .
Fig. '2: The Datasets

obtained with the other datasets.


The search spaces are for the Ruspini data, dataset 1 and 2, dataset :3, and dataset 4
and ·5: 1.178 x 106°.2.7-5.5 X 10 393 ,2.7.5.5 X 10 793 ,2.7.55 X 10993 . It is irnportallt to keep
the size of these spiLces in mind to realize how quickly the G.A. finds the optimum.
In the case of generated data sets we onl._ search up to 10 clusters because in the
other case the seMch would be impossible because of the huge space and the time to
emluate the function.
The number of clu::;ters obtained in eilch evaluatioll, each operator and each dataset
can be seen in Table 2. For instance, with the operator ex in the dataset 3, we
reached 4 clusters in 4 e\'aluations, -5 in -56 evaluations and 6 clusters in 0 evaluations.

" Ruspilli I Data 1 I Data 2 I Data 3 I Data 4 " Data .s "


elus. " 4 I .5 I 4 I 4 I -5 I 6" .5 I 6 I 7 " -5 I 6 I 7 I8 I9 "
ex 60 0 60 60 4 56 0 59 1 0 0 56 4 0 0
ER GO 0 60 60 4 56 0 55 5 0 3 36 21 0 0
OXI GO 0 60 60 4 .56 0 .59 1 0 5 41 14 0 0
P~[X 59 1 60 60 4 5'i 2 46 11 3 8 31 17 3 1
S\[ 40 0 40 40 2 37 1 36 3 1 4 30 5 0 1
SI\1 39 1 40 40 .J 34 1 36 4 0 0 27 13 0 0
IS\[ 40 0 JO 40 0 40 0 3.5 4 1 1 31 8 0 0
IV~[ ·10 0 40 40 ·5 35 0 36 4 0 3 27 10 0 0
Dl 40 0 40 40 1 :39 0 38 2 0 4 23 12 1 0
0\[ 40 0 40 ·10 :3 4·j 0 38 1 1 4 26 8 2 0
Tab. 2: Results of the experiments.
123

A Kruskal- Wallis test has Ix'en applil'd to the results.


The following analysis has been made about the experimeuts:
Ruspini data. Every olx'nLtor reaches the optimum without problems. It can seen
that there is a statistically significant difference between the operators in rela-
tion to the number of e\'aluatiolls, PMX is the fastest.
Dataset 1. Again it has been an easy problem for our approach. The operator Pl\IX
is the fastest. (84,46 function eHLluations on a\'erage against the worse 1.54.5).

Dataset 2. It has proved a difficult problem for our approach, but there is a way of
solving it. Choose a p,trameter r (size of the square) in such a way that only a
empty square would be inside of the ring, and a parameter Q bigger than 0.5.
Of course, it is not a natural approach.
Dataset 3. The result is not very good, but this is a problem of the small parameters
values used in the algorithm. In this problem the objective function gets the
optimum in three clusters but the way in which we decode the permutation
makes it difficult for the algorithm to find the optimum. However we ha\'e
carried out other experiments with this dataset where the optimum was reached.
If we take into account the operators, the Pl\lX operator continues to be the
fastest (96.1.5 function evaluatiolls) and the slowest is ER (1.54.7).
Dataset 4. This is the first dataset for which we find some difference between the
opera~or with respect to the objective function. The operator CX is the best,
i.e., it finds the correct number of clusters more times than the other operators.
In relation to speed P~IX again is the fastest.
Dataset 5. The results arc lIot very good because of our way of choosing r (5). Some
more experiments with a small change in the parameter,' allow our algorithm
to reach the correct number of clusters in nearly e\'ery execution. Again the
best operator in relation to the objective function is the ex and in relation to
the uumber of fUllctiou e\',Lluation the P~IX (119.7) is the best and the worst
is the OXI (227.1).
5. Conclusion and Future Work
\Ve have given an algorithm that searches for the number of clusters in partitional
cluster analysis and at the same time it gets the most natural classes. Moreover our
algorithm is very flexible in the sense that it could be used to find only the classes
given the number of clusters, or it could find the optimum number of clusters between.
given possible values. The result" are encouraging but it is important to note the
dependence of our algorithm on the parameter 1'. Some more experiments changing
this parameter seem to he <L good way to continue with this work. Of course other
obvious steps in our future work could be to generalize our approach to objects in nd
spaces. This has a problem, that while the size of the search space does not change,
the evaluation function is more expensive, the fundamental point, search the convex
hull, is much more complicated in 'Rd.
In addition we plan to apply G. A. 's to hierarchical clustering and to the most modern
pyramidal dusteriug.

Acknowledgements
Thanks to Prof. A. Hardy for providing software and references.
124

References:
Babu, G.P. and l\\urty, l\\'N. (1994); Clustering with evolution strategies, Pattern Recog-
nilio/I, 27, 2, 3:21-329.
Bezdek, J.C. et al. (1994); GeneLic Algorithm Guided Clustering, In; Proc. of The First
IEEE Conference on Et>ol1Ltiollat-y Computation, 34-40.
Bozdogan, H. (1994); Choosiug the number of clusters, subset selection of variables, and
outlier detection in the standard mixture-model cluster analysis, In; Diday E., Lechevallier
Y., Schader I\\., Bertrand P. and Burtschy B. (eds.), New Approaches in Classification a/id
Data AlLaiysis, Springer- Verlag, 169-177.
Bhuyan, J.N. et al. (1991); Genetic Algorithms for clustering with an ordered representa-
tiou, In; Belew and Booker (eds.), Proceedings of the Fourth Intemational Conference on
Genetic Algorithms, 408-415, I\Iorgan I'aufmann.
Celeux, G. and Soromenho, G. (1993); An entropy criterion for assesing the number of
clusters in a mixture model, Technical Report 1874 INRIA, France.
Cucchiara, R (1993); Analysis and comparison of different genetic models for the clustering
problem in image analysis, In; Albrecht R.F., Reeves C.R. and Steele N.C. (eds.), Artificial
Neural Networks and Genetic Algorithms. Springer-Verlag, 423-4:27.
Everitt, B.S (1974); Cluster Analysis, John Wiley &: Sons, Inc.
Goldberg, D.E. (1989); Genetic Algorithlllsill Search, Optimization, AIachine Learning,
Addison-Wesley.
Gordon, A.D. (1995); Test for asessing clusters, Statistics in Transition, 2,207-217.
Hardy, A. (1994); An examination of procedures for determining the number of clusters in
a data set, In; Diday E., Lechevallier Y., Schader 1\\., Bertrand P. and Burtschy B. (eds.),
New Approaches in Classification and Data Analysis, Springer-Verlag, 178-185.
Hartigan, J.A. (1975); Clusteling Algo/itllms, John Wiley & Sons, New York.
Jain, A.K and Dubes, R.C. (1988); Algorithms for Clustering Data, Prentice Hall.
Jones, D.R. and Beltramo. 1\I.A. (1993); Solving partitioning problems with Genetic Algo-
rithms, In Albrecht R.F., Reeves c.R. and Steele ~.C. (eds.). Artificial Neural Networks
and Genetic A 19orithllls, Spril1ger-Verlag, 423-427.
Krovi, R. (1991); Genetic Algorithms for clustering; A preliminary investigation, In; Pro-
ceedings of the Twenty-Fifth International Conference on System Sciences, 4, 540-544.
Larraiiaga, P. et al. (199G); Learning Bayesian l'etworks Structures by Searching for the
Best Ordering with Genetic Algurithms, IEEE T,.,Ulsactions on Systems Alan and Cyber-
net ics, 26, 4. In press.
Lozano, J.A. et al. (1995); Genetic Algorithms; Bridging the Convergence Gap, submitted
to Et'o/l!tio/HLry COllllutalion.
Luchiall, S. et al. (199.:\); Evolutionary automated classification, In; Proc. of The First
IEEE COII!ercl/ce 011 Evolutionary COlllplttatioll, 585-589.
Rasson, J.P. and Kubushishi, T. (1994); The gap test; an optimal method for determining
the number of natural dasses in cluster analysis, In Diday E., Lechevallier Y., Schader M.,
Bertrand P. and Blll'tschy B. (e<ls.). NeIL' Approaches in Classification and Data Analysis,
Springer- Verlag, 186-193.
Whitley, D. and I\auth. J. (1988); Genitor; A different Genetic Algorithm, In; Proceedings
of the Rocky MOILlltain Conference on Artificial Intelligence, 2, 189-214.
Explanatory Variables in Classifications and
the Detection of the Optimum
Number of Clusters
Janos Podani

Department of Plant Taxonomy and Ecology


Lonind Eotvos University. Ludovika ter 2
H-1083 Budapest. Hungary
Fax: +36 I 1338764. Email: [email protected]

Summary: An ordinal approach to the a posteriori evaluation of the explanatory power of


variables in classifications is proposed. The contribution of each variable is assessed in a way
fully compatible with the distance or dissimilarity function used in the clustering process. Then.
a simple ranking-based measure is applied to express the relative agreement or disagreement of
variables with a given partition. This measure treats all variables equally, no matter how
influential they were when the classification was actually created. The sum of measures for all
variables reflects their overall agreement and can be used to select an optimal partition from a
hierarchical classification.

1. Introduction
An integral part of the interpretation of clustering results is to evaluate how the individual
variables explain the classes. Finding an order of importance of variables for an existing
classification is often called a posteriori feature selection (cf. Dale et al. 1986), as
opposed to a priori feature selection. when the variables are ranked before the analysis
starts (e.g., Orl6ci 1973, Stephenson and Cook 1980), and to forward selection, in which
evaluation of variables is part of the algorithm (e.g., Jancey and Wells 1987. Fowlkes et
al. 1988).
Attention in a posteriori feature selection may be focused on two fundamental aspects of
classification: cluster cohesion and separation (sensu Gordon 1981). The analysis can be
restricted to either of these aspects (e.g., to contributions to within-cluster sum of squares
only). Alternatively, the effect of variables on the distinction between clusters as well as
on the internal "homogeneity" of clusters is simultaneously incorporated in the study,
even if the clustering method did not actually consider both. A simple possibility which
comes to the mind first is to compute for each variable the ratio of within-group and
between-group sum of squares as an index of explanatory power.
It is emphasized, however, that there is no point in examining cluster cohesion and
separation in terms of sum of squares when, say, the starting matrix contained chord
distances or percentage dissimilarity values and the algorithm was single or complete
linkage sorting. In other words. the evaluation procedure has to be compatible with the
distance coefficient used in creating the classification. Since in many fields of science.
e.g., in biological taxonomy, relatively few classifications are based on sum of squares or
variance, and often the clustering models are not even Euclidean, a more generally
applicable, yet flexible, criterion is required. Godehardt's (1990) multigraph approach, in
which each variable is treated independently, seems to satisfy this requirement.
The third point emphasized here is that the importance of variables may be judged in two

125
126

ways. The more obvious one is the measurement of the absolute effect of each variable
upon the creation of clusters. For example. in case of Euclidean distance and centroid
clustering. we can examine how far apart the cluster centroids are for each variable. and
then order the variables on this basis. This ordering will emphasize variables that
dominated the classification process. and may neglect others that are equally if not more
interesting for the a posteriori interpretation of clusters. In fact. any variable supporting
the given partition may prove useful in subsequent descriptions. no matter how small this
support is in absolute temlS. Thus. an alternative procedure free from the implicit variable
weighting. i.e .. measurement of the relative importance of the variable may prove
useful. Lance and Williams (1977) are early proponents of this approach. by suggesting
taking the ratio of between-cluster sum of squares to the total for each continuous
variable or to compute Cramer's index (see also Anderberg 1973) for each binary or
multistate variable. These criteria are thus only data-type-dependent and do not consider
the manner in which the dissimilarities were calculated. The procedure described in this
paper releases implicit weighting by introducing an ordinal measure of the explanatory
power of variables in non-hierarchical classifications. This measure also satisfies the
requirement of being compatible with the dissimilarities used and relies equally on both
cluster separation and cohesion.
I will also provide an alternative approach to the familiar problem of detecting the
optimum number of clusters. The sum of measures of explanatory power for all variables
will be defined as an overall measure of the agreement (in a sense: consensus) among
the variables regarding the partition of m objects into t clusters. Plotting the sum over a
reasonable range of t values provides a graphical means to find the optimum. if any. This
approach. as will be seen. is radically different from most of the methods reviewed and
compared by Milligan and Cooper ( 1985).

2. Variable contributions
The procedure starts with evaluating the contribution of each variable to the distances or
dissimilarities between objects. To ensure compatibility. the determination of this
contribution must be specific to the distance or dissimilarity function used. As an
example. the total contribution of variable i to all the z=m(m-l)12 values in the lower
semi matrix of D1 containing the squared Euclidean distances for m objects is computed
as

111·1 III

<l>i =L L gijk •
j=l k= j+l

where gijk = (xij - XikP is the contribution of variable i to d 2jk and is written as an
element of matrix G i . The contributions are strictly additive. the matrix of squared
distances is therefore reproduced as

/I

0 2 = I,Gi.
i=l
with n as the number of variables. Formulae for computing contributions have been
deri ved and are presented without proofs for 15 other distance and dissimilarity measures
(Table I). The measures themselves are not shown here. because most of them are well-
known from the clustering literature. A full list is found in Podani (1994). although the
127

Tab. 1: Contribution of variable i to djk for several, well-known dissimilarity and


distance functions. For presence/absence coefficients we assume that Xir1 for presence)
and XirO for absence. Contributions are ranked in ascending order for most measures,
except for those marked with an *, for which ranking is the reverse (see text).

Euclidean distance ..
(XIJ-XI'/e)2

Manhattan distance, I-simple match. coeff.


Penrose SIZE Xij-Xile
XijXik
Chord distance *

IXij-Xile I
Canberra metric
Ixijl + Ixilel
IXij-Xik I
I'ercentage difference, I-Sorensen II

I Xij -Xik I
I-Ruzicka, I-Jaccard II

2,lIIaX[Xllj. Xllk)
11=1

XijXik
I-Similarity ratio *

I-Russell· Ibo (I - XijXik)/1I

I·Rogers • Tanimoto II

11+ L.21x11j - xllkl


//=1

I·Sokal • Sneath II

2,lIIaX[Xllj •.tllk) + Ixllj - xllkl


/.=1

~jXile !ijXik (I-Xij)( I-Xile) (l-xW(I-xik)


I.Anderberg 1*
2,xllj . Ville' '1 - Vly . '1 - VII/e
II I. I. II

min[Xij,Xik) min(xij,Xile)
1.Ku1czynski * -~---+

L.xllj L.XIIk
II h
128

reader may also consult Anderberg (1973). Sneath and Sokal (1973) and Orl6ci (1978).
Note that some indices known generally as similarity functions are expressed as
complements.

3. A new measure of explanatory power


The second part of the analysis involves determining the the rank order of the gijk scores
for each variable i for a given partition Pt of m objects into 1 clusters. each with ms
objects. s= 1. ...•1. For most coefficients of distance. e.g.. Manhattan and Euclidean
distances. it is reasonable to state that a variable completely explains Pt if all of its
within-cluster contributions are smaller than the between-cluster contributions. and
have therefore the smallest ranks. (For functions with an asterisk in Table I the situation
is the reverse. however. In these cases. ranking is done in descending order to keep the
generality of the statement that within-cluster contributions should be ranked first in the
optimal case. In the sequel. we assume the previous type to simplify discussion.)
Consequently. we have the minimum sum of ranks of within-cluster contributions of
any variable. denoted by R(t)mill" This quantity is obtained as

R(t)min
,
= (q-+q)/2 where q= Lt (
Ills
2 ) .
s=1
Let. further. R(i,t)obs be an observed sum of ranks of within-cluster contributions for
variable i. Clearly. R(t)mi/l" ::;; R(i,t)obs' Ties in the rank order can be resolved randomly
which has negligible effects for large values of z. The sum of ranks for between-cluster
contributions will not be used. because it conveys no extra information.
Let R(t)exp denote the random expectation for the null situation. i.e .. when the variable
does not make distinction as to whether a contribution is within- or between clusters
(indifferent variable). In other words, contributions are arranged at random with the
expected sum of within-cluster ranks given by

where p=q/z is the probability that a randomly chosen value in the rank order is a within-
cluster contribution. The expectation will be used below as a reference basis for
constructing the formula.
Then, the explanatory power of the variable is defined as the complement of the
deviation of the actual sum of ranks from the minimum as divided by the deviation of the
expectation from the minimum:

R(i,t)obs - R(t)mill
I"(i t)
,
= 1.0 - R(t)exp - R(t)mill

r(i,t)values close to 1.0 indicate high explanatory power. values- around zero reflect
indifference. whereas negative values correspond to a situation when variable i is
contradictory with Pt. (I deliberately avoid using the term discriminatory power. because
it is usually coined with variance-related concepts as in discriminant analysis.) The
variables may be ordered based on their r scores. to facilitate interpretation of clusters
and to detect variables which happen to be indifferent or even contradictory with the
given partition.
129

4. Detecting the optimum number of clusters


The coefficient of explanatory power can be used in turn to detennine the optimum
number of clusters in hierarchical classifications. The intuitive basis for this is that the
more variables support a given partition the more acceptable it is. The criterion to be used
is defined as the sum of coefficients of explanatory power,

n
(5( = Ir(i,t)
i=l
which will be called the coefficent of cluster separation. The upper bound of this
coefficient is 11, reached in the unanimous situation with all variables fully explaining the
partition. The (5( coefficient is computed for each level of interest in the hierarchy and the
results are plotted against t. For data sets with group structure, the curve shows a peaked
effect allowing to detect the number of clusters at which the majority of variables
support the same clustering in terms of their ranked within-cluster contribution scores.
For very many clusters, each with 1-2 objects only, the increase of (5( is a necessity, but
such trivial clusters attract no interest anyway (these clusters are usually excluded from
such studies, cf. Milligan and Cooper 1985). Absence of clear-cut peaks is indicative of
either strong disagreements between variables as to the "optimum" value of t, or
complete lack of group structure, so the r(i,r) values must be inspected.
Computer program SYN-TAX 5.02 (Podani 1994) designed for classification purposes
includes an option for computing the explanatory power of variables and the coefficient
of cluster separation at several levels in a dendrogram and for plotting the graph
automatically (available for PCs and Macintosh computers).

5. Example
The method will be demonstrated by an actual example coming from community
ecology. A total of 80 vegetational plots (objects) represent a sample of dolomite
grasslands of Sas-hill, Budapest, Hungary (for more details, see Podani 1985). The plots
have been described in tenns of percentage cover scores of 123 vascular plant species.
For the purpose of illustration, the matrix of Euclidean distances of objects was subjected
to complete linkage clustering (Fig. I). The explanatory power of variables and the
coefficient of cluster separation were computed for the top ten cut levels in the
dendrogram, i.e., for t=2 to II. The plot of cluster separation against the number of
clusters (Fig. 2) indicates high agreement of variables for two an9 three clusters, with the
maximum at t=3. When t is raised from 3 to 4, the coefficient drops by more than 50%
and for more groups it remains about the same. The analysis thus suggests that the given
classification is best supported by the species at the 3-cluster level. It is therefore
worthwhile to examine the rank order of variables based on their explanatory power
values for t=3 (Tab. 2). To save space, the table lists only the first ten and the last ten
species from the rank order. Those at the beginning of the list are the best indicators of
difference between closed (the two smaller groups) and open grasslands (the large
group), whereas species with negative scores counter-support this classification because
they tend to differentiate the large group even further. These species have been widely
used as discriminatory species to subdivide the relatively open communities. The analysis
revealed, however, that the majority of species are contradictory with this, showing that
the classical syntaxonomic classification was subjectively based on a narrow subset of
species.
130

100

O .... C":c- .. ,.,C ......... "=n ....... ." ",..,. ... 1"1,...,...,.,,..,..
"'"1'1" .. ,.."1'="
............. 'C'N"NN"' .... Nit'." .. 1'11"1 ... ., ...... ., .. ,...,...,.. ..
.. 1"1 ... ,,'==-=,...... _ .. = .. ~ "at'!I'oO"'''''''
C" til ... Nt'!",t"I"" C"" .... ,... n."." .. ." ..... n" ....... N,....
",:\1
O .. ,.. ........... C"Zlt'tto-C ...
• ........... ,....,..,...,...,..,. .. ,...

Fig. I: Dendrogram showing complete linkage clustering of 80 vegetational plots. based


on the Euclidean distances among objects using percentage cover scores of 123 species.

c
o .5
+=
a....
a
a.
Q)
'0

-....Q)
C/)

C /)
4.
::J
U 40

3.

30

2'

lU

Number of clusters

Fig. 2: Plot showing the relationship between the number of clusters and the coefficient
of cluster separation for the top ten partitions obtained from the dendrogram of Fig. I.
131

Tab. 2: The first ten and the last ten species in the rank order of variables for the three-
cluster partition obtained from the dendrogram in Fig. I.

Rank Species ri ?ank Species ri

Cytisus hirsutt.:5 .865 114 Dianthus seroti:1:J.s .057


L :estuca sulcata .835 115 Andropogon ischaemum .050
~ 8upleuru:n falca'Cum .803 116 Helianthemum can"Jm .047
Punpinella sax:'fraga " 771 :17 Thymus praecox .024
5 Asyneuma canescens .763 118 Sanguisorba minor .005
6 Veronica spicata .758 119 Stipa eriocaulis -.004
7 Polygonatum odoratum .757 120 Festuca pallens -.051
8 Campanula sibirica .725 121 Carex ~iparocarpos -.053
9 Carlina intermedia .719 122 Chrysopogon gryllus -.162
10 Adonis vernalis .717 123 Seseli leucospermum -.177

6. Discussion
The measure of explanatory power proposed in this paper is a non-metric criterion
because actual differences are irrelevant: it is the rank order of contributions that matters.
Thus, even if a variable had negligible effects on the distances (because of lack of
commensurability, for example), it may tum out to be a good explanatory variable
afterwards. Also, ranking variables based on the r values reveals an aspect rarely
emphasized: the identification of variables that do not agree with the partition. Finding
these variables may lead to revisions of former classifications. The possibility is also
raised here that after the removal of these variables a repeated classification based on the
reduced set of variables may provide a more noise-free classification. This is certainly an
aspect which merits future investigations.
The coefficient of cluster separation, being based on ranked contributions, is considerably
different from the currently known indices of optimum number of clusters as reviewed by
Milligan and Cooper (1985). In addition to the ranking technique, the most substantial
difference is that whereas the other methods are less dependent on the number of
variables (so that they can be best demonstrated with a two-dimensional example), the
present technique is more meaningful when there are quite a few variables. Therefore, it
may perform very poorly in a low dimensional situation if compared to the other
methods, and evaluation of the method proposed here along the lines of Milligan and
Cooper's study would be irrelevant. The only exception seems to be the Ratkowsky and
Lance (1978) criterion, which involves computation of the ratio used by Lance and
Williams (1977) for each variable, and takes the average over variables. It is perhaps an
explanation for the fairly poor performance of this measure in the two-dimensional case
of Milligan and Cooper's study, though Ratkowsky and Lance reported high success,
usually with many dimensions. It is also noted here that whereas almost all methods
provide the same result after rigid rotation of the axes (rotation invariance) the
Ratkowsky and Lance criterion and the one suggested in this paper are exceptions.

As with other formulae for detecting the optimum number of clusters in hierarchical
classifications, the possibility to incorporate the measure directly as a clustering criterion
may be examined. The coefficient of cluster separation is computationally very
132

demanding, however. (The actual example presented in this paper took ten hours on a PC
486.) Building clusters based on a global nonmetric criterion similar to crt will provide a
clustering procedure completely compatible with the optimality measure.

Acknowledgements:
The author expresses his sincerest thanks for receiving an OMFB Travel Grant (No.
MEC 96-0176) and an OTKA Travel Grant. (No. U21456) to participate at IFCS'96,
Kobe, where this contribution was presented. This study was funded by the OTKA
Hungarian National Research Grant No. TI9364. I am grateful to A. D. Gordon
(University of St. Andrews, U.K.) for his comments on the manuscript, and to M. B. Dale
(CSIRO, Australia) and Sz. Bokros (ELTE, Budapest) for discussions.
References:
Anderberg, M. R. (1973): Cluster Analysis for Applications. Academic, New York.
Dale. M. B .. Beatrice, M .. Venanzoni. R. and Ferrari. C. (1986): A comparison of some methods
of selecting species in vegetation analysis. Coenoses. 1. 3S-S2.
Fowlkes, E. B .. Gnanadesikan, R. and Kettenring, J. R: (\ 988): Variable selection in clustering.
Journal of Classification. 5, 205-228.
Godehardt, E. ( 1990): Graphs as Structural Models: The Application of Graphs and Multigraphs
in Cluster Analysis (2nd ed.). Vieweg & Sohn, Braunschweig.
Gordon. A. D. (1981): Classification: Methods for the Exploratory Analysis of Multivariate
Data. Chapman and Hall, London.
Jancey, R. C. and Wells, T. C. (1987): Locality theory: the phenomenon and its significance.
Coenoses, 2, 31-37.
Lance. G. N. and Williams. W. T. (1977): Attribute contributions to a classification. Australian
Computer Joumal, 9. 128-129.
Milligan. G. W. and Cooper, M. C. (198S): An examination of procedures for determining the
number of clusters in a data set. Psychometrika. 50. IS9-179.
Orl6ci. L. (1973): Ranking characters by a dispersion criterion. Nature. 244. 371-373.
Orl6ci. L. (1978): Multivariate Analysis ill Vegetation Research. Junk. The Hague.
Podani. J. (198S): Syntaxonomic congruence in a small-scale vegetation survey. Abstracta
Botanica. 9. 99-128.
Podani. J. ( 1994): Multivariate Data Analysis in Ecology and Systematics. SPB Publishing, The
Hague.
Ratkowsky. D. A. and Lance. G. N. (1978): A criterion for determining the number of groups in
a classification. Australian Computer Journal, 10. IIS-117.
Sneath, P.H.A. and Sokal, R. R. (1973): Numerical Taxonomy. Freeman, San Francisco.
Stephenson, W. and Cook. S. D. (1980): Elimination of species before cluster analysis.
Australian Journal of Ecology. 5. 263-273.
Random dendrograms for classifiability testing
Bernard Van Cutsem, Bernard Ycart
Laboratoire de Modelisation et Calcul, Universite Joseph Fourier, Grenoble
B.P.53, F-38041 GRENOBLE Cedex 9,
Bernard.Van-Cutsem@imagJr, Bernard.Ycart@imagJr

Summary: We propose statistical tests to decide between an hypothesis of non classifi-


ability of the data against the presence of a classification in some simple situations. We
consider only the single link algorithm and two null hypotheses of non classifiability, ac-
cording to whether the dissimilarities or the objects themselves are Li.d. random variables.
Each choice for the distribution of the input of the single link algorithm induces a different
probability distribution on the output, which is a random indexed dendrogram. Certain
characteristics of these random indexed dendrograms are studied, and their asymptotic dis-
tributions computed under each null hypothesis. All these random variables can be used
to define statistical tests. Explicit examples of such tests are provided.

1. Introduction
Mathematical structures such as partitions, trees or dendrograms are commonly used
to analyze and represent classification structures on n objects. Classical algorithms
construct these structures either from dissimilarities between pairs of objects or from
the values of a set of variables on the n objects.
We consider here only hierarchical classifications also called indexed dendrograms
(ID), as introduced by Hartigan (1967), Johnson (1967), Jardine, Jardine and Sibson
(1967). In practical situations, ID's are deduced from dissimilarities using ascend-
ing hierarchical algorithms. Descriptions of such algorithms can be found in classical
books on classification such as Sneath and Sokal (1973) or Jain and Dubes (1988). We
focus here on the Single Link Algorithm (SLA) which is among the most commonly
used. The ID's obtained using the SLA will be called single link indexed dendrograms
(SLID).
Even if the data do not present any cluster (they are homogeneous in some sense),
the SLA will produce an ID. In some instances, external information on the data
can be used to confirm or not the exhibited structure. If this is not possible, it is
nevertheless important to be able to decide if this ID is significant or not. A method
is to consider a probability distribution on the set of data which corresponds to the
absence of classification structure (no clusters), and then derive the distribution of
the corresponding random SLID. Then statistical tests for this null hypothesis of non
classifiability of data against an hypothesis of existence of a classification structure
can be constructed. Bock (1996) is a thorough review on the problem of testing par-
titions as structures of classification.
In Van Cutsem and Ycart (1994), we made a first attempt in this direction by de-
scribing the finite set of stratified dendrograms on n objects. This set was endowed
with the equiprobability and the distributions of some characteristics were derived.
We considered mainly the number of levels of such dendrograms, and the sizes of par-
titions. Those results can be easily extended to binary and strictly binary stratified
dendrograms (see exact definitions below).

133
134

More realistic hypotheses bear on non classifiable data. In Van Cutsem and Ycart
(1996a), we considered two other types of null hypotheses corresponding to non clas-
sifiable data .
• Model 1. The first hypothesis concerns dissimilarities. It supposes that the
dissimilarities between objects are exchangeable random variables, and more
particularly i.i.d. random variables uniformly distributed on [0,1] .
• Model 2. The second hypothesis concerns objects. It supposes that the n ob-
jects are i.i.d. points in a metric space distributed according either to a uniform
distribution on a convenient domain or to a unimodal distribution. Dissimilar-
ities between objects are then defined as their distances in the representation
space.
Model 1 has been considered long ago, for instance by Ling (1973). In Van Cut-
sem and Ycart (1996b), a probabilistic analysis of the main ascending hierarchical
algorithms was proposed. Different variables attached to indexed dendrograms were
introduced, including the sequence of levels, the survival time of an object (i.e. the
number of partitions in which this object is isolated) and the ultrametric distance
between two objects. The distributions of these random variables attached to ID's
produced by the SLA, (and also the average link and complete link algorithms) ap-
plied to random data corresponding to model 1 were studied, and exact as well as
asymptotic results were proposed.
Our goal in the present paper is twofold. Firstly we want to extend the results
obtained in Van Cutsem and Ycart (1996b) to a particular case of model 2, more
precisely to i.i.d. objects uniformly distributed in the interval [0,1]. The main re-
sults for model 1 are recalled in theorem 3.2, where explicit asymptotic distributions
for large sets of objects are given. The corresponding results for model 2 are given in
theorem 4.3. Interestingly enough, the asymptotic distribution turns out to be the
same (scaled Gumbel distribution) for the last index of the dendrogram. However, it
is quite different for other variables. For instance the size of the smallest subset in
the penultimate partition tends to 1 in probability for i.i.d. dissimilarities (model 1),
whereas for i.i.d. points on [0,1]' it has a uniform distribution.
Our second aim is to demonstrate the applicability of our results to some problems
of classifiability testing. To do this, we introduce simple classifiability hypotheses to
be tested against models 1 and 2. They consist in assuming that the set of objects is
partitionned into two subsets, such that the distribution of dissimilarities inside each
set is stochastically smaller than that of dissimilarities between the two subsets. For
both models, we derive explicit tests based on the last index of the SLID and compute
the probability of detecting the given partition under the classifiability hypothesis.
The article is organized as follows. Section 2 recalls basic definitions. We summarize
in section :3 the results of Van Cutsem and Ycart (1996b) concerning modell. Section
-l contains new results for i.i.d. points uniformly distributed on [0,1]. Applications
to classifiability testing are presented in section .5.
135

2. Basic definitions
A dissimilarity on a set S of n object.s is as usual a function d: S2 -+ m+ such
that d(a,a) = 0, d(a,b) = d(b,a) for any pair of objects a and bin S. Moreover we
suppose here that d is definite, that is: (d(a, b) = 0) =? (a = b).
An indexed dendrogram is a sequence {(Pl, Al)}O!>l!>lm.. such that
1) {Pl }O<l<lm.. is a sequence of nested partitions of S such that Po is the partition
into singfetons and Plm .. is the partition with only one element {S}. The dendrogram
is defined by {Pl }o!>l!>lm&. .
2) {Al }O!>l!>lm.. ' the sequence of indices of the dendrogram, is strictly increasing and
starts from Ao = O.
A sequence {Pt}O<l of partitions is nested if any subset of a partition Pl is union
of subsets of Pl - I .- An ID is called strictly binary if any partition Pl is obtained by
joining exactly two subsets of Pl-I' For a strictly binary dendrogram, lmax = n - l.
An ID is a stratified dendrogram (SD) if, for any i E {O, ... , i max }, Al = i.
If A and B are two disjoint subsets of S, we define

6(A,B) = min{d(a,b) : a E A,b E B}.

The function 6 defines a dissimilarity on pairwise disjoint subsets of S. We suppose


in the sequel that all dissimilarities, either between objects or subsets, are distinct.
The SLA constructs an ID from a dissimilarity d by the iterative procedure which
starts from (PO, 0), and for each (Pl-I,Al-tl,
1) computes 6min = min{6"(A, B) : A =1= BE P,-d .
2) defines the partition Pt by grouping together the pair of elements of Pl - I whose
dissimilarity is 6min .
3) sets At = 6"min.
The algorithm stops when Pl = {S}.
The associated SLID is then a strictly binary indexed dendrogram. Moreover, for
each i E {O, ... , n - I}, Pt is a partition into exactly n - l subsets.
Given a dissimilarity don S, with no ties (except 0), we attach to d two families of
undirected graphs, the vertices of which are the objects in S .
• The discrete family is the sequence {gm}o<m<n-l> where for each m, gm has m
edges which correspond to the first m pairs -{ a~b} of objects, ranked in increasing
order of dissimilarities .
• The continuous family is the family {g(A),O :5 A}, where for each real A ~ 0,
g(A) is the graph with set of edges {{a, b} : d(a, b) :5 A}.
Of course these two dissimilarity graph families are closely related and their use will
depend on whether the interest bears more on partitions (discrete) or on the sequence
of indices (continuous). The SLA can be expressed according to these families of
graphs as follows.
1) Rank the o(n) = n(~-l) pairs (a, b) of objects according to the (strictly) increasing
order of the dissimilarities.
136

2) Determine the strictly increasing sequence of integers In, by Ino = 0 and, for any
fE[O,n-1],

Inl = min {17! grn has 11. - f connected components} .

:3) Define Pi to be the pa.rtition of 5 in the connected components of gmt and At =


d(am"bmt )·
This point of view was introduced long ago by Ling (19;:3), see also Godehardt (1988)
and the many references therein. Thus a (strictly binary) SLID {( PI, AI) }O<t<n-l can
be described by a sequence of graphs indexed in three different ways according to the
preference given to the dig,.ete level 171l, to the continuous level AI, or to the rank
leNI f.
A SLID is a complex object and our first objective is to summarize its characteristics
into certain variables which. later on, will serve as test statistics. The characteristics
we consider are the following. Each of them can be expressed from the three points
of view described above, according to the choice of its scale value.

l. The sequences of indices, defined either by the continuous levels, PdO<I<n-l


or by the discrete levels {lndu<f<n-I' The analysis of the gaps in the sec(uence
of indices can suggest the presence of clusters in the 10.

2. The survival time of a singleton, which is the minimum level f at which a given
singleton is not an isolated point for the partition Pt. This survival time can
also be evaluated either by the discrete level In,
or by the continuous one Al.
An object which is isolated at too high a level may be significantly different
from the others.

:3. The ultrametric distance between two given objects, defined as the minimum
level f at which both objects are in a same subset of the partition Pt. This
ultrametric distance can be evaluated either by the discrete level In( or by the
continuous level ,\( of partition P~. Two objects which are separated at too high
a. level may suggest the existence of two different clusters.

-1. The size of subsets of partitions P,. Balanced or unbalanced subsets may pro-
\'ide indications on the presence or not of clusters.

These variables are respectively denoted as follows.

Al continuous level of the partition P,


111, discrete level of the partition p(
,\( a) survival time of u expressed as a continuous level
m(a) survival time of a expressed as a discrete level
f(a) survival time of a expressed as a. ra.llk of partition
.\(a, b) ultrametric distallce of a and b expressed as a continuous level
m(a,b) ultrametric distallce of (I a.nd b expressed as a discrete level
C(a,b) ultrametric distance of a and b expressed as a rank of partition
tl ( a) size of the cluster conta.ining a ill partition P,.
137

3. Random dissimilarities on [0,1]


Let us denote by D(a, b) the random dissimilarity between objects a and b. Through-
out this section, we shall suppose that the D(a, b)'s are independent and uniformly
distributed on [0,1 j. The exchangeability of the D(a, b)'s implies that, when pairs of
objects are ranked according to the increasing order of the dissimilarities, the result-
ing order is uniformly distributed on the set of the a(n)! possible orders. We denote
by Prob the reference probability distribution.
When applying the SLA to random dissimilarities, one obtains a random SLID
(RSLID) and random families of graphs denoted as usual with capital letters {G m },
{G( ~)}. The variables attached to a deterministic SLID that were described in the
previous section become now random variables, denoted also with capital letters,
At, A-ft, A(a), M(a), L(a), A(a,b), M(a,b), L(a,b), Tt(a).

In Van Cutsem and Ycart (1996b), we derived the exact and asymptotic distributions
of all these variables. Exact distributions involve some combinatorics on graphs and
use mainly the following numbers.
,,((n,m) number of graphs with n vertices, m edges
,,((n, m, k) = number of graphs with n vertices, m edges
and k connected components.

These numbers can be computed using recurrence relations, but the actual imple-
mentation is quickly limited by numerical explosion.
We summarize below, without proofs, the distributions of A" A(a), A(a, b), M(a, b)
and Tn - 2 . Details of the proofs and other related results can be found in Van Cutsem
and Ycart (1996b).
Theorem 3.1 With the above hypotheses and notations,
• VlE{0,1, ... ,n-1},'v'~E[0,1J,
a(n) n-l
Prob(A I ::::; A) =L L ,,((n,m,kpm(1- At(n)-m.
m=t k=l
• 'v'AE[0,1],

Prob(A(a) ::::; A) = 1- (1 - At- 1


• 'v'~ E [0,1]'

Prob(A(a,b)::::; A) = E Prob(M(a,b) $ m) (a~)) Am (1- Aj<>(n)-m.


where 'v'm = 1, ... , a( n - 1) + 1,
Prob(M(a,b) $m)= L (n-~)) "((nl,mIl1)-y(n-nj,m-md
( nl,ml )EA( ""m ) nl - - ,,((n, m)

with
138

• Vnl = 1, ... , n - 1, Prob(Tn _ 2 (a) = nd =

(:::.\) "Y(nl, ml, 1) "Y(n-nl, m-ml -1) nl (n-nd


(:(~D a(n)-m+l

where

B(n,nd = {(m,md nl-l ~ ml ~ a(nd, n-nl-l ~ m-ml-l ~ a(n-nd}·

The size of the numbers involved makes the exact computation of the probability
distributions of theorem 3.1 impossible for n larger than a few tens. Fortunately, ex-
plicit asymptotic distributions are also available. These asymptotics are also derived
in Van Cutsem and Ycart (1996b). The proofs are based on the theory of random
graphs. The key observation is that {Gm } and {G('\)} are random graph processes in
the sense of Bollobas (1985). In particular the discrete and the continuous families of
graphs are equivalent in some sense for n tending to infinity. More precisely, Gm and
GP) will have similar properties if ~ '" m / a(n). As an illustration, we summarize
below the asymptotic distributions of some of our variables.
Theorem 3.2 With the above hypotheses and notations,
• W ~ 1, "Ix E m:t ,
i-I
n2
lim Prob( -At < x) l_e- r (1+"'+_x_).
n-oo 2 i-I!
.Vx Em,
lim Prob(nA n_ 1 -log(n) ~ x)
n-oo
= e-·-'

• "Ix E m:t ,
lim Prob(nA(a) ~ x) 1 -e - r .
n_oo

.VxEj1,+oo[,
. Prob(nA(a,b)
hm ~ x) = ( 1 -'P(X))2
-
n-oo X

where 1'( x) is the only solution in ]0, 1[ of the equation tpe-'P = xe- r
• lim Prob(min(Tn_2 (a),n - Tn_z(a)) = 1) = 1 .
n-oo

4. Random points on [0,1]


We now consider the simplest case where the n objects are represented by n random
points on the unit interval. Let U I , U 2 , •.• , Un denote n independent random variables
uniformly distributed on [0,1]. As usual, we denote by U(1), U(2) , .••• U(n) the
ordered variables, and by R I , R 2 , ••• ,Rn the sequence of ranks. The object a is then
represented by the point Ua = U(R.)' In this model, the dissimilarity D(a, b) between
objects a and b is lUa - Ubi.
139

It is easily checked that


1) the density of U(.) = (U(I),u(2) ..... l'(n)) is

where
fin = {(UI,U1, ...• Un ) : 0:::; UI:::; U2:::; ... :::; lIn:::; 1}
and where U.-t denotes the indicator function of t he set A.
2) the ranks R(.) = (RI' H2 ••.. , Rn) define a permutation of the first n integers which
is uniformly distributed on the n! permutations of these integers.
Let us define ~i = U(i+I) - U(i) to be the distance between two consecutive points in
the representation. The density of ~ = (~I, ~2' ...• ~n-I) is easily computed.

where

Then. for all i E {I, 2, ... ,11 - 1}, ~i is distributed according to a Beta distribution
;3( 1, n). Moreover the variables ~I. ~2' ... , ~n-I are exchangeable.
We now associate a SLID to the dissimilarities D(a. b). Since almost surely there are
no ties in dissimilarities. this ID is strictly binary. \Ve shall consider separately, the
sequence of levels Au = O. AI," .. ;\,,-1 and the stratified strictly binary dendrogram
denoted by BD (see Figure 1 for an example).

:l(11 :l('1 ~I'J :l(J,


:ll :l'l :l,,:l,
---"
o a3U6
~----~--.--------'

al (I.; U1 :\2 1- ,_!_.


AJ.l.- ,-'-I
...... _ _ _ _ l.. __ _
U(IP(21 ['1:<1 U('I ['('I
~

Figure 1: Six random points [0.1] and their associated SLID.

Theorem 4.1 ~Vith lIlt abol'( "Y]loillt.<t., 1I11d Tlollliions.

1. The sequence of itl'e/" of Iht RSLID i ... Ihe ol'dend stquellCt ~(l)' ~(l)' .•.•
~(n-I) u8sociuted to the l'urillblts ~l' ~2'"'' ~TI-I'

2. BD is uniform./y dislributed 011 Ihe jinitt stl Bn of s/mlijitll 8/l'ict/y biliary


de n.drograms 011 /I ohjeels.
140

The proof is elementary. Result 1 is obvious because of the properties of the SLA.
To prove 2 we remark that, as the variables Ai are exchangeable, their ranks define a
permutation which is uniformly distributed on the set of all the (n -I)! permutations
of {1,2, ... ,n -I}. Fix first a permutation of the n objects, and independently a
permutation of the Ai'S. Then the stratified dendrogram BD is determined. But 2n - 1
such choices will lead to the same- binary dendrogram, since the order of the pair of
sets which are joined at each of the n -1 levels can be switched. Thus -each binary
dendrogram is obtained with probability

n!(n-l)!'
Result 3 is easy to check and uses once more the exchangeability of the variables Ai'
Remark. This decomposition of the RSLID into the product of two independent
random structures given by the levels and the stratified binary dendrogram can be
extended in a product of three independent random structures if we decompose a
stratified dendrogram into the product of labels and of an unlabelled strictly binary
dendrogram.
The consequences of this theorem are important. Firstly the distribution of levels
can be deduced directly from that of A(.). Many results on survival times of objects,
ultrametric distances between objects expressed as ranks of partitions, and conse-
quently expressed as levels, can be directly obtained from combinatorics on the set
Bn of strictly binary dendrograms on n objects. Here are for instance the distribu-
tions of survival times and ultrametric distances expressed as ranks of partitions.
Theorem 4.2 Let ED denote a stratified strictly binary dendrogram uniformly dis-
tributed on the set Bn.
ThenWE {l, ... ,n-l},
n-l
Prob(L(a) = l) = (~) ,
n+l
Prob(L(a,b) = l) = (n -1) (t
n-/2 )
Also, 'ifnI E {1,2, ... ,n -I},
1
Prob(Tn _ 2 (a) = nd = - -.
n-l

The proof is easy.


The distributions of the levels Ak are more difficult to compute. We give only a few
indications.
1) It is obvious that, using the exchangeability of the Ai'S,

Prob(A k :5.\) = Prob(A(kl:5.\)

= ~ ( n -:- 1) Prob(!:ll :5 .\, ... , Aj :5 .\, ~j+1 > .\, ... An-I> .\) .
n-I
j=k J
141

2) One can prove that, for any k E {I, 2, ... , n - I},


Prob(Al > 'x, ... ,AI. > ,x) = ((1- U)+t ,
where (x)+ = max{x,O}.
3) This yields
Prob(A 1 :5 'x, ... ,Ah:5 'x,Ah+l > 'x, ... ,Ak >,X) =
((1 - (k - h)'x)+)" - (~) ((1 - (k - h + 1)'x)+)" + ... +
(-l)i m ((1 - (k - h + j),X)+t + ... + (_l)n-l m
((1 - (n -I)'x)+)".
A tedious computation allows to derive Prob(A(/o) :5 ,x). For the survival time A(a)
of a singleton, the exact formula is
2 n-?
Prob(A(a) :5 ,x) =1- -(I-,Xt - --((1- 2,X)+t .
n n
The ultrametric distance of two elements A( a, b) can also be computed along the same
lines. More interesting are the asymptotic distributions for continuous variables.
They are obtained using the classical approximation by a Poisson process (cf. for
instance Feller (1971)).
Theorem 4.3 With the above hypotheses and notations,
• 'Vx Em:+-,

.'Vx E IR,
lim Prob(nA,,_1 -log(n) :5 x) = e-e-'
"-00
• 'Vx Em:+-,
lim Prob(nA(a) :5 x)
,,-00 = 1 - e2r •

• 'Vx Ell, +oo[ ,

lim Prob(nA(a,b) -log(n/3):5 x) = e-'-'


"-00
• 'Vx ElO, 1/2[ ,
lim Prob((I/n)min(T,,_2(a),n - T,,_2(a)):5 x)
"-00
= 2x.

Thus the asymptotic distribution for the last index A,,-1 is the same for i.i.d. dissim-
ilarities (theorem 3.2) and for i.i.d. points on [0,1]. The asymptotic distributions of
A(a) and A(a, b) are different but with scalings of similar orders of magnitude. The
main difference comes from the penultimate partition which is very unbalanced in the
case of i.i.d. dissimilarities and corresponds to a random cut in the case of i.i.d. points
on [0,11. The proof of theorem 4.3 uses the well known Poisson approximation, under
the following equivalent form. Consider n i.i.d. points, uniformly distributed on [0, nl.
Let AI", . , AI. be a fixed number of distances between consecutive neighbors. As n
tends to infinity, they are asymptotically independent, exponentially distributed with
parameter 1. Recall that the infimum of n i.i.d. exponential r.v.'s is exponential with
parameter n. Their supremum is asymptotically Gumbel, with location parameter
Log(n).
142

5. Examples of classifiability testing


Hartigan and Mohanty (1992) propose tests of classifiability based on the cluster sizes
at different levels of a SLID. In this section, we shall consider only the last two levels.
As an illustration of our tests, we define a partition P of the set S = {I, ... , n} into
two subsets A = {I, ... ,nd and B = {nl + 1, ... , nl +n2}' Here, nl and n2 = n - nl
are meant as integer valued functions of n, tending to infinity with n. Our classifia-
bility hypotheses 'HI and 'H; depend on the partition P and on a positive parameter
p..
The hypothesis 'HI is: the dissimilarities D( a, b) are independent and distributed as
follows
1) if (a,b) E A2 U B2, D(a,b) is uniformly distributed on [0,1],
2) if (a, b) E A x B, D(a, b) is uniformly distributed on [1',1 + p.].
The null hypothesis 'Ho is the hypothesis of non classifiability of section 3 (i.i.d. dis-
similarities, model 1 ). We denote by Probl' the probability distribution corresponding
to hypothesis 'HI'
Using the SLA we derive a RSLID and consider the last index An-I! the partition
Pn - 2 and the variable Z = min{D(a, b) : (a, b) E A x B}. We introduce two more
variables A~, and A~~ which are the highest levels of the two sub-ID's obtained by
the SLA applied to the restrictions of D to A and B respectively.
We first consider the probability PI' of good detection of P by the SLA:
PI' = Probl'(Pn _ 2 = P) .
It is clear, using properties of the SLA, that:
PI' = Probl'(max(A~" A~~) < Z) .
This quantity evaluates the possibility for SLA to detect the partition P. It is cer-
tainly bigger than PI' below, which is a good approximation for large nl and n2'
PI' = Probl'(max(A~l' A~~) < 1').
This last probability is easy to derive since the two variables A~l and A~ are in-
dependent and asymptotically distributed according to a Gumbel distribut10n with
convenient location and scale parameters (cf. theorem 3.2).

Table 1 gives some values of PI' in the case nl = n2 = 100.

0.04 0.05 0.06 0.07 0.08 0.09 0.10


0.0257 0.2599 0.6091 0.8333 0.9351 0.9756 0.991

Table 1: Values of the probability of detection at the penultimate partition.

Let us now test 'Ho against 'HI> using the statistic An-I' We determine the critical
region [co, 1], at significance level Q, by
Probo(A n _ 1 ~ co) = Q •
143

Since the asymptotic distribution of An-I is Gumbel, for n large enough, Co is deter-
mined by

Table 2 presents a few values for nl = n2 = 100.

0.1 0.05 0.01 0.005 0.001


0.0377 0.0413 0.0495 0.0530 0.0610

Table 2: Critical values for the test on the last index.

Notice that, somewhat paradoxically, the hypothesis of non classifiability 1i1 may
be rejected by the test on the last level, whereas the penultimate partition does not
detect the correct structure.
The power of the test on the last level is given by

and it is clear that 11"(1') = 1 if I' ~ Ca. As limn _+ oo Ca = 0, we see that, for any fixed
I'> 0,
lim 11"(/1)
n-+oo
= 1.
Consider now model 2. The null hypothesis 1i~ is: the n objects are i.i.d. points in
[0,1]. For the hypothesis 1i~ we choose: the objects in A are i.i.d. points uniformly
distributed in [0,1'1], and the objects in Bare i.i.d. points uniformly distributed in
[1'2,1], with
0<1'1 < 1'2 < 1 1'2 -I'I = I' > 0 .
If the statistic of test. is the last level An-I, since its asymptotic distribution is the
same for model 1 as for model 2, we obtain exactly the same critical region as before.
However the probability of detection PI' has chang~d. Define again A~1 and A:1 as
the last levels of the sub-SLID's on the sets A and B respectively. For nl and n2
A: A:
large enough, the asymptotic distributions of 1 and 2 are still Gumbel, but with
different scalings. One has

As an example, table 3 gives some values of PI' in the case nl = n2 = 100, 1'1 =
(1 - 1')/2 and 1'2 = (1 + 1')/2 (the gap is centered).
144

0.02 0.03 0.04 0.05 0.06 0.07 0.08


0.0342 0.6625 0.9531 0.9947 0.9994 0.9999 1.

Table 3: Values of the probability of detection at the penultimate partition.

References:
Bock, H.H. (1996): Probability models and hypotheses testing in partitioning cluster anal-
ysis. In: Clustering and classification, Arabie, P. et al. (eds.), 377-453, World Scientific,
Singapore.
Bollobas, B. (1985): Random Graphs, Academic Press, London.
Feller, W. (1971): An introduction to probability theory and its applications, vol. II, Wiley,
London.
Godehardt, E.(1988): Graphs as Structural Models, Vieweg, Brauschweig/Wiesbaden.
Hartigan, J.A. (1967): Representations of similarity matrices by trees. J. Amer. Statist.
Assoc., 62, 1140--1158.
Hartigan, J.A. and Mohanty, S. (1992): The RUNT test for multimodality. J. of Classifi-
cation, 9, 63-70.
Jain, A.K. and Dubes, R.C. (1988): Algorithms for clustering data. Prentice Hall, Engle-
wood Cliffs.
Jardine, C.J., Jardine, N., and Sibson, R. (1967): The structure and the construction of
taxonomic hierarchies. Math. Biosci., I, 171-179.
Johnson, S.C. (1967): Hierarchical clustering schemes. Psychometrika, 32,241-254.
Ling, R.F. (1973): A probability theory of cluster analysis, J. Amer. Statist. Ass., 68,
159-164.
Sneath, P.H. and Sokal, R.R. (1973): Numerical Taxonomy, Freeman, San Francisco.
Van Cutsem, B. and Ycart, B. (1994): Renewal-type behaviour of absorption times in
Markov Chains. Adv. Appl. Probab., 26, 988-1005.
Van Cutsem, B. and Ycart, B. (1996a): Probability distributions on indexed dendrograms
and related problems of classifiability. In: Data Analysis and Information Systems, Bock,
H. (ed.), 73-87. Springer-Verlag, Berlin.
Van Cutsem, B. and Ycart, B. (1996b): Indexed dendrograms on random dissimilarities.
J. of Classification, to appear.
The Lp-product of ultrametric spaces and the
corresponding product of hierarchies
Bernard Fichet
Laboratoire de Biomathematiques
Universite d'Ai"'<-Marseille II
27 Boulevard Jean Moulin
13385 Marseille, France.

SUIlImary : The Lp-product (1 ::; p ::; (0) of r indexed hierarchies is introduced in


cOllnection with the Lp-product of the corresponding r ultrametric spaces. The Cartesian
product of two hierarchies appears to be a quasi-hierarchy. Endowed with an index of Lp-
type (p < (0). this quasi-hierarchy is in bijection with the Lp-product of two ultrametric
spuces. The indexed hierarchy associated with the supremum product of r ultrametric
spuces is alsu characterized.

1 Introduction
This paper is devoted to the Lp-product of indexed hierarchies and its relationship
wit h the J..p-product of ultrametric spaces. In this approach, quasi-hierarchies will
appem as an fundamental structure. Recall that the main axiom of a quasi-hierarchy,
uxium iii) below. has been investigated by Batbedat (1989) and Bandelt and Dress
(198Y) in defining weak hierarchies. This axiom stipulates that the intersection of
three clusters always is the intersection of two clusters among them. Then an indexed
quasi-hierarchy is defined by adding some usual a."'<ioms. Quasi-hierarchical classifica-
tiun may be regarded as a unifying way for two extensions of hierarchical classification.
Indeed, indexed quasi-hierarchies extend indexed (but not weakly-indexed) pseudo-
hierarchies, also called "pyramids", and additive trees. For references concerning the
three previous concepts, see Durand and Fichet (1988), Bertrand and Diday (1991)
und Buneman (1974).
A bijection has been established between indexed quasi-hierarchies and particular
dissimilurities, called quasi-ultrametrics. See Diatta and Fichet (1994) or Bandelt
( 1992) via a four-point characterization.
In this paper we show that the Cartesian product of two hierarchies is a quasi-
hierarchy. \ Ioreover . from two indexed hierarchies a level index of Lp-type (p < (0)
is lJruduced in connection with the Lp-product of the corresponding two ultrametric
spure". Finally the supremum product of r ultrametric spaces is ultrametric and the
assuriated indexed hierarchy is characterized as a subclass of th.e Cartesian product
uf the ronesponding /' hierarchies.
The reader will find an analogy with the primary approach of Benzecri and Escofier
defining correspondence analysis, see Escofier (1969). Introducing the x2-metric on
the ruw-set and the column-set of categories, they produce a global scattering which
is nuthing but the L2 -product of the two metric sets.
Let us nute that the results given here have been presented by the author at the 19th
Annual Conference of the Gesellschaft fur Klassifikation e.V, held in Basel, March

145
146

lYY5. and the I.F.C.S. 5th Conference held in Kobe, ~'larch 1996.

2 Preliminaries and notations


Let J be a finite set. A quasi-hierarchy 11. on I is a collection of nonempty subsets of
J obeying the following foUl' a..xioms:
i) IE11.
ii) VH E 11. , U{H' E 11. , H' c H} E {H,0}
iii) VH I , H2 , fh E 11. , HI n H2 n H3 E {HI n H2, H2 n H3, H3 n HI}
ill) VH,H'E11., HnH' E11.U{0}
Such a definition stays valid whenever I is infinite. In the finite case as considered
here, we have an equivalent definition. The class 11. is a quasi-hierarchy if and only
if it obeys i), iii), iv) and
ii') minimal elements of 11. partition I
A quasi-hierarchy 11. is said to be total (definite) if and only if:
lI) Vi E J , {i} E 11.
Kote that tI) implies ii) (or iii))
Recall that a hiemrchy obeys i), ii) (or iii)) and
['i) VH, H' E 11. , H n H' E {H, H'. 0}
Thus a hierarchy is a quasi-hierarchy.
Rerall that a hierarchy admits a well-known visual display, called a dendrogram. In a
quasi-hierarchy 'H., a predecessor of a cluster (element) H in 'H., H :f I, is any minimal
ele:ment of the family {H' E 11. , H':J H}. It follows from vi) that in a hierarchy H
has a lmique predecessor. In a quasi-hierarchy, there exists a smallest cluster, say
i-J ij , containing two fixed units i and j in I. That derives from iv).
A level index is defined as usually. An indexed q'uasi-hierarchy is a pair ('H, f) where
11. is a quasi-hierarchy and f is a function mapping 11. into IR+ and obeying:
"ii) (H E 11. , H minimal) =* f (H) = 0
"iii) (H: H' E 11. • He H') =* f (H) < f (H')
A :,;llUlifietll/.uasi-hierarchy is a pair ('H.. ::5) where 11. is a quasi-hierarchy and ::5 is a
(linear) quasi-order on 11. obeying:
[.ii') (H E 11. , H minimal) =* VH' E 11. • H ::5 H'
'I'iii') (H, H' E 11. , He H') =* H ::5 H' and not H'::5 H
Clearly an indexed hierarchy induces a stratified quasi-hierarchy by:

VH,H' E 'H., H::5 H' <==> f(H) $ f(H ' )

A tii:j:j'imilurity on I is a mapping d : 12 1-+ 1R+ obeying

V(i,j) E 12 , d(i,j) = d(j,i) and Vi E I, d(i,i) =0


Such a dissimilarity is said to be propel' (definite) iff d (i,j) = 0 => i = j
(J ,el) is called a dissimilarity space.
We note B d {i,l') the (closed) ball with centre i and radius r ~ 0, i.e. Bd(i,r) =
{k E J / d (i,k) $ r}. We define a 2-ball as the set Bt; = Bd (i,d{i,j)) nBd (j,d (i,j)).
The family of 2-balls is denoted by Bd = {Bt; , i, j E I} .
147

Fur any J ~ 1, diamd (J) is the diameter of J, i.e.


diamd(J)=ma.'X{d(i,j). i,jEJ}
ThE'rE' is a \vell-known bijection between the set of indexed hierarchies and the set
uf particular dissimilarities, called ultrametrics. see Johnson (1967), Jardine et a!.
(1967). Benzecri (1973). An ultrametric obeys the ultrametric inequality: Vi,j,k E I
. cl (i. j) :S max [d (i, k) ,d (k, j)l. The set of ultrametrics on I will be denoted Du (1) .
In fart. there are many equivalent definitions for ultrametricity. The following five
statE'ments will be useful in this paper.
d E D" (1)
<=} VU,j) E J2. Bd(i,d(i,j)) = Bd(j,d(i,j))
<=} V(i,j) E 12 , Btj = Bd(i,d(i,j))
<=} [V (i, j. k.l) E.I4 , k, I EBd (i, d (i, j)) =? Bd (k, d (k, I)) ~ Bd (i, d (i, j)) 1
<=} V(i,j) E J2. diamdBd(i,d(i,j)) =d(i,j)
Acrording to those pruperties, Diatta and Fichet (1994) define a quasi-ultrametric as
a dissimilarity obeying both
i.c) V(i,j,k,l) E 14; k,l E Btj =? B:I ~ Bt (inclusion condition)
.r) "<I (i,j) E 12 , diamdBtj = d(i,j) (diameter condition)
Thus an ultrametric is a quasi-ultrametric. We denote by Dqu (1) the set of quasi-
ultrumE'trics on 1.
;'IJow. we have the ingredients to recall a theorem extending the one-to-one correspon-
dE'nce between indexed hierarchies and ultrametrics, see Diatta and Fichet (1994).
Theorem 1 Let d EDqu (1). Then (Ed. diamd) is an indexed quasi-hierarchy. The
Ijlw.;t-hiemrchy Bdis total iff d is proper. This is a hierarchy iff d EDu (1). Con-
I't',.ijt:/y. given a'nindexed quasi-hierarchy (H, f), let 6 define the mapping from 12
illt u 1U + by: V (i. j) E 12 , 6 (i, j) = f (Hij). Then 6 is the unique quasi-ultrametric
sllch that (BJ, JiamJ) = (H,f).
WE' end the section by specifying the Lp-product of dissimilarity spaces. Let (11, dl ) ,
... .(1r. drl be r dissimilarity spaces. Let I = II x··· x I r . For every 1 :S p:S 00, define
1
Ii: F ...... 1H+ by: "<Ii = (il. ... ,i r ), j = (jl,'" ,jr) E I ,d (i,j) = [~df (i d l )] P

Clearl~' d is a dissimilarity and (I,d) is called the Lp-product of (11,dd,··· ,(1"dr ).


\\'E' nute d = d l EBp'" EBp dr. For p = 1, we have the direct prod'uct and the following
simpler nutation: d = d l EB··· EBd r .
\YhE'n jJ = 00. (I. el) is called the supremum product.

3 The product of two hierarchies


c: iwn r sets 1, . I = 1. .... /. and a collection of subsets H, on every set II , we denote
J..,y H = HI X ... x 'H. r the SE't {H = HI X ... x Hr I H, E HI , 1= 1, ... , r} . Up to
a J..,ijE'ction. H is the Cartesian product of HI, ... , Hr.
Proposition 1 Let HI and H2 be two hierarchies on II and 12, respectively. Then
H = HI X H2 is a Ijlw;,-i-hierarchy on II x 12, called the product of HI and H 2 . It is
total if awl only if both HI and H2 are total.
148

Proof. It is clear that axioms i), ii) (or iii)) and iv) of a quasi-hierarchy are
fulfilled. In particular H = HI X H2 is minimal in 'H iff both HI and H2 are minimal
in 'HI and 'H 2 .
Now, let H = HI X H2, H' = H~ x H~, HI! = H~' x H:; be three elements of 1-l.
Then H n H' n HI! = (HI n H~ n Hn x (H2 n H~ n H:;).
First suppose that HI n H; n H~ = 0. Since 'HI is a hierarchy, two clusters, say HI
and Hi, are necessarily disjoint. Then:

H n H' n HI! = 0 = (HI n H~) x (H2 n H~) = H n H'


A similar property holds whenever H2 n H2 n H:; = 0.
Finally, suppose that HI n H; n H;' and H2 n H~ n H:; are nonempty.
Without loss of generality (w.l.o.g.), let HI S;;; H; S;;; H;'.
If H2 = min (H2' H~, H:;), then H n H' n HI! = H = H n H', for example.
Otherwise, suppose w.l.o.g. that H~ = min (H2, H2, H:;). Then: H n H' n HI!
HI x H~ = H n H'. Axiom iii) of a quasi-hierarchy is fulfilled.
The last assertion of the proposition is obvious. •

Let us observe that the quasi-hierarchy 'H obtained in the previous proposition is
very particular. For example, the following property is noteworthy. Every cluster
H = HI X H2 of 'H, with HI =I- II and H2 =I- 12, has exactly two predecessors,
specifically HI X H~ and H; x H2, where H; and H~ stand for the (unique) predecessors
of HI in 'HI and H2 in 'H 2.

Corollary 1 Let ('HI.!I) and ('H 2.!2) be two indexed hierarchies on II and [2, re-
spectively. Let 'H = 'HI X 'H 2 and I : 'H f-+ 1R+ such that:
l:::;p<oo

Then ('H.!) is an indexed q'uasi-hierarchy, called the Lp-product 01 ('HI.!d and


('H2,/2)'

We note I = /I ffip 12 or simply I = /I ffi 12 when p = 1.


For different values of p, the corresponding indexed quasi-hierarchies do not induce the
same stratified quasi-hierarchy. For a counter-example define ('H\,/I), ('H 2 ,h) and
clusters HI. H; E 'HI , H2, H2 E 'H 2 such that:
It (Hd = 1 - ~ , It (Hi) = 1/2 , 12 (H2) = c' , 12 (H~) = 1/2 , 0 < c' <
~ < 1 . Then /I ffi 12 (HI x H2) < 11 ffi 12 (H~ x H2) whereas It ffi2h (HI x H2) >
II ffi2 12 (H~ x H2) for c sufficiently small.
In terms of dissimilarities, we have the following proposition.

Proposition 2 Let (h,dd and (I2,d 2) be two ultrametric spaces and let (I,d) be
their Lp-product (1 :::; p < 00). Then:

1. Vi,j E I, Bt = Bd! (il,d l (i\,jd) X Bd2 (i2,d 2 (i 2,j2))


2. d is quasi-ultrnmetric.

Note that the family of 2-balls of (I,d) does not depend on p.


149

Similarly, since every point of a ball in an ultrametric space is a centre of the ball,
we have: d(j,k) :::; d(i,j). Thus k E Bt .
Cunversely, let k E Bt . Then:

Suppuse, by way of contradiction, that for example d l (i}, k l ) > d l (i}, jd .


Since dl is ultrametric, we have dl (il,jl) < dl (il,k l ) = dl (jl,k l )
Thus, max [d 2 (i2, k 2) , d2 (j2, k 2)] < d2 (i 2, j2)' That violates the ultrametric inequal-
ity.
Cunsequently, Point 1. is proved.
Then, the inclusion condition and the diameter condition in terms of 2-balls for d
derive from the same conditions in terms of balls for d l and d2 . •

Proposition 2 shows that the Lp-product of two ultrametric spaces is connected with
an indexed quasi-hierarchy. Similarly, we deduce from Corollary 1 that the Lp-product
uf twu indexed hierarchies is connected with a quasi-ultrametric. In fact, each map-
ping is the inverse of the other.

Proposition 3 The indexed quasi-hierarchy associated with the Lp-product 01 two


ultrumet71c spaces is the Lp-product 01 the corresponding two indexed hierarchies.

Since the clusters of a hierarchy are the balls and the clusters of a quasi-hierarchy are
the 2-balls, the proof is immediate from Proposition 2 and Corollary 1. We may also
use the opposite way by observing that the smallest cluster of 'HI x 'H 2 containing i
and j is the Cartesian product of the smallest clusters of 'HI and 'H 2 containing the
curresponding components of i and j.
The fullowing coherent diagram summarizes the previous results.

[('HI.Jd ('H 2 ,h)] ----+ ('HI X 'H2 , II ffip h)


1 ! 1
[d l E 'Du (1d d2 E 'Du (12)] ----+ dl ffip d 2 E 'Dqu (h x 12 )

Given the Lp-product (HI x H2 , II ffip h), it is easy to recover the hierarchical
cumponents (HI, Id and (H 2, 12) . Indeed, a subset HI of II is in 'HI iff HI x 12 E
'HI >: 'H 2 · Thus we have 'HI and H2 and in particular their minimal elements. Then,
for every HI E HI , II (HI) = II ffip 12 (HI x H 2 ) where H2 stands for any minimal
element of H2 .
.-\ similar way stays valid for a more general problem. Is a given indexed quasi-
hierarchy on II x 12 the Lp-product of two indexed hierarchies (HI, II) and ('H 2, h)
un 11 and 12 for some p ? Indeed, the previous procedure gives potential candidates
as clusters of HI and H 2. Then it suffices to check whether 'HI and H2 are hierarchies
and whether 'H = HI X 'H2. Similarly we have potential level indices II and 12 and a
unique potential real number p.
In terms of metric spaces, such a property remains clear still.
150

4 The supremum product of indexed hierarchies


This paragraph is devoted to the supremum product (p = oo) of hierarchies. As
opposed to the previous case (p < oo) , we need here a level index on each hierarchy.

Proposition 4 Let ('HI'!I},"" ('Hr.!r) be r indexed hiera1'chies on It. ... : I r , re-


spectively. Let 'H' = 'HI X ... x 'Hr. Define r:
'H' 1-+ 1R+ by : 'r/ H = HI X ... x Hr E
r
'H' , (H) = maxf, (Ht). Let'H be the subclass of 'H' defined by:

HE 'Hiff (H' E 'H', H ~ H', r (H') ~ r (H)) =? H = H'

Let f be the restriction of rto 'H.


Then ('H,!) is an indexed hierarchy, called the supremum product of ('HI, fI), ...
,('Hr.!r). We note ('H,!) = ('H1.!I}® .. ·®('Hr.!r)'

Proof. Clearly axiom i} of a hierarchy is fulfilled. The following equivalences are


also obvious:
r
H = HI X ••• x Hr minimal in 'H' <=> H, minimal in 'H, for every 1 <=> (H) = O.
Thus, a minimal element of 'H' is in 'H and the set of minimal elements of 'H' is the
set of minimal elements of 'H. The class 'H obeys axiom ii} (or ii')) of a hierarchy.
Moreover, if H is minimal in 'H , f (H) = O. -
Finally, let H = HI X .•. X Hr , H' = H~ x ... x H~ E 'H , with H n H' =f 0.
Define H;' = H, U H; for every l, and H" = Hr x ... x H~. Since H, n H; =f 0 for
every I, H;' is equal to H, or H; and belongs to 'H,.
Suppose w.l.o.g. that r r r
(H) ~ (H'). Then, for evel.Y l, f, (Hn ~ (H') ,so that
r r
(H") ~ (H'). From the definition of 'H, we deduce H" = H'.
Consequently we have: H ~ H' and 'H is a hierarchy.
If H, H' E 'H with H ~ H', then clearly f (H) ~ f (H') and from the definition of
'H we deduce: f (H) = f (H') => H = H'. Moreover we have seen that f (H) = 0
whenever H is minimal, so that the proof is complete. •

Proposition 5 Let (Ilodd , ... , (Ir,d r ) be r ultrametric spaces and let (I,d) be their
s'upremum product.
Then, d is ultrametric.

Proof. Let i,j,k E II. x .. · x I r • There exists s such that: d(i,j) = d. (i.,j.).
Then: d (i,j) = d. (i.,j.) ~ max [d. (i., k.) , d. (k.,j.)] ~ max [d (i,k) , d (k,j)]. •

As in the previous section, we may establish a coherent diagram.

Proposition 6 The indexed hierarchy associated with the supremum product of r


ultrametric spaces is the supremum product of the corresponding r indexed hierarchies.
151

Proof. Denote by (HI.!I), ... , (Hr.!r) the indexed hierarchies defined on Ill··· ,Ir,
respectively. Let I = II X •.. x Ir and let (H, f) be the supremum product of
(HI.Jd, ... ,(Hr.Jr). For every i = (iI, ... ,i r ), j = (jI, ... 'jr) E I and for ev-
ery l = 1, ... ,r , let HL, be the smallest cluster of HI containing i l and k Define
H:~j, as the greatest cluster of HI containing i l and jl , and such that II (H::it ) ~
mp-x [II (HL,)]· Let s be an integer such that I. (HI. i.) = mF [II (HL,)]. Then
H;~j, = His,i,. We show that H' = H;tit x ... x H':ir is in H and is the smallest cluster
of H containing i and j. Indeed the existence of H' is obvious and H' E 'H derives
from the definition of 1i. Now, let H" = H~ x ... X H;' E H be a cluster containing
i and j. For every l, HLi, ~ HI'. Then we have: I (H") 2: I. (H:) 2: I. (HI.i.) =
I (H') . Thus H' ~ H" and H' has the announced property. The result follows since
I (H') = mfx [II (HL,)]· •
We may also use the opposite way. Observing that a ball Bd (i, d (i, j)) for the supre-
mum product may be expressed as B d , (ill d (i, j)) x· .. XBd,. (ir, d (i,j)) , it suffices to
show that the family of such balls coincides with the supremum product of hierarchies
defined in Proposition 4.
The following diagram summarizes the previous properties.

Although the supremum product does not contain all elements of the Cartesian prod-
uct, it is still possible to extract the hierarchical components from such a structure
and to solve a more general problem as in paragraph 3. Indeed, for every HI E 'HI,
it always exists some HI E HI , l = 2, ... , r such that H = HI X H2 X ... x Hr E 'H.
Thus HI, ... ,'liT follow. Furthermore, owning the hierarchies, there is a minimal
H2 x ... x Hr , HI E HI , l = 2, ... , r such that H = HI X H2 X ••• x Hr E 1t for a
fixed HI E HI. Then It (HI) = I (H). One might also use the fastidious but clearer
way of ultrametric spaces.
Replacing the maximum by the minimum, we may establish, with a similar proof, a
property analogous to the one of Proposition 4. From 1t' = 'HI X ... x H r , define
I. : H' I-> lR.+ by: VH = HI x ... x Hr E H' , !. (H) = min!1 I
(HI). Let 11. be the
subclass of H' defined by:

HE 11. iff (H' E H' , H' ~ H , I. (H') 2: I, (H)) ===* H = H'

Let 1 be the restriction of I. to 11..


Then (11.,1) is an indexed hierarchy in I = II X ••. x Ir (not on 1), i.e. 11. obeys
only axioms ii) and vi) of a hierarchy.

5 Example
'vVe give here a simple and illustrative example.
Two indexed hierarchies are given via their dendrograms in Figure 1.
152

Figure 2 exhibits the three clusters at the level 5 for the L1-product of the indexed
hierarchies given in Figure 1: {iI, i 2, i3} X J , I x {jI, j2} , I x {j4, js}. We may
imagine a practical procedure, with a computer, displaying in this way the clusters
at a given level.
The dendrogram for the supremum product of the same indexed hierarchies is visu-
alized in Figure 3.

41-
at
2

0
h i2 i3 i4 i5 is jl j2 j3 j4 j5
Figure 1: Two indexed hierarchies

is

Figure 2: The three clusters at the level 5 of the L1-product of hierarchies

o
1 1 2 2 3 3 1 2 1 1 2 2 3 3 344 5 5 4 5 4 455 6 6 666
j 121212334545345121233454512345
Figure 3: The supremum product of the indexed hierarchies
153

6 References
Bandelt, H.J. (1992): Four-point characterization of the dissimilarity functions obtained
from indexed closed weak hierarchies, Math. Seminar der Universitiit, Hamb·urg.
Bandelt, H.J. and Dress, A.W. (1989): Weak hierarchies associated with similarity mea-
sures: an additive clustering technique, Bull. Math. Biology, 51, 113-166.
Batbedat, A. (1989): Les dissimilarites Medas et arbas, Statistique et Analyse des donnees,
14,3, 1-18
Benzecri J.P. (1973): L'analyse des donnees. Tome 1, La Taxmomie., Dunod, Paris.
Bertrand, P. and Diday, E. (1991): Les pyramides classifiantes: une extension de Ia struc-
ture hierarchique, C.R. Acad. Sci. Paris, Serie I, 693-{;96.
Buneman, P. (1974): A note on metric properties of trees, J. Combin. Theory, Ser. B,17,
48-50.
Diatta, J. and Fichet, B. (1994): From Apresjan hierarchies and Bandelt-Dress weak hier-
archies to quasi-hierarchies, In:New approaches in Classification and Data Analysis, Diday,
E. et al. (eds.), 111-118, Springer-Verlag, Berlin.
Durand, C. and Fichet, B. (1988): One-to-one correspondences in pyramidal representa-
tion: a unified approach, In: Classification and Related Methods of Data Analysis, Bock,
H. (ed.), 80-85, North-Holland, Amsterdam.
Escofier, B. (1969): L'analyse factorielle des correspondances, Cahiers du B.U.R.O., Uni-
veT'site de Paris VI, 13,25-29.
Jardine, C.J., Jardine, N. and Sibson, R. (1967): The structure and construction of ta.xo-
nomic hierarchies, Mathematical Biosciences, I, 465-482.
Johnson, S.C. (1967): Hierarchical clustering schemes, Psychometrika, 32, 241-254.
Towards Comparison of Decomposable Systems
\lark Sh. Le\'in
The Cniversity of Aizu, Fukushima 965-80 Japan

Summary: The paper focuses on the comparison of decomposable s~'stems on the base
of combinatorial descriptions of systems and their parts. Our system description involves
the following interconnected hierarchies: a tree-like system model; criteria and restrictions
for s~'stem components (nodes of the model): design alteruatives (DAs) for nodes; inter-
connection (Is) or compatibility between DAs of different system components; estimates of
DAs and Is . .-\ vector-like proximity for rankings is described.

1. Introduction
Usually the following basic system analytical problems have been examined to an-
alyze complex systems: (1) to compare two system versions; (2) to classify system
versions; (3) to evaluate a system version set (e.g., aggregation, construction of con-
sensus, etc.); (-4) to analyze tendencies of system changes (evolution, etc.); (5) to
reveal the Illost significant system parameters; (6) to plan an improvement process
for a system, Here we examine system representations on the base of structural
or combinatorial objects. We assume that a system is decomposable one, and may
have several versions. Thus the following kinds of problems are basic ones: (a) de-
scription of the system and their parts; (b) operations of the system analysis and
transformation. Our description of decomposable systems consists of the following
interconnected hierarchies (Le\'in, 1996b): (1) a ~ystem tree-like model); (2) require-
ments (criteria, restrictions) to system components (nodes of the model): (3) design
alternatives (D.-\s) for nodes: (-1) interconnection (Is) or compatibilitv between D.-\s
of different components; (.5) factors of compatibility.
The system proximity mayhe examined as the following structure corresponding to
the system model: proximity of hierarchical tree-like model; proximity of requirement
hierarchy (cri terion hierarchy; rest riction hierarchy; com pati bilit,Y, factors hierarchy);
proximity of D.-\s (sets of DAs, estimates on criteria, and priorities); proximity of
Is (set of Is with priorities). We consider three levels of combinatorial decriptions:
(I) hasic combinatorial objects (points in a space: vectors; sets; partitions: rankings,
strings, trees, posets, etc.); (2) elements of the system description: leaf nodes; set of
DAs and/or Is; tree-like system model: criteria for D.-'l.s; etc.; (3) basic system descrip-
tions, e.g .. complete description; external requirement. So a vector-like proximity for
rankings is examined,

2. Measurement of proximity
First let us consider approaches to modeling a proximity (distance, similarity, close-
ness, dissimilarity', etc,) for combinatorial objects. :'-iote that these investigations
hal'e been executed in various disciplines (e.g., mathematical psychology: decision
making; chemistry: linguistics; morphological schemes of systems in technological
forecasting; biology; genetics; data and knowledge engineering; network engineering:
archi tect ure; combinatorics). .-\ survey of coefficients for measures of si mi lari ty, dis-
similarity, and distance from viewpoint of statistical sciences is presented in (Gower,
1995). From a system viewpoint it io reasonable to examine some functions, op-
erations and corresponding requirements to mathematical models of the proximity
154
155

(Table 1).

Table 1. Some operations and requirements to mathematical models


Functional phase Operations Requirements
De\·elopment bele.ction
eSlgn iCObl5m "Ievantn,,'
.0(TlP ete~ess
nl veersa J t v
. asines<; .
,eg~ra{iza bility
a ltu ness
Representation gathamotioal d""iption
ext eS5nptJOn ~asl to \.isU~j~e
n ers~an~. .1 lty
rq.phic;a presentation peratJOna I ltv
, mmatJOn elevantness to· use
omposite approach
Study and learn Ir,,,¥,ndin
e em enn
g
a It),la\0ness
rhlitY pcom' and anal,."
v~U.~tJOll of features oordmatJOn with intuition
dentl <;atJOn ·.ndefcstanabi Ii ty
rocessmg Imp eness
asv to visualize
Wilizatipn )
processmg ~roCe~Sing
rans or.mation ~as~ to pro~ss
y- um~\ y computer)
om§9 smg omposa llty
oor matJOn

Formal requirements to proximity models are based on three Freshe's axioms specify-
ing metrics, and sometimes on additional axioms (Kemeny and Snell, 1972; etc.). In
some cases, the triangulation axiom is rejected, e.g., for architectural objects (Zeitoun,
1977), for rankings (Belkin and Levin, 1990). Measuring the proximity between com-
binatorial objects is based on the following approaches: (1) a metric in a parameter
space; (2) attributes of the largest common part of objects (intersection) or an uni-
fication (the minimal covering construction); (3) minimum of changes (change path)
which allows to transform an initial object into a target one. .
Secondly let us consider scales of measuring. Traditionally Rl , [0,1] or an ordinal
scale are applied. Huber and Arabie use measures of agreement or consensus indices,
e.g., from [-1,1] (Hubert and Arabie, 1985). Recently some extensions of metric
spaces have been proposed (Pouzet and Rosenberg, 1994; Barthelemy and Guenoche,
1991; etc.), for examples: (a) graphs and ordered sets (Jawhari et al., 1986); (b) con-
ceptual lattices for complex scaling (Ganter and \Ville, 1989); (c) ordered sets and
semilattices for partitions (Barthelemy et al., 1986); (d) simplices for rankings (Belkin
and Levin, 1990). Generally, Arabie and Hubert ha\·e examined. three approaches to
compare combinatorial objects (sequences, partitions, trees, graphs~ through given
matrices (Arabie and Hubert, 1992): (a) axiomatic approach to construct a "good"
measures; (b) usage of structural representations; (c) usage of an optimization task.
Finally, in complex cases, the following approaches maybe applied: (a) multidimen-
sional scaling (Torgenson, 1958; Kruskal, 1977; etc.); (b) usage of graphs and ordered
sets as a kind of a metric space (Jawhari et al. 1986; Barthelemy et al., 1986;
Barthelemy and guenoche, 1991; Ganter and \Ville, 1989); (c) integrating or compos-
ing of a global proximity from distances or proximities of system components.

3. Combinatorial objects and system description


Main approaches to compare combinatorial objects are the following (Arabie and
Hubert, 1992; Pouzet and Rosenberg, 1994; etc.):
1. Points in a space: traditional metrics.
156

2. Sets, system of representatives: metrics (Margush, 1982; etc.).


3. Partitions: metrics (Mirkin and Cherny, 1970; etc.), multidimensiunal scaling
(Arabie and Boorman, 1973), measures of agreement or consensus indices (Hubert
and Arabie, 1985), ordered sets, semilattices (Barthelemy et ai, 1986).
4. Linear (ordinal) rankings: metrics (Kendall, 1962; Kemeny and Snell, 1972j Cook
and Kress, 198-1).
5. Strings: maximum subsequence (Wagner and Fisher, 19i4j Sellers,197-l), minimum
supersequence (Timko\"skii, 1989; etc.), metrics (Hannenhalli and Pevzner, 1995).
6. Linear rankings with values (set of numbers): metrics (Rote, 1991).
7. Group rankings (elements with ordinal priorities): metrics (Cook and Kress, 1984:
Kendall, 1962), vector-like proximity (Belkin and Levin, 1990).
8. Sets of strings: measures from [0,1] (Lemone, 1982).
9. Trees: metrics (Boorman and Oliver, 1973; Robinson and Foulds, 1981; :\lar-
gush, 1982), distance as the minimum cost transformation, and the la.rgest common
substructure (Tai, 1979), proximity (Barthelemy and Guenoche, 1991; Akutsu and
Halldorsson, 1994).
10. Trees with labeled leafs: metrics (Hendy et al., 1984; Day, 1985).
11. Hierarchies: metrics (Botafogo et al., 1982).
12. Graphs and posets: metrics (Bogart, 1975j etc.), structural representations of
proximity (Arabie and Hubert, 1992).
An aggregation of combinatorial objects is studied in (Hubert and Arabie, 1985; Day,
1985j Barthelemy et ai, 1986j Gordon, 1986j Arabie and Hubert, 1992j etc.). A· re-
lationship of basic elements of our system model and combinatorial objects above
is the following: (1) points in a space: leaf nodes, PAs, Is, priorities of DAs, esti-
mates of Is, requirements; (2) sets: leaf nodes, DAs, Is, requirementsj (3) partitions:
requirements; (4) ordinal rankings: DAs, Is, requirements; (5) strings: DAs, Is; (6)
linear rankings: DAs, Is, requirements; (7) sets of strings: DAs, requirements; (8)
trees: system model, requirementsj (9) posets: estimates of DAs and Is, require-
ments. Clearly, it is reasonable to apply typical system descriptions of decomposable
systems as follows: (1) external requirements: criteria, factors, restrictions; (2) sys-
tem model: structural (tree-like) model, leaf nodes (components), DAs; (3) extended
system model: structural (tree-like) model, leaf nodes (components), DAs, Is, priori-
ties of DAs, ordinal estimates of Is.

4. Vector-like proximity for rankings


Usually the proximity measures have been used as scalar functions which satisfy to
Freshe's axioms for metrics (pseudo-metrics). Kendall's proximity measure is used the
most widely (Kendall, 1962). Here we describe a vector-like proximity for rankings
(Levin, 1988; Belkin and Levin, 1990). In this case, a measurement scale is a simplex
in which for a set of measured objects a poset maybe constructed. Let G = (A, E)
be a digraph, where A = {1, ... ,i, ... ,n} is a set of vertices (i.e., objects, discrete
information units), E is a set of edges corresponding to a preference.
By the above we may examine the following kinds of the digraphs: (1) tree (denoted as
T)j (2) parallel-series graph (P); (3) acyclic graph or partial ordering (R); (4) chain or
linear ordering (L); (5) layered structure S (group ordering, ranking or stratification)
in which the set A is divided into m subsets (layers) without intersections as follows:
(a) .4(l), 1= 1, ... ,m, and 'V/ ll / 2, II #12, 1.4(lt}&A.(/2)1 = 0, (b) if 11 < 12 then i l >- i2
'ViI E A(lt} and 'Vi2 E A(l2)j (6) fuzzy layered structure Sf allowing any object
to belong to the group of successive layers: the number or the interval of layers
which the ith object is belonged to is defined 'Vi E.4.: 7r(i) or d(i) = [d(1,i),d(2,i)],
157

1 ~ d(l,i) ~ d(2,i) ~ m and 7r(i) = d(l,i) if d{l,i) = d(2:i). Thus the system of
intervals {t5( i)} is specified. By the analogy of definitions above it is possible to specify
clusters, and fuzzy clusters. Sometimes the comparison of structures representing the
union of similar graphs (e.g. 'chains'- N L, layered structures - N S) has a particular
interest in practice. Let 0(S) be a set of all layered structures on A.
Definition 1. We say that
5,,(i, S, Q) = 7r(i, S) - 7r(i, Q),
5,,(i,j, S, Q) = 7r(i, S) - 7r(j, S) - (7r(i, Q) - 7r(j, Q)),
where 7r(i, S) = / Vi E A(/) in S, are the first order error Vi E A, and the second
order error V(i,j) E {A * Ali f j VS, Q E 0(S)} respectively. Thus for an estimate
of a discordance between the structures S, Q E 0(S) with respect to i and (i, j) we
obtain an integer-valued scale with the following ranges: -(m - 1) ~ r ~ m - 1 for
5,,(i, S, Q), and -2(m - 1) ~ r ~ 2(m - 1) for 5,,(i,j, S, Q).
Definition 2. Let
x(S, Q) = (x[-(m - 1)], ... , x[-I],x[1], ... ,x[m - 1]), (1)
y(S,Q) = (y[-2(m -1)], ... ,y[-I],y[I], ... ,y[2(m -1)]), (2)
be vectors of an error (proximity) VS, Q E 0(S) with respect to components i (1st
order), and the pairs (i,j) (2nd order). The vector components are:

x[r] = I{i E AI5,,(i,S,Q) = rH/n,


y[r] = 21{(i,j) E {A * Ali f j}15,,(i,j, S, Q) = r}l/(n(n - 1)).
It maybe reasonable to define similar vectors of the higher order also. Moreover it is
possible to examine the weighted errors of the first and the second order with taking
into account dependence on corresponding number / of layer A( /) for the definition
of vector components.
l'iow denote a set of arguments for the components of vectors x and y as follows:
0= {-k, ... , k}), negative values as 0-, and positive ones as 0+. In addition, we use
vectors x with aggregate components of the following types (similarly, for y):
!c, -!c m+!
x[k l , k2] = L x[r], x[~ -k] = L x[r], x[;::: kJ =L x[rJ, k > 0,
r=-(m-l) r=!c

x[lrl] = x[r] + x[-r].


Definition 3. Let Ix(S, Q)I = LEn x[r], ly(S, Q)I = LEn y[rJ be modules of the
vectors.
Afterhere we will consider vector x as a basic one.
Definition 4. We will call vectors truncated ones if
(l) the part of terminal components is rejected, e.g.
x(S,Q) = (x[-kd,x[-(kJ - 1)], ... ,x[-1],x[1], ... ,x[k2 -1],x[k2])'
and one or both of following conditions are satisfied: kJ < m - 1, k2 < m - 1;
(2) aggregate components are used as follows:
158

X(S, Q) = (x[lll], ... , x[lrl]' ... , x[lkl]). (3)


Definition 5. Let us call vector x (y):
(a) the two-sided one, if 10+1 f. 0 and 10-1 = OJ
(b) the one-sided one, if 10+1 = 0 or 10-1 = OJ
(c) the symmetrical one, if -r E 0- exists Vr E 0+, and vice versa;
(d) the modular one, if it is defined with respect of definition -1 (3).
So we obtain a pair of linear orders on the components of vectors x (1), and y (2):
component 1( -1) -< ... -< component k( - k).
Clearly, if the components are aggregate ones, the orders will be analogues ones.
Definition 6. XI(S, Q) t X2(S, Q), O(xt} = 0(X2), 'rIS, Q E 0(S), if any decreasing
of weak components Xl in the comparison with X2 is compensated by corresponding
increasing of it's 'strong' components (r,p E 0+ or -r, -p E 0-):
r r
L xdr] - L x2[r] ~ 0, 'rI1.1. E 0+('rI - v. En-, -r ~ -1.1.). (4)
r~u r~u

It is possible to force condition (4) by using a right side which is equal to a parameter
II> O.
Definition 7. Let Al = {x E XI LrEO x[rj = I} be a marginal set (similarly for y).
Kote that 'rIx (y) there exists a dominating subset D(x) = {1/ E AflT} t x}.
Definition 8. Let a pair of vectors Xl, X2 (Yl, Y2) be:
(a) comparable ones, if Xl t X2 (therefore D(X2) 2 D(xd and vice versa)j
(b) strongly uncomparable ones, if ID(xllx2)1 = ID(xd&D(x2)1 = 0;
(c) weakly uncomparable ones, ifID(xl,x2)1 f. 0, D(XllX2) does not include D(xd,
D(X2)'
Finally let us consider properties of our vector-like proximity as follows:
1. Condition (4) defines a poset.
!2. 0 ~ Ix(S, Q)I ~ 1, 0 ~ IY(S, Q)I ~ 1 'rIS, Q E 0(S).
3. x(S,Q) t (0, ... ,0), y(S,Q) t (0, .. ,0) 'rIS,Q E 0(S).
4. The following condition is true for one-sided vectors: x( S, Q) -< (0,0, ... ,0,1),
y(S, Q) -< (0,0, ... , 0,1).
5. The following condition is true for any two-sided vector x(S, Q), 'rIS, Q E 0(S):
there exists such vector e = (e[-kd,0, ... ,0,e[k2D E M(k 1 ,k2 > 0), that x(S,Q) t e
(similarly, for y).
6. For any modular vector the following is true: x(S,Q) = x(Q,S) (similarly, for y).
7. For any two-sided symmetrical vector the following is true: x(S, Q) = x"(Q, S),
where x"[r] = x[-r] (similarly, for y).
8. 'rIx(S,Q), 'rIS,Q E 0(S), the following is true: if x(S,Q) = (0, .. ,0) then S = Q.
An assessment of proximity between fuzzy layered structures is a more complicated
problem. Let us consider an example of qualitative vector-like proximity for any fuzzy
structures S"Q, E 0(5,), where 0(S,) is a set of all fuzzy layered structures on A.
Definition 9. Let z(S"Q,) = (z[-(m-l), ... ,z[-l],z[l], ... ,z[m-l]), where

z[r] = I{i E Ald(2,i,S,) -d(l,i,Q,) = r, Id(i,S,)&d(i,Q,)1 = O}l/n, r > 0;


z[r] = I{i E Ald(l, i, Sf) - d(2, i, Q,) = -r, Id(i, S,)&d(i, Q,)I = O}l/n, r < 0;
be a. vector of the 1st order proximity between 'rI5f,Q, E 0(5,) with respect to
Vi EA.
159

In the same way, we may describe properties for vectors z, which are similar to those
of vector x (y) (besides the 8th one).

5. Example
Let us consider an example: .4 = {l,2,3,-!,5,6,7,8,9}; SI : Al = {2,4},A2 =
{9},.43 = {I, 3, 7}, A~ = {5, 6, 8}; and S2 : Al = {7, 9}, .'h = {l. 3}. A3 = {2, 5, 8}, A~ =
P,6}. Let Ilg'ill. (i,j E A) be an adjacency matrix for graph 0:
1, ifi~j,
9iJ ={ 0, !f
i ~ j,
-1,lfl...(J.

Then Kendall proximity measure (metric) for graphs 0 1 and 0 2 is the following
(Kendall, 1962): PK(01,02) = L;<i I gL - g;j I, where glj, g;j are elements of adja-
cency matrices of graphs 0 1 and 0 2 respectively. Adjacency matrices, corresponding
to our example, are the following:
-1 o -1 1 0 1 -1
1 1 0 1 1 1 1
o -1 -1 1 0 1 -1
1 0 1 1 1 1 1
-1 -1 -1 -1 o -1 0 -1
-1 -1 -1 -1 0 -1 0 -1
o -1 0 -1 1 1 1 -1
-1 -1 -1 -1 0 o -1 -1
1 -1 1 -1 1 1 1
o 1 1 -1 1-1
-1 -1 0 1 -1 0-1
o 1 1 1 -1 1-1
-1 -1 -1 -1 0 -1 -1 -1
1%(S2)1 = -1 0 -1 1 1 -1 0-1
-1 -1 -1 0 -1 -1 -1 -1
11111 10
-1 0 -1 1 0 -1 -1
1 1 1 1 1 0
So, Kendall's distance is: PK(SI, S2) = 31. Proposed vector-.tike proximity allows to
describe dissimilarity between two structures more prominent:
1I"(SI) = (1I"1(Sl), ... ,rri(Sl, ... ,1I"4(SI) = (3,I,3,1,4,-!,3,4,2),
rr(S2) = (11"1 (S2), ... , 1I",(S2, ... , rr~(S2) = (2.3,2, -!, 3, -!, 1, 3,1).

8;(SI,S2) = (1,-2,1,-3,1,0,2,1,1),
3 o -! o I -1 0 0
-3 -3 -1 -3 -2 --! -3 -3
o 3 --! o 1 100
--! 1 4 --! -3 -5 --! -4
o 3 0 -! I -1 0 0
-1 -2 -1 3 -1 -1 -1 -1
1 -!-I 5 1 1 1
o 3 0 -! o -1 o
o 3 0 -! o -1 o
160

x(5 1 ,53 ) = (X-3,X-2,X-I,XI,X2,X3) = (1,1,0,5,1,0),


( ~I 52) = (Y
y:', -6
,y -5 ,Y -4 ,y -3 ,y -2 ,X -I ,y,y
I 2)
,Y 3 ,y 4 ,y,y
5 6)
=
(0,1,5,5,1,6,6,0,1,1,0,0).
Here we do not use in x and y the coefficients: ~ and n(n2_1)' Aggregate components
can be applied too: x(5 1 , 52) = (X~-I, X~I) = (2,6).

6. Conclusion
.-\ combinatorial approach (e.g., description, design, transformation) to decomposable
systems has been described in (Levin 1996a, Levin, 1996b). Comparison of decom-
posable systems can be applied in decision making, information engineering (e.g.,
hypertext systems), network design, etc. Clearly,. that composing a global system
proximity from proximities of system parts is a central problem. Finally, let us point
out basic kinds of changes for our system description: (1) internal changes: microlevel
(DAs), parts (subsystems, requirements), macrolevel (model); (2) external changes:
requirements. ~ote proposed vector-like proximity can be used in aggregation of
rankings.

References:
Arabie, Ph. and Boorman, S.A. (1973): Multidimensional Scaling of Measures of Distance
between Partitions. J. of Math. Psychology, 10, 2, 148-203.
Arabie, Ph. and Hubert, L.J. (1992): Combinatorial Data Analysis. Annual Review of
Psychology, 43, 169-203.
Barthelem!', J.-P., Guenoche, A. (1991): Trees and proximity Representation, Wiley, New
York.
Barthelemy, J.-P., Leclerc, B. and :VIonjardet. B. (1986): On the ese of Ordered Sets in
Problems of Comparison and Consensus of Classifications. J. of Classification, 3, 2, 17-224.
Belkin, A.R. and Levin. M.Sh. (1990): Decision Making: Combinatorial A/odds of Infor-
mation .-tpproximation, Nauka Publishing House, \1oscow (in Russian).
Bogart, K.P. (1973): Preference Structures I: Distance between Transitive Preference Re-
lations. J. Math. Soc., 3, -19-67.
Boorman, S.A. and Oliver, D.C. (1973): Metrics on Spaces of Finite Trees. J. of Math.
Psychology, 10, 1, 26-59.
Botafogo, R.A., Rivlin, E. and Shneiderman, B. (1992): Structural Analysis of Hypertexts:
Identif!'ing Hierarchies and Useful Metrics . .-tCM Tran.~. on Information Systems, 10, 2,
1-12-180.
Cook, W.O. and Kress, M. (1984): Relationships between II Metrics on Linear Rankings
Space. 5UAI J. on .-tppl. Math., 44, 1,209-220.
Da.\·, W.H.E. (1985): Optimal Algorithms for Comparing Trees with Labeled Leafs. J. of
Classification, 2, 1, 7-28.
Ga nter, B. and Wille, R. ( 1989): Conceptual Sca.li ng. In Applications of Combinatorics and
Graph Theory to the Biological and Social Sciences. Roberts F.S. (ed.), 140-167, Springer-
Verlag, New York.
Gordon, A.D. (1986): Consensus Supertrees: The Synthesis of Rooted Trees Containing
Overlapping Sets of Labeled Leafs. J. of Classification, 3, 2, 335-348.
161

Gower, J.C. (1985): Measures of Similarity, Dissimilarity, and Distance. In Encyclopedia


of Statistical Sciences, 5, S. Kotz, and N.1. Johnson (eds.), 397-·105, Wiley, ~ew York.
Hannenhalli, S. and Pevzner, P. (1995): Transforming :\len into Mice, Technical Report
CSE-95-012. Dept. of Computer Science and Engineering, Pennsylvania State Cniversity.
Hendy, M.D., Little, C.H.C. and Penny, D. (1984): Comparing Tress with Pendant \'ertices
Labelled. SIAM J. on Appl. Math., 44, 5,1054-1065.
Hubert,1. and Arabie, P. (1985): Comparing Partitions. J. of Classification,2, 3,193-218.
Jawhari, E.M., Pouzet, M .. and Missane, D. (1986): Retracts: Graphs and Ordered Sets
From The Metrics Point of View. In Contemporary Mathematics, 57, Rival, L (ed.), .-\\IS,
Providence.
Kemeny, J.G. and Snell, J.L. (1972): Jlathematical Models in the Social Sciences, \IIT
Press, Cambridge, Mass.
Kendall, M. (1962): Rank Correlation Methods, 3rd ed., Hafner, New York.
Kruskal, J.B. (1977): The relationship between multidimensional scaling and clustering. In
Classification and Clustering, Van Ryzin, J. (ed.). 17·44 ..l..cademic, New York.
Lemone, K.A. (1982): Similarity Measures Between Strings Extended to Sets of Strings.
IEEE Trans. on Pattem Analysis and Machine Intelligence, 4,3,345-347.
Levin, M.Sh. (1988): Vector-like Criterion for Proximity of Structures. In Abstracts of The
Third Conf. "Problems and Approaches of Decision Making in Organizational Management
Systems", lnst. for System Analysis, Moscow, 20-21 (in Russian).
Levin, M.Sh. (1996a): Towards Combinatorial Engineering of Decomposable Systems, In
13th European Meeting on Cybemetics and Systems, 1, Vienna, 265-270.
Levin, M.Sh. (1996b): Hierarchical Morphological Design of Decomposable Systems. Con-
current Engineering: Research and Applications, 4, 2, 111-117.
Margush, T. (1982): Distance Between Trees. Disc. Appl. Math., 4, 281-290.
Mirkin, B.G. and Chernyi, L.B. (1970): On Measurement of Distance Between Partitions
of a Finite Set of Units. Automation and Remote Control, 31, 786-792.
Pouzet, \1. and Rosenberg, LG. (199-1): General Metrics and Contracting Operations. Dis-
crete Mathematics, 130, 1-3, 103-169.
Robinson, D.F. and Foulds, L.R. (1981): Comparison of Phylogenetic Trees. Alathematical
Biosciences, 53,1/2,131-147.
Rote, G. (1991): Computing the Minimum Hausdorff Distance between Two Point Sets on
a Line under Translation. Information Processing Letters, 38, 3, 123-127.
Sellers, P.H. (1974): An Algorithm for the Distance between two Finite Sequences. J. of
Combinatorial Theory, Ser. A, 16, 253-258.
Tai, K.-C. (1979): The Tree-to-Tree Correction Problem. J. of the ACM, 26, 3, .. 22--133.
Timkovskii, V.G. (1989): Complexity of Common Subsequence and Supersequence Prob-
lems and Related Problems, English translation from Kibernetika, 5, 1-13.
Torgenson, \V.S. (1958): Theory and Alethod of Scaling. \Viley, New York.
Wagner, R.A. and Fisher, M.J. (1974): The String-to-String Correction Problem. J. of the
;lCAl, 21,1,168-173.
Zeitoun, J. (1977): Trames Planes. Intr'oduction a une etude architecturale des trames.
Bordas, Paris.
Performance of Eight Dissimilarity Coefficients to
Cluster a Compositional Data Set
Maria Cristina Martin
Faculty of Engineering Sciences, Osaka University
Machikaneyarnacho 1-3, Toyonaka-shi
Osaka 560, Japan

Summary: Concerned with the problem of clustering a compositional data set consisting
of vectors of positive components subject to a unit-sum constraint, as a first step we looked
for an appropriate dissimilarity coefficient or distance between two compositions. In this
paper we selected eight different dissimilarity measures, and their performance was evalu-
ated by means of graphics and cluster validity coefficients of six clustering methods applied
to three compositional data sets. Almost recent criteria for measures of compositional dif-
ference are also tested for those measures emerging as the best to cluster compositions.

1. Introduction
Any compositional data set consists of N D-part compositions with Xri the ith_
component of the rth-composition (r = 1, ... , Nj i = 1, ... , D) satisfying the require-
ments that each component is non-negative and that the sum of all the components
in each composition is 1.
It is our objective to study the problem of Cluster Analysis for compositonal data
sets. As a first step, we will look for an appropriate dissimilarity coefficient or dis-
tance between two compositions. Aitchison (1992) discussed some criteria to define
a measure of difference between two compositions, although no clustering problems
were tried by the author. Briefly, Aitchison's proposal can be stated as follows:
The appropriate sample space for a compositional data vector Z = (XI, ... ,'<D) is
the d-dimensional positive simplex (Aitchison (1986, Chapter2))

Sd = {(Xt, ... ,XD): Xl> O, ... ,XD > 0 and Xl + ... + XD = I}. (1)
As a way to eliminate the constant-sum constraint difficulty that each compositional
data vector Z E sd must satisfy and therefore to work in a space without restrictions,
Aitchison (1986, Sec.4.6) proposed the following two transformations of the data:
• the logratio vector y = log (:Z:r~D) E R d, where Z_D is the vector Z with the last
component omitted, and
• the centred logratio vector z = log Ci~!)) E RD, where 9(Z) = (Xl'" xD)i is the
geometric mean of the components.

*'
Aitchison's ideas are based on three postulates:
1.- A composition contains information about relative, not absolute, magnitudes of
its components. In other words, by writing ri = (i = 1"", d) a composition can
be completely determined by Xi = Tl + ...r.;. Ttl +1 (i = 1,···,d) and XD = Tl + ... ~ Td +1 ;
2.- Any discussion of the variability of a composition can be expressed in terms of
ratios, such as :!:!. of components;
r}

3.- No compositional information is lost if log (~) is studied instead of ratios them-
selves.

162
163

Then, Aitchison(1992) (see also Aitchison(1994)) derives

[E (log( Xri) _ loge X,i )) 2] ! (2)


i<j Xr] X,]

"as the simplest and most tractable measure of difference or distance between two
compositions Zr = (Xrl, ... , XrD) and z, = (X.l, ... , X,D)" . The above expression,
apart from a constant factor, coincides with the Euclidean distance between two·
centred logratio vectors, which is the first proposal of Aitchison (1986, p.193) as the
inquired measure.
The fundamental point of Aitchison's criteria in order to obtain (2) is:
" Any scalar measure of difference between two compositions Zr and z, must be
expressible in terms of ratios of components in the same composition (this is in terms
of ~ and ~), and also as ratios of components in different compositions (this is as
;Cr) x,}

:.c...
:r:
II
or E.u).
%'rt
In other words, the measure will be expressible as a function of the ratio
-R (or in a symmetric way as 3.) where r = :.c... and R = ~".
r
r %rJ :r' J

In the present paper, we first review some dissimilarity coefficients or distances which
are currently being used in Cluster Analysis and select those we consider capable of
measuring the difference between two compositions. Then, we conduct a study of six
clustering methods applied to three compositional data sets in order to explore, and
compare with Aitchison's proposal, the performance of the different dissimilarities
displayed. This profit is evaluated by means of graphics and coefficients of cluster
validity of the many clustering techniques used. Also, Aitchison's criteria for those
coefficients emerging as the best to cluster compositions are tested.

2. Dissimilarity Measures and Cluster Methods considered


As possible measures of difference between two compositions Zr = (Xrl, . .. , XrD) and
z. = (:r 11, ... , X.D) we consider:

City Block or ~Ianhattan Distance: 2:f:l IXri - x.d


Euclidean Distance: [2:iD
=l (Xri - X,;) 2] ~
Chebychev Distance: maxi IXri - x.d
Jeffreys Matusita Distance: [2:f:l ( Fri - Jx.ir] !
Divergence Dissimilarity of Jeffreys: 2:f:llog (:.0.)
"..
(Xri - X,i)
Bhattacharyya (log): -log (LP:l .JXriv'I.i)
Bhattacharyya (arccos): arccos (EP:l ...rx;.ivx.;-)
Aitchison Distance: [EP:l (log(g(f'r») -log(g[:i:,)r]!

Based on the above cited measures, four Hierarchical and two Partitioning Clus-
tering l\Iethods were considered. Single, Complete and Average Linkage were the
Agglomerative Hierarchical Clustering Methods used, and MacNoughton-Smith algo-
rithm was the Divisive Hierarchical Clustering Method applied. Among Partitioning
164

Clustering Methods we have considered Partitioning around Medoid and Fuzzy Anal-
ysis. Single and Complete Linkage outputs were obtained by means of SPSS or Splus
packages; whereas, programs AGNES (Average Linkage), DIANA (Divisive Analy-
sis), PAM (Partitioning around Medoid) and FANNY (Fuzzy Analysis) of Kaufman
and Rousseeuw (1990) were used in the computation of the other algorithms.

3. Criteria of' Comparison


The behavior of the cited dissimilarity coefficients or distances in order to group com-
positions was evaluated by means of:
S.l: The usual Dendrogram when Single and Complete Linkage were applied,
S.2: Banner plots and the corresponding Agglomerative and Divisive coefficients
when AGNES and DIANA were used (Kaufman and Rousseeuw (1990, p.212 and
p.263)). Both coefficients range between 0 and 1. A coefficient nears 1 indicates that
a strength clustering structure has been found.
S.S: Silhouette plots and the Silhouette coefficients when PAM and FANNY were
considered (Kaufman and Rousseeuw (1990, p.83)). Conclusions here are based on
Kaufman and Rousseeuw subjective interpretations of the silhouette coefficient (SC):
0.75 :5 se :5 1.00, a strong structure has been found,
0.50 :5 se < 0.75, the structure is reasonable,
0.25 :5 se < 0.50, the structure is weak and could be artificial,
se < 0.25, no substantial structure has been found.
S.4: The cluster validity was also evaluated by means of the normalized version
of Dunn's partition coefficient (Kaufman and Rousseeuw (1990, Chapter 4) when
FANNY was applied. Dunn's partition coefficient always varies from 0 (entirely
fuzzy) to 1 (hard clustering).

4. Applications and Results


Three compositional data sets (two of them consisting of three-part compositions,
and therefore, allowing a visual representation in a ternary diagram) were used in all
the comparisons:
- (Sand, Silt, Clay) compositions of 39 Sediment samples in an Arctic Lake (Aitchi-
son(1986, p.359,Data5));
- (Albite, Blandite, Cornite) compositions, abbreviated as (ABC), of 25 specimens of
Hongite and 25 specimens of Kongite (Aitchison(1986, p.354-356, Data 1 and 2»;
- (Albite, Blandite, Cornite, Daubite, Endite) or (ABCDE) compositions, for 25-
Hongite and 25-Kongite specimens (Aitchison(1986, p.354-356, Data 1 and 2».
In the ternary diagram of Figure 1 are represented the ABC subcompositions for Hon-
gite and Kongite specimens. When Single Linkage was applied, the marked curvature
or banana configuration displayed by the data apparently prevails over the type of
dissimilarity proposed. This means that, independent of the measure used, from the
dendrogram by Single Linkage we can construct a cluster with 49 elements and the
remaining element in another cluster (the loop, in Figure 1,was merely drawn to in·
dicate the isolated point). So the special criticized property of Single Linkage to give
a long serpentine cluster (the so called "chaining") allowed in this example to join
compositions with insignificant proportions of cornite and compositions with insignif-
icant proportions of blandite. The same effect was observed from the dendrograms
165

of Single Linkage for (ABCDE) compositions.

'31 ..
"'4 ....,.
03i. 32 '27
'2:n& ." '43"~~
• 12'.17
.$
• 40 • 11 .~:O.,
..iI'

B c

Figure 1: Ternary Diagram of (A,B,C) subcompositions of 25 specimens of Hongite


and 2.5 specimens of Kongite

The ternary diagrams of Figure 2 summerize the observations from the dendrograms
by Single Linkage of the three-part compositions of 39-samples in the Arctic Lake
Sediments.
The extensive variation in the ratio of clay to sand is again represented in the shape of
a banana. Even when the ternary diagrams show a clear cluster of compositions with
low proportions of sand and approximately equal proportions of silt and clay, this
cluster was not detected by Aitchison's approach when Single Linkage was applied.
Aitchison's distance prefers to isolate compositions with a very low proportion of
clay (Figure 2. (a)), whereas for the other seven coefficients the same four and six
cluster solutions could be observed (Figure 2. (b)). Clearly, the last four-solutions
are reasonable, but not the solutions using Aitchison's coefficient.
166

Sand Sand

Silt '------=-..::lE...----'Clay Silt '-----=:DL----'Clay

(a) (b)

(a) 4-clusters using Aitchison's distance


(b) 4-clusters using 7-different coefficients

Figure 2: Ternary Diagrams of 39-(Sand,Silt,Clay) compositions in the Arctic Lake


Sediments and the 4-cluster solutions by Single Linkage with eight different dissimi-
larity coefficients

No special or different behavior of dissimilarities could be observed from the den-


drograms by Complete Linkage. However, the structures are more compact when
Divergence of Jeffreys and Bhattacharyya(log) dissimilarities are used. However,
"very compact clusters even not necessarily well separated" is the known tendency
of Complete Linkage.
Tables 1 to 4 summerize the cluster coefficients of AGNES, DIANA, PAM and
FANNY for the eight dissimilarities under study in each of the compositional data.
Because of the amount of output, only coefficients (not plots) are shown here. Also,
we reproduce silhouette coefficients by PAM (not by FANNY) and Dunn's partition
coefficients by FANNY for k = 2, ... ,7 where k is the selected number of clusters
(results for k = 8, ... ,N - 1 are, in general, still worse).
All these coefficients (Tables 1 to 4) underline the fact that Divergence of Jeffreys and
Bhattacharyya (log) dissimilarities produce the better structures of the compositional
data sets considered.
167

Table 1: Agglomerative Coefficients of three different Compositional Data Sets using


eight different Dissimilarity Coefficients

I C.Data IManhat I Euclid I Cheby. I J .Mats I Diverg I B(log} IB( acos) I Aitch. I
Sedim. 0.93 0.9:3 0.93 0.93 0.99 0.99 0.93 0.93
HK(3) 0.94 0.94 0.94 0.92 0.99 0.99 0.93 0.94
HK(5) 0.89 0.90 0.91 0.89 0.99 0.98 0.89 0.90

Table 2: Divisive Coefficients of three different Compositional Data Sets using eight
different Dissimilarity Coefficients

I C.Data IManhat I Euclid I Cheby. I J.Mats I Diverg I B(log) IB( acos) I Aitch. I
Sedim. 0.9.5 0.94 0.95 0.94 0.99 0.99 0.94 0.96
HKJ3) 0.96 0.96 0.96 0.95 1.00 1.00 0.96 0.97
HK(5) 0.93 0.94 0.95 0.94 1.00 1.00 0.94 0.95

Table 3: Silhouette Coefficients by PAM of three different Compositional Data Sets


using eight different Dissimilarity Coefficients

I Arctic Lake Sediments - Sand, Silt and Clay compositions I


I# d. IManhat I Euclid I Cheby. I J.Mats I Diverg I B(log) IB(acos) I Aitch. I
2 0.72 0.71 0.72 0.70 0.87 0.88 0.71 0.67
3 0 ..')20.52 0.52 0.55 0.71 0.71 0.55 0 ..54
4 0 ..52 0 ..52 0.52 0.51 0.67 0.67 0.50 0.50
.5 o -.)
.'l_ 0 ..53 0.52 0.5.5 0.74 0.74 0.54 0.51
6 0 ..')1 0.·51 0 ..51 0.48 0.75 0.74 0.45 0 ..52
7 0.·54 0.5:3 0 ..54 0.46 0.60 0.60 0.47 0.49
..
Honglte and Konglte - ABC sub compOSItIOns
# cl. IManhat I Euclid I Chebv. I D,tats I Diverg I B{log) IB(acos)
- I Aitch.
2 0.6-1 0.63 0.64 0.55 0.74 0.75 0.57 0.54
3 0 ..52 0.51 0.52 0.50 0.70 0.69 0 ..53 0.53
4 0 ..54 0.·54 0 ..54 0.51 0.73 0.73 0.55 0.51
.5 0 ..')3 0 ..')3 0 ..53 0.54 0.74 0.74 0.5.5 0.56
6 0..5:3 0 ..53 0.53 0.47 0.74 0.74 0.52 0.52
7 0.46 0.48 0.46 0.50 0.72 0.72 0.51 0 ..52
Hongite and h.ongite - ABvDE compositions
# cl. :\Ianhat Euclid Chcby. J.Mats Diverg B(log) B( acos) Aitch.
2 0.·')8 0.60 0.61 0 ..52 0.72 0.72 0 ..53 0.49
3 0.46 0.48 0 ..52 0.46 0.66 0.67 0.47 0.43
4 0.4.5 0,47 0.48 0.44 0.64 0.64 0.44 0.40
.5 O.H 0.46 0.48 0.39 0.63 0.63 0.39 0.41
6 0.44 O.·Hi 0 ..')0 0.30 0.62 0 ..58 0.:30 0.:39
7 O.lj 0.:37 0.41 0.29 0 ..59 0.59 0.32 0.39
168

Table 4: Dunn's Partition Coefficients by FANNY of three different Compositional


Data Sets using eight different Dissimilarity Coefficients

Arctic Lake Sediments - Sand, Silt and Clay compositions


# d. IVlanhat t:uclid C;heby. J.Mats Diverg ti( log) ti( acos) Aitch.
2 0.58 0.57 0.58 0.56 0.82 0.82 0.56 0 ..51
3 0.42 0.42 0.42 0.42 0.72 0.72 0.42 0.41
4 0.42 0.42 0.42 0.39 0.71 0.71 0.39 0.37
.5 0.36 0.36 0.36 0.32 0.64 0.64 0.32 0.34
6 0.:32 0.32 0.32 0.32 0.6.5 0.64 0.30 0.34
7 0.34 0.33 0.34 0.33 0.67 0.66 0.32 0.:3.5
Hongite and Kongite - AB" sub compositions
# d. Manhat Euclid Cheby. J.Mats Uiverg ti(log) ti( acos) Aitch.
2 0.38 0.37 0.38 0.36 0.68 0.68 0.36 0.34
3 0.34 0.:3:3 0.34 0.3.5 0.67 0.67 0.36 0.36
4 0.:36 0.36 0.36 0.37 0.70 0.70 0.38 0.:37
.5 0.37 0.:37 0.37 0.31 0.70 0.67 0.37 0.38
6 0.35 0.3.5 0.:35 0.30 0.69 0.69 0.34 0.:36
7 0.33 0.:33 0.33 0.32 0.67 0.67 0.34 0.35
Hongite and Kongite - ABvDE compositions
# d. t<.Ianhat Euclid Cheby. J.Mats Diverg B(log) B(acos) Aitch.
2 0.26 0.29 0.31 0.28 0.63 0.64 0.28 0.24
3 0.2.5 0.27 0.30 0.27 0.62 0.62 0.27 0.2.5
4 0.24 0.26 0.30 0.25 0.61 0.60 0.2.5 0.21
.5 0.18 0.2:3 0.28 0.18 0.·57 0.56 0.19 0.17
6 0.18 0.20 0.23 0.16 0.5.5 0.55 0.16 0.17
7 0.17 0.19 0.22 0.16 0..50 0.50 0.16 0.14

5. Conclusions
The rank order performance for the dissimilarity coefficients or distances examined
was replicated for the six clustering methods applied to the three compositional data
sets, namely:
5.1,' Divergence of Jeffreys and Bhattacharyya using logarithm produce the better
(excellent in general) recovery of cluster structures;
5.!!: The relati\-e merits of the other coefficients are not straightforward. They indi-
cate the lowest recovery values, especially when partitioning clustering methods are
used.
Partitioning Clustering Methods appear more sensible than Hierarchical Clustering
Methods to evaluate the performance of the different dissimilarity coefficients in or-
der to cluster a compositional data set. For example, Agglomerative and Divisive
Coefficients, independent of the measure used, always ranged between 0.90 and 1
(strong clustering structures). However, Silhouette and Dunn's partition coeffi('ients
clearly allow the separation of Divergence and Bhattacharyya(log), given reasonable
or good structures. from the other measures under analysis which show weak or poor
structures.
169

Finally, since Aitchison's proposal does not reveal advantages over the other coeffi-
cients investigated, and it also can conduce to unreasonable cluster structures, we
wish to know if Divergence of Jeffreys and Bhattacharyya(log) dissimilarities satisfy
the cited Aitchison's criteria for measures of compositional difference. That is, can
we express Divergence and Bhattacharyya(log) dissimilarities as a function of 'h with
r = Eu.
ZrJ
and R = £u..,
X'J
(r,s = 1,,,.,N; i,j = 1,,, .,D)?
Considering two-partcomposi'tions Divergence Dissimilarity of Jeffreys becomes:

10 (!:.-) [( r + 1) -
(R + 1)] . (:3)
g R (r+l)(R+l)

A similar computation does not allow us to express Bhattacharyya(log) coefficient as


a function of the ratio 'h.
Thus, despite some complex function of !.. but according to Aitchison's criteria. Di-
vergence Dissimilarity of Jeffreys is an almissible measure of difference between two
compositions. Additional research is essential to judge Divergence's apparently su-
perior behavior and consequently to support it as ·'the" measure to cluster a compo-
sitional data set.

5. References
Aitchison, J. (1986): The Statistical Analysis of Compositiona.l Data. Chapman and
Hall, London
Aitchison, .J. (1992): On criteria for Measures of Compositional Difference. Mathe-
matical Geology, Vol. 24, NoA, p.36.5-379.
Aitchison, J. (1994): Principles of Compositional Data Analysis. Multivariate Anal-
ysis and its Applications. IMS Lectures Notes-Monograph Series, Vo1.24, p.n-8l.
Kaufman, L. and Rousseeuw, P.J. (1990): Finding Groups in Data. John Wiley and
Sons, Inc.
Consensus of Hierarchical Classifications
Bruno Simeone I. Maurizio Vichi 2

1 Dipartimento di Statistica. Probabilita e Statistiche Applicate.


Universita "La Sapienza" di Roma,
P.le A. Moro, 5, 00185, Roma, Italy.
e-mail: [email protected]
'Dipartimento di Metodi Quantitativi e Teoria Economica,
Universita "G. D'Annunzio" di Chieti.
Viale Pindaro, 42, 65127, Pescara, Italy.
e-mail: [email protected].

Summary: In this paper, after briefly reviewing the techniques to achieve one or more consensus
dendrograms, we propose new algorithms. These perform a sequence of elementary tree operations
to obtain at each step, in a greedy fashion, a dendrogram that is as close as possible to the given
ones. The time and the space complexities of such procedures make them suitable for large scale
applications. A numerical example is used to show the proposed technique.

Keywords: consensus classification, hierarchical classitication. n-tree ultrametric matrices

1. Introduction.
Often in analyzing multivariate data, a set of hierarchical classitications can be determined as the
result of: (i) different hierarchical clustering methods applied on the same set of n objects; (ii) the
same clustering algorithm (aggregative or divisive) embodying different proximity measures
between pairs of objects; (iii) a hierarchical technique applied on r frontal slices of a three-way
data set, i.e., on different multivariate data sets, relati~e to the same n multivariate objects
examined in different occasions, such as times, spaces, etc. In this paper we will consider
especially the last case.
One general problem in numerical taxonomy is to compare and synthesize the given set of
hierarchical classifications, determining a consensus structure toften an n-tree. less frequently a
dendrogram). The intuitive reasons to search for a consensus may be different: in case (i) we are
looking fur a single and more natural classification not depending on the clustering technique
considered; or, in case (ii), for a classitication not depending on the chosen dissimilarity measure:
or, in case (iii), we may want to synthesize several ditferent classifications into a single one, by
detecting the relevant and stable intormation in the individual classifications.
In consensus theory, much attention has been devoted to taxonomic models such as n-trees and
three approaches have been identified (Barthelemy, Leclerc and Monjardet, 1986) to determine a
consensus n-tree: (I) constructive ones, (Adams, 1972: Margush and McMorris, 1981: McMorris
et al., 1983), where purely combinatorial methods are proposed: these generally involve heuristic
algorithms with interesting properties; (2) a.xiomatic ones, (McMorris and Neumann, 1983;
Neumann, 1983: Stinebrickner, 1984; Barthelemy and McMorris. 1986; Day et al., 1986:
Neumann and Norton. 1986), whereby after the formulation of some general "particularl:
desirable" properties of consensus procedures (or functions) methods satisfying these properties
are defined if possible; (3) optimization ones (Barthelemy and Monjardet, 1981), where after the
choice of a distance index between each observed n-tree and the consensus n-tree, the consensus
minimizing such distance is determined. The three approaches are not independent, since, for
example, consensus methods defined via the optimization approach generally give rise to heuristic
procedures satisfying some given axioms or properties. These heuristics procedures are used to
define initial good solutions for the optimization algorithm.
An alternative approach searches for a common pruned tree (Rosen, 1978) by pruning the least
number of leaves of the given trees in order to render them equivalent. An interesting overview on
consensus methods is given in Gordon. t 1996).

170
171

In general, a consensus n-tree discovers replicated classes belonging, completely or partially, to


more than one classitication. but the distances among the replicated classes are not evaluated.
However, these distances are fundamental to help us to choose one among (at most) n-I partitions,
in the consensus n-tree. For this reason we are interested in determining a consensus dendrogram.
In this field, the few available consensus procedures generally follow a constructive or axiomatic
approach. In practice, a consensus dendrogram is produced on the basis of a consensus procedure
for notrees (e.g., Neumann, 1983; Stinebrickner, 1984): in this way one obtains a consensus n-tree
and can use a level function for each node of the consensus n-tree. Hence, for example. we can
choose an intersection method (strict consensus, cardinality intersection, durchschnitl rule. etc.)
and a level function such as the arithmetic mean.
Among the three approaches for determining a consensus dendrogram, we prefer to consider the
optimization one, which, in our opinion, is closer to the general concept of consensus, since it does
not necessarily need "desirable", but in any case subjective, properties to be defined and satisfied.
We will always consider also the constructive approach to write algorithms for determining good
initial solutions of the optimization problems.
In this paper. after briefly reviewing the techniques to achieve a consensus dendrogram. we
propose new algorithms which, starting from the given dendrograms, perform a sequence of
elementary tree operations so as to obtain at each step, in a greedy fashion, a dendrogram that is as
close as possible to the given ones. The time - and space - complexities of such procedures make
them suitable for large scale applications. Two numerical examples are provided to show the
proposed technique.

2. Notation and Basic results.


Let,. hierarchical classifications of a set 1 of n objects be given in terms of:
(i) notrees T I • T2• ... , Tr , where each Th is a collection of at most 2n-1 subsets of 1 such that
I); (i) ETh, i=I, ... , nand 0 (/' Th; 2) J. J' E Th ~ either J c J' or J' c J or JrJ' = 0; 3) 1 E Th
(I). Thus, the n-tree is given by the n trivial clusters (leaves) {i}, (i E I) and the n-I clusters of
units (internal nodes), obtained by the n- r steps of fusion performed by a hierarchical algorithm
(hence, the last cluster is the root I).
As customary, any n-tree T is represented as a tree. generally as a binary tree, i.e.. with exactly n-
1 internal nodes, rooted at I, whose nodes are subsets of 1 belonging to T, and where there are two
directed edges (J.1f) and (J,K) whenever {H. K} is a bipartition of J. The binary tree will still be
denoted, for simplicity, by T.
If (J,1f) is any edge, we say that J is a predecessor of H and that H is a slIccessor of J. If there is
a directed path from J to J', then J' is said to be a descendant of J, and J an ancestor of J'. In
particular, every node is both a descendant and an ancestor of itself. A terminal (or leaf) is any
node without successors. A node is internal if it is neither a terminal nor the root. Two nodes of T
are brothers if they have the same predecessor. Let J be a node of T. The sub-tree ofT rooted at J
- denoted by TJ - is the rooted tree (with root J) whose nodes are the descendants of J in T and
where (J, If) is an edge if it is an edge in T. We denote by TJ the rooted tree obtained from T after
deletion of all the descendants of J other than J, together with all edges incident to them.
(ii) Dendrograms Li l , Li 2• .... Lir , where Lih = {6(Ilh)' 6(I2h ), ... , 6(11n .1 h)}' Here IJh is the cluster
obtained right after the j-th fusion, initially there are n clusters formed by the individual objects,
and 8...~h) is the corresponding value of fusion. One must have 6(Ilh) ~ 8.../2h)~ ... ~ tiJ11I • I h) and in
particular ~h S I'h ~ 6(Ilh)~8J'h)'
(iii) Ultrametric matrices VI' V 2, .... V r' where V h = {llljh: i.j E l} and the elements lIyh satisfy the
ultrametric inequality lIyh ~ maX(II,/h' III/h) '<;f (i.I.j) E I. (h =1. .... r).

,I) In the following we represent the n-tree by its non trivial classes: Th = (lih' Ilh • .... In,! hI. where
I < IljI,l < n for all).
172

3. Least Squares Consensus Dendrogram.


Letkovitch (1985) defined a Euclidean consensus to be (usually) a dendrogram, embedding the
ultrametric matrices Vi' V 2 , ... , Vr into a Euclidean space with at most n-I dimensions, via
classical scaling. Then, all the r sets of points are rotated, using polar rotation. so that, each set of
points is: X h Vh' = V h Ah Vh" where X h denote the matrix of coordinates, obtained via scaling,
corresponding to the h-th ultrametric matrix, Ah is the diagonal matrix with non null elements the
eigenvalues of Xh Xh ' and V h are the corresponding eigenvectors. T-he polar rotation allows to
represent the r sets of points in the same coordinate system, with basis the identity matrix I.
Another consequence following from the polar rotation is that it is easy to obtain, by a convex
combination with coetlicients wh, C=Lh whV h Ah Vh" a consensus set of points that minimizes
LhIlC- V h Ah Vh'112. Hence. a consensus classification structure, generally, but not always a
dendrogram, is obtained from this convex combination of the r rotated sets of points. reversing the
process to embed the ultrametric matrices into a Euclidean space. Following the optimization
approach Lapointe and Cj.lcumel (1991), Vichi (1993) determined the Average Consenslls or Least
Sqllares Consenslls Dendrogram. solving the following quadratic problem:
,
minz)V h - VII~
h=1
subject to [P I)
II" ::; max(II",II,,) for every triplet (i,l,/) E I
diag(V) = 0
where V~ {lii/ i.j E /} is the closest least squares matrix subject to the ultrametric condition. The
zero diagonal condition, together with the ultrametric constraints, implies that U is non-negative
and symmetric. As II" = 0 for i= I ,... ,n the variables II" are not needed in [P I].
The ultrametric constraints can be written under the equivalent disjunctive form: "either II" .,; II"
or II" ~ 11,1 ". Hence [P I) is a disjunctive program with quadratic convex objective function.
Disjunctive programs are known to be NP-hard even for linear objective functions. The only
known exact solution methods are combinatorial, and thus their running times grow exponentially
as the problem size increases. Actually. a result of Krivanek and Moravek (1986) states that [P I)
is NP-hard.
For a given V. let C(V) '" Wj,p): I ~ iJ.p ~ n; i~j,j7;.p. i7;.p; II,,~ 11,1" II" ~ lI,p }.
Then [P I) is equivalent to

min IIV - iTll~


subject to
[P2]
L(II,p -1I,p)~ = O.
lI.j.plerll'1

diag(U) = O.

when: U=( 1/r) (V, +... +Vr).


This is a consequence of the follo\\ ing remarks:
Ii) the ultrametric constraint II" ~ max(lI,p. IIIP) is equivalent to the condition that the two largest
elements among II". lI,p. 11,1' be equal:
Iii) One has
,. ,
L.11lI" - iTll: +1'11 iT - UII~ +2tl'[( iT - U)L. (U" - iT)] = constant + I'll iT- UI1 2 +0.
11=1

Notice that if V is ultrametric (a sufficient condition for this to happen is that the binary n-trees
T 1• T2••••• T,. associated with the ultra metric matrices are equal (Vichi 1995» then it is obviously
the optimal solution to [P2).
173

[P2) may be reformulated, following a classical approach. as an unconstrained quadratic


minimization problem. considering the penalty function:

minllU - UI12+A.( L(u,p -II Jp )2) [P3]


(I.J.p)er~li )

Algorithms for finding good (although not necessarily optimal) solutions to the above problems
have been given by Vichi (1993) ([PI). [P2) versions): Carroll and Pruzansky. (1980) and De
Soete (1984), ([P3) version).

4. Algorithms.
In view of the NP-hardness of[PI], [P2). [P3), and of the exponentially growing running times of
the available exact solution algorithms, it makes sense to look for fast heuristic procedures.
yielding "good", although not provably optimal, solutions. Any such solution is also useful as a
starting point for iterative global optimization procedures.
An initial feasible solution to [P I], [P2) and [P3) may be obtained through a hierarchical
classification algorithm having in input the Euclidean matrix U. The best choice is the average link
method. which it is known to give, among all aggregative methods, the ultrametric matrix U at
minimum distance from the given dissimilarity matrix (in our case U), see for example
Cunningham and Ogilvie (1972). This variant will be called Algorithm I. The resulting
dendrogram can be considered as a member of the family of consensus dendrogram methods
proposed by Ghashghai et al. (1989), as a generalization of the Stinebrickner's top-down method
for dendrograms (Stinebrickner, 1984a) and of Neumann's generalized intersection methods for n-
trees (Neumann, 1983).
A second heuristic algorithm (Algorithm 2) for [PI], [P2] and [P3] has been proposed by Vichi
(1994). Essentially, starting from U the algorithm: (i) computes a least upper bound ultrametric
(via complete linkage) and the largest lower bound ultrametric (via the single linkage), (ii) finds the
mean matrix of these two ultrametrics; and (iii) repeats steps (i) and (ii) on the current mean
matrix until convergence is achieved. If the limit matrix (which always exists) happens to be not
ultrametric. then the average link method is applied to it.
A third algorithm is actually a procedure to improve the objective function:
(I)
h=1

when a dendrogram approximating U is given. In our case the dendrogram is defined by one of the
previous two procedures.
Before introducing the algorithm we need to define a swap operation and some of its properties.
Let T be an n-tree containing, among others, the following sets:
(i) three pairwise disjoint sets J, H, K; (ii) JuH; (iii) JuHuK; then a swap operation transforms T
into the n-tree T' or T", where JuH is replaced by JuK or HuK. respectively.
The swap operation can be described by the tree representation in Fig. I.

Fig. 1: Trees associated with a swap operation

J~:~\ JUHU
Ac " J~:;{~
JUc!
J
\
H
~
K J
111/~K
H K
JUJ:o \
J K H
(I) (2) (3)
174

Thus, given the n-tree T with classes J, H, K, whose fusion is represented in Fig. 1.1, a swap
operation is represented in Fig. 1.2, or in Fig. 1.3. These fusions give rise to the n-trees T" and T'.
respectively.
Example I: Given the n-tree (including also the trivial classes) T ={ 1,2,3,4.5,12.34.125,1}. a swap on
classes { 1},{ 2},{ 51 gives the n-trees T'={ 1,2,3.4,5, 15,34,125.Il. T"={ 1,2,3.4,5,25,34, 125,l} .
From an algebraic viewpoint. the trees T, T' and T" correspond to the three possible ways to obtain
JuHuK by successive unions starting from J, H. K. namely, (JuH)uK, (Jul0..JH, (HuK)uJ.
Theorem: Given two binary n-trees Sand T on the same set I, it is always possible to obtain T
from S, or vice-versa, by a finite sequence of swap operations.
Remark: If S and Tare two n-trees on the same set I, if the sub-trees SJ and TJ are identical, and
if TJ can be obtained from sJ by a finite sequence of swaps, then T can be obtained from S by a
finite sequence of swaps as well.
Weare now in position to prove the theorem.
Proof By induction on n=ll]. For n=2 or 3 the theorem is trivial. Suppose that the theorem holds
for all p-trees with p9l-I, and let Sand T be any two n-trees on 1 (n ~4 l.
Case I) There exist two identical sub-trees SJ and TJ with 1JI~2.

Then by the inductive hypothesis TJ may be obtained form sJ through a finite sequence of swaps
and thus by the above remark the theorem holds also for·n-trees.
Case 2) There are no two identical sub-trees SJ and TJ with 1JI~2.

Then consider any two brother terminals Land M in the tree T (two brother terminals must always
exist). Both Land M are terminals also in S, but they are not brothers in S, otherwise Case I
would occur. We are going to describe a procedure that, starting from S, produces after a finite
sequence of swaps an S' where Land M are brother terminals. Then S' and T have two identical
sub-trees -- namely, those rooted at L u M. Thus, we .fall back into Case I) and the theorem
follows.

Hl:\M
Fig. 2: Swaps in the procedure DOUBLE LIFTING

(a)

(b)
L/)\H
(c) L!;\N
In the procedure to be described below. and with reference to the current tree S', we denote by
pred(J) and brother(J) the predecessor and the brother of the node J (hI), respectively.
175

PROCEDURE DOUBLE LIFTING ~

S':=S; N:=brother(L); M:=brother(K);


K:=pred(M); Perform on S' the swap indicated in Fig 2(c);
While K#do Let S' be the tree after the swap;
N:=brother(M); J=pred(K); H:=brother(K); outputS';
Pertorm on S' the swap indicated in Fig. 2(a); end
Let S' be the tree after the swap;
K:=pred(A.-J);
End While
K:=pred(L);
While pred(K);</ do
N:=brother(L); J=pred(K); H:=brother(K);
Perform on S'the swap indicated in Fig. 2(b);
Let S' be the tree after the swap;
K:=pred(L);
End While ~
o
Notice that: (i) After each swap in the tirst While-loop, the depth of M (that is. the number of
edges of the path from the root I to M) decreases by one, and after the loop is exactly 1; (ii) After
each swap in the second While-loop, the depth of L decreases by one, and after the loop is equal to
2; (iii) After the final swap, Land M become brother terminals; (iv) the overall complexity of
DOUBLE LIFTING is O(n).
An Example of execution of DOUBLE LIFTING is shown in Fig. 3.
Fig. 3: (a) Two 6-trees Sand T; terminals 1 and 4 are brothers in T, but not in S. (b) Execution of
a DOUBLE LIFTING: starting from S, after 5 swaps a tree Ss is generated, in which 1 and 4 are
brothers

2 3 4 5 6 4 2 3 5 6

5 6 4

2 3 5 6 423 5
176

The above theorem motivates the following Algorithm 3 (although it is not enough 10 guarantee the
optimality of the dendrogram generated by such an algorithm). For a better understanding of the
algorithm. consider again the three sub-trees in Fig. I.
The sub-trees represent three sub-dendrograms of a dendrogram LI only if: a-:;,h: or ('5,(/: or t!<::;t:
where a, b, c, e./. are the levels of fusion. as reported in Figure I. respectively.
Furthermore. the contribution of each sub-dendrogram to the objective function ( I) is given by the
amount:

s,=Su+s ,,= :t I(II 'gh _all +:t


h=I/<.I h=ll<.lull
I (lI,gh -b)~:
~£H '{.eA"

S2=Sc+Sd=:t I(II ,gh -c)~+± I (II MI -d)~: (3 )


h=l/eH "=1 (.;.1/..)(
g€' g€J

S3=S,+S? :t I(lI jgh -e)~+±


holl<.l h=1
I(lI fih
(<.1.01:
-n~:
gEA' ~EfI

respectively.
The minimum increase s, is given, for a sub-tree of type ( I ) in Fig. I. when the level of fusion is:
I I '
I I I
r
a*=--,- Ill/gh : h* 11 "11 if a* < h*. (5)
rIJI·IHI h=I/<.I rlJu HI·IKlh=I/<.Iufl .
gel! /<.g gii)'; (<'g

otherwise if a* ~ b* the minimum increase is given by

I r
a**=b**= I IIIJ:gh' (6)
rlJ u H U Klh=ll.gEJ<...,I/,,,1\
f<g
but in the last case the n-tree (I) in Fig. I becomes a bush. i.e., the level of fusion Ll is equal to the
level of fusion b. In fact. the value a* is the arithmetic mean of the dissimilarities among classes J
and H, while b* is the mean of the dissimilarities among classes JuH and K. Thus. a*> h* means
that the best feasible solution is a tree with one level of fusion (bush) equal to the mean of
dissimilarities between the elements of cluster JuHuK,
Similarly, the minimum increase .1'2 is given, for a sub-tree of t)-pe (2). in Fig. I. when the level of
fusion is .

irc* < £1*.

otherwise if c* ~ d* the minimum increase is given by c**=d** as in (6).


The sub-tree of type (3). in Fig I, has level of fusion

if e* <f*. (8)

otherwise if e*>f" the minimum increase is given by e**=j"* as in (6).


177

ALGORITHM 3
Given a dendrogram.1= {<XII)' <X/~) • ...• ci(lln.I)}' with at most n-I internal nodes. where II' I~ • ...• fn• 1
are the non singleton classes associated with the internal nodes;
Step 0: Set the iteration parameter k:= I:
Stepl: Visit the k-th internal node according to the order given by the non-decreasing kvels of fusion
Wk):
If cluster fk is the root of a sub-tree with at least one internal node then
If the sub-tree has more than one internal node then
aggregate those clusters with the smallest level of fusion. Let J. H. K be
the three leaves of this sub-tree. With a swap operation we have one of the
three sub-trees in Fig. I:
End If;
Step 2: Compute (a·. b*). (c*. d*). (e*.j"):
,,*
If :<; b* or c'* :<; d* or e* :<;j" (i.e .• at least one pair is feasible) then
Compute for the feasible levels of fusion the increases Sl' s~. S1:
Consider the sub-dendrogram with the smallest increase of the
objective function:
End If;
Else
Compute for the class with two elements {ij} the mean of lI'}h h= I..... r.
End If;
Step 3: k:=k+ I ;
repeat step I to step 3. n-I times.
The worst time-complexity of the algorithm is O(rnJ). since processing the k-th node. which has at
most k proper descendants. takes O(irr). k= I ..... n-I.
A fourth algorithm is illustrated in the following table, and can be applied directly on the original
dissimilarity matrices D I • D~ ..... D,.

ALGORITHM 4
Step 0: (initialization): let r matrices D" D~ .... , D, be given, where Dh={d,}h i,je/}: these may be
dissimilarity or ultrametric matrices. Set the iteration parameter k:= I:
Step 1: For each matrix D,: h= I.... , r. find the minimum value:
d'1l = min{D I }; .... ; dlmh=min{ Dh}; ... ; dpq,=min{ D,l.
These are the values of fusion between groups: (G" G) • .... (G I• G m ), .... (Gp • Gq ). respectively. and
represent the k-th smallest value of fusion of r dendrograms.
Step 2: Compute the means of the dissimilarities:
I E{ d'1'" y= I ,.... r}; ... ; hE! dlnrv , v= 1..... r}; ... ; .E{ dp</v' y= I .... ,r}:
Step 3: Among the above means. choose the smallest one. let it be hE{dlmv' ~·=I, .... r}, which is at least as
large as the minimum mean detected at iteration k-I.
Step 4: The increment of the objective function in [PI] after the fusion of groups (Gr G m ) is:
DEV{dlmv , y=I .... ,r}=r.(dlnrv - hE)2
Thus. the fusion of groups (G r Gm ) with cardinality nl and n.. defines the k-th group I. with associated
value of fusion ,E{ din.., v= I..... r}. Cluster h represents the k-th node of the consensus IHree associated
with the r dendrograms.
Step 5: Update the matrices DI , D~ ..... D" i.e., the distances between the fused cluster I, and a generic
cluster G. : dh(l" G)= (n, / (n,+n.,)) dh(GI' G)+(n .. / (n,+ll m»)dh(Gm • G)
where dh(l" G) is the distance between G,uGm, and cluster G. in the h-th matrix Dh .
Step 6: k:=k+ I;
repeat Step I to Step 6. n-I times.

5. Application
In order to show the behavior of the proposed algorithms a well-known benchmark in cluster
analysis, given by Michner (1970). has been used. This data set arises from a taxonomic problem
on II types of Hoplites bees, described by 23 variables related to the form and characteristics of
178

the bees (further details are reported in the original paper). Using the distance matrix between pairs
of bees. Everitt (1993) compares the dendrograms obtained by single linkage. complete linkage and
average linkage between groups (UPGMA). The ultrametric matrices VI' V 2 and V3 are reported

in Vichi (1993). The mean matrix iJ =(1/3)(VI + V 2 + VJ). and the dendrogram of the average

linkage applied on iJ are shown in Fig. 4.


For each of the 10 steps ofUPGMA, the class and the corresponding value of fusion is reported

3.4.6.9.5.12.10.11

The objective function (I) has the value: I 3.9372


Algorithm 3 is applied on the dendrogram obtained by UPGMA on the mean matrix (Fig. 4)

Step 0: Given the dendrogram in Fig. 4; set k:~ I:


• k~ I. 2. Both nodes {3,4} and { 10, II}, with level of fusions 0.303 and 0.645, have no internal node
among their successors and cannot be roots of sub-trees. The values 0.303 and 0.645 are the means of
the corresponding values in the three ultrametric matrices. The increase of the o.f. is 0 for both
nodes:
~3. Step I: The internal node {3,4.6}. with level of fusion 0.676. is the first node that can be
considered as the root of the sub-tree in Fig. 5 (i). With a swap the two sub-trees in Fig. 5 (ii) and (iii)
are obtained. Step 2: compute £1*=0.303, h*=0.676; c*~0.676, d*~0.4S9; e*=0.676, J*=0.489. The
only feasible solution is £1* and b* and the increase of the 0.[ is sh =0.0420353.
• ~4. This node {1,2} with level of fusion 0.940, has no internal node among its successors and hence
it cannot be the root of a sub-tree. The value 0.Q40 is the mean of elements between (1,2) in the three
ultrametric matrices. The increase of the o. f. is 0;

Fig. 4: Mean matrix iJ of the ultrametric matrices associated with single linkage, complete linkage
and average linkage applied on Hoplites data (Everitt, 1993). The dendrogram is obtained applying
average linkage on the mean matrix.
I 6 10 II
0.000
0940 0.000
1383 1333 0000
1.383 1.333 0.303 0000
))91 1.391 1.066 1.066 0000
U8J I 333 0676 0.676 1.066 0.000
I 530 1:130 1.630 1.630 1.630 163!l 0000
I 376 1376 1.598 1.598 1598 1.598 1.530 0000
9 1383 1333 0.947 0947 0.996 0947 1.630 1.598 0.000
If) 1.570 1.570 I 493 I ~93 I ~93 1.493 1.63!l I 598 I 493 0000
II I :\70 I 570 I 493 I 493 1493 I 493 16]1) 1.598 149J 0645 1).1)01)

• ~5, 6. 7 .8. In steps I and 2 the sub-dendrograms (iv). (vii). (x), (xiii) are found to be optimal
respecti\ely.
• ~9. This node P.S}, with level of fusion 1.53, has no internal node among its successors and it
cannot be the root ofa sub-tree. The value 1.53 is the mean of the elements between (7,S) in the three
ldtrametric matrices. The increase of the 0.1'. is 0.10492:
• k 10. Step I: The internal node {1.2.3,4.5.6.7.S.9.IO,II} is the last node that can be considered as
the root of a sub-tree. The associated sub-tree with one internal node is ShO\Hl in Fig. 5 (xvi). With
one swap the sub-trees (xvii) end (xviii) in Fig. 5 are obtained. Step 2: compute a*~ 1.53033.
179

1(\676 '
m 1\\""
Fig. 5' Steps of the algorithm 3 for which a swap operation can be executed

a*~O.303 c*~O.676 e*=O.676


3u4 4u6 3u6

346 3 4 6 3 6 4
k=3 (i) feasible (ii) nol [easible (iii) nOlfeasible

",c;J~""7 (JM7~
(3.4)u6 il"7~":9~7 6u9
e*=O.9467
(3.4)u9

(3.4) 6 9 (3A) 6 9 (3A) 9 6

"eX
k=5 (iv) feasible (V) nol feasible (vi) nol feasible

(JA.6)v9ul
"'~";I\\'"
'J'6~; ~92
c*=0.996
(3.4.6)v'1 9u5 (3.4.6)u9

,'.,'/!. .
O.U) 9 50 (3.~.6)

""m'"
(3,4,6) 9 5 90 5
k=6 (vii) feasible (viii) feasible lix} nol feasible

",c<,."
O"'R\~;
".li'" "z
a·~1.0~88J c·= \.391 e·: 1.3579
(JA.6.9)u5
~1.2) (3.4.6.9)v( 1,2)

(3.4.6.9) 5 (1.2) (3,U.9) 5 (1,2) (3,4,6.9) (1.2) 5


k=7 (x) feasible (.tI) not feasible (.tii) nOI feasible

r-,,,,
"'"'O)C<'~\ 'm.• c<'
a*=136450
(3.4.6.9.5) u( 1.2)
(1.2)u(IO.II)
c*=1570
O'
\
".H .•.
(10.11)
(3.~.6,9.5)

(10,11 )
u
,·=1.413

6
O.H, .5) (1.2) (10.11) (JA.6,9.5) (10,11) (1,2)
(3.4,6,9,5) ( \.2) (10,11 )

M'"''
k=8 (.tiiiJ jeasible (.tiv) nOI feasible (xv) nOI feasible

Il\~
a·=1.53033 N
8Jo..A.\~,6.8.1.2.5.8.IO,111 e*=1.608 '"

7 8 \lA.6.8. I 2.5.8. 10, I I) 7 ().~.6.8.1.2 ..8.10.111 8


8 7 (3A.6.8.1.2.S.8. 10, I I J
(xviii) nal feasible
(xvii) nOI feasible
k=IO (xvi) feasible

The objective function has value: 0+0+0.0420253+0.24454+0.083308+5.9516695 + 2.820236 +


2.17669+ 2.05085= 12.8311 064. The same solution has been obtained by Vichi (1993) solving [P3]
with the truncated-Newton method.
180

The second data set, analyzed -by Carroll et al. (1984), describes over-the-counter pain reliever
usage in remedying three common maladies. On the three arrays of dissimilarities the average
linkage algorithm has yielded the dendrograms reported in Vichi (1994).
On these data Algorithm 4 has been applied.
k=1; step 1. u(4;7;1)=20.29; u(3;4;2)=20.24; u(5;10;2)=20.03; step 2. 3. E(d(4;7);i=I,2,3»=26.76:
E(d(3;4;i=I,2,3»=22.68; E(5;10;i=I,2,3»=20.96; step 4. DEV(d{5;10;i= 1,2,3))= 1.8066801. Thus, 5 with
10 at level 20.96 are aggregated, and o. r. =1.806680 I.
For k=2. 3. 4,5,6,7,8, Steps I to 4 can be executed in a similar way.
k=9: step 1. u(3,4, I,7,6,9;5.1 0, 1.8; I)=37.41; u{3,4,1 ,7,6,9;5.1 0,2,8;2)=34.65; u{3.4.I, 7,6.9;5.10,2,8;3)=
=33.46; step 2,3. E(d{3,4, I,7.6,9;5.1 0.2.8;i= 1.2.3))=35.17; step 4. DEY(d(3,4.1, 7,6.9 ;5, I0,2,8;i= I,2.
3))= 197.09015; Thus, fuse 3.4,1,7,6.9 with 5, 10,2.8 at level 35.17; The o.r. value is 232.73522 +
197.09015=429.82538.
Thus. the dendrogram obtained can be synthetized as follows:
I.H.7.6.9
31.940

The same result has been obtained by Vichi (1994) through the solution of [P3] by the truncated-
Newton method and also through the Algorithm 2 briefly outlined in this paper.

References
Adams, E. N. (1972): Consensus techniques and comparison of taxonomic trees, Systematic
Zoology, 21, 390-397.3
Barthelemy, J. P., Leclerc, B. and Monjardet B. (1986): On the use of Ordered Sets in Problems of
Comparison and Consensus of Classifications, Journal oIClassification.], 187-224.
Barthelemy, J. P., and Me Morris, F.R. (1986): The median procedure for n-trees, Journal ol
Classification. 3, 329-334.
Barthelemy, J. P. and Monjardet B. (1981): The Median Procedure in Cluster Analysis and Social
Choice Theory, Mathematical Social Sciences, I, 235-267.
Carroll, J. D., Clark, L.A. and De Sarbo, W. S. (1984): The representation of three-way proximity
data by single and multiple tree structure models, Journal olClassification, 1,24-74.
Carroll. J.D. and Pruzansky, S. (1980): Discrete and Hybrid Scaling Models, In: Similarity and
Choice, E. D. Lantermann and H. Feger (Eds.), Huber, Bern, 108-139.
Cunningham. K., M. and Ogilvie, J., C. (1972): Evaluation of hierarchical grouping techniques: a
preliminary study. Computer Journal, 15,209-213.
Day, W. E., McMorris, F.R. and Meronk, D.B. (1986): Axioms for Consensus Function Based on
Lower Bounds in Posets. Mathematical Social Sciences, 12, 185-190.
De Soete, G. (1984): A Least Squares Algorithm for titting an ultrametric tree to a dissimilarity
matrix, Pattern Recognition Letters. 2, 133-137.
Everitt, B., S. (1993): Cluster Ana(vsis. Edward Arnold, II edition.
Ghashghai. E.. Stinebrickner, R. and Suters, W.H. (1989): A Family of Consensus Dendrograms
Methods. Abstract of the paper presented at the Second Conference of IFCS. Charlottesville.
Gordon. A. D. (1987): A Review of Hierarchical Classification, The Journal of the Royal
Statistical Society, A, vol. ISO, 2, 119-137.
Gordon. A. D. (1996): Hierarchical Classification, In: P. Arabie et al. (Eds), Clustering and
Classification, World Scientific Publishing, 65-121.
Krivanek, M. and Moravek, J. (1986): NP-Hard Problems in Hierarchical-Tree Clustering. Acta
Informatica, 23. 311-323.
Lapointe, F. 1. and Cucumel. G. (1991): The Average Consensus. Abstract of the paper presented
the Third Conference of the IFCS, Edinburgh. Scotland.
181

Letkovitch, L.P. (1985): Euclidean Consensus Dendrograms and Other Classification Structures.
MathemClfical Bioscience.l. 74. 1-15.
Margush, T. and Me. Morris, F.R. (1981): Consensus n-tree, Bulletin of Afathematical Biology.
43. 239-244.
McMorris F.D., Meronk. D. B. aand Neumann. D.A. (1983): A view of Some Consensus Methods
for Trees. In J. Felsestein (Ed.). Numerical Taxonomy. Spriger-Verlag. Berlin.
McMorris. F.D. and Neumann. D. A. (1983): Consensus Functions on Trees, Mathematical
Social Sciences, 4, 131-136.
Neumann, D. A. (1.983): Faithful consensus methods for n-trees, Mathematical Biosciences. 63.
271-287.
Neumann D. A. and Norton V. T. (1986): On Lattice Consensus Methods. Journal of
Classification, 3, 225-255.
Powell, M.J.D. (1983): Variable Metric Methods for Constrained Optimization. Mathematical
Programming: The State of the Art. In Bachem A. et al. eds. Springer Verlag. 288-311.
Rosen, D. E. (1978): Vicariant Patterns and Historical Explanation in Biogeography, Systematic
Zoology, 27, 159-188.
Stinebrickner. R. (1984a): s-consensus trees and indices, Bulletin of Mathematical Biology, 46,
923-935.
Vichi M. (1993): Un algoritmo dei minimi quadrati per interpol are un insieme di classificazioni
gerarchiche con una elassificazione consenso, Metron, 5 I. 3-4. 139-163.
Vichi M. ( 1994): An algorithm for the consensus of the hierarchical classifications, Proceedings of
Italian Statistical Society, 37, 261-268.
Vichi M. (1995): Principal Classification analysis of a three-way data set, presented at the
meeting: Analisi dei dati multidimensionali, Napoli, 30-31 october.
On the Minimum Description Length (MDL) Principle
for Hierarchical Classifications
Peter G. Bryant
Graduate School of Business Administration
University of Colorado at Denver
Campus Box 16.5
Denver, Colorado 80217-3364 U.S.A.

Summary: Hierarchical clustering procedures such as single-, average-, or complete-link


procedures produce a series of groupings of the data arranged in the form of a hierarchy,
or tree structure. In most cases, the choice of where to "cut" the tree is left to the user.
Occasional formal guidelines have usually been based on ideas of random sampling, but
that assumption is often violated in the contexts in which cluster analysis is used. This
paper explores the application of Rissanen's MDL principle to derive possible guidelines for
cutting the tree. These guidelines do not assume random sampling.

1. Background
1.1 Hierarchical clustering
Commonly used hierarchical clustering procedures group obervations into a nested
sequence of classifications. Often that sequence is represented by a tree or dendogram.
A Simple Example: To fix ideas, let us consider a simple example consisting of the
seven univariate observations

y = (1,2,5,7,12, 16,20)t, (1 )
which are to be grouped in some appropriate manner.
To use agglomerative hierarchical methods, we must specify an appropriate distance
measure such as Euclidean distance, city-block distance, etc., and an aggregation
criterion such as single linkage or complete linkage, which specifies how the distances
between groups of observations are determined from the distances between individual
observations. Such measures and aggregation criteria are discussed in standard text-
books such as Everitt (1993). The tree produced by single linkage clustering using
Euclidean distance for the data in (1) is given in Fig. 1.
1.2 The problem of cutting the tree
Hierarchical methods do not require that we specify a priori how many groups are
to be found, and this is often an advantage, but neither do they give us specific
guidance on how many groups we have actually found. For those problems in which
the tree is not the fundamental object of interest, but is simply a means to obtain a
final grouping, the user must determine at what point it is useful to "cut" the tree.
For example, in Fig. 1, if we cut the tree at (vertical) level 4.5, say, we obtain two
groups, (1,2,5,7) and (12,16,20), while if we cut the tree at level 3.5, we obtain 4
groups, (1,2,5,7), (12),(16), and (20). The finer the subdivision, the more accurate
(in some sense) the description of the data is, but the additional accuracy comes at
the expense of a more complicated model. At what point, then, should we cut the
tree?

182
183

Figure 1: Clustering for Simple Example Using Single Linkage

Euclidean
distance'
5
4
3
2 r--
1
1
12 5 7 12 16 20
x
1.3 Possible approaches
At any given level of aggregation, that is, for any point at which we cut the tree,
we usually obtain a corresponding figure or merit of some kind, such as the "pooled
within group distance." The smaller this measure, the better the grouping describes
our data. One way to determine an appropriate grouping is to plot this figure of
merit versus the level of aggregation. The resulting curve often displays distinct dips
at one or more points, and such points indicate clusterings which appear "significant"
in some sense. How much of a dip is enough to be considered significant is harder to
specify, though. For Euclidean distance, Duda and Hart (1973), for example, suggest
referring t.he ratio of two successive figures of merit to some critical value, although
in many cases of clustering, the sampling theory assumptions lmderlying classical
statistical approaches to determining such critical values will be violated. In the
next section, I explore an approach baseJ on the MDL principle, an approach which
doesn't depend on sampling theory.

2. The MDL Principle


2.1 General Ideas
The Minimum Description Length (MDL) principle articulated and developed by
Rissanen (1987, 1989, 1996 and elsewhere) suggests a way of choosing statistical
models for data when our purpose is simply to describe the given data rather than
to estimate the parameters of some hypothetical population. The MDL principle is
similar to a penalized maximum likelihood approach in that each proposed model
has a total figure of merit combining the ability of the model to describe the data
and the complexity of the model. The references cited give details of the approach.
I apply it here to the case of Euclidean distance models for hierarchical clustering.
2.2 An MDL Criterion for Gaussian Models
The MDL principle requires that certain somewhat subjective choices be made by
the user. Typically the result is not a single criterion but a family of criteria. For
the common case of linear models y = X {3 + e ~ith Gaussian errors e, though, one
appropriate form of the criterion turns out to be

n2
lvIDL=-ln (2)
s +--In - n-2 P (n)2 -lnr (n--2-P) +-In
P (nR2)
2
-
27r
. (2)
184

where 52 = n- 1 x (Error Sum of Squares) is the estimated error variance for y, n is


the number of data points, and p < n is the number of variables in the linear model.
The parameter R may chosen in various ways, as long as it satisfies

R> ma.xi· - mini·


- i'5p' i'5p'

where i is derived from the vector of least squares coefficients, suitably standardized.
The details of this derivation are given in Bryant (1996).
2.3 Application to Cutting a Hierarchical Tree
To each level of a tree representing p groups, there corresponds a Gaussian model
with p independent variables. For example the design matrix X corresponding to the
top level of Fig. 1 (a single group) is

X =(1 1 1 1 1 1 1)1,

while that corresponding to the division into the groups (1,2,5,7) and (12,16,20) is

( 1 1 1 1 0 0 0)1
X= 0 0 0 0 1 1 1 '

and so forth. To each grouping represented by the tree, there will correspond an MDL
figure of merit (2), combining its ability to represent the data with the complexity
of the description. The best level at which to cut the tree is the one yielding the
smallest value of M DL.
The criterion given above is for univariate data. In clustering, it is more usual to have
m-variate data (m > 1), and for such cases, we replace 52 in (2) by the total within.
group squared Euclidean distance divided by n' = nm and replace n and p by n' and
p' = nm, respectively. For non-Euclidean distance measures d, the corresponding
models would use a probability density measure proportional to e- d , though the
detailed calculations to produce an analogue of (2) will often be messy.

3. Some Numerical Examples


3.1 Simple Example (continued)
For the example in Fig. 1, the figures of merit turn out to be those given in Tab. 1.

Table 1: MDL Criteria for Simple Example (n = 7)


Grouping MDL p 5 R
(1,2,5,7,12,16,20) 16.71 1 6.676 1.348
(1,2,5,7)(12,16,20) 12.16 2 2.797 2.732
(1,2,5,7)(12,16)(20) 10.29 3 2.096 2.254
(1,2,5,7)(12)(16)(20) 10.20 4 1.803 2.621
(1,2)(5,7)(12)(16)(20) 10.05 5 .598 11.307
(1,2)(5)(7)(12)(16)(20) 10.52 6 .267 25.284
(1)(2)(5)(7)(12)(16)(20) Undefined

They suggest that for these data, each finer subdivision of the data is preferred at
least slightly to the one which precedes it, until we reach the last: splitting the (5,7)
group into two components seems to cost more than it is worth.
185

3.2 Johnson and Wichern's Cereal- Data


We may illustrate the MDL approach further using the data from Johnson and Wich-
ern (1988, page 587) listed in Tab. 2.

Table 2: Johnson and Wichern's Cereal Data


Id Brand Protein Carbo Fat Calories VitA
1 Life 6 19 1 110 0
2 Grape Nuts 3 23 0 100 25
3 Super Sugar Crisp 2 26 0 llO 25
4 Special K 6 21 0 llO 25
5 Rice Krispies 2 25 0 llO 25
6 Raisin Bran 3 28 1 120 25
7 Product 19 2 24 0 llO 100
8 Wheaties 3 23 1 llO 25
9 Total 3 23 1 110 100
10 Puffed Rice 1 13 0 50 0
11 Sugar Corn Pops 1 26 0 110 25
12 Sugar Smacks 2 25 0 110 25

Five properties are given for each of twelve breakfast cereals. A clustering of these
observations using standardized variables, the complete linkage criterion, and Euclid-
ean distance is summarized in the tree in Fig. 2, and the corresponding figures of

Figure 2: Complete Linkage Clustering of Johnson and Wichern's Cereal Data

Euclidean
.distance
6
5
4
3
2
1

10 1 4 7 9 6 8 2 11 3 5 12
Cereal ID Number
merit from the MDL criterion are listed in Tab. 3 for several groupings derived by
cutting the tree.
We see, for example, that the three-group clustering is not preferred to that with
two groups, but the proposed division into four groups is preferred to the division
into three groups. Such "reversals" may happen, since for Euclidean distance, the
optimal division into, say, k + 1 groups is not necessarily a subdivision of the division
into k groups. For these data, it seems likely that the results may be sensitive to the
particular distance and aggregation criteria used, and to the choice of a hierarchical
method rather than some other kind. The analyst would be well-advised to explore
these issues further, rather than simply accepting the complete linkage, Euclidean
distance results.
186

Table 3: MDL Criteria for Clustering Johnson and Wichern's Cereal Data (n' = 60)
Number of Groups MDL p' s R
2 19.02 10 .760 .547
3 20.92 15 .614 1.278
4 17.00 20 .479 1.642
5 20.39 25 .434 1.814

4. Remarks
The l\IDL approach as explored here is clearly easiest to apply in the case of mathe-
matically tractable measures of error, such as least squares, at least in the sense that
the formulae correspond naturally to the distances being used.
On the other hand, (2) could be used to assess any series of classifications whatever,
hierarchical or not, without reference to the distances or other criteria used to gen-
erate them. It can thus be used as a kind of external check on the results of other
methods. It seems likely that the MDL measures will be most useful when combined
with other measures and results. They are intended to augment careful thought and
reflection, not to replace them.
The last subdivision, in which all observations are distinct and there is no clustering,
has no corresponding MDL figure of merit, as the sum of squared errors becomes O.
This will often be of little practical consequence, though is it theoretically unappeal-
ing.
Finally, note that other MDL criteria are possible, too, and they will not necessar-
ily lead to identical conclusions. The differences among them arise from different
specifications of allowable ranges for the parameters, scaling of the observations, etc.
TheSe different specifications are roughly analogous to different prior distributions in
Bayesian analysis, though the exact formalisms are different. Some remarks on this
are given in Bryant (1996).
References
Bryant, P. (1996): The Minimum Description Length Principle for Gaussian Regression.
Working Paper 1996-08, University of Colorado at Denver, Graduate School of Business
Administration. Denver, Colorado 80217-3364.
Duda, R. O. and Hart, P.E. (1973): Pattern Classification and Scene Analys·is. John Wiley
& Sons, New York.
Everitt. B. S. (1993): Cluster Analysis. Edward Arnold, London.
Johnson. R. A. and Wichern, D. W. (1988): Applied Multivariate Statistical Analysis, sec-
ond edition, Prentice-Hall, Englewood Cliffs, N. J.
Rissanen. J. (1987): Stochastic complexity. Journal of the Royal Statistical Society, Series
B. 49,3,223-265
Rissanen, J. (1989): Stochastic Complexity in Statistical Inqui1Y. World Scientific Publish-
ing Co .. Singapore.
Rissanen, J. (1996): Shannon-Wiener information and stochastic complexity, In: Proceed-
ings, N. Wiener Centenary Congress. East Lansing, Michigan.
Consensus Methods for Pyramids and Other
Hypergraphs
J. Lehel, F. R. McMorris, R. C. Powers
Department of Mathematics
University of Louisville
Louisville, KY 40292
U.S.A.

Summary: A classification can most generally be viewed as a hypergraph, which is simply


a set of subsets (clusters) of the finite set S of objects being studied. In this paper. we are
primarily concerned with consensus functions on tree hypergraphs such as pyramids and
totally balanced hypergraphs.

1. Introduction and Definitions


Let S be a finite set of n objects. A (simple) hypergraph H on S is a set of non-
empty subsets of S. If A E H, A is called an edge of H. The set of all clusters of a
classification of S is thus a hypergraph with the clusters as edges, and so we use the
words 'cluster' and 'edge' interchangeably. Of course, the clusters of a hierarchical
classification usually are structured into a tree-like relationship. In what follows, we
will give an overview of several types of hypergraphs that have proved (or should
prove) useful in classification studies. We then briefly investigate possibilities for
consensus functions on some tree-like hypergraphs.
Throughout it is assumed that all hypergraphs H on S satisfy the following: {x} E H
for all XES, and S E H. A hypergraph H on S is a hiera'rchy (also frequently called
an n-tree) if A n B E {0, A, B} for all clusters (edges) A, B E H, and is a weak
hierarchy if A n B n C E {A n B, A n C, B n C} for all clusters A, B, C E H. Let T
denote the set of all hierarchies on Sand W the set of all weak hierarchies on S. A
hypergraph H is a pyramid if AnB E HU{0} for all clusters A, BE H, and there is a
linear ordering of S so that each cluster of H is an interval in this ordering (Bertrand
and Diday (1991)). Let P denote the set of all pyramids on S. It is not hard to
see that T ~ P ~ W. There has been much interesting work relating the classes
T, P and W to various dissimilarity measures on S (see Bertrand (1995), Gaul and
Schader (1994), and the excellent collection of papers in Van Cutsem (1994)) but we
are concerned here only with the relationships that the resulting clusters have among
themselves.
A cycle (of length k) in a hypergraph H is an alternating sequence of vertices and
edges VI, el, ... , VI.. e/c where el, ... , e/c are distinct edges, VI, ... , Vic are distinct vertices,
Vi,Vi+! E ei for all i = 1, ... , k - 1, and Vk, VI E ek. The cycle is special if k ~ 3 and
Vi E ej if and only if j = i,j = i-I or (i,j) = (l,k). Thus a cycle is not special
when there is an edge of the cycle that contains at least three vertices of the cycle. In
Bandelt and Dress (1989) it is noted that a hypergraph H is a weak hierarchy if and
only if H contains no special cycles of length 3. When a hypergraph has no special
cycles at all, it is called totally balanced.
Before showing how totally balanced hypergraphs fit into the scheme, we need to
consider hypergraphs whose clusters inherit structure from a relation on its vertices.
A hypergraph H is an interval hypergraph if there is a path P so that every A E H is

187
188

an interval of P. (See Duchet (1995) for standard terminology. Note that a graph is
simply a hypergraph with each edge having two vertices.) Thus every pyramid is an
interval hypergraph. A hypergraph H is a tree hypergraph Lf there is a tree T so that
every A E H is a subtree of T. Let I denote the set of interval hypergraphs on 5, TB
the set of totally balanced hypergraphs on 5, and TH. the set of tree hypergraphs
on 5. From results in Lehel (1983) and Lehel (1985) we have that Ie; TB, TB
= W nTH., and H E TB if and only if every subhypergraph of H is in TH.. Thus
the complete list of inclusions is Te; pe; Ie; TB e; W (TB e; TH.) with examples
existing that show proper inclusions.
Because of the above characterizations of TB as those weak hierarchies that are also
tree hierarchies, it is our opinion that totally balanced hypergraphs merit further
study for possible uses in classification theory. However, in this paper our concern is
with consensus methods for various hypergraphs and we now turn our attention to
this topic.

2. Consensus
Let 7-f. denote a class of hypergraphs on 5. A consensus function on 7-f. is a mapping
C : 7-f.k -> 7-f., where k is a fi..xed positive integer. Elements of 7-f.k are called profiles and
are denoted by r. = (HI,""Hk ),7r' = (H;, ... ,H4J, etc. Among the general types of
consensus functions are the counting rules and the intersection rules. A counting rule
puts a cluster in C( iT) if it appears sufficiently often in the hypergraphs making up the
input profile iT. For example, the majority rule on T (Margush and McMorris (1981))
puts a cluster in the output if it appears in more than half of the input hierarchies.
Counting rules from Tk into T were characterized in McMorris and Neumann (1983)
and counting rules from (TUW)k into W were characterized in l'vIcMorris and Powers
(1991). 'vVe will shortly investigate the possibilities for counting rules on P and TB.
An intersection rule puts a cluster in the output C(iT) when it is the intersection
of certain clusters from the input hierarchies in 7r. For H E 7-f., let h : H -> Zo
(Zo denotes the set of nonnegative integers) be defined by h(A) = t if and only if
5 = 04 0 :l Al :J ... :l At = A (proper inclusion) with each Ai E H, and maximum in
length. When 7-f. = T, h is an easily visualized height function. For iT = (HI, ... , H k ) E
7-f.k and j E Zo let Lj(ii) = {Xl n ... n X k : Xi E Hi and h(X";) = j for i = 1, ... ,k}*.
(For a set of subsets R, R" = {X E R : X =1= 0}). Now set CD (7r) = U~oLj(7r).
The driving motivation for considering CD(r.) is that by int.ersecting clusters at the
same level across the profile 7r one might be able to produce a hypergraph whose
clusters represent areas of partial overlap (agreement). This is precisely the type
of information lost when using counting rules. A problem, of course, is that CD(r.)
might not be the same type of hypergraph as those making up the profile iT. However,
when 7-f. = T then CD(iT) E T, and in this case CD was studied in Neumann (1983).
Intersection rules on T are further investigated in Adams (1986), Powers (1995) and
Vach (1994). Problems start to arise even as we pass from T to P, and in McMorris
and Powers (1996) it is noted that CD(iT) need not be a pyTamid when iT E pk.
However when each pyramid in 7r is based on the same linear ordering of 5 then
CD(r.) is an interval hypergraph, from which a pyramid can be formed. The resulting
consensus function on P is characterized in McMorris and Powers (1996).
We now seek counting rules for P and TB. Recall that a counting rule C : 7-f.k -> 7-f.
can be described by a threshold l. It is then referred to as a 1\-[t - rule with A E IV[t(r.)
if and only if 11i:A!H,lI > I for iT E 7-f.k. The codomain of M t is of concern. For
example, when 7-f. = T then the majority rule Ml(iT) E T for all 11" E T (Margush
1
189

and McMorris (1981)); while if rt = W, then A12 (it) E W for all IT E W k (Mci\Iorris
3
and Powers (1991)). Clearly AII(rr) E rt for all it E rtk where rt E {T, P,I, TB, W}
and is usually called the unanimity rule.
Surprisingly, counting rules other than the unanimity rule fail for P and TB as our
example shows.
Example 1: Let 5 = {Xt, ... ,xn } with n :::: .3. Define the hypergraphs H t , ... , Hn as
follows:
HI = {5, {xd, ... , {Xn}, {Xt,X2}, {X2, X3}, ... , {Xn-I, Xn}},
H2 = {5, {xd, ... , {Xn}, {X2, X3}, {X3, X4}, ... , {Xn, xd},

Hn = {5, {xd, ... , {Xn}, {Xn' xd, {Xt, X2}, ... , {Xn-2, Xn-I}}.
It is easy to see that each Hi E P and thus Hi E TB. Letting k = nand
IT = (HI, ... , Hk ) we now see that lvI/(ii) has a special k-cycle for alii E (0,1) and is
thus not a totally balanced hypergraph (and hence not a pyramid). Therefore the
only I that works is I = 1.
If we try to dodge the problem pointed out in the example by requiring each pyramid
in rr E p k to be defined from the same linear order of 5, then any selection of clus-
ters from those that appear in the H;'s will give an interval hypergraph from which a
pyramid is easily formed by taking intersections of intervals and adding the singletons
and 5. In particular, a cluster that appears in only one out of the k hypergraphs
could be part of the consensus output and this is contrary to the notion of consensus.
Generalizing the Ah3 -rule for weak hierarchies gives the following result.
Theorem: Let rr = (HI, ... , Hk ) where each hypergraph Hi has no special cycle of
length m. Set I = m,;;-t and assume that flkl < k. Then AII/(rr) has no special cycle
of length m.
Proof: Let AI, ... ,A m E M/(ii). Then, for j = 1, ... ,m, I{i : Aj E Hi}1 > lk. Since
mrlk1 > (m -1)k it follows that there exists j E {I, ... , k} such that At, ... , Am E H).
Since Hj has no special cycle of length m we have that AI, ... , Am is not a special
cycle of length m. Hence ]\.-I/(rr) has no special cycle of length m. 0
We point out that if l < ~I, then M/(rr) might have a special cycle of length m. This
leads to another interpretation as to why lvII is the only counting rule that works
on TB. If rr E (TB)k and we are trying to eliminate special cycles of all lengths in
lvI,(rr), we must have limm_ool = limm_oo(m,;:l) = 1.
vVe now are ready to make a proposal for an approach to consensus for hypergraphs
that utilize clusters from both counting and intersection rules. This procedure is
first described in general terms as follows: Let rt be a fi'(ed class of hlpergraphs for
which there is a smallest l E (0,1] such that lvI/(rr) E rt for all ii E rt . For rr E rt\
consider GD(ii) and add clusters from Go(rr) subject to preserving membership in rt
and other appropriate constraints.
To illustrate this approach consider rt = P. We have seen that I = 1 for pyramids so
for rr E p k we first form unanimity consensus MI(rr). Next construct GD(rr) and sort
GD(ii) = {Atl "'l Am} according to a criterion such as size, lAd ~ IA21 :::: .. , ~ IAml·
190

Now consider the hypergraph l\I1 (iT) U {Ad. The idea is to have Al as a cluster in the
consensus output if and only if AlI(rr) U {Ad is a pyramid. This procedure continues
until a decision is made about the last cluster Am. Thus the final consensus pyramid
gives clusters that are obtained by either counting or intersection. One should, how-
ever, consider exact algorithms and their associated complexities and we leave this
for future work.
Acknowledgement
The research of F.R. McMorris was supported by the United States Office of Naval
Research Grant N00014-95-1-0109.

3. References
Adams, E.N. III (1986): N-trees as nestings: Complexity, Similarity, and Consensus, Jou·r-
nal of ClaSSification, 3, 2, 299-317.
Bandelt, H.-J. and Dress, A. (1989): Weak hierarchies associated with similarity measures-
an additive clustering technique, Bulletin of Mathematical Biology, 51, 1, 133-166.
Bertrand, P. (199.5): Structural Properties of Pyramidal Clustering, In: Partitioning Data
Sets, Cox, 1. et al. (eds.), DIMACS Series in Discrete Mathematics and Theoretical Com-
puter Science, 19, 35-53, AMS, Providence, RI.
Bertrand, P. and Diday, E. (1991): Les pyramides c1assifiantes: une extension de la struc-
ture hierarchique, C. R. Acad. Sci. Paris, Serie I, 693-696.
Duchet, P. (1995): Hypergraphs, In: Handbook of Combinatorics, Graham, R. et al. (eds.),
VOL. 1,381-432, MIT Press, Cambridge, MA.
Gaul, W. and Schader, M. (1994): Pyramidal classification based on incomplete dissimilar-
ity data, Journal of Classification, 11, 2,171-193.
Lehel, J. (1983): Helly-hypergraphs and abstract interval structures, ARS Combinatoria,
16-A, 239-253.
Lehel, J. (1985): A characterization of totally balanced hypergraphs, Discrete Mathematics,
57, 59-65.
Margush, T. and McMorris, F.R. (1981): Consensus notrees, Bulletin of Mathematical Bi-
ology, 43, 239-344.
McMorris, F.R. and Neumann, D.A. (1983): Consensus functions defined on trees, Mathe-
matical Social Sciences, 4, 131-136.
Mc}'Iorris, F.R. and Powers, R.C. (1991): Consensus weak hierarchies, Bu.lIetin of Mathe-
matical Biology, 53, 679-684.
McMorris, F.R. and Powers, R.C. (1996): Intersection rules for consensus hierarchies, In:
Proceedings of the third international conference on ordinal and symbolic data analysis, Di-
day, E. et al. (eds.), 301-308, Springer Verlag, Berlin.
Neumann, D.A. (1983): Faithful consensus methods for notrees, AJathematical Biosciences,
63, 271-287.
Powers, R.C. (1995): Intersection rules for consensus notrees, Applied Mathematics Letters,
8, 4, 51-55.
Vach, W. (1994): Preserving consensus hierarchies, Journal of Classification, 11, 1, 59-77.
Van Cutsem, B. (1994): Classification and dissimilarity analysis, Lecture Notes in Statis-
tics, New York.
On the Behavior of Splitting Criteria
for Classification Trees
Roberta Siciliano. Francesco i\Iola
Dipartimento di i\Iatematica e Statistica
F niversita di Napoli Federico II
Via Cintia - Monte S.Angelo
80126 - ~aples - Italy
e-mail: r.sic§ldmsna.dms.unina.it
[email protected]

Summary: In the framework of classification trees, the behavior of splitting criteria is


investigated through a simulation study and applications on a real data set. Some en-
phasis is appointed to the strength of the dependency among variables, the choice of the
splitti'ng rule, the role played by the type of predictors, the stability of the classification
rule. Alternative splitting criteria and new splitting rules are also proposed to deal with
the computational effort of splitting procedures in large data sets.

1. Classification trees and splitting criteria


A classification tree procedure consists of a recursive sequential binary division of N
cases belonging to J groups. A sample of N cases on which are observed a response
variable Y with J classes and AI predictorg (XI," ., X,\Il, not necessarily all of the
same type, is assumed to be available. Such sample (called learning sample). is used
to gtoW-up a binary tree. A test sample, with the same structure of the learning
sample, or a v-fold cross-validation can be used to find an optimal size tree and then
to validate the final classification rule. As a result, a classification tree procedure
provides not only a classification rule for new cases of unknown class (decision free)
but also an analysis of the dependence structure in large data sets (exploratory tree).
In a classification tree procedure a crucial moment is represented by the choice of
the splitting criterion to divide a group of cases at the node t into two subgroups
associated to the left node tl and the right node t, respectively. A splitting criterion
usually satisfies the coherEnce principle in order to choose at each node the division
of cases such that the two subgroups are internally most homogeneous and externally
most heterogeneous (see for example Breiman et al., 1984: l\Iingers, 1988).
The set of splits at each node is formed considering for each predictor all possible
binary divisions of its categories. As an example, a numerical predictor with h' dis-
tinct values produces h' - 1 possible splits; a nominal predictor with K categories
generates 21\'-1 -1 possible splits. A split s, also called dichotomous splitting variable
or binary question. sends a proportion PI of cases to the left node tl and a proportion
p,. of cases to the right node t r .
Several splitting criteria have been proposed in literature which differences rely on
the type of predictors that can be considered and the adopted splitting rule; see for
instance the DXP (Discrimination Non Parametrique) splitting procedure of Celeux
and Lechevallier (1982), the RECPA:\I (RECursive Partitioning and AMalgamation)
approach of Ciampi and Thiffau!t (1987) (for a detailed review on binary segmenta-
tion procedures see Mola, 199:3). In the following, we will consider the methodology
C.-\RT (Classification and Regression Trees) of Breiman et al. (1984) and the two-
stage binary segmentation of ~Iola and Siciliano (1992, 1994).

191
192

1.1 CART splitting criterion


Breiman et al. (1984) have introduced the CART methodology providing several
innovations; one of these is represented by the possibility to deal simultaneously with
both numerical and categorical predictors. This was allowed by the definition of a
general splitting criterion based on the concept of impurity at a given node. Accord-
ing to the CART splitting criterion the following decrease in impurity is maximized
for each sEQ

~i(slt) = i(t) - (Pli(til + Pri(tr)) (1)


where i(·) is the impurity function, PI and Pr are the proportions of cases at the
left node and the right node respectively. Depending on the definition of the im-
purity functicm different splitting rules can be used such as for example the Gini
index of heterogeneity Gy(t) = 1 - Lj p(jlt)2 and the entropy index Hy(t) =
-Ljp(jltllogp(jlt), where p(jlt) is the proportion of cases with class j at node
t.
1.2 Two-stage splitting criterion
i\Iola and Siciliano (1992, 1994) have introduced a two-stage splitting criterion in
order to consider the "global" role played by the predictor as well as the "local" role
played by any splitting variable. The first stage provides a variable selection of one
or more significant predictors which are used to define the set of possible splits; the
second stage applies a splitting rult to select the best split. Three strategies can be
adopted with respect to the use of either (a) statistical indexes, or (b) statistical
modeling, or (c) factorial methods. We briefly describe the first strategy; for the sec-
ond strategy we refer to Mola, Klaschka and Siciliano (1996), for the third strategy
we refer to Mola and Siciliano (1997b) in this volume.
Originally, the two-stage splitting criterion was based on the predictability index T
L L p2(jJi.t)p(iJt)- L p2(jJt) h ('1')' h
of G00 dman an d K rus'a
k I TYJX ()
t ===- ' 1 I- 2(jJt) 1
L1P , were P J l,t IS t e

proportion of cases that belong to class j given that they have category i of X at node
t and p(ilt) is the proportion of cases that has category i of X at node t (Mola and
Siciliano, 1992). A further index can be proposed, namely the conditional entropy
index of Shannon HyJx(t) = - L, Lj p(jli, t) logp(jli, t).
Both the above mentioned indexes can be proved to bt: special cases of the following
general measure for the proportional reduction in the heterogeneity of the response
variable Y due to the information provided by the predictor X (globally considered):
iy(t) - 2::;p(ilt)iYli(t)
-IYI.d t ) = iy(t) (2)

where iy(t) is the measure of heterogeneity for the variable Y at node t and iYli(t)
is the same measure for the conditional distribution of Y given the modalityi of
predictor X; using the Gini index yields to the predictability T index whereas using
the entropy index yields to the conditional entropy. I
As a result, we can describe the two stages of the splitting criterion using such general
index rYI.(t) to be defined for both the predictor and the splits generated by a given
predictor.

INotice that the impurity measure in CART is nothing else than an heterogeneity index and for
this reason we have adopted the same notation for the heterogeneity index here as for the impurity
measure in section 1.1.
193

At the first stage, we maximize for each predictor m E AI

~E'7ir"fYIx",(t) = "'IY1x,.(t) (3)

and, at the second stage, we maximize for each split s E S of the predictor X'

. () _ iy(t) - (pliy(t,) + priy(tr))


max1YIs
mEM
t - . (t )
I}"
. (4)

At the first stage we can select more than one predictor, namely we can order the
predictors with respect to the values of "'IYIX",(t) so that we can rank the predictors
with respect to their predictability power. The selected predictors are used to gener-
ate the set of splits used at the second stage.
Notice also that the numerator of (4) is equivalent to the decrease in impurity (1).
As a result, the splitting rule in CART can be defined in terms of the dependency
index Ai instead of the decrease in impurity. Indeed, if we consider in place of S the
set Q of all possible splitting variables generated by all predictors then we could use
directly (4) as a splitting rule in CART. In this way, we provide at least two new
interpretations of CART splitting rules: one in terms of the predictability T index,
the other in terms of the conditional entropy index. Using this result, Mola and
Siciliano (1997a) have recently introduced a fast splitting algorithm that is related to
the two-stage criterion but that allows to find the same solution of CART splitting
criterion.

2. The behavior of splitting criteria


In this section we discuss the behavior of splitting criteria focalizing our attention
on some aspects concerning their performance with respect to the structure of the
dependency in the data, the problem to choose among several splitting rules, the role
played by the type of predictors and the stability of the classification rule. For sake
of brevity, we present a simulation study and some applications on a real data set
in order to contribute to the discussion on the above mentioned aspects with some
methodological issues.
2.1 The dependence data structure
Although there are methodological differences in the splitting procedures proposed
in literature we believe that empirical differences in results strongly depend on the
structure of the data set analyzed. Our experience allows us to say that when pass-
ing from a data set without structure of dependency to a data set where some of
the predictors have high predictability power on the response variable different tree
procedures might converge to the same result.
As an example, we present some results of a simulation study when comparing the
CART splitting procedure, using either the Gini index or the entropy index as a
splitting rule, with the two-stage splitting procedure, using either the predictability
T index or the conditional entropy index as a splitting rule.

We have considered a 4 x 2 factorial design, with I (number of categories for each


categorical predictor) taking values 3,4,5,6 and J (response classes) taking values
2,3. For each combination of the factorial design we have generated 1000 matrices
of dimensions 100 x 11 (where 100 is the sample size, and 11 is given by 1 response
variable plus 10 predictors).
We have generated all the variables from a uniform distribution without imposing
194

any dependency. Tables 1 and 2 describe the results using the Gini index of hetero-
geneity and the entropy index respectively in order to find the best split at the root
node. In particular, for each combination of the factorial design each cell of the table
gives the percentage of times that the best split of CART splitting criterion is found
by the two-stage splitting criterion through the first best predictor X(l), the second
predictor X(2) and the third predictor X(3) in the order respectively (see section 1.2).
For instance, for I = 3 and. J = 2 in table 1 the best split according to CART has
been found in predictor X(l) for 88% of times and in predictor X(2) for the remaining
12% of times.

J-2 J-3
I X(1) X(2) X(3) overall X(I) X(2) X(3) overall
3 88% 12% 0% 100% 79'10 16'10 4% 99%
4 88% 10% 2% 100% 72% 22% 4% 98%
5 81% 16% 2% 99% 75% 14% 5% 94%
6 72% 21% 7% 100% 76% 13% 3% 92%

Table 1: Percentage of times that the two-stage splitting criterion finds the best split
of CART (using the Gini index of heterogeneity)

J-2 J-3
I X(1) X(2) X(3) overall X(1) X(2) X(3) overall
3 88% 12% 0'10 100'10 79% 18% 2% 99'lo
4 89% 9% 2% 100% 77% 15% 5% 97%
5 85% 12% 2% 99% 74% 20% 1% 95%
6 76% 17% 7% 100% 70% 11% 8% 89%
Table 2: Percentage of times that the two-stage splitting criterion finds the best split
of CART (using the entropy index)

Both tables 1 and 2 show that whatever is the adopted splitting rule considering three
best predictors in two-stage splitting criterion yields to find with a high percentage
of times (see column "overall") the same best split of CART, especially when J = 2,
although we have not imposed any dependency structure.
As soon as we have one of the predictors related to the response variable then the
CART splitting criterion and the two-stage splitting criterion using only one best
predictor give the same result. This has been the result of a further simulation study
in which one of the predictors was generated according to a dependency structure
(passing from a low level of dependency to a high level of dependency as measured
with the dependency 'Y index).

2.2 The choice of the splitting rule


Another interesting point is the role played by the splitting rule in the adopted split-
ting criterion, that is to verify whether the selection of the best split does depend on
the choice of the splitting rule. This aspect is strictly related to the problem discussed
ip ~ection 2.1, as we believe that if there exists a dependency of the response variable
195

on some of the predictors then different splitting rules agree OIl the same best split.
On this purpose we consider the factorial design described in section 2.1 and we cal-
culate the percentage of times that the Gini index of heterogeneity and the entropy
inclex in the CART splitting procedure attain the same result: analogously, we cal-
culate the percentage of times that the predictability T inclex and the conditional
entropy index in the two-stage splitting procedure yield to the same result. \Ve de-
scribe these results in tables :3 and 4 where we notice a certain coherence in the results
especially for J = 2.

J-~ J - J
I Gini-Entropy Ciini-Entropy
:3 99Yc 90Yc
4 987c 847c
.') 977c S2%
6 9:3% I 707c
Table :3: Percentage of times that the best split is the same using the Gini index and
the t'ntropy index in C.-\RT splitting criterion

J=2 J - :3
I , Clini-f:.·ntropy Clini-f:.ntropy
:3 99'1c 91 'lc
4 99% 88%
.) 977c 84%
6 97% 72%
Table 4: Percentage of times that the best split is the same using the T index and
the conditional entropy index in two-stage splitting criterion

2.3 The role played by the type of predictors


One of the innovative aspects of the C.-\RT splitting criterion has been the simulta-
neous treatment of numerical and categorical predictors. i\everthless we have very
often wrified that numerical predictors playa privileged role in the analysis in the
sense that the best splits in the upper part of the tree are very often generated by
numf'rical predictors rather than categorical ones.
As an example. we haw considered a well known real data set such as the "low birth
weight'· (medical) data set from Hosmer and Lemeshow (1990). The response vari-
able is a dummy variable that is equal to 1 if birth-weight is less than 2.')00 9 and
o otherwise. For each mother. two numerical predictors were recorded. age of the
mother in years (Xd and the weight in pounds at the last menstrual period (X 2 ).
Six categorical variables were also recorded: race (X3 , 1=white, 2=black, 3=other),
smoking status during pregnancy (X~. O=non-smoker. 1=smoker), history of prema-
ture abort (X5 O=no. 1=yes), history of hypertension (X6 , O=no, 1=yes). presence
196

of uterine irritability (.\7. O=no. l=yes). physician yisits during last trimester (X s .
O=none, l=one or more).
We have analyzed this data set using the C.-\RT splitting procedure. The ~plit ,e-
quence to grow the final binary tree is shown in table :J. :\Io.,t of lit.., split, ba\'e
been generated by numerical predictors. \\"e ha\'e repeated the analysis consickring
a categorization of the numerical predictors. The res\llt is shown ill table 6. It i,
interesting to notice how the categorization of llumerical predictor~ modifies the Tree
structure and thus the misclassification rate in both nonterminal and terminal node,;
(see both columns "cases" and "error rate'").

sp it sequ~nce tcrmina.
cases I error beS[ Jabel
% rat< (%) pred. split class
1 100.0 31.0 .\, o vs 1
2 16.i6 40.0 X, < 31.5
4 15.08 33.0 I Xo < l·Se.·)
S 13.97 28.0 1
9 1.1 ~ 0.0 i o
I 6~
1.6S 0.0 o
83.24 2';.0 .Y, < 106.0
15.64 46.0 X; < 22.5
12 10.06 28.0 u
1
13 5.59 20.0 1
7 6i.70 21.0 X6 no
14 63.69 18.0 Xc no
1 28 56..t2 14.0 X, < 27
29 7.26 46.0 X2 < 122.5

Il ~~ 1Hf
2.;,0
20.0
15 29.0 : I

Table ·3: Split sequence using numerical and categorical predictors

sp It sequence terminal
cas~s error best abe
t % rat< (%) pred. split class
1 100.0 31.0 "\ 5 o ,·s 1
2 16.76 40.0 X, 1.2 \"5 3.4
4 4.47 38.0 0
.:; 12.29 32.0 1
., 83.24 26.0 X. o \'5 1
6 is.i';'' 23.0 .Y, 1 vs :! ..3.-1
51.40 16.0 0
,
12
13 27 ..37 37.0 .\" '2 \"s 0.1
26 13.41 46.0 Xc o vs 1
52 10.61 37.0 0
53 2.79 20.0 I 1

I '27
54
13.97
6.70
28.0
17.0
X. 1 V.5 0
0
55 7.26 38.0 Xl
1I0 6.15 27.0 0
III 1.12 0.0 1
7 4.47 38.0 1

Table 6: Split sequence using categorized and categorical predictors

2.4 The stability of the classification rule

A skeptical researcher might consider "unstable" the results of tree procedures: this
is in fact a crucial point that we discuss in this section.
Through some resampling methods we analyze the stability of classification tree pro-
197

cedures with the respect to the structure of the final binary tree and the related
misclassification rates (R(t)). In particular, we have analyzed the behavior of the
CART splitting procedure when the test sample is used to validate the classification
rule (see Breiman et al., 1984) . Considering the same data set described in section
2.3 we have repeated .the analysis 1000 times taking 30% of cases randomly chosen
in the test sample; again we have repeated the analysis 1000 times taking 20% of
cases in the test sample; finally we have repeated the analysis 1000 times taking 10%
of cases in the test sample. For each analysis we have considered final classification
trees with 3 and 4 terminal nodes respectively and we have calculated the related
misclassification rates. In figure 1 we show two series of boxplots in order to describe
the distribution of the misclassification rate considering 3 terminal nodes (boxplots
above) and 4 terminal nodes (boxplots below). As a result, the misclassification rate
appears to be quite "unstable" when the test sample takes 30% of cases. In conclu-
sion, when the learning sample is not too large the cross-validation is recommended
since it can provide more stable validations of the classification rule.

-
cs I


C5 I

I Q

Figure 1: Boxplots of the misclassification error for different binary trees

3 Concluding remarks
In this paper we have discussed important aspects concerning the behavior of splitting
criteria in classification trees. We have shown how the two-stage splitting criterion
can be fruitfully used to select a number of predictors that generate with high confi-
dence level the best split according to the CART criterion.
We have also verified that the structure of the binary tree is not influenced by the
choice among alternative splitting rules, but rather by the type of predictors and
their treatment in the splitting procedure.
There are several classification tree procedures proposed in literature and in recent
198

years a lot of attention has been given to the specialized software for applying such
procedures (i.e. CART. RECPAM). Furthermore. it is possible to find binary seg-
mentation procedures also in statistical packages such as SPSS, S+, SPAD.s.
As a result. the number of utilizers of such procedures is increasing and "nonexpert
researchers" might be willing to apply classification tree procedures for statistical
analysis. It becomes then evident that a correct use of such methods requires a cer-
tain experience or at least the attention for some crucial aspects such as the simul-
taneous treatment Qf numerical and categorical predictors. the choice of the splitting
rule, the method for validating the tree and so on. We believe that is worthwhile to
discuss some of the problems and peculiar aspects of classification tree procedures as
we have described in this paper: we hope that the present contribution provides a
good step in this direction.

Acknowledgements: This research was supported for the first author by C:\,R research
funds number 9.5.020-l1.CTI0 and for the second author by C:\R research funds number
92.187:2 P

References

Breiman L., Friedman J.H .. Olshen R.A., Stone C.J .. (198-1): Classification and Regression
Trees. Belmont C.A. Wadsworth.
Celeux, G. and Lechevallier Y. (198:2): ~Iethodes de segmentation non parametriques, Re-
l'ue de statistique appliquees. 4, 39-.5:3.
Ciampi. A. and Thiffault, .1. (1987): Recursive Partition and Amalgamation (RECPAM)
for Censored Survival Data: Criteria for tree selector. Statistical Software Neu'sletter, 2,
vol. 1-1. 78-81.
Hosmer. D. W. and Lemeshow, S. (1990): A.pplied Logistic Regression, J. Wiley, l"'ew York.
~Iingers. J. (1988): .-\n empirical comparison of selection measures for decision tree induc-
tion. Jlachine Imming, 3. 319-3-12.
~Iola. F. (199:3): A.spetti metodologici e computaziona/j delle tecniche di segmentazione bi-
naria. Cn contributo ba.sato su fun=ioni di predi;ione. PhD dissertation, university of
:\aples.
~Iola. F. and Siciliano, R. (1992): A Two-Stage Predictive Splitting Algorithm in Binary
Segmentation, Computational Statistics, Dodge, Y. and Whittaker, J. (eds.), 1, 179-18-1,
(Compstat '92 Proceedings). Physica Verlag.
Mola. F. and Siciliano, R. (199-1): Alternative Strategies and CATANOVA Testing in Two-
-Stage Binary Segmentation. New A.pproaches in Classification and Data .4nalysis, Diday,
E. et al. (eds.).:316-323. Springer Verlag.
Mola. F. and Siciliano, R. (1997a): A Fast Splitting Procedure for Classification Trees.
Statistics and Computing (to appear).
Mola, F. and Siciliano, R. (1997b): Visualizing Data in Tree-Structured Classification, Pro-
ceedings of IFCS-96: Data Science, Classification and Related Methods, (Hayashi, C. et aI.,
eds.), Springer Verlag. Tokyo.
Mola, F., Klaschka, J. and Siciliano, R. (1996): Logistic Classification Trees, COMPSTAT
96 Proceedings (A. Prato ed.), Physica Verlag.
Taylor, P.C. and Silverman, B.W. (1993): Block Diagrams and Splitting Criteria for Clas-
sification Trees. Statistics and Computing, 3, 1-H-161.
Fitting Pre-specified Blockmodels
Vladimir Batagelj I, AnuSka Ferligoj2, and Patrick Doreian3
I University of Ljubljana, FMF, Dept. of Mathematics
Jadranska 19, 1000 Ljubljana, Slovenia
2 University of Ljubljana, Faculty of Social Sciences
P.O. Box 47, 1109 Ljubljana, Slovenia
3 University of Pittsburgh, Dept. of Sociology
Pittsburgh, PA 15260, USA

Summary: In this paper an optimization approach to blockmodeling for fitting an ob-


served relation to a pre-specified blockmodel is used. The proposed deductive approach to
blockmodeling is applied to a concrete example. Some further research directions are also
suggested.

1. Introduction.
The goal of conventional blockmodeling is to reduce a large, potentially incoherent
network to a smaller comprehensible structure that can be interpreted more read-
ily. Blockmodeling, as an empirical procedure, is based on the idea that units in a
network can be grouped according to the extent to which they are equivalent, under
some meaningful definition of equivalence.
There are many inductive approaches for establishing blockmodels for a set of social
relations defined oyer a set of social actors. Some form of equivalence is specified and
clusterings are sought that are consistent with the specified equivalence. In all cases,
the analyses respond to empirical information in order to establish the blockmodel.
Another view of blockmodeling is deductive in the sense of starting with a blockmodel
that is specified in terms of substance prior to an analysis. In this paper we present
methods where a set of observed relations are fitted to a pre-specified blockmodel.

2. Basic Terms
Network: Let E = {XI,X2, ... ,Xn } be a finite set of units. The units are re-
lated by binary relations R t C; E x E, t = 1, ... , r which determine a network
N = (E, R I, R2 , ... , Rr). In the following we restrict our discussion to a single rela-
tion R described by a corresponding binary matrix R = [rji]nxn where

{ I xRx
r 'J -- •
0 otherwise
J

In some applications Tjj can be a nonnegative real number expressing the strength of
the relation R between units Xj and x J.
Cluster, clustering: One of the main procedural goals of blockmodeling is to iden-
tify, in a given network, clusters (classes) of units that share structural characteristics
defined in terms of R. The units within a cluster have the same or similar connection
patterns to other units. They form a clustering C = {CI, C 2, ... , C k} which is a par-
tition of the set E: U. C j = E and i =f. j => C j n Cj = 0. Each partition determines
an equivalence relation (and vice versa). Let us denote by '" the relation determined

199
200

complete row-dominant col-dominant

regular row-regular col-regular

null row-functional col-functional

Figure 1: Types of connection between two sets; the left set is the ego-set.

by partition C.
Block: A clustering C partitions also the relation R into blocks R( Cj , C j ) = R n Cj x
Cj . Each such block is defined by units belonging to clusters Cj and Cj in terms of
the arcs leading from cluster C. to cluster Cj . If i = j, a block R(Cj , Cj ) is called a
diagonal block.
Blockmodel: A blockmodel consists of structures obtained by identifying all units
from the same cluster of the clustering C. For an exact definition of a blockmodel we
have to be precise also about which blocks produce an arc in the reduced graph and
which do not . The reduced graph can be presented by a matrix, called also image
matrix.
Block Types: Several possible block types can be defined. In Figure 1 nine block
types are presented (Batagelj, 1993) . In the relational matrix below three types of
blocks can be found:
1 1 1 1 1 1 0 0
1 1 1 10 1 0 1
1 1 1 1 0 0 1 0
1 1 1 1 1 0 0 0
0 0 0 0 01 1 1
0 0 0 0 1 0 1 1
0 0 0 0 1 1 0 1
0 0 0 0 1 1 1 0

3. Blockmodeling - Formalization
A blockmodel is an ordered sextuple .M = (U, K , T ,Q, 7r , 0) where:
201

• U is a set of types of units (images or representatives of classe:-;):

• 1\ ~ U x U is a set of connections:
• T is a set of preciicates used to describe the t~'pes of connections between
different classes (clusters, groups, types of units) in a network. \\"1' assume that
nul E T. A mapping ;r : I .... -t T \ {nul} assigns predicates to connections:

• Q is a set of at1emging rules. A mapping n : I.... -t q determines rules for


computing values of connections.

For {t : V - t U we define for t E U Cit) == Jt-l(t) == {.r E t· : JL(.r) == t}. ThereforI'


C(IL) == {C(t) : t E U} is a partition (clustering) of the set of units {'.
A (surjective) mapping {L : F -t U determines a blockmodel ,Ad of network /'./ iff it
satisfies the conditions:

V(t, w) E h' : ;r(t. w)(C(t). C(w))

and
V(t, tv) E U xU \ J( : nul(C(t), C(u')).
:\ote. T == {nul, com} implies a structural blockmodel (Lorrain and White, 19/1):
and, T == {nul, reg} implies a regular blockmodel (White and Reitz. 198:3)
Let ~ be an equivalence relation over t' and [xl == {y E \. : x ~ y}. We say that ~
is compatible with T over a network N iff

"Ix, y E V:lT E T: T([x], [yJ).


It is easy to verify that the notion of compatibility for T == {nul, reg} reduces to the
usual definition ofregular equivalence (Borgatti and Everett, 1989). For a compatible
equivalence ~ the mapping JL: J' >-t [xl determines a blockmodel.

4. Optimization
4,1 A Criterion Function
One of the possible ways of constructing a criterion function that directly reflects
the considered equivalence is to measure the fit of a clustering to an ideal one with
perfect relations within each cluster and between clusters according to the considered
equivalence (Batagelj, Doreian, and Ferligoj. 1992; Batagelj. 1993: Doreian. Batagelj.
and Ferligoj, 1994).
Given a set of types of connection T and a block R(X, Y), X, Y ~ \'. we can
determine the strongest (according to the ordering of the set T) type T which is
satisfied by R(X, }'). In this case we set

We need to consider also the (many) cases where no type from T is satisfied. One
approach is to introduce the set of ide!!l blocks for a given type T E T

B(X,}"; T) == {B ~ X x }' : T(B)}

a.nd define the deviation 6 (X, Y: T) of a block R( X, Y) from the nearest ideal block.
We ca.n efficiently test whether the block R(X, n is of the type T (see Table 1).On
202

Table 1: Characterizations of types of blocks

null all 0 except may be diagonal)


complete all 1 (except may be diagonal)
row-dominant :3 all 1 row (except may be diagonal)
col-dominant :3 all 1 column (except may be diagonal)
regular i-covered rows and i-covered columns

the basis of these characterizations we can construct also the corresponding measures
of deviation from the ideal realization. For the proposed types, all deviations are
sensitive
8(X, Y; T) = 0 <=> T(R(X, Y)).
Therefore a block R(X, Y) is of a type T exactly when the corresponding deviation
8(X, Y; T) is O. In the deviation <5 we can also incorporate the values of lines, v, if
the network has valued arcs.
Based on the deviation8(X, Y; T) we introduce the block-err'or E(X, }'; T) of R(X, Y)
for type T. Two examples of block-errors are

IOI(X, }'; T) = 1('(T)8(X, Y; T) and ':2(X, }': T) = w(T) (1 + 6(X, Y; T)),


TtrTtc

where w(T) > 0 is a weight for type T. We extend the block-error to the set of
feasible types T by defining

IO(X, Y; T) = TET
minE(X, Y; T) and "(/lo(X), 11(Y)) = argminTET':-(X, Y: T)
To' make 7r well-defined, we order (priorities) the Sf't T and select the first type from
T which minimizes.:, We combine block-errors into a total ermr - a blockmodeling
criterion function
PUt; Tl = L ::(C(t), C(1L"); T).
(t.wjEU<F

The criterion functions based on block-errors ':1 ann. ::2 are denoted PI and P2 respec-
tively.
For the criterion function PI(/lo) we have PI(/lo) = 0 <=> /lo is an exact blockmodeling.
Also for P2 , we obtain an exact blockmodeling Jt iff the deviations of all blocks are O.
The obtained optimization problem can be solved by local optimization, Once a par-
titioning /lo and types of connection iT are determined, we can also compute the values
of connections by using the avemg'ing rules.
4.2 Local Optimization
For solving the blockmodeling problem we use a local optimization procedure (a
relocation algorithm):
Determine the initial clustering C:
repeat:
if in the neighborhood of the current clustering C
there exists a clustering C' such that P(C') < P(C)
then move to clustering C' .
203

Table 2: Student Goyernment ?-.Iatrix


m p m m m m m m a a a
1 2 3 4 5 6 7 8 9 10 11
minister 1 1 0 1 1 0 0 1 0 0 0 0 0
p.minister 2 0 0 0 0 0 0 0 1 0 0 0
minister 2 3 1 1 0 1 0 1 1 1 0 0 0
minister 3 4 0 0 0 0 0 0 1 1 0 0 0
minister 4 5 0 1 0 1 0 1 1 1 () 0 0
minister 5 6 0 1 0 1 1 0 1 1 0 0 0
minister 6 7 0 0 ·0 1 0 0 () 1 1 0 1
minister 7 8 0 1 () 1 0 0 1 0 0 0 1
adviser 1 9 0 0 0 1 () 0 1 1 () 0 1
adviser 2 10 1 0 1 1 1 0 0 () 0 0 0
adviser 3 11 0 0 0 0 0 1 0 1 1 0 0

The neighborhood in this local optimization procedure is determined by the following


two transformations: moving a unit Xk from cluster Cp to cluster Cq (transition); and
interchanging units Xu and J./I from different clusters Cp and Cq (transposition).
4.3 Benefits from Optimization Approach
There are several benefits of using the optimization approach to blockmodeling:

• ordinary / inductive blockmodeling: Given a network N and set of types of


connection T, determine M, i.e., II, 7r and Q;

• evaluation of the quality of a model, comparing different models, analyzing the


evolution of a network (Sampson (1968) data, Doreian and Mrvar, 1996): Given
a network N, a model M, and blockmodeling JL, compute the corresponding
criterion function;

• model jitting / deductive blockmodeling: Given a network N, set of types T,


and a model M, determine JI which minimizes the criterion function.

• we can fit the network to a partial model and analyze the residual afterward;

• we can also introduce different constraints on the model, for example: units x
and yare of the same type; or, types of units x and yare not connected; ...

5. Example: Student Government


The example network consists of communication interactions among twelve mem-
bers and advisors of the Student Government at the University in Ljubljana (Hlebec,
1993). The results of the measurement are not real interact.ions among actors but
their cognitions about communication interactions.
Data were collected through face to face interviews that wpre conducted in May 1992.
Communication flow among actors was identified by the following question: Of the
members and advisors of the Student Government, whom do you (most often) talk
with?
The content of the communication flow was limited to the matters of the Student
Government. The time frame was also defined: the question was referred to a six
204

Figure 2: Student Government - hypothetical block model

month period. One respondent refused to cooperate in the experiment. As he was


not considered in the analysis, the network consists of eleven actors and is presented
in Table 2. The computations were done with the program MODEL2 which is for PC
available on WW\\, address http://vlado.fmf . uni -lj . si/pub/networks/.
The hypothetical structure of the Student Government network, a hierarchy where
each advisor communicates at least to one minister and the ministers to the prime
minister, is presented in Figure 2.
First, let us analyze the network inductively (without assuming the hypothetical hi-
erarchical structure) assuming :3 clusters and regular equivalence. We obtained many
regular solutions with" errors.
The basis for the deductive blockmodeling is the assumed hierarchical structure of
the Student Government presented in Figure 2.
The first pre-specified blockmodel is defined by assuming 3 clusters, regular equiva-
lence and the following pre-specified image matrix

2
1
2
3
°
{ 0, reg}
{ reg}

The result is a subset of the set of inductive solutions with 7 errors. One of the
solutions is the following:

C1 = {{pm, m7, a3} , {ml, rn2, rn3, m4 , m5, m6, all, {a2}}
The solution is presented also in Figure 3. The black dots on the arcs denote super-
flous arcs (errors) according to the ideal solution and the white dots the missing arcs.
We can constrain our pre-specified blockmodel further by additional constraint on
205

7r 1 2 3
1 reg 0 0
2 reg reg 0
3 0 reg 0
d 1 2 3
1 0 4 0
2 0 0 0
3 0 3 0

Figure 3: Student Government - first deductive solution

7r 1 2 3
1 0 0 0
2 reg reg 0
3 0 reg 0
d 1 2 3
1 0 1 0
2 2 0 3
3 0 0 2

Figure 4: Student Government - second deductive solution

units in clusters: all advisers are in cluster 3. We obtained a single solution with 8
errors.
C2 = {{pm}, {ml , m2, m3 , m4, m5, m6, m7}, {aI , a2, a3}}
The solution is also presented in Figure 4.
The results of model fitting show that the hypothetical hierachical model was ob-
tained with a minimal increase of error compared to inductive solution (from 7 to 8).
This indicates that it represents the network structure well.
206

6. Further Research
There are several possible directions for further research in the field of blockmodeling.
At least two questions need an attention:

• to define additional block types which are more appropriate for describing spe-
cific network struct ures:

• to elaborate blockmodeling of valued networks.

References:
Batagelj, V. (1991): STR.4.N - STRucture ,4Naiysis, Manual, Ljubljana.
Batagclj, V., Doreian, P. and Ferligoj, A. (1992): An optimizational approach to regular
equivalence, Social Networks, 14, 121-135.
Batagelj, V., Ferligoj, A. and Doreian, P. (1992): Direct and indirect methods for structural
equivalence, Social Networks, 14, 63-90.
Batagelj, V. (1993): Notes on block modelling. In Abstracts and Short Versions of Papers,
3rd European Conference on Social Network Analysis, Munchen: DJI, 1-9. Extended ver-
sion in print in Social Networks 1997.
Borgatti, S.P. and Everett, M.G. (1989): The class of all regular equivalences: Algebraic
structure and computation, Social Networks, 11:65-88.
Dorcian, P., Batagelj, V. and Feriigoj, A.(1994): Partitioning Networks Oil Generalized
Conccpts of Equivalcnce, Jou77Iai of Mathemat'ical Sociology, 19, 1, 1-27.
Doreian, P. and Mrvar, A. (1996): A Partitioning Approach to Structural Balance, Social
Networks, 18, 2, 149 ·168.
Ferligoj, A., Batagelj, V. and Doreian, P. (1994): On Connecting Network Analysis and
Cluster Analysis, In Contributions to Mathematical Psychology, Psychometrics, and Method-
ology (G.H. Fischer and D. Laming, Eds.), New York: Springer.
Hlebec, V. (1993): Recall versus recognition: Comparison of two alternative procedures for
collecting social network data, In Developments in Stati$tics and Methodology (A. Ferligoj
and A. Kramberger, Eds.), Metodoloski zvezki 9, Ljubljana: FDV, 121-128.
Lorrain, F. and White, H.C. (1971): Structural equivalence of individuals in social net-
works, Journal of Mathematical Sociology, 1. 49-80.
Sampson, S.F. (1968): 11 Novitiate in f1 Pe,-iod of Change: A.n Experimental and Cas"
Study of Social Relationships, PhD thesis, Cornell University.
White, D.R. and Reitz. K.P. (1983): Graph and semigroup homomorphisms on networks
of relations, Social Networks, 5, 193-234.
Robust impurity measures in decision trees
Tomas Aluja-Banet, Eduard Nafria
Dept. of Statistics and Operational Research
Universitat Politcnica de Catalunya
c. Pau Gargallo. ·5. 08028 Barcelona. Spain
E-mail: [email protected]

Summary: Tree-based methods are a statistical procedure for automatic learning from
data, their main characteristic being the simplicity of the results obtained. Their virtue is
also their defect since the tree growing process is very dependent on data; small fluctuations
in data may cause a big change in the tree growing process. Our main objective was to
define data diagnostics to prevent internal instability in the tree growing process before
a particular split has been made. We present a general formulation for the impurity of a
node, a function of the proximity between the individuals in the node and its representative.
Then. we compute a stability measure of a split and hence we can define more robust splits.
Also. we have studied the theoretical complexi ty of this algorithm and its applicability to
large data sets.

1. Introduction
The objective of tree-based methods is to automatically detect which variables serve
to explain the behaviour of a response variable, whether quantitative or categorical.
They can be applied in the same context as other alternative methods, such as multi-
ple regression, discriminant analysis, logistic regression or neural networks. Its main
advantage is the simplicity of the results obtained and the possibility of automatic
generation of decision rules. This property links this methodology with the AI tech-
niques. Thus the main usage is decision making. Its strength is also its weakness,
since the tree growing process is very dependent on data; a small fluctuation in data
may cause a .major change in the topology of the tree. This raises the problem of
the stability of a tree. 'v'ie distinguish internal stability from external stability in
the same sense as that stated by Greenacre (1984). External stability refers to the
sensiti\'ity of the tree to independent random samples. and can be assessed by means
of a test sample or cross-validation, whereas by internal stability we mean the influ-
ence exerted by each observation in the learning sample on the formed tree. Another
problem relating to the tree methodology is the computational cost, due to the recur-
sive nature of the aigorithms and the large number of possible splits, which can be
very costly for large data sets. For this reason we have studied the complexity of the
algorithm in order to optimise it. proposing an efficient heuristic capable of coping
with large data sets. with almost linear cost depending on the number of individuals
and variables and the depth of the tree.

2. Tree growing met hodology

Since the pioneering work of AID. Sonquist et al. (1964), tree growing methodolo-

207
208

gy has con~istPd u1" splittingl ea,-h group of incii \-iduals (node) r('cursin'ly into t\I'O
groups. starting from the total sample 11. a«urding to a statistical criterion relatill>:
the condition for splitting to the rpsponse \-ariabl('_ Since then. a grpat dt'al uf re-
search have been done into I he threshold-\Jased criterion. r":ass (['1";:0) dewluped tn'!'
llwthorlology for <l categorical response variable llsing a C'hi·~qllarE' "plit criterion.
and Cdeux et al. (19S~) proposed ior the latter cas.' a split criterion hased on a
distance b"t\\-een distribution fun(tion~. while Ciampi (1991) propo:;nl instead the
use of the deviance uf a gent'ralised lincar model. .-\lthough the r\,:ildt~ obI<1iw'.-j Cilll
be sittisfactory in applipd research. with an error rate ()f the same orcleT ot' altemitl i\-"
methodo. they do not CSt-ape the criticism of the optimality and goodness (.f th,' Tl't'f'
obtained_ The CART approach, introduced hy i:3reiman f't ill. (19,SJ). Wib ,~ll attt'lllpi
to solve these problems. It" main innovations consisted of:
l. l'nification of the ca"" of a categorical rt'~p')!l5e \-ariahle (clitssitlratioll tree:') with
thill of a quantitative r('sponse variable I. reg:-",,:-;ion trrt'S) withi!l it similar trill1H".wrk,
2_ L;" of an impurity index to meaSLl,'e tIle het.erogcneity of a node,
:3_ Pruning from a maxima! !.ree instead of using a stop criterion.
-). Gi\-ing right hOlles t es t imatE's of the n;isdassi ticat ion error.

The impurity indices measure the heterogeneity of a !lo(k:

i(t\ = FUJlj, til for a classification tree


i( t) = Fly) Ij EO t! for a n>gres"i')ll tree

where p(j I t) is the probability of class j in node t and If) is th(-' \-ahw of tite rcspon'w.
for an indi vid Lial in node t_
Impurity indices "hould have a lllaximul11 \-al11" fllr <'lass,·< with cqual probability. a
valuE' of 0 for a pure node and ,;}lOllld he it decreasing function through the splitting
process:
i(1) 2: 0 11f,) T (J -- It) illr,l O:s Ct <

The most llsllal impurity indices are:

itt) = L)-,=k plJ:t)PII.·11) Gini index


;(/)= l-lIw'cJP(/tJ rnisclassification index
i(l) = - L) [l(jitilog(pijltl) (-,11l.ropy index

And for a regression tree:

i( t) =
i(l) = absolute deviation index

Then, the split criterioll consist" of :;elt'cling the "plit which maximises the \\'f'ighted
reduction of impurity h"t.wE't'n tlif' part'nt node and its offspring ([eft and rightl.

n.,1 , n'r
'::::'ill) = 1111 - ~I,it,) - ""':"'ilt r ) (~)
II, n.
209

The problem of defining a right-sized tree is solved, instead of using a threshold, by


growing a maximal tree (a tree with every terminal node pure or with say, five indi-
viduals or fewer) and from that tree, defining a nested sequence of optimum subtrees
by successively removing non informative branches of the large tree, minimising an
error complexity measure (pruning step). Then, the problem is transformed in choos-
ing the subtree with minimum misclassification error. This is done by using a test
sample or a tenfold cross-validation to obtain honest estimates of the misclassification
error.
Although this approach solves many of the shortcomings for which its ancestors were
criticised, the tree growing process is still very dependent on data. It is easy to see
that every split depends on the presence of some outliers in the actual node leading
to instable trees. Moreover, the choice of an impurity measurement is closely related
to the stability of the tree. This is particularly true using for example the Cini index
which attempts to favour small but very pure nodes rather than equal-sized but less
pure ones. This is why the pruning process very often results in a severe reduction.
In order to tackle this problem, we have studied the internal stability of a partition
and then defined more robust impurity indices.

3. General formulation of impurity

Let us suppose that we have a node t with nt individuals to classify according to k


classes of a response variable2 , and let Wit be the weight of one individual in node t.
To generalise the notion of impurity, we use a geometric approach where each class
defines a point (1,0,0, ... ) E Rk. Then, we define the impurity as a function of the
distances between each individual in the node and the representative of the node mt,
defined as the point of the convex polygon of Rk, which minimises i(t) (Figure 1).

'() L:iEtW,tb(i,mt)
t == ---C-=:-,------'-_---'-
! (3)
L:iEt Wit
where b( i, mt) is the distance of an individual i and mt (obviously, all individuals
in the same response class j share the same distance). This formula reduces for a
categorical response variable with uniform weight to:

i(t) == L:;=l n)tb(j, mt) (4)


nt

IClass 3

Class 2

Class I

Fig. 1 Convex polygon of classes of a node with its representative


2We first present the case of a categorical response variable.
210

For a regression tree. the geometrical interpretation is easier since the values of the
response variable are represented in the real line m: being the point on the real line
representing the node t, which minimises its impurity.
This formulation being very general, we can choose the distances c(j, nit) in a very
general sense. In particular, we can use the L2 norm. Then. it is easy to show that
for a classification tree the representative of the node coincides with the multinomial
vector of probabilities of classes. and the impurity index reduces to the well known
Gini index; and that for a regression tree. the representative of the node is the mean
of the response in a node and the impurity is the variance. On the other hand. in the
case of an Ll norm. for a classification tree the representative coincides with the class
of maximum probability and the index reduces to twice the misclassification index.
and for a regression tree the representative is the median of the response in the lIode
and the impurity is the absolute deviation.
Furthermore, this formulation can allow for different misclassification costs. Let C be
the matrix of misclassification costs. and Cj. the cost of misclassifying an individllal
in class j when it belongs to clas~ i. Then C.i will represent the overall cost of
misclassifying an individual of class i.

(5)

\Ve can see that, in fact. introducing the misclassification costs entails overweighting
those response classes for which it is most dangerous to make a wrong assignment.
In any case, the reduction to impurity can be expressed as a function of the distances
of the individuals to the representative of the parent node and the distance to the
corresponding successor.

l:I.i(t) = LiEt Witc(i, mtl- LiEt. tt·it,c(i. m t,) - LiEt, Wit,C(i. Int,) (6)
LiEtl1'.t

where t{ and tT represent the left and right child nodes of node t.

4. Stability analysis

Thus. from Formula 6, it. is easy to calculate the contrihution of any individual t.o the
reduction of impurity. It is simply the difference in the distances of this individual to
the representative of the parent node and to the representative of its corresponding
child node:
c, = u',t(6(i,mtl- e(i,m:.) (7)

Notice that the contribution to the reduction of impurity can be positive or negative.
although on average this contribution coincides with the overall impurity reduction.

(8)

Then. the ratio for the average reduction of impurity ,{itt) is an easy way to diagnose
individuals with a strong inHuence in the split. Moreover. the distance between the
representatives of the child nodes is an indicator of the stability of the present split.
211

It is clear that in classification trees with the L2 metric, due to the quadratic form
of the impurity index, instabilities occur when the splitting process leads to nodes
with very few members of at least one response class j, whereas the most stable case
is when the probability of classes are similar, that is, the representative of the node
mt is close to the centre of the convex polygon. In the Ll metric, the locus of the
representative of the node coincides with the class with maximum probability, thus
the distance of the remaining classes to the representative is equal to 2, that is, each
individual of the latter classes has the same influence.
In regression trees, instability may occur when dealing with nodes with some outlying
values; a split attempting to accommodate for these outliers will reduce the impurity
significantly and hence the distance between the representatives of both child nodes
will be large. Of course, one way of achieving a robust split that is insensitive to the
effect of outliers would be by using the Ll norm, that is, using the absolute deviation
as an impurity measure.

4.1 Function of the impurity reduction

For each predictor variable we can define a function of the impurity reduction relative
to the impurity in the parent node, defined over all possible splits of this variable,
defined as follows:

f( tL) =1_ L:iEtl WitlO( i, mtl) + L:iEtr w itr t5(i, mt.)


(9)
L:iEt Wit O(1, mt)

where u means a particular split of a variable. This function can be represented


graphically provided that there is an ordering of the splits of the predictor variable.
This is the case for continuous, ordinal predictors, and also for any type of predictor
when the response variable is continuous or binary 3.
Then, a sharply peaked function is sign of an instable split, since a small change in the
level of the predictor will imply a major change in the impurity reduction, whereas
a smooth function would indicate more stable splits. Also, it is useful to plot the
function of impurity reduction together with the distribution functions of response
classes in the case of a classification tree. In Figure 2 we represent two functions of
impurity reduction (in thick) with the empirical distribution function of the response
classes (in thin). The first illustrates an instable case, whereas the second correspond
to a more stable case, although it contains one inversion in the empirical distribution
functions.

Fig. 2 Two examples of impurity reduction functions


3Then. an ordering of splits is induced by the response variable.
212

5. A split criterion based on a distance between distribution


functions

For the case of a categorical response variable, it is natural to compute for every node
the empirical distribution function of every response class Fj and compare them with
the average distribution function Fe in node t. Then, the split point can be defined
by a distance between these functions.
k
Mal' du =L IFj(!!) - Flu)1 ( 10)
)=1

where Fj(u) and Fe(u) are the empirical distribution function of class j and the
average distribution function of node t evaluated at split It. To use this split criterion
there should be an ordering among the possible splits of the predictor variable.
It is easy to see that this distance coincides with the Smirnov distance for the case
of two classes (Figure :3).

Fig. :3 Distance between the distrihution functions of classes

ma.!" [Fdu) - Fe(ull + iF2(U) - Fe(u)l] = ma.r IFj(u) - F2(u)1 (11)


~Ioreover, for this case, it also coincides with the misclassification index with uniform
weight of classes.
.. . n1 n.
mal' 6i(t) ~ mIT! [ntt(td + Hrt{ t r )] = mln. [nl( 1 - - ) + nr ( 1 - -=.)] =
III nr
mal' [n\ + (Tl2 - Sn = max (Fj(u) - F2(U)) (12)

Furthermore, it coincides with the Celel.lx- Lechevallier (1982) index with uniform
weighing of the response classes. In fact, the difference with this latter index consist
of in the different weighing of the response classes.

6. Complexity

The search for an optimal tree is an NP-complete problem. Thus, we should use
efficient heuristics. The most used heuristic consists of finding at each step the best
split among the whole set of binary partitions. This solution leads to fairly good
results obtained in a hierarchical fashion. However, the computational cost of this
heuristic is very high, and it requires an efficient algorithm to guarantee a reasonable
speed.
213

We have designed an algorithm of almost linear cost with the parameters of a tree.
Lets us first define the parameters of a tree. These are the number of individuals n,
the number of total splits s, and the maximum depth of the tree d. Obviously, the
total number of splits depends on the number of variables and its type. See Table 1
for the number of splits according the type of variable. Let P be the total number
of variables, which can be split according to type: p = Ph + Po + Pn + Pc. For each
variable we have Sj splits with E~=1 Sj = S and for a maximum depth of d we have
I $ 2d - 1 nodes.

Type of variable splits


Binary 1
Ordinal with k modalities k-l
Nominal with k modalities 2"'·1 - 1
Continuous with m different values m-l
Table 1. Number of splits according to the type of variables

The complexity can be decomposed in the following steps:

1. Cost of a split for a given variable: O(nt) + C. This cost depends only on the
number of individuals in the node plus a constant.
2. Cost of all splits for a given variable: E:;10(nt) = O(nt . Sj). This cost
depends on the number of splits of the variable, which according to its type can
be optimised to the following results:
O(nt) (for a binary variable)
1 O(nt) + O(k)
O(nt) + 0(2k-l)
(for an ordinal variable)
(for a nominal variable)
O(nt) + O(n ·log(n)) (for a continuous variable)
3. Cost of all splits for a node: E~=1 O(ne) = O(nt . s). This is, of course, simply
the sum of the costs for all the active variables in a node, which, according to
the above result reduces in most cases to: O(nt . p).

4. Cost of all splits in every node: E:~~d-I O(nt·p) = O(p) Et;l O(n) = O(p·n·d).
Moreover, we have the cost of assigning every individual to its node: 0(/· n).

Thus, the total cost can be written in the following way:

O(Pc . n . log(n)) + 0(1· n) + O(p . n . d) (13)

A critical point is the total number of splits when S ~ n, then O(s· n· d) ~ 0(n 2 • d).
This is particularly dangerous for a nominal response variable with a large number
of classes kj when k ~ nt the cost becomes quadratic or even exponential. See Mola
et al. (1992) for the treatment of a multiple class response variable. Also, when I
becomes large the cost increases quadratically.
This algorithm, named SA AD (Segmentacio Automatica per Arbres de Decisio), runs
on a PC platform in a Windows environment and is able to cope with problems with
up to 100.000 individuals and 100 variables. Here we present the time in seconds of
two problems, one corresponding to a classification tree into 4 classes, with 9 explana-
tory variables (6 of them were categorical with a maximum of 8 categories each and 3
were continuous), and the other being a regression tree with 18 explanatory variables
214

(half categorical and the other half continuous). For each problem we have varied
the number of individuals considered and the depth of the tree produced, obtaining
the results shown in Table 2. As can be seen, linearity is preserved approximately up
to a depth of 8.

individuals depth classification tree regression tree


1078 1 5" 9"
3 -"
I 13"
5 9" 14"
8 20" 23"
13 46" 60"
9862 1 21" 32"
3 30" .;0"
.; 41" 62"
8 91" Ill"
13 391" 381"
37575 1 108" 155"
3 126" 269"
5 159" 352"
8 376" 641"
13 1544" 2847"
Table 2. Execution time in seconds for the SAAD segmentation algorithm on an
HP 486/66

References

Aluja T., Nafria E. (1995). Generalised impurity measures and data diagnostics in decision
trees. Visualising Categorical Data. Cologne.

Breiman L.• Friedman J.H., Olshen R.A., and Stone C.J. (1984). Classification and Re-
gression Trees. Waldsworth International Group, Belmont, California.

Celeux G., Lecheva\lier Y. (1982). Methodes de Segementation non Parametriques. Revue


de Statistique Appliquee, XXX (4), 39-53.

Ciampi A. (1991). Generalized Regression Trees. Computational Statistics and Data Anal-
ysis, 12,57-78. North Holland.

Greenacre M. (1984). Theory and Application of Correspondence Analysis. Academic Press.

Gueguen A., Nakache J.P. (1988). Methode de discrimination basee sur la construction
d'un arbre de decision binaire. Revue de Statistique Appliquee, XXXVI (1), 19-38.

Kass G.V. (1980). An Exploratory Technique for Investigating Large Quantities of Cate-
gorical Data. Applied Statistics, 29, n 2, pp. 119-127.

Mola F., Siciliano R. (1992). A two-stage predictive splitting algorithm in binary segmen-
tation. Computational Statistics. vo!' 1. Y. Dodge and J. Whittaker ed. Physica Verlag.

Sonquist J.A., Morgan J.N. (1964). The Detection of Intemction Effects. Ann Arbor: In-
stitute for Social Research. University of Michigan.
Induction of Decision Trees
Based on the Rough Set Theory
Tu Bao Ho, Trong Dung Nguyen, Masayuki Kimura
Japan Advanced Institute of Science and Technology, Hokuriku
Tatsunokuchi, Ishikawa, 923-12 JAPAN

Summary: This paper aimed at two following objectives. One was the introduction of
a new measure (R-measure) of dependency between groups of attributes in a data set, in-
spired by the notion of dependency of attribute in the rough set theory. The second was
the application of this measure to the problem of attribute selection in decision tree induc-
tion, and an experimental comparative evaluation of decision tree systems using R-measure
and other different attribute selection measures most of them are widely used in machine
learning: gain-ratio, gini-index, dN distance, relevance, )(2.

1. Introduction
The goal of inductive classification learning is to learn a classifier from a training
set that correctly predicts classes of unseen instances. Among approaches to induc-
tive classification learning, decision trees is certainly the most active and applicable
one. During the last decade, many top-down induction of decision tree (TDIDT)
systems have been developed, most notably CART of Breiman et al. (1984), .ID3
of Quinlan (1986) and its successor C4.5 (1993). There are two crucial problems of
variable selection (choosing the "best" attribute to split a decision node) and prun-
ing (avoiding overfitting) in TDIDT systems. Most heuristics for estimating multi-
valued attributes are information-based measures, such as Quinlan's information gain
or gain-ratio (1986, 1993), Mantaras' normalized information gain (1991), etc., and
statistics-basd measures, such as Breiman's Gini-index (1984), X2 (Liu and White,
1994), Baim's relevance (1989), etc. An analysis of eleven measures for estimating
the quality of the multi-valued attributes was given by Kononenko (1995).
Rough set theory, introduced by Pawlak in early 1980s (Pawlak, 1991), is a mathemat-
ical tool to deal with vagueness and uncertainty, in particular for the approximation of
classifications. Although the rough set methodology of approximation sets has been
successful in many real-life applications, there are still several theoretical problems to
solve. For example, we will show that one of its fundamental notions - the measure
of dependency of an attribute set Q on an attribute set P - is not always robust
with noisy data and not enough sensitive with "partial" dependency between Q and
P. Inspired by this measure for dependency of attributes, we introduce in this paper
a new measure of attribute dependency, called R-measure. We show experimentally
that R-measure can be applied with success to the problem of attribute selection in
decision tree induction.

2. A measure for attribute dependency


2.1 Attribute dependency measure in rough set theory
The theory of rough sets was recognized as a fruitful theory for discovering relation-
ship in data. Though closely related to statistics, its approach is entirely different:
rough sets are based on equivalence relations describing partitions made of classes
of indiscernible objects instead of employing probability to express data vagueness.

215
216

The starting point of the rough set theory is the assumption that our "view" on
elements of the object set 0 depends on an equivalence relation E ~ 0 x O. Two
objects 01> D2 EO are called to be indiscernible in E if olED2. The lower and upper
approximations of any X ~ 0 consisting all objects which surely and possibly belong
to the X, respectively, regarding the relation E. These lower approximation E.(X)
and upper approximation E·(X) are defined as

E.(X) = {o EO: [OlE ~ X} (1)

E·(X) = {o EO: [OlE n X :f: 0} (2)


where [OlE denotes the equivalence class of objects indiscernible with 0 in the equiv-
alence relation E.
Pawlak (1991) has pointed out that one of the most important and fundamental
notions to the rough sets philosophy is the need to discovery redundancy and depen-
dencies between attributes. A key concept in the rough set theory is the measure
of dependency of a set of attributes Q on a set of attributes P, denoted by J.Lp(Q)
(0:5 J.Lp(Q) :5 1), defined as
card(U[oIQ P.([o]Q))
J.Lp( Q) = card( 0) (3)

If J.Lp(Q) = 1 then Q is totally depends on Pi if 0 < J.Lp(Q) < 1 then Q is partially


depends on Pi if J.Lp( Q) = 0 then Q is independent of P. The measure of dependency
is a basic notion in the rough set theory as based on it many other notions are defined,
such as reducts and significance of attributes. The formula (3) can be rewritten as

J.Lp(Q) = card(U[oIQ P.([o]Q)) = card({o EO: [o]p ~ [o]Q}) (4)


card(O) card(O)
An interpretation of the formula (4) is given through Table 1, slightly modified from
the example of Pawlak et al. (1995), consisting of eight objects described by two
descriptive attributes Temperature, Headache, and the class attribute F[u.

Table 1. Information table


Iemperature Headache I-Iu
el normal yes no
e2 high yes yes
e3 very_high yes yes
e4 normal no no
es high no no
ea very_high no yes
e7 high no no
ea very_high yes yes

Consider how the attribute Flu depends on the attribute Temperature. We express
the causal relation between these attributes in the form of usual rules
If Temperature = normal then Flu = no
If Temperature = very_high then Flu = yes
The number of objects satisfied these rules is 5 out of 8. In the other words, the pro-
portion of objects whose values on Flu are correctly predicted by values of Temperature
is 5/8. This argument is analogous with the definition of degree of dependency, where
217

each rule corresponds to an equivalent class w.r.t P which is included in an equivalent


class w.r.t Q.
2.1 R-measure
From the above argument we can see that the formula (3) takes account only pre-
cise rules. However, in real-world data, rules with some probability may be found
more often. For example, consider the dependency of Flu on Headache, the following
probabilistic predicted rules can be see from the Table 1
If Headache = yes then Flu = yes (3/4)
If Headache = no then Flu = no (3/4)
These rules show that Flu depends somehow on Headache, but the formula (3), by its
value 0 in this case, says that Flu is independent of Headache. Taking account of only
precise rules causes some limitations of the above degree of dependency: not robust
with noisy data and not enough sensitive with "partial" dependency. Let consider
further probabilistic rules. Suppose that the value on Headache of a new object is
known, and an agent wants to predict the value on Flu of this object. For example, if
Headache = yes, then there are two possibilities: Flu = yes (3/4), or Flu = no (1/4).
To minimize the probability of error, Flu = yes is certainly chosen as the value with
the maximum likelihood of occurrence among all possibilities. Due to the risk of Flu
= no, this prediction is uncertain and has an estimated accuracy of 3/4. Similarly,
the value Flu = no will be predicted if Headache = no with the estimated accuracy is
also 3/4. Denote by X the event that the prediction of the agent is true, we have
P(X) ~ P(Headache = yes) x P(X I Headache = yes)
. + P(Headache = no) x P(X I Headache = no)
= 1/2 x 3/4 + 1/2 x 3/4 = 3/4
This value can be interpreted as the degree of dependency of Flu on Headache estab-
lished by the above argument. This argument can be generalized and formulated for
a measure of degree of dependency of an attribute set Q on an attribute set P

(5)

The degree of dependency of Fiu on Temperature calculated by (3) is 3/4. The main
difference between J.Lp( Q} and J.L~( Q) is the latter measures the dependency of Q on P
in maximizing the predicted membership of an instance in the family of equivalence
classes generated by Q given its membership in the family of equivalence classes
generated by P. We have obtained the following property (Ho and Nguyen, 1997).

Theorem. For every set P and Q we have


max[oIQcard([o]Q) < ' (Q) < 1 (6)
card(O) - J.Lp -

From this theorem we can define that Q totally depends on P iff J.L~(Q) = 1; Q
partially depends on P iff max[oIQcard([o]Q) /card(O) < J.L~( Q) < 1; Q is independent
of P iff J1.~(Q) = max[olQ card([o]Q)/card(O). In practice, to emphasize rules those
have the higher generalities we use the following formula, and call it R-measure
_ 1 card([o]Q n[o]p)2
J1.p(Q) L max[olQ
= card(O) [o)p car d([0 1p)
(7)
218

3. Decision tree induction and selection measures


In this paper R-measure is applied to the attribute selection problem in decision
tree induction. R-measure is compared experimentally with five different attribute
selection measures by a careful designed benchmark. In order to facilitate a com-
mon understanding of different attribute selection measures, we use the notations
presented by Liu and White (1994) and Kononenko (1995). Suppose that we are
dealing with a problem of learning a classifier with k classes Gi (i = I, k) from a set
o of objects described by a set of attributes A. We assume that all attributes are
discrete each of which is with a finite number of possible values. Let n .. denotes the
total number of objects in 0, ni. the number of objects from class Gi , n.j the number
of objects with the j-th value of the given attribute A, and nij the number of objects
from class Gi and with the j-th value of A. Let further
nij n·I. n.j ni;
Pij =;-, Pi.=-, P.j = - , Pilj = -
n n.. n.j
(8)

denote the approximation of the probabilities from the object set O. Let

Hc = - LPi.logpi., HA = - LP.jlogp.j, (9)


;

(10)

be the entropy of the classes, of the values of the given attribute, of the joint example
class-attribute value, and of the class given the value of the attribute, respectively
(all logarithms introduced here are of the base two).
The well-known decision tree algorithm C4.5 use the gain-ratio (Quinlan, 1993)

. R Hc + HA - HCA
Ga~n = ---~---
HA
(11)

The measure dN (Lopez de Mantaras, 1991) is based on the definition of distance


between two partitions and can be rewritten as in (Kononenko, 1995)

Hc +HA -HCA
dN = 1 - - ------ (12)
HCA
The author has reported experiments with two data sets "hepatitis" and "breast
cancer". Gini-index used in decision tree learning algorithm CART (Breiman et al.,
1984) can be rewritten as

Gini = LP.j LP~j - LP~. (13)


;

Baim (1988) introduced a selection measure called relevance and showed an experi-
ment in craniostenosis syndrome identification
1 n··
Relev =1- -- L L ~, . (.)
tm J = arg maXi {ni;}
-
n·I.
(14)
1 - k ; i;Oi".(;) ni.
219

Another statistics-based measure of interest is X2

n .,·n·I.
eij =-- (15)
n ..

Suppose that P = {AI, A 2 , ••. , Ap}, and Q = {BI' B 2 , .•• , Bq}. Let n.jJi2 ... j, denotes the
number of instances with the jl-th, i2-th, ... , jp-th values of attributes AI! A 2 , ... , A p ,
respectively, and ni l i 2.. .i.lili2 ... j, the number of instances with the il-th, i 2 -th, ... , iq-th
values of attributes BI! B 2 , .•. , Bq and with the jl-th, i2-th, ... , jp-th values of AI, A 2 ,
... ,Ap , simultaneously. We also denote by P.j';2 ... j, and P i li2 ... i.I;';2 ... j, the approxima-
tions of these probabilities from the training set. We can rewrite (7) as

jip(Q) = . L P.;. ... jpmaxil •...• i.P~I ... i.ljl ... jp (16)
11,···,1,

In a special case of (7) when Q stands for the class attribute and P stands for the
given attribute A, ji.p( Q) can be used to measure the dependency of the class attribute
on A, and can be written as

ji.A = L>.jmax;{p~lj} (17)


j

4. An experimental comparative study


The studies of Minger (1989), White and Liu (1994) have shown how difficult it is to
be thorough and fair when evaluating the accuracy of learning methods. Generally,
it is not adequate to evaluate methods either with a small number of data sets or
with only a single train-and-test experiment. In order to attain a reliable estimation
of the mentioned attribute selection measures we designed experiments as follows

i. Implement programs for different measures in an unique scheme;


2. Use a large number of public data sets;
3. Use the stratified cross-validation for the evaluation.
Six mentioned attribute selection measures are implemented in six systems based on
the common scheme of Concept Learning System (CLS) for decision tree induction.
This scheme can be briefly described in the following steps

1. Choose the "best" attribute by given selection measure.


2. Extend tree by each attribute value.
3. Sort training examples to leaf nodes.
4. If training examples unambiguously classified Then Stop Else Goto steps 1.
In this paper, to be independent with pruning techniques, we considered the re-
sults of these programs on unpruned trees. The pruning technique of cost complexity
(Breiman, 1984) is used and its experimental results are reported in (Ho and Nguyen,
1997). In this work we use the entropy-based technique (Dougherty, 1995) for dis-
cretization <)f numeric attributes.
Twelve data sets are taken from the UCI repository of machine learning databases.
They included Wisconsin Breast Cancer, Congressional Votes, Audiology, Hayes-
Roth, Heart Disease, Image Segmentation, Ionosphere, Lung Cancer, Promoters
and Splice-Junction of sequences in molecular-biology, Solar Flare, Soybean Disease,
220

Satellite Image, Tic-Tac-Toe endgame, King-rook-vs-King-pawn. Their properties


are listed in Table 2 with the following abbreviations: att, inst, and class stand for
the number of attributes, instances, and classes; type stands for the type of attributes
(symbolic or mixture of symbolic and numeric).

Table 2. Properties of chosen data sets


Data sets att inst class type
Vote 16 435 2 sym
Breast cancer 9 699 2 sym
Soybean-small 35 46 4 sym
Tic-tac-toe 9 862 9 sym
Lung cancer 56 32 3 sym
Hayes-roth 5 160 3 sym
Kr-vs-kp 36 3196 2 sym
Audiology 69 226 24 sym
Splice 60 3190 3 sym
Promoters 58 106 2 sym
Heart-disease 13 303 2 mix
Solar Flare 13 1389 7 mix

Traditionally, in machine learning data are usually divided into two sets of training
and testing data. Training data are used to produce the classifier by a method and
testing data are used to estimate the prediction accuracy of the method. A single
train-and-test experiment is often used in machine learning for estimating perfor-
mance of learning systems.
It is· recognized that multiple train-and-test experiments can do better than single
train-and-test experiments. Recent works showed that cross validation is a suitable
for accuracy estimation, particularly the lO-fold stratified cross validation (Kohavi,
1995). However, cross validation is still not widely used in machine learning as it is
computationally expensive.

Table 3. Error rates estimated for different measures


data set GainR Gini X" dN Relev R-m
Vote 5.5 6.2 6.2 7.6 6.4 6.4
Breast cancer 6.7 5.9 6.0 7.7 6.6 6.1
Soybean-small 2.2 2.2 0.0 2.2 2.2 2.2
Tic-tac-toe 12.0 13.8 13.6 13.7 11.9 14.1
Lung cancer 23.3 10.0 10.0 36.6 20.0 13.3
Hayes-roth 25.0 25.0 e6.3 24.4 25.6 23.1
Kr-vs-kp 0.5 0.3 0.3 3.e 0.3 0.5
Audiology 23.4 22.5 21.7 31.7 4·0 21.6
Splice 8.0 1.9 8.4 10·4 10.3 8.4
Heart-disease 26.1 29.1 27.7 31.7 30.1 27.8
Flare 28.4 28.7 28.1 eY.8 29.3 21.5
Promoters 11.5 14.5 16.0 eY.5 17.2 14.5
Average 13.3 12.8 12.1 17.6 15.7 12.8

To obtain a fair estimation of the performance of mentioned selection measures, we


carried out a IO-fold cross validation in our experiments. The data set () is randomly
divided into 10 mutually exclusive subsets 0 1 , O2 , .•• ,010 of approximately equal size.
Each attribute selection measure is tested 10 times. Each time for k E {I, 2, ... , IO},
221

a decision tree is generated on () \ Ok and tested on Ok. A k-fold stratified cross


validation is a k-fold cross validation the folds are stratified so that they contain ap-
proximately the same proportions of labels as in the original dataset. The error rate
of each measure is the average of its error rates after 10 running times.
Table 3 shows the estimated error rates of six attribute selection measures obtained
in our experiments. Each line presents error rates of six measures for a data set
in which the bold number indicates the measure with minimum error rate and the
italic number indicates the measure with maximum error rate. The abbreviation R-m
stands for R-measure.
Averages of error rates are given in the last line of Table 3. Though average of error

__ I ...._ e ... _Ic.;..!.-I·

I .. _·-·I'_··. .
........ 1 ( -

............ 1.....-
- 1 - 1 ...--]
'~

Figure 1: Main screen of the system

rates can not be used directly to evaluate measures, it may, however, provide a snap-
shot of measure comparison, and show the stability of these measures.

5. Discussions and conclusions


Some conclusions can be drawn from the above experimental results. First, we re-
confirm that the quality of the decision tree construction is affected by the choice of
attribute selection measures. Somehow different from experimental results in some
previous works, in these experiments the error rates of Relevance and dN distance
we~e relatively higher than those of others, and they were relatively unstable (as
showed the results with data sets Lung-cancer, Audiology, Splice, Promoters). Mea-
sures Gain-ratio, Gini-index, X2 and R-measure have error rates which are not really
much different. In comparison with estimation of learning performance by a single
train-and-test experiment, i.g. in Buntine and Niblett (1991), a multiple train-and-
test experiment such as lO-fold stratified cross-validation in these experiments often
gives higher error rates. Thus, the stratified cross validation technique can allow us
to avoid an usual overoptimistic estimation of learning performance. An extension
of these experiments has been made with more data sets and with noisy data on
pruned trees. Careful evaluation in these conditions hopefully allows us to obtain
222

more reliable and new results on R-measure (Ho and Nguyen, 1997).
In this paper we have introduced R-measure for the degree of dependency b~tween
two groups of attributes in a data set. R-measure is inspired by the same notion of
attribute dependency in the rough set theory and it aims at overcoming some limi-
tations of that notion. We have applied R-measure to the general scheme of decision
tree induction and carried out carefully an experimental comparative study of R-
measure with five attribute selection measures which are well-known in the machine
learning literature.

References:
Bairn, P.W. (1988): A method for attribute selection in inductive learning systems. IEEE
Trans. on PAMI, 10, 888-896.
Breiman, L., Friedman, J., Olshen, R., Stone, C. (1984): Classification and Regression
Trees, Belmont, CA: Wadsworth.
Buntine, W., Niblett, T. (1991): A further comparison of splitting rules for decision-tree
induction. Machine Learning, 8, 75-85
Dougherty, J., Kohavi, R. and Sahami, M. {1995}: Supervised and Unsupervised Dis-
cretization of Continuous Features. Proceedings 1~th International Conference on Machine
Learning, Morgan Kaufmann, 194-202.
Ho, T.B., Nguyen, T.D. {1997}: An interactive-graphic system for decision tree induction
(under review).
Kononenko, I. {1995}: On biases in estimating multi-valued attributes. Proc. 14th Inter.
Joint. Conf. on Artificial Intelligence, Montreal, Morgan Kaufmann, 1034-1040.
Kohavi, R {1995}: A study of cross-validation and bootstrap for accuracy estimation and
model selection. Proc. Int. Joint Conf. on Artificial Intelligence IJCAI'95, 1137-1143.
Liu, W.Z., White, A.P. {1994}: The importance of attribute selection measures in decision
tree induction. Machine Learning, 15, 25-4l.
L6pez de Mantaras, R. (1991): A distance-based attribute selection measure for decision
tree induction. Machine Learning, 6, 81-92.
Mingers, J. {1989}: An empirical comparison of selection measures for decision-tree induc-
tion. Machine Learning, 3, 319-342.
Pawlak, Z. (1991): Rough Sets: Theoretical Aspects of Reasoning About Data, Kluwer Aca-
demic Publishers.
Pawlak, Z., Grzymala-Busse, J., Slowinski, R., Ziarko, W. (1995): Rough sets. Communi-
cations of the ACM, 38, 89-95.
Quinlan, J. R. (1993): C{5: Programs for Machine Learning, Morgan Kaufmann.
Wille, R. (1992): Concept lattice and conceptual knowledge systems. Computers and Math-
ematics with Applications, 23, 493-515.
Visualizing Data in
Tree-Structured Classification
Francesco Mola, Roberta Siciliano
Dipartimento di Matematica e Statistica
l'niversita di Kapoli Federico II
Via Cintia - ~[onte S.Angelo
80126 - ~aples - Italy
e-mail: [email protected]
r.sicQidmsna.dms.unina.it

Summary: This paper provides a classification tree methodology to analyze and visualize
data in multi-way cross-classifications of a categorical response variable and a high num-
ber of predictors observed in a large sample. The idea is to perform recursively a factorial
method such as nonsymmetric correspondence analysis to every finer partitions of the given
sample. Some new insights on the graphic displays of nonsymmetric correspondence analy-
sis are considered for defining classification criteria based on factorial scores. As a result. we
grow particular types of classification trees with two aims: 1) to enrich the interpretation of
the dependence structure using predictability measures and graphic displays; 2) to obtain
a classification rule for new cases of unknown class on the basis of a factorial model.

1. The proposed methodology


Consider a data matrix with a response categorical variable (n and M categorical
predictors (Xl'" X,\[) observed on a sample of N cases. The response variable is
used as criterion variable and its categories represent the a-priori classification of the
observed cases. Objective of the proposed methodology is to analyze and visualize
the dependency in such data matrix and. in the same time, to obtain a classification
rule for new cases of unknown class.
The idea is to use recursively. on continuously finer partitions of the :V cases, a fac-
torial method such as nonsymmetric correspondence analysis (Lauro and D'A.mbra.
1984; D'Ambra and Lauro. 1989; Lauro and Siciliano. 1989; Siciliano, Mooijaart and
van der Heijden, 199:3; Mola and Siciliano, 199.5). The approach consists in selecting
at each node the best set of predictor categories for visualizing the dependence struc-
ture by nonsymmetric correspondence analysis, successively. in partitioning cases on
the basis of some properties concerning the factorial axes.
As a result. we define a tree-structured classification of the :V cases where for pass-
ing from a coarser partition to a finer partition we use a factorial model and where a
factorial representation is assigned to each partition.
In the following subsections we describe the main phases of the partitioning proce-
dure to divide N cases into two subgroups. Then the overall procedure repeats for
each subgroup until stopping according to a given rule.

1.1 Table selection by the predictability index T

We consider the set of :Vl contingency tables where each table cross-classifies the
response categories of Y with the categories of each predictor. In this phase we select
the best predictor X' by maximizing the predictability index T of Goodman and
Kruskal, that is:

223
224

(1 )

where p,; are the proportions of the selected contingency table that cross-classifies
the response categories i = L .... [ of Y with the categories j = 1, .... J of the best
predictor X". The usual dot notation is used for summations. i.e., Lj pi] = Pi ..
1.2 Dependence analysis by a factorial model
In this phase we analyze the dependency in the selected table using the factorial
model of nonsymmetric correspondence analysis:

-p'} = Pi. + "~ )..krikc}k, (2)


P; k

for I.: = 1, ... , [\" = mint 1- 1. J -1). and )..1 ~ .•• ~ )..i: ~ O. The row scores T'ik and
the column scores C}k satisfy the following centering and orthonormality conditions:

L rik = O. L C}kPj = O. (3)

(4)

where 6kk " is the Kronecker's delta.


:\onsymmetric correspondence analysis decomposes into a number of dimensions the
predictability index T: T}'IX" (1 - Li PZ) = Lk )..%. The objective is however to decom-
pose the observed table by using a number of factors I{ lower than f{".
To each level of the partitioning procedure we assign a two-dimensional factorial
representation of the dependence structure between the response categories and the
selected predictor categories (see for example Lauro and Siciliano, 1989; Siciliano,
:\looijaart and van der Heijden, 1993).

1.3 Partitioning criterion using factorial scores


Let E = (f 1 ..... f s) be the set of the S observed cases. We denote by G(2) = (g/, 9r)
a partition of E into two subgroup:; called "left" and "right" such that a case en can
belong to only one of the two subgroups and all cases are classified.
The classification criterion is defined by using the weighted predictor scores of the
best predictor X" in the first dimension. i.e .. C)kP.} for k = 1. that sum up to zero
due to the centering condition (:3): we can say that the predictor categories having a
negati\'e score value beha\'e in the opposite way with respect to the categories ha\'ing
a positive score value. :\otice in fact that the celltering condition (3) for the column
scores holds since L}( ~ - pi. )p} = 0 where p,. is the centroid of the J conditional
cli:;tributions. In this way we can discriminate between two subgroups of categories:
the first subgroup includes the categories j such that e}l ::; 0 and the second subgroup
includes the categories j such that C}1 > O.
To this discrimination it corresponds a partition G(2) of the set E of X cases into
two subgroups: the "left" subgroup 91 includes the V/ cases that present one of the
predictor categories with nonpositive score value, whereas the "right" subgroup 9r
includes the .\'r cases that present oue of the predictor categories with positive score
value.
In addition, we characterize this partitioning of both categories and cases with some
225

predictability measures. {j sing the orthonormality conditions (-1) we can prove that
the following relation holds

2:>t = L L (rikAd = L P.) L (CJkAd.


k k k
(.'))

Since the index, is proportional to Lk Ak. (5) shows that we can further decompose
such predictability over the row categories as well as over the column categories. In
particular, we can identify which predictor categories contribute at most in predicting
the response variable by defining the following predictability measure of category j:
pred(C)) = (P.) Lk (CjkAk)2)/(Lk A%), which sum up to one over the index j. When
we consider the first factorial axis in the partitioning criterion this formulation simpli-
fies to pred( C)) = c7IP)' We can compare pred( C j ) with the weight p.) of category j:
we can say that category j is a strong category when prcd( C)) 2: P.j whereas category
j is a weak category when prcd(Cj ) < p.)" By definition. we can equivalently check
the conditions !eJk! 2: 1 and [C)k! < 1 respectively.

1.4 Stopping rule and class assignment rule


For stopping the recursive partitioning procedure we consider a stopping rule based
on the CATAI\OVA statistic defined in our context as Ct = (St - l)(1t - l)'n.;,
where s; is the best split at node t, It is the number of unempty response classes
at node t and :Vt is the number of cases at node t (:\Iola and Siciliano, 199-1): this
statistic is asymptotically approximated as X[[<-I) under the null hypothesis of in-
dependence. Once we define the best split we test the significance by applying the
C.\TAl'\OVA statistic. In case we reject the null hypothesis of independence we con-
tinue partitioning the cases; otherwise, we stop at the current node. When there are
no groups that could be di\'idecl further then the partitioning procedure is completed.
In the end. we can assign the class label to each subgroup of the final partition. This
can be done by considering the response category having the highest proportion of
cases within the given subgroup.

2. Example
.\ real data set is analyzed with the proposed methodology. The set consists of 286
graduates of the Economy and Commerce Faculty of the {j niversity of :\ aples over
the period 1986 - 1989, on which the following variables were observed:

• Final score with categories Lou' (LJ ..\ledium-Lou' (JIL) . .\hdium-High ('\IHJ.
High (H):
• Se.r with categories male, femalt:
• Origin with categories Saples. county. other counties;

• .-lge with categories -:25. :26-.'30, .'31-.'35. +.'36;


• Diploma with categories cias.;;ical. scientific. technical. magistral. profe';;.>ional:
• Study plan with categories official, managerial, economics, quantitatice, public.
profe.ssional:
• Time needed to graduate with categories 4 years, 5-6 years. + 7 years:
• Thesis subject with categories economy, lau:. quantitatire. history. rnanagernwt.
226

The variable Final Score is the response variable. In table 1 we summarize the par-
titioning sequence to grow the binary tree shown in figure 1.

node N Nl N2 N3 :"<4 predictor '1c, left right !

1 286 54 i8 81 T:3 Age 94/c -25 years 31-35 years


26-:30 vears +36 years
2 239 33 59 74 T3 Diploma 88% Class'ical PI'ofessional
~Iagistral Scientific
Tecnical
3 4T 21 19 i 0 Study Plan 8io/c Official Managerial
,
Quantitative Ecollomics
Public Professional
4 ·54 5 3 19 2i Ortgrn 8T% :\aples Othel' Counties
("ountv
5 185 28 56 .'i5 46 Study Plan 84% Official Mallage~'ial
Public Professional
Economics
Quantitative
10 114 24 54 50 46 Origin 8:3% Naples Other counties
County
20 81 i 18 28 28 Thesis subject 83% Economv Law
Quantitative History
~(anagement

Tab. 1: Partitioning Sequence (strong categories are in bold)

For sake of brevity. we analyze only some partitions. The first table that is selected
cross-classifies the response variable Final Score with the predictor Age. In figure :2
we present the twcr-dimensional factorial representation of nonsymmetric correspon-
dence analysis applied to this table. The first factorial axis explains a very high
percentage of T index (94 %).
We notice an axial opposition between old graduated and young graduated as well as
between g'raduated u:ith high score and g7'aduated with low sco,.e. Only the age cate-
gory 26-30 yea,.s is a weak category (very nearby the origin). These considerations
enrich the interpretation of the partitioning at node 1 that divides the 286 cases into
two subgroups of 2:39 and 47 cases respectively.
At node 2 the selected predictor is Diploma. Figure 3 shows the graphic display
where the first factorial axis explains again a high percentage of T index (88 7c).
\Ve notice an axial opposition between classical diploma and prof€s8ional diploma
that represent the strong predictor categories. Considering the simultaneous inter-
pretation of predictor and response categories we underline that graduated with high
score are better predicted by the category classical diploma. whereas the category
professional diploma predicts more the graduated with medium low score. Instead.
the category magistral diploma predicts more the graduated with medium high SCOrf.
The partitioning at node 2 divides the 239 cases into two subgroups of .j4 and IS.)
cases respectively.
The selected predictor at node 5 is Study Plan. In the graphic display of figure -I
the prediction explained by the first factorial axis is 84 <te. The axial opposition is
now provided by official and public against managerial and professional study plans.
The other predictor categories, economics and quantitative. are weak categories (very
nearby the origin). The partitioning divides the 185 cases at node .j into two sub-
groups of 174 and 11 cases respectively.
227

1-25 years 1 31-35 years


Age
26-30 years +36 years

1 Classical I

MaQistral

L
Naples

ML
Ecoromv
Quantitative
ManaQement

H MH

Figure I: Final binary tree (strong categories are in a box)


228

MH
0

0.5-

31-35 25-30
0- L+36·

-. -25
o • ML
0

-0.5- H
0

-I I
-I -0.5 0 0.5
Figure 2. Factorial representation at node 1: Final Score vs Age
1
MH
~
magistral

0.5

. techniC,iCntifiC
0 professIonal· •
ML .0 classical
o L
H
-0.5 - 0

-I-r--.--.I------,---r-~--_r~

-I -0.5 o 0.5

Figure 3. Factorial representation at node 2: Final Score vs Diploma

ML
o
0.5 -

economics quantitative
0- profcssional4t. ••
L managerial public

o •
MH
-0.5 -
o official

-I~--~~--~--~----~I--~~
-I -0.5 o 0.5

Figure 4. Factorial representation at node 5: Final Score vs Study Plan


229

Table 2 summarizes the terminal node information with the values of the CATANOVA
statistic and the assigned class. The left side of table 3 shows the misclassification
matrix obtained with the proposed tree-classification procedure whereas the right side
of table 3 shows the misclassification matrix obtained with a standard classification
tree procedure such as for example the CART methodology (Breiman, Friedman,
Olshen and Stone, 1984). We can compare the two misclassification matrices of ta-
ble 3. Notice that the proportions of misclassified cases are very similar in the two
analyses and for some response categories the proposed method has classified better
than CART.

node cases nl n2 n3 n4 CATANUVA d.f. assigned class


6 17 2 8 7 0 4.6 2 2
7 30 19 11 0 0 3.29 1 1
8 29 0 0 7 22 3.96 1 4
9 25 5 3 12 5 9.96 3 3
11 11 4 2 5 0 6.11 2 3
21 93 17 36 22 18 11.03 3 2
40 53 5 10 14 24 7.72 3 4
41 28 2 8 14 4 4.00 3 3

Tab. 2: Terminal Node Information

true class true class

5 . 5 . 1
predicted ML 0.35 0.36 0.25 predicted ML 0.37 0.41 0.27
class MH 0.20 0.38 0.12 class MH 0.17 0.31 0.18
H 0.10 0.26 0.63 H 0.11 0.27 0.55
t

Tab. 3: Misclassification Matrices

3. Some extensions and relations


It is interesting to consider s.ome extensions of the basic method; in the first phase
it is possible to select more then one predictor. In fact we can extend the proposed
method to the case of joint predictors that cross-classify two or more predictors.
In the partitioning phase we can consider a partition into three subgroups considering
more levels of dependency (strong dependency with positive or negative score value
and weak dependency), growing up not necessarily binary trees but ternary trees
too (Siciliano and Mola, 1997). Moreover, the splitting criterion can be based on two
factorial axes instead of one. We consider also the possibility to have fuzzy subgroups,
that is the possibility to send the same case in two different subgroups (Mola and
Siciliano, 1995).
Our methodology has some relations with two-stage binary segmentation procedures
(Mola and Siciliano, 1992, 1994; Mola, Klaschka and Siciliano, 1996) although a
factorial model is used as a splitting rule. Instead of the usual classification tree
procedures (CART, Breiman, Friedman, Olshen and Stone, 1984; FACT, Loh and
Vanichsetakul, 1988), the proposed method can also grow ternary classification trees
where it is possible to distinguish between nodes where there is a strong dependency
and nodes where there is a weak dependency.
230

As main results of applications on real data sets, the proposed methodology has
shown to achieve several purposes:

• to build up a classification rule (binary tree),

• to predict the response classes (binary tree),

• to visualize the dependency structure in multi-way cross-classifications with


many predictors and large sample size (recursive factorial representations),

• to detect outliers (ternary tree).

Acknowledgements: The Authors wish to thank Carlo Lauro and Jaromir Antoch for
helpful comments on a previous version of this paper. This research was supported for the
first author by CNR research funds number 92.1872 P and for the second author by CNR
research funds number 95.02041.CTlO.

References

Breiman L., Friedman J.H., Olshen R.A., Stone C.J., (1984): Classification and Regression
Trees, Belmont C.A. Wadsworth.
D'Ambra, L. and Lauro, C. (1989): Nonsymmetrical Analysis of Three-way Contingency
Tables, in Multiway Data Analysis, Coppi, R. and Bolasco, S. (eds.), North Holland, Ams-
terdam.
Lauro, N.C. and D'Ambra, L. (1984): l'Analyse non Symmetrique des Correspondances.
Data Analysis and Informatics Ill. E. Diday et al. (eds.), 433-446. North Holland, Ams-
terdam.
Lauro, N.C. and Siciliano, R. (1989): Exploratory methods and modelling for contingency
tables analysis: an integrated approach. Statistica Applicata. Italian Journal of Applied
Statistics, 1, 5-32.
Loh, W. and Vanichsetakul, N. (1988): Tree-Structured Classification via Generalized Dis-
criminant Analysis. Journal of the American Statistical Association, 83, 715-728.
Mola, F. and Siciliano, R. (1992): A Two-Stage Predictive Splitting Algorithm in Binary
Segmentation. Computational Statistics, Dodge, Y. and Whittaker, J. (eds.), 1, 179-184,
(Compstat '92 Proceedings). Physica Verlag.
Mola, F. and Siciliano, R. (1994): Alternative Strategies and CATANOVA Testing in Two-
-Stage Binary Segmentation. New Approaches in Classification and Data Analysis, Diday,
E. et al. (eds.), 316-323, Springer Verlag.
Mola, F. and Siciliano, R. (1995): Nonsymmetric correspondence analysis for tree-structured
classification, research internal report, conditionally accepted for Applied Stochastic Models
and Data Analysis.
Mola, F., Klaschka., J. and Siciliano, R. (1996): Logistic Classification Trees, Compstat 96
Proc., Prat, A. (ed.), Physica Verlag.
Siciliano, R. and Mola, F. (1997), Ternary classification trees: a factorial approach, Visu-
alization of categorical data (Greenacre, M., Blasius, J., eds.), Academic Press, CA.
Siciliano, R., Mooijaart, A. and van der Heijden, P.G.M. (1993): A Probabilistic Model for
Nonsymmetric Correspondence Analysis and Prediction in Contingency Tables. Journal of
the Italian Statistical Society, 2, 1, 85-106.
Adaptive Cluster Analysis Techniques -
Software and Applications
Hans-Joachim Mucha I , Rainer Siegmund-Schultze I and Karl Dilbon 2

I Weierstrass Institute for Applied Analysis and Stochastics (WIAS)


Mohrenstrasse 39
D-IO 117 Berlin, Germany
2 Daimler-Benz AG, Deptartment Research and Technology
Postfach 2360
D-89013 VIm, Germany

Summary: Well-known cluster analysis techniques like for instance the K-means
method can be improved in almost every case by using adaptive distances. For this one
has to estimate at least "appropriate" weights of variables, i.e. appropriate contributions
to cluster analysis. Recently, adaptive classification techniques for two class models are
under development. Here usually both the weights of variables and the weights (masses)
of observations play an important role. For instance, observations that are harder to clas-
sify get increasingly larger weights. Quite successful applications of these techniques
can be reported from the area of credit scoring systems for consumer loans or credit
cards. The software CIL/sCorr (running under Microsoft EXCEL) perform classification,
cluster analysis and multivariate graphics of (huge) high-dimensional data sets contain-
ing numerical values (quantitative or categorical) as well as non-numerical information.

1. Introduction
What can our statistical software do for you? For example, with the help of the EXCEL-
Add-In ClusCorr one is able to look for classification rules in data sets like the follow-
ing one:
kl,kB,kFF,k2,kl,kl,k2,kO,kO,kO,kD,kl,k2,k4,k8,k7,kl
kO,kS,kME,kO,kl,kl,k3,kO,kO,kO,kD,kO,k4,kl,kO,k7,kl
kO,kS,kFF,kl,kl,kl,kl,kO,kO,kO,kD,kO,k4,k3,k2,k9,kO
kO,kS,kMB,k2,kO,k2,kl,kO,kO,kl,kD,k2,kO,k3,k5,k9,kl
kl,kS,kAD,kl,kl,kl,kl,kO,kO,kl,kD,k7,k6,k2,k7,k5,kl
kO,kS,kFF,kl,kO,kl,k3,kO,kO,kO,kD,kO,k4,k2,k7,k9,kO

Moreover, one can look at multivariate graphics of such data sets in order to get a first
view on the way of understanding the data at hand. Above, every line contains informa-
tion (whatever this may be in reality) about an applicant for credit. At the end of each
line the code "k I" characterises a good applicant, whereas a "kO" stands for a bad one
(which is not able to pay the amount of credit back to the bank, telephone company,
mail-order house, or department store). Depending on the kind of credit often additional
numerical (quantitative, categorical, ordinal, ... ) information has to be taken into account
in order to optimise the decision about a new applicant. In fact that is no problem for the
classification technique described later on.

231
232

2. Adaptive clustering techniques


Some years ago new adaptive clustering techniques were developed (for details see, e.g.,
Mucha 1992, 1995). They appear to show some intelligence because of their ability to
learn the appropriate (adaptive) distance measures. For instance, the squared weighted
Euclidean distance

(1)

or the weighted absolute distance


J

tQ(Xi'X / ) = LqjlX;j -Xljl (2)


j=\

between two observations x; and XI are well-known dissimilarity measures for metric
scaled data which are often used in cluster analysis. Here the number of variables of a
data matrix X is denoted by 1. In the simple case which we consider here, the metric Q is
diagonal with non-negative diagonal elements qj which are unknown usually and there-
fore have to be estimated during an adaptive clustering procedure. For example, the gen-
eral variant of the well-known K-means clustering method (MacQueen 1967) takes into
consideration both the weights of variables qj introduced above and the non-negative
weights (masses) of observations m; in order to minimise the sum of the within-clusters
variances
K I
VK =L LO;km;d~(x;;ik) (3)
1=\ ;=\

concerning a fixed number of clusters K. Here the number of observations of the data
table X is denoted by /. The indicator function a;l is equals 1 if the observation x; comes
from cluster k, or 0 otherwise. The vector xt contains the usual arithmetic mean values
in the cluster k. On the one hand the weights can be chosen in conformity with data and
model (usually in this case these weights are termed special weights). On the other hand
the weights can be adaptive ones estimated in some ways in an iterative manner regard-
ing some calibration constraints in (3). In any case, as a result the contributions of the
variables to cluster analysis differ among one another. As a further result it is no longer
necessary to provide a selection of variables.
The performance of adaptive clustering methods was investigated using extensive simu-
lation studies (resampling as well as random error techniques). We used measures which
are proposed by Hubert and Arabie (1985). The usefulness of adaptive techniques has
been shown in practice in several applications. Because of weights qj and w; accompa-
nied by various kinds of data transformations like rank transformation, clustering tech-
niques based on (3) have a wide area of applications. For instance. one can look for
clusters in contingency tables using K-means technique or Ward's hierarchical clustering
method (Mucha and Klinke 1993; see Figure 1 below: a dendrogram which is drawn
onto the plane of the first two factors of correspondence analysis). It should be men-
tioned already here that the dendrograms and other graphics of the software ClusCorr are
linked by click with data vaiues. Another example of a successful application of adaptive
clustering based on (3) is the detection of groups in rank order data (Mucha 1992).
233

Fig. 1: Application in archaeology. The so-called plot-dendrogram combining both the


output of a hierarchical cluster analysis (clusters of tools from stoneage) and a two-
dimensional plot (for instance a scatterplot).

Moreover. adaptive distances are very important in order to obtain highly informative
multivariate plots. In that way. both the interactive data analysis and the interpretation of
clustering results become much easier.
A basic approach for a simultaneous classification and visualization of data is proposed
by Bock (1987). Some methods are descibed by Bock (1996). These methods are based
on a well-specified goodness-of fit criterion.
In the case of mixed data several distance measures can be used side by side. for exam-
ple. in an additive fashion.

3. A distance approach in Credit scoring


If some information about classes is known beforehand. then one has a problem in su-
pervised learning (see Michie et.a\. 1994). This knowledge has to be taken into account
in order to give categorical data a quantitative meaning (Nishisato 1994). Credit scoring
is a new area of application of adaptive distances. Let us consider very briefly the first
few steps of a simple distance approach for two class models. A training data set con-
taining a class membership variable c is used to estimate the classification rule. The ele-
ments of the vector c can either take the values 0 (bad applicant) or I (good applicant).
One can start with equal weights of observations. namely m; = I for all i =1.2 •... ,/.
First of all we recommend an optimal scaling of the data values with regard to the given
class membership by using a kind of correspondence analysis technique (Nishisato
234

1980). Generally, we want to obtain a new variable x so as to make the values (scores)
within the given classes SS. as similar as possible and the scores between classes SSb as
different as possible. Let us look at a contingency table which can be obtained by cross-
ing a categorical variable j (KJ categories, where K J ~ 2 is considered) with the class
membership variable (2 categories at least). That is (regarding some constraints in the
frame of the dual scaling approach), the squared correlation ratio has to be maximised:

(= SSb) . (4)
SS,

Because of the well-known variance decomposition


(5)
the squared correlation ratio lies in the interval [0,1]. Considering the special case of two
classes now the correspondence analysis can be used without the calculation of orthogo-
nal eigenvectors (Mucha 1996a): a given category Ykj of a variablej is transformed into
an optimally scaled one by
PkJ(I)
Xkj = (I) (0)' (6)
Pkj + Pkj
Here pi? is an estimate of the probability of being a good applicant when coming from
category k, whereas on the other side p~J) is an estimate of the probability of being a bad
applicant when coming from category k. Additionally, these estimates can take into con-
sideration weights of the observations. For instance observations that are harder to clas-
sify get increasingly larger weights during the adaptive classification process. The trans-
formation (6) is quite a simple one compared with, for example, another one given by
Fahrmeir and Hamerle (1984), but it has important advantages (Mucha 1996a).
Without loss of generality the hypothetical worst case XI =0 will be considered here
=
(otherwise one can look at the best case model with XI 1). Then the L,-distance (2)
between an observation (applicant) XI and the worst case is simply

* 1
t (x.,x/) = L. x ... (7)
I j = 1 I}

Instead of t* here we prefer

I 1
t(x.,x/)=- L. x .. (8)
I 1 j = I I}

because of independence of the number of variables 1. Now the question arises, how can
a suitable cut-off point be determined on the distance scale? The simplest way is to take
the cut-off point which gives the minimum error rate for the training sample (Figure 2).
Usually (without consideration of assumptions on distributions) the cut-off point is cho-
sen by sampling techniques. For example, one can take the mean or median of a set of
cut-off points (corresponding to minimum error rates obtained from many different sam-
ples which are drawn randomly from the training data).
235

Afterwards, as usual, we are able to classify the observations of the test sample using the
scores (6) and the cut-off point of the training sample.
The obtained results of the first step (described above) can be improved in almost every
case by altering the weights of observations (i.e. by changing the amount of influence of
the observations). This leads to a local adaptive distance approach.

Error rates versus adaptive scores. Blue curve (big 8f11lIItude):


training sarrple, red curve: test sarrple.
10 ~~~~~~~~

9,5

8,5

7,5
eft.
,5 7
~...
...5 6,S
...
G.I
iij 6
(5
l-
S,S

4,5

3,5

3 ~~~~~~~~

0,22 0,24 0,26 0,28 0,3 0,32 0,34


Score of creditvathiness

Fig. 2: Credit scoring. Example 1. Cut-off point on the distance axis versus error rate
(dark curve: training sample, grey curve: test sample).
236

4. The software
The XC/IIst-library (Mucha, 1995; Mucha and Klinke, 1993) of the interactive statistical
computing environment XploRe contains well-known cluster analysis methods as well as
new adaptive ones. Additionally, highly interactive and dynamic graphics of XploRe
support both the search for and the interpretation of clusters. Everyone who is a little bit
familiar with matrix notations can write new distance functions (for instance for mixed
data) by using the macro language of XploRe.
In order to make cluster analysis techniques available for almost everyone (without any
knowledge in algebra and statistical languages) the software ClusCorr, running under
Microsoft Windows, is under development (Mucha 1996b). It is written in Visual Basic
for Applications (VBA) to function as an Add-In under Microsoft EXCEL 5 (or higher).
Hardware requirements are defined from this point. Have regard to the wide family of
EXCEL users in mind, CIlIsCorr is designed on the one hand to have teachware proper-
ties including an extensive help system and decision support for choice of distance
measures, clustering techniques, multivariate graphics, and so on, and on the other hand
to perform the cluster analysis of huge data sets.
In consideration of the kind of results obtained one can distinguish roughly between hi-
erarchical and partitioning techniques. The hierarchical methods fonn a sequence of
nested partitions, a hierarchy (Figure 1). In CIlIsCorr (as well as in XClust) well-known
hierarchical ascending clustering methods are available, for instance
• Ward's minimum variance method,
• Single Linkage (nearest neighbour),
• Complete Linkage (furthest neighbour),
• Average Linkage (group average), and
• Centroid Method (weighted centroid).
About twenty well-known distance measure can be selected out simply by a click:
Euclid, LI' Jaccard, Cosine, ...
What is one to do if one has to carry out a hierarchical cluster analysis for a million of
observations? We offer a special principal axes cutting algorithm in order to reduce a
huge data set to, for instance, 250 "pre-clusters". This algorithm is quite fast because it
takes 25 minutes of time (on a Pentium 200 MHz) to reduce one million objects to some
few hundred pre-clusters. Afterwards one can carry out a hierarchical cluster analysis.
The (adaptive) K-means method is available in different variants: exchange method, gra-
dient technique, minimum distance method, ...
The stability of cluster analysis results can be investigated by simulation studies based
on measures for comparing partitions. In that way one can check automatically whether
the adaptive clustering perfonns better or not. Furthennore, in that way one can validate
the number of clusters, or the user can assess the importance of the variables.

5. Credit scoring - some results


In general, credit scoring systems should be able to estimate approximately an appli-
cant's creditworthiness. Usually, to this end one quantifies numerically the variables
237

(attributes) and adds them up to form a so-called creditworthiness score (see above). On
this basis a decision about a new applicant is derived.

Example I: The training sample A (see above) consists of 16795 applicants, whereas the
test sample B contains 14745 persons. Several statistical methods (see, for example,
Quinlan 1993) are used which give the following total error rates:

Method A B

C4.5 (decision tree by Quinlan) 3,3% 4,62%


logistic regression 5,85% 5,03%
linear discriminant analysis 3,9% 5,67%
ClusCorr (local adaptive distances) 3,3% 4,47%

Example 2: Credit scoring data from Fahrmeir and Hamerle (1984): the sample consists
of 1000 applicants, whereas 300 are bad clients and 700 are good ones. All 20 variables
are used. The following total error rates were obtained by resubstitution (R) and cross-
validation (C) (Ieaving-one-out-method):

Method R C

quadratic discriminant analysis 26,7% 2 32,3% 2

linear discriminant analysis 27,3% 2 29,0% 2

ClusCorr (local adaptive distances) 19,7% 22,9%

2see Fahrmeir and Hamerle ( 1984). The following average error rates (sum of error rates
of each class divided by 2) are given by the authors: quadratic discriminant analysis:
24,3% (R), 31,2% (C); linear discriminant analysis: 27,0% (R), 28,9% (C).

6. References
Bock, H. H. (1987): On the interface between cluster analysis, principal component
analysis, and multidimensional scaling, In: Multivariate statistical modeling and data
analysis. Bozdogan, H. and Gupta A. K. (eds.), 17-34, D. Reidel, Dordrecht.
Bock, H. H. (1996): Simultaneous visualization and classification methods as an alter-
native to Kohonen's neural networks. In: Classification and Multivariate Graphics:
Models. Software and Applications. Mucha, H.-1. and Bock, H. H. (eds.), Report No. 10
(ISSN 0956-8838). 15-24, Weierstrass Institute for Applied Analysis and Stochastics,
Berlin.
Fahrmeir, L. and Hamerle, A. (1984): Multivariate statistische Verfahren. De Gruyter,
Berlin.
Hubert, L. 1. and Arabie, P. (1985): Comparing partitions, lournal of Classification. 2,
193-218.
238

MacQueen, J. (1967): Some methods for classification and analysis of multivariate ob-
servations, In: Proc. 5th Berkeley Symp. Math, Statist. Prob. 1965/66, LeCam, L. and
Neyman, J. (eds.), Vol. 1,281-297, Univ. California Press, Berkeley.
Michie, D. M., Spiegelhalter, D. J. and Taylor, C. C. (eds.)(1994): Machine Learning,
Neural and Statistical Classification, Ellis Horwood, Chichester.
Mucha, H.-J. (1992): Clusteranalyse mit Mikrocomputem, Akademie Verlag, Berlin.
Mucha. H.-J. (1995): XClust: clustering in an interactive way. In: XploRe: an Interactive
Statistical Computing Envirollment, Hardie. W., Klinke. S .• and Turlach. B. A. (eds.).
141-168. Series Statistics and Computing. Springer-Verlag. New York.
Mucha. H.-J. (I 996a): Distance based credit scoring. In: Classification and Multivariate
Graphics: Models, Software and Applications, Mucha. H.-J. and Bock. H. H. (eds.), Re-
port No. \0 (ISSN 0956-8838). 69-76. Weierstrass Institute for Applied Analysis and
Stochastics. Berlin.
Mucha. H.-J. (l996b): ClusCorr: cluster analysis and multivariate graphics under MS
EXCEL. In: Classification and lHultivariate Graphics: Models, Software and Applica-
tions, Mucha. H.-J. and Bock. H. H. (Eds.). Report No. \0 (ISSN 0956-8838). 97- \05.
Weierstrass Institute for Applied Analysis and Stochastics, Berlin.
Mucha. H.-J. and Klinke. S. (1993): Clustering techniques in the interactive statistical
computing environment XploRe. DiscLlssion Paper 9318. Institute de Statistique. Uni-
versite Catholique de Louvain, Louvain-Ia-Neuve.
Nishisato. S. (1980): Analysis of Categorical Data: Dual Scaling and its Applications.
University of Toronto Press. Toronto.
Nishisato. S. (1994): Elements of Dual Scaling: An Introduction to Practical Data
Analysis. Lawrence Erlbaum Associates. Publishers. Hillsdale.
Quinlan. J. R. (1993): C4.5: Programs for Machine Learning. CA: Morgan Kaufmann.
San Mateo.
Part III

Classification and Discrimination

. Statistical Approaches for Classification Problems

. Discrimination and Pattern Analysis


The Most Random Partition of a Finite Set
and Its Application to Classification
Ylasaaki Sibuya
Takachiho University
2-19-1 Ohmiya, Suginaki-ku, Tokyo 168, Japan

Summary: The notion of 'the most random partition of a finite set' is defined and pro-
posed as a null model in classification, A Hamming-type distance between two independent
and most random partitions is used for justifying its randomness. and is used for testing
this null hypothesis, The probability distribution of the distance is studied for the latter
purpose,

1. Validation of classification
In order to justify a classification procedure from the viewpoint of classical statistics.
one should test the null hypothesis that a given or resulting classification is purely
random, thus meaningless against the alternative that the classification is somehow
structured or meaningfuL Many models for the null hypothesis were proposed as
reviewed by Gordon (1996) and Bock (1996) at this conference. Typically, a proba-
bilistic null model assumes the uniform distribution of the observable variables on a
domain, or a uniform random structure of a similarity or dissimilarity matrix,
The approach presented in this report is completely new and different from those
which were previously proposed (see also Sibuya 1993 a,b), The new proposal is
based on the notion of ·the most random partition of a finite set' which yields directly
the null hypothesis of 'random partition without any regularity', A Hamming-type
distance between two independent and most random partitions plays the role of the
conventional chi-square goodness of fit statistic,

2. Preliminaries
2.1 Random partitions
In this paper. a classification means a partition M = {7n J, 1n2, .. ,} of a finite set
Nn = {I, 2",. ,n} of n objects or elements. The ordering of the classes or subsets
is disregarded. Let the set of all partitions of N n be denoted by An' A discrete
probability distribution P on An given by

P(M) ~ 0, MEAn, and L P(M) = 1,


.\{.;;A..

is called a random partition k Typically it has a random number K of classes:


A = {ai,"" aK}. A special random partition. the 'most random partition'. v,;ll be
introduced later,
2.2 Distance between partitions
The discrepancy between two given partitions of N n is measured by a distance <1\ .. ,l,
satisfying the axioms of the distance,

241
242

Definition (Rand, 1971). Let L, MEAn and i,j E Nn,i =I j, and let

0, if (i, j) belong to the same class in L


and the same class in M,
0, if (i, j) belong to different classes in L
and to different classes in M,
do(i, j; L, M) = 1, if (i, j) belong to different classes in L
and to the same class in M,
1, if ('i, j) belong to the same class in L
and to different classes in M.

Then the distance d between two partitions L and M is defined by

d(L, M) = "1<' .< do(i,j; L, M),


L..J _'<J_n
(1)

i.e., the number of inconsistent class assignments in Land M. We have d(L, M) = 0


if and only if L = M. Let V = {Nn } be a partition with the single class, and let
W = {{I}, ... , {n}} be the partition with all singletons. Then

d(L, M) 5 d(V, W) = (;) for any L, MEAn. (2)

The distance d is used also by Zahn (1964).

3. A single parameter family of random partitions


For an arbitrary partition M = {m], ... , m.k} E An define by Sj the number of classes
of cardinality j, namely,
n n
Sj := I{i: Illil = i}1 ~ 0 for 1 5 j 5 n, L Sj = k, and LiSj = n.
j=1 j=1

This defines the vector S(M) = S = (Sil' .• , sn) which is a partition of the natural
number n.
If a random partition A = {al,"" aK} on An with K classes has the probability
distribution,
Ok
II ((j -
n
P(K = k and A {mll"" md) = (,if 1)!)&j =: g(n; s),
a j=1
for 1 ~ k ~ n, M = {ml,'" ,mk} E An, (3)
we say that it has the distribution P(n, a) and write A ,...., P(n, a). Here S =
S(M), a > 0 is a real parameter and a lnl = 0(0 + 1)··· (a + n - 1). An illus-
trative interpretation of P(n, a) is given in Section 5.
We can prove the following recursions of g(n; s). 9 with a negative argument Sj <
0, 1 ~ i 5 n is regarded as O.
(i)

ng(n; s) = Pn-1Slg(n -
1; SI. - 1, S2, . .. , Sn-I)
+ 1-;'' :11 Ej:f i(j + I)Sj+lg(n - 1; Sit ... , Sj-t. Sj + 1, Sj+l -1, Sj+2, ... , Sn-I), (4)
243

where Pn-I = 0./(0. + IL - 1).


(ii)

g(n - 1: S 1! ... ,Sn- d = g{n: SI + 1, 052, .•. ,sn-l, 0)


,,-1
+L 8jg(1I: 81,"" S)_I, 8j - 1, 8j+ 1 .... 1,8)+2.··., Sn). (5)
j~1

This is a sort of dual of (4). It means that if an arbitrarily chosen element, say n, of
A is deleted, the remaining random partition has the distribution P (n - 1, a).
(iii)
o.(j - 1)1 .
g(11.;S) = ( ')lJl g (n- J ;Sl! ... ,Sj-l,Sj-I,Sj+h ... ,Sn-j). (6)
o.+n-)
This relation means that if an arbitrarily chosen element, say 11, of .4 belongs to a
class with j elements, and if the class is deleted from A, then the remaining partition
has the distribution P{n - j, a.).
lilly one of the three conditions with some part of the other two conditions charac-
terizes the distribution P(n,o.), Sibuya and Yamato (1995).

4. The most random partition


Let A be any random partition on An, and let Mo E An be a fixed partition. If

E(d(A, Mo)) ~ E(d(A, M)), for allM E An,

then Mo is called the 'center', a sort of mean. of the random partition A.

Proposition 1 (Snijders. 1996) Suppose that A '" P(n,o.).


(i) If Q < 1 the center of A is V.
(ii) If Q = 1 E(d(A, M)) is a constant for all MEAn. and any fixed partition M is
the center of A.
(iii) If Q > 1 the center of it is W.

Proposition 2 If A and B are any independent and identically distributed random


partitions, then
C(d(A,IJ)) ~ ~(~).
Equality holds if and only if the marginal distribution of partitions of any subset Ao =
{ai,aj} with two elements of;l is P(2, 1). The condition is satisfied if A", P{n, 1).

Definition P(n, 1) defines the 'most random partition' A on An.


From the properties of P(ll,n) and from the propositions, P{Il.1) is qualified to be
called the most random partition. The ne,,"t section illustrates this definition.
The following table lists six random partitions ;l from P(n, 1) for n = 26 objects
(letters):
244

1 2 3 4 5 6
az abi ajk'olz ag af ackqr
bilrs C'ol bcdegmt bix bcmr b
cfkmp dp fly cdjmt diu d
dht'ol elrtx hos eh eo e
eq fhjnou iu fsz gkpqtz fho
g gmy nx kno hlv gimntz
jv k py 1 jn jp'olX
nox qsz qr pqr'ol s'olXy 1
uy v u sv
v u
y y
In this table. the elements in a class are alphabetically ordered, and the classes of
a partition are lexicographically ordered. The latter order, which is independent of
the ordering of elements in a class, is natural and automatic if the elements of the
original set is linearly ordered. In this paper, the order of classes and elements are
disregarded.

5. Geneses of the family and the most random partition


5.1 A clustering process
The random partition P(n, a) is generated by the following model. Suppose that
balls and urns are labeled 1,2, ... (one ball and one urn for each label) and the balls
are sequentially thrown at random into the urns as follows. Ball 1 is put into Urn
1. Ball 2 is put into Urn 1 with probability 1/(a + 1), and into empty Urn 2 vrith
probability a/(o. + 1), where a > 0 is a constant. After n balls, suppose that Urn
j has Cj balls, j = 1, ... ,k, and CI + ... + Ck = 11. Ball 11 + 1 is put into Urn j
",'ith probability cj/(a -:- 11.) for j = 1, ... ,k and into a new empty Urn k + 1 with
probability a/(a + n). After n steps, the balls in an urn are regarded as a cluster,
and we obtain an ordered random partition like the above table. Disregarding the
urn numbers we obtain a random partition A "" P(n, a). For a = 1, the assignment
probabilities for Ball n + 1 are proportional to the class sizes CI, ... ,CI,: and 1. See
Sibuya (1993a). This process is essentially the same as Hoppe's urn model (see. e.g ..
Ewens 1990).
If the balls are indistinguishable we observe a random partition jail -:- la2i +... + ian! =
n of a number n. Again, the partitioned numbers are unordered (or ordered) provided
that the urn numbers are disregarded (or considered). In the theory of population
genetics, these cases are discussed in Hoppe's urn modeL and the unordered and
ordered partitions of number are known as Ewens' sampling formula. and Donnelly-
TavarE!-Griffiths formula, respectively (see, e.g:, Ewens 1990). The logical aspects of
these random partitions are studied by Zabell (1992).
5.2 Partition of a random permutation into cycles
Let (J' Nn , i.e. a bijection of N n onto itself. The sequences
be a permutation of
N n , form cycles and partition N n into cycles. This is
(j,O'(j),O'(O'(j)), ... ), j E
a natural way to get a partition of N n . Assume that all n! permutations of N n
are equally probable, and the probability distribution on A" induced by cycles of
the random permutation (J is P(n,I). The model appears frequently in applied
probability, statistics and computer science, and was studied in Sibuya (1993b).
245

6. Applications
Let A, B be independent random partitions on An following P(n, 1). Let Fn be the
distribution function (d.f.):

O<;L'
- <
- (n).
2

The d.f. F,. is used for measuring the performance of classification. Some properties
of F,. are shown in the last section.
Chernoff's faces are useful for subjective classification of a multivariate dataset. Its
performance depends on the design of comical faces, the allocation of variates to the
face features, and the training of judges. The effects of these factors can be measured
in terms of F,.(d(A, B)). A trial experiment and its analysis are reported by Harada
and Sibuya (1991). The planning of the classification experiments of Section 1 can
be examined in a similar way.
In numerical taxonomy, suppose that we have a 'standard' classification procedure,
and some new candidate procedure. Sample data are generated from a mixture of
some known population distributions. Each multivariate observation is known to be-
long to its true population, and observations from the same population are expected
to be classified in the same subset. Consider the distance between the classifications
A, B of sample data into the true populations and that obtained by a classification
procedure. The distance d(A, B) measures the difficulty of the classification of a
mixed population if the classification is the 'standard' one. Otherwise, the distance
measures the performance of a new candidate, to be compared with that of the stan-
dard one.

7. Distance between two independent most random parti-


tions
For the computation of the probability function of Dn = d(A, B) and its moments, for
independent random partitions A and H of n objects following P(n, a), the following
facts are useful:

Proposition 3 Let A be a random partition following P(n, a) and MEAn be a


fixed partition. The probability distribution of d(A, M) is determined by S(M).
If 0 = 1 the probability distribution of D" is the same as that of d(A, M). for any
MEAn. The choices M = V or Ware convenient.

Proposition 4 The moment of Dn of degree 'r, for any n. can be calculated from
the probabi.lity /unction of D2r ,'r = 1,2, ...

From these, we calculate the lower moments:

E(D,.) = (~) (a :(1)2'


2
E(Dn) =
(1£)
2 (a
20 (n) 120(30 + 2) (n) 9a
120(u:1 + 2 + 21Q + 6)
+ 1)2 + 3 (a + 1)2(a + 2)2 + 4 (0 + 1)2(a + 2)2(0 + 3)2'
Var (D,. / (n))
2
= 4a(4a~ + 30 3a2 - 3a:- 6) + 0 (~),
3
-
(0 + l)2(a + 2)2(a + 3)2 It
for n ~ 00.
246

The variance of Dn/(~) at Q = 1 is neither maximum nor minimum. but a rather


small value 0.008. The numerical value of the probability function of P(n, 1) for
11 = 2, ... ,8, and its moments up to the degree four are shown in Sibuya (1993b).
The probability function OfP(II, 1) is multi-modaL and is, as well as its limit, strongly
mesokurtic.

Acknowledgements
The author has benefited from valuable discussions with participants of the IFCS
conference and \\rith Prof. H. Yamato.
References:
Bock, H. H. (1996): Probabilistic aspects in classification, IFCS'96, Kobe, March 1996,
Invited Lecture 5.
Ewens, W. J. (1990): Population genetics theory - the past and the future, In: Mathematical
and Statistical Developments of Evolutionary Theory, Lessard, S. ed., NATO Adv. Sci. Inst.
Ser. C-299, Kluwer, Dordrecht, 177-227.
Gordon, A. D. (1996): Cluster validation, IFCS'96, Kobe, March 1996. Invited Lecture l.
Harada, M. and Sibuya, M. (1991): Effectiveness of the classification using Chernoff faces,
Japanese Journal of Applied Statistics. 20. 39-48 (in Japanese).
Rand. W. M. (1971): Objective criteria for the evaluation of clustering methods, Journal
of American Statistical Association, 66, 846-850.
Sibuya, M. (1992): Distance between random partitions of a finite set, In: Distancia '92.
Joly. S. and Le Calve, G. (eds.) 143-145. June 22-26. Rennes.
Sibuya, M. (1993a): A random clustering process, Ann. Inst. Statist. Math., 45. 459-465.
Sibuya, M. (1993b): Random partition of a finite set by cycles of permutation. Japan
Journal of Industrial and Applied lvJathematics, 10, 69-84.
Sibuya, :\1. and Yamato. H. (1995): Characterization of some random partitions. Japan
Journal of Industrial and Applied Mathematics. 12, 237-263.
Snijders, T. A. B. (1996): Private communications.
Zabell. S. 1. (1992): Predicting the unpredictable, Syntheses, 90, 205-232.
Zahn. C. T .. Jr. (1964): Approximating symmetric relations by equivalence relations. J.
Soc. Indust. Appl. Math .. 12. 840-8-17.
A Mixture Model To Classify Individual
Profiles Of Repeated Measurements
Toshiro Tango
Division of Theoretical Epidemiology, The Institute of Public Health
4-6.-1 Shirokanedai, ~Iinato-ku, Tokyo IDS, JAPAN
E-mail: [email protected]

Summary: To evaluate the treatment effects in a randomized experiment, Tango( 1989,


Japanese Journal of Applied Statistics,18, 1-13-161) and Skene and White( 1992, Statistics
in lVledicine,l1, 2111-2122), independently, proposed similar mixture models in which the
effects of treatment were characterized by the shape of common latent profiles and the
mixing proportion of these profiles. This paper describes the difference between the two
procedures and presents a generalized model to cope with improper longitudinal records.
The usefulness of the proposed methods are illustrated with data from clinical trials.

1. Introduction
In clinical medicine, drugs are usually administered to control some response vari-
able, X, reflecting the patient's disease state directly or indirectly, within a specified
range. So, in many clinical trials, some response variable is scheduled to be observed
at regular intervals for assessing changes from the baseline. Figure 1 shows the mean
treatment profiles of (a) log-trasformed serum levels of glutamate pyruvate transam-
inase (CPT) for 124 patients with chronic hepatitis patients randomly assigned to
recieve the new treatment A or the standard B in a double-blinded clinical trial,
which are measured at baseline and 1 week interval thereafter up to 4 weeks and (b)
its change from the baseline level for each of treatment groups. In this paper, only
complete cases are shown and used for illustration purposes. In this clinical trial, the
effects of treatment can be observed as "decrease" in levels of CPT as compared with
baseline level and must be evaluated at the last observation time (4 weeks). In this
kind of clinical trials, the difference between mean treatment profiles can generally
be defined as the size of interaction term TREATMENT x TIME. Classical and still
most frequently used procedures in medical literature will be to repeat Student's t-
test or Wilcoxon's rank sum test at each time point for the treatment difference in
change from the baseline shown in Figure 1-(b). Repeated application of two-tailed
t-test resulted in no significant differences between the two groups at all the time
points. Since the test results are the same regardless of the time point, we tend to
conclude "no difference". However, such multiple comparisons inflate the over-all
siginificance level and often show siginificant differences at some points but not at
other points, which will generate confusion and may lead to the post hoc selection of
the most highly significant difference.
To avoid this problem, the following two kinds of procedures are well known: 1)
Univariate repeated measure ANOVA where the degrees of freedom associated with
F test for repeated factors are reduced by one of two procedures, Creenhouse-Ceisser
method and Huynh-Feldt method, and 2)maximum likelihood based ANOVA which
can allow for more general within-subjects covariance structure, missing values and
irregularily spaced data. But all are assuming within-group covariance structure to
be homogeneous between treatment groups, which seems to be a difficult assumption
to justify in clinical trials. The results of applying these procedures using Bi\IDP

247
248

Table 1: Comparison of existing procedures to test for interaction term TREAT-


MENT x TI~[E
Methods P-values -210g(L)
ANOYA with dJ. reduction
Greenhouse-Geisser 0.623
Huyng-Feldt 0.627

ANOYA with specified Co\". Structure


Compound symmetry 0.793 862
First order autoregressive 0.802 528
General autoregressive 0.873 472
Unstructured 0.879 429

Random-effects growth curve


2nd degree polynomial 0.753 552
3rd degree polynomial 0.870 490

5Y including random coefficient growth curve models are summarized in Table 1,


in which P-value for the interaction term TREATMENT x TIME in each model is
shown. All the results indicate non-siginificant regardless of goodness-of-fit of models.
Recently the use of summary statistics for the analysis of repeated measurements
have received a growing interest instead of using complicated models described above
in a randomized clinical trials. This seems to be mainly due to ease of interpreta-
tion and communication. Most types of summary statistics, say slope, change from
the baseline, etc, can be expressed in the form of a weighted linear combination of
repeated measurements. Of course, these summary statistics are also devised for
detecting differences in mean treatment profiles between groups. Amongst others,
Frison and Pocock(1992) recommended analysis of covariance for the post treatment
mean adjusted for pretreatment baseline values. Therefore, let us apply ANOCOYA
for the last measurement ( at 4 weeks) adjusted for the baseline valued. The resultant
F = 0.148 with (1,121) degrees of freedom and p = 0.701, indicating nonsignificant.
Based upon all the results discussed above, we might tend to conclude that the
effects of new treatment A is not statistically different from the standard B. However,
what should be taken into account before applying these procedures is to consider
the implication of mean treatment profiles of the response variable. If each group
is really homogeneous, namely, SUBJECT x TIME interaction within each group is
negligible, then the mean treatment profile could reasonably represent the pattern
of the treatment effects over time in each group. However, if SUBJECT x TIME
interaction is not negligible, the problem is not so simple. As procedures which
are capable of dealing with heterogeneity among patients' profile, random-effects or
random coefficient models have been proposed. But, they are still based upon the
"homogeneous" assumption. To this problem, Tango( 1989) proposed a mixture model
for the analysis of repeated measurements in a randomized clinical trial assuming
that 1) each treatment group consists of a mixture of several distinct latent profiles
of changes from the baseline level, common to all the treatment groups, 2)a low-
degree polynomial can represent the "mean profile" for each of these latent profiles
and 3)the effects of treatment is characterized by the shape of the latent profiles and
the mixing propotions of these latent profiles. Skene and White( 1992) , probably
without knowing my paper since it has been written in Japanese, also proposed
249

a similar mixture model but it is unsuitable to data with "improper records" and
undesirable since it estimates unrealistic ragged profiles.
The purpose of this paper is to make clear the difference between the two models,
to present a generalized formulation of my model to cope with improper records and
describe how these procedures are useful and essentially important to analyze data
and interprete the results ill some sorts of randomized clinical trials.

2.Model
Suppose that a randomized clinical trials specify the following protocol:
1. Patients are randomly assigned to receive one of G treatments, with Ni patients
on the ith treatment group.
2. The response variable X is measured T + 1 times at baseline and at equally
spaced intervals, where T is at most 4 to G.
3. The effects of treatment is evaluated at the last measurement time.
But, in practice, the occurrence of missing values and measurements at irregularily
spaced intervals is inevitable. Further, recent tendency of "intent-ta-treat" requires
that all the patients registered should be included in the analysis regardless of the
degree of completeness of records. Both Tango and Skene and White formulated
the model only for complete data. Therefore, we shall here generalize the model
to allow for incomplete records of patients. Let Xij(tijk) denote the measurement
made at the time tijk(k = 0,1, ... , Uij ::; T) of the jth subject (j = 1,2, ... , N i ) in
the ith treatment group (i = 1,2, ... , G) and tijo = O. Thus Xij(O) indicates the
baseline level. Without loss of generality, the "improvement" induced by a treatment
is defined by the" decrease" in levels of the response variable X as compared with the
baseline level. Let l~j(tijk) = Xij(tijd - Xij(O), the change from the baseline level
and assume the existence of AI latent profiles common to all the treatment groups.
Then, under the condition that the jth patient of the ith group follows the mth latent
profile (m = 0,1,2, ... , M - 1), it can be assumed
(1)
where mth latent profile and fijI are. conditional on 711, mutually independent. With
regard to the mean profile ~tm(t), Tango proposed a smooth function of time by a low
degree polynomial. For example. when Al = 3, we have
=0
~t(t) = 1
~to(t)
~tl(t)
~t2(t)
= ,£f=1 i3lkt"( < 0)
= ,£f=1 i32k tk (> 0)
if the subject belongs to "unchanged" ,
if the subject belongs to "improved",
if the subject belongs to "worsened",
(2)

where R is the degree of polynomial common to all the profiles except for the un-
changed profile. On the other hand, Skene and White proposed a profile vector
~!m = (~tml' ... ,~tmT)' This kind of parameterization is seemingly more flexible in
representing profiles but tends to estimate undesirably ragged profiles. Further, it
cannot allow for incomplete data with missing values or measured at irregularily
spaced intervals, which are also pointed out in the discussion section of their paper.
In this model, the response pattern for the jth subject of the ith group, Y ij
(1'iJ( tijil ..... lij(tijU,,))'. j = 1..... N i , have the following mixture density
.II-I
9i(Y iJ lfJ) =L Pimfm(Y ij ) (3)
m::;;:;O
250

where Pim implies the mixing proportion of the mth latent profile in the ith treatment
group and fm(.) denotes the density function of the mth latent profile and are given
by

m = 0, ... ,A1-1

(4)
where m=O means "unchanged" profile. The log-likelihood for the parameters of
() = (Pim,.!3mk.U2),i = 1, ...• G;m = 0.1, ... ,:\1 -1;1.: = 1•... ,R is
G /'V I
L = L L 9i(Yijl())· (5)
i=lj=1
and the comparison of treatment effects might be reduced to the test of the following
null hypothesis:

Ho : Plm = P"lm = ... = Pam, m = O. 1. .... AI - 1 (6)


Furthermore, this model provides us with a criterion of classification of subjects into
one of the components based upon the maximum of the posterior probability that
the jth subject of the ith group comes from the mth profile:

(7)

where 8 is the maximum likelihood estimate(~[LE).


3.EM algorithm
To obtain the maximum likelihood estimates (MLE) of parameters of () of the mixture
distribution, a Newton-Raphson algorithm or the E~[ algorithm have been used. Both
I and Skene and White recommended the use of EM algorithm especially for its easy
implementation. Its solution, however, heavily depends on the initial values and
then careful examination is neccessary to assure that the global, rather than a local,
maximum was attained. The EM algorithm is briefly outlined below:
1. Step 0 : Give starting values of the posterior probability Qij(m).

2. Step l[M-step]:Given the Qij(m), the parameters Pim are easily given by
/1,",
Pi", = L QiJ(m)/:V. (8)
j=1

/lmk are obtained by llsing a weighted least squares procedure minimizing


a N • .11-1 U.)
L L L Qij(m) LO;j(tijk!- JLm(tijk))2.
i=1 j=1 m=O k=l

and fJ2 is obtained as


a N; .\1-1 U.) a N.
a2 = L L L Qij(m) L(lij(tijk) - jLm(ti)k))2/ L L [iij, (9)
i=1 j=1 m=O k=1 i=1 j=1
251

3. Step 2[E-step] :Calculate the posterior probability Qij(m) based on the esti-
mates Bobtained in the step 1.

-1. Step 3:Check to see if {) has converged; if not, repeat !vI-step and E-step.
\\'hen we construct a particular alternative hypothesis, we need to apply some con-
straints to the Pim'S, say Pll = P"21. In this case, expression (9) must be changed. But
this kind of extra work can easily be handled in GLIM or S-PLUS.

4.N umber of latent profiles


As is well known, the classical asymptotic theory of distributions of likelihood ratio
tests cannot be applied for testing the number of components in mixture models be-
cause it does not satisfy the regularity condition. Self and Liang( 1987) discussed its
theoretical issues and gave several examples in which the distribution can be expressed
as mixtures of chi-squared distribution. But, its application to mixture models are
difficult. Everitt( 1981), Thode, Finch and Ylendell( 1988) conducted a simulation
study to find the percentage points of likelihood ratio test. McLachlan(1987) applied
bootstrapping methods. But, most of their works were focussed on the test of a single
normal density versus a mixture of two normal densities in the univariate case. Skene
and White examined the possibility of the use of empirical semi-variogram plot as
an exploratory tool. But it is in nature subjective and is not so easy to determine
the number of profiles based on such plots. So, its use cannot be recommended es-
pecially for confirmatory clinical trials. Basically, this problem is identical to that of
choosing the number of clusters in cluster analysis. So far, the challenges to this kind
of problem have never been successful in obtaining the clear solution statistically.
All such procedures are exploratory. Therefore, especially in a confirmatory trials
such as phase III clinical trials, the number of latent profiles should not be selected
statistically but be carefully discussed before starting clinical trials and be fixed in
the protocol. Given the number of latent profiles, the problem becomes simple and
clear and usual likelihood ratio tests can be applied to compare the goodness-of-fit
of nested models (hypotheses) and to estimate optimal () and R.

5. Examples
We shall consider again the data of GPT shown in Figure 1. Empirically the dis-
tribution of GPT in healthy subjects can be approximated by log-normal, then let
Xij(tijk) denotes here the transformed value 10g(GPT), natural logarithm. Further
assume M=5 since several other endpoints in this trial are to be evaluated by 5 or-
dered categories for each patient. As an criterion which gives initial values of Qij(m),
we may use the 5 ordered categories based on the value of 5 ij = L~';:l Yij(tijk), where
Uij = AI - 1 for all the patients in this complete set of data. Several other initial
values were examined to assure whether the result derived below is optimal. The
main results are summarized in Table 2.
Based on the likelihood ratio tests, one alternative hypothesis HI: Pll = P21, R = 3
was selected as the most appropriate model. Compared with models for the null
hypothesis Ho for each of three kinds of mean profiles, R=2, 3, and Skene and White's
vector Itm, fitting the model with constraints PII = P21 gave a significant decrease in
deviance of 8.9 on 3 d.f., 9.6 on 3 d.f. and 8.5 on 3 d.f., respectively, regardless of
the goodness-of-fit of models. Among others, a cubic polynomial effected the highest
decrease with p = 0.022. Goodness-of-fit of these models were also investigated by
observing each patient's response profile in relation to the estimated 95% region of
252

..
N
"1
0

..,oj N
d

B ...
I='
~
N c;
N g
d
.s
fE
~
~
D

"f ;:i B 0

.
d
E
Ii
A ..go
,g
..
:::I!
C!
N £ 9
li.
:::I!
"!
~

.., N.S. N.S . N.S. N.S.


~ 9
·1 0 2 3 5 ·1 0 3 4 5
W••kI Week.

Figure 1. The mean treatment profiles and mean ± 2SD at each time point, of
(a) log(GPT) and (b) its change from the baseline level for each of new treatment
A and standard treatment B. The difference in change from the baseline was not
significant (p> 0.05 by two-tailed Student's t test) at each time point.

Greatly Improved Improved

W..ks Weeki
.
We.kI

g#C?fi
Unchanged Worsened - Greally worsen~

=~"""-­
~---=~~

We.k. Weeki

Figure '2. (a)Individual profiles for all the patients. (b)-(f) Estimated 95% re-
gion of profiles, iLm(t) ± 20-, m = 0,1, ... ,4 and individual profiles classified into the
corresponding region regardless of the treatment group.
253

Table 2: -210g L for each of mixture models assuming .\1 = 5. Degrees of freedom
are shown in parentheses.

form of latent profile


Hypothesis quadratic cubic Skene and White
polynomial polynomial
Ho 402.7 (13) 389.1 (17) 384.7 (21)
!\I(Pll = P2Il 393.8 (16) 380.5 (20) 376.2 (24)
H 1(Plm '" P2m} 393.6 (17) 380.3 (21) 376.0 (25)

Difference: Ho vs. [(1 8.9 (3) 9.6 (3) 8.5 (3)


p-value 0.031 0.022 0.037

profiles, itm(t) ± 20', of the mth latent profile into which each patient was classifed
according to the maximum of estimated Q;j(m). In Figure 2, the estimated 95%
region of latent profiles for the optimal model with Pll = P21 and R = 3 are illustrated
together with individual profiles classified into the corresponding profile regardless
of treatment groups. Table 3 presents the classification of each patient into one of
those 5 profiles. As would be expected. \:2 test based on this table yielded P-value
= 0.023 very close to that of the likelihood ratio test. These results are summarized
using estimated mixing proportions Pmk'S as follows: Compared with the standard
treatment B. the treatment A has
1. the same proportion of "greatly improved" (4.1%),
2. higher proportions of "improved" (32.7% vs 24.7o/c). but also higher " worsened"
(24.5% vs 16.7%) and "greatly worsened" (11.8% YS 2.9%)
3. a lower proportion of" unchanged" (26.7% \'s 51.6%).
Therefore, based on estimated proportions and latent profiles, we can not say that
treatment A is better than B, but we can say that the effects are significantly different.
These characterizations of the efficacy of treatments seem to be medically important
espeically for finding the key factors on baseline factors to discriminate responders
from non-responders. but cannot be recognized by observing the mean treatment pro-
files over time. Figure 2 suggests how the mean treatment profiles shown in Figure
1 is misleading. Example illustrated here is not exceptional but rather typical one.
Tango(1989) illustrated the method with another two sets of data from randomized
clinical trials.

Table 3: Classification of patients by the optimal mixture model with AI = 5, R = 3


and constraint Pll = P21.

Greatly Greatly
Group improved improved unchanged worsened worsened total
A 3 20 17 14 8 62
B 2 13 34 11 2 62
254

6. Discussion
It is well recognized that some unknown prognostic factors could have larger effects
on the response variable than treatment under study. Therefore, if they really ex-
ist, random allocation of patients into one of treatment groups help these unknown
prognostic factors also to distribute equally likely between groups. Namely, several
distinct latent profiles common to all the groups could be explained by these prog-
nostic factors and the mixture model provide useful data to investigate and identify
these unknowns in the next stage of research.
On the other hand, it seems to me that recent literature concerning to the analysis
of repeated measurements has been concentrated too much on modelling the within-
group covariance structure assuming homogeneity between groups which seems to be
unrealistic especially for clinical trials. As Skene and White has pointed out, the ob-
served autocorelation could be a consequence of under-specifying the mean structure
for subjects of each treatment group. Therefore. before applying such statistically
flexible but clinically unrealistic models, more attention should be placed on the va-
lidity of these assumptions and on t.he reasons why we take observations over time.
ACKNOWLEDGEMENTS
The author is indebted to the Japanese Foundation for rvIultidisciplinary Treatment
of Cancer. This study was supported in part by Grant-in-Aid for Scientific Research
(Grant No. 05302064) from the Ministry of Education, Science and Culture of Japan.
References:
Crowder,M.J. and Hand,D.J. (1990): Analy.~is of Repeated Measures, Chapman and Hall.
Dempster,A.P., Laird,N.M. and Rubin,D.B.(1977): Maximum likelihood from incomplete
data via the EM algorithm, Journal of the Royal Statistical Society, Series B,39,l-22.
Diggle,P., Liang,K.Y. and Zeger,S.L. (1994): .4nalysis of Longitudinal Data, Oxford Science
Publication.
Everitt,B.S.(1981): A Monte Carlo investigation of the likelihood ratio test for the nwnber
of components in a mixture of normal distributions, Multi. Beha·v. Res.,16, 171-180.
Frison,L. and Pocock,S.J.(1992): Repeated measures in clinical trials: analysis using mean
summary statistics and its implications for design, Statistics in Medicine, 11, 1685-1704.
McLachlan,G.J. (1987): On bootstrapping the likelihood ratio test statistics for the number
of components in a normal mixture, Applied Statistics,36, 318-324.
Self, S.G. and Liang,K.Y.(1987): Asymptotic properties of maximum likelihood estimates
and likelihood ratio tests under nonstandard conditions, Journal of Ame7-ican Statistical
Association,82, 605-610.
Skene,A.M. and White,S.A. (1992): A latent class model for repeated measurments exper-
iments, Statistics in Medicine,11,2111-2122.
Tango,T. (1989): Mixture models for the analysis of repeated measurements in clinical tri-
als,Japanese Journal of Applied Statistics,18. 143-161.
Thode,Jr.,H.C, Finch,S.J. and Mendell,N.R.(1988): Simulated percentage points for the
null distribution of the likelihood ratio test for a mixture of two nomlals, Biometrics,44,
1195-1201.
Titterrington,D.M .. Smith,A.F.M. and Makov, U.F. (1985): Statistical Analysis of Finite
Mixture Distributions. New York. Wiley and Sons.
Irregularly Spaced AR (ISAR) Models
Jeffrey s.c. Pail, Wolfgang Polasek2 and Hideo Kozumi3
1 Faculty of Management, University of Manitoba
181 Freedman Crescent, Winnipeg, Manitoba, R3T 5V4, Canada
2 Institute of Statistics and Econometrics, University of Basel
Holbeinstrasse 12, CH-4051 Basel, Switzerland
3 Faculty of Economics and Business Administration, Hokkaido University
Kita 9 Nishi 7 Kita-ku, Sapporo 060, Japan

Summary: High frequency data in finance are time series which are often measured at
unequally or irregularly spaced time intervals. This paper suggests a modeling approach
by so-called AR response surfaces where the AR coefficients are declining functions in con-
tinuous lag time. The irregularly spaced ISAR models contain the usual AR models as a
special case if the time series is equally spaced. We illustrate our methodology with two
examples.

1. Introduction
For some years now the set of available data from financial markets has increased
rapidly. So far only a small subset of the information available has been used. In
the 70'ies, most of the empirical studies were based on yearly, quarterly, or monthly
data. This data could typically be modeled by random walks or linear models such
as AruMA-models (Box and Jenkins, 1976). In the 80'ies, the study of weekly and
daily financial data lead to non-linear models such as ARCH models (Engle, 1982).
Recently, empirical studies through analyzing intra-daily data are gaining new in-
sights into the behavior of financial markets (see Guillaume et al. 1994).
For the Foreign Exchange (FX) market, Muller et al. (1990) include the daily het-
eroskedasticity of the volatility while Goodhart and Figliuoli (1991) discovered neg-
ative first order autocorrelation at one minute intervals. In fact, daily data are
computed on the basis of the average of five intra-daily quoted prices of the largest
banks around a particular time. The spot intra-daily FX data are observed as an
irregularly spaced time series (see Olsen & Associates, 1993). The standard methods
of data analysis, on the contrary, are based upon equally spaced da.ta. Typically, a
certain fixed interval is chosen, and some averaging procedure for all the transactions
within the intervals is done in order to apply standard methods of analysis. Several
problems arise if the data from irregularly spaced time series are converted to regu-
larly spaced time series. The appropriate interval will depend upon the transaction
frequency. These intervals vary with different markets. There could be a problem of
missing data if the length of the interval is too short. On the other hand, information
is lost if the length of the interval is too long.
Data from financial markets exhibit high correlation between the ticking frequencies
and the volatility of the time series. It is widely believed that the durations between
transactions may carry information about the volatility.
In Section 2, we describe a new class of AR models for irregularly spaced time series,
the ISAR(p) models, as well as the ordinary least squares result. Section 3 illustrates
our methodology with two examples. Section 4 concludes.

255
256

2. The ISAR(p) models


Consider the time series model with p general lag response functions cI>d . J, ... ,cI>p[ . J
observed at n irregular time points
p
Yi. = L cI>j[L-,jtiJ Yi i - i + fi, i = 1, ... , n, (1)
j=1

where Yi is the observed time series at time t i , and L-,t, is the time span between two
adjacent' observations. L-,jti is the distance between observations which are j "ticks"
apart: L-,jti = ti - t i - j . For the residual process we assume fi "" N(O, 75), where 15 is
the white noise variance. Note that the parameter functions cI>j[L-,jtiJ are functions of
L-,jti as well. These functions will pick up the effect from the previous observations
and noises. Possible parametrization for the lag response functions, the decay of cI>-
functions are:
-Constant functions (in L-,Jti , or in time)

(2)
This special case is the usual AR models for equally spaced time series.
-Exponential functions
(3)
-Reciprocal functions
(4)
For different ¢4 and ¢b parameters, we will have different decay functions (in absolute
value). The stationarity condition for the irregularly spaced model in (1) depends on
the parameter space of ¢'s as well as the distribution of L-,ti' This may be obtained by
taking the expectation of cI>j[L-,it;J with respect to L-,t i and assuming Yii are observed
at regularly spaced intervals. For example, assume
(81) L-,t i are independent and identically distributed as Gamma random variables
with parameters a and {3,
(82) L-,ti and fi are independent.
From the exponential function we have

(5)
For the process to be stationary, the roots of ¢(z) = 0 must lie outside the unit circle.
Consider the linear equation Y = XcI> + f where the parameter vector is cI> =
(¢41 , ¢bl , ... ,,p4., ¢b.)', and the dependent variable vector is Y = (YI , ... , Yn)'. The
n x 2p regression matrix X is built up by lagged unweighted and weighted dependent
variables, where the weights depend on the elapsed duration time. For each lag j the
first regressor component is
X i ,jo2-1 = Yi'-i'
and the second regressor component is

X i ,jo2 = Yii-il L-,Jti , (for reciprocal function),


for i = 1,' .. , nand j = 1, ... ,p. Note that the constant model has a n x p regression
matrix with the first regressor component. Thus, an unrestricted estimate of the
257

decay parameters of the exponential or reciprocal model is given by the ordinary


least squares estimate of <1>:

3. Illustrative example
We present two examples to illustrate our methodology. The first example is based
on a simulated lSAR(l) model and the second example is a high frequency exchange
rate of the FX market.
Example 1: The first data set consists of three time series each with length 1000
simulated from lSAR(l) with
(i) constant function (¢ = 0.3),
(ii) exponential function (¢a = 0.3, ¢b = 0.3),
(iii) reciprocal function (¢a = 0.3, ¢b = 0.3),
respectively. ,0,;'s are sampled from a Gamma distribution with parameter Q = 2
and {3 = 0.5 (giving the mean ticking frequency of 1) and fi'S are sampled from a
Normal distribution with mean 0 and variance 1. The ordinary least squares estimates
together with the AlC values from three different functions are shown in Table 1. Our
approach is successful in two aspects: First, the OLS estimates are all very close to
the values we sampled from (¢ = 0.3):
(i) constant function (¢ = 0.28),
(ii) exponential function (¢a = 0.31, ¢b = 0.23),
(iii) reciprocal function (¢~ = 0.29, ¢b = 0.30).

Table 1: Simulated lSAR(l) model: ordinary least squares estimates


Constant Exponential Reciprocal
True parameters model·) model·) model·)
Const. 4>=0.3 4>=0.28( .02) 4>a =0.32( .04) 4>a =0.28( .02)
~b=O.lO( .08) ~b=O.OO(.Ol)
AlC 8471.64 8472.15 8473.64
Expon. 4>a=0.3 4>=0.41( .02) 4>a=0.31(.04) 4>a=0.38(.02)
4>b=0.3 ~b=0.23( .08) ~b=0.02(.01)
AlC 8479.27 8472.25 8474.22
Recip. 4>a=0.3 4>=0.86(.01) 4>a=O.OO(.Ol) 4>a =0.29( .01)
4>b=0.3 ¢b=1.77(.01) ¢b=0.30(.01)
AlC 30952.96 16502.87 8469.70
·)Standard errors are In parentheses

Second, we are able to select the correct model based on AlC. More interesting
results can be seen from Table 1. The estimated parameter (¢ = 0041) modeled by
the constant function with data sampled from the exponential function is very closed
to 0.42 which is value in (5) with Q = 2 and {3 = 0.5. Similarly, from the reciprocal
function we have

(6)
258

This procedure will give 0.9 which is very close to 0.86 from Table 1.
Example 2: The data consists of the FX rate quotes for the DEM-JPY exchange rate
distributed by Olsen and Associates (1993). We first introduce the data definitions
based on Guillaume et al. (1994). The logarithmic price Z is defined as the log of
the geometric mean of the bid and the ask prices, Pbid and P ask ,

Z-- Iog (JR. n ) _


bid . •ask -
IOg(Pbid ) + log(Pask ) .
2
(7)

The change of price at time t i , Yii' is defined as

Y. = Zt. - Zti_l
ti - 6t ' (8)

where 6t is some fixed time interval. The change of the logarithmic price is often
referred to as "FX return".
For the study of the irregularly spaced time series, we adopt the same data definitions
in Guillaume et al. except (8). We define the return as duration dependent difference
of the logarithmic price Zti:
(9)

Figure 1: Irregularly spaced time series plot of FX return in hours

o
ci

E
::I
~o
xci
u.

It)
o
9
o
9

140 160 180 200 220 240 260


time in hour

Differencing the time series means taking the difference of adjacent points and divide
it by the space between. We define as the difference of an irregularly spaced time
series the sequence of slopes between adjacent points. Instead of interpolating data
for a fixed time interval 6t, we define Yii in a natural way as shown in (9). The
difference of the logarithmic series gives the growth rates, therefore the logarithmic
price change is simply the average of the growth rates of the bid and the ask time
259

series.
Figure 1 shows the plot of the return in hours for the first week in June, 1993 with
sample size n=3000. Table 2 shows the ordinary least squares estimators by fitting
ISAR(p) models. The negative first order autocorrelation is consistent with the study
by Goodhart et al. (1991).
The OLS estimates on Table 2 for the ISAR(p) model shows a clear negative estimate
which is significant up to order 2. But the higher order effects do not reduce the
residual variance substantially. This result is also confirmed by the ML estimation
method. A similar picture can be seen for the ISAR(p) model with exponential or
reciprocal decay functions. All parameter estimates are negative where the limiting
decay parameter are not significant. The slope decay parameter is significant up to
order 2, but the residual variance is again rather constant.

Table 2: Ordinary least squares estimates for the DEMj JPY exchange rate
P ¢Jl ¢2 ¢J3 ')'0
parameter estimates (standard error) residual variance
CONSTANT function
1 -.163(.018) .0195
2 -.170J.(18) -.0471-018j .0195
3 -.171(.018} -.049( .019) -.016( .(18) .0195
EXPONENTIAL function
1 .964PI4) .0195
-1.163(.323}
2 1.008i: 316 ) .262S·228). .0195
-1.217(.325) -.328(.241)
3 1.007i: 316 ) .274S:232). .048S:189). .0195
-1.217( .325) -.343(.245) -.070( .205)
RECIPROCAL function
1 .0385:029r .0193
-.0022( .(003)
2 .0385.029) .036(.030) .0192
-.0024( .00(3) -.0024( .0007)
3 .0375.029) .037(.030) .021(.032) .0192
-.0024( .0003) -.0026( .0007} -.00l8( .0013}

4. Conclusions
This paper has demonstrated how we can estimate irregularly spaced AR models
by reciprocal or exponential ISAR(p) models where the attributes" reciprocal" and
"exponential" refer to the form of the response function of the AR model over the
lag interval. Based on the setup for ISAR(p) models from the previous section,
we can include an autoregressive conditional heteroskedasticity component into our
irregularly spaced models (see Pai et al. 1995). Also, the modeling process is flexible
enough to incorporate an ISAR model for the ticking process, i.e., the irregularly
observed time spacing process as well.
260

References:
Box, G.E.P. and Jenkins, G.M. (1976): Time Series Analysis: Forecasting and Control.
Holden Day: San Francisco.
Engle, R.F., (1982): Autoregressive conditional heteroskedasticity with estimates of the
variance of U.K. inflation. Econometrica, 50, 987-1008.
Goodhart, C.A.E. and Figliuoli, L. (1991): Every minute counts in financial markets. Jour-
nal of International Money and Finance, 10, 23-52.
Guillaume, D.M., Dacorogna, M.M., Dave, R.R., Miiller, U.A., Olsen, R.B. and Pictet
O.V. (1994): From the bird's eye to the microscope: A survey of new stylized facts of the
intra-daily foreign exchange. Olsen and Associates.
Miiller, U.A., Dacorogna, M.M., Olsen, R.B., Pictet, O.V., Schwarz, M., and Morgenegg,
C. (1990): Statistical study of foreign exchange rates, empirical evidence of a price change
scaling law, and intraday analysis. Journal of Banking and Finance, 14, 1189-1208.
Olsen and Associates (1993): Data distribution for HFDF - 1.
Pai, J.S.C., Polasek, W. and Kozumi, H. (1995): Irregularly spaced AR and ARCH models.
WWZ-Discussion Paper Nr. 9509, University of Basel.
Two Types of Partial Least Squares Method
in Linear Discriminant Analysis

Hyun Bin Kim! and Yutaka Tanaka2

! System Engineering Research Institute,


P.O.BOX 1, Yusung-Gu, Taejon, 305-600, Korea
2 Department of Environmental and Mathematical Sciences, Okayama University,
Tsushima, Okayama 700, Japan

Summary: Partial least squares linear discriminant function (PLSD) is a new discrimi-
nant function proposed by Kim and Tanaka (1995a). PLSD uses the idea of partial least
squares (PLS) method, which was originally developed in multiple regression analysis, in
discriminant analysis. In this paper, two types of PLSD are investigated and evaluated in
a simulation study. In the first type named PLSDA(all), a common pooled within-group
covariance matrix of all groups is used in modeling PLSD to discriminate all pairs of groups.
In the second type named PLSDT(two), pooled within-group covariance matrices based on
the related two groups are used in modeling PLSD to discriminat.e pairs of groups. As
the results of the simulation study PLSDA has the better performance than PLSDT in all
situations when the covariance matrices are equal in all groups, while PLSDT is better than
PLSDA in well conditioned situations when the covariance matrices are different among the
groups.

1. Introduction
Partial least squares regression (PLSR), which was originally developed by Wold
(1975) in the field of chemometrics, is a regression method which intends to reduce
the effect of multicollinearity by the reduction of dimensionality of the explanatory
variables like principal components regression (PCR). The performance of PLSR has
been investigated by some authors including Frank and Friedman (1993) and Kim
and Tanaka (1994) in their simulation studies. It is known that PLSR has worked
well also in many practical problems of chemical fields.
The basic idea of PLS is closely related to the conjugate gradient method for solving
linear equations or for calculating inverse matrices in numerical analysis (see, e.g.,
Wold et al. 1984). Taking this aspect into consideration, we can apply the algorithm
of PLS to other statistical methods in which we need to calculate the inverse of the
covariance matrix. In discriminant analysis the inverse of the covariance matrix is
needed. The direct application of ordinary linear discriminant functions can not be
successful for the so-called ill-conditioned or multicollinear data set.
Kim and Tanaka (l995a) proposed a new linear discriminant function using partial
least squares method (PLSD) and compared the performance of PLSD with that of
the ordinary linear discriminant function (LDF) by applying to two real data sets,
i.e., Fisher's iris data and Yoshimura's arc pattern data (see, Yoshimura et al., 1993)
and by a Monte Carlo simulation study (Kim and Tanaka (1995a, 1996)). The results
of these studies suggest that there are no great differences between the performances
of PLSD and LDF in case of no multicollinearity and that the performance of PLSD
is remarkably better than that of LDF in case of high degree of multicollinearity or
poorly conditioned situations.
In this paper we consider two types of PLSD. In the first type, a common pooled
within-group covariance matrix is calculated using the observations in all groups. In
the second type, each one of within-group covariance matrices is calculated using the

261
262

observations in only related two groups. We abbreviate the former PLSDA(all) and
the latter PLSDT(two). These two types of PLSD are compared through a simulation
study. .
In sections 2, 3 and 4 the algorithms of PLSR, the ordinary LDF and the proposed
PLSD are briefly reviewed, respectively. In section 5 two types of PLSD are de-
scribed and compared through a simulation study. Finally, section 6 provides a short
discussion on PLSD.

2. Wold's PLSR algorithm


PLSR (Wold 1975) is a method in which a covariance vector WI between Y and X
is calculated at first, and it is used as a coefficient vector to calculate tI, a linear
combination of the columns of X. Simple OLS regression is then conducted to
predict each of X and y using tl as the explanatory variable, and the residuals
X I and YI from these regressions are computed. Next, another covariance vector
W2 based on these residuals and ~, a linear combination of XII are calculated and
simple regressions on ~ to each of Xl and Yl are conducted. The unfitted parts
by the regressions on tt are then complemented by the regressions on t 2 • In the
same way, covariance vectors (W3, W4,' .. ) and linear combinations (t3, t 4 ,' •• ) are
calculated and simple regressions are conducted until good fitting is obtained. The
algorithm of PL8R proposed by Wold is described as follows:
1 Initialization (Centering):
Xo +- (X - 1xt) , Yo +- (y - y1)

2 For k = 1,2"" to J(
2.1 Wk = XLtYk-l
2.2 tk = Xk--1Wk
2.3 Pk = XLltk/t~tk ( = Xttk/t~tk )
2.4 qk = YLltk/t~tk ( = yttk/t~tk )
2.5 X k = X k - 1 - tkP~
2.6 Yk = Yk-l - tkqk
3 Calculation of regression coefficients :
~K = W K(WkXtXW K)-IWkxty,
4 Prediction equation:
fJ = (j) - ~kx) + ~kx,
where nand p indicate the numbers of observations and explanatory variables, re-
spectively, X is an 11 x p matrix of explanatory variables, Y is an n x 1 vector of
response variable, 1 is a unit vector with 1 in its all elements, x is the mean vector
of X, j} is the mean of Y, K (::::; rank of X) is the number of components employed
in the model. PLS becomes equivalent to OLS if it uses all possible components. It
is important how to determine the number of components K in applying the PLSR.

3. LDF
Suppose that there exist 9 groups 7Tt, 7T2,"', 7Tg and that an observation x from group
7Tj follows a p-variate normal distribution N(lLi' E) with a common covariance matrix
263

E. Also suppose that the Calts c(j I i) due to misclassifying an observation from 1I"j
to 1I"j fori,} = 1,2"" ,g are the same for all pairs (i,i). Then the LDF for the i-th
group which minimizes the total cost is expressed as

(1)

where qj is the prior probability of drawing an observation from group 1Ti. In sample
version J.1.i, E and qi are replaced by the estimated Xi, E and qj, respectively. Ap-
plying the LDF in sample version to, say, the i-th and i-th groups, we obtain LDFij
for thale groups defined by
~-I 1 ~-I
LDFij(X) = (Xi - Xj)t E X - 2"(Xi - Xi)!}J (Xi + Xi) + In(qi) -In(qj)' (2)

with the classification rule as


assign x to 1Tj if LDFji(x) ~ 0 or 1Tj if LDFjj(x) < O.

4. Classification by PLS method


4.1 Regression coefficient vector of PLSR
The regression coefficient vector of PLSR can be described as follows

{JPLSR = W «(Wkxtxw «)-IWkxty


= W «(WkX!XW K)-IWk·(X!X)(X!X)-IX!y

= HK{JOLS,
where
(3)
Let U( W K) be the linear subspace spanned by the columns of matrix W K and
the inner product in U be defined by (a, b) = at X t Xb. Then H K is the explicit
expression for an orthogonal projector onto the subspace U with K dimensions (Rao
1973). Consequently the coefficient {JPLSR is obtained by projecting the ordinary
regression coefficient vector {JOLS onto the subspace U to stabilize the estimator by
reducing the dimensions.
In the principal components regression (PCR), the regression coefficient vector is
stabilized by reducing the dimension of the eigenspace of X t X, which is calculated
from only X. But in PLSR the regression coefficient vector is stabilized by reducing
the dimension of the subspace spanned by successively derived covariance vectors
between X and y. We are of the opinion that the above difference between PCR and
PLSR makes PLSR to have a slightly better performance than peR. The comparison
between PLSR and PCR was reported by Frank and Friedman (1993), Kim and
Tanaka (1994) among others.

4.2 PLSD
PLSD was pro Paled by Kim and Tanaka (1995a) for the purpose of reducing the
effects of multicollinearity. The proposed PLSD is obtained by replacing E-l in
LDF.j for the i-th and i-th groups, by H ijK defined as

(4)
264

Namely,
.. 1 ..
PLSD'j(x) = (Xi - x)r Hi)KX - 2(Xi - Xj)t HijK(X, + Xj) + In(cli) -In(q)). (5)

Here W ijK consists of K covariance vectors obtained by applying PLSR to the data of
the i-th and j-th groups, where a dummy variable with the values -nj/(ni + n)) and
n;/(n, + nj) is used as the response variable to indicate which group an observation
belongs to. The basic idea behind this is that LDF is mathematically equivalent to
the OLS regression of binary responses.
In case of more than three groups, gC2 PLSDijs are calculated and an observation x
is assigned to thei-th or j-th group based on PLSD,) for all possible pairs. Then, x
is classified into the group to which x is most often assigned. The proposed method
coped very well with the real data with more than three groups such as Fisher's iris
data with three groups (Kim and Tanaka, 1995a) and Yoshimura et. al.'s arc pattern
data with twenty groups (Kim and Tanaka, 1996).
The classification obtained by PLSD ij becomes equivalent to that by LDF if all
possible components are used in PLSDij, by the same reason that PLSR becomes
OLS regression if full components are employed. This fact suggests that PLSD has at
least equal and possibly better performance than LDF if the number of components
are properly determined. It is important, therefore, how to choose the number of
components K like in case of PLSR.
As a criterion to choose K, the correct discrimination rate (CDR), CDR = 100 x
I:f=l e;/n;, is computed using the cross-validation method, where e; is the numbers of
correctly classified observations in the i-th group. We search for a PLSD model with
the maximum value of cross-validated CDR. In the case where two or more PLSD
models have the same maximum value of CDR, we choose the PLSD model with the
least number of components as a tentative rule.
It is shown that PLSD has the better performance than LDF in Kim and Tanaka
(1996), where two methods are compared through a simulation study and a real data.

5. PLSDA and PLSDT


As discussed in section 1, two types of PLSD, named PLSDA(All) and PLSDT(Two),
are studied. In PLSDA the pooled within-group covariance matrix of all groups is
used for 15aJl in eq. (5) and in PLSDT the pooled within-group covariance matrix of
the related two i-th and j-th groups is used for 15 i ) in eq. (5).

(6)

(7)
Naturally, it is expected that PLSDA fits the case where the covariance matrices
are equal for all groups and that PLSDT is suitable for the case where groups have
unequal covariance matrices.

5.1 Design of simulation study


This section shows a summary of the design of Monte Carlo experiments for evaluating
the performance of two types of PLSD. Twenty four different situations were set up
in this simulation study. The number of groups was fixed at three and the distances
among the centroids of the three groups were fixed with ratios 3:4:5. The situations
were classified by the training-sample sizes (balanced data, nl = 20, n2 = 20, n3 = 20;
unbalanced data, nl = 10, n2 = 20, n3 = 30), the structure of the (population)
covariance matrices (equal covariances, lJ 1 = :E z = :E3 ; unequal covariances, lJ 1 ¥
265

E2 =f E 3 ) and the numbers of variables (p = 5,15,25,35, .:15, 55). Here artificial


data of three groups in case of equal covariances were generated so that they had
the same (population) condition number "', defined by )..rnazl )..min of eigenvalues of
E, with the value", = 900.0. In case of unequal covariance type the condition
numbers were assigned as "'I = 900.0, "'2 = 400.0 and "'3 = 1.0. For each of 24 (=
2 (sample size) x 2 (covariance matrix) x 6 (variable number) ) situations, 50 (= Nt)
repetitions were performed.
The criterion to evaluate the performance was the average value of CDRs based on
50 repetitions in which CDR was evaluated with the independently generated test-
samples of the sizes of 50 times as large as the training-samples.
1 N,
CDR = IV L CDR;.
1 I i=1

Artificial data with the preassigned degrees of multicollinearity in this experiment


were generated by the method proposed by Kim and Tanaka (1995b).

Balanced Data with Equal Covariances Unbalanced Data with Equal Covariances

90 hPLSOA. 2=PLSDT 90 1=PLSDA.2=PLSDT

eo

70

60 60
15 25 35 45 55 15 25 35 45 55
'otvanables

Balanced Data with Unequal Covariances Unbalanced Data with Unequal Covariances

90 1EPLSDA. 2=PLSDT 90 1=PLSDA. 2.PLSDT

2 '" 2.
eo 1 - - 1 -.:.:.; ~ ....
! -...:.:..:..~ .
8
2
--....:.T~i
70 70

60 60
15 25 35 45 ·55 15 25 35 45 55
• 01 ",arnlbles • 01 ",.nabl••

Figure 1: Comparisons between PLSDA and PLSDT

5.2 Comparison between PLSDA and PLSDT


Results of the experiment are given in Figure 1. Note that, since the training-sample
size is fixed at 60 for all groups, the degrees of freedom of the sample covariance
matrices decrease and, therefore, the degree of ill-conditioning increases as the number
of variables increases.
266

As expected, PLSDA shows remarkably better performance than PLSDT at all sit-
uations in case of equal covariances. In case of unequal comriances, there are great
differences between cons of PLSDA and PLSDT at p = 5, 1.5 (well conditioned
situations), but no great differences at p = 25, 35, .15, 5.5 (poorly conditioned situa-
tions) .
We can explain th~ results in such a way that the advantage of PLSDT in case of
unequal covariances is not large enough to overcome the disadvantage of the smaller
degrees of freedom of the covariances compared to the case of PLSDA. That is, the
precision of the estimated values of Li) is worse in PLSDT than in PLSDA, because
PLSDT always has smaller sample size than PLSDA for estmating the covariance
matrices. \Ve think there is the relationship between sample size and dimension such
that the more the number of dimension is in explanatory data, the better result is
able to be obtained if sample size is in well conditioned, but no good if ill-conditioned
or multicollinear data set because of the unstable covariance matrix. So, we can
give one suggestion that the technique of dimension reduction is useful in order to
avoid the problem of singularity or multicollinearity as a result of ill-conditioned or
multicollinear data set.

~Icov. 5 15 25 35 45 55
Sal. Equal 2.16/2.11 2.69/3.29 3.42/4.69 356/4.00 4.10/5.38 :U5/4.83
Sal. L:neq. 5.41/5.30 3.11/3.28 4.05/4.34 3.92/3.23 4.49/3.56 3.72/3.28
Unba!. Equal 1.65/2.58 3.·18/3.83 3.82/5.53 4.23/5.77 495/4.91 3.96/3.78
Unba!. Uneq. 5.41/4.79 405/3.90 4.01/4.72 3.82/3.85 3.98/3.3.) 3.64/-1.08

TABLE 1. Standard deviation of PLSDA and PLSDT at each situatiou.


Left value in each cell is for PLSDA, right for PLSDT.

6. Discussion
In this paper two types of partial least squares liuear discriminant function (PLSD)
are investigated through a simulation study. In the first type, which is abbreviated
as PLSDA, the pooled within-group covariance matrix of all groups is used as an
estimate for the common covariaw:e matrix in the discriminant function, while in
the second type, which is abbreviated as PLSDT, the pooled within-group covariance
matrix of only the i-th and j -th groups is used for discriminating these two groups.
From the results of the simulation study we can conclude:
(1) \Vhcn the covariance matrices are common in all groups, PLSDA has better
performance than PLSDT as is expected.
(2) When the covariance matrices are different, it is expected that PLSDT has bet-
ter performance than PLSDA. However, we can say so only in well-conditioned
situations, not in poorly or ill-conditioned situations.
As discussed by Flury et al. (1994). it is natural to use principal components in dis-
criminant analysis (PCD) to reduce the dimensionality of explanatory Yariables as in
regression analysis. So, we have a plan to compare PLSD with peD. ~[oreover, it is
our open and future task to compare PLSD with other kinds of discriminant functions
such as Friedman's (1989) regularized discriminant function(RDF) using shrinkage es-
timators (see, e.g, James and Stein 1961: Efron and Morris 1976; Stigler 1990), which
intend to reduce the effects of multicollinearity and can be applied in poorly condi-
tioned or ill-conditioned situations (see, e.g, Titterington 1985: O'Sullivan 1986).
267

References :
Efron, B., and Morris, C. (1976), Multivariate Empirical Bayes and Estimation of Covari-
ance Matrices, The Annals of Statistics, Vol. -1, pp.22-32.
Flury, B., Schmid, M. J. and Narayanan, A. (199-1), Error Rates in Quadratic Discrim-
ination with Constraints on the Covariance Matrices, Journal of Classification, Vol. 11,
pp.101-120.
Frank, I. E. and Friedman, J. H. (1993), A Statistical View of Some Chemometrics Regres-
sion Tools, Technometrics, Vol. 35, No.2, pp.109-l.t8.
Friedman, J. H. (1989), Regularized Discriminant Analysis, Journal of the American Sta-
tistical Association, Vol. 8-1, pp.165-175.
James, W., and Stein, C. (1961), Estimation with Quadratic Loss, Proceedings of the
Fourth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, pp.361-
379, Berkeley: University of California Press.
Kim, H. B., and Tanaka, Y. (199-1), A Numerical Study of Partial Least Squares Regression
witb an Emphasis on the Comparison with Principal Component Regression, Proceedings
of the Eighth Japan and Korea Joint Conference of Statistics, pp.83-88, Okayama, Japan.
Kim, H. 8., and Tanaka, Y. (1995a), Linear Discriminant Function Using Partial Least
Squares Method, Proceedings of International Conference on Statistical Methods and Sta-
tistical Computing for Quality and Productivity Improvement (ICSQP '95), Vol. 2, pp.875-
881, Seoul, Korea.
Kim, H. B., and Tanaka, Y. (1995b), Generating Artificial Data with Preassigned Degree of
r.lulticollinearity by Using Singular Value Decomposition, The Journal of Japanese Society
of Computational Statistics, Vol. 8., pp.I-8.
Kim, H. B., and Tanaka, Y. (1996), Application of Partial Least Squares Linear Discrim-
inant Function to Writer Identification in Pattern Recognition, Journal of the Faculty of
Environmental Science and Technology Okayama liniversity, Vol. 1. pp.65-76.
O'Sullivan, F. (1986), A Statistical Perspective on Ill-Posed Inverse Problems, Statistical
Science, Vol. 1, pp.502-527.
Rao, C. R. (197:3), Linear Statistical Inference and Its Applications, 2nd Edition, John
Wiley & Sons, Inc., New York.
Stigler, S. M. (1990), The 1988 Neyman Memorial Lecture: A Galtonian Perspective on
Shrinkage Estimators, Statistical Science, Vol. 5, pp.147-155.
Titterington, D. r.1. (1985), Common Structure of Smoothing Techniques in Statistics, In-
ternational Statistical Review, Vol. 53, pp.l.tl-170.
Wold, H. (1975), Soft ~Iodeling by Latent Variables; the )jon-linear Iterative Partial Least
Squares Approach, In Perspectives in Probability and Statistics, Papers in Honou of M. S.
Bartlett, Edited J. Gani, Academic Press, Inc., London.
Wold, S., \void, H., Dunn, W. J., and Ruhe, A. (198-1), The CoLlinearity Problem in Lin-
ear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverse, SIA~I
Journal on Scientific and Statistical Computing, \·01. 5, pp. 735- 7-13.
Yoshimura, M., Yoshimura, I. and Kim, H. 8. (1993), A Text-Independent Off-Line Writer
Identification Method for Japanese and Korean Sentences, IE ICE Trans. Inf. & Syst., Vol.
E76-D, No -1, pp.-154--161.
Resampling Methods for Error Rate Estimation
in Discriminant Analysis
l\lasayuki Honda 1 and Sadanori Konishi 2

1 Division of Medical Informatics


Chiba Cniversity Hospital
1-8-1 Inohana, Chuou-ku
Chiba 260. Japan
2 Graduate School of Nlathematics
Kyushu University
6-10-1 Hakozaki, Higashi-ku
Fukuoka 812, Japan

Summary: The performance of resampling methods, like the bootstrap or cross-validation


methods, was investigated for estimating the error rates in linear and quadratic discrimi-
nant analyses. A ]'donte Carlo experiment was carried out under the assumption that two
population distributions were characterized by a mi."xture of two multivariate normal distri-
butions. Simulation results indicated that the bootstrap method gave good performance in
the case of the linear discriminant function, but it was a little biased when the quadratic
discriminant function was used. Cross-validation method was superior ill regard to the un-
biasedness, and the 0.632 bootstrap estimator outperformed in regard to the mean square
error. The methods for error rate estimation were also examined through the analysis of
real data in medical diagnosis.

1. Introduction
The main aim in discriminant analysis is to allocate a future observation to one of a
finite number of distinct groups or populations on the basis of several characteristics
of the observation. It is assumed here that an individual with observation on the
p-dimensional random vector is allocated into one of two p-variate populations, and
that allocation is carried out on the basis of Fisher's linear discriminant function or
the quadratic discriminant function.
In practice it is important to estimate the error rates in allocating a randomly selected
future observation. For the problem of estimating the actual error rates (conditional
error rates), Efron (1979, 1983) proposed nonparametric bootstrap methods, such as
the bootstrap bias-corrected apparent error rate and 'the 0.632 estimator'. Gane-
shanandam and Krzanowski (1990) investigated several methods including the 0.632
estimator for error-rate estimation in Fisher's linear discriminant function. Konishi
and Honda (1990) examined the parametric and nonparametric methods for esti-
mating the error rates in linear discriminant analysis under normal and nonnormal
populations.
Very little work has been done on evaluating error rate estimation procedures both in
the linear and the quadratic discriminant functions simultaneously. In this paper we
investigate several estimation methods for error rates in Fisher's linear discriminant
function and the quadratic discriminant function, when the population distribution
is assumed to be a mixture of two multivariate normal distributions. We exam-
ine the performance of the estimation methods through :\lonte Carlo simulations, in
which evaluation in non normal situations and the quadratic discriminant function

268
269

are especially respected. Section 2 describes the formation of two-group discriminant


problem. Section 3 presents the methods of error rates estimation. Simulation results
for evaluating the several estimation methods are summarized in Section 4. Applica-
tion of the methods to a real data in the medical problems is executed in Section .5.

2. Two-group discriminant analysis


Suppose that the aHocation of an individual x into one of two populations l1i (i =
1,2), i-th population having a p -variate probability distribution Fi(x), is carried out
on the basis of Fisher's linear discriminant function (LDF)

(1)

or the quadratic discriminant function (QDF)

where Xl, X2, SI, S2 and S are, respectively, the sample means, the sample covariance
matrices and the pooled sample covariance matrLx based on the training sample Xn =
{x~): Q = 1,2,··· ,Ni,i = 1,2}.
A future observation Xo is allocated to 111 or 112 according as Xo belongs to the region
of discrimination RI or R2 given by

RI(Xn) = {x; h(xlXn) > a} or = {x; q(xIXn) > a}, (3)


R2(X n) = {x; h(xlXn) ~ a} or = {x: q(xIXn) ~ a}, (4)

under the assumption of equality of a priori probabilities and costs of misallocation.


Then the actual error rates are given by

ei(Fi ; Xn) = f. dFi(x), (i =1= j; i,j = 1,2). (5)


lR,
By replacing the unknown distribution F; by the empirical distribution function, t;,
with equal weight of liN; assigned to each point of the training sample {xl'), ... ,
x~}, we obtain the apparent error rate (AP method)

. .
e;(Fi; Xn) h
= .
R,
.
dF;(x) = J ..
[(xIRj)dF, =
1 '''.,
V L
' '0=1
() .
[(x; IRj ), (6)

where I(x[.4) is the indicator function defined by

I if x EA
[(xIA) ={a if x ~ A . (7)

Usually the apparent error rate provides an optimistic assessment of the actual error
rate, and hence much attention has been given to the development of the estimation
procedures.
In the next section we present several methods for estimation of error rate in linear
and quadratic discriminant analyses.
270

3. Methods for error rate estimation


3.1 Bootstrap bias-corrected apparent error rates
The bootstrap methods introduced by Efron (1979) provide an approach to estimate
the bias of the apparent error rate numerically.
Suppose that the unknown probability distribution F; (x) is estimated by the empiri-
cal distribution function F;(x) based on the training sample {x~) : Q = 1,2,'· . ,N;}
(i = 1,2). We call the random sample {x~)' : Q = 1.2, ... , N;} (i = 1,2) taken from
each F; as the bootstrap sample. The bootstrap analogues of the actual error rate in
(5) and the apparent error rate in (6) are, respectively,

1-. dF;(x) = ~. ~ I(x~)lk;),


R} • '0=1
(8)

1-R; dF;,(x) = ~N; ~ I(x~)'lk;)


0=1
(9)

where X~ denotes the bootstrap sample of size (Nl + N 2 ),the discrimination regions
ki(i = 1,2) is constructed based on the bootstrap sample and Ft is the empirical
distribution function of the bootstrap sample {x~)' : Q = 1,2, ... , N;}.
The bias of the apparent error rate is approximated by averaging { e;(Fi; X~) -
ei(F;,; X~) } over a large number of repeated bootstrap samples, say, h; for i = 1,2.
Then we have the bias-corrected apparent error rate e,(F;; X,,) +h; called BS method.
3.2 The 0.632 bootstrap estimator
Efron (1983) proposed the 0.632 bootstrap estimator given by

(10)

where e(FI • F2 ; Xn) denotes the total apparent error rate over the discriminant rule
(n = Nl + N 2 ) and

Here X~ (j) denotes the j-th bootstrap sample of size n for) = 1,···, Band c50 )
equals 1 when x~) does not belong to X~(j), otherwise zero, and nh = Lf=l(L:~l c50 ]
+ L~;l c5oj ). \Ve call this estimator 632 method.
The 0.632 estimator is considered to be a summation of the apparent error rate and
some kind of cross-validation with appropriate weights. Fitzmaurice et aL. (1991)
investigated the performance of the 0.632 estimator by a Monte Carlo simulation and
showed that its performance depended on the true error rate.
3.3 Cross-validation method
In cross-validation or the leaving-one-out method, an individual is removed from the
training samples and LDF h(xIXn) or QDF q(xIX n) is constructed with the remain-
ing data. Then check whether or not the removed individual is allocated to the correct
population. This is done for each individual in turn in the samples. The proportion
271

of incorrect allocations gives the required estimate. We refer to this estimator by


CV-method.
3.4 Asymptotic results for linear discriminant analysis in normal samples
Suppose that the population distributions Fi(X)(i = 1,2) are p -variate normal dis-
tribution with a common covariance matrix. McLachlan (1974) gave an asymptotic
unbiased estimator of the actual error rate, say, Q:vI method in the form

<I? (-~D) + [~v~~ + 3~V {4(4p-l) - D2}] cj) (-~D), (12)

where D2 = (Xl - X2)'S-1(X1 - X2), N = N1 + N2 - 2 and <I?(x) and ¢(x) are,


respectively, a standard normal distribution function and its density function.

4. Monte Carlo study


Monte Carlo experiments are carried out under the assumption that two population
distributions are characterized bv a mixture of two multivariate normal distributions.
To put the problem into canoni~al form (see Ashikaga and Chang (1981)), we make
a linear transformation so that

111: h(x) (1 - c)np(o, Ip) + wp(v, ailp)


(13)
11 2 : h(x) (1 - E)np(lL, Ip) + Wp(1L + v, a~Ip)
where c is a rate of contamination (0 ~ E ~ 1), and IL =(m, 0, ···,0)', V =(V1,
V2,0, ... , 0)', I p is the identity matrix of order p, and np (IL, E) denotes a p-variate
normal density function with mean vector IL and covariance matrix E. Then the
population means and covariance matrices are given by

ILl = EV, E1 = {I + E(a~ - I)} Ip + E(l - E)VV',


(14)
1L2 = IL + W, E2 = {I + E(a~ -1)} Ip + c(1 - c)vv'.

When E equals 0 or 1, population distributions are specified normal and otherwise


nonnormal. Covariance matrices are unequal in the case of E I- 0 and ai I- ai.
We investigate the performance of several estimation methods not only in normal
and/or equal covariance situations, but also in nonnormal and/or unequal covariance
situations.
In the Monte Carlo simulation, random samples were generated from a population for
different combinations of parameters and sample sizes, and 200 bootstrap replications
were taken for each trial. For v, we took the value v = (1,1,0, ... ,0)'. Simulation
entries in Table 1, 2, 3 and 4 are estimated by averaging over 200 repeated Monte
Carlo trials. As a criterion for evaluating the methods, we took the expected actual
error rate, E[{e1(F1 ; Xn) +e2(F2 ; X n)}/2]' say TV. Values of TV were calculated
by 10,000 generated samples as future observations for each trial. The means and
the mean square errors (MSE) (x 105 ) are computed and listed in Tables 1 and 2.
The dimension (p) in Table 2 is twice of that in Table 1, being fixed sample sizes.
Difference between TV and the error rate estimate based on each method and mean
square error are respectively focused in Tables 3 and 4, in which two populations are
close together (m = 1). The degree of deviation from normality is assessed by the
multivariate skewness !31,p and kurtosis (!32,p) proposed by Mardia (1970). The values
272

of ,al,p and !h,p given in Table 5 for a mixture of two multivariate normal distributions
in (13) were calculated, using the formulae obtained by Konishi and Honda (1990).
Several findings as a summary of the simulation study are given in the followings.
• Under the assumption of multivariate normality (Le. (3l,P = 0, (32,p = p(p + 2))
and equal covariance matrices, all methods besides BS method in QDF provide
estimates with small biases, when two populations are not close together (see
the cases that e = 0 in Tables 1 and 2).
• Apparent error rates (AP method) clearly underestimate the actual error rates
(TV) in the case of normal and nonnormal situations. The difference between
TV and AP in QDF is larger than one in LDF.
• Bootstrap method (BS method) gives good performance in LDF, but it is a
little biased in QDF. High dimension gives worse influence to BS method when
QDF is used, because AP method has more severely underestimated in the case
of p = 8 than in the case of p = 4 (see Tables 1 and 2).
• 632 method performs fairly well with regard to the mean square errors in the
case of normal and nonnormal situations. But it slightly underestimates the
actual error rate in high dimensional case (Table 2) of nonnormal situations in
QDF and the case of closer populations (Table 3).
• Cross Validation method (CV) has a little bias to the actual error rates but has
larger mean square errors than 632 method in LDF. When QDF is used, CV
performs well with regard to the unbiasedness.
• Parametric estimator (QM method) gives overestimated results in the cases
further from normality and equality of covariance. But it totally performs very
well in the case of e = O.
• With total consideration, CV method is superior to other methods in regard to
the unbiasedness, and 632 method is far superior in regard to the mean square
error.

Table 1. Comparison of methods for estimation of error rates


(Nl = N2 = 30,p = 4,0'~ = 3.0,0'~ = 9.0,m = 3)
LDF QDF
e TV AP BS CV 632 Ql\1 TV AP BS CV 632
0.0 Mean .076 .058 .073 .076 .075 .074 .084 .051 .077 .083 .084
MSE 135 125 120 95 70 196 127 133 98
0.1 Mean .107 .086 .107 .107 .107 .107 .121 .076 .112 .123 .119
MSE 182 170 176 138 183 336 177 222 134
0.3 Mean .153 .124 .148 .149 .148 .152 .181 .120 .166 .180 .170
MSE 287 238 258 197 301 544 253 269 192
0.5 Mean .198 .169 .199 .202 .200 .231 .229 .172 .224 .234 .222
MSE 317 269 258 216 344 543 275 298 214
0.7 Mean .234 .197 .231 .236 .231 .255 .243 .177 .233 .246 .234
~lSE 399 313 300 261 287 661 285 303 219
0.9 Mean .265 .228 .262 .263 .259 .278 .224 .154 .209 .223 .216
MSE 446 370 371 302 297 780 368 315 265
1.0 1-Iean .273 .240 .277 .279 .270 .288 .202 .134 .190 .204 .200
MSE 360 303 311 245 273 669 264 275 199
273

Table 2. Comparison of methods for estimation of error rates


(Nl == N z== 30, p == 8, ai == 3.0, ai == 9.0, m== 3)
LDF QDF
t: TV AP BS CV 632 gM TV AP BS <';V 632
0.0 Mean .089 .053 .085 .096 .091 .083 .125 .033 .099 .133 .132
MSE 218 131 153 107 98 909 176 224 114
0.1 Mean .121 .074 .113 .121 .117 .120 .165 .049 .125 .168 .159
MSE 342 173 178 128 152 1445 317 270 134
0.3 Mean .173 .116 .165 .175 .169 .192 .239 .098 .188 .245 .214
MSE 530 284 311 224 311 2208 568 496 272
0.5 Mean .219 .157 .216 .228 .218 .244 .288 .139 .239 .299 .260
MSE 634 326 316 259 359 2446 557 407 289
0.7 Mean .255 .180 .243 .256 .246 .270 .284 .135 .238 .299 .267
MSE 845 382 365 288 357 2417 512 452 261
0.9 Mean .284 .205 .273 .288 .272 .291 .236 .088 .189 .247 .234
MSE 922 415 484 329 365 2374 450 362 195
1.0 Mean .294 .212 .282 .291 .279 .296 .199 .050 .151 .202 .204
MSE 1002 431 455 328 357 2340 416 375 172

Table 3. Difference between TV and each estimate: (TV - each estimate)


(Nl == N2 == 30, p == 4, ai == 3.0, a~ == 9.0, m == 1)
LDF _QDF
~ AP BS CV 632 QM AP BS CV 632
0.0 .049 .007 .003 .009 .007 .117 .031 .003 .027
0.1 .049 .002 -.004 .004 -.014 .112 .029 .005 .032
0.3 .064 .014 .007 .015 -.007 .101 .024 .005 .031
0.5 .063 .009 -.003 .013 -.007 .088 .014 .000 .021
0.7 .078 .030 .002 .024 -.004 .088 .016 -.002 .012
0.9 .074 .014 .000 .021 -.006 .089 .015 -.006 .004
1.0 .082 .022 .006 .028 -.005 .093 .021 .001 .009

Table 4. Mean square error


(Nl == N2 == 30, p == 4, ai == 3.0, ai == 9.0, m== 1 )
LDF QDF
t: AP BS CV 632 QM AP BS CV 632
0.0 577 416 396 337 297 1560 455 460 342
0.1 595 457 509 364 407 1587 502 486 385
0.3 781 480 484 363 343 1312 401 417 346
0.5 723 434 536 339 415 1013 329 416 290
0.7 1015 571 730 456 550 1011 352 416 279
0.9 891 457 557 327 345 1051 353 408 262
1.0 999 481 603 398 429 1094 347 342 353
274

Table 5. Multivariate skewness (i11,P) and kurtosis (l12,p)


p-4 p-8
ITI IT2 ITI IT2
e f:h,p th.,p PI,p l12,p i11,p fJ2,p fJl,P fJ2,p
0.0 0.00 24.0 0.00 24.0 0.00 80.0 0.00 80.0
0.1 0.64 31.8 2.82 68.6 1.04 103.0 4.76 225.6
0.3 1.11 32.3 2.12 51.6 1.93 107.3 3.66 172.6
0.5 0.74 29.5 0.95 38.7 1.34 99.3 1.65 130.1
0.7 0.30 26.7 0.30 30.8 0.56 90.4 0.52 103.8
0.9 0.04 24.7 0.03 25.8 0.07 83.0 0.05 86.5
1.0 0.00 24.0 0.00 24.0 0.00 80.0 0.00 80.0

In practical situations, there often occur nonnormal populations and close together of
two populations. When e varies 0.1 through 0.5, populations seem to be non normal
along the measure of nonnormality given in Table 5. In such cases we would like
to recommend 632 and BS method in linear discriminant analysis and CV and 632
method in quadratic discriminant analysis.
Results of our simulation study also indicate that QDF is superior to LDF for two
normal populations with unequal covariance matrices provided that the sample sizes
are sufficiently large. QDF has poor performance with high-dimension of p relative to
the sample sizes. Sample size is a critical factor in choosing between LDF and QDF
with normal data, as shown by Marks and Dunn {1974}, Wahl and Kronmal {1977}.

5. Application to medical data


\Ve applied the methods for error rate estimation to the problem of medical diagnosis.
Source data under consideration are written in Andrews and Herzberg (1985). The
two groups consist of 44 cases without crystals of calcium oxalate in urine and 33
cases with crystals. The data set consists of six physical characteristics of urine,
namely,
Xl: specific gravity X2: pH Xa: osmolarity
X4: conductivity Xs: urea concentration X6: calcium concentration
The estimates of the actual error rate based on each method are
(LDF) AP: 0.189 BS: 0.227 CV: 0.246 632: 0.215 QM: 0.236
(QDF) AP: 0.186 BS: 0.243 CV: 0.288 632: 0.241
in which 200 bootstrap replications were taken for BS and 632 methods.
The total consideration referring to the results concludes that the estimate of the
actual error rate is about 22 % in the linear discriminant analy~is and about 28 % in
the quadratic discriminant analysis. We recommend the simultaneous use of several
estimators with caution to data characteristics for estimation of the actual error rates.

Acknowledgment
The authors would like to thank referees for their helpful comments and suggestions.
275

References
Andrews, D. F. and Herzberg, A. M. (1985): Data: A Collection of problems from many
fields for the student and research worker. Springer-Verlag, New York.
Ashikaga, T. and Chang, P. C. (1981): Robustness of Fisher's linear discriminant function
under two-component mixed normal models. J. Amer. Statist. Assoc. 76,676-680.
Efron, B. (1979): Bootstrap methods: Another look at the jackknife. Ann. Statist. 7, 1-26.
Efron, B. (1983): Estimating the error rate of a prediction rule: Improvement on cross-
validation. J . ..imer. Statist. Assoc. 78, 316-331.
Fitzmaurice, G. M., Krzanowski, W. J. and Hand, D. J. (1991): A Monte Carlo study of
the 632 bootstrap estimator of error rate. J. of Classification 8, 239-250.
Ganeshanandam, Sand Krzanowski, W. J. (1990): Error-rate estimation in two-group
discriminant analysis using the linear discriminant function. J. Statist. Comput. Simul.
36, 157-175.
Konishi, S. and Honda, M. (1990): Comparison of procedures for estimation of error rates
in discriminant analysis under nonnormal populations. J. Statist. Comput. Simul. 36,
105-115.
Mardia, K. V. (1970): Measures of multivariate skewness and kurtosis with applications.
Biometrika 57, 519-530.
Marks, S and Dunn, O. J. (1974): Discriminant functions when covariance matrices are
unequal. J. Amer. Statist. Assoc. 69, 555-559.
McLachlan, G. J. (1974): An asymptotic unbiased technique for estimating the error rates
in discriminant analysis. Biometrics 30, 239-249.
Wahl, P and Kronmal, R. (1977): Discriminant functions when covariances are unequal
and sample sizes are moderate. Biometrics 33, 479-484.
A Short Overview of the Methods
for Spatial Data Analysis
Masaharu Tanemura
The Institute of Statistical Mathematics
4-6-7, Minarni-Azabu, Minato-ku
Tokyo 106, Japan

Summary: Some methods of spatial data analysis are given in a manner of short overview.
Here the recent. development of this field is also included. At first. it is shown that spatial
indices based on quadrat counts or nearest neighbour distances. which have been devised
mostly by ecologists, are still useful for preliminary analysis of spatial data. Then. it is
discussed that distance functions such as nearest neighbour distribution and K function
are useful to the diagnostic analysis of spatial data. Further, it is shown that the maximum
likelihood procedures for estimating and fitting pair interaction potential models are very
useful for a wide class of spatial patterns. Finally, it is pointed out that Markov chain
Monte Carlo (MCMC) methods are powerful tools for spatial data analysis and for other
fields.

1. Spatial Data Analysis ?


Suppose we have a·set of data of spatially distributed objects. Let us assume the
objects be plauts or animals and let us call each object an 'individual'. There might
be various types of spatial data. Among such types, we assume a mapped spatial
point pattern is given and let the coordinates of N individuals in an area S be
X == (Xl,X2,'" .IN·). This is considered as the most informative.
Then our purpose of spatial data analysis would be such that: which type of spatial
pattern the observed data is classified; how to model the observed pattern; what is
the estimated value of parameters for the model; and how to predict the pattern
which will appear in future.
The field of statistics which investigates such problems as above is called 'spatial
statistics'. In this short overview, we will summarize a part of its status by putting
emphasis on how the recent development of spatial statistics is related to the early
studies of spatial data analysis in other fields of science, such as ecology.
Due to the limited space of this article. subjects considered here are mainly confined
to those the present author had concerned and some references cited here are omitted
from the list of references. For the full list and for more informations, readers can
refer, for instance. Hasegawa and Tanemura (1986) and Cressic (1993).

2. Obtaining Spatial Index


It will be quite natural to ask what the typr of the observed point pattern is and how
can it be represented by a certain quantity.
.-\S a preliminary study of spatial data analysis, it is usual to iUYestigate whether the
observed pattern has an indication of departure from complete spatial randomness
(or homogeneous Poisson point process). ~Iore concretely, we put as a null hypothesis
Ho: the homogeneous Poisson point process, and make statistical test against Ho. As
alternatives, we consider two types of patterns, i.e., 'regular' and 'clustered'.

276
277

In the field study, the spatial data are often sampled according to the following
methods:
1. Quadrat method : Counts of individuals in contiguous quadrats are obtained.
Let Si be the counts in the i-th quadrat (i = 1,2"" ,q), where q is the number
of quadrats.
2. Nearest neighbour distance method: Usually the two types of nearest neighbour
distances are measured; the distance between a randomly sampled point to its
nearest individual (rt}; and the distance between a randomly sampled individual
to its nearest individual (r2)'
For these types of spatial data, many spatial indices had been considered by mainly
ecologists in order to represent the degree of aggregation of individuals.
2.1 Spatial indices for quadrat counts
As regards quadrat counts, we cite here only three indices. David and Moore (1954)
presented an index 1 = Vim -1, where m and V are, respectively, sample mean and
variance of s/s. Morisita (1959) introduced the index 16 = qL:l=l Si(Si -1)/(N(N-
1)). This is often called Morisita's 16 . Lloyd (1967) presented an index m+ m=
m
V / m - 1. The index is called a mean crowdedness.

o
o 20 40 60 80 100 120 140 160 180 200

x
Fig. 1: A pattern of 584 trees of longleaf pines (Pinus palustris) taken from Cressie (1993).
278

Let us show an example of values of these indices. Figure 1 is the stands of longleaf
pines (Pinus palustris) (Cressie, 1993). For this data, the values of the above indices
are, respectively, I =
1.335, 16 =
1.930 and m=
2.765. These values indicate that
the stands of longleaf in Fig. 1 is a clustered pattern.
2.2 Spatial indices for nearest neighbour distances
As regards nearest neighbour distances, we show three indices here, too. Hopkins
and Skellam (1954) gave an index A = E rU E r~, Clark and Evans (1954) proposed
the index R = 2fiE r2/n and Besag and Gleaves (1973) presented the index T =
2 E rU E r~. Here, n is the number of samples of r2, A the intensity of Poisson point
process and Ta is the so-called T-square samples devised by Besag and Gleaves.
For the data of Fig. 1, the values of indices given above are: A =
2.049, R 0.849 =
and T = 1.646. These values again indicate the clustered nature of the longleaf pines
data.

3. Nearest Neighbour Distance Distribution


For the data of nearest neighbour distances. we can further consider their distributions
in order to get more detailed informations from the data.
For that purpose, let {rl,;; i = 1,2", . ,nl} be the nearest neighbour distance data
between point and individual, and let {r2,;; i = 1,2,,,,, n2} be the data between
individual and individual. Here, nl and n2 are the number of respective samples.
3.1 Empirical distributions for rl and r2
Then, we obtain the empirical distributions for {rtl and {r2} as:

= L l(rl,; :5 r)/nl' = L l(r2.; :5 r)/n2,


"1 "1
per) and q(r)
;=1 ;=1

where 1(8) is the indicator function, such that 1(8) = 1, if 8 is true; = 0, if 8 is


false.
It isinterestin~ to note, for the Poisson point process with intensity A, p( r) = q( r) =
1 - exp( -AlI'r ) hold in the limit of nl, n2 ..... 00.
3.2 Exact empirical distribution for rl
It is further interesting to point out that an 'exact' empirical distribution per) for rl
is obtained if the mapped point pattern is given. See Okabe and Miki (1981) and
Tanemura (1983). Here, the term 'exact' is meant by the distribution in the limit of
nl ..... 00.
This can be seen by considering the following relations:
p(r) = Pr{rl:5 r}
= Pr{ a random point lies within distance r from a certain individual}
N
= L Pr{ a random point lies within a disc with center Xi and with radius r}
;=1
= ISI- 1{area of union of discs with radius r and with center at every
individual} .

This is illustrated in Fig. 2. In Fig. 2, the radius of each shaded disc is r. In order to
compute the area of the union of discs, it will be most easy to use Voronoi tessellations
279

as shown in Fig. 2.
It is also important to note that we can get an exact empirical 'density' fer) =
dp( r)/ dr for /"l in the following manner:
fer) = ISI-l{peripherallength of union of discs with radius r
and with center at every indi\·idual}.

Fig. 2: Illustration for computing exact empirical distribution per) and its density f(r).
In Fig. 2, the value of f(r) is obtained by computing the total peripheral length of
union discs. This is again not difficult if we consider the Voronoi tessellation as in
Fig. 2.
Let us show an example. In Fig. 3, a pattern of nests of gray gulls (Larus modestus) is
given. In Fig. -lea), the exact empirical distribution estimated for the gray gull data
of Fig. 3 is shown as the curve with crosses. We have done computer simulations of
Poisson point process for the same number of individuals in the same size of area.
In this figure. the envelopes of p(r)'s obtained from 19 simulations are represented
as curves. As a comparison, we give in Fig. ol(b) similar curves for the empirical
distribution q( r) for the same data sets as in Fig. 4( a). It is obvious that the range
of envelopes in Fig. 4(a) is narrower than in Fig. ol(b). This indicates the power of
testagainst Poisson model is bigger for p(r) than for q(r). Actually, we can see from
Fig. ol( a) that the curve of p( r) for gray gull data systematically deviates from the
envelopes for the Poisson model. This indicates the rejection of the Poisson model.
The figures such as Figs. 4( a) and (b) indicates their usefulness as diagnostic analyses.
280

Fig. 3: Patt.cl"Il of 110 nests of gray gull (Larus modestus) with its Voronoi t.essellation.

1.20 (a)

0.80
-:
...
~

0.40

1.20 1.60 2.00

1.20
( b)

0.80
-:
<';
0.40

1.20 1.60 2.00

Fig. 4: Empirical distributions of nearest neighbour distance. (a) exact empirical


distribution p(T) forr!. (b) empirical distribution q(T) for T2.
281

4. Second Moment Measures


From the abO\·e example, we see that the number of nearest neighbour distance
data between individual and individual is at most of order N, i.e., the total number
of individual. In contrast to this, if we consider the second neighbours, the third
neighbours, ... , we can get more detailed informations. If all of distances between
individuals are considered, the number of data amounts to be of order N2.
The distribution of distances between all pair of individuals is already known as
the 'pair correlation function' in the field of statistical physics, and this function is
experimentally observed by X-ray diffractions for crystals or liquids.
In spatial statistics, the distribution is called the 'second moment measure'. It is
often called a 1\- function and is defined by:
K( r) = A-I E(number of extra individuals within distance r
from an arbitrary individual},
and its empirical distribution is given by
N
K(1') = j -I L 1(1;1"j - Xjl :5 r)/N,
j<j

where A is the intensity of the observed process.


The derivative of K(r) is sometimes convenient. It is called a 'radial distribu-
tion function' and is given by g(r) = (1/27rr)dK(r)/dr. A so-called L. function
L(r) = JK(l')/7r is also useful. For Poisson point processes, these functions have the
forms: K(r) = 7rr 2 , g(r) = 1, L(r) = r. It is important to note that these functions
are also useful for the diagnostic analysis.

5. Spatial Point Process Models and Model Fitting


In order to proceed the analysis of spatial data further, we should take into account
models which may include mechanisms of pattern formation. Among such models,
we consider here the Gibbs point process model because of its generality.
5.1 Gibbs point process model
Assume the observed mapped point pattern X = (XI, X2,· .', I.v) is a random sam-
ple from a Gibbs canonical distribution which is characterized by a pair interaction
potential function <p(I:t'j - XiI) :

!(X) = Z-I exp{ -UN(X)}, where UN(X) =L L::j<P(iXi - Xii),


and
Z r exp{-UN(X)}dx l cl.r2 .. · d;I:N.
= Z(<P;N,5) = lSN
Here, UN(X) is called a 'total potential energy' and Z( <P; _"'-,5) a partition function
or a normalizing constant.
5.2 Maximum likelihood estimation of pair interaction potential function
By parametrizing <P(r) as {<po(r); 8 E 0} in a certain parameter space 0, we can get
a log-likelihood function from the above Gibbs canonical distribution as:
N
L(<po:X) = - L <Po(lXj - .tjl) -logZ(<Po;N,S).
j<i
282

where Z = Z/ISI"".
For Poisson model, it holds <l>8(r) == 0 and Z(O;S,S) = lSI"".
In order to perform the likelihood procedure, we ellcounter a serious difficulty of
obtaining L( <1>8: X) and Z( <1>8; N, S) for a general <1>8 ( r) as a fuuction of B. This is
due to the high multiplicity of integration in the normalizing constant Z.
To overcome this difficulty, efforts for devising methods for obtaining approximate
log-likelihoods have been done. Some of such methods are the cluster expansion, the
virial expansion and the polynomial approximation through computer experiments
(Ogata and Tanemura, 1981, 198-1, 1989). For the computer experiments of Gibbs
point process, the so-called Markov chain Monte Carlo method is most suitable. This
is discussed more in the next section .
.-\s the interaction potential models, it would be desirable that they can cover a
wide class of patterns of individuals, including regular and clustered patterns which
respectively correspond to repulsive and attractive interactions. For that purpose, we
are necessary to consider potential models with several parameters, even in different
class of potential family. In order to select a suitable model among competing models.
the informatioll criterion Ale is useful:
Ale = -2 (maximum Log-likelihood) +2 (number of adjusted parameters).
The model which has minimal Ale value is most suitable. By using this criterion,
the Poisson model is always considered as a candidate, since Ale == 0 for this model.
Here, we will show some examples. Let us consider the so-called 'soft-core potential
moclels' cI>".n(r) = (a/r)n, a> O,n > 2 (Ogata and Tanemura, 1984, 1989). This
family of potential can cover a cert.ain class of point patterns with repulsive inter-
actions. Table 1 shows some of results of fitting the soft-core models (Ogata and
Tanemura, 1989).

Data iJ it T
Pines 0.1-1 meters ex: 0.0-1
Gulls 2.26 meters 6.9 0.06
Iowa 16.8 miles 6.2 0.37
Balls 1.10 millimeters 5.8 0.-12
Table 1: Result.s of fitting soft-core models to rea.l data.

Here, T = lYa 2 /A. represents a crampedness of a pattern. The data for 'Gulls' corre-
sponds to Fig. 3.

6. Markov Chain Monte Carlo Methods and Spatial Data


Analysis
Finally, we mention the computational methods which we have already used to sim-
ulate Gibbs point processes. These methods are developing recently due to their
usefulness to other fields as well as to spatial statistics (see. for example. Geyer.
1992; Besag et a1.. 1995).
Let u(X) be a certain probability density of X E X. Then. the expectation of a
function g( X):
E.[g(X)1 =L g(X)u(X),
Xe.:r
283

is obtained by (lltn::::=1 g(X;) after generating a Markov chain (XI, X 2 ,' " , X,,, .. )
without knowing the normalizing constant of the density u. It is because XI --+ X '"
u(X) and (lIt) 'E:=I g(X;) --+ E.[g(X)] as t --+ 00, almost surely.
Let q(x,x') be an arbitrary transition probability such that. if X, = x, a sample x'
drawn from q(.r,x') is considered as a proposed possible value for XI+!' Then. the
essence of the ~Iarkov chain Monte Carlo (MCMC) method would be to choose the
transition probability p(X, X') in such a way:

( V V') {q(X, X')o(X, X') if X' '" X


P _'\.. ~\. = 1 - 'Ex .. q(X, X")o(X, X") if X' = X,
where o(~·. x') is the acceptance probability such that we accept x' for X I+1 with
probability 0': otherwise we reject it and set Xt+l = x.
In the Metropolis-Hastings method,
( X X') _ {min{u(X')q(X',X)lu(X)q(X,X')'l} if u(X)q(X.X') > 0,
Q" ,. - 1 if u(X)q(X, X') = 0
is chosen. Then, u(X)p(X. X') = u(X')p(X', X) holds. This property is called a
'detailed balance'. The property of 'ergodicity' u{X)p(n)(x, X') --+ u(X') as n --+ 00
also holds. Thus, from the above choice of o(X, X'), it is obvious that a Markov chain
can be constructed without knowing the normalizing constant as pointed out above.
According mainly to this property, the Markov chain Monte Carlo (abbreviated as
MCMC) methods are becoming important tools, besides for spatial data analysis, for
image analysis and for other fields, which are somehow related to Bayesian inferencp.s.

References:
Besag. J., Green, P., Higdon, D. and Mengersen, K. (1995): Bayesian computation and
stochastic systems, Statistical Science, 10, 3-66.
Cressie. N.A.C. (1993): Statistics for Spatial Data. Revised edition, John Wiley & Sons.
New York.
Geyer. C.J. (1992): Practical Markov chain Monte Carlo. Statistical Science. 7. 473-483.
Hasegawa. M. and Tanemura. M. (1986): Ecology of Territories - Statistics of Ecological
Models and Spatial Patterns, Tokai University Press (in Japanese).
Morisita. M. (1959): Measuring of the dispersion and analysis of distribution patterns,
Memoires of the Faculty of Science, Kyushu University. Series E. Biology. 2, 215-235.
Ogata, Y. and Tanemura, M. (1981): Estimation of interaction potentials of spatial point
patterns through the maximum likelihood procedure. _4nnals of the Institute of Statistical
Mathematics, 33B, 315-338.
Ogata, Y. and Tanemura, M. (1984): Likelihood analysis of spatial point patterns. Journal
of the Royal Statistical Society. Series B. 46. 496-518.
Ogata, Y. and Tanemura. M. (1989): Likelihood estimation of soft-core interaction poten-
tials for Gibbsian point patterns. Annals of the Institute of Statistical Mathematics. 41.
583-600.
Okabe, A. and Miki, K. (1981): A statistical method for the analysis of a point distribution
in relation to networks and structural points, and its empirical application. DP-3, Depart-
ment of Urban Engineering, University of Tokyo.
Tanemura, M. (1983): Statistics of spatial patterns - Territorial patterns and their forma-
tion mechanisms. Suri Kagal.:u (Mathematical Science), 246. 25-32 (in Japanese).
Choice of Multiple Representative Signatures for
On-line Signature Verification
Using a Clustering Procedure
Isao Yoshimura 1, ;\[itsu Yoshimura 2 and Shin-ichi Matsuda 3

1 Faculty of Engineering, Science University of Tokyo


1-3 Kagurazaka, Shinjuku-ku, Tokyo 162, Japan
2 Chubu University
Matsumoto-cho, Kasugai-shi, Aichi 487, Japan
3 Nanzan t: ni versi ty
Showa-ku, Naogya-shi Aichi 466, Japan

Summary: The task of signature verification is to judge whether the writer of a signature
is truly the declared person or not, by referring to a previously provided signature database.
The verification is completed when the observed dissimilarity between the questioned sig-
nature and a set of signatures, which were previously written by the declared person and
included in the database, is less than a given threshold. This paper proved, through a ver-
ification experiment using a signature database supplied by CADIX Co. Ltd., that the use
of multiple representatives, which represent respective clusters constructed by a clustering
procedure, is effective. According to the experiment, the resulting error rate decreased from
about 6% to about 1% on average by increasing the number of representatives from one to
three.

1. Introduction
This paper deals with an automatic signature verification for the identification of
an individual based on on-line information, where the problem is to verify an input
signature to be a realization of the declared autograph by comparing it with a set of
authentic signatures previously provided.
From a statistical viewpoint, this problem can be formulat.ed as one of judging
whether a questioned sample belongs to the declared population of authentic sig-
natures based on a set of samples from this population. This will be referred to as
the reference sample in the following. \Vhen we can assume a suitable distribution of
the population, the problem is reduced to one of estimating the population parame-
ters based on the reference sample and of judging whether the questioned sample is
located near the central part of the estimated distribution.
In real situations, however, the population is not so homogeneous that we can assume
a simple distribution. It seems better for us to usc signatures themselves to represent
the population without assuming any distribution. Then there occurs a problem of
how to construct the representative signature from the reference sample.
In our experience, people often write their signatures in two or three different forms,
(examples of which are shown in Fig. 1) although most of them are similar, as are
shown in Fig. 2. From this observation we considered an idea that we can construct
a signature verification system with a good performance by using two or three sig-
natures as the representatives. Fortunately, the first two authors of this paper have

284
285

had some experience of signature verification; they devised a signature verification


system and, at the same time, provided a database of on-line signatures. They had
the idea to cluster the reference sample in two or three categories and to choose one
representative from each cluster. They subsequently tried to realize this idea in a
signature verification system.
The purpose of this paper is to explain the devised system and to show its effective-
ness through an experiment.

Sincerely yours,

.J&tI..~/J1J<U-ntJ<--~J
(Mrs.) Blaire Mossman
Managing Editor

Sincerely yours,

4M~~)~;~
(Mrs.) Blaire V. Mossman
Managing Editor

Figure 1 An example of two different forms of signatures written by the same person

F ~

~
~

-
tJt;: ~

~
~

~ ~

.~ ~

Figure 2 Examples of a set of on-line signatures written by the same person


Sampling points are traced by straight lines along the time.
286

2. Devised system
Our system gets any signature as a set of time series ((x(t), y(t), p(t)); t = 1,2, ... ,T}
of pen position (x, y) and wri ting pressure p, where T is the number of sampling points
automatically fixed by the sampling time inherent in the input device. ,\Ve retain in
the system, in advance, a set of authentic signatures as a database to be referred
to in verificat.ion. After a stage of preprocessing, such as the normalization of size
and sampling poin~s, the system classifies these reference signatures into a certain
number (three in this paper) of clusters by using a complete linkage method (Cf.
Yanagisawa and Ohsumi (1979).) and chooses one representative from each cluster,
where the distance for clustering is set as the dissimilarity measure explained in the
next section.
When the system receives a questioned signature it measures the dissimilarity be-
tween the questioned signature and each representative and uses their minimum as
the dissimilarity between the questioned signature and the declared autograph. If
the dissimilarity is less than a given threshold the questioned signature is judged as
genuine; otherwise it is juaged as forged. The threshold is defined for each autograph
as the mean value of dissimilarity measure among authentic signatures multiplied
by an optional coefficient, C, which is specified by the system manager. Note that
although there are some optional parts such as the method of normalization or the
choice of weights in the system, they are properly fixed in this paper because they
do not seriously affect the conclusion induced from the experiment.

3 .. Dissimilarity measure
Although there are various proposals concerning the dissimilarity measure (See Yoshi-
mura and Yoshimura (1996) ). we adopted a measure similar to that of Sato and
Kogure (1982), whose characteristic feature is the adoption of a dynamic program-
ming matching method for adjusting the time scale of two signatures. Note that this
adjustment is very important to measure the dissimilarity, because anyone often takes
irregular rests of pen movement during the course of writings and these meaningless
time duration is better to be omitted.
The adjustment in this sense can be realized by determining a matching of two time
coordinates, say ~ and TJ, for the two signatures. The matching thus determined is
referred to as the "warping function" in this paper, which is visually illustrated by a
thick line on Fig. 3.
Practically, the system first considers a temporal dissimilarity measure D-(ZA. ZB)
between two signatures =.-\ and ZB as
,,(..I,B)
D"(z..l' ZB) = L [w p 6(s, s - 1) + 1 - 6(s, s - 1)]
.=1
X[WI {(x..l(~.) - YB(TJ.))2 + (Y.-\(~.) - YB(TJ.))2}
+W2{PA(~.) - PB(TJ.)}2
+W3{(d:rA(~.) - d:rB(TJ.))2 + (dyA ((,) - dyB(TJ.))l}], (1)
where the subscripts A and B correspond to two signatures respectively, (~., TJ. ) is a
coordinate of a warping function T = {(~" TJ.); s = 1,2, ... , n(A, 8)} which represents
an adjustment in the time domain, n(A, B) is the number of sampling points, WI. W2,
and W3 are properly chosen constants, 6' is an indicator constant related to the s-th
287

Time axis ,
for 2iJ '

31-----i---r"'---

0-1 o
Time axis for ~

Warping function

Figure 3 Illustration of the warping function T and indicator constant b.


T is a path of coordinates (~., 11.) on grids running from the lower left (l,1)
to the upper right (T, T). 8( s, s - 1) = 1 if the span from (s - 1) th grid to s-
th grid is horizontal or vertical and = 0 otherwise. In the above figure, (~t.111) =
(1,1), (~2' 112) = (2,2), (6, 113) = (3, 2), (~4' 114) = (3,3), ... i 8(2, 1) = 0,8(3,2) = 1, ...

I
I
I dy(~s)
I
y(~s-l) ' ________ 1. _______ '
"(~s-l) ,,(~s)

Figure 4 Illustration of the direction vector and related variables


(d,,(~.)A,(~,)) is the unit vector with the same direction as (x(~.)-x(~.-d, y(~.)­
Y(~.-l». When the angle of two unit vectors (d"A(~.), dYA(~.» and (d"B( 11.), dyB (11.»
is () then ((d"A({.) - (d"B('7.))2 + ((dyA ({.) - (dyB (11.»)2 = 4 x sin 2 (()/2) and therefore,
the last term of the equation (1) reflects the direction of the pen-movement.
288

point on the warping function, and (dr(~.)' dy(~.)) is the unit vector with the same
direction as (l~(~.) - x(~._tl, y(~.) - Y(~._I)) (See Figs. 3 and ell, i.e., as follows:

d (t) = - l·A(~.-d
x ..d~.) (2)
rA <,. J(XA(~.) - xA(~.-Il)2 + (YA(~.) - Y.-\(~._Il)2

:-';cxt it calculates, using a DP matching algorithm, the dissimilarity measure


D(z.-\, ZB) as

(4)
where the minimum is taken over the possible change of the warping function T under
a certain restriction, the detail of which is explained in Yoshimura et al. (1991) and
Yoshimura and Yoshimura (1992).
In the proposed method, although the dissimilarity of questioned signature to the rep-
resentative signatures are measured, there are some five optional parameters 8, W p , WI, tL'2,
and W3 in the definition of this dissimilarity measure. Among them, 8 and Wp arc
introduced to impose a penalty for warping the time scale. As our proposal, 8 was
set as one when ~. = ~._I or 1]. = 1].-1, and as zero otherwise, additionally wp was
set as two. These setting implies that the additional increment in a coordinate at the
s-tb step is not the same as the one in the other coordinate, then the dissimilarity
value is increased depending on the degree of distortion of the time scale. Other
constants represent the weights on the three variables of x, y, and p with respect to
the differcnce between the two signatures. The optimum values of them are to be
determined adaptively to the signature data. In the experiment below, they were set
as (Wll tt:2,U'3, wp ) = (0.5,OA,0.1,2.0) based on a preliminary experiment.

4. Experiment
An experiment to examine the effectiveness of ollr proposal was performed based
on a database of on-line signatures provided by CADIX Co. Ltd. The database is
composed of 2203 signatures, including both authentic and forged signatures, for 28
autographs (See Table 1). Thcse contain 10 in Roman letters by non-Japanese, 14 in
Japanese letters by Japanese and 4 in Roman letters by Japanese; "Roman letters"
implies English alphabet such as a, b etc. Examples of signatures in the database
are shown in Fig. 5. In the experiment, ten authentic signatures for each autograph
and five forged signatures were used as references and the remaining signatures were
used to evaluate the performance of our system.
In the experiment, the number of clusters were set in three cases of one to three in
order to evaluate the effectiveness of the increase of representative signatures.
The most difficult obstacle to achieving a good performance of the system was the
determination of optional constants inherent in the system, such as the weights in
the dissimilarity measure and the threshold coefficient C. \Ve utilized the reference
sample to optimize these constants. It was important that forged signatures were
included in the reference because, by comparing the dissimilarities within authentic
signatures with those between authentic and forged, we could reasonably evaluate
the suitability of various values of C.
289

The system is composed of two parts: a preprocessing part and verification part.
In the preprocessing part various normalizations and standardizations such as the
reduction of sampling points, are possible which may affect the achieved error rates.
In this paper, however, we fixed the normalization and standardization in one form
because we could make sure, through preliminary experiments, that the effect of such
preprocessing did not influence the relative capability of the three cases of the number
of representative signatures.

Genuine Forgery

Figure 5 Examples of signatures in CADIX database

Table 1 Number of signatures in the CADIX database

Case R~: Roman sig. by non-Japanese Case JJ: Japanese sig. by Japanese
No. Gn. Forg. ( repetitions) No. Gn. Forg. ( repetitions)
1 30 54 (2,4,5,5,5,5,5,5,6,12) 1 27 57 4,5,5,5,5,6,6,6,6,9)
2 25 64 4,5,5,5,5,5,5,6,6,6,12) 2 25 53 5,5,5,5,5,5,5,5,8)
3 51 53 5,5,5,5,5,5,5,5,6,7) 3 25 16 2,6,8)
4 25 51 5,5,5,5,5,5,5,5,5,6) 4 20 52 3,4,5,5,5,5,5,5,5,5,5)
5 25 51 (5,5,5,5,5,5,5,5,5,6) 5 24 53 (5,5,5,.5,5,5,5,5,6,7)
6 33 56 (5,5,5,5,5,5,5,5,5,5,6) 6 25 21 2,7,12)
7 30 65 (5,5,5,5,5,5,5,5,5,5,5,5,5) 7 25 9 (9)
8 28 60 5,5,5,5,5,5,5,5,5,5,5,5) 8 28 53-(5,5,5,.5,5,5,56,6,6 )
9 25 64 4,4,5,5,5,5,5,5,5,5,5,5,6) 9 26 67 (4,5,5,5,5,5,5,5,5,6,7)
10 35 40 5,5,10,10,10) 10 26 56 4,5,5,5,5,5,5,5,6,11 )
11 28 53 5,5,5,5,5,5,5,6,6,6)
Case RJ: Roman sig. by Japanese 12 26 50 5,5,5,5,5,5,5,5,5,5T
No. Gn. Forg. l repetitions) 13 34 50 (5,5,5,.5,5,5,5,.5,5,5)
1 27 62 1,2,3,4,5,5,5,5,5,5,5,5,6,6) 14 26 15 2,3,,5,5)
2 42 62 5,5,5,5,5,5,5,5,5,5,6,6)
3 31 55 4,5,5,5,5,5,5,5,5,5,6)
4 25 64 (5,5,5,5,5,5,5,5,.5,5,7,7)
290

5. Result
The effectiveness of the use ·r_
of multiple representatives is Case AN J!-
clearly shown in Table 2, which
is a part of the results of the ex-
periment with WI, W2, W3 and Wp
fixed as O..j, 0.4, 0.1, and 2.0,
respectively. The error rates in
Table 2 are an average of those
for foreigners and Japanese. For
any choice of the threshold co-
efficient C, at least two repre-
1.' I.S 1.8 1.1 I I , g

---
sentatives are necessary to get
good performance of the system
in the verification, while more
than three do not yield any im-

---
provement.
CaseJJ
Table 2 Error rates in %
#0 representatives
1 2 3
5.22 1.51 1.46

.
1.4
1..5 6.28 1.27 1.11
1.6 8.07 1.83 1.21
1.7 10.31 2.89 1.73 I • ,s I 6 1.7 ,

Figure 6 Type I and II errors in % against the


threshold coefficient C
Ri'.": Roman signatures by non-Japanese.
RJ: Roman signatures by Japanese.
JJ: Japanese signatures by Japanese.

6. Discussions
The use of multiple representatives of authentic signatures based on a clustering pro-
cedure was proved, through an experiment using a signature database, to be effective
in decreasing error rates in on-line signat.ure verification. The authors think that it
is due to the existence of clusters even in a set of authentic signatures written by
one person. It also implies that various type of signatures should be included in the
database for signature verification.
The achieved error rates are an average of two types of errors and each error varies
depending on the threshold or threshold coefficient C in our setting of the threshold.
If it moves to greater values, type I errors decrease whereas type II errors increase.
The trade-off relation is obvious by looking at Fig.6. The determination of the rea-
sonable value of C is the responsibility of the system manager, while C = 1.5 is a
generally good coefficient in our experience.
An example of clustering and of representative signatures obtained from each cluster
is shown in Fig. 7. While the result of clustering is not visually clear average error
rates decreased, which may imply that the automatic verification is better than vi-
291

sual verification. Although the experiment was limited to on-line verification, similar
conclusions will be obtained for off-line verification, too.

2 ~

2 ~

3 ~

3 ~

2 ~

2 ~

3 ~

3 "ftiV\:f;

3 ~

Figure 7 Example of realized clusters and representatives with time series of (x, y, p)

The number placed at the head of signature implies the cluster it belongs to when
three clusters are constructed. The bold figure denotes the representative signature.
The three traced curves to time in abscissa imply the time series of x(t),y(t),p(t) in
this order from left to right.

References:

Sato, Y. and Kogure, K. (1982): On-line signature verification based on shape, motion, and
writing pressure, Proc. 6th Int. Conf. Pattern Recog. , 823-826.
Yanagisawa, Y. and Ohsumi, N. (1979): Evaluation procedure for estimating the number
of clusters in hierarchical clustering system, The Japanese Journal of Applied Statistics, 8,
51-7l.
Yoshimura, I. et al. (1991): all-line signature verification incorporating he direction of pen
movement, Trans IEICE Japan, E74, 2083- 2092.
Yoshimura, 1. and Yoshimura, M. (1992): On-line signature verification incorporating he
direction of pen movement- An experimental examination of the effectiveness, In: From
Pixels to natures, Impedove, S. and Simon, J. C. (eds.), 3,53-361, North Holland, Amster·
dam.
Yoshimura, ~L and Yoshimura, 1. (1996): The-state-of-the-art and issues to be addressed
in writer recognition, Technical Report of IEICE, PRMU96-48, 81-90 (in Japanese).
Part IV

Related Approaches for Classification

• Fuzzy and Probabilistic Modeling Methods

• Spatial Clustering and Neural Networks

• Symbolic and Conceptual Data Analysis


Algorithms for L1 and Lp Fuzzy c-Means and
Their Convergence
Sadaaki :\-1iyamoto l and Yudi Agusta 2
1 Institute of Information Sciences and Electronics
University of Tsukuba., Ibaraki 305, Japan
2 Program Research, Development and Documentation Division
Central Bureau of Statitics, Indonesia

Summary: Algorithms for LI and Lp based fuzzy c-means are proposed. These algorithms
calculate cluster centers in the general alternating algorithm of the fuzzy c-means. The
algorithm for the Ll space is based on a simple linear search on nodes of step functions
derived from derivatives of components of the objective function for the fuzzy c-means,
whereas the algorithm for the Lp spaces use binary search on the nodes and then the interval
to which the cluster center belong. Termination of the algorithms based on different criteria
for the convergence is discussed. The algorithm for the Ll space is proved to be convergent
after a finite number of iterations. A numerical example is shown.

1. Introduction
The Ll and Lp spaces have sometimes been referred to in studies of data analysis such
as the regression analysis in the Ll space (Bloomfield and Steiger, 1983), although
these spaces have not extensively been applied to cluster analysis yet. Recently, many
researchers have studied fuzzy clustering, which uses membership degrees in the unit
interval interpreted as fuzzy classification. The method of fuzzy c-means, abbreviated
as FCM, is most well-known among different techniques in fuzzy clustering (Bezdek,
1981).
Fuzzy c-means clustering based on the Ll or Lp space has recently been considered by
Jajuga (1991) and by Bobrowski and Bezdek (1991). These two studies have shown
difficulties in solving fuzzy c-means in the Lp spaces. The general fuzzy c-means
algorithm is an iterative procedure in which the step of determining grades and that
of determining cluster centers are repeated. A unified formula can be used for the
determination of grades for different distances, whereas the calculation of cluster
centers strongly depends on a selected distance. Cluster centers in the Euclidean
space are derived by a simple formula of weighted average, whereas the calculation
of cluster centers seems to require much computation for the Ll and Lp spaces,
however.
In this paper we propose efficient algorithms for the FCM in the Lp spaces. The
results are divided into those for the Ll space and for the Lp spaces, which implies
that stronger results are obtained for the Ll case.
In the case of the Ll space, the fact that each coordinate of a cluster center is the
minimizing element of a piecewise affine function is utilized, whereas a binary search
is considered for the Lp spaces.
Moreover we prove several theorems of convergence of the algorithms. The proofs of
the convergence theorems for the Ll case use the finiteness of the search region for
the cluster centers and the uniqueness of the minimum of strictly convex functions
with respect to U.
A numerical example is given to show that the algorithm actually works well on a

295
296

large set of data with a small computation time.

2. Fuzzy c-means based on Lp spaces


The problem herein is that n objects, each of which is represented by an h-dimensional
real vector x" = (X"l' ... , x",,) E R", k = 1, ... , n, should be divided into c fuzzy
clusters. Namely, the grade Ui", 1 ~ i ~ c, 1 ~ k ~ n, by which the object k belongs
to the cluster i should be determined. For each object k, the grade of membership
should satisfy the condition of the fuzzy partition:
e
M = { (Ui") : LUi" = 1, 1 ~ k ~ n; 0~ Ui/C ~ 1, 1 ~ i ~ c, 1 ~ k ~ n }.
i=l

The formulation by Bezdek (1981) is the optimization of the objectb,.e function


e n
J(U,v} =L L(Ui"}md(x,,,v;}
;=1 "=1

in which d( x, v} is a measure of dissimilarity between x and v, m is a real parameter


such that m > 1, Vi is the center of the fuzzy cluster i, and U = (u;,,) and V =
(Vl, ... ,Ve ) ERe".
Here the dissimilarity d is assumed to be the pth power of the Lp (p ~ 1) norm:

d(x", Vi} = IIx/c - Viii: = L" IX"i - viil P , where Vi = (Vil' ••• , Vi"). Namely, the objective
i=l
function is
e n
J(U, v} =L L(ui"}mll x,, - Viii:· (1)
i=l "=1
It is well-known that the direct optimization of J by (U, v): min J(U, v} is
(lEM,vER·h
difficult. A two stage iteration algorithm is therefore used.
A General Algorithm of FCM (cf. Bezdek, 1981):
(a) Initialize U(O); Set s = o.
(b) Calculate cluster centers v(·) = (v~·), ... , v~·)} that minimize J(U('), .}:

(c) Update U: calculate U(·+1) that minimizes J(.,v(·)}:

J(U(o+l) v(·)}
,
= (min J(U v(·)}.
lEM'

(d) Check convergence using a given f > 0: If the convergence criterion is satisfied,
stop; otherwise s = s + 1 and go to (b).
The convergence criterion in general should be one of the following (I-III).
(I)
297

(II)

(III) Using a suitable matri:t norm,

Remark: The above term of convergence simply implies that the algorithm will
eventually terminate; the convergence does not guarantee that the obtained solution
is the correct one, nor the solution is the global optimum. In general it is difficult to
derive an efficient algorithm that guarantees the true optimal solution.
In general, calculation of U(·+l) does not depend on a particular choice of a norm.
It is well-known that Uilt is easily derived by using the Lagrange multiplier. Namely,
for Xk such that Xk =I- Vi, i = 1, ... , c, and m > 1,
1
(2)

On the other hand, calculation of cluster centers is not simple for the Lp spaces, and
therefore the problem to be solved is the minimization with respect to v in the step
(b) of the above FC~. The algorithms for the step (b) that we propose here are based
on the following ideas.
(i) Each component of a cluster center can independently be calculated from other
coordinates, by decomposing the function J(U, t') to be optimized with respect to v
n
into a sum of he functions Fij('W) = I:(Uik)m!Xltj - U'!P, i = 1, .. , c, j = 1, .. , h:
10=1

c h
J(U, v) = I: I: Fij( Vij),
.=1 j=1

where each F' j depends solely on the jth coordinate of a cluster center. lS"otice that
U is a parameter in this subproblem. Thus, concerning the search of cluster centers,
we can limit ourselves to the minimization of Fij (U').
(ii) The function Fij is convex with repect to each coordinate of a cluster center, and
in particular. it is a piecewise affine function in the Ll case.
(iii) Seeing the properties (i) and (ii), we can use one-dimensional search for the
minimization of F.j('w). For the Ll space, however, a more efficient algorithm can be
derived using the piecewise affine property: the coordinate is calculated by a linear
search on the derivative of the function Fij (U'), which is remarkably simple.

3. Algorithms for calculating cluster centers


3.1 Algorithm for the Ll space
Let us first consider the Ll case, Since the function Fij (U') is piecewise affine for the
Ll space, the solution
miRn Fij (U' ) (3)
wE

can be limited to the coordinates of the data: {Xlj, ... , Xnj}.


Two ideas are used in the following algorithm: ordering of {XIt;} and derivative of
298

F ij . We assume that when {Xlj, ... , Xnj} is ordered, first subscripts are changed using
a permutation function qj( k), k = 1, ... , n, that is, x qJ ( I)j ~ x qJ (2)j ~ ... ~ xqJ(n)j'
Using {Xqj(k)Jl,
n
Fij(W) = ~)Uiqj(k))mlw - Xqj(k)jl·
k=l

Although Fij (tv) is not differentiable on R. we extend the derivative of Fij (w) on
{Xqj(k)j }:
n
dFij(w) = ~)uiqJ(k))msign"-(tv - xqJlk)j)
k=l

where
(z 0),
sign"'(z) = {~1 (z
~
< 0).
Thus, dFij (w) is a step function which is right continuous and monotone nondecreas-
ing in view of its convexity and piecewise affine property. :\ ow, it is easy to see that
the minimizing element for (3) is one of XqJ(k)j at which dFij(w) changes its sign.
More precisely, Xqj(t)j is the optimal solution of (3) if and only if dFij(w) < 0 for
W < Xqj(t)j and dFij(w) ~ 0 for w ~ xqJ(t)j.

Let W = Xqj(r)j, then


n
dFij(xqJ(r)j) = ~)Uiqj(k))m - L (Uiqj(k)}'n.
k=l k~r+l

These observations lead us to the next algorithm.


begin
S := - L:k:1 (Uik )m;
r:= 0;
while ( S < 0 ) do begin
r := r + 1:
S := S + 2(Uiqj(r))m
end;
output Vij = Xqj(r)j as the j-th coordinate of
the cluster center Vi
end.

It is easy to see that this algorithm correctly calculates one coordinate of the cluster
center.
Remark: A similar idea to the above algorithm can be found in the LI regression.
See Bloomfield and Steiger (1983).

302 Algorithm for the Lp spaces


For the Lp (p > 1) spaces, the function Fij (w) is differentiable. For calculating the
dY·
solution 7l\{ -d ., (71\j) = 0, the following algorithm is natural.
tv

(I0) Searc h xqJ(r)j an d Xq (r-I)j such that dFij


- d (Xq(r)j) ~ 0 and dFij
- d (Xq'(r-I),) ~ O.
) W J 'U' J

(ii) Search the solution Wij in the interval [xqJlr)j,XqJ(r-;-l)j] using, e.g., binary search.
299

4. Convergence of the algorithms


Different results are obtained for the foregoing convergence criteria (I-III). First, it
should be remarked that during the iterations, the value of the objective function is
monotone nonincreasing:

since the FCM algorithm is the alternative optimization with respect to U and v.
Then, an obvious result for the convergence of the above algorithms using (I) can be
stated as follows.
Theorem 1. For arbitrary given f > 0 and the convergence criterion (I) used in
FCM, the Ll and Lp algorithms converge.
(Proof) The proof is obvious, seeing that the sequence a( s) = J (U('), v(')) is monotone
nonincreasing and bounded from below by the obvious bound a( s) ~ O. Hence the
basic theorem of convergence of monotone and bounded sequences is applied. (QED)
Theorem 2. For f = 0 and the convergence criterion (I), the LI algorithm converges
after a finite number of iterations in FCM.
(Proof) Consider the set

11= {y = (Yll ... , Yh) I Yj E {Xlj, ... , Xnj}, j = 1, ... , h }.

Let lIe be the Cartesian product of II: lIe = IT X II X ... X Ii, and Illel be the number
of elements in lIe. This set is finite and we can easily see that if J(U(·),v(·-I)) >
J(U('), v(')) then V(·-I) and v(·) are different points of lIe. Thus, within Illel + 1 times
of iterations in FCM, J(U(·),V(·-I)) = J(U('),v(')) occurs and then the algorithm for
the LI space terminates with f = O. (QED)
In most cases of iterative calculations, the convergence is checked using the solutions
such as (II) or (III) instead of the function (I). Unfortunately, it is difficult to prove
a theoretical property for the Lp space algorithm using (II) or (III). In contrast,
however, the LI algorithm can further be analyzed.
In general J(U('), V(·-I)) = J(U('), v(·)) does not imply V(·-I) = v('), since the solution
for (3) may not be unique. It is, however, easy to modify the LI algorithm so that

(5)
holds. The modification is obvious:
Check if U' = vi;-I), the (i, j) component of the previous solution V(·-I),
satisfies dFij(w) = O. When dF;j(u') = 0, i.e., the previous solution is
still optimal, use the previous solution as vl;)
= w, the (i, j) component
of the new v(·).
Theorem 3. For f = 0 and the convergence criterion (II), the LI algorithm with
the above modification converges after a finite number of iterations in FCM.
(Proof) Since we have modified the algorithm so that (5) is satisfied, the conclusion
follows from the observation stated in the proof of theorem 2: within Illel + 1 times
of iterations, J(U(·),V(·-I)) = J(U('l,v(')) occurs. (QED)
300

Finally, we should be careful about reduction of the number of clusters. Namely,


when more than one clusters have the same cluster centers, these centers actually
indicate a unique cluster, and hence the number of clusters should accordingly be
reduced. Specifically, if

"y> 1,

the clusters i l , i z, ... , i.., should be replaced by a unique cluster, say i l , and the other
cluster numbers i z, .. :, i.., will not be used thereafter. After this reduction we can set
Ui,lo = 1, since Xio = Vi,·

Theorem 4. For f = a and the convergence criterion (III), the Ll algorithm with
the above two modifications converges after a finite number of iterations in FCM.
(Proof) By the above consideration, the matrL",{ U is uniquely determined, since (1)
is a strictly convex function with respect to Uik with an arbitrarily fixed v for XI. i=- Vi
for all i. When Xio = Vi, the previous modification uniquely determines the corre-
sponding part of U. Thus, if V{·-l) = v('), we have U(·) = U('Tl). (QED)

5. A numerical example
In this example we assume that m = 2, the number of clusters c = 2, the dimension
p = 2, and the number of points n = 10, 000. A region is considered and data points
are scattered over the region. Namely, two squares ABCD and EFGH of unit size
with the intersecting square P EQC are considered, as shown in Figure 1. The square
PEQC has edge length a (0 < a < 1). Data points have been scattered over the area
surrounded by ABQ F G H P D A using the uniformly distributed random numbers.
The initial value for the grade u~~ for each XI. has been generated by the pseudo
random numbers uniformly distributed over [0, IIi u~~ = 1 - u~~ to form a fuzzy
partition. Ten trials with different initial grades U(O) have been carried out. The
criterion (III) has been used for the convergence test in all cases.

H G

D p C
,,
,,
,
E IQ F

A B

Fig. 1: Two overlapped squares.


301

Fuzzy clusters have been transformed into crisp clusters using the Q - cut of Q = 0.5.
Then, a measure of miscIassification has been introduced for a quantitative evalua-
tion of the results. Namely, when a data point that is in the left and lower side of
the broken line segment PQ in Figure 1 is classified into the same class as the north
east cluster, i.e., the one to which data in the area surrounded by PQFGHP belong,
the former data point is called miscIassified. In the same way, when a data point
that is in the right and upper side of the broken segment PQ in Figure 1 is classified
into the same class as the south west cluster, i.e., the one to which data in the area
surrounded by QPDABQ belong, the data point is also called miscIassified.
Table 1 shows the number of successes, the average number of misclassified data, the
maximum number of iterations, and the average CPU time (sec) throughout the ten
trials for three values of the parameter a: a = 0.1, 0.2, 0.3. Moreover this table
compares results by the L1 c-means, Euclidean c-means, and the Lp c-means with
p= 3.
The number seven, for example, of successes means that seven trials out of the ten
have produced good results, while the other three have led to unacceptable classifica-
tions of large numbers of misclassified data. The average number of misclassifications
has been calculated from the successful trials: if the seven trials are successful, the
data of the other three trials are not used for the calculation. The CPU time is for
one cycle of calculating v(') and U(·';"l) in the main loop of FCM. The total CPU time
needed until the convergence is, for example, 0.758 x 11 ~ 8.34 for a = 0.1 by L1
FCM.
Comparison of the statistics given in Table 1 leads to the following observations.
(a) The computation for one cycle by L1 FCM is faster than Euclidean FCM, whereas
the Lp algorithm is far slower than the other two.
(b) The number of iterations by L1 FCM is less than that by Euclidean FCM in
every case; the Lp algorithm requires more iterations than Euclidean FCM.
(c) For the numbers of misclassifications, we do not find a remarkable difference
among these three algorithms.
(d) L1 FCM has failed 4 times in all 30 trials, while Euclidean and Lp FCM algo-
rithms have succeeded in all trials.
We have analyzed the cases of the failure of the L1 method, and found that the iter-
ation stopped at s = 1, i.e., after (V(l), U(2)) had been calculated, or at s = 2. Thus,
the failure to produce an appropriate result occurred when the iteration terminated
too early. A simple technique for improving the algorithm is to incorporate an em-
pirical rule into the L1 algorithm, whereby if an early termination is detected, the
calculation starts again with renewed initial membership values.
This failure has not been caused by the present algorithm, since the L1 method
exactly calculates the optimal solution for the cluster center, without any approxi-
mation. In other words, one cannot theoretically expect a better result by replacing
the present L1 algorithm by any other procedure for calculating the cluster centers.
(This does not mean, however, that we are unable to improve the algorithm by using
heuristic or ad hoc rules.)

6. Conclusions
We have presented two algorithms, i.e., the L1 algorithm based on a linear search on
302

Table 1: The number of successes out of ten trials, the average number of misclas-
sifications, and the ma:rimum number of iterations, and CPU time for a = 0.1, 0.2,
0.3 by L1 , L2 (Euclidean), and Lp FCM (p = 3).

L1 FCM
a successes misclassifications iterations( ma."C) CPU time(sec)
0.1 7 5.4 11 0.758
0.2 9 10.8 13 0.752
0.3 10 19.8 13 0.755
Euclidean FCM
a successes misclassifications iterations{ ma."C) CPU time{ sec)
0.1 10 3.3 14 0.813
0.2 10 6.0 15 0.812
0.3 10 24.4 14 0.812
Lp FCM(p = 3)
a successes misclassifications iterations( ma."C) CPU time( sec)
0.1 10 3.90 17 36.465
0.2 10 13.90 18 34.366
0.3 10 25.10 18 35.053

the nodes of step functions, and the Lp algorithm using the binary search. Moreover
we have shown theorems of convergence under three stopping criteria. The numerical
example has shown that the Ll algorithm is as efficient as the Euclidean algorithm.
The Lp algorithm has required 40 times more processing time than the other algo-
rithms, since it uses an iterative procedure in calculating a cluster center.
For the Lp (p > 1) algorithm, improvements are still possible, whereas further im-
provement cannot be expected for the Ll case, since the present algorithm is already
simple enough and has theoretically good properties as shown in the theorems of
convergence.
Further studies on the Ll and Lp clustering include: (i) to find applications when
the Ll and Lp spaces are more appropriate than the Euclidean space; (ii) analysis of
data obtained from real applications using these two algorithms; (iii) development of
a system of softwares for these methods.
Acknowledgment:
This research has partly been supported by TARA (Tsukuba Advanced Research
Alliance), University of Tsukuba.
References:
Bezdek, J.C. (1981): Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum.
Bloomfield, P. and Steiger, W.L. (1983): Least Absolute Deviations: Theory, Applications,
and Algorithms, Birkhauser.
Bobrowski, L. and Bezdek, J.C. (1991): c-means clustering with the II and ("'" norms.
IEEE Transactions on Systems, Man, and Cyhern., 21, 3, 545-554.
Jajuga, K. (1991): LI-norm based fuzzy clustering. Fuzzy Sets and Systems, 39, 43-50.
General Approach to the Construction
of Measures of Fuzziness of Fuzzy K -Partitions
Slavka Bodjanova
Department of Mathematics,
Texas A&M University-Kingsville,
Kingsville, TX 78363, U.S.A.

Summary: One of the most important characterizations of fuzzy partitions is the amount
of their fuzziness. This paper proposes an axiomatic framework for measures of fuzziness
of nonhierarchical fuzzy partitions. Mathematical conditions for measures of fuzziness are
discussed and a way of measuring fuzziness in terms of dissimilarity between a fuzzy parti-
tion and its complement is presented.

1. Introduction
Research on the theory of fuzzy sets and on the broad variety of applications of fuzzy
sets has been growing steadily since the inception of the theory in the mid sixties by
Lotfi Zadeh (1965). In the area of classification, an impressive number of papers has
been published on fuzzy clustering algorithms, fuzzy pattern recognition, etc. As a
result of fuzzy clustering of a finite set of objects, each object may be assigned to
the multiple clusters with some degree of certainty. The amount of fuzziness of the
resulting partition is an important characterization of the structure of data. If the
fuzziness is low, it means that clusters are reasonably separable. On the other hand,
if the fuzziness is large, the fuzzy cluster separability is low and either the partition
does not reflect the real structure well or there is not a clear structure present in
data.
Bezdek (1981) introduced the partition entropy as a measure of fuzziness of non-
hierarchical partitions which is a complete formal analogy to Shannon's entropy and
generalization of non probabilistic entropy of fuzzy sets introduced by de Luca and
Termini (1972). Backer (1987) proposed to measure fuzziness of fuzzy partitions in
terms of fuzziness of fuzzy clusters. There are some other characteristics of fuzzy par-
titions which could be interpreted as measures of fuzziness, but the comprehensive
theory of measures of fuzziness of fuzzy partitions is still missing.
The aim of this paper is to develop an axiomatic framework for measures of fuzziness
of nonhierarchical fuzzy partitions and to show a more general way of constructing
these measures.
In the first part of our paper we briefly review the matrix characterization of k-
partitions and the sharpness of fuzzy k-partitions.
De Luca and Termini (1972) formulated three essential requirements that adequately
capture the intuitive comprehension of fuzziness of fuzzy set. Generalization of these
requirements for fuzzy partitions is the cornerstone of our definition of measure of
fuzziness of a fuzzy k-partition introduced in the second part of our paper. Empotz
(1981) studied the mathematical background of measures of fuzziness of fuzzy sets.
We discuss conditions under which some nonnegative real functions on the interval
[0, 1] could be used for constructing measures of fuzziness of fuzzy partitions.
Fuzziness of fuzzy sets is often measured in terms of lack of distinction between the
set and its complement, or by a metric distance between its membership grade func-
303
304

tion and the characteristic function of the nearest crisp set (Klir and Yuan (1995)).
In the last part of our paper we show how this idea can be used in the evaluation of
the amount of fuzziness of fuzzy partition.

2. Fuzzy k-partitions
2.1 Matrix characterization
Let X = {Xl, X2, .•• , Xn} be a given set of objects. Fix the integer k, 2 $ k < n and
denote by Vkn the usual vector space of real k x n matrices. Bezdek (1981) proposed
the following matrix characterization of partitions of X into k clusters.
Fuzzy k-partition space associated with X:

P lk = {U E Vkn; tLij E [0,1]; L tLij = 1 for all j; L tLij > 0 for all i}. (1)

Here tLij is the grade of membership of object Xj E X in fuzzy cluster tLi.

Hard k-partition space associated with X:

Pk = {U E Vkn ; tLij E {O, I}; L tLij = 1 for all j; L tLij > 0 for all i}. (2)
j

If we relax the condition E


> 0 we will get degenerate partitions.
j tLij

Degenerate fuzzy k-partition space associated with X:

Plko = {U E Vkn ; tLij E [0,1]; L tLij = 1 for all j}. (3)

Degenerate hard k-partition space associated with X:

(4)

It is obvious that Pk C P lk C Pika and Pk C P ka C Pika'

2.2 Sharpness
Partitions from Pka are certain, i.e. they have zero amount of uncertainty. On the
other hand, the partiton (; = til E Pika is maximally uncertain, maximally fuzzy. If
a partition U E Plko is moving from D to a partition V E Pko its amount of fuzziness
decreases, we say that the partition U becomes" sharper". We propose the following
definition of sharpness of fuzzy k-partitions.
Definition 1 Let U, V E P lko ' We say that U is sharper than V denoted by U -: V, I

if and only if for all i, j:


1
tLi] < Vij for Vi] $ k' (5)
1
tLij > Vij for V·
I] -
> -k' (6)

Note:
For k = 2 we get the relation sharpness defined by De Luca and Termini (1972) for
fuzzy sets.
305

Property 1 Relation -< satisfies the following properties:


1. PJI<a is partially ordered by -<.
f. For all U E Plica: U -< [; = rtJ.
3. Let U E Plica such that for all icard {Utj : 0 < Utj < H = O.
Let V -< U. Then V = u.
4. Let U E PJI<a - Pica, U =f [;. Then there exists V E Pica such that V -< U, if and
only if for all j card {Utj : Utj > H= 1.
Definition 2 Function F : Plica -+ R is called a function preserving sharpness of
fuzzy partitions if and only if for all U, V E Plica such that U -< V we get
F(U) $ F(V).

Example 1:
Function Fl : Plica -+ R defined by

(7)

is a sharpness preserving function.


Example f:
Function F2 : PIka -+ R defined by
n Ic

F2 = L(L U tj Ui-l,j)/U m j, (8)


j=l i=2

where Umj = mini,.." ~ t {Uij}, is not a sharpness preserving function.


For example:
Let U, V, WE PI4a be partitions of X = {XbX2,X3,X4,XS}
given by matrices

0.1 0.05 1
1 0 0 00) V _ ( 0.10 0
U _ ( 0.1 000
- 0.3 o 1 1 0 - 0.30 0
0.5 o 0 0 1 0.55 0

0.0 1 0 0 0)
W _ ( 0.0 0 0 0 0
- 0.4 0 1 1 0
0.6 0 0 0 1

Obviously W -< V -< U, but F 2(W) = 0.6 $ F 2(V) = 0.66,


and F2(V) = 0.66;::: F2(U) = 0.63.
306

3. Fuzziness of fuzzy k-partitions


3.1 Measures of fuzziness
Several measures of uncertainty of fuzzy partitions have been used as objective func-
tions in fuzzy clustering algorithms or as measures of the quality of a fuzzy classifi-
cation. The most frequent is the partition entropy introduced by Bezdek (1981) as
follows:
H(U, k) = -~ LLUij log" Uij, (9)
i j

where a E (1,00) and Uij log" Uij = 0 for Uij = O.


Let us denote by F(X) the set of all fuzzy sets defined on X. De Luca and Termini
(1972) raised an interesting question of assigning to any fuzzy set U E F(X) some
measure of its fuzziness which would express the degree to which the boundary of U
is not sharp. They proposed as a measure of fuzziness a function F(X) ~ R+ e:
which satisfies the following properties:

1. e(u) = 0 if and only if u(x) E {O, I} for all x E X,


2. e(u) = maxvEF(X) e(v) if and only if u(x) = ~ for all x E X,
3. if U,v E F(X) such that u(x) ~ v(x) ~ ~ and u(x) ~ vex) ~ ! for all x E X,
then e(u) ~ e(v).

We propose the following definition of a measure of fuzziness of fuzzy k- partitions:


Definition 3 Function ¢J : Plica ~ R+ satisfying the following properties
1. ¢J is a sharpness preserving function, i.e. if U -< V then ¢J(U) ~ ¢J(V),
~. ¢J(U) = 0 if and only if U E Pica,
9. rp(U) = maxvEP1h ¢J(V) if and only if U = U,
is a measure of fuzziness of partitions from Plica'

We will say that ¢J is a normalized measure of fuzziness if ¢J(U) = 1.

Note:
Partition entropy where a = t is a normalized measure of fuzziness of U E Plica'

Example 9:
The following functions satisfy the properties of Definition 3:

¢J(U) " " U~·'1


= 1- .!.n~~ (10)
i j

¢J(U) =1- (1 - m!l-x(mjn


] , Uij» (11)

k" 1
tjJ(U) =1- 2n(k -1) ~ ~ lUi; - kl (12)
, J
307

3.2 Fuzziness measured by real functions


Measures of fuzziness (9), (10), (12) have the following general form:

¢I(U) = aLL I(u;j) + (3, (13)

where I is a real function on [0,1].


Two questions arise:
1. What are the properties of function I defined on [0,1] which guarantee that
function ¢I : PJlco -+ R defined by

¢I(U) = aLL I(u;j) + {3, (14)


j

is a measure of fuzziness of fuzzy k-partitions ?


2. How to find constants a and (3 in order to obtain a normalized measure
of fuzziness of fuzzy partitions ?
We answer these questions in the following remarks:

Property 2 Let I be a real continuous lunction defined on [0, 1] and let


.l J(y)-J(%) > SUp.l
A = inf
0:5%<1/:5. 1/-% -
J(.)-/(r) = B
.:5'<'9 .-r .
Then lor U, V E p/ ko such that U -< V
LLI(u;j):5 LLI(v;j). (15)
j j

Property 3 Let I be a real continuous function defined on [0, 1] and let


C /(1/)-/(%) <. f lW::.&l D
= sUPO:5%<II:5i 11_% - In 1:5'<'9 ,-, = .
Then lor U, V E p/ ko such that U -< V
LLI(u;j) ~ LLI(v;j). (16)
j j

As consequence of Property 3 we have:


1. Let I : [0,1] -+ R be any nonconstant continuous function increasing on [0, t] and
decreasing on [t,l] . Then there exist constants a, {3 such that

(17)
j

is a function preserving sharpness of fuzzy k-partitions.

2. Let I : [0, 1] -+ R be any convex or any concave nonlinear function. Then there
exist constants a, (3 such that
¢I(U) = aLL I(u;j) + {3 (18)
j

is a function preserving sharpness of fuzzy k-partitions.


308

Property 4 Let U E P fka - Pka, U =F U. Let f : [0, 1]-t R be a continuous function


satisfying one of the two next conditions i), ii) and the condition iii):

Z
.) A . f O<r<lI<l f(II)-_ fer) >
= In _ sUP~<r<.<1
f(.)- fer)
_r =B,
_ Y or
-Ir: lII'- _ ,

.. ) C <.
U = sUPO;5;r<II;5;:/; f(II)-f(r)
II-r
f
- In :/;;5;r<'9
f(.)-f(r)
'-r = D,

iii) (A - B)2 + (C - D)2 =F O.

Then
1jJ(U) = 0 L L f(Uij) + {3 (19)
i

where
1
(20)
0= n[k.f(t) - (k - 1).f(0) - f(l)]
and
{3 = [- f(l) - (k - l).f(O)]o.n (21)

is a normalized measure of fuzziness of fuzzy k-partitions.

Example 5:
Examples of functions which could be used in (19):

f(t) = t 2 for t E [0,1]

t if t E [0, t]
I(t) = { l~k(t - 1) if t E (t, 1]

o if t = 0
I(t) = {-tlogt ift E (0,1]

I(t) = 1 - It - tl for t E [0,1]


For k = 4 we can use
5t + 2 if t E [0,0.1]
{ 2.5 if t E (0.1,0.2]
I(t) = -lOt + 4.5 if t E (0.2, t]
lOt - 0.5 ift E (t, 1]
309

4. Dissimilarity and fuzziness of fuzzy partitions


4.1 Measures of dissimilarity

Definition 4 Function D : Pllco x Pllco -+ R+ is a measure of dissimilarity between


two partitions from Pllco if it satisfies the following properties:
D(U, V) = 0 iff U = V, (22)
D(U, V) = D(V, U). (23)
Definition 5 A dissimilarity measure D is called a dissimilarity measure preserving
sharpness of fuzzy partitions if for U, V E Plko such that U -< V we get
D(U, 0) ~ D(V, 0).

Property 5 Let d : [0,11 x [0,11 -+ R be a distance function. Let U E Plko' Then


D( U, V) = Ei E j d( Uij, Vij) is a sharpness preserving dissimilarity function which
can be used in construction of measures of fuzziness of fuzzy k-partitions.

Example 6:
Let U, V be two fuzzy k-partitions of X = {Xl,''''X n }, Let 1"(u(xj),v(Xj)) be the
distance function of Minkowski class defined by

(24)

Then
(25)

is measure of dissimilarity preserving sharpness of fuzzy partitions.


Example 7:
Measure
D(U, V) = 1)1- E~UijVij2 1) (26)
j (EiUijEiVij)l
is a dissimilarity measure preserving sharpness.

Example 8:
Measure
D(U, V) = t
j=1
E7-21l u ij - Ui-l,jl-I vij - Vi-l,jll
UmjVmj
(27)

I
> d Uij}, and Vmj = mini v· > d Vij}, is a dissimilarity measure
where Umj = mini u '1_1: I .,_"

which does not preserve sharpness of fuzzy partitions.


For example:
Let U, V, W E Pl40 be partitions given in Example 2. Obviously W -< V -< U, but
D(W,O) = 18 :$ D(V, 0) = 18.67, and D(V, 0) = 18.67 ~ D(U, 0) = 12.53.
310

4.2 Dissimilarity between a fuzzy partition and its complement


One way of measuring the amount of fuzziness of a fuzzy partition is to view the
fuzziness in terms of lack of distinction between the partition and its complement.
The less the partition differs· from its complement the fuzzier it is. We use the
definition of complement introduced by Bodjanova (1994).

Definition 6 Let U E Pfka. The complement of U is a fuzzy partition'" U E Pfka


defined by
~
Uij - :::-
'" U·· -
I) -
~-=-
1 - A.) (28)

where
;{. 1
IJ mIni Uij < k
;{. I (29)
IJ mIni Uij = k.

Note:
Let U E Pika, where k = 2. Then'" Uij = 1 - Uij for all i,j, which is Zadeh's
definition of complementation of fuzzy sets.

Property 6 Let U, V E Pika. Then

~. if V -< U then'" U -<'" V,


3. U ='" U iff U = 0, therefore 0 is the unique equilibrium of Pika.

Example 9:
Let us consider the fuzzy partition V E PI3a given by the matrix

0.9 0.3 0 0.4)


V = ( 0.1 0.2 1 0.2
0.0 0.5 0 0.4

Complement of V:

0.00 0.36 0.50 0.20)


'" V = ( 0.47 0.44 0.00 0.60
0.53 0.20 0.50 0.20

Property 7 Let U E Pika and D : Pika X Pika -+ R be a sharpness preserving


dissimilarity measure such that ifYt, V2 E Pka , and W E Pika - Pica then

D(V1 , '" Yt) = D(V2 , '" V2 ) > D(W, '" W). (30)
Then there exist constants a f. 0 and f3 E R such that

¢>( U) = aD( U, '" U) + f3 (31)

is a measure of fuzziness of fuzzy partitions from Pika.


311

It is obvious, that the fuzzier a partition U E PIka, the more similar to the equilib-
rium of PIka. Therefore, another way of measuring the amount of fuzziness of a fuzzy
partition is to evaluate its dissimilarity with U.

Property 8 Let U E Pika and D : Pika X PIka - t R be a sharpness preserving


dissimilarity measure such that if Vi, V2 E Pko , and W E P lko - P ko then

D(Vi,U) = D(V2'U) > D(W,U). (32)


Then there exist constants a f. 0 and (3 E R such that

<P(U) = aD( U, U) + {3 (33)


is a measure of fuzziness of fuzzy partitions from PIka.

5. Conclusion
We have proposed a definition of a measure of fuzziness of fuzzy partitions. Our
definition is a generalization of the measure of fuzziness of fuzzy sets. We identified
classes of real functions which could be used for constructing measures of fuzziness
of fuzzy partitions. We also explained how the fuzziness of fuzzy partitions can be
evaluated by the dissimilarity between a fuzzy partition and its complement. Since
fuzziness is one of the most important characterizations of fuzzy partitions, more
theoretical work needs to be conducted in this area.
References:
Backer, E. (1987): Cluster analysis by optimal decomposition of induced fuzzy sets, Delftse
Universitaire Pres, Delft.
Bezdek, J.C. (1981): Pattern recognition with fuzzy objective function algorithms, Plenum
Press, New York.
Bodjanova, S. (1994): Complement of fuzzy k-partitions, Fuzzy Sets and Systems , 62,
175-184.
De Luca, A. and Termini, S. (1972) : A definition of a nonprobabilistic entropy in the
setting of fuzzy sets theory, Information and Control, 20, 4, 301-312.
Empotz, H. (1981) : Nonprobabilistic entropies and indetermination measures in the set-
ting of fuzzy set theory, Fuzzy Sets and Systems, 5, 307-317.
Klir, J.G. and Youan, B. (1995): Fuzzy sets and fuzzy logic: theory and applicaitons, Pren-
tice Hall, Englewood Cliffs.
Zadeh, L.A. (1965): Fuzzy sets, Information and Control, 8, 3, 338-353.
Additive Clustering Model and Its Generalization
Mika Sato l , Yoshiharu Sat0 2
I Institute of Policy and Planning Science, University of Tsukuba
Tenodai 1-1-1, Tsukuba 305. Japan
e-mail: [email protected]
2 Department of Information and Management Science, Hokkaido University
Kita 13, Nishi 8, Kita-ku, Sapporo 060, Japan
e-mail: [email protected]

Summary: ADCLUS (ADditive CLUStering) is known as a clustering model which is designated


for the purpose of finding the structure of the similarity data. The aim of this paper is to gener-
alize this model from several points of view. The first point of view is to extend the degree of
belongingness of the objects to the continuous value in the interval [0,1], namely to an additive
fuzzy clustering model, because the combinatorial optimization is inevitable in the algorithm for
ADCLUS. The second point of view is to generalize the model for an asymmetric similarity data.
And the third point of view is that we introduce the aggregation operator in the model to represent
the degree of simultaneous belongingness of the objects to each cluster.

1. Introduction
The concept of ADCLUS (ADditive CLUSTering model) was proposed by Shepard and
Arabie (1979). This model is intended to find the structure of a similarity relation between
the pair of objects by clusters. The model is defined by the following:

1\'
Sij = L
k=1
U"kP,kP;k + Oij' (1)

where 8ij (0 ::; s') ::; 1; i, i = 1,2,···. n) is the observed similarity between objects
i and i, K is the number of clusters, and U'k is a weight representing the salience of the
property corresponding to cluster k. If object i has the property of cluster k, then Pi/c 1,=
otherwise it is O. In this model, the similarity is represented by the sum of weights of
clusters kl' 1..'2,' .. ,km' to which both objects i and i belong. That is,

This shows that if objects i and j belong to cluster kc, then the degree of contribution to
the similarities is U'k,. Moreover, if the pair of objects shares some common properties, the
grades which the pair of objects contributes to the similarities U'k" •..• 1Ckm are additive.
In fuzzy clustering, a fuzzy cluster is defined to be a fuzzy subset on a set of objects and
the fuzzy grade of each object represents the degree of belongingness. The degree of be-
longingness of object i to cluster k is denoted by Uik. (0 ::; U,k ::; 1). To avoid conditions
in which objects do not belong to any clusters, we assume that

I,
lIil.: :::: 0, L Itil.: = 1. (2)
k=1

A pioneering work for applying the concept of fuzzy sets to a cluster analysis was made
by E. Ruspini (1969). Since the fuzzy c-means clustering algorithm was proposed by lC.

312
313

Bezdek (1987) and J.e. Dunn (1973), several methods of fuzzy clustering have rapidly
developed and many applications have been suggested (Dave, et al. (1992), Hall, et al.
(1992».
In order to construct a fuzzy clustering model, we have to define a function p( llik, U jk) which
represents the grade of belongingness of objects i and j to cluster k. Generally, p(.T, y) is a
function from [0,1] x [0,1] to [0,1]. By using this function p, the clustering model (1) is
extended as follows (M. Sato and Y. Sato (1994a, 1994b»:
I\
sij = LP(Uikoltjk)+:ij. (3)
k=l
Namely, the similarity sij is represented by the addition of the functions P(Uik, Ujk), where
=ijis an error. T-norms (K. Menger (1942), M. Mizumoto (1989» are well known as a class
of concrete functions p(x, y).
In the practical applications, the similarity data Sij is not always symmetric - for instance,
the mobility data, the input-output data, the perceptual confusion data and so on. Recently,
clustering techniques based on such asymmetric similarity have generated tremendous in-
terest among a number of researchers.
In conventional clustering methods for asymmetric similarity, A.D. Gordon (1987) pro-
posed a method using only the symmetric part of the data. This method is based on the
idea in which the asymmetry of the given similarity data can be regarded as errors of the
symmetric similarity data, that is,
- 1
5 =2(5 + 5),
I

where 5 is a similarity matrix and 5' is the transposed matrix of 5. As for the data, S was
used.
L. Hubert (1973) proposed a method to select a maximum element in the corresponding
elements, that is,
= =
sij max(sij, Sji), oSij min(sij, Sji).
R.E. Tarjan (1983) proposed an algorithm to hierarchically decompose a directed graph
with weighted edges which is used for clustering of asymmetric similarity.
We have proposed a model under which the cluster is constructed with similar objects, but
the similarity between clusters is not symmetric_ (see M. Sato and Y. Sato (1995a, 1995b,
1995c).
In the above conventional clustering algorithm, the concept of similarity between a pair
of objects is eventually reduced to symmetry. Therefore, we will propose a new concept
of asymmetric aggregation operators in order to represent the asymmetric relationship be-
tween a pair of objects. Introducing these asymmetric aggregation operators into the above
fuzzy clustering model (3), a new model is proposed in order to obtain clusters in which ob-
jects are not only similar to each other but also asymmetrically related. The validity of this
model is shown by the numerical example and some features of the aggregation operators.

2. Generalized Clustering Model


Suppose that [( fuzzy clusters exist on a set of n objects, that is, the partition matrix [: =
(Uik) is given. ltik is a fuzzy grade (the degree ofbelongingness) in which object i belongs
to cluster k. This condition is assumed:
I\
ltik ~ 0, L ltik = 1. (4)
k=l
314

Let p( Il,/.:. II j/) be a function which denotes the degree of simultaneous belongingness of the
pair of objects i and j to clusters k and I, namely, a degree of sharing common properties.
Then a general model for the similarity sij is defined as follows:

(5)

Pij =(P(Ud, Ujl).···. P(U,I. Ujl.;),···. p(llil,. Iljl),"', p(Il,I\". 1l)1'»'


To state simply, we assume that if all of p(lt,/.:Il)I) are multiplied by (1, then the similarity is
also multiplied by (1. Therefore, the function.; itself must satisfy the condition "positively
homogeneous of degree 1 in the p", that is,

(1 > O.
We consider the following function as a typical function of .;:
I,'
8ij =L P(lti/.:. Ilj/.:) + :ij'
/,:=1

The P is assumed to satisfy the following conditions:

1. 0 ~ pelli/.:. Ujl) ~ 1, P(Ui/.:. 0) =O. P(Ui/.:. 1) = U,k


2. P(Uik. Ujl) ~ p(lt./.:. UII) whenever Ui/.: ~ llsbU)1 ~ Uti

3. P(lti/.:. lIjl) = p(ltjl. /t,d


We can consider the T-norms as a function which satisfies the above conditions. (Weber,
1983).
The structure of an observed similarity is usually unknown and complicated, so various
clustering models are required to identify the latent structure of the similarity data. There-
fore, we define the general class of clustering models so as to represent many different
structures of the similarity data. Moreover, the robustness of this model is confirmed by
simulation.

3. Model for Asymmetric Similarity Data


3.1 Similarity between Clusters
If the observed similarity data is asymmetric, then the proposed additive fuzzy clustering
models in the foregoing sections are not available. We then extend Model (5) as follows:
1\' I\"
8ij = LLU'/':IP(Ui~.UjI)+:ij.
k=1 1=1

In this model, the weight U'k/ is considered to be a quantity which shows the asymmet-
ric similarity between the pair of clusters. That is, we assume that the asymmetry of the
similarity between the objects is caused by the asymmetry of the similarity between the
clusters.

3.2 Asymmetric Aggregation Operator


In Model (5), if the obtained similarity is symmetric, then the model is represented by sym-
metric function P, and by a symmetric similarity between clusters.
On the other hand, if the similarity is asymmetric, two ways exist. The first is a way which
315

represents the asymmetry between objects by the asymmetry between the clusters in the
foregoing section. The second is a way using the new approach, that is the asymmetric
aggregation operators. In this case, we have to create new aggregation operators which
satisfy the following conditions: Boundary conditions, Monotonicity, and Asymmetry.
Suppose. f(x) is a generator function of t-norms, and 0(1") is a continuous monotone de-
creasing function satisfying

Q: [0.1] --+ [O,x], 0(1) = 1.

Then we define the following -,(.T, ./) as asymmetric aggregation operators:

-r(.r,./) = f[-lJ(f(·r) + o(.1')f(y».

Using the generator function of the Hamacher product (R Fuller (1991», i.e. f(.r) = 1 - .1'
.1'
1
and the monotone decreasing function 0(.1') = - (m > 0). the asymmetric aggregation
.rm
operator is defined as
.rmy
-,(.r, y) = 1(I) •
- Y +.r m- y
which is shown in Figure 1. In Figure 2, the dotted curve shows the intersecting curve of
the surface shown in Figure 1 and the plane x = y, and the solid curve is the intersection
with .1' + Y = 1. >From the solid curve, we find the asymmetry of the proposed aggregation
operator. Figure 3 shows the asymmetric aggregation operator defined as
xy
-/(X, y) = y + .r(2 _ .1')m(1 _ y)'

where the generator function is the Hamacher product, i.e. f(l') = 1 - .r and the monotone
.1'
decreasing function 0(.1') = (2 - .r)"'. Figure 4 shows intersecting curves with .r = ./) and
.1' + Y = 1. In the case of the generator function of the Algebraic product, i.e. f(.r) = - log.r
and the monotone decreasing function o(.r) =2 - .rm (shown in Figure 5). the asymmetric
aggregation operator is defined as
-f(.r, y) = .ry(2-x"'l.

Intersecting curves are shown in Figure 6.


Generally, -,(.r. y) satisfies the inequality

,(.r.y) ~ p(x,y) (6)

by f(.1') + fey) ~ f(x) + O(.r)f(y), because o(.r) ;:::: 1. Since the following inequality:
/,- 1,"
L -(Ulb Ujk) ~ L P(U,b up.,) :'S 1
k=1 k=1

is satisfied by (4) and (6), we assume the condition 0 ~ Sij ~ 1.


Using the asymmetric aggregation operators ,(.r. y), we define the model for asymmetric
similarity data as follows:
I,
Sij =L -(U,b lL)k) + oi), (7)
k=1
316

y (x, y)='i y/(l-y+xy) y(x.y)=,cyl(l·y+xv) .II

/
y(x.xr-......../

o.
y(x.l·x)

o.

Fig. 1: Asymmetric Aggregation Operator Fig. 2: Intersecting Curves


(m = 2)

y(x,y)=xy/(y+(1-y)x(2-x))

Fig. 3: Asymmetric Aggregation Operator Fig. 4: Intersecting Curves


(m = 1)

y (x,y)=xy"')

o.
y(x.l·x)

o
o
o.
\
........., ..•.............

Fig. 5: Asymmetric Aggregation Operator Fig. 6: Intersecting Curves


(17/ = 1)
317

4. Numerical Example
To demonstrate the applications of Model (7), we will use data which shows telephone
traffic from one prefecture to another. In the optimization algorithm used in this example,
20 sets of initial values are given by using uniform pseudorandom numbers in the interval
[0. 'fl, and in the end, we select the best result. The number of clusters is determined based
on the value of fitness. By increasing the number of clusters, the value of fitness decreases,
but even if the number of clusters is greater than 4, there is no severe decrease of the fitness.
>From the principle of parsimony, it should be considered that the number of clusters is
determined to be 4.
The results of the analysis using the asymmetric aggregation operator defined in Figure 1
are shown in Table 1 and Figure 7. In Figure 7, monotone gradation in this figure shows
the degree of belongingness of a prefecture to a cluster. The darker the shade, the larger
the degree of belonging. As for the results, we find that geographical distance is closely
connected with the telephone communication. Moreover, the results show that large cities
become influx points.

o
.

-~ --

Fig. 7: Fuzzy Clustering


318

Tab. 1: Clustering using Asymmetric Aggregation Operator

No. of Prefectures C 1 C1 C3 C4
Hokkaido .326~ .17U2 .1~)3 .3076
Aomori .5469 .0992 .0999 .2540
Iwate .6248 .0574 .0684 .2494
Miyagi .7782 .0000 .0000 .2218
Akita .5956 .0584 .0785 .2675
Yamagata .6063 .0417 .0739 .2781
Fukushima .4882 .0532 .0737 .3849
Ibaragi .2093 .1022 .0974 .5911
Tochigi .2484 .0915 .0908 .5693
Gunma .2173 .1145 .1106 .5576
Saitama .1862 .0003 .0056 .8079
Chiba .0951 .0847 .0985 .7218
Tokyo .0000 .0000 .0000 1.0000
Kanagawa .0537 .0591 .1071 .7800
Niigata .2979 .0999 .1762 .4260
Toyama .2277 .1223 .3414 .3085
Ishikawa .2063 .1060 .3934 .2942
Fukui .1993 .1294 .4218 .2495
Yamanashi .1880 .1665 .1612 .48·B
Nagano .1657 .1167 .2292 .4883
Gifu .1261 .1177 .4326 .3236
Shizuoka .1454 .1373 .2527 .4645
Aichi .0000 .0000 .5544 .4456
Miye .1286 .1217 .4716 .2780
Shiga .1054 .1091 .5867 .1988
Kyoto .0988 .0234 .6947 .1831
Osaka .0000 .0000 1.0000 .0000
Hyogo .0000 .1597 .6592 .1811
Nara .1203 .1152 .6121 .1525
Wakayama .1715 .1532 .4937 .1816
TOllori .0808 .2584 .4700 .1908
Shimane .0790 .3073 .4226 .1911
Okayama .0660 .2810 .4652 .1878
Hiroshima .0000 .3785 .4287 .1928
Yamaguchi .0530 .4795 .2784 .1891
Tokushima .1416 .2270 .4428 .1887
Kagawa .0756 .2431 .4756 .2057
Ehime .1033 .2811 .4153 .2003
Kochi .1537 .2519 .3947 .1997
Fukuoka .0000 .9098 .0000 .0902
Saga .1106 .6236 .1127 .1531
Nagasaki .0975 .5546 .1616 .1863
Kumamoto .0738 .5996 .1456 .1810
Oita .0945 .5480 .1741 .1834
Miyazaki .1076 .5098 .1843 .1984
Kagoshima .1063 .4796 .2045 .2096
Okinawa .2347 .3276 .2050 .2327

Goodness of fit = 0.0099


319

Acknowledgement
A part of this work was supported by a grant for Scientific Research from the Ministry of
Education, Science and Culture of Japan.

References:

Bezdek, J.e. (1987). Pattem Recogl/itiol/ with Fuzzy Objective FUilctiol/Algorithms. Plenum Press.
Dave, R.N. and Bhaswan, K. (1992). Adaptive Fuzzy c-Shells Clustering and Detection of Ellipses.
IEEE Tral/sactiol/s 01/ Neural NeMorks, 3, 643-662.
Dunn, J.e. (1973). A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact
Well-Separated Clusters. Jour. Cybemecics, 3,3,32-57.
Fuller, R. (1991). On Hamacher-sum of Triangular Fuzzy Numbers. Fuzzy Sets al/d Systems, 42,
205-212.
Gordon, A.D. (1987). A Review of Hierarchical Classification. Joumal of the Royal Statistical
Society, Series A, 150,119-137.
Hall, L.O. and Bensaid, A.M. et al.. (1992). A Comparison of Neural Network and Fuzzy Clus-
tering Techniques in Segmenting Magnetic Resonance Images of the Brain. IEEE Tral/sactiolls 01/
Neural NeMorks, 3, 672-682.
Hubert, L. (1973). Min and Max Hierarchical Clustering Using Asymmetric Similarity Measures.
Psychometrika, 38, 63-72.
Menger, K. (1942). Statistical Metrics. Mathematics, 28, 535-537.
Mizumoto, M. (1989). Pictorial Representations of Fuzzy Connectives, Part I: Cases of t-Norms,
t-Conorms And Averaging Operators. Flazy Sets alld Systems, 31,217-242.
Ruspini, E.H. (1969). A New Approach to Clustering,/llform. Colltrol., 15, 1,22-32.
Sato, M. and Sato, Y. (1994a). An Additive Fuzzy Clustering Model. Japallese Joumal of Fuzzy
Theory alld Systems, 6, 2, 185-204.
Sato, M. and Sato, Y. (1994b). Structural Model of Similarity for Fuzzy Clustering. Joumal of the
Japal/ese Society of Co mputa tiolla I Statistics, 7, 27-46.
Sato, M. and Sato, Y. (1995a). Extended Fuzzy Clustering Models for Asymmetric Similarity.
Ftazy Logic alld Soft Computillg, World Scientific, 228-237.
Sato, M. and Sato, Y. (1995b). A General Fuzzy Clustering Model Based On Aggregation Opera-
tors. Behaviormetrika, 22, 2, 115-128.
Sato, M. and Sato, Y. (1995c). On a General Fuzzy Additive Clustering Model. II/tematiol/al Jour-
I/al of buelligel/t A utomatioll al/d Soft Computil/g. I, 4, 439-448.
Shepard, R.N. and Arabie, P. (1979). Additive Clustering: Representation of Similarities as Com-
binations of Discrete Overlapping Properties. Psychological Review, 86, 87-123.
Tarjan. R.E. (1983). An improved algorithm for hierarchical clustering using strong components.
Illformatioll Processil/g Letters, 17. 37-4l.
Weber. S. (1983). A General Concept of Fuzzy Connectives, Negations and Implications Based on
T-Norms and T-Conorms. Ftazy Sets al/d Systems, 11, 115-134.
A Proposal of an Extended Model of ADCL US
Model
Tadashi Imaizumi
Department of Management & Information Sciences
Tama University
4-1-1 Hijirogaoka, Tama-shi,Tokyo 206, Japan

Summary: This paper presents an extended model of ADCLUS model as a model for
overlapping clustering. It is assumed that a degree of the belongings of each object to a
cluster is represented as a binary random variable, and similarity between two objects is
defined by a weighted expectation of a cross product of these variables. The problems on
the visualization of the results is also discussed. An extension of the proposed model to a
two-mode, two-way data is also described.

1. Introduction
When we want to analyze a proximity data set among objects, it is very important
how to extract a hidden information(structure) of it by using several models and
several methods. MDS(Multi-Dimensional Scaling) models and methods have been
used for extracting it. We can extract a continuos structure of a data set using these
model in which dissimilarity or similarity are related to the inter-point distance of
the corresponding two points.
A combinatorial theoretic model will be proper ones when the structure of data set is
assumed to be the descrete one though these MDS methods are also applicable(Arabie
and Hubert, 1992). For example, this combinatorial theoretic model has been used
to analyze the data set which represents the structure of a confusions between 16
consonant phonemes (Arabie and Carroll, 1980; Soli, Arabie and Carroll, 1986), that
of a kinship relations(Carroll and Arabie, 1983).
As Shepard and Arabie (1979) first introduced ADCLUS model as a model for over-
lapping clustering of one-mode, two-way data, several models and several methods
which based on this model are proposed. The basic ADCLUS model for one mode,
two-way data matrix is expressed as
R
Sij = L WtPitPjt + C + eij· ( 1)
t=1

An extended model for two-mode, two-way data matrix or three-way matrices is


proposed as
R
Sij = L WtPit%i + C + eij, (2)
t=1
R
Sijl< = L WlctPitPjt + Clc + eijlc, (3)
t=1

where Pit or %t is a binary variable whose value is 0 or 1, Wt or Wlct is a positive value.


Arabie and Carroll(1980) proposed an alternative algorithm,MAPCLUS program, for

320
321

fitting ADCLUS model. An extended models for three-way matrices are named IN-
DCLUS model. In these models, N objects are clustered into R possibly overlapping
clusters. Similarity between two objects that are not members of a cluster is assumed
to be zero.
There are four states ( Pil,Pjl ) of two objects which represents two objects belonging
to the same cluster or not. The state (1,1) represents two objects belonging to the
same cluster, the state (0,0) represents them not belonging, and the states (I,D) or
(0,1) represents one object belonging to the cluster and the other one not. The prop-
erty of ADCLUS model considering the state (1,1) only will also lead the number of
clusters being increased and the visualization of the resultant clusters being difficult.
In the ADCLUS model, only the state (1,1) contributes to the similarity and the
other three states are ignored.
As this ADCLUS model will be rewritten as the matching model of two binary
sequences (Pi1, Pi2, ... ,Pit, ... , PiR) and (Pj1, Pj2, ... , Pjl," . ,PjR) which represent
whether an object has the property t or not. The binary 0 or 1 are exchangeable
for any cluster t, the state (0,0) or (1,1) will contribute the similarity between the
corresponding two objects. From this point of view, we propose an extended model
of ADCL US model in this paper.

2. Model
We present an extended model for overlapping clustering by defining similarity be-
tween two objects that are not members of a cluster. Let XiI denote an binary random
variable such that
(4)
'vVe also assume

XiI and Xj' (i =1= j, t =1= s) are independently distributed.


The similarity between two objects 0i and OJ, Sij will be defined by
R
Sij = L {WIXiIXjl + Ut(1- X i, )(l - Xjt)} + c, (i =1= j) (5)
1=1

where WI, Uj and c denote real value respectively. In this model, similarity between
0i and OJ for cluster t, Sijl takes one of three values

(6)

We also assume that the observed similarity sij(i = 1,2,···, n;j = 1,2,···, n;j =1= i)
are a realization of E(Sij). We have
Sij = E[XiWXj + (J - Xi)U(J - Xj)'] + c, (i =1= j). ( 7)
where S = (Sij) is N x N a similarity matrix of N objects, X =(Xk ) is a N x R
binary random matrix, J is 1 x R matrix whose elements are aliI, tV and U is R x
R diagonal matrix of positive weights respectively. This model is expressed as
Sij = trace(U) - E(Xi - Xj)U(Xi - Xi!' + E(Xi)(W - U)E(Xj)' + c, (i =1= j). (8)
322

The expection E(Xi - Xj)U(Xi - X;)' is interpreted as the weighed squared distance
between Xi and Xj, and indicates the distance property of the proposed model. As
the term jWr - uri is decreasing to 0 for each cluster t, a visualization of estimated
parameters will be possible. The following two models are special sub-model of the
proposed model.
2.1 Case W=U
In this case, Two objects which are in the same state were equally weighted. A mean
similarity 5ij is expressed as
5ij = trace(U) - E(Xi - Xj)U(Xi - Xj)' + c. (9)
A resultant clustering will be represented as a geometric configuration.

2.2 Case W=O or U=O


The proposed model is same to the ADCLUS model. Since Wand U are exchange-
able, two objects which belongs to the same cluster are weighted.
( 10)

3. The Algorithm
The observed mean similarity 5ij is represented as
R
5ij = 2)WrPirPjr + ur(1- Pir)(1 - Pjr)} + c + eij, (ll)
r=1
with the constraints 0 ~ Pir ~ 1. To estimate these parameters, we use the ordinary
least squares method with the constraints on the parameters. The estimation pro-
cedure for given the number of clusters R is an iterative process consisted of three
steps.

3.1 Initial value of Pit


When there is no rational initial value of Pir, P~r' we generate the initial pO = (P~r)
by using uniform random number (0,1),
pO", Uniform(O, 1). (12)
3.2 The esthnation of WI> Ur and c
As there are constraints with Wr and Ur being positive, they are estimated by using
the quadratic programming procedure with the active set method which minimizes,
n n

L L
i=1 j;ioi,j=l
{Sij - WrPirPjr - ur(1- Pir)(l- Pjr) - cV (13)

3.3 The estimation of Pir


For given Wr ,Ur and c, we estimate Pir(t = 1,2,···, R) using ALS procedure with
other Pjr(j = 1,2, ... , i-I, i + 1" .. , n) being fixed. Similarity of object 0i, 5ij(j =
1,2, ... , i-I, i + 1" .. , n), will be expressed as follows:
R R
Sij = L ur{1 - Pir) + c + L {( wr + Ur)Pir - Ut}Pit + eii (14)
r=1 r=1
323

with the constraints 0 ::; Pil ::; 1. We use the quadratic programming prodecure with
the active set method to obtain a feasible solution of (Pil, Pi2," ·,PiR).
These iterative processes 3.2 and 3.3 are repeated until some the convergence criteria
are satisfied. This iterative process is simple one than that of MAPCLUS procedure
since we estimate Pil instead of XiI, a realization of XiI'

4. How to represent the resultant configuration as a space


structure
The resultant configuration represents the descrete structure of data set. We com-
pare each objects on the uni-dimensional representations of R clusters since cluster
analysis is not to embed objects as points in lower dimensional space but to find
cluster structure in data. However,an geometric representation will be informative
since we will be able to understand the relationship between clusters.
The geometric representation as space structure seems to be difficult. This repre-
sentation will be done in case of W = U or W = 0 or U = O. Since

E(Sij) = trace(U) - E(Xi - Xj)U(Xi - Xj)' + E(Xi)(W - U)E(Xd + c, (15)

the quantity E(Xi)(W - U)E(Xj)' of two object 0i and OJ will indicates the degree of
the resultant configuration being not represented as the space structure. To represent
the resultant configuration as space structure, we assume the model with W = U,

(16)
as the target model to be represented. This configuration can be represented as a
space structure. We define the shifting factor, Fi , of object 0i to the other objects
by
Fi = E(Xi)(W - U) L" E(Xj )'/2(n -1). (17)
i~jJ=l

When this quantity Fi is positive one, a weighted squared distance E(Xi - Xj )U(Xi-
X j ), should be increased to represent as space structure. When Fi is negative one,
the distance should be decreased. The quantity Fi is interpreted as a deviate from
the point E(Xi)U to embed as point. This suggests that the resultant configuration
will be more informative with a vector,
n

Fi = L E(Xj)/(n - 1) (18)
j~i.j=l

from the point E(X;)U.


This visual representation of the resultant configuration as a space structure will
be appropriate for the upto 3 clusters. When number of clusters R is 4 or more, we
must decide what representation is more informative. We will be able to choose these
clusters, for example, by computing the sum of cross product between any 2 clusters
N
cr.1 = LPi,Pjl, (s:l t), (19)
;=1

and inspecting this R x R matrix or the eigen vectors of this matrix.


324

5 Applications
We applied the proposed model to the confusion matrix of Morse Code signals
collected by Rothkopt(1957). This data was analyzed by Shepard using Kruskal's
non-metric MDS procedure(Shepard, 1963). We analyzed the symmetrized similarity
matrix Slj whose (i,j) element is defined by

(20)

where Mij is (i,j) element of the confusion matrix. The analysis was done using the
number of clusters from 5 to 1. Ratio of the minimized squared loss over trace(S)
in 5 clusters to 1 cluster were 0.051, 0.069, 0.082, 0.109, and 0.183. By using elbow
criterion and to compare result by MDS, we chose two-cluster result as a solution.
The estimated w, u and c are shown Table 1. The estimated P is shown Table 2 and
is illustrated in Figure 1,

Tab. 1: The estimated w, u and c in number of clusters =2


Cluster 1 Cluster 2
w 57.820 66.198
u 39.281 17.285
c -22.935

Table 1 shows that the state (1,1), two Morse code are a member of this cluster, and
the state (0,0), they are not a member of this cluster, are contributed to similarity
in cluster 1 since the ratio Ul/Wl is about 0.679. In cluster 2, the state (1,1) only
is contributed to similarity since the ratio U2/W2 is about 0.261 and it seems to be
small.
Tab. 2: The estimated P in number of clusters = 2
Morse code Clus.l Clus. 2 Morse code Clus. 1 Clus. 2
A .- 0.201 0.068 S ... 0.391 0.000
B -... 1.000 0.051 T- 0.154 0.099
C -.-. 0.873 0.398 U .. - 0.437 0.000
D -.. 0.622 0.000 V ... - 0.675 0.000
E. 0.194 0.083 W.-- 0.501 0.068
F .. -. 0.747 0.099 X -.. - 1.000 0.151
G --. 0.415 0.224 Y -.-- 0.872 0.512
H .... 0.521 0.000 Z -- .. 0.781 0.512
I .. 0.211 0.084 1 .- - -- 0.375 0.873
J .- -- 0.643 0.544 2 .. - -- 0.565 0.744
K -.- 0.668 0.024 3 ... -- 0.686 0.316
L .-.. 0.943 0.105 4 .... - 0.690 0.032
M -- 0.234 0.125 5 ..... 0.593 0.000
N -. 0.189 0.086 6 -.... 0.929 0.209
0-- - 0.393 0.292 7 --... 0.792 0.541
P .--. 0.808 0.565 8 ---.. 0.503 0.817
Q --.- 0.818 0.676 9 - - --. 0.233 0.855
R .-. 0.524 0.000 0----- 0.202 0.692
325

T· E. 0····· g.... 1··- 0··· U..· 8·-.. R:. 5..... J.- V.. : 4... : Z·· .. P-. y.:. 6· .... X'.:

Ld\h ~MdS lG-lWdH~$+tt,hIQ-~;y


~i~i~I~?~?i~~i~ (~
ClU3tr I

R· D· 5 U·4 ·W·· EN· T· M·· 6· 0·- C·· y •.. J-Q'" 2··· g••••

lr(k~wa·~~J~l-:bJ~-
~(~ ~X(, i
mm~~~~ o~rno~m
ClU3tr2
Fig. 1: The estimated P.
From Table 2 and Figure 1, the cluster 1 seems to be interpreted as the degree of
mixture of dash and dot. The cluster 2 is interpreted as dot-to-dash ratio.

9-\. \ 1 .----

rr
3.5 -
8 ---

3- 2
O-~ Q --.-
2.5 -
I
N

~
I-<
2-
J;t- jI.j . -
.!l
t)
~.-.
1.5 -
0---
----p.
y-
1- -<:1---
G-~
6-.~

'PV.K '\
L'
, X- ..-
0.5 -

0
~ fW.t,
S ..u
t~ ~~\~ . -.
..' 1~ .-. 5 ..... ~ ....
B -...
I I I I
0 1 2 3 4 5 6 7
Cluster 1
Fig. 2: A geometric representation with the shifting factor from each point.

In Figure 2, The estimated PU are illustrated as points in two-dimensional space.


326

Each vetcor from each point illustrates the shifting factor. each point was shifted to
the direction of the centroid of other n - 1 points.

6. An extension to Two-mode Data Matrix


This model is applicable to the two-mode data matri..x of similarity. Let Yjt denote
binary random ·variablefor column object.

(21 )

We also assume

Yjt and Y k• (j =1= k, t =1= s) are independently distributed,

and
Xit and Yj. are also independently distributed,
Then the model for two-mode data matri..x of similarity will be defined as
R
Sij = 2:: WtXitljt + Ut(1- X it )(1 - Yjt) + c (22)
t=1
= E[XiWY; + (J - XdU(J - lj)'] + c, (23)

where S = (Sij) is a N x M similarity matrix, X = (Xk) is a N x R binary random


matrix ,Y = (Yj) is a M x R binary random matrix Wand U is R x R diagonal
matri..x of positive weights.
The estimation procedure of Pit and qjt are similar to that in section 3. For given Wt
,Ut, c, and qjt(j = 1,2,···, m; t = 1,2,···, R) we estimate Pit(t = 1,2,···, R) using
ALS procedure. Similarity of row object Oi, Sij(j = 1,2,··" i-I, i + 1,···, m), will
be expressed as
R R
Sij = 2:: Ut(l- qje) + c + 2::{(Wt + Ut)qjl - Ut}Pit + eii (24)
t=1 t=1
with the constraints 0 :::; Pit:::; 1. We use the quadratic programming prodecure with
the active set method to obtain a feasible solution of (Pil,Pi2,··· ,P,R). Then qit(j =
1,2,'··, m) will be updated using these updated Pit(i = 1,2,···, n; t = 1,2,···, R)
by similar procedure.

7 Discussions
We proposed the extended model of the ADCLUS model with random variables. By
using this model, statistical inference on parameters will be applicable. The state
(0,0) of (Xit , Xit) is also contributed to the similarity Sij. As Sijl takes one of three
values, 0, Wt, and Ut, the maximum distinct values of similarity matrix is 3 + 2/- 1 •
The other hand, it is 2 + t - 1 in ADCLUS model. The number of clusters needed
will be less that that of ADCLUS model for analyzing a similarity matrix.
As the estimation procedure is simple one, the weighted least squares method will
be applicable by using a consistent estimator of variance of Sij.
327

The model discussed in section 6 is also applicable to the case of the preference data
with some modification. It seems to be natural that each subject is correspond to
each ideal object. Then we relate the degree of preference of subject Si to object OJ
with similarity between an idean object Ii and OJ. However,it seems to be difficult
that the evaluated preference is a realization of the expection from statistical point
of view. We will analyze the preference data with the latent class model in which the
definition of the expection will make a sense.
When data matrix is one of preference data, we assume that there are G unobserved
group and each subject belongs to one of these groups. Then we can apply this
model by assuming that the preference to object OJ of subject Si, who belongs group
9 (g = 1,2,···, G), is related to the similarity between OJ and some ideal point g".
Let X gt denote the binary random variable of group 9 in cluster t. We define the
similarity between group 9 and object OJ by
R
Sgj = L WgtXgtYit + U gt(1 - X gt )(1 - }jt) + cg • (25)
t=1

Each group differently weights each property in this model. We assume that the
preference to object OJ in group 9 is equall to Suj. Each subject is assumed to be a
sample from one of G groups. A clustering procedure will be rolled in the estimation
procedure for allocating each subject to one of groups.
As the extension of the proposed model to analyze three-way data matrices, specially
for analysis of individual differences, one model is same to INDCLUS model, and the
other model is based on the latent class model in which each individual belongs to
one of G groups.

8. References
Arabie, P. and Carroll, J. D.(1980): MAPCLUS: a mathematical programming approach
to fitting the ADCLUS model Psychometrika, 45,211-235.
Arabie, P. and Hubert, L. J .(1992): Combinatorial Data Analysis. Annual Review of
Psychology,43,169-203.
Carroll, J. D. and Arabie, P.(1983):INDCLUS: An individual differences generalization of
the ADCLUS model and the MAPCLUS algorithm Psychometrika, 48,157-169.
Rothkopt, E. Z.(1957): A measurement of stimulus similarity ans errors in some paired-
associate learning tasks. Journal of Ezperimental Psychology, 53 94-1Ol.
Soli, S. D., Arabie, P. and Carrol, J.D.(1986): Representation of descrete structure under-
lying observed confusions between consonant phonemes. Journal of the Acoustical Society
of America, 79,826-837
Shepard, R.N., and Arabie, P. (1979): Additive Clustering Representation of Similarities
as Combinations of Descrete Overlapping Properties. Psychological Review,86,87-123.
Comparison of Pruning Algorithms
in Neural Networks
Yoshihiko Hamamoto, Toshinori Hase, Satoshi Nakai, and Shingo Tomita
Faculty of Engineering, Yamaguchi University
Ube, 755 Japan

Summary: In order to select the right-sized network, many pruning algorithms have been
proposed. One may ask which of the pruning algorithms is best in terms of the generaliza-
tion error of the resulting artificial neural network classifiers. In this paper, we compare the
performance of four pruning algorithms in small training sample size situations. A com-
parative study with artificial and real data suggests that the weight-elimination method
proposed by Weigend et al. is best.

1. Introduction
There are two fundamental problems in the design of artificial neural network (ANN)
classifiers: finding training algorithms and selecting the right-sized network. Concern-
ing training algorithms, the back-propagation algorithm (Rumelhart et al., 1986) has
been widely used because of its simplicity. In the back-propagation learning, the
following error function is minimized:

(1)

where Ok is the output in the output layer for the k-th training sample, tk is the
corresponding target value, and T is the training set. However, BP has two seri-
ous disadvantages: the extremely long training times and the possibility of trapping
in local minima. On the other hand, concerning right-sized network selection, its
issue is that the network that is too large or too small can overfit or underfit the
data, respectively. The use of the right-sized network leads to the improvement in
the performance of the resulting ANN classifier. Hence, the problem of selecting the
right-sized network is very important in neural networks. We will address only this
problem.
In order to select the right-sized network, many pruning algorithms have been pre-
sented (Reed, 1993). Unfortunately, little is known about experimental comparison
of the pruning algorithms in finite sample conditions. In this paper, we compare the
performance of four pruning algorithms in terms of the generalization error of ANN
classifiers in small training sample size situations. Our emphasis is on giving practical
advice to designers and users of ANN classifiers.

2. ANN classifiers
We will consider AN:-.1 classifiers with one hidden layer. The units in the input layer
correspond to the components of the feature vector to be classified. The hidden layer
has m units. The units in the output layer are associated with pattern class labels.
In the network discussed here, the inputs to the units in each successive layer are the
outputs of the preceding layer. Initial weights were distributed uniformly in -0.5 to
0.5.

328
329

3. Pruning algorithms
In this section, we briefly describe four pruning algorithms. We will follow Reed
(1993)'s notations.
A. l{arnin's method (Kamin, 1990)
Kamin measures the sensitivity of the error function with respect to the removal of
each connection and then prunes the weights with low sensitivity. The sensitivity of
a weight Wi) is given as

(2)

where N is the number of training epochs, TJ is a learning rate, w! is the final value of
the weight after training, and Wi is the initial weight. ~Wij in (2) can be calculated
by the back-propagation algorithm.
B. Optimal Brain Damage (OBD) method (Le Cun et al., 1990)
When the weight vector w is perturbed, the change in the error is approximately
given by
(3)

where the liwi's are the components of liw and C is the set of all connections. The
second derivatives can be calculated by a modified back-propagation algorithm. The
saliency of the weight Wi is then

s--_·-
[PE wl (4)
, - ow? 2'

Pruning is done iteratively: i.e., train to a reasonable error level, compute saliencies,
delete low saliency weights, and resume training.
C. vVeight-Elimination (WE) method (Weigend et al., 1991)
The following error function is minimized:

2 W~IW6
J=E(tk-Ok) +,\E 2/2 (5)
kET iEG 1 + Wi lVo

where ,\ is a parameter dynamically adjusted in training. The second term represents


the complexity of the network as a function of the weight magnitudes relative to the
constant woo
D. Kruschke's mdhod (I\:ruschke, 1988)
U nits are redundant when their weight vectors are nearly parallel or anti parallel and
they compete with others that have similar directions. The gains 9 are adjusted
according to
~gi = -, Ecos2 L(w:' wj) . gj (6)
jii

where, is a small positive constant, L(-,·) denotes the angle between vectors and the
superscript s indexes the layer. If unit i has weights parallel to those of unit j, then
the gain of each will decrease in proportion to the gain of the other and the one with
330

the smaller gain will be driven to zero faster. Once a gain becomes zero, it remains
zero. A unit with zero gain has a constant output. So, the unit can be removed.

4. Experimental Results
Jain and Chandrasekaran (1982) suggest that the size of training samples per class
should be at least five to ten times the dimensionality. However, in practice, the
number of available samples is limited. Hamamoto et al. (l996a) point out that
the evaluation of ANN classifiers in the small-sample, high-dimensional setting is
very important. In our experiments, thus, the ratio of the training sample size to
the dimensionality is small. On the other hand, a large test sample size was used
to accurately evaluate a classifier. Note that the estimated generalization error is a
random variable, because it is a function of training and test samples. Thus, it is
preferable to repeat the experiments several times independently.
To highlight the difference in the performance of four pruning algorithms, the follow-
ing experiments were conducted.
4.1 Experiment 1
We briefly describe the Ness data set (Van Nes~, (980), which is used in this experi-
ment. This data set has been used in order to evaluate the performance of classifiers
such as the nearest neighbor, Parzen, linear, quadratic and neural network classifiers
(Van Ness, 1980; Hamamoto et aI., 1996a). The available samples were independently
generated from n-dimensional Gaussian distributions NU.Lk, Ek ) with the following
parameters:
PI = [O,O,···,of, P2 = [~/2, 0, ... ,0, tl/2f

El = In, E2 = [1'6 ~f/2]


2

where tl is the Mahalanobis distance between class ....'J and class W2, PJ is the n-
dimensional zero vector, and In
is the n x Tl identity matrix. The true Bayes error
can be controlled by the values of ~ and II. That is, the degree of overlapping be-
tween two distributions can be controlled by the values of tl and Tl. For that reason,
we used this data set.
The experimental condition is summarized as follows:

Data Ness data set


Values of Co 2, 4, 6
Dimensionality 2, 10, 20
No. of training samples 10 per class
No. of test samples 100 per class
Hidden unit size 100
No. of trials 100

Figs. 1-3 provide the mean of the estimated generalization error. For comparison,
the generalization error of BP is also presented. It is well known that when a fixed
number of training samples is used to design a classifier, the error of the classifier
tends to increase as the dimensionality n gets large. That is, as n increases, the
generalization problem becomes severe. In our limited experiment, the WE method
works well regardless of the true Bayes error and the dimensionality, even in practical
situations where the training sample size is relatively small for the dimensionality.
331

36

34
~
~ 32
E
W
c 30
0
~
.t:l
til 28
Q; -+- Karnin
C
<1l .+. OBD
CJ 26 -0- WE
-*- Kruschke
24 -6-. BP

22 2 10 20
Dimensionality
Fig. 1: Comparison of pruning algorithms in terms of the generalization error
(Ness data set with .6. = 2).

16
15
14 ._._._._._._._. _.8
~
~

E
13
/
/
.6'-
,, ,,' ,

W 12 /
"
....... " ........+
c / ,/
0 / ,, _El
11
.~
/
/
"
,,
, ,x'
.t:l
-
/

O' --
til 10 /
Q; / , ,/
c , ,, -+- Karnin
<1l
CJ
9
, ,, , .
, ,, , , ·+·OBD
8 . ~~-: .. -0· WE
t'!- -*".Kruschke
7 -&- BP

6 2 10 20
Dimensionality
Fig. 2: Comparison of pruning algorithms in terms of the generalization error
(Ness data set with .6. = 4).
332

5.5
-+-Karnln
5 '+-OBD
-G- WE
;e 4.5 -*"- Kruschke
...
!!- ...... BP
...wIi:! 4

c: 3.5
0
~
.!::! 3
...
iii
Q)
c: 2.5
Q)
(!)
2

1.5

2 10 20
Dimensionality
Fig. 3: Comparison of pruning algorithms in terms of the generalization error
(Ness data set with D. = 6).

4.2 Experiment 2
Next, we compare four pruning algorithms on a real data set. In this data set, each
class represents one of 10 handwritten numerals. This data set contains 1400 128-
dimensional feature vectors per class. In feature extraction, Gabor filters (Gabor,
1946) were applied to a character image. The outputs of Gabor filters produce a
128-dimensional feature vector. Gabor filters tend to detect line and edge segments,
which seem to be good discriminating features. We call this the Gabor data set. For
additional details refer to (Hamamoto et al., 1996b). We need to assure the inde-
pendence between training and test sets. Thus, the following handwritten numeral
character experiment was performed:
(1) Divide 1400 samples into the training set of size 100 and the test set of size
1300. Note that the two sets are mutually exclusive.
(2) Design an ANN classifier with 256 hidden units by using a pruning algorithm
with the above training set.
(3) Estimate the generalization error of the ANN classifier by using the test set.
(4) Repeat steps (1)-(3) 5 times independently.
(5) Compute the average of the generalization error and its standard deviation.
Results are shown in Tab. 1. The performance of the classifiers trained only on 25
training samples per class, which are randomly selected out of 100 training samples
per class, is also presented. It should be pointed out that as the training sample
size decreases, the generalization problem becomes severe. Again, the WE method
performs better than other pruning algorithms.
333

Tab. 1: Comparison of pruning algorithms in terms of the generalization error[%] in


the 128-dimensional feature space. The first line in Table is the mean of the 5 trials.
The second line is the standard deviation.

Training sample
Kamin OBD WE Kruschke BP
size per class
8.77 8.57 7.65 7.72 8.58
2.5
0.81 0.77 0 ..58 0.66 0.72
4.98 4.59 4.28 4.92 4.6.5
100
I 0.61 0.28 0.17 0.26 0.33

5. Conclusions
We have compared four pruning algorithms for ANN classifier design, in small train-
ing sample size situations. The generalization error of resulting ANN classifiers was
estimated on artificial and real data. Experimental results show that the WE method
outperforms other pruning algorithms. Therefore, we believe that the WE method is
best for ANN classifier design.

References:
Gabor, D. (1946): Theory of communication, J. Inst. Elect. Engr., 93, 429-459.
Hamamoto, Y. et al. (1996a): On the behavior of artificial neural network classifiers in
high-dimensional spaces, IEEE Trans. Pattem A.nalysis and Machine Intelligence, 18, 5,
571-574.
Hamamoto, Y. et al. (1996b): Recognition of handwritten numerals using Gabor features,
In Proc. of 13th Int. Con/. Pattem Recognition, Vienna, in press.
Jain, A. K. and Chandrasekaran, B. (1982): Dimensionality and sample size considerations
in pattern recognition practice, In Handbook of Statistics, Vo1.2, P. R. Krishnaiah and L.
N. Kanal, Eds., North-Holland, 835-855.
Kamin, E. D. (1990): A simple procedure for pruning back-propagation trained neural
networks, IEEE Trans. Neural Networks, 1,2,239-242.
Kruschke, J. K. (1988): Creating local and distributed bottlenecks in hidden layers of back-
propagation networks, In Proc. 1988 Connectionist Models Summer School, 120-126.
Le Cun, Y. et al. {1990}: Optimal brain damage, In Advances in Neural Information Pro-
cessin9 (2), Denver, 598-605.
Reed, R. (1993): Pruning algorithms - A survey, IEEE Trans. Neural Networks, 4, 5,
740-747.
Rumelhart, D. E. et al. (1986): Learning internal representations by error propagation, In
D. E. Rumelhart & J. L. McClelland (Eds.), Parallel Distributed Processing: Explorations
in the Microstructure of Cognition, VoU : Foundations., MIT Press.
Van Ness, J. (1980): On the dominance of non-parametric Bayes rule discriminant algo-
rithms in high dimensions, Pattern Recognition, 12, 355-368.
Weigend, A. S. et al. (1991): Generalization by weight·elimination with application to
forecasting, In Advances in Neural Information Processing (3),875-882.
Classification Method by Using the Associative
Memories in Cellular Neural Networks
Akihiro Kanagawa, Hiroaki Kawabata, and Hiromitsu Takahashi

Faculty of Computer Science and System Engineering


Okayama Prefectural University

Soja. Okayama 719-11. Japan

Summary: This paper deals with a classification problem. such as medical diag-nosis.
which classes are defined by categorical forms. Classification should be done by careful
and synthetical judgement for a lot of characteristic values taking each individual
variations into account. We use the associative memory function of the cellular neural
networks to classify by means of remembering one category from among the
preregistered categories.

1. Introduction
Supposing there are K objects. and each object has n kinds of characteristic values
ql' q2. ''', qj' ... , qn' The problem to classify these objects into given m classes
{C I' C 2 , •••• C j , .... Cm } based on their characteristic values has been discussed for a
long time. Classification problem discussed here is a sort of the diagnosis problem which
includes some of individual variations. Classic methods using the multivariate normal
distribution theory including the discriminant analysis are difficult to be applied to these
problems. To cope with these. Pawlak(1984) proposed the concept of rough sets. which
can be employed to discuss the consistency of the classification given by human experts
with the observed attribute values of each sample. Shigenaga et al. (1993) modified the
Pawlak's method to reduce more attributes by considering the given classification. In the
case of rough sets. choice of descriptive function or fundamental sets is difficult; while in
the case of fuzzy if-then rules. it is so troublesome to determine a set of membership
functions or logical structure. In addition. if there are some lacking or missing data. these
methods easily have a bad effect upon their classification results. This paper aims to
apply the associative memories of cellular neural networks to this classification problem.
The associative memory is a function of nueral networks that the nueral net wok recall a
pattern from patterns embedded in advance. Further we propose a CNN whose each cell
has three output values to enhance its capability of association.

2. Three valued cellular neural networks (TVCNN)


Cellular neural network ( hereafter CNN) is composed of some simple analog circuits
called 'cell'. which is arranged checkered ( see Fig. 1). CNN is regarded as a sort of
Hopfield network from the view of connecting. Whereas Hopfield network has full
conection. CNN has limited connection. Namely each cell is connected with the adjacent
cells. and varies its state by a dynamics of differential equations under the influence of the
adjacent cells. A differential equation of a cell which is located i - row·andj - column is:
.i:1j = -xlj+PIj* YIj+SIj*lIlj+llj' (1)

334
335

where xij and uij denote the state variable and control variable respectively. lij is the
threshold value. and Pij and Sij are template matices. Pij * Yij implies the sum of the
influence terms from the adjacent cells.

Pi)( -T. - 'l'" p') (-'.0)''' Pi) (- '.' l


r r

Pij * Yij = p, j( 0.- 'l'" Pi) (-,. - 'l'" Pi)( 0., l *Yij = k ~-r I ~ r Pj}(k.f)Yj + k.} + I (2)
p') (T. -rj'" Pi) (T. 0 l'" Pi)( T. T l

To make it simplified. we reform Eq.(l) by the vector notation:

.t = - x + T Y + I (x) }
(3)
Y = sat (x) •
where
ll=NxM
yEDndct'{xEm n : -lsxjsl. i=I ... ·.ll}
T = [ T ij] E m")( n
1=(II ... ·.IIl)TEm n
sat (x) = [sat (Xl)' .... sat (Xn) ]T E mn.

Original CNN has binary state output. and it is suit for coding as black-and-white picture.
But the binary state outputs put fetters upon the approach of higher recognizing problem
such as classification or diagnosis based on association. The reason is that it is essential
to grasp these problems based on more than two valued logic such as low-middle-high or
grade A - grade B - grade C - grade D. For example. in a group examination. personal
condition of health is measured by each ingredient of the blood test. There is an ideal state
( healthy condition) given by a certain range for each ingredient. and from where degrees
of separation toward either side are comprehensively judged.
In this study. for the purpose of extending method of these associative memory functions
of CNN. we propose a cellular neural network which has three valued outputs of the cell
neurons. Here we devise the output function to apply it to classification problems.

Y = sat(:c) = ~ ( I x + ~ I-I x + ~ I + I x - ~ I-I x - ~ I ). (See Fig.2) (4)


For 8" def{XEmll!IXI= lor x=O;i= I, .. ·, III}, we can define

OJ =I = > 1.5, Xj

C( a) ={x E mil , aj =0 = IXj' < 0.5, } (5)


OJ =-1 = x <-1.5. j

Then we have sat (x) = °


for x E C (a) because a = ±l for all i.
Therefore the control equation of x for x E C (a) is

.r = - x + To + I . (6)
As f3 = Ta + I
is constant, this equation has the apparent equilibrium point of Xe = f3. If
f3 E C(a), this equilibrium point is also asymptotically stable since all eigen values of
equation (5) are -1. From this. Liu and Michel discussed the designing method of
336

r =2 I
------1/
I
.--:,' ,~

-:'.5 ~ .5

Fig.l: CNN with r -neighbor Fig.2 : Output function with three levels

templates which can recollect the image pattern, and called it the associative memories. In
the same manner we can easily make output functions with multiple levels.

3. Design procedure using a singular value decomposition


Now we introduce a design procedure for template matrix and bias vector given by Liu
and Michel(l993). Suppose that we are given m vectors ai' Cl2' ... ,am in Bn which
are to be stored as memory vectors for CNN. First we choose a real number c > I and m
vectors 131' /3 2,"', 13m. such that (3; = C Cl; (i = 1,''', 1/1). Then, our problem is to
determine the template matrix T and the threshold vector / which simultaneously satisfy
the following equations:
-131 + Tal +1 =0
\ (7)

-f3m+ TCl m+/ =0. f


Where each memory vector f3i corresponds to one of the equilibrium steady states of the
vector differential equation system in CNN. Here we set the following matrices:
A = ( <X I - ({q' (lz - (lll' ••• ,aq -I - o'q )
B = (PI - fJ q , f32 - f3 q , "', f3 q -1 -13 q ).
\ (8)
J
Then the problem is equivalent to determine the matrix T and the vector / which satisfy:
B = TA } (9)
/ = f3 q -TCl '1 .

Now we focus the i - j cell in TVCNN. Then the next condition is needed to be satisfied:

(10)

where bk and tk are the k th row vectors of the matrices Band T respectively. If we take
the elements out of the matrix A and vectors bk and tk which belong to the r- neighbor cells
of kth cell, we obtain the following equation:
337

b~ = I,tA'. (11)

where b~ • l,t and A' are the vectors and matrix in which the coupling coefficients having
no influence on the kth cell are removed. Since A' is not a square matrix. we apply the
singular value decomposition to the matrix N. Then we obtain the relation:

(12)

where Uk Uk 2' Vk and Vk 2 are the unit onhogonal matrices which satisfy:

l
I' 1

A' -- [ Uk 1 Uk2 1 [J...l 1,2 Vv[1T


k2
j • (13)

where [It 1is a diagonal matrix with the non-zero singular values of the matrix [ AT] T A' .
Thus we can obtain the desired matrix T and the vector I by calculating the above lk in
each cell.

4. Formulation into Classification / Diagnosis Problem


The three valued CNN enables one to express three kinds of aspects, such as low-
middle - high or small - moderate - large and so on. Next, we introduce the following
classification procedure. Each data is expressed by a bit map pattern. and the TVCNN
associates one of patterns registered beforehand. Namely. the TVCNN comes to classify
oneself. The procedure is shown below:
=
(1) Make an 11 M x N CNN. and allocate ql' q2 • •..• qj • ...• qn to each cell.
(2) Determine the number of output levels of each cell. From 2 to 4 levels are
recommended. In this paper we use 3 levels output functions for all cells.
(3) Make scaling functions ft. Ji ..... .h ..... /"
such that

(4) Make M x N lattice patterns representing m classes {Ct. C 2 • .•.• C j ••••• C m }.


(5) Embed these patterns in MVCNN. Concretely. determine the template matrix T and
the threshold vector I by means mentioned in § 3.
(6) Express respective data by M x N lattice patterns using the scaling functions
ft. Ji . .... /" .
(7) Operate the TVCNN with the data pattern as an initial state. and observe a reached
pattern.
(8) Classify the object to a class represented by a reached pattern.

5. Example of the classification method


In. order to demonstrate this method. we take up an application to the automatic diagnosis
problem of liver troubles. Diagnosis of the liver disease is very difficult since it is
necessary to judge synthetically for a lot of medical inspection items taking their individual
338

ql UA uric asid
ql BUN blood urea nitrogen
ClJ LDH lactate dehydrogenase UA BUN LDH PLt
q4 PL[ plate let
qs ALb albumin
q6 y- GTP y - glutamyl trlnspepti[ase ALb yGTP GOT
q7 GOT glutamic ol(a1oaceric transaminase
LAP
qs LAP leucine aminopep[itase
q9 TBil [0[a1 bilirubin
qlO GOT/GPT [he ratio of GOT [0 GPT TBii GOT/GPT GPT AFP
qll GPT glutamic pyruvic transaminase
ql1 AFP alpha-l fc[opro[ein
q!3 DBil direct bilirubin
ql4 ChE cholinesterase DBil ChE ALP AFP
ql5 ALP alkaline phosphatase
ql6 AFP alpha-l fe[opro[ein

Tab_ 1: Medical inspection indeces Fig.3: Allocation of each cell

variations into account. So even the medical specialist occasionary makes a wrong
diagnosis when he must examine a large number of patient data_ On the contrary, in case
of diagnonsis of diabetes, it is sufficient to know only information of FPO (blood
sugar)_
(1) We select 14 medical inspection indices shown in Table I, and make a 4 x 4 CNN,
whose each cell is allocated a medical inspection index shown in Fig.3. The neiborhood ,
is 4. In order to see robustness against the missing or lacking data, we allocated qg - cell
nevertheless we have no data for LAP. .
(2) We adopt three valued output function for all cells because most of inspection indeces
are grasped as three stages, for example, y- OTP roughly has the following three levels:
NORMAL (0 - 50), LIGHT EXCESS ( 50 - 100) and HEAVY EXCESS (100 - ). For
another example ChE has the levels: SHORTAGE ( 0 - 200) , NORMAL (200 - 400) and
EXCESS (400 - ). Both SHORTAGE and EXCESS are regarded as extraordinary.

(3) We make the scaling functions by consulting some technical books of medical
science. It is shown in Table 2.

II UA (ql- 5) / 6
f2 BliN (q:r:- 14)/12
13 LDH (q3- 175 )/120
14 PL[ (q4- 25) /30
Is ALb ( qs- 4.2 ) / 1.6
16 r-GTP q6/ 1oo-1
h GOT q7/ 80- 1
Is LAP o (lacking data)
19 TBil (q9""5118
110 GOT/GPT In ( q7/ qll )
III GPT qll/loo-1
112 AFP (ql:r:- 210) /380
113 DBii ( q13- 70 ) / 40
ti4 ChE (qI4- 375 ) / 150
115 ALP ( qlS - 60 ) / 80
116 AFP =/11

Tab. 2: The scaling functions

(4) We take up four liver diseases. Registered patterns shown in Fig. 4 imply 'Healthy
person', 'Hepatoma', 'Chronic hepatitis' and 'Liver cirrhosis' ,respectively. In these
pictures, I corresponds to a black pixel, 0 corresponds to a neutral gray pixel and -I
corresponds to a white pixel.
339

(I) Heallhy person (2) Hepatoma

(3) Chronic hepatitis (4) Liver c irrhos is


Fig.4: Registered patterns representing four liver diseases

(5) Fig. 5 shows an evolution process of assoiation for the MVCNN. The initial state
represents an actual patient's data. Resultingly, the MVCNN associates a pattern of
'Liver cirrhosis' with the initial pattern. Actually, this patient is diagnosed liver cirrhosis
by a close medical examination.

Fig.5: Evolution process of TVCNN

Statistical inference results should be shown. We are offered forty actual patient's data,
which consist of ten healthy person data, ten hepatoma data, ten chronic hepatics data and
ten liver cirrhosis data from Kawasaki medical college. Diagnostic results by TVCNN
system is shown in Table 3, where 2 (#4) implies that TVCNN diagnosed two hepatoma
cases wrongly as two liver cirrhosis cases. In Table 3, "irregular convergent" is that the
TVCNN did not reach a pattern among the preregistered images; besides "diagnostic
sensitivity" is the right diagnosis rate in the past literature (Shigenaga et al. (1993)). They
used a concept of rough set and fuzzy if - then rules.
right wrong irregular diagnostic
diagnosis diagnosis convergent sensitivity

(1) healthy person (10 ) 10 0 0 100.0 %


(2) hepatoma (10 ) 5 2 (#4) 3 58.8 %
(3) chronic hepatitis (10) 6 1(#1) 3 61.5 %
(-I) liver cirrhosis (10) 4 0 6 64.7 %

Tab.3 : Infereance results and comparison of Shigenaga's method


340

5. Conclusion
We propose a CNN whose each cell has three output values. It is easily extended to one
whose each cell has multiple output values We call this CNN multiple-valued cellular
neural network (MVCNN). Associative memory function of MVCNN enables one to
express several kinds of aspects. We give an application of TVCNN to a diagnosis
ploblem of live!: troubles. The classification power of the proposed method is
demonstrated eqivalent or somewhat inferior in comparison with other method using the
fuzzy if -then rules. But it should be emphasized our method is :
(1) Data of LAP is lacking, so it is allocated neutral grey,
(2) Scaling functions and disease patterns are made by us (amateur) without
consulting the medical specialists,
(3) Sample size is rather small. Shigenaga et al. [8] used 500 sample data.
Thus our diagnosis system using TVCNN has much room for improvement. So it cannot
be concluded from Table 3 that the classification power of TVCNN is somewhat inferior
to that of fuzzy if-then rules. Generally, fuzzy expert system is effective to these
problems. It is, however, complicated and troblesome to make a great number of fuzzy if-
then rules when one should make an expert system. On the contrary, the diagnosis system
using the TVCNN is designable by simple procedure.
Optimization of diagnosis system using TVCNN (MVCNN) is an important subject in the
future study.

References:
Fisher, R.A.(l936) : The use of Multiple Mesurements in Taxonomic Problems, Annals
of Eugenics, 7, pp.175-188.

Liu, D and Michel,A.N.(l993): Cellular Neural Networks for Associative Memories,


IEEE Trans. Circuits Syst. II, 40, pp. 119 -121.

Pawlak, Z (1984): Rough Classification,Intern. 1. of Man Machine Studies, 30, pp. 457-
473.

Shigenaga, T., Ishibuchi,H. and Tanaka, H.(l993): Fuzzy Inference of Expert System
Based on Rough Sets and Its Application to Classification Problems," 1. of Japan
Society for Fuzzy Theory and Systems,S, 2, pp. 358-366.
Application of Kohonen maps to the
G PS Stochastic Tomography of the Ionosphere
M.Hernandez-Pajares, J.M.Juan and J.Sanz
Research Group of Astronomy and Space Geodesy
Vniversitat Politecnica de Catalunya
Campus Nord Mod. C3,Bol
cj. Gran Capita. s/n., 08034 Barcelona, Spain
e-mail: [email protected]

Summary: The adaptative classification of the rays received from a constellation of geo-
detic satellites (GPS) by a set of ground receivers is performed using neural networks. This
strategy allows to improve the reliability of reconstructing the Ionospheric electron density
from GPS data. As an example, we present the evolution of the radially integrated electron
density (Total Electron Content, TEC) during the day 18th October 199.5, coinciding with
an important geomagnetic storm. Also the problems in the vertical reconstruction of the
electron density are discussed, including the data coming from one Low Earth Orbiter GPS
receiver: the GPS/MET. Finally is proposed as main conclusion a new strategy to estimate
the ionospheric electron distribution, using GPS data, at different scales -the 2-D distribu-
tion (TEC) at Global scale and the 3-D distribution (Electron density) at Regional scale-.

1. Introduction
about the Problem:
As it is well known, the Ionosphere is the part of the Earth Atmosphere containing
free ions and causes a frequency-dependent delay in the propagated EM-signals, be-
ing proportional to the columnar density of electrons (TEe) (see for instance Davies
1990, page 73).
This is a distorting physical effect for Space Geodesy and Satellite Telecommunications
activities, that can be used in a positive sense to estimate the global 3-D distribution
of the free-electrons in the Atmosphere, from dual frequency delay observations, i.e.
for the Stochastic Tomography of the Ionosphere.
about the Data:
To achieve this objective, we need during a certain time interval, a high sampling rate
of the Atmosphere, with so many rays in so many orientations as possible. Nowadays,
the unique system that provides so many observations, continuously and on a planetary
scale, is the Global Positioning System (GPS). Its space segment contains a constel-
lation with more than 24 satellites emitting continuously carrier and code phases in
two frequencies L1 (~1.6 GHz) and L2 (~1.2 GHz) (see for instance more informa-
tion in Seeber 1993, pages 209-349). In the GPS user segment, it is possible to get
few hours later the public domain data gathered from a global network of permanent
receivers, such as International GPS Service for Geodynamics (IGS, Zumberge et al.
1994), with more than 100 stations worldwide distributed, mainly concentrated in the
Northern Hemisphere, in North America and Europe. Also the Low Earth Orbiters
containing GPS r(,ceivers (LEO), are becoming usual. Hajj et al. (1994) conclude
than the GPS/MET LEO observations, are important to resolve the vertical structure
of the Ionosphere.

341
342

about the Model and Goals:


But the amount of data implied (more than 1.510 6 delays/day collected in the IGS
network) jointly with its inhomogeneous distribution, makes it difficult to solve the
problem, so that new algorithms and strategies must be considered to perform the
tomography of the Ionosphere.
In this paper we mainly discuss the different data analysis problems encountered in
the estimation of the ionospheric electron distribution using mainly ground data from
the IGS network, and the solutions adopted, emphasizing three points:
1. The adaptative clustering of the rays, using the Kohonen neural network al-
gorithm. This technique will be applied for bidimensional modeling (TEe) of
the overall Ionosphere, i.e. at Global scale, generating cells adapted in size to
the variable sparsity of the data.
2. The problems coming from the bad geometry to solve the vertical structure of the
Ionosphere using data coming solely from ground receivers are also discussed.
We will adopt the model using regular cells instead of the adaptative one, due
to the better performance related with the lower discretization error in presence
of high correlations.
3. The inclusion of a set of GPS/MET data, consisting on orthogonal rays to the
ground data rays, improve the estimation of the ionospheric vertical structure.

2. The Model
The Scenario:

We have the following situation:


• From each ground station (see figure 1) we simultaneously measure with a certain
sampling rate (i.e. 1 time/3D sec) the ionospheric delays experienced by the rays
received from the visible satellites (i.e. 4-8).
• This rays cross different parts of the nearby Ionosphere to the respective station.
• Between observation epochs the Earth rotates, and the part of the sounded
Ionosphere has changed.
• We assume that the Ionosphere is stationary in a Sun-fixed reference system I.
• Then, we have chosen an Geocentric Equatorial pseudo-Inertial reference system
(GEl), where the X-Axis points towards the Vernal Equinox and the Z-Axis
points towards the Geographic North Pole; the XY plane is the cerestial equator.
In the GEl the Sun is only moving 1 degree/day. The associated spherical
coordinates are the right ascension 0 (azimuthal angle) and the declination 8
(angle referred to the equator).

The General Model:


Not taking into account the bmding-effect of the ray, we have to solve the following
IThis is not true in second order, due to the magnetic field effects and the variability of geomagnetic
conditions ( see for example Sanz el al.. 1996)
343

Figure I: Layout of the CPS full constellation (2-1 satellites) orbiting around the
Earth. The CPS rays corresponding to the observations of a given station at a given
time are ,dso represent.ed.

[]:(] Satellite j

Layer k+1

Layer k

Figure 2: An scheme with one pair-station satellite


344

integral equation:
(1)

where:
• N(r) is the electron density at position r, which is a point that belong to the
ray between station i and satellite j at distance s of the station.
• ftWi, i:') is the ionospheric combination corresponding to the ray from the satel~
lite j for the station i at time t, obtained preprocessing the GPS observations
(see for instance Sardon et al. 1994).
• r;, .p are the position vectors of the station i and satellite j at the observation
time.
• OJ and Dj are the instrumental delays associated to the station i and satellite
J.
• The integral path is assumed to be extended over the linear ray.
In order to estimate the density N from equation 1, we can expand it within a certain
basis functions,
N(r) = AI91(F') L (2)
I

in such a way that:

(3)

Our final purpose is to get an estimation of N, from the data f, knowing r;, ;:i. The
ill$trumental delays Di, oj are also unknowns. Then, the general model proposed
consists on a certain number of geocentric spherical shells, coveri~g the Ionosphere
sampled by the GPS rays. These shells define the floor and ceil of each layer, which
is going to be partitioned within pixels.
For each layer k (see figure 2) we define the pixels {Pk./}vl as those given by a set of
centers {Wid} with the minimum distance criterium:

P ((I
k.l . J
= {I0 if 1Ir' -wdl <= I!r - tih./,II
otherwise
Vl' (4)

Then we can write applying to equation 3:

(5)

where the unknowns A have been reinterpreted as the mean electronic density in the
cell Nk,l' From the last equations, we get:

ItWi, rJ) =L L Nk./ i1s k,1 + D + oj


j (6)
k I

where N k ,/, i1s k .l are the mean density, and fraction of the ray length respectively,
corresponding to the celll of layer k.
345

Regular grid Adaptative cells


Cells poor populated? Yes No
Discretization error? Lower in general Higher in general
CPU time Fast Slow (:::::3 times slower)
Kalman filtering Easy Non-immediate

Table 1: Comparison of some features of the regular and em adaptative grid to get
cells for each considered layer of the Ionosphere

Finally we have in equation 6 a linear overdetermined system, that will be solved by


Least Squares taking into account techniques such the Singular Value Decomposition
following Hajj et al. 1994.
Adaptative pixels versus Regular pixels:
If we choose an pixel basis function to expand the electron density, two approaches
-among others- are possible to define the pixels or cells:
1. To consider a I'egular grid
2. To get adaptative cells in such a way that an approximate equal number of rays
per cell let be assured.
Some advantages and disadvantages are summarized in Table 1.
The main reason to choose in this paper an adaptative basis is that it is a more robust
approach adapting the cell size to the data density, avoiding is this manner the exist-
ence of void or Jew visited cells. The centers are obtained applying the unsupervised
classifier known as the I. . onohen new'al network or Self Organizing Map (Kohonen,
1990) to the crossing points of each ray with the mean shell of the given layer. The
resulting cells are adapted in size allowing to guarantee a minimum number of cross-
ing points per cell (see appendix A for a detailed explanation of the algorithm, and
Murtagh & Hermindez-Pajares 1995 for an assessment).
Taking into account the main problems of poor populated cells, i.e. cells crossed by
few rays, and the discretization error, we propose the following combined strategy to
estimate the electron density in equations 6:
Firstly, we estimate a 2-D Global Model of the overall Ionosphere (using all the IGS
data). To do so we solve equation 6, with only 1 layer, avoiding in this manner
the high correlations between layers as is commented in Hajj et al 1994. Then
it is possible to use adaptative basis of functions to avoid the problem of poor
populated cells (mainly in the South Hemisphere du to the lack of sta.tions).
In fact the low correlations of the 2-D model makes not critical the greater
discretization error of the adaptative grid model regarding on the regular grid
model.
Secondly, we solve equation 6 with several layers, performing a 3-D Regional model
of the Ionosphere, in a region crossed by a high number of rays to avoid the
problem of poor populated cells. Due to the high correlations between layers
that appear in the estimation, it is useful to constrain the instrumental delays
with the values obtained in the first run, and to use a regular gridding. The
regular cells model presents lower discretization error, that is a critical aspect
when we have high correlations.
346

Layer k

Figure 3: Scheme of the strategy adopted to classify the sub-rays to the corresponding
cells, with a minimum computation load -explanations in the text-.

Some practical aspects of the pTOblem:


In order to implement the adaptative strategy, we have to deal with the problem of the
non-regular boundaries between the cells. Indeed, to solve Eq. 6, we have to compute
for each ray and for a given layer, the cells crossed and the ray path fraction for each
one. One procedure is to digitize the ray within sub-rays of length L, counting how
many of them are contained in the crossed cells (see figure 3).
This approach implies to get the cell to which each sub-ray belongs to, and this can
be an expensive operation, due to the amount of sub-rays that we have. One way to
avoid such computation load, is to use the physical continuity of the ray, knowing the
neighboring between the cells. The Kohonen neural network at the same time that
constructs cells with an approximate equal number of elements, gives a bidimensional
topological map of the cells (see appendix A).
The strategy adopted is:
1. Computing the cells: From the crossing points of all the rays with the mean
spherical shell of the layer, the centers of the cells are obtained. Simultaneously
we get the cell membership of all the rays and the 2-D Kohonen map of the
centers, which maintains their 3-D neighboring relationships. The cells are
defined in the usual way by the nearest center criterium.
2. Computing the crossing fractions of the rays: For a given ray, we know
now to which cell the center sub-ray belongs to (step 1), and hence where it is
placed in the Kohonen map. For a enough small length and taking into account
the continuity of the ray, the next upper (lower) sub-ray must belong to the same
or to a neighbor cell (see figure 3); Then only a small subset of centers must be
explored, neighbors in the Kohonen map to the center of the last subray (see
again figure :3).
347

IGS STATIONS USED IN THE COMPUTATIONS

Figure 4:

3. Computations and Results


The Data
Some features of the data sets llsed --which have been provided by the IGS via
anonymous ftp- are summarized in Table 2. Also the stations distribution is plotted
in figure 4.

Estimated lV!odels
We have computed basically two different models using Least Squares:
2-D Global Ionosphere; we use the Self Organizing Map algorithm to generate the
adaptative cells in order to compute the TEC for the Data Set G (subsets 2-7h,
7-12h, 12-17h and 17-22h, see figure 5)2. We have solved the equation 6 for one
unique layer between 300 and 400 km, taking 400 cells (self-organized along
a 20x20 Kohonen map) and a subray length of 5 km. The cells presents sizes
ranging from few squared-degrees to 10-100 times greater, depending on the
small or large sparsity of the data. In figure 5 appears an important result: the
detection of the TEC increase due to the start of the geomagnetic storm in the
last time interval (from 17 to 22 hUT).
3-D Regional Ionosphere; we describe the electron density with 5 spherical layers
in equation 6, with height boundaries at 50, 200, 350, 500, 650 and 800 km. The
regular gridding consists on cells of 5x5°. We have performed two computations:
one with the data set RGROUND (only ground data, see table 2) and other using
also one GPS/MET occultation (data set RGROUND+RMET). In both cases
we have reemplazed the real observations (ionospheric delays) by those coming
2During this period one
geomagnetic storm happened, with a high variability of the ionospheric electron distribution (see
for instance Web document at http://bolero.gsfc.nasa.gov/gov / solart/cloud/cloud.html)
348

Global Regional
Data set G Data set RGROUND Data set RMET
GPS Receivers 60 31 1~(jt'_::;!Mt;T occultation 0207)
Time Interval 2-22h UT 18-23h UT 20h30m-20h31m UT approx.
Number of subsets 4 1 1
Right ascension range of
the rays at 330 km height o to 360° 185 to 230° 185 to 230° approx.
Declination range of the
rays at 330 km height -90 to 90° 20 to 60° 20 to 60° approx.
Elevation mask 0° 0° none (occultation geometry)
Number of rays ~200000 6581 3347

Table 2: Description of the Data Sets considered in the computations, all of them
corresponding to the 18th October 1995, with a geomagnetic storm and P-code not
encrypted (Antispoofing Off)

from a very simple model: the Ionosphere as a one spherical layer with constant
density of 10 12 e/m 2 , with boundaries at 240 and 400 km height. Nevertheless
the geometry (the rays) is the real one. The results (figure 6) are quite signific-
ative showing the important improvement in the estimation when we add to the
ground data, the observations coming from one orbital GPS receiver such the
GPS/MET.

5. Conclusions
We present in this paper an study of a difficult problem: To reconstruct the 3-D
Ionospheric electron distribution from GPS data.
The use of the neural network in this work makes it possible to overcome the problem
of the non-homogeneous sampling, dividing each layer into a partition of cells or
clusters with a similar number of rays. The existence of the topological relationships
between the neighbors centers in the Kohonen map, is taken into account to reduce the
computation load. In order to solve it, we have consider a new approach than consist
on a basis of aclaptative pixels ([(ohonen adaptative pixel basis) that are defined from
a certain number of centers, obtained with the Kohonen artificial neural network.
As the main conclusion, a new modeling of the Ionosphere is proposed within two
steps: firstly a bidimensional Global model, describing the TEe and instrumental
delays by mean of adaptative cells; and secondly a tridimensional regional model with
regular cells for the electron density within a region with a high number of observations
(Northern hemisphere). We ought to include Low Earth Orbital GPS receivers, data
such GPS/MET, to improve the vertical resolution, and constraining the instrumental
delays with the values obtained in step 1. The cells in this case are chosen regular in
angular size, in order to diminish the discretization error in the description of electron
density, in a model with high correlations between layers.
This new strat-egy is supported by the main results obtained for the day 18th October
1995:
• the detection in the global model of the increase of electron content during the
geomagnetic storm start .
• the capability to estimate the 3-D structure of tqe ionosphere with a large number
of regional data, specially when we include orbital GPS occultation data from
349

"
" .. . ..
'1
"

··S
.~

..
"

.' .... , .. ,.
' ' ",,' ' ..
, ' ",,' MO'

'1
......
~ "

"II .. ! 'iii

'. "

.,
· ·S ··S
.~

" ,
.. ' ,. ' ,,,,' ,.. '

-' -'
Figure 5: Global model of the TEC for the day 18th October 1995, for the data subsets
2·7h, 7-12h, 12-17h and 17-22h respectively

8e+11

710+11

8e+ll / \ \\\\

M"

I 58+11
!
./ /
i · . ·.............. ....
,./I
-...... ~=...:: ....
~ 4<1+11
./ / \ ..~
z
~
38+11
// \ ... ",,",
~
g
w
2.... 11 j /;: """
. . ~'
/' ................ '..
". ..,
1.... 11
.~f
o -L._. .---..-----...-.------.-.-------..--.--~,.__ •______ ~~~~ _. __ ... _

-,&+, \00L-----'2OQ'----300-'----4OQ.L----'5OQ'----eoo-'----7oo'----eoo
HEIGHT(m)

Figure 6: Comparison of the electron density vertical profile at Q = 207.5° and


0= 37 ..50°, estimated from the data sets RGROUND and RGROUND+RMET (day
18th October 1995) with the real rays but with simulated ionospheric delays with a
very simple model.
350

orbital receivers. This last point confirms the study of Hajj et al. about the
importance of including orbital data to reconstruct the Ionosphere with GPS.

6. Acknowledgments
We would like to acknowledge to the International GPS Service for Geodynamics and
to the University Corporation for Atmospheric Research for the availability ot the
GPS data used in this research. This work has been partially supported with funds
from the Spanish government projects PB94-1205 and PB94-0905 (DGICYT).

References:

Davies K. (1990): Ionospheric Radio. lEE ElectroMagnetic Waves Series 31, Peter Peri-
grinus Ltd., London.
Hajj G.A., Ibanez-Meier R., Kursinski E.R., Romans L.J. (1994): Imaging the Ionosphere
with the Global Positioning System. Imaging Systems and Technology, Vo1.5, 174-184.
Kohonen T. (1990): The self-organizing map. Proceedings of the IEEE, Vo1.78, pages 1464-
1480.
Murtagh F., Hernandez-Pajares M. (1995): The Kohonen self-organizing map method: an
assessment. Journal of Classification, Vo1.l2, 165-190.
Press W.H., Flannery B.P., Teukolsky S.A., Vetterling W.T. (1986): Numerical Recipes.
The Art of Scientific Computing. Cambridge Univ. Press, Cambridge.
Sanz J., Juan J.M., Hernandez-Pajares M., Madrigal A.M. (1996): GPS Ionosphere imaging
during the "October-18th 1995" magnetic cloud. European Geophysical Society meeting,
The Hague, May 1996.
Sard6n E., Rius. A., Zarraoa N. (1994): Estimation of the transmitter and receiver dif-
ferential biases and the ionospheric total electron content from Global Positioning System
observations. Radio Science, Vo1.29, No.3, pages 577-586.
Seeber G (1993): Satellite Geodesy. Walter de Gruyter, Berlin.
Zumberge J., Neilan R., Beutler G., Gurtner W., 1994. The International GPS Service for
Geodynamics-Benefits to Users. ION GPS-94, Salt Lake City, Utah.
351

APPENDICES

A The Self Organizing Feature Map algorithm


One of the competitive neural network algorithms is the Self-Organizing Map (SOMA,
also named Kohonen network). It has the special property of creating spatially or-
ganized representatives of the centroids (weights of the output neurons) found in the
input vectors. The resulting maps resemble real neural structures that appear in the
cortices of developed animal brains. The SOMA has also been successful in various
pattern recognition tasks involving noisy signals such as speech recognition (see a
summarized review in Kohonen 1990).
The basic aim of this neural network is to find a smaller set {WI, ... ,we} of c centroids
that provides a good approximation of the original set S of n objects - input space
- , with m attributes, encoded as vectors i E S. Intuitively, this should mean that for
each i E S the distance II i - W~x) II between x and the closest centroid W~x) should
be small. Simultaneously the algorithm arranges the centroids so that the associated
map f(.) from 3 to A.,
f: S c rn. n -+ A C]N2
(7)
WI -+ f(w,) = (i"j,)
reflects the topology of the set S in a least distorting way, where A is the representation
space, a 2-dimensional set of indices known as the Self-Organizing Feature Map or
Kohonen Map. Proximity in A means similarity between the global properties of the
associated groups of objects.
From a detailed point of view, the Kohonen network is composed of a set of c nodes
or neurons. The algorithm scheme is:
1. We initialize at random the weights of the c nodes of the grid with small values:
{Wl(O), ... , lCc(O)}. After training, every neuron I E {I, ... c} will represent
a group of objects (stars) with similar features, and the weight vector WI \.'ViII
approximate to the centroid of these associated objects.
2. For each of the n training vectors of the overall database, xP:
(a) We fllld the node k whose weight Wk best approaches xP (d can represent
the Euclidean distance): d (Wk' xP) $ d (WI, xP), VI E {I, ... ,c}.
(b) We update the weight of the winning node k and its neighbours, Nk(t),
approaching the training vector as closely as possible:
WI(t) = { WI(t -
1) + o(t)H(t) (xP - WI(t - 1)) IE Nk(t)
wl(t-l) ltiNk(t) VlE{l, ... ,c}
(8)
where:
• o( t) is the learning rate, a suitable monotonic decreasing sequence of
scalar-valued gain coefficients, 0 < o( t) < l.
• The radius R t of the activated neighbourhood, Nk(t), is a monoton-
ically decreasing function of the iteration t.
• H (t) = exp (f!.r)
is a function that represents in Eq. 8 the decay of
the activation depending on the distance d in A between the winner
unit k and the considered unit l, d = /(i l - i k )2 + (j, - jk)2.
352

'III

S (m - dim INPUT SPRCE) H (2 - dim REPRESENTRTION


SPRCE or Kohonen map)

Figure 7: Ordering induced by the Kohonen network: after training the data, the
centroids which arc close within the representation space A will also be close within
the input space S.

Upd<lting; neighbours' weights instead of just that of the winning node


assures the ordering of the net, in such a way that centroids which are close
in re[>re~wntation space A (dimension 2) are updated to become similar in
input space S of dimension Tn (see Fig. 7).

3. Process 2 is repeated for the overall database until good final training is ob-
tained.
The final point dell,ity function of {tV" ... , we} is an approximation of the continuous
probability clen,itl ['ullction of the vectorial input variable 9(i) (Kohonen 1990, p.
1466).
Capacities, Credibilities in Analysis of
Probabilistic Objects by Histograms and
Lattices
Edwin Didayl, Richard Emilion 2
1 CEREAIADE - Universiif Paris 9
INRI.4 Rocquencourt, 781.']5 Ie Chesnay Cedex, France.
e-mail: [email protected]

2 Centre de Calcul I nformatique


Universiif de Dakar, Senegal.

Summary: Capacities and credibilities appear in the modelling of objects described


by random variables with probability distributions. The general aim is to extend
standard data analysis to such objects. Only histograms and lattices are investigated
in this more general setting. Sub-additive ergodic theorems are used as a tool in the
histogram analysis of these objects.
1. Introduction
In this paper, "first order objects" are described by rows and columns of a data
table (which generalizes standard data. tables) in which each cell represents a random
variable and contains its probability distribution. For instance, if a row is associated
to an animal and the weight to a column, then the cell corresponding to this row and
this column contains the probability distribution of the weight of this animal. Such a
row defines what we call a "probabilistic object". ~ote that standard data tables are
a simple particular case of our model: take distributions concentrated on a single
point.
A "second order object"' is defined by a class of first order objects. For example, in
order to describe the colour of a class of individuals in which some are yellow and
others are red, we say that "the colour of a member of the class is yellow or red". In
this way we "generalize" the description of the individuals by giving the description
(yellow or red) to the colour of the class. In other words the colour yellow enter in
the description of the class if "at least one of its members is yellow". We extend this
notion of generalization to the case of individuals described by probabilistic objects,
in describing the class by the probability that at least one member of the class be
yellow. We ca.\l this probability the "capacity" of the class to be yellow since it can
be shown that it satisfies the mathematical properties of a "capacity" in the sense of
Choquet (19.')4).
We may also be interested in a specialization of the class. Instead of obtaining a
"generalized" second order object we obtain a "specialized" second order object. It is
described by all the properties which appear in the description of all the individuals
of the class. For example, the colour of a specialized second order object is yellow

353
354

if all the members of the class that it represents are yellow. When the individuals
are described by probabilistic objects the specialized class is described by. "the prob-
ability that all the members of a class are yellow". This probability is called the
"credibility" of the colour yellow for this class since it can be shown that it satisfies
the mathematical properties of a "credibility" (or "belief") in the sense of Schafer
(1976). When the n random variables associated to each member of the class are in-
dependent, the capacity and credibility may respectively be calculated by a t-conorm
and t-norm as defined by Schweizer and Sklar (198:3).
A "concept" is usually defined by an intent and extent. It may be modelized by a
"symbolic object" also defined by an intent and an extent, where the intent is the
description of a second order object (in terms of capacities or credibilities, in this
paper) and the extent is defined by the set of individuals (described by probabilistic
objects, in this paper) which satisfy this description as well as possible. The general
aim of this paper is to extract such (concepts) from a set of probabilistic objects
versus an extension of standard data analysis methods to such objects.
2. Probabilistic objects
2.1 Basic model:
Several real situations can be modelled as follows:
Let C be a set whose elements c are called objects and let C denote a a-algebra on
C. Let (0, F, J.L) be a measure space and I a set of indices.
A description of the object c is a family (XC,;),ET, of measurable maps defined on 0
and taking values in a measurable space (OJ, 0;). If J.L is a probability P then the
objects will be called probabilistic objects, described by the random variables Xc,j
with laws denoted by .t,j. By definition the distribution law Xc,j is a probability on
0; such that Xc.;(O) = P{X;/(O)} for any 0 EO,.
2.2 Examples:
- Let C be a set of computer processors which are working in various environments
i, i E I. The time of execution of a task w E 0 for the processor c under condition i
is not deterministic, therefore it will be given by Xc,;(w) E 0; where Xc,; is a r.v.
- Let C be a set of individuals who are submitted to different tests of type i. The
random result of the individual c to a test w E 0 of type i will be Xc,;(w).
- Let C be a set of specimen (for example, insects or plants from a same species)
given with different descriptors i, i E I (for example, the size, the age, etc ... ). Due
to the variability of these descriptor's values on a same specimen when time varies,
the random value of a descriptori of the specimen c at time wE 0 will be Xc,j(w).
3. Capacities and credibilities
3,1 Capacities
Let i E I, A E C and 0 EO;. \Ve will say that the system A is not able to reach the
objective 0 if, for all c E A, p(X c,; E 0) = O. Hence, we are tempted to evaluate the
U
ability or the capacity of A to reach 0, by the number J.L( (Xc.; EO)) (for countable
cE.4
A) or by the number SUpUt(Xc ,; EO)). In the above examples we will then get the
cE.4
C3pacity for a set of processors to complete a task before t seconds, the capacity of a
set of individuals to succeed, etc ... It turns out that these definitions agree with the
capacities as defined by Choquet (19.5-1) :
355

Definition:
A capacity on (C, C) is a map" from C to ~+ such that
i) K(0) ::: O i i ) ,,(AI U A2 ) :S K(Ad + ,,(A 2 )
iii) A ~ B => ,,(A) :S ,,(B) iv) ,,(lim TAn) ::: lim T ,,(,4 n)
The capacity K is a strong capacity if, in addition, we have:
ii') ,,(A I UA 2 )+,,(A l nA 2 ):S ,,(Ad + 11:(.1 2 )
The capacity,.; is a capacity of order oc if, in addition we have
U U
,.;(AI A2 • .. A,,) :S L(,.;A,) - L ,,(Ai n Aj) + ... + (-It+llI:(n A)).
i<j
Proposition 1 :
Let ,,(A,O)::: sup (Jl{U(Xc.d-I(O))); then the map A --> ,,(.1,0) (resp.
initet; A
Bf cEB
0--> ,,(A,O)) is a capacity of order oc on C (resp. on 0,).
3.2 Credibilities
What happens if in section :3.1, we replace union by intersection? Actually, passing
to complementary, we get in a natural way credibilities instead of capacities.
Definition :
A credibility on (C, C) is a map {3 from C to ~+ such that
i) 6(0) :::° n n
ii) ;3(AI nA2···nA,,):S L,6(Aj) - L.d(A,UA}) + ... + (-It+ I{3(U A})
j=1 i-<) j=1
Proposition 2 :
Let .d(0) ::: P{ n
(Xc,i(O)tI} for any fixed countable A E C; then the map
cEA
° ~

13(0) is a credibility on Oil


3.3 T-norms and t-conorms
We will now get some new capacities by using some special maps introduced by
Schweizer and Sklar (198:3).
Definition
At-norm T is a map from [0, 1]'2 to [0.1], (IL, v) --> U T v such as for all t, u, t·, w in
[0,1]
i) u T (v T w) ::: (u T v) T w (Associativity)
ii) t T u:S v T u: whenever t:S V,ll:S tL' ( Increasing)
iii) u Tv::: l' T u (Commutativity)
iv)uTl:::u (Identity law)
Examples
u TI u::: u X l' and its dual u Ti t' ::: U + I' - U x v,
U T2 v::: mill(u, v) and its dualu Ti l' ::: IIWI(U. u).
U T3 v::: max(u + l' -1,0) and its dual u T; t'::: min(u + 1'.1)
u T4 v ::: 1 - min{ ((1- u)P + (1 - 1')P)} ~ and its dualll T4" l' ::: mint (uP + uP) ~ ); p > O.
u T, v::: log (1
5 (>
+ (0"-1)(0'-1))
(,,-I)
(0 < a <:xl)
and its dual u T"
,
l' ::: 1 -log (1
0
+ (,,1-"_1)(,,1-'_1))
(0-1)
where logo(u) denotes logarithm to the base a > O.
4. Capacities and credibilities histograms
Let Xc,i be a real-valued r.V. and a sample X~.i' X~,i' .... X;i of n independent r.v.
356

distributed as a r.v Xc,i' In this case, each X!,i is defined on {In. For example, let
W = (WI, ... , wn ) E {In the result of the individual c to the test W j of type i is X!,i (w).
Consider the probability measure defined by :
Pn(w)([s, til = ( number of X:,;{w), X~,i(W), ... , X~i(w) between sand t)/n.
Histograms of frequencies can be derived from this measure: given k real numbers
·
SI < S2 .. • < Sk, we represen t th e func t Ion H"",+1 ) on [Sj, Sj+J [ were
(Sj+1 h H • ')+1
- Sj )'
denotes Pn(w)([Sj,sj+JD.
A consequence of the large numbers strong law is that ~n([s, tD is a good approxima-
tion of P(Xc,i E Is, t]) if n is large enough. Moreover, as an application of Lebesgue
differentiation theorem the preceding function is a good approximation of the density
of Xc,i, (if its distribution has a density) when the steps Sj+J - Sj are small enough.
Now, starting with two (or more) r.v. Xc,i,Xd,i as above, and putting A = {c,d}, it
is natural to compute the capacity 1I:.4,n(W) = Pc,n(w) * Pd,n(w) and the credibility by
t3A,n = Pc,n(w) T Pd,n(W), where T is a t-norm and * a t-conorm. (We have omitted
the index i). Note that in case of independent r.v. Xc,i we can take u T t' = U x v
and u * t> = U + l' - Ut'.
Put 11:.,1 = )imoo 1I:.4,n(W)[S, t[ and t3.,1 = )imoo .8A,n(W)[S, t[. Due to the continuity of
t-norms and t-conorms, these capacities and credibilities converge as n tends to 00.
Since we have 11:.,.. ::::; 11:.,1 + 11:1, .. and .8.,.. ~ .8.,1 + .81,,, for (s ::::; t ::::; u), we· get two new
type of histograms: histogram of capacity which are sub-additive and a histogram of
credibility which are super-additive, while usual frequencies histograms are additive.
Theorem: If KO,I and .Bo,1 are O(ltl) then lim ( 11:.,1 ) and lim ( .8.,1 ) exist for
almost all s, as t tends to s. The limit function
1-. t- S 1_. t-
f is the smallest (resp. the greatest)
S

positive function such as 11:.,1::::; f f (resp . .8.,1 ~ f f).


i[.,,[ i[.,,[
Proof:
Let ('1;)1>0 be the tanslations semigroup operating on mesurable positive functions
f, that i; 7;(f)(x) = f(x + t). Let F,(x) = II:r,r+1 so that
Fc+.(.r) = 1I::<,r+I+' ::::; II:r,r+t + 1I:,,+I,r+t+. = F,(x) + 7;F.(x).
Since 7; preserves the integral of integrable functions, the subadditive local ergodic
theorem of Akcoglu-Krengel (1987, 1981) yields the convergence. The assertion con-
cerning the limit follows from Feyel's (1982) proof of the local ergodic theorem. More-
over F,= G I + HI with G I additive and HI subadditive such that 1-0
lim HI = 0, and
t
GI(x) = f f + SI with 51 singular, f being necessarily the limit.
i[r,:t+'[

5. Lattices
5.1 Lattices of a set of probabilistic objects
For any fixed i let ~'; E Oi, where ~'i may be defined, for instance, by a percentile,
and consider the subset (Xc,.)-I(",;) = {w E {l/Xc,i(W) E "';}. Denoting by Fv"c the
characteristic function of this set, the family de = {H;,e, i E I} is a partial description
of the object c depending on the choice of the values sets "';.
Consider the complete lattice generated by dc, c E C with respect to the order
F~;,c ::::; FV,.d for all i. This order corresponds of course to the inclusion order of the
sets {X~l("';)}.
357

A natural valuation of the lattice is obtained by using capacities and c·redibilities :


in the join semi-lattice of all the unions of {(X;'/)(V;)}CEB, the formula ,..(B, ~~-) =
r
P{ U (X c •i I (\~)} for B countable is the order 00 capacity of section :3.1. Similarly in
cEB
the met semi-lattice of all intersections {(Xc.;)-l(V;)}cEB, ,8(V;) = P{ n(Xc.irl(\I:)}
cEB
defines a credibility function on (ai, Oil when B is fixed.
5.2 Symbolic objects
A symbolic object is a mathematical model of a "concept" (see Diday (1995) for
more details). It is defined by an intent (i.e. the set of properties satisfied by a set of
individuals) an a way of computing its extent (i.e. the set of units which satisfy the
intent). A symbolic object a is more general (resp. more specific) than a symbolic
object b if the extent of a contains (resp. is contained in) the extent of b.
An interesting problem is to associat to each node of the lattices defined above, a
symbolic object in such a way that the intent and the extent are compatible with the
subset associated to this node. For instance, considering the lattice defined in (5.1),
the symbolic object, denoted by a, associated to an upper semi-lattice node which is
associated to the subset U ()C.;)-l (\~), may be defined as follows:
cEA
t
The intent restricted to each ~i is defined by I\:(A, \Ii) = P{UcE.4(X c.i l (V;)} and
denoted by : a = {((I\:(A, V;Wi)/i E I}.
The extent of a is defined by a mapping denoted by "cap" from C to [0,1] such that:
II U
capt el) = P( r r
((Xc.i )-1 (\~)) n(Xd.i 1 (V;) )/P{ (X d •i 1 (V;)}.
iE! cEA
This mapping may be considered as a membership function, which indicates the
degree to which el belongs to extent of a. Note that caPt d) = 1 if d belongs to A.
Conclusion:
This work shows a way of extending standard Data Analysis to objects defined by
random variables with probability distributions. We show that capacities and credi-
bilities constitute natural way of describing classes of such objects by generalization or
specialization and of calculating the intent and the extent of the associated concepts,
modelled by symbolic objects. In Diday and Emilion (1996), lattices, hierarchies,
pyramids, principal components analysis and decision trees are investigated in this
more general approach.
Acknowledgements
We would like to thank E. U. I:"l'TRS (9:3-72.)) program that supported partially this
researsh.
" References
Choquet, G.(19.)4) Theory of capacities. Ann. Instit. Fourier.
Diday, E.(199.)) Probabilistic, possibilist and belief objects for knowledge analysis.
Annals of Operations Research .).) , 227 - 276.
Diday E., Emilion, R. (1996) Capacities and credibilities in analysis of probabilis-
tic objects. Proceedings of OSDA'9.). Springer Verlag.
Schafer G. (1976) A mathematical theory of evidence. Princeton University Press.
Schweizer B., Sklar B. (198:3) Probabilistic metric spaces. Elsevier :"l'orth-Holland,
:"l'ew-York.
Symbolic Pattern Classifiers Based on
the Cartesian System Model
:'vlanabu !chino and Hiroyuki Yaguchi
Tokyo Denki University
H~toyama, Saitarna 350-03, Japan
E-mail: [email protected]

Summary: As symbolic pattern classifiers, this paper presents region oriented methods
based on the Cartesian system model which is a mathematical model to treat symbolic data.
Our region oriented methods are able to use locally effective information to discriminate
between pattern classes. This fact may achieve, at least superficially, a perfect discrimina-
tion of the pattern classes under a finite design set. Therefore. we have to take a ballance
between the separability between classes and the generality of class desciptions. We describe
this viewpoint theoretically and experimentally in order to assert the importance of feature
selection which is essentially important in ~ny pattern classification problem. We present
also an example based on symbolic data in order to illustrate the usefulness of our approach.

1. Introduction
Traditional approaches to pattern classification (e.g. Bow (1992)) may be divided
into two categories as follows.
1) Boundary oriented approach:
The purpose in this category is fo find the equations of decision boundaries. Linear
classifiers and Bayes classifiers are examples for this category.
2) Similarity based approach: The purpose in this category is to find standard pat-
terns for pattern classes and to use an appropriate similarity measure between the
standard patterns and new patterns to be classified. Nearest neighbor rules and
various matching methods are examples for this category.
As the third category of classification methods, several authors developed region
oriented approaches. Stoffel (1974) used the prime events to describe class regions for
binary feature variables. The prime events for a class cover only training samples for
the class in the feature space. Michalski (1980) developed a very general approach to
pattern classification based on his mathematical model so called the variable valued
logic system. In his approach, various feature types can be used simultaneously to
describe sample patterns, and feature selection is performed in the process to find
class regions. According to the recent terminology, we may use the term symbolic
data (Diday (1988)) for this general type of sample patterns. !chino (1979, 1981)
used hyperrectangles to describe pattern classes in the feature space. This approach
can treat ordinal and binary feature variables simultaneously. However, a further
generalization is necessary to treat symbolic data (Ichino(1986, 1988, 1993, 1995)).
The purpose of this paper is to present symbolic pattern classifiers based on region
oriented approaches. In particular, we point out the importance of feature selection
under a limited number of design samples, since we may be disturbed to achieve
a proper classification ability by the pretended simplicity appearing in classification
problems. In Section 2, we describe the Cartesian System Model (CSM ) as the
mathematical model to treat symbolic data. The CSM is represented as (U(d), I±I,
~), where U(d) is the feature space in which each sample pattern is represented by

358
359

a mixture of various feature types, E6 is the Cartesian join operato'r which generates
a generalized description from given descriptions in the feature space, and ~ is the
Cartesian meet operator which extracts a common description from given descriptions
in the feature space. In Section 3, we define graphs so called the roeLative neighborhood
graph (RNG ) and the mutual neighborhood graph (UNG ). The UNG for a pattern
class yields the interclass structure against the other pattern class. On the other hand,
the RNG for a pattern class 'indicates the intracLass structure for the class. Under the
assumption that the sample size of each pattern class is finite, the UNG and the RNG
approach complete graphs, when we increase the number of features used tO,describe
sample patterns. The completeness of the UNG means that a perfect separability
between pattern classes is achieved, while the completeness of the RNG means that
the class description has a minimum generality even for the given design set. Then,
based on the properties of the RNG and the MNG, we restate the Pretended simplicity
theorem (Ichino(1993)) by an improved way in Section -1. We describe our symbolic
pattern classifiers and we compare our approach to other well known approaches, the
ID3 (Quinlan(1986)) and the backpropagation neuraL network (Rumelhart(1986)), by
using an example of symbolic data in Section 5. Section 6 is a summary.

2. The Cartesian system model


2.1 Description of sample patterns
Each sample pattern is represented by the Cartesian product set:
E = E1 X E2 X ••. X Ed, (1)

where Ek is the feature value taken by the feature X k . We can treat the following
five feature types.
1) continuous quantitative feature (e.g. height, weight, etc.)
2) discrete quantitative feature (e.g. the number of family members, etc.)
3) ordinal qualitative feature (e.g. academic career etc.; appropriate numerical coding
is assumed)
-1) nominal qualitative feature (e.g. sex, blood type, etc.)
5) tree structured feature (see Fig. 1, where terminal values are taken as feature values)
Feature types 1), 2). and 3) are permitted to take interval values of the form [a,b],
and feature types -1) and 5) are permitted to take finite sets as feature values. The
Cartesian product (1) described in terms of features 1) '" 5) is called an event. It
should be noted that a sample pattern is an event. Let Uk be the domain of the
feature X k . Let Uk be a finite interval when the feature type is 1), 2), or 3), and be
a finite set when the feature type is 4) or 5). Then, the feature space is given by the
product set
(2)
2.2 The Cartesian join operator
The Cartesian join A EEl B of a pair of events A and B in the feature space U(d) is
defined by:
(3)
where AJEBk is the Cartesian join of feature values Ak and Bk for feature Xk and
is defined as follows.
1) When X k is a quantitative or an ordinal qualitative feature, Ak ffiBk is a closed
interval given by:
(-1)
360

Microprocessor

Inlel~Olh."
/1\ ffi
80Z16 80386 10.016 680Z0 680JO 680.00
I~
zaooo
V50 \'60 HP

Figure 1: A tree sructured feature.

where AkL and A kU are the minimum value and the maximum value of the interval A k,
respectively; and min(AkL, B kL ) and max(AkU,BkU ) are the operators which select
the minimum and the maximum values from the sets {A kL , Bkd and {AkU' B kU },
respectively.
2) When Xk is a nominal feature, Ak EEl Bk is the union:
(5)

3) When Xk is a tree structured feature, let N(Ak) be the nearest parent node which
is common to all terminal values included in A k. Then, if N(Ak) = N(Bk ),
(6)
and if N(Ak) =/: N(Bk),
AJBBk = the set of all terminal values branched from the node N(Ak U B k ), (i)

where for each feature value Ak, we assume that


(8)
2.3 The Cartesian meet operator
The Cartesian meet A [ElB of A and B in the feature space U(d) is defined by:
Al8JB = (A1[ElBr) x (A~B2) x ... x (A,J8JBd), (9)

where Ak [ElBk is the Cartesian meet of the k-th feature values Ak and Bk, and is
defined by the intersection:
(10)
When the intersection (10) takes the empty value ¢J at least one feature, events A
and B have no common part. We denote this fact by
A[E]B = 4>, (11)

and we call that "A and B are completely distiguishable." Fig. 2(a) and Fig. 2(b)
illustrate the Cartesian join and the Cartesian meet in the Euclidean plane, respec-
tively.
We call the triple (U(d),EEl,[~) as the Cartesian System Model (CSM). 1

lWe used the name Cartesian Space Model initially. However, according to the suggestion by
Prof. E. Diday, we renamed to prevent misunderstanding about the model.
361

X2 A EE B X2

(b)
(a)

Figure 2: The Cartesian join and the Cartesian meet in the Euclidean plane.

3. The graph concepts


3.1 Sample patterns
Let WI and W2 be two pattern classes, and let their sets of sample patterns be

(12)

where sample pattern E k;, j = 1,2, ... , Nk, is represented by

(13)

We assume that each sample in WI is completely distinguishable from any sample in

E I JK]E 2q = tP, P = 1,2, ... , N I , q = 1,2, ... , N 2 · (14)

3.2 The relative neighborhood graph


The relative neighborhood groph (RNG ) by Ichino and Sklansky(1985) yields relative
relationships between samples in a class, and is generalized under the CSM as follows .
Two samples E ip and E iq in Wi are relative neighbors if

Ei~(Ei;EEiq) = tP, k = 1,2, ... ,N i , (p, q::l= k) . (15)

If two samples are relative neighbors, the Cartesian join of these samples never in-
cludes other samples. In this context, samples which are relative neighbors are isolated
and singular from other samples. The relative neighborhood groph, written RNG (Wi),
is a graph constructed by joining all pairs of sample patterns, E ip , E iq E Wi, which
are relative neighbors. Fig. 3 (a) illustrates the RNG in the Euclidean plane, where
we omit all edges which represent the fact that each sample is itself relative neighbor.

3.3 The Mutual neighborhood graph and the silhouette


The mutual neighborhood groph (MNG) (Ichino (1986)) yields the information about
interclass structure, and is defined as follows. Two samples E ip and E Iq in WI are
mutual neighbors against W2 if
(16)
362

XI

(a) (b)

Figure 3: The RIVG and the MNG in the Euclidean plane.

Xl

(a) (b)

Figure 4: The MNG and the silho·uette in the Euclidean plane.

The mutual neighborhood graph (AING ) of WI against W2, written M NG(wd W2), is
a graph constructed by joining all pairs of sample patterns, E lp , E lq E WI , which are
mutual neighbors against W2.
Fig. 3(b) and Fig. 4(a) illustrate the AIiVG in the Euclidean plane, where we omit
all edges which represent the fact that each sample pattern is itself mutual neigh-
bor. When two pattern classes are well separated (Fig. 3(b)), the MNG becomes
a complete graph or a near complete graph (e.g. , J'vING(wdw2) and MNG(W2Iwd
in Fig. 3(b)) . On the other hand, the number of edges of the MNG decreases ac-
cording to the closeness of the two pattern classes (e.g., MNG(Wllw2) in Fig. 4(a)).
The shaded regions in Fig. 4(b) are called the silhouettes of the pattern classes. The
silhouette S(w;J approximates the region for pattern class Wi, and is defined by
(17)

\vhere the union is taken for all mutual neighbor pairs in w, against the other class.
It should be noted that the AlNG and the silhouette of a pattern class are the descrip-
tions of the class from the view point of relativity against the other class . However,
the AlIVG is an abstract mathematical description. while the silhouette is an actual
description in the feature space.

4. The Pretended simplicity theorem


The Pretended simplicity theorem (rchino (1993)) was introduced in order to assert
the importance of feature selection in the design of symbolic pattern classifiers. We
restate this theorem here in an improved way.
363

} - -I C

(a)
(b)

Figure 5: Illustration for the invariability for jO'in regions.

4.1 As the boy so the man theorem


We start from a simple example of three samples in the Euclidean plane (see Fig,
,5( a)), In this example. the Cartesian join of two samples A and B excludes sample
C along feature XI, The property of this exclusion is invariable by the addition of
feature X 2 . In generaL once the exclusion of a sample from the given Cartesian Join
region is achieved by a set of features, called simply as a feature set, it is invariable for
the addtion of new features to the feature set. This invariability is a special property
for our Cartesian join. In fact, if we use the circle of influence as the join region of
two sample patterns. our invariability is no longer obtained (see Fig, 5(b)),
i'iow we assume again the data sets in (12). Let two samples Eli, Eli in <""1 be mutual
neighbos against ;.,)2. Then for each E2k in W2, it is required the existence of at least
one feature Xp such that
(18)
The condition in (18), however, is not related to any other sample in W2 and is not
related to any other feature. In other words, once the property that the Cartesian
join of two samples Ell and Ell are completely distinguishable from sample E2k is
obtained for a feature set, the property is invariable for the addition of new features
to the feature set. It should be noted that this invariablity is also true for the relative
neighbor's by confining ourselves to a single class.
Theorem 1 (As the boy so the man theorem)
Once the properties of the relative neighbors and the mutual neighbors are obtained
for a feature set, the properties are invariable for the addition of new features to the
feature set.

4,2 Generality and separability


We introduce two additional terms generality and separability for the description by
the Cartesian Join. Let F be a feature set. Let Eli and Elj be two samples of
class ",'1, and let 0:,) be the number of samples, in class WI, which are included in the
Cartesian join of Eli and Ell under the feature set F. Then we define the generality
of samples Ell and Ell under the feature set F as follo""'5:

Gen(i,jlF) = O:'i/(NI - 2), (19)

where NI is the number of samples given for class WI, On the other hand, let 3 'j be
the number of samples, in class ",,'2, which are included in the Cartesian join of Ell
and Elj under the feature set F. Then, we define the separability of the Cartesian
join of Ell and Ell from the other class ;.,)2 under the feature set F as follows:

Sep(i, jlF) =1- .8,j/N2 , (20)


364

o
••
• lIJz
o
o •
XI

Figure 6: Illustration for the generality and the separability.

where N2 is the number of samples given for class W2 . It is clear that

0::; Cen(i,jIF), Sep(i,j!F) ::; I, (21)

and the generality and the separability become maximum when Cen(i,jIF) = 1 and
Sep(i , jIF) = I, respectively.
We illustrate the generality and the separability by using a two dimensional example
in Fig. 6. In this figure, we have 8 samples for class WI and 7 samples for class W2.
In the feature space by XI, the Cartesian join of samples Eli and Elj includes 2 WI
samples and 3 W2 samples. Hence, we have

Gen(i,JIXd = 2/(8 - 2) = 1/3,Sep(i,jIXd = 1- 3/7 = 4/7 (22)

In the feature space by X 2 , we have

Cen(i,jIX2 ) = 1/(8 - 2) = 1/6, Sep(i,jIX2 ) = 1- 2/7 = 5/7. (23)

On the other hand, in the two dimensional feature space by XI and X 2, we have

(24)

From this example, it may be clear that the generality is monotonically decreased and
the separability is monotonically increased by the increasion of features to describe
sample patterns. We shoulld point out that this monotonic property of the generality
and the separability is based on the As the boy so the man theorem.
4.3 Pretended simplicilty theorem
Now we assume that the sample size of each pattern class is finite, and that, for each
pair of samples Eli and Elj in WI, there exist features by which the Cartesian join
Ed33EIJ is completely distinguishable from any other samle Elk in WI and from any
sample E2k in W2. Then, we have the following theorem.
Theorem 2 (Pretended simplicity theorem )
By adding features appropriately to the feature set F:
1) The generality CenCi, j!F) becomes zero and the separability Sep(i, jlF) becomes
one for each pair of samples Eli and Elj in WI;
2) The RNC(wI) and the MNC(WI I W2) approach complete graphs; and
3) The silhouette S(wd has a perfect separability from class W2, but it has a minimum
generality as a description of class WI·
365

The properties 1) and thus 2) are direct conclusions from the As the boy so the man
theorem. Then the property 3) is derived from 1) and 2).
This theorem asserts that: 1) The silhouette S(wtJ becomes a connected single clus-
ter, and it never includes any sample in class W2, since all sample pairs in . . . 1 are
mutual neighbors; but 2) The silhoutte S(wtJ yields a very sparse description for class
WI, and it yields only a very poor covering ability even for other design samples in the
same class WI, since all sample pairs of class WI are also relative neighbors. Therefore,
the simplicity for the interclass structure obtained here is superficial and is a "pre-
tended simplicity", and thus the selection of globally effective features is absolutely
important in order to achieve a realistic classification performance.
4.4 Example 1
We generate 2N d-dimensional Gaussian samples, where d features are mutually in-
dependent identically distributed with the zero mean and the unit variance. We
de vide 2N samples randomely into two sets of N samples. These two sets are used
as the design sets for pattern classes WI and W2. Therefore. two pattern classes are
completely overlapped in the d-dimensional feature space. Fig. 7(a) illustrates the
distributions in a three dimensional feature space. The Pretended simplicity theo-
rem asserts that if we fix N and increase d, the MNG(Wllw2) (MNG(W2IwI)) and
RNG(wtJ (RNG(W2)) approach complete graphs, namely their number of edges ap-
proach the maximum number NC2 . Fig. 7(b) summarizes our experimental results.
For example, when N = 500, the numbers of edges of MNGs and RNGs increase
by the addition of feat;lres and approach the maximum number NC2 =124750 at
arround 11 features. This is a remarkable fact, since we can separate our mixed up
pattern classes by using only a small number of very locally effective features. The
silhouettes S(WI) and S(W2) may be mutually overlapped, but they never include any
sample from their counter pattern class. Therefore, we achieved a perfect separability
between classes in terms of our design sets, although it is pretended simplicity from
the view point of the given interclass structure. Furthermore, the silhouette S(Wk)
includes all samples of the class Wk, but each Cartesian join region of S(Wk) which
is spanned by a pair of samples of Wk never includes other samples of Wk except the
pair of samples. Therefore, the silhouette S(Wk) has a minimum generality in the
description of the class . . . k. In fact, for a new sample pattern independent from the
design sets, the silhouttes S(wtJ and S(W2) have exactly the same possiblity to cover
the pattern.
This example asserts again that we have to select only suffiCiently effective features
in order to take a ballance between the separability between pattern classes and the
generality of class descriptions.

5. Symbolic pattern classlifiers


5.1 Region oriented methods
For pattern class ",'" i = 1,2, we assume the data sets in (12). Let R;j ,j = 1. 2, ... , M,.
i = 1,2, be the events such that
M.
E'J ~ U R;k,j = 1,2 .... ,N"i = 1,2 (25)
k=1

E jj q; Rpq,j = 1,2, ... , N j , q = 1,2, ... , Mdi ¥ p), (26)


366

Xl
XI
(a) 11,e /lumber of fea tures
(b)

Figure 7: Example 1.

where I'v!, < N" i = 1,2, in general. Then, we can use the following decision rule to
classify a given pattern sample E.
1) E is determined to come from class ...:, if there exists an R;k for which E <;;; R;k
and if E ct Rpq for all q where p =F i.
2) It is rejected as type·I reject if it is covered by events Of;.,,'1 and W2 simultaneously.
3) It is rejected as type-II reject if no events cover it.
Now we can state our basic problem as: "Generate an appropriate set of events which
satisfy (25) and (26)". We can assert the following theorem.
Theorem 3 (Existence thorem (Ichino (1988)))
If the given training sets WI and W2 are mutually completely distinguishable, there
exist events which satisfy (25) and (26).
This theorem may be clear by assuming that R;k = Eik, k = i, 2, ... , N i , i = 1,2.
However, realistic covering ability of R;] for new patterns will be achieved by the
events which are expanded from given sample patterns.
As an approach, we may use the s2lhouette S(w,) in (17) to describe the region for
each pattern class "';,. We point out a principle of the relativity that silhouettes can
be relatively described in the feature space according to the mutual separability of
the pattern classes (see Fig. 4). Therefore, if we can find a set of minimum number
of sample patterns which span the silhouettes, we may obtain a realistic symbolic
classifier.
As a different approach. Ichino (1986, 1988) presented an algorithm which approxi-
mates the silhouette of a class by a lesser number of events. This algorithm generates
events so that the events cover the sample patterns which yield the densely connected
portions of the mutual ne'ighborhood graph (see Fig. 8).
In the above decision rule, a given new sample is rejected to assign class name, when
the sample is not included in any event. In this case, we can suggest the nearest
pattern class ",,'i by using membership grade of sample E = EI X E2 X •.. X Ed from
the event R;k = R.kl X R.k2 X ... x R.kd for class w, defined by

~
M G(E I R;k)=~dl
lR.kpl
D
k "1
r+iE I' ·=I,~, ... ,jv.,
? (27)
p= 1 j ..... k?"-' P
367

.n RII

XI

Figure 8: The lv! NG and events.

where I * i is the length of the interval * when the p - th feature is continuous


quantitative, and is the number of possible feature values included in * when the
p - th feature is discrete quantitative, ordinal qualitative, or structural. It is clear
that this membership function takes the value in the unit interval [0,1]. Thus, the
function may be regarded as a fuzzy membership function.

5.2 Example 2
As an example of symbolic pattern classification problem, we treat here the data of
"TOYOTA" and ":\ISSAj\" car models in 1992. Each sample (car model) is described
by 23 quantitative features and 3 qualitative features. We prepared 181 samples as
the design set. The experiments were performed in the following way.
Step 1: We applied the furthest neighbor method (complete linkage method) for hier-
archical clustering based on the generalized Euclidean distance by Ichino and Yaguchi
(199q) defined for a pair of samples A = Al X A2 X ... X Ad and B = BI X B2 X ... X Bd
by
d
d(A, B) = [2::: 1b(Ak, B"Y] 1/2, (28)
k=1

.I'(A. B.) = I A.,ffiBk I-I A.&JBk i +0.5(21 A~Bk I-I Ak I-I Bk I) (29)
0/ k, k I Uk I '
where I * I is the same in (27).
We found five clusters (pattern classes) which were well correspondent to usually
used concepts of "luxury cars", "sport cars", "leisure vehicles", etc ..
Step 2: We found events in (25) and (26) for each pattern class by using the method
in Ichino (1986, 1988), where our multiclass problem was treated as a set of dualclass
problems in the sense that "class u..:, versus other classes except class ;,.;;". Each
pattern class was described by one or two event(s). Then, for each event, we found a
minimum set of features by which the event is separated from other pattern classes
by using a modified ze1'O-one integer programming (Ichino (1986, 1988)). The total
number of features was reduced from 26 to 16. The selected features were 1) We·ights.
2) Width, 3) Length, 4) Height,S) Wheel base, 6) Front tread. 7) Rear tread. 8)
Minimum turning radius, 9) Maximum power, 10) Rev/AIax power, 11) AIax torque.
12) Rev/Max torque, 13) Engine stroke, 14) Cylinder layout, 15) Final gear ratio.
and 16) 10-mode mileage, where 14-th feature is a tree structured feature and others
are quantitative features (some of them are interval valued features).
368

Decisions
No. Company Model Correct Neural ID3 Proposed system
Answer Network
I HONDA NSX 2 2 2,3,5 2
2 HONDA LCllend I I 1,2,3,4,5 I
3 HONDA Prelude 3 3 2,3,5 3
4 HONDA Accord Wallon 4 3 2,3,4,5 4
5 HONDA Accord 3 3 2,3,4,5 3
6 HONDA Intewa 3 3 2,3,5 3
7 HONDA Civic 3 3 2,3,5 3
8 MAZDA Sentia I 2 1,2,3,4,5 1
9 MAZDA Eunos Cosmo 2 4 - 2
10 MAZDA Efmi RX-7 2 3 - 2
11 MAZDA EtiniMPV 4 1 - 4
12 MAZDA Familia 3 3 - 3
13 MAZDA Re~ue 5 2 - 5
14 MAZDA Carol 5 5 - 5
15 SUBARU Legacv wagon 4 4 4
16 SUBARU Vivio 5 5 - 5

Table 1: Results of Example 2.

Step 3: We classified 1992 car models of "HONDA", "MAZDA", and "SUBARU" as


the test data, by using the membership function defined in (27). Our classification
results were shown in Table 1.
We applied also well known the ID3 (Quinlan(1986}) and the backpropagation neural
network (Rumelhart(1986}). In the case of the neural network system, almost test
samples were classified correctly, however, some test samples were incorrectly clas-
sified. We performed training of the network 200 times, where we excluded 14-th
feature from the limitation of the use of qualitative fatures. However, it was still
insufficient for classes 2, 4, and 5.
On the other hand, in the case of the ID3 , all car models of "MAZDA" and "SUB-
ARU" were rejected. Each car model of "HONDA" was assigned to several classes
which include correct one. This is because that the ID3 evaluate a single feature at
a time in the learning process to generate a decision tree.
In our system, all test samples were classified correctly. In the region oriented ap-
proach, each pattern is included or not in the predetermined events of the class. This
will yield the reliable classification results for the events properly generated. However,
many patterns appearing in the future may be rejected as type-II reject because that
they are not included in any prepared events. In order to overcome this drawback we
introduced a Jv.zzy memebership function in (27).

6. Concluding remarks
This paper presented region oriented methods for symbolic pattern classifiers based
on the Cartesian system model. Our methods use the mutual neighborhood graph
(MNG) as a tool to understand the interclass structure. This graph is able to pick
369

up very local discrimination information to describe class regions. This property re-
quires to take a ballance between the separability of classes and the generality of class
descriptions. In order to assert this viewpoint we presented the Pretended simplicity
theorem, and we pointed out the importance of feature selection in symbolic pattern
classification. We compared our approach to well known the ID3 and the backpropa-
gation neural network based on the symbolic data of car models.

Acknowledgment
The authors thank Professor Edwin Diday for his helpful discussions. The authors
wish to thank also the referees for their suggestions leading to improvements in this
paper.

References
Bow, S. T. (1992): Pattern Recognition and Image Preprocessing, Mercel Dekker.
Diday, E. (1988): The symbolic approach in clustering. In Classification and Related
lvlethods of Data Analysis" Bock, H. H. (ed.), Elsevier.
Stoffel, J. C. (1974): A classifier design technique for discrete pattern recognition
problems. IEEE Trans. Compt., C-23. pp. 428-441.
~Iichalski, R. S. (1980): Pattern recognition as rule-guided inductive inference. IEEE
Trans. Pattern Anal. and Mach. Intell. PAlv/l-2. pp. 349-361.
Quinlan, J.R. (1986): Introduction of Decision Tree, l'vlachine Learning, 1, pp. 81-
106.
Rumelhart, D.E.R. and \1cCleliand (1986): Parallel Distributed Processing, MIT
Press.
Ichino, M. (1979): A nonparametric multiclass pattern classifier. IEEE Trans. Syst.
Man, Cybern. 9, pp.345-352.
Ichino, M. (1981): ~onparametric feature selection method based on local interclass
structure, IEEE Trans. on Syst. Man. Cybern. 11. pp. 289-296.
Ichino, M and Sklansky, J. (1985): The relative neighborhood graph for mixed fea-
ture variables, Pattern Recognition, 18, 2. pp. 161-167.
Ichino, M. (1986): Pattern classification based on the Cartesian join system: A gen-
eral tool for feature selection, In Proc. IEEE Int. Conf. on SMC (Atlanta).
Ichino, M. (1988): A general pattern classification method for mixed feature prob-
lems. Trans IEICE Japan J-71-D. PP. 92-101 (in Japanese).
Ichino, M. (1993): Feature selection for symbolic data classification. In New Ap-
proaches in Classification and Data Analysis, Diday, E. et al. (ed.), Springer-Verlag.
Ichino, M and Yaguchi, H. (1995): Generalized Minkowski metrics for mixed feture-
type data analysis, IEEE Trans. Syst. Alan. Cybern. 24. 4. pp. 698-708.
Ichino, M., Yaguchi, H. and Diday, E. (1995): A fuzzy symbolic pattern classifier.
OSDA '95. Paris.
Yaguchi, H., Ichino, M. and Diday, E. (1995): A knowledge acquisition system based
on the Cartesian space model. OSDA '95. Paris.
Extension based proximities between
constrained Boolean symbolic objects
Francisco de A. T. de Carvalho l
I Departamento de Estatistica - (,CEN I UFPE
Av. Prof. Luiz Freire, sin - Cidade Universitaria
50.740-.540 Recife - PE BRASIL
Fax: ++.5.5 +81 2718422 and E-mail:fatdildi.ufpe.br

Summary: In conventional exploratory data analysis each variable takes a single value. In
real life applications, the data will be more general spreading from single \'alues to interval
or set of values and including constraints between variables. Such data set are identified
as Boolean symbolic data. The purpose of this paper is to present two extension based
approaches to calculate proximities between constrained Boolean symbolic objects. Both
approaches compares a pair of these objects at the level of the whole set of variables by
functions based on the description potential of its join, union and conjunctions. The first
comparison function is inspired on a function proposed by Ichino and Yaguchi (1994) while
the others are based on the proximity indices related to arrays of binary variables.

1. Introduction.
Constrained Boolean symbolic objects (Diday (1991)) are better adapted than usual
objects of data analysis to describe classes of individuals taking into account simulta-
neously variability, as a disjunction of values on a variable, and logical dependencies
between variables. For example, if an expert wishes to describe the fruits produced by
a village, by the fact that "the weight is between 300 and 400 and the colour is white
or red and if the colour is white then the weight is lower than 3.50", it is not possible
to put this kind of information on a usual data table where rows represent villages
and columns descriptors of the fruits. Instead, this description may be represented
by a constrained Boolean symbolic object as aj = [weight = [300,400]]/\ [colour =
{white, red}) /\ [[colour = {white}] ~ [weight = [:300,350]]] where aj, which repre-
sents the jth village, is a mapping defined on the set of fruits such that for a given
fruit w, aj(w) = true iff the weight of w belongs to the interval [300, 400], its colour
is white or red and if it is white then its weight is less than 350.

2. The Boolean symbolic objects.


Let !1 be a set of individuals and w E !1 an individual. A variable is a function Yi:!1
--+ Oi where Oi is the set of values that Yi may take. A variable may take no values,
a single value or several values (a discrete set or an interval) for a symbolic object.
Let Y = {YI,"" yp} be the set of p variables, defined on !1 and taking their values
on 0 1 , ... ,Op, respectively. Let V = {VI," . , v~} where V; ~ Oi , i E {I, ... , p}.

2.1 No constrained Boolean symbolic objects.


A Boolean symbolic object is a logical conjunction of properties. Formally, a Boolean
symbolic object a is defined by the function ayv : !1 --+ {true, false} such that
ayv(w) = true iff Vi E {I, ... ,p}, y,(w) E ~·i. The intention of a, denoted as
a = [YI E VI]/\ ... /\ [yp E Vp] = /\;=dy, E \i], states that variable YI takes its
values in VI and ... and variable YP takes its values in Vp. The extension of a is

370
371

defined by 1a 10= {"-' E D./Yi(W) E \1;, Iii E {I, ... ,p}}.


Example 1: Let a = [colour E {red,blue}] /\ [height E [0, 15ll. An individual wED. is
such that a(w) = true iff its colour is red or blue and its height is between 0 and 15.
2.2 Constrained Boolean symbolic objects.
To represent actual knowledge, the description of a class of individuals by a Boolean
symbolic object must take into consideration different kinds of constraints given by
logical dependencies between variables. We distinguish two kinds of dependencies:
conditional and logical correlation dependencies.
2.2.1 Conditional dependence.
A variable Yi may become inapplicable if another variable YJ takes its values on a
subset Sj of its observation set OJ. As an example, we can not describe the colour of
the hat of a mushroom which has no hat. We note Yi = NA to indicate that variable
Yi became inapplicable and Yi should be considered as non-existent or meaningless. In
any case NA should not be considered as a variable value, and it is just for notation
conveniences that we denote Yi = NA.
In the case of single dependence, the conditional dependence is expressed by the rules
rl : [[Yj E Sj] *[Yi = NAll and r2 : [[Yi = NA] *
[Yj E Sjll· The rule rl expresses
the fact that if YJ takes its values in SJ then Yi becomes inapplicable while rule r2
expresses the fact that Yi becomes inapplicable only, and only if YJ takes its values in
Sj (Vignes (1991)).
2.2.2 Logical correlation dependence.
The set Sj ~ OJ, where OJ is the observation set of Yj, may be on correspondence with
the set S'i ~ Oi, where Oi is the observation set of Yi. In the case of single dependence,
the logical correlation dependence is expressed by the rule rl : [[Yj E SJ] [Yi E Sd]. *
Example 2. Let a = [Height E [50,1.50ll /\ [Weight E [25, 75ll. The Fig. 1 shows
the Cartesian representation of a, respectively, (a) under hypothesis of independence
between the variables Height and Weight, (b) under hypothesis of conditional depen-
dence between these variables expressed by the rules rl : [[Height E [100,1.50ll *
[Weight = N All and r2 : [[Weight = N A] *
[Height E [100,150]]] and (c) under hy-
pothesis of logical correlation dependence between these variables expressed by the
rule r3 : [[Height E [100, 150ll *
[Weight E [2.5,50]]].

':L~_ ':lg~ ~:l


so
1:'1
'0
::,
so
u
[SI
I ~
He •• hl -- Hel.h, Hci.hl
o ~ 100 I~ 2<:10 0 !IO 100 ,,.., 100 0 :10 IOU, '!l0 100

Fig. 1: Cartesian representation of a: (a) independence; (b) conditional


dependence; (c) logical correlation dependence.

2.2.3 Afultiple logical dependencies.


The logical dependencies may be visualised as a set of connected graphs. In each
graph, the nodes are the variables. These graphs are directed (from the variable
premise to the variable conclusion) and labelled CD become inapplicable in the case
of conditional dependence) (conditional dependence) or LCD (logical correlation de-
pendence). The figure 2 shows, for example, that there is a conditional dependence
372

between variables YI and Y2 and that there is no logical dependence between variables
Y2 and Y3' We consider now several important cases.

yl

~D
y2 y3

Fig. 2: Graphs of logical dependencies between variables.

Non Applicability propagation. Suppose that a variable Yi may become inapplicable


by a variable Yj which on the same time may become inapplicable by another variable
Yk:
CD CD
Yk --+ Yi --+ Vi·
In this case, where the variable Yj may also become inapplicable, the conditional
dependence between the variables Yi and Yj is expressed by the rules rl : [[Yj E Sj] ~
[Vi = NA]], r2 : [[Vi = NA] ~ [Yi E Sj]] and r3 : [[Yi = NA] ~ [Vi = NA]]. The rule r3
expresses the Non Applicability propagation which is a kind of inheritance induced
by the conditional dependence.
Relaxation. Suppose now a variable Yi has a logical correlation dependence with a
variable Yj which may become inapplicable by another variable Yk:
CD LCD
Yk --+ Yi --+ Vi·
In this case, the logical correlation dependence between the variables Yi and Yj is
expressed by the rules rl : [[Yj E Si] ~ [Vi E Sill and r2 : [[Yj = NA] ~ [Vi E Oi]].
The rule r2 expresses the fact that when Yj is not applicable Yi is not affected by the
logical dependence (relaxation).
Subordination. Suppose a variable Yi has a logical correlation dependence with a
variable Yj and on the same time may become inapplicable by another variable Yk:
LCD CD
Yj --+ Yi +-- Yk·
In that case, the logical correlation between the variables Yi and Yi is expressed by
the rule rl : [[Yj E Si] ~ [Vi E Si U {NA}]]. The rule rl expresses the fact that
Yi E Si implies Yi E Si only if Yi is applicable. The logical correlation dependence is
subordinate to conditional dependence.
A single variable may become inapplicable by multiple variables. Suppose a variable
Yi may become inapplicable by variables Yjl ... Vim. In that case, the m conditional
dependencies are expressed by m + 1 rules rl : [[Yil E Sjl] ~ [Vi = NA]], ... , rm :
[[Yjm E Sjm] ~ [Vi = NA]] and rm+l : [[Vi = NA] ~ [Yjl E Sid V ... V [Yjm E Sjm]].
Multiple variables may become inapplicable by a single variable. Suppose the variables
Yjl ... Yjm may become inapplicable by a variable Vi. In that case, the m conditional
dependencies are expressed by m pairs of rules rll : [[Yil E Sid ~ [Vi = NA]]
and r12 : [[Vi = NA] ~ [Yil E Sjd],···rml : [[Vim E Sim] ~ [Vi = NA]] and
rm2: [[Vi = NA] ~ [Vim E Sim]]'
We say that a description of an individual (or a group of individuals) by a Boolean
symbolic object is coherent if it does not contradict the set of rules which expresses
the logical dependencies between the variables.
373

2.3 Join, Union, Disjunction, Conjunction and Equality.


Let a = I\f=1 [Yi E Ad and b = I\f=1 [Yi E Bi], where Ai = [lr, urI and Bi = [l~, u~1 if y;is
a quantitative or a ordinal qualitative variable. The join (Ichino and Yaguchi (1994))
between these two Boolean symbolic objects is defined as a ffi. b = Al'=1 [Yi E Ai ffi B i],
where Ai ffi Bi = [min( lr, In,
max( ur, u~)1 if Yi is a quantitative or a ordinal qualitative
variable, and Ai ffi Bi = Ai U Bi if Yi is a nominal qualitative variable. The union is
defined as aU. b = I\f=dYi E Ai UBi], the disjunction is defined as a V. b = {l\f=1 [Yi E
Ai]} V {Al'=I[YiE Bd}, the conjunction is defined as a 1\. b = I\f=I[Yi E Ai n Bd and
we say that a =. b iff Vi, Ai = B i . The figure 3 illustrate these operations.

D
(a)lci. (b) Union (c) Oisjuoction (d) ennjunc1ion

Fig. 3: Symbolic operations.

2.4 Description potential of a Boolean symbolic object.


Let a = I\f=dYi E Ad be a Boolean symbolic object We define the Description
potential of a, denoted 1I"(a), as the volume of the part of the Cartesian product
Al x ... x Ap formed only by the descriptions of individuals given by a which are
coherent.
!L4-1 No constrained Boolean symbolic objects
If we suppose there are no logical dependencies between variables in the knowledge
base, the Boolean symbolic object a is coherent and also all individuals descriptions
given by the Cartesian product Al x ... x Ap. In this case 1I"(a) is calculated by the
following expression:
p

1I"(a) = II J.I(A i ) (1)


i=1
where
(A-) _ { cardinal(Ai), if Yi is qualitative or discrete quantitative
J.I 1 - range(A;), if Yi is quantitative continuous (2)

range(Ai} being the sum of absolute value of the difference between the upper bound
and the lower bound of each interval, where Ai is a set of real intervals.
Proposition 1 If {at, ... ,an} is a set of Boolean symbolic objects, where aj =
n
I\f=dYi E Aij] with j E {2, ... ,n}, then 1I"(al V ... Van) = L1I"(aj) - L1I"(aj 1\
j=1 j<k
ak) + L 1I"(aj 1\ ak 1\ a,) + ... + (_1)n-l11"(al 1\ ... 1\ an).
j<k<1
2.4.2 Constrained Boolean symbolic object.
Suppose now there are logical dependencies between variables in the knowledge base.
Let a = I\f=dYi E Ad be a constrained Boolean symbolic object, where NA E Ai
if Yi may become inapplicable, and let {rl, ... , rt} be the set of rules expressing
the dependencies between the variables. The description potential of a will be now
374

calculated as the difference between the volume of the Cartesian product Al x ... x Ap
and the part of this volume which is formed by the individual descriptions which are
not coherent, i. e.,
p
'Jr(a) = II ~(,4;) - 'Jr(a A (--'(rl A ... A rtl)) (3)
i=1
We have
r.(a A (..,(rl A ... A I"t))) = ;r((a A ..,rd V ... V (a A ..,rtl)
and therefore, according to proposition 1,
t
7r(a A (..,(rl A ... A rtl)) L 7r(a A ..,rj) - L'Jr«a A ..,rj) A ..,rk) + ...
j=1 j<k
+ (-I)t-l7r((aA..,rdA..,r2) ... )A..,rt_dA..,r,) (4)
The complexity of the calculation of the description potential of a constrained Boolean
symbolic object is exponential on the number of rules and linear on the number of
variables to each connected graph of dependencies.

3. Extension based proximity indices.


Proximities measures play an important role in exploratory data analysis. Thus,
clustering and ordination may have as input data a matrix of proximities between
objects. Usually, to calculate the proximity between a pair of objects, we use a com-
parison function, in order to compare the objects at the level of each variable, and
an aggregation function, in order to aggregate each comparison to obtain a global
measure of proximity. The Minkowsky metric is a classical example of a proximity
measure which uses a comparison and an aggregation functions.
In symbolic data analysis, approaches which use a comparison and an aggregation
function to calculate the proximities between Boolean symbolic objects were pro-
posed. We can find some approaches concerning the measurement of the proximity
between no constrained Boolean symbolic objects in Ichino and Yaguchi (1994) and
Gowda and Diday (1991). A first approach to calculate the proximity between con-
strained Boolean symbolic objects can be find in De Carvalho (1994).
In this paper, we present two extension based approaches to calculate the proximity
between no constrained and constrained Boolean symbolic objects. Both approaches
use only a comparison function based on the description potential of the join or union
and disjunction of these objects. The description potential of a Boolean symbolic ob-
ject is related to its extension.
Let a = Af=dYi E Ad and b = N=l[Yi E Bd and let y = {YIl ... , Yp}. We define
a Ay' b = AY,Ey'[Yi = .4i n Bd, where y' = {Yi E y/~~j n V;k #- 0}.
The first type of comparison functions is inspired on the comparison function pro-
posed by Ichino and Yaguchi (1994):
dl(a, b) = 7r(a op b) - 7r(a A b) + -y(27r(a A b) - 7r(a) - 7r(b)) (5)
where op is the join operator or the union operator and 0 ~ -y ~ 0..5.
Two normalised versions of equation 5 are:
d( b) = 7r(a op b) - 7r(a A b) + ~((27r(a A b) - 7r(a) - 7r(b))
(6)
2 a, 7r( an)
375

( b) _ rr(a op b) - "(a A b) + ~f(2rr(a A b) - rr(a) - 1i(b)) (7)


d3 a, - rr ( a op b)

where 0 < 'Y ::; 0.5.

Proposition 2 Let (a, b) be a pair of no constrained Boolean symbolic objects.


a) d1 and d2 are semi metrics, i. e., the triangular inequality does not hold. and
d 1 ~ d2 • d2 is normalised between 0 and 1;

b) d3 is a metric and it is normalised between 0 and 1;


c) the quasi order defined as (a, b) jd (c, d) ¢o} d1 (a, b) ::; d1 (c, d) is not affected by
a change of scale of continuous variables by a linear transformation:

d) d2 and d3 are not affected by a change of scale of continuous variables by a


linear transformation.
The second type of comparison functions are based on the proximity indices related
to arrays of binary variables:

d4 (a,b):= rr(a op b) -rr(a Ai b)


(8)
rr(a op b) + 'Y rr(a /\y' b)

1[ [rr(a op b) - rr(a)) (rr(a op b) - rr(b)) ]


(9)
ds(a, b) := 2" rr(a op b) + rr(a Ay b) - rr(a) + rr(a op b) + rr(a /\y' b) - rr(b)
d6 (a, b) := 1- rr(a /\y b) (10)
v(rr(a op b) + rr(a /\y' b) - rr(a))(rr(a op b) + rr(a A y ' b) - rr(b))
where op is again the join operator or the union operator and 'Y E {-0.5, 0, ,1}.
These comparison functions are normalised between 0 and 1 and they are valid only
to discrete variables because in the case of continuous variables it is possible to
have rr(a op b) < rr(a /\y' b). In the case of nominal qualitatiye variables, there is
no difference between join and union operators and we may use indifferently one of
them in equations 8 to 10.

Proposition 3 Let (a, b) be a pair of no constrained Boolean symbolic objects.


a) d4 is a metric if'Y E {-0.5,0};

b) d4 h := 1), ds and ds are semi metrics;


c) d4 , ds and d6 a7·e not affected by a change of scale of continuous variables by a
linear transformation;

4. Case studies.
We present two examples in order to illustrate the usefulness of our dissimilarity
indices.
376

4.1 Fats and Oils.


The chemical data base presented by Ichino and Yaguchi (1994) describes 8 items of
fats and oils by .5 variables (four continuous which take interval as values and one
discrete which take finite sets as values). Each item is described by a no constrained
Boolean symbolic object. It is known that each of item pairs (linseed oil. perilla
oil), (cottonseed oil, sesame oil), (camellia oil, olive oil) and (beef tallow, hog fat)
has similar properties. Ichino and Yaguchi studied this data base by using one of its
proximity measures:

This data base, which includes continuous variables, cannot be studied by the com-
parison functions of equations 8, 9 and 10. Fig. 4 shows the dendrograms obtained
by the complete linkage method by using di (Ichino and Yaguchi function, Fig. 4a
to Fig. 4d), d2 (equation 6, Fig. 4e to Fig. 4h) and d3 (equation 7, Fig. 4i to Fig.
41). It is not necessary to show the results obtained by d 1 (equation 5) because this
proximity measure is equivalent to d2 (see proposition 1).
It seems that parameter "'f has no influence on the proximity indices d2 and d3 • To "'f
fixed in d2 , d3 and dr, the join operator furnishes better results than union operator.
This is because in the case of continuous variables position is important. With join
operator, proximity measure d3 is able to obtain the five groups of fats and oils in-
dicated by experts (Fig. 41). This is not the case by using !chino and Yaguchi index
(Fig. 4d).
4.2 Freshwater insect orders.
The biological knowledge base (Vignes, 1991) concerns freshwater insect orders. It
includes items described by 12 nominal qualitative variables (which one takes finite
sets as values), where the number of modalities is between 2 and 6, and there are 3
different pairs of variables presenting conditional dependencies. Both types of com-
parison functions (equations 5 to 10) may be used to study this knowledge base.
Fig. 5 shows the dendrograms obtained by the complete linkage method by using d3
h = 0..5) as proximity measure (a) under hypothesis of logical independence between
variables (each insect order is described by a no constrained Boolean symbolic object)
and (b) under hypothesis of conditional dependencies between variables (each insect
order is described by a constrained Boolean symbolic objects). In this figure, I means
imago, L means larva, N means naiad and P means pupa.
In both cases all the orders larva are grouped together but only under hypothesis of
conditional dependencies the orders naiad and imago are in the same subgroup. It
seems that we have got better results when we describe each item of this knowledge
base by a constrained Boolean symbolic object.

5. Conclusions.
Two approaches to calculate proximities (dissimilarities) between constrained Boolean
symbolic objects are presented. These approaches use comparisons functions which
are based on the description potential of its join or union and conjunction and are
inspired from Ichino and Yaguchi (1994) functions and from the proximity indices
of binary variables. Classical properties concerning proximity indices such as type,
equivalence and metric properties are presented. Simulations with a chemical data
and with a biological knowledge base seems corroborate the approaches.
377

-[
n.1I 0.111 11.211
-i:
11.11 lUX", tWill
-
f,
lUI n." 11.8

~
, ,
~.
lin.setd Io"-'CCOl linseed

,,~IIi.a Co ht.f;u Co a.mellia f


~

f,
penll~ i
i
br:cf~IIIN
penll.a f. perin"

",ao".....
i."
: tteefwlow
§. olive ,,,i,,t ~. nlive
~ ~
't
i beefedlnw 'i"1.~
•!
L'OCtuftSftd

I'Ill,',Jl
! tq(;K
! I:;amellia
..
g" """

p
."

I5 __
n,n 11.2 11.4 11.11 112 IU 11.6 lUi IU u.•
~ S
i:
~.
,
bccualluw I-, bccfwklW
",.tal ~ bt.foJt ...r..
~
Co

,? ,........
fe
f,
liRKeil linxed
perilla !. perilla perilla
~
I::amelliil : 1:~lIia camclli;r,
~.
~. ~lIi¥c l ohve 1 "liYe
i ..:naunsco1 i . . u~ il!uaOlUftd
! !, Jewne !,
5" " .=
'"
~
lUI II.1W lUll
w
lUI lUI,.. IJ.IIIII
R
U.II
0'" II.•

l,- li~1l
i:
R
, linKftJ I- linsmJ
I-
Co I.:"arncllia Co ",,-'a ~
" .:amcllia

i i i
.......
penUiI bftftlliow perilla

I, •~ penUa
~
~
COftlN'llft:Il
" ........ " boe'' '1ow
I- J.
1
otive aI... olive

'; '& ...........



beeilolilow ':~JHd

! ",,-r.. !, I:;amelha !, ""-'at


e" e E!

'-EJ !-p -[
IJ.U n.lil 0.2' lUI 11.2 (U 11.6 0." (U II.•
e ~
a", bftflallllw
e; ~ ~fat Ito,ral
i
Iqfill

t
Co

\''WMllioi ,? ,;....., Ii~ed

iI Illive i periUa perilla


~C~ ,
.
II umclha ,""amcllia
1 .same 1 olive
oz·
~. olive

'& ,....... 'I ..:naonteed


.......
; I,;UCUJnIftd
!, penlla !, .same !,
e = "

Fig. 4: Dendrograms by the complete linkage method (Fats and Oils).


378

0.0 02 0.-' n.tt C).X 1.0 e


5"

-=================]-_______
0-
Planipenna_L - - - - - - - - - - - - - - - - - - - - - - - - - - - - , n
1il

-==================:=1----------,
Dipccra_L -
Uymcnopcera_L --.J ::>
0-
0
::>
n
l\.1egaluptera_L
Coh::opccr.1_L - n
Tt'ichoptcra_L ?f
L.epUc.lplcra_L ~
Ephcmera_N
Plc..:optcra_N
'"
'"
::>
Ouunata __ N
-<
Heccruptcra_N ..,
~.
Hetcroptcra_( 5!:
~

==================f=======J---J
Coleoptcra_1

Tnchopcera._P ""..,3
Diph:ra_P
Hyrn.:nI.1ptcra_P 3
'"11
=V.

0.0 04 0.6 0.8 1.0 :§;


0
n
Diptera_L ::>
Hymen\'lpcer3_L. ~
Mcgaluptcra_L 0·
::>
Clllcl.lpccra_L !:..
PJ""nipennu_L 0-
n
""'"3
Tt'i..:hoprcra_L '"0
0
Lepdoptcra_L ::>
0-
3 0
::>
'"11
= ?f
Tt'il.:hoptcra_P n
HYllu::noptera_P ~.

f-lcccropccra_1 v.
Colcoptcra_1
Ephcll1~ra_N
~
<"1>
0
PJecupccra_N ------------- ::>
Ouonal.J._N -<
'"
::!.
=-
f-h!rcrllp(cra_~

'"
~

Fig ..5: Dendrograms by the complete linkage method (Freshwater insect orders).

References:

De Carvalho. F.A.T. (199-l): Proximity Coefficients between Boolean symbolic objects.


In !Vew Approaches in Classification ctlld Dctla Analysis. Diday, E. et al. (eds.),387-394.
Springer- Verlag, Heidelberg.
Diday, E. (1991): Des objets de l'analyse de donnees it ceux de l'analyse de connaissances.
In Induction symbolique et llume,.iquE a pa,.tir de donnes. Diday, E. and Kodratoff. Y.
(eds.), 9-7.5, Cepadue Editions. Toulouse.
Gowda, K.C. and Diday. E. (1991): Symbolic clustering using a new dissimilarity measure.
Pattern Recognition, 24, 6, 567--.578.
Ichino, M. and Yaguchi. H. (199-1): Gf'IlPraiised ~Iinkowsky ;"'Ietrics for ~!ixed Features
Type Data Analysis. IEEE Transactio/!3 Oil System . .Ilan and Cybernetics. 24, 4, 698--708.
Vignes, R. (1991): Caracterisation automatique de groupes biologiques. These de Doctorat.
t: niversite Paris-VI Pierre et ~!arie Curie. Paris.
Towards A Normal Symbolic Form
i\[arc Csemel l , Francisco de A.T. de C'arvalh0 2
1 Univcrsite Paris Dauphine 75016 Paris France.&
INRIA-Rocquencourt
Domaine de Voluceau
Le Chesnay Cedex France
E-mail: [email protected]
~ Departamento de Estatistica - CCE~ ICPFE
Av Luiz Freire, sin - Cidade Universitaria
50.740 - 540 - Recife - PE BRASIL
E-mail [email protected] Fax ++55 +81 2710359

Summary: Boolean Symbolic Objects were introduced by Diday (1988) and since that
time a large number of applications have been defined. using these objects, but relatively
few of them take constraints on the variables into account. Even in this case, when the
graph of dependencies becomes too large, the computational time becomes huge because
dependencies are treated in a combinatorial way. "'Ie present a method inspired by the
technique used in relational data bases (Codd 1972) leading to a decomposition of symbolic
objects into a I\'ormal Symbolic Form which allows an easier calculation, however huge the
graph of dependencies rules may be. We will apply our method to distance computation
following a method due to De Carvalho and inspired by Ichino (1994) but the normal form
we present in this paper could be llsed for other purposes. In our first trials we obtained
a 90% reduction of the computational time. In the present text we will only deal with
nominal boolean Symbolic Objects, but the method could be used with other kinds of
symbolic objects.

1. Introduction
Constrained Boolean Symbolic Objects defined by Diday (1991) are better adapted
than usual objects of data analysis to describe classes of individuals such as popu-
lations or species, being able to take into account variability. They are expressed
by a logical disjunction of elements called elementary events, and each of these ele-
mentary events represents a set of values associated with a variable. Each boolean
symbolic object describes a volume which is a subset of the Cartesian product of the
description variable.
A symbolic object can be constrained by different kinds of rules which express
logical dependencies between the variables. These rules reduce the description space
described by symbolic objects, they interfere greatly on computation of distance
between them.
\Ve shall use for distance computation a comparison function based on the de-
scription potential (De Carvahlo (1994)) of each object. \Ve define the description
potential as the part of the volume described by a symbolic object which is coherent.
i.e. where all the values are satisfying all the dependencies rules.
Until now the different methods used to compute distances between symbolic ob-
jects were taking rules into account by computing the incoherent part of each object
or each computed element. This computation can become huge when the dependen-
cies graph is deep and has to be repeated for each pair of objects each time you choose
a different distance indice.
To avoid this kind of problem. we were induced to propose a representation of
symbolic objects where only the coherent part of an object is represented. We recall
379
380

that a boolean symbolic object (if no depcndence rule applies) describes a subset of
a Cartesian product, this is just the definition of a relation.
Since long people dealing with data bases are familiar with relations: they use
a relational model. E. Codd has introduced some normal forms to structure more
efficiently the data base relational schema, particularly the third one which concerns
the case where it exists functional dependencies between the variables. Normal forms
are used in relational data bases to offer a better factorization of data, thus providing
a simpler and easier way to update data.
All this has induced us to introduce a normalization of boolean symbolic objects,
inspired by the Codd's third normal form, which allows a representation of the only
coherent part of a symbolic object. By reference to the relational normalization we
call it Normal Symbolic Form(NSF).
We shall expose in section 2 a mathematical definition of boolean symbolic object
and examine different possible kinds of dependencies rules. In the section 3 we shall
see rules influence on distance computation. In the section 4 we shall see the def-
inition of Codd's third normal form, the definition of the 1\SF and an example of
decomposition it induces. In section 5 we shall examine the principle of the decom-
position process, and in the section 6 we shall see how to use a NSF description of
symbolic objects to perform some usual calculus needed by distance computation and
then conclude in section 7.
2. Constrained Boolean Symbolic Objects
Let n be a set of elementary objects generally called" individuals", described by p
variables Yi where i E {L.p}. Let Oi be a set of observations where the variable Yi
takes its values. The set 0 = 0IX02X ... Op is then called the description space.
An elementary event f,. denoted by the symbolic expression

f, = [V, = q
where (i E {l,p}, \; ~ 0,), expresses that "the variable y, takes its values in Vi".
A Boolean symbolic object a, is a conjunction of elementary events of the form:

a = /\ ej = [V, = \;1
'E{l...p}

where the different \; ~ 0" represent the intention of a set of individuals C c n. It


can be interpreted as follows: the values of Yi for any individual c E C are in v;.

Example: a = [colour = {blue, red}] A [Size = {small, medium}]

means that the colour of a is red or blue and the size small or medium.
vVe define the symbolic union denoted U, :

al Us a2 = Ai':{LP} (e,l Uei2) with (eil Ued = [Vi = \;1 U Vi2]


Constrained boolean symbolic objects are usual symbolic objects associated with
a Domain-Knowledge which constraints the description given by the objects. Such
constraints are expressed as logical dependencies between the variables. We take into
account two kinds of dependencies. The first one, called conditional dependencies is
of the form if A = al then B has 1\0 Sense. as in :
iJwing!i = A bsent then wings_colour = No_Sense. (rl)
The other form, called logical correlation dependencies, is of the form if A = al
then B = bl. as in :
381

if wings_colour = red then Thorax_colour = blue. (r2)


if Thorax-colour = blue then ThoraLsize = big. (r3)
The term No_Sense means that the variable does not exist, hence it's value is not
applicable.
The difference between these two kinds of rules appears at different levels:
- In case of single dependencies:
• the conditional dependence rule reduces the description space, the logical cor-
relation does not.
• the No~ense value can be taken only when a conditional dependence rule is
active.
- In case of multiple dependencies the way each rule is propagated among the
dependencies graph is different, for example:
• If we consider r2 and r3 together, then wings_colour = red implies that Tho-
rax_size = big. This is the usual propagation.
• If we consider r1 and r2 together we can see that when wings = absent we
can say nothing about w'ings_colour which is not relevant and by consequence,
we can not consider any particular value of Thorax_colour. This is the rule
relaxation.
• if we consider the following rules together :
if yl = al then y3 = No_Sense; if y2 = bl then y3 = c1;
The second rule implies that the variable y3 has a meaning, hence that the
first rule premise is not true. The logical correlation is subordinated to the
conditional dependence.
We shall call premise variable the variable associated with the premise and conclu-
sion variable the variable associated with the conclusion.
3. Dependency rules influence on distance computation
Some approaches to calculating the proximity between constrained Boolean objects
have been proposed for instance by De Carvahlo (1994). The use of dependencies
rules could be considered as creating" holes" in the Cartesian product of the variables
which must be taken into account for an accurate distance computation.
So far, existing algorithms have not solved this problem in a satisfactory way,
especially when the constraints are dependent on one another. To accomplish distance
computation we use a comparison measure based on the Description Potential of each
object i.e. the COHERENT part of the volume described by each object. We can
see on the following example how a rule can interfere on the comparison between two
objects. If we have the two following objects:
a1 = [y1 = [5,30]]/\ [y2 = [10,30]]; a2 = [y1 = [15,30]]/\ [y2 = [15,35]];
y2 y2

40 40

CD D
30 30
20 20
10 10
yl y1
0 10 20 30 40 0 10 20 30 40
Fig'ure 1
382

the left side of the Figure 1 shows a rectangle ill plain line which area represents
the description potential of al, and a rectangle in dot line which area represents
the description potential of a2. While there is no dependence rule, the description
potential is here equivalent to the Cartesian product given by the description of
respectively al and a2.
In the right side of the Figure 1 we represent the description potential of al and
a2 considering the following dependence rule:
if y 1 E [5, 15[ then y2 has No Sense;
Examining Figure I it seems that al and a2 are more similar when the dependence
rule is considered. We believe that the distance must take into account this evidence.
4. The Normal Symbolic Form
In a way symbolic Boolean objects are very close to the relations used in data bases
(i.e. a subset of a Cartesian product). People using relational data bases have been
long familiar with the decomposition in normal forms introduced by Codd (1972).
We will focus on the third normal form which states roughly" a relation is in third
normal form if there are no functional dependencies between a non-key attribute S
and other non-key attributes SI, ... , S" ".
Attributes have the same meaning as variable, and relations are presented as arrays
where each column represents an attribute (or variable). Each line represents a tuple
of values (an individual) and each line can be identified by a key. :\ key is an
attribute (or a set of attributes) which can be used to identify a tuple in a relation.
Codd's definition of a functional dependency says that" an attribute Y is functionally
dependent of an attriLute X if each X-value is associated with precisely one Y-value".
Usually it is necessary to decompose a relation if you want it to follow the third
normal form.
The third normal form is used in relational data bases to offer a better factorization
of data, thus providing a simpler and easier way to update data and a reduction of
the space amount necessary.

Considering Boolean symbolic objects as a relation, if no dependencies rules are


considered, suggests the possibility of a normal form.
We were induced to think that it can be useful to split an array of symbolic objects
in two parts when it exists dependencies between some variables.
If some logical correlation exists between a premise variable X and some conclusion
variables Y j , 12"" 1~, we will have then to split, as for Codd's normal form, our origi-
nal table in t\VO, the second table containing all values associated with X:Y1 , Y2, ... , }~.
We will refer the different tables, as original, main and secondary.
Simultaneously, we will represent the premise variable X in a flat way. By 'flat way'
\ve mean that the values corresponding to a premise must appear one by one and not
set by set as it occurs in the usual representation. This will allow to represent in
an independent way premise and conclusion values and allow to not represent the
inconsistent values. \Ve will refer, to the elements of secondary tables by their line
number.
Remark that for each object, we will have to refer to as many lines in the secondary
table that there was values in the premise variable of the the initial table.
"'ie define the Normal Symbolic Form (NSF) as follows: "if no dependency
occurs between the variables, or, if the dependencies occur between the
first variable VI and the others, then VI must have a single value". One
can remark that the variable order has no importance on the description but it is
more convenient to have the premise variable as the first one.
383

IVlost of the time a symbolic object has to be decomposed to follow the [\;ormal
Symbolic Form (NSF), as we can see in the following example,
wings wings_colour I Thorax_colour Thorax....size
a1 {present,absent} {blue,red} I {blue,yellow} {big,small}
a2 {present,absent} {green,red} I {blue,red} {small}
Table 1 anginal table
The previous array represents two boolean symbolic objects called a1 and a2, the
dependencies rules r1,r2 are associated with the definition.
if wings = Absent then wings_colour = No_Sense. (r1)
if wings_colour = red then Thorax-colour = blue. (r2)
The description of the objects a1 and a2 representing two different(imaginary) insect
species is obviously not NSF, because the description of wings in a1 has two values
and there is a dependency between wings and wings_colour (r1). There is also a
dependency between wings_colour (r2) and Thora:ccolour and Wings_colour is not
the first variable. Then the description has to be transformed in the sequence of the
three following tables to be NSF. In these tables the upper left corner contains the
table name, a new kind of column appears where the values are integers referring to
a line in another table with the same name as the column. The first table has no
name, it refers to the initial table.
wings ... wings colour
1 absent 4
2 absent 5
3 present { 1, 2 }
4 present { 1, 3 }
main table secondary table 1

We have now three tables instead of a single one, but only the valid parts of the
objects are represented: now, the tables include the rules.
5. How to Transform a set of Description in a NSF Form
The aim of the ;-;SF is to provide a description of a symbolic object where only
the part of the object satisfying dependencies rules is represented, so the description
potential of the object can be calculated directly.
As we mentioned it before, we will have to split the original table. into a main
table and some secondary ones. \Ve will follow the dependencies graph to proceed
this task. Generally each premise variable will generate a new table containing the
premise variable and all the conclusion variables depending from the premise.
At first, we must precise we will only consider the case where the graph between
the variables induced by dependencies forms a tree or a set of trees i.e. no variable
can be the conclusion of more that one premise variable.
The transformation process can be decomposed in two different phases:
1) Definition of the new tables 2) Fill the new tables
384

The secondary tables definition follows the variables dependencies graph. For each
non terminal node N of the graph, we build a new table Tv composed of the variable
V associated with N as the first variable, and the variables associated with each of
the sons of N as other variables. The variable V will be replaced in its original table
by a reference variable Rv which will contain lines number of the new table.
The table filling is a little more complicated and the lack of space does not allow
us to describe it in full detail. It is decomposed in two processes the first one consists
in a construction proc~ss, the second in a factorization one.
The first process induces a table growing, the second one a table reduction. With
the real examples we did proceed, the reduction factor was more important than the
growing factor, and we did obtain a reduction in size of the secondary tables greater
than 30% .
For commodity reason we will expose the construction process under an algorithm
form.
for each symbolic object
{ for each variable ~r
if (V is not a premise nor a conclusion)
put value of V in the main table;
else if (V is a premise)
put the references provided by GetrRef{~')
}
GetrRef{V)
{ for each value Val of V / / the premise 1,ariable
/ / build a new line in Tv
for each other variable.5 Vc in Tv
{ restrict the values of Vc according to ~'al and the rule
if (Vc is not a premise)
put the correspondingualues in Tv
else
put the references provided GetrRef( ~'c) }
7'etUn! the list of lines builded
}
Once the construction is done, we need to factorize. For each newly builded line L,
if L' = L is already present in the table we change the reference for a reference to L'
and L is suppressed.
6. An application to distance computation
At present distance computation algorithms must generate (in the worst case) all
possible combinations of variables, and then verify which ones are valid. In that case
Mp combinations must be generated (M is the average number of modalities, p the
number of variables) and verified. The NFS avoids this huge amount of verification.
Our distance measure uses, for the comparison process of the distance calculus, the
volume of the description potential of the union of two objects. For nominal symbolic
objects we will use a distance due to De Carvalho (to be published) inspired by Ichino
(1994).

d{ ai, a·, ) -_ potential(aj Us a2) - O.5{potential(ad + potential(a2))


- potential(aj Us a2}

We will first show how to compute, using a NSF representation, the potential of a
symbolic object, second, how to compute the potential of a symbolic union. We will
illustrate the method, using the two objects aj, a2 described in our previous example.
385

7.1 Computing the potential of an object


Each line of a secondary table describes a coherent hyper-volume, and all the lines
contributing to the representation of the same object describe hyper-volumes which
do not intersect (by construction). So one has to sum up the potential described by
each line of a secondary table.
On the following example, the potential of each line of the secondary table 2 has to
be computed first, then the lines of the secondary table 1, and at last the potential
of each object described in the main table.
For example line 3 of the secondary table 1, refers to the lines 1 and 2 of the
secondary table 2. The potential is the sum of the potential described by these two
lines: 1+2 = 3. In the same way the potential of al is obtained by multiplying the
sum of the potentials of lines 1 and 3 of the the secondary table 1 (2 +3) by the
potential due to the variable Thorax..size (2) giving the result 10.
wings ... wings colour pot
1 absent 4 2
2 absent 5 2
3 present {l,21 3
4 present P,3} 3
main table secondary table 1
colour wings_colour Thora.'ccolour pot
1 red ~ue~ 1
2 blue { blue, yellow } 2
3 green I blue, red} 2
4 N~ { blue, yellow } 2
5 NS { blue, red} 2
secondary table 2

7.2 Computing the potential of the symbolic union

The computation of the potential of the union of al and a2 must be split in two
parts: the first one concerning the lines where the premise is verified the second
concerning the lines where the premise is not verified.
For each part one needs to compute the union of two lines II and 12 , II participating
to the description of ai, 12 participating to the description of a2. These values must
be multiplied by the number of lines of the part participating to the union and we
obtain the potential related to the part. The potential of the union is obtained by
summing up the potential computed for each part.
We will note poWl (1,2) the union of lines 1 and 2 of the secondary table l.
We will show now, on the previous example, how to compute
potential(al Us a2) = potU( {big,small})* potUl( {1,3},{2,4})
For the secondary table 1:
potUl( {1,3},{2,4}) = potUl(1,2) (premise verified) +
potUl(3,4) (premise not verified)
potUl(I,2) = 1*potU2(4,5) potUl(3,4) = 1*potU2( {1,2},{1,3})
For the secondary table 2:
potU2( {1,2},{1,3}) = potU2(1,1) (premise verified) +
potU2(2,3) (premise not verified)
potU2(4,5) = pot({blue,red,yellow}) = 3 potU2(1,1) = 1
potU2(2,3) = pot( {blue,green}) * pot( {blue,red,yellow}) = 2*3 = 6
This is expressed by the following tables:
386

wings .. wings colour pot ((f,) j

2
1 absent
absent
4 pot5~,2)_ -
1*3 =3
I
~
3
4
present
present
{1,2}
p,J}
pot(~,4~-
1+6 =7
I
main table secondary table 1

colour wings_colour Thora."ccolour pot


1 red {blue} 1
2 blue {bl ue, yellow} 2
3 green {blue, red} 2
4 NS {blue, yellow J 2
5 N::i t blue, red} 2
secondary table 2

7. Conclusion
The decomposition of symbolic objects following the Normal Symbolic Form induces
the easiest way to take dependencies rules into account, as shown by our first ap-
plication on distance computation. Including the construction of the NFS we have
obtained on our first trials a reduction of about 90 % of the computational time. This
encourages us to carryon our work.
We need to test it with a larger set of examples, to estimate better the amelioration
it can provide. This will lead us to make a more formal approach of the complexity
of the different computation phases needed in a distance processing with and without
NSF. Because NFS is a normal form, we hope it will induce in the future, a better
and easier interfacing of large sets of symbolic objects with data bases.
References:
Codd, E.F. (1972) : Further Normalization of the Data Base Relational Model. In Data
Base Systems, Courant Computer Science Symposia Series, Vol 6. Englewood Cliffs, N.J .
. Prentice-Hall
De Carvalho, F.A.T. (1994): Proximity Coefficients between Boolean symbolic objects. In:
E. Diday et al (eds.): New Approaches in Classification and Data Analysis. Springer-Verlag,
387-394.
De Carvalho, F.A.T (to be published) : Extension based proximities between Boolean
symbolic objects In : Proceeding of Fifth Conference of the International Federation of
Classification societies
Diday, E. (1991): Des objets de l'analyse de donnees 11 ceux de I'analyse de conuaissances.
In: Y. KodratofJ and E. Diday (eds.): Induction symbolique et numerique d partir de
donnees. Cepadue Editions,Toulouse, 9-75.
Diday, E. (1988) : The Symbolic Approach In Clustering. In H.H. Bock (eds.): Classifica-
tion and Related Methods of Data Analysis. North Holland 673-683
Ichino, M. , Yaguchi, H (1994) : Generalized Minkowski Metrics for Mixed Feature-Type
Data Analysis. IEEE Transaction on Systems, Man, and Cybernetics 24,4,698-708.
The SODAS Project: a Software for Symbolic Data Analysis

Georges Hebrail

ELECTRICITE DE FRANCE - Research Center


I, Av. du General de Gaulle
92141 CLAMART CEDEX - FRANCE
E_mail: [email protected]

Summary: This paper presents an ESPRIT European project, whose goal is to develop a
prototype software for symbolic data analysis. Symbolic data analysis is an extension of
standard methods of data analysis (such as clustering, discrimination, or factorial
analysis) to more complex data structures, called symbolic objects. After a short
presentation of the model of symbolic objects, the different parts of the software are
briefly described.

1. Introduction
Standard statistical data analysis methods, such as clustering, discrimination, or factorial
analysis, apply to data which are basically structured as arrays. Each row represents an
individual and each column for a particular row contains a single value, which is the
value of a variable describing individuals. In many real world applications, for some
variables, an individual may be described by sets of values, intervals of values, or
probability distributions of values. Moreover, some a priori knowledge of the user may
be associated with data, such as taxonomies in variable domains.
More complex data structures - called symbolic objects - have been proposed by Pr Diday
in the last decade (see Diday (1991». These data structures capture the complexity
described above, but remain manageable regarding to computations performed in
statistical data analysis methods. Beyond these data structures, some extensions of
standard methods have been studied and evaluated to apply to symbolic objects. The
extensions include clustering, discrimination, and factorial analysis.
But these new methods remain difficult to use in real applications for two main reasons
(see Hebrail (1995»: there is no available software to do so (there are only disparate
pieces of software in various universities), and it is difficult to manage data objects with a
more complex structure than simple arrays. The goal of the SODAS project is to develop
a prototype sofuvare to solve these problems and make these methods available to more
users.
The SODAS project (for fu'mbolic Qfficial.Qata Analysis ~ystem) i·s an European project
within the DOSIS Programme mevelopment Qf .s.tatistical Information fu'stems),
organized by EUROSTAT, the Statistical Office of the European Communities in
Luxembourg. It gathers several partners, including national official statistics offices,
industrials and universities; various European countries are represented. An important
part of the project is devoted to benchmarks of real world applications, which will be

387
388

used to specify and test the software. These benchmarks are mainly provided by the
national official statistics offices involved in the project.
In this communication, after a short presentation of the model of symbolic objects. we
describe the main contents of this project, and especially the different parts of the
software.

2. Symbolic objects
As mentioned before, standard methods of statistical data analysis accept as their input
INDIVIDUALS by VARIABLES arrays. Each cell of such arrays contains the value
taken by an individual for a variable. This value is said to be atomic in the sense that it is
not a list or a set of values. For instance, if individuals are people, and if variables are
AGE and SOCIO-PROFESSIONAL CATEGORY (SPC), the AGE cell for a person
contains one value (the age of the person) and the SPC cell contains one value (the SPC
of the person).
Symbolic objects introduced by Pr Diday extend the classical data structure to
INDIVIDUALS by VARIABLES arrays where the value taken by an individual on a
variable may be non-atomic: possibly a set of values, intervals of values, or a probability
distribution. For instance, if individuals represent groups of people, and if variables are
still AGE and SPC, a cell of this new array may contain, for each individual (i.e. each
group of people), the interval of people age values in the group for the AGE variable, and
the list of SPC of people of the group for the SPC variable.
We recall below, in an informal way, the basic data structures defined by Pr DIDA Y.
Additional structures have already been defined (see Diday (1991», but are not presented
here. The benchmarks of the project will be used to define the final list of data structures
supported by the software.

Boolean elementary events


Boolean elementary events are expressions of the form:
SPC = { Worker, Employee}
meaning that SPC takes one of the two values Worker or Employee.
AGE = { [25,27], [48,55] }
meaning that AGE is between 25 and 27, or between 48 and 55.

Probabilistic elementary events


Probabilistic elementary events are expressions of the form:
SPC = { Worker(0.8), Employee(O.I). Farmer(O.I) }
meaning that SPC is Worker in 80% of cases, Employee in 10% of cases, or
Farmer in 10% of cases.
AGE = { [26,30] (0.7), [31,35] (0.1), [36,40] (0.1 ), [41,45] (0.1) }
describing a probability distribution of ages on a sub-popUlation (for instance a
district).
389

Assertions
Assertions are conjunctions of boolean and/or probabilistic elementary events, for
instance:
Group 125 = [AGE = ( 34, 29, 2, 1 }] 1\ [SPC = { Employee, Worker}]
District 92 = [AGE = ( [25,30](0.2), [31,35](0.23), ... }]
1\ [SPC = { Executive manager(0.6), Worker(0.2), ... }]

The model of symbolic objects also enables the user to associate with the data some a
priori knowledge (i.e. metadata), which is then used by methods applied to symbolic
objects. This a priori knowledge can be defined by different means. We present below
three ways to do so: mother-daughter variables, rules, and taxonomies in variable
domains.

Mother-daughter variables
Mother-daughter variables offer the possibility of defining variables which are not
applicable to all people, but only to people verifying some properties. For instance:
SPC is applicable only if AGE> 18

Expression of a priori knowledge with rules


Rules offer the possibility of defining a priori knowledge, in the form of a restriction
of possible combinations of values for the different variables, for instance:
If AGE> 60 Then SPC = Retired

Variables with taxonomic domain


Variables with taxonomic domain offer the possibility of defining a taxonomy within
the values taken by a variable. Figure I gives an example of such·a taxonomy.

Figure I : Example ofa taxonomic variable

AGE Variable

/\~ Adult
[0.10] (11.15] [16.(8) ______ I~

Young Middle-age Retired

/\
(18,34]
/\ I"""
[35,45] [46,55] [56,60] [60,70] [70,?]
390

As a summary, symbolic objects can represent individuals which are groups of elements
of an underlying population, featuring variation within these groups. These objects can
also describe uncertainty on data (with probabilistic objects) and metadata (;,vith
mother/daughter variables, taxonomies in variable domains, and rules).

3. Construction, management and manipulation of symbolic objects


A symbolic object manager will be developed as the kernel of the system. A normalized
language will be defined to describe symbolic objects and a parser will enable the user to
acquire symbolic objects from ascii files. Once objects are acquired, the system will store
them physically in an ad'hoc structure and allow queries and updates to this symbolic
object ;database'. Objects will also be accessible easily from programs by the means of
some basic access functions.
As in many applications data are stored in relational databases, an interface will be
developed to help the user to build symbolic objects from the contents of relational
databases. Several operators will enable the user to build assertions from an underlying
population stored in the database, as well as to extract a priori knowledge such as
taxonomies or mother/daughter variables when they are available in the database.

4. Methods of symbolic data analysis


Several methods will be developed to apply to symbolic objects. All these methods will
be extensions of standard methods to symbolic objects. There will be three classes of
methods in the system: clustering, discriminant analysis, and factorial analysis methods.

4.1 Clustering methods


Symbolic clustering methods build clusters which are represented by symbolic objects.
They can be applied either to standard data or to symbolic objects. In addition to that,
these methods provide the following features:
- the clustering criterion favours interpretation of clusters instead of a complete
optimization of a distance criterion.
- metadata associated with symbolic objects (mother/daughter variables,
taxonomies, ... ) are used in the clustering process.

The methods which will be developed include partitioning algorithms (see Chavent
(1995)) and hierarchical clustering (see Brito (1995)). Hierarchical clustering will
produce either disjoint or overlapping clusters (pyramids).

4.2 Discriminant analysis methods


Two standard methods of discriminant analysis will be extended to the case of
individuals defined by symbolic objects.
The first extension will concern decision tree construction. This extension will take into
account mother/daughter variables in addition to the possibility of discriminating
between individuals defined by boolean assertions (see Lebbe and Vignes (1991».
391

The second extension is the extension of non-parametric Bayesian discriminant analysis


to symbolic objects defined by boolean and probabilistic assertions. In particular, it will
be possible to discriminate between individuals described by variables which are
probability distributions (see Granville and Rasson (1995».

4.3 Factorial analysis


The method develope~ in the project will extend standard factorial analysis to individuals
defined by symbolic objects featuring numerical variables of interval type. This method
can be useful in the case of numerical data with uncertainty (see Chouakria et al. (1995)).

5. Link with standard data analysis


Two different approaches can be considered to use jointly standard and symbolic data
analysis:
- the transfonnation of symbolic objects into classical data arrays, followed by the
application of standard methods of data analysis,
- the use of the symbolic object fonnalism to describe groups of individuals found by
standard methods.

These two approaches will be available in SODAS through the following features:
- an interface to call standard methods from SODAS,
- a tool for building disjunctive arrays of data from symbolic objects,
- a tool for building distance matrices from symbolic objects (see !chino (1994),
Carvalho (1996)),
- a tool for creating symbolic interpretations by symbolic objects to standard
clustering and factor analysis methods (see Tong et at. (1996».

6. User interface
A large part of the project will be devoted to the consideration of user's needs. As a
working package of the project, a users' group will be created and animated.
This users' group will :
gather different benchmarks from national official statistics offices and from
industrial partners,
- list the symbolic object structures which are necessary in these benchmarks,
- check if the developed methods solve real problems and meet user's needs,
- test the software on real world applications.

From another point of view, the software will include a user-friendly interface to help the
end-user to visualize graphically symbolic objects.
Within the budget of this project, it will not be possible to develop a fancy homogenized
interface to all the methods. But some guidelines will be edited to homogenize interfaces
of different methods of the software.
392

Finally, a scientitic reference manual will be edited for the software. This book will
contain a unified presentation of symbolic objects and methods applicable to them in the
SODAS prototype software.

7. Partners
The project gathers 18 partners from various European countries. \\'hile THOMSON-
CSF is the pilot of the project, Pr. Diday will be the scientific manager. The CISIA
French company will be responsible for the development of the kernel of the software.
For more information about this project and the distribution of tasks between partners,
see SODAS Project (1995).
The main partners of the project are: Thomson-CSF (France), Universite de Dauphine
(France), Facultes Universitaires Notre Dame de la Paix (Belgium), Instituto Nacional de
Estadistica (Portugal), and University of Athens (Greece).
The associated partners are: CISIA (France), Centre de recherche public STADE
(Luxembourg), Central Statistical Office (England), Universita degli Studi dei Bari
(Italy), Universita Federico II - Napoli (Italy), Electricite de France - Research center
(France), EUSTAT (Spain). INRIA (France), Universidade de Lisboa (Portugal). Institute
for Statistics - RWTH (Germany), Service des etudes et de la statistique du ministere de
la region wall one (Belgium), and Universidad complutense de Madrid (Spain).

8. References
Brito P. (1995): « Symbolic objects: order structure and pyramidal clustering », in Annals
o/Operations Research, N°55, pp.277-297.
Carvalho F.A.T. (1996): « Extension based proximities between constrained boolean
symbolic objects ». in Proceedings o/the IFCS'96 Conference. Kobe. March 96.
Chavent M. (1995): « Choix de base pour un partitionnement d'objets- symboliques ». in
Actes des i mes Rencontres de la Societe Francophone de Classification (SFC-95j.
Namur, Sept.95.
Chouakria A., Cazes P., Diday E. (1995): « Extension de I'analyse factorielle des
correspondances multiples Ii des donnees de type intervalle et de type ensemble» in
Actes des /mes Rencontres de la Societe Francophone de Classijication (SFC-95),
Namur, Sept.95.
Diday E. (1991): « Des objets de I'analyse des donnees a ceux de I'analyse des
connaissances», in Induction Symbolique et Numerique a partir de Donnees. Editeurs
Y.Kodratoff et E.Diday. Cepadues-Editions.
Granville V., Rasson J.P. (1995): « Multivariate discriminant analysis and maximum
penalized likelihood density estimation ». Journal 0/ Royal Statistical Society. Series B,
57, pp. 501-517.
Hebrail G. (1995): « L'analyse de donnees symboJiques : etat de I'art et perspectives»,
EDF-DER Research report, nOHI-23/95-0 18.
393

Ichino M. (1994): « Generalised Minkowski metrics for mixed features type data
analysis », in IEEE Transactions on System, Man, and Cybernetics, 24, 4, pp. 698-708.
Lebbe 1., Vignes R. (1991): « Generation de graphes d'identification a partir de
descripti.ons de concepts », in Induction Symbo/ique et NlImerique a partir de Donnees,
Editeurs Y.Kodratoff et E.Diday, Cepadues Editions.
Tong H.T.T., Summa M.,. Perine I E., Ferraris 1. (1996): « Generating symbolic
descriptions for classes », in Proceedings of the IFCS'96 Conference, Kobe, March 96.
SODAS Project (1995): Answer to the DOSIS Call for Proposals.
Classification Structures for Cognitive Maps
Stephen C. Hirtle and Guoray Cai
School of Information Sciences
University of Pittsburgh
Pittsburgh, PA 15260 USA
sch,[email protected]

Summary: The ability to create and manipulate meaningful data structures of cogni-
tive spaces remains a problem for designers of geographic information systems. Methods
to represent the inherent hierarchical structure in cognitive spaces are discussed. Several
alternative scaling techniques for developing hiera.rchical and overlapping representations,
including orderE'd trees, ultramE'tric trees. and sE'mi-latticE's, are presented and discussed.
To demonstratE' the differences among thE'se three rE'prE'sentation schE'mes, each of three
techniques is applied to two small datasets collected on the recall of capitals or countries
in Europe. The methods disCUSSE'd here were chosen to illustrate the limitations of a strict,
hierarchical representation and because they have been used in the past to model cognitive
spaces.

1. Introduction
The ability to create and manipulate meaningful data structures of cognitive space~
remains a problem for designers of geographic information systems. The ability to
present and to interpret spatial data in a method that is consistent with the inter-
nal cognitive map of the user would lead to systems that are n10re flexible and will
provide greater functionality in terms of cognitive spatial tasks (Hirtlp and Heidorn,
199:3; Medyckyj-Scott and Blades, 1992).
A common conclusion that has emerged from the research on the structure of cog-
nitive mapping is that spatial memory is organized hierarchically, which results in
processing biases and errors in judgments (Couclelis, et al. 1987; Golledge, 1992: Hir-
tle and .Jonides, 198.5; McNamara, et al., 1989; Stevens and Coupe, 1978). However,
as Hirtle (1995) argued recently, the claim that mental representations are inherently
hierarchical is often made without providing an explicit alternative. For example,
the first author and his colleagues have argued that their data is consistent with a
"partially hierarchical model" (McNamara, et al., 1989) and have warned against the
conclusion that only structure in a cognitive map is of a hierarchical nature (Hirtle
and .Jonides, 198.''»). While such qualifications are intriguing, they are often stated
without proposing an explicit alternative. In this paper, several alternative scaling
techniques for developing hierarchical and overlapping representations, including or-
dered trees, ultrametric trees, and semi- lattices, are considered.

2. Hierarchies
A strict hierarchy if often assumed for representing spatial concepts. For example,
Stevens and Coupe (1978) showed how people consistently misjudged certain direc-
tions, such as assuming that Reno, Nevada is north and east of San Diego, California,
when in fact Reno is Ilorth and west of San Diego. To account for such effects, Stevens
a.nd Coupe (1978) presented a nested, propositional model, with San Diego as part
of California, Reno as part of Nevada, and California to the west of Nevada. Here,
the reasoning processes occur Oil a hierarchical tree structure, which contains cities

394
395

nested within states. Thus, a hierarchy is assumed to be formally equivalent to a


rooted tree, in a graph-theoretic form. A hierarchy can be defined formally as a col-
lection of sets such that for any two sets in the collection, either one set is containt>d
in the other or the two sets are disjoint (Alexander, 19();j).
Many real-world phenomena can be represented by a tree, such as cities within states,
states within countries, countries within continents, and so on. However, Hirtle (1995)
argued that most attempts to force a hierarchy onto anything other than artifir.ial
examples usually fail. Gary, Indiana, in terms of influences, transportation, and even
time zones, is more closely associated with Chicago than with the rest of Indiana.
Lake Tahoe represents a single geographical "neighborhood" that IiI'S in both Cal-
ifornia and Nevada. Such examples might be considered noise in the data to be
ignored. However, in discussing the structure of cities, Alexander (1965) has argued
that a natural city is by nature not hierarchical, but contains overlapping dusters
that are better represented in semi-lattice. Hirtle (199.'» explored this hypothesis by
examining a small subsample of a larger dataset. Here, we expand on this analysis
by including the entire dataset and examining alternative distance metrics. We begin
by constrasting two partially hierarchical structures, ordered trees and semi-lattices.

3. Ordered Trees

A technique that has proven useful for uncovering hierarchical structure in cognitive
maps has been that of the ordered tree algorithm for free-recall data (Hirtle and
.Jonides, 1985; McNamara, et al., 1989). An ordered tree is a rooted tree where the
childrpn of a node, at any level, may be ordered, as a unidirectional or bidirectional
nodI', or unordered, as a nondirectional node. Ordered trees, as discussed here, werp
first introduced by Reitman and Rueter (1980) and differ from two other uses in thp
literature of the term. Aho, et al. (1974) define an ordered tree as one in which all
children are strictly ordered from left to right. In a third use of the term, Barthelemy,
et al. (1986) dpfine an ordered trpf' as a rooted tree whpre thp nodes arp orderpd by
the height of the nodes. In this paper, thf' discussion is restricted to the first USf' of
the term, as df'fined by Reitman and Rueter (1980).
(eN MA ME NH RI VT}

/'
(ME NH VT} \
(eN MA RI)

i~\
/ '\
(eN MA}(M..a. RI}
NH VT ME eN MA RI

(NH} (VT} (ME}


/\/\
(eN} (M..l\.} (RI}

Fig. 1: Ordered dendrogram and set inclusion diagram for ordered tree.

An ordered tree is built by examining the regularities in a set of recalls over a fixed set
of items. In fact, an ordered tree is a generalization that allows for some overlapping
structure. As an example, the collection of sets {NH VT}, {ME NH VT}, {eN MA},
{MA RI}, {CN MA RI}, and {CN MA ME NH RI VT} can not be represented by
a tree, since the sets {eN MA} and {MA RI} are overlapping and violate the defini-
tion of a hierarchy, given above. However, this collection can be represented by the
ordered trt>e as seen in Figure 1.
396

One might be tempted to conclude that an ordered tree is simply a variant of non-
binary tree. However, this is not the case. Note that in the previous example, a
non-binary tree could be constructed by the removal of the overlappin sets {eN MA}
aud {MA RI}. However by the inclusion of these two sets, along with the explicit
exclusion of the set {eN RI}, the collection of sets can no longer be represented by a
strict hierarchy. Furthermore, many cognitive and real- world relations are best seen
as exactly this type of ordered structure.

4. Semi-lattices
A semi-lattice is a generalization of an ordered tree. It is defined formally as a collec-
tion of sets, such that for any two overlapping sets in the collection, the intersection
of the sets is also in the collection (Alexander, 1965). Therefore, if the sets {A B C
D E F} and {B C E G H} are in the collection, then the set {B C E} must be in
the collection, as well. As an example, consider the collection of sets {NH VT}, {eN
MA}, {ME NH VT}, {CN MA VT}, {CN MA RI}, and {CN MA ME NH RI VT}.
Such a collection cannot be represented as either a tree or an ordered tree, but can be
represented as a semi-lattice. The sets {ME NH VT}, {CN MA VT} and {CN MA
RI} are overlapping and thus violate the definition of an ordered tree, given above.
However, this collection can be represented by the graph structure shown ill Figure
::!.
(eN MA ME NH RI VT}

(ME N~} I (~RI}


/7,7',,,'
/,. \ (eN MA V T i \

(ME} (NH} (VT} (eN} (MA} (RI}

Fig. :2: Set inclusion diagram for semi-lattice structure.


Alexander (1965) notes that planned, or what he calls "artificial," cities often are
designed using a strict tree structure. Two such examples are Columbia, Maryland,
where clusters of exactly five neighborhoods combine to form villages and a 194:3 plan
of Greater London by Abercrombie and Forshaw argues for a "large number of com-
munities, each separated from all adjacent communities." Each community is further
subdivided into neighborhoods, each with their own shops and schools. Alexander
(1965) goes on to argue that a natural, living city, despite the ill-advised wishes of the
urban planner, does not conform to the hierarchical structure of a tree. Rather, aa
post office, a local school, a social club, or water authority all serve areas of different
sizes and scope. Thus, the resulting structure is better conceptualized to be that of
a semi- lattir:e.

5. Mapping Data to Structures


5.1 Data and Trees
To demonstrate further the differences among these three representation schemes, we
turn to two small datasets collected on the recall for countries in Europe. During
the academic year of 1994-1995, students on two different campuses in two different
European countries were asked to make an ordered list, from memory, of either all
397

the countries in Europe, or all the capitals in Europe. No other instructions were
given to the subjects. To equate the two samples, the capitals were converted into
the country name for those receiving the capital task. It is further acknowledged
that the capital task was harder and that exclusions might occur, not from forgetting
the country, but because the subject does not know or is unsure of the name of the
capital. However, these two datasets are considered only to highlight the differences
between the representations discussed in this paper, and not to generalize about spe.
riflc regional understanding of European geography. Furthermore, the purpose of this
exercise was to explore the possible clustering that exists among countries and not
the rather trivial set inclusion principle of aggregating a capital to its host country.
A group of 18 subjects in Norway, who were asked to recall countries of Europe,
produced a total of 44 distinct entries. The entire set of recalls can be seen using a
path-graph visualization developed by Hirtle (1991) in Figure 3. Here, the line width
is proportional to the number of the times two countries were recalled simultane-
ously. By visually focusing on the thicker connections, several clear clusters, such as
the Scandinavian countries, begin to emerge, as seen in Figure 3.

Cyprus

Fig. :3: Path graph of European countries generated by the Norwegian subjects
398

Fig. 4: Pa.th gra.ph of country na.mes for the Europea.n ca.pita.ls genera.ted by the
Austria.n subjects
A group of 12 subjects in Austria., who were asked to list a.ll the ca.pita.l cities of
Europe, produced a. tota.l of :J2 distinct entries. The ca.pita.ls were converted to the
country na.mes, a.nd resulting complete pa.th-gra.ph, shown in Figure 4.
The ordered lists from ea.ch of the da.ta.sets were clustered into a. strict hiera.rchica.l
tree, using a.n a.vera.ge-link clustering a.lgorithm (UPGMA). This was done using two
different llwasures of dista.nce, city-block a.nd a. log-ba.sed dista.nce. The city-block
metric is equiva.lent to sta.ting tha.t the dista.nce between a.ny two countries is propor-
tiona.l to the tota.l number of intervening items between them a.cross a.ll the ordered
lists. However, a.s items a.re further sepa.ra.ted on the list, the a.rtua.lnumerica.l differ-
ence becomes less importa.nt. Therefore, we replica.ted the a.na.lysis with the loga.rithm
of the difference. Furthermore, four countries were dropped from the a.na.lysis, due
to a. la.ck of da.ta. for c.a.lcula.ting pa.irwise dista.nces. For simplicity, only the la.ter
distance a.na.lysis is reported here. The resulting tree is shown in Figure 5.
399

CYPRUS -+-------+
TURKEY -+ ~-----------------t
GREECE ---------+ +---t

POloAN[J +---t
AIJSTRIA -------------T---------+
SWITZERLANli +-------.
ITALY
ANDORRA
lo I ECH7ENc~TE IN ---------------+ + - - - - - - - ...
MONAC'O
BULGARIA ---T-----------------+ ~-----+

ROMANIA "---.,.
,,~ LO'v'A!< Ll!, +-----------+
SLOV2NI~. t-------+
CZECfi REPUB -------t t-----t
HUNGARY +---+
,:'ROATIA -------+---t +-----t
"P. YUGOSLAVIA -------+ t---------------t
BO~;NIA-HEP.ZE'; .,.-----t
ALB.\NIA
EST0NIA -----------t---------+
LITfiUANIA +-----------+
LATVIA
[Ji(HAINE -----+ t-------t -t----------+
BE LOR U~"; I,"
RUSSIA
N0P.WAY -t-------------+
SWEDEN -+ t---------------t
DENMARK ---------------+ t-----------t
nNLAN,' -----------+
IC2LAN!) -------------------------------+ .,.-----+
IRELAN" -------------------+---------------+
U.K. -------------------+
PORTUC;"'.L +------- ..
SPAPI +---------+
"RA~J(,E
BEU;:UM -+-------------+
N~TfiERL"'.Nr'~; -+ +---------+ i
LijXSMBOUR'~ +-+
,-;SEMAN',

Fig. 5: Hierarchical tree of European countries generated by the Norwegian subjects


A group of 12 subjects in Austria, who were asked to list all the capital cities of Eu-
rope, produced a total of :32 distinct entries. The ordered recalls of these :32 countries
were clustered into a strict hierarchical tree, also using the average-link clustering
algorithm with the city-block distance and log-based distance. The resulting tree for
log-based distances is shown in Figure 6.
400

IT.l\.LY -+-----------------+
SAN MARINO -+ +-------------+
MONACO -------------------+
LIECHTENSTEIN ---------------------+-----------+
SWITZERLAND ---------------------+
POLAND -----------+------ ---+
SLOVAKIA -----------+ +-----------------+
CZECH REPUB -----------------+
BULGARIA -----------+-------+
ROMANIA +---------+
"REECE -------------------+ +- --+
BOSNIA --+-------------+ +-----+
FR Y'JGOSL.'.V IA +--------
CROATIA ---------f-----+ +-----+
SLOVEN I!'.
HUNGARY
AUSTRIA ---------------------------------------------+
I(:ELAND ---------------------------+-------------------+ I
IRELAND ---------------------------+ I 1
"RANCE -------------T-----------------+ I
UK -------------+ ~---------+
POR'fUGA"
SPAIN +---------+
ANDOR:<A ---------------------+
NORWAY -+-----------+ +-----+
SWEDEN -+ +-+
DENM.l\.RK -------------t ~---------------------+

FINLAr;C
BELGIUM ---+-----------------+ +---+
LUXEMBOURG ---+ +-------------+
NETHERLANDS ---------------------+ +-+
,;ERMAN'( -------------------------------_._--+

Fig. 6: Hierarchical tree of the country names for the European capitals generated
by the Austrian subjects
In examining these trees, the limitation of a hierarchical tree becomes obvious. Each
country is placed uniquely in a single cluster within the tree, by the very definition
of a tree. The multiple relationships, which Alexander (1965) argued convincingly in
favor of, are not able to be incorporated in the representational structure.
5.2 Ordered trees
An ordered tree might allow some overlapping relationships to emerge. Unfortu-
nately, an immediate application of the existing ordered tree algorithm of Reitman
and Rueter (l980) is not possible. The algorithm was developed to account for the
strong representational structures within a single subject for a domain of interest
and not to build an average structure across many subjects. Thus, the algorithm is
deterministic and produces clusters that exist across all recall patterns. Within the
Norwegian sample, there was not a single cluster that was common to all subjects,
whereas in the Austrian sample, only the single cluster of !'."orway Sweden existed for
all the subjects.
However, by examining subgroups of subjects within each sample, one can identify
small groups of subjects with common strategies, for which one can calculate non-
trivial ordered trees. Figure 7 shows one tree from a subset of the Norwegian subjects
and Figure 8 shows trees from two subsets of the Austrian subjects. It is interest-
ing to note the predominance of the home country, as expected, in each sample. In
addition, the two ordered trees in Figure 8, from the Austrian sample, indicate two
very different strategies, one that is geographically oriented (Figure 8a), and another
that is ordered by prominence (Figure 8b). The former strategy resulted in Austria
401

clustered with Switzerland and Liechtenstein, whereas the latter strategy resulted in
Austria being followed by France and United Kingdom.
Norway
Sweden
Finland
Denmark
Iceland
United Kingdom
The Netherlands
Belgium
France
Spain
Portugal
Switzerland
Austria
Italy
Germany
Poland
Greece

Fig. 7: An example of an ordered tree for a subgroup of Norwegian subjects


Austria Austria
Germany France
Liechtenstien UK
Belgium Netherlands
Netherlands Italy
Germany Spain
Spain Greece
France Poland
UK Czech Republic
~====ItalY Slovakia
Norway Hungary
Sweden Germany
Finland Switzerland
Poland Norway
Czech Republic Sweden
Slovakia Finland
Hungary Liechtenstien
Greece Belgium

Fig. 8: An example of ordered trees for two subgroups of Austrian subjects


Two benefits arise from the ordered tree over the strict hierarchical tree. First, any
order that might exist within a cluster is preserved. This can be seen by dominance
uf the home countries of Norway and Austria within their respective ordered trees in
Fignres 7 and 8. The ordered clusters also provide examples of implicit overlapping
internal dusters. For example, in Figure 7, the ordered cluster of France, Spain,
Portugal is created by the two underlying, overlapping clusters of France, Spain and
Spain, Portugal both of which have strong surface validity, while the excluded re-
lationship France, Portugal has much weaker surface validity. A strict hierarchical
model would imply that every pair of items in a cluster would be associated at the
same level.
5.3 Semi-lattices
The final representational scheme of a semi-lattice lacks any direct method to pro-
duce, which may account for why the representation of a semi-lattice for cognitive
maps has not been considered to the extent of the previous two representations. One
solution would be to use the MAPCLUS algorithm to fit the ADCLUS model (Shep-
402

ard and Arabie, 1979) of overlapping clusters. These clusters could then provide a
seed set of potential clusters to build a semi-lattice upon. An initial application of
the MAPCLUS algorithm to the data from the Norwegian subjects resulted in four
clusters, with only the Scandinavian cluster being distinct from the others. The data
from the Austrian subjects also resulted in four overlapping clusters. One cluster con-
sisted of northern European countries, including Scandinavia and the British Isles.
The second consisted of eastern European countries. The third cluster consisted of
prominent central European countries and the final ofless prominent countries. While
such an analysis is promising, it is clear that any implementation of semi-lattice mod-
els will require the additional development of appropriate algorithms.

6. Conclusions and summary


In summary, a tree structure is one realization for a hierarchical structure for the
representation of space. It is easily constructed and understood, but it is also a rigid
structure that does not allow for overlap. Ordered trees provide an extension that
allows for some degree of overlap, whereas a semi-lattice is an even richer structure
that appears to be consistent with many aspects of cognitive space (Alexander, 196,5).
There are many other possibilities for representing spatial clusters, including addi-
tive trees (Sattath and Tversky, 19(7), pseudo-hierarchies or pyramids (Diday, 1986),
extended trees (Carroll and Corter, 199,5), and hybrid scaling models (Carroll and
Pruzansky, 1980). A survey and review of the mathematical properties of many of
these representations can be found in Van Cutsem (199-l). The methods discussed
here were chosen to illustrate the limitations of a strict, hierarchical representation
and because they have been used in the past to model cognitive spaces.
As spatial information systems develop and evolve, the importance of considering
alternative structures to strict hierarchical trees and the necessity of being explicit
about the nature of the assumed representational structure will only increase. Spatial
information systems, multimedia systems, and large information systems including
the World Wide Web require a user to navigate through complex and often poorly
differentiated spaces (Kim and Hirtle, 1995). The ability to generate a meaningful
mutli-level structure should ease the cognitive burden imposed by the navigational
task and allow users to focus on the informational task instead. Finally, it is impor-
tant to note that a consistent theme behind all of the representations disCllssed is
that of a highly structured representation. To replace use of hierarchical trees with
unstructured representation, such as an undifferentiated network, would be a serious
mistake. Rather, the goal of future research should be to clarify the exact naturt~ of
the underlying, structured representation of cognitive spaces.

7. Acknowledgments

This paper was prepared, in part, while the first author was on sabbatical at tht-'
Department of Computer Science, Molde College, in Molde, Norway. Their support
is gratefully appreciated. The authors wish to thank Adrijana Car, Kai Olsen, and
Phipps Arabie for their comments concerning the issues presented in this paper.
403

References:
Aho, A. V., et al. (1974): The design and analysis of computer programs, Addison-Wesley,
Reading, MA.
Alexander (1965): A city is not a tree, Design, 46-55.
Barthelemy, J. P., et al. (1986): On the use of ordered sets in problems and consensus of
classification, Journal of Classification, 3, 187-224.
Carroll, J. D. and Corter, J. E. (1995): A graph-theoretic method for organizing overlap-
ping clusters into trees, multiple trees, or extended trees, Journal of Classification, in press.
Carroll, J. D. and Pruzansky, S. (1980): Discrete and hybrid scaling models. In: Similarity
and Choice, Lantennann, E. D. and Feger, H .. (eds.), Hans Huber, Bern.
Couclelis, H., et al. (1987): Exploring the anchor·point hypothesis of spatial cognition,
Journal of Environmental Psychology, 7, 99-122.
Diday, E. (1986): Orders and overlapping clusters in pyramids, In: Multidimensional data
analysis de Leeuw, J., et al. (eds.), 201-234, DSWO Press, Leiden.
Golledge, R. G. (1992): Place recognition and wayfinding: Making sense of space, Oeofo-
rum, 23, 199-214.
Hirtle, S. C. (1991): Knowledge representations of spatial relations. In: Mathematical
psychology: Current developments, Doignon, J.-P. and Falmagne, J.-C. (eds.), 233-250,
Springer-Verlag, New York.
Hirtle, S. C. (1995) Representational structures for cognitive space: Trees, ordered trees,
and senii-lattices, In: Spatial information theory: A theoretical basis for CIS, Frank, A. V.
and Kuhn, W. (eds.), Springer- Verlag, Berlin.
Hirtle, S. C. and Heidorn, P. B. (1993): The structure of cognitive maps: Representations
and processes. In: , Behavior and environment: Psychological and geographical approaches,
Garling, T. and Golledge, R. G. (eds.), 170-192, North-Holland, Amsterdam.
Hirtle, S. C. and Jonides, J. (1985): Evidence of hierarchies in cognitive maps, Memory
and Cognition, 3,208-217.
Kim, H., and Hirtle, S. C. (1995). Spatial metaphors and disorientation in hypertext brows-
ing. Behaviour and Information Technology, 14, 239- 250.
McNamara, T. P., et al. (1989): Subjective hierarchies in spatial memory, Journal of Ex-
pe"i71lental Psychology: Learning, Memory, and Cognition, 15, 211- 227.
Medyckyj-Scott, D. J. and Blades, M. (1992): Human spatial cognition, Ceoforum, 2,215-
226.
Reitman, J. S. and Rueter, H. R. (1980): Organization revealed by recall orders and con-
firmed by pauses, Cognitive Psychology, 12, 5.'54-58l.
Sattath, S. and Tversky, A. (1977): Additive similarity trees, Psychometrika, 42, 319-345.
Shepard, R. N. and Arabie, P. (1979): Additive clustering: Representation of similarities
as combinations of discrete overla.pping properties, Psychological Rel1iew, 86,87-123.
Stevens, A. and Coupe, P. (1978): Distortions in judged spatial relations, Cognititlf P,~y­
cllology, 10, 422-4:H.
Van Cutsem, B. (Ed.) (1994): Clas,~ification and di,~si71lilarity allalysis, Lecture Notes in
Statistics, No. 9:3, Springer- Verlag, New York.
Unsupervised Concept Learning
U sing Rough Concept Analysis
Tu Bao Ho 1

Japan Advanced Institute of Science and Technology


Tatsunokuchi, Ishikawa, 923-12 JAPAN

Summary: Formal concept analysis (Wille, 1982) offers an algebraic tool for representing
and analyzing formal concepts, and the rough set theory (Pawlak, 1982) offers an alternative
tool to deal with vagueness and uncertainty. Rough concept analysis (Kent, 1994) is an
attempt to synthesize common features of these two theories. In this work we develop a
method for unsupervised concept learning in the framework of rough concept analysis that
aims at finding and using concepts with their approximations.

1. Introduction
The problem of finding not only hierarchical clusters of unlabelled objects but also
their 'good' conceptual descriptions was addressed early in data analysis, e.g., Diday
and Simon (1976). This problem is referred to as unsupervised concept learning in
machine learning which can be defined as
• Given a set of unlabelled object descriptions;
• Find a hierarchical clustering that determines useful object subsets {clustering};
• Find intensional definitions for these subsets of objects {characterization}.
Unsupervised concept learning techniques depend strongly on how concepts are un-
derstood and represented. The notion of concepts under the classical view and the
generality relation was mathematically formulated in the theory of formal concept
analysis by Wille and his colleagues during last fifteen years (Wille, 1992). Recently,
several concept learning methods have been developed in this framework, e.g., those
of Godin and Missaoui (1994), Carpineto and Romano (1996) that incrementally
learn all possible concepts, or method OS HAM (Ho, 1995) that extracts a part of the
hypothesis space in the form of concept hierarchies.
The theory of rough sets is a ma.thematical tool to deal with vagueness and uncer-
tainty in interpreting given data (Pawlak, 1991). Recently, by combining common
features between the rough set theory and formal concept analysis, Kent (1994) in-
troduced a theory of rough concept analysis as a framework for representing and
learning approximations of concepts. In this paper we developed unsupervised con-
ceptual clustering method A-OSHAM, an extension of OSHAM, for inducing concept
hierarchies with concept approximations by using rough concept analysis.

2. Formal concept analysis and rough concept analysis


2.1 Formal concept analysis
Basis of the most widely understanding of a concept is the function of collecting
individuals into a group with certain common properties. One distinguishes these
common properties as the intent of the concept that determines its extent which are
1 Also with the Institute of Information Technology, Hanoi, Vietnam

404
405

objects sharing these properties and accepted as members of the concept. Formal con-
cept analysis models formal concepts from a formal context which is a triple (0, A, R)
where 0 is a set of objects, A is a set of attributes and R is a binary relation between
o and A, i.e., R ~ 0 x A. Notice that for the simplicity, formal concept analysis is
always described with Boolean data. However, its notions can be extended for multi-
valued data. In general, oRa is understood as "object 0 has attribute a" in Boolean
domain, and can be extended to nominal or discrete numeric domains as "object 0
has attribute a with some value v". Data from continuous domains can be discretized
into discrete data to be used with the framework. A formal concept of a given formal
context is a pair of extent/intent (X, 5) where the extent X contains precisely those
objects sharing all attributes in the intent 5, and vice-versa, the intent 5 contains
precisely those attributes shared by all objects in the extent X. The relation between
extent and intent of concepts can be described by two operators A and p

5 = '\(X) = {a E A I Vo EX: oRa} n


= oR ~ A (1)

X = p(5) = {o E 0 I Va E 5: oRa}
ol=X
n
= Ra ~ 0
aES
(2)

abb I nw Iw (( nc 21g lIg mo Ib sk


Leech
Bream
Le
Br g8 80 0
Frog Fr 0 0 0 0
Dog Dg 0 0 0 0 0
Spike-Weed SW 0 0 0 0
Reed Rd 0 0 0 0 0
Bean Bn 0 0 0 0
Maize Mz 0 0 0 0

Figure 1: Formal context and concept lattice of living organisms


A concept C1 is called a superconcept of a concept C2 (C2 is a subconcept of Cd if
the extent of C2 is included in the extent of C 1 . The basic theorem in this theory
is that formal concepts of a formal context is a complete lattice 2 with respect to the
superconcept-su bconcept relation, called concept lattice and denoted by C( 0, A, R) ..
Example: Figure 1 shows a Wille's illustration of a formal context of eight living
organisms whose names are heading the rows (with corresponding abbreviations)
2 A lattice C is complete when each of its subset X has a least upper bound and a greatest lower
bound in C.
406

{Leech (Le), Bream (Br), Frog (Fr), Dog (Dg), Spike-Weed (SW), Bean (Bn), Maize (Mz),
Reed e(Rd)}. The living organisms are described by attributes {needs water (nw), lives in
water (Iw), lives on land (II), needs chlorophyll (nc), 2 leaf germination (2Ig), 1 leaf germination
(llg), is motile (mol, has limbs (Ib), suckles young (sk)). Furthermore, the 0 indicates
when an object has an attribute, i.e., which living organism has which attribute. The
concept lattice .c( 0, A, R) consists of 19 formal concepts Gl , G2 , .•. , G l9 (denoted in
the figure only by their indexes), for example Gg = ({Le, Br, Fr}, {nw, Iw, mol).
2.2 Rough sets and rough concept analysis
There have been different methods of approximating concepts, e.g., those employ
the Bayesian decision theory or the well-known fuzzy set theory which characterize
approximately concepts by a membership function with a range between 0 and 1.
Rough set theory can be considered as an alternative way for approximating concepts.
The starting point of this theory is the assumption that our "view" on elements of
a set of objects 0 depends on some equivalence relation E on O. An approximation
space is a pair (0, E) consisting of 0 and an equivalence relation Ea ~ 0 x O. The
key notion of the rough set theory is the lower and upper approximations of any su bset
X ~ 0 which consist of all objects surely and possibly belonging to X, respectively.
The lower approximation Eo(X) and the upper approximation EO(X) are defined by

E.(X) = {o EO: [olE ~ X} EO(X) = {o EO: [olE n X "1= 0} (3)

where [olE denotes the equivalence class of objects indiscernible with 0 with respect to
the equivalence relation E. Kent (1994) has pointed out common features between the
theories of rough sets and formal concept analysis, and formulated the rough concept
analysis. Saying that a given formal context (0, A, R) is not obtained completely and
precisely means that the relation R is incomplete and imprecise. Let (0, E) be any
approximation space on objects 0, we wish to approximate R in terms of E. The
lower approximation RoE and the upper approximation R· E of R w.r.t. E can be
defined element-wise as

= Eo(Ra) = {o EO I [olE ~ Ra}


RoEa (4)

RoEa = EO(Ra) = {o EO I [olE n Ra"l= 0} (5)


The formal context (0, A, R) can be then roughly approximated by two lower and
upper formal contexts (O,A,R oE ) and (O,A,RO E). These approximate contexts
can be intuitively viewed as "truncated" and "filled up" contexts with respect to
the equivalence relation E, as illustrated in Figure 2 and 3. Two formal context
(O,A,R) and (O,A,R') are E-roughly equal if they have the same lower and upper
formal contexts. i.e., RoE == R~E and RoE == R,oE. A rough formal context in (0, E)
is a collection of formal contexts of object set 0 and attribute set A which have the
same lower and upper formal contexts (roughly equal formal contexts).
The rough extent of an attribute subset S ~ A w.r.t. R.E and RoE are defined as

P(SoE) = n RoEa p(S·E) = n RoEa (6)


aES 4ES

Now, any formal concept (X, S) E .c(0, A, R) can be approximated by RoE and RoE.
The lower and upper E-approximation of (X, S) are defined as

(X,S)oE = (P(SoE),'xP(SoE)) E .c(O,A,RoE) (7)


(X,S)oE = (p(soE),'xp(soE)) E .c(O,A,RO E) (8)
407

A rough concept of a formal concept (0, .4, R) in (0, E) is the collection of concepts
which have the same lower and upper E-approximations (roughly equal concepts).

I o6J I nw Iw nc 21g 1Ig mo 16 .k I rlo;:;J6;'iJ'l-;n:;;;w--'l;:;w-n--;n;;-c~2;r;lg;--1rr.lg~-;;;m:;;-o--"16---;."kI


L. 0 !:.
B, 0 ~
~' ~ B, ~ ~ ~ ~ ~
F, 0 0 0 F, 0 0 0 ...Q. 0
Ug I --,,-g
y
~: 60 0 6 6 ~: 6 6 ,::'
u
';-,
0 6
Bn
Mz (.J
U
0 6 !'I.n
Mz
(,,)
0 ~ 6 6 ~

( I (nw, !w, II, nCo 119, 219, mo.lb, ski

Figure 2: Lower and upper approximations of living organisms w.r.t. (Iw, nc)

3. Learning in the framework of rough concept analysis


3.1 Inducing concept hierarchies with concept approximations

Note that approximate contexts of (0, A, R) in (0, E) vary according to the equiv-
alence relation E. Figure 2 (modified from Kent, 1994) illustrates the lower and
upper approximate contexts (left and right tables) as well as the lower and upper ap-
proximate concept lattices C(O, A, R. E) (up) and C(O, A, R· E) (down) of C(O, A, R)
where the indiscernible relation E (denoted by Ed is determined with respect to
two features lives in water (Iw) and needs chlorophyll (nc). These approximate concept
lattices consist of 10 lower approximations and 9 upper approximations, denoted by
Ct"q., ... ,ctoJ and q.,q., ... ,q.}, respectively. Figure 3 illustrates those of
C(O, A, R) (similar positions) where the indiscernible relation E (denoted by E2 ) is
determined with respect to two features lives on land (II) and is motile (mo). These
approximate concept lattices also consist of 10 and 9 approximations, denoted by
{Cr., C? ' ... , C~oJ and {C?, C?, ... , C~. }, respectively.
We can determine the maps of concepts from C( 0, A, R) to their approximations in
C(O,A,R. E,) and C(O,A,R· E,) and to C(0,.4,R. E.) and C(O,A,R·E.).
Example: Some assignments from these maps
408

{C2 } t-> {CiJ {C2 } t-> {C~J


{C6 , CS, CIS} {CJ }
t-> {C6 , CS, CIS} t-> {Cl}
{C9,Cl3 ,CI6 } t-> {C}J {C9 , Cl3 , C16 } t-> {ct}
{C2 } t-> {ct.} {C2 } t-> {C;.}
{C9 , Cl3 , C16} t-> {CJ.} {C9,CI3,C16} t-> {Cl.}
{CI2 , CIS} t-> {CJ.} {CI2, CIS} t-> {C~.}
I obJ I nw Iw nc 21g 1Ig mo Ib sk I Iro::tb"J'l--=n::::w,--,l=w-w---:n:-:"c----,2iT:lg:--"1Ig=---=mc=-o,--,I'C'"6--=s:t:"OkI
1::"
Br ~ ~ ~ 1::"
Br ~ ~ 8 8
Fr
Og
,:>W
8 8 ~ ~
Fr
Og
'5W
8 8 ~
0
~ ~ ~
~~ ~ ~ ~ ~~ ~ ~ ~ ~ ~ ~
Mz 0 0 0 Mz 0 0 0 0 0 0

I"

(I (nw, tw, II, rIC, 1!g, 21g, 1IIO.1b, sk)

Figure 3: Lower and upper approximations of living organisms w.r.t. (il, mo)

Consider concept C l 3 = ({Br, Fr}, {nw, Iw, mo, Ib}) in .c(O,A,R). It has lower ap-
proximations Ci. = ({Le, Br, Fr}, {nw, Iw, mo}) and C~. = ({Le, Br, Fr, Dg}, {nw, mol)
in .c(0,A,R. E1 ) and .c(O.A,RoE.), and upper approximations CJ. = ({Le, Br, Fr},
{nw, Iw, II, mo, Ib}) and C~. = ({Le, Br, Fr, Dg}, {nw, II, mo, Ib}) in .c(0, A, RoEI) and
.c(0,A,RO E2). Although some concepts in .c(O,A,R) have similar indexes in differ-
ent approximate lattices, their approximations may have different intent and extent.
OSHAM (Ho, 1995) is an unsupervised concept learning method that induces a con-
cept hierarchy H from the concept lattice .c( 0, A, R). Inspired by this algorithm, we
propose algorithm A-OSHAM for learning approximate concepts in the framework of
rough concept analysis. Essentially, A-OSHAM induces a concept hierarchy in which
each induced concept is associated with a pair of its lower and upper approximations.
The search for the concept hierarchy is carried out through the hypothesis space of
concept lattice .c( 0, A, R). The basis of this search is a generate-and-test operator
to split a concept C into subconcepts at a lower level of H. Associated lower and
409

upper approximations are computed from the approximate contexts generated cor-
responding to the heuristic of inducing the concept. Starting from the root concept
with the whole set of training instances, A-OSHAM induces the concept hierarchy H
recursively in a top-down direction as described in Table 1. Procedure for computing
the lower and upper approximations in 1.(c) will be given in Table 2. Procedures for
doing 1.(a), 1.(b), 1.(d), 1.(e) and 1.(f) are the same as for OSHAM (Ho, 1995).
Table 1: Algorithm A-OSHAM (C/c, H)

Input concept hierarchy H and an existing splittable concept CIc.


Result H formed gradually.
Top-level call A-OSHAM(root concept, 0).
Variables Q is a given threshold.

1. Suppose that CIc" ... , Clc ft are subconcepts of CIc = (XIc , Sic) found so far. While
CIc is still splittable, find a new subconcept C lcft +1 = (Xlcft + p Slc ft +1) of CIc and its
approximations by doing:

(a) Find attribute aO so that U:':l XIc, U p( {aO}) is the largest cover of XIc.
(b) Find the largest attribute set S containing aO satisfying Ap(S) = S.
(c) Form subconcept Clc ft +1 with p(Slc +1)
ft = Sand Xlc ft + = p(S).
l

(d) Find a lower approximation and an upper approximation of C lcft +1 with respect
to the chosen equivalence relation E.
(e) From intersecting subconcepts corresponding to intersections of p(Slc +1) with ft

extents of existing concepts on H excluding its superconcepts, and find their


approximations.

2. Let X; = XIc \U7:11 XIc,' If one ofthe following conditions holds then CIc is considered
unsplittable:

(a) There exist not any attribute set S ~ Sic satisfying Ap(S) = S in X Ic '
(b) card(Xk) ~ Q.

3. Apply A-OSHAM(CIc"H) to each CIc, formed in the step 1.

A-OSHAM forms concepts at different levels of generality in the hierarchy each level
corresponds to a partition of O. A-OSHAM generates concepts with their approxima-
tions recursively and gradually, once a level of the hierarchy is formed the procedure
is repeated for each class.
There are many possible lower and upper approximations of a concept according to
the the family :F of equivalence relations E on O. There will be at least two ways
of approximating concepts in the framework of the rough concept analysis: (1) to
compute all possible approximations for each induced concept, and (2) to compute
one plausible lower and upper approximation for each induced concept.
We investigate in this paper the case of one plausible lower and upper approxima-
tion for each induced concept. In each attempt of splitting a concept at the level n,
A-OSHAM finds consequently its subconcepts at the level n + 1 by the specialized
hypotheses with maximum coverage. Suppose that we want to find approximations
(X/c, SIc)oE and (X Ic , SIc)oE of the induced concept C/c = (X/c, Sic). As Sic is chosen
among the hypotheses specialized from the intent of its superconcept with the maxi-
mum coverage, it is reasonable to approximate CIc by the one-step further specialized
hypothesis generated by the same way. Suppose that P is the partition of the super-
concept extent with respect to Sic, we first generate a refinement of P with respect to
410

Sk U {a*} where the conjunction Sk U {a*} forms the largest cover of the Ck's super-
concept extent, then we approximate Ck with equivalence classes of this refinement
according to (8) and (9).

Table 2: Procedure to find approximations

1. Find attribute a' E A \ Sk so that Sk U {a'} is the largest cover of the extent of the
superconcept of Ck .
2. Find equivalence classes of the superconcept extent of Ck according to the equivalence
relation E formed by adding a' to Sk.
3. Find the lower and upper approximations of Ck with respect to E, i.e., (Xk' Sk),E
and (Xk' Sk),E.

Example: Figure 4 illustrates the concept hierarchy obtained by A-OSHAM from


the living organisms context described in Figure 1, with the parameter Q: = 2. The
indexes are kept from the search space (Figure 1).
Example: Find the approximations of concept C 4 = ({Le, Sr, Fr, Sw, Rd}, {nw, Iw}).
The equivalence relation to generate the approximate context is with respect to {nw,
Iw, mol), the lower and upper approximations are ({Le, Sr, Fr}, {nw,lw, mol) and ({Le,
Sr, Fr, Sw, Rd}, {nw, Iw}), respectively.

(le, er, Fr, Dg, SW, Rd, en, Mz) (nw)

12
(SW, Rd) (nw, Iw, nc, IIg)

13
(B" Fr) (nw, Iw, mo, Ib)

Figure 4: Concept hierarchy induced by A-OSHAM

3.2 Using concept approximations in predicting unknown instances


Concept approximations will have no meaning without some associated interpreter
to exploit them in deciding to which induced concepts an unknown instance belongs.
Traditionally, there are three types of outcomes when matching logically the induced
concept hierarchy H with an unknown instance 0:
1. single-match: only one concept on H that matches 0;
2. multiple-match: many concepts on H that match 0;
3. no-match: no concept on H that matches o.
Most conceptual clustering methods deal with the cases of no-match and multiple-
match by employing a probabilistic estimation. Michalski (1990) developed the mea-
sure of fit for no-match cases and estimate of probability for multiple-match cases.
Recently, we have developed an interpretation of unsupervised induction results that
combines case-based reasoning with logical matching (Ho and Luong, 1997). Con-
cept approximations in the framework of rough concept analysis offer an alternative
solution to this problem.
The basis of this solution is a refinement of three common match outcomes by the
411

concept approximations. For a concept (X, S) E H, the approximations of its extent


X. E and X· E are interpreted as the sets consisting of all objects surely and possibly,
resjectively, belong to X with respect to E. We have X. E ~ X ~ X· E and dually
S· ~ S ~ S.E for every E. We say that an object 0 strongly matches a concept
(X, S) with respect to E if it matches the lower approximation S.E. We say that an
object 0 matches a concept (X, S) with respect to E if it matches the concept intent
S. We say that an object 0 weakly matches a concept (X, S) with respect to E if it
only matches the S·E but does not match S. By the order S·E ~ S ~ S'E we have
the corresponding order of weakly match, match and strong match when matching
an object 0 with (X, S). Three common match outcomes now can be refined as follows
1. No match: No concepts weakly match 0;
2. Single weak match: Only one concept weakly matches 0;
3. Multiple weak match: Many concepts weakly match 0 but no concepts match 0;
4. Single match but not strong: One concept matches 0 but not strongly;
5. Single strong match: Only one concept strongly matches 0;
6. Multiple match with one strong match: Many concepts match 0 and
only one concept strongly matches 0;
7. Multiple match but no strong match: Many concepts match 0 and
no concepts strongly match 0;
8. Multiple strong match: Many concepts match 0 and more than
one concept strongly matches o.
The match outcomes 1-3 are the refinement of the exact no-match, the match out-
comes 4-5 are the refinement of the exact single-match, and the match outcomes
6-8 are the refinement of the exact multiple-match. The notion of degree of match
satisfaction can also be refined regarding these matching outcomes. In fact, cases
2 and 6 offer immediately a decision about the membership of the unknown object
to the concept being considered; cases 3, 7, 8 require the same consideration as the
usual multiple-match case.
References:
Carpineto, C., Romano G. (1996): A Lattice Conceptual Clustering System and its Appli-
cation to Browsing Retrieval, Machine Learning, 10,95-122.
Diday, E. and Simon, J.C. (1976): Clustering analysis. In Digital Pattern Recognition, K.S.
Fu (Ed.), 47-94, Springer-Verlag.
Godin, R.and Missaoui, R. (1994): An incremental concept formation approach for learn-
ing from databases. Theoretical Computer Science, 133, 387-419.
Ho, T.B. (1995): An approach to concept formation based on formal concept analysis. IE-
ICE Trans. Information and Systems, E78-D, 553-559.
Ho, T.B., Luong, C.M. (1997): Using Case-Based Reasoning in Interpreting Unsupervised
Inductive Learning Results, Inter. Joint Conf. on Artificial Intelligence, Nagoya (in press).
Kent, R.E. (1994): Rough concept analysis. Proc. Rough Sets, Fuzzy Sets and Knowledge
Discovery, 248-255, Springer-Verlag.
Michalski, R.S. (1990): Learning flexible concepts: Fundamental ideas and a method based
on two-tiered representation. In Machine Learning: An Artificial Intelligence Approach,
Vol. III. Michalski R.S. and Kodratoff Y. (Eds.), Morgan Kaufmann.
Pawlak, Z. (1991): Rough Sets: Theoretical Aspects of Reasoning About Data, Kluwer Aca-
demic Publishers.
Wille, R. (1992): Concept lattice and conceptual knowledge systems. Computers and lvlath-
ematics with Applications, 23, 493-515.
Implicative statistical analysis

R. Gras 1•2, H. Briand3 ,


P. Peter3 and J. Phili ppe3,4 ,
lIRMAR - Campus de Beaulieu, F-35042 Rennes Cedex, France
2IRESTE - CP 3003, F-44087 Nantes Cedex 03, France
3IRIN - Equipe SIC - 2 rue de la Houssiniere
F-44072 Nantes Cedex 03, France
4PERFORMANSE - 3 rue Racine, F-44000 Nantes, France

Summary: Implicative analysis, due to a problem of evaluation in education, allows


us to treat a table crossing subjects or objects and variables according to a non-
symmetrical point of view. In term of method of data analysis, it structures the set
of variables, leads to tree and hierarchical structures and leads to the calculation of
the objects contribution to the structure of the variables. Furthermore, it appears
to be an effective tool in artificial intelligence to explain a base of rules in a set of
knowledge. An example of the treatment of a big corpus of human behaviours is
presented. The results, given by the method, have been validated a posteriori by the
expert (psychologist).

1 The theory
Every researcher interested by the relations between variables (for example psycholo-
gist, specialist of methods, didactic specialist, ... ) questions himself as follows: "Let
a and b be two binary variables, can I affirm that the observation of a leads to the
observation of b ?". In fact, this non-symmetrical point of view on the couple (a, b),
a. contrary to methods of similarity analysis, expresses itself by the question: "Is it
right that if a then b ?". Generally, the strict answer is not possible and the researcher
must content himself with a quasi-implication. We propose, with the statistical im-
plication, a concept and a method which allow to measure the degree of validity of an
implicative proposition between (binary or not) variables. Furthermore, this method
of data analysis allows to represent the partial order (or pre-order) which structures
a set of variables.

1.1 Theorical aspects of the binary case


This is the generic situation of the binary case (Gras 1979, Lerman et al. 1981). We
cross a set E of objects and a set V of variables. We now want to give a statistical
meaning to a quasi-implication a ~ b (logical implications are exceptional). We note
A (respectively B) the subset of E where the variable a (respectively b) takes the

412
413

value 1 (or true). Measuring the quasi-inclusion of a into b is similar to measunng


this reduced form of implication. Intuitively and qualitatively, we can say that a ~ b
is admissible if the number of counter-examples (objects of An B), verifying a 1\ b
inE is improbably small in comparaison with the number of objects expected in an
absence-of-a-link hypothesis between a and b (or .4 and B).
The quality of the implication is measured with the implication intensity. The ap-
proach developed for elaborating the implication intensity is inspired by I.e. Ler-
man's theorical considerations for designing his similarity indexes (Lerman 1981).
We associate A (and B respectively) with a random subset X (and respectively Y)
of E which have the same cardinal. We then compare the cardinal of A n B to
that of X n Y in an absence-of-a-link hypothesis. If the cardinal of A n B is im-
probably small in comparaison with the cardinal of X nY, the quasi-implication
a ~ b will be accepted; otherwise it will be refused. It has been demonstrated
(Lerman et al. 1981), that the random variable card(X n Y) follows a hyper-
geometrical law and, under certain conditions, follows a Poisson's law of param-
eter card(A).card(B)/card(E). The implication intensity is defined by the func-
=-
tion : cp(a, b) = 1 Pr(card(X n Y) 5 card(A) n B)) = 1 - L.~~d(AnB) 1fe->'
with
). = card(A).card(B)/card(E). We can say that the quasi-implication a ~ b is ad-
missible at the level of confidence Ck if and only if cp(a, b) ~ 1 - Ck.
For example, we have 100 students (card(E) = 100) who can have 2 behaviours a
and b with: card(A) = 6, card(B) : : : : 15 and card(A n B) = 1
We can observe that the number of students (here 1) refuting the implication a ~ b
is improbably small in an absence-of-a-link hypothesis. In fact, cp(a, b) = 0.965 that
is to say a level of confidence equal to 96.5 per cent for the implication because the
probability that card(X n Y) < 1 is equal to 0.035.
In (Larher 1991), the notion of statistical implication is extended to modal (or qual-
itative) variables and to numerical (or quantitative) variables.

1.2 Implication graph


A great interest of the statistical implication consists to study together all the vari-
ables on a given population. We can associate at each couple (a, b) of variables, a
measure of their implacation intensity. This will be represented by a valued oriented
edge. When the cardinals of A and B are equals, there are two oriented edges (a -+ b
and b -+ a). If we fix a condition of transitivity of the implication (generally 0.5), it is
possible to generate a transitive graph. For example, if we have a set of five variables,
whose the implication intensities greater than 0.5, are given in the following table:

~ a b c d e
a 0.97 0.73
b
c 0.82 0.975 0.82
d 0.78 0.92
e

we obtain the following graph:


414

A. Larher (Larher 1991) has proved that the order between the intensities respects
the order between the cardinals. So, for each pair of variables, we only keep the
maximal intensity of the two couples defined by this pair. We can also proved (Gras
and Larher 1992) the relation existing between the linear correlation coefficient and
the statistical implication and the relation between the X2 of independance and the
statistical implication.

1.3 Implication between classes of variables


When we consider the above graph, we are faced with some questions:
- can we aggregate c and e ?
- can we aggregate d and a ?
- can we have an implication from a class to another?

This last notion have a (semantic) meaning only when the considered classes have
a good cohesion. This cohesion must be measured. This measure is founded on the
implication intensity. For example, if we consider the class (a, b, c), we observe:
'P(a, b) = 0.97, 'P(b, c) = 0.95 and 'P(a, c) = 0.92. We can say thatthe oriented class
from a to c has a good cohesion. It would not be the case if the implication intensities
were respectively equal to 0.82, 0.38 and 0.48. We then define the cohesion of a class
like a notion which is opposed to the entropy (of the class). Then we can write (Gras
and Larher 1992) :

Let be p = max('P(a, b), 'P(b, a)), the entropy (Shannon's definition) En of a class
(a, b) equals: En = -plOg2(P) - (1 - p)log2(1 - p).
Let coh(a, b) be the implicative cohesion of the class (a, b). It is defined as follows for
a class with two elements:

coh(a, b) = 1 if P = 1
{ coh(a, b) = vI - En 2 if p ~ 0 ..5
coh(a, b) = 0 if p < 0.5
The cohesion coh( a, b) is an increasing function of the implication intensity between
a and b. The cohesion of a general class C is the geometrical average of the cohesion
of the elements of C taken two by two.
415

;E{I.2 •...• r-l) ) r{r:'l)

coh(C) =( II coh(a;,aj) U'ithC=(al,a2, ... ,ar )


jE{2 •...• r}.i>i

For the previous example, we find coh(c, b) = 0.98, coh(d. e) = 0.91 and coh(a, (c, b)) =
0.89.

The implication between classes must integrate their cohesion. For a class A. =
{ai, ... , ar } and a class B = {b l ,· .. , b.}, we define the implication from class A to
class B as follows :

",,(A, B) = ((sup(<p(a;,bi))),··.y'C(A.).C(B), i E {1, ... ,r}, j E {1, ... ,s}}

The implication between A and B increases according to their cohesion and decreases
with their cardinal. As is the case with two variables, we retain the implication
.4. => B if ",,(A, B) > ",,(B, A) or the implication B => A. if ",,(B, A) > ",,(A, B).
We obtain, for the example (a, (b, c)) => (d, e), an implication intensity equal to 0.27.
Using the classical algorithm of hierarchical clustering, we can build a hierarchy.
However, here we refuse the aggregation of two classes if the cohesion of the resulting
class is equal to zero. For the example, we obtain:

a c b d
I 'I
e
An appropriateness between the partition and the numerical index of the implicative
clustering happens at some nodes of the hierarchical tree. These nodes are said
significant and are studied in (Ratsimba-Rajohn 1992) and in (Gras and Ratsimba-
Rajohn 1996). We find also in these papers a statistical tool which attributes at a
given class of the hierarchy, the objects and sets of objects which contribute the more
to this class. A software, CHIC (Ag Almouloud 1992), is available. Now we shall
present an application of this method.

1.4 Use in learning algorithms


H. Briand's bibliographies and the linking of the initial type of problems as well as
this concerning artificial intelligence allow a critical reflexion on the classical tools for
the processing of these types of problems. The statistical implication, which has been
newly defined, is resistant to noise, converges with the size of the sample, eliminates
trivial rules and can be used with an incremental algorithm.
Furthermore, by expanding the application of the concepts of statistical implications,
we have as is done in similarity analysis, defined the notion of a significant node and
of a significant level.
Finally, we have defined the notion of "supplementary variables", as in factorial
416

analysis. These variables can help to explain the meaning of the classes which objects
are responsible for their formation.

2 An expiremental knowledge discovery system Fi-


able!
2.1 A set of examples in human resources
We worked on populations in which each person has been represented by behaviour
features. These behaviour features are used by our system (the knowledge discovery
system Fiablel) to discover rules used by a Decision Aided System DIAL ECHO of
the society PerformanSe at Nantes, France. In this context a person may be charac-
terized by ten behaviour features: agressive (P), anxious (N), self-confidence (EST),
need of affiliation (AFL), achievement (ACH), power (LED), intellectual dynamism
(CLY), professional conscience (CON), receptivity (REC), extraversion (E). These
ten features are defined by the expert. Each one may be valuate by one of these
symbols: average (0), negative (-), positive (+). Example: N+ : the person is very
anxious; N- : the person is not anxious; NO : the person has an average degree of
anxiety.

The objective of this experience is to provide to the expert the different relations
between the different behaviour features in a population. Two populatfons have been
examined : commercials of the firm (60 persons), workers of the same firm (40 per-
sons).

With the agreement of the expert, the number of examples (Cab), the conditionnal
probabilities, probability to have b true if a is true (Pab), and the forces of implica-
tions (Pfab) are the same in the discovery process on the two populations.

2.2 The relations between concepts: results and use by the


expert
Relation : Commercial
Condition of discovery: Cab ~ 5 and Pab ~ 0.8 and Pfab ~ 0.95
This condition, defined by the expert, represents knowledge discovery heuristics in
the two populations.

Rules without counter-examples


IF E = E+ AND p; = N- THEN EST = EST+ : 5 100 96 (Cab = 5, Pab=l, Pfab
=0.96)
E = E+ : Extraversion is positive
N = N- : Anxiety is negative
EST = EST + : Self-confidence is positive
417

There are 5 examples which verify this rule whose conditonnal probability is equal to
1, and whose intensity of implication is equal to 0.96. We notice that the condition of
discovery is valid, this rule has no counter-example (it is a "logical" rule). This rule
means that if the commercial has positive extraversion, and negative anxiety then his
self-confidence is posi ti ve.

IF E = EO AND P = PO AND EST = ESTO THEN AFL = AFLO : 6 100 96


IF E = EO AND N = NO AND EST = ESTO THEN ACH = ACHO : 8 100 95
Rules with counter-examples

IF E = EO AND EST = EST + THEN REC = RECO : 7 8'i 96

With the help of the expert psychologist, we will comment some rules discovered.
Their analysis permits us to confirm and to ameliore the expertise; It permits us to
valid the adequation of the theorical model of the construction to measured phenom-
ena. We know that the psychometry assigns dimensions to qualitative objects that
are not measurable.

An example which confirms psychometric theory


IF P = P+ AND CLV = CLV- THEN REC = REC- : 13 81 98

If a person (commercial) is aggressive and his intellectual dynamism is not very im-
portant then he does not listen other persons (no reception from another person).
The coverage rate is equal to 0.22 (number of examples that satisfy the rule (13) /
number of all examples (60)), and there are two counter-examples.

Some rules which proove that the same behaviour may be deduced by
different premisses
IF EST = ESTO AND AFL = AFL- AND REC = REC- THEN P = P+ : 1010098
IF LED = LED+ AND REC = REC- THEN P = P+ : 12 10099

If a person is animated by power motivation and he does not listen other persons then
he is aggressive. Rate of coverage is equal to 0.2 and there is no counter-example.

2.3 Explanation of the concepts from the items


The learning set is composed of 864 examples and constitutes a medium sample from
a students population given by the CIO (centre of informations and careers advis-
ing). The goal of this study is to explain the ten concepts of base. Each concept
is characterized by three conceptual classes (cf 2.1.). The concepts are determined
from a set of items. The results are used by the expert for validate the set of items in
comparaison with concepts to determine. This completes the more classical process
used in psychometry. The discovered rules are of the form:
418

IF itemj = (1,2) AND··· AND itemj = (1,2) THEN concept = concept(-,O,+).


We present here the explanat.ion from extraversion from the items. The expert can
re-opens "item41" which seems to determine EO whatever the answer
(rules 1 and 7).

1. IF item41 = 1 AND item47 = 1 AND item65 = 1 THEN E = EO 151 81 100


2. IF item15 = 2 AND item41 = 1 AND item65 = 1 THEN E = E+ 154 82 100

3. IF item4 = 2 AND item7 = 1 AND item43 = 1 THEN E = EO 151 81 99

4. IF iteml0 = 1 AND item43 = 1 THEN E = E- 90 82 100


5. IF itemlO = 1 AND item30 = 1 AND item66 = 1 THEN E = EO 57 82 100
6. IF iteml0 = 1 AND item41 = 2 AND item43 = 2 THEN E = E- 31 83 100

7. IF item41 = 2 AND item47 = 1 AND item65 = 1 THEN E = EO 44 84 100

8. IF item4 = 2 AND item7 = 1 AND item62 = 1 THEN E = E+ 57 81 100

9. IF item7 = 1 AND item42 = 1 AND item53 = 1 THEN E = EO 43 8196

10. IF item3 = 1 AND item47 = 1 THEN E = E- 178599

11. IF item7 = 2 AND item42 = 1 THEN E = EO 14 82 95

The overall results are summarized in the next table where we can notice, among
others, an excellent distribution for each concept of the number of explicative rules
and of the number of covered examples in each class of definition of the concept.

Nl N2 N3
- 0 + - 0 + - 0 +
E 142 462 220 3 6 2 138 460 211
P 193 386 245 3 6 4 186 374 242
N 186 438 200 5 6 3 185 428 190
ACH 178 478 168 4 6 2 173 474 157
CLV 205 242 377 7 12 10 194 239 374
CON 226 418 180 6 13 6 222 407 180
EST 163 495 166 2 4 2 154 494 162
LED 148 482 194 3 7 4 143 476 188
AFL 213 300 311 6 9 5 204 194 305
REC 220 384 220 6 14 7 205 382 218

where N1 is the number of examples which are members of the concept,


where N2 is the number of rules discovered by concept and
where N3 is the number of examples covered by the rules of each concept.
419

2.4 Conclusion
After the extraction process, the database contains many rules, many of them are
not interesting, accurate and useful enough with respect to the end user's objectives.
The quality of each generated rule has to be verified. So we need probabilistic tests
to check if the rule actually describes some regularity in the data. So, the evaluation
has to determine the usefulness of extracted rules, and decides which to save in the
database. If a rule is no valid, by using the indexes, and more precisely the intensity
of the implication, then it will be considered not interesting, and will not be saved in
the database.

In the context of the collaboration between Knowledge and Information System team
at IRIN and the firm PerformanSe SA, studies of other populations are underway in
sportive and education domains.

3. References

Ag Almouloud, S. (1992) : L'ordinateur, outil d'aide a I'apprentissage de la demonstration et traite-


ment de donnees didactiques. These de Doctorat de !'UniversiU de Rennes I.

Briand, H. et al. (1995): Mesure statistique de la robustesse d'une implication pour I'apprentissage
symbolique. Prepublication IRMAR 10-1995, Rennes

Gras, R. (1979): Contribution a I'etude experiment ale et a I'analyse de certaines acquisitions cog-
nitives et de certains objectifs didactiques. These d'Etat, Universitl! de Rennes I.

Gras, R. and Ratsimba-Rajohn H. (1996): Analyse non symetrique de donnees par I'implication
statistique. RAIRO - Recherche operationnel/e, 30-3 AFCET, Paris.

Gras and al.(1996) : structuration sets with implication intensity, Proceeding of the International
Conference on Ordinal and symbolic Data Analysis - OSDA 95, E. Diday, y.. Cheval/ier, O. Opitz,
Eds, Springer, Paris.

Larher, A. (1991): Implication statistique et applications a I'analyse de demarches de preuve math-


emathique. These de Doctorat de l'UniversiU de Rennes I.

Lerman, I.C. et al. (1981) : Evaluation et elaboration d'un indice d'implication pour des donnees
binaires I et II. ,'"fathematiques et Sciences Humaines 74 p 5-35 et 75 p 5-47.

Ratsimba-Rajohn, H. (1992) : Contribution a I'etude de la hierarchie implicative. Application a


I'analyse de la gestion didactique des phenomenes d'obstention et de contradictions. These de Doc-
torat de I' UniversiU de Rennes l.

Totohasina, A. (1992) : Methode implicative en analyse des donnees et application a I'analyse de


conception d'etudiants sur la notion de probabilite conditionnelle. These de Doctorat de l'Universitl!
de Rennes l.
Part V

Correspondence Analysis,
Quantification Methods, and
Multidimensional Scaling

• Correspondence Analysis and Its Application

• Classification of Textual Data

· Multidimensional Scaling
Correspondence Analysis,
Discrimination, and Neural Networks
Ludovic Lebart
Centre National de la Recherche Scientifique
Ecole Nationale Superieure des Telecommunications
46 rue Barrault, 75013, Paris, France.

Summary: Correspondence Analysis of contingency tables (CA) is closely related to a


particular Supervised Multilayer Perceptron (MLP) or can be described as an Unsupervised
MLP as well. The unsupervised MLP model is also linked to various types of stochastic
approximation algorithms that mimic the cognition process involved in reading and
comprehending a data table.

1. CA: a tool at the junction of many different methods


Correspondence Analysis of contingency tables (CA), independently discovered by
various authors, can be presented from nearly as many points of views. It can be viewed,
for example, as a particular case of both Linear Discriminant Analysis (LOA) (performed
on dummy variables) and Singular Value Decomposition (SVD) (performed after a proper
scaling of the original data). After the seminal papers of Guttman (1941), Hayashi (1956)
and Benzecri (1969a), various presentations of CA can be found in the available literature
(see, for instance, Lebart et a1. (1984), Greenacre (1984), Gifi (1990), Benzecri (1992),
Gower and Hand (1996».
In the context of neural networks - cf. the recent reviews of this fast-growing field by
Cheng and Titterington (1994), Murtagh (1994), Ripley (1994) - Correspondence
Analysis is also at the meeting point of many different techniques.
It can be described as a particular Supervised Multilayer Perceptron (MLP, section 2) (in
that case, the input and the output layers are respectively the rows and the columns of the
contingency table) or as an Unsupervised Multilayer Perceptron (UMLP, section 3) (in
such a case the input layer, and the output layer as well, could be the rows, whereas the
observations - also named examples, or elements of the training set - could be the columns
of the table).
In both situations, the networks make use of the identity function as a transfer function.
More general transfer functions might lead to interesting non-linear extensions of the
method.
CA can also be obtained from Linear Adaptive Networks (section 4), a series of methods
closely related to stochastic approximation algorithms.

2. A particular supervised Multilayer Perceptron


2.1 Reminder about the Multilayer Perceptron
Equivalence between Linear Discriminant Analysis and supervised Multilayer Perceptron
(when transfer functions are identity functions) has been proved by Gallinari et al. (1988)
and generalized to the case of more general models (such as non-linear discriminant
analysis) by Asoh and Otsu (1989).

423
424

A general framework (see, e.g., Baldi and Hornik (1989» can deal simultaneously with
the supervised and the unsupervised cases.
Let X be the (n, q) matrix whose n rows contain the n observations of an input q-vector,
and let Y be the (n, p) matrix containing (as rows) the n observations of an output p-
vector.
A designates the (q, r) matrix of weights (ajm) (see fig.l) before the hidden layer, and B
the (r, p) matrix of weights (bmk) following it (r:S; p and r:S; q).

JC il

Xi2 Yi}

JCjJ Yi2

JCj.j YjJ

JC ij output
input
Fig. 1: Perceptron with one hidden layer (i-th observation)

A perceptron with a unique hidden layer is a model of the form:

(1)

In the case of identity transfer functions (tP and 'P) and null constant terms, the model
collapses to the simpler form:

2.2 Estimating the parameters


The np equations (2) are summarized by:
Y XAB+E. = (3)
Denoting by MT the transpose of matrix M, the loss function to be minimized can be
written:
f =trace ETE =trace (Y • XAB)T (Y • XAB),
under the constraint:
=
BBT Ir (I r is the identity (r, r) matrix).
This last constraint is introduced to remedy the indeterminacy of the model, since for any
non-singular (r, r) matrix H, AH and H-iB are solutions of the minimization problem as
well as A and B.
425

A and B could be estimated through a back-propagation algorithm, complemented with an


orthonormalization of the rows of B at each step.
Since we are dealing here with the simpler case of identity transfer functions, we will
focus on a direct analytical solution.
The minimization ofj leads to equations (4) and (5):
BY7X = A7X7X, (4)
Y7XA = BTL (5)
(L is an (r, r) matrix of Lagrange multipliers).
Equations (4) and (5), together with the previous constraint, lead to the following
equation:
MBT= BTL,
the matrix M being defined as:
M = Y7X(X7Xr 1XTY (6)
We get a new expression for the criterionf
j = trace YTY - trace L.
Minimizingjis then equivalent to maximizing trace L.
We can easily deduce from the preceding relationships and from this new criterion that Lis
a diagonal matrix containing the r largest eigenvalues of M as diagonal elements, the r
rows of B being the corresponding unit eigenvectors.
We can then derive the value of A:
A = (X7Xr 1XTYBT
This formula provides a generalization of that obtained in the simultaneous multiple
regression, since (3) can be written:
Y = XW+ E, (with W= AB).
This generalization concerns the new situation where the matrix of coefficients W
undergoes a constraint of rank.
Note that:
P = X(X7Xr 1XT
is the (idempotent) projector onto the subspace spanned by the columns of X.
Hence:
M = (PY)T (PY).
Thus, the Multilayer Perceptron performs a projected principal axes analysis of Y, the
projection being performed onto the space spanned by the columns of X. This analysis is
also a projected Principal Component Analysis, if Y is centered columnwise.

2.3 The case of binary disjunctive data


When Y and X are binary disjunctive tables (dummy variables describing two partitions of
the n observations into p and q classes), the matrix C defined as:
C=Y7X
is the (p, q) contingency table crossing the two partitions.
The matrix Dq (resp. Dp) such that:
Dq = X7X (resp. Dp = YTY)
426

is the diagonal matrix whose q (resp. p) diagonal elements are the counts of the q classes
(resp. p classes).
This particular Multilayer Perceptron, whose training entails the diagonalization of the
matrix M:

performs a Non Symmetrical Correspondence Analysis (Lauro and D'Ambra (1984» of


the contingency table C.
A classical Correspondence Analysis would imply a diagonalization of the matrix M* such
that:
M* =O-ICO-1C
P q

Note that M* involves symmetrically the two sets (p columns of X on the one hand, q
columns of Y on the other).
The Multilayer Perceptron will coincide with Correspondence Analysis if Dp is a scalar
matrix (all the p classes have the same number of elements) or if the output matrix Y has
been properly re-scaled during a preliminary step into Y according to the following
formula:
Y = YD-Pl12
The new matrix to be diagonalized :
M5 =0-P1I2 CO-q 1C 0-p 112
has the same eigenvalues as M*, and has eigenvectors that can be easily derived from
those ofM*.

3. An unsupervised Multilayer Perceptron


In auto-associative neural networks, the output Y coincides with the input X. The
common value of X and Y is denoted by Z.

"if

Zi4 z;.$
input hidden layer output
Fig. 2: Auto association strangulated network

It is an apparently trivial situation. In fact, these networks are of great interest if the hidden
layer is narrower than the others, thus realizing a compression of the input signal (fig. 2).
Bourlard and Kamp (1988), Baldi and Hornik (1989) have stressed the link between SVD
- and consequently Principal Component Analysis (PCA) - and these particular networks.
The proof is straightforward if we replace both Y and X by Z in the formulas obtained in
the previous section.
427

In this context, the matrix M given by the equation (6) is nothing but the product-moment
matrix ZTZ.
In this setting, the equivalence with Correspondence Analysis is obtained if Z is derived
from a contingency table K according to the transformation (with usual notations):

z .. - k··lJ - k·k·
l. .J (7)
lJ - ~ki.k.j
Note that the nature and the size of the input data involved in the two approaches of section
2 and 3 are radically different.
The network of section 2 is "fed" by n individual observations. It learns how to predict the
output category corresponding to observation i, from the knowledge of its input category.
The network of section 3 is fed simultaneously by q observations of p categories (rows of
Z) or equivalently by p observations of q categories (columns of Z). It learns how to
summarize the input information.
Note that section 3 deals with properties common to Principal Component Analysis and
Correspondence Analysis.

4. A Linear Adaptive Network


4.1 Brief review of some computational techniques involved in CA
Several distinct computational algorithms could be involved in Correspondence Analysis:
Reciprocal averaging, iterated power, QR and QL algorithms, Jacobi method and its
generalizations, Lanczos method, as well as other classical numerical procedure for SVD,
(see, for example, Parlett (1980)).
The use of Back-Propagation method and other techniques usually associated with
Multilayer Perceptron provides new numerical approaches and a better insight into the
method. The unsupervised MLP model is also closely related to various types of stochastic
approximation algorithms that could roughly outline the cognition process involved in
perusing a data table. These algorithms are able to tackle huge data sets like those
encountered in Automatic Information Retrieval.
Benzecri (1969b), Krasulina (1970) have proposed independently stochastic
approximation algorithms for determining the largest eigenvalues of the expectation of a
random matrix. Lebart (1974) has given a numerical proof of the convergence of Benzecri
algorithm, and shown its interest in the case of sparse data matrices, such as those
involved in Multiple Correspondence Analysis. Oja and Karhunen (1981) have proposed
similar algorithms, adding new proofs and developments, reinforced by the results of
Kushner and Clark (1978). The first mention of neural networks can be found in Oja
(1982), who has proposed since then a wide variety of algorithms (see: Oja (1992».

4.2 Basics of stochastic approximation algorithms


From our point of view, the basic idea is as follows:
X being the (n,p) matrix of properly re-scaled data, the product moment matrix X1X can
be written as a sum of n terms Ai.
i=n
xTx= I A;
;=[
428

with:
A; =XiX; I (Xi being the ith column ofXT)
The classical iterated power algorithm can then be performed using this decomposition.
(cf. Wold (1966» taking advantage of the possible sparsity of the data matrix X.
Starting from a random vector Uo. the step k of this algorithm. after setting Uk =O.
consists of n assignments such as:
for i =1 to i =n, do: Uk f- uk + Ai uk-I (8)

The vector Uk remains unchanged during the whole step k.


We can try to improve the algorithm by modifying the estimate of Uk during each
assignment. according to the process:
for j= 1 to j = 00, do: Uj f- Uj_1 + }(j) Ai(j) Uj_1 (9)
where ')(j) is a gain parameter.
During each step k. the index i(j) of the matrix A takes values 1 to n .
At step k : i(j) =j - (k-l)n .
To ensure the convergence of Uj towards the largest eigenvector of XTX • the series }(j)
must diverge whereas the series y2 (j) must converge. The series '}O) could be chosen
among series closely related to the harmonic series such as: }(j) =al(b+j) .
In fact. during step k. the iterated power algorithm (algorithm (8) ) involves the operator:
L;A; (10)
whereas the stochastic approximation algorithm (algorithm (9» replaces the operator (10)
with the operator:
(11)

4.3 Stochastic approximation versus iterated power


Actually. if the terms of the series(j) are small enough. the two operators defined by (10)
and (11) have similar unit eigenvectors. However. if the terms of the series "t(j) are not too
small, the operator (11) may have more separated eigenvalues. inducing a faster
convergence of algorithm (9). Therefore. there is a trade-off between two options: fast
convergence towards approximate eigenvectors. or slower convergence towards the exact
values.
After several steps. because of the decrease in the values of '}O) , operator ( 10) is definitely
superior to operator (11).
In this sense. algorithm (9) can be considered as a mere technique of acceleration of the
algorithm (8) (Lebart (1982».
Unlike the algorithm (8). (9) depends on the order of the Ai within the sequence (AI. A2 •
.... Ai ..... An). It can be shown that the speed of convergence can be improved if two
consecutive sequences are read in reverse order (Lebart (1974».
429

Both linear adaptive networks corresponding to algorithms (8) and (9) can produce
simultaneously several eigenvectors, provided that orthonormalizations are carried out with
a frequency that depends on the available precision. It is by no mean necessary to
orthonormalize the estimates of eigenvectors at each reading (i.e. for each value of the
indexj when using the algorithm (9».
It must be stressed that stochastic approximation algorithms such as algorithm (9)
converge very slowly, their convergence being based on the divergence of the harmonic
series. Iterated power algorithms (8) (whose firts steps could be speeded up by using
stochastic approximation (9» perform well if they confine themselve to finding a s-
dimensional space Vs containing the t first eigenvectors (with: t« s). Then, the t
dominant eigenvectors (and their corresponding eigenvalues) can be efficiently computed
through a classical diagonalization algorithm applied to the (s, s) product-moment matrix
obtained after projection onto the subspace Vs.

5. References
Asoh, H. and Otsu, N. (1989): Nonlinear Data Analysis and Multilayer Perceptrons. IEEE.
/JCNN-89,2, 411-415.
Baldi, P. and Hornik, K. (1989): Neural networks and principal component analysis:
learning from examples without local minima. Neural Networks. 2, 52-58.
Benzecri. 1.-P. (1969a): Statistical analysis as a tool to make patterns emerge from clouds. In :
Methodology of Pattern Recognition. S.Watanabe, (ed.) Academic Press, 35-74.
Benzecri, 1.-P. (1969b): Approximation stochastique dans une algebre normee non
commutative. Bull. Soc. Math. France, 97, 225-241.
Benzecri 1.-P. (1992): Correspondence Analysis Handbook. Marcel Dekker, New York.
Bourlard, H. and Kamp. Y. (1988): Auto-association by Multilayers perceptrons and singular
value decomposition. Biological Cybernetics. 59. 291-294.
Cheng, B. and Titterington, D.M. (1994): Neural networks: a review from a statistical
perspective. Statistical Science, 9, 2-54.
Gallinari, P., Thiria, S. and Fogelman-Soulie. F. (1988): Multilayers perceptrons and data
analysis, International Conference on neural Networks. IEEE" I, 391-399.
Gifi A. (1990): Non Linear Multivariate Analysis, 1. Wiley, Chichester.
Greenacre M. (1984): Theory and Applications of Correspondence Analysis. Academic
Press, London.
Guttman L. (1941): The quantification of a class of attributes:.a theory and method of a
scale construction. In : The prediction of personal adjustment, Horst P., (ed.) 251 -264, SSCR
New York.
Hayashi C.( 1956): Theory and examples of quantification. (II) Proc. of the Institute of
Statist. Math. 4 (2). 19-30.
Hornik, K. (1994): Neural networks: more than "statistics for amateurs". In: COMPSTAT.
Dutter R., Grossmann W. (eds.), Physica Verlag, Heidelberg, 223-235.
430

Krasulina, T. P. (1970): Method of stochastic approximation in the determination of the


largest eigenvalue of the mathematical expectation of random matrices. Automation and
Remote Control, Feb., 215-22l.
Kushner, H. and Clark, D. (1978): Stochastic approximation methods for constrained and
unconstrained systems, Springer, New York.
Lauro, N. C and D'Ambra, L. (1984): L'Analyse non-symetrique des Correspondances. In:
Data Analysis and Informatics, III, Diday et al. (eds.), North-Holland, 433-446.
Lebart, L. (1974): On the Benzecri's method for finding eigenvectors by stochastic
approximation. COMPSTAT. Proceedings in Computational. Statist., Physica verlag, Vienna,
202-211.
Lebart, L. (1982): Exploratory analysis of large sparse matrices with application to textual
data. COMPSTAT. Proceedings in Computational. Statist.. Physica Verlag, Vienna, 67- 76.
Lebart L.. , Morineau A., Warwick K. (1984): Multivariate Descriptive Statistical Analysis.
J. Wiley, New York.
Murtagh, F. (1994): Neural network and related massively parallel methods of statistics: a
short overview. International Statistical Review, 62, 275-288.
OJ a, E. (1982): A simplified neuron model as a principal components analyzer. J. of Math.
Biology, 15, 267-273.
Oja, E. (1992): Principal components, minor components, and linear neural networks. Neural
NetworkS,S, 927-935.
Oja, E. and Karhunen, J. (1981): On stochastic approximation of the eigenvectors and
eigenvalues of the expectation of a random matrix. Report of the Helsinki University of
Technology (Dept of Technical Physics). Otaniemi, Finland.
Parlett B. N. (1980): The Symmetric Eigenvalue Problem. Prentice Hall, Englewood Cliffs,
N.J.
Ripley, B. D. (1994): Neural nerworks and related methods of classification. J. R. Statist.
Soc. B, 56, 3, 409-456.
Wold, H. (1966): Estimation of principal components and related models by iterative least
squares, in : Multivariate Analysis, Krishnaiah et al. (eds), Academic Press, New York, 391-
420.
Exploratory data analysis for
Hayashi's quantification method ill by graphics
Tsutomu Komazawa and Takahiro Tsuchiya
The Institute of Statistical Mathematics
4-6-7, Minami-Azabu, Minato-ku
Tokyo 106, Japan

Summary: This paper gives an illustration of exploratory data analysis with Hayashi's
Quantification Method III using graphics (hereafter we abbreviate this method as HQM III).
It is shown that artificial data sets with ordinal structures can be expressed on the surface
of a torus, and it is also pointed out that the torus suggests the results of HQM III. Some
applications of HQM III to medical data using the graphical configuration are reported.

1. Introduction
Hayashi's quantification method III was presented by Hayashi in 1956. Correspon-
dence analysis was later developed by the school of Benzecri in France in 1973. Artifi-
cial data sets with one-dimensional structure have been proposed by Guttman(1950),
Iwatsubo(1987) and Okamoto, among others. Iwatsubo(1987) also presented data
with circular structure.
The purpose of this report is to illustrate a graphical representation of qualitative
data with ordinal structures and the usage of such graphics for exploratory analysis.
We show that it is possible to express on the surface of torus a family of artificial
data sets with ordinal structures, which includes the Guttman data and the Iwatsubo
data. Hayashi's quantification method ill is used to illustrate these data sets by a
graphical configuration in three-dimensional spaces.

2. Solution of Hayashi's Quantification Method ill(HQMill)


for Item-category data
We summarize the solution of HQM ill for item-category data. The data matri..x is

D = {Oi(jk)} ,

where Oi(jk), n(jk) and n(jk)(uv) are given by

O. _ {1 if subject i chooses category k of item j,


·Uk) - 0 otherwise,
n n

n(jk) = L)i(jk) , n(jk)(uv) = L)i(jk)Oi(uv).


i=! i=!

for i = 1,2, ... ,n; j,u = 1,2, ... ,m; k = 1,2, ... ,£}; v = 1,2, ... ,£u, that n is the
number of subjects, m is the number of items, e) or eu is the number of categories of
item j or item u.

431
432

The solution of this problem HQyr III is given by

AX==)..X,

where
A == {a(jk)(UV)} , X == {X(jk)} '

1 { n(;k) . n(uv) }
a(jk)(uv) == n(jk)(uv) - .
m· n(jk) n

The eigenvector X of matrix A is a qualitative solution of item-categories, and the


eigenvalue is a square of correlation coefficient r2 between subject and item-category.

3. Torus Model
The following artificial data are the representative data with a one-dimensional struc-
ture.

(1) Guttman's perfect scale data


(2) Iwatsubo's (or Komazawa's) circular data

Generally, the data matrix D is given by a square matrix (n x n), where n is the
e
number of subjects, m is the number of item, is the number of categories of an item.
But,
n == e·m
Then this data matrix Dis,

where,
(1) i ~ n - m +1
{ 1 O~j-i<m
0.) == 0 otherwise

(2) i > n - m + 1
fJ_{O l~i-j~n-m
.) - 1 otherwise

Table 1.1 shows a few examples generated from a matrix with n, m and e being 20,
10 and 2 respectively.

1) The entire 20 x 20 matrix in Table 1.1 presents a circular model.


2) The 11 x 20 top half of the matrix of Table 1.1 is an example of Guttman's
perfect scale model.

3) The 10 x 10 matrix, surrounded by a dotted line, is an example of Guttman's


unidimensional scale model.
433

Table 1.1
item 12345678910, 1 2 3 4 5 6 7 8 9 10
category 1 1 111 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
1
2
••••••••••
••••••••• 0
3
4 ••••••••
•••••••
00
000
6
5 ••••••
•••••
0000
00000
8
7 ••••
•••
000000
0000000
9 •• 00000000
---il------------------ 000000000
10 •
0000000000
- -- ---- -.--- - - --- - ----- -- _. --- -- ------ --- - --- - ---
12
13 ••• 000000000
00000000
14
15 •••
••••
0000000
000000
16
17 •••••
••••••
00000
0000
18
19
•••••••
••••••••
ODD
00
20 ••••••••• 0

Table 1.2
item 1 2 3 4 5 6 7 8 9 10
category 12 12 12 12 12 12 12 12 12 12
1
•O.• •• •• •• •• •• •• •• ••
• •• •• •• •• •• ••
2
3 0 O.
0 0 O.
• •• •• •• ••
4
5 0 0 0 O.
0 O.
• •• ••
6 0 0 0
7 0 0 0 0 0 O.
8 0 0 0 0 0 0 O.
9
10
11
0 0
0 0 0
0
0 0 0
0 0
0 0
0
0
0 O.

0 0 0
0
0
0
0
O.
0 0
- --- -- -- ---- -- -- ----
••
--------.-
12 0 0 0 0 0 0 0 0 0

••• •••
13 0 0 0 0 0 0 0 0
14
• 0 0 0 0 0 0 0

••• ••
15 0 0 0 0 0 0
16
17 •• •• •• •• 0

0
0
0
0
0
0
0
0

••• ••• •• •• ••• ••• ••• •• •0


18 0 0 0
19 0
20 0

Table 1.2 arranges Table 1.1 in corresponding similarity order.


434

Those data matrices can be indicated on the surface of torus as in Fig. 1.

dar
cross cut V

Fig. 1Torus model


In this figure of torus, the column elements of item-categories of each model can be
represented in the direction of cross cut, and the row elements of subjects in the
annular direction. Using this graphical method, three models are plotted in three-
dimensional space, together with the results obtained from Hayashi's quantication
method Ill, in Figures 2 '" 4. Notice remarkable similarities between the two kinds
of graphs in many respects.

Fig. 2(a) Circular model Fig. 2(b) Its configuration of HQM III
435

Fig. 3(a) Guttman's perfect Fig. 3(b) Its configuration of HQM ill
scale model

Fig. 4(a) Guttman's unidimensional Fig. 4(b) Its configuration of HQM ill
scale model

4. Application of the method (HQMID) to medical data

In this section, we show the results of HQM ill on an examination of heart functions .
Let us consider a data set from Treadmill excercise study (Arai, 1987). This consists
of data from 32 subjects (n = 32) on five variables (m = 5), and as shown in Table 2
the data set is augmented by item6, a classification variable which will be used later.
436

Table 2 Data on Five Variables of the Heart Function from 32 Subjects

Items I(HB) 2(SBP) 3(DBP) 4(MBP) 5(PP) 6(G)

1 2 3 1 2 3 4 1 2 3 1 2 3 4 1 2 3 1 2
Categories
- + ' , - ± +-++ - +++ - ± + ++ -+++ 'Z
-;-r
0
.... ~
:::0
a ....
e. S
Subjects e.
1 1 1 1 1 1 1
2 1 1 1 1 1 1
3 1 1 1 1 1 1
4 1 1 1 1 1 1
5 1 1 1 1 1 1
6 1 1 1 1 1 1
7 1 1 1 1 1 1
8 1 1 1 1 1 1
9 1 1 1 1 1 1
10 1 1 1 1 1 1
11 1 1 1 1 1 1
12 1 1 1 1 1 1
13 1 1 1 1 1 1
14 1 1 1 1 1 1
15 1 1 1 1 1 1
16 1 1 1 1 1 1
17 1 1 1 1 1 1
18 1 1 1 1 1 1
19 1 1 1 1 1 1
20 1 1 1 1 1 1
21 1 1 1 1 1 1
22 1 1 1 1 1 1
23 1 1 1 1 1 1
24 1 1 1 1 1 1
25 1 1 1 1 1 1
26 1 1 1 1 1 1
27 1 1 1 1 1 1
28 1 1 1 1 1 1
29 1I 1 1 1 1 1
30 1 1 1 1 1 1
31 1 1 1 1 1 1
32 1
I 1 1 1 1 1

[Notes] l. HB : Heart Beat 2. SBP : Systolic Blood Pressure


3. DBP : Diastolic Blood Pressure 4. MBP : Mean Blood Pressure
5. PP : Pulse Pressure 6. G : Group
437

Table 3 Data Rearranged in Order by HQM III

Item No. 1 2 4 3 5 2 4 3 1 2 1 4 5 3 2 5 4
Group
Category No. 1 1 1 1 1 2 2 2 2 3 3 3 2 3 4 3 4

en D Z >-
I 8... t:l0
>= ::: 0 ::r
I I I I I I I
~.
'...." 9 e:.. ...
z '"<+:n
t:l
n 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
~ VJ VJ VJ VJ ...... ...... ...... ...... 0 ...... C11 0> 0> 8
(5' ol>- 0> ...... ...... ol>- -J 0
~ ~ ~ ~
ol>- <0 <C> 0) VJ 0l>-
...... e:..
00 Co" VJ ...... VJ -J -J
0> C11 ~ ~
!=l en 0) ol>- 00 C11 ol>- 00 00 ~ ~

6 1.346 1 1 1 1 1 1
12 1.334 1 1 1 1 1 1
13 1.120 1 1 1 1 1 1
9 1.116 1 1 1 1 1 1
8 1.022 1 1 1 1 1 1
4 1.010 1 1 1 1 1 1
3 1.006 1 1 1 1 1 1
I 5 0.890 1 1 1 1 1 1
11 0.796 1 1 1 1 1 1
7 0.780 1 1 1 1 1 1
1 0.566 1 1 1 1 1 1
10 0.566 1 1 1 1 1 1
19 0.566 1 1 1 1 1 1
14 0.318 1 1 1 1 1 1
28 0.318 1 1 1 1 1 1
2 0.163 1 1 1 1 1 1
16 - 0.085 I 1 1 1 1 1 1
18 - 0.085 1 1 1 1 1 1
25 - 0.103 1 1 1 1 1 1
24 - 0.141 1 1 1 1 1 1
27 - 0.141 1 1 1 1 1 1
23 - 0.197 1 1 1 1 1 1
31 - 0.202 1 1 1 1 1 1
22 - 0.447 1 1 1 1 1 1
30 - 0.506 1 1 1 1 1 1
21 - 0.810 1 1 1 1 1 1
26 - 1.198 1 1 1 1 1 1
29 - 1.213 1 1 1 1 1 1
32 - 1.697 1 1 1 1 1 1
15 - 1.751 1 1 1 1 1 1
17 - 1.999 1 1 1 1 1 1
20 - 2.342 1 1 1 1 1 1
438


Fig. 5 Evaluation of Heart Function
(Examination Variables)

Figure 5 is a plot of the five variables with coordinates on the three-dimensional


spaces, and Figure 6 shows a plot of subjects on the three-dimensional spaces. In
these figures, we use the symbol 'ball and polygon star' to represent a category of an
item and subjects for convenience. Smooth round balls indicate normal conditions
or normal subjects, while their successive changes toward stars indicates increased
abnormality as indicated on the scale show in Figure 5. Figure 5 and 6 reveal more
precise in formation about the ordinal data structure than the torus representation.
439

26

6
5
2

. Normal __ : Abnormal

Fig. 6 Evaluation of Heart Function


(Subjects)

5. Conclusion

In this paper, we presented the use of the torus model and its comparison with
HQM ill. Artificial data by Guttman. Iwatsubo and Komazawa were used to investi-
gate their properties by graphic data processing, specifically, the torus representation
and graphical display based on HQiVr III. It was shown that graphical displays were
helpful to clarify otherwise hidden properties of those data.

Acknowledgments

We would like to thank the editor and referees for a careful reading of the manuscript
and for comment that improved it. Anonymous referee "A" was particularly helpful
in detecting typographical errors and bring our attention to relevant references .
440

References:
Arai, C., Komazawa, T., et al. (1984): The effect of hypoxanthine riboside on integral
value % for various hemodynamic parameters under upright treadmill exercise. The Jour-
nal of Japanese College of Angiology, 24, 1, 75-81 (in Japanese) & 90 (abstract in English).

Guttman, L. (1950): The principal components of scale analysis. Measurement and Pre-
d'ietion, Stouffer, S. A. et al. (eds.), 312-361, Wiley, New York.
Hayashi, C. (1956): Theory and examples of quantification, II. Proc. of ISM, 4, 1, 19-30
(in Japanese).
Iwatsubo, S. (1987): The Foundation of Quantification Method. Asakura-shoten, Tokyo (in
Japanese).
Komazawa, T. and Tsuchiya, T. (1995): Exploratory data analysis for Hayashi's quan-
tification method ill by graphics: Its application of ordinal structure analysis to ninth
nationwide survey of the Japanese national charactor. Proe. of ISM, 43, 1, 161-176 (in
Japanese).
Exploring Multidimensional Quantification Space
Shizuhiko Nishisato
The University of Toronto
OISE/UT, 252 Bloor Street West
Toronto, Ontario, Canada M5S 1V6

Summary: Dual scaling deals with two distinct types of categorical data, incidence data
and dominance data. While perfect row-column association of an incidence data matrix
does not mean that the data matrix can be fully explained by one solution, dominance
data with perfect association can be explained by one solution. Considering a main role
of quantification theory is to explain data in multidimensional space, the present study
presents a non-technical look at some fundamental aspects of quantification space that are
used for analysis of the two types of data.

1. Introduction
Consider principal component analysis (PCA) of n standardized variables. If the rank
of the correlation matrix is 2, that is, r(R) == 2, then all the variabl~ are positioned
in two-dimensional space, and more specifically, on a circle of the radius of 1. In
other words, each variable is located on a plane at the distance of 1 from the origin.
Likewise, if r(R) = 3, all the variables are located on the surface of a sphere at the
distance of 1 from the origin. One can extend the same statement to the general
multidimensional case. Thus, no matter what rank the correlation matrix may have,
one can infer that K dimensions are sufficient if the sum of squares of K coordinates
of each variable is close to 1. Thus, with standardized continuous variables. both
graphical display and the statistic" percentage accounted for" can tell us the dimen-
sionality of data.
Suppose that we now consider PCA of·deviation scores, that is, variables which are
not standardized, but centered. This leads to principal component analysis of the
variance-covariance matrix, V. Since the variance of each variable is different, even
when r(V) = 2, variables are not positioned on a circle of any fixed radius. Al-
though the shape of the distribution of variables in two-dimensional principal plane
cannot tell us that r(~r) == 2, the statistic of the percentage accounted for by each
component tells us dimensionality and importance of each component. Thus. with
non-standardized continuous variables, graphical display does not tell us the dimen-
sionality of data, but the statistic" percentage accounted for" still does.
From the practioner's point of view, the real problem in choosing between Rand \.
is not that of computational convenience, but rather that of which one is more mean-
ingful out of the two. In the social sciences, many variates such as personality scores
do not have any rational unit of measurement, and one tends to opt for standardizing
variables for the purpose of comparability. Even in the natural science.s where there
exists a rational unit of measurement, however, one may still face the choice between
the use of available units (e.g., kilometers or miles) and standardization. There is no
readily available guideline on which one to use. To make the task of choosing either

441
442

R or V more important and difficult than one may think, the results of PCA from the
two can be vastly different, and there does not seem to be any mathematically trace-
able relationship between the two sets of PCA results (e.g., Nishisato and Yamauchi,
19i4). This ma.l<es it almost impossible to consider a rational way of choosing one
over the other.
When one looks at quantification methods such as Hayashi's quantification theory,
correspondence analysis, homogeneity analysis and dual scaling, they are conceptu-
ally the same as PCA. In fact, Torgerson (1958) called these quantification methods
PCA of categorical data. There are, however, a number of differences between them
and PCA. On one hand, "PCA of categorical data" is regarded as singular value
decomposition of standardized categorical variables, suggesting stronger resemblance
to PCA of R than to PCA of V. On the other hand, there are many more numerical
aspects that connect it to PCA of V rather than to PCA of R. If we consider such
multiple-choice data in the response-pattern format (i.e., in the form of an indicator
matrix) that can be explained by two components, we would realize almost imme-
diately that the quantification results are more like PCA results associated with V
than with R because a twcrdimensional graph does not tell us that the data are two
dimensional. In fact, it is the present author's view that the quantification method as
we know by many different names is PCA of non-standardized categorical variables.
As French researchers in the area of correspondence analysis would say, graphical
display of quantification results is almost indispensable for their interpretation. But,
surprisingly, there seems to be little knowledge about the space used in quantifica-
tion. For instance, under what circumstances can we infer from graphical display
that the categorical data in hand are twcrdimensional? To answer this question, it
seems essential to know what multidimensional quantification space we are dealing
with. For some reason or others, this fundamental question on the space has not
been a topic of intensive investigation. The present study will be concerned \\ith this
problem, and some preliminary consideration will be presented.

2. Dual Scaling
As a method of quantification, dual scaling (Nishisato, 1980, 1994, 1996) will be con-
sidered to see what kind of multidimensional space it uses. Dual scaling is known for
two aspects: (1) it handles a wide variety of categorical data, in particular inc'idence
data (contingency tables, mUltiple-choice data, sorting data) and dominance data
(rank-order data, paired comparison data, successive categories data), (2) it employs
two distinct objectives, one for incidence data and the other for dominance data.
The former is the familiar low-rank approximation to the input data, and the latter
the low-rank approximation to the ranking of input data by the ranking of distances
between each subject and a set of stimuli. The latter is, therefore, considered to offer
a low-rank solution to the problem of multidimensional unfolding (Nishisato, 1994,
1996). Mathematical handling of the two major types of categorical data by dual
scaling is based on singular value decomposition of appropriately transformed data
matrices.
For incidence data, denote the data matrix by F, where the typical element I,) is
443

either 1 (presence) or a (absence), or the joint frequency of cells i and j, the diagonal
matrices of row marginals and column marginals by Dr and Dc, respectively. Then
singular value decomposition is carried out as follows,

where \/'\/ = 1, W'H/ = I, and A = diag(Aj). Optimal weight matrices for rows, Y,
and columns, X, are given by
I I
Y = Dt\/,X = Dlw so that F = ItYAX'
Optimal vectors are scaled as 1.'~Dcxk = y!.:DrYk = It, that is, the sum of all the
elements in F.
For dominance data, responses collected are first transformed into the subject-by-
stimulus matrix of dominance numbers E, where a typical element e'i indicates the
number of times Subject i judges Stimulus j higher (larger, more attractive) than
other stimuli minus the number of times Subject i judges other stimuli higher than
Stimulus j. If we indicate the number of subjects by lV and the Humber of stimuli
by n, the dominance matrix is N xn, and the sum of elements of each row is zero.
To define diagonal matrices Dr and Dc for the dominance matrix, each element is
considered based on n - 1 comparisons. Thus, we can now specify the two diagonal
matrices as follows: Dr = n(n - 1)1, and Dc "-' N(n - 1)1. Hence, It = nN(n - 1).
Then, with these newly defined diagonal matrices and dominance matrix E, we can
carry out singular value decomposition of E to obtain optimal vectors .I!" and 1.."
(Nishisato, 1978).

3. Multidimensional Quantification Space


It is interesting to know that probably due to the influence of literature in factor
analysis and mental test theory most studies on quantification theory seem to have
assumed more subjects than the number of stimuli. But, dual scaling is symmetric in
its decomposition of data, and as such it is important to consider both cases: when
the number of rows of a data matrix determines the dimensionality of the space, and
when the number of columns determines it.
Generally speaking, one can make the follO\~ing ctatement. When the rank of a data
matrix is determined by its rows (e.g., five members of a family rank ten movies ac-
cording to their order of preference; 100 students answer a multiple-choice personality
questionnaire with 200 questions), each row variable (i.e., each subject) has sufficient
space to reveal his or her unique pattern of responses or idiosyncratic responses; but
each column variable (i.e., each stimulus) does not have enough room to be fully ac-
counted for. Similarly, when the rank of a data matrix tS determined by its columns,
each column variable (stimulus) has enough space to be fully explai·ned, and it is
not the case with each row variable (subject). Thus, in the former case where the
rank is determined by the rows, the contribution of each of those row variables to the
quantification space can be expressed by a mathematical formula: in the latter case,
the same applies to each of the column variables.
444

3.1. Multidimensional Analysis of Dominance Data


Since quantification of incidence data has been widely discussed in the literature of
correspondence analysis, multiple correspondence <malysis, homogeneity analysis and
dual scaling, a number of aspects relevant to the theme of the current paper are
relatively well known. Admittedly, however, they apply to the case in which there
are more subjects (in the case of multiple correspondence analysis) than the rank of
the data matrix. Therefore, the present paper will start with the quantification space
pertaining to dual scaling of dominance data.
Table 1 shows the ranking of ten government services by 31 subjects, and Table 2 is
the corresponding 31 x 10 dominance matrix. This dominance matrix can be fully
explained by nine dual scaling solutions, and it is known that dual scaling offers a
perfect solution to the problem of multidimensional unfolding (Nishisato. 199·1, 1996).
Let us first explain what the unfolding problem is. Coombs (1950) proposed a model
for ranking judgement, in which a joint continuum for stimuli and subjects is pos-
tulated in such a way that a subject ranks first the stimulus most closely located to
him or her on this continuum, second the second closest, and so OIl. The subject's
Table 1
Ranking of Ten Government Services
by 31 Subjects
1 7 9 10 2 6 3 8 5 -1
6 10 9 5 3 1 7 2 4 8
9 8 -1 3 5 6 10 2 1 7
2 10 5 6 3 1 -1 8 7 9
2 10 6 7 -1 1 5 3 9 8
1 3 5 6 7 8 2 -1 10 9
7 10 1 6 5 3 8 -1 2 9
2 10 6 7 4 1 5 3 9 8
2 10 5 8 4 1 6 3 7 9
2 10 5 9 8 7 4 1 3 6
9 10 7 6 5 1 4 2 3 8
6 10 7 -1 2 1 3 9 8 5
1 10 3 9 6 -1 5 2 7 8
8 6 5 3 10 7 9 2 1 4
8 10 9 6 4 1 3 2 5 7
3 5 10 4 6 9 8 2 1 7
1 10 8 9 3 5 2 6 7 -1
5 -1 9 3 10 8 7 2 1 6
2 10 6 7 8 1 5 4 3 9
1 4 2 10 9 7 6 3 5 8
2 10 5 7 3 1 -1 6 8 9
6 3 9 4 10 8 7 2 1 5
6 9 10 -1 8 7 5 2 1 3
5 2 1 9 10 4 8 6 3 7
2 10 6 7 9 1 3 4 5 8
7 10 9 5 2 6 3 1 4 8
8 7 10 3 5 9 4 2 1 6
3 8 6 7 5 10 9 2 4 1
2 10 7 9 -1 1 5 3 6 8
2 10 9 1 -1 7 ;) 3 6 8
-1 10 9 7 5 1 3 2 6 8
Columns: (1) Public transit system, (2) Postal service, (3) Medical care
(4) Sports/recreational facilities, (5) Police protection
(6) Public Libraries, (7) Cleaning streets, (8) Restaurants
(9) Theatres, (10) Overall planning and development
445

position on the continuum is called his or her ideal point, and the ranking of each
subject is interpreted as the ranking of stimuli on the continuum folded at each
subject's ideal point. Thus, the problem of unfolding is that given a set of rank orders
(i.e., folded continua) from subjects we wish to unfold the rank orders to recover
the original single continuum, on which subjects and stimuli are jointly located.
When a single continuum is replaced with multidimensional a.xes, unfolding becomes
complicated, and the problem then is called that of multidimensional unfolding. A
number of studies on the topic have been published (e.g., Coombs and Kao, 1960;
Coombs, 1964: Schi:inemann, 1970; Schi:inemann and Wang, 1972; Carroll, 1972; Gold,
1973; Heiser, 1981; Greenacre and Browne, 1986). However, they have not made any
reference to dual scaling, obviously being unaware of its relevance. Kow we know
that dual scaling always offers a perfect solution to the problem of multidimensional
unfolding (Nishisato. 1994).

Table 2
Dominance ylatrix E
9. -3. -7. -9. 7. -I. 5. -5. I. 3.
-I. -9. -7. I. 5. 9. -3. 7. 3. -5.
-7. -5 3. 5. I. -1. -9. 7. 9. -3.
7. -9. I. -I. 5. 9. 3. -5. -3. -7.
7. -9. -1. -3. 3. 9. I. 5. -7. -5.
9. 5. I. -1. -3. -5. 7. 3. -9. -7.
-3. -9. 9. -1. 1. 5. -5. 3. 7. -7.
7. -9. -1. -3. 3. 9. 1. 5. -7. -5.
7. -9. 1. -5. 3. 9. -1. 5. -3. -7.
7. -9. 1. -7. -5. -3. 3. 9. 5. -1.
-7. -9. -3. -1. 1. 9. 3. 7. 5. -5.
-I. -9. -3. 3. 7. 9. 5. -7. -5. 1.
9. -9. 5. -7. -1. 3. 1. 7. -3. -5.
-5 -1. 1. 5. -9. -3 -7. 7. 9. 3.
-5. -9. -7. -I. 3. 9. 5. 7. I. -3.
5. I. -9. 3. -1. -7. -5. 7. 9. -3:
9. -9. -5. -7. 5. 1. 7. -I. -3. 3.
1. 3. -7. 5. -9. -5. -3. 7. 9. -1.
7. -9. -1. -3. -5. 9. 1. 3. 5. -7.
9. 3. 7. -9. -7. -3. -I. 5. 1. -5.
7. -9. I. -3. 5. 9. 3. -1. -5. -7.
-1. 5. -7. 3. -9. -5. -3. 7. 9. 1.
-1. -, . -9. 3. -5. -3. 1. 7. 9. 5.
1. 7. 9. -7. -9 3. -5. -1. 5. -3.
7. -9. -1. -3. -7. 9. 5. 3. 1. -;).

-3. -9. -7. 1. 7. -1. 5. 9. 3. -.~.

-5. -3. -9. •1. 1. -7. 3. 7. 9. -1.


5. -5. -1. -3. 1. -9. -7. 7. 3. 9
7. -9. -3. -7. 3. 9. 1. 5. -1. -5.
7. -9. -7. 9. 3. -3. 1. 5. -1. -5.
3. -9. -7. -3. 1. 9. 5. 7. -1. -5.

Suppose we consider the ranking of ten services by the first two subjects. As one
can see easily, there is no constraints on each of the ten columns, while each row is
constrained by the condition that each row sum of dominance numbers is zero. Thus
the analysis of this 2 x 10 matrix yields at most two solutions (see Table :3). Similarly,
if we analyze the ranking by the first three subjects, the 3 x 10 dominance matrix
yields three dual scaling solutions (see Table 4). Since variates Yt'· and J:) •. do not
span the same space, it is impurtallt that one of them is projected onto the space of
446

the other. For dual scaling to provide a perfect solution, we must project stimuli onto
the space for subjects (Nishisato, 1994). In other words, we must plot Yik and PkXjk.
Figure 1 shows the results of dual scaling analysis of the 2 x 10 matrix: compute
the distance between Subject 1 and each of the ten stimulus points, and rank order
the distances from the closest (smallest distance) to the furthest. which reproduces
the exact ranking of the ten services by Subject I! Similarly, one can calculate the
distances between Subject 2 and the stimuli, rank the distances. and see that the
observed ranking is now again reproduced. For dual scaling of the 3 x 10 matrix, we
need all the three solutions to recover the same ranking of the services by each of the
three subjects from the plot of (U,k, PkXjk).

Figure 1
A Perfect T\vo-dimensional Solution
2

PX2
Postal
Service

pX-I
Sports/
Recreation

2
Suppose we increase the Humber of subjects to nine, and analyze the 9 x 10 dominance
table. As expected, if we use all nine solutions. the ranking of distances between sub-
jects and stimuli in 9-dimensional space will reproduce perfectly the ranking in the
original data (Table 1). How about the case in which there are more subjects than
nine? Would we fail to recover the ranking of the services by some subjects'? A truly
remarkable property of dual scaling is that no matter how many more subjects than
the number of stimuli we may have the rankings of stimuli by all the subjects are
reproduced in (n - I)-dimensional space, or !V-dimensional space! This is a classi-
cal property of singular value decomposition. Table S shows the sums of squares of
rank discrepancies between the input rankings and rankings approximated by one,
two, three, .... , and nine solutions. As you see in the table, the rank-9 approxima-
tion yields 0 discrepancy, indicating that it is a perfect solution to the problem of
multidimensional unfolding.
447

Table :3
Normed Weights for 2 Subjects and
Projected Weights for 10 Stimuli (2 Solutions)
1 2 1 2
-0.9996 1.0005 -0.4442 0.5557
-1.0000 -0.9995 0.6668 0.3331
0.7778 -0.0003
0.4442 -0.5557
-0.6666 0.1114
-0.4447 -0.5554
-0.1109 0.4445
-0.1114 -0.6666
-0.2223 -0.1110
0.1113 0.4444

Table 4
Kormed Weights for 3 Subjects and
Projected Weights for 10 Stimuli (3 Solutions)
1 2 3 1 2 3
0.9485 -1.0848 -0.9610 0.6677 -0.3027 -0.0398
-0.7304 -1.3498 0.8027 03699 0.5768 0.0608
-1.2517 -0.0342 -1.1967 -0.1956 0.6274 -0.0919
-0.5750 0.3053 0.1285
0.0643 -0.5325 -0.1448
-0.2323 -0.4085 0.3475
0.6740 -0.0395 0.1317
-0.6895 -0.1580 0.0758
-0.4633 -0.2016 -0.3453
0.3797 0.1332 -0.1225

Thus, in dual scaling of dominance data, all we are interested in is the recovery of,
or approximation to, the original rankings of stimuli by individual subjects. As such,
the importance of shape of quantification space becomes secondary. The percentage
of original rankings approximated by K sblutions is sufficient for the investigator to
know. Even so, however, it should be mentioned that multidimensional quantification
space for dominance data is in some sense standardized since the total contribution
of each subject to the total space is fixed and equal to (n + 1)/ [3( n - 1)], irrespective
of N. One further characteristic of dominance data is that both Dr and Dc are scalar
matrices.

3.2. Multidimensional Analysis of Incidence Data


All the complexities of multidimensional quantification space for incidence data seem
to arise from the fact that Dr and Dc are not scalar matrices. Rather than reviewing
relevant mathematical expressions for row and column contributions of a variety of
incidence data to the total space, let us look at one numerical example. Consider
multiple-choice data collected from seven subjects answering three multiple-choice
questions 'With three response options per question. In this case, the number of so-
lutions is determined by both rows and columns of the incidence matrix because
:V - 1 = m - n = 6. This is a special case, in which both row variables and column
variables have space sufficient to accommodate their full contributions to the total
448

space. Let us look at a numerical example Cfable 6) and note the following regular-
ities: (a) The sum of squared singular values (correlation ratios in dual scaling) is
equal to the average number of options minus 1, that is, 2; (b) The sum of squares of
quantified item scores is equal to nN(mj -1), that is, 3 x 7 x (3 - 1) = 42; (c) The
sum of r;t over the total solutions (6) is equal to mj -1, which is, 3 - 1 = 2; (d) The
sum of normed scores of each subjects over the total solutions is equal to the total
number of solutions; (e) The sum of squares of option weights over the total solutions
is equal to n(1\[ - hp), where fjp is the number of subjects who chose option p of
Item j, and; (f) The inter-subject squared distance in the total normed space is 2/V,
which is 14 in the present example.

Table 5
Sum of Squares of Discrepancies
between Observed and Reproduced Ranks
Number of Solutions
Subject 1 2 3 -1 5 6 7 8 9
1 88 78 90 46 42 1-t 16 0 0
2 62 28 14 2 4 -1 2 4 0
3 60 80 80 12 12 0 0 0 0
4 14 10 12 16 16 16 6 2 0
5 12 8 14 1-1 14 10 8 6 0
6 100 12-1 66 62 8 2 0 0 0
7
8
104
12
106
8
78
1-t
1-1
14
12
14
8
10
8
8 "6 0
0
9 10 16 8 6 6 0 0 0 0
10 104 52 22 8 2 -1 2 2 0
11 84 34 12 8 10 4 0 0 0
12 68 52 4 4 4 4 2 0 0
13 40 42 8 4 4 0 0 0 0
14 72 28 14 2 2 2 2 0 0
15 68 38 10 10 6 4 2 2 0
16 122 46 52 30 22 22 22 2 0
17 52 54 54 14 6 0 0 0 0
18 90 28 16 12 10 10 0 0 0
19 H 28 16 16 14 8 0 0 0
20 12-1 186 6 6 2 2 2 0 0
21 4 2 4 4 4 2 2 0 0
22 72 28 22 12 10 6 6 6 0
23 152 8 12 10 10 4 2 0 0
24 160 158 36 8 8 0 0 0 0
25 44 2-1 12 12 8 6 2 0 0
26 96 52 14 8 6 8 4 0 0
27 88 18 16 10 8 8 4 2 0
28 104 64 70 44 8 6 6 0 0
29 10 12 6 6 8 2 2 2 0
30 102 80 88 16 26 10 0 0 0
31 30 4 4 2 4 2 2 2 0
As stated earlier, these relations will be violated, depending whether the number of
total solutions is determined by rows or columns. These cases, however, "'ill not be
discussed due to the limitation of the space.
449

Table 6
Summary Statistics 011 a Case Where
the Rank is Determined by Both Rows and Columns
2
1/ a 6(%)
1 0.8116 0.8839 40.5777
2 0.6071 0.6765 30.3575
3 0.2476 -0.5195 12.379,1
4 0.1884 -1.1538 9.4206
5 0.1075 -3.1501 5.3762
6 0.0378 -11.7370 18886
Sum 2.0000 -15.0000 100.0000

Sum of Squares on Each Item.


Item 1 7.54 6.82 8.45 3.81 5.01 10.37 42.00
Item 2 7.5,1 6.82 8.45 3.81 5.01 10.37 .12.00
Item 3 5.92 7.36 ,1.09 13.38 10.99 0.26 42.00

Squared Item-Total Correlation.


Item 1 0.8741 0.5915 0.2990 0.1026 0.0769 0.0560
Item 2 0.8741 0.5915 0.2990 0.1026 0.0769 0.0560
Item 3 0.6865 0.6385 0.1448 0.3601 0.1688 0.0014

Sums of Squared r]t over Dimensions


Item 1 2.00
Item 2 2.00
Item 3 2.00

Normed Score of Subjects, Yi


SS
1 -0.9134 0.2923 -0.2465 -1.4013 -1.6140 0.6715 6.00
2 0.5128 -0.7759 1.7125 -0.1064 -0.5518 -1.3736 6.00
3 1.5501 1.2634 -0.7118 0.9488 -0.7685 -0.0591 6.00
4 -0.5128 -0.7759 -1.7125 -0.1064 0.5518 -1.3736 6.00
5 -1.5501 1.2634 0.7118 0.9488 0.7685 -0.0591 6.00
6 0.9134 0.2923 0.2465 -1.4013 1.6140 0.6715 6.00
7 0.0000 -1.5597 0.0000 1.1178 0.0000 1.522.1 6.00

Normed Option Weights, Xjp

SS
1 -1.1013 0.3336 -0.8355 -0.4292 -0.2986 -1.3056 12.00
2 0.2846 -1.4987 1.7208 1.1650 -0.8413 0.3830 15.00
3 1.3673 0.9983 -0.4675 -0.5212 1.2892 1.5754 15.00
4 -1.3673 0.9983 0.4675 -0.5212 -1.2892 1.5754 15.00
5 1.1013 0.3336 0.8355 -0.4292 0.2986 -1.3056 12.00
6 -0.2846 -1.4987 -1.7208 1.1650 0.8413 0.3830 15.00
7 1.7207 1.6214 -1.4305 2.1860 -2.3437 -0:3043 18.00
8 -1.7207 1.6214 1.4305 2.1860 2.3437 -0.3043 18.00
9 0.0000 -0.6486 0.0000 -0.8744 0.0000 0.1217 16.00

Inter-Subject Squared Distances in the Normed Space


0.000
14.000 0.000
14.000 14.000 0.000
14.000 14.000 14.000 0.000
14.000 14.000 14.000 14.000 0.000
14.000 14.000 14.000 14.000 14.000 0.000
14.000 14.000 14.000 14.000 14.000 14.000 0.000
450

4. Concluding Remarks
The current paper touched only on a surface of the topic. The two data types,
incidence data and dominance data, are distinct as reflected on their respective ob-
jectives for quantification. From the view of multidimensional decomposition of data,
however, the probably most important distinction lies in the fundamental premise of
"what is multidimensionality for the two data types?" For incidence data, perfect
association in the data (Le., the correlation ratio of 1) does not mean that a sin-
gle solution (component) can explain the data exhaustively. A simple example is a
10 x 10 contingency table of perfect row-column association (e.g., all non-zero entries
are found only in the main diagonal). This data matrix yield nine perfect correlation
ratios (Le., 1), yet needing nine solutions to explain the data. In contrast, when
association is perfect in dominance data (e.g., all the subjects rank the stimuli in
the same way), one solution explains the data completely. This distinction creates
a number of different characteristics between the data types, some of which are dis-
cussed in Nishisato (1993, 1994, 1996). From the graphical point of "iew as well
as the interpretation point of view, the distinction between the two types should be
well understood, but a number of further investigations into the differences are still
needed before we are certain about our full understanding of the implications of the
differences.
The last remark on both types of data is about the treatment of missing responses.
Most imputation methods have the effect of increasing total information (i.e., the
sum of squared singular values) in data. This is obviously undesirable, and should
be regarded as fabrication of information. Thus, any method of imputation must
be such that the total observed information be kept invariant. Such an example is
rare (see dual scaling of rank order data, Nishisato, 1994), and the effects of missing
responses on multdimensional space need to be further investigated.

5. References
Carroil, J.D. (1972). Individual differences and multidimensional scaling. In R.N. Shepard,
A.K. Romney, and S.B. Nerlove (eds.), Multidimensional Scaling: Theory and Applications
in the Behatl'ioral Sciences, Volume 1. New York: Seminar Press.
Coombs, C.H. (1950). Psychological scaling without a unit of measurement. Psychological
Remew, 51, 148-158.
Coombs, C.H. (1964). A Theory of Data. New York: Wiley.
Coombs, C.H., and Kao, R.C. (1960). On a connection between factor analysis and multi-
dimensional unfolding. Psychometrika, 25, 219-231.
Gold, E.M. (1973). Metric unfolding: Data requirements for unique solution and clarifica-
tion of SchOnemann's algorithm. Psychometrika, 38, 555-569.
Greenacre, M.J., and Browne, M.W. (1986). An efficient alternating least-squares algo-
rithm to perform multidimensional unfolding. Psychometrika, 51, 241-250.
Heiser, W.J. (1981). Unfolding analysis of proximity data. Doctoral dissertation, Leiden
University, The Netherlands.
451

Nishisato, S. (1978). Optimal scaling of paired comparison and rank-order data: An alter-
native to Guttman's formulation. Psychometrika, 43, 26~271.
Nishisato, S. (1980). Analysis of Categorical Data: Dual Scaling and Its Applications .
Toronto: University of Toronto Press.
Nishisato, S. (1993). On quantifying different types of categorical data. Psychometrika,
58,617-629.
Nishisato, S. (1994). Elements of Dual Scaling: An Introduction to Practical Data Analysis.
Hillsdale, NJ: Lawrence Erlbaum Associates.
Nishisato, S. (1996). Gleaning in the field of dual scaling. Psychometrika, 61, 559-599.
Nishisato, S., and Yamauchi, H. Principal components of deviation scores and standardized
scores. Japanese Psychological Research, 16, 162-170.
SchOnemann, P.H. (1970). On metric multidimensional unfolding. Psychometrika, 35, 167-
176.
Schoneruann, P.H., and Wang, M.M. (1972). An individual difference model for the multi-
dimensional analysis of preference data. Psychometrika, 37, 2 75-309.
Torgerson, W.S. (1958). Theory and Methods of Scaling. New York: Wiley.
Homogeneity Analysis for Partitioning
Qualitative Variables
Takahiro Tsuchiya
The Institute of Statistical Mathematics
4-6-7, Minami-Azabu, Minato-ku
Tokyo 106, Japan

Summary: This paper proposes a method to construct multiple uni-dimensional scales


by partitioning qualitative variables into mutually exclusive groups. The method is based
on homogeneity analysis, and fuzzy c-means criterion is introduced for partitioning. Also,
some goodness of fit indexes are proposed. Two artificial data sets and one real data set are
analyzed as numerical examples. The results illustrate that the proposed method is more
effective for partitioning qualitative variables compared with peA with optimal scaling and
Hayashi's Quantification Method 111, or HOMALS.

1. Introduction
In the social and cultural sciences, uni-dimensional scaling is an important theme.
A distinctive feature of scaling in those fields is that a data set often consists of
qualitative variables. Multiple correspondence analysis, dual scaling and Hayashi's
Quantification Method ill (HQM ill) are methods for analyzing the structure of qual-
itative variables. In case that all the variables are homogeneous (namely they have
only one uni-dimensionality), the first solution of those methods can be used as a uni-
dimensional scale. It is rare, however, that all the variables have a uni-dimensionality
in practical situations. In order to construct uni-dimensional scales, selection or par-
titioning of variables is usually needed. HQM ill is not necessarily appropriate for this
purpose. To demonstrate it, let us consider the following example.
Tab. 1 shows an artificial data set 1,
which consists of eight variables and 13 Tab. 1: Artificial Data Set 1
observations. The numerals in the ta- variable
ble indicate category labels and n is an n
1 2 3 4 5 6 7 8
observation number. For example, ob- 1 1 1 1 1 2 1 1 1
servation 13 chose category 4 of vari- 2 2 1 1 1 2 2 2 1
able 1. Although there are 8 variables 3 2 2 1 1 3 3 2 2
in the data set, they should be parti- 4 2 2 2 1 2 2 1 1
tioned into two groups. That is, an in- 5 2 2 2 2 3 3 3 3
dicator matrix Dl is obtained by using 6 3 2 2 2 2 2 2 2
only variable 1 to 4 (Tab. 2). Zeros are 7 3 3 2 2 1 1 1 1
omitted in the table. It is clear that Dl 8 3 3 3 2 3 2 2 2
has a Guttman's uni-dimensional struc- 9 3 3 3 3 4 4 4 3
ture. Another indicator matrix D2 10 4 3 3 3 4 4 3 3
with uni-dimensional structure is also 11 4 4 3 3 3 3 3 2
obtained by using variable 5 to 8 (Tab. 12 4 4 4 3 4 4 4 4
3). Because the orders of the rows are, 13 4 4 4 4 4 3 3 3
however, different between Dl and D 2 ,
we should say that the data set in Tab.
1 have two uni-dimensional structures.

452
453

Tab. 2: Indicator Matrix Dl Tab. 3: Indicator Matrix D2


var. 1 234 1 234 1 234 1 234 var. 5678567856785678
cat. 1 1 1 1 2 2 2 2 3 3 3 3 4 444 cat. 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4
1 1111 7 1111
2 1111 1 111 1
3 1111 4 1111
4 111 1 2 1111
5 III 1 6 1111
6 1111 8 III 1
7 1111 3 1111
8 1111 11 1 111
9 1111 5 1111
10 III 1 13 1111
11 1111 10 1111
12 III 1 9 1111
13 1111 12 1111

Fig. 1 shows the first and the second axes of HQM ill applied to this data set. In the
figure, 0 indicates a category of variable 1 to 4, while • indicates that of variable
5 to 8. It is well known that a horseshoe is obtained when a data set has one uni-
dimensional structure. In Fig. 1, such a horseshoe appears and we might incorrectly
conclude that the data set has only one uni-dimensional structure.
Fig. 2 shows the first and the third axes. Carefully observing the figure, we can
see that there are two flows of categories and that they correspond to two groups of
variables.

0
o

,
080 00
o. • o.
••0
~
•• ·ct •
• 0

..
••
j. o •
0
o • 0 0
6D 0 <»
~


Fig. 1: The second axis versus the first Fig. 2: The third axis versus the first

The above example illustrates that at least three-dimensional display of the result
is needed for partitioning variables in order to construct multiple uni-dimensional
scales. It is impractical, however, to plot all the categories when the number of vari-
ables increases because the figure will be illegible. In the example, difference between
two groups of variables appeared on the third axis, but in case of other data sets, it
is not certain which axis should be used.
Thus this paper proposes an easy method for partitioning qualitative variables to
construct multiple uni-dimensional scales.
454

2. Method
Let Xi(N x Gi ) be an indicator matrix of variable i(i = 1, ... ,1) where N is the
number of observations and Gi is the number of categories of variable i . There is at
most a single 1 in each row of Xi. Let Ai(N x N) = diag(Xil) be a diagonal matrix
whose nn-th diagonal element is the n-th element of Xil.
The method is developed based on homogeneity analysis (Gifi, 1990). It is well
m
known that HQM leads to the same equation as homogeneity analysis for analyzing
X 11 ... ,X I. Homogeneity analysis is, however, more appropriate to explain the
following method. In homogeneity analysis, sum of the squares of the distances
between optimally transformed score vector, XiWi, and one common vector m is
minimized under the constraint on m:
I
S(Wi,m) = ~ IIAi(XiWi - m)112 --+ min, (1)
i=1
I I I
where ~IIAimW=~IIA;lW, ~l'Aim=O.
i=1 i=1 ;=1
If all the variables have only one uni-dimensional structure, it is possible to transform
every score vector XiWi close to each other. Since m in (1) is obtained as a mean
vector of X I WI, .'.. ,X /W [, m can be a un i-dimensional score vector. In other words,
a un i-dimensional score vector m has to be constructed from vectors which can be
transformed near to each other. In case of the artificial data set 1, it is impossible to
transform all the eight variables close. Hence, homogeneity analysis fails to construct
uni-dimensional scales from data set 1. However, the first four variables can be
transformed close to each other, and the same is true for the last four variables. If
we prepare two score vectors, ml and m2, and sum of the distances between ml and
the first four variables and between m2 and the last four are minimized, then both
ml and m2 can be uni-dimensional score vectors:
2 8
S(Wi' ml! m2) = ~ ~ 6di IIA;(X;Wi - md)W,
d=1 i=1
where
{ I d = 1 and 1 :5 i :5 4, d = 2 and 5 :5 i :5 8,
6di = 0 otherwise.
In practice, because we do not know in advance which variables have uni-dimensional
structure, weight parameters summing up to 1 are introduced 'instead of 6di :
D I
S(adi' Wdi, md) = ~ ~ a~i IIAi (XiWdi - md)11 2 , (2)
d=1 i=1
I / /
where ~a~iIIA;mdI12 = ~a~iIIAilW, ~a~im~Ail = 0,
;=1 ;=1 ;=1
D
~ adi = 1, adi ~ O.
d=1
We call D dimensionality because it indicates the number of uni-dimensional scales.
k is given a priori in the region of k > 1. To explain the meaning of (2), let us first
455

consider the case when ml,"" mD are obtained in some ways. For each i, adi is
constrained to 2:d adi = 1. Hence, in order to minimize the quantity S, a large value
has to be assigned to adi if corresponding e~i = IIAi(Xiwdi - md)W is small. On
the other hand, a small value must be assigned to adi if e~i is large. Thus adi can be
considered as an index of degree of correlation between XiWdi and md. Variable i is
classified into the dimension d if adi is the largest among ali, ... ,aDi. The variables
which are classified into the same dimension d are close to each other because they
are all in the neighborhood of one score vector md·
Next let us consider the case when all adi's are determined. In that case, md is
obtained as a mean vector of XiWdi weighted by ad;' It implies that md is constructed
from the vectors which are near to each other. Hence, we can say that md is a uni-
dimensional score vector.
The same principle is used in fuzzy c-means clustering (Bezdek, 1981). X;Wdi is,
however, fixed in fuzzy c-means unlike (2). Also, XiWdi depends on dimension din
(2) .
There are two reasons for not introducing k-means criterion (MacQueen, 1967) but
fuzzy c-means. One is that fuzzy c-means includes k-means by letting k -+ 1 in (2).
The other is that the influence of variables which should be removed is decreased
by letting adi = D- 1 . In practice, it is often the case that some variables have low
correlation with the other variables. k-means criterion classifies such a variable into
some dimension and md of that dimension can not be a uni-dimensional score.
The values of adi, Wdi, md, are obtained by means of alternating least squares method.
Because of space limitations, the details of the algorithm are omitted.
An index
Dk- 1
A= 1- I min S, (3)
LI'Ai I
1=1

which takes a value between 0 and 1, represents in a sense a goodness of fit to D


dimensionality. Especially when D = 1, A corresponds to the square of correlation
coefficient in HQM ill.
In case of Ai = I for all i, such a goodness of fit index is obtained for each variable
and each dimension. For each variable i,

Dk - 1
Ai = 1- TminS\, (4)

is an index where Si = 2:f=1 a~iIIXiWd\ - mdW. For each dimension d,

Dk
Ad = 1- NI minSd, (5)

is an index where Sd = 2:[=1 a~\IIXiWd; - mdW. There is a following relation among


A, Ai and Ad.
1 I 1 D
A=- L A\ = - L Ad (6)
I i=1 D d=1
456

3. Example
Three data sets are analyzed. k is set to be 2 for all data sets.

3.1 Data Set 1


First, the artificial data set 1 of Tab. 1 is analyzed. Since an iterative algorithm
often hits local minima, random starts are used for 100 times. The one giving the
lowest S is considered as a global minimum.
Tab. 4 summarizes the value of D, frequency of global minimum, and),. As the
table indicates, the algorithm always converges to the same value when D = 2, but
as D increases, it frequently falls into local minima. This result tells us that it is
easy to partition 8 variables into 2 groups, and that partitioning into 3 or 4 groups
is difficult. Further, ). of D = 3 is a little smaller than that of D = 2. Therefore, we
can conclude that D should be 2.
Tab. 5 shows the values of ad. and goodness of fit indexes when D = 2. ad. of variable
1 to 4 is larger in dimension 1 than in dimension 2, while that of variable 5 to 8 is
larger in dimension 2 than in dimension 1. That is, the first uni-dimensional scale
can be constructed by variable 1 to 4 and the second can be constructed by variable
5 to 8. We should notice that this conclusion is the same as in the case when Fig. 2
was used in introduction.

Tab. 4: Summary of ). for each D Tab. 5: adi when D = 2


D Frequency of ). variable dimension
global minimum 1 2
1 100 .819 1 .875 .125 .873
2 100 .879 2 .898 .102 .883
3 23 .878 3 .853 .147 .881
4 26 .896 4 .794 .206 .880
5 12 .916 5 .189 .811 .878
6 .104 .896 .878
7 .132 .868 .885
8 .129 .871 .872
.880 .878 .879

3.2 Data Set 2


Next, artificial data set 2 of Tab. Tab. 6: Artificial Data Set 2
6 is analyzed. The purpose of this variable
example is to compare the proposed n
1 2 3 4 5 6 7 8
method with principal component anal- 1 1 1 1 1 1 1 3 3
ysis model (PCA) or factor analysis 2 2 1 1 1 1 1 1 3
model (FA). FA is usually used for par- 3 2 2 1 1 3 3 3 3
titioning quantitative variables. This 4 2 2 2 1 1 3 3 3
example shows that PCA or FA for 5 2 2 2 2 4 1 1 1
qualitative variables is not necessar- 6 3 2 2 2 4 4 1 1
ily appropriate to partition qualitative 7 3 3 2 2 1 1 1 1
variables. 8 3 3 3 2 4 4 4 4
The artificial data set 2 has also two 9 3 3 3 3 2 2 4 4
uni-dimensional structures as data set 10 4 3 3 3 2 2 2 4
1; that is, variable 1 to 4 have one uni- 11 4 4 3 3 4 4 4 1
dimensionality and variable 5 to 8 have 12 4 4 4 3 2 2 2 2
another un i-dimension ality. Unlike 13 4 4 4 4 2 4 4 4
457

data set 1, the order of categories of the last four variables is 3,1,4 and 2.
Tab. 7 is the result of principal components analysis with optimal scaling followed
by VARIMAX rotation. The analysis was performed with the PRINQUAL procedure in
SAS (A similar procedure is PRINCALS in SPSS.). The orders of categories are 1,2,3,4
for variable 1 to 4 and 3,1,4,2 for variable 5 to 8. Hence, the procedure found the
proper orders. Variables are partitioned into two groups according to factor loadings.
The first group consists of variable 4 to 8 and the second group consists of variable
1 to 3. This result does not represent the original data structure. This is because
the assigned value for each category is optimal for applying peA model but it is not
optimal for partitioning of variables.

Tab. 7: Factor loadings for optimally transformed data set 2

variable factor loading category


1 2 1 2 3 4
1 .514 .814 1.29 l.87 3.05 4.00
2 .350 .931 l.06 l.92 3.08 3.96
3 .635 .735 l.02 1.88 3.24 3.72
4 .688 .684 0.92 2.07 3.18 3.30
5 .886 .411 1.16 3.85 0.36 2.65
6 .896 .380 l.63 4.09 0.56 3.02
7 .747 .614 l.93 4.13 0.88 3.59
8 .772 .569 2.55 4.32 l.03 3.84

Tab. 8 is a summary of ad. and Ai. '\d obtained by the proposed method. Variable 1 to
4 are classified into the first dimension and the other variables are classified into the
second dimension. Hence, the proposed method succeeded in partitioning variables
in order to construct uni-dimensional scales.

Tab. 8: ad. for data set 2


variable dimension
1 2
1 .766 .234 .895
2 .828 .173 .889
3 .634 .366 .912
4 .537 .463 .905
5 .184 .816 .896
6 .237 .763 .884
7 .404 .896 .906
8 .299 .702 .892
.901 .894 .898

3.3 Kendall Data


The third example is taken from Kendall et al.(1983). The data consists of 15 vari-
ables on which 48 applicants for a post were judged. The original variables were
scored on an ll-point scale, but Sato and Yanai (1985) recategorized each variable
into three categories considering the category frequencies. Thus the recategorized
data matrix is used in this paper.
458

Fig. 3 shows the first three axes of HQM ill applied to the data set. There seems to
be no distinctive structures in the figure.

• ••


• • •
• • •• •
•••, •
, •
•• • • ,• • ••
•••


• •• •
• •

,
• • • •


• ••

• • •• • ••• •• ••
• • • • •
• • • •
••
• •
• ••

• •
The second axis versus the first The third axis versus the first

Fig. 3: Plots of categories of Kendall data

The proposed method was applied to the data set. The same as data set 1, the
algorithm was randomly started for 100 times. Tab. 9 summarizes the value of D,
frequency of global minimum, and A. When D = 3, there are quite some local minima
(34%), but the algorithm successfully reached a global minimum for 66 times. This
indicates that we can not say it is too difficult to partition 15 variables into three
groups. Tab. 10 is a summary of adi and Ai, Ad when D = 3.

Tab. 9: Summary of A for each D


D Frequency of A
global minimum
1 100 .469
2 100 .474
3 66 .492
4 6 .520

The first dimension consists of "Salesmanship", "Self-confidence", "Ambition" and so


on. This dimension seems to represent "ambition to the job". The second dimension
consists of "Potential", "Grasp", "Suitability" and "Experience". This dimension
expresses "ability of the applicant". The third dimension consists of "Likeability",
459

"Honesty" and so on. This dimension means "external appearance" or "the first
impression". These three dimensions are easy to understand, but it is difficult to
obtain these dimensions from Fig. 3.

Tab. 10: ad. when D =3


dimension
variable Ai
1 2 3
8 Salesmanship .834 .116 .050 .872
5 Self-confidence .679 .201 .120 .693
11 Ambition .500 .366 .137 .693
10 Drive .431 .410 .159 .655
6 Lucidity .413 .395 .192 .493
14 Keenness to join the company .351 .312 .337 .342

13 Potential .144 .751 .106 .815


12 Grasp .226 .640 .134 .698
15 Suitability .338 .374 .288 .361
9 Experience .327 .347 .326 .078

4 Likeability .032 .042 .926 .916


7 Honesty .257 .271 .473 .260
1 Form of letter of application .306 .334 .360 .133
2 Appearance .303 .346 .351 .302
3 Academic ability .327 .332 .342 .064
Ad .471 .465 .538 .492

4. Conclusion
In this paper a new method to construct multiple uni-dimensional scales by parti-
tioning qualitative variables is proposed. As for the presented data sets, the results
of the application of the model are successful.

References:

Bezdek, J.C. (1981): Pattern Recognition with Fuzzy Objective Function Algorithm. Plenum
Press, New York.
Gifi, A. (1990): Non Linear Multivariate Analysis. John Wiley & Sons, Chichester.
Kendall, M.G. et al. (1983): The Advanced Theory of Statistics, Volume, 3, 4th ed. Charles
Griffin.
MacQueen, J. (1967): Some methods for classification and analysis of multivariate ob-
servations. Proceeding of the fifth Berkeley Symposium on Mathematical Statistics and
Probability, 1, 281-297.
Meulmann, J.J. (1996): Fitting a distance model to homogeneous subsets of variables:
points of view analysis of categorical data. Journal of Classification, 13, 249-267.
Sato, T. and Yanai, H. (1985): A method of simultaneous scaling of discrete variables.
Behaviormetrika, 18, 39-51.
Determining the Distance Index, II
Matevz Bren 1 and Vladimir Batagel?
1 University of Maribor, Faculty of Organizational Sciences,
Presernova 11,4000 Kranj, Slovenia
2 University of Ljubljana, Department of Mathematics,
Jadranska 19, 1000 Ljubljana, Slovenia

Summary: We present the results of approximating the distance index by optimization.


We do it for 16 standard dissimilarities on binary vectors and for some usual nonlinear
transformations of these dissimilarities.

1. Preliminaries
1.1 Dissimilarities
A mapping d: f x f -+ IR is a dissimilarity measure on the set of objects f iff it is
PI. symmetric:·d(x,y) = d(y,x) for all x,y E f,
P2. straight: d(x,x):::; d(x,y) for all x,y E f.
A dissimilarity measure d that is

D1. nonnegative: d(x,y) ~ 0 for all x,y E f and


D2. vanishes on the diagonal: d( x, x) = 0 for all x E f
is a dissimilarity (on f). The ordered pair (f, d) is a dissimilarity space. We denote
with'D+ the set of all dissimilarities on f.
A dissimilarity d on f is said to be:

D3. definite iff d(x, y) =0 '* x = y;


D4. even iff d(x,y)=O '* forallzEf: d(x,z)=d(y,z);
D5. a semi-distance iff the triangle inequality
for all x, y, z E f: d(x, y) :::; d(x, z) + d(y, z) holds.

1.2 Dissimilarity Spaces


A dissimilarity space (£, d) is a semi-metric space if and only if d is a semi-distance,
and is a metric space if and only if d is also definite - d is a distance.
We denote the set of all semi-distances on f by 'Doo .
In Joly and Le Calve (1986) we can find the theorem:

• d E 'Doo '* d E 'Doo for all a:


Q
0 ~ a ~ 1;
• d E 'D+ '* there exists an unique positive number p E IR, such that dc' E 'D"",
for all a: a ~ p and d rt 'D"" for all a: Q > p;
Q

460
461

We call the threshold value p = p(d) a distance index of the dissimilarity d. For
details see Batagelj and Bren (1996).
2. Determining the Distance Index for Dissimilarities
on Binary Vectors
In the first part of the paper we present:
• an analytical solution of this problem for two families of dissimilarities obtained
from functions defined in Gower and Legendre (1986)

s_ a+d and To
a
= --::-:-:--
o-a+d+(}(b+c) a+(}(b+c)

(where (} > 0 to avoid negative values) that contain some well-known similarity
measures (see Table 1: Kendall, Sokal-Michener; Rogers and Tanimoto; Jac-
card; Dice, Czekanowski; Sokal and Sneath);
The quantities a, b, c and d have the usual meaning: for binary vectors x, y E
IBm we denote with x y = L~1 XiY, their scalar product, and with x = [1 - Xi]
the complementary vector of x. We define counters: a = xy, b = xy, c = xy
and d = xy, where a + b + c + d = m. Using these counters several resemblance
measures on binary vectors are defined (see Table 1). The use of symbol d
for a dissimilarity measure and also for a counter might be confusing, but it's
meaning is always clear from the context .
• and a computational approach to determining the distance index for other dis-
similarity measures from Table 1.
In this part we study, using a computational approach, dissimilarity measures ob-
tained by the following nonlinear transformations
D __ d_ d d
1 - 1+d
D2 = - -
1- d D3(t) = 1 + t(l - d)
2
D4 = -In(l - d) Ds = - arctan d
'II"
D6 = 1- 11 - 2d1
D7 = 4d(1 - d)
on dissimilarity measures from Table 1.
Since the triangle inequality implies evenness we consider only even (D4) dissimilar-
ity measures.
For a dissimilarity measure that does not vanish on the diagonal the triangle inequal-
ity implies
d( x, x) :5 2 d( x, y) for all x, y E £ .
But, since P2 holds this is true for any dissimilarity measure.
For the indeterminate cases we use the definitions proposed in Batagelj and Bren
(1995). To make this reading easier, the second column of Table 1 includes also the
labels used there.
.".
~
measure Si definition di D2 D3 D4 D5

1. Russel and Rao (1940) SI .!!.. l-s N Y Y Y


m

2. Kendall, Sokal-Michener (1958) S2 ill l-s Y Y Y Y


m
a
3. Jaccard (1900) S6 a+b+e
l-s Y Y Y Y
R S-I
4. Kulczynski (1927), T-I S7 He
Y Y Y N
~
c:r
5. Kulczynski SIO I( a -a ) l-s Y Y Y N c;-
a+b
2
+ a+e
.....
6. Sokal & Sneath (1963), un4 Sll Ha~b + a~e + d~b + d~e) l-s Y Y Y N
:>
CI>
be S N N CI>
7. Qo SI2 ;;;; N N
....o
ad-be ~.
8. Yule (1927), Q SI4 ad+be H1- S) N N N N
4be y

t:l
9. - bc- SIS S N N N
~
a
l-s Y Y Y N
8
10. Driver & Kroeber (1932), Ochiai (1957) SI6
J(a+b)(a+e)
ad ~
11. Sokal & Sneath (1963), uns S17 l-s Y Y Y N
J(a+b)(a+e)(d+b)(d+e)
ad be ....g
CI>
12. Pearson, ¢ SIS H1- S) Y Y Y N
J(a+6)(a+e)(d+6)(d+e)
a+0id l-s Y Y Y N
13. Baroni-Urbani, Buser (1976), S·· SI9 a+He+J;;i
a 1- S Y
14. Braun-Blanquet (1932) S20 max(a+6.a+e)
Y Y Y
a
15. Simpson (1943) S21 min(a+b.a+e)
1-s Y N N N
4(ad-be)
16. Michael (1920) S22 (a+d)~+lHe)~ Hl- s) N Y Y N
463

2.1 Approximating the Distance Index by Optimization


We can approach the problem of determining the distance index by solving numeri-
cally the corresponding optimization problem

P = argmax{Q: Vi,j, k E Ism: d"(i,j) + d"(j, k) ~ d"(i, k)}.

We use a local optimization procedure

initial (read, random) i, j, k


p:= 1
while 3(u, v, w) E N(i,j, k) :
dP(u, v) + dP(v, w) < dP(u, w) do begin
p :=Solve( Q : d"(u, v) + d"(v, w) = d"(u, w) )
i := u; j := v; k := w
end
over a neighbourhood

N(i,j,k) = {(i',j',k'): r E l..m 1\ (i~ = ,iT V j; = 'jr V k; = 'kr)}

From the local minima (i", j", k") obtained by this procedure we can usually guess a
general pattern of 'extremal' triples from which we compute an upper bound for Pm:

Pm = p(i·,j", k"),
We conjecture that for the obtained triples (i",j",k·) the equality Pm = Pm often
holds.
3. Results
In Table 2 there are values 1520 for twelve even dissimilarity measures and their trans-
formations computed with this local optimization procedure. The notation used:

Y triangle inequality (DS) holds;


N-d there exist such binary vectors that D is not defined;
N-e DS doesn't hold because transform D is not even;
N-p some values of transform D are not positive;
p -+ P tends to this value as dimension m increases.

In the following paragraphs we shall give explanation of the obtained results for most
interesting cases.
Russel and Rao d l = 1 - SI = 1 - ~ E [0,1]
It does not vanish at the diagonal: for 0:= [0, ... ,0] E IBm we have dl(O,O) = l.
The triangle inequality holds with equality on vectors of the form
I m-r
~~
[0, ... ,0,1, ... , 1]
j"= [ 1, ... , 1,0, ... ,0]
k" = [ 1, ... , 1, 1, ... , 1] =: 1

as 1 = dIU",j") = dl(i*, 1) + dl(j·, 1) = ~ + m,;;-r. So p is equal to 1.


The transform DI is a concave function on the interval [0, lj. Hence strict triangle
~

measure Linear Dl D2 D3(2) D4 Ds D6 D7

Russel SI Y Y N-d 0.5 N-d Y N-e N-e


Kendall S2 Y Y N-d 0.5 N-d Y Y Y
Jaccard S6 Y Y N-d 0.5 N-d Y N-e N-e
T- 1 S7 N-d Y N-d N-d N-d 0.7761 N-p N-p
Kulczynski SIO 0.3798 0.5283 N-d 0.2489 N-d 0.4375 N-e N-e
p~O p~O p~O

0.6457 N-d 0.3847 N-d 0.5377 0.6186" 0.8385"


p;3
un4 Sl1 0.5004 cr
p~O ;-
t>.j

Driver S16 0.5535 0.7706 N-d 0.3286 N-d 0.6509 N-e N-e
p~ 0.5 ~
g.
uns S17 0.5946 0.9319 N-d 0.3412 N-d 0.7257 N-e N-e
p ~ 0.5645 p ~ 0.8755 p ~ 0.3286 p ~ 0.6938
-
'"
Y for n ::; 10
¢ S18 0.5696 0.7383 N-d 0.4255 N-d 0.6067 0.6236 0.8965
p ~ 0.5645
S"" S19 0.3785 0.5386 N-d 0.2460 N-d 0.4337 N-e N-e
p~O p~O

Braun S20 Y Y N-d 0.5 N-d Y N-e N-e


Michael S22 0.2811 0.2979 N-d 0.2265 N-d 0.2855 0.2811 0.2985
- --
465

inequality holds and it is an equality only for the "triples" 1, 1, i for all i E Bm.
Indeed, for the previous triple i", j", 1 we have

The equality holds 'only when x = m; that = 1.


is for j"
The transform D2 is undefined because D2(d l (i",j")) = I~I'
If we choose parameter t = 2 we get D3(2) = 3!~dl ' and for the previous triple

The expression on the right side is a function of x (0 ~ x ~ m) with minimum at


x = T' Therefore, for x = T, we get

and Ii = 0.5.
The transform D4 is undefined because D4 (d l (i",j")) = -In(1 - 1).
The transform Ds is a concave function on the interval [0,00]. Hence strict triangle
inequality holds.
The transformations D6 and D7 are not even: for the triple i = [0, 1, 1,0] , j =
[0,0,0,1] and k = [0,1,1,1], we have

pair a b c d dl (-,-) D6 D7
..
Z) 0 2 1 1 1 0 0
1
ik 2 0 1 1 2' 1 1

jk 1 0 2 1 ;! 1 ;!
4 2' 4

Sokal & Sneath d


17 -
- 1- s - 1-
17 -
ad
v'(a+b)(a+c)(d+b)(d+c)
E [0
,
1]
is a definite dissimilarity (see Table 1) but it does not obey the triangle inequality:
for the triple i", j", k" E IBm

i" = [0, ... ,0,0, 1]


j"= [0, ... ,0,1,0]
k" = [0, ... ,0,1,1]

we get d I7 (i",j") = 1 and dI7 {i",k") = d17(j",k") = 1- 12(:,.__21 ), This is for


dimension

m =2: also equal to 1 - triangle inequality (D5) holds;

J
m = 3: equal to 1/2 - triangle equality holds;
m = 4: equal to 1 - == 0.42 - D5 doesn't hold;
m~: equal to 1 - -f
== 0.29 - D5 doesn't hold.
466

So 1520 = - In(I-In~
Ji)
== 0.5946 and when m tends to infinity 15 = -~. == 0.5645.
In(l- , )

The transform DI is a concave function on the interval [0, 1] hence the triangle inequal-
ity holds for larger dimensions. For the previous triple we get DI(dli(i*,j*)) = 1/2
and DI (d I7 ( i*, k*)) = DI (d17(j*, k*)) is for dimension
m =2: also equal to 1/2 - 05 holds;
m equal to 1/3 - 05 holds;
=3:
m = 4: equal to ;9J--\
== 0.30 - 05 holds;
m = 10: equal to 1/4 - triangle equality holds;
m = 20: equal to 2~-=-33 == 0.24 - 05 doesn't hold;
m ::;P : equal to /J-=-\ == 0.23 - 05 doesn't hold.
For m > 10 we obtain for the upper bound of Pm:
In 2 In2
1520 =-I v'i9 == 0.9319 and for m ::;P 15 -- -ln2 -/2-1 == 0.875.5
n -2Ji9-3
? 19-3
i72-1
The transform D2 is undefined because D2(d 17 (i*,j*)) = 1:1'
For t = 2 the transform D3(2) = 3!~L' For the previous triple i*, J", k* we get
D3(2)(d 17 (i*,j")) = 1 and D3(2)(d 17 (i*,k*)) = D3(2)(dI7 (j",k*)). This is for dimen-
sIon
m = 2:also equal to 1 - 05 holds;
m= equal to 1/4 - 05 doesn't hold;
3:
m = 20: equal to ~t~ == 0.13 - 05 doesn't hold;
m ::;P : equal to ~t~ == 0.12 - 05 doesn't hold.
Now we can calculate the upper bound of Pm:
In 2 In2
1520 = -I v'i9-3 == 0.3412 and for m ::;P 15 = -----
In ";2-1
== 0.3286
n Ji9+6 Ji+2
If we calculate 1520 for different values of parameter t we get for t = I: 1520 = OAI03,
for t = 2: 1520 = 0.3412 and for t = 3: 1520 = 0.3033.
The transform D4 is undefined because D4(d17(i",j*)) = -In(1 - 1).
The transform Ds is a concave function on the interval to,
00]. For the previous triple
we get Ds(d I7 (i",j*)) = 1/2 and Ds(d 17 (i*,k*)) = Ds(d17(j*,k")) is for dimension
m = 2: also equal to 1/2 - 05 holds:
m = 3: equal to ~ arctan ~ == 0.29 - 05 holds:
m = 4: equal to ~ arctan( 1 - 1") ~ 0.25 - 05 holds;
m = 5: equal to JI)
~ arctan( 1 - == 0.2:3 - 05 doesn't hold;
m = 20: equal to ~ arctan(1 - 3Yf) == 0.19 - 05 doesn't hold:
m::;P: equal 1.0 ~ arctan(1 -~) == 0.18·- 05 doetin't hold.
Hence for m > 4 we can calculate the upper bound of Pm:

P20 == 0.7257 and for m::;P P == 0.6938


Transformations D6 and D, are not ('ven: for the triple i = [ 1. 1, 0, 0] , j = [1, O. l. I]
and k = [1,0, l.0], we get
467

pair a b c d d17 (-,-) Ds D7

I) 1 1 2 0 1 0 0

ik 1 1 1 1 ;! 1 ;!
4 2 4

j k 2 1 0 1 1- 'I} == 0.42 2(1 - 'I}) == 0.84 ~(J3 - 1) == 0.98

In the last two columns of Table 2 we can see that most of the considered dissimilarity
measures are not even. This occurs because the transformations Ds and D7 map
1 -> O. Hence the transformed dissimilarity measure is even if and only if the original
one has the property: for all pairs i,j E IBm such that d(i,j) = 1

d(i, k) = d(j, k) or dei, k) =1- d(j, k) for all k E IBm.

Because the dissimilarity d ll has this property on IBm - {a, I} we put the' in it's
row.

4. Conclusion
In the paper we present an approach to determining the distance index for a given
dissimilarity on binary vectors and applied it to some well known dissimilarity mea-
sures and their transforms. The results obtained offer new information that can be
used when selecting dissimilarity measure for applications.
We also expect that the proposed approach can be successfully applied to dissimilar-
ities between other types of units.

5_ References:

Batagelj, V., Bren, M. (1995): Comparing Resemblance Measures, Journal of Classifica-


tions, 12, 1, 73-90.
Batagelj, V., Bren, M. (1996): Determining the Distance Index, In: Ordinal and Symbolic
Data Analysis: proceedings of the International Conference on Ordinal and Symbolic Data
Analysis - OSDA'95, Paris, June 20-24, 1995, Diday, E. at al. (eds), 238-252, Springer-
Verlag: Berlin, Heidelberg, New York.
Critchley, F., Fichet, B. (1994): The Partial Order by Inclusion of the Principal Classes of
Dissimilarity on Finite Set, and some of their Properties, In: Classification and Dissimi-
larity Analysis, Van Cutsem, B. (ed.), 5-67, Lecture Notes in Statistics, Springer Verlag:
New York.
Gower, J.C., and Legendre, P. (1986): Metric and Euclidean properties of dissimilarity
coefficients, Journal of Classification, 3, 5-48.
Joly, S., Le Calve, G. (1986): Etude des puissances d'une distance, Statistique et Analyse
des Donnees, II, 3, 30-50.
Meta-data and Strategies of Textual Data Analysis:
Problems and Instruments

Sergio Bolasco

Faculty of Economy
University of Rome "La Sapienza"
Via del Castro Laurenziano, 9 - 00161 Roma - Italy

Summary : In order to develop a proper multidimensional content analysis, we discuss some


typical aspects of a pre-treatment of a textual data analysis. In particular: i) how to select the peculiar
subset of the words in a text; ii) how to reduce the word ambiguity. Our proposal is to use both
frequency dictionaries and reference lexicons as external lexical knowledge bases with respect to the
corpus, by means of a comparison of ranking, inspired by Wegman's parallel coordinate method. The
conditions of iso-frequency of unlernmatized forms as an indication of the need for lemrnatization is
considered. Finally in order to evaluate the opportunities of the choices (both disambiguations and
fusions), we propose the reconstruction, by means of bootstrapping strategy, of some convex hulls -
as word confidence areas - in a factorial plane. Some examples from a large corpus of parliamentary
discourses are presented.

1. Introduction
In this paper we are concerned with the different phases of text pre-treatment necessitated
by a content analysis, based on multidimensional statistical techniques. These phases have
been modified in recent years by the growth in sizes of textual data corpora and their related
vocabularies and by the increased availability of lexical resources.
As a consequence of this, some new problems arise. The first one is how to select the
fundamental core of the corpus vocabulary, when it is composed of several thousands of
elements. In other words how to identify the subset of the characteristic words within a text,
regardless of their frequency, in order to optimize computing time and minimize
interpretation problems. The second problem is how to reduce the ambiguity of language
produced by the automatic treatment of a text. The main aspects of this are the choice of the
unit of analysis and of lemmatization.
We also propose the validation of the lemmatization choices in terms of the stability of the
word points on factorial planes in order to control the effects of this preliminary
intervention.

To solve these problems, it is possible to use both external and internal information,
concerning the corpus: i. e. both meta-data and data. Some examples of our proposals are
applied to a very large corpus of parliamentary discourses on government programmes
(called Tpg from now on). The size of the Tpg corpus (Tpg Program Discourses and Tpg
Replies) is over 700.000 occurrences and the size of the Tpg vocabulary it is over 28.000
unlemmatized words, equivalent to 2500 pages of text.

468
469

2. - How to identify the fundamental core of the corpus vocabulary


Regarding the first problem, frequency dictionaries and reference lexicons playa crucial
role as external lexical knowledge bases. The former can be represented as being several
models of language. Just as a reminder a frequency dictionary is a vocabulary ranking by
decreasing of headword frequency obtained by means of a very large corpus (at least one
million occurences); this corpus is a representative sample of texts from some collections of
the language. A reference lexicon is a complete inventory of the inflected forms or of any
other collection of locutions or idiomatic expressions.
We can assume that every textual corpus (as discourse) is the reflection of an idiom, a
context and a situation (i. e.: enunciation and historical period). So its vocabulary cannot
but come out of these three components.
The idiom is identifiable through the base-dictionary of a given natural language. In Italian
this base-dictionary is represented by a VdB of around 7000 most frequent words in
everyday language (or the 2000 most frequent words in LIF, see Bartolini et at. 1971).
Some of the words of the corpus which belong to the VdB, in some cases, could be
eliminated from the analysis inasmuch as they are necessary only to the construction of
sentences (for instance the grammatical words).
Words such as support-verbs or idiomatic phrases can be clearly identified and their capture
will contribute to the reduction of ambiguity. This capture is possible by means of a
reference lexicon of locutions and phrasal verbs. For example, if we look at the Italian verb
<andare> (to go, in English), we will see in tab. 1, from a reference lexicon, that there are
over 200 different phrasal verbs that use this verb as support. Of these, of course, almost half
do not exist or do not have an equivalent in English.

Tab. 1: Examples of idioms of the verb "andare" (to go ) as phrasal verb

aodar/beoe/VAVV/V/DIGE/DCM5411 = be/a/good/matcb/VDET AGGN/V


aodare/a gli/estrcmi/VPN/v/DIGE/DCM693/ = go/lo/exlrcmes/VPN/V
aodare/a/fare/la/spesaNPVDETN/V/DIGE/DCM980/ = go/sbopping/VA VV/V
aodare/a/gioroata/VDETN/V/DIGE/DCM72l! = g%ut/lo/work/by he/day/VPVPN/V
aodare/a/male/VPN/V/DIGE/DCM654/ = go/badlYA VV/V
aodare/a/spasso/VPN/V/DIGE/CfS/ = go/for a/walk/VPN/V
aodare/a/zoozo/VPN/V/DIGE/DTN = saunler/V/V
aodare/avaoti/VA VV/v/DIGE/DCM562/ = progrcss/V/V
aodare/direttameote/a lo/scopo/VA VVPN/V/DIGE/DCM661/ = go/straigbt/to he/mark/VAVVPN/V
aodare/fuori/VAVV/V/DIGE/DCM1026/ = get/out/VAVV/V
aodare/fuori/V AVV/V/DIGE/DCM1026/ = g%ut/VAVV/V
andare/fuori/VAVV/V/DIGE/DCM1026/ = set/out/VA VV/V
andare/fuori/uso/VPN/v/DIGE/DCMI027/ = wcar/oul/VAVV/V
aodare/oltre i/limiti/VPN/v/DIGE/DCM827/ = overslep/tbe/limits/VDETN/V
aodare/per la/maggiore/VPAGG/V/DIGF)GVI = he/very/popular/VA VVAGG/V
aodare/sotto il/oome/di/vPNP/V/DIGE/GV/ = go/hy tbe/name/ot]VPNP/V
and so on. with over 200 different examples in Italian language and at least other 40 phrasal forms of "to go· in English
470

The context and the situation are characterized with the aid of a specialized frequency
dictionary (political, scientific, or economic, etc.). In this event, the lexical inclusion
percentage of the corpus vocabulary in the reference language model is a basic measure.
With regards to the Tpg, the chosen frequency dictionary is the lexicon of Press and Press
Agencies Information (called Veli). This vocabulary is derived from a collection of over 10
million occurrences. On the assumption that the Veli vocabulary is the pertinent neutral
model available of a formal language in social and political context, we can ask ourselves to
what extent the Tpg corpus resembles it, or differs from it.
In this sense the situation can be identified by studying the original terms not included in this
external knowledge base. In our case, the language of the situation is composed of the Tpg
terms which does not belong to the VeIL This sub-set is interesting in itself.
On the contrary, the context can be identified through the words in common in the above two
lexicons. Among these words, in general, the highly specific sectorial terms are measured by
the largest diversities of use with respect to the chosen frequency dictionary.
In this way we are interested to identify one sub-set of characteristic words. The peculiarity
or intrinsic specificity of this sub-set will be measured by calculating the diversities of use for
each pair of words. As Lyne says (1985: 165): "The specific words are terms whose
frequency differs characteristically from what is normal. The difference can be calculated
from the theoretical frequency of a word in a given text, on the assumption that the latter is
proportional to the .length of the text." One possible measure 01 specificity could be the
classical measure of z - like a normalized difference of the frequencies -
Ij - I *
z= .fi*
where: the Ij is the relative number of occurrences in the corpus and I * the correspondent
in the frequency dictionary. Proposed by P. Guiraud in 1954, z usually is called ecart reduit,
and it is equivalent to the square root of the chi square.
It is possible to compare the coefficients of usage between the two vocabularies; where the
latter is - for each headword - the frequency weighed with the measure of dispersion.

The above specificity measure can be either positive or negative. Using the Veli list as a
yardstick, we can investigate the Tpg vocabulary. In fact as Lyne suggests (ibidem: 1985:
7): "The ranking favours those items which are most characteristic of our corpus, what we
shall call, Positive .. Items. Conversely, towards the bottom of this list are found those items,
Negative .. Items, which, although still occurring (in some instances frequently) in our
corpus, are nevertheless least characteristic of it, since they occur relatively less frequently
than in the reference dictionary"
Once the relative differences between the Tpg and the Veli vocabulary are measured in terms
of z, it is possible to select and to visualize two comparative rankings of words in the above
vocabularies. The threshold of selection can be the classical level of the absolute value of z
(greater than or equal to 3). The set of these selected words can be visualized by using the
method of "parallel coordinates" (Wegman, 1990). As known, Wegman's proposal consists
in using the parallel coordinate representation as a high-dimensional data analysis tool.
Wegman shows that this geometry has some interesting properties; in particular a statistical
interpretation of the correlation can be given. For highly negatively correlated pairs, the dual
line segments in parallel coordinates tend to cross near a single point between the two
parallel axes. So the level of correlation can be visualized by means of the set of these
segments (see Wegman's fig. 3, ibidem: 666).
471

VELI RANK

TPG RANK

Examples of items, in the TPG, with Highest Positive or Negative Specificity


intendere = to illtend Z = + 35,1 dire = to My Z = - 28,2
assicurare = to assure z = + 32,6 fare = to do z = - 18,8

Fig. Ia: Comparison between TPG and VEU ranking of the 100 Verbs with the Highest
Peculiarity in the TPG (either Positive or Negative intrinsic Specificity)

YELl RANK

3000

TPG RANK

Examples of items without Intrinsic Specificity in the TPG


tinanziare = to jitUlftce z = + 0.70
valutare= to evaluate z =-0,35
Fig. I b: Comparison between TPG and VELI ranking examples of several Banal Verbs in the
TPG (neither Positive or Negative Items)

TO INVOLVE. TO INSURE

: TPG

1000 2000
Fig. Ie: UF - VELI - TPG rank comparison. 15 most commonly used verbs in Italian and a
selection of sOme highly peculiar Tpg verbs with positive or negative specificity
472

Generally, only two dimensions are considered (fig. la, 1b), but it is possible to -compare
several (more than two) ranking lists from the related frequency dictionaries (fig. 1c).
Figures 1 illustrate the above selected verbs according to whether they occur more or less
markedly in our Tpg corpus than in the Veli corpus. In fig. 1a we show the 50 verbs with the
highest positive specificity, among these: <intendere>= to intend, <assicurare>= to assure,
<impegnarsi>= to involve, <provvedere>= to take measures, <favorire>= to favour,
<garantire>= to garantee; and also the other 50 verbs with the highest negative specificity in
our Tpg. Among them, there are several most commonly used verbs like: <dire>= to say,
<stare>= to stay, <fare>= to do, <vedere>= to see, <parlare>= to talk, <venire>= to come,
but also <decidere>= to decide, <spiegare>= to explain, <andare>= to go. As you can see
the criterion of negative specificity can clearly characterize certain words as "infrequent"
words. In fact they are very relevant in their "rarity" (under-used or not so frequent) with
respect to the chosen frequency dictionary, being consciously or unconsciously avoided by
the writer or speaker. Also this selection of terms could be the subject of a study by itself.
In fig. 1b we show the group of words that are not specific, also called "banal", and could be
discarded, because not so relevant as expressions of the context.
A further selection of items could be derived from the comparison of 3 ranking lists (Tpg -
Veli - Lit). The figure 1c shows the first 15 most common verbs and some specific Tpg Verb,
as Positive or Negative Items. From this illustration we can conclude that the most typical
governmental verbs, among the Positive Items, are "to take measures" and "to intend".
Conversely the most relevant among the negative ones, in comparison with Veli and Lif, are
"to explain" and "to decide". Finally it is possibile to observe the situation oflhe same use, in
the three dictionaries, of the verbs "to assure", "to involve", "to insure" as a set of high
politic peculiarity due to their progressive ranking in the passage from the general language
(Lit) to the sectorial one (Vell) up to the more specific one of government programs (Tpg).

3. How to solve problems of ambiguity


Regarding the two components, idiom and context, the corpus should be analysed at the
level of headMlOrds (lemmas) and therefore needs a lemmatization.
While with respect to the third component (situation) it is preferable to analyse the corpus in
terms ofinJlectedJorms such as graphical unlemmatizedJorms, or, even better, through the
choice of adequate units of analysis (like lexias, as linguists call them. The lexias is the
minimal significant unit of meaning).
In general, if a whole sequence of words induces meaning (for example an idiomatic
expression), it can be regarded as a single lexical item, and therefore as a single entry of
vocabulary. If the frequency of the related forms composing the sequence is particularly high
with respect to the chosen frequency dictionary, this reflects a highly peculiar terminology,
and we can conclude that this segment is very representative and has an intrinsic specificity
of its own in the corpus.
In all the above cases, the corpus vocabulary is both more precise and unambiguous.
Moreover, it permits us to circumscribe the subsequent phases of lemmatization, that is
disambiguation and fusion. A preliminary recognition of names, acronyms and polyforms
shortens the lemmatization phase, especially from a semantic point of view. This requires the
use of reference lexicons, such as a dictionary of locutions and of the principal support-verbs
(Ella, 1995). The Institute of Linguistics at the University of Salerno has developed an
integrated system of external lexical knowledge bases composed of the following
473

inventories: one lexicon of over 110.000 simple entries - derived from a collection of 4 main
dictionaries of the Italian language -, called DELAS; one lexicon of over 900.000 inflected
simple forms, called DELAF; one lexicon of over 600.000 inflected polyforms, derived from
250.000 lexias, called DELAC. It is also available one dictionary of over 800.000 bilingual
terms, called DEBIS. Elia's study show - for example - that in 13.790 simple forms there are
1.406 polyrhematic constructions (polyrhematic is a sequence of terms whose whole
meaning is different from its elementary components), composed of 3.500 simple forms,
equivalent to 25% of vocabulary. As we can see the density of polyrhematic forms is very
high.
Therefore it could be very important to construct some frequency dictionaries ofpolyforms,
in order to compare the corpus vocabulary of repeated segments (Salem, 1987) or, even
better, of quasi-segments (Becue, 1995), and select those sequences that are more
significant. Up to now such frequency dictionaries are not available: an initial attempt to
construct one is illustrated here in tab. 2, concerning the adverbial groups and other typical

Tab. 2: Example of Frequency Dictionary of Locutions derived from a collection of over 2


million occurrences (among a total of 250 locutions with occurrences> 30)

IT ALlAN WORD ENGLISH TRANSLATION GEN TPG TPG Olbtr


Tolal Protlr Rtpl Corpora

OAPARTE ON THE PART OF 855 227 368 260


IN MODO IN THE WAY 853 309 288 256
INITAUA INrrALY 548 84 66 398
PER QUANTO RIGUAROA wrrH REGARDS TO 511 237 136 138
NON SOLO NOT ONLY 477 176 119 182
IN PARTICOLARE IN PARTICULAR 453 270 100 83
MAANCHE BUT ALSO 431 153 92 186
IN TERMINI IN TERMS OF 429 92 94 243
01 FRONTE IN FRONT OF 424 113 240 71
PER CUI FOR WHICH 421 19 34 368
AUVELLO AT THE LEVEL 417 48 36 333
SITRAITA DEALSwrrH 384 170 127 87
SULPIANO ON THE LEVEL OF 373 167 141 65
NELL'AMBITO IN THE CONTEXT 368 149 132 87
NEI CONFRONTI DEAUNGWrrH 331 79 140 112
SEMPREPIU ALWAYS MORE 330 176 45 109
IN MATERIA ON THE SUBJECT 321 143 160 18
NELQUAORO wrrH THE REFERENCE TO 314 178 130 6
NELSENSO IN THE SENSE 297 27 35 235
IN CORSO ONGOING 297 159 124 14
SULlABASE ON THE BASIS OF 277 153 102 22
PER QUANTO IN AS FAR AS 273 61 37 175
NELCAMPO IN THE FIELD OF 273 107 76 90
PER ESEMPIO FOR EXAMPLE 259 35 74 150
IN GRADO 01 ABLE TO 255 70 26 159
INMANIERA IN THE WAY 246 36 31 161
UNA VOLTA (da disombiguare) ONCE, AT ONE TIME,
ONCE UPON A TIME 248 35 48 165
ALFINE IN ORDER TO 202 166 31 5

expressions. Preliminary matching with the corpus under study allows us to isolate the
relevant parts of lexical items (either single or compound forms) and constitutes a valid
system of text pre-categorization.
An additional possibility for this disambiguation emerges from the data. In every corpus it is
possible to observe some equivalence of frequency - I call it iso-frequency - among the
474

inflected forms of the same adjectives or nouns. See in tab. 3 some examples of adjectives
like economic, important and legislative.

Tab. 3: Examples of Iso-Frequency

--- nolISO-FREQUENT NOUNS --- ISO-FREQVENT ADJECTIVES

LEGGE (law) (s) 622


LEGGI (laws) (p) 208 (OS = 0.33) ECONOMICO (ms) 315 (OS = 0,77)
ECONOMICA (Is) ~61
ECONOMIA (eoooomy) (s) 262 ECONOMICHE (Ip) 100
ECONOMIE (eooDOmies) (p) 35 (OS = 0.13) ECONOMICI (mp) 100 (OS = 1,00)

--- lSD-FREQUENT NOUNS


IMPORTANTE (s) 117
OBIETTIVO (purpose) (5) 2-1-3 IMPORTANTI (p) 116 (DS = 0,99)
OBIETTIVI (purposes) (p) 286 (OS = 0,85)

INTERESSE (iDleres') (s) 193 LEGISLATIVO (ms) 57 (OS = 0,84)


INTERESSI (iDlereslS) (p) 178 (OS = 0,9%) LEGISLATIVA (rs) 68
LEGISLATIVE (fp) 53
LIVEllI (levels) (p) 110 LEGISLATIVI mp 58 (OS = 0,91)
LIVELLO (level) (s) 187-67=1%0 (OS = 0,91)
<a/liveI/o> 48 LIBERA (fs) 58
<alf/ivel/o> 19 L1BERO (ms) 55 (OS = 0,95)
UBERE (fp) 28
FORZA (Iorce) (5) 105 L1BER! (mp) 25 (OS = 0,89)
FORZE (forces) (p) %59·166= 93 (OS = 0,88)
<Jone politic he> 126 LOCALE (local) (5) 80
<Jone sociali> 40 LOCAll (local) (p) 195-90 = 105 (OS = 0,89)
<enti-Iocali> 90

Legend: DS = occ A I occ B with occ A < occ B

(5) singular (p) plural (ms) masculine and singular (fs) feminine and singular
(mp) masculine and plural (fp) feminine and plural

This iso-frequency can be the first clue to their equivalent use and meaning. On the contrary,
in some cases, the lack of iso-frequency among the inflected forms of the same headword
(Bolasco, 1993) suggests the need for disambiguation. In fact, this happens in presence of
some compound forms, especially where the incidence of the occurrences of simple
component forms is relevant. As you can see in words like <[orza> (force) and <livello>
(level). For example when we take away the frequency of the compound form of the word
"level" (187) like "at (local) level" (48) and "at level of' (19), we return to the presence of
iso-frequency (120) with the plural (110). As we will see later the differences among the
inflected forms can be the clue to their different meanings. This should be verified by means
of a bootstrapping approach.

4. Strategies for evaluating the lemmatization choices


For an optimal reconstruction of the main semantic axes of latent sense in a corpus we can
use, as is well known, correspondence analysis (Lebart and Salem, 1994). Our objective, at
this level, is to obtain stable representations. To assess the opportunities that both the
disambiguations and fusions offer, we can test their significance by providing the factorial
475

planes with confidence areas (Balbi, 1995). This assessment procedure is based on a
bootstrapping strategy that generates a set of "word to subtext" frequency matrices. We
assume Balbi's hypothesis which consists in generating a large number B of contingency
tables by resampling, with replacement, into the original contingency table.
This set of bootstrapped matrices generates a three-way data structure; which could be
analysed for example by means of a multiway technique, for constructing a reference matrix.
A technique, such as STATIS, can be used, see Lavit (1988). In our example, in order to
optimize computing time, the reference matrix is the average of these B matrices, due to the
large dimensions of the original matrix (786 x 46) and of the number of bootstrapped
matrices (B=2oo).
The stability of word points is graphically established by projecting them, as supplementary
points, into the first factorial plane computed from a correspondence analysis of this
reference matrix. Balbi proposes to use the non symmetrical correspondence analysis
(ANSC). We have attempted this road but the results have not been comforting at level of
interpretation. We believe that, in general, it is more opportune use the analysis of the simple
correspondence analysis and only for special reasons the ANSC.The resulting clouds of
points (for each word) constitute the empirical confidence areas, delimitated by a convex
hull. The tig. 2 shows the convex hull regarding the word <way> and its locution
<in/the/way>.

in/the/way
1')6 nee.

way
roo ace.

Fig. 2: Convex hulls of the locution IN THE WAYand of the word WAY

In practice, if two or more convex hulls do not overlap, disambiguation is absolutely


necessary. See also, in fig. 3, the semantic disambiguation of the word <sviluppo>
(development) in three different meanings: the first as "economic growth", the second as
"progress" (in general politir.al sense: social or civil), and the third as some "specific
technological advance".
On the contrary, if the relative convex hull of different inflected forms or of some synonyms
are (strongly) overlapping or included, their fusion is fruitful for the analysis.
476

DEVElOPMENT ,S] I7J DCC.


• sp«ilic I.:chnolo,inl ~dv.nu

VEVfLO,.,wENT .S2 4" f)('C'.


lilt ,oei..1 rH cin,' ,,,o,rru

DEVElOPMENT S' '15 occ.


• UDllomfC anrlopmM'

Fig. J: Conve:c hulls of three different meanings of the word DEVELOPMENT (semantic disambiguation)

Let me give some examples concerning these situations. In fig. 4a "stato _verb" (equivalent
to "been" in English) is clearly distant from "stato _noun" ("state"); but, conversely, the
different unlemmatized forms «stato/a/e/i» of "stato _verb" (see tig. 4b) have their convex
hulls completely overlapped and it is not important to distinguish them. Furthermore, if we
look at the two meanings of "stato_noun" - they are further distinguished (fig. 4a).
<Stato_s1> like state or nation and <stato_s2> like status or condition/situation (marital
status, state of mind) have their relative convex hulls separated.

51A TO S lilt "sl.'~· NOUN


- STATO Sf. ·Jt.,,~·.s "";on
ST.4.10=51_ -".'r·.If s'.'u,
HATO 51
6210ce.

Fig. 4a: Convex hulls of the Italian 1V0rd "STATO" after disambiguation
477

'I
I--
I
STATA v
116 OCt:.
~~:'{J'IJI;"':'/.JJ , 77 nee.

. I.('(f'ntl
.
STA. TA : /r,,,min,. nnJ .lInt,,/n,: sr" TO a '''''11
,./int' tIIul uII""I,,,:
STATE = Irm;,."" 111,,1 pluTal. ST" 1/ z '1II1.f(:IIIi",. I"ul rillmi.

Fig. 4b: Convex hulls of Italian different infected forms of the past participle of the verb "to be"

CONDITION 51 72 nee.
...•
STATE_51 10000eC'.
= ·condition- ~.. position .a ·"~I~· ,~'us,
situAtion M condition

• -S;'u<11inn·.1.1 pos;,;nn
Mconditiml
I

,.IL-_I---_
Fig 4c: Convex hulls of the synonym of "STATE" as "condition" or "situation"

In particular the latter does not overlap so much with the other synonyms such as
"conditionL2)" and "situationL1)", as you can see in fig. 4c. This shows how the use of
these terms has changed over time in political discourse. Paying particular attention to fig.
4a, these words are always distant from state as the Italian State.

Now let me look at the significance and interpretation of convex hull sizes and positions, as
shown in the following scheme:
1) a small convex hull, and therefore closeness of points, means high stability of
representation but: a) when the points arc located around the origin of the axes, it means
evenness of these items in the various parts of the corpus, or b) when the points are in one
particular quadrant of the plane, distant from the origin, it means the item is very
characteristic and specific to some sub-set of the corpus. In this case, most of the time we
obtain convex hulls not so small as above, because the factor scale of this region depends on
the point distance from the origin (see the example of Politics in fig, 5);
478

POLITlCHE_S

POI.ITlCA_S

Fig. 5: Convex hulls of Italian inflected forms of the headword POLITICS

2) a large convex hull, that is with a wide dispersion of points, means a not so strong stability
of representation, and several different uses of this word in the corpus, but: a) if we do not
have overlapping convex hulls, this means that the relative items have different meanings and
that their fusion is not pertinent or, in other words, that their disambiguation is justified (see
in fig. 4a the case of nation and status) or b) if, conversely, we have overlapping convex
hulls, this means irrelevant disambiguation or justified fusion (factual synonyms).
In conclusion, having discussed how to identify the most significant part of the corpus and
how to construct a more restricted and highly peculiar vocabulary composed of items with a
high level of semantic quality, we can now finally proceed to an accurate and proper content
multi-dimensional analysis, based on the 'above vocabulary, in which all the relevant units of
analysis, which I have called "textual/arms", are considered (Bolasco, 1993).
To this effect, such a vocabulary (see an example in tab. 4) will be composed of the items
which are: 1) not banal with respect of some model of language (high intrinsic specificity or
original terms); 2) significant as a minimal unit of meaning (lexia): either headwords (verbs
and adjectives), or unlemmatized significant in11ected forms (such as nouns in the plural with
different meaning from the singular, i.e. forzalforze), or more frequent typical locutions and
other idiomatic expressions (phrasal verbs and nominal groups).

Rderences:
Balbi, S. (1995): Non symmetrical correspondence analysis of textual data and confidence regions f()r
graphical forms. In: JADT 1995 Ana/isi stalislica dei dali lestua/i, Bolasco, S. et al. (eds.), II, 5-12, CISU,
Roma
Becue, M. et Haeusler, L. (1995): Vers une post-codification automatique In: JADT 1995 Ana/isi Slalislica
dei dali testua/i, Bolasco, S. et al. (eds.), 1,35-42, CISU, Roma
Bolasco, S. (1993): Choix de lemmatisatioll en vue de reconslrUcliollS synlagmaliques du lexte par ['analyse
des correspondallces. Proc. JADT 1993,399-410, ENST-Telecom, Paris
479

Bolasco, S. (1994): L 'individuazione di forme testuali per 10 studio statistico dei testi con tecniche di analisi
multidimensiotlale. Alii della XXXVII Riunione Scienlifica della S.I.S., II, 95-103, CISU, Roma
Bortolini N., Tagliavini C., ZampoJli A. (1971): Lessico di frequenza della lingua italiana contemporanea.
Garzanli., Milano.
Dubois, J. el al. (1979): Dizionario diLinguislica, Bologna: ZanicheJli
Elia, A. (1995): Per una disambiguazione semi-aulomalica di sinlagmi composli: i dizionari elellronici
lessico-grammalicali. In: Ricerca Qualitativa e Computer, Cipriani, R. e Bolasco, S. (cds.), 112-141, Franco
Angeli, Milano
Cipriani, R. e Bolasco, S., eds. (1995): Ricerca Qualitativa e Computer. Franco Angeli, Milano
Lavil, Ch. (1988): Allalyse conjointe de tableaux qualltitatifs. Masson, Paris
Lebart, LeI Salem, A. (1994): Statistique textuelle. Dunod, Paris
Lyne A. A. (1985): The vocabulary offrench business correspolldence, Sial kine-Champion, Paris
Salem, A. (1987): Pratique des segments repetes. Essai de statistique textuelle. K1incksieck, Paris
Wegman, E. J. (1990): Hyperdimensional Data Analysis Using Parallel Coordinates JASA, 85, 411, 664-
675
Clustering of Texts using Semantic Graphs.
Application to Open-ended Questions in Surveys
Monica Becue Bertaut l and Ludovic Lebart 2

1 Universitat Politecnica de Catalunya.


Pau Gargallo.5
08028 Barcelona. Spain.

2 Centre National de la Recherche Scientifique.


ENST. 46 rue Barrault 75013 Paris. France

Summary: A methodology for the automatic classification of short texts is proposed


(leading cases are responses to open-ended questions in sample surveys. titles or abstracts of
papers in documentary data bases). It aims to take into account a graph structure on the
variables (elementary text units). This graph could be a semantic graph provided by an
external source, or a co-occurrence graph. built from the corpus itself.

1. The basic problem


The starting point is to consider each text as described by its lexical proflie, i.e. by a vector
that contains the frequency of all the selected units in the text. The units could be words (or
lemmas, or types), segments (sequences of words appearing with a certain frequency) or
quasi-segments (segments allowing non-contiguous units). The corpus is reoresented by a
(n,p) matrix X whose i-th row (i.e.: i-th respondent) contains vector whose p components
are the frequencies of units (words).
However, an usual classification algorithm applied to the rows of X could lead to
disappointing or misleading results.
1) the matrix X could be very sparse, many rows could have no element at all in common
(in terms of responses to open-ended questions in sample surveys, this means that two
answers may contain distinct words).
2) a wealth of meta-data is available and needs to be utilized (syntactic relationships,
semantic networks, external corpus and lexicons, etc.).
To make more meaningful the distances between the lexical profiles of the texts we can add
new variables obtained from a morpho-syntactic analyzer; it is then possible to tag the text
units depending on their category. It is then possible to complement the p-vector associated
to a response or a text with new components (Salem, 1995). It is also useful to take into
account the available semantic information about units, information which can be stored in
a "semantic weighted graph". From now on, we will focus on this latter issue and discuss
the different ways of deriving such a graph from the data themselves.
Note that most algorithms of constrained classification (see for example Gordon (1996) for
a recent review of hierarchical classification including this topic) involve a graph structure
upon the set of objects to be classified. We are dealing here with a quite different situation:
it is the set of descriptors (words) of the objects (texts) which is provided with a graph
structure. This structure will be used to modify the distances between pairs of objects.

2. Semantic graph and contiguity analysis


There is no universal rule to establish that two words are semantically equivalent, despite
the existence of synonymy dictionaries and thesaurus. In particular, in some sociological

480
481

studies. it could be simplistic to consider that two different words (or expressions) have
the same meaning for different categories of respondents. But it is clear that some units
having a common extra-textual reference are used to designate the same "object".
Whereas the syntactic meta information can provide the user with new variables, the
semantic information defmed over the pairs of statistical units (words, lemmas) is
described by a graph that can lead to a specific metric structure (see fig. 1).
a) The semantic graph can be constructed from an external source of information (a
dictionary of synonyms. a thesaurus. for instance). In such a case. a preliminary
lernmatization of the text must be performed.
b) It can be built up according to the associations observed in a separate (external) corpus.
c) Eventually. the semantic graph can also be extracted from the corpus itself. In this latter
case, the similarity between two words (or other units) is derived from the proximity
between their distributions (lexical profiles) within the corpus.
The vertices of this weighted undirected graph are the distinct units (words) j. U=I, .... pl.
The edge U. j') exists iff there is some non-zero similarity sUo j') between j and j'. The
weighted associated matrix M= (mij). of order (P. p) associated to this graph. contains in
the line j and column j' the weight sUo j') of the edge U. j'). or the value 0 if there is no
edge between j and j' .
The repeated presence of a pair of words within a same sentence of a text is a relevant
feature of a corpus. The words can help to disambiguate each other (see e. g.: Lewis and
Croft. 1992). To take into account co-occurrence relationships allows one to use words in
their most frequent contexts. In particular, we can also describe the most relevant co-
occurrences by using a weighted undirected complete graph linking lexical units. Each pair
of units are joined by an edge weighted by a co-occurrence intensity index.
At that stage. we fmd ourselves within the scope of a series of descriptive approaches
working simultaneously with units and pairs of units (see for example Art and al., 1982),
including contiguity analysis or local analysis (see: Lebart, 1969; Aluja and Lebart. 1984;
Escofier, 1989; Cazes and Moreau, 1991).
Statistical units Grammatical Other

n
(words, lemmas) categories categories

Respondents
Texts ~

, \I
" II
, II

Statistical units '~------i


, II Metric induced by the semantic networks
(words, lemmas)
" n whose nodes are the statistical units
,.,
, II

,, (Anisotropy of the space of words)


,,
Fig. 1: How to use meta-data in text classification: metric, or new variables?
482

These visualization techniques are designed to modify the classical methods based on
Singular Value Decomposition by taking into account a graph structure over the entries
(row orland column) of the data table. Visualizing the proximities using contiguity analysis
is equivalent to performing a projection pursuit algorithm as described in Burtschy and
Lebart (1991).
The classification can then be performed either:
1) by using as input data the principal coordinates issued from these contiguity analyses,
or:
2) by computing a new similarity index between texts. This new index is built from
generalized lexical proflles (i.e.: original proftles complemented with weighted units that
are neighbors (contiguous) in the semantic graph).

3. Different types of associated graphs


3.1 Case of an external semantic graph
A simple way to take into account the semantic neighbors leads to the transformation:
Y =X(I + aM), (1)
where M is the matrix associated with the graph defmed previously, and a a scalar allowing
to calibrate the importance given to the semantic neighborhood.
This is equivalent to providing the p-dimensional space of words with a metric defmed by
the matrix:
Q =(I + aM)2
This leads immediately to a new similarity index that can be used for the classification of
the rows. Due to the size of the involved data tables, the classification is often performed
using distances computed with the r principal coordinates (a current order of magnitude is
r = 100, whilst p = 1500).
The metric:
Q* =(I - aM)-2 .
closely related to the previous one if a is small, has also some interesting properties.

3.2 Case of an internal co-occurrence graph: self learning.


A possible matrix M could be:
M= C - 1
where C is the correlation matrix between words (allowing negative weights, which
correspond to a negative co-occurrence intensity measure); if the columns of C have their
variances equal to 1:

C =1.n XT (I - 2.n U)X, with U··


U
= J for all i andj. (2)

Since:
Y = X (I + a(CoI) ),
the matrix S to be diagonalized when perfonning a Principal Component Analysis of Y,
reads:
483

I T I , , 2..-.3
S = - Y (1-- U)Y = (1- arc + 2a(l- a)e- + a L (3)
n n
Therefore, the eigenvectors of S are the same as the eigenvectors of C. However, to an
eigenvalue A, of C corresponds the eigenvalue Jl of S such that:
Jl = (1- a)l A, + 2a(1- a)A, 1 + a 2A,l
The effect of the new metric is simply to re-weight the principal coordinates when
recomputing the distances to perform the classification.
If c£ = I, for instance, we get A,3 instead of A" thus the relative importance of the first
eigenvalues is strongly increased.
Such properties contribute to shed light on the prominent role played by the first principal
axes, particularly in the techniques of Latent SenuU1tic Analysis used in Automatic
Information Retrieval: see Furnas and al. (1988).

Constmction of the matrix M through a hierarchical classification of words


Another way of deriving a matrix M from the data themselves is to perform a hierarchical
classification of words, and to cut the dendrogram at a low level of the index. It can either
provides a graph associated with a partition, or a more general weighted graph if the nested
set of partitions (corresponding to the lower values of the index) is taken into account.
The approaches of Salton and Mc Gill (1983), Celeux et al. (1991), Iwayama and
Tokunaga (1995), in the framework of discriminant analysis (also named in this context
"Text Categorization"), are similar to the case of a graph associated with a partition: to
remedy the sparsity of the matrix X (leading generally to ill-condition ned problems of
discrimination), words below a certain threshold of distance are aggregated.
However, in our context, it would be awkward to confme ourselves to semantic graphs
having a partition structure, since the relation of semantic similarity is not transitive
(consider for example the semantic chain: fact -feature - aspect - appearance - illusion).
It is more appropriate to assign to each pair of words (i, j) a value:
mij = f(d(i, j», for dei, j) :s; 10,
f(t) being a decreasing function of t, dei, j) being the index corresponding to the smallest
set containing both elements i and j, and:
mij = 0, for dei, j) > 10.
We could use the initial distance instead of the ultrametric associated with the hierarchy,
but it is convenient to obtain as a by-product the graphical representation induced by the
ultrametric.

3.3 Internal co-occurrence graph after aggregation: another self learning


procedure
The preceding case of self learning entails several serious problems. To derive semantic
information from frequency distributions (see Harris (1954), Church and Hanks (1990»
becomes a fruitful operation only if the frequency profiles are defined by using a
substantial amount of occurrences.
We define as substitutioll relationship between words the situation occurring in the
following two circumstances:
484

i) they have a similar context distribution


ii) they don't appear (or seldom appear) in a same units string
More concretely, we may use a correspondence analysis (andlor a hierarchical clustering)
of the (q, p) matrix A, crossing aggregated responses and lexical units, to determine the
distribution similarities.
Let us call Z the (n,q) binary matrix describing an a priori partition of the set of
respondents (such that Zij = 1 if respondent i belongs to category j, Zij = 0 elsewhere).
We have in such a case: A = ZTX (ZT is the transpose of Z)
Then, we could exclude those units that appear within a same string through a search for
multi co-occurrences. We will see in the example below that the grouping of responses
may have a too strong influence on the proximities between words. It appears to be more
efficient to stack several contingency tables corresponding to various criteria (in a sample
survey: sex, age, region, status, occupation, education LeveL, ... ).
Note that the methods 3.2 and 3.3 do not provide semantic networks in the usual sense of
the term. However, they have the advantage of being independent of the language, and of
the context of the study as well. The "local network" obtained can be confronted to the
external semantic network that can be available.

4. Example
This methodology has been applied to a corpus of 1563 answers to an open-ended
question included in a sociological survey. We confine ourselves here to discussing the
choice of the semantic graph that has been used to improve the classification of the
respondents.
The following open-ended question was asked in a multinational survey conducted in
seven countries (Japan, USA, United Kingdom, Germany, France, Italy and Netherlands)
in the late nineteen eighties (Hayashi et al., 1992): "What is the singLe most impoT1ant thing
in Life for you? " It was followed by the probe: "What other things are very impoT1ant to
you ?". Our illustrative example is limited to the American sample (Sample size: J563).
Some aspects of this multinational survey concerning general social attitudes are described
in: Sasaki and Susuki (1989).
Examples of answers to the first question were:
1 - Family, being together as afamily
2 - Mother. money, peace of mind, peace in the worLd
Some words are connected to a single lemma (or dictionary word) (be. is, are, being).
Also to be noted is the strong presence of function words (a, and, for, of, the). Note that
the concept of a function word (sometimes referred to as empty word, or tool word or
grammatical word in information retrieval) is widely used by text researchers. These words
are obviously excluded from the semantic networks.
This corpus has a length of 13 999 occurrences and contains 1378 distinct words. Only the
126 words used at least 16 times are kept, the total length being then reduced to 10 752
occurrences.
Figure 2 shows three branches of the dendrogram of words obtained through a direct
classification of the columns of X performed on the 15 first principal axes of a
Correspondence Analysis of the sparse contingency table X. We observe grouping of
words according to their meanings (home and members of family, standard of living,
485

together with function words (on) or pronouns (my), and repeated and isolated segments
(don't know).
We note that some topics are characteristic of the produced clusters. These topics are less
salient when a second dendrogram is computed from the 45 first principal axes.
If we cut both dendrograms at a level producing 30 classes, we obtain a ratio of variance
(between classes variance divided by total variance) of 0.87 (case of 15 axes) and 0.64
(case of 45 axes). It is a further empirical proof of the ability of the first axes to gather
structural features, ability already mentioned in section 3.2.

Index Words

.87 grandchildren --------+


.16 husband
2.29 children ---*----*---------+
.25 kids ---+
.74 wife ---*---+
.25 home ---+
15.~1 my ---*---*----------*----------------------*--------*--/ 11-

.73 money -------+


.14 on --+
.29 live --*+
.24 comfortably
6.90 enough ---*---*------------------------------------------+
.06 don't --+
know -_._------------------------------------------------///--

Fig. 2: Parts of the dendrogram obtained through a direct clustering of the columns of X

Fig.3 presents some similar branches of the dendrogram obtained through clustering of
the columns of the aggregated table A = zTx. Z corresponds to a partition into 6 classes
obtained by crossing the two nominal variables sex, and age (3 categories). The themes
observed previously are now disseminated into various groups strongly influenced by the
criteria of grouping.

Index Words

5.08 grandchildren --------------------------------+


2.15 husband ---------------+
.04 self
.07 religion
.28 peace --*+
.10 mind --+1
.80 in -_ •• _-+
3.80 children ------*--------*---------+
.08 people --+
.37 as --*-+
1. 49 health ----*------+
.35 church
.14 welfare --+ 1
.04 our -_. 1
world --*-*------*-------------*------ ---///---

Fig. 3: Parts of the dendrogram obtained through clustering of the columns of the
aggregated table A = Z1'X. (Z corres ponds to a partition into 6 classes
according to sex and age).
486

The category ''female-over 55 years" being particular and homogeneous has strongly
influenced the proximities between words. We do not fmd again wife (used by men), kids
(used by younger people, mostly by men), home (used by younger people), etc ..
Obviously, a composite partition (crossing or stacking various criteria) should be chosen
instead of the nominal variable "sex-age" used to obtain higher frequencies of words. The
quality of the partition of responses depends on the homogeneity of the classes and their
interpretability. In such a context, only a group of experts could assess the obtained
results. The use of a co-occurrence graph (with: a = 1) issued from a direct classification
of the columns of X enables a better classification of the responses that have a poor lexical
profile, and leads to more meaningful grouping.
It is clear that we are dealing with experimental statistics: the value of the parameter a, the
number of neighbours to be taken into account could probably vary according to the field
of application. Reliable tools for assessing and comparing partitions are all the more
needed.

5. References
Aluja Banet, T., l..ebart, L. (1984): Local and Partial Principal Component Analysis and
Correspondence Analysis. COMPSTAT Proceedings, 113-118, Physica Verlag, Vienna.
Art, D., Gnanadesikan, R., and Kettenring, J. R. (1982): Data Based Metrics for Cluster
Analysis. Utilitas Mathematica, 21 A, 75-99.
Becue, M. (1991): Analisis de Datos Textuales. Metodos Estadisticos y Algoritmos.
CISIA, Paris.
Burtschy, B., Lebart, L. (1991): Contiguity analysis and projection pursuit, in : Applied
Stochastic Models and Data Analysis, Gutierrez R. and Valderrama M.l., (eds), World
scientific, Singapore, 117-128.
Cazes, P., Moreau, J. (1991): Analysis of a contingency table in which the rows and
columns have a graph structure. in : Symbolic and Numeric DaJa Analysis and Leaming,
Diday E., and l..echevallier Y. (eds), 271-280, Novascience publisher, New York.
Celeux, G., Hebrail, G., Mkhadri, A, Suchard, M. (1991): Reduction of a large scale
and ill-condition ned problem on textual data. in: Applied Stochastic Model and Data
AnalYSis, Gutierrez R. and Valerrama N., 1. (eds.), World Scientific, Singapore, 129-
137.
Church, K. W., Hanks, P. (1990): Words association norms, mutual information and
lexicography. Computational Linguistics, 16,22-29.
Escofier, B. (1989): Multiple correspondence analysis and neighboring relation Data
Analysis Leaming Symbolic and Numeric knowledge, Diday E. (eds), 55-62,
Novascience publisher, New York.
Furnas, G. W. et al. (1988): Information retrieval using a singular value decomposition
model of latent semantic structure. Proceedings of the 14th ACM Conference on R. and
D. in 1nfonnation Retrieval, 465-480.
Gordon, AD. (1996): Hierarchical Classification. in: Clustering and Classification. P.
Arabie, L. J. Hubert, G. De Soete (eds.) World Scientific, River Edge, NJ.
487

Harris, Z. S. (1954): Distributional Structure. Word, 2·3, 146·162.


Hayashi, C, Suzuki, T., Sasaki, M. (1992): Data Analysisjor Social Comparative Research:
International Perspective. North-Holland, Amsterdam.
Iwayama, M., Tokunaga, T. (1995): Cluster-based text categorization: a comparison of
category search strategies. in: ACM / SIGIR'95, (Fox E.A, Ingwersen P., Fidel R.,
eds), Seattle, W A, USA, 273-280.
Lebart, L. (1969): Analyse Statistique de la contiguite. Publication de l'ISUP, 28, 81-
112.
Lebart, L., Salem, A (1994): Statistique Textuelle. IAmod, Paris.
Lebart, L., Salem, A, Berry, E. (1991): Recent development in the statistical processing
of textual data, Applied Stoch. Model and DaJaAnalysis, 7,47-62.
Lewis, D. D., Croft, W. (1990): Term clustering of syntactic phrases. SIGIR- 90,.385-
404.
Salem, A (1995): Les unites lexicometriques. Analisi Statisticadei Dati Testuali, Bolasco
etal. (eds), 19-27, CISU, Rorna.
Salton, G., Mc Gill, M.l. (1983): Introduction to Modem Infonnation Retrieval,
International Student Edition.
Sasaki, M., Suzuki, T. (1989): New directions in the study of general social attitudes:
trends and cross-national perspectives, Behavionnetrika, 26, 9-30.
How to find the nearest by evaluating only few?
Clustering techniques used to improve the
efficiency of an Information Retrieval system
based on Distributional Semantics
Martin Rajman and Arnon Rungsawang
ENST-Paris, Department of Computer Science
46 Rue Barrault, F-75634 Paris Cedex 13, France

Summary: The first objective of this contribution is to give a description of our tex-
tual information retrieval system based on distributional semantics. The central idea of
the approach is to represent the retrievable units and the user queries in a unified way
as projections in a vector space of pertinent terms. The projections are derived from a
co-occurrence matrix computed on large reference (textual) corpora collecting the distribu-
tional semantic information. A similarity computation based on the cosine measure is then
used to characterize the semantic proximity between queries and documents.
Retrieval effecti veness can be further improved by the use of relevance feedback techniques.
A simple feedback method where document relevance is interactively integrated to the orig-
inal query will also be presented and evaluated.
Although our first experiments lead to quite promising results, one major drawback of our
IR system in its original form is that the satisfaction of a query requires the evaluation of
the similarities between that query and all the documents in the textual base. Therefore,
the second objective of this contribution is to investigate how clustering techniques can
be applied to the textual database in order to retrieve the documents satisfying a query
through a partial exploration of the base. A tentative solution based on hierarchical clus-
tering will be suggested.

1. Introduction
Information Retrieval (IR) research is concerned with the analysis, the representation, and
the searching of heterogeneous textual databases with wide varieties of vocabularies and
unrestricted subject matters. Examples of such databases, the elements of which will be
called hereafter documents, are databases containing newspaper articles, new5wires, tech-
nical or scientific articles, magazines, encyclopedia entries and so on. Due to the enormous
amoWlt of information currently available on-line in the different computer networks (In-
ternet, ... ) and in the library environments, simple keyword search and browsing are not
sufficient anymore. IR users need more sophisticated tools to help them to reach the rele-
vant information.
Our retrieval model exploits co-occurrence properties of words to determine whether queries
and texts are semantically related (distributional semantics). More precisely, documents
and queries are represented in a unified way as projections of co-occurrence profile vectors
in a multidimensional vector space of selected informative terms, in which the proximity
is interpreted as semantic similarity. These co-occurrence profile vectors are derived from
co-occurrence matrices, computed on large reference textual corpora. The cosine similarity
measure is used to characterize the proximity and thus the relevance between user queries
and documents in the textual database.

2. Distributional Semantics
Using distributional information for automatic extraction of general morphologic, syntac-

488
489

tic or semantic properties of a given language has already been considered by several re-
searchers (Schiitze (1992), Gallant et al. (1992), Rungsawang & Rajman (1995». Such
properties correspond to observable regularities (frequency, distribution, co-frequency, ... )
in large textual corpora. In our approach, we use a co-occurrence matrix (Clij)i,j to fetch
the semantic information by automatic co-occurrence computation on a large reference cor-
pus of texts. The lines of such a matrix correspond to all distinct terms Wi found in the
reference corpus and the columns correspond to selected informative terms tj, called perti-
nent terms, used to represent the meaning of all other terms in the textual database. Each
element clij records how often a term Wi co-occurs with a term tj within some pre-defined
textual units (e.g. sentences or paragraphs) in the reference corpus.
To build a co-occurrence matrix, several elements must be determined. First, the nature of
the primary linguistic units which will be used as terms has to be defined. Tokens, as pro-
duced by the simple stemmer (e.g. Stopper and Porter stemmer, (Frakes and Baeza-Yates
1992», or words reduced to their radical forms (i.e. conjugated verbs to infinitives, nouns
to singular forms) are frequently used. Words with their part-of-speech tags, for example,
produced by a natural language tagger may also be considered.
Then, we need to determine the sets of terms and pertinent terms that define the rows
and the colwnns of the co-occurrence matrix. To cover the maximum semantic informa-
tion, all distinct terms (except perhaps the functional words) appearing in the reference
corpus should be used. However, feasibility constraints have to be taken into consideration.
Salton et al. (1975, 1976) indicate that terms which have the document frequency (i.e. the
proportion of documents in which they appear) ranging between 1/100 and 1/10, possess
good content discrimination in the document space and yield good retrieval effectiveness.
Therefore, we decided to reduce the w-dimension of the matrix to terms appearing in at
least 2 documents, and to use Salton's criterion for the pertinent terms in the t-dimension.
The third element to define is the textual unit in which the co-occurrences will be computed.
Usually, sentences, paragraphs, fixed-size word windows or fixed-size character windows are
chosen.
Once the co-occurrence matrix is built, a distributional semantic hypothesis is assumed pos-
tulating a correlation between terms that co-occur in similar distributional environments.
The semantics of a term Wi is then represented by its co-occurrence profile vector, the row
corresponding to term Wi in the co-occurrence matrix. The geometric proximity between
the co-occurrence profile vectors is interpreted as an indication of the semantic similarity
between the corresponding terms, provided that the reference corpus is large enough to
cover sufficient semantic information. The geometric proximity between these vectors is
measured by the cosine value of the angle between them.

3. Document and Query Representation


In our model, the retrieval of document passages of varying size has been considered for the
definition of document and query representation. Therefore, we consider that each docu-
ment may be decomposed into several retrievable text units (RTU s ). The RTU definition
(i.e. document decomposition) is strongly related to the notion of a text excerpt, which
can be defined as a varying-size piece of a document covering an amount of information
that the user considers as a satisfactory answer to his/her information need. In addition
to this, we do not formally distinguish between documents and queries. In our preliminary
experiments, we simply define the RTUs as paragraphs and sections.
According to the distributional semantic hypothesis previously mentioned, we represent
each RTU by the corresponding indexing structure (IS) vector. This IS vector is currently
defined as the mean vector of the co-occurrence profile vectors associated with the terms
contained in the RTU.
Since documents and queries are represented in a unified way in a common multi-dimensional
490

SMART ate weight +-


OSIA unnormalized -+--
0.9 aSIR normalized .E)-
SMART nnn weight -x-

0.8

Figure 1: 11 points Recall-Precision curves (Cranfield standard test collection)

vector space, the document-query relevance can be defined on the basis of the proximity
between the average IS vector representing the documents and the IS vector" representing
the query. We currently measure the proximity by the cosine of the angle between the IS
vectors.
A benefit that we expect from our approach is that any two documents may have a high sim-
ilarity score in the multi-dimensional space of pertinent terms even though they only have a
few terms in common. For example, one document might contain the words "corpus-based
linguistic analysis", whereas the other might contain words "computational linguistics" . If
the emphasized terms globally occur in the same distributional environment in the corpus
that was taken as reference, the resulting IS vectors should correspond to similar directions
in the multi-dimensional space of pertinent terms.

4. Preliminary Experiments
In the first phase of this ongoing research, we have implemented (in C and Perl languages on
a SPARC workstation) a prototype corresponding to the system described in the previous
section. vVith this prototype, we have conducted several experiments using the Cranfield
standard test collection. The Cranfield collection is a test collection of 1400 documents and
225 queries in the field of aerodynamics which also contains, for each query, the list of the
relevant documents. The collection is available in the SMART version 11.0 distribution 1,
and has been used for several years to test many retrieval algorithms.
The 11 points Recall- Precision curves (Salton and McGill (1983)) comparing our system (de-
noted DSIR 2 ) with the ones obtained with the standard version of the system SMART with
term-frequency weights (denoted SMART nnn weight) and augmented inverse-document-
frequency weights (denoted SMART atc weight) are given in figure 1.

lThis contribution can be obtained from "ftp.cs.comell.edu".


2Distributional Semantic Information Retrieval.
491

5. Experiments with Relevance Feedback


Relevance feedback is a process used to build improved queries especially through inter-
action with the user. It is based on the use of relevance information that the users can
associate with previously retrieved documents. In a standard feedback process, the user in-
dicates the relevance (or non-relevance) of either terms or documents to create new queries.
In our experiments, IS vectors of previously retrieved (and user-evaluated) documents are
merged with the original query vector to form the new one. 1>Iore precisely, new feedback
queries of rank k are made up by adding to the original query the IS vectors of all the
relevant documents selected among the k first previously retrieved documents (ordered by
decreasing similarity with the original query) and by subtracting from it the IS vector of
the first non relevant document (among the k):

+ f3 L 8;fsi -,ISi1
k
ejnew = ejold (1)
£=1

where ejnew is the new query vector, ejold the original query vector. lSi are the IS vectors
of the k first previously retrieved documents. 8i equals 1 when lSi is the IS vector of
a relevant document (and 0 otherwise), and ISi1 the IS vectors of the first non-relevant
documents. f3 and, are importance respectively given to relevant and non-relevant com-
ponents. To eliminate the problem of residual (or ranking) effect (Hull (1993)), we created
the feedback query vectors from one collection of documents and apply them to another
one (see experimental setup below). Settings f3 = 0.75 and, = 0.25 yield the best results
in our experiments.

5.1 Document collection used in feedback experiments


The document base used in our feedback experiments was the OHSUMED collection, con-
tributed by the Oregon Health Sciences University (Hersh (1994)). The OHSUMED collec-
tion contains all MEDLINE references from 270 journals, spanning from year 1987 through
1991. It also includes 106 queries with the associated relevance judgments. Document
relevance was judged on a three-point scale: defulltely relevant, possibly relevant, and not
relevant. 'vVe have chosen to consider as relevant the documents judged as either defulltely
or possibly relevant. To cope with our limited computational resources, we only used a
subset of the OHSUMED collection. The reduced collection was created as follows.
We extracted a first sub-collection, called the B 1 sub-collection, consisting ofthe OHSUMED
documents spanning from 1987 through 1989 (22040 documents, "" 24 M bytes) along with
their relevant information. The second sub-collection, called the B2 sub-collection, contains
the OHSUMED documents spanning from 1990 through 1991 (19950 documents, "" 24 M
bytes) also with corresponding relevance judgment data.
For both sub-collections, we used the Bl sub-collection as reference corpus for co-occurrence
matrix calculation. The document pre-processing consisted of removing the functional
words by the Stopper algorithm, and normalizing the morphological variants to their rad-
ical forms by the Stemmer algorithm. We used the algorithms available from Frakes' con-
tribution (Frakes and Baeza-Yates 1992)). Terms that only appeared in one document in
the collection were removed. To limit the second dimension of the co-occurrence matrix, a
simple frequency- based filtering technique was used to select pertinent attributes.

5.2 Our experimental setup


Our experimental set up follows the diagram shown in figure 2. Starting from the upper left
portion of the diagram (arrows labeled with (1)), we first calculate the co-occurrence ma-
trix (DSCOOCC program) using data from the B1 sub-collection. Then, the co-occurrencE
data and the Bl sub-collection are used to create (2) the B1 document IS vectors (DS-
BASE program). Afterward, the DSSEARCH program retrieves (3) from the B1 collection
492

, FB,Q05 RlQ10
®', :(}):'
" .

® /,... :.....
ft" 17(})'1 ~

DOD
OS BASE OSFB05 OSFB 10

Figure 2: Feedback diagram.

the relevant documents corresponding to the original 106 queries, and present the result
in a form of an ordered list, where the first document (rank 1) is the most similar to the
query. Then (4), the FBQ program creates a new query based on the old one, the relevant
documents (as evaluated by the user) and the first non-relevant document from rank 1 to
rank 5 (FBQ05), rank 1 to rank 10 (FBQ10), etc.
Continuing at the lower portion of the diagram, the DSBASE program will create (5) the
B2 document IS vectors, using previously calculated co-occurrence data and the B2 sub-
collection. Then (6), the B2 document IS vectors and the original 106 queries are used by
the DSSEARCH program to produce the DSBASE (result corresponding to the original
queries) that will be used as a reference to evaluate the retrieval with feedback. Finally (7),
the DSSEARCH program retrieves from the B2 document collection the relevant document
sets (DSFB05, DSFB10, etc.) in response to the feedback queries FBQ05, FBQ10, etc.

5.3 The experimental evaluation and discussion


Our relevance feedback experiments were evaluated on the base of the standard 11 points
recall-precision (RP) curves (Salton & McGill (1983)). We also conducted (on the B1 and
B2 sub-collections) comparative retrieval experiments with the system SMART (version
11.0). The document-query weighting "ann-atn" was used for all SMART experiments
(Hersh & Buckley(1994)).
The two curves shown in figure 3, and the first two columns in table 1 indicate the retrieval
performance obtained by our system (DSIR) and the system SMART in the case of no
relevance feedback on the B1 sub-collection.
The five curves given in figure 4, and the corresponding five columns in table 1 illustrate
the results obtained by the system SMART (without relevance feedback) and our DSIR
system (DSBASE: without relevance feedback, DSF05: feedback with documents under
rank 5, etc.) on the B2 sub-collection. The curves (DSBASE vs. SMART) show the degra-
dation of our system performance when no relevance feedback is used. DSIR seems not to
retrieve relevant documents as well as SMART at lower recall level. However, its overall
performance is still comparable to that of SMART (only 2.66% inferior). One possible
493

1.0
SMART -+-
DS8ASE -+-_ .
0.9 •••••••••••••••• -1 ••• - ........_..L

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0
0.0 0.1 0.2 0.3 04 0.5 0.6 0.7 0.8 0.9 1.0
RECALL

Figure 3: 11 points Recall-Precision curves derived from Bl sub-collection, without


relevance feedback, using SMART and DSIR systems.

reason for this behavior could be the unsuitable use of co-occurrence data derived from
Bl sub-collection. However, the result remains very encouraging especially when robust-
ness considerations are taken into account, because our system can satisfyingly deal with
many new documents (:::::: 20000 from the B2 sub-collection) without any parameter change,
re-indexing a new co-occurrence computation. In addition to this, experiments conducted
with the B2 sub-collection as reference corpus still confirm better results than SMART.
As far as feedback is concerned, the curves denoted DSFB05, DSFBI0 and DSFB15 in
figure 4 clearly indicate the improvement of our system's overall level of recall, and allow
an interesting quantification of this improvement when various maximal (document) ranks
are used for the feedback. The results in table 1 show the improvement in % of average pre-
cision change (compared with SMART). These results confirm the usefulness of relevance
feedback as printed out in several previous references (Salton & Buckley (1990), Harman
(1992), Allen (1995) and Buckley et al. (1995)).

6. Research Directions
The results of the experiments that we have conducted with our IR system on different
standard test collections have convinced us of the feasibility of our approach.
However, as far as algorithmic efficiency is concerned, one major drawback of our system in
its original form is that the satisfaction of a query requires the evaluation of the similarities
between the query and all the documents in the textual base. Therefore, we are now inves-
tigating how clustering techniques can be used in association with the similarity measure
in order to cluster the database in a way that allows the identification of the documents
satisfying the query through a partial exploration of the base. We are currently working on
a first tentative solution based on a very simple hierarchical partitioning process.
The process starts (step 0) with a unique initial class containing all the N documents of the
textual database. At step n, each class c defined at step n - 1 is divided into 2 subclasses
corresponding respectively to elements with negative and positive coordinates in the first
factorial dimension obtained by a factorial analysis performed on c. The partitioning pro-
494

Run SMART DSBASE I SMART DSBASE DSFB05 ! DSFB10 DSFB15


Recall - Precision B1 B1 B2 B2 B2 B2 B2
0.00 0.7494 0.7460 0.7546 0.6900 0.7642 0.8238 0.8387
0.10 0.7095 0.7315 0.7260 0.6963 0.7629 0.7956 0.8160
0.20 0.6132 0.6267 0.6627 0.6276 0.6820 0.7234 0.7363
0.30 0.5611 0.5867 0.5888 0.5819 0.6172 0.6563 0.6790
0040 0.5119 0.5124 0.5117 0.5111 0.5430 0.5767 0.5929
0.50 0.4839 0.4713 0.4660 0.4536 0.4978 0.5303 0.5493
0.60 0.4243 0.4254 0.3890 0.3982 04356 0.4504 0.4700
0.70 0.3639 0.3691 0.3258 0.3556 0.3808 0.4015 0.4076
0.80 0.2864 0.3317 0.2778 0.2738 0.2905 0.3045 0.3167
0.90 0 .. 1879 0.2393 0.2090 0.1912 0.2046 0.2001 0.2075
1.00 0.0824 0.1143 II 0.1282 I 0.1283 0.1232 0.1196 0.1242
A verage precision 0.4522 0.4686 0.4581 0.4459 0.4819 0.5075 0.5217
% Precision change +3.6267 I -2.6631 +5.1954 +10.7837 13.8834

Table 1: Average precision calculated over 11 recall points

1.0 r----,----.,------,---.---,--,------;;----,----.---,

0.9

0.8

06

0.5

0.4

0.3

0.2

0.1

0.1 0.2 0.3 0.4 05 0.6 0.7 0.8 0.9 1.0


RECALL

Figure 4: 11 points Recall-Precision curves derived from B2 sub-collection usmg


SMART and DSIR with feedback query at rank 5, 10, 15.

cess is then iterated until all classes contain only one element.
By construction, the resulting set of classes corresponds to a binary tree T. In T, each non
terminal node (i.e. class) is associated with a decision function based on the equation of
the corresponding one-dimensional factorial vector space. Furthermore, consider, for any
query q, the path P(q), starting at the root ofT, and in which, at each node, the branching
decision is taken according to the result, for q, of the decision function associated with the
node. Then P( q) verifies the following interesting properties:
• P( q) leads to a leaf of T that corresponds to a document representing a good approx-
imation of the nearest document to q;

• the number of operations necessary to build P(q) (i.e. to retrieve the document
associated with the leaf) is at most log2(N).
'vVe are currently integrating the partitioning process described above in our IR system in
order to quantify, for the available test databases, the influence of such a method on the
overall retrieval efficiency of the system. In addition to this, we are currently implementing
495

a parallel version of our retrieval engine within the PVM programming environment (Geist
et al. (1994)). This work will give us the means to conduct a realistic evaluation of the
computational speed-up provided by the clustering based model.

References:
Allen, J. (1995): Relevance Feedback with Too Much Data, In Proceedings of the 18 th An-
nual International ACM/SIGIR Conference on Research and Development in Information
Retrieval, Seattle, USA.
Buckley, C. et al. (1995): Automatic Query Expansion Using SMART: TRECS, In the
third Text REtrieval Conference (TREC-3), NIST Special Publication 500-225.
Frakes, W.B. and Baeza-Yates, R. (1992): Information Retrieval: Data Structures €3 Algo-
rithms. Prentice Hall.
Gallant, S.l. et al. (1992): HNC's MatchPlus System. SIGIR FORUM, 16(2).
Geist, A. et al. (1994): PVM: Parallel Virtual Machine, A Users' Guide and Tutorial for
Netwo'rked Parallel Computing, The MIT Press, Cambridge, England.
Harman, D. (1992): Relevance Feedback Revisited. In Proceedings of the 15 th Annual Inter-
national ACM/SIGIR Conference on Research and Development in Information Retrieval,
Copehagen, Denmark.
Hersh, W. and Buckley C. (1994): OHSUMED: An Interactive Retrieval Evaluation and
New Large Test Collection for Research, In Proceedings of the 17th Annual International
ACM/SIGIR Conference on Research and Development in Information Retrieval, Dublin,
Ireland.
Hull, D., (1993): Using Statistical Testing in the Evaluation of Retrieval Experiments, In
Proceedings of the 16 th Annual International ACM/SIGIR Conference on Research and
Development in Information Retrieval, Pittsburgh, USA.
Rungsawang, A. and Rajman, M. (1995): Textual Information Retrieval Based on the Con-
cept of the Distributional Semantics. In Proceedings of the 3rd International Conference
on Statistical Analysis of Textual Data. Rome, Italy, December.
Schiitze, H. (1992): Dimensions of Meaning. In IEEE Proceedings of Supercomputing 92.
Salton, G. and McGill, M.J. (1983): Introduction to Modern Information Retrieval. Mc-
Graw Hill.
Salton, G. et al. (1975): A Theory of Term Importance in Automatic Text Analysis. Jour-
nal of the American Society for Information Science.
Salton, G. et al. (1976): Automatic Indexing Using Term Discrimination and Term Preci-
sion Measurement. Information Processing &: Management, 12.
Salton, G. and Buckley, C. (1990): Improving Retrieval Performance by Relevance Feed-
back. Journal of the American Society for Information Science, 41(4).
Fitting the CANDCLUSjMUMCLUS Models
with Partitioning and Other Constraints
J. Douglas Carroll! and Anil Chaturvedi 2
Faculty of Management
1
Rutgers University
Management Education Center, Room 125
81 New Street
Newark, NJ 07102-1895
USA
2 AT&T Bell Laboratories
Room 5C-133
600 Mountain Avenue
Murray Hill, NJ 07974
USA

Summary: The CANDCLUS (for CANonical Decompositon CLUStering) model and


method is described for analysis of multi way data arrays in terms of multilinear models
in which some ways (or modes) are modeled by continuous parameters defining spatial
dimensions, other ways/modes by discrete parameters defining cluster or other categori-
cal structures, and still others by mixtures of continuous and discrete parameters defining
"hybrid" models in which spatial dimensional structure is combined with cluster-like cat-
egorical structure. A generalization of CANDCLUS, called MUMCLUS (for MUltiMode
CLUStering), whose two-way special case corresponds to DeSarbo's GENNCLUS model, is
also defined and discussed. Methods previously published for unconstrained fitting of the
CANDCLUSjMUMCLUS family of models, based on a separability property observed by
Chaturvedi, are extended to allow certain constraints on the discrete parameters-in par-
ticular a constraint that the cluster structure be a partition, and another that each entity in
a particular mode may be a member of no more than C clusters. These constraints are im-
plemented via an extended separability property (for vectors of discrete parameters, rather
than for single parameters) which is defined. The possibility of fitting other constrained
versions of these models within this general framework is discussed.

1. THE CANDCLUS MODEL


Canonical decomposition clustering (CANDCLUS) is a general multilinear model for
multiway data arrays. We first state the CANDCLUS model in its most general form,
for the N -way array Y. It might be noted that this form of the CANDCL US model is
identical to that of the Carroll and Chang CANDECOMP (CANonical DECOMPo-
sition) model (Carroll and Chang, 1970; Carroll and Pruzansky, 1984), whose most
important application to date has been to provide the computational underpinnings
of the INDSCAL approach to two-mode, three-way (individual differences) multi-
dimensional scaling (MDS). The most important difference between CANDECOMP
and CANDCLUS is that, while in the former, all parameters are assumed continuous,
so that the models being assumed and fitted in the concomitant data analysis are
all continuous spatial models (e.g., MDS models), in CANDCLUS, some or all of the
dimensions for some or all ways/modes may be constrained to be discrete, typically
binary (0-1) variables, which can be interpreted as class membership variables en-

496
497

coding whether a particular object (or other entity corresponding to a given level of
a given mode/way) belongs (value = 1) or does not belong (value = 0) to a partic-
ular class or cluster. In CAND·CLUS, the dimensions for various ways/modes may
be continuous (spatial) or binary (clusterlike). Other possibilities include discretely
valued dimensions with k(> 2) distinct possible values for a particular dimension.
Specifically, the CANDCLUS model for a general N-way array Y can be stated in
the following form:
R
ViI i~ l •.• liN rv L aJ ra~2r
l •.• a~T . (1 )
r=l

Define An as the parameter matrix of order In X R with elements ainr for the rth
dimension and the· nth way. The elements of matrices An( n = 1, ... ,N) can take on
any of the following values:

• Real values for all matrices An, (n = 1, ... , N). This results in the R-dimensional
CANDECOMP model of Carroll and Chang (1970; see also Carroll and Pruzan-
sky, 1984).

• Discrete integer (or finite set of real number) values for all An, n = 1, ... , N.
• Mixture of real parameters for some ways and discrete values for the rest. Other
possibilities, such as those involving "hybrid" models, also exist-in a hybrid
model for a particular way/mode, some dimensions are defined via continuous,
and others via discrete parameters. See Carroll (1976), Carroll and Pruzansky
(1980), De Soete and Carroll (1996), and Carroll and Arabie (in press) for
discussion of hybrid models.

2. MUMCLUS (MULTIMODE CLUSTERING), GENNCLUS,


GENERALIZED GENNCLUS, TUCKER'S THREE-MODE
AND MULTIMODE FACTOR/COMPONENTS ANALYSIS,
AND HYBRID MODELS OF GENERAL TUCKER FORM

2.1 DeSarbo's GENNCLUS Model (Two-way and Three-way Versions)


Given K nonsymmetric I x J matrices Sk, k = 1, ... , K, the three-way GENNCLUS
model is of the form

(2)
where A is an I x Ra binary matrix, B is a J x Rb binary matrix, while Uk is a
completely general Ra x Rb matrix.

In generalized GENNCLUS, A and B can each be continuous, discrete, or a mixture


of continuous and discrete (i.e., hybrid) dimensions. U is generally assumed to have
continuously valued entries.

Tucker's three-mode factor/components analysis (TMFA) model corresponds to the


case in which both A and B (as well as U) are entirely continuously valued. It should
be noted that two-way GENNCLUS, as proposed by DeSarbo (1982), corresponds to
the case where K = 1. While DeSarbo refers to a possible three-way generalization,
498

he discusses only the two-way case.

2.2 N-Mode Factor/Components Analysis (NMFA), Multiway GENNCLUS,


and Hybrid Models

Given an II x 12 x ... X IN array, with general entry Yi,i, ... iN, where in = 1, 2, ... , In
and n = I, 2, ... , N. We fit it by a model of the general algebraic form
T, T, TN
Yi l 12 .. ·i N r-.; L L ... L ai1t 1 a;2t2 ... a~tN Utltz ... tN , (3)
tl =1 t';!::::2 (1',;:::::1

where An is an IN x TN and U is a Tl x Tz x ... X TN array ( a "generalized" core array


in Tucker's terminology). Each An may be continuous, discrete, or hybrid, while U
will have continuously valued entries. It should be noted that the multiway/mode
NMFA/GENNCLUS is the most general model, including all others described here
as special cases. \Ve call this highly general multilinear model "MUMCLUS," for
MUlti-Mode CLUStering. (If all parameters are continuous, we call the resultant
special case, which is a generalization of TMFA, "MUMSCAL.")

While both CANDCLUS and MUMCLUS, when one or more of the ways/modes
entails binary (clustering) "dimensions," are, in general, overlapping clustering mod-
els/methods, it is possible as a special case that the resulting clustering will cor-
respond to that in which the clusters are non-overlapping (mutually exclusive and
collectively exhaustive)~i.e., comprise a partition of the objects or other entities cor-
responding to the mode/way in question. Assuming for the moment that the mode
in question is treated completely cluster-wise, (not in terms of "hybrid" representa-
tion), the matrix (say AI) for that mode must then have the property that each row
contains exactly one "1", all other entries in that row equaling O.

Approaches for fitting either CANDCLUS or MUMCLUS models via either an OLS
or LAD (least absolute deviation) criterion are described in Carroll and Chaturvedi
(1995), for the general, unconstrained case. Slight modification of these methods
would enable fitting via a WLS (weighted least squares) or weighted LAD (WLAD)
criterion, or other even more general "additively decomposable" loss functions.

An additively decomposable loss function, L, is of the form:

(4)

where I n = 1112 ... In-lIn+ l ... IN, while jn is an index ranging from 1 to I n,
varying systematically over the combinations of all values of all N - 1 subscripts
exclILding in, while t[z, i] is a measure of discrepancy between the two scalar valued
quantities z and i; e.g., (z - i)2 or 1 z - i I, in the case of an OLS or LAD loss
function, L, respectively.

As demonstrated for OLS fitting of the ADCLUS/INDCLUS model by Chaturvedi


and Carroll (1994), and by Chaturvedi. Lakshmi-Ratan, and Carroll (1995) for the
case of fitting ADCL US /INDCL US via a "least absolute deviation" (LAD) criterion
499

(see, also, Carroll and Chaturvedi, 1995 for a general discussion of this in the con-
text of the general unconstrained CANDCLUS/MUMCLUS models), very efficient
general algorithms .can be fDrmulated for fitting these models via an OLS or LAD
criterion, via what can be called a "one dimension at a time" elementwise approach.

Thil.t it is "one dimension at a time" simply means that, at each stage of an "outer
iteration" process, only one dimension-whether continuous, or a discrete (usually bi-
nary) one (e.g., defining membership vs. non-membership in a cluster )-is estimated,
conditional on fixed values of all the other R - 1 dimensions (and/or clusters). This
conditional estimation procedure is iterated over dimensions/clusters until conver-
gence occurs. Within each of these "one dimension at a time" estimation steps an-
other "inner estimation" process is used, in this case iterating over individual values
of the (continuous or discrete) components of that dimension/cluster. Since this inner
iteration process is, in fact, iterating over certain elements of the set of parameters
matrices or arrays, we call this an element wise procedure. At each of the most basic
computational steps of the composite iterative process all elements of all parameter
arrays are fixed, save the one which is currently being (re)estimated-conditional on
the fixed values of the remaining parameters.

Concretely, in the stage in which conditional estimates are being made for dimen-
sion/ cluster r for the general CANDCL US model, as in the CANDECOl\fP algorithm
(Carroll and Chang, 1970; Carroll and Pruzansky, 1984) when using the "one dimen-
sion at a time" approach, the CANDCL US algorithm fixes the parameters for all ways
except the nth, and conditiona.l/v (re )estimates the parameters for that nth way. In
fact, if all parameters are continuous, and OLS estimation is done, the CANDCLUS
algorithm is exactly equivalent to the CANDECO;\lIP algorithm, implemented on the
"one dimension at a time" Dil.sis. Thus a basic step in this overall algorithm entails
conditional estimation of a vector of coordinates of just one (continuous or discrete)
dimension, for just one way of the multiway data array, with all parameters for all
other dimensions, and all other ways for the dimension currently being (re)estimated,
held fixed at their current values.

Chaturvedi was the first to note that this adaptation of the one-dimension-at-a-time
CANDECOMP algorithm to fitting clustering or other discrete models via an OLS
(and later LAD) criterion could be greatly accelerated computationally based on a
separability property resulting from the additively decomposable structure of these
loss functions (see Chaturvedi and Carroll, 1994, Chaturvedi, Lakshmi-Ratan, and
Carroll 1995 and Carroll and Chatnrvedi, 1995 for details). This separability property
enables optimization of the objective function over the entire vector of discretely val-
ued parameter values via optimization for each discrete parameter separately. Thus,
for example, in the "standard" case of binary valued parameters, this optimization,
for a vector of In components (comprising the In values of the Rth dimension for
the nth way) can be accomplished via 2In evaluations, rather than 21• evaluations.
(In the case of a K-ary discrete parameter, KIn rather than KIN evaluations are
required.) These discrete conditional estimation steps are implemented via what are
called the "elementary discrete least squares or LAD procedures," respectively. Con-
ditional OLS or LAD fitting of a vector of continuous pitrameters defining coordinates
of a continuous (spatial) dimension can also be done via a sequence of 2IN (or K 1,-.;,
in the K-ary case) quite simple OLS or LAD regression steps (called the "elementary
continuous least squares or LAD procedures," respectively). Because of the one di-
mension at <1, time estimation in this case. it is quite straightforward to impose simple
500

equality or inequality constraints (e.g., a nonnegativity constraint) on the continuous


parameters, as is also discussed in the papers cited above.

It is straightforward to extend either the OLS or LAD estimation scheme for the un-
constrained CANDCLUS model to weighted least squares (WLS) or weighted LAD
(WLAD) estimation. It is also simple, in principle, to extend this to a criterion of fit

in certain circumstances this would also be sensible for an Lp norm for °: ;


based on minimizing an Lp-norm based loss function, for any 1 ::; p ::; (x). In fact,

where the Lo norm is the limiting case as p -+ 0, corresponding to the case of a


p < 1,

"counting metric," appropriate for categorical data. An application of this "Lo loss
function" to a clustering approach called "K-modes" has been discussed by Carroll,
Chaturvedi, and Green (1994) and Chaturvedi, Green, and Carroll (1996). OLS and
LAD fitting correspond, of course, to the L2 (or Euclidean) and Ll (or "city-block")
norms, respectively.

The elementary discrete estimation procedures for any Lp-norm based loss function
would be quite simple-merely entailing evaluating the loss function for each of the
2 (or K) values of each parameter, with all other parameters fixed, and choosing
the parameter value with the lower (lowest) value of the loss function. There is,
however, no closed form solution, in general, for an Lp norm based loss function for
p other than 1 or 2 (and, also, for p = 0, for appropriate models and data types,
where the solution will simply be the mode of certain values of an associated cat-
egorical variable, and for p = (X), where the solution is the midrange of certain values).

In these cases, some form of line search algorithm or other unidimensional optimiza-
tion procedure would have to be used to solve the elementary continuous estimation
problem. These statements can be extended to any additively decomposable loss
function of the form defined in equation (4); again the elementary discrete proce-
dure would entail simple enumeration of the 2( or K) parameter values and choice
of the one yielding the lower (lowest) value of the loss function, while the elemen-
tary continuous procedure would require either use of a unidimensional optimization
method, or a closed form solution, if available. The separability property discussed
above makes estimation of the CANDCLUS parameter based on any loss function of
the form stated in equation (4) particularly efficient computationally for the discrete
parameters, while reducing it to a unidimensional problem for the continuous ones
(with a straightforward way of imposing such constraints as nonnegativity, if desired.)

Carroll and Chaturvedi (1995) also discuss an extension of this approach to estima-
tion of the unconstrained MUMCLUS model, at least in the case of an OLS criterion
of fit. LAD, or most other loss functions, would not be nearly as tractable for fitting
MUMCLUS, however, since the estimation of the core array would be particularly
difficult in this case (requiring, generally, use of some form of multidimensional op-
timization procedure, one not being simplifiable at all via a separability property
such as that utilized in CANDCLUS). A WLS extension of MUMCLUS estimation
would be quite straightforward, however, since estimation of the core array could be
implemented in this case by use of a WLS multivariate regression procedure.

It turns out that a very straightforward modification ofthe CANDCLUS/MUMCLUS


algorithm (for fitting Al and/or matrices for any other modes)-namely fitting that
matrix at each stage of the algorithm rowwise (all "dimensions" and clusters being
501

estimated simultaneously, rather than via the "one dimension at a time" elementwise
strategy discussed in the case of CANDCLUS/MUMCLUS)-enables fitting a solu-
tion constrained to have this partitioning form (for one or more of the ways/modes).
In this approach, one simply seeks, for each row, to which column (cluster) the sin-
gle "I" should be assigned to optimize the (OLS, LAD or other) loss function being
minimized. The separability property, again, can be used-in this case it means that
this decision can be made separately for each row of the matrix given that the other
matrices are treated as fixed.

We describe the resulting approach for CANDCL US, with partitioning constraints,
below. To be specific, given current estimates of all other matrices, A 2 , A 3 , •.. AN,
(whether these are spatial/continuous, cluster-like/discrete, or hybrid), we define
what is sometimes called the "column-wise Kronecker product" of A 2 , A 3 , .•. AN,
which we shall denote A 2 @c A 3 @c ... @c AN, and denote when appropriate as
Ql-indexing Q by the index of the matrix, Al in this case, which is omitted (see
Carroll and Chang, 1970, and ten Berge and Kiers, 1996).

For two matrices, An, and Ani, In X R and In' X R respectively (note, in particular,
that both have a common column order, R) the columnwise Kronecker product is
defined as :

(5)

where a~n)is the rth column of matrix An, and @ is the ordinary Kronecker product
(applied in the present case separately to corresponding columns An, and An' ). Since
An is In X R, and An' is In' X R, An @c An' will be InIn' X R. We define such products
as A 2 @cA3@c ... @cAN, recursively (as in the case of ordinary Kronecker products);
thus, for example:

(6)
We also define a matrix Zl (II X 1213 ... IN) as the matrix defined by concatenating
the N - 1 subscripts i 2 , i3 .. . iN to form a row vector of 1213 ... IN components,
for each il = 1,2, ... 11 , and then defining theilth row of Zl as that i 1 th row
vector. (The order of the subscripts i 2i3, '" ,iN is assumed to be identical to the
order induced on these same susbscripts in the columns of the columnwise Kronecker
product Ql')

Having so defined Zl and Ql or, more generally, Zn and Qn, for all n = 1,2, ... N
(extending these definitions in the obvious way to the case of anyone of the N
matrices in the overall decomposition of Y, say An, denoting the corresponding
matrices as Zn and Qn, respectively, in this case with Qn being defined in terms of
current estimates of all A matrices except An), we then can solve for a new estimate
of An, An, conditional on the current estimates of An' for all n l ' " n, by finding the
constrained OLS, WLS, LAD, or other estimate of An, by solving for an estimate,
An, in the equation
(7)
(where ~ can be taken as implying optimizing a fit criterion such as OLS, WLS, LAD,
weighted LAD, or an objective function optimizing any other specified additively de-
composable criterion of fit. of the additively decomposed form given in equation (4).
502

Given this general form of-the loss function, whether OLS, WLS, LAD, \VLAD or
other, we can very simply minimize L by the use of a similar separability property as
noted in the case of CANDCLUS/MUMCLUS. In this case, this separability of the
overall loss function L is defined rowwise, for the entire matrix An, since for a loss
function of the additive form given in equation (5), given that Zn = An Q:, we have
i(n) = a(n)(q(n)), so that
In)n l. n In

In J I1
L = .0
""""
G
e [zn .. a(n)(q(n)),]
ttl]n' tn In '
(8)
In In

so that L is separable in each of the row vectors in the matrix An, generalizing quite
directly the element wise separability property defined earlier (and used for uncon-
strained CANDCLUS/MUMCLUS) to a rowwise separability property.

In the case of a partition, the problem is particularly simple, of course, since each of
these row vectors is constrained to have one and only one 1, with all other elements
= O. Thus, the optimal vector can be selected, at each stage of the overall algorithm,
by simply computing the objective function being optimized (minimized) for each
of the In unit vectors, and choosing the one optimizing (minimizing) the objective
function-so this entails only In computations, rather than the 2I n computations that
would be required for the unconstrained case.

Thus, the conditional minimization of L can be accomplished simply by sequentially


mllllmlzlIlg

(9)

where fdal~)l is a function of the vector a;~) alone, since all other variables are
treated as constant, while /;" is minimized by a very simple exhaustive search. Given
the constraints that An have only one "1" per row for each of its In rows, the column
vector al~) must be one of the Runit column vectors €I, €2, ... €R· Thus fin need be
evaluated only for those R permissible unit vectors-entailing exactly Revaluations.
Thus An as a whole can, in view of the separability of the assumed loss function, be
optimized via a total of InR such evaluations-a quite manageable search process,
since it is linear in the size (J"R) of the matrix (as opposed to being proportional
to 2 I n R , as would be true in the case of a totally general binary matrix and a non-
separable loss function).

The conditional OLS, LAD (or other) estimation of the remaining matrices can be
implemented either rowwise or in a "one-dimension at a time" elementwise manner,
as with unconstrained CANDCLUS/CANDECOMP, depending on the nature of the
parameters, constraints, or other factors. For purposes of the approach described
here, conditional estimates of each of the other N - 1 other matrices must be given,
and treated as fixed for reestimation of An. This enables using the elementwise sep-
arability property described earlier for CANDCLUS to estimate overlapping cluster
structures, or other discrete structures, for some modes and also, if desired, to im-
plement continuous optimization of parameter matrices for other modes with (say)
non negativity constraints. In the case of a matrix Ak with continuous parameters,
and in which OLS or WLS estimation is being done, Ak may be estimated matrix wise
503

via OLS regresslOn as III CANDECOMP, or via a straightforward generalization of


this to' WLS matrixwise es timation via WLS regression. If the LAD (or WLAD)
criterion is to be used, the elementary discrete LAD (or WLAD) procedure must be
used, on an iterated "one-dimension at a time" elementwise basis.

It should be noted that other forms of constrained overlapping cluster structures can
also be estimated (conditionally) on a row wise basis. For example, if one wanted to
constrain a cluster structure to one in which each object or other entity correspond-
ing to a level of a specified mode/way is contained in no more than C other clusters,
the search could be restricted to those binary R-vectors having C or fewer 1 'so The
additive decomposability of the objective function leads to a very simple algorithm
for this problem. First, go through all components of a particular row vector, allow-
ing each component to be 0 or 1 on an unconstrained elementwise basis. Then, if
the number of l's is C or less, you're finished (for that row vector). If not, choose
the C components associated with the C largest reductions (or "differentials") in the
objective function being minimized. (This step requires storing these differentials for
each such unit component, followed by a sorting algorithm aimed at choosing the C
largest absolute differentials among them.) This algorithm would enable choosing the
optimal vector for theinth row of An in considerably fewer than (~) operations-the
specific number of operations being dependent on the data and other parameters'
current estilTlCltes. Other constmints are possible (e.g., that each object be in exactly
C clusters, or that the R-vectors be restricted to a predefined subset of all possible
bimuy R-dimensional vectors). Any constrClints that can be defined in terms of the
set of permissible binary R-vectors for eitch row of a specific parameter matrix (even
if a different set for each row is specified, so long itS these constraints are independent
of para.llleter valnes in the other rows) can be imposed in this manner.

'While we have only roughly outlined the constmined CANDCLUS approach, at least
one special case lllerits particular attcntion. This is the special case involving two-
mode two-way data, in which one mode is modeled by partitions and the other by
continuous parameters. The, particular special case can easily be shown to be equiv-
iclent to the well-know clustering procedure know as K-means (where K = R. in this
case) if an OLS, and K-medians if it LAD, criterion is used. "We might mention, as a
related model of interest. the case of unconstrained CANDCLUS fit to such two-mode
data with one mode modeled by olJeTiappmg clusters and the other by continuous pa-
rameters. This model/ method we have called "overlapping K-centIOids" (Chaturvedi,
Canoll, Green and Rotondo. 1994) elsewhere.

While we have dealt here only with the CANDCLUS model with constraints making
the clustering (for one or more Illodes) a, partition, this entire appn)ilch generalizes in
a straightforward lllanller to a similarly constrained version of MU~TCL US, at least
via OLS or \VLS optimization. As cliscllssecl by Carroll and Chatnrvecli (1995), es-
timation of the contilluollS pMiullders in the core alTay via LAD, \VLAD or other
adclitively dccoIllposabl,~ fit criteria is not, ill generitl, an easiiy implcmelltable COIll-
putational problem leilciing to a closed form solution. \Ve shall not explore this last
class of models and methods more fully, however, ill the present pitper.
504

3. CONCLUSION
The general CANDCLUS and MUMCLUS models have been defined and discussed,
including references to previously published work on unconstrained estimation of each
using various additively decomposable loss functions as fit criteria, based on a sepa-
rability property originally observed by Chaturvedi. An extension of the separability
property from individual elements of various parameter matrices to (row) vectors of
those matrices leads to a straightforward approach for estimating these models with
the clusters corresponding to certain ways or modes being constrained to satisfy par-
titioning constraints. Using this property to impose certain other kinds of constraints
on the clustering is also possible. One that is discussed entails constraining the ob-
jects to be contained in no more than a fixed number (C) of clusters.

Some special cases of CANDCLUS are discussed, including OLS and LAD estimation
of the ADCLUS/INDCLUS models and a procedure generalizing K-means and K-
medians to the case of overlapping clusters, called K -overlapping centroids clustering
(which includes methods called overlapping K-means and overlapping K-medians as
special cases), as well as some other potential applications (e.g., to fitting "hybrid"
models entailing combinations of continuous and discrete dimensions for the same
set of entities corresponding to one or more modes of the multiway data array).
We anticipate many future applications of these and other specific special cases of
constrained and unconstrained CANDCLUS/MUMCLUS models, some not yet even
contemplated, to a wide variety of data analytic situations.

New algorithmic developments may further improve fitting procedures for this class of
models, using various loss functions as fitting criteria-potentially reducing problems
of merely local optima and slow convergence, and generally increasing computational
efficiency so as to make dealing with the increasingly large data arrays arising in
many practical data analytic situations much more feasible.

4. References
Carroll, J. D. (1976): Spatial, non-spatial and hybrid models for scaling, Psychometrika,
41, 439-463.
Carroll, J. D. and Arabie, P. (in press): Multidimensional scaling, In: Handbook of Percep-
tion and Cognition. Volume 3: Measurement, Judgment and Decision Making, Birnbaum,
M. H. (ed.), San Diego, CA: Academic Press.
Carroll, J. D. and Chang, J. J. (1970): Analysis of individual differences in multidimensional
scaling via an N-way generalization of "Eckert-Young" decomposition, Psychometrika, 35,
283-319.
Carroll, J. D. and Chaturvedi, A. (1995): A general approach to clustering and multidimen-
sional scaling of two-way, three-way, or higher way data, In: Geometric Representations of
Perceptual Phenomena, Luce, R. D. et aL (eds.), 295-318, Mahwah, NJ: Erlbaum.
Carroll, J. D. et al. (1994): K-means,K-medians and K-modes: Special cases ofpartitiorung
multi way data. (Paper presented at meeting of the Classification Society of North America,
Houston, TX.)
Carroll, J. D. and Pruzansky, S. (1980): Discrete and hybrid scaling models, In: Similarity
and Choice, Lantermaun et aI., (eds.), 108-139, Bern: Hans Huber.
Carroll, J. D. and Pruzansky, S. (1984): The CANDECOMP-CANDELINC family of mod-
505

els and methods for multidimensional data analysis, In: Research Methods for Multimode
Data Analysis, Law, H. G. et a1. (eds.), 372-402, New York: Praeger.
Chaturvedi, A. and Carroil, J. D. (1994): An alternating combinatorial optimization ap-
proach to fitting the 1NDCL US and generalized INDCL US models, J oumal of Classification,
11,155-170.
ChatUTvedi, A. et a1. (1994): A feature based approach to market segmentation via over-
lapping K-centroids clustering. Manuscript submitted for publication.
Chaturvedi, A. et a1. (1995): Two L1 norm procedures for fitting ADCLUS and INDCLUS.
Manuscript submitted for publication.
Chaturvedi, A. et a1. (1996): Market segmentation via K-modes clustering. (Paper pre-
sented at American Statistical Association Conference, Chicago, 1L.
De Soete, G. and Carroil, J. D. (1996): Tree and other network models for representing
proximity data, In: Clustering and Classification, Arabie, P. et a1. (eds.), 157-197, River
Edge, NJ' World Scientific.
DeSarbo, W. S (1982): GENNCLUS: New models for general nonhierarchical clustering
analysis, Psychometrika, 47, 449-475.
ten Berge, J. M. F and Kiers, H. A. L. (1996): Some uniqueness results for PARAFAC2,
Psychometrika, 61, 123-132.
A Distance-Based Biplot
for Multidimensional Scaling of Multivariate Data
Jacqueline J. Meulman
Department of Data Theory, University of Leiden
P.O. Box 9.55.5, 2300 RB Leiden, The Netherlands

Summary: Least squares multidimensional scaling (MDS) methods are attractive candi-
dates to approximate proximities between subjects in multivariate data (Meulman, 1992).
Distances in the subject space will resemble the proximities as closely as possible, in con-
trast to traditional multivariate methods. When we wish to represent the variables in the
same display - after using MDS to represent the subjects - various possibilities exist. A
major distinction is between linear and nonlinear biplots. Both types will be discussed
briefly, including their drawbacks. To circumvent these drawbacks, a third alternative will
be proposed. By expanding the optimal p-space (where p denotes the dimensionality of the
subject space) into an m-dimensional space of rank p (with m > p), we obtain a coordinate
system that is appropriate for the evaluation of the MDS solution directly in terms of the
m original variables. The latter are represented graphically as vectors in p-space, and their
entries as markers that are located on these vectors. The overall approach, including the
analysis of mixed sets of continuous and categorical variables, can be viewed as a distance-
based alternative for the graphical display of multivariate data in Gifi (1990).

1. Introduction
In the approach to :'I.lultivariate Analysis (:\1VA) applied in this paper, the vari-
ables are used to define an observation or measurement space in which the units
are located according to their scores. The distances in this observation space are
regarded as proximities to be approximated by distances between subject points in
a low-dimensional representation space. If the m-dimensional observation space is
denoted by Q, giving coordinates for n points in Tn dimensions, the proximities be-
tween all pairs of subjects are given in the proximity matrix D( Q) where D(.) is the
Euclidean (also called Pythagorean) distance function. So the n x n matrix D( Q)
contains proximities dik(Q) between subject i and k. Squared distances are given by:
D2(Q) = vI' + Iv' - 2QQ', (1 )
where v = vecdiag( QQ') is the n-vector containing the diagonal elements of QQ',
and 1 is the n-vector of all 1 '". Analogously, squared distances in the p-dimensional
representation space X are defined by D"(X) = vI' + 1 v' - 2XX', now with v =
vecdiag( XX').
Approximation of a set of proximities by a set of distances in some low-dimensional
space is usually identified as a multidimensional scaling (MDS) task. In lvleulman
(1986), following Gower (1966), it is shown that techniques of multivariate analy-
sis, like principal components, canonical correlation and homogeneity analysis, are
equivalent to yIDS tasks applied to particular derived proximities when the so-called
classical Torgerson-Gower approach to MDS (Torgerson, 1958; Gower, 1966) is used.
A basic ingredient is the original 'Y"oung-Householder (1938) process that transforms
a squared distance matrix D2(Q) into an n x n scalar product matrix QQ', modified
by locating the origin in the centroid of points through the use of the n x n centering
operator J = I - (11'/1'1), which gives
-1/2J(D2(Q))J = -1/2J(vl' + lv' - 2QQ')J = QQ'. (2)

506
507

(I is the n x n identity matrix; we assume that the variables qj have zero mean.) The
scalar product matrix QQ' is then approximated by another scalar product matrix
of lower rank, using an objective function that can be written in the form:

STRAIN(X) = iIQQ' - XX'W = tr (QQ' - XX')'(QQ' - XX'), (3)

so I! • il" denotes a least squares discrepancy measure. (The term STRAIN is used
after Carroll and Chang. 19i2.) In our case. because XX' = -1/2J(D2(X))J. (3)
can also be written as

STRAIN(X) = (1/4)IIJ(D 2 (Q) - D"(X))JW, (4)


so Torgerson-Gower scaling approximates double-centered squared distances. To find
the coordinates in X, first an eigenanalysis of QQ' is performed: QQ' = KAK',
where K is an N x t matrix containing t eigenvectors, A is a txt diagonal matrix
containing the ordered positive eigenvalues, and t denotes the rank of Q (for t :s; m).
The optimal solution in the Torgerson-Gower procedure for obtaining a p-dimensional
X (for p :s; t), is given by X = KpA~!2; thus, the eigenvectors are rescaled using the
eigenvalues to give coordinates X for the units in p-space with dimensions that reflect
differential saliences.
The multivariate case described above, with proximities D(Q), is usually called Prin-
ci pal Coordinates Analysis (Gower, 1966); using the singular value decomposition
Q = KAL', it can easily be shown that the solution for X that would be obtained in
a principal components analysis is equivalent to the solution for X obtained by min-
imizing (:3) or (4). Csing the same strategy, it was shown in l\Ieulman (1986, 1992)
that l\Iultiple Correspondence Analysis (MCA), also called homogeneity analysis, can
be viewed as a classical scaling technique as well. In MeA, categorical variables are
analyzed, and each categorical variable h) defines a binary indicator matrix G j with
n rows and I) columns, where I j denotes the number of categories. Elements h ij then
define elements gfr as follows:

h,) r-tgfr=l; (5)


=/: r -t g;r = O.
where r = 1, ... , I) is the running index to indicate the category number of variable j.
In analogy with (:3) and (-l). \1CA can be written as a classicall'vIDS problem since
its optimal solution for the subject scores X minimizes

(6)
)=1
m
1/mL::IIJ(D2(G)(GjG))-1/2) - D2(X))JII",
)=1

where pro:.:imities are deri ved simultaneously from all G) separately. The columns
of the indicator matri:.: G J are divided by the square root of the marginals GjG j ;
the latter operation defines the chi-squared metric. Finally, the proximities are ap-
proximated by Euclidean distances in X. As before, in the classical scaling approach,
X would not be normalized to represent the subjects in an orthonormal cloud, but
instead the eigenvalues are llsed to give the representation space a certain shape,
displaying the differential saliences.
In ?v1eulman (1986, 1992) an alternative is proposed, which is to analyse multivari-
ate clata by minimizing a loss fUIlction that is directly defined on the distances (so it
508

does not approximate distances through inner products). The history of least squares
:.vms methods can be followed from Shepard (1962), I":ruskal (1964), Guttman (1968).
Takane, Young, and De Leeuw (1977). De Leeuw and Heiser (1980), Ramsay (1982).
a.o. Least squares MDS methods are traditionally applied to a given proximity ma-
trix, whose proximities are then approximated through minimization of some least
squares loss function that is defined on (transformations of) proximities and distances
in a representation space X. In the multivariate cases described above, we derive the
proxirriities from the multivarate data. Then, in the distance-based modification of
principal components analysis (distance-based PCA. for short). we minimize

STRESS(X) = IID(Q) - D(X)li2 (7)

over X. As before, D(.) is an n x n matrix with Euclidean distances between subjects,


and X is the low-dimensional space in which distances should match the proximities
as closely as possible. To minimize loss function (7). we have to use an iterative
procedure since there is no analytic solution. In the present paper, the majorization
algorithm for MDS has been used (e.g., see De Leeuw and Heiser, 1980; Groenen and
Heiser, 1996). The majorization approach amounts to computing an update for X
that reduces the value of (7) from a starting point Xu by:

X = 1/nB(XQ)XQ. (8)

Here, the n x n matrix B(XO) is defined as

(9)

where the elements of the matrix B'(XO) are given by bik(XO) = dik\Q)/dik(XO)
if i i- k and dik(XO) i- 0; otherwise bik(XO) = O. The elements of the diagonal
matrix B+(XO) are given by bt(XO) = I' B·(XO)ei. where ei is the ith column of the
identity matrix I. Repeatedly computing the update X gives a convergent series of
configurations.
A special feature of the Gifi-system is the possibility of differential treatment of vari-
ables in the analy·sis. For example, some variables may be treated as containing
numerical scores, while others may be treated as nominal variables. The latter treat-
ment is appropriate when a variable partitions the subjects into unordered classes. In
distance-based PCA nominal treatment of variables is carried out as follows. First,
the nominal variable h J is replaced by a binary indicator matrix G j with n rows and
Ij columns, as above. Then,_ proximities are derived simultaneously from Q and G J :

( 10)

where j = 1, .'" m is the running index to indicate the variables in the analysis that
classify the subjects into groups (there may be more than one classifying variable, and
then multiple indicator matrices should be created). As in MCA and homogeneity
analysis, the columns of the indicator matrix G J are divided by the square root of the
marginals GjG j to give distances in the chi-squared metric. Finally, the proximities
6( Q; G) are approximated by Euclidean distances D(X).
Meulman (1992) has shown that if we apply the classical scaling approach (as in
Torgerson, 1958; Gower, 1966) to approximate 6( Q; G) then this results in a so-
lution for X that is equivalent to the subject scores in Gifi's PCA, with numerical
and nominal variables. (Again, apart from a scaling factor per dimension, displaying
the differential saliences; PCA usually displays the subject points as an orthonormal
509

cloud, with equal dispersions.)

2. Supplementary representation of variables III a MDS solu-


tion
The idea of joint representation of subjects and variables, which onglllates with
Tucker (1960), was subsequently very succesfully applied in the analysis of prefer-
ence data (Carroll, 1972), and has become well-known as the biplot (Gabriel, 1971);
the classical reference to the basic idea of lower-rank approximation is Eckart and
Young (1936). A recent book on biplots is Govver and Hand (1996). From the biplot
point of view, principal components analysis can be regarded as a bilinear model
(Kruskal, 1978). (A variable-oriented approach would regard PCA as the analysis of
a correlation matrix.) In the bilinear model, we minimize

o-(X) = liQ - XA'W, (11 )


over X and A: the observed scores in the m-dimensional space Q are approximated by
the inner product of the p-dimensional component scores X and component loadings
A (with p much smaller than m). Graphically, in the biplot representation, subjects
are represented as points, and variables as vectors, and the orthogonal projection of
the subject points onto the variable vectors gives an approximation of the observed
scores.
In the framework of distance-based PCA, as in (7), variables can be considered from
an internal and an external perspective. From the internal perspective, their role is
to provide the proximities between the subjects. From the external perspective, the
variables can be used afterwards to study whether we can account for the structure
among the subjects. The latter can be done through linear and nonlinear external
biplot methods. We call these biplots external, because the variables are fitted into
the subject space in a second step, while the subject points X are kept fixed. By
contrast, in internal biplots, the subject points and the vectors representing the vari-
ables are found simultaneously.

2.1 Linear external biplots through multiple regression


A straightforward way to fit a set of variables in a given configuration is through
"property fitting" (e.g., Carroll, 1972; also, s,ee Meulman, Heiser and Carroll, 1987).
For distance-based PCA, this amounts to the projection of a variable Clj into the space
of the subjects X by the use of multiple regression. In the regression, the columns
in X are the independent variables, and the weights obtained from the regression
determine the coordinates for the variable Clj in the space X. The optimal direction
aJfor the vector representing variable Clj in the p--space X is thus found as

(l2)

Using aj to represent the endpoint of the vector gives a linear biplot representation.
This biplot is obtained, however, through the use of different rationales for fitting the
subjects on the one hand and the variables on the other: the subject points are fitted
through the use of least squares distance fitting in (7), and the vectors representing
the variables through ordinary multiple regression in (12).

2.2 Nonlinear external biplots through unfolding


As a possible alternative, coherent, method, l'vIeulman and Heiser (1993) proposed a
least squares generalization of the so-called nonlinear biplot in Gower and Harding
510

(1988). The latter nonlinear biplot was developed to obtain nonlinear representations
of variables in a space that is generated through a principal coordinates analysis. The
procedure discussed in Meulman and Heiser (1993) can be described as follows. First,
regard each variable as a series of 5 = 1, ... , S supplementary points (a trajectory)
in the space X. In terms of the data, a supplementary point for variable j has
coordinates e r , in X that are all equal to zero, except for the jth variable; so when
e J is the jth column of the m x m identity matrix I, e r , = T'sej, where min(CJJ) :::;
T's :::; max(CJJ). Next, for each supplementary point, the distance is calculated to the
n original points in observation space. The vector with squared distances between
supplementary point e r , and the subjects in Q is given by

(13)
with v the n-vector containing the diagonal elements of QQ'. The kth element of
d2 (e r ,; Q) gives the squared distance between the 5th supplementary point and the
kth subject point in observation space, and will be written as d2(e r ,;qk), where
qk denotes the kth row of Q. Mapping the trajectory for variable CJJ involves the
approximation of d(e r ,; Q) by d(ys; X), where Ys gives p-dimensional coordinates in
the space of X, for different values of T'31 s = 1, ... , S. Here S denotes a prechosen
number. appropriate to cover the range of min( CJJ) to ma:l;( CJJ). Each supplementary
point has to be mapped separately, and the coherent method with respect to least
squares multidimensional scaling as in (7) is the use of

STRESS(ys) = Ild(e r ,; Q) - drys; X)W, (14)


to be minimized over Ys for given Q and X. The loss function (14) represents the
least squares external unfolding problem; it is called unfolding because it fits distances
between two sets of points, X and y, and it is called external because X is known
and fixed. There is no closed-form solution for STRESS-based external unfolding
as in (14), so the loss function has to be minimized iteratively. The latter can be
done using the SMACOF framework for unfolding (Heiser. 1981; 1987). The points
Yl, "'Yso ... , Ys, mapped in X by minimizing (14), will in general not be on a straight
line, because (14) represents a nonlinear mapping. Nonlinear biplot representations
have interesting properties, but some of them are not yet fully understood and are
currently under study (Groenen and Meulman, 1995). In most cases, nonlinear bi-
plots are harder to interpret than linear ones, and therefore here a biplot is presented
that is linear and, unlike regression, at the same time consistent with the criterion
minimized in distance-based PCA.

3. The Distance-Based Biplot


The basic notion to arrive at a simple, linear, and coherent biplot is that distances
D(e) are invariant under rotation, i.e., D(X) = D(XA') if A'A = I. Usually, a ro-
tation matrix is of the order P x Pi here, however, we will consider a matrix A of
order m x p. This matrix will be labeled a T'otation-expansion matrix, because the
transformation preserves the distances in X, and at the same time expands the rep-
resentation space. Thus, the coordinates X in p-space are replaced by m-dimensional
coordinates in the space XA' of rank p, and since A' A = I

STRESS(X) = IID(Q) - D(X)W = IID(Q) - D(XA')W. (15)


To obtain the biplot coordinates, we have to minimize IIQ - XA'W over all A satis-
fying A' A = I. This amounts to an orthogonal Procrustes problem of the order p x p
that is solved as follows.
511

Define the singular value decomposition of the m x p matrix Q'X as Q'X = KAL',
and the eigenvalue decomposition of the p x p matrix X'QQ'X as X'QQ'X = LA 2L'.
Then the rotation-expansion matrix is found by

A = KL' = Q'XLA -IL'. (16)

Now the m-dimensional coordinate system XA' can be used to evaluate the MDS
solution directly in terms of the original variables, with the Pearson correlation co-
efficient as a natural measure of association. At the same time, the jth row in A
(denoted by aj) gives the coordinates to display the variable <Ii in the space X. The
scores {%} themselves can be represented as well; the projected coordinates in the
space X are given by <Iiaj/ajaj.The latter set of quantities are called single category
coordinates in Gifi's (1990) approach to PCA. Therefore, the approach proposed here
can be regarded as a distance-based alternative for Gifi's biplot display of multivari-
ate data. The series of points <Iiaj/ajaj are located on the vector that represents the
variable <Ii, and these are usually called markers (Gabriel, 1971; Gower and Hand,
1996).

4. Material and Methods


The data ~hat are used in the application of the distance-based biplot were collected
by Van Strien and Van der Ham, at the Department of Psychiatry at Utrecht Uni-
versity Hospital (see Van der Ham, Meulman, Van Strien and Van Engeland, 1997).
The data concern 16 variables that measure well-being at four points in time for 5.5
patients with eating disorders. The patients were diagnosed independently into four
categories (using the DSM-III-R): 1. Anorexia Nervosa (NI = 25), 2. Anorexia with
Boulimia Nervosa (N2 = 9), 3. Boulimia Nervosa after Anorexia (N3 = 14), and 4.
Atypical Eating Disorder (N4 = 7). The total number of patients N = 55, and data
are available at four different time points, and the total number of observational units
would be 4 x 55 = 220. However, there are a few subjects with missing observations
at one of the four time points, so the actual number of observational units is 55 + 53
+ 54 + 55 = 217.
In addition to the 16 variables that measure well-being, two additional nominal vari-
ables are used, the first a variable (with four categories) that associates each obser-
vational unit with a point in time, and the second a variable that links each patient
with a diagnosis (in one of the four eating disorder categories; the diagnosis was es-
tablished before time point one, and does not change over time). In summary, the
multidimensional scaling analysis is applied to a 217 x 217 matrix with proximities
between the observational units, with the proximities derived from 16 + 4 + 4 = 20
variables (time and diagnosis each have four categories). The analysis is intended to
result in a subgrouping of the eating disorders in a longitudinal perspective.
The multidimensional scaling task was performed in two dimensions; the solution as
a whole is judged on two different criteria. First, the correlation between the original
variable <Ii and the fitted Xaj is considered; variables were fitted into the subject
space by the biplot method dIscussed in Section 3. The Pearson correlation coeffi-
cient is a natural goodness-of-fit measure for the biplot representation, since standard
PCA maximizes the average squared correlation. So for the distance-based biplot,
we can compare this particular goodness-of-fit measure with the optimum provided
by standard PCA. Second, we consider classification of subjects on the basis of the
Euclidean distances in two-dimensional space. Each subject is allocated to one of
the four time category points and to one of the four diagnosis points. Specifically,
512

the centroids of the subject points in a particular class define the associated category
points. In addition to Euclidean distance, the allocation used the posterior probabil-
ities, by employing Bayes rule and the a priori distribution over the categories. The
resulting assignment is compared to the original time points and diagnostic group
classification.

5. Results
The primary result of the analysis consists of the coordinates for the observational
units in two-dimensional space; the graph displaying the cloud of points is given in
Figure 1, top panel. The subjects have been labeled with their diagnosis. Visual in-
spection immediately suggests that the second dimension is related to the diagnostic
categories of eating disorder. We see that the anorexia subjects (label 1 ) form a group,
but patients with atypical eating disorder (label 4) form a subgroup. The latter pa-
tients can be considered as anorectic patients for whom the loss of weight is unknown
or less than 15%. The second dimension separates the anorectic patients (classes 1
and 4) from the boulimic patients (classes 2: anorexia nervosa with boulimia nervosa,
and 3: boulimia nervosa after anorexia). Having connected the diagnosis category
points over the different points in time (by computing the appropriate centroids of
subject points), it is clear that the first dimension displays the development in time.
The variables are displayed in the bottom panel of Figure l. Instead of displaying
the variables and subjects together, it was chosen to display the variables with the
group points, since this gives a more comprehensive biplot; groups of subjects are
represented by their centroid, which is associated with a particular point in time and
diagnosis.
The first thing to note is that all variables have a positive correlation with the first
dimension; this means there is a general factor that correlates positively with all the
variables. The second dimension separates the variables. We find three bundles of
variables: Bingeing (4), Vomiting (5) and Purging (6) are clearly distinguished in the
vertical direction of the graph; Preoccupation (15), Body Perception (16), Hyper-
activity (7), Sexual Behavior (13), and Fasting (3) correlate most with the general
factor, and Weight (1), Menstruation (2), Family Relations (8), Emancipation (9),
Work/School record (11) and Sexual Attitude (12) form a third bundle of variables.
Friends (10) and Mood (14) do not fit very well in the overall representation. The
fit of the variables, as measured by the Pearson correlation between observed scores
and fitted scores, is given in Table l. We notice that distance-based PCA gives a
very decent fit for the variables, compared to the maximum that could be obtained
by applying standard PCA.
As described above, we compare the classification of the subjects on the basis of the
results from both analyses with the original ones. With respect to time, the grouping
is given in Table 2. Here, rows indicate the original time points, and columns the fitted
time points. From the percentage correctly assigned subjects in the bottom row, we
conclude that distance-based PCA performs better than standard PCA, although it
is obviously hard to distinguish between consecutive time points, especially between
3 and 4. Next, we inspect assignment to the eating disorders categories. Results are
given in Table 3, with the rows and columns ordered according to the subgroupings.
It is clear that neither approach to PCA can distinguish between groups 1 and 4
on the one hand, and groups 2 and 3 on the other hand. On the whole, distance-
based PCA performs better. This becomes more clear when we combine results from
Table 3 in its bottom row: now standard PCA finds 83% and 91 % correctly, while
distance-based PCA obtains 92% and 97%.
513

0.5

4
4
1 4 1
4 4
1
441 f
4 1 1 III 4
1 1
J
0.25
11 1 1 1
1
\ 11 1 4
1 1 1 j4
11 1,11\\ f
0
1
3 4 Il114 3- t ?
3 4
42
4 3 1
3 2 3 33~f~
? ?1
3 ~ -2
~ 33 3
3
3- 4 3
?
-0.25 3 2
3 2 2 3 3
2 2 2 2
4 2
1 3
3 3
-0.5
-0.5 -0.25 0 0.25

0.5

0.25

15

o -+-
1

-0.25
3
2 0

-0.50 -0.25 o 0.25 0.50

Figure l. Top panel: Subjects represented in two-dimensional space. Labels 1: anorexia, 2: anorexia
with boulimia, 3: boulimia after anorexia, 4: atypical eating disorder. Trajectories represent the
diagnostic categories in time (from left to right). Lower panel: Trajectories, now displayed with the
variables (described in Table 1). Markers represent the three categories for variable 13.
514

Table 1: Correlations between Observed Variables and Fitted


Variables in Standard and Distance-Based PCA
Variables Standard PCA D-Based PCA
Weight 0.69 0.68
Menstruation 0.72 0.72
Fasting 0.70 0.72
Bingeing 0.79 0.71
Vomitting 0.62 0 ..5.5
Purging 0.88 0.79
Hyperactivity 0.53 0 ..53
Family Relations 0.58 0 ..58
Emancipation 0.64 0.6-1
Friends 0.46 0.44
Work/School 0.60 0.63
Sexual Attitude 0.62 0.62
Sexual Behavior 0.70 0.71
Mood 0.47 0 ..50
Preoccupation 0.76 0.7-1
Body Perception 0.63 0.63
Mean 0.66 0.63

Tabel 2: Classification of Subjects


in Time Categories.
Standard PCA D-Based PCA
1 234 1 23-1
1 47 6 2 0 48 5 2 0
2 13 15 10 15 13 16 5 19
3 9 15 5 25 6 16 8 2-1
4 6 10 4 3.5 5 12 2 36
% Correct .85 .28 .09 .64 .87 .30 .15 .6·5

Table 3: Classification of Subjects


in Diagnosis Categories
Standard PCA D-Based PCA
1 4 2 3 1 4 2 3
1 82 2 11 2 92 0 3 2
-1 20 0 5 3 22 1 2 3
2 3 0 15 18 1 0 19 16
3 2 3 20 31 2 0 2.5 29
% Correct .85 .00 .42 .5.5 .95 .0-1 ..53 ..52
Combined 0.83 0.91 0.92 0.97

6. Monte Carlo Study


By definition, the distance-based biplot is outperformed by ordinary PCA with re-
spect to Pearson correlations between observed and fitted variables. The differences
in our empirical example are small, however, and distanced-based PCA performs bet-
ter with respect to the recovery of the original classification of subjects into groups.
515

Table 4: Correlations between True Scores and Fitted


Scores in Distance-based PCA compared to Standard PCA
Variables D-Based PCA Standard PCA
1 .80 . ,9
:2 .81 .,9
3 .,9 .,8
4 .18 .,.5
·5 ., .,3

,
I

6 .,:3 .69
.6'3 .64
8 .6.5 .60
9 .66 ..59
10 ..5, .,s·l
11 ..51 '-
• ""I')

dim 1 .93 .92


dim :2 .66 .56
dim :3 .:2/ .20

To inspect these properties in a more general context, (replicated) artificial data were
generated, with ,.5 subjects and 1:3 variables with a perfect representation in three
dimensions. From this set, two variables were selected as partitioning variables, and
five categories were created using an optimal discretization strategy. The remaining
11 variables were subjected to a fair amount of random error (with an average of
.5:3%). The resulting set of variables was analyzed by distance-based PCA, and com-
pared to standard PCA. In both cases, analyses were done in two dimensions. The
number of replications was set to 100. Four criteria were inspected:
• 1. The fit per variable, as measured by the Pearson correlation between the true
scores (without error) and the bilinear approximation.
• 2. The fit per dimension, as measured by the Pearson correlation between the true
dimensions and the fitted dimensions.
• :3. The correct classification of subjects in two-space as compared to the original
classes. The a priori distribution in the population was taken into account to compute
the posterior probabilities.
• 4. The distances between the subjects in two-space as compared to the distances
in true-space.
The results reported below were obtained by averaging the results after applying
distance-based and standard PCA to the 100 samples of the artificially created struc-
ture described above. The first part of Table 4 gives the correlations between true
scores and fitted scores; the second part reports on the results with respect to the
original three dimensions (that are approximated in two-space; the correlations here
are again obtained by applying the rotation-expansion strategy from Section :3, but
now to the original dimensions: so if the original dimensions are denoted by Z. we
W
minimize II Z - XB ' over all B satisfying B'B = I. and we compute the correlations
between Z and XB'). \Ve notice that distance-based PCA performs better for each
variable and each dimension separately. Results for the two classification variables
were combined; the aggregated resltits are given in Table .s, where again rows indi-
cate the original categories and columns the fitted categories. Except for the third
category (7.3% versus 76% correct), distance-based PC A performs better: this effect
is strongest for the extreme categories 1 and .s (88% versus 78% and 87% versus 76S\·
correct). Finally, the overall statistics are given in Table 6, confirming the superior
performance of distance-based PCA o\'er standard PCA.
516

Table 5: Classification of Subjects


in the Monte Carlo Study
D-Based PCA Standard PCA
1 2 3 4 5 2 3 4 5
1 .88 .11 .01 .00 .00 .78 .20 .02 .00 .00
2 .08 .81 .11 .01 .00 .0·5 .74 .19 .02 .00
3 .01 .12 .76 .10 .01 .00 .13 .7.5 .12 .00
4 .00 .01 .10 .80 .09 .00 .02 .18 .76 .0·5
.j .00 .00 .01 .13 .87 .00 .00 .02 .22 .76

Table 6: Summary of Results


of the Monte Carlo Study:
Overall Statistics
D-Based PCA Standard PCA
Distances 0.91 0.88
Variables 0./-1 0.70
Dimensions 0.62 0.56
Classification 0.81 0.75

7. Discussion
A distance-based biplot was developed for a least squares MDS analysis of multivari-
ate data. The fitted p-dimensional space is expanded into an m-dimensional space of
rank p that directly represents the fitted variables, while distances between subjects
are preserved. The biplot method was applied to a data concerning patients with
various types of eating disorders. The variables obtained a decent fit, also when com-
pared to standard PCA. The method was further studied in a Monte Carlo study.
Here results were compared with respect to the true structure, and the method pro-
posed performed better than standard PCA with respect to the criteria considered.
The analysis allows nominal variables to be included; their categories are represented
by centroids of subjects. The method could include optimal scoring of ordinal vari-
ables too. By representing nominal variables as points in the subject space, and the
other variables as vectors in the same space, we have actually developed a triplot,
with subjects, variables, and classes as its constituents.

References:
Carroll, J.D., and Chang, J ..J. (1972): IDIOSCAL (Individual differences in orientation
scaling): A generali=ation of INDSCAL allow£ng IDIOsyncratic reference systems as /L'eU
as an analytic approximation to INDSCAL, Paper presented at the Psychometric Society
Meeting, Princeton, NJ.
Carroll, J. D. (1972): Individual differences and multidimensional scaling, In: Multidimen-
sional scaling: Theory and applications in the behavioral sciences, R. :'-f. Shepard, A. K.
Romney, and S. B. Nerlove (eds.), Vol. 1, 105-1.5.5, Seminar Press, New York and London.
De Leeuw, .J., and Heiser, W.J. (1980): Multidimensional scaling with restrictions on the
configuration, In: Multivariate analysi.s, P.R. Krishnaiah (ed.), , Vol. V, 501-522, North-
Holland, Amsterdam.
Eckart, C. and Young, G. (1936): The approximation of one matrix by another of lower
rank, Psychometrika, 1, 211-218.
Gabriel, K.R. (1971): The biplot graphic display of matrices with application to principal
517

components analysis, Biometrika, 58,453-467.


Gifi, A. (1990): Nonlinear multivariate analysis, John Wiley and Sons, Chichester.
Gower, J.C. (1966): Some distance properties of latent roots and vector methods used in
multivariate analysis, Biometrika, 53, 325-338.
Gower, J.C. and Hand, D.J. (1996): Biplots, Chapman & Hall, London.
Gower, J.C., and Harding, S.A. (1988): Nonlinear biplots, Biometrika, 75, 445-455.
Groenen, P.J.F. and Heiser, W.J. (1996): The tunneling method for global optimization,
Psychometrika, 61, (in press).
Groenen, P.J .F. and Meulman, J.J. (1995): Joint nonlinear biplots through least sq'uares
i'vlDS, Paper presented at the 9th European Meeting of the Psychometric Society, Leiden.
Guttman, 1. (1968): A general nonmetric technique for finding the smallest coordinate
space for a configuration of points, Psychometrika, 33, 469- 506.
Heiser, W,J, (1981): Unfolding analysis of proximity data, Dept. of Data Theory, Leiden.
Heiser, W.J. (1987): Joint ordination of species and sites: the unfolding technique, In:
Developments in numerical ecology, P. Legendre and 1. Legendre (eds.), 189-221, Springer,
New York.
Kruskal, .1 .B, (1964): Multidimensional scaling by optimizing goodness of fit to a nonmetric
hypothesis, Psychometrika, 29, 1-28. '
Kruskal, .1 .B. (1978): Factor analysis and principal components analysis: bilinear methods.
In: International encyclopedia of statistics, W.H. Kruskal and J.M. Tanur (eds.), 307-330,
The Free Press, New York.
Meulman, .1 ..1. (1986): A distance approach to nonlinear multivariate analysis, DSWO
Press, Leiden,
Meulman, .1 ..1. (1992): The integration of multidimensional scaling and multivariate analysis
with optimal transformations of the variables, Psychometrika, 57, 539-565.
Meulman, .1 ..1., and Heiser, W.J. (1993): Nonlinear biplots for nonlinear mappings. In:
Information and Classification: Concepts, lvlethods and Applications, O. Opitz, B. Lausen,
and R. Klar (Eds), 201-213, Springer Verlag, Berlin.
Meulman, .1,.1., Heiser, W.J., and Carroll, J.D. (1987): PREPMAP-3 User's Guide, Bell
Telephone Laboratories, Murray Hill, NJ.
Ramsay, .1 .0. (1982): Some statistical approaches to multidimensional scaling data, Journal
of the Royal Statistical Society, Series A, 145, 285-312.
Shepard, R.N, (1962): The analysis of proximities: Multidimensional scaling with an un-
known distance function I and II, Psychometrika, 27, 125-140, 219-246.
Takane, Y., Young, F.W., and De Leeuw, J. (1977): Nonmetric individual differences mul-
tidimensional scaling: An alternating least squares method with optimal scaling features,
Psychometrika, 42, 7-67.
Torgerson, W.S. (1958): Theory and methods of scaling, Wiley, New York.
Tucker, L. R (1960): Intra-individual and inter-individual multidimensionality. In: Psycho-
logical Scaling: Theory and Applications, H. Gulliksen and S. Messick (eds.), Wiley, New
York.
Van der Ham, Th., Meulman, J.J., Van Strien, D.C, and Van Engeland, H. (1997): Em-
pirically based subgrouping of eating disorders in adolescents: A longitudinal perspective,
Britisch Journal of Psychiatry, in press.
Young, G. and Householder, A.S. (1938): Discussion of a set of points in terms of their
mutual distances, Psychometrika, 3, 19-22,
Latent-class scaling models for the analysis
of longitudinal choice data
Ulf Bockenholt 1
1 University of Illinois at Urbana-Champaign
Department of Psychology
603 E. Daniel Street
Champaign, Il 61820

Summary: A new class of multidimensional scaling models for the analysis of longitudi-
nal choice data is introduced. This class of models extends the work by Bockenholt and
Bockenholt (1991) who proposed a synthesis of latent-class and multidimensional scaling
(mds) models to take advantage of the attractive features of both classes of models. The
mds part provides graphical representations of the choice data while the latent-class part
yields a parsimonious but still general representation of individual differences. The exten-
sions discussed in this paper involve simultaneously fitting the mds latent-class model to
the data obtained at each time point and modeling stability and change in preferences by
autocorrelations and shifts in latent-class membership over time.

1. Introduction
Over the years, latent class models have proven to be a versatile tool for the analy-
sis of panel, survey, or experimentally derived choice data. For example, numerous
marketing applications demonstrate that these models can provide useful insights
into market structure by simultaneously segmenting and structuring a market at a
particular point in time or over time with the use of panel data. However, to some
extent the interpretation of classification results obtained by these models is compli-
cated by the fact that no information is provided about the perceptual space of the
choice options or the decision process of a respondent. To overcome this limitation,
Bockenholt and Bockenholt (1991) presented a synthesis of latent-class and multi-
dimensional scaling (mds) models that takes advantage of the attractive features of
both classes of models. The mds part determines the perceptual space of the choice
options while the latent class part yields a parsimonious but still general representa-
tion of individual differences.
This paper introduces an extension of this approach for the analysis of stability and
change in choice data collected at two time points. The new class of models represents
persons and items in a joint space. Individual differences are captured by different
vector termini or ideal point positions in a multidimensional space. Person-specific
changes are modelled by allowing for switching among the latent classes over time.
Changes in the perception of items are modelled by drifts in the item parameters.
As a result, the proposed approach provides a much stronger test of the stability and
validity of a multidimensional representation than possible on the basis of a data set
collected at a single point in time. Moreover, a rich set of hypotheses can be tested
to determine the locus of change in the time-dependent scaling results.
The remainder of this paper is structured as follows. First, parametric representa-
tions of ideal and vector models are reviewed. Next, a general modeling framework
is presented for the analysis of choi:e data collected at two time points. Various
special cases of the approach are derived. The paper concludes with an analysis of a
sociometric choice data set.

518
519

2. Latent class scaling models


A natural and economic way of investigating choice behavior is to ask respondents to
select the preferred items from a set of items. Coombs (1964) termed these data pick
any when the set of items is unconstrained, and pick any / n when ~he set of items is
fixed. In the latter case, a decision outcome may be written as i, j, k, ... , q indicating
that options i, j, and' q are chosen and option k is not chosen. Clearly, persons may
differ strongly in their preferences for the various items (Takane, 1983). Latent-class
models are well-suited to account for these individual preference effects by grouping
respondents with high intragroup similarity and large intergroup differences (Lazars-
feld and Henry, 1968). Thus, preference variability is described by assigning judges to
latent classes such that members of a latent class share the same choice probabilities.
The unobserved classes are determined by invoking the principle of local indepen-
dence which states that the latent-class membership variable accounts completely for
any dependencies among the observed choices.
Let 7rtla denote the conditional probability that item i is selected by members of class
a. Under the local independence representation of the latent-class model the joint
probability of observing the choice outcome (i, j, k, ... , q) is
Pr(i,j, k, ... , q) =L 7ra 7rila 7rjla (1- 7rkla) ... 7r q la (1)
a

where 'ira represents the relative size or proportion of class a and L~=l 7ra = 1. Al-
though the unconstrained latent-class model is well-suited to describe individual pref-
erence differences, it does not furnish information about the perception of the items
and the response process of the respondents. However, this information can be ex-
tracted from the data by constraining the class-specific probabilities, 7rila, to be a
function of a scaling model. In particular, the ideal point and vector models may
prove useful in supplying succinct graphical representations of the choice behavior of
the respondents.
B6ckenholt and B6ckenholt (1991) showed that latent-class models and scaling models
can be combined by setting 7riia = <I> ( zila) and <I> is the normal cumulative distribution
function. The deviate zila is expressed as a function of an unfolding model or a vector
model. For example, the ideal point model specifies that an item is selected when it
is close to a person's ideal point position. As a result, for members of class a we may
write
,
Zila = -Ti - L(Lah - Bih)2, (2)
h=l
where Ti is an item-specific threshold parameter, and Lah is the ideal point position
of class a on dimension h. The location of item i on dimension h is denoted by ,8ih .
Thus, according to (2) the items' locations are perceived homogeneously in an T-
dimensional space, however, the positions of the ideal points differ among the classes.
The smaller the distance is between an item and a class' ideal point position, the
more members of this class prefer the item.
Occasionally, a special case of the ideal point model, the vector model, may prove
sufficient for describing individual differences in choice data. According to this model
each class is characterized by a preference vector Va = (Val, Va2,"" Va,)' An item's
projection onto the preference vector of a class determines its probability of being
chosen,
,
Zila = Ti + L Vah{3ih (3)
"=1
520

where l/ah is the h-th element of the preference vector Va.

By combining the vector or ideal point model with the latent-class representation \ve
get

Pr(i, j, k, .. . , q) = I.: ITa~(zila) ~(Z]la) ~(-zkla) ... ~(Zqia) (el)


a

Clearly, when zila is unconstrained the latent class representation in (1) is obtained.

3. Nlodeling stability and change


For the investigation of stability and change in choice data consider the situation of N
individuals measured at two time points. If preferences or attitudes are stable, both
item and person parameters are time-homogeneous. In contrast, systematic response
differences at both time points may indicate changes in the perception of (some of)
the items and/or person-specific variability.
Within the framework of the latent-class scaling models, hypotheses about person-
specific change can be tested by allowing for time-dependent switches among latent
classes. Shifts among classes can be interpreted as shifts among preference states
because each latent-class position corresponds to a particular preference state. A.. s a
result, the approach can account for both individual preference differences and pref-
erence changes over time.
Hypotheses about item-specific changes can be tested by letting the item location
or threshold parameters vary over time. When the perceptual space of the items is
not time-homogeneous, it is difficult to assess any changes in the person-specific pa-
rameters. Thus, although both hypotheses about the locus of change in choice data
are of interest, the interpretation of the data is greatly simplified when item-specific
changes are small and can be ignored.
In addition, to the analysis of preference-related effects, it is desirable to take into
account "test-retest" effects (Hagenaars, 1990). In particular, when the time interval
separating the two choice occasions is small, it is likely that choices among identical
options by a person are not independent. Instead, outcomes of previous choices may
strongly influence current choices. In this case the local independence representation
is not appropriate and a more general approach that allows for correlated choices is
needed.
Based on these considerations the latent-class mds model for two time points is writ-
ten as
Pr( ;(I)
v)
)(1) k-(l)
1 , ••• 1
q(I). 1(2) )(2) k-:(2)
1 , 1 1 ••• 1
q(2)) = ~ ~
~ ~
~
II ab
a b
(1) (2) )
,T. (
... '±'2 Zqla' Zqlb ; Pq (5)

where ~2(Z:I~' Z:;~; p;) is the bivariate normal distribution function with correlation
coefficient Pi and upper integration limits z:l~ and z:i~' These limits are either un-
constrained, or a function of the ideal point or vector models. For example, in the
case of the unidimensional ideal point model
z(t) = __ (t) _ (L _ 3(t))2 (6)
t[a I t a, L •

The class size parameter IT ab refers to the probability of belonging to classes a and
b at time points t and t + 1, respectively. Switching among class locations occurs
521

whenever 1f'ab > 0 provided a f= b.


Autodependencies of item i between two time points t and t + 1 are represented by Pi'
Note that this "test-retest" effect is not class-dependent. In contrast, dependencies
between different items are accounted for by the latent class representation. Thus,
by separately modeling between- and within-item dependencies we can distinguish
individual difference effects and local associations in the choices.
Model (5) includes a variety of published models as special cases. When the corre-
lation between choices among identical options is zero, and z;ltl is unconstrained we
obtain a class of latent change models originally proposed by Wiggins (1973). Spe-
cial cases of this class include the latent Markov model (Langeheine and van der Pol,
1990; Poulsen, 1990) and the latent symmetry model (Bockenholt and Langeheine,
1996; Bockenholt, 1987). However, for testing the former model at least three time
points are required.
Hypotheses of interest may be roughly classified as follows: (a) no change, (b) person-
specific change, (c) item-specific change, and (d) test-retest effects. To test the no-
change hypothesis, we set z;l~ = z;l~ (for all items) and 1f'ab = 0 when a f= b. Hy-
potheses regarding person-specific change can be tested by specifying a latent change
mechanism. For example, under the symmetric change model we expect 1f'ab = 1f'ba.
According to this constraint, an equal number of individuals switches from class a
to class b as from class b to a. Changes in the perception of item i can be tested
by constraining rp)= r?)
and ,6)1) = ,6;2) for the ideal point and vector models.
Similarly, test-retest effects for item i are investigated by setting Pi = O.

4. Estimation and Model Testing


Parameter estimates of the time-dependent latent-class mds model can be obtained
by using maximum likelihood methods. Under the assumption of random sampling
of N subjects for the pick any /n task the log-likelihood function is written as
In L = c + I: fu In {Pr(yu)} (7)

where c is a constant, and fu denotes the observed number of persons with selection
vector Yu = (i(1), ... , q(l); i(2), ... ,q(2)). The Expectation-Maximization (EM) algo-
rithm is used for parameter estimation (Dempster, et al., 1977). The implementation
of this algorithm is straightforward and not further discussed here because it is well
documented in the literature. For example, Hathaway (1985) and Lwin and Martin
(1989) provide a detailed discussion of the EM algorithm for estimating normal mix-
ture models, and Bockenholt (1992) reviews methods for the computation of normal
probabilities.
Large sample tests of fit are available based on the likelihood-ratio (LR) X2 statistic
(G 2) and/or Pearson's goodness-of-fit test statistic (P2). Asymptotically, if a latent-
class mds model provides an adequate description of the data, then both statistics
follow a X2 -distribution with (4n - m - 1) degrees of freedom, where m refers to the
number of parameters to be estimated. The LR test is most useful when the number
of items is small. Otherwise, only a small subset of the possible choice patterns may
be observed. In this case, it is doubtful that the test statistics will follow approx-
imately a X2 -distribution. However, useful information about a model fit may be
obtained by inspecting standardized differences between the observed and expected
model probabilities for certain partitions of the data (e.g., subsets of items). Al-
though these residuals are not independent, a careful inspection of their direction
522

and size is useful in identifying sources of systematic model misfits.


~ested latent-class mds models can be compared by computing the difference between
their LR-statistics. This difference is asymptotically distributed as a x2-statistic with
the degrees of freedom equal to the difference between the number of parameters in
the unrestricted and restricted models. For example, the hypothesis of test-retest
effects can be tested with n degrees of freedom by comparing the fits of Model (5)
with and without correlation coefficients.

5. Sociometric Choices
This section presents the analysis of a small study with two items observed at two
points in time. Although the number of items is too small for testing the mds part
of the latent-class models, the data set is useful for demonstrating the importance
of taking into account latent change and re-test effects in an analysis of longitudinal
data.
In a sociometric choice investigation reported by Langeheine (1994) students were
asked with respect to every classmate whether they would choose (C) or reject (R)
this person on the basis of two criteria measuring interpersonal attractiveness. Crite-
rion i is "share a table in case of a new seating arrangement" and criterion j is "share
a tent if a class would go for a camping trip". Table 1 contains t~e results of this
study for two measurement occasions separated by a one week interval. For example,
the majority of responses (673) rejects class mates on the basis of the two criteria
at both time points. For the most part the following results agree with the ones
obtained by Langeheine (1994). The main difference between his and the analyses
reported here is the application of rvIodel (5).
Because the LR-test of the two-class no-change model,
2
Pr(i(l)
,
)'(1),
"
i(2) )'(2)) = '"
L.....
-
lIa
<p(Z(I)) <p(z(I)) <p(,.(2)) <p(z(2))
.ia lia ~'ia lia (8)
a=1

yields a G2 = 239.1 with dJ =10, it is justified to test whether the poor fit is a re-
sult of variability in the evaluation of the items over time, person-specific changes,
or both. When allowing for switching among classes (i.e., 'Trab > 0 when a =I b) G2
drops to 49.9 (dJ = 8) indicating a substantial latent change effect. In contrast,
when, additionally, relaxing the assumption of time-homogenous item probabilities
(i.e., <P(z:i~) =I <P(z:I~))' G2 reduces by little to 49.7 (dJ=4). Clearly, item-specific
effects can be ignored after allowing for person-specific change. Because of the short
interval between the time points, it seems likely that the poor fit of the model is
caused by strong auto dependencies of the choices. Support for this hypothesis is
obtained by inspecting the fitted frequencies of the latent change model in (the p = 0
column of) Table 1. In particular, the two choice patterns with identical choices are
underpredicted. This result suggests that the fit of the model can be improved con-
siderably by Model (5) because it allows for test-retest effects. The LR-test for this
model with equal correlations for both items is G2 = 5.99 (dJ = 7), and the estimated
correlation coefficient is P = Pi = Pl = .49. The predicted frequencies are given in
(the P= .49 column of) Table 1 and the estimated choice probabilities and class size
parameters are listed in Table 2. We note that the class consisting of reject responses
is twice as large as the class consisting of pick responses. However, the transition
probabilities of both class are symmetric with 7r112 ~ 7r211.
523

Table 1: Observed and expected frequencies of sociometric choices

Choices Obs. freq. Exp. freq. (p - 0) Exp. freq. (jj - .49)


ill) jll) i(2) j(2)
I
R R R R 673 668.7 672.4
R R R C I .32 35.5 31.8
R R C R 27 29.8 26.5
R R C C 62 61.5 62.0
R C R R 25 27.9 27.2
R C R C 25 8.9 23.4
R C C R 6 4..3 2.3
R C C C 29 44.0 32.0
C R R R 24 27.9 24.6
C R R C 3 4.4 2.3
C R C R 10 2.5 11.0
C R C C 17 19.3 16.4
C C R R 37 35.9 36.0
C C R C 29 43.2 31.0
C C C R 12 18.3 15.2
C C C C 2-19 2.30.9 2-15.8

Table 2: Parameter estimates of Model (5)

Class Size Items Transition Prob .

.67 .04 .04 .90 .10


2 .33 .85 .93 .11 .89

6. Discussion
This paper presented latent-class scaling models for graphical representations of pref-
erence stability and change in choice data. The models provide a parsimonious frame-
work for the analysis of individual taste differences. A rich set of hypotheses can be
tested that examine the stability of the perceptual space of the items and shifts
among different preference states for two time points. In addition, as illustrated in
the application section test-retest effects can also be taken into account. Because
the approach can be applied with minor modifications to different data types (i. e.,
rankings, first choices, preference ratings) it can be viewed as a general framework
for obtaining graphical representations of preference changes.

Acknowledgements

This work was partially supported by the \ational Science Foundation (Grant No.
SBR 94-095.31).
524

References:
Bockenholt, D. (1992). Thurstonian models for partial ranking data. British Journal of
lVlathematical and Statistical Psychology, 45, 31-49.
Bockenholt, D. (1997). Modeling time-dependent preferences: Drifts in ideal points. In:
Visualization of Categorical Data, Greenacre, M., and Blasius, J. (eds.). Lawrence Erlbaum
Press.
Bockenholt, D., and Bockenholt, 1. (1991). Constrained latent class analysis: Simultaneous
classification and scaling of discrete choice data. Psychometrika, 56, 699-716.
Bockenholt, D., and Langeheine, R. (1996). Latent change in recurrent choice data. Psy-
chometrika, 61, 285-302.
Carroll, J. D., and Pruzansky, S. (1980). Discrete and hybrid scaling models. In E. D.
Lantermann and H. Feger (eds.), Similarity and Choice. Vienna: Huber Verlag.
Coombs, C. H. (1964). A theory of data. New York: Wiley.
Dempster, A. P., Laird, N. H., and Rubin, D. B. (1977). Maximum likelihoqd from incom-
plete data via the EM algorithm. Journal of the Royal Statistical Society, B39, 1 - 38.
Hagenaars, J. (1990). Categorical longitudinal data. Newbury Park: Sage.
Hathaway, R. J. A. (1985). A constrained formulation of maximum-likelihood estimation
for normal mixture distributions. Annals of Statistics, 13, 795-800.
Langeheine, R. (1994). Latent variables markov models. In: A. von Eye, and C. C. Clogg
(eds.) Latent variables analysis. Thousand Oaks: Sage.
Langeheine, R., and van de Pol, F. (1990). A unifying framework for Markov modeling in
discrete space and discrete time. Sociological lylethods and Research, 18, 416-44l.
Lazarsfeld, P. F., and Henry, N. W. (1968). Latent structure analysib. New York: Houghton-
Mifflin.
Loewenstein, G. F.,and Elster, J. (1992). Choice over Time. New York: Russell Sage
Foundation.
Lwin, T. and Martin, P. J. (1989). Probits of mixtures. Biometrics, 45, 721-732.
Poulsen, C. S. (1990). Mixed Markov and latent Markov modeling applied to brand choice
data. International Journal of Marketing, 7, 5-19.
Takane, Y. (1983). Choice model analysis of the "p-ick any/n" type of binary data. Handout
at the European Psychometric and Classification tvleetings, Jouy-en-Josas, France.
Wiggins, L. M. (1973). Panel analysis: Latent probability models for attitude and behaviour
processes. Amsterdam: Elsevier.
Part VI

Multivariate and Multidimensional


Data Analysis

· Multidimensional Data Analysis

· Multiway Data Analysis

· Non-Linear Modeling and Visual Treatment


Nonlinear Multivariate Analysis
by Neural Network Models
'{oshio Takane
Department of Psychology, McGill Vniversity
120.5 Dr. Penfield Ave., Quebec, H3A 1BI CA:iADA

Summary: Feedforward neural network (NN) models approximate nonlinear functions


that connect inputs to outputs by repeated applications of simple nonlinear transforma-
tions. By combining this feature of NN models with traditional multivariate analysis (MVA)
techniques, nonlinear versions of the latter can readily be constructed. In this paper, we ex-
amine various properties of nonlinear p,[VA by NN models in two specific contexts: Cascade
Correlation (CC) networks for nonlinear discriminant analysis simulating the learning of
personal pronouns, and a five-layer auto-associative network for nonlinear principal compo-
nent analysis (PCA) finding two defining features of cylinders. \Ne analyze the mechanism
of function approximations, focussing, in particular, on how interaction effects among input
variables are captured by superpositions of sigmoidal transformations.

1. Introduction
Feed-forward neural netv;ork (NN) models and statistical models have much in com-
mon. The former can be viewed as approximating nonlinear functions that connect
inputs to outputs. Many statistical techniques can be viewed as approximating func-
tions (often linear) that connect predictor variablf's to criterion variables. It is thus
beneficial to exploit various developments in i\'N models in nonlinear extensions of
linear statistical techniques. There is one aspect of nonlinear transformations by ~N
models that is particularly attractive in developing nonlinear multivariate analysis
(MVA). It allows joint multivariate transformations of input variables, so that inter-
actions among them can be captured automatically in as much as they are needed
for prediction. In this paper we examine various properties of nonlinear J\IVA by NN
models in two specific contexts: Cascade Correlation (CC) networks for nonlineaT
discriminant analysis simulating the learning of personal pronouns, and a five-layer
auto-associative network for nonlinear principal component analysis (PCA) recover-
ing two defining attributes of cylinders. In particular, ,ve analyze the mechanism of
function approximations in these networks.

2. Cascade Correlation (CC) Network


NN models consists of a set of units, each performing a simple operation. Units re-
ceive contributions from other units, computes acti vations by summing the incoming
contributions and applying prescribed (nonlinear) transformations (called transfer
functions) to the summed contributions, and send out their contributions according
to the activations and strengths of connections to other units. A network of such
units can produce rather complicated and interesting effects. It can produce almost
any kind of nonlinear effects and interactions among input variables by looking at
examples that show specific input-output relationships.
The CC learning network is capable of dynamically growing nets (FahlmaIl &: Lebiere.
1990). It starts as a net without hidden units, and it adds hidden units to improve
its performance until a satisfactory degree of performance is reached. Thus, no a
priori net topology has to be specified. Hidden units are recruited one at a time in

527
528

such a way that all pre-existing units are connectd to the new one. Input units are
directly connected to output units (cross conneions) as 'Nell as to all hidden units.
The cross connections capture linear effects of input variables. Hidden units, on
the other hand, produce nonlinear and interaction effects among the input variables
that are necessary to connect inputs to outputs in some tasks. When a. new hi.dden
unit is recruited, the connection weights associated with input connections are deter-
mined so as to m~'(imize the correlation between residuals from network predictions
at the particular stage and projected outputs from the recruited hidden unit, and are
fixed throughout the rest of the learning process. This avoids the necessity of back-
propagating error across different levels of the network, and leads to faster and more
stable convergence. The weights associated with output connections are, however,
re-estimated after each new hidden unit is recruited.
The CC algorithm constructs a net and estimates connection weights based on a
sample of training patterns. For each input pattern, a unit in a trained net sends
contributions to units it is connected to. A contribution is defined as the product of
the activation for the pattern at the sending unit and the weight associated with the
connection between the sending unit and the receiving unit. The receiving unit forms
its activation by summing up the contributions from other units and applying the
sigmoid transformation to the summed contribution. An activation is computed at
each unit and for each input pattern in the training sample. Let al denote an input
pattern (a vector of activations at input units and bias, which acts like a constant
term in regression analysis), and let WI represent the vector of weights associated
with the connections from the input and bias units to hidden unit 1 (hd. Then,
the activation for the input pattern at hI is obtained by bl = f( a~ WI) - .5, where
f is a sigmoid function, i.e., f(t) = 1/{1 + exp(-t)}. Now hI as well as the input
and bias units send contributions to h2. The activation at h2 is then obtained by
b2 = f(a~w2) - .5. A similar process is repeated until an activation at the output
unit is obtained, which is the network prediction for the output. In the training
phase, connection weights are determined so that the network prediction closely ap-
proximates the output corresponding to the input pattern.

3. Two-Number Identification
The CC network algorithm was first applied to the two-number identification prob-
lem, in which there are two input variables, Xl and X2 (excluding the bias). Pairs of
Xl and X2 are classified into group 1 (indicated by output variable y equal to .5) when
the two numbers are identical, and are otherwise classified into group 2 (indicated
by y = -.5). This is a simple t.wo-group discrimination problem, but the function to
be approximated is highly nonlinear, as can be seen in Figure 1(a). The problem is
interesting because of its implication to real psychological problems; identifying two
objects underlies many psychological phenomena, as exemplified by an example given
in the next section.
One hundred training patterns, generated by facorially combining Xl and X2 varied
systematically from 1 to 10 in the step of 1, were used in the training. The CC
network algorithm constructed a network depicted in Figure 1(b). This net has three
input units (including the bias), one output unit, and two recruited hidden units.
Network predictions are computed in a manner described above. Figure 1( c) displays
the function approximated by the CC net (the set of network predictions as a function
of Xl and X2). The approximation looks quite good, although the ridge at Xl = X2 in
the approximated function is not as "sharp" as in the original target function. This
is due to the "crudeness" of the training sample. The minimum difference between
two distinct numbers in the training sample is 1, so that the net was not required
529

(a) The Target Function (b) CC Network (c) A Network Approx.


{ same + .5

~- Hiddens
0.5 0.5
O",p",
0 0
hi

10 Inputs 10
a 0 bias XI 2:2 o 0

(d) bias -> h1 (f) x2 -> h1


(e) x1 -> h1

50 50

0 0
-50
10 0
10 5
o 0 a 0

(g) sum of (d)--(f) (h) act. at h1 (i) act. at h2

50 0.5 0.5

0 a a

0 10 10
o 0 o 0 a 0

Figure 1: The mechanism of a function approximation for the two-number iclentifl-


cation problem by CC network. (aJ depicts the target function approximated by the
CC network, (b) with the approximated function displayed in (c). (d J through (f)
are contributions from three input units to hI. which are summed to obtain (g). which
is sigmoid-transformed to obtain the activation function (h) at hi. The activation
function at h2 (il is similarly derived. These activation functioll5 are used to define
contributions of the units to other units.
530

to discriminate beween two numbers whose differences are less than 1. The ridge in
the approximated function can be made sharper if pairs of numbers with smaller dif-
ferences are included in the training sample. Note that interpolations are clone quite
nicely. That is. although numbers like 5.5 were not included in the training sample,
the identification involving such numbers are handled as expected. Extrapolation,
on the other hand, seems a bit difficult, as indicated by a slight increase in function
values toward the righthand side corner. Note that the target function involves·a
form of interaction between .1"1 and £2, where the word " interaction" is construed
broadly; the meaning of a specific value on one variable, say .rl, depends on the value
on the other variable, ;r'2.
It is interesting to see how the approximated function is bulit up and what roles the
two hidden units play. Figure 1(cl) through (f) present contributions of the three
input units to hI. As described above, contributions are defined as products of acti-
vations at the input units and the weights associated with the connections leading to
hI. The contributions are summed up (Figure l(g)), and further sigmoid-transformecl
to obtain the activation function at hI (Figure l(h)). It seems that hI is identifying
if .1:1 2: .1"2' The activation function at h2 (Figure l(i)) is similarly derived. Contribu-
tions now come from four units (three input units plus hd. h2 seems to be identifying
if X2 2: l·I. The output unit (y) receives contributions from all other units. However,
hI ·and h2 seem to play particularly important roles. y stands out to take ..5, when
and only when input patterns satisfy both Xl 2: .1'2 and .1:2 2: .l·r, but otherwise -..5.
Interestingly, this is essentially how we prove l'I = .1"2 in mathematics.

4. Pronoun Learning
\Ve were interested in the two-number identification problem because of its implica-
tion to a real psychological problem, that is, the learning of first. and second person
pronouns. \Vhen the mother talks to her child, me refers to the mother and you to
the child. However, when the child talks to the mother, me refers to the child, and
YOIt to the mother. The child has to learn the shifting reference of these pronouns.
There are three relevant input variables in this problem (excluding bias) and one
output variable indicating me (y = ..5) or you (y = - ..S). The three input variables
are speaker (sp). addressee (ad), and referent (rf). The rule (or the function) to be
learned is: "Use me when the speaker and the referent agree (i.e., y = ..5, when sp =
rff'. a.nd "use YOlt when the addressee and the referent agree (i.e., y = -.,S. when ad
= rf)." The network should be able to judge which two of the three input variables
agree in their values. The two-number identification problem is thus a prerequisite
to the pronoun learning problem.
How children learn the correct use of these pronouns has been studied by Oshima-
Takane (1988. 1992) and her collaborators (Oshima-Takane, et at.. 1996). Simulation
studies by CC networks have also been reported in Oshima-Takane, et al. (1995), and
in Takane, et al. (199.5). All previous simulation studies, however, presupposed the
existence of only two pronouns, me and yOlt. This severely limits the scope of these
studies. In particular, the operating rule may not coincide with the one assumed
above. Tha.t is, seemingly correct bellavior can follow from rules other than the one
described above. For example, a rule such as me if sp = rf and YOlt otherwise. or
yo It if ad = rf and me otherwise. works equally well so far as only me and yo It are
considered. That is, ad = rf is equivalent to sp =I rL and sp = rf is equivalent to ad
=I rf when only me and you are to be distinguished.
\Ve. therefore, first investigate what rule is in fact learned under the me-you-only
condition. Forty training patterns were created by systematically varying the three
531

input variables from -2 to 2 in the step of 1, and by discarding all but me and you
patterns. Forty patterns were retained. (Remember that sp and ad cannot agree,
and such patterns were also discarded.) The CC network algorithm recruited two
hidden units to perform the task. The approximated function is depicted in Figure
2 in terms of ad on the y-axis and rf on the x-axis for nine different values of sp (-2,
-1.5, -1. -.5, O.. 5, I, 1..5, and 2). It looks like the output variable, y, takes the value
of -.5, as it is supposed to (see the diagonal "ditch" observed in each graph), but
it also takes the value of ..5 in all other cases, including sp = rf and sp #- rf #- ad.
This is correct for sp = rf, but not for sp #- rf #- ad. Remember that no training
patterns were given for the latter and so it is quite natural that the net responded
rather arbitrarily to the latter patterns. This implies, however, that pronouns other
than me and you are necessary to learn the correct use of these two pronouns. That
is, to learn to discriminate between sp = rf and sp #- rf #- ad, patterns involving other
pronouns such as he and she have to be included in the training sample.
To verify the above assertion, another simulation study was conducted, this time,
with pronouns other than me and you also included in the training sample. This
condition, called the me-you-others condition, had 100 training patterns with 40
me-you patterns plus 60 others patterns. The net was trained to take the value of 0
(y = 0) when sp #- rf #- ad in addition to y = .5 when sp = rf and y = -.5 when ad
= rf. Figure 3 shows the approximated function under this condition, which looks as
it is supposed to. The task is appreciably more complicated than before, and the CC
network algorithm recruited five hidden units to perform the task.

5. Five-Layer Auto-Associative Network


The next example pertains to a five-layer auto-associative network. A simplified ver-
sion of this network is depicted in Figure 4(a). There are five layers of units including
the input layer at the bottom and the output layer at the top. Units are intercon-
nected between adjacent layers, but not within same layers or between nonadjacent
layers. It is well known (e.g., Baldi & Hornik, 1989) that a three-layer neural network
with linear transfer functions at both middle (hidden) and output layers has a rank
reducing capability when the number of units in the hidden layer is smaller than
both the number of input units and that of output units. This is a network version of
(linear) reduced-rank regression (Anderson, 1951), also known as PCA with instru-
mental variables (Rao, 1964) and redundancy analysis (Van den vVollenberg, 1977).
The usual (linear) PCA results when inputs and outputs coincide, as in Figure 4(a).
The name "auto-associative" derives from the fact that this net attempts to repro-
duce X from input X with a reduced number of components (the number of units in
the middle layer). The network version of PCA is not interesting in itself since there
are more efficient and accurate algorithms to do linear PCA. It becomes interesting
when the model is extended to nonlinear PCA by including two additional hidden
layers with nonlinear transfer functions (most often, with sigmoidal transformations),
one between the input layer and the middle layer, and the other between the middle
layer and the output layer, resulting in a five-layer network. Layer 2 (hidden layer 1)
and layer 4 (hidden layer 3) perform nonlinear input encoding and nonlinear output
decoding, respectively. Unlike the CC network, the network topology (the number
of layers, the number of units in each layer and how the units in different layers are
connected) is a priori specified and fixed throughout the learnig process, in which
only connection weights are adjusted using the backpropagation (BP) algorithm.
The five-layer auto-associative network was proposed (apparently independently) by
several authors at about the same time (e.g., Irie & Kawato, 1989; Oja, 1991), and
532

o':lL
sp=-2 sp=-1,5 sp=-1

0.:
-0,5
~~
'-, ___ -0,5 ~ ,
O':~'
-0,5 ~, ___.
2 0
ad- 2
0 2
-2 rf
2 °
ad- 2
0 2
-2 rf
2 0
ad- 2 -2 rf
02
sp=-0.5 sp=O sp=0.5

sp=1 sp=1.5 sp=2

~ ~~
°:1 ,~
0.5 0.5

-0: "V _ _
2 0
ad-2 -2 °rf
2
-0,5~ '- ,
20~2
ad- 2
.,____---

-2 °rf
-0: ~, ____
2 0
ad- 2 -2 °rt
2

Figure 2: The approximation function for the pronoun learning problem obtained
under the me-you-only condition, The function is depicted as functions of 'addressee'
(y-axis) and 'referent' (x··axis) at several values of 'speaker'. Function values (.::-axis)
at ad=rf should indicate you (y = +,,5), and those at sp=rf me (!J = ~.'j), if the
pronouns are correctly learned. The problem is that the function takes the assumed
value for me even if sp#rf#ad for which no examples were given in the training.
Discontinuities in the function correspond to points where sp=ad which never occurs,
533

sp=-2 sp=-1.5 sp=-1

sp=-O.5 sp=O sp=O.5

sp=1 sp=1.5 sp=2

Figure 3: The same as Figure 2, but under the me-you-others condition. When the
correct learning occurs, the function takes the value of you (y = -.5) if and only
if ad=rf, the value of me (y = +.5) if and only if sp=rf, and y = 0 if and only if
sp#f:rfad.
534

(a) 5-Layer Network (b) Compo 1 (c) Compo 2

O'~l~ o.~~
Outputs

'-L'~.-r-..-''--''>-''
Hidden 3

Hidden 2
~___
-O.S ~
-O.S
~...A...A-,,-<:v Hidden 1
10~ 10 S 10
o0 5 0 0 S

-2',,-
Inputs

(d) Variable (e) Variable 2 (f) Variable 5

-S
-10O\~
VJ:Pi5lii#
10 S 10
_4~C
-6
-8
-10

10 5 10
o 0 S o 0 5

(g) Output Unit 1 (h) Output Unit 2 (i) Output Unit 5

60

-10o~
-S~ _5~
-10
40
20

10 S 10 10 5 10
o 0 5 o 0 5

Figure 4: The mechanism of a function approximation 111 the five-layer auto-


aSsoCIatIve network. (a) depicts the basic construction of the network. (b) and
(c) represent recovered components as functions of the original components (III (lon
the y-axis, III b on the x-axis). (eI) through (f) display a sample of input functions
(out of 12 altogether). and (g) through (i) recoveree! functions at the output units
corresponding to (d)-(f).
535

has been applied to extracting components that determine facial expressions of emo-
tion (DeMers & Cottrell, 1993) and internal color representation (Usui, et aI., 1991).
Takane (1995) examined recovery properties of nonlinear PCA by the NN models
using several artificial data sets.

6. The Cylinder Problem


The example used to demonstrate nonlinear PCA by NN models was adapted from
Kruskal & Shepard (1974), who generated a set of cylinders by systematically varying
log altitude (lna) and log base area (lnb) of the cylinders. These two variables serve
as two assumed components to be recovered by nonlinear PCA. Kruskal & Shepard
then measured the cylinders with respect to twelve variables which are all monotonic
functions of In a and In b: 1. altitude, 2. base area, 3. circumference, 4. side area,
5. volume, 6. moment of inertia, 7. slenderness ratio, 8. diagonal-base angle, 9.
diagonal-side angle, 10. electrical resistance, 11. conductance, and 12. torsional
deformability.
The training patterns used in the present study were generated in a similar way,
except that 1) In a and In b were systematically varied from -.6 to .6 in the step of
.1 to obtain 13 equally spaced levels, which were factorially combined to obtain 169
cylinders (as opposed to 30 prescribed cylinders in Kruskal & Shepard), and 2) after
the same twelve variables were used to measure the cylinders, they were further trans-
formed by an arbitrary linear transformation to define a completely new set of twelve
variables, which may no longer be monotonic with either In a or In b. Three examples
of these variables are shown in Figure 4( d )-( f), as functions of In a and In b. These
variables are joint multivariate nonlinear transfomations of In a and In b. Nonmetric
PCA allowing only variablewise monotonic transformations is expected to have great
difficuties in recovering the original components from such data. However, nonlinear
PCA by means of a five-layer auto-associative network with 12 units in each of the
the first and third hidden layers (this number is the same as the number of input
units and that of output units) could almost perfectly recover the input data. The
recovered data are shown in Figure 4(g)-(i) for the variables corresponding to those
in Figure -t( d)-(£). The recovered variables at the output units look remarkably
similar to the corresponding input variables, except that small wiggles are observed
in the former. Figure 4-(b) and (c) give two recovered components plotted against
In a and In b. In both cases, recovered components are fairly iinear with the original
components.
It is interesting to see how input variables are approximated (recovered) at the ouput
units with a reduced number of components in the middle layer (Hidden layer 2).
Figures .5 and 6 display the activation functions created at hidden layers 1 and :3
(HI and H 3 ), respectively. The activation functions at HI were obtained by sigmoid
transformations of linear combinations of input unit activations. They are in turn
linearly cornbined to obtain the two recovered components at H 2 . The recovered com-
ponents are then linearly combined and sigmoid-transformed to obtain the activation
functions at H 3 . They were then linearly combined to obtain the approximated input
functions at the output units.

7. Discussion
NN models present interesting perspectives to nonl.inea.r multivariate ana.lysis bv al-
lowing joint multivariate nonlinear transformations of input variables. In this pa.per,
we highlighted the mechanisms of these transformations in two specific context: CC'
536

Hi Unit 1 Hi Unit 2 Hi Unit 3


1 1
0.5r:::

0 10 5 0 0 5 10
0'~~<8~~
10 5 0 0 5 10

Hi Unit 4 Hi Unit 5 Hi Unit6

o~~ o~~. o~~


0 1050 0 5 10 0 10 5 0 0 5 10 0 10 5 0 0 5 10

Hi Unit 7 Hi Unit 8 Hi Unit 9


1
0.5
0 ____- - - o~~
O~
10 5 0 0 5 10 10 5 0 0 5 10

Hi Unit 10 Hi Unit 11 Hi Unit 12


1

o~~~ 10 5 0 0 5 10
O.~~~~
10 5 0 0 5 10

Figure 5: The activation functions created at units in hidden layer 1. These acti I'ation
functions are linearly combined to obtain the activation functions ((h) and (c) of
Figure 4), which are recovered component scores.
537

H3 Unit 1 H3 Unit 2 H3 Unit 3

o~~~
0 10 5 0 0 5 10
0\£::
0 10 5 0 0 5 10
o~l~:::;_;
0 10 5 0 0 5 10

H3 Unit 4 H3 Unit 5 H3 Unit 6

ol~~
10 5 0 0 5 10
olhla_
10 5 0 0 5 10
o~v::::~
0 10 5 0 ° 5 10

H3 Unit 7 H3 Unit 8 H3 Unit 9

'~;
0.5~ o~~
5 0 0 5 °10500510 10 5 0 0 5 10

H3 Unit 10 H3 Unit 11 H3Unit12

olb.
1

10
o:~~ 10 5 0 0 5 10
10 5 0 0 5

Figure 6: The same as Figure 5, but for hidden layer :3. These activation functions
are linearly combined to obtain the output functions (activation functions at output
units), some of which are given in (g)-(il of Figure 4.
538

networks for nonlinear discriminant analysis and a five-layer auto-associative network


for nonlinear PCA. In the present studies, no random errors were added in the data
generation process. Investigating how the networks cope with random errors in the
data is an important next step to evaluate the viabilit.y of the approach as a general
method for developing nonlinear multivariate analysis techniques.

References

Anderson, T. W. (1951). Estimating linear restrictions on regression coefficients for multi-


variate normal distributions. Annals of Afathematical Statistics, 22, 327-3.5l.
Baldi, P. &: Hornik, K. (1989). Neural network and principal component analysis: Learning
from examples without local minima. Neural Network, 2 ..5:3-.58.
DeMers, D., &: Cottrell, G. (199:3). Non-linear dimension reduction. In: :Yeural Informa-
tion Processing Systems 5, Hanson, S. J. et al. (eds.), .580-5,'3" ",forgan Kaufmann, San
Mateo, CA.
Fahlman, S. E., &: Lebiere, C. (1990). The cascade correlation learning architecture. In:
Neural Information Processing Systems 2, Touretzky, D. S. (eds.) . .5:2-i-53:2, Morgan Kauf-
mann, San Mateo, CA.
Irie, B. &: Kawato, ~v1. (1989). Taso paseputoron ni yoru naibu hyogen no kakutoku. [Ac-
quisition of internal representation by multi-layered perceptron.] Shingakllgihi5, NC89-15.
Kruskal, J. B., &: Shepard, R. N. (197-i). A nonmetric variety of liner factor analysis. Psy-
chometrika, 39, 123-157.
Oja, E. (1991). Data compression, feature extraction, and autoassociation in feedforward
neural networks. In: Artificial iVcural Networks, Kohonen, T. pt al. (eds.). ':37-H.5.
Oshima- Takane, Y. (1988). Children learn from speech not addressed to them: The case
of personal pronouns. Journal of Child Language, 15,94-108.
Oshima-Takane, Y. (1992). Analysis of pronomial errors: A case study. JOltnwl of Child
Language, 19, 111-131.
Oshima-Takane, Y. et ai, (1995). The learning of personal pronouns: Network models and
analysis. AIcGili University Cognitive Science Center Tcrhnical Report, 2095. McGill Uni-
versity.
Oshima-Takane, Y. et al. (1996). Birth order effects on early language development: Do
second born children learn from overheard speech? Child Development, 67. 621-6:3-i.
Rao, C. R. (1964). The use and interpretation of principal component analysis in applied
research. Sankhyii A., 26, 329-3.58.
Takane, Y. (199.5). Nonlinear multivariate analysis by neural network models. Proceedings
of the 63rd A.nnual Meeting oJthe Japan Statistical Society, 2,58-260.
Takane, Y. et al. (199.5). :'1"etwork analyses: The case of first and second person pronouns.
Proceedings of the 1995 IEEE International Conference on Systems. Man and Cybernetics,
:3594-3.599.
Usui, S. et al. (1991). Internal color representation acquired by a five-layer neural network.
In: Artificial Neural Network, Kohonen, T. et ai. (eds.), 867~8"2.
Van den Wollenberg, A. 1. (19,7). Redundancy analysis: An alternative for canonical
analysis. Psychometrika, 42, 207-219.
Generalized Canonical Correlation Analysis
with Linear Constraints
Haruo Yanai l
I Research Division
National Center for University Entrance Examination
2-19-23 Komaba, Meguro-ku, Tokyo 153, Japan

Summary: A method is introduced by imposing linear constraints upon parameters


corresponding to more than two sets of variables. We call the method introduced here
'generalized canonical correlation analysis with linear constraint'. It covers canonical
correlation analysis with linear constraints proposed by Yanai & Takane (1992) as its
special case. Further, by employing dummy variables, our method turns out to be
identical to the multiple correspondence analysis with linear constraints.

1. Introduction

Yanai & Takane (1992) proposed canonical correlation analysis with linear constraints
by imposing linear constraints upon parameters corresponding to two sets of variables.
In this paper, we extend the earlier results to the case where more than two sets of
variables are available, and thereby derive general explicit solutions for canonical cor-
relation analysis with more than two sets of variables by imposing linear constraints
upon the parameters corresponding to the sets of variables. We call this method the
generalized canonical correlation analysis with linear constraints (GCCAC). Further-
more, we show that for categorical data our method yields the multiple correspon-
dence analysis with linear constraints (MCC) ,or the Quantification method of the
third type (Hayashi,1952) with linear constraints which covers the canonical analysis
of contingency tables with linear constraints presented by Bockenholt & Bockenholt
(1990) as well.

2. Generalized Canonical Correlation Analysis (GCCA)


2.1 Formulation of GCCA
Suppose that for n subjects data on m sets of variables Xl, X2, ... ,X m were recorded
(where X, comprises Pi variables) and are represented in terms of the matrix X =
(Xl, X 2,'" ,Xm)' The sizes of the matrices X I ,X2, " ' , Xm are nXPI,nxp2,'" ,nx
Pm, respectively. Let a' = ((ad', ... , (am)') be a vector of weight vectors for X I, X 2 , ... ,
Xm with ai E RP'. We look for a weight vector a that maximizes the following crite-
rion m m m

539
540

where ( , ) denotes the scalar product in Rn and

o
X~X2
X~X2
X~X",
X~Xm
J ( X~Xl
0
and

X:nXm 0 o
are block matrices of order P x P where p = Ez;,l Pi' Further, the Pi x Pi matrices
XjXj(j = 1,···, m) are assumed to be nonsingular throughout the paper. Any
solution of (1) is necessary a solution of the following eigenvalue problem:

(X'X)a = ADXXa. (2)


Thus the ma..ximum value of (1) is given by AI, the maximum eigenvalue of (2). If
Xl, X 2 ,'" , Xm are m sets of dummy variables, then (2) implies nothing but the eigen
equation of multiple correspondence analysis (MeA) (see, Lebart et al.(1984)), or of
the Quantification method of the third type of by Hayashi (1952).
We give an alternative equation to solve a in the following Theorem.
Theorem 1: The weight vectors aj E RP) maximizing (1) can be obtained by solving:

gP f = Af with gP = PX, + PX2 + ... + PXm for A and fERn (3)


where P Xj = Xj(XjXj)-lXj is the orthogonal projection matrix onto S(X), the
column subspace of Xj. From the eigenvector f of gP, the weight vectors

(4)

provide a solution of (2).


Proof: Putting f = Xa and multiplying (2) from the left by XDx:":, we get

(XDx:":X')1 = At (5)

which is easily transformed into (3). Furthermore, by substituting f = X a on the


left hand side of (2), we obtain X'I = ADxxa. This yields a = (1/ A)Dx'xX' f, which
implies (4). (Q.E.D.)
An alternative method for deriving the eigenequations (3) and (4) is given by mini-
mizing
m
Q2(a) = :L III - X jajl12 -+ min
a
(6)
j=l
with respect to al,"', am for a given f. This shows that the minimum value of the
Qda) defined above is attained as

Q2(a') = mf'f - f'(P XI + PX2 + ... + Pxm)1 (7)

when a;= (.\}\"]tl Xif. Now, minimizing (7) with respect to I leads to (3). We
show an important property of gP in the following corollary.
541

Corollary 1: 0 ::; Aj(gP) ::; m(] = 1,···, m)


Proof: Observe that In ::::: Px ] (] = 1"" ,m). which implies that In - Px ] is nonneg-
ative definite. Then we have
m

mIn::::: 'LPx, =g P,
i::::::;l

thus establishing the desired result due to Poincare separation theorem (PST). (Q.E.D.)

3. Generalized Canonical Correlation Coefficient


Corollary 2 : Let Aj = Aj(gP)(j = 1,···, k) be j-th eigenvalue of (:3) and fJ be the
corresponding normalized eigenvector. Let us denote by

Then, we have
-1
- - 1 ::; R k (X 1 , " ' , Xm) ::; 1 for k
m-
= 1,···, rank(gP). (9)

Proof: First, multiply (3) by f' PXj from the left and sum over J. \Ve get
m m
'L(PX./k, PXJk)/('L IIPXi l1 2 h) = Ak - l.
ii=j i=l

By Corollary 1, we get (9). (Q.E.D.)


Observe that if the intersection of all m subspaces SU{J) U = 1,···, m) is not
empty, then R(X1, X 2 , ... , Xm) attains l. Further, if XIX j = 0 for i -I j then
R(X1,X2,,,,,Xm) = O. Thus, if m = 2, then R(X 1,X2 ) is the ordinary canoni-
cal correlation coefficient between Xl and X 2 . From the above, it can be justified
that R j (X 1 ,··· , Xm) is the j-th generalized canonical correlation coefficient among
)(1, X 2 , · · · , )(m'

4. Generalized Canonical Correlation Analysis with linear


Constraints
Next, we consider maximizing (1) under some additional linear constraints on a]U =
1, .. ·,m).
Theorem 2: Let CjU = 1,···, m) be matrices of orders PJ x 5J (where 5j > 0).
Maximizing (1) subject to the 5 = 2:::7'=1 sx) linear constraints

C;Aj = 0 (j = 1, ... , m) (10)


yields eigen equations of the form
m

'L(Px ) - pX)(X;X)l-'C))g = ).,cg (11)


J=1
542

and m

«£Px;Qc,lf = /\f and Qc; = I - Pc,. (12)


j=l

Further, weight vectors aj are given by

aj = (XjXj)-lXif-(XiXj)-lGj(Gj(XiXj)-lGJrlGiu<iX))-l Xif for j = 1,···, m


(13)
Proof: The proof uses the following auxiliary lemma.
Lemma (Rao(1973)). Suppose that it and B are symmetric matrices of the same
orders, and let B be nonsingular. Then, maximization of a'ita/a'Ba subject to
G' a = 0 leads to the necessary condition for a:
(14)
Proof of Theorem 2: First, consider the p x p matrices

X[XI
X[Xm ) ( X[XI
X~Xm 0
it=X'X = ( X.~.X.l ,B=

X:,. Xl X:"Xm 0 o
and the p x s matrix

Multiplying (14) from the left by X B- 1 and expanding, we get the equation for A
and a:
(XB-1X' - XB-IG(G'B-1C)-lG'B-1X')Xa = AXa,
thus leading, by g = X a, to
m

"(PJ(
L ') - PXo(X'
J j
Xo)-lco)g
J 1
= Acg, (15)
j=1

observing that (XB-1G)'(XB-1G) = G'B-IG.


From the above, and using the Theorem 2.1 of Yanai & Takane(1992), we immediately
have m

o Px J QcJ )f = Acf
(" where = I -]
Qc) Pc . (16)
j=l

If rank(Xi, Gj ) = rank(Xj ) + rank(Gj ), then we can use

which implies that adding to the constraint GJaj = 0 is meaningless under the con-
dition stated above.
543

We term the method explained in this theorem 'Generalized Canonical Correlation


Analysis with linear constraints (GCCAC)'
Corollary 3: Let Aj(X, C) be the j-th eigenvalue of (15 and 16). Then Aj(X, C) ~
Aj(X) for all j.
Theorem 3 includes the result of the canonical correlation analysis with linear con-
straints by Yanai & Takane (1992) as its special case.
Suppose that the data matrices Xl and X 2 contain only dummy variables. Further,
let N12 = X;X2' Dl = X;Xl and D2 = X~X2' Then max:imizing

subject to E'a = 0 and F'b = 0 yields the singular value decomposition of

(J - E(E'D~1/2EtlE'D~1/2)Y(I - F(F'D:;1/2 F )-lF'D:;1/2)

h
were Y = D 1-l/2N12 D-l/2
2 .

This is the canonical analysis of a contingency table with linear constraints by Bock-
enholt & Bockenholt (1990), which is here subsumed as a special case of GCCAC
with m = 2 and Xl and X 2 two matrices of dummy variables.

5. Numerical Examples
We give three numerical examples demonstrating the usefulness of our method.
Example 1: Given m = 3 dummy matrices A, Band C of orders 8 x 2, we compute
the eigenvalues of gP where gP = PA + Pa + Pc· Then we have Al (gP) = 3, A2(gP) =
A3(9P) = /\4(gP) = 1. and obtain generalized canonical correlation coefficients by
the equation (8): Rl(A,B,C) = 1, R 2 (A,B,C) = R 3 (A,B,C) = R 4 (A,B,C) = O.

1 0 1 0 1 0
1 0 1 0 1 0
1 0 0 1 0 1
1 0 0 1 0 1
A= , B= and C =
0 1 0 1 1 0
0 1 0 1 1 0
0 1 1 0 0 1
0 1 1 0 0 1

Example 2: Let Cl = (1,2), C2 = (1, -2) and C3 = (2,1) be constraint vectors.


Then by means of the formula (15), we have ACl(gP) = 1.477, AC2(gP) = 0.900 and
AC3(gP) = 0.623 and the remaining eigenvalues are O. Thus, from Examples 1 and 2,
we have demonstrated Corollary 3 of Theorem 2, numerically. Further, the resulting
weights are as follows.
544

weight all au I a21 a22 iL31 iL32


dimension 1 0.287 -0.1!14 0.361 0.181 -0.144 0.287
dimension 2 -0.300 0.150 0.000 0.000 -0.150 0.300
dimension 3 0.166 -0.083 -0.264 -0.132 I -0.083 0.166
constraints 1 2 1 -2 I 2 1

Observe that 0.287 x 1 + (-0.144) x 2 = 0, 0.361 x 1 + 0.181 x (-2) = 0 and


-0.144 x 2 + 0.287 x 1 = 0, thus establishing that the obtained weights satisfy the
given constraints.
Example 3: As data X ,we consider answers of25 subjects to 11 categories of four items
(see Table 1). Using these data we computed the eigenvalues and the corresponding
weights for the categories by changing the constraints as in the following four ways:

case 1: without any constraint


case 2: c2=(I, 1), c3=(1,1, 1), C4=( 1, 1,1)
Cl =(1, 1,1),

case 3: cl=(I,1,-l)' c2=(I,-1), c3=(1,1,-1), c4=(1,1,-1)


case 4: cl=(1,2,3), c2=(1,2), c3=(1,2,3), c4=(1,2,3)
It follmvs that the maximum eigenvalue attains 4 in case 1. Thus, we showed the the
weights of 11 categories corresponding to the second largest eigenvalue in case 1, and
we showed the weights corresponding to the largest eigenvalues in each of the other
three cases. As can be seen from Table 3,weights given to categories in any of the
four items satisfy the constraints in the cases 2,3 and 4.

Table 1: Labels of 11 categories of four items


item i 1) age group 2) sex 3) married status 4) numbers of children
category 1 20-39 male single none
category 2 40-59 female married one and two
category 3 over 60 -- alternatives more than three

Further, we showed the eigenvalues of the four cases in Table 4. It is shown that the
eigenvalue of the case 1 is larger than any of the other three eigenvalues obtained
from ( 3 ). The result also demonstrates the validity of Corollary 3.

6. Discussions
Our method proposed in this paper is said to be the general method of multivariate
analysis for finding appropriate linear combinations by imposing linear constraints
upon parameters corresponding to many sets of variables. It is to be noted here
that our method covers canonical correlation analysis, canonical discriminant analy-
sis ,principal component analysis, multiple correspondence analysis and also Quan-
tification method of the third type ,taking linear constraints into consideration. In
545

Table 2: Data of 25 subjects of the four items


item 1 234 item 1 234 item 1 234
subject 1 3123 subject 10 1 223 subject 19 2222
subject 2 1 233 subject 11 321 1 subject 20 1 222
subject 3 2 122 subject 12 3232 subject 21 1 121
subject 4 2 123 subject 13 2233 subject 22 2223
subject 5 1 121 subject 14 3123 subject 23 311 1
subject 6 3233 subject 15 2222 subject 24 1 223
subject 7 1 222 subject 16 1 223 subject 25 1 233
subject 8 321 1 subject 17 2 223
subject 9 3231 subject 18 2233

Table 3: Weights of 11 categories corresponding to the second maximum eigenvalue


of the four cases
case 1 case 2 case 3 case 4
1-1 0.070 -0.073 0.139 -0.099
1-2 0.145 -0.122 0.134 -0.191
1-3 -0.223 0.195 0.273 0.160
2-1 -0.095 0.104 0.198 0.080
2-2 0.037 -0.104 0.198 -0.040
3-1 -0.459 0.295 0.097 0.412
3-2 0.077 -0.141 0.169 -0.117
3-3 0.031 -0.153 0.266 -0.059
4-1 -0.304 0.257 0.127 0.311
4-2 0.123 -0.113 0.115 -0.079
4-3 0.084 -0.143 0.242 -0.051

Table 4: eigenvalues of the four cases


component 1 2 3 4 5 6 7 8
case 1 4.000 2.162 1.142 1.093 0.997 0.632 0'.456 0.197
case 2 2.203 2.203 1.350 0.900 0.605 0.533 0.295 0.000
case 3 3.785 1.501 1.040 0.505 0.108 0.043 0.D18 0.000
case 4 2.042 1.477 1.200 0.982 0.686 0.394 0.220 0.000
546

our paper, we have not given any comment how to find the appropriate constraints
for each of the methods. More often, natural forms of constraints may appear from
specific empirical questions posed by the investigators concerned in the problem of
their fields. With such formulations, one specifies the space in which the original
parameter should lie, and then proceeds to test the hypothesis by means of an ap-
propriate statistics. It should be noted, however, that little work has been done on
various kinds of multivariate analysis. In this viewpoint, how to test the hypothesis
stated in (10) provides an interesting problem to be tackled in the future.

References:
Yanai,H. &:: Takane,Y.(1992): Canonical correlation analysis with linear constraints,
Linear Algebra fj its Applications, 176, 75-89
Bokenholt, U. &:: Bockenholt,I.(1990): Canonical analysis of contingency tables with
linear constraints, Psychometrika, 55, 633-639
Lebart, L. Morineau, A.K. &:: Warwick, M.(1984): Multivariate Descriptive Statistical
Analysis, John Wiley &:: Sons, New York
Rao,C.R. &:: Yanai,H.(1979): General definition of a projector, its decomposition and
application to statistical problems, 1. of Statistical Planning and Inference, 3, 1-17
Rao, C.R.(1973): Linear Statistical Inferences and its Applications (second Edition),
John Wiley &:: Sons, New York
Hayashi,C. (1952): On the prediction of phenomena from qualitative data from the
mathematico-statistical point of view, Annals of the Institute of Statistical lvfathe-
matics, 3, 69-96
Principal Component Analysis Based on
a Subset of Variables for Qualitative Data
Yuichi Mori 1 , Yutaka Tanaka 2 and Tomoyuki Tarumi 2
1 Kurashiki City College
160 Kojima Bieda-cho
Kurashiki 711, Japan

2 Department of Environmental and :\Iathematical Sciences


Okayama University
2-1-1, Tsushima-naka. Okayama 700, Japan

Summary: A principal component analysis based on a subset of variables is proposed


by Tanaka and J\Iori (1996) to derive principal components which are computed as linear
combinations of a subset of quantitative variables but which can reproduce all the variables
very welL The present paper discusses an extension of their modified principal component
analysis so that it can deal with qualitative variables by using the idea of the alternating
least squares method by Young et aL(1978). A backward elimination procedure is applied
to find a suitable sequence of subsets of variables. A numerical example is shown to illus-
trate the performance of the proposed procedure.

1. Introduction
Consider a situation where we wish to make a small dimensional rating scale to mea-
sure latent traits. On the one hand, from the validity aspect all the variables should
be included in the stage of constructing the rating scale. On the other hand, the
number of variables should be as small as possible in the stage of application of the
rating scale from the practical aspect . Thus we meet the problem of variable selec-
tion in principal component analysis (PCA).
Suppose we observe quantitative variables and wish to make a small dimensional
rating scale by applying a PCA-like method. Furthermore, we wish to obtain the
rating scale or a set of principal components (PCs) in such a way that it is based
on only a subset of variables but it represents all the variables very welL If we can
find such PCs, we may say that those PCs provide a multidimensional rating scale
which has high validity and is easy to be applied practically. To do this Tanaka and
Mori (1996) proposed a method of PCA based on a subset of variables using the
idea of Rao(1964)'s PCA of instrumental variables and called it a :'modified PCA"
(NLPCA), while the problems of variable selection in the ordinary PCA have been
studied by Jolliffe (1972, 1973, 1986), Robert and Escoufier (1976), McCabe (198c1)
and Krzanowski (1987a, 1987b) among others.
In this paper we extend M.PCA so that it can deal with qualitative data with un-
ordered or ordered categories. 0."amely we perform both quantification of qualitative
data and :VLPCA at the same time. and study the performance of our method nu-
merically by applying it to a real data set. For analyzing qualitative data with a
PCA approach, a lot of techniques have been proposed, e.g., De Leeuw (1982, 198c1),
De Leeuw and Van Rijckevorsel (1980), Israels (198c1), and Young et aL (1978). We
use Young et al.(1978)'s PCA in which the alternating least squares (ALS) method
is utilized. We call this qualitative PCA based on a subset of variables a "qualitative
modified PCA" (QM.PCA) to distinguish it from yLPCA and others.

547
548

2. peA based on a subset of qualitative variables


2.1 Quantification of qualitative data
Let X be a data matrix with n observations and p variables (items), where the j-th
(j = 1, ... , p) variable x j has Cj possible values (Cj categories) labeled 1, ... , Cj. To
quantify X we use an indicator matrix G j (n x Cj) for each Xj' The columns of G j
are dummy variables which have scores 0 and 1, score 1 indicating the category that
the observation belongs to. Then, as an optimally scaled matrix of X we make an
n x p quantified matrix Y whose j-th column is given by Yj = Gjaj, where aj is
a score vector for categories of Xj. The scores aj are assigned so that the criterion
of PCA is maximized under the constraints that all y/s have zero means and unit
variances (which will be described in 2.3).
2.2 Modified peA for quantitative data matrix Y
Suppose we have obtained a quantitative data matrix Y by quantifying the quali-
tative variables. We wish to represent this Y by r PCs as well as possible, where
r is preassigned and the PCs are linear combinations of a subset of variables of Y.
As discussed by Tanaka and Mori (1996), to derive such PCs we can use PCA of
instrumental variables proposed by Rao (1964) by assigning the subset of variables
as instrumental variables, when variables are quantitative.
We wish to make r linear combinations Z = Yl W, which reproduce the original p
variables as well as possible, where Y l is a subset of Y with q (1 S; q :s: p) variables
and W is a q x r (1 :s: r :s: q) weight matrix for the q variables. Here aj (j = 1, ... , q)
and W are determined in such a way that Y can be predicted as well as possible by
means of linear functions of z. Thus the predictive efficiency is maximized for y by
using a linear predictor in terms of z.

Let the covariance matrix of Y = (Yl , }2) be 5 = (~~: ~~~). The residual covari-
ance matrix of y after subtracting the best linear predictor is expressed as
(1)
where 51 = (511 , 5 12 ), Then the problem becomes to maximize 5 Reg . If it is formu-
lated as the ma..ximization problem of tr(5 Reg ) among other possibilities, a generalized
eigenvalue problem
(2)
is obtained. Let the q eigenvalues of (2) be ordered from the largest to the smallest
as Al,A2, ... ,A q and the associated eigenvectors be denoted by WI, W2, ... , w q .
Then, the solution is expressed as W = (WI, ... , wr), and the maximized value of
r
the criterion tr(5 Reg ) is given by I: A" or the proportion of the original variations
,=1
explained by the r PCs is given by
r

P = I: A;/tr(5). (3)
i=l

In QM.PCA this proportion P indicates the average squared multiple correlation be-
tween each of the original variables and the r PCs, since all y;'s are standardized in
the quantification process.
Thus we apply the following two-stage procedure of variable selection as a practical
549

strategy to find PCs which are based on a small number of variables but represent
all the variables very well.
A. Initial fixed-variable stage
Assign q variables to subset Y1 , usually q := p, and solve the eigenvalue problem
(EVP) (2). Looking carefully at the eigenvalues and the proportions P in (3), deter-
mine the number r of PCs to be used.
B. Variable selection stage (backward elimination)
B-1: Based on the results of stage A, start with q variables - r PCs model.
B-2: Remove each one of the q variables in turn, solve the q EVP (2) with q - 1
dimensional matrices, and find the best subset of size q - 1, that is, the one for which
the proportion P in (3) is the largest. Remove the corresponding variable from the
present subset of q variables and put q := q - l.
B-3: If both P and the number of variables in Y1 are larger than preassigned values,
go back to B-2. Otherwise stop.
2.3 Estimation of category scores and variable weights
Before performing M.PCA we have to estimate the unknown parameters aj and then
make optimally scaled variables. To do this we use the idea of PRINCIPALS (Young
et al., 1978) which is an extension of ordinary PCA to the situation where the vari-
ables may be measured at a variety of scale levels. It is used to estimate both aj and
W simultaneously. Here, we use all variables in this estimation phase, i.e., q := p.
Let
Y= ZB, (4)
where Z (= YW in this phase) is an n x r matrix of n component scores and B is a r x p
coefficient matrix. We wish to obtain B so that Y predicts Y as well as possible, where
Y contains unknown a/s and Z contains unknown W. The optimization criterion is
expressed as
e = tr(Y - Y)'(Y - Y). (5)
Based on the ALS principle, unknown parameters are estimated as follows.
(Step 0) Determine initial values Y, which is assigned as category scores provided
externally or assigned as random numbers. Standardize Y columnwise.
(Step 1) Apply M.PCA to the data matrix Y, that is, solve EVP (2) and obtain W.
Successively obtain Z and B = (Z'zt1Z'Y.
(Step 2) Evaluate (). If the improvement in fit from the previous iteration to the
present iteration is negligible, stop.
(Step 3) From Z and B compute Y by eq.(4). Obtain the matrix of optimally scaled
data Y which gives the minimum () in (5) for the fixed Y. The optimal scaling of
data is performed for each variable separately and independently. Standardize the
optimally scaled data and go back to Step 1.
Steps 1 through 3 are iterated until convergence is obtained.

3. A numerical example
We illustrate the proposed method by applying it to the data gathered for the purpose
of making a rating scale to measure the seriousness of mild disturbance of conscious-
ness (MDOC) due to head injury and other lesions (Sano et a!., 1977, 1983). The
data set consists of 87 individuals and 25 variables (test items, four points scale).
550

Tab. 1 Process of removing variables.


Step q P P~ Removed Variable Po
0 23 0.69389 1.00000 0.69389
1 22 0.693GG 0.9930,1 V9 0.69314
2 21 0.69323 0.98847 V17 0.69276
3 20 0.6928G 0.979,11 V10 0.69200
,I 19 0.69238 0.972:38 V3 0.69070
G 18 0.69183 0.96316 V16 0.6896G
6 17 0.6913':; 0.9:;196 V12 0.688;)3
7 16 0.69036 0.939,1,1 V11 0.68733
8 IS 0.68933 0.91520 V21 0.68,1:;4
9 1,1 0.68837 0.89372 V7 0.68212
10 13 0.68734 0.88519 V19 0.67778
11 12 0.68602 0.87106 V4 0.67560
12 11 0.68448 0.83876 V23 0.67551
13 10 0.6824,1 0.81823 V2 0.67,149
14 9 0.68019 0.80063 V5 0.67133
15 8 0.67743 0.78617 V13 0.66767
16 7 0.67412 0.77004 V20 0.66034
17 6 0.6703G 0.74620 V22 0.66114
18 5 0.66264 0.7l730 VI 0.65993
19 4 0.65145 0.68S79 V6 0.64770
20 3 0.63371 0.64599 V8 0.63340
21 2 0.60734 0.6073,1 VI:; 0.60734
Pq is P value using all q eigenvalues obtained by EVP (2).
Po is P value comput.ed by the ordinary PCA of the selected variables.

r
lI
0.75

0.7 ri V9V17Vl0V3V16V12VllV21 V7V19 V4 V23


- ............. ..-. V2 V5 V13V20V22
I .-. ... ... Vl
0.. O.65 ,~ Vl0V23V1BV15 V2 VB V13V20V14V21 V4 V
19V7V5Vl
V6 I
I
V9 V12 ~ I
I Vl\~ ''V~51
0.6 :
~, I

'A
0.55
23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2
Number of Variables
Fig. 1: Process of removing variables (Q:\1.PCA and ".I.PCA).
( . indicates the result of Q:\I.PCA which means ;\I.PCA of Y and" indicates
that of :\I.PCA which is applied to X as quantitative variables.)

These data were originally analyzed by Sano, et al. (1977, 1983) to make a rating
scale for :'vIDOC, and later by Tanaka and Kodake (1981) and Tanaka (1983) us-
ing principal factor analysis with variable selection functions. All of these studies
adopted a 2.3 variables - 2 factors model, since two of 25 variables were thought not
to be important, and analyzed the 23 variables as quantitative ones using scores O.
1, 2 and 3 or 0 and ;3 according to its grade of disturbance.
~ow let us apply Q:\1.PCA to this data set. We denote this data set by X.
At first indicator matrices Gj are generated from X and are transformed into initial
quantified vectors Y J by assigning original values of Xj to score vectors aj (j = L
... , 23). After normalizing y/s so that they have zero means and unit variances. we
551

apply the ordinary PCA to the initial Y. The eigenvalues and the cumulative pro-
portions are Al = 13.878 (60.34%) > A2 = 1.579 (67.20%) > A3 = 0.9580 (71.37%)
> A4 = 0.8456 (75.05%) > As = 0.7432 (78.28%) > .... Looking at these values
and referring the previous studies we decided to extract two PCs. Starting with this
initial Y and r = 2, scores a/s and weights Ware estimated iteratively by means of
ALS method. Y obtained at the final iteration is an optimally quantified matrix of
qualitative X. The eigenvalues and the cumulative proportions of the quantitative
variables Yare Al = 14,458 (62.86%) > A2 = 1.502 (69.39%) > A3 = 0.9401 (73.48%)
> A4 = 0.7788 (76.86%) > As = 0.6948 (79.88%) > .... These proportions are larger
than those obtained by the ordinary PCA applied to the original X.
Next, we apply M.PCA to this optimally scaled variables Y. The process of removing
variables is shown in Tab.l and the change of P values are visualized in Fig.l (.).
They illustrate that the change of P value is very small until 11 or 12 variables are
removed. Even when the number of variables is reduced drastically to 6, the reduc-
tion of P is less than 2.5%.
Fig.2 shows scatter plots of the correlation loadings which are defined as the correla-
tions between the original variables and the (varimax-rotated) derived PCs and which
play important roles for the interpretation of PCs. Three scatter plots correspond
to the cases of (a) q = 23, (b) q = 11 and (c) q = 6. Black circles in (b) and (c)
indicate the selected variables. Comparing these scatter plots, it is observed that the
configurations are almost the same between (a) and (b), but they are slightly different
between (a) and (c). This fact indicates that the meanings of the PCs do not change
when 12 variables are removed and they change only slightly when 17 variables are
removed. We can say that PCs based on the selected variables have almost the same
information as PCs based on the whole ones.
Comparing with the result of M.PCA applied to the original X, Fig.l shows that
Q\JI.PCA provides higher values of P at every step. This difference is due to the
effect of optimal scaling by ALS method.
To illustrate how our method selects variables from clusters of variables, the selected
variables are marked on the dendrogram in Fig.3 obtained by cluster analysis of Y
using standardized squared Euclidean distances and the furthest neighbor method.
As illustrations 11 variables and G variables selected by QM.PCA are marked 'below
the variable numbers in Fig.3. The dendrogram suggests that there exist three clus-
ters which are composed of 11, 2 and 10 variables, respectively. It is noted that the
selected variables are distributed in two major clusters in a well-balanced manner.
The reason why no variables are selected from the smallest cluster may be explained
in such a way that variables ::\0.21 and :-:0.23 in this cluster are located near the
origin in the scatter plots of the correlation loadings (Fig.2) and therefore they do
not play important roles for composing these PCs.
Finally let us compare our method with other variable selection methods. On the
basis of the previous studies we consider the following methods.
(1) Method of regression analysis: Remove one variable which has the largest multiple
correlation with the remaining variables successively.
(2) lo11iffe(1972. 1973. 1986)'s method based on peA: Apply PCA to all p variables.
Looking at loadings associated with the p-th largest eigenvalue, remove one variable
which has the largest loading. i\ext, looking at loadings associated with the (p -1 )-th
largest eigenvalue, remove one variable among the remaining ones in the same way,
and so on. Iterate the process until the number of removed variables is p - q.
(3) Method based on cluster analysis: Form q clusters of variables using a method of
552

(a) I

. •,
0.8
! 2'25;
~.
0.6 • v,
~~.~ •
"0
c:
C\J 7
0.4
2 ,

0.2
~ ,
:

I I

o I I 1 I
o 0.2 0.4 0.6 0.8
1st

(b) (c) I, I
I I

0.8
I
'6 2i
0.8 •
v
I
~
i
!
0.6
4 CO • ~~~~ I
0.6
tl
o 4,!-
9 ?ib. 4~ ,ra~~
-0 -0
c: c: 7 1
C\J C\J

2W ~
0.4 0.4

s: ~ s: ~ ~~
0.2 0.2

~.
I

~ I
o
i
OLI__ ~__L-~ -L1_1~
I
__

o 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8


1~ 1~
Fig. 2: Scatter plots of the correlation loadings (Q~I.PCA).
(a) q = 23, (b) q == 11 and (c) q = 6.

x x x x x x x x x x x
x x x x x x
Fig. 3: Dendrogram of 23 variables (Furthest neighbor method with standardized
squared Euclidean distances). Eleven and six variables selected by Qi\I.PCA are
marked by X below the variable number.

cluster analysis and select one variable from each cluster.


(4) Method using Robert and Escoujie7'{1976) 's RV-coefficient (1): Find q-l variables
among q variables, which have the closest configuration of the PC scores to that based
on q variables, and remove the corresponding variable. The RV-coefficient is used as
the criterion to evaluate the closeness between the two configurations.
553

0.6
~ _ , QM.PCA
___
7- Reg
Cl. i -S- PCA (by Jolliffe)
!
I ~ Cluster analysis ' \\
0.5 ~ -0- RV1 \ ,",.

I -. RV2 \

I
0.4 i~__~~__~~__L-~~_ _ _ _~_ _~~~_ _~~_ _~~~_ _~~_ _L-~~~
~\~
23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2
Number of variables
Fig. 4: Comparison of P obtained by applying various selection methods to Y.

(5) Method using Robert and Escoufier(1976),s RV-coefficient (2): Find q-1 variables
among q variables, which have the closest configuration of the modified PC scores
to that of the original variables Y in terms of the RV -coefficient. In this case the
RV -coefficient is computed as RV = {LJ=l AJ / tr( 52) } 1/2 \vhere Aj'S are obtained by
solving EVP (2) (see, Tanaka and Mori, 1996).
Fig.4 indicates the changes of P for each selection method. P values at each step
are computed using Y1 obtained by each method as the instrumental variables and Y
as the main variables. The reason why we use the formulation based on instrumen-
tal variables to compute P is to make fair comparisons, because P values obviously
become smaller if the ordinary PCA is applied to the selected variables (this fact is
observed in Tab. 1, that is, P values in the last column, which are computed by the
ordinary PCA of the selected variables, are always smaller than those in the third
column). Fig.4 shows that the proposed method selects a subset of variables which
explains the original variables better than other methods.

4. Concluding remarks
Modified PCA (M.PCA), which was proposed by Tanaka and :'v1ori (1996) to derive
PCs which are formulated as linear combinations of a subset of variables but which
represent all the variables very welL is extended so as to deal with qualitative data
by combining a method of optimal scaling and M.PCA. Performance of the proposed
method (QM.PCA) is studied numerically by applying it to a real data set. From
the results of the numerical study we can say the followings:
1) The proportion of variance explained by a specified number of PCs in Q:VI.PCA is
larger than in M.PCA without optimal scaling in every step of variable selection, It
suggests the usefulness of optimal scaling in analyzing qualitative data.
2) In our numerical example the average proportion P of the variance of all the vari-
ables explained by PCs does not change much by omitting at most 17 among 2:3
variables. Also the loadings of PCs based on all the variables (q = 23) and on two
subsets of variables (q = 11 and q = 6) are almost the same. These facts suggest that
we can construct a multidimensional rating scale which is based on a small number
of variables but which has almost the same information as the case using all the
variables.
554

3) Comparison is made among the performances of our method and other variable
selection methods in PCA. Our method is superior to the other methods in the sense
that the proportion P of our method is larger than those of the other methods.
4) To study how our method select variables the selected variables are marked in the
scatter plots and the dendrogram for cluster analysis of the variables. It seems that
the variables are selected from the major clusters in a well-balanced manner.

References:
De Leeuw, J. (1982): Nonlinear principal component analysis. COlvfPSTAT 1982, 77-86,
Physica-Verlag, Vienna.
De Leeuw, J. (1984): The Gifi system of nonlinear multivariate analysis. Data Analysis
and Informatics III, Diday, E. et al. (eds.), 415-424, Elsevier Science Publishers, North-
Holland.
De Leeuw, J. and Van Rijckevorsel. J. (1980): HO:\IALS & PRINCALS: Some general-
ization of principal components analysis. Data Analysis and Infomwtics, Diday, E. et al.
(eds.), 415-424, North-Holland Publishing Company.
Israels, A. Z. (1984): Redundancy analysis for qualitative variables. Psychometrika, 49,
331-346.
Jolliffe, 1. T. (1972): Discarding variables in a principal component analysis. 1. Artificial
data. Applied Statistics, 21, 160-173.
Jolliffe, 1. T. (1973): Discarding variables in a principal component analysis. II. Real data.
Applied Statistics, 22, 21-31.
Jolliffe, 1. T. (1986): Principal component analysis. Springer-Verlag, New York.
Krzanowski, W. J. (1987a): Selection of variables to preserve multivariate data structure,
using principal components. Applied Statistics, 36, 22-33.
Krzanowski, W. J. (1987b): Cross-validation in principal component analysis. Biometrics.,
43, 575-584.
J\IcCabe, G. P. (1984): Principal Variables. Technometrics, 26, 137-144.
Rao, C. R. (1964): The use and interpretation of principal component analysis in applied
research. Sankhya, A, 26, 329-58.
Robert, P. and Escoufier, Y. (1976): A unifying tool for linear multivariate statistical meth-
ods: the RV-coefficient. Applied Statistics, 25, 257-265.
Sano, K. et al. (1977): Statistical studies on evaluation of mind disturbance of conscious-
ness: Abstraction of characteristic clinical pictures by cross-sectional investigation. Sinkei
Kenkyu no Shinpo, 21, 1052-1065 (in Japanese).
Sano, K. et al. (1983): Statistical studies on evaluation of mind disturbance of conscious-
ness. Journal of Neurosurg, 58, 223-230.
Tanaka, Y. and Kodake, K. (1981): A method of variable selection in factor analysis and
its numerical investigation. Behavion71etrika, 10, 49-6l.
Tanaka, Y. (1983): Some criteria for variable selection in factor analysis. Behaviormetrika,
13, 31-45.
Tanaka, Y and Mori, Y. (1996): Principal component analysis based on a subset of vari-
ables: Variable selection and sensitivity analysis. American Journal of Mathematics and
J1;fanagement Sciences, Special Volume (to appear).
Young, F. et al. (1978): The principal components of mLxed measurement level multivariate
data: An alternating least squares method with optimal scaling features. Psychometrika,
43. 279-281.
Missing Data Imputation in Multivariate Analysis
j'vIario Romanazzi
Department of Statistics, "Ca' Foscari" University of Venice
2347 S. Polo, 30125 Venice, Italy

Summary: A new imputation method for incomplete data is suggested, to be used with
non-random multivariate observations. The principle is to fill the gaps in a data set so that
the partial complete-data configuration and the total filled-in configuration are similar, ac-
cording to a matrix correlation coefficient. The optimality criteria are Escoufier's RV and
Procrustes normalized statistic. Three examples are illustrated.

1. The method
Suppose P numerical variables have been observed on n units, but only nj units have
complete data, whereas the other n2 = n - nl have incomplete data. Let X be the
n X P data matrix, with some empty cells, and let XI be the n X PI submatrix including
the PI ::::: 1 variables with complete data. It is assumed that the variables (columns
of XI and X) are centered with respect to the means. 'vVe suggest estimating the
missing values in such a way that XI and X are as similar as possible, according to
a matrix correlation coefficient. Crettaz De Roten and Helbling (1991) also use a
matrix correlation criterion, but they consider an a priori partition of the variables
in two groups, or try to maximize the match between XI and the submatrix of the
incomplete-data variables.
A geometrical interpretation can be given. XI and X correspond to two constel-
lations representing the same n points in PI- and p-dimensional Euclidean space,
respectively. The constellation associated with XI is fi'(ed, whereas in the constel-
lation corresponding to X, relative positions of points and overall shape change for
each selection of values to substitute for missing values in the empty cells of X. Our
method amounts to fi'(ing X at the position of maximum similarity \vith XI.
Obviously, a definition of "similar configurations" must be given. In multivariate
analysis there are two standard equivalence criteria: congruence up to linear transfor-
mations, typical of canonical correlation and multivariate regression, and congruence
up to orthogonal transformations, as in principal component and factor analysis. In
this work the match of Xl and X is evaluated under orthogonal congruence. It is
well-known that a maximal invariant is the pair of inner-product matrices XIX[,
X X T that are comparable by Escoufier's RV

trXTXxTx j

or Lingoes-Schonemann normalized Procrustes statistics (Seber, 1984)

LS(X X) _ {t'(KTXXTK
1" 1 " 1
)1/2}2
_
{t'(KTXXTX
I" 1 j
)1/2}2
], - tr(XjXT) tr(XXT) - tr(XTXd tr(XTX)

Up to the normalization factor appearing in the denominator, RV and LS are dif-


ferent functions of the singular values of the matrix Xr X, whose elements are the

555
556

covariances of the observed variables.


Suppose there are q 2:: 1 missing values 81 , ... , 8q , forming the vector 8 = (81. ... ,
8q )T. 'With no loss of generality, they can be thought to be pertinent to the last n2
units and last P2 variables. Thus X can be partitioned as

X = ( X 1 X)
2 = ( Xl!
X 21
X12)
X 22 '

where X j is an n x Pj matrix and X ij is an ni X Pj matrix, i, j E {I, 2}. The p x 1


vector of the means is

with x~j) (X~2») denoting the vector of the means of the first PI (last P2) variables. For
j E {I, 2}, it is useful to interpret x~) as the weighted mean of xij) and x~j), the
group averages computed from first nj and last n2 units. The relevant expression is

Hence the submatrices in X can be written as

where A , E , C and 6 are column-centered matrices , a = x(1)2 _x(1) 1 and 13 = x(2)


2 _x(2)
l'
Since Xl is the submatrix of X corresponding to the PI complete-data variables, A,
E, C and a are free of 8, whereas X 2 == X 2 (8), 6 == 6(8) and 13 == 13(8) depend on the
missing values to be estimated. Thus the matrix X == X(8) is a function of 8, and it
makes sense to look for the vector B' which gives RV (LS) its ma..-ximum value. This

according to RV (LS). Putting f =


(X[Xl)2, ,=
amounts to choosing, among all possible matrices X(B), the most similar one to Xl,
trf, <1>(8) = X[X2(8)X2(B)TX1.
¢(8) = td)(B), 1/;(8) = tr{X2(8jTX 2 (8))2, since X[X(8)X(8)TXl = f + <1>(8) and
tr(XTX)2 = ,+ 2¢(8) + 1/;(8), leads to

1+ ¢(8)
RV(X1' X) == RV(8)
J,b + 2cf;(D) + 1jJ(D)} ,
{tr(f + <1>(8))1/2)2
LS(Xl , X) == LS(8) =

Several properties of RV(8) and LS(8) can be noted.


Remark 1. Since tr(XT X)2 :::: tr(X[ X 1 )2 > 0, RV(8) is an 00 times differentiable
function of 8 E ]Rq.
Remark 2. A sufficient condition for X[X(8)X(8)TXl to be positive definite is that
X[Xj is positive definite. In this case {X[X(8)X(8)TXlP/2 is an analytic function
of 8, which implies that LS(8) is too. The behaviour of LS(8) in the singular case is
beeing investigated.
Remark 3. RV(8) = LS(8) = 0 <=} X[ X(8) = ( X[ Xl X[ X2(8) ) = 0PI,P' But
557

X[ XI =I 0Pl' thus RV(O) and LS(O) are positive for all 0 E IRq.
Remark 4. RV(O) = LS(O) = 1 iff there exists a column-orthogonal matrix Q(O)
PI x P2 and a scalar b such that X 2 (0) = c5XIQ(O).
Remark 5. If iI>(0) does not depend on 0, LS(()) attains its ma."Ximum value at the
O-vector where tr X2(0)T X 2 (0) is minimized. Then the optimal value for all missing
elements in the j-th incomplete-data variable is the mean of observed data for that
variable. If iI>(0) does not depend on 0 and all the data corresponding to last 71.2 units
and last P2 variables are missing, RV(O) satisfies the same property.
Remark 6. If PI = P2 =q= 1, then

where 811 = varX I , 812(0) = coV(X I ,X 2 ) and 822(0) = varX 2 . It can be shown that
the value 0Ls which maximizes LS(O) satisfies

whereas the value 0lw which maximizes RV(O) satisfies

Therefore, in the optimal two-dimensional configuration including the estimated point


(Xn,l, 0·), O· is the ordinary linear regression prediction, with the slope adjusted by
a positive factor that depends on the strength of the linear relation between XI and
X 2 and, for RV(O), the relative magnitudes of 811 and 822(0·).

2. Optimization of RV and LS
To obtain maximum similarity between XI and X(O), the optimization problems
ma."XOERq RV(O), or maXOERq L5(0), must be solved. The two functions vary in the
(0,1] interval, if X[ XI is positive definite they are continuous and differentiable for
all 0 E Rq, but the search for the extrema might be complicated by the existence of
local extrema. A detailed discussion of the behaviour of RV(O) in the particular case
q = 1 is given in Romanazzi (1995). When P and q are not very high (tentatively,
p:5 10, q:5 4, in the case of RV(O)), the optimizations are conveniently performed
by symbolic computation, using f01' instance 1-Iathematica or MathLab. Otherwise,
ad hoc algorithms based on steepest descent methods must be designed. The gradient
vector of RV(B) is (Romanazzi, 1995):

(1)( _ RV(O) ] (I) [ (I)


RV 0) - 2b + ¢(Onb + 2¢(0) + 1,b(())} {2[¢(O) +lji(O) ¢ (0) - ,+¢;(O)]lji (On·

Here 8vecb.(0)/80 == b.(1) is the q x T/.2P2 matrix of the partial derivatives of the
elements of vec b.( 0) with respect to 01 " .. , ()q and 8{3( D) /80 == {3(I) is the q x P2
matrix of the partial derivatives of the components of {3(0) with respect to 01 , ..• ,Oq.
b. (1) and {3(1) are constant matrices no longer depending on 01 , ••• ,Oq. The expressions
558

of c/P)(O) and '!jJ(1)(0) turn out to be

2{.6. (I) [(I p , 0BBT) vec .6.(0) + nl n2 (I p , 0B) vec a,8(Of + vec BAT C]
n
+n n 2,8(1)[n 1n 2(aT a),8(O)
1 + .6. (Of Ba + cT Aa]},
n n
.J{.6.(l) [(.6. (Of.6. (0) + CTC + nln2,8(0),8(0)T) 0 In,]vec.6.(0)
n
+n l n2 ,8(1) [.6. (0) T.6.(0) + C TC + nl n2 (,8 (Of ,8( 0)) Ip,],8(O)}.
n n

If XI Xl is positive definite, the gradient vector of LS(O) is (Romanazzi, 1995):

LS(O)
tr[r + <I>(0)]!/2 tr[X(O)T X(O)] x
{tr[X(OfX(O)][.6.(l)(<I>~)(O)f + n~n2,8(l)(<I>~I)(0))Tl vec[r + <I>(OW~
-2 tr[r + <Jl(0)j1/2[.6.() vee .6.(0) + n)n2,3(l)r'J(0)]},
n
where
avec <I> (0)
avec .6.(0)
BT .6.(0) 0 BT + (BT 0 BT .6.(0))J{n2,P' + (BT 0 ATC)J{n"p,
+ n)n2 (a,8(Of 0 BT + BT 0 a,3(Of) ,
n
avec <I> (0)
0,8(0)
n n 2(a.8(0)T 0 a + a 0 a.8(Of)
1 + a 0 ATC + ATC 0 a
n
+a 0 BT .6.(0) + BT .6.(0) 0 a.
Here J{n"p, denotes the commutation matrix of the order n2P2, transforming vec .6.(0)
into vec(.6.(Of).
Unfortunately, explicit expressions of 0RV and 0is are not available: even in the
simplest case - RV coefficient, q = 1 - the optimal value of 0 is a root of a fourth-
degree equation. This is an important drawback in comparison with imputation based
on linear regression.
The optimization of RV(O) and LS(O) is computationally feasible and the results are
sensible if the number of missing values is small. In particular, at least one variable
must have complete data. \Vith large matrices scattered with missing values we
suggest the following iterative procedure.
Step O. Substitute a naive initial estimate (e.g., the mean) for each missing value.
Step 1. For k = 1, ... , q, determine the optimal value of Ok, according to RV or LS,
keeping the other O/s, j =f. k, fi.-xed at their previous values.
Step 2. Iterate step 1 until the configuration of points no longer changes, i. e, I i ) - 01
Ok(i-I) I<ck,
- k-
- l , ... ,q.
\Vhile the general method determines the optimal values of 01 , ••• ,Oq simultane-
ously, the iterative procedure tries to locate the optimum O-vector optimizing the
559

RV LS

60 60

Figure 1: Graphs of RV(O) and LS(O).

ek's one at a time. This implies that it will only find a sub-optimal configuration
X(eRV·)(X(e LS )) such that RV(eRlf)(LS(e LS )) ~ RV(eRV)(LS(Ois))'

3. Numerical illustrations
The following examples illustrate the behaviour of matrix correlation imputation in
three situations which are fairly different both in the characteristics of the data and
the number of missing values. A comparison with alternative procedures - imputa-
tion of means, multiple regression - is also given.
Example 1. The data set includes p = 4 measures of school performance of a
sample of n = 85 students of the faculty of sociology at the university of Trento.
The measures are (1) final secondary-school mark, (2) average score in the university
examinations, (3) and (4) mathematics and sociology examination scores, respec-
tively. Variable 1 is measured on a sixty-points scale, with 36 and 60 corresponding
to pass-mark and m1L"{imum mark, respectively. The other variables are measured
on a thirty-points scale, with 18 and 30 corresponding to pass-mark and maximum
mark. In the data set there are q = 2 missing values, since one student did not record
the final secondary-school mark (the other data are X2 = 28, X3 = 21, X4 = 28)
and another one did not record the average examination score (the other data are
Xl = 42, X3 = 25, X4 = 28).

The means of the variables (standard deviations in parentheses) for the n] = 83


complete-data students are Xl ~ 49.6 (6.2), X2 ~ 26.6 (1.49), X3 === 25.1 (3.19), X4 ~
27.7 (2.06). The pair-wise linear correlations between the variables are positive and
low; the minimum correlation is TI J ~ .203, the maximum correlation is T2 4 ~ .558
and the average correlation is .363.' '

Imputation of the means over complete cases gives OAV,1 ~ 49.6, OAV,2 ~ 26.6. The
predictions derived from multiple regression are OUR,1 ~ 51.2 (average examination
560

c3(16%)

-1
2

pc2(2n)

pc1(49~)
-2

Figure 2: Principal components of crimes data for 16 American cities (numbers cor-
respond to cities); bold (italic) numbers are cities with discarded data estimated by
optimization of LS (multiple regression).

score, mathematics and sociology scores as predictors), 0 M R,2 ='= 25.7 (final secondary-
school mark, mathematics and sociology scores as predictors). The optimal O-vectors
according to RV and LS are O~w ='= (45.5,34.5)T, 0is ='= (46.8, 26.0)T. The optimal
values of RV and LS are rather low (RVmax ='= .353, LSmax ='= .346), which suggests
that the optimal four-dimensional configuration X (ORv) (X (Ois)) is not very similar
to the two-dimensional configuration of the complete-data variables Xj. Moreover,
Figure 1 reveals that the surface described by RV(O) in a neighbourhood of 0Rv is
almost flat in the direction of O2 : this means that the value of 0Rv,2 can be altered
without a substantial reduction in the value of RV(O). Predictions derived from mul-
tiple regression should also be interpreted ,with caution, since the squared multiple
correlations are low (.156 in the estimation of OJ, A80 in the estimation of O2 ), It
is clear that the estimates of OJ vary according to the imputation method, whereas
the estimates of O2 are similar fm' all of the methods except optimization of RV.
(However, OR!' 2 ='= 34.5 is not valid, since the maximum score is 30; this surprising
result can be clue to the fact that RV(O) is almost constant in the direction of O2 .)
The discrepancy between multiple regression and matrix correlation estimates of 0 1
follows from the "global" character of the first method and the "local" character of
the second one. i'vlultiple regression uses mainly the positive correlation between de-
pendent and explanatory variables (in particular, average examination score) in the
complete data set, thus producing a value somewhat greater than the mean. On the
contrary, optimization of RV (LS) looks for the value of 0 1 for which the position of
the missing-data student is as near as possible to the positions of the complete-data
students with similar values of the average examination score and mathematics and
sociology scores. Inspection of the data shows that this value is about 46.
Example 2. To explore the validity of iterative optimization of RV(O) and LS(O),
we discarded, at random, q = 4 values from the city-crimes data set described by
561

Discarded Imputation Methods


Values AV MR' RV*' LS'
Xl,3 =. 106 277 293 ~.719! 142 ~23~! 137 ,~139),
X2,6 = 669 1064 952 (.888) 858 (853) 810 (1004)
XlO,4 = 226 211 275 (.745) 351 (199) 212 (232)
X15.4 = 148 211 145 (.745) 256 (188) 147 (143)

Table 1: Estimates of discarded values from city-crimes data (AV: average; !vIR:
multiple regression; RV, LS: optimization of RV or LS; ': squared multiple corre-
lation in brackets; ": results from iterative optimization in brackets).

Everitt (1984). The variables are crime rates for p = 7 different crimes in n = 16
American cities. The minimum, maximum and average (absolute) correlations are
rl,6 == .050, r2,4 == .772 and .403, respectively. The scatter plot of the first three prin-
cipal components of the standardized data is shown in Figure 2. The first component
is interpreted as a size factor, with high scores associated with low levels of crime.
The second is a shape factor, contrasting the first four variables ("violent" crimes) to
the last three ("non-violent" crimes); high values correspond to a prevalence of "non-
violent" crimes. The third component is easily interpreted, being highly correlated
with variable no. 7 (auto-theft).
In Table 1 we record the discarded values and their matrix correlation imputations,
obtained by direct and iterative optimization. Results from imputation of means and
multiple regression are also given. Iterative estimates derived from RV are better
than global estimates in three cases out of four. Iterative estimates derived from LS
are very near to the global ones, except X2,6' Global and local optima are almost
equal (RV(ORV) ='= .921, RV(7J RV ) ='= .920, LS(Ois) ='= .774, LS(7JLS ) ='= .771), this
maybe indicating, as in Example 1, that the surfaces RV(O) and LS(O) are not very
steep in a neighbourhood of 0'. Finally, the results in Table 1 suggest that, in this
data set, matrix correlation imputation (in particular, optimization of LS) might
be superior to multiple regression imputation. This is confirmed by the scatter plot
in Figure 2 where, together with the complete-data units no. 1, 2, 10 and 15, we
represent as supplementary points the same units with discarded values replaced by
imputations derived from optimization of LS and multiple regression. It is clear that
the supplementary points corresponding to LS are closer to the true ones than those
corresponding to multiple regression.
Example 3. We consider the sons data discussed by Seber (1985): the n = 25
units form a compact cluster of points and the 4 variables are positively correlated.
The minimum, maximum and average correlations are r2,3 ='= .693, r3,4 == .839 and
.732, respectively. We also consider an artificial data set with n = 15, P = 4, whose
characteristics mimic the city-crimes data. The minimum, ma.:'Cimum and average
(absolute) correlations are 11"1,21='=1 -.126 !. 11"2,31='=1 -.889 I, and .387, respectively.
To assess the behaviour of imputation methods, all possible pairs of points are dis-
carded in turn from the configuration of X3 and X 4 , and the omitted values are
estimated by averages over complete data, multiple regressions using Xl and X 2 as
predictors, optimizations of RV and LS. As a summary of the results, the "sampling"
distribution of the sum of the squared deviations between true values and estimates
and the percentage of best results, i. e., minimum squared errors, are computed. Ac-
cording to the frequency distributions of squared errors in Table 2, the best results
562

(a) FTequency (b) FTequency


SquaTed Error AV MR RV LS S quar'ed ETTOT AV MR RV LS
Of-- 50 .12 .16 .06 .10 0f-5 .01 .20 .16 .24
50 f- 100 .14 .20 .15 .2.:1 5 f-- 10 .03 .16 .19 .14
100 f- 150 .17 .15 .12 .16 10 f-- 15 .0.:1 .22 .25 .31
150 f- 200 .06 .20 .13 .13 15 f-- 20 .08 .18 .13 .11
200 f- 300 .12 .17 .28 .21 20 f- 30 .2.:1 .14 .10 .05
300 f- 500 .21 .12 .21 .15 30 f- 50 ..:12 .09 .16 .14
~ 500 .18 ~O .05 .01 ~ 50 .18 .01 .01 .01
% Best Results 30 27 10 33 % B est Results 10 19 26 45

Table 2: Distributions of squared errors and percentages of best results (symbols as


in Table 1). (a) Sons data, 300 trials. (b) Artificial data, 105 trials.

are given by multiple regression for the sons data, by the optimization of LS for the
artificial data. In both cases, olJtimization of LS produces the highest percentage of
minimum squared errors.

4. Concluding remar'ks
Imputation of values to non-observed data requires a criterion that is subjectively, if
not arbitrarily, chosen. Our criterion is to fill the gaps in such a way that the partial
complete-data configuration and the total "filled-in" configuration are as similar as
possible according to a measure of matrix correlation. \Ve confined the attention
to coefficients invariant under separate orthogonal transformations of the matching
configurations, but other choices are possible. such as coefficients that are invariant
under linear transformations. The computational effort is not negligible - even in the
simplest cases optimization rOlltines are necessary - but this drawback is bearable if
the aim is a truly mllitivariate solution. using all the available information.
Preliminary results suggest that when data is homogeneous with strong linear rela-
tions between complete-data variables and those with missing values, then imputation
based on linear regression is superior. Conversely. when the units belong to different
groups and there are no important linear relations, the matrix correlation method is
more reliable. In these situations, LS is often better than RV

5. References
Crettaz De Roten F. and Helbling, J.-.\1. (1991): Une estimation de donnees man-
quantes basee sur Ie coefficient RV. ReFILe de Statisliq'ue Appliquee. 39, .:17-57.
Everitt. B. S. (198.:1): An in.lraduclian 10 lalenl1lar"iable models. Chapmann and Hall.
Romanazzi. i\I. (1995): :\Iissing vailles imputation and matrix correlation. Quademi
di Slalii5lica e AI alemlllica Ilpplical a aile S cien:;e Eeo Ilomieo-S aciali, 15, .:1 1-59.
Seber. G. A. F. (198.:1): Mu.lli?lariale ObscT1lalions. Wiley.
Recent Developments in Three-Mode Factor Analysis:
Constrained Three-Mode Factor Analysis and Core Rotations
Henk A.L. Kiers

Department of Psychology
University of Groningen
Grote Kruisstraat 211
9712 TS Groningen
The Netherlands

Summary: A review is presented of some recent developments in three-mode factor analysis,


that are all aimed at reducing the difficulties in interpreting three-mode factor analysis
solutions. First, variants of three-mode factor analysis with zero constraints on the core are
described, and attention is paid to algorithms for fitting these models, as well as to uniqueness
of the representations. Next, various methods for rotation of the core to simple structure are
discussed and related to two-way simple structure rotation techniques. In the concluding
section, new perspectives for simplification of the interpretation of three-mode factor analysis
solutions are discussed.

1. Introduction
1.1 Three-way data and three-way methods

Three-way data are data associated with three entries. Three-way data are collected in
various disciplines. For instance, in the behavioral sciences, three-way data may
consist of scores of a set of individuals on a set of variables at different occasions. As
another example, in spectroscopy, three-way data are obtained when measuring
absorbed energy at various absorption levels on various mixtures of substances that
have been exposed to various sorts of light emission.

Several methods have been proposed for the analysis of three-way data (for overviews
see Law, Snyder, Hattie & McDonald, 1984; Coppi & Bolasco, 1989; Carlier et ai.,
1989; Kiers, 1991). The most popular methods are probably Three-Mode Factor
Analysis (3MFA; Tucker, 1966; Kroonenberg & De Leeuw, 1980) and CANDE-
COMP/PARAFAC (Carroll & Chang, 1970; Harshman, 1970; Harshman & Lundy,
1984). In 3MFA the data Xijk are modelled by
P Q R
Xijk = L L L aipbjqc~pqr
r l q=1 r=1
+eijk'
(1)

i=1, ... ,1, j=1, ... ,J, k=1, ... ,K, where aip , b;q' ckr are elements of the matrices A, B,
and C of orders I by P, J by Q, and K by R, and the additional parameters gpqr denote
the elements of the P by Q by R so called "core array". The matrices A, B, and C
can be considered component matrices for "idealized subjects" (in A), "idealized
variables" (in B), and "idealized occasions" (in C), respectively. The elements of the
core indicate how the components from the different modes interact. The model is

563
564

fitted to the data by minimizing the sum of squared error terms. In this minimization,
the component matrices can be taken columnwise orthonormal without loss of fit,
since orthonormalizations of A, B. and C can be compensated for by transformations
of the core. In fact, any nonsingular transformation of the component matrices can be
compensated by applying the inverse transformation to the core. As a consequence,
the model parameters have a great deal of transformational freedom, comparable to
the rotational indeterminacy in factor analysis.

The CANDECOMP/PARAFAC model is given by


R
X ijk = L airbjrc kr +e ijk ,
(2)
r=l

i = 1, ... ,1, j
= 1, ... ,1, k= 1, ... , K, where air' bIT' c kr again denote elements of matrices
A, B, and C. As has been noted by Carroll and Chang (170, p. 312), the PARAFAC
model can be considered as a version of the 3MFA model where the core is con-
strained to be "superdiagonal" (which implies that gpqr is unconstrained if p=q=r and
gpqr is constrained to 0 otherwise). It follows that, if P=Q=R, the 3M FA fit is
always at least as good as the PARAFAC fit, because the 3MFA model uses not only
the superdiagonal elements of the core, but also the off-superdiagonal elements, which
may considerably enhance the fit. In contrast to the 3MFA model, the parameters in
the CANDECOMP/PARAFAC model are (usually) unique, up to scalar multiplica-
tions and permutations.

To increase insight in the difference between the two models and in the role of the
core in the 3MFA mode, we use a (simplified) tensorial description of the two
models. Considering x as a vectorized version of the modelled three-way array, and e
as a vector with error terms, the 3MFA model can be written as

where ~+bq<fJCr denotes the triple tensor product of column p of A, column q of B,


and column r of C. This tensor product is a vectorized version of the three-way array
computed from all possible products of elements from the three different vectors.
Specifically, it contains the products aipbjqckr for i=l, ... ,I, j=l, ... ,1, k=l, ... ,K,
ordered such that index i runs slowest and index k runs fastest; the vector x is
composed analogously from the elements xijl:' In the same notation, the CANDE-
COMP/PARAFAC model can be written as

(4)

It can now be seen easily that the essential difference between the 3M FA model and
the CANDECOMP/PARAFAC model consists of the fact that the latter contains only
a subset of the triple tensor products of the former. Specifically, the PARAF AC
model contains only the triple products for which p=q=r. Another difference
between the descriptions in (3) and (4) is that in (3) all triple tensor products are
565

weighted by an element of the core matrix, whereas such weights in (4) are missing.
This difference, however, is nonessential, because in (4) these weights can be
understood to be subsumed in the columns of A, B or C.

1.2 Problems in interpreting three-mode factor analysis solutions

Above, we have d~scribed the CANDECOMP/PARAFAC model and the 3MFA


model. The latter method, 3MFA, is one of the earliest generalizations of ordinary
(two-mode) principal components analysis (PCA), but, as we see from (1) and (3),
the generalization is considerably more complicated than PCA itself. This is first of
all because it employs components for three rather than two modes. In addition, it
involves a three-way core array that describes the relations between all factors of all
three modes. The interpretation of the results is a somewhat burdensome enterprise,
especially because of the need to take into account all core elements. It would
therefore be attractive if many core elements could be ignored. Another complicating
aspect is that the solution is rotationally undetermined. This implies that a solution is
equivalent to infinitely many rotated versions of this solution, and part of the process
of interpreting one's results is to choose which rotated solution must be interpreted.
These complications of the 3MFA model have already been recognized by Tucker
(1966), and have since evoked several proposals for handling these problems. In these
proposals, the CANDECOMP/PARAFAC method played an important role. This is
because the CANDECOMP/PARAFAC model can be seen as the extremely simpli-
fied version of the three-mode factor analysis model where the core is superdiagonal.
Because of this simplification, CANDECOMP/PARAFAC solutions are much easier
to interpret. An added benefit is that CANDECOMP/PARAFAC solutions are unique,
hence no choice has to be made between various rotated versions of the solution. The
main disadvantage of CANDECOMP/PARAFAC is that it is much more restrictive
than the 3MFA model, and may therefore more often fail to fit the data well. For this
reason, researchers have not abandoned the 3MFA model, and from the seventies on,
procedures have been proposed for simplifying the core of a 3MFA solution.

1.3 Early proposals for simplification of the core

Kroonenberg (1983, p.58) has observed that if the 3MFA core can be simplified into
a core with diagonal core planes, then the 3MFA model has the same form as the
CANDECOMP/PARAFAC model (which, as we saw above, is considerably less
complicated than the 3MFA model); it should be noted that the CANDECOMP/PA-
RAFAC model thus turns out to be equivalent to two different forms of the 3MFA
model, that is, not only the 3MFA model with a superdiagonal core, but also a
different 3MF A model employing a core with diagonal frontal planes. Kroonenberg
(1983, Chapter 5) has mentioned and demonstrated that procedures for diagonalizing
the frontal planes of a core that have been investigated in the related context of the
IDIOSCAL model (Cohen, 1974, 1975; MacCallum, 1976; De Leeuw & Pruzansky,
1978) apply equally to the 3MFA model. Thus, the first attempts at simplifying the
3MFA core consist of diagonalizing the frontal planes of the core by means of
nonsingular transformations. By applying nonsingular transformations to the core, and
the inverse transformations to A, B, and C, the fit of the model is unaffected. In this
way, the simplicity of the CANDECOMP/PARAFAC model is approximated, without
566

losing the better fitting properties of 3MFA.

In practice, diagonalization of the core planes cannot be expected to work perfectly.


Therefore, the obtained cores have only approximately diagonal core planes, and
cannot be reduced to CANDECOMP/PARAFAC representations. This diagonalization
loses part of its motivation. What one can expect to find is a core with relatively high
elements on the diagonals of the frontal planes, and relatively small elements
elsewhere. In this way possibilities for further simplification may well be overlooked.
Some of the more recent developments have specifically aimed at simplifications of
the core that go beyond simplification to plane diagonal form. On the one hand,
approaches have been proposed that impose a (usually) very simple structure' on the
core, and search among the thus constrained models for models that fit almost as well
as the unconstrained 3MFA model, thus attaining great simplicity at small cost. On
the other hand, procedures have been developed for rotations of the core that lead to
simpler forms than those consisting of approximately diagonal planes, thus enhancing
simplicity at no cost whatsoever. In Section 2, the constrained 3MFA 'approaches will
be discussed; the rotations to simpler cores are the subject of Section 3.

2. Constrained Three-mode Factor Analysis

As has been mentioned in the introduction, the CANDECOMP/PARAFAC model can


be seen as a variant of the 3MFA model in which the core is constrained to be
superdiagonal. It has also been mentioned that the CANDECOMP/PARAFAC model
often fits considerably more poorly than the 3MFA model. This observation has
motivated the study of methods in between the two models: In "Constrained 3MFA"
(Kiers, 1992; Rocci, 1992) the core is constrained to have certain elements equal to
zero, but usually not so many that it yields the CANDECOMP/PARAFAC model. In
this way, a class of new models is generated, that has a greater fitting potential than
CANDECOMP/PARAFAC has, and yet is not as intricate as the full three-mode
factor model.

Obviously, the more elements of the core are constrained to zero, the poorer the fit of
the model, and hence Constrained 3MFA (C3MFA) seems merely to offer a set of
compromise methods, in which the benefit of a simpler solution can only be attained
at the cost of a poorer fit. Fortunately, in practice. the costs of using a considerably
simpler solution are usually low. Experience with C3MFA indicates that a fit almost
as well as that of C3MF A can often be attained by models with only a few more
triple product terms than the CANDECOMP/PARAFAC model. In fact, it can be
proven that the core of a 3MFA model can always be constrained to have a consider-
able number of zero elements without loss of fit. For instance, for P by Q by R cores
with P=QR-I, Murakami, Ten Berge and Kiers (1996) have shown that at least
QR(QR-2)-(R-2) elements can be constrained to zero without loss of fit. For example.
in a 5x3x2 core array as many as 24 of the 30 elements can be constrained to zero.
The possibility of costless constraining core elements in other situations is still under
study, but experience suggests that usually (far) more than half of the elements can be
constrained to zero without affecting the fit.

As has been seen above, an important feature of C3MFA is that it offers models that
567

are considerably more simple than the full 3MFA model and still fit nearly as well.
An added property of some C3MFA models is that they give unique solutions just as
CANDECOMP/PARAFAC. In particular, Kiers, Ten Berge and Rocci (in press) have
shown that C3MFA employing 3x3x3 cores yield unique solutions if the cores are
constrained to have zeros in the positions indicated by 0 in the core, the frontal planes
of which are given below:

Y 0 0 o 0 X o X 0
o 0 X o Y 0 X 0 0 (5)
o X 0 X 0 0 o 0 Y

The elements indicated by X mayor may not be constrained to 0; the elements


indicated by Y must be unconstrained and nonzero. This is only one class of models
for which uniqueness has been proven. Practical experience has suggested that models
employing other sets of constraints on 3x3x3 cores or that employ smaller core sizes
are nonunique (except, of course, the models corresponding to CANDECOMP/PA-
RAFAC). For models employing larger cores, to the author's knowledge no unique-
ness results are available as yet.

3. Rotation of the Core to Simple Structure

An alternative to constraining a three-mode factor model to have many zeros in the


core, is to rotate the core (in all three directions) such that it has many (near) zero
elements, and hence is easy to interpret as well. An advantage of rotating rather than
constraining the core is that rotation never affects the fit of the solution. A disad-
vantage is that it need not yield a solution that is as simple as desired. Two types of
approaches have been proposed recently. On the one hand, approaches for rotation to
a fixed simple form have been considered, thereby varying on the earlier approaches
by Kroonenberg (1983). On the other hand, attempts to rotation to arbitrary simple
structures have been made, following the approach to simple structure rotations
developed for PCA.

3.1 Rotation of the core to a fIXed simple structure

Just as Kroonenberg (1983), Kiers (1992) searched for methods for rotation of the
core such that the core approximates a core associated with the CANDECOMP/PA-
RAFAC model. However, contrary to Kroonenberg's approach, Kiers suggested
rotating the core to approximate superdiagonality (rather than diagonality of the core
planes). The rationale behind this approach is twofold. On the one hand, the approxi-
mation is expected to yield more small sized elements, simply because it explicitly
aims at finding more elements close to zero. On the other hand, a 3M FA model
employing a superdiagonal core is more directly related to the CANDECOMP/PARA-
FAC model than one employing a core with diagonal planes. When using the former.
the component matrices are directly related to those of CANDECOMP/PARAFAC;
when using the latter, one of the matrices can only be obtained after a transformation.
Especially when only orthonormal rotations of the core are considered, the latter
(usually nonorthonormal) transformation makes the relation with the CANDECOMPI
PARAF AC model more indirect and complicated.
568

Kiers (1992) studied procedures for both oblique and orthonormal rotation to
superdiagonality. Two of the procedures for oblique rotation turned out to behave
poorly in that they frequently led to degenerate solutions, in which the transformation
matrices tended to singular matrices, and hence inverse transformations could not be
computed sensibly anymore. A third procedure for oblique rotation, which behaved
better, was deemed uninteresting because it imposed unwanted restrictions on the
transformation matrices. A procedure for oblique rotation that does not share these
problems has been devised only recently as a variant of a procedure for rotation to
arbitrary simple structure (Kiers, 1995. see below).

The procedure for orthonormal rotation proposed by Kiers (1992) behaved well
computationally. In a simulation study with randomly rotated superdiagonal cores of
sizes 2x2x2, 3x3x3 and 4x4x4 (ten of each), the method recovered all 30 superdiago-
nal cores. Further properties of this rotation procedure, like the derivation of
theoretical bounds to the "superdiagonalizability" of core arrays, have been studied by
Henrion (1993). Practical experience with the method, however, often indicated that
superdiagonality was not well approximated, and that the resulting rotated core was
sometimes far from simple. Further research, therefore, mainly aimed at rotation of
cores to an arbitrary simple structure.

3.2 Rotation of the core to an arbitrary simple structure

Rather than rotating the core to an a priori specified form, one may choose to rotate
the core to an unspecified simple structure. One such procedure was employed by
Murakami (1983) who rotated the frontal core planes to simple structure by means of
varimax rotation applied to the transpose of the supermatrix containing all core planes
next to each other. In Murakami's case, the core could be rotated in only one
direction. The general situation where the core can be rotated in all three directions
such that some kind of overall simple structure is attained was first dealt with by
Kruskal (1988). He proposed a method called "tri-quartimax " rotation, which is a
procedure for maximizing a combination of normalized quartimax functions applied to
the supermatrices consisting of the frontal, lateral and horizontal planes, respectively,
of the core. Kruskal proposed to maximize this function over oblique rotations, thus
ignoring the fact that quartimax was originally proposed as a criterion for orthonor-
mal rotation. Therefore, there is little reason to expect that the properties of "ordi-
nary" quartimax carry over to tri-quartimax. Unfortunately, no published information
is available on the performance of the method, nor on its implementation.

Kruskal's (1988) idea has been an important point of departure for Kiers' (in press)
"three-mode orthomax" rotation procedure. "Orthomax" (Jennrich, 1970) is a general
family of simple structure criteria containing the well-known varimax (Kaiser, 1958)
and quartimax (Carroll, 1953; Ferguson, 1954; Neuhaus & Wrigley, 1954; Saunders,
1953) criteria as special cases. Kiers' proposal consists of the application of the
orthomax criterion to the supermatrices consisting of the frontal, lateral and horizon-
tal planes of the core, thereby using a generalization of Kruskal's (1988) criterion. In
contrast to Kruskal's method, three-mode orthomax is restricted to orthonormal
rotations, thus respecting the orthonormality of the 3MFA component matrices.
Because of the latter fact, the method should be used with care: If the rotated core is
569

not as simple as one would like, one should take into account that further simplicity
might be attainable upon relaxing the orthonormality restriction. It seems, however,
that in practice the method often gives quite simple solutions for the core, despite the
restriction of orthonormality on the rotations.

Three-mode orthomax is not just a single method, but a class of methods. First,
different methods arise from different choices of the simplicity criterion (e.g.,
quartimax, varimax, or other orthomax criteria). Second, three-mode orthomax allows
for rotation of the core in all three directions simultaneously, but also, if desired, in
two or only one direction. Similarly, the criterion can measure simplicity of the core
in all three directions or a subset of those. The quartimax criterion always measures
simplicity in all three directions, because it operationalizes amount of simplicity by
the sum of fourth powers of all elements. Clearly, the order in which these fourth
powers are considered is irrelevant. For other criteria, the situation is quite different.
For instance, varimax measures simplicity as a sum of variances of columns of
squared loadings. In three-mode varimax these columns can be constituted in three
different ways, depending on the mode of interest. For instance, varimax in direction
A takes the variances of the columns that each contain the elements related to one of
the entries of mode A (e.g. "idealized individual"). Hence, varimax in direction A
aims at large variation between the elements corresponding to the same idealized
individual. Varimax simplicity in direction A does not imply varimax simplicity in
direction B, because large variances of squared loadings could be found per entry of
mode A, even when elements related to one entry in mode B are all equal. Finally, it
is possible to attach different weights to the criteria used for simplicity in direction A,
B, and C. Thus three-mode orthomax is a very flexible approach. Despite this
flexibility, the procedure for optimizing the three-mode orthomax criterion is
relatively simple: It consists of iteratively updating estimates for the three rotation
matrices by applying an ordinary orthomax procedure to a supermatrix computed
from the original core and the current values for the rotations. The method is
somewhat sensitive to local optima, but because the algorithm converges quickly,
using several restarts is an adequate and feasible way of dealing with this problem.

For practical applications of three-mode orthomax it is useful to have some guidelines


for which method to choose from the class of methods. First one has to decide in
which directions the core will be rotated. Of course, the more rotational freedom is
employed, the simpler the core can get. However, sometimes rotational freedom in
one or two directions may be exploited for rotating the component matrix rather than
the core to a desirable form. Next, one has to choose among the simplicity criteria.
Kiers (in press) reports a simulation study from which one might conclude that effects
of using different simplicity criteria tend to be small. Although in some situations the
use of quartimax should be disadvised for its tendency to yield a general factor (as is
well-known for the two-way case), in general quartimax and varimax behave similar-
ly, and it seems safest to start with varimax, but to keep in mind that, if this does not
yield desirable results, other options should be considered.

As an example of application of three-mode varimax, consider the cores reported in


Table l. The first core is an unrotated 3MFA solution (details of which are to be
found in Kiers, in press). The second core results from three-mode varimax applied
570

to the unrotated core. It can be seen that the varimax rotated core is considerably
more simple than the unrotated core: Most of its elements are extreme; it has far
fewer medium sized elements (in the intervals [-.5,-.2] or [.2,.5] say) than the
unrotated core.

Table 1: Un rotated exemplary core and three-mode varimax rotated core


Frontal planc:s of unrotated core

2.07 0.06 0.21 0.07 0.14 -0.26 0.04 0.34 -0.25


-0.19 -0.60 -0.39 0.99 0.08 -0.60 -0.38 0.62 -0.36
-0.12 0.59 0.29 0.13 -0.05 -0.13 -0.14 0.22 -0.16

Frontal planes of three-mode varimax rotated core

2.11 0.01 -0.03 0.00 0.33 -0.06 0.01 0.01 0.00


0.03 -0.02 0.38 0.14 0.97 -0.06 0.97 0.13 0.59
0.00 -0.14 0.84 0.12 0.06 -0.02 0.17 O.ll 0.09

Three-mode orthomax can indirectly also be used for oblique core rotations, for
instance, by combining them with normalization operations, in analogy to Harris and
Kaiser's (1964) orthoblique approach. Recently, Kiers (1995) proposed a different
procedure for oblique simple structure rotation of the core. He developed a straight-
forward generalization of the SIMPLIMAX procedure for oblique rotation of a
loading matrix to an optimal simple target (Kiers, 1994). Specifically, in three-mode
SIMPLIMAX, oblique rotation matrices for all three modes are found in such a way
that the m (a number to be specified in advance) smallest elements of the rotated core
have a minimal sum of squares (a). The technique thus aims at simple structure in a
very explicit way. For example, when a 3x3x3 core is analyzed by SIMPLIMAX with
m=20, the method finds a core in which 20 elements are optimally close to zero (as
expressed by the fact that the method minimizes its sum of squares), whereas the
other seven elements will be relatively large. How close to zero the smallest elements
are depends on the choice for m. If m is chosen very small, then the method will
often succeed in setting all m elements to exactly 0, but as m increases, the small
values (or at least their sum of squares) will increase as well. Because of this trade-
off relationship one should apply SIMPLIMAX with different values for m, and
search the solution which has sufficiently many small elements that are sufficiently
close to zero. It should be noted that in three-mode SIMPLIMAX it is, just as in
three-mode orthomax, possible to rotate over only one or two modes, rather than all
three.

The method is computationally much less efficient than three-mode orthomax. The
algorithm for three-mode SIMPLIMAX consists of iterative application of the two-
mode SIMPLIMAX procedure, applied to supermatrices of frontal, lateral or horizon-
tal planes of the current rotated core matrix. This procedure turns out to be very
sensitive to local optima, and hence requires many starts from different starting
positions (say 200), which makes it relatively inefficient (e.g., using about one hour
for one full analysis of 200 runs on an 486 66MHz pc). Hence, better starting pro-
cedures, or approaches for avoiding local optima are called for. On the other hand, in
571

practice the cores are usually rather small, and not many different values of m need to
be tested, as in the empirical, and, as far as size is concerned, rather typical example
reported below.

Table 2: Three-way SIMPLIMAX applied to a core reported by Kroonenberg (1994)


Original Core (Frontal planes next twO each other)

24 -26 -5 -2 1 1
18 11 9 -3 -3 0
-2 -12 15 -1 -6 7

Core rotated to 13 zeros (J = 0.000

35.4 0 a 3.1 a a
0 24.0 a a 0 a
a a 18.9 a 0 12.4

Core rotated to 14 zeros (J = 2.958

35.5 0.0 0.0 0.9 0.3 0.0


0.0 23.5 0.0 0.2 -1.4 0.0
0.0 0.0 18.8 0.0 0.0 11.3

Core rotated to 15 zeros (J = 10.967

33.9 0.0 -0.0 1.9 -0.6 0.0


0.0 25.7 -0.1 -0.5 -2.5 0.0
0.0 0.0 0.4 -0.0 0.0 21.6

Core rotated to 16 zeros (J = 276.355

36.6 0.0 -0.7 -0.2 0.7 -1.4


0.0 25.8 1.4 0.7 2.7 -0.4
0.4 0.8 -9.3 2.2 -2.2 12.9

The core array used in the present example was reported by Kroonenberg (1994), and
is based on a three-mode factor analysis on scores of 82 subjects measured on five
variables (pertaining to performance and drunkenness) at eight occasions (at which
different doses of alcohol had been administered to them). The present 3x3x2 core
array has been rotated here by means of three-way SIMPLIMAX. We used the values
m=12, ... ,16. For m=12 and m=13 three-way SIMPLIMAX found a solution in
which the smallest m elements were zero, up to the accuracy implied by the conver-
gence criterion used. In fact, it can be proven that any 3x3x2 cores can be transfor-
med into a core with as many as thirteen exactly zero elements (Ten Berge, 1995),
which explains what we found here. Hence, the only nontrivial applications of
SIMPLIMAX are those with m= 14, m= 15 and m= 16. In each complete SIMPLI-
MAX analysis we used 200 random starts. The rotated cores, as well as the values of
the function (J (the sums of smallest squared core elements) are given in Table 2, with
the (l8-m) highest elements in bold face. It can be seen that the core rotated towards
14 zeros indeed gives only four important core elements, the others being about ten
times as small or smaller. Even in the core rotated to 15 zeros the high values are at
572

least eight times as large as the small values. However, in the core rotated to 16
zeros the smallest elements were no longer negligible compared to the high values. It
can hence be concluded that this 3x3x2 core can be simplified tremendously, and, in
fact, the main relations between components can be described in three or four terms.
Therefore, even though this procedure may destroy some of the simplicity of the
component matrices (Kroonenberg's matrix was obtained after 'simplicity' rotations of
two of the component matrices), the gain in simplicity of the core is worth consider-
ing.

The algorithms used in SIMPLIMAX can also be used for rotations to a target in
which the positions of the small elements are fixed. In this way, the procedure can
also be used for superdiagonalization, and hence a new technique has come available
for superdiagonalization by means of oblique rotation.

4. Discussion
In the present paper, an overview has been given of recent developments concerning
simplification of the core. Simplification of the core has been proposed as an aid for
interpreting a 3MFA solution. The interpretation of a 3MFA solution usually starts by
giving interpretations to the components for the three modes, and next, the relations
between these modes, as reflected in the core, are considered. The developments
discussed here all focused on simplifying the core, none aimed at simplifying the
interpretation of the component matrices. In fact, methods yielding simplified cores
may yield component matrices that are hard to interpret. In other words, from the
former situation where component matrices could be interpreted easily, but the core
made the results rather complex, we are now at the other extreme where the relations
are made simple, but the component matrices may be rather complex. One way to
deal with this problem, as suggested by Kiers (1995, in press) is to consider rotation
of the core in only one or two directions, and not affect those component matrices
that are very important in interpretation, and for which simple interpretations are
available. Of course, such approaches do not always work. On the one hand, it is
possible that for all three component matrices simple interpretations are available,
which one does not wish to disturb; on the other hand, it is possible that using only
one or two rotation directions no longer leads to a simple core. However, even in
those situations it is conceivable that a solution exists in which the core as well as the
component matrices are reasonably simple. A way to find such solutions would be to
define and optimize criteria that combine simplicity of the component matrices and
simplicity of the core. Because of the interdependence of the rotation of the core and
of the component matrices, optimization of such combined simplicity criteria seems
far from trivial.

An alternative position can be taken as well: Methods that simplify the core can be
seen as methods that simplify the structure of the model, just as CANDECOMP/PA-
RAFAC is a simpler model than 3MFA. The fact that the components themselves are
not related in a simple way to the original individuals, variables, occasions, or
whatever, can be deemed less disturbing. For instance, it can be deemed acceptable
that, when a 3x3x3 core is reduced to only 4 nonnegligible elements, the 4 ensuing
tensor product terms are somewhat complicated to interpret. The alternative of trying
573

to grasp up to 27 interactions between (more simple) components does not seem more
attractive. Using a few tensor product terms based on more complex components is a
way of moving most of the interactions into the components, and once these are
conceptualized, the model becomes easy to grasp.

One new development discussed here concerned constrained 3MFA methods.


Probably the most surprising finding was that the core can be constrained to have
very many zero elements at no or very little cost. This has led to the study of
procedures for reducing a core to a maximally parsimonious core without loss of fit.
This study has been and will be aided by the availability of three-way SIMPLIMAX.
By means of three-way SIMPLIMAX one can search how many elements of a
particular core can be set to zero by means of oblique rotations (and hence without
loss of fit). In this way, three-way simplicity rotation can be used to obtain useful
C3MF A models.

References:

Carlier, A., Lavit, Ch., Pages, M., Pernin, M.O. and Turlot, J.C. (1989): Analysis of data tables
indexed by time: a comparative review. In: Multiway data analysis, Coppi, R. and Bolasco, S. (Eds.),
85-101. Amsterdam, Elsevier Science Publishers.
Carroll, J .B. (1953): An analytic solution for approximating simple structure in factor analysis,
Psychometrika, 18, 23-38.
Carroll, J.D. and Chang, J.-J. (1970): Analysis of individual differences in multidimensional scaling
via an n-way generalization of "Eckart-Young" decomposition, Psychometrika, 35, 283-319.
Cohen, H.S. (1974): Three-mode rotation to approximnte INDSCAL structure (TRIAS), Paper presented
at the Psychometric Society Meeting, Palo Alto.
Cohen, H.S. (1975): Further thoughts on three-mode rotation to INDSCAL structure, with jackknifed
confidence regions for points, Paper presented at U.S.-Japan seminar on Theory, Methods and
Applications of Multidimensional Scaling and Related Techniques. La Jolla.
Coppi, R. and Bolasco, S. (Eds.) (1989): Multiway data analy<is, Amsterdam, Elsevier Science
Publishers.
De Leeuw, J. and Pruzans\cy, S. (1978): A new computational method to tit the weighted Euclidean
distance model, Psychometrika, 43, 479-490.
Ferguson, G.A. (1954): The concept of parsimony in factor analysis, Psychometrika, 19, 281-290.
Harris, C.W. and Kaiser, H.F. (1964): Oblique factor analytic solutions by orthogonal transformations,
Psychometrika, 29, 347-362.
Harshman, R.A. (1970): Foundations of the PARAFAC procedure: models and conditions for an
"explanatory" multi-mode factor analysis, UCLA Working Papers in Phonetics, 16, 1-84.
Harshman, R.A. and Lundy, M.E. (1984): The PARAFAC model for three-Way factor analysis and
multidimensional scaling, In: Research methods for multimode data analysis, Law, H.G., Snyder,
C.W., Hattie, I.A. and McDonald, R.P. (Eds.), 122-215, New York, Praeger.
Henrion, R. (1993): Body diagonalization of core matrices in three-way principal components analysis:
Theoretical bounds and simulation, Journal of Chemnmetrics, 7, 477-494.
Jennrich, R.I. (1970): Orthogonal rotation algorithms, Psychometrika, 35, 229-235.
Kaiser, H.F. (1958): The varimax criterion for analytic rotation in factor analysis, Psychometrika, 23,
187-200.
Kiers, H.A.L. (1991): Hierarchical relations among three-way methods, Psychometrika, 56, 449-470.
574

Kiers, H.A.L. (1992): TUCKALS core rotations and constrained TUCKALS modelling, Statistica
Applicata, 4, 659-667.
Kiers, H.A.L. (1994): SIMPLIMAX: Oblique rotation to an optimal target with simple structure,
Psychometrika, 59, 567-579.
Kiers, H.A.L. (1995): Three-way SIMPLlMAXfor oblique rotation of the three-mode factor analysis
core to simple structure, Manuscript submitted for publication.
Kiers, H.A.L. (in press): Three-mode Orthomax rotation, Psychometrika.
Kiers, H.A.L., ten Berge, J.M.F. and Rocci, R. (in press): Uniqueness of three-mode factor models
with sparse cores: The 3)(3)(3 case, Psychometrika.
Kroonenberg, P.M. (1983): Three-mode principal component analysis: Theory and applications,
Leiden, DSWO press.
Kroonenberg, P.M. (1994): The TUCKALS line: A suite of programs for three-way data analysis,
Computational Statistics and Data Analysis, 18, 73-96.
Kroonenberg, P.M. and De Leeuw, J. (1980): Principal component analysis of three-mode data by
means of a1tentating least squares algorithms, Psychometrika, 45, 69-97.
Kruskal, J.B. (1988): Simple structure for three-way data: A new method intermediate between 3-mode
factor analysis and PARAFAC-CANDECOMP, Paper presented at the 53rd Annual Meeting of the
Psychometric Society, Los Angeles, June 27-29.
Law, H.G., Snyder, C.W., Hattie, J.A. and McDonald, R.P. (Eds.)(1984): Research methods for
multimode data analysis, New York, Praeger.
MacCallum, R.C. (1976): Transformations of a three-mode multidimensional scaling solution to
INDSCAL form, Psychometrika, 41, 385-400.
Murakami, T. (1983): Quasi three-mode principal component analysis - A method for assessing factor
change, Behaviormetrika, 14, 27-48.
Murakami, T., Ten Berge, J.M.F. and Kiers, H.A.L. (1996): A class of core matrices in three-mode
principal components analysis which can be transformed to have a mojority of vanishing elements,
Manuscript submitted for pUblication.
Neuhaus, J.O. and Wrigley, C. (1954): The quartimax method: An analytic approach to orthogonal
simple structure, British Journal of Mathematical and Statistical Psychology, 7, 81-91.
Rocci, R. (1992): Three-mode factor analysis with binary core and orthonormality consrraints, Journal
of the Italian Statistical Society, 3, 413-422.
Saunders, D·.R. (1953): An analytic method for rotation to orthogonal simple structure, Research
Bulletin, RB 53-10, Princeton, New Jersey, Educational Testing Service.
Ten Berge, I.M.F. (1995): How sparse can core arrays get: The 3x3x2 case, Unpublished note.
Tucker, L.R. (1966): Some mathematical notes on three-mode factor analysis, Psychometrika, 31,
279-311.
Tucker2 as a Second-order
Principal Component Analysis
Takashi Murakami
School of Education, :'\agoya University
Furo-cho, Chikusa-ku, :'\agoya
464-01, Japan

Summary: Statistical properties of the Tucker2 (TI) model, a simplified version of three-
mode principal component analysis (PCA), are investigated aiming at applications to the
study of factor invariance. The T2 model is derived as a restricted form of second-order
PCA in the situation comparing component loadings and component scores across occa-
sions. Several statistical interpretations of coefficients obtained from the least squares
algorithm of TI are proposed, and several aspects of T2 are shown to be natural extensions
of characteristics of classical PCA. A scale free formulation of T2 and a new derivation of
the algorithm for the large sample case are also shown. The relationship with a generalized
canonical correlation model is suggested.

1. Introduction
Consider a set of data collected by administering an inventory consisting of p items to
n subjects on m occasions with or without changing the conditions. In this situation,
we may be interested in the comparisons between factor loadings of the same items
(variables) and between factor scores of the same subjects on different occasions. This
is the factor invariance problem.
In the present article, we will investigate how to use the Tucker2 (T2) model, a sim-
plified version of three-mode principal component analysis (three-mode PCA; Tucker,
1966; Kroonenberg & De Leeuw, 1980), as a tool for the study of factor invariance.
While T2 is a very general model of multidimensional analysis of three-way data,
we will specify the manner of preprocessing of input data and of transformation of
output coefficients which are appropriate to the restricted purpose. In Section 2, we
will recapitulate several formulations of classical PCA, and distinguish two solutions;
PCA-1 and PCA-2. We will prefer PCA-1 because of several favorable aspects; rota-
tional freedom, statistical convenience in interpretations of coefficients, and the scale
free derivation. In Section 3, we will formulate T2 as a restricted second-order PCA,
and will show that the least squares solution obtained by the TUCKALS2 algorithm
(Kroonenberg & De Leeuw, 1980) has almost all properties of classical PCA, which
",ill be listed in Section 2. We will find two classes of solutions; T2-1 and T2-2,
neither are the same as the standard formulation. We will conclude the T2-1 is more
favorable due to the similar convenient properties to PCA-1 while T2-2 is also useful
for straightforward derivation of the algorithm for large sample data.

2. Classical Principal Component Analysis Revisited


2.1 Two solutions of classical peA
There are many ways to formulate classical PCA (e.g. Ten Berge &: Kiers, 1996). We
will start from the approximation of the matrix of input data by the factor analytical
model, namely, the minimization of the following function,
J(F,A*) = liZ - FA*'112, (1)

575
576

where Z is an n x p matrix of input data, F, an 1! x q matrix of component scores,


and A *, a p x q matrL,{ of component loadings. For the simplicity, we "ill assume that
1! > p, and the rank of Z is equal to p. In addition, we assume that Z is centered
and standardized; Z'l = 0, and diag ZI Z = I (rather than diag ZI Z = nl). Hence
R = Z' Z is the correlation matrix between input variables.
The constraint is imposed on F such that the problem is to minimize (1) subject to
F' F = I. (We will use a superscript asterisk to denote that no constraints will be
imposed on the matrix.) The minimum will be attained by the use of the singular
value decomposition (SVD), Z = PDQ', where P is the n x p column-wise orthonor-
mal matrix, D, the p x p diagonal matrix of singular values arranged in descending
order, and Q, the p x p orthonormal matrix. Then, the solution is given by

(2)
where Pq contains the first q columns of P, Qq contains the first q columns of Q, Dq is
the upper left q x q submatrix of D, T is an q x q arbitrary orthonormal matrix, and
q is the number of components satisfying q < p (cf. Ten Berge, 1993, pp. 35-36). The
formulation can be seen to be a kind of regression problem where columns of F are
predictor variables and elements of A* are regression coefficients to predict columns
of Z. The matrix A* in this sense is sometimes called the pattern matrix.
Because constraints are essentially inactive in this problem (Ten Berge, 1993, p . .15
etc.), we can change the constraints without loss of optimality; if we minimize

f(F*, A) = liZ - F* A' 11 2 , (3)

subject to A' A = I. The solution will be given by

(4)
where U is an arbitrary square orthonormal matrL'{. In this formulation, it is usual
to set U = I otherwise U destroys the orthogonality of columns of F*.
We will call the class of formulations peA leading to (2) PCA-l, and call one prer
clucing (4) PCA-2. Solutions of peA with various criteria and constraints result in
either of them as long as the analysis is based on centered and standardized data
matrix Z. Of course, we can transform a solution to the other through F* = F Dql
and A = A* D;! if T=U=I, and F* A' = F A*' irrespective of T and U.

2.2 Statistical meanings of peA


We will examine the statistical meanings of peA in some detail. First, we will return
to (1) and (3). By the use of regression theory, we obtain

A' = Z'F. (5)


and
F* = ZA, (6)
irrespective of F and A as long as they satisfy the orthonormal constraints. For-
mally, these equations represent a kind of dual relations (Nishisato, 1994, p.102). In
addition, they have some interesting interpretations; (5) defines the structure matrix,
namely, the correlation matrix between input variables and (standardized) component
scores while (6) defines the component scores as linear composites of input variables
by the use of A as the matrix of weights. These interpretations are not only useful
in applications but also give us some important insights on peA.
577

For example. we can see that both F* and F are matrices of linear composites of input
variables, namely, columns of F* and F exist in the column space of Z although the
fact is not explicit in (1) and (3). Hence, we know that the minimum of the function
f (F, A *) in (1) is equal to the minimum of
f(v, A*) = liZ - ZV A*'11 2 , (7)
where V is the p x q matrix of weights constrained as V'RV = J, and given by
V = QrD;lT (Ten Berge & Kiers, 1996). This can be much simplified in PCA-2;
substituting (6) into (3), we have

f(A) = liZ - ZAA'112. (8)


in which the matrix A plays the two roles; the weight matrix and the pattern matrix.
By the use of (8), we can rewrite the minimization problems into the maximization
problems. After some operations using A' A = I, we have

f(A) = trR - trAIZ'ZA = p - trA'RA, (9)


which means that the PCA-2 solution is also obtained by the maximization of trA' RA
under the constraint of A' A = I. This is the most popular formulation of classical
PCA in the literature because it directly leads to the well-known solution of A = K q ,
where Kq is the matrix of eigenvectors associated with q largest eigenvalues of R.
:'\ext, substituting (5) into (1), we obtain

f (F', A *) = tr R - tr F' Z Z' F' = p - tr A *, A* . (10)


This implies that PCA-l can be interpreted as the maximization of sums of squares
of elements of structure matrix, tr A *1 A * •
Ten Berge and Kiers (1996) called the formulation based on the maximization of
trA' RA Hotelling PCA, and the approximation of a data matrix by the factor model
like (1) Pearson PCA. Their classification does not perfectly overlap with our PCA-l
and -2. For example, they classify (3) as Pearson PCA. They prefer Pearson PCA to
Hotelling PCA due to its full rotation freedom including oblique one, elegance of the
idea of least squares predictive efficiency, and convenience of orthonormal component
scores (only if one uses the constraint of PCA-l, and does not perform the oblique
rotation.) We can expect that the loading matrix A * is the pattern matrix and the
structure matrix simultaneously, and the property does not change after orthogonal
rotation because F is orthonormal. A squared element of A * is the size of explained
variance by the component in the variable, and the row sum of them is the squared
multiple correlation for each variable, and so forth. However, the matrix of weights,
V in (7) has slightly complicated relationship with A*; V = A*(A*' A*)-l (d. Ten
Berge, 1986, p. 36).
PCA-2 has other kinds of good properties; A plays roles both of the weight and
pattern matrix shown in (8), and the straightforward derivation of algorithm as the
eigenproblem of R. However, resistance for rotation seems to be a shortcoming of
this solution. Moreover, we will add that the scale-free formulation of PCA will be
formulated on the basis of PCA-l below.
2.3 Scale free formulation of classical peA
We have used the centered and standardized data array Z without any excuse so far.
The choice of scale transformations is a crucial problem because there is no simple
578

formula, for example, to transform the result of SVD of raw score matrix X into that
of Z. Because variables do not share the same measurement unit and origin in many
applications, centering and standardizing are practically useful. We want a scale free
formulation of PCA to justify it.
Mereclith and Millsap (1985) proposed the scale free formulation justifying the stan-
dardization based on the ma.'<imization of sum of squared multiple correlations. We
will extend it to justification of the centering.
Let us define the matrix At as

At = DS12 X' F, (11)

where X is the matrix of raw input data, and Ds is the diagonal matrL,( of vari-
ances of columns of X. We will introduce one more constraint. F'l = 0 as well as
F' P = I. If we define the n x 11 matrix, .} = I - II', then F'l = 0 means F = .} F
(Ten Berge, 1993, p.66), hence X'F = X''}F. Because X''} = (X -Ix')', we obtain
that At = Z'F, where Z' = DS12(X - Ix')'. /\8 a result, the elements of At are
correlation coefficients between variables and components although variables are not
centered in (11), and we know that A t is equal to A' in (5).
:\"ow, we have justified not only the preprocessing of centering and standardization
but also constraints of PCA-I. In other words. we can use classical PCA of standard-
ized variables with orthogonal rotation as the maximization of tr At' A t under the
constraints of F'l = 0 and P' F = I. Our result may not be so impressive because
it looks to depend solely on the word correlation. However, the same rationale will
produce the result which is not necessarily tri"vial on T2 in the next section.
Here, we will list the formulae to obtain final output of PCA-l to compare them with
those for T2: The loading matrix is given by

(12)
where A is the diagonal matrix of q largest eigenvalues of R, K q . the matrix of corre-
sponcling eigenvectors as' before, and T, an arbitrary orthogonal matrix determined,
say, Varimax method. Then, the matrix of component scores are obtained by

(13)

which is equivalent to ZV in (7). These equations are the same as those resulting
from the algorithm of PCA of" factor analysis with unit diagonals" in many standard
statistical packages.

2.4 Asymmetric role of scores and coefficients


Gifi (1990, pp. 49-50) distinguished multivariate analysis (MVA) from multidimen-
sional analysis (:vIDA) based on the asymmetric role of rows (subjects) and columns
(variables) of the n x p matrix of input data. While the latter treats the matrix as an
arbitrary rectangular matrix, the former uses it as p random variables and regards
rows of the matrix as a random sample of size n, even when no probabilistic model
is assumed explicitly. Their definition of :vIVA is the study on "systems of random
variables or random samples from such systems." We accept the standpoint of Gifi,
and treat PCA as MVA in his sense.
The asymmetry of rows and columns of a matrix of data is extended to the asym-
metry of two matrices obtained by PCA, for example, F and A* in (1). While F is
579

the matrix of q derived random variables, A* is the matrix of some coefficients with
the definite statistical meaning rather than values of random variables. As was men-
tioned before, elements of A* have two implications; coefficients of correlation and
coefficients of regression. While coefficients themselves will be objective of interpre-
tations, individual values on variables per se are not usually interpreted, especially in
large sample cases. Properties of components are interpreted through the coefficient
matrix and the relationships with external information.
Our treatment of PCA refiects these asymmetries. For example, we are willing to
rotate the result of PCA-l because the orthogonal rotation keeps the orthonormality
of derived variables.
The reason why we have emphasized the asymmetric treatment is that T2 had been
formulated as a method of MDA rather than MVA in Gifi's sense. In the sequel,
we will not necessarily follow the standard symmetric notations and treatments of
three-mode PCA as in, for example, Kroonenberg (1983) because we consider that
the factor invariance problem commonly involves the asymmetry.

3. T2 as a second-order peA
3.1 Derivation of the T2 model as a second-order peA
Let Zk (k = 1, ... , m) be the n x p matrix of data on k-th occasion, namely, the k-th
frontal plane of a t~ree-mode data array in the terminology of three-mode analysis.
From the standpoint of asymmetric role of rows and columns of the matrix of data
mentioned above, we consider that the three-mode array consists of mp random vari-
ables. We will postpone to specify the manner of centering and standardization of
m
Zk, but we assume that L Z~Zk = m]. We also assume that n > mp and that all
k=l
mp columns of data are linearly independent.
The simplest way of applying classical PCA to these matrices may be the separate
analysis of p variables on each occasion such as
k= 1, ... ,m, (14)
where Fk is an n x q (q < p) matrix of component scores, and Ak is an p x q matrix
of component loadings. (We will adhere to PCA-I.) It looks easy to attain the two
aims mentioned in the introduction, namely, comparisons between factor loadings and
factor scores obtained on different occasions. To do so, one can compare matrices of
loadings on different occasions directly, or compute any of indices of coefficients of
congruence between them (e.g. Ten Berge, 1986), and compute correlation coefficients
between component scores. However, there are several probler.ns in these methods.
First, the comparisons are not so easy when m and q are large. Second, rotation
which can be performed in each condition separately may bring the indeterminacy to
any indices for comparisons, for example, correlation coefficients between component
scores. Third, there may be much redundancy in mq columns of loadings and scores
which makes estimates of coefficients unstable and interpretations of results confusing.
A simple way to partially avoid these difficulties is applying PCA-l to an mn x p
matrix obtained by juxtaposing all frontal planes vertically;
k= 1, ... ,m, (15)
where A* is a p x q common loading matrix, and we assume that p > q. (Here, we re-
gard the data array as a sample of size nm on p variables temporarily.) Although (15)
580

is more parsimonious than (14), some of mq columns of Fk's can remain redundant.
Hence, we will apply PCA-l again to the n x mq matrix obtained by juxtaposing Fk's
horizontally. (We return to a sample of size n);
k = 1, ... ,Tn, (16)
where C k is a q x r matrix of second-order loadings and G is an n x r matrix of
second-order component scores. From the relationship of ranks of matrices, it follows
that r :s; mq, and q :s; mr. By substituting (16) into (15), we can get the equation
having the same form as T2; Zk;::j GCk' A*'. This is a kind of second-order PCA with
the equality restriction on first-order loadings (Bloxom, 1984), but ;::j does not mean
the least squares approximation of the model. We will obtain the three matrices
simultaneously in the least squares sense by minimizing
m
f(G,Ck,A') =L IIZk - GCk'A*'11 2 , (17)
k=!
m
subject to G'G = I, and L CkCk' = mI. We can impose constraints on G and
k=!
Ck because the model of T2 is the product of three matrices. We will call the
minimization of (17) T2-1 problem. The original problem for T2 by Kroonenberg
and De Leeuw (1980) with different constraints from T2-1 will be introduced later.
One can distinguish the change of loadings from the change of scores by checking the
pattern appeared in C k unless the changes are not so drastic ones. Hence T2-1 can be
used a tool for the study of factor changes on the descriptive level. That is illustrated
in Murakami (1983; pp.31-34), and we will not repeat it here. We will only point out
that the simple structure attained by orthogonal rotation in (43) is crucial for such
interpretations.
Next, we redefine the first-order components in (15) as
k= 1, ... ,m, (18)
where G and Ck are given by the minimization of (17) rather than the heuristic solu-
tions in (16). Correspondingly, we will call A* the matrix of the first-order loadings.
Fk defined in (18) has some convenient aspects: First, it is orthonormal in the sense
m
of m-! L F~Fk = I. Second, as will been shown, the formula similar to (5),
k=!
m
A* = m-! 2:: Zk' Fk , (19)
k=!
will give the basis for a scale free formulation of T2. Third, another analogous
equation is derived immediately from (18) by the use of G'G = I;
k = 1, ... ,171, (20)
which means that C k is a kind of structure matrix, whose elements are covariances
between the first-order components and the second-order components.
There may be another approach to the second-order PCA of three-mode data; we can
derive it through the definition of the linear composites. First, we define the matrix
of the first-order composites for each occasion such as

k= 1, ... ,m. (21)


581

The weights held constant across occasions such as used in (21) are sometimes called
the stationary weights (d. Meredith & Tisak, 1982). Next, let us define the matrix
of the second-order composites as
m
G* = LF;CA. (22)
k=!
Then, we can formulate the second-order PCA problem in the similar way to Rotelling
PCA as the maximization of the following function under the constraints of A' A = I,
m
and L Ck'Ck = I;
k=l
m m
g(A, C\, C2 , ••• , Cm) = tr L L Ck' A'RkIAC1, (23)
k=ll=l
where Rkl = Zk'ZI' Ck should be distinguished from Ck because they differ in the
direction of constraints. We will call the maximization of (23) T2-2 problem.
As will be shown later, the relationship between G in (17) and G* in (22) is simple.
However, the relationship between the first-order composites F* and the first-order
component Fk defined in (18) is somewhat complicated and must be defined separately
because the former spans an mq dimensional subspace in mp columns of Zk'S, but
the latter exists only in an r dimensional column space of G. (Note that r :::; mq.)
3.2 Equivalence of two formulations of second-order peA to T2
Kroonenberg and De Leeuw (1980) defined TUCKALS2 as the algorithm minimizing
the following function;
m
f( G, C~, ... ,e;" A) =L IIZk - GCi.' A'11 2 (24)
k=l
where G is the n x r orthonormal matrix, Ck, the q x r frontal plane of a three-mode
core matrix, and A, the p x q orthonormal matrix. This is the original T2 formulation.
Analogous to the case of (1) and (3), we can change the constraints without loss of
optimality, hence we can also consider the minimization problem of (17) and of
m
f(G*,C 1, ••• ,Cm,A) = L IIZk - G*Ck'A'11 2 (25)
k=l
m
subject to LCk'Ck = I, and A'A = I.
k=l
Assuming that we have the solution of (24), define
m m
A= LCkC;', and ~ = LC;'C;. (26)
k=l k=l

Kroonenberg and De Leeuw (1980) showed that both A and ~ are the diagonal
matrices of eigenvalues of positive definite matrices. Then,

and (27)
582

and
and C-k -- C* 1\ -1,2
'kL.:> , (28)
where T and U are arbitrary orthonormal square matrices. We did not introduce
rotational freedom to (28) for the same reason as in the case of PCA-2. It is easy
to verify that matrices defined above produce the same optimum in (IT) and (25) as
that of (24), and Ck and G)'; satisfy their corresponding constraints.
Similar to the case of classical PCA, we can also convert the minimization problems
to the maximization ones. On the one hand, by applying regression theory to (1'1),
L C),;C'CC),;' =
Tn

and noting that mI, we obtain


k=1

m
A * -- m -I ' "
L-, Z k 'CC""''k,,
I (29)
k=1

which is equivalent to (19), and can be regarded as an extension of the counterpart


in classical PCA, (5). Substituting (29) into (17), we have the formula of the same
form as (10);

L
Tn

f(C, CI , C2 ,· .• , Cm, A*) = tr Rkk - mtrA*'A* = mp - mtrA*'A* (30)


k=1

L AIC\'GkA = 1, we obtain
m
On the other hand, using (25), and
k=1
m
G* = LZ)';AG),;. (31)
k=1

which.is equal to (22), and can be regarded as an extension of (6). Substituting this
into (2.5), we also obtain the formula of a natural extension of (8);
m
f(A, G1 , G2, ... ,Gp ) = liZ),; - (L ZlAGl)G~A'112, (32)
l=1

where the matrix AG),; has two roles; as a weight matrix and a pattern matrix. It is
easy to confirm that the minimization of (32) is equivalent to the T2 problem, the
m m
maximization of (23), because (32) can be written as mp- tr L L G~A' RklAGl . We
),;=ll=1
also point out that the relationship between (29) and (31) is analogous to the dual
relationship between (5) and (6).
AI:, was in the case of classical PCA, implications of the two solutions are remarkaoly
different. We prefer T2-1 to T2-2 for the same reason why we prefer PCA-l to PCA-
2; rotational freedom, convenient properties of A', and the (origin- and) scale free
formulation derived below notwithstanding the several attractive properties of T2
such as (32).

3.3 Scale free formulation of T2-1


We will reformulate an origin- and scale free version of the second-order PCA on the
583

basis of T2-1 in the same way in classical PCA. We will start from the redefined
structure matrix of the first-order composites;
m .
At = m- 1DS1/2 2..: X~GC~, (33)
k=1

where X k is the matrix of raw input data on k-th occasion, and Ds is the diagonal
matrix of variances of variables which are centered in such a way as to transform At
into the correlation matrix.
The process is almost the same as in the case of classical PCA in Section 2.3. First,
we will add one constraint that G'l = 0, which means G = JG. Therefore, we have
Xk'G = (Xk - IXk')' G where Xic = n- 1X Ic'l. Hence, we know that Ds must be
m
defined as Ds = m- 1 diag 2..:(XIc - IXk')'(Xk - IXk'),
1c=1

This suggests that a sufficient condition for obtaining G such that At defined in (33)
has the interpretation of the structure matrix is the transformation

(34)
m
which satisfies that Zk'l = 0, and 2..:ZIc'Zk = mI, and we have At = A*, where A*
1c=1
is given in (29).
Although the above discussion may look almost trivial, we should consider that there
are some other possible methods of preprocessing. First, one seemingly plausible
m
transformation Z~ = DS1/2(Xk - Ix')', where x = (mn)-l 2..: XIc'I, and Ds =
k=1
m
diag 2..:(XIc - Ix')'(X", - Ix'), are not a sufficient condition to make At be a struc-
1c=1
ture matrix. Second, another transformation, Z~ = DSk -1/2(XIc - IXk')', where
DSk = diag(XIc - IXk')'(Xk - IXk'), the standardization for each k, is also plausi-
ble. This possibility is not precluded but somewhat spurious notwithstanding Mu-
rakami(1983)'s early recommendation.
We will not assert that the preprocessing (34) is universally valid. Theoretical and
empirical studies to find better methods of preprocessing (e.g. Kroonenberg, 1983)
are meaningful for the vast class of applications. Our conclusion is limited to the
study of factor comparisons which we define in 3.1.
3.4 The algorithm
As the sample size n is usually very large comparing to mp in the study of factor
invariance, the iterative algorithm based on Rlcl is more convenient than that on Zic.
Murakami(1983) derived such an algorithm from TUCKALS2 through the algebraic
manipulations. A very straightforward derivation of an improved version is possible
on the basis of T2-2 criterion in (23).
First, we will assume that A is given. We define the n x mp data matrix Z byar-
ranging frontal planes next to each other as Z = [ Zl Z2 ... Zml, and compute the
mp x mp covariance matrix as R = Z' Z. Next, we define the mq x mq matrix,

H = (I ® A)'R(I ® A), (35)


584

where 0 denotes the Kronecker product, and also define the mp by q matrix, G=
[ G/ (\' ... Gm ' 1'. Then we can rewrite (23) as
(36)
where the constraint is C'C = I. Columns of G "lVill be given as the eigenvectors
associated with the r largest eigenvalues of H. (For the simplicity, we assumed that
all the eigenvalues are distinct.)
:\ext, we assume that G is given. In addition, we also assume that we have a set
initial values of elements of A. (It should be one given in the previous iteration in an
ALS process.) We will rewrite (23) into
Tn m
g(A) = trA':L:L RktAGtG/, (37)
k=11=1
and consider the singular value decomposition
m m
:L:LRktAGtG,,' = PAQ', (38)
k=cltc~1

and let
B= PQ'. (39)
Using Schwartz inequality, Ten Berge (1988) proved that

trG'(l0 B)' R(l0 B)G 2': trG'(l0 A)' R(l0 A)(\ (40)
or g(B) 2': g(A). This means that the process replacing A by B increases the criterion
monotonically. Therefore, we can alternate the eigendecomposition of H and the SVD
in (38) until convergence is attained.
Clearly, when the algorithm converges,
g(A, G) = trL'-. = trA (41)
holds, and we can use this for a criterion of convergence.
As the step for A in the algorithm described here performs the singular value de-
composition of the matrix of p by q, it is much better than that in Murakarni(1983)
which needs the eigendecomposition of the p by p matrix. However, more careful
studies will be necessary to compare the efficiency with that of the new algorithm
using Gram-Schmidt orthogonalization (Kiers, et al., 1992).
Finally, we will list formulae to complete the analysis of T2-1. First, we will obtain
the first-order loading matrix;
(42)
where T is an orthogonal matrix which should be determined to attain the simple
structure of A* . .\"ext, we will have the second-order loading matrix by
Ck = mL2T'A-U2GkL'-.12U, (43)
where U is also an orthogonal matrix determined to reach the simple structure of C.
If necessary, the second-order component scores are obtained by
m m
G =:L ZkA*Ck(:LC;A*'A*Cttl. (44)
k=.1 1=1
585

Eq. (42) and (44) show the apparent similarities to their counterparts for PCA-l,
(12) and (13)
3.5 Use of the first-order composites
In analyzing the three-mode data, one may want to evaluate the correlations of com-
ponents between occasions. For the purpose, the covariances between the first-order
components defined in (18), F~Fz = GkC{, may be useful, and it is easily computed
on only the second-order loading matrix. However, elements in the matrix often
overestimate the correlations because columns of the first-order components exist in
the r dimensional subspace as is mentioned in 3.1. Hence, the first-order composites
defined in (21) may be appropriate for because they are the linear combinations of
input variables, and columns of them spans an mq dimensional space.
Although one can compute the correlation coefficients between the first-order com-
posites directly, we can transform them into variables with unit variances which are
mutually orthogonal in advance. In addition, we can maximize the congruence of
the transformed first-order composites with the first-order components, which is con-
ceptually in the same level as the first-order composites. That is, we will define the
composites,
k= 1, ... ,m, (45)
m m
which satisfy L F~Fk = mI and minimize L IlFk - Fk11 2. Through somewhat com-
k=l k=l
plicated operations using eigenequation of H and the rationale for the orthogonal
Procrustes method (Cliff, 1966), we obtain

(46)

L
Tn

where A= AA -1/2, R= m- l Rkk and T is the orthogonal matrix in (42) attaining


k=l
the simple structure of A*. We can obtain the covariances of first-order composites
between occasions by VI Rkl V.
One may wonder what criteria V optimizes because it does not produce the com-
posites with maximum variances while A in (15) does it. Eq. (36) suggests that
A in (17), on the basis of which V is defined, maximizes the sum of the r largest
eigenvalues of H. As H is the covariance matrix rather than the correlation matrix,
the criterion consists of the maximization of variances of composites and correlations
between composites. The mearling is somewhat ambiguous. ,However, it is interesting
to point out that the criterion is almost equal to SUMCOV, which is a generalized
canonical analysis of more than three sets of variables with the stationary weights
(Meredith & Tisak, 1982) if one uses the constraints AI RA = I instead of AI A = I.
(A slight difference is that Meredith and Tisak employ the successive formulation
instead of our simultaneous one.)
Because of the simple relationship between G and G* in(28), we can write
m

G=LZk VW", (47)


k=l

where Wk = TI(AIRA)12A 12 C\.6.- 12 U. Substituting (47) into (17), we obtain


m m
f(V, WI, W2, ... , Wp , A*, Gl , G2 ,.··, Cp ) = L IIZk - (L ZIVWI)C~A*/112, (48)
k=l l~l
586

which is an extension of (7), and completes our list of the parallel relationships be-
tween classical PCA and T2.

3.6 Concluding remarks


As long as one considers classical PCA to be a rank q approximation of data matrix
via truncated SVD, it is simple enough. However, if one utilizes the indeterminacy of
the model to transform the output matrix of coefficients into the weight, pattern, and
structure matrix '.'lith orthogonal or oblique rotations, it is far from simple. Similarly,
Tucker2 is also simple as a symmetric decomposition of a three-way array. We have
reformulated Tucker2 in multiple ways such that it can be seen as a natural extension
of classical PCA which is surprisingly full of statistical implications. The multiplicity
of transformations of output coefficients and scores from T2 with orthogonal rota-
tion as well as possibilities of several formulations with different criteria is expected
to facilitate interpretations of complex associations between variables in three-mode
array.

Acknowledgment
The author is obliged to anonymous reviewers for their many helpful comments.

References:
Bloxom. B. (1984): Tucker's three-mode factor analysis model. In: Research Methods for
lv!ultimode Data Analysis, Law. H.G. et al (eds.), 104--120, Praeger Publishers, :\"ew York.
Cliff, K. (1966): Orthogonal rotation to congruence, Psychometrika, 31. 33-42.
Gifi. A. (1990): Nonlinear lV[ultil'ariate Analysis. Chichester: Wiley.
Kiers. H.A.L. et al. (1992): An efficient algorithm for TUCKALS3 on data with large
number of observation unit, Psychometrika. 57, 415-422.
Kroonenberg, P ..'.L (1983): Three-mode Principal Component Analysis. DS'vVO Press. Lei-
den.
Kroonenberg. P.:\L and De Leeuw. J. (1980): Principal component analysis of three-mode
data by means of alternating least squares algorithm, Psychometrika, 45. 69-97.
:\Ieredith, W. & :\Iillsap, R.E. (198.')): On component analysis. Psychometrika, 50, 49.')-507.
:\Ieredith, W. and Tisak, J. (1982): Canonical analysis of longitudinal and repeated mea-
sures data with stationary weights. Psychometrika. 47, 47-67.
:\Iurakami, T. (1983): Quasi three-mode principal component analysis: A method for as-
sessing the factor change, Behaviormetrika. 14, 27-48.
:\"ishisato, S. (1994): Elements of Dual Scaling: An Introduction to Practical Data Analysis.
Lawrence Erlbaum, Hilsdale.
Ten Berge, J.M.F. (1986): Some relationship between descriptive comparisons of compo-
nents from different studies. lvlultivariate Behavioral Research. 21, 29-40.
Ten Berge. J.:\LF. (1988): Generalized approach to the :\IAXBET problem and the .'.IAX-
DIFF problem. with applications to canonical correlations. Psychometrika. 53, 487-494.
Ten Berge, J ..'.LF. (1993): Least Squares Optimization in Multivariate Analysis. DSWO
Press. Leiden.
Ten Berge. J.:\LF. & Kiers. H.A.L. (1996): Optimality criteria for principal component
analysis and generalizations, British Journal of Mathematical and Statistical Psychology.
49, 335-345.
Tucker. L.R. (1966): Some mathematical notes on three-mode factor analysis. Psychome-
trika. 31. 279-311.
Parallel Factor Analysis with Constraints on the
Configurations: An overview
Pieter M. Kroonenberg 1 and vVillem J. Heiser 2

Department of Education, Leiden University


1
2Department of Data Theory, Leiden University
Wassenaarseweg 52, 2333 AK Leiden, The Netherlands

Summary: The purpose of the paper is to present an overview of recent developments


with respect to the use of constraints in conjunction with the Parallel Factor Analysis
PARAFAC model (Harshman, 1970). Constraints and the way they can be incorporated
in the estimation process of the model are reviewed. Emphasis is placed on the relatively
new triadic algorithm which provides a large number of new ways to use the PARAFAC model.

1. Introduction
The PARAFAC model is a data-analytic model for three-way data, in which each way
represents a different mode (three-mode data), for example, subjects (mode 1) have
scores on semantic differential scales (mode 2) under several conditions (mode 3), or
in case of a three-way analysis of variance design the mean yield of several varieties
of maize (mode 1) planted in several locations (mode 2) during several years (mode 3).

In most applications to date the full model has been used, sometimes with provisions
for missing data. However, it is possible to include constraints on the configurations
of the components in t.he model. In this paper an overview is given of constraints that
have been proposed in this context and the practical relevance of such constraints is
illustrated. Moreover, attention will be paid to ways of fitting the model to include
constraints. To this end the recently developed alternatives to the basic algorithm
will be reviewed, in particular, the triadic or component-wise estimation.

2. The PARAFAC model


The Parallel factor analysis model (PARAFAC) was formulated by Harshman (1970;
Harshman and Lundy, 1984a,b) and in a different context by Carroll and Chang
(1970). It can be considered as the three- way generalization of both component anal-
ysis and the singular value decomposition.

The standard formulation of the model is (seen as a generalization of component


analysis)
5
LiJk =L aisbjsCb + eijk (1)
s=l

withi = 1, .. ,1, j = 1, .. ,J, and k = I, .. ,K. The a", bjs , and C\;s are the elements of
the components as, b s , and c s , respectively, and the eiJk are the errors of approxima-
tion. Note that each component depends only on one the indices i, j, k and that each
component s is present in all three ways. An alternative formulation of the model
(seen as a generalization of the singular value decomposition)

587
588

5
Xijk = L AsaisbjsCks + eijk (2)
8=1

where the vectors (components) as, bS) and CS) have lengths equal to 1 (or mean
squares equal to 1), and the scale factors As can be considered the three-way ana-
logues of the singular values.

In many applications, especially in the social and behavioral sciences, no a pr··iori


models exist for three-way data, so that three-way models such as PARAFAC have to
be used in an exploratory fashion. The only exception known to us can be found in
the field of event-related potentials and has been formulated by Mocks (1988). In
particular, Mocks derives what he calls the topographic component model based on
biophysical considerations and continues to show that his model is identical to the
PARAFAC model.

In the physical sciences, explicit models for physical processes occur frequently, and
with respect to three-way and higher data several examples can be found in chemistry.
Smilde, Van der Graaf, and Doornbos (1990) discuss a model for the multivariate cali-
bration of reversed phase chromatographic systems which is identical to the PARAFAC
model, and Leurgans and Ross (1992) discuss several three- and multimode models
in spectroscopy. For instance that discuss a model in which the light emission mea-
sured is separately linear (1) in the number of photons absorbed, (2) in the fraction
of photons absorbed that lead to emission at a particular wavelength (:3) for several
concentrations of light emitting entities.

In contrast with several other three-way models, the PARAFAC model is an identi-
fied model, so that after estimation the parameters of the model cannot be changed
without affecting the fit of the model to the data. In particular no transformations
of the components are possible without loss of fit. The identifiability is a great help
in evaluating and interpreting solutions, and it is this feature which makes the model
extremely relevant in those cases where an a priori model for the data is available. A
PARAFAC analysis is in that case not a search for a good fitting model, but a method
to obtain identified estimates for the parameters.

3. Constraints
Substantive reasons, parsimony, or modelling considerations may require constraints
on the parameters. In particular, the following situations may be considered, which
we will discuss in turn. (1) Orthogonality of components, (2) non-negativity of com-
ponents, (:3) linear constraints including design variables for the components, (4) fixed
components, (5) order constraints on the components, (6) mixed measurement levels,
(7) missing data. It should be noted that all constraints lead to a loss of fit, but
the constrained solution may be compared with an unconstrained one to assess the
importance of the constraints.

Constraints can enter the problem of finding a solution for the parameter estimates
in essentially two ways, i.e. as constraints on the parameters, cases (1)-(5), or as con-
straints on possible transformations of the data (6) and (7). An example of the latter
is optimal scaling in which case optimal transformations for the data are sought given
589

the measurement level of the variables, simultaneously with optimal parameter esti-
mates for the model. Another example is the estimation of the model in the presence
of missing data, in which case either the missing data are 'ignored' via a 0-1 weight
matrix for the data (the model is fitted around the missing data), or the missing data
are estimated simultaneously with the model in a form of expectation-maximization
algorithm.

3.1 Orthogonality constraints


This type of constraints requires that the components within a particular mode are
pairwise orthogonal, i.e. they form an orthogonal base in the space of the com-
ponents. In principle, restricting one of the modes to orthogonality is sufficient to
obtain a partitioning of the total variability by components (see e.g. Kettenring,
1983). The orthogonality restriction is not necessarily a logical constraint for the
PARAFAC model. After all, the parallel proportional profile principle which lies at
the heart of the model, refers to a property which is specified for a triplet (a., h31
c s ), and does not refer to a projection of, say the rows, into a lower-dimensional
space. (Similarly, confirmatory factor analysis is not concerned with a projection
into a lower-dimensional space either.) A more extensive discussion of this issue can
be found in Franc (1992, pp. 151-1.55), who discusses the differences between a model
which consists of the sum of rank-one tensors (PARAFAC), and a model which projects
the data matrix from the Cartesian product space nlxJxK into a lower dimensional
product space nPxQxR, where P, Q, and R are the number of components of the
three ways in the three-mode principal component model defined by Tucker (1966).

Harshman and Lundy (1984a) suggested to use orthogonality constraints to avoid


so-called degenerate (non-converging) solutions. A detailed treatment of degeneracy
is beyond the scope of this review paper, and the reader is referred to Harshman and
Lundy (1984b, pp. 271-281), Kruskal, Harshman, and Lundy (1989), and for tests to
detect degeneracy to Krijnen and Kroonenberg (submitted.).

Harshman and Lundy (1984a) also discuss fitting covariance and similarity data by
PARAFAC, and show that this indirect fitting (indirect with respect to fitting the
PARAFAC model to the raw data) implies that one mode (usually that of the subjects
or generally the data generators) is orthogonal.

3.2 Non-negativity constraints


Non-negativity is very natural constraint to impose on components in several cases,
especially when quantities expressed in the components cannot be negative. In spec-
troscopy components cannot have negative elements because negative values for ab-
sorption, emission or concentration are nonsensical. Similarly, in the latent class
model, tackled with three-way methods by Carroll, De Soete, and Kamensky (1992),
the estimated probabilities should be nonnegative. As a final example, it is known
that subtests of intelligence tests generally correlate positively which each other, be-
cause all measure intelligence in some way. Krijnen and Ten Berge (1992; Krijnen,
1993, chap. 4) make a case for imposing nonnegativity constraints on the compo-
nents of PARAFAC when analyzing intelligence tests. This is particularly useful to
avoid so-called contrast components on which, for instance, the numerical tests have
negative coefficients and the verbal tests positive ones.
590

3.3 Linear constraints including design variables


Design present for' the subjects. Suppose that subjects come into specific groups,
and that not their individual values but their group means are interesting. Carroll,
Pruzansky, and Kruskal (1980) showed that such constraints can be handled by first
averaging the original data according to the design, followed by a three-way analy-
sis on the condensed data. They also warned that on interpretational grounds this
does in general not seem a good procedure for the subject mode in the INDSCAL
model. DeSarbo, Carroll, Lehmann, and O'Shaughnessy (1982) employ such linear
constraints in their paper on three-way multivariate conjoint analysis, and also take
up the issue of the appropriateness of restrictions on the subject mode.

Facet or factor"ial designs for variables. Tests and questionnaires are sometimes con-
structed according to a facet or factorial design and as above these variables may be
combined and then subjected to a standard PARAFAC analysis.

A priori cluste7's on the variables. Another type of constraint occurs when variables
belong to certain explicitly defined a priori clusters and this is to be evident via a
simple structure on the components. In the three-way case one might like to fit such
a constraint via the application of the PARAFAC model. The details of such approach
has been worked out by Krijnen (1993, chap. 5).

A poster'iori clusters on variables. Rather than knowing the clusters beforehand, a


PARAFAC solution might be desired in which optimal non-overlapping clusters are
searched at a prespecified number of clusters. In practice, several numbers of clusters
may be tried to find a solution which is optimal in some sense (Krijnen, 1993, chap. 6).

3.4 Fixed components


Constant component(s). Suppose one is uncertain whether the data at hand really
exhibit individual differences in all components. One notices differences in the first
component of the subjects, but the higher ones might not be very large. One could
consider a three-way analysis with the last component fixed at a constant value and
compare the fit of the solutions. In case of a negligible difference one can conclude
that the differences were not worth bothering about. In the extreme case of no indi-
vidual difference whatsoever, one will most likely end up with averaging over subjects.
One example of this might be the study of the question whether semantic differentials
are subject to individual differences.

Inclusion of exter'nal information. Most studies are part of a research tradition, so


that almost always earlier results exist using the same type of data and/or the same
variables. Thus one might have a configuration available for one of the modes, and
one might be interested in the question whether the older information is comparable
to the information contained in the new three-way data set. By fixing the configu-
ration of the mode one can estimate the other parameters in the model within the
context of the already available external information. By comparing the restricted
with the unrestricted model one may assess the differences between the present and
the previous study. For an example using the Tucker3 model (Tucker, 1966; Kroonen-
berg and De Leemv, 1980) see the study by Van der Kloot and Kroonenberg (1985).
591

Another view of the same situation might be that external (two-way) information on
continuous variables is available. For instance, personality information is available
for the subjects, and it is desired to explain the structure of the analysis in terms
of these external variables. Within the analysis-of-variance context such procedures
have sometimes been called factorial regression. A discussion of this approach for
two-way case as well as references to examples can be found in Van Eeuwijk, Denis,
and Kant (1995).

Fixing components for modelling purposes. It is possible to decide for each component
in a PARAFAC model whether one, two of three ways should have constant values for
this component. By doing this, one can, for instance, perform a three-way analysis
of variance with the PARAFAC model. At present, we are working on a viable way of
carrying out three-way analysis of variance with multiplicative terms for the interac-
tions via a PARAFAC model as suggested by De Leeuw in Kroonenberg (198:3, p. 141).

3.5 Order constraints


Order constraints for autocorrelated data. When one of the modes in the three-way
array is a time node or two of the modes contain spatial coordinates, it may be de-
sirable to use this design information during the analysis. One way to do this would
be by requiring that the levels of a component obey certain order restrictions. One
might also try to take this further and require equal intervals, but then one more or
less returns to the previous section.

3.6 Constraints to create discrete models


Discrete and hybrid clustering. Carroll and co-workers have developed procedures to
perform clustering on two-mode three-way similarity data by using constraints of the
components. This approach demands that certain components can only have a lim-
ited number of integer values. Like in the previous section this is an example of using
constraints to fit specific models. Details can be found in Carroll and Chaturvedi
(1995).

3.7 Optimal scaling


As mentioned above, optimal scaling concerns constraints on the data rather than
on the model. Detailed discussions of optimal scaling can be found in Gifi (1990),
and the earlier references mentioned there in. The fundamental idea is that the mea-
surement level of a variable is specified by the transformations which may be applied
to the values of that variable without changing the meaning of the variable and its
interpretation. For example, any monotone transformation of the values of an ordi-
nal variable will maintain the ordering of the values and thus their basic meaning.
Sands and Young (1980) presented an optima.! scaling version of the PARAFAC model
as well as the algorithm to compute its solution, but no further applications of this
approach are known (see, however, some comments by Harshman and Lundy, 1984,
pp. 188-191).

3.8 Constraints for single elements


Missing data. Missing data may lead to constrained estimation problems. They can
be handled within the PARAFAC model either by a kind of EM algorithm estimating
592

the model and the missing data in turn, or by differentially weighting elements of the
three-way data array, for instance, by weighting every valid data point with 1 and
each missing data point by o.

Equality constraints. In certain applications it might be interesting to demand or test


whether certain elements in a component are equal to one another, where it is not
known beforehand what the size of these equal elements should be. Such constraints
clearly operate at the level of the individual parameters in the model, and are thus
model constraints.

4. Algorithms
In this section we will give an overview of several algorithms proposed for the PARAFAC
model and comment on the way constraints can be handled. In particular, we will
concentrate on the standard or so-called CP algorithm and the triadic algorithm.

4.1 Standard (unconstrained) algorithm


The by-now standard algorithm was independently worked out by Harshman (1970)
who called it the PARAFAC algorithm and by Carroll and Chang (1970) who called it
the CANDECOMP algorithm. Following Kiers and Krijnen (1991), we will refer to it as
the CP algorithm. As shown in Ten Berge, Kiers and Krijnen (1993) in the standard
CP algorithm the estimates for the parameter matrices are found row by row (even
though this is not always evident from its formulation), and thus the algorithm may
be called a row-wise algorithm for the PARAFAC model. The discrepancy function
(objective function, loss function) which lies at the basis of most algorithms is

(;3 )

The basic solution proposed by Carroll and Chang (1970) and Harshman and Lundy
(1970) which at a later date has also been independently worked out by several other
authors (e.g. Mocks, 1988), consists of fixing two of the parameters matrices, solving
for the third one via (multivariate) regression, performing this procedure for each
permutation of the component matrices in turn, and repeating the whole procedure
until convergence. Also Hayashi and Hayashi (1982) presented an algorithm which
show similarities to the standard one. Technically such the CP algorithm is straight-
forward to implement, but it turns out that the algorithm does not always converge,
due to the mismatch of data and model. This situation is called degeneracy and de-
tails can be found in Harshman and Lundy (1984b), Kruskal, Harshman, and Lundy
(1989), and Krijnen and Kroonenberg (submitted). As an aside the work of Pham
and Mocks (1992) should be mentioned, as they prove the consisency and asymptotic
normality of the least squares estimators.

Variants. A variant on the basic algorithm was proposed by Kiers and Krijnen (1991).
They showed by reauanging the calculations and by operating on the (multivariable-
multioccasion) cova.riance matrix rather than the raw data, that the computation
time becomes independent of the number of observations, but that at each iteration
step the results are the same as the standard algorithm. Moreover, by operating on
the covaria.nce matrix, the storage space does not increase with the number of obser-
593

vations, and multivariable-multioccasion covariance matrices from the literature can


be directly used. A deficiency of the algorithm is that it cannot handle missing data
other than during the building of the covariance matrix.

Using the assumption of normality for the errors, Mayekawa (1987) developed a max-
imum likelihood procedure for indirect fitting (see also section 3.2), i.e. first the raw
data are converted covariance matrices per occasion (with possibly different samples
per occasion). Note that the covariances are different from the Kiers and Krijnen
(1991) proposal, because they use only a single (multivariable-multioccasion) covari-
ance matrix, and thus necessarily assume repeated measurements.

Weights for error terms. Several authors (Harshman and Lundy, 1984b, pp. 242ff.;
Carroll, De Soete and Kamensky, 1992; Heiser and Kroonenberg, 1994) considered
a generalization of the discrepancy function by including weights for the error terms
eijk

(4)

The addition of weights changes the details of the algorithm in several places but
not its basic character. The seemingly small change has considerable consequences
in the ability of the algorithms to handle certain kinds of constraints. Harshman and
Lundy (1984b) presented the weighted version of the standard CP algorithm in the
context of reweighting or scaling of error terms to achieve equal error variances but
seemed to look only at diagonal weight matrices (i.e. Wi)k = wJ

Orthogonality. Because the basic algorithm operates on rows or whole matrices at


a time, in principle only constraints can be handled which operate on rows or en-
tire component matrices. As the lea.st squares steps for each of the three parameter
matrices are independent, one may use different constraints on different component
matrices as long as at each step the discrepancy function is decreased. Orthogonality
constraints are an example of this but they are generally handled by including the
constraints in the discrepancy function via Lagrange multipliers, and thus changing
fundamentally the characteristics of the discrepancy function.

Weights on components. Several constraints can be incorporated via weight matrices


for the component matrices (rather than the errors) and the estimation can then
proceed via the standard algorithm or with specialized algorithms. This leads to a
very general discrepancy function

[ J f.; (5
IlJ(A,B,C)=t;~EWijk Zijk-~(piSaLS)(qjsb)5)(rkSckS)
)2 , (.5)

where P = (Pis), Q = (qjs), and R = (rks), are, generally known, weight matrices
for the components. Krijnen (199:3, cha.p ..5) uses such, binary, weights as a priori
cluster constraints. Moreover, Krijnen (199:3, chap. 6) also proposed an algorithm
for optima.!ly clustering the coordinates of the variables using explicitly the row-wise
character of the standard algorithm, in which the binary weights had to be deter-
mined. Carroll, Pruzansky, and Kruskal (1980) showed that if a factorial design or
linear constraints are specified on the components there is no need for a special algo-
rithm because one may first reduce the data matrix according to the design and then
594

use the basic algorithm to solve for the parameter estimates (see also Franc (1992,
pp.188ff.). The latter author also discusses the inclusion of different metrics for the
component matrices and shows that this situation can be handled by rewriting the
basic equations and then using the standard algorithm. Such a procedure is analogous
to the use of the ordinary singular value decomposition to solve the singular value
decomposition with weighted metrics in correspondence analysis (see e.g. Greenacre,
198:3, p. 40).

Nonnegativ-ity. Nonnegativity constraints require an adaptation of the basic algo-


rithm to solve the regressions at each step. One may use a non-negative least squares
(NNLS) procedure such as described in Lawson and Hanson (1974, pp. 158-161) and
discussed or referred to in Krijnen and Ten Berge (1992), Durrell, Lee, Ross, and
Gross (1990), Carroll, De Soete, and Kamensky (1992). The basic principle of the
NNLS procedure is to set negative coordinate values to zero during iterations. Paatero
and Tapper (1994) take a different approach and use a penalty function whose value
is dependent on the size of the negative element. Paatero's procedure - Positive
Matrix Factorization - is as of yet only fully described for the two-way case, but a
program with documentation (Paatero, 1996) exists which includes the estimation of
the parameters of the nonnegative PARAFAC model as well.

4.2 Triadic algorithm or component-wise algorithm


Recently, column-wise algor-ithms have been put forward which can handle a larger
number of constraints. A column-wise algorithm for symmetric matrices was first
worked out in unpublished notes by Ten Berge (1986) for the specific purpose to fit
the INDSCAL model, and it was published in Ten Berge, Kiers, and Krijnen (1993).
In another unpublished manuscript Carroll (1987) produced the same or similar al-
gorithm for the same purpose and a full column-wise algorithm alternative for the
CP algorithm was published in Carroll, De Soete and Kamensky (1992). Again in-
dependently in yet another unpublished manuscript Heiser and Kroonenberg (1994)
also proposed the column-wise algorithm or triadic algorithm, as tliey called it.

In the standard algorithm the discrepancy function is solved by optimizing over a


complete parameter matrix of a mode holding the other two fixed. However, it is
possible to rearrange the computations yet again by making the number of compo-
nents the outer loop and computing for each component s first the component as)
then b" followed by c" while holding the other two components fixed. This may be
a.ccomplished by using the following development of the discrepancy function

<l>(a.,b.,c,)

(6)

Note that after the s- th components have been estimated the components s + 1 (or 1)
will be next after the new component s have been incorporated in Zijk.

Variants. The triadic algorithm opens the way to including many more types of con-
straints and it also allows the development of special variants of the PARAFAC model
595

using these constraints. In particular, such features as structured components as in


time series, design matrices for the modes, nonnegativity and equality constraints,
etc. can be easily handled in this context, and different constraints may be imposed
on different components. However, for including orthogonality constraints again a
change of the discrepancy function is necessary. In terms of special models using
constraints, Carroll, De Soete, and Kamensky (1992) used a constrained version of
the columnwise algorithm to estimate Lazarsfeld latent class model, and Carroll and
Chaturvedi (1995) developed both a clustering procedure and a hybrid clustering
procedure for two-mode three-way similarity data. Finally, the estimation of analysis
of variance with multiplicative components for all interactions seems to come within
reach (Heiser and Kroonenberg, in preparation).

Other triadic aIgo1·ithms. Yoshizawa (1988) discussed at a theoretical level an algo-


rithm to solve the PARAFAC model which produces orthogonal triads, and which shows
considerable similarity with an algorithm proposed by Denis and Dhorne (1989) for
the same purpose, the latter, however, are able to completely decompose a three-way
array with their method. No specific details about feasibility or practical relevance
of these algorithms seems to be known, nor have they been used to incorporate con-
straints.

5. Programs
The standard algorithm has been included in (FORTRAN) programs such as CAN-
DECaMP (Carroll and Chang, 1970) and PARAFAC (Harshman and Lundy, 1994), of
which the latter program is probably the most extensive one as it includes many
special features relevant for the analysis of three-mode data.

Most of the authors who have contributed to the further development of PARAFAC
have written their own programs, primarily in Matlab, Splus, or other matrix-based
languages. The standard algorithm for the PARAFAC model, as well as the nonneg-
ativity and orthogonal variants, have also been included in the analysis package for
three-way data 3W. WPACK (Kroonenberg. 1994, 1996), and the triadic or component-
wise version will be in the next version of 3WAYPACK. This package also includes
other three-way models, such as those proposed by Tucker (1966. 1972).

6. References
Carroll, J.D. (1987): New algorithm for symmetric CA'IDECOMP. Unpublished manuscript. AT&T
Bell Laboratories, Murray Hill, NJ.
Carroll. J.D. and Chaturvedi, A. (1995): A general approach t.o clustering and multidimensional
scaling of two-way. t.hree-way, and higher way dat.a. Georpetric representatIOns of perceptual phe-
nomena: Papers in honor of Tarow Indow on h,s 70th birthday, Luce, R.D. et. al. (eds.), Erlbaum.
Mahwah, NJ.
CarrolL J.D. and Chang, J .-J. (1970): Analysis of individual differences in multidimensional scaling
via an N-way generalization of "Eckart-Young" decomposit.ion. Psychometrika, 35. 283-:319.
Carroll, J.D .. De Soete, G., and Kamenski, A.D. (1992): A modified CANDECOMP algorit.hm for
fit.ting the lat.ent class model: Implementat.ion and evaluat.ion. Applied Stochastic Models and Data
Al!al.~SlS, 8, 303-309.
596

Carroll, J.D., Pruzansky, S., and Kruskal, J .B. (1980): CANDELINC: A general approach to multidi-
mensional analysis of many-way arrays with linear constraints on parameters. Psychometrika, 45,
3-24.
Denis, J.B. and Dhorne, T. (1989): Orthogonal tensor decomposition of 3-way tables. In: Multiway
data analysis, Coppi, R. and Bolasco, S. (eds.), 31-38, Elsevier, Amst.erdam.
DeSarbo, W.S., Carroll, J.D., Lehmann, D.R., and O'Shaughnessy, J. (1982). Three-way multivari-
ate conjoint analysis. Marketing Science, 1, 323-350.
Durrell, S.R., Lee, C.-H., Ross, R.T., and Gross, E.L. (1990): Factor analysis of the near-ultraviolet
absorption spectrum of plastocyanin using bilinear, trilinear, and quadrilinear models. Archives of
Biochemistry and Biophysics, 278, 148-160.
Franc, A. (1992): Etude algebrique des multitableaux: Apports de l'algebre tensorielle. Unpublished
PhD thesis, Universite de Montpellier II, France.
Gifi, A. (1990): Nonlinear multIVariate analysis, Wiley, Chicester, UK.
Greenacre, M.J. Theory and applicatIOns of correspondence analysis, Academic Press, London.
Harshman, R.A. (1970): Foundations of the PARAFAC procedure: Models and contributions for an
"explanatory" multi-modal factor analysis. UCLA Working Papers in Phonetics,16, 1-84. [Also
available as University Microfilms, No. 10,0085).
Harshman, R.A. and Lundy, M.E. (1984a): The PARAFAC model for three-way factor analysis and
multidimensional scaling. In: Research methods in multimode data analysis, Law, H.G. et al. (eds.).
122-214, Praeger, New York.
Harshman, R.A. and Lundy, M.E. (1984b): Data preprocessing an the extended PARAFAC model.
In: Research methods in multimode data analysis, Law, H.G. et al. (eds.), 216-284, Praeger, New
York.
Harshman, R.A. and Lundy, M.E. (1994): PARAFAC: Parallel factor analysis. Computational Statis-
tics and Data Analysis, 18,39-72.
Hayashi, C. and Hayashi, F. (1982): A new algorithm t.o solve PARAFAc-model. Behav·iormetrika.
11, 49-60.
Heiser, W.J. and Kroonenberg, P.M. (1994): Dimensionwise fitting in Parafac-Candecomp with
missing data and constrained parameters. Unpublished manuscript, Department of Data Theory,
Leiden University, Leiden.
Kettenring, J .R. (1983): Components of interaction in analysis of variance models with no repli-
cations. In: Contributions to statistics: Essays in honor of Norman L. Johnson, Sen, P.K. (ed.),
North-Holland, Amsterdam. Kiers, H,A.L. and Krijnen, W.P. (1991): An efficient. algorit.hm for
PARAFAC of three-way dat,a with large numbers of observation units. Psychometrika, 56, 147-152.
Krijnen, W.P. (1993): The analYSIS of three-way arrays by constrained PARAFAC methods, DS\VO
Press, Leiden.
Krijnen, W.P. and Kroonenberg, P.M. (submit.t.ed): Det.ecting degeneracy when fitting the PARAFAC
model.
Krijnen, W.P. and Ten Berge, J.M.F. (1992): A const.rained PARAFAC method for positive manifold
data. Applied Psychological Measurement, 16, 295-305.
Kroonenberg, P.M. (1983): Three-mode principal component analysis: Theory and applications,
DSWO Press, Leiden.
Kroonenberg, P.M. (1994): The TUCKALS line: A suite of program for three-way data analysis.
Computational Statistics and Data Analysis, 18, 73-96.

Kroonenberg, P.M. (1996): 3WAYPACK user's manual, Leiden University, Leiden.


597

Kroonenberg, P.M. and De Leeuw, J. (1980): Principal component analysis of three-mode data by
means of alternating least squares algorithms. Psychometrika, 45, 69-97.
Kruskal, J.B., Harshman, R.A., and Lundy, M.E. (1989): How 3-MFA can cause degenerate PARAFAC
solutions, among other relat.ionships. In: Mu[tiway data analysis, Coppi, R. and Bolasco, S. (eds.),
115-122, Elsevier, Amsterdam.
Lawson, C.L. and Hanson, R.J. (1974): Solving least squares problems, Prentice Hall, Englewood
Cliffs, NJ. Leurgans, S.E. and Ross, R.T. (1992): Multilinear models: Application in spectroscopy
(wit.h discussion). Statistical Science, 7, 289-319.
Mayekawa, S.-I. (1987). Maximum likelihood solution to the PARAFAC model. Behaviormetr'ika, 21,
4.5-6:3
Mocks, J. (1988): Decomposing event-related potentials: A new topographic components model.
BIOlogical Psychology. 26, 129-215.

Paatero. P. and Tapper, G. (1994): Positive matrix factorization: A non-negative factor model with
optimal utilization of error estimates of data values. Environmetrics, 5, 111-126.
Paatero. P. (1995): User's guide for posit.ive matrix factorization programs PMF2.EXE and PMF3.EXE,
Depart.ment of Physics, University of Helsinki.
Pham, T.D. and Mocks, J. (1992). Beyond prinicpal component analysis: A trilinear decomposition
model and least squares estimation. Psychometnka, 57, 203-215.
Sands, R. and Young, F.W. (1980): Component medels for three-way data: ALSCOMP;3, an alterna-
tive least squares algorithm with optimal scaling features. Psychometnka, 45, 39-67.
Smilde, A.K., Van der Graaf, P.H., and Doornbos, D.A. (1990): Multivariate calibration ofreversed-
phase chromatographic systems. Some designs based on three-way data analysis. Analytica Chlmica
Acta. 235, 41-5l.

Ten Berge, J.M.F. (1986): Three not.es on three-way analysis. Paper presented at t.he Workshop on
TCCKALS and PARAFAC, Leiden Universit.y, July 2.
Ten Berge, J.M.F., Kiers, H.A.L., and Krijnen, W.P. (1993): Computational solutions for the prob-
lem of negative saliences and nonsymmetry in INDSCAL. Journal of Classification, 10, 115-124.
Tucker, L.R. (1966): Some mathematical notes on three-mode factor analysis. Psychometrika, 37,
279-311.
Tucker, L.R. (19T2): Relations bet.ween multidimensional scaling and three-mode factor analysis.
Psychometrika, 37, 3-27.

Van der Kloot. W.A. and Kroonenberg, P.M. (198.5): External analysis with three-mode principal
component analysis. Psychometrika, .
Van Eeuwijk, F.A., Denis, .I.-B., Kang. M.S. (1995): Incorporating additional information on geno-
types and environment.s in models for two-way genot.ype by environment tables. In: Genotype by
environment wteractlOn: New perspectives, M.S. Kang and H.G. Gaugh Jr. (eds.), eRC-Press, Boca
Raton. USA.
Yoshizawa, T. (1988): Singular value decomposition of multiarray data and its applications. In
Recent developments in clustering and data analysis, C. Hayashi et al. (eds.), Academic Press, New
York.

Acknowledgement
The research of the first author was financially supported by a grant from the Nis-
san Fellowship Programme of the Netherlands Organization for Scientific Research
(NWO).
Regression Splines for Multivariate Additive
Modeling
.Jean-Fran~·ois Durand
Probabilites et Statistiq ue li nite de Biometrie
Universite i\lontpellier II Ei'iSAM-INRA-UM II
Place Eugene Bataillon 9 Place Pierre Viala
3409·5 i\Iontpellier. Frallce :34060 I\Iontpellier, France

Summary: Four additive spline extensions of some linear multiresponse regression meth-
ods are presented. Two of them are defined in this paper and their properties are compared
with those of two other recently devised methods. Dimension reduction aspects and quality
of the regression are discussed and illustrated on examples.

1. Introduction
Let (Xl, ... , xp) be a set of predictors related to a set of responses (Yl, ... , Yq), all measured
on the same n individuals with sample data matrices X (n x p) and Y (n x q). The goal
of this paper is to present methods for multivariate additive modeling and data reduction
which integrate regression splines in tl'0ir settings. Piecewise polynomials or splines are ex-
tensively used in statistics and data analysis, see for example (Ramsay 1988), (Gifi 1990),
(Hastie and Tibshirani 1990) and (Durand 199:3). Spline transformations of the explanatory
variables presented in Section 2, non necessarily monotonic in contrast to (Ramsay 1988),
are conjointly used with orthogonal projections on either such smoothed predictors or "syn-
thetic" explanatory variables called additive components. Spline functions are attractive
due to their appealing local sensitivity to data and orthogonal projectors preserve some lin-
ear properties in the considered nonlinear methods. Section 3 outlines that problems arise
with least-squares splines (Eubank 1988) when the number of predictors is large. Scarcity of
data in multivariate setting may cause harm in additive modeling by least-squares splines,
thus providing a point in favour of dimension reduction.
Four multiresponse additive regression models are considered, all presenting dimensi'on re-
ducing aspects. Because of their similar scope of applicability, the three first methods are
compared on a "running example". In Section 4, the regression on Additive Spline Principal
Components (ASPCs henceforth) is based on a new definition of ASPCs. Additive principal
components have been recently defined by Donnell et al. (1994) who explore the low end
of the component spectrum for detecting concurvities in additive models. Here, large un-
correlated ASPCs are used not for condensing in an additive fashion the X sample matrix
since the predictors are transformed, but rather for predicting linearly the response data
set. The second met hod, Partial Least Sq uares regression via additi ve splines referred to as
ASPLS (Durand and Sabatier 1991), is summarized in Section ,5. This method differs from
the preceding ill that dimension reduction is processed at the same time as the regression
is computed. \\'hen predictor and response data sets are identical, ASPLS components are
called self-ASPLS components. In Section 6, such self-ASPLS components are interpreted
as a.n additive summary of the original predictor matrix and used for regression purpose.
Filially. Principal Component .\nalysis with Instrumental Variables (Durand 1993) whose
scope of applicability ditTers from that of the preceding methods, is presented in Section 7
with applications to simple regression and to nonlinear Discriminant Analysis.

2. Spline transformations of the predictors


For simplicity. let us choose the same kind of spline functions for transforming the predic-

598
599

tors: we take 1{ interior knots in which piecewise polynomials of order m are required to
join end to end so that r = m + /\.is the dimension of the spline space. A spline function
5' (.) used for transforming the predictor :t i is a linear combination of normalized B-spline
basis functions {B;r.)}l=l, .,r

(1 )

see (De Boor 1978) for computational and mathematical properties of B-splines. The
ith column of X, denoted X', is replaced by Xi (a') linearly depending on a i through
Xi(a i) = Bia i , where Bi is the n x r coding matrix of Xi, and a i is the vector of the r
spline coefficients. The matrix X is now being transformed in a n x p matrix X(a) which
is a function of the spline vectors. This matrix is columnwise denoted

(2)

In order to make Xi(a') centered independently of the v,dues of the spline coefficients, B' is
column centered with respect to D, a n x n diagonal matrix of weights for the observations.
The knot sequence {El' ... , E~m+K} llsed for transforming the ith predictor is written as

When In 2': 2, particular spline coefficients called nodal coefficients (Durand 1993) and given
by at = mean(E!+l'" .,El+ m- 1 ), keep the ith predictor invariant which gives X'(a'J = X'.
The existence of such spline coefficients implies firstly that the additive spline model as
defined in section 3 can effectively take account of possibly linear relationships and secondly,
that the different iterative algorithms can reasonably be initialized.

3. Additive models and least-squares splines


In multi response regression, an additive spline model is a fit of the form

j = 1, ... , q, (:3)

where the coordinate function Ii (xd is a spline function as defined in (1) with spline
coefficients stored in a;' In matrix notation, the jth column of the nx; q model matrix Y
is the sl1m of the smoothed sample predictors

(~)

\Ve will note whether or not an additive method is associated with a multivariate linear
smoother (Hastie and Tibshirani 1990) defined by Y = SY, where the smoother matrix S
does not depend on Y.
Spline coefficients can be chosen to minimize the mean squared error IIY - BAllb, where
IIXII'b = trace(X'DX), B = [Bil ... IEP] and A is the pr,~ q matrix of spline coefficients.
Linear multiple regression on the n X pr design matrix B provides the so called least-sqlwres
spline model (Eubank 1988) associated with the smoother matrix S = PB, where PB is the
D-orthogonal projector on span(B). However, usirrg least squares splines may cause prob-
lem when no sufficient points are available for fitting surfaces in high dimensional spaces.
Our O.E.C.D. "running" data set is typical of such u[llucky CO[ltext since only eighteen
points are needed for fitting an additive surface in a space of fourteen dimensions (n = 18.
P = 1:3). The scope of applicability of this method concerns data sets with large samples
of few predictors since scarcity of data is increased because the number of independent
variables (the columns of the design matrix B) is getting larger.
600

The aim of the paper is to present four methods for additive modeling that include dimen-
sion reducing aspects in their settings. All are based on the column centered design matrix
X(a) defined by (2). In the different models, a; is expressed as a linear combination of
M optimal spline vectors, all associated with the same ith predictor transformed at the
different stages of the methods (M is the dimension of the model, that is, the number
of additive components or latent variables). Model dimension AI may be considered a
tuning parameter and can be checked out in the same way as in linear methods by using
cross-validation. When it is computationally feasible, cross-validation is more important in
nonlinear modeling because the risk of overfitting is increased due to the greater flexibility
of splines. Other tuning parameters allow to choose what type of splines is to be used: the
order of the polynomials, the number and position of the knots. Choosing few well located
knots generally suffices in multiresponse regression but finding their optimal number and
location is a difficult problem so that giving the optimal answer is beyond the scope of this
paper. To summarize, the model defined by (3) and (4), belongs to the family of nonpara-
metric additive models depending on the aforesaid tuning parameters.

4. Regression on additive spline principal components


Additive principal components have been recently defined by Donnell et al. (1994) to detect
instability in additive regression models due to concurvity between the predictors. They
explore the low end of the principal components spectrum, neglecting the use of the largest
components for dimension reduction because plots of such components cannot be inter-
preted as a projection of the "same" data set. This insightful remark sets us the problem
of what is really going on when using the largest components in dimension reduction not
for purpose of exploratory data analysis but rather for multivariate additive modeling.
The definition of additive spline principal components (ASPCs) presented in the next sec-
tion differs from that of Donnell et al. in these points: first, only finite-dimensional function
spaces are used here, namely, piecewise polynomials of space dimension r = m+K. Second,
uncorrelation of ASPCs is not obtained through orthogonal sets offunctions but constructed
step by step through an orthogonalizing procedure. Finally, ASPCs are not deduced from
the eigenanalysis of some variance matrix, but instead, are considered as mutually uncor-
related solutions of a maximum variance optimization problem.

4.1 Additive spline principal components


Some finite-dimensional nonlinear principal component extensions (Ramsay 1988, Gifi
1990), are based on the Eckart-Young theorem that constructs low rank approximations of
a matrix. Others are directly ded uced from the eigenanalysis of some covariance matrix
(Donnell et al. 1994). Here, we follow a nested approach motivated by the fact that taking
spline coefficients equal to nodal coefficients leads to common principal components.
DEFINITION. The kth ASPC is the linear composite of X(a), uncorrelated with the pre-
vious components, which has the largest variance. Denoting c = X(a)u, the objective
function f(a l , ••• , a P , u) = var(c) is maximized subject to

IluW 1,
var(Xi(a i )) var(xt i = 1, .. . ,p,
cov(c, ell 0, j = 1, ... , k - 1.
The last constraint is omitted when k = 1. Writing (a l,k, ... , aP,k, uk) as an optimal argu-
ment of f and X(a(k 1) = [Blal,kl .. . IBPaP,k] as an optimal matrix, the kth ASPC becomes
c k = X(a(kl)u k. Note that X as well as all coding matrices Bi are column centered with
respect to D in order to make c k centered for arbitrary spline coefficients:
The kth additive principal function may be defined as Ck(Xl,"" xp) = 2:;=1 ¢i,k(x,) with
¢i,k(x;J = ufsi,k(xil, where si,k is the optimal spline function used for transforming the ith
601

predictor. A crucial problem is the choice of Nf, the number of components. Here we have
no total variance decomposition theorem because data sets are changing when successive
ASPCs are computed. However the sequence var(c l ), ... , var(c M ), cannot increase because
nested optimizations are considered (an exception to this rule may nevertheless occur when
local optima are reached by the algorithm). The only question is then of setting a stopping
rule, that is, on estimating whether var( c k ) is "small". Since uncorrelated ASPCs are con-
structed for regression purpose only, one can pragmatically check out the goodness-of-fit
for different model dimensions, see Section 4.3.

4.2 ASPCs from the O.E.C.D. predictors


Our" funning example" is a data set from econometrics, that consists of eighteen observa-
tions (countries) and eighteen variables analyzed by Bertier and Bouroche (1975). More
precisely, thirteen predictors characterizing the eighteen countries have been measured to
explain five consumer responses, see Table 1.

Table 1. D.E.C.D. variables

Predictors Responses
POP population CAL calories per capita and per day
DENS density per km 2 LODG number of lodgings per 1000 cap.
POPG population growth ELEC electricity consumption
AGRF % of farming & fishing population EDUC public expenditure for education
INDU % of industrial population TV number of TV sets per 1000 cap.
GNP gross national product per capita
GDPA % of GNP for agriculture
FCF fix capital formation
RR running receipts
OFR official reserves (million $)
DR discount rate
IMP importations (million $)
EXP exportations (million $)

Variables are centered and standardized with equally distributed weights (D = 18- 1 liS)
and for all the competing regression methods we will compare, the 13 predictors are trans-
formed by B-spine functions of degree 1 (order 2) with 2 equally spaced interior knots. In
this section, twelve ASPCs have been computed whose variances are in Table 2.

Table 2. ASPC's variance for the O.E. C.D. data set

ASPC 2 3 4 5 6 7 8 9 10 11 12
var 7.15 6.62 4.76 3.52 3.02 2.48 1.99 1.82 1.73 1.59 1.09 0.98

In contrast to common principal components, we observe the fact, shared by all the ex-
amples we have studied, that the sequence of variances is mildly decreasing. We are now
ready to address the problem of what to do with the ASPC's D-orthogonal basis: the next
section presents a simple way of selecting the components that best explain the responses.

4.3 Regression on additive spline principal components


Denoting by C the n x M matrix of M selected ASPCs, the regression on additive spline
principal components, in short ASPCR, is geometrically defined as the orthogonal projec-
tion of Yon span(C), the linear space spanned by the columns of C. The multivariate
additive model is given by }( = PeY where Pe is the projection matrix on span(C).
Obviously ASPCR is a multivariate linear smoothing procedure that enters the additive
framework of Section 3 and any modeled response may be expressed as the sum of p coordi-
nate functions of the predictors separately, each function being a particular spline function.
Choosing Nf ASPCs can be checked out by examining the fit of nested models. Table 3
602

shows the R2 values of the O.E.C.D. responses according to different model dimensions. It
is clear that some ASPCs may be deleted, for instance. the second and the three last.

Table J. R"' of the responses for nested ASPCR models

~od
1m. ASPC CAL LODG ELEC EDGC TV f -va·o~total
lance
0.056 0.00.5 0.084 0.034 0608 1.5.74
2 0.105 0.062 0.091 0.080 0.608 18.92
:3 0.118 0.101 0.188 0.482 0700 31.78
4 0.:3/8 0.10:3 0358 0./27 0.821 4774
5 0.378 0.373 0359 0.744 0823 5354
6 0.443 0.46:3 0.360 0.746 0828 56.80
7 0.615 0.718 0.671 0.781 0.904 73.78
« 0.629 0732 0.693 0.814 0.90.5 75.46
9 0.769 0.8:38 0.71-1 0.8:32 090.5 81 16
10 0.865 0.8:38 0.737 0.837 0.929 84.12
11 0.880 0.8-14 0.785 0.853 0.952 8628
12 0.923 0.871 0.827 0.878 0.966 89.:30

Finally one can reduce the model dimension JI by choosing ASPCs that best explain the
responses. Here ASPCs 1, 3, 4, .), 7 and 9 seem to provide a good trade-off between goodness
of fit and dimension reduction. Table 4 presents the results for the nested corresponding
models and the goodness-of-fit may be compared with that of the model dimension 6 in
Table 3 (73.04% against 56.80%).

Table 4- R2 of the responses for nested Selected ASPCR models

~od
1m. ASPC CAL LODG ELEC EDUC TV fo-vanance
of total
1 0.0.56 0005 0084 0.034 0.608 15.74
2 3 0.068 0.044 0.181 0.436 0.700 28.58
3 4 0.329 0.0-17 0.35~ 0.681 0.820 -14.58
4 .5 0.329 0.317 0.352 0698 0822 5036
5 7 0.501 0 ..571 0.664 0.732 0.898 6732
6 9 0641 0.678 0.68.5 0.750 0.898 73.04

Responses TV and EDUC are well reconstituted by the six dimensions model whereas CAL,
LODG and ELEC are rather badly approximated. Let us now pay more attention on se-
lecting the predictors of main influence for the variable EDUC which will be our "running
response" for comparing different models. Figure 1 shows coordinate function plots for the
six main variables of the ASpeR model. The influence of a predictor on a response is here
measured by the range of the transformed data marked by their corresponding numbers:
Germany (1 D), Austria (2 A), Belgium (S B), Canada (4 CND) , Denmark (5 DK), Spain (6 E),
USA (7 USA), Finland (8 FI), FranCE (9 F), Greece (10 G), Ireland (11 IRL), Italw (12 I), Japan
(13 J), iVorway (14 N), the Netherland,; (15 !VL), Portugal (16 P), England (17 UK), Sweden (18
S). As a confirmatory point, note that EDl'C is modeled in the same fashion when all
ASPCs are used (same influential predictors with similar function shapes).

5. Additive splines for Partial Least Squares regression


This method, whose multivariate linear version is due to Wold et al. (1983), has been
extended to additive spline multiresponse regression by Durand and Sabatier (1994). It
mainly differs from ASPCR in that components or latent variables are constructed at the
same time as the regression is computed, and not before, thus hopefully providing com-
ponents with better explanatory potential. The example of O.E.C.D. data will actually
603

...c5 ...c5 ...c5 ...c5

tv
N N N N
c5 c5 c5 c5
0 0 0 0
c5 c5 c5 c5
N
9 '"9 N
9
N
9
... ... ... ...
9 9 9 9
-2
df1 1 2 -2
~OPG 2

...c5 ...c5 ...c5 ...c5


'"c5 '"c5 '"c5 N
c5
0 0 0 0
c5 c5 c5 c5
N
9 '"9 '"9 N
9
... ... ... ...
9 9 9 9
-1.0
Dg~S 1.5 -1 a E~P 2

Figure 1: Coordinate function plots of the main predictors (in decreasing order from left to right)
for EDUC modeled by ASPCR. The dotted vertical lines indicate the position of the knots.

illustrate that a smaller number of components are needed in ASPLS (Additive Spline Par-
tial Least Squares) than in ASPCR for explaining the same amount of Y-variance.

5.1 Mathematical background


DEFINITION. Using the standard notation of linear PLS, see for example (Frank and
Friedman 1993), the kth ASPLS components are the linear composites t = X(a)w and
u = Fk_1C, with Fo = Y, which maximize the objective function f(al, ... ,aP,w,c)
w'X(a)'DFk_1C = cov(t,u) subject to

IIwl12 = IIcl1 2 1,
var(X'(a i )) var(Xi), i=l, ... ,p,
court, t j ) 0, j = 1, ... , k - 1.

In the same way as ASPCs are defined, the last constraint is omitted when k = 1. Writing
(a 1.k, ... , aP,k, wk, c k ) as an optimal argument of f and X(a(k 1) = [B 1a 1,kl .. • IBPaP,k] as
an optimal matrix, the kth ASPLS components become t k = X(a(kl)w k and uk = Fk_1Ck.
The final part of step k of ASPLS consists of updating Fk as the residual regression of
Fk-1 onto tk,
(5)
It must be noted that fixing spline coefficients equal to nodal coefficients in the optimization
problem above, does not lead to linear PLS components except for t 1 and u 1. Moreover,
the linear PLS property of reconstructing the predictor matrix is obviously not preserved in
ASPLS whose aim is only to provide an additive approximation to the response variables.
As a consequence of (.5), the ASPLS model is given by
M
+ FA + FA,
~ ~

Y = Fo = 2::k=l Y k = Y (6)

where 17k = PtkFk-1 is the kth partial model matrix of rank 1. Model dimension M can
be determined by cross-validation. !\Iore pragmatically, the fact that the tks are mutually
uncorrelated implies the additive decomposition of the total Y-variance

(7)
604

Table 5. R2 of the responses for nested Ac','PLS models

~od
1m. CAL LODG ELEC EDlT TV -f-vanance
of total
O.O.5·l 0.10 1 ();l70 0.463 0.874 37.26
2 0.534 0 ..547 (UDS 0.),9 0.881 58.82
:3 0.722 0 ..576 0.423 U.820 0.889 68.64 '
-± 0.814 0.6'lS U.692 0.820 0.902 77.;i7

To determine the ASPLS model dimension. one can easily measure the part of each com-
ponent in the reconstruction of the total response variance. It can be shown (Durand and
Sabatier 1994) that (6) enters the additive framework (4). :\0 linear smoother matrix can

... ... ... j ...

;V \,,/
d 0; d 0;

N N
I N N
d d d d
0 0
'
•. iI 0 0
0; d d d
N N N N
0;; 0;;
• • I 0;; 0;;
...0;; ...0;; . . I
...0;; ...
0;;
·2 ·1 ·2 1 2 ·1.0 15
~R D~ Dg~S

.,.
d
...
d
...d ...d
N N N N
d d d d

~
0 0 0 0
d d d d
N N N N
0;; 0;; 0;; 0;;
...0;; ...0;; ...0;; .,.
0;;
·1 0 1
FCF

Figure 2: Infiuentwl coordinate functwn plols (ill decreasing order from left to right) for EDUC
modeled by .4SPLS. The dotted ,uticallines md,cate the postlion of the kllots.

be associated with model (6) since Y does not enter linearly in the expression of Y. One
can also consult the latter paper for computational aspects of ASPLS which are based on
normal equations just like the ASPC's algorithm of Section ·1.2.

5.2 ASPLS applied on O.E.C.D. data


Going back to the example, Table .5 shows n" response values for up to 4 model dimensions.
Four components provide a better fit than the six selected ASpeR components. Moreover,

CDN

·4 -2 0.0 0.2 0.4 05 ·3 ·2

" " 1

Figure 3: The (tl' t 2) d,splay IS mndar to a first prinCipal component scatterplot used to explain
the responses summari:ed by (Cl. C2) and (Ul' U2) plots.

response EDUC'. whose eight coordinate functions of main influence are presented in Figure
2, is similarly modeled by ASPLS than by A.SPCR: predictors RR. DR, DENS, G)lP, EXP
605

and AGRF mainly participate in the additive fit. Because responses are projected on com-
ponents, a particular aspect of ASPLS reducing properties is the possibility of " naming"
predictor components by their capability in explaining the responses. In Figure 3, Axis
1 summarizes the opposition between strong and weak consumer countries, while axis 2
contrasts countries according to responses LODG and CAL.

6. Self-ASPLS components for exploratory analysis and addi-


tive modeling
An appealing idea is that: of using ASPLS components to approximate the X data matrix
itself. When predictors are taken for responses, X computed by (6) can be seen as an
additive spline summary of rank M for X. This approximation cannot be better than that
provided by the optimal Eckart-Young theorem, here the objective function to be maxi-
mized is cov(t, u). Components t 1 , ... , tA, referred to as the self-ASPLS components, may
be used as an exploratory tool. In fact, self-ASPLS components are close to linear princi-
pal components and Figure -! illustrates the comparison between self-ASPLS and common
principal components both computed from the predictors of the O.E.C.D. data.
We observe great similarities in component plots. Note however that six self-ASPLS com-
(a) (b)
us
11'- '" ~
...0
"''"<Xi '"
G
P
AP
"''"
E ~ 0
.... eND
.... 0
'"
N
IRL I
'" ...
0
F
u N

'"Ea. ';" AN
NL '"'iI.,'" 9
0
DK UK
u C)' B
S '"9
·4 -2 o 2 4 6 -0.5 0.0 0.5 1.0

camp 1 5.87 (45.12 %) axis 1 5.87 (45.12 %)

(e) (d)
US

'"0
0
G \l:\L 0
'"u
E
o FI '"9
A'l
I"'r
...
OKS IBIl
UK 0 9
·4 -2 o 2 4 ·0.2 0.0 0.2 0.4

t1 e1
Figure 4: Comparison between common (a),(b) and selj-ASPLS (c),(d) principal component plots.

ponents explain respectively 43.98, 1-!.9-!, 12.08,7,5.26 and 4.11 percents (total 87.37%) of
the total variance against 4.5.12, 18.2,5, 11.91, 8.49, 1 ..52 and 3.80 percents (total 95.09%)
for common principal components.
Linear regression on self-ASPLS components provides an additive model in the same way
ASPCR does in Section 4.3. Table 6 shows R2 values for the O.E.C.D. responses regressed
on six explanatory self·ASPLS components. Figure .5 displays coordinate function plots of
the influential predictors on the "current" response EDUC. For all the models and methods
studied, we observe a great stability in the prediction since variables DR, RR, GNP, DENS
606

and AGRF all chiefly intervene with 6imilar transformations in the model.

Table 6. R2 of the responses for sir nested self-ASPLc") component regressIOn models.

~od
( tm. CAL LODG GLEC EDUC TV srof total
\ -vanance
0.047 O.G 18 0.106 0.072 0.611 18.28
2 0.052 0.084 0.107 0.2-1'5 01)76 2:328
:3 0.338 0.091 0.327 0.·l(i5 0.S11 ·10.8-1
.j 0.427 0:176 0.-166 0100 0.861 ')6.61]
5 (H6t) 05-1.5 0475 0.101 0.864 G102
6 0.570 0.606 0.605 0.757 0.875 68.26

o
cO
N
o

·2 -1 0 ·2 ·1 o ·1 o 1
DR RR GNP

~
cO

t~/I ~.J:r:
v
cO
N
0

0
cO
N
9
v
9
-1.0 O.OOENS 1 ,0 2.0 -1 0 1 2 -2 -1 0 1
AGRF POPG

Figure 5: iHain coordinate function plots for EDUC modeled by regressIOn on self-ASPLS compo-
nents. The dotted vertical lines indIcate the positIOn of the ",tfrlOr knots.

7. Additive splines for Principal Component Analysis on


Instrumental Variables
Principal Component Analysis on Instrumental \'ariables was introduced by Rao (1964)
and has been extended in a more general linear context (Escoufier 1987) by introducing
adapted metrics in principal component analysis. A nonlinear additive spline extension
to this approach, referred to as ASPCAIV, can be found in (Durand 199:3). This method
differs from those of Sections 4, .s and 6 in that dimension reduction (here, linear PCA)
occurs after a multivariate linear smoothing procedure has been applied. The associated
smoother matrix is here PX("l whi(h is to be compared with that of the least-squares re-
gression spline model of Section :3. Although projections are processed on spaces of smaller
dimensions, ASPCAIV is sensitive to scarcity of data and the O.E.C.D. example does not
provide sufficient number of samples to interpret the model. To illustrate the performance
of the method, two examples are shown in Section 6.2. The first one presents a simple
regression based on 200 observations. In the second, two predictors are related to three
responses, all measured on 2.50 items.

7,1 Mathematical background


DEFINITION. Let Q be a q x q. positive definite matrix that defines a metric for com-
puting Euclidian distances between objects in the response space IRq. The aim of the
607

ASPCAIV method is first to find a p x p metric Ii and a vector of spline coefficients a that
minimize the objective function f(R,a) = trace [YQY'D - X(a)RX(a)'Dt Then, di-
mension reduction is done (if needed), by solving the eigenanalysis of X(a)RX(a)'D.
The objective function can be interpreted as a measure of discrepancy between eigen-
matrices for components associated with the predictor and response sets of variables. For
fixed a, an optimal metric is explicitly given by

R = [X(a)'DX(a)t [X(a)'DY] Q [Y'DX(a)] [X(a)'DX(a)]+, (8)


but the reverse is false: for fixed R, no explicit optimal a is available. An iterative algorithm
is needed which alternates a step of computing the best R for fixed a, with a step of
gradient descent with respect to a. Due to (8), when a local optimum is obtained, solving
the eigenproblem for X(a)RX(a)'D leads to the same solutions as for YQJ!'D, where

(9)

0.2 0.4 0.6 as 10 0.0 02 04 0.8 0.8 1.0

0.11 0.8 1.0 00 0.2 0.'

0.' 0.8 TO

Figure 6: Evolution of the smoother along with the 6 first steps of the method applied on the exam-
ple based on 200 (Xi, Yi) observations (dots), Yi = sin(27r(1 - x;)2) + XiEi, with Xi uniform on
[0,1] and Ei standard normal. The signal (dashed), the spline smooth (solid), degree 2 with 3 knots.

As a consequence, ASPCAIV is a multivariate linear smoothing procedure associated with


the smoother projection matrix PX(a)' Figure 6 presents the evolution of the scatter plot
smoother along with the first steps of the method applied to the example from the S-PLUS
(1991) user's manual in chapter 18 (here Q = [1], and splines are of degree 2 with 3 equally
spaced knots). Note that the initializing choice of nodal coefficients consists of computing
608

the common linear regression at the first step of the method.

7.2 Application to additive spline discriminant analysis


By (9), ASPCAIV provides a competing model for classical additive models (Hastie and
Tibshirani 1990). However, to make use of ASPCAIV's abilities, the metric Q can be differ-
ent from the identity and specifically chosen according to specific multivariate regressions.
For example, it can be shown that taking Q = (Y'Dy)-1 leads to considering the method
as the best canonical correlation analysis between Y and any smoothed predictor matrix
defined by (2). More precisely, the optimization problem becomes

trace (PyP X (ii))2 2: trace (pyP X (a))2, for arbitrarya.

By taking the indicator matrix of classes for the Y sample matrix, a direct applica-
tion of this result concerns nonlinear Discriminant Analysis (Durand 1992, 1993, Hastie
et al 1994). Note that, here, linear Discriminant Analysis is obtained at the first step
of the algorithm (because nodal spline coefficients are used) and that discriminating by
additive spline variables can only improve linear results. Discriminant variables are de-
duced from principal components by using a scaling factor (Escoufier 1987). Denoting
G = (Y'Dy)-IY'DX(a), the matrix whose rows Gi are the centroids of the classes de-
fined by X (a) and Y, the classification rule is: The object a: to be classified, is transformed
in t (considered as a row vector) by using optimal B-spline functions, and then affected to
class j if

(t - G J ) [X(a)'DX(a)]+ (t - Gj)' = . min (t - G;) [X(a)'DX(a)]+ (t - Gi)'. (10)


l=l, ... ,q

As in linear Discriminant Analysis, other metrics can be chosen for the geometrical affec-
tation rule (10). However, in order that spline transformations could make sense, the user
has to verify that each observation lies within the range of the corresponding variable in
the training sample.

,:i1 ,
~ 33, "" 2
" ,~ §o
'" 'l! "
1~\3 ,
' l~22 33 11'- ,~, ~
'" ,
~~' 2.... ~2 2 i 2 ~
i- <lJ
", (;\ ~ t;,2i1z
?J,J
2 '2 0 2~~ ~
3
,,' ,2
22 111 2 2
2 "
'"
2
??2~2222 f " ,n
~~~t 1 ~ ~o ~ .2:i2 3. 33
' 22 2 \
~'
J
>-0 0 :>0 2
3:\J ; ~ ",' ZI! 2 3
3,
t ~2 ~~2
J
,,2
22 2
3 2 22
'1''1', 2
22 '"
:ii
> ';"
~il 333~)3
~2£D%, a! 2 I
2i3
j, .,i- ,~ , 3,
'l'
,,2
~ 2 22 22 2 2 2 2 3 "
1;l
2
3$3~
, 2 3 Ci 33

~33:i13
31 '
,
3 3
,
3
,!!:iJ 'l' 31\3,
:lI.,
~

·2 ·1.5 ·1.0 .(l.5 0.0 0.5 1.0 1.5


Discr. Var. 1 0.93p (99.76 %)

Figure 7: Misclassified items are marked with circles and centroids with black squares.
The application illustrating the performance of the method is based on three class, two
dimensional data (x, y). Simulated data are generated as follows: the three class distribu-
tions are uniform on the annuls centered at the origin with respective extreme radii (0,1),
(1.5,2.5) and (3,3.5). Number of items are 50 for the first group and 100 for the others
so that the training sample matrix X is 250 x 2 and Y, the indicator matrix of classes,
250 x 3. Usual linear and quadratic discriminant methods perform poorly on this data set
while discrimination by additive spline discriminant variables provides good results: Fig-
ure 7 presents seven misclassified items, marked with circles, in both (x, y) and two first
discriminant variables. Only one discriminant variable is needed for separating the classes
and the close to one first eigenvalue yield well separated groups.
609

8. Conclusion
In this paper we have compared and illustrated various additive multiresponse regression
methods from the point of view of prediction as well as interpretation. It is well known
that in case of extreme collinearity in predictors, interpreting linear regression coefficients
is dangerous. ASPCAIV is an additive modeling method in which dimension reducing as-
pects occur after the regression is computed. Therefore, its scope of application is that of
predictor data sets with much more observations than variables. The possibility of choosing
different metrics leads us to view ASPCAIV as a unifying framework for additive extensions
to some two data blocks linear methods.
In some chemometrics, econometrics or sociometries applications, the number of predictors
exceeds or equals approximately the number of observations and nonlinear methods pre-
senting dimension reduction stages before or at the same time prediction is processed, are
to be preferred. The ASPLS method, the regression on ASPCs as well as on self-ASPLS
components, all construct a set of uncorrelated components which are additive functions
of the predictors. In ASPLS, linear regression on such components occurs as soon as they
are constructed, thus generally providing a more parsimonious additive model in the sense
that less additive components are needed for explaining the same amount of the response
variance. A great stability has been found in interpreting the different additive models on
the O.E.C.D. data: the choice of a set of influential predictors is in a large way independent
of the method and coordinate spline function shapes are similar.

References

Bertier, P. and Bouroche, J.-M. (1975), Analyse des donnees multidimensionnelles, Paris: PUF.
De Boor, C. (1978), A practical guide to splines, New York: Springer.
Donnell, D. J. et al. (1994), Analysis of additive dependencies and concurvities using smallest ad-
ditive principal components (with discussion), The Annals of Statistics, 4,1635-1673
Durand, J. F. (1992), Additive spline discriminant analysis, in Computatzonnal Statistics, Vol.J, (Y.
Dodge and J. Whittaker, eds.), Physica-Verlag, 144-149.
Durand, J. F. (1993), Generalized principal component analysis with respect to instrumental vari-
ables via univariate spline transformations, Computational Statistics & Data Analysis, 16, 423-440.
Durand, J. F. and Sabatier, R. (1994), Additive splines for PLS regression, Tech. Rept. 94-05,
Unite de Biometrie, ENSAM-INRA-UM II, Montpellier, France. In press in Journal of the Ameri-
can Statistical Association.
Escoufier, Y. (1987), Principal components analysis with respect to instrumental variables, Euro-
pean Courses in Advanced Statistics, University of Napoli, 285-299.
Eubank, R. L. (1988), Spline smoothing and nonpammetric regression, New York and Basel: Dekker.
Frank,1. E., and Friedman, J. H. (1993), A statistical view of some Chemometrics regression tools
(with discussion), Technometrics, 35, 109-148.
Gifi, A. (1990), Nonlinear multzvariate analysis, Chichester: Wiley.
Bastie, T. and Tibshirani, R. (1990), Generalzzed additive models. London: Chapman and Hall.
Hastie, T. et al. (1994), Flexible discriminant analysis by optimal scoring, Journal of American
Statistical Association, 89, 1255-1270.
Ramsay, J. O. (1988), Monotone regression splines in action (with discussion), Statistical Science,
3, 425-461.
Rao, C. R. (1964), The use and the interpretation of principal component analysis in applied re-
search, Sankhya A, 26, 329-356.
Wold, S. et al. (1983), The multivariate calibration problem in chemistry solved by the PLS method,
Proc. ConI Matrix Pencils. Ruhe, A. and Kagstrom, B. (Eds), Lecture notes in mathematics, Hei-
delberg: Springer Verlag, 286-293.
Bounded Algebraic Curve Fitting for
Multidimensional Data
Using the Least-Squares Distance
Masalliro Mizuta l
I Division of Systems and Information Engineering, Hokkaido University
N.13, W.S, Kita-ku, Sapporo-slli
Hokkaido 060, Japan

Summary: Linear regression or smoothing techniques are not adequate for curve fitting,
in cases in which neither variable can be designated as the response. We present a new
method for fitting bounded algebraic curve to multidimensional data using the least-squares
distarlce between data points and the curve. Numerical examples of the proposed method
are also shown.

1. Introduction
In data analysis process, we can sometimes investigate data structures with methods
of curve fitting. Curves can be represented by explicit functions, parametric func-
tions, or implicit functions. Curves represented by explicit functions reveal infiuences
of other variables on one variable, like regression analysis. Curves by parametric
fUllctions give latent order of data and are depicted easily with computer graphics.
Curves by implicit functions, i.e., Algebraic C'Lwves show relations with variables.
[VrallY researchers proposed curve fitting methods with algebraic curves. Particularly,
it is worthwhile to notice that Keren et a1.(1994), and Taubin et a1.(1994) indepen-
dently developed the algorithms for bounded algebraic curve. However, most studies
are based on approximate distances between data points and the algebraic curve. We
have developed a method to find algebraic curve that minimizes the sum of squares
exact distances. In this article, we propose a method to find bo'unded algebraic curves
based on exact distances.

2. Algebraic curve fitting and bounded algebraic curve


A p-dimensional algebraic curve is the set of zeros of k- polynomials f (x) = (it (x), . , . ,
/k(x)) on RP,

Z(fJ = {x: fix) = O}.


\Ve restrict ourselves p = 2 and k = 1 'without a loss of generality hereafter:

Z(j) = {(x, y) : f(l:, y) = O}. (1 )

2.1 Approximate distance and exact distance


The distance from a point a = (0:,;3) to the curve Z(f) is usually defined by

dist( a, Z (f)) inf(11 a - y II: y E Z(f))


iuf {V(l' - 0:)2
x,y
+ (y - 3)2 : fix, y) = O}. (2)

610
611

''''',,-
(a./3)
Fig. 1: Distance between curve Z(f) and point (0,(3).
It was said that the distance between a point and the algebraic curve cannot be
computed by direct methods. So, Taubin proposed an approximate distance from a
to Z(f) (Taubin (1991)). The point fI that approximately minimizes the distance
II y - a 1/, is given by

(3)

where ('\7 f(af)+ is the pseudoinverse of '\7 f(a)T. The distance from a to Z(f) is
approximated by
. 2~ f(a)2
dlst(a,Z(f)) ~ 1/ '\7f(a) 1/2' (4)

f(x)

Exact Distance

/
\ - Data Point

Approximate /
'--------T'---Oistance

Fig. 2: Approximate distance and distance.


Kriegman and Poncc(1990) have shown that elimination theory can be used to con-
struct a closcd-form exprcssion for the exact distance, but the amount of computation
is impractical.
We have developcd a mcthod to calculate the exact distance with a techniquc of con-
strained optimization (Mizuta (1995)). Let x = (x, y) be the nearest point on Z(f)
to a point a = (0, (3). The point x = (x, y) can be calculated with an augmented
Lagrangian function (Hcstenes (1969)):

1.)
Q(x, tt, r) = (x - ?
0)- + (y -
-)
jJ)- + ttf(x, y) + 2r f(x, y)-
612

where Ii is a Lagrangian parameter and l' is a penalty parameter. The algorithm


proceeds with the following steps:
Initialization: Set Po = 1,1'0 = 1, k = O.
Set Xo the initial points.
Step 1 Minimize Q(x,Pk,rkl with respect to x.
Put them Xk.
Step 2 Stop if the value of f(x, y)2 is below some
threshold.
Step 3 rHI = CTk, where C is a constant, greater
than 1.
Step 4 PHI = Pk +rkf(x,y)
Step 5 k = k + 1, and return to Step 1.
2.2 Methods for curve fitting
Taubin (1991) presented the algorithm to find the algebraic curve such that the sum
of approximate squares distances between data points and the curve is minimum.
We have already developed the algorithm to find the algebraic curve with exact
squares distance (Mizuta (1995)). Although algebraic curves can fit the data very
well, they usually contain points far remote from the given data set. In 1994, Keren et
al.(1994) and Taubin at al.(1994) independently developed the algorithms for bounded
(closed) algebraic curve with approximate squares distance. We will introduce the
definition and properties of bounded algebraic curve in the next subsection.
2.3 Bounded algebraic curve
We call Z(f) is bo·unded iff there exists a constant T such that Z(f) C {x :II x I < r}.
For example, it is clear that Z( x 2+ y2 - 1) is bounded, but Z( x 2- y2) is not bounded.
Keren et al.(1994) defined Z(J) to be stably bounded if a small perturbation of the co-
efficients of the polynomial leaves its zero set bounded. An algebraic curve Z( (X_y)4+
x 2+y2 -1) is bounded, but not stably bounded because Z((x - y)4 +x2+y2 -1 +EX 3)
is not bounded for any E =f. O.

f(x,y) = x 2 + y2 - 1 f(x,y) = x 2 _ y2 f(x,y)=(x-y)4+ x 2+ y 2-1


(a) Stably bounded (b) Not bounded (c) Bounded but not
stably bounded
Fig. 3: Examples on bounded and stably bounded

Let A(x, y) be the form of degree k of a polynomial f(x, y): f(x, y) = 2:%=0 A(x, y).
The leading form of a polynomial f (x, y) of degree d is defined by f d( x, y). For
example, the leading form of f(x, y) = x 2 + 2xy - y2 + 5x - y + 3 is h(x, y) =
613

X2 + 2xy _ y2.
Lemma: For an even positive integer d, any leading form fd( x, y) can be represented
by XAXT. Where A is asymmetric matri.x and X = (X~,x~-ly, ... ,xy~-I,y~).
Remark: Symmetric matrix A is not unique. For example,

x' - x'y' + y' = (x', xy, y') ( Jl


(x',xy,y') ( Jl
Theorem (Keren et al. (1994)): The Z(f) is stably bounded iff d is even and there
exists a symmetric positive definite matrix A such that

3. Bounded algebraic curve fitting based on exact distance


We must show two steps to find the bounded algebraic curve fitted to a given data.
In the first step, we use the method to calculate the distance from a given point to a
given algebraic curve already described in the section 2. The next step is to find the
bounded curve that minimizes the sum of the squares distances from the data.
We parametrize a set of polynomials that induce (stably) bounded algebraic curve. In
general, a polynomial f of degree p with q parameters can be denoted by
f (a I, ... , aq , x, y), where ai, ... ,aq are parameters of the polynomial.
For example, all of the polynomials of degree 2 can be represented by

f(al, a2,"', a6, x, y) = ATX,

where X = (1,x,y,x 2,xy,y2)T,A = (al,a2, ... ,a6jT.


For stably bo'unded algebra-ic curves of degree 4,
f( ai, ... ,aI6, x, y) = (X2, xy, y2 )B2(X2 , xy, y2)T + (a7, ... ,a16)(1, X, y, ... ,y3)T
where,

B = (~~ ~~ ~~).
a3 a5 a6
Let (a,,B)(i = 1,2,"',nl be n data points in the plane. The point in Z(fl that
minimizes the distance from (ai, !1i) is denoted by (Xi, y;)(i = 1,2"", n).
The sum of squares of distances is
n
R='LRi,
i=1
614

Z(f)=[(x,Y)

Fig. -i: The sum of squares of distances.

vVe can minimize R with respect to the parameters of polynomial f with Lcucnbcrg-
Marquardt Afcthod. The method requires the partial derivatives of R with respect to

oR, . (. 0 1 : i , o.lJi) (.5 )


-::;-:- = 2 (1", - Gi) -::;-:- + ClJ, - 0;) ~ .
ua) ua) ua)

The only. thing left to discuss is a solution for L)JJ' and~. Hereinafter, the snbscript
(~ u~

i is omitted.
B.Y the derivative of the both sides of
f(ol,'" ,Clq,J:,.lJ) = 0 with respect to {Lj (j = l,···.q), we obtain

of ox of oy df
--+--+-=0 (6)
01'OOj o.lJoa) daj ,

where ~IdrJ is the differcntial of 'f" with aj whcn .r and •lJ arc fixed.
( (l

Because (Gi,/3d is on the normallinc from (Xi,y,), theu ClJ - ,])~ - (J" - (l)~ = O.
By the derivative of the both sides with respect \0 aj, we obtain

(i)

The equations (6) and (I) arc simultaneous linear cq uations in two \·ariables t j
' and
!!..Y.... So. we can get rh, and 0h.. "J
iJ~' D~ &}

4. Numerical examples
Two examples arc provided of bOllnded versus unbounded curve fitting.
The first example data is two dimensional artificial data of size -i0 t hat lies in the
neighborhood of an asteroid. We set the result of GPCA (Generalizcd Principal
Components Analys'is: Gnanadcsikcin (19ii), lvIizuta (1983)) for an initial curve, and
search for fitting an algebraic curve and a bounded algebraic curve of degree -i with
the proposed method (Fig. 5). The sum of squares of distances R is 0.088 in the
case of bounded fitting. This value is greater than the value in the case of bounded
fitting (R = 0.0:26). But the figure 5 shows that the bounded algebraic curye reveals
a suitable outline of the data points.
The second is three-dimen"ional data of size 210. The 210 points almo::;t lie on a
615

closed cylinder (Fig. 6 (a)). We also apply the method to the data with an algebraic
curve and a bounded algebraic curve of degree 4 (Fig. 6 (b), (c)). The value of R
is 1.239 in the case of bounded fitting and the value of R is 0.924 in the case of
unbounded fitting. The unbounded algebraic surface reproduces the structure of the
closed cylinder and the bounded surface shows a global shape of the data points .

Fig. 5: Astero id data: bounded (left ) and unbounded (right)

(a) Data points (b) Bounded fitting


(e) Unbounded fitting
Fig. 6: Cylinder data

5. Concluding remarks
In tile article, we do not mention the curve fitting in the 3-dimensional space:

Z(fl = {(:r,.I/, z) : flCr . .I/. z) = 0 and h(x, .II, z) = O}.

However , extending the proposed method is not difficult.


Taubin ( 1994 ) proposed the approximate distance of order /,; and present.ed algorithms
for rastering algebraic curves. The proposed algorithm for exact distance can also be
used for rastering algebraic curves.

References:

Gnalladesikall. a. (1977). lvfethod:; fo1' Statistical Datfl Analysis of Multivariate Obse1'va-


tions. John Wiley & Sons.
Hestcncs. M.a. (1969). Survey paper: Multiplier and gradient methods, J. of Optimization
Theory and A.pplications, 4. 303 - 320.
Kerell. D .. Cooper. D. aud Subralullollia, J. (1994). Describing complicated objects by
616

implicit polynomials, IEEE tmns. Patt. Anal. Machine Intell., 16, 1, 38-53.
Kriegman, D. J. and Ponce, J. (1990). On recognizing and positioning curved 3-D objects
from image contours, IEEE Trans. Patt. Anal. Machine Intell., 12, 12, 1127-1137.
Mizuta, M. (1983). Generalized principal components analysis invariant under rotations of
a coordinate system, J.Japan Statist. Soc., 14, 1-9.
Mizuta, M. (1995). A derivation of the algebraic curve for two-dimensional data using the
least-squares distance, In Data Science and Its Application, Hayashi, C. et al, (eds.), 167-
176, Academic Press, Tokyo.
Taubin, G.(1991). Estimation of planar curves, surfaces, and nonplanar space curves de-
fined by implicit equations with applications to edge and range image segmentation, IEEE
Trans. Patt. Anal. Machine Intell., 13, 11, 1115-1138.
Taubin, G.(1994). Distance ApprOximations for Rastering Implicit Curves, ACM Trans.
on Gmphics, 13, 1, 3-42.
Taubin, G., Cukierman, F., Sullivan, S., Ponce, J. and Kriegman, D. J. (1994). Parameter-
ized families of polynomials for bounded algebraic curve and surface fitting. IEEE tmns.
PaU. Anal. Machine In te II. , 16, 3, 287-303.
Using the Wavelet Transform for
Multivariate Data Analysis and
Time Series Analysis
Fionn Murtagh!, Alexandre Aussem 2
! Faculty of Informatics, University of Ulster
Magee College, Londonderry BT48 7JL, Nth. Ireland
Email [email protected] Web www.infm.ulst.ac.ukrfionn
2 Universite Blaise Pascal, Clermont-Ferrand II
ISIMA, Campus des Cezeaux, BP 12.5 63173 Aubiere Cedex, France
Email: [email protected]

Summary: We discuss the use of orthogonal wavelet transforms in multivariate data anal-
ysis methods such as clustering and dimensionality reduction. Wavelet transforms allow
us to introduce multiresolution approximation, and multi scale nonpararnetric regression or
smoothing, in a natural and integrated way into the data analysis. Applications illustrate
the powerfulness of this new perspective on data analysis.

1. Introduction
Data analysis, for exploratory purposes, or prediction, is usually preceded by various
data transformations and recoding. In fact, we would hazard a guess that 90% of the
work involved in analyzing data lies in this initial stage of data preprocessing. This
includes: problem demarcation and data capture; selecting non-missing data of fairly
homogeneous quality; data coding; and a range of preliminary data transformations.
The wavelet transform offers a particularly appealing data transformation, as a pre-
liminary to data analysis. It offers additionally the possibility of close integration
into the analysis procedure as will be seen in this article. The wavelet transform
may be used to "open up" the data to de-noising, smoothing, etc., in a natural and
integrated way.

2. Some Perspectives on the Wavelet Transform


We can think of our input data as a time-varying signal, e.g. a time series. If discretely
sampled (as will almost always be the case in practice), this amounts to considering
an input vector of values. The input data may be sampled at discrete wavelength
values, yielding a spectrum, or one-dimensional image. A two-dimensional, or more
complicated input image, can be fed to the analysis engine as a rasterized data stream.
Analysis of such a two-dimensional image may be carried out independently on each
dimension, but such an implementation issue will not be of further concern to us here.
Even though our motivation arises from the analysis of ordered input data vectors,
we will see below that we have no difficulty in using exactly the same approach with
(more common) unordered input data vectors.
Wavelets can be introduced in different ways. One point of view on the wavelet
transform is by means of filter banks. The filtering of the input signal is some
transformation of it, e.g. a low-pass filter, or convolution with a smoothing function.
Low-pass and high-pass filters are both considered in the wavelet transform, and their
complementary use provides signal analysis and synthesis.

617
618

3. The Wavelet Transform Used


Manuscripts must not exceed 12 pages for invited lectures, a prizewinner's speech,
and invited sessions; and 8 pages for contributed sessions. The following discussion
is based on Strang (1989), Bhatia et al. (1996) and Strang and Nguyen (1996). Our
task is to consider the approximation of a vector x at finer and finer scales. The
finest scale provides the original data, XN = .r, and the approximation at scale m is
Xm where usually m = 2°,21 , ... 2N. The incremental detail added in going from Xm

to X m +l, the detail signal, is yielded by the wavelet transform. If ~m is this detail
signal. then the following holds:
X m +l = HT(m)x m + GT(m)~m (1)
where G( m) and H(m) are matrices (linear transformations) depending on the wave-
let chosen, and T denotes transpose (adjoint). An intermediate approximation of the
original signal is immediately possible by setting detail components ~m' to zero for
m' 2' m (thus, for example. to obtain X2, we use only xo, ~o and ~d. Alternatively
we can de-noise the detail signals before reconstituting x and this has been termed
wavelet regression (Bruce and Gao, 1994).
Define ~ as the row-wise juxtaposition of all detail components, {~m}' and the final
smoothed signal, Xa, and consider the wavelet transform Hl given by
(2)
The right-hand side is a concatenation of vectors. Taking WTvV = I (the identity
matrix) is a strong condition for exact reconstruction of the input data, and is satisfied
by an orthogonal wavelet transform. The important fact that W'TW = I will be
used below in our enha.ncement of multivariate data analysis methods. This permits
use of the "prism" (or decomposition in terms of scale and locatio~l) of the wavelet
transform.
Examples of these orthogonal wavelets, i.e. the operators G and H, are the Daubechies
family. and the Haar wavelet transform (Press et aI., 1992; Daubechies, 1992). For
the Daubechies D4 wavelet transform, H is given by
(OA8296291 :31,0.8:36·516:30:37. 0.224H:38680, -0.l29-10095226)
and G is given by
(-0.129-10095226, -0.22H-1o:38680, 0.8:36516:30:37, -0.48296291:31).
ImplementatioIl is by decimating the signal by two at each level and convolving with
G and H: therefore the number of operations is proportional to n + nl2 + nl4 + ... =
O( n). 'vVrap-around (or "mirroring") is used by the convolution at the extremities of
the signal.

4 .Wavelet-Based Multivariate Data Analysis: Basis


We consider the wavelet transform of x. ~v x. Consider two vectors. x and y. The
squa.red Euclidean distance between these is Ilx - Yl12 = (x - y)T(x - V). The squared
Euclidean distance between the wavelet transformed vectors is II W x - WYl12 =
WTW(x - y)T(,l' - V), and hence identical to the distance squared between the orig-
inal vectors. For use of the Euclidean distance, the wavelet transform can replace
the origina.l data in the data analysis. The dna.lysis can be carried out in wavelet
space rather than direct spa.ce. This in turn allows us to directly manipula.te the
wavelet transform values. using any of the approaches found useful in other areas.
The results based on the orthogonal wavelet transform exclusively imply use of the
619

Euclidean metric, which nonetheless covers a considerable area of current data anal-
ysis practice.
Note that the wavelet basis is an orthogonal one, but is not a principal axis one
(which is orthogonal, but also optimal in terms of least squares projections). Wicker-
hauser (1994) proposed a method to find an approximate principal component basis
by determining a large number of (efficiently-calculated) wavelet bases, and keeping
the one closest to the desired. Karhunen-Loeve basis. If we keep, say, an approximate
representation allowing reconstitution of the original n components by n' components
(due to the dyadic analysis, n' E {n/2, n/4, . .. }), then we see that the space spanned
by these n' components will not be the same as that spanned by the n' first principal
components.

5. Wavelet Filtering or Wavelet Regression


Foremost among modifications of the wavelet transform coefficients is to approximate
the data, progressing from coarse representation to fine representation, but stopping
at some resolution level m. As noted above, this implies setting wavelet coefficients
~m' to zero when m' 2: m.
Filtering or non-linear regression of the data can be carried out by deleting insignif-
icant wavelet coefficients at each resolution level (noise filtering), or by "shrinking"
them (data smoothing). Reconstitution of the data then provides a cleaned data
set. A practical overview of such approaches to data filtering (arising from work by
Donoho and Johnstone at Stanford University) can be found in Bruce and Gao (1994,
chapter 7). For other model-based work see Starck et al. (1995).

6. Examples of Multivariate Data Analysis in Wavelet Space


We used a set of 45 astronomical spectra. These were of the complex AGN (active
galactic nucleus) object, NGC 4151, and were taken with the small but very success-
ful IUE (International Ultraviolet Explorer) satellite which was still active in 1996
after nearly two decades of operation. We chose a set of 45 spectra observed with the
SWP spectral camera, with wavelengths from 1191.2 A to approximately 1794.4 A,
with values at 512 interval steps. There were some minor discrepancies in the wave-
length values, which we discounted: an alternative would have been to interpolate
flux values (vertical axis, y) in order to have values at identical wavelength values
(horizontal axis, x), but we did not do this since the infrequent discrepancies were
fractional parts of the most common regular interval widths. Fig. 1 shows a sample
of 20 of these spectra. A wavelet transform (Daubechies 4 wavelet used) version of
these spectra was generated, with a number of scales generated which was allowed
by dyadic decomposition. An overall 0.1 J7 (standard deviation, calculated on all
wavelet coefficients) was used as a threshold, and coefficient values below this were
set to zero. Spectra which were apparently more noisy had relatively few coefficient
values set to zero, e.g. 31 %. More smooth spectra had up to over 91 % of their coef-
ficients set to zero. On average, 76% of the wavelet coefficients were zeroed in this
way. Fig. 2 shows the relatively high quality spectra re-formed, following zeroing of
wavelet coefficient values.
620

IJo / lLJ
I

-do
4: :: = ::
.
~ 0

~
0

0
-
. on g 0

I 1§

I
f

j
0

. 0
~

. 0

Z
~ !
0

"
'0
2

i I • J •
uooooo
~~~~~~
'"
'" ,., _ _ .n
~

Figure 1: Sample of 20 spectra (from -15 used) with origina.l flux measurements plotted
on the y-a,xis.

The Kohonen "self-organizing feature map" (SOFM; fvIurtagh and Hermindez-Paja-


res, 199.5) was applied to this data. A .j x 6 output representationalgrid was used.
621

~ ''\ .~ ~

~
~ ~ o Q o 0

U; I H
I
l~ lil

i
i

, ig
1: t
I

g
f
t o ",0
~
000000 ~
ci

UJ
~

~ _0000000 ~

~
\~
1
.~

J:
' 9

o ~ '" ... "" 0


_00000

Figure 2: Sample of 20 spectra (as in previous Fig.), each normalized to unit maxi~
mum value, then wavelet transformed, approximately 7.5% of wavelet coefficients set
to zero, and reconstituted.

In wavelet space or in direct space, the assignment results obtained were identical.
With 76% of the wavelet coefficients zeroed, the result was very similar, indicating
622

that redundant information had been successfully removed. This approach to SOFM
construction leads to the following possibilities:

1. Efficient implementation: a good approximation can be obtained by zeroing


most wavelet coefficients, which opens the way to more appropriate storage (e.g.
offsets of non-zero values) and distance calculations (e.g. implementation loops
driven by the stored non-zero values). Similarly, compression of large datasets
can be carried out. Finally, calculations in a high-dimensional space, R m , can
be carried out more efficiently since, as seen above, the number of non-zero
coefficients may well be m" < < m with very little loss of useful information.

2. Data "cleaning" or filtering is a much more integral part of the data analysis
processing. If a noise model is available for the input data, then the data
can be de-noised at multiple scales. By suppressing wavelet coefficients at
certain scales, high-frequency (perhaps stochastic or instrumental noise) or low-
frequency (perhaps "background") information can be removed. Part of the
data coding phase, prior to the analysis phase, can be dealt with more naturally
in this new integrated approach.
A number of runs of the k-means partitioning algorithm were made. The exchange
method, described in Spath (198.5) was used. Four. or two, clusters were requested.
Identical results were obtained for both data sets. which is not surprising given that
this partitioning method is based on the Euclidean distance. For the 4-cluster, and
2-cluster, solutions we obtained respectively these assignments:

123213114441114311343133141121412222222121114

122211111111111111111111111121112222222121111

The case of principal components analysis was very interesting. We know that the
basic PCA method uses Euclidean scalar products to define the new set of axes.
Often PCA is used on a variance-covariance input matrix (i.e. the input vectors are
centered); or on a correlation input matrix (i.e. the input vectors are rescaled to zero
mean and unit variance). These two transformations destroy the Euclidean metric
properties vis-a-vis the raw data. Therefore we used PCA on the unprocessed input
data. We obtained identical eigenvalues and eigenvectors for the two input data sets.
The eigenvalues are similar up to numerical precision:

1911.217163 210.355377 92.042099 13.908587 7.481989


2.722113 2.304520

1911 . 220703 210.355392 92.042336 13.908703 7.481917


2.722145 2.304524

The eigenvectors are similarly identical. The actual projection values are entirely
different. This is simply due to the fact that the principal components in wavelet
space are themselves inverse-transformable to provide principal components of the
initial data.
Various aspects of this relationship between original and wavelet space remain to be
investigated. We have argued for the importance of this. in the framework of data
coding and preliminary processing. We have also noted that if most values can be set
to zero with limited (and maybe beneficial) effecL then there is considerable scope for
computational gain also. The processing of spa.rse data can be based on an "inverted
623

file" data-structure which maps non-zero data entries to their values. The inverted
file data-structure is then used to drive the distance and other calculations. Murtagh
(198.5, pp. 51-.54 in particular) discusses various algorithms of this sort.

7. An Isotropic Redundant Wavelet Transform


It is common in pattern recognition to speak of "features" when what is intended are
small density perturbations in feature space, small glitches in time series, etc. Such
"features" may include sharp (edge-like) phenomena which can be demarcated using
wavelet transforms like the orthogonal ones described above. Sometimes the glitches
which are of interest are symmetric or isotropic. If so, a symmetric wavelet may be
more useful. The danger with an asymmetric wavelet is that the wavelet itself may
impose artifacts.
The "a trous" (with holes) algorithm is such an isotropic wavelet transform. It does
not have the orthogonality property of the transform described earlier. The French
term is commonly used, and arises from an interlaced convolution which is used
instead of the usual convolution (see Shensa, 1992; Holschneider et aI., 1989; see also
Starck and Bijaoui, 1994; and Bijaoui et aI., 1994). The algorithm can be described as
follows: (i) smoothing p times with a B3 spline - hence Gaussian-like, but of compact
support; (ii) the wavelet coefficients are given by the differences between successive
smoothed versions of the signal. The latter provide the detail signal, which (we hope)
in practice will capture small "features" of interpretational value in the data. The
following attractive additive decomposition of the data follows immediately from the
design of the above scheme:
p
co(k) = cp(k) + LWi(k) (:3)
1=1

The set of values provided by Cp provide a "residual" or "continuum" or "background"


Adding Wi values to this, for i = p, p - 1, ... gives increasingly more accurate ap-
proximations of the original signal. ~ote that no decimation is carried out here,
which implies that the size or dimension of Wi is the same as that of Co. This may
be convenient in practice: d. next section. It is readily seen that lhe computational
complexity of the above algorithm is O(n) for an n-valued input, and the storage
complexity is O(n 2 ).

8. Wavelet-Based Forecasting
In experiments carried out on the sunspots benchmark dataset (yearly averages from
1720 to 1979, with forecasts carried out on the period 1921 to 1979: see, e.g., Tong,
1990), a wavelet transform was used for values k up to a time-point ko. One-step-
ahead forecasts were carried out independently at each Wi. These were summed to
produce the overall forecast (d. the additive decomposition of the original data, pro-
vided by the wavelet tra.nsform). An interesting varia.nt on this was also investigated:
this variant was that there was no need to use the same forecasting method at each
level, i. We ran autoregressive. multilayer perceptron and recurrent connectionist
networks in pa.rallel, and kept the best results indicated by a cross-validation on
withheld data at that leveL vVe found the overall result to be superior to working
with the original data a.lone, or with one forecasting engine alone. Details of this
work can be found in Aussem and ?vlurtagh (1996).

9. Conclusion
The results described here. from the multivariate data analysis perspective, are very
exciting. They !lot only open up the possibility of computational advances but also
624

provide a new approach in the area of data coding and preliminary processing.
The chief advantage of these wavelet methods is that they provide a multiscale de-
composition of the data, which can be directly used by multivariate data analysis
methods, or which can be complementary to them.
A major element of this work is to show the practical relevance of doing this. It has
been the aim of this paper to do precisely this in a few cases. Finding a symbiosis
between what are, at first sight, methods with quite different bases and quite differ-
ent objectives, requires new insights. Wedding the wavelet transform to multivariate
data analysis no doubt leaves many further avenues to be explored.
Further details of the experimentation described in this paper, details of code used,
and further information, can be found in Murtagh (1996).

References:
Aussem, A. and l\lurtagh, F. (1996): Combining neural network forecasts on wavelet-
transformed time series, Connection Science, in press.
Bhatia, M., Karl, W.C. and Willsky, A.S. (1996): A wavelet-based method for multiscale
tomographic reconstruction, IEEE Transactions on :\Jedical Imaging, 15, 92-101.
Bijaoui, A., Starck, J.-L. and Murtagh, F. (1994): Restauration des images multi-echelles
par I'algorithme 11 trous, Traitement du Signal, 11, 229-243.
Bruce, A. and Gao, H.- Y. (1994): S+ Wavelets ["sa's Manual, Version 1.0, Seattle. WA:
StatSci Division, MathSoft Inc.
Daubechies. 1. (1992): Ten Lectures on Wavdets, Philadelphia: SIAM.
Holschneider, M., Kronland-Martinet, R., Morlet. J. and Tchamitchian. Ph. (1989): A
real-time algorithm for signal analysis with the help of the wavelet transform. in .J.M.
Combes, A. Grossmann and Ph. Tchamitchian (eds.), Wavelets: Time-Frequency Methods
and Phase Space, Berlin: Springer-Verlag, 286-297.
Murtagh, F. (198.5): Clustering Algorithms, Wiirzburg: Physica-Verlag.
Murtagh, F. and Hernandez-Pajares, M. (199.5): The Kohonen self-organizing feature map
method: an assessment, Journal of Classification, 12. 16.5-190.
Murtagh, F. (1996): Wedding the wavelet transform and multivariate data analysis, Jour-
nal of Classification, submitted.
Press, W.H., Teukolsky, S.A., Vetterling, W.T. and Flannery. B.P. (1992): Numerical
Recipes, 2nd ed., Chapter 13, New York: Cambridge lJniversity Press.
Shensa, M.J. (1992): The discrete wavelet transform: wedding the a trous and Malla,t al-
gorithms, IEEE Transactions on Signal Processing, 40, 2464-2482.
Spath, H. (1985): Cluster Dissection and Analysis, Chichester: Ellis Horwood.
Starck, J.-1. and Bijaoui, A. (1994): Filtering and deconvolution by the wavelet transform,
Signal Processing, 35, 19.5-211.
Starck, J .-1., Bijaoui, A. a,nd Murtagh, F. (199.5): Multiresolution support applied to image
filtering and deconvolution, Graphical J'vfodels and Image Processing. 57, 420-·131.
Strang, G. (1989): Wavelets and dilation equations: a brief introduction, SIAM Review,
31,614-627.
Strang, G. and Nguyen, T. (1996): Wavelets and Filter Banks, Wellesley, '\IA: Wellesley-
Cambridge Press.
Tong, H. (1990): Non Linear Time SEries, Oxford: Cla.rendon Press.
Wickerhauser, M.V. (199.,1): Adapted Wavelet Analysis from Theory to Practice, Wellesley,
~IA: A.K. Peters.
Visual Manipulation Environment for Data
Analysis System
Masahiro Mizuta l , Hiroyuki Minami 2
I Division of Systems and Information Engineering, Hokkaido University
n.13, W.8, Kita-ku, Sapporo-shi
Hokkaido 060, Japan
2 Department of Information and Management Science, Otaru University of Commerce
3-5-21, Midori, Otaru-shi
Hokkaido 047, Japan

Summary: Most statistical softwares utilize graphical facility to display data mainly, not
to construct and execute analysis.
We had developed the data analysis system with visual manipulation and improve it on
the UNIX platform with TcljTk, the interface builder available on many architectures and
operation systems. We offer its features and introduce some examples on our environment.

1. Visual Environment for Statisticians


We statisticians are familiar with graphical environments. We pick up, detect many
features within observations through many analysis methods and plots. Most sta-
tistical softwares have also much graphical facility. For example, S-Language(Becker
et al.(19SS)), SAS/GRAPH have functions not only to display but to rotate, print
data with various ways. Graphical environments provide us more potential for
EDA(Tukey(1977)). We can see data from various viewpoints with computer soft-
wares. For example, xgobi may tell us over what we expected with the function
"GrandTour."
In computer science, "Visualization" is one of the most attractive keywords. What
we introduced above is the effect by "Visualization of Data." "Visualization" is useful
not only to reveal data but to deal with analysis processes. A statistician is often
forced to apply many methods to the same data in his/her analysis since he/she
wants to find the most reasonable result. A lot of results he/she made, however,
may perplex him/her. If he/she can view an analysis flow (made of some statistical
methods) on the graphical display, he/she can easily find the relations between data
and applied methods. It is called "Visualization of Execution."
As we described, "Visualization" has much power then we try to build "Visual En-
vironment," which synthesizes both "Visualization" merits. In short, our target is
to build the system which can view, display data and construct analysis flows and
execute them in the same environment. In the system, if we can add and execute
some statistical methods in an analysis flow, we can easily change some methods or
apply another procedure to the original data or intermediate ones.
We had already developed the prototypes and now improve it on the workstation
with Tcl/Tk environment for its interoperability. In addition, this idea is also effec-
tive for novices in statistics but they have less knowledge on statistics then may make
a meaningless statistical flow. We have already recognized this problem and tried to
add a supporting module which makes use of technique on knowledge engineering.
In this paper, we introduce the system and some additional modules with examples.

625
626

2. Concept and Overview of Our System


Shu( 1988) proposcs a classification of visualizations according to their usagc.
1. Visualization of Data or Information about Data
2. Visualization of a Program and/or Execution
3. Visualization of Software Dcsign
What we usc on analyses are classified into 1 and 2. Then, ,YC havc tried to apply thc
concept of 3 to a data analysis system. The process of making softwares is similar
to that of making data analysis fimvs. \Ve regard data analysis procedure as a flow
of data. Our concept is shown in Figure 1 and flows of whole analysis processes are
shown in Figure 2.

Outputs

. Status
. Execution

Fig. 1: Basic concept

Program b Program c Programd

Program e Program!

Fig. 2: Analysis flows

An arrow stands for a datum and a box is a (statistical) method. The leftmost box
stands for the import procedure for an observation and the others are methods which
are applied to the input. The data flow through many arrmvs and boxes.
"Flow Chart" which shows a coutrol of a program is used in software development
and similar to our concept. It is really useful when we check the control in a progmm
but cannot map it into a real program since we cannot find the input and thc out put
from this chart.
627

The flows we construct can execute as procedures directly since they show flows of
data.
2.1 Overview of our system
We made a prototype of a data analysis system based on data flow diagram with
PC(MS-DOS)(Mizuta (1989)). The prototype was ported into a workstation(X Win-
dow system on UNLX) in 1991.
Recently, the progress of computer environments has been accelerated. Then, we
decide to rebuild our system, based on the environment available in many computer
platforms.
Tcl/Tk is a kind of interface builder in computer graphical environment. It is de-
veloped on X Window originally and ported many platforms. Now it is available
on rv'fS-Windows, Windows95, WindowsNT and so on. It can handle interface parts
(button, menu, bar, etc.) and assign its function easily. The characteristics are
suitable for our system then we make a new prototype with Tcl/Tk.

3. How to build analysis flow and its examples


Figure 3 is a snapshot of the system at the first phase. The user first clicks "Put
procedure" button then the menu of procedures appears. The items in this menu are
sorted by procedure's purposes or type of data.
Fie Edt Help

pPIIOM:
EJ
EJ
EJ
EJ
Fig. 3: First phase

The selected procedure box appears in the view follows the movement of pointing
device(mouse). He/She can put it any place as you like. If he/she decides its position,
click the left button of the mouse then the box is fixed. Figure 4 shows the situation
some boxes are put.
628

I~ IkDlrect ~
file ~dit Qption Help

J Move procedure I~ Connection I Status I Execute


I
la proc j2:!l 1-=
Categories:
""" + pUlpOse ..:; dala tvpo
IAIl r..:

EJ EJ EJ
Io!sCrimlnatlon
nput
Oulllut
PredJctlon I

r'"~'" :u
Procedures:

f~
Tree
,"" ~ .
, EJ EJ
ScatterPlot

~:.
SM
Istogram .
BarPlot
!pie
I A

Fig. 4: Placed methods

!\;"ext, the user makes connections between boxes to construct a flow. He/She clicks
a source box then an arrow appears. The start point is fixed to the source box and
the other follows the movement of its pointing device. He/She clicks on the other
box then a source box and the other are connected. The view in Figure 5 shows tv;O
streams. The upper flow is to do discriminant analysis and print results. The lower
is to do regression analysis and plot results.
Now the user can really do analysis. He/She can click the rightmost box of a flow.
A procedure box is executed if the input is calculated or ready to import. If !lot, the
procedure which is expected to make the input is triggered. In short, if the rightmost
box is activated, all boxes on the flow arc executed recursively.
Figure 6 is the result of regression analysis. 'The user can get the [rsult through tilr
plot with the points and the regression line.

3.1 Parameter handling


Some methods may have parameters: for example, a parameter in sp(,cifying the
dimension for execution l'vIDS procedure. The system has a facility to set them on
the same screen. Figure i shows that the dimension of .\IDS is :2, do output the eigen
value in addition to the result and use an additive constant in the analysis.
629

En. Edit QpUon Help

Put procedure Move procedure I Delete I

EJ

Fig . .5. Two flows

Fig. 6: Results(with plot)


630

Help

Options:

- -:. r,

i~~}~~~~~';i ;;~:~} ,
~~/I';'§.?-'~~~JI~~~' :;l.~:

Fig. 7: Parameter handling(MDS)

4. Details and Applications


This system is developed on some indepcndcnt routines.
As introduced already, the interface parts are written with Tel/Tk (Ousterhout(1994)).
As a statistical engine, \'ie adopt S Language. These are familiar under UNLX work-
stations.
Each engine, however, is developed independently thus we can add another statisti-
cal engine (i. e. your original statistical library) as a method under the restriction
that its input device is :'stdin(standard input)" and the output is "stdout(standard
output)."
A new function can also attach t.o this system easily. For example, we are about to
add a new function "Validity Check," which verifies if the connection between sta-
tistical procedures is reasonable or not. If inappropriate, the routine suggests us the
suitable method. How to realize this module is to check the input attribute of a pro-
cedure and the attribute of data as the input. If both are consistent, the connection
must be reasonable. If not, the system tries to find another procedure which satisfies
its condition. If a suitable procedure is not found, the module searches another pro-
cedure which can convert the attribute of data into the input one.
This module (written \vith Prolog) is implemented on f-v'IS-DOS version and works
well. In addition, we have already developed a knowledge supporting system of data
analysis (Minami et al. (1993)) as the extension of ''Validity Check." We are going
to synthesize the system and this module.

5. Concluding Remarks
The "Macro" feature may be effective for t.he system. The flow except an import box
631

can be regarded as one (synthesized) procedure then we can consider it as another


one method. To realize this feature is not so hard and we will offer it soon.
Our concept is not enough for all statistical methods since it forces that the result
of procedures provides by value mainly. As you know, some methods output their
result as a function or formula. The current version of our system cannot handle
these datatypes directly. The concept of "Object-Oriented" is important to extend
kinds of datatypes. Tcl/Tk is based 011 its concept. If we make all parts of the system
object-oriented, we may handle formulas and functions same as values.
This idea has an influence on the module "Validity Check" and supporting modules.
The essence of them is to check the types of data, input and output attributes on
procedures. If procedures are managed with object-oriented style, it is easy to make
some checker routine since the routine has only to check their classes are valid.

Acknowledgment
vVe would like to thank Mr. Kikuchi (Graduate School of Information Engineering,
Hokkaido University) for his programming support.
A part of this work was supported by a Grant-in-Aid for Scientific Research from the
Ministry of Education, Science, Sports, and Culture of Japanese Governments.

References:
Becker, R.A., Chambers, J.M. and Wilks, A.R (1988). The New S Language, Wadsworth
& Brook/Cole Advanced Books & Software, Pacific Grove.
Minami, M.. Mizuta, M. and Sato, Y.(1993). A Knowledge Supporting System for Data
Analysis. Journal of the Japanese Society of Computational Stat'istics, 6, 1, 85-97.
Mizuta, M. (1990). Data Analysis System with Visual Manipulations, Bulletin of the Com-
putational Stat'istics of Japan, 3, 1, 23-29 (in Japanese).
Ousterhout, J. K. (1994). Tel and the Tk Toolkit, Addison Wesley.
Shu. N. C. (1988). Visual Programming. Van Nostrand Reinhold Company.
Tukey, J. W. (1977). Exp/omtory Data A nalys'is, Addison-Wesley.
Human Interface for Multimedia Database with
Visual Interaction Facilities
Toshikazu Kato

Electrotechnical Laboratory (ETL), AIST, MITI


1-1-4, Umezono, Tsukuba Science City, 305 Japan

Summary: This paper describes visual interaction mechanisms for image database
systems. The typical mechanisms for visual interactions arc query by visual example
(QVE) and query by subjective descriptions (QBD). The former includes a similarity
retrieval function by showing a sketch, and the latter includes a sense retrieval function
by learning user's personal taste. We mudeled user's visual perception process by four
levels; a physical level, a physiological level, a visual psychological level, and visual
cognition level. These models are automatically created by image analysis and statistical
learning, are referred as multimedia indexes to database systems.

1. Introduction
"A picture is worth a thousand words. " A human interrace plays an important role in a
multimedia information system. For instance, we request a content based visual interface
in order to communicate visual information itself to and from a multimedia database
system, Iyenger and Kashyap (1988). Grosky and ivkhmtra (1989). The algorithms or
multimedia operations have to suit user's subjective viewpoint, such as a similarity
measure, a sense of taste, de. Thus, we have to provide tlexible interaction mechanism tll
design a multimedia human interface.
We expect a multimedia database system to manage multimedia data themselves, such as
image data, as well as alphanumeric data. We also expect it to provide a human interface
to accomplish tlexible man-machine communicati()n in a user-friendly manner. Then.
what are needed in multimedia interaction? We can summarize the essential needs in
multimedia interaction as follows, Kato, et a1. (1991).
(a) Visual query on pictorial domain: We need to communicate multimedia data to and
from the database in a user-friendly manner. For instance, we would lib: to sh(1W
image data itself as a pictorial key to retrieve some visual information from database
systems.
(b) Subjectivity of judging criteria: We want to adjust database operations as well as
database schemata to each of our subjective views. In case of similarity retrieval, we
would like to get some suitable candidates according to our subjective measures
where the measures may differ with each individuals.
(c) Interpretation between multimedia domains: Some of the multimedia queries should
evaluate multimedia data on the different domains. In case of content based retrieval,
we expect the system to retrieve some image data by describing their contents as text
data.
We can answer these needs by a multimedia human interface with our visual interaction
facilities. Let us show the general framework of visual interaction by typical user's query
requests in our applications. Our basic ideas are QVE (query hy visual example) and QBD
(query by subjective description). Multimedia interaction requires interpreting the conknts
of multimedia information in order to operate from user's subjective viewp()int. Thus,
interpretation algorithms have to suit the perception processes of each user. Such
processes belong to a subjective human factor. On our multimedia human interface, the
system refers to the object model on multimedia inf()rmation and the uscr model on the
perception process to operate from user's subjective viewp()int.

632
633

2. Intelligent Visual Interaction


2.1 Conventional Approach to Visual Interface
Several experimental image database systems have been proposed to provide visual
interfaces. The QPE system provides a schema of pictorial data in a graphic form as well
as in a tabular form, Chang and Fu (1980). While this system shows the data in a graphic
form, its query style is only a substitute for the query languages on alphanumeric data. In
the icon-based system, icons and their two dimensional strings are referred to as pictorial
keys to image data, Chang, et al. (1988). A user can specify the target images by placing
icons on the graphic display as a visual query. The system evaluates only two di-
mensional strings of icons in its retrieval process. Therefore, it is difficult to perform
similarity retrieval according to the subjective measure of the user. The hypermedia
system provides an indexing mechanism for multimedia data in a uniformed style,
Yankelovich, et al. (1988). Although this system enables a kind of subjective indexing,
its process fully owes much to the user's effort on defIning many links.
We originally want to process visual information and its contents according to our
subjective views. On this point, although the above systems use graphic devices to show
schema, icons and guidelines, their facilities are not enough to perform such interaction in
a user-friendly manner.
2.2 Computational Models of Visual Perception Process
In human-to-human communication, we can exchange not only pictorial data themselves
as objective information but also emotion, personal taste, and so on, as subjective infor-
mation. The latter type communication, we call "kansei-oriented communication," is
rather important to express and to understand personal opinions well.
As a working hypothesis, we have developed an artificial sense model on visual percep-
tion process.

Cognitive Level

Psychological Level
~--~-----r--~

Physiological Level
'-----'----.,---'

Signal Physical Level

Receiving Stimula from the World


Fig. 1: Conceptual schema of visual perception process
(1) Physical level interaction: A picture may often remind us similar images or related
pictures. This process is a kind of similarity and associative retrieval of pictorial
data by physical level interaction with pictorial database.
(2) Physiological level interaction: As we can recognize hand-written characters as well
as printed block characters, some graphical features have almost the same values to
distinguish the category from the others. Early stage of mammal vision mechanism
extracts this sort of graphical features such as intensity levels, edge, contrast,
correlation, spatial frequency, and so on. Visual perception may depend on such
634

graphical features.
(3) Psychological level interaction: A user may wish to see some graphic symbols which
give him the similar impression from his view. We have to notice that the criteria for
similarity belongs to a subjective human factor. Although human beings have
anatomically the common organs, each person may show different interpretation in
classification and similarity measure. It means each person has his own weighting
factors on graphical features. The system should evaluate similarity according to his
subjective criterion. Therefore, the system should analyze and learn the subjective
similarity measure on the images with each user. Graphical features are mapped into
subjective features by the weighting factors.
(4) Cognitive level interaction: We often have difference impressions, even when
viewing the same painting. Each person may also differently give a unique interpreta-
tion even viewing the same picture. It seems each person has his own correlation
between concepts and graphical features and/or subjective features. The system
should evaluate a subjective description according to his criterion. Therefore, the
system should analyze and learn the correlation between the subjective descriptions
and the images with each user.
A user model is needed to operate visual information based on the subjective viewpoint of
each user. We have to develop a simple learning algorithm to adjust the criteria for each
user.
3. Query by Visual Example at Physiological Level
This chapter describes the visual perception models and algorithms for query by visual
example (QVE), i.e. similarity retrieval on objective criteria.

3.1 Visual Perception Model for Graphic Symbols


We have been developing an image database system called TRADEMARK. The
TRADEMARK database is a collection of graphic symbols, Kato, et al. (1991). These
figures are protected as intellectual property. At a Patent Office, an examiner compares
each figure with tens of thousands of existing registered graphic symbols. It is a burden-
some task that can be avoided if the image database system accepts query by visual
example (QVE) using sketch retrieval facility. This is the essential problem of an image
database system.
Let us discuss the technical problems associated with QVE. We can point out 'these
problems:
(i) The pixelwise pattern matching is quite a time consuming task.
(ii) We can not give a pictorial key which is exactly the same with the original design of
the symbol.
(iii) Our visual impressions of graphic symbols are psychologically ambiguous. The
similarity measure may differ with each of us.
We have assumed an image model using several kinds of graphic features (GF) from our
psychological experiments and from the recent knowledge on visual physiology. We can
simulate the process of visual perception of graphic symbols with a GF space. These
features are primitive ones in the early stage of our human vision system. The
TRADEMARK system refers to these GF vectors as the pictorial index of graphic
symbols.
(1) Spatial distribution of the gray level GrayS, EdgeS: The distribution of black pixels
represents the outline of the graphic symbol. For this purpose, the graphic symbol is
divided into 8 x 8 square meshes. GrayS denotes the number of black pixels mij in
each mesh.
Gray8 = {mil} (0 ~ i ~ 7, 0 ~ j ~ 7)
Similarly, we defined EdgeS with the contour of the graphic symbol.
635

(2) Spatial frequency RunBIW, Run W': The spatial frequency measures the complexity
of graphic symbols. RunBIW approximates the frequency by the run-length
distribution of each rectangle mesh. Here, the figure is divided into four horizontal
meshes as well as four vertical meshes. Similarly, we defined Run W' without distin-
guishing the black and white runs.
(3) Local' correlation measure and local contrast measure Corr4, Cont4: The local
correlation and the local contrast show the spatial structure, such as the regularity of
arrangement of partial figures.
Corr4 = (mij x m l ,)'} )

mij-mij' (O:::;i:::;4,O:::;j:::;4)
Cont4 =-----'---"-
mij+mi·}'
Where, mij , mi,}' are adjacent meshes. These parameters are defmed on 4 x 4 square
meshes.
[Alg. 1] GF space for graphic symbols
(1) Analyze the layout of a document image to extract graphic symbols. Normalize
the image size of the graphic symbols.
(2) Calculate the GF vector Pi with each graphic symbol.

3.2 Sketch Retrieval on GF space


We may expect that the neighboring graphic symbols in a GF space will have a similar
shape. For example, a fme copy and its rough sketch are neighbors in GF space.
Therefore, the system can retrieve similar graphic symbols by comparing their GF vectors.
r-------------------------------------------------------1
Q
User
n ~Graphical Symbols (Learning Set)
~ Graphical Feature
~_ _ _..c.(G
......F) Space
Subjective Similarity Subjective Similarity
by Classification by Values
(Discrete Criteria) (Continuous Criteria

AI
<Subjective Feature Space> <SUbJUlive SimilGTilY> r------ ----- -- - ----s~bi~~ti~~F~~t~;~:
i w: Ave. illtra-group cov. ~ =(s 'J' ---~-- SF) Space '
178: Ave. inter-group cav. s'=SS"'~

<Graphical FeatllreSpa:e> <Crilaic;;, I

N_ N_ T '
Lw: lntra-grollpcov.
178: Inter-RrOllpcov.
(;=2.,2.5"I'i ',):mGx
• I I • 1 rr-- --:- -;---+--=-+----'<--------!,...,
:

r( =~ ;0
<Criteria> ,A----'----....
¢=IEA=IxAII
J = LT( tv) i B) : max AT IxA = II

J:je~:,::":,, ~
<= L8A =Lw A ;\
AT Lw A =1
----------------------------~

Visual Example \.~j


'---_ _ _ _ _ _ _ _ _ _ Symbols DB

Fig. 2: Overview of Query by Visual Example mechanism


636

Let us show the sketch retrieval algorithm for graphic symbols. Fig. :2 shows the outline
of the whole QVE mechanisms. In Fig. 2. the sketch retrieval process is enclosed by solid
lines.
[Alg. 2] Sketch retrieval on GF space
(I) Normalize the image size of the sketch, i.e. the visual example.
(2) Calculate the GF vector Po of the sketch.
(3) Calculate the distance ci, between the sketch poand the graphic symbols in the
database Pi
d, = I~=l(hkllp, - Pill)
Where, k and IVk mean the GF vector and its weight factor.
(4) Choose the graphic symbols in the ascending order of d.

3.3 Experimental Results of Sketch Retrieval


Let us show our experimental results of sketch retrieval algorithms for graphic symbols
and full color paintings. Fig. 3 shows an example of sketch retrieval. A user has written
down a sketch shown as "your visual example" in the QVE window in Fig. 3. The sys-
tem searches for the most similar graphic symbols on the pictorial index comparing their
GF vectors. The QVE window also shows the candidates for similar graphic symbols in
descending order of priority. The first candidate is the original design of the rough sketch.
We can also find similar graphic symbols in Fig. 3.

[Rl search
~ .,. £IfMple ( 111 .'.[,)
r;.,.ltil' ..,....,.

JI"'c!clB~

12'01 'D7 ;;O"" " - - ': . W~~~ j

, 208 (DFHMrboard PIPt' ProOJcU, tL


IUSA)

U. "-. Ie!
,"".4l7S)

I ~I I '" I " II...~... I I~ I


Fig. 3: S
ketch retrieval of graphic symbols by showing a rough sketch
637

We have evaluated this algorithm in an experiment in which we showed fair copies, hand-
written sketches and rough sketches with every 100 visual example. (Currently, the
TRADEMARK database manages about 2,000 graphic symbols.) We have tested the
recall ratio. Here, the recall ratio shows the rate of retrieval of the original graphic symbol
among the best ten candidates For a fair copy, the system had an almost 100% recall ratio
among the flrst ten candidates, using the GF features. Even for the rough sketches, it had
about 95% recall. We may conclude that our GF features satisfy the requirements for a
robust image model for sketch retrieval.
4. Query by Visual Example at Psychological Level
This chapter describes another aspect to query by visual example (QVE) , i.e. similarity
retrieval on subjective criteria.

4.1 Personal View Model for Graphic Symbols


Remember that the human being does not rely on only geometric process for determining
shape similarity. Psychological processes also relate to it. Therefore, we have to consider
a psychological aspect of similarity retrieval. Such similarity differs with each user even
for the same flgures. Therefore, the system should learn the subjective similarity measure
as a personal view model of each user.
A user classifles the test samples from the database into several clusters judging similarity.
The system extracts the GF vector of each graphic symboL We need a subjective feature
(SF) space which reflects the subjective similarity measure. We can construct such an SF
space by the discriminant analysis. The discriminant analysis is one of the multivariate
analysis methods to evaluate the classiflcation. The algorithm to construct the SF space
and the personal index is as follows. The learning process is shown enclosed by dotted
dashed lines in Fig. 2
[Alg. 3J Learning subjective similarity measure (SF space)
(1) Choose appropriate graphic symbols from the database to make the learning set P.
The user classifles the graphic symbols into several clusters without overlapping.
(2) Normalize the image size. Calculate the GF vector Pk of each graphic symbol
k E P. (The GF vector is the same one in the sketch retrieval algorithm.)
(3) Apply the discriminant analysis to the clustering result by the user. The linear
mapping A is given by solving the following eigenvalue problem.
LB~=LwAA}
A LwA= I
Where, LB and Lw denote inter-group and intra-group covariance matrixes of
GF vectors, respectively. A' means the transposed matrix of A, and A is an
eigenvalue vector. Thus, we can define the SF space with the user.
rk = A' Pk .
Where, rk is the SF vector of the graphic symbol k.
(4) Calculate the SF vectors with every graphic symbol in the database.
'k = A' Pk·
We will refer to the SF space of rk as the personal index. Note that we do not have to
examine the similarity of all the graphic symbols in the database. Once the system has
learned the linear mapping A, it can automatically construct the personal index only from
the GF vectors. This algorithm reduces the personnel expenses for indexing.

4.2 Similarity Retrieval on SF space


We may expect that the neighboring graphic symbols in an SF space will give a similar
impression from the user's view. Just the same with sketch retrieval, the user shows a
sketch with which he wants to see the similar flgures. The system can retrieve similar
638

graphic symbols by comparing their SF vectors on the personal index. Then, the system
shows suitable candidates. The algorithm for similarity retrieval is as follows, which is
also shown enclosed by dotted lines in Fig. 2.
[Alg. 4] Similarity retrieval on SF space
(1) Apply the linear mapping A to the GF vector Po of the sketch.
ro = A' Po.
(2) Choose the neighboring graphic symbols Pi on the personal index as the
candidates forsirnilarity retrieval.
(3) Calculate the distance di between the sketch ro and the graphic symbols r; on
the personal index.
d, =11r; -roll·
(4) Choose the graphic symbols in the ascending order of di.

4.3 Experimental Results of Similarity Retrieval

Fig. 4: Example of similarity retrieval


The upper QVE window and the lower one show the result of similarity
retrieval on the SF space of the user and that of sketch retrieval on the
GF space.
Fig. 4 shows an example of similarity retrieval. The upper QVE window in this figure
shows the ten candidates for similarity retrieval on the SF space, while the lower one
shows the ten candidates on the GF space. The second to the eighth candidates on the SF
space have matched with the classification by this user. The system could not retrieve
639

these candidates in the sketch retrieval on the GF space, since their graphic features differ
from those of the visual example.
We have evaluated the learning algorithm and the similarity retrieval algorithm in an
experiment with eleven users. In this experiment, we used 230 graphic symbols for the
samples out of 2000. The system had at least one similar graphic symbol more than 98%
recall ratio among the ftrst ten candidates. We may conclude that our SF spaces satisfy the
subjective similarity measure of each user.

5. Query by Subjective Description at Cognitive Level


This chapter gives a more complex visual interaction algorithm for full color paintings. A
user has only to show several words in a query by subjective description (QBD) after
constructing his visual perception model as the user model.

5.1 Full Color Paintings and Artistic Impressions


In conventional database systems, keyword indexes have been used to retrieve some
paintings which give us certain impressions. A user describes his request with a
combination of such keywords, which are assigned by the indexer to each painting. Even
though a keyword thesaurus is available, this approach has the following problems.
(a) The indexer has to assign several keywords to each painting in the database, which is
a laborious work.
(b) The indexer assigns such keywords according to his personal view, which affects the
user's query. Even when viewing the same painting, such descriptions may differ
with persons, nations and cultural backgrounds.
(c) While a keyword thesaurus is useful for enlarging the vocabulary of the user's query,
the operations can be deftned only on the text data domain.
Our approach aims to unify text data and image data to describe the contents of an image.
Thus, we have to model how a person feels certain impressions when viewing a painting.
Art critics view paintings from several aspects, such as motif, general composition and
coloring. Chijiiwa (1983) reported that the dominant impression generated by paintings is
coloring. This report suggests that there is a reasonable correlation between the coloring
and the words in the reviews.
We have been developing an electronic art gallery called ART MUSEUM, Kato, et al.
(1991). The ART MUSEUM is a collection of full color paintings. The ART MUSEUM
system provides QBD facility for sense retrieval. A user can retrieve full color paintings
by presenting some words on his personal taste.
The ART MUSEUM system has the personal index on unwed features (UF), which are
derived from graphic features (GF) on color and subjective features (SF) on impression
words.
We can parameterize the coloring of a painting by the distribution of the RGB intensity
value in the subpictures.
[Alg. 5] Pictorial index (GF sPSlce) for coloring features
(1) Divide a painting into 4 x 4 subpictures to approximate the combination and the
arrangement of colors.
(2) Calculate the distribution of the RGB intensity value in the subpictures as the GF
vector Pi with each graphic symbol.
We also need a subjective criterion on artistic impression. A user answers his artistic
impressions on sample paintings as the weight vector of adjective words. (Currently, the
adjectives are restricted up to about 30.)
[Alg. 6] Inquiry for artistic impression (SF space)
(1) Choose appropriate paintings from the database to make the learning set P.
640

(2) The user describes his impressions as the weight of the adjectives a, to each
painting k E P .

5.2 Personal View Model for Artistic Impressions


Let us show the algorithm for learning a personal view model for artistic impressions. We,
of course, can not directly compare the subjective words in a query and the coloring
features of a painting, since they are on the different domains.
Remember that the subjective descriptions of each user are related with the coloring of
paintings. We may expect that the set of words and the parameterized coloring feature
correlates with each other. The ART rvfUSEUM system analyzes such correlation
between the different domains. We will regard the correlation as the personal view model
for the user. We can construct a unified feature (UP) space on this model to compare the
subjective words and coloring features.
,------------------------------------------------1
User Q
,
I
i, Adjective Space
n
~ Pictures (Learning Set) ,
~ Graphical Feature
(GF) Space iI,
I
I
I,' Subjective Feature
(SF) Space __~t----.
I (Prine. Compo Space • x
I onAdjectives}-----~l_- ---
I ------,-:--1---'""

Graphical Feature (GF) Space

Pointing on
Reduced SF Space Subjective Description Visual Example

Fig. 5: Overview of QBD mechanism


We can construct such a UF space by the canonical correlation analysis. The algorithm to
construct the UP space and the personal index is as follows. (It is also shown enclosed
by dotted dashed lines in Fig. 5).
[Alg. 7] Learning artistic impression measure
(1) Apply the canonical correlation analysis to the result of the inquiry by the user.
The linear mappings F and G make their correlation maximum:
c, = Fa,}
Tk = G Pk
Where, F', G' mean the transposed matrix of F, G, respectively.
(2) Calculate the UP vectors of paintings in the database from the following formula.
r; =G Pi·
We will refer to the UP space of r; as the personal index of the user model. Note that we
do not have to assign the adjectives ai to every painting in the database. Once the system
641

has learned the linear mappings F and G, it can automatically construct the personal index
only from the GF vectors. This is a labor-saving algorithm for indexing.

5.3 Sense Retrieval on UF space


We may expect that the neighboring paintings in UF space will give similar impressions
of coloring to the user.
The user has only to show several words in QBD. The system evaluates the most suitable
coloring for the words according to the personal view model. Then, the system can
provide paintings of suitable coloring. The algorithm for sense retrieval on coloring is as
follows. (It is shown enclosed in solid lines in Fig. 5.)
[Alg. 8] Sense retrieval on UF space
(1) Apply the linear mappings F and A to the adjective vector ao of a subjective
description in the user's query.
ro =AFao '
Where A is the regression by the diagonal matrix of canonical correlation
coefficients.
(2) Choose neighboring paintings of the image 'i on the UF space as candidates for
sense retrieval.
The UF space gives a criterion to evaluate the text data and the image data by their
contents. The UF space enables us to operate a multimedia query which has multimedia
data of different domains in its parts. Therefore, this algorithm corresponds to a
multimedia join on text data and image data.
Note that the sense retrieval algorithm evaluates the visual impression in UF space.
Therefore, we can retrieve paintings without assigning keywords to every painting; this
reduces the labor cost of indexing.
For other applications of the UF space, we can retrieve paintings that give us a similar
impression by showing a painting as a visual example. We can also infer the suitable
keywords for simulating the user's personal view, using the inverse mappings as follows.
ao = F-1A-1a Po .

5.4 Experiments on Sense Retrieval


We have experimented with the learning algorithm and the sense retrieval algorithm on
our ART MUSEUM system. In this experiment, we adjusted the UF space according to
the personal view of female students. The learning algorithm is applied to the average
answers of female students. This is a user-group model on artistic impressions.
Fig. 6 shows an example of sense retrieval. This figure shows the best eight candidates
for the adjectives; "romantic, soft and warm". These paintings roughly satisfied the
personal view of the subjects. We may conclude that the UF personal index on UF space
reflects a personal sense of coloring.
642

.. You want to apprac I at a the


rOMant ic, soft and warm
ctures In tha electron ic ART MUSEUM.
K. Le t ' a t,.y!
ULiS pret t y 9 1,.1 s (JOSH I-DR I-SE III
ART MUSEUM simulate. th.l,. sanaibi I I
I.arned Can. Co,.. Anal .

Fig. 6: Example of sense retrieval in QBD


The best eight paintings in the database appear for the words "romantic",
"soft" and "warm" . The words in the query and colorings of the paintings
are evaluated on a UF space of female students.
6. Summary
We proposed the concept of cognitive view mechanism which interprets the multimedia
data. We have developed the algorithms for sketch retrieval, similarity retrieval and sense
retrieval to support visual interaction. The sketch retrieval algorithm accepts visual data as
an example. The similarity retrieval algorithm and the sense retrieval algorithm adjust their
interpretation criteria to the personal view of each user. We have shown the fundamental
method for evaluating a personal view in multimedia information systems. These
algorithms are implemented and tested in our experimental multimedia database systems ,
TRADEMARK and ART MUSEUM. These functions formed visual interaction in a user-
friendly manner.
Our research gives the guiding principle to the cognitive view mechanism for multimedia
information systems. The methods in this paper are basis for user-centered human
interface designing and multimedia information understanding.

Acknowledgments:
The author would like to thank the coUeagues in Electrotechnical Laboratory, especially
Dr. Akio Tojo, Dr. Toshitsugu Yuba, Dr. Kunikatsu Takase, Dr. Hideo Tsukune, Mr.
Koreaki Fujimura and Mr. Takio Kurita for their support in this research.
643

The author would also thank to the students from University for Library and Information
Science (ULIS) and Tsukuba University and visitors from private companies.

References:
Chang, N. S. and Fu, K. S. (1980): Query-by-Pictorial Example, IEEE Trans. on
Software Engineering, SE-6, 6, 519-524.
Chang, S. K., et aI. (1988): An Intelligent Image Database System", IEEE Trans. on
Software Engineering, SE-14, 5, 681-688.
Chijiiwa, H. (1983): Chromatics, Fukumura Printing Co., 128-163.
Grosky, W. I. and Mehrotra, R. (eds.) (1989): Image Database Management,
COMPUTER (special issue), 22, 12,7-71.
Iyenger, S. S. and Kashyap, R. L. (1988): Image Databases, IEEE Trans on Software
Engineering (special selection), SE-14, 5, 608-688.
Kato, T., et aI. (1991): A Cognitive Approach to Visual Interaction, Proc. of Multimedia
Information Systems MIS'91, 271-278.
Yankelovich, N., et aI. (1988): Intermedia: The Concept and the Construction of a
Seamless Information Environment, COMPUTER, 21, 1, 81-96.
Part VII

Case Studies of Data Science

. Social Science and Behavioral Science

• Management Science and Marketing Science

• Environmental, Ecological, Biological,


and Medical Sciences
Proposition of a new Paradigm for a scientific Classification
for Leadership Behavior Research Perspective

Jyuji lVlisumi
Institute of Social Research,
Institute of Nuclear Safety System, Incorporated
Keihanna Plaza, 1-7
Hikaridai, Seika-cho,
Soraku-gun, Kyoto 619-02,
Japan

Summary: This is our approach to the behavioral science ofleadership, comprising a behavioral
morphology and a behavioral dynamics, each with its speci£lcally and generality. In this sense, it
is different from current social science research. The later lacks the category of general
behavioral morphology and cross-disciplinary perspective. A balance must be struck among the
four areas of the above-mentioned paradigm, if there is to be productive interdisciplinary
research.

1. The Limitations of Traditional Leadership Types


The behavioral classification concerning leadership types employed from common usage
has the following imitations.
First, it is unidimensional-democratic/ autocratic, conservative/progressive, liberal
/authoritarian, hawk/dove, employee-centered/ production-centered-rather than
multidimensional.
Second, the terms used in this classification have multiple meanings in co=on usage,
which makes them difficult to operationalize.
Third, these terms are heavily value-laden.
Fourth, the categories are used on historical concepts and functional concepts.

2.PM Leadership Concept


As remedy, we developed the leadership PM concept which 1) allows multidimensional
analysis, 2) can be operationally defined, 3)is itself value-neutral, 4)makes possible
experimental research and statistical studies.
Measuring leadership which is very much a group phenomenon, requires a group
functional concept like PM concept(l\tlisumi, 1985).
In the concept ofPM,P stands for performance and represents the kind ofleadership that
is oriented towards achievement of the group's goal and problem solving. Being an
abbreviation ofmaintenance,M stands for the kind ofleadership that is oriented towards
the group's self-preservation or maintenance and strengthening of the group process
itself. These two conceptual elements(p and M)are similar to Bale's(1953) "task leader"
and "emotional leader" .
The concept of PM is a constructive concept to classify and organize the factors obtained
from leadership at different levels. It is not merely a descriptive concept for the factors
obtained from factor analysis, but is at a higher level of abstraction. Because of being
abstract, PM concept applies not only to industrial organizations, but also to many other
social groups, P does not concern production only but also more general group goals or
problem solving tasks. This is what principally distinguishes it from Blake
647
648

Mouton's(1964) model.
In the case of PM concept, we consider P and M to be two axes on which the level of
each type can be measured (high or LDw),thus obtaining four distinct types ofleadership
(see figure 1).The validity of these four PM types was proved using correspondence
analysis which was first developed by Guttman(1950) and later by Hayashi(1956).

pM PM

MSCALE

pm Pm

PSCALE
Fig.1.Conceptual representation of 4 patterns of PM leadership behavior
(Misumi,J .1984)

3.A New Research Paradigm:


Searching for Differences and Similarities
As a group phenomenon, leadership pertains to all the fields of social science.
Consequently, for a better comprehension ofleadership, an interdisciplinary perspective
is indispernsable. PM theory, more than any other leadership theory, is the result of
abroad interdisciplinaly research program(Misumi & Hafsi, 1989; 11isumi & Perterson,
1987).
This interdisciplinary orientation would have been difficult, if not impossible, without the
existence of an adequate research paradigm.
One of the principal characteristics of PM Leadership Theory is, as indicated by the
paradigm if Fig.2, to apprehend the study of leadership in terms of two principal
perspectives:(a)behavioral morphology and (b) behavioral dynamics.

situation

General Specific
Behavioral-morp hology General behavioral Specific behavioral
dimension morphology morphology

Behavioral-dynamics General behavioral Specific behavioral


dimension dynamics dynamics

Fig.2.paradigm of science ofleadership behavior.(Misumi,J.1984)


649

4.Behavioral Morphology of Leadership


The morphological approach of behavioral morphology consists in indentifying,
describing, naming, and categorizing the forms of leadership in both general and specific
settings. This distinction is based on the idea that there are both universal and
particularistic leadership situations.

5.Behavioral Dynamics of Leadership


The behavioral morphology approach alone is, however, not sufficient to apprehend the
leadership phenomenon. We need also a complementary approach that helps ascertain the
causal laws that govern or determine the effectiveness ofleadership.
Like behavioral morphology, behavioral dynamics can also be further subdivided into
general behavioral dynamics and specific behavioral dynamics, depending on the degree
of abstraction being considered(specific or general).
In the behavioral dynamics area, field research using quantitative behavioral methods has
been complemented by laboratory research.
This is our approach to the behavioral science of leadership,comprising a behavioral
morphology and a behavioral dynamics, each with its specificity and generality. In this
sense, it is different from current social science research. The latter lacks the category of
general behavioral morphology and cross-disciplinary perspective. A balance must be
struck among the four areas of the above mentioned paradigm,if there is to be productive
interdisciplinary research.

6.Some Empirical Results


Our research on the PM model consisted of both field surveys in different kinds of
organizations and laboratory studies. Regarding measurement in the field, we found that
evaluation by subordinates of their superiors was more valid than evaluation by
superiors, peers or self .We,therefore, had subordinates evaluate the leadership of
their superiors on the P and M dimensions.
To determine the level of P and M leadership for each subject, we first calculated
the mean score of al subjects on each item of the two dimensions (P and M). As
discussed by Misumi(1985), these P and M items, represented in Table 1, are the
results of factor analysis. A leader whose score in P and M, is, for example, higher
than the mean, is thought to provide a leadership of PM-type. A leader whose score is
higher than the mean only in P dimension, is classified as providing a P-type(or Pm-
type) leadership. When a leader's score is higher than the mean only in M dimension,
he is referred to as a M-type(pM-type).When a leaders obtains a score lower than the
mean in both dimensions, he is thought to provide a leadership of pm-type. This
results in our final four-type classification:PM,P,M and pm.
To test the validity and reliability of these leadership categories in industrial
organizations, we examined their relationship with some objective and cognitive
variables such as productivity, accident rate, rate of turnover, job satisfaction,
satisfaction with compensation, sense of belongingness to company and labor union,
team work meetings quality, mental hygiene, and performance norms. More than
300,000 subjects were surveyed. As indicated in Table 2, of the four types PM-type
was found to provide the best results, and pm-type the worst. In the long run, M-type
ranks second, and in the short run, P-type ranks second.
650

Table 1
Factor Loading's of Main Items on L€adership{1fisumi,1984)

Factor loadings
Items
I II III
59.Make subordinates work to maximum capacity .687 -.017 -.203
57.Fussy about the amount of work .670 -.172 .029
50.Fussy about regulations .664 -.072 .001
58.Demand finishing a job within time limit .639 .070 .065
51.Give orders and instructions .546 .207 .198
60.Blame the poor job on the employee .528 .113 -.121
74.Demand reporting on the progress of work .466 .303 .175
86.Support subordinates .071 .780 .085
96.Understand subordinates' viewpoint .079 .775 .229
92.Trust subordinates .024 .753 -.003
109.Favor subordinates .067 .742 -.050
82.Subordinates talk to their superior without any hesitation -.026 .722 .059
lOl.Concerned about subordinates' promotion, .147 .713 .134
pay-raise, and so forth
88.8how consideration for subordinates' personal .132 .705 .150
problems
94.Express appreciation for job well done .058 .651 .129
104.Impartial to everyone in work group -.143 .644 .164
95Ask subordinates' opinion of how on-the-job problems .049 .643 .121
should be solved
85.Make efforts to fill subordinates' request when they request .110 .606 .333
improvement of facilities
81.Try to resolve unpleasant atmosphere .233 .538 .338
87.Give subordinates jobs after considering their feelings -.276 ..!78 .457
76.Work out detailed plans for accomplishment of goals .229 .212 .635
75.No time is wasted because of inadequate planning and .038 .333 .614
processing
70.Inform of plans and contents ofthe work for the day .254 .278 .607
52.Set time-limit for the completion of the work .319 .299 .554
53.Indicate new method of solving the problem .251 .489 .479
56.Show how to obtain knowledge necessary for the work .295 .492 .472
61.Take proper steps for an emergency .360 .451 .305
69.Know anything about the machinery and equipment .255 .304 .458
subordinates are in charge of
651

Table 2
The Summary of Comparison of the Effectiveness of 4 Patterns of P-M
Leadership Behavior on Various Kinds of Factors of Work Group (the figures
of this table show the ranking of effectiveness in each factor) ( Misumi,J.,
1984)

Pattern ofleadership behavior


PM M P pm

Productivity Long term 1 2 3 4


a
Short term 1 3 2 4

Accidents b Long term 1 2 3 4


a
Short term 1 3 2 4

Turn over 1 2 3 4

Group norm for high performance Long term 1 2 3 4


a
Short term 1 3 2 4

Job satisfaction 1 2 3 4

Satisfaction with salaries 1 2 3 4

Team work 1 2 3 4

Evaluation of work group meeting 1 2 3 4

Loyalty (belongingness) to Company 1 2 3 4


Labor union 1 2 3 4

Comm unication 1 2 3 4
c
Mental hygiene (excessive tension and anxiety) 1 2 3 4

Hostility to supervisol 1 2 3 4

a Including the data obtained by laboratory studies.


b Smaller figures indicate lower rate of accidents or turn over.
c Smaller figures indicate less tension and anxiety
d Smaller figures indicate less hostility to supervisor.

It is noteworthy that this order of effectiveness is not limited to businesses only, but is
the same for teachers(Misumi, Yoshizaki & Shinohara, 1977), government offices
(Misumi, Shinohara & Sugiman, 1977), sports coaches(Misumi, 1985) and
religious groups(Kaneko, 1986).
652

References:
Bales,RF.(1953): The equilibrium problem in small groups. In working papers in the theory 01
action, ed. Parsons.T.,Bales,RF.& Shils,E.A. Glencoe,III: Free Press.

Blake,RR.. & Mouton, J.S.(196-!): The managerial grid. Houston: Gulf.

Guttman,L.(1950): Chaps. 2,3,6, and 8. In Stuffer.S.(Ed.), Jfeusurement and preLiie/ion.


Princeton:Princeton University Press.

Hayashi,C.(1956): Theory and examples of quantification(II). Proceedings of /he Institute of


Sta/is/ics and Mathematics 4,19·30.

Kaneko,S.(1986): Religious consciousness and behavior of religious a adherents'


representatives. Kyoto Sur....ey Research cellla ofJodo-shin-svu. 65·86.

Misumi,J.,(1984): The behm-iora! science ofleadership(Second Edition). Yuhikaku

Misumi,J.,& Hafsi,M.(1989): La theorie de leadership de PjI"l(performance·


Maintenance):une approach Japonaise de l'etude scintifique de leadership. Bulle/in de
Ps.vchologie, Tome XLII,392,727·736.

Misumi,J.,& Peterson,M.F.(1987): Developing a Performance·r.laintenance (PM) Theory of


leadership. Bulle/in of the Facuily ofHuman Sciences, Osaka University, 13, 135-170.

Misumi,J., Shinohara,H., & Sugiman,T.(1977): Measurement of leadership behavior of


administrative managers and supervisors in local government offices. The Japanese Journal
ofEducational and Social Ps.vchology, 16,2,77 ·98.

Misumi,J.,Yoshizaki,S.,& Shinohara,S.(1977): The study on teacher's leadership: its


measurement and validity. The Japanese Journal of Educa/iona! and Social
Psychology, 25 ,2, 157 -166.
Structure of Attitude toward Nuclear Power
Generation
Kiyoshi Karube l Chikio Hayashi" Shinichi Morikawa 3

lTeikyo University
359 Ohtsuka, Hachiohji
Tokyo 192-03, Japan

'Institute of Statitical Mathematics


4-6-7 Minami-Azabu, Minato-ku
Tokyo 106, Japan

'Institute of Nuclear Safety System, Incorporated


Keihana Plaza, 1-7 Hikaridai, Senka-cho, Soraku-gun
Kyoto 619-02, Japan

Summary: How does the Japanese general public respond to the planned future progress of
nuclear power generation? Research and surveys on the Japanese national character, which
have been carried out in Japan for over 40 years, confirm that despite considerable change. the
"core" of the Japanese national character continues to be firmly preserved. What consideration,
therefore, should be given to such national character in connection with nuclear power
generation. when nuclear power generation is expected to continue in future, and even higher-
level developments in nuclear power technology are likely to be pursued, both quantitatively
and qualitatively?
In the present research project, data gathered via opinion survey are analyzed to elucidate how
Japanese general public attitudes toward nuclear power generation, whether favorable or
unfavorable, and the Japanese national character, which governs various aspects of societal life,
are interrelated within an attitude space, and how subjects can be classified and divided therein.

1. Data collection and analysis


Data were collected in an opinion survey addressed to 1,138 subjects, 18 to 79 years of
age, sekctcd by probability sampling in the area of Kansai. The questionnaire included
questions concerning ideas and feelings about various matters that constitute attitude
toward nuclear power generation (e.g. images of nuclear power, understanding of
nuclear power, attitude toward energy and environmental issues, sense of anxiety,
sensitivity to risk, views of science and civilization that can influence the above, etc );
social and political attitude closely related to the above; and archetypal Japanese
characteristics (e.g. tendency toward moderate opinions, sense of trust, concept of
typical Japanese leadership, attitude toward supernatural beings such as ghosts,
psychological relationship with superstition, etc. ), as well as questions concerning how
electric power companies should deal with the general public.
Hayashi's Quantification Method III ( pattern classification) is used in the analysis.
When the maximum correlation coeft1cient is obtained in calculation, using numeric
assigned to respondents and respective reply categories, and taking into account all
reply categories, those categories generally present multi-dimensional category scores,
the corresponding respondent scores also being multi-dimensional. Catt,:gories of
proximate category scores are placed proximately in the attitude space, indicating their
close correlation. Since respective dimensions of respondent scores serve as attitude
scaks that are independent from each other, respondents concentrated on the same
scak can be classified into one group.
653
654

2. Findings
Replies to a group of closely correlated questions, such as questions concerning
attitudes toward nuclear power generation, were analyzed first. When a group of
questions was found to form a clear-cut scale, a category defined by the value range of
the scale was regarded as representative of replies to the group of questions, in the
same manner as in other question categories Replies to almost all questions presented
in the survey were analyzed, yielding the following findings.
The first axis in the attitude space divides the subjects ( respondents) into those who
are completely indifferent and those who are not. Subjects with high respondent scores
on this axis ( inflection point on or above the distribution curve) can be distinguished
as the indifferent group. The number of subjects in this group accounts for 13% of all
subjects.
The second axis divides the subjects into those who are strongly positive toward
nuclear power generation and those who are not, while the third axis divides the
subjects into those who are strongly negative toward nuclear power generation and
those who are not. Respondent scores show that strongly positive accounts for 11 %,
and strongly negative for 9%.

15
135

12.6

112

10
a7

5
3.6
3.0

15

0.1 0.0 0.0 0.1 0.1

-0.8 -0.6 -0A -02 o 02 Of 0.6 0.8 1.0 12 1.4 1.6 1.8

Indifferent group >

Fig.1 I-axis Sample Score Distribution


655

The straight line formed by the "positive", "moderate" and "negative" categories which
runs diagonally on the plane demarcated by the second and third axis, becomes, when
projected onto the parallel line passing through the original point, a scale that divides
the three categories. The subjects are classified on this scale into groups that account
for 12%, 50%, and 5%, respectively, of the total.
The positions of item categories that correspond to these groups, the item categories of
the Japanese national character, and other item categories indicate that respondents in
the indifferent, strongly positive, and strongly negative groups have nothing to do with
typical Japanese sentiments. (Tab. 1, Tab 2)
It is believed that for the majority of respondents, which excludes people in these
groups, somewhat Japanese style communication can be effective in promoting nuclear
power generation.


III axis

Center of
strongly negative
group

Center of
negative group


"-"- ,
" I
""~;.

\s. . . 'IX
".: .:.!.:.
,>~ -. !.. .
II axis
(. '.' .,,: : ··i .. :'. .\
Center .... ::
.,', .',', .··.i.~···
' . . I ..
.. : .
' "~

I "
:"".

-
Center of

Fig.2 II-and III-axis Relationship (Model presentation)


656

Tab.1 Characteristics of Subjects with Positive and Negative Attitude


toward Nuclear Power Generation

Characteristics Common Characteristics


Strongly positive

Positive views of science and civilization Sufficient knowledge about positive


Few centrist replies (tendency to express opinions clearly) and negative effects of nuclear
Not very interested in accidents power generation
Strongly negative Little interest in supernatural
beings
Development of aircraft, etc.!:not useful Little influence of superstition
Do not like typical Japanese leadership Consider crime, bullying and inter-
Consider most dangerous/:diseases, drug hazards, malpractice, personal trouble most dangerous
nuclear power, nuclear power generation(plant).
radioactive contamination, war, environmental
contamination and natural environmental destruction
Negative views of sc;ence and civilization
Consider Japan's power generation capacity sufficient Strong distrust of others
Strong interest in environmental issues Cause of nuclear power plant accidents:
operation error
Remember Chernobyl accident very
Negative well
Insufficient knowledge of positive
Usefulness of aircraft, etc.!: moderate and negative effects of nuclear
Importance of aircraft, etc.!: moderate power generation
Images of nuclear power/:radioactivity, breakdown,
environmental contamination, accident, explosion
Widely used power generation method in Japan/: hydro
Regarding nuclear power generation, wish to know more
about past incidents and accidents, treatment of waste
materials, regional development, impact of radioactivity.
necessity of nuclear pmver generation, difference from
atomic bomb
Consider socialism good
Fearful of impact of nuclear power plant accidents that can
last for generations
Moderate replies to many questions
Main cause of nuclear power accident/: inadequate
operation, manual, equipment system failure ..
Sense of anxiety/: moderate and strong
Interest in accident/: moderate and strong
Somewhat negative views of science and civilization
Moderate view of typical Japanese leadership Strong interest in politics
Evaluation of ideologies/: depend on times and situation Male
Images of nuclear power/: danger Japan's power generation capacity/:
insufficient
Positive Consider energy issues important
University graduate
Consider democracy and capitalism good 60 years of age or older
Little anxiety
Trust in others very much
(Continue)
657

(Continued)

Japan's power generation capacity/: appropriate


Somewhat positive and moderate view of science and
civilization
Strong influence of superstition
Strong interest in supernatural beings
Chernobyl accidentl:somewhat remember, do not remember
Widely used power generation method!: nuclear
Consider natural disaster most dangerous
Consider typical Japanese leadership favorable
Images of nuclear power/: electricity, power, power generation,
energy, fuel, resources
Somewhat weak interest in environmental issues
Usefulness of aircraft, etc./: very useful
Importance of aircraft, etc.!: very important
Fearful of nuclear power plant accident/: exposure to
radioactivity above permissible levels

Tab.2 Category Score and Weighted Sum of Respondents


(Sample weight =pop!respondents in each of urban and provincial strata)

N(4676) Axis-l Axis-2 Axis-3


Attitude toward nuclear power generation
1 Strongly negative 608 -0.245 -0.537 4.430
2 somewhat negative lI92 0.922 -1.570 0.8</!6
3 Moderate lI41 0.713 -OA05 -0.621
4 Somewhat positive lI22 0.235 0.550 -1.853
5 Strongly positive 613 -2.226 4.350 -1.433
Image of nuclear power
1 Energy, fuel, resource 331 -0.443 1397 -0.760
2 Electricity, power. power generation 1801 -0.310 0.613 -0.504
3 War, atomic bomb, nuclear weapon 1339 -0.242 -0.209 0.148
4 Accident. explosion 805 -0.849 -0.747 0.844
5 Radioactivity, breakdown, environmental 858 -0.874 -0.635 1.460
contamination
6 Danger, fear 267 0.566 -1.049 -0.159
7 Other 648 0.287 0.148 -0.140
Understanding of nuclear power
(I) Power generation method in Japan
I Thermal 1653 -0.875 0.720 0.395
2 Hydro 1106 0.447 -0.564 1.056
3 Nuclear 1546 0.492 -0.011 -1323
4 Other 371 2.233 0.197 0.575

(Continue)
658

( Continued)
(2) Know about positive/negative effects of
nuclear power generation?
1 Yes 700 -2.537 3.061 1.591
2 No 3052 0.734 -0.700 -0.580
3 Neither 924 0.189 0.669 0.699

(3) What you wish to know about nuclear


power generation
1 Mechanism 1416 -0.870 -0.580 -OAI2
2 Necessity 1351 -0.162 -1.718 -0.260
3 Economic efficiency 692 -OA61 -1.317 -0.635
4 Safety 3467 -O.IH -0.526 -0.273
5 Past incidents and accidents 1153 -1.228 -0.867 0.509
6 Disaster prevention system 2919 -OA80 -0.564 -0.220
7 Impact of radioactivity 2818 -0.197 -1.025 -0.152
8 Treatment and disposal of spent fuel and 2678 -0.899 -OA96 0.331
waste materials
9 Difference from atomic bomb 974 -0.011 -2.268 -0.545
10 Regional development in localities around 570 -0.986 -1.136 -0.916
nuclear power plants
II Nothing in particular 305 5.194 5.684 1321
(4) Remember Chernobyl accident?
I Very well 2883 -1.022 0.340 0.837
2 A little 1222 1.042 -0.440 -1485
3 No 571 4.045 0.316 -1.067
(5) Main cause of accident
1 Equipment system failure 2692 -0.398 -0.286 0.270
2 Operator error(human error) 1896 -0.618 0.106 1.009
3 Inadequate management system 3308 -0.179 -0.030 -0.156
4 Inadequate operation manual 1191 -0.753 -0.396 0.524
(6) Impact of accident
1 Death or injury in power plant or local 341 1.964 2.318 -1.153
community
2 Radioactivity exposure above permissible 575 0.809 1.822 -1.853
levels
3 Radioactive contamination spreading to 1608 -0.392 -0.087 -0.100
other parts of the world
4 Impact on future generations 2047 -0.134 -0.565 0.691
Energy, environmental issues, sense of anxiety,
sensitivity to risk
(1) Energy issues
1 Very important 2036 -1493 0.852 OA55
2 Important 2318 0.893 -0.849 -0791
3 Not important 303 4.932 2.549 2.588

(2) Japan's power generation capacity


I Sufficient 619 -0.268 -0.118 2.267
2 Almost sufficient 670 0.042 -0.690 -0.746
3 Exactly appropriate 1520 1.147 0.064 -0.561
4 Almost insufficient 1449 -0.284 0.273 -0.203
5 Insufficient 386 -1.844 1.567 0.196
(Continue)
659

(Continued)
(3) Environmental issues
1 Very interested 1001 -U81 -1.679 2.181
2 Interested 1580 -0.881 -0.305 -0.389
3 Slightly interested 1592 0.969 0.540 -1.418
4 Not very interested 503 4.l! 7 3.830 1345
(4) Fear (Sense of anxiety)
1 Little 2052 0.508 0.802 -0.389
2 Moderate 1849 -0.145 -0.104 0.607
3 Much 775 -0.176 -1.070 -0.433
(5) Interest in accident (Sensitivity to risk)
1 Little 1027 2.438 2.954 -0.636
2 Moderate 217+ -0.537 -0.216 0.500
3 Much 1475 -0.473 -1.316 -0.301
(6) Most dangerous thing in social life
1 Traffic accident 1976 -0.382 0.599 -0.195
2 Natural disaster 831 -0.0l! 0.256 -1.l67
3 Environmental pollution, destruction, 247 -1.766 -1.337 2.400
abnormal weather
4 Fire 244 0.174 -1.588 -1.717
5 Crime, bullying, interpersonal trouble 46 -0.676 2.340 4.727
6 War 77 0.444 0.444 2.926
7 Nuclear power(generation), radioactive 54 -0.436 -1. 981 3.282
contamination
8 Disease, drug hazards, malpractice 101 -0.423 -1. 631 3.744
Views of science and civilization
1 Very negative 563 -0.409 -1.675 2.769
2 Somewhat negative 862 -0.479 -0.639 0.187
3 Slightly negative 1288 1.180 0.239 0.166
4 Moderate 738 0.859 -0.489 -0.777
5 Somewhat positive 957 -0.454 1.287 -0.901
6 Very positive 268 -1.635 3.504 -1. 902
Social and political attitude
(I) No. of items considered important
(importance of aircraft etc.)
1 0 item 604 3.055 1.459 1.862
2 1 item Ill8 0.530 -0.809 1.949
3 2 items 1703 -0.410 -0.685 0.078
43 items 1251 -0.880 1.448 -2.755
(2) No. of items considered useful
(usefulness of aircraft etc.)
1 0-1 item 623 2.740 0.822 3.941
22 items 1415 -0.019 -1.114 1.656
3 3 items 2638 -0.395 0.639 -1.823
(3) Interest in political affairs
I Very interested 877 -1.984 1.707 1.321
2 Somewhat interested 2009 -0.410 -0060 -0.245
3 Not interested 1749 LSI I -0.405 -0.412

(Continue)
660

(Continued)
(4) Ideology
1 Democracy and capitalism considered good 188-1 -1.256 1.-108 -0.3-19
2 Depend on time and situation 18-13 1.146 -0.715 0.035
3 Socialism considered good 9-19 0.9-10 -0.750 0.613
Japanese national characteristics
(1) Tendency toward moderate opinions
1 0--1 items (Tend for express opinions clearly) -162 -2.689 3.702 -0.818
2 5-12 items 2792 -0.566 -0.132 -0.189
3 13 or more items 1-122 2.-135 -0.505 0.328
(2) Scale of sense of trust
10 item 1316 0.-1-19 0.398 1.099
2 I item 15-1-1 0.808 -0.577 -0.385
3 2-3 item (Strong distrust of others) 1816 -0.661 0.5-16 -0.-175
(3) Typical Japanese leadership
1 0-2 items (In favorable to Japanese style of -112 1.268 0.88-1 3.837
leadership)
2 3-6 items 2480 0.617 -0.-183 0.018
3 7 or more items (Favorable) 178-1 -0.793 0.816 -0.917
(4) Scale of interest in supernatural beings
1 0-3 items (Little interest) 627 1.566 2.861 2.705
24-9 items 2300 0.099 -0.065 0.12-1
3 10 or more items (Much interest) 17-19 -0.326 -0.58-1 -1.l39
(5) Superstition believable?
I 0-2 items (Little influence) 653 -0.288 2.324 3.615
2 3-6 items 1797 -0.231 -0.184 -0.025
3 7-8 items (Much influence) 2226 0.558 -0.253 -1.0-15

Demographics
(1) Sex
I Male 221-1 -0.78-1 1.-186 0.653
2 Female 2-162 0.96-1 -1.083 -0.592
(2) Age
I 18-29 years old 1121 0.972 -0.092 -0.533
2 30-39 years old 967 -0.0-1-1 -0.170 -0.389
3 40-59 years old 1937 -0.323 0.078 0.-155
-I 60 years old or above 651 0.33-1 1.138 0.126
(3) Education
1 Elementary/secondary school graduate 66-1 1.-110 0.-136 0.-190
2 High school 2-167 0.467 -0.4-11 -0.415
3 University 1505 -1.023 0.918 0.-110
( 4) Residence
1 Urban -1000 0.030 0.0-16 0.017
2 Provincial 676 0.768 0.649 -0.115
R( correlation) 0.335 0.282 0.265

References:
Hayashi, C. and Morikawa. S. (1995) National Character and Communication. INSS Report May, 1995
Research Concerning the Consciousness of
Women's Attitude toward Independence
Setsuko T akakura
Tokyo International r niversity
2,509 Matoba, Kawagoe-shi
Saitama-ken 3.50-11, Japan

Summary: Applying Quantification Method III(Correspondence analysis) to the results


of our survey we found a pattern of thinking as regards the consciousness of women's inde-
pendence. The majority of the samples (female university graduates) consider "women's in-
dependence" from three points of view: financial independence, psychological strength and
family role. It is said that financial independence is a very important aspect for women's
independence, but more than half of the samples value psychological independence rather
than financial independence, even if they consider that, concerning men's independence,
both financial power and psychological strength are necessary conditions for it.

1. Outline of Survey
Recently what do .Japanese women understand by the expression "women's inde-
pendence", and what is the relation with the other items: consciousness of liberty.
equality between the sexes, or happiness, autonomy, identity, etc.'; What are the
obstructions to their independence'? The purpose of this research is to clarify these
subjects, (we conducted a mail survey~).
We have chosen 7 universities and ,j junior colleges situated within Tokyo and sur-
rounding areas, As the population we have determined female graduates of these
universities and junior colleges in the years: 19.58, '67, '7.j, '81, '86, '9l. The number
of samples and the number of effective responses are as table-l.
Table-l
N umber of samples and number of effective responses
graduation year 1 '.58 '67 '91 TOTAL i
188 : 188 31.5 11.52:3 !

7 universities 1
132: 98, 137, 1321 7:38
(74%) (58%)1 (.52%)j (:37%)1 (,56%)
! 160 ! 160 ! 220 1367 11.500
.5 colleges 981 89; 1,5:3i 1:32 669
(64%)! (59%)' (4.5%)1 (:37%) (48%)
1348 1:348 i 682 30n
I TOTAL , 2:301 290 264 140'1
(69%)1 (48%)1 (41%) (52%):

number on the left: number of samples


number on the right: number of effective responses ) : ratio of effective responses

We obtained 1,107 effective responses in total.

661
662

2. Results
2.1 outline
We will show the results of the responses to the two main questions. (The numbers
are the percentage of the responses.)

Regarding the consciousness of independence:

Q11. Do you think that you are independent now"? (no dependency upon anyone)
l.sufficiently independent (12%) 2.just independent (46%)
:3.not sufficiently independent (24%) 4.hardly independent (13%)
5.D.K. ( 5%)
Regarding the degree of satisfaction with the level of one's own independence:

Q13. What do you think of the level of your independence?


l.satisfied (27%) 2.unsatisfied, but have no choice (33%)
3.unsatisfied, then seeking some other way (27%) 4.nothing, D.K. (13%)
We show in Fig.-l some aspects of the responses 1 and 2 in Q.ll, according to several
categories: university or junior college, year of graduation, occupation, marital status;
and we show in Fig.-2 the response 1 in Q.l:3, according to the same categories.

Q.ll responses:l,2 (%) Q.13 response:l(satisfied) (%)

un iversity or college

college

university

o 10 20 30 40 50 60 70% o 10 20 30 40%

marital status marital status


w idow
~
divorcee

~
married
~ ~
single

0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50

Fig.-1-! Fig,-2-l
663

0.11 responses: 1,2 (%) 0.13 response:1(satisfied) (%)

I . just independent I
.sufficiently independent

''881 6_----
year of the graduation
year of the graduation
'91 • • • •_
~
~
_...._
75_...._
~
·67 ~_ _ _ _ __
I

'58_. . . . ._
~

o 10 20 30 5~

occupation
without
occup.

sidejob • • • • • •

sel f- , • •••••••••
employee ~ I

part-time • • • • •

full -lime
I
o 10 20 30 40 50 60 70 8(fk o 10 20

Fig.-l·2 Fig.-2-2

From these figures we understand that the highest degree of consciousness of inde-
pendence was as follows: by a.ge, among the oldest samples; by profession, among
full-time workers and the self-employed; by marital status, among divorcees and wid-
ows.
664

In order to ascertain the fundamental structure of women's consciousness regarding


independence we asked :3 questions: Q.2, Ql,S, Q.29. These questions are as follows.
(The numbers are the percentage of the responses.)

Q.2 Some examples of women's life-styles are listed below. Please mark the types
that attract you. (no more than 2)
l. fashionable single career woman able to make full use of her abilities in her
profession ( 6 %)
2. woman with a full-time profession living with husband (without children) ( 9%)
:3. married woman with a child/children who devotes her energies to her profession.
entrusting housework and child care to somebody else as necessary (2.5%)
~. woman with a full-time profession who shares partial charge of housework with
hus band and child/ children (:3 T%)
5. woman with a full-time profession who does housework and child care without
help from anybody else ( 5%)
6. woman whose priority is taking care of husband and child! children and who
works part-time locally during the daytime as long as she has time to spare (11 %)
T. woman who takes care of husband and child! children and participates acti vely
in some activity which contributes to society (:30%)
8. woman who depends on the income of husband. and who efficiently deals with
housework and child care and then participates actively in various free-time activ-
ities of her own choice (21%)
9. woman who while preserving good judgement and making efforts to extend her
knowledge, and although depending on the income of her husband. successfully
takes care of the family and domestic management and acts as the mainstay of the
family (2.5%)

Q.29 When you hear the expression "women's independence" how do you feel about
it? Please mark any of the following categories which correspond to your feelings.
(as many as you like)

#. unattractive ( ~%) R. impressive (27%)


S. self-confident (.s~%) T. financially independent (.S6%)
U. vibrant (-±~%) V. active (.S4%)
#. selfish ( 2%) vV'. trendy (18%)
X. self-reliant (:39%) Y. self-assertive (2~%)
#. too busy ( :3%) Z. equality between the sexes (:32%)
665

Q.15 How important do you think each of the following items is, from the point of
view of women's independence and men's independence '?
F: women's independence M: men's independence
very a lit- not not at very a lit- not not at
tie very all Item tie very all
import- import- import- import- import- import- import- import-
ant ant ant ant ant ant ant ant

45 46 8 0 A. being able to support 78 18 0


oneself financially
46 47 5 0 B. having one's own con- 62 32 4 0
victions
56 41 0 C. acting according to 70 26 0
one '5 own principles
55 42 2 0 D. being able to select 68 28 2 0
one's own way of life
34 54 10 0 E. being able to manage 22 51 23 0
a family budget
36 52 10 0 F. being able to produce 28 57 12 0
a harmonious family
atmosphere
31 51 15 G. protecting and suppor- 40 45 11
ting the family
21 51 26 H. having a profession 69 25 3 0
4 44 47 4 I . participating actively 6 41 46 4
in local social activities
14 62 20 J . taking an interest in 37 50 9
political and economic
affairs
34 56 8 0 K. being able to coope- 36 54 i 0
rate with others
39 55 .5 0 1. respecting the view- 42 52 4 0
point of others
24 57 16 M. being able to accomp 36 49 11
lish one's aims
56 39 3 0 N. being able to look 42 48 i 0
after oneself
31 49 18 O. being able to do house- 12 49 35 2
work
33 46 17 2 P. being able to take care 10 46 38 3
of children
10 43 42 2 Q. supporting the family 66 26 4
financially

(The numbers are the percentage of the responses of each category.)


666

2.2. The application of Quantification Method III(Correspondence analy-


sis)
We applied Quantification Method III (Correspondence analysis) to the response of
Q15-F (except item-I) regarding women's independence (we have taken notice of the
response l(very important) (Fig.-:3); and in the same way, to the response of Q15-M
regarding men's independence (Fig.--1).

2nd. ,...Q- .~ ' .,


,
I
,
I
I ,
\
\
\

H:
,
. ,
,
A
",""
,,

'PO
\G
....... !- ;0/-;--_:
-------~-----~~~--------+---------+---------
\L , ., 1st.
"" I MJ
' ... - _~ ... I

Fig.-3 Q.15-F: concerning women's independence

0\ 2nd.
I
1

,,
pi
'-N .... " : .,-A.... ·'
1 "If \
1 ' : \ ~I
---------~---------+-------~+--------~~--~~~---+---
\ I', 1st.
IM\
\ E I \
J)
I
I
\
L G /
\ I
I
\
, F
K 1/
1
, ,

Fig.-4 Q.15-M: concerning men's independence

As a result of these figures we were able to classify the items as follows:


Q15F (Fig.:3) Ql.sM (Fig.4)
concerning women's independence concerning men's independence
group of items meaning group of items meaning
.-\,H,Q financial power A,H,Q,B,C,D financial power and
B,C,D psychological strength psychological strength
.J .K,L.M.N human relations etc. E,F,G.J,K,L)vI,N some role in the family
and human relations etc.
P,O,E,F,G role in the family o,P ability of doing
housework
667

This classification reveals that the samples (female graduates) consider women's inde-
pendence from four different aspects: financial power, psychological strength, human
relations and role in the family; while, concerning men's independence they consider
almost only two aspects: the aspect of financial and psychological strength and the
aspect of human relations etc.. (The item-O and the item-P are situated very far
from the other items; it means that the samples who marked "very important" for
items 0 and P are few and heterogeneous.)

We apply also Quantification Method III to all the responses of Q.2, Q.29, and Q.l5-
F. (Fig.-5). We understand that the samples who regard financial power as very
important items for women's independence chose the categories 1 or 2 in Q.2, which
is the opposite of the samples who chose the categories 6,7,8,9 in Q.2; the latter
choosing also "'vV" in Q.29 seem conservative; it seems that they do not consider
"independence" positively. The samples who chose O,P,G,K,L,F,E,N, in the Q.15-F
respect the family role and they chose type-5 in Q.2, it seems that they consider this
kind of life-style as an ideal.

2nd.

I H
IQ
I
JMI

A
o K E Ie
P L 5 N BD 3
F I 4
-----+--~----+---------+---------+---------+---------+ ---------+---
1st.
ZT
S
I vu y
8
W
I
AX
17
I
9

Fig.-S Q.2: ( 1 -- 9); Q.29: ( R -- Z); Q.15-F: (A -- Q)

People say that financial power and psychological strength are very important for
independence. We found here that women's (female graduates) opinions of inde-
pendence are different according to whether they are considering women's indepen-
dence or men's independence. That is, concerning men's independence they consider
that the two strengths, financial and psychological, are important, while concerning
women's independence some samples (female graduates) attach importance to finan-
cial power and other some samples to psychological strength.

Concerning the women's indeppndenre we ha.ve ta.ken the items A,H,Q (which mean
668

financial power); dividing two categories: "very important" and the others, we ap-
plied again Quantification Method III. Then, taking the first 'eigen-vector' as the
out-side criterion we applied the multiple regression analysis for the purpose of find-
ing the effective factors; we have taken as predictor factors: kind of school, year of
the graduation, marital status, occupation, response to Q.11 and response to Q.l:3.
We found from this result that the occupation (having one or not) is the remarkable
effective factor, and the kind of schools (uni versi ty or junior college) and the response
to Q.ll are quite effective factors. vVe have tried this same method concerning the
items B,C ,D (which mean the psychological strength). vVe found as effective factors
(to discriminate the samples who at tach importance to these items for women's inde-
pendence and the other samples who do not) the kind of schools (university or junior
college), the response to Q.11 and the response to Q.1:3.

3. Conclusion
After these analysies we can say that the consciousness of female graduates toward
independence is not so high; especially the consciousness of young graduates is lower
than that of older graduates. The consciousness of the samples who are widows or
divorced is strong; most of them having an occupation, retain financial independence .
.\-lost of the samples who have husbands do not consider financial power as a very im-
portant factor for women's independence, they attach importance to the psychological
aspect rather than the financial aspect. The great part of young women graduates
who are not yet married have not enough consciousness of independence and are not
satisfied with the level of their independence. It is particularly interesting that we
found by the method of the Quant.-:3, the difference in women's (graduates) way of
thinking between women's independence and men's independence. There were many
discussions on the important factors: financial power and psychological strength re-
garding independence. 'vVe found that the samples who support the former aspect
and the samples who support the latter aspect are not necessarily the same. The
former are many among the samples who have an occupation, the latter are many
among the samples who graduated from junior college are confident of the situation
of their own independence. v'lie obtained in this way some new views concerning the
consciousness of women's at ti tude toward independence.

References
Garaudy, R. (1981): Pour l'avement de la femme, Albin Michel
.'vloen, Y. (1992): Women's Two Roles, Auburn House
Sodei, T. et al. (1992): Research Concerning the Women's Attitude toward Indepen-
dence, Bureau of Citizens and Cultural ,.l"ffaires, Tokyo metropolitan Government (in
Japanese)
Watanabe, K. (1990): The Concept of "Jiritsu" in Psychology, The Journal of the
Faculty of Integrated Arts and Social Sciences, Japan Women's University, 1, 189-206
(in .J apa.nese)

Acknowledgment:
We conducted this survey with the help of a grant in aid of The Tokyo Women's
Foundation, and this study was carried out under the IS.\! Cooperative Research
Program (9.'i-IS:\[·CRP A-:3.t).
A Cross-National Analysis of the Relationship between
Genderedness in the Legal Naming of Same-Sex Sexual/Intimate
Relationships and the Gender System
Saori Kamano
Institute of Statistical Mathematics
4-6-7 Minami-Azabu, Minato-ku
Tokyo 106 Japan

Summary: In this paper, I approach the naming of same-sex sexual/intimate relationships


from a social constructionist perspective and undertake a theoretically informed cross-
sectional cross-national analysis. Classifying the countries in terms of the genderedness in
the legal naming of same-sex sexual/intimate relationships, I explore first, how various
socio-politico-economic factors are related to the differentiations of the countries in the
naming of same-sex sexuallintimate relationships; and, second, how such differentiations
in naming are linked to the gender system. I undertake correlation analyses and logistic
regression analyses to examine the these linkages.

1. Introduction
Feminist scholarship has noted that it is ultimately a gender issue as to whether and how
same-sex sexual/intimate relationships are named or seen as a category apart from other
types of relationships (see, e.g., Connell 1987). The very conceptual possibility of
"same-sex sexual/intimate relationships" (abbreviated as SSSIR) as an identifiable type of
relationships hinges on the viability of gender as a social category. In other words, it is
only when" gender" or the social categories of "men" and "women" operate as a social
divider and when heterosexuality is assumed can sexual/intimate relationships involving
people of the same-sex be constructed as an identifiable and distinguishable category of
relationships. Extant historical and anthropological studies have documented that such
"naming" of same-sex sexual/intimate relationships varies overtime and across societies
(see, e.g., Durberman, et al. 1980; D'Emilio and Freedman 1988). However, such
documentation of the variation awaits more rigorous theorizing as well as systematic
analysis, which I will attempt in this study by focusing on one aspect of naming--the
"genderedness" of the naming of SSSIR--and take a first step toward a more theoretical
and rigorous exploration of its cross-national variation.
I will first layout my argument regarding the genderedness of the naming of SSSIR and
the gender system before discussing how genderedness is operationalized and how
countries are differentiated in this respect by socio-politico-economic factors. I will next
consider whether and how genderedness in naming is related to the rigidity of gender
categories and the level of gender inequality.

2. Theoretical linkage between the genderedness in the naming


of SSSIR and the gender system

Since the naming of SSSIR is basically a gender issue, one can expect it to be related to the
gender system of a society. Specifically, I argue that the rigidity of gender categories--
which partially constitute the gender system--produces differences in how men and women
are generally treated in a society, leading to gender differences in the naming of SSSIR.
Highly rigid gender categories mean that men and women are treated and conceptualized

669
670

separately and differently in a society. Similarly, the level of gender inequality is expected
to contribute to whether or not men's and women's SSSIR are named in the same way.
Gender inequality means that men and women are evaluated differently and unequally for
who they are and what they do. In virtually all modem societies, men are granted more
importance in the system; whatever men do receives more attention whereas what women
do tends to be disregarded. I argue that such a tendency is expected to be more prominent
in societies with a higher level of gender inequality. Rigidity and inequality in gender
system are analytically distinct but inseparable empirically. I therefore expect the cross-
national pattern of the naming of SSSIR to be as follows: the more rigid and/or unequal the
gender system, the more likely the naming of SSSIR is gendered--i.e. only men's SSSIR
(and not those of women's) are named.

3. Coding of Legal Naming of SSSIR


I operationalize legal naming as the codification of SSSIR in the law of each country.
use as data the description of laws regarding SSSIR for 117 countries compiled in
"Country-by-Country Survey" in ILGA Pinkbook (Tielman and de longe 1988). I first
classify legal statements of the 88 countries where SSSIR are named to identify the gender
of the subject referenced: men only; women only; men and women; and gender unspecified
person. I then construct a dummy variable (MENREF) differentiating countries that
reference men exclusively in naming SSSIR from those referencing gender-unspecified
persons and/or both men and women. Such coding identifies countries in which "men"
are singled out in the naming of SSSIR and contrast them with all other cases. Table 1
shows the classification of countries by the genderedness in the naming of SSSIR, by
geographic region.

Table 1: Genderedness in the Naming of SSSIR, by Geographic Region


MENREF Africa N. c.& S. Asia Europe Oceania All
America America
Code: Description
0: countries with at least one 14 2 11 12 23 5 67
statement referencing SSSIR (73.7) (100) (91.7) (63.2) (82.1) (62.5) (76)
of both men and women or
gender-unspecified persons

1: countries in where 5 0 1 7 5 3 21
statements reference men's (17.9) (0) (4.2) (36.8) (17.9) (37.5) (24)
SSSIR exclusively

N 19 2 12 19 28 8 88
% (100) (100) (100) (100) (100) (100) (100)

The table indicates that 21 out of 88 countries, or 24% of the countries considered here,
name only men's SSSIR while 67 countries, or 76%, do so for SSSIR of both genders
and/or persons without gender specification. Examination by each geographic region
indicates that the tendency to name only men's relationships is stronger in Asian countries
and Oceanic countries (about 40%). The proportion of countries naming only men's
SSSIR is particularly low in Central and South American countries--only 1 out of 12
countries references only men's SSSIR.
671

4. Correlation analyses of the relationships between socio-


politico-economic factors and the genderedness in the legal
naming of SSSIR
Having demonstrated cross-national variation of genderedness in naming, I now explore,
through correlation analyses, how the cross-national patterns of the genderedness in
naming is related to various socio-politico-economic factors which affect a range of social
phenomena. The factors considered include: (a) political systems, as indicated by
democratic system vs. non-democratic system (DEMOC) and communist system vs. non-
communist system (COMM); (b) the level of legalization in a society, as indicated by the
number of insti tutional areas under jurisdiction (LEGAL), (c) thc level of economic
development, as indicated by GNP per capita (LGNPCP); (d) history of colonization, as
indicated by whether or not a country has been colonized by Britain (COLBR!) and
whether or not a country has been colonized by France (COLFRN); religious system, as
indicated by the percentages of population affiliated with Protestantism (PCTPROT),
Catholicism (PCTCATH) and Islam (PCTISLM) respectively; and, the strength of the
linkage between sex and procreation as indicated by the percentage of married women
using contraceptives (CONTRA). All the indicators are measured in the 1980's, except for
the level of legalization of a society which is measured in 1978. (See Kamano (1995) for
details of the data sources.)
The correlation coefficients between the genderedness in the naming of SSSIR and each of
the socio-politico-economic factors are presented in Table 2 below.

Table 2: Correlation Analyses of Socio-Politico-Economic Factors and Genderedness in


the Naming of SSSIR
Correlation Coefficient Meaning of the Findings:
(Pearson's R)
DEMOC: democracy/non- .053 not significant
democracy
COMM: co=unistlnon- -.009 not significant
co=unist
LEG AL: # of institutional areas .082 not significant
under jurisdiction
LGNPCP: natural log of GNP per -.322*** More economically developed,
capita Less likely to exclusively reference men
COLBRI: former British colony .401 *** Former British colony,
More likely to exclusively reference men
COLFRN: former French colony -,177* Former French colony,
Less likely to exclusively reference men
PCTCATH: % Catholic -.096 not significant
population
PCTPROT: % Protestant -.134# (Higher % of Protestant people,
population Less likely to exclusively reference men)
"p'C'T'jsii\i"'%'M;;-~'j;~'p~p;;]~t;'~;:;""""""""''''''''':j'28'''''''' ................·..·....·..............·.. ;:;~·t·;;·gcifi·~;;:;;t..........................·..
.. ·c·ONTR·A':..%..~f';;;-;;;;:;~'d'~~~·;;;-~;;:....................·......·~:·143#........-·..................·(lligh~~·%·~['C~;;:tr~~·~p·ii'~~·~..;;;;~·:........·
using contraceptives ~ likely to exclusively reference men)
#: p<.1O; *:p < .05; **: p < .01; ***: p < .005
672

The correlation analyses show that economically more developed countries and former
French colonies tend to belong to the group of countries referencing both men and women
or "persons" in the naming of SSSIR, while former British colonies are more likely
grouped among countries exclusively referencing men in naming such relationships, as
indicated by the statistically significant coefficient (at .05 level) of -.322, -.177, and .401
respectively. Using a more generous criterion of .10 level, one can also conclude that
countries with a higher percentage of Protestant population and countries with a higher
percentage of married women using contraceptives are countries which name SSSIR of
both men and women or without gender specification.
Embedded in these political, social and economic structures might also be differences in
the gender system, which might underlie the observed differences in the particular gender
group(s) referenced. For example, it might be the level of rigidity and inequality in gender
categories divide countries varying in economic development and which differ in the
naming of SSSIR In the next section, I directly examine the how the pattern of the
naming of SSSIR is related to the gender system.

5. Logistic regression analysis of the effect of gender system on


the genderedness in the legal naming of SSSIR
To explore how the cross-national pattern of the genderedness in the naming of SSSIR is
linked to the gender system, I undertake a logistic regression analysis with genderedness
in the legal naming of same-sex sexuallintimate relationships as a dependent variable and
three measures of the level of gender inequality and the rigidity of gender categories as
explanatory yariables, controlling for the socio-politico-economic variables I discussed
above. The rigidity of gender categories and the level of gender inequality are indicated by
the gender segregation in occupation and education, the ratio of women-to-men in labor
force participation rate, and the legal rights of women. I expect the following cross-
national pattern: countries with a more rigid and more unequal gender system will tend to
name only men's SSSIR, rather than those of both men and women or gender-unspecified
persons. Table 3 shows the results of logistic regression analysis.
Among the three indicators of the rigidity of gender categories and gender inequality, the
statistically significant logit coefficient of one measure of gender segregation is consistent
with the expected pattern. The observed effect of this variable indicates that rigidity and
inequality in the gender system are related to the differentiation among countries in terms
of the genderedness in the naming. More specifically, societies where men and women are
more segregated (unequally evaluated and treated) in occupation and education tend to
reference men exclusively in the naming of SSSIR, while societies where men and women
are less segregated tend to either explicitly reference both men and women or do not
reference either gender. The level of gender inequality and the rigidity of gender categories
as measured by women's participation in the labor force (relative to men's) and women's
legal rights do not seem to differentiate the countries in whether they reference men
exclusively or reference both men and women or gender-unspecified persons in the
naming of SSSIR.
673

Table 3: Logistic Regression Analysis of the Effect of Gender System on the


Genderedness in the Naming of SSSIR (Dependent Variable: MENREF: 1 =Men only;
O=Men and Women and/or "Persons")
Logit Coefficient(i) exp(b) (ii) !> P (iii)
(std. error) (P=.5)
EXPLANATORY VARIABLES __________________~~~~----~~----~--
SEGREG: 1.98 (1.07)* 7.27 .495
gender segregation in occupation and education
FMLFP: 3.08 (4.72) 21.7 .770
ratio of women-to-men in labor force participation rate
EQLAW -1.43 (1.24) .238 .358
legal equality of women
CONTROL VARIABLES (iv)
DEMOC:democracylnon-democracy -2.30 (.928)** .100 .575
"'C'o~i~i"'~~;:;;:;;;-;;;-;;'~'ii";;:;;:;:'~~;:;;:;;;-;;~;r;'i""""""""'"''''''''''''''''''"""""""·~'i'.'5'o"(iiS)·""""""·""·j:2·4""·""""""j75""·"·

LEGAL: # of institutional areas under jurisdiction .021 (.024) 1.02 .005


LGNPCP: natural log of GNP per capita -.555 (.754) .574 .139
"'c'oLi3'ifFf;~'~~'Bri'ii';h'~;I~;:;y""""""'"''''''''''''''''''''''''""""""""""""'~j6'6"(820)*"""""""""j90""'"'''''''''':04'i'''''''''

COLFRN: former French colony 4.95 (24.2) 140.8 1.23


PCTCATH: % Catholic population -.032 (.019)# .969 .008
"·PCTP·R·OT~"%"~;i~;·~t·p-;;pcl~·ti;:;~ ......................·......·....·...."""""""~:084'(:'047)#""""""""'§2'O""""""""':O'2'i"""'"
PCTISLM: % Muslim population -.030 (.027) .970 .008
CONTRA: % of married women using contraceptives .085 (.076) 1.09 .021
"'FE'RT'iL'E':"";;;;;J'f~;tii';~;"~~~~'(vi""''''''''''''''''''''''''''''''"""·"""""""""""j·54'(079·j*"""""""""·Ci7··"·""""""·03S·" .....
-2 x Log likelihood = 33.6 (p=.994)
Model X2:42.7 (p<.OOI)
Goodness of Fit=95.0 (p<.00l) N=72
#: p<.10; *:p < .05; **: p < .01; ***: p < .005;
(i): log [pI(l-P)]; (ii): PI(l-P); (iii): The change in probability of referencing men only (!>P) at the level
where the effects ofXb are ma'timized (where P=.5); (iv): Correlation analyses (results not shown here)
among the explanatory variables and control variables have shown that the variables are not highly
correlated with one another «.400), with expections of those among CONTRA, LGNPCP and FERTILE
(>.700). It is possible that such high correlation among these three variables to have affected the overall
results of the logistic regression; (v) used as a "control variable" for CONTRA which intended to
indirectly indicate the extent to which sex and procreation is linked.

Apart from the link of the gender system to the cross-national patterns of the genderedness
of the naming of SSSIR, statistically significant coefficients of some socio-politico-
economic factors show the following patterns: (a) democracy increases the likelihood of
naming SSSIR for both or unspecified genders; (b) countries with a history of British
colonization tend to name both men's and women's SSSIR ,while countries not colonized
by Britain tend to name only men's relationships; (c) the higher the total fertility rate, the
less likely SSSIR are named exclusively for men; and, (d) the higher the proportion of
Catholic and Protestant population, the more likely are men exclusively referenced.
The observed effect of democracy on gender reference in the naming of same-sex
sexual/intimate relationships is consistent with conventional wisdom. If one grants that
democratic countries are more "inclusive," then it follows that democratic countries either
reference both men and women or do not make any distinctions by gender in the naming of
SSSIR. If Britain has imposed its law which exclusively references men onto its colonies,
674

and that these colonies have kept the law in its original form in the 1980's, then one would
expect a history of British colonization (COLBR!) to have a positive effect on the
genderedness in naming (Dynes 1990). However, the effect observed here is negative.
The key to understanding the unexpected effect might lie in understanding the post-colonial
developments in former British colonies. One can conjecture that Britain might have
imposed its legal and cultural paradigm on its colonies which, upon independence, also
show a greater tendency to reverse these imposed legal and cultural norms. Testing these
ideas requires a separate analysis. For the present purpose, it is sufficient to note that
presence/absence of British colonization divides countries in how they name SSSIR.
The positive effect of total fertility rate means that countries with a higher fertility rate name
only men's 3SSIR, while the countries with a lower fertility rate name SSSIR of both men
and women or gender-unspecified persons. The observed effect here might support my
argument linking a higher level of gender inequality and a rigidity of gender categories to a
stronger tendency to name only men's SSSIR. It is generally the case that the total fertility
rate is higher in societies with more rigid and unequal gender categories, which in tum
increases the likelihood of the exclusive reference to men in the naming of SSSIR
The negative effect of the percentage of population affiliated with Catholicism and
Protestantism indicates that the higher the proportion of Catholics and Protestants in the
population, the more likely that same-sex sexuallintimate relationships are named
exclusively for men. Given that the formal teachings of Judeo-Christianity tend to address
men exclusively, this finding is perhaps readily comprehensible.

6. Conclusion
In this paper, I focused on the genderedness in the naming of same-sex sexual/intimate
relationships, differentiating between countries that exclusively reference men in naming
and those which reference both genders or gender-unspecified "persons". I argued that the
gender system is one of the important factors differentiating the two types of the naming of
SSSIR: men are referenced exclusively in the naming of same-sex sexuallintimate
relationships in societies with rigid and unequal gender categories. I reasoned that
societies with rigid gender categories conceptualize and treat men and women as two
distinctive groups, resulting in differences in the naming of men's and women's same-sex
sexual/intimate relationships. Similarly, societies with a higher level of gender inequality
by definition privilege men over women and grant the former more visibility and
prominence. It is therefore expected that men tend to be referenced exclusively in the
naming of same-sex sexual/intimate relationships in these type of societies.
After presenting a coding scheme of the legal naming of same-sex sexual/intimate
relationships, I discussed the results of correlation analysis of the genderedness of naming
and various socio-politico-economic factors, which showed that countries which are
former French colonies, Protestant, and economically more developed belong mostly to
the group that references both men and women or persons in the naming of SSSIR. In
contrast, countries which are former British colonies belong largely to the group
referencing only men in the naming of SSSIR. Furthermore, I undertook a logistic
regression analysis to examine the linkage between gender references in naming and the
rigidity of gender categories and the level of gender inequality, controlling for the effects
of socio-politico-economic factors. The anticipated pattern of the genderedness in the
naming of SSSIR in relation to the rigidity and inequality of gender categories was borne
out by the finding that the more rigid the gender categories are, the stronger the tendency to
exclusively reference men, rather than both genders or gender-unspecified "persons," in
naming SSSIR. More importantly, the analyses presented here affirm that aspects of the
675

gender system are important in differentiating among societies which name SSSIR
differently.

References:
Connell, R. W. (1987): Gender and Power. Stanford University Press, Stanford.
D'Emilio,1. and Freedman, E. B. (1988): Intimate Matters: A History 0/ Sexuality in
America. Harper and Row, New York.
Durberman, M. B. et al. (eds.) (1980): Hiddenjrom History: Reclaiming the Gay and
Lesbian Past. New American Library, New York.
Dynes, W. R. (ed.). (1990): Encyclopedia o/Homosexuality. Garland, New York and
London.
Kamano, S. (1995): Same-Sex Sexual/Intimate Relationships: A Cross-National Analysis
o/the Interlinkages among Naming, the Gender System, and Gay and Lesbian Resistance
Activities. Ph.D. Dissertation, Stanford University.
Tielman, R. and de Jonge, T. (1988): A worldwide inventory of the legal and social
situation of lesbians and gay men. In ILGA Pink Book: A Global View 0/ Lesbian and
Gay Liberation and Oppression, Tielman, R. and van der Veen, E. (eds.), 183-242,
Utrecht University, Utrecht.
A Constrained Clusterwise Regression
Procedure for Benefit Segmentation
Daniel Baier
Institute of Decision Theory and Operations Research
University of Karlsruhe; Post Box 6980;
76128 Karlsruhe; Germany

Summary: A new procedure for benefit segmentation using clusterwise regression is pre-
sented. Constraints on the model parameters ensure that the derived benefit segments can
be easily attached to single competing products under consideration. The new procedure is
compared to other one-stage and two-stage procedures for benefit segmentation using data
from the European air freight market.

1. Introduction
Conjoint analysis is the label attached to a popular research tool for measuring buy-
ers' tradeoffs among competing multiattributed products (see, e.g., Green, Srinivasan
(1990) for a review): First, respondents are asked for preferential judgments w.r.t.
a set of attribute-level-combinations (stimuli) which serve as (hypothetical) product
descriptions. Then, the observed response data are analyzed using regressionlike
estimation procedures at the individual level. The resulting so-called part-worths
(estimated preferences for at tri bute-levels) are later used to predict responses W.r. t.
a set of competing products and-assuming, e.g., that each individual chooses the
product with the highest predicted preference-shares of choices or market shares.
A research purpose served by many commercial applications of conjoint analysis is
the identification and understanding of so-called benefit segments (see, e.g., Green.
Krieger (1991), Wittink, Vriens, Burhenne (1994)): The observed response data are
used in order to identify groups of buyers having similar preferences. For this pur-
pose, various procedures have been proposed and successfully applied during the last
years. Some of these procedures are so-called two-stage procedures where the indi-
vidual part-worth estimates are used in an unrelated secondary stage as an input
for clustering techniques. Others are so-called one-stage procedures (see, e.g., Ka-
makura (1988), DeSarbo, Oliver, Rangaswamy (1989), Wedel, Kistemaker (1989),
Wedel, Steenkamp (1989), (1991), DeSarbo, Wedel, Vriens, Ramaswamy (1992), and
Baier, Gaul (1995) for sample procedures or Wedel, DeSarbo (1994) for a review)
where segment-specific part-worth functions and segment-membership indicators are
simultaneously estimated using generalizations of well-known cluster wise regression
procedures (see, e.g., Bock (1969), Spath (1983), DeSarbo, Cron (1988)). In both
cases, the resulting parameter estimates are then used to predict preferences and
choices at the segment level W.r.t. a set of competing products.
However, a major shortcoming of these procedures consists in the (missing) link
between the computational derivation of segment-specific model parameters and the
prediction of choices for the competing products: No guarantee is provided that each
competing product is chosen by at least one benefit segment, a fact that-when ana-
lyzing real markets-could lead to the (surprising) situation that established products
in these markets are predicted to have no buyers. For this reason, a new one-stage
procedure for benefit segmentation is proposed where constraints are implemented
which ensure that each competing product is selected by at least one segment.

676
677

2. A constrained clusterwise regression procedure


2.1 Model formulation
Let be i an index for N respondents, t an index for T segments or homogeneous
groups of respondents, j an index for n stimuli, ) an index for it (it :::; T) competing
products, v an index for V attributes, and w an index for Wv levels of attribute v.
The data are (binary) profile data Blll ,· .. , Bnvwv for the stimuli, Elll , ... , Envwv
for the competing products (where B jvw and E]vw indicate whether stimulus j and
product) has level w for attribute v (=1) or not (=0)), and response data Yll, ... , YnN
(where Yji describes the observed preference value for stimulus j obtained from re-
spondent i).
The model parameters are the segment-membership indicators hll, ... , hTN (where
htz denotes whether respondent i belongs to segment t (=1) or not (=0)) and segment-
specific part-worths Ulll, ... , UTVWv ' With the help of the segment-specific response
estimates for the stimuli and the products
\/ vVu v W,
Ujt =L L BjuwUtvw Vj, t and uJt =L L E;vwUtvw V), t, (1)
v=l w=l v=l w=l

the least squares loss function


N n T
Z = L L(Yji - L hti ujt)2 (2)
i=l j=l t=l

with
T
Lhti = 1 Vi, hti E{O,I} Vt,i (3)
t=l

is minimized. The constraints in formula (3) ensure that segment) chooses compe-
ting product) resp. that each competing product is selected by at least one segment
(N ote that for obvious reasons index) is also used for segments.), and that the seg-
mentation scheme is non-overlapping.
It should be mentioned that with n=O the standard versions of Wedel, Kistemaker's
(1989) as well as Baier, Gaul's (1995) (non-overlapping) clusterwise regression proce-
dures are contained in the model formulation. (In this case the inequality constraints
in formula (3) are omitted.)

2.2 Algorithm
For estimation of the model parameters, an exchange algorithm is proposed as gi-
ven in Tab. 1: In the initialization phase we start with a segmentation matrix H,
segment-specific response estimates for the stimuli U, and segment-specific response
estimates for the competing products U so that the constraints in formula (3) are
fulfilled. Additionally, the initial loss function value is computed. In the iteration
phase we repeatedly test, whether an exchange of a respondent from one segment
to another improves the loss function without violating the constraints in formula
(3). If so, a new loss function value is calculated. If not, the exchange is cancelled
(using the variables hI, ... ,hT for restoration). In the final phase segment-specific
part-worth estimates are computed.
678

.=1 j=l t=l


{Iteration phase:}
Repeat Set s:=s+l, H(s):=H(s-l),Z(s):=Z(s-l).
For i:=l to N do
For t:=l to T do
Begin Set h.t,:=h~~! 'tt', h~~!:=O 'it' i= t, h~:) :=1. .
Set C:=(B'B)-l B'YH(s)' (H(s)H(s)')-l, u(s) :=BC, v(s) :=:E3C.
N n l'
If (( u}jl 2: u~;j 't],]') and (L L(YJ' - L h~:) U}~))2 < Z(s)))
;=1 j=l t=l
N n l'
Then. Set Z(s)'="''''(y
• ~~
.. _ "'h(s)U(S))2
JZ D h )t
.=1 j=1 t=l
Else Set h t":
(s)
= h- t' '>J'
vt .
End.
Until Z(s-lLZ(s) < L

if w i= Wu
'tt,v,w.
else

Tab. 1: An algorithm for constrained cluster wise regression

Note that in the algorithm for constrained clusterwise regression B is the dummy-
coded design matrix for the stimuli with elements
1 if l == 1
Ejl == { B I Va == max{vll>l + (W1 -l) + ... + (Wu- 1 -l)}, (4)
J"oWo ese, Wa == l - (1 + (Wl-l) + ... + (Wuo-1-l)),

and that :B is defined in the same way as the dummy-coded design matrix for the
competing products. The algorithm uses some computational simplifications con-
cerning parameter estimation when the segmentation matrix H with elements h t ;
is additionally known: In this case, we get segment-specific response estin;ates for
the stimuli U=BC with elements Ujt and for the competing products U=:BC with
elements U;t by using
C = (B'Br1B'Y H'(HH,)-l (5)
~
==: G
679

where the elements of matrix G can be easily computed via

{ -&-
N
if hti =1 = L htz
git = lv t ' Vt, i with Nt Vt. (6)
0, else, i=l

3. Comparisons
The new constrained clusterwise regression procedure (in the following referred to
as CCR) was empirically compared to other one-stage and two-stage procedures for
benefit segmentation. For comparisons, data from the European air freight market
were used (see Baier, Gaul (1995) for a more detailed discussion of the data) which
descri be preferences of 150 respondents w. r. t. 18 hypothetical product descri ptions for
an over-night parcel service with house-to-airport delivery and European destination
collected from responsibles for parcel delivery of German companies with more than
2.5 air freight parcels per month within Europe. The reduced design of the 18 stimuli
W.r.t. the attributes 'collection time', 'delivery time', 'transport control', 'agency
type', and 'price' (for a 10kg parcel) and descriptions of six competing products
selected according to Baier, Gaul (1995) are given in Tab. 2 and 3.

collection deli very transport


time time control agency type price
stimulus 1 16:30 10:30 active airline company 160DM
stimulus 2 16:30 10:30 passive airline company 200DM
stimulus 3 16:30 13:30 active integrator 200DM
stimulus 4 16:30 13:30 passive integrator 240DM
stimulus 5 16:30 12:00 active forwarding agency 160DM
stimulus 6 16:30 12:00 active forwarding agency 240DM
stimulus 7 17:30 13:30 active airline company 160DM
stimulus 8 17:30 13:30 active airline company 240D:vf
stimulus 9 17:30 12:00 passive integrator 160DM
stimulus 10 17:30 12:00 active integrator 200DM
stimulus 11 17:30 10:30 active forwarding agency 200DM
stimulus 12 17:30 10:30 passive forwarding agency 240DM
stimulus 13 18:30 12:00 passive airline company 200DM
stimulus 14 18:30 12:00 active airline company 240DM
stimulus 15 18:30 10:30 active integrator 160DM
stimulus 16 18:30 10:30 active integrator 240DM
stimulus 17 18:30 13:30 passive forwarding agency 160DM
stimulus 18 18:30 13:30 active forwarding agency 200DM

Tab. 2: 18 stimuli for data collection in the European air freight market

collection delivery transport


time time control agency type price
product A 17:30 13:30 passive integrator 160DM
product B 17:30 13:30 active integrator 240Dlvf
product C I 16:30 10:30 passive integrator 200DM
product D 17:30 10:30 passive forwarding agency 200DM
product E j 16:30 12:00 passive integrator 160DM
product F 18:30 13:30 active integrator 200DM

Tab. :3: Six products for simulations in the European air freight market
680

Besides the new procedure, two-stage procedures using 'vVard- or k-i'vIeans-clustering


(referred to as Ward or k-I\leans) as well as one-stage procedures using Kamakura's
(1988) hierarchical clusterwise regression (HCR), Wedel, Kistemaker's (1989) cluster-
wise regression (CR) wi th Baier, Gaul's (199,j) iterati ve minimum-distance algorithm,
v\ledel. Steenkamp's (1989) fuzzy clusterwise regression (FCR), or DeSarbo, Wedel,
Vriens. Ramaswamy's (1992) latent class metric conjoint analysis (LCA) were applied
to t.he conjoint data in order to derive 1- to 8-segment solutions, For CCR, competing
products were selected in the fixed order from Tab. :3,
Aftprwards, the observed preference values for the stimuli at the individual level we-
re compared to the respective segment-specific response estimates using different fit
measures like, e,g" DeSarbo et aL's (1992) variance accounted for (VAF)

:V n T

L L(Y)i - L h tl uJtl 2 lV n
VAF = 1_ i=l j=l t=l
with f) = L LY)i/(lYn) (7)
* /'v' n

LL(YJ' -!d i:::::::1 J~l

;=1 )=1

(also applicable to fuzzy clustering), Kamakura's (1988) predictive accuracy index


(PAl), Green, Helson's (1989) 1st-choice hits, average product moment correlations
(Prod,mom,corr,) and average mean sums of errors (l'vlSE) as well as average values
for Kendall's T, Tab. 4 and ,5 show the results based on .so
random starting solutions,
(r nderline denotes best performance,) 7'=6 was selected for further comparisons
since there are six competing products for simulations,

T 'Nard k-Means HCR CR FCR LCA CCR


1 0,2413 0.2413 0,2413 0,2413 0,2413 0.2412 0,2413
2 0.4641 0.4641 0.4616 0.4645 0.4752 0.4294 0.4577
3 0,5311 0,5299 0,5257 0,5329 0,5329 0.48,56 0.5320
4 0.5800 0.5787 0,5831 0,5867 0 ..5755 0.5590 0.,5867
.5 0,6140 0.6154 0.6273 0,632:3 0.6164 0,5607 0,6322
6 0,6541 0,6538 0.658,5 0.6645 0,6337 0.6377 0,6353
7 0.6813 0.6821 0,6886 0.6923 0,6442 0.6.528 0.6639
8 0.7095 0.7119 0,7106 0,7142 0.6539 0.6479 0,6964

Tab, 4: Summary of VAF-values for different numbers of segments 7'

Ward k-Means HCR CR FCR LCA CCR


Prod,mom.corr. 0,7992 0,7992 0,8054 0,8078 0,786.5 0,7818 0,7883
ylSE 6.2258 6,2321 6,1479 6.0388 6 ..5934 6.5213 6,5650
1st-choice hits 0,8333 0.8,533 0,8333 0,7933 0.8400 0,8133 0.7867
Kendall's T 0,6489 0,6437 0,6544 0,6565 0,6494 0.6315 0.6366
VAF 0.6,541 0,6538 0,6.585 0,6645 0.6337 0.6377 0,6353
PAr 0,:3579 0,3,582 0.3534 0,3472 0.4833 0,3856 0.3771

Tab . .5: Summary of values according to various fit measures (7'=6)

Even a first glance at Tab, 4 and .5 reveals that the best one-stage procedure (CR)
performs better (with the exception of 7'= 1 or 7'=2) than the other one-stage and two-
stage procedures, CCR, the new procedure, competes well in this context with only
minor deteriorations w,r.t, variolls fit measures. Hovy'ever -- to be honest -- one should
681

mention that this behavior heavily depends on the selected competing products in the
market. (If, e.g., some benefit segments are not satisfied by the available products
in the market, an application of CCR should lead to inferior classification results
compared to an application of CR.)

segm. 1 segm. 2 segm. 3 segm. 4 segm. ,s segm. 6 I


attribute level (14.7%) (37.3%) ( 8.7%) (14.7%) (16.7%) ( 8.0%)
I collection 16:30 0.123 0.768 0.000 0.273 0.000 0.022
I tIme 17:30 0.050 0.3,s9 0 ..529 0.424 0.018 0.0·54
18:30 0.000 0.000 0.617 0.000 0.014 0.000
! delivery 10:30 0.043 0.068 0.172 0.129 0.061 0.608
I time 12:00 0.025 0.036 0.108 0.0,s6 0.004 0.427
13:30 0.000 0.000 0.000 0.000 0.000 0.000
transport active 0.048 0.043 0.067 0.106 0.491 0.030
control passive 0.000 0.000 0.000 0.000 0.000 0.000
agency airline company 0.123 0.002 0.021 0.000 0.080 0.095
type integrator 0.049 0.000 0.0,s9 0.082 0.2,s0 0.000
forwarding agency 0.000 0.005 0.000 0.200 0.000 0.078
price 160DM 0.663 0.117 0.084 0.141 0.182 0.217

l
200DM 0.313 0.059 0.021 0.047 0.08,s 0.11,s
240DM 0.000 0.000 0.000 0.000 0.000 0.000
prod. E prod. E prod. F prod. D prod. F prod. D

Tab. 6: Standardized segment-specific part-worth functions and most preferred pro-


ducts of the six-segment solution from CR (VAF=O.6645)

segm. 1 segm. 2 segm. 3 segm. 4 segm. 5 segm. 6


attribute level (1,s.3%) (12.7%) (29.3%) (17.3%) (14.7%) (10.7%)
collection 16:30 0.093 0.050 0.776 0.089 0.611 0.000
time 17:30 0.126 0.061 0.367 0.196 0.357 0.385
18:30 0.000 0.000 0.000 0.000 0.000 0.494
delivery 10:30 0.025 0.085 0.091 0.3,s0 0.044 0.139
time 12:00 0.034 0.007 0.056 0.228 0.000 0.082
13:30 0.000 0.000 0.000 0.000 0.015 0.000
transport active 0.053 00497 0.052 0.126 0.042 0.112
control passive 0.000 0.000 0.000 0.000 0.000 0.000
agency airline company 0.119 0.014 0.000 0.000 0.009 0.092
type integrator 0.052 0.203 0.011 0.008 0.000 0.154
forwarding agency 0.000 0.000 0.008 0.147 0.019 0.000
price 160DM 0.667 0.1,s3 0.069 0.181 0.283 0.100
200DM 0.298 0.060 0.037 0.091 0.129 0.047
240DM 0.000 0.000 0.000 0.000 0.000 0.000
prod. A prod. B prod. C prod. D prod. E prod. F

Tab. 7: Standardized segment-specific part-worth functions and most preferred pro-


ducts of the six-segment solution from CCR (VAF=O.6353)

In order to demonstrate advantages of the new procedure, Tab. 6 and T show the
six-segment solutions with part-worth functions and most preferred competing pro-
682

ducts from CR and CCR. (The attribute-levels of the most preferred products are
underlined.) The part-worth functions are standardized in the usual way that the
segment-specific part-worth estimates for the least preferred attribute-levels are fi-
xed to 0 and that the response estimates for the combinations of the most preferred
attribute-levels (not necessarily the most preferred product) sum up to l.
Using this standardization, the (relative) importances of the single attributes are gi-
ven by the part-worth estimates for the most preferred attribute-levels: So, e.g., from
Tab. 6 we can see that segment 1 is highly price-sensitive (attribute 'price' accounts
for 66.:3% of the overall preference), that ('16:30', '10:30', 'active', 'airline company',
'160D M') is its most preferred at tri bu te-Ievel-combination (with a segment-specific
response estimate of 1), that 'product E' with attribute-level-combination ('16:30"
'12:00', 'passive', 'integrator', '160D~I') is its most preferred competing product (with
a response estimate of 0.860), and that 'product E' is also the most preferred product
of segment 2 where early collection times are most important. However, only the
competing products 'product D' to 'product F' can be attached to benefit segments.
The utilities for 'product A' to 'product C' are for all segments inferior to the utilities
for 'product D', 'product E', or 'product F', s. t. under the usual assumptions these
products are predicted to have no buyers.
This shortcoming is overcome in the six-segment solution from CCR as given in
Tab. 7: Here, segments with similar interpretations as in the CR solution can be
found: e.g., a highly price-sensitive cluster (segment 1) and two clusters focusing on
an early collection time (segment 3 and 5). Nevertheless, each product can be atta-
ched to a segment. So, e.g., segment 1 now prefers 'product A' whereas 'product E'
is preferred by segment 5. (Note that segment 1 in the CR solution also had a fairly
high response estimate of 0.769 for 'product A'.)

! CCR: segm. 1 segm. 2 segm. 3 segm. 4 segm. 5 segm. 6


CR: (15.3%) (12.7%) (29.3%) (17.3%) (14.7%) (10.7%)
segrn. 1 (14.7%) 19 0 0 0 3 0 prod. E
segm. 2 (37.3%) 0 0 -1,2 0 14 0 prod. E
segm. 3 ( 8.7%) 1 0 0 0 0 12 prod. F
segm. 4 (14.7%) 2 1 2 12 5 0 prod. D
segm. 5 (16.7%) 1 18 0 2 0 4 prod. F
segm. 6 ( 8.0%) 0 0 0 12 0 0 prod. D
I
prod. A prod. B prod. C prod. D prod. E prod. F

Tab. 8: Cross-tabulation of segment-membership

The cross-tabulation in Tab. 8 shows that only few reallocations were necessary for
achieving this easier interpretation: Segment 1 of both solutions consists mainly of
the same respondents whereas segment 2 of the CR solution was divided up into the
two CCR-segments 3 and 5.

4. Conclusions and Outlook


A new procedure for benefit segmentation has been presented, where segment-specific
constraints ensure. that competing products can be easily attached to specific seg-
ments, a feature that simplifies interpretation. Comparisons so far show that the new
procedure also competes well with other one-stage or two-stage procedures w.r.t va-
rious fit measures. The proposed procedure can be easily modified in order to support
683

other research yurposes. So, e.g., if the most preferred attribute-level-combinations


of additional (j > n) benefit segments were treated as candidates for new promising
products, the procedure could be used for simultaneous product design and benefit
segmentation. Another possible extension could be the integration of two-mode clu-
stering in this context (see, e.g., the model formulations in Wedel, Steenkamp (1991)
or Baier, Gaul, Schader (1996)). In this case, the procedure could be used for simul-
taneous market structuring and benefit segmentation.

References:

Baier, D., Gaul, W. (1995): Classification and Representation Using Conjoint Data, In:
From Data to Knowledge:, Gaul, W., D. Pfeifer (eds.), Berlin, Springer, 298-307.
Baier, D., Gaul, W., Schader, M. (1996): Two-Mode Overlapping Clustering With Ap-
plications to Simultaneous Benefit Segmentation and Market Structuring, To appear in:
Classification, Data Analysis and Knowledge Organization, Klar, R., Opitz, O. (eds.), Ber-
lin, Springer.
Bock, H. H. (1969): The Equivalence of Two Extremal Problems and its Application to the
Iterative Classification of Multivariate Data, In: Report on the Conference Medizinische
Statistik, Forschungsinsti t u t 0 berwolfach.
DeSarbo, w. S., Cron, w. L. (1988): A Maximum Likelihood Methodology for Clusterwise
Regression, Journal of Classification, 5, 249-282.
DeSarbo, W. S., Oliver, R., Rangaswamy, A. (1989): A Simulated Annealing Methodology
for Clusterwise Linear Regression, Psychometrika, 54, 707-736.
DeSarbo, W. S., Wedel, M., Vriens, M., Ramaswamy, V. (1992): Latent Class Metric Con-
joint Analysis, Marketing Letters, 3, 273-288.
Green, P. E., Helson, K. (1989): Cross-Validation Assessment of Alternatives to Individual-
Level Conjoint Analysis: A Case Study, Journal of Marketing Research, 26, 346-350.
Green, P. E., Krieger, A. M. (1991): Segmenting Markets with Conjoint Analysis, Journal
of Marketing, 55, 20-3l.
Green, P. E., Srinivasan V. (1990): Conjoint Analysis in Marketing: New Developments
with Implications for Research and Practice, Journal of lV[arketing, 54, 3-15.
Kamakura, W. A. (1988): A Least Squares Procedure for Benefit Segmentation with Con-
joint Experiments, Journal of Marketing Research, 25, May, 157-167.
Spa.th, H. (1983): Cluster-Formation und -Analyse, Oldenbourg, Munchen.
Wedel, M., DeSarbo, W. S. (1994): A Review of Recent Developments in Latent Class Re-
gression Models, In: Advanced Methods of Marketing Research, Bagozzi, R. P. (ed.): Basil
Blackwell, Cambridge, MA, 352-388.
Wedel, M., Kistemaker, C. (1989): Consumer Benefit Segmentation Using Clusterwise Li-
near Regression, International Journal of Research in l'vfarketing, 6, 45-49.
Wedel, M., Steenkamp, J.-B. E. M. (1989): A Fuzzy Clusterwise Regression Approach to
Benefit Segmentation, International Journal of Research in Marketing, 6, 241-258.
Wedel, M., Steenkamp, J.-B. E. M. (1991): A Clusterwise Regression Method for Simulta-
neous Fuzzy Market Structuring and Benefit Segmentation, Journal of Marketing Research,
28, November, 385-396.
Wittink, D. R., Vriens, M., Burhenne, W. (1994): Commercial Use of Conjoint Analysis in
Europe: Results and Critical Reflections, International Journal of Research in Marketing,
11,41-.52.
Application of Classification and Related l'vlethods
to SQC Renaissance in Toyota l'vlotor

Kakuro Amasaka

TQ:-,[ Promotion div.Toyota Motor Corporation


1, Toyota-cho, Toyota, Aichi -Ill, Japan

Summary: To capture the true nature of making products, and in the belief that the
best personnel developing is practical research to raise the techr.ologicallevel, we have been
engaged in SQC promotion activities under the banner" SQC Renaissance". The aim of
SQC promoted by Toyota is to take up the challenge of solvi!lg vital technological assign-
ments, and to conduct superior QCDS research by employing SQC in a scientiftc, recursi',e
manner. To this end, we must build Toyota's technical methods for conducting scientific
SQC as a key technology in all stages of the management process, from product plar.ning
and development through to manufacturing and sales. Especially, multivariate analysis
resolves complex entanglements of cause and effect relationships for both quantitative and
qualitative data. A part of application examples are reported below, fOCUSing on the cluster
ane.lysis as a representative example.

1. Toyota's SQC Renaissance for Making the JVrost of Staff


These days, a review of the changes in the environment surrounding the manufactur-
ing industry indicated an ever greater necessity for corporate efforts to amplify and
capitalize upon the technical prowess of young engineering staff who bear the main
burden of these times. To capture the true nature of products, and in the belief that
the best personnel developing is practical research to raise the technological level, we
have gained a new awareness of Statistical Quality Control (SQC) as a behavioral
science. In recent years, the entire company has been engaged in SQC promotion
activities as shown in Fig. 1, under the banner" SQC Renaissance" (Amasaka 1993).
Specifically, the objectives are that all members including the engineering staff and
the management should seek to reach excellent solutions for technical problerr,s and
that they should realize practical achievement by both improving the proprietary
technologies and management technologies through SQC pre.ctices. Another aspect
of the objective is to develop the SQC promotion cycle activities in which the SQC
practices result in practical and full development of SQC education that facilitates
effective development of human resources, which, in turn, will be reflected in the
performance of operations (Amasaka and Azuma 1991).

Practical Growing
effort Development Hwnan Resources

Education

Fig.1 Schematic Drawing of Company-wide SQC Promotion Cycle Activities

684
685

If we are to make products that can satisfy our customers, we must work under the
most appropriate conditions for raising work quality and minimizing problems. If
staff keep a careful watch over their work and apply SQC properly, SQC can assist
them in remedying work processes and effectively raise the quality of work (Amasaka
and Yamada 1991).

2. SQC for improving Technology


The aim of SQC promoted by Toyota is to take up the challenge of solving vital
technological assignments, and to conduct superior Quality, Cost and Delivery (QCD)
research by employing SQC in a scientific with the exhibition of insight, inductive
problem-solving methodology in addition to engineer's deductive work methods.
This SQC goes beyond reactive technological assignments, to solve proactive tech-
nological ones that must be anticipated. This is not a matter of merely performing
analytically-oriented SQC in the form of statistical analysis, but is a scientific appli-
cation of SQC at all stages from problem construction, assignment setting, through to
the achievement of objectives, and entails the planning of surveys and experioents
to ascertain the desirable scenarios, and devises approaches for tackling prob~ems.
When engineers and managers place value on logical thinking, they can resolve the
cause and effect relationships of the gap between theory and practice, and can ob-
tain new facts and knowledge for improving proprietary technology and managerial
techniques. Rather than ending up with one-off solutions and partial solutions, they
can create technology for general solutions, which in turn leads to improve prc.duct
quality (Amasaka 1995).
By operating this way, a wide variety of SQC practical reports should be utilized as
guidelines that can contribute to building up of wealth of engineering technologies
as well as support for handing dmm and developing engineering technologies (Kamio
and Amasaka 1992).
Based on this view point, we propose new Schematic Drawing of Scientific SQC that
all departments make a Superior QCDS Research at each step of business process, as
shown in Fig. 2.
- in order to deliver artr;),ctivc product to customer - - In order 10 cre,te lechnology for improving product quality-

~tiCX1 --> /Jectc ti


/' Planning G) '"

~I
-3
\ c·
...r; Hiah '3
Business
Marketing
- for the
CLlSlomer-
Process
- Behavioral
Science -
Designing
-WIu,
10 ~W:::e-
¢ Problem "'Quality
- General Solution Problem
by Scientific SQC -
I

ti- ft
-:g
9'0
<:?.i,
<9
l<9cfnoloG;J

2-1 Business Process for Customer Science 2-2 Scientific SQC for Improving Technology

Fig.2 New Schematic Drawing of Scientific SQC to Conduct of Superior QCDS Research
686

3. SQC Established as a Toyota technical Methodology


In order to be able to provide customer-oriented, attractive products, it is important
that we implement customer science that deftly reflects the feelings and voices of
customers in the products we make. To this end, we must build Toyota's technical
methods for conducting scientific SQC as a key technology in all stages of the L .L1-
agement process, from planning and design through to manufacturing and marketir-
as shown in Fig. 2.
3.1 SQC as a core of the technical methods
Using proprietary technologies and acquired knowledge, SQC resolves complex entan-
glements of cause and effect relationships for both quantitative and qualitative data.
Hence, it is a highly convenient method for technological analyses for improving pro-
prietary technology. New seven tools for Total Quality Control (N7) and basic SQC
methods enable full support for designing the experimental process and the analyti-
cal process, thus making it possible to analyze technology rapidly and with error-free
thinking.
In addition by capitalizing upon proprietary technology, the use of the multivariate
analysis method enables 70% to 80% of the mileage required to go before finding
solution for a problem to be covered. Combined with SQC methods such as design
of experiment, the remaining distance can be covered effectively.
The detailed practical reports shown in the references (Amasaka et al. 1992) (Amasaka
et al. 1993) (Amasaka et al. 1994) (Amasaka et al. 1996a) (Amasaka and Sakai 1996)
(Kusune and Amasaka 1992) (Takaoka and Amasaka 1991), unravel complex entan-
glements of cause and effect relationships, by capitalizing on SQC methods such as
N7, multivariate analysis and design of experiment in effective combination with the
physical and scientific methodology. The use of SQC methods brings about the out-
standing achievement on the jobs, using analysis of sources of variation, modeling
for prediction and control (Amasaka et al. 1993) (Amasaka et al. 1994) (Kusune
and Amasaka 1992) (Takaoka and Amasaka 1991), and concurrent application with
neural network as the new technical method (Amasaka et al. 1996a). Moreover, IT
(Information Technology) needed for production control (Amasaka and Sakai 1996),
the equipment diagnostic technology (Amasaka et al. 1992) are capitalized on as
scientific support.
These methodologies for technological problem-solving have been established as new
SQC methodology and Toyota's technical method for improving quality works done
by engineering staff and managers. Fig. 3 shows a conceptual diagram of new SQC
methodology.
Mountain-Climbing for Problem-Solving
- New Methodology by SQC -

MA DE
(Mul,ivari310 analysis) (Design ofE:<pcrimcnc)

Structuring Problem and Sclc:"ins: Topic Prablcm~SQI . . ing (Level I) Problem-Solving (Level 21

Fig.3 Schematic Drawing of Toyota's Technical Methods


for Conducting Scientific SQC in Toyota
687

3.2 Using multivariate analysis as the core of scientific SQC practices


Upstream product technology departments are required to quickly develop advanced
design technology. Large amounts of data cannot be obtained from individual R&D
projects, but lots of the collected trial and experimental data have excellent features in
terms of status and condition control. For this reason, even time-series data collected
from similar dead and buried in the R&D projects can be used for gaining scientific
insights into the cause and effect relationships through multivariate analysis that
makes use of proprietary technology. Thus we are able to build logical quantitative
models for analysis of source of variation, prediction and control (Amasaka et al.
1996a) (Takaoka and Amasaka 1991).
Midstream manufacturing preparation departments are also required to rapidly de-
velop new production technology and control systems. Comparatively large amounts
of data can be obtained, although they stem from R&D, trials, and attempts using
mass-production facilities, and rather than being planned, they tend to be of the trial
and error variety, changing over time series.
The soundness of these data is not as clear as those of data from upstream depart-
ments, but they are data gathered on the spot by engineers, and are reliable in terms
of the status and conditions pertaining at the time of collection. Hence, they lend
themselves to the application of multivariate analysis using proprietary technology,
so that technological knowledge and new facts can be probed for use in raising the
technological level (Amasaka and Sakai 1996) (Kusune and Amasaka 1992).
In the downstream manufacturing departments, the important task is to improve
manufacturing technology for making stable, high-quality products. Although the
reliability of the collected data is not as clear as for the upstream and midstream
departments, a great deal of raw data can be obtained if they are collected in a
purposeful and planned fashion.
Then, by taking advantage of the manufacturing engineers' unique insights into the
actual status of the site and the product, multivariate analysis can be performed to
analysis of sources of variation, to extract factors that may aid in quality improve-
ment, and to control processes at the optimum level, thus contributing to improve-
ment of process capability (Amasaka et al. 1992) (Amasal<a et al. 1993) (Amasaka
et al. 1994).
In this way, multivariate analysis can be used for flexible analysis of technology even
with varying quantities of diverse data collected in the past. It has taken root not
only as an analytic method of SQC specialists, but also as Toyota's technical methods
for skillfully raising the work quality of engineers (Amasaka et al. 1996b) (Amasaka
et al. 1996c) (Amasaka and Kosugi 1991) (Amasaka and Maki 1991).
Recently, we have developed and provided our staff with friendly SQC analysis soft-
ware to be used from personal computers (Amasaka et al. 1995a) (Amasaka and Maki
1992). In all types of technological fields, multivariate analysis has become the core
of the SQC practices in applying scientific use of the technical field, and practical
research is progressing
A part of application examples are reported below, focusing on the cluster analysis
as a representative example.

4. Examples of Applying lVlultivariate Analysis Method Fo-


cusing on Cluster Analysis
The research examples outlined are" Analysis of sources of variation in vehicle rust-
ing", both belonging to the product and production technology arenas, and" Latent
structure of engineers' attitudes to good inventions and patents", a theme imping-
ing on the technological development and control areas. All these research examples
688

started out with cluster analysis to unravel the complex technical assignments, and
show that the application of a combination of different multivariate analysis methods
has brought about the expected results,

4.1 "Latent structure of engineers' attitudes to goodness of inventions and


patents"
Acquiring" good patents" that enable intelligent properties in possession to improve
the corporate quality, has become more important measures for allowing permanent
business activities, Hence, good patents recognized by managers and technical staff
(here-in-after referred to as engineers), containing the contents of both invention and
right, are clarified in terms of quality and required to conduct researches aimed of
encouraging the acquisition for strong and wide. patents. Therefore, this paper se-
lects the representative examples that grasp the conceptual structure of the" good
patents" admitted by the engineers (Amasaka et al. 1996d) (Amasaka and Ihara
1996) (Ihara and Amasaka 1996).
4.1,1 Characteristic Classification of Common Images of "Good Patents"
and Grasping their Structural Concept
A good patent is said that the invention and right leads to the benefit of own company
while allowing the company to maintain its main business and affecting others. The
patent information available currently gives only simple statistical values, not studied
about qualitative analysis by probing into the conscious region of engineers as inven-
tors. In this connection, the survey aimed of objectifying the subjective structural
concept of "good patent" is employed by grasping the engineers' latent status-quo
recognition interpreted as the language information. Regarding to the survey's pro-
cess (Amasaka et al. 1995b), free opinions are collected from engineers in advance
about "what a good patent is, while considering the recent environmental situations."
Those collected opinions are grouped and arranged into the key words by employiRg
an affinity chart method and a relation chart method in the cooperation with the
special department for the patent application. Then the key words are largely classi-
fied into" content of inventive technique" and" content of patent right." Respective
contents are summarized into a format of 11 questions as outlined in Table 1 sheet.
The survey by the questionnaire in marking the Table 1 sheet, one of given answers
for selection, is conducted to a total of 97 persons selected from among those with
experience of patent acquisition in seven departments from Product Technology: Re-
search " b", Design" c", Development" d", Technical Administration" a", Production
Engineering" e", "f', and" g", as shown in Fig. 4,
Fig. 4.1 shows the grouping by shared recognition of the" good patent" through the
cluster analysis. The results are classified mainly into three major clusters ((1), (2),
and (3)). The figure shows the percentage of the total number of persons who answer
the questionnaire and those of each cluster of the seven departments. The figure indi-
cates that the percentage is dispersed by the department. The questionnaire results
are now analyzed with factor analysis by probing into the commonly shared struc-
tural concept of the" good patent." Fig. 4.2 shows the result of the analysis obtained
from the scatter diagram of factor loading. Varimax method allows the reading of the
strength of the right and the engineering development capability standing opposite
to each other on axis 1 and the commercialization and profit-minded on axis 2. The
result of this analysis confirms that there exists in the seven departments four types of
engineers who attach importance to commercialization, profit-minded, the strength
of the right, or the engineering development capability. Analysis of the similar scatter
diagram for the factor scores reveals an interesting new knowledge that each engineer
has his own structural concept of the" good patent."
689

And the summaries of these two analytical results allow us to interpret, as shown
in the lower section of Fig. 4.1, that group (1) is A: the realist group (that places
importance on practical invention based on Toyota's engineering capability and on the
right of practical effect); group (2) is B: the advance group (that places importance
on advanced invention superior to competitors or specific right competitors are eager
to get); and group (3) is C: the futuristic group (that places importance on the
prospective invention on the construction stage and the right to lead competitors with
internationally). For example, research section" b" mostly consists of the futuristic
group while Technical Administration "a" is largely composed of the realist group
and so on for the rest of sections that reflect logical result of analysis. Reganding
to this results, it is found that respective departments are conscious and in need of
well-balanced activities for" good patent" as they understand what to expect from
and what role to play in executing their job. We have thus obtained valuable results
(of the analysis) which constitute the preparation for our future patent strategy.

Table 1 Example:" What is a good patent? "

• Plca'sc: answer the questionnaire (or constructing patenr strategy. (Asscssmcru)


I: very importanc
TIlcs-: dJoYl, palen! dilPU!C:S whh adler companiej at.:: increasing. 2: ralher importanl
Under the: circumstance, how do you think about" W'hat is. good parenl? .. J; imponant
~ Encircle the number. 4: either
5: bie unnecessary
6: WlJ'lec.essary
[IJ Question about "inventive Icchnic," [2] Question about "patent right."

,,,
,
(l) pr.1ctic.liry 1 1
(13) large system

(2) cost 1 1 ,, (14) many examples .


-------'--.:..'-'-" j ,
(3) commercialization
1 ;

L--_-=-_-' 21 :inCl:manan.al
8:foword-looking

Fig. 4-1 Cluster Analysis Fig.4-2 Factor Analysis


Fig.4 Patterning and common understanding of good patents
690

4.1.2 Grasping Linkage of Recognition between "Good Invention" and


"Good Right"
As mentioned in the proceeding section, a "good patent" has two sides of invented
technique and subsequent right. To determine characteristic key words for respective
sides as recognized by the realist, advance, and futuristic groups classified as the
results of cluster and factor analyses, and obtain the degrees of mutual linkage among
them, canonical correlation analysis is conducted. Fig. 5 shows the result of the
analysis. It is found from the main component relation scatter diagram in the figure
that most of the engineers belonging to A: the realist group can be related to the
profit-minded in terms of the content' of invented art and to the effective-minded in
terms of the content of the right by the key words that each group is requifed to
have. Likewise, the figure reads that the B: the advance group and C: the futuristic
group are configured with the pioneer-minded and the innovative-minded, and the
competitive-minded and the monopoly-minded respectively. This .result of analysis
can be judged logicalty from the viewpoint of empirical technique. We are able to
obtain a verification to the effect that such cause and effect relationships can be a
qualitative general evaluation index for a pat~ntapplication by weighting respective
key words as the multiple regression equation. At present, this process is in the
middle of transfer to the implementation stage as a regular task.

EEl { Patent Right}


Group
I A: Realist

Group
I
B: Advance I
Group
I C: Future

Fig.5 The Relate ofInventive Technic and Patent Right (canonical correlation analysis)

This case (1) shows a scientific application of SQC exactly as seen in Fig. 2 by
upgrading the quality of job on the business process stage in a proactive engineering
area. It is thus judged that the application effect· of multivariate analysis as the core
of Toyota Technical Method has been verified.
4.2 "Analysis of sources of variation for preventing vehicles' rusting"
One of the subjects for technological development of higher quality, longer life vehicles
is the quality assurance of anti-rusting of the body. From the engineering viewpoint of
anti-rusting of vehicle body, it involves consideration for structural design, adoption of
anti-rusting steel, adoption of local anti-rusting processing (such as the application
of wax, sealer, etc.), and paint design (conversion treatment and improvement of
electrodeposition coating, etc.). On the implementation stage of the research, these
anti-rusting measures are adopted singularly or in combination of multiple measures
as may be required by the construction of subject section or corrosion factors. In order
691

for us to proceed with advanced and timely QCDS study activities, it is necessary
to conduct variable factor analysis of vehicle's anti-rusting and the optimization. In
this sense, SQC centering around the multivariate analysis has much to contribute
as a scientific approach.
In this connection, this paper describes a characteristic case for study (Amasaka et
al. 1995c).
4.2.1 Anti-rusting Methods and How to Outline their Characteristics
To establish a superior quality assurance system, it is important to set up the network
of quality assurance on the job processing stage so as to raise reliability technology
of whole the sectors including product planning, design, review, production engineer-
ing, process design, administration, inspection and so on. Such a quality assurance
activity under the cooperation of all the sectors has been established as Toyota's QA
network, where SQC plays an important role as the behavioral science for enhancing
quality performance.
Toyota has been incorporating various anti-rusting methods to various sections of the
vehicle. To outline the deployment of quality performance, the application of matrix
diagram method is effective. For example, Table 2 outlines complex correlationship,
between corrosive environmental factors of 72 sections of a vehicle divided for the ease
of arrangement of anti-rusting measures and the manufacturing processes factors, and
the respective anti-rusting methods adopted in an anti-rusting QA network table.
This table is a summary of objective facts inductively arrested by engineers from
multiple engineering sectors, which are then arranged deductively in the subjective
point of view.
To make this table more effective, it is necessary to provide it with the ease of visual
recognition so that it shows at a glance where Toyota stands in its present activities
for rust prevention quality assurance. Table 2 enables staff and engineers to make
the engineering judgment more accurately, subsequently contributing much to the
strategic decision making on the part of the management.

Table 2 Anti-corrosion Q.A. Matrix • .: Anli·Corro~iort Sled Sheet


o : Wax.
Part
!..vBe
Oa.s.ri6ari
Sm:lll Cb.s.sitictiOd
Categor; Rest Prevent
No ..........
C.cro.on(
--
........ ..........
SlcdShocI
A.dI:I=i,.. Wi
/ 0: PVC
() : Chip R.:::sis:.mt
o : Elcaorodeposited Coating
~ ~
: gcncraJ surface:
: Ofllccllheet

~i
1 1 1 }
Heming Lower part of door Q : Adhesive
of Lower part or I~g~ge 2 1 1 0: Plug Hole
Shell Lower port of fuel fillerlid 3 1 <!l) : Scaler
Parts Door. The: ot.hen 4 1 1 Size or a maJU numbc:n
o(CIIoCnmeuurc : lapping put
5 1 1 ~ of sled sheet

.
Shell Hood' Lock RJF
,4..xis 2
Pam Door * Lock RJF 6 1
Upper Body • Under Body
Door. Side Procection Bar 7 1 1
RJF Bock Door * Lock RJF 8
9
1
1
\
@:~~~-~~~,
~
Genc:r.al Surf:lce

-- °i~l ~
Floor
Under RJF 10 1
Exhoust Pioc RIF. /"----' /"'----
Door group o( membm
L-__________~NWI~--------~

I 72 parts 28 categories (Sattet diagnm of lOdividual-seorc by qUMtifk;llion method rypcJ)


t Every part of
body shell
Anti~rrosion factor
P = intlucic: factor
}
Yes: I, No: blill1k
Fig. 6 Characteristic of Toyotil's
Corrosion fador de. Anti-Corrosion Measures
692

To proceed with such an aim and make it much easier to understand complex en-
tanglements of cause and effect relationships, they will be summarized visually as
shown in Fig. 6 by using quantification method type III. From a scatter diagram of
vehicle section (axis I x axis II), it is apparent that Toyota's anti-rusting measures
are taken mainly against the steel sheet joint portions and that the local anti-rusting
processing and many other methods tend to be adopted as the measures reach the
underbody sections. The diagram indicates that the combination of several types of
anti-rusting methods is applied to the doors and underbody members positioned in
the upperbody sections.
Adoption of such an analytical method makes it easy to evaluate the vehicles of com-
petition, enabling the reactive bench marking for additional advantages. In addition,
insight into proprietary technologies with the knowledge acquired from the result of
this analysis enables us to grasp proactively our responses to the future quality as-
surance including the trends of anti-rusting measures and techniques of competition
that reflect their thought on the target quality and the counter-marketability of anti-
rusting materials.
4.2.2 Factorial Analysis Method for an Optimal Anti-Rusting of Vehicle
Bodies
This subsection takes up the door hemming sections to which application of anti-
rusting technique meets particular difficulties. Spraying of snow melting salts during
winter to prevent the roads from freezing allows the filtration of salt water, a corrosive
factor, to the hemmed joint portions of door outer-panels and inner-panels of rust
prevention steel sheet. To prevent this, wax and/or sealer are adopted or the joint
portions of the door outer-panels and inner-panels are sealed with adhesive agent for
the dual purpose of adhesion and anti-rusting.
To evaluate the performance of various anti-rusting measures, we conduct a market
monitor test using actual vehicles, in-house accelerated corrosion test, and the accel-
erated corrosion test on the bench using testpieces and/or parts. For any of these
tests, it is imperative to conduct analysis of sources of variation for the optimization
of anti-rusting of the body.
Example indicated in Table 3 outlines the test data showing the result of analysis on
the relationship between the (combined) rust prevention specifications of a testpiece
and the depth of corrosion. A dendrogram can be generated as shown in Fig. 7
by subjecting the above data to a cluster analysis. This dendrogram is used to
hierarchically outline the degree of effect from the factors of anti-rusting measures
by grouping the experiment numbers of a corrosion test. Moreover, analysis using
quantification method type I enables us to verify quantitative degree of effect from the
factors of anti-rusting measures by category-scores and the size of partial correlation
coefficient. From ·this diagram, it is possible to grasp the effect of anti-rusting steel
sheets used to the door hemming sections and the validity of quantitative effect of
local anti-rusting processing A, B, and C from the engineering point of view, which
in concurrent application with the analytical results of other testing methods using
actual vehicles enables the optimization of the vehicle body rust prevention.
We have applied comprehensive and technical insight into these results of analyses.
And by adopting controllable factors effective for anti-rusting measures in addition
to the environmental factors in consideration of market environments, we have been
able to verify deftly the factorial effects with the application of the design of exper-
iment. We have thus succeeded in the embodiment of production with good QCDS
performance through the optimization of the vehicle body rust prevention.
693

Table 3 Experiment Data of Test Piece T:rP:A:%pIrC."c.(o~iollprccCI.J.A


~ S : IJ* lAa·QJlI'QIiaa pra=:u 8
Dendrogr;un
l'rP=C:lpOllJ'IQ.c:atr01iOGpnx:o::uC (word method)
Antl-<OCTOSlon
. speCification ( combination )" and Result
Factor and Level Corrosion depth
XI X2 X3 X4 X5 outer . inner

-
I I 2 2 2 0.15 0
b
I I I 2 2 0.10 0 ]
N I 2 J...---
:~..,
~
'-
XI: typcofscc.:1 (oula") I: sloel shoe!
"u
Ci
2: and,corTosion steel shc:ct level 1 'U
CI.l ~
X2: type of steel (iMer) t: anti-GOITosion steel sho:t level 1
2: anri~rro.ion steel sheet level 2
)0: rpollUlti.aorrosion steel proccSl A (I:no 2;ye.}
X4: rpollUlti.c<Jrrosion steel process B (I :no 2;ycs)
X5: spollUlri-carrosion slecl proccs.C (I:no 2;ycs)

..
. 21 combmanons
Fig.7 Cluster Analysis

spot 1Jta-corTQ.sion.stee1 Jp:l( Ulti<Omllion stel:l spot J.Q.Q-GCcrnsiOtl ~I


tn>= of'tc<1 (ilU1er)

-p ---qJ:1II
process A process B JX1>C= C

(-0. ob1-~_r' ~"TI (-0. oJ~.JV' (-0.04)


M-51 (-0. Q. .:.J ~51
~I 01l~Jv, w

• W'
(-0. v.
(0. 87] (0. 15) (0.67) (0. 781 (0.36)
( ): =gory-scorc:s • [ ]: partial correlation coefficient

Fig.8 Influence of Anti-corrosion Measues by Quantification Method (type 1)

It is judged that this case (2) too constitutes a new methodology of mountain-climbing
for Problem-Solving effectively using SQC in concurrent application of multivariate
analysis with N7 and design of experiment. We think that here too it has been
verified that the multivariate analytical method forms the core of the Toyota Technical
Method for scientific implementation of SQC as observed in Fig. 2 and 3.

5. Conclusion
We have been able to verify the following from the two demonstrative cases for studies
as above mentioned: In combination with N7 and design of experiment, various
multivariate analysis starting from the cluster analysis have been established as the
core of technology-advancing SQC from a mere statistical analysis that tends to place
unbalanced importance on analysis.
We think that we have verified that the multivariate analysis offers a great application
effect as the core of the SQC methods in connection with the construction of the
presently advocated Toyota Technical Method for the scientific implementation of
SQC and the enhancement of the effectiveness of called SQC.
In unraveling confound situations and complex entanglements of cause and effect
relationships, one of the multivariate analysis' methods cluster analysis enables them
to clarify and organize visually and logically. On this view point, engineers' abilities
with latent pursuit-minded and new ideas-minded are enhanced and improved. It
is considered that this new technological approach has great insight to unravel and
pursue complex problem-assignments appropriately. The author appreciates valuable
teaching and comments from those people concerned.
694

References:
Amasaka, K. (1993): "SQC Development and Effects at TOYOTA," (in Japanese) QUAL-
ITY, JSQC (Journal of the Japanese Society for Quality Control), 23, 4, 47-58.
Amasaka, K. (1995): "A Construction of SQC Intelligence System for Quick Registration
and Retrieval Library, - A Visualized SQC Report for Technical Wealth -," Springer Lec-
ture Notes in Economics and Mathematical Systems, 318-336.
Amasaka, K. et al.(1992): "A Method on Equipment Diagnosis of Grinder," (in Japanese)
QUALITY, JSQC (Journal of the Japanese Society for Quality Control), The 42th Tech-
nical Conference, 37-40.
Amasaka, K. et al.(1993): "A Study of Quality Assurance to Protect Plating Pants from
Corrosion by SQC, - Improvement of Grenading Roughness for Rod Piston by Centerless
Grinding -," (in Japanese) QUALITY, JSQC (Journal of the Japanese Society for Quality
Control), 23, 2, 90-98.
Amasaka, K. et al. (1994): "Consideration of effieientical counter measure method for
Foundry, - Adaptability of defects control to Casting Iron Cylinder Block -," (in Japanese)
JSQC (.Journal of the Japanese Society for Quality Control), The 47th Technical Confer-
ence, 60-65.
Amasaka, K. et al. (1995a): "Aiming at Statistical Package using in the Job Process," (in
Japanese) JSQC (Journal of the Japanese Society for Quality Control), The 25th Annual
Technical Conference, 3-6.
Amasaka, K. et al. (1995b): "A study of Questionnaire Analysis of the Free Opinion, -The
Analysis of Information Expressed in 'Nords Using N7 and i\Iultivariate Analysis Together-
," (in Japanese) JSQC (Journal of the Japanese Society for Quality Control), The 50th
Technical Conference, 43-46.
Amasaka, K. et al. (1995c): "The Q.A. Network Activity for Prevent Rusting ofVehicleoy
Using SQC," (in Japanese) JSQC (Journal of the Japanese Society for Quality Control),
The 50th Technical Conference, 35-38.
Amasaka, K. et al. (1996a): "A Study on Estimating Vehicle Aerodynamics of Lift, -
Combining the Usage of Neural Networks and Multivariate Analysis -," (in Japanese) The
Institute of Systems Control and Information Engineers, 9, 5, 229-237.
Amasaka, K. et al. (1996b): "Influence of Multicollinearity and Proposal of New Method
of Variable Selection, -A Study of Applied j'vIultiple Regression Analysis for Analysis of
Source of Valuation-," (in Japanese) JIMA (Japan Industrial Management Association),
46, 6, 573-584.
Amasaka, K. et al.(1996c): "A Study on Validity of the EN method for Variable Selection,
- A Study of Applied Multiple Regression Analysis for Analyzing Source of Variation Fac-
tors (Part II) -," (in Japanese) lIMA (Japan Industrial Management Association), 47, 4,
248-256.
Amasaka, K. et al. (1996d): "An Investigation of engineers' Recognition and Feelings about
Good Patens by New SQC Method," (in Japanese) JSQC (Journal of the Japanese Society
for Quality Control), The 52th Technical Conference, 17-24.
Amasaka, K. and Azuma, H. (1991): "The Practice SQC Education at TOYOTA, - For
Growing Human Resource and Practical Effort -," (in Japanese) QUALITY, JSQC (Jour-
nal of the Japanese Society for Quality Control), 21, 1, 18-25.
Amasaka, K. and Ihara, M. (1996): "Latent Structure of Goodness-of-invention," Springer
Lecture Notes in Economics and Mathematical Systems, 348-353.
695

Amasaka, K. and Kosugi, T. (1991): "Application and Effects of Multivariate Analysis in


TOYOTA," (in Japanese) The Behavior Metric Society of Japan, The 19th Annual Con-
ference, 178-183.
Amasaka, K. and Mald, K. (1991): "Application of Multivariate Analysis for the Attraction
of Manufacturing Vehicles," (in Japanese) The Behavior Metric Society of Japan, The 19th
Annual Conference, 190--195.
Amasaka, K. and Maki, K. (1992): "Application of SQC Analysis Soft in Toyota," (in
Japanese) QUALITY, JSQC (Journal of the Japanese Society for Quality Control), 22, 2,
79-85.
Amasaka, K. and Sakai, H. (1996): "Improving the Reliability of Body Assembly Line
Equipment," The International Journal of Reliability and Safety Engineering, 3, 1, 11-24.
Amasaka, K. and Yamada, K. (1991): "Re-evaluation of Present QC concept and Method-
ology in Autoindustry, - Deployment of SQC Renaissance in Toyota -," (in Japanese) Total
Quality Control, 42, 4, 13-22.
Ihara, M. and Amasaka, K. (1996): "Factor Analysis for Selected Observations," Springer
Lecture Notes in Economics and Mathematical Systems, 354-361.
Kamio, M. and Amasaka, K. (1992): "Collection of Activity Example Using SQC Method
to Improve Engineering Technologies," (in Japanese) JSA (Japanese Standards Associa-
tion) NAGOYA QC Research Group.
Kusune, K. and Amasaka, K. (1992): "The Statistical Analysis of the Springback for Stamp-
ing Parts with longitudinal Curvature," (in Japanese) QUALITY, JSQC (Journal of the
Japanese Society for Quality Control), 22, 4, 24-30.
Takaoka, T. and Amasaka, K. (1991): "Derivation of Statistical Equation for Fuel Con-
sumption in S.I.Enginenes," (in Japanese) QUALITY, JSQC (Journal of the Japanese
Society for Quality Control), 21, 1, 64-69.
Application of Statistical Binary Thee
Analysis to Quality Control
Atsushi Ootaki

Department of Precision Engineering


School of Science and Technology
~Ieiji University
Higashi-mita 1-chome. Tama-ku
Kawasaki 214-71, JAPA:\
E-mail:[email protected]

Summary:The purpose of this paper is to show how statistical binary tree analysis would
be applied in the field of QC. Analysis of discrimination of uncollectible issue of credit loan
in sales management is shown. by using Classification And Regression Trees (CART).

1. Introduction
Quality of Japanese industrial products has been improved by modern quality con-
trol(QC) which is introduced into Japanese industries after the World War II. Espe-
cially, application of statistical techniques to QC (SQC) is one of the reasons why it
helps us to improve quality of products. QC has spread not only to manufacturing
department but also to almost all of departments in a company tQ improve quality
of products, services and jobs as total quality control (TQC), which is known as a
management tool of Japanese continuous quality improvement.
One of the basic concepts of SQC is how well we decompose many causes affecting
variations of quality into each cause vrithout confound. This means how many causes
are well classified. To do this, engineers relating to quality planning, design, im-
provement of products and services apply many kinds of statistical techniques such
as control chart, statistical test and estimation, design of experiments and multivari-
ate statistical analysis. Especially, accompanying advance of computer during recent
decade, they well apply multivariate statistical techniques such as multiple regres-
sion analysis, discriminant analysis and cluster analysis to specify and classify some
causes affecting quality of products and services. They sometimes feel that the result
of analysis by such techniques does not always bring them a useful information of
problem solving since statistical linear model differs from the model that they esti-
mate with their professional knowledge.
On the other hand, statistical binary tree analysis gives us a basic concept of strat-
ification and/or classification. As shown in Fig. 1, the process is binary because
parent nodes are always split into exactly two child nodes and is recursive' because
the process can be repeated by treating each child node as a parent. It gives us
the same as a naturally simple thinking way of human being like an answer, "yes"
or "no" according to successive questions. As anyone who would like to apply the
methodology into real world can make clear that successive causes are followed until
a terminal node of which the contents are sufficiently homogeneous, it is easier for
him/her to understand the result.

696
697

Root Node

Parent Node

Terminal Node
,,
,
,
, , .'
-'... , , ..., - ... ,
: t I : t I
, '-' '
L B.
, '
'-'

Fig. 1: Binary tree.

2. Data set used


This paper shows how to apply the statistical binary tree analysis in the field of QC.
To do this, I will show an example: improvement of uncollectible issue of credit loan
which is one of quality characteristics of sales management of construction machine.
In the analysis, 47 variables including 31 categorical variables and 16 numerical vari-
ables are employed as shown in Fig. 2. The data set has 448 cases including 123 of
uncollectible credit loan and 325 of collectible loan. The response variable is the name
"(19) judge" which means the judgment; whether a case of credit loan is collectible
or not.

3. Overview of binary tree by CART and analysis of


the process
To analyze the loan problem, I employed Classification And Regression Trees (CART)
developed by Breiman, Friedman, Olshen and Stone(1984). CART methodology is
technically known as binary recursive partitioning and its computer program (1994)
provides powerful means of carrying out the procedures. CART works in both clas-
sification and regression problem as shown in the name.
The earliest work of binary tree structure was proposed by Morgan and Sonquist(1963)
as Automatic Interaction Detection (AID). The major difference between CART and
AID lies in pruning and estimation process, that is, growing and honest estimation.
698

% DATA SPECIFlCATIONS FOR CART.


Variables = 47 % must be specified first.
Names % Characu.rs on a lil1e following % are igJWred.
(l)juridic (2)assess (3)nominatn ( 4) age ( 5)dealings
( 6) capital (7) establish ( 8) eJtqlloyee ( 9) current ( 10) anngross
( 11) grading (12) stabilit (13) credit ( 14) debt ( 15) payrisk
(l6)deposit (17)undertak (18)sufficnt (19)judge (20)unredeem
( 21 ) rnu.rest (22) install ( 23) evenpay ( 24) marks ( 25) annincom
( 26) selkap ( 27) district ( 28) trade ( 29) bk.credit (30) stocks
(31)payable (32)develomt (33)agent (34)fix (35)brand
(36) coJtqletit ( 37) 1teavymec ( 38) machine (39) compleu. ( 40) l"111Il1Ir
( 41) life ( 42) mgntable (43) mood ( 44) native ( 45) dishonor
( 46) jUl1lp ( 47) creditrn +
Cau.gories
(1) 5 (.2) 2( 3) 5 ( 5) 2( 9) 2(12) 3 (19) 2
(23) 4 (25) 6 (26) 6 (27) 6 (28) 6 (29) 6 (30) 7
( 31) 7 (32) 9 (33) 7 ( 34) 7 ( 35) 9 ( 36) 7 (37) 6
(38) 6(39) 4(40) 7 (41) 4 (42) 7(43) 7(44) 5
(45) 4 (46) 4 ( 47) 5 *
Missing
(3) 9 (7) 9( 9) 9 (19) 9(47) 9( 5) 9 (23) 9( 1) 9
( 2) 9 (12) 9"

Fig. 2: Variables and data specifications.

The key elements of CART analysis are:


(1) rules for splitting each nodes into a tree
(2) deciding when a tree is complete, and
(3) assignjng each tenninal node to a class outcome
(or predicted value for regression.)

3.1 Splitting rules


Figure 3 is a part of the result of analysis by CART software.
The first topic addresses the method CART uses to select its questions for splitting
nodes. The process is considerably simplified because CART always asks questions
that have a "yes" or "no" answer.
For example, the questions might be:

Is A;-;;-;GROSS -s 9.50.0 millions yen ":


Is the rank of AS\,I\'CO:\1 ::; 3.500:
(see the second line and the part of" Competitor" in Fig ..3.)

iNhere A\'\'GROSS denotes the variable name of annual gross sales amount and
A;-;\,I\,CO:\1 the rank of annual sales income.
In this output CART answers:

"Node 1 was split on variable ANNGROSS. ..


"A case goes left if variable ANNGROSS ::; 9.500"
699

NODEINFORMA110N

.. Node 1 _ split on 1I8riahle ANNCROSS


.... A .... e ",.. left if1l8riahle ANNCROSS eo '.500
.. .. hoplOW..... ni = 0.0'3 C. T. = 0.0!!8
.. 1"
" . Node Cue. Class Colt
**.. 1 4411 1 8.500

.... 2 200 2 0.884

.. .. , 248 1 8.160

..200 .248 Numbe., Of Cue. Within Node Prob.

.. *
Class Tap Left lUchi
1 325 103 222
Tap
0.950
Left
0.884
lUcht
0.984

.....
* * 2 123 !17 26 0.850 0.1l6 0.016
**

...... .... ,....


.. * * * SUlTIIpa Split Ano .. hop .........
2* 1 ANMNCOM • %..!i00 0.927 0.089
2 BKCREDrr I 1,2,6 8.324 0.860

..
** 3 HEAVYMEC .0,1,2 0.245 0.056
4I!MPLOYEE • 5.500 0.242 b.b4!!
5 JURIDIC • 4,5 0.203 0.024

Competi....., Split Improve.


1 ANMNCOl'd 2.500 0.08!!
2 DISTRICT 3.500 0.084
3 NOl'dlNATN 1,2 0.072
4 ESTABLSH 44.500 0.071
5 CURRENT 1 0.067

Fig. 3: Node information for the first split.

In each case the question is used to split a node by sending the yes answers to the
left child node and the no answers to the right child node. In the credit loan data,
200 cases go the left node; 248 cases go the right node. The cases in the left node
contains 103 cases of Class 1 for collectible loan and 97 of Class 2 for uncollectible
loan; the cases in the right node contains 222 of Class 1 for collectible loan and 26 of
Class 2 for uncollectible loan.
CART's method is to look at all possible splits for all variables included in the anal-
ysis. For example, it would consider splits on ANNGROSS at one-hundred-billion,
two-hundred-billions, etc. All the way through the highest annual gross sales amount
observed in the data. It would then do the same for the rank of annual sales income
which is the variable name ANNINCOM, and for all other variables as well. Since
there are at most 448 different values for each variable in this data set. Any problem
will have a finite number of candidate splits and CART will conduct a brute force
search through them all by converting a continuous variable to an ordered categorical
variable with quantile of data. If a categorical variable is nominal, CART selects the
best of all combination of iC2, where i denotes number of categories.
3.2 Choosing a split: Measure of goodness-of-split criterion
CART's next activity is to rank order each splitting rule on the basis of a goodness-of-
split criterion. One criterion commonly used is a measure of how well the splitting
rule separates the classes contained in the parent node. The goodness-of-split crite-
rion is defined as follows:
700

i(t) = I: C(i I j)p(i II)p(i I t):;r(i) (1)


iij

where, C(i i j): misclassification cost which classify j as i


:;r (i) : prior probability of class i
prj I t): proportion of cases in node t which belong to class j.

This criterion is called the Gini index of diversity as a measure of node impurity.
In practice, the following improvement of impurity by a split variable is employed as
criterion for selection of split variable:

6i(t) = i(t) - pLi(tJ,j - Pni(i II) (2)


where, i(t): Girri index of parent node
i (t [,): Gini index of the left child node.
i(tR): Gini index of the right child node.
J!L, PR: proportion of cases which go from parent node to the left node
and/or the right node.

In the example, the improvement of the best split variable, A:\":\"GROSS was 0.093
while the improvement of the best competitor variable, A:\:\I:\COM which appears
in the part of labelled" Competitor" was 0.089. Thus, the goodness-of- split criterion
of the variable, A:\":\GROSS is higher than that of the variable, A:\:\I:\CO:vr which
is the best competitor.

3.3 Prior probability


Generally, CART assumes that the relative population frequencies of the dependent
variable classes are approximately equal, regardless of their distribution in the learn-
ing data set. If uncollectible loan occurs frequently as like as 27% of this data set,
any company will go bankrupt. To change this operating assumption, the PRIOR
command specifies that the sample proportion of the dependent variables do repre-
sent the population proportions. Any population distribution can be specified with
the PRIOR command. In the analysis it is assumed that the prior probability of
uncollectible credit loan occurs 0.0.5.

3.4 Misclassification cost


CART measures misclassification costs as the probability of misclassification. In do-
ing so, CART has been implicitly treating all classification errors equally. Although
the assumption of equal unit costs for all errors i5 often appropriate, there will be
circumstances when non-unit costs are needed to describe a decision problem. For
example, in some credit loan problem. the aim is to separate high risk of uncollectible
loan in credit loans. In the example presented here, it is assumed that the expenses
of collection to an uncollectible loan are paid about 10 times of that of a collectible
loan.

3.5 Class assignment


Once a best split is found, CART repeats the search process for each child node,
continuing recursively until further splitting is impossible or stopped for some other
reason. Splitting will be impossible if only one case remains in a particular node or
if all the cases in that node are exact copies of each other (on predictor variables).
CART also allows splitting to be stopped for several other reasoIlS, including that a
node has too few cases. The default for this lower limit is 5 cases. but may be set
701

higher or lower to suit a particular analysis.


Once a terminal node is found we must decide how to classify all cases falling within
it. One simple criterion is plurality rule: the group "ith the greatest representation
determines the class assignment. Thus, as shown in Fig. 3, the left-most node has
88.4% collectible cases, so the entire node is classified as collectible loan. However, if
some cases in the left-most node was determined as uncollectible cases, the left node
has 88.4% misclassification cost. Similarly, the right-most terminal node is 98.4%
collectible cases, so all cases falling into that node are classified as collectible cases.
CART goes a step further: Because each node has the potential for being a terminal
node, a class assignment is made for every node whether it is terminal or not. The
rule of class assignment can be modified from simple plurality to account for the costs
of making a mistake in classification and to adjust for over-or under-sampling from
certain classes.

3.6 Pruning trees


For example, if the splitting is carried out to the point where each terminal node
contains only one data case, then each node is classified by the case it contains, and
estimation error gives zero misclassification rate. However in virtually all applied
statistics, less complex models are easier to understand and, for a given data set, can
usually be estimated with greater precision.
-"ow we concern with two main issues: getting the right sized tree and getting more
accurate estimate of the probability of misclassification or of the true expected mis-
classification cost.
Preference for simplicity can be seen in the model selection criteria proposed for con-
ventionalleast squares regression such as adjusted R-squared, the Akeike information
criteria(AIC), etc. All of them devise a trade-off between the error sum of squares and
the number of parameters in the model. Analogous concept has been developed for
CART tree. A natural measure of the complexity of a tree is the number of terminal
node it contains, and misclassification rate estimated from a model is an accuracy
measure that always improves as trees get larger. CART employes the following cost
compexity measure for any tree:

Cost Complexity = Estimation of Misclassification Rate


+ Ix(Number of Terminal Nodes)
where, ()' is the penalty imposed per additional terminal node.
If the penalty parameter 00, also known as the complexity parameter, is set to O.
the largest possible tree "ill always have the lowest cost complexity. This is because
the cost-complexity criterion uses the estimates of misclassification rate to measure
cost. On the other hand. if 00 is set sufficiently high, for example, it set to infinity,
a tree with only a root node "ill be preferred because it is the smallest tree. Values
of between infinity and 0 will pick out different trees, with selected trees becoming
larger as a moves towards O.
To see how the complexity parameter works, let '8 take another look at the tree sum-
mary information from CART run shown in Fig. ..1. In this output the right-most
column, labelled" Complexity Parameter," contains the value of 00 that would make
that tree optimal. We can see that Cl' starts out at a value of 0.099 for the tree with
just one terminal node and declines to 0 by the time we reach the largest tree, "ith
thirty-five terminal nodes.
To use the cost-complexity formula we first need to convert the relative costs into
their absolute equivalents. We do this by multiplying the relative costs by the initial
misclassification cost, which is 0.5 in this run. Thus. the tree with one terminal node
702

has cost equal to l.000 x 0.500 = 0.500 and the tree with two terminal nodes has cost
0.605 x 0.500 = 0.3025. Therefore the cost complexities for the two trees are:

(One terminal node ): Cost Complexity = 0.500 + 10.


(Three terminal nodes): Cost Complexity = 0.3025 + 30.
The value of 0: that equalizes the two cost complexities is 0.099, and therefore an !.l
of 0.099 is sufficient to make the smaller tree preferable. We Call successively get the
complexity parameters.
Instead of attempting to decide whether a given node is terminal or not, CART pro-
ceeds by growing trees until it is not possible to grow them any further. Once it
has generated what we call a maximal tree, it examines smaller trees obtained by
pruning away branches of maximal tree. The important point is that CART trees
are always grown bigger than they need to be and are then selectively pruned back.
Unlike other systems, CART does not stop in the middle of the tree-grown process,
even if the tree appears to be optimal at that point, because important branches of
the most accurate tree might be missed.

TREE SEQUENCE

Dependentvari:>ble: JUDGE
Terntinal Cross-Validated Resubstitution Complexity
Tree Nodes Relative Cost Relative Cost Parameter

35 0.722 +f-
OIl60 0.248 OIlOO
7 21 0.685 OIl59
+f- 0.286 OIl03
8 20 0.685 +f-
0.059 0.292 0.003
9 15 0.644 +f-
0 I158 0.323 OIl03
10 12 0.654 +f- OIl58 0.351 OliOS
11* 11 0.612 +1- 0.057 0.363 OIl06
12 9 0.643 +1- 0.058 0.392 OIl07
13 7 0.650 +1- 0.058 0.447 OIl14
14** 5 0.650 +1- 0.1l58 OS04 oII 14
15 3 0.731 +1- 0.060 0.605 OIl25
16 1 I100 +1- 0 I100 I liDO OIl99

Initial misclassification cost = 0 SOO


Initial class assignment = I

Fig. 4: Tree summary information.

3.7 Best pruned subtree: Test sample/Cross-validation


Once the maximal tree is grown and various sub-tree are derived from it, the best
tree is determined by testing each for its error rate or cost. When there are sufficient
data the simplest method is to divide the sample into learning and test sub-samples.
However, many problems will not have sufficient data to allow a separate test sample.
The tree-growing methodology is data intensive and requires many more cases than
classical regression.
When data are in short supply, CART employs the computer-intensive technique of
cross validation and the default is 10 cross validation. The results of the 10 mini-test
703

samples are then combined to fonn error rates for trees of each possible size; these
error rates are applied to the trees based on the entire learning sample. This complex
process brings a set of reliable estimates of independent predictive accuracy of the
tree. The middle column in Fig. 4, labelled "Cross-Validated Relative Cost," contains
the relative misclassification rates. The first value is the average of cross-validated rel-
ative error and the second value is the standard error of cross-validated relative cost.
We can see that the minimum cross-validated relative error is 0.612 of 11-terminal
nodes. Let us notice that the maximal tree does not always have the minimal value
of cross-validated relative error though the maximal tree has the minimal value of
resubstitution relative cost, which means estimates of relative misclassification rate.

Reltiiw co.tv. ComplexityPuamea=r


~ cv Rel~tiye Cost:R(tJ
1 ~R(tJ·Std(t) 0.1
~ R(tJ.Std(t)
•. Resubstitution Rel~tiye Cost
0.08
.fl'"
lIi Parameter
....
CompleHit~
0.8 --.. --~

J
0
to)
0.06
J 0.6
JI
.j
...~
to)
0.04 JI
r0
0.4 to)
0.02

0.2 0
0 10 20 30 40
Number of Term.unal Nodes
Fig. 5: Cross-validated relative cost.

The characteristics are fairly rapid initial decrease followed by a long, fiat valley and
then a gradual increase for larger trees shown in Fig. 5.
CART employs the following 1 SE rule for selecting the right sized tree to:
1. Reduce the instability
2. Choose the simplest tree whose accuracy is comparable to minimal
misclassification rate.
In addition,
3_ Less complex model is easier to understand and is prefered in
all applied statistics_

(3)

where Tko denotes the tree with the minimal misclassification cost and Tkl denotes
the tree having the minimal nodes vvithin l-SE rule.
In the analysis cross-validated relative cost of the tree with 5 terminal nodes is 0.650
and it is within l-SE rule from theminimal cross-validated relative cost of the tree
with 11 terminal nodes.
704

4. Application of CART to credit loan of construction


machine
Finally, as shown in Fig. 6, the options for analysis are employed out of those for
which CART software(1994) prepares.
It is assumed that prior probability of uncollectible credit loan occurs 0.05 and ex-
penses of collection to uncollectible loan are paid about 10 times of collectable loan.
As the result of the analysis, the tree having the minimal cross-validated relative
error is selected by CART as shown in Fig. 7.
The following split variables are employed out of 46 candidate variables excluding
one of 47 variables to construct the classification tree.

Xl: Annual gross sales amount: (10)"angross"


X2: Undertaken cost: (17)"undertak"
X3: Deposit: (16)" deposit'·
X4: Install: (22)"instaW
X5: Nominate rank: (3)"nominatn"
X6: Rank in district: (27)" district"
X7: ~umber of installment plan: (22) install
X8: \'umber of own heavy machines: (37)"heavymec"
X9: Amount of debt: (14)"debt"
The cross-validated misclassification cost (probability) of the minimal tree which mis-
classifies a case of Class 2 for uncollectible loan into a case of Class 1 for collectible
loan is 0.317 while the opposite cost is 0.175 shown as Fig. 8. The costs by cross-
validation are the same as the optimal tree within the l-SE rule having 5 terminal
nodes unlike the costs by learning sample. Also the cross-validated costs of the trees
having 9 and 7 terminal nodes \vithin the l-SE rule are the same as the above. The
difference between the minimal cost tree and the optimal tree is shown in Fig. 9 and
Fig. 10.

In conclusion the minimal cost tree having 11 terminal nodes are employed because
the variable X5 to X9 are considerably important in business unlike variable impor-
tance in the analysis which is defined as a measure of a variable's ability to mimic
the chosen tree and to playa role as a surrogate for the best splitting variable.
705

% OPrION SPECIfICATIONS FOR CART.


% Chara£1ers a1'1er % are ignored.
% Refer 10 the do<ument USING 1HE CART PROGRAMS for syntax rule •.
% The re'POnse variable mWJt be w.niifi.ed first.
Response = 19 Clane. = 2 % Variable Name, JUDJE
Split = GOO % GOO po..ihly with altered priors
Prio... = thus 0.9500 0.5000E-Ol * % prior probabilities

Cost = thus % misdu.i6cation coot.


( 11 1) O. 10.00
( 11 2) 1.000 0."
Dole1e • (11, ).. % delete the followingvariable.
Secondary % Secondary options
&1om 5 % minimum node .ize ... split
sample 1000 % larpor nodes are suhsampled
nourrogate. = 10, 5 % .urrogates WJed, prin1ed
ncollqleti1o'" = 5 % cOllqleti1ors printed
ntrees = 100 % print errors for this many tree.
learnsize = 20000 % maximum size oflearninc .ample
nodemax = 750 % number of node. in larpo.t tree
catmax = 1000 % maximum catepori<:al split.
linmax 0 % maximum linear combination split.
deptJunax 750 % maximum depth of tree .,
Cro.. 10 % ·fold cro •• validation .,

Fig. 6: Option parameters for analysis.

Toial Caoe. = 448


I Xl 950 billions )"en
+---------1-------+
X% %5 thou. I I X9 949 thou.
+·--··-2-·····+ +····-9-_···+
Xl 40 thou. I I XB in (1,4,5)
+···_-3-······+ +-10..+
X4 %4.51 I I
+--------4-----+
X5 3.51
+---------5---------+
X6
+------6------+
IX7 in (1,2)
+--------7 --------+
IXlI in (1,2,4,5) I
+---8---+
I I

Terminal Rel;io ...


I Z J 4 5 Ii 7 8 9 10 11
@I @ X X X X @I @I @I @I X CIao.
19 6 Z 0 8 16 15 39 ZOJ 16 21 I (@)
Z 0 6 10 12 6% Z 3 II 4 HZ %(X)
0.99 1.00 0.79 1.00 0.83 0.1i3 0.98 0.99 0.99 0.97 0.57

Fig. 7: Classification tree diagram of the minimal cost tree.


706

MISClASSlFICATION BY ClASS

~finirnal Cost Tree: Terminal Nodes = 11

I--CROSS VAlIDATION---II--LEARNING SAMPLI---I


Prior NMis- NMis.-
Class Prob. N ClaWi.d Cost N Cl=i1i.d Cast

0.950 l25 57 0.175 325 37 0.1l4


0.0Sl 113 19 0.317 ill 18 0.146

T()tai 1.000 448 96 55

Optimal Tree: Terminal Nodes ::: 5

I--CROSS VAlIDATION--II-.. LEARNING SAMPLI---I


Prior N Mis- N :Mis-
Oass Prob. N Clamfied Cort N Qauified Cost

0.950 325 57 0.175 325 39 O.IlO


0.0Sl 113 39 0.317 ill 34 0.276

T,tal 1.000 448 96 448 7l

Fig. 8: Comparison of cross-validated relative error


between the minimal cost tree and the optimal tree.

ClASSlFlCAIION ME DlAGRAt\!

i
·------1--------
I I
·----·2---· -----7----
-----3-----
I I
---.1---
I i
------l---- --10--
I I !
.--5--.-
I I
--~--
I
Terminal Regions
10 11

Fig. 9: The tree with ll-terminal nodes with minimal


cross-validated cost.
707

ClASSIFlCATlON TREE DIAGRAM

I
.------1------
I
·----·2---·
I I
-----3-----
I I
-----+----
I

Termina! Regionl

Fig. 10: The optimal tree with 5-terminal nodes.

References
Breiman, 1., Friedman, J. H., Olshen, R A., Stone, C. J. (1984): Classification and Re-
gression Trees, Wadsworth.
California Statistical Software(1994): CART Version uno, CalStat, Inc.
Morgan, J. N., Sonquist, J. A. (1963): Problems in the analysis of survey data, and pro-
posal, JABA, 58, 415-434.
Steinberg, D., Colla, P.(1992): CART, SYSTAT, Inc.
Analysis of Preferences for Telecommunication
Services in Each Area
Tohru UEDA 1 and Daisuke SATOH 2
1Seikei University
3-3-1 Kichijoji-Kitamilchi, Musilshino-Shi
Tokyo 180, Jilpan
2NTT TeleconmllU1ication Networks Lilboriltories
3-9-11 Midori-Cho, Musilshino-Shi
Tokyo 180, Jilpiln

Summary: A method of combining conjoint ilnillysis ilud regTession anillysis is proposed


in order to analyze not only individual consumer preferences, but ills a preference tendencies
on an aggregate basis, for example, in vilrious geogrilphic ill'eils. We apply this Inethod to
new telecol=1Unication services.

1. Introduction
Conjoint analysis [Luce and Tukey (196-1)] has heen used in marketing research to
measure consumer preferences. It is a practical set of methods for predicting con-
sumer preferences for multi-at.t.rihute opt.ions ill a wide variet.y of product and service
contexts. \Vhen developing new products and services, it. is an effective method to
determine service characteristics. As for the e"olutioll of conjoint analysis in market-
ing research, see reviews by Green and Srinivasan (1978, 1990), vVittink and Cat.tin
(1989), and Wittink et al (199-1).
In this paper, we apply conjoint analysis to t.elecomlllunication services. We divide
telecommunication services into tvvo classes: sen'ices that are independent of sub-
scriber networks and services that depend on subscriber net.works. In the former
case we can use conjoint analysis to determine service characteristics because we can
regard typical consumer preference as reflecting general consumer preferences. U eda
(199-1) has recently applied conjoint analysis to existing t.elecommunication services
(voice mail). This kind of service offers a good opportunit.y to apply conjoint analysis.
In the latter case we must analyze overall consumer preferences in service areas be-
cause the services depend on subscriber networks. lvloreover, when we have several
service area candidates, we must forecast demands in order to build subscriber net-
works economically. It has been difficult to use conjoint analysis to identify preference
tendencies on an aggregate basis and especially in specific geographic areas because it
has mainly focused on individual consumers. Conjoint analysis alone is insufficient to
measure consumer preferences for services suc h as cable television (CATV) in service
areas.
In this paper we propose a method of identifying likely preferences in various areas.
To do this we cOlllbine conjoint analysis wit 11 regression analysis. This combination
enables us to analyze preference tenclencies in specific geographic areas. \Ve apply
our method to new telecomlllunication services.

708
709

2: Application of conjoint analysis to telecommunication ser-


vIces
vVe assume that new telecommunication services (cable television services) have
four attributes or factors that will influence conSUlller preference: video on demand
(V.O.D) service, telephone service, registration fee, and monthly charge.
Here we consider the number of possible alternatives. There are two alternatives
for each of the two attributes: the V.O.D service and the telephone service, because
each is either present or not. Four registrat.ion fees are being considered: 20,000 yen,
40,000 yen, 60,000 yen, and 80,000 yen. Four mont.hly charges are being considered:
3,000 yen, 4,000 yen, 5,000 yen, and 6,000 yen. Consequently, a total of 2 x 2 x 4 x
4 = 64 alternatives would have to be tested if ",Ie were to array all possible combina-
tions of the four attributes.

Table 1: Combinations for CATV service

V.O.D Telephone service Registration fee (yen) Monthly charge (yen)


A Yes [2] Yes [2] 80,000 [1] 4,000 [3]
B No [1] Yes [2] 80,000 [1] 3,000 [4]
C No [1] Yes [2] 60,000 [2] 4,000 [3]
D No [1] No [1] 60,000 [2] 3,000 [4]
E Yes [2] Yes [2] 40,000 [3] 6,000 [1]
F Yes [2] No [1] 40,000 [3] 5,000 [2]
G Yes [2] No [1] 20,000 [4] 6,000 [1]
H No [1] No [lJ 20,000 [4J 5,000 [2J

Table 1 shows an orthogonal array that involves 8 of the 64 possible combinations


that we wish to test in this case. As an example, we give tentative points, which are
the numbers in square brackets in Table 1, to each alternative for each of the four
attributes. Each combination has a total of 8 points. Therefore we cannot judge that
one is clearly superior to another one.
We mal,e inquiries about the preference-ranking of the choice-set of the 8 combina-
tions in Table 1. We assume that every respondent would prefer a lower price and
more services.
We consider the following utility function for combination i
Ui = aIoI( i) + a202( i) - Ct310g :r:1U) - a410g X4( i), (1)
where 8j ( i) and :r,,( i) are defined by
8.(i) = { 1: if co.ml~ination i h&'3 attribute j (j = 1,2), (2)
J 0: otherWIse,
X/t(i) : the value of attribute h for combination i (h = 3,4). (3)
Respondents' preference for a lower price and more services is expressed by the fol-
lowing restrictions:
(k = 1,2,3,4). (4)
710

3. Sampling
vVe conduct a survey in two areas of Japan, getting 389 respondents from one area
and 200 from the other. In Table 2 the area denotes a major city in Japan. The
respondents in each area are composed of random and purposive sampling. Although
there are very few subscribers to CATV service in those areas, their opinions are very
important, so we also intentionally select respondents who are CATV subscribers.

Table 2: Sampling

Area Random sampling Purposive sampling Total


A 300 89 389
B 162 38 200

4. Estimation Method
Various algorithms for conjoint analysis have been proposed, such as MONANOVA
[Kruskal (1965)], TRADE-OFF [Johnson (1975)], LINMAP [Srinivasan and Shocker
(1973)], and RANKLOGIT [Ogawa (1987)]. LINMAP differs from the others in that
it uses linear programming whereas t.he other approaches use nonlinear optimization.
The use of linear programming enables LINMAP to obtain global optimum param-
eter estimates, while the other approaches cannot be guaranteed to achieve global
optimums.
Satoh and Veda have discovered two problems wit.h LINMAP solutions.
1: Even if there is a set of solutions that expresses the preference data perfectly,
LINMAP cannot always generate it.. Instead it produces a set that expresses only
the partial rankings.
2: LINiVIAP cannot necessarily produce a set of solutions that matches an analyst's
inferences from observed data.
Satoh and Ueda have proposed an improvement of LINMAP[Satoh and Ueda]. Thus
we applied it to the new telecommunication services in Table l. The algorithm is
composed of two steps as follows. STEP1 is LINMAP and STEP2 is a new additional
part.
STEP 1 (LINMAP):
n-1
min L Ii. (5)
i== 1

subject to

Ui - U'2 (Xi -X'2)a+ l l;::: 0,


U'2 - Vi (Xi - x:j)a + 12 ;::: 0,

Un :" l - Uli (X,,:"l - Xi,)a + 1n-l ;::: 0,


Ui - U,; (Xi - Xi,)a = 1,
a;::: 0 11,'" .111-1 ;::: 0,
711

where n is the total number of combinations, a is the vector whose components are
the utility parameters aj for every attribute, and Uh and x h are respectively the utility
and the combination vector of the combination ranked lith by the respondent.
STEP 2:
(6)
subject to

1
(Xi - x:i)a + 11 -< --+e
11. - 1 u,

1
(x:i - xj)a + "/2 ~ --+cu,
II - 1

1
(X,,':'1 - x;,)a + /'/1-1'::; - - + CUI
II - 1
1
(xi - xi)a + ~'1
1
>-
-
--
/I - 1
~[
~,

1
(xi-xj)a + "V,»
1- -
---e[
n - 1 '

1
(x,,':'1 - Xi,) a + ,,,-1>- - --
n-1
e[,

(Xi - x;.)a 1,
11-t

'min Lli,
;=1
a 2: 0, 11, ,1,,-1 2: 0,
1 n-2
0< e[ < - -
- - 11.-1
0::; ell ::; - - ,
11.-1

where Imin is the optimal solution of the ohjecti\'e function (5).

5. Analysis
Here is the sum of the range of every respondcnt's partworth for each attribute in
Table 3 and Figs. 1 and 2.

Table 3: Sum of partworths

Area V.O.D. Phone Registration fee Monthly charge


A 100.11 37.51 69.66 62.99
B 50.00 30.58 23.33 22.20
712

Monthly charge
22% v.a.D.
34%

Registration
fee
Phone
24%
20%

Figure 1: Sum of partv,iorths in area A

Monthly charge
17%

V.a.D.
Registration 39%
fee
20%

Phone
24%

Figure 2: Sum of partworths in area B

6. Preference-ranking of choice-set combinations in each


area
Let a) denote the relative importance of attribute .J to decision-maker p. Let 0/
denote the utility of combination i for decision-maker jJ through conjoint analysis.
U; iJ is expressed as
(I)
\Ve assume that U/' can be relateel to decision-maker p's charact.eristic 8} as follows:

TTl'
VI
= 'L-
\""' /;'(jI'
J J
..L =1'
' .... ; (8)
713

where 01' is the disturbance term and (II' is clcfi·ned by


J "

(11' _ { 1: if decision-maker p has characteristic j,


J - 0: otherwise. (9)

\Ve obt ain the value of bj through multiple regression analysis.


To get a preference-ranking of the choice-set combinations in area A, we aggregate
(I~) in area A as follO\vs:

U;(A) = L L bj(l'! = L bj L (I'j. (10)


pEA j pEA

Thus. we can obtain Ui(A) for combination I in area A if we know the value of
Z=pE.-t B'j. The factors Vie chose as explanatory variables B'J are shown in Appendix 1.

.\Ioreover, we can obtain the relative importance of attribute adA) in area A through
multiple regression analysis by nsing conjoint analysis or the multiple regression equrt-
tion
[filA) = o.dA)6di) + o.2(A)6 2 (i) - o.:)(A) log:r:l(i) - o.4(A)log:r4(i), (11)
where [';(A) is the criterion variable and 6j (i) and .I",,(i) are explanatory variables,
vvhich are defined by Eqs. (2) and (3), respectin'ly.
Regression analysis gives coefficients bj. Hence we can obtain a preference-ranking

Table 4: Coefficient of determination

Coefficient of determination
A 0.3199
B 0.3:526
C 0.338,
D 0.36,1
E 0.3:52:5
F 0.3,02
G 0.3:539
H 0.3111

of choice-set combinations in \'arious areas if the data of explanatory variables in


Appendix I are a\'ailable in those areas . .\Ioreover, we can estimate the relative im-
pOl·tance of an attribute in various areas. The estimation, however, has low accuracy,
as shown in Table 4.

7. Conclusion
We have applied conjoint analysis to new telecommunication services and investigated
a method of obtaining a preference-ranking of choice-set combinations in various ar-
eas. \'"e have chosen this elaborate method oYer ot her ones because it should enable
us to determine the relative importance of an attribute in various areas and to choose
714

a new service that most consumers in those area prefer to another sen'ice. Although
the results had low accuracy clue to a lack of dfectiw explanatory variables, this
method will he effectiw if appropriate ones arc f0\1I1(1. Fnrther studies ,vill he needed
to tl'ansform our estimation into a forecast of thc clcmancl[l:eda et al (199:5)] for new
telecommunication services in variollS areas.

Appendix I: The factors used as explanatory variables


1: Are you male or female"? (2)
2: How old are you? (9)
3: What is your relationship to the head of your household"? (-±)
4' What does your head of household do"? (8)
5: \Vhat kind of company does your head of houschold work af? (9)
6: What is the annual income of your hOllsehold" (11)
7: What kiud of house do you liw in? (-±)
8: How many rooms does your house have? (8)
9: How many television sets are there in :-'our hOllse? (-±)
10: How many hours per day do you watch telcvision on average') (10)
11: How many vidpocassette recorders cia yon havc') (:5)
12: How many rental videocassettes do yon rcnt per month') (I)
13: How many personal computers are there in your household? (-±)
14: Do you havE' "ideo game machine A') (~)

15: Do you have video game machine H) (2)


16: Do you haw video game machine C) (2)
1'l: Do you haw \'ideo game machine U? (2)
18: Do yon haw video game machine E) (2)
19: Do yon haw video game machine F? (2)
20: Do you haw video game machine G? (2)
21: Do you haw video game machine H? (2)
2)' Do you have video game machinE' r. (2)
23: Do you haw video game machine .T (~)

24: Do you get satellite broadcast TV in your hOllse') (2)


25: Do you get CATV? (2)
26: \Vould you pay an additional charge for pay clwnnds on CATV if they were
in teres ting') (2)
:\"ote that these qnestions were designed for Japancse rcspondents and might be in-
appropriate in ot her countries.
The number in parentheses is the 111lllllwr of cat egori(:'s.
715

References:

Green, P.E. and Srinivasan, V. (1978): Conjoint Analysis in Consumer Research: Issues
and Outlook, Journal oj ConslLmer Research, 5. 103-123.
Green, P.E. and Srinivasan, V. (1990): Conjoint Analysis in Marketing: New Developments
with Implications for Research and Practice . .lonnl.ILI oj MILrket7>ng. 54. 3-19.
Johnson. R.M. (1975): A Simple Method for Pairwise Monotone Regression. Psychome-
trika, 40, 2. 163-168.
Kruskal. J .B. (1965): Analysis of Factorial Experiments by Estimating Monotone Trans-
formations of the Data, .lo·urnal oj the Royal StatistieILI Society, Series B, 27, 251-263.
Luce, R.D. and Tukey, J.W.(1964): Simultaneous conjoint measurement: A new type of
ftmdamental measurement, .loumal oj Mathematical Ps:~chology, 1, 1-27.
Ogawa. K. (1987): An Approach to Simultaneous Estimation and Segmentation in Conjoint
Analysis. Marketing Science. 6, 1, 66-8l.
Sat.oh, D. and Veda. T.: to be submit.ted.
Srinivasan, V. and Shocker, A.D. (1973): Estimating the Weights for Multiple Attributes
in a Composite Criterion Vsing Pairwise Judgments. Psychometrika. 38, 473-493.
Veda, T. (1994): Analysis of Preferences for Services Based on Conjoint Analysis, Singak-n
Ton. J77-B-I, 9. 542-549, (in Japanese).
Veda, T. et al. (1995): A method of forecast.ing demand for new telecommunication ser-
vices, 9th ElLropean Meeting of the Psychometric Society. 123.
Wit.tink, D. and Cat.t.in, P. (1989): Commercial Vse of Conjoint. Analysis: An Update,
.IolLrnai oj Market?:ng. 53. 91-96.
Wit.t.ink. D. et al. (1994): Commercial Vsc of COlljoint. in Europe: Results and Critical
Refiect.ions, International .lolLrnai oj Research in Marketing, 11, 41-52.
Effects of End-Aisle Display and Flier
on the Brand-Switching of Instant Coffee
Akinori Okada
Department of Industrial Relations
School of Social Relations
Rikkyo (St. Paul's) University
3 Nishi Ikebukuro
Toshima-ku, Tokyo 171, Japan

Summary: Brand-switching data among instant coffee brands were analyzed by a non-
metric asy=etric multidimensional scaling (Okada and lmaizumi, 1987) to identify effects
of the end-aisle display and of the flier. Two-dimensional solutions show that the end-aisle
display of the brand is in general not effective to induce switching to the brand and is vul-
nerable against switching to other brands. and that for some brands the flier of the brand
is effective to induce switching from similar brands to the brand and is defensive against
switching to other brands, but that for some brands the flier is not effective to induce
switching to the brand and is vulnerable against switching to similar brands.

1. Introduction
After several asymmetric multidimensional scaling (MDS) models and procedures
have been introduced (Zielman and Heiser, 1996), asymmetric MDS has been utilized
to analyze various sorts of data such as attraction relationships (Chino, 1978; Collins,
1987), journal citations (Chino, 1978, 1990; \Neeks and Bentler, 1982), word associa-
tions (Chino, 1990; Harshman et al., 1982: Zielman and Heiser, 1993), telephone com-
munication (Okada, 1989), intergenerational occupational mobility (Okada, 1988a),
foreign trade (Chino, 1978), marriages among ethnic groups (Zielman, 1991), or data
from various areas of psychology and of sociology.
One of the most important areas for applying asymmetric MDS seems to be market-
ing research. Brand switching data have been analyzed by asymmetric ;\:lDS, because
asymmetries in brand switching might have a relationship with the differences in at-
tractiveness among brands (DeSarbo and De Soete, 1984; Zielman and Heiser, 1996).
Asymmetric ~vIDS has been used to analyze brand switching data among car cate-
gories or among soft drink brands (DeSarbo and Manrai, 1992: DeSarbo, et al., 1992;
Harshman. et al.. 1982; Okada, 1988b; Zielman, 1991). In the present study, brand
switching data among instant coffee brands are analyzed by a nonmetric asymmetric
[I.'IDS (Okada and Imaizumi, 1987) to investigate the effects of the end-aisle display
and of the flier (a pamphlet or circular for mass distribution issued by a supermarket
informing of sales) on the brand switching.

2. Data
The brand switching data analyzed in the present study was derived from scanner
data of about 5,000 instant coffee purcha'3es made in 1993 by a panel which consists of
796 households who frequently came to a super market. Eleven instant coffee brands
which were analyzed in the present study (10 brands and other instant coffee brands
which were treated as the II-th brand) are represented in Table l. These brands

716
717

include three types of instant coffee; freeze-dried instant coffee (type a in Table 1),
regular instant coffee (type b in Table 1), and ones which are already mixed with
sugar and cream or which are already packed in a plastic or paper cup (type c in
Table 1). They also include brands of Nestle which dominates in the Japanese in-
stant coffee market and brands of Ajinomoto General Foods which is a joint venture
between Ajinomoto and General Foods.

Tab. 1: Eleven Instant Coffee Brands.


brand , type'" abbreviation
1 Nescafe Goldblend 150g a NGB150
2 Nescafe Goldblend 100g a NGB100
3 Nescafe Excella 250g b NEX250
4 Ajinomoto General Foods Ma.xim 100g, a ANIXIOO
5 Ajinomoto General Foods Ma.xim 2 cup: c ANIX2cup
6 Nescafe Cappuccino c NCP
7 Ajinomoto General Foods Blendy 250g b ABL250
8 UCC The Blend 144 100g a UCC100
9 Nescafe Excella 100 b NEXlOO
10 Ajinomoto General Foods Ma.xim 30g a AMX30
11 others ** others
*a: freeze-dried instant coffee
b: regular instant coffee
c: already mixed with sugar and cream or already packed in a plastic or paper cup
** Others are not a single brand, and consist of different types of instant coffee
brands.

Seven brands were purchased both when there was and when there was not an end-
aisle display of each of them at the supermarket. Others, the Il-th brand, were not
purchased when there was an end-aisle display of any of them. If we distinguish a
purchase of a brand when there was an end-aisle display of that brand (the brand
'with the end-aisle display) from a purchase when there was not an end-aisle display of
that brand (the brand without the end-aisle display) as two different items, we have
18 items or brands. A sv:itching matrix among 18 items or brands was calculated
for each household. The sum of 796 matrices was derived to construct the switching
matrix among 18 items or brands for the panel. Table 2 shows the 18 x 18 switch-
ing matrix, whose (j,k) element represents the frequency of switching from items or
brands j to k. which is called the end-aisle display data.
Six brands were purchased both when a flier accompanied and when did not ac-
company each of the brands. If we distinguish a purchase of a brand when the flier
issued by the supermarket accompanied that brand (the brand with the flier) from a
purchase when the flier did not accompany the brand (the brand without the flier) as
two different items, we have 17 items or brands. Others were not purchased when a
flier accompanied any of them. Table 3 shows the switching matrix among 17 items
or brands which is called the flier data.

3. Method and analysis


The nonmetric asymmetric MDS (Okada and Imaizumi, 1987) was applied to the
18 x 18 table of the end-aisle data and the 17 x 17 table of the flier data respec-
tively. The asymmetric MDS represents each item or brand in a multidimensional
-...J
co
Tab 2 Switching Matrix among 18 Items or Brands (with/without the End).

to
from 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
1 NGB 150 w/o end 70 76 26 34 8 5 7 12 1 3 1 4 5 3 1 5 2 47
2 NGB 150 with end 65 92 39 71 8 21 15 23 I 3 5 6 10 9 1 7 5 52
3 NGB I 00 wlo end 26 24 33 24 4 4 10 12 0 2 2 0 2 1 0 2 3 231
4 NGB 100 with end 27 37 39 25 4 6 11 51 0 9 4 0 2 6 0 3 2 47
5 NEX250 w/o end 9 13 7 5 84 59 2 1 0 0 1 7 12 0 0 29 I 22
6 NEX250 with end 10 13 8 4 60 88 0 10 2 4 2 5 8 3 0 18 I 26
7 AMX 100 w/o end 11 22 8 14 3 1 22 18 1 5 2 2 2 15 0 3 6 45
8 AMX I 00 with end 1I 31 47 12 5 4 38 65 0 1 3 0 4 7 2 6 2 51
9 AMX2 w/o end I I I 0 0 0 0 1 39 0 1 0 I 3 0 0 2 19
10 NCP w/o end 4 6 5 6 3 1 4 7 0 187 21 0 2 1 0 3 6 73
II NCP with end 3 I 4 4 4 1 1 5 0 15 18 0 0 0 0 0 I 31
12 ABL250 w/o end I 5 0 3 8 8 0 4 0 1 0 6 18 0 0 I 0 21
13 ABL250 with end 11 13 5 6 19 19 3 9 0 2 4 20 41 3 0 2 I 43
14 UCCI00 w/o end 6 14 9 15 0 I 12 31 I 1 0 I 5 36 0 3 3 49
15 UCC 100 with end 1 0 1 1 0 0 0 0 0 0 0 0 0 I 0 0 0 2
16 NEX 150 w/o end 6 5 3 3 19 27 1 6 I 2 2 3 I 0 I 51 3 28
17 AMX30 w/o end 0 2 4 2 1 3 8 8 2 5 3 I 0 7 0 0 51 38
18 others 37 63 54
------------
36
---
25
-------
30
-------
27
_.. _ - -
79
----
16
--------
69
-----
35
--- -- _
23
.. _------
35
----
48
---- - - - - -
2 25 32 768
Tab. 3: Switching Matrix among 17 Items or Brands (with/without the Flier).

to
from 1 2 3 4 5 6 7 8 9 10 11 12 I3 14 15 16 17
1 NGB 150 w/o flier 279 22 139 11 30 6 43 11 2 9 2 23 2 14 11 7 97
2 NGB 150 with flier 2 0 20 0 1 5 1 2 0 1 0 0 0 0 1 0 2'1
3 NGB 100 w/o flier 99 2 116 1 12 3 55 27 0 14 1 4 0 6 3 4 661
4 NGB100 with flier I3 0 4 0 3 0 1 1 0 2 0 0 0 1 2 1 4
5 NEX250 w/o flier 28 4 17 1 160 52 3 2 1 1 1 25 1 1 37 1 35
6 NEX250 with flier 12 1 6 0 50 29 3 5 1 5 0 6 0 2 10 1 I3
7 AMX 100 w/o flier 56 2 60 3 10 1 86 13 1 9 0 7 0 22 5 8 80
8 AMX 100 with flier 16 1 16 2 1 1 32 12 0 2 0 1 0 2 4 0 16
9 AMX2cup w/o flier 2 0 1 0 0 0 1 0 39 1 0 1 0 3 0 2 19
10 NCP w/o flier 12 1 17 1 7 1 17 0 0 227 8 1 1 1 3 7 97
11 NCP with flier 1 0 0 1 1 0 0 0 0 6 0 0 0 0 0 0 7
12 ABL250 w/o flier 25 2 12 2 34 13 11 3 0 6 0 78 5 2 3 1 60
13 ABL250 with flier 3 0 0 0 3 4 2 0 0 1 0 2 0 1 0 0 4
14 DCC 100 w/o flier 19 2 24 2 1 0 28 15 1 1 0 6 0 37 3 3 51
15 NEX 150 w/o flier 10 1 5 1 33 13 5 2 I 4 0 4 0 1 51 3 28
16 AMX30 w/o flier 2 0 6 0 2 2 15 1 2 8 0 1 0 7 0 51 38
17 others 96 4 88 2 38 17 80 26 16 100 4 52 6 50 25 32 768

-.,J
~

co
720

Euclidean space as a point and a circle (in a two-dimensional space), a sphere (in a
three-dimensional space) or a hypersphere (in a four- or higher-dimensional space)
centered at that point. A configuration, which consists of points and circles (spheres,
hyperspheres), represents both symmetric and asymmetric proximity relationships
among items or brands. The Euclidean distance between two points imbedded in a
configuration corresponds to the symmetric switching between two items or brands
represented by those two points, and the difference of two radii of the circles centered
at those two points corresponds to the asymmetric switching from one item or brand
to the other.
We regard the number of purchases of item or brand k switched from item or brand
j as the similarity from items or brands j to k, and the number of purchases of item
or brand j switched from item or brand k as the similarity from items or brands k to
j. Let Sjk be the similarity from items or brands j to k, and Skj be the similarity from
items or brands k to j. Sjk is not necessarily equal to Slcj. The similarity is assumed
to be monotonically decreasingly related with Tnj/:. Tnjk is defined by Equation (1)

(1 )

and Tn."j is defined by Equation (2)

(2)

Figure 1 represents items or brands j and k geometrically in a two-dimensional space.


Item or brand J is represented by a point and a circle of radius Tj centered at that
point, and item or brand k is represented by a point and a circle of radius rk centered
at that point. The length of the arrow directing from objects or brands J to k (from
left to right) in Figure 1 shows Tn)k, and the length of the arrow directing from objects
or brands k to j (from right to left) in Figure 1 shows Tn"'j'

item or brand)

item or brand k

Fig. 1: Geometric Representation of Items or Brands j and k


in a Two-Dimensional Space.

\Vhen a point is in a central part of a configuration, that point is on average near to


other points. The item or the brand represented by the point in a central part of a
configuration is likely to be switched to/from other items or brands. \Vhen a point
721

is in a periphery of a configuration, that point is on average near to some points


representing similar items or brands but is rather distant to points representing dis-
similar items or brands. The item or the brand represented by a point in a periphery
of a configuration is likely to be switched to/from similar items or brands, but it is
unlikely to be switched to/from dissimilar items or brands.
The radius of a circle (sphere or hyper sphere) centered at a point represents the
vulnerability of the item or the brand represented by that point in switching. When
the radius of a circle centered at a point is small, the item or the brand represented
by that point is unlikely to be switched to others and is likely to be switched from
others. When the radius of a circle centered at a point is large, the item or the
brand represented by that point is likely to be switched to others and is unlikely to
be s\vitched from others. The larger the radius is. the more vulnerable the item or
brand is .
.-\. nonmetric algorithm based on Kruskal's nonmetric i'vIDS (Kruskal, 1964) is used
to derive a configuration (coorinates of points representing items or brands and radii)
where mj'" is monotonically decreasingly related with the frequency of switching. By
using the algorithm, the configuration which minimizes the badness-of fit measure S
(Equation (3.1) of Okada and Imaizumi, 1987) for a given dimensionality is derived.
In the analysis the ma:'(imum dimensionality of the analysis is determined, and in each
of the ma..-x:imum dimensional through unidimensional spaces the configuration which
minimizes S is derived. In the ma..-x:imum dimensional space, an initial configuration
of items or brands and their radii is derived by processing the similarities among
items or brands using a factor analytic procedure. In a lower dimensional space,
an initial configuration based on the principal components of the higher dimensional
result is used. The initial configuration is improved by the steepest descent method
iteratively to minimize S. The iteration is stopped, when (a) S is smaller than the
stopping crit.erion (0.00001), (b) the improvement of S is smaller than the stopping
criterion (0.00001), or (c) the number of iterations becomes larger than the maximnm
number (70).
Each data were analyzed by using ma..'<imum dimensionalities of seven through four.
\Ve had one S in seven-dimensional space, two S in six-dimensional space,. three Sin
five-dimensional space, and four S in each of four- through unidimensional spaces. In
six-dimensional space, the smaller S was chosen as the minimized S in six-dimensional
space. In five-dimensional space, the smallest S of three S was chosen as the mini-
mized S. In each of four- through unidimensional spaces, the smallest S was chosen
as the minimized Sin that dimensional space. Then we have the minimized Sin each
of SE'\"en- through unidimensional spaces.

4. Results
The minimized Sin five- through unidimensional spaces for the end-aisle data \vere
0.232, 0.265. 0.309. 0.367 and 0.471. and those for the flier data were 0.204, 0.252",
0.303, 0.377 and 0.499. These figures and the interpretation of configurations suggest
that we choose the two dimensional result as the solution for each of the two data
sets.
Figure 2 shows the two-dimensional solution for the end-aisle display data. (The
configuration was visually rotated so that the interpretation of the rotated configu-
ration seems a~ clear as possible.) Each item or brand is represented as a point and a
circle. The bold circle represents the brand with the end-aisle display, and the light.
circle represent the brand without the end-aisle display. For most of the brands, two
722

NEX250
o ABL250

o
~ABL250

NEX2~
o with end-aisle display NGB150~B150

o
Alv!X2cup

without end-aisle
display
NGBIOO.~ others

c§J 0
AMXIOO<!() NGBIOO

AMXIOO

.
CE
NCP
NCP

UCCIOO
o UCClOO • AMX30

Fig. 2: Tw(}-Dimensional Configuration of 11 Instant Coffee Brands


with/without the End-Aisle Display.

NGBI50 NEX250

NEXI500

NEX250 ®
0
0
o
ABL250

other; ABL250

oo
o
AMXIOO. NGBI50

NGBIO()
AMXIOO
UCCIOO

NGBlOO

• NCP
• AMX30
AMX2cup

o
o
withllier

without Ilier
NCP

Fig. 3: Two-Dimensional Configuration of 11 Instant Coffee Brands


with/without the Flier.
723

points representing the same brand with and without the end-aisle display are closely
located in the configuration. The vertical dimension seems to differentiate regular
instant coffee brands from freeze-dried and ones already mixed with sugar and cream
or already packed in a plastic or paper cup, and the horizontal dimension seems to
represent the difference between brands of Nestle and those of Ajinomoto General
Foods (Okada and Genji, 1995).
Figure 3 shows the two-dimensional solution for the flier data. (The configuration
was also visually rotated.) The bold circle represents the brand with the flier, and
the light circle represent the brand without the flier. Two points representing the
same brand with and without the flier are also closely located in the configuration.
The two dimensions look like to have the same meaning of those in Figure 2.
The obtained configuration derived from the analysis of the end-aisle data shows that
most of the brands with the end-aisle display are in the central part of the configura-
tion and that most of the brands without the end-aisle display are in the periphery of
the configuration. Most of the brands with the end-aisle display located in the central
part of the configuration have larger radii than those of the same brands without the
end-aisle display located in the periphery of the configuration. The obtained configu-
ration derived from the analysis of the flier data shows that most of the brands with
the flier are in the periphery of the configuration and that most brands without the
flier are in the central part of the configuration. Some of the brands with the flier
located in the periphery of the configuration have larger radii and some of them have
smaller radii than those of the same brands without the flier located in the central
part of the configuration.

5. Discussion
vVe would like to focus our attention on those brands which are represented as two
different items in a configuration; one with the end-aisle display or the flier and the
other without the end-aisle display or the flier. Combining the location of a point
(central part or periphery of a configuration) and the radius of a circle (smaller or
larger), we can classify these brands into four categories (a) to (d) shown below
(Okada and Genji, 1995). Characterization of a brand in each of four categories is
accompanied.
(a) A brand with the end-aisle display or the flier has a smaller radius than the same
brand without the end-aisle display or the flier and is located in rather a central part
of a configuration, while that brand without the end-aisle display or the flier having
a larger radius is located in rather a periphery of the configuration.
With the end-aisle display or the flier, a brand is likely to be switched from the same
brand without the end-aisle display or the flier as well as from other brands, and is
unlikely to be switched to the same brand without the end-aisle display or the flier as
well as to other brands. Without the end-aisle display or the flier, a brand is unlikely
to be switched from the same brand with the end-aisle display or the flier as well as
from other brands, and is likely to be switched to the same brand with the end-aisle
display or the flier as well as to other similar brands.
(b) A brand with the end-aisle display or the flier has a larger radius than the same
brand without the end-aisle display or the flier and is located in rather a central part
of a configuration, while that brand without the enQ-aisle display or the flier having
a smaller radius is located in rather a periphery of the configuration.
With the end-aisle display or the flier, a brand is unlikely to be switched from the
same brand without the end-aisle display or the flier as well as from other brands,
724

and is likely to be switched to the same brand without the end-aisle display or the
flier as well as to other brands. \Vithout the end-aisle display or the flier, a brand is
likely to be switched from the same brand with the end-aisle display or the flier as
well as from other similar brands, and is unlikely to be switched to the same brand
with the end-aisle display or the flier as well as to other brands.
(c) A brand with the end-aisle display or the flier has a smaller radius than the same
brand without the end-aisle display or the flier and is located in rather a periphery
of a configuration, while that brand without the end-aisle display or the flier having
a larger radius is located in rather a central part of the configuration.
With the end-aisle display or the flier, a brand is likely to be switched from the same
brand without the end-aisle display or the flier as well as from other similar brands,
and is unlikely to be switched to the same brand without the end-aisle display or the
flier as well as to other brands. Without the end-aisle display or the flier, a brand is
unlikely to be switched from the same brand "ith the end-aisle display or the flier as
well as from other brands, and is likely to be switched to the same brand "ith the
end-aisle display or the flier as well as to other brands.
(d) A brand with the end-aisle display or the flier has a larger radius than the same
brand without the end-aisle display or the flier and is located in rather a periphery
of a configuration, while that brand without the end-aisle display or the flier having
a smaller radius is located in rather a central part of the configuration.
With the end-aisle display or the flier, a brand is unlikely to be switched from the
same brand without the end-aisle display or the flier as well as from other brands,
and is likely to be switched to the same brand without the end-aisle display or the
flier as ,vell as to other similar brands. Without the end-aisle display or the flier, a
brand is likely to be switched from the same brand with the end-aisle display or the
flier as well as from other brands, and is unlikely to be switched to the same brand
with the end-aisle display or the flier as well as to other brands.
As mentioned earlier, seven brands were purchased both when there was and when
there was not the the end-aisle display, and six brands were purchased both when
there was and when there was not the flier. For the end-aisle display data, these
seven brands are classified into categories (a) to (d) as shown in Table 4. For the flier
data, these six brands are classified as shown in Table 5.

Tab. 4: Classification of the Seven Brands for the End-Aisle Display Data.
location of a brand with the end-aisle display
radius of a brand with central part periphery
the end-aisle display
smaller radius (a) AMXIOO (c) NCP

=i"<
UCCIOO
(b) NGB150 (d) none
NGBlOO

I__ NEX250
ABL250

Five of the seven brands with the end-aisle display were in the central part of the
configuration (categories (a) or (b) in Table 4), and four of the five had larger radii
when they were with the end-aisle display than without the end-aisle display ((b) in
Table 4), suggesting that the end-aisle display is in general not effective to induce
725

switching from other brands and is vulnerable against switching to other brands. All
six brands with the flier were in the periphery of the configuration ((c) or (d) in
Table 5). Three of the six brands had smaller radii when they were with the flier
than without the flier ((c) in Table 5), suggesting that the flier is effective to induce
switching from similar brands as well as from the same brand without the flier and
is defensive against switching. to other brands. The other three had larger radii when
they were with the flier than without the flier ((d) in Table 5), suggesting that the
flier is not effective to induce switching from other brands and is vulnerable against
switching to other similar brands.

Tab. 5: Classification of the Six Brands for the Flier Data.


location of a brand with the flier i
radius of a brand with the flier central part periphery
(a) none (c) :,{GB150
smaller radius AMXIOO
NCP
(b) none (d) NGBIOO
larger radius NEX250
ABL250

Four of the si..x brands with the flier were always accompanied with the end-aisle dis-
play. For these four brands (NGBl50, NGBIOO, AMXIOO, and ABL250), the effect
of the flier actually means the effect of the end-aisle display and the flier. For the
two brands (NEX250 and NCP), the effect of the flier means the mixture of the effect
of the flier alone and the effect of the flier and the end-aisle display. To separate
the effect of the end-aisle display and the effect of the flier, a switching matrix was
constructed by treating each of the four brands above mentioned as three different
items; (1) an item with the end-aisle display and the flier, (2) an item with the end-
aisle display alone and (3) an item without the end-aisle display nor the flier, and by
treating each of the two brands above mentioned as four different items; (1) an item
with the end-aisle display and the flier, (2) an item with the end-aisle display alone,
(3) an item ",ith the flier alone and (4) an item without the end-aisle display nor the
flier.
The resultant table was analyzed by the asymmetric MDS. Obtained results seem to
suggest adopting the two-dimensional configuration as the solution. The two dimen-
sions of the configuration have the same meaning of those in Figures 2 and 3. For
the three brands (NGB150, NGBlOO, and ABL250) of the four mentioned above, the
comparison between the brand with the end-aisle display alone and the same brand
without the end-aisle display nor the flier tells that these three brands are classified in
the same category (b) as shown in Table 4. For AMXlOO, the comparison shows that
this brand is classified into (c) not (a) as shown in Table 4 (locations were reversed).
Although locations of two items, one representing AMXlOO with the end-aisle display
alone and the other representing AMXIOO without the end-aisle display nor the flier,
were reversed, two items were closely located. It seems that the effect of the end-aisle
display alone is almost the same as the mixture of the effect of the end-aisle display
alone and the effect of the end-aisle display and the flier. For these four brands, the
flier was always accompanied with the end-aisle display, and it seems impossible to
separate the effect of the flier and that of the end-aisle display.
For NEX250 and NCP, the comparison between the brand with the end-aisle dis-
726

play alone and the same brand without the end-aisle display nor the flier shows that
NEX250 is classified into (c) not (b) as shown in Table 4 (both radii and locations
,vere reversed), and that NCP is classified in the same category (c) as shown in Table
4. Although locations and radii of two items; one representing NEX250 with the end-
aisle display alone and the other representing NEX250 without the end-aisle display
nor the flier, were reversed, two items were closely located and the difference of two
radii was small. The comparison between the brand with the flier alone and the same
brand without the end-aisle display nor the flier shows that l':EX250 is classified in
the same category (d) as shown in Table .J, and that NCP is classified into (d) not (c)
as shown in Table 5 (radii were reversed). Although radii of two items; one represent-
ing the radius of NCP vrith the flier alone and the other representing NCP without
the end-aisle display nor the flier, were reversed, the difference of two radii wa.~ small.
It seems to suggest that the effect of the end-aisle display alone is almost the same as
the mixture of the effect of the end-aisle display alone and the effect of t.he end-aisle
display and the flier. and that the effect of the flier alone is almost the same as the
mi..xture of the effect of the flier alone and the effect of the end-aisle display and
the flier. For NEX250 and NCP, the item representing the brand with the end-aisle
display and the flier wa.'S located between the item representing the brand with the
end-aisle display alone and the item representing the brand with the flier alone, and
had a radius whose length was betweeIl the hvo radii of these two items, suggesting
that the interaction of the end-aisle display and of the flier are rather small.

Acknow ledgment
The author would like to express his gratitude to Professor Dr. Wolfgang Gaul of the
University of Karlsruhe to his helpful comments and suggestions given to him at his presen-
tation on the IFCS-96 meeting. He also wishes to express his appreciation to Dr. Takeshi
Moriguchi of The Distribution Economics Institute of Japan for providing him with the
data. The author is indebted to H. A. Donovan for his helpful advice concerning English.

References
Chino. N. (1978): A Graphical Technique for Representing the Asyuunetric Relationships
between N Objects, Beha'viormetrika, No, 5,23-40.
Chino. N. (1990): A Generalized inner product model for the analysis of aSY=letry, Be-
hauiormetrika, No. 27. 25-46.
Collins. L.M. (1987): Deriving sociograms via asymmetric multidimensional scaling, In:
Muitidzmens'ional Scaling: History. Theo.,.·y, and Applications. Young, F.W. et al. (eds.).
179-196, Lawrence Erlbaum Associates, Hillsdale, NJ.
DeSarbo. W.S., and De Soete. G. (1984): On the use of hierarchical clustering for the
analysis of nonsyuunetric proximities. JO'urnal of Gonsu.mer Research, 11,601-610.
DeSarbo, W.S. et it!. (1992): TSCALE: A new multidimensional scaling procedure based
on Tversky's contrast model. Psychometrika, 57. 43-69.
DeSarbo. W.S., and Manrai, A.K. (1992): A new multidimensional scaling methodology
for the analysis of asymmetric proximity data in marketing research, Iv[arketing Science,
11. 1-20.
Harshman. R.A. et al. (1982): A model for the analysis of asymmetric data in marketing
research, Marketing Science, 1. 205-242.
Kruskal, J.B. (1964): Nonmetric multidimensional scaling: A numerical method, Ps·y-
727

chometrika, 29, 115-129.


Okada, A. (1988a): An analysis of intergenerational occupational mobility by asymmetric
multidimensional scaling, In: The Many Faces of Multivariate Analysis: Proceedings of the
SMABS-88 Conference Vol. 1, Jansen, M.G.H. et a1. (eds.), 1-15, RlON, Institute for
Educational Research, University of Groningen, Groningen.
Okada, A. (1988b): Asymmetric multidimensional scaling of car switching data. In: Data,
Expert Knowledge and Decisions, Gaul, W. et a1. (eds.), 279-290, Springer-Verlag, Berlin.
Okada, A. (1989). Asymmetric multidimensional scaling: Theory and application, The
Japanese JO'urnal of the Acoustical Society of Japan, 45,131-137. (in Japanese)
Okada, A., and Genji, K. (1995). Brand switching of instant coffee and the effect. of end-
aisle display, Communications of the Operations Research Society of Japan, 40, 448-50l.
(in Japanese)
Okada, A., and Imaizumi. T. (1987): Nonmetric multidimensional scaling of asymmetric
proximities, Behaviormetrika, No. 21, 81-96.
Weeks, D.G., and Bentler, P.M. (1982): Restricted multidimensional scaling models for
asy=etric proximities, Psychometrika, 47, 201-208.
Zielman, B. (1991): Three-way scaling of Asymmetric Proximities (RR-91-0l), Department
of Data Theory, Gniversity of Leiden.
Zielman, B., and Heiser, W.J. (1993): Analysis of asy=etry by a slide vector model,
Psychometrika, 58, 101-114.
Zielman, B., and Heiser, W.J. (1996): Models for asymmetric proximities, British Journal
of Mathematical and Statistical Psychology, 49, 127-146.
On the classification of environmental data in the
Bavarian Environment Information System
using an object-oriented approach
Erich Weihs

Bavarian State Ministry for State Development and Environmental Affairs


P.O. 810140 D 81901 Munich, Germany, l!!!1ail [email protected]

Summary: In hardly any other area is the availability of regional development data of such great importance as it is in
the sector of environmental protection and conservation. It is therefore the goal of every environmental information
system to provide relevant data collections for legislative bodies and for the daily execution of administrative tasks. In
this context, environmental information systems are mainly represented by the organisational association of data
collections by specialist information systems. Organisational association, because the required data should be
accessible, but must remain with the authorities responsible for the specialist information. The current technology of
data processing supports these requirements placed in divided processing.

The proof of where which data can be found and processed under \"hich qualitative and quantitative conditions is the
core of the system. The required 'common denominator' is th~ classification and the determination of the common
vocabulary and its relations (=thesaurus) as a metalanguage. !II order to be able to secure comparable and combinable
research results regarding specialist information. This therefore means for information systems that the 'language' and
the 'grammatics' of the data used in the said information s:'stem is clearly defined and known to the user. The dialogue
with the user is organised and supported on the basis of tflese language patterns. For this reason, the importance of a
thesaurus and of a data model as the basis of a common laubUage is given particular weight.

Due to the fact that the references, as described above, are based on extremely varied contents and regulations of how
the data is to be dealt with, it would appear justifiable to classify these points of view as being aspects of the object.

The content classification of data according to the topic, i.e. specialist origins and possible usability, is the aspect
pertaining to the structuring of the data. The question of the technical procedural position of the data, within the data
bank, points the way to an object-related classification and storage of data Using a simple mathematic model, the
following sections are intended to demonstrate,

that the object-related classification of data as a content-inderJendent aspect is an important supplement to the
data itself and
several possibilities of an object-orientated client SP.1Ver technolcgy will be pointed 'Jut.

1. Introduction
One of the most important preconditions [or thou!$ht is the comparative classification of terms to
categories. It is only in this way that the remembrance of experiences can develop to become an
aid in life. As means of categorisation, we have at our disposal, in addition to emotionally
experienced images, language and script, also the ability to quantify. Finally however, even
quantitative expressions of orders require their linguistic expression. The entirety of these
experiences of orders and their cross-references to the model of reality is a part of what we mean
when we claim to be informed (Wef1zlaff 19~ 1).

'Information' is therefore a subjective recognition - or imagination - of excerpts of reality.


Subjects of information can include individuals or collective groups, i.e. organisations.
'Intersubjective' communication of data may, for example, take place by means of data transfer.
Due to the fact that the recognition and formatio:l of imagination is always bound to the existence
of thought patterns, the registration ofinformatioIl (i.e. by acknowledging data) is the supplement
of the subject's model of reality. Should the data n~t add to this model ofreality in any way, then
it will not be regarded by the recipient subject as being information (Whorf 1976, Herrschaft

728
729

1996). This statement is equally true for classifi,.ation schemes and rules.

In this context, 'data' is to be understood as being definite values of statements, whose meaning
(explicit or implicit - i.e. according to recognised ~greement) has been determined. The meaning
of a document can be made explicit through texts, tabular diagrams, legends or in files and data
banks, thesaurii, etc. In this context, definite does not mean that it is a determined value in a
mathematical sense. Even a parameter of a probability distribution has a definite value.

A classification system is therefore only relevant, if the data it contains make possible an
extension of the user's subjective model of reality (Feyerabend 1976). Therefore, the order of the
system to the research and the classification of the data according to the preconditions described
above must be accessible.

In the sector of environmental conservation, linguistically-based order systems are of considerable


importance: the development and use of existing data for environmental topics is receiving more
and more recognition (KoEU, 1990). In this context, the establishment of verification and
navigation systems - often described using the partially misleading title of metainformation
systems - as the core of an environmental information system (compare LABO, 1994, p.1S) IS
being accounted for increasingly.

Due to the fact that the majority of the available environmental data is regionally related, then, in
addition to the subject-related search, regionally-related search into town names, certain regional
units and co-ordinates are required and must be taken into consideration in classification systems
(Weihs 1993, LABO 1994 p.21-22)

In this way, it is possible to carry out regional search asing both an order term, as well as the co-
ordinates of a freely-defined area. The regionally-related access is necessary because the majority
of the information is to be evaluated on the basis of area structures (i.e. natural regional units,
such as river valleys, lakes, rivers and canals, aquatic and terrestrial tenns or town names).

The basis for the indexing and search is the thesaurus, acting as a common 'vocabulary of
language', for example for:

the data model and the method catalogue of the system


the data and method verification in the specialist information system
the regulation of the access p1th.9 ~o the data stock
a bibliographic, topiolmap Wilh relevant reference (catalogue of maps)
the subject and fact data
the specialist integration of various ~'nd separate data stock
filing of procedures, correspondcf'cC'

Typically, the expression bn

bn E E, n = 1,2, ... j, ... k, ... , n+ with (1)

bj 0/= bk E B, B = final sum of the expression (2)

as a defined succession of symbols (data in the above mentioned sense) is not revocably and
definitely based on only one object 0i n' If It is homonymous (i.e. 'field' as a expression for
agricultural land or a data section, see diagram 2) then
730

bn = {Og,n}, g=1,2, ... , g+ with g+ = the number of all objects 0g,n to the term bn (3).

Correspondingly, we treat objects in the same way that, from the point of view of different users,
are observed under different aspects: for example, 'pasture' has a very different meaning,
depending on whether it is considered from the point of view of land use, or of natural
preservation. Based on the same aspect (i.e. natural preservation versus land use), various objects
can be observed in the same way. We use the factor g' to denote the number of all possible
aspects. We will continue from here without limitation from an aspect-related point of view, due
to the fact that the homonyms, according to (3), are included. As we will see, the introduction oj
the aspects mean a further classification, independent oj the subject-logical classification.

The relation expressed by (3) is not revocably definite (i.e. bn = 'field', unlike in table 1 for
agriculture and/or data-bank)

bj means that there can be terms with equal rights g+ (synonyms, i.e. 'field' and 'field') for one
object:

bj = Sj = {Sij}, with Sij i =1,2, ., m, .. ,ij; Slj oF Srnj' j oF m and Sij ff Band (4)

S={Sj} and (6)

B*=BuS (7)

The relation stated by (4) is rarely transitive in a logical linguistic sense, i.e. for the term n = b l
= 'field', accordin.g to table 1 under the aspect of 'land use' si,k = b 3 ='pasture' it is defined in a
synonymous manner, under the aspect 'data bank' b2 'variable'. Thus, in this case, the chain of
synonyms si,l '" bk '" si,k leads to the erroneous relation of synonyms 'pasture' '" 'variable'.

It is a particularly frequent occurrence in the environmental sector to find a whole series of


homonyms, i.e. one term denoting several different objects, which, amongst other things, result
from the point of view of the use of various specialist terminologies, hereafter known as aspects
(of a specialist language). Unlike in other areas of knowledge, no uniform use oflanguage exists
in this sector (Feyerabend, 1976), which would enable a generally valid expression and allocation
of environmental influences. Quite the contrary, ~nvironmental questions are cross-section
questions that affect many established areas of knowledge. Every knowledge or specialist sector
will categorise according to its aspect (i.e. taole 1, field to the land use versus the data
processing) and according to its model of reality. The consideration of the various aspects in one
thesaurus is an intention of the ideas presented here.

This then means for the process of indexing that data i~ not sinlply given a catchword or a
definition because a finite number of preconditions (=allocation of catchwords, codes, etc.) have
been fulfilled, rather only if the habits relating to the expression, which ma.ke information out of
the data, are taken into consideration.

The homonymicity of the terms in a formulated system of order, which we denote as the
thesaurus T, result in this generally not being of a transitive nature, i.e. these deficiencies are
circumnavigated by redundant structures. Due to the fact that the transitive systems of order offer
considerable advantages in the retrieval and in the object-orientated procedural technology, the
731

model of a transitive thesaurus is the subject of discussion and will also be used in the
environmental information system.

Initially, the preconditions for a consistent thesaurus T are stated for just one aspect.
Subsequently, the required extensions for a consistent 'multi-lingual' thesaurus T are formulated
according to several aspects: the classification and the research terms must correspond with the
(specialist) linguistic understanding of the user according to various aspects g = 1, 2, .. , g+ .

Term level Aspect ag


6Jj bn si,n b·J g=1 2 3 4 (g*)

"@
e::s E
(l) i:: (l)

!§.g '"cd
(l)
Related '" ,;::
;:1 ::s
Expression Synonyms terms I u o U ~
I

]>
(l)
"0
0.. i:: .~ ~ cd
;>, cd i:: 0 1il
E-< .....:l ~~ Q

II Field x x X X
13 Pasture X
Variable X
Pasture X
21 Variable X
Field X
31 Pasture X X X
II Field X
Field I X

Table I: Relations of expressions

2 The Thesaurus
The ordering and evaluation of information is essentially a precondition for successful and, above
all, for efficient work in all areas of knowledge and application. The drafting of classification
systems and classifications is an essential method in creating order with regards to one certain
criterion, i.e. regaining information, for objects which require classification. Therefore, in the
process of regaining information, one must not always base all considerations on the terms stored
in the data, as one is able to make use of the order created by the classification (i.e. research with
the aid of generic terms, catchwords, etc.). Conversely, the classification can be made
considerably easier by means of automation, provided that this contains the terms used by the
thesaurus or has derived its terms from the thesaums (compare diagram 2, Weihs 1992).

2.1 The Hierarchical Principle

The thesaurus T is an ordered sum of expressions E', which forms an open system for the
specialist- and/or problem-orientated classification and crdering of tenns; as a classification
system, it is striving for the revocable, definite allocation of expressions bn to the objects OJ This
732

means that each terms is contained in the thesaurus T once and only once and only refers to one
obj ect OJ (excluding the demands of (3».

on <=> bnwith b n E B' =sum of all terms, n = 1,2,3,j, ... , n+ and (8)

(9)

Due to (1) through (7), the transitive, synonymous expressions are contained in B'.

A definite allocation of the expressions bn to the k+ categories (Wersig, 1982) is defined as being
the order system.

(10)

which satisfy the preconditions (II), (12).

B' 1-
- B' (11)
* .. t * *. +.
B k \ B 2k+1 n B 2k+1 = {O}, B k \ B 2k+1 u B 2k = B k wIth k = 1, 2, ... k Categones (12)
j I\'

Due to the fact that (1) and (2) ensure that a term will appear once and once only in the sum B*,
and therefore is included in only one subset. The allocations of the synonyms to the categories
according to (4) bj = {'1,i},is also definite according to the preconditions (5) and (6), because A
n B = O. (10) through to (12) define a strictly hierarchical, transitive system of order. The rule R,
which, according to (10), must be defined for the Oldering of the terms, is established on the basis
of subject logical determined preconditions, determined between these as super, i.e. sub-orders
of category forms. Nevertheless, the literature has examples of promising, mathematical-statistical
principles for the extraction ofthesaurii from texts or empirical material (see p.e. Bock 1993).

2.2 The Extended Principle

The equation (3) demands the consideration of the homonyms, because that is the point when the
allocation between object and expression is no longer revocable. Conversely, (9) is the
precondition for a logical subject set of rules according to (10) for the division of expressions in
categories. We will therefore extend (8) to the effect that we can i'.llocate in a definite sense each
of the h <: 1 varied objects with identical expression b n (i.e. field, compare table 1) to a certain
aspect g = 1, , .. ,u, .. , gO:

0h,n <=> a g and 0h,n <=> bn,g with bn,g E Bg = sum of all terms of the aspect g (13)

(14)

The definition of the synonymous terms is extended in accordance with (4).

bj,g = Sj,g = { Sj,g } , WIt. h Sij.g 1=,


. 1 2,., m, ... , I+., Sij -L
r Smj' I. =F man d (15)

Sg n Bg = 0 but Sg n B t :l 0 (16)

The division rule is extended in an analogue manner accnrding to (10):

Bk,g \ B 2k+1,g n B 2k+1,g = {O}; Bk,g \ B 2k+1,gU B 2k,g = Bk,g with k=I,2, . k+ Categories (17)
733

with a division rule, based on the aspect, accordmg to:

Bk,g E C;: the Bk,g of C; with a division rule, based on the aspect, according to: g. (19)

In accordance with (19), it must be taken into consideration in the logical subject classification
rules, that this is related to one and only the one related aspect ago

However, on the basis of (16), the following is also true:

(20)

~k,gt =sum of the homonyms from the expressions vf the categories j, k of the aspects g and t.

A pyramid structure, in accordance with Brito (1990), results from the inclusion of several
languages r, u, v .... E {g = 1, ... > gO}:

(21)

the Bk,g of ,C"


are the clusters of the pyramid structure (Brito, 1990) of the languages, i.e.
aspects of {g'].

Category 8 ' " I :


:::1 Category 4 f\ _ _ _ _ _,
Category 9 / I I \ '

D Category 10

Category 11
Category:; ~S·'=r
i

52 . \
IlL.___-"-'"--'
Total C +. B*

C=J Category 12
Category13/~
''''I ~~;:.;
Category 6
'. ~

EJ . .
Category 3
Category 14 "'J !/
011(1----.... /1 Category 7 (
Category 15 ' I

OIl( • Reference by means of catchwording, indexing

Diagram 1: Indexing when dealing with several aspects

The sum total (universal sum) of the expressions of the thesaurus T of all expressions results from
the terms and the synonyms allocated to the same.
734

The subject logical origins of a term is characterised by the affiliation to one (or several) aspects,
(Schilling, 1991). Depending on which super, i. e. sub-orders have been permitted, the thesaurus
will be, according to diagram 1, net-like, pyramidal or strictly hierarchical.

It is easily recognisable in diagram 2 that, through the introduction of the aspects,a g' + 1 -
dimensional variable area is stretched out. The aspects must not only be independent of each
other and the categories in a Cartesian sense, they must also be independent of each other in a
subject logical sense. The selection quantity, which is of interest for the retrieval of expressions,
then results from the projection of the desired aspect area into the term level.

Aspect g

./
./
./
r'-----
I .....•.•.....•.•....
I
I .'
I .' .' ..
~" "

I ....•.....•
I
I B
I

....... ~ reference to cluster according to (18)


- - . Rreference to synonyms according to (15)
Aspect r

Diagram 2: Relations g+ + 1 dimensionalell apect area

3. The Object-Orientated Principle

In an earlier part of this paper, we examined the preconditions which make possible the definition
of a thesaurus T, which includes any number of aspects. The introduction of the aspects makes
possible the consideration of the homonymous expressions used in various specialist languages,
i.e. speCialist points of view. In this context, the precondition was presupposed that each
expression would appear once and only once in the thesaurus. The expressions are allocated to
synonymous terms, depending on their aspect, and these are then equally components of the
thesaurus. There is therefore an hierarchical clustering of terms within each aspect. The
combination of various aspects then produces a pyramidal structure in the thesaurus. If one holds
onto this precondition, it is possible to make a definite, non-redundant depiction of the
expressions, simply by stretching out a g' + 1 dimensional space. In accordance with the
precondition, the aspects are now independent of each other. In as far as an aspect is then
accurate for the expression bn, the binary value I is commonly allocated, otherwise o. The g' +
1 th dimension makes reference to the term area. Due to the fact that, according to (2), each term
appears only once, all changes related to the term (i.e. update of synonyms {g+}, category
affiliation {k}, references to aspects {s}) directly affect the relevant relations and objects. This is
an essential precondition for the realisation of an object-orientated principle.
735

Our term object b(n) is defined by the vector of the variable area of the aspects, the categories of
the thesuarii and the synonymous ring:

(23)

The subject logical division of the point of view of the aspect from the classification determined
by the contents makes possible an extended interpretation of the m'Jdel: the aspect term can,
without contradiction, be regarded as being a technical reference (address) to interfaces, such as
HTML pages of the Internet, data-banks or methods. This therefore opens up a further area of
use for a thesaurus of this kind.

References:
Bock, H.-H., Lenski W., Richter M.M. Eds. (1993): Information Systems and Data Analysis, Procedings of the 17th
Anual Conference of the Gesellschaft fur KIassiflkation e. V., Berlin Springer
Brito, P., Diday, E. (1990): Pyramidal representaton ofsymbolic objects, Knowledge, Data Knowledge and Decision,
Hamburg, Eds. Schader, Springer
Feyerabend, P. (1976): Wider den Methodenzl1lang, Suhrkamp, Frankft. 1976, 296f., p. 352ff..
Herrschaft, L.: Zur Bestimmung eines medienspeziJlschen lnformationsbegriffs in: Zeitschriftfur
lnformationswissenschaft und -praxis, Journal for Information Thgeory and Work, 47. Jahrg. Nr.3 Voltune 47, NO.3
Nadoaw47(3) SI71 ff.
Kommission der Europiiischen Gemeinschaft, KoEU (1990): "Richtlinien uber denfreien Zugang zu Informationen
uber die Umwelf', Brtlssel1990
Bund Landerarbeitsgemeinschaft Bodenschutz LABO (1994): Aufgaben und Funktion'!n von Kemsystemen des
Bodeninformationssystems als Tei! von Umweltinformationss}'stemen, Umweltrninisteritun Baden Wilrttemberg,
Stuttgart 1994
Nedobity, W (1989) Ordnungsstrukturenfiir Begriffskategorien, Studie zur KIassiflkation, Bd 19(5K 19), Hrsg. Ges.
f. KIassiflkation e. V. Darmstadt 1989, p. 183 f. Opitz, 0 ... Lausen, B., KIar (Eds.) (1992) Information and
Classification, Springer Berlin New York
Schilling, P. (1991): Variabler Thesaurus - eine Schliisselfonktionftr die zukiinftige lnformationsverarbeitung in
einer Verwaltung Konzeption und Einsatz von Umweltinformationssystemen, Brauer, W. im Auftrag der Gesellschaft
fUr Informatik (GI)
Weihs, E. (1992): On the Client-Server Conzept of Text Relater! Data, Cognitive Paradigms in Knowledge Orga
nisation, Sarada Ranganathan Endowment for Library Science, (Hrsg.), 452-459, Madras.
Weihs, E. (1993): An approach to a Space Related Thesaurus; Informaticn and Classification, O.Opitz, B.Lausen,
R.KIar (Hrsg), 469-476; Springer; Berlin, Heidelberg.
Weihs, E. (1993): Datenbanken als Grundlage von Umweltinformationssystemen, Tagungsunterlagen zur 17.
J ahrestagung der Ges. fur KIassiflkation, Kaiserslautem.
Wenzlaff, B. (1991): Vielfalt der Informationsbegriffe, Nachncnten fur Dokumentation 42, Heft 5/1991, 335-361,
Weinheim.
Wersig, G. (1985): Thesaurus-Leitfaden: eine Einftlhnmg in das Thesawus-Prinzip, Theorie und Praxis. - DGD-
Schriftenreihe Ed.8; K.G. Saur; Milnchen.
Whorf, X. (1976): Language, Thought and Reality, MIT Press, 1956, p. 12
Cluster Analysis of Associated Words Obtained
from a Free Response Test on Tokyo Bay
Shinsuke Suga l , Ko Oil and Sadaaki Miyamot02

I National Institute for Environmental Studies


16-2, Onogawa, Tsukuba
Ibaraki305,Japan

2 Institute of Information Sciences and Electronics


University of Tsukuba, 1-1, Tennodai
Tsukuba, Ibaraki 305, Japan

Summary: This study is concerned with data analysis of associated words obtained from
a free association test on Tokyo Bay. It is shown that cluster analysis of the words is an
effective method to find respondents' concerns about the bay. Word clusters are
considered to give structures of inseparable conceptions of the object of association. We
analyze data from two survey areas near Tokyo Bay, i.e., one in a residential area and the
other in a town where fisheries are primary industries. The data analysis shows that water
pollution is a respondents' important concern in the two areas. Associated words showing
various industries or development works are classified in some specific clusters. Further,
we find a word cluster which indicates that Tokyo Bay is closely related to the lives of
respondents engaged in fisheries.

1. Introduction
In the data analysis concerning the investigation in several environmental problems a
variety of data are used. The authors have been analyzing the word data obtained from a
questionnaire survey to find the environmental awareness of local residents. In the
questionnaire survey respondents were asked to write down freely what they associated
with a given stimulus word or a phrase. It is considered that people have a wide variety of
conceptions about environmental problems. Thus, the questionnaire survey based on a free
association is more useful to get satisfactory information about residents' concerns than the
usual survey in which respondents find questions in the given list of individual items. We
consider the classification of the words obtained from the free association test. In this aim,
cluster analysis is applied to the associated words.
Applying the method of classification is useful to examine the awareness of local residents
through the associated words in the following senses. First, discussing groups of
classified words in the whole data is more practical than examining each word one by one.
Second, a word cluster of associated words is considered to give the cognitive structures
of the awareness, if an appropriate measure of the similarity is used. When people ponder
on their living condition, for example, do they associate individual words, "convenience",
"road", "quiet", etc., separately? On the contrary, as mentioned in Oi et al. (1986), they
seem to rather recognize these items as a group of inseparable conceptions. A word Cluster
is considered to indicate some notion related to respondents' concerns.
The authors have analyzed word data obtained from various questionnaire surveys asking
respondents to write down about living condition (Oi et al. (1988)), water side in general
and Lake Kasumigaura (Suga et al. (1993)), and acoustic environment (Kondoh et al.
(1993)). In the present paper, we examine local residents' awareness or images of Tokyo
Bay. For data analysis the word data obtained from a free association test which was
carried out in some regions near the inland sea are used. In the survey, respondents were
asked to write down freely what they associated with a stimulus phrase "Tokyo Bay".

736
737

This paper is constructed as follows. In section 2, we summarize the questionnaire survey


used in this study. In section 3, we describe the similarity measure for cluster analysis of
word data. In section 4, the associated words used for cluster analysis are shown.
Examining each of the words, especially written with a greater frequency, gives interesting
information about Tokyo Bay that respondents conceive. Section 5 contains the results of
cluster analysis and the interpretation of word clusters. We also describe characteristic
residents' concerns in each survey area. In section 6, we discusses the efficiency of
applying cluster analysis to the word data obtained from a free association. We conclude
the paper with some remarks.

2. Questionnaire survey
The questionnaire survey based on a free association concerning this study was planned to
find how people around Tokyo Bay evaluate the nearby sea area. In the survey, three
stimuli, "sea", "Tokyo Bay", and "the new road across Tokyo Bay" were used for the free
association. Respondent.s were asked to write down their association items in
questionnaire sheets for each stimulus. In the present study, we analyze the words
associated with "Tokyo Bay".
The survey was carried out in four areas. Two areas were in Kawasaki City in Kanagawa
Prefecture, and other two areas in Kisarazu City in Chiba Prefecture. The two cities are on
the opposite sides of Tokyo Bay each other. Data obtained from two areas nearer the bay
in the cities are used in this study. One is a residential area, adjacent to a coastal industrial
district in Kawasaki City, located about five kilometers from the edge of the bay. Another
one is a rural area facing the bay on Kisarazu City side. We call the former one and the
latter one Kawasaki and Kisarazu, respectively.
The questionnaires were mailed to 667 people in Kawasaki and 550 in Kisarazu selected
from the residential map of each area by systematic sampling. The average ratio of
recovery of questionnaires was 41 %. About 45% of respondents in Kawasaki were office-
workers. In Kisarazu about 60% of respondents were fisheries workers. In fact, fisheries
are important industries in Kisarazu. The period of the survey was from February to
March, 1993. The whole results concerning the survey are shown in Suga and Oi (1995).

3. Similarity measure of cluster analysis


To carry out cluster analysis of the words the following arrangement of answers is carried
out. If phrases or sentences, written without any clear separation between words in
Japanese, are given by a respondent, at first those are decomposed into words. Then
nouns, adjectives and some verbs which are meaningful even after the decomposition in
describing the environment are chosen as associated words. Those words are merged with
words written down one by one from the beginning to make the set of associated words of
a survey area. Among thus obtained associated words in the sets, those appearing in
higher frequency are employed in the data analysis.
We use the similarity measure by Miyamoto and Nakayama (1980). The measure between
two associated words is defined in the following way. Let X=(Xl,X2, ... ,X n } be a set of
words each of which is found more than N times among the associated words from an
area. Then let Y=(YI ,Y2, ... ,Ym} be a set of respondents each of which writes at least one
word in X. Suppose a word Xi is written by a respondent Yk with the frequency Pik, the
similarity measure between the two words Xi and Xj is defined by

m m

S(Xi,Xj) = I, min(Pik,Pjk) II, max(Pik,Pjk) (1)


k=l k=l
738

Clearly, 0:::: s(xi,x):::: 1. This measure shows that two words associated by more common
respondents are more similar to each other. Thus
In this study, we set N=10 and N=7 for the data of Kawasaki and that of Kisarazu,
respectively. Thus, 50 words and 53 words are analyzed for Kawasaki and Kisarazu,
respectively. A computer package PAB developed by Miyamoto (1984) is employed. The
method of average linkage between the merged groups is used.
Though other similarity measures may be used for classification of words, the measure
defined by equation (1) is an effective one for considering respondents' concerns in the
sense that the similarity between two words are measured based on common respondents'
association. The efficiency of our measure will be shown in sections 5 and 6.

4. Words for data analysis


Table 1 shows the words used for cluster analysis in the order of the frequency of their
appearance in each survey area. Those words are translated from associated words in the
original Japanese language into English. Phrases in Table 1 are written in one word in the
original language. Among whole respondents in the two survey areas, 221 respondents
and 185 respondents answered the question about the stimulus "Tokyo Bay" in Kawasaki
and in Kisarazu, respectively. In the table the words with an asterisk "Edo-mae", "sudate",
and "Odaiba" are not translated because they cannot be found appropriate expressions in
English. People often calls a dish made of marine products from Tokyo Bay or the marine
products themselves "Edo-mae" in Japanese. The word "sudate" means a classical method
of catching fish in a beach in the bay area. "Odaiba" is a tiny islet made at the end of the
shogunate era in the 19th century to install a cannon.
Suga and Oi (1995) examined the words obtained from the survey and analyzed the
frequency of each word. We summarize briefly the words in Table 1. We can find various
images about Tokyo Bay which respondents conceive through the words. The words
showing characteristics of Tokyo Bay are seen in each area. For example, the words
"laver" and "short-necked clam" show marine products from the bay, while "reclamation"
and "new road across Tokyo Bay" indicate industrial development in the bay area.
Furthermore, "shell gathering" and "fishing" mean pastimes along a beach.
The words related to bad images are also seen in Table 1. The words "dirty" and
"pollution" are written with a greater frequency, within the 20th place, in each area.
Further, we can see "sludge" in each area. The words "death" and "red tide" are seen in
Kisarazu, while "bad smell" in Kawasaki in Table 1. It indicates that the bad images
related to pollution of Tokyo Bay are prominent in each survey area. We can see the words
"fisheries", " fisherman", "place of fisheries", "fishing net", "marine products", "flat
fish", "shellfish", and "trough shell" in Kisarazu. Most of these words are written with
extremely small frequency in Kawasaki (see Suga and Oi (1995)). Greater frequency of
the words showing marine products or fisheries is characteristic in Kisarazu.

5. Results of cluster analysis


5.1 Classification of words
Figures 1 and 2 show the results of cluster analysis of the associated words shown in
Table 1 based on the similarity measure defined by equation (1). These figures show
sketches of the actual dendrograms in which the similarity values are segmented into 25
classes in an interval [0, 1]. Figures 1 and 2 correspond to the results of Kawasaki and
Kisarazu, respectively. In order to find respondents' concerns about Tokyo Bay we will
discuss the interpretation of each word cluster by examining the words belonging to it.
From the author's experiences, as for a word cluster containing many words it is often
difficult to find its meaning. Thus, the whole word data are classified based on a
dendrogram so that each cluster contains a suitable number of words for interpretation. On
739

Table 1 Associated words for data analysis

Kawasaki Kisarazu

word frequency word frequency

sea 105 laver 75


dirty 77 sea 74
ship 73 short-necked clam 48
reclamation 67 new road across Tokyo Bay 42
laver 62 dirty 34
clean 47 ship 32
litter 43 old days 30
water 41 reclamation 28
fish 40 marine products 28
10 nature 34 litter 28

11 ferry 32 water 27
fishing 30 fish 24
old days 29 fisheries 23
pollution 28 pollution 21
sludge 28 life 21
shell gathering 28 nature 20
river 26 shell gathering 20
Edo-mae* 25 Mt Fuji 19
industrial area 22 waste water 16
20 goby 21 tidal flat 16

21 human being 21 abundance 16


Kawasaki 20 culture of marine products 15
change 19 small 14
tanker 18 shellfish 14
industrial complex 18 trough shell 14
seaside 17 change 14
factory 17 clean 13
life 16 death 13
Haneda airport 16 factory 13
30 childhood 15 environment 13

31 Tokyo 15 ferry 12
port 15 human being 12
Japan 14 flat fish 11
reclaimed land 14 fisherman 11
landscape 14 development 11
houseboat 14 tasty 11
new road across Tokyo Bay 13 goby 10
short-necked clam 13 Tokyo 9
Chiba 13 wind 9
40 fishing boat 12 Edo-mae' 9

41 bad smell 12 place of work 9


culture of marine products 12 sudate' 8
waterfront 11 fishing 8
Kisarazu 11 oil 8
color 11 industrial area 8
bird 11 leisure 8
overcrowding 10 work 8
development 10 place oflife 8
Odaiba* 10 place of fisheries 7
50 Yokohama 10 construction 7

51 sludge 7
fishing net 7
reds tied 7
* The meaning is described in the text.
740

the other hand, some clusters containing a few words, say, less than three words, are also
difficult to be considered.
We describe the classification procedure of the associated words by using the results in
Kawasaki. In Figure 1, if the whole data are classified at level 0.054, then eight clusters
Al to A8 are obtained. Two clusters A2 and A3 are divided further because they contain
too many words. Finally, i4 word clusters ai to ai4 are obtained. Clusters A4, A7, and
A8 contain only one word. If such a cluster is formed at lower level, the word is
considered to be associated independently of other words. In the same way, the whole data
in Kisarazu are classified into i4 clusters bi to bi4 as shown in Figure 2.

S.2 Interpretation of word clusters


S.2.1 Clusters in Kawasaki
We consider the results in Kawasaki shown in Figure 1. As mentioned in section 1, the
word clusters are considered to give the respondents' concerns or images about Tokyo
Bay. In order to find the meaning of the clusters we examine the words belonging to each
cluster. First, we consider cluster al. The words "Tokyo", "Chiba", "Kawasaki",
"Yokohama" are seen in al. Each of these words shows the name of city located on the
shore of Tokyo Bay. The formation of ai indicates that the cognition concerning the cities
shows an inseparable structure in the free association with Tokyo Bay.
Next, we consider clusters a2 and a4. It is clear that cluster a2 is characterized by the
words showing the bad images of the sea area. Cluster a4 is considered to be an important
one in the sense that all words in a4 are written with a greater frequency within the 10th
place in Kawasaki (see Table i). Such words are considered to show respondents' primal
concerns, which depict an ordinary marine scenery related with pollution affairs dominant
in this sea area.
Regarding those four clusters as one cluster A2 is more useful to find respondents'
interests about Tokyo Bay. We should note that the words "litter", "bad smell", and
"dirty" belong to A2. This indicates that an inseparable conception on the bay is composed
of the bad images shown by those words. The dendrogram shows higher value of the
similarity measure between "dirty" and the words with a greater frequency, "sea", "water",
"ship", and "fish". Thus, we can fmd that the images of unclean water and dirty sea are the
typical respondents' concerns about the bay.
Now, we consider A3. Although this cluster contains a rather large number of words,
many of which are related to development works, various industries, and marine products.
Further, we note that we can find few words concerning such things in other clusters. It
shows that A3 gives a characteristic cognitive structure in the association. In order to find
the meanings of association shown by those words more precisely, it is desirable to
consider small clusters composing A3 separately.
Here, we consider three clusters a7 to a9 separately. Cluster a7 is characterized by the
words concerning marine products. The cluster a7 indicates residents' recollection of the
days before reclamation for heavy industries where the culture of laver, other marine
products, and shell gathering were prevalent. Among the words in a8, we note "new road
across Tokyo Bay", "ferry", and "Kisarazu". This cluster is considered to reveal
respondents' interests about the new road being constructed to connect Kawasaki City and
Kisarazu City across the bay. The word "ferry" shows the existing way of marine
transport between the two cities which will not be used after the opening of the new road.
The meaning of a9 is clear. We can find the image of the pollution caused by a coastal oil
industry in the formation of the cluster. It is interesting that two words "house boat" and
"fishing boat" belong to this cluster. Respondents may associate a scene of boating with a
coastal industrial area as a background.
Finally, we consider ai2. Three words "nature", "change", and "development" are
characteristic. This cluster shows respondents' interest in the change of nature caused by
various development works around the bay.
741

Similarity measure 0.0


I
Al al Tokyo
Chiba
landscape
Kawasaki
Yokohama
oort
A2 a2 litter
h~tl sm~ll
a3 factory
color
a4 clean
water
sea
dirty
ship
_fish
as river
A3 a6 industrial area
Odaiba
overcrowding
a7 childhood
short-necked clam
Haneda airport
culture of marine products
laver
shell gathering
reclamation
a8 reclaimed land
new road across Tokyo Bay
fishing
goby
ferry
Kisarazu
a9 house boat
fishing boat
sludge
tanker
industrial complex
pollution
A4 alO Edo-mae
AS all life
hird
A6 al2 nature
seaside
old days
change
development
human being
A7 al3 Jaoan
A8 al4 waterfront
I
1.0
t
0.054
I
0.0

Figure I A sketch of the dendrogram showing the result of cluster analysis of


word data in Kawasaki
742

5.2.2 Clusters in Kisarazu


Now, we consider the results in Kisarazu shown by Figure 2. The whole data are
classified into seven clusters B 1 to B7 at level 0.052. Two clusters Bland B3 are further
divided in the same way in the result of Kawasaki. Finally, we consider 14 word clusters
bltob14.
First, we try to characterize B I as one cluster. Many of the words showing marine
products from Tokyo Bay and concerning fisheries belong to this cluster. It indicates that
respondents' awareness related to such words shows an inseparable conception in the
association in Kisarazu.
Next, we consider four clusters bl, b2, b5, and b6 separately. Cluster bl is especially
concerned with marine products from the bay. Though cluster b2 contains only three
words, we should note that two words "tidal flat" and "sludge" belong to the same cluster.
Actually, the conservation area of a natural tidal flat in the bay area exists near Kisarazu
city. The formation of b2 reveals respondents' concerns about the conservation of the tidal
flat area from water pollution. In cluster b5, we note three words "nature", "change", and
"development" which are also found in al2 in Kawasaki. Thus, those two clusters have a
common meaning each other. Examining all the words in b5 reveals that respondents'
interest in the contrast between nature and development is relating to the environment and
the abundance of nature in Kisarazu.
The words concerning fisheries are seen in b6. The result that the three words "sea",
"life", and "fisheries" belong to one cluster shows that Tokyo Bay is closely related to the
life of respondents in Kisarazu. This cluster is considered to be formed by the fact that
about 60% of the respondents in this survey area engage themselves in fisheries.
Now, we consider two clusters b8 and b9. Although cluster b8 is composed of fewer
words, it is clear that this cluster shows the bad images about water in Tokyo Bay.
Various words related to pollution can be found in b9. The words "factory" and "oil" seem
to show causes of pollution. The word "death" is used in phrases "death of sea" or "death
of fish" etc., in the texts of answers. We can find that the pollution of the sea area is a
serious problem for respondents in the fisheries community from this cluster.
Finally, we consider cluster b 10. The dendrogram shows that the value of similarity
measure between "Mt. Fuji" and "clean" is higher in b 10. This indicates that respondents
in Kisarazu make a positive evaluation about the landscape of Tokyo Bay coupled with Mt.
Fuji. Actually, Mt. Fuji can be seen from the survey area Kisarazu. Further, we should
note that the word "waste water" showing contrary images to "clean" belong to blO.
Formation of this cluster seems to reveal that respondents in Kisarazu recognize the
beautiful landscape of the bay and the bad images of waste water which show contrary
images each other as an inseparable notion in the association of Tokyo Bay.

5.2.3 Common and different awareness between the two survey areas
We can find the clusters composed of the associated words relating to water pollution of
Tokyo Bay in two survey areas. This indicates that water pollution is a respondents'
important concern about the inland sea in the two areas. Two clusters al2 in Kawasaki and
b5 in Kisarazu show respondents' concerns about the relation between nature and
development. They also seem to show respondents' interest in the change of nature caused
by various development works in Tokyo Bay. The associated words showing fisheries or
individual names of marine products are classified into some specific clusters. Typical
examples in each area are clusters a7 and b 1.
In Kawasaki, the clusters containing words showing various industries or development
works are found. Especially, cluster a9 which contains the words concerning a coastal oil
industry is characteristic. As described in section 4, respondents in Kisarazu write various
words related to fisheries and marine products from Tokyo Bay, and those words
constitute some symbolic clusters. Among them, cluster b 1 includes the words showing
743
Similarity measure
0.0
I
Bl hI laver
short-necked clam
shell gathering
flat fish
goby
trough shell
sudate
new road across Tokyo Bay
ferry
bl. tidal nat
Edo-mae
sludl!e
b3 old days
fish
shellfish
b4 construction
b5 abundance
change
nature
environment
reclamation
development
bb tlsherman
fishing net
sea
life
fisheries
B2 b7 ship
fishinl!
B3 b8 dirty
water
small
b9 factory
death
wind
oil
litter
red tide
culture of marine products
blU Mt. .FUJI
clean
human being
Tokyo
waste water
B4 bll marine products
tasty
pollution
place of fisheries
B5 bl2 mdustrial area
B6 b13 place of work
leisure
B7 b14 work
place of life
I
1.0 t
0.052
I
0.0

Figure 2 A sketch of the dendrogram showing the result of cluster analysis of word
data in Kisarazu
744

several marine products. Formation of cluster b6 indicates that fisheries are not only parts
of industries in Tokyo Bay but closely related to the lives of respondents themselves in
Kisarazu.

6. Efficiency of cluster analysis


We may apply some other methods to the free response data except for the method based
on cluster analysis. One is to examine each associated word in detail one by one. Indeed,
the words especially written with a greater frequency in Table 1 give respondents' primal
concerns about Tokyo Bay. We can also find characteristic respondents' concerns in each
survey area by examining the difference of word frequencies in different survey areas. But
that method dose not give the relationship between the associated words. Another way is
to read each answer in detail. Knowing the contents of the actual answers is important for
the data analysis in this study. This would be the best method for fewer number of
answers. By using such method Oi et al (1994) analyzed the complaints caused by various
pollution phenomena. However, it is not easy to integrate the contents of whole answers
when we have to use more answers, say, more than 100 in each area as in this study.
As we have seen, classification of the associated words gives clear structures in a free
association. Examination of words in each word cluster allows us to find residents'
concerns without reading whole answers in detail. In order to find further interpretation of
a cluster the authors often read the original answers and examine how the words belonging
to the cluster are described in the actual answers. It is more useful to read original answers
for a detailed interpretation of word clusters than to read them without the results of cluster
analysis of words.
As mentioned in section 3, sentences or phrases appeared in the original answers are
decomposed into words. Hence, those data are regarded as the sequence of words in
which the order of the appearance of the words is closely related to the meaning of an
answer. Miyamoto et al (1990) proposed a method, called a neighborhood method, of
generating similarity measures between a pair of words obtained from long sentences in a
free association. Suga et al (1994) applied the method to free response data about
annoyance and trouble. The application of the method to the present data would reveal
another aspect of the awareness of the respondents.

7. Concluding remarks
A free response test is useful for examInIng residents' concerns about several
environmental problems directly. However, the analysis of such data is not as easy as that
of the data obtained by a usual surveys in which respondents choose their answer from
given items in a questionnaire. The method of classification we use in this study gives
clear structures of the association with Tokyo Bay. Grasping such structures is important
to discuss several issues about the bay in [he future, for example development and
conservation. It is not easy to reveal such structures through the data analysis based on a
usual survey.

Acknowledgment: The authors would like to express our appreciation to the subjects
for cooperating the survey.

References:

Kondoh, Y. et al. (1993): The acoustic environmental awareness of residents in high-rise


apartment houses by the free response method, Proceedings of the Japan Society of Civil
Engineers, 458,111-120 (in Japanese).
Miyamoto, S. and Nakayama, K. (1980): A hierarchical representation of citation
relationship, IEEE Trans., Systems Man and Cybern., SMC-IO, 899-903.
745

Miyamoto, S. (1984): Development of a Computer Program Package for Bibliometrics,


Report of a research supported by the Grant in Aid for Fundamental Scientific Research of
the Educational Ministry in fiscal 1983, (in Japanese).
Miyamoto, S. et al. (1990): Methods of digraph representation and cluster analysis for
analyzing free association, IEEE Trans., Systems Man and Cybern., Vol. 20, No.3,
695-70l.
Oi, K. et al. (1986): Analysis of cognitive structures of environment of the local residents
through word association method, Ecological Modelling, 32, 29-4l.
Oi, K. et al. (1988): The range and the structure of cognition of the living environment
conceived by local residents, Proceedings of the Japan Society of Civil Engineers, 389,
83-92 (in Japanese).
Oi, K. et al. (1994): Management of complaints caused by noise and other pollution
phenomena filed by residents flowing into industrial areas, Proceedings of International
Congress on Noise Control Engineering, Vol. 2, 1141-1144.
Suga, S. et al. (1993): Study of the awareness of local residents on an expanse of water by
a free association test and cluster analysis, Proceedings of the Japan Society of Civil
Engineers, 458, 91-100 (in Japanese).
Suga, S. et al. (1994): An application of a method of neighborhood to text of free response
test on annoyance and trouble, Research Report from the National Institute for
Environmental Studies, R-132-'94, 97 -106 (in Japanese).
Suga, S. and Oi, K. (1995): A Survey of the Image of Sea through a Free Association
Method, F-73-'95/NIES, National Institute for Environmental Studies (in Japanese).
Data Analysis of Deer-Train Collisions
in Eastern Hokkaido, Japanl
Keiichi Onoyama 1, Noboru Ohsumi 2, Naoko Mitsumochi 1, and Tsuyoshi Kishihara 1

1 Obihiro University of Agriculture and Veterinary Medicine


Inada-cho, Obihiro 080, Japan

2 The Institute of Statistical Mathematics


4-6-7, Minami-Azabu, Minato-ku,
Tokyo 106, Japan

Summary: The data of 696 deer-train accidents which occurred on 330.95 km distance in
eastern Hokkaido, Japan from April 1988 to March 1995 was statistically analyzed. Many of the
accidents occurred at particular sites and night hours, which suggests the relation with the habitat
and diel activity of deer. Relative densities of deer were estimated where the train runs were
constant.

1. Background
One of the serious problems between human and animals is deer-train collisions. It
includes the breakdown of or damage to trains, hence the disturbance to the train
diagram, and the death or injury of deer. Although several studies were published on
deer-car accidents (Allen and McCullough (1976), Schafer and Penland (1985), Waring et
al. (1991), Reeve and Anderson (1993)), no actual data of deer-train accidents has been
studied. Recently the number of accidents between the train and the Sika deer (Cervus
nippon yesoensis) greatly increased from year to year in eastern Hokkaido, Japan.

We have produced a data set including a total of 696 cases of deer-train accidents from
April 1987 to March 1995 (8 years) based on the driver reports of the Kushiro Branch of
Hokkaido Railway Company. Determining the altitude and representative vegetation at
0.5 km distance along the Line on the basis of 1/50000 scale topographical maps by the
National Geographical Survey Institute, Japan and 1/50000 scale actual vegetation maps
(Environment Agency, 1988), we have created another data set consisting of the number
of accidents per 0.5 km and environmental conditions. We present the results of
statistical analysis on the data sets and an estimation of the relative densities of deer.

2. Statistics
Fig. 1 shows that the number of accidents increased year by year except for 1991. Since
the number of train runs per year was almost the same, it is suggested the increase in the
number of deer that passed across the railway track, hence the increase of the population
of deer. Fig. 2 gives the hourly change in the number of accidents. Most of the accidents
(79%) occurred between 16:00 to 23:00, when the deer activity is high. Fig. 3 presents
the number of accidents per 10 km from Kamiochiai to Nemuro stations on the Nemuro

I: This study was in part carried out under the ISM Cooperative Research Program (95-ISM'CRP-
A58).

746
747

160
-
140
- -
120

100

80 r-

n
60 r-
,--
40

20
~n
o
1987 1988 1989 1990 1991 1992 1993 1994

Fig.1: Yearly Change in the number of train-deer

110

100 -
-
90

80
-
70

60
- - - -

50

40

30
,. I-
20

10

o h -=LiJ= !h-n---n-rr-
o 1 2 3 4 5 6 7 8 9 10 11 1213 1415 1617 18192021 222324
Hour

Fig. 2: Hourly change in the number of deer-train accidents

140

120

100

60

60

40

20

100
n--fl
150 200
-rf n-
250 300
r rr ~
350 400 450

Fig.3: Number of accidents per lOkm. Numerals on abscissa


are distances in km from the Takigawa station.
748

Line. Many accidents occurred between Kushiro and Nemuro stations, where hunters
reported that many deer lived.

Table 1 gives the number of cases in which numbers deer were found. From January to
April the number of deer was great and the mean ranged from 4.1 to 5.0. However, the
number of deer. which collided with trains ranged from 1 to 4. (Cl means a nearmiss), and
among these the cases of 1 occupied <)4.1 'Yr (Table 2). This situation is similar in every
month.

Table 1: Number of cases in which numners deer were t'(lund

Number of deer found / case


Month 1 2 3 4 5 6-90-19 20- Total Mean
Jan 16 12 9 10 10 6 6 1 70 4.1
Feb 17 15 14 1 7 9 6 3 72 4.7
Mar 12 18 17 10 6 11 8 82 4.2
Apr 6 5 3 5 9 4 33 5.0
May 12 8 4 2 3 4 0 33 2.8
Jun 4 6 14 3.0
Jul 17 13 9 2 0 0 41 1.9
Aug 14 13 4 3 2 0 36 2.1
Sep 19 17 3 5 4 1 50 2.6
Oct 33 25 7 5 5 1 0 76 2.0
Nov 47 23 12 5 6 14 0 107 2.6
Dec 24 15 11 10 5 7 3 75 3.1
Total 221 166 96 57 48 68 29 4 689 3.2

Tab. 2: Frequency distrihution of the numher of deer which collided with trains.
Number of deer which collided with trains Mean number of
Month 0 1 2 3 4 unknown Total deer when collided
Jan 63 6 70 1 .11
Feb 7 58 6 1 73 1.14
Mar 5 68 5 1 4 83 1.09
Apr 4 25 2 2 33 1.21
May 2 31 33 1.00
Jun 12 14 1.23
Jul 2 34 3 41 1 .13
Aug 1 35 37 1.00
Sep 4 45 2 51 1.04
Oct 5 68 1 2 77 1.04
Nov 2 100 5 108 1.02
Dec 71 4 1 76 1.05
Total 32 610 29 7 2 16 696 1.08

The distances at whkh drivers found deer ranged 0 to 3UO m. Fig. 4. is box plots for the
distances in the daytime (8:00-16:00) and at night (20:00-24:00). Drivers found deer
farther in front of them in the daytime than at night. The mean distance at which drivers
749

m n = 49 248
350

300 o
250

200 8
8
150 o
100

50

o o
-50
Oayti me Nlght

Fig. 4: Distances at which driver~ found deer in the daytime and at night

found deer is in the daytime is 89 m with a standard deviation of 53 m in the daytime, but
it is 56 m (SD=41 m) at night. Namely the distance in the daytime is on average 33 m
longer than that at night.
Fig. 5 is box plots for the distances at which drivers found deer on 5 weathers. The
distances were shorter in rainy or misty situations than in fine or cloudy ones. The
difference in the means is about 18 m.

m n = 419 170 43 19 19
350

300 0

250 0

0
200 8 0 0
0 0
8

~
150
0
100 0

50

a
Fine Cloudy Rainy Misty Snowy
mean=63 63 45 46 78

Fig. 5: Distance at which drivers found deer on each weather


750

m
500
Y = -173.625 + 5.684 * X; R-2 = .761
450 o o

400 o g 0
o o ~
350

300 o
250

200 o
150

100

50
30 40 50 60 70 80 90 100
Train Speed

Fig. 6: Relation between train speed and distance to stop (ordinate)

Fig. 6 shows the relation between train speed and distance to stop. As is expected, the
greater the train speed is, the greater the distance becomes.

3. Relative densities of deer


Since the number of train runs was different between hours and stations, the number of
accidents cannot be used as a direct indicator of the population density of deer.
However, the 8 selected train runs (train numbers 5638, 5639, 5640, 5641, 5642, 5645,
5647, and 3644) between 16:00 and 22:00 together gave a total of 6 runs between
Kusruro and Nemuro stations, also giving hourly runs nearly constant, so we may use the

,
number of accidents on the 8 train runs as relative densities of deer along the railway.

450
.5638
..
I • II
440 X
;(
X 85639
430 11 X 5640 •
420 ~
;~ 5641
410
~ ;Ix ;( 5642
400 )I( .5645

"
390 X. +5647

• .....
<
380 .3644
;(
370
360
l1li ~ tI •
350 ";1.*

'.•• •
X
340 ;( 'XI

oft
x
330
~ +
320
310
X
X
• )(

16:00 17:00 18:00 19:00 20:00 21'00 22:00

Fig. 7: The accidents occurred on the 8 trains between Kushiro and


N emuro stations. Ordinate: distance from the Takigawa Station in km.
751

Fig. 7 shows the accidents occurred on the 8 train runs, and Fig. 8 the relative densities
per 2.5 km along the Nemuro Line between Kushiro and Nemuro stations. The relative
density at the distances of 420 to 430 km was very high. It is suggested that there may
be three or four large deer populations along the line between the two stations.
Hayashi's quantification method (type I) analysis performed on the data set of 0.5 km
distance (270 cases) has revealed that the relative density of deer between Kushiro and
Nemuro stations is related to the vegetation type and altitude category along the Nemuro
Line. Since the multiple correlation coeficient is 0.468, other factors such as the deer's
behavioral habit itself may be also related.

50~~~~~~~~~~~~~~~~-L~~+

45

40

35
30
25

20

15

10
5

330 350 370 390 410 430 450


Distance in km from Takigawa Station

Fig. 8: Number of accidents per 2.5km on Nemuro Line

4. Acknowledgement
We would like to thank the Kushiro Branch of Hokkaido Railway Company for offering
the material.

References:
Allen, R. and McCulough, D. (1976): Deer-car accidents in southern Michigan. Journal of
Wildlife Management, 40, 317-325.
Environment Agency. (1988): Actual vegetation map: the 3rd national survey on the natural
environment (vegetation). Japan Wildlife Research Center, Tokyo.
Reeve, A F. and Anderson, S. H. (1993): Ineffectiveness of Swareflex reflectors at reducing
deer-vehicle collisions. Wildlife Society Bulletin, 21, 127-132.
Schafer, J. A and Penland, S. T. (1985): Effectiveness of Swareflex reflectors in reducing deer-
vehicle accidents. Journal of Wildlife Management, 49, 775-776.
Waring, G. H., Griffis, J. L. and Vaughn, M. E. (1991): White-tailed deer roadside behavior,
wildlife warning reflectors, and highway mortality. Applied Animal Behaviour Science, 29,215-
223.
Comparison of some numerical data between the belisama group
of the genus Delias Hiibner (Insecta: Lepidoptera) from Bali
Island, Indonesia

Sadaharu Morinaka
The University of the Air
2-11 Wakaba. Mihama-ku, Chiba-ken, 261 Japan

Summary: In taxonomic study of closely related taxa, numerical data seem to be


insufficiently used or discussed. In this study actual sizes of various parts in male genitalia
and their ratios (to forewing length) are compared and discussed. Generally ratios are
considered to be useful for eliminating the environmental effects, as seen in the case that
male genitalia of Delias oraia are proved clearly larger than those of Dl'lias bl'lisama.
Meanwhile it is clarified through the discussion that there are such a character that is hardly
effected by the environment, and that is therefore very important to discuss the speciation
and biological evolution in related taxa.

1. Introduction
The description of biota is very important for various natural sciences because it is the
basis of them, for instance taxonomy. phylogenetics, population genetics, ethology,
evolutionary biology. Since Linne, huge numbers of descriptions have been carried out
but they are far from complete even now. Many more biota await description. As for
descriptions of insects, we can see them in the journals of various learned societies, for
instance, the Entomological Society of Japan, the Biogeographical Society of Japan, the
Lepidopterological Society of Japan. We can see non-numerical expressions for instance
"antennae red, upper side of forewing blue and shining" in them. We can see also
numerical expressions such as "forewing 34 mm in average", but they are not well used in
studies.
P= 6 +f (P: Phenotypic value, G: Genetic value, E: Environmental value)
This formula (Kimura 1960) is well known in quantitative genetics. We know that insects,
for instance, buttert1ies or moths when their larvae are not fed enough or live in severe
conditions become small adults. Thus numerical data, for instance, the size of wings, is
affected by various environmental factors, I consider. Therefore I consider that numerical
data is difficult to use for taxonomic studies.
Recently Lande (1976) or Lynch (1988) discussed the genetic model of evolution but
hardly used actual data of organisms. Komatsu (1996) referred that variation in the size of
genitalia was independent of that of general body but did not show actual data in his paper
then. I found that relative size of phallus (a part of male genitalia) had a distinct difference
from those of other organs in the way of taxomic study and showed that it was peculiar
and independent from the variety of sizes of other organs using actual data of two very
closely related taxa (Morinaka 1996). And I referred that it had some important relation to
the speciation and biological evolution (Morinaka 1996). In this paper I showed again that
phallus had distinct difference from other organs using actual data and suggested it had
very important meaning for the speciation and biological evolution between two very
closely related taxa.

752
753

2. Materials
2.1 The genus Delias (Insecta: Lepidoptera, Pieridae)

The genus Delias HUbner, 1819 is a big genus which has more than two hundred species
and belongs to Pieridae. The constituent species are distributed broadly from India to New
Caledonia. Lots of them inhabit South East Asia and is also divergent on highlands of
New Guinea Island. Their larvaes eat mistletoe plants and the adults have colorful wing
markings on the undersides of the wings. These are known as remarkable characters of
the genus Delias. Talbot divided this big genus to twenty groups (Talbot, 1928-1937). I
have been studying Group 17 (belisama group). It has 15 species which are closely
related to each other and distributed from Nepal to New Caledonia including Australia.

2.2 Materials

Delias belisama balina Fruhstorfer, 1910 (population from Bali Island) and Delias oraia
braTal/a Kalis, 1941 (population from Bali Island) are used. Distributions of Delias
belisama and its relatives are shown in Fig.l. D. belisama inhabits Sumatera, Jawa and
Bali Island. On the other hand, D. oraia i:lhabits Lesser Sunda Islands for instance
Lombok Island, Rores Island and also Bali Island beyond Wallace's lme. They are
closely related but clearly different species because they fly together in the mountains of
Bali island, maintaining their separate identities. Males of both species are shown in
Fig.2. Their wing markings are nearly identical and sometimes it is difficult to distinguish
from each other.

~ : Delias beliSiJma
.:: ~ ~: :.: : De1ias oraia

/
/ Wallace's line

Fig. 1 Distribution of Delias belisama and its relatives


754

A B

Fig. 2 Male adults of both species (A: Delias belisama balina, B: D. oraia bralana,
upper: upperside, lower: underside)
3. Method and results
3.1 Method

As mentioned above comparison of wing markings between D. belisama and D. oraia IS


difficult. Therefore comparison of genitalia becomes to be rather important. In this study
comparisons using various sizes of male genitalia are carried out. 35 males of D. belisama
balina and 24 males of D. oraia bratana were measured for each sIzes of male genitalia as
Fig.3.

Fig.3 Mesured portions of male genitalia. A: Ring (anterior). B: Juxta (posterior), C Dorsum (dorsal).
D: Val va (inner), E: Phallus (lateral and dorsal)
755

3.2 Results

The results are shown in Table l. Generally sizes of D. oraia are larger than these of D.
belisama. But the actual data is considered to include environmental effects. It is
considered that originally one species existed in Bali Island and another species invaded
there. Therefore it can be imagined that environments are not so suitable for the latter
species. And also it is imagined that large individuals have large genitalia. Therefore
correlation coefficients between sizes of genitalia and forewing length (=fl.), and also
regression equations are required. They are shown in FigA. and Table 2. And ratios of
genital sizes/forewing length also shown in Table 2. It is clear that D. oraia is larger than
D. belisama on some genital sizes of the same fl. Therefore it is concluded that male
genitalia of Delias oraia are larger than those of Delias belisama. But each value of the
correlation coefficients has remarkable difference. Values of the dorsum and valva are
large but the values of the phallus is markedly small.
D.belisallla balina (n=35) D.oraia bratalla (n=2-+) Comparison
mean ± S.B. mean ± S.E.

Forewing length 36.229 ± 0.328 37.500 ± 0.458 belisallla < uraia*

Abdominal length 16.057 ± 0.201 16.750 ± 0.235 belisallla <oraia*

Juxta length 0.662 ± 0.012 0.799 ± 0.016 iJelisama <oraia**

Ring long diameter 2.901 ± 0.027 2.995 ± 0.0-+8 belisallla < oraia

Ring short diameter 1.465 ± 0.021 1.623 ± 0.032 belisallla <oraia**

Dorsum length 2.380 ± 0.027 2.593 ± 0.037 belisallla <oraia**

Uncus width 0.697 ± 0.008 0.757 ± 0.011 belisallla <oraia**

Valva length 4.440 ± 0.0-+2 4.830 ± 0.061 belisallla <oraia**

Valva width 2.323 ± 0.021 2.530 ± 0.035 belisallla <oraia**

Phallus length 2.744 ± 0.021 3.082 ± 0.040 belisallla <oraia**

Table 1. Each sizes (=) of male genitalia and comparison of both species.
(*:P<0.05, **: P<O.OI by t-test)

8.9
A B
E
5. 8.8
E 8.7
S
bra/ana
""
'v
.<: 8.6 x
E 0

x'" 8.5 ~ 2.2 .,


o ~balil1a
2. Q
8.~
3B 32 34 36 38 48 42 3B 32 34 36 38 4B 42
Forewing length (mm) Forewing length (mm)

Fig. 4 Correlation of forewing length and juxta height (A) or dorsum length (8) in Delias belisama
balina (0) and D. oraia bralana (X).
756

Corrciatlon ax:ITiclcnl:; Jnd Rq::;rc!'.~i{1n cYU;J!tol1

LJ_hf'lisama halina (n;:35) V,I/raia hrat(/l((l {n-::-n D.hr/i.t<llTld halma (n_1'i) n "r"ill /I("(lI,m<l (11.::'2-1)

JU\la length 0·"<0" Y~020U:( + 0 0:;8 0 OIR 1: 0 (10'22:t 0

Rmg long dIameter 0579" 0_3-W Y=OO:t-! \+171(, I.lORO:t(l()()!

RLflg shm1 diameter 0190 Y=OO\2 .\ ... I (~5 :)['){) Y=(]OlJ x1 I I~ ()()·-H.t 0001 (l~!:OOOI

J)O!~ljrn length y =0 05..r"",\ .. (] -141 0762""

l .'ncus WIdth 0250 Y=:()(X:t, .\ + () -+6) 0_7-.10"

Y=OOR5 \+It;1 o 7C4" Y=O(YX1""\+14.-I2 (J12J:l:O(l(ll 1I 12') .:: I) COl

Vah'a wiJlh 07-15"

Ph;llius length '(=0025" ,\. il-.l9 0.-171 y::() Ul.'i~ .1'" I 774 () 076 1: r. co 1 no..'u.t Oml
Table 2. Correlation coefficients between sizes of male genitalia and forewing length, and regression
equations and each ratio to forewing length. (*:P<005, **: P<O.O I by t-test)
4. Discussion
Ratios are sometimes used for description because it is considered that they are
comparatively constant and effects of envIronments are excluded. In this study it is
concluded that male genitalia of Delias oraia are larger than those of Delias belisami1
using ratios. On the other hand I also found each value of the correlation coefficients has
remarkable difference. Regression equations of forewing length and ratios (genital sIzes!
forewing length) and correlation coefficients are shown in Fig. 5. In this correlation chart,
the X axis is the forewing length and the Y axis is the ratio of two genital sizes to forewing
length. Both lower regression equations show valva width and upper lines show phallus
length. The ratio is constant in all forewing lengths in the lower graph, on the other hand,
in the upper graph the ratio is not constant, reducing with the X axis. What does it mean?
If environmental effects are constant to genitalia and forewing length, these regression
lines have to be parallel to the X axis like the lower graph. In other words, thIs ratio has to
be constant in all sizes of the forewing. Certainly the regression line of the valva is
constant but that of the phallus is not, the ratio of phallus decreasing as forewlllg length
lllcreases. I consider that it means the phallus length is comparatively constant,
independent of forewing length. In other words the environmental effects are not equal to
phallus and forewing length. The sizes of these buttert1ies' phalli are rather constant and
affected less than forewing length by the environment. I considered that such a character
that is hardly affected by the environment as for phallus in this case, is most important to
discuss the speciation and biological evolution.
RallO (genil;!] size lo forewing lenglh)
0.1
0, belisama balina
• Phallus lenglh/ f.!.
D. oraia bra lalla •
0.09 III Valva \vidthl f.L •
) •
r=·OJj."i.K"'~
('
.) • I • •
• •• • •
• ••
0.08
• • • •
• • I

• • " •
r-e-
"
iil
0.07
g • "
!
"
;" " III ill" .
" " "
• l' "
"
~

"
"
II iil

" " " .


II
II )
0.06 '7 "
y =..J.caJ2:1: ..... 0075
y =-O.CJCX);.UJ'" + 0 08l
r=-O 36.... >< r=-O 133

0.05
ForC\Vlng length (mmj Forew!ng length (mm)
30 32 34 36 38 40 42 30 32 34 36 38 40 42
Fig.5 Correlation of forewing length and ratios (Valva width and phallus length forewing length)
757

Acknowledgments

I greatly acknowledge Prof. Dr S. Sakai, Daito Bunka Uruversity, Saitama (The President
of The Biogeographical Society of Japan), who recommended and encouraged to my talk
in this IFCS-96 Conference. I wish to express his gratitude to Dr H. Mohri, Director
General of National Institute for Basic Biology, Okazaki, and Prof. Dr T. Nakazawa, The
University of The Air, Chiba for their kindly supporting and encouragement to my study.
I also express my heartily thank to Dr N. Minaka, National Institute of Agro-
Environmental Sciences, Tsukuba. He gave me much helpful advice, critically read and
corrected the manuscript. I express my cordial thanks to Mr S. Sugi, Tokyo. He also
critically read and corrected the manuscript.

References:

D'Abrera, B. (1986): Butterflies of the Oriental Region, Part 3. Lycaenidae and


Riodinidae, Hill House, Victoria, Australia.
Kimura, M., 1960. An outline of Population Genetics, Baihilkan, Tokyo (In Japanese).
Komatsu, T. (1996): Morphometric study on stochastic processes and deterministic
processes in evolution. Why do insect genitalia differentiate?, Fifth conference of
International Federation of Classification Societies. Abstracts, 2, 212-215.
Lande, R. (1976): Natural selection and random genetic drift in phenotipic evolution,
Evolution, 30, 314-334.
Lynch, M. (1988): The rate of polygenetic mutatio, Genetical research, 51, 137-148.
Morinaka, S. (1988): A Study on the Belisama Group of the Genus Delias from Bali,
Indonesia(l), Tyo toGa, 39, 2,137-148 (In Japanese).
Morinaka, S. (1990): Ditto (2), - Comparison of Male Genitalia between Delias
belisama balina andD. oraia bratana -, Tyo to Ga, 41, 3,139-147 (In Japanese).
Morinaka, S. (1996): Comparison of some numerical data between the belisama group of
the genus Delias HUbner (Insecta: Lepidoptera) from Bali Island, Indonesia. Fifth
conference of International Federation of Classification Societies. Abstracts, 2, 302-305.
Morinaka, S. (1996): Comparison of some numerical data of genitalia relating to the
speciation, 43rd Annual Meeting of the Lepidopterological Society of Japan, Abstracts, 50
(In Japanese).
Talbot, G. (1928-1937): A Monograph of the Pierine genus Delias, British Museum
(Natural History), London.
Yagishita, A., S. Nakano and S. Morita. (1993): An illustrated list of the Genus Delias
HUbner of the World, Nishiyama, Y. (ed), Khepera Publishers, Singapore (In Japanese).
Yata, O. and K. Morishita. (1981): Butterflies of the South East Asian Islands, vol. II.
Pieridae, Danaidae, Tsukada, E. (ed), Plapac, Tokyo (In Japanese).
A method for classifying unaligned biological
sequences
Tallur B. 1 , Nicolas J. 1
1 IRISA, Campus Universitaire de Beaulieu,
Avenue de Gen. Leclerc, 35042 Rennes cedex, France

Summary: It is needless to emphasize the importance of classification of protein sequences


in molecular biology. Various methods of classification are currently being used by biologists
(Landes et a1.1992) but most of them require the sequences to be prealigned - and thus
to be of equal length - using one of the several multiple alignment algorithms available, so
as to make the site-by-site comparison of sequences possible. Two LLA-based approaches
for classifying prealigned sequences were already proposed (Lerman et al. (1994a)) whose
results compared favourably with most currently used methods. The first approach made
use of the" preordonnance" coding and the second one, the idea of "significant windows" .
The new directions of research leading to a clustering method free from this somewhat
strong constraint were also suggested by the authors. The present paper gives an account
of the recent developments of our research, consisting of a new method that gets round the
sequence comparison problem faced with while dealing with unaligned sequences, thanks
to the "significant windows" approach.

1. Introduction
The biological sequences are composed of strings of letters belonging to a finite size
alphabet. In case of protein sequences. the alphabet A comprises 20 letters each one
representing, respectively, one amino acid.

A = {ACDEFGHIKUvINPQRSTnVy} (1)
Similarity betvveen amino acids is one of the important factors to be considered while
computing the similarity between sequences. Several researchers have focussed their
attention on this problem and put forward the similarity matrices (see, for exam-
ple, Dayhoff et al. (198.3), George et al. (1990), Risler et al. (1988), Lerman et
al. (1994b)). The "profile matrix" considered by (Gribscov (1987)) is a particular
case of stadardized similarity matrix between letters used in the second approach of
classification proposed by Lerman et al. (1994a). The first approach of classification
described by the latter is based upon the preordonnance coding and necessitates site-
by-site comparison of sequences which is well adapted for aligned sequences. However,
the high variation in sequence length renders the overall comparison of sequences very
difficult and the site-by-site comparison altogether impossible in case of no prealign-
ment. The" significant windows" approach for classifying unaligned sequences may
be summarized as follows: A fixed-size window is made to slide along the sequences
to be compared. Each vvindow (i.e. subsequence delimited by each window position)
of the shorter sequence is compared with the set of all windows of the longer one
and, the "most significant window" - with respect to a proper null hypothesis of
independence - is selected by means of beam search. Similarity index is then de-
fined as a function of the similarities resulting from the site-wise comparison with the
most significant window. This index depends on some parameters such as window
size and window significance level whose values need to be chosen by the users, and
also on some other parameters such as percentage of homologous sites that may be
estimated from the data. Finally, the similarity index is standardized with respect

758
759

to its observed distribution over the set of sequence pairs and hierarchical clustering
using LLA program (Lerman et al. (19'93)) is performed.

2. Similarity between sequences


The sequences being of unequal length, their site-by-site comparison is impossible
and hence the pairwise comparison of sequences will be carried out by first select-
ing the" significantly comparable" windows and then using the site-by-site similarity
between the latter for measuring the overall similarity between sequences. Let us
consider two sequences of lengths L1 and L2 respectively with L1 < L 2.

and
b1 ,b2 , ... ,b Ll
Let D = ((D ij ))l :s; i,j :s; 20 be a matrix of similarity scores associated with all possible
letter pairs over a 20-letter alphabet A (such as for instance, Dayhoff's mutation
data matrix (PAM 250)) where Dij is the score of the letter pair (i,j), for (i,j) E
A. Let M = ((Mij)h :s; i,j :s; 20 be a "match matrix" where Mij takes a value of
1 if the ith letter" matches" with the /h letter and 0 otherwise. We may build
the match matrix in several ways and, in particular, by considering a classification
of the set of letters. A window of fixed size / is made to slide along each of the
sequences compared. Let us denote by Wil the ith window of the first sequence (i.e.
the subsequence ai, ai+l, ... ai+I-1) and by Wj2 the jth window of the second sequence
(i.e. the subsequence bj , bj +1, ... bj +I- 1). The similarity score S(Wi1, Wj2) of a window
pair (Wi1, Wj2) for 1 :::; i :::; (L1 -/ + 1) and 1 :::; j :::; (L2 -/ + 1) is defined as the sum
of similarity scores of the corresponding letter pairs, as given by the matrix D. i.e.
I-I
S( Wi1, Wj2) = L D( ai+kl bj+k) (2)
k=O

But as a matter of fact, we will rather consider the standardized scoring matrix DS
instead of D (see Lerman et al. (1994a)).
2.1 Significant window pairs
Consider the window pair (Wil, Wj2). The total number m of letter matches occurring
in this pair is
I-I
m = L A1(ai+kl bJ + k ) (3)
k=O
Under the null hypothesis that the letters in both windows are randomly and indepen-
dently distributed, the random variable associated with the Ilumber m is distributed
as a Binomial(l, p) variable where p is the probability of OIle match occurring in the
sequence pair. The parameter p may be estimated from the observed frequencies of
letters in the sequences as follows:

p= L freq(i, seq1).freq(j, seq2) (4)


(i,j)IMi.j=l

where (i, J) E set of all letter pairs (ai, bj ), freq( i, seq1) is the relative frequency of
the letter ai in sequence 1 and freq(j, seq2) is the relative frequency of the letter bJ
in sequence2. The window pair (Wi!, wd is said to be significant or significantly
760

comparable at level a if rna > [: where ma is the observed number of matches and
U is an integer determined such that

Prob[m > Uj :::: a (5)


2.2 Determination of the significant region of comparison
Computation of the similarity measure between the given sequences is carried out
according to an original idea that may be summarized as follows :

l. Determine the pertinent area (biologically speaking) in each of the sequences to


which the search will be limited. As this information is generally not available,
we have considered the two following solutions
• The first one is based on two hypotheses: (1) the homology (or percentage
of homologous sites among the sequences) h is either known or estimated
from data, (2) the pertinent or biologically meaningful area is uniformly
distributed over the sequence .
• The second solution is obtained by using the maximum predictable clas-
sification (Lebbe and Vignes 1992, 1993) based upon the idea of local
predictability of sequence areas from their contexts.

2. Compare - using the beam search technique - each window inside the pertinent
area of the shorter sequence (i.e. sequence1) to all the windows inside the
pertinent area of the longer one (i.e. sequence2).

3. Rule out the window pairs that are not significantly comparable by means of
the binomial test described in section 2.1.

The significantly comparable sequence pairs alone will contribute to the overall sim-
ilarity between sequences. For each window of sequence1, the best score associated
with the window of sequence2 (among those selected as above) is considered and a
,. rough" similarity index between two given sequences is obtained by averaging it
over all chosen window pairs. The standardized similarity index between sequences is
then obtained by standardizing the" rough" index, first with respect to its empirical
distribution over the set of window pairs and finally, over the set of all sequence pairs.
The matrix of the probabilistic similarity index required by LLA is obtained by ap-
plying the standard normal cumulative distribution function as in case of prealigned
sequences.
2.3 Aggregation criterion used in LLA method
The basic data required by the LLA method of hierarchical classification is the ma-
trix of probabilistic similarity indices between the sequence pairs. To fix the ideas let
us consider the set (] of all sequences, and the probabilistic similarities between the
sequences 01 and 02 given by the equation

(6)

where <P, Q sand P2 ((]) denote respectively the standard normal cumulative distri-
bution function. standardized similarity index between sequences (defined in section
2.2) and the set of all possible sequence pairs. The algorithm builds a classification
tree iteratively. by joining together at each step the two (or more in case of ties) most
similar sequences or classes of sequences until all clusters are merged together. Thus
the aggregation criterion that is maximized at each step or "level" of the algorithm
761

aa

Figure 1: Classification of cytochrome sequences with the significant windows ap-


proach, Dayhoff matrix, window size = 7, c; = 0.10, percentage of homology = 7D.
762

Figure 2: Classification of globin sequences with the significant windows approach,


Dayhoff matrix, window size = 7, a = 0.10, percentage of homology = 60.

is expressed as a similarity measure between two clusters. Suppose that C and Dare
any two arbitrary disjoint subsets (or clusters) of 0 comprising respectively r and s
elements. Then a family of criteria of the "maximal link likelihood" is defined by the
following measure of similarity between C and D
LLy(C, D) = [max{P(c, d)/(c, d) E C x D}l(rxs)~ 0::; 'Y ::; 1 (7)
In case of our data sets 'Y = 0.5 was found to yield the best results.
3. Applications
The experiments on the protein sequences belonging to cytochrome and globin fam-
ilies were carried out using different amino acid similarity matrices (e.g. Dayhoff et
al. (1983), Risler et al. (1988), among others) and different values of the parameters
such as window size, level of significance for window comparison and percentage of
homology. The hierarchical classification method based on the LLA approach was
used. Similar experiments on aa-tRNA ligase (also known as aminoacyl-tRNA syn-
thetase) family of sequences were also conducted. The results were found to be rather
unaffected by most of the above parameters whereas the choice of the relevent se-
quence areas retained for comparison was proved to be of utmost prevalence.
3.1 Sequences from cytochrome and globin families
A set of 89 sequences belonging to cytochrome family was classified using the simi-
larity index described in section 2 and the LLA method of hierarchical classification.
The most significant level was found to be the 86 th where three main classes may be
distinguished (see figure 1). Two of them group together the bacterial chromosomes
and the third one is split into two subclasses corresponding respectively to plant and
763

Figure 3: Classification of aatRNA ligases with the significant windows approach,


Lerman's matrix, window size = 12, sequence area selection by maximam predictable
classification.

<ueRSIECI

'<«<·
3~' V.'RSIE'I

-~=
(RSIE'I
GlnRS{ECI
, 5
GluAS{ECI

7<TrpRSIECI
TyrRS{Ecl

Figure 4: Classification of class I aatRNA from E. coli with Risler's matrix, window
size = 9, sequence area selection by maximam predictable classification.
764

animal families. As for the globin family, a set of 4:2 sequences were classified and
the figure :2 displays the corresponding classification tree. It may be observed that at
.3.S th level of the tree - which is most significant according to Lerman's global statistic
(Lerman et al. 199.3) - 7 classes are clearly visible. Two of them are characterized by
the vertebrates IV), the others being characterized by bivalves (E), plants (P), an-
nelids (An), gastropods (G) and arthropods (Ar). All artropods but artemia are very
clearly seperated from the rest of the species; artemia is a miss which is classified
among the vertebrates. Similarly, excepting glycera all the annelids are nicely put
together and the bacterial hemoglobin ggzlb, which is very" neutral", is associated
with the gastropods. It may also be noticed that the two classes of vertebrates are
not joined quickly enough. Notwithstanding the above remarks, the results are glob-
ally quite satisfactory and are comparable to those produced by the best methods
available.
3,2 Sequences from aa-tRNA ligase family
It is well knO\vn, from the biological standpoint, that the ligases may be considered
as belonging to one of the two groups (see Eriani et al. (1990), Landes et al. (1995)).
Class I comprising the sequences that recognize the amino acids Met, Ile, Leu, Val,
Cys. Arg, Gin, Gz.u, Tyr and Trp seems to be most homogeneous wherein three sub-
groups may be distinguished: {Met, Ile, Leu, Val, Cys, Arg}, {Gin, Glu} and {Tyr,
Trp}. The second group corresponding to the amino acids Ser, Pro, Thr, Asp, Lys,
His, Ala, Gly and Phe is the least structured. The aim of our experiment was to vali-
date our method by producing the results that are as close as possible to the biological
knowledge todate by applying it to a test data set. The first test data set was made
up of 65 sequences of aa-tRNA ligases belonging to various species and was particu-
larly hard to classify due to the very high variation in the length of different amino
acids for the same species on the one hand and in the length of the same amino acid
for different species, on the other. For instance, for E. Coli the length of TrpRS was
.3.34 and that of ValRS was 951, whereas the length of GlnRS was 554 for E. Coli and
809 for Saccharomyces cerevisiae. It was found that the selection of suitable sequence
areas using the "maximum predictability classification" (Lebbe and Vignes (1993),
Lebbe and Vignes (1992)) was particularly useful in this case. Figure 3 illustrates
the hierarchical classification tree obtained by this method. The results are not fully
satisfactory in that they do not agree perfectly with the biological classification given
above. This is in fact explainable by a strong inter-species variation. The second test
data set containing the sequences of class I aa-tRNA ligases, all belonging to a same
species namely, E. Coli yielded much better results. Figure 4 shows clearly the three
subgronps known to the biologists.
4. Conclusion and further research
A hierarchical classification method based on significant windows approach for clas-
sifying unaligned biological sequences has been presented and the results of the ex-
periments with several not-easy-to-classify data sets have been described. On the
whole, the results are close to the present knowledge of the phylogeny of correspond-
ing species. In case of aa-tRNA ligases, it was shown that the quality of the results
depends on the consideration of the biologically pertinent sequence areas as well as
on the dgree of inter-species variation. Further direction of this research would be to
improve the sensitivity of our method by a refinement of the clustering strategy on
the one hand and to reduce the need for parameter tuning (window size, similarity
matrix etc.), 011 the other.
765

References
Abe.K., Gita.N .. (1982): Distances between strings of symbols: Review and remarks. ICPR6,
Munich.
Barker W.C., Hunt L., George D.(1988): Prote'in Seq. Data Anal., 1, 363.
Dayhoff M.O., Barker W.C., Hunt L.T. (1983): Mehods Enzymol., 91, 524-545.
Dickerson R., Geis I. (1983): Hemoglobin, Benjamin/Cummings, Menlo Park, CA,
Eriani G., Delarue E., Poch 0., Gangloff J., Moras D. (1990): Partitions of tRNA syn-
thetases into two classes based on mutually exclusive sets of sequence motifs. Nature, 347,
203-206.
George D., Barker W., Hunt L. (1990): Mutation Data Matrix and Its Uses. Mehods En-
zymol., 183, 313-330,
Gribscov M., Mclachlan A., Eisenberg D. (1987): Profile analysis: Detection of distantly
related proteins. Proc. Nat!. Acad. Sci. 84, 4355-4358.
Landes C., Henault A., Risler J.L, (1992): A comparison of several similarity indices used
in the classification of protein sequences: a multivariate analysis. Nucleic Acids Research,
20, 3631-3637.
Landes C., Perona J.J, Brunie S., Rould M.A., Zelwer C., Steitz T.A., Risler J-L. (1995):
A structure-based multiple sequence alignment of all class I aminoacyl- tRNA synthetases.
Biochimie, 77, 194-203.
Lebbe J., Vignes R. (1992): Selection d'un sous-ensembe de descripteurs maximalement
discriminant. Troisieme journees Symbolique-Numerique, universite de Paris Dauphine.
Lebbe J., Vignes R. (1993): Local predictability in biological sequences, algorithms and
application. Biochimie, 75, 371-378.
Lerman LC., Peter Ph., Leredde H. (1993): Principes et calculs de la methode implantee
dans Ie programme CHAVL (Classification Hierarchique par Analyse de la Vraisemblance
des Liens). La revue de Modulad, Numero 12, Dec 93.
Lerman I.C., Nicolas J., Tallur B., Peter Ph. (1994a): Classification of aligned biological
sequences. New Approaches in Classification and Data Analysis, Springer Verlag, Berlin.
Lerman I. C., Peter Ph., Risler J. L. (1994b): Matrices AVL pour 1", classification et aligne-
ment de sequences proteiques. Publication IRISA No 866, IRISA, Rennes, France.
Risler J.L., Delorme M.O., Delacroix H. and Henault A. (1988): Amino acid substitutions
in structurally related proteins: a pattern recognition approach. Determination of a new
and efficient scoring matrix. Journal of Molecular Biology, 204, 1019-1029.
An Approach to Determine the Necessity of
Orthognathic Osteotomy or Orthodontic Treatment
in a Cleft Individual
-comparison of craniomaxillo-facial structures in borderline
cases by roentogenocephalometrics-

Sumimasa Ohtsuka 1, Fumiye Ohmori 1, Kazunobu Imamura 1,


Yoshinobu Shibasaki 1 and Noboru Ohsumi 2
1Department of Orthodontics, Showa University School of Dentistry
2-1-1 Kitasenzoku, Ohta-ku
Tokyo 145, Japan

2The Institute of Statistical Mathematics


4-6-7, Minami-Azabu, Minato-ku
Tokyo 106, Japan

Summary: At an early stage of growth, it seems nearly impossible to predict the


treatment result which the malocclusion of a cleft patient could be corrected by surgical or
orthodontic treatment. This study was done to get any clue to contirm the way which the
treatment should be finished one or another of both plans for so called a "borderline case"
by using a method of data analysis gained from roentgenocephalograms longitudinally.
The subjects were unilateral cleft lip & palate patients, who were divided into two groups.
One was the OPE group which are corrected by orthodontic treatment with orthognathic
osteotomy, the other was the Non-OPE group which are corrected by orthodontics only.
These cephalograms were used to evaluate some characteristics of maxillofacial structures.
The results showed a possibility to identify the difference of the two groups by utilizing
some parameters at an early stage of growth.

1. Introduction
Cleft patients have severe dental problems related to their abnormal facial structures,
disturbed facial growth patterns and tooth anomalies, therefore their habilitation is needed
from childhood to adulthood. Early orthodontic treatment is often indicated in order to
change unfavorable growth pattern and to correct abnormal oral functions such as speech,
mastication and swallowing. A majority of treatment objectives in some cases can be
achieved through orthodontic treatment alone, while surgical treatment must be applied in
the long run for others. At the adulthood, the combined approach between orthodontics
and surgery such as the orthognathic osteotomy would be the best way for the one with
severe maxillo-mandibular three dimensional disharmony which could not be treated by
orthodontic treatment alone. However, the selection of orthodontic or surgical orthodontic
treatment remains subjective in nature, which often result in forced long continuous
orthodontic treatment on surgical case in the borderline cases. Besides, the decision for
the surgery or not are multifactorial things which are related not only maxillo-mandibular
relationship but also occlusion, soft tissue profile and the consent of the patient for the
surgery. If we could judge the treatment plan for the surgical case earlier before the patient
reaches maturity, we could avoid to force long term treatment of growth control which
must be finally useless at the time of surgery. The earlier, the better.
This study was designed to investigate cephalometrically, on a longitudinal basis, the
possibility of growth prediction for surgical-orthodontic treatment in the cleft patients as

766
767

early grwth stage as possible with the aid of the cephalometric analysis (Fig. 1).

Non-OPE case

~O
borderline case

\1:lf®iill\1:N'J®!fa'!:
~ ~lf@~\1:ihJ

OPE case

Fig. 1. Concept of the study


How to predict the orthognathic surgical case in its early growth stage?
What is some parameter to detect the surgical case?

2. Materials and methods


2.1 SUbjects
The subjects were 67 Japanese unilateral cleft lip and palate (UCLP) patients (37 males
and 30 females) at the Department of Orthodontics, Showa University Dental Hospital.
Two orthodontic doctors who have the clinical experience in the treatment of the CLP
patients over ten years have judged the 67 subjects to be borderline cases by the
condition of malocclusion using dental cast model. Criteria of selection for the borderline
cases were based on the severity of the maloccluion or treatment which all cases had light
anterior crossbite before treatment at early mixed dentition. The cases which could be
obviously made a diagnosis for surgical in the term of severe maxillo-mandibular
relationship and would be favorable not to need the surgery in the future were omitted.
Finally 33 of the patients had orthodontic treatment with osteotomy (OPE grou p) and 34
had only orthodontic treatment (Non-OPE group).
The lateral radiographic cephalograms, dental casts and hand-wrist radiograph were
longitudinally assessed for both of the groups. The cephalograms were measured about
cranio-facial structures at 3 stages which were the early mixed dentition, late mixed
dentition and adult dentition according to the bone maturation by the hand-wrist
radiograph. The maturity of bone as an index of growth was quantatively expressed by the
percentages from 0 to 100 by the Ryokawa's method. The stage are as follows;
stage A early mixed dentition (Hellman dental stage IlIA), bone maturation=50-60%
about 6 years old of age
stage B mixed dentition (dental stage IIIB), bone maturation=60-70%
almost the time of adolescent growth initiation
stage C permanent dentition (dental stage IIIC), bone maturation=90-100%
almost the time of bone growth completed, when it is clearly to make a treat-
ment planning which should be corrected by orthodontics alone or with
orthognathic sur-gery.
768

2.2 Cephalometric analysis


Cephalograms provide a quantitative medium for describing dynamic changes in the
patient and for growth studies as for the dentofacial pattern in general.
Lateral cephalograms were taken with the same x-ray device and by a single technician.
Focus median plane distance was 150 cm and film median plane distance was 10 cm with
an enlargement of 10%. No correction was made for this radiographic enlargement, as it
affected all the cephalograms of both groups in the same way.
In longitudinal cephalometric studies on growing subjects, reference line should be traced
through craniofacial stable structures. Radiographs were traced and put the following
landmarks which were identified or constructed:sella trucica (S), nasion (N), orbitale (Or),
anterior nasal spine (ANS), point A (A), point B (B), pogonion (Pog), gnathion (Gn),
menton (Me), gonion (Go), articulare (Ar), condylion (Cd), porion (Po), posterior nasal
spine (PNS).
Reference planes were adopted as follows;
S-N plane (connects S to N), A-B line (connects A to B), Facial plane (N to Pog), FH
plane (Po to Or), Ramus plane (Ar to the posterior border of the mandibular ramus), Y-
axis (S to Gn), Mandibular plane ( Me to the lower border of the mandible), Palatal plane
(ANS to PNS). The definition of all these landmarks and planes were correspond to those
given by Downs, Riedel and associates. The coordinates of each landmark for each
cephalogram were recorded by means of a WACOM digitizer interfaced with a NEC PC-
98Vm computer. The output values for each point were stored by coordinate represen-
tation on a disk for computer analysis. Nine linear and seventeen angular measurements
were selected for quantitative cephalometric cvaluation(Fig. 2,3).
Linear measurements for the assessment of
cranial base dimensions: N -So
maxilarry dimensions: N-ANS, S'-Ptm', A'-Ptm'.
mandibular dimensions: N-Me, Gn-Cd, Pog'-Go, Cd-Go.
Angular measurements for the assessment of
maxillary dimensions: SNA (S-N-A).
mandibular dimensions: SNB (S-N-B), SNP (S-N-Pog), mandibular plane angle,
gonial angle, ramus inclination, Y-axis angle, facial plane angle.
Intermaxillary relationship: ANB (A-N-B), A-B plane, convexity (N-A-Pog).

s
/W
Po Cd
~

Fig. 2. Standard landmarks 0 f roentogenocepharometrics


769

Fig. 3. Standard reference lines of roentogenocepharometrics


I, SoN plane (S to N). 2, FH plane (Po-Or). 3, Palatal plane (ANS to
PNS). 4, A-B plane (A to B). 5, Facial plane (N to Pog). 6, Y-axis (S to
Gn). 7, Ramus plane (Ar to construc-ted gonion). 8, Mandibular plane
(Me to constructed gonion).

2.3 Sta tistical evaluation


Cepalogram data were evaluated using StatView II and JUSE/MA1, the Statistical Package
for the Social Sciences designed for Apple and NEC compatible personal computers.
Initially, all the variable were tested for validity and robustness. Then, the morphological
difference was tested by means of a nonparametric test, Mann-Whitney Utest between
OPE and Non-OPE group at each stage. Moreover, the cephalometric data were analyzed
by a multivariate statistical approach, discriminant analysis. Fisher's type linear
discriminant analyses were canried out, using treatment group (OPE or Non-OPE) as the
dependent variable at each growth stage. Then, some variables were selected as an
effective parameter to identify the Operation group.

3. Results
3.1 Descriptive statistics for the differences between the OPE and Non-
OPE group
Table 1 summarizes the results of the nonparametric statistical comparison on the
differences between the OPE and Non-OPE groups. At the stage A in male, gonial angle
(Ar-Go-Me) exhibited significantly larger in OPE group (p<O.OOl); mandibular ramus
inclination angle (FH-Ramus plane) exhibited significantly smaller in OPE group
(p<O.05). Whereas in female, SNB angle which represents the anterior limit of the
mandibular basal arch in relation to the anterior cranial base showed significantly larger in
OPE group (p<O.Ol); mandibular plane angle and Y-axis angle showed significantly
smaller in OPE group (p<O.Ol).
At the stage B in male, mandibular plane angle and gonial angle exhibited larger in OPE
group (p«l.CJ5, p<O.OO 1); ramus inclination angle, posterior position of maxilla to cranial
base (S' -Ptm') and mandibular ramus length (Cd-Go) appeared to be significantly smaller
770

in OPE group. In female, both SNB and Y-axis angle showed significantly difference as
the same as stage A (p<O.O 1, p<0.05). The linear measurement for the assessment of
mandibular ramus and total length of the mandible (Gn-Cd) showed signiticantly larger in
OPE group (p<O.O 1, P<0.05).
At the stage C in male, Both SNB and gonial angle showed significantly larger in OPE
group (p<0.05, p<O.OOl); ramus inclination and ramus length showed signiticantly
smaller in OPE group (p<0.05). On the other hand, SNB and Y-axis angle showed
significant ditTerence between OPE and Non-OPE groups (p<0.05, p<O.Ol); mandibular
total length and ramus length exhibited significantly larger in OPE group (p<0.05).

Table 1 Means values for variables which were indicated signiticantly


morphological differences between OPE and Non-OPE group at each
growth stage for male and female.

male female
OPE Non p OPE Non p
stage A Gonial 134.1 125.9 *** SNB 78.9 75.8 **
Ramus 79.1 82.3 * MandP 30.7 35.2 **
Y-axis 63.2 662 **
stage B Mand P 33.9 30.5 * SNB 78.1 75.0 **
Gonial 132.2 125.3 *** Y-axis 64.1 67.3 *
Ramus 81.7 85.2 * Ramus 81.7 85.2 *
S'-Ptm' 17.2 19.8 ** Gn-Cd 107.9 103.0 **
Cd-GQ 51 2 554 ** Cd-G() 51.0 48.2 *
stage C SNB 76.0 72.8 * SNB 78.7 74.9 *
Gonial 130.8 124.6 *** Y-axis 63.8 67.6 **
Ramus 81.8 85.8 * Gn-Cd 116.0 110.1 *
Cd-GQ 5J..!L 61.8 * Cd-Go 55.5 51.6 *
***p< 0.1 %, **p< 1 % and *p< 5% by Mann-Whitney test

3.2 Discriminant analysis of all cases


At each stage the discriminant analysis was done, however, there were some difference in
the eligble variable and correct classitied percentage. In male, variables which showed
over 2.0 of F-ratio at stage A were Mandibular plane (MP), Gonial angle, ramus
inclination and Cd-Gn. On the other hand, these were MP, Gonial angle, ramus
inclination, N-ANS, S'-Ptm' at stage B, convexity, AB plane, SNP angle, SNB, ANB,
gonial angle, ramus inclination, Pog' -Go and Cd-Gn at stage C. The same tindings were
observed in female samples. Therefore from the point of clinical view, elgible variables
were selected commnly at each stage. For male, 3-factor model which is composed of
gonial angle, ramus inclination and Cd-Gn was generated, giving 77.8, 83.3 and 83.8
percent correct classilication of the OPE group at each growth stage. In the other hand, 4-
factor model which is composed of Y-axis angle, SNB, ramus inclination and Cd-Gn was
generated, giving 76.7,74.1 and 78.6 percent correct classification of the OPE group in
female (Table 2).
771

Table 2 Discriminant analysis generated by MAl


Percentage of case correctly c1assitled OPE group for male and female.

male female
stage A 77.8 76.7
stage B 83.3 74.1
stage C 838 786
Predictive Gonial angle Y-axis
vaiable Ramus inclination SNB
Cd-Go Ramus inclination
Cd-Gn

4. Discussion
Early treatment of the cleft patient's malocclusion has beeen generally recommended by
many authors for the favorable results on growth and occlusal relationship. However,
some disadvantages of the problems that might be encountered during the early dentition
tretment are: (l)it may be not always that early treatment bring easier and better results,
(2)patient's cooperation may deteriorate because of long periods of active treatment, and
(3) family financing may also have an int1uence on the length and timing of treatment. If
prediction of maxillo-facial relationship at the time of its growth completed could be done
at early growth such as a childhood, it could be free from a wasted treatment and various
sufferings with it. There are many reports about the criteria of treatment adaptation to
skeletal Class III patients which should be selected orthodontic treatment or surgical
orthodontic treatment. However, these subjects are almost for adults, there are few for
growing young people, especially children. It may be the main reason that it is very hard
to predict skeletal changes by growth and orthodontic treatment at the raky growth stage.
On the other hand, variables to express the morphological differences In both groups got
increased according to the raising the bone maturity. It suggested that growth prediction
could be easier by aging. At the stage C which is almost completed growth, it seems eaier
to make a diagnosis and treatment plannings what could be finished by orthodontic
treatment alone or combined with orthognathic osteotomy on earth by the reason of
stopped growth. In the present study a correct methodologic approach for evaluation of
borderline case with malocclusions was then initiated.
The cephalometric analysis we applied was suitablefor a geometric evaluation of maxillo-
facial components. Several significant differences in craniofacial skeletal structures were
found between the OPE group and Non-OPE group. The following results were obtained.
1. There were significant morphological ditIerences of dentofacial complex, especially in
the Mandible, between the OPE and Non-OPE group in both sexes. No signitlcant
difference in the maxillary components were noted in both groups.
2. The differences became more clearly with growth.
3. In the male OPE group samples were characterized at short and anterior position of the
ramus with wide gonial angle, while dominant anterior growth direction of the
mandible and its larger size in females (Fig. 4).
4. The morphological information from the mandible were available for determination of
future surgical case.
772

Ramus «v....,~.
d-Go "~--"'1

~~~ \SNB

Fig.4. Schematic morphological characterisics in OPE case


of male and female.
Male orthognathic surgical case has short ramus and downward rotated
mandible with downward growth direction of its, On the other hand,
larger mandible with its forward growth pattern is different from male one
in female surgical case,

Comparing these data to the standard data which are gained from the subjects of normal
occlusion by Iizuka and Ishikawa, the male gonial angle and ramus inclination showed
larger than the standard. On the other hand, female Y-axis and ramus inclination showed
almost same the standard except SNB which were larger by 2 dt:gret::. As they arc
substantively different from sample composed from sizt:: and criteria of growth stage, it
coud not easily to compart: the study and standard sampk hae.
In our opinion, a fundamental question arises from the obtained data. \\!hat wert: the
reasons for morphological differences in the orthognathic surgical case between male and
female? Moreover, why the differences were almostly found in the mandibular componets
except by the maxillary components. There may be some important factors to be
considered for the reason. One is the size of sampling in this study. The borderline case
were selected from the poins of severity of maloocclusion with anterior and lateral
crossbite, There is a data which indicated their similarity of malocclusion about inter-
maxillary relationship in the term of ANB angle and SNA angle which shows no
significant difference in both groups. In fact, there may be several morphological patterns
for cleft patients. The male operation group shows the downward rotated of the mandible,
the female cases show a typical skeletal class III which has overgrowth of the mandible
relatively. The former is very difficult to correct anterior crossbite by the mandible
backward rotation, since it makes the mandible more rotation result in a long face and
shallow oberbite. On the other hand, the latter overgrowth of mandible is not t::asily
controlled becaust:: of its size even if growth control would be begun from early growth
stage. Treatment planning could be the other main cause to ini1uence the results, sinct::
dt::cision of treatment plannings may depend on some factors which are inter-maxillary
relationship such as ANB angle, soft tissue protik from tht:: point of the aesthetic sense,
teeth movement in the orthodontic treatment and consent of the surgery by patient and
parents. Therefore it could be said that borde line case is multifactorial. That may be settkd
in the collecting more sampks for borderline cases, which should be separated and
773

examined by each factor.


The previous study suggested that there are some etlective parameters between surgical
and non-surgical orthodontic treatment for a cleft individual. The possibility of prediction
for surgical case could be indicated in the earlier growth stage. After the patients had been
treated for a period of time, from the first examination in mixed dentition, reexamination at
about the initiation of puberal growth spurt on the hand-wrist radiographs may be of
advantage in predicting future treatment procedures.

5. Concluding remarks
There were significant morphological differences in dentofacial complex, especially in the
mandible, between OPE and Non-OPE group in both sexes. The differences became more
and more clearly with growth. The male OPE group samples were characterized at short
and anterior position of the ramus with wide gonial angle, while dominant anterior growth
direction of the mandible and its large size are shown in females. The morphological
information from the mandible were available for determination of future surgical case.
The previous study suggested that there are some effective parameters between surgical
and non-surgical treatment for a cleft indvidual. The posssiblity of prediction for surgical
case could be indicated in the early growth stage. We could not make our conclusion from
the point of some problems by small sample, however, after this, we would like to collect
more variable cases and examination them in detail. Although the growth prediction is
really very hard things, we will do more in the future.

References:
Battagcl 1. M. (1994): The identification of Class III malocclusions by discriminant
analysis, European Journa! of Orthodontics, 16, 71-80.
Cassidy D.W. et al. (1993): A comparison of surgery and orthodontics in "borderline"
adults with Class II, Divisionl malocclusions, American Journa! of Orthodontics and
Dentofada! Orthopedics, 164, 455-470.
Downs, W. B. (1948) : Variation in facial relationships: their significance in treatment and
prognosis, American Journal of Orthodontics, 34,812-840.
Iizuka T. and Ishikawa F. (1950): Points and landmark in head plates, The Journal of
Japan Orthodontic Society, 16,66-75.
Sinclair P.M. et al. (1993): Combined surgical and orthodontic treatment, In
Contemporary orthodontics, 2nd ed., Pro tli t W.R. (eds.), 607-631, Mosby Year Book,
St. Louis.
Tollaro I., Baccetti T. and Franchi L. (1996): Craniofacial changes induced by early
functional treatment of Class III malocclusion, American Journal of Orthodontics and
Den tofa cial Orthopedics, 109,310-318.
Valko R. M. (1968): Indications for selecting surgical or orthodontic correction of
mandibular protrusion, Journal of Oral Surgery, 26, 230-238.

(t)This study was in part carried out under the ISM cooperative Research Program (95-
ISM·CRP-A54 )
Data Analysis for Quality of Life and Personality
Kazue Yamaoka! and Mariko Watanabe 2

!Department of Hygiene and Public Health, Teikyo University School of Medicine


2-11-1 Kaga, Itabashi-ku, Tokyo 173, JAPAN

2Department of Food Science, Showa Women's University Junior College


1-7 Taishido, Setagaya-ku, Tokyo 154, JAPAN

Summary: In order to examine the influence of Personality on Quality of Life (QOL)


measurement, the questionnaire survey was conducted and the analysis was carried out.
Using Hayashi's Quantification Method Type III, a structure of QOL questionnaire was
confirmed and QOL score was calculated on the basis of the structure. The result of the
analysis for the association between QOL and personality indicated that the QOL scores
of the tolerable type subjects were greater than those of the intolerable subjects.
Therefore, personality type. was thought to be a possible confounding factor for a study
related to QOL

1. Introduction
In the last decade interest has increased in quality of life (QOL) measures in four broad
health contexts: such as measuring the health of population, assessing the benefit of
alternative uses of resources, and making a decision on treatment for an individual
patient. Each context requires an assessment of the impact of ill health on aspects of
everyday life of the individual. Increased attention is now being given to the importance
of QOL, and methods of QOL measurement have accordingly been developed (Cox, et
aI., 1992). Yamaoka, et al. (1994) has developed the QOL20 Questionnaire for the
measurement of non-disease specific QOL of a patient.
In the present study, we focused on the data analysis for the problems related to QOL
measurement and it's association with personality type and special attention was given
for the classification of subjective attitude measured by a questionnaire survey on the
basis of structure analysis.

2. Study subjects and Method

2.1 Working Hypothesis


In general, there is a possibility that personality influences subjective responses to QOL
questionnaire. Thus we focused on the influence of personality on QOL measurement.
That is, the QOL20 (Yamaoka, et aI., 1994) is a subjective measurement, it may be
influenced by personality. In the present study, we hypothesized that the QOL20 scores
may differ among personality types. Therefore, we classified personality type using

774
775

Eysenck Personality Questionnaire (EPQ, Shigehisa, 1989) described below.

2.2 Questionnaire
QOL20
In general, QOL measures were classified into three types: The first type is a
performance score, e.g. an activity score. This type of measure is evaluated by a third
party and Kamofsky Performance Status Scale (Karnofsky, 1949) and WHO
Performance Status Scale (WHO, 1979) were established, for instance. The second type
of evaluation employs objective data such as clinical findings. For this, nutritional
parameters and duration of inpatient/outpatient stay, and so on were used. The third
type is subjective evaluation by the patient. Although there is still no standard method
for evaluating QOL, however, many trials have been carried out and FLIC (Schipper,
1985) and EORTC (Aaronson, et ai, 1988), were proposed for QOL measurement in the
West. Furthermore, a large number of studies have been performed on QOL
questionnaires, little attention has been given developing a non-disease-specific QOL
questionnaire for Japanese. Kobayahi, et al. (1994) summarized the problems in the
investigation of QOL.
For the measurement of subjective QOL of patients, we developed the QOL20, which
belongs to the third type and consists of 20 questions related to psychological,
physiological, and environmental factors.
QOL20 was conducted under the working hypotheses that 1) The QOL of a patient is
similar to that of a healthy person and that 2) QOL includes two main factors, i.e., "state
of disease" (D) and "attitude toward disease" (F) and with the changes of these factors
the QOL of a patient also changes. The scoring method has developed on the basis of
the structure of the questionnaire. Because the structure of QOL recognized to be uni-
dimentional, an additive scale, that is, the sum of number of responses, was used for the
measurement of QOL. The symmetry of the positive score and negative score was not
guarantied. This means that the distance in the configuration of the items were not
equivalent among the items. In such a case, we thought that it is better not to use Likert
Scale but to calculate the both positive score (QTP) and negative score (QTN).

Eysenck Personality Questionnaire; EPQ


Personality type was classified using EPQ which consists of 25 questions with Likert
scale with 4 categories (Shigehisa, 1989). Each question belongs to one of the
dimensions such as extroversion (E), neuroticism (N), psychoticism (P), and
dissimulation (L). Using the score for the target dimension (the score calculated as sum
of Likert scales), we classified subjects into the personality type. That is, in case that the
score is larger than the mean, it is classified into positive ("+") type and the other is
classified into negative ("-") type. In this study, we concentrated on the tolerable type
(E+,N-,P+), intolerable type (E-,N+,P-), and the other type. These types were classified
using the above dimensions.

2.3 Subjects
Using the above two questionnaires, the survey was conducted on the parents of 145
students. A hundred and twenty-five males and 145 females responded to the
questionnaires. We used the subjects who responded to both QOL20 and EPQ, and
whose ages were between 40 to 65 years old. Thus the subjects used for the analysis
776

were 120 males and 128 females.

2.4 Statistical Method


In order to examine the validity of the construction of the QOL20, we analyzed the
structure of the QOL20 using Hayashi's Quantification Method Type III (QIII)
( Hayashi, 1952). The correlation between QOL20 scores and EPQ dimensions were
examined using Spearman rank correlation coefficient. .The difference of QOL20 scores
among the personality types was examined by Kruscal Wallis test.

3. Results
3.1 Structure of QOL20
The structure of QOL20 was examined using QIII method and a scattergram of the
category values corresponding to the maximum latent root and the category values
corresponding to the second maximum latent root was figured. Although items were
somewhat varied, however, it could be thought that a cup type curve like a uni-
dimensional Guttmann's scale was reconfirmed (see Yamaoka, et aI., 1994) Therefore,
the scoring method, described in the above, was recognized to be valid.

3.2 QOL20 scores by personality types


The minimum value and the maximum value of the QOL20 positive scores for males
were 1 and 20 and negative scores were -8 and 0, respectively. In case of female, those
for positive scores were 0 and 20 and for negative scores were -8 and 0, respectively.
Spearman rank correlation coefficients between the positive and the negative QOL
scores and EPQ dimensions are shown in Tab. 1. It was revealed that although the
correlation coefficient between the positive score and the negative score ofQOL20 were
significant but they did not strongly correlate. Furthermore, Spearman rank correlation
coefficients between QOL scores and EPQ dimensions were not so strong, too.
However, the tendency for females were similar to that for males.
QOL20 scores by personality types are shown in Tab. 2. Results for males indicated that,
in terms of a positive response tendency to the QOL20, the QOL20 scores of the
tolerable type subjects were significantly greater than those of the intolerable type
(p<O.OI). The tolerable type, the other (intermediate type), the intolerable type made an
order.
In terms of a negative response tendency to the QOL20, similar tendency was
recognized (p<O 001) In case for females, although similar tendency was recognized, it
was not significant
777

Table 1. Spearman rank correlation coefficient between QOL scores and EPQ
dimensions.

males n=120

QOL20 EPQ dimensions


QTP E N P L

QTP 1.00 0.32 ** -009 0 17 0.15


QTN 0.30 *** 0.10 -0.31 *** 0.14 0.13

females; n=128

QTP 1.00 0.14 -0.10 0 10 0.21 *


QTN 0.31 *** 0.05 -0.27** 0 13 0.23 **

E: extraversion, N: neuroticism, P: psychoticism, L: dissimulation


*** p<O.OOI, ** p<O.OI, * p<0.05

Table 2: Personality Types (Tolerable, Intolerable, Other) by EPQ and QOL20 scores.

Males (n=120)
QOL Tolerable (17) Intolerable (16) Other (78) X 2value

Positive 11.5 (6, 13) 5 ( 4, 7) 9 ( 7, 11) 9.42**


Negative o (-1,0) -2 (-6, -1) -0.5 (-2, 0) 14.03***

Females (n=128)
QOL Tolerable (18) Intolerable (23) Other (87) X 2 value

Positive 9(7,12) 7 (5.5,9.5) 9 (7, 12) 5.81


Negative 0(-2,0) -1 (-4.5, -0.5) -0.5 (-2, 0) 4.16

Tolerable type: E+N-P+, Intolerable E-N+P-,


The numbers of the table were Median(25%tile, 75%tile),
Kruscal Wallis test: ** p<O.OI, * p<0.05
778

4. Conclusion
It is concluded that there is a possibility that the QOL20 scores of the tolerable type
subjects were greater than those of the intolerable subjects Therefore, personality type
was thought to be a possible confounding factor for a study related to QOL. It is
suggested that one must interpret carefully, as diagnostic measures, the QOL scores of
the persons with "tolerable type" personality.

5. References
Aaronson NK, et al. (1988): A modular approach to quality oflife assessment in cancer
clinical trials. Resent Results Cancer Res, 111,231-249.

Cox DR, et al (1992) Quality-of-life assessment: Can we keep it simple? J R Statist


Soc A, 155, part 3: 353-393.

Kobayashi K, et al. (1994): Quality of life (QOL) studies in Japan-Problems in the


investigation of QOL-. Ann Cancer Res Ther, 3(2), 83-89.

Kamofsky DA and Burchenal ill (1949) The clinical evaluation of chemotherapeutic


agents in cancer. In: Evaluation o/Chemotherapeutic Agents (MacLeod CM, ed),
Columbia Univ. Press, New York, 191-205.

Shigehisa T (1989): Behavioral regulation of dietary risk factor associated with stress-
induced disease, in relation to personality and interpersonal behavior, in a sociocultural
perspective Tokyo Kasei Gakuin University Journal, 29, 25-45.

Schipper H, et al (1985) Measuring quality oflife: Risks and benefits. Cancer Treat Rep,
69,1115-1125.

World Health Organization (1979): WHO handbook for reporting results of cancer
treatment. Offset Pub! No 48, WHO, Genove.

Yamaoka K, et al. (1994) A Japanese version of the questionnaire for quality of life
measurement. Ann Cancer Res Ther 3,45-53.
CONTRIBUTORS INDEX

Agusta, Y. 295 Hamamoto, Y. 328


Aluja-Banet, T. 207 Hase, T. 328
Amasaka, K. 684 Hayashi, C. 40,653
Aussem, A 617
Hebrail, G. 387
Heiser, W.J. 52, 587
Bacelar-Nicolau, H. 89
Baier, D. 676 Hernandez-Pajares, M. 341
Batagelj, V. 199,460 Hirtle, S.c. 394
Bertaut, M. B. 480 Ho, T.B. 215, 404
Bock, H.H. 3 Honda, M. 268
Bockenholt, U. 518
Bodjanova, S. 303
lehino, M. 358
Bolasco, S. 468
Bren, M. 460 Imaizumi, T. 320
Briand, H. 412 Imamura, K. 766
Bryant, P.G. 182
Jajuga, K. 63
Cai, G. 394 Juan, J.M. 341
Carroll, J.D. 496
Chaturvedi, A 496
Kamano, S. 669
Csernel, M. 379
Kanagawa, A 334
de Carvalho, F. de AT. 370,379 Karube, K. 653
Diday, E. 353 Kato, T. 632
Doreian, P. 199 Kawabata, H. 334
DUbon, K. 231
Kiers, H.AL. 563
Durand, J.-F. 598
Kim, H.B. 261
Emilion, R. 353 Kimura, M. 215
Kishihara, T. 746
Fe r1igoj , A 199 Komazawa, T. 431
Fichet, B. 145
Konishi, S. 268

Gordon, AD. 22, 109 Kozurni, H. 255


Grana, M. 117 Kroonenberg, P.M. 587
Gras, R. 412

779
Lapointe, F.-J. 71 Rajman, M. 488
Larraii.aga, P. 117 Rasson,l.-P. 99
Lebart, L. 423, 480 Romanazzi, M. 555
Lehel, J. 187 Rungsawang, A. 488
Levin, M.S. 154
Lozano, J.A. 117 Sanz, J. 341
Sato, M. 312
Martin, M.e. 162 Sato, Y. 312
Matsuda, S. 284
Satoh, D. 708
McMorris, F.R. 187
Shibasaki, Y. 766
Meulman, J.J. 506
Minami, H. 625 Sibuya, M. 241
Misumi, J. 647 Siciliano, R. 191,223
Mitsumochi, N. 746 Siegmund-Schultze, R. 231
Miyamoto, S. 295, 736 Simeone, B. 170
Mizuta, M. 610, 625 Suga, S. 736
Mola, F. 191,223
Mori, Y. 547 Takahashi, H. 334
Morikawa, S. 653 Takakura, S. 661
Morinaka, S. 752
Takane, Y. 527
Mucha, H.-J. 231
Tallur, B. 758
Murakami, T. 575
Tanaka, Y. 261,547
Murtagh, F. 617
Tanemura, M. 276
Nafria, E. 207 Tango, T 247
Nakai, S. 328 Tarumi, T 547
Nguyen, TD. 215 Tomita, S. 328
Nicolas, J. 758 Tsuchiya, T. 431,452
Nicolau, F.e. 89
Nishisato, S. 441 Ueda, T. 708

Ohmori, F. 766 Van Cutsem, B. 133


Ohsumi, N. 746, 766 Vic hi, M. 170
Ohtsuka, S. 766
Oi, K. 736
Watanabe, M. 774
Okada, A. 716
Weihs, E. 728
Onoyama, K. 746
Ootaki, A. 696
Yaguchi, H. 358
Pai, J.S.e. 255 Yamaoka, K. 774
Peter, P. 412 Yanai, H. 539
Philippe, J. 412 Ycart, B. 133
Podani, J. 125 Yoshimura, 1. 284
Polasek, W. 255
Yoshimura, M. 284
Powers, R.e. 187

780

You might also like