0% found this document useful (0 votes)
52 views393 pages

Classification and Data Science in The Digital Age

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views393 pages

Classification and Data Science in The Digital Age

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 393

Studies in Classification, Data Analysis,

and Knowledge Organization

Paula Brito · José G. Dias ·


Berthold Lausen · Angela Montanari ·
Rebecca Nugent Editors

Classification
and Data Science
in the Digital Age
Studies in Classification, Data Analysis,
and Knowledge Organization

Managing Editors Editorial Board


Wolfgang Gaul, Karlsruhe, Germany Daniel Baier, Bayreuth, Germany
Maurizio Vichi, Rome, Italy Frank Critchley, Milton Keynes, UK
Claus Weihs, Dortmund, Germany Reinhold Decker, Bielefeld, Germany
Edwin Diday†, Paris, France
Michael Greenacre, Barcelona, Spain
Carlo Natale Lauro, Naples, Italy
Jacqueline Meulman, Leiden, The
Netherlands
Paola Monari, Bologna, Italy
Shizuhiko Nishisato, Toronto, Canada
Noboru Ohsumi, Tokyo, Japan
Otto Opitz, Augsburg, Germany
Gunter Ritter, Passau, Germany
Martin Schader, Mannheim, Germany
Studies in Classification, Data Analysis, and Knowledge Organization is a book
series which offers constant and up-to-date information on the most recent
developments and methods in the fields of statistical data analysis, exploratory
statistics, classification and clustering, handling of information and ordering of
knowledge. It covers a broad scope of theoretical, methodological as well as
application-oriented articles, surveys and discussions from an international
authorship and includes fields like computational statistics, pattern recognition,
biological taxonomy, DNA and genome analysis, marketing, finance and other
areas in economics, databases and the internet. A major purpose is to show the
intimate interplay between various, seemingly unrelated domains and to foster the
cooperation between mathematicians, statisticians, computer scientists and practi-
tioners by offering well-based and innovative solutions to urgent problems of
practice.
Paula Brito José G. Dias Berthold Lausen
• • •

Angela Montanari Rebecca Nugent


Editors

Classification and Data


Science in the Digital Age

123
Editors
Paula Brito José G. Dias
Faculty of Economics Business Research Unit
University of Porto University Institute of Lisbon
Porto, Portugal Lisbon, Portugal
INESC TEC, Centre for Artificial
Intelligence and Decision Support Angela Montanari
(LIAAD) Department of Statistical Sciences
Porto, Portugal “Paolo Fortunati”
University of Bologna
Berthold Lausen Bologna, Italy
Department of Mathematical Sciences
University of Essex
Colchester, UK

Rebecca Nugent
Department of Statistics & Data Science
Carnegie Mellon University
Pittsburgh, PA, USA

ISSN 1431-8814 ISSN 2198-3321 (electronic)


Studies in Classification, Data Analysis, and Knowledge Organization
ISBN 978-3-031-09033-2 ISBN 978-3-031-09034-9 (eBook)
https://doi.org/10.1007/978-3-031-09034-9
Mathematics Subject Classification: 62H30, 62H25, 62R07, 68T09, 62H86, 68T10, 94A16, 68T30

© The Editor(s) (if applicable) and The Author(s) 2023. This book is an open access publication.
Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adap-
tation, distribution and reproduction in any medium or format, as long as you give appropriate credit to
the original author(s) and the source, provide a link to the Creative Commons license and indicate if
changes were made.
The images or other third party material in this book are included in the book's Creative Commons
license, unless indicated otherwise in a credit line to the material. If material is not included in the book's
Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publi-
cation does not imply, even in the absence of a specific statement, that such names are exempt from the
relevant protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, expressed or implied, with respect to the material contained
herein or for any errors or omissions that may have been made. The publisher remains neutral with regard
to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface

“Classification and Data Science in the Digital Age”, the 17th Conference of the In-
ternational Federation of Classification Societies (IFCS), is held in Porto, Portugal,
from July 19th to July 23rd 2022, locally organised by the Faculty of Economics of
the University of Porto and the Portuguese Association for Classification and Data
Analysis, CLAD.

The International Federation of Classification Societies (IFCS), founded in 1985,


is an international scientific organization with non-profit and non-political motives.
Its purpose is to promote mutual communication, co-operation and interchange of
views among all those interested in scientific principles, numerical methods, theory
and practice of data science, data analysis, and classification in a broad sense and in as
wide a range of applications as possible; to serve as an agency for the dissemination
of scientific information related to these areas of interest; to prepare international
conferences; to publish a newsletter and other publications. The scientific activities
of the Federation are intended for all people interested in theory of classification
and data analysis, and related methods and applications. IFCS 2022 – originally
scheduled for August 2021, and postponed due to the Covid-19 pandemic – will be
its 17th edition; previous editions were held in Thessaloniki (2019), Tokyo (2017)
and Bologna (2015).

Keynote lectures are addressed by Genevera Allen (Rice University, USA), Charles
Bouveyron (Université Côte d’Azur, Nice, France), Dianne Cook (Monash Univer-
sity, Melbourne, Australia), and João Gama (Faculty of Economics, University of
Porto & LIAAD INESC TEC, Portugal). The conference program includes two
tutorials: “Analysis of Data Streams” by João Gama (Faculty of Economics, Univer-
sity of Porto & LIAAD INESC TEC, Portugal) and “Categorical Data Analysis of
Visualization” by Rosaria Lombardo (Università degli Studi della Campania Luigi
Vanvitelli, Italy) and Eric Beh (University of Newcastle, Australia). IFCS 2022 has
highlighted topics, which lead to Semi-Plenary Invited Sessions. The Conference
program also includes Thematic Tracks on specific areas, as well as free contributed
sessions in different topics (both oral communications and posters).
v
vi Preface

The Conference Scientific Program Committee is co-chaired by Paula Brito, José G.


Dias, Berthold Lausen, and Angela Montanari, and includes representatives of the
IFCS member societies: Adalbert Wilhelm – GfKl, Ahmed Moussa – MCS, Arthur
White – IPRCS, Brian Franczak – CS, Eva Boj del Val – SEIO, Fionn Murtagh –
BCS, Francesco Mola – CLADAG, Hyunjoong Kim – KCS, Javier Trejos Zelaya –
SoCCCAD, Koji Kurihara – JCS, Krzysztof Jajuga – SKAD, Mark de Rooij – VOC,
Mohamed Nadif – SFC, Niel le Roux – MDAG, Simona Korenjak Černe – SSS,
Theodore Chadjipadelis – GSDA, who were responsible for the Conference Scien-
tific Program, and whom the organisers wish to thank for their precious cooperation.
Special thanks are also due to the chairs of the Thematic Tracks, for their invaluable
collaboration.

The papers included in this volume present new developments in relevant topics
of Data Science and Classification, constituting a valuable collection of method-
ological and applied papers that represent the current research in highly developing
areas. Combining new methodological advances with a wide variety of real appli-
cations, this volume is certainly of great value for Data Science researchers and
practitioners alike.

First of all, the organisers of the Conference and the editors would like to thank
all authors, for their cooperation and commitment. We are specially grateful to all
colleagues who served as reviewers, and whose work was decisive to the scientific
quality of these proceedings. We also thank all those who have contributed to the de-
sign and production of this Book of Proceedings at Springer, in particular Veronika
Rosteck, for her help concerning all aspects of publication.

The organisers would like to express their gratitude to the Portuguese Association
for Classification and Data Analysis, CLAD, as well as to the Faculty of Economics
of the University of Porto (FEP–UP), who enthusiastically supported the Conference
from the very start, and contributed to its success. We cordially thank all members
of the Local Organising Committee – Adelaide Figueiredo, Carlos Ferreira, Carlos
Marcelo, Conceição Rocha, Fernanda Figueiredo, Fernanda Sousa, Jorge Pereira,
M. Eduarda Silva, Paulo Teles, Pedro Campos, Pedro Duarte Silva, and Sónia Dias
– and all people at FEP–UP who worked actively for the conference organisation,
and whose work is much appreciated. We are very grateful to all our sponsors, for
their generous support. Finally, we thank all authors and participants, who made the
conference possible.

Porto, Paula Brito


July 2022 José G. Dias
Berthold Lausen
Angela Montanari
Rebecca Nugent
Acknowledgements

The Editors are extremely grateful to the reviewers, whose work was determinant
for the scientific quality of these proceedings. They were, in alphabetical order:

Adalbert Wilhelm Koji Kurihara


Agustín Mayo-Iscar Krzysztof Jajuga
Alípio Jorge Laura Palagi
André C. P. L. F. de Carvalho Laura Sangalli
Ann Maharaj Lazhar Labiod
Anuška Ferligoj Luis Angel García-Escudero
Arthur White Luis Teixeira
Berthold Lausen M. Rosário Oliveira
Brian Franczak Margarida G. M. S. Cardoso
Carlos Soares Mark de Rooij
Christian Hennig Michelangelo Ceci
Conceição Amado Mohamed Nadif
Eva Boj del Val Niel Le Roux
Francesco Mola Paolo Mignone
Francisco de Carvalho Patrice Bertrand
Geoff McLAchlan Pedro Campos
Gilbert Saporta Pedro Duarte Silva
Glòria Mateu-Figueras Pedro Ribeiro
Hans Kestler Peter Filzmoser
Hélder Oliveira Rosanna Verde
Hyunjoong Kim Rosaria Lombardo
Jaime Cardoso Salvatore Ingrassia
Javier Trejos Satish Singh
Jean Diatta Simona Korenjak-Černe
José A. Lozano Theodore Chadjipadelis
José A. Vilar Veronica Piccialli
José Matos Vladimir Batagelj

vii
Partners & Sponsors

We are extremely grateful to the following institutions whose support


contributes to the success of IFCS 2022:

Sponsors

Banco de Portugal

Berd

Comissão de Viticultura da Região dos Vinhos Verdes

Indie Campers

INESC/TEC

Luso-American Development Foundation

PSE

Sociedade Portuguesa de Estatística

Instituto Nacional de Estatística/Statistics Portugal

Unilabs

Universidade do Porto

ix
x Partners & Sponsors

Partners

Associação Portuguesa para a Investigação Operacional

Associação Portuguesa de Reconhecimento de Padrões

Associação de Turismo do Porto e Norte

Centro Internacional de Matemática

Faculdade de Engenharia da Universidade do Porto

International Association of Statistical Computing

International Association of Statistical Education

Sociedade Portuguesa de Matemática

Springer

Organisation

CLAD - Associação Portuguesa de Classificação e Análise de Dados

Faculdade de Economia da Universidade do Porto


Contents

A Topological Clustering of Individuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1


Rafik Abdesselam
Model Based Clustering of Functional Data with Mild Outliers . . . . . . . . 11
Cristina Anton and Iain Smith

A Trivariate Geometric Classification of Decision Boundaries for


Mixtures of Regressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Filippo Antonazzo and Salvatore Ingrassia
Generalized Spatio-temporal Regression with PDE Penalization . . . . . . . . 29
Eleonora Arnone, Elia Cunial, and Laura M. Sangalli
A New Regression Model for the Analysis of Microbiome Data . . . . . . . . . 35
Roberto Ascari and Sonia Migliorati
Stability of Mixed-type Cluster Partitions for Determination of the
Number of Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Rabea Aschenbruck, Gero Szepannek, and Adalbert F. X. Wilhelm
A Review on Official Survey Item Classification for Mixed-Mode Effects
Adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Afshin Ashofteh and Pedro Campos

Clustering and Blockmodeling Temporal Networks – Two Indirect


Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Vladimir Batagelj
Latent Block Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Rafika Boutalbi, Lazhar Labiod, and Mohamed Nadif

xi
xii Contents

Using Clustering and Machine Learning Methods to Provide Intelligent


Grocery Shopping Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Nail Chabane, Mohamed Achraf Bouaoune, Reda Amir Sofiane Tighilt,
Bogdan Mazoure, Nadia Tahiri, and Vladimir Makarenkov

COVID-19 Pandemic: a Methodological Model for the Analysis of


Government’s Preventing Measures and Health Data Records . . . . . . . . . . 93
Theodore Chadjipadelis and Sofia Magopoulou
pcTVI: Parallel MDP Solver Using a Decomposition into Independent
Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Jaël Champagne Gareau, Éric Beaudry, and Vladimir Makarenkov
Three-way Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Cinzia Di Nuzzo and Salvatore Ingrassia
Improving Classification of Documents by Semi-supervised Clustering
in a Semantic Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Jasminka Dobša and Henk A. L. Kiers
Trends in Data Stream Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
João Gama

Old and New Constraints in Model Based Clustering . . . . . . . . . . . . . . . . . 139


Luis A. García-Escudero, Agustín Mayo-Iscar, Gianluca Morelli, and Marco
Riani
Clustering Student Mobility Data in 3-way Networks . . . . . . . . . . . . . . . . . 147
Vincenzo Giuseppe Genova, Giuseppe Giordano, Giancarlo Ragozini, and
Maria Prosperina Vitale
Clustering Brain Connectomes Through a Density-peak Approach . . . . . . 155
Riccardo Giubilei
Similarity Forest for Time Series Classification . . . . . . . . . . . . . . . . . . . . . . 165
Tomasz Górecki, Maciej Łuczak, and Paweł Piasecki

Detection of the Biliary Atresia Using Deep Convolutional Neural


Networks Based on Statistical Learning Weights via Optimal Similarity
and Resampling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Kuniyoshi Hayashi, Eri Hoshino, Mitsuyoshi Suzuki, Erika Nakanishi,
Kotomi Sakai, and Masayuki Obatake

Some Issues in Robust Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183


Christian Hennig
Robustness Aspects of Optimized Centroids . . . . . . . . . . . . . . . . . . . . . . . . . 193
Jan Kalina and Patrik Janáček
Contents xiii

Data Clustering and Representation Learning Based on Networked Data 203


Lazhar Labiod and Mohamed Nadif
Towards a Bi-stochastic Matrix Approximation of 𝑘-means and Some
Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Lazhar Labiod and Mohamed Nadif
Clustering Adolescent Female Physical Activity Levels with an Infinite
Mixture Model on Random Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
Amy LaLonde, Tanzy Love, Deborah R. Young, and Tongtong Wu

Unsupervised Classification of Categorical Time Series Through


Innovative Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
Ángel López-Oriona, José A. Vilar, and Pierpaolo D’Urso
Fuzzy Clustering by Hyperbolic Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . 243
David Masís, Esteban Segura, Javier Trejos, and Adilson Xavier

Stochastic Collapsed Variational Inference for Structured Gaussian


Process Regression Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
Rui Meng, Herbert K. H. Lee, and Kristofer Bouchard
An Online Minorization-Maximization Algorithm . . . . . . . . . . . . . . . . . . . 263
Hien Duy Nguyen, Florence Forbes, Gersende Fort, and Olivier Cappé
Detecting Differences in Italian Regional Health Services During Two
Covid-19 Waves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Lucio Palazzo and Riccardo Ievoli
Political and Religion Attitudes in Greece: Behavioral Discourses . . . . . . . 283
Georgia Panagiotidou and Theodore Chadjipadelis
Supervised Classification via Neural Networks for Replicated Point
Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
Kateřina Pawlasová, Iva Karafiátová, and Jiří Dvořák

Parsimonious Mixtures of Seemingly Unrelated Contaminated Normal


Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
Gabriele Perrone and Gabriele Soffritti
Penalized Model-based Functional Clustering: a Regularization
Approach via Shrinkage Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
Nicola Pronello, Rosaria Ignaccolo, Luigi Ippoliti, and Sara Fontanella
Emotion Classification Based on Single Electrode Brain Data:
Applications for Assistive Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
Duarte Rodrigues, Luis Paulo Reis, and Brígida Mónica Faria
xiv Contents

The Death Process in Italy Before and During the Covid-19 Pandemic: a
Functional Compositional Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
Riccardo Scimone, Alessandra Menafoglio, Laura M. Sangalli, and
Piercesare Secchi

Clustering Validation in the Context of Hierarchical Cluster Analysis:


an Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
Osvaldo Silva, Áurea Sousa, and Helena Bacelar-Nicolau
An MML Embedded Approach for Estimating the Number of Clusters . . 353
Cláudia Silvestre, Margarida G. M. S. Cardoso, and Mário Figueiredo
Typology of Motivation Factors for Employees in the Banking Sector: An
Empirical Study Using Multivariate Data Analysis Methods . . . . . . . . . . . 363
Áurea Sousa, Osvaldo Silva, M. Graça Batista, Sara Cabral, and Helena
Bacelar-Nicolau

A Proposal for Formalization and Definition of Anomalies in Dynamical


Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
Jan Michael Spoor, Jens Weber, and Jivka Ovtcharova
New Metrics for Classifying Phylogenetic Trees Using 𝐾-means and the
Symmetric Difference Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
Nadia Tahiri and Aleksandr Koshkarov
On Parsimonious Modelling via Matrix-variate t Mixtures . . . . . . . . . . . . 393
Salvatore D. Tomarchio
Evolution of Media Coverage on Climate Change and Environmental
Awareness: an Analysis of Tweets from UK and US Newspapers . . . . . . . . 403
Gianpaolo Zammarchi, Maurizio Romano, and Claudio Conversano
A Topological Clustering of Individuals

Rafik Abdesselam

Abstract The clustering of objects-individuals is one of the most widely used ap-
proaches to exploring multidimensional data. The two common unsupervised cluster-
ing strategies are Hierarchical Ascending Clustering (HAC) and k-means partitioning
used to identify groups of similar objects in a dataset to divide it into homogeneous
groups. The proposed Topological Clustering of Individuals, or TCI, studies a homo-
geneous set of individual rows of a data table, based on the notion of neighborhood
graphs; the columns-variables are more-or-less correlated or linked according to
whether the variable is of a quantitative or qualitative type. It enables topological
analysis of the clustering of individual variables which can be quantitative, qualita-
tive or a mixture of the two. It first analyzes the correlations or associations observed
between the variables in a topological context of principal component analysis (PCA)
or multiple correspondence analysis (MCA), depending on the type of variable, then
classifies individuals into homogeneous group, relative to the structure of the vari-
ables considered. The proposed TCI method is presented and illustrated here using
a real dataset with quantitative variables, but it can also be applied with qualitative
or mixed variables.

Keywords: hierarchical clustering, proximity measure, neighborhood graph, adja-


cency matrix, multivariate data analysis

1 Introduction

The objective of this article is to propose a topological method of data analysis in the
context of clustering. The proposed approach, Topological Clustering of Individuals

Rafik Abdesselam ( )
University of Lyon, Lyon 2, ERIC - COACTIS Laboratories
Department of Economics and Management, 69365 Lyon, France,
e-mail: [email protected]

© The Author(s) 2023 1


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_1
2 R. Abdesselam

(TCI) is different from those that already exist and with which it is compared. There
are approaches specifically devoted to the clustering of individuals, for example, the
Cluster procedure implemented in SAS software, but as far as we know, none of
these approaches has been proposed in a topological context.
Proximity measures play an important role in many areas of data analysis [16, 5, 9].
The results of any operation involving structuring, clustering or classifying objects
are strongly dependent on the proximity measure chosen.
This study proposes a method for the topological clustering of individuals what-
ever type of variable is being considered: quantitative, qualitative or a mixture of
both. The eventual associations or correlations between the variables partly depends
on the database being used and the results can change according to the selected prox-
imity measure. A proximity measure is a function which measures the similarity or
dissimilarity between two objects or variables within a set.
Several topological data analysis studies have been proposed both in the context
of factorial analyses (discriminant analysis [4], simple and multiple correspondence
analyses [3], principal component analysis [2]) and in the context of clustering of
variables [1], clustering of individuals [10] and this proposed TCI approach.
This paper is organized as follows. In Section 2, we briefly recall the basic
notion of neighborhood graphs, we define and show how to construct an adjacency
matrix associated with a proximity measure within the framework of the analysis
of the correlation structure of a set of quantitative variables, and we present the
principles of TCI according to continuous data. This is illustrated in Section 3 using
an example based on real data. The TCI results are compared with those of the well-
known classical clustering of individuals. Finally, Section 4 presents the concluding
remarks on this work.

2 Topological Context

Topological data analysis is an approach based on the concept of the neighborhood


graph. The basic idea is actually quite simple: for a given proximity measure for
continuous or binary data and for a chosen topological structure, we can match a
topological graph induced on the set of objects.
In the case of continuous data, we consider 𝐸 = {𝑥 1 , · · · , 𝑥 𝑗 , · · · , 𝑥 𝑝 }, a set of 𝑝
quantitative variables. We can see in [1] cases of qualitative or even mixed variables.
We can, by means of a proximity measure 𝑢, define a neighborhood relationship,
𝑉𝑢 , to be a binary relationship based on 𝐸 × 𝐸. There are many possibilities for
building this neighborhood binary relationship.
Thus, for a given proximity measure u, we can build a neighborhood graph on 𝐸,
where the vertices are the variables and the edges are defined by a property of the
neighborhood relationship.
Many definitions are possible to build this binary neighborhood relationship. One
can choose the Minimal Spanning Tree (MST) [7], the Gabriel Graph (GG) [11] or,
as is the case here, the Relative Neighborhood Graph (RNG) [14].
A Topological Clustering of Individuals 3

For any given proximity measure 𝑢, we can construct the associated adjacency
binary symmetric matrix 𝑉𝑢 of order 𝑝, where, all pairs of neighboring variables in
𝐸 satisfy the following RNG property:

 1 if 𝑢(𝑥 , 𝑥 ) ≤ max[𝑢(𝑥 , 𝑥 ), 𝑢(𝑥 , 𝑥 )] ;


 𝑘 𝑙 𝑘 𝑡 𝑡 𝑙


𝑘 𝑙
𝑉𝑢 (𝑥 , 𝑥 ) = ∀𝑥 , 𝑥 , 𝑥 ∈ 𝐸, 𝑥 ≠ 𝑥 𝑎𝑛𝑑 𝑥 𝑡 ≠ 𝑥 𝑙
𝑘 𝑙 𝑡 𝑡 𝑘

 0 otherwise.

Fig. 1 Data - RNG structure - Euclidean distance - Associated adjacency matrix.

Figure 1 shows a simple illustrative example in R2 of a set of quantitative variables


that verify the structure
qÍof the RNG graph with Euclidean distance as proximity
2
measure: 𝑢(𝑥 , 𝑥 ) =
𝑘 𝑙
𝑗=1 (𝑥 𝑗 − 𝑥 𝑗 ) .
𝑘 𝑙 2

This generates a topological structure based on the objects in 𝐸 which are com-
pletely described by the adjacency binary matrix 𝑉𝑢 .

2.1 Reference Adjacency Matrices

Three topological factorial approaches are described in [1] according to the type of
variables considered: quantitative, qualitative or a mixture of both. We consider here
the case of a set of quantitative variables.
We assume that we have at our disposal a set 𝐸 = {𝑥 𝑗 ; 𝑗 = 1, · · · , 𝑝} of 𝑝
quantitative variables and 𝑛 individuals-objects. The objective here is to analyze in
a topological way, the structure of the correlations of the variables considered [2],
from which the clustering of individuals will then be established.
We construct the reference adjacency matrix named 𝑉𝑢★ from the correlation
matrix. Expressions of suitable adjacency reference matrices for cases involving
qualitative variables or mixed variables are given in [1].
To examine the correlation structure between the variables, we look at the sig-
nificance of their linear correlation. The reference adjacency matrix 𝑉𝑢★ associated
with reference measure 𝑢★, can be written using the Student’s t-test of the linear
correlation coefficient 𝜌 of Bravais-Pearson:
4 R. Abdesselam

Definition 1 For quantitative variables, 𝑉𝑢★ is defined as:


1 if 𝑝-value = 𝑃 [ | 𝑇𝑛−2 | > t-value ] ≤ 𝛼 ; ∀𝑘, 𝑙 = 1, 𝑝
𝑉𝑢★ ( 𝑥 𝑘 , 𝑥 𝑙 ) =
0 otherwise.

where the 𝑝-value is the significance test of the linear correlation coefficient for
the two-sided test of the null and alternative hypotheses, 𝐻0 : 𝜌(𝑥 𝑘 , 𝑥 𝑙 ) = 0 vs.
𝐻1 : 𝜌(𝑥 𝑘 , 𝑥 𝑙 ) ≠ 0.
Let 𝑇𝑛−2 be a t-distributed random variable of Student with 𝜈 = 𝑛 − 2 degrees of
freedom. In this case, the null hypothesis is rejected if the 𝑝-value is less than or equal
to a chosen 𝛼 significance level, for example, 𝛼 = 5%. Using a linear correlation
test, if the 𝑝-value is very small, it means that there is a very low likelihood that the
null hypothesis is correct, and consequently we can reject it.

2.2 Topological Analysis - Selective Review

Whatever the type of variable set being considered, the built reference adjacency
matrix 𝑉𝑢★ is associated with an unknown reference proximity measure 𝑢★.
The robustness depends on the 𝛼 error risk chosen for the null hypothesis: no
linear correlation in the case of quantitative variables, or positive deviation from
independence in the case of qualitative variables, can be studied by setting a minimum
threshold in order to analyze the sensitivity of the results. Certainly the numerical
results will change, but probably not their interpretation.
We assume that we have at our disposal {𝑥 𝑘 ; 𝑘 = 1, .., 𝑝} a set of 𝑝 homogeneous
quantitative variables measured on 𝑛 individuals. We will use the following notations:
- 𝑋 (𝑛, 𝑝) is the data matrix with 𝑛 rows-individuals and 𝑝 columns-variables,
- 𝑉𝑢★ is the symmetric adjacency matrix of order 𝑝, associated with the reference
measure 𝑢★ which best structures the correlations of the variables,
-𝑋b(𝑛, 𝑝) = 𝑋𝑉𝑢★ is the projected data matrix with 𝑛 individuals and 𝑝 variables,
- 𝑀 𝑝 is the matrix of distances of order 𝑝 in the space of individuals,
- 𝐷 𝑛 = 𝑛1 𝐼𝑛 is the diagonal matrix of weights of order 𝑛 in the space of variables.
We first analyze, in a topological way, the correlation structure of the variables
using a Topological PCA, which consists of carrying out the standardized PCA [6, 8]
triplet ( 𝑋
b , 𝑀 𝑝 , 𝐷 𝑛 ) of the projected data matrix 𝑋
b = 𝑋𝑉𝑢★ and, for comparison,
the duality diagram of the Classical standardized PCA triplet ( 𝑋 , 𝑀 𝑝 , 𝐷 𝑛 ) of the
initial data matrix 𝑋. We then proceed with a clustering of individuals based on the
significant principal components of the previous topological PCA.

Definition 2 TCI consist of performing a HAC, based on the Ward criterion1 [15],
on the significant factors of the standardized PCA of the triplet ( 𝑋,
b 𝑀 𝑝 , 𝐷 𝑛 ).

1 Aggregation based on the criterion of the loss of minimal inertia.


A Topological Clustering of Individuals 5

3 Illustrative Example

The data used [13] to illustrate the TCI approach describe the renewable electricity
(RE) of the 13 French regions in 2017, described by 7 quantitative variables relating
to RE. The growth of renewable energy in France is significant. Some French regions
have expertise in this area; however, the regions’ profiles appear to differ.
The objective is to specify regional disparities in terms of RE by applying topo-
logical clustering to the French regions in order to identify which were the country’s
greenest regions in 2017. Statistics relating to the variables are displayed in Table 1.

Table 1 Summary statistics of renewable energy variables.


Standard Coefficient of
Variable Frequency Mean Deviation (N) variation (%) Min Max
Total RE production (TWH) 13 6.84 6.58 96.19 0.59 2.34
Total RE consumption (TWH) 13 3.70 1.87 50.67 2.18 7.06
Coverage RE consumption (%) 13 0.18 0.11 59.01 0.02 0.36
Hydroelectricity (%) 13 0.34 0.30 87.47 0.01 0.89
Solar electricity (%) 13 0.13 0.09 72.57 0.02 0.31
Wind electricity (%) 13 0.39 0.29 76.12 0.01 0.86
Biomass electricity (%) 13 0.15 0.19 130.54 0.01 0.79

Table 2 Correlation matrix ( 𝑝-value) - Reference adjacency matrix 𝑉𝑢★ .


Production 1.000

Consumption 0.575 1.000


(0.040)
1 1 1 1 0 0 0
Coverage 0.798 0.090 1.000 ©1 1 0 0 0 0 0 ª®
(0.001) (0.771)
−1 ®
­
­1 0 1 1 0 0
Hydroelectricity 0.720 0.138 0.872 1.000
= ­1 1 1 −1
­ ®
𝑉𝑢★ 0 0 0 ®
(0.006) (0.653) (0.000) ­0 0 0 0 1 0 0 ®®
Solar -0.272 -0.477 0.105 0.168 1.000 ­
­0 0 0 −1 0 1 0 ®
(0.369) (0.099) (0.734) (0.582)
«0 0 −1 0 0 0 1 ¬
Wind -0.408 -0.305 -0.524 -0.772 -0.395 1.000
(0.167) (0.311) (0.066) (0.002) (0.181)
Biomass -0.365 0.489 -0.609 -0.459 -0.149 -0.135 1.000
(0.220) (0.090) (0.027) (0.114) (0.627) (0.660)
Significance level: p−value ≤ 𝛼 = 5%

The adjacency matrix 𝑉𝑢★ , associated with the proximity measure 𝑢★, adapted
to the data considered, is built from the correlations matrix Table 2 according to
Definition 1. Note that in this case, which uses quantitative variables, it is considered
that two positively correlated variables are related and that two negatively correlated
variables are related but remote. We will therefore take into account any sign of
correlation between variables in the adjacency matrix.
We first carry out a Topological PCA to identify the correlation structure of the
variables. A HAC, according to Ward’s criterion, is then applied to the significant
principal components of the PCA of the projected data. We then compare the results
of a topological and a classical PCA.
Figure 2 presents, for comparison on the first factorial plane, the correlations
between principal components-factors and the original variables.
6 R. Abdesselam

We can see that these correlations are slightly different, as are the percentages of
the inertias explained on the first principal planes of Topological and Classic PCA.

Fig. 2 Topological & Classical PCA of RE of the French regions.

The two first factors of the Topological PCA explain 57.89% and 26.11%, re-
spectively, accounting for 83.99% of the total variation in the data set; however, the
two first factors of the Classical PCA add up to 75.20%. Thus, the first two factors
provide an adequate synthesis of the data, that is, of RE in the French regions. We
restrict the comparison to the first significant factorial axes.
For comparison, Figure 3 shows dendrograms of the Topological and Classical
clustering of the French regions according to their RE. Note that the partitions chosen
in 5 clusters are appreciably different, as much by composition as by characterization.
The percentage variance produced by the TCI approach, 𝑅 2 = 86.42%, is higher than
that of the classic approach, 𝑅 2 = 84.15%, indicating that the clusters produced via
the TCI approach are more homogeneous than those generated by the Classical one.
Based on the TCI analysis, the Corse region alone constitutes the fourth cluster,
and the Nouvelle-Acquitaine region is found in the second cluster with the Grand-
Est, Occitanie and Provence-Alpes-Côte-d’Azur (PACA) regions; however, in the
Classical clustering, these two regions - Corse and Nouvelle-Aquitaine - together
constitute the third cluster.
Figure 4 summarizes the significant profiles (+) and anti-profiles (-) of the two
typologies; with a risk of error less than or equal to 5%, they are quite different.
The first cluster produced via the TCI approach, consisting of a single region,
Auvergne-Rhônes-Alpes (AURA), is characterized by high share of hydroelectricity,
a high level of coverage of regional consumption, and high RE production and con-
sumption. The second cluster - which groups together the four regions of Grand-Est,
Occitanie, Provence-Alpes-Côte-d’Azur (PACA) and Nouvelle-Aquitaine - is consid-
ered a homogeneous cluster, which means that none of the seven RE characteristics
differ significantly from the average of these characteristics across all regions. This
cluster can therefore be considered to reflect the typical picture of RE in France.
A Topological Clustering of Individuals 7

Fig. 3 Topological and Classical dendrograms of the French regions.

Fig. 4 Typologies - Characterization of TCI & Classical clusters

Cluster 3, which consists of six regions, is characterized by a high degree of wind


energy, a low degree of hydroelectricity, low coverage of regional consumption, and
low production and consumption of RE compared to the national average. Cluster
4, represented by the Corse region, is characterized by a high share of solar energy
and low production and consumption of RE. The last class, represented by the Ile-
de-France region, is characterized by a high share of biomass energy. Regarding the
other types of RE, their share is close to the national average.
8 R. Abdesselam

4 Conclusion

This paper proposes a new topological approach to the clustering of individuals which
can enrich classical data analysis methods within the framework of the clustering of
objects. The results of the topological clustering approach, based on the notion of a
neighborhood graph, are as good - or even better, according to the R-squared results
- than the existing classical method. The TCI approach is be easily programmable
from the PCA and HAC procedures of SAS, SPAD or R software. Future work will
involve extending this topological approach to other methods of data analysis, in
particular in the context of evolutionary data analysis.

References

1. Abdesselam, R.: A topological clustering of variables. Journal of Mathematics and System


Science. Accepted (2022)
2. Abdesselam, R.: A topological approach of Principal Component Analysis. International
Journal of Data Science and Analysis. 77(2), 20–31 (2021)
3. Abdesselam, R.: A topological Multiple Correspondence Analysis. Journal of Mathematics
and Statistical Science, ISSN 2411-2518, 5(8), 175–192 (2019)
4. Abdesselam, R.: A topological Discriminant Analysis. Data Analysis and Applications 2,
Utilization of Results in Europe and Other Topics, Vol.3, Part 4. pp. 167–178 Wiley, (2019)
5. Batagelj, V., Bren, M.: Comparing resemblance measures. Journal of Classification, 12(1),
73–90 (1995)
6. Caillez, F., Pagès, J. P.: Introduction à l’Analyse des Données. S.M.A.S.H., Paris (1976)
7. Kim, J. H. and Lee, S.: Tail bound for the minimal spanning tree of a complete graph. In
Statistics & Probability Letters, 4(64), 425–430 (2003)
8. Lebart, L.: Stratégies du traitement des données d’enquêtes. La Revue de MODULAD, 3,
21–30 (1989)
9. Lesot, M. J., Rifqi, M., Benhadda, H.: Similarity measures for binary and numerical data: a
survey. In: IJKESDP, 1(1), 63–84 (2009)
10. Panagopoulos, D.: Topological data analysis and clustering. Chapter for a book, Algebraic
Topology (math.AT) arXiv:2201.09054, Machine Learning (2022)
11. Park, J. C., Shin, H., Choi, B. K.: Elliptic Gabriel graph for finding neighbors in a point set and
its application to normal vector estimation. Computer-Aided Design Elsevier, 38(6), 619–626
(2006)
12. SAS Institute Inc. SAS/STAT Software, the Cluster Procedure, Available via DIALOG.
https://support.sas.com/documentation/onlinedoc/stat/142/cluster.pdf
13. Selectra: Electricité renouvelable: quelles sont les régions les plus vertes de France ?
http://selectra.info/energie/actualites/expert/electricite-renouvelab
le-regions-plus-vertes-france (2020)
14. Toussaint, G. T.: The relative neighbourhood graph of a finite planar set. Pattern Recognition,
12(4) 261–268 (1980)
15. Ward, J. R.: Hierarchical grouping to optimize an objective function. Journal of the American
Statistical Association, 58(301), 236–244 (1963)
16. Zighed, D., Abdesselam, R., Hadgu, A.: Topological comparisons of proximity measures. In:
Tan et al. (Eds). In Proc. 16th PAKDD 2012 Conference, pp. 379–391. Springer, (2012)
A Topological Clustering of Individuals 9

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Model Based Clustering of Functional Data with
Mild Outliers

Cristina Anton and Iain Smith

Abstract We propose a procedure, called CFunHDDC, for clustering functional data


with mild outliers which combines two existing clustering methods: the functional
high dimensional data clustering (FunHDDC) [1] and the contaminated normal mix-
ture (CNmixt) [3] method for multivariate data. We adapt the FunHDDC approach
to data with mild outliers by considering a mixture of multivariate contaminated nor-
mal distributions. To fit the functional data in group-specific functional subspaces
we extend the parsimonious models considered in FunHDDC, and we estimate the
model parameters using an expectation-conditional maximization algorithm (ECM).
The performance of the proposed method is illustrated for simulated and real-world
functional data, and CFunHDDC outperforms FunHDDC when applied to functional
data with outliers.

Keywords: functional data, model-based clustering, contaminated normal distribu-


tion, EM algorithm

1 Introduction

Recently, model-based clustering for functional data has received a lot of attention.
Real data are often contaminated by outliers that affect the estimations of the model
parameters. Here we propose a method for clustering functional data with mild
outliers. Mild outliers are usually sampled from a population different from the

Cristina Anton ( )
MacEwan University, 10700 – 104 Avenue Edmonton, AB, T5J 4S2, Canada,
e-mail: [email protected]
Iain Smith
MacEwan University, 10700 – 104 Avenue Edmonton, AB, T5J 4S2, Canada,
e-mail: [email protected]

© The Author(s) 2023 11


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_2
12 C. Anton and I. Smith

assumed model, so we need to choose a model flexible enough to accommodate


them.
Functional data live in an infinite dimensional space and model-based methods
for clustering are not directly available because the notion of probability density
function generally does not exist for such data. A first approach is to use a two-
step method and first do a discretization or a decomposition of the functional data
in a basis of functions (such as Fourier series, B-splines, etc.), and then directly
apply multivariate clustering methods to the discretization or the basis coefficients.
A second approach, which allows the interaction between the discretization and the
clustering steps, is based on a probabilistic model for the basis coefficients [1, 2].
We follow the second approach, and we propose a method, called CFunHDDC,
which extends the functional high dimensional data clustering (FunHDDC) [1] to
clustering functional data with mild outliers. There are several methods to detect
outliers of functional data and a robust clustering methodology based on trimming
is presented in [4]. Our approach does not involve trimming the outliers and it is
inspired by the method CNmixt [3] for clustering multivariate data with mild outliers.
We propose a model for the basis coefficients based on a mixture of contaminated
multivariate normal distributions. A multivariate contaminated normal distribution
is a two-component normal mixture in which the bad observations (outliers) are
represented by a component with a small prior probability and an inflated covariance
matrix.
In the next section we present the model and its parsimonious variants. Parameter
estimation is included in Section 3. In Section 4 we present applications to simulated
and real-world data. The last section includes the conclusions.

2 The Model

We suppose that we observe 𝑛 curves {𝑥1 , . . . , 𝑥 𝑛 } and we want to cluster them


in 𝐾 homogeneous groups. For each curve 𝑥𝑖 we have access to a finite set of
values 𝑥𝑖 𝑗 = 𝑥𝑖 (𝑡 𝑖 𝑗 ), where 0 ≤ 𝑡 𝑖1 < 𝑡 𝑖2 < . . . < 𝑡𝑖𝑚𝑖 ≤ 𝑇. We assume that the
observed curves are independent realizations of a 𝐿 2 − continuous stochastic process
𝑋 = {𝑋 (𝑡)}𝑡 ∈ [0,𝑇 ] for which the sample paths are in 𝐿 2 [0, 𝑇]. To reconstruct the
functional form of the data we assume that the curves belong to a finite dimensional
space spanned by a basis of functions {𝜉1 , . . . , 𝜉 𝑝 }, so we have the expansion for
each curve
Õ𝑝
𝑥𝑖 (𝑡) = 𝛾𝑖 𝑗 𝜉 𝑗 (𝑡).
𝑗=1

Here we assume that the dimension 𝑝 is fixed and known. We consider a model based
on a mixture of multivariate contaminated normal distributions for the coefficients
vectors {𝛾1 , . . . , 𝛾𝑛 } ⊂ R 𝑝 , 𝛾𝑖 = (𝛾𝑖1 , . . . , 𝛾𝑖 𝑝 ) > ∈ R 𝑝 , 𝑖 = 1, . . . , 𝑛.
We suppose that there exists two unobserved random variables 𝑍 = (𝑍1 , . . . , 𝑍 𝐾 ),
Υ = (Υ1 , . . . , Υ𝐾 ) ∈ {0, 1} 𝐾 where 𝑍 indicates the cluster membership and Υ
Clustering of Functional Data with Mild Outliers 13

whether an observation is good or bad (outlier). 𝑍 𝑘 = 1 if 𝑋 ∈ 𝑘th cluster and 𝑍 𝑘 = 0


otherwise, and Υ𝑘 = 1 if 𝑋 ∈ 𝑘th cluster and it is a good observation, and Υ𝑘 = 0
otherwise. For clustering we need to predict the value 𝑧𝑖 = (𝑧 𝑖1 , . . . , 𝑧 𝑖𝐾 ) of 𝑍, and
to determine the bad observations we need to predict the value 𝜈𝑖 = (𝜈𝑖1 , . . . , 𝜈𝑖𝐾 )
of Υ for each observed curve 𝑥𝑖 , 𝑖 = 1, . . . , 𝑛.
We consider a set of 𝑛 𝑘 observed curves of the 𝑘th cluster with the coefficients
{𝛾1 , . . . , 𝛾𝑛𝑘 } ⊂ R 𝑝 . We assume that {𝛾1 , . . . , 𝛾𝑛𝑘 } are independent realizations
of a random vector Γ ∈ R 𝑝 , and that the stochastic process associated with the
𝑘th cluster can be described in a lower dimensional subspace E 𝑘 [0, 𝑇] ⊂ 𝐿 2 [0, 𝑇]
with dimension 𝑑 𝑘 ≤ 𝑝 and spanned by the first 𝑑 𝑘 elements of a group specific
basis of functions {𝜙 𝑘 𝑗 } 𝑗=1,...,𝑑𝑘 that can be obtained from {𝜉 𝑗 } 𝑗=1,..., 𝑝 by a linear
transformation
Õ𝑝
𝜙𝑘 𝑗 = 𝑞 𝑘, 𝑗𝑙 𝜉𝑙 ,
𝑙=1

with an 𝑝 × 𝑝 orthogonal matrix 𝑄 𝑘 = (𝑞 𝑘, 𝑗𝑙 ). In [1] for FunHDDC the assumption


is that the distribution of Γ for the 𝑘th cluster is Γ ∼ 𝑁 (𝜇 𝑘 , Σ 𝑘 ), Σ 𝑘 = 𝑄 𝑘 Δ 𝑘 𝑄 >𝑘 ,
where
𝑎 𝑘1 0 

© . ª
0
­ . ® 

­ . ®
­ ®
­ 0 𝑎 𝑘 𝑑𝑘 ®
Δ𝑘 = ­ ® 𝑝
­ 𝑏𝑘 0 ®
­ ®
­ 0 . . ®
­ . ®


« 0 𝑏 𝑘 ¬
with 𝑎 𝑘𝑖 > 𝑏 𝑘 , 𝑖 = 1, . . . , 𝑑 𝑘 . We can say that the variance of the actual data in the
𝑘th cluster is modeled by 𝑎 𝑘1 , . . . , 𝑎 𝑘 𝑑𝑘 and the parameter 𝑏 𝑘 models the variance
of the noise [1].
We follow the approach in [3] and we assume that Γ for the 𝑘th cluster has the
multivariate contaminated normal distribution with density

𝑓 (𝛾𝑖 ; 𝜃 𝑘 ) = 𝛼 𝑘 𝜙(𝛾𝑖 ; 𝜇 𝑘 , Σ 𝑘 ) + (1 − 𝛼 𝑘 )𝜙(𝛾𝑖 ; 𝜇 𝑘 , 𝜂 𝑘 Σ 𝑘 ), (1)

where 𝛼 𝑘 ∈ (0.5, 1), 𝜂 𝑘 > 1, 𝜃 𝑘 = {𝛼 𝑘 , 𝜇 𝑘 , Σ 𝑘 , 𝜂 𝑘 }, and 𝜙(𝛾𝑖 ; 𝜇 𝑘 , Σ 𝑘 ) is the density


for the 𝑝−variate normal distribution 𝑁 (𝜇 𝑘 , Σ 𝑘 ):
 
− 𝑝/2 −1/2 1 > −1
𝜙(𝛾𝑖 ; 𝜇 𝑘 , Σ 𝑘 ) = (2𝜋) |Σ 𝑘 | exp − (𝛾𝑖 − 𝜇 𝑘 ) Σ 𝑘 (𝛾𝑖 − 𝜇 𝑘 ) (2)
2

Here 𝛼 𝑘 defines the proportion of uncontaminated data in the 𝑘the cluster and 𝜂 𝑘
represents the degree of contamination. We can see 𝜂 𝑘 as an inflation parameter that
measures the increase in variability due to the bad observations.
Each curve 𝑥 𝑖 has a basis expansion with coefficient 𝛾𝑖 such that 𝛾𝑖 is a random
vector whose distributions is a mixture of contaminated Gaussians with density
14 C. Anton and I. Smith

𝐾
Õ
𝑝(𝛾; 𝜃) = 𝜋 𝑘 𝑓 (𝛾; 𝜃 𝑘 ) (3)
𝑘=1

Ð 𝜋 𝑘 = 𝑃(𝑍 𝑘 = 1) is the prior probability of the 𝑘th the cluster and


where
𝜃 = 𝑘𝑘=1 (𝜃 𝑘 ∪{𝜋 𝑘 }) is the set formed by all the parameters. We refer to this model as
FCLM[𝑎 𝑘 𝑗 , 𝑏 𝑘 , 𝑄 𝑘 , 𝑑 𝑘 ] (functional contaminated latent mixture). As in [1] we con-
sider the parsimonious sub-models: FCLM[𝑎 𝑘 𝑗 , 𝑏, 𝑄 𝑘 , 𝑑 𝑘 ], FCLM[𝑎 𝑘 , 𝑏 𝑘 , 𝑄 𝑘 , 𝑑 𝑘 ],
FCLM[𝑎, 𝑏 𝑘 , 𝑄 𝑘 , 𝑑 𝑘 ], FCLM[𝑎 𝑘 , 𝑏, 𝑄 𝑘 , 𝑑 𝑘 ], FCLM[𝑎, 𝑏, 𝑄 𝑘 , 𝑑 𝑘 ].

3 Model Inference

To fit the models we use the ECM algorithm [3], which is a variant of the EM
algorithm. In the ECM algorithm we replace the M-step in the EM algorithm by two
simpler CM-steps given by the partition of the set with the parameters 𝜃 = {Ψ1 , Ψ2 },
where Ψ1 = {𝜋 𝑘 , 𝛼 𝑘 , 𝜇 𝑘 , 𝑎 𝑘 𝑗 , 𝑏 𝑘 , 𝑞 𝑘 𝑗 , 𝑘 = 1, . . . , 𝐾, 𝑗 = 1, . . . , 𝑑 𝑘 }, Ψ2 = {𝜂 𝑘 , 𝑘 =
1, . . . , 𝐾 }, and 𝑞 𝑘 𝑗 is the 𝑗th column of 𝑄 𝑘 .
We have two sources of missing data: the clusters’ labels and the type of observa-
tion (good or bad). Thus the complete data are given by 𝑆 = {𝛾𝑖 , 𝑧𝑖 , 𝜈𝑖 }𝑖=1,...,𝑛 , and
the complete-data likelihood is
𝑁 Ö
Ö 𝐾
 𝑧𝑖𝑘
𝐿 𝑐 (𝜃; 𝑆) = 𝜋 𝑘 [𝛼 𝑘 𝜙(𝛾𝑖 ; 𝜇 𝑘 , Σ 𝑘 )] 𝜈𝑖𝑘 [(1 − 𝛼 𝑘 )𝜙(𝛾𝑖 ; 𝜇 𝑘 , 𝜂 𝑘 Σ 𝑘 )] 1−𝜈𝑖𝑘
𝑖=1 𝑘=1

We denote the complete-data log-likelihood by 𝑙 𝑐 (𝜃; 𝑆) = log(𝐿 𝑐 (𝜃; 𝑆)).


Next we present the ECM algorithm for the model FCLM[𝑎 𝑘 𝑗 , 𝑏 𝑘 , 𝑄 𝑘 , 𝑑 𝑘 ]. At
the 𝑞 iteration of the ECM algorithm in the E-step we calculate 𝐸 [𝑙 𝑐 (𝜃 (𝑞−1) ; 𝑆)|𝛾1 ,
. . . , 𝛾𝑛 , 𝜃 (𝑞−1) ], given the current values of the parameters 𝜃 (𝑞−1) . This reduces to
(𝑞) (𝑞)
the calculation of 𝑧𝑖𝑘 := 𝐸 [𝑍𝑖𝑘 |𝛾𝑖 , 𝜃 (𝑞−1) ], 𝜈𝑖𝑘 := 𝐸 [Υ𝑖𝑘 |𝛾𝑖 , 𝑧𝑖 , 𝜃 (𝑞−1) ].
In the first CM step in the 𝑞 iteration of the ECM algorithm we calculate Ψ1(𝑞) as
the value of Ψ1 that maximize 𝑙 𝑐(𝑞−1) with Ψ2 fixed at Ψ2(𝑞−1) . We obtain
 (𝑞)

Í𝑛 (𝑞) (𝑞) 1−𝜈𝑖𝑘
Í𝑛 (𝑞) Í𝑛 (𝑞) (𝑞) 𝑧
𝑖=1 𝑖𝑘 𝜈 𝑖𝑘 + (𝑞−1) 𝛾𝑖
𝑖=1 𝑧𝑖𝑘 𝑖=1 𝑧𝑖𝑘 𝜈𝑖𝑘 𝜂𝑘
𝜋𝑘(𝑞) = , 𝛼𝑘(𝑞) = Í𝑛 (𝑞) , 𝜇𝑘(𝑞) =   (4)
𝑛 Í𝑛 (𝑞) (𝑞) 1−𝜈𝑖𝑘(𝑞)
𝑖=1 𝑧𝑖𝑘 𝜈𝑖𝑘 + (𝑞−1)
𝑖=1 𝑧𝑖𝑘 𝜂𝑘
(𝑞)
𝑛
!
1 Õ 1 − 𝜈𝑖𝑘
Σ 𝑘(𝑞) = Í (𝑞)
(𝑞)
𝑧𝑖𝑘 (𝑞)
𝜈𝑖𝑘 + (𝑞−1)
(𝛾𝑖 − 𝜇𝑘(𝑞) ) (𝛾𝑖 − 𝜇𝑘(𝑞) ) > (5)
𝑛
𝑖=1 𝑧𝑖𝑘 𝑖=1 𝜂𝑘

We introduce a value 𝛼∗ and we constrain 𝛼 𝑘 ∈ (𝛼∗ , 1). If the estimation 𝛼 𝑘(𝑞) in


(4) is less than 𝛼∗ , we use the optimize() function in the stats package in R to do a
numerical search for 𝛼 𝑘(𝑞) .
Clustering of Functional Data with Mild Outliers 15

As in [1] we get the updated values 𝑎 𝑘(𝑞)𝑗 , 𝑏 𝑘(𝑞) , 𝑞 𝑘(𝑞)𝑗 , 𝑘 = 1, . . . , 𝐾, 𝑗 = 1, . . . , 𝑑 𝑘


from the sample covariance matrix Σ 𝑘(𝑞) of cluster 𝑘, using also the matrix of
inner products between the basis functions 𝑊 = (𝑤 𝑗𝑙 )1≤ 𝑗,𝑙 ≤ 𝑝 , where 𝑤 𝑗𝑙 =
∫𝑇
0
𝜉 𝑗 (𝑡)𝜉𝑙 (𝑡)𝑑𝑡.
In the second CM step in the 𝑞 iteration of the ECM algorithm we calculate 𝜂 𝑘(𝑞)
as the value that maximize 𝑙 𝑐(𝑞−1) with Ψ1 fixed at Ψ1(𝑞) .
At the end of the ECM algorithm, we do a two-step classification to provide the
expected clustering. If 𝑞 𝑓 is the last iteration of the algorithm before convergence,
an observation 𝛾𝑖 ∈ R 𝑝 is assigned to the cluster 𝑘 0 ∈ {1, . . . , 𝐾 } with the largest
(𝑞 )
𝑧𝑖𝑘 𝑓 . Next, an observation 𝛾𝑖 that was assigned to the cluster 𝑘 0 is considered good
(𝑞 )
if 𝜈𝑖𝑘0𝑓 > 0.5, and it is considered bad otherwise. After the classification step we
can eliminate the bad observations and run FunHDDC to re-cluster the remaining
observations.
The class specific dimension 𝑑 𝑘 is selected through the scree-test of Cattell by
comparison of the difference between eigenvalues with a given threshold [1]. The
number of clusters 𝐾 as well as the parsimonious model are selected using the BIC
criterion.

4 Applications

a b
−30 0 30

−40 0 40

0 20 40 60 80 100 0 20 40 60 80 100

c d
−100 0 100
−40 0 40

0 20 40 60 80 100 0 20 40 60 80 100

Fig. 1 Smooth data simulated without oultiers (a), according to scenario A (b), scenarion B (c),
and scenario C (d), coloured by group for one simulation.

We simulate 1000 curves based on the model FCLM[𝑎 𝑘 , 𝑏 𝑘 , 𝑄 𝑘 , 𝑑 𝑘 ]. The number


of clusters is fixed to 𝐾 = 3 and the mixing proportions are equal 𝜋1 = 𝜋2 = 𝜋3 = 1/3.
We consider the following values of the parameters
Group 1: 𝑑 = 5, 𝑎 = 150, 𝑏 = 5, 𝜇 = (1, 0, 50, 100, 0, . . . , 0)
Group 2: 𝑑 = 20, 𝑎 = 15, 𝑏 = 8, 𝜇 = (0, 0, 80, 0, 40, 2, 0, . . . , 0)
16 C. Anton and I. Smith

Group 3: 𝑑 = 10, 𝑎 = 30, 𝑏 = 10, 𝜇 = (0, . . . , 0, 20, 0, 80, 0, 0, 100),


where 𝑑 is the intrinsic dimension of the subgroups, 𝜇 is the mean vector of size 70,
𝑎 is the value of the 𝑑-first diagonal elements of Δ, and 𝑏 the value of the 70 − 𝑑- last
ones. Curves are smoothed using 35 Fourier basis functions. We repeat the simulation
100 times. A sample of theses data is plotted in Figure 1 a. We consider the following
contamination schemes where the scores are simulated from contaminated normal
distributions with the previous parameters and
A: 𝛼𝑖 = 0.9, 𝑖 = 1, . . . , 3, and 𝜂1 = 7, 𝜂2 = 10, 𝜂3 = 17.
B: 𝛼𝑖 = 0.9, 𝑖 = 1, . . . , 3, and 𝜂1 = 5, 𝜂2 = 50, 𝜂3 = 15.
C: 𝛼𝑖 = 0.9, 𝑖 = 1, . . . , 3, and 𝜂1 = 100, 𝜂2 = 70, 𝜂3 = 170.
Samples for data generated according to scenarios A, B, C are plotted in Figure 1
b, c, d, respectively. We notice that there is more overlapping between the 3 groups
when we increase the values of 𝜂.

Table 1 Mean (and standard deviation) of ARI for BIC best model on 100 simulations. Bold values
indicates the highest value for each method.
Scenario Method 𝛼∗ 𝜖 ARI ARI Outliers

A FunHDDC - 0.05 0.519 (0.11) -


A FunHDDC - 0.1 0.499(0.05) -
A FunHDDC - 0.2 0.494 (0.01) -
A CFunHDDC 0.75 0.05 0.769 (0.23) 0.959(0.04)
A CFunHDDC 0.75 0.1 0.986(0.08) 0.998(0.01)
A CFunHDDC 0.75 0.2 0.9995 (0.001) 1 (0)
B FunHDDC - 0.05 0.861 (0.23) -
B FunHDDC - 0.1 0.754(0.25) -
B FunHDDC - 0.2 0.52 (0.09) -
B CFunHDDC 0.75 0.05 0.807 (0.22) 0.961(0.05)
B CFunHDDC 0.75 0.1 0.948 (0.14) 0.99(0.03)
B CFunHDDC 0.75 0.2 0.990 (0.062) 0.971 (0.149)
C FunHDDC - 0.05 0.490 (0.02) -
C FunHDDC - 0.1 0.491(0.02) -
C FunHDDC - 0.2 0.494 (0.01) -
C CFunHDDC 0.75 0.05 0.736 (0.23) 0.928(0.10)
C CFunHDDC 0.75 0.1 0.911 (0.18) 0.958(0.15)
C CFunHDDC 0.75 0.2 0.965 (0.11) 0.994 (0.03)

The quality of the estimated partitions obtained using FunHDDC and CFunHDDC
is evaluated using the Adjusted Rand Index (ARI) [3], and the results are included in
Table 1. For FunHDDC we use the library funHDDC in R. We run both algorithms
for 𝐾 = 3 with all 6 sub-models and the best solution in terms of the highest BIC
value for all those submodels is returned. The initialization is done with the 𝑘-means
Clustering of Functional Data with Mild Outliers 17

Table 2 Correct classification rates for each method.


Method 𝜖 CCR Method 𝛼∗ 𝜖 CCR Method 𝛼∗ CCR

FunHDDC 0.01 0.68 CFunHDDC 0.85 0.01 0.67 CNmixt 0.5 0.67
FunHDDC 0.05 0.64 CFunHDDC 0.85 0.05 0.70 CNmixt 0.75 0.66
FunHDDC 0.1 0.59 CFunHDDC 0.85 0.1 0.70 CNmixt 0.85 0.67
FunHDDC 0.2 0.57 CFunHDDC 0.85 0.2 0.6 CNmixt 0.9 0.66

strategy with 50 repetitions, and the maximum number of iterations is 200 for the
stopping criterion. We use 𝜖 ∈ {0.05, 0.1, 0.2} in the Cattell test.
We notice that CFunHDDC outperforms FunHDDC, and it gives excellent results
even in Scenario C. For CFunHDDC the best results are obtained for 𝜖 = 0.2 in the
Catell test, and the values of the ARI are close to 1.
Next, we consider the NOx data available in the fda.usc library in R and repre-
senting daily curves of Nitrogen Oxides (NOx) emissions in the neighborhood of
the industrial area of Poblenou, Barcelona (Spain). The measurements of NOx (in
𝜇g/m3 ) were taken hourly resulting in 76 curves for “working days” and 39 curves
for “non-working days” (see Figure 2 a). Since NOx is a contaminant agent, the
detection of outlying emission is useful for environmental protection. This data set
has been used for testing methods for the detection of outliers and to illustrate robust
clustering based on trimming for functional data [4].

We apply CFunHDDC, FunHDDC, and CNmixt to the NOx data. Curves are
smoothed using a basis of 8 Fourier functions, and we run the algorithms for 𝐾 = 2
clusters. For CFunHDDC, FunHDDC we use 𝜖 ∈ {0.001, 0.05, 0.1, 0.2} in the
Cattell test and the rest of the settings are the same as in the simulation study. We
run CNmixt for all 14 models from the ContaminatedMixt R library, based on the
coefficients in the Fourier basis, with 1000 iterations for the stopping criteria, and
initialization done with the 𝑘-means method. The correct classification rates (CCR)
are reported in Table 2.
The CCR for CFunHDDC are slightly better than the ones for FunHDDC and
CNmixt, and are comparable with the ones reported in Table 1 in [4] for Funclust,

a b c
400

400
300
200
200

200
mglm3

100

0
0

0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (hours) Time (hours) Time (hours)

Fig. 2 a.Daily NOx curves for 115 days; b. c. Clustering obtained with CFunHDDC, 𝜖 = 0, 05,𝛼∗ =
0.85; Non-working days (blue), working days (red), outliers (green).
18 C. Anton and I. Smith

RFC, and TrimK. In Figure 2 b, c we present the clusters and the detected outliers
for 𝜖 = 0.05 and 𝛼∗ = 0.85. The curves that are detected as outliers (green lines)
exhibit different patterns from the rest of the curves.
One of the advantages of extending the FunHDDC to CFunHDDC is the outlier
detection. For 𝛼∗ = 0.85 and 𝜖 = 0.05, CFunHDDC detects 16 outliers, which are the
same with the outliers mentioned in [4]. For the data without outliers, CFunHDDC
becomes equivalent to FunHDDC, and for the trimmed data the CCR increases to
0.79.

5 Conclusion

We propose a new method, CFunHDDC, that extends the FunHDDC functional clus-
tering method to data with mild outliers. Unlike other robust functional clustering
algorithms, CFunHDDC does not involve trimming the data. CFunHDDC is based
on a model formed by a mixture of contaminated multivariate normal distributions,
which makes parameter estimation more difficult than for FunHDDC, so we use an
ECM instead of an EM algorithm. The clustering and outlier detection performance
of CFunHDDC is tested for simulated data and the NOx data and it always out-
performs FunHDDC. Moreover, CFunHDDC has a comparable performance with
robust functional clustering methods based on trimming, such as RFC and TrimK,
and it has similar or better performance when compared to a two-step method based
on CNmixt. Although there are several model-based methods for multivariate data
with outliers that can be used to construct two-step methods for functional data, as
observed in [1], these two-step methods always suffers from the difficulty to choose
the best discretization. CFunHDDC can be extended to multivariate functional data,
and recently, independently of our work, a similar approach was followed in [5], but
without considering the parsimonious models and the value 𝛼∗ .

References

1. Bouveyron, C., Jacques, J.: Model-based clustering of time series in group-specific functional
subspaces. Adv. Data. Anal. Classif. 5(4), 281–300 (2011)
2. Jacques, J., Preda, C.: Funclust: a curves clustering method using functional random variables
density approximation. Neurocomputing 112, 164–171 (2013)
3. Punzo, A., McNicholas, P. D.: Parsimonious mixtures of multivariate contaminated normal
distributions. Biom. J. 58, 1506–1537 (2016)
4. Rivera-Garcia, D., Garcia-Escudero, L. A., Mayo-Iscar, A., Ortega, J.: Robust clustering for
functional data based on trimming and constraints. Adv. Data Anal. Classif. 13, 201–225
(2019)
5. Amovin-Assagba, M., Gannaz, I., Jacques, J.: Outlier detection in multivariate functional data
through a contaminated mixture model. (2021)
https://doi.org/10.48550/arXiv.2106.07222
Clustering of Functional Data with Mild Outliers 19

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
A Trivariate Geometric Classification of
Decision Boundaries for Mixtures of Regressions

Filippo Antonazzo and Salvatore Ingrassia

Abstract Mixtures of regressions play a prominent role in regression analysis when


it is known the population of interest is divided into homogeneous and disjoint
groups. This typically consists in partitioning the observational space into several
regions through particular hypersurfaces called decision boundaries. A geometrical
analysis of these surfaces allows to highlight properties of the used classifier. In
particular, a geometrical classification of decision boundaries for the three most
used mixtures of regressions (with fixed covariates, with concomitant variables and
random covariates) was provided in case of one and two covariates, under Gaussian
assumptions and in presence of only one real response variable. This work aims to
extend these results to a more complex setting where three independent variables are
considered.

Keywords: mixtures of regressions, decision boundaries, hyperquadrics, model-


based clustering

1 Introduction

Linear regression is commonly employed to model the relationship between a 𝑑-


dimensional real vector of covariates X and a real response variable 𝑌 . It is well suited
if we can assume that regression coefficients are fixed over all possible realizations
(x, 𝑦) ∈ R𝑑+1 of the couple (X, 𝑌 ). This assumption falls if it is a-priori known
that realizations come from a population Ω which can be partitioned into 𝐺 disjoint

Filippo Antonazzo ( )
Inria, Université de Lille, CNRS, Laboratoire de mathématiques Painlevé 59650 Villeneuve d’Ascq,
France, e-mail: [email protected]
Salvatore Ingrassia
Dipartimento di Economia e Impresa, Università di Catania, Corso Italia 55, 95129 Catania, Italy,
e-mail: [email protected]

© The Author(s) 2023 21


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_3
22 F. Antonazzo and S. Ingrassia

homogeneous groups Ω𝑔 , 𝑔 = 1, . . . , 𝐺. In this case, a mixture of linear regressions


(or clusterwise regression) is a more appropriate statistical tool. According to their
degree of flexibility and generality, we can distinguish three types of mixtures of
regressions: mixtures of regressions with fixed covariates (MRFC) [3]; mixtures of
regressions with concomitant variables (MRCV) [6] and mixtures of regressions
with random covariates (MRRC), also referred to in literature as cluster-weighted
models [3, 4].
Mixtures of regressions can also be employed from a classification point of view
to identify the group membership of each observation. In this case, the generated
classifier divides the real space into 𝐺 regions through particular R𝑑+1 surfaces
called decision boundaries. In [5], the decision boundaries generated by each type
of mixture are analyzed from a geometrical point of view, especially in those cases
where 𝑑 = 1, 2 and 𝐺 = 2. The aim of the present work is to extend the results
presented in the aforementioned paper to a higher dimensional case where 𝑑 = 3,
giving more insight into the properties of these classifiers. The rest of the paper is
organized as follows. In Section 2 we summarize the main ideas about mixtures of
regressions. In Section 3 decision boundaries will be defined, finally proposing a
geometrical classification in Section 4 when 𝑑 = 3 and 𝐺 = 2. In Section 5, we
will conclude investigating with practical example the shape of three-dimensional
decision boundaries in presence of variables following heavy-tailed 𝑡-distributions.

2 Mixtures of Regressions

Below we briefly define three types of mixtures of regressions, ordered according to


their generality and flexibility, given by an increasing number of parameters.

MRFC. Mixtures of regressions with fixed covariates have the following density:
𝐺
Õ
𝑝(𝑦|x; 𝜓) = 𝜋𝑔 𝑓 (𝑦|x; 𝜃 𝑔 ). (1)
𝑔=1

The density 𝑓 (𝑦|x; 𝜃 𝑔 ) is indexed by a parameter vector 𝜃 𝑔 belonging to an Euclidean


Í
parametric space Θ𝑔 . Moreover, every 𝜋𝑔 is positive and 𝐺 𝑔=1 𝜋 𝑔 = 1. The vector
𝜓 = (𝜋1 , . . . , 𝜋𝐺 , 𝜃 1 , . . . , 𝜃 𝐺 ) denotes the set of all the parameters of the model.

MRCV. The density of a mixture of regressions with concomitant variables is:


𝐺
Õ
𝑝(𝑦|x; 𝜓) = 𝑓 (𝑦|x; 𝜃 𝑔 ) 𝑝(Ω𝑔 |x; 𝛼), (2)
𝑔=1

where the vector 𝜓 = (𝜃 1 , . . . , 𝜃 𝐺 , 𝛼) contains all parameters indexing the model.


More specifically, 𝑝(Ω𝑔 |x; 𝛼) is a function depending on x according to a vector
of real parameters 𝛼. Typically, the probability 𝑝(Ω𝑔 |x; 𝛼) is a multinomial logistic
Trivariate Geometric Classification of Decision Boundaries for Mixtures of Regressions 23

density with 𝛼 = (𝛼1𝑡 , . . . , 𝛼𝐺 ) and 𝛼𝑔 = (𝛼𝑔0 , 𝛼𝑔1


𝑡 𝑡 𝑡 𝑡
) ∈ R𝑑+1 , i.e.:

exp(𝛼𝑔0 + 𝛼𝑔1
𝑡
x)
𝑝(Ω𝑔 |x; 𝛼) = Í𝐺 .
𝑔=1 exp(𝛼𝑔0 + 𝛼𝑔1 x)
𝑡

Due to identifiability reasons, it is necessary to add the constraint 𝛼1 = 0, see [2].

MRRC. Mixtures of regressions with random covariates propose the following


decomposition for the conjoint density 𝑝(x, 𝑦; 𝜓):
𝐺
Õ
𝑝(x, 𝑦; 𝜓) = 𝑓 (𝑦|x, 𝜃 𝑔 ) 𝑝(x; 𝜉𝑔 )𝜋𝑔 , (3)
𝑔=1

Í
where 𝜋𝑔 > 0 and 𝐺 𝑔=1 𝜋 𝑔 = 1. Furthermore, the model is totally parametrized
by the vector 𝜓 = (𝜋1 , . . . , 𝜋𝐺 , 𝜃 1 , . . . , 𝜃 𝐺 , 𝜉1 , . . . , 𝜉𝐺 ), where each 𝜃 𝑔 indexes the
conditional density 𝑓 (𝑦|x, 𝜃 𝑔 ), while each 𝜉𝑔 refers to the density of X in the group
Ω𝑔 , denoted with 𝑝(x; 𝜉𝑔 ).
In particular, under Gaussian assumptions it results 𝑌 |x, Ω𝑔 ∼ 𝑁 (𝛽𝑔0 + 𝛽𝑔1 𝑡
x, 𝜎𝑔2 ),
where each 𝛽𝑔 = (𝛽𝑔0 , 𝛽𝑔1 ) is a vector of real parameters. Only for MRRC model, we
will further assume X|𝛀𝑔 ∼ 𝑁 (𝜇𝑔 , Σ𝑔 ) for all 𝑔 = 1, . . . , 𝐺, where 𝜇𝑔 denotes the
mean of the Gaussian distribution, while Σ𝑔 is its covariance matrix. Denoting with
𝜙(·) the Gaussian density function, equations (1)-(3) can be, respectively, rewritten
as
𝐺
Õ
𝑝(𝑦|x; 𝜓) = 𝑡
𝜙(𝑦; 𝛽𝑔0 + 𝛽𝑔1 x; 𝜎𝑔2 )𝜋𝑔 , (4)
𝑔=1
𝐺
Õ
𝑝(𝑦|x; 𝜓) = 𝑡
𝜙(𝑦; 𝛽𝑔0 + 𝛽𝑔1 x; 𝜎𝑔2 ) 𝑝(Ω𝑔 |x; 𝛼), (5)
𝑔=1
𝐺
Õ
𝑝(x, 𝑦; 𝜓) = 𝑡
𝜙(𝑦, 𝛽𝑔0 + 𝛽𝑔1 x; 𝜎𝑔2 )𝜙(x; 𝜇𝑔 , Σ𝑔 )𝜋𝑔 . (6)
𝑔=1

Maximum likelihood estimate for 𝜓 are usually obtained with the Expectation-
Maximization (EM) algorithm. Then, the final estimate is used to build classifiers
which group observations into 𝐺 disjoint classes.

3 Decision Boundaries: Generality

There are different ways to build classifiers. One of the best known is the method of
discriminant functions. The aim of this procedure is to define 𝐺 functions 𝐷 𝑔 (x, 𝑦; 𝜓)
and a decision rule to divide the real space R𝑑+1 into 𝐺 decision regions, named
24 F. Antonazzo and S. Ingrassia

R 1 , . . . , R 𝐺 . The decision regions have a one-to-one relationship with the subgroups


Ω𝑔 , i.e., if an observation (x, 𝑦) ∈ R𝑑+1 is assigned to R 𝑔 , it is classified as part
of Ω𝑔 . Among all possible decision rules, the most used one consists in assigning
(x, 𝑦) to R 𝑔 if:
𝐷 𝑔 (x, 𝑦; 𝜓) > 𝐷 𝑗 (x, 𝑦; 𝜓) ∀ 𝑗 ≠ 𝑔. (7)
Then, decision boundaries are defined as the surfaces in R𝑑+1 separating the de-
cision regions R 𝑔 , where observations cannot be uniquely classified. Formally,
each decision boundary is a hypersurface represented by the mathematical equa-
tion 𝐷 𝑗 (x, 𝑦; 𝜓) − 𝐷 𝑘 (x, 𝑦; 𝜓) = 0, 𝑗 ≠ 𝑘.
Different choices for discriminant functions are possible: under Gaussian assump-
tions it is convenient to define 𝐷 𝑔 (·) as the logarithm of the 𝑔-th component mixture
density, as it conveys useful computational simplification [5]. So, we can define, for
all the three models, these discriminant functions:
𝑡
𝑀 𝑅𝐹𝐶 : 𝐷 𝑔 (x, 𝑦; 𝜓) = ln[𝜙(𝑦; 𝛽𝑔0 + 𝛽𝑔1 x, 𝜎𝑔2 )𝜋𝑔 ] (8)
𝑀 𝑅𝐶𝑉 : 𝐷 𝑔 (x, 𝑦; 𝜓) = ln[𝜙(𝑦; 𝛽𝑔0 + 𝑡
𝛽𝑔1 x, 𝜎𝑔2 ) exp(𝛼𝑔0 + 𝛼𝑔1𝑡
x)] (9)
𝑀 𝑅𝑅𝐶 : 𝐷 𝑔 (x, 𝑦; 𝜓) = ln[𝜙(𝑦; 𝛽𝑔0 + 𝑡
𝛽𝑔1 x, 𝜎𝑔2 )𝜙(x; 𝜇𝑔 , Σ𝑔 )𝜋𝑔 ] (10)

3.1 The Case with 𝑮 = 2

In the case of interest where 𝐺 = 2, there is a single decision boundary defined by


the equation 𝐷 (x, 𝑦; 𝜓) = 𝐷 2 (x, 𝑦; 𝜓) − 𝐷 1 (x, 𝑦; 𝜓) = 0. Thus, the assignment rule
for every point (x, 𝑦) ∈ R𝑑+1 is based on the sign of 𝐷 (x, 𝑦; 𝜓). It assigns (x, 𝑦) to
Ω2 if 𝐷 (x, 𝑦; 𝜓) > 0; to Ω1 , otherwise.
In [5] the geometrical properties of the hypersurfaces, defined by the equation
𝐷 (x, 𝑦; 𝜓) = 0, have been investigated up to dimension 𝑑 = 2, providing the follow-
ing propositions for quadrics.

Proposition 1 (MRFC quadrics) The decision boundary between Ω1 and Ω2 is


always a degenarate quadric.

Proposition 2 (MRCV quadrics) If 𝛼𝑡 (𝛽21 − 𝛽11 ) ≠ 0, then the decision boundary


between Ω1 and Ω2 is a paraboloid; otherwise it is a degenarate quadric.

Proposition 3 (MRRC quadrics) Under convenient conditions, the decision bound-


ary between Ω1 and Ω2 can be a degenerate quadric but it can be also assume any
of the general quadric forms.

These results show that models with more flexibility, i.e. with more parameters,
can generate more varieties of decision boundaries. In the following section, we will
extend these statements to dimension 𝑑 = 3.
Trivariate Geometric Classification of Decision Boundaries for Mixtures of Regressions 25

4 Geometrical Classification of Decision Boundaries with 𝑮 = 2


and 𝑫 = 3

In this section we extend previous results for mixtures of regression in presence of


two classes and 𝑑 = 3, where decision boundaries reveal to be hyperquadrics in
R4 . Mathematical proofs of results for MRFC and MRCV models are based on an
algebraic analysis of the matrices representing these hyperquadrics.

MRFC. Mixtures of regressions with fixed covariates are characterized by a low


degree of flexibility. Indeed, all decision boundaries are degenerate hyperquadrics
as the following result shows.
Proposition 4 (MRFC hyperquadrics) The decision boundary between Ω1 and Ω2
is a degenerate hyperquadric of rank at most equal to 3. The rank is less than 3 if
𝛽11 = 𝛽21 or 1−𝜋1𝜋1 = 𝜎
𝜎2
1
.

MRCV. A MRCV allows more degrees of freedom than a MRFC. A consequence is


that the obtained decision boundaries are higher rank hyperquadrics, as the following
result states.
Proposition 5 (MRCV hyperquadrics) The decision boundary between Ω1 and Ω2
is a degenerate hyperquadric with rank at most equal to 4. In particular, rank is
equal to 4 if 𝛼𝑡 (𝛽21 − 𝛽11 ) ≠ 0. In addition, if 𝛼𝑡 (𝛽21 − 𝛽11 ) = 0 and 𝜎12 = 𝜎22 ; the
matrix has rank at most equal to 2, therefore the hyperquadric is reducible.

MRRC. Proposition 3 shows MRRC exhibit a high number of possible types of


conics and quadrics [5]. This fact is confirmed in dimension 𝑑 = 3, even if a strong
theoretical result is difficult to obtain with simple algebra due to the mathemati-
cal complexity of the MRRC hyperquadric matrix. Indeed, it is possible to show
such flexibility by building several practical examples (not displayed here), where
hyperquadrics of various shapes arise.

Analyzing the provided results, we can note that they perfectly match the hierar-
chy established in dimension 𝑑 = 2. Indeed, a MRFC can generate only degenerate
hyperquadrics of rank 3; the surfaces generated by a MRCV, which has more param-
eters, are still degenerate, but with a higher rank (equal to 4) depending on the same
mathematical condition of Proposition 2; finally a MRRC, the most flexible model in
terms of number of parameters, can give rise to various hyperquadrics, as in 𝑑 = 2.

5 Beyond Gaussian Assumptions: 𝒕-distribution in 𝒅 = 2


In [5], Gaussian assumptions were crossed by illustrating the case of a simple linear
regression (𝐺 = 2 and 𝑑 = 1) where more general 𝑡-distributions were required
26 F. Antonazzo and S. Ingrassia

for robustness reasons. It is shown that the generated decision boundaries are more
flexible than their Gaussian counterparts, as they can assume more various shapes,
although these surfaces can be calculated only numerically. In this section, we
continue the exploration of the 𝑡-distribution case adding one more variable, thus
𝑑 = 2. Under these more general assumptions, discriminant functions (8) − (10)
become:

𝑡
𝑀 𝑅𝐹𝐶-𝑡 : 𝐷 𝑔 (x, 𝑦; 𝜓) = ln[𝑞(𝑦; 𝛽𝑔0 + 𝛽𝑔1 x, 𝜎𝑔2 , 𝜂𝑔 )𝜋𝑔 ], (11)
𝑡
𝑀 𝑅𝐶𝑉-𝑡 : 𝐷 𝑔 (x, 𝑦; 𝜓) = ln[𝑞(𝑦; 𝛽𝑔0 + 𝛽𝑔1 x, 𝜎𝑔2 , 𝜂𝑔 ) exp(𝛼𝑔0 + 𝛼𝑔1
𝑡
x)], (12)
𝑀 𝑅𝑅𝐶-𝑡 : 𝐷 𝑔 (x, 𝑦; 𝜓) = ln[𝑞(𝑦; 𝛽𝑔0 + 𝑡
𝛽𝑔1 x, 𝜎𝑔2 , 𝜂𝑔 )𝑞(x; 𝜇𝑔 , 𝚺𝑔 , 𝜈𝑔 )𝜋𝑔 ], (13)

where 𝑞(𝑦; 𝛽𝑔0 + 𝛽𝑔1


𝑡
x, 𝜎𝑔2 , 𝜂𝑔 ) denotes a generalized 𝑡-distribution density, with non-
centrality parameter equal to 𝛽𝑔0 + 𝛽𝑔1 𝑡
x, scaling parameter equal to 𝜎𝑔2 and degrees
of freedom given by 𝜂𝑔 . Similarly, 𝑞(x; 𝜇𝑔 , 𝚺𝑔 , 𝜈𝑔 ) is a multivariate generalized
𝑡-distribution density, where 𝜇𝑔 is the non-centrality parameter, 𝚺𝑔 denotes the
scaling and 𝜈𝑔 represents the degrees of freedom. Figure 1-2 display the decision
boundaries for the three considered models whose parameters are presented in Table
1: they clearly show the gain in flexibility given by the more general distributional
assumptions. Moreover, 𝑡-boundaries with 𝜂1 = 𝜂2 = 10 (Figure 2; red curves) seem
to be closer to Gaussian ones (blue curves) than those with 𝜂1 = 𝜂2 = 3 (Figure 1;
orange curves): this is coherent with standard probabilistic theory.

Table 1 Parameters used in Figure 1-2. MRRC: covariance matrices 𝚺1 and 𝚺2 are equal to the
identity matrix I2 .
Model Group 𝜋𝑔 𝛽𝑔0 𝛽𝑔1 𝜎𝑔2 𝛼𝑔0 𝛼𝑔1 𝜇𝑔 𝜈𝑔
MRFC 1 0.3 1 (2,-3) 0.5
2 0.7 1 (-4,3) 0.5
MRCV 1 0.3 1 (2,-3) 0.5
2 0.7 1 (-4,3) 0.5 1 (-1,0.5)
MRRC 1 0.3 1 (2,-3) 0.5 (1,2) 5
2 0.7 1 (-4,3) 0.5 1 (-1,0.5) (-1,-2) 5

6 Conclusions
This work has provided a trivariate multivariate geometrical classification for the
decision boundaries generated by mixtures of regressions in presence of two classes.
Under Gaussian assumptions, our results confirmed the same hierarchy that was
shown in 𝑑 = 2, as MRRC turns out to exhibits a huge variety of decision boundaries,
while other models generate only degenerate surfaces. This is coherent with its high
degree of flexibility given by its very general parametrization. The provided results
Trivariate Geometric Classification of Decision Boundaries for Mixtures of Regressions 27

Fig. 1 Decision boundaries under assumptions of Gaussian (in blue) and 𝑡-distributed variables
with 𝜂1 = 𝜂2 = 3 (in orange) for the three considered mixtures of regressions.

Fig. 2 Decision boundaries under assumptions of Gaussian (in blue) and 𝑡-distributed variables
with 𝜂1 = 𝜂2 = 10 (in red) for the three considered mixtures of regressions.

could help to select the right model depending on the shape of data. For example,
if in a descriptive analysis data turn out to be approximately separated by a simple
degenerate hyperquadric, it will be better to estimate a MRFC or a MRCV instead
of a complex MRRC. On the contrary, if the separation surface seems to be non-
degenerate, then it will be preferable to fit a general MRRC. Moreover, this work
also showed that the degree of flexibility (thus, the variety of possible decision
boundaries) can be enhanced by go further Gaussianity, assuming, for example,
𝑡-distributed variables. This encourage additional extensions where more general
28 F. Antonazzo and S. Ingrassia

distributions can be included, allowing a better comprehension of mixtures and


possible applications to generalized linear models where categorical variables are
considered.

References
1. DeSarbo, W. S., Cron, W. L.: A maximum likelihood methodology for clusterwise linear
regression. J. Classif. 5, 249–282 (1988)
2. Grun, B., Leisch, F.: FlexMix version 2: finite mixtures with concomitant variables and varying
and constant parameters. J. Stat. Softw. 28, 1–35 (2008)
3. Hennig, C.: Identifiablity of models for clusterwise linear regression. J. Classif. 17, 273–296
(2000)
4. Ingrassia, S., Minotti, S. C., Vittadini, G.: Local Statistical Modeling via a Cluster-Weighted
Approach with Elliptical Distributions. J. Classif. 29, 363-401 (2012)
5. Ingrassia, S., Punzo, A.: Decision boundaries for mixtures of regressions. J. Korean Stat. Soc.
45, 295-306 (2016)
6. Wedel, M.: Concomitant variables in finite mixture models. Stat. Neerl. 56, 362–375 (2002)

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Generalized Spatio-temporal Regression with
PDE Penalization

Eleonora Arnone, Elia Cunial, and Laura M. Sangalli

Abstract We develop a novel generalised linear model for the analysis of data dis-
tributed over space and time. The model involves a nonparametric term 𝑓 , a smooth
function over space and time. The estimation is carried out by the minimization of an
appropriate penalized negative log-likelihood functional, with a roughness penalty
on 𝑓 that involves space and time differential operators, in a separable fashion, or
an evolution partial differential equation. The model can include covariate informa-
tion in a semi-parametric setting. The functional is discretized by means of finite
elements in space, and B-splines or finite differences in time. Thanks to the use of
finite elements, the proposed method is able to efficiently model data sampled over
irregularly shaped spatial domains, with complicated boundaries. To illustrate the
proposed model we present an application to study the criminality in the city of
Portland, from 2015 to 2020.

Keywords: functional data analysis, spatial data analysis, semiparametric regression


with roughness penalty

Eleonora Arnone ( )
Dipartimento di Scienze Statistiche, Università di Padova, Via Cesare Battisti, 241, 35121 Padova,
Italy, e-mail: [email protected]
Elia Cunial
Dipartimento di Matematica, Politecnico di Milano, Piazza Leonardo da Vinci 32, 20133 Milano,
Italy, e-mail: [email protected]
Laura M. Sangalli
Dipartimento di Matematica, Politecnico di Milano, Piazza Leonardo da Vinci 32, 20133 Milano,
Italy, e-mail: [email protected]

© The Author(s) 2023 29


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_4
30 E. Arnone et al.

1 Introduction

In this work we develop a novel generalised linear model for the analysis of data
distributed over space and time. Let 𝑌 be a real-valued variable of interest, and W a
vector of 𝑞 covariates, observed in 𝑛 spatio-temporal locations {p𝑖 , 𝑡𝑖 }𝑖=1,...,𝑛 ∈ Ω×𝑇,
where Ω ⊂ R2 is a bounded spatial domain, and 𝑇 ⊂ R a temporal interval. We
assume that the expected value of 𝑌 , conditional on the covariates and the location
of observation, can be modeled as:

𝑔(E[𝑌 |𝑊, p, 𝑡]) = W> 𝜷 + 𝑓 (p, 𝑡)

where 𝑔 is a known monotone link function, chosen on the basis of the stochastic
nature of 𝑌 , 𝜷 ∈ R𝑞 is an unknown vector of regression coefficients, and 𝑓 : Ω×𝑇 →
R is an unknown deterministic function, which captures the spatio-temporal variation
of the phenomenon under study. Starting from the values {𝑦 𝑖 , w𝑖 }𝑖=1,...,𝑛 , of the
observed response variable and covariates, we estimates 𝜷 and 𝑓 in a semiparametric
fashion. In particular, following the approach in [9], that consider a similar problem
for data scattered over space only, we minimize the functional
 
ℓ {𝑦 𝑖 , w𝑖 , p𝑖 , 𝑡𝑖 }𝑖=1,...,𝑛 ; 𝜷, 𝑓 + P ( 𝑓 )

where ℓ is the appropriate negative log-likelihood, and P ( 𝑓 ) is a penalty that enforces


𝑓 to be a regular function.
Similarly to the regression methods in [1, 2, 3, 4, 5, 7, 8], the roughness penalty
on 𝑓 , P ( 𝑓 ), involves some partial differential operators. In particular, our aim is
to extend the Spatial-Temporal regression with partial differential equations regu-
larization (ST-PDE), developed in [2, 3, 4], to generalized linear model settings,
further broadening the class of regression models with PDE regularization reviewed
in [6]. Hence, likewise ST-PDE, the proposed generalized linear model has a rough-
ness penalty that involves a second order linear differential operator 𝐿 applied to 𝑓 .
Specifically, as in [4], we may consider the penalty
∫ ∫ 𝑇  2 ∫ ∫ 𝑇
𝜕2 𝑓
P ( 𝑓 ) = 𝜆𝑇 + 𝜆𝑆 (𝐿 𝑓 ) 2 ,
Ω 0 𝜕𝑡 2 Ω 0

where the first term accounts for the regularity of the function in time, while the
second accounts for the regularity of the function in space; the importance of each
term is controlled by two smoothing parameters 𝜆𝑇 and 𝜆 𝑆 . Alternatively, as in
[2], we may consider a single penalty which accounts for the spatial and temporal
regularity:
∫ ∫ 𝑇  2
𝜕𝑓
P( 𝑓 ) = 𝜆 +𝐿𝑓 −𝑢 .
Ω 0 𝜕𝑡
Differently from the models in [2, 3, 4], the estimation functional to be minimized
is not quadratic. This poses increased difficulties from the computational point
Generalized Spatio-temporal Regression with PDE Penalization 31

of view. The minimization is performed via a functional version of the penalized


iterative reweighted least square algorithm.
The estimation problem is appropriately discretized. In particular, in time, the
discretization involves either cubic B-splines, for the two-penalties case, or finite
differences, when the single penalty is employed. The discretization in space is
performed via finite elements, on a triangulation of the spatial domain of interest.
This enables to appropriately considered spatial domains with complicated bound-
aries, such as the one considered in the following section, concerning the study of
criminality data over the city of Portland.

2 Application to Criminality Data

This section describes the Portland criminality data, that will be used to illustrate
the proposed methodology. We will present a Poisson model to count the crimes in
the city, and study their evolution from April 2015 to November 2020. In addition,
we shall consider as a covariate the population of the city neighborhoods. The crime
data are publicly available on the website of the Police Bureau of the city1.
The crimes counts are aggregated by trimesters and at a neighborhoods level.
Figure 1 shows the city neighborhoods, each neighborhood colored according to its
total population. The bottom part of the same figure shows the temporal evolution
of the crimes in each neighborhood. Each curve corresponds to a neighborhood
and is colored according to the neighborhood population. In both panels, the three
neighborhoods with the highest number of crimes are indicated by numbers 1, 2
and 3. The figure highlights the presence of some correlation between neighborhood
population and the number of crimes. However, criminality is not fully explained by
population. For instance, neighborhoods 1 and 3 present an high number of crimes
with a moderate population. This raises the interest towards a semiparametric
generalized linear model, as the one introduced in Section 1, with a nonparametric
term accounting for the spatio-temporal variability in the phenomenon, that cannot
be explained by population or other census quantities. Figure 2 shows the same data
for four different trimesters on the Portland map. As already pointed out, the three
area with the highest number of crimes are in the city center, and in the Hazelwood
neighborhood, in the east part of the city.
From Figures 1 and 2 we can see that the shape of the domain is complicated; the
city is indeed crossed by a river, with few bridges connecting the two parts, most of
them placed downtown. Therefore, neighborhoods at opposite side of the river and
far from the center, where most bridges are located, are close in euclidean distance,
but far apart in reality. This particular morphology influences the phenomenon under
study, for example, in the north of the city, the east side of the river is characterized by
an higher number of crimes with respect to the west part. Due to this characteristics
of the data and the domain, is is of crucial importance to take into account the shape

1 Police Bureau crime data: https://www.portlandoregon.gov/police/71978


32 E. Arnone et al.

Fig. 1 Top: the city of Portland divided into neighborhoods, each neighborhood colored according
to the total population. Bottom: the total crimes over time for each neighborhood; each curve
corresponds to a neighborhood and is colored according to the neighborhood’s population. The
three neighborhoods with the highest number of crimes are indicated by numbers 1, 2 and 3.
Generalized Spatio-temporal Regression with PDE Penalization 33

Fig. 2 Total crime counts per neighborhood per trimester; green indicates lower number of crimes,
red indicates a higher number of crimes.

of the domain during the estimation process. For this reason, estimation based on
classical semiparametric models, such as those based on thin-plate splines, would
give poor results, while the proposed method is particularly well suited, being able
to complying the nontrivial form of the domain.
34 E. Arnone et al.

References

1. Aguilera-Morillo, M. C., Durbán, M., Aguilera, A. M.: Prediction of functional data with
spatial dependence: a penalized approach. Stoch. Environ. Res. Risk Assess. 31, 7–22 (2017)
2. Arnone, E., Azzimonti, L., Nobile, F., Sangalli, L. M.: Modeling spatially dependent functional
data via regression with differential regularization. J. Multivariate Anal. 170, 275–295 (2019)
3. Arnone, E., Sangalli, L. M., Vicini, A.: Smoothing spatio-temporal data with complex missing
data patterns. Stat. Model. Int. J. (2021)
4. Bernardi, M. S., Sangalli, L. M., Mazza, G., Ramsay, J. O.: A penalized regression model for
spatial functional data with application to the analysis of the production of waste in Venice
province. Stoch. Environ. Res. Risk Assess. 31, 23–38 (2017)
5. Marra, G., Miller, D. L., Zanin, L.: Modelling the spatiotemporal distribution of the incidence
of resident foreign population. Statistica Neerlandica 66(2) 133–160 (2012)
6. Sangalli, L. M.: Spatial regression with partial differential equation regularization. Int. Stat.
Rev. 89(3), 505–531 (2021)
7. Ugarte, M. D., Goicoa, T., Militino, A. F., Durbán, M.: Spline smoothing in small area trend
estimation and forecasting. Comput. Stat. Data Anal. 53(10), 3616–3629 (2009)
8. Ugarte, M. D., Goicoa, T., Militino, A. F.: Spatio-temporal modeling of mortality risks using
penalized splines. Environmetrics 21, 270–289 (2010)
9. Wilhelm M., Sangalli L. M.: Generalized spatial regression with differential regularization. J.
Stat. Comput. Simulat. 86(13), 2497–2518 (2016)

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
A New Regression Model for the Analysis of
Microbiome Data

Roberto Ascari and Sonia Migliorati

Abstract Human microbiome data are becoming extremely common in biomed-


ical research due to the relevant connections with different types of diseases.
A widespread discrete distribution to analyze this kind of data is the Dirichlet-
multinomial. Despite its popularity, this distribution often fails in modeling micro-
biome data due to the strict parameterization imposed on its covariance matrix. The
aim of this work is to propose a new distribution for analyzing microbiome data
and to define a regression model based on it. The new distribution can be expressed
as a structured finite mixture model with Dirichlet-multinomial components. We
illustrate how this mixture structure can improve a microbiome data analysis to
cluster patients into "enterotypes", which are a classification based on the bacterio-
logical composition of gut microbiota. The comparison between the two models is
performed through an application to a real gut microbiome dataset.

Keywords: count data, Bayesian inference, mixture model, multivariate regression

1 Introduction

The human microbiome is defined as the set of genes associated with the micro-
biota, i.e. the microbial community living in the human body, including bacteria,
viruses and some unicellular eukaryotes [1, 8]. The mutualistic relationship be-
tween microbiota and human beings is often beneficial, though it can sometimes

Roberto Ascari ( )
Department of Economics, Management and Statistics (DEMS), University of Milano-Bicocca,
Milan, Italy, e-mail: [email protected]
Sonia Migliorati
Department of Economics, Management and Statistics (DEMS), University of Milano-Bicocca,
Milan, Italy, e-mail: [email protected]

© The Author(s) 2023 35


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_5
36 R. Ascari and S. Migliorati

become detrimental for several health outcomes. For example, changes in the gut
microbiome composition can be associated with diabetes, cardiovascular disesase,
obesity, autoimmune disease, anxiety and many other factors impacting on human
health [1, 5, 12, 14]. Moreover, the development of next-generation sequencing
technologies allows nowadays to survey the microbiome composition using direct
DNA sequencing of either marker genes or the whole metagenomics, without the
need for isolation and culturing. These are the two main reasons for the recent ex-
plosion of research on microbiome, and highlight the importance of understanding
the association between microbiome composition and biological and environmental
covariates.
A widespread distribution for handling microbiome data is the Dirichlet-
multinomial (DM) (e.g., see [4, 16]), a generalization of the multinomial distribution
obtained by assuming that, instead of being fixed, the underlying taxa proportions
come from a Dirichlet distribution. This allows to model overdispersed data counts,
that is data showing a variance much larger than that predicted by the multinomial
model. Despite its popularity, the DM distribution is often inadequate to model
real microbiome datasets due to the strict covariance structure imposed by its pa-
rameterization, which hinders the description of co-occurrence and co-exclusion
relationships between microbial taxa.
The aim of this work is to propose a new distribution that generalizes the DM,
namely the flexible Dirichlet-multinomial (FDM), and a regression model based on
it. The new model provides a better fit to real microbiome data, still preserving a
clear interpretation of its parameters. Moreover, being a finite mixture with DM
components, it enables to account for the data latent group structure, and thus to
identify clusters sharing similar biota compositions.

2 Statistical Models for Microbiome Data

In this section, we define a new distribution for multivariate counts and a regression
model based on it, that allows to link microbiome abundances with covariates. Note
that, once the DNA sequence reads have been aligned to the reference microbial
genomes, the abundances of microbial taxa can be quantified. Thus, microbiome data
represent the count composition of 𝐷 bacterial taxa in a specific biological sample,
and a microbiome dataset is a sequence of 𝐷-dimensional vectors Y1 , Y2 , . . . , Y 𝑁 ,
where 𝑌𝑖𝑟 counts the number of occurrences of taxon 𝑟 in the 𝑖-th sample (𝑖 =
1, . . . , 𝑁 and 𝑟 = 1, . . . , 𝐷). Since the 𝑖-th sample contains a number 𝑛𝑖 of bacteria,
Í
microbiome observations are subject to a fixed-sum constraint, that is 𝑟𝐷=1 𝑌𝑖𝑟 = 𝑛𝑖 .
A New Regression Model for the Analysis of Microbiome Data 37

2.1 Count Distributions

Following a compound approach, we assume that Y|𝚷 = 𝝅 ∼ Multinomial(𝑛, 𝝅),


the vector of probabilities 𝚷 ∈ S 𝐷 . The
and we consider suitable distributions for Í
|
set S = {𝝅 = (𝜋1 , . . . , 𝜋 𝐷 ) : 𝜋𝑟 > 0, 𝑟𝐷=1 𝜋𝑟 = 1} is the 𝐷-part simplex and
𝐷

it is the proper support of continuous compositional vectors. A distribution for Y is


obtained by marginalizing the joint distribution of (Y, 𝚷) | . A common choice for
this distribution is the mean-precision parameterized Dirichlet, whose probability
density function (p.d.f.) is

Γ(𝛼+ )
𝐷
( 𝛼+ 𝜇𝑟 )−1
Ö
𝑓Dir (𝝅; 𝝁, 𝛼+ ) = Î𝐷 𝜋𝑟 ,
𝑟 =1 Γ(𝛼+ 𝜇 𝑟 ) 𝑟 =1

where 𝝁 = E[𝚷] ∈ S 𝐷 , and 𝛼+ > 0 is a precision parameter. Compounding the


multinomial distribution with the Dirichlet one leads to the DM distribution, widely
used in microbiome data analysis, whose probability mass function (p.m.f.) is

𝑛!Γ(𝛼+ ) Ö Γ(𝛼+ 𝜇𝑟 + 𝑦 𝑟 )
𝐷
𝑓DM (y; 𝑛, 𝝁, 𝛼+ ) = .
Γ(𝛼+ + 𝑛) 𝑟 =1 (𝑦 𝑟 !)Γ(𝛼+ 𝜇𝑟 )

The mean vector of a DM distribution is E[Y] = 𝑛𝝁, so that the parameter 𝝁 =


E[Y]/𝑛 can be thought of as a scaled mean vector. Moreover, its covariance matrix
is  
𝑛−1
V [Y] = 𝑛M 1 + + , (1)
𝛼 +1
where M = (Diag( 𝝁) − 𝝁𝝁| ). Equation (1) highlights how the additional parameter
𝛼+ allows to increase flexibility in the variability structure with respect to the standard
multinomial distribution.
We propose to take advantage of an alternative sound distribution defined on S 𝐷 ,
namely the flexible Dirichlet (FD) [7, 9]. The latter is a structured finite mixture with
Dirichlet components, entailing some constraints among the components’ parameters
to ensure model identifiability. Thanks to its mixture structure, the p.d.f. of a FD-
distributed random vector can be expressed as

𝛼+
𝐷
Õ  
𝑓FD (𝝅; 𝝁, 𝛼+ , 𝑤, p) = 𝑝 𝑗 𝑓Dir 𝝅; 𝝀 𝑗 , , (2)
𝑗=1
1−𝑤

where
𝝀 𝑗 = 𝝁 − 𝑤p + 𝑤e 𝑗 (3)
is the mean vector
n of the 𝑗-th ncomponent,
oo 𝝁 = E [𝚷] ∈ S 𝐷 , 𝛼+ > 0, p ∈ S 𝐷 ,
𝜇𝑟
0 < 𝑤 < min 1, min𝑟 ∈ {1,...,𝐷 } 𝑝𝑟 , and e 𝑗 is a vector with all elements equal to
zero except for the 𝑗-th which is equal to one.
38 R. Ascari and S. Migliorati

Equation (2) points that the Dirichlet components have different mean vectors
and a common precision parameter, the latter being determined by 𝛼+ and 𝑤. In
particular, inspecting Equation (3), it is easy to observe that any two vectors 𝝀𝑟 and
𝝀 ℎ , 𝑟 ≠ ℎ, coincide in all the elements except for the 𝑟-th and the ℎ-th.
If 𝚷 is supposed to be FD distributed, a new discrete distribution for count vectors
can be defined (we shall call flexible Dirichlet-multinomial (FDM)). The p.m.f. of
the FDM can be expressed as

𝛼+
𝐷
Õ  
+
𝑓FDM (y; 𝑛, 𝝁, 𝛼 , p, 𝑤) = 𝑝 𝑗 𝑓DM y; 𝑛, 𝝀 𝑗 , (4)
𝑗=1
1−𝑤
𝐷 𝛼+ 𝐷 𝛼+
Õ 𝑛!Γ( 1−𝑤 ) Ö Γ( 1−𝑤 𝜆 𝑗𝑟 + 𝑦 𝑟 )
= 𝑝𝑗 𝛼+ 𝛼+ ,
𝑗=1 Γ( 1−𝑤 + 𝑛) 𝑟 =1 (𝑦 𝑟 !)Γ( 1−𝑤 𝜆 𝑗𝑟 )

where 𝝀 𝑗 is defined in Equation (3). Interestingly, it is possible to recognize the


flexible beta-binomial (FBB) [3] distribution as a special case of the FDM. The
FBB is a generalization of the binomial distribution successful in dealing with
overdispersion. Moreover, note that when p = 𝝁 and 𝑤 = 1/(𝛼+ + 1) the DM
distribution is recovered.
Equation (4) shows that the FDM is a finite mixture with DM components
displaying a common precision parameter and different scaled mean vectors 𝝀 𝑗 ,
𝑗 = 1, . . . , 𝐷. The overall mean vector and the covariance matrix of the FDM can
be expressed as

E [Y] = 𝑛𝝁,
 
𝑛−1 (𝑛 − 1)𝜙𝑤2
V [Y] = 𝑛M 1 + +𝑛 P, (5)
𝜙+1 𝜙+1

where M = (Diag( 𝝁) − 𝝁𝝁| ), P = (Diag(p) − pp| ), and 𝜙 = 𝛼+ /(1 − 𝑤) is


the common precision parameter of the DM components. A comparison between
Equations (5) and (1) points out that the covariance matrix of the FDM distribution
is a very easily interpretable extension of the DM’s covariance matrix. Indeed, it is
composed of two terms, the first one coinciding with the DM’s covariance matrix,
whereas the second one depends on the mixture structure of the FDM model. In
particular, the FDM covariance matrix has 𝐷 additional parameters with respect to
the DM, namely 𝐷 − 1 distinct elements in the vector of mixing weights p, and the
parameter 𝑤 which controls the distance among the components’ barycenters [7].
This is the key element explaining the better ability of the FDM in modeling a wide
range of scenarios.
A New Regression Model for the Analysis of Microbiome Data 39

2.2 Regression Models

With the aim of performing a regression analysis, let Y = (Y1 , . . . , Y 𝑁 ) | be a set of


independent multivariate responses collected on a sample of 𝑁 subjects/units. For
the 𝑖-th subject, Y𝑖 counts the number of times that each of 𝐷 possible taxa occurred
among 𝑛𝑖 trials, and x𝑖 is a (𝐾 + 1)-dimensional vector of covariates.
A parameterization of the FDM useful in a regression perspective is the one based
on 𝝁, p, 𝛼+ , and 𝑤,
˜ where
𝑤
𝑤˜ = n n oo ∈ (0, 1). (6)
𝜇𝑟
min 1, min𝑟 𝑝𝑟

We can define the FDM regression (FDMReg) and the DM regression (DMReg)
models assuming that Y𝑖 follows an FDM(𝑛𝑖 , 𝝁𝑖 , 𝛼+ , p, 𝑤)
˜ or a DM(𝑛𝑖 , 𝝁𝑖 , 𝛼+ ) dis-
tribution, respectively. Even if the FDM and DM distributions do not belong to the
dispersion-exponential family, we can follow a GLM-type approach, [6] by linking
the parameter 𝝁𝑖 to the linear predictor through a proper link function such as the
multinomial logit link function, that is
 
𝜇𝑖𝑟
𝑔(𝜇𝑖𝑟 ) = log = x|𝑖 𝜷𝑟 , 𝑟 = 1, . . . , 𝐷 − 1, (7)
𝜇𝑖𝐷
where 𝜷𝑟 = (𝛽𝑟 0 , 𝛽𝑟 1 , . . . , 𝛽𝑟 𝐾 ) | is a vector of regression coefficients for the 𝑟-th
element of 𝝁𝑖 . Note that the last category has been conventionally chosen as baseline
category, thus 𝜷 𝐷 = 0.
The parameterization of the FDMReg based on 𝝁, p, 𝛼+ , and 𝑤˜ defines a variation
independent parameter space, meaning that no constraints exist among parameters. In
a Bayesian framework, this allows to assume prior independence, and, consequently,
we can specify a prior distribution for each parameter separately. In order to induce
minimum impact on the posterior distribution, we select weakly-informative priors:
(i) 𝜷𝑟 ∼ 𝑁 𝐾 +1 (0, Σ), where 0 is the (𝐾 + 1)-vector with zero elements, and Σ is
a diagonal matrix with ‘large’ variance values, (ii) 𝛼+ ∼ 𝐺𝑎𝑚𝑚𝑎(𝑔1 , 𝑔2 ) for small
values of 𝑔1 and 𝑔2 , (iii) 𝑤˜ ∼ 𝑈𝑛𝑖 𝑓 (0, 1), and (iv) a uniform prior on the simplex for
p.
Inferential issues are dealt with by a Bayesian approach through a Hamilto-
nian Monte Carlo (HMC) algorithm [10], which is a popular generalization of the
Metropolis-Hastings algorithm. The Stan modeling language [13] allows implement-
ing an HMC method to obtain a simulated sample from the posterior distribution.
To compare the fit of the models we use the Watanabe-Akaike information crite-
rion (WAIC) [15, 17], a fully Bayesian criterion that balances between goodness-of-
fit and complexity of a model: lower values of WAIC indicate a better fit.
40 R. Ascari and S. Migliorati

3 A Gut Microbiome Application

In this section, we fit the DM and the FDM regression models to a microbiome dataset
analyzed by Xia et al. [19] and previously proposed by Wu et al. [18]. They collected
gut microbiome data on 98 healthy volunteers. In particular, the counts of three
bacteria genera were recorded, namely Bacteroides, Prevotella, and Ruminococcus.
Arumugam et al. [2] used these three bacteria to define three groups they called
enterotypes. These enterotypes provide information about the human’s body ability
to produce vitamins.
Wu et al. analyzed the same dataset conducting a cluster analysis via the ‘par-
titioning around medoids’ (PAM) approach. They detected only two of the three
enterotypes defined in the work by Arumugam et al. Moreover, these two clusters
are characterized by different frequencies: 86 out of the 98 samples were allocated
to the first enterotype, whereas only 12 samples were clustered into enterotype 2.
This is due to the small number of subjects with a high abundance of Prevotella (i.e.,
only 36 samples showed a Prevotella count greater than 0).
Besides the bacterial data, we consider also 𝐾 = 9 covariates, representing in-
formation on micro-nutrients in the habitual long-term diet collected using a food
frequency questionnaire. These 9 additional variables have been selected by Xia et
al. using a 𝑙1 penalized regression approach.
Table 1 shows the posterior mean and 95% credible set (CS) of each parameter
involved in the DMReg and the FDMReg models. Though the significant covariates
are the same across the models, the FDMReg shows a lower WAIC, thus being the
best model in terms of fit. This is due to the additional set of parameters involved in
the mixture structure that help in providing information on this dataset.
The mixture structure of the FDMReg model can be exploited to cluster ob-
servations into groups through a model-based approach. More specifically, each
observation can be allocated to the mixture component that most likely generated it.
Indeed, note that the mixing weights estimates (0.637, 0.357 and 0.006, from Table
1) confirm the presence of two out of the three enterotypes defined by Arumugam
et al. [2]. To further illustrate the benefits of the FDReg model in a microbiome
data analysis, we compare the clustering profile obtained by the FDMReg model and
the one obtained with the PAM approach used by Wu et al. In particular, Table 2
summarizes this comparison in a confusion matrix. Despite the clustering generated
by the FDMReg being based on some distributional assumptions (i.e., the response
is FDM distributed), it highly agrees with the one obtained by the PAM algorithm for
84% of the observations. This percentage is obtained using the covariates selected
by Xia et al. in a logistic normal multinomial regression model context. Clearly,
the results could be improved by developing an ad hoc variable selection procedure
for the FDMReg model. The main advantage to considering the FDMReg (that is a
model-based clustering approach) is that, besides the clustering of the data points,
it provides also some information on the detected clusters (e.g., their size and a
measure of their distance) and the relationship between the response and the set of
covariates. This additional information may increase the insight we can gain from
A New Regression Model for the Analysis of Microbiome Data 41

Table 1 Posterior mean and 95% CS for the parameters of the DMReg and FDMReg models.
Regression coefficients in bold are related to 95% CS’s not containing the zero value.
DM FDM
Post. Mean 95% CS Post. Mean 95% CS
Intercept 2.197 (1.844, 2.546) 2.642 (2.215, 3.034)
Proline -0.039 (-0.344, 0.273) -0.036 (-0.325, 0.261)
Sucrose -0.257 (-0.555, 0.039) -0.208 (-0.471, 0.064)
Bacteroides

Vitamin E, food fortification -0.016 (-0.351, 0.336) -0.043 (-0.351, 0.299)


Beta cryptoxanthin -0.073 (-0.357, 0.237) -0.059 (-0.334, 0.214)
Added germa from wheats -0.147 (-0.477, 0.196) -0.042 (-0.411, 0.271)
Vitamin C 0.300 (-0.031, 0.771) 0.267 (-0.035, 0.673)
Maltose -0.031 (-0.311, 0.260) 0.034 (-0.237, 0.302)
Palmitelaidic trans fatty acid 0.019 (-0.292, 0.328) -0.044 (-0.336, 0.251)
Acrylamide 0.133 (-0.167, 0.455) 0.184 (-0.094, 0.474)
Intercept -1.196 (-1.715, -0.699) -0.402 (-1.094, 0.245)
Proline -0.053 (-0.571, 0.443) -0.018 (-0.663, 0.546)
Sucrose 0.029 (-0.437, 0.476) 0.126 (-0.335, 0.591)
Prevotella

Vitamin E, food fortification 0.109 (-0.355, 0.548) 0.113 (-0.473, 0.574)


Beta cryptoxanthin 0.263 (-0.230, 0.762) 0.349 (-0.386, 0.812)
Added germa from wheats 0.280 (-0.137, 0.701) 0.121 (-0.298, 0.604)
Vitamin C -0.169 (-1.196, 0.623) -0.021 (-1.131, 0.738)
Maltose 0.640 (0.164, 1.126) 0.877 (0.260, 1.400)
Palmitelaidic trans fatty acid -0.530 (-1.008, -0.043) -0.716 (-1.209, -0.140)
Acrylamide 0.780 (0.362, 1.206) 0.800 (0.382, 1.231)
𝛼+ 1.541 (1.104, 2.040) 2.275 (1.489, 3.208)
𝑝1 — — 0.637 (0.420, 0.797)
𝑝2 — — 0.357 (0.197, 0.570)
𝑝3 — — 0.006 (0.000, 0.027)
𝑤˜ — — 0.914 (0.791, 0.991)
WAIC 1686.2 1662.3

data. Further improvements could be obtained considering an even more flexible


distribution for 𝚷, that is the extended flexible Dirichlet [11].

Table 2 Confusion matrix for clustering based on the FDMReg model compared to the PAM
algorithm.
FDMReg
1 2
PAM

1 70 16
2 0 12

References

1. Amato, K.: An introduction to microbiome analysis for human biology applications. Am. J.
Hum. Biol. 29 (2017)
42 R. Ascari and S. Migliorati

2. Arumugam, M. et al.: Enterotypes of the human gut microbiome. Nature. 473, 174–180 (2011)
3. Ascari, R., Migliorati, S.: A new regression model for overdispersed binomial data accounting
for outliers and an excess of zeros. Stat. Med. 40(17), 3895–3914 (2021)
4. Chen, J., Li, H.: Variable selection for sparse Dirichlet-multinomial regression with an appli-
cation to microbiome data analysis. Ann. Appl. Stat. 7(1), 418–442 (2013)
5. Koeth, R. A. et al.: Intestinal microbiota metabolism of L-carnitine, a nutrient in red meat,
promotes atherosclerosis. Nat. Med. 19(5) (2013)
6. McCullagh, P., Nelder, J. A.: Generalized Linear Models. Chapman & Hall (1989)
7. Migliorati, S., Ongaro, A., Monti, G. S.: A structured Dirichlet mixture model for composi-
tional data: inferential and applicative issues. Stat. Comput. 27(4), 963–983, 2017.
8. Morgan, X. C., Huttenhower, C.: Human microbiome analysis. PloS Computational Biology.
8(12) (2012)
9. Ongaro, A., Migliorati, S.: A generalization of the Dirichlet distribution. J. Multivar. Anal.
114, 412–426 (2013)
10. Neal, R. M.: An improved acceptance procedure for the hybrid Monte Carlo algorithm. Tech.
Rep. (1994)
11. Ongaro, A., Migliorati, S., Ascari, R.: A new mixture model on the simplex. Stat. Comput.
30(4), 749–770 (2020)
12. Qin, J., Li, Y., Cai, Z., Li, S., Zhu, J., Zhang, F., Liang, S., Zhang, W., Guan, Y., Shen, D.,
Peng, Y.: A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature.
490 (2012)
13. Stan Development Team: Stan Modeling Language Users Guide and Reference Manual (2017)
14. Turnbaugh, P. J. et al.: A core gut microbiome in obese and lean twins. Nature. 457 (2009)
15. Vehtari, A., Gelman, A., Gabry, J.: Practical Bayesian model evaluation using leave-one-out
cross-validation and WAIC. Stat. Comput. 27(5), 1413–1432 (2017)
16. Wadsworth, W. D., Argiento, R., Guindani, M., Galloway-Pena, J., Shelburne, S. A., Van-
nucci, M.: An integrative Bayesian Dirichlet-multinomial regression model for the analysis of
taxonomic abundances in microbiome data. BMC Bioinformatics. 18(94) (2017)
17. Watanabe, S.: A widely applicable Bayesian information criterion. J. Mach. Learn. Tech.
14(1), 867–897 (2013)
18. Wu., G. D. et al.: Linking long-term dietary patterns with gut microbial enterotypes. Science.
334, 105–109 (2011)
19. Xia, F., Chen, J., Fung, W. K., Li, H.: A logistic normal multinomial regression model for
microbiome compositional data analysis. Biometrics. 69(4), 1053–1063 (2013)

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Stability of Mixed-type Cluster Partitions for
Determination of the Number of Clusters

Rabea Aschenbruck, Gero Szepannek, and Adalbert F. X. Wilhelm

Abstract For partitioning clustering methods, the number of clusters has to be


determined in advance. One approach to deal with this issue are stability indices.
In this paper several stability-based validation methods are investigated with regard
to the 𝑘-prototypes algorithm for mixed-type data. The stability-based approaches
are compared to common validation indices in a comprehensive simulation study in
order to analyze preferability as a function of the underlying data generating process.

Keywords: cluster stability, cluster validation, mixed-type data

1 Introduction

In cluster analysis practice, it is common to work with mixed-type data (i.e. nu-
merical and categorical variables), while in theoretical development the research is
traditionally often restricted to numerical data. A comprehensive overview on cluster
analysis based on mixed-type data is given in [1]. To cluster these mixed-type data, a
popular approach is the 𝑘-prototypes algorithm, an extension of the popular 𝑘-means
algorithm, as proposed in [2] and implemented in [3].
As for all partitioning clustering methods, the number of clusters has to be spec-
ified in advance. In the past, several validation methods have been identified for the

Rabea Aschenbruck ( )
Stralsund University of Applied Sciences, Zur Schwedenschanze 15, 18435 Stralsund, Germany,
e-mail: [email protected]
Gero Szepannek
Stralsund University of Applied Sciences, Zur Schwedenschanze 15, 18435 Stralsund, Germany,
e-mail: [email protected]
Adalbert F.X. Wilhelm
Jacobs University Bremen, Campus Ring 1, 28759 Bremen, Germany,
e-mail: [email protected]

© The Author(s) 2023 43


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_6
44 R. Aschenbruck et al.

𝑘-prototypes algorithm to enable the rating of clusters and to determine the index
optimal number of clusters. A brief overview is given in Section 2, followed by an
examination of the investigated stability indices to improve clustering mixed-type
data1. In Section 3, a simulation study has been conducted in order to compare the
performance of stability indices as well as a new proposed adjustment, and addi-
tionally to rate the performance with respect to internal validation indices. Finally, a
summary, which does not state a superiority of the stability-based approaches over
internal validation indices in general, and an outlook are given in Section 4.

2 Stability of Cluster Partitions

The assessment of cluster quality can be used for the comparison of clusters resulting
from different methods or from the same method but with different input parameters,
e.g., with a different number of clusters. Especially the latter has already been an
important issue in partitioning clustering many decades ago [5]. Since then, some
work has been done on this subject. Hennig [6] points out, that nowadays some liter-
ature uses the term cluster validation exclusively for methods that decide about the
optimal number of clusters, in the following named internal validation. An overview
of internal validation indices is given, e.g., in [7] or [8]. In [9], a set of internal cluster
validation indices for mixed-type data to determine the number of clusters for the
𝑘-prototypes algorithm was derived and analyzed. In the following, stability indices
are presented, before they are compared to each other and additionally to internal
validation indices in Section 3. Since cluster stability is a model agnostic method,
the indices are applicable to any clustering algorithm and not limited to numerical
data [10].
A partition 𝑆 splits data 𝑌 = {𝑦 1 , . . . , 𝑦 𝑛 } into 𝐾 groups 𝑆1 , . . . , 𝑆 𝐾 ⊆ 𝑌 . The
focus of this paper is on the evaluation and rating of cluster partitions with so-
called stability indices. To calculate these, as discussed by Dolnicar and Leisch
[11] or mentioned by Fang and Wang [12], 𝑏 ∈ {1, . . . , 𝐵} bootstrap samples 𝑌 𝑏
(with replacement, see e.g. [13]) from the original data set 𝑌 are drawn. For every
bootstrap sample 𝑌 𝑏 , a cluster partition 𝑆 𝑏 = {𝑆1𝑏 , . . . , 𝑆 𝑏𝐿𝑏 } is determined. For the
validation of the different results of these bootstrap samples, the set of points from
the original data set that are also part of the 𝑏-th bootstrap sample 𝑋 𝑏 = 𝑌 ∩ 𝑌 𝑏 is
used, where 𝑛𝑏 is the size of 𝑋 𝑏 . Furthermore 𝐶 𝑏 = {𝑆 𝑘 ∩ 𝑋 𝑏 |𝑘 = 1, . . . , 𝐾 } and
𝐷 𝑏 = {𝑆𝑙𝑏 ∩ 𝑋 𝑏 |𝑙 = 1, . . . , 𝐿 𝑏 }, with 𝐵★
𝐶 being the number of bootstrap samples for
which 𝐶 𝑏 ≠ ∅, and 𝑛𝑆𝑘 , 𝑛𝐶 𝑏 , 𝑛𝑆 𝑏 , and 𝑛 𝐷 𝑏 with 𝑘 ∈ {1, . . . , 𝐾 }, 𝑙 ∈ {1, . . . , 𝐿 𝑏 }
𝑘 𝑙 𝑙
are the numbers of objects in cluster group 𝑆 𝑘 , 𝐶 𝑘𝑏 , 𝑆𝑙𝑏 and 𝐷 𝑙𝑏 , respectively.
In 2002, Ben-Hur et al. [14] presented stability-based methods, which can be used
to define the optimal number of clusters. In their work, the basis for the calculation
𝑏
of the stability indices is a binary matrix 𝑃𝐶 , which represents the cluster partition
𝐶 𝑏 in the following way

1 The mentioned and analyzed stability indices will extend the R package clustMixType [4].
Stability for Determination of the Number of Clusters 45
(
𝑏 1, if objects 𝑥𝑖𝑏 , 𝑥 𝑏𝑗 ∈ 𝑋 𝑏 are in the same cluster and 𝑖 ≠ 𝑗,
𝑃𝑖𝐶𝑗 = (1)
0, otherwise.
𝑏
With 𝑃 𝐷 defined analogously, the dot product of the two cluster partitions 𝐶 𝑏 and
𝑏 𝑏 Í 𝑏 𝑏
𝐷 𝑏 is defined as 𝐷 (𝑃𝐶 , 𝑃 𝐷 ) = 𝑖, 𝑗 𝑃𝑖𝐶𝑗 𝑃𝑖𝐷𝑗 . This leads to a Jaccard coefficient
based index of two cluster partitions 𝐶 𝑏 and 𝐷 𝑏
𝑏 𝑏
𝑏 𝑏 𝐷 (𝑃𝐶 , 𝑃 𝐷 )
𝑆𝑡𝑎𝑏 J (𝑃𝐶 , 𝑃 𝐷 ) = . (2)
𝐷 (𝑃𝐶 , 𝑃𝐶 ) + 𝐷 (𝑃 𝐷 𝑏 , 𝑃 𝐷 𝑏 ) − 𝐷 (𝑃𝐶 𝑏 , 𝑃 𝐷 𝑏 )
𝑏 𝑏

Hennig proposed a so-called local stability measure for every cluster group in a
cluster partition based on the Jaccard coefficient as well [15]. To obtain one stability
value 𝑆𝑡𝑎𝑏 J;cw for the whole partition, the weighted mean of the cluster-wise values
with respect to the size of the cluster groups is determined. Another stability-based
index presented by Ben-Hur et al., based on the simple matching coefficient, is called
Rand index [16] and defined as
𝑏 𝑏 1 𝐶𝑏 𝑏
𝑆𝑡𝑎𝑏 R (𝑃𝐶 , 𝑃 𝐷 ) = 1 − 2
k𝑃 − 𝑃 𝐷 k 2 . (3)
𝑛
Additionally, they present the stability index based on a similarity measure, which
was originally mentioned by Fowlkes and Mallows [17],
𝑏 𝑏
𝐶𝑏 𝐷𝑏 𝐷 (𝑃𝐶 , 𝑃 𝐷 )
𝑆𝑡𝑎𝑏 FM (𝑃 ,𝑃 )=p . (4)
𝐷 (𝑃𝐶 𝑏 , 𝑃𝐶 𝑏 )𝐷 (𝑃 𝐷 𝑏 , 𝑃 𝐷 𝑏 )
For determination of the number of clusters, Ben-Hur et al. proposed the analysis of
the distribution of index values calculated between pairs of clustered sub-samples,
where high pairwise similarities indicate a stable partition. The authors’ suggested
aim is examining the transition from a stable to an unstable clustering state. In
the simulation study, this qualitative criterion was numerically approximated by
the differences in the areas under these curves. Furthermore, von Luxburg [18]
published an approach to obtain the cluster partition stability based on the minimal
matching distance, where the minimum is taken over all permutations of the 𝐾 labels
of clusters. Straightforward, the distances are summarized by their mean to obtain
𝑏 𝑏 𝑏 𝑏 𝑏 𝑏
𝐼𝑛𝑠𝑡𝑎𝑏 L (𝑃𝐶 , 𝑃 𝐷 ) respectively 𝑆𝑡𝑎𝑏 L (𝑃𝐶 , 𝑃 𝐷 ) = 1 − 𝐼𝑛𝑠𝑡𝑎𝑏 L (𝑃𝐶 , 𝑃 𝐷 ).

3 Simulation Study

In order to compare the stability indices of the cluster partition and afterwards with
respect to the internal validation indices, a simulation study was conducted. In the
following, the setup and execution of this simulation study starting with the data
generation is briefly presented, and subsequently the results are evaluated.
46 R. Aschenbruck et al.

3.1 Data Generation and Execution of Simulation Study

The simulation study is based on artificial data, which are generated for different
scenarios. In Table 1, the features that define the data scenarios and their corre-
sponding parameter values are listed. Since a full factorial design is used, there are
120 different data settings in the conducted simulation study.2 The selection of the
considered features follow the characteristics of the simulation study in [19] and
were extended with respect to the ratio of the variable types as in [20].
Table 1 Features and the associated feature specifications used to generate the data scenarios.
data parameter feature specification short
number of clusters 2, 4, 8 nC
clusters of equal size (FALSE: randomly drawn sizes) TRUE, FALSE symm
number of variables 2, 4, 8 nV
ratio of factor to numerical variables 0.25, 0.5, 0.75 fac_prop
overlap between cluster groups 0, 0.05, 0.1 overlap

The clusters of the 200 observations are defined by the the feature settings. Each
variable can either be active or inactive. For the numerical variables, active means
drawing values from the normal distribution 𝑋1 ∼ N (𝜇1 , 1), with random 𝜇1 ∈
{0, ..., 20}, and inactive means drawing from 𝑋0 ∼ N (𝜇0 , 1) with 𝜇0 = 2 · 𝑞 1− 2𝑣 − 𝜇1 ,
where 𝑞 𝛼 is the 𝛼-quantile of N (𝜇1 , 1) and 𝑣 ∈ {0.05, 0.1}. This results in an overlap
of 𝑣 for the two normal distributions. To achieve an overlap of 𝑣 = 0, the inactive
variable is drawn from N (𝜇1 − 10, 1). Furthermore, each factor variable has two
levels, 𝑙0 and 𝑙1 . The probability for drawing 𝑙 0 for an active variable is 𝑣 and (1 − 𝑣)
for level 𝑙 1 . For an inactive variable, the probability for 𝑙0 is (1 − 𝑣) and 𝑣 for 𝑙1 .
Below, the code structure of the simulation study is presented. For each of the 120
data scenarios, a repetition of 𝑁 = 10 runs was performed. This should mitigate the
influence of the random initialization of the 𝑘-prototypes algorithm. For the range of
two up to nine cluster groups, the stability indices are determined based on bootstrap
samples as suggested in [21]. In order to rank the performance of the stability-based
indices, the internal validation indices were also determined on the same data.

Pseudo-Code Simulation Study


for(every data situation){
for(i in 1:N){ # 10 iterations to mitigate/soften random influences
data <- create.data(data situation)
for(q in 2:9){
output <- kproto(data, k = q, nstarts = 20)
# stability-based indices determined with the usage of 100 bootstrap samples
stab_val_method <- stab_kproto(output, B = 100, method)
int_val_method <- validation_kproto(output, method) # internal validation
}
# determine optimal cluster size for every method
cs_method <- max/min(int_val_method or stab_val_method)
}
}

2 There is no data scenario with two variables and eight cluster groups. Additionally, if there are
two variables, obviously only the 0.5 ratio between factor and numerical variables is possible.
Stability for Determination of the Number of Clusters 47

Fig. 1 The evaluations of the four stability-based cluster indices are presented. There are ten
repetitions of rating the data situation for 𝑘 clusters in the range of two to nine and the index-
optimal number of clusters is highlighted. The parameters of the underlying data structure are nV
= 8, fac_prop = 0.5, overlap = 0.1 and symm = FALSE. The number of clusters nC in the
data structure varies row-wise.

3.2 Analysis of the Results

Figure 1 shows exemplary results of the simulation study for three different data
scenarios over the 10 repetitions. Each row of the figure shows a different data
scenario and each column shows one of the four stability-based indices. The first row
is related to a data scenario with two clusters (marked by a vertical green line). Each
plot shows the examined number of clusters and the determined index value for the 10
repetitions. The maximum index value for each repetition is highlighted with a larger
dot and marks the index-optimal number of clusters of this repetition. It can be seen
that all of the four different indices detected the two clusters in the underlying data
structure. Rows two and three show the evaluations of data with cluster partitions
of four and eight clusters, respectively. It can be seen that the generated number of
clusters is not always rated as index optimal (for example, with four clusters, two or
three clusters were often also evaluated as optimal). Since the results shown here are
representative for all scenarios, the four cluster indices and their interpretation were
examined in more detail.
In the left part of Figure 2, different transformations of the index values are pre-
sented. Besides the standard index values (green line), the numerical approximation
of the approach of Ben-Hur et al. mentioned above is also shown (red line). For
the Jaccard-based evaluation, the proposed cluster-wise stability determination by
Hennig is presented in orange. Additionally, we propose an adjustment of the index
values (hereinafter referred to as new adjust), similar to [22], to take into account not
only the magnitude of the index but also the local slope: The index value scaled with
the geometric mean of the changes to the neighbor values is presented in dark green.
48 R. Aschenbruck et al.

Fig. 2 Left: Example of the variations of the index values at an iteration of the data scenario with
the parameters nC = 4, nV = 8, fac_prop = 0.5, overlap = 0.1 and symm = FALSE. Right:
Proportion of correct determinations, partitioned according to the different number of clusters in
the underlying data structure.

Again, for each variation of the indices, the index optimal value is highlighted. The
numerically determined index values according to the approach of Ben-Hur et al.
gain no benefit, thus it can be concluded that the quantification is not appropriate
for the purpose and that further research is required. The cluster-wise stability de-
termination of the Jaccard index also does not seem to improve the determination of
the number of clusters to a large extent. Obviously, the local slope in the example
in Figure 2 is strengthened for four evaluated cluster groups by the new adjustment
that leads to a determination of four cluster groups (which is the generated number
of clusters). Since only one iteration of one data scenario is shown on the left, the
sum of correct determined number of clusters with respect to the generated number
of clusters is shown on the right hand side of Figure 2. These sums for two, four
and eight clusters in the underlying data structure point out the improvement of the
proposed adjustment of the index values. Especially for more than two clusters, the
rate of correctly determined numbers of clusters can be increased.
Finally, the internal validation indices were comparatively examined. For analyz-
ing the outcome of the simulation study, the determined index optimal numbers of
clusters are shown in Table 2. While the comparison for two clusters in the underly-
ing data shows a slight advantage for the stability-based indices, especially for eight
clusters the preference is in favor of the internal validation indices. To gain a better
understanding of the mean success rate of determining the correct number of clusters
for each data scenario, Figure 3 further shows the results of a linear regression on
the various data parameters. It can be seen that in most cases there is not too much
difference between the considered methods. The stability-based indices do a better
job of determining the number of clusters for data with equally large cluster groups.
Obviously, a larger number of variables causes a better determination of the number
of clusters. The largest variation in the influence on the proportion of correct deter-
mination can be seen for the parameter number of clusters. The more cluster groups
are available in the underlying data structure, the worse the determination becomes
(especially for the stability-based indices and the indices Ptbiserial and Tau).
Stability for Determination of the Number of Clusters 49

Table 2 Determined number of clusters for all data scenarios with nC ∈ {2, 4, 8}, summarized by
the stability-based as well as internal validation indices and the evaluated number of clusters.
clusters 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9
Jnewadj 403 17 0 0 0 0 0 0 47 74 298 1 0 0 0 0 90 70 27 16 16 26 104 11
Rnewadj 391 18 5 0 1 1 1 3 56 99 258 3 2 0 0 2 38 68 22 17 16 32 133 34
FMnewadj 402 17 1 0 0 0 0 0 50 80 289 1 0 0 0 0 88 71 26 16 15 26 106 12
Lnewadj 394 21 5 0 0 0 0 0 53 83 282 2 0 0 0 0 100 97 31 20 16 16 76 4
CIndex 313 13 2 2 1 4 18 67 7 27 344 13 3 2 5 19 2 0 2 4 22 28 211 91
Dunn 386 24 4 2 0 1 1 2 39 56 307 8 7 3 0 0 19 9 17 7 37 53 190 28
Gamma 343 9 1 0 1 2 14 50 9 16 356 15 3 1 5 15 2 1 4 4 16 16 198 119
GPlus 319 8 1 0 0 0 9 83 6 10 319 12 5 2 15 51 2 1 1 4 14 12 175 151
McClain 71 3 1 1 5 12 57 270 0 0 17 4 4 13 87 295 0 0 0 0 0 9 34 317
Ptbiserial 400 11 6 0 3 0 0 0 72 120 225 3 0 0 0 0 31 62 79 65 55 39 26 3
Silhouette 388 3 1 4 4 5 8 7 14 37 348 7 0 0 8 6 6 0 3 1 12 46 220 72
Tau 391 16 9 0 4 0 0 0 68 144 205 3 0 0 0 0 33 82 119 68 40 14 3 1

Fig. 3 Linear regression coefficients for the parameters of the five data set features, where coeffi-
cients whose confidence intervals contain 0 are displayed in transparent.

4 Conclusion

The aim of this study was to investigate the determination of the optimal number of
clusters based on stability indices. Several variations of analysis methods of stability-
based index values were presented and comparatively analyzed in a simulation study.
The proposed adjustment of the index values with respect not only to their magnitude
but also to the local slope was able to improve the standard stability indices, especially
for a smaller number of clusters. The simulation study did not show any general
superiority of stability-based approaches over internal validation indices.
In the future, the various methods of analyzing the stability-based index values
should be examined in more detail, e.g., taking into account the Adjusted Rand
Index. For this purpose, further research may address the characteristics of the
evaluated curves more precisely, or further extend the approach of Ben-Hur et al. as
a quantitative determination method, which has not been done yet.
50 R. Aschenbruck et al.

References

1. Ahmad, A., Khan, S.: Survey of state-of-the-art mixed data clustering algorithms. IEEE
Access, 31883–31902 (2019)
2. Huang, Z.: Extension to the k-Means algorithm for clustering large data sets with categorical
values. Data Min. Knowl. Discov. 2(6), 283–304 (1998)
3. Szepannek, G.: clustMixType: User-friendly clustering of mixed-type data in R. The R J.
10(2), 200–208 (2018)
4. Szepannek, G., Aschenbruck, R.: clustMixType: k-prototypes clustering for mixed variable-
type data. R package version 0.2-15 (2021)
https://CRAN.R-project.org/package=clusterMixType
5. Thorndike, R. L.: Who belongs in the family. Psychometrika 18(4), 267–276 (1953)
6. Hennig, C.: Clustering strategy and method selection. In: Hennig, C., Meila, M. , Murtagh, F.,
Rocci, R. (eds.) Handbook of Cluster Analysis, pp. 703–730. Chapman and Hall/CRC, New
York (2015)
7. Halkidi, M., Vazirgiannia, M., Hennig, C.: Method-independent indices for cluster validation
and estimating the number of clusters. In: Hennig, C., Meila, M. , Murtagh, F., Rocci, R. (eds.)
Handbook of Cluster Analysis, pp. 595–618. Chapman and Hall/CRC, New York (2015)
8. Desgraupes, B.: clusterCrit: clustering indices. R package version 1.2.8 (2018)
https://CRAN.R-project.org/package=clusterCrit
9. Aschenbruck, R., Szepannek, G.: Cluster validation for mixed-type data. Arch. Data Sci., Ser.
A 6(1), 1–12 (2020)
10. Lange, T., Roth, V., Braun, M. L., Buhmann, J. M.: Stability-based validation of clustering
solutions. Neural. Comput. 16(6), 1299–1323 (2004)
11. Dolnicar, S., Leisch, F.: Evaluation of structure and reproducibility of cluster solutions using
bootstrap. Mark. Lett. 21, 83–101 (2010)
12. Fang, Y., Wang, J.: Selection of the number of clusters via the bootstrap method. Comput.
Stat. Data Anal. 56(3), 468–477 (2012)
13. Mucha, H.-J., Bartel, H.-G.: Validation of k-means clustering: why is bootstrapping better
than subsampling. Arch. Data Sci., Ser. A 2(1), 1–14 (2017)
14. Ben-Hur, A., Elisseeff, A., Guyon, I.: A stability based method for discovering structure in
clustered data. In: Pac. Symp. Biocomput. 2002, 6–17 (2001)
15. Hennig, C.: Cluster-wise assessment of cluster stability. Comput. Stat. Data Anal. 52(1),
258–271 (2007)
16. Rand, W. M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc.
66(336) 846–850 (1971)
17. Fowlkes, E. B., Mallows, C. L.: A method for comparing two hierarchical clusterings. J. Am.
Stat. Assoc. 78(383) 553–569 (1983)
18. von Luxburg, U.: Clustering stability: an overview. Found. Trends® Mach. Learn. 2(3), 235–
274 (2010)
19. Dangl, R., Leisch, F.: Effects of resampling in determining the number of clusters in a data
set. J. Classif. 37(3), 558–583 (2020)
20. Jimeno, J., Roy, M., Tortora, C.: Clustering mixed-type data: a benchmark study on KAMILA
and k-prototypes. In: Chadjipadelis, T., Lausen, B., Markos, A., Lee, T.R., Montanari, A.,
Nugent, R. (eds.) Data Analysis and Rationality in a Complex World, 83–91, Springer Inter-
national Publishing, Cham (2021)
21. Leisch, F.: Resampling methods for exploring cluster stability. In: Hennig, C., Meila, M.,
Murtagh, F., Rocci, R. (eds.) Handbook of Cluster Analysis, pp. 637–652. Chapman and
Hall/CRC, New York (2015)
22. Ilies, J., Wilhelm, A. F. X.: Projection-based partitioning for large, high-dimensional datasets.
J. Comp. Graph. Stat. 19(2), 474–492 (2010)
Stability for Determination of the Number of Clusters 51

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
A Review on Official Survey Item Classification
for Mixed-Mode Effects Adjustment

Afshin Ashofteh and Pedro Campos

Abstract The COVID-19 pandemic has had a direct impact on the development, pro-
duction, and dissemination of official statistics. This situation led National Statistics
Institutes (NSIs) to make methodological and practical choices for survey collection
without the need for the direct contact of interviewing staff (i.e. remote survey data
collection). Mixing telephone interviews (CATI) and computer-assisted web inter-
viewing (CAWI) with direct contact of interviewing constitute a new way for data
collection at the time COVID-19 crisis. This paper presents a literature review to
summarize the role of statistical classification and design weights to control cover-
age errors and non-response bias in mixed-mode questionnaire design. We identified
289 research articles with a computerized search over two databases, Scopus and
Web of Science. It was found that, although employing mixed-mode surveys could
be considered as a substitution of traditional face-to-face interviews (CAPI), proper
statistical classification of survey items and responders is important to control the
nonresponse rates and coverage error risk.

Keywords: mixed-mode official surveys, item classification, weighting methods,


clustering, measurement error

Afshin Ashofteh ( )
Statistics Portugal (Instituto Nacional de Estatística, Departamento de Metodologia e Sistemas de
Infomação) and NOVA Information Management School (NOVA IMS) and MagIC, Universidade
Nova de Lisboa, Lisboa, Portugal, e-mail: [email protected]
Pedro Campos
Statistics Portugal (Instituto Nacional de Estatística, Departamento de Metodologia e Sistemas de
Infomação) and Faculty of Economics, Universidade do Porto, and LIAAD INESC TEC, Portugal,
e-mail: [email protected]

© The Author(s) 2023 53


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_7
54 A. Ashofteh and P. Campos

1 Introduction

This paper provides a summary of a systematic literature review of the role of


classification variables and weighting methods of mixed-mode surveys in minimizing
the measurement error, coverage error, and nonresponse bias.
Before the COVID-19 pandemic, the statistical adjustment of mode-specific mea-
surement effects was studied by many scholars. However, after the pandemic, survey
methodologists made a strong effort to meet the challenges of new restrictions for
collecting data with proper quality [1]. Data collection with mixing different modes
by considering their contribution to the overall published statistics was considered
as a solution by NSIs. The methodologists have been trying to use technology, data
science, and mixed-device surveys to decrease the expected coverage error and non-
response bias with new target populations at the time of pandemic rather than the
traditional interviewer-assisted and paper survey modes [2]. This coverage error is
caused by the changes of the target population from the general population to the
general population accessible with technological devices. Te Braak et. al. [3] high-
lighted how the representativeness of self-administered online surveys is expected
to be impacted by decreased response rates. Their research demonstrates that a huge
group of respondents dropout selectively and that this selectivity varies depending
on the dropout moment and demographic categorical information.
According to the studies in Statistics Portugal, using classification methods by
categorical variables and applying the repeated weighting techniques seem to be
fruitful to estimate and adjust for mode and device effects. Fortunately, many authors
discussed the use of weights in statistical analysis [4]. It is important to improve
inference in cases where mixed-mode effects are combined with measurement errors
caused by primary data collection on categorical variables and socio-demographic
information. On one side that the categorical variables are collected with the help
of responders (primary data), the survey mode has a strong impact on answering
behaviors and answering conditions. Respondents might evaluate some of the new
categorical variables as sensitive information or privacy intrusive. They may not be
willing to share these personal data by telephone or technological devices, which
are necessary for statistical classification. Additionally, for NSIs, also the new data
collection channels are costly and redesign of the survey estimation methodology
is time consuming. On the other side, the categorical variables should be available
in sampling frames (secondary data) and the coverage error is the main concern.
For instance, in CATI surveys of Statistics Portugal after COVID-19, the population
was considered as belonging to the following categories: (i) households with a listed
landline telephone, (ii) households that do not have a telephone but use only a mobile
telephone, and (iii) households that do not have a telephone at all (or whose number
is unknown). We could expect these households with very different socioeconomic
characteristics, and new methods of classification or clustering as helpful methods
for measurement error adjustment at the time of the pandemic. However, if they
are different in the important categorical variables of our survey, then a weighting
solution could amplify a part of the sample, which does not represent the population.
As a result, statistical classification would be another source of bias instead of
Item Classification for Mixed-Mode Effects Adjustment 55

solving the problem. Therefore, we could expect two approaches. First, we could
ignore classification, simply because we consider the groups are homogeneous and
the weighting could be recommended to adjust for COVID-19 pandemic situation and
non-observation errors. Second, the groups or responders are different and we need
categorical variables. In this case, the non-observation errors of CATI and CAWI
could not be covered by changing only the weights and we have to recommend CAPI
to collect categorical information and apply both clustering and weighting together
to have a reasonable coverage by mixed modes.
This study undergoes a systematic literature review on this topic guided by the
following question. What is the best methodology or modified estimation strategy
to mitigate the mode-effects problems based on design weighting and classification?
To answer this question, we performed a systematic review analysis limited to the
following databases: Web of Science, Scopus, and working papers from NSIs. We
only considered papers written in English. This article is organized as follows:
Section 2 presents the methodology of research that maps keyword identification
search, databases, and bibliometric analysis. In Section 3, we present the results,
identifying the PRISMA flow diagram, characteristics of the articles, author co-
authorship analysis, as well as the Keywords occurrence over the years. In Section 4,
we discuss the content analysis. Section 5 is about the main conclusions and finally,
in Section 6, the main research gaps and future works are outlined.

2 Methods

To accomplish the research, the preferred reporting items for systematic reviews and
meta-analysis methodology were adopted. The algorithm of the paper selection from
databases (Scopus and WOS) was based on screening started by search keywords
((mixed-mode* OR "Mode effect*") AND (weighting OR weight* OR classifica-
tion) AND ("Measurement error*" OR "Non-response bias" OR "Data quality" OR
"response rate*" ) AND ( capi OR "Computer Assisted Personal Interview*" OR
cawi OR "Assisted Web Interview*" OR cati OR "Computer Assisted Telephone In-
terview*" OR "web survey*" OR "mail survey*" OR "telephone survey*" )) and then
the result was filtered by “official statistics”. The results of the two databases were
merged, and then duplication was removed. For bibliometric analysis, the Mendeley
open-source tool was used to extract metadata and eliminate duplicates. For network
analysis, the VOSviewer open-source tool has been applied to visualize the extracted
information from the data set and obtain the quantitative and qualitative outcomes.
After assessing the eligibility, books and review papers were omitted from results
and relevant articles picked up from databases. The final dataset was selected ac-
cording to the visual abstract in Figure 2, which shows detailed information about
this systematic literature review.
56 A. Ashofteh and P. Campos

Fig. 1 Literature review flow diagram. (Source: Author’s preparation).

Fig. 2 Density visualization analysis of the 22 leader authors who have at least 3 papers.

3 Results

The 28 leader authors who had at least 4 papers are presented in Figure 2. Author
occurrence analysis was performed by applying the VOSviewer research tool for
network analysis. The top three leader authors were Mick P. Couper with 14 articles,
Barry Schouten with 14 articles, and Roger Tourangeau with 11 articles. With the
help of VOSviewer, keywords’ analysis was accomplished. We analyzed the co-
occurrence of author keywords with the full counting method. In the first step,
we select one for the minimum occurrence of a keyword and the result was 711
keywords. We could see the application of keywords over years (Figure 3). Some of
the keywords were not exactly the same, but their use and meaning were the same.
Item Classification for Mixed-Mode Effects Adjustment 57

Fig. 3 Application of keywords over years.

We decided to match similar words to make the output clearer. Choosing the full
counting method resulted in a total of 592 authors meeting the threshold.

4 Content Analysis

The studies emphasize the dramatic change in mixed-mode strategies in the last
decades based on design-based and model-assisted survey sampling, time series
methods, small area estimation [6], and high expectation to undergo further changes
especially after the magnificent experience of NSIs, trying new modes after COVID-
19 pandemic [7].
The problem is about mixed-mode effects and calibration, and briefly, we could
follow several approaches such as design weighting to find sampling weights, non-
response weighting adjustment, and calibration. The design weight of a unit may be
interpreted as the number of units from population represented by a specific sample
unit. Most surveys, if not all, suffer from nonresponse in item or unit. Auxiliary
information could be used to improve the quality of design-weighted estimates. An
auxiliary variable must have at least two characteristics to be considered in calibra-
tion: (i) It must be available for all sample units; and (ii) Its population total must be
known.
The categorical variables from the demographic information of nonrespondents
such as education level, age, income, location, language, and marital status could
help the survey methodologists to categorize the target population and recognize
58 A. Ashofteh and P. Campos

the best sequence of the modes [8]. Van Berkel et al. [9] considered nine strata in
their classification tree by using age, ethnicity, urbanization, and income as explana-
tory variables. Re-interview design and inverse regression estimator (IREG) are
among the best approaches to improve measurement bias by using related auxiliary
information [10].
The focus of this approach is on the weights of estimators rather than the bias
from the measurements. For an estimator, we could consider 𝑦 𝑖,𝑚 the measurement
obtained from unit 𝑖 through mode 𝑚. The 𝑦 𝑖,𝑚 consists of 𝑢 𝑖 as the observed value
for respondent 𝑖, an additive mode-dependent measurement bias 𝑏 𝑚 , and a mode-
dependent measurement variance 𝜀 𝑖,𝑚 with an expected value equal to zero. Equation
(1) shows the measurement error model.

𝑦 𝑖,𝑚 = 𝑢 𝑖 + 𝑏 𝑚 + 𝜀 𝑖,𝑚 (1)

¤ then the differential measurement


If we consider two different modes 𝑚 and 𝑚,
error between these two modes is given by

𝑦 𝑖,𝑚 − 𝑦 𝑖, 𝑚¤ = (𝑏 𝑚 − 𝑏 𝑚¤ ) + 𝜀 𝑖,𝑚 − 𝜀 𝑖, 𝑚¤ (2)
The expected value of (𝑏 𝑚 − 𝑏 𝑚¤ ) is the differential measurement bias. If we
consider 𝑡ˆ𝑦 as an estimation of the total of variable 𝑦 according to its observations
in different modes 𝑦 𝑖,𝑚 , then
𝑛
Õ
𝑡ˆ𝑦 = 𝜔𝑖 𝑦 𝑖,𝑚 (3)
𝑖=1

where 𝜔𝑖 is a survey weight assigned to unit 𝑖 with 𝑛 the number of respondents.


From a combination of equations (2) and (3), and taking the expectation over the
measurement error model (1), we would have

𝑛
! 𝑛 𝑛 𝑛
 Õ Õ Õ Õ 
𝐸 𝑡ˆ𝑦 = 𝐸 𝜔𝑖 𝑦 𝑖,𝑚 = 𝜔𝑖 𝑢 𝑖,𝑚 + 𝑏 𝑚 𝜔𝑖 𝜕𝑖,𝑚 + 𝜔𝑖 𝜕𝑖,𝑚 𝐸 𝜀 𝑖,𝑚 (4)
𝑖=1 𝑖=1 𝑖=1 𝑖=1

 = 1 if unit 𝑖 responded through mode 𝑚, and zero otherwise. Since


with 𝜕𝑖,𝑚
𝐸 𝜀 𝑖,𝑚 = 0
𝑛
! 𝑛 𝑛
 Õ Õ Õ
𝐸 𝑡ˆ𝑦 = 𝐸 𝜔𝑖 𝑦 𝑖,𝑚 = 𝜔𝑖 𝑢 𝑖,𝑚 + 𝜔𝑖 𝜕𝑖,𝑚 𝑏 𝑚 (5)
𝑖=1 𝑖=1 𝑖=1

stating that the expected total of the survey estimate for 𝑌 consists of the estimated
true total of 𝑈, plus true total of 𝑏 𝑚 from data collectedÍ𝑛through mode 𝑚. Since
𝑏 𝑚 is an unobserved mode-dependent measurement bias, 𝑖=1 𝜔𝑖 𝜕𝑖,𝑚 𝑏 𝑚 in equation
(5) indicates the existence of an unknown mode-dependent bias for estimation of
𝑡 𝑦 . According to Equation (5), there is an unknown measurement bias in sequential
mixed-mode designs that might be adjusted by different estimators. Data obtained
Item Classification for Mixed-Mode Effects Adjustment 59

via a re-interview design or a sub-set of respondents to the first stage of a sequential


mixed-mode survey provides necessary auxiliary information to adjust measurement
bias in sequential mixed-mode surveys. Klausch et al [10] propose six different
estimators and show that an inverse version of regression estimator (IREG) performs
well under all considered scenarios. The idea of IREG is to use re-interview data to
estimate the inverse slope of ordinary or generalized least squares linear regression
of benchmark measurements 𝑦 𝑚𝑏 on 𝑦 𝑚 𝑗 as follows [11]
𝑚𝑗
𝑦𝑖 = 𝛽ˆ0 + 𝛽ˆ1 𝑦 𝑖𝑚𝑏 (6)
and estimate the measurement of target variable by applying the inverse of 𝛽ˆ1 in the
following estimator, so-called inverse regression estimator

𝑛𝑚 𝑏 𝑚𝑗  !
1 Õ Õ 1  𝑚𝑗 
𝑖𝑟 𝑒𝑔 𝑚
𝑦ˆ 𝑟𝑚𝑚 =  𝑑𝑖 𝑦𝑖𝑚𝑏 + 𝑑𝑖 𝑦ˆ 𝑟𝑚𝑒𝑏 − 𝑦ˆ 𝑟 𝑒 − 𝑦𝑖 𝑗 𝑏, 𝑗 = 1, 2; 𝑏 ≠ 𝑗 (7)
𝑁ˆ 𝑚1 + 𝑁ˆ 𝑚2 𝑖=1 𝑖=1 𝛽ˆ1

𝑚
where 𝑦ˆ 𝑟 𝑒𝑗 and 𝑦ˆ 𝑟𝑚𝑒𝑏 are the respondents means of focal and benchmark mode outcome
in the re-interview and 𝑑𝑖 denotes the design weight of the sample design. For
a detailed presentation and discussion of the methods see Chapter 8.5 in [12].
However, for longitudinal studies with different modes at different time points, the
effect of time on the respondents would make it difficult to estimate the pure mixed-
mode effect especially for volatile classification variables such as the address for
immigrants. The solution could be conducting the survey on parallel or separate
samples to evaluate the time effect and mode effect separately.
In practice, Statistics Portugal has been using the available information of a
sampling frame as a part of FNA (the dwellings national register database) at the
time of COVID-19. The situation was considered as telephone numbers are linked
to a sample drawn from a population register in FNA for the samples for CATI
rotation-scheme surveys such as Labor Force Survey. In 2020, the Labour Force
Survey (LFS) in Portugal as a mandatory survey for the member states within the EU
was adjusted for undercover of the percentage of households with a listed landline
telephone. As a result, the comparison of these surveys after and before COVID-19
shows the usefulness of the discussed methodologies. In 2021, the successful CAWI
mode census by Statistics Portugal shows respondents tend to favor the web-based
questionnaire to avoid the risk of COVID-19 infection with a face-to-face interview.
It shows the potential change in the mode tendency by responders.

5 Conclusions

COVID-19 crisis led to new solutions on item classification for mixed-mode effects
adjustment, such as applying mode calibration to population subgroups by cate-
gorical variables such as gender, regions, age groups, etc. Studies offer sequential
mixed-mode design started with CAWI as the cheapest mode supported by an initial
60 A. Ashofteh and P. Campos

postal mail or telephone contact and possible cash incentive. With a lag, follow up
the non-respondents with giving them a choice between CAPI and CATI according
to their specific classification group and demographic information, such as education
level, age, income, location, language, and marital status. It is fruitful to reduce the
cost and increase the accuracy simultaneously.
This study showed that sample frames might need updates for necessary categor-
ical information, which are based on choices made several years ago. Additionally,
more research studies seem necessary for ethics concerns, privacy regulations, and
standards for using categorical variables and classification information in social
mixed-mode surveys and official statistics.

References

1. Ashofteh, A., Bravo, J. M.: A study on the quality of novel coronavirus (COVID-19) official
datasets. Stat. J. IAOS, 36(2), 291–301, (2020) doi: 10.3233/SJI-200674
2. Ashofteh, A., Bravo, J. M.: Data science training for official statistics: A new scientific
paradigm of information and knowledge development in national statistical systems. Stat. J.
IAOS, 37(3), 771–789, (2021) doi: 10.3233/SJI-200674
3. Te Braak, P., Minnen, J., Glorieux, I.: The representativeness of online time use surveys.
Effects of individual time use patterns and survey design on the timing of survey dropout. J.
Off. Stat., 36(4), 887–906, (2020)
4. Szymkowiak, M., Wilak, K.: Repeated weighting in mixed-mode censuses. Econ. Bus. Rev.,
7(1), 26–46, (2021)
5. Zax, M., Takahashi, S.: Cultural influences on response style: comparisons of Japanese and
American college students. J. Soc. Psychol., 71(1), 3–10, (1967)
6. Pfeffermann, D.: New important developments in small area estimation. Stat. Sci., 28(1),
40–68, (2013)
7. Toepoel, V., de Leeuw, E., Hox, J.: Single- and Mixed-Mode Survey Data Collection. SAGE
Res. Methods Found, (2020) doi: 10.4135/9781526421036876933
8. Kim, S., Couper, M. P.: Feasibility and quality of a national RDD smartphone web survey:
comparison with a cell phone CATI survey. Soc. Sci. Comput. Rev., 39(6), 1218–1236, (2021)
9. Van Berkel, K., Van Der Doef, S., Schouten, B.: Implementing adaptive survey design with an
application to the Dutch health survey. J. Off. Stat., 36(3), 609–629, (2020) doi: 10.2478/jos-
2020-0031
10. Klausch, T., Schouten, B., Buelens, B., van den Brakel, J.: Adjusting measurement bias in
sequential mixed-mode surveys using re-interview data. J. Surv. Stat. Methodol., 5(4), 409–
432, (2017) doi: 10.1093/jssam/smx022
11. Särndal, C. E., Lundström, S.: Estimation in surveys with nonresponse. Estimation in surveys
with nonresponse. John Wiley (2005) doi: 10.1002/0470011351
12. Schouten, B., Brakel, J. van den, Buelens, B., Giesen, D., Luiten, A., Meertens, V.: Mixed-
Mode Official Surveys. Chapman and Hall/CRC (2021) doi: 10.1201/9780429461156
Item Classification for Mixed-Mode Effects Adjustment 61

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Clustering and Blockmodeling Temporal
Networks – Two Indirect Approaches

Vladimir Batagelj

Abstract Two approaches to clustering and blockmodeling of temporal networks


are presented: the first is based on an adaptation of the clustering of symbolic data
described by modal values and the second is based on clustering with relational
constraints. Different options for describing a temporal block model are discussed.

Keywords: social networks, network analysis, blockmodeling, symbolic data anal-


ysis, clustering with relational constraints

1 Temporal Networks

Temporal networks described by temporal quantities (TQs) were introduced in the


paper [2]. We get a temporal network NT = (V, L, T , P, W) by attaching the time
T to an ordinary network, where V is the set of nodes, L is the set of links, P is
the set of node properties, W is the set of link weights, and T = [𝑇𝑚𝑖𝑛 , 𝑇𝑚𝑎𝑥 ) is a
linearly ordered set of time points 𝑡 ∈ T which are usually integers or reals.
In a temporal network nodes/links activity/presence, nodes properties, and links
weights can change through time. These changes are described with TQs. A TQ
is described by a sequence 𝑎 = [(𝑠𝑟 , 𝑓𝑟 , 𝑣𝑟 ) : 𝑟 = 1, 2, . . . , 𝑘] where [𝑠𝑟 , 𝑓𝑟 )
determines a time interval and 𝑣𝑟 is the value of the TQ 𝑎 on this interval. The set
Ð
𝑇𝑎 = 𝑟 [𝑠𝑟 , 𝑓𝑟 ) is called the activity set of 𝑎. For 𝑡 ∉ 𝑇𝑎 its value is undefined,
𝑎(𝑡) = .
Assuming that for every 𝑥 ∈ R ∪ { } : 𝑥 + = + 𝑥 = 𝑥 and 𝑥 · = · 𝑥 =
we can extend the addition and multiplication to TQs

Vladimir Batagelj ( )
IMFM, Jadranska 19, 1000 Ljubljana, Slovenia & IAM UP, Muzejski trg 2, 6000 Koper, Slovenia
& HSE, 11 Pokrovsky Bulvar, 101000 Moscow, Russian Federation,
e-mail: [email protected]

© The Author(s) 2023 63


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_8
64 V. Batagelj

(𝑎 + 𝑏)(𝑡) = 𝑎(𝑡) + 𝑏(𝑡) and 𝑇𝑎+𝑏 = 𝑇𝑎 ∪ 𝑇𝑏


(𝑎 · 𝑏)(𝑡) = 𝑎(𝑡) · 𝑏(𝑡) and 𝑇𝑎·𝑏 = 𝑇𝑎 ∩ 𝑇𝑏
Let 𝑇𝑉 (𝑣) ⊆ T , 𝑇𝑉 ∈ P, be the activity set for a node 𝑣 ∈ V and 𝑇𝐿 (ℓ) ⊆ T ,
𝑇𝐿 ∈ W, the activity set for a link ℓ ∈ L. The following consistency condition
must be fulfilled for activity sets: If a link ℓ(𝑢, 𝑣) is active at the time point 𝑡 then its
end-nodes 𝑢 and 𝑣 should be active at the time point 𝑡 : 𝑇𝐿 (ℓ(𝑢, 𝑣)) ⊆ 𝑇𝑉 (𝑢) ∩𝑇𝑉 (𝑣).
In the following we will need
Í
1. Total: total(𝑎) = 𝑖 ( 𝑓𝑖 − 𝑠𝑖 ) · 𝑣𝑖
total(𝑎) Í
2. Average: average(𝑎) = where |𝑇𝑎 | = 𝑖 ( 𝑓𝑖 − 𝑠𝑖 )
|𝑇𝑎 |
3. Maximum: max(𝑎) = max𝑖 𝑣𝑖
To support the computations with TQs we developed in Python the libraries TQ
and Nets, see https://github.com/bavla/TQ .

2 Traditional (Generalized) Blockmodeling Scheme

A blockmodel (BM) [11] consists of structures


obtained by identifying all units from the same
cluster of the clustering / partition C = {𝐶𝑖 },
u
𝜋(𝑣) = 𝑖 ⇔ 𝑣 ∈ 𝐶𝑖 . Each pair of clusters uv
(𝐶𝑖 , 𝐶 𝑗 ) determines a block consisting of links v
linking 𝐶𝑖 to 𝐶 𝑗 . For an exact definition of a
blockmodel we have to be precise also about
which blocks produce an arc in the reduced
graph on classes and which do not, what is the
weight of this arc, and in the case of general-
ized BM, of what type. The reduced graph can
be represented by relational matrix, called also Fig. 1 Blockmodel.
image matrix.
To develop a BM method we specify a criterion function 𝑃(𝜇) measuring the
"error" of the BM 𝜇. We can introduce additional knowledge by constraining the
partitions to a set Φ of feasible partitions. We are searching for a partition 𝜋 ∗ ∈ Φ
such that the corresponding BM 𝜇∗ minimizes the criterion function 𝑃(𝜇).

3 BM of Temporal Networks

For an early attempt of temporal network BM see [2, 5]. To the traditional BM
scheme we add the time dimension. We assume that the network is described using
temporal quantities [2] for nodes/links activity/presence, and some nodes properties
and links weights. Then also the BM partition 𝜋 is described for each node 𝑣 with a
Clustering and Blockmodeling Temporal Networks 65

temporal quantity 𝜋(𝑣, 𝑡): 𝜋(𝑣, 𝑡) = 𝑖 means that in time 𝑡 node 𝑣 belongs to cluster
𝑖. The structure and activity of clusters 𝐶𝑖 (𝑡) = {𝑣 : 𝜋(𝑣, 𝑡) = 𝑖} can change through
time, but they preserve their identity.
For the BM 𝜇 the clusters are mapped into BM nodes 𝜇 : 𝐶𝑖 → [𝑖]. To determine
the BM we still have to specify how the links from 𝐶𝑖 to 𝐶 𝑗 are represented in the
BM – in general, for the model arc ([𝑖], [ 𝑗]), we have to specify two TQs: its weight
𝑎 𝑖 𝑗 and, in the case of generalized BM, its type 𝜏𝑖 𝑗 . The weight can be an object of a
different type than the weights of the block links in the original temporal network.
We assume that in a temporal network N = (V, L, T , P, W) the links weight is
described by a TQ 𝑤 ∈ W. In the following years we intend to develop BM methods
case by case.
1. constant partition – nodes stay in the same cluster all the time:
Í
a. indirect approach based on clustering of TQs: 𝑝(𝑣) = 𝑢 ∈𝑁 (𝑣) 𝑤(𝑣, 𝑢),
hierarchical clustering and leaders;
b. indirect approach by conversion to the clustering with relation constraint
(CRC);
c. direct approach by (local) optimization of the criterion function 𝑃 over Φ
2. dynamic partition – nodes can move between clusters through time. The details
are still to be elaborated.
In this paper, we present approaches for cases 1.a and 1.b.
In the literature there exist other approaches to BM of temporal networks. A
recent overview is available in the book [12].

3.1 Adapted Symbolic Clustering Methods

In [8] we adapted traditional leaders [13, 10] and agglomerative hierarchical [14, 1]
clustering methods for clustering of modal-valued symbolic data. They can be almost
directly applied for clustering units described by variables that have for their values
temporal quantities.
For a unit 𝑋𝑖 , each variable 𝑉 𝑗 is described with a size ℎ𝑖 𝑗 and a temporal quantity
x𝑖 𝑗 , 𝑋𝑖 𝑗 = (ℎ𝑖 𝑗 , x𝑖 𝑗 ). In our algorithms we use normalized values of temporal
variables 𝑉 0 = (ℎ, p) where
𝑣𝑟
p = [(𝑠𝑟 , 𝑓𝑟 , 𝑝 𝑟 ) : 𝑟 = 1, 2, . . . , 𝑘] and 𝑝𝑟 =

In the case, when ℎ = total(x), the normalized TQ p is essentially a probability
distribution.
Both methods create cluster representatives that are represented in the same way.
66 V. Batagelj

3.2 Clustering of Temporal Network and CRC

To use the CRC in the construction of a nodes partition we have to define a dissim-
ilarity measure 𝑑 (𝑢, 𝑣) (or a similarity 𝑠(𝑢, 𝑣)) between nodes. An obvious solution
is 𝑠(𝑢, 𝑣) = 𝑓 (𝑤(𝑢, 𝑣)), for example
1. Total activity: 𝑠1 (𝑢, 𝑣) = total(𝑤(𝑢, 𝑣))
2. Average activity: 𝑠2 (𝑢, 𝑣) = average(𝑤(𝑢, 𝑣))
3. Maximal activity: 𝑠3 (𝑢, 𝑣) = max(𝑤(𝑢, 𝑣))
1
We can transform a similarity 𝑠(𝑢, 𝑣) into a dissimilarity by 𝑑 (𝑢, 𝑣) = 𝑠 (𝑢,𝑣) or
𝑑 (𝑢, 𝑣) = 𝑆 − 𝑠(𝑢, 𝑣) where 𝑆 > max𝑢,𝑣 𝑠(𝑢, 𝑣). In this way, we transformed the
temporal network partitioning problem into a clustering with relational constraints
problem [6, 360–369]. It can be efficiently solved also for large sparse networks.

3.3 Block Model

Having the partition 𝜋, to produce a BM we have to specify the values on its links.
There are different options for model links weights 𝑎(([𝑖], [ 𝑗])).
Í
1. Temporal quantities: 𝑎(([𝑖], [ 𝑗])) = activity(𝐶𝑖 , 𝐶 𝑗 ) = 𝑢 ∈𝐶𝑖 ,𝑣 ∈𝐶 𝑗 𝑤(𝑢, 𝑣), for
𝑖 ≠ 𝑗, and 𝑎(([𝑖], [𝑖])) = 12 activity(𝐶𝑖 , 𝐶𝑖 ).
2. Total intensities: 𝑎 𝑡 (([𝑖], [ 𝑗])) = total(𝑎(([𝑖], [ 𝑗]))) .
𝑎 𝑡 (([𝑖], [ 𝑗]))
3. Geometric average intensities: 𝑎 𝑔 (([𝑖], [ 𝑗])) = p .
|𝐶𝑖 | · |𝐶 𝑗 |

4 Example: September 11th Reuters Terror News

The Reuters Terror News network was obtained from the CRA (Centering Resonance
Analysis) networks produced by Steve Corman and Kevin Dooley at Arizona State
University. The network is based on all the stories released during 66 consecutive
days by the news agency Reuters concerning the September 11 attack on the U.S.,
beginning at 9:00 AM EST 9/11/01.
The nodes, 𝑛 = 13332, of this network are important words (terms). For a given
day, there is an edge between two words iff they appear in the same utterance (for
details see the paper [9]). The network has 𝑚 = 243447 edges. The weight of an
edge is its daily frequency. There are no loops in the network. The network Terror
News is undirected – so will be also its BM.
The Reuters Terror News network was used as a case network for the Viszards
visualization session on the Sunbelt XXII International Sunbelt Social Network
Conference, New Orleans, USA, 13-17. February 2002. It is available at http:
//vlado.fmf.uni-lj.si/pub/networks/data/CRA/terror.htm .
Clustering and Blockmodeling Temporal Networks 67

We transformed the Pajek version of the network into NetsJSON format used in
libraries TQ and Nets. For a temporal description of each node/word for clustering
we took its activity (sum of all TQs on edges adjacent to a given node 𝑣)
Õ
act(𝑣) = 𝑤(𝑣 : 𝑢).
𝑢 ∈𝑁 (𝑣)

Our leaders’ and hierarchical clustering methods are compatible – they are based
on the same clustering error criterion function. Usually, the leaders’ method is used
to reduce a large clustering problem to up to some hundred units. With hierarchical
clustering of the leaders of the obtained clusters, we afterward determine the "right"
number of clusters and their representatives.

Fig. 2 Hierarchical clustering of 100 leaders in Terror News.

To cluster all 13332 words (nodes) in Terror News we used the adapted leaders’
method searching for 100 clusters. We continued with the hierarchical clustering of
the obtained 100 leaders. The result is presented in the dendrogram in Figure 2.

country
pres_bush

national islamabad
bin_laden foreign

thursday thousand
newspaper world_trade_ctr committee
government

timepercent include
airlineauthority white_house reporter
major reuter
effortrule saudi−born
threat former package republican
meet

agent
day police world
tell aircraft
office
way attack hour
system
saturday
neighbor
mohammad
problem

refugee
speech

war
economic britain
individual industry
coalition
position
monday military information
people add yearhouse order
support
cooperation

grow deputy
source chief action suspect stock
plane

face conflict need


airport base charge
work

israeli

man terrorist response statement name riskeuropean


legislation

fear democratic
state
russia

capital america facility pentagon


law

air operation ground talk


aid citizen ministry
hijack

week

company council
american

buildng officer say


part

passenger fire questionomarguerrilla islamic


television

organization financial congress use role


prime

agency
number possible employee friday jet home terrorism ruler kind end
militant
official

federal fight asset


interest

servicespokesmanlife fbi call


move

strike plan
large
city worker
report area nation
network
leader region china
court
hand
center

international

defense member own japan


intelligence
security september
news new_york
program soviet
pakistan
secretary
activity

power right
musharraf

united_states group
muslimpresident
russian
situation islam political senior
ally
great

hijacker
measure

washington minister pakistani potential


sunday british

Fig. 3 Word clouds for clusters 𝐶58 and 𝐶81.


68 V. Batagelj

To get an insight into the content of a selected cluster we draw the corresponding
word cloud based on the cluster’s leader. In Figure 3 the word clouds for clusters
𝐶58 and 𝐶81 (|𝐶58| = 1396, |𝐶81| = 2226 ) are presented.
We can also compare the activities of pairs of clusters by considering the overlap
of p-components (probability distributions) of their leaders. In Figure 4, we com-
pare cluster 𝐶58 with cluster 𝐶81, and cluster 𝐿96 with cluster 𝐶66. In the right
diagram some values are outside the display area: 𝐿96[15] = 0.3524, 𝐶66[4] =
0.1961, 𝐶66[5] = 0.2917.

0.07 0.06

C58:C81 L96:C66
0.06
0.05

0.05
0.04

0.04

0.03

0.03

0.02
0.02

0.01
0.01

0.00 0.00
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
Fig. 4 Comparing activities of clusters (blue – first cluster, red – second cluster, violet – overlap).

We decided to consider in the BM the clustering of Terror News into 5 clusters


C = {𝐶94, 𝐶88, 𝐶95, 𝐿43, 𝐿74}. The split of cluster C95 gives clusters of sizes 325
and 629 (for sizes, see the right side of Figure 5). Both clusters C94 and C88 have a
chaining pattern at their top levels.
Because of large differences in the cluster sizes, it is difficult to interpret the total
intensities image matrix. An overall insight into the BM structure we get from the
geometric average intensities image matrix (right side) and the corresponding BM
network (cut level 0.3), left side of Figure 5.

𝑖 cluster 1 2 3 4 5
1 𝐶94 23.85 12.23 2.26 1.57 1.42
2 𝐶88 3.58 0.33 0.22 0.19
3 𝐶95 0.56 0.07 0.07
4 𝐿43 0.38 0.08
5 𝐿74 0.39
size 6018 5109 954 535 716

Fig. 5 Block model and image matrix.


L74
L43
C95
C88
C94

L74−L74
L43−L74
L43−L43
C95−L74
C95−L43
C95−C95
C88−L74
C88−L43
C88−C95
C88−C88
C94−L74
C94−L43
C94−C95
C94−C88
C94−C94
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
1
sep−11 2
3
4
5
6
7
sep−12 8
9
10
11
12
13
sep−13 14
15
16
17
18
19
sep−14 20
21
22
23
24
25
in Figure 6.

sep−15 26
27
28
29
30
31
sep−16 32
33
34
35
36
37
sep−17 38
39
40
C94
41
42
43
sep−18 44
45
46
47
48
49
sep−19 50
51
52
53
54
55
sep−20 56
57
58
59
60
61
sep−21 62
63
64
65
66
sep−22
C94−C94

sep−23

0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07

sep−24
1 1
sep−25 2 2
3 3
4 4
5 5
6 6
7 7
sep−26 8 8
9 9
10 10
11 11
sep−27 12 12
13 13
14 14
15 15
16 16
17 17
sep−28 18 18
19 19
20 20
21 21
22 22
23 23
sep−29 24 24
25 25
26 26
27 27
28 28
29 29
sep−30 30 30
31 31
32 32
33 33
34 34
35 35
oct−1 36 36
37 37
38 38
39 39
C88

40 40
41 41
oct−2 42 42
43 43
44 44
45 45
46 46
47 47
oct−3 48 48
49 49
50 50
51 51
52 52
53 53
oct−4 54 54
55 55

Fig. 7 BM heatmap with log2 values.


56 56
57 57
58 58
59 59
oct−5 60 60
61 61
62 62
63 63
64 64
65 65
oct−6 66 66
C88−C88
C94−C88

oct−7
oct−8

0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07

oct−9
1 1 1
2 2 2
3 3 3
4 4 4
oct−10 5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10 10
11 11 11
oct−11 12 12 12
13 13 13
14 14 14
15 15 15
16 16 16
17 17 17
oct−12 18 18 18
19 19 19
20 20 20
21 21 21
oct−13 22 22 22
23 23 23
24 24 24
25 25 25
26 26 26
27 27 27
oct−14 28 28 28
29 29 29
30 30 30
31 31 31
32 32 32
Clustering and Blockmodeling Temporal Networks

33 33 33
oct−15 34 34 34
35 35 35
36 36 36
37 37 37
38 38 38
39 39 39
C95

oct−16 40 40 40
41 41 41
42 42 42
43 43 43
44 44 44
45 45 45
oct−17 46 46 46
47 47 47
48 48 48
49 49 49
50 50 50
51 51 51
oct−18 52 52 52
53 53 53
54 54 54
55 55 55
56 56 56
57 57 57
oct−19 58 58 58
59 59 59
60 60 60
61 61 61
62 62 62
63 63 63
oct−20 64 64 64
65 65 65
66 66 66
C95−C95
C88−C95
C94−C95

oct−21

Terror news / log2


oct−22
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07

oct−23

at https://github.com/bavla/TQ/wiki/BMRC .
1 1 1 1
2 2 2 2
oct−24 3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
oct−25 9 9 9 9
10 10 10 10
11 11 11 11
12 12 12 12
13 13 13 13
14 14 14 14
oct−26 15 15 15 15
16 16 16 16
17 17 17 17
18 18 18 18
19 19 19 19
20 20 20 20
21 21 21 21
oct−27 22 22 22 22
23 23 23 23
24 24 24 24
25 25 25 25
26 26 26 26
27 27 27 27
oct−28 28 28 28 28
29 29 29 29
30 30 30 30
31 31 31 31
oct−29 32 32 32 32
33 33 33 33
34 34 34 34
35 35 35 35
36 36 36 36
37 37 37 37
oct−30 38 38 38 38
39 39 39 39
L43

40 40 40 40
41 41 41 41
42 42 42 42
43 43 43 43
oct−31 44 44 44 44
45 45 45 45
46 46 46 46
47 47 47 47
48 48 48 48
49 49 49 49
nov−1 50 50 50 50
51 51 51 51
52 52 52 52
53 53 53 53
54 54 54 54
55 55 55 55
nov−2 56 56 56 56
57 57 57 57
58 58 58 58
59 59 59 59
60 60 60 60
61 61 61 61
nov−3 62 62 62 62
63 63 63 63
64 64 64 64
65 65 65 65
66 66 66 66
L43−L43
C95−L43
C88−L43
C94−L43

nov−4
nov−5
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07

nov−6
nov−7 1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
nov−8 7 7 7 7 7
8 8 8 8 8
9 9 9 9 9
10 10 10 10 10
11 11 11 11 11
12 12 12 12 12
nov−9 13 13 13 13 13
14 14 14 14 14
15 15 15 15 15
16 16 16 16 16
17 17 17 17 17
18 18 18 18 18
nov−10 19 19 19 19 19
20 20 20 20 20
21 21 21 21 21
22 22 22 22 22
23 23 23 23 23
24 24 24 24 24
nov−11 25 25 25 25 25
26 26 26 26 26
27 27 27 27 27
28 28 28 28 28
29 29 29 29 29
30 30 30 30 30
31 31 31 31 31
nov−12 32 32 32 32 32
33 33 33 33 33
34 34 34 34 34
35 35 35 35 35
36 36 36 36 36
37 37 37 37 37
nov−13 38 38 38 38 38
39 39 39 39 39
L74

40 40 40 40 40
41 41 41 41 41
nov−14 42 42 42 42 42
43 43 43 43 43
44 44 44 44 44
45 45 45 45 45
46 46 46 46 46
47 47 47 47 47
nov−15 48 48 48 48 48
49 49 49 49 49
50 50 50 50 50
51 51 51 51 51
52 52 52 52 52
53 53 53 53 53
54 54 54 54 54
55 55 55 55 55
56 56 56 56 56
57 57 57 57 57
58 58 58 58 58
59 59 59 59 59
60 60 60 60 60
61 61 61 61 61
display of the matrix with logarithmic values provides much more information.
62 62 62 62 62
63 63 63 63 63
64 64 64 64 64
65 65 65 65 65
66 66 66 66 66
L74−L74
L43−L74
C95−L74
C88−L74
C94−L74

+0.00
+0.84
+1.68
+2.52
+3.36
+4.20
+5.04
+5.88
+6.72
+7.56
+8.40
+9.24
Fig. 6 BM represented as 𝑝-components of temporal activities of links between pairs of clusters.

+10.08
+10.92
+11.76
+12.60
+13.44

present it here. A description of the analysis with the corresponding code is available
straints approach. Because of the limited space available for each paper, we can not
To the Terror News network, we applied also the clustering with relational con-
matrix in Figure 7. Because of some relatively very large values, it turns out that the
A more compact representation of a temporal BM is a heatmap display of this
A more detailed BM is presented by the activities (𝑝-components) image matrix
69
70 V. Batagelj

5 Conclusions

The presented research is a work in progress. It only deals with the two simplest
cases of temporal blockmodeling. We provided some answers to the problem of
normalization of model weights TQs when comparing them and some ways to
present/display the temporal BMs.
We used different tools (R, Python, and Pajek) to obtain the results. We intend to
provide the software support in a single tool – probably in Julia. We also intend to
create a collection of interesting and well-documented temporal networks for testing
and demonstrating the developed software.

Acknowledgements The paper contains an elaborated version of ideas presented in my talks at the
XXXX Sunbelt Social Networks Conference (on Zoom), July 13-17, 2020 and at the EUSN 2021
– 5th European Conference on Social Networks, Naples (on Zoom), September 6-10, 2021.
This work is supported in part by the Slovenian Research Agency (research program P1-0294
and research projects J1-9187, J1-2481, and J5-2557), and prepared within the framework of the
HSE University Basic Research Program.

References

1. Anderberg, M.R.: Cluster Analysis for Applications. Academic Press, New York (1973)
2. Batagelj, V., Praprotnik, S.: An algebraic approach to temporal network analysis based on
temporal quantities. Soc. Netw. Anal. Min. 6(1), 1–22 (2016)
3. Batagelj, V., Ferligoj, A.: Clustering relational data. In: Gaul, W., Opitz, O., Schader, M. (Eds.)
Data Analysis / Scientific Modeling and Practical Application, pp. 3–15. Springer (2000)
4. Batagelj, V.: Generalized Ward and related clustering problems. In: Bock H.-H. (ed) Classifi-
cation and Related Methods of Data Analysis, pp. 67–74. North-Holland, Amsterdam (1988)
5. Batagelj, V., Ferligoj, A., Doreian, P.: Indirect blockmodeling of 3-way networks. In: Brito,
P., Bertrand, P., Cucumel, G., de Carvalho, F. (eds.) Selected Contributions in Data Analysis
and Classification, pp. 151–159. Springer (2007)
6. Batagelj, V., Doreian, P., Ferligoj, A., Kejžar, N.: Understanding Large Temporal Networks
and Spatial Networks: Exploration, Pattern Searching, Visualization and Network Evolution.
Wiley (2014)
7. Batagelj, V., Kejžar, N.: Clamix – Clustering Symbolic Objects (2010) Program in R
https://r-forge.r-project.org/projects/clamix/
8. Kejžar, N., Korenjak-Černe, S., Batagelj, V.: Clustering of modal-valued symbolic data. Adv.
Data Anal. Classif. 15, pp. 513—541 (2021)
9. Corman, S. R., Kuhn, T., McPhee, R. D., Dooley, K. J.: Studying complex discursive systems:
Centering resonance analysis of communication. Hum. Commun. Res. 28(2), 157–206 (2002)
10. Diday, E.: Optimisation en Classification Automatique. Tome 1.,2. INRIA, Rocquencourt (in
French) (1979)
11. Doreian, P., Batagelj, V., Ferligoj, A.: Generalized Blockmodeling. Structural Analysis in the
Social Sciences. Cambridge University Press (2005)
12. Doreian, P., Batagelj, V., Ferligoj, A. (Eds.) Advances in Network Clustering and Blockmod-
eling. Wiley (2020)
13. Hartigan, J. A.: Clustering Algorithms. Wiley-Interscience, New York (1975)
14. Ward, J. H.: Hierarchical grouping to optimize an objective function. J. Am. Stat. Asso.c 58,
236–244 (1963)
Clustering and Blockmodeling Temporal Networks 71

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Latent Block Regression Model

Rafika Boutalbi, Lazhar Labiod, and Mohamed Nadif

Abstract When dealing with high dimensional sparse data, such as in recommender
systems, co-clustering turns out to be more beneficial than one-sided clustering, even
if one is interested in clustering along one dimension only. Thereby, co-clusterwise is
a natural extension of clusterwise. Unfortunately, all of the existing approaches do not
consider covariates on both dimensions of a data matrix. In this paper, we propose
a Latent Block Regression Model (LBRM) overcoming this limit. For inference,
we propose an algorithm performing simultaneously co-clustering and regression
where a linear regression model characterizes each block. Placing the estimate of the
model parameters under the maximum likelihood approach, we derive a Variational
Expectation-Maximization (VEM) algorithm for estimating the model’s parameters.
The finality of the proposed VEM-LBRM is illustrated through simulated datasets.

Keywords: co-clustering, clusterwise, tensor, data mining

1 Introduction

The cluster-wise linear regression algorithm CLR (or Latent Regression Model) is
a finite mixture of regressions and one of the most commonly used methods for
simultaneous learning and clustering [14, 5]. It aims to find clusters of entities to
minimize the overall sum of squared errors from regressions performed over these
clusters. Specifically, X = [𝑥𝑖 𝑗 ] ∈ R𝑛×𝑣 is the covariate matrix and Y ∈ R𝑛×1 the
response vector. The cluster-wise method aims to find 𝑔 clusters 𝐶1 , . . . , 𝐶𝑔 and
regression coefficients 𝜷 (𝑘) ∈ R𝑑×1 by minimizing the following objective function
Í𝑔 Í Í𝑣 (𝑘)
𝑘=1 𝑖 ∈𝐶𝑘 (𝑦 𝑖 − 𝑗=1 𝛽 𝑗 𝑥 𝑖 𝑗 + 𝑏 𝑘 ) where:
2

• 𝑦 𝑖 is the value of the dependent variable for subject/observation 𝑖 defined by


x𝑖 = (𝑥𝑖1 , . . . , 𝑥 𝑖𝑑 ),
• 𝑥𝑖 𝑗 is the value of the 𝑗-th independent variable for subject/observation 𝑖,
• 𝛽 (𝑘)
𝑗 is the 𝑗-th multiple regression coefficient and 𝑏 𝑘 is the intercept.

Rafika Boutalbi ( )
Institute for Parallel and Distributed Systems, Analytic Computing, University of Stuttgart, Ger-
many, e-mail: [email protected]
Lazhar Labiod · Mohamed Nadif
Centre Borelli UMR 9010, Université Paris Cité, France,
e-mail: [email protected];[email protected]

© The Author(s) 2023 73


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_9
74 R. Boutalbi et al.

Various adjustments have been made to this model to improve its performance in
terms of clustering and prediction. In our contribution, we propose to embed the
co-clustering in the model.
Co-clustering is a simultaneous clustering of both dimensions of a data matrix
that has proven to be more beneficial than traditional one-sided clustering, especially
when dealing with sparse data. When dealing with high dimensional data sparse or
not, co-clustering turns out to be more valuable than one-sided clustering [1, 13],
even if one is interested in clustering along one dimension only. In [4] the authors
proposed the SCOAL approach (Simultaneous Co-clustering and Learning model),
leading to co-clustering and prediction for binary data; they generalized the model
to continuous data. However, this model does not take into account the sparsity
of data in the sense that it does not lead to homogeneous blocks. The obtained
results in terms of Mean Square Error (MSE) are good, but in terms of co-clustering
(homogeneity of co-clusters), no analysis has been presented. This model is also
related to the soft PDLF (Predictive Discrete Latent Factor) model [2], where the
value of response 𝑦 𝑖 𝑗 ’s in each co-cluster is modeled as a sum 𝛽𝑇 𝑥 𝑖 𝑗 + 𝛿 𝑘ℓ where
𝛽 is a global regression model. In contrast, 𝛿 𝑘ℓ is a co-cluster specific offset. More
recently, in [17] the authors proposed an algorithm taking into account only row
covariates information to realize co-clustering and regression simultaneously. To
this end, the authors are based on the latent block models [8]. In our contribution,
we propose to rely also on this model but considering row and column covariates.
The proposed Latent Block Regression Model (LBRM) is an extension of fi-
nite mixtures of regression models where the co-clustering is embedded. It allows
us to deal with co-clustering and regression simultaneously while taking into ac-
count covariates. To estimate the parameters we rely on a Variational Expectation-
Maximization algorithm [7] referred to as VEM-LBRM.

2 From Clusterwise Regression to Co-clusterwise Regression

2.1 Latent Block Model (LBM)

Given an 𝑛 × 𝑑 data matrix X = (𝑥𝑖 𝑗 , 𝑖 ∈ 𝐼 = {1, . . . , 𝑛}; 𝑗 ∈ 𝐽 = {1, . . . , 𝑑}). It is


assumed that there exists a partition on 𝐼 and a partition on 𝐽. A partition of 𝐼 × 𝐽 into
𝑔 × 𝑚 blocks will be represented by a pair of partitions (z, w). The 𝑘-th row cluster
corresponds to the set of rows 𝑖 such that 𝑧𝑖𝑘 = 1 and 𝑧 𝑖𝑘 0 = 0 ∀𝑘 0 ≠ 𝑘. Thereby, the
partition represented
Í𝑔 by z can be also represented by a matrix of elements in {0, 1} 𝑔
satisfying 𝑘=1 𝑧 𝑖𝑘 = 1. Similarly, the ℓ-th column cluster corresponds to the set of
columns 𝑗 Í and the partition w can be represented by a matrix of elements in {0, 1} 𝑚
𝑚
satisfying ℓ=1 𝑤 𝑗ℓ = 1.
Considering the Latent Block Model (LBM) [6], it is assumed that each ele-
ment 𝑥𝑖 𝑗 of the 𝑘ℓth block is generated according to a parameterized probabil-
ity density function (pdf) 𝑓 (𝑥𝑖 𝑗 ; 𝛼 𝑘ℓ ). Furthermore, in the LBM the univariate
random variables 𝑥𝑖 𝑗 are assumed to be conditionally independent given (z, w).
Thereby, the conditional pdf of X can be expressed as 𝑃(𝑧𝑖𝑘 = 1, 𝑤 𝑗ℓ = 1|X) =
Latent Block Regression Model 75

𝑃(𝑧𝑖𝑘 = 1|X)𝑃(𝑤 𝑗ℓ = 1|X). From this hypothesis, we then consider the latent
block model where the two sets 𝐼 and 𝐽 are considered as random samples and the
row, and column labels become latent variables. Therefore, the parameter of the
latent block model is 𝚯 = (𝝅, 𝝆, 𝜶), with 𝝅 = (𝜋1 , . . . , 𝜋𝑔 ) and 𝝆 = (𝜌1 , . . . , 𝜌 𝑚 )
where (𝜋 𝑘 = 𝑃(𝑧𝑖𝑘 = 1), 𝑘 = 1, . . . , 𝑔), (𝜌ℓ = 𝑃(𝑤 𝑗ℓ = 1), ℓ = 1, . . . , 𝑚) are
the mixing proportions and 𝜶 = (𝛼 𝑘ℓ ; 𝑘 = 1, . . . 𝑔, ℓ = 1, . . . , 𝑚) where 𝛼 𝑘ℓ
is the parameter of the distribution of block 𝑘ℓ. Considering that the complete
data are the vector (X, z, w), i.e, we assume that the latent variable z and w
are known, the resulting complete data log-likelihood of the latent block model
𝐿 𝐶 (X, z, w, 𝚯) = log 𝑓 (X, z, w; 𝚯) can be written as follows
𝑔
Õ 𝑚
Õ 𝑛 Õ
Õ 𝑔 Õ
𝑑 Õ 𝑚
𝑧 𝑘 log 𝜋 𝑘 + 𝑤ℓ log 𝜌ℓ + 𝑧 𝑖𝑘 𝑤 𝑗ℓ log 𝝓 𝑘ℓ (𝑥𝑖 𝑗 ; 𝜶 𝑘ℓ ).
𝑘=1 ℓ=1 𝑖=1 𝑗=1 𝑘=1 ℓ=1

where the 𝜋 𝑘 ’s and 𝜌ℓ ’s denote the proportions of row and columns clusters re-
spectively; see for instance [8]. Note that the complete-data log-likelihood breaks
into three terms: the first one depends on proportions of row clusters, the second on
proportions of column clusters and the third on the pdf of each block or co-cluster.
The objective is then to maximize the function 𝐿 𝐶 (z, w, 𝚯).

2.2 Latent Block Regression Model (LBRM)

For co-clustering of continuous data, the Gaussian latent block model can be used. For
instance, note that it isÍeasyÍ to show that theÍ minimization of the well-known criterion
𝑔 𝑚 Í
of ||X − z𝝁w𝑇 || 2 = 𝑘=1 ℓ=1 𝑖 |𝑧𝑖𝑘 =1 (𝑥
𝑗 |𝑤 𝑗ℓ =1 𝑖 𝑗 − 𝜇 𝑘ℓ ) 2 where z ∈ {0, 1} 𝑛×𝑔 ,

w ∈ {0, 1} 𝑑×𝑚 and 𝝁 ∈ R𝑔×𝑚 is associated to Latent block Gaussian model whith
𝜶 𝑘ℓ = (𝜇 𝑘ℓ , 𝜎𝑘ℓ 2 ), the proportions of row clusters and column clusters are equal and

in addition the variances of blocks are identical [9]. Note that 1) the characteristic
of the latent block model is that the rows and the columns are treated symmetrically
2) the estimation of the parameters requires a variational approximation [7, 17]. In
the sequel, we see how can we integrate a regression model. Hereafter, we propose a
novel Latent Block Regression model for co-clustering and learning simultaneously.
The model considers the response matrix Y = [𝑦 𝑖 𝑗 ] ∈ R𝑛×𝑑 and the covariate tensor
X = [1, x𝑖 𝑗 ] ∈ R𝑛×𝑑×𝑣 where 𝑛 is the number of rows, 𝑑 the number of columns, and
𝑣 the number of covariates. Figure 1 presents data structure for the proposed model
LBRM.
In the following we propose the integration of mixture of regression [5] per block
in the Latent Block model (LBM) considering the distribution Φ(𝑦 𝑖 𝑗 |x𝑖 𝑗 ; 𝜆 𝑘ℓ ).We
assume in the following the normality of Φ,
( )
2 −0.5 1 > 2
Φ(𝑦 𝑖 𝑗 |x𝑖 𝑗 ; 𝜆 𝑘ℓ ) = 𝑝(𝑦 𝑖, 𝑗 |x𝑖 𝑗 , 𝜷 𝑘ℓ , 𝜎𝑘ℓ ) = (2𝜋𝜎𝑘ℓ ) exp − 2 (𝑦 𝑖 𝑗 − 𝜷 𝑘ℓ x𝑖 𝑗 )
2𝜎𝑘ℓ
76 R. Boutalbi et al.

Fig. 1 Data representation for proposed model.

With the LBRM model, the parameter 𝛀 is composed of row and column proportions
𝝅, 𝝆 respectively, 𝜷 = {𝜷11 , . . . , 𝛽𝑔𝑚 } with 𝜷>𝑘ℓ = (𝛽0𝑘ℓ , 𝛽1𝑘ℓ , . . . , 𝛽 𝑘ℓ
𝑣 ) where 𝛽0
𝑘ℓ
represents the intercept of regression and 𝝈 = {𝜎11 , . . . , 𝜎𝑔𝑚 }. The classification
log-likelihood can be written:
Õ Õ 1Õ 1 Õ
𝑧𝑖𝑘 log 𝜋𝑘 + 𝑤 𝑗ℓ log 𝜌ℓ − 2
𝑧.𝑘 𝑤.ℓ log( 𝜎𝑘ℓ )− 2
𝑧𝑖𝑘 𝑤 𝑗ℓ ( 𝑦𝑖 𝑗 − 𝜷 >
𝑘ℓ x𝑖 𝑗 )
2

𝑖,𝑘 𝑗,ℓ
2 𝑘,ℓ 2𝜎𝑘ℓ 𝑖, 𝑗,𝑘,ℓ

Í Í
with 𝑧 .𝑘 = 𝑖 𝑧 𝑖𝑘 et 𝑤.ℓ = 𝑗 𝑤 𝑗ℓ .

3 Variational EM Algorithm

To estimate 𝛀, the EM algorithm [3] is a candidate for this task. It maximizes


the log-likelihood 𝑓 (X, 𝛀) w.r. to 𝛀 iteratively by maximizing the conditional
expectation of the complete data log-likelihood 𝐿 𝐶 (z, w; 𝛀) w.r. to 𝛀, given a
previous current estimate 𝛀 (𝑐) and the observed data x. Unfortunately, difficulties
arise owing to the dependence structure among the variables 𝑥𝑖 𝑗 of the model. To solve
this problem an approximation using the [12] interpretation of the EM algorithm can
be proposed; see, e.g., [7, 8]. Hence, the aim is to maximize the following lower bound
of the log-likelihood
Í criterion: 𝐹𝐶 (z̃, w̃; 𝛀) = 𝐿 𝐶 (z̃, w̃, 𝛀) +Í𝐻 (z̃) + 𝐻 ( w̃) where
𝐻 (z̃) = − 𝑖,𝑘 𝑧˜𝑖𝑘 log 𝑧˜𝑖𝑘 with 𝑧˜𝑖𝑘 = 𝑃(𝑧 𝑖𝑘 = 1|X), 𝐻 ( w̃) = − 𝑗,ℓ 𝑤˜ 𝑗ℓ log 𝑤˜ 𝑗ℓ with
𝑤˜ 𝑗ℓ = 𝑃(𝑤 𝑗ℓ = 1|X), and 𝐿 𝐶 (z̃, w̃; 𝛀)
˜ is the fuzzy complete data log-likelihood (up
to a constant). 𝐿 𝐶 (z̃, w̃; 𝛀) is given by
Õ Õ 1Õ 2
𝐿 𝐶 (z, w, 𝛀) = 𝑧˜𝑖𝑘 log 𝜋 𝑘 + 𝑤˜ 𝑗ℓ log 𝜌ℓ − 𝑧˜.𝑘 𝑤˜ .ℓ log(𝜎𝑘ℓ )
𝑖,𝑘 𝑗,ℓ
2 𝑘,ℓ
1 Õ
− 2
𝑧˜𝑖𝑘 𝑤˜ 𝑗ℓ (𝑦 𝑖 𝑗 − 𝜷>𝑘ℓ x𝑖 𝑗 ) 2
2𝜎𝑘ℓ 𝑖, 𝑗,𝑘,ℓ

The maximization of 𝐹𝐶 (z̃, w̃, 𝛀) can be reached by realizing the three following
optimization: update z̃ by argmaxz̃ 𝐹𝐶 (z̃, w̃, 𝛀), update w̃ by argmaxw̃ 𝐹𝐶 (z̃, w̃, 𝛀),
and update 𝛀 by argmax𝛀 𝐹𝐶 (z̃, w̃, 𝛀). In what follows, we detail the Expectation
(E) and Maximization (M) step of the Variational EM algorithm for tensor data.
Latent Block Regression Model 77

E-step. It consists in computing, for all 𝑖, 𝑘, 𝑗, ℓ the posterior probabilities 𝑧˜𝑖𝑘


and 𝑤˜ 𝑗ℓ maximizing 𝐹𝐶 (z̃, w̃, 𝛀) given the estimated parameters 𝛀 𝑘ℓ . It is easy
to show that, the posterior probability 𝑧˜𝑖𝑘 maximizing 𝐹𝐶 (z̃, w̃, 𝛀) is given by:
Í 
𝑧˜𝑖𝑘 ∝ 𝜋 𝑘 exp ˜ 𝑗ℓ log 𝑝(𝑦 𝑖 𝑗 |x𝑖 𝑗 , 𝜷 𝑘ℓ , 𝜎𝑘ℓ ) . In the same manner, the poste-
𝑗,ℓ 𝑤
Í 
rior probability 𝑤˜ 𝑗ℓ is given by: 𝑤˜ 𝑗ℓ ∝ 𝜌ℓ exp 𝑖,𝑘 𝑧˜𝑖𝑘 log 𝑝(𝑦 𝑖 𝑗 |x𝑖 𝑗 , 𝜷 𝑘ℓ , 𝜎𝑘ℓ )
M-step. Given the previously computed posterior probabilities z̃ and w̃, the M-step
consists in updating , ∀𝑘, ℓ, the parameters of the model 𝜋 𝑘 , 𝜌ℓ , 𝝁 𝑘ℓ and 𝝀 𝑘ℓ maxi-
mizing 𝐹𝐶 (z̃, w̃, 𝛀). Using the computed quantities from step E, the maximization
step (M-step) involves the following closed-form updates.
Í Í
• Taking intoÍ account the constraints Í 𝑘 𝜋 𝑘 = 1 and ℓ 𝜌ℓ = 1, it is easy to show
𝑧˜ 𝑤˜ 𝑗ℓ
that 𝜋 𝑘 = 𝑖𝑛 𝑖𝑘 = 𝑧˜𝑛.𝑘 and 𝜌ℓ = 𝑗 𝑑 = 𝑤˜𝑑.ℓ .
• The update of 𝜆 𝑘ℓ which is formed by ( 𝜷 𝑘ℓ , 𝜎𝑘ℓ ) where can be given by simple
˜ 𝛀) with respect to 𝜷 𝑘ℓ and 𝜎𝑘ℓ respectively. This leads to
derivates of 𝐹𝐶 ( 𝑧˜, 𝑤,
! ! −1
𝑧˜𝑖𝑘 𝑤˜ 𝑗ℓ ( 𝑦𝑖 𝑗 − 𝜷 >
Í
𝑘ℓ x𝑖 𝑗 )
Õ Õ 2
𝑧˜𝑖𝑘 𝑤˜ 𝑗ℓ x𝑖 𝑗 x> 2 𝑖, 𝑗
𝜷 𝑘ℓ = 𝑧˜𝑖𝑘 𝑤˜ 𝑗ℓ 𝑦𝑖 𝑗 x𝑖 𝑗 𝑖𝑗 , 𝜎𝑘ℓ = Í .
𝑖, 𝑗 𝑖, 𝑗 𝑖, 𝑗 𝑧˜𝑖𝑘 𝑤
˜ 𝑗ℓ

The proposed algorithm for tensor data referred to as VEM-LBRM alternates the two
previously described steps Expectation-Maximization. At the convergence, a hard
co-clustering is deduced from the posterior probabilities.

4 Experimental Results

First, we evaluate the proposed VEM-LBRM on three synthetic datasets in terms of co-
clustering and regression. We compare VEM-LBRM with some clustering and regres-
sion methods namely Global model which is a single multiple linear regression
model performed on all observations, K-means, Clusterwise, Co-clustering
and SCOAL. We retain two widely used measures to assess the quality of clustering,
namely the Normalized Mutual Information (NMI) [16] and the Adjusted Rand In-
dex (ARI) [15]. Intuitively, NMI quantifies how much the estimated clustering is
informative about the true clustering. The ARI metric is related to the clustering
accuracy and measures the degree of agreement between an estimated clustering and
a reference clustering. Both NMI and ARI are equal to 1 if the resulting clustering
is identical to the true one. On the other hand, we use RMSE (Root MSE) and MAE
(Mean Absolute Error) metrics to evaluate the precision of prediction while RMSE
is a loss function which is suitable for Gaussian noises when MAE uses the absolute
value which is less sensitive to extreme values.
We generated tensor data X with size 200 × 200 × 2 according to Gaussian
model per block. In the simulation study, we considered three scenarios by varying
the regression parameters — the examples have blocks with different regression
collinearity and different co-clusters structure complexity. The parameters for each
example are reported in Tables 1. In Figures 2 and 3 are depicted the true regression
planes and the true simulated response matrix Y.
78 R. Boutalbi et al.
Table 1 Parameters generation for examples.
Dataset Example 1 Example 2 Example 3
𝝅 = [0.35, 0.35, 0.3] , 𝝆 = [0.55, 0.45]
𝜎 𝜎 = 5  𝜎 =7  𝜎 = 7 
1 0 2 0.3 1 2
𝚺 𝚺= 𝚺= 𝚺=
0 1 0.3 2 2 1
Co-clusters 𝜷 𝑘ℓ 𝝁 𝑘ℓ 𝜷 𝑘ℓ 𝝁 𝑘ℓ 𝜷 𝑘ℓ 𝝁 𝑘ℓ
Cluster (1,1) [1, -10, 1] [5,20] [1, -10, 1] [5,20] [1, -10, 1] [5,20]
Cluster (1,2) [10, 4, 13] [5,10] [1, -10, 1] [5,10] [1, -10, 1] [5,10]
Cluster (2,1) [3, 20, -2] [10,20] [1, -10, 1] [10,20] [1, -10, 1] [5,30]
Cluster (2,2) [-5, -2, -6] [10,10] [7, 5, -10] [10,10] [7, 5, -10] [20,10]
Cluster (3,1) [-10, 20, 10] [20,20] [7, 5, -10] [20,20] [7, 5, -10] [20,20]
Cluster (3,2) [7, 5, -10] [20,10] [7, 5, -10] [20,10] [7, 5, -10] [20,30]

Fig. 2 Synthetic data: True regression plans according to the chosen parameters.

Fig. 3 Synthetic data: True co-clustering according to the chosen parameters.

In our illustrations, we consider co-clustering and regression challenges. All


metrics concerning rows and columns are computed by averaging on ten random
training, and testing data split using an 80% vs. 20% of training and validation data.
Thereby, we compare VEM-LBRM with Global model (which is a multiple linear
regression), K-means, Clusterwise by reshaping the tensor to matrix with size
𝑁 × 𝑣 where 𝑁 = 𝑛 × 𝑑. On the other hand, the VEM algorithm for co-clustering is
applied on response matrix Y. Furthermore, for clustering algorithms, the RMSE,
MAE, and R-squared are computed by applying linear regression on each obtained
co-cluster. In Table, 2 are reported the performances for all algorithms. The missing
values represent measures that cannot be computed by the corresponding models.
From these comparisons, we observe that whether the block structure is easy to
identify or not, the ability of VEM-LBRM to outperform other algorithms.
To go further, note that in [11], the authors reformulated the clusterwise and
introduced the linear cluster-weighted model (CWM) in a statistical setting and
showed that it is a general and flexible family of mixture models. They included in
Latent Block Regression Model 79

Table 2 (co)-clustering and prediction: mean and sd in parentheses.


Regression Clustering
Examples Algorithms RMSE MAE Rsquare ARI NMI
Training Test Training Test Avg. Row Col Row Col
Global model 164.38 164.05 145.29 145.05 0.46 - - - -
( 0.03 ) (0.49 ) ( 0.08 ) ( 0.71 ) ( 0.0 ) - - - -
K-means 49.62 49.51 34.86 34.91 0.8 0.61 - 0.49 -
( 60.2 ) (67.48 ) ( 33.56 ) ( 35.79 ) ( 0.02 ) 0.02 - 0.03 -
Clusterwise
Example1

154.57 154.47 127.77 127.93 0.52 0.07 - 0.01 -


(𝑔 = 3) ( 0.01 ) (0.36 ) ( 0.03 ) ( 0.45 ) ( 0.0 ) 0.0 - 0.0 -
Co-clustering 10.86 10.83 7.29 7.29 0.88 0.84 1.0 0.71 1.0
(𝑔 = 3) ( 14.76 ) (14.36 ) ( 4.67 ) ( 4.59 ) ( 0.0 ) 0.01 0.0 0.04 0.0
SCOAL 14.99 14.92 10.45 10.41 0.99 0.91 1.0 0.84 1.0
(𝑔 = 3, 𝑚= 2) ( 207.56 ) (208.91 ) ( 89.48 ) ( 90.55 ) ( 0.0 ) 0.01 0.0 0.04 0.0
VEM-LBRM 7.1 7.06 5.29 5.26 0.99 0.95 1.0 0.92 1.0
(𝑔 = 3, 𝑚= 2) ( 17.71 ) (16.86 ) ( 6.8 ) ( 6.32 ) ( 0.0 ) (0.01) (0.0) (0.03) (0.0)
Global model 29.15 29.21 24.64 24.68 0.34 - - - -
( 0.04 ) (0.15 ) ( 0.04 ) ( 0.12 ) ( 0.0 ) - - - -
K-means 10.43 10.49 7.73 7.77 0.71 0.56 - 0.45 -
( 0.25 ) (0.24 ) ( 0.17 ) ( 0.16 ) ( 0.01 ) 0.0 - 0.0 -
Clusterwise
Example2

18.54 18.62 11.33 11.38 0.73 0.15 - 0.16 -


(𝑔 = 3) ( 0.09 ) (0.27 ) ( 0.06 ) ( 0.14 ) ( 0.0 ) 0.0 - 0.0 -
Co-clustering 7.5 7.49 5.89 5.9 0.8 0.95 1.0 0.94 1.0
(𝑔 = 3) ( 1.35 ) (1.38 ) ( 0.82 ) ( 0.86 ) ( 0.07 ) 0.14 0.0 0.17 0.0
SCOAL 12.63 12.69 8.75 8.81 0.81 0.97 1.0 0.94 1.0
(𝑔 = 3, 𝑚= 2) ( 12.57 ) (12.81 ) ( 7.38 ) ( 7.58 ) ( 0.35 ) 0.1 0.0 0.17 0.0
VEM-LBRM 6.99 6.99 5.57 5.57 0.96 1.0 1.0 1.0 1.0
(𝑔 = 3, 𝑚= 2) ( 0.01 ) (0.04 ) ( 0.01 ) ( 0.02 ) ( 0.0 ) (0.0) (0.0) (0.0) (0.0)
Global model 45.38 45.24 38.33 38.21 0.49 - - - -
( 0.06 ) (0.24 ) ( 0.07 ) ( 0.26 ) ( 0.0 ) - - - -
K-means 10.47 10.41 7.44 7.42 0.83 0.54 - 0.45 -
( 1.73 ) (1.74 ) ( 1.08 ) ( 1.08 ) ( 0.08 ) 0.01 - 0.01 -
Clusterwise
Example3

23.09 23.18 12.09 12.15 0.87 0.09 - 0.09 -


(𝑔 = 3) ( 1.84 ) (2.02 ) ( 1.23 ) ( 1.29 ) ( 0.02 ) 0.0 - 0.0 -
Co-clustering 9.48 9.39 6.98 6.93 0.73 0.74 1.0 0.7 1.0
(𝑔 = 3) ( 0.16 ) (0.22 ) ( 0.01 ) ( 0.02 ) ( 0.02 ) 0.04 0.0 0.08 0.0
SCOAL 27.32 27.14 16.82 16.73 0.57 0.98 1.0 0.96 1.0
(𝑔 = 3, 𝑚= 2) ( 41.97 ) (41.83 ) ( 24.13 ) ( 24.16 ) ( 0.93 ) 0.07 0.0 0.12 0.0
VEM-LBRM 7.21 7.21 5.71 5.71 0.99 0.98 1.0 0.96 1.0
(𝑔 = 3, 𝑚= 2) ( 0.68 ) (0.7 ) ( 0.42 ) ( 0.42 ) ( 0.0 ) (0.07) (0.0) (0.12) (0.0)

the classical model of clusterwise the probability Φ0 (x𝑖 |𝛀 𝑘 ) to model the covariates,
whereas the classical cluster-wise model the output only using Φ(𝑦 𝑖 |x𝑖 ; 𝜆 𝑘 ). They
prove that sufficient conditions for model identifiability are provided under a suitable
assumption of Gaussian covariates [10]. We can include in LBRM a joint probability
Φ0 (x𝑖 𝑗 |𝛀 𝑘ℓ ) where 𝛀 𝑘ℓ = [𝝁 𝑘ℓ , 𝚺 𝑘ℓ ] to evaluate its impact in terms of clustering
and regression. Figure 4 presents the graphical model of LBRM and its extension.
The first experiments on real datasets give encouraging results.

Fig. 4 Graphical model of LBRM (left) and its extension (right).


80 R. Boutalbi et al.

5 Conclusion

Inspired by the flexibility of the latent block model (LBM), we proposed extending it
to tensor data aiming at both tasks: co-clustering and prediction. This model (LBRM)
gives rise to a variational EM algorithm for co-clustering and prediction referred to
as VEM-LBRM. This algorithm which can be viewed as the co-clusterwise algorithm
can easily deal with sparse data. Empirical results on synthetic data showed that
VEM-LBRM does give more encouraging results for clustering and regression than
some algorithms that are devoted to one or both tasks simultaneously. For future
work, we plan to develop the extension of LBRM and apply the proposed models for
the recommender system task.

Acknowledgements Our work is funded by the German Federal Ministry of Education and Re-
search under Grant Agreement Number 01IS19084F (XAPS).

References

1. Affeldt, S., Labiod, L., Nadif, M.: Regularized bi-directional co-clustering. Statistics and
Computing, 31(3), 1-17 (2021)
2. Agarwal, D., and Merugu, S.: Predictive discrete latent factor models for large scale dyadic
data. In: SIGKDD, pp. 26–35 (2007)
3. Dempster, A. P., Laird, N. M., Rubin, D. B.: Maximum likelihood from incomplete data via
the EM algorithm. Journal of the Royal Statistical Society Series B (Methodological), 39(1),
1–22 (1977)
4. Deodhar, M., Ghosh, J.: A framework for simultaneous co-clustering and learning from
complex data. In: SIGKDD, pp. 250–259 (2007)
5. DeSarbo, W. S., and Cron, W. L.: A maximum likelihood methodology for clusterwise linear
regression. Journal of Classification, 5(2), 249–282 (1988)
6. Govaert, G., Nadif, M.: Clustering with block mixture models. Pattern Recognition, 36, 463-
473, (2003)
7. Govaert, G., Nadif, M.: An EM algorithm for the block mixture model. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 27(4), 643–647 (2005)
8. Govaert, G., Nadif, M.: Block clustering with Bernoulli mixture models: Comparison of
different approaches. Computational Statistics & Data Analysis, 3233–3245 (2008)
9. Govaert, G., Nadif, M.: Co-clustering: Models, Algorithms and Applications. John Wiley &
Sons (2013)
10. Ingrassia, S., Minotti, S. C., Punzo, A.: Model-based clustering via linear cluster-weighted
models. Computational Statistics & Data Analysis, 71, 159–182 (2014)
11. Ingrassia, S., Minotti, S. C., Vittadini, G.: Local statistical modeling via a cluster-weighted
approach with elliptical distributions. In: Journal of Classification, 29(3), 363–401 (2012)
12. Neal, R. M., Hinton, G. E.: A view of the EM algorithm that justifies incremental, sparse, and
other variants. In Learning in Graphical Models, pp. 355–368. Springer (1998)
13. Salah, A., Nadif, M.: Directional co-clustering. Advances in Data Analysis and Classification,
13(3), 591-620 (2019)
14. Späth, H.: Algorithm 39 clusterwise linear regression. Computing, 22(4), 367–373 (1979)
15. Steinley, D.: Properties of the Hubert–Arable Adjusted Rand Index. Psychological Methods,
9(3), 386 (2004)
16. Strehl, A., Ghosh, J.: Cluster ensembles—a knowledge reuse framework for combining mul-
tiple partitions. Journal of Machine Learning Research, 3, 583–617 (2002)
17. Vu, D., Aitkin, M.: Variational algorithms for biclustering models. Computational Statistics
& Data Analysis, 89, 12–24 (2015)
Latent Block Regression Model 81

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Using Clustering and Machine Learning
Methods to Provide Intelligent Grocery
Shopping Recommendations

Nail Chabane, Mohamed Achraf Bouaoune, Reda Amir Sofiane Tighilt, Bogdan
Mazoure, Nadia Tahiri, and Vladimir Makarenkov

Abstract Nowadays, grocery lists make part of shopping habits of many customers.
With the popularity of e-commerce and plethora of products and promotions avail-
able on online stores, it can become increasingly difficult for customers to identify
products that both satisfy their needs and represent the best deals overall. In this
paper, we present a grocery recommender system based on the use of traditional
machine learning methods aiming at assisting customers with creation of their gro-
cery lists on the MyGroceryTour platform which displays weekly grocery deals in
Canada. Our recommender system relies on the individual user purchase histories,
as well as the available products’ and stores’ features, to constitute intelligent weekly
grocery lists. The use of clustering prior to supervised machine learning methods
allowed us to identify customers profiles and reduce the choice of potential products
of interest for each customer, thus improving the prediction results. The highest
average F-score of 0.499 for the considered dataset of 826 Canadian customers was
obtained using the Random Forest prediction model which was compared to the
Decision Tree, Gradient Boosting Tree, XGBoost, Logistic Regression, Catboost,
Support Vector Machine and Naive Bayes models in our study.

Keywords: clustering, dimensionality reduction, grocery shopping recommenda-


tion, intelligent shopping list, machine learning, recommender systems

Nail Chabane · Mohamed Achraf Bouaoune · Reda Amir Sofiane Tighilt ·


Vladimir Makarenkov ( )
Université du Québec à Montreal, 405 Rue Sainte-Catherine Est, Montreal, Canada,
e-mail: [email protected]
e-mail: [email protected]
e-mail: [email protected];[email protected]
Bogdan Mazoure
McGill University and MILA - Quebec AI Institute, 845 Rue Sherbrooke O, Montreal, Canada,
e-mail: [email protected]
Nadia Tahiri
University of Sherbrooke, 2500 Bd de l’Université, Sherbrooke, Canada,
e-mail: [email protected]

© The Author(s) 2023 83


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_10
84 N. Chabane et al.

1 Introduction

Grocery shopping is a common activity that involves different factors such as budget
and impulse purchasing pressure [1]. Customers typically rely on a mental or digital
list to facilitate their grocery trips. Many of them show a favorable interest towards
tools and applications that help them manage their grocery lists, while keeping
them updated with special offers, coupons and promotions [2, 3]. Major retailers
throughout the world typically offer discounts on different products every week in
order to improve sales and attract new customers. This very common practice leads
to the fact that thousands of items go on special simultaneously across different
retailers at a given week. The resulting information overload often makes it difficult
for customers to quickly identify the deals that best suit their needs, which can become
a source of frustration [4]. To address this problem, many grocery stores have taken
advantage of the popularity of e-commerce to set up their own websites featuring
various functionalities, including Recommender Systems, to assist customers during
the shopping process.
Recommender Systems (RSs) [5] are tools and techniques that offer personalized
suggestions to users based on several parameters (e.g. their past behavior). RSs have
recently become a field of interest for researchers and retailers as many e-commerces,
online book stores and streaming platforms have started to offer this service on their
websites (e.g. Amazon, Netflix and Spotify). Here, we recall some recent works in
this field. Faggioli et al. [6] used the popular Collaborative Filtering (CF) approach
to predict the customer’s next basket in a context of grocery shopping, taking into
account the recency parameter. When comparing their model with the CF baseline
models, Faggioli et al. observed a consistent improvement of their prediction results.
Che et al. [7] used attention-based recurrent neural networks to capture both inter-
and intra-basket relationships, thus modelling users’ long-term preferences dynamic
short-term decisions.
Content-based recommendation has also proven efficient in the literature, as
demonstrated by Xia et al. [8] who proposed a tree-based model for coupons recom-
mendation. By processing their data with undersampling methods, the authors were
able to increase the estimated click rate from 1.20% to 7.80% as well as to improve
significantly the F-score results using Random Forest Classifier and the recall results
using XGBoost. Dou [9] presented a statistical model to predict whether a user will
buy or not buy an item using Yandex’s CatBoost method [10]. Dou relied on contex-
tual and temporal features as well as on some session features, such as the time of
visit of specific web pages, to demonstrate the efficiency of CatBoost in this context.
Finally, Tahiri et al. [11] used recurrent and feedforward neural networks (RNNs and
FFNs) in combination with non-negative matrix factorization and gradient boosting
trees to create intelligent weekly grocery baskets to be recommended to the users
of MyGroceryTour. Tahiri et al. considered different (from our study) features char-
acterizing the users of MyGroceryTour to provide their predictions, with the best
F-score results of 0.37 obtained from the augmented dataset.
Clustering and ML Methods for Intelligent Grocery Shopping 85

2 Materials and Methods

2.1 Data Considered


In this section we describe the dataset obtained from MyGroceryTour website used
in our research. MyGroceryTour [11] is a Canadian grocery shopping website and
database available in both English and French languages. The main purpose of the
website is to present weekly specials offered by the major grocery retailers in Canada.
It allows users to display grocery products available in their living area, compare
their products over different stores as well as to build their grocery shopping baskets
based on the provided insights. MyGroceryTour users can easily archive and manage
their grocery lists and access them at any given time.
In this study, we considered 826 MyGroceryTour users with varying numbers of
grocery baskets (between 3 and 100 baskets were available per user). The grocery
baskets contained different products added by users when they were creating their
weekly shopping lists. In our recommender system (i.e current basket prediction
experiment), we have considered the following features:
• user_id : unique user identifier (numerical)
• basket_id : unique basket identifier (numerical)
• product_id : unique product identifier (numerical)
• category : category of the product (categorical)
• price : price of the product (numerical)
• special : discount on the product (in %) compared to regular price (numerical)
• distance_min : minimal distance between user’s home and the closest store where
the product was available (numerical)
• distance_mean : mean distance between user’s home and all stores where the
product was available (numerical)
• availability : availability of the product at different stores (binary)
In addition, we engineered the total_bought feature which represents, for each
product, the total number of times it has been bought over all users.

2.2 Data Normalization


Data normalization is an important data preprocessing step in both unsupervised and
supervised machine learning [12] as well as in data mining [13]. Prior to feeding the
data to our models we rescaled the available features using z-score standardization.
Thus, all rescaled features had the mean of 0 and the standard deviation of 1:
𝑥𝑓 − 𝜇𝑓
𝑧(𝑥 𝑓 ) = , (1)
𝜎𝑓

where 𝑥 𝑓 is the original value of the observation at feature 𝑓 , 𝜇 𝑓 is the mean and
𝜎 𝑓 is the standard deviation of 𝑓 .
86 N. Chabane et al.

2.3 Further Data Preprocessing Steps


In order to determine which weekly products could be recommended to a given
user we propose to classify them using both clustering (unsupervised learning)
and traditional supervised machine learning methods. The final recommendation is
obtained based on the availability of the products, the data on the products’ regular
prices and available discounts, as well as on the user’s shopping history. In our
context, the baskets contain only the products bought by the users. The information
about the other available products (not selected by the user at the moment he/she
organized his/her shopping basket) is also available on MyGroceryTour. It has been
used to create a large class of available items that were not bought by the user.
While we considered the items bought by a given user as positive feedback, we
regarded the items that were available to this user at the time of the order, but not
acquired by him/her, as a negative feedback. For an order of size 𝑃, if 𝑇 is the total
amount of items available to the user at the time of the order, the negative feedback
𝑁 for that order is 𝑁 = 𝑇 − 𝑃. In this context, 𝑁 usually represents thousands of
products, while 𝑃 is typically inferior to 50. This difference in size between positive
and negative feedback can lead to a situation of imbalanced training data and could
result in an important loss in performance. Similarly to Xia et al. [8], we applied an
undersampling method to balance our data instead of considering all of the available
disregarded items as the negative feedback.
To identify customer profiles and perform a preselection of products that are
susceptible to be of interest to a given user, we first carried out the clustering of the
normalized original dataset (the 𝐾-means [14] and DBSCAN [15] data partitioning
algorithms were used). Then, we limited the choice of the items offered to a given user
to the products purchased by the members of his/her cluster. By doing so, we managed
to reduce the amount of products which could be recommended to the user and thus
minimize eventual classifications mistakes. The clustering phase is detailed in the
Subsection 2.4. Then traditional machine learning methods were used to provide
the final weekly recommendation. The size 𝑆 of the weekly basket recommended
to a given user was equal to the mean size of his/her previous shopping baskets.
As the number of items to be recommended by the machine learning methods
was often greater than 𝑆, we retained as final recommendation the top 𝑆 items,
ranked according to the confidence score (i.e. the probability estimate for a given
observation, computed using the predict_proba function from the scikit-learn [16]
library).

2.4 Data Clustering


In this section, we present the steps we carried out to obtain the clusters of users. As
explanatory features used to generate clusters, we considered the mean prices and
mean specials of the products purchased by the user as well as a new feature, called
here the fidelity ratio 𝐹 𝑅𝑢 , which is meant to give insight on whether a given user 𝑢
has a favorite store where he/she makes most of his/her grocery purchases. 𝐹 𝑅𝑢 is
defined as follows:
Clustering and ML Methods for Intelligent Grocery Shopping 87

1 Í𝑛
𝑋𝑚𝑎𝑥,𝑢 − (𝑛−1) 𝑖=2 𝑋𝑖,𝑢
𝐹 𝑅𝑢 = , (2)
𝑋𝑡𝑜𝑡 𝑎𝑙,𝑢
where 𝑋𝑚𝑎𝑥,𝑢 is the total number of products bought by user 𝑢 at the store where
he/she made most of his/her purchases, 𝑛 (𝑛>1)
Í𝑛 is the total number of stores visited by
user 𝑢, and 𝑋𝑡𝑜𝑡 𝑎𝑙,𝑢 (𝑋𝑡𝑜𝑡 𝑎𝑙,𝑢 = 𝑋𝑚𝑎𝑥,𝑢 + 𝑖=2 𝑋𝑖,𝑢 ) is the total number of products
purchased by user 𝑢 over all stores he/she visited. A high fidelity ratio means that
user 𝑢 buys most of his/her products at the same store, whereas a low fidelity ratio
indicates that user 𝑢 buys his/her products at different stores. When user 𝑢 purchases
all of his/her products at the same store (𝑋𝑚𝑎𝑥,𝑢 = 𝑋𝑡𝑜𝑡 𝑎𝑙,𝑢 and 𝑛 = 1), the fidelity
ratio equals 1. It equals 0 when he/she purchases the same number of products at
different stores.
The 𝐾-means [14] and DBSCAN [15] algorithms were used to perform clustering.
Here we present the results of DBSCAN, as the clusters provided by DBSACAN
had less entity overlap than those provided by 𝐾-means. The main advantage of
DBSCAN is that this density-based algorithm is able to capture clusters of any
shape.

Fig. 1 Davies-Bouldin cluster validity index variation with respect to the number of clusters.

We used the Davies-Bouldin (DB) [17] cluster validity index to determine the
number of clusters in our dataset. The Davies-Bouldin index is the average similarity
between each cluster 𝐶𝑖 for 𝑖 = 1, ..., 𝑘 and its most similar counterpart 𝐶 𝑗 . It is
calculated as follows:
𝑘

𝐷𝐵 = 𝑚𝑎𝑥 𝑅𝑖 𝑗 , (3)
𝑘 𝑖=1 𝑖≠ 𝑗
88 N. Chabane et al.

where 𝑅𝑖 𝑗 is the similarity measure between clusters calculated as (𝑑𝑖 + 𝑑 𝑗 )/𝛿𝑖 𝑗 ,


where 𝑑𝑖 (𝑑 𝑗 ) is the the mean distance between objects of cluster 𝐶𝑖 (𝐶 𝑗 ) and the
cluster centroid and 𝛿𝑖 𝑗 is the distance between the centroids of clusters 𝐶𝑖 and 𝐶 𝑗 .
Figure 1 illustrates the variation of the Davies-Bouldin cluster validity index
whose lowest (i.e. best) value was reached for our dataset with 6 clusters. The
resulting clusters are represented in Figure 2. After performing the data clustering,
we applied the t-SNE [18] dimensionality reduction method for data visualisation
purposes. Since t-SNE is known to preserve the local structure of the data but not
the global one, we used the PCA initialization parameter to mitigate this issue.

Fig. 2 Clustering results : Clustering obtained with DBSCAN with the best number of clusters
according to the Davies-Bouldin index. Data reduction was performed using t-SNE. The 6 clusters
of customers found by DBSCAN are represented by different symbols.

We have noticed that the users in Cluster 1 (see Fig. 2) are fairly sensitive to
specials and have a high fidelity score, the users in Cluster 2 mostly purchase
products on special in different stores, the users in Cluster 3 seem to be sensitive
to the total price of their shopping baskets, Cluster 4 includes the users who are
sensitive to specials but have a low fidelity score, Cluster 5 includes the users who
are not very attracted by specials but are rather loyal to their favorite store(s), and
the users in Cluster 6 tend to buy products on special and have high fidelity scores.

3 Application of Supervised Machine Learning Methods

To predict the products to be recommended for the current weekly basket, we used
the following supervised machine learning methods: Decision Tree, Random Forest,
Clustering and ML Methods for Intelligent Grocery Shopping 89

Gradient Boosting Tree, XGBoost, Logistic Regression, Catboost, Support Vector


Machine and Naive Bayes. These methods were used through their scikit-learn
implementations [16]. Due to the lack of large datasets we did not use deep learning
models in our study. We decided to use these classical machine learning methods
because they are usually recommended to work with smaller datasets contrary to
their deep leaning counterparts. Also, deep leaning algorithms usually don’t handle
properly mixed types of features present in our data. Most of the methods we used
are the ensemble methods, i.e. they use multiple replicates to reduce the variance.
The F-score results provided by each method without (using all products available)
and with clustering (using only the products purchased by the cluster members) are
presented in Table 1.
As shown in Table 1, Random Forest outperformed the other competing methods
without and with data clustering, providing the average F-scores of 0.494 and 0.499
(obtained over all users), respectively. Tree-based models relying on gradient boost-
ing performed relatively well and could possibly give better results with a different
data processing. We can also notice that all the methods, except CatBoost, benefited
from the data clustering process.

Table 1 F-scores provided by ML methods without and with clustering of MyGroceryTour users.
Machine learning methods Results without clustering Results with clustering

CatBoost 0.438 0.438


Decision Tree 0.463 0.468
Gradient Boosting Tree 0.488 0.495
Logistic Regression 0.474 0.478
Naive Bayes 0.433 0. 436
Random Forest 0.494 0.499
SVM-RBF 0.392 0.397
XGBoost 0.476 0.481

4 Conclusion

In this paper, we presented a novel recommender system that is intended to predict


the content of the customer’s weekly basket depending on his/her purchase history.
Our system is also able to predict the store(s) where the purchase(s) will take place.
The clustering step allowed us to identify customer profiles and to improve the F-
score result for every tested machine learning model, except CatBoost. Using our
methodology and the new data available on MyGroceryTour, we were able to improve
the F-score performance by the margin of 0.129, compared to the results obtained
by Tahiri et al. [11]. Our model is able to predict products that will be purchased
again or acquired for the first time by a given user, but it is not yet able to predict the
optimal quantity for each product to be bought. Another important issue is how to
provide plausible recommendations for customers without shopping history (i.e. the
cold start problem). We will tackle these important issues in our future work.
90 N. Chabane et al.

References

1. Vincent-Wayne, M., Aylott, R.: An exploratory study of grocery shopping stressors. Int. J.
Retail. Distrib. Manag. 26, 362–373 (1998)
2. Newcomb, E., Pashley, T., Stasko, J.: Mobile computing in the retail arena. In: Proceedings
of the SIGCHI conference on Human factors in computing systems, pp. 337-344. Association
for Computing Machinery, New York (2003)
3. Sourav, B., Floréen, P., Forsblom, A., Hemminki, S., Myllymäki, P., Nurmi, P., Pulkkinen,
T., Salovaara, A.: An Intelligent Mobile Grocery Assistant. In: 2012 Eighth International
Conference on Intelligent Environments, pp. 165-172. IEEE, Guanajuato (2012)
4. Park, Y., Chang, K.: Individual and group behavior-based customer profile model for person-
alized product recommendation. Expert Systems with Applications 36(2), 1932-1939 (2009)
5. Ricci, F., Rokach, L., Shapira, B.: Recommender Systems: Introduction and Challenges. In:
Recommender Systems Handbook, pp. 1-34. Springer, Boston (2015)
6. Faggioli, G., Mirko P., Fabio A.: Recency aware collaborative filtering for next basket recom-
mendation. In: Proceedings of the 28th ACM Conference on User Modeling, Adaptation and
Personalization, pp. 80-87. Association for Computing Machinery, New York (2020)
7. Che et al.: Inter-basket and intra-basket adaptive attention network for next basket recommen-
dation. IEEE Access 7, 80644-80650 (2019)
8. Xia, Y. Giuseppe, D. F., Shikhar, V., Ankur, D.: A content-based recommender system for e-
commerce offers and coupons. In: Proceedings of the SIGIR 2017 eCom workshop. eCOM@
SIGIR, Tokyo (2017)
9. Dou, X.: Online purchase behavior prediction and analysis using ensemble learning. In : 2020
IEEE 5th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA),
pp. 532-536. IEEE (2020)
10. Prokhorenkova, L., Gleb G., Aleksandr V., Anna V. D., Andrey G.: CatBoost: unbiased
boosting with categorical features (2017) Available via arXiv.
11. Tahiri, N., Mazoure, B. and Makarenkov, V.: An intelligent shopping list based on the appli-
cation of partitioning and machine learning algorithms. In: Proceedings of the 18th Python in
Science Conference (SCIPY 2019), pp. 85-92. Austin, Texas (2019)
12. Kotsiantis, S. B., Kanellopoulos, D., Pintelas, P. E.: Data preprocessing for supervised leaning.
Int. J. Comput. Sci. 2, 111-117 (2006)
13. García, S., Luengo, J., Herrera, F.: Data Preprocessing in Data Mining. Springer, Cham,
Switzerland (2015)
14. MacQueen, J.: Some methods for classification and analysis of multivariate observations. In:
Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol.
1, no. 14, pp. 281-297 (1967)
15. Ester, M., Kriegel, H. P., Sander, J., Xu, X.: A Density-Based Algorithm for Discovering
Clusters in Large Spatial Databases with Noise. In: Proceedings of the 2nd International
Conference on Knowledge Discovery and Data Mining. AAAI Press, pp. 226–231. AAAI
Press, Portland, Oregon (1996)
16. Pedregosa et al.: Scikit-learn: Machine Learning in Python. JMLR 12, 2825-2830 (2011)
17. Davies, D. L., Bouldin, D. W.: A Cluster Separation Measure. In: IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. PAMI-1, no. 2, pp. 224-227 (1979)
18. Van der Maaten, L. J. P., Hinton, G.: Visualizing High-Dimensional Data Using t-SNE. J.
Mach. Learn. Res. 9, 2579-2605 (2008)
Clustering and ML Methods for Intelligent Grocery Shopping 91

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
COVID-19 Pandemic: a Methodological Model
for the Analysis of Government’s Preventing
Measures and Health Data Records

Theodore Chadjipadelis and Sofia Magopoulou

Abstract The study aims to investigate the associations between the government’s
response measures during the COVID-19 pandemic and weekly incidence data (pos-
itivity rate, mortality rate and testing rate) in Greece. The study focuses on the
period from the detection of the first case in the country (26th February 2020) to the
first week of 2022 (08th January 2022). Data analysis was based on Correspondence
Analysis on a fuzzy-coded contingency table, followed by Hierarchical Cluster Anal-
ysis (HCA) on the factor scores. Results revealed distinct time periods during which
interesting interactions took place between control measures and incidence data.

Keywords: hierarchical cluster analysis, correspondence analysis, COVID-19,


evidence-based policy making

1 Introduction

The present study focuses on the period of the COVID-19 pandemic in Greece, from
the detection of the first case of COVID-19 to the first week of 2022. This period
can be divided into five distinct phases. The first phase extends from the beginning
of 2020 until the first lockdown, i.e., from the first case reported in Greece until
the end of the first quarantine period in May 2020. The second phase concerns the
interim period from June to October 2020, when the pandemic indices improved,
and policies were loosened for the opening of tourism. The third phase concerns the
second lockdown and the evolution of the pandemic in the country from November
2020 to April 2021, when the first vaccination period of the adult population took
place. The fourth phase includes the interim period from May 2021 to October

Theodore Chadjipadelis ( )
Aristotle University of Thessaloniki, Greece, e-mail: [email protected]
Sofia Magopoulou
Aristotle University of Thessaloniki, Greece, e-mail: [email protected]

© The Author(s) 2023 93


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_11
94 T. Chadjipadelis and S. Magopoulou

2021, where a general stabilization of the number of cases occurred, while the last
period refers to a significant increase in the number of cases from November 2021
to January 2022.
Overall, from March 2020 to January 2022, a total of 1,79 million cases of COVID-
19 were recorded in Greece (Figure 1) and a total of 22,635 deaths. Vaccination
coverage is as of January 2022 over 65% of country’s population, i.e., 7,241,468
fully vaccinated citizens.

Fig. 1 Record of cases of COVID-19 in Greece (March 2020-January 2022).

In this study, a combination of multivariate data analysis methods was employed


to analyze COVID-19-related data so as to assess the quality of decision-making
outputs during the crisis and improve evidence-based decision-making processes.
Section 2 presents the methodology and describes the data sources and the data
analysis workflow. Section 3 presents the study results and Section 4 discusses the
results and proposes methodological tools and presents the paper conclusions.

2 Methodology

2.1 Data

For the study purposes, data were obtained from the Oxford Covid-19 Government
Response Tracker (OxCGRT) and were combined with self-collected Covid-19 data
for Greece [3] daily updated in Greek. The Oxford Covid-19 Government Response
Tracker (OxCGRT) collects publicly available information reflecting government re-
sponse from 180 countries since 1 January 2020 [4]. The tracker is based on data for
23 indicators. In this study, two groups of indicators were considered: Containment &
Closure and Health Systems in the case of Greece. The first group of indicators refers
to “collective” level policies and measures, such as school closures and restriction in
COVID-19: Analysis of Government’s Preventing Measures and Health Data Records 95

mobility, while the second refers to “individual” level policies and measures, such as
testing and vaccination. Specifically, the collective level indicators refer to policies
taken by the governments’ and reflect on a collective level on the society: school
closing, workplace closing, cancelation of public events, restrictions on gathering,
closure of public transport, stay at home requirements, internal movement restric-
tions and international travel controls. The health system policies primarily touch
upon the individual level and specifically refer to: public information campaigns,
testing, contact tracing, healthcare facilities, vaccines’ investments, facial coverings,
vaccination and protection of the elderly people. All collective-level indicators (C1 to
C8) were summed to yield a total score (ranging from 0 to 16). Similarly, individual-
level indicators (H1 to H3 and H6 to H8) were summed to compute a total score
(ranging from to 12).
The self-collected data refer to positive cases, number of Covid-19-related deaths,
number of tests and total number of vaccinations administered. These data have been
recorded daily since March 2020 from public announcements by official and verified
sources. A total of 94 time points were considered in the present study, corresponding
to weekly data (Monday was used as a reference). Three quantitative indicators were
derived, a positivity index (#cases / #tests), a mortality index (#deaths / #cases) and a
testing index (#tests / #people). The number of vaccinations is not used in the present
study because the vaccination process began in January 2021 and the administration
of the booster dose began in September 2021. The final data set consisted of five
indicators: two ordinal total scores, and three quantitative indices.

2.2 Data Analysis

A four-step data analysis strategy was adopted. In the first step, the three quantitative
variables (positivity rate, mortality rate and testing rate) were transformed into
ordinal variables, via a method used in [7] (see Step 1) transformation of continuous
variables into ordinal categorical variables, with minimum information loss. Three
ordinal variables were derived. In the second step, the five ordinal variables (i.e., the
three recoded variables and the two ordinal total scores), were fuzzy-coded into three
categories each, using the barycentric coding scheme proposed in [7]. This scheme
has been recently evaluated in the context of hierarchical clustering in [7] and was
applied with the DIAS Excel add-in [6]. Barycentric coding allows us to convert an
m-point ordinal variable into an n-point fuzzy-coded variable [6, 7]. In other words,
the transformation of the three quantitative variables into ordinal variables resulted
in a generalized 0-1 matrix (fuzzy-coded matrix), where for each variable we obtain
the estimated probability for each category. A drawback of the proposed approach is
that the ordinal information in the 5 ordinal variables is lost.
The third step involved the application of Correspondence Analysis (CA) on
the fuzzy-coded table with the 94 weeks as rows and the fifteen fuzzy categories
as columns (see [1] for a similar approach). The number of significant axes was
determined based on percentage of inertia explained and the significant points on each
96 T. Chadjipadelis and S. Magopoulou

axis were determined based on the values of two statistics that accompany standard
CA output; quality of representation (COR) greater than 200 and contribution (CTR)
greater than 1000/(n + 1), where n is the total number of categories (i.e., 15 in our
case). In the final step, Hierarchical Cluster Analysis (HCA) using Benzecri’s chi-
squared distance and Ward’s linkage criterion [2, 8] was employed to cluster the
94 points (weeks) on the CA axes obtained from the previous step. The number of
clusters was determined upon the empirical criterion of the change in the ratio of
between-cluster inertia to total inertia, when moving from a partition with r clusters
to a partition with r – 1 cluster [8]. Lastly, we interpret the clusters after determining
the contribution of each indicator to each cluster. All analyses were conducted with
the M.A.D. [Méthodes de l’Analyse des Données] software [5].

3 Results

Correspondance Analysis resulted in four significant axes, which explain 74.91% of


the total inertia (Figure 2). For each axis, we describe the main contrast between
groups of categories based on their coordinates, COR and CTR values (Figure 3).
“Low and moderate mortality rates” and “high factor testing rates” define a pole on
the 1st axis, which is opposed to “average and high levels of “individual” measures”.
On the second axis, “low positivity rate” and “average levels of collective measures”
define a pole, while “average and high positivity rate” and “high levels of collective
measures” define the opposite pole. The third axis is characterized by “moderate
and high mortality rate”, “high levels of collective measures” and “average levels
of individual measures” that are opposed to “average levels of collective measures”.
On the fourth axis, “average levels of collective measures” are opposed to “average
testing rate” and “high levels of collective measures”.

Fig. 2 Explained inertia by axis.


COVID-19: Analysis of Government’s Preventing Measures and Health Data Records 97

Fig. 3 Category coordinates on the four CA axes (#G), quality of representation (COR) and
contribution (CTR). COR values greater than 200 and CTR values greater than 1000 / 16 = 62.5
are shown in yellow. Positive coordinates are shown in green and negative in pink.

Hierarchical Cluster Analysis on the factor scores resulted in seven clusters using
the empirical criterion for cluster determination (see Section 2.2). The corresponding
dendrogram is shown in Figure 4. The seven nodes in the figure that correspond to
the seven clusters are 182, 181, 175, 177, 171, 181, 133 and 179. Cluster content
reflects the different periods (phases) presented in the introductory section.

Fig. 4 Dendrogram of the HCA.


98 T. Chadjipadelis and S. Magopoulou

The first cluster (182) combines data points from March 2020, the onset of the
pandemic with data points from a period following the summer of 2020 (October
and November). This cluster is characterized by high positivity rate, low testing
rate, high levels of “collective” measures (containment & closure) and low levels of
“individual” measures (health system). The second cluster (181) contains data points
from April and May 2020 and is characterized by low positivity rate, average to high
mortality rate, low testing rate, high levels of “collective” measures (containment &
closure) and average levels of “individual” measures (health system). The third clus-
ter (175) combines summer months of 2020 and 2021. This cluster is characterized
by low positivity rate, low testing rate and average levels of “collective” measures
(containment & closure). The fourth cluster (177) marks the period of December
2020 and the period of spring of 2021, with average positivity rate and high levels of
“collective” measures (containment & closure). The fifth cluster (171) refers to the
period from December 2020 to February 2021, but also includes August 2021, with
high levels of “collective” measures (containment & closure). The sixth cluster (133)
refers to the period following the summer of 2021 (September and October 2021).
In this cluster, average positivity rates were observed but also strict containment and
closure measures.
Lastly, the seventh cluster (179) refers to November and December 2021, including
also January 2022, with high positivity and high testing rates, while high levels of
containment and closure and health system measures were observed. Figure 5 shows
the contributions of each indicator in each cluster.

Fig. 5 Cluster description (contribution values of the indicators in each cluster - node).

4 Discussion

Based on the study results, we can argue that, when it comes to measures and
real time data following a situation such as the pandemic, “the chicken and egg”
dilemma arises. The question is whether “collective” and “individual” measures
COVID-19: Analysis of Government’s Preventing Measures and Health Data Records 99

affect daily incidence data or the inverse (i.e., that the daily data lead to measures).
We conclude that in fact the two should be perceived as working in conjunction
and not independently from one another. The analysis showed that lower positivity
rate is accompanied by average levels of measures from the government at both
the “individual” and the “collective” level. Furthermore, higher positivity rate is
accompanied by higher levels of measures, as a response. With regard to mortality
rate, we observed that higher mortality invokes higher levels of “collective” measures
and average levels of “individual” measures, whereas average levels of “collective”
measures are associated with higher mortality rate.
It is therefore evident that when it comes to decision making in crisis situations, a
systematic collection, analysis and use of data is linked to more effective government
response overall. Therefore, evidence-based policy making should be linked to crisis
management. This paper presents a first attempt to capture an ongoing phenomenon
and therefore it is crucial that the collection and analysis of data will be complemented
until the end of the phenomenon.

References

1. Aşan, Z., Greenacre, M.: Biplots of fuzzy coded data. Fuzzy Sets and Systems, 183(1), 57–71
(2011)
2. Benzècri, J. P.: L’Analyse des Données. 2. L’Analyse des Correspondances. Dunod, Paris
(1973)
3. Chadjipadelis, T.: Facebook profile (2022).
https://www.facebook.com/theodore.chadjipadelis
4. Hale, T., Petherick, A., Phillips, T., Webster, S.: Variation in government responses to COVID-
19. Blavatnik School of Government Working Paper, 31, 2020-11 (2020)
5. Karapistolis, D.: Software Method of Data Analysis MAD. (2010)
http://www.pylimad.gr/
6. Markos, A., Moschidis, O., Chadjipadelis, T.: Hierarchical clustering of mixed-type data based
on barycentric coding (2022) https://arxiv.org/submit/4142768
7. Moschidis, O., Chadjipadelis, T.: A method for transforming ordinal variables. In: Palumbo,
F., Montanari, A., Vichi, M. (eds) Data Science, pp. 285-294. Studies in Classification, Data
Analysis, and Knowledge Organization. Springer, Cham. (2017)
https://doi.org/10.1007/978-3-319-55723-6\_22
8. Papadimitriou, G., Florou, G.: Contribution of the Euclidean and chi-square metrics to de-
termining the most ideal clustering in ascending hierarchy (in Greek). In Annals in Honor of
Professor I. Liakis, 546-581. University of Macedonia, Thessaloniki (1996)
100 T. Chadjipadelis and S. Magopoulou

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
pcTVI: Parallel MDP Solver Using a
Decomposition into Independent Chains

Jaël Champagne Gareau, Éric Beaudry, and Vladimir Makarenkov

Abstract Markov Decision Processes (MDPs) are useful to solve real-world proba-
bilistic planning problems. However, finding an optimal solution in an MDP can take
an unreasonable amount of time when the number of states in the MDP is large. In
this paper, we present a way to decompose an MDP into Strongly Connected Com-
ponents (SCCs) and to find dependency chains for these SCCs. We then propose a
variant of the Topological Value Iteration (TVI) algorithm, called parallel chained
TVI (pcTVI), which is able to solve independent chains of SCCs in parallel lever-
aging modern multicore computer architectures. The performance of our algorithm
was measured by comparing it to the baseline TVI algorithm on a new probabilistic
planning domain introduced in this study. Our pcTVI algorithm led to a speedup
factor of 20, compared to traditional TVI (on a computer having 32 cores).

Keywords: Markov decision process, automated planning, strongly connected com-


ponents, dependancy chains, parallel computing

1 Introduction

Automated planning is a branch of Artificial Intelligence (AI) aiming at finding


optimal plans to achieve goals. One example of problems studied in automated
planning is the electric vehicle path-planning problem [1]. Planning problems with
non-deterministic actions are known to be much harder to solve. Markov Decision

Jaël Champagne Gareau ( )


Université du Québec à Montréal, Canada, e-mail: [email protected]
Éric Beaudry
Université du Québec à Montréal, Canada, e-mail: [email protected]
Vladimir Makarenkov
Université du Québec à Montréal, Canada, e-mail: [email protected]

© The Author(s) 2023 101


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_12
102 J. Champagne Gareau et al.

Processes (MDPs) are generally used to solve such problems leading to probabilistic
models of applicable actions [2].
In probabilistic planning, a solution is generally a policy, i.e., a mapping specifying
which action should be executed in each observed state to achieve an objective.
Usually, dynamic programming algorithms such as Value Iteration (VI) are used to
find an optimal policy [3]. Since VI is time-expensive, many improvements have
been proposed to find an optimal policy faster, using for example the Topological
Value Iteration (TVI) algorithm [4]. However, very large domains often remain out
of reach. One unexplored way to reduce the computation time of TVI is by taking
advantage of the parallel architecture of modern computers and by decomposing an
MDP into independent parts which could be solved concurrently.
In this paper, we show that state-of-the-art MDP planners such as TVI can run
an order of magnitude faster when considering task-level parallelism of modern
computers. Our main contributions are as follows:
• An improved version of the TVI algorithm, parallel-chained TVI (pcTVI), which
decomposes MDPs into independent chains of strongly connected components
and solves them concurrently.
• A new parametric planning domain, chained-MDP, and an evaluation of pcTVI’s
performance on many instances of this domain compared to the VI, LRTDP [5]
and TVI algorithms.

2 Related Work

Many MDP solvers are based on the Value Iteration (VI) algorithm [3], or more
precisely on asynchronous variants of VI. In asynchronous VI, MDP states can be
backed up in any order and do not need to be considered the same number of times.
One way to take advantage of this is by assigning a priority to every state and by
considering them in priority order.
Several state-of-the-art MDP algorithms have been proposed to increase the speed
of computation. Many of them are able to focus on the most promising parts of MDP
through heuristic search algorithms such as LRTDP [5] or LAO* [6]. Some other
MDP algorithms use partitioning methods to decompose the state-space in smaller
parts. For example, the P3VI (Partitioned, Prioritized, Parallel Value Iteration) al-
gorithm partitions the state-space, uses a priority metric to order the partitions in an
approximate best solving order, and solves them in parallel [7]. The biggest disad-
vantage of P3VI is that the partitioning is done on a case-by-case basis depending on
the planning domain, i.e., P3VI does not include a general state-space decomposition
method. The inter-process communication between the solving threads also incurs
an overhead on the computation time. The more recent TVI (Topological Value Iter-
ation) algorithm [4] also decomposes the state-space, but does it by considering the
topological structure of the underlying graph of the MDP, making it more general
than P3VI. Unfortunately, to the best of our knowledge, no parallel version of TVI
has been proposed in the literature.
pcTVI: Parallel MDP Solver 103

3 Problem Definition

There exist different types of MDP, including Finite-Horizon MDP, Infinite-Horizon


MDP and Stochastic Shortest Path MDP (SSP-MDP) [2]. The first two of them can
be viewed as special cases of SSP-MDP [8]. In this work, we focus on SSP-MDPs,
which we describe formally in Definition 1 below.

Definition 1 A Stochastic Shortest Path MDP (SSP-MDP) is given by a tuple


(𝑆, 𝐴, 𝑇, 𝐶, 𝐺), where:
• 𝑆 is a finite set of states;
• 𝐴 is a finite set of actions;
• 𝑇 : 𝑆 × 𝐴 × 𝑆 → [0, 1] is a transition function, where 𝑇 (𝑠, 𝑎, 𝑠 0) is the probability
of reaching state 𝑠 0 when applying action 𝑎 while in state 𝑠;
• 𝐶 : 𝑆 × 𝐴 → R+ is a cost function, where 𝐶 (𝑠, 𝑎) gives the cost of applying the
action 𝑎 while in state 𝑠;
• 𝐺 ⊆ 𝑆 is the set of goal states (which can be assumed to be sink states).

We generally search for a policy 𝜋 : 𝑆 → 𝐴 that tells us which action should be


executed at each state, such that an execution following the actions given by 𝜋 until
a goal is reached has a minimal expected cost. This expected cost is given by a value
function 𝑉 𝜋 : 𝑆 → R. The Bellman Optimality Equations are a system of equations
satisfied by any optimal policy.

Definition 2 The Bellman Optimality Equations are the following:



 0, h

 if 𝑠 ∈ 𝐺,
𝑉 (𝑠) =
i
min 𝐶 (𝑠, 𝑎) +
Í
𝑇 (𝑠, 𝑎, 𝑠 0 )𝑉 (𝑠) , otherwise.

 𝑎∈𝐴
 𝑠0 ∈𝑆

The expression between square brackets is called the Q-value of a state-action pair:
Õ
𝑄(𝑠, 𝑎) = 𝐶 (𝑠, 𝑎) + 𝑇 (𝑠, 𝑎, 𝑠 0)𝑉 (𝑠).
𝑠0 ∈𝑆

When an optimal value function 𝑉 ★ has been computed, an optimal policy 𝜋★


can be found greedily:

𝜋★ (𝑠) = argmin𝑎 ∈ 𝐴𝑄★ (𝑠, 𝑎).

Most MDP solvers are based on dynamic programming algorithms like Value
Iteration (VI), which update iteratively an arbitrarily initialized value function until
convergence with a given precision 𝜖. In the worst case, VI needs to do |𝑆| sweeps of
the state space, where one sweep consists in updating the value estimate of every state
using the Bellman Optimality Equations. Hence, the number of state updates (called
a backup) is O (|𝑆| 2 ). When the MDP is acyclic, most of these backups are wasteful,
since the MDP can in this situation be solved using only |𝑆| backups (ordered in
reverse topological order), thus allowing one to find an optimal policy in O (|𝑆|) [8].
104 J. Champagne Gareau et al.

4 Parallel-chained TVI

In this section, we describe an improvement to the TVI algorithm, named pcTVI


(Parallel-Chained Topological Value Iteration), which is able to solve an MDP in
parallel (as P3VI). pcTVI uses the decomposition proposed by TVI, known to give
good performance on many planning domains. We start by summarizing how the
original TVI algorithm works.
First, TVI uses Kosaraju’s graph algorithm on a given MDP to find the strongly
connected components (SCCs) of its graphical structure (the graph corresponding
to its all-outcomes determinization).The SCCs are found by Kosaraju’s algorithm
in reverse topological order, which means that for every 𝑖 < 𝑗, there is no path
from a state in the 𝑖 th SCC to a state in the 𝑗 th SCC. This property ensures that
every SCC can be solved separately by VI sweeps if previous SCCs (according to
the reverse topological order) have already been solved. The second step of TVI is
thus to solve every SCC one by one in that order. Since TVI divides the MDP in
multiple subparts, it maximizes the usefulness of every state backup by ensuring
that only useful information (i.e., converged state values) is propagated through the
state-space.
Unfortunately, TVI can only solve one SCC at a time. Since modern computers
have many computing units (cores) which can work in parallel, we could theoretically
solve many SCCs in parallel to greatly reduce computation time. Instead of choosing
SCCs to solve in parallel arbitrarily or using a priority metric (as in P3VI), which
incur a computational overhead to propagate the values between the threads, we
want to consider their topological order (as in TVI) to minimize redundant or useless
computations. One way to share the work between the processes is to find independent
chains of SCCs which can be solved in parallel. The advantage of independent chains
is that no coordination and communication is needed between the SCCs, which both
removes some running-time overhead and simplifies the implementation.
The Parallel-Chained TVI algorithm we propose (Algorithm 1) works as follows.
First, we find the graph 𝐺 corresponding to the graphical structure of the MDP,
decompose it into SCCs, and find the reverse topological order of the SCCs (as in
TVI, but we use Tarjan’s algorithm instead of Kosaraju’s algorithm since it is about
twice as fast). We then build the condensation of the graph 𝐺, i.e., the graph 𝐺 𝑐
whose vertices are SCCs of 𝐺, where an edge is present between two vertices 𝑠𝑐𝑐 1
and 𝑠𝑐𝑐 2 if there exists an edge in 𝐺 between a state 𝑠1 ∈ 𝑠𝑐𝑐 1 and a state 𝑠2 ∈ 𝑠𝑐𝑐 2 .
We also store the reversed edges in 𝐺 𝑐 and a counter 𝑐 𝑠𝑐𝑐 on every vertex 𝑠𝑐𝑐 which
indicates how many incoming neighbors have not yet been computed. We use this
(usually small) graph 𝐺 𝑐 to detect which SCCs are ready to be considered (the SCCs
whose incoming neighbors have all been determined with precision 𝜖, i.e., the SCCs
whose associated counter 𝑐 𝑠𝑐𝑐 is 0). When a new SCC is ready, it is inserted into a
work queue from which the waiting threads acquire their next task.
pcTVI: Parallel MDP Solver 105

Algorithm 1 Parallel-Chained Topological Value Iteration


1: procedure pcTVI(𝑀 : MDP, 𝑡: Number of threads)
2: ⊲ Find the SCCs of 𝑀
3: 𝐺 ← Graph( 𝑀 ) ⊲ 𝐺 implicitly shares the same data structures as 𝑀
4: 𝑆𝐶𝐶𝑠 ← Tarjan(𝐺) ⊲ SCCs are found in reverse topological order
5:
6: ⊲ Build the graph of SCCs of 𝐺
7: 𝐺𝑐 ← GraphCondensation(𝐺, 𝑆𝐶𝐶𝑠)
8:
9: ⊲ Solve in parallel independent SCCs
10: 𝑃𝑜𝑜𝑙 ← CreateThreadPool(𝑡) ⊲ Create 𝑡 threads
11: 𝑉 ← NewValueFunction() ⊲ Arbitrarily initialized; Shared by all threads
12: 𝑄 ← CreateQueue() ⊲ Shared by all threads
13: Insert(𝑄, Head(𝑆𝐶𝐶𝑠)) ⊲ The goal SCC is inserted in the queue
14: while NotEmpty(𝑄) do ⊲ Only one thread runs this loop
15: 𝑠𝑐𝑐 ← ExtractNextItem(𝑄)
16: for all 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟 ∈ Neighbors(𝑠𝑐𝑐) do
17: Decrement NumIncomingNeighbors(𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟 )
18: if NumIncomingNeighbors(𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟 ) = 0 then
19: AssignTaskToAvailableThread( 𝑃𝑜𝑜𝑙, PartialVI( 𝑀 , 𝑉 , 𝑠𝑐𝑐))
20: Push(𝑄, 𝑠𝑐𝑐) ⊲ Neighbors of 𝑠𝑐𝑐 are ready to be considered next
21: end if
22: end for
23: end while
24:
25: ⊲ Compute and return an optimal policy using the computed value function
26: Π ← GreedyPolicy(𝑉 )
27: return Π
28: end procedure

5 Empirical Evaluation

In this section, we evaluate empirically the performance of pcTVI, comparing it to the


three following algorithms: (1) VI – the standard dynamic programming algorithm
(here we use its asynchronous round-robin variant), (2) LRTDP – a well-known
heuristic search algorithm, and (3) TVI – the Topological Value Iteration algorithm
described in Section 4. In the case of LRTDP, we carried out the admissible and
domain-independent ℎmin heuristic, first described in the original paper introducing
LRTDP [5]:

if 𝑠 ∈ 𝐺.

 0,


ℎmin (𝑠) =  0

 𝑎min
 𝐶 (𝑠, 𝑎) + 0 min 𝑉 (𝑠 ) , otherwise,
 ∈ 𝐴𝑠 𝑠 ∈𝑠𝑢𝑐𝑐𝑎 (𝑠)

where 𝐴𝑠 denotes the set of applicable actions in state 𝑠 and 𝑠𝑢𝑐𝑐 𝑎 (𝑠) is the set of
successors when applying action 𝑎 at state 𝑠. The four competing algorithms (VI,
TVI, LRTDP and pcTVI) were implemented in C++ by the authors of this paper and
compiled using the GNU g++ compiler (version 11.2). All tests were performed on a
106 J. Champagne Gareau et al.

computer equipped with four Intel Xeon E5-2620V4 processors (each of them having
8 cores at 2.1 GHz, for a total of 32 cores). For every test domain, we measured
the running time of the four compared algorithms carried out until convergence to
an 𝜖-optimal value function (we used 𝜖 = 10−6 ). Every domain was tested 15 times
with randomly generated MDP instances. To minimize random factors, we report
the median values obtained over these 15 MDP instances.
Since there is no standard MDP domain in the scientific literature suitable to
benchmark a parallel MDP solver, we propose a new general parametric MDP
domain that we use to evaluate the algorithms. This domain, which we call chained-
MDP, uses 5 parameters: (1) 𝑘, the number of independent chains {𝑐 1 , 𝑐 2 , . . . , 𝑐 𝑘 } in
the MDP; (2) 𝑛𝑠𝑐𝑐 , the number of SCCs {𝑠𝑐𝑐 𝑖,1 , 𝑠𝑐𝑐 𝑖,2 , . . . , 𝑠𝑐𝑐 𝑖,𝑛𝑠𝑐𝑐 } in every chain
𝑐 𝑖 ; (3) 𝑛𝑠 𝑝𝑠 , the number of states per SCC; (4) 𝑛 𝑎 the number of applicable actions
per state, and (5) 𝑛𝑒 the number of probabilistic effects per action. The possible
successors 𝑠𝑢𝑐𝑐(𝑠) of a state 𝑠 in 𝑠𝑐𝑐 𝑖, 𝑗 are states in 𝑠𝑐𝑐 𝑖, 𝑗 and either the states
in 𝑠𝑐𝑐 𝑖, 𝑗+1 if it exists, or the goal state otherwise. When generating the transition
function of a state-action pair (𝑠, 𝑎), we sampled 𝑛𝑒 states uniformly from 𝑠𝑢𝑐𝑐(𝑠)
with random probabilities. In each of our tests, we used 𝑛𝑠𝑐𝑐 = 2, 𝑛 𝑎 = 5 and 𝑛𝑒 = 5.
A representation of a Chained-MDP instance is shown in Figure 1.

c1
scc1,1 scc1,2 scc1,3 scc1,4

c2
S scc2,1 scc2,2 scc2,3 scc2,4 A
c3
scc3,1 scc3,2 scc3,3 scc3,4

Fig. 1 A chained-MDP instance where 𝑛𝑐 = 3 and 𝑛𝑠𝑐𝑐 = 4. Each ellipse represents a strongly
connected component.

Figure 2 presents the obtained results for the Chained-MDP domain when varying
the number of states and fixing the number of chains (32). We can observe that when
the number of states is small, pcTVI does not provide an important advantage over
the existing algorithms since the overhead of creating and managing the threads is
taking most of the possible gains. However, as the number of states increases, the gap
in the running time between pcTVI and the three other algorithms increases. This
indicates that pcTVI is particularly useful on very large MDPs, which are usually
needed when considering real-world domains.
Figure 3 presents the obtained results for the same Chained-MDP domain when
varying the number of chains and fixing the number of states (1M). When the
number of chains increases, the total number of SCCs implicitly increases (which
also implies the number of states per SCC decreases). This explains why each tested
pcTVI: Parallel MDP Solver 107

300
VI
Running time (s) 250 LRTDP
200 TVI
pcTVI
150
100
50
0
2 4 6 8 10 12 14 16 18 20
Number of states (x 100 000)
Fig. 2 Average running times (in s) for the Chained-MDP domain with varying number of states
and fixed number of chains (32).

300
VI
Running time (s)

250 LRTDP
200 TVI
pcTVI
150
100
50
0
4 8 12 16 20 24 28 32
Number of chains
Fig. 3 Average running times (in s) for the Chained-MDP domain with varying number of chains
and fixed number of states (1M).

algorithms becomes faster (TVI becomes faster by design, since it solves SCCs
one-by-one without doing useless state backups, and VI and LRTDP become faster
due to an increased locality of the considered states in memory, which improves
cache performance). The performance of pcTVI increases as the number of chains
increases (for the same reason as the others algorithms, but also due to increased
parallelization opportunities). We can also observe that for domains with 4 chains
only, pcTVI still clearly outperforms the other methods. This means that pcTVI does
not need a highly parallel server CPU and can be used on standard 4-core computer.
108 J. Champagne Gareau et al.

6 Conclusion

The main contributions of this paper are two-fold. First, we presented a new algo-
rithm, pcTVI, which is, to the best of our knowledge, the first MDP solver that takes
into account both the topological structure of the MDP (as in TVI) and the parallel
capacities of modern computers (as in P3VI). Second, we introduced a new para-
metric planning domain, Chained-MDP, which models any situation where different
strategies (corresponding to a chain) can reach a goal, but where, once committed
to a strategy, it is not possible to switch to a different one. This domain is ideal to
evaluate the parallel performance of an MDP solver. Our experiments indicate that
pcTVI outperforms the other competing methods (VI, LRTDP, and TVI) on every
tested instance of the Chained-MDP domain. Moreover, pcTVI is particularly effec-
tive when the considered MDP has many SCC chains (for increased parallelization
opportunities) of large size (for decreased overhead of assigning small tasks to the
threads). As future work, we plan to investigate ways of pruning provably subopti-
mal actions, which would allow more SCCs to be found. While this paper focuses
on the automated planning side of MDPs, the proposed optimization and parallel
computing approaches could also be applied when using MDPs with Reinforcement
Learning and other ML algorithms.

Acknowledgements This research has been supported by the Natural Sciences and Engineering
Research Council of Canada (NSERC) and the Fonds de Recherche du Québec — Nature et
Technologies (FRQNT).

References

1. Champagne Gareau, J., Beaudry E., Makarenkov, V.: A fast electric vehicle planner using
clustering. In: Stud. in Classif., Data Anal., and Knowl. Organ., 5, 17-25. Springer (2021)
2. Mausam, Kolobov, A.: Planning with Markov Decision Processes: An AI Perspective. Morgan
& Claypool (2012)
3. Bellman, R.: Dynamic Programming. Prentice Hall (1957)
4. Dai, P., Mausam, Weld, D. S., Goldsmith, J.: Topological value iteration algorithms. J. Artif.
Intell. Res., 42, 181-209 (2011)
5. Bonet, B., Geffner, H.: Labeled RTDP: Improving the convergence of real-time dynamic
programming. In: Proc. of ICAPS, pp. 12-21 (2013)
6. Hansen, E., Zilberstein, S.: LAO*: A heuristic search algorithm that finds solutions with loops.
Artif. Intell., 129(1-2), 35-62 (2001)
7. Wingate, D., Seppi, K.: P3VI: A partitioned, prioritized, parallel value iterator. In: Proc. of
the Int. Conf. on Mach. Learn. (ICML), 863-870 (2004)
8. Bertsekas, D.: Dynamic Programming and Optimal Control, vol. 2. Athena scientific Belmont,
MA (2001)
pcTVI: Parallel MDP Solver 109

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Three-way Spectral Clustering

Cinzia Di Nuzzo and Salvatore Ingrassia

Abstract In this paper, we present a spectral clustering approach for clustering


three-way data. Three-way data concern data characterized by three modes: 𝑛 units,
𝑝 variables, and 𝑡 different occasions. In other words, three-way data contain a 𝑡 × 𝑝
observed matrix for each statistical observation. The units generated by simultaneous
observation of variables in different contexts are usually structured as three-way data,
so each unit is basically represented as a matrix. In order to cluster the 𝑛 units in 𝐾
groups, the spectral clustering application to three-way data can be a powerful tool
for unsupervised classification. Here, one example on real three-way data have been
presented showing that spectral clustering method is a competitive method to cluster
this type of data.

Keywords: spectral clustering, kernel function, three-way data

1 Introduction

Spectral clustering methods are based on the graph theory, where the units are
represented by the vertices of an undirected graph and the edges are weighted by
the pairwise similarities coming from a suitable kernel function, so the clustering
problem is reformulated as a graph partition problem, see e.g. [16, 6]. The spectral
clustering algorithm is a very powerful method for finding non-convex clusters of
data, moreover, it is a handy approach for handling high-dimensional data since it
works on a transformation of the raw data having a smaller dimension than the space
of the original data.

Cinzia Di Nuzzo ( )
Department of Statistics, University of Roma La Sapienza, Piazzale Aldo Moro, 5, 00185 Roma,
Italy, e-mail: [email protected]
Salvatore Ingrassia
Department of Economics and Business, University of Catania, Piazza Università, 2, 95131 Catania,
Italy, e-mail: [email protected]

© The Author(s) 2023 111


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_13
112 C. Di Nuzzo and S. Ingrassia

Three-way data derives from the observation of various attributes measured on a


set of units in different situations; some examples are longitudinal data on multiple
response variables and multivariate spatial data. Three-way data can also derive
from temporal measurements of a feature vector, thus having the dataset composed
of three modes: 𝑛 units (matrices), 𝑝 variables (columns), and 𝑡 times (rows). Clus-
tering of three-way data has attracted a growing interest in literature, see e.g. [14],
[1]; model-based clustering of three-way data has been introduced by [15] in the
framework of matrix-variate normal mixtures; recent papers include [9] handle on
parsimonious models for modeling matrix data; [11] introduce two matrix-variate
distributions, both the elliptical heavy-tailed generalization of the matrix-variate
normal distribution; [12] deal with three-way data clustering using matrix-variate
cluster-weighted models (MV-CWM); and, [13] consider an application to educa-
tional data via mixtures of parsimonious matrix-normal distribution.
In this paper, we present a spectral clustering approach for clustering three-way
data and a suitable kernel function between matrices is introduced. As a matter of
fact, the data matrices represent the vertices of the graph, consequently, the edges
must be weighted by a single value.
The rest of the paper is organized as follows: in Section 2 the spectral clustering
method is summarized; in Section 3 a method to select the parameters in the spectral
clustering algorithm is described; in Section 4 the three-way spectral clustering with
a new kernel function are introduced; in Section 5 an application based on real
three-way data is presented. Finally, in Section 5 we provide concluding remarks.

2 Spectral Clustering

Spectral clustering algorithm for two-way data has been described in [8, 16, 6]. Here,
we summarize the main step of this algorithm.
Let 𝑉 = {𝒙 1 , 𝒙 2 , . . . , 𝒙 𝑛 } be a set of points in X ⊆ R 𝑝 . In order to group the data
𝑉 in 𝐾 cluster, the first step concerns the definition of a symmetric and continuous
function 𝜅 : X × X → [0, ∞) called the kernel function. Afterwards, a similarity
matrix 𝑊 = (𝑤𝑖 𝑗 ) can be assigned by setting 𝑤𝑖 𝑗 = 𝜅(𝒙 𝑖 , 𝒙 𝑗 ) ≥ 0, for 𝒙 𝑖 , 𝒙 𝑗 ∈ X.
and finally the normalized graph Laplacian matrix 𝐿 sym ∈ R𝑛×𝑛 is introduced

𝐿 sym = 𝐼 − 𝐷 −1/2𝑊 𝐷 −1/2 , (1)

where 𝐷 = diag(𝑑1 , 𝑑2 , . . . , 𝑑 𝑛 ) is the degree matrix and 𝑑𝑖 is the degree of the


Í
vertex 𝒙 𝑖 defined as 𝑑𝑖 = 𝑗≠𝑖 𝑤𝑖 𝑗 and 𝐼 denotes the 𝑛 × 𝑛 identity matrix. The
Laplacian matrix 𝐿 sym is positive semi-definite with 𝑛 non-negative eigenvalues. For
a fixed 𝐾  𝑛, let {𝜸 1 , . . . , 𝜸 𝐾 } be the eigenvectors corresponding to the smallest 𝐾
eigenvalues of 𝐿 sym . Then, the normalized Laplacian embedding in the 𝐾 principal
subspace is defined as the map Φ𝚪 : {𝒙 1 , . . . , 𝒙 𝑛 } → R𝐾 given by

Φ𝚪 (𝒙 𝑖 ) = (𝛾1𝑖 , . . . , 𝛾𝐾 𝑖 ), 𝑖 = 1, . . . , 𝑛,
Three-way Spectral Clustering 113

where 𝛾1𝑖 , . . . , 𝛾𝐾 𝑖 are the 𝑖-th components of 𝜸 1 , . . . , 𝜸 𝐾 , respectively. In other


words, the function Φ𝚪 (·) maps the data from the input space X to a feature space
defined by the 𝐾 principal subspace of 𝐿 sym . Afterwards, let 𝒀 = ( 𝒚 10 , . . . , 𝒚 𝑛0 ) be the
𝑛 × 𝐾 matrix given by the embedded data in the feature space, where 𝒚 𝑖 = Φ𝚪 (𝒙 𝑖 ) for
𝑖 = 1, . . . , 𝑛. Finally, the embedded data 𝒀 are clustered according to some clustering
procedure; usually, the 𝑘-means algorithm is taken into account in literature. How-
ever, to this end Gaussian mixtures have been proposed because they yield elliptical
cluster shapes, i.e. more flexible cluster shapes with respect to the 𝑘-means, see [2].
Finally, we point out that the performances of other mixture models based on non-
Gaussian component densities have been analyzed, but Gaussian mixture models
can be considered as a good trade-off between model simplicity and effectiveness,
see [3] for details.

3 A Graphical Approach for Parameter Selection

According to spectral clustering algorithm introduced in Section 2, the spectral


approach requires to set: i) the number of clusters 𝐾, ii) the kernel function 𝜅 (with
the corresponding parameter). In order to select these quantities, in the following we
summarize the method proposed in [4].
To begin with, we point out that the choice of the kernel function affects the entire
data structure in the graph, and consequently, the structure of the Laplacian matrix
and its eigenvectors. An optimal kernel function should lead to a similarity matrix
𝑊 having (as much as possible) diagonal blocks: in this case, we get well-separated
groups and we are also able to understand the number of groups in that data set
by counting the number of blocks. For the sake of simplicity, we consider here the
self-tuning kernel introduced by [17]
!
k𝒙 𝑖 − 𝒙 𝑗 k 2
𝜅(𝒙 𝑖 , 𝒙 𝑗 ) = exp − (2)
𝜖𝑖 𝜖 𝑗
with 𝜖𝑖 = k𝒙 𝑖 − 𝒙 ℎ k, where 𝒙 ℎ is the ℎ-th neighbor of point 𝒙 𝑖 (similarly for 𝜖 𝑗 ).
This function allow to get a similarity matrix that does not depend on any parameter
so that the algorithm of spectral clustering will be based on the pairwise proximity
between units. On the contrary, we need to select the ℎ-th neighbor of the unit in (2).
The main novelty of the joint-graphical approach concerns the analysis of some
graphic features of the Laplacian matrix including the shape of the embedded space.
Indeed, the embedded data provide useful information for the clustering, in particular
the main results in [10] and [5] allow to deduce that if the embedded data assume a
cones structure, then the number of clusters is equal to the number of the cones/spikes
in the feature space; furthermore, a clearer clustering structure emerges when the
spikes are narrower and well separated.
The idea behind the graphical approach is to select the number 𝐾 of groups and the
parameter ℎ in the kernel function from a joint analysis of three main characteristics:
the plot of the Laplacian matrix; the maxima values of the eigengaps between two
114 C. Di Nuzzo and S. Ingrassia

consecutive eigenvalues; the scatter plot of the mapped data in the feature space and
in particular the number of spikes counted in the embedded data space.
We remark that we cannot analyze all possible values of ℎ ∈ {1, 2, . . . , 𝑛 − 1} and
hence we choose a suitable subset H ⊂ {1, 2, . . . , 𝑛 − 1}, in particular we choose
H = {1%, 2%, 5%, 10%, 15%, 20%} × 𝑛 ⊂ {1, 2, . . . , 𝑛 − 1}, and select ℎ ∈ H , see
the following procedure for details.

Parameter selection (𝐾 and ℎ)


Input: data set 𝑉, kernel function 𝜅, H .
1. For each ℎ in H , compute the matrix 𝑀𝑠 and analyze the block structure in the
greyscale plot of 𝑀𝑠 .
2. For each ℎ in H , plot the embedded data in the feature space and analyze the
shape of the cone structure.
3. If the number of blocks in Step 1 is equal to the number of spikes in Step 2, then
set 𝐾 equal to the number of blocks. Go to Step 5.
4. Otherwise, analyze the eigengap plot.
a. If this plot shows a unique maximum eigengap for each ℎ ∈ H , then set 𝐾
according to this maximum. Go to Step 5.
b. If this plot shows multiple maxima for different ℎ ∈ H , select the number
of clusters 𝐾 not to be smaller than the number of tight spikes in the
corresponding plot of the embedded data.
5. Select ℎ ∈ H such that the clearest orthogonal data structure emerges from the
plot of the embedded data.
6. Stop.
Output: 𝐾, ℎ.

4 Three-way Spectral Clustering

In this section, we propose a spectral approach for clustering three-way data. Three-
way data consists of a data set referring to the same sets of units and variables,
observed in different situations, i.e., a set of multivariate matrices, that can be
organized in three modes: 𝑛 units, 𝑝 variables, and 𝑡 situations. Therefore, given
𝑛 matrices that represent the vertices of the graph, each matrix is composed by 𝑝
columns that represent our variables and 𝑡 rows that represent the time or another
feature. So we have a tensor of dimension 𝑛 ×𝑡 × 𝑝, thus the dataset is a tensor { 𝑿}𝑖𝑠𝑘
for 𝑖 = 1, . . . , 𝑛, 𝑠 = 1, . . . , 𝑡, 𝑘 = 1, . . . , 𝑝.
We define a distance function 𝛿 𝑀 between two matrices 𝐴, 𝐵 ∈ R 𝑝×𝑡 such that
𝛿 𝑀 : 𝑅 𝑡× 𝑝 × 𝑅 𝑡× 𝑝 → [0, +∞) is defined as
Three-way Spectral Clustering 115
v
u
t 𝑝
𝑡 Õ
Õ
𝛿 𝑀 ( 𝐴, 𝐵) := k 𝐴 − 𝐵k𝐹 = |𝑎 𝑠𝑘 − 𝑏 𝑠𝑘 | 2 (3)
𝑠=1 𝑘=1

where k · k𝐹 is Frobenius norm1. Thus the distance between two units in the matrix
data 𝑿 is equal to
v
u
t 𝑡 𝑝
ÕÕ
𝛿 𝑀 (𝑋𝑖1 𝑠𝑘 , 𝑋𝑖2 𝑠𝑘 ) = |𝑋𝑖1 𝑠𝑘 − 𝑋𝑖2 𝑠𝑘 | 2 , for 𝑖 1 , 𝑖2 = 1, . . . , 𝑛. (4)
𝑠=1 𝑘=1

For simplicity, in the following, we denote 𝛿 𝑀 (𝑋𝑖1 𝑠𝑘 , 𝑋𝑖2 𝑠𝑘 ) by 𝛿 𝑀 (𝑖1 , 𝑖2 ). Moreover,


we define the three-way self-tuning kernel function as
 
𝛿 𝑀 (𝑖 1 , 𝑖2 )
𝜅 𝑆 : 𝑿 × 𝑿 → [0, +∞), 𝜅 𝑆 (𝑖 1 , 𝑖2 ) = exp − (5)
𝜖𝑖1 𝜖𝑖2

where 𝜖𝑖1 and 𝜖𝑖2 need to be selected like in the kernel defined in (2).
Afterwards, we compute the similarity matrix 𝑊 given by 𝑤𝑖1 𝑖2 = 𝜅(𝑖 1 , 𝑖2 ), so that
we can apply the spectral clustering algorithm.
Finally, we point out that, differently from approaches based on mixtures of
matrix-variate data, the number of variables of the data set is not a critical issue
because the spectral clustering algorithm is based on distance measures.

5 A Real Data Application

We apply the three-way spectral clustering to the analysis of the Insurance data set,
available in the splm R package. This dataset was initially introduced by [7] and
has recently been analyzed by [12]. The goal is to study the consumption of non-life
insurance during the years 1998-2002 in the 103 Italian provinces, so 𝑡 = 5 and
𝑛 = 103. As regards the number of variables, we consider all the variables contained
in the data set, so 𝑝 = 11. Thus, we have 103 matrices of dimensions 5 × 11.
The 103 Italian provinces are divided into north-west (24 provinces), north-
east (22 provinces), center (21 provinces), south (23 provinces), and islands (13
provinces).
As regard the choice of 𝐾 and ℎ, we consider the graphical approach introduced
in Section 3. In Figure 1 the geometric features of spectral clustering are plotted
as ℎ varies. From the number of blocks of the Laplacian matrix (Figure 1-𝑎)), the
first maximum eigengap (Figure 1-𝑏)) and the number of spikes in the feature space
(Figure 1-𝑐)), we deduce that the number of clusters is 𝐾 = 2. For the selection of

1 In general, given a matrix 𝐴 ∈ R𝑛×𝑚 , with 𝐴 = (𝑎𝑖 𝑗 ) for 𝑖 = 1, . . . , 𝑛 and 𝑗 = 1, . . . , 𝑚. The


Frobenius norm is defined by v
u
tÕ 𝑚 Õ𝑛
k 𝐴k𝐹 := |𝑎𝑖 𝑗 | 2 .
𝑗=1 𝑖=1
116 C. Di Nuzzo and S. Ingrassia

−0.08
0.005 0.010 0.015
0.8

eigengap
1% 0.6

−0.10
ℎ=1 0.4

−0.12
0.2

0 1 2 3 4 5 6 7 8 −0.10 −0.05 0.00 0.05 0.10 0.15

Index

−0.06
1

0.8

0.025
eigengap

−0.09
2% 0.6

0.015
ℎ=2 0.4

−0.12
0.005
0.2

0 1 2 3 4 5 6 7 8 −0.15 −0.10 −0.05 0.00 0.05 0.10

Index

−0.05
1
0.07

0.8
eigengap

−0.08
5% 0.6
0.05

ℎ=5 0.4

−0.11
0.2
0.03

0 1 2 3 4 5 6 7 8 −0.15 −0.10 −0.05 0.00 0.05 0.10

Index

−0.05
0.12

0.8
eigengap

10% 0.6 −0.08


0.08

ℎ = 10 0.4
−0.11
0.04

0.2

0 1 2 3 4 5 6 7 8 −0.10 −0.05 0.00 0.05 0.10 0.15

Index

1
−0.05

0.8
0.15
eigengap

15% 0.6
−0.08
0.10

ℎ = 15 0.4
0.05

−0.11

0.2

0 1 2 3 4 5 6 7 8 −0.15 −0.10 −0.05 0.00 0.05 0.10

Index

1
0.25

−0.05

0.8
eigengap

20%
0.15

0.6
−0.08

ℎ = 21 0.4
0.05

−0.11

0.2

0 1 2 3 4 5 6 7 8 −0.15 −0.10 −0.05 0.00 0.05 0.10

Index

𝑎) 𝑏) 𝑐)

Fig. 1 Insurance data. Spectral clustering features: 𝑎) plot of Laplacian matrix in greyscale; 𝑏)
plot of the first eight eigengap values; 𝑐) scatterplot of the embedded data along with directions
(𝜸 1 , 𝜸 2 ).
Three-way Spectral Clustering 117

Table 1 Insurance data. Table of spectral clustering result.


NORTHWEST (24 provinces)
Cluster 1 NORTH EAST (22 provinces)
CENTRE (15 provinces)
CENTRE (6 provinces)
Cluster 2 SOUTH (23 provinces)
ISLANDS (13 provinces)

ℎ we choose indifferently ℎ = 15 and ℎ = 21 because in these cases the maximum


eigengap highlights the maximum values corresponding to 𝐾 = 2. In Table 1 the
clustering results are presented. This table shows that only 6 center provinces are
classified together with the southern provinces. But to be sure that these provinces
are neighboring the south provinces, let us analyze spectral clustering results on the
map of Italy. Figure 2-𝑎) illustrates the partition deriving from spectral clustering
in the political map of Italy, where Italian regions are described by the yellow lines,
while the provinces are by the black lines. The result shows a clear separation
between center-north Italy and south-insular Italy, in fact, the center-north has a
level of insurance penetration close to the European averages, while the South is
less developed economically. However, the Massa-Carrara province should belong
to the centre-north group. Moreover, we remark that the Rome province, being the
capital of Italy, has one socio-economic development comparable to that of north
Italy justifying belonging to the centre-north group.
Furthermore, in Figure 2-𝑏) we also represented the partition produced by MN-
CWM proposed in [12], we note that the two clustering results are very similar to
each other and differ only for one province of central Italy (precisely for the province
of Terni). It should also be emphasized that the dataset analyzed by [12] is different
from the one analyzed here, since, to avoid excessive parameterization of the models,
the authors select only 𝑝 = 5 variables in the data set.

𝑎) 𝑏)

Fig. 2 Insurance data. 𝑎) Three-way spectral clustering; 𝑏) Method proposed by [12].


118 C. Di Nuzzo and S. Ingrassia

6 Conclusion

In this paper, a spectral approach to cluster three-way data has been proposed. So
the data are organized in a tensor and the vertices in the graph are represented by
the matrices of dimension 𝑡 × 𝑝. In order to weigh the matrices in the graph, a
kernel function based on the Frobenius norm between the matrix difference has been
introduced. The performance of the spectral clustering algorithm has been shown in
one real three-way data set. Our method is competitive with respect to other clustering
methods proposed in the literature to perform matrix-data clustering. Finally, in order
to provide suggestions for future research, other kernel functions can be introduced
considering different distances with respect to the Frobenius norm.

Acknowledgements This work was supported by the University of Catania grant PIACERI/CRASI
(2020).

References
1. Bocci, L., Vicari, D.: ROOTCLUS: Searching for "ROOT CLUSters" in Three-Way Proximity
Data. Psychometrika. 84, 941–985 (2019)
2. Di Nuzzo, C., Ingrassia, S.: A mixture model approach to spectral clustering and application
to textual data. Stat. Meth. Appl. Forthcoming (2022)
3. Di Nuzzo, C.: Model selection and mixture approaches in the spectral clustering algorithm.
Ph.D. thesis, Economics, Management and Statistics, University of Messina (2021)
4. Di Nuzzo, C., Ingrassia, S.: A joint graphical approach for model selection in the spectral
clustering algorithm. Tech. Rep. (2022)
5. Garcia Trillos, N., Hoffman, F., Hosseini, B.: Geometric structure of graph Laplacian embed-
dings. arXiv preprint arXiv:1901.10651. (2019)
6. Meila, M.: Spectral clustering. In Hennig, C., Meila, M., Murtagh, F., Rocci, R. (eds.).
Handbook of Cluster Analysis. Chapman and Hall/CRC (2015)
7. Millo, G., Carmeci, G.: Non-life insurance consumption in Italy: A sub-regional panel data
analysis. J. Geogr. Syst. 12, 1–26 (2011)
8. Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: Analysis and an algorithm. Adv. Neural
Inf. Process. Syst. 14 (2002)
9. Sarkar, S., Zhu, X., Melnykov, V., Ingrassia, S.: On parsimonious models for modeling matrix
data. Comput. Stat. Data Anal. 142, 106822 (2020)
10. Schiebinger, G., Wainwright, M. J., Yu, B.: The geometry of kernelized spectral clustering.
Ann. Stat. 43(2), 819–846 (2015a)
11. Tomarchio, S. D., Punzo, A., Bagnato, L.: Two new matrix-variate distributions with applica-
tion in model-based clustering. Comput. Stat. Data Anal. 152, 107050 (2020)
12. Tomarchio, S. D., McNicholas, P., Punzo, A.: Matrix normal cluster-weighted models. J.
Classif. 38, 556-575 (2021)
13. Tomarchio, S. D., Ingrassia, S., Melnykov, V.: Modeling students’ career indicators via mix-
tures of parsimonious matrix-normal distributions. Aust. New Zeal. J. Stat. Forthcoming
(2022)
14. Vichi, M., Rocci, R., Kiers, H. A. L.: Simultaneous component and clustering models for
three-way data: Within and between approaches. J. Classif. 24, 71–98 (2007)
15. Viroli, C.: Finite mixtures of matrix normal distributions for classifying three-way data. Stat.
Comput. 21, 511–522 (2011)
16. von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007)
17. Zelnik-Manor, L., Perona, P.: Self-tuning spectral clustering. Adv. Neural Inf. Process. Syst.
17 (2004)
Three-way Spectral Clustering 119

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Improving Classification of Documents by
Semi-supervised Clustering in a Semantic Space

Jasminka Dobša and Henk A. L. Kiers

Abstract In the paper we propose a method for representation of documents in a


semantic lower-dimensional space based on the modified Reduced 𝑘-means method
which penalizes clusterings that are distant from classification of training documents
given by experts. Reduced 𝑘-means (RKM) enables simultaneously clustering of
documents and extraction of factors. By projection of documents represented in the
vector space model on extracted factors, documents are clustered in the semantic
space in a semi-supervised way (using penalization) because clustering is guided by
classification given by experts, which enables improvement of classification perfor-
mance of test documents.
Classification performance is tested for classification by logistic regression and sup-
port vector machines (SVMs) for classes of Reuters-21578 data set. It is shown that
representation of documents by the RKM method with penalization improves the
average precision of classification by SVMs for the 25 largest classes of Reuters
collection for about 5,5% with the same level of average recall in comparison to
the basic representation in the vector space model. In the case of classification by
logistic regression, representation by the RKM with penalization improves average
recall for about 1% in comparison to the basic representation.

Keywords: classification of textual documents, LSA, reduced 𝑘-means

Jasminka Dobša ( )
Faculty of Organization and Informatics, University of Zagreb, Pavlinska 2, 40000 Varaždin,
Croatia, e-mail: [email protected]
Henk A. L. Kiers
Department of Psychology, University of Groningan, Grote Kruisstraat 2/1, 9712 TS Groningen,
The Netherlands, e-mail: [email protected]

© The Author(s) 2023 121


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_14
122 J. Dobša and H. A. L. Kiers

1 Introduction

There are two main families of methods that deal with representation of documents
and words that index them: global matrix factorization methods such as Latent Se-
mantic Analysis (LSA) [2] and local context window methods such as the continuous
bag of words (CBOW) model and the continuous skip-gram model [8]. The latter
use neural networks for learning of representations of words and are intensively
explored lately in the scientific community since the development of fast processors
has enabled processing of huge amounts of data which resulted in improvements in
performance of wide spectra of text mining and natural language tasks. However,
representation of words solely by context window methods has a drawback due to
the neglect of information about global corpus statistics [9].
In this paper we propose a method for representation of documents by application
of a penalized version of the RKM method [4] on a term-document matrix. The
corpus of textual documents is represented by a sparse term-document matrix in
which entry (i, j) is equal to the weight of the i-th index term for the j-th document.
Weights of terms are given by the TfIdf weighting which utilizes local information
about the frequency of the i-th term in the j-th document and global information about
usage of the i-th term in the entire collection. A benchmark method that utilizes global
matrix factorization on term-document matrices is LSA [2] which uses truncated
singular value decomposition (SVD) for representation of terms and documents in
lower-dimensional semantic space. SVD does not capture the clustering structure of
data which motivates application of the RKM.
The rest of the paper is organized as follows: the second section describes related
work on representation of documents and words and methods of dimensionality
reduction related to RKM. The third section describes the modified RKM method
with penalization, while the fourth section describes an experiment on Reuters-21578
data set. In the last section conclusions and directions for further work are given.

2 Related Work

2.1 Representation by Matrix Factorization Methods

A benchmark method among methods that utilize matrix factorization for repre-
sentation of textual documents is the method of LSA introduced in 1994 [2]. By
LSA a sparse term-document matrix is transformed via SVD into a dense matrix
of the same term-document type with representations of words (index terms) and
documents in a lower-dimensional space. The idea is to map similar documents, or
those that describe the same topics, closer to each other regardless of the terms that
are used in them. A very efficient application of LSA is in cross-lingual information
retrieval where relevant documents for a query in one language are retrieved from a
set of documents in another language [7]. According to our knowledge application
Improving Classification of Documents by Semi-supervised Clustering 123

of methods that simultaneously cluster objects and extract factors in the field of text
mining is very limited. In [6] a method is proposed for cross-lingual information
retrieval based on the RKM method.

2.2 Neural Network Word Embeddings

Another approach is to learn representations of words, or so called embeddings, by


using local context windows. In 2003 Bengio and coauthors [1] proposed a neural
probabilistic language model that uses simple neural network architecture to learn
distributed representations for each word as well as probability functions for word
sequences, expressed in terms of these representations. Mikolov and coautors [8]
proposed in 2013 two models based on single-layer neural network architectures:
the skip gram-model that predicts context words given the current word and the
continuous bag of words model which predicts current words based on the context.
In 2014 the GloVe model [9] was proposed, based on the critique that neural network
models suffer from the disadvantage that they do not utilize co-occurrence statistics
of the entire corpus, but scan only context windows of words ignoring vast amounts
of repetition in the data. That model exploits the advantages of global matrix factor-
ization methods by utilization of term-term co-occurrence matrices and local context
window methods.
Word embedding can be classified as static such as word2vec [8] and GloVe
[9], and contextual, such as ELMo [10] and BERT [5]. Contextual representation
is introduced in [10] in order to model characteristics of word use (syntax and
semantics) on one side and variation in word representation due to the context in
which words are appearing.

2.3 Methods for Simultaneous Clustering and Factor Extraction

A standard procedure for clustering of objects in a lower-dimensional space is tandem


analysis which includes projection of data by principal components and clustering
of data in a lower-dimensional space. Such an approach was criticized in [3] and
[4] since principal components may extract dimensions which do not necessarily
significantly contribute to the identification of a clustering structure in the data.
As a response, De Soete and Carroll proposed the method of RKM [4] which
simultaneously clusters data and extracts the factors of variables by reconstructing
the original data with only centroids of clusters in a lower-dimensional space. The
algorithm of Factorial 𝑘-means (FKM) proposed by Vichi and Kiers [13] has the
same aim of simultaneous reduction of objects and variables and it reconstructs the
data in a lower-dimensional space by its centroids in the same space. The application
of the latter method is limited in text mining since the method is limited to cases in
which the number of variables is less than the number of cases. In [11] the RKM
124 J. Dobša and H. A. L. Kiers

and FKM methods are compared using simulations and theoretically in order to
identify cases for their application. Timmerman and associates also propose method
of Subspace 𝑘-means [12] which gives an insight into cluster characteristics in terms
of relative positions of clusters given by centroids and the shape of the clusters given
by within cluster residuals.

3 Reduced 𝒌-Means with Penalization

Let X be 𝑚 × 𝑛 term-document matrix. We use the following notation:


• A is an 𝑚 × 𝑘 columnwise orthonormal matrix of extracted factors;
• M is an 𝑛 × 𝑐 membership matrix, where c is a predefined number of clusters;
𝑚 𝑖𝑐 = 1 if object (document) i belongs to cluster c and 0 otherwise;
• Y is a 𝑐 × 𝑘 matrix which gives centroids of clusters in the lower-dimensional
space.
By definition, we suppose that every document in the collection belongs to exactly
one cluster. The RKM method minimizes the loss function

F(M,A) = kX − AY𝑇 M𝑇 k 2 (1)

in the least squares sense. The dimension of the lower-dimensional space must be
less or equal to the number of clusters. Modified RKM with penalization minimizes
the loss function

F(M,A) = kX − AY𝑇 M𝑇 k 2 + 𝜆kM − Gk 2 (2)


where G is 𝑛 × 𝑐 membership matrix based on expert judgements. If c is number of
classes then 𝑔𝑖𝑐 = 0 if object (document) i belongs to class c, and 0 otherwise. By the
second summand in the loss function we penalize clusterings that are distant from
the classes by expert judgements using parameter 𝜆 that regularizes the importance
of that penalization. We use the alternating least squares (ALS) algorithm analogous
to the one in [4] which alternates between corrections of the loading matrix A in
one step and of the membership matrix M in another. As each of the steps in the
ALS algorithm improves the loss function, the algorithm converges to at least a local
minimum. By starting the procedure from a large number of random initial estimates
and choosing the best solution, the chances of obtaining the global minimum are
increased.
Improving Classification of Documents by Semi-supervised Clustering 125

4 Experiment

4.1 Design of Experiment

Experiments are conducted for classification on the Reuters-21578 data set, specifi-
cally using the ModApte Split which assigns Reuters reports from April 7, 1987 and
before to the training set, and after, until end of 1987, to the test set. It consists of
9603 training and 3299 test documents. The collection has 90 classes which contain
at least one training and test document. Documents are represented by a bag of words
representation. A list of index terms is formed based on terms that appear in at least
four documents of the collection, which resulted in a list of 9867 index terms.
Classification is conducted by logistic regression (LR) and SVM algorithm. The
basic model is the bag of words representation (full representation), while repre-
sentations in the lower-dimensional space are obtained by SVD (Latent Sematic
Analysis), RKM and RKM with penalization (𝜆 = 0.1, 0.2, 0.4, 0.6). For RKM and
RKM with penalization representations are obtained by applying matrix factoriza-
tion on the term-document matrix of the training documents, and by projection of
test documents on factors given by matrix A in the factorization. RKM is computed
for 90 clusters (which corresponds to the number of classes in the collection) using
as dimension of the lower-dimensional space 𝑘 = 85, and truncated SVD is com-
puted for 𝑘 = 85 as well. The RKM and RKM with penalization algorithms are run
10 times (with different starting estimates), and the representation and factorization
with the minimal loss function is chosen. The optimal cost parameter for LR and
SVM is chosen by grid search technique from the set of values 0.1, 0.5, 1, 10, 100
and 1000. For the classification methods, the LiblineaR library in R is used, while
RKM and RKM with penalization algorithm are implemented in Matlab.

4.2 Results

Results are given in terms of precision, recall, and 𝐹1 measure of the classification.
Recall is proportion of correctly classified samples among all positive samples (i.e.,
samples actually belonging to the class, according to the expert), while precision is
proportion of correctly classified samples among all samples classified as positive
by the model. In the Figures 1 and 2, are shown results of average 𝐹1 measures of
classification for 5 classes sorted in descending order by their size, i.e. number of
train documents (which is 2877 to 389 for classes 1-5, 369 to 181 for classes 6-10,
140 to 111 for classes 11-15, 101 to 75 for classes 16-20, 75 to 55 for classes 21-25,
50 to 41 for classes 26-30, 40 to 37 for classes 31-35, 35 to 24 for classes 36-40,
23 to 19 for classes 41-45, 18 to 16 for classes 46-50, 16 to 13 for classes 51-55,
and 13-10 for classes 56-60). Figure 1 shows the results for classification by LR,
while Figure 2 for classification by SVM. Only the 60 largest classes are observed
since smaller classes (less than 10 training documents) are not interesting for the
126 J. Dobša and H. A. L. Kiers

Fig. 1 Average 𝐹1 measure of classification by LR for 5 classes sorted by their size.

research, because for those classes recall is low and it can be expected that full bag
of words representation will result in better recognition since classes can possibly be
recognized by key words, but not by transformed representations. It can be seen that
𝐹1 measures are comparable for the full representation and various representations
by RKM with penalization for both classification algorithms for the biggest 25
classes. For smaller classes results for representation by RKM with penalization are
unstable, although for some classes they were better than the basic representation
(in the case of LR). Classification for representations obtained by SVM and RKM
without penalization resulted in lower 𝐹1 measures for all class sizes.
In Table 1 are shown average precision, recall and 𝐹1 measures for the 25 largest
classes for both classification algorithms and all observed representations. In the case
of classification by LR the average recall is improved for representation by RKM
with penalization (for 𝜆 = 0.4) approximately 1% compared to basic full represen-
tation. For classification by SVM average precision is improved for representation
by RKM with penalization (for 𝜆 = 0.6) for almost 6% and 𝐹1 measure is improved
for representation by RKM with penalization (𝜆 = 0.4) for 2% in comparison to
the basic full representation. The best results are obtained for classification by the
SVM algorithm and representation with RKM with penalization with 𝜆 = 0.2 for
which precision is improved for 5% with the similar level of recall as in the basic
representation.
Improving Classification of Documents by Semi-supervised Clustering 127

Fig. 2 Average 𝐹1 measure of classification by SVM for 5 classes sorted by their size.

Table 1 Average precision, recall, and 𝐹1 measure of classification for the 25 largest classes.

Class. algorithm Logistic regression SVM


Representation Precision Recall 𝐹1 Precision Recall 𝐹1

Full 86.31 70.24 76.84 82.76 71.72 76.47


SVD 82.80 64.84 71.42 85.24 61.61 68.99
RKM 80.80 61.10 68.44 82.93 55.66 63.83
RKMPenal, 𝜆 = 0.1 84.24 70.71 76.27 87.24 71.01 77.62
RKMPenal, 𝜆 = 0.2 84.68 71.23 76.72 87.78 72.16 78.57
RKMPenal, 𝜆 = 0.4 84.72 71.38 76.88 87.86 64.93 73.87
RKMPenal, 𝜆 = 0.6 85.89 70.40 76.80 88.40 66.11 74.75

5 Conclusions and Further Work

In this paper we propose a modification of the RKM method that simultaneously


clusters documents and extracts factors on one side, and penalizes clusterings that are
distant from the classification of the training documents given by experts on the other
side. We show that such a modification enables representation of textual documents
in a semantic lower-dimensional space that improves performance of classification.
The method is tested for classes of Reuters-21758 data set and compared to the
full bag of words representation and the method of LSA. It is also shown that the
128 J. Dobša and H. A. L. Kiers

original RKM method without proposed modification does not have the same effect
on classification performance; it has a similar effect as the LSA method.
The proposed representation method can improve precision and recall of classi-
fication for sufficiently large classes, i.e. those that have enough training documents
to enable capturing of semantic relations and characteristics of classes. A more
important effect can be observed in the improvement of precision.
In the future we plan to investigate hybrid models using representation of words by
neural language models and application in different domains, such as classification
of images.

References

1. Bengio, J., Ducharme, R., Vincet, P., Jauvin, C.: A Neural probabilistic language model.
Journal of Machine Learning Research 3, 1137-1155 (1997)
2. Deerwester, S., Dumas, S. T., Furnas, G.W., Landauer, T. K., Harshman, R. A.: Indexing
by latent semantic analysis. Journal of the American Society for Information Science 41(6),
381-407 (1990)
3. De Sarbo, W. S., Jedidi, K., Cool, K., Schendel, D.: Simultaneous multidimensional unfolding
and cluster analysis: an investigation of strategic groups. Marketing Letters, 2, 129-146 (1990)
4. De Soete, G., Carroll, J. D.: 𝐾 -means clustering in a low-dimensional Euclidean space. In:
Diday, E., Lechevallier, Y., Schader, M., Bertrand, P., Burtschy, B. (eds.) New Approaches
in Classification and Data Analysis. Studies in Classification, Data Analysis, and Knowledge
Organization, pp. 212-219. Springer, Heidelberg (1994)
5. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional
transformers for language understanding. In: Proceedings of Annual Conference of the North
American Chapter of the Association for Computation Linguistic, pp. 4171-4186, Association
for Computational Linguistic (2019)
6. Dobša, J., Mladenić, D., Rupnik, J., Radošević, D., Magdalenić, I.: Cross-language information
retrieval by Reduced 𝑘-means, International Journal of Computer Information Systems and
Industrial Management Applications, 10, 314-322 (2018)
7. Dumas, S., Letche, T., Littman, M., Landauer, T.: Automatic cross-language retrieval using
latent semantic indexing. In: Proceedings of the AAAI spring symposium on cross-language
text and speech retrieval, pp. 15-21. American Association for Artificial Intelligence (1997)
8. Mikolov, T., Chen, K., Corrado, G.S., Dean, J.: Efficient estimation of word representations
in vector space (2013) Available via arXiv.org
https://arxiv.org/abs/1301.3781.Cited21Jan2022
9. Pennington, J., Socher, R., Manning, C. D.: GloVe: Global vectors for word representation. In:
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
(EMNLP), pp. 1532-1543, Association for Computational Linguistics, (2014)
10. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Tettlemoyer, L.: Deep
contextualized word representations. In Proceedings of the Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies,
1:2227-2237 (2018)
11. Timmerman, M. E. Ceulemans, E., Kiers, H. A. L., Vichi, M: Factorial and Reduced 𝑘-means
reconsidered. Computational Statistics & Data Analyisis, 54, 1856-1871 (2010)
12. Timmerman, M. E., Ceulemans, E., De Rover, K., Van Leeuwen, K.: Subspace𝑘-means
clustering, Behavioural Research, 45, 1011-1023 (2013)
13. Vichi, M., Kiers, H. A. L.: Factorial 𝑘-means analysis for two-way data, Computational
Statistics & Data Analysis, 37, 49-64 (2001)
Improving Classification of Documents by Semi-supervised Clustering 129

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Trends in Data Stream Mining

João Gama

Abstract Learning from data streams is a hot topic in machine learning and data
mining. This article presents our recent work on the topic of learning from data
streams. We focus on emerging topics, including fraud detection and hyper-parameter
tuning for streaming data. The first study is a case study on interconnected by-pass
fraud. This is a real-world problem from high-speed telecommunications data that
clearly illustrates the need for online data stream processing. In the second study,
we present an optimization algorithm for online hyper-parameter tuning from non-
stationary data streams.

Keywords: fraud detection, hyperparameter tuning, learning from data streams

1 Introduction

The developments of information and communication technologies dramatically


change the data collection and processing methods. What distinguishes current data
sets from earlier ones are automatic data feeds. We do not just have people entering
information into a computer. We have computers entering data into each other. In
most challenging applications, data are modeled best not as persistent tables, but
rather as transient data streams.
This article presents our recent work on the topic of learning from data streams.
It is organized into main sections. The first one is a real-world application of data
stream techniques to a telecommunications fraud detection problem. It is based on
the work presented in [5]. The second topic discusses the problem of hyperparameter
tuning in the context of data stream mining. It is based on the work presented in [4].

João Gama ( )
FEP-University of Porto and INESC TEC
R. Dr. Roberto Frias, Porto, Portugal, e-mail: [email protected]

© The Author(s) 2023 131


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_15
132 J. Gama

2 Fraud Detection: a Case Study

The high asymmetry of international termination rates with regard to domestic ones,
where international calls have higher charges applied by the operator where the call
terminates, is fertile ground for the appearance of fraud in Telecommunications.
There are several types of fraud that exploit this type of differential, being the
Interconnect Bypass Fraud one of the most expressive [1, 3].
In this type of fraud, one of several intermediaries responsible for delivering the
calls forwards the traffic over a low-cost IP connection, reintroducing the call in the
destination network already as a local call, using VOIP Gateways. This way, the
entity that sent the traffic is charged the amount corresponding to the delivery of
international traffic. However, once it is illegally delivered as national traffic, it will
not have to pay the international termination fee, appropriating this amount.
Traditionally, the telecom operators analyze the calls of these Gateways to detect
the fraud patterns and, once identified, have their SIM cards blocked. The constant
evolution in terms of technology adopted on these gateways allows them to work
like real SIM farms capable of manipulating identifiers, simulating standard call
patterns similar to the ones of regular users, and even being mounted on vehicles to
complicate the detection using location information.
The interconnect bypass fraud detection algorithms typically consume a stream
𝑆 of events, where 𝑆 contains information about the origin number 𝐴 − 𝑁𝑢𝑚𝑏𝑒𝑟,
the destination number 𝐵 − 𝑁𝑢𝑚𝑏𝑒𝑟, the associated timestamp, and the status of the
call (accomplished or not). The expected output of this type of algorithm is a set of
potential fraudulent 𝐴 − 𝑁𝑢𝑚𝑏𝑒𝑟 𝑠 that require validation by the telecom operator.
This process is not fully automated to avoid blocking legit 𝐴 − 𝑁𝑢𝑚𝑏𝑒𝑟 𝑠 and getting
penalties. In the interconnect bypass fraud, we can observe three different types of
abnormal behaviors:
1. the burst of calls, which are 𝐴 − 𝑁𝑢𝑚𝑏𝑒𝑟 𝑠 that produce enormous quantities of
#𝑐𝑎𝑙𝑙𝑠 (above the #𝑐𝑎𝑙𝑙𝑠 of all 𝐴 − 𝑁𝑢𝑚𝑏𝑒𝑟 𝑠) during a specific time window
𝑊. The size of this time window is typically small;
2. the repetitions, which are the repetition of some pattern (#𝑐𝑎𝑙𝑙𝑠) produced by a
𝐴 − 𝑁𝑢𝑚𝑏𝑒𝑟 during consecutive time windows 𝑊;
3. the mirror behaviors, which are two distinct 𝐴 − 𝑁𝑢𝑚𝑏𝑒𝑟 𝑠 (typically these
𝐴 − 𝑁𝑢𝑚𝑏𝑒𝑟 𝑠 are from the same country) that produces the same pattern of
calls (#𝑐𝑎𝑙𝑙𝑠) during a time window 𝑊.
Trends in Data Stream Mining 133

Algorithm 3 The Lossy Counting


Algorithm 2 The Lossy Counting
Algorithm. with Fast Forgetting Algorithm.
1: procedure LossyCounting(𝑆: A Se-
1: procedure LossyCounting(𝑆: A Se-
quence of Examples; 𝜖 : Error margin; 𝛼:
quence of Examples; 𝜖 : Error margin; 𝛼:
fast forgetting parameter)
fast forgetting parameter)
2: 𝑛 ← 0; Δ ← 0; 𝑇 ← 0;
2: 𝑛 ← 0; Δ ← 0; 𝑇 ← 0;
3: for example 𝑒 ∈ 𝑆 do
3: for example 𝑒 ∈ 𝑆 do
4: 𝑛← 𝑛+1
4: 𝑛← 𝑛+1
5: if 𝑒 is monitored then
5: if 𝑒 is monitored then
6: Increment 𝐶𝑜𝑢𝑛𝑡𝑒
6: Increment 𝐶𝑜𝑢𝑛𝑡𝑒
7: else
7: else
8: 𝑇 ← 𝑇 ∪ {𝑒, 1 + Δ }
8: 𝑇 ← 𝑇 ∪ {𝑒, 1 + Δ }
9: end if
9: end 𝑛if 10: if 𝑛𝜖 ≠ Δ then
10: if 𝜖 ≠ Δ then
11: Δ ← 𝑛𝜖
11: Δ ← 𝑛𝜖
12: end if
12: end if
13: for all 𝑗 ∈ 𝑇 do
13: for all 𝑗 ∈ 𝑇 do
14: 𝐶𝑜𝑢𝑛𝑡 𝑗 ← 𝛼 ∗ 𝐶𝑜𝑢𝑛𝑡 𝑗
14: if 𝐶𝑜𝑢𝑛𝑡 𝑗 < 𝛿 then
15: if 𝐶𝑜𝑢𝑛𝑡 𝑗 < 𝛿 then
15: 𝑇 ← 𝑇 \{ 𝑗 }
16: 𝑇 ← 𝑇 \{ 𝑗 }
16: end if
17: end if
17: end for
18: end for
18: end for
19: end for
19: end procedure
20: end procedure

Figures 1 and 2 present the evolving top-10 most active phone numbers. The
first Figure 1 presents the top-10 cumulative counts, while the Figure 2 presents the
top-10 counts with forget.

3 Learning to Learn Hyperparameters

A hyperparameter is a parameter whose value is used to control the learning process.


Hyperparameter optimization (or tuning) is the problem of choosing a set of optimal
hyper-parameters for a learning algorithm. For this propose we adapt the Nelder-
Mead algorithm [4] for the streaming context. This algorithm is a simplex search
algorithm for multidimensional unconstrained optimization without derivatives. The
vertexes of the simplex, which define a convex hull shape, are iteratively updated
in order to sequentially discard the vertex associated with the largest cost function
value.
The Nelder-Mead algorithm relies on four simple operations: reflection, shrink-
age, contraction and expansion. Figure 3 illustrates the four corresponding Nelder-
Mead operators 𝑅, 𝑆, 𝐶 and 𝐸. Each vertex represents a model containing a set of
hyper-parameters. The vertexes (models under optimisation) are ordered and named
according to the root mean square error (RMSE) value: best (𝐵), good (𝐺), which is
134 J. Gama

80000

60000
Frequency

40000

20000

0
20190600 20190610 20190620 20190630
Date

ILFFVZZUZNN ILFFVZZINZI LFPNVLNUIPN IOINLIRPPRP IZUZNRRZOUUU

Fig. 1 Approximate Counts with Lossy Counting.

6000
Frequency

4000

2000

0
20190600 20190610 20190620 20190630
Date

ILFFVZZUZNN ILFFVZZINZI ILZOVIFOVIF IOINLIRPPRP IZUZNRRZOUUU

Fig. 2 Approximate Counts with Lossy Counting and Fast Forgetting.

the closest to the best vertex, and worst (𝑊). 𝑀 is a mid vertex (auxiliary model). The
bottom panel in Figure 3 describe the four operations: Contraction, Reflexion,
Expansion, and Shrink.
For each Nelder-Mead operation, it is necessary to compute an additional set of
vertexes (midpoint 𝑀, reflection 𝑅, expansion 𝐸, contraction 𝐶 and shrinkage 𝑆)
and verify if the calculated vertexes belong to the search space. First, the algorithm
computes the midpoint (𝑀) of the best face of the shape as well as the reflection
point (𝑅). After this initial step, it determines whether to reflect or expand based on
the set of heuristics.
The dynamic sample size, which is based on the RMSE metric, attempts to identify
significant changes in the streamed data. Whenever such a change is detected, the
Nelder-Mead compares the performance of the 𝑛 + 1 models under analysis to choose
the most promising model. The sample size 𝑆 𝑠𝑖𝑧𝑒 is given by Equation 1 where 𝜎
Trends in Data Stream Mining 135

EDKƉĞƌĂƚŽƌƐ ƌŝĨƚĞƚĞĐƚĞĚ

Ɛ
Ɛů
ĞĚ ĞůĚ
Ž Ž
D н D

ϭ Ƶ ͘ ͘
нE 
D
E ͘ ͘
͘ ͘

džƉůŽƌĂƚŝŽŶ ĞƉůŽLJŵĞŶƚ džƉůŽƌĂƚŝŽŶ ĞƉůŽLJŵĞŶƚ


   

 t Z t  t t
^

' ' ' '


ŽŶƚƌĂĐƚŝŽŶ ZĞĨůĞĐƚŝŽŶ džƉĂŶƐŝŽŶ ^ŚƌŝŶŬ

Fig. 3 SPT working modes: Exploration and Deployment. Bottom panel illustrates the Nelder &
Mead operators.

represents the standard deviation of the RMSE and 𝑀 the desired error margin. We
use 𝑀 = 95%.
4𝜎 2
𝑆 𝑠𝑖𝑧𝑒 = 2 (1)
𝑀
However, to avoid using small samples, that imply error estimations with large
variance, we defined a lower bound of 30 samples. The adaptation of the Nelder-
Mead algorithm to on-line scenarios relies extensively on parallel processing. The
main thread launches the 𝑛+1 model threads and starts a continuous event processing
loop. This loop dispatches the incoming events to the model threads and, whenever
it reaches the sample size interval, assesses the running models, and calculates the
new sample size. The model assessment involves the ordering of the 𝑛 + 1 models
by RMSE value and the application of the Nelder-Mead algorithm to substitute the
worst model. The Nelder-Mead parallel implementation creates a dedicated thread
per Nelder-Mead operator, totaling seven threads. Each Nelder-Mead operator thread
generates a new model and calculates the incremental RMSE using the instances of
the last sample size interval. The worst model is substituted by the Nelder-Mead
operator thread model with the lowest RMSE.
Figure 4 presents the critical difference diagram [2] of three hyper-parameter
tuning algorithms: SPT, Grid search, default parameter values on four benchmark
classification datasets. The diagram clearly illustrates the good performance of SPT.

4 Conclusions

This paper reviews our recent work in learning from data streams. The two works
present different approaches to dealing with high-speed and time-evolving data:
from applied research in fraud detection to fundamental research on hyperparameter
136 J. Gama

Fig. 4 Critical Difference Diagram comparing Self hyperparameter tuning, Grid hyperparameter
tuning, and default parameters in 4 classification problems.

optimization for streaming algorithms. The first work identifies burst on the activity
in phone calls, using approximate counting with forgetting. The last work presents a
streaming optimization method to find the minimum of a function and its application
in finding the hyper-parameter values that minimize the error. We believe that the
two works reported here will have an impact on the work of other researchers.

Acknowledgements I would like to thank my collaborators Bruno Veloso and Rita P. Ribeiro that
contribute to this work.

References

1. Ali, M. A., Azad, M. A., Centeno, M. P., Hao, F., van Moorsel, A.: Consumer-facing technology
fraud: Economics, attack methods and potential solutions. Future Generation Computer Systems,
100, 408–427 (2019)
2. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. Journal of Machine
Learning Research, 7(Jan), 1–30 (2006)
3. Laleh, N., Azgomi, M. A.: A taxonomy of frauds and fraud detection techniques. In International
Conference on Information Systems, Technology and Management, pp. 256–267. Springer
(2009)
4. Veloso, B., Gama, Malheiro, J. B., Vinagre, J.: Hyperparameter self-tuning for data streams.
Information Fusion, 76, 75–86 (2021)
5. Veloso, B., Tabassum, S., Martins, C., Espanha, R., Azevedo, R., Gama, J.: Interconnect bypass
fraud detection: a case study. Annals des Télécommunications, 75(9), 583–596 (2020)
Trends in Data Stream Mining 137

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Old and New Constraints in Model Based
Clustering

Luis A. García-Escudero, Agustín Mayo-Iscar, Gianluca Morelli, and Marco Riani

Abstract Model-based approaches to cluster analysis and mixture modeling of-


ten involve maximizing classification and mixture likelihoods. Without appropriate
constrains on the scatter matrices of the components, these maximizations result
in ill-posed problems. Moreover, without constrains, non-interesting or “spurious"
clusters are often detected by the EM and CEM algorithms traditionally used for
the maximization of the likelihood criteria. A useful approach to avoid spurious
solutions is to restrict relative components scatter by a prespecified tuning constant.
Recently new methodologies for constrained parsimonious model-based clustering
have been introduced which include the 14 parsimonious models that are often ap-
plied in model-based clustering when assuming normal components as limit cases.
In this paper we initially review the traditional approaches and illustrate through an
example the benefits of the adoption of the new constraints.

Keywords: model based clustering, mixture modelling, constraints

L. A. García-Escudero
Department of Statistics and Operational Research and IMUVA, University of Valladolid, Spain,
e-mail: [email protected]
A. Mayo-Iscar
Department of Statistics and Operational Research and IMUVA, University of Valladolid, Spain,
e-mail: [email protected]
G. Morelli
Department of Economics and Management and Interdepartmental Centre of Robust Statistics,
University of Parma, Italy, e-mail: [email protected]
M. Riani ( )
Department of Economics and Management and Interdepartmental Centre of Robust Statistics,
University of Parma, Italy, e-mail: [email protected]

© The Author(s) 2023 139


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_16
140 L. A. García-Escudero et al.

1 Introduction

Given a sample of observations {𝑥1 , ..., 𝑥 𝑛 } in R 𝑝 , a widely used method in un-


supervised learning is to assume multivariate normal components and to adopt a
maximum likelihood approach for clustering purposes. With this idea in mind, well-
known classification and mixture likelihood approaches can be followed.
In this work, we use 𝜙(·; 𝜇, Σ) to denote the probability density function of a
𝑝-variate normal distribution with mean 𝜇 and covariance matrix Σ.
In the classification likelihood approach we search for a partition {𝐻1 , ..., 𝐻 𝑘 } of
the indices {1, · · · , 𝑛}, centres 𝜇1 , · · · , 𝜇 𝑘 in R 𝑝 , symmetric positive Ísemidefinite
𝑝 × 𝑝 scatter matrices Σ1 , · · · , Σ 𝑘 and positive weights 𝜋1 , · · · , 𝜋 𝑘 with 𝑘𝑗=1 𝜋 𝑗 = 1,
which maximize
Õ 𝑘 Õ

log 𝜋 𝑗 𝜙(𝑥𝑖 ; 𝜇 𝑗 , Σ 𝑗 ) . (1)
𝑗=1 𝑖 ∈𝐻 𝑗

On the other hand, in the mixture likelihood approach, we seek the maximization
of
𝑛
Õ 𝑘
©Õ
𝜋 𝑗 𝜙(𝑥𝑖 ; 𝜇 𝑗 , Σ 𝑗 ) ® ,
ª
log ­ (2)
𝑖=1 « 𝑗=1 ¬
with similar notation and conditions on the parameters as above. In this second
approach, a partition into 𝑘 groups can be also obtained, from the fitted mixture
model, by assigning each observation to the cluster-component with the highest
posterior probability.
Unfortunately, it is well-known that the maximization of “log-likelihoods" like
(1) and (2) without constraints on the Σ 𝑗 matrices is a mathematically ill-posed
problem [1, 2]. To see this unboundedness issue, we can just take 𝜇1 = 𝑥 1 , 𝜋1 > 0
and |Σ1 | → 0 making (2) to diverge to infinity or (1) also to diverge with 𝐻1 = {1}.
This lack of boundedness can be solved by just focusing on local maxima of
the likelihood target functions. However, many local maxima are often found and
it is difficult to know which are the most interesting ones. See [3] for a detailed
discussion of this issue. In fact, non-interesting local maxima denoted as “spurious"
solutions, which consist of a few, almost collinear, observations, are often detected
by the Classification EM algorithm (CEM), traditionally applied when maximizing
(1), and by the EM algorithm, traditionally applied when maximizing (2). A recent
review of approaches for dealing with this lack of boundedness and for reducing the
detection of spurious solutions can be found in [4].
It is also common to enforce constraints on the Σ 𝑗 scatter matrices when maxi-
mizing (1) or (2). Among them, the use of “parsimonious” models [5, 6] is one of
the most popular and widely applied approaches in practice. These parsimonious
models follow from a decomposition of the Σ 𝑗 scatter matrices as

Σ 𝑗 = 𝜆 𝑗 Ω 𝑗 Γ 𝑗 Ω0𝑗 , (3)

with 𝜆 𝑗 = |Σ 𝑗 | 1/ 𝑝 (volume parameters),


Old and New Constraints in Model Based Clustering 141
𝑝
Ö
Γ 𝑗 = diag (𝛾 𝑗1 , ..., 𝛾 𝑗𝑙 , ..., 𝛾 𝑗 𝑝 ) with det (Γ 𝑗 ) = 𝛾 𝑗𝑙 = 1
𝑙=1

(shape matrices), and Ω 𝑗 (rotation matrices) with Ω 𝑗 Ω0𝑗 = I 𝑝 . Different constraints


on the 𝜆 𝑗 , Ω 𝑗 and Γ 𝑗 elements are considered across components to get 14 par-
simonious models (which are coded with a combination of three letters). These
models reduce notably the number of free parameters to be estimated, so improving
efficiency and model interpretability. Moreover, many of them turn the constrained
maximization of the likelihoods into well-defined problems and help to avoid spu-
rious solutions. Unfortunately, the problems remain for models with unconstrained
𝜆 𝑗 volume parameters, which are coded with the first letter as a V (V** models).
Aside from relying on good initializations, it is common to consider the early stop-
ping of iterations when approaching scatter matrices with very small eigenvalues
or when detecting components accounting for a reduced number of observations. A
not fully iterated solution (or no solution at all) is then returned in these cases. The
idea is known, for instance, to be problematic when dealing with (well-separated)
components made up of a few observations.
Starting from a seminal paper by [7], an alternative approach is to constrain the
Σ 𝑗 scatter matrices by specifying some tuning constants that control the strength of
the constraints. In this direction, the ratio between the largest and the smallest of the
𝑘 × 𝑝 eigenvalues of the Σ 𝑗 matrices was forced to be smaller than a given fixed
constant 𝑐∗ ≥ 1 [8, 9, 10, 11, 12]. This means that the maximization of (1) and (2)
is done under the (more simple) constraint:

max 𝜆𝑙 (Σ 𝑗 )/min 𝜆𝑙 (Σ 𝑗 ) ≤ 𝑐∗ , (4)


𝑗𝑙 𝑗𝑙

𝑝
where {𝜆𝑙 (Σ 𝑗 )}𝑙=1 are the set of eigenvalues of the Σ 𝑗 matrix, 𝑗 = 1, ..., 𝑘.
With this eigenvalue-ratio approach, we need a very high 𝑐∗ value to be close to
affine equivariance. Unfortunately, such a high 𝑐∗ value does not always successfully
prevent us from incurring into spurious solutions.

2 The New Constraints

García-Escudero et al. [13] have recently introduced three different types of con-
straints on the Σ 𝑗 matrices which depend on three constants 𝑐 det , 𝑐 shw and 𝑐 shb all of
them being greater than or equal to 1.
The first type of constraint serves to control the maximal ratio among determinants
and, consequently, the maximum allowed difference between component volumes:
𝑝
max 𝑗=1,...,𝑘 |Σ 𝑗 | max 𝑗=1,...,𝑘 𝜆 𝑗
“deter": = ≤ 𝑐 det . (5)
min 𝑗=1,...,𝑘 |Σ 𝑗 | min 𝑗=1,...,𝑘 𝜆 𝑝𝑗
142 L. A. García-Escudero et al.

The second type of constraint controls departures from sphericity “within” each
component:
max𝑙=1,..., 𝑝 𝛾 𝑗𝑙
shape-“within": ≤ 𝑐 shw for 𝑗 = 1, ..., 𝑘. (6)
min𝑙=1,..., 𝑝 𝛾 𝑗𝑙

This provides a set of 𝑘 constraints that in the most constrained case, 𝑐 shw = 1,
imposes Γ1 = ... = Γ 𝑝 = I 𝑝 , where I 𝑝 is the identity matrix of size 𝑝, i.e., sphericity
of components.
Note that the new determinant-and-shape constraints (based on 𝑐 det > 1 and
𝑐 shw = 1) in (4) allow us to deal with spherical “heteroscedastic" cases, whereas the
eigenvalue ratio constraint with 𝑐∗ = 1 can only handle the spherical “homoscedastic"
case. Constraints (5) and (6) were the basis for the “deter-and-shape” constraints in
[14]. These two constraints alone resulted in mathematically well-defined constrained
maximizations of the likelihoods in (1) and (2). However, although highly operative
in many cases, they do not include, as limit cases, all the already mentioned 14
parsimonious models. For instance, we may be interested in the same (or not very
different) Γ 𝑗 or Σ 𝑗 matrices for all the mixture components and these cannot be
obtained as limit cases from the “deter-and-shape” constraints.
The third constraint serves to control the maximum allowed difference between
shape elements “between” components:
max 𝑗=1,...,𝑘 𝛾 𝑗𝑙
shape-“between": ≤ 𝑐 shb for 𝑙 = 1, ..., 𝑝. (7)
min 𝑗=1,...,𝑘 𝛾 𝑗𝑙

This new type of constraint allows us to impose “similar” shape matrices for the
components and, consequently, enforce Γ1 = ... = Γ𝑘 in the most constrained
𝑐 shb = 1 case.

3 An Illustration Example of the New Constraints

Figure 1 shows an example based on three groups. The data have been generated
imposing equal determinants 𝑐 det = 1, a sensible departure from sphericity “within”
each component 𝑐 shw = 40 and a very moderate difference “between” shape elements
components, 𝑐 shb = 1.3. No constraint has been imposed on the rotation matrices.
Finally an average overlap of 0.10 has been imposed. The generation of these data
sets has been done through the MixSim method of [15], as extended by [16] and
incorporated into the FSDA Matlab toolbox [17]. The overlap is defined as a sum of
pairwise misclassification probabilities. See more details in [16].
The application of traditional tclust approach with maximum ratio between eigen-
values (𝑐∗ ) respectively equal to 128 and 1010 produces the classifications shown
in the left panels of Figure 2. In fact, it could be seen that the results in the top
left panel would be exactly the same one for any choice of 𝑐∗ within the interval
[16, 128]. This means that a higher value of 𝑐∗ would be apparently needed to detect
Old and New Constraints in Model Based Clustering 143

0.8
1
0.7 2
3

0.6

0.5

0.4

0.3

0.2

0.1

-0.1
-0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Fig. 1 An example with simulated data with 3 clusters in two dimensions. The average overlap is
0.10. The data have been generated using equal determinants, moderate difference between shape
elements “between” components and sensible departure from sphericity “within” each component.

those two almost parallel clusters that were shown in Figure 1. However, choosing
a value greater for 𝑐∗ may destroy the desired protection against spurious solutions
provided by the constraints. For example, we see in the lower left panel how the
choice 𝑐∗ = 1010 results in the detection of a spurious group consisting of a single
observation.
The panels on the right, on the other hand, show the partitions resulting from the
3 new constraints imposed on the components covariance matrices. The top right
panel shows the result of applying the 3 new restrictions with values of the tuning
constants very close to the real values used to generate the dataset. We can see
that, in this case, it is possible to recover the real structure of the data generating
process. Moreover, the real cluster structure is also recovered in the low right panel
by choosing larger values of these tuning constants, but not too large just to avoid
detection of spurious solutions. Some guidelines about how to choose these tuning
constants can be found in [13].
144 L. A. García-Escudero et al.

0.6 1 0.6 1
2 2
3 3
0.4 0.4

0.2 0.2

0 0
-0.2 0 0.2 0.4 0.6 -0.2 0 0.2 0.4 0.6

0.6 1 0.6 1
2 2
3 3
0.4 0.4

0.2 0.2

0 0
-0.2 0 0.2 0.4 0.6 -0.2 0 0.2 0.4 0.6

Fig. 2 Comparison between the traditional (left panels) and new tclust procedure (right panels).

References

1. Kiefer, J., Wolfowitz, J.: Consistency of the maximum likelihood estimator in the presence of
infinitely many incidental parameters. Ann. Math. Stat. 27, 887-906 (1956)
2. Day, N. E.: Estimating the components of a mixture of normal distributions. Biometrika, 56,
463-474 (1969)
3. McLachlan, G., Peel, D. A.: Finite Mixture Models. Wiley Series in Probability and Statistics,
New York (2000)
4. García-Escudero, L. A., Gordaliza, A., Greselin, F., Ingrassia, S., Mayo-Iscar, A.: Eigenvalues
and constraints in mixture modeling: geometric and computational issues. Adv. Data Anal.
Classif. 12, 203-233 (2018)
5. Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recogn. 28, 781-
793 (1995)
6. Banfield, J. D., Raftery, A. E.: Model-based Gaussian and non-Gaussian clustering. Biometrics
49, 803-821 (1993)
7. Hathaway, R. J.: A constrained formulation of maximum likelihood estimation for normal
mixture distributions. Ann. Stat. 13, 795-800 (1985)
8. Ingrassia, S., Rocci, R.: Constrained monotone EM algorithms for finite mixture of multivariate
Gaussians. Comput. Stat. Data Anal. 51, 5339-5351 (2007)
9. García-Escudero, L. A., Gordaliza, A., Matrán, C., Mayo-Iscar, A.: A general trimming
approach to robust cluster analysis. Ann. Stat. 36, 1324-1345 (2008)
10. García-Escudero, L. A., Gordaliza, A., Matrán, C., Mayo-Iscar, A.: Exploring the number of
groups in robust model-based clustering. Stat. Comput. 21, 585-599 (2011)
11. García-Escudero, L. A., Gordaliza, A., Mayo-Iscar, A.: A constrained robust proposal for
mixture modeling avoiding spurious solutions. Adv. Data Anal. Classif. 8, 27-43 (2014)
Old and New Constraints in Model Based Clustering 145

12. García-Escudero, L. A., Gordaliza, A., Matrán, C., Mayo-Iscar, A.: Avoiding spurious local
maximizers in mixture modeling. Stat. Comput. 25, 619-633 (2015)
13. García-Escudero, L. A., Mayo-Iscar, A., Riani, M.: Constrained parsimonious model-based
clustering. Stat. Comput. 32 (2022)
14. García-Escudero, L. A., Mayo-Iscar, A., Riani, M.: Model-based clustering with determinant-
and-shape constraint. Stat. Comput. 25, 1-18 (2020)
15. Maitra, R., Melnykov, V.: Simulating data to study performance of finite mixture modeling
and clustering algorithms. J. Comput. Graph. Stat. 19, 354-376 (2010)
16. Riani, M., Cerioli, A., Perrotta, D., Torti, F.: Simulating mixtures of multivariate data with
fixed cluster overlap in FSDA library. Adv. Data Anal. Classif. 9, 461-481 (2015)
17. Riani, M., Perrotta, D., Torti, F.: FSDA: a Matlab toolbox for robust analysis and interactive
data exploration. Chemometr. Intell. Lab. Syst. 116, 17-32 (2012)

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Clustering Student Mobility Data in 3-way
Networks

Vincenzo Giuseppe Genova, Giuseppe Giordano, Giancarlo Ragozini, and Maria


Prosperina Vitale

Abstract The present contribution aims at introducing a network data reduction


method for the analysis of 3-way networks in which classes of nodes of different
types are linked. The proposed approach enables simplifying a 3-way network into
a weighted two-mode network by considering the statistical concept of joint depen-
dence in a multiway contingency table. Starting from a real application on student
mobility data in Italian universities, a 3-way network is defined, where provinces of
residence, universities and educational programmes are considered as the three sets
of nodes, and occurrences of student exchanges represent the set of links between
them. The Infomap community detection algorithm is then chosen for partitioning
two-mode networks of students’ cohorts to discover different network patterns.

Keywords: 3-way network, complex network, community detection, mobility data,


tertiary education

Vincenzo Giuseppe Genova


Department of Economics, Business, and Statistics, University of Palermo, Italy,
e-mail: [email protected]
Giuseppe Giordano
Department of Political and Social Studies, University of Salerno, Italy,
e-mail: [email protected]
Giancarlo Ragozini
Department of Political Science, Federico II University of Naples, Italy,
e-mail: [email protected]
Maria Prosperina Vitale ( )
Department of Political and Social Studies, University of Salerno, Italy,
e-mail: [email protected]

© The Author(s) 2023 147


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_17
148 V. G. Genova et al.

1 Introduction

Many complex relational data structures can be described as multimode or multiway


networks in which nodes belonging to different modes are linked. The most common
multimode network in social networks is represented by the affiliation network,
where two-mode data, actors and events, form a bipartite graph divided into two
groups [6]. In the case of tripartite networks, we deal with three types of nodes, and
different graph structures can be defined.
Although only a few papers deal with methods for these networks, in recent years, a
growing number of works have appeared –especially in bipartite and tripartite cases–
to disentangle the inherent complexity of such kinds of data structures. Looking at
clustering and community detection algorithms proposed to partition a network into
groups, we can identify some strands, all deriving from generalizations of methods
suited for one-mode [19] and two-mode networks [2]. A classical approach consists
of applying the usual community detection algorithms on a unique supra-adjacency
matrix defined by combining all the possible two-mode networks in a block matrix
[11, 15]. Alternative methods rely on projecting each two-mode networks and on
applying separately the usual community detection algorithms on these matrices
[10]. In addition, there are methods adopting both an optimization procedure for
3-way networks [16, 17, 14] by extending the idea of bipartite modularity [2], and
an indirect blockmodeling approach by deriving a dissimilarity measure based on
structural equivalence concept [3].
In our opinion, approaches based on the analysis of the k-modes examined con-
sidering the collection of the 𝑘 (𝑘 − 1)/2 two-mode networks [10] cannot take into
account statistical associations among all modes at same time. Hence, the aim of the
contribution is to present a network data reduction method based on the concept of
joint dependence in a multiway contingency table [1].
Starting from real applications on the Italian student mobility phenomenon in
higher education [12, 21, 7, 8, 13, 22], a 3-way network is defined, where provinces
of residence, universities and educational programmes are considered as the three
modes. Student mobility flows, measured in terms of occurrences, represent the set
of links between them. Assuming that the statistical dependency between the set of
nodes provinces of residence and the other two sets of nodes can be captured by
the joined pair of nodes (universities and educational programmes), the tripartite
network is transformed into a bipartite network, where the two modes are given by
Italian provinces of residence (first mode) and the set of nodes given by all possi-
ble pairs of universities and educational programmes (second mode). Thus, taking
advantage of this approach of network simplification, network indexes and cluster-
ing techniques for bipartite networks are available. Hence, the Infomap community
detection algorithm is adopted [9, 4] to partition the derived network.
The remainder of the paper is organized as follows. Section 2 presents the details
of the proposed strategy of analysis, and the main results are reported from the
analysis of student mobility data of Italian universities. Section 3 provides final
remarks.
Clustering in 3-way Networks 149

2 Simplification of 3-way Networks

In the present paper, the case of a tripartite network is considered as an example


to show how the proposed network data simplification method works. In particular,
we consider the real case study of student mobility paths in Italian universities. The
MOBYSU.IT dataset1 enables reconstruction of network data structures considering
student mobility flows among territorial units and universities.
More formally, given V𝑃 ≡ {𝑝 1 , . . . , 𝑝 𝑖 , . . . , 𝑝 𝐼 }, the set of 𝐼 provinces of
residence; V𝑈 ≡ {𝑢 1 , . . . , 𝑢 𝑗 , . . . , 𝑢 𝐽 }, the set of 𝐽 Italian universities, and
V𝐸 ≡ {𝑒 1 , . . . , 𝑒 𝑘 , . . . , 𝑒 𝐾 }, the set of 𝐾 educational programmes, a weighted tri-
partite 3-uniform hyper-graph T can be defined, consisting of a triple (V, L, W),
with V = {V𝑃 , V𝑈 , V𝐸 } the collection of three sets of vertices, one for each mode,
and being L = {L 𝑃𝑈 𝐸 }, L 𝑃𝑈 𝐸 ⊆ V𝑃 × V𝑈 × V𝐸 , the collection of hyper-edges,
with generic term ( 𝑝 𝑖 , 𝑢 𝑗 , 𝑒 𝑘 ), which is the link joining the 𝑖-th province, the 𝑗-th
university, and the 𝑘-th educational programme. Finally, W is the set of weights,
obtained by the function 𝑤 : L 𝑃𝑈 𝐸 → N, and 𝑤( 𝑝 𝑖 , 𝑢 𝑗 , 𝑒 𝑘 ) = 𝑤𝑖 𝑗 𝑘 is the number
of students moving from a province 𝑝 𝑖 towards a university 𝑢 𝑗 in an educational
programme 𝑒 𝑘 . Such a network structure can be described as a three-way array
A = (𝑎 𝑖 𝑗 𝑘 ), with 𝑎 𝑖 𝑗 𝑘 ≡ 𝑤𝑖 𝑗 𝑘 , and it has been called a 3-way network [3].
To deal with such a complex network structure and aiming at obtaining commu-
nities in which three modes are mixed, we wish to simplify the tripartite nature of
the graph, without losing any significant information. In statistical terms, the array
A can be interpreted as a 3-way contingency table, and then the statistical techniques
to evaluate the association among variables (i.e. the modes) can be exploited [1].
Because a 3-way contingency table is a cross-classification of observations by the
levels of three categorical variables, we are defining a network structure where the
sets of nodes are the levels of the categorical variables. Specifically, we assume that
if two modes are jointly associated –as are, for their own nature, universities and
educational programmes– the tripartite network can be logically simplified into a
bipartite one. In the student mobility network, we join the pair of nodes in V𝑈 and
in V𝐸 , and then we deal with the relationships between these dyads and the nodes
in V𝑃 .
Following this assumption, the sets of nodes V𝑈 and V𝐸 are put together into a
set of joint nodes, namely V𝑈 𝐸 . The tripartite network T can now be represented
as a bipartite network B given by the triple {V ∗ , L ∗ , W ∗ }, with V ∗ = {V𝑃 , V𝑈 𝐸 }.
The set of hyper-edges L is thus simplified into a set of edges L ∗ = {L 𝑃,𝑈 𝐸 },
L 𝑃,𝑈 𝐸 ⊆ V𝑃 × V𝑈 𝐸 . The new edges ( 𝑝 𝑖 , (𝑢 𝑗 ; 𝑒 𝑘 )) connect a province 𝑝 𝑖 with an
educational programme 𝑒 𝑘 running in a given university 𝑢 𝑗 . The weights W ∗ are
the same as in the hyper-graph T , i.e., 𝑤𝑖∗𝑗,𝑘 = 𝑤𝑖 𝑗 𝑘 . Note that the weights contained
in the 3-way array A are preserved, but are now organized in a rectangular matrix A
of 𝐼 rows and (𝐽 × 𝐾) columns.

1 Database MOBYSU.IT [Mobilità degli Studi Universitari in Italia], research protocol MUR -
Universities of Cagliari, Palermo, Siena, Torino, Sassari, Firenze, Cattolica and Napoli Federico II,
Scientific Coordinator Massimo Attanasio (UNIPA), Data Source ANS-MUR/CINECA.
150 V. G. Genova et al.

Taking advantage of this method, we aim to analyse weighted bipartite graphs


adopting clustering methods. Among others, we use the Infomap community de-
tection algorithm [9, 4] to study the flows’ patterns in network structures instead
of modularity optimization proposed in topological approaches [18, 5]. Indeed, the
rationale of this algorithm –map equation– takes advantage of the duality between
finding communities and minimizing the length –codelength– of a random walker’s
movement on a network. The partition with the shortest path length is the one that
best captures the community structure in the bipartite data. Formally, the algorithm
defines a module partition M of n vertices into m modules such that each vertex is
assigned to one and only one module. The Infomap algorithm looks for the best M
partition that minimizes the expected codelength, 𝐿(𝑀), of a random walker, given
by the following map equation:
𝑚
Õ
𝐿(𝑀) = 𝑞 y 𝐻 (Q) + 𝑝 𝑖 𝐻 (P 𝑖 ) (1)
𝑖=1

In equation (1), 𝑞 y 𝐻 (Q) represents the entropy of the movement between mod-
ules weighed for the probability
Í𝑚 𝑖 that the random walker switches modules on any
given step (𝑞 y ), and 𝑖=1 𝑝  𝐻 (P 𝑖 ) is the entropy of movements within modules
weighed for the fraction of within-module movements that occur in module 𝑖, plus
Í𝑚 𝑖
the probability of exiting module 𝑖 (𝑝 𝑖 ), such that 𝑖=1 𝑝  = 1 + 𝑞 y [9].
In our case, the Infomap algorithm is adopted to discover communities of students
characterized by similar mobility patterns. Indeed, to analyse mobility data, where
links represent patterns of student movement among territorial units and universities,
flow-based approaches are likely to identify the most important features. Finally, in
our student mobility network, to focus only on relevant student flows, a filtering
procedure is adopted by considering the Empirical Cumulative Density Function
(ECDF) of links’ weights distribution.

2.1 Main Findings

Students’ cohorts enrolled in Italian universities in four academic years (a.y.) 2008–
09, 2011–12, 2014–15, and 2017–18 are analysed. The number of nodes for the sets
V𝑃 (107 provinces), V𝑈 (79-80 universities), and V𝐸 (45 educational programmes),
and the number of students involved in the four cohorts are quite stable over time
(Table 1). Furthermore, the percentage of movers (i.e., students enrolled in a univer-
sity outside of their region of residence) increased, from 16.4% in the a.y. 2008–09
to 20.6% in the a.y. 2017–18, and it is higher for males than females.
Clustering in 3-way Networks 151

Table 1 Percentage of students according to their mobility status by cohort and gender.
Mover status
Cohort Gender Stayers% Movers%

F 136,381 84.2 15.8


2008–09 M 106,950 82.8 17.2
Total 243,331 83.6 16.4
F 126,606 81.7 18.3
2011–12 M 102,479 80.9 19.1
Total 229,085 81.0 19.0
F 121,121 80.5 19.5
2014–15 M 102,358 80.4 19.6
Total 223,479 80.5 19.5
F 134,315 79.1 20.9
2017–18 M 113,496 79.8 20.2
Total 247,811 79.4 20.6

Following the network simplification approach, the tripartite networks –one for
each cohort– are simplified into bipartite networks, and the four ECDFs of links’
weights are considered to filter relevant flows. The distributions suggest that more
than 50% of links between pairs of nodes have weights equal to 1 (i.e., flows of only
one student), and about 95% of flows are characterized by flows not greater than a
digit. Thus, networks holding links with a value greater or equal to 10 are further
analysed.
To reveal groups of universities and educational programmes attracting students,
the Infomap community detection algorithm is applied. Looking at Table 2, we
notice a reduction of the number of communities from the first to the last student
cohort, suggesting a sort of stabilization in the trajectories of movers towards brand
universities of the center-north with also an increase in the north-north mobility [20],
and a relevant dichotomy between scientific and humanistic educational programmes.
Network visualizations by groups (Figures 1 and 2) confirm that the more attractive
universities are located in the north of Italy, especially for educational programmes
in economics and engineering (the Bocconi University, the Polytechnic of Turin and
the Cattolica University).

Table 2 Number of communities, codelength, and relative saving codelength per cohort.
Relative saving
Cohort Communities Codelength codelength
2008–09 14 0.96 83%
2011–12 17 1.72 70%
2014–15 3 5.23 12%
2017–18 3 1.00 83%
152 V. G. Genova et al.

Fig. 1 Network visualization by groups, student cohort a.y. 2008–09.

Fig. 2 Network visualization by groups, student cohort a.y. 2017–18.


Clustering in 3-way Networks 153

3 Concluding Remarks

The proposed simplification network strategy on tripartite graphs defined for student
mobility data provides interesting insights for the phenomenon under analysis. The
main attractive destinations still remain the northern universities for educational
programmes, such as engineering and business. Besides the well-known south-to-
north route, other interregional routes in the northern area appear. In addition, the
reduction in the number of communities suggests a sort of stabilization in terms of
mobility roots of movers towards brand universities, highlighting student university
destination choices close to the labor market demand.
Hyper-graphs and multipartite networks still remain very active areas for research
and challenging tasks for scholars interested in discovering the complexities underly-
ing these kinds of data. Specific tools for such complex network structures should be
designed combining network analysis and other statistical techniques. As future lines
of research, the comparison of community detection algorithms that better represent
the structural constraints of the phenomena under analysis and the assessment of
other backbone approaches to filter the significant links will be developed.

Acknowledgements The contribution has been supported from Italian Ministerial grant PRIN 2017
“From high school to job placement: micro-data life course analysis of university student mobility
and its impact on the Italian North-South divide", n. 2017 HBTK5P - CUP B78D19000180001.

References

1. Agresti, A.: Categorical Data Analysis (Vol. 482). John Wiley & Sons, New York (2003)
2. Barber, M. J.: Modularity and community detection in bipartite networks. Phys. Rev. E, 76,
066102 (2007)
3. Batagelj, V., Ferligoj, A., Doreian, P.: Indirect Blockmodeling of 3-Way Networks. In: Brito
P., Cucumel G., Bertrand P., de Carvalho F. (eds) Selected Contributions in Data Analysis
and Classification. Studies in Classification, Data Analysis, and Knowledge Organization, pp.
151–159. Springer, Berlin, Heidelberg (2007)
4. Blöcker, C., Rosvall, M.: Mapping flows on bipartite networks. Phys. Rev. E, 102, 052305
(2020)
5. Blondel, V. D., Guillaume, J. L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities
in large networks. J. Stat. Mech.-Theory E, 10, P10008 (2008)
6. Borgatti, S. P., Everett, M. G.: Regular blockmodels of multiway, multimode matrices. Soc.
Networks, 14, 91–120 (1992)
7. Columbu, S., Porcu, M., Primerano, I., Sulis, I., Vitale, M.P.: Geography of Italian student
mobility: A network analysis approach. Socio. Econ. Plan. Sci. 73, 100918 (2021)
8. Columbu, S., Porcu, M., Primerano, I., Sulis, I., Vitale, M. P.: Analysing the determinants of
Italian university student mobility pathways. Genus, 77, 34 (2021)
9. Edler, D., Bohlin, L., Rosvall, M.: Mapping higher-order network flows in memory and
multilayer networks with infomap. Algorithms, 10, 112 (2017)
10. Everett, M. G., Borgatti, S.: Partitioning multimode networks. In: Doreian, P., Batagelj, V.,
Ferligoj, A. (eds.) Advances in Network Clustering and Blockmodeling, pp. 251-265, John
Wiley & Sons, Hoboken, USA (2020)
154 V. G. Genova et al.

11. Fararo, T. J., Doreian, P.: Tripartite structural analysis: Generalizing the Breiger-Wilson for-
malism. Soc. Networks, 6, 141–175 (1984)
12. Genova, V. G., Tumminello, M., Aiello, F., Attanasio, M.: Student mobility in higher educa-
tion: Sicilian outflow network and chain migrations. Electronic Journal of Applied Statistical
Analysis, 12, 774–800 (2019)
13. Genova, V. G., Tumminello, M., Aiello, F., Attanasio, M.: A network analysis of student
mobility patterns from high school to master’s. Stat. Method. Appl., 30, 1445–1464 (2021)
14. Ikematsu, K., Murata, T.: A fast method for detecting communities from tripartite networks.
In: International Conference on Social Informatics, pp. 192-205. Springer, Cham (2013)
15. Melamed, D., Breiger, R. L., West, A. J.: Community structure in multi-mode networks:
Applying an eigenspectrum approach. Connections, 33, 18–23 (2013)
16. Murata, T.: Detecting communities from tripartite networks. In: Proceedings of the 19th
international conference on world wide web, pp. 1159-1160. (2010)
17. Neubauer, N., Obermayer, K.: Tripartite community structure in social bookmarking data.
New Rev. Hypermedia M., 17, 267-294 (2011)
18. Newman, M. E., Girvan, M.: Finding and evaluating community structure in networks. Phys.
Rev. E, 69, 026113 (2004)
19. Newman, M. E.: Modularity and community structure in networks. Proceedings of the National
Academy of Sciences, 103, 8577-8582 (2006)
20. Rizzi, L., Grassetti, L. Attanasio, M.: Moving from North to North: how are the students’
university flows? Genus 77, 1–22 (2021)
21. Santelli, F., Scolorato, C., Ragozini, G.: On the determinants of student mobility in an inter-
regional perspective: A focus on Campania region. Statistica Applicata - Italian Journal of
Applied Statistics, 31, 119–142 (2019)
22. Santelli, F., Ragozini, G., Vitale, M. P.: Assessing the effects of local contexts on the mobility
choices of university students in Campania region in Italy. Genus, 78, 5 (2022)

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Clustering Brain Connectomes Through a
Density-peak Approach

Riccardo Giubilei

Abstract The density-peak (DP) algorithm is a mode-based clustering method that


identifies cluster centers as data points being surrounded by neighbors with lower
density and far away from points with higher density. Since its introduction in 2014,
DP has reaped considerable success for its favorable properties. A striking advantage
is that it does not require data to be embedded in vector spaces, potentially enabling
applications to arbitrary data types. In this work, we propose improvements to
overcome two main limitations of the original DP approach, i.e., the unstable density
estimation and the absence of an automatic procedure for selecting cluster centers.
Then, we apply the resulting method to the increasingly important task of graph
clustering, here intended as gathering together similar graphs. Potential implications
include grouping similar brain networks for ability assessment or disease prevention,
as well as clustering different snapshots of the same network evolving over time to
identify similar patterns or abrupt changes. We test our method in an empirical
analysis whose goal is clustering brain connectomes to distinguish between patients
affected by schizophrenia and healthy controls. Results show that, in the specific
analysis, our method outperforms many existing competitors for graph clustering.

Keywords: nonparametric statistics, mode-based clustering, networks, graph clus-


tering, kernel density estimation

1 Introduction

Clustering is the task of grouping elements from a set in such a way that elements
in the same group, also defined as cluster, are in some sense similar to each other,
and dissimilar to those from other groups. Mode-based clustering is a nonparametric
approach that works by first estimating the density, and then identifying in some

Riccardo Giubilei ( )
Luiss Guido Carli, Rome, Italy, e-mail: [email protected]

© The Author(s) 2023 155


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_18
156 R. Giubilei

way its modes and the corresponding clusters. An effective method to find modes
and clusters is through the density-peak (DP) algorithm [12], which has drawn
considerable attention since its introduction in 2014. One of the striking advantages
of DP is that it does not require data to be embedded in vector spaces, implying that
it can be applied to arbitrary data types, provided that a proper distance is defined.
In this work, we focus on its application to clustering graph-structured data objects.
The expression graph clustering can refer either to within-graph clustering or
to between-graph clustering. In the first case, the elements to be grouped are the
vertices of a single graph; in the second, the objects are distinct graphs. Here, graph
clustering is intended as between-graph clustering. Between-graph clustering is an
emerging but increasingly important task due to the growing need of analyzing and
comparing multiple graphs [10, 4]. Potential applications include clustering: brain
networks of different people for ability assessment, disease prevention, or disease
evaluation; online social ego networks of different users to find people with similar
social structures; different snapshots of the same network evolving over time to
identify similar patterns, cycles, or abrupt changes.
Heretofore, the task of between-graph clustering has not been exhaustively in-
vestigated in the literature, implying a substantial lack of well-established methods.
The goal of this work is to improve and adapt the density-peak algorithm to define a
fairly general method for between-graph clustering. For validation and comparison
purposes, the resulting procedure and its main competitors are applied to grouping
brain connectomes of different people to distinguish between patients affected by
schizophrenia and healthy controls.

2 Related Work

Existing techniques for between-graph clustering can be divided into two main
categories: 1) transforming graph-structured data objects into Euclidean feature
vectors in order to apply standard clustering algorithms; 2) using the distances
between the original graphs in distance-based clustering methods.
The most common technique within the first category is the use of classical
clustering techniques on the vectorized adjacency matrices [10]. Nonetheless, more
advanced numerical summaries have been proposed to better capture the structural
properties of the graphs and to decrease feature dimensionality. Examples include:
shell distribution [1], traces of powers of the adjacency matrix [10], and graph
embeddings such as graph2vec [11]; see [4] for a longer list. Techniques from the
first category share an important drawback: the transformation into feature vectors
necessarily implies loss of information. Additionally, methods for extracting features
may be domain-specific.
The second category features Partitioning Around Medoids (PAM) [7], or k-
medoids, which finds representative observations by iteratively minimizing a cost
function based on the distances between data objects, and assigns other observations
to the closest medoid. PAM’s main limitations are that it requires the number of
Clustering Brain Connectomes Through a Density-peak Approach 157

clusters in advance and can only identify convex-shaped groups. Density-based


spatial clustering of applications with noise [3], or DBSCAN, overcomes these two
constraints by computing the density of data points starting from their distances,
and defining clusters as samples of high density that are close to each other (and
surrounded by areas of lower density). A similar approach is the DP, which is
described in greater detail in Section 3.1. Alternatively, hierarchical clustering can
be applied to distances between graphs, as in [13], where a spectral Laplacian-based
distance is proposed and used. Finally, 𝑘-groups [8] is a clustering technique within
the Energy Statistics framework [14] where the goal is minimizing the total within-
cluster Energy distance, which is computed starting from the distances between
original observations.

3 Methods

In this section, we first describe the original DP approach; then, we introduce the
DP-KDE method, which is partly named after Kernel Density Estimation; finally,
we discuss how to employ it for graph clustering.

3.1 Original DP

The density-peak algorithm [12] is based on a simple idea: since cluster centers are
identified as the distribution’s modes, they must be 1) surrounded by neighbors with
lower density, and 2) at a relatively large distance from points with higher density.
Consequently, two quantities are computed for each observation 𝑥𝑖 : the local density
𝜌𝑖 , and the minimum distance 𝛿𝑖 from other data points with higher density. The
local density 𝜌𝑖 of 𝑥𝑖 is defined as:
Õ
𝜌𝑖 = 𝐼 (𝑑𝑖 𝑗 −𝑑𝑐 ) , (1)
𝑗

where 𝐼 ( ·) is the indicator function, 𝑑𝑖 𝑗 = 𝑑 (𝑥𝑖 , 𝑥 𝑗 ) is the distance between 𝑥 𝑖 and


𝑥 𝑗 , and 𝑑 𝑐 is a cutoff distance. In simple terms, 𝜌𝑖 is the number of points that are
closer than 𝑑 𝑐 to 𝑥𝑖 . The DP algorithm is robust with respect to 𝑑 𝑐 , at least with large
datasets [12]. Once the density is computed, the definition of the minimum distance
𝛿𝑖 between point 𝑥𝑖 and any other point 𝑥 𝑗 with higher density is straightforward:

𝛿𝑖 = min (𝑑𝑖 𝑗 ). (2)


𝑗:𝜌 𝑗 >𝜌𝑖

By convention, the point with highest density has 𝛿𝑖 = 𝑚𝑎𝑥 𝑗 (𝑑𝑖 𝑗 ). The interpretation
of 𝛿𝑖 reflects the algorithm’s core idea: data points that are not local or global maxima
have their 𝛿𝑖 constrained by other points within the same cluster, hence cluster centers
have large values of 𝛿𝑖 . However, this is not sufficient: they also need to have a large 𝜌𝑖
158 R. Giubilei

because otherwise the point could be merely distant from any other. After identifying
cluster centers, other observations are assigned to the same cluster as their nearest
neighbor of higher density.
The density-peak algorithm has many favorable properties: it manages to detect
nonspherical clusters, it does not require the number of clusters in advance or data to
be embedded in vector spaces, it is computationally fast because it does not maximize
explicitly each data point’s density field and it performs cluster assignment in a single
step, it estimates a clear population quantity, and it has only one tuning parameter
(the cutoff distance 𝑑 𝑐 ).

3.2 DP-KDE

The density-peak approach also has drawbacks. Over the last few years, many articles
have proposed improvements to overcome two main critical points: the unstable
density estimation and the absence of an automatic procedure for selecting cluster
centers. In this work, we explicitly tackle these two aspects.
The unstable density estimation induced by Equation (1) has been widely shown
[9, 16, 15]. Although many solutions have been proposed, we espouse the research
line suggesting the use of Kernel Density Estimation (KDE) to compute 𝜌𝑖 [9, 15]:
𝑛
1 Õ  𝑥𝑖 − 𝑥 𝑗 
𝜌𝑖 = 𝐾 . (3)
𝑛ℎ 𝑗=1 ℎ

In Equation (3), ℎ is the bandwidth, which is a smoothing parameter, and 𝐾 (·) is


the kernel, which is a non-negative function weighting the contribution of each data
point to the density of the 𝑖-th observation. We use the Epanechnikov kernel, which
is normalized, symmetric, and optimal in the Mean Square Error sense [2]:
(
3/4(1 − 𝑢 2 ), |𝑢| ≤ 1
𝐾 (𝑢) = . (4)
0, |𝑢| > 1
Equation (4) implies a null contribution of observation 𝑗 to the 𝑖-th density whenever
|(𝑥 𝑖 −𝑥 𝑗 )/ℎ| ≥ 1, while, in the opposite case, it results in a positive weight depending
quadratically on (𝑥𝑖 − 𝑥 𝑗 )/ℎ. Consequently, ℎ may be regarded as the cutoff distance
for the DP-KDE method.
The automatic selection of cluster centers involves many aspects: the cutoff dis-
tance, the number of clusters, and which data points to select. In the following, we
use a cutoff distance ℎ such that the average number of neighbors is between 1 and
2% of the sample size, as suggested by [12]. The number of clusters 𝑘 is here con-
sidered as a given parameter, leaving the search for its optimal value for future work.
Finally, the method for selecting data points as cluster centers is obtained refining
an intuition contained in [12], where candidates are observations with sufficiently
large values of 𝛾𝑖 = 𝛿𝑖 𝜌𝑖 . However, this quantity has two major drawbacks: first, if
Clustering Brain Connectomes Through a Density-peak Approach 159

𝛿𝑖 and 𝜌𝑖 are not defined over the same scale, results could be misleading; second, it
implicitly assumes that 𝛿𝑖 and 𝜌𝑖 shall be given the same weight in the decision. We
overcome these two limitations by first normalizing both 𝛿𝑖 and 𝜌𝑖 between 0 and
1, and then giving them different weights that are based on their informativeness.
We measure the latter using the Gini coefficient of the two (normalized) quantities,
under the assumption that the least concentrated distribution between the two is the
most informative. Specifically, each observation is given a measure of importance
that is defined as:
𝐺 ( 𝛿01 ) 𝐺 (𝜌 )
𝛾𝑖𝐺 = 𝛿01,𝑖 𝜌01,𝑖 01 , (5)
where 𝛿01 and 𝜌01 are the normalized versions of 𝛿 and 𝜌 respectively, 𝛿01,𝑖 and 𝜌01,𝑖
are the corresponding 𝑖-th values, and 𝐺 (𝑥) denotes the Gini coefficient of 𝑥. Then,
the selected cluster centers are the top 𝑘 observations in terms of 𝛾𝑖𝐺 . Assigning
observations to the same cluster as their nearest neighbor of higher density is what
concludes the DP-KDE method.

3.3 Graph Clustering

A graph is a mathematical object composed of a collection of vertices linked by


edges between them. Formally, a graph is denoted with G = (𝑉, 𝐸), where 𝑉 is
the set of vertices and 𝐸 is the set of edges. If 𝑒 ∈ 𝐸 joins vertices 𝑢, 𝑣 ∈ 𝑉, i.e.,
𝑒 = {𝑢, 𝑣}, then 𝑢 and 𝑣 are adjacent or neighbors. The number of edges incident with
any vertex 𝑣 is the degree of 𝑣. Each edge 𝑒 ∈ 𝐸 is represented through a numerical
value 𝑤𝑒 called edge weight: if weights are equal to 1 for all and only the existent
edges, and 0 for the others, G is unweighted; when existent edges have real-valued
weights, G is weighted. If 𝑤 {𝑢,𝑣 } = 𝑤 {𝑣,𝑢 } for all 𝑢, 𝑣 ∈ 𝑉, the graph G is undirected;
otherwise, it is directed. The entire information about G’s connectivity is stored in
a |𝑉 | × |𝑉 | adjacency matrix A whose generic entry in the 𝑢-th row and 𝑣-th column
is 𝑤𝑒 , where 𝑒 = {𝑢, 𝑣} and 𝑢, 𝑣 ∈ 𝑉.
The DP-KDE method can be used for graph clustering if a proper distance between
graphs is defined. In this work, we employ the Edge Difference Distance [6], which is
defined as the Frobenius norm of the difference between the two graphs’ adjacency
matrices. The choice is motivated by many factors: a flexible definition that can
be directly applied also to directed and weighted graphs, the reasonable results it
yields when node correspondence is a concern, and its limited computational time
complexity. Formally, the Edge Difference Distance between two graphs 𝑥𝑖 and 𝑥 𝑗
is defined as:
sÕ Õ
𝑖 𝑗 𝑗
𝑑 𝐸 𝐷 (𝑥𝑖 , 𝑥 𝑗 ) = ||A − A ||𝐹 B | 𝐴𝑖𝑝𝑞 − 𝐴 𝑝𝑞 | 2 , (6)
𝑝 𝑞

where A𝑖 and A 𝑗 are the adjacency matrices of 𝑥 𝑖 and 𝑥 𝑗 respectively, and || · ||𝐹
denotes the Frobenius norm.
160 R. Giubilei

Consequently, the two fundamental quantities of the DP-KDE method are com-
puted in the following as:
𝑛  
Õ 𝑑 𝐸 𝐷 (𝑥𝑖 , 𝑥 𝑗 )
𝜌𝑖 = 𝐾 , (7)
𝑗=1

where 𝐾 (·) is the Epanechnikov kernel defined in Equation (4) and the normalizing
constant is omitted because we are simply interested in the ranking between the
densities, and:

𝛿𝑖 = min (𝑑 𝐸 𝐷 (𝑥𝑖 , 𝑥 𝑗 )). (8)


𝑗:𝜌 𝑗 >𝜌𝑖

Finally, cluster centers are selected as the observations with the largest values
of 𝛾𝑖𝐺 , as defined in Equation (5), and other observations are assigned to the same
cluster as their nearest neighbor in terms of 𝛿𝑖 .

4 Empirical Analysis

The DP-KDE method for graph clustering is employed in an unsupervised empirical


analysis where the ground truth is known, and its performance is compared in terms
of accuracy both with natural competitors and with a method treating the problem
as supervised. The ultimate goal is clustering brain connectomes, one for each
individual, correctly distinguishing between patients affected by schizophrenia (SZ)
and healthy controls.
We use publicly available1 data from a recent study [5] whose aim is finding
relevant links between Regions of Interest (ROIs) for predicting schizophrenia from
multimodal brain connectivity data. The cohort is composed of 27 schizophrenic
patients and 27 age-matched healthy participants acting as control subjects. In the
current work, we focus only on this cohort’s functional Magnetic Resonance Imaging
(fMRI) connectomes. Functional connectivity matrices have been computed starting
from fMRI scans, treating them as time series, and computing Pearson’s correlation
coefficient between time series for distinct ROIs. The resulting matrices are weighted,
undirected, and made of 83 nodes.
The aforementioned study [5] treats every functional connectivity matrix as a
single multivariate realization of (83 · 82)/2 = 3403 numeric variables, each repre-
senting a connection between two of the 83 ROIs. They reduce feature dimensional-
ity by performing Recursive Feature Elimination based on Support Vector Machines
(SVM-RFE), and tackle the classification problem as supervised using 20 repetitions
of nested 5-fold cross-validation. When using only functional connectivity data, they
achieve an average accuracy of 68.28%2 over the resulting 100 test sets.

1 https://doi.org/10.5281/zenodo.3758534.
2 This exact figure is not included in the article, but the analysis is fully reproducible since the authors
made their source code available at https://github.com/leoguti85/BiomarkersSCHZ.
Clustering Brain Connectomes Through a Density-peak Approach 161

The approach we adopt in this work is rather different. First, graphs are analyzed
in their original form, without any simplification to numeric variables, resulting in
only one graph-structured variable. Observations are 54, each one representing the
functional connectome of a different individual. We tackle the problem with an un-
supervised classification approach seeking to cluster connectomes into two groups:
schizophrenic and healthy. To this end, we use the DP-KDE method for graph cluster-
ing described in Section 3.3. Starting from the 54 connectomes, each observation’s
local density 𝜌𝑖 and minimum distance 𝛿𝑖 are computed using Equations (7) and
(8), respectively. The centers of the two clusters are those whose 𝛾𝑖𝐺 is largest.
Then, other observations are assigned to the same cluster as their nearest neighbor
of higher density. Finally, the clustering performance is evaluated by comparing
the algorithm’s assignment to the ground truth. The DP-KDE method achieves an
accuracy of 70.37%, which is more than 2% higher than the one obtained in [5].
Table 1 includes the performance in terms of accuracy of both the DP-KDE
and the SVM-RFE methods, as well as that of other graph clustering competitors.
Specifically, we consider: the classical DP algorithm on the original data objects,
with the same cutoff distance as in DP-KDE and manually selected cluster centers;
k-means clustering on the 3403 numeric variables obtained from vectorizing the
adjacency matrices; DBSCAN on the original data objects, with parameters 𝜀 = 20.2
and 15 as the minimum number of points required to form a dense region; PAM and
𝑘-groups on the original data objects. In all these cases, the number of clusters has
been kept fixed to 𝑘 = 2. The method that yields the best accuracy in the specific
problem is the DP-KDE.

Table 1 Accuracy for DP-KDE and some of its possible competitors.


Method DP-KDE SVM-RFE DP k-means DBSCAN PAM 𝑘-groups

Accuracy 70.37 68.28 62.96 62.96 61.11 62.96 62.96

5 Concluding Remarks

After explaining the importance of graph clustering and briefly reviewing some
existing methods to perform this task, we have considered the possibility of adopting
a density-peak approach. We have improved the original DP algorithm by using
a more robust definition of the density 𝜌𝑖 , and by automatically selecting cluster
centers based on the quantity 𝛾𝑖𝐺 we have introduced. We have also selected a proper
distance between graphs, namely, the Edge Difference Distance. Finally, we have
used the resulting method in an empirical analysis with the goal of clustering brain
connectomes to distinguish between schizophrenic patients and healthy controls.
Our method outperforms another one treating the specific task as supervised, and it
is by far the best one with respect to many graph clustering competitors.
162 R. Giubilei

An initial idea for future work is the search for the optimal number of clusters.
This may be achieved either by fixing a threshold for 𝛾𝑖𝐺 or by selecting all the data
points after the largest increase in terms of 𝛾𝑖𝐺 . Also the cutoff distance could be
tuned, possibly maximizing in some way the dispersion of points in the bivariate
distribution of 𝜌 and 𝛿. Then, the DP-KDE method needs to be extended beyond the
univariate case. Finally, other distances between graphs could be considered to better
reflect alternative application-specific needs, e.g., when graphs are not defined over
the same set of nodes.

Acknowledgements The author would like to thank Pierfancesco Alaimo Di Loro, Federico Car-
lini, Marco Perone Pacifico, and Marco Scarsini for several engaging and stimulating discussions.

References

1. Carmi, S., Havlin, S., Kirkpatrick, S., Shavitt, Y., Shir, E.: A model of Internet topology using
k-shell decomposition. Proc. Natl. Acad. Sci. 104, 11150–11154 (2007)
2. Epanechnikov, V.: Non-parametric estimation of a multivariate probability density. Theory
Probab. Its Appl. 14, 153–158 (1969)
3. Ester, M., Kriegel, H., Sander, J., Xu, X., et al.: A density-based algorithm for discovering
clusters in large spatial databases with noise. KDD-96 34 226–231 (1996)
4. Gutiérrez-Gómez, L., Delvenne, J.: Unsupervised network embeddings with node identity
awareness. Appl. Netw. Sci. 4, 1–21 (2019)
5. Gutiérrez-Gómez, L., Vohryzek, J., Chiêm, B., Baumann, P., Conus, P., Do Cuenod, K.,
Hagmann, P., Delvenne, J.: Stable biomarker identification for predicting schizophrenia in the
human connectome. NeuroImage Clin. 27 102316 (2020)
6. Hammond, D., Gur, Y., Johnson, C.: Graph diffusion distance: A difference measure for
weighted graphs based on the graph Laplacian exponential kernel. IEEE GlobalSIP 2013, pp.
419–422 (2013)
7. Kaufmann, L., Rousseeuw, P.: Clustering by means of medoids. Proc. of the Statistical Data
Analysis based on the L1 Norm Conference, Neuchatel, Switzerland, pp. 405–416 (1987)
8. Li, S., Rizzo, M.: K-groups: A generalization of k-means clustering. ArXiv Preprint
ArXiv:1711.04359 (2017)
9. Mehmood, R., Zhang, G., Bie, R., Dawood, H., Ahmad, H.: Clustering by fast search and find
of density peaks via heat diffusion. Neurocomputing. 208, 210–217 (2016)
10. Mukherjee, S., Sarkar, P., Lin, L.: On clustering network-valued data. NIPS2017, pp. 7074–
7084 (2017)
11. Narayanan, A., Chandramohan, M., Venkatesan, R., Chen, L., Liu, Y., Jaiswal, S.: graph2vec:
Learning distributed representations of graphs. ArXiv Preprint ArXiv:1707.05005 (2017)
12. Rodriguez, A., Laio, A.: Clustering by fast search and find of density peaks. Science 344,
1492–1496 (2014)
13. Shimada, Y., Hirata, Y., Ikeguchi, T., Aihara, K.: Graph distance for complex networks. Sci.
Rep. 6, 1–6 (2016)
14. Székely, G., Rizzo, M.: The energy of data. Annu. Rev. Stat. Appl. 4, 447–479 (2017)
15. Wang, X., Xu, Y.: Fast clustering using adaptive density peak detection. Stat. Methods Med.
Res. 26, 2800–2811 (2017)
16. Xie, J., Gao, H., Xie, W., Liu, X., Grant, P.: Robust clustering by detecting density peaks and
assigning points based on fuzzy weighted K-nearest neighbors. Inf. Sci. 354, 19–40 (2016)
Clustering Brain Connectomes Through a Density-peak Approach 163

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Similarity Forest for Time Series Classification

Tomasz Górecki, Maciej Łuczak, and Paweł Piasecki

Abstract The idea of similarity forest comes from Sathe and Aggarwal [19] and is
derived from random forest. Random forests, during already 20 years of existence,
proved to be one of the most excellent methods, showing top performance across a
vast array of domains, preserving simplicity, time efficiency, still being interpretable
at the same time. However, its usage is limited to multidimensional data. Similarity
forest does not require such representation – it is only needed to compute similarities
between observations. Thus, it may be applied to data, for which multidimensional
representation is not available. In this paper, we propose the implementation of
similarity forest for time series classification. We investigate 2 distance measures:
Euclidean and dynamic time warping (DTW) as the underlying measure for the
algorithm. We compare the performance of similarity forest with 1-nearest neighbor
and random forest on the UCR (University of California, Riverside) benchmark
database. We show that similarity forest with DTW, taking into account mean ranks,
outperforms other classifiers. The comparison is enriched with statistical analysis.

Keywords: time series, time series classification, random forest, similarity forest

Tomasz Górecki ( )
Faculty of Mathematics and Computer Science, Adam Mickiewicz University, Uniwersytetu Poz-
nańskiego 4, Poznań, Poland, e-mail: [email protected]
Maciej Łuczak
Faculty of Mathematics and Computer Science, Adam Mickiewicz University, Uniwersytetu Poz-
nańskiego 4, Poznań, Poland, e-mail: [email protected]
Paweł Piasecki
Faculty of Mathematics and Computer Science, Adam Mickiewicz University, Uniwersytetu Poz-
nańskiego 4, Poznań, Poland, e-mail: [email protected]

© The Author(s) 2023 165


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_19
166 T. Górecki et al.

1 Introduction

Time series classification is a well-developing research field, that gained much


attention from researchers and business during the last two decades apparently by
the fact that more and more data around us seems to be located in the time domain –
and thus fulfilling the definition of time series. Predictive maintenance [18], quality
monitoring [22], stock market analysis [20] or sales forecasting [17] are just a few
exemplar nowadays problems where time series are indeed present. The reason why
we usually apply to time series different methods from regular (non-time series) data
is the fact, that time series are ordered in time (or some other space with ordering)
and it is beneficial to use the information conveyed by the ordering.
In recent years, one could observe many advances on the field of time series
classification. In 2017, Bagnall et al. presented a comprehensive comparison of time
series classification algorithms [2], showing that despite there are dozens of far
more complex methods, 1-Nearest Neighbour (1NN) [6, 11] coupled with DTW [3]
distance constitutes a good baseline. In fact, it has been outperformed by several
classifiers, with Collective of Transformation Ensembles (COTE) [1] as the most
efficient one. Furthermore, COTE was extended with Hierarchical Vote system, first
to HIVE-COTE [13] and then finally to HIVE-COTE 2.0 [15] – a current state of
the art classifier for time series. In general, the success of COTE-family classifiers
is based on the observation, that in the case of time series it is highly beneficial
to use different data representations. For example, HIVE-COTE 1.0 utilizes five
ensembles based on different data transformation domains. However, a common
criticism of such an approach is its time complexity. In the case of HIVE-COTE,
it equals 𝑂 (𝑛2 𝑙 4 ), where 𝑛 is a number of observations and 𝑙 is a length of series.
Another drawback, especially significant for practitioners is the complex structure
of the model ensembles that makes it hard to use HIVE-COTE without spending a
decent amount of time studying its components beforehand.
As an alternative to such complex models may be trying to achieve possibly
slightly worse performance in favour of model simplicity and reduced computation
time. A group of classifiers that seems to hold a great potential are those inspired
by Random Forest (RF) [4]. This already 20-years old algorithm remains in the
classifiers’ forefront, showing extremely good performance and robustness across
multiple domains. Fernandez-Delgado et al. [10] performed a comparison of 179
classifiers on 121 non-time series data sets originated from UCI Machine Learning
Repository [9], concluding RF to be the most accurate one. Unfortunately, the usage
of RF is essentially limited to multidimensional data, as they sample features from
original space while creating each node of decision trees.
In this paper, we propose a method for extending RF to work with time series
using similarity forests (SF). We significantly extend the applicability of the RF
method to time series data. Furthermore, the approach even outperforms traditional
classifiers for time series. The main goal of this paper is to enrich the pool of time
series classifiers by Similarity Forest for time series classification. SF was initially
proposed by Sathe and Aggarwal in 2017 [19], as a method extending Random Forests
to deal with arbitrary data sets, provided that we are able to compute similarities
Similarity Forest for Time Series Classification 167

between observations. We would like to implement and tune the method to time series
data. We investigate the performance of the model using two distance measures (the
algorithm’s hyper-parameter): Euclidean and DTW. Also, a comparison with other
selected time series classifiers is provided. We compare its performance against
1NN-ED, 1NN-DTW and RF.
The rest of the paper is structured as follows. In Section 2, we provide details
of similarity forest and we give more details about random forests. Additionally, we
discuss how similarity forest is related to random forest. Section 3 describes data
sets that we used and the comparison methodology. The corresponding results are
presented in Section 4. Finally, in Section 5 we give a brief summary of our research.

2 Classification Methods Used in Comparison

In the paper, we compare the standard random forest and the similarity forest with
the distance measure: ED (Euclidian distance) and DTW (dynamic time warping
distance). As benchmark methods, we also use the nearest neighbor method (1NN)
with distance measure ED and DTW. 1NN-ED and 1NN-DTW are very common
classification methods for time series classification [2]. For a review of these methods
refer to [14].

2.1 General Method of Random Forest Construction

Random forest consists of random decision trees. For the construction of a random
forest we usually take decision trees as simple as possible — without special criteria
for stopping, pruning, etc.
When building a decision tree, we start at a node 𝑁, which contains the entire
data set (bootstrap sample). Then, according to an established criterion, we split the
node 𝑁 into two subnodes 𝑁1 and 𝑁2 . In each subnode there are data subsets of
the data set from node 𝑁. We make this split in a way that is optimal for a given
split method. In each node, we write down how the split occurred. Then, proceeding
recursively, we split next nodes into subnodes until the stop criterion occurs. In our
case we take the simplest such criterion, namely we stop the split of a given node
when only elements of the same class are included in a node. We call such a node a
leaf and assign it a label which elements of the node (leaf) have.
Having built a tree, we can now use it (in the testing phase) to classify a new
observation. We pass this observation through the trained tree — starting from the
node 𝑁 selecting each time one of the subnodes, according to the condition stored
in the node. We do this until we reach one of the leaves, and then we assign the test
observation to the class of the leaf.
Now, constructing the random forest, we collect a certain number of decision
trees, train them independently according to the above method and, in the test phase,
168 T. Górecki et al.

use each of the trees to test new observation. Thus, each tree assigns a label to the
test observation. The final label (for the entire forest) we construct by voting, we
choose the most frequently appearing label among the decision trees.

2.2 Classical Random Forest

To create a (classical) random tree and a random forest [4], we proceed as described
above using the following node split method:
To obtain √split conditions for a single tree, we select randomly a certain number
of features ( 𝑘 for classification, 𝑘 — number of features), and for each feature
we create a feature vector (column, variable) made of all elements of the data set
(bootstrap sample). For a given feature vector (variable), we determine a threshold
vector. First, we sort values of the feature vector (uniquely — without repeating
values). Let us name this sorted feature vector as 𝑉 = (𝑉1 , 𝑉2 , . . . ). Then we take the
values of the split as means of successive values of the vector 𝑉 :
𝑉𝑖 + 𝑉𝑖+1
𝑣𝑖 = 𝑖 = 1, 2, . . . . (1)
2
Each splitting value divides the data set in node 𝑁 into two subsets — the one (left)
in which we have elements with feature values smaller than 𝑣𝑖 and the second (right)
with other elements. Then we check the quality of such a split.
The splitting point is chosen such that it minimizes the Gini index of the children
nodes. If 𝑝 1 , 𝑝 2 . . . 𝑝 𝑐 are the fractions of data points belonging to the 𝑐 different
Í𝑐
classes in node 𝑁, then the Gini index of that node is given by: 𝐺 (𝑁) = 1 − 𝑖=1 𝑝 2𝑖 .
Then, if the node 𝑁 is split into two children nodes 𝑁1 and 𝑁2 , with 𝑛1 and 𝑛2
points, respectively, the Gini quality of the children nodes is given by:

𝑛1 𝐺 (𝑁1 ) + 𝑛2 𝐺 (𝑁2 )
𝐺𝑄(𝑁1 , 𝑁2 ) = .
𝑛1 + 𝑛2
Quality of the split is given by: 𝐺𝑄(𝑁) = 𝐺 (𝑁) − 𝐺𝑄(𝑁1 , 𝑁2 ).

2.3 Similarity Forest

The similarity forest [19] differs from the ordinary (classical) random forest only in
the way we split nodes of trees. Instead of selecting a certain number of features,
we select randomly a pair of elements 𝑒 1 , 𝑒 2 with different classes. Then, for each
element 𝑒 of the subset of elements in a given node, we calculate the difference of
the squared distances to the elements 𝑒 1 and 𝑒 2 :

𝑤(𝑒) = 𝑑 (𝑒, 𝑒 1 ) 2 − 𝑑 (𝑒, 𝑒 2 ) 2 ,


Similarity Forest for Time Series Classification 169

where 𝑑 is any fixed distance measure of the elements of the data set. We sort the
vector 𝑤 uniquely (without duplicates) creating the vector 𝑉 and continue as for the
classical decision tree. We calculate values of the split 𝑣𝑖 (1), calculate the quality
of the node split using the Gini index (2.2) and choose the best split. In the learning
phase, we remember in each node how the optimal split occurred (elements 𝑒 1 ,
𝑒 2 , 𝑤(𝑒)). In the learning phase, in each node we write down the optimal split —
elements 𝑒 1 , 𝑒 2 , and value 𝑤(𝑒)).

2.4 Random Forest vs Similarity Forest

The difference
√ between a classical random tree and a similarity tree is that instead of
selecting 𝑘 of the features, we select only one pair of elements 𝑒 1 , 𝑒 2 . Generally,
we have much fewer possible node splits, which has a very good effect on the
computation time.
The second important difference is that in the similarity tree we use any distance
measure between elements of the data set. Therefore, we can use distance measures
specific to a data set. For example, for time series we can use the DTW distance,
much better suited for calculating the distance between time series, instead of the
Euclidean distance.

3 Experimental Setup

We investigated the performance of similarity forest on UCR time series repository


[7] (128 data sets). The latest update of the UCR database introduced several data
sets with missing observations and uneven sample lengths. However, the repository
includes a standardized version of the database without these impediments, and that
is the version we used.
All data sets are split into a training and testing subset, and all parameter opti-
mization is conducted on the training set only. We combined both parts and in the
next step, we used 100 random train/test splits.

4 Results

The error rates for each classifier can be found on the accompanying website1. In
the Table 1 we show a short summary of results, including a number of wins (draw
is not counted as a win) and mean ranks. Taking into account mean ranks, SF-DTW
is the best classifier, sightly ahead of RF (mean ranks correspondingly equal 2.64

1 https://github.com/ppias/similarity_forest_for_tsc
170 T. Górecki et al.

Table 1 Number of wins (clearly wins) and mean ranks for examined methods.
Method 1NN-ED 1NN-DTW RF SF-ED SF-DTW
Wins 12 28 38 10 31
Mean rank 3.59 2.89 2.69 3.19 2.64

and 2.89). Figure 1 demonstrates comparison of error rates and ranks for classifiers.
These results lead to a conclusion that even though there is no clear winner, the top
efficient distances are dominated by RF and SF-based classifiers. Figure 2 shows
scatter plots of errors for pairs of classifiers.

SF-ED SF-ED

SF-DTW SF-DTW

RF RF

1NN-ED 1NN-ED

1NN-DTW 1NN-DTW

0.00 0.25 0.50 0.75 1 2 3 4 5


Errors Ranks

Fig. 1 Comparison of error rates and ranks.

1.00 1.00 1.00

0.75 0.75 0.75


SF-DTW

SF-DTW
SF-ED

0.50 0.50 0.50

0.25 0.25 0.25

SF-ED better here SF-DTW better here SF-DTW better here


0.00 0.00 0.00

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
RF RF SF-ED

Fig. 2 Comparison of error rates.

To identify differences between the classifiers, we present a detailed statistical


comparison. In the beginning, we test the null hypothesis that all classifiers perform
the same and the observed differences are merely random. The Friedman test with
Iman & Davenport extension is probably the most popular omnibus test, and it is
usually a good choice when comparing different classifiers [12]. The 𝑝-value from
this test is equal to 0. The obtained 𝑝-value indicates that we can safely reject the
null hypothesis that all the algorithms perform the same. We can therefore proceed
Similarity Forest for Time Series Classification 171

with the post-hoc tests in order to detect significant pairwise differences among all
of the classifiers.
Demšar [8] proposes the use of the Nemenyi’s test [16] that compares all the
algorithms pair-wise. For a significance level 𝛼 the test determines the critical
difference (CD). If the difference between the average ranking of two algorithms is
greater than CD the null hypothesis that the algorithms have the same performance
is rejected. Additionally, Demšar [8] creates a plot to visually check the differences,
the CD plot. In the plot, those algorithms that are not joined by a line can be regarded
as different.
In our case, with a significance of 𝛼 = 0.05 any two algorithms with a difference
in the mean rank above 0.54 will be regarded as non equal (Figure 3). We can see
that we have three groups of methods. In the first group we have SF-DTW, RF and
1NN-DTW, in the second we have RF, 1NN-DTW and SF-ED and in the last group
we have SF-ED and 1NN-ED. Unfortunately, groups are not disjoint. The first group
is the group with the highest accuracy of classification. Hence, SF-DTW does not
statistically outperform RF. However, we can recommend it over RF because of
statistically the same quality and much better computational properties.

CD

2 3 4

SF-DTW 1NN-DTW
RF SF-ED
1NN-ED

Fig. 3 Critical difference plot.

5 Conclusions

Our contribution is to implement similarity forest for time series classification using
two distance measures: Euclidean and DTW. Comparison based on the recently
updated UCR data repository (128 data sets) was presented. We showed that SF-
DTW outperforms other classifiers, including 1NN-DTW which has been considered
as a strong baseline hard to beat for years. The statistical comparison showed, that RF
and SF-DTW are statistically insignificantly different, however taking into account
mean ranks the latter one is the best one.
There are many improvements that could be applied to the implementation that
we propose. For example, we could test other distance measures such as LCSS [21]
or ERP [5] that were successfully used in time series tasks. Another idea could be
to investigate the usage of boosting algorithm.

Acknowledgements The research work was supported by grant No. 2018/31/N/ST6/01209 of the
National Science Centre.
172 T. Górecki et al.

References

1. Bagnall, A., Lines, J., Hills, J., Bostrom A.: Time-series classification with COTE: The
collective of transformation-based ensembles. IEEE Trans. on Knowl. and Data Eng. 27,
2522–2535 (2015)
2. Bagnall, A., Lines, J., Bostrom, A., Large J., Keogh, E.: The great time series classification
bake off: a review and experimental evaluation of recent algorithmic advances. Data Min. and
Knowl. Discov. 31, 606–660 (2017)
3. Berndt, D. J., Clifford, J.: Using dynamic time warping to find patterns in time series. Proc.
of the 3rd Int. Conf. on Knowl. Discov. and Data Min., pp. 359–370 (1994)
4. Brieman, L.: Random forests. J. Mach. Learn. Arch. 45, 5–32 (2001)
5. Chen, L., Ng, R.: On the marriage of 𝐿 𝑝 -norms and edit distance. Proc. of the 30th Int. Conf.
on Very Large Data Bases 30, pp. 792–803 (2004)
6. Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. on Inf. Theor. 13,
21–27 (1967)
7. Dau, H.A., Keogh, E., Kamgar, K., Yeh, Chin-Chia M., Zhu, Y.,Gharghabi, S., Ratanama-
hatana, C.A., Yanping, C., Hu, B., Begum, N., Bagnall, A., Mueen, A., Batista, G., Hexagon-
ML: The UCR time series classification archive (2019) https://www.cs.ucr.edu/\str
ing~eamonn/time\_series\_data\_2018
8. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. of Mach. Learn.
Res. 7, 1–30 (2006).
9. Du,a D., Graff, C.: UCI Machine Learning Repository.
http://archive.ics.uci.edu/ml
10. Fernandez-Delgado, M., Cernadas, E., Barro, S., Amorim, D.: Do we need hundreds of
classifiers to solve real world classification problems?. J. of Mach. Learn. Res. 15, 3133–3181
(2014)
11. Fix, E, Hodges, J. L.: Discriminatory analysis: nonparametric discrimination, consistency
properties. Techn. Rep. 4, (1951)
12. García, S., Fernández, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple
comparisons in the design of experiments in computational intelligence and data mining:
Experimental Analysis of Power. Inf. Sci. 180, 2044–2064 (2010)
13. Lines, J., Taylor S., Bagnall, A.: HIVE-COTE: The hierarchical vote collective of transfor-
mation based ensembles for time series classification. IEEE Int. Conf. on Data Min., pp.
1041–1046 (2016)
14. Maharaj, E. A., D’Urso, P., Caiado, J.: Time Series Clustering and Classification. Chapman
and Hall/CRC. (2019)
15. Middlehurst, M., Large, J., Flynn, M., Lines, J., Bostrom, A., Bagnall, A.: HIVE-COTE 2.0:
a new meta ensemble for time series classification. (2021)
https://arxiv.org/abs/2104.07551
16. Nemenyi, P.:Distribution-free multiple comparisons. PhD thesis at Princeton University (1963)
17. Pavlyshenko, B. M.: Machine-learning models for sales time series forecasting. Data 4, 15
(2019)
18. Rastogi, V., Srivastava, S., Mishra, M., Thukral, R.: Predictive maintenance for SME in
industry 4.0. 2020 Glob. Smart Ind. Conf., pp. 382–390 (2020)
19. Sathe, S., Aggarwal, C. C.: Similarity forests. Proc. of the 23rd ACM SIGKDD, pp. 395–403
(2017)
20. Tang, J., Chen, X.: Stock market prediction based on historic prices and news titles. Proc. of
the 2018 Int. Conf. on Mach. Learn. Techn., pp. 29–34 (2018)
21. Vlachos, M., Kollios, G., Gunopulos, D.: Discovering similar multidimensional trajectories.
Proc. 18th Int. Conf. on Data Eng., pp. 673–684 (2002)
22. Wuest, T., Irgens, C., Thoben, K. D.: An approach to quality monitoring in manufacturing
using supervised machine learning on product state data. J. of Int. Man. 25, 1167–1180 (2014)
Similarity Forest for Time Series Classification 173

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Detection of the Biliary Atresia Using Deep
Convolutional Neural Networks Based on
Statistical Learning Weights via Optimal
Similarity and Resampling Methods

Kuniyoshi Hayashi, Eri Hoshino, Mitsuyoshi Suzuki, Erika Nakanishi,


Kotomi Sakai, and Masayuki Obatake

Abstract Recently, artificial intelligence methods have been applied in several fields,
and their usefulness is attracting attention. These methods are techniques that corre-
spond to models using batch and online processes. Because of advances in compu-
tational power, as represented by parallel computing, online techniques with several
tuning parameters are widely accepted and demonstrate good results. Neural net-
works are representative online models for prediction and discrimination. Many
online methods require large training data to attain sufficient convergence. Thus,
online models may not converge effectively for low and noisy training datasets. For
such cases, to realize effective learning convergence in online models, we introduce
statistical insights into an existing method to set the initial weights of deep convo-
lutional neural networks. Using an optimal similarity and resampling method, we
proposed an initial weight configuration approach for neural networks. For a practice
example, identification of biliary atresia (a rare disease), we verified the usefulness

Kuniyoshi Hayashi ( )
Graduate School of Public Health, St. Luke’s International University, 3-6 Tsukiji, Chuo-ku, Tokyo,
Japan, 104-0045, e-mail: [email protected]
Eri Hoshino · Kotomi Sakai
Research Organization of Science and Technology, Ritsumeikan University, 90-94 Chudoji Awat-
acho, Shimogyo Ward, Kyoto, Japan, 600-8815,
e-mail: [email protected];[email protected]
Mitsuyoshi Suzuki
Department of Pediatrics, Juntendo University Graduate School of Medicine, 2-1-1 Hongo, Bunkyo-
ku, Tokyo, Japan, 113-8421, e-mail: [email protected]
Erika Nakanishi
Department of Palliative Nursing, Health Sciences, Tohoku University Graduate School of
Medicine, 2-1 Seiryo-machi, Aoba-ku, Sendai, Japan, 980-8575,
e-mail: [email protected]
Masayuki Obatake
Department of Pediatric Surgery, Kochi Medical School, 185-1 Kohasu, Oko-cho, Nankoku-shi,
Kochi, Japan, 783-8505, e-mail: [email protected]

© The Author(s) 2023 175


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_20
176 K. Hayashi et al.

of the proposed method by comparing existing methods that also set initial weights
of neural networks.

Keywords: AUC, bootstrap method, leave-one-out cross-validation, projection ma-


trix, rare disease, sensitivity and specificity

1 Introduction

The core technique in deep learning corresponds to neural networks, including the
convolutional process. Since 2012, deep learning architectures have been frequently
used for image classification [1, 2]. More so, deep convolution neural networks
(DCNN) are representative nonlinear classification methods for pattern recognition.
The DCNN technique is used as a powerful framework for the entirety of image
processing [3]. The clinical medicine field presents many opportunities to perform
diagnoses using imaging data from patients. Therefore, DCNN techniques are ap-
plied to enhance diagnostic quality, e.g., applying a DCNN to a chest X-ray dataset
to classify pneumonia [2] and detecting breast cancer [4]. However, DCNN architec-
tures involve many parameters to be learned using training data. Therefore, effective
and efficient model development must realize effective learning convergence for
such parameters. Notably, it is important to set the initial parameter values to achieve
better learning convergence. Furthermore, several methods have been proposed to
set initial parameter values in the artificial intelligence (AI) field [5, 6]. However,
there are no clear guidelines for determining which existing methods should be used
in different situations. Thus, we propose an efficient initial weight approach using
existing methods from the viewpoints of optimal similarity and resampling methods.
Using a real-world clinical biliary atresia (BA) dataset, we evaluate the performance
of the proposed method compared with existing DCNNs. Additionally, we show the
usefulness of the proposed method in terms of learning convergence and prediction
accuracy.

2 Background

BA is a rare disease that occurs in children and is fatal unless treated early. Previous
studies have investigated models to identify BA by applying neural networks to pa-
tient data [7] and using an ensemble deep learning model to detect BA [8]. However,
these models were essentially for use in medical institutions, e.g., hospitals. Gener-
ally, certain stool colors in infants and children are highly correlated with BA [9]. In
Japan, the maternal and child health handbook includes a stool color card so parents
can compare their child’s stool color to the information on the card. Such fecal color
cards are widely used to detect BA because of their easy accessibility outside the
clinical environments. However, this stool color card screening approach for BA is
Detection of BA using DCNN Based on Statistical Learning Weights 177

subjective; thus, accurate and objective diagnoses are not always possible. Previ-
ously, we developed a mobile application to classify BA and non-BA stools using
baby stool images captured using an iPhone [10]. Here, a batch type classification
method was used, i.e., the subspace method, originating from the pattern recognition
field. Since BA is a rare disease, the number of events in the case group is generally
less. Thus, when we set the explanatory variables of the target observation as the pixel
values of a target image, the number of explanatory variables is much higher than the
number of observations, especially the disease group. With the subspace method, we
can efficiently discriminate such high-dimensional small-sample data. For example,
our previous study using the subspace method to classify BA and non-BA stools
showed that BA could be discriminated with reasonable accuracy by applying the
proposed method to image pixel data of the stool image data captured by a mobile
phone [10]. This application was an automated version of the stool color card from
the maternal and child health handbook. Unlike previous studies by [7, 8], the appli-
cation is widely available outside hospital environments. As described previously,
DCNNs are useful for image classification, including the automatic classification of
stool images for early BA detection.

3 Proposed Method

Dimension reduction and discrimination processing can be realized using the sub-
space method and DCNN techniques. In DCNN, layers based on padding, convo-
lution, and pooling correspond to the dimension reduction functions, and the affine
layer performs the discrimination. The primary motivation of this study is to propose
a method that properly sets the initial weights of the parameters in a DCNN using
statistical approaches. Our secondary motivation is to apply the proposed method to
real-world, high-dimensional, and small-sample clinical data.

3.1 Description of Related Procedures of the Convolution

For image discrimination in pattern recognition and machine learning fields, the pixel
values of the image data are set as the explanatory variables for the target outcome.
Here, the data to be classified correspond to a high-dimensional observation. To
improve efficiency and demonstrate the feasibility of discriminant processing, the
dimensionality must be reduced to a manageable size before classification. The most
representative dimensionality reduction method is convolution in pattern recognition
and machine learning, which involves padding, convolution, and pooling operations.
After converting the input image to a pixel data matrix, the pixel data matrix is
surrounded with a numeric value of 0. Using a convolution filter, we reconstruct the
pixel data matrix while considering pixel adjacency information. Generally, the size
and convolution filter type are parameters that need optimization to realize sufficient
178 K. Hayashi et al.

prediction accuracy. However, some representative convolution filters that exhibit


good performance are known in the AI field, and we can essentially fix the size and
type of the convolution filter. Finally, pooling is performed to reduce the size of the
pixel data matrix after convolution. Here, we refer to the sequence of processing
from padding to pooling as the layer for feature selection.

3.2 Setting Conditions Assumed in This Study

We denote the input pattern matrices comprising numerical pixel values in hue (H),
saturation (S), and value (V) as X 𝐻 (∈ R 𝑝×𝑞 ), X𝑆 (∈ R 𝑝×𝑞 ), and X𝑉 (∈ R 𝑝×𝑞 ),
respectively. First, we performed padding for the input pattern matrices in H, S, and
V, respectively, and then, performed a convolution in each signal pattern matrix using
a convolution filter. Next, we then applied max pooling to each pattern matrix after
convolution. Here, we denote the pattern matrices after the padding, convolution,
𝐻 0 0 𝑆 0 0 𝑉 0 0
and max pooling as X̃ (∈ R 𝑝 ×𝑞 ), X̃ (∈ R 𝑝 ×𝑞 ), and X̃ (∈ R 𝑝 ×𝑞 ), respectively,
where 𝑝 0 and 𝑞 0 are less than 𝑝 and 𝑞. Therefore, we combine the component values of
each pattern matrix after padding, convolution, and max pooling into a single pattern
matrix by simply adding them together. The combined pattern matrix after applying
0 0
the feature selection layer is expressed as X̃(∈ R 𝑝 ×𝑞 ). Next, we applied convolution
and max pooling to the combined pattern matrix 𝑘 times. Additionally, the input
vector after performing the convolution and max pooling 𝑘 times is denoted by
x(∈ Rℓ×1 ), and the output of the DCNN and the label vectors are denoted y(∈ R1×1 )
and t(∈ R1×1 ), respectively. In this study, we evaluated the difference between y
and t according to the mean square error function, i.e., 𝐿(y, t) = ℓ1 k t − y k22 .
Here, we consider a simple neural network with three layers. Concretely, between
the first and second layers, we perform a linear transformation using W1 (∈ R2×ℓ )
and b1 (∈ R2×1 ). Then, a linear transformation is performed using W2 (∈ R1×2 ) and
b2 (∈ R1×1 ) between the second and third layers. Next, we defined 𝑓1 (x) and 𝑓2 (x)
as W1 x + b1 and W2 𝑓1 (x) + b2 , respectively. Note that we assume 𝜂2 is a nonlinear
transformation between the second and third layers, and we calculated the output
y as 𝜂2 ( 𝑓2 ◦ 𝑓1 (x)). Generally, y is calculated as a continuous value. For example,
with classification and regression tree methods, we can determine the optimal cutoff
point of y𝑠 from a prediction perspective.

3.3 General Approach to Update Parameters in CNNs

Here, we denote 𝑓1 (x) and 𝑓2 ◦ 𝑓1 (x) in the previous subsection as u1 and u2 ,


respectively. By performing the partial derivative of 𝐿 (y, t) with respect to W2 , we
𝜕𝐿 𝜕𝐿 𝜕y 𝜕u2 2 𝜕y 𝜕𝜂2 (u2 ) 𝜕u2
obtain 𝜕W 𝑇 = 𝜕y 𝜕u2
𝜕W𝑇
where 𝜕𝐿
𝜕y = − ℓ (t − y), 𝜕u2 = 𝜕u2 , and 𝜕W𝑇 = u1 .
2 2 2
Additionally, we calculate 𝜂2 (u2 ) as 1/(1 + exp(−u2 )) using the representative
Detection of BA using DCNN Based on Statistical Learning Weights 179

𝜕y
sigmoid function. Then, 𝜕u 2
is calculated as 𝜂2 (u2 )(1 − 𝜂2 (u2 )). Therefore, we
𝜕𝐿 2
obtain 𝜕W𝑇 = − ℓ (t−y)𝜂2 (u2 )(1−𝜂2 (u2 ))u1 . With the learning coefficient of 𝛾2 , we
2
update W𝑇2 to W𝑇2 − 𝛾2 𝜕W
𝜕𝐿
𝑇 . Then, when performing the partial derivative of 𝐿 (y, t)
2
𝜕𝐿 𝜕𝐿 𝜕y 𝜕u2 𝜕u1 𝜕𝐿 2
with respect to W1 , we can obtain 𝜕W1 = 𝜕y 𝜕u2 𝜕u1 𝜕W1 where 𝜕y = − ℓ (t − y),
𝜕y 𝜕u2 𝑇 𝜕u1
𝜕u2 = 𝜂2 (u2 )(1 − 𝜂2 (u2 )), 𝑇
𝜕u1 = W2 , and 𝜕W1 = 2x . Thus, we then obtain
𝜕𝐿
𝜕W1 = − ℓ4 (t − y)𝜂2 (u2 )(1 − 𝜂2 (u2 ))W𝑇2 x𝑇 . With the learning coefficient of 𝛾1 , we
𝜕𝐿
update W1 to W1 − 𝛾1 𝜕W 1
.

3.4 Setting the Initial Weight Matrix in the Affine Layer

To ensure proper learning convergence in situations with limited training datasets, we


proposed a method using optimal similarity and bootstrap methods. Here, the number
of training data and the training dataset are denoted 𝑛 and 𝑆(3 x 𝑗 ), respectively, where
x 𝑗 is the 𝑗-th training observation ( 𝑗 takes values 1 to 𝑛). Additionally, we normalized
each observation vector, such that its norm is one. By considering the discrimination
problem of two groups whose outcomes are 0 and 1, respectively, we divided {x 𝑗 }
into {x 𝑗 |y 𝑗 = 0} and {x 𝑗 |y 𝑗 = 1}. Next, we defined {x 𝑗 |y 𝑗 = 0} and {x 𝑗 |y 𝑗 = 1}
as 𝑆0 and 𝑆1 , respectively. First, we calculated the autocorrelation matrix with the
observations belonging to 𝑆0 . Then, using the eigenvalues (𝜆ˆ 𝑠0 ) and eigenvectors
(û𝑠0 ) for the autocorrelation matrix, we calculated the following projection matrix:
ℓ00
Õ
𝑃ˆ0 := û𝑠0 û𝑇𝑠0 , (1)
𝑠0 =1

where ℓ00 takes values 1 to ℓ in Equation (1). Similarly, we calculated the autocor-
relation matrix with the observations belonging to 𝑆1 . Then, with eigenvalues (𝜆ˆ 𝑠1 )
and eigenvectors (û𝑠1 ) for the autocorrelation matrix, we calculate the following
projection matrix:
ℓ10
Õ
ˆ
𝑃1 := û𝑠1 û𝑇𝑠1 , (2)
𝑠1 =1

ℓ10
where takes values 1 to ℓ in Equation (2). Here, if the value of x𝑇 ( 𝑃ˆ1 − 𝑃ˆ0 )x > 0,
we classify x into 𝑆1 ; otherwise, we classify x into 𝑆0 .
From a prediction perspective, using the leave-one-out cross-validation [11],
we determined the optimal ℓˆ00 and ℓˆ10 , which are minimum values satisfying 𝜏 <
Íℓ 0 Í Íℓ 0 Í
( 𝑠00 =1 𝜆ˆ 𝑠0 )/( ℓ𝑠0 =1 𝜆ˆ 𝑠0 ) and 𝜏 < ( 𝑠11 =1 𝜆ˆ 𝑠1 )/( ℓ𝑠1 =1 𝜆ˆ 𝑠1 ), respectively. Here, 𝜏 is
a tuning parameter to be optimized using the leave-one-out cross-validation. In the
second step, based on 𝑃ˆ1 , we estimated ŷ 𝑗 as x𝑇𝑗 𝑃ˆ1 x 𝑗 . In the third step, using existing
approaches [5, 6], we generated , we generated normal random numbers and set an
initial matrix, vector, and scalar as Ŵ2 , b̂1 , and b̂2 , respectively. Next, we extracted
180 K. Hayashi et al.

𝑚 observations randomly using the bootstrap method [12]. Using Ŵ2 , b̂1 , b̂2 , and a
bootstrap sample of size 𝑚, we estimated W2 W1 as follows:
𝑚
1 Õ −1
Ŵ2 Ŵ1 = (𝜂 ( ŷ ) − ( Ŵ2 b̂1 + b̂2 ))x𝑇𝑖 (x𝑖 x𝑇𝑖 ) −1 , (3)
𝑚 𝑖=1 2 𝑖

where we estimate the inverse of x𝑖 x𝑇𝑖 in Equation (3) using the naive approach from
the diagonal elements in x𝑖 x𝑇𝑖 . Additionally, using the generalized inverse approach,
we obtained Ŵ1 in the basis of Ŵ2 and Ŵ2 Ŵ1 . Finally, b̂1 , b̂2 , Ŵ1 , and Ŵ2 were
used as initial vectors and matrices to update the parameters of the convolutional
neural network.

4 Analysis Results on Real-world Data

In this paper, all analyses were performed using R version 4.1.2 (R Foundation for
Statistical Computing). We applied the proposed method to a real BA dataset. Here,
stool image data with objects, such as diapers partially photographed on the image
were used. In this numeric experiment, we randomly divided 35 data into 15 training
and 20 test data, respectively. Next, we compared the proposed and existing methods
relative to the learning convergence and prediction accuracy on the training and test
data, respectively. Here, we set the values of the learning coefficients 𝛾1 and 𝛾2 to
0.1, respectively. Also, we prepared a single feature selection layer and performed the
convolution and max pooling process seven times. Each time an initial value was set
randomly, learning was performed 1000 times using the 15 training data, and it was
judged that learning converged when the value obtained by dividing the sum of the
absolute values of the difference between ŷ 𝑗 and t 𝑗 by 1000 became less than 0.01.
We repeated to randomly divide 35 data into 15 training and 20 test data five times.
As a result, we created five datasets. For each dataset, the sensitivity, specificity, and
AUC values of the training and test data were calculated using the parameters (b̂1 , b̂2 ,
Ŵ1 , and Ŵ2 ) at the time the learning first converged in the existing and our proposed
methods. Figure 1 shows the average of the five absolute values of the difference
between the correct label and the predicted value at each step when learning was
first converged for each method. We can observe that the error decreased steadily as
the proposed method progressed compared to the existing methods. When the model
was constructed using the weights at the learning convergence point and applied to
15 training data every time, the average values of sensitivity and specificity were
100.0%, and that of the AUC value was 1.000 for all methods. However, a difference
was observed among the compared methods on the test data. For the method by [5],
the average values of sensitivity, specificity, and AUC in the test data were 83.3%,
42.5%, and 0.629, respectively. Also, for that of [6], the average values of sensitivity,
specificity, and AUC in the test data were 85.0%, 40.0%, and 0.625, respectively.
With the proposed method, the average values of sensitivity, specificity, and AUC
obtained on the test data were 85.0%, 67.5%, and 0.763, respectively.
Detection of BA using DCNN Based on Statistical Learning Weights 181

Fig. 1 Transition of learning in each method.

5 Conclusion and Limitations

In this paper, we considered a discrimination problem using a DCNN for high-


dimensional small sample data and proposed a method by setting the initial weight
matrix in the affine layer. In situations of limited learning data, although transfer
learning can be used, we proposed an efficient learning method using the DCNN
method. In terms of learning convergence and results obtained from the test data,
we confirm that the proposed method is good. However, the results presented in this
paper are limited and the proposed method needs to be examined in more detail.
Therefore, in the future, through large-scale simulation studies and other real-world
data applications, we plan to investigate the differences between the proposed method
and existing methods by changing the number of feature selection layers and using
different convolution filters. We also plan to investigate the proposed method by
considering robustness and setting outliers on the simulation data.

Acknowledgements We thank Shinsuke Ito, Takashi Taguchi, Dr. Yusuke Yamane, Ms. Saeko
Hishinuma, and Dr. Saeko Hirai for their advice. In addition, we acknowledge the biliary atresia
patients’ community (BA no kodomowo mamorukai) for their generous support of this project. This
work was supported by the Mitsubishi Foundation.
182 K. Hayashi et al.

References

1. Rawat, W., Wang, Z.: Deep convolutional neural networks for image classification: a compre-
hensive review. Neural Comput. 29(9), 2352–2449 (2017)
2. Yadav, S. S., Jadhav, S.M.: Deep convolutional neural network based medical image classifi-
cation for disease diagnosis. J. Big Data (2019) doi: 10.1186/s40537-019-0276-2
3. Huang, J., Xu, Z.: Cell detection with deep learning accelerated by sparse kernel. In: Lu, L.
et al. (eds.) Advances in Computer Vision and Pattern Recognition, pp. 137-157. Springer,
Switzerland (2017)
4. Abdelhafiz, D., Yang, C., Ammar, R., Nabavi, S.: Deep convolutional neural networks
for mammography: advances, challenges and applications. BMC Bioinform. (2019) doi:
10.1186/s12859-019-2823-4
5. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural
networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence
and Statistics, pp. 249-256. (2010)
6. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level
performance on ImageNet classification. In: Proceedings of the IEEE International Conference
on Computer Vision (ICCV), pp. 1026-1034. (2015)
7. Liu, J., Dai, S., Chen, G., Sun, S., Jiang, J., Zheng, S., Zheng, Y., Dong, R.: Diagnostic value
and effectiveness of an artificial neural network in biliary atresia. Front. Pediatr. (2020) doi:
10.3389/fped.2020.00409
8. Zhou, W., Yang, Y., Yu, C., Liu, J., Duan, X., Weng, Z., Chen, D., Liang, Q., Fang, Q., Zhou,
J., Ju, H., Luo, Z., Guo, W., Ma, X., Xie, X., Wang, R., Zhou, L.: Ensembled deep learning
model outperforms human experts in diagnosing biliary atresia from sonographic gallbladder
images. Nat. Commun. (2021) doi: 10.1038/s41467-021-21466-z
9. Gu, Y.H., Yokoyama, K., Mizuta, K., Tsuchioka, T., Kudo, T., Sasaki, H., Nio, M., Tang, J.,
Ohkubo, T., Matsui, A.: Stool color card screening for early detection of biliary atresia and
long-term native liver survival: a 19-year cohort study in Japan. J. Pediatr. 166(4), 897–902
(2015)
10. Hoshino, E., Hayashi, K., Suzuki, M., Obatake, M., Urayama, K.Y., Nakano, S., Taura, Y.,
Nio, M., Takahashi, O.: An iPhone application using a novel stool color detection algorithm
for biliary atresia screening. Pediatr. Surg. Int. 33(10), 1115–1121 (2017)
11. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining,
Inference, and Prediction, Second Edition. Springer, New York (2009)
12. Efron, B.: Bootstrap methods: another look at the jackknife. Ann. Stat. 7(1), 1–26 (1979)

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Some Issues in Robust Clustering

Christian Hennig

Abstract Some key issues in robust clustering are discussed with focus on the
Gaussian mixture model based clustering, namely the formal definition of outliers,
ambiguity between groups of outliers and clusters, the interaction between robust
clustering and the estimation of the number of clusters, the essential dependence
of (not only) robust clustering on tuning decisions, and shortcomings of existing
measurements of cluster stability when it comes to outliers.

Keywords: Gaussian mixture model, trimming, noise component, number of clus-


ters, user tuning, cluster stability

1 Introduction

Cluster analysis is about finding groups in data. Robust statistics is about methods
that are not affected strongly by deviations from the statistical model assumptions or
moderate changes in a data set. Particular attention has been paid in the robustness
literature to the effect of outliers. Outliers and other model deviations can have a
strong effect on cluster analysis methods as well. There is now much work on robust
cluster analysis, see [1, 19, 9] for overviews.
There are standard techniques of assessing robustness such as the influence func-
tion and the breakdown point [15] as well as simulations involving outliers, and these
have been applied to robust clustering as well [19, 9].
Here I will argue that due to the nature of the cluster analysis problem, there are
issues with the standard reasoning regarding robustness and outliers.
The starting point will be clustering based on the Gaussian mixture model, for
details see [3]. For this approach, 𝑛 observations are assumed i.i.d. with density

Christian Hennig ( )
Dipartimento di Scienze Statistiche “Paolo Fortunati”, University of Bologna, Via delle Belle Arti
41, 40126 Bologna, Italy, e-mail: [email protected]

© The Author(s) 2023 183


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_21
184 C. Hennig

𝐾
Õ
𝑓 𝜂 (𝑥) = 𝜋 𝑘 𝜑 𝜇𝑘 ,Σ𝑘 (𝑥),
𝑘=1

𝑥 ∈ R 𝑝 , with 𝐾 mixture components with proportions 𝜋 𝑘 , 𝜑 𝜇𝑘 ,Σ𝑘 being the Gaussian


density with mean vectors 𝜇 𝑘 , covariance matrices Σ 𝑘 , 𝑘 = 1, . . . , 𝐾, 𝜂 being a vector
of all parameters. For given 𝐾, 𝜂 can be estimated by maximum likelihood (ML)
using the EM-algorithm, as implemented for example in the R-package “mclust”.
A standard approach to estimate 𝐾 is the optimisation of the Bayesian Information
Criterion (BIC). Normally, mixture components are interpreted as clusters, and
observations 𝑥𝑖 , 𝑖 = 1, . . . , 𝑛, can be assigned to clusters using the estimated posterior
probability that 𝑥 𝑖 was generated by mixture component 𝑘. A problem with ML
estimation is that the likelihood degenerates if all observations assigned to a mixture
component lie on a lower dimensional hyperplane, i.e, a Σ 𝑘 has an eigenvalue of
zero. This can be avoided by placing constraints on the eigenvalues of the covariance
matrices [8]. Alternatively, a non-degenerate local optimum of the likelihood can
be used, and if this cannot be found, constrained covariance matrix models (such as
Σ1 = . . . = Σ𝐾 ) can be fitted instead, as is the default of mclust. Several issues with
robustness that occur here are also relevant for other clustering approaches.

2 Outliers vs Clusters

It is well known that the sample mean and sample covariance matrix as estimators
of the parameters of a single Gaussian distribution can be driven to breakdown by
a single outlier [15]. Under a Gaussian mixture model with fixed 𝐾, an outlier must
be assigned to a mixture component 𝑘 and will break down the estimators of 𝜇 𝑘 , Σ 𝑘
(which are weighted sample means and covariance matrices) for that component in
the same manner; the same holds for a cluster mean in 𝑘-means clustering.
Addressing this issue, and dealing with more outliers in order to achieve a high
breakdown point, is a starting point for robust clustering. Central ideas are trimming
a proportion of observations [7], adding a “noise component” with constant density
to catch the outliers [4, 3], mixtures with more robust component-wise estimators
such as mixtures of heavy-tailed distributions (Sec. 7 of [18]).
But cluster analysis is essentially different from estimating a homogeneous popu-
lation. Given a data set with 𝐾 clear Gaussian clusters and standard ML-clustering,
consider adding a single outlier that is far enough away from the clusters. Assuming
a lower bound on covariance matrix eigenvalues, the outlier will form a one-point
cluster, the mean of which will diverge with the added outlier, and the original
clusters will be merged to form 𝐾 − 1 clusters [10].
The same will happen with a group of several outliers being close together,
once more added far enough away from the Gaussian clusters. “Breakdown” of an
estimator it is usually understood as the estimator becoming useless. It is questionable
that this is the case here. In fact, the “group of outliers” can well be interpreted as
a cluster in its own right, and putting all these points together in a cluster could be
Some Issues in Robust Clustering 185

seen as desirable behaviour of the ML estimator, at least if two of the original 𝐾


clusters are close enough to each other that merging them will produce a cluster that
is fairly well fitted by a single Gaussian distribution; note that the Gaussian mixture
model does not assume strong separation between components, and a mixture of
two Gaussians may be unimodal and in fact very similar to a single Gaussian. A
breakdown point larger than a given 𝛼, 0 < 𝛼 < 12 may not be seen as desirable in
cluster analysis if there can be clusters containing a proportion of less than 𝛼 of the
data, as a larger breakdown point will stop a method from taking such clusters (when
added in large distance from the rest of the data) appropriately into account.
The core problem is that it is not clear what distinguishes a group of outliers
from a legitimate cluster. I am not aware of any formal definition of outliers and
clusters in the literature that allows this distinction. Even a one-point cluster is not
necessarily invalid. Here are some possible and potentially conflicting aspects of
such a distinction.
• A certain minimum size may be required for a cluster; smaller groups of points
may be called outliers.
• Groups of points in low density areas of the data may be called outliers. Note that
this particularly means that very widely spread Gaussian mixture components
would also be defined as outliers, deviating from the standard interpretation of
Gaussian mixture components as clusters.
• Members of non-Gaussian mixture components may be called outliers. This does
not seem to be a good idea, because Gaussianity cannot be assessed for too small
groups of observations, and furthermore in practice model assumptions are never
perfectly fulfilled, and it may be desirable to interpret homogeneous or unimodal
non-Gaussian parts of the data as “cluster” and fit them by a Gaussian component.
• The term “outlier” suggests that outliers lie far away from most other observa-
tions, so it may be required that outliers are farther away from the clusters than
the clusters are from each other. But this would be in conflict with the intuition
that strong separation is usually seen as a desirable feature for well interpretable
clusters. It may only be reasonable in applications in which there is prior informa-
tion that there is limited variation even between clusters, as is implied by certain
Bayesian approaches to clustering [17].
• The term “cluster” may be seen as flexible enough that a definition of an outlier
is not required. Clustering should accommodate whatever is “outlying” by fitting
it by one or more further clusters, if necessary of size one (single linkage clus-
tering can be useful for outlier detection, even though it is inappropriate for most
clustering problems).
Most of these items require specific decisions that cannot be made in any objective
and general manner, but only taking into account subject matter information, such
as the minimum size of valid clusters or the density level below which observations
are seen as outliers (potentially compared to density peaks in the distribution). This
implies that an appropriate treatment of outliers in cluster analysis cannot be expected
to be possible without user tuning.
186 C. Hennig

3 Robustness and the Number of Clusters

The last item suggests that there is an interplay between outlier identification and the
number of clusters, and that adding clusters might be a way of dealing with outliers;
as long as clusters are assumed to be Gaussian, a single additional component may
not be enough. More generally, concentrating robustness research on the case of
fixed 𝐾 may be seen as unrealistic, because 𝐾 is rarely known, although estimating
𝐾 is a notoriously difficult problem even without worrying about outliers [13].
The classical robustness concepts, breakdown point and influence function, as-
sume parameters from R𝑞 with fixed 𝑞. If 𝐾 is not fixed, the number of parameters
is not fixed either, and the classical concepts do not apply.
As an alternative to the breakdown point, [11] defined a “dissolution point”.
Dissolution is measured in terms of cluster memberships of points rather than in
terms of parameters, and is therefore also applicable to nonparametric clustering
methods. Furthermore, dissolution applies to individual clusters in a clustering;
certain clusters may dissolve, i.e., there may be no sufficiently similar cluster in a
new clustering computed after, e.g., adding an outlier; and others may not dissolve.
This does not require 𝐾 to be fixed; the definition is chosen so that if a clustering
changes from 𝐾 to 𝐿 < 𝐾 clusters, at least 𝐾 − 𝐿 clusters dissolve.
Hennig [10, 11] showed that when estimating 𝐾 using the BIC and standard ML
estimation, reasonably well separated clusters do not dissolve when adding possibly
even a large percentage of outliers (this does not hold for every method to estimate
the number of clusters, see [11]). Furthermore, [11] showed that no method with
fixed 𝐾 can be robust for data in which 𝐾 is misspecified - already [7] had found
that robustness features in clustering generally depend on the data.
An implication of these results is that even in the fixed 𝐾 problem, the standard
ML method can be a valid competitor regarding robustness if it comes with a rule
that allows to add one or possibly more clusters that can then be used to fit the
outliers (this is rarely explored in the literature, but [18], Sec. 7.7, show an example
in which adding a single component does not work very well).
An issue with adding clusters to accommodate outliers is that in many applications
it is appropriate to distinguish between meaningful clusters, and observations that
cannot be assigned to such clusters (often referred to as “noise”). Even though adding
clusters of outliers can formally prevent the dissolution of existing clusters, it may
be misleading to interpret the resulting clusters as meaningful, and a classification
as outliers or noise can be more useful. This is provided by the trimming and noise
component approaches to robust clustering. Also some other clustering methods such
as the density-based DBSCAN [5] provide such a distinction. On the other hand,
modelling clusters by heavy-tailed distributions such as in mixtures of t-distributions
will implicitly assign outlying observations to clusters that potentially are quite far
away. For this reason, [18], Sec. 7.7, provide an additional outlier identification
rule on top of the mixture fit. [6] even distinguish between “mild” outliers that are
modelled as having a larger variance around the same mean, and “gross” outliers to
be trimmed. The variety of approaches can be connected to the different meanings
that outliers can have in applications. They can be erroneous, they can be irrelevant
Some Issues in Robust Clustering 187

noise, but they can also be caused by unobserved but relevant special conditions (and
would as such qualify as meaningful clusters), or they could be valid observations
legitimately belonging to a meaningful cluster that regularly produces observations
further away from the centre than modelled by a Gaussian distribution.
Even though currently there is no formal robustness property that requires both the
estimation of 𝐾 and an identification or downweighting of outliers, there is demand
for a method that can do both.
Estimating 𝐾 comes with an additional difficulty that is relevant in connection
with robustness. As mentioned before, in clustering based on the Gaussian mixture
model normally every mixture component will be interpreted as a cluster. In reality,
however, meaningful clusters are not perfectly Gaussian. Gaussian mixtures are very
flexible for approximating non-Gaussian distributions. Using a consistent method
for estimating 𝐾 means that for large enough 𝑛 a non-Gaussian cluster will be
approximated by several Gaussian mixture components. The estimated 𝐾 will be
fine for producing a Gaussian mixture density that fits the data well, but it will
overestimate the number of interpretable clusters. The estimation of 𝐾, if interpreted
as the number of clusters, relies on precise Gaussianity of the clusters, and is as such
itself riddled with a robustness problem; in fact slightly non-Gaussian clusters may
even drive the estimated 𝐾 → ∞ if 𝑛 → ∞ [12, 14].
This is connected with the more fundamental problem that there is no unique
definition of a cluster either. The cluster analysis user needs to specify the cluster
concept of interest even before robustness considerations, and arguably different
clustering methods imply different cluster concepts [13]. A Gaussian mixture model
defines clusters by the Gaussian distributional shape (unless mixture components
are merged to form clusters [12]). Although this can be motivated in some real situ-
ations, robustness considerations require that distributional shapes fairly close to the
Gaussian should be accepted as clusters as well, but this requires another specifica-
tion, namely how far from a Gaussian a cluster is allowed to be, or alternatively how
separated Gaussian components have to be in order to count as separated clusters. A
similar problem can also occur in nonparametric clustering; if clusters are associated
with density modes or level sets, the cluster concept depends on how weak a mode
or gap between high level density sets is allowed to be to be treated as meaningful.
Hennig and Coretto [14] propose a parametric bootstrap approach to simultane-
ously estimate 𝐾 and assign outliers to a noise component. This requires two basic
tuning decisions. The first one regards the minimum percentage of observations so
that a researcher is willing to add another cluster if the noise component can be re-
duced by this amount. The second one specifies a tolerance that allows a data subset
to count as a cluster even though it deviates to some extent from what is expected
under a perfectly Gaussian distribution. There is a third tuning parameter that is in
effect for fixed 𝐾 and tunes how much of the tails of a non-Gaussian cluster can be
assigned to the noise in order to improve the Gaussian appearance of the cluster. One
could even see the required constraints on covariance matrix eigenvalues as a further
tuning decision. Default values can be provided, but situations in which matters can
be improved deviating from default values are easy to construct.
188 C. Hennig

4 More on User Tuning

User tuning is not popular, as it is often difficult to make appropriate tuning decisions.
Many scientists believe that subjective user decisions threaten scientific objectivity,
and also background knowledge dependent choices cannot be made when investigat-
ing a method’s performance by theory and simulations. The reason why user tuning
is indispensable in robust cluster analysis is that it is required in order to make the
problem well defined. The distinction between clusters and outliers is an interpre-
tative one that no automatic method can make based on the data alone. Regarding
the number of clusters, imagine two well separated clusters (according to whatever
cluster concept of interest), and then imagine them to be moved closer and closer
together. Below what distance are they to be considered a single cluster? This is
essentially a tuning decision that the data cannot make on their own.
There are methods that do not require user tuning. Consider the mclust imple-
mentation of Gaussian mixture model based clustering. The number of clusters is by
default estimated by the BIC. As seen above, this is not really appropriate for large
data sets, but its derivation is essentially asymptotic, so that there is no theoretical
justification for it for small data sets either. Empirically it often but not always works
well, and there is little investigation of whether it tends to make the “right” decision
in ambiguous situations where it is not clear without user tuning what it even means
to be “right”. Covariance matrix constraints in mclust are not governed by a tuning of
eigenvalues or their ratios to be specified by the user. Rather the BIC decides between
different covariance matrix models, but this can be erratic and unstable, as it depends
on whether the EM-algorithm gets caught in a degenerate likelihood maximum or
not, and in situations where two or more covariance matrix models have similar BIC
values (which happens quite often), a tiny change in the data can result in a different
covariance matrix model being selected, and substantial changes in the clustering. A
tunable eigenvalue condition can result in much smoother behaviour. When it comes
to outlier identification, mclust offers the addition of a uniform “noise” mixture
component governed by the range of the data, again supposedly without user tuning.
This starts from an initial noise estimation that requires tuning (Sec. 3.1.2 of [3]) and
is less robust in terms of breakdown and dissolution than trimming and the improper
noise component, both of which require tuning [10, 11]. The ICL, an alternative to
the BIC (Sec. 2.6 of [3]), on the other hand, is known to merge different Gaussian
mixture components already at a distance at which they intuitively still seem to
be separated clusters. Similar comments apply to the mixture of t-distributions; it
requires user tuning for identifying outliers, scatter matrix constraints, and it has the
same issues with BIC and ICL as the Gaussian mixture.
Summarising, both the identification of and robustness against outliers and the
estimation of the number of clusters require tuning in order to be well defined
problems; user tuning can only be avoided by taking tuning decisions out of the
user’s hands and making them internally, which will work in some situations and
fail in others, and the impression of automatic data driven decision making that a
user may have is rather an illusion. This, however, does not free method designers
from the necessity to provide default tunings for experimentation and cases in which
Some Issues in Robust Clustering 189

the users do not feel able to make the decisions themselves, and tuning guidance for
situations in which more information is available. A decision regarding the smallest
valid size of a cluster is rather well interpretable; a decision regarding admissible
covariance matrix eigenvalues is rather difficult and abstract.

5 Stability Measurement

Robustness is closely connected to stability. Both experimental and theoretical inves-


tigation of the stability of clusterings require formal stability measurements, usually
comparing two clusterings on the same data (potentially modified by replacing or
adding observations). Not assuming any parametric model, proximity measures such
as the Adjusted Rand Index (ARI; [16]), the Hamming distance (HD; [2]), or the
Jaccard distance between individual clusters [11] can be used. Note that [2], standard
reference on cluster stability in the machine learning community, state that stability
and instability are caused in the first place by ambiguities in the cluster structure
of the data, rather than by a method’s robustness or lack of it. Although the outlier
problem is ignored in that paper, it is true that cluster analysis can have other stability
issues that are as serious as or worse than gross outliers.
To my knowledge, none of the measures currently in use allow for a special
treatment of a set of outliers or noise; either these have to be ignored, or treated just
as any other cluster. Both ARI and HD, comparing clusterings C1 and C2 , consider
pairs of observations 𝑥𝑖 , 𝑥 𝑗 and check whether those that are in the same cluster
in C1 are also in the same cluster in C2 . An appropriate treatment of noise sets
𝑁1 ∈ C1 , 𝑁2 ∈ C2 would require that 𝑥𝑖 , 𝑥 𝑗 ∈ 𝑁1 are not just in the same cluster in
C2 but rather in 𝑁2 , i.e., whereas the numberings of the regular clusters do not have
to be matched (which is appropriate because cluster numbering is meaningless), 𝑁1
has to be matched to 𝑁2 . Corresponding re-definitions of these proximities will be
useful to robustness studies.

6 Conclusion

Key practical implications of the above discussions are:


• Outliers can be treated as forming their own clusters, or be collected in out-
lier/noise or trimmed sets, or be integrated in clusters of non-outliers. Which of
these is appropriate depends on the nature of outliers in a given application.
• Methods that do not identify outliers but add clusters in order to accommodate
them are valid competitors of robust clustering methods, as are nonparametric
density-based methods.
• Cluster analysis involving estimating the number of clusters and robustness require
tuning in order to define the problem they are meant to solve well. Method
190 C. Hennig

developers need to provide sensible defaults, but also to guide the users regarding
a meaningful interpretation of the tuning decisions.

References

1. Banerjee, A., Davé, R. N.: Robust clustering. WIREs Data Mining Knowl. Discov. 2, 29–59
(2012)
2. Ben-David, S., von Luxburg, U., Pál, D.: A sober look at clustering stability. In: Proceedings
of the 19th annual conference on Learning Theory (COLT’06), pp. 5–19, Springer, Berlin
(2006)
3. Bouveyron, C., Celeux, G., Murphy, T. B., Raftery, A. E.: Model-based clustering and classi-
fication for data science. Cambridge University Press, Cambridge MA (2019)
4. Coretto, P., Hennig, C.: Consistency, breakdown robustness, and algorithms for robust im-
proper maximum likelihood clustering. J. Mach. Learn. Res.18, 1–39 (2017)
5. Ester, M., Kriegel, H. P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters
in large spatial databases with noise. In: Proceedings of the 2nd International Conference on
Knowledge Discovery and Data Mining, pp. 226-231, AAAI Press, Portland OR (1996)
6. Farcomeni, A., Punzo, A.: Robust model-based clustering with mild and gross outliers. TEST
29, 989–1007 (2020)
7. García-Escudero, L. A., Gordaliza, A.: Robustness properties of k-means and trimmed k-
means. J. Am. Stat. Assoc. 94, 956–969 (1999)
8. García-Escudero, L. A., Gordaliza, A., Greselin, F., Ingrassia, S., Mayo-Iscar, A.: Eigenvalues
and constraints in mixture modeling: geometric and computational issues. Adv. Data Anal.
Classi. 12, 203–233 (2018)
9. García-Escudero, L. A., Gordaliza, A., Matrán, C., Mayo-Iscar, A., Hennig, C.: Robustness
and outliers. In: Hennig, C., Meila, M., Murtagh, F., Rocci, R. (eds.) Handbook of Cluster
Analysis, pp. 653–678. Chapman & Hall/CRC, Boca Raton FL (2016)
10. Hennig, C.: Breakdown points for maximum likelihood estimators of location-scale mixtures.
Ann. Stat. 32, 1313–1340 (2004)
11. Hennig, C.: Dissolution point and isolation robustness: robustness criteria for general cluster
analysis methods. J. Multivariate Anal. 99, 1154–1176 (2008)
12. Hennig, C.: Methods for merging Gaussian mixture components. Adv. Data Anal. Classi. 4,
3–34 (2010)
13. Hennig, C.: Clustering strategy and method selection. In: Hennig, C., Meila, M., Murtagh, F.,
Rocci, R. (eds.) Handbook of Cluster Analysis, pp. 703–730. Chapman & Hall/CRC, Boca
Raton FL (2016)
14. Hennig, C., Coretto, P.: An adequacy approach for deciding the number of clusters for
OTRIMLE robust Gaussian mixture-based clustering. Aust. N. Z. J. Stat. (2021) doi:
10.1111/anzs.12338
15. Huber, P. J., Ronchetti, E. M.: Robust Statistics (2nd ed.). Wiley, Hoboken NJ (2009)
16. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985)
17. Malsiner-Walli, G., Frühwirth-Schnatter, S., Grün, B.: Identifying mixtures of mixtures using
Bayesian estimation. J. Comput. Graph. Stat. 26, 285–295 (2017)
18. McLachlan, G. J., Peel, D.: Finite Mixture Models. Wiley, New York (2000)
19. Ritter, G.: Robust cluster analysis and variable selection. Chapman & Hall/CRC, Boca Raton
FL (2015)
Some Issues in Robust Clustering 191

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Robustness Aspects of Optimized Centroids

Jan Kalina and Patrik Janáček

Abstract Centroids are often used for object localization tasks, supervised seg-
mentation in medical image analysis, or classification in other specific tasks. This
paper starts by contributing to the theory of centroids by evaluating the effect of
modified illumination on the weighted correlation coefficient. Further, robustness
of various centroid-based tools is investigated in experiments related to mouth lo-
calization in non-standardized facial images or classification of high-dimensional
data in a matched pairs design. The most robust results are obtained if the sparse
centroid-based method for supervised learning is accompanied with an intrinsic vari-
able selection. Robustness, sparsity, and energy-efficient computation turn out not to
contradict the requirement on the optimal performance of the centroids.

Keywords: image processing, optimized centroids, robustness, sparsity, low-energy


replacements

1 Introduction

Methods based on centroids (templates, prototypes) are simple yet widely used for
object localization or supervised segmentation in image analysis tasks and also within
other supervised or unsupervised methods of machine learning. This is true e.g. in
various biomedical imaging tasks [1], where researchers typically cannot afford a too
large number of available images [3]. Biomedical applications also benefit from the
interpretability (comprehensibility) of centroids [11].
This paper is focused on the question how are centroid-based methods influenced
by data contamination. Section 2 recalls the main approaches to centroid-based
object localization in images, as well as a recently proposed method of [6] for op-

Jan Kalina ( ) · Patrik Janáček


The Czech Academy of Sciences, Institute of Computer Science, Pod Vodárenskou věží 2, 182 07
Prague 8, Czech Republic, e-mail: [email protected];[email protected]

© The Author(s) 2023 193


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_22
194 J. Kalina and P. Janáček

timizing centroids and their weights. The performance of these methods to data
contamination (non-standard conditions) has not been however sufficiently investi-
gated. Particularly, we are interested in the performance of low-energy replacements
of the optimal centroids and in the effect of posterior variable selection (pixel selec-
tion). Section 2.1 presents novel expressions for images with a changed illumination.
Numerical experiments are presented in Section 3. These are devoted to mouth lo-
calization over raw facial images as well as over artificially modified images; other
experiments are devoted to high-dimensional data in a matched pairs design. The
optimized centroids of [6] and especially their modification proposed here turn out
to have remarkable robustness properties. Section 4 brings conclusions.

2 Centroid-based Classification (Object Localization)

Commonly used centroid-based approaches to object localization (template match-


ing) in images construct the centroid simply as the average of the positive examples
and typically use Pearson product-moment correlation coefficient 𝑟 as the most com-
mon measure of similarity between a centroid c and a candidate part of the image
(say x). While the centroid and candidate areas are matrices of size (say) 𝐼 × 𝐽 pixels,
they are used in computations after being transformed to vectors of length 𝑑 := 𝐼𝐽.
This allows us to use the notation c = (𝑐 1 , . . . , 𝑐 𝑑 )𝑇 and x = (𝑥 1 , . . . , 𝑥 𝑑 )𝑇 .
Assumptions A: We assume the whole image to have size 𝑁 𝑅 × 𝑁𝐶 pixels. We
assume the centroid c = (𝑐)𝑖, 𝑗 with 𝑖 = 1, . . . , 𝐼 and 𝑗 = 1, . . . , 𝐽 to be Í aÍmatrix of
size 𝐼 × 𝐽 pixels. A candidate area x and nonnegative weights w with 𝑖 𝑗 𝑤𝑖 𝑗 = 1
are assumed to be matrices of the same size as c.
For a given image, E will denote the set of its rectangular candidate areas of size
𝐼 × 𝐽. The candidate area fulfilling

arg max 𝑟 (x, c) (1)


x∈E

or (less frequently)
arg min ||x − c||2 (2)
x∈E

are classified to correspond to the object (e.g. mouth).


Let us consider here replacing 𝑟 by the weighted correlation coefficient 𝑟 𝑤

arg max 𝑟 𝑤 (x, c; w) (3)


x∈E
Í𝑛
with given non-negative weights w = (𝑤1 , . . . , 𝑤𝑑 )𝑇 ∈ R 𝑝 with 𝑖=1 𝑤𝑖 = 1,
where
Í𝑑 R denotes the set of all real numbers. Let us further use the notation 𝑥¯ 𝑤 =
𝑤 𝑥 = w 𝑇 x and 𝑐¯ = w𝑇 c. We may recall 𝑟 between x and c to be defined
𝑗=1 𝑗 𝑗 𝑤 𝑤
as Í𝑑
𝑖=1 𝑤𝑖 (𝑥 𝑖 − 𝑥¯ 𝑤 )(𝑐 𝑖 − 𝑐¯𝑤 )
𝑟 𝑊 (x, c; w) = q . (4)
Í𝑑 Í
2 ] 𝑑 [𝑤 (𝑐 − 𝑐¯ ) 2 ]
𝑖=1 [𝑤 (𝑥
𝑖 𝑖 − ¯
𝑥 𝑤 ) 𝑖=1 𝑖 𝑖 𝑤
Robustness Aspects of Optimized Centroids 195

Fig. 1 The workflow of the optimization procedure of [6].

A detailed study of [2] investigated theoretical foundations of centroid-based classi-


fication, however for the rare situation when (1) is replaced by
The sophisticated centroid optimization method of [6], outlined in Figure 1,
requires to minimize a nonlinear loss function corresponding to a regularized margin-
like distance (exploiting 𝑟 𝑤 ) evaluated for the worst pair from the worst image over
the training database (i.e. the worst with respect to the loss function). Subsequently,
optimization of the weights may be also performed, ensuring many pixels to obtain
zero weights (i.e. yielding a sparse solution). The optimal centroid may be used
as such, even without any weights at all; still, optimization of the weights leads
to a further improvement of the classification performance. In the current paper,
we always consider a linear (i.e. approximate) approach to centroid optimization,
although a nonlinear optimization is also successful as revealed in the comparisons
in [6].

2.1 Centroid-Based Object Localization: Asymmetric Modification


of the Candidate Area

In the context of object localization as described above, our aim is to express


𝑟 𝑤 (x∗ , c; w) under modified candidate areas (say x∗ ) of the image x; we stress that
the considered modification of the image does not allow to modify the centroid c and
weights w. These considerations are useful for centroid-based object localization,
when asymmetric illumination is present in the whole image or its part. The weighted
variance 𝑆 2𝑤 (x; w) of x with weights w and the weighted covariance 𝑆 𝑤 (x, c) between
x and c are denoted as
Õ Õ
𝑆 2𝑤 (x) = 𝑤𝑖 𝑗 (𝑥𝑖 𝑗 − 𝑥¯ 𝑤 ) 2 , 𝑆 𝑤 (x, c) = 𝑤𝑖 𝑗 (𝑥𝑖 𝑗 − 𝑥¯ 𝑤 )(𝑐 𝑖 𝑗 − 𝑐¯𝑤 ). (5)
𝑖, 𝑗 𝑖, 𝑗

Further, the notation x + 𝑎 with x = (𝑥𝑖 𝑗 )𝑖, 𝑗 is used to denote the matrix (𝑥 𝑖 𝑗 + 𝑎)𝑖, 𝑗
for a given 𝑎 ∈ R. We also use the Í Í notation. The image x is divided to two
following
parts x = (x1 , x2 )𝑇 ∈ R 𝑑 , where 𝐼 or 𝐼 𝐼 denote the sum over the pixels of the
first or second part, respectively.

Theorem 1 Under Assumptions A, the following statements hold.


1. For x∗ = x + 𝜀, it holds 𝑟 𝑤 (x∗ , c) = 𝑟 𝑤 (x, c) for 𝜀 > 0.
2. For x∗ = 𝑘x with 𝑘 > 0, it holds 𝑟 𝑤 (x∗ , c) = 𝑟 𝑤 (x, c).
3. For x = (x1 , x2 )𝑇 and x∗ = (x1 , x2 + 𝜀)𝑇 , it holds 𝑟 𝑤 (x∗ , c) =
196 J. Kalina and P. Janáček
Í
𝑆 𝑤 (x, c) + 𝜀
𝑤𝑖 𝑗 𝑐 𝑖 𝑗 − 𝜀𝑣2 𝑐¯𝑤
𝐼𝐼
= q , (6)
Í
𝑆 𝑤 (c) 𝑆 2𝑤 (x) + 𝑣2 (1 − 𝑣2 )𝜀 2 + 2𝜀(2𝑣2 − 1)( 𝐼 𝐼 𝑤𝑖 𝑗 𝑥𝑖 𝑗 − 𝑣2 𝑥¯ 𝑤 )
Í
where 𝑣2 = 𝐼 𝐼 𝑤𝑖 𝑗 and 𝜀 ∈ R.
4. For x = (x1 , x2 )𝑇 and x∗ = (x1 , 𝑘x2 )𝑇 with 𝑘 > 0, it holds
Í
∗ 𝑆 𝑤 (x) (𝑘 − 1) 𝐼 𝐼 𝑤𝑖 𝑗 𝑥𝑖 𝑗 (𝑐 𝑖 𝑗 − 𝑐¯𝑤 )
𝑟 𝑤 (x , c) = 𝑟 𝑤 (x, c) ∗ + , (7)
𝑆 𝑤 (x) 𝑆 𝑤 (c)𝑆 ∗𝑤 (x)
where
!2
Õ 𝑘2 − 1 Õ
𝑆 ∗𝑤 (x)
2
= 𝑆 2𝑤 (x) 2
+ (𝑘 − 1) 𝑤𝑖 𝑗 𝑥𝑖2𝑗 − 𝑤𝑖 𝑗 𝑥 𝑖 𝑗 −
𝐼𝐼
𝑛 𝐼𝐼
! !
2 Õ Õ
− (𝑘 − 1) 𝑤𝑖 𝑗 𝑥 𝑖 𝑗 𝑤𝑖 𝑗 𝑥 𝑖 𝑗 . (8)
𝑛 𝐼 𝐼𝐼

The proofs of the formulas are technical but straightforward exploiting known
properties of 𝑟 𝑤 . The theorem reveals 𝑟 𝑤 to be vulnerable to the modified illumina-
tion, i.e. all the methods based on centroids of Section 2 may be too influenced by
the data modification.

3 Experiments

3.1 Data

Three datasets are considered in the experiments. In the first dataset, the task is to
localize the mouth in the database containing 212 grey-scale 2D facial images of faces
of healthy individuals of size 192 × 256 pixels. The database previously analyzed
in [6] was acquired at the Institute of Human Genetics, University of Duisburg-
Essen, within research of genetic syndrome diagnostics based on facial images [1]
under the projects BO 1955/2-1 and WU 314/2-1 of the German Research Council
(DFG). We consider the training dataset to consist of the first 124 images, while the
remaining 88 images represent an independent test set acquired later but still under
the same standardized conditions fulfilling assumptions of unbiased evaluation. The
centroid described below is used with 𝐼 = 26 and 𝐽 = 56.
Using always raw training images, the methods are applied not only to the raw test
set, but also to the test set after being artificially modified using models inspired by
Section 2.1. On the whole, five different versions of the test database are considered;
the modifications required that we first manually localized the mouths in the test
images:
1. Raw images.
Robustness Aspects of Optimized Centroids 197

2. Illumination. If we consider a pixel [𝑖, 𝑗] with intensity 𝑥𝑖 𝑗 in an image (say) 𝑓 ,


then the grey-scale intensity 𝑓𝑖 𝑗 will be

𝑓𝑖∗𝑗 = 𝑓𝑖 𝑗 + 𝜆| 𝑗 − 𝑗 0 |, 𝑖 = 1, . . . , 𝐼, 𝑗 = 1, . . . , 𝐽, (9)

where [𝑖0 , 𝑗0 ] are the coordinates of the mouth and 𝜆 = 0.002.


3. A more severe version of the modification (ii) with 𝜆 = 0.004.
4. Asymmetry. In every test image, each true mouth x of size 26 × 56 pixels with
intensities 𝑥 𝑖 𝑗 is replaced by

 𝑥𝑖 𝑗 + 0.2, 𝑖 = 1, . . . , 26, 𝑗 = 1, . . . , 15,





𝑥𝑖∗𝑗 = 𝑥𝑖 𝑗 , 𝑖 = 1, . . . , 26, 𝑗 = 16, . . . , 41, (10)
 𝑥𝑖 𝑗 + 0.1,

𝑖 = 1, . . . , 26, 𝑗 = 42, . . . , 56.

5. Rotation. Such candidate area is classified as the mouth in the given image,
which maximizes the loss (1) or (3) over the three versions of the image, namely
after rotations by +5, 0, and −5 degrees.
6. Image denoising (for raw images). The LWS-filter [5], replacing each grey
value by the least weighted squares estimate [7] computed from a circular
neighborhood with radius 4 pixels, was applied to each test image.
The optimized centroids were explained in [6] to be applicable also to classifi-
cation tasks for other data than images, if they follow a matched pairs design. We
use two datasets from [6] in the experiments and their classification accuracies are
reported in a 10-fold cross validation.
• AMI. The gene expressions of 4000 genes over 92 individuals in two versions (raw
or contaminated by outliers). The aim is to learn a classification rule allowing to
assign a new individual to one of the two given groups (controls or patients with
acute myocardial infarction (AMI)).
• Simulated data. The design mimicks a 1:1 matched case-control study with 2500
variables over 60 individuals in two versions (raw or contaminated by outliers)
and the aim is again to classify between two given groups (patients and controls).

Fig. 2 The average centroid used as the initial choice for the centroid optimization.
198 J. Kalina and P. Janáček

3.2 Methods

The following methods are compared in the experiments; standard methods are
computed using R software and we use our own C++ implementation of centroid-
based methods. The average centroid is obtained as the average of all mouths of the
training set, or the average across all patients. The centroid optimization starts with
the average centroid as the initial one, and the optimization of weights starts with
equal weights as the initial ones:

A. Centroid-based method (2).


B. Centroid-based method (1) with average centroid (Figure 2) and equal weights.
C. Centroid-based method (1) with average centroid, replacing 𝑟 𝑤 by cosine sim-
ilarity defined for x ∈ R 𝑑 and y ∈ R 𝑑 as
Í𝑑
x𝑇 y 𝑖=1 𝑥 𝑖 𝑦 𝑖
cos 𝜃 = =  . (11)
||x||2 ||y||2 Í𝑑  1/2 Í
𝑑
2
𝑖=1 𝑥 𝑖 ( 𝑗=1 𝑦 𝑗 )
2 1/2

D. Centroid-based method (1) with optimal centroid and equal weights [6].
E. Centroid-based method (1) with optimal centroid and optimal weights as in
[6] (optimizing the centroid and only after that the weights), i.e. with posterior
variable selection (pixel selection).
F. Centroid-based method (1) as in [6], where however the weights are optimized
first, and then the centroid is optimized.
G. Centroid-based method (1) as in [6], where however each step of centroid
optimization is immediately followed by optimization of the weights; this method
performs (in contrary to [6]) intrinsic variable selection.
H. Centroid-based method (1) as in [6], where however each optimization step
proceeds over 10 worst images (instead of the very worst image).
I. Centroid-based method (1) with average centroid, where 𝑟 𝑤 is used as 𝑟 LWS [7]
with weight function
 2   
𝑡 3
𝜓1 (𝑡) = exp − 2 1 𝑡 < , 𝑡 ∈ [0, 1], (12)
2𝜏 4

corresponding to a (trimmed) density of the Gaussian N (0, 1) distribution; 1 de-


notes an indicator function. To explain, the computation of 𝑟 LWS (𝑥, 𝑦) starts by
fitting the LWS estimator in the linear regression of 𝑦 as the response of 𝑥, and
𝑟 𝑤 is used with the weights determined by the LWS estimator.
 
J. The method (I) with the weight function 𝜓2 (𝑡) = 1 𝑡 < 34 for 𝑡 ∈ [0, 1].
K. The approach of [12] that is meaningful however only for the mouth localization
dataset.
Robustness Aspects of Optimized Centroids 199

Table 1 Classification accuracy for three datasets. For the mouth localization data, modifications of
the test images are described in Section 3: (i) None (raw images); (ii) Illumination; (iii) Asymmetry;
(iv) Rotation; (v) Image denoising. A detailed description of the methods is given in Section 3.2.
Dataset
Mouth localization AMI Simul.
Method (i) (ii) (iii) (iv) (v) (vi) Raw Cont. Raw Cont.
A 0.90 0.86 0.81 0.88 0.81 0.93 0.73 0.66 0.71 0.67
B 0.93 0.90 0.86 0.92 0.86 0.95 0.76 0.70 0.77 0.70
C 0.89 0.84 0.74 0.89 0.84 0.93 0.72 0.61 0.70 0.64
D 1.00 0.98 0.95 0.99 0.93 0.98 0.85 0.83 0.80 0.77
E 1.00 1.00 0.98 1.00 0.95 0.98 0.87 0.85 0.83 0.80
F 1.00 0.98 0.96 1.00 0.89 0.97 0.86 0.82 0.79 0.73
G 1.00 0.96 0.95 1.00 0.93 0.99 0.88 0.85 0.86 0.82
H 1.00 1.00 0.98 1.00 0.92 0.96 0.86 0.83 0.84 0.79
I 0.96 0.96 0.93 0.99 0.94 0.96 0.77 0.72 0.75 0.71
J 0.94 0.93 0.89 0.95 0.89 0.93 0.74 0.69 0.72 0.66
K 1.00 1.00 0.97 0.95 0.97 0.96 Not meaningful

3.3 Results

The results as ratios of correctly classified cases are presented in Table 1. For the
mouth localization, the optimized centroids of methods D, F, and H turn out to out-
perform simple centroids (A, B, and C); the novel modifications E and G performing
intrinsic variable selection yield the best results. Simple standard centroids (A, B,
and C) are non-robust to data contamination; this follows from Section 2.1 and from
analogous considerations for other types of contaminating the images. On the other
hand, the robustness of optimized centroids is achieved by their optimization (but
not by using 𝑟 𝑤 as such). Methods E and G are even able to overcome methods I
and J based on 𝑟 LWS . We recall that 𝑟 𝐿𝑊 𝑆 is globally robust in terms of the break-
down point [4]), is computationally very demanding, and does not seem to allow
any feasible optimization. Other results reported previously in [6] revealed that also
numerous standard machine learning methods are too vulnerable (non-robust) with
respect to data contamination, if measuring the similarity by 𝑟 or 𝑟 𝑤 .
For the AMI dataset, methods E and G with variable selection perform the best
results for raw as well as contaminated datasets. For the simulated data, the method G
yields the best results and the method E stays only slightly behind as the second best
method.

4 Conclusions

Understanding the robustness of centroids represents a crucial question in image


processing with applications for convolutional neural networks (CNNs), because
centroids are very versatile tools that may be based on deep features learned by deep
200 J. Kalina and P. Janáček

learning. We focus on small datasets, for which CNNs cannot be used [10]. This
paper is interested in performance of centroid-based object localization over small
databases with non-standardized images, which commonly appear e.g. in medical
image analysis.
The requirements on robustness with respect to modifications of the images turn
out not to contradict the requirements on optimality of the centroids. The method G
applying an intrinsic variable selection on the optimal centroid and weights [6]
can be interpreted within a broader framework of robust dimensionality reduction
(see [8] for an overview) or low-energy approximate computation. Additional results
not presented here reveal the method based on optimized centroids to be robust also
to small shift. Neither the theoretical part of this paper nor the experiments exploit
any specific properties of faces. The presented robust method has potential also for
various other applications, e.g. for deep fake detection by centroids, robust template
matching by CNNs [9], or applying filters in convolutional layers of CNNs.

Acknowledgements The research was supported by the grant 22-02067S of the Czech Science
Foundation.

References

1. Böhringer, S., de Jong, M. A.: Quantification of facial traits. Frontiers in Genetics 10, 397
(2019)
2. Delaigle, A., Hall, P.: Achieving near perfect classification for functional data. Journal of the
Royal Statistical Society 74, 267–286 (2012)
3. Gao, B., Spratling, M. W.: Robust template matching via hierarchical convolutional features
from a shape biased CNN. ArXiv:2007.15817 (2021)
4. Jurečková, J., Picek, J., Schindler, M.: Robust statistical methods with R. 2nd edn. CRC Press,
Boca Raton (2019)
5. Kalina, J.: A robust pre-processing of BeadChip microarray images. Biocybernetics and
Biomedical Engineering 38, 556–563 (2018)
6. Kalina, J., Matonoha, C.: A sparse pair-preserving centroid-based supervised learning method
for high-dimensional biomedical data or images. Biocybernetics and Biomedical Engineering
40, 774–786 (2020)
7. Kalina, J., Schlenker, A.: A robust supervised variable selection for noisy high-dimensional
data. BioMed Research International 2015, 320385 (2015)
8. Rousseeuw, P. J., Hubert, M.: Anomaly detection by robust statistics. WIREs Data Mining
and Knowledge Discovery 8, e1236 (2018)
9. Sun, L., Sun, H., Wang, J., Wu, S., Zhao, Y., Xu, Y.: Breast mass detection in mammography
based on image template matching and CNN. Sensors 2021, 2855 (2021)
10. Sze, V., Chen, Y. H., Yang, T. J., Emer, J. S.: Efficient processing of deep neural networks.
Morgan & Claypool Publishers, San Rafael (2020)
11. Watanuki, S.: Watershed brain regions for characterizing brand equity-related mental pro-
cesses. Brain Sciences 11, 1619 (2021)
12. Zhu, X., Ramanan, D.: Face detection, pose estimation, and landmark localization in the
wild. IEEE Conference on Computer Vision and Pattern Recognition 2012. IEEE, New York,
pp. 2879–2886 (2012)
Robustness Aspects of Optimized Centroids 201

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Data Clustering and Representation Learning
Based on Networked Data

Lazhar Labiod and Mohamed Nadif

Abstract To deal simultaneously with both, the attributed network embedding and
clustering, we propose a new model exploiting both content and structure infor-
mation. The proposed model relies on the approximation of the relaxed continuous
embedding solution by the true discrete clustering. Thereby, we show that incorporat-
ing an embedding representation provides simpler and easier interpretable solutions.
Experiment results demonstrate that the proposed algorithm performs better, in terms
of clustering, than the state-of-art algorithms, including deep learning methods de-
voted to similar tasks.

Keywords: networked data, clustering, representation learning, spectral rotation

1 Introduction

In recent years, Networks [4] and Attributed Networks (AN) [8] have been used to
model a large variety of real-world networks, such as academic and health care
networks where both node links and attributes/features are available for analysis.
Unlike plain networks in which only node links and dependencies are observed,
with AN, each node is associated with a valuable set of features. In other words, we
have X and W obtained/available independently of X. More recently, the learning
representation has received a significant amount of attention as an important aim
in many applications including social networks, academic citation networks and
protein-protein interaction networks. Hence, Attributed network Embedding (ANE)
[2] aims to seek a continuous low-dimensional matrix representation for nodes
in a network, such that original network topological structure and node attribute
proximity can be preserved in the new low-dimensional embedding.
Although, many approaches have emerged with Network Embedding (NE), the
research on ANE (Attributed Network Embedding) still remains to be explored

Lazhar Labiod ( ) · Mohamed Nadif


Centre Borelli UMR9010, Université Paris Cité, 75006-Paris, France,
e-mail: [email protected], e-mail: [email protected]

© The Author(s) 2023 203


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_23
204 L. Labiod and M. Nadif

[3]. Unlike NE that learns from plain networks, ANE aims to capitalize both the
proximity information of the network and the affinity of node attributes. Note that,
due to the heterogeneity of the two information sources, it is difficult for the existing
NE algorithms to be directly applied to ANE. To sum up, the learned representation
has been shown to be helpful in many learning tasks such as network clustering [13],
Therefore ANE is a challenging research problem due to the high-dimensionality,
sparsity and non-linearity of the graph data.
The paper is organized as follows. In Section 2 we formulate the objective function
to be optimized, describe the different matrices used, and present a Simultaneous
Attributed Network Embedding and Clustering (SANEC) framework for embedding
and clustering. Section 3 is devoted to numerical experiments. Finally, the conclusion
summarizes the advantages of our contribution.

2 Proposed Method

In this section, we describe the SANEC method. We will present the formulation of
an objective function and an effective algorithm for data embedding and clustering.
But first, we show how to construct two matrices S and M integrating both types of
information –content and structure information– to reach our goal.

2.1 Content and Structure Information

An attributed network G = (V, E, X) consists of V the set of nodes, 𝐸 ⊆ V × V


the set of links, and X = [x1 , x2 , . . . , x𝑛 ] where 𝑛 = |V | and x𝑖 ∈ R𝑑 is the
feature/attribute vector of the node 𝑣𝑖 . Formally, the graph can be represented by two
types of information, the content information X ∈ R𝑛×𝑑 and the structure information
A ∈ R𝑛×𝑛 , where A is an adjacency matrix of 𝐺 and 𝑎 𝑖 𝑗 = 1 if 𝑒 𝑖 𝑗 ∈ 𝐸 otherwise 0;
we consider that each node is a neighbor of itself, then we set 𝑎 𝑖𝑖 = 1 for all nodes.
Thereby, we model the nodes proximity by an (𝑛 × 𝑛) transition matrix W given by
W = D−1 A, where D is the degree matrix of A defined by 𝑑𝑖𝑖 = 𝑖𝑛0 =1 𝑎 𝑖0 𝑖 .
Í
In order to exploit additional information about nodes similarity from X, we
preprocessed the above dataset X to produce similarity graph input WX of size
(𝑛 × 𝑛); we construct a K-Nearest-Neighbor (KNN) graph. To this end, we use the
heat kernel and 𝐿 2 distance, KNN neighborhood mode with 𝐾 = 15 and we set the
width of the neighborhood 𝜎 = 1. Note that any appropriate distance or dissimilarity
measure can be used. Finally we combine in an (𝑛 × 𝑛) matrix S, nodes proximity
from both content information X and structure information W. In this way, we intend
to perturb the similarity W by adding the similarity from WX ; we choose to take S
defined by S = W + WX (Figure 1).
As we aim to perform clustering, we propose to integrate it in the formulation of
a new data representation by assuming that nodes with the same label tend to have
Networked Data Clustering 205

Fig. 1 Model and objective function of SANEC.

similar social relations and similar node attributes. This idea is inspired by the fact
that, the labels are strongly influenced by both content and structure information
and inherently correlated to both these information sources. Thereby the new data
representation referred to as M = (𝑚 𝑖 𝑗 ) of size (𝑛 × 𝑑) can be considered as a
multiplicative integration of both W and X by replacing Í each node by the centroid
of their neighborhood (barycenter): i.e, m𝑖 𝑗 = 𝑛𝑘=1 w𝑖𝑘 x 𝑘 𝑗 , ∀𝑖, 𝑗 or M = WX. In
this way, given a graph 𝐺, a graph clustering aims to partition the nodes in 𝐺 into
𝑘 disjoint clusters {𝐶1 , 𝐶2 , . . . , 𝐶 𝑘 }, so that: (1) nodes within the same cluster are
close to each other while nodes in different clusters are distant in terms of graph
structure; and (2) the nodes within the same cluster are more likely to have similar
attribute values.

2.2 Model, Optimization and Algorithm

Let 𝑘 be the number of clusters and the number of components into which the data
is embedded. With M and S, the SANEC method that we propose aims to obtain
the maximally informative embedding according to the clustering structure in the
attributed network data. Therefore, we propose to optimize

M − BQ> + 𝜆 S − GZB> B> B = I, Z> Z = I, G ∈ {0, 1} 𝑛×𝑘 (1)


2 2
min
B,Z,Q,G

where G = (𝑔𝑖 𝑗 ) of size (𝑛 × 𝑘) is a cluster membership matrix, B = (𝑏 𝑖 𝑗 ) of size


(𝑛 × 𝑘) is the embedding matrix and Z = (𝑧 𝑖 𝑗 ) of size (𝑘 × 𝑘) is an orthonormal
rotation matrix which most closely maps B to G ∈ {0, 1} 𝑛×𝑘 . Q ∈ R𝑑×𝑘 is the
features embedding matrix. Finally, The parameter 𝜆 is a non-negative value and
can be viewed as a regularization parameter. The intuition behind the factorization
of M and S is to encourage the nodes with similar proximity, those with higher
similarity in both matrices, to have closer representations in the latent space given
by B. In doing so, the optimisation of (1) leads to a clustering of the nodes into 𝑘
clusters given by G. Note that, both tasks –embedding and clustering– are performed
206 L. Labiod and M. Nadif

simultaneously and supported by Z; it is the key to attaining good embedding while


taking into account the clustering structure. To infer the latent factor matrices Z, B,
Q and G, we shall derive an alternating optimization algorithm. To this end, we rely
on the following proposition.

Proposition 1. Let be S ∈ R𝑛×𝑛 , G ∈ {0, 1} 𝑛×𝑘 , Z ∈ R 𝑘×𝑘 , B ∈ R𝑛×𝑘 , we have

S − GZB> = S − BB> S
2 2
+ kSB − GZk 2 (2)

proof. We first expand the matrix norm of the left term of (2)
2
= kSk 2 + GZB> − 2𝑇𝑟 (SGZB> )
2
S − GZB𝑇 (3)

In a similar way, we obtain from the two terms of the right term of (2)
2
S − SBB𝑇 = ||S|| 2 − ||SB|| 2 due to B> B = I (4)

and kSB − GZk 2 = kSBk 2 + kGZk 2 − 2𝑇𝑟 (SBZG> ).

Due also to B> B = I, we have

kSB − GZk 2 = ||SB|| 2 + ||GZB> || 2 − 2𝑇𝑟 (SGZB> ) (5)

Summing the two terms of (4) and (5) leads to the left term of (2).
2
kSk 2 + kGZk 2 − 2𝑇𝑟 (SGZB> ) = S − GZB𝑇 due to kGZk 2 = GZB>
2

Compute Z. Fixing G and B the problem which arises in (1) is equivalent to


minZ kS − GZB> k 2 . From Proposition 1, we deduce that

min S − GZB> ⇔ min S − BB> S


2 2
+ kSB − GZk 2 (6)
Z Z

which can be reduced to maxZ 𝑇𝑟 (G> SBZ) s.t. Z> Z = I. As proved in page 29
of [1], let UΣV> be the SVD for G> SB, then Z = UV> .

Compute Q. Given G, Z and B, the opimization problem (1) is equivalent to


minQ kM − BQ> k 2 , and we get

Q = M> B. (7)

Thereby Q is somewhere an embedding of attributes.


Compute B. Given G, Q and Z, the problem (1) is equivalent to

max 𝑇𝑟 ((M> Q + 𝜆SGZ)B> ) s.t. B> B = I.


B
Networked Data Clustering 207

In the same manner for the computation of Z, let ÛΣ̂V̂> be the SVD for (M> Q +
𝜆SGZ), we get
B = ÛV̂> . (8)
It is important to emphasize that, at each step, B exploits the information from the
matrices Q, G, and Z. This highlights one of the aspects of the simultaneity of
embedding and clustering.
Compute G: Finally, given B, Q and Z, the problem (1) is equivalent to
minG kSB − GZk 2 . As G is a cluster membership matrix, its computation is done as
follows: We fix Q, Z, B. Let B̃ = SB and calculate

𝑔𝑖𝑘 = 1if 𝑘 = arg min


0
|| b̃𝑖 − z 𝑘 0 || 2 and 0 otherwise . (9)
𝑘

In summary, the steps of the SANEC algorithm relying on S referred to as SANECS


can be deduced in Algorithm 1. The convergence of SANECS is guaranteed and
depends on the initialization to reach only a local optima. Hence, we start the
algorithm several times and select the best result which minimizes the objective
function (1).

Algorithm 1 : SANECS algorithm


Input: M and S from structure matrix W and content matrix X;
Initialize: B, Q and Z with arbitrary orthonormal matrix;
repeat
(a) - Compute G using (9)
(b) - Compute B using (8)
(c) - Compute Q using (7)
(d) - Compute Z using (6)
until convergence
Output: G: clustering matrix, Z: rotation matrix, B: nodes embedding and Q: attributes embed-
ding

3 Numerical Experiments

In the following, we compare SANEC with some competitive methods described later.
The performances of all clustering methods are evaluated using challenging real-
world datasets commonly tested with ANE where the clusters are known. Specifically,
we consider three public citation network data sets, Citeseer, Cora and Wiki, which
contain sparse bag-of-words feature vector for each document and a list of citation
links between documents. Each document has a class label. We treat documents as
nodes and the citation links as the edges. The characteristics of the used datasets are
summarized in Table 1. The balance coefficient is defined as the ratio of the number
of documents in the smallest class to the number of documents in the largest class
while nz denotes the percentage of sparsity.
208 L. Labiod and M. Nadif
Table 1 Description of datasets (#: the cardinality).
datasets # Nodes # Attributes # Edges #Classes 𝑛𝑧(%) Balance
Cora 2708 1433 5294 7 98.73 0.22
Citeseer 3312 3703 4732 6 99.14 0.35
Wiki 2405 4973 17981 17 86.46 0.02

In our comparison we include standard methods and also recent deep learning
methods; these differ in the way they use available information. Some of them (such
as K-means) use only X as the baseline, while others use more recent algorithms
based on X and W. All the compared methods are: TADW [14], DeepWalk [7] and
Spectral Clustering [11]. Using X and W we evaluated GAE and VGAE [5],
ARVGA [6], AGC [15] and DAEGC [12].
With the SANEC model, the parameter 𝜆 controls the role of the second term
||S−GZB> || 2 in (1). To measure its impact on the clustering performance of SANECS ,
we vary 𝜆 in {0, 10−6 , 10−3 , 10−1 , 100 , 101 , 103 }. Through, many experiments, as
illustrated in Figure 2 we choose to take 𝜆 = 10−3 . The choice of 𝜆 warrants in-depth
evaluation.

0.7 0.7 0.55


Acc Acc Acc
0.65 NMI NMI NMI
ARI ARI 0.5 ARI
0.6
0.6
0.45
Clustering performance(%)

Clustering performance(%)

Clustering performance(%)

0.55 0.5

0.5 0.4
0.4
0.45 0.35

0.4 0.3
0.3
0.35
0.2
0.25
0.3

0.25 0.1 0.2


0 10 - 6 10 - 3 10 - 1 10 0 10 1 10 3 0 10 - 6 10 - 3 10 - 1 10 0 10 1 10 3 0 10 - 6 10 - 3 10 - 1 10 0 10 1 10 3

Cora Citeseer Wiki

Fig. 2 Sensitivity analysis of 𝜆 using ACC, NMI and ARI.

Compared to the true available clusters, in our experiments the clustering per-
formance is assessed by accuracy (ACC), normalized mutual information (NMI)
and adjusted rand index (ARI). We repeat the experiments 50 times, with differ-
ent random initialization and the averages (mean) are reported in Table 2; the best
performance for each dataset is highlighted in bold.
First, we observe the high performances of methods integrating information from
W. For instance, RTM and RMSC are better than classical methods using only either X
or W. On the other hand, all methods including deep learning algorithms relying on
X and W are better yet. However, regarding SANEC with both versions relying on W,
referred to as SANECW or S referred to as SANECS , we note high performances for all
the datasets and with SANECS , we remark the impact of WX ; it learns low-dimensional
representations while suits the clustering structure.
To go further in our investigation and given the sparsity of X we proceeded to
standardization tf-idf followed by 𝐿 2 , as it is often used to process document-term
Networked Data Clustering 209

matrices; see e.g, [9, 10], while in the construction of WX we used the cosine metric.
In Figure 3 are reported the results where we observe a slight improvement.

Table 2 Clustering performances (ACC % , NMI % and ARI %).


Datasets
Methods Input Cora Citeseer Wiki
ACC NMI ARI ACC NMI ARI ACC NMI ARI
K-means X 49.22 32.10 22.96 54.01 30.54 27.86 41.72 44.02 15.07
Spectral W 36.72 12.67 03.11 23.89 05.57 01.00 22.04 18.17 01.46
DeepWalk W 48.40 32.70 24.27 33.65 08.78 09.22 38.46 32.38 17.03
RTM X, W 43.96 23.01 16.91 45.09 23.93 20.26 43.64 44.95 13.84
RMSC X, W 40.66 25.51 08.95 29.50 13.87 04.88 39.76 41.50 11.16
TAWD X, W 56.03 44.11 33.20 45.48 29.14 22.81 30.96 27.13 04.54
VGAE X, W 50.20 32.92 25.47 46.70 26.05 20.56 45.09 46.76 26.34
ARGE X, W 64.0 44.9 35.2 57.3 35.0 34.1 47.34 47.02 28.16
ARVGE X, W 63.8 45.0 37.74 54.4 26.1 24.5 46.45 47.8 29.65
SANECW X, W 64.47 43.30 36.19 64.71 38.61 39.20 46.21 42.83 28.30
SANECS X, S 67.38 47.14 39.88 66.77 40.60 41.78 52.80 50.02 35.57

70 70 60
SANEC_l2 SANEC_l2 SANEC_l2
SANEC_tfidf SANEC_tfidf SANEC_tfidf
60 60
50

50 50
Clustering performance(%)

Clustering performance(%)

Clustering performance(%)

40

40 40
30
30 30

20
20 20

10
10 10

0 0 0
Acc NMI ARI Acc NMI ARI Acc NMI ARI

Cora Citeseer Wiki

Fig. 3 Evaluation of SANECS using tf-idf normalization of X and cosine metric for WX .

4 Conclusion

In this paper, we proposed a novel matrix decomposition framework for simultane-


ous attributed network data embedding and clustering. Unlike known methods that
combine the objective function of AN embedding and the objective function of clus-
tering separately, we proposed a new single framework to perform SANECS for AN
embedding and nodes clustering. We showed that the optimized objective function
can be decomposed into three terms, the first is the objective function of a kind of
PCA applied to X, the second is the graph embedding criterion in a low-dimensional
space, and the third is the clustering criterion. We also integrated a discrete rotation
functionality, which allows a smooth transformation from the relaxed continuous
embedding to a discrete solution, and guarantees a tractable optimization problem
with a discrete solution. Thereby, we developed an effective algorithm capitalizing
210 L. Labiod and M. Nadif

on learning representation and clustering. The obtained results show the advantages
of combining both tasks over other approaches. SANECS outperforms all recent meth-
ods devoted to the same tasks including deep learning methods which require deep
models pretraining. However, there are other points that warrant in-depth evaluation,
such as the choice of 𝜆 and the complexity of the algorithm in terms of network size.
The proposed framework offers several perspectives and investigations. We have
noted that the construction of M and S is important, it highlights the introduction of
W. As for the WX we have observed that it is fundamental as it makes possible to link
the information from X to the network; this has been verified by many experiments.
First, we would like to be able to measure the impact of each matrix W and WX in
the construction of S by considering two different weights for W and WX as follows:
S = 𝛼W + 𝛽WX . Finally, as we have stressed that Q is an embedding of attributes,
this suggests to consider also a simultaneously ANE and co-clustering.

References

1. Ten Berge, J. M. F: Least Squares Optimization in Multivariate Analysis. DSWO Press, Leiden
University Leiden, (1993)
2. Cai, H. Y., Zheng, V. W., Chang, K. C. C.: A comprehensive survey of graph embedding:
Problems, techniques, and applications. IEEE Trans. Knowl. Data Eng. 30(9), 1616-1637
(2018)
3. Chang, S., Han, W., Qi, G. J., Aggarwal, C. C., Huang, T.S.: Heterogeneous network embedding
via deep architectures. In SIGKDD, pp. 119–128 (2015)
4. Doreian, P., Batagelj, V., Ferligoj, A.: Advances in network clustering and blockmodeling.
John Wiley & Sons (2020)
5. Kipf, T. N., Welling, M.: Variational graph auto-encoders. In NIPS Workshop on Bayesian
Deep Learning, (2016)
6. Pan, S., Hu, R., Long, G., Jiang, J., Yao, L., Zhang, C.: Adversarially regularized graph
autoencoder for graph embedding. In IJCAI, pp. 2609-2615, (2018)
7. Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: Online learning of social representations. In
SIGKDD, pp. 701-710 (2014)
8. Qi, G.J., Aggarwal, C. C., Tian, Q., Ji, H., Huang, T. S.: Exploring context and content links in
social media: A latent space method. IEEE Trans. Pattern Anal. Mach. Intell. 34(5), 850-862
(2012)
9. Salah, A., Nadif, M.: Model-based von Mises-Fisher co-clustering with a conscience. In SDM,
pp. 246–254. SIAM (2017)
10. Salah, A., Nadif, M.: Directional co-clustering. Data Analysis and Classification. 13(3), 591-
620 (2019)
11. Tang, L., Liu, H.: Leveraging social media networks for classification. Data mining and
knowledge discovery. 23(3), 447-478 (2011)
12. Wang, C., Pan, S., Hu, R., Long, G., Jiang, J., Zhang, C.: Attributed graph clustering: A deep
attentional embedding approach. arXiv preprint arXiv:1906.06532 (2019) Available via .
https://arxiv.org/pdf/1906.06532.pdf
13. Wang, C., Pan, S., Long, G., Zhu,X., Jiang, J.: Mgae: Marginalized graph autoencoder for
graph clustering. In CIKM, pp. 889-898, (2017)
14. Yang, C., Liu, Z., Zhao, D., Sun, M., Chang, E. Y.: Network representation learning with rich
text information. In IJCAI, pp. 2111-2117 (2015)
15. Zhang, X., Liu, H., Li, Q., Wu, X. M.: Attributed graph clustering via adaptive graph convo-
lution. arXiv preprint arXiv:1906.01210, (2019) Available via .
https://arxiv.org/pdf/1906.01210.pdf?ref=https://githubhelp.com
Networked Data Clustering 211

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Towards a Bi-stochastic Matrix Approximation
of 𝒌-means and Some Variants

Lazhar Labiod and Mohamed Nadif

Abstract The 𝑘-means algorithm and some 𝑘-means variants have been shown to
be useful and effective to tackle the clustering problem. In this paper we embed
𝑘-means variants in a bi-stochastic matrix approximation (BMA) framework. Then
we derive from the 𝑘-means objective function a new formulation of the criterion. In
particular, we show that some 𝑘-means variants are equivalent to algebraic problem
of bi-stochastic matrix approximation under some suitable constraints. For optimiz-
ing the derived objective function, we develop two algorithms; the first one consists
in learning a bi-stochastic similarity matrix while the second seeks for the opti-
mal partition which is the equilibrium state of a Markov chain process. Numerical
experiments on real data-sets demonstrate the interest of our approach.

Keywords: 𝑘-means, reduced 𝑘-means, factorial 𝑘-means, bi-stochastic matrix

1 Introduction

These last decades unsupervised learning and specifically clustering, have received
a significant amount of attention as an important problem with many application in
data science. Let 𝐴 = (𝑎 𝑖 𝑗 ) be a 𝑛 × 𝑚 continuous data matrix where the set of rows
(objects, individuals) is denoted by 𝐼 and the set of columns (attributes, features) by
𝐽. Many clustering methods such as hierarchical or not aim to construct an optimal
partition of 𝐼 or, sometimes of 𝐽.
In this paper we show how some 𝑘-means variants can be presented as a bi-
stochastic matrix approximation problem under some suitable constraints generated
by the properties of the reached solution. To reach this goal, we first demonstrate that
some variants of 𝑘-means are equivalent to learning a bi-stochastic similarity matrix
having a diagonal block structure. Based on this formulation, referred to as BMA,
we derive two iterative algorithms, the first algorithm learns a bi-stochastic 𝑛 × 𝑛
similarity matrix while the second directly seeks an optimal clustering solution.
Our main contribution is to establish the theoretical connection of the conventional
𝑘-means and some of its variants to BMA framework. The implications of the
reformulation of 𝑘-means as a BMA problem are multi-folds:

Lazhar Labiod ( ) · Mohamed Nadif


Centre Borelli UMR9010, Université Paris Cité, 75006-Paris, France,
e-mail: [email protected], e-mail: [email protected]

© The Author(s) 2023 213


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_24
214 L. Labiod and M. Nadif

• It makes connections with recent clustering methods like spectral clustering and
subspace clustering.
• It learns a well normalized (bi-stochastic normalization) similarity matrix, bene-
ficial for spectral clustering [12].
• Unlike existing spectral and subspace methods which combine in a sequential
way, the steps of similarity learning and clustering derivation, our proposed
method jointly learns a block diagonal bi-stochastic affinity matrix which naturally
expresses a clustering structure.
The rest of paper is organized as follows. Section 2 introduces some variants of
𝑘-means. Section 3 provides Matrix Factorization (MF) and BMA formulations of
𝑘-means variants. Section 4 discusses the BMA clustering algorithm and section 5
is devoted to numerical experiments. Finally, the conclusion summarizes the interest
of our contribution.

2 Variants of 𝒌-Means

Given a data matrix 𝐴 = (𝑎 𝑖 𝑗 ) ∈ 𝑅 𝑛×𝑚 , the aim of clustering is to cluster the rows
or the columns of 𝐴, so as to optimize the difference between 𝐴 = (𝑎 𝑖 𝑗 ) and the
clustered matrix revealing significant block structure. More formally, we seek to
partition the set of rows 𝐼 = {1, . . . , 𝑛} into 𝑘 clusters 𝐶 = {𝐶1 , . . . , 𝐶𝑙 , . . . , 𝐶 𝑘 }.
The partitioning naturally induce clustering index matrix 𝑅 = (𝑟 𝑖𝑙 ) ∈ R𝑛×𝑘 , defined
as binary classification matrix such as we have 𝑟 𝑖𝑙 = 1, if the row 𝑎 𝑖 ∈ 𝐶𝑙 , and
0 otherwise. On the other hand, we note 𝑆 ∈ R𝑚×𝑘 a reduced matrix specifying
the cluster representation. The detection of homogeneous clusters of objects can be
reached by looking for the two matrices 𝑅 and 𝑆 minimizing the total squared residue
measure
J𝐾 𝑀 (𝑅, 𝑆) = || 𝐴 − 𝑅𝑆 > || 2 (1)
The term 𝑅𝑆 > characterizes the information of 𝐴 that can be described by the clusters
structure. The clustering problem can be formulated as a matrix approximation
problem where the clustering aims to minimize the approximation error between the
original data 𝐴 and the reconstructed matrix based on the cluster structures.
Factorial 𝑘-means analysis (FKM) [9] and Reduced 𝑘-means analysis (RKM)
[1] are clustering methods that aim at simultaneously achieving a clustering of the
objects and a dimension reduction of the features. The advantage of these methods
is that both clustering of objects and low-dimensional subspace capturing the cluster
structure are simultaneously obtained. To achieve this objective, RKM is defined by
the minimizing problem of the following criterion

J𝑅𝐾 𝑀 (𝑅, 𝑆, 𝑄) = || 𝐴 − 𝑅𝑆 > 𝑄 > || 2 (2)

and FKM is defined by the minimizing problem of the following criterion

J𝐹 𝐾 𝑀 (𝑅, 𝑆, 𝑄) = || 𝐴𝑄 − 𝑅𝑆 > || 2 (3)


Bi-stochastic Matrix Approximation 215

where 𝑆 ∈ R 𝑝×𝑘 with RKM and FKM, and 𝑄 is an 𝑚 by 𝑝 column-wise orthonormal


loading matrix.

3 Bi-stochastic Matrix Approximation of 𝒌-Means Variants

3.1 Low-rank Matrix Factorization (MF)

By considering 𝑘-means as a lower rank matrix factorization with constraints, rather


than a clustering method, we can formulate constraints to impose on MF formulation.
Let 𝐷 𝑟−1 ∈ R 𝑘×𝑘 be diagonal matrix defined as follow 𝐷 𝑟−1 = 𝐷𝑖𝑎𝑔(𝑟 1−1 , . . . , 𝑟 𝑘−1 ).
Using the matrices 𝐷 𝑟 , 𝐴 and 𝑅, the matrix summary 𝑆 can be expressed as 𝑆𝑇 =
𝐷 𝑟−1 𝑅 > 𝐴. Plugging 𝑆 into the objective function in equation, (1) leads to optimize
|| 𝐴 − 𝑅(𝐷 𝑟−1 𝑅 > 𝐴)|| 2 equal to

J𝑀 𝐹 −𝐾 𝑀 (R) = || 𝐴 − RR> 𝐴|| 2 , where R = 𝑅𝐷 𝑟−0.5 . (4)

On the other hand, it is easy to verify that the approximation RR> 𝐴 of 𝐴 is formed
by the same value in each block 𝐴𝑙, (𝑙=1,...,𝑘) . Specifically, the matrix R> 𝐴, equal to
𝑆𝑇 , plays the role of a summary of 𝐴 and absorbs the different scales of 𝐴 and R.
Finally RR> 𝐴 gives the row clusters mean vectors. Note that it is easy to show that
R verifies the following properties

R ≥ 0, R> R = 𝐼 𝑘 , RR> 1 = 1, 𝑇𝑟𝑎𝑐𝑒(RR> ) = 𝑘, (RR> ) 2 = RR> (5)

Next, in similar way, we can derive a MF formulation of FKM,

J𝑀 𝐹 −𝐹 𝐾 𝑀 (R) = || 𝐴𝑄 − RR> 𝐴𝑄|| 2 , (6)

and of RKM, J𝑀 𝐹 −𝑅𝐾 𝑀 (R) = || 𝐴 − RR> 𝐴𝑄𝑄 > || 2 . (7)

3.2 BMA Formulation

Let 𝚷 = RR> be a bi-stochastic similarity matrix, before giving the BMA formula-
tion of 𝑘-means variants, we need first to spell out the good properties of 𝚷. Indeed,
by construction from R, 𝚷 has at least the following properties reported below that
can be easily proven.

𝚷 ≥ 0, 𝚷> = 𝚷, 𝚷1 = 1, 𝑇𝑟𝑎𝑐𝑒(𝚷) = 𝑘, 𝚷𝚷> = 𝚷, 𝑅𝑎𝑛𝑘 (𝚷) = 𝑘 (8)

Given a data matrix 𝐴 and 𝑘 row clusters, we can hope to discover the cluster structure
of 𝐴 from 𝚷. Notice that from (8) 𝚷 is nonnegative, symmetric, bi-stochastic (doubly
216 L. Labiod and M. Nadif

stochastic) and idempotent. By setting the 𝑘means in the BMA framework, the
problem of clustering is reformulated as the learning of a structured bi-stochastic
similarity matrix 𝚷 by minimizing the following 𝑘-means variants objective,

J𝐵𝑀 𝐴−𝑘 𝑀 (𝚷) = || 𝐴 − 𝚷𝐴|| 2 , (9)

J𝐵𝑀 𝐴−𝐹 𝐾 𝑀 (𝚷) = || 𝐴𝑄 − 𝚷𝐴𝑄|| 2 , (10)

J𝐵𝑀 𝐴−𝑅𝐾 𝑀 (𝚷) = || 𝐴 − 𝚷𝐴𝑄𝑄 > || 2 , (11)


with respect to the following constraints on 𝚷

𝚷 ≥ 0, 𝚷 = 𝚷> , 𝚷1 = 1, 𝑇𝑟 (𝚷) = 𝑘, 𝚷𝚷> = 𝚷 (12)

and 𝑄 > 𝑄 = 𝐼 for equations (10) and (11).


In the rest of the paper, we will consider only non-negativity, symmetry and bi-
stochastic constraints.

3.3 The Equivalence Between BMA and 𝒌 -Means

The theorem below demonstrates that the optimization of the 𝑘-means objective and
the BMA objective under some suitable constraints are equivalent. The equation
(13) establishes the equivalence between 𝑘-means and the BMA formulation. Then,
solving the BMA objective function (9) is equivalent to finding a global solution of
the 𝑘-means criterion (1).
Theorem 1

arg min || 𝐴 − 𝑅𝑆 > || 2 ⇔ arg min || 𝐴 − 𝚷𝐴|| 2 (13)


𝑅,𝑆 {𝚷≥0,𝚷=𝚷> ,𝚷1=1,𝑇 𝑟 (𝚷)=𝑘,𝚷𝚷> =𝚷}

The proof of this equivalence is given in the appendix. Note that this new formulation
gives some interesting highlights on 𝑘-means and its variants:
• First, this shows that 𝑘-means is equivalent to learning a structured bi-stochastic
similarity matrix which is normalized bi-stochastic matrix with block diagonal
structure.
• Secondly, it establishes very interesting connections of 𝑘-means to many state-of-
the-art subspace clustering methods [10, 5]. Moreover, this formulation combines
the traditional two-step process used by subspace clustering methods, which con-
sist in first constructing an affinity matrix between data points and then applying
spectral clustering to this affinity. This allows joint learning of a similarity matrix
that better reflects the clustering structure by its block diagonal shape.
• Finally, it allows to apply the spirit of 𝑘-means for graph or similarity data.
Bi-stochastic Matrix Approximation 217

4 BMA Clustering Algorithm

First, we establish the relationship between our objective function and that used in
[12, 11]. From || 𝐴 − 𝚷𝐴|| 2 = 𝑇𝑟𝑎𝑐𝑒( 𝐴𝐴> ) + 𝑇𝑟𝑎𝑐𝑒(𝚷𝐴𝐴> 𝚷) − 2𝑇𝑟𝑎𝑐𝑒( 𝐴𝐴> 𝚷)
and by using the idempotent property, 𝚷𝚷> = 𝚷 , we can show that

arg min || 𝐴 − 𝚷𝐴|| 2 ⇔ arg min || 𝐴𝐴> − 𝚷|| 2 ⇔ arg max 𝑇𝑟𝑎𝑐𝑒( 𝐴𝐴> 𝚷).
𝚷 𝚷 Π

The algorithm for learning similarity matrix is summarized in Algorithm 1 as in


[12, 11]. Once the bi-stochastic similarity matrix 𝚷 is obtained, the basic idea of
BMA is based on the following steps:

Algorithm 1 : Learning similarity matrix


Input: data 𝐴
Output: similarity matrix 𝚷
Initialize: 𝑡 = 0 and 𝚷 (0) = 𝐴𝐴>
repeat
Π (𝑡+1) ← [𝚷 (𝑡 ) + 𝑛1 (𝐼 − 𝚷 (𝑡 ) + 11 𝑛𝚷 ) 11> − 11> 𝚷 (𝑡 ) ]
> (𝑡 ) 1
𝑛
until Satisfied convergence condition

1. Estimating iteratively 𝐴 by applying at each time the matrix 𝚷 on the current


𝐴 using the following update 𝐴ˆ (𝑡+1) = 𝚷𝐴 (𝑡) . This process converges to an
equilibrium (steady) state. Let 𝑘 be the multiplicity of the eigenvalue of matrix
𝚷 equal to 1, 𝐴ˆ is composed of 𝑘 << 𝑛 quasi-similar rows, where each row is
represented by its prototype.
2. Extracting the first left singular vectors 𝜋 of 𝐴ˆ using the Power method [4];
it is a well-known technique used for computing the largest left eigenvector of
data matrix. The numerical computation of the leading left singular vector of 𝐴, ˆ
consists in starting with an arbitrary vector 𝜋 (0) , repeatedly performing updates
(𝑡 )
of 𝜋 until stabilization of 𝜋 as follow: 𝜋 (𝑡+1) = 𝐴ˆ 𝐴ˆ > 𝜋 (𝑡) and 𝜋 (𝑡) ← | | 𝜋𝜋 (𝑡 ) | | . We
stop the Power method if, |𝛾 (𝑡+1) − 𝛾 (𝑡) | ' 𝜖 where 𝛾 (𝑡+1) ← ||𝜋 (𝑡+1) − 𝜋 (𝑡) ||.
Why does this work? At first glance, this process might seem uninteresting since it
eventually leads to a vector with all rows and columns coincide for any starting vector.
However our practical experience shows that, first the vectors 𝜋 very quickly collapse
into rows blocks and these blocks move towards each other relatively slowly. If we
stop the Power method iteration at this point, the algorithm would have a potential
application for data visualization and clustering. The structure of 𝜋 during short-run
stabilization makes the discovery of rows data ordering straightforward. The key is
to look for values of 𝜋 that are approximately equal and reordering rows and columns
data accordingly. The BMA algorithm involves a reorganization of the rows of data
matrix 𝐴ˆ according to sorted 𝜋. It also allows to locate the points corresponding to
an abrupt change in the curve of the first left singular vector 𝜋, and then assess the
number of clusters and the rows belonging to each cluster.
218 L. Labiod and M. Nadif

5 Experiments Analysis

In this subsection we first ran our algorithm on two real world data set, the 16 town-
ships data which consists of the characteristics (rows) of 16 townships (columns),
each cell indicates the presence 1 or absence 0 of a characteristic on a township . This
example has been used by Niermann [7] for data ordering task and the author aims to
reveal a block diagonal form. The second data called Mero data, comes from archaeo-
logical data on Merovingian buckles found in north eastern France. This data matrix
consists of 59 buckles characterized by 26 attributes of description (see Marco-
ˆ 𝑆𝑅 = 𝐴𝐴𝑇 reorganized
torchino for more details [6]). Figure 1 shows in order, 𝐴, 𝐴,
according to the sorted 𝜋 and the sorted 𝜋 plot for both data sets. We also evaluated

Fig. 1 left: 16 Townships data - right: Mero data.

the performances of BMA on some real challenging datasets described in Table1.


We compared the performance of BMA with the spectral co-clustering (SpecCo)
[2], Non-negative Matrix Factorization (NMF) and Orthgogonal Non-negative Matrix
Tri-Factorization (NMTF) [3] by using two evaluation metrics: accuracy (ACC) cor-
responding to the percentage of well-classified elements and the normalized mutual
information (NMI) [8]. In Table 1, we observe that BMA outperforms all compared
algorithms for all tested datasets.

Table 1 Clustering Accuracy and Normalized Mutual Information (%).


datasets # samples # features # classes per 𝑘 -means NMF ONMTF SpecCO BMA
Classic3 3891 4303 3 ACC 88.6 73.33 70.10 97.89 98.30
NMI 74.9 51.46 51.46 91.17 91.91
CSTR 476 1000 4 ACC 76.3 75.30 77.41 80.21 90.73
NMI 65.4 66.40 67.30 66.36 77.86
Webkb4 4199 1000 4 ACC 60.10 66. 30 67.10 61.68 68.8
NMI 45.7 42.70 45.36 48.64 49
Leukemia 38 5000 3 ACC 72.2 89.21 90.32 94.73 97.36
NMI 19.4 75.42 80.50 82 90.69
Bi-stochastic Matrix Approximation 219

6 Conclusion

In this paper we have presented a new reformulation of some variants of 𝑘-means


as a unified BMA framework and established the equivalence between 𝑘-means and
BMA under suitable constraints. By doing so, 𝑘-means leads to learning a structured
bi-stochastic matrix which is beneficial for clustering task. The proposed approach,
not only learns a similarity matrix from data matrix, but uses this matrix in an iter-
ative process that converges to a matrix 𝐴ˆ in which each row is represented by its
prototype. The clustering solution is given by the first left eigenvector of 𝐴ˆ while
overcoming the knowledge of the number of clusters. We expect for future work to
integrate the idempotent and trace constraints on 𝚷 to make the approximate simi-
larity matrix fits the best the case of a block diagonal structure.

Appendix
From the BMA’s formulation, we know that one can easily construct a feasible solu-
tion for 𝑘-means from a feasible solution of BMA’s formulation. Therefore, it remains
to show that from a global solution of BMA’s formulation, we can obtain a feasible
solution of 𝑘-means. In order to show the equivalence between the optimization of 𝑘-
means formulation and the BMA formulation, we first consider the following lemma.

Lemma If Π is a symmetric and positive semi-definite matrix, then we have





 (𝑎)𝜋𝑖𝑖0 ≤ 𝜋𝑖𝑖 𝜋𝑖0 𝑖0 (geometric mean) ∀𝑖, 𝑖 0
 (𝑏)𝜋𝑖𝑖0 ≤ 1 (𝜋𝑖𝑖 + 𝜋𝑖0 𝑖0 ) (arithmetic mean) ∀𝑖, 𝑖 0


2

 (𝑐) max𝑖𝑖0 𝜋𝑖𝑖0 = max𝑖 𝜋𝑖𝑖
 (𝑑)𝜋 = 0 ⇒ 𝜋 0 = 𝜋 0 = 0 ∀𝑖, 𝑖 0

 𝑖𝑖 𝑖𝑖 𝑖𝑖

Proposition. Any positive semi-definite matrix Π satisfying the constraints:




 𝜋𝑖𝑖0 = 𝜋𝑖0 𝑖 ∀𝑖, 𝑖 0 (symmetry)
 𝜋𝑖𝑖0 = Í𝑖00 𝜋𝑖𝑖00 𝜋𝑖0 𝑖00 ∀𝑖, 𝑖 0 (idempotence)


Í

 𝑖0 𝜋𝑖𝑖0 = 1 ∀𝑖
Í 𝜋 = 𝑘

 𝑖 𝑖𝑖

is a matrix partitioned into 𝑘 blocks Π = 𝑑𝑖𝑎𝑔(Π 1 , . . . , Π 𝑙 , . . . , Π 𝑘 ) with


Π 𝑙 = 𝑛1𝑙 1𝑙 1𝑙𝑡 , 𝑡𝑟𝑎𝑐𝑒(Π 𝑙 ) = 1 ∀𝑙 and 𝑙=1 𝑛𝑙 = 𝑛; 1𝑙 denotes the vector of ap-
Í𝑘
propriate dimension with all its values are 1.
Í
Proof. Since Π is idempotent (Π 2 = Π), we have: ∀𝑖; 𝜋𝑖𝑖 = 𝑖0 𝜋𝑖𝑖2 0 From the Lemma
above, we know that there exist; 𝑖 0 ∈ {1, 2, . . . , 𝑛} such as max𝑖𝑖0 𝜋𝑖0 𝑖 = 𝜋𝑖0 𝑖0 > 0.
Consider the set 𝐴𝑖0 defined by 𝐴𝑖0 = {𝑖|𝜋𝑖0 𝑖 > 0}, we can rewrite; ∀𝑖 ∈ 𝐴𝑖0 ; 𝜋𝑖𝑖 =
Í 2
𝑖0 ∈ 𝐴𝑖 0 𝜋𝑖0 𝑖
Õ Õ
∀𝑖 ∈ 𝐴𝑖0 ; 𝜋𝑖0 𝑖 = 𝜋𝑖0 𝑖 = 1 (14)
𝑖 0 ∈ 𝐴𝑖 0 𝑖0 ∈𝐼
220 L. Labiod and M. Nadif

and, Õ Õ Õ Õ
𝜋𝑖0 𝑖 = 𝜋𝑖. = 1 = | 𝐴𝑖 0 | (15)
𝑖 0 ∈ 𝐴𝑖 0 𝑖 ∈ 𝐴𝑖 0 𝑖 ∈ 𝐴𝑖 0 𝑖 ∈ 𝐴𝑖 0

Õ Õ 𝜋2 0 Õ 𝜋𝑖𝑖0
𝑖𝑖
∀𝑖 𝜋𝑖𝑖 = 𝜋𝑖𝑖2 0 ⇒ ∀𝑖 ∈ 𝐴𝑖0 ; = ( )𝜋𝑖𝑖0 = 1. (16)
𝑖0 𝑖0 ∈ 𝐴
𝜋𝑖𝑖 𝑖0 ∈ 𝐴 𝜋𝑖𝑖
𝑖0 𝑖0
Í Í 𝜋𝑖𝑖0
From (14) and (16), we deduce that ∀𝑖 ∈ 𝐴𝑖0 ; 𝑖 0 ∈ 𝐴𝑖 0 𝜋 𝑖 0 𝑖 = 𝑖0 ∈ 𝐴𝑖 0 ( 𝜋𝑖𝑖 )𝜋𝑖𝑖0 ,
0
implying that: 𝜋𝑖𝑖0 = 𝜋𝑖𝑖 , ∀𝑖, 𝑖Í ∈ 𝐴𝑖0 . Substituting in (15) 𝜋𝑖𝑖0 by 𝜋𝑖𝑖 for all
𝑖, 𝑖 0 ∈ 𝐴𝑖0 leads to 𝑖0 ∈ 𝐴𝑖0 𝜋𝑖𝑖0 = 𝑖0 ∈ 𝐴𝑖0 𝜋𝑖𝑖 = | 𝐴𝑖0 |𝜋𝑖𝑖 = 1, ∀𝑖 ∈ 𝐴𝑖0 . From this we
Í

can deduce that 𝜋𝑖𝑖 = 𝜋𝑖𝑖0 = | 𝐴10 | , ∀𝑖, 𝑖 0 ∈ 𝐴𝑖0 . We can therefore rewrite the matrix
𝑖  0 
Π 0
Π in the form of a block diagonal matrix Π = where Π 0 is a block matrix
0 Π̄ 0
whose general term is defined by Π𝑖𝑖0 0 = | 𝐴10 | , ∀𝑖, 𝑖 0 ∈ 𝐴𝑖0 and 𝑡𝑟𝑎𝑐𝑒(Π 0 ) = 1.
𝑖
The matrix Π̄ 0 is a positive semi-definite matrix which also verified the constraints
( Π̄ 0 ) 𝑡 = Π̄ 0 , Π̄ 0 1 = 1, ( Π̄ 0 ) 2 = Π̄ 0 and 𝑡𝑟𝑎𝑐𝑒( Π̄ 0 ) = 𝑘 − 1.
By repeating the same process 𝑘 − 1 times, we get the block diagonal form of Π.
Π = 𝑑𝑖𝑎𝑔(Π 0 , Π 1 , . . . , Π 𝑙 , . . . , Π 𝑘−1 ) with, Π 𝑙 = 𝑛1𝑙 1𝑙 1𝑙𝑡 , 𝑡𝑟𝑎𝑐𝑒(Π 𝑙 ) = 1∀𝑙 and
Í 𝑘−1
𝑙=0 𝑛𝑙 = 𝑛.

References

1. De Soete, G., Carroll, J. D.: K-means clustering in a low-dimensional euclidean space. In:
E. Diday et al. (eds.) New Approaches in Classification and Data Analysis, pp. 212–219.
Springer-Verlag Berlin (1994)
2. Dhillon, I. S.: Co-clustering documents and words using bipartite spectral graph partitioning.
In SIGKDD, pp. 269-274 (2001)
3. Ding, C., Li, T., Peng, W., Park, H.: Orthogonal nonnegative matrix trifactorizations for
clustering. In SIGKDD, pp. 126-135 (2006)
4. Golub, G. H., van Loan, C. F.: Matrix Computations (3rd ed.). Johns Hopkins University Press
(1996)
5. Lim, D., Vidal, R., Haeffele, B. D.: Doubly stochastic subspace clustering. ArXiv,
abs/2011.14859, 2020. Available via ArXiv. https://arxiv.org/abs/2011.14859
6. Marcotorchino, J. F.: Seriation problems: an overview. Appl. Stoch. Model. D. A., 7(2),
139–151 (1991)
7. Niermann, S.: Optimizing the ordering of tables with evolutionary computation. American
Statistician. 59(1), 41-46 (2005)
8. Strehl, A., Ghosh, J.: Cluster ensembles—a knowledge reuse framework for combining mul-
tiple partitions. Journal of Machine Learning Research, 3, 583-617 (2002)
9. Vichi, M., Kiers, H. A.: Factorial 𝑘-means analysis for two-way data. CSDA, 37(1), 49-64
(2001)
10. Vidal, R.: Subspace clustering. IEEE Signal Processing Magazine 28(2), 52-68 (2011)
11. Wang, F., Li, P., König, A. C.: Improving clustering by learning a bi-stochastic data similarity
matrix. Knowl. Inf. Syst. 32(2), 351-382 (2012)
12. Zass, R., Shashua, A.: A unifying approach to hard and probabilistic clustering. In ICCV, pp.
294-301 (2005)
Bi-stochastic Matrix Approximation 221

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Clustering Adolescent Female Physical Activity
Levels with an Infinite Mixture Model on
Random Effects

Amy LaLonde, Tanzy Love, Deborah R. Young, and Tongtong Wu

Abstract Physical activity trajectories from the Trial of Activity in Adolescent Girls
(TAAG) capture the various exercise habits over female adolescence. Previous analy-
ses of this longitudinal data from the University of Maryland field site, examined the
effect of various individual-, social-, and environmental-level factors impacting the
change in physical activity levels over 14 to 23 years of age. We aimed to understand
the differences in physical activity levels after controlling for these factors. Using a
Bayesian linear mixed model incorporating a model-based clustering procedure for
random deviations that does not specify the number of groups a priori, we find that
physical activity levels are starkly different for about 5% of the study sample. These
young girls are exercising on average 23 more minutes per day.

Keywords: Bayesian methodology, Markov chain Monte Carlo, mixture model,


reversible jump, split-merge procedures

1 Introduction

Physical activity and diet are arguably the two main controllable factors having the
greatest impact on our health. Whereas we have little to no control over factors like
our genetic predisposition to disease or exposure to environmental toxins, we have

Amy LaLonde
University of Rochester, NY, USA, e-mail: [email protected]
Tanzy Love ( )
University of Rochester, NY, USA, e-mail: [email protected]
Deborah Rohm Young
University of Maryland, MD, USA, e-mail: [email protected]
Tongtong Wu
University of Rochester, NY, USA, e-mail: [email protected]

© The Author(s) 2023 223


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_25
224 A. LaLonde et al.

much greater control over our diet and activity levels. Despite our ability to choose to
engage in healthy behaviors such as exercising and eating a healthy diet, these choices
are plagued with the complexity of human psychology and the modern demands and
distractions that pervade our lives today. Several factors influence levels of physical
activity; we explore the factors impacting female adolescents using longitudinal data.
The University of Maryland, one of the six initial university field centers of the
Trial of Activity in Adolescent Girls (TAAG), selected to follow its 2006 8𝑡 ℎ grade
cohort for two additional time points over adolescence: 11𝑡 ℎ grade and 23 years of
age. The females were therefore measured roughly at ages 14, 17, and 23. In these
waves, there was no intervention as this observational longitudinal study aimed at
exploring the patterns of physical activity levels and associated factors over time.
The model presented in Wu et al. [1] motivates the current work. We fit a similar
linear mixed model controlling for the same variables. Rather than cluster the raw
physical activity trajectories to identify groups, we cluster the females within the
model-fitting procedure based on the values of the subject-specific deviations from
the adjusted physical activity levels. Fitting a Bayesian linear mixed model, we
simultaneously explore the subject groups through the use of reversible jump Markov
chain Monte Carlo (MCMC) applied to the random effects. Bayesian model-based
clustering methods have been applied within linear mixed models to identify groups
by clustering the fitted values of the dependent variable. For example, [2] fits cluster-
specific linear mixed models to the gene expression outcome using an EM algorithm
and [3] clusters gene expression in a similar fashion, except using Bayesian methods.
In contrast, we perform the clustering on the random effects, which allows us to
investigate the variability that is unexplained by the covariates of interest. This
methodology is advantageous because of its ability to jointly estimate all effects,
while also exploring the infinite space of group arrangements.

2 Bayesian Mixture Models for Heterogeneity of Random Effects

Let y𝑖 = (𝑦 𝑖,1 , . . . , 𝑦 𝑖,𝑇 ) be the 𝑖 𝑡 ℎ subject’s average daily moderate-to-vigorous


physical activity (MVPA) at each of the 𝑇 = 3 time points. The MVPA was collected
from ActiGraph accelerometers (Manufacturing Technologies Inc. Health Systems,
Model 7164, Shalimar, FL) worn for seven consecutive days. Accelerometers offered
a great alternative to self-report for tracking physical activity levels, and measuring
over seven days helped to account for differences in activity patterns during weekdays
and weekends. Wu et al. [1] analyzed this cohort using mixed models that accounted
for the subject-specific variability. We let X𝑖 represent the 𝑖 𝑡 ℎ subject’s values for
covariates.
Furthermore, let r = (𝑟 1 , . . . , 𝑟 𝑛 ) represent the subject-specific random effects for
the 𝑛 subjects. The simple linear mixed model is written in terms of each subject as

y𝑖 = X𝑖 𝜷 + 𝑟 𝑖 1𝑇 + 𝝐 𝑖 (1)
Clustering with an Infinite Mixture Model on Random Effects 225

where 𝜷 represents the coefficients for the covariate effects and 𝝐 𝑖 = (𝜖𝑖,1 , . . . , 𝜖 𝑖,𝑇 )
are the residuals. We assume independence and normality in the residuals and the
random effects; hence, 𝑟 𝑖 ∼ 𝑁 (0, 𝜎𝑟2 ) and 𝝐 𝑖 ∼ 𝑁 (0, 𝜎𝜖2 I𝑇 ) for 𝑖 = 1, . . . , 𝑛.
Fitting the mixed model demonstrates substantial heterogeneity in the residuals,
the variability increases as the fitted values increase. A traditional approach to fixing
this violation would re-fit the model to the log-transformed MVPA values. Plots
of residuals versus fitted values in this model approach also exhibited evidence of
heterogeneity in the model; thus, still violating a core assumption of the regression
framework. Given the changes adolescents experience as they grow into young adults,
we expect to see heterogeneity in the physical activity patterns across this duration of
follow-up time. However, the inability of the model to capture such changes over time
at these higher levels of physical activity suggests the need for model improvements.
The purpose of this analysis is to present our adjustments to previous analyses in
order to investigate underlying characteristics across different groups of females
formed based on deviations from adjusted physical activity levels.

Fig. 1 The plot on the left depicts the residuals versus fitted values for the linear mixed model
in Eq. (1); they demonstrate severe heteroscedasticity. The variance increases as the fitted values
increase. The plot on the right depicts the distribution of the random effects.

We fit the mixed model in Eq. (1) to the sample of female adolescents. The
heteroscedasticity depicted in Figure 1 reveals an increase in variance with predicted
minutes of moderate-to-vigorous physical activity, which we would expect. The plot
on the right in Figure 1 demonstrates that the distribution of the random effects do
not appear to follow our assumption of normally distributed and centered around
zero. The random effects do appear to follow a normal distribution over the lower
range of deviations with a subset of the subjects having larger positive deviations
from the estimated adjusted physical activity levels.
To capture the heterogeneity and allow the random effects to follow a non-normal
distribution, we assign the random effects a Gaussian mixture distribution. Before
introducing the model for heterogeneity, we note the likelihood distribution for the
observed outcomes, Y = (y1 , . . . , y𝑇 ) 0. The moderate-to-vigorous physical activity
distribution is
226 A. LaLonde et al.

𝑛 Ö
𝑇   − 12  
Ö 1
𝑝(Y| 𝜷, r, 𝜎𝜖2 ) = 2𝜋𝜎𝜖2 2
exp − 2 (𝑦 𝑖,𝑡 − 𝑋𝑖,𝑡 𝜷 − 𝑟 𝑖 ) . (2)
𝑖=1 𝑡=1
2𝜎𝜖

Then to account for the heterogeneity across subjects, the probability density for
the subject-specific deviations in physical activity is expressed as a mixture of one-
dimensional normal densities,
𝐺
( )
Õ   − 12 1
2 2 2
𝑝(𝑟 𝑖 | 𝝁, 𝝈𝒓 ) = 𝜋𝑔 2𝜋𝜎𝑟 ,𝑔 exp − 2 (𝑟 𝑖 − 𝜇𝑔 ) . (3)
𝑔=1
2𝜎𝑟 ,𝑔

Here, 𝝁 = (𝜇1 , . . . , 𝜇𝐺 ) 0 defines the group-specific mean deviations, 𝝈𝑟2 =


(𝜎𝑟2,1 , . . . , 𝜎𝑟2,𝐺 ) 0 characterizes the variances of the group-specific deviations, and
𝜋 = (𝜋1 , . . . , 𝜋𝐺 ) 0 is the probability of membership in each group 𝑔.
The model in Eqs. (2) and (3) can be fit using either an EM or Bayesian MCMC
procedures. Both require specification of a fixed number of 𝐺-groups. While we
may hypothesize that there are only two groups–one that is normally distributed and
centered at zero and another that is normally distributed and centered at a larger
mean–the assumption hinges on what we have seen from plots like those in Figure
1. The random effects in the aforementioned histogram, however, are being shrunk
towards zero by assumption; while a mixture model will allow the data to more
accurately depict the deviations observed in the girl’s physical activity levels. The
assumption of 𝐺 groups can strongly influence the results of our model fitting. To
circumvent the issues associated with selecting 𝐺 in either an EM algorithm or a
Bayesian finite mixture model framework, we implement a Bayesian mixture model
that incorporates 𝐺 as an additional unknown parameter.

2.1 Bayesian Mixed Models With Clustering

Richardson and Green [4] adapts the reversible jump methodology to univariate nor-
mal mixture models. In addition to being able to characterize the distribution of 𝐺,
this Bayesian framework has the ability to simultaneously explore the posterior dis-
tribution for the covariate effects of interest. Furthermore, we will have the posterior
distributions of the group-defining parameters rather than just point estimates. Since
we are interested in the physical activity differences in subjects when controlling for
these covariates, we use Eq. (1) as the basis of our model.
The foundation of our clustering model is a finite mixture model on the random
effects, 𝑟 𝑖 , as shown in Eq. (3). For all 𝑖 = 1, . . . , 𝑛 and 𝑔 = 1, . . . , 𝐺, 𝑟 𝑖 |𝑐 𝑖 , 𝝁 ∼
𝐹𝑟 (𝜇 𝑐𝑖 , 𝜎𝑟2,𝑐𝑖 ), (𝑐 𝑖 = 𝑔)|𝝅, 𝐺 ∼ Categorical(𝜋1 , . . . , 𝜋𝐺 ), 𝜇𝑔 |𝜏 ∼ 𝑁 (𝜇0 , 𝜏),
𝜎𝑟2,𝑔 |𝑐, 𝛿 ∼ 𝐼𝐺 (𝑐, 𝛿), 𝝅|𝐺 ∼ Dirichlet(𝛼, . . . , 𝛼), 𝐺 ∼ Uniform[1, 𝐺 𝑚𝑎𝑥 ], where 𝑐 𝑖
is the latent grouping variable tracking the assignment of 𝑟 𝑖 into any one of the 𝐺 clus-
ters. The likelihood function for these subject-specific deviations, given the group as-
  − 12 n o
signment, 𝑐 𝑖 , is simply 𝑝(𝑟 𝑖 |𝑐 𝑖 = 𝑔, 𝜇𝑔 , 𝜎𝑟2,𝑔 ) = 2𝜋𝜎𝑟2,𝑔 exp − 2𝜎12 (𝑟 𝑖 − 𝜇𝑔 ) 2 .
𝑟 ,𝑔
Clustering with an Infinite Mixture Model on Random Effects 227

This replaces the typical independent and identically distributed assumption of


𝑟 𝑖 ∼ 𝑁 (0, 𝜎𝑟2 ) for all 𝑖 with a normal distribution that is now conditional on group
assignment. The remainder of the model formulation follows closely to the frame-
work constructed in [4], except we have an additional layer of unknown parameters
defining the linear mixed model in Eq. (1).
We select conjugate priors so that the the posterior distributions of the unknown
parameters are analytically tractable. The prior on the mixing probabilities, 𝝅, is a
symmetric Dirichlet distribution, reflecting the prior belief that belonging to any one
cluster is equally likely. To use the sampling methods of [4], we select a discrete
uniform prior on 𝐺 that reflects our uncertainty on the number of groups, and impose
an a priori ordering of the 𝜇𝑔 , such that for any given value 𝐺, 𝜇1 < 𝜇2 < · · · < 𝜇𝐺 ,
to remove label switching. Thus, in the prior for the clustering parameters,
𝐺 q  
Ö 1 1
𝑝( 𝝁) = 𝐺! (2𝜋𝜏) − 2
exp − (𝜇𝑔 − 𝜇0 ) 2
𝑔=1
2𝜏
( )
2 𝛿𝑐 2 −𝑐−1 𝛿
𝑝(𝜎𝑟 ,𝑔 ) = (𝜎 ) exp − 2
Γ(𝑐) 𝑟 ,𝑔 𝜎𝑟 ,𝑔
1
𝑝(𝐺) = 1{𝐺 ∈ [1, 𝐺 𝑚𝑎𝑥 ]},
𝐺 𝑚𝑎𝑥
where 𝐺 𝑚𝑎𝑥 is set to be reasonably large and 1{𝐺 ∈ [1, 𝐺 𝑚𝑎𝑥 ]} is a discrete
indicator function, equal to 1 on the interval [1, 𝐺 𝑚𝑎𝑥 ] and 0 elsewhere.
The capacity of our sampler to move between dimensions is essential to our
ability to explore the grouping of the observations while simultaneously exploring
the parameters describing the relationships between the covariates and the outcome.
This means that we can allow the number of components of our mixture model on the
random effects to increase or decrease at each state of our MCMC chain. Such changes
impact the dimension of the parameters of the mixture model, 𝜽 = ( 𝝁, 𝝈𝑟2 , 𝐺, 𝝅, c).
Let 𝜽 denote the current state of the parameters (𝝁, 𝝈𝑟2 , 𝐺, 𝝅, c) when propos-
ing move 𝑚 where 𝑚 ∈ {𝑆, 𝑀, 𝐵, 𝐷} corresponds to a split, merge, birth and
death, respectively. Given the current state, 𝜽, and move 𝑚, we propose a new
state,h 𝜽 𝑚 , under move 𝑚. The i acceptance probability is written as 𝑎𝑐𝑐 𝑚 (𝜽 , 𝜽) =
𝑚
−1
min 1, 𝑝 (𝜽𝑝 (𝜽|r)𝑞
𝑚
(𝜽 |𝑚 )
𝑚

|r)𝑞 (𝜽 |𝑚) |𝐽 | where 𝑝(·) and 𝑞(·) denote the target and proposal dis-
tribution, respectively. In our case, the target distribution is the posterior distribution
of our group-specific parameters, (𝝁, 𝝈𝑟2 , 𝝅, c), given the data, r, which are the ran-
dom effects. Each proposed move changes the dimension of the parameters in 𝜽 by 1,
adding or deleting group-specific parameters. The ratio 𝑞(𝜽 𝑚 |𝑚 −1 )/𝑞(𝜽 |𝑚) ensures
"dimension balancing", as explained in [4]. For moves increasing in dimension, the
Jacobian, |𝐽 |, is computed as |𝛿𝜽 𝑚 /𝛿(𝜽, u)| because moving from 𝜽 to 𝜽 𝑚 will re-
quire additional parameters, u to appropriately match dimensions. The opposite is
true for moves decreasing in dimension. This is what we refer to as the reversible
jump mechanism; each time a split is proposed, we must also design the reversible
move that would result in the currently merged component, and vice versa.
228 A. LaLonde et al.

Split and merge moves are implemented for our model. These moves update 𝜋,
𝜇, and 𝜎 for two adjacent groups or create two adjacent groups using three Beta-
distributed additional parameters, 𝑢, for dimension balancing in a similar way to
[4]. Within our context of random effects, births and deaths are not appropriate.
A singleton causes issues of identifiability because the 𝑟 𝑖 is no longer defined as
random. We do not allow for birth and death moves in our reversible jump methods.

3 Trial of Activity in Adolescent Girls (TAAG) and Model Results

Our analysis focuses only on these girls from the University of Maryland site of the
TAAG study who were measured at all three follow-up time points, beginning in
2006. After excluding girls with missing outcomes, the final sample consisted of 428
girls measured in 2006, 2009, and 2014. Missing covariate values were imputed for
four subjects using the values from the nearest time point.
We determine the group assignments using an MCMC sampler having 10,000 it-
erations, with a burn-in of 500 draws. The posterior distribution for 𝐺 was extremely
peaked at 𝐺 = 2. Summarization of the posterior distribution of the group assign-
ments via the least squares clustering method delivers the final arrangement, ĉ 𝐿𝑆 , of
girls into two groups describing their physical activity levels [5]. Since our sampler
explores several models for which group assignments and 𝐺 can vary, we sample
additional draws from the posterior distribution of the remaining parameters of in-
terest using an MCMC sampler with the model specification of Eq. (1) with groups
fixed at our posterior assignment, ĉ 𝐿𝑆 , for the subject-specific random effects. This
additional chain was run for 10,000 iterations with a burn-in 500 draws, yielding the
results summarized below. Convergence diagnostics indicated that 10,000 iterations
sufficiently met the effective sample size threshold for estimating the coefficients for
the covariate effects, 𝜷, and the group-specific means, 𝝁, describing the deviations
of the girls’ physical activity levels [6].
After controlling for covariates believed to best describe the variation in the
physical activity levels of females, our method finds that there is a small subset of the
females who are much more active than the remainder of the sample. Every subject
in the more active group has fitted trajectories above the recommended 30 minutes
of exercise. Most of the population does not get the recommended allowance of daily
physical activity and this is well-supported in our analysis. All but two subjects in the
less active group have fitted trajectories that never pass the recommended 30 minutes
of exercise. The random effects from this model better fit a normal distribution (not
centered at 0) for each of the two groups and do not show as much heteroscedasticity
over time as the one group model depicted in Figure 1.
Given these differences are observed even after controlling for the aforementioned
variables, we would like to further examine the characteristics that may set these
highly active females apart from the rest of the girls in our sample. To do this, we
look at a number of other covariates that were either excluded during the variable
selection process or were not measured at all time points. We use simple Wilcoxon
Clustering with an Infinite Mixture Model on Random Effects 229

tests on the available time points of the additional variables and on all time points
for covariates we adjusted for in the initial model.
We first note that the median BMI of the subset of highly active girls is sig-
nificantly lower than that of the remaining girls consistently at each TAAG wave.
Similarly, mother’s education level is also consistently significant at each time point.
These values are measured at each time point to reflect changes as the mother pursues
additional education, or as the girls become more aware of their mother’s education.
The majority of the highly active girls have mother’s who have completed college
or higher (75% or higher at each time point); whereas, the remainder of the sample
has mother’s with a range of education levels (less than high school through college
or more). The number of parks within a one-mile radius of the home is significantly
different among the high and low groups in the middle school and high school years,
when the girls are likely to be living at home. This variable may be an indicator of so-
cioeconomic status as families with more money may live in neighborhoods nearer to
parks. Finally, in the high school and college-aged years, the self-management strate-
gies among the highly active girls are significantly higher rated than the remainder
of the population.
In high school, the subset of highly active girls tend to have better self-described
health, participate in more sports teams, have access to more physical education
classes, and have been older at the time of their first menstrual period. At the college
age, these girls still have higher self-described health; however, the higher levels
of the global physical activity score and self-esteem scores are now significantly
improved in the subset of highly active females.

4 Discussion

We extended the mixed models of [1] with the application still focused on the same
428 girls from the TAAG, TAAG 2, and TAAG 3 studies. Within the Bayesian
linear mixed model, we implemented a clustering procedure aimed at clustering
girls into groups based on deviations from the adjusted physical activity levels.
These groups reflected the tendency for small subsets of females to be highly active.
Not surprisingly, only 24 girls (5% of our sample) were classified as highly active.
This group of highly active girls differs in several ways. These girls are more
active, and thus we expect that the age at first menstrual period will be higher. We
may also expect that the highly active girls are involved in more sports teams and
that they will have higher global physical activity scores. Some other interesting
characteristics of these girls, however, is their increased self-management strategies,
self-esteem scores, and self-described health. This may suggest that interventions
focusing on time management and emphasizing self-efficacy could impact adolescent
female physical activity levels. In doing so, we could aim to increase self-esteem and
self-described health.
The ability to account for heterogeneity in the subject-specific deviations from
an adjusted model allows us to keep the outcome on the original scale while still
230 A. LaLonde et al.

improving model assumptions. Our model estimates model parameters while identi-
fying groups of observations with differing activity levels. In contrast, a frequentist
approach could be taken using EM algorithm; however, we would lose the ability
for the data to give statistical inference on the appropriate number of groups and to
incorporate posterior samples with different numbers of groups into the estimated
class label.
The current analysis looks only at identifying groups based on deviations from
the overall adjusted minutes of MVPA for the females. A natural extension would
be to look at clustering on the slope for time to begin to understand the various
patterns we observe among adolescent females over time. Furthermore, we may
want to incorporate a variable selection procedure into the fixed portion of the
model. The groups we find by either clustering on subject-specific intercepts and/or
slopes would be sensitive to the covariates selected, depending on the variability
captured by this fixed portion of the model. Physical activity, like most human
behavior, varies widely for a multitude of reasons, many of which we may not think
to or are unable to measure. Identifying groups when a traditional mixed model
constructed using standard variable selection methods suggests lack of fit can be a
useful step towards better understanding differences through post-hoc analyses of
the groups’ characteristics.

Acknowledgements Research reported in this publication was supported by the National Institutes
of Health (NIH) under award numbers T32ES007271 and R01HL119058. The content is solely the
responsibility of the authors and does not necessarily represent the official views of the NIH.

References

1. Young, D. R., Mohan, Y. D., Saksvig, B. I., Sidell, M., Wu, T. T., Cohen, D.: Longitudinal
predictors of moderate to vigorous physical activity among adolescent girls and young women.
Under review. (2017)
2. Ng, S. K., McLachlan, G. J., Wang, K., Ben-Tovim Jones, L., Ng, S. W.: A mixture model with
random-effects components for clustering correlated gene-expression profiles. Bioinformatics
22(14), 1745 (2006)
3. Zhou, C., Wakefield, J.: A Bayesian mixture model for partitioning gene expression data.
Biometrics 62(2), 515–525 (2006)
4. Richardson, S., Green, P. J.: On Bayesian analysis of mixtures with an unknown number of
components (with discussion). J. Roy. Stat. Soc. B 59(4), 731–792 (1997)
5. Dahl, D. B.: Model-based clustering for expression data via a Dirichlet process mixture model.
Bayesian Inference for Gene Expression and Proteomics. 201–218 (2006)
6. Flegal, J. M., Hughes, J., Vats, D.: mcmcse: Monte Carlo standard errors for MCMC. R
package version 1.2-1 (2016)
Clustering with an Infinite Mixture Model on Random Effects 231

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Unsupervised Classification of Categorical Time
Series Through Innovative Distances

Ángel López-Oriona, José A. Vilar, and Pierpaolo D’Urso

Abstract In this paper, two novel distances for nominal time series are introduced.
Both of them are based on features describing the serial dependence patterns between
each pair of categories. The first dissimilarity employs the so-called association
measures, whereas the second computes correlation quantities between indicator
processes whose uniqueness is guaranteed from standard stationary conditions. The
metrics are used to construct crisp algorithms for clustering categorical series. The
approaches are able to group series generated from similar underlying stochastic
processes, achieve accurate results with series coming from a broad range of mod-
els and are computationally efficient. An extensive simulation study shows that the
devised clustering algorithms outperform several alternative procedures proposed in
the literature. Specifically, they achieve better results than approaches based on max-
imum likelihood estimation, which take advantage of knowing the real underlying
procedures. Both innovative dissimilarities could be useful for practitioners in the
field of time series clustering.

Keywords: categorical time series, clustering, association measures, indicator pro-


cesses

1 Introduction

Clustering of time series concerns the challenge of splitting a set of unlabeled time
series into homogeneous groups, which is a pivotal problem in many knowledge
discovery tasks [1]. Categorical time series (CTS) are a particular class of time
series exhibiting a qualitative range which consists of a finite number of categories.
Most of the classical statistical tools used for real-valued time series (e.g., the
autocorrelation function) are not useful in the categorical case, so different types
of measures than the standard ones are needed for a proper analysis of CTS. CTS

Ángel López-Oriona ( ), José A. Vilar


Research Group MODES, Research Center for Information and Communication Technologies
(CITIC), University of A Coruña, Spain,
e-mail: [email protected];[email protected]
Pierpaolo D’Urso
Department of Social Sciences and Economics, Sapienza University of Rome, Italy,
e-mail: [email protected]

© The Author(s) 2023 233


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_26
234 Á. López-Oriona et al.

arise in an extensive assortment of fields [2, 3, 7, 8, 9]. Since only a few works have
addressed the problem of CTS clustering [4, 5], the main goal of this paper is to
introduce novel clustering algorithms for CTS.

2 Two Novel Feature-based Approaches for Categorical Time


Series Clustering

Consider a set of 𝑠 categorical time series S = {𝑋𝑡(1) , . . . , 𝑋𝑡(𝑠) }, where the 𝑗-th
( 𝑗)
element 𝑋𝑡 is a 𝑇 𝑗 -length partial realization from any categorical stochastic process
(𝑋𝑡 )𝑡 ∈Z taking values on a number 𝑟 of unordered qualitative categories, which are
coded from 1 to 𝑟 so that the range of the process can be seen as V = {1, . . . , 𝑟 }.
We suppose that the process (𝑋𝑡 )𝑡 ∈Z is bivariate stationary, i.e., the pairwise joint
distribution of (𝑋𝑡−𝑘 , 𝑋𝑡 ) is invariant in 𝑡. Our goal is to perform clustering on the
elements of S in such a way that the series assumed to be generated from identical
stochastic processes are placed together. To that aim, we propose two distance metrics
which are based on feature extraction.

2.1 Descriptive Features for Categorical Processes

Let {𝑋𝑡 , 𝑡 ∈ Z} be a bivariate stationary categorical stochastic process with range


V = {1, . . . , 𝑟 }. Denote by 𝝅 = (𝜋1 , . . . , 𝜋𝑟 ) the marginal distribution of 𝑋𝑡 , which
is, 𝑃(𝑋𝑡 = 𝑗) = 𝜋 𝑗 > 0, 𝑗 = 1, . . . , 𝑟. Fixed 𝑙 ∈ N, we use the notation 𝑝 𝑖 𝑗 (𝑙) =
𝑃(𝑋𝑡 = 𝑖, 𝑋𝑡−𝑙 = 𝑗), with 𝑖, 𝑗 ∈ V, for the lagged bivariate probability and the
notation 𝑝 𝑖 | 𝑗 (𝑙) = 𝑃(𝑋𝑡 = 𝑖|𝑋𝑡−𝑙 = 𝑗) = 𝑝 𝑖 𝑗 (𝑙)/𝜋 𝑗 for the conditional bivariate
probability.
To extract suitable features characterizing the serial dependence of a given CTS,
we start by defining the concepts of perfect serial independence and dependence
for a categorical process. We have perfect serial independence at lag 𝑙 ∈ N if and
only if 𝑝 𝑖 𝑗 (𝑙) = 𝜋𝑖 𝜋 𝑗 for any 𝑖, 𝑗 ∈ V. On the other hand, we have perfect serial
dependence at lag 𝑙 ∈ N if and only if the conditional distribution 𝑝 · | 𝑗 (𝑙) is a
one-point distribution for any 𝑗 ∈ V. There are several association measures which
describe the serial dependence structure of a categorical process at lag 𝑙. One of
such measures is the so-called Cramer's 𝑣, which is defined as
v
u
t 𝑟
1 Õ ( 𝑝 𝑖 𝑗 (𝑙) − 𝜋𝑖 𝜋 𝑗 ) 2
𝑣(𝑙) = . (1)
𝑟 − 1 𝑖, 𝑗=1 𝜋𝑖 𝜋 𝑗

Cramer's 𝑣 summarizes the serial dependence patterns of a categorical process for


every pair (𝑖, 𝑗) and 𝑙 ∈ N. However, this quantity is not appropriate for characterizing
a given stochastic process, since two different processes can have the same value
of 𝑣(𝑙). A better way to characterize the process 𝑋𝑡 is by considering the matrix
 ( 𝑝 (𝑙)− 𝜋 𝜋 ) 2
𝑽 (𝑙) = 𝑉𝑖 𝑗 (𝑙) 1≤𝑖, 𝑗 ≤𝑟 , where 𝑉𝑖 𝑗 (𝑙) = 𝑖 𝑗 𝜋𝑖 𝜋 𝑗 𝑖 𝑗 . The elements of the matrix
Unsupervised Classification of Categorical Time Series 235

𝑽 (𝑙) give information about the so-called unsigned dependence of the process.
However, it is often useful to know whether a process tends to stay in the state it has
reached or, on the contrary, the repetition of the same state after 𝑙 steps is infrequent.
This motivates the concept of signed dependence, which arises as an analogy of the
autocorrelation function of a numerical process, since such quantity can take either
positive or negative values. Provided that perfect serial dependence holds, we have
perfect positive (negative) serial dependence if 𝑝 𝑖 |𝑖 (𝑙) = 1 (𝑝 𝑖 |𝑖 (𝑙) = 0) for all 𝑖 ∈ V.
Since 𝑽 (𝑙) does not shed light on the signed dependence structure, it would
be valuable to complement the information contained in 𝑽 (𝑙) by adding features
describing signed dependence. In this regard, a common measure of signed serial
dependence at lag 𝑙 is the Cohen's 𝜅, which takes the form
Í𝑟
𝑗=1 ( 𝑝 𝑗 𝑗 (𝑙) − 𝜋 𝑗 )
2
𝜅(𝑙) = Í . (2)
1 − 𝑟𝑗=1 𝜋 2𝑗
Proceeding as with 𝑣(𝑙), the quantity 𝜅(𝑙) can be decomposed in order to obtain
a complete representation of the signed dependence pattern of the process. In this
way, we consider the vector K (𝑙) = (K1 (𝑙), . . . , K𝑟 (𝑙)), where each K𝑖 is defined
as

𝑝 𝑖𝑖 (𝑙) − 𝜋𝑖2
K𝑖 (𝑙) = Í , (3)
1 − 𝑟𝑗=1 𝜋 2𝑗
𝑖 = 1, . . . , 𝑟.
In practice, the matrix 𝑽 (𝑙) and the vector K (𝑙) must be estimated from a 𝑇-length
realization of the process, {𝑋1 , . . . 𝑋𝑇 }. To this aim, we consider estimators of 𝜋𝑖
𝑁 (𝑙)
and 𝑝 𝑖 𝑗 (𝑙), b 𝜋𝑖 and 𝑝b𝑖 𝑗 (𝑙), respectively, defined as b 𝜋𝑖 = 𝑁𝑇𝑖 and 𝑝b𝑖 𝑗 (𝑙) = 𝑇𝑖 𝑗−𝑙 ,
where 𝑁𝑖 is the number of variables 𝑋𝑡 equal to 𝑖 in the realization {𝑋1 , . . . 𝑋𝑇 },
and 𝑁𝑖 𝑗 (𝑙) is the number of pairs (𝑋𝑡 , 𝑋𝑡−𝑙 ) = (𝑖, 𝑗) in the realization {𝑋1 , . . . 𝑋𝑇 }.
Hence, estimates of 𝑽 (𝑙) and K (𝑙), 𝑽 b (𝑙) and K b (𝑙), respectively, can be obtained
by plugging in the estimates b 𝜋𝑖 and 𝑝b𝑖 𝑗 (𝑙) in (2) and (3), respectively. This leads
directly to estimates of 𝑣(𝑙) and 𝜅(𝑙), denoted by b 𝑣(𝑙) and b𝜅 (𝑙).
An alternative way of describing the dependence structure of the process
{𝑋𝑡 , 𝑡 ∈ Z} is to take into consideration its equivalent representation as a multi-
variate binary process. The so-called binarization of {𝑋𝑡 , 𝑡 ∈ Z} is constructed as
follows. Let 𝒆 1 , . . . , 𝒆𝑟 ∈ {0, 1}𝑟 be unit vectors such that 𝒆 𝑘 has all its entries
equal to zero except for a one in the 𝑘-th position, 𝑘 = 1, . . . , 𝑟. Then, the binary
representation of {𝑋𝑡 , 𝑡 ∈ Z} is given by the process {𝒀 𝑡 = (𝑌𝑡 ,1 , . . . , 𝑌𝑡 ,𝑟 ) > , 𝑡 ∈ Z}
such that 𝒀 𝑡 = 𝒆 𝑗 if 𝑋𝑡 = 𝑗. Fixed 𝑙 ∈ N and 𝑖, 𝑗 ∈ V, consider the correlation
𝜙𝑖 𝑗 (𝑙) = 𝐶𝑜𝑟𝑟 (𝑌𝑡 ,𝑖 , 𝑌𝑡−𝑙, 𝑗 ), which measures linear dependence between the 𝑖-th and
𝑗-th categories with respect to the lag 𝑙. The following proposition provides some
properties of the quantity 𝜙𝑖 𝑗 (𝑙).
236 Á. López-Oriona et al.

Proposition 1

Let {𝑋𝑡 , 𝑡 ∈ Z} be a bivariate stationary categorical process with range V =


{1, . . . , 𝑟 }. Then the following properties hold:

1. For every 𝑖, 𝑗 ∈ V, the function 𝜙𝑖 𝑗 : N → [−1, 1] given by 𝑙 → 𝜙𝑖 𝑗 (𝑙) =


𝐶𝑜𝑟𝑟 (𝑌𝑡 ,𝑖 , 𝑌𝑡−𝑙, 𝑗 ) is well-defined.
2. 𝜙𝑖 𝑗 (𝑙) = 0 ⇔ 𝑝 𝑖 𝑗 (𝑙) = 𝜋𝑖 𝜋 𝑗 .
p
3. 𝜙𝑖 𝑗 (𝑙) = ±1 ⇔ 𝑝 𝑖 𝑗 (𝑙) = ± 𝜋𝑖 (1 − 𝜋𝑖 )𝜋 𝑗 (1 − 𝜋 𝑗 ) + 𝜋𝑖 𝜋 𝑗 .
q
𝜋 (1− 𝜋 )
4. 𝜙𝑖 𝑗 (𝑙) = 𝜋𝑖𝑗 (1− 𝜋 𝑖𝑗 ) ⇔ 𝑝 𝑖 | 𝑗 (𝑙) = 1.

The proof of Proposition 1 is quite straightforward and it is not shown in the


manuscript for the sake of brevity. According to Proposition 1, the quantity 𝜙𝑖 𝑗 (𝑙)
can be used to explain both types of dependence, signed and unsigned, within the
underlying process. In fact, in the case of perfect unsigned independence at lag 𝑙,
we have that 𝑝 𝑖 𝑗 (𝑙) = 𝜋𝑖 𝜋 𝑗 for all 𝑖, 𝑗 ∈ V so that 𝜙𝑖 𝑗 (𝑙) = 0 for all 𝑖, 𝑗 ∈ V in
accordance with Property 2 of Proposition 1. Under perfect positive dependence at
lag 𝑙, 𝑝 𝑖 |𝑖 (𝑙) = 1 for all 𝑖 ∈ V. Then 𝜙𝑖𝑖 (𝑙) = 1 for all 𝑖 ∈ V by following Property 4 of
Proposition 1. The same property allows to conclude that 𝜙𝑖𝑖 (𝑙) = −𝜋𝑖 /(1−𝜋𝑖 ) for all
𝑖 ∈ V in the case of perfect negative dependence. In sum, 𝜙𝑖 𝑗 (𝑙) evaluates unsigned
dependence when 𝑖 ≠ 𝑗 and signed dependence when 𝑖 = 𝑗. The previous quantities
can be encapsulated in a matrix 𝚽(𝑙) = (𝜙𝑖 𝑗 (𝑙))1≤𝑖, 𝑗 ≤𝑟 , which can be directly
estimated by means of 𝚽(𝑙) b = ( 𝜙b𝑖 𝑗 (𝑙))1≤𝑖, 𝑗 ≤𝑟 , where each 𝜙b𝑖 𝑗 (𝑙) is computed as
b𝑖 𝑗 (𝑙)− 𝜋
𝑝 b𝑖 𝜋
𝜙𝑖 𝑗 (𝑙) = √
b𝑗
b (this is derived from the proof of Proposition 1).
b𝑖 (1− 𝜋
𝜋 b𝑖 ) 𝜋
b𝑗 (1− 𝜋
b𝑗 )

2.2 Two Innovative Dissimilarities Between CTS

In this section we introduce two distance measures between categorical series based
on the features described above. Suppose we have a pair of CTS 𝑋𝑡(1) and 𝑋𝑡(2) , and
consider a set of 𝐿 lags, L = {𝑙1 , . . . , 𝑙 𝐿 }. A dissimilarity based on Cramer’s 𝑣 and
Cohen’s 𝜅, so-called 𝑑𝐶𝐶 , is defined as

𝐿 h
Õ 2
𝑑𝐶𝐶 (𝑋𝑡(1) , 𝑋𝑡(2) ) = b (𝑙 𝑘 ) (1) − 𝑽
b (𝑙 𝑘 ) (2)

𝑣𝑒𝑐 𝑽
𝑘=1
2 i 2
+ K
b (𝑙 𝑘 ) (1) b (𝑙 𝑘 ) (2)
−K 𝝅 (1) − b
+ b 𝝅 (2) ,

where the superscripts (1) and (2) are used to indicate that the corresponding
estimations are obtained with respect to the realizations 𝑋𝑡(1) and 𝑋𝑡(2) , respectively.
An alternative distance measure relying on the binarization of the processes,
so-called 𝑑 𝐵 , is defined as
Unsupervised Classification of Categorical Time Series 237

𝐿
Õ 2 2
𝑑 𝐵 (𝑋𝑡(1) , 𝑋𝑡(2) ) = b 𝑘 ) (1) − 𝚽(𝑙
b 𝑘 ) (2) 𝝅 (1) − b
𝝅 (2)

𝑣𝑒𝑐 𝚽(𝑙 + b .
𝑘=1

For a given set of categorical series, the distances 𝑑𝐶𝐶 and 𝑑 𝐵 can be used
as input for traditional clustering algorithms. In this manuscript we consider the
Partition Around Medoids (PAM) algorithm.

3 Partitioning Around Medoids Clustering of CTS

In this section we examine the performance of both metrics 𝑑𝐶𝐶 and 𝑑 𝐵 in the
context of hard clustering (i.e., each series is assigned to exactly one cluster) of CTS
through a simulation study.

3.1 Experimental Design

The simulated scenarios encompass a broad variety of generating processes. In par-


ticular, three setups were considered, namely clustering of (i) Markov Chains (MC),
(ii) Hidden Markov Models (HMM) and (iii) New Discrete ARMA (NDARMA) pro-
cesses. The generating models with respect to each class of processes are given below.

Scenario 1. Clustering of MC. Consider four three-state MC, so-called MC1 , MC2 ,
MC3 and MC4 , with respective transition matrices 𝑷11 , 𝑷12 , 𝑷13 and 𝑷14 given by

𝑷11 = 𝑀𝑎𝑡 3 (0.1, 0.8, 0.1, 0.5, 0.4, 0.1, 0.6, 0.2, 0.2),
𝑷12 = 𝑀𝑎𝑡 3 (0.1, 0.8, 0.1, 0.6, 0.3, 0.1, 0.6, 0.2, 0.2),
𝑷13 = 𝑀𝑎𝑡 3 (0.05, 0.90, 0.05, 0.05, 0.05, 0.90, 0.90, 0.05, 0.05),
𝑷14 = 𝑀𝑎𝑡 3 (1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3),

where the operator 𝑀𝑎𝑡 𝑘 , 𝑘 ∈ N transforms a vector into a square matrix of order 𝑘
by sequentially placing the corresponding numbers by rows.
Scenario 2. Clustering of HMM. Consider the bivariate process (𝑋𝑡 , 𝑄 𝑡 )𝑡 ∈Z , where
𝑄 𝑡 stands for the hidden states and 𝑋𝑡 for the observable random variables. Process
(𝑄 𝑡 )𝑡 ∈Z constitutes an homogeneous MC. Both (𝑋𝑡 )𝑡 ∈Z and (𝑄 𝑡 )𝑡 ∈Z are assumed
to be count processes with range {1, . . . , 𝑟 }. Process (𝑋𝑡 , 𝑄 𝑡 )𝑡 ∈Z is assumed to
verify the three classical assumptions of a HMM. Based on previous considerations,
let HMM1 , HMM2 , HMM3 and HMM4 be four three-state HMM with respective
transition matrices 𝑷21 , 𝑷22 , 𝑷23 and 𝑷24 and emission matrices 𝑬 21 , 𝑬 22 , 𝑬 23 and 𝑬 24
given by
238 Á. López-Oriona et al.

𝑷21 = 𝑀𝑎𝑡 3 (0.05, 0.90, 0.05, 0.05, 0.05, 0.90, 0.90, 0.05, 0.05), 𝑷22 = 𝑷21 ,
𝑷23 = 𝑀𝑎𝑡 3 (0.1, 0.7, 0.2, 0.4, 0.4, 0.2, 0.4, 0.3, 0.3),
𝑷24 = 𝑀𝑎𝑡 3 (1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3), 𝑬 21 = 𝑷21 ,
𝑬 22 = 𝑀𝑎𝑡 3 (0.1, 0.8, 0.1, 0.5, 0.4, 0.1, 0.6, 0.2, 0.2), 𝑬 23 = 𝑬 22 ,
𝑬 24 = 𝑀𝑎𝑡 3 (1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3).

Scenario 3. Clustering of NDARMA processes. Let (𝑋𝑡 )𝑡 ∈Z and (𝜖 𝑡 )𝑡 ∈Z , be two


count processes with range {1, . . . , 𝑟 } following the equation

𝑋𝑡 = 𝛼𝑡 ,1 𝑋𝑡−1 + . . . + 𝛼𝑡 , 𝑝 𝑋𝑡− 𝑝 + 𝛽𝑡 ,0 𝜖 𝑡 + . . . + 𝛽𝑡 ,𝑞 𝜖 𝑡−𝑞 ,

where (𝜖 𝑡 )𝑡 ∈Z is i.i.d with 𝑃(𝜖 𝑡 = 𝑖) = 𝜋𝑖 , independent of (𝑋𝑠 )𝑠<𝑡 , and the i.i.d
multinomial random vectors

(𝛼𝑡 ,1 , . . . , 𝛼𝑡 , 𝑝 , 𝛽𝑡 ,0 , . . . , 𝛽𝑡 ,𝑞 ) ∼ MULT(1; 𝜙1 , . . . , 𝜙 𝑝 , 𝜑0 , . . . , 𝜑𝑞 ),

are independent of (𝜖 𝑡 )𝑡 ∈Z and (𝑋𝑠 )𝑠<𝑡 . The considered models are three three-state
NDARMA(2,0) processes and one three-state NDARMA(1,0) process with marginal
distribution 𝝅 3 = (2/3, 1/6, 1/6), and corresponding probabilities in the multinomial
distribution given by

(𝜙1 , 𝜙2 , 𝜑0 )13 = (0.7, 0.2, 0.1), (𝜙1 , 𝜙2 , 𝜑0 )23 = (0.1, 0.45, 0.45),
(𝜙1 , 𝜙2 , 𝜑0 )33 = (0.5, 0.25, 0.25), (𝜙1 , 𝜑0 )43 = (0.2, 0.8).

The simulation study was carried out as follows. For each scenario, 5 CTS of
length 𝑇 ∈ {200, 600} were generated from each process in order to execute the
clustering algorithms twice, thus allowing to analyze the impact of the series length.
The resulting clustering solution produced by each considered algorithm was stored.
The simulation procedure was repeated 500 times for each scenario and value of
𝑇. The computation of 𝑑𝐶𝐶 and 𝑑 𝐵 was carried out by considering L = {1} in
Scenarios 1 and 2, and L = {1, 2} in Scenario 3. This way, we adapted the distances
to the maximum number of significant lags existing in each setting.

3.2 Alternative Metrics and Assessment Criteria

To better analyze the performance of both metrics 𝑑𝐶𝐶 and 𝑑 𝐵 , we also obtained
partitions by using alternative techniques for clustering of categorical series. The
considered procedures are described below.
Unsupervised Classification of Categorical Time Series 239

• Model-based approach using maximum likelihood estimation (MLE). The dis-


tance between two CTS is defined as the squared Euclidean distance between the
corresponding vectors of fitted coefficients via MLE (𝑑 𝑀 𝐿𝐸 ).
• Model-based approach using mixtures. [4] propose to group a set of CTS by using
a mixture of first order Markov models via the EM algorithm (𝑑𝐶 𝑍 ).
• An hybrid framework for clustering CTS. [6] presents a dissimilarity between
categorical series which evaluates both closeness between raw categorical values
and proximity between dynamic patterns (𝑑 𝑀 𝑉 ).
Note that the approach based on the distance 𝑑 𝑀 𝐿𝐸 can be seen as a strict
benchmark in the evaluation task. The effectiveness of the clustering approaches
was assessed by comparing the clustering solution produced by the algorithms with
the true clustering partition, so-called ground truth. The latter consisted of 𝐶 = 4
clusters in all scenarios, each group including the five CTS generated from the same
process. The value 𝐶 = 4 was provided as input parameter to the PAM algorithm
in the case of 𝑑𝐶𝐶 , 𝑑 𝐵 , 𝑑 𝑀 𝐿𝐸 and 𝑑 𝑀 𝑉 . As for the approach 𝑑𝐶 𝑍 , a number of 4
components were considered for the mixture model. Experimental and true partitions
were compared by using three well-known external clustering quality indexes, the
Adjusted Rand Index (ARI), the Jaccard Index (JI) and the Fowlkes-Mallows index
(FMI).

3.3 Results and Discussion

Average values of the quality indexes by taking into account the 500 simulation trials
are given in Tables 1, 2 and 3 for Scenarios 1, 2 and 3, respectively.

Table 1 Average results for Scenario 1.


𝑇 = 200 𝑇 = 600
Method ARI JI FMI ARI JI FMI
𝑑𝐶𝐶 0.774 0.710 0.830 0.916 0.886 0.935
𝑑𝐵 0.729 0.661 0.792 0.861 0.878 0.893
𝑑 𝑀 𝐿𝐸 0.704 0.633 0.772 0.841 0.792 0.876
𝑑𝐶 𝑍 0.712 0.648 0.786 0.915 0.886 0.934
𝑑 𝑀𝑉 0.406 0.363 0.665 0.379 0.363 0.650

The results in Table 1 indicate that the dissimilarity 𝑑𝐶𝐶 is the best performing one
when dealing with MC, outperforming the MLE-based metric 𝑑 𝑀 𝐿𝐸 . The distance
𝑑 𝐵 is also superior to 𝑑 𝑀 𝐿𝐸 . The measure 𝑑𝐶 𝑍 attains in Scenario 1 similar results
than 𝑑𝐶𝐶 , specially for 𝑇 = 600. The good performance of 𝑑𝐶 𝑍 was expected,
since the assumption of first order Markov models considered by this metric is
fulfilled in Scenario 1. Table 2 shows a completely different picture, indicating that
the metrics 𝑑𝐶𝐶 and 𝑑 𝐵 exhibit a significantly better effectiveness than the rest
of the dissimilarities. Finally, the quantities in Table 3 reveal that the model-based
distance 𝑑 𝑀 𝐿𝐸 attains the best results when 𝑇 = 200, but is defeated by 𝑑 𝐵 when
240 Á. López-Oriona et al.

Table 2 Average results for Scenario 2.


𝑇 = 200 𝑇 = 600
Method ARI JI FMI ARI JI FMI
𝑑𝐶𝐶 0.707 0.639 0.777 0.856 0.810 0.888
𝑑𝐵 0.760 0.701 0.812 0.963 0.949 0.971
𝑑 𝑀 𝐿𝐸 0.354 0.342 0.512 0.299 0.310 0.478
𝑑𝐶 𝑍 0.645 0.577 0.739 0.703 0.638 0.779
𝑑 𝑀𝑉 0.089 0.175 0.323 0.062 0.175 0.301

Table 3 Average results for Scenario 3.


𝑇 = 200 𝑇 = 600
Method ARI JI FMI ARI JI FMI
𝑑𝐶𝐶 0.627 0.563 0.715 0.875 0.837 0.903
𝑑𝐵 0.680 0.612 0.754 0.925 0.901 0.941
𝑑 𝑀 𝐿𝐸 0.727 0.656 0.788 0.872 0.828 0.900
𝑑𝐶 𝑍 0.586 0.562 0.693 0.647 0.577 0.738
𝑑 𝑀𝑉 0.035 0.167 0.292 -0.028 0.138 0.251

𝑇 = 600. The metric 𝑑𝐶 𝑍 suffers again from model misspecification. In summary,


the numerical experiments carried out throughout this section show the excellent
ability of both measures 𝑑𝐶𝐶 and 𝑑 𝐵 to discriminate between a broad variety of
categorical processes. Specifically, these metrics either outperform or show similar
behavior than distances based on estimated model coefficients, which take advantage
of knowing the true underlying models.
It is worth highlighting that the methods proposed in this paper could have
promising applications in some fields as the clustering of genetic data sequences.

References

1. Liao, T. W.: Clustering of time series data: A survey. Pattern Recogn. 38, 1857-1874 (2005)
2. Churchill, G. A.: Stochastic models for heterogeneous DNA sequences. Bull. Math. Biol. 51,
79-94 (1989)
3. Fokianos, K., Kedem, B.: Regression theory for categorical time series. Stat. Sci. 18, 357-376
(2003)
4. Cadez, I., Heckerman, D., Meek, C., Smyth, P., White, S.: Model-based clustering and visu-
alization of navigation patterns on a web site. Data Min. Knowl. Discov. 7, 399-424 (2003)
5. Fruhwirth-Schnatter, S., Pamminger, C.: Model-based clustering of categorical time series.
Bayesian Analysis. 5, 345-368 (2010)
6. García Magariños, M., Vilar, J. A.: A framework for dissimilarity-based partitioning clustering
of categorical time series. Data Min. Knowl. Discov. 29, 466-502 (2015)
7. Elzinga, C. H.: Combinatorial representations of token sequences. J. Classif. 22, 87-118 (2005)
8. Elzinga, C. H.: Sequence similarity: a nonaligning technique. Socio. Meth. Res. 32, 3-22
(2003)
9. Elzinga, C. H.: Sequence analysis: Metric representations of categorical time series. Socio.
Meth. Res. (2006)
Unsupervised Classification of Categorical Time Series 241

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Fuzzy Clustering by Hyperbolic Smoothing

David Masís, Esteban Segura, Javier Trejos, and Adilson Xavier

Abstract We propose a novel method for building fuzzy clusters of large data sets,
using a smoothing numerical approach. The usual sum-of-squares criterion is relaxed
so the search for good fuzzy partitions is made on a continuous space, rather than a
combinatorial space as in classical methods [8]. The smoothing allows a conversion
from a strongly non-differentiable problem into differentiable subproblems of op-
timization without constraints of low dimension, by using a differentiable function
of infinite class. For the implementation of the algorithm, we used the statistical
software 𝑅 and the results obtained were compared to the traditional fuzzy 𝐶–means
method, proposed by Bezdek [1].

Keywords: clustering, fuzzy sets, numerical smoothing

1 Introduction

Methods for making groups from data sets are usually based on the idea of disjoint
sets, such as the classical crisp clustering. The most well known are hierarchical
and 𝑘-means [8], whose resulting clusters are sets with no intersection. However,
this restriction may not be natural for some applications, where the condition for

David Masís
Costa Rica Institute of Technology, Cartago, Costa Rica, e-mail: [email protected]
Esteban Segura
CIMPA & School of Mathematics, University of Costa Rica, San José, Costa Rica,
e-mail: [email protected]
Javier Trejos ( )
CIMPA & School of Mathematics, University of Costa Rica, San José, Costa Rica,
e-mail: [email protected]
Adilson E. Xavier
Universidade Federal de Rio de Janeiro, Brazil, e-mail: [email protected]

© The Author(s) 2023 243


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_27
244 D. Masís et al.

some objects may be to belong to two or more clusters, rather than only one. Several
methods for constructing overlapping clusters have been proposed in the literature
[4, 5, 8]. Since Zadeh introduced the concept of fuzzy sets [17], the principle of
belonging to several clusters has been used in the sense of a degree of membership
to such clusters. In this direction, Bezdek [1] introduced a fuzzy clustering method
that became very popular since it solved the problem of representation of clusters
with centroids and the assignment of objects to clusters, by the minimization of
a well-stated numerical criterion. Several methods for fuzzy clustering have been
proposed in the literature; a survey of these methods can be found in [16].
In this paper we propose a new fuzzy clustering method based on the numerical
principle of hyperbolic smoothing [15]. Fuzzy 𝐶-Means method is presented in
Section 2 and our proposed Hyperbolic Smoothing Fuzzy Clustering method in
Section 3. Comparative results between these two methods are presented in Section
4. Finally, Section 5 is devoted to the concluding remarks.

2 Fuzzy Clustering

The most well known method for fuzzy clustering is the original Bezdek’s 𝐶-means
method [1] and it is based on the same principles of 𝑘-means or dynamical clusters
[2], that is, iterations on two main steps: i) class representations by the optimization
of a numerical criterion, and ii) assignment to the closest class representative in
order to construct clusters; these iterations are made until a convergence is reached
to a local minimum of the overall quality criterion.
Let us introduce the notation that will be used and the numerical criterion for
optimization. Let X be an 𝑛 × 𝑝 data matrix containing 𝑝 numerical observations
over 𝑛 objects. We look for a 𝐾 × 𝑝 matrix G that represents centroids of 𝐾 clusters
of the 𝑛 objects and an 𝑛 × 𝐾 membership matrix with elements 𝜇𝑖𝑘 ∈ [0, 1], such
that the following criterion is minimized:

Õ𝑛 Õ 𝐾
𝑊 (X, U, 𝐶) = (𝜇𝑖𝑘 ) 𝑚 kx𝑖 − g 𝑘 k 2
Í 𝑖=1 𝑘=1 (1)
subject to 𝐾𝑘=1Í𝜇𝑖𝑘 = 1, for all 𝑖 ∈ {1, 2, . . . , 𝑛}
𝑛
0 < 𝑖=1 𝜇𝑖𝑘 < 𝑛, for all 𝑘 ∈ {1, 2, . . . , 𝐾 },

where x𝑖 is the 𝑖-th row of X and g 𝑘 is the 𝑘-th row of G, representing in R 𝑝 the
centroid of the 𝑘-th cluster.
The parameter 𝑚 ≠ 1 in (1) controls the fuzzyness of the clusters. According to
the literature [16], it is usual to take 𝑚 = 2, since greater values of 𝑚 tend to give
very low values of 𝜇𝑖𝑘 , tending to the usual crisp partitions such as in 𝑘-means. We
also assume that the number of clusters, 𝐾, is fixed.
Minimization of (1) represents a non linear optimization problem with constraints,
which can be solved using Lagrange multipliers as presented in [1]. The solution,
for each row of the centroids matrix, given a matrix U, is:
Fuzzy Clustering by Hyperbolic Smoothing 245

𝑛
, 𝑛
Õ Õ
𝑚
g𝑘 = (𝜇𝑖𝑘 ) x𝑖 (𝜇𝑖𝑘 ) 𝑚 . (2)
𝑖=1 𝑖=1

The solution for the membership matrix, given a matrix centroids G, is [1]:
! 1/(𝑚−1)  −1
 𝐾 ||x𝑖 − g 𝑘 || 2
Õ

𝜇𝑖𝑘 =   . (3)
 𝑗=1 ||x𝑖 − g 𝑗 ||
2 

 
The following pseudo-code shows the mains steps of Bezdek’s Fuzzy 𝐶-Means
method [1].

Bezdek’s Fuzzy c-Means (FCM) Algorithm


1. Initialize fuzzy membership matrix U = [𝜇𝑖𝑘 ]𝑛×𝐾
2. Compute centroids for fuzzy clusters according to (2)
3. Update membership matrix U according to (3)
4. If improvement in the criterion is less than a threshold, then stop; otherwise go
to Step 2.

Fuzzy 𝐶-Means method starts from an initial partition that is improved in each
iteration, according to (1), applying Steps 2 and 3 of the algorithm. It is clear that
this procedure may lead to local optima of (1) since iterative improvement in (2) and
(3) is made by a local search strategy.

3 Algorithm for Hyperbolic Smoothing Fuzzy Clustering

For the clustering problem of the 𝑛 rows of data matrix X in 𝐾 clusters, we can seek
for the minimum distance between every x𝑖 and its class center g 𝑘 :

𝑧2𝑖 = min kx𝑖 − g 𝑘 k22


g 𝑘 ∈G

where k · k2 is the Euclidean norm. The minimization can be stated as a sum-of-


squares:
Õ𝑛 𝑛
Õ
min min kx𝑖 − g 𝑘 k22 = min 𝑧2𝑖
g 𝑘 ∈G
𝑖=1 𝑖=1

leading to the following constrained problem:


𝑛
Õ
min 𝑧2𝑖 subject to 𝑧𝑖 = min kx𝑖 − g 𝑘 k2 , with 𝑖 = 1, . . . , 𝑛.
g 𝑘 ∈G
𝑖=1
246 D. Masís et al.

This is equivalent to the following minimization problem:


𝑛
Õ
min 𝑧2𝑖 subject to 𝑧𝑖 − kx𝑖 − g 𝑘 k2 ≤ 0, with 𝑖 = 1, . . . , 𝑛 and 𝑘 = 1, . . . , 𝐾.
𝑖=1

Considering the function: 𝜑(𝑦) = max(0, 𝑦), we obtain the problem:


𝑛
Õ 𝐾
Õ
min 𝑧2𝑖 subject to 𝜑(𝑧𝑖 − kx𝑖 − g 𝑘 k2 ) = 0 for 𝑖 = 1, . . . , 𝑛.
𝑖=1 𝑘=1

That problem can be re-stated as the following one:


𝑛
Õ 𝐾
Õ
min 𝑧 2𝑖 subject to 𝜑(𝑧 𝑖 − kx𝑖 − g 𝑘 k2 ) > 0, for 𝑖 = 1, . . . , 𝑛.
𝑖=1 𝑘=1

Given a perturbation 𝜖 > 0 it leads to the problem:


𝑛
Õ 𝐾
Õ
min 𝑧2𝑖 subject to 𝜑(𝑧𝑖 − kx𝑖 − g 𝑘 k2 ) ≥ 𝜖 for 𝑖 = 1, . . . , 𝑛.
𝑖=1 𝑘=1

It should be noted that function 𝜑 is not differentiable. Therefore, we will make


a smoothing procedure in order to formulate a differentiable function and pro-
ceed with a minimization
√ by a numerical method. For that, consider the func-
𝑦+ 𝑦 2 +𝜏 2
tion: 𝜓(𝑦, 𝜏) = 2 , for all 𝑦 ∈ R, 𝜏 > 0, and the function: 𝜃 (x𝑖 , g 𝑘 , 𝛾) =

𝑝
𝑗=1 (𝑥 𝑖 𝑗 − 𝑔 𝑘 𝑗 ) + 𝛾 , for 𝛾 > 0. Hence, the minimization problem is trans-
2 2

formed into:
𝑛
Õ 𝐾
Õ
min 𝑧2𝑖 subject to 𝜓(𝑧𝑖 − 𝜃 (x𝑖 , g 𝑘 , 𝛾), 𝜏) ≥ 𝜖, for 𝑖 = 1, . . . , 𝑛.
𝑖=1 𝑘=1

Finally, according to the Karush–Kuhn–Tucker conditions [10, 11], all the con-
straints are active and the final formulation of the problem is:
𝑛
Õ
min 𝑧2𝑖
𝑖=1
𝐾
Õ (4)
subject to ℎ𝑖 (𝑧𝑖 , G) = 𝜓(𝑧𝑖 − 𝜃 (x𝑖 , g 𝑘 , 𝛾), 𝜏) − 𝜖 = 0, for 𝑖 = 1, . . . , 𝑛,
𝑘=1
𝜖, 𝜏, 𝛾 > 0.

Considering (4), in [15] it was stated the Hyperbolic Smoothing Clustering Method
presented in the following algorithm.
Fuzzy Clustering by Hyperbolic Smoothing 247

Hyperbolic Smoothing Clustering Method (HSCM) Algorithm


1. Initialize cluster membership matrix U = [𝜇𝑖𝑘 ]𝑛×𝐾
2. Choose initial values: G0 , 𝛾 1 , 𝜏 1 , 𝜖 1
3. Choose values: 0 < 𝜌1 < 1, 0 < 𝜌2 < 1, 0 < 𝜌3 < 1
4. Let 𝑙 = 1
5. Repeat steps 6 and 7 until a stop condition is reached:
Õ 𝑛
6. Solve problem (P): min 𝑓 (G) = 𝑧2𝑖 with 𝛾 = 𝛾 𝑙 , 𝜏 = 𝜏 𝑙 and 𝜖 = 𝜖 𝑙 , G𝑙−1
𝑖=1
being the initial value and G𝑙 the obtained solution
7. Let 𝛾 𝑙+1 = 𝜌1 𝛾 𝑙 , 𝜏 𝑙+1 = 𝜌2 𝜏 𝑙 , 𝜖 𝑙+1 = 𝜌3 𝜖 𝑙 and 𝑙 = 𝑙 + 1.

The most relevant task in the hyperbolic smoothing clustering method is finding
Í
the zeroes of the function ℎ𝑖 (𝑧𝑖 , G) = 𝐾 𝑘=1 𝜓(𝑧 𝑖 − 𝜃 (x𝑖 , g 𝑘 , 𝛾), 𝜏) − 𝜖 = 0 for
for 𝑖 = 1, . . . , 𝑛. In this paper, we used the Newton-Raphson method for finding
these zeroes [3], particularly the BFGS procedure [12]. Convergence of the Newton-
Raphson method was successful, mainly, thank to a good choice of initial solutions.
In our implementation, these initial approximations were generated by calculating the
minimum distance between the 𝑖-th object and the 𝑘-th centroid for a given partition.
Once the zeroes 𝑧 𝑖 of the functions ℎ𝑖 are obtained, it is implemented the hyperbolic
smoothing. The final solution for this method consists on solving a finite number
of optimization subproblems corresponding to problem (P) in Step 6 of the HSCM
algorithm. Each one of these subproblems was solved with the R routine optim [13],
a useful tool for solving optimization problems in non linear programming. As far
as we know there is no closed solution for solving this step. For the future, we can
consider writing a program by our means, but for this paper we are using this R
routine. Í
Since we have that: 𝐾 𝑘=1 𝜓(𝑧 𝑖 − 𝜃 (x𝑖 , g 𝑘 , 𝛾), 𝜏) = 𝜖, then each entry 𝜇𝑖𝑘 of the
membership matrix is given by: 𝜇𝑖𝑘 = 𝜓 (𝑧𝑖 −𝑑 𝜖
𝑘 , 𝜏)
. It is worth to note that fuzzyness
is controlled by parameter 𝜖.
The following algorithm contains the main steps of the Hyperbolic Smoothing
Fuzzy Clustering (HSFC) method.

Hyperbolic Smoothing Fuzzy Clustering (HSFC) Algorithm


1. Set 𝜖 > 0
2. Choose initial values for: G0 (centroids matrix), 𝛾 1 , 𝜏 1 and 𝑁 (maximum number
of iterations)
3. Choose values: 0 < 𝜌1 < 1, 0 < 𝜌2 < 1
4. Set 𝑙 = 1
5. While 𝑙 ≤ 𝑁:
𝑧𝑖 with 𝛾 = 𝛾 (𝑙) and 𝜏 = 𝜏 (𝑙) ,
Í𝑛 2
6. Solve the problem (P): Minimize 𝑓 (G) = 𝑖=1
(𝑙−1) (𝑙)
with an initial point G and G being the obtained solution
7. Set 𝛾 (𝑙+1) = 𝜌1 𝛾 (𝑙) ,𝜏 (𝑙+1) = 𝜌2 𝜏 (𝑙) , y 𝑙 = 𝑙 + 1
8. Set 𝜇𝑖𝑘 = 𝜓(𝑧 𝑖 − 𝜃 (x𝑖 , g 𝑘 , 𝛾), 𝜏)/𝜖 for 𝑖 = 1, . . . , 𝑛 and 𝑘 = 1, . . . , 𝐾.
248 D. Masís et al.

4 Comparative Results

Performance of the HSFC method was studied on a data table well known from the
literature, the Fisher’s iris [7] and 16 simulated data tables built from a semi-Monte
Carlo procedure [14].
For comparing FCM and HSFC, we used the implementation of FCM in R
package fclust
Í [6].Í𝑛This comparison2 was made upon the within class sum-of-squares:
𝑊 (𝑃) = 𝐾 𝑘=1 𝑖=1 𝜇𝑖𝑘 kx𝑖 − g 𝑘 k . Both methods were applied 50 times and the
best value of 𝑊 is reported. For simplicity here, for HSFC we used the following
parameters: 𝜌1 = 𝜌2 = 𝜌3 = 0.25, 𝜖 = 0.01 and 𝛾 = 𝜏 = 0.001 as initial values. In
Table 1 the results for Fisher’s iris are shown, in which case HSFC performs slightly
better. It contains the Adjusted Rand Index (ARI) [9] between HSFC and the best
FCM result among 100 runs; ARI compares fuzzy membership matrices crisped into
hard partitions.

Table 1 Minimum sum-of-squares (SS) reported for the Fisher’s iris data table with HSFC and
FCM, 𝐾 being the number of clusters, ARI comparing both methods. In bold best method.
Table 𝐾 SS for HSFC SS for FCM ARI

2 152.348 152.3615 1
Fisher’s iris 3 78.85567 78.86733 0.994
4 57.26934 57.26934 0.980

Simulated data tables were generated in a controlled experiment as in [14], with


random numbers following a Gaussian distribution. Factors of the experiment were:
• The number of objects (with 2 levels, 𝑛 = 105 and 𝑛 = 525).
• The number of clusters (with levels 𝐾 = 3 and 𝐾 = 7).
• Cardinality (card) of clusters, with levels i) all with the same number of objects
(coded as card(=)), and ii) one large cluster with 50% of objects and the rest with
the same number (coded as card(≠)).
• Standard deviation of clusters, with levels i) all Gaussian random variables with
standard deviation (SD) equal to one (coded as SD(=)), and ii) one cluster with
SD=3 and the rest with SD=1 (coded as SD(≠)).
Table 2 contains codes for simulated data tables according to the codes we used.
Table 3 contains the minimum values of the sum-of-squares obtained for our
HSFC and Bezdek’s FCM methods; the best solution of 100 random applications
for FCM in presented and one run of HSFC. It also contains the ARI values for
comparing HSFC solution with that best solution of FCM. It can be seen that,
generally, HSFC method tends to obtain better results than FCM, with only few
exceptions. In 23 cases HSFC obtains better results, FCM is better in 5 cases, and
results are in same in 17 cases. However, ARI shows that partitions tend to be very
similar with both methods.
Fuzzy Clustering by Hyperbolic Smoothing 249

Table 2 Codes and characteristics of simulated data tables; 𝑛: number of objects, 𝐾 : number of
clusters, card: cardinality, DS: standard deviation.
Table Characteristcs Table Characteristcs

T1 𝑛 = 525, 𝐾 = 3, card(=), SD(=) T9 𝑛 = 525, 𝐾 = 3, card(≠), DS(=)


T2 𝑛 = 525, 𝐾 = 7, card(=), SD(=) T10 𝑛 = 525, 𝐾 = 7, card(≠), DS(=)
T3 𝑛 = 105, 𝐾 = 3, card(=), SD(=) T11 𝑛 = 105, 𝐾 = 3, card(≠), DS(=)
T4 𝑛 = 105, 𝐾 = 7, card(=), SD(=) T12 𝑛 = 105, 𝐾 = 7, card(≠), DS(=)
T5 𝑛 = 525, 𝐾 = 3, card(=), SD(≠) T13 𝑛 = 525, 𝐾 = 3, card(≠), DS(≠)
T6 𝑛 = 525, 𝐾 = 7, card(=), SD(≠) T14 𝑛 = 525, 𝐾 = 7, card(≠), DS(≠)
T7 𝑛 = 105, 𝐾 = 3, card(=), SD(≠) T15 𝑛 = 105, 𝐾 = 3, card(≠), DS(≠)
T8 𝑛 = 105, 𝐾 = 7, card(=), SD(≠) T16 𝑛 = 105, 𝐾 = 7, card(≠), DS(≠)

Table 3 Minimum sum-of-squares (SS) reported for HSFC and FCM methods on the simulated
data tables. Best method in bold.
Table 𝐾 SS for SS for ARI Table 𝐾 SS for SS for ARI
HSFC FCM HSFC FCM

2 7073.402 7073.814 0.780 2 12524.31 12524.31 0.900


T1 3 3146.119 3146.119 1 T9 3 9269.361 9269.611 1
4 2983.651 2983.651 1 4 6298.47 6298.368 1
2 16987.19 16987.71 0.764 2 5466.893 5466.912 0.890
T2 3 11653.22 11653.22 1 T10 3 2977.58 2977.58 1
4 7776.855 7777.396 1 4 2745.721 2746.671 1
2 3923.051 3923.062 0.763 2 2969.247 2969.32 0.860
T3 3 2917.13 2917.13 0.754 T11 3 1912.323 1912.323 1
4 2287.523 2256.298 0.993 4 1401.394 1401.394 1
2 1720.365 1720.374 0.992 2 1816.056 1816.056 1
T4 3 569.3112 569.3112 1 T12 3 525.7118 525.7118 1
4 535.5491 535.3541 1 4 477.0593 477.2696 1
2 15595.67 15595.67 0.910 2 12804.03 12805.05 0.920
T5 3 11724.93 11725.28 1 T13 3 8816.805 8817.702 1
4 8409.738 8409.738 0.984 4 6293.774 6293.951 1
2 11877.96 11877.96 0.970 2 16228.07 16228.98 0.920
T6 3 8299.779 8300.718 1 T14 3 7255.113 7255.423 1
4 7212.611 7213.725 1 4 6427.313 6427.313 1
2 4336.261 4336.507 0.955 2 2616.286 2616.943 1
T7 3 3041.076 3041.076 1 T15 3 1978.017 1978.233 1
4 2395.683 2421.333 1 4 1526.895 1526.953 1
2 1767.43 1767.43 1 2 2226.923 2226.212 0.962
T8 3 1380.766 1381.019 1 T16 3 1232.074 1232.124 1
4 1215.302 1211.235 1 4 982.7074 982.9721 1
250 D. Masís et al.

5 Concluding Remarks

In hyperbolic smoothing, parameters 𝜏, 𝛾 and 𝜖 tend to zero, so the constraints in


the subproblems make that problem (P) tends to solve (1). Parameter 𝜖 controls the
fuzzyness degree in clustering; the higher it is, the solution becomes more and more
fuzzy; the less it is, the clustering is more and more crisp. In order to compare results
and efficiency of the HSFC method, zeroes of functions ℎ𝑖 can be obtained with any
method for solving equations in one variable or a predefined routine. According to
the results we obtained so far and the implementation of the hyperbolic smoothing
for fuzzy clustering, we can conclude that, generally, the HSFC method has a slightly
better performance than original Bezdek’s FCM on small real and simulated data
tables. Further research is required for testing performance of HSFC method on very
large data sets, with measures of efficiency, quality of solutions and running time.
We are also considering to study further comparisons between HSFC and FCM with
different indices, and writing the program for solving Step 6 in HSFC algorithm, that
is the minimization of 𝑓 (𝐺), by our means, instead of using the optim routine in R.

References

1. Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press,
New York (1981)
2. H.-H. Bock: Origins and extensions of the k-means algorithm in cluster analysis. Electronic
Journal for History of Probability and Statistics 4 (2008)
3. Burden, R., Faires, D.: Numerical analysis, 9th ed. Brooks/Cole, Pacific Grove (2011)
4. Diday, E.: Orders and overlapping clusters by pyramids. In J.De Leeuw et al. (eds.) Multidi-
mensional Data Analysis, DSWO Press, Leiden (1986)
5. Dunn, J. C.: A fuzzy relative of the ISODATA process and its use in detecting compact, well
separated clusters. J. Cybernetics 3, 32–57 (1974)
6. Ferraro, M. B., Giordani, P., Serafini, A.: fclust: An R Package for Fuzzy Clustering. The R
Journal 11(1), 198-210 (2019) doi: 10.32614/RJ-2019-017
7. Fisher, R. A.: The use of multiple measurements in taxonomic problems. Annals of Eugenics
7: 179-188 (1936)
8. Hartigan, J. A.: Clustering Algorithms. Wiley, New York, NY (1975)
9. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193-218 (1985)
10. Karush, W.: Minima of Functions of Several Variables with Inequalities as Side Constraints.
Master’s Thesis, Dept. of Mathematics, University of Chicago, Chicago, Illinois (1939)
11. Kuhn, H., Tucker, A.: Nonlinear programming, Proc. 2nd Berkeley Symposium on Mathemat-
ical Statistics and Probability, University of California Press, Berkeley, pp. 481-492 (1951)
12. Li, D., Fukushima, M.: On the global convergence of the BFGS method for nonconvex
unconstrained optimization problems. SIAM J. Optim. 11, 1054-1064 (2001)
13. R Core Team: R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria (2021)
14. Trejos, J., Villalobos, M. A.: Partitioning by particle swarm optimization. In: Brito, P. Bertrand,
P., Cucumel G., de Carvalho, F. (eds.) Selected Contributions in Data Analysis and Classifi-
cation, pp. 235-244. Springer, Berlin (2007)
15. Xavier, A.: The hyperbolic smoothing clustering method, Pattern Recognit. 43, 731-737 (2010)
16. Yang, M. S.: A survey of fuzzy clustering. Math. Comput. Modelling 18, 1-16 (1993)
17. Zadeh, L. A.: Fuzzy sets. Information and Control 8(3), 338-353 (1965)
Fuzzy Clustering by Hyperbolic Smoothing 251

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Stochastic Collapsed Variational Inference for
Structured Gaussian Process Regression
Networks

Rui Meng, Herbert K. H. Lee, and Kristofer Bouchard

Abstract This paper presents an efficient variational inference framework for a


family of structured Gaussian process regression network (SGPRN) models. We
incorporate auxiliary inducing variables in latent functions and jointly treat both
the distributions of the inducing variables and hyper-parameters as variational pa-
rameters. Then we take advantage of the collapsed representation of the model and
propose structured variational distributions, which enables the decomposability of a
tractable variational lower bound and leads to stochastic optimization. Our inference
approach is able to model data in which outputs do not share a common input set, and
with a computational complexity independent of the size of the inputs and outputs
to easily handle datasets with missing values. Finally, we illustrate our approach on
both synthetic and real data.

Keywords: stochastic optimization, Gaussian process, variational inference, multi-


variate time series, time-varying correlation

1 Introduction

Multi-output regression problems arise in various fields. Often, the processes that
generate such datasets are nonstationary. Modern instrumentation has resulted in
increasing numbers of observations, as well as the occurrence of missing values.
This motivates the development of scalable methods for forecasting in such datasets.
Multi-ouput Gaussian process models or multivariate Gaussian process models
(MGP) generalise the powerful Gaussian process predictive model to vector-valued

Rui Meng ( ) · Kristofer Bouchard


Biological Systems and Engineering Division, Lawrence Berkeley National Laboratory, USA,
e-mail: [email protected];[email protected]
Herbert K. H. Lee
University of California, Santa Cruz, USA, e-mail: [email protected]

© The Author(s) 2023 253


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_28
254 R. Meng et al.

random fields [1]. Those models demonstrate improved prediction performance com-
pared with independent univariate Gaussian processes (GP) because MGPs express
correlations between outputs. Since the correlation information of data is encoded in
the covariance function, modeling the flexible and computationally efficient cross-
covariance function is of interest. In the literature of multivariate processes, many
approaches are proposed to build valid cross-covariance functions including the
linear model of coregionalization (LMC) [2], kernel convolution techniques [3], B-
spline based coherence functions [4]. However, most of these models are designed
for modelling low-dimensional stationary processes, and require Monte Carlo sim-
ulations, making inference in large datasets computationally intractable.
Modelling the complicated temporal dependencies across variables is addressed in
[5, 6] by several adaptions of stochastic LMC. Such models can handle input-varying
correlation across multivariate outputs. Especially for multivariate time series, [6]
propose a SGPRN that captures time-varying scale, correlation, and smoothness.
However, the inference in [6] is difficult to handle in applications where either the
number of observations and dimension size are large or where missing data exist.
Here, we propose an efficient variational inference approach for the SGPRN by
employing the inducing variable framework on all latent processes [7], taking ad-
vantage of its collapsed representation where nuisance parameters are marginalized
out [8] and proposing a tractable variational bound amenable to doubly stochastic
variational inference. We call our approach variational SGPRN (VSGPRN). This
variational framework allows the model to handle missing data without increasing
the computational complexity of inference. We numerically provide evidence of the
benefits of simultaneously modeling time-varying correlation, scale and smoothness
in both a synthetic experiment and a real-world problem.
The main contributions of this work are threefold:
• Learning structured Gaussian process regression networks using inducing vari-
ables on both mixing coefficients and latent functions.
• Employing doubly stochastic variational inference for structured Gaussian pro-
cess regression networks by taking advantage of its collapsed representation and
constructing a tractable lower bound of the loglikelihood, making it suitable for
mini-batching learning.
• Demonstrating that our proposed algorithm succeeds in handling time-varying
correlation on missing data under different scenarios in both synthetic data and
real data.

2 Model

Assume y ( x) ∈ R𝐷 is a vector-valued function of x ∈ R 𝑃 , where 𝐷 is the di-


mension size of the outputs and 𝑃 is the dimension size of the inputs. SGPRN
assumes that noisy observations y ( x) are the linear combination of latent variables
g ( x) ∈ R𝐷 , corrupted by Gaussian noise 𝜖 ( x). The coefficients L ( x) ∈ R𝐷×𝐷
of the latent functions are assumed to be a stochastic lower triangular matrix with
SGPRN 255

Fig. 1 Graphical model of VSGPRN. Left: Illustration of the generative model. Right: Illustration
of the variational structure. The dashed (red) block means that we marginalize out those latent
variables in the variational inference framework.

positive values on the diagonal for model identification [9, 6]. Thus, SGPRN is
defined in the generative model of Figure 1 and it is y ( x) = f ( x) + 𝜖 ( x), f ( x) =
L ( x) g ( x) with independent white noise 𝜖 ( x) 𝑖𝑖𝑑 ∼ N (0, 𝜎𝑒𝑟𝑟2 𝐼). We note that

each latent function 𝑔 𝑑 in g is independently sampled from a GP with a non-


stationary kernel 𝐾 𝑔 and the stochastic coefficients are modeled via a struc-
tured GP based prior as proposed in [9] with 𝑙
( a stationary kernel 𝐾 such that
𝑖𝑖𝑑 GP(0, 𝐾 𝑙 ) , 𝑖> 𝑗,
𝑔 𝑑 ∼ GP(0, 𝐾 𝑔 ) , 𝑑 = 1, . . . , 𝐷 , and 𝑙𝑖 𝑗 ∼ where logGP
logGP(0, 𝐾 ) , 𝑖 = 𝑗 ,
𝑙

denotes the log Gaussian process [10]. 𝑔


 𝐾 is modelled
 as a Gibbs correlation func-
( x)ℓ ( x)
exp − ℓ ( xk x) 2−+ℓx (kx0 ) 2 , ℓ ∼ logGP(0, 𝐾 ℓ ) , where ℓ
q
tion 𝐾 𝑔 ( x, x0) = ℓ2ℓ
0 0 2

( x) 2 +ℓ ( x0 ) 2
determines the input-dependent length scale of the shared correlations in 𝐾 𝑔 for all
latent functions 𝑔 𝑑 . The varying length-scale process ℓ plays an important role in
modelling nonstationary time series as illustrated in [11, 6].
Let X = {x𝑖 }𝑖=1 𝑁
be the set of observed inputs and Y = {y𝑖 }𝑖=1 𝑁
be the set
of observed outputs. Denote 𝜂 as the concatenation of all coefficients and all log
length-scale parameters, i.e., 𝜂 = ( l, ℓ) ˜ evaluated at training inputs X. Here, l is a
vector including the entries below the main diagonal and the entries on the diagonal
in the log scale and ℓ˜ = log ℓ is the length-scale parameters in log scale. Also,
denote 𝜃 = (𝜃 𝑙 , 𝜃 ℓ , 𝜎𝑒𝑟𝑟
2 ) as all hyper-parameters, where 𝜃 and 𝜃 are the hyper-
𝑙 ℓ
parameters in kernel 𝐾𝑙 and 𝐾ℓ . We note that directly inferring the posterior of the
latent variables 𝑝(𝜂| Y, 𝜃) ∝ 𝑝( Y |𝜂, 𝜎𝑒𝑟𝑟 2 ) 𝑝(𝜂|𝜃 , 𝜃 ) is computationally intractable
𝑙 ℓ
in general because the computational complexity of 𝑝(𝜂| Y, 𝜃) is O (𝑁 3 𝐷 3 ). To
overcome this issue, we propose an efficient variational inference to significantly
reduce the computational burden in the next section.
256 R. Meng et al.

3 Inference

We introduce a shared set of inducing inputs Z = {z𝑚 }𝑚=1 𝑀 that lie in the same space

as the inputs X and a set of shared inducing variables w𝑑 for each latent function
𝑔 𝑑 evaluated at the inducing inputs Z. Likewise, we consider inducing variables u𝑖𝑖
for the function log 𝐿 𝑖𝑖 when 𝑖 = 𝑗, u𝑖 𝑗 for function 𝐿 𝑖 𝑗 when 𝑖 > 𝑗, and inducing
variables v for function log ℓ( x) evaluated at inducing inputs Z. We denote those
collective variables as l = {l𝑖 𝑗 }𝑖 ≥ 𝑗 , u = {u𝑖 𝑗 }𝑖 ≥ 𝑗 , g = {g𝑑 }𝑑=1
𝐷 , w = {w } 𝐷 , ℓ
𝑑 𝑑=1
and v. Then we redefine the model parameters 𝜂 = ( l, u, g, w, ℓ, v), and the prior
of those model parameters is 𝑝(𝜂) = 𝑝( l | w) 𝑝( w) 𝑝( g | u, ℓ, v) 𝑝( u) 𝑝(ℓ| v) 𝑝( v).
The core assumption of inducing point-based sparse inference is that the inducing
variables are sufficient statistics for the training and testing data in the sense that the
training and testing data are conditionally independent given the inducing variables.
In the context of our model, this means that the posterior processes of 𝐿, 𝑔 and ℓ are
sufficiently determined by the posterior distribution of u, w and v. We propose a
structured variational distribution and its corresponding variational lower bound. Due
to the nonconjugacy of this model, instead of doing expectation in the evidence lower
bound (ELBO), as is normally done in the literature, we perform the marginalization
on inducing variables u, w and g, and then use the reparameterization trick to
apply end-to-end training with stochastic gradient descent. We will also discuss a
procedure for missing data inference and prediction.
To capture the posterior dependency between the latent functions, we propose a
structured variational distribution of the model parameters 𝜂 used to approximate its
posterior distribution as 𝑞(𝜂) = 𝑝( l | u) 𝑝( g | w, ℓ, v) 𝑝(ℓ| v)𝑞( u, w, v) . This varia-
tional structure is illustrated in Figure 1. The variational distribution of the inducing
variables 𝑞( u, w, v) fully characterizes the distribution of q (𝜂). Thus, the inference
of 𝑞( u, w, v) is of interest. We assume the parameters u, w, and v are Gaussian
and mutually independent.
Given the definition of Gaussian process priors for the SGPRN, the conditional
distributions 𝑝( l | u), 𝑝( g | w, ℓ,
˜ v), and 𝑝(ℓ| v) have closed-form expressions and all
are Gaussian, except for 𝑝(ℓ| v), which is log Gaussian. The ELBO of the log like-
lihood of observations under our structured variational distribution 𝑞(𝜂) is derived
using Jensen’s inequality as:

𝑝( Y | g, l) 𝑝( u) 𝑝( w) 𝑝( v)
  
log 𝑝( Y) ≥ 𝐸 𝑞 ( 𝜂) log = 𝑅 + 𝐴,
𝑞( u, w, v)
(1)

𝑑=1 𝐸 𝑞 ( g𝑛 , l𝑛 ) log( 𝑝(𝑦 𝑛𝑑 | g𝑛 , l𝑛 )) is the reconstruction term and


Í 𝑁 Í𝐷
where 𝑅 = 𝑛=1
𝐴 = KL(𝑞( u)|| 𝑝( u)) + KL(𝑞( w)|| 𝑝( w)) + KL(𝑞( v)|| 𝑝( v)) is the regularization
term. g𝑛 = {𝑔 𝑑𝑛 = ( g𝑑 )𝑛 }𝑑=1
𝐷 and l = {𝑙
𝑛 𝑖 𝑗𝑛 = ( l𝑖 𝑗 )𝑛 }𝑖 ≥ 𝑗 are latent variables.
The structured decomposition trick for 𝑞(𝜂) has also been used by [12] to derive
variational inference for the multivariate output case. The benefit of this structure
is that all conditional distributions in 𝑞(𝜂) can be cancelled in the derivation of the
lower bound in (1), which alleviates the computational burden of inference. Because
of the conditional independence of the reconstruction term in (1) given g and l, the
SGPRN 257

lower bound decomposes across both inputs and outputs and this enables the use
of stochastic optimization methods. Moreover, due to the Gaussian assumption in
the prior and variational distributions of the inducing variables, all KL divergence
terms in the regularization term 𝐴 are analytically tractable. Next, instead of directly
computing expectation, we leverage stochastic inference [13].
Stochastic inference requires sampling of l and g from the joint variational
posterior 𝑞(𝜂). Directly sampling them would introduce much uncertainty from
intermediate variables and thus make inference inefficient. To tackle this is-
sue, we marginalize unnecessary intermediate variables u and w and obtain the
marginal distributions 𝑞( l) = 𝑖= 𝑗 log N ( l𝑖𝑖 | 𝜇˜ 𝑖𝑖 , Σ̃𝑖𝑖 ) 𝑖> 𝑗 N ( l𝑖 𝑗 | 𝜇˜ 𝑖 𝑗 , Σ̃𝑖 𝑗 ) and
Î 𝑙 𝑙 Î 𝑙 𝑙

𝑞( g |ℓ, v) = 𝑑=1 N ( g𝑑 | 𝜇˜ 𝑑 , Σ̃𝑑 ) with a joint distribution 𝑞(ℓ, v) = 𝑝(ℓ| v)𝑞( v),
Î𝐷 𝑔 𝑔

where the conditional mean and covariance matrix are easily derived. The corre-
sponding marginal distributions 𝑞( l𝑛 ) and 𝑞( g𝑛 |ℓ, v) at each 𝑛 are also easy to
derive. Moreover, we conduct collapsed inference by marginalizing the latent vari-
ables g𝑛 , so then the individual expectation is

E𝑞 ( g𝑛 ,l𝑛 ) log( 𝑝(𝑦 𝑛𝑑 | g𝑛 , l𝑛 )) = (𝐿 𝑛𝑑 )𝑞(ℓ𝑛 , v)𝑞( l𝑑 ·𝑛 )𝑑 ( l𝑑 ·𝑛 , ℓ𝑛 , v)), (2)

Í 𝑔 Í𝐷 2 𝑔2
where 𝐿 𝑛𝑑 = log N (𝑦 𝑛𝑑 | 𝐷 2 1
𝑗=1 𝑙 𝑑 𝑗𝑛 𝜇˜ 𝑗𝑛 , 𝜎𝑒𝑟𝑟 ) − 2𝜎𝑒𝑟𝑟
2 ˜ 𝑗𝑛 measure the
𝑗=1 𝑙 𝑑 𝑗𝑛 𝜎
reconstruction performance for observations y𝑛𝑑 .
Directly evaluating the ELBO is still challenging due to the non-linearities in-
troduced by our structured prior. Recent progress in black box variational inference
[13] avoids this difficulty by computing noisy unbiased estimates of the gradient of
ELBO, via approximating the expectations with unbiased Monte Carlo estimates and
relying on either score function estimators [14] or reparameterization gradients [13]
to differentiate through a sampling process. Here we leverage the reparameterization
gradients for stochastic optimization for model parameters. We note that evaluating
ELBO (1) involves two sources of stochasticity from Monte Carlo sampling in (2)
and from data sub-sampling stochasticity [15]. The prediction procedure is based on
Bayes’ rule and replaces the posterior distribution by the inferred variational distribu-
tion. In the case of missing data, the only modification in (1) is in the reconstruction
term, where we sum up the likelihoods of observed data instead of complete data.

4 Experiments

This section illustrates the performance of our model on multivariate time series. We
first show that our approach can model the time-varying correlation and smoothness
of outputs on 2D synthetic datasets in three scenarios with respect to different types of
frequencies but the same missing data mechanism. Then, we compare the imputation
performance on missing data with other inducing-variable based sparse multivariate
Gaussian process models on a real dataset.
258 R. Meng et al.

We conduct experiments on three synthetic time series with low frequency


(LF), high frequency (HF) and varying frequency (VF) respectively. They are
generated from the system of equations 𝑦 1 (𝑡) = 5 cos(2𝜋𝑤𝑡 𝑠 ) + 𝜖1 (𝑡) , 𝑦 2 (𝑡) =
5(1 − 𝑡) cos(2𝜋𝑤𝑡 𝑠 ) − 5𝑡 cos(2𝜋𝑤𝑡 𝑠 ) + 𝜖2 (𝑡), where {𝜖𝑖 (𝑡)}𝑖=1
2 are independent stan-

dard white noise processes. The value of 𝑤 refers to the frequency and the value of
𝑠 characterizes the smoothness. The LF and HF datasets use the same 𝑠 = 1, imply-
ing the smoothness is invariant across time. But they employ different frequencies,
𝑤 = 2 for LF and 𝑤 = 5 for HF (i.e., two periods and five periods in a unit time
interval respectively). The VF dataset takes 𝑠 = 2 and 𝑤 = 5, so that the frequency
of the function is gradually increasing as time increases. For all three datasets, the
system shows that as time 𝑡 increases from 0 to 1, the correlation between 𝑦 1 (𝑡) and
𝑦 2 (𝑡) gradually varies from positive to negative. Within each dataset, we randomly
select 200 training data points, in which 100 time stamps are sampled on the interval
(0, 0.8) for the first dimension and the other 100 time stamps sampled on the interval
(0.2, 1) for the second dimension. For the test inputs, we randomly select 100 time
stamps on the interval (0, 1) for each dimension.

Table 1 Prediction measurements on three synthetic datasets and different models. LF, HF and VF
refer to low-frequency, high-frequency, and time-varying datasets. Three prediction measures are
root mean square error (RMSE), average length of confidence interval (ALCI), and coverage rate
(CR). All three measurements are summarized by the mean and standard deviation across 10 runs
with different random initializations.
Data Model RMSE ALCI CR
IGPR [16] 2.25(1.33e-13) 2.18(1.88e-13) 0.835(0)
ICM [17] 2.26(2.54e-5) 2.18(1.22e-5) 0.835(0)
LF
CMOGP [12] 1.43(6.12e-2) 1.36(1.98e-1) 0.651(3.00e-2)
VGPRN [18] 1.01(0.31) - -
VSGPRN 1.00(1.43e-1) 2.21(6.56e-2) 0.892(1.63e-2)
IGPR [16] 1.51(6.01e-14) 3.17(1.30e-13) 0.915(2.22e-16)
ICM [17] 1.52(1.01e-5) 3.17(1.19e-5) 0.910(0)
HF
CMOGP [12] 1.29(3.04e-2) 2.34(3.31e-1) 0.729(3.07e-2)
VGPRN [18] 1.11(0.25) - -
VSGPRN 1.10(1.98e-1) 2.74(7.94e-2) 0.930(1.14e-2)
IGPR [16] 1.64(8.17e-14) 3.19(3.02e-13) 0.875(0)
ICM [17] 1.66(2.37e-3) 3.16(1.49e-3) 0.880(1.50e-3)
VF
CMOGP [12] 2.24(3.08e-1) 2.56(9.29e-1) 0.697(1.56e-1)
VGPRN [18] 1.04(0.67) - -
VSGPRN 1.24(1.33e-1) 2.92(1.21e-1) 0.887(9.80e-3)

We quantify the model performance in terms of root mean square error (RMSE),
average length of confidence interval (ALCI), and coverage rate (CR) on the test set.
A smaller RMSE corresponds to better predictive performance of the model, and
a smaller ALCI implies a smaller predictive uncertainty. As for CR, the better the
model prediction performance is, the closer CR is to the percentile of the credible
band. Those results are reported by the mean and standard deviation with 10 differ-
ent random initializations of model parameters. Quantitative comparisons relating
SGPRN 259

to all three datasets are in Table 1. We compare with independent Gaussian process
regression (IGPR) [16], the intrinsic coregionalization model (ICM) [17], Collab-
orative Multi-Output Gaussian Processes (CMOGP) [12] and variational inference
of Gaussian process regression networks [18] on three synthetic datasets. In both
CMOGP and VSGPRN approaches, we use 20 inducing variables. We further exam-
ined model predictive performance on a real-world dataset, the PM2.5 dataset from
the UCI Machine Learning Repository [19]. This dataset tracks the concentration of
fine inhalable particles hourly in five cities in China, along with meteorological data,
from Jan 1st, 2010 to Dec 31st, 2015. We compare our model with two sparse Gaus-
sian process models, i.e., independent sparse Gaussian process regression (ISGPR)
[20] and the sparse linear model of corregionalization (SLMC) [17]. In the dataset,
we consider six important attributes and use 20% of the first 5000 standardized mul-
tivaritate for training and use the others for testing. The RMSEs on the testing data
are shown in Table 2, illustrating that VSGPRN had better prediction performance
compared with ISGPR and SLMC, even when using fewer inducing points.

Table 2 Empirical results for PM2.5 dataset. Each model’s performance is summarized by its
RMSE on the testing data. The number of equi-spaced inducing points is given in parentheses.
Data ISGPR (100) [20] SLMC (100) [17] VSGPRN (50) VSGPRN (100) VSGPRN (200)
PM2.5 0.994 0.948 0.840 0.708 0.625

5 Conclusions

We propose a novel variational inference approach for structured Gaussian process


regression networks named the variational structured Gaussian process regression
network, VSGPRN. We introduce inducing variables and propose a structured varia-
tional distribution to reduce the computational burden. Moreover, we take advantage
of the collapsed representation of our model and construct a tractable lower bound of
the log likelihood to make it suitable for doubly stochastic inference and easy to han-
dle missing data. In our method, the computation complexity is independent of the
size of the inputs and the outputs. We illustrate the superior predictive performance
for both synthetic and real data.
Our inference approach, VSGPRN can be widely used for high dimensional time
series to model complicated time-varying dependence across multivariate outputs.
Moreover, due to its scalability and flexibility, it can be widely applied for irregu-
larly sampled incomplete large datatsets that widely exist in various research fields
including healthcare, environmental science and geoscience.
260 R. Meng et al.

References

1. Álvarez, M., Lawrence, N.: Computationally efficient convolved multiple output Gaussian
processes. J. Mach. Learn. Res. 12, 1459-1500 (2011)
2. Goulard, M., Voltz, M.: Linear coregionalization model: tools for estimation and choice of
cross-variogram matrix. Math. Geol. 24, 269-286 (1992)
3. Gneiting, T., Kleiber, W., Schlather, M.: Matérn cross-covariance functions for multivariate
random fields. J. Am. Stat. Assoc. 105, 1167-1177 (2010)
4. Qadir, G., Sun, Y.: Semiparametric estimation of cross-covariance functions for multivariate
random fields. Biom. 77, 547-560 (2021)
5. Gelfand, A., Schmidt, A., Banerjee, S., Sirmans, C.: Nonstationary multivariate process mod-
eling through spatially varying coregionalization. Test. 13, 263-312 (2004)
6. Meng, R., Soper, B., Lee, H., Liu, V., Greene, J., Ray, P.: Nonstationary multivariate Gaussian
processes for electronic health records. J. Biom. Inform. 117, 103698 (2021)
7. Titsias, M., Lawrence, N.: Bayesian Gaussian process latent variable model. Int. Conf. Artif.
Intell. Stat. 844-851 (2010)
8. Teh, Y., Newman, D., Max Welling, M.: A collapsed variational Bayesian inference algorithm
for latent Dirichlet allocation. In: Schölkopf, B., Platt, J., Hofmann, T. (eds.) Advances in
Neural Information Processing Systems 19, (2006)
9. Guhaniyogi, R., Finley, A., Banerjee, S., Kobe, R.: Modeling complex spatial dependencies:
Low-rank spatially varying cross-covariances with application to soil nutrient data. J. Agric.
Biol. Environ. Stat. 18, 274-298 (2013)
10. Møller, J., Syversveen, A., Waagepetersen, R.: Log Gaussian Cox processes. Scand. J. Stat.
25, 451-482 (1998)
11. Remes, S., Heinonen, M., Kaski, S.: Non-stationary spectral kernels. Adv. Neural Inf. Process.
Syst. 30 (2017), https://proceedings.neurips.cc/paper/2017/file/c65d7bd70fe
3e5e3a2f3de681edc193d-Paper.pdf
12. Nguyen, T., Bonilla, E., et al.: Collaborative multi-output Gaussian processes. Uncertain.
Artif. Intell. 643-652 (2014)
13. Titsias, M., Lázaro-Gredilla, M.: Doubly stochastic variational Bayes for non-conjugate infer-
ence. Int. Conf. Mach. Learn. 1971-1979 (2014)
14. Ranganath, R., Gerrish, S., Blei, D.: Black box variational inference. Int. Conf. Artif. Intell.
Stat. 814-822 (2014)
15. Hoffman, M., Blei, D., Wang, C., Paisley, J.: Stochastic variational inference. J. Mach. Learn.
Res. 14, 1303-1347 (2013)
16. Rasmussen, C., Kuss, M.: Gaussian processes in reinforcement learning. Adv. Neural Inf.
Process. Syst. 751-759 (2004)
17. Wackernagel, H.: Multivariate geostatistics: an introduction with applications. Springer Sci-
ence & Business Media (2013)
18. Nguyen, T., Bonilla, E.: Efficient variational inference for Gaussian process regression net-
works. Int. Conf. Artif. Intell. Stat. 472-480 (2013)
19. Liang, X., Zou, T., Guo, B., Li, S., Zhang, H., Zhang, S., Huang, H., Chen, S.: Assessing
Beijing’s PM2.5 pollution: severity, weather impact, APEC and winter heating. Proc. R. Soc.
A: Math. Phys. Eng. Sci. 471, 20150257 (2015)
https://royalsocietypublishing.org/doi/abs/10.1098/rspa.2015.0257
20. Snelson, E., Ghahramani, Z.: Sparse Gaussian processes using pseudo-inputs. Adv. Neural
Inf. Process. Syst. 1257-1264 (2006), http://papers.nips.cc/paper/2857-sparse-g
aussian-processes-using-pseudo-inputs.pdf
SGPRN 261

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
An Online Minorization-Maximization
Algorithm

Hien Duy Nguyen, Florence Forbes, Gersende Fort, and Olivier Cappé

Abstract Modern statistical and machine learning settings often involve high data
volume and data streaming, which require the development of online estimation
algorithms. The online Expectation–Maximization (EM) algorithm extends the pop-
ular EM algorithm to this setting, via a stochastic approximation approach. We show
that an online version of the Minorization–Maximization (MM) algorithm, which in-
cludes the online EM algorithm as a special case, can also be constructed in a similar
manner. We demonstrate our approach via an application to the logistic regression
problem and compare it to existing methods.

Keywords: expectation-maximization, minorization-maximization, parameter esti-


mation, online algorithms, stochastic approximation

1 Introduction

Expectation–Maximization (EM) [6, 17] and Minorization–Maximization (MM)


algorithms [15] are important classes of optimization procedures that allow for
the construction of estimation routines for many data analytic models, including

Hien Duy Nguyen ( )


School of Mathematics and Physics, University of Queensland, St. Lucia, 4067 QLD, Australia,
e-mail: [email protected]
Florence Forbes
Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000, Grenoble, France,
e-mail: [email protected]
Gersende Fort
Institut de Mathématiques de Toulouse, CNRS, Toulouse, France,
e-mail: [email protected],
Olivier Cappé
ENS Paris, Universite PSL, CNRS, INRIA, France, e-mail: [email protected]

© The Author(s) 2023 263


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_29
264 H. D. Nguyen et al.

many finite mixture models. The benefit of such algorithms comes from the use of
computationally simple surrogates in place of difficult optimization objectives.
Driven by high volume of data and streamed nature of data acquisition, there
has been a rapid development of online and mini-batch algorithms that can be used
to estimate models without requiring data to be accessed all at once. Online and
mini-batch versions of EM algorithms can be constructed via the classic Stochastic
Approximation framework (see, e.g., [2, 13]) and examples of such algorithms
include those of [3, 7, 8, 10, 11, 12, 19]. Via numerical assessments, many of the
algorithms above have been demonstrated to be effective in mixture model estimation
problems. Online and mini-batch versions of MM algorithms on the other hand
have largely been constructed following convex optimizations methods (see, e.g.,
[9, 14, 23]) and examples of such algorithms include those of [4, 16, 18, 22].
In this work, we provide a stochastic approximation construction of an online
MM algorithm using the framework of [3]. The main advantage of our approach is
that we do not make convexity assumptions and instead replace them with oracle
assumptions regarding the surrogates. Compared to the online EM algorithm of [3]
that this work is based upon, the Online MM algorithm extends the approach to allow
for surrogate functions that do not require latent variable stochastic representations,
which is especially useful for constructing estimation algorithms for mixture of
experts (MoE) models (see, e.g. [20]). We demonstrate the Online MM algorithm
via an application to the MoE-related logistic regression problem and compare it to
competing methods.
Notation. By convention, vectors are column vectors. For a matrix 𝐴, 𝐴> denotes
its transpose. The Euclidean scalar product is denoted by h𝑎, 𝑏i. For a continuously
differentiable function 𝜃 ↦→ ℎ(𝜃) (resp. twice continuously differentiable), ∇ 𝜃 ℎ (or
simply ∇ when there is no confusion) is its gradient (resp. ∇2𝜃 𝜃 is its Hessian). We
denote the vectorization operator that converts matrices to column vectors by vec.

2 The Online MM Algorithm

Consider the optimization problem

arg max E [ 𝑓 (𝜃; 𝑋)] , (1)


𝜃 ∈T

where T is a measurable open subset of R 𝑝 , X is a topological space endowed


with its Borel sigma-field, 𝑓 : T × X → R is a measurable function and 𝑋 is a
X-valued random variable on the probability space (Ω, F , P). In this paper, we are
interested in the setting when the expectation E [ 𝑓 (𝜃; 𝑋)] has no closed form, and
the optimization problem is solved by an MM-based algorithm.
Following the terminology of [15], we say that 𝑔 : T×X×T, (𝜃, 𝑥, 𝜏) ↦→ 𝑔 (𝜃, 𝑥; 𝜏),
is a minorizer of 𝑓 , if for any 𝜏 ∈ T and for any (𝜃, 𝑥) ∈ T × X, it holds that

𝑓 (𝜃; 𝑥) − 𝑓 (𝜏; 𝑥) ≥ 𝑔(𝜃, 𝑥; 𝜏) − 𝑔(𝜏, 𝑥; 𝜏). (2)


An Online MM Algorithm 265

In our work, we consider the case when the minorizer function 𝑔 has the following
structure:
A1 The minorizer surrogate 𝑔 is of the form:

𝑔 (𝜃, 𝑥; 𝜏) = −𝜓 (𝜃) + 𝑆(𝜏;


¯ 𝑥), 𝜙(𝜃) , (3)

where 𝜓 : T → R, 𝜙 : T → R𝑑 and 𝑆¯ : T × X → R𝑑 are measurable functions.


In addition, 𝜙 and 𝜓 are continuously differentiable on T.
We also make the following assumptions:
A2 There exists a measurable open and convex set S ⊆ R𝑑 such that for any 𝑠 ∈ S,
𝛾 ∈ [0, 1) and any (𝜏, 𝑥) ∈ T × X:

𝑠 + 𝛾 𝑆(𝜏;
¯ 𝑥) − 𝑠 ∈ S.

A3 The expectation E[ 𝑆(𝜃;¯ 𝑋)] exists, is in S, and is finite whatever 𝜃 ∈ T but it


may have no closed form. Online independent oracles {𝑋𝑛 , 𝑛 ≥ 0}, with the same
distribution as 𝑋, are available.
A4 For any 𝑠 ∈ S, there exists a unique root to 𝜃 ↦→ −∇𝜓(𝜃) + ∇𝜙(𝜃) > 𝑠, which
is the unique maximum on T of the function 𝜃 ↦→ −𝜓(𝜃) + h𝑠, 𝜙(𝜃)i. This root is
denoted by 𝜃¯ (𝑠).
Seen as a function of 𝜃, 𝑔(·, 𝑥; 𝜏) is the sum of two functions: −𝜓 and a linear
combination of the components of 𝜙 = (𝜙1 , . . . , 𝜙 𝑑 ). Assumption A1 implies that
the minorizer surrogate is in a functional space spanned by these (𝑑 + 1) functions.
By (2) and A1–A3, it follows that
 
E [ 𝑓 (𝜃; 𝑋)] − E [ 𝑓 (𝜏; 𝑋)] ≥ 𝜓(𝜏) − 𝜓(𝜃) + E 𝑆(𝜏;¯ 𝑋) , 𝜙(𝜃) − 𝜙(𝜏) , (4)

thus providing a minorizer function for the objective function 𝜃 ↦→ E [ 𝑓 (𝜃; 𝑋)].
By A4,
 the usual  MM algorithm would define iteratively the sequence 𝜃 𝑛+1 =
¯ 𝑛 ; 𝑋) . Since the expectation may not have closed form but infinite datasets
𝜃¯ E 𝑆(𝜃
are available (see A3), we propose a novel Online MM algorithm. It defines the
sequence {𝑠 𝑛 , 𝑛 ≥ 0} as follows: given positive step sizes {𝛾𝑛+1 , 𝑛 ≥ 1} in (0, 1) and
an initial value 𝑠0 ∈ S, set for 𝑛 ≥ 0:
 
𝑠 𝑛+1 = 𝑠 𝑛 + 𝛾𝑛+1 𝑆¯ 𝜃¯ (𝑠 𝑛 ); 𝑋𝑛+1 − 𝑠 𝑛 . (5)

The update mechanism (5) is a Stochastic Approximation iteration, which defines


an S-valued sequence (see A2). It consists of the construction of a sequence of
minorizer functions through the definition of their parameter 𝑠 𝑛 in the functional
space spanned by −𝜓, 𝜙1 , . . . , 𝜙 𝑑 .  
If our algorithm (5) converges, any limiting point 𝑠★ satisfies E 𝑆( ¯ 𝜃¯ (𝑠★); 𝑋) =
𝑠★. Hence, our algorithm is designed to approximate the intractable expectation,
evaluated at 𝜃¯ (𝑠★), where 𝑠★ satisfies a fixed point equation. The following lemma
establishes the relation between the limiting points of (5) and the optimization prob-
lem (1) at hand. Namely, it implies that any limiting value 𝑠★ provides a stationary
266 H. D. Nguyen et al.

point 𝜃★ := 𝜃¯ (𝑠★) of the objective function E [ 𝑓 (𝜃; 𝑋)] (i.e., 𝜃★ is a root of the
derivative of the objective function). The proof follows the technique of [3]. Set
 
h (𝑠) := E 𝑆¯ 𝜃¯ (𝑠) ; 𝑋 − 𝑠, Γ := {𝑠 ∈ S : h (𝑠) = 0}.

Lemma 1 Assume that 𝜃 ↦→ E [ 𝑓 (𝜃; 𝑋)] is continuously differentiable on T and


denote by L the set of its  points. If 𝑠★ ∈ Γ, then 𝜃 (𝑠★) ∈ L. Conversely,
 stationary
¯
if 𝜃★ ∈ L, then 𝑠★ := E 𝑆 (𝜃★; 𝑋) ∈ Γ.
¯

Proof A4 implies that

−∇𝜓( 𝜃¯ (𝑠)) + ∇𝜙( 𝜃¯ (𝑠)) > 𝑠 = 0, 𝑠 ∈ S. (6)

Use (2) and A1, and apply the expectation w.r.t. 𝑋 (under A3). This yields (4),
which is available for any 𝜃, 𝜏 ∈ T. This inequality provides a minorizer function for
𝜃 ↦→ E [ 𝑓 (𝜃; 𝑋)]: the difference is nonnegative and minimal (i.e. equal to zero) at
𝜃 = 𝜏. Under the assumptions and A1, this yields

∇E [ 𝑓 (·; 𝑋)] | 𝜃=𝜏 + ∇𝜓(𝜏) − ∇𝜙(𝜏) > E 𝑆(𝜏;


 
¯ 𝑋) = 0. (7)

Let 𝑠★ ∈ Γ and apply (7) with 𝜏 ← 𝜃¯ (𝑠★). It then follows that

∇E [ 𝑓 (·; 𝑋)] | 𝜃= 𝜃¯ (𝑠★) + ∇𝜓( 𝜃¯ (𝑠★)) − ∇𝜙( 𝜃¯ (𝑠★)) > 𝑠★ = 0,

which implies 𝜃¯ (𝑠★) ∈ L by (6). Conversely, if 𝜃★ ∈ L, then by (7), we have

∇𝜓(𝜃★) − ∇𝜙(𝜃★) > E 𝑆(𝜃


 
¯ ★; 𝑋) = 0,
 
which, by A3 and A4, implies that 𝜃★ = 𝜃¯ E 𝑆(𝜃 ¯ ★; 𝑋) = 𝜃¯ (𝑠★). By definition of
 
𝑠★, this yields 𝑠★ = E 𝑆¯ 𝜃¯ (𝑠★); 𝑋 ; i.e. 𝑠★ ∈ Γ. 
By applying the results of [5] regarding the asymptotic convergence of Stochastic
Approximation algorithms, additional regularity assumptions on 𝜙, 𝜓, 𝜃¯ imply that
the algorithm (5) possesses a continuously
 differentiable
 Lyapunov function 𝑉 de-
fined on S and given by 𝑉 : 𝑠 ↦→ E 𝑓 ( 𝜃¯ (𝑠); 𝑋) , satisfying h∇𝑉 (𝑠), h (𝑠)i ≤ 0,
where the inequality is strict outside the set Γ (see [3, Prop. 2]). In addition to
Lemma 1, assumptions on the distribution of 𝑋 and on the stability of the sequence
{𝑠 𝑛 , 𝑛 ≥ 0} are provided in [5, Thm. 2 and Lem. 1], which, combined with the usual
Í Í
conditions on the step sizes: 𝑛 𝛾𝑛 = +∞ and 𝑛 𝛾𝑛2 < ∞, yields the almost-sure
convergence of the sequence {𝑠 𝑛 , 𝑛 ≥ 0} to the set Γ, and the almost-sure conver-
gence of the sequence { 𝜃¯ (𝑠 𝑛 ), 𝑛 ≥ 0} to the set L of the stationary points of the
objective function 𝜃 ↦→ E [ 𝑓 (𝜃; 𝑋)]. Due to the limited space, the exact statement
of these convergence results for our Online MM framework is omitted.
An Online MM Algorithm 267

3 Example Application

As an example, we consider the logistic regression problem, where we solve (1) with

𝑓 (𝜃; 𝑥) := 𝑦𝑤> 𝜃 − log 1 + exp 𝑤> 𝜃 ,


 
𝑥 := (𝑦, 𝑤),

where 𝑦 ∈ {0, 1}, 𝑤 ∈ R 𝑝 , and 𝜃 ∈ T := R 𝑝 . Here, we assume that 𝑋 = (𝑌 , 𝑊) is a


random variable such that E [ 𝑓 (𝜃; 𝑋)] exists for each 𝜃.
Denote by 𝜆 the standard logistic function 𝜆 (·) := exp {·} /(1+exp {·}). Following
[1], (2) and A1 are verified by taking
   
𝜃 𝑠¯1 (𝜏; 𝑥) 
𝜓 (𝜃) := 0, 𝜙 (𝜃) := , 𝑆 (𝜏; 𝑥) =
¯
vec (𝜃𝜃 > ) vec 𝑆¯2 (𝜏; 𝑥)

where
1 1
𝑠¯1 (𝜏; 𝑥) := 𝑦 − 𝜆 𝜏 > 𝑤 𝑤 + 𝑤𝑤> 𝜏, 𝑆¯2 (𝜏; 𝑥) = − 𝑤𝑤> .
 
4 8
With S := {(𝑠1 , vec (𝑆2 )) : 𝑠1 ∈ R 𝑝 and 𝑆2 ∈ R 𝑝× 𝑝 is symmetric positive definite} ,
it follows that 𝜃¯ (𝑠) := −(2𝑆2 ) −1 𝑠1 . 
Online MM. Let 𝑠 𝑛 = 𝑠1,𝑛 , 𝑆2,𝑛 ∈ S. The corresponding Online MM recursion
is then
 
>  1 > ¯
𝑠1,𝑛+1 = 𝑠1,𝑛 + 𝛾𝑛+1 𝑌𝑛+1 − 𝜆 𝜃 (𝑠 𝑛 ) 𝑊𝑛+1 𝑊𝑛+1 + 𝑊𝑛+1𝑊𝑛+1 𝜃 (𝑠 𝑛 ) − 𝑠1,𝑛 (8)
¯
4
 
1 >
𝑆2,𝑛+1 = 𝑆2,𝑛 + 𝛾𝑛+1 − 𝑊𝑛+1𝑊𝑛+1 − 𝑆2,𝑛 , (9)
8

where {(𝑌𝑛+1 , 𝑊𝑛+1 ), 𝑛 ≥ 0} are i.i.d. pairs with the same distribution as 𝑋 = (𝑌 , 𝑊).
Parameter estimates can then be deduced by setting 𝜃 𝑛+1 := 𝜃¯ (𝑠 𝑛+1 ).
For comparison, we also consider two Stochastic Approximation schemes directly
on 𝜃 in the parameter-space: a stochastic gradient (SG) algorithm and a Stochastic
Newton Raphson (SNR) algorithm.
Stochastic gradient. SG requires the gradient of 𝑓 (𝜃; 𝑥) with respect to 𝜃:
∇ 𝑓 (𝜃; 𝑥) = {𝑦 − 𝜆(𝜃 > 𝑤)} 𝑤, which leads to the recursion

𝜃ˆ𝑛+1 = 𝜃ˆ𝑛 + 𝛾𝑛+1 𝑌𝑛+1 − 𝜆( 𝜃ˆ>



𝑛 𝑊𝑛+1 ) 𝑊𝑛+1 . (10)

Stochastic Newton-Raphson. In addition SNR requires the Hessian with respect


to 𝜃, given by ∇2𝜃 𝜃 𝑓 (𝜃; 𝑥) = −𝜆(𝜃 > 𝑤) {1 − 𝜆(𝜃 > 𝑤)} 𝑤𝑤> . The SNR recursion is then

𝐴ˆ 𝑛+1 = 𝐴ˆ 𝑛 + 𝛾𝑛+1 ∇2𝜃 𝜃 𝑓 ( 𝜃ˆ𝑛 ; 𝑋𝑛+1 ) − 𝐴ˆ 𝑛 (11)
𝐺 𝑛+1 = − 𝐴ˆ −1
𝑛+1 (12)
𝜃ˆ𝑛+1 = 𝜃ˆ𝑛 + 𝛾𝑛+1 𝐺 𝑛+1 𝑌𝑛+1 − 𝜆( 𝜃ˆ>

𝑛 𝑊𝑛+1 ) 𝑊𝑛+1 . (13)
268 H. D. Nguyen et al.

Equation (12) assumes that 𝐴ˆ 𝑛+1 is invertible. In this logistic example, we can
guarantee this by choosing 𝐴ˆ 0 to be invertible. Otherwise 𝐴ˆ 𝑛 is invertible after
some 𝑛 sufficiently large, with probability one. Again in the logistic case, observe
that, from the structure of ∇2𝜃 𝜃 𝑓 and from the Woodbury matrix identity, Equations
(11–12) can be replaced by
> 𝐺
𝑎 𝑛+1 𝐺 𝑛 𝑊𝑛+1𝑊𝑛+1
𝐺𝑛 𝛾𝑛+1 𝑛
𝐺 𝑛+1 = −  >
.
1 − 𝛾𝑛+1 1 − 𝛾𝑛+1 (1 − 𝛾𝑛+1 ) + 𝛾𝑛+1 𝑎 𝑛+1𝑊𝑛+1 𝐺 𝑛 𝑊𝑛+1

where 𝑎 𝑛+1 := 𝜆( 𝜃ˆ> ˆ>



𝑛 𝑊𝑛+1 ) 1 − 𝜆( 𝜃 𝑛 𝑊𝑛+1 ) ,
It appears that the Online MM recursion in the 𝑠-space defined by (8) and (9) is
equivalent to the SNR recursion above (i.e., (11)–(13)) when the Hessian ∇2𝜃 𝜃 𝑓 (𝜃; 𝑥)
is replaced by the lower bound − 14 𝑤𝑤> . This observation holds whenever 𝑔 is
quadratic in (𝜃 − 𝜏).
Polyak averaging. In practice, for Online MM, SG, and SNR recursions, it is
common to consider Polyak averaging [21], starting from some iteration 𝑛0 , chosen
such as to avoid the initial highly volatile estimates. Set 𝜃ˆ𝑛𝐴0 := 0, and for 𝑛 ≥ 𝑛0 ,
𝐴
𝜃ˆ𝑛+1 = 𝜃ˆ𝑛𝐴 + 𝛼𝑛−𝑛0 +1 ( 𝜃ˆ𝑛 − 𝜃ˆ𝑛𝐴), (14)

where 𝛼𝑛 is usually set to 𝛼𝑛 := 𝑛−1 .


Numerical illustration. We now demonstrate the performance of the Online
MM algorithm for logistic regression – defined by (5) and the derivations above. To
do so, a sequence {𝑋𝑖 = (𝑌𝑖 , 𝑊𝑖 ) , 𝑖 ∈ {1, . . . , 𝑛max }} of 𝑛max = 105 i.i.d. replicates
of 𝑋 = (𝑌 , 𝑊)  is simulated: 𝑊 = (1, 𝑈), where 𝑈 ∼ N (0, 1) and [𝑌 |𝑊 = 𝑤] ∼
Ber 𝜆 𝜃 0> 𝑤 , where 𝜃 0 = (3, −3). Online MM is run using the learning rate 𝛾𝑛 =
𝑛−0.6 , as suggested
 in [3]. The algorithm is initialized with 𝜃ˆ0 = (0, 0) and 𝑠0 =
Í2 ¯ ˆ
𝑖=1 𝑆 𝜃 0 ; 𝑋𝑖 /2.
For comparison, we also show, on Figure 1, the SG, SNR estimates and their
Polyak averaged values in 𝜃-space. As is usually recommended with Stochastic Ap-
proximation, the first few volatile estimations are discarded. Similarly, for Polyak
averaging, we set 𝑛0 = 103 . As expected, we observe that the Online MM and the
SNR recursions are very close but with the SNR showing more variability. Their com-
parison after Polyak averaging shows very close trajectories while the SG trajectory
is clearly different and shows more bias. Final estimates [Polyak averaged estimates]
of 𝜃 0 from the SG, SNR, and Online MM algorithms are respectively: (2.67, −2.66)
[(2.51, −2.48)], (3.03, −3.03) [(2.99, −3.03)], and (3.01, −3.03) [(2.98, −3.02)],
which we can compare to the batch maximum likelihood estimate (3.00, −3.05)
(obtained via the glm function in R). Notice the remarkable closeness between the
online MM and batch estimates.
An Online MM Algorithm 269

Fig. 1 Logistic regression example: the first row shows Online MM (black), SG (blue), and SNR
(red) recursions. The second row shows the respective Polyak averaging recursions. The estimates
of the first 𝜃 (first column) and the second (second column) components of 𝜃 are plotted started
from 𝑛 = 103 for readability.

4 Final Remarks

Remark 1 For a parametric statistical model indexed by 𝜃, let 𝑓 (𝜃; 𝑥) be the


log-density
∫ of a random variable 𝑋 with stochastic representation 𝑓 (𝜃; 𝑥) =
log Y 𝑝 𝜃 (𝑥, 𝑦) 𝜇(d𝑦), where 𝑝 𝜃 (𝑥, 𝑦) is the joint density of (𝑋, 𝑌 ) with respect
to the positive measure 𝜇 for some latent variable 𝑌 ∈ Y. Then, via [15, Sec. 4.2],
we recover the Online EM algorithm by using the minorizer function 𝑔:

𝑔 (𝜃, 𝑥; 𝜏) := log 𝑝 𝜃 (𝑥, 𝑦) 𝑝 𝜏 (𝑥, 𝑦) exp(− 𝑓 (𝜏; 𝑥)) 𝜇(d𝑦).
Y

Remark 2 Via the minorization approach of [1] (as used in Section 3) and the mixture
representation from [19], we can construct an Online MM algorithm for MoE models,
analogous to the MM algorithm of [20]. We shall provide exposition on such an
algorithm in future work.

Acknowledgements Part of the work by G. Fort is funded by the Fondation Simone et Cino Del
Duca, Institut de France. H. Nguyen is funded by ARC Grant DP180101192. The work is supported
by Inria project LANDER.
270 H. D. Nguyen et al.

References

1. Bohning, D.: Multinomial logistic regression algorithm. Ann. Inst. Stat. Math. (1992)
2. Borkar, V.S.: Stochastic approximation: A dynamical systems viewpoint. Springer (2009)
3. Cappé, O., Moulines, E.: On-line expectation-maximization algorithm for latent data models.
J. Roy. Stat. Soc. B Stat. Meth. 71, 593–613 (2009)
4. Cui, Y., Pang, J.: Modern nonconvex nondifferentiable optimization. SIAM, Philadelphia
(2022)
5. Delyon, B., Lavielle, M., Moulines, E.: Convergence of a stochastic approximation version of
the EM algorithm. Ann. Stat. 27, 94–128 (1999)
6. Dempster, A. P., Laird, N. M., Rubin, D. B.: Maximum likelihood from incomplete data via
the EM algorithm. J. Roy. Stat. Soc. B Stat. Meth. 39, 1–38 (1977)
7. Fort, G., Gach, P., Moulines, E.: Fast incremental expectation maximization for finite-sum
optimization: nonasymptotic convergence. Stat. Comput. 31, 1–24 (2021)
8. Fort, G., Moulines, E., Wai, H. T.: A stochastic path-integrated differential estimator expecta-
tion maximization algorithm. In: Proceedings of the 34th Conference on Neural Information
Processing Systems (NeurIPS) (2020)
9. Hazan, E.: Introduction to online convex optimization. Foundations and Trends in Optimization.
2 (2015)
10. Karimi, B., Miasojedow, B., Moulines, E., Wai, H. T.: Non-asymptotic analysis of biased
stochastic approximation scheme. Proceedings of Machine Learning Research. 99, 1–31 (2019)
11. Karimi, B., Wai, H. T., Moulines, R., Lavielle, M.: On the global convergence of (fast) incre-
mental expectation maximization methods. In: Proceedings of the 33rd Conference on Neural
Information Processing Systems (NeurIPS) (2019)
12. Kuhn, E., Matias, C., Rebafka, T.: Properties of the stochastic approximation EM alpgorithm
with mini-batch sampling. Stat. Comput. 30, 1725–1739 (2020)
13. Kushner, H. J., Yin, G. G.: Stochastic Approximation and Recursive Algorithms and Applica-
tions. Springer, New York (2003)
14. Lan, G.: First-order and Stochastic Optimization Methods for Machine Learning. Springer,
Cham (2020)
15. Lange, K.: MM Optimization Algorithms. SIAM, Philadelphia (2016)
16. Mairal, J.: Stochastic majorization-minimization algorithm for large-scale optimization. In:
Advances in Neural Information Processing Systems, pp. 2283–2291 (2013)
17. McLachlan, G. J., Krishnan, T.: The EM Algorithm And Extensions. Wiley, New York (2008)
18. Mokhtari, A., Koppel, A.: High-dimensional nonconvex stochastic optimization by doubly
stochastic successive convex approximation. IEEE Trans. Signal Process. 68, 6287–6302
(2020)
19. Nguyen, H.D., Forbes, F., McLachlan, G. J.: Mini-batch learning of exponential family finite
mixture models. Stat. Comput. 30, 731–748 (2020)
20. Nguyen, H. D., McLachlan, G. J.: Laplace mixture of linear experts. Comput. Stat. Data Anal.
93, 177–191 (2016)
21. Polyak, B. T., Juditsky, A. B.: Acceleration of stochastic approximation by averaging. SIAM
J. Contr. Optim. 30, 838–855 (1992)
22. Razaviyayn, M., Sanjabi, M., Luo, Z.: A stochastic successive minimization method for non-
smooth nonconvex optimization with applications to transceiver design in wireless communi-
cation networks. Math. Program. Series B. 515–545 (2016)
23. Shalev-Shwartz, S.: Online learning and online convex optimization. Foundations and Trends
in Machine Learning. 4, 107–194 (2011)
An Online MM Algorithm 271

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Detecting Differences in Italian Regional Health
Services During Two Covid-19 Waves

Lucio Palazzo and Riccardo Ievoli

Abstract During the first two waves of Covid-19 pandemic, territorial healthcare sys-
tems have been severely stressed in many countries. The availability (and complexity)
of data requires proper comparisons for understanding differences in performance
of health services. We apply a three-steps approach to compare the performance of
Italian healthcare system at territorial level (NUTS 2 regions), considering daily time
series regarding both intensive care units and ordinary hospitalizations of Covid-19
patients. Changes between the two waves at a regional level emerge from the main
results, allowing to map the pressure on territorial health services.

Keywords: regional healthcare, time series, multidimensional scaling, cluster anal-


ysis, trimmed 𝑘-means

1 Introduction

During the Covid-19 pandemic, the evaluation of similarities and differences between
territorial health services [23] is relevant for decision makers and should guide the
governance of countries [15] through the so-called “waves”. This type of analysis
becomes even more crucial in countries where the National healthcare system is
regionally-based, which is the case of Italy (or Spain) among others. Italy is one of
the countries in Europe which has been mostly affected by the pandemic, and the
pressure on Regional Health Services (RHS) has been producing dramatic effects
also in the economic [2] and the social [3] spheres. Regional Covid-19-related health

Lucio Palazzo ( )
Department of Political Sciences, University of Naples Federico II, via Leopoldo Rodinò 22 - 80138
Napoli, Italy, e-mail: [email protected]
Riccardo Ievoli
Department of Chemical, Pharmaceutical and Agricultural Sciences, University of Ferrara, via
Luigi Borsari 46 - 44121 Ferrara, Italy, e-mail: [email protected]

© The Author(s) 2023 273


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_30
274 L. Palazzo and R. Ievoli

indicators are extremely relevant for monitoring the pandemic territorial widespread
[21], and to impose (or relax) restrictions in accordance with the level of health risk.
The aim of this work is to exploit the potential of Multidimensional Scaling (MDS)
to detect the main imbalances occurred in the RHS, observing the hospital admission
dynamics of patients with Covid-19 disease. Both daily time series regarding patients
treated in Intensive Care (IC) units and individuals hospitalized in other hospital
wards are used to evaluate and compare the reaction to healthcare pressure in 21
geographical areas (NUTS 2 Italian regions), considering the first two waves [4] of
pandemic. Indeed, territorial imbalances in terms of RHS’ performance [24] should
be firstly driven by the geographical propagation flows of the virus (first wave). Then,
different reactions to pandemic shock may be provided by RHSs, and changes of
imbalances can be observed in the second wave.
Our proposal consists of three subsequent steps. Firstly, a matrix of distances
between regional time series through a dissimilarity metric [29] is obtained. There-
fore, we apply a (weighted) MDS [19, 22] to map similarity patterns in a reduced
space, adding also a weighting scheme considering the number of neighbouring
regions. Finally, we perform a cluster analysis to identify groups according to RHS
performance in the two waves.
The paper is organized as follows: Section 2 describes the methodological ap-
proach used to compare and cluster time series, while Section 3 introduces data and
descriptive analysis. Results regarding RHSs are depicted and discussed in Section
4, while Section 5 concludes with some remarks and possible advances.

2 Time Series Clustering

Given a matrix 𝑇 × 𝑛, where 𝑇 represents the days and 𝑛 the number of regions, our
methodological approach consists of three subsequently steps:
Step 1. Compute a dissimilarity matrix 𝐷 based on a given measure;
Step 2. Apply a weighted multidimensional scaling (wMDS) procedure, storing
the coordinates of the first two components;
Step 3. Perform cluster analysis on the MDS reduced space to identify groups
between the 𝑛 regions.
In the first step, a dissimilarity measure is computed for each pair of regional time
series. The objective is to obtain a dissimilarity matrix 𝐷 (with elements 𝑑𝑖, 𝑗 ) for
estimating synthetic measures of the differences between regions. There are different
alternatives to compare time series, some comprehensive overviews are in [29, 13].
A reasonable choice is the the Fourier dissimilarity 𝑑 𝐹 (x, y), which applies the
𝑛-point Discrete Fourier Transform [1] on two time series, allowing to compare the
similarity between two time sequences after converting them into a combination of
structural elements, such as trend and/or cycle.
Detecting Differences in Italian RHS During Covid-19 Waves 275

In the second step, we implement a multidimensional scaling [31]. Due to its


flexibility, MDS has been introduced also in time series analysis [25] and recently
applied to different topics [30, 9, 16].
Since our aim is to take into account the degree of proximity between regions, we
also employ a weighted multidimensional scaling technique (wMDS) [17, 14]. The
L2 norm is multiplied by a set of weights 𝝎 = (𝜔1 , . . . , 𝜔 𝑛 ) such that high weights
have a stronger influence on the result than low weights.
The reduced space generated by MDS can be used as starting point for subsequent
analyses. Then, a cluster algorithm can be performed on the coordinates (of the
reduced space) of MDS [18]. Different procedures should be suitable to perform a
cluster analysis on the wMDS coordinates map. For an overview of modern clustering
techniques in time series, see e.g. [26].
In our case, both the geographical spread of the pandemic and population density
can determine remarkable differences in terms of hospitalization rates [12]. To
mitigate the risk of regional outliers in the data, generating potential spurious clusters,
we employ the trimmed 𝑘-means algorithm [8, 11]. A relevant topic in cluster analysis
is related to the choice of the 𝑘 number of groups. Our strategy is purely data-driven
and it is based on the minimization of the within-cluster variance.

3 Data and Descriptive Statistics

Daily regional time series reporting a) the number of patients treated in IC units
and b) the number of patients admitted in the other hospital wards are retrieved
through the official website of Italian Civil Protection1. All patients were positive
for the Covid-19 test (nasal and oropharyngeal swab). To take into account the
different sizes in terms of inhabitants, both a) and b) are normalized according to the
population of each territorial unit (estimated at 2020/01/01). The rates of patients
treated in IC units and hospitalized (HO) patients in other hospital wards, are then
multiplied by 100,000.
The whole dataset contains two identified waves2 of Covid-19, as follows:
Wave 1 (W1): 𝑇 = 109 days from February 24 to June 11, 2020
Wave 2 (W2): 𝑇 = 109 days from September 14 to December 31, 2020
The date/trend may also depend on external factors, such as the implementation of
restrictive measures introduced by the Italian Government [27, 6], which influenced
the observed differences between W1 and W2. We have to remark that a full national
lockdown was held between March 9th and May 18th 2020.
Figure 1 shows the time series for HO and IC (rows), according to the two waves
of Covid-19 (columns). The anomaly of the small Italian region (Valle D’Aosta)
emerges both in the first (in particular concerning IC) and second waves (also for

1 Source: www.dati-covid.italia.it
2 Refer to [7] for further details.
276 L. Palazzo and R. Ievoli

W1 W2

20

15

IC
10

100

HO
50

mar apr mag giu ott nov dic gen

Abruzzo Friuli Venezia Giulia Molise Sardegna Veneto


Basilicata Lazio P.A. Bolzano Sicilia
Calabria Liguria P.A. Trento Toscana
Campania Lombardia Piemonte Umbria
Emilia−Romagna Marche Puglia Valle d'Aosta

Fig. 1 Time series distributions of Italian regions.

HO), while Lombardia, which is the largest and most populous region, dominates
other territories especially when considering HO of W1. The upper panel of Figure
1 helps to understand differences between the two waves in terms of admission to
intensive cares: while regions with high, medium and low IC rate can be directly
identified through the eyeball of the series during W1, in W2 more homogeneity
is observed. Furthermore, with the exception of Valle D’Aosta, the IC rate remains
always less than 10 for all considered observations.
For what concerns HO rate, (lower panels of Figure 1), Lombardia reaches values
greater than 100 in W1 (especially in April), while during W2 this threshold had
exceeding by Valle D’Aosta and Piemonte (both in November). Again, if W1 opposes
regions with high and (moderately) low HO rates, in W2 the following situation
arises: a) Valle D’Aosta and Piemonte reach values over 100, b) four regions (Liguria,
Lazio, P.A. Trento and P.A. Bolzano) present values over 75, and c) the majority of
territories share similar trends with peaks always lower than 75.

4 Grouping Regions by Clustering and Discussion

In order to confirm and deepen the descriptive results of Section 3, we perform a


cluster analysis following the scheme proposed in Section 2. We compute wMDS
Detecting Differences in Italian RHS During Covid-19 Waves 277

equipped with the Fourier distance3, using a set of weights 𝝎 proportional to the
number of neighbourhoods for each region, ensuring a spatial feature into the model.
Figure 2 displays the main results of wMDS, distinguishing between four levels
of critical issues experienced by the RHS. Outlying performances are coloured in
Violet. A first cluster (in Red) includes “critical” regions while a group depicted in
Orange contains territories with high pressure in their RHS. Regions involved in the
Green cluster experimented a moderate pressure on RHS, while colour Blue indi-
cates territories suffering from a low pressure. These clusters may also be interpreted
as a ranking of the health service risk.
As regards the HO during W1, leaving apart the two outliers (Lombardia and P.A.
Bolzano) the “red” cluster is composed by three Northern regions (Piemonte, Valle
d’Aosta and Emilia-Romagna). The group of high pressure is composed by Liguria,
Marche and P.A. Trento, while the green cluster involves Lazio, Abruzzo and Toscana
(from the centre of Italy) and Veneto. The last group includes nine regions, 7 of which
are located in the southern Italy. In W2 the clustering procedure Piemonte and Valle
d’Aosta are identified as outliers, while the high-pressure group is composed by two
autonomous provinces (Trendo and Bolzano), Lombardia and Liguria. The “orange”
group is constituted by regions located in the North-East (Friuli-Venezia Giulia,
Emilia-Romagna and Veneto), along with Abruzzo and Lazio. Southern regions
are allocated in the “green” coloured group (together with Umbria, Toscana and
Marche), while Molise, Calabria and Basilicata remain in the low-pressure cluster.
Regarding IC rates, during W1 Lombardia and Valle d’Aosta are considered
as outliers while the “red” cluster is composed by four northern Italian regions
(Emilia-Romagna, P.A. of Trento, Piemonte and Liguria), and Marche (located in
the centre). The “orange” cluster contains Toscana, Veneto and P.A. Bolzano, while
the moderate-pressure cluster involves three areas of centre Italy (Lazio and Umbria),
among with the Friuli-Venezia Giulia (from the north-east) and Abruzzo. The last
cluster includes only regions from the south. According to the bottom right panel of
Figure 2, apart from Valle D’Aosta, the procedure identifies Calabria as an outlier.
The “red” group acquires two observations from the Centre of Italy such as Toscana
and Umbria, while the majority of regions are classified in the moderately pressured
group. Only three Southern Italian areas are allocated in the last group (in green).
If the geography of the disease appears fundamental in W1, especially regarding
adjoining territories of Lombardia, in W2 this effect is less evident. Thus, regions
improving (e.g. Emilia-Romagna) or worsening (such as Lazio and Abruzzo) their
clustering “ranking” can be easily observed. As mentioned, the differences of re-
strictive measures imposed by the Government in the two periods may have a role
on these results.

3 We remark that other distance measures have been applied. Moreover, a) the Fourier one shows
better performance in terms of goodness of fit; b) the results are not sensitive with respect to the
choice of distance.
278 L. Palazzo and R. Ievoli

HO W1 HO W2
P.A. Bolzano
P.A. Bolzano

Valle d'Aosta Friuli Venezia Giulia Valle d'Aosta Friuli Venezia Giulia
P.A. Trento P.A. Trento
46°N
Lombardia Lombardia
Veneto Veneto
Piemonte Piemonte
Emilia-Romagna Emilia-Romagna

Liguria Liguria
44°N
Toscana Toscana
Marche Marche

Umbria Umbria

Abruzzo Abruzzo
42°N
Lazio Lazio Molise
Molise
Puglia Puglia

Campania Basilicata Campania Basilicata


Sardegna Sardegna
40°N
Calabria Calabria

38°N
Sicilia Sicilia

36°N
1
2
IC W1 IC W2 3
P.A. Bolzano P.A. Bolzano 4

Friuli Venezia Giulia Friuli Venezia Giulia


Outlier
Valle d'Aosta P.A. Trento Valle d'Aosta P.A. Trento
46°N
Lombardia Lombardia
Veneto Veneto
Piemonte Piemonte
Emilia-Romagna Emilia-Romagna

44°N Liguria Liguria


Toscana Toscana Marche
Marche
Umbria Umbria

Abruzzo Abruzzo
42°N
Molise Lazio Molise
Lazio
Puglia Puglia

Campania Campania
Basilicata Basilicata
Sardegna Sardegna
40°N
Calabria
Calabria

38°N
Sicilia Sicilia

36°N

8°E 10°E 12°E 14°E 16°E 18°E 8°E 10°E 12°E 14°E 16°E 18°E

Fig. 2 Map of the identified regional clusters.


Detecting Differences in Italian RHS During Covid-19 Waves 279

5 Concluding Remarks

The Covid-19 pandemic has put a strain on the Italian healthcare system. The reac-
tions of RHS play a relevant role to mitigate the health crisis at territorial level and
to guarantee an equitable access to healthcare.
This work helps to understand similarities and divergences between the Italian re-
gions in relation to the health pressure of the first two waves of the virus. Considering
crucial measures such as HO and IC rates, the comparison between two waves allows
to understand differences in the reactions to pandemic shocks of RHS. Although the
northern Italy represented the epicentre of the Covid-19 spread in the first wave,
some regions (e.g. Veneto and Friuli-Venezia Giulia) seem to have succeeded in
avoiding hospitals overcrowding, while Southern regions (and Islands) definitively
suffered from less pressure. Furthermore, in the second wave, the difference appears
slightly smoothed and the cluster sizes seem more homogeneous. Moreover, there
are some exceptions, such as the Emilia-Romagna, which seems to have been less
affected by the second wave, compared to the other regions. The detection of clusters
represents a starting point for the improvement of health governance and can be used
to monitor potential imbalances in future unfortunate waves.
Further analysis may employ other dedicated indicators coming, for instance,
from the Italian National Institute of Statistics4, or using different proposals for com-
bining wMDS with dissimilarity measures and clustering [28]. Following a different
methodological approach, the recent method proposed in [10] should be applied on
those data to include more complex spatial relationships between territories.

References

1. Agrawal, R., Faloutsos, C., Swami, A.: Efficient similarity search in sequence databases. In:
International Conference on Foundations of Data Organization and Algorithms, pp. 69-84.
Springer, Berlin (1993)
2. Ascani, A., Faggian, A., Montresor, S.: The geography of COVID-19 and the structure of local
economies: The case of Italy. Journal of Regional Science, 61(2), 407-441 (2021)
3. Beria, P., Lunkar, V.: Presence and mobility of the population during the first wave of Covid-19
outbreak and lockdown in Italy. Sustainable Cities and Society, 65, 102616 (2021)
4. Bontempi, E.; The Europe second wave of COVID-19 infection and the Italy “strange” situa-
tion. Environmental Research, 193, 110476 (2021)
5. Capolongo, S., Gola, M., Brambilla, A., Morganti, A., Mosca, E. I., Barach, P.: COVID-19
and Healthcare facilities: A decalogue of design strategies for resilient hospitals. Acta Bio
Medica: Atenei Parmensis, 91(9-S), 50 (2020)
6. Chirico, F., Sacco, A., Nucera, G., Magnavita, N.: Coronavirus disease 2019: the second wave
in Italy. Journal of Health Research (2021).
7. Cicchetti, A., Damiani, G., Specchia, M. L., Basile, M., Di Bidino, R., Di Brino, E., Tattoli,
A.: Analisi dei modelli organizzativi di risposta al Covid-19. ALTEMS (2020). link: https:
//altems.unicatt.it/altems-report47.pdf
8. Cuesta-Albertos, J. A., Gordaliza, A., Matrán, C.: Trimmed 𝑘-means: An attempt to robustify
quantizers. The Annals of Statistics, 25(2), 553-576 (1997).

4 see for example the BES indicators of the domain “Health” and “Quality of services”
https://www.istat.it/it/files//2021/03/BES\_2020.pdf
280 L. Palazzo and R. Ievoli

9. Di Iorio, F., Triacca, U.: Distance between VARMA models and its application to spatial differ-
ences analysis in the relationship GDP-unemployment growth rate in Europe. In: International
Work-Conference on Time Series Analysis, pp. 203-215. Springer, Cham (2017)
10. D’Urso, P., De Giovanni, L., Disegna, M., Massari, R.: Fuzzy clustering with spatial-temporal
information. Spatial Statistics, 30, 71-102 (2019)
11. Garcia-Escudero, L. A., Gordaliza, A.: Robustness properties of 𝑘-means and trimmed
𝑘-means. Journal of the American Statistical Association, 94(447), 956–969 (1999) doi:
10.2307/2670010
12. Giuliani, D., Dickson, M. M., Espa, G., Santi, F.: Modelling and predicting the spatio-temporal
spread of COVID-19 in Italy. BMC infectious diseases, 20(1), 1-10 (2020)
13. Górecki, T., Piasecki, P.: A comprehensive comparison of distance measures for time series
classification. In: Steland, A., Rafajłowicz, E., Okhrin, O. (Eds.) Workshop on Stochastic
Models, Statistics and their Application, pp. 409-428. Springer, Nature (2019)
14. Greenacre, M.: Weighted metric multidimensional scaling. In: New developments in Classi-
fication and Data Analysis, pp. 141-149. Springer, Berlin, Heidelberg (2005)
15. Han, E., Tan, M. M. J., Turk, E., Sridhar, D., Leung, G. M., Shibuya, K., Legido-Quigley, H.:
Lessons learnt from easing COVID-19 restrictions: an analysis of countries and regions in
Asia Pacific and Europe. The Lancet, 396(10261), 1525–1534 (2020)
16. He, J., Shang, P., Xiong, H.: Multidimensional scaling analysis of financial time series based on
modified cross-sample entropy methods. Physica A: Statistical Mechanics and its Applications,
500, 210-221 (2018)
17. Kent, J. T., Bibby, J., Mardia, K. V.: Multivariate Analysis. Amsterdam: Academic Press
(1979)
18. Kruskal, J.: The relationship between multidimensional scaling and clustering. In: Classifica-
tion and Clustering, pp. 17-44. Academic Press (1977)
19. Kruskal, J. B.: Multidimensional Scaling (No. 11). Sage (1978)
20. Mardia, K. V.: Some properties of classical multi-dimensional scaling. Communications in
Statistics-Theory and Methods, 7(13), 1233-1241 (1978)
21. Marziano, V., Guzzetta, G., Rondinone, B. M., Boccuni, F., Riccardo, F., Bella, A., Merler,
S.: Retrospective analysis of the Italian exit strategy from COVID-19 lockdown. Proceedings
of the National Academy of Sciences, 118(4) (2021)
22. Mead, A.: Review of the development of multidimensional scaling methods. Journal of the
Royal Statistical Society: Series D (The Statistician), 41(1), 27-39 (1992)
23. Pecoraro, F., Luzi, D., Clemente, F.: Analysis of the different approaches adopted in the Italian
regions to care for patients affected by COVID-19. International Journal of Environmental
Research and Public Health, 18(3), 848 (2021)
24. Pecoraro, F., Clemente, F., Luzi, D.: The efficiency in the ordinary hospital bed management
in Italy: An in-depth analysis of intensive care unit in the areas affected by COVID-19 before
the outbreak. PLoS One, 15(9), e0239249 (2020)
25. Piccolo, D.: Una rappresentazione multidimensionale per modelli statistici dinamici. In: Atti
della XXXII Riunione Scientifica della SIS, 2, pp. 149-160 (1984)
26. Saxena, A., Prasad, M., Gupta, A., Bharill, N., Patel, O. P., Tiwari, A., Lin, C. T.: A review of
clustering techniques and developments. Neurocomputing, 267, 664-681 (2017)
27. Sebastiani, G., Massa, M., Riboli, E.: Covid-19 epidemic in Italy: evolution, projections and
impact of government measures. European Journal of Epidemiology, 35(4), 341-345 (2020)
28. Shang, D., Shang, P., Liu, L.: Multidimensional scaling method for complex time series feature
classification based on generalized complexity-invariant distance. Nonlinear Dynamics, 95(4),
2875-2892 (2019)
29. Studer, M., Ritschard, G.: What matters in differences between life trajectories: A comparative
review of sequence dissimilarity measures. Journal of the Royal Statistical Society: Series A
(Statistics in Society), 179(2), 481-511 (2016)
30. Tenreiro Machado, J. A., Lopes, A. M., Galhano, A. M.: Multidimensional scaling visualiza-
tion using parametric similarity indices. Entropy, 17(4), 1775-1794 (2015)
31. Torgerson, W. S.: Multidimensional scaling: I. Theory and method. Psychometrika, 17(4),
401-419 (1952)
Detecting Differences in Italian RHS During Covid-19 Waves 281

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Political and Religion Attitudes in Greece:
Behavioral Discourses

Georgia Panagiotidou and Theodore Chadjipadelis

Abstract The research presented in this paper attempts to explore the relationship be-
tween religious and political attitudes. More specifically we investigate how religious
behavior, in terms of belief intensity and practice frequency, is related to specific
patterns of political behavior such as ideology, understanding democracy and his set
of moral values. The analysis is based on the use of multivariable methods and more
specifically Hierarchical Cluster Analysis and Multiple Correspondence Analysis in
two steps. The findings are based on a survey implemented in 2019 on a sample of
506 respondents in the wider area of Thessaloniki, Greece. The aim of the research is
to highlight the role of people’s religious practice intensity in shaping their political
views by displaying the profiles resulting from the analysis and linking individual
religious and political characteristics as measured with various variables. The final
output of the analysis is a map where all variable categories are visualized, bringing
forward models of political behavior as associated together with other factors such
as religion, moral values and democratic attitudes.

Keywords: political behavior, religion, democracy, multivariate methods, data anal-


ysis

1 Introduction

In this research we present the analysis results of a survey, which was implemented
in April 2019 to 506 respondents in Thessaloniki, focusing on their religious profile
as well as their political attitudes, their moral profile and the way they comprehend
democracy. The aim of the analysis is to investigate and highlight the role of religious
practice in shaping political behavior. In the political behavior analysis field, religion

Georgia Panagiotidou ( )
Aristotle University of Thessaloniki, Greece, e-mail: [email protected]
Theodore Chadjipadelis
Aristotle University of Thessaloniki, Greece, e-mail: [email protected]

© The Author(s) 2023 283


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_31
284 G. Panagiotidou and T. Chadjipadelis

and more specifically church practice has emerged as one of the main pillars that form
the political attitudes of voters. Religious habits seem to have a decisive influence
on electoral choices, as derives from Lazarsfeld’s research at Columbia University
in 1944 [3], followed by the work of Butler and Stokes in 1969 [1] and the research
of Michelat and Simon in France [6]. More specifically in the comparative study
of Rose in 1974 [9], it turns out that the more religious voters appear to be more
conservative by choosing to place themselves on the right side of the ideological
“left-right” axis, while the non-religious voters opt for the left political parties.
The research and analysis of Michelat and Simon [6] brings to the surface two
opposing cultural models: on the one hand we have the deeply religious voters, who
belong to the middle and upper classes, residing in the cities or in the countryside,
while on the other hand we have the non-religious left voters with working class
characteristics. The first framework is articulated around religion and those who
belong to it identifying themselves as religious people, is inspired by a conservative
value system, put before the value of the individual, the family, the ancestral heritage
and tradition. The second cultural context is articulated around class rivalries and
socio-economic realities; those who belong to this context identify themselves as
“us workers towards others”. They believe in the values of collective action, vote
for left-wing parties, participate actively in unions and defend the interests of the
working class. To measure the influence of religious practice on political behavior,
applied research uses measurement scales about the intensity of religious beliefs and
the frequency of church service practice as an indicator of the level of one’s religious
integration.
To measure religious intensity level, variables are used such as how often they go
to the service, how much do they believe in the existence of God, of afterlife, in the
dogmas of the church and so on. Since the 90’s there is a rapid decline in the frequency
with which the population attends church service or self-identifies strongly in terms of
religiousness. Nevertheless, the strong correlation between electoral preference and
religious practice remains strong [5]. The most significant change for non-religious
people is that the left is losing its universal influence as many of these voters expand
also to the center. Strongly religious people continue to support the right more and, in
some cases, strengthen the far right. In this paper, apart from attempting to explore
and verify the existing literature over the effect of religion on political behavior,
focusing on the Greek case, the approach exploits methods used to achieve the
visualization of all existing relationships between different sets of variables. To link
together numerous variables and their categories to construct a model of religious and
political behavior, multiple applications of Hierarchical Cluster analysis (HCA) are
being made followed by Multiple Correspondence Analysis (MCA) for the emerging
clusters. In this way, a semantic map is constructed [7], which visualizes discourses
of political and religious behavior and the inner antagonisms between the behavioral
profiles.
Political and Religion Attitudes in Greece: Behavioral Discourses 285

2 Methodology

For the implementation of the research a poll was conducted on a random sample
of 506 people in the greater area of Thessaloniki in Greece, during April 2019.
A questionnaire was used as a research tool which was distributed with an on-site
approach of the random respondents. The questionnaire consisted of three sections:
a) the first section included seven questions for demographic data of the respondent
such as gender, age, educational level, marital status, household income, occupation
and social class to which the respondent considers belonging; b) the second part
contained seven questions, ordinal variables, related to the religious practice and
beliefs of the respondent: i) how often does one go to church? ii) how often does one
pray? iii) how close does one feel to God, Virgin Mary (or to another seven religious
concepts) during church service? iv) how strongly does one have seven different
feelings during church service? v) does one believe or not in the saints, miracles,
prophecies (and another six religious concepts)? Two more questions investigating
their profile in terms of what is taught in the Christian dogma were included vi)
one asking if one can progress only by being an ethical person and vii) another one
asking if they agree on the pain/righteousness scheme, that is if one suffers in this
life will be rewarded later or in the afterlife; c) questions concerning the political
profile of the respondent are developed in the third part of the questionnaire: i)
one’s self-positioning on the ideological left-right axis, ii) a set of nine ordinal
variables requiring one’s agreement or disagreement level on sentences that reflect
the dimensions of liberalism-authoritarianism and left-right iii) this last section
also includes two different sets of pictures, used as symbolic representation for the
“democratic self” and the “moral self” [4]. The first set of twelve pictures represent
various conceptualizations of democracy, and one is asked to select three pictures
that represent democracy. The second set of pictures represent moral values in
life, and one is asked to choose three pictures that represent one’s set of personal
values. Variables are ordinal, using a five-point Likert scale, apart from the question
regarding whether one believes or not in prophecies magic etc. and the two last
questions with the pictures, where we are using a binary scale of yes-no or zero-one
where zero is for a non selected picture and one is for a selected picture.
Data analysis was implemented with the use of M.A.D software (Méthodes
d’Analyse des Données), developed by Professor Dimitris Karapistolis (more about
M.A.D software at www.pylimad.gr). Firstly, Hierarchical Cluster Analysis (HCA)
using chi-quare distance and Ward’s linkage, assigns subjects into distinct groups
based on their response patterns. This first step produces a cluster membership vari-
able, assigning each subject into a group. In addition to this, the behavior typology of
each group is examined, seeing the connection of each variable level to each cluster
using two proportion 𝑧 test (significance level set at 0.05) between respondents be-
longing to cluster 𝑖 and those who do not belong in cluster 𝑖 for a variable level. The
number of clusters is determined by using the empirical criterion of the change in the
ratio of between-cluster inertia to total inertia, when moving from a partition with 𝑟
clusters to a partition with 𝑟 − 1 clusters [8]. In the second step of the analysis, the
cluster membership variable is analyzed together with the existing variables using
286 G. Panagiotidou and T. Chadjipadelis

MCA on the Burt table [2]. All associations among the variable categories are given
on a set of orthogonal axes, with the least possible loss of the original information
of the original Burt table. Next, we apply HCA for the coordinates of variable cat-
egories on the total number of dimensions of the reduced space resulting from the
MCA. In this way we cluster the variable, as previously we clustered the subjects.
By clustering the variable response categories, we detect the various discourses of
behavior, where each cluster of categories stands as a behavioral profile linked with
a set of responses and characteristics. To produce the final output, the semantic map,
we created a table including the output variables of the questionnaire, including de-
mographics and variables for political behavior. Using the same two-step procedure
using HCA and MCA for this final table, the semantic map is constructed, positioning
the variable categories on a bi plot created by the two first dimensions of MCA.

3 Results

In the first step of the analysis, we apply HCA for each set of variables in each
question. In the question: “How close do you feel during the service 1-To God, 2-To
the Virgin, 3-To Christ, 4-To some Saint, Angel, 5-To the other churchgoers, 6-To
Paradise, 7-To Hell, 8-To the divine service, 9-To his preaching priest”, we get four
clusters (Figure 1).

Fig. 1 Four clusters on how close the respondents feel during church service.

For the question: “How strongly you feel after the end of the service 1-The Grace
of God in me, 2-Power of the soul, 3-Forgiveness for those who have hurt me, 4-
Forgiveness for my sins, 5-Peace, 6-Relief it is over”, we get six clusters (Figure
2).

Fig. 2 Six clusters on how the respondents feel at the end of church service.
Political and Religion Attitudes in Greece: Behavioral Discourses 287

Five clusters (Figure 3) for the question: “Do you believe in 1-Bad (magic influ-
ence) 2-Magic? 3- Destiny? 4-Miracles? 5-Prophecies of the Saints? 6- Do you have
pictures of holy figures in your house? 7-in your workplace? 8-Do you have a family
Saint?”.

Fig. 3 Five clusters on the beliefs of the respondents on various aspects of the Christian faith.

Six clusters are detected (Figure 4) for the question: “How do you feel when you
come face to face with a religious image 1-Peace, 2-Awe, 3-The presence of God,
4-Emotion, 5-The need to pray, 6-Contact with the person in the picture”.

Fig. 4 Six clusters on how the respondents feel when facing a religious image.

We proceed with the clustering of the replies on political views and we get seven
clusters of political profiles (Figure 5).

Fig. 5 Seven clusters according to the political views- profile of the respondents.
288 G. Panagiotidou and T. Chadjipadelis

For the symbolic representation of the democratic self, when choosing three
pictures that represent democracy for the respondent, we find eight clusters (Fig-
ure 6), and eight clusters for the symbolic representation of the moral self for the
respondents, as show in Figure 7.

Fig. 6 Eight clusters on how the respondents understand democracy.

Fig. 7 Eight clusters on the different sets of moral values of the respondents.

In the second step of the analysis, we jointly process the cluster membership
variables. MCA produces the coefficients of each variable category which are now
positioned in a two-dimensional map as seen in Figure 9. HCA is then applied again
to the coefficients of the items, which bring forward three main clusters, modeling
political and religious behavior. In Figure 8, Cluster 77 is connected to centre and
moderate religious behaviour, cluster 78 reflects the voters of the right, with strong
religious habits and beliefs, individualistic attitudes and more authoritarian and
nationalistic political views, whereas cluster 79 represents the leftists, non-religious
voters, closer to revolutionary political views and collective goods. Examining the
antagonisms on the behavioral map (Figure 9), the first horizontal axis which explains
22.8% of the total inertia, is created by the antithesis between right political ideology
- strong religious behavior and left political ideology-no religious behavior (cluster
78 opposite to cluster 79). The second axis (vertical) accounts for 7% of the inertia,
and is explained as the opposition between the center (moderate religious behavior)
against the left and right (cluster 77 opposite to both clusters 78 and 79).
Political and Religion Attitudes in Greece: Behavioral Discourses 289

Fig. 8 Three main behavioral discourses linking all variable categories together.

Fig. 9 The semantic map visualizing the behavioral profiles of voters, and the inner antagonisms.
290 G. Panagiotidou and T. Chadjipadelis

4 Discussion

The analysis uncovers the strong existing relationship between religious habits and
political views, for the Greek case. The semantic map indicates two main antagonistic
cultural discourses, including both religious, political and moral characteristics: The
first discourse (cluster 77) is described as moderately religious practice and beliefs,
connected to the ideological center. These voters have political attitudes that belong
to the space between the center-left and the center-right. They understand democracy
as a connection to money, direct democracy and electronic democracy. Their moral
set of values is naturalistic and individualistic. The next behavioral discourse (cluster
78) describes the voters of right ideology, with strong religious beliefs andfrequent
religious practice. They appear as very ethical and believe in the concept of pain
and righteousness. Regarding their political attitudes these more religious voters
are against violence, have more authoritarian and nationalistic positions. They view
democracy as parliamentary, representative, ancient Greece but also as church, while
their moral set of values appear clearly naturalistic, Christian and nationalistic.
Cluster 79 reflects the exact opposite discourse compared to 78. These voters
belong to the left ideology and are non-religious. They do not adopt the ideas of
the ethical person, or the scheme of pain and righteousness as mentioned in the
Christian dogma. In terms of political attitudes, they are pro-welfare state. These
non-religious and left voters understand democracy as direct with the need for
revolution, protest and riot and support collective goods. Interpreting further the
antagonisms as visualized on the semantic map, the main competition exists between
the “right political ideology - strong religious behavior individualism” discourse
and the “left political ideology-no religious behavior collectivism” discourse. A
secondary opposition is found between the “center ideology- moderate religious
behavior” discourse against the left and right extreme positions.

References

1. Butler, D., Stokes, D.: Political Change in Britain. Macmillan, London (1969)
2. Greenacre, M.: Correspondence Analysis in Practice. Chapman and Hall/CRC Press, Boca
Raton (2007)
3. Lazarsfeld, P. F., Berelson, B., Gaudet, H.: The People’s Choice. Columbia University Press
(1944)
4. Marangudakis, M., Chadjipadelis, T.: The Greek Crisis and its Cultural Origins. Palgrave-
Macmillan, New York (2019)
5. Mayer, N.: Les Modèles Explicatifs du Vote. L’Harmatan, Paris (1997)
6. Michelat, G., Simon, M.: Classe, Religion et Comportement Politique. PFNSP-Editions So-
ciales, Paris (1977)
7. Panagiotidou, G., Chadjipadelis, T.: First-time voters in Greece: views and attitudes of youth on
Europe and democracy. In T. Chadjipadelis, B. Lausen, A. Markos, T. R. Lee, A. Montanari and
R. Nugent (Eds.), Data Analysis and Rationality in a Complex World, Studies in Classification,
Data Analysis and Knowledge Organization, pp. 415-429. Springer (2020)
8. Papadimitriou, G., Florou, G.: Contribution of the Euclidean and chi-square metrics to de-
termining the most ideal clustering in ascending hierarchy (in Greek). In Annals in Honor of
Professor I. Liakis, 546-581. University of Macedonia, Thessaloniki (1996)
9. Rose, R.: Electoral Behavior: a Comparative Handbook. Free Press, New York (1974)
Political and Religion Attitudes in Greece: Behavioral Discourses 291

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Supervised Classification via Neural Networks
for Replicated Point Patterns

Kateřina Pawlasová, Iva Karafiátová, and Jiří Dvořák

Abstract A spatial point pattern is a collection of points observed in a bounded


region of R𝑑 , 𝑑 ≥ 2. Individual points represent, e.g., observed locations of cell
nuclei in a tissue (𝑑 = 2) or centers of undesirable air bubbles in industrial materials
(𝑑 = 3). The main goal of this paper is to show the possibility of solving the su-
pervised classification task for point patterns via neural networks with general input
space. To predict the class membership for a newly observed pattern, we compute
an empirical estimate of a selected functional characteristic (e. g., the pair correla-
tion function). Then, we consider this estimated function to be a functional variable
that enters the input layer of the network. A short simulation example illustrates
the performance of the proposed classifier in the situation where the observed pat-
terns are generated from two models with different spatial interactions. In addition,
the proposed classifier is compared with convolutional neural networks (with point
patterns represented by binary images) and kernel regression. Kernel regression
classifiers for point patterns have been studied in our previous work, and we consider
them a benchmark in this setting.

Keywords: spatial point patterns, pair correlation function, supervised classifica-


tion, neural networks, functional data

Kateřina Pawlasová ( )
Charles University, Faculty of Mathematics and Physics, Ke Karlovu 3, 121 16 Praha 2, Czech
Republic, e-mail: [email protected]
Iva Karafiátová
Charles University, Faculty of Mathematics and Physics, Ke Karlovu 3, 121 16 Praha 2, Czech
Republic, e-mail: [email protected]
Jiří Dvořák
Charles University, Faculty of Mathematics and Physics, Ke Karlovu 3, 121 16 Praha 2, Czech
Republic, e-mail: [email protected]

© The Author(s) 2023 293


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_32
294 K. Pawlasová et al.

1 Introduction

Spatial point processes have recently received increasing attention in a broad range
of scientific disciplines, including biology, statistical physics, or material science
[9]. They are used to model the locations of objects or events randomly occurring
in R𝑑 , 𝑑 ≥ 2. We distinguish between the stochastic model (point process) and its
realization observed in a bounded observation window (point pattern).
Typically, analyzing spatial point pattern data means working with just one pat-
tern, which comes from a specific physical measurement. In this paper, we take
another perspective: we suppose that a collection of patterns, which are independent
realizations of some underlying stochastic models, is to be analyzed simultaneously.
These independent realizations are then referred to as replicated point patterns.
Recently, this type of data has become more frequent, encouraging the adaptation
of methods such as supervised classification to the point pattern setting.
Since we are talking about supervised classification, our task is to predict the la-
bel variable (indicating class membership) for a newly observed point pattern, using
the knowledge about a sample collection of patterns with known labels (training
data). In the literature, this problem has been studied to a limited extent. Properties
of a classifier constructed specifically for the situation where the observed patterns
were generated by inhomogeneous Poisson point processes with different intensity
functions are discussed in [5]. However, this method is based on the special proper-
ties of the Poisson point process, and its use is thus limited to a small class of models.
On the other hand, no assumptions about the underlying stochastic models are made
in [12], where the task for replicated point patterns is transformed, with the help
of multidimensional scaling [16], to the classification task in R2 . In [10, 11], the ker-
nel regression classifier for functional data [4] is adapted for replicated point patterns.
Instead of classifying the patterns themselves, a selected functional characteristic
(e.g. the pair correlation function) is estimated for each pattern. These estimated
values are considered functional observations, and the classification if performed
in the context of functional data. The idea of linking point patterns to functional
data also appears in [12] – the dissimilarity matrix needed for the multidimensional
scaling is based on the same type of dissimilarity measure that is used for the ker-
nel regression classifier in [10, 11]. Finally, [17] briefly discusses the model-based
supervised classification. Unsupervised classification is explored in [2].
In this paper, our goal is to discuss the use of classifiers based on artificial neu-
ral networks in the context of replicated point patterns. We pay special attention
to the procedure described in [14], where both functional and scalar observations
enter the input layer. Hence, similarly as in [10, 11], each pattern can be represented
by estimated values of a selected functional characteristic and the classification is per-
formed in the context of functional data. The resulting decision about class member-
ship is based on the spatial properties of the observed patterns that can be described
by the selected characteristic. Therefore, with a thoughtfully chosen characteristic,
this method has great potential within a wide range of possible classification scenar-
ios. Moreover, it can be used without assuming stationarity of the underlying point
Supervised Classification via Neural Networks for Replicated Point Patterns 295

processes, and it can be easily extended to more complicated settings (e.g., point
patterns in non-Euclidean spaces or realizations of random sets).
We present a short simulation experiment that illustrates the behaviour of the neu-
ral network described in [14]. Binary classification is performed on realizations
of two different point process models – the Thomas process (model for attractive
interactions among pairs of points) and the Poisson point process (benchmark model
for no interactions among points). This approach is then compared to the classifica-
tion based on convolutional neural networks (CNNs) [8], where each pattern enters
the network as a binary image. Finally, both methods based on artificial neural net-
works are compared to the kernel regression classifier studied in [10, 11] which can
be considered a benchmark in the context of replicated point patterns.
This paper is organized as follows. Section 2 provides a brief theoretical back-
ground on spatial point processes and their functional characteristics, including
the definition of the pair correlation function, which plays a crucial role in the se-
quel. Section 3 summarizes the methodology introduced in [14] about neural network
models with general input space. Section 4 is devoted to a short simulation example.

2 Point Processes and Point Patterns

This section presents the necessary definitions from the point process theory. Our ex-
position closely follows the book [13]. For detailed explanation of the theoretical
foundations, see, e.g., [7]. Throughout the paper, a simple point process 𝑋 is defined
as a random locally finite subset of R𝑑 , 𝑑 ≥ 2, where each point 𝑥 ∈ 𝑋 corresponds
to a specific object or event occurring at the location 𝑥 ∈ R𝑑 . In applications, 𝑋 can
be used as a mathematical tool to model random locations of cell nuclei in a tissue
(with 𝑑 = 2) or centers of undesirable air bubbles in industrial materials (𝑑 = 3).
We distinguish between the mathematical model 𝑋, which is called a point process,
and its observed realization X, which is called a point pattern. Examples of four
different point patterns are given in Figure 1.
Before proper definition of the pair correlation function, a functional characteristic
that plays a key role in the sequel, we need to define some moment properties of 𝑋.
The intensity function 𝜆(·) is a non-negative measurable function on R𝑑 such that
𝜆(𝑥) d𝑥 corresponds to the probability of observing a point of 𝑋 in a neighborhood
of 𝑥 with an infinitesimally small area d𝑥. If 𝑋 is stationary (its distribution is
translation invariant in R𝑑 ), then 𝜆(·) = 𝜆 is a constant function and the constant 𝜆 is
called the intensity of 𝑋. In this case, 𝜆 is interpreted as the expected number of points
of 𝑋 that occur in a set with unit 𝑑-dimensional volume. Similarly, the second-order
product density 𝜆 (2) (· , ·) is a non-negative measurable function on R𝑑 × R𝑑 such
that 𝜆 (2) (𝑥, 𝑦) d𝑥 d𝑦 corresponds to the probability of observing two points of 𝑋
that occur jointly at the neighborhoods of 𝑥 and 𝑦 with infinitesimally small areas
d𝑥 and d𝑦.
Assuming the existence of 𝜆 and 𝜆 (2) , the pair correlation function 𝑔(𝑥, 𝑦) is de-
fined as 𝜆 (2) (𝑥, 𝑦)/(𝜆(𝑥)𝜆(𝑦)), for 𝜆(𝑥)𝜆(𝑦) > 0. If 𝜆(𝑥) = 0 or 𝜆(𝑦) = 0, we set
296 K. Pawlasová et al.

𝑔(𝑥, 𝑦) = 0. We write 𝑔(𝑥, 𝑦) = 𝑔(𝑥 − 𝑦) when 𝑔 is translation invariant and


𝑔(𝑥, 𝑦) = 𝑔 (k𝑥 − 𝑦k) when 𝑔 is also isotropic (invariant under rotations around
the origin). For the Poisson point process, a model for complete spatial randomness,
𝜆 (2) (𝑥, 𝑦) = 𝜆(𝑥)𝜆(𝑦) and 𝑔 ≡ 1. Thus, 𝑔(𝑥, 𝑦) quantifies how likely it is to observe
two points in 𝑋 jointly occurring in infinitesimally small neighbourhoods of 𝑥 and 𝑦,
relative to the "no interactions" benchmark.
A large variety of characteristics (both functional and numerical) have been de-
veloped to capture various hypotheses about the stochastic models that generated
the observed point patterns at hand. We have focused on the pair correlation function
𝑔 mainly because of its widespread use in practical applications and ease of interpre-
tation. Other popular characteristics are based on 𝑔, e.g., its cumulative counterpart,
traditionally called the 𝐾-function. Others are based on inter-point distances, such as
the nearest-neighbor distance distribution function 𝐺 and the spherical contact dis-
tribution function 𝐹. A comprehensive summary of commonly used characteristics,
including the list of possible empirical estimators, is presented in [9, 13]. Estimators
of 𝑔, 𝐾, 𝐺, and 𝐹 are implemented in the R package spatstat [3].

3 Neural Networks with General Input Space

This section prepares the theoretical background for the supervised classification
of replicated point patterns via artificial neural networks. The recent approach
of [14, 15] is the cornerstone of our proposed classifier, and hence we focus on
its description in the following paragraphs. On the other hand, the approach based
on CNNs is more established in the literature. We use it primarily for comparison
and thus we refer the reader to [8] for a detailed description.
Following the setup in [14], let us assume that we want to build a neural network
such that it takes 𝐾 ∈ N functional variables and 𝐽 ∈ N scalar variables as input.
In detail, suppose that we have 𝑓 𝑘 : 𝜏𝑘 −→ R, 𝑘 = 1, 2, . . . , 𝐾 (𝜏𝑘 are possibly
different intervals in R), and 𝑧 (1)
𝑗 ∈ R, 𝑗 = 1, 2, . . . , 𝐽. Furthermore, suppose that
the first layer of the network contains 𝑛1 ∈ N neurons. We then want the 𝑖-th neuron
of the first layer to transfer the value

𝐾 ∫ 𝐽
©Õ Õ
𝑧𝑖(2) = 𝑔 ­ 𝛽𝑖𝑘 (𝑡) 𝑓 𝑘 (𝑡) d𝑡 + 𝑤𝑖(1) (1) (1) ª
𝑗 𝑧 𝑗 + 𝑏 𝑖 ® , 𝑖 = 1, 2, . . . , 𝑛1 ,
𝜏𝑘
« 𝑘=1 𝑗=1 ¬

where 𝑏 𝑖(1) ∈ R is the bias and 𝑔 : R −→ R is the activation function. Two


types of weights appear in the formula: the functional weights {𝛽𝑖𝑘 : 𝜏𝑘 −→ R},
and the scalar weights {𝑤𝑖(1) (1)
𝑗 , 𝑏 𝑖 }. The optimal value of all these weights should
be found during the training of the network. To overcome the difficulty of find-
ing the optimal weight functions 𝛽𝑖𝑘 , we can express 𝛽𝑖𝑘 as a linear combina-
tion of 𝜙1 , . . . , 𝜙 𝑚𝑘 , where 𝜙1 , . . . , 𝜙 𝑚𝑘 are the basis functions
Í ∫ (from the Fourier
or 𝐵-spline basis) and 𝑚 𝑘 is chosen by the user. The sum 𝐾 𝑘=1 𝜏 𝛽(𝑡)𝑖𝑘 𝑓 𝑘 (𝑡) d𝑡 can
𝑘
Supervised Classification via Neural Networks for Replicated Point Patterns 297

Fig. 1 Theoretical values of the pair correlation function 𝑔 for the Poisson point process
and the Thomas process with different values of the model parameter 𝜎. For these models, 𝑔 is trans-
lation invariant and isotropic. A single realization of the Poisson point process and the Thomas
process with parameter 𝜎 set to 0.1, 0.05 and 0.02 respectively, is illustrated in the right part
of the figure.

Í Í𝑚𝑘 ∫ ∫
be expressed as 𝐾 𝑘=1 𝑙=1 𝑐 𝑖𝑙𝑘 𝜏𝑘 𝜙𝑙 (𝑡) 𝑓 𝑘 (𝑡) d𝑡, where the integrals 𝜏𝑘 𝜙𝑙 (𝑡) 𝑓 𝑘 (𝑡) d𝑡
can be calculated a priori and the coefficients of the linear combination of the basis
functions {𝑐 𝑖𝑙𝑘 } act as scalar weights of the first layer and are learned by the network.
The scalar values 𝑧𝑖(2) , 𝑖 = 1, . . . , 𝑛1 , then propagate through the next fully connected
layers as usual. An in-depth analysis of the computational point of view is provided
in [14]. In the software R, neural networks with general input space are covered by
the package FuncNN [15] built over the packages keras [6] and tensorflow [1].
The last two packages are used to handle CNNs.

4 Simulation Example

This section presents a simple simulation experiment in which we illustrate the per-
formance of the classification rule based on the neural network with general input
space. Binary classification is considered, where the group membership indicates
whether a point pattern was generated by a stationary Poisson point process or a sta-
tionary Thomas process, the latter exhibiting attractive interactions among pairs
of points [13]. The sample realizations can be seen in Figure 1.
We consider the Thomas process to be a model with one parameter 𝜎. Small values
of 𝜎 indicates strong, attractive short-range interactions between points, while larger
values of 𝜎 result in looser clusters of points. Attractive interactions between the
points of a Thomas process result in the values of the pair correlation function being
greater than the constant 1, which corresponds to the Poisson case. The effect of
𝜎 on the shape of the theoretical pair correlation function of the Thomas process
(which is translation invariant and isotropic) is illustrated in Figure 1.
298 K. Pawlasová et al.

Since the model parameter 𝜎 affects the strength and range of attractive interac-
tions between points of the Thomas process, the complexity of the binary classifica-
tion task described above increases with increasing values of 𝜎 [10, 11]. Therefore,
this experiment focuses on the situation where 𝜎 is set to 0.1, and all realizations
are observed on the unit square [0, 1] 2 . We fix the intensity of the two models to 400
(in spatial statistics, patterns with several hundreds of points are standard nowadays).
In this framework, we expect the classification task to be challenging enough to ob-
serve differences in the performance of the considered classifiers. On the other hand,
it is still reasonable to distinguish (w.r.t. the chosen observation window) the realiza-
tions of the model with attractive interactions from the realizations corresponding
to the complete spatial randomness.
Two different collections of labelled point patterns are considered as training sets.
The first, referred to as Training data 1, is composed of 1 000 patterns per group.
The second, called Training data 2, is then composed of 100 patterns per group.
The test and validation sets have the same size and composition as the Training
data 2. Table 1 presents the accuracy of three classification rules (described below)
with respect to the test set. For the first two rules, the accuracy is in fact averaged
over five runs corresponding to different settings of initial weights in the underlying
neural network. Concerning the network architecture, we fix the ReLU function to be
the activation function for all layers, except the output one. The output layer consists
of one neuron with sigmoid activation function. The loss function is the binary
cross-entropy. A detailed description of the individual layers is given below.
Rule 1 is based on the neural network with general input space. We set 𝐾 and
𝐽 from Sect. 3 to be 1 and 0, respectively, and 𝜏1 = (0, 0.25). The value 0.25 is
related to the observation window of the point patterns at hand being [0, 1] 2 . Then,
𝑓1 is the vector of the estimated values of the pair correlation function 𝑔 (estimated
by the function pcf.ppp from the package spatstat [3] with default settings but
the option divisor set to d), considered as a functional observation. Furthermore,
we set 𝑚 1 = 29, and consider the Fourier basis. The data preparation (estimation of 𝑔,
computation of integrals from Sect. 3) takes 740 s of elapsed time (w.r.t. the Training
data 1, on a standard personal computer). To tune the hyperparameters of the final
neural network (number of hidden layers, number of neurons per hidden layers,
dropout, etc.), we performed a rough grid search (models with various combinations
of the hyperparameters were trained on Training data 1 and we used the loss function
and the accuracy computed on the validation set to compare the performances).
The resulting network consists of one hidden layer with 128 neurons followed by
a dropout layer with a rate of 0.3. We use the Adam optimizer, and the learning rate is
decaying exponentially, with initial value 0.001 and decay parameter 0.05. In total,
the network has 3 969 trainable parameters. To train the network, we perform 50
epochs with an average elapsed time of 200 ms per epoch (w.r.t. Training data 1).
Rule 2 uses CNNs. Similarly to the previous case, our decision about the network
architecture is based on a rough grid search. The final network has two convolutional
layers, each of them with 8 filters, a squared kernel matrix with 36 (first layer) or 16
rows (second layer), and a following average pooling layer with the pool size fixed
at 2 × 2. We add a dropout layer after the pooling, with a rate of 0.3 (after the first
Supervised Classification via Neural Networks for Replicated Point Patterns 299

Table 1 Accuracy for the three presented classification rules w.r.t. the testing set. For Rule 1
and Rule 2, the accuracy is averaged over five runs corresponding to five different choices of initial
weights in the underlying neural networks. In addition, the standard deviation computed from
the five accuracy values is reported. Values close to 1 indicate a nearly perfect classification.

Rule 1 Rule 2 Rule 3

Training data 1 0.947 ±0.003 0.934 ±0.032 0.935


Training data 2 0.895 ±0.010 0.512 ±0.028 0.925

pooling) and 0.2 (after the second pooling). The batch size is set to 32. We use
the Adam optimizer, and the learning rate is decaying exponentially, with initial
value 0.001 and decay parameter 0.1. The total number of trainable parameters is
equal to 32 785 and we perform 50 epochs with the average elapsed time per epoch
(w.r.t. Training data 1) equal to 930 s. Data preparation (converting point patterns
to binary images) takes less than 10 s of the elapsed time (w.r.t. Training data 1).
Rule 3 is the kernel regression classifier studied in [10, 11]. We use the Epanech-
nikov kernel together with an automatic procedure for the selection of the smoothing
parameter. The underlying dissimilarity measure for point patterns is constructed
as the integrated squared difference of the corresponding estimates of the pair cor-
relation function 𝑔; for more details, see [10]. The elapsed time needed to compute
the upper triangle of the dissimilarity matrix (containing dissimilarities between
every pair of patterns from Training data 1) is equal to 390 s. To predict the class
membership for the testing set (w.r.t. Training data 1), 206 s elapsed. During the clas-
sification procedure, no random initialization of any weights is needed. Thus, there
is no reason to average the accuracy in Table 1 over multiple runs.
For Training data 1, Table 1 shows that the highest accuracy was achieved for the
neural network with general input space. The standard deviation of the five different
accuracy values is significantly higher for CNN which has almost ten times more
trainable parameters than the network with general input space. For Training data
2, the kernel regression method achieved the highest accuracy. In this situation,
the performance of the classifier is stable even in the case of small training data.
For the first two rules, the neural network models chosen with the help of the grid
search (where the networks were trained w.r.t. the bigger training set) are now
trained w.r.t. the smaller training set. The resulting accuracy is still around 0.90
for the network with general input space, but it drops to 0.5 (random assignment
of labels) for CNN. The size of Training data 2 seems to be too small to successfully
optimize the large amount of trainable parameters of the convolutional network.
To conclude, our simulation example suggests that the classifier based on CNN
(using information about the precise configuration of points) is in the presented sit-
uation outperformed by the classifiers based on the estimated values of the pair cor-
relation function (using information about the interactions between pairs of points).
The high number of trainable parameters of the CNN makes its use rather demanding
with respect to computational time. The approach based on neural networks with
300 K. Pawlasová et al.

general input space proved to be competitive with or even outperform the current
benchmark method (kernel regression classifier), especially for large datasets. Also,
it has the lowest demands regarding computational time. In the case of a small
dataset, the low number of hyperparameters speaks in favor of kernel regression.
Finally, in the simple classification scenario that we have presented, the choice
of the pair correlation function was adequate. In practical applications, a problem-
specific characteristic should be constructed to achieve satisfactory performance.

Acknowledgements The work of Kateřina Pawlasová and Iva Karafiátová has been supported from
the Grant schemes at Charles University, project no. CZ.02.2.69/0.0/0.0/19 073/0016935. The work
of Jiří Dvořák has been supported by the Czech Grant Agency, project no. 19-04412S.

References

1. Allaire, J. J., Eddelbuettel, D., Golding, N., Tang, Y.: tensorflow: R Interface to TensorFlow
(2016) Available at GitHub. https://github.com/rstudio/tensorflow.Cited10Jan
2022
2. Ayala, G., Epifanio, I., Simo, A., Zapater, V.: Clustering of spatial point patterns. Comput.
Stat. Data. Anal. 50, 1016–1032 (2006)
3. Baddeley, A., Rubak, E., Turner, R.: Spatial Point Patterns: Methodology and Applications
with R. Chapman & Hall/CRC Press, Boca Raton (2015)
4. Ferraty, F., Vieu, P.: Nonparametric Functional Data Analysis. Theory and Practice.
Springer-Verlag, New York (2006)
5. Cholaquidis, A., Forzani, L., Llop, P., Moreno, L.: On the classification problem for Poisson
point processes. J. Multivar. Anal. 153, 1–15 (2017)
6. Chollet, F., Allaire, J. J. and others: R Interface to Keras (2017) Available via GitHub.
https://github.com/rstudio/keras.Cited10Jan2022
7. Daley, D., Vere-Jones, D.: An Introduction to the Theory of Point Processes. Vol II., 2nd edn.
Springer-Verlag, New York (2008)
8. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016)
9. Illian, J., Penttinen, A., Stoyan, H., Stoyan, D.: Statistical Analysis and Modelling of Spatial
Point Patterns. Wiley, Chichester (2004)
10. Koňasová, K., Dvořák, J.: Techniques from functional data analysis adaptable for spatial
point patterns (2021) In: Proceedings of the 22nd European Young Statisticians Meeting.
https://www.eysm2021.panteion.gr/publications.html.Cited10Jan2022
11. Koňasová, K., Dvořák, J.: Supervised nonparametric classification in the context of replicated
point patterns. Submitted (2021)
12. Mateu, J., Schoenberg, F. P., Diez, D. M., González, J. A., Lu, W.: On measures of dissimilarity
between point patterns: classification based on prototypes and multidimensional scaling.
Biom. J. 57, 340–358 (2015)
13. Møller, J., Waagepetersen, R.: Statistical Inference and Simulation for Spatial Point Processes.
Chapman & Hall/CRC, Boca Raton (2004)
14. Thind, B., Multani, K., Cao, J.: Deep Learning with Functional Inputs (2020) Available via
arxiv. https://arxiv.org/pdf/2006.09590.pdf.Cited10Jan2022
15. Thind, B., Wu, S., Groenewald, R., Cao, J.: FuncNN: An R Package to Fit Deep Neural
Networks Using Generalized Input Spaces (2020) Available via arxiv.
https://arxiv.org/pdf/2009.09111.pdf.Cited10Jan2022
16. Torgerson, W.: Multidimensional Scaling: I. Theory and Method. Psychometrika. 17, 401–419
(1952)
17. Vo, B. N., Dam, N., Phung, D., Tran, Q. N., Vo, B. T.: Model-based learning for point pattern
data. Pattern Recognit. 84, 136–151 (2018)
Supervised Classification via Neural Networks for Replicated Point Patterns 301

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Parsimonious Mixtures of Seemingly Unrelated
Contaminated Normal Regression Models

Gabriele Perrone and Gabriele Soffritti

Abstract In recent years, the research into linear multivariate regression based on
finite mixture models has been intense. With such an approach, it is possible to
perform regression analysis for a multivariate response by taking account of the
possible presence of several unknown latent homogeneous groups, each of which is
characterised by a different linear regression model. For a continuous multivariate
response, mixtures of normal regression models are usually employed. However, in
real data, it is not unusual to observe mildly atypical observations that can negatively
affect the estimation of the regression parameters under a normal distribution in
each mixture component. Furthermore, in some fields of research, a multivariate
regression model with a different vector of covariates for each response should be
specified, based on some prior information to be conveyed in the analysis. To take
account of all these aspects, mixtures of contaminated seemingly unrelated normal
regression models have been recently developed. A further extension of such an
approach is presented here so as to ensure parsimony, which is obtained by imposing
constraints on the group-covariance matrices of the responses. A description of the
resulting parsimonious mixtures of seemingly unrelated contaminated regression
models is provided together with the results of a numerical study based on the
analysis of a real dataset, which illustrates their practical usefulness.

Keywords: contaminated normal distribution, ECM algorithm, mixture of regres-


sion models, model-based cluster analysis, seemingly unrelated regression

Gabriele Perrone ( )
Department of Statistical Sciences, University of Bologna, via delle Belle Arti 41, 40126 Bologna,
Italy, e-mail: [email protected]
Gabriele Soffritti
Department of Statistical Sciences, University of Bologna, via delle Belle Arti 41, 40126 Bologna,
Italy, e-mail: [email protected]

© The Author(s) 2023 303


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_33
304 G. Perrone and G. Soffritti

1 Introduction

Seemingly unrelated (SU) regression equations are usually employed in a multivari-


ate regression analysis whenever the dependence of a vector Y = (𝑌1 , . . . , 𝑌𝑀 ) 0 of
𝑀 continuous variables on a vector X = (𝑋1 , . . . , 𝑋 𝑃 ) 0 of 𝑃 regressors has to be
modelled by allowing the error terms in the different equations to be correlated and,
thus, the regression parameters of the 𝑀 equations have to be jointly estimated [14].
With such an approach, the researcher is also enabled to convey prior information
on the phenomenon under study into the specification of the regression equations
by defining a different vector of regressors for each dependent variable. This latter
feature is particularly useful in any situation in which different regressors are ex-
pected to be relevant in the prediction of different responses, such as in [3, 6, 16].
This approach has been recently embedded into the framework of Gaussian mixture
models, leading to multivariate SU normal regression mixtures [7]. In these models,
the effect of the regressors on the dependent variables changes with some unknown
latent sub-populations composing the population that has generated the sample of
observations to be analysed. Thus, when the sample is characterised by unobserved
heterogeneity, model-based cluster analysis is simultaneously carried out.
Another source of complexity which could affect the data and make the prediction
of Y a difficult task to perform is represented by mildly atypical observations [13].
Robust methods of parameter estimation insensitive to the presence of such obser-
vations in a sample characterised by unobserved heterogeneity have been introduced
in [9], where the conditional distribution Y|X = x is modelled through a mixture of
𝐾 multivariate contaminated normal models, where 𝐾 is the number of the latent
sub-populations. A limitation associated with these latter models is that the same
vector of regressors has to be specified for the prediction of all the dependent vari-
ables. To overcome this limitation while preserving all the features mentioned above,
a more flexible approach which employs mixtures of multivariate SU contaminated
normal regression models has been recently introduced in [11]. These latter models
are able to capture the linear effects of the regressors on the dependent variables
from sample observations coming from heterogeneous populations. The researcher
is also enabled to specify a different vector of regressors for each dependent variable.
Finally, a robust estimation of the regression parameters and the detection of mild
outliers in the data are ensured.
In the presence of many responses and many latent sub-populations, analyses
based on these latter models can become unfeasible in practical applications because
of a large number of model parameters. In order to keep this number as low as
possible, an approach due to [4], based on the spectral decompositions of the 𝐾
covariance matrices of Y|X = x, is exploited here so as to obtain fourteen different
covariance structures. The resulting parsimonious mixtures of SU contaminated
regression models are described in Section 2. The usefulness of these new models is
illustrated through a study aiming at determining the effect of prices and promotional
activities on sales of canned tuna in the US market. A summary of the obtained results
is provided in Section 3.
Parsimonious Seemingly Unrelated Contaminated Normal Regression Mixtures 305

2 Parsimonious SU Contaminated Normal Regression Mixtures


In a system of 𝑀 SU regression equations for modelling the linear dependence of
Y on X, let X𝑚 = (𝑋𝑚1 , 𝑋𝑚2 , . . . , 𝑋𝑚𝑃𝑚 ) 0 be the 𝑃𝑚 -dimensional sub-vector of X
composed of the 𝑃𝑚 regressors expected to be relevant for the explanation of 𝑌𝑚 ,
for 𝑚 = 1, . . . , 𝑀. Furthermore, let X∗𝑚 = (1, X𝑚 0 ) 0 . The mixture of 𝐾 SU normal

regression models described in [7] can be defined as follows:



 X̃∗0 𝜷∗1 + 𝝐, 𝝐 ∼ 𝑁 𝑀 (0 𝑀 , 𝚺1 ) with probability 𝜋1 ,


Y = ··· (1)

 X̃∗0 𝜷∗ + 𝝐, 𝝐 ∼ 𝑁 𝑀 (0 𝑀 , 𝚺 𝐾 ) with probability 𝜋 𝐾 ,

 𝐾

where 𝜋 𝑘 is the prior probability of the 𝑘th latent sub-population, with 𝜋 𝑘 > 0 for
Í
𝑘 = 1, . . . , 𝐾; 𝐾 ∗ ∗
𝑘=1 𝜋 𝑘 = 1; X̃ is the following (𝑃 + 𝑀) × 𝑀 partitioned matrix:

 X∗ 0 𝑃1 +1 . . . 0 𝑃1 +1 
 1 
 0 𝑃2 +1 X∗ . . . 0 𝑃2 +1 
∗ 2
X̃ =  .
 
.. ..  ,
 .. . . 

0 𝑃 +1 0 𝑃 +1 . . . X∗ 
 𝑀 𝑀 𝑀 

with 0 𝑃𝑚 +1 denoting the (𝑃𝑚 + 1)-dimensional null vector; 𝑃∗ = 𝑚=1 𝑃𝑚 ; 𝜷∗𝑘 =


Í𝑀
∗0 ∗0 ∗0 0 ∗
( 𝜷 𝑘1 , . . . , 𝜷 𝑘𝑚 , . . . , 𝜷 𝑘 𝑀 ) is the (𝑃 + 𝑀)-dimensional vector containing all the
linear effects on the 𝑀 responses in the 𝑘th latent sub-population, with 𝜷∗𝑘𝑚 =
0
(𝛽0𝑘,𝑚 , 𝜷 0𝑘𝑚 ) 0, for 𝑚 = 1, . . . , 𝑀; 𝝐 = (𝜖1 , . . . , 𝜖 𝑀 ) is the vector of the errors, which
are supposed to be independent and identically distributed; 𝑁 𝑀 (0 𝑀 , 𝚺 𝑘 ) denotes
the 𝑀-dimensional normal distribution with mean vector 0 𝑀 and positive-definite
covariance matrix 𝚺 𝑘 . From now on, this mixture regression model is denoted as
MSUN. When X𝑚 = X ∀𝑚 (the 𝑃 regressors are employed in all the 𝑀 equations),
model (1) reduces to the mixtures of 𝐾 normal (MN) regression models (see [8]).
When the data are contaminated by the presence of mild outliers, departures from
the normal distribution could be observed within any of the 𝐾 latent sub-populations.
A model able to manage this situation has been recently introduced in [11]. It
has been obtained from equation (1) by replacing the normal distribution with
the contaminated normal distribution. Under this latter distribution, the probability
density function (p.d.f.) of 𝝐 within the 𝑘th sub-population is equal to ℎ (𝝐; 𝝑 𝑘 ) =
𝛼 𝑘 𝜙 𝑀 (𝝐; 0 𝑀 , 𝚺 𝑘 ) + (1 − 𝛼 𝑘 )𝜙 𝑀 (𝝐; 0 𝑀 , 𝜂 𝑘 𝚺 𝑘 ), where 𝜙 𝑀 (·; 𝝁, 𝚺) denotes the
p.d.f. of the distribution 𝑁 𝑀 (0 𝑀 , 𝚺 𝑘 ), 𝛼 𝑘 ∈ (0.5, 1) and 𝜂 𝑘 > 1 are the proportion
of typical observations within the 𝑘th sub-population and a parameter that inflates the
elements of 𝚺 𝑘 , respectively, and 𝝑 𝑘 = (𝛼 𝑘 , 𝜂 𝑘 , 𝚺 𝑘 ). As a consequence, a mixture
of 𝐾 SU contaminated normal (MSUCN) regression models is given by:


 X̃∗0 𝜷∗1 + 𝝐, 𝝐 ∼ 𝐶𝑁 𝑀 (𝛼1 , 𝜂1 , 0 𝑀 , 𝚺1 ) with probability 𝜋1 ,


Y = ··· (2)

 X̃∗0 𝜷∗ + 𝝐, 𝝐 ∼ 𝐶𝑁 𝑀 (𝛼𝐾 , 𝜂 𝐾 , 0 𝑀 , 𝚺 𝐾 ) with probability 𝜋 𝐾 ,

 𝐾
306 G. Perrone and G. Soffritti

where 𝐶𝑁 𝑀 (𝛼 𝑘 , 𝜂 𝑘 , 0 𝑀 , 𝚺 𝑘 ) denotes the 𝑀-dimensional contaminated normal dis-


tribution described by the p.d.f. ℎ (𝝐; 𝝑 𝑘 ). The parameter vector of model (2) is
𝝍 = (𝝍 1 , . . . , 𝝍 𝑘 , . . . , 𝝍 𝐾 ), where 𝝍 𝑘 = (𝜋 𝑘 , 𝜽 𝑘 ), 𝜽 𝑘 = ( 𝜷∗𝑘 , 𝝑 𝑘 ). The number of
free elements of 𝝍 is 𝑛𝝍 = 3𝐾 − 1 + 𝐾 (𝑃∗ + 𝑀) + 𝑛𝝈 , where 𝑛𝝈 denotes the total
number of free variances and covariances, with 𝑛𝝈 = 𝐾𝑛𝚺 and 𝑛𝚺 = 𝑀 ( 𝑀 2
+1)
. When
X𝑚 = X ∀𝑚, model (2) coincides with the mixture of 𝐾 contaminated normal (MCN)
regression models described in [9]. For 𝛼 𝑘 → 1 or 𝜂 𝑘 → 1 ∀𝑘, model (2) reduces
to model (1). Conditions ensuring identifiability of models (2) are provided in [11].
The ML estimation of 𝝍 in equation (2) can be carried out by means of a sample
S = {(x1 , y1 ), . . . , (x𝐼 , y𝐼 )} of 𝐼 independent observations drawn from model (2) and
an expectation-conditional maximisation (ECM) algorithm [10]. Details about this
algorithm, including strategies for the initialisation of 𝝍 and convergence criteria, are
illustrated in [11]. In practical applications, the value of 𝐾 is generally unknown and
has to be properly chosen. This task can be carried out by resorting to model selec-
tion criteria, such as the Bayesian information criterion [15]: 𝐵𝐼𝐶 = 2ℓ( 𝝍) ˆ − 𝑛𝝍 ln 𝐼,
where 𝝍ˆ is the maximum likelihood estimator of 𝝍. Another commonly used in-
formation criterion is the integrated completed likelihood Í𝐼 Í𝐾 [2], which admits two
slightly different Íformulations: 𝐼𝐶 𝐿 1 = 𝐵𝐼𝐶 + 2 𝑖=1 𝑘=1 MAP( 𝑧ˆ𝑖𝑘 ) ln 𝑧ˆ𝑖𝑘 and
𝐼 Í𝐾
𝐼𝐶 𝐿 2 = 𝐵𝐼𝐶 + 2 𝑖=1 ˆ
𝑧
𝑘=1 𝑖𝑘 ln ˆ
𝑧 𝑖𝑘 , where ˆ
𝑧 𝑖𝑘 is the estimated posterior probabil-
ity that the 𝑖th sample observation come from the 𝑘th sub-population (for further
details see [11]), MAP( 𝑧ˆ𝑖𝑘 ) = 1 if maxℎ { 𝑧ˆ𝑖ℎ } occurs when ℎ = 𝑘 (MAP( 𝑧ˆ𝑖𝑘 ) = 0
otherwise). Whenever the specification of the subvectors X𝑚 , 𝑚 = 1, . . . , 𝑀, to be
considered in the 𝑀 equations of the multivariate regression model is questionable,
such criteria can also be employed to perform subset selection.
As the number of free parameters 𝑛𝝍 incresases quadratically with 𝑀, analyses
based on model (2) can become unfeasible in real applications. A way to man-
age this problem can be based on the introduction of suitable constraints on the
elements of 𝚺 𝑘 , 𝑘 = 1, . . . , 𝐾, based on the following eigen-decomposition [4]:
𝚺 𝑘 = 𝜆 𝑘 D 𝑘 A 𝑘 D0𝑘 , where 𝜆 𝑘 = |𝚺 𝑘 | 1/𝑀 , A 𝑘 is a diagonal matrix with entries
(sorted in decreasing order) proportional to the eigenvalues of 𝚺 𝑘 (with the con-
straint |A 𝑘 | = 1) and D 𝑘 is a 𝑀 × 𝑀 orthogonal matrix of the eigenvectors of 𝚺 𝑘
(ordered according to the eigenvalues). This decomposition allows to obtain vari-
ances and covariances in 𝚺 𝑘 from 𝜆 𝑘 , A 𝑘 and D 𝑘 . From a geometrical point of view,
𝜆 𝑘 determines the volume, A 𝑘 the shape and D 𝑘 the orientation of the 𝑘th cluster of
sample observations detected by the fitted model. By constraining 𝜆 𝑘 , A 𝑘 and D 𝑘 to
be equal or variable across the 𝐾 clusters, a class of fourteen mixtures of 𝐾 SUCN
regression models is obtained (see Table 1). With variable volumes, shapes and ori-
entations (VVV in Table 1), the resulting model coincides with (2). When 𝐾 > 1, the
other covariance structures allow to obtain thirteen different parsimonious mixtures
of 𝐾 SUCN regression models (i.e.: with a reduced 𝑛𝝈 ). When 𝐾 = 1, the possible
covariance structures for 𝚺1 are: diagonal with different entries, diagonal with the
same entries and fully unconstrained. The ML estimation of 𝝍 under model (2) with
any of these parameterisations can be carried out through an ECM algorithm in
which the CM-step update for 𝚺 𝑘 can be computed either in closed form or using
iterative procedures, depending on the parameterisation to be employed (see [4]).
Parsimonious Seemingly Unrelated Contaminated Normal Regression Mixtures 307

Table 1 Features of the parameterisations for the covariance matrices 𝚺 𝑘 , 𝑘 = 1, . . . , 𝐾 (𝐾 > 1).
Acronym Covariance structure Volume Shape Orientation CM step 𝑛𝝈
EEE 𝜆DAD0 Equal Equal Equal Closed 𝑛𝚺
VVV 𝜆 𝑘 D 𝑘 A 𝑘 D0𝑘 Variable Variable Variable Closed 𝐾 𝑛𝚺
EII 𝜆I Equal Spherical − Closed 1
VII 𝜆𝑘 I Variable Spherical − Closed 𝐾
EEI 𝜆A Equal Equal Axis-aligned Closed 𝑀
VEI 𝜆𝑘 A Variable Equal Axis-aligned Iterative 𝑀 +𝐾 −1
EVI 𝜆A 𝑘 Equal Variable Axis-aligned Closed 𝑀 𝐾 − (𝐾 − 1)
VVI 𝜆𝑘 A𝑘 Variable Variable Axis-aligned Closed 𝑀𝐾
EEV 𝜆D 𝑘 AD0𝑘 Equal Equal Variable Iterative 𝐾 𝑛𝚺 − (𝐾 − 1) 𝑀
VEV 𝜆 𝑘 D 𝑘 AD0𝑘 Variable Equal Variable Iterative 𝐾 𝑛𝚺 − (𝐾 − 1) ( 𝑀 − 1)
EVE 𝜆DA 𝑘 D0 Equal Variable Equal Iterative 𝑛𝚺 − (𝐾 − 1) ( 𝑀 − 1)
VVE 𝜆 𝑘 DA 𝑘 D0 Variable Variable Equal Iterative 𝑛𝚺 − (𝐾 − 1) 𝑀
VEE 𝜆 𝑘 DAD0 Variable Equal Equal Iterative 𝑛𝚺 − (𝐾 − 1)
EVV 𝜆D 𝑘 A 𝑘 D0𝑘 Equal Variable Variable Iterative 𝐾 𝑛𝚺 − (𝐾 − 1)

3 Analysis of U.S. Canned Tuna Sales

The models illustrated in Section 2 have been fitted to a dataset [5] containing the
volume of sales (Move), a measures of the display activity (Nsale) and the log price
(Lprice) for seven of the top 10 U.S. brands in the canned tuna product category in
the 𝐼 = 338 weeks between September 1989 and May 1997. The goal of the analysis
is to study the dependence of canned tuna sales on prices and promotional activites
for two products: Star Kist 6 oz. (SK) and Bumble Bee Solid 6.12 oz. (BBS). To this
end, the following vectors have been considered: Y0 = (𝑌1 = Lmove SK, 𝑌2 = Lmove
BBS), X0 = (𝑋1 = Nsale SK, 𝑋2 = Lprice SK, 𝑋3 = Nsale BBS, 𝑋4 = Lprice
BBS), where Lmove denotes the logarithm of Move. The analysis has been carried
out using all the parameterisations of the MSUN, MN, MCSUN and MCN models
for each 𝐾 ∈ {1, 2, 3, 4, 5, 6}. Furthermore, MSUN and MCSUN models have been
fitted by considering all possible subvectors of X as vectors X𝑚 , 𝑚 = 1, 2, for each
𝐾. In this way, best subset selections for Lmove SK and Lmove BBS have been
included in the analysis both with and without contamination. The overall number of
fitted models is 37376, including the fully unconstrained models (i.e., with the VVV
parameterisation) previously employed in [11] to perform the same analysis.
Table 2 reports some information about the nine models which best fit the analysed
dataset according to the three model selection criteria over the six examined values
of 𝐾 within each model class. An analysis based on a single linear regression model
(𝐾 = 1), both with and without contamination, appears to be inadequate according to
all criteria. All the examined criteria indicate that the overall best model for studying
the effect of prices and promotional activities on sales of SK and BBS tuna is a
parsimonious mixture of two SU contaminated Gaussian linear regression models
with the EVE parameterisation for the covariance matrices in which the log unit sales
of SK tuna are regressed on the log prices and the promotional activites of the same
brand, while the regressors selected for the BBS log unit sales are the log prices of
308 G. Perrone and G. Soffritti

both brands and the promotional activites of BBS. Thus, the analysis suggests that
two sources of complexity affect the analysed dataset: unobserved heterogeneity over
time (𝐾 = 2 clusters of weeks have been detected) and the presence of mildly atypical
observations. Since the two estimated proportions of typical observations are quite
similar (see the values of 𝛼ˆ 𝑘 in Table 3), contamination seems to characterise the
two clusters of weeks detected by the model almost in the same way. As far as the
strength of the contaminating effects on the conditional variances and covariances
of Y|X = x is concerned, it appears to be stronger in the first cluster, where the
estimated inflation parameter is larger (𝜂ˆ1 = 15.70). By focusing the attention on the
other estimates, it appears that also some of the estimated regression coefficients,
variances and covariances are affected by heterogeneity over time. Sales of SK tuna
results to be negatively affected by prices and positively affected by promotional
activites of the same brand within both clusters detected by the model, but with
effects which are sligthly stronger in the first cluster of weeks. A similar behavior is
detected for the estimated regression equation for Lmove BBS, which also highlights
that Lmove BBS are positively affected by the log prices of SK tuna, especially in
the first cluster of weeks. Furthermore, typical weeks in the first cluster show values
of Lmove SK which are more homogeneous than those of Lmove BBC; the opposite
holds true for the typical weeks belonging to the second cluster. Also the correlation
between log sales of SK and BBS products results to be affected by heterogeneity
over time: while in the largest cluster of weeks this correlation has been estimated
to be slightly positive (0.200), the first cluster is characterised by a mild estimated
negative correlation (−0.151). An interesting feature of this latter cluster is that 17
out of the 20 weeks which have been assigned to this cluster are consecutive from
week no. 58 to week no. 74, which correspond to the period from mid-October 1990
to mid-February 1991 characterised by a worldwide boycott campaign encouraging
consumers not to buy Bumble Bee tuna because Bumble Bee was found to be buying
yellow-fin tuna caught by dolphin-unsafe techniques [1]. Such events could represent
one of the sources of the unobserved heterogeneity detected by the model. According
to the overall best model, some weeks have beed detected to be mild outliers. In the
first cluster, this has happened for week no. 60 (immediately after Halloween 1990)
and week no. 73 (two weeks immediately before Presidents day 1999). The analysis

of the estimated sample residuals y𝑖 − 𝝁ˆ 1 (x𝑖 ; 𝜷ˆ 1 ) for the 20 weeks belonging to the
first cluster (see the scatterplot on the left side of Figure 1) clearly show that weeks
60 and 73 noticeably deviates from the other weeks. Among the 318 weeks of the
second cluster, 32 have resulted to be mild outliers, most of which are associated
with holidays and special events that took place between September 1989 and mid-
October 1990 or between mid-February and May 1997 (see the scatterplot on the
right side of Figure 1). These results are almost equal to those obtained using the best
overall fully unconstrained fitted model in the analysis presented in [11]. However,
the EVE parameterisation for the MSUCN model has allowed to obtain a better trade-
off among the fit, the model complexity and the uncertainty of the estimated partition
of the weeks; furthermore, it has led to a slightly lower number of mild outliers in
the second cluster of weeks.
Parsimonious Seemingly Unrelated Contaminated Normal Regression Mixtures 309

Table 2 Maximised log-likelihood ℓ (𝝍)


ˆ and values of 𝐵𝐼 𝐶, 𝐼 𝐶 𝐿1 and 𝐼 𝐶 𝐿2 for nine models
selected from the classes MSUCN, MCN, MSUN and MN in the analysis of tuna sales.
Model class 𝐾 Acronym X1 X2 ℓ (𝝍)
ˆ 𝑛 𝜓 𝐵𝐼 𝐶 𝐼 𝐶 𝐿1 𝐼 𝐶 𝐿2
MSUCN 2 EVE 𝑋1 , 𝑋2 𝑋2 , 𝑋3 , 𝑋4 −242.9 23 −619.8 −625.7 −635.8
MCN 2 EVI X X −239.6 28 −642.2 −648.9 −663.2
MCN 2 EEV X X −240.8 29 −650.6 −650.8 −652.0
MCN 3 EVI 𝑋1 , 𝑋2 , 𝑋4 𝑋1 , 𝑋2 , 𝑋4 −214.2 36 −638.0 −703.1 −788.6
MSUN 2 VEV 𝑋1 , 𝑋2 𝑋3 , 𝑋4 −279.3 18 −663.4 −673.1 −692.1
MSUN 3 EEV 𝑋2 , 𝑋3 𝑋2 , 𝑋3 , 𝑋4 −259.8 28 −682.7 −684.7 −688.0
MSUN 5 VVV 𝑋2 , 𝑋3 𝑋1 , 𝑋4 −167.4 49 −620.0 −701.1 −780.3
MN 3 EEV 𝑋2 , 𝑋3 , 𝑋4 𝑋2 , 𝑋3 , 𝑋4 −258.7 31 −697.9 −699.6 −702.1
MN 4 VVE 𝑋2 , 𝑋4 𝑋2 , 𝑋4 −216.6 36 −642.9 −725.3 −832.9

Table 3 Parameter estimates of the overall best model for the analysis of tuna sales.
𝝍ˆ 𝑘=1 𝑘=2
𝜋ˆ 𝑘 0.062 0.938
𝛼ˆ 𝑘 0.810 0.844
𝜂ˆ 𝑘 15.70 6.94
0∗
𝜷ˆ 𝑘1 (8.87, 0.56, −4.70) (8.64, 0.27, −3.09)
0∗
𝜷ˆ 𝑘2 (15.04,
 3.92, 2.83, −17.76)
 (9.98, 0.25, 0.12, −3.83)

0.034 −0.009 0.121 0.012
𝚺𝑘
ˆ
−0.009 0.105 0.012 0.030

73
0.8
2.0

0.6
1.5


● ●●

0.4


residuals for Lmove BBS

residuals for Lmove BBS


1.0

● ●
● ●● ●
●●


68 ● ● ●● ●
● ● ●●●● ●● ●
● ●● ●●
0.2


●●
0.5

● ● ●● ● ●
71 ●● ●●
● ● ● ● ●● ●●● ● ●
●● ●● ●●●●● ●●
70 ●
● ●● ● ●● ●
● ●
●● ●
●●●
69 67 ●
●●●
● ●●●●●●●●
74
66 105 ● ● ●●● ●
●●●●
● ●●● ●●● ●
●●
● ● ● ● ●
●●●
0.0

58 ●
●●
0.0

65 ●● ● ● ●
● ● ● ● ●●● ●●
●● ●● ● ●
51 330 ● ● ●● ●● ●●●
● ●●●●● ●
●●● ●
● ●● ●
59 ●●

● ●


●●●●

● ● ●●●● ●●
● ●● ●●

● ●
64 ● ● ● ●● ● ● ●●
●● ● ●
72 ●● ● ●● ●
63 ● ● ● ● ● ● ●
−0.2
−0.5

● ● ●● ● ● ● ●
●●
62 ●●● ● ● ●●
● ●●●
● ● ●
●● ●
● ● ●● ●
61 ●● ●●
●●
● ● ●
−0.4
−1.0

● ●

−0.6
−1.5

60

0.0 0.5 1.0 −3 −2 −1 0 1

residuals for Lmove SK residuals for Lmove SK

Fig. 1 Scatterplots of the estimated residuals for the weeks assigned to the first (left) and second
(right) clusters detected by the overall best model. Points of the first scatterplot are labelled with
the number of the corresponding weeks. Black circle and red triangle in the second scatterplot
correspond to typical and outlying weeks, respectively.

4 Conclusions

The parsimonious mixtures of seemingly unrelated linear regression models for


contaminated data introduced here can account for heterogeneous regression data
310 G. Perrone and G. Soffritti

both in the presence of mild outliers and multivariate correlated dependent variables,
each of which is regressed on a different vector of covariates. Models from this
class allow for simultaneous robust clustering and detection of mild outliers in
multivariate regression analysis. They encompass several other types of Gaussian
mixture-based linear regression models previously proposed in the literature, such
as the ones illustrated in [7, 8, 9], providing a robust and flexible tool for modelling
data in practical applications where different regressors are considered to be relevant
for the prediction of different dependent variables. Previous research (see [9, 11])
demonstrated that BIC and ICL could be effectively employed to select a proper
value for 𝐾 in the presence of mildly contaminated data. Thanks to an imposition of
an eigen-decomposed structure on the 𝐾 variance-covariance matrices of Y|X = x,
the presented models are characterised by a reduced number of variance-covariance
parameters to be included in the analysis, thus improving flexibility, usefulness and
effectiveness of an approach to multivariate linear regression analysis based on finite
Gaussian mixture models in real data applications.

References

1. Baird, I. G., Quastel, N.: Dolphin-safe tuna from California to Thailand: localisms in environ-
mental certification of global commodity networks. Ann. Assoc. Am. Geogr. 101, 337–355
(2011)
2. Biernacki, C., Celeux, G., Govaert, G.: Assessing a mixture model for clustering with the
integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell. 22, 719–725 (2000)
3. Cadavez, V. A. P., Hennningsen, A.: The use of seemingly unrelated regression (SUR) to
predict the carcass composition of lambs. Meat. Sci. 92, 548–553 (2012)
4. Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recognit. 28,
781–793 (1995)
5. Chevalier, J. A., Kashyap, A. K., Rossi, P. E.: Why don’t prices rise during periods of peak
demand? Evidence from scanner data. Am. Econ. Rev. 93, 15–37 (2003)
6. Disegna, M., Osti, L.: Tourists’ expenditure behaviour: the influence of satisfaction and the
dependence of spending categories. Tour. Econ. 22, 5–30 (2016)
7. Galimberti, G., Soffritti, G.: Seemingly unrelated clusterwise linear regression. Adv. Data
Anal. Classif. 14, 235–260 (2020)
8. Jones, P. N., McLachlan, G. J.: Fitting finite mixture models in a regression context. Aust.
New Zeal. J. Stat. 34, 233–240 (1992)
9. Mazza, A., Punzo, A.: Mixtures of multivariate contaminated normal regression models. Stat.
Pap. 169, 787–822 (2020)
10. Meng, X. L., Rubin, D. B.: Maximum likelihood estimation via the ECM algorithm: A general
framework. Biometrika. 80, 267–278 (1993)
11. Perrone, G., Soffritti, G.: Seemingly unrelated clusterwise linear regression for contaminated
data. Under review (2021)
12. R Core Team R: a language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria (2022) http://www.R-project.org
13. Ritter, G.: Robust cluster analysis and variable selection. Chapman & Hall, Boca Raton (2015)
14. Srivastava, V. K., Giles, D. E. A.: Seemingly unrelated regression equations models. Marcel
Dekker, New York (1987)
15. Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)
16. White, E. N., Hewings, G. J. D.: Space-time employment modelling: some results using
seemingly unrelated regression estimators. J. Reg. Sci. 22, 283–302 (1982)
Parsimonious Seemingly Unrelated Contaminated Normal Regression Mixtures 311

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Penalized Model-based Functional Clustering: a
Regularization Approach via Shrinkage Methods

Nicola Pronello, Rosaria Ignaccolo, Luigi Ippoliti, and Sara Fontanella

Abstract With the advance of modern technology, and with data being recorded
continuously, functional data analysis has gained a lot of popularity in recent years.
Working in a mixture model-based framework, we develop a flexible functional
clustering technique achieving dimensionality reduction schemes through a 𝐿 1 pe-
nalization. The proposed procedure results in an integrated modelling approach
where shrinkage techniques are applied to enable sparse solutions in both the means
and the covariance matrices of the mixture components, while preserving the under-
lying clustering structure. This leads to an entirely data-driven methodology suitable
for simultaneous dimensionality reduction and clustering. Preliminary experimental
results, both from simulation and real data, show that the proposed methodology is
worth considering within the framework of functional clustering.

Keywords: functional data analysis, 𝐿 1 -penalty, silhouette width, graphical LASSO,


mixture model

Nicola Pronello ( )
Department of Neurosciences, Imaging and Clinical Sciences, University of Chieti-Pescara, Chieti,
Italy, e-mail: [email protected]
Rosaria Ignaccolo
Department of Economics and Statistics "Cognetti de Martiis", University of Torino, Torino, Italy,
e-mail: [email protected]
Luigi Ippoliti
Department of Economics, University of Chieti-Pescara, Pescara, Italy,
e-mail: [email protected]
Sara Fontanella
National Heart and Lung Institute, Imperial College London, London, United Kingdom,
e-mail: [email protected]

© The Author(s) 2023 313


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_34
314 N. Pronello et al.

1 Introduction

In recent decades, technological innovations have produced data that are increasingly
complex, high dimensional, and structured. A large amount of these data can be
characterized as functions defined on some continuous domain and their statistical
analysis has attracted the interest of many researchers. This surge of interests is
explained by the ubiquitous examples of functional data that can be found in different
application fields (see for example [2], and references therein for specific examples).
With functions as the basic units of observation, the analysis of functional data
poses significant theoretical and practical challenges to statisticians. Despite these
difficulties, methodology for clustering functional data has advanced rapidly during
the past years; recent surveys of functional data clustering are presented in [7] and
[2]. Popular approaches have extended classical clustering concepts for vector-valued
multivariate data to functional data.
In this paper, we consider a finite mixture as a flexible model for clustering.
In particular, applying a functional model-based clustering algorithm with an 𝐿 1 -
penalty function on a set of projection coefficients, we extend the results of [8]
and [9] for vector-valued multivariate data to a functional data framework. This
approach appears particularly appealing in all cases in which the functions are
spatially heterogeneous, meaning that some parts of the function can be smoother
than in other parts, or that there may be distant parts of the function that are correlated
with each other. Furthermore, the introduction of a shrinkage penalty allows to look
for directions in the feature space (that is now the space of expansion/projection
coefficients) that are the most useful in separating the underlying groups without
first applying dimensionality reduction techniques.
In Section 2 we present at first the methodology along with some details on model
estimation (subsection 2.2). Secondly, in Section 3, we perform a validation study
with simulated and real data for which the classes are known a-priori.

2 Shrinkage Method for Model-based Clustering for Functional


Data

Here we consider the problem of clustering a set of 𝑛 observed curves into 𝐾


homogeneous groups (or clusters). To this end, we propose a flexible model based
on a finite mixture of Gaussian distributions, with a 𝐿 1 penalized likelihood, which
we name Penalized model-based Functional Clustering (PFC-𝐿 1 ).

2.1 Model Definition

We consider a set of 𝑛 observed curves, 𝑥1 , . . . , 𝑥 𝑛 , that are independent realizations


of a continuous stochastic process 𝑋 = {𝑋 (𝑡)}𝑡 ∈ [0,𝑇 ] taking values in 𝐿 2 [0, 𝑇]. In
Penalized Model-based Functional Clustering 315

practice, such curves/trajectories are available only at a discrete set of the domain
points {𝑡 𝑖𝑠 : 𝑖 = 1, . . . , 𝑛, 𝑠 = 1, . . . , 𝑚 𝑖 } and the 𝑛 curves need to be reconstructed.
To this goal, it is common to assume that the curves belong to a finite dimensional
space spanned by a basis of functions, so that given a basis of functions 𝚽 =
{𝜓1 , ..., 𝜓 𝑝 } each curve 𝑥𝑖 (𝑡) admits the following decomposition:
𝑝
Õ
𝑥𝑖 (𝑡) = 𝛽 𝑗,𝑖 𝜓 𝑗 (𝑡), 𝑖 = 1, . . . , 𝑛; (2.1)
𝑗=1

that is the stochastic process 𝑋 admits a corresponding truncated basis expansion


𝑝
Õ
𝑋 (𝑡) = 𝛽 𝑗 (𝑋)𝜓 𝑗 (𝑡),
𝑗=1

where 𝜷 = {𝛽1 (𝑋), . . . , 𝛽 𝑝 (𝑋)} is a random vector in R 𝑝 . By considering observa-


tions with a sampling error, such that

𝑥𝑖𝑜𝑏𝑠 (𝑡) = 𝑥𝑖 (𝑡) + 𝜖𝑖 , 𝑖 = 1, . . . , 𝑛, (2.2)

with 𝜖𝑖 ∼ N (0, 𝜎𝜖2 ), the realizations of the random coefficients 𝛽 𝑗,𝑖 for 𝑗 = 1, . . . , 𝑝
0 0
describing each curve can be obtained via least squares as 𝜷ˆ𝑖 = (𝚯𝑖 𝚯𝑖 ) −1 𝚯𝑖 X𝑖𝑜𝑏𝑠
where 𝚯𝑖 = (𝜓 𝑗 (𝑡𝑖𝑠 )), 1 ≤ 𝑗 ≤ 𝑝, 1 ≤ 𝑠 ≤ 𝑚 𝑖 contains the basis functions evaluated
0
at the fixed domain points and X𝑖𝑜𝑏𝑠 = (𝑥 𝑖𝑜𝑏𝑠 (𝑡𝑖1 ), . . . , 𝑥 𝑖𝑜𝑏𝑠 (𝑡 𝑖𝑚𝑖 )) is the vector of
observed values of the 𝑖-th curve.
With the goal of dividing into 𝐾 homogeneous groups the observed curves
𝑥 1 , . . . , 𝑥 𝑛 , let us assume that it exists an unobservable grouping variable Z =
(𝑍1 , ..., 𝑍 𝐾 ) ∈ [0, 1] 𝐾 indicating the cluster membership: 𝑧 𝑖,𝑘 = 1 if 𝑥𝑖 belongs to
cluster 𝑘, 0 otherwise (and 𝑧 𝑖,𝑘 is indeed what we want to predict for each curve).
In adopting a model-based clustering approach, we denote with 𝜋 𝑘 the (a-priori)
probabilities of belonging to a group:

𝜋 𝑘 = P(𝑍 𝑘 = 1), 𝑘 = 1, . . . , 𝐾,
Í𝐾
such that 𝑘=1 𝜋 𝑘 = 1 and 𝜋 𝑘 > 0 for each 𝑘, and we assume that, conditionally on
𝑍, the random vector 𝜷 follows a multivariate Gaussian distribution, that is for each
cluster
𝜷|(𝑍 𝑘 = 1) = 𝜷 𝑘 ∼ N ( 𝝁 𝑘 , 𝚺 𝑘 )
where 𝝁 𝑘 = (𝜇1,𝑘 , . . . , 𝜇 𝑝,𝑘 )𝑇 and 𝚺 𝑘 are respectively the mean vector and
the covariance matrix of the 𝑘-th group. Then the marginal distribution of 𝜷 =
{𝛽1 , . . . , 𝛽 𝑝 } can be written as a finite mixture with mixing proportions 𝜋 𝑘 as
𝐾
Õ
𝑝( 𝜷) = 𝜋 𝑘 𝑓 ( 𝜷 𝑘 ; 𝝁 𝑘 , 𝚺 𝑘 ),
𝑘=1
316 N. Pronello et al.

where 𝑓 is the multivariate Gaussian density function. The log-likelihood function


can then be written as
𝑛
Õ 𝐾
Õ
𝑙 (𝜽; 𝜷) = 𝑙𝑜𝑔 𝜋 𝑘 𝑓 ( 𝜷𝑖 ; 𝝁 𝑘 , 𝚺 𝑘 ),
𝑖=1 𝑘=1

where 𝜽 = {𝜋1 , . . . , 𝜋 𝐾 ; 𝝁1 , . . . , 𝝁 𝐾 ; 𝚺1 , . . . , 𝚺 𝐾 } is the vector of parameters to be


estimated and 𝜷𝑖 = (𝛽1,𝑖 , . . . , 𝛽 𝑝,𝑖 )𝑇 is the vector of projection coefficients of the
𝑖-th curve.
In this modeling framework, we consider a very general situation without intro-
ducing any kind of constraints neither for cluster means nor for covariance matrices,
that can be different in each cluster. This flexibility, however, leads to overparame-
terization and, as an alternative to any kind of constraints, we consider a penalty that
allows regularized parameters’ estimation.
To define a suitable penalty term, we follow the penalized approach introduced
by Zhou et al. [8] in the high-dimensional setting, and so we consider a penalty
composed by two terms: the first one on the mean vector of each cluster 𝝁 𝑘 , and
the second one on the inverse of the covariance matrix in each group W 𝑘 = 𝚺−1 𝑘 ,
otherwise said “precision” matrix, with elements 𝑊 𝑘; 𝑗,𝑙 . The proposed penalized
log-likelihood function, given the projection coefficients 𝜷𝑖 , is
𝑛
Õ 𝐾
Õ 𝐾
Õ 𝑝
𝐾 Õ
Õ
𝑙 𝑃 (𝜽; 𝜷) = 𝑙𝑜𝑔 𝜋 𝑘 𝑓 ( 𝜷𝑖 ; 𝝁 𝑘 , 𝚺 𝑘 ) − 𝜆 1 || 𝝁 𝑘 ||1 − 𝜆2 |𝑊 𝑘; 𝑗,𝑙 |,
𝑖=1 𝑘=1 𝑘=1 𝑘=1 𝑗,𝑙

where || 𝝁 𝑘 ||1 = 𝑝𝑗=1 |𝜇 𝑘, 𝑗 |, 𝜆1 > 0 and 𝜆2 > 0 are penalty parameters to be suitably
Í
chosen.
The penalty term on the cluster mean vectors allow for component selection
in the functional data framework (whereas it would be variable selection in the
multivariate case), considering that when the 𝑗-th component in the basis expansion
is not useful in separating groups it has a common mean across groups, that is
𝜇1, 𝑗 = . . . = 𝜇 𝐾 , 𝑗 = 0. Then to realize component selection the considered term is
Í 𝐾
𝑘=1 || 𝝁 𝑘 ||1 . Í Í𝑝
The second part of the penalty, namely 𝐾 𝑘=1 𝑗,𝑙 |𝑊 𝑘; 𝑗,𝑙 |, imposes a shrinkage on
the elements of the precision matrices, thus avoiding possible singularity problems
and facilitating the estimation of large and sparse covariance matrices.

2.2 Model Estimation via E-M Algorithm

Since the membership of each observation to a cluster is unobservable, data related


to the grouping variable Z is inevitably missing and the maximum penalized log-
likelihood estimator can be obtained by means of the E-M algorithm [4], that iterates
over two steps: expectation (E) of the complete data (penalized) log-likelihood by
considering the unknown parameters equal to those obtained at the previous iteration
Penalized Model-based Functional Clustering 317

(with initialization values), and maximization (M) of a lower bound of the obtained
expected value with respect to the unknown parameters.
In particular, at the 𝑑-th iteration, given a current estimate 𝜽 (𝑑) , the lower bound
after the E-step assumes the following form:
(𝑑)
𝑄 𝑃 (𝜽;𝜽 (𝑑) )= 𝐾
Í Í𝑛 Í𝐾 Í𝐾 Í 𝑝
𝑘=1 𝑖=1 𝜏𝑘,𝑖 [log 𝜋𝑘 +log 𝑓 (𝜷 𝑖 ;𝝁 𝑘 ,𝚺 𝑘 ) ]−𝜆1 𝑘=1 | |𝝁 𝑘 | |1 −𝜆2 𝑘=1 𝑗,𝑙
|𝑊𝑘; 𝑗,𝑙 | ,

where 𝜏𝑘,𝑖 = P(𝑍 𝑘 = 1|𝑋 = 𝑥𝑖 ) is the posterior probability of observation 𝑖 to belong


to group 𝑘. The M-step maximizes the function 𝑄 𝑃 in order to update the estimate
of 𝜽.
As suggested by [9], it is possible to maximize each of the 𝐾 term us-
ing a “graphical lasso” (GLASSO) algorithm (first proposed by [5]), thanks
to the close connection between fitting Gaussian mixture models and Gaus-
sian graphical models. Indeed, in GLASSO the objective function looks like
log det(W) − tr(SW) − 𝜆 𝑝𝑗,𝑙 |𝑊 𝑗,𝑙 | so that the algorithm implemented in the R
Í
2𝜆2
package “glasso” can be used with W = W 𝑘 , 𝑆 = S̃ 𝑘 and 𝜆 = Í𝑛 (𝑑) for each 𝑘
𝑖=1 𝜏𝑘,𝑖
b (𝑑+1) of the precision matrices.
to obtain the elements 𝑊 𝑘; 𝑗,𝑙

2.3 Model Selection via Silhouette Profile

A fundamental, and probably unsolved, problem in cluster analysis is determining


the “true” number of groups in a dataset. To this purpose, for simplicity, here we
approach the problem choosing the number of groups as cluster validation problem
and use the average silhouette width index as a model selection heuristic. The
silhouette value for curve 𝑖 is given by

𝑏(𝑖) − 𝑎(𝑖)
𝑠(𝑖) =
max{𝑎(𝑖), 𝑏(𝑖)}

where 𝑎(𝑖) is the average distance of curve 𝑖 to all other curves ℎ assigned to the
same cluster (if 𝑖 is the only observation in its cluster, then 𝑠(𝑖) = 0), and 𝑏(𝑖) is
the minimum average distance of curve 𝑖 to observations ℎ which are assigned to
a different cluster. This definition ensures that 𝑠(𝑖) takes values in [−1, 1], where
values close to one indicate “better”clustering solutions. Conditional on 𝐾 and a pair
of values (𝜆1 , 𝜆2 ), we thus assess the overall cluster solution using the total average
of silhouette values
𝑛

𝑆(𝐾, 𝜆1 , 𝜆2 ) = 𝑠(𝑖).
𝑛 𝑖=1
In particular, by doing a grid search for the triple (𝐾, 𝜆1 , 𝜆2 ), the best cluster
solution is obtained by looking for the largest value of the average silhouette width
(ASW) index. Note that, to evaluate 𝑠(𝑖), 𝑖 = 1, . . . , 𝑛, and then the objective function
𝑆(𝐾, 𝜆1 , 𝜆2 ), we need to compute a distance between pairs of curves 𝑋𝑖 and 𝑋ℎ . One
318 N. Pronello et al.

possibility is to compute the euclidean distance



2
𝑑 𝐸 (𝑖, ℎ) = k 𝑋𝑖 (𝑡) − 𝑋ℎ (𝑡)k 2 𝑑𝑡.

3 Experimental Results

3.1 Simulation

We present here a simulated scenario in order to investigate the effectiveness of


the 𝐿 1 regularization in removing noise while preserving dominant local features,
accommodating for spatial heterogeneity of the curves.
The statistical analysis is illustrated for data simulated by means of a finite mixture
of multivariate Gaussian distributions. In particular, based on equation (2.1) and
(2.2), the curves are simulated using a combination of 𝑝 = 25 Fourier basis functions
defined over a one-dimensional regular grid with 100 observations. We consider a
mixture of four (𝐾 = 4) multivariate Gaussian distributions with isotropic covariance
matrices, i.e.

𝜷 𝑘 ∼ N ( 𝝁 𝑘 ; I𝑘 ) where 𝜖𝑖 ∼ N (0; 0.5), 𝑘 = 1, . . . , 4.

With the exclusion of 3 entries per group, the means 𝝁 𝑘 are all zero mean vectors.
Under this scenario, the simulated curves (25 per group) and the non-zero group
expansion coefficients are represented in Figure 1. For this simple simulation setting,
estimation results suggest that, using euclidean distance to computed the ASW, the
grid search procedure is always able to correctly select the cluster-relevant basis
functions. This is confirmed by Figure 2 which shows both the distribution (over 100
replications) of the selected basis functions and the data projected on these bases that
clearly highlight the identification of 4 clusters. Under this scenario, the quality of
the estimated clusters thus appears very good as the analysis of the misclassification
rate suggests an 100% of accuracy in all the replicated datasets.
Similar results hold for more complex simulation designs, where we consider
different structure of the covariance matrices in the data generating process.

3.2 Performance on Real Data Sets

We evaluate the PFC-𝐿 1 model on a well-known benchmark data set, namely the
electrocardiogram (ECG) data set (data can be found at the UCR Time Series
Classification Archive [3]).
The ECG data set comprises a set of 200 electrocardiograms from 2 groups of
patients, myocardial infarction and healthy, sampled at 96 time instants in time.
Penalized Model-based Functional Clustering 319

Fig. 1 Left: 25 simulated curves for each group. Right: Vector of expansion coefficients for each
group, with only three non-zero coefficients corresponding to basis functions with specific period-
icities (Hertz values).

Fig. 2 Left: Data projected on cluster specific functional subspace generated by the selected basis
functions. Right: Distribution (over 100 replications) of the selected basis functions shown for pairs
of sine and cosine basis functions, according to the Hertz values.

This data set were previously used to compare the performance of several func-
tional clustering models in [1]. The results in Table 5 of [1] show that the FunFEM
models, compared to other state of the art methodologies, achieved the best perfor-
mances in terms of accuracy. Hence, here, we limit the comparison to the results
obtained with the PFC-𝐿 1 and the FunFEM models. Although FunFEM models relay
on a mixture of Gaussian distributions describing the likelihood of the data similarly
to our proposal, they differ on facing the intrinsic high dimension of the problem
by estimating a latent discriminant subspace in parallel with the steps of an EM
algorithm.
For all the data, we reconstruct the functional form from the sampled curves
choosing arbitrarily 20 cubic spline basis of functions. We tested the PFC-𝐿 1 models
considering five different values for the number of clusters, 𝐾 = {2, 3, 4, 5, 6}, and
six values for 𝜆1 = {0.5, 1, 5, 10, 15, 20}.
Considering that the GLASSO penalty parameter 𝜆 depends linearly from 𝜆2 ,
the choice of 𝜆 2 has to provide suitable values for 𝜆. A practical approach is to
choose values avoiding convergence problems with GLASSO. Here 𝜆 2 was set to
{5, 7.5, 10, 12, 15, 20} for the ECG data. Both PFC-𝐿 1 and FunFEM algorithms were
initialized using a 𝐾-means procedure.
320 N. Pronello et al.

The clustering accuracies, computed with respect to the known labels, are 69% for
FunFEM DFM [ 𝛼𝑘 𝑗 𝛽𝑘 ] (choosing among 12 different model parameterizations with
BIC index), and 75% for PFC-L1 [𝜆1 = 0.5 , 𝜆2 = 5] (values of tuning parameters
chose by ASW index) . Thus PFC-𝐿 1 achieves good performance, with an increase
in the accuracy about 9%.

4 Discussion

In this paper we tried to investigate the potential of shrinkage methods for clustering
functional data. Our numerical examples show the advantages of performing clus-
tering with features selection, such as uncover interesting structures underlying the
data while preserving good clustering accuracy. To the best of our knowledge, this is
the first proposal that considers a penalty for both means and covariances of mixture
components in functional model-based clustering. In the model selection section we
defined an heuristic criterion to choose among different model parameterizations
based on average silhouette index. It may be interesting to evaluate different dis-
tances (i.e. not euclidean) to compute this index in future research. Moreover, we
will consider more complex simulation designs to investigate the robustness of the
proposal and extend the comparison with the state of the art methodologies on more
benchmark datasets.

References

1. Bouveyron, C., Come, E., Jacques, E.: The discriminative functional mixture model for a
comparative analysis of bike sharing systems. Ann. Appl. Stat. 9, 1726–1760 (2015)
2. Chamroukhi, F., Nguyen, H.: Model-based clustering and classification of functional data.
Wiley Interdiscip. Rev.: Data Min. and Knowl. Discov. 9, e1298, 1–36 (2019)
3. Dau, H. A., Keogh, E., Kamgar, K., Yeh, C.-C. M., Zhu, Y., Gharghabi, S., Ratanamahatana,
C. A., Yanping, Hu, B., Begum, N., Bagnall, A., Mueen, A., Batista, G., Hexagon-ML: The
UCR Time Series Classification Archive (October 2018)
https://www.cs.ucr.edu/$\sim$eamonn/time\_series\_data\_2018/
4. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM
algorithm. J. R. Stat. Soc. Ser. B (Methodol.). 39, 1–38 (1977)
5. Friedman, J., Hastie, T., Tibshirani, R.: Sparse inverse covariance estimation with the graphical
lasso. Biostat. 9, 432–41 (2008)
6. Friedman, J., Hastie, T., Tibshirani, R.: glasso: Graphical Lasso: Estimation of Gaussian
Graphical Models, R package version 1.11 (2019).
https://CRAN.R-project.org/package=glasso
7. Jacques, J., Preda, C.: Functional data clustering: A survey. Adv. Data Anal. Classif. 8, 231–
255 (2013)
8. Pan, W., Shen, X.: Penalized model-based clustering with application to variable selection. J.
Mach. Learn. Res. 8, 1145–1164 (2007)
9. Zhou, H., Pan, W., Shen, X.: Penalized model-based clustering with unconstrained covariance
matrices. Electron. J. Stat. 3, 1473–1496 (2009)
Penalized Model-based Functional Clustering 321

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Emotion Classification Based on Single
Electrode Brain Data: Applications for Assistive
Technology

Duarte Rodrigues, Luis Paulo Reis, and Brígida Mónica Faria

Abstract This research case focused on the development of an emotion classification


system aimed to be integrated in projects committed to improve assistive technolo-
gies. An experimental protocol was designed to acquire an electroencephalogram
(EEG) signal that translated a certain emotional state. To trigger this stimulus, a set
of clips were retrieved from an extensive database of pre-labeled videos. Then, the
signals were properly processed, in order to extract valuable features and patterns
to train the machine and deep learning models.There were suggested 3 hypotheses
for classification: recognition of 6 core emotions; distinguishing between 2 different
emotions and recognising if the individual was being directly stimulated or merely
processing the emotion. Results showed that the first classification task was a chal-
lenging one, because of sample size limitation. Nevertheless, good results were
achieved in the second and third case scenarios (70% and 97% accuracy scores,
respectively) through the application of a recurrent neural network.

Keywords: emotions, brain-computer interface, EEG, supervised learning, machine


and deep learning

Duarte Rodrigues
Faculty of Engineering of University of Porto (FEUP), Rua Dr. Roberto Frias, s/n 4200-465 Porto,
Portugal, e-mail: [email protected]
Luis Paulo Reis
Faculty of Engineering of University of Porto (FEUP) and Artificial Intelligence and Computer
Science Laboratory (LIACC), Rua Dr. Roberto Frias, s/n 4200-465 Porto, Portugal,
e-mail: [email protected]
Brígida Mónica Faria ( )
School of Health, Polytechnic of Porto (ESS-P.PORTO) and Artificial Intelligence and Computer
Science (LIACC), Rua Dr. Roberto Frias, s/n 4200-465 Porto, Portugal,
e-mail: [email protected]

© The Author(s) 2023 323


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_35
324 D. Rodrigues et al.

1 Introduction

Emotions are a part of our lives, as humans we know how to identify the tiniest
of microexpressions to unveil what someone is feeling, but also how to use them
to express our hearts. From the youngest of ages we see and interact with others
and build a database of patterns of, for example, what joy is and how different it is
from fear or sadness. Computers, on the other hand, do not have any idea of what an
emotion is or how to recognize it. Or do they?
The Artificial Intelligence and Computer Science Laboratory (LIACC) estab-
lished 2 projects where emotion recognition can be of the utmost importance. The
first project, the "IntellWheels 2.0" [1], intends to develop an interactive and in-
telligent electric wheelchair. This innovative equipment will have a diverse set of
features, such as an adaptive control system (through eye gaze, a brain-computer
interface, hand orientation, among others) and a personalized multi-modal interface
which will allow communication to multiple devices both from the patients and the
caregivers. In this case, having information about the mood of the patient is very
beneficial, because the interface can give updates to the nursing staff of the emotional
condition of the patient. The second project, the "Sleep at the Wheel" [2], focuses on
the research of an interface that can sense and predict a driver’s drowsiness state, be-
ing able to detect if he fell asleep while driving and, consequently, support an alarm
system to provide safer routing and driving. Here the state of mind of the driver
is a very important aspect, as different emotions, like anger or fear, can provoke
dangerous situations or unpredictable scenarios, making the driver less attentive to
his surroundings.
In this work, emotions will be sensed through a brain-computer interface (BCI).
These are commercial devices that allow to acquire a surface electroencephalo-
gram (EEG). This signal is used to measure the electrical activity of the brain, that
fluctuates according to the firing of the neurons in the brain, being quantified in
micro-volts. In this research, the BCI used was the "NeuroSky MindWave2" which
possesses one single electrode on the forehead, from which it collects a signal from
the activity of the frontal lobe. This brain area is responsible for the higher executive
functions, including emotional regulation, planning, reasoning and problem solving
[3].
The study of emotion recognition started with psychologist Paul Ekman that
defined, based on a cross cultural study, six core emotions - Fear, Anger, Happiness,
Sadness, Surprise and Disgust [4]. Later, psychologist Robert Plutchik established a
model called "Wheel of Emotions", a diagram where every emotion can be derived
from the core 6.
It is also important to have a way to measure what someone is feeling or what
emotion they are experiencing. An easy way to do this is through the "Discrete Emo-
tion Questionnaire", a psychological validated questionnaire to verify the intensity
of a certain emotion. This assessment presents the 6 core emotions to the subjects
asking them to rate the intensity they felt, from 1 to 7 [5].
As a first approach in this area, the current work aims to be able to identify the
core emotions using EEG signals collected with the BCI.
Emotion Classification with 1-channel BCI 325

2 Experimental Methodology

In order to correctly identify the core emotions, the first step is to trigger them in
an efficient way for the brain data collected to be as informative as possible.To do
so, the emotions were prompted via a set of video clips, that lasted 5-7 seconds.
These videos were selected from a certified database, where the videos were labeled
according to the intensity and kind of emotion it caused in the subjects [6]. For each
of the 6 core emotions, the 4 videos classified with the biggest intensity were selected
to be presented to the participants of this research work.
For each of the 24 video clips (4 videos per each of the 6 emotions), 3 EEG
samples are collected. The first is before the display of the video, where a fixation
cross is presented, in order to collect the idle/blank state of the user, where he
is asked to relax. The second sample is the EEG during the video (active visual
stimulus); and the third sample is after the video finishes where the volunteer is
processing the emotion triggered (higher level thinking), while getting back to the
initial relaxed state, where the fixation cross is presented again. To confirm that the
volunteers experience the same emotion defined in the pre-determined label, they
are a prompted to answer the “Discrete Emotion Questionnaire”, after the 3 EEG
samples are collected.
Regarding the physiological signal processing, this step is important because the
raw EEG signal that comes directly from the BCI has a low signal-to-noise ratio,
as well as many surrounding artifacts that contaminate the readings, especially eye
blinks and facial movements triggered by the various emotions. These interfering
signals caused by the latter, denominated electromyograms (EMG), are characterized
by high frequencies (50-150 Hz) that make the underlying signal very noisy. Every
time a person blinks, the EEG signal shows a very high peak with a very low
frequency (<1Hz). To remove these muscle artifacts, a 5th order utterworth bandpass
filter (this type of filter was chosen because it has the flattest frequency response,
which leads to less signal distortion) with cut-off frequencies in 1 Hz and 50 Hz
[7].The attenuation of very low frequencies is important to remove the eye blinks
artifacts. Considering the top cut-off frequency, it is very convenient to use 50 Hz
since it mitigates the effects of the power line noise and the EMG artifacts. Like
this, no important brain data is lost. At this step, the EEG was segmented in the
brain waves of interest, i.e., the alpha and beta brain waves. The best way to perform
this is to apply bandpass filters (same filter type as before) in the corresponding
bandwidths, 8-13Hz and 13-32 Hz, to have alpha and beta bands, respectively.
The EEG signals, at this stage possess the "emotional data" exposed allowing
to extract the features. To do so, multiple mathematical equations were applied to
obtain relevant information from the signals. Feature extraction methods depend
on the domain, as will be seen ahead [8]. Most strategies to extract features from
the EEG are formulas applied in the time domain, such as, the common statistical
equations, the Hjorth statistical parameters, the mean and zero crossings (number of
times the signal crosses these 2 thresholds) [8]. Besides these, there were applied
more advanced feature extraction methods, based on fractal dimensions and entropy
analysis (methods to assess the complexity, or irregularity, of a time-series) [9].
326 D. Rodrigues et al.

Regarding frequency domain approaches, these features can only be calculated in


the filtered EEG and not in the brain waves, as their spectrum is very narrow. In
terms of the pure frequency band, the only feature computed was the Power Spectral
Density (PSD), based on the Welch method. These domains can be combined creating
the time-frequency domain, leading to more sophisticated methods, like the Hilbert
– Huang Transform, where the original signal is decomposed in intrinsic mode
functions (IMF) [10].
The resulting number of features is too high to compute machine learning models,
because the correlation between most of the features is very low, which means that
between different classes the information is virtually the same. This would introduce
uncertainty in the weights for each class in the models, thus the number of features
needs to be reduced. To do this the "Min Redundancy Max Relevance" (MRMR)
method was applied, with the objective of finding the optimal number of features
to have a higher inter-class variability, in order to find distinct patterns between
emotions [11]. The features were used raw, normalized or standardized to train the
models.
In this study, all the models implemented are based on supervised learning and
fully depend on the data that is inputted. Concerning emotion classification there is
not a specific machine learning approach that is optimal, thus 9 different types of
models were implemented to verify which has the best performance. These models
are designed to be able to adapt to various kinds of input data, through the definition
of hyper-parameters. Hence, to tune them to the best possible configuration, it was
performed a GridSearchCV. This method exhaustively searches over a given list of
possible parameters applying cross validation between them. In the end, the model
with the best performance is chosen to be trained with the resulting feature matrix.
A deep learning model was also implemented, based on recurrent neural network
(RNN), a very common architecture in classification problems using EEG. A par-
ticularity of this network is that it has a GRU, i.e., a layer that helps to mitigate the
problem of vanishing gradients (common issue on artificial neural networks), giving
long term memory to the model [12].

3 Evaluation and Discussion of Results

In this experiment, 12 subjects volunteered to participate. Each EEG recording is


labeled according to the emotion registered in the original database, as well as if it
was before video, during or after the video. The answers of the “Discrete Emotion
Questionnaire” were used to validate if the emotion triggered by the video was as
expected and, if so, the data was used. With this dataset structure, 3 hypotheses were
tested and their results are discussed ahead.
An important aspect to have in consideration is that the EEG collected while the
subject is relaxing, i.e., while the fixation cross presented before the video, does not
have relevant cognitive information regarding emotions. Therefore, these segments
were not considered to train any of the models.
Emotion Classification with 1-channel BCI 327

3.1 Core Emotions Classification

This first hypothesis describes the main goal of the project where a model was
developed to classify 6 emotions.
First, the feature extraction was computed. At this step, the optimal number of
features to get selected was tested, iterating from 5 to 50, 5 at a time. The best
number found was 30, which gave the best accuracies, with a balanced computation
time and power. This value was chosen for the 3 feature matrixes (raw, normalized
and standardized). The dataset was then divided into training and testing with an
80% ratio and fully independent of one another. Each model was then trained and
assessed, by computing the accuracy in the test dataset. Table 1 presents the results
for each model.

Table 1 Results of the 6 Core Emotions Classification.


Classification Models Raw Features Normalized Features Standardized features
Accuracy (%)
Gaussian Naïve Bayes Classifier 12.07 12.93 10.34
Support Vector Classifier 12.07 12.93 16.38
Decision Tree Classifier 18.96 18.10 18.10
Random Forest Classifier 24.13 18.10 20.69
K Nearest Neighbors 21.55 18.96 16.38
Logistic Regression 25.00 14.66 18.10
Linear Discriminant Analysis 24.13 14.65 18.96
Linear Support Vector Classifier 18.10 13.79 19.82
Multi-Layer Perceptron 20.69 13.79 12.93
Recurrent Neutral Network 13.79 20.69 23.27

When comparing the various models, the average accuracy is around 16-18%,
logically due to the number of classes in the problem (100%/6 = 16,6%). Despite
this, the best result reached was 25% accuracy, with the features in their raw state,
since the magnitude information was not lost, so patterns in different emotions could
be more easily identified due to the high discrepancy in the values. These results are
not discouraging since the main objective of the study is very ambitious, as we are
trying to create a model to define universally what an emotion is. There is no work
more subjective or abstract, and the only way to achieve this universal standardization
would be with a sample population as wide and diverse as possible with different
beliefs, nationalities, age groups, etc. Although this is an initial study, it shows that
it is possible to register and identify differences in the electrical changes of the
prefrontal cortex and, with that information, categorize what someone is feeling.
328 D. Rodrigues et al.

3.2 One vs One – Dual Emotion Classification

As the results in the previous hypothesis could not precisely identify an emotion when
compared to the other 5, the problem was narrowed down and a new hypothesis was
tested, to continue the proposed research. In this experiment, the model was trained
to discern between only 2 emotions, decided a priori. For demonstration purposes,
a concrete example can be seen in Table 2 where it compares "fear" vs "surprise".

Table 2 Results of "Fear vs Surprise" Classification.


Classification Models Raw Features Normalized Features Standardized features
Accuracy (%)
Gaussian Naïve Bayes Classifier 48.27 55.17 53.44
Support Vector Classifier 51.72 51.72 53.44
Decision Tree Classifier 56.89 50.00 44.83
Random Forest Classifier 48.27 50.00 60.34
K Nearest Neighbors 46.55 44.82 50.00
Logistic Regression 50.00 53.45 53.45
Linear Discriminant Analysis 50.00 48.28 53.44
Linear Support Vector Classifier 50.00 51.72 55.17
Multi-Layer Perceptron 50.00 50.00 58.62
Recurrent Neutral Network 69.23 51.23 56.21

In this case, most of the machine learning algorithms have accuracies in the
order of the 50-53%. This results are not ideal, as they are no better than a random
choice between the two classes, however this can be justified by the low population
sample, which is not high enough to bring to the surface concrete patterns on the
features. Regarding the deep learning approach, the RNN has an advantage in this
case, giving a final accuracy of 69%. This result shows that this model is reliable, and
in the majority of the cases the 2 emotions can be distinguished. In this particular
case, the facial expressions and their muscle activity, can induce big artifacts in
the EEG. Someone who feels surprised has the tendency to raise their eyebrows
and open the mouth. These movements can lead to a difference in the EEG and,
consequently, in the patterns of the features, making the distinction between surprise
and fear more noticeable. The same thinking applies to other emotions that trigger
facial movement, like laugh, frowning, among others.

3.3 Stimulus vs No Stimulus Classification

Besides the good results presented in the last premise, one last hypothesis was
assessed, regarding the difference between experiencing the emotion while watching
the video (direct stimulus), and after, when the fixation cross is presented, while the
volunteer is simply thinking and cognitively processing the emotion.
Emotion Classification with 1-channel BCI 329

Table 3 summarizes the results of the various models.

Table 3 Results of Stimulus vs No Stimulus classification.


Classification Models Raw Features Normalized Features Standardized features
Accuracy (%)
Gaussian Naïve Bayes Classifier 61.20 58.62 85.34
Support Vector Classifier 58.62 58.62 91.37
Decision Tree Classifier 39.65 58.62 89.65
Random Forest Classifier 39.65 58.62 91.37
K Nearest Neighbors 37.93 58.62 89.65
Logistic Regression 34.48 58.62 87.06
Linear Discriminant Analysis 29.31 37.06 80.17
Linear Support Vector Classifier 34.48 58.62 87.06
Multi-Layer Perceptron 31.03 58.62 88.79
Recurrent Neutral Network 96.55 61.20 88.79

As it can be seen, for this experiment, most models did fairly well using the
standardized feature, being all accuracies higher than 80%. However, when testing
the deep learning approach, this architecture revealed to fit almost perfectly to the
testing data, with an accuracy higher than 96%. This hypothesis is the proof of
concept that the characteristics of the signal collected during the stimulus itself
are very different from the ones from a signal obtained when the person is simply
thinking and cognitively processing the emotion (this change would be obvious if
the EEG was collected from the occipital lobe, which is responsible for the visual
perception, but is remarkable when spotted in the prefrontal cortex).

4 Conclusions

In conclusion, as a first approach, the results achieved are very satisfactory and
reveal a high potential to be greatly efficient in the proposed applications both in
"IntellWheels2.0" and "Sleep at the Wheel projects". Nevertheless by collecting
more data the models will get more generalized resulting in more realistic patterns
and, consequently, increasing the prediction’s accuracies.
Comparing to the literature, using simple visual stimuli to distinguish six emo-
tions, in a relaxed state, is a novel tactic. Most studies, complement the stimulus with
forced facial expression, introducing different characteristics to the signal, leading
to better results. Other studies use BCIs with more electrodes (channels), covering a
wider cranial surface and, consequently, getting more EEG and information, which
leads to more robust results.
As future work, the preprocessing of the data could be polished, improving the
removal of artifacts and enhancing the underlying information of the EEG’s. To obtain
better results, it could also be used a transfer learning approach, by pre-training the
models with another emotion related EEG databases.
330 D. Rodrigues et al.

Acknowledgements This work was financially supported by Base Funding - UIDB/00027/2020 of


the Artificial Intelligence and Computer Science Laboratory – LIACC - funded by national funds
through the FCT/MCTES (PIDDAC), Sono ao Volante 2.0 - Information system for predicting
sleeping while driving and detecting disorders or chronic sleep deprivation (NORTE-01-0247-
FEDER-039720), and Intellwheels 2.0 - IntellWheels2.0 – Intelligent Wheelchair with Flexible
Multimodal Interface and Realistic Simulator (POCI-01-0247-FEDER-39898), supported by Norte
Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partner-
ship Agreement.

References

1. IntellWheels2.0 – Intelligent Wheelchair with Flexible Multimodal Interface and Realistic


Simulator. Optimizer, Lda, FEUP, UA, Rehapoint, GroundControl. Available at
http://www.intellwheels.com/en/client/skins/geral.php?id=25 Cited 24 May
2021
2. Sono ao Volante 2.0 - Information system for predicting sleeping while driving and detecting
disorders or chronic sleep deprivation. Optimizer, Lda, FEUP, IS, IPCA. Available at
http://sonoaovolante.com/en/client/skins/geral.php?id=25 Cited 24 May
2021
3. Lobes of the Brain. UQ-Queensland Brain Institute (2018). Available at
https://qbi.uq.edu.au/brain/brain-anatomy/lobes-brain.Cited26May2021
4. Eckman, P.: Facial Expressions of Emotion: New Findings, New Questions In: Psychological
Science, 34-38. Sage Journals (1992)
5. Harmon-Jones, C., Bastian, B., Harmon-Jones, E.: The Discrete Emotions Questionnaire: A
New Tool for Measuring State Self-Reported Emotions. In: PLoS One 11(8), e0159915 (2016)
doi: 10.1371/journal.pone.0159915.
6. Cowen, A., Keltner, D.: Self-report captures 27 distinct categories of emotion bridged by
continuous gradients In: Proceedings of the National Academy of Sciences of the United
States of America 14(38), E7900-E7909 (2017) doi: 10.1073/pnas.1702247114.
7. López-Gil, J.-M., Virgili-Gomá, J., Gil, R., Guilera, T., Batalla, I., Soler-González, J., García,
R.: Method for Improving EEG Based Emotion Recognition by Combining It with Synchro-
nized Biometric and Eye Tracking Technologies in a Non-invasive and Low Cost Way. In:
Frontiers in Computational Neuroscience 10, 85 (2016) doi: 10.3389/fncom.2016.00119
8. Jenke, R., Peer, A., Buss, M.: Feature Extraction and Selection for Emotion Recognition
from EEG. In: IEEE Transactions on Affective Computing, 5(3), 327-339, (2014) doi:
10.1109/TAFFC.2014.2339834
9. Richman, J. S., Moorman, J. R.: Physiological time-series analysis approximate entropy and
sample entropy. In: American Journal of Physiology-Heart and Circulatory Physiology (2000)
doi: 10.1152/ajpheart.2000.278.6.H2039
10. Junsheng, C., Dejie, Y., Yu, Y.: Research on the intrinsic mode function (IMF) criterion in
EMD method In: Mechanical Systems and Signal Processing, 20(4), 817-824. (2006) doi:
10.1016/j.ymssp.2005.09.011
11. Ding, C., Peng, H.: Minimum redundancy feature selection from microarray gene
expression data. In: Bioinformatics Computation Biol., 3(2) 185–205, (2003) doi:
10.1142/S0219720005001004
12. Zain, M. A.: Predicting Emotions Using EEG Data with Recurrent Neural Networks. Geek
Culture (2021) Available at
https://medium.com/geekculture/predicting-emotions-using-eeg-data-with
-recurrent-neural-networks-8acf384896f5 Cited 19 May 2021
Emotion Classification with 1-channel BCI 331

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
The Death Process in Italy Before and During
the Covid-19 Pandemic: a Functional
Compositional Approach

Riccardo Scimone, Alessandra Menafoglio, Laura M. Sangalli, and Piercesare


Secchi

Abstract In this talk, based on [1], we propose a spatio-temporal analysis of daily


death counts in Italy, collected by ISTAT (Italian Statistical Institute), in Italian
provinces and municipalities. While in [1] the focus was on the elderly class (70+
years old), we here focus on the middle class (50-69 years old), carrying out anal-
ogous analyses and comparative observations. We analyse historical provincial data
starting from 2011 up to 2020, year in which the impacts of the Covid-19 pan-
demic on the overall death process are assessed and analysed. The cornerstone of
our analysis pipeline is a novel functional compositional representation for the death
counts during each calendar year: specifically, we work with mortality densities over
the calendar year, embedding them in the Bayes space 𝐵2 of probability density
functions. This Hilbert space embedding allows for the formulation of functional
linear models, which are used to split each yearly realization of the mortality density
process in a predictable and an unpredictable component, based on the mortality
in previous years. The unpredictable components of the mortality density are then
spatially analysed in the framework of Object Oriented Spatial Statistics. Via spa-
tial downscaling of the results obtained at the provincial level, we obtain smooth
predictions at the fine scale of Italian municipalities; this also enable us to perform

Riccardo Scimone ( )
MOX, Dipartimento di Matematica, Politecnico di Milano and Center for Analysis, Decision and
Society, Human Technopole, Milano, Italy, e-mail: [email protected]
Alessandra Menafoglio
MOX, Dipartimento di Matematica, Politecnico di Milano, Milano, Italy,
e-mail: [email protected]
Laura M. Sangalli
MOX, Dipartimento di Matematica, Politecnico di Milano, Milano, Italy,
e-mail: [email protected]
Piercesare Secchi
MOX, Dipartimento di Matematica, Politecnico di Milano and Center for Analysis, Decision and
Society, Human Technopole, Milano, Italy, e-mail: [email protected]

© The Author(s) 2023 333


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_36
334 R. Scimone et al.

anomaly detection, identifying municipalities which behave unusually with respect


to the surroundings.

Keywords: COVID-19, O2S2, functional data analysis, spatial downscaling

1 Introduction and Data Presentation

At the dawn of the third year of global pandemic, we can affirm that no aspect of
people’s everyday life has been left untouched by the consequences of Covid-19.
The virus, in addition to exacting an heavy death toll, has caused great upheavals
in global economy, education systems, technological development and in countless
other aspects of human life. Given this global reaching, we deem appropriate to anal-
yse death counts from all causes, and not just those directly attributed to Covid-19, as
a proxy of how Italian administrative units, be they municipalities or provinces, have
been affected by the pandemic. This choice is driven by the following considerations:
• Death counts from all causes are, on many levels, high quality data: they have a
very fine spatial and temporal granularity, being collected daily in each Italian
municipality, they are finely stratified in many age classes, and they are not affected
by errors due to incorrect attribution of the cause of death, as may happen, for
example, in deciding whether or not a given death is due to Covid-19;
• They incorporate any possible shock, be it direct or indirect, which the natural
death process underwent: less deaths from road accidents due to restrictive poli-
cies, more deaths from other pathologies which are left untreated because of the
unnatural stress on the welfare systems, and so on;
• They are made freely available by ISTAT1, with a substantial amounts of historical
data; in particular, in the following analysis we consider data starting from the
beginning of 2011 up to the end of 2020.
The purpose of the analysis of such data is twofold: (1) to study the correlation
structure of the death process in Italy before and during the pandemic, assessing
possible perturbations caused by its outbreak, and (2) to assess local anomalies at
the municipality level (i.e., identifying municipalities which behave unusually with
respect to the surrounding). This talk will entirely be devoted to presenting data and
results concerning people aged between 50 and 69 years. The elderly class was the
focus of [1], while analyses focusing on younger age classes can be freely examined
at https://github.com/RiccardoScimone/Mortality-densities-italy
-analysis.git.
Daily death counts for the 107 Italian provinces, in the time interval spanning
from 2017 to 2020, are shown in Fig. 1: for each province, we draw death counts
along the year in light blue. The black solid line is the weighted mean number of
deaths, where each province has a weight proportional to its population. We also

1 https://www.istat.it/it/archivio/240401
The Death Process in Italy 335

highlight four provinces with colours: Rome, Milan, Naples, and Bergamo. By a
visual inspection, it is easy to see that, during the years 2017, 2018 and 2019, the
mortality in this age class has an almost uniform behaviour, with only a very slight
increase in deaths during winter, for some Provinces. Conversely, 2020 presents
an abnormal behaviour in many provinces, due to the pandemic outbreak: look for
example at the double peak for Milan, hit by both pandemic waves, or the single,
dramatically sharp peak of Bergamo, which reached, during the first wave, higher
death counts than the ones associated to provinces which are several times bigger, as
Rome or Naples. By comparison with the plots in [1], on can see how all these peaks
are less sharper with respect to the elderly class: this is perfectly reasonable, since
people aged more than 70 years are much more susceptible to death by Covid-19.

Daily death counts, Italian provinces, 50-69 years

Fig. 1 Daily death counts during the last four years, for the Italian provinces. The plots refer to
people aged between 50 and 69 years. For each province, death counts along the year are plotted in
light blue: curves are overlaid one on top of the other to visualize their variability. The black solid
line is the weighted mean number of deaths, where each province has a weight proportional to its
population, while some selected provinces are highlighted in colour.

To set some notation, we denote the available death counts data as 𝑑𝑖𝑦𝑡 , where
𝑖 is a geographical index, identifying provinces or municipalities, 𝑦 is the year and
𝑡 is the day within year 𝑦. Moreover, we denote by 𝑇𝑖𝑦 the absolutely continuous
random variable time of death along the calendar year, that models the instant of
death of a person living in area 𝑖 and passing away during year 𝑦. We hence consider
the empirical discrete probability density of this random variable,
𝑑𝑖𝑦𝑡
𝑝 𝑖𝑦𝑡 = Í for 𝑡 = 1, ..., 365
𝑡 𝑑 𝑖𝑦𝑡

for each area 𝑖 and year 𝑦. The family {𝑝 𝑖𝑦 }𝑖𝑦 is the main focus of our analysis: we
show these discrete densities in Fig. 2, with the same color choices of Fig. 1. It is
336 R. Scimone et al.

clear that using densities provides a natural alignment of areas whose population
differs significantly, providing complementary insights with respect to the absolute
number of death counts: greater emphasis is given on the temporal structure of the
phenomenon. For example, the astonishing behaviour of the province of Bergamo
during the first pandemic wave in 2020, is now much more visible.

Empirical densities of daily mortality, provinces, 50-69 years

Fig. 2 Empirical densities of daily mortality, for people aged between 50 and 69 years, at the
provincial scale. For each province, the empirical density of the daily mortality is plotted in light
blue: densities are overlaid one on top of the other to visualize their variability. The black solid line
is the weighted mean density, where the weight for each province has been set to be proportional to
its population; some selected provinces are highlighted in colour.

In this talk, we will show results obtained by embedding a smoothed version


of the {𝑝 𝑖𝑦 }𝑖𝑦 , i.e., an estimate { 𝑓𝑖𝑦 }𝑖𝑦 of the continuous density functions of the
{𝑇𝑖𝑦 }𝑖𝑦 , in the Hilbert space 𝐵2 (Θ), called Bayes space [2, 4, 3], where Θ denotes
the calendar year. This is the set (of equivalence classes) of functions

𝐵2 (Θ) = { 𝑓 : Θ → R+ 𝑠.𝑡. 𝑓 > 0, 𝑙𝑜𝑔( 𝑓 ) ∈ 𝐿 2 (Θ)}

where the equivalence relation in 𝐵2 (Θ) is defined among proportional functions,


i.e., 𝑓 = 𝐵2 𝑔 if 𝑓 = 𝛼𝑔 for a constant 𝛼 > 0. In [1], we also propose a preliminary
exploration of the {𝑝 𝑖𝑦 }𝑖𝑦 based on the Wasserstein space embedding, a very regular
metric space of probability measures with a straightforward physical interpretation
[5]. For the sake of brevity, we here focus on the analysis in 𝐵2 (Θ), which constitutes
our main contribution.
𝐵2 (Θ) is equipped with an Hilbert geometry, constituted by appropriate operations
of sum, multiplication by a scalar, and inner product, which make it the infinite-
dimensional counterpart of the Aitchison simplex used in standard compositional
analysis [6, 7]: for this reason this space is considered the most suited Hilbert
embedding for positive continuous density functions. The smoothed densities { 𝑓𝑖𝑦 }𝑖𝑦
The Death Process in Italy 337

Smooth estimates of the mortality densities, 50-69 years


Provinces

Fig. 3 Smooth estimates of the mortality densities over the 107 Italian provinces. The usual pattern
of mortality is visible till 2019, while the functional process is completely different in 2020, with
the two pandemic waves clearly captured by the estimated densities. The black thick lines represent
the mean density, computed in 𝐵 2 , with weights proportional to the population in each area.

are shown in Fig. 3: they are obtained by smoothing the {𝑝 𝑖𝑦 }𝑖𝑦 via compositional
splines [8, 9]. It is easy to see, by comparison with Fig. 2, how smoothing filters out
a good amount of noise, much more than the case of the elderly class: this is fairly
reasonable, since the death process is usually more noisy for younger age classes.
From now on, the { 𝑓𝑖𝑦 }𝑖𝑦 are analysed as a spatio-temporal functional random sample
taking values in 𝐵2 (Θ). We briefly anticipate the results of such analysis:
1. The { 𝑓𝑖𝑦 }𝑖𝑦 are decomposed, by means of a linear model formulated in 𝐵2 (Θ)
[10], in a predictable and an unpredictable part, on the basis of mortality during
previous years;
2. The unpredictable part is then analysed spatially in order to infer the main
spatial correlation characteristics of the process; in particular, the impacts of
the pandemic are investigated via functional variography [13, 14, 11, 12] and
Principal Component Analysis in the 𝐵2 space (SFPCA, [16]);
3. The results obtained at the provincial level are reduced to the municipality scale
by spatial downscaling [15] techniques, obtaining smooth density estimates
for each municipality. This provides continuous density at the municipality
level, without directly smoothing the corresponding daily death process, which
is quite irregular due to the reduced population of many municipalities. The
spatial downscaling estimates, that are exclusively based on provincial data, are
then compared with the actual measurements on municipalities, allowing for the
identification of local anomalies.
Points 1 and 2 above are detailed in Section 2, while point 3 will be discussed during
the talk. The reader is referred to [1] for full details on the analysis pipeline.
338 R. Scimone et al.

2 Some Results

The first step of the analysis of the random sample { 𝑓𝑖𝑦 }𝑖𝑦 , where 𝑖 is indexing the
107 Italian provinces, is the formulation of a family of function-on-function linear
models in 𝐵2 (Θ), extending classical models formulated in the 𝐿 2 case [17], namely

𝑓𝑖𝑦 (𝑡) = 𝛽0𝑦 (𝑡) + h𝛽 𝑦 (·, 𝑡), 𝑓 𝑖𝑦 i 𝐵2 + 𝜖 𝑖𝑦 (𝑡), 𝑖 = 1, ...107, 𝑡 ∈ Θ, (1)


Í 𝑦−1
where 𝑓 𝑖𝑦 = 14 𝑟 =𝑦−4 𝑓𝑖𝑟 is the 𝐵2 mean of the observed densities in the four years
preceding year 𝑦, functional parameters 𝛽0𝑦 (𝑡), 𝛽 𝑦 (𝑠, 𝑡) are defined in the 𝐵2 sense,
as well as the residual terms 𝜖 𝑖𝑦 (𝑡) and all operations of summation and multiplication
by a scalar. Model (1) is trying to explain the realization of the mortality density
𝑓𝑖𝑦 for a year 𝑦 in a province 𝑖 as a linear function of what happened in the same
province during the preceding years. It is thus interesting to look at the following
functional prediction errors:
𝛿𝑖𝑦 = 𝑓𝑖𝑦 − 𝑓ˆ𝑖𝑦 (2)
where
𝑓ˆ𝑖𝑦 (𝑡) := 𝛽ˆ0𝑦−1 (𝑡) + h 𝛽ˆ 𝑦−1 (·, 𝑡), 𝑓 𝑖𝑦 i 𝐵2 . (3)
The 𝛿𝑖𝑦 are not the estimate 𝜖ˆ𝑖𝑦 of the residual of model (1): they rather represent

Prediction error norms and 𝐵 2 functional clustering, provinces, 50-69 years

Fig. 4 First four panels, from the left: heatmaps of the 𝐵 2 norm of the prediction errors 𝛿𝑖𝑦 , in
logarithmic scale, for the elderly class. In 2020 the pandemic diffusion is clearly visible in northern
Italy, while the prediction errors are generally higher on all provinces. Last panel: result of a
𝐾 -mean 𝐵 2 functional clustering (𝐾 = 3) on the 𝛿𝑖𝑦 , during 2020.

the error committed in forecasting 𝑓𝑖𝑦 using the model fitted at year 𝑦 − 1. Thus,
we can look at the densities 𝛿𝑖𝑦 as the unpredictable component of 𝑓𝑖𝑦 , i.e., as a
proxy of what happened at year 𝑦 which could not be predicted by information
available at the previous years, and analyze them under the spatial viewpoint. For
example, we can look at the spatial heatmaps of the 𝐵2 norms of the 𝛿𝑖𝑦 , which are
shown in Fig 4. It is clear, by looking at the magnitude of the error norms, that what
happened during 2020 was to a large extent unpredictable, since almost all Italian
The Death Process in Italy 339

provinces are characterized by higher errors with respect to previous years. More
significantly, in 2020 a clear spatial pattern can be noticed, at least during the first
wave in northern Italy: a diffusive process, having at its core the provinces most
gravely hit by the first pandemic wave, seems to take place in northern Italy. This
pattern is, as reasonable, slightly less evident with respect to the case of the elderly
class analysed in [1]. Going in this direction, we also show in Fig 4 the result of
a K-means functional clustering, set in the 𝐵2 space, of the 𝛿𝑖𝑦 for the year 2020.
We clearly identify provinces hit by the first wave (blue cluster), while the other two
clusters behave irregularly: this is a neat distinction with people aged more than 70
years, where each cluster clearly identifies different kinds of pandemic behaviour
(see [1]). For a more precise investigation of the spatial correlation structure of the

Functional trace-semivariograms, provinces, 50-69 years

Fig. 5 Empirical trace-semivariograms for the prediction errors 𝛿𝑖𝑦 , in people aged between 50
and 69 years. The purple lines are the corresponding fitted exponential models. Distances on the
x-axes are expressed in kilometers. The last panel shows the 2020 severe perturbation of the spatial
dependence structure of the process generating the prediction errors.

process across different years, from the 𝛿𝑖𝑦 we compute a functional trace variogram
for each year: we show them for 2017 up to 2020 in Figure 5. Without entering into
the details of the mathematical definition of variograms, we can look at the fitted
curves in Figure 5 as follows. Distances are on the x-axis, while on the y-axis we
have a function of the spatial correlation of the process: when the curve reaches its
horizontal asymptote, it has reached the total variance of the process and we are
beyond the maximum correlation length. In this perspective, it is immediate to infer
that not only the total variance of the functional process 𝛿𝑖𝑦 has sharply increased
340 R. Scimone et al.

in 2020, but also a significant spatial correlation has manifested, compatibly with
the presence of a pandemic. In the main work [1], we further deepen the connection
between the pandemic and the upheavals in the spatial structure by means of Principal
Component Analysis of the 𝛿𝑖𝑦 in the Bayes space (SFPCA, [16]).

References

1. Scimone, R., Menafoglio, A., Sangalli, L. M., Secchi, P.: A look at the spatio-temporal
mortality patterns in Italy during the COVID-19 pandemic through the lens of mortality
densities. Spatial Stat. (2021) doi:10.1016/j.spasta.2021.100541
2. Egozcue, J., Díaz–Barrero, J., Pawlowsky-Glahn, V.: Hilbert space of probability density
functions based on Aitchison geometry. Acta Mathematica Sinica. 22, 1175-1182 (2006)
3. Pawlowsky-Glahn, V., Egozcue, J., Boogaart, K.: Bayes Hilbert spaces. Aust. New Zeal. J.
Stat. 56, 171-194 (2014)
4. Boogaart, K., Egozcue, J., Pawlowsky-Glahn, V.: Bayes linear spaces. SORT. 34, 201-222
(2010)
5. Villani, C.: Topics in Optimal Transportation. American Mathematical Society (2003)
6. Aitchison, J.: The statistical analysis of compositional data. J. Roy. Stat. Soc. B Stat. Meth.
44, 139-177 (1982)
7. Aitchison, J.: The Statistical Analysis of Compositional Data. Chapman & Hall, London
(1986)
8. Machalová, J., Hron, K., Monti, G.: Preprocessing of centred logratio transformed density
functions using smoothing splines. J. Appl. Stat. 43 (2015)
9. Machalová, J., Talská, R., Hron, K., Gába, A.: Compositional splines for representation of
density functions. Comput. Stat. 36, 1031-1064 (2021)
10. Talská, R., Menafoglio, A., Machalová, J., Hron, K., Fišerová, E.: Compositional regression
with functional response. Comput. Stat. Data Anal. 123, 66-85 (2018)
11. Menafoglio, A., Petris, G.: Kriging for Hilbert-space valued random fields: The operatorial
point of view. J. Multivariate Anal. 146 (2015)
12. Menafoglio, A., Grujic, O., Caers, J.: Universal kriging of functional data: Trace-variography
vs cross-variography? Application to gas forecasting in unconventional shales. Spatial Stat.
15, 39-55 (2016)
13. Nerini, D., Monestiez, P., Manté, C.: Cokriging for spatial functional data. J. Multivariate
Anal. 101, 409-418 (2010)
14. Menafoglio, A., Secchi, P., Dalla Rosa, M.: A universal kriging predictor for spatially depen-
dent functional data of a Hilbert space. Electronic Journal of Statistics 7, 2209-2240 (2013)
15. Goovaerts, P.: Kriging and semivariogram deconvolution in the presence of irregular geo-
graphical units. Mathematical Geosciences. 40, 101-128 (2008)
16. Hron, K., Menafoglio, A., Templ, M., Hrůzová, K., Filzmoser, P.: Simplicial principal com-
ponent analysis for density functions in Bayes spaces. Comput. Stat. Data Anal. 94, 330-350
(2016)
17. Ramsay, J., Silverman, B.: Functional Data Analysis. Springer (2005)
The Death Process in Italy 341

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Clustering Validation in the Context of
Hierarchical Cluster Analysis:
an Empirical Study

Osvaldo Silva, Áurea Sousa, and Helena Bacelar-Nicolau

Abstract The evaluation of clustering structures is a crucial step in cluster analysis.


This study presents the main results of the hierarchical cluster analysis of variables
concerning a real dataset in the context of Higher Education. The goal of this
research is to find a typology of some relevant items taking into account both the
homogeneity and the isolation of the clusters.Two similarity measures, namely the
standard affinity coefficient and Spearman’s correlation coefficient, were used, and
combined with three probabilistic (AVL, AVB and AV1) aggregation criteria, from
a parametric family in the scope of the VL (Validity Link) methodology. The best
partitions were selected based on some validation indices, namely the global STAT
levels statistics and the measures P(I2, Σ) and 𝛾, adapted to the case of similarity
coefficients. In order to evaluate the clusters and identify their most representative
elements, the Mann and Whitney U statistics and the silhouette plot were also used.

Keywords: clustering validation, affinity coefficient, Spearman correlation coeffi-


cient, VL methodology

Osvaldo Silva ( )
Universidade dos Açores and CICSNOVA.UAc, Rua da Mãe de Deus, 9500-321, Portugal, e-mail:
[email protected]
Áurea Sousa
Universidade dos Açores and CEEAplA, Rua da Mãe de Deus, Portugal,
e-mail: [email protected]
Helena Bacelar-Nicolau
Universidade de Lisboa (UL) Faculdade de Psicologia and Institute of Environmental Health
(ISAMB/FM-UL), Portugal, e-mail: [email protected]

© The Author(s) 2023 343


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_37
344 O. Silva et al.

1 Introduction

Cluster analysis or unsupervised classification usually concerns exploratory multi-


variate data analysis methods and techniques for grouping either a set of data units
or an associated set of descriptive variables in such a way that elements in the same
group (cluster) are more similar to each other than elements in different clusters [6].
Therefore, it is important to validate the results obtained, bearing in mind that, in
an ideal situation, the clusters should be internally homogeneous and externally well
separated or isolated. Thus, according to Silva et al. ([15], p. 136), there are some
important questions, such as: “i) How to compare partitions obtained using different
cluster algorithms? ii) Is it possible to join information from several approaches in
the decision-making process of choosing the most representative partition?”
This paper presents the main results of a hierarchical cluster analysis of variables
concerning a real dataset in the field of Higher Education, in order to find a typology
taking into account relevant validation measures. Two similarity measures (standard
affinity coefficient and Spearman’s correlation coefficient) were used, and combined
with a parametric family aggregation criteria in the scope of the VL methodology
(e.g., [10, 11, 17]).
With regard to the validation of clustering structures, some validation indices
were used for the evaluation of partitions and the clusters that integrate them, which
are referred to in Section 2. The main results are presented and discussed in Section
3. Section 4 contains some final remarks.

2 Data and Methods

Data were obtained from a questionnaire administered to three hundred and fifty
students who were attending Higher Education in a public university, after their
informed consent. The questionnaire contains, among others, eleven questions related
to academic life and the respective courses.
Several algorithms of hierarchical cluster analysis of variables were applied
on the data matrix. The variables (items) are: T1-Participation, T2-Interest, T3-
Expectations, T4-Accomplishment, T5-Job Outlook, T6- Teachers’ Professional
Competence, T7-Distribution of Curricular Units, T8- Number of weekly hours
of lessons, T9-Number of hours of daily study, T10-School Outcomes and T11-
Assessment Methods, which were evaluated based on a Likert scale from 1 to 5
(1-Totally disagree, 2- Partially disagree, 3- Neither disagree nor agree, 4- Partially
agree, 5- Totally agree).
The Ascendant Hierarchical Cluster Analysis (AHCA) was based on the standard
affinity coefficient [1, 17] and Spearman’s correlation coefficient. In this paper both
measures of comparison were combined with three probabilistic aggregation criteria
(AVL, AVB and AV1), issued from the VL parametric family. This methodology, in the
scope of Cluster Analysis, uses probabilistic comparison functions, between pairs of
elements, which correspond to random variables following a unit uniform distribu-
Clustering Validation: an Empirical Study 345

tion. Besides, this approach considers probabilistic aggregation criteria, which can
be interpreted as distribution functions of statistics of independent random variables,
that are i.i.d. uniform on [0, 1] (e.g., [17]).
Let A and B be two clusters with cardinals, respectively, 𝛼 and 𝛽, and let 𝛾 𝑥 𝑦
be a similarity measure between pairs of elements, 𝑥, 𝑦 ∈ 𝐸 (set of elements to
classify). Concerning the family I of AVL methods (e.g., SL, AV1, AVB, and AVL),
the comparison functions between clusters can be summarized by the following
conjoined formula:
Γ( 𝐴, 𝐵) = ( 𝑝 𝐴𝐵 ) 𝑔 ( 𝛼,𝛽) (1)
where 𝛼 = 𝐶𝑎𝑟 𝑑 𝐴, 𝛽 = 𝐶𝑎𝑟𝑑 𝐵, 𝑝 𝐴𝐵 = 𝑚𝑎𝑥 [𝛾 𝑎𝑏 : (𝑎, 𝑏) ∈ ( 𝐴 × 𝐵], with
1 ≤ 𝑔(𝛼, 𝛽) ≤ 𝛼𝛽, and 𝛾 𝑥 𝑦 , establishing a bridge between SL and AVL methods
which have a braking effect on the formation √ of chains. For example, 𝑔(𝛼, 𝛽) = 1 for
SL, 𝑔(𝛼, 𝛽)=(𝛼 + 𝛽)/2 for AV1, 𝑔(𝛼, 𝛽)= 𝛼𝛽 for AVB, and 𝑔(𝛼, 𝛽) = 𝛼𝛽 for AVL
(see [3, 17]).
The application of the two measures of comparison between elements (Spearman
correlation coefficient and standard affinity coefficient), combined with the afore-
mentioned aggregation criteria, aims to find a typology of items corresponding to
the best partition among the best partitions obtained by the several algorithms, in
order to verify if there are any substantial changes in the results. Therefore, some
validation indices based on the values of the corresponding proximity matrices were
used, namely the global levels statistics (STAT) [1, 10, 11] and the indices P(I2mod,
Σ) and 𝛾 [8], adapted to this type of matrices [16], so that the choice of the best
partition is judicious and based on the desirable properties (e.g., isolation and homo-
geneity of the clusters). Concerning the best partitions, the respective clusters and
the identification of their most representative elements were based on appropriate
adaptations of the Mann and Whitney U statistics [8] and of the silhouette plots [14]
to the case of similarity measures.
Each level of a dendrogram corresponds to a stage in the constitution of the
partitions hierarchy. Therefore, the study of the most relevant partition(s) is strictly
related to the choice of the best cut-off levels (e.g., [6, 5])
According to Bacelar Nicolau [1, 2], the global levels statistics (STAT) values
must be calculated for each of the 𝑘 = 1, 𝑛𝑖𝑣𝑚𝑎𝑥 levels of the corresponding den-
drograms, designating them by 𝑆𝑇 𝐴𝑇 (𝑘). At each level k, 𝑆𝑇 𝐴𝑇 (𝑘) is the global
statistics that measures the total information given by the pre-order associated to
the corresponding partition, in relation to the initial pre-order associated with the
similarity or dissimilarity measure. A “significant” level is considered to be one that
corresponds to a partition for which the global statistics undergoes a significant in-
crease in relation to the information provided by neighbouring levels, that is, a local
maximum of the differences 𝐷 𝐼 𝐹 (𝑘) = 𝑆𝑇 𝐴𝑇 (𝑘) − 𝑆𝑇 𝐴𝑇 (𝑘 − 1), 𝑘 = 1, 𝑛𝑖𝑣𝑚𝑎𝑥.
346 O. Silva et al.

2.1 Adaptation of the P (I2, 𝚺)

To evaluate the partitions, an appropriate adaptation of the index P (I2, Σ) [8] for the
case of similarity measures was used, given by the following formula:

𝑐 Σ Σ 𝑠𝑖 𝑗
1 Õ 𝑖 ∈𝐶𝑟 𝑗∉𝐶𝑟
𝑃(𝐼2𝑚𝑜𝑑, Σ) = (2)
𝑐 𝑟 =1 𝑛𝑟 × (𝑁 − 𝑛𝑟 )
where 𝑐 is the number of clusters of the partition and 𝑠𝑖 𝑗 is the value of the similarity
measure between the element 𝑖 belonging to cluster 𝐶𝑟 and the element 𝑗 belonging
to another cluster. This index takes into account the number of clusters and the
number of elements in each of the clusters and evaluates the isolation of clusters
belonging to a given partition.

2.2 Goodman and Kruskal Index (𝜸 )

The 𝛾 index, proposed by Goodman and Kruskal [7], has been widely used in cluster
validation [9]. Comparisons are developed between all within-cluster similarities,
𝑠𝑖 𝑗 and all between-cluster similarities 𝑠 𝑘𝑙 [18]. A comparison is judged concordant
(respectively discordant) if 𝑠𝑖 𝑗 is strictly greater (respectively, smaller) than 𝑠 𝑘𝑙 . The
𝛾 index is defined by:
𝛾 = (𝑆+ − 𝑆− )/(𝑆+ + 𝑆− ), (3)
where 𝑆+ (or 𝑆− ) is the number of concordant (respectively, discordant) comparisons.
This index is a global stopping rule and it evaluates the fit of the partition in c clusters
based on the homogeneity (high similarity between the elements within the clusters)
and the isolation (low similarity of the elements between the clusters) of the clusters.
Note that the higher the value of this index, the better is the adjustment of that
partition.
The use of STAT, 𝛾 and P(I2mod, Σ) indices can help identifying the most
significant levels of a dendrogram, taking into account both the homogeneity and
the isolation of the clusters [15].

2.3 U Statistics (Mann and Whitney)

U statistics [12] are relevant for assessing the suitability of a cluster, combining the
concepts of compactness and isolation. Thus, the “best” cluster is the one with the
lowest values of global U-index, 𝑈𝐺 , and local U-index, 𝑈 𝐿 [8]. In the present paper
we used an appropriate adaptation of these indices to the case of similarity measures
(for details, see [19]). Moreover, the clusters considered “ideal” are those for which
𝑈𝐺 and 𝑈 𝐿 both take the value zero. Mann and Whitney’s U statistics are useful in
Clustering Validation: an Empirical Study 347

decision making, in situations of uncertainty, both for the evaluation of the clusters
and partitions.

2.4 Silhouette Plots

We also used an appropriate adaptation of the silhouette plots [14], which allows
the assessment of compactness and relative isolation of clusters. The adaptation of
this measure for the case of similarity measures, 𝑆𝑖𝑙 (𝑖), considers the average of the
similarities between an element i belonging to cluster 𝐶𝑟 , which contains 𝑛𝑟 (≥ 2)
elements, and all other elements that do not belong to this cluster (see [19]). The
values of this measure {𝑆𝑖𝑙 (𝑖) : 𝑖 ∈ 𝐶𝑟 } lie between −1 and +1, with “values near +1
indicating that element strongly belongs to the cluster in which it has been placed”
([8], p. 205). In the case of a singleton cluster, 𝑆𝑖𝑙 (𝑖) assumes the value zero [8] in
the corresponding algorithm.

3 Results and Discussion

The best partitions provided by the dendrograms are shown in Table 1.

Table 1 The best partitions concerning the dendrograms.


Coefficient Method The best partition Validation indices

Affinity AVL (T1, T3, T4, T5 ,T6, T7, T8, T10, T11), (T2, T9) STAT=5.1301
𝛾= 0.8589
P(I2mod,Σ)=0.2077

AV1/AVB (T1, T3, T4 , T5, T6, T7, T8, T10, T11), (T2), (T9) STAT=5.3453
𝛾= 0.8830
P(I2mod,Σ)=0.2049
Spearman AVL (T3, T4 ,T2 , T9) (T7, T11, T8), (T6, T10), (T1), (T5) STAT=4.0152
𝛾= 0.8178
P(I2mod,Σ)=0.3896

AV1/AVB (T3, T4 ,T2 , T9, T6 ) (T7, T11, T8), (T1, T10), (T5) STAT=4.05751
𝛾= 0.7317
P(I2mod,Σ)=0.38177

Figure 1 shows the dendrograms obtained, respectively, by the standard affin-


ity coefficient (left side) and Spearman’s correlation coefficient (right side), both
combined with the AVL method.
348 O. Silva et al.

Fig. 1 Dendrograms based on standard affinity coefficient (left side) and Spearman’s correlation
coefficient (right side) - AVL.

The “best” partition obtained using the affinity coefficient and the AVL method is
the partition into two clusters (level 9 of the aggregation process). The first cluster
consists of nine items that highlight the importance of the teachers’ professional
competence, the structuring/content of the course and the future perspectives in
relation to the career opportunities, mostly factors exogenous to the students. The
second one is composed by two items (T2 and T9) which emphasize the role of
interest in the study of Mathematics.
The algorithms in which the standard affinity coefficient was used are the ones that
provided the best partitions and their hierarchies are the ones that remained closest
to the initial pre-orders. In fact, in the case of Spearman correlation coefficient the
values of STAT and 𝛾 indices are clearly lower than the previous ones. Moreover,
the cluster {T1, T3, T4, T5, T6, T7, T8, T10, T11}, corresponding to the best
partition provided by the combination of the standard affinity coefficient with the
aggregation criteria AVL, AV1 and AVB, presents (𝑈𝐺 =39 and 𝑈 𝐿 =4, both lower than
those obtained for the cluster {T3, T4, T2, T9, T6} (𝑈𝐺 =65 and 𝑈 𝐿 =26) provided
by the Spearman correlation coefficient combined, respectively, with AV1 and AVB
methods.
Focusing the attention on the two first partitions of Table 1, the only difference
between them is that while the best partition provided by AV1 and AVB methods
contains the singletons T2 and T9, the best partition given by AVL joins these two
singletons in the same cluster. The values of the numerical validation indices shown
in Table 1 indicate that the best partition is the one provided by AV1 and AVB
methods. This conclusion is reinforced by the observation of the silhouette plot (see
Figure 2), which indicates that the cluster joining T2 and T9, given by AVL method,
includes the elements which have the two lowest values of 𝑆𝑖𝑙 and Sil (T2) is negative
Clustering Validation: an Empirical Study 349

Fig. 2 Silhouette plot - standard affinity coefficient and AVL method.

(i.e., T2 does not fit very well in this cluster). Note that the silhouette plot cannot be
used for the best partition, since it does not apply for singletons.

4 Final Remarks

This research was useful concerning the identification of relevant partitions of items
in the context of Higher Education. In the cases where the affinity and the Spearman
correlation coefficients were used, it was concluded that the probabilistic criteria AV1
and AVB showed a higher agreement regarding the hierarchies of partitions obtained
than the AVL method.
The validation measures STAT, 𝛾 and P(I2mod, Σ) help us to determine the best
cut-off levels of a hierarchy of clusters, taking into account both the homogeneity
and the isolation of the clusters. It should also be noted that if there is no absolute
consensus between these three measures, the Mann and Whitney U statistics and the
silhouette plot prove to be very useful, as we have seen with the application of this
methodology to evaluate both the clusters and the partitions obtained.

Acknowledgements Funding. This work is financed by national funds through FCT – Founda-
tion for Science and Technology, I.P., within the scope of the project «UIDB/04647/2020» of
CICS.NOVA – Centro de Ciências Sociais da Universidade Nova de Lisboa.
350 O. Silva et al.

References

1. Bacelar-Nicolau, H.: Analyse d’un Algorithme de Classification Automatique. Thèse de 3ème


Cycle. ISUP, Paris VI (1972)
2. Bacelar-Nicolau, H.: Contributions to the Study of Comparison Coefficients in Cluster Anal-
ysis (in Portuguese). Univ. Lisbon (1980)
3. Bacelar-Nicolau, H.: On the distribution equivalence in cluster analysis. In: P. A., Devijver,
& J. Kittler (eds.) Pattern Recognition Theory and Applications, NATO ASI Series, Series F.
Computer and Systems Sciences, vol. 30, pp. 73-79. Springer - Verlag, New York (1987)
4. Bacelar-Nicolau, H., Nicolau, F. C., Sousa, Á., Bacelar-Nicolau, L.: Clustering of variables
with a three-way approach for health sciences.Testing, Psychometrics, Methodology in Applied
Psychology (TPM) (2014) doi: 10.4473/TPM21.4.56
5. Benzécri, J. P.: Analyse Factorielle des Proximités. Publication de l’Institut de Statistique de
l’ Universite de Paris (ISUP), XIII et XIV (1965)
6. De La Vega, W.: Techniques de la classification automatique utilisant un índice de ressem-
blance. Revue Française de Sociologie. VIII, 506–520 (1967)
7. Goodman, L. A., Kruskal, W. H.: Measures of association for cross-classifications. Journal of
the American Statistical Association. 49, 732–764 (1954)
8. Gordon, A. D.: Classification, 2nd Ed. Chapman & Hall, London (1999)
9. Hubert, L. J.: Some applications of graph theory to clustering. Psychometrika 39(3), 283–309
(1974) ) doi: 10.1007/BF02291704
10. Lerman, I. C.: Classification et Analyse Ordinale des Données. Dunod, Paris (1981)
11. Lerman, I. C.: Foundations and Methods in Combinatorial and Statistical Data Analysis and
Clustering. Series: Advanced Information and Knowledge Processing. Springer-Verlag, Boston
(2016)
12. Mann, H. B., Whitney, D. R.: On a test of whether one of two random variables is stochastically
larger than the other. Annals of Mathematical Statistics, 50–60 (1947)
13. Nicolau, F. C., Bacelar-Nicolau, H.: Some trends in the classification of variables. In: Hayashi
et al. (eds.) Data Science, Classification and Related Methods, pp. 89-98. Springer,Tokyo
(1998)
14. Rousseeuw, P. J.: Silhouettes: a graphical aid to the interpretation and validation of cluster
analysis. Journal of Computation and Applied Mathematics. 20, 53–65 (1987)
15. Silva, O., Bacelar-Nicolau, H.; Nicolau, F.: A global approach to the comparison of clustering
results. Biometrical Letters 49(2), 135–147 (2013) doi: 10.2478/bile-2013-0010
16. Silva, O., Bacelar-Nicolau, H., Nicolau, F. C., Sousa, Á.: Probabilistic approach for comparing
partitions. In: Manca, R., McClean, S., Skiadas, C. H.(eds.) New Trends in Stochastic Modeling
and Data Analysis, pp. 113-122. ISAST (International Society for the Advancement of Science
and Technology), Athens (2015)
17. Sousa, Á., Silva, O., Bacelar-Nicolau, H., Nicolau, F. C.: Distribution of the affinity coefficient
between variables based on the Monte Carlo simulation method. Asian Journal of Applied
Sciences. 1(5), 236–245 (2013a)
18. Sousa, Á., Tomás, L., Silva, O., Bacelar-Nicolau, H.: Symbolic data analysis for the assessment
of user satisfaction: an application to reading rooms services. European Scientific Journal
(ESJ). Special/Edition 3, 39–48 (2013b)
19. Sousa, Á., Nicolau, F., Bacelar-Nicolau, H., Silva, O.: Cluster analysis using affinity coefficient
in order to identify religious beliefs profiles. European Scientific Journal (ESJ). Special/Edition
3, 252–261 (2014)
Clustering Validation: an Empirical Study 351

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
An MML Embedded Approach for Estimating
the Number of Clusters

Cláudia Silvestre, Margarida G. M. S. Cardoso, and Mário Figueiredo

Abstract Assuming that the data originate from a finite mixture of multinomial
distributions, we study the performance of an integrated Expectation Maximization
(EM) algorithm considering Minimum Message Length (MML) criterion to select
the number of mixture components. The referred EM-MML approach, rather than
selecting one among a set of pre-estimated candidate models (which requires run-
ning EM several times), seamlessly integrates estimation and model selection in a
single algorithm. Comparisons are provided with EM combined with well-known
information criteria – e.g. the Bayesian information Criterion. We resort to synthetic
data examples and a real application. The EM-MML computation time is a clear ad-
vantage of this method; also, the real data solution it provides is more parsimonious,
which reduces the risk of model order overestimation and improves interpretability.

Keywords: finite mixture model, EM algorithm, model selection, minimum mes-


sage length, categorical data

1 Introduction

Clustering is a technique commonly used in several research and application areas.


Most of the clustering techniques are focused on numerical data. In fact, clustering

Cláudia Silvestre ( )
Escola Superior de Comunicação Social, Campus de Benfica do IPL 1549-014 Lisboa, Portugal,
e-mail: [email protected]
Margarida G. M. S. Cardoso
BRU-UNIDE, ISCTE-IUL, Av. das Forças Armadas, 1649-026 Lisboa, Portugal,
e-mail: [email protected]
Mário Figueiredo
Instituto de Telecomunicações, Portugal, Av. Rovisco Pais 1, 1049-001 Lisboa, Portugal,
e-mail: [email protected]

© The Author(s) 2023 353


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_38
354 C. Silvestre et al.

methods for categorical data are more challenging [12] and there are fewer techniques
available [11].
In order to determine the number of clusters, model-based approaches commonly
resort to information-based criteria e.g., the Bayesian Information Criterion (BIC)
[15] or the Akaike Information Criterion (AIC) [1]. These criteria look for a balance
between the model’s fit to the data (which corresponds to maximizing the likelihood
function) and parsimony (using penalties associated with measures of model com-
plexity), thus trying to avoid over-fitting. The use of information criteria follows the
estimation of candidate finite mixture models for which a predetermined number
of clusters is indicated, generally resorting to an EM (Expectation Maximization)
algorithm [7]. In this work, we focus on determining the number of clusters while
clustering categorical data, using an EM embedded approach to estimate the number
of clusters. This approach does not rely on selecting among a set of pre-estimated
candidate models, but rather integrates estimation and model selection in a single
algorithm. Our new implementation to deal with categorical variables by estimating
a finite mixture of multinomials, follows a previous version described in [16]. We
capitalized on the work of Figueiredo and Jain [9] for clustering continuous data and
extended it for dealing with categorical data. The embedded method is thus based on
a Minimum Message Length (MML) criterion to select the number of clusters and
on an EM algorithm to estimate the model parameters.

2 Clustering with Finite Mixture Models

The literature on finite mixture models and their application is vast, including some
books covering theory, geometry, and applications [8, 13, 3]. When applying finite
mixture models to social sciences, the analyst is often confronted with the need to
uncover sub-populations based on qualitative indicators.

2.1 Definitions and Concepts

Let Y = {𝑦 , 𝑖 = 1, . . . , 𝑛} be a set of 𝑛 independent and identically distributed


𝑖
(i.i.d.) sample of observations of a random vector, 𝑌 = [𝑌1 , . . . , 𝑌𝐿 ] 0. We assume
𝑌 follows a mixture of 𝐾 components densities, 𝑓 (𝑦|𝜃 𝑘 ) (𝑘 = 1, . . . , 𝐾), with
probabilities {𝛼1 , . . . , 𝛼𝐾 }, where 𝜃 𝑘 are the distributional parameters defining the
𝑘-th component and Θ = {𝜃 1 , . . . , 𝜃 𝐾 , 𝛼1 , . . . , 𝛼𝐾 } the set of all the parameters of
the model. TheÍ 𝛼 values, also called mixing probabilities, are subject to the usual
constraints: 𝐾 𝑘=1 𝛼 𝑘 = 1 and 𝛼 𝑘 ≥ 0, 𝑘 = 1, . . . , 𝐾. The log-likelihood of the
observed set of sample observations is
𝑛
Ö 𝑛
Õ 𝐾
Õ
log 𝑓 (Y|Θ) = log 𝑓 (𝑦 |Θ) = log 𝛼 𝑘 𝑓 (𝑦 |𝜃 𝑘 ). (1)
𝑖 𝑖
𝑖=1 𝑖=1 𝑘=1
An MML Embedded Approach 355

In clustering, the identity of the component that generated each sample observa-
tion is unknown. The observed data Y is therefore regarded as incomplete, where
the missing data is a set of indicator variables Z = {𝑧1 , ..., 𝑧 𝑛 }, each taking the
form 𝑧𝑖 = [𝑧𝑖1 , ..., 𝑧 𝑖𝐾 ] 0, where 𝑧𝑖𝑘 is a binary indicator: 𝑧 𝑖𝑘 takes the value 1 if the
observation 𝑦 was generated by the k-th component, and 0 otherwise. It is usually
𝑖
assumed that the {𝑧 𝑖 , 𝑖 = 1, . . . , 𝑛} are i.i.d., following a multinomial distribution of
𝐾 categories, with probabilities {𝛼1 , . . . , 𝛼𝐾 }. The log-likelihood of complete data
{Y, Z} is given by
𝑛 Õ
Õ 𝐾 h i
log 𝑓 (Y, Z|Θ) = 𝑧 𝑖𝑘 log 𝛼 𝑘 𝑓 (𝑦 |𝜃 𝑘 ) . (2)
𝑖
𝑖=1 𝑘=1

2.2 Discrete Finite Mixture Models

Consider that each variable in Y, 𝑌𝑙 (𝑙 = 1, . . . , 𝐿) can take one of 𝐶𝑙 categories.


Conditionally on having been generated by the k-th component of the mixture,
each 𝑌𝑙 is thus modeled by a multinomial distribution with 𝑛𝑙 trials, 𝐶𝑙 categories,
and non-negative parameters 𝜃 𝑘𝑙 = {𝜃 𝑘𝑙𝑐 , 𝑐 = 1, . . . , 𝐶𝑙 }, with 𝐶
Í 𝑙
𝑐=1 𝜃 𝑘𝑙𝑐 = 1.
For a sample 𝑦 𝑖𝑙 (𝑖 = 1, . . . , 𝑛) of 𝑌𝑙 , we denote as 𝑦 𝑖𝑙𝑐Í the number of outcomes
in category 𝑐, which is a sufficient statistic; naturally, 𝐶 𝑐=1 𝑦 𝑖𝑙𝑐 = 𝑛𝑙 . Thus, with
𝑙

𝜃 𝑘 = {𝜃 𝑘1 , . . . , 𝜃 𝑘 𝐿 } and Θ = {𝜃 1 , . . . , 𝜃 𝐾 , 𝛼1 , . . . , 𝛼 𝑘 }, the log-likelihood function,


for a set of observations corresponding to a discrete finite mixture model (mixture of
multinomials). This log-likelihood can be seen as corresponding to a missing-data
problem, where the missing data has exactly the same meaning and structure as
above. The log-likelihood of the complete data {Y, Z} is thus given by
𝑛 Õ 𝐾 𝐿
" 𝐶𝑙
#!
Õ Ö Ö (𝜃 𝑘𝑙𝑐 ) 𝑦𝑖𝑙𝑐
log 𝑝(Y, Z|Θ) = 𝑧 𝑖𝑘 log 𝛼 𝑘 𝑛𝑙 ! . (3)
𝑖=1 𝑘=1 𝑙=1 𝑐=1
𝑦 𝑖𝑙𝑐 !
To obtain a maximum-likelhood (ML) or maximum a posteriori (MAP) estimate
of the parameters of a multinomial mixture, the well-known EM algorithm is usually
the tool of choice [7].

3 Model Selection for Categorical Data

Model selection is an important problem in statistical analysis [6]. In model-based


clustering, the term model selection usually refers to the problem of determining
the number of clusters, although it may also refer to the problem of selecting the
structure of the clusters. Model-based clustering provides a statistical framework to
solve this problem usually resorting to information criteria. Among the best-known
information criteria we find BIC and AIC, their modifications - namely the consistent
356 C. Silvestre et al.

AIC, (CAIC) and the Modified AIC (MAIC) - and also the Integrated Completed
Likelihood (ICL) [14, 4]. They are all easily implemented, the final model being
selected according to a compromise between its fit to data and its complexity.
In this work, we use the Minimum Message Length (MML) criterion to choose
the number of components of a mixture of multinomials. MML is based on the
information-theoretic view of estimation and model selection, according to which an
adequate model is one that allows a short description of the observations. MML-type
criteria evaluate statistical models according to their ability to compress a message
containing the data, looking for a balance between choosing a simple model and
one that describes the data well. According to Shannon’s information theory, if 𝑌 is
some random variable with probability distribution 𝑝(𝑦|Θ), the optimal code-length
(in an expected value sense) for an outcome 𝑦 is 𝑙 (𝑦|Θ) = − log2 𝑝(𝑦|Θ), measured
in bits (from the base-2 logarithm). If Θ is unknown, the total code-length function
has two parts: 𝑙 (𝑦, Θ) = 𝑙 (𝑦|Θ) + 𝑙 (Θ); the first part encodes the outcome 𝑦, while
the second part encodes the parameters of the model. The first part corresponds the
fit of the model to the data (better fit corresponds to higher compression), while the
second part represents the complexity of the model. The message length function
for a mixture of distributions (as developed in [2]) is:
1 𝐶
𝑙 (𝑦, Θ) = − log 𝑝(Θ) − log 𝑝(𝑦|Θ) + log |𝐼 (Θ)| + (1 − log(12)) , (4)
2 2
where 𝑝(Θ) is a prior distribution overh 2the parameters,i 𝑝(𝑦|Θ) is the likelihood
𝜕
function of mixture, |𝐼 (Θ)| ≡ | − 𝐸 𝜕Θ 2 log 𝑝(𝑌 |Θ) | is the determinant of the
expected Fisher information matrix, and 𝐶 is the the number of parameters of
the model that need to be estimated. For example,Í for the 𝐾 mixture multinomial
𝐿
distributions presented in (3), 𝐶 = (𝐾 − 1) + 𝐾 𝑙=1 (𝐶𝑙 − 1) . The expected Fisher
information matrix of a mixture leads to a complex analytical form of MML which
cannot be easily computed. To overcome this difficulty, Figueiredo and Jain [9]
replace the expected
h 2 Fisher information
i matrix by its complete-data counterpart
𝜕
𝐼𝑐 (Θ) ≡ −𝐸 𝜕𝜃 2 log 𝑝(𝑌 , 𝑍 |Θ) . Also, they adopt independent Jeffreys’ priors for
the mixture parameters that is proportional to the square root of the determinant of
the Fisher information matrix. The resulting message length function is

𝑀 Õ 𝑛 𝛼 
𝑘 𝑘 𝑛𝑧 𝑛 𝑘 𝑛𝑧 (𝑀 + 1)
𝑙 (𝑦, Θ) = log + log + − log 𝑝(𝑦, Θ) (5)
2 𝑘: 𝛼𝑘 >0
12 2 12 2

where 𝑀 is the number of parameters specifying each component (the dimension


of each 𝜃 𝑘 ) and 𝑘 𝑛𝑧 the number of components with non zero probability (for more
details on the derivation of (5), see [9, 2]).
An MML Embedded Approach 357

4 The MML Based EM Algorithm

In order to estimate a mixture of multinomials, we use a variant of the EM algorithm


(herein termed EM-MML), which integrates both estimation and model selection, by
directly minimizing (5). The algorithm results from observing that (5) contains, in
addition to the log-likelihood term, an explicit penalty on the number of components
(the two terms proportional to 𝑘 𝑛𝑧 ), and a term (the first one) that can be seen as a
log-prior on the 𝛼 𝑘 parameters of Θ, that will directly affect the M-step.
E-step: The E-step of the EM-MML is precisely the same as in the case of ML or
MAP estimation, since the generative model for the data is the same. Since we are
dealing with a multinomial mixture, we simply have to plug the corresponding
multinomial probability function yielding
 (𝑡 ) 𝑦

Î 𝑙 ( 𝜃b𝑘𝑙𝑐 ) 𝑖𝑙𝑐
𝑛𝑙 ! 𝐶
Î𝐿
𝛼 𝑘 𝑙=1 𝑐=1 𝑦𝑖𝑙𝑐 !
(𝑡)
𝑧¯𝑖𝑘 =  (𝑡 ) 𝑦
, (6)
Í𝐾 Î𝐿 Î𝐶𝑙 ( 𝜃b𝑗𝑙𝑐 ) 𝑖𝑙𝑐
𝛼
𝑗=1 𝑗 𝑙=1 𝑛 𝑙 ! 𝑐=1 𝑦𝑖𝑙𝑐 !

for 𝑖 = 1, . . . , 𝑛 and 𝑘 = 1, . . . , 𝐾.
M-step: For the M-step, noticing that the first term in (5) can be seen as the
+1
negative log-prior − log 𝑝(𝛼 𝑘 ) = 𝐶−𝐾 2𝐾 log 𝛼 𝑘 (plus a constant), and enforcing
Í
the conditions that 𝛼 𝑘 ≥ 0, for 𝑘 = 1, ..., 𝐾 and that 𝐾 𝑘=1 𝛼 𝑘 = 1, yields the
following updates for the estimates of the 𝛼 𝑘 parameters:
( 𝑛 )
Õ
(𝑡) 𝐶−𝐾 +1
max 0, 𝑧¯𝑖𝑘 −
2𝐾
(𝑡+1) 𝑖=1
𝛼𝑘
b = ( ), (7)
𝐾 𝑛
Õ Õ
(𝑡) 𝐶−𝐾 +1
max 0, 𝑧¯𝑖 𝑗 −
𝑗=1 𝑖=1
2𝐾

for 𝑘 = 1, ..., 𝐾. Notice that, some b 𝛼 𝑘(𝑡+1) may be zero; in that case, the 𝑘-th
component is excluded from the mixture model. The multinomial parameters
corresponding to components with b 𝛼 𝑘(𝑡+1) = 0 need not be further calculated,
since these components do not contribute to the likelihood. For the components
with non-zero probability, b𝛼 𝑘(𝑡+1) > 0, the estimates of multinomial parameters
are updated to their standard weighted ML estimates:
𝑛
Õ
(𝑡)
𝑧¯𝑖𝑘 𝑦 𝑖𝑙𝑐
(𝑡+1) 𝑖=1
𝜃 𝑘𝑙𝑐
b = 𝑛 , (8)
Õ
(𝑡)
𝑛𝑙 𝑧¯𝑖𝑘
𝑖=1
358 C. Silvestre et al.

for 𝑘 = 1, . . . , 𝐾, 𝑙 = 1, . . . , 𝐿, and 𝑐 Í
= 1, . . . , 𝐶𝑙 . Notice that, in accordance with
𝑙 b(𝑡+1)
the meaning of the 𝜃 𝑘𝑙𝑐 parameters, 𝐶 𝑐=1 𝜃 𝑘𝑙𝑐 = 1.

5 Data Analysis and Results

First, we evaluate the performance of the EM-MML algorithm on 10 synthetic data


sets, over 50 runs. The data sets were originated from a mixture of 3 categorical
variables (with 2, 3 and 4 levels) and 2 components. The correponding Sihouette
index values illustrate the structures diversity: 0.099; 0.216; 0.217; 0.230; 0.713;
0.733; 0.746; 0.778; 0.805; 0.817. The obtained results are compared with those
obtained from a standard EM algorithm combined with BIC, AIC, CAIC, MAIC,
and ICL criteria.
The comparison resorts to a cohesion-separation measure and a concordance
measure: the Fuzzy Silhouette index [5] of the clustering structure obtained and the
Adjust Rand [10] between the same clustering structure and the original one. In
Table 1 we can verify there are no significant differences between the EM-MML and
the other criteria, except ICL which only recovers the very well separated structures.
Regarding the number of clusters, EM-MML and MAIC are tied, recovering this
number correctly for all data sets.The same is not true for the other criteria: AIC
identifies 3 clusters in 3 data sets and 4 clusters once; in addition, BIC and CAIC
could not find any cluster structure once and ICL was unable to do it for 4 data sets. In
terms of computation time, since EM-MML does not require a sequential approach,
it becomes clearly faster than the other criteria (Friedman test yields 𝜒2 (5)=2500
and p-value<0.01; Post hoc tests, with Bonferroni correction, only reveal statistically
significant differences between the EM-MML and the other criteria).

Table 1 Criteria performance.


Criterion Number of Fuzzy Silhouette: 95% CI Adjusted Rand: 95% CI
data sets Lower ; Upper Limits𝑎 Lower ; Upper Limits𝑎

AIC 10 0.430 ; 0.741 0.545 ; 0.867


BIC 9 0.622 ; 0.935 0.728 ; 1.000
CAIC 9 0.616 ; 0.931 0.732 ; 1.000
ICL 6 0.917 ; 0.948 1.000 ; 1.000
MAIC 10 0.568 ; 0.887 0.623 ; 0.950
EM-MML 10 0.561 ; 0.891 0.594 ; 0.955
𝑎 1000 bootstrap samples were used to estimate the Confidence Intervals (CI).

Additional insight into the performance of EM-MML is obtained by applying it


to a real data set referring to the 6th European Working Conditions Survey (2015),
Eurofound working conditions survey. Note that these data are the most recent.
For the purpose of our experiment, we consider the aggregate data referring to
305 European regions and the answers to the following questions: Are you able to
An MML Embedded Approach 359

Fig. 1 Clusters’ profile and their dimensions (𝑛).

choose or change: a) your order of tasks; b) your methods of work; c) your speed or
rate of work. Do you work in a group or team that has common tasks and can plan
its work?
EM-MML selected 7 clusters, which is a smaller number than for the remaining
criteria (ICL, BIC, CAIC, AIC and MAIC select 10, 12, 12, 15 and 15 respectively).
This fact avoids estimation problems associated with very small segments and also
improves the interpretability of the clustering solution.
The segments selected by EM-MML criterion are presented in Figure 1. Workers
with slightly above average autonomy (cluster 7) live in several countries, but Ireland
stands out, as well as Belgium, Germany, Netherlands, Switzerland, and the UK
regions. Denmark, Estonia, Malta, and Norway are the countries where the most
independent workers are found (cluster 3). The smallest cluster, 6, includes Sweden
and a region of Greece and Kriti and Açores, a Greek and a Portuguese region,
respectively. The cluster 5, where workers claim they have no autonomy, includes
regions from many countries.

6 Discussion and Perspectives

In this work, a model selection criterion and method for finite mixture models of
categorical observations was studied - EM-MML. This algorithm simultaneously
performs model estimation and selects the number of components/clusters. When
compared to information criteria, which are commonly associated with the use of
the EM algorithm, the EM-MML method exhibits several advantages: 1) it easily
recovers the true number of clusters in synthetic data sets with various degrees of
360 C. Silvestre et al.

separation; 2) its computations times are significantly lower than those required
by standard approaches resorting to the sequential use of EM and an information
criterion; 3) when applied to a real data set it produces a more parsimonious solution,
thus easier to interpret. An additional advantage of this approach that stems from
obtaining more parsimonious solutions is that such solutions have a higher number
of observations per cluster, thus helping to overcome eventual estimation problems.
The performance of the EM-MML is encouraging for selecting the number of
clusters, and the same criterion was already used for feature selection [17]. However,
future research is required, namely considering data sets with different numbers of
clusters and high dimensional data.

Acknowledgements This work was supported by Fundação para a Ciência e Tecnologia, grant
UIDB /00315/2020.

References

1. Akaike, H.: Maximum likelihood identification of Gaussian autorregressive moving average


models. Biometrika. 60, 255–265 (1973)
2. Baxter, R. A., Olivier, J. J.: Finding overlapping components with MML. Stat. Comput. 10(1),
5–16 (2000)
3. Bouguila, N., Fan, W.: Mixture Models and Applications (Unsupervised and Semi-Supervised
Learning). Springer Nature Switzerland AG, Switzerland (2020)
4. Bozdogan, H.: Mixture-model cluster analysis using model selection criteria and a new infor-
mational measure of complexity. In: Bozdogan, H. (eds.) Proceedings of the First US/Japan
Conf. Frontiers of Stat. Modeling, pp.69–113. Boston: Kluwer Academic Publishers (1994)
5. Campello, R. J., Hruschka, E. R.: A fuzzy extension of the silhouette width criterion for cluster
analysis. Fuzzy Set. Syst., 157(21), 2858–2875 (2006)
6. Celeux, G., Martin-Magniette, M. L., Maugis-Rabusseau, C., Raftery, A. E.: Comparing model
selection and regularization approaches to variable selection in model-based clustering. J. Soc.
Fr. Statistique. 155(2), 57–71 (2014)
7. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood estimation from incomplete data via
the EM Algorithm. J. R. Stat. Soc. 39, 1–38 (1997)
8. Everitt, B. S., Hand, D.: Finite Mixture Distributions. Chapman and Hall, New York (1981)
9. Figueiredo, M. A. T., Jain, A. K.: Unsupervised learning of finite mixture models. IEEE T.
Pattern Anal. 24, 381–396 (2002)
10. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 93–218 (1985).
11. Kumar, P., Kanavalli, A.: A similarity based K-means clustering technique for categorical data
in data mining application. Int. J. Intell. Eng. Syst. (2021) doi: 10.22266/ijies2021.0430.05
12. Lee, C., Jung, U.: Context-based geodesic dissimilarity measure for clustering categorical
data. Appl. Sciences (2021) doi: 10.3390/app11188416
13. McLachlan, G. J., Peel, D.: Finite Mixture Models. Wiley, New York (2000)
14. Novais, L., Faria, S.: Selection of the number of components for finite mixtures of linear mixed
models. J. Int. Math. 24(8), 2237–2268 (2021)
15. Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)
16. Silvestre, C., Cardoso, M. G. M. S. and Figueiredo, M.: A clustering view on ESS measures
of political interest: an EM-MML approach. NTTS - New Techniques and Technologies for
Statistics (2017).
17. Silvestre, C., Cardoso, M. G. M. S. and Figueiredo, M.: Feature selection for clustering
categorical data with an embedded modeling approach. Expert Syst. 32(3), 444–453 (2014).
An MML Embedded Approach 361

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Typology of Motivation Factors for Employees in
the Banking Sector: An Empirical Study Using
Multivariate Data Analysis Methods

Áurea Sousa, Osvaldo Silva, M. Graça Batista, Sara Cabral, and Helena
Bacelar-Nicolau

Abstract Leadership has been considerate as a competitive advantage for organiza-


tions, contributing to their success and effective and efficient performance. Motiva-
tion, on the other hand, is assumed as a basic competence of leadership. Therefore,
the main purpose of this paper is to know the perceptions of bank employees on
the main motivational factors in the organizational context. Data analysis was per-
formed based on several statistical methods, among which the Categorical Principal
Component Analysis (CatPCA) and some agglomerative hierarchical clustering al-
gorithms from VL (V for Validity, L for Linkage) parametrical family, applied to the
items that aim to assess the aspects most valued by bankers in the work context. The
CatPCA allowed to extract four principal components which explain almost 70%
of the total data variance. The dendrograms provided by the hierarchical clustering
algorithms over the same data, exhibit four main branches, which are associated with
different main motivational factors. Moreover, CatPCA and clustering results show
an important correspondence concerning the main motivations in this sector.

Keywords: leadership, welfare, motivational factors, CatPCA, cluster analysis

Áurea Sousa ( )
Universidade dos Açores and CEEAplA, Rua da Mãe de Deus, 9500-321, Portugal,
e-mail: [email protected]
Osvaldo Silva
Universidade dos Açores and CICSNOVA.UAc, Rua da Mãe de Deus, Portugal,
e-mail: [email protected]
M. Graça Batista
Universidade dos Açores and CEEAplA, Rua da Mãe de Deus, Portugal,
e-mail: [email protected]
Sara Cabral
Universidade dos Açores, Rua da Mãe de Deus, Portugal, e-mail: [email protected]
Helena Bacelar-Nicolau
Universidade de Lisboa (UL) Faculdade de Psicologia and Institute of Environmental Health
(ISAMB/FM-UL), Portugal, e-mail: [email protected]

© The Author(s) 2023 363


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_39
364 Á. Sousa et al.

1 Introduction

Motivation has always been subject of analysis by the scientific community, as


numerous definitions have emerged. For Robbins and Judge ([21], p. 184), motivation
is defined as “the processes that account for an individual’s intensity, direction, and
persistence of effort toward attaining a goal”. These three indicators are assumed
to be key-factors of motivation: intensity describes the individual’s effort to achieve
the proposed goals; this effort should go in a direction that benefits the organization;
and, finally, the persistence with which the individual is able to maintain that effort.
In this context, the individual’s behavior is determined by what motivates them,
which is why their performance results not only from ability and skills, but also
from motivation. Moreover, motivation is complex and influenced by innumerable
variables, considering the diverse needs and expectations that individuals try to
satisfy in different ways [15]. Moreover, different leadership practices may lead to
better or worse motivational responses from employees.
The main purpose of this paper is to analyse the perceptions of bank employees
who work in the banks that operate in the Autonomous Region of the Azores on
the main motivational factors in the organizational context. Our study also intends
to perform a reduction of the dimensionality of the data and to find a typology of a
set of items that was used to evaluate the latent variable “Motivation”, regarding the
most valued aspects in the work context. Thus, Section 2 concerns the materials and
methods of research. Section 3 presents and discusses the main results of this study.
Finally, Section 4 contains the main conclusions.

2 Materials and Methods

This study was based on a quantitative approach, using a validated questionnaire,


which can be found in Cabral [7]. The sample consists of 202 bank employees (51.0
% male and 49.0 % female) of the Autonomous Region of the Azores (response
rate: 6.4%). Most respondents are 36 years old or older (60.9%) and have higher
education (56.7%).
The present study refers to a subset of twenty-seven items used to evaluate the
latent variable “Motivation” in work context, namely: 1 - The opportunity for career
advancement, 2 - Have greater responsibility, 3 - The feeling of being involved
in decision making, 4 - A job that gives you prestige and status, 5 - Have an
interesting and challenging job, 6 - The recognition and appreciation of others for
the accomplished work, 7 - Have a good relationship with your colleagues, 8 - Have
a good relationship with your superiors, 9 - A work environment where there is trust
and respect, 10 - The loyalty of superiors towards the collaborators, 11 - Team spirit,
12 - Sense of belonging to the organization, 13 - An adequate discipline, 14 - There
is equality of treatment and opportunities between the various employees, 15 - Earn
respect and esteem of your colleagues and superiors, 16 - Professional development,
17 - Salary appropriate to the professional functions, 18 - A stable job that gives
Typology of Motivation Factors for Employees in the Banking Sector 365

you security, 19 - Good working conditions, 20 - Balance between personal and


professional life, 21 - Being able to express your opinion and ideas without fear of
reprisals, 22 - Availability to solve problems/personal situations, 23 - Have a fair
and adequate system of objectives and incentives, 24 - Being rewarded for overtime
work, 25 - Being pressured to achieve the proposed objectives, 26 - Ability to handle
pressure at work, and 27 - Appropriate training to the professional functions.
For each item, respondents could pick only one of six modalities of response
according to their level of agreement or disagreement with the items that assess
motivation: Totally disagree; Disagree most of the time; Slight disagree; Slight
agree; Agree most of the time, and Totally agree. In this study, Categorical Principal
Components Analysis (CatPCA), using the Varimax rotation method with Kaiser
Normalization; and some agglomerative hierarchical clustering algorithms (AHCA)
were used. Data analysis was performed using the packages IBM SPSS Statistics 26
and CLUST11 [19].
Principal Components Analysis (PCA) aims to reduce the dimensionality of the
original data so that “the first few dimensions account for as much of the available
information as possible” ([9], p. 83), assuming linear relationships among numeric
variables. Each principal component is uncorrelated with all others, and it is ex-
pressed as a linear combination of the original variables. CatPCA optimally quanti-
fies categorical (ordinal or nominal) variables and can handle and discover nonlinear
relationships between variables (e.g., [12]). In the present study, we applied the
CatPCA due to the ordinal nature of the items under analysis.
The goal of a clustering algorithm is to obtain a partition, where the elements
within a cluster are similar and elements (objects/individuals/groups of individuals or
variables) in different groups are dissimilar, identifying natural clustering structures
in a data set (e.g., [8]). Agglomerative clustering algorithms usually start with each
element to sort into its own separate cluster of size 1 (singleton). At each step,
the algorithms find the two “closest” clusters, taking into account the aggregation
criterion, and join them. The process continues until a cluster containing all elements
to classify is obtained. The AHCA of the set of items was based on the affinity
coefficient as a measure of comparison between elements, combined with two classic
(Single-Linkage ( SL) and Complete-Linkage (CL)) and a family of probabilistic VL
(V for Validity, L for Linkage) aggregation criteria (e. g., [1, 2, 3, 10, 11, 16, 17, 18,
22]).
According to Ng et al. ([20], p. 849), “the task of finding good clusters has been
the focus of considerable research in machine learning and pattern recognition“.
However, the identification of the best partitions using validation indices is also
of crucial importance. Therefore, a pertinent question arises: “How well does the
partition fit the data?” ([8], p. 505). On what validation of results is concerned, the
identification of the best partitions in the present study was based on the global
level statistics, STAT [1, 10, 11]. The global maximum STAT value indicates the best
cut-off level of a dendrogram and the local maxima STAT differences indicate the
most significant levels.
The affinity coefficient between two distribution functions was introduced by
Matusita in 1951 (e.g., [13, 14]). Bacelar-Nicolau extended it to the non-supervised
366 Á. Sousa et al.

classification field as a similarity measure between profiles. Let V be a set of p


variables, describing a set D of N statistical data units (individuals), so that each of
the 𝑁 × 𝑝 cells of the corresponding data table X contains one single non-negative
real value 𝑥 𝑖𝑘 (𝑖 = 1,..., N; 𝑘 = 1,. . . , p) which denotes the value of the k-th variable
on the i-th individual. The standard affinity coefficient 𝑎(𝑘, 𝑘 0) between a pair of
variables, 𝑉𝑘 and 𝑉𝑘0 (𝑘, 𝑘 0 = 1,..., p) is given by formula (1), where 𝑥 .𝑘 = Σ𝑖=1 𝑁
𝑥 𝑖𝑘 ,
𝑁
𝑥 .𝑘 0 = Σ𝑖=1 𝑥𝑖𝑘 0 .
r
𝑥𝑖𝑘 𝑥𝑖𝑘 0
𝑎(𝑘, 𝑘 0) = Σ𝑖=1𝑁
(1)
𝑥 .𝑘 𝑥 .𝑘 0
The coefficient (1) is a symmetric similarity coefficient which takes values in
[0,1] (1 for equal or proportional vectors and 0 for orthogonal vectors). Note that
its mathematical formula corresponds to the inner product between the square root
column profiles associated with those variables and measures a monotone tendency
between column profiles. In the particular case of binary variables, the affinity
coefficient coincides with the well-known Ochiai coefficient. Furthermore (e.g.,
[4, 6]), it is related to the Hellinger distance 𝑑 by the relation 𝑑 2 = 2(1 − 𝑎), which
has been used in the context of spherical factor analysis by Michel Volle. Later on,
the standard affinity coefficient was extended to the clustering of statistical data units
or variables, mainly in a three-way approach (e.g., [3, 4, 5, 6]). The computation of
the standard affinity coefficient between individuals can be performed by previously
transposing the data matrix and then applying formula (1).
The probabilistic aggregation criteria on the scope of VL methodology can be
interpreted as distribution functions of statistics of independent random variables,
that are i.i.d. uniform on [0, 1] (e.g., [3, 17]). The SL aggregation criterion can lead
to very long clusters (chaining effect). On the other hand, the AVL (Aggregation
Validity Link) has a tendency to form equicardinal clusters with an even number of
elements. The comparison functions between a pair of clusters, A and B, concerning
the family I of AVL methods can be generated by the following conjoined formula
(e.g., [17, 10, 11]):

Γ( 𝐴, 𝐵) = ( 𝑝 𝐴𝐵 ) 𝑔 ( 𝛼,𝛽) (2)
with 𝛼 = 𝐶𝑎𝑟 𝑑 𝐴, 𝛽 = 𝐶𝑎𝑟𝑑 𝐵, 𝑝 𝐴𝐵 = 𝑚𝑎𝑥 [𝛾 𝑎𝑏 : (𝑎, 𝑏) ∈ ( 𝐴 × 𝐵], with
1 ≤ 𝑔(𝛼, 𝛽) ≤ 𝛼𝛽, and 𝛾 𝑥 𝑦 is a similarity measure between pairs of elements, x and
y, of the set of elements to classify (e.g., 𝑔(𝛼, 𝛽) = 1 for SL, 𝑔(𝛼, 𝛽) = 𝛼𝛽 for AVL).
Note that varying 𝑔(𝛼, 𝛽) with 1 < 𝑔(𝛼, 𝛽) < 𝛼𝛽, a sort of compromise can be
built between SL and AVL methods (e.g., 𝑔(𝛼, 𝛽)=(𝛼 + 𝛽)/2 for AV1). Thus, Γ( 𝐴, 𝐵)
will be “more polluted by the chain effect when 𝑔(𝛼, 𝛽) remains near 1, and more
contaminated by the symmetry effect as long as 𝑔(𝛼, 𝛽) is in the neighbourhood of
𝛼𝛽“ ( [17], p. 95). Among the criteria that establish a compromise between AVL and
SL methods, stands out the AV1 method, whose behavior is very similar to that of
AVL and often provides, at its cut-off level, a partition better adjusted to the preorder
than the “best” classification obtained by AVL.
Typology of Motivation Factors for Employees in the Banking Sector 367

3 Main Results and Discussion

Concerning the CatPCA, the best solution comprises four principal components,
and the percentage of variance accounted for (PVAF) across these components is
almost 70% (about 69%) of the data’s total variance. All extracted components have
eigenvalues above 1. Moreover, the first three main components have a very good
internal consistency and the fourth component has an acceptable internal consistency,
as shown by the values of the Cronbach’s Alpha coefficient (see Table 1).

Table 1 Rotated component loadings of the 4-component solution - Motivational factors.


Items PC1 PC2 PC3 PC4

M1 0.213 0.351 0.699 0.166


M2 0.197 0.044 0.794 0.211
M3 0.248 0.148 0.763 -0.018
M4 -0.028 0.098 0.482 0.442
M5 0.354 0.219 0.674 0.037
M6 0.522 0.214 0.425 0.095
M7 0.837 0.110 0.193 -0.114
M8 0.774 0.151 0.244 0.099
M9 0.778 0.227 0.183 -0.125
M10 0.783 0.269 0.227 -0.043
M11 0.757 0.259 0.223 -0.103
M12 0.798 0.155 0.227 -0.035
M13 0.708 0.213 0.341 0.070
M14 0.486 0.511 0.372 -0.257
M15 0.775 0.263 0.252 0.041
M16 0.432 0.364 0.665 0.035
M17 0.289 0.708 0.410 -0.046
M18 0.462 0.641 0.097 -0.247
M19 0.548 0.532 0.211 -0.034
M20 0.503 0.609 0.074 -0.223
M21 0.684 0.401 0.070 0.074
M22 0.678 0.399 0.019 0.054
M23 0.295 0.770 0.284 0.102
M24 0.174 0.835 0.176 -0.011
M25 0.019 -0.012 0.233 0.864
M26 -0.038 -0.146 0.035 0.896
M27 0.543 0.458 0.230 0.227
Eigenvalue (VAF) 7.988 4.417 4.066 2.138
Percentage accounted (PVAF) 29.59 16.36 15.06 7.92
Cronbach’s Alpha 0.950 0.934 0.919 0.610

The most important items for the first dimension are items M6, M7, M8, M9,
M10, M11, M12, M13, M15, M19, M21, M22, and M27, which are related to human
relationships/interactions with colleagues and hierarchical superiors, so it is called
368 Á. Sousa et al.

“Psychological well-being/Interpersonal relationships”. This dimension explains the


highest proportion of data variance (29.59%).
Concerning the second dimension, the items M14, M17, M18, M20, M23, and
M24 are the most important, so this dimension was designated “Remuneration,
job stability and incentive system”. The most relevant items regarding the third
dimension are M1, M2, M3, M4, M5, and M16; so, this dimension was called
“Career progression/Professional achievement”. Finally, the most important items
for the fourth dimension are M25 and M26 related to “Fulfilment of the proposed
objectives and the timings to achieve them”.
Regarding the AHCA of the same set of items, and considering the best cut-off
levels, the results of the present study are summarized in Table 2.

Table 2 The best partition - Standard affinity coefficient.


Method The best partition STAT Cut-off
level

SL/CL {M1, M2, M3, M5, M8, M10, M11, M12, M13, M15, M14, 15.8858 20
M16, M18, M19, M22, M20, M6, M23, M27, M24, M21};
{M4}; {M9}; {M7}; {M25}; {M26}; {M17}

AV1 {M1, M2, M3, M6, M27, M21, M5, M23, M24, M8, M15, 15.6490 22
M14, M16, M10, M13, M11, M12, M18, M19, M20, M22};
{M4, M25, M26}; {M7}; {M9}; {M17}

According to the STAT values, the best partitions were obtained by the classic
SL/CL and the probabilistic AV1 methods (see Table 2). All dendrograms highlighted
four main branches, which are associated with different motivational factors ("Career
progression"; "Psychological well-being / Interpersonal relationships"; "Organiza-
tional environment and working conditions"; "Conformity with objectives and time
to reach them"), bringing new information, and identifying some singletons, as
shown in Figure 1.

4 Conclusion

Organizations and their leaders have become increasingly aware of the importance
of their employees being well and that negative feelings can negatively affect pro-
ductivity. Thus, it is essential to ensure the well-being of employees, taking into
account the main motivational factors identified in this study. CatPCA made it pos-
sible to extract four principal components (dimensions), which explain almost 70%
of the total variance of the data, which were designated, respectively, by “Psy-
chological well-being/Interpersonal relationships”; “Remuneration, job stability and
incentive system”; “Career progression/Professional achievement”; and “Fulfilment
of objectives and timings to achieve them”. Regarding the AHCA of the items that
Typology of Motivation Factors for Employees in the Banking Sector 369

M1 --*--------------------------* M1 --*
|--------------------* |
M2 --*--------------* | | M2 --|
|-----------* | |
M3 --*--------------* |--------* M3 --|
| | |
M6 --*--------------------* | | M6 --|
|--------------* | | |
M27 --*--------------------* |-----------* | M27 --|
| | |
M21 --*-----------------------------------* | M21 --|
| |
M5 --*-----------------* | M5 --|
| | |
M23 --*-----------------|--------------------* |--- M23 --|
| | | |
M24 --*-----------------* | | M24 --|
|-----------* | |
M8 --*-----* | | | M8 --|
|-----------------------* | | | |
M15 --*-----* | | | | M15 --|-----*
|--------* | | | |
M14 --*-----* | | | M14 --| |
|-----------------------* | | | |
M16 --*-----* | | M16 --| |
|-----* | |
M10 --*--------* | M10 --| |
|--------------* | | |
M13 --*--------* | | M13 --| |
|--------------------* | | |
M11 --* | | | M11 --| |
|-----------------------* | | | |
M12 --* | | M12 --| |
|-----* | |--*
M18 --*--* | M18 --| | |
|-----------------------------* | | | |
M19 --*--* | | M19 --| | |
|-----------* | | |
M20 --*-----------* | M20 --| | |
|--------------------* | | |
M22 --*-----------* M22 --* | |
| |
M4 --*-----------------------------------------------------* M4 --* | |
|------ | | |
M25 --*-----------------------------------------* | M25 --|--* | |
|-----------* | | | |
M26 --*-----------------------------------------* M26 --* |--* |
| |
M7 --*-----------------------------------------------------------* M7 --* | |
| |--* |
M9 --*-----------------------------------------------------------* M9 --* |
|
M17 --*------------------------------------------------------------ M17 --*--------*

Fig. 1 Dendrogram - Standard affinity coefficient + AV1.

assess motivation, the dendrograms highlight four main branches, which are associ-
ated with different motivational factors called "Career progression"; "Psychological
well-being / Interpersonal relationships"; "Organizational environment and working
conditions"; and "Conformity with objectives and time to reach them". They carried
new information and identify some singletons as well. Comparing the dendrograms,
we conclude that the clusters referring to the best partitions are quite similar, with
observed differences mainly concerning the few singletons. Moreover, the effec-
tive and fruitful correspondence between the AHCA and the CatPCA results may
help to better understand the main types of factors identified. In fact, the four main
branches of all dendrograms are related to motivational factors which corresponding
interpretation are in consonance with those identified through CatPCA.

Acknowledgements This paper is financed by Portuguese national funds through FCT – Fundação
para a Ciência e a Tecnologia, I.P., project number UIDB/00685/2020.

References

1. Bacelar-Nicolau, H.: Contributions to the Study of Comparison Coefficients in Cluster Anal-


ysis (in Portuguese). Univ. Lisbon (1980)
2. Bacelar-Nicolau, H.: The affinity coefficient in cluster analysis. In: Bekmann, M. J. et al.
(eds.). Methods of Operations Research, pp. 507-512. Verlag Anton Hain, Munchen (1985)
370 Á. Sousa et al.

3. Bacelar-Nicolau, H.: Two probabilistic models for classification of variables in frequency


tables. In: Bock, H._H. (ed.) Classification and Related Methods of Data Analysis, pp. 181-
186. Elsevier Sciences Publishers B.V., North Holland (1988)
4. Bacelar-Nicolau, H.: The affinity coefficient. In: Bock, H.-H. and Diday, E. (eds.) Analysis of
symbolic data: Exploratory methods for extracting statistical information from complex data,
Series: Studies in Classification, Data Analysis, and Knowledge Organization, pp. 160-165.
SpringerVerlag, Berlin (2000) doi: 10.1007/978-3-642-57155-8
5. Bacelar-Nicolau, H., Nicolau, F.C., Sousa, Á., Bacelar-Nicolau, L.: Measuring similarity of
complex and heterogeneous data in clustering of large data sets. Biocybernetics and Biomed-
ical Engineering, PWN-Polish Scientific Publishers Warszawa. 29(2), 9-18 (2009)
6. Bacelar-Nicolau, H., Nicolau, F. C., Sousa, Á., Bacelar-Nicolau, L.: Clustering of variables
with a three-way approach for health sciences. Testing, Psychometrics, Methodology in Ap-
plied Psychology (TPM) (2014) doi: 10.4473/TPM21.4.56
7. Cabral, S.: O Impacto da Liderança na Motivação dos Colaboradores do Setor Bancário na
Região Autónoma dos Açores. Universidade dos Açores, Ponta Delgada (2018)
8. Gurrutxaga, I., Muguerza, J., Arbelaitz, O., Pérez, J.M., Martín, J.I.: Towards a standard
methodology to evaluate internal cluster validity indices. Pattern Recognit. Lett. 32, 505-515
(2011)
9. Lattin, J. M., Carrol, J. D., Green, P. E.: Analyzing Multivariate Data (1st ed.). Thomson
Brooks, Cole (2003)
10. Lerman, I. C.: Classification et Analyse Ordinale des Données. Dunod, Paris (1981)
11. Lerman, I. C.: Foundations and methods in combinatorial and statistical data analysis and
clustering. Series: Advanced Information and Knowledge Processing. Springer-Verlag, Boston
(2016)
12. Linting, M., Meulman, J. J., Groenen, P. J., van der Koojj, A. J.: Nonlinear principal com-
ponents analysis: Introduction and application. Psychol. Meth. 12(3), 336–358 (2007) doi:
10.1037/1082-989X.12.3.336
13. Matusita, K.: On the theory of statistical decision functions. Ann. Inst. Stat. Math. 3, 1-30
(1951)
14. Matusita, K.: Decision rules, based on distance for problems of fit, two samples and estimation.
Ann. Inst. Stat. Math. 26, 631-640 (1995)
15. Mullins, L. J.: Management and Organizational Behavior. Prentice Hall, England (2005)
16. Nicolau, F. C., Bacelar-Nicolau, H.: Nouvelles méthodes d’agrégation basées sur la fonction de
répartition. In: Collection Séminaires INRIA 1981, Classification Automatique et Perception
par Ordinateur, pp. 45-60. Rocquencourt (1982)
17. Nicolau, F. C., Bacelar-Nicolau, H.: Some trends in the classification of variables. In: Hayashi
et al. (eds.). Data Science, Classification and Related Methods, pp. 89-98. Springer, Tokyo
(1998)
18. Nicolau, F. C., Bacelar-Nicolau, H.: Teaching and learning hierarchical clustering probabilistic
models for categorical data. In: Proc. 54th Session of the International Statistical Institute
(IASE at ISI, IPM-71). Berlin, Germany (2003)
https://iase-web.org/documents/papers/isi54/3654.pdf.Cited25Jan2022
19. Nicolau, F. C., Bacelar-Nicolau, H., Sousa, F., Sousa, Á., Silva, O.: CLUST11: Cluster Analysis
Software - Standard and VL Probabilistic Approaches. LEAD, FP-UL (2011)
20. Ng., A. Y., Jordan, M. I., Weiss, Y. In: On spectral clustering: Analysis and an algorithm
(2002) https://proceedings.neurips.cc/paper/2001/file/801272ee79cfde7fa
5960571fee36b9b-Paper.pdf.Cited25Jan2022
21. Robbins, S. P., Judge, T. A.: Organizational Behavior. Pearson Education, Inc., New Jersey
(2015)
22. Sousa, Á., Silva, O., Bacelar-Nicolau, H., Nicolau, F. C.: Distribution of the affinity coefficient
between variables based on the Monte Carlo simulation method. Asian. J. Appl. Sci. 1(5),
236-245 (2013)
Typology of Motivation Factors for Employees in the Banking Sector 371

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
A Proposal for Formalization and Definition of
Anomalies in Dynamical Systems

Jan Michael Spoor, Jens Weber, and Jivka Ovtcharova

Abstract Although many scientists strongly focus on anomaly detection in different


applications and domains, there currently exists no universally accepted definition
of anomalies and outliers. Using an approach based on control theory and dynamical
systems, as well as a definition for anomalies as described by philosophy of science,
the authors propose a generalized framework viewing anomalies as key drivers
of progress for a better understanding of the dynamical systems around us. By
mathematically defining anomalies and delimiting deviations within expectations
from completely unforeseen instances, this paper aims to be a contribution to set up
a universally accepted definition of anomalies and outliers.

Keywords: anomaly detection, outlier analysis, dynamical systems

1 Introduction

Anomalies, often interchangeably called outliers [1], are of key interest in explorative
data analysis. Therefore, anomaly detection finds application in many different sci-
entific fields, i.e., in social science, economics, engineering, and medical science [2].
In particular, research in these domains regarding databases, data mining, machine
learning or statistics focuses strongly on anomaly detection [3]. Despite the wide

Jan Michael Spoor ( )


Institut für Informationsmanagement im Ingenieurwesen (IMI), Karlsruhe Institute of Technology,
Karlsruhe, Germany, e-mail: [email protected]
Jens Weber
Team Digital Factory Sindelfingen, Mercedes-Benz Group AG, Sindelfingen, Germany,
e-mail: [email protected]
Jivka Ovtcharova
Institut für Informationsmanagement im Ingenieurwesen (IMI), Karlsruhe Institute of Technology,
Karlsruhe, Germany, e-mail: [email protected]

© The Author(s) 2023 373


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_40
374 J. M. Spoor et al.

range of anomaly detection, there is currently no universally accepted definition of


what an outlier or anomaly is [2], and the mathematical definition depends on the
selected method to find these anomalies [4].
The authors previously proposed an applied framework to formalize anomalies
within the context of control theory and dynamical systems [5]. In this publication,
the idea is discussed in more depth, and a generalization of the framework is proposed
to extend its application area to more domains since dynamical systems are relevant
in engineering and science [6] as well as in management science and economics [7].
Furthermore, the proposed definition of anomalies should also be applicable outside
of the context of control theory and aims to be a contribution to set up a universally
accepted definition of anomalies and outliers.
When controlling or simulating dynamical systems, a measurement and prediction
process is used. Anomalies occur in this process as substantial deviations of a
measured system state (an actual value) from an expected system state (a planned
value) [5]. Despite simulation and planning effort, these deviations still occur. While
some deviations fall within an acceptable range and within the expectations of normal
system behavior, other anomalies are completely unforeseen and do not fit the set-up
and expectations of the system. Three sequential questions are derived to further
investigate the nature of anomalies within dynamical systems:
1. What distinguishes unforeseen system states from regular system behavior?
2. How can unforeseen system states or errors occur despite simulation?
3. How can unforeseen system states be analyzed and transferred to a standard
model of a system’s behavior?

2 Definition of Anomalies for Dynamical Systems

2.1 Definitions of Anomalies and Outliers

In general, it is assumed that anomalies are somehow visible within the data of
the observed systems. This is also clearly stated by the definition of an outlier or
anomaly as data points with a substantial deviation from the norm since this requires
a normal state of the system and a measurable deviation [8]. Furthermore, the
anomaly detection requires existence and knowledge of a normal state, a definition
of a deviation, a metric, and a threshold measure of distance. This threshold measure
of distance uses the selected metric. All distances between the norm and the data
points, which are either above (in case of distance measures) or below (in case of
similarity measures) the defined threshold, are assumed to be non-substantial.
Therefore, in addition, the selection of an appropriate metric becomes an impor-
tant tool to accurately describe an anomaly. Some authors claim that, in a practical
application, the selection of a suitable metric might be more important than the
algorithm itself. For example, if clusters are clearly separated within the examined
dataset in context of the selected metric, clusters will be found independently of
A Proposal for Formalization and Definition of Anomalies in Dynamical Systems 375

the used method or algorithm [9]. Other authors claim that the selected method for
investigating clusters is of importance [10].
To summarize, there is no trivial definition of a normal state, a deviation, and when
a deviation might be substantial. Some authors therefore describe the usefulness of
an analysis only within the context of the goals of the analysis [11]. Outlier detection
becomes more of a technical target than an actual scientific finding of something
novel since the novelty is always defined within the technical target of the analysis.
Alternatively, the normal model of the data defines an anomaly [1].
This results, for example, in approaches of regression diagnostics to exclude
outliers and anomalous data prior to an analysis or to conduct the analysis along the
standard model in a more robust way, which is less affected by anomalies [12]. Both
approaches result in the maintaining of the normal model using anomalies as if they
were less adequate or not at all representative of the data set.
Since anomalies are only relevant within a context, a typology of anomalies within
different dataset contexts can be created. Thus, Foorthuis [13] proposes a typology
along the following dimensions: types of data (qualitative, quantitative or mixed),
anomaly level (atomic or aggregated) and cardinality of relationship (univariate or
multivariate). Anomalies are, within this kind of typology, always dependent on
the dataset and behave differently along the measured features, which have been
classified as relevant for the specific analysis. The anomaly detection becomes a
detection of unfitting, surprising values while maintaining the normal model.

2.2 Definition by Philosophy of Science

If the assumptions regarding normal states, deviation, and substantiality are dropped,
it is possible to discuss anomalies on a more fundamental level for understanding
our surroundings and the observations of them.
To do this, anomalies have to be placed in the historic context of science and
research. Since anomaly detection as a discipline of data science is placed within
the scientific context [14], anomaly detection can also be analyzed as part of the
scientific method and therefore a comparison with the historical understanding of
anomalies in the context of science becomes relevant. By definition of Kuhn [15],
anomalies play an important role in the scientific discovery of novelties:
Discovery commences with the awareness of anomaly, i.e., with the recognition that nature
has somehow violated the paradigm-induced expectations that govern normal science. It
then continues with a (...) exploration of the area of anomaly. And it closes only when the
paradigm theory has been adjusted so that the anomalous has become the expected.

This statement describes scientific progress as a stepwise discovery and the place-
ment of anomalies within a normal state by science. The discussed normal state is
therefore dictated by current scientific knowledge, which encompasses the predic-
tions of the currently available and widely used models and theories. An anomaly
violates the normal state by violating the predictions of these models. The steps of
scientific progress are then as follows:
376 J. M. Spoor et al.

1. Knowledge of the anomaly.


2. Stepwise acknowledgement of observations and conceptual nature of the
anomaly.
3. Change of paradigm and methods to include the anomaly in the new models,
often under resistance by the scientific community itself.
Therefore, different states of an anomaly exist as follows:
1. The anomaly is completely unknown.
2. The anomaly is neither described nor modeled but was observed.
3. The anomaly is not commonly recognized and placed within the standard model.
The states of anomalies correspond to the initially defined questions in the in-
troduction regarding the delimitation of anomalous states from normal states, the
exploration of the causes for anomalies, and the modeling and planning with the
now known anomalies. If the states of anomalies are used to describe practical errors
in engineering, error states of systems are not anomalies. This is the case because
if error states are priorly classified as such, they are therefore already known and
described. This corresponds to the idea that outliers or anomalies are created by a
different underlying mechanism [16] and therefore imply an unknown system behav-
ior, which needs modeling to better describe the system. In addition, this follows the
assumption of a normal state in which anomalies simply derive from a normal model
[1] since they are not part of the normal model. Also, this idea relates strongly to the
discussion of the relation between novelty and anomaly detection [17].
To follow the definitions by Kuhn [15], science is driven by internal progress, lim-
ited by the current methods and available resources, while external targets, defined by
stakeholders, e.g., society or companies, drive technicians. This description matches
the idea that the usefulness of an analysis should be evaluated within the context of
its goals [11] and distinguishes two types of anomalies: "Scientific" anomalies of a
novel observation and "technical" anomalies as deviations from a predefined norm
using a predefined measurement of substantiality.
"Scientific" anomalies might still result in unwanted system states, which then
can result in some kind of error or critical system state. Nevertheless, not every
"scientific" anomaly inevitably results in an error state and not every error state is
a "scientific" anomaly. An anomaly is not a "scientific" anomaly if the error state
is already documented or can be described by the standard model. In this case, the
anomaly becomes a "technical" anomaly.
Using the philosophy of science definition of anomalies, the normal state is the
prediction by the system model, the deviation is the difference between the prediction
of the system state and the measured actual state of the system, and the substantiality
is defined by the noise and precision of our predictions and measurement tools.

3 Proposed Framework for a Formalization of Anomalies

To separate "scientific" and "technical" anomalies, a formerly proposed framework


[5] is generalized as illustrated in Fig. 2. and mathematically defined in this section.
A Proposal for Formalization and Definition of Anomalies in Dynamical Systems 377

Fig. 1 Formalization of "scientific" and "technical" anomalies and system states.

Definition 1 (System State) There exists a multivariate description 𝑥𝑖 of a state 𝑖


with a finite number of features. For each feature 𝑗 of state 𝑖 a value 𝑥𝑖 𝑗 exists, which
is a realization of the feature space 𝑅 𝑗 . The value 𝑥𝑖 𝑗 is the actual and precise state
description of feature 𝑗 at state 𝑖. Although there exists only a single true value
𝑥𝑖 𝑗 , the value itself does not necessarily have to be a single data point but can be a
multivariate or symbolic data value and can be of any data type.

∀𝑖 ∀ 𝑗 ∃! 𝑥𝑖 𝑗 , 𝑥𝑖 𝑗 ∈ 𝑅 𝑗 (1)

The set 𝐶 of all combinations of system state values with 𝐽 features is given by:

𝐶 = {𝑥𝑖 | ∀ 𝑗 ∃ 𝑥 𝑖 𝑗 ∈ 𝑅 𝑗 } = 𝑅1 × ... × 𝑅 𝐽 (2)

Definition 2 (Operation) An operation is an analytical function 𝑓 which changes


the system state from state 𝑖 to the following state 𝑖 + 1. Both states belong to the set
of all combinations of system states 𝐶.

𝑓 : 𝐶 → 𝐶, 𝑓 (𝑥𝑖 ) = 𝑥𝑖+1 (3)

There exists a finite set 𝐹 of functions of endogenous state transformations. This


set of functions is the scope of operations that can be performed. These functions
are the fundamental functionality of a system, which can be performed without any
external involvement. For all functions the following expression is applied:

𝑔∈𝐹∧ 𝑓 ∈𝐹 :𝑔◦ 𝑓 ∈𝐹 (4)

Using the defined function space, a restriction of reachable system states via all
functions from 𝐹 is defined, resulting in the set of physically possible system states.

Definition 3 (Physically Possible System States) The relation 𝑓 spans the complete
space of state changes of a system using the entire scope of operations. The resulting
space is the set of all possible system states. The physically possible system states
378 J. M. Spoor et al.

are the possible realizations of 𝑥𝑖 based on a starting point and if only functions from
𝐹 are applied. The set 𝑃 is a group with a neutral element of operations.

𝑃 = {𝑥𝑖 | ∀ 𝑓 ∈ 𝐹 : 𝑓 (𝑥𝑖 ) ∈ 𝑃} ⊆ 𝐶 (5)

Definition 4 (Observed System States) Of the amount 𝐽 of existing features of the


system state, only an amount 𝐷 of features is known with 𝐷 ≤ 𝐽. Since not all
system states can be measured, a function 𝑧 transforms the real system states and
real operations of the system into observable system states and operations.

𝑧 : 𝐶 → 𝑀, 𝑧(𝑥𝑖 ) = 𝑥 𝑖∗ (6)

Therefore, the set 𝑀 = 𝑅1 × ... × 𝑅 𝐷 is the space of all observable and known
system states. Function 𝑧 is the measurement process.

Definition 5 (Observed Operations) Not all functions of the whole set of function
𝐹 are known or observable when planning and operating a system.

𝐹0 ⊆ 𝐹 (7)

Additionally, only observable system states are modeled when operating a system.
The observed operations of systems are therefore projections of a subsets of known
operations of 𝐹 and operate within the observed and known system states.

𝐹 ∗ = 𝑧(𝐹 0) (8)

The actual conducted operations 𝑓 are always from the set of operations 𝐹, but the
expectation and prediction utilize, due to lack of system knowledge, only 𝑓 ∗ ∈ 𝐹 ∗ .

𝑓 ∗ : 𝑀 → 𝑀, 𝑓 ∗ (𝑥𝑖∗ ) = 𝑥𝑖+1∗ (9)

Therefore, all states applied in operation 𝑓 ∗ are defined as expected system states.

Definition 6 (Expected System States) The system states, which are possible if
only the observed and known operations of the set 𝐹 ∗ are applied to all system states
𝑥𝑖∗ ∈ 𝐸, are the expected system behavior.

𝐸 = {𝑥𝑖∗ | ∀ 𝑓 ∗ ∈ 𝐹 ∗ : 𝑓 ∗ (𝑥𝑖∗ ) ∈ 𝐸 } ⊆ 𝑀 (10)

The expected system states can be further split into desired system states, where
the system is running most beneficially for its usage, a critical system state, where
a possible error or rare system states are measured, and error states, which are
system faults with operational risks involved as defined by Basel III [18]. Applied
in engineering, this definition is compatible with the definition of DIN EN 13306
since the system is at risk of being unable to perform a certain range of functions
without necessarily being completely inoperable [19]. All kinds of errors, warnings
and non-beneficial system states are the "technical" anomalies within the contextual
analysis of the data set.
A Proposal for Formalization and Definition of Anomalies in Dynamical Systems 379

Definition 7 (Unforeseen System States) The set of unforeseen system states 𝑈 are
therefore all measurable system states within the realm of observable system states
but not within the expected system states:

𝑈 = 𝑀/𝐸 (11)

"Scientific" anomalies in unforeseen system states are measured if the real oper-
ation 𝑓 differs from 𝑓 ∗ such that a prediction error occurs:

𝑓 ∗ (𝑥𝑖∗ ) ∈ 𝐸, 𝑓 ∗ (𝑥 𝑖∗ ) ≠ 𝑧( 𝑓 (𝑥 𝑖 )) ∉ 𝐸 (12)

"Scientific" anomalies are part of the unforeseen system states. Another reason for
unforeseen system states is a measurement of an impossible system state. Anomalies
originated by physically impossible system states are to be distinguished from "scien-
tific" anomalies since the reason for their occurrence follows a different mechanism.
Thus, they are assigned to the "technical" anomalies.
Definition 8 (Physically Impossible System States) Physically impossible system
states 𝐼 are combinations of states in set 𝐶 which are not reachable using function 𝑓 :

𝐼 = 𝐶/𝑃 (13)

Definition 9 (External Influence) Applying changes to the system, the feature


space also changes. Consequently, the space of the physically possible system states
changes. Previously impossible system states become possible system states.
Definition 10 (Faulty Data Points) If a measurement is conducted incorrectly, the
measured values could be within the impossible system states. Faulty data points are
therefore neither measurement noise nor imprecision, but should be systematically
excluded. Note that faulty data points could be within the possible system space but
need to be excluded either way.

4 Conclusion

It is concluded that the anomaly concept is often loosely defined and heavily depends
on assumptions of a normal state, deviation, and substantiality. These definitions are
often case-specific and influenced by the conducting researchers’ choice. Therefore,
a rigorous definition of anomalies is capable of further streamlining the discourse
and increasing a common understanding of what kind of anomaly is described.
Using "technical" and "scientific" anomalies, further research will be conducted
to set up models detecting both types of anomalies separately. Differences between
observed and real system states and operations are a focus of further research to
more precisely analyze the hidden processes of the "scientific" anomaly generation.
Also, a more fundamental discussion of the philosophical definition of anomalies
within the philosophy of science and its applications to anomaly detection in general
should be conducted to further gain insight into the true nature of anomalies.
380 J. M. Spoor et al.

The authors plan to validate the concept by using the proposed definition and
framework in exemplary applications within industrial processes. Furthermore,
anomaly detection methods designed for applications in dynamical systems using
the proposed framework are planned to be developed.
Acknowledgements The Mercedes-Benz Group AG funds this research. The research was prepared
within the framework of the doctoral program of the Institut für Informationsmanagement im
Ingenieurwesen (IMI) at the Karlsruhe Institute of Technology (KIT).

References

1. Aggarwal, C. C.: Outlier Analysis. Springer Science+Business Media, New York (2013)
2. Hodge, V. J., Austin, J.A.: Survey of outlier detection methodologies. Artif. Intell. Rev. 22,
85-126 (2004)
3. Aggarwal, C. C., Sathe, S.: Outlier Ensembles. Springer, Cham (2017)
4. Wang, X., Wang, X., Wilkes M.: New Developments in Unsupervised Outlier Detection -
Algorithms and Applications. Springer, Singapore (2021)
5. Spoor, J. M., Weber, J., Ovtcharova, J.: A definition of anomalies, measurements and predic-
tions in dynamical engineering systems for streamlined novelty detection. Accepted for the
8th International Conference on Control, Decision and Information Technologies (CoDIT),
Istanbul (2022)
6. Åström, K. J., Murray, R. M.: Feedback Systems - An Introduction for Scientists and Engineers.
Princeton University Press, Princeton, New Jersey (2008)
7. Sethi, S. P., Thompson, G. L.: Optimal Control Theory - Applications to Management Science
and Economics. Springer Science+Business Media, Boston, MA (2000)
8. Mehrotra, K. G., Mohan, C., Huang, H.: Anomaly Detection - Principles and Algorithms.
Springer International Publishing, Cham (2017)
9. Skiena, S. S.: The Data Science Design Manual. Springer International Publishing, Cham
(2017)
10. James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning.
Springer Science+Business Media, New York (2013)
11. Fahrmeier, L., Hamerle, A., Tutz, G. (ed.): Multivariate Statistische Verfahren. de Gruyter,
Berlin (1996)
12. Rousseeuw, P. J., Leroy, A. M.: Robust Regression and Outlier Detection. John Wiley & Sons,
Inc (1987)
13. Foorthuis, R.: On the nature and types of anomalies: A review of deviations in data. Int. J.
Data Sci. Anal. 12, 297-331 (2021)
14. Cuadrado-Gallego, J. J., Demchenko, Y.: The Data Science Framework: A View from the
EDISON Project. Springer Nature Switzerland AG, Cham (2020)
15. Kuhn, T.: The Structure of Scientific Revolutions. 2nd ed. The University of Chicago Press,
Chicago (1970)
16. Hawkins, D.: Identification of Outliers. Chapman and Hall (1980)
17. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: A survey. ACM Comput. Surv.
41(3) 15 (2009)
18. Bank for International Settlements: Basel Committee on Banking Supervision: International
Convergence of Capital Measurement and Capital Standards (2006)
19. DIN Deutsches Institut für Normung e. V.: DIN EN 13306: Instandhaltung - Begriffe der
Instandhaltung. Beuth Verlag GmbH, Berlin (2010)
A Proposal for Formalization and Definition of Anomalies in Dynamical Systems 381

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
New Metrics for Classifying Phylogenetic Trees
Using 𝑲-means and the Symmetric Difference
Metric

Nadia Tahiri and Aleksandr Koshkarov

Abstract The 𝑘-means method can be adapted to any type of metric space and is
sometimes linked to the median procedures. This is the case for symmetric difference
metric (or Robinson and Foulds) distance in phylogeny, where it can lead to median
trees as well as to Euclidean Embedding. We show how a specific version of the
popular 𝑘-means clustering algorithm, based on interesting properties of the Robin-
son and Foulds topological distance, can be used to partition a given set of trees into
one (when the data is homogeneous) or several (when the data is heterogeneous)
cluster(s) of trees. We have adapted the popular cluster validity indices of Silhouette,
and Gap to tree clustering with 𝑘-means. In this article, we will show results of this
new approach on a real dataset (aminoacyl-tRNA synthetases). The new version of
phylogenetic tree clustering makes the new method well suited for the analysis of
large genomic datasets.

Keywords: clustering, symmetric difference metrics, 𝑘-means, phylogenetic trees,


cluster validity indices

1 Introduction

In biology, one of the most significant organizing principles is the "Tree of Life"
(ToL) [12]. In genetic studies, there is evidence of an enormous number of branches,
but even a rough estimate of the total size of the tree remains difficult. Many recent

Nadia Tahiri ( )
Department of Computer Science, University of Sherbrooke, Sherbrooke, QC J1K2R1, Canada,
e-mail: [email protected]
Aleksandr Koshkarov
Department of Computer Science, University of Sherbrooke, Sherbrooke, QC J1K2R1, Canada;
Center of Artificial Intelligence, Astrakhan State University, Astrakhan, 414056, Russia,
e-mail: [email protected]

© The Author(s) 2023 383


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_41
384 N. Tahiri and A. Koshkarov

representations of ToL have emphasized either the existence of deep evolutionary


relationships [7] or the knowledge of a large and diverse variety of life, with an
emphasis on Eukaryotes [8]. These approaches do not consider the dramatic evolution
in our understanding of the diversity of life due to genomic sampling of previously
unexplored environments.
As a result, Maddison in 1991 [11] was the first to formulate the idea of multiple
consensus trees when he described his phylogenetic island method. He observed
that island consensus trees can differ significantly from each other and are generally
better resolved than the species-wide consensus tree. The most intuitive approach to
discovering and clustering genes that share similar evolutionary histories is to cluster
their genetic phylogenies. In this context, Stockham et al. in 2002 [18] proposed
a tree clustering algorithm based on 𝑘-means [4, 9, 10] and the Robinson and
Foulds quadratic distance [15]. Their clustering algorithm aims to infer a set of
strict consensus trees, minimizing information loss. They proceed by determining
the consensus trees for each set of clusters in all intermediate partitioning solutions
tested by 𝑘-means. This makes the Stockham et al. algorithm very expensive in
terms of execution time. More recently, Tahiri et al. in 2018 [19] proposed a fast and
accurate tree clustering method based on 𝑘-medoids. Finally, Silva and Wilkinson
in 2021 [17] introduced a revised definition of tree islands based on any tree-to-tree
metric that usefully extends this notion to any set or multiset of trees and provided
an interesting discussion of biological applications of their method.
In this context, the use of a method that infers multiple supertrees (i.e., a supertree
clustering method) would help discover and cluster alternative evolutionary scenarios
for several ToL subtrees.
The paper is structured as follows. In the next section, we introduce a new metric
for 𝑘-means algorithm based on the Robinson and Foulds distance. The section
3 presents the simulation results (on a real dataset) obtained with our algorithm
compared to other clustering methods. Finally, we discuss our contributions in section
4.

2 Methods

The 𝑘-means algorithm [9, 10] is a very common algorithm for data parsing. From
a set of 𝑁 observations 𝑥𝑖 , . . . , 𝑥 𝑁 each one being described by 𝑀 variables, this
algorithm creates a partition in 𝑘 homogeneous classes or clusters. Each observation
corresponds to a point in a 𝑀-dimensional space and the proximity between two
points is measured by the distance between them. In the framework of 𝑘-means, the
most commonly used distances are the Euclidean distance, Manhattan distance, and
Minkowski distance [4]. To be precise, the objective of the algorithm is to find the
partition of the 𝑁 points into 𝑘 clusters in such a way that the sum of the squares of the
distances of the points to the center of gravity of the group to which they are assigned
is minimal. To the best of our knowledge, finding an optimal partition according to
the 𝑘-means least-squares criterion is known to be NP-hard [13]. Considering this
New Metrics to Classify Phylogenetic Trees 385

fact, several polynomial-time heuristics were developed, most of which have the time
complexity of O (𝐾 𝑁 𝐼 𝑀) for finding an approximate partitioning solution, where
𝐾 is the maximum possible number of clusters, 𝑁 is the number of objects (for
example, phylogenetic trees), 𝐼 is the number of iterations in the 𝑘-means algorithm,
and 𝑀 is the number of variables characterizing each of the 𝑁 objects.
A well-known metric of comparing two tree topologies in computational biology
is the Robinson-Foulds distance (𝑅𝐹), also known as the symmetric-difference dis-
tance [15]. Moreover, the distance 𝑅𝐹 is a topological distance, which means that
it does not consider the length of the edges of the tree. The formula of 𝑅𝐹 distance
can be describe as (𝑛1 (𝑇1 ) + 𝑛2 (𝑇2 )), where 𝑛1 (𝑇1 ) is the number of partitions of
data implied by the tree 𝑇1 , but not the tree 𝑇2 and 𝑛2 (𝑇2 ) is the number of partitions
of data implied by the tree 𝑇2 but not the tree 𝑇1 . According to Barthélemy and
Monjardet [1], the majority-rule consensus tree of a set of trees is the median tree of
this set. This fact makes the use of tree clustering possible.

2.1 Silhouette Index Adapted for Tree Clustering

The first popular cluster validity index we consider in our study is the Silhouette
width (𝑆𝐻) [16]. Traditionally, the Silhouette width of the cluster 𝑘 is defined as
follows:
"𝑁 #
1 Õ 𝑘
𝑏(𝑖) − 𝑎(𝑖)
𝑠(𝑘) = , (1)
𝑁 𝑘 𝑖=1 𝑚𝑎𝑥(𝑎(𝑖), 𝑏(𝑖))
where 𝑁 𝑘 is the number of objects belonging to cluster 𝑘, 𝑎(𝑖) is the average distance
between object 𝑖 and all other objects belonging to cluster 𝑘, and 𝑏(𝑖) is the smallest,
over all clusters 𝑘 0 different from cluster 𝑘, of all average distances between 𝑖 and all
the objects of cluster 𝑘 0.
We used Equations (2) and (4) for calculating 𝑎(𝑖) and 𝑏(𝑖), respectively, in
our tree clustering algorithm (see also [19]). For instance, the quantity 𝑎(𝑖) can be
calculated as follows:
" Í 𝑁𝑘 #
𝑗=1 𝑅𝐹 (𝑇𝑘𝑖 , 𝑇𝑘 𝑗 )
𝑎(𝑖) = + 𝜉 /𝑁 𝑘 , (2)
2𝑛(𝑇𝑘𝑖 , 𝑇𝑘 𝑗 ) − 6

where 𝑁 𝑘 is the number of trees in cluster 𝑘, 𝑇𝑘𝑖 and 𝑇𝑘 𝑗 are, respectively, trees 𝑖
and 𝑗 in cluster 𝑘, 𝑛(𝑇𝑘𝑖 ) is the number of leaves in tree 𝑇𝑘𝑖 , 𝑛(𝑇𝑘 𝑗 ) is the number of
leaves in tree 𝑇𝑘 𝑗 , and 𝜉 is a penalty function which is defined as follows:

𝑀𝑖𝑛(𝑛(𝑇𝑘𝑖 ), 𝑛(𝑇𝑘 𝑗 )) − 𝑛(𝑇𝑘𝑖 , 𝑇𝑘 𝑗 )


𝜉 =𝛼× , (3)
𝑀𝑖𝑛(𝑛(𝑇𝑘 𝑗 ), 𝑛(𝑇𝑘 𝑗 ))
386 N. Tahiri and A. Koshkarov

where 𝛼 is the penalization (tuning) parameter, taking values between 0 and 1, used
to prevent from putting to the same cluster trees having small percentages of leaves
in common, and 𝑛(𝑇𝑘𝑖 , 𝑇𝑘 𝑗 ) is the number of common leaves in trees 𝑇𝑘𝑖 and 𝑇𝑘 𝑗 .
The formula for 𝑏(𝑖) is as follows:
" Í 𝑁𝑘 0 #
𝑗=1 𝑅𝐹 (𝑇𝑘𝑖 , 𝑇𝑘 0 𝑗 )
𝑏(𝑖) = min + 𝜉 /𝑁 𝑘 0 , (4)
1≤𝑘 0 ≤𝐾 ,𝑘 0 ≠𝑘 2𝑛(𝑇𝑘𝑖 , 𝑇𝑘 0 𝑗 ) − 6

where 𝑇𝑘 0 𝑗 is the tree 𝑗 of the cluster 𝑘 0, such that 𝑘 0 ≠ 𝑘, and 𝑁 𝑘 0 is the number of
trees in the cluster 𝑘 0.
The optimal number of clusters, 𝐾, corresponds to the maximum average Silhou-
ette width, 𝑆𝐻, which is calculated as follows:
𝐾 h
Õ i
𝑆𝐻 = 𝑠(𝐾) = 𝑠(𝑘) /𝐾 . (5)
𝑘=1

The value of the Silhouette index defined by Equation (5) ranges from -1 to +1.

2.2 Gap Statistic Adapted for Tree Clustering

It is worth noting that the 𝑆𝐻 cluster validity index (Equations (1) to (5)) do not allow
comparing the solution consisting of a single consensus tree (𝐾 = 1; the calculation of
𝑆𝐻 is impossible in this case) with clustering solutions involving multiple consensus
trees or supertrees (𝐾 ≥ 2). This can be considered as an important disadvantage
of the 𝑆𝐻-based classifications because a good tree clustering method should be
able to recover a single consensus tree or supertree when the input set of trees is
homogeneous (e.g. for a set of gene trees that share the same evolutionary history).
The Gap statistic was first used by Tibshirani et al. [20] to estimate the number of
clusters provided by partitioning algorithms. The formulas proposed by Tibshirani
et al. were based on the properties of the Euclidean distance. In the context of tree
clustering, the Gap statistic can be defined as follows. Consider a clustering of 𝑁
trees into 𝐾 non-empty clusters, where 𝐾 ≥ 1. First, we define the total intracluster
distance, 𝐷 𝑘 , characterizing the cohesion between the trees belonging to the same
cluster 𝑘:
𝑁𝑘 Õ𝑁𝑘
" #
Õ 𝑅𝐹 (𝑇𝑘𝑖 , 𝑇𝑘 𝑗 )
𝐷𝑘 = +𝜉 . (6)
𝑖=1 𝑗=1
2𝑛(𝑇𝑘𝑖 , 𝑇𝑘 𝑗 ) − 6

Then, the sum of the average total intracluster distances, 𝑉𝐾 , can be calculated
using this formula:
𝐾
Õ 1
𝑉𝐾 = 𝐷𝑘 . (7)
𝑘=1
2𝑁 𝑘
New Metrics to Classify Phylogenetic Trees 387

Finally, the Gap statistic, which reflects the quality of a given clustering solution
including 𝐾 clusters, can be defined as follows:

𝐺𝑎 𝑝 𝑁 (𝐾) = 𝐸 ∗𝑁 log(𝑉𝐾 ) − log(𝑉𝐾 ) .



(8)

where 𝐸 ∗𝑁 denotes expectation under a sample of size 𝑁 from the reference distri-
bution. The following formula [20] for the expectation of 𝑙𝑜𝑔(𝑉𝐾 ) was used in our
algorithm:
𝐸 ∗𝑁 log(𝑉𝐾 ) = log(𝑁𝑛/12) − (2/𝑛) log(𝐾) ,

(9)
where 𝑛 is the number of tree leaves.
The largest value of the Gap statistic corresponds to the best clustering.

3 Results - A Biological Example

To illustrate the methods described above, we used a dataset from Woese et al. [22].
The aminoacyl-tRNA synthetases (aaRSs) are enzymes that attach the appropriate
amino acid onto its cognate transfer RNA. The structure-function aspect of aaRSs
has long attracted the attention of biologists [22, 6]. Moreover, the relationship of
aaRSs to the genetic code is observed from the evolutionary view (the central role
played by the aaRSs in translation would suggest that their histories and that of the
genetic code are somehow intertwined [22]). The novel domain additions to aaRSs
genes play an important role in the inference of the ToL.
We encoded 20 original aminoacyl-tRNA synthetase trees from Woese et al. [22]
in Newick format and then split some of them into sub-trees to account for cases
where the same species appeared more than once in the original tree. Our approach
cannot handle data that includes multiple instances of the same species in the input
trees. Thus, 36 aaRS trees with different numbers of leaves (including 72 species
in total) were used as input of our algorithm (their Newick strings are available at:
https://github.com/tahiri-lab/PhyloClust). Our approach was applied
with the 𝛼 parameter set to 1.
First, we implemented our new approach with the Gap statistic cluster validity
index which suggested the presence of 7 clusters of trees in the data, thus suggesting a
heterogeneous scenario of their evolution. Then, we conducted the computation using
the 𝑆𝐻 cluster validity index and obtained 2 clusters of trees each of which could
be represented by its own supertree. The first cluster obtained using 𝑆𝐻 included 19
trees for a total of 56 organisms, whereas the second cluster included 17 trees for
a total of 61 organisms. The supertrees (see Figure 1) for the two obtained clusters
of trees were inferred using the CLANN program [5]. Further, we decided to infer
the most common horizontal gene transfers which characterized the evolution of
gene trees included in the two obtained tree clusters. The method of [3], reconciling
the species and gene phylogenies to infer transfers, was used for this purpose. The
species phylogenies followed the NCBI taxonomic classification. These phylogenies
were not fully resolved (the species phylogeny in Figure 1a contains 9 internal nodes
388 N. Tahiri and A. Koshkarov

(a) M.jannaschii (b) M.jannaschii


M.maripaludis
P.furiosus
M.maripaludis
P.horikoshii
M.barkeri M.barkeri
A.fulgidus A.fulgidus Archaea
H.marismortui H.salinarum
P.horikoshii Archaea M.thermautotrophicus
M.thermautotrophicus S.solfataricus
S.solfataricus P.aerophilum
P.aerophilum A.pernix
A.pernix C.symbiosum
C.symbiosum A.thaliana
A.thaliana N.tabacum
L.luteus O.sativa
C.elegans S.cerevisiae
H.sapiens Eukaryota N.locustae
S.cerevisiae D.melanogaster
H.sapiens Eukaryota
G.intestinalis
P.falciparum G. intestinalis
G. lambia
P.aeruginosa T.vaginalis
P. uorescens
P.falciparum
A.calcoaceticus
T.thermophila
F.tularensis
S.pneumoniae
E.coli S.pyogenes
H.in uenzae L.casei
C.burnetii L.delbrueckii
A.brasilense E.faecalis
R.meliloti S.aureus
Z.mobilis B.subtilis
R.prowazekii C.acetobutylicum
R.capsulatus C.longisporum
C.crescentus E.coli
C.jejuni S.typhimurium
H.pylori F.tularensis
N.gonorrhoeae P.aeruginosa
B.pertussis H.in uenzae
T.ferrooxidans N.gonorrhoeae
Bacteria B.pertussis Bacteria
S.aureus B.bacilliformis
B.subtilis R.prowazekii
S.pyogenes R.capsulatus
E.faecalis C.crescentus
L.bulgaricus H.pylori
C.acetobutylicum T.ferrooxidans
B.burgdorferi C.glutamicum
T.pallidum M.tuberculosis
T.thermophilus S.coelicolor
D.radiodurans B.burgdorferi
M.tuberculosis T.pallidum
S.coelicolor T.aquaticus
M.genitalium D.radiodurans
M.pneumoniae M.genitalium
M.pneumoniae
C.trachomatis C.trachomatis
P.gingivalis P.gingivalis
C.tepidum C.tepidum
Synechocystis sp.PCC 6803 Synechocystis sp.PCC 6803
T.maritima T.maritima
A.aeolicus A.aeolicus

Fig. 1 Nonbinary species tree corresponding to the NCBI taxonomic classification are represented
with (a) 56 species for cluster 1. The 4 HGTs (indicated by arrows) were found by the 𝑆𝐻 index for
the first cluster; (b) 61 species with 𝛼 equal 1 for cluster 2. The 2 HGTs (indicated by arrows) were
found by the 𝑆𝐻 index with 𝛼 equal 1 for the second cluster. We applied Most Similar Supertree
Method (𝑑 𝑓 𝑖𝑡) [5] implemented in CLANN Software with 𝑚𝑟 𝑝 criterion. This criterion is a
matrix representation employing parsimony criterion.

with a degree higher than 3 and the species phylogeny in Figure 1b contains 10
internal nodes with a degree higher than 3).
We used the version of the HGT (Horizontal Gene Transfer) algorithm available
on the T-Rex web site [2] to identify the scenarios of HGT events that reconcile the
species tree and each of the supertrees. We choose the same root between species
trees and supertrees: the root which split Bacteria to the clade of Eukayota and
Archaea.
For the first cluster composed of 56 species, we obtained 40 transfers with 22
regular and 18 trivial HGTs. Trivial HGTs are necessary to transform a non-binary
tree into a binary tree. We removed the trivial HGTs and selected between regular
HGTs. The non-trivial HGTs with low representation are most likely due to the tree
reconstruction artefacts. In Figure 1a, we illustrated only those HGTs that are most
represented in the dataset.
We followed the same procedure for the second cluster composed of 61 species
and obtained 42 transfers with 28 regular and 14 trivial HGTs that are not represented
here. We selected only the most popular HGTs in the dataset. All other transfers are
represented in Figure 1b.
The transfers link of P. horikoshii to the clade of spirochetes (i.e. B. burgdorferi
and T. pallidum) was found by [3, 14]. The transfers of P. horikoshii to P. aerophilum
were also found by [14]. These results confirmed the existing HGT of [3, 14].
New Metrics to Classify Phylogenetic Trees 389

4 Discussion

Many research groups are estimating trees containing several thousands to hundreds
of thousands of species, toward the eventual goal of the estimation of the Tree of Life,
containing perhaps several million leaves. These phylogenetic estimations present
enormous computational challenges, and current computational methods are likely to
fail to run even with datasets on the low end of this range. One approach to estimate a
large species tree is to use phylogenetic estimation methods (such as maximum like-
lihood) on a supermatrix produced by concatenating multiple sequence alignments
for a collection of markers; however, the most accurate of these phylogenetic estima-
tion methods are extremely computationally intensive for datasets with more than
a few thousand sequences. Supertree methods, which assemble phylogenetic trees
from a collection of trees on subsets of the taxa, are important tools for phylogeny
estimation where phylogenetic analyses based upon maximum likelihood (ML) are
infeasible.
In this article, we described a new algorithm for partitioning a set of phylogenetic
trees in several clusters in order to infer multiple supertrees, for which the input trees
have different, but mutually overlapping sets of leaves. We presented new formulas
that allow the use of the popular Silhouette and Gap statistic cluster validity indices
along with the Robinson and Foulds topological distance in the framework of tree
clustering based on the popular 𝑘-means algorithm. The new algorithm can be used
to address a number of important issues in bioinformatics, such as the identification
of genes having similar evolutionary histories, e.g. those that underwent the same
horizontal gene transfers or those that were affected by the same ancient duplication
events. It can also be used for the inference of multiple subtrees of the Tree of Life. In
order to compute the Robinson and Foulds topological distance between such pairs
of trees, we can first reduce them to a common set of leaves. After this reduction,
the Robinson and Foulds distance is normalized by its maximum value, which is
equal to 2𝑛 − 6 for two binary trees with 𝑛 leaves. Overall, the good performance
achieved by the new algorithm in both clustering quality and running time makes it
well suited for analyzing large genomic and phylogenetic datasets. A C++ program,
called PhyloClust (Phylogenetic trees Clustering), implementing the discussed tree
partitioning algorithm is freely available at https://github.com/tahiri-lab/
PhyloClust.

Acknowledgements We would like to thank Andrey Veriga and Boris Morozov for helping us with
the analysis of Aminoacyl-tRNA synthetases data. We also thank Compute Canada for providing
access to high-performance computing facilities. This work was supported by Fonds de Recherche
sur la Santé of Québec and University of Sherbrooke grant.
390 N. Tahiri and A. Koshkarov

References

1. Barthelemy, J., Monjardet, B.: The median procedure in cluster analysis and social choice
theory. Math. Soc. Sci. 1, 235-267 (1981)
2. Boc, A., Legendre, P., Makarenkov, V.: An efficient algorithm for the detection and classifi-
cation of horizontal gene transfer events and identification of mosaic genes. Algorithms From
And For Nature And Life. pp. 253-260 (2013)
3. Boc, A., Philippe, H., Makarenkov, V.: Inferring and validating horizontal gene transfer events
using bipartition dissimilarity. Syst. Biol. 59, 195-211 (2010)
4. Bock, H.: Clustering methods: a history of k-means algorithms. Selected Contributions In
Data Analysis And Classification. pp. 161-172 (2007)
5. Creevey, C., Fitzpatrick, D., Philip, G., Kinsella, R., O’Connell, M., Pentony, M., Travers,
S., Wilkinson, M., McInerney, J.: Does a tree–like phylogeny only exist at the tips in the
prokaryotes?. Proc. Roy. Soc. Lond. B Biol. Sci. 271, 2551-2558 (2004)
6. Godwin, R., Macnamara, L., Alexander, R., Salsbury Jr, F.: Structure and dynamics of tRNA-
met containing core substitutions. ACS Omega. 3, 10668-10678 (2018)
7. Gouy, R., Baurain, D., Philippe, H.: Rooting the tree of life: the phylogenetic jury is still out.
Phil. Trans. Biol. Sci. 370, 20140329 (2015)
8. Hinchliff, C., Smith, S., Allman, J., Burleigh, J., Chaudhary, R., Coghill, L., Crandall, K., Deng,
J., Drew, B., Gazis, R. et al.: Synthesis of phylogeny and taxonomy into a comprehensive tree
of life. Proc. Natl. Acad. Sci. Unit. States Am. 112, 12764-12769 (2015)
9. Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inform. Theor. 28, 129-137 (1982)
10. MacQueen, J. et al.: Some methods for classification and analysis of multivariate observations.
Proceedings of the Fifth Berkeley Symposium On Mathematical Statistics and Probability. 1,
281-297 (1967)
11. Maddison, D.: The discovery and importance of multiple islands of most-parsimonious trees.
Syst. Biol. 40, 315-328 (1991)
12. Maddison, D., Schulz, K., Maddison, W. et al.: The tree of life web project. Zootaxa. 1668,
19-40 (2007)
13. Mahajan, M., Nimbhorkar, P., Varadarajan, K.: The planar k-means problem is NP-hard.
International Workshop On Algorithms And Computation. pp. 274-285 (2009)
14. Makarenkov, V., Boc, A., Delwiche, C., Philippe, H. et al.: New efficient algorithm for
modeling partial and complete gene transfer scenarios. Data Science And Classification.
341-349 (2006)
15. Robinson, D., Foulds, L.: Comparison of phylogenetic trees. Math. Biosci. 53, 131-147 (1981)
16. Rousseeuw, P.: Silhouettes: a graphical aid to the interpretation and validation of cluster
analysis. J. Comput. Appl. Math. 20, 53-65 (1987)
17. Silva, A., Wilkinson, M.: On defining and finding islands of trees and mitigating large island
bias. Syst. Biol. 706, 1282-1294 (2021)
18. Stockham, C., Wang, L., Warnow, T.: Statistically based postprocessing of phylogenetic anal-
ysis by clustering. Bioinformatics. 18, S285-S293 (2002)
19. Tahiri, N., Willems, M., Makarenkov, V.: A new fast method for inferring multiple consensus
trees using k-medoids. BMC Evol. Biol. 18, 1-12 (2018)
20. Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a dataset via the
gap statistic. J. Roy. Stat. Soc. B Stat. Meth. 63, 411-423 (2001)
21. Whidden, C., Zeh, N., Beiko, R.: Supertrees based on the subtree prune-and-regraft distance.
Syst. Biol. 63, 566-581 (2014)
22. Woese, C., Olsen, G., Ibba, M., Soll, D.: Aminoacyl-tRNA synthetases, the genetic code, and
the evolutionary process. Microbiol. Mol. Biol. Rev. 64, 202-236 (2000)
New Metrics to Classify Phylogenetic Trees 391

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
On Parsimonious Modelling via Matrix-variate t
Mixtures

Salvatore D. Tomarchio

Abstract Mixture models for matrix-variate data have becoming more and more
popular in the most recent years. One issue of these models is the potentially high
number of parameters. To address this concern, parsimonious mixtures of matrix-
variate normal distributions have been recently introduced in the literature. However,
when data contains groups of observations with longer-than-normal tails or atypi-
cal observations, the use of the matrix-variate normal distribution for the mixture
components may affect the fitting of the resulting model. Therefore, we consider a
more robust approach based on the matrix-variate 𝑡 distribution for modeling the
mixture components. To introduce parsimony, we use the eigen-decomposition of the
components scale matrices and we allow the degrees of freedom to be equal across
groups. This produces a family of 196 parsimonious matrix-variate 𝑡 mixture mod-
els. Parameter estimation is obtained by using an AECM algorithm. The use of our
parsimonious models is illustrated via a real data application, where parsimonious
matrix-variate normal mixtures are also fitted for comparison purposes.

Keywords: matrix-variate, mixture models, clustering, parsimonious models

1 Introduction

The matrix-variate model-based clustering literature is expanding more and more


over the last few years, as confirmed by the high number of contributions using finite
mixture models for the modelization of matrix-variate data [1, 2, 3, 4, 5, 6, 7, 8]. This
kind of data is arranged in three-dimensional arrays, and depending on the entities
indexed in each of the three layers, different data examples might be considered
[9]. In many of these applications, we observe a 𝑝 × 𝑟 matrix for each statistical

Salvatore D. Tomarchio ( )
University of Catania, Department of Economics and Business, Catania, Italy,
e-mail: [email protected]

© The Author(s) 2023 393


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_42
394 S. D. Tomarchio

observation. Thus, from a model-based clustering perspective, the challenge is to


suitably cluster realization coming from random matrices.
One problem of matrix-variate mixture models is the potentially high number of
parameters. To cope with this issue, [5] have recently proposed a family of parsimo-
nious mixtures based on the matrix-variate normal (MVN) distribution. Nevertheless,
for many datasets, the tails of the MVN distribution are often shorter than required.
This has several consequences on parameter estimation as well as in the proper data
classification [4, 7]. Therefore, in this paper we relax the normality assumption of the
mixture components by using (in a parsimonious setting) the matrix-variate 𝑡 (MVT)
distribution. The MVT distribution has been used within the finite mixture model
paradigm by [10] in an unconstrained framework. Here, to introduce parsimony in
this model, (i) we use the eigen-decomposition of the two scale matrices of each
mixture component and (ii) we allow the degrees of freedom to be tied across the
groups. This produces the family of 196 parsimonious matrix-variate MVT mixture
models (MVT-Ms) discussed in Section 2. Parameter estimation is implemented by
using an alternating expectation-conditional maximization (AECM) algorithm [12].
In Section 3, our parsimonious MVT-Ms, along with parsimonious matrix-variate
MVN mixture models (MVN-Ms) for comparison purposes, are fitted to a Swedish
municipalities expenditure dataset. The differences in terms of fitting among the two
families of models are illustrated. The estimated parameters and the data partition
of the overall best fitting model are also commented. Finally, some conclusions are
drawn in Section 4.

2 Methodology

2.1 Parsimonious Mixtures of Matrix-variate 𝒕 Distributions

The probability distribution function (pdf) of a 𝑝 × 𝑟 random matrix X coming from


a finite mixture model is
𝐺
Õ
𝑓MIXT (X; 𝛀) = 𝜋𝑔 𝑓 (X; 𝚯𝑔 ), (1)
𝑔=1

Í
where 𝜋𝑔 is the 𝑔th mixing proportion, such that 𝜋𝑔 > 0 and 𝐺 𝑔=1 𝜋 𝑔 = 1, 𝑓 (X; 𝚯𝑔 )
is the 𝑔th component pdf with parameter 𝚯𝑔 , and 𝛀 contains all of the parameters
of the mixture. In this paper, for the 𝑔th component of model (1), we adopt the MVT
distribution having pdf
 
𝑟 𝑝 𝑝𝑟 +𝜈𝑔    − 𝑝𝑟+𝜈
|𝚺𝑔 | − 2 |𝚿𝑔 | − 2 Γ
𝑔
2 𝛿𝑔 X; M𝑔 , 𝚺𝑔 , 𝚿𝑔 2
𝑓MVT (X; 𝚯𝑔 ) =  𝑝𝑟  𝜈  1+ , (2)
𝜋𝜈 2 Γ 𝑔 𝜈𝑔
𝑔 2
On Parsimonious Modelling via Matrix-variate t Mixtures 395

where 𝛿𝑔 X; M𝑔 , 𝚺𝑔 , 𝚿𝑔 = tr 𝚺−1 −1 0
  
𝑔 (X − M𝑔 )𝚿𝑔 (X − M𝑔 ) , M𝑔 is the 𝑝 × 𝑟
component mean matrix, 𝚺𝑔 is the 𝑝 × 𝑝 component row scale matrix, 𝚿𝑔 is the 𝑟 × 𝑟
component column scale matrix and 𝜈𝑔 > 0 is the component degree of freedom.
It is interesting to recall that the pdf in (2) can be hierarchically obtained via the
matrix-variate normal scale mixture model when the mixing random variable 𝑊 is
a gamma distribution with scale and rate parameters set to 𝜈𝑔 /2 [10]. Specifically, a
hierarchical representation of MVT distribution can be given as follows

1. 𝑊 ∼ G 𝜈𝑔 /2, 𝜈𝑔 /2 ,
2. X|𝑊 = 𝑤 ∼ N (M𝑔 , 𝚺𝑔 /𝑤, 𝚿𝑔 ),
where G (·) is a gamma distribution and N (·) denotes the MVN distribution. This
representation will be convenient for parameter estimation presented in Section 2.2.
As discussed in Section 1, the mixture model in (1) may be characterized by a
potentially high number of parameters. To address this concern, we firstly use the
eigen-decomposition of the components scale matrices 𝚺𝑔 and 𝚿𝑔 . In detail, we
recall that a generic 𝑞 × 𝑞 scale matrix 𝚽𝑔 can be decomposed as [11]

𝚽𝑔 = 𝜆 𝑔 𝚪𝑔 𝚫𝑔 𝚪𝑔0 , (3)

where 𝜆 𝑔 = |𝚽𝑔 | 1/𝑞 , 𝚪𝑔 is a 𝑞 × 𝑞 orthogonal matrix whose columns are the


normalized eigenvectors of 𝚽𝑔 , and 𝚫𝑔 is the scaled (|𝚫𝑔 | = 1) diagonal matrix of
the eigenvalues of 𝚽𝑔 . By constraining the three components in (3), the following
family of 14 parsimonious structures is obtained: EII, VII, EEI, VEI, EVI, VVI,
EEE, VEE, EVE, VVE, EEV, VEV, EVV, VVV, where “E” stands for equal, “V”
means varying and “I” denotes the identity matrix.
If we apply the decomposition in (3) to 𝚺𝑔 and 𝚿𝑔 , we obtain 14 × 14 = 196
parsimonious structures. However, to solve a well-known identifiability issue related
to the scale matrices of matrix-variate distributions [1, 3, 5], we impose the restriction
|𝚿𝑔 | = 1, which makes the parameter 𝜆 𝑔 unnecessary, and reduces the number of
parsimonious structures related to 𝚿𝑔 from 14 to 7: II, EI, VI, EE, VE, EV, VV.
Thus, we have 14×7 = 98 parsimonious structures for the component scale matrices.
To further increase the parsimony of model (1), we also consider the option
of constraining the component degrees of freedom 𝜈𝑔 . The nomenclature used is
the same to that adopted for the scale matrices. This option, combined with that
discussed above for the scale matrices, allows us to produce a total of 98 × 2 = 196
parsimonious MVT-Ms.

2.2 An AECM Algorithm for Parameter Estimation

To estimate the parameters of our family of mixture models, we implement an AECM


algorithm. By using the hierarchical representation of Section 2.1, our complete data
𝑁
are S𝑐 = {X𝑖 , z𝑖 , 𝑤𝑖 }𝑖=1 , where z𝑖 = (𝑧𝑖1 , . . . , 𝑧 𝑖𝐺 ) 0, such that 𝑧𝑖𝑔 = 1 if observation
𝑖 belongs to group 𝑔 and 𝑧𝑖𝑔 = 0 otherwise, and 𝑤𝑖 is the realization of 𝑊. Therefore,
the complete-data log-likelihood can be written as
396 S. D. Tomarchio

ℓ𝑐 (𝛀; S𝑐 ) = ℓ1𝑐 (𝝅; S𝑐 ) + ℓ2𝑐 (𝚵; S𝑐 ) + ℓ3𝑐 (𝜗; S𝑐 ) , (4)

where
𝑁 Õ
Õ 𝐺

ℓ1𝑐 (𝝅; S𝑐 ) = 𝑧𝑖𝑔 ln 𝜋𝑔 ,
𝑖=1 𝑔=1

𝑁 Õ
𝐺 h 𝑝𝑟
Õ 𝑝𝑟  𝑟 𝑝
ℓ2𝑐 (𝚵; S𝑐 ) = 𝑧𝑖𝑔 − ln (2𝜋) + ln 𝑤𝑖𝑔 − ln |𝚺𝑔 | − ln |𝚿𝑔 |
𝑖=1 𝑔=1
2 2 2 2

𝑤𝑖𝑔 𝛿𝑔 (X; M𝑔 , 𝚺𝑔 , 𝚿𝑔 )
− , (5)
2
𝑁 Õ
𝐺 n 𝜈𝑔  𝜈𝑔  h  𝜈𝑔 i  𝜈𝑔
Õ   𝜈𝑔 o
ℓ3𝑐 (𝜗; S𝑐 ) = 𝑧 𝑖𝑔 ln − ln Γ + − 1 ln 𝑤𝑖𝑔 − 𝑤𝑖𝑔 ,
𝑖=1 𝑔=1
2 2 2 2 2
 𝐺  𝐺  𝐺
with 𝝅 = 𝜋𝑔 𝑔=1 , 𝚵 = M𝑔 , 𝚺𝑔 , 𝚿𝑔 𝑔=1 and 𝜗 = 𝜈𝑔 𝑔=1 .
Our AECM algorithm then proceeds as follows (notice that, the parameters
marked with one dot are the updates of the previous iteration, while those marked
with two dots are the updates at the current iteration):
E-step At the E-step we have to compute the following quantities
¤𝑔

𝜋¤ 𝑔 𝑓MVT X𝑖 ; 𝚯 𝑝𝑟 + 𝜈¤𝑔
𝑧¥𝑖𝑔 = Í𝐺  and 𝑤¥ 𝑖𝑔 = . (6)
¤ℎ
𝜋¤ ℎ 𝑓MVT X𝑖 ; 𝚯 ¤ 𝑔 , 𝚺¤ 𝑔 , 𝚿
𝜈¤𝑔 + 𝛿¤𝑔 X𝑖 ; M ¤𝑔
ℎ=1

There is no need to compute the expected value of ln 𝑊𝑖𝑔 , given that we do not
use this quantity to update 𝜈𝑔 .
CM-step 1 At the first CM-step, we have the following updates
Í𝑁 Í𝑁
𝑖=1 𝑧¥𝑖𝑔 𝑖=1 𝑧¥𝑖𝑔 𝑤
¥𝑔= Í
¥ 𝑖𝑔 X𝑖
𝜋¥ 𝑔 = and M 𝑁
.
𝑖=1 𝑧¥𝑖𝑔 𝑤
¥ 𝑖𝑔
𝑁

Because of space constraints, we cannot report here the updates of each par-
simonious structure related to 𝚺𝑔 and 𝚿𝑔 . However, they can be obtained by
generalizing the results in [5]. The only differences consist in the updates of the
row and column scatter matrices of the 𝑔th component, that are here defined as
𝑁
Õ
¥ 𝑔𝑅 = ¥𝑔 𝚿 ¥ 𝑔 0,
¤ 𝑔 X𝑖 − M
 −1 
W 𝑧¥𝑖𝑔 𝑤¥ 𝑖𝑔 X𝑖 − M
𝑖=1
Õ𝑁
¥𝐶 ¥ 𝑔 0𝚺
¥ 𝑔 X𝑖 − M
¥𝑔 .
 −1 
W 𝑔 = 𝑧¥𝑖𝑔 𝑤¥ 𝑖𝑔 X𝑖 − M
𝑖=1
On Parsimonious Modelling via Matrix-variate t Mixtures 397

CM-step 2 At the second CM-step, we firstly define the “partial” complete-data


log-likelihood function according to the following specification
𝑁 Õ
𝐺
  Õ
ℓ 𝑝𝑐 𝛀; S 𝑝𝑐 = ℓ1𝑐 𝝅; S 𝑝𝑐 + 𝑧𝑖𝑔 ln 𝑓MVT (X𝑖 ; 𝚯𝑔 ), (7)
𝑖=1 𝑔=1

where “partial” refers to fact that the complete data are now defined as S 𝑝𝑐 =
𝑁
{X𝑖 , z𝑖 }𝑖=1 . Then, 𝜈¥𝑔 is determined by maximizing

𝑁
Õ 𝑁 Õ
Õ 𝐺
¥ 𝑔)
𝑧¥𝑖𝑔 ln 𝑓MVT (X𝑖 ; 𝚯 or ¥ 𝑔 ),
𝑧¥𝑖𝑔 ln 𝑓MVT (X𝑖 ; 𝚯
𝑖=1 𝑖=1 𝑔=1

over 𝜈𝑔 ∈ (0, 100), depending on the parsimonious structure selected, i.e. V or


E, respectively. Notice that, an higher upper bound could also have been selected
for the maximization problem but, with the already chosen value, the differences
between an estimated MVT distribution and the nested MVN distribution would
be negligible. Furthermore, when a heavy-tailed distribution approaches to nor-
mality, the precision of the estimated tailedness parameters is unreliable [4].

3 Real Data Application

Here, we analyze the Municipalities dataset contained in the AER package [13]
for the R statistical software. It consists of expenditure information for 𝑁 = 265
Swedish municipalities over 𝑟 = 9 years (1979–1987). For each municipality, we
measure the following 𝑝 = 3 variables: (i) total expenditures, (ii) total own-source
revenues and (iii) intergovernmental grants received.
We fitted parsimonious MVT-Ms and MVN-Ms for 𝐺 ∈ {1, 2, 3, 4, 5} to the data,
and for each family of models the Bayesian information criterion (BIC) [14] is used
to select the best fitting model. According to our results, we found that the best among
MVN-Ms has a BIC of -82362.61, a VVV-EE structure and 𝐺 = 4 groups, while
the best among MVT-Ms has a BIC of -82701.59, a VVE-EE-V structure and 𝐺 = 3
groups. Thus, the overall best fitting model is that selected for MVT-Ms. The MVN-
Ms seem to overfit the data, given that an additional group is detected. This is not an
unusual behavior, given that the tails of normal mixture models cannot adequately
accommodate deviations from normality, and additional groups are consequently
found in the data [4, 7, 15]. Anyway, the best fitting models of the two families agree
in finding varying volumes and shapes in the components row scale matrices and
equal shapes and orientations in the components column scale matrices.
Figure 1 illustrates the parallel coordinate plots of the data partition detected by
the VVE-EE-V MVT-Ms. The dashed lines correspond to the estimated mean for
that variable, across the time, in that group. We notice that the first group contains
municipalities having, on average, slightly higher expenditures, an intermediate
398 S. D. Tomarchio
Group 1 Group 2 Group 3 Group 1 Group 2 Group 3 Group 1 Group 2 Group 3

0.030

0.0125
0.030

0.025

0.0100
Expenditures

0.020
0.025

Revenues

Grants
0.0075
0.015
0.020

0.0050
0.010
0.015

0.0025
1980 1982 1984 1986 1980 1982 1984 1986 1980 1982 1984 1986 1980 1982 1984 1986 1980 1982 1984 1986 1980 1982 1984 1986 1980 1982 1984 1986 1980 1982 1984 1986 1980 1982 1984 1986
Years Years Years

Fig. 1 Parallel coordinate plots of the data partition obtained by the VVE-EE-V MVT-Ms. The
dashed lines correspond to the estimated means.

level of revenues and higher levels of intergovernmental grants than the other two
groups. Furthermore, it seems to cluster several outlying observations, as confirmed
by the estimated degree of freedom 𝜈1 = 3.75, which implies quite heavy tails
for this mixture component. The second group shows the lowest average levels of
expenditures and revenues, but a similar amount of received grants to that of the
third group. Interestingly, this group does not presents many outlying observations,
as also supported by the estimated degree of freedom 𝜈2 = 10.95. Lastly, the third
group has the highest levels of revenues but, as already said, it is similar to the other
two groups in the other variables. Also in this case, we have a moderately heavy tail
behavior given that the estimated degree of freedom is 𝜈3 = 6.05.
To evaluate the correlations of the variables with each other and over time, for the
three groups, we now report the correlation matrices R ( ·) related to the covariance
matrices associated to 𝚺𝑔 and 𝚿𝑔 :

1.00 0.48 0.14  1.00 0.55 0.18  1.00 0.73 0.22 


     
R𝚺1 = 0.48 1.00 −0.06 , R𝚺2 = 0.55 1.00 −0.07 , R𝚺3 = 0.73 1.00 −0.02 ,
   
0.14 −0.06 1.00  0.18 −0.07 1.00  0.22 −0.02 1.00 
     
1.00 0.80 0.72 0.67 0.65 0.59 0.58 0.55 0.52
 
0.80 1.00 0.79 0.73 0.69 0.62 0.62 0.57 0.54

0.72 0.79 1.00 0.80 0.73 0.69 0.66 0.63 0.60

0.67 0.73 0.80 1.00 0.79 0.73 0.71 0.67 0.64

R𝚿1 = R𝚿2 = R𝚿3 = 0.65 0.69 0.73 0.79 1.00 0.83 0.80 0.73 0.71 .
0.59 0.62 0.69 0.73 0.83 1.00 0.80 0.76 0.73

0.58 0.62 0.66 0.71 0.80 0.80 1.00 0.81 0.78

0.55 0.57 0.63 0.67 0.73 0.76 0.81 1.00 0.79

0.52 0.54 0.60 0.64 0.71 0.73 0.78 0.79 1.00

When R𝚺1 , R𝚺2 and R𝚺3 are considered, we notice that, as it might be reasonable to
expect, the correlations between total-expenditures and total-own source revenues
or intergovernmental grants received are positive, despite they increase as we move
from the first to the third group. Conversely, there exists a slightly negative correlation
between total-own source revenues and intergovernmental grants received. However,
there would be no great differences among the groups in this case. As concerns
R𝚿1 , R𝚿2 and R𝚿3 , we observe that the correlation among the columns, i.e. between
On Parsimonious Modelling via Matrix-variate t Mixtures 399

time points, decreases as the temporal distance increases. Furthermore, considering


the dimensionality of these column matrices, it is readily understandable the benefit,
in terms of number of parameters to be estimated, of an EE parsimonious structure
with respect to a fully unconstrained model.
Finally, we analyze the uncertainty of the detected classification. This can be
computed, for each observation, by subtracting the probability 𝑧𝑖𝑔 of the most likely
group from 1 [16]. The lower the uncertainty is, the stronger the assignment becomes.
The quantiles of the obtained uncertainties can be used to measure the quality of
the classification. In this regard, we noticed that 75% of the observations have an
uncertainty equal or lower than 0.05. However, we observed a maximum value of
0.50. This happens when groups intersect, since uncertain classifications are expected
in the overlapping regions [17]. Relatedly, a more detailed information can be gained
by looking at the uncertainty plot illustrated in Figure 2, which reports the (sorted)
uncertainty values of all the municipalities. We see that the municipalities clustered

Group 1 Group 2 Group 3

0.5

0.4
Uncertainty

0.3

0.2

0.1

0.0
0 100 200
Observations in order of increasing uncertainty

Fig. 2 Uncertainty plot for the Municipalities dataset.

in the first group, excluding a couple of cases, have practically null uncertainties.
This applies to a lesser extent to the municipalities in the other two groups, given
the slightly higher number of exceptions. For example, there are 15 observations
(approximately 5% of the total sample size) that have uncertainty values greater than
0.3. However, and as said above, this is due to the closeness between the groups,
which can be confirmed by looking at the parallel plots in Figure 1.

4 Conclusions

One serious concern of matrix-variate mixture models is the potentially high number
of parameters. Furthermore, many real data requires models having heavier-than-
400 S. D. Tomarchio

normal tails. To address both aspects, in this paper a family of 196 parsimonious
mixture models, based on the matrix-variate 𝑡 distribution, is introduced. The eigen-
decomposition of the components scale matrices, as well as constraints on the com-
ponents degrees of freedom, are used to attain parsimony. An AECM algorithm for
parameter estimation has been presented. Our family of models have been fitted to a
real dataset along with parsimonious mixtures of matrix-variate normal distributions.
The results demonstrate the best fitting results of our models, and the overfitting ten-
dency of matrix-variate normal mixtures. Lastly, the estimated parameters and data
partition for the best of our models have been reported and commented.

Acknowledgements This work was supported by the University of Catania grant PIACERI/CRASI
(2020).

References

1. Gallaugher, M. P. B., McNicholas P. D.: Finite mixtures of skewed matrix variate distributions.
Pattern Recognit. 80, 83–93 (2018)
2. Melnykov, V., Zhu, X.: On model-based clustering of skewed matrix data. J. Multivar. Anal.
167, 181–194 (2018)
3. Melnykov, V., Zhu, X.: Studying crime trends in the USA over the years 2000–2012. Adv.
Data Anal. Classif. 13(1), 325–341 (2019)
4. Tomarchio, S. D., Punzo, A., Bagnato, L.: Two new matrix-variate distributions with applica-
tion in model-based clustering. Comput. Stat. Data Anal. 152, 107050 (2020)
5. Sarkar, S., Zhu, X., Melnykov, V., Ingrassia, S.: On parsimonious models for modeling matrix
data. Comput. Stat. Data Anal. 142, 106822 (2020)
6. Tomarchio, S. D., McNicholas, P. D., Punzo, A.: Matrix normal cluster-weighted models. J.
Classif. 38(3), 556–575 (2021)
7. Tomarchio, S. D., Gallaugher, M. P. B., Punzo, A., McNicholas, P. D.: Mixtures of matrix-
variate contaminated normal distributions. J. Comput. Gr. Stat. 1–9 (2022)
8. Tomarchio, S. D., Ingrassia, S., Melnykov, V.: Modelling students’ career indicators via
mixtures of parsimonious matrix-normal distributions. Aust. N. Z. J. Stat. 1–16 (2022)
9. Viroli, C.: Model based clustering for three-way data structures. Bayesian Anal. 6(4), 573–602
(2011)
10. Doğru, F. Z., Bulut, Y. M., Arslan, O.: Finite mixtures of matrix variate t distributions. Gazi
Univ. J. Sci. 29(2), 335–341 (2016)
11. Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recognit. 28(5),
781–793 (1995)
12. Meng, X. L., Van Dyk, D.: The EM algorithm-an old folk-song sung to a fast new tune. J.
Royal Stat. Soc. B. 59(3), 511–567 (1997)
13. Kleiber, C., Zeileis, A.: Applied Econometrics with R. Springer-Verlag, New York (2008)
14. Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)
15. Gallaugher, M. P. B., Tomarchio, S. D., McNicholas, P. D., Punzo, A.: Multivariate cluster
weighted models using skewed distributions. Adv. Data Anal. Classif. 1–32 (2021)
16. Fraley, C., Raftery, A. E.: Enhanced model-based clustering, density estimation, and discrim-
inant analysis software: MCLUST. J. Classif., 20(2), 263–286 (2003)
17. Tomarchio, S. D., Punzo, A.: Dichotomous unimodal compound models: application to the
distribution of insurance losses. J. Appl. Stat. 47(13-15), 2328–2353 (2020)
On Parsimonious Modelling via Matrix-variate t Mixtures 401

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Evolution of Media Coverage on Climate Change
and Environmental Awareness: an Analysis of
Tweets from UK and US Newspapers

Gianpaolo Zammarchi, Maurizio Romano, and Claudio Conversano

Abstract Climate change represents one of the biggest challenges of our time.
Newspapers might play an important role in raising awareness on this problem and
its consequences. We collected all tweets posted by six UK and US newspapers in
the last decade to assess whether 1) the space given to this topic has grown, 2) any
breakpoint can be identified in the time series of tweets on climate change, and 3) any
main topic can be identified in these tweets. Overall, the number of tweets posted on
climate change increased for all newspapers during the last decade. Although a sharp
decrease in 2020 was observed due to the pandemic, for most newspapers climate
change coverage started to rise again in 2021. While different breakpoints were
observed, for most newspapers 2019 was identified as a key year, which is plausible
based on the coverage received by activities organized by the Fridays for Future
movement. Finally, using different topic modeling approaches, we observed that,
while unsupervised models partly capture relevant topics for climate change, such
as the ones related to politics, consequences for health or pollution, semi-supervised
models might be of help to reach higher informativeness of words assigned to the
topics.

Keywords: climate change, Twitter, environment, time series, topic modeling

Gianpaolo Zammarchi ( )
University of Cagliari, Viale Sant’Ignazio 17, 09123, Cagliari, Italy,
e-mail: [email protected]
Maurizio Romano
University of Cagliari, Viale Sant’Ignazio 17, 09123, Cagliari, Italy,
e-mail: [email protected]
Claudio Conversano
University of Cagliari, Viale Sant’Ignazio 17, 09123, Cagliari, Italy, e-mail: [email protected]

© The Author(s) 2023 403


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9_43
404 G. Zammarchi et al.

1 Introduction

Climate change is one of the biggest challenges for our society. Its consequences
which include, among others, glaciers melting, warming oceans, rising sea levels,
and shifting weather or rainfall patterns, are already impacting our health and im-
posing costs on society. Without drastic action aimed at reducing or preventing
human-induced emissions of greenhouse gasses, these consequences are expected
to intensify in the next years. Despite its global and severe impacts, individuals may
perceive climate change as an abstract problem [1]. It is also a well-known fact that
the level of information plays a crucial role in the awareness about a topic (e.g.
healthy food [2] and smoking [3]) . Media represent a crucial source of information
and can exert substantial effects on public opinion, thus helping to raise the awareness
on climate change. For instance, media can explain climate change consequences as
well as portraying actions that governments, communities and single individuals can
take. For this reason, it is important to distinguish themes that might have gained
popularity from those that may have seen a decrease of interest. Nowadays, social
media have become a reliable and popular source of information for people from
all around the world. Twitter is one of the most popular microblogging services and
is used by many traditional newspapers on a daily basis. While we can hypothesize
that in the last few years the media coverage on climate change might have risen,
due for instance to international climate strike movements, the recent emergence of
the coronavirus disease 2019 (COVID-19) pandemic might have led to a decrease of
attention on other relevant topics.
Aims of this work were to: (1) assess trends in media coverage on climate change
using tweets posted by main international newspapers based in United Kingdom
(UK) and United States (US), and (2) identify the main topics discussed in these
tweets using topic modeling.

2 Dataset and Methods

We downloaded all tweets posted from 2012 January 1st to 2021 December 31st
from the official Twitter account of six widely known newspapers based in UK (The
Guardian, The Independent and The Mirror) or US (The New York Times, The Wash-
ington Post and The Wall Street Journal) leading to a collection of 3,275,499 tweets.
Next, we determined which tweets were related to climate change and environmental
awareness based on the presence of at least one of the following keywords: “climate
change”, “sustainability”, “earth day”, “plastic free”, “global warming”, “pollution”,
“environmentally friendly” or “renewable energy”. We plotted the number of tweets
on climate change posted by each newspaper during each year using R v. 4.1.2 [4].
We analyzed the association between the number of tweets on climate change and
the whole number of tweets posted by each newspaper using Spearman’s correlation
analysis. For each year and for each newspaper, we computed and plotted the differ-
ences in the number of posted tweets compared to the previous year, for either (a)
Climate Change in UK and US Newspapers’ Tweets 405

tweets related to climate change and (b) all tweets. Finally, we used the changepoint
R package [5] to conduct an analysis aimed at identifying structural breaks, i.e. unex-
pected changes in a time series. In many applications, it is reasonable to believe that
there might be m breakpoints (especially if some exogenous event occurs) in which a
shift in mean value is observed. The changepoint package estimates the breakpoints
using several penalty criteria such as the Bayesian Information Criterion (BIC) or the
Akaike Information Criterion (AIC). We estimated the breakpoints using the Binary
Segmentation (BinSeg) method [6] implemented in the package.
Lastly, we used tweets posted by The Guardian to perform topic modeling, a
method for classification of text into topics. Preprocessing (including lemmatization,
removal of stopwords and creation of the document term matrix) was conducted with
tm [7] and quanteda [8] in R. We used two different approaches: 1) Latent Dirichlet
Allocation (LDA) implemented in the textmineR R package [9]; and 2) Correlation
Explanation (CorEx), an approach alternative to LDA that allows both unsupervised
as well as semi-supervised topic modeling [10].

3 Results

3.1 Analysis of Tweet Trends and Breakpoints

Among 3,275,499 collected tweets, we identified 11,155 tweets related to climate


change and environmental awareness. Figure 1A shows the number of tweets on
climate change posted by each of the analyzed newspapers from 2012 to 2021, while
Figure 1B the total number of tweets posted by each newspaper.

Fig. 1 Number of tweets on climate change (A) or total number of tweets (B) posted by the six
newspapers from 2012 to 2021.

For the majority of newspapers, the number of tweets on climate change increased
from 2014 to 2019, saw a sharp decrease in 2020, in correspondence of the emergence
of the COVID-19 pandemic, and a subsequent rise in 2021. On the other hand, the
406 G. Zammarchi et al.

Fig. 2 Year-over-year percentage changes of overall tweets and tweets on climate change. A: The
Guardian, B: The Mirror, C: The Independent, D: The New York Times, E: The Washington Post,
F, The Wall Street Journal.

number of tweets on climate change posted by The Guardian showed a peak during
2015 and a subsequent decrease. However, it must be noted that The Guardian is
also the newspaper that showed a more pronounced decrease in the overall number
of tweets.
The number of tweets on climate change was significantly positively correlated
with the overall number of tweets posted from 2012 to 2021 for four newspapers (The
Guardian, Spearman’s rho = 0.95, 𝑝 < 0.001; The Mirror, Spearman’s rho = 0.95, 𝑝
< 0.001; The Independent, Spearman’s rho = 0.76, 𝑝 = 0.016; The Washington Post,
Spearman’s rho = 0.70, 𝑝 = 0.031) but not for The New York Times (Spearman’s
rho = 0.18, 𝑝 = 0.63) or The Wall Street Journal (Spearman’s rho = 0.49, 𝑝 = 0.15).
Year-over-year percentage changes among either tweets related to climate change or
all posted tweets can be observed in Figure 2.
Looking at Figure 2, we can observe a great variability in the posted number of
tweets during the years, both for the total number of tweets and for the number of
tweets on climate change. While the analysis aimed at identifying structural changes
Climate Change in UK and US Newspapers’ Tweets 407

Fig. 3 Structural changes in the time series of tweets related to climate change. A: The Guardian,
B: The Mirror, C: The Independent, D: The New York Times, E: The Washington Post, F, The Wall
Street Journal. The red line represents the years between two breakpoints.

in the time series comprising tweets on climate change identified three or four
breakpoints for all newspapers, wide variability was observed regarding the specific
year in which these structural changes were identified (Figure 3). Despite the great
variability, Figure 3 shows that even if a common breakpoint cannot be identified,
2019 was a key year for five out of six newspapers (except for The Independent).

3.2 Topic Modeling

Finally, we exploited the topic modeling approach to identify and analyze the main
topics discussed by newspapers in their tweets. Due to space limitations, we focus
only on The Guardian since this newspaper showed a trend in contrast with the
others. Data comes from 2,916 tweets posted by The Guardian analyzed using LDA
and CorEx. For LDA, a range of 5-20 unsupervised topics was tested, with the most
408 G. Zammarchi et al.

interpretable results obtained with 10 topics (Table 1). The topic coherence ranged
from 0.01 to 0.34 (mean: 0.13). For each topic, bi-gram topic labels were assigned
with the labeling algorithm implemented in textmineR. We can observe that topics
are related to politics or leaders (Topics 3, 7 and 10), environmental scientists or
climate journalists (Topics 1 and 5), energy sources (Topics 4 and 8) and effects
of climate change (Topics 2, 6 and 9). The intertopic distance map obtained with
LDAvis is shown in Figure 4. The area of each circle is proportional to the relative
prevalence of that topic in the corpus, while inter-topic distances are computed based
on Jensen-Shannon divergence.

Table 1 Top terms for the ten topics identified with LDA.
dana_nuccitelli air_pollution barack_obama renewable_energy john_abraham

dana pollution fight energy john


dana_ nuccitelli air obama renewable trump
nuccitelli air_pollution trump renewable_energy australia
live study plan uk tackle
trump finds battle sustainability abraham
air_pollution donald_trump fossil_fuel extreme_weather pope_francis
pollution trump report world pollution
air schoolstrike fossil paris study
air_pollution school ipcc leaders tackling
uk great warns talks pope
tackle donald stop deal scientists

Fig. 4 Intertopic distance map.


Climate Change in UK and US Newspapers’ Tweets 409

Finally, we conducted a semi-supervised topic modeling analysis based on an-


chored words using CorEx. When anchoring a word to a topic, CorEx maximizes the
mutual information between that word and the topic, thus guiding the topic model
towards specific subsets of words. A model with 5 topics and three anchored words
for each topic (Table 2) showed a total correlation (i.e. the measure maximized by
CorEx when constructing the topic model) of 4.36. This value was higher compared
to the one observed with an unsupervised CorEx analysis with the same number of
topics (total correlation = 0.97, topics not shown due to space limits). Topics related
to politics (Topic 3) and science (Topic 5) were found to be the most informative in
our dataset based on the total correlation metric.

Table 2 Topics with anchored words and examples of tweets.


Topic Topic words Examples of tweets per topic

1 school, strike, march, schoolstrik, climat- EPA wipes its climate change site day before
estrikeuk, ukschoolstrik, schoolstrikeclim, march on Washington
climatemarch, arabia, saudi
2 ocean, ice, environment, john, dana, nuc- Chasing Ice filmmakers plumb the ’bottom-
citelli, air, abraham, sea, reed less’ depths of climate change - new clip
from @GuardianEco
3 trump, obama, lead, donald, barack, Trump administration pollution rule strikes
ivanka, brighton, repli, administr, pick final blow against environment
4 plastic, fuel, oil, fossil, compani, pictur, Engaging with oil companies on climate
wast, big, bay, photo change is futile
5 studi, scientist, research, find, link, say, Microplastic pollution revealed ‘absolutely
show, death, prematur, speci everywhere’ by new research
The anchored words are reported in bold.

4 Discussion

The present study aims to evaluate how some of the most relevant British and
American newspapers have given space to the topic of climate change on their
Twitter page in the last decade. Apart from The Guardian, which shows a decreasing
trend in the number of tweets related to climate change, all the other newspapers
showed an overall growing trend, except during 2020. During this year, the number
of tweets related to climate change declined for all six newspapers. This was most
probably due to the COVID-19 outbreak that was massively covered by all media.
By analyzing the breakpoints in Figure 3, it is possible to observe that 2019 was
a relevant year for climate change. This is plausible considering that, starting from
the end of 2018, the strikes launched by the Fridays for Future movement to raise
awareness on the issue of climate change, gained high media coverage.
410 G. Zammarchi et al.

Our topic modeling analysis showed that the main topics defined using unsuper-
vised models such as LDA are mostly related to politics, environmental scientists,
energy sources and effects of climate change. While unsupervised models capture
relevant topics, using CorEx we found a semi-supervised model to be able to reach
a higher total correlation, which is a measure of informativeness of the topics,
compared to an unsupervised model with the same number of topics.
As future developments, we plan to extend our analyses to newspapers from other
countries. We believe our work to be useful to gain more knowledge and awareness
about the climate change topic and on how much space relevant newspapers have
given to this issue on social media. Increasing the knowledge about the nature of the
topics covered by newspapers will lay the basis for future studies aimed at evaluating
public awareness on this highly relevant challenge.

References

1. Van Lange, P. A. M., Huckelba, A. L.: Psychological distance: How to make climate change
less abstract and closer to the self. Curr. Opin. Psychol. 42, 49–53 (2021)
2. Wakefield, M., Flay B., Nichter M., Giovino G.: Role of the media in influencing trajectories
of youth smoking. Addiction, 98, 79-103 (2003)
3. Dumanovsky, T., Huang, C. Y., Bassett, M. T., Silver, L. D.: Consumer awareness of fast-food
calorie information in New York City after implementation of a menu labeling regulation.
American Journal of Public Health, 12, 2520-2525 (2010)
4. R Core Team. R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria (2020). Available via http://www.R-project.org
5. Killick, R., Eckley, I. A.: changepoint: An R Package for changepoint analysis. J. Stat. Softw.
58, 1–19 (2014)
6. Scott, A.J., Knott, M.: A cluster analysis method for grouping means in the analysis of variance,
Biometrics, 30, 507-512 (1974)
7. Feinerer, I., Hornik, K., Meyer, D.: Text mining infrastructure in R. J. Stat. Softw. 25, 1–54
(2008)
8. Benoit, K., Watanabe, K., Wang, H., Nulty, P., Obeng, A., Müller, S., Matsuo, A.: quanteda:
An R package for the quantitative analysis of textual data. J. Open Source Softw. 3, 774 (2018)
9. Jones, T., Doane, W.: Package ‘textmineR’. Functions for Text Mining and Topic Modeling
(2021). Retreived from
https://cran.r-project.org/web/packages/textmineR/textmineR.pdf
10. Gallagher, R., Reing, K., Kale, D., Ver Steeg, G.: Anchored correlation explanation: Topic
modeling with minimal domain knowledge. Trans. Assoc. Comput. 5, 529-542 (2017)
Climate Change in UK and US Newspapers’ Tweets 411

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Index

Symbols Boutalbi, R., 73


k-means, 213 brain-computer interface, 323
3-way network, 147
C
A Cabral, S., 363
Abdesselam, R., 1 Campos, J., 53
adjacency matrix, 1 Cappé, O., 263
affinity coefficient, 343 Cardoso, M. G. M. S., 353
anomaly detection, 373 categorical data, 353
Anton, C., 11 categorical time series, 233
Antonazzo, F., 21 CatPCA, 363
Arnone, E., 29 Chabane, N., 83
Ascari, R., 35 Chadjipadelis, T., 93, 283
Aschenbruck, R., 43 Champagne Gareau, J., 101
Ashofteh, A., 53 classification of textual documents, 121
association measures, 233 climate change, 403
AUC, 176 cluster analysis, 273, 363
automated planning, 101 cluster stability, 43, 183
cluster validation, 43
B cluster validity indices, 383
Bacelar-Nicolau, H., 343, 363 clustering, 53, 83, 203, 233, 243, 383, 393
Batagelj, V., 63 clustering validation, 343
Batista, M. G., 363 clustering with relational constraints, 63
Bayesian inference, 35 clusterwise, 73
Bayesian methodology, 223 co-clustering, 73
Beaudry., É, 101 community detection, 147
bi-stochastic matrix, 213 complex network, 147
blockmodeling, 63 constraints, 139
bootstrap method, 176 contaminated normal distribution, 11, 303
Bouaoune, M. A., 83 Conversano, C., 403
Bouchard, K., 253 correspondence analysis, 93

© The Author(s) 2023 413


P. Brito et al. (eds.), Classification and Data Science in the Digital Age,
Studies in Classification, Data Analysis, and Knowledge Organization,
https://doi.org/10.1007/978-3-031-09034-9
414 Index

count data, 35 H
COVID-19, 93, 334 Hayashi, K., 175
Cunial, E., 29 Hennig, C., 183
hierarchical cluster analysis, 93
D hierarchical clustering, 1
D’Urso, P., 233 Hoshino, E., 175
data analysis, 283 hyperparameter tuning, 131
data mining, 73 hyperquadrics, 21
decision boundaries, 21
democracy, 283 I
dependancy chains, 101 Ievoli, R., 273
Di Nuzzo, C., 111 Ignaccolo, R., 313
dimensionality reduction, 83 image processing, 193
Dobša, J., 121 indicator processes, 233
Dvorák, J., 293 Ingrassia, S., 21, 111
dynamical systems, 373 intelligent shopping list, 83
Ippoliti, L., 313
E item classification, 53
ECM algorithm, 303
EEG, 323 J
EM algorithm, 11, 353 Janácek, P., 193
emotions, 323
environment, 403 K
evidence-based policy making, 93 k:-means, 383
expectation-maximization, 263 Kalina, J., 193
Karafiátová, I., 293
F kernel density estimation, 155
factorial k-means, 213 kernel function, 111
Faria, B. M., 323 Kiers, H. A. L., 121
Figueiredo, M., 353 Koshkarov, A., 383
finite mixture model, 353
Fontanella, S., 313 L
Forbes, F., 263 L1-penalty, 313
Fort, G., 263 López-Oriona., Á, 233
fraud detection, 131 Labiod, L., 73, 203, 213
functional data, 11, 293 LaLonde, A., 223
functional data analysis, 29, 313, 334 leadership, 363
fuzzy sets, 243 learning from data streams, 131
leave-one-out cross-validation, 176
G Lee, H. K. H., 253
Gama, J., 131 Love, T., 223
García-Escudero, L. A., 139 low-energy replacements, 193
Gaussian mixture model, 183 LSA, 121
Gaussian process, 253
Genova, V. G., 147 M
Giordano, G., 147 machine and deep learning, 323
Giubilei, R., 155 machine learning, 83
graph clustering, 155 Magopoulou, S., 93
graphical LASSO, 313 Makarenkov, V., 83, 101
grocery shopping recommendation, 83 Markov chain Monte Carlo, 223
Górecki, T., 165 Markov decision process, 101
Masís, D., 243
Index 415

matrix-variate, 393 P
Mayo-Iscar, A., 139 pair correlation function, 293
Mazoure, B., 83 Palazzo, L., 273
measurement error, 53 Panagiotidou, G., 283
Menafoglio, A., 333 parallel computing, 101
Meng, R., 253 parameter estimation, 263
Migliorati, S., 35 parsimonious models, 393
minimum message length, 353 Pawlasová, K., 293
minorization-maximization, 263 Perrone, G., 303
mixed-mode official surveys, 53 phylogenetic trees, 383
mixed-type data, 43 Piasecki, P., 165
mixture model, 35, 223, 313 political behavior, 283
mixture modelling, 139 projection matrix, 176
mixture models, 393 Pronello, N., 313
mixture of regression models, 303 proximity measure, 1
mixtures of regressions, 21
mobility data, 147 R
mode-based clustering, 155 Ragozini, G., 147
model based clustering, 139 random forest, 165
model selection, 353 rare disease, 176
model-based cluster analysis, 303 recommender systems, 83
model-based clustering, 11, 21 reduced k-means, 121, 213
Morelli, G., 139 regional healthcare, 273
motivational factors, 363 Reis, L. P., 323
multidimensional scaling, 273 religion, 283
multivariate data analysis, 1 representation learning, 203
multivariate methods, 283 reversible jump, 223
multivariate regression, 35 Riani, M., 139
multivariate time series, 253 robustness, 193
Rodrigues, D., 323
N Romano, M., 403
Nadif, M., 73, 203, 213
Nakanishi, E., 175 S
neighborhood graph, 1 Sakai, K., 175
network analysis, 63 Sangalli, L. M., 29, 333
networked data, 203 Scimone, R., 333
networks, 155 Secchi, P., 333
neural networks, 293 seemingly unrelated regression, 303
Nguyen, H. D., 263 Segura, E., 243
noise component, 183 semiparametric regression with roughness
nonparametric statistics, 155 penalty, 29
number of clusters, 183 sensitivity and specificity, 176
numerical smoothing, 243 silhouette width, 313
Silva, O., 343, 363
O Silvestre, C., 353
O2S2, 334 similarity forest, 165
Obatake, M., 175 Smith, I., 11
online algorithms, 263 social networks, 63
optimized centroids, 193 Soffritti, G., 303
outlier analysis, 373 Sousa., Á, 343, 363
Ovtcharova, J., 373 sparsity, 193
416 Index

spatial data analysis, 29 trimmed k-means, 273


spatial downscaling, 334 trimming, 183
spatial point patterns, 293 Twitter, 403
Spearman correlation coefficient, 343
spectral clustering, 111 U
spectral rotation, 203 user tuning, 183
split-merge procedures, 223
Spoor, J. M., 373 V
stochastic approximation, 263 variational inference, 253
stochastic optimization, 253 Vilar. J. A., 233
strongly connected components, 101 Vitale, M. P., 147
supervised classification, 293 VL methodology, 343
supervised learning, 323
Suzuki, M., 175 W
symbolic data analysis, 63 Weber, J., 373
symmetric difference metrics, 383 weighting methods, 53
Szepannek, G., 43 welfare, 363
Wilhelm, A. F. X., 43
T Wu, T., 223
Tahiri, N., 83, 383
tensor, 73 X
tertiary education, 147 Xavier, A., 243
three-way data, 111
Tighilt, R. A. S., 83 Y
time series, 165, 273, 403 Young, D. R., 223
time series classification, 165
time-varying correlation, 253 Z
Tomarchio, S. D., 393 Zammarchi, G., 403
topic modeling, 403 Łuczak, T., 165
Trejos, J., 243

You might also like