Classification and Data Science in The Digital Age
Classification and Data Science in The Digital Age
Classification
and Data Science
in the Digital Age
Studies in Classification, Data Analysis,
and Knowledge Organization
Editors
123
Editors
Paula Brito José G. Dias
Faculty of Economics Business Research Unit
University of Porto University Institute of Lisbon
Porto, Portugal Lisbon, Portugal
INESC TEC, Centre for Artificial
Intelligence and Decision Support Angela Montanari
(LIAAD) Department of Statistical Sciences
Porto, Portugal “Paolo Fortunati”
University of Bologna
Berthold Lausen Bologna, Italy
Department of Mathematical Sciences
University of Essex
Colchester, UK
Rebecca Nugent
Department of Statistics & Data Science
Carnegie Mellon University
Pittsburgh, PA, USA
© The Editor(s) (if applicable) and The Author(s) 2023. This book is an open access publication.
Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adap-
tation, distribution and reproduction in any medium or format, as long as you give appropriate credit to
the original author(s) and the source, provide a link to the Creative Commons license and indicate if
changes were made.
The images or other third party material in this book are included in the book's Creative Commons
license, unless indicated otherwise in a credit line to the material. If material is not included in the book's
Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publi-
cation does not imply, even in the absence of a specific statement, that such names are exempt from the
relevant protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, expressed or implied, with respect to the material contained
herein or for any errors or omissions that may have been made. The publisher remains neutral with regard
to jurisdictional claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
“Classification and Data Science in the Digital Age”, the 17th Conference of the In-
ternational Federation of Classification Societies (IFCS), is held in Porto, Portugal,
from July 19th to July 23rd 2022, locally organised by the Faculty of Economics of
the University of Porto and the Portuguese Association for Classification and Data
Analysis, CLAD.
Keynote lectures are addressed by Genevera Allen (Rice University, USA), Charles
Bouveyron (Université Côte d’Azur, Nice, France), Dianne Cook (Monash Univer-
sity, Melbourne, Australia), and João Gama (Faculty of Economics, University of
Porto & LIAAD INESC TEC, Portugal). The conference program includes two
tutorials: “Analysis of Data Streams” by João Gama (Faculty of Economics, Univer-
sity of Porto & LIAAD INESC TEC, Portugal) and “Categorical Data Analysis of
Visualization” by Rosaria Lombardo (Università degli Studi della Campania Luigi
Vanvitelli, Italy) and Eric Beh (University of Newcastle, Australia). IFCS 2022 has
highlighted topics, which lead to Semi-Plenary Invited Sessions. The Conference
program also includes Thematic Tracks on specific areas, as well as free contributed
sessions in different topics (both oral communications and posters).
v
vi Preface
The papers included in this volume present new developments in relevant topics
of Data Science and Classification, constituting a valuable collection of method-
ological and applied papers that represent the current research in highly developing
areas. Combining new methodological advances with a wide variety of real appli-
cations, this volume is certainly of great value for Data Science researchers and
practitioners alike.
First of all, the organisers of the Conference and the editors would like to thank
all authors, for their cooperation and commitment. We are specially grateful to all
colleagues who served as reviewers, and whose work was decisive to the scientific
quality of these proceedings. We also thank all those who have contributed to the de-
sign and production of this Book of Proceedings at Springer, in particular Veronika
Rosteck, for her help concerning all aspects of publication.
The organisers would like to express their gratitude to the Portuguese Association
for Classification and Data Analysis, CLAD, as well as to the Faculty of Economics
of the University of Porto (FEP–UP), who enthusiastically supported the Conference
from the very start, and contributed to its success. We cordially thank all members
of the Local Organising Committee – Adelaide Figueiredo, Carlos Ferreira, Carlos
Marcelo, Conceição Rocha, Fernanda Figueiredo, Fernanda Sousa, Jorge Pereira,
M. Eduarda Silva, Paulo Teles, Pedro Campos, Pedro Duarte Silva, and Sónia Dias
– and all people at FEP–UP who worked actively for the conference organisation,
and whose work is much appreciated. We are very grateful to all our sponsors, for
their generous support. Finally, we thank all authors and participants, who made the
conference possible.
The Editors are extremely grateful to the reviewers, whose work was determinant
for the scientific quality of these proceedings. They were, in alphabetical order:
vii
Partners & Sponsors
Sponsors
Banco de Portugal
Berd
Indie Campers
INESC/TEC
PSE
Unilabs
Universidade do Porto
ix
x Partners & Sponsors
Partners
Springer
Organisation
xi
xii Contents
The Death Process in Italy Before and During the Covid-19 Pandemic: a
Functional Compositional Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
Riccardo Scimone, Alessandra Menafoglio, Laura M. Sangalli, and
Piercesare Secchi
Rafik Abdesselam
Abstract The clustering of objects-individuals is one of the most widely used ap-
proaches to exploring multidimensional data. The two common unsupervised cluster-
ing strategies are Hierarchical Ascending Clustering (HAC) and k-means partitioning
used to identify groups of similar objects in a dataset to divide it into homogeneous
groups. The proposed Topological Clustering of Individuals, or TCI, studies a homo-
geneous set of individual rows of a data table, based on the notion of neighborhood
graphs; the columns-variables are more-or-less correlated or linked according to
whether the variable is of a quantitative or qualitative type. It enables topological
analysis of the clustering of individual variables which can be quantitative, qualita-
tive or a mixture of the two. It first analyzes the correlations or associations observed
between the variables in a topological context of principal component analysis (PCA)
or multiple correspondence analysis (MCA), depending on the type of variable, then
classifies individuals into homogeneous group, relative to the structure of the vari-
ables considered. The proposed TCI method is presented and illustrated here using
a real dataset with quantitative variables, but it can also be applied with qualitative
or mixed variables.
1 Introduction
The objective of this article is to propose a topological method of data analysis in the
context of clustering. The proposed approach, Topological Clustering of Individuals
Rafik Abdesselam ( )
University of Lyon, Lyon 2, ERIC - COACTIS Laboratories
Department of Economics and Management, 69365 Lyon, France,
e-mail: [email protected]
(TCI) is different from those that already exist and with which it is compared. There
are approaches specifically devoted to the clustering of individuals, for example, the
Cluster procedure implemented in SAS software, but as far as we know, none of
these approaches has been proposed in a topological context.
Proximity measures play an important role in many areas of data analysis [16, 5, 9].
The results of any operation involving structuring, clustering or classifying objects
are strongly dependent on the proximity measure chosen.
This study proposes a method for the topological clustering of individuals what-
ever type of variable is being considered: quantitative, qualitative or a mixture of
both. The eventual associations or correlations between the variables partly depends
on the database being used and the results can change according to the selected prox-
imity measure. A proximity measure is a function which measures the similarity or
dissimilarity between two objects or variables within a set.
Several topological data analysis studies have been proposed both in the context
of factorial analyses (discriminant analysis [4], simple and multiple correspondence
analyses [3], principal component analysis [2]) and in the context of clustering of
variables [1], clustering of individuals [10] and this proposed TCI approach.
This paper is organized as follows. In Section 2, we briefly recall the basic
notion of neighborhood graphs, we define and show how to construct an adjacency
matrix associated with a proximity measure within the framework of the analysis
of the correlation structure of a set of quantitative variables, and we present the
principles of TCI according to continuous data. This is illustrated in Section 3 using
an example based on real data. The TCI results are compared with those of the well-
known classical clustering of individuals. Finally, Section 4 presents the concluding
remarks on this work.
2 Topological Context
For any given proximity measure 𝑢, we can construct the associated adjacency
binary symmetric matrix 𝑉𝑢 of order 𝑝, where, all pairs of neighboring variables in
𝐸 satisfy the following RNG property:
This generates a topological structure based on the objects in 𝐸 which are com-
pletely described by the adjacency binary matrix 𝑉𝑢 .
Three topological factorial approaches are described in [1] according to the type of
variables considered: quantitative, qualitative or a mixture of both. We consider here
the case of a set of quantitative variables.
We assume that we have at our disposal a set 𝐸 = {𝑥 𝑗 ; 𝑗 = 1, · · · , 𝑝} of 𝑝
quantitative variables and 𝑛 individuals-objects. The objective here is to analyze in
a topological way, the structure of the correlations of the variables considered [2],
from which the clustering of individuals will then be established.
We construct the reference adjacency matrix named 𝑉𝑢★ from the correlation
matrix. Expressions of suitable adjacency reference matrices for cases involving
qualitative variables or mixed variables are given in [1].
To examine the correlation structure between the variables, we look at the sig-
nificance of their linear correlation. The reference adjacency matrix 𝑉𝑢★ associated
with reference measure 𝑢★, can be written using the Student’s t-test of the linear
correlation coefficient 𝜌 of Bravais-Pearson:
4 R. Abdesselam
1 if 𝑝-value = 𝑃 [ | 𝑇𝑛−2 | > t-value ] ≤ 𝛼 ; ∀𝑘, 𝑙 = 1, 𝑝
𝑉𝑢★ ( 𝑥 𝑘 , 𝑥 𝑙 ) =
0 otherwise.
where the 𝑝-value is the significance test of the linear correlation coefficient for
the two-sided test of the null and alternative hypotheses, 𝐻0 : 𝜌(𝑥 𝑘 , 𝑥 𝑙 ) = 0 vs.
𝐻1 : 𝜌(𝑥 𝑘 , 𝑥 𝑙 ) ≠ 0.
Let 𝑇𝑛−2 be a t-distributed random variable of Student with 𝜈 = 𝑛 − 2 degrees of
freedom. In this case, the null hypothesis is rejected if the 𝑝-value is less than or equal
to a chosen 𝛼 significance level, for example, 𝛼 = 5%. Using a linear correlation
test, if the 𝑝-value is very small, it means that there is a very low likelihood that the
null hypothesis is correct, and consequently we can reject it.
Whatever the type of variable set being considered, the built reference adjacency
matrix 𝑉𝑢★ is associated with an unknown reference proximity measure 𝑢★.
The robustness depends on the 𝛼 error risk chosen for the null hypothesis: no
linear correlation in the case of quantitative variables, or positive deviation from
independence in the case of qualitative variables, can be studied by setting a minimum
threshold in order to analyze the sensitivity of the results. Certainly the numerical
results will change, but probably not their interpretation.
We assume that we have at our disposal {𝑥 𝑘 ; 𝑘 = 1, .., 𝑝} a set of 𝑝 homogeneous
quantitative variables measured on 𝑛 individuals. We will use the following notations:
- 𝑋 (𝑛, 𝑝) is the data matrix with 𝑛 rows-individuals and 𝑝 columns-variables,
- 𝑉𝑢★ is the symmetric adjacency matrix of order 𝑝, associated with the reference
measure 𝑢★ which best structures the correlations of the variables,
-𝑋b(𝑛, 𝑝) = 𝑋𝑉𝑢★ is the projected data matrix with 𝑛 individuals and 𝑝 variables,
- 𝑀 𝑝 is the matrix of distances of order 𝑝 in the space of individuals,
- 𝐷 𝑛 = 𝑛1 𝐼𝑛 is the diagonal matrix of weights of order 𝑛 in the space of variables.
We first analyze, in a topological way, the correlation structure of the variables
using a Topological PCA, which consists of carrying out the standardized PCA [6, 8]
triplet ( 𝑋
b , 𝑀 𝑝 , 𝐷 𝑛 ) of the projected data matrix 𝑋
b = 𝑋𝑉𝑢★ and, for comparison,
the duality diagram of the Classical standardized PCA triplet ( 𝑋 , 𝑀 𝑝 , 𝐷 𝑛 ) of the
initial data matrix 𝑋. We then proceed with a clustering of individuals based on the
significant principal components of the previous topological PCA.
Definition 2 TCI consist of performing a HAC, based on the Ward criterion1 [15],
on the significant factors of the standardized PCA of the triplet ( 𝑋,
b 𝑀 𝑝 , 𝐷 𝑛 ).
3 Illustrative Example
The data used [13] to illustrate the TCI approach describe the renewable electricity
(RE) of the 13 French regions in 2017, described by 7 quantitative variables relating
to RE. The growth of renewable energy in France is significant. Some French regions
have expertise in this area; however, the regions’ profiles appear to differ.
The objective is to specify regional disparities in terms of RE by applying topo-
logical clustering to the French regions in order to identify which were the country’s
greenest regions in 2017. Statistics relating to the variables are displayed in Table 1.
The adjacency matrix 𝑉𝑢★ , associated with the proximity measure 𝑢★, adapted
to the data considered, is built from the correlations matrix Table 2 according to
Definition 1. Note that in this case, which uses quantitative variables, it is considered
that two positively correlated variables are related and that two negatively correlated
variables are related but remote. We will therefore take into account any sign of
correlation between variables in the adjacency matrix.
We first carry out a Topological PCA to identify the correlation structure of the
variables. A HAC, according to Ward’s criterion, is then applied to the significant
principal components of the PCA of the projected data. We then compare the results
of a topological and a classical PCA.
Figure 2 presents, for comparison on the first factorial plane, the correlations
between principal components-factors and the original variables.
6 R. Abdesselam
We can see that these correlations are slightly different, as are the percentages of
the inertias explained on the first principal planes of Topological and Classic PCA.
The two first factors of the Topological PCA explain 57.89% and 26.11%, re-
spectively, accounting for 83.99% of the total variation in the data set; however, the
two first factors of the Classical PCA add up to 75.20%. Thus, the first two factors
provide an adequate synthesis of the data, that is, of RE in the French regions. We
restrict the comparison to the first significant factorial axes.
For comparison, Figure 3 shows dendrograms of the Topological and Classical
clustering of the French regions according to their RE. Note that the partitions chosen
in 5 clusters are appreciably different, as much by composition as by characterization.
The percentage variance produced by the TCI approach, 𝑅 2 = 86.42%, is higher than
that of the classic approach, 𝑅 2 = 84.15%, indicating that the clusters produced via
the TCI approach are more homogeneous than those generated by the Classical one.
Based on the TCI analysis, the Corse region alone constitutes the fourth cluster,
and the Nouvelle-Acquitaine region is found in the second cluster with the Grand-
Est, Occitanie and Provence-Alpes-Côte-d’Azur (PACA) regions; however, in the
Classical clustering, these two regions - Corse and Nouvelle-Aquitaine - together
constitute the third cluster.
Figure 4 summarizes the significant profiles (+) and anti-profiles (-) of the two
typologies; with a risk of error less than or equal to 5%, they are quite different.
The first cluster produced via the TCI approach, consisting of a single region,
Auvergne-Rhônes-Alpes (AURA), is characterized by high share of hydroelectricity,
a high level of coverage of regional consumption, and high RE production and con-
sumption. The second cluster - which groups together the four regions of Grand-Est,
Occitanie, Provence-Alpes-Côte-d’Azur (PACA) and Nouvelle-Aquitaine - is consid-
ered a homogeneous cluster, which means that none of the seven RE characteristics
differ significantly from the average of these characteristics across all regions. This
cluster can therefore be considered to reflect the typical picture of RE in France.
A Topological Clustering of Individuals 7
4 Conclusion
This paper proposes a new topological approach to the clustering of individuals which
can enrich classical data analysis methods within the framework of the clustering of
objects. The results of the topological clustering approach, based on the notion of a
neighborhood graph, are as good - or even better, according to the R-squared results
- than the existing classical method. The TCI approach is be easily programmable
from the PCA and HAC procedures of SAS, SPAD or R software. Future work will
involve extending this topological approach to other methods of data analysis, in
particular in the context of evolutionary data analysis.
References
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Model Based Clustering of Functional Data with
Mild Outliers
1 Introduction
Recently, model-based clustering for functional data has received a lot of attention.
Real data are often contaminated by outliers that affect the estimations of the model
parameters. Here we propose a method for clustering functional data with mild
outliers. Mild outliers are usually sampled from a population different from the
Cristina Anton ( )
MacEwan University, 10700 – 104 Avenue Edmonton, AB, T5J 4S2, Canada,
e-mail: [email protected]
Iain Smith
MacEwan University, 10700 – 104 Avenue Edmonton, AB, T5J 4S2, Canada,
e-mail: [email protected]
2 The Model
Here we assume that the dimension 𝑝 is fixed and known. We consider a model based
on a mixture of multivariate contaminated normal distributions for the coefficients
vectors {𝛾1 , . . . , 𝛾𝑛 } ⊂ R 𝑝 , 𝛾𝑖 = (𝛾𝑖1 , . . . , 𝛾𝑖 𝑝 ) > ∈ R 𝑝 , 𝑖 = 1, . . . , 𝑛.
We suppose that there exists two unobserved random variables 𝑍 = (𝑍1 , . . . , 𝑍 𝐾 ),
Υ = (Υ1 , . . . , Υ𝐾 ) ∈ {0, 1} 𝐾 where 𝑍 indicates the cluster membership and Υ
Clustering of Functional Data with Mild Outliers 13
Here 𝛼 𝑘 defines the proportion of uncontaminated data in the 𝑘the cluster and 𝜂 𝑘
represents the degree of contamination. We can see 𝜂 𝑘 as an inflation parameter that
measures the increase in variability due to the bad observations.
Each curve 𝑥 𝑖 has a basis expansion with coefficient 𝛾𝑖 such that 𝛾𝑖 is a random
vector whose distributions is a mixture of contaminated Gaussians with density
14 C. Anton and I. Smith
𝐾
Õ
𝑝(𝛾; 𝜃) = 𝜋 𝑘 𝑓 (𝛾; 𝜃 𝑘 ) (3)
𝑘=1
3 Model Inference
To fit the models we use the ECM algorithm [3], which is a variant of the EM
algorithm. In the ECM algorithm we replace the M-step in the EM algorithm by two
simpler CM-steps given by the partition of the set with the parameters 𝜃 = {Ψ1 , Ψ2 },
where Ψ1 = {𝜋 𝑘 , 𝛼 𝑘 , 𝜇 𝑘 , 𝑎 𝑘 𝑗 , 𝑏 𝑘 , 𝑞 𝑘 𝑗 , 𝑘 = 1, . . . , 𝐾, 𝑗 = 1, . . . , 𝑑 𝑘 }, Ψ2 = {𝜂 𝑘 , 𝑘 =
1, . . . , 𝐾 }, and 𝑞 𝑘 𝑗 is the 𝑗th column of 𝑄 𝑘 .
We have two sources of missing data: the clusters’ labels and the type of observa-
tion (good or bad). Thus the complete data are given by 𝑆 = {𝛾𝑖 , 𝑧𝑖 , 𝜈𝑖 }𝑖=1,...,𝑛 , and
the complete-data likelihood is
𝑁 Ö
Ö 𝐾
𝑧𝑖𝑘
𝐿 𝑐 (𝜃; 𝑆) = 𝜋 𝑘 [𝛼 𝑘 𝜙(𝛾𝑖 ; 𝜇 𝑘 , Σ 𝑘 )] 𝜈𝑖𝑘 [(1 − 𝛼 𝑘 )𝜙(𝛾𝑖 ; 𝜇 𝑘 , 𝜂 𝑘 Σ 𝑘 )] 1−𝜈𝑖𝑘
𝑖=1 𝑘=1
4 Applications
a b
−30 0 30
−40 0 40
0 20 40 60 80 100 0 20 40 60 80 100
c d
−100 0 100
−40 0 40
0 20 40 60 80 100 0 20 40 60 80 100
Fig. 1 Smooth data simulated without oultiers (a), according to scenario A (b), scenarion B (c),
and scenario C (d), coloured by group for one simulation.
Table 1 Mean (and standard deviation) of ARI for BIC best model on 100 simulations. Bold values
indicates the highest value for each method.
Scenario Method 𝛼∗ 𝜖 ARI ARI Outliers
The quality of the estimated partitions obtained using FunHDDC and CFunHDDC
is evaluated using the Adjusted Rand Index (ARI) [3], and the results are included in
Table 1. For FunHDDC we use the library funHDDC in R. We run both algorithms
for 𝐾 = 3 with all 6 sub-models and the best solution in terms of the highest BIC
value for all those submodels is returned. The initialization is done with the 𝑘-means
Clustering of Functional Data with Mild Outliers 17
FunHDDC 0.01 0.68 CFunHDDC 0.85 0.01 0.67 CNmixt 0.5 0.67
FunHDDC 0.05 0.64 CFunHDDC 0.85 0.05 0.70 CNmixt 0.75 0.66
FunHDDC 0.1 0.59 CFunHDDC 0.85 0.1 0.70 CNmixt 0.85 0.67
FunHDDC 0.2 0.57 CFunHDDC 0.85 0.2 0.6 CNmixt 0.9 0.66
strategy with 50 repetitions, and the maximum number of iterations is 200 for the
stopping criterion. We use 𝜖 ∈ {0.05, 0.1, 0.2} in the Cattell test.
We notice that CFunHDDC outperforms FunHDDC, and it gives excellent results
even in Scenario C. For CFunHDDC the best results are obtained for 𝜖 = 0.2 in the
Catell test, and the values of the ARI are close to 1.
Next, we consider the NOx data available in the fda.usc library in R and repre-
senting daily curves of Nitrogen Oxides (NOx) emissions in the neighborhood of
the industrial area of Poblenou, Barcelona (Spain). The measurements of NOx (in
𝜇g/m3 ) were taken hourly resulting in 76 curves for “working days” and 39 curves
for “non-working days” (see Figure 2 a). Since NOx is a contaminant agent, the
detection of outlying emission is useful for environmental protection. This data set
has been used for testing methods for the detection of outliers and to illustrate robust
clustering based on trimming for functional data [4].
We apply CFunHDDC, FunHDDC, and CNmixt to the NOx data. Curves are
smoothed using a basis of 8 Fourier functions, and we run the algorithms for 𝐾 = 2
clusters. For CFunHDDC, FunHDDC we use 𝜖 ∈ {0.001, 0.05, 0.1, 0.2} in the
Cattell test and the rest of the settings are the same as in the simulation study. We
run CNmixt for all 14 models from the ContaminatedMixt R library, based on the
coefficients in the Fourier basis, with 1000 iterations for the stopping criteria, and
initialization done with the 𝑘-means method. The correct classification rates (CCR)
are reported in Table 2.
The CCR for CFunHDDC are slightly better than the ones for FunHDDC and
CNmixt, and are comparable with the ones reported in Table 1 in [4] for Funclust,
a b c
400
400
300
200
200
200
mglm3
100
0
0
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (hours) Time (hours) Time (hours)
Fig. 2 a.Daily NOx curves for 115 days; b. c. Clustering obtained with CFunHDDC, 𝜖 = 0, 05,𝛼∗ =
0.85; Non-working days (blue), working days (red), outliers (green).
18 C. Anton and I. Smith
RFC, and TrimK. In Figure 2 b, c we present the clusters and the detected outliers
for 𝜖 = 0.05 and 𝛼∗ = 0.85. The curves that are detected as outliers (green lines)
exhibit different patterns from the rest of the curves.
One of the advantages of extending the FunHDDC to CFunHDDC is the outlier
detection. For 𝛼∗ = 0.85 and 𝜖 = 0.05, CFunHDDC detects 16 outliers, which are the
same with the outliers mentioned in [4]. For the data without outliers, CFunHDDC
becomes equivalent to FunHDDC, and for the trimmed data the CCR increases to
0.79.
5 Conclusion
We propose a new method, CFunHDDC, that extends the FunHDDC functional clus-
tering method to data with mild outliers. Unlike other robust functional clustering
algorithms, CFunHDDC does not involve trimming the data. CFunHDDC is based
on a model formed by a mixture of contaminated multivariate normal distributions,
which makes parameter estimation more difficult than for FunHDDC, so we use an
ECM instead of an EM algorithm. The clustering and outlier detection performance
of CFunHDDC is tested for simulated data and the NOx data and it always out-
performs FunHDDC. Moreover, CFunHDDC has a comparable performance with
robust functional clustering methods based on trimming, such as RFC and TrimK,
and it has similar or better performance when compared to a two-step method based
on CNmixt. Although there are several model-based methods for multivariate data
with outliers that can be used to construct two-step methods for functional data, as
observed in [1], these two-step methods always suffers from the difficulty to choose
the best discretization. CFunHDDC can be extended to multivariate functional data,
and recently, independently of our work, a similar approach was followed in [5], but
without considering the parsimonious models and the value 𝛼∗ .
References
1. Bouveyron, C., Jacques, J.: Model-based clustering of time series in group-specific functional
subspaces. Adv. Data. Anal. Classif. 5(4), 281–300 (2011)
2. Jacques, J., Preda, C.: Funclust: a curves clustering method using functional random variables
density approximation. Neurocomputing 112, 164–171 (2013)
3. Punzo, A., McNicholas, P. D.: Parsimonious mixtures of multivariate contaminated normal
distributions. Biom. J. 58, 1506–1537 (2016)
4. Rivera-Garcia, D., Garcia-Escudero, L. A., Mayo-Iscar, A., Ortega, J.: Robust clustering for
functional data based on trimming and constraints. Adv. Data Anal. Classif. 13, 201–225
(2019)
5. Amovin-Assagba, M., Gannaz, I., Jacques, J.: Outlier detection in multivariate functional data
through a contaminated mixture model. (2021)
https://doi.org/10.48550/arXiv.2106.07222
Clustering of Functional Data with Mild Outliers 19
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
A Trivariate Geometric Classification of
Decision Boundaries for Mixtures of Regressions
1 Introduction
Filippo Antonazzo ( )
Inria, Université de Lille, CNRS, Laboratoire de mathématiques Painlevé 59650 Villeneuve d’Ascq,
France, e-mail: [email protected]
Salvatore Ingrassia
Dipartimento di Economia e Impresa, Università di Catania, Corso Italia 55, 95129 Catania, Italy,
e-mail: [email protected]
2 Mixtures of Regressions
MRFC. Mixtures of regressions with fixed covariates have the following density:
𝐺
Õ
𝑝(𝑦|x; 𝜓) = 𝜋𝑔 𝑓 (𝑦|x; 𝜃 𝑔 ). (1)
𝑔=1
exp(𝛼𝑔0 + 𝛼𝑔1
𝑡
x)
𝑝(Ω𝑔 |x; 𝛼) = Í𝐺 .
𝑔=1 exp(𝛼𝑔0 + 𝛼𝑔1 x)
𝑡
Í
where 𝜋𝑔 > 0 and 𝐺 𝑔=1 𝜋 𝑔 = 1. Furthermore, the model is totally parametrized
by the vector 𝜓 = (𝜋1 , . . . , 𝜋𝐺 , 𝜃 1 , . . . , 𝜃 𝐺 , 𝜉1 , . . . , 𝜉𝐺 ), where each 𝜃 𝑔 indexes the
conditional density 𝑓 (𝑦|x, 𝜃 𝑔 ), while each 𝜉𝑔 refers to the density of X in the group
Ω𝑔 , denoted with 𝑝(x; 𝜉𝑔 ).
In particular, under Gaussian assumptions it results 𝑌 |x, Ω𝑔 ∼ 𝑁 (𝛽𝑔0 + 𝛽𝑔1 𝑡
x, 𝜎𝑔2 ),
where each 𝛽𝑔 = (𝛽𝑔0 , 𝛽𝑔1 ) is a vector of real parameters. Only for MRRC model, we
will further assume X|𝛀𝑔 ∼ 𝑁 (𝜇𝑔 , Σ𝑔 ) for all 𝑔 = 1, . . . , 𝐺, where 𝜇𝑔 denotes the
mean of the Gaussian distribution, while Σ𝑔 is its covariance matrix. Denoting with
𝜙(·) the Gaussian density function, equations (1)-(3) can be, respectively, rewritten
as
𝐺
Õ
𝑝(𝑦|x; 𝜓) = 𝑡
𝜙(𝑦; 𝛽𝑔0 + 𝛽𝑔1 x; 𝜎𝑔2 )𝜋𝑔 , (4)
𝑔=1
𝐺
Õ
𝑝(𝑦|x; 𝜓) = 𝑡
𝜙(𝑦; 𝛽𝑔0 + 𝛽𝑔1 x; 𝜎𝑔2 ) 𝑝(Ω𝑔 |x; 𝛼), (5)
𝑔=1
𝐺
Õ
𝑝(x, 𝑦; 𝜓) = 𝑡
𝜙(𝑦, 𝛽𝑔0 + 𝛽𝑔1 x; 𝜎𝑔2 )𝜙(x; 𝜇𝑔 , Σ𝑔 )𝜋𝑔 . (6)
𝑔=1
Maximum likelihood estimate for 𝜓 are usually obtained with the Expectation-
Maximization (EM) algorithm. Then, the final estimate is used to build classifiers
which group observations into 𝐺 disjoint classes.
There are different ways to build classifiers. One of the best known is the method of
discriminant functions. The aim of this procedure is to define 𝐺 functions 𝐷 𝑔 (x, 𝑦; 𝜓)
and a decision rule to divide the real space R𝑑+1 into 𝐺 decision regions, named
24 F. Antonazzo and S. Ingrassia
These results show that models with more flexibility, i.e. with more parameters,
can generate more varieties of decision boundaries. In the following section, we will
extend these statements to dimension 𝑑 = 3.
Trivariate Geometric Classification of Decision Boundaries for Mixtures of Regressions 25
Analyzing the provided results, we can note that they perfectly match the hierar-
chy established in dimension 𝑑 = 2. Indeed, a MRFC can generate only degenerate
hyperquadrics of rank 3; the surfaces generated by a MRCV, which has more param-
eters, are still degenerate, but with a higher rank (equal to 4) depending on the same
mathematical condition of Proposition 2; finally a MRRC, the most flexible model in
terms of number of parameters, can give rise to various hyperquadrics, as in 𝑑 = 2.
for robustness reasons. It is shown that the generated decision boundaries are more
flexible than their Gaussian counterparts, as they can assume more various shapes,
although these surfaces can be calculated only numerically. In this section, we
continue the exploration of the 𝑡-distribution case adding one more variable, thus
𝑑 = 2. Under these more general assumptions, discriminant functions (8) − (10)
become:
𝑡
𝑀 𝑅𝐹𝐶-𝑡 : 𝐷 𝑔 (x, 𝑦; 𝜓) = ln[𝑞(𝑦; 𝛽𝑔0 + 𝛽𝑔1 x, 𝜎𝑔2 , 𝜂𝑔 )𝜋𝑔 ], (11)
𝑡
𝑀 𝑅𝐶𝑉-𝑡 : 𝐷 𝑔 (x, 𝑦; 𝜓) = ln[𝑞(𝑦; 𝛽𝑔0 + 𝛽𝑔1 x, 𝜎𝑔2 , 𝜂𝑔 ) exp(𝛼𝑔0 + 𝛼𝑔1
𝑡
x)], (12)
𝑀 𝑅𝑅𝐶-𝑡 : 𝐷 𝑔 (x, 𝑦; 𝜓) = ln[𝑞(𝑦; 𝛽𝑔0 + 𝑡
𝛽𝑔1 x, 𝜎𝑔2 , 𝜂𝑔 )𝑞(x; 𝜇𝑔 , 𝚺𝑔 , 𝜈𝑔 )𝜋𝑔 ], (13)
Table 1 Parameters used in Figure 1-2. MRRC: covariance matrices 𝚺1 and 𝚺2 are equal to the
identity matrix I2 .
Model Group 𝜋𝑔 𝛽𝑔0 𝛽𝑔1 𝜎𝑔2 𝛼𝑔0 𝛼𝑔1 𝜇𝑔 𝜈𝑔
MRFC 1 0.3 1 (2,-3) 0.5
2 0.7 1 (-4,3) 0.5
MRCV 1 0.3 1 (2,-3) 0.5
2 0.7 1 (-4,3) 0.5 1 (-1,0.5)
MRRC 1 0.3 1 (2,-3) 0.5 (1,2) 5
2 0.7 1 (-4,3) 0.5 1 (-1,0.5) (-1,-2) 5
6 Conclusions
This work has provided a trivariate multivariate geometrical classification for the
decision boundaries generated by mixtures of regressions in presence of two classes.
Under Gaussian assumptions, our results confirmed the same hierarchy that was
shown in 𝑑 = 2, as MRRC turns out to exhibits a huge variety of decision boundaries,
while other models generate only degenerate surfaces. This is coherent with its high
degree of flexibility given by its very general parametrization. The provided results
Trivariate Geometric Classification of Decision Boundaries for Mixtures of Regressions 27
Fig. 1 Decision boundaries under assumptions of Gaussian (in blue) and 𝑡-distributed variables
with 𝜂1 = 𝜂2 = 3 (in orange) for the three considered mixtures of regressions.
Fig. 2 Decision boundaries under assumptions of Gaussian (in blue) and 𝑡-distributed variables
with 𝜂1 = 𝜂2 = 10 (in red) for the three considered mixtures of regressions.
could help to select the right model depending on the shape of data. For example,
if in a descriptive analysis data turn out to be approximately separated by a simple
degenerate hyperquadric, it will be better to estimate a MRFC or a MRCV instead
of a complex MRRC. On the contrary, if the separation surface seems to be non-
degenerate, then it will be preferable to fit a general MRRC. Moreover, this work
also showed that the degree of flexibility (thus, the variety of possible decision
boundaries) can be enhanced by go further Gaussianity, assuming, for example,
𝑡-distributed variables. This encourage additional extensions where more general
28 F. Antonazzo and S. Ingrassia
References
1. DeSarbo, W. S., Cron, W. L.: A maximum likelihood methodology for clusterwise linear
regression. J. Classif. 5, 249–282 (1988)
2. Grun, B., Leisch, F.: FlexMix version 2: finite mixtures with concomitant variables and varying
and constant parameters. J. Stat. Softw. 28, 1–35 (2008)
3. Hennig, C.: Identifiablity of models for clusterwise linear regression. J. Classif. 17, 273–296
(2000)
4. Ingrassia, S., Minotti, S. C., Vittadini, G.: Local Statistical Modeling via a Cluster-Weighted
Approach with Elliptical Distributions. J. Classif. 29, 363-401 (2012)
5. Ingrassia, S., Punzo, A.: Decision boundaries for mixtures of regressions. J. Korean Stat. Soc.
45, 295-306 (2016)
6. Wedel, M.: Concomitant variables in finite mixture models. Stat. Neerl. 56, 362–375 (2002)
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Generalized Spatio-temporal Regression with
PDE Penalization
Abstract We develop a novel generalised linear model for the analysis of data dis-
tributed over space and time. The model involves a nonparametric term 𝑓 , a smooth
function over space and time. The estimation is carried out by the minimization of an
appropriate penalized negative log-likelihood functional, with a roughness penalty
on 𝑓 that involves space and time differential operators, in a separable fashion, or
an evolution partial differential equation. The model can include covariate informa-
tion in a semi-parametric setting. The functional is discretized by means of finite
elements in space, and B-splines or finite differences in time. Thanks to the use of
finite elements, the proposed method is able to efficiently model data sampled over
irregularly shaped spatial domains, with complicated boundaries. To illustrate the
proposed model we present an application to study the criminality in the city of
Portland, from 2015 to 2020.
Eleonora Arnone ( )
Dipartimento di Scienze Statistiche, Università di Padova, Via Cesare Battisti, 241, 35121 Padova,
Italy, e-mail: [email protected]
Elia Cunial
Dipartimento di Matematica, Politecnico di Milano, Piazza Leonardo da Vinci 32, 20133 Milano,
Italy, e-mail: [email protected]
Laura M. Sangalli
Dipartimento di Matematica, Politecnico di Milano, Piazza Leonardo da Vinci 32, 20133 Milano,
Italy, e-mail: [email protected]
1 Introduction
In this work we develop a novel generalised linear model for the analysis of data
distributed over space and time. Let 𝑌 be a real-valued variable of interest, and W a
vector of 𝑞 covariates, observed in 𝑛 spatio-temporal locations {p𝑖 , 𝑡𝑖 }𝑖=1,...,𝑛 ∈ Ω×𝑇,
where Ω ⊂ R2 is a bounded spatial domain, and 𝑇 ⊂ R a temporal interval. We
assume that the expected value of 𝑌 , conditional on the covariates and the location
of observation, can be modeled as:
where 𝑔 is a known monotone link function, chosen on the basis of the stochastic
nature of 𝑌 , 𝜷 ∈ R𝑞 is an unknown vector of regression coefficients, and 𝑓 : Ω×𝑇 →
R is an unknown deterministic function, which captures the spatio-temporal variation
of the phenomenon under study. Starting from the values {𝑦 𝑖 , w𝑖 }𝑖=1,...,𝑛 , of the
observed response variable and covariates, we estimates 𝜷 and 𝑓 in a semiparametric
fashion. In particular, following the approach in [9], that consider a similar problem
for data scattered over space only, we minimize the functional
ℓ {𝑦 𝑖 , w𝑖 , p𝑖 , 𝑡𝑖 }𝑖=1,...,𝑛 ; 𝜷, 𝑓 + P ( 𝑓 )
where the first term accounts for the regularity of the function in time, while the
second accounts for the regularity of the function in space; the importance of each
term is controlled by two smoothing parameters 𝜆𝑇 and 𝜆 𝑆 . Alternatively, as in
[2], we may consider a single penalty which accounts for the spatial and temporal
regularity:
∫ ∫ 𝑇 2
𝜕𝑓
P( 𝑓 ) = 𝜆 +𝐿𝑓 −𝑢 .
Ω 0 𝜕𝑡
Differently from the models in [2, 3, 4], the estimation functional to be minimized
is not quadratic. This poses increased difficulties from the computational point
Generalized Spatio-temporal Regression with PDE Penalization 31
This section describes the Portland criminality data, that will be used to illustrate
the proposed methodology. We will present a Poisson model to count the crimes in
the city, and study their evolution from April 2015 to November 2020. In addition,
we shall consider as a covariate the population of the city neighborhoods. The crime
data are publicly available on the website of the Police Bureau of the city1.
The crimes counts are aggregated by trimesters and at a neighborhoods level.
Figure 1 shows the city neighborhoods, each neighborhood colored according to its
total population. The bottom part of the same figure shows the temporal evolution
of the crimes in each neighborhood. Each curve corresponds to a neighborhood
and is colored according to the neighborhood population. In both panels, the three
neighborhoods with the highest number of crimes are indicated by numbers 1, 2
and 3. The figure highlights the presence of some correlation between neighborhood
population and the number of crimes. However, criminality is not fully explained by
population. For instance, neighborhoods 1 and 3 present an high number of crimes
with a moderate population. This raises the interest towards a semiparametric
generalized linear model, as the one introduced in Section 1, with a nonparametric
term accounting for the spatio-temporal variability in the phenomenon, that cannot
be explained by population or other census quantities. Figure 2 shows the same data
for four different trimesters on the Portland map. As already pointed out, the three
area with the highest number of crimes are in the city center, and in the Hazelwood
neighborhood, in the east part of the city.
From Figures 1 and 2 we can see that the shape of the domain is complicated; the
city is indeed crossed by a river, with few bridges connecting the two parts, most of
them placed downtown. Therefore, neighborhoods at opposite side of the river and
far from the center, where most bridges are located, are close in euclidean distance,
but far apart in reality. This particular morphology influences the phenomenon under
study, for example, in the north of the city, the east side of the river is characterized by
an higher number of crimes with respect to the west part. Due to this characteristics
of the data and the domain, is is of crucial importance to take into account the shape
Fig. 1 Top: the city of Portland divided into neighborhoods, each neighborhood colored according
to the total population. Bottom: the total crimes over time for each neighborhood; each curve
corresponds to a neighborhood and is colored according to the neighborhood’s population. The
three neighborhoods with the highest number of crimes are indicated by numbers 1, 2 and 3.
Generalized Spatio-temporal Regression with PDE Penalization 33
Fig. 2 Total crime counts per neighborhood per trimester; green indicates lower number of crimes,
red indicates a higher number of crimes.
of the domain during the estimation process. For this reason, estimation based on
classical semiparametric models, such as those based on thin-plate splines, would
give poor results, while the proposed method is particularly well suited, being able
to complying the nontrivial form of the domain.
34 E. Arnone et al.
References
1. Aguilera-Morillo, M. C., Durbán, M., Aguilera, A. M.: Prediction of functional data with
spatial dependence: a penalized approach. Stoch. Environ. Res. Risk Assess. 31, 7–22 (2017)
2. Arnone, E., Azzimonti, L., Nobile, F., Sangalli, L. M.: Modeling spatially dependent functional
data via regression with differential regularization. J. Multivariate Anal. 170, 275–295 (2019)
3. Arnone, E., Sangalli, L. M., Vicini, A.: Smoothing spatio-temporal data with complex missing
data patterns. Stat. Model. Int. J. (2021)
4. Bernardi, M. S., Sangalli, L. M., Mazza, G., Ramsay, J. O.: A penalized regression model for
spatial functional data with application to the analysis of the production of waste in Venice
province. Stoch. Environ. Res. Risk Assess. 31, 23–38 (2017)
5. Marra, G., Miller, D. L., Zanin, L.: Modelling the spatiotemporal distribution of the incidence
of resident foreign population. Statistica Neerlandica 66(2) 133–160 (2012)
6. Sangalli, L. M.: Spatial regression with partial differential equation regularization. Int. Stat.
Rev. 89(3), 505–531 (2021)
7. Ugarte, M. D., Goicoa, T., Militino, A. F., Durbán, M.: Spline smoothing in small area trend
estimation and forecasting. Comput. Stat. Data Anal. 53(10), 3616–3629 (2009)
8. Ugarte, M. D., Goicoa, T., Militino, A. F.: Spatio-temporal modeling of mortality risks using
penalized splines. Environmetrics 21, 270–289 (2010)
9. Wilhelm M., Sangalli L. M.: Generalized spatial regression with differential regularization. J.
Stat. Comput. Simulat. 86(13), 2497–2518 (2016)
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
A New Regression Model for the Analysis of
Microbiome Data
1 Introduction
The human microbiome is defined as the set of genes associated with the micro-
biota, i.e. the microbial community living in the human body, including bacteria,
viruses and some unicellular eukaryotes [1, 8]. The mutualistic relationship be-
tween microbiota and human beings is often beneficial, though it can sometimes
Roberto Ascari ( )
Department of Economics, Management and Statistics (DEMS), University of Milano-Bicocca,
Milan, Italy, e-mail: [email protected]
Sonia Migliorati
Department of Economics, Management and Statistics (DEMS), University of Milano-Bicocca,
Milan, Italy, e-mail: [email protected]
become detrimental for several health outcomes. For example, changes in the gut
microbiome composition can be associated with diabetes, cardiovascular disesase,
obesity, autoimmune disease, anxiety and many other factors impacting on human
health [1, 5, 12, 14]. Moreover, the development of next-generation sequencing
technologies allows nowadays to survey the microbiome composition using direct
DNA sequencing of either marker genes or the whole metagenomics, without the
need for isolation and culturing. These are the two main reasons for the recent ex-
plosion of research on microbiome, and highlight the importance of understanding
the association between microbiome composition and biological and environmental
covariates.
A widespread distribution for handling microbiome data is the Dirichlet-
multinomial (DM) (e.g., see [4, 16]), a generalization of the multinomial distribution
obtained by assuming that, instead of being fixed, the underlying taxa proportions
come from a Dirichlet distribution. This allows to model overdispersed data counts,
that is data showing a variance much larger than that predicted by the multinomial
model. Despite its popularity, the DM distribution is often inadequate to model
real microbiome datasets due to the strict covariance structure imposed by its pa-
rameterization, which hinders the description of co-occurrence and co-exclusion
relationships between microbial taxa.
The aim of this work is to propose a new distribution that generalizes the DM,
namely the flexible Dirichlet-multinomial (FDM), and a regression model based on
it. The new model provides a better fit to real microbiome data, still preserving a
clear interpretation of its parameters. Moreover, being a finite mixture with DM
components, it enables to account for the data latent group structure, and thus to
identify clusters sharing similar biota compositions.
In this section, we define a new distribution for multivariate counts and a regression
model based on it, that allows to link microbiome abundances with covariates. Note
that, once the DNA sequence reads have been aligned to the reference microbial
genomes, the abundances of microbial taxa can be quantified. Thus, microbiome data
represent the count composition of 𝐷 bacterial taxa in a specific biological sample,
and a microbiome dataset is a sequence of 𝐷-dimensional vectors Y1 , Y2 , . . . , Y 𝑁 ,
where 𝑌𝑖𝑟 counts the number of occurrences of taxon 𝑟 in the 𝑖-th sample (𝑖 =
1, . . . , 𝑁 and 𝑟 = 1, . . . , 𝐷). Since the 𝑖-th sample contains a number 𝑛𝑖 of bacteria,
Í
microbiome observations are subject to a fixed-sum constraint, that is 𝑟𝐷=1 𝑌𝑖𝑟 = 𝑛𝑖 .
A New Regression Model for the Analysis of Microbiome Data 37
Γ(𝛼+ )
𝐷
( 𝛼+ 𝜇𝑟 )−1
Ö
𝑓Dir (𝝅; 𝝁, 𝛼+ ) = Î𝐷 𝜋𝑟 ,
𝑟 =1 Γ(𝛼+ 𝜇 𝑟 ) 𝑟 =1
𝑛!Γ(𝛼+ ) Ö Γ(𝛼+ 𝜇𝑟 + 𝑦 𝑟 )
𝐷
𝑓DM (y; 𝑛, 𝝁, 𝛼+ ) = .
Γ(𝛼+ + 𝑛) 𝑟 =1 (𝑦 𝑟 !)Γ(𝛼+ 𝜇𝑟 )
𝛼+
𝐷
Õ
𝑓FD (𝝅; 𝝁, 𝛼+ , 𝑤, p) = 𝑝 𝑗 𝑓Dir 𝝅; 𝝀 𝑗 , , (2)
𝑗=1
1−𝑤
where
𝝀 𝑗 = 𝝁 − 𝑤p + 𝑤e 𝑗 (3)
is the mean vector
n of the 𝑗-th ncomponent,
oo 𝝁 = E [𝚷] ∈ S 𝐷 , 𝛼+ > 0, p ∈ S 𝐷 ,
𝜇𝑟
0 < 𝑤 < min 1, min𝑟 ∈ {1,...,𝐷 } 𝑝𝑟 , and e 𝑗 is a vector with all elements equal to
zero except for the 𝑗-th which is equal to one.
38 R. Ascari and S. Migliorati
Equation (2) points that the Dirichlet components have different mean vectors
and a common precision parameter, the latter being determined by 𝛼+ and 𝑤. In
particular, inspecting Equation (3), it is easy to observe that any two vectors 𝝀𝑟 and
𝝀 ℎ , 𝑟 ≠ ℎ, coincide in all the elements except for the 𝑟-th and the ℎ-th.
If 𝚷 is supposed to be FD distributed, a new discrete distribution for count vectors
can be defined (we shall call flexible Dirichlet-multinomial (FDM)). The p.m.f. of
the FDM can be expressed as
𝛼+
𝐷
Õ
+
𝑓FDM (y; 𝑛, 𝝁, 𝛼 , p, 𝑤) = 𝑝 𝑗 𝑓DM y; 𝑛, 𝝀 𝑗 , (4)
𝑗=1
1−𝑤
𝐷 𝛼+ 𝐷 𝛼+
Õ 𝑛!Γ( 1−𝑤 ) Ö Γ( 1−𝑤 𝜆 𝑗𝑟 + 𝑦 𝑟 )
= 𝑝𝑗 𝛼+ 𝛼+ ,
𝑗=1 Γ( 1−𝑤 + 𝑛) 𝑟 =1 (𝑦 𝑟 !)Γ( 1−𝑤 𝜆 𝑗𝑟 )
E [Y] = 𝑛𝝁,
𝑛−1 (𝑛 − 1)𝜙𝑤2
V [Y] = 𝑛M 1 + +𝑛 P, (5)
𝜙+1 𝜙+1
We can define the FDM regression (FDMReg) and the DM regression (DMReg)
models assuming that Y𝑖 follows an FDM(𝑛𝑖 , 𝝁𝑖 , 𝛼+ , p, 𝑤)
˜ or a DM(𝑛𝑖 , 𝝁𝑖 , 𝛼+ ) dis-
tribution, respectively. Even if the FDM and DM distributions do not belong to the
dispersion-exponential family, we can follow a GLM-type approach, [6] by linking
the parameter 𝝁𝑖 to the linear predictor through a proper link function such as the
multinomial logit link function, that is
𝜇𝑖𝑟
𝑔(𝜇𝑖𝑟 ) = log = x|𝑖 𝜷𝑟 , 𝑟 = 1, . . . , 𝐷 − 1, (7)
𝜇𝑖𝐷
where 𝜷𝑟 = (𝛽𝑟 0 , 𝛽𝑟 1 , . . . , 𝛽𝑟 𝐾 ) | is a vector of regression coefficients for the 𝑟-th
element of 𝝁𝑖 . Note that the last category has been conventionally chosen as baseline
category, thus 𝜷 𝐷 = 0.
The parameterization of the FDMReg based on 𝝁, p, 𝛼+ , and 𝑤˜ defines a variation
independent parameter space, meaning that no constraints exist among parameters. In
a Bayesian framework, this allows to assume prior independence, and, consequently,
we can specify a prior distribution for each parameter separately. In order to induce
minimum impact on the posterior distribution, we select weakly-informative priors:
(i) 𝜷𝑟 ∼ 𝑁 𝐾 +1 (0, Σ), where 0 is the (𝐾 + 1)-vector with zero elements, and Σ is
a diagonal matrix with ‘large’ variance values, (ii) 𝛼+ ∼ 𝐺𝑎𝑚𝑚𝑎(𝑔1 , 𝑔2 ) for small
values of 𝑔1 and 𝑔2 , (iii) 𝑤˜ ∼ 𝑈𝑛𝑖 𝑓 (0, 1), and (iv) a uniform prior on the simplex for
p.
Inferential issues are dealt with by a Bayesian approach through a Hamilto-
nian Monte Carlo (HMC) algorithm [10], which is a popular generalization of the
Metropolis-Hastings algorithm. The Stan modeling language [13] allows implement-
ing an HMC method to obtain a simulated sample from the posterior distribution.
To compare the fit of the models we use the Watanabe-Akaike information crite-
rion (WAIC) [15, 17], a fully Bayesian criterion that balances between goodness-of-
fit and complexity of a model: lower values of WAIC indicate a better fit.
40 R. Ascari and S. Migliorati
In this section, we fit the DM and the FDM regression models to a microbiome dataset
analyzed by Xia et al. [19] and previously proposed by Wu et al. [18]. They collected
gut microbiome data on 98 healthy volunteers. In particular, the counts of three
bacteria genera were recorded, namely Bacteroides, Prevotella, and Ruminococcus.
Arumugam et al. [2] used these three bacteria to define three groups they called
enterotypes. These enterotypes provide information about the human’s body ability
to produce vitamins.
Wu et al. analyzed the same dataset conducting a cluster analysis via the ‘par-
titioning around medoids’ (PAM) approach. They detected only two of the three
enterotypes defined in the work by Arumugam et al. Moreover, these two clusters
are characterized by different frequencies: 86 out of the 98 samples were allocated
to the first enterotype, whereas only 12 samples were clustered into enterotype 2.
This is due to the small number of subjects with a high abundance of Prevotella (i.e.,
only 36 samples showed a Prevotella count greater than 0).
Besides the bacterial data, we consider also 𝐾 = 9 covariates, representing in-
formation on micro-nutrients in the habitual long-term diet collected using a food
frequency questionnaire. These 9 additional variables have been selected by Xia et
al. using a 𝑙1 penalized regression approach.
Table 1 shows the posterior mean and 95% credible set (CS) of each parameter
involved in the DMReg and the FDMReg models. Though the significant covariates
are the same across the models, the FDMReg shows a lower WAIC, thus being the
best model in terms of fit. This is due to the additional set of parameters involved in
the mixture structure that help in providing information on this dataset.
The mixture structure of the FDMReg model can be exploited to cluster ob-
servations into groups through a model-based approach. More specifically, each
observation can be allocated to the mixture component that most likely generated it.
Indeed, note that the mixing weights estimates (0.637, 0.357 and 0.006, from Table
1) confirm the presence of two out of the three enterotypes defined by Arumugam
et al. [2]. To further illustrate the benefits of the FDReg model in a microbiome
data analysis, we compare the clustering profile obtained by the FDMReg model and
the one obtained with the PAM approach used by Wu et al. In particular, Table 2
summarizes this comparison in a confusion matrix. Despite the clustering generated
by the FDMReg being based on some distributional assumptions (i.e., the response
is FDM distributed), it highly agrees with the one obtained by the PAM algorithm for
84% of the observations. This percentage is obtained using the covariates selected
by Xia et al. in a logistic normal multinomial regression model context. Clearly,
the results could be improved by developing an ad hoc variable selection procedure
for the FDMReg model. The main advantage to considering the FDMReg (that is a
model-based clustering approach) is that, besides the clustering of the data points,
it provides also some information on the detected clusters (e.g., their size and a
measure of their distance) and the relationship between the response and the set of
covariates. This additional information may increase the insight we can gain from
A New Regression Model for the Analysis of Microbiome Data 41
Table 1 Posterior mean and 95% CS for the parameters of the DMReg and FDMReg models.
Regression coefficients in bold are related to 95% CS’s not containing the zero value.
DM FDM
Post. Mean 95% CS Post. Mean 95% CS
Intercept 2.197 (1.844, 2.546) 2.642 (2.215, 3.034)
Proline -0.039 (-0.344, 0.273) -0.036 (-0.325, 0.261)
Sucrose -0.257 (-0.555, 0.039) -0.208 (-0.471, 0.064)
Bacteroides
Table 2 Confusion matrix for clustering based on the FDMReg model compared to the PAM
algorithm.
FDMReg
1 2
PAM
1 70 16
2 0 12
References
1. Amato, K.: An introduction to microbiome analysis for human biology applications. Am. J.
Hum. Biol. 29 (2017)
42 R. Ascari and S. Migliorati
2. Arumugam, M. et al.: Enterotypes of the human gut microbiome. Nature. 473, 174–180 (2011)
3. Ascari, R., Migliorati, S.: A new regression model for overdispersed binomial data accounting
for outliers and an excess of zeros. Stat. Med. 40(17), 3895–3914 (2021)
4. Chen, J., Li, H.: Variable selection for sparse Dirichlet-multinomial regression with an appli-
cation to microbiome data analysis. Ann. Appl. Stat. 7(1), 418–442 (2013)
5. Koeth, R. A. et al.: Intestinal microbiota metabolism of L-carnitine, a nutrient in red meat,
promotes atherosclerosis. Nat. Med. 19(5) (2013)
6. McCullagh, P., Nelder, J. A.: Generalized Linear Models. Chapman & Hall (1989)
7. Migliorati, S., Ongaro, A., Monti, G. S.: A structured Dirichlet mixture model for composi-
tional data: inferential and applicative issues. Stat. Comput. 27(4), 963–983, 2017.
8. Morgan, X. C., Huttenhower, C.: Human microbiome analysis. PloS Computational Biology.
8(12) (2012)
9. Ongaro, A., Migliorati, S.: A generalization of the Dirichlet distribution. J. Multivar. Anal.
114, 412–426 (2013)
10. Neal, R. M.: An improved acceptance procedure for the hybrid Monte Carlo algorithm. Tech.
Rep. (1994)
11. Ongaro, A., Migliorati, S., Ascari, R.: A new mixture model on the simplex. Stat. Comput.
30(4), 749–770 (2020)
12. Qin, J., Li, Y., Cai, Z., Li, S., Zhu, J., Zhang, F., Liang, S., Zhang, W., Guan, Y., Shen, D.,
Peng, Y.: A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature.
490 (2012)
13. Stan Development Team: Stan Modeling Language Users Guide and Reference Manual (2017)
14. Turnbaugh, P. J. et al.: A core gut microbiome in obese and lean twins. Nature. 457 (2009)
15. Vehtari, A., Gelman, A., Gabry, J.: Practical Bayesian model evaluation using leave-one-out
cross-validation and WAIC. Stat. Comput. 27(5), 1413–1432 (2017)
16. Wadsworth, W. D., Argiento, R., Guindani, M., Galloway-Pena, J., Shelburne, S. A., Van-
nucci, M.: An integrative Bayesian Dirichlet-multinomial regression model for the analysis of
taxonomic abundances in microbiome data. BMC Bioinformatics. 18(94) (2017)
17. Watanabe, S.: A widely applicable Bayesian information criterion. J. Mach. Learn. Tech.
14(1), 867–897 (2013)
18. Wu., G. D. et al.: Linking long-term dietary patterns with gut microbial enterotypes. Science.
334, 105–109 (2011)
19. Xia, F., Chen, J., Fung, W. K., Li, H.: A logistic normal multinomial regression model for
microbiome compositional data analysis. Biometrics. 69(4), 1053–1063 (2013)
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Stability of Mixed-type Cluster Partitions for
Determination of the Number of Clusters
1 Introduction
In cluster analysis practice, it is common to work with mixed-type data (i.e. nu-
merical and categorical variables), while in theoretical development the research is
traditionally often restricted to numerical data. A comprehensive overview on cluster
analysis based on mixed-type data is given in [1]. To cluster these mixed-type data, a
popular approach is the 𝑘-prototypes algorithm, an extension of the popular 𝑘-means
algorithm, as proposed in [2] and implemented in [3].
As for all partitioning clustering methods, the number of clusters has to be spec-
ified in advance. In the past, several validation methods have been identified for the
Rabea Aschenbruck ( )
Stralsund University of Applied Sciences, Zur Schwedenschanze 15, 18435 Stralsund, Germany,
e-mail: [email protected]
Gero Szepannek
Stralsund University of Applied Sciences, Zur Schwedenschanze 15, 18435 Stralsund, Germany,
e-mail: [email protected]
Adalbert F.X. Wilhelm
Jacobs University Bremen, Campus Ring 1, 28759 Bremen, Germany,
e-mail: [email protected]
𝑘-prototypes algorithm to enable the rating of clusters and to determine the index
optimal number of clusters. A brief overview is given in Section 2, followed by an
examination of the investigated stability indices to improve clustering mixed-type
data1. In Section 3, a simulation study has been conducted in order to compare the
performance of stability indices as well as a new proposed adjustment, and addi-
tionally to rate the performance with respect to internal validation indices. Finally, a
summary, which does not state a superiority of the stability-based approaches over
internal validation indices in general, and an outlook are given in Section 4.
The assessment of cluster quality can be used for the comparison of clusters resulting
from different methods or from the same method but with different input parameters,
e.g., with a different number of clusters. Especially the latter has already been an
important issue in partitioning clustering many decades ago [5]. Since then, some
work has been done on this subject. Hennig [6] points out, that nowadays some liter-
ature uses the term cluster validation exclusively for methods that decide about the
optimal number of clusters, in the following named internal validation. An overview
of internal validation indices is given, e.g., in [7] or [8]. In [9], a set of internal cluster
validation indices for mixed-type data to determine the number of clusters for the
𝑘-prototypes algorithm was derived and analyzed. In the following, stability indices
are presented, before they are compared to each other and additionally to internal
validation indices in Section 3. Since cluster stability is a model agnostic method,
the indices are applicable to any clustering algorithm and not limited to numerical
data [10].
A partition 𝑆 splits data 𝑌 = {𝑦 1 , . . . , 𝑦 𝑛 } into 𝐾 groups 𝑆1 , . . . , 𝑆 𝐾 ⊆ 𝑌 . The
focus of this paper is on the evaluation and rating of cluster partitions with so-
called stability indices. To calculate these, as discussed by Dolnicar and Leisch
[11] or mentioned by Fang and Wang [12], 𝑏 ∈ {1, . . . , 𝐵} bootstrap samples 𝑌 𝑏
(with replacement, see e.g. [13]) from the original data set 𝑌 are drawn. For every
bootstrap sample 𝑌 𝑏 , a cluster partition 𝑆 𝑏 = {𝑆1𝑏 , . . . , 𝑆 𝑏𝐿𝑏 } is determined. For the
validation of the different results of these bootstrap samples, the set of points from
the original data set that are also part of the 𝑏-th bootstrap sample 𝑋 𝑏 = 𝑌 ∩ 𝑌 𝑏 is
used, where 𝑛𝑏 is the size of 𝑋 𝑏 . Furthermore 𝐶 𝑏 = {𝑆 𝑘 ∩ 𝑋 𝑏 |𝑘 = 1, . . . , 𝐾 } and
𝐷 𝑏 = {𝑆𝑙𝑏 ∩ 𝑋 𝑏 |𝑙 = 1, . . . , 𝐿 𝑏 }, with 𝐵★
𝐶 being the number of bootstrap samples for
which 𝐶 𝑏 ≠ ∅, and 𝑛𝑆𝑘 , 𝑛𝐶 𝑏 , 𝑛𝑆 𝑏 , and 𝑛 𝐷 𝑏 with 𝑘 ∈ {1, . . . , 𝐾 }, 𝑙 ∈ {1, . . . , 𝐿 𝑏 }
𝑘 𝑙 𝑙
are the numbers of objects in cluster group 𝑆 𝑘 , 𝐶 𝑘𝑏 , 𝑆𝑙𝑏 and 𝐷 𝑙𝑏 , respectively.
In 2002, Ben-Hur et al. [14] presented stability-based methods, which can be used
to define the optimal number of clusters. In their work, the basis for the calculation
𝑏
of the stability indices is a binary matrix 𝑃𝐶 , which represents the cluster partition
𝐶 𝑏 in the following way
1 The mentioned and analyzed stability indices will extend the R package clustMixType [4].
Stability for Determination of the Number of Clusters 45
(
𝑏 1, if objects 𝑥𝑖𝑏 , 𝑥 𝑏𝑗 ∈ 𝑋 𝑏 are in the same cluster and 𝑖 ≠ 𝑗,
𝑃𝑖𝐶𝑗 = (1)
0, otherwise.
𝑏
With 𝑃 𝐷 defined analogously, the dot product of the two cluster partitions 𝐶 𝑏 and
𝑏 𝑏 Í 𝑏 𝑏
𝐷 𝑏 is defined as 𝐷 (𝑃𝐶 , 𝑃 𝐷 ) = 𝑖, 𝑗 𝑃𝑖𝐶𝑗 𝑃𝑖𝐷𝑗 . This leads to a Jaccard coefficient
based index of two cluster partitions 𝐶 𝑏 and 𝐷 𝑏
𝑏 𝑏
𝑏 𝑏 𝐷 (𝑃𝐶 , 𝑃 𝐷 )
𝑆𝑡𝑎𝑏 J (𝑃𝐶 , 𝑃 𝐷 ) = . (2)
𝐷 (𝑃𝐶 , 𝑃𝐶 ) + 𝐷 (𝑃 𝐷 𝑏 , 𝑃 𝐷 𝑏 ) − 𝐷 (𝑃𝐶 𝑏 , 𝑃 𝐷 𝑏 )
𝑏 𝑏
Hennig proposed a so-called local stability measure for every cluster group in a
cluster partition based on the Jaccard coefficient as well [15]. To obtain one stability
value 𝑆𝑡𝑎𝑏 J;cw for the whole partition, the weighted mean of the cluster-wise values
with respect to the size of the cluster groups is determined. Another stability-based
index presented by Ben-Hur et al., based on the simple matching coefficient, is called
Rand index [16] and defined as
𝑏 𝑏 1 𝐶𝑏 𝑏
𝑆𝑡𝑎𝑏 R (𝑃𝐶 , 𝑃 𝐷 ) = 1 − 2
k𝑃 − 𝑃 𝐷 k 2 . (3)
𝑛
Additionally, they present the stability index based on a similarity measure, which
was originally mentioned by Fowlkes and Mallows [17],
𝑏 𝑏
𝐶𝑏 𝐷𝑏 𝐷 (𝑃𝐶 , 𝑃 𝐷 )
𝑆𝑡𝑎𝑏 FM (𝑃 ,𝑃 )=p . (4)
𝐷 (𝑃𝐶 𝑏 , 𝑃𝐶 𝑏 )𝐷 (𝑃 𝐷 𝑏 , 𝑃 𝐷 𝑏 )
For determination of the number of clusters, Ben-Hur et al. proposed the analysis of
the distribution of index values calculated between pairs of clustered sub-samples,
where high pairwise similarities indicate a stable partition. The authors’ suggested
aim is examining the transition from a stable to an unstable clustering state. In
the simulation study, this qualitative criterion was numerically approximated by
the differences in the areas under these curves. Furthermore, von Luxburg [18]
published an approach to obtain the cluster partition stability based on the minimal
matching distance, where the minimum is taken over all permutations of the 𝐾 labels
of clusters. Straightforward, the distances are summarized by their mean to obtain
𝑏 𝑏 𝑏 𝑏 𝑏 𝑏
𝐼𝑛𝑠𝑡𝑎𝑏 L (𝑃𝐶 , 𝑃 𝐷 ) respectively 𝑆𝑡𝑎𝑏 L (𝑃𝐶 , 𝑃 𝐷 ) = 1 − 𝐼𝑛𝑠𝑡𝑎𝑏 L (𝑃𝐶 , 𝑃 𝐷 ).
3 Simulation Study
In order to compare the stability indices of the cluster partition and afterwards with
respect to the internal validation indices, a simulation study was conducted. In the
following, the setup and execution of this simulation study starting with the data
generation is briefly presented, and subsequently the results are evaluated.
46 R. Aschenbruck et al.
The simulation study is based on artificial data, which are generated for different
scenarios. In Table 1, the features that define the data scenarios and their corre-
sponding parameter values are listed. Since a full factorial design is used, there are
120 different data settings in the conducted simulation study.2 The selection of the
considered features follow the characteristics of the simulation study in [19] and
were extended with respect to the ratio of the variable types as in [20].
Table 1 Features and the associated feature specifications used to generate the data scenarios.
data parameter feature specification short
number of clusters 2, 4, 8 nC
clusters of equal size (FALSE: randomly drawn sizes) TRUE, FALSE symm
number of variables 2, 4, 8 nV
ratio of factor to numerical variables 0.25, 0.5, 0.75 fac_prop
overlap between cluster groups 0, 0.05, 0.1 overlap
The clusters of the 200 observations are defined by the the feature settings. Each
variable can either be active or inactive. For the numerical variables, active means
drawing values from the normal distribution 𝑋1 ∼ N (𝜇1 , 1), with random 𝜇1 ∈
{0, ..., 20}, and inactive means drawing from 𝑋0 ∼ N (𝜇0 , 1) with 𝜇0 = 2 · 𝑞 1− 2𝑣 − 𝜇1 ,
where 𝑞 𝛼 is the 𝛼-quantile of N (𝜇1 , 1) and 𝑣 ∈ {0.05, 0.1}. This results in an overlap
of 𝑣 for the two normal distributions. To achieve an overlap of 𝑣 = 0, the inactive
variable is drawn from N (𝜇1 − 10, 1). Furthermore, each factor variable has two
levels, 𝑙0 and 𝑙1 . The probability for drawing 𝑙 0 for an active variable is 𝑣 and (1 − 𝑣)
for level 𝑙 1 . For an inactive variable, the probability for 𝑙0 is (1 − 𝑣) and 𝑣 for 𝑙1 .
Below, the code structure of the simulation study is presented. For each of the 120
data scenarios, a repetition of 𝑁 = 10 runs was performed. This should mitigate the
influence of the random initialization of the 𝑘-prototypes algorithm. For the range of
two up to nine cluster groups, the stability indices are determined based on bootstrap
samples as suggested in [21]. In order to rank the performance of the stability-based
indices, the internal validation indices were also determined on the same data.
2 There is no data scenario with two variables and eight cluster groups. Additionally, if there are
two variables, obviously only the 0.5 ratio between factor and numerical variables is possible.
Stability for Determination of the Number of Clusters 47
Fig. 1 The evaluations of the four stability-based cluster indices are presented. There are ten
repetitions of rating the data situation for 𝑘 clusters in the range of two to nine and the index-
optimal number of clusters is highlighted. The parameters of the underlying data structure are nV
= 8, fac_prop = 0.5, overlap = 0.1 and symm = FALSE. The number of clusters nC in the
data structure varies row-wise.
Figure 1 shows exemplary results of the simulation study for three different data
scenarios over the 10 repetitions. Each row of the figure shows a different data
scenario and each column shows one of the four stability-based indices. The first row
is related to a data scenario with two clusters (marked by a vertical green line). Each
plot shows the examined number of clusters and the determined index value for the 10
repetitions. The maximum index value for each repetition is highlighted with a larger
dot and marks the index-optimal number of clusters of this repetition. It can be seen
that all of the four different indices detected the two clusters in the underlying data
structure. Rows two and three show the evaluations of data with cluster partitions
of four and eight clusters, respectively. It can be seen that the generated number of
clusters is not always rated as index optimal (for example, with four clusters, two or
three clusters were often also evaluated as optimal). Since the results shown here are
representative for all scenarios, the four cluster indices and their interpretation were
examined in more detail.
In the left part of Figure 2, different transformations of the index values are pre-
sented. Besides the standard index values (green line), the numerical approximation
of the approach of Ben-Hur et al. mentioned above is also shown (red line). For
the Jaccard-based evaluation, the proposed cluster-wise stability determination by
Hennig is presented in orange. Additionally, we propose an adjustment of the index
values (hereinafter referred to as new adjust), similar to [22], to take into account not
only the magnitude of the index but also the local slope: The index value scaled with
the geometric mean of the changes to the neighbor values is presented in dark green.
48 R. Aschenbruck et al.
Fig. 2 Left: Example of the variations of the index values at an iteration of the data scenario with
the parameters nC = 4, nV = 8, fac_prop = 0.5, overlap = 0.1 and symm = FALSE. Right:
Proportion of correct determinations, partitioned according to the different number of clusters in
the underlying data structure.
Again, for each variation of the indices, the index optimal value is highlighted. The
numerically determined index values according to the approach of Ben-Hur et al.
gain no benefit, thus it can be concluded that the quantification is not appropriate
for the purpose and that further research is required. The cluster-wise stability de-
termination of the Jaccard index also does not seem to improve the determination of
the number of clusters to a large extent. Obviously, the local slope in the example
in Figure 2 is strengthened for four evaluated cluster groups by the new adjustment
that leads to a determination of four cluster groups (which is the generated number
of clusters). Since only one iteration of one data scenario is shown on the left, the
sum of correct determined number of clusters with respect to the generated number
of clusters is shown on the right hand side of Figure 2. These sums for two, four
and eight clusters in the underlying data structure point out the improvement of the
proposed adjustment of the index values. Especially for more than two clusters, the
rate of correctly determined numbers of clusters can be increased.
Finally, the internal validation indices were comparatively examined. For analyz-
ing the outcome of the simulation study, the determined index optimal numbers of
clusters are shown in Table 2. While the comparison for two clusters in the underly-
ing data shows a slight advantage for the stability-based indices, especially for eight
clusters the preference is in favor of the internal validation indices. To gain a better
understanding of the mean success rate of determining the correct number of clusters
for each data scenario, Figure 3 further shows the results of a linear regression on
the various data parameters. It can be seen that in most cases there is not too much
difference between the considered methods. The stability-based indices do a better
job of determining the number of clusters for data with equally large cluster groups.
Obviously, a larger number of variables causes a better determination of the number
of clusters. The largest variation in the influence on the proportion of correct deter-
mination can be seen for the parameter number of clusters. The more cluster groups
are available in the underlying data structure, the worse the determination becomes
(especially for the stability-based indices and the indices Ptbiserial and Tau).
Stability for Determination of the Number of Clusters 49
Table 2 Determined number of clusters for all data scenarios with nC ∈ {2, 4, 8}, summarized by
the stability-based as well as internal validation indices and the evaluated number of clusters.
clusters 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9
Jnewadj 403 17 0 0 0 0 0 0 47 74 298 1 0 0 0 0 90 70 27 16 16 26 104 11
Rnewadj 391 18 5 0 1 1 1 3 56 99 258 3 2 0 0 2 38 68 22 17 16 32 133 34
FMnewadj 402 17 1 0 0 0 0 0 50 80 289 1 0 0 0 0 88 71 26 16 15 26 106 12
Lnewadj 394 21 5 0 0 0 0 0 53 83 282 2 0 0 0 0 100 97 31 20 16 16 76 4
CIndex 313 13 2 2 1 4 18 67 7 27 344 13 3 2 5 19 2 0 2 4 22 28 211 91
Dunn 386 24 4 2 0 1 1 2 39 56 307 8 7 3 0 0 19 9 17 7 37 53 190 28
Gamma 343 9 1 0 1 2 14 50 9 16 356 15 3 1 5 15 2 1 4 4 16 16 198 119
GPlus 319 8 1 0 0 0 9 83 6 10 319 12 5 2 15 51 2 1 1 4 14 12 175 151
McClain 71 3 1 1 5 12 57 270 0 0 17 4 4 13 87 295 0 0 0 0 0 9 34 317
Ptbiserial 400 11 6 0 3 0 0 0 72 120 225 3 0 0 0 0 31 62 79 65 55 39 26 3
Silhouette 388 3 1 4 4 5 8 7 14 37 348 7 0 0 8 6 6 0 3 1 12 46 220 72
Tau 391 16 9 0 4 0 0 0 68 144 205 3 0 0 0 0 33 82 119 68 40 14 3 1
Fig. 3 Linear regression coefficients for the parameters of the five data set features, where coeffi-
cients whose confidence intervals contain 0 are displayed in transparent.
4 Conclusion
The aim of this study was to investigate the determination of the optimal number of
clusters based on stability indices. Several variations of analysis methods of stability-
based index values were presented and comparatively analyzed in a simulation study.
The proposed adjustment of the index values with respect not only to their magnitude
but also to the local slope was able to improve the standard stability indices, especially
for a smaller number of clusters. The simulation study did not show any general
superiority of stability-based approaches over internal validation indices.
In the future, the various methods of analyzing the stability-based index values
should be examined in more detail, e.g., taking into account the Adjusted Rand
Index. For this purpose, further research may address the characteristics of the
evaluated curves more precisely, or further extend the approach of Ben-Hur et al. as
a quantitative determination method, which has not been done yet.
50 R. Aschenbruck et al.
References
1. Ahmad, A., Khan, S.: Survey of state-of-the-art mixed data clustering algorithms. IEEE
Access, 31883–31902 (2019)
2. Huang, Z.: Extension to the k-Means algorithm for clustering large data sets with categorical
values. Data Min. Knowl. Discov. 2(6), 283–304 (1998)
3. Szepannek, G.: clustMixType: User-friendly clustering of mixed-type data in R. The R J.
10(2), 200–208 (2018)
4. Szepannek, G., Aschenbruck, R.: clustMixType: k-prototypes clustering for mixed variable-
type data. R package version 0.2-15 (2021)
https://CRAN.R-project.org/package=clusterMixType
5. Thorndike, R. L.: Who belongs in the family. Psychometrika 18(4), 267–276 (1953)
6. Hennig, C.: Clustering strategy and method selection. In: Hennig, C., Meila, M. , Murtagh, F.,
Rocci, R. (eds.) Handbook of Cluster Analysis, pp. 703–730. Chapman and Hall/CRC, New
York (2015)
7. Halkidi, M., Vazirgiannia, M., Hennig, C.: Method-independent indices for cluster validation
and estimating the number of clusters. In: Hennig, C., Meila, M. , Murtagh, F., Rocci, R. (eds.)
Handbook of Cluster Analysis, pp. 595–618. Chapman and Hall/CRC, New York (2015)
8. Desgraupes, B.: clusterCrit: clustering indices. R package version 1.2.8 (2018)
https://CRAN.R-project.org/package=clusterCrit
9. Aschenbruck, R., Szepannek, G.: Cluster validation for mixed-type data. Arch. Data Sci., Ser.
A 6(1), 1–12 (2020)
10. Lange, T., Roth, V., Braun, M. L., Buhmann, J. M.: Stability-based validation of clustering
solutions. Neural. Comput. 16(6), 1299–1323 (2004)
11. Dolnicar, S., Leisch, F.: Evaluation of structure and reproducibility of cluster solutions using
bootstrap. Mark. Lett. 21, 83–101 (2010)
12. Fang, Y., Wang, J.: Selection of the number of clusters via the bootstrap method. Comput.
Stat. Data Anal. 56(3), 468–477 (2012)
13. Mucha, H.-J., Bartel, H.-G.: Validation of k-means clustering: why is bootstrapping better
than subsampling. Arch. Data Sci., Ser. A 2(1), 1–14 (2017)
14. Ben-Hur, A., Elisseeff, A., Guyon, I.: A stability based method for discovering structure in
clustered data. In: Pac. Symp. Biocomput. 2002, 6–17 (2001)
15. Hennig, C.: Cluster-wise assessment of cluster stability. Comput. Stat. Data Anal. 52(1),
258–271 (2007)
16. Rand, W. M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc.
66(336) 846–850 (1971)
17. Fowlkes, E. B., Mallows, C. L.: A method for comparing two hierarchical clusterings. J. Am.
Stat. Assoc. 78(383) 553–569 (1983)
18. von Luxburg, U.: Clustering stability: an overview. Found. Trends® Mach. Learn. 2(3), 235–
274 (2010)
19. Dangl, R., Leisch, F.: Effects of resampling in determining the number of clusters in a data
set. J. Classif. 37(3), 558–583 (2020)
20. Jimeno, J., Roy, M., Tortora, C.: Clustering mixed-type data: a benchmark study on KAMILA
and k-prototypes. In: Chadjipadelis, T., Lausen, B., Markos, A., Lee, T.R., Montanari, A.,
Nugent, R. (eds.) Data Analysis and Rationality in a Complex World, 83–91, Springer Inter-
national Publishing, Cham (2021)
21. Leisch, F.: Resampling methods for exploring cluster stability. In: Hennig, C., Meila, M.,
Murtagh, F., Rocci, R. (eds.) Handbook of Cluster Analysis, pp. 637–652. Chapman and
Hall/CRC, New York (2015)
22. Ilies, J., Wilhelm, A. F. X.: Projection-based partitioning for large, high-dimensional datasets.
J. Comp. Graph. Stat. 19(2), 474–492 (2010)
Stability for Determination of the Number of Clusters 51
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
A Review on Official Survey Item Classification
for Mixed-Mode Effects Adjustment
Abstract The COVID-19 pandemic has had a direct impact on the development, pro-
duction, and dissemination of official statistics. This situation led National Statistics
Institutes (NSIs) to make methodological and practical choices for survey collection
without the need for the direct contact of interviewing staff (i.e. remote survey data
collection). Mixing telephone interviews (CATI) and computer-assisted web inter-
viewing (CAWI) with direct contact of interviewing constitute a new way for data
collection at the time COVID-19 crisis. This paper presents a literature review to
summarize the role of statistical classification and design weights to control cover-
age errors and non-response bias in mixed-mode questionnaire design. We identified
289 research articles with a computerized search over two databases, Scopus and
Web of Science. It was found that, although employing mixed-mode surveys could
be considered as a substitution of traditional face-to-face interviews (CAPI), proper
statistical classification of survey items and responders is important to control the
nonresponse rates and coverage error risk.
Afshin Ashofteh ( )
Statistics Portugal (Instituto Nacional de Estatística, Departamento de Metodologia e Sistemas de
Infomação) and NOVA Information Management School (NOVA IMS) and MagIC, Universidade
Nova de Lisboa, Lisboa, Portugal, e-mail: [email protected]
Pedro Campos
Statistics Portugal (Instituto Nacional de Estatística, Departamento de Metodologia e Sistemas de
Infomação) and Faculty of Economics, Universidade do Porto, and LIAAD INESC TEC, Portugal,
e-mail: [email protected]
1 Introduction
solving the problem. Therefore, we could expect two approaches. First, we could
ignore classification, simply because we consider the groups are homogeneous and
the weighting could be recommended to adjust for COVID-19 pandemic situation and
non-observation errors. Second, the groups or responders are different and we need
categorical variables. In this case, the non-observation errors of CATI and CAWI
could not be covered by changing only the weights and we have to recommend CAPI
to collect categorical information and apply both clustering and weighting together
to have a reasonable coverage by mixed modes.
This study undergoes a systematic literature review on this topic guided by the
following question. What is the best methodology or modified estimation strategy
to mitigate the mode-effects problems based on design weighting and classification?
To answer this question, we performed a systematic review analysis limited to the
following databases: Web of Science, Scopus, and working papers from NSIs. We
only considered papers written in English. This article is organized as follows:
Section 2 presents the methodology of research that maps keyword identification
search, databases, and bibliometric analysis. In Section 3, we present the results,
identifying the PRISMA flow diagram, characteristics of the articles, author co-
authorship analysis, as well as the Keywords occurrence over the years. In Section 4,
we discuss the content analysis. Section 5 is about the main conclusions and finally,
in Section 6, the main research gaps and future works are outlined.
2 Methods
To accomplish the research, the preferred reporting items for systematic reviews and
meta-analysis methodology were adopted. The algorithm of the paper selection from
databases (Scopus and WOS) was based on screening started by search keywords
((mixed-mode* OR "Mode effect*") AND (weighting OR weight* OR classifica-
tion) AND ("Measurement error*" OR "Non-response bias" OR "Data quality" OR
"response rate*" ) AND ( capi OR "Computer Assisted Personal Interview*" OR
cawi OR "Assisted Web Interview*" OR cati OR "Computer Assisted Telephone In-
terview*" OR "web survey*" OR "mail survey*" OR "telephone survey*" )) and then
the result was filtered by “official statistics”. The results of the two databases were
merged, and then duplication was removed. For bibliometric analysis, the Mendeley
open-source tool was used to extract metadata and eliminate duplicates. For network
analysis, the VOSviewer open-source tool has been applied to visualize the extracted
information from the data set and obtain the quantitative and qualitative outcomes.
After assessing the eligibility, books and review papers were omitted from results
and relevant articles picked up from databases. The final dataset was selected ac-
cording to the visual abstract in Figure 2, which shows detailed information about
this systematic literature review.
56 A. Ashofteh and P. Campos
Fig. 2 Density visualization analysis of the 22 leader authors who have at least 3 papers.
3 Results
The 28 leader authors who had at least 4 papers are presented in Figure 2. Author
occurrence analysis was performed by applying the VOSviewer research tool for
network analysis. The top three leader authors were Mick P. Couper with 14 articles,
Barry Schouten with 14 articles, and Roger Tourangeau with 11 articles. With the
help of VOSviewer, keywords’ analysis was accomplished. We analyzed the co-
occurrence of author keywords with the full counting method. In the first step,
we select one for the minimum occurrence of a keyword and the result was 711
keywords. We could see the application of keywords over years (Figure 3). Some of
the keywords were not exactly the same, but their use and meaning were the same.
Item Classification for Mixed-Mode Effects Adjustment 57
We decided to match similar words to make the output clearer. Choosing the full
counting method resulted in a total of 592 authors meeting the threshold.
4 Content Analysis
The studies emphasize the dramatic change in mixed-mode strategies in the last
decades based on design-based and model-assisted survey sampling, time series
methods, small area estimation [6], and high expectation to undergo further changes
especially after the magnificent experience of NSIs, trying new modes after COVID-
19 pandemic [7].
The problem is about mixed-mode effects and calibration, and briefly, we could
follow several approaches such as design weighting to find sampling weights, non-
response weighting adjustment, and calibration. The design weight of a unit may be
interpreted as the number of units from population represented by a specific sample
unit. Most surveys, if not all, suffer from nonresponse in item or unit. Auxiliary
information could be used to improve the quality of design-weighted estimates. An
auxiliary variable must have at least two characteristics to be considered in calibra-
tion: (i) It must be available for all sample units; and (ii) Its population total must be
known.
The categorical variables from the demographic information of nonrespondents
such as education level, age, income, location, language, and marital status could
help the survey methodologists to categorize the target population and recognize
58 A. Ashofteh and P. Campos
the best sequence of the modes [8]. Van Berkel et al. [9] considered nine strata in
their classification tree by using age, ethnicity, urbanization, and income as explana-
tory variables. Re-interview design and inverse regression estimator (IREG) are
among the best approaches to improve measurement bias by using related auxiliary
information [10].
The focus of this approach is on the weights of estimators rather than the bias
from the measurements. For an estimator, we could consider 𝑦 𝑖,𝑚 the measurement
obtained from unit 𝑖 through mode 𝑚. The 𝑦 𝑖,𝑚 consists of 𝑢 𝑖 as the observed value
for respondent 𝑖, an additive mode-dependent measurement bias 𝑏 𝑚 , and a mode-
dependent measurement variance 𝜀 𝑖,𝑚 with an expected value equal to zero. Equation
(1) shows the measurement error model.
𝑛
! 𝑛 𝑛 𝑛
Õ Õ Õ Õ
𝐸 𝑡ˆ𝑦 = 𝐸 𝜔𝑖 𝑦 𝑖,𝑚 = 𝜔𝑖 𝑢 𝑖,𝑚 + 𝑏 𝑚 𝜔𝑖 𝜕𝑖,𝑚 + 𝜔𝑖 𝜕𝑖,𝑚 𝐸 𝜀 𝑖,𝑚 (4)
𝑖=1 𝑖=1 𝑖=1 𝑖=1
stating that the expected total of the survey estimate for 𝑌 consists of the estimated
true total of 𝑈, plus true total of 𝑏 𝑚 from data collectedÍ𝑛through mode 𝑚. Since
𝑏 𝑚 is an unobserved mode-dependent measurement bias, 𝑖=1 𝜔𝑖 𝜕𝑖,𝑚 𝑏 𝑚 in equation
(5) indicates the existence of an unknown mode-dependent bias for estimation of
𝑡 𝑦 . According to Equation (5), there is an unknown measurement bias in sequential
mixed-mode designs that might be adjusted by different estimators. Data obtained
Item Classification for Mixed-Mode Effects Adjustment 59
𝑛𝑚 𝑏 𝑚𝑗 !
1 Õ Õ 1 𝑚𝑗
𝑖𝑟 𝑒𝑔 𝑚
𝑦ˆ 𝑟𝑚𝑚 = 𝑑𝑖 𝑦𝑖𝑚𝑏 + 𝑑𝑖 𝑦ˆ 𝑟𝑚𝑒𝑏 − 𝑦ˆ 𝑟 𝑒 − 𝑦𝑖 𝑗 𝑏, 𝑗 = 1, 2; 𝑏 ≠ 𝑗 (7)
𝑁ˆ 𝑚1 + 𝑁ˆ 𝑚2 𝑖=1 𝑖=1 𝛽ˆ1
𝑚
where 𝑦ˆ 𝑟 𝑒𝑗 and 𝑦ˆ 𝑟𝑚𝑒𝑏 are the respondents means of focal and benchmark mode outcome
in the re-interview and 𝑑𝑖 denotes the design weight of the sample design. For
a detailed presentation and discussion of the methods see Chapter 8.5 in [12].
However, for longitudinal studies with different modes at different time points, the
effect of time on the respondents would make it difficult to estimate the pure mixed-
mode effect especially for volatile classification variables such as the address for
immigrants. The solution could be conducting the survey on parallel or separate
samples to evaluate the time effect and mode effect separately.
In practice, Statistics Portugal has been using the available information of a
sampling frame as a part of FNA (the dwellings national register database) at the
time of COVID-19. The situation was considered as telephone numbers are linked
to a sample drawn from a population register in FNA for the samples for CATI
rotation-scheme surveys such as Labor Force Survey. In 2020, the Labour Force
Survey (LFS) in Portugal as a mandatory survey for the member states within the EU
was adjusted for undercover of the percentage of households with a listed landline
telephone. As a result, the comparison of these surveys after and before COVID-19
shows the usefulness of the discussed methodologies. In 2021, the successful CAWI
mode census by Statistics Portugal shows respondents tend to favor the web-based
questionnaire to avoid the risk of COVID-19 infection with a face-to-face interview.
It shows the potential change in the mode tendency by responders.
5 Conclusions
COVID-19 crisis led to new solutions on item classification for mixed-mode effects
adjustment, such as applying mode calibration to population subgroups by cate-
gorical variables such as gender, regions, age groups, etc. Studies offer sequential
mixed-mode design started with CAWI as the cheapest mode supported by an initial
60 A. Ashofteh and P. Campos
postal mail or telephone contact and possible cash incentive. With a lag, follow up
the non-respondents with giving them a choice between CAPI and CATI according
to their specific classification group and demographic information, such as education
level, age, income, location, language, and marital status. It is fruitful to reduce the
cost and increase the accuracy simultaneously.
This study showed that sample frames might need updates for necessary categor-
ical information, which are based on choices made several years ago. Additionally,
more research studies seem necessary for ethics concerns, privacy regulations, and
standards for using categorical variables and classification information in social
mixed-mode surveys and official statistics.
References
1. Ashofteh, A., Bravo, J. M.: A study on the quality of novel coronavirus (COVID-19) official
datasets. Stat. J. IAOS, 36(2), 291–301, (2020) doi: 10.3233/SJI-200674
2. Ashofteh, A., Bravo, J. M.: Data science training for official statistics: A new scientific
paradigm of information and knowledge development in national statistical systems. Stat. J.
IAOS, 37(3), 771–789, (2021) doi: 10.3233/SJI-200674
3. Te Braak, P., Minnen, J., Glorieux, I.: The representativeness of online time use surveys.
Effects of individual time use patterns and survey design on the timing of survey dropout. J.
Off. Stat., 36(4), 887–906, (2020)
4. Szymkowiak, M., Wilak, K.: Repeated weighting in mixed-mode censuses. Econ. Bus. Rev.,
7(1), 26–46, (2021)
5. Zax, M., Takahashi, S.: Cultural influences on response style: comparisons of Japanese and
American college students. J. Soc. Psychol., 71(1), 3–10, (1967)
6. Pfeffermann, D.: New important developments in small area estimation. Stat. Sci., 28(1),
40–68, (2013)
7. Toepoel, V., de Leeuw, E., Hox, J.: Single- and Mixed-Mode Survey Data Collection. SAGE
Res. Methods Found, (2020) doi: 10.4135/9781526421036876933
8. Kim, S., Couper, M. P.: Feasibility and quality of a national RDD smartphone web survey:
comparison with a cell phone CATI survey. Soc. Sci. Comput. Rev., 39(6), 1218–1236, (2021)
9. Van Berkel, K., Van Der Doef, S., Schouten, B.: Implementing adaptive survey design with an
application to the Dutch health survey. J. Off. Stat., 36(3), 609–629, (2020) doi: 10.2478/jos-
2020-0031
10. Klausch, T., Schouten, B., Buelens, B., van den Brakel, J.: Adjusting measurement bias in
sequential mixed-mode surveys using re-interview data. J. Surv. Stat. Methodol., 5(4), 409–
432, (2017) doi: 10.1093/jssam/smx022
11. Särndal, C. E., Lundström, S.: Estimation in surveys with nonresponse. Estimation in surveys
with nonresponse. John Wiley (2005) doi: 10.1002/0470011351
12. Schouten, B., Brakel, J. van den, Buelens, B., Giesen, D., Luiten, A., Meertens, V.: Mixed-
Mode Official Surveys. Chapman and Hall/CRC (2021) doi: 10.1201/9780429461156
Item Classification for Mixed-Mode Effects Adjustment 61
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Clustering and Blockmodeling Temporal
Networks – Two Indirect Approaches
Vladimir Batagelj
1 Temporal Networks
Vladimir Batagelj ( )
IMFM, Jadranska 19, 1000 Ljubljana, Slovenia & IAM UP, Muzejski trg 2, 6000 Koper, Slovenia
& HSE, 11 Pokrovsky Bulvar, 101000 Moscow, Russian Federation,
e-mail: [email protected]
3 BM of Temporal Networks
For an early attempt of temporal network BM see [2, 5]. To the traditional BM
scheme we add the time dimension. We assume that the network is described using
temporal quantities [2] for nodes/links activity/presence, and some nodes properties
and links weights. Then also the BM partition 𝜋 is described for each node 𝑣 with a
Clustering and Blockmodeling Temporal Networks 65
temporal quantity 𝜋(𝑣, 𝑡): 𝜋(𝑣, 𝑡) = 𝑖 means that in time 𝑡 node 𝑣 belongs to cluster
𝑖. The structure and activity of clusters 𝐶𝑖 (𝑡) = {𝑣 : 𝜋(𝑣, 𝑡) = 𝑖} can change through
time, but they preserve their identity.
For the BM 𝜇 the clusters are mapped into BM nodes 𝜇 : 𝐶𝑖 → [𝑖]. To determine
the BM we still have to specify how the links from 𝐶𝑖 to 𝐶 𝑗 are represented in the
BM – in general, for the model arc ([𝑖], [ 𝑗]), we have to specify two TQs: its weight
𝑎 𝑖 𝑗 and, in the case of generalized BM, its type 𝜏𝑖 𝑗 . The weight can be an object of a
different type than the weights of the block links in the original temporal network.
We assume that in a temporal network N = (V, L, T , P, W) the links weight is
described by a TQ 𝑤 ∈ W. In the following years we intend to develop BM methods
case by case.
1. constant partition – nodes stay in the same cluster all the time:
Í
a. indirect approach based on clustering of TQs: 𝑝(𝑣) = 𝑢 ∈𝑁 (𝑣) 𝑤(𝑣, 𝑢),
hierarchical clustering and leaders;
b. indirect approach by conversion to the clustering with relation constraint
(CRC);
c. direct approach by (local) optimization of the criterion function 𝑃 over Φ
2. dynamic partition – nodes can move between clusters through time. The details
are still to be elaborated.
In this paper, we present approaches for cases 1.a and 1.b.
In the literature there exist other approaches to BM of temporal networks. A
recent overview is available in the book [12].
In [8] we adapted traditional leaders [13, 10] and agglomerative hierarchical [14, 1]
clustering methods for clustering of modal-valued symbolic data. They can be almost
directly applied for clustering units described by variables that have for their values
temporal quantities.
For a unit 𝑋𝑖 , each variable 𝑉 𝑗 is described with a size ℎ𝑖 𝑗 and a temporal quantity
x𝑖 𝑗 , 𝑋𝑖 𝑗 = (ℎ𝑖 𝑗 , x𝑖 𝑗 ). In our algorithms we use normalized values of temporal
variables 𝑉 0 = (ℎ, p) where
𝑣𝑟
p = [(𝑠𝑟 , 𝑓𝑟 , 𝑝 𝑟 ) : 𝑟 = 1, 2, . . . , 𝑘] and 𝑝𝑟 =
ℎ
In the case, when ℎ = total(x), the normalized TQ p is essentially a probability
distribution.
Both methods create cluster representatives that are represented in the same way.
66 V. Batagelj
To use the CRC in the construction of a nodes partition we have to define a dissim-
ilarity measure 𝑑 (𝑢, 𝑣) (or a similarity 𝑠(𝑢, 𝑣)) between nodes. An obvious solution
is 𝑠(𝑢, 𝑣) = 𝑓 (𝑤(𝑢, 𝑣)), for example
1. Total activity: 𝑠1 (𝑢, 𝑣) = total(𝑤(𝑢, 𝑣))
2. Average activity: 𝑠2 (𝑢, 𝑣) = average(𝑤(𝑢, 𝑣))
3. Maximal activity: 𝑠3 (𝑢, 𝑣) = max(𝑤(𝑢, 𝑣))
1
We can transform a similarity 𝑠(𝑢, 𝑣) into a dissimilarity by 𝑑 (𝑢, 𝑣) = 𝑠 (𝑢,𝑣) or
𝑑 (𝑢, 𝑣) = 𝑆 − 𝑠(𝑢, 𝑣) where 𝑆 > max𝑢,𝑣 𝑠(𝑢, 𝑣). In this way, we transformed the
temporal network partitioning problem into a clustering with relational constraints
problem [6, 360–369]. It can be efficiently solved also for large sparse networks.
Having the partition 𝜋, to produce a BM we have to specify the values on its links.
There are different options for model links weights 𝑎(([𝑖], [ 𝑗])).
Í
1. Temporal quantities: 𝑎(([𝑖], [ 𝑗])) = activity(𝐶𝑖 , 𝐶 𝑗 ) = 𝑢 ∈𝐶𝑖 ,𝑣 ∈𝐶 𝑗 𝑤(𝑢, 𝑣), for
𝑖 ≠ 𝑗, and 𝑎(([𝑖], [𝑖])) = 12 activity(𝐶𝑖 , 𝐶𝑖 ).
2. Total intensities: 𝑎 𝑡 (([𝑖], [ 𝑗])) = total(𝑎(([𝑖], [ 𝑗]))) .
𝑎 𝑡 (([𝑖], [ 𝑗]))
3. Geometric average intensities: 𝑎 𝑔 (([𝑖], [ 𝑗])) = p .
|𝐶𝑖 | · |𝐶 𝑗 |
The Reuters Terror News network was obtained from the CRA (Centering Resonance
Analysis) networks produced by Steve Corman and Kevin Dooley at Arizona State
University. The network is based on all the stories released during 66 consecutive
days by the news agency Reuters concerning the September 11 attack on the U.S.,
beginning at 9:00 AM EST 9/11/01.
The nodes, 𝑛 = 13332, of this network are important words (terms). For a given
day, there is an edge between two words iff they appear in the same utterance (for
details see the paper [9]). The network has 𝑚 = 243447 edges. The weight of an
edge is its daily frequency. There are no loops in the network. The network Terror
News is undirected – so will be also its BM.
The Reuters Terror News network was used as a case network for the Viszards
visualization session on the Sunbelt XXII International Sunbelt Social Network
Conference, New Orleans, USA, 13-17. February 2002. It is available at http:
//vlado.fmf.uni-lj.si/pub/networks/data/CRA/terror.htm .
Clustering and Blockmodeling Temporal Networks 67
We transformed the Pajek version of the network into NetsJSON format used in
libraries TQ and Nets. For a temporal description of each node/word for clustering
we took its activity (sum of all TQs on edges adjacent to a given node 𝑣)
Õ
act(𝑣) = 𝑤(𝑣 : 𝑢).
𝑢 ∈𝑁 (𝑣)
Our leaders’ and hierarchical clustering methods are compatible – they are based
on the same clustering error criterion function. Usually, the leaders’ method is used
to reduce a large clustering problem to up to some hundred units. With hierarchical
clustering of the leaders of the obtained clusters, we afterward determine the "right"
number of clusters and their representatives.
To cluster all 13332 words (nodes) in Terror News we used the adapted leaders’
method searching for 100 clusters. We continued with the hierarchical clustering of
the obtained 100 leaders. The result is presented in the dendrogram in Figure 2.
country
pres_bush
national islamabad
bin_laden foreign
thursday thousand
newspaper world_trade_ctr committee
government
timepercent include
airlineauthority white_house reporter
major reuter
effortrule saudi−born
threat former package republican
meet
agent
day police world
tell aircraft
office
way attack hour
system
saturday
neighbor
mohammad
problem
refugee
speech
war
economic britain
individual industry
coalition
position
monday military information
people add yearhouse order
support
cooperation
grow deputy
source chief action suspect stock
plane
israeli
fear democratic
state
russia
week
company council
american
agency
number possible employee friday jet home terrorism ruler kind end
militant
official
strike plan
large
city worker
report area nation
network
leader region china
court
hand
center
international
power right
musharraf
united_states group
muslimpresident
russian
situation islam political senior
ally
great
hijacker
measure
To get an insight into the content of a selected cluster we draw the corresponding
word cloud based on the cluster’s leader. In Figure 3 the word clouds for clusters
𝐶58 and 𝐶81 (|𝐶58| = 1396, |𝐶81| = 2226 ) are presented.
We can also compare the activities of pairs of clusters by considering the overlap
of p-components (probability distributions) of their leaders. In Figure 4, we com-
pare cluster 𝐶58 with cluster 𝐶81, and cluster 𝐿96 with cluster 𝐶66. In the right
diagram some values are outside the display area: 𝐿96[15] = 0.3524, 𝐶66[4] =
0.1961, 𝐶66[5] = 0.2917.
0.07 0.06
C58:C81 L96:C66
0.06
0.05
0.05
0.04
0.04
0.03
0.03
0.02
0.02
0.01
0.01
0.00 0.00
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
Fig. 4 Comparing activities of clusters (blue – first cluster, red – second cluster, violet – overlap).
𝑖 cluster 1 2 3 4 5
1 𝐶94 23.85 12.23 2.26 1.57 1.42
2 𝐶88 3.58 0.33 0.22 0.19
3 𝐶95 0.56 0.07 0.07
4 𝐿43 0.38 0.08
5 𝐿74 0.39
size 6018 5109 954 535 716
L74−L74
L43−L74
L43−L43
C95−L74
C95−L43
C95−C95
C88−L74
C88−L43
C88−C95
C88−C88
C94−L74
C94−L43
C94−C95
C94−C88
C94−C94
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
1
sep−11 2
3
4
5
6
7
sep−12 8
9
10
11
12
13
sep−13 14
15
16
17
18
19
sep−14 20
21
22
23
24
25
in Figure 6.
sep−15 26
27
28
29
30
31
sep−16 32
33
34
35
36
37
sep−17 38
39
40
C94
41
42
43
sep−18 44
45
46
47
48
49
sep−19 50
51
52
53
54
55
sep−20 56
57
58
59
60
61
sep−21 62
63
64
65
66
sep−22
C94−C94
sep−23
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
sep−24
1 1
sep−25 2 2
3 3
4 4
5 5
6 6
7 7
sep−26 8 8
9 9
10 10
11 11
sep−27 12 12
13 13
14 14
15 15
16 16
17 17
sep−28 18 18
19 19
20 20
21 21
22 22
23 23
sep−29 24 24
25 25
26 26
27 27
28 28
29 29
sep−30 30 30
31 31
32 32
33 33
34 34
35 35
oct−1 36 36
37 37
38 38
39 39
C88
40 40
41 41
oct−2 42 42
43 43
44 44
45 45
46 46
47 47
oct−3 48 48
49 49
50 50
51 51
52 52
53 53
oct−4 54 54
55 55
oct−7
oct−8
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
oct−9
1 1 1
2 2 2
3 3 3
4 4 4
oct−10 5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10 10
11 11 11
oct−11 12 12 12
13 13 13
14 14 14
15 15 15
16 16 16
17 17 17
oct−12 18 18 18
19 19 19
20 20 20
21 21 21
oct−13 22 22 22
23 23 23
24 24 24
25 25 25
26 26 26
27 27 27
oct−14 28 28 28
29 29 29
30 30 30
31 31 31
32 32 32
Clustering and Blockmodeling Temporal Networks
33 33 33
oct−15 34 34 34
35 35 35
36 36 36
37 37 37
38 38 38
39 39 39
C95
oct−16 40 40 40
41 41 41
42 42 42
43 43 43
44 44 44
45 45 45
oct−17 46 46 46
47 47 47
48 48 48
49 49 49
50 50 50
51 51 51
oct−18 52 52 52
53 53 53
54 54 54
55 55 55
56 56 56
57 57 57
oct−19 58 58 58
59 59 59
60 60 60
61 61 61
62 62 62
63 63 63
oct−20 64 64 64
65 65 65
66 66 66
C95−C95
C88−C95
C94−C95
oct−21
oct−23
at https://github.com/bavla/TQ/wiki/BMRC .
1 1 1 1
2 2 2 2
oct−24 3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
oct−25 9 9 9 9
10 10 10 10
11 11 11 11
12 12 12 12
13 13 13 13
14 14 14 14
oct−26 15 15 15 15
16 16 16 16
17 17 17 17
18 18 18 18
19 19 19 19
20 20 20 20
21 21 21 21
oct−27 22 22 22 22
23 23 23 23
24 24 24 24
25 25 25 25
26 26 26 26
27 27 27 27
oct−28 28 28 28 28
29 29 29 29
30 30 30 30
31 31 31 31
oct−29 32 32 32 32
33 33 33 33
34 34 34 34
35 35 35 35
36 36 36 36
37 37 37 37
oct−30 38 38 38 38
39 39 39 39
L43
40 40 40 40
41 41 41 41
42 42 42 42
43 43 43 43
oct−31 44 44 44 44
45 45 45 45
46 46 46 46
47 47 47 47
48 48 48 48
49 49 49 49
nov−1 50 50 50 50
51 51 51 51
52 52 52 52
53 53 53 53
54 54 54 54
55 55 55 55
nov−2 56 56 56 56
57 57 57 57
58 58 58 58
59 59 59 59
60 60 60 60
61 61 61 61
nov−3 62 62 62 62
63 63 63 63
64 64 64 64
65 65 65 65
66 66 66 66
L43−L43
C95−L43
C88−L43
C94−L43
nov−4
nov−5
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
nov−6
nov−7 1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
nov−8 7 7 7 7 7
8 8 8 8 8
9 9 9 9 9
10 10 10 10 10
11 11 11 11 11
12 12 12 12 12
nov−9 13 13 13 13 13
14 14 14 14 14
15 15 15 15 15
16 16 16 16 16
17 17 17 17 17
18 18 18 18 18
nov−10 19 19 19 19 19
20 20 20 20 20
21 21 21 21 21
22 22 22 22 22
23 23 23 23 23
24 24 24 24 24
nov−11 25 25 25 25 25
26 26 26 26 26
27 27 27 27 27
28 28 28 28 28
29 29 29 29 29
30 30 30 30 30
31 31 31 31 31
nov−12 32 32 32 32 32
33 33 33 33 33
34 34 34 34 34
35 35 35 35 35
36 36 36 36 36
37 37 37 37 37
nov−13 38 38 38 38 38
39 39 39 39 39
L74
40 40 40 40 40
41 41 41 41 41
nov−14 42 42 42 42 42
43 43 43 43 43
44 44 44 44 44
45 45 45 45 45
46 46 46 46 46
47 47 47 47 47
nov−15 48 48 48 48 48
49 49 49 49 49
50 50 50 50 50
51 51 51 51 51
52 52 52 52 52
53 53 53 53 53
54 54 54 54 54
55 55 55 55 55
56 56 56 56 56
57 57 57 57 57
58 58 58 58 58
59 59 59 59 59
60 60 60 60 60
61 61 61 61 61
display of the matrix with logarithmic values provides much more information.
62 62 62 62 62
63 63 63 63 63
64 64 64 64 64
65 65 65 65 65
66 66 66 66 66
L74−L74
L43−L74
C95−L74
C88−L74
C94−L74
+0.00
+0.84
+1.68
+2.52
+3.36
+4.20
+5.04
+5.88
+6.72
+7.56
+8.40
+9.24
Fig. 6 BM represented as 𝑝-components of temporal activities of links between pairs of clusters.
+10.08
+10.92
+11.76
+12.60
+13.44
present it here. A description of the analysis with the corresponding code is available
straints approach. Because of the limited space available for each paper, we can not
To the Terror News network, we applied also the clustering with relational con-
matrix in Figure 7. Because of some relatively very large values, it turns out that the
A more compact representation of a temporal BM is a heatmap display of this
A more detailed BM is presented by the activities (𝑝-components) image matrix
69
70 V. Batagelj
5 Conclusions
The presented research is a work in progress. It only deals with the two simplest
cases of temporal blockmodeling. We provided some answers to the problem of
normalization of model weights TQs when comparing them and some ways to
present/display the temporal BMs.
We used different tools (R, Python, and Pajek) to obtain the results. We intend to
provide the software support in a single tool – probably in Julia. We also intend to
create a collection of interesting and well-documented temporal networks for testing
and demonstrating the developed software.
Acknowledgements The paper contains an elaborated version of ideas presented in my talks at the
XXXX Sunbelt Social Networks Conference (on Zoom), July 13-17, 2020 and at the EUSN 2021
– 5th European Conference on Social Networks, Naples (on Zoom), September 6-10, 2021.
This work is supported in part by the Slovenian Research Agency (research program P1-0294
and research projects J1-9187, J1-2481, and J5-2557), and prepared within the framework of the
HSE University Basic Research Program.
References
1. Anderberg, M.R.: Cluster Analysis for Applications. Academic Press, New York (1973)
2. Batagelj, V., Praprotnik, S.: An algebraic approach to temporal network analysis based on
temporal quantities. Soc. Netw. Anal. Min. 6(1), 1–22 (2016)
3. Batagelj, V., Ferligoj, A.: Clustering relational data. In: Gaul, W., Opitz, O., Schader, M. (Eds.)
Data Analysis / Scientific Modeling and Practical Application, pp. 3–15. Springer (2000)
4. Batagelj, V.: Generalized Ward and related clustering problems. In: Bock H.-H. (ed) Classifi-
cation and Related Methods of Data Analysis, pp. 67–74. North-Holland, Amsterdam (1988)
5. Batagelj, V., Ferligoj, A., Doreian, P.: Indirect blockmodeling of 3-way networks. In: Brito,
P., Bertrand, P., Cucumel, G., de Carvalho, F. (eds.) Selected Contributions in Data Analysis
and Classification, pp. 151–159. Springer (2007)
6. Batagelj, V., Doreian, P., Ferligoj, A., Kejžar, N.: Understanding Large Temporal Networks
and Spatial Networks: Exploration, Pattern Searching, Visualization and Network Evolution.
Wiley (2014)
7. Batagelj, V., Kejžar, N.: Clamix – Clustering Symbolic Objects (2010) Program in R
https://r-forge.r-project.org/projects/clamix/
8. Kejžar, N., Korenjak-Černe, S., Batagelj, V.: Clustering of modal-valued symbolic data. Adv.
Data Anal. Classif. 15, pp. 513—541 (2021)
9. Corman, S. R., Kuhn, T., McPhee, R. D., Dooley, K. J.: Studying complex discursive systems:
Centering resonance analysis of communication. Hum. Commun. Res. 28(2), 157–206 (2002)
10. Diday, E.: Optimisation en Classification Automatique. Tome 1.,2. INRIA, Rocquencourt (in
French) (1979)
11. Doreian, P., Batagelj, V., Ferligoj, A.: Generalized Blockmodeling. Structural Analysis in the
Social Sciences. Cambridge University Press (2005)
12. Doreian, P., Batagelj, V., Ferligoj, A. (Eds.) Advances in Network Clustering and Blockmod-
eling. Wiley (2020)
13. Hartigan, J. A.: Clustering Algorithms. Wiley-Interscience, New York (1975)
14. Ward, J. H.: Hierarchical grouping to optimize an objective function. J. Am. Stat. Asso.c 58,
236–244 (1963)
Clustering and Blockmodeling Temporal Networks 71
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Latent Block Regression Model
Abstract When dealing with high dimensional sparse data, such as in recommender
systems, co-clustering turns out to be more beneficial than one-sided clustering, even
if one is interested in clustering along one dimension only. Thereby, co-clusterwise is
a natural extension of clusterwise. Unfortunately, all of the existing approaches do not
consider covariates on both dimensions of a data matrix. In this paper, we propose
a Latent Block Regression Model (LBRM) overcoming this limit. For inference,
we propose an algorithm performing simultaneously co-clustering and regression
where a linear regression model characterizes each block. Placing the estimate of the
model parameters under the maximum likelihood approach, we derive a Variational
Expectation-Maximization (VEM) algorithm for estimating the model’s parameters.
The finality of the proposed VEM-LBRM is illustrated through simulated datasets.
1 Introduction
The cluster-wise linear regression algorithm CLR (or Latent Regression Model) is
a finite mixture of regressions and one of the most commonly used methods for
simultaneous learning and clustering [14, 5]. It aims to find clusters of entities to
minimize the overall sum of squared errors from regressions performed over these
clusters. Specifically, X = [𝑥𝑖 𝑗 ] ∈ R𝑛×𝑣 is the covariate matrix and Y ∈ R𝑛×1 the
response vector. The cluster-wise method aims to find 𝑔 clusters 𝐶1 , . . . , 𝐶𝑔 and
regression coefficients 𝜷 (𝑘) ∈ R𝑑×1 by minimizing the following objective function
Í𝑔 Í Í𝑣 (𝑘)
𝑘=1 𝑖 ∈𝐶𝑘 (𝑦 𝑖 − 𝑗=1 𝛽 𝑗 𝑥 𝑖 𝑗 + 𝑏 𝑘 ) where:
2
Rafika Boutalbi ( )
Institute for Parallel and Distributed Systems, Analytic Computing, University of Stuttgart, Ger-
many, e-mail: [email protected]
Lazhar Labiod · Mohamed Nadif
Centre Borelli UMR 9010, Université Paris Cité, France,
e-mail: [email protected];[email protected]
Various adjustments have been made to this model to improve its performance in
terms of clustering and prediction. In our contribution, we propose to embed the
co-clustering in the model.
Co-clustering is a simultaneous clustering of both dimensions of a data matrix
that has proven to be more beneficial than traditional one-sided clustering, especially
when dealing with sparse data. When dealing with high dimensional data sparse or
not, co-clustering turns out to be more valuable than one-sided clustering [1, 13],
even if one is interested in clustering along one dimension only. In [4] the authors
proposed the SCOAL approach (Simultaneous Co-clustering and Learning model),
leading to co-clustering and prediction for binary data; they generalized the model
to continuous data. However, this model does not take into account the sparsity
of data in the sense that it does not lead to homogeneous blocks. The obtained
results in terms of Mean Square Error (MSE) are good, but in terms of co-clustering
(homogeneity of co-clusters), no analysis has been presented. This model is also
related to the soft PDLF (Predictive Discrete Latent Factor) model [2], where the
value of response 𝑦 𝑖 𝑗 ’s in each co-cluster is modeled as a sum 𝛽𝑇 𝑥 𝑖 𝑗 + 𝛿 𝑘ℓ where
𝛽 is a global regression model. In contrast, 𝛿 𝑘ℓ is a co-cluster specific offset. More
recently, in [17] the authors proposed an algorithm taking into account only row
covariates information to realize co-clustering and regression simultaneously. To
this end, the authors are based on the latent block models [8]. In our contribution,
we propose to rely also on this model but considering row and column covariates.
The proposed Latent Block Regression Model (LBRM) is an extension of fi-
nite mixtures of regression models where the co-clustering is embedded. It allows
us to deal with co-clustering and regression simultaneously while taking into ac-
count covariates. To estimate the parameters we rely on a Variational Expectation-
Maximization algorithm [7] referred to as VEM-LBRM.
𝑃(𝑧𝑖𝑘 = 1|X)𝑃(𝑤 𝑗ℓ = 1|X). From this hypothesis, we then consider the latent
block model where the two sets 𝐼 and 𝐽 are considered as random samples and the
row, and column labels become latent variables. Therefore, the parameter of the
latent block model is 𝚯 = (𝝅, 𝝆, 𝜶), with 𝝅 = (𝜋1 , . . . , 𝜋𝑔 ) and 𝝆 = (𝜌1 , . . . , 𝜌 𝑚 )
where (𝜋 𝑘 = 𝑃(𝑧𝑖𝑘 = 1), 𝑘 = 1, . . . , 𝑔), (𝜌ℓ = 𝑃(𝑤 𝑗ℓ = 1), ℓ = 1, . . . , 𝑚) are
the mixing proportions and 𝜶 = (𝛼 𝑘ℓ ; 𝑘 = 1, . . . 𝑔, ℓ = 1, . . . , 𝑚) where 𝛼 𝑘ℓ
is the parameter of the distribution of block 𝑘ℓ. Considering that the complete
data are the vector (X, z, w), i.e, we assume that the latent variable z and w
are known, the resulting complete data log-likelihood of the latent block model
𝐿 𝐶 (X, z, w, 𝚯) = log 𝑓 (X, z, w; 𝚯) can be written as follows
𝑔
Õ 𝑚
Õ 𝑛 Õ
Õ 𝑔 Õ
𝑑 Õ 𝑚
𝑧 𝑘 log 𝜋 𝑘 + 𝑤ℓ log 𝜌ℓ + 𝑧 𝑖𝑘 𝑤 𝑗ℓ log 𝝓 𝑘ℓ (𝑥𝑖 𝑗 ; 𝜶 𝑘ℓ ).
𝑘=1 ℓ=1 𝑖=1 𝑗=1 𝑘=1 ℓ=1
where the 𝜋 𝑘 ’s and 𝜌ℓ ’s denote the proportions of row and columns clusters re-
spectively; see for instance [8]. Note that the complete-data log-likelihood breaks
into three terms: the first one depends on proportions of row clusters, the second on
proportions of column clusters and the third on the pdf of each block or co-cluster.
The objective is then to maximize the function 𝐿 𝐶 (z, w, 𝚯).
For co-clustering of continuous data, the Gaussian latent block model can be used. For
instance, note that it isÍeasyÍ to show that theÍ minimization of the well-known criterion
𝑔 𝑚 Í
of ||X − z𝝁w𝑇 || 2 = 𝑘=1 ℓ=1 𝑖 |𝑧𝑖𝑘 =1 (𝑥
𝑗 |𝑤 𝑗ℓ =1 𝑖 𝑗 − 𝜇 𝑘ℓ ) 2 where z ∈ {0, 1} 𝑛×𝑔 ,
w ∈ {0, 1} 𝑑×𝑚 and 𝝁 ∈ R𝑔×𝑚 is associated to Latent block Gaussian model whith
𝜶 𝑘ℓ = (𝜇 𝑘ℓ , 𝜎𝑘ℓ 2 ), the proportions of row clusters and column clusters are equal and
in addition the variances of blocks are identical [9]. Note that 1) the characteristic
of the latent block model is that the rows and the columns are treated symmetrically
2) the estimation of the parameters requires a variational approximation [7, 17]. In
the sequel, we see how can we integrate a regression model. Hereafter, we propose a
novel Latent Block Regression model for co-clustering and learning simultaneously.
The model considers the response matrix Y = [𝑦 𝑖 𝑗 ] ∈ R𝑛×𝑑 and the covariate tensor
X = [1, x𝑖 𝑗 ] ∈ R𝑛×𝑑×𝑣 where 𝑛 is the number of rows, 𝑑 the number of columns, and
𝑣 the number of covariates. Figure 1 presents data structure for the proposed model
LBRM.
In the following we propose the integration of mixture of regression [5] per block
in the Latent Block model (LBM) considering the distribution Φ(𝑦 𝑖 𝑗 |x𝑖 𝑗 ; 𝜆 𝑘ℓ ).We
assume in the following the normality of Φ,
( )
2 −0.5 1 > 2
Φ(𝑦 𝑖 𝑗 |x𝑖 𝑗 ; 𝜆 𝑘ℓ ) = 𝑝(𝑦 𝑖, 𝑗 |x𝑖 𝑗 , 𝜷 𝑘ℓ , 𝜎𝑘ℓ ) = (2𝜋𝜎𝑘ℓ ) exp − 2 (𝑦 𝑖 𝑗 − 𝜷 𝑘ℓ x𝑖 𝑗 )
2𝜎𝑘ℓ
76 R. Boutalbi et al.
With the LBRM model, the parameter 𝛀 is composed of row and column proportions
𝝅, 𝝆 respectively, 𝜷 = {𝜷11 , . . . , 𝛽𝑔𝑚 } with 𝜷>𝑘ℓ = (𝛽0𝑘ℓ , 𝛽1𝑘ℓ , . . . , 𝛽 𝑘ℓ
𝑣 ) where 𝛽0
𝑘ℓ
represents the intercept of regression and 𝝈 = {𝜎11 , . . . , 𝜎𝑔𝑚 }. The classification
log-likelihood can be written:
Õ Õ 1Õ 1 Õ
𝑧𝑖𝑘 log 𝜋𝑘 + 𝑤 𝑗ℓ log 𝜌ℓ − 2
𝑧.𝑘 𝑤.ℓ log( 𝜎𝑘ℓ )− 2
𝑧𝑖𝑘 𝑤 𝑗ℓ ( 𝑦𝑖 𝑗 − 𝜷 >
𝑘ℓ x𝑖 𝑗 )
2
𝑖,𝑘 𝑗,ℓ
2 𝑘,ℓ 2𝜎𝑘ℓ 𝑖, 𝑗,𝑘,ℓ
Í Í
with 𝑧 .𝑘 = 𝑖 𝑧 𝑖𝑘 et 𝑤.ℓ = 𝑗 𝑤 𝑗ℓ .
3 Variational EM Algorithm
The maximization of 𝐹𝐶 (z̃, w̃, 𝛀) can be reached by realizing the three following
optimization: update z̃ by argmaxz̃ 𝐹𝐶 (z̃, w̃, 𝛀), update w̃ by argmaxw̃ 𝐹𝐶 (z̃, w̃, 𝛀),
and update 𝛀 by argmax𝛀 𝐹𝐶 (z̃, w̃, 𝛀). In what follows, we detail the Expectation
(E) and Maximization (M) step of the Variational EM algorithm for tensor data.
Latent Block Regression Model 77
The proposed algorithm for tensor data referred to as VEM-LBRM alternates the two
previously described steps Expectation-Maximization. At the convergence, a hard
co-clustering is deduced from the posterior probabilities.
4 Experimental Results
First, we evaluate the proposed VEM-LBRM on three synthetic datasets in terms of co-
clustering and regression. We compare VEM-LBRM with some clustering and regres-
sion methods namely Global model which is a single multiple linear regression
model performed on all observations, K-means, Clusterwise, Co-clustering
and SCOAL. We retain two widely used measures to assess the quality of clustering,
namely the Normalized Mutual Information (NMI) [16] and the Adjusted Rand In-
dex (ARI) [15]. Intuitively, NMI quantifies how much the estimated clustering is
informative about the true clustering. The ARI metric is related to the clustering
accuracy and measures the degree of agreement between an estimated clustering and
a reference clustering. Both NMI and ARI are equal to 1 if the resulting clustering
is identical to the true one. On the other hand, we use RMSE (Root MSE) and MAE
(Mean Absolute Error) metrics to evaluate the precision of prediction while RMSE
is a loss function which is suitable for Gaussian noises when MAE uses the absolute
value which is less sensitive to extreme values.
We generated tensor data X with size 200 × 200 × 2 according to Gaussian
model per block. In the simulation study, we considered three scenarios by varying
the regression parameters — the examples have blocks with different regression
collinearity and different co-clusters structure complexity. The parameters for each
example are reported in Tables 1. In Figures 2 and 3 are depicted the true regression
planes and the true simulated response matrix Y.
78 R. Boutalbi et al.
Table 1 Parameters generation for examples.
Dataset Example 1 Example 2 Example 3
𝝅 = [0.35, 0.35, 0.3] , 𝝆 = [0.55, 0.45]
𝜎 𝜎 = 5 𝜎 =7 𝜎 = 7
1 0 2 0.3 1 2
𝚺 𝚺= 𝚺= 𝚺=
0 1 0.3 2 2 1
Co-clusters 𝜷 𝑘ℓ 𝝁 𝑘ℓ 𝜷 𝑘ℓ 𝝁 𝑘ℓ 𝜷 𝑘ℓ 𝝁 𝑘ℓ
Cluster (1,1) [1, -10, 1] [5,20] [1, -10, 1] [5,20] [1, -10, 1] [5,20]
Cluster (1,2) [10, 4, 13] [5,10] [1, -10, 1] [5,10] [1, -10, 1] [5,10]
Cluster (2,1) [3, 20, -2] [10,20] [1, -10, 1] [10,20] [1, -10, 1] [5,30]
Cluster (2,2) [-5, -2, -6] [10,10] [7, 5, -10] [10,10] [7, 5, -10] [20,10]
Cluster (3,1) [-10, 20, 10] [20,20] [7, 5, -10] [20,20] [7, 5, -10] [20,20]
Cluster (3,2) [7, 5, -10] [20,10] [7, 5, -10] [20,10] [7, 5, -10] [20,30]
Fig. 2 Synthetic data: True regression plans according to the chosen parameters.
the classical model of clusterwise the probability Φ0 (x𝑖 |𝛀 𝑘 ) to model the covariates,
whereas the classical cluster-wise model the output only using Φ(𝑦 𝑖 |x𝑖 ; 𝜆 𝑘 ). They
prove that sufficient conditions for model identifiability are provided under a suitable
assumption of Gaussian covariates [10]. We can include in LBRM a joint probability
Φ0 (x𝑖 𝑗 |𝛀 𝑘ℓ ) where 𝛀 𝑘ℓ = [𝝁 𝑘ℓ , 𝚺 𝑘ℓ ] to evaluate its impact in terms of clustering
and regression. Figure 4 presents the graphical model of LBRM and its extension.
The first experiments on real datasets give encouraging results.
5 Conclusion
Inspired by the flexibility of the latent block model (LBM), we proposed extending it
to tensor data aiming at both tasks: co-clustering and prediction. This model (LBRM)
gives rise to a variational EM algorithm for co-clustering and prediction referred to
as VEM-LBRM. This algorithm which can be viewed as the co-clusterwise algorithm
can easily deal with sparse data. Empirical results on synthetic data showed that
VEM-LBRM does give more encouraging results for clustering and regression than
some algorithms that are devoted to one or both tasks simultaneously. For future
work, we plan to develop the extension of LBRM and apply the proposed models for
the recommender system task.
Acknowledgements Our work is funded by the German Federal Ministry of Education and Re-
search under Grant Agreement Number 01IS19084F (XAPS).
References
1. Affeldt, S., Labiod, L., Nadif, M.: Regularized bi-directional co-clustering. Statistics and
Computing, 31(3), 1-17 (2021)
2. Agarwal, D., and Merugu, S.: Predictive discrete latent factor models for large scale dyadic
data. In: SIGKDD, pp. 26–35 (2007)
3. Dempster, A. P., Laird, N. M., Rubin, D. B.: Maximum likelihood from incomplete data via
the EM algorithm. Journal of the Royal Statistical Society Series B (Methodological), 39(1),
1–22 (1977)
4. Deodhar, M., Ghosh, J.: A framework for simultaneous co-clustering and learning from
complex data. In: SIGKDD, pp. 250–259 (2007)
5. DeSarbo, W. S., and Cron, W. L.: A maximum likelihood methodology for clusterwise linear
regression. Journal of Classification, 5(2), 249–282 (1988)
6. Govaert, G., Nadif, M.: Clustering with block mixture models. Pattern Recognition, 36, 463-
473, (2003)
7. Govaert, G., Nadif, M.: An EM algorithm for the block mixture model. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 27(4), 643–647 (2005)
8. Govaert, G., Nadif, M.: Block clustering with Bernoulli mixture models: Comparison of
different approaches. Computational Statistics & Data Analysis, 3233–3245 (2008)
9. Govaert, G., Nadif, M.: Co-clustering: Models, Algorithms and Applications. John Wiley &
Sons (2013)
10. Ingrassia, S., Minotti, S. C., Punzo, A.: Model-based clustering via linear cluster-weighted
models. Computational Statistics & Data Analysis, 71, 159–182 (2014)
11. Ingrassia, S., Minotti, S. C., Vittadini, G.: Local statistical modeling via a cluster-weighted
approach with elliptical distributions. In: Journal of Classification, 29(3), 363–401 (2012)
12. Neal, R. M., Hinton, G. E.: A view of the EM algorithm that justifies incremental, sparse, and
other variants. In Learning in Graphical Models, pp. 355–368. Springer (1998)
13. Salah, A., Nadif, M.: Directional co-clustering. Advances in Data Analysis and Classification,
13(3), 591-620 (2019)
14. Späth, H.: Algorithm 39 clusterwise linear regression. Computing, 22(4), 367–373 (1979)
15. Steinley, D.: Properties of the Hubert–Arable Adjusted Rand Index. Psychological Methods,
9(3), 386 (2004)
16. Strehl, A., Ghosh, J.: Cluster ensembles—a knowledge reuse framework for combining mul-
tiple partitions. Journal of Machine Learning Research, 3, 583–617 (2002)
17. Vu, D., Aitkin, M.: Variational algorithms for biclustering models. Computational Statistics
& Data Analysis, 89, 12–24 (2015)
Latent Block Regression Model 81
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Using Clustering and Machine Learning
Methods to Provide Intelligent Grocery
Shopping Recommendations
Nail Chabane, Mohamed Achraf Bouaoune, Reda Amir Sofiane Tighilt, Bogdan
Mazoure, Nadia Tahiri, and Vladimir Makarenkov
Abstract Nowadays, grocery lists make part of shopping habits of many customers.
With the popularity of e-commerce and plethora of products and promotions avail-
able on online stores, it can become increasingly difficult for customers to identify
products that both satisfy their needs and represent the best deals overall. In this
paper, we present a grocery recommender system based on the use of traditional
machine learning methods aiming at assisting customers with creation of their gro-
cery lists on the MyGroceryTour platform which displays weekly grocery deals in
Canada. Our recommender system relies on the individual user purchase histories,
as well as the available products’ and stores’ features, to constitute intelligent weekly
grocery lists. The use of clustering prior to supervised machine learning methods
allowed us to identify customers profiles and reduce the choice of potential products
of interest for each customer, thus improving the prediction results. The highest
average F-score of 0.499 for the considered dataset of 826 Canadian customers was
obtained using the Random Forest prediction model which was compared to the
Decision Tree, Gradient Boosting Tree, XGBoost, Logistic Regression, Catboost,
Support Vector Machine and Naive Bayes models in our study.
1 Introduction
Grocery shopping is a common activity that involves different factors such as budget
and impulse purchasing pressure [1]. Customers typically rely on a mental or digital
list to facilitate their grocery trips. Many of them show a favorable interest towards
tools and applications that help them manage their grocery lists, while keeping
them updated with special offers, coupons and promotions [2, 3]. Major retailers
throughout the world typically offer discounts on different products every week in
order to improve sales and attract new customers. This very common practice leads
to the fact that thousands of items go on special simultaneously across different
retailers at a given week. The resulting information overload often makes it difficult
for customers to quickly identify the deals that best suit their needs, which can become
a source of frustration [4]. To address this problem, many grocery stores have taken
advantage of the popularity of e-commerce to set up their own websites featuring
various functionalities, including Recommender Systems, to assist customers during
the shopping process.
Recommender Systems (RSs) [5] are tools and techniques that offer personalized
suggestions to users based on several parameters (e.g. their past behavior). RSs have
recently become a field of interest for researchers and retailers as many e-commerces,
online book stores and streaming platforms have started to offer this service on their
websites (e.g. Amazon, Netflix and Spotify). Here, we recall some recent works in
this field. Faggioli et al. [6] used the popular Collaborative Filtering (CF) approach
to predict the customer’s next basket in a context of grocery shopping, taking into
account the recency parameter. When comparing their model with the CF baseline
models, Faggioli et al. observed a consistent improvement of their prediction results.
Che et al. [7] used attention-based recurrent neural networks to capture both inter-
and intra-basket relationships, thus modelling users’ long-term preferences dynamic
short-term decisions.
Content-based recommendation has also proven efficient in the literature, as
demonstrated by Xia et al. [8] who proposed a tree-based model for coupons recom-
mendation. By processing their data with undersampling methods, the authors were
able to increase the estimated click rate from 1.20% to 7.80% as well as to improve
significantly the F-score results using Random Forest Classifier and the recall results
using XGBoost. Dou [9] presented a statistical model to predict whether a user will
buy or not buy an item using Yandex’s CatBoost method [10]. Dou relied on contex-
tual and temporal features as well as on some session features, such as the time of
visit of specific web pages, to demonstrate the efficiency of CatBoost in this context.
Finally, Tahiri et al. [11] used recurrent and feedforward neural networks (RNNs and
FFNs) in combination with non-negative matrix factorization and gradient boosting
trees to create intelligent weekly grocery baskets to be recommended to the users
of MyGroceryTour. Tahiri et al. considered different (from our study) features char-
acterizing the users of MyGroceryTour to provide their predictions, with the best
F-score results of 0.37 obtained from the augmented dataset.
Clustering and ML Methods for Intelligent Grocery Shopping 85
where 𝑥 𝑓 is the original value of the observation at feature 𝑓 , 𝜇 𝑓 is the mean and
𝜎 𝑓 is the standard deviation of 𝑓 .
86 N. Chabane et al.
1 Í𝑛
𝑋𝑚𝑎𝑥,𝑢 − (𝑛−1) 𝑖=2 𝑋𝑖,𝑢
𝐹 𝑅𝑢 = , (2)
𝑋𝑡𝑜𝑡 𝑎𝑙,𝑢
where 𝑋𝑚𝑎𝑥,𝑢 is the total number of products bought by user 𝑢 at the store where
he/she made most of his/her purchases, 𝑛 (𝑛>1)
Í𝑛 is the total number of stores visited by
user 𝑢, and 𝑋𝑡𝑜𝑡 𝑎𝑙,𝑢 (𝑋𝑡𝑜𝑡 𝑎𝑙,𝑢 = 𝑋𝑚𝑎𝑥,𝑢 + 𝑖=2 𝑋𝑖,𝑢 ) is the total number of products
purchased by user 𝑢 over all stores he/she visited. A high fidelity ratio means that
user 𝑢 buys most of his/her products at the same store, whereas a low fidelity ratio
indicates that user 𝑢 buys his/her products at different stores. When user 𝑢 purchases
all of his/her products at the same store (𝑋𝑚𝑎𝑥,𝑢 = 𝑋𝑡𝑜𝑡 𝑎𝑙,𝑢 and 𝑛 = 1), the fidelity
ratio equals 1. It equals 0 when he/she purchases the same number of products at
different stores.
The 𝐾-means [14] and DBSCAN [15] algorithms were used to perform clustering.
Here we present the results of DBSCAN, as the clusters provided by DBSACAN
had less entity overlap than those provided by 𝐾-means. The main advantage of
DBSCAN is that this density-based algorithm is able to capture clusters of any
shape.
Fig. 1 Davies-Bouldin cluster validity index variation with respect to the number of clusters.
We used the Davies-Bouldin (DB) [17] cluster validity index to determine the
number of clusters in our dataset. The Davies-Bouldin index is the average similarity
between each cluster 𝐶𝑖 for 𝑖 = 1, ..., 𝑘 and its most similar counterpart 𝐶 𝑗 . It is
calculated as follows:
𝑘
1Õ
𝐷𝐵 = 𝑚𝑎𝑥 𝑅𝑖 𝑗 , (3)
𝑘 𝑖=1 𝑖≠ 𝑗
88 N. Chabane et al.
Fig. 2 Clustering results : Clustering obtained with DBSCAN with the best number of clusters
according to the Davies-Bouldin index. Data reduction was performed using t-SNE. The 6 clusters
of customers found by DBSCAN are represented by different symbols.
We have noticed that the users in Cluster 1 (see Fig. 2) are fairly sensitive to
specials and have a high fidelity score, the users in Cluster 2 mostly purchase
products on special in different stores, the users in Cluster 3 seem to be sensitive
to the total price of their shopping baskets, Cluster 4 includes the users who are
sensitive to specials but have a low fidelity score, Cluster 5 includes the users who
are not very attracted by specials but are rather loyal to their favorite store(s), and
the users in Cluster 6 tend to buy products on special and have high fidelity scores.
To predict the products to be recommended for the current weekly basket, we used
the following supervised machine learning methods: Decision Tree, Random Forest,
Clustering and ML Methods for Intelligent Grocery Shopping 89
Table 1 F-scores provided by ML methods without and with clustering of MyGroceryTour users.
Machine learning methods Results without clustering Results with clustering
4 Conclusion
References
1. Vincent-Wayne, M., Aylott, R.: An exploratory study of grocery shopping stressors. Int. J.
Retail. Distrib. Manag. 26, 362–373 (1998)
2. Newcomb, E., Pashley, T., Stasko, J.: Mobile computing in the retail arena. In: Proceedings
of the SIGCHI conference on Human factors in computing systems, pp. 337-344. Association
for Computing Machinery, New York (2003)
3. Sourav, B., Floréen, P., Forsblom, A., Hemminki, S., Myllymäki, P., Nurmi, P., Pulkkinen,
T., Salovaara, A.: An Intelligent Mobile Grocery Assistant. In: 2012 Eighth International
Conference on Intelligent Environments, pp. 165-172. IEEE, Guanajuato (2012)
4. Park, Y., Chang, K.: Individual and group behavior-based customer profile model for person-
alized product recommendation. Expert Systems with Applications 36(2), 1932-1939 (2009)
5. Ricci, F., Rokach, L., Shapira, B.: Recommender Systems: Introduction and Challenges. In:
Recommender Systems Handbook, pp. 1-34. Springer, Boston (2015)
6. Faggioli, G., Mirko P., Fabio A.: Recency aware collaborative filtering for next basket recom-
mendation. In: Proceedings of the 28th ACM Conference on User Modeling, Adaptation and
Personalization, pp. 80-87. Association for Computing Machinery, New York (2020)
7. Che et al.: Inter-basket and intra-basket adaptive attention network for next basket recommen-
dation. IEEE Access 7, 80644-80650 (2019)
8. Xia, Y. Giuseppe, D. F., Shikhar, V., Ankur, D.: A content-based recommender system for e-
commerce offers and coupons. In: Proceedings of the SIGIR 2017 eCom workshop. eCOM@
SIGIR, Tokyo (2017)
9. Dou, X.: Online purchase behavior prediction and analysis using ensemble learning. In : 2020
IEEE 5th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA),
pp. 532-536. IEEE (2020)
10. Prokhorenkova, L., Gleb G., Aleksandr V., Anna V. D., Andrey G.: CatBoost: unbiased
boosting with categorical features (2017) Available via arXiv.
11. Tahiri, N., Mazoure, B. and Makarenkov, V.: An intelligent shopping list based on the appli-
cation of partitioning and machine learning algorithms. In: Proceedings of the 18th Python in
Science Conference (SCIPY 2019), pp. 85-92. Austin, Texas (2019)
12. Kotsiantis, S. B., Kanellopoulos, D., Pintelas, P. E.: Data preprocessing for supervised leaning.
Int. J. Comput. Sci. 2, 111-117 (2006)
13. García, S., Luengo, J., Herrera, F.: Data Preprocessing in Data Mining. Springer, Cham,
Switzerland (2015)
14. MacQueen, J.: Some methods for classification and analysis of multivariate observations. In:
Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol.
1, no. 14, pp. 281-297 (1967)
15. Ester, M., Kriegel, H. P., Sander, J., Xu, X.: A Density-Based Algorithm for Discovering
Clusters in Large Spatial Databases with Noise. In: Proceedings of the 2nd International
Conference on Knowledge Discovery and Data Mining. AAAI Press, pp. 226–231. AAAI
Press, Portland, Oregon (1996)
16. Pedregosa et al.: Scikit-learn: Machine Learning in Python. JMLR 12, 2825-2830 (2011)
17. Davies, D. L., Bouldin, D. W.: A Cluster Separation Measure. In: IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. PAMI-1, no. 2, pp. 224-227 (1979)
18. Van der Maaten, L. J. P., Hinton, G.: Visualizing High-Dimensional Data Using t-SNE. J.
Mach. Learn. Res. 9, 2579-2605 (2008)
Clustering and ML Methods for Intelligent Grocery Shopping 91
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
COVID-19 Pandemic: a Methodological Model
for the Analysis of Government’s Preventing
Measures and Health Data Records
Abstract The study aims to investigate the associations between the government’s
response measures during the COVID-19 pandemic and weekly incidence data (pos-
itivity rate, mortality rate and testing rate) in Greece. The study focuses on the
period from the detection of the first case in the country (26th February 2020) to the
first week of 2022 (08th January 2022). Data analysis was based on Correspondence
Analysis on a fuzzy-coded contingency table, followed by Hierarchical Cluster Anal-
ysis (HCA) on the factor scores. Results revealed distinct time periods during which
interesting interactions took place between control measures and incidence data.
1 Introduction
The present study focuses on the period of the COVID-19 pandemic in Greece, from
the detection of the first case of COVID-19 to the first week of 2022. This period
can be divided into five distinct phases. The first phase extends from the beginning
of 2020 until the first lockdown, i.e., from the first case reported in Greece until
the end of the first quarantine period in May 2020. The second phase concerns the
interim period from June to October 2020, when the pandemic indices improved,
and policies were loosened for the opening of tourism. The third phase concerns the
second lockdown and the evolution of the pandemic in the country from November
2020 to April 2021, when the first vaccination period of the adult population took
place. The fourth phase includes the interim period from May 2021 to October
Theodore Chadjipadelis ( )
Aristotle University of Thessaloniki, Greece, e-mail: [email protected]
Sofia Magopoulou
Aristotle University of Thessaloniki, Greece, e-mail: [email protected]
2021, where a general stabilization of the number of cases occurred, while the last
period refers to a significant increase in the number of cases from November 2021
to January 2022.
Overall, from March 2020 to January 2022, a total of 1,79 million cases of COVID-
19 were recorded in Greece (Figure 1) and a total of 22,635 deaths. Vaccination
coverage is as of January 2022 over 65% of country’s population, i.e., 7,241,468
fully vaccinated citizens.
2 Methodology
2.1 Data
For the study purposes, data were obtained from the Oxford Covid-19 Government
Response Tracker (OxCGRT) and were combined with self-collected Covid-19 data
for Greece [3] daily updated in Greek. The Oxford Covid-19 Government Response
Tracker (OxCGRT) collects publicly available information reflecting government re-
sponse from 180 countries since 1 January 2020 [4]. The tracker is based on data for
23 indicators. In this study, two groups of indicators were considered: Containment &
Closure and Health Systems in the case of Greece. The first group of indicators refers
to “collective” level policies and measures, such as school closures and restriction in
COVID-19: Analysis of Government’s Preventing Measures and Health Data Records 95
mobility, while the second refers to “individual” level policies and measures, such as
testing and vaccination. Specifically, the collective level indicators refer to policies
taken by the governments’ and reflect on a collective level on the society: school
closing, workplace closing, cancelation of public events, restrictions on gathering,
closure of public transport, stay at home requirements, internal movement restric-
tions and international travel controls. The health system policies primarily touch
upon the individual level and specifically refer to: public information campaigns,
testing, contact tracing, healthcare facilities, vaccines’ investments, facial coverings,
vaccination and protection of the elderly people. All collective-level indicators (C1 to
C8) were summed to yield a total score (ranging from 0 to 16). Similarly, individual-
level indicators (H1 to H3 and H6 to H8) were summed to compute a total score
(ranging from to 12).
The self-collected data refer to positive cases, number of Covid-19-related deaths,
number of tests and total number of vaccinations administered. These data have been
recorded daily since March 2020 from public announcements by official and verified
sources. A total of 94 time points were considered in the present study, corresponding
to weekly data (Monday was used as a reference). Three quantitative indicators were
derived, a positivity index (#cases / #tests), a mortality index (#deaths / #cases) and a
testing index (#tests / #people). The number of vaccinations is not used in the present
study because the vaccination process began in January 2021 and the administration
of the booster dose began in September 2021. The final data set consisted of five
indicators: two ordinal total scores, and three quantitative indices.
A four-step data analysis strategy was adopted. In the first step, the three quantitative
variables (positivity rate, mortality rate and testing rate) were transformed into
ordinal variables, via a method used in [7] (see Step 1) transformation of continuous
variables into ordinal categorical variables, with minimum information loss. Three
ordinal variables were derived. In the second step, the five ordinal variables (i.e., the
three recoded variables and the two ordinal total scores), were fuzzy-coded into three
categories each, using the barycentric coding scheme proposed in [7]. This scheme
has been recently evaluated in the context of hierarchical clustering in [7] and was
applied with the DIAS Excel add-in [6]. Barycentric coding allows us to convert an
m-point ordinal variable into an n-point fuzzy-coded variable [6, 7]. In other words,
the transformation of the three quantitative variables into ordinal variables resulted
in a generalized 0-1 matrix (fuzzy-coded matrix), where for each variable we obtain
the estimated probability for each category. A drawback of the proposed approach is
that the ordinal information in the 5 ordinal variables is lost.
The third step involved the application of Correspondence Analysis (CA) on
the fuzzy-coded table with the 94 weeks as rows and the fifteen fuzzy categories
as columns (see [1] for a similar approach). The number of significant axes was
determined based on percentage of inertia explained and the significant points on each
96 T. Chadjipadelis and S. Magopoulou
axis were determined based on the values of two statistics that accompany standard
CA output; quality of representation (COR) greater than 200 and contribution (CTR)
greater than 1000/(n + 1), where n is the total number of categories (i.e., 15 in our
case). In the final step, Hierarchical Cluster Analysis (HCA) using Benzecri’s chi-
squared distance and Ward’s linkage criterion [2, 8] was employed to cluster the
94 points (weeks) on the CA axes obtained from the previous step. The number of
clusters was determined upon the empirical criterion of the change in the ratio of
between-cluster inertia to total inertia, when moving from a partition with r clusters
to a partition with r – 1 cluster [8]. Lastly, we interpret the clusters after determining
the contribution of each indicator to each cluster. All analyses were conducted with
the M.A.D. [Méthodes de l’Analyse des Données] software [5].
3 Results
Fig. 3 Category coordinates on the four CA axes (#G), quality of representation (COR) and
contribution (CTR). COR values greater than 200 and CTR values greater than 1000 / 16 = 62.5
are shown in yellow. Positive coordinates are shown in green and negative in pink.
Hierarchical Cluster Analysis on the factor scores resulted in seven clusters using
the empirical criterion for cluster determination (see Section 2.2). The corresponding
dendrogram is shown in Figure 4. The seven nodes in the figure that correspond to
the seven clusters are 182, 181, 175, 177, 171, 181, 133 and 179. Cluster content
reflects the different periods (phases) presented in the introductory section.
The first cluster (182) combines data points from March 2020, the onset of the
pandemic with data points from a period following the summer of 2020 (October
and November). This cluster is characterized by high positivity rate, low testing
rate, high levels of “collective” measures (containment & closure) and low levels of
“individual” measures (health system). The second cluster (181) contains data points
from April and May 2020 and is characterized by low positivity rate, average to high
mortality rate, low testing rate, high levels of “collective” measures (containment &
closure) and average levels of “individual” measures (health system). The third clus-
ter (175) combines summer months of 2020 and 2021. This cluster is characterized
by low positivity rate, low testing rate and average levels of “collective” measures
(containment & closure). The fourth cluster (177) marks the period of December
2020 and the period of spring of 2021, with average positivity rate and high levels of
“collective” measures (containment & closure). The fifth cluster (171) refers to the
period from December 2020 to February 2021, but also includes August 2021, with
high levels of “collective” measures (containment & closure). The sixth cluster (133)
refers to the period following the summer of 2021 (September and October 2021).
In this cluster, average positivity rates were observed but also strict containment and
closure measures.
Lastly, the seventh cluster (179) refers to November and December 2021, including
also January 2022, with high positivity and high testing rates, while high levels of
containment and closure and health system measures were observed. Figure 5 shows
the contributions of each indicator in each cluster.
Fig. 5 Cluster description (contribution values of the indicators in each cluster - node).
4 Discussion
Based on the study results, we can argue that, when it comes to measures and
real time data following a situation such as the pandemic, “the chicken and egg”
dilemma arises. The question is whether “collective” and “individual” measures
COVID-19: Analysis of Government’s Preventing Measures and Health Data Records 99
affect daily incidence data or the inverse (i.e., that the daily data lead to measures).
We conclude that in fact the two should be perceived as working in conjunction
and not independently from one another. The analysis showed that lower positivity
rate is accompanied by average levels of measures from the government at both
the “individual” and the “collective” level. Furthermore, higher positivity rate is
accompanied by higher levels of measures, as a response. With regard to mortality
rate, we observed that higher mortality invokes higher levels of “collective” measures
and average levels of “individual” measures, whereas average levels of “collective”
measures are associated with higher mortality rate.
It is therefore evident that when it comes to decision making in crisis situations, a
systematic collection, analysis and use of data is linked to more effective government
response overall. Therefore, evidence-based policy making should be linked to crisis
management. This paper presents a first attempt to capture an ongoing phenomenon
and therefore it is crucial that the collection and analysis of data will be complemented
until the end of the phenomenon.
References
1. Aşan, Z., Greenacre, M.: Biplots of fuzzy coded data. Fuzzy Sets and Systems, 183(1), 57–71
(2011)
2. Benzècri, J. P.: L’Analyse des Données. 2. L’Analyse des Correspondances. Dunod, Paris
(1973)
3. Chadjipadelis, T.: Facebook profile (2022).
https://www.facebook.com/theodore.chadjipadelis
4. Hale, T., Petherick, A., Phillips, T., Webster, S.: Variation in government responses to COVID-
19. Blavatnik School of Government Working Paper, 31, 2020-11 (2020)
5. Karapistolis, D.: Software Method of Data Analysis MAD. (2010)
http://www.pylimad.gr/
6. Markos, A., Moschidis, O., Chadjipadelis, T.: Hierarchical clustering of mixed-type data based
on barycentric coding (2022) https://arxiv.org/submit/4142768
7. Moschidis, O., Chadjipadelis, T.: A method for transforming ordinal variables. In: Palumbo,
F., Montanari, A., Vichi, M. (eds) Data Science, pp. 285-294. Studies in Classification, Data
Analysis, and Knowledge Organization. Springer, Cham. (2017)
https://doi.org/10.1007/978-3-319-55723-6\_22
8. Papadimitriou, G., Florou, G.: Contribution of the Euclidean and chi-square metrics to de-
termining the most ideal clustering in ascending hierarchy (in Greek). In Annals in Honor of
Professor I. Liakis, 546-581. University of Macedonia, Thessaloniki (1996)
100 T. Chadjipadelis and S. Magopoulou
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
pcTVI: Parallel MDP Solver Using a
Decomposition into Independent Chains
Abstract Markov Decision Processes (MDPs) are useful to solve real-world proba-
bilistic planning problems. However, finding an optimal solution in an MDP can take
an unreasonable amount of time when the number of states in the MDP is large. In
this paper, we present a way to decompose an MDP into Strongly Connected Com-
ponents (SCCs) and to find dependency chains for these SCCs. We then propose a
variant of the Topological Value Iteration (TVI) algorithm, called parallel chained
TVI (pcTVI), which is able to solve independent chains of SCCs in parallel lever-
aging modern multicore computer architectures. The performance of our algorithm
was measured by comparing it to the baseline TVI algorithm on a new probabilistic
planning domain introduced in this study. Our pcTVI algorithm led to a speedup
factor of 20, compared to traditional TVI (on a computer having 32 cores).
1 Introduction
Processes (MDPs) are generally used to solve such problems leading to probabilistic
models of applicable actions [2].
In probabilistic planning, a solution is generally a policy, i.e., a mapping specifying
which action should be executed in each observed state to achieve an objective.
Usually, dynamic programming algorithms such as Value Iteration (VI) are used to
find an optimal policy [3]. Since VI is time-expensive, many improvements have
been proposed to find an optimal policy faster, using for example the Topological
Value Iteration (TVI) algorithm [4]. However, very large domains often remain out
of reach. One unexplored way to reduce the computation time of TVI is by taking
advantage of the parallel architecture of modern computers and by decomposing an
MDP into independent parts which could be solved concurrently.
In this paper, we show that state-of-the-art MDP planners such as TVI can run
an order of magnitude faster when considering task-level parallelism of modern
computers. Our main contributions are as follows:
• An improved version of the TVI algorithm, parallel-chained TVI (pcTVI), which
decomposes MDPs into independent chains of strongly connected components
and solves them concurrently.
• A new parametric planning domain, chained-MDP, and an evaluation of pcTVI’s
performance on many instances of this domain compared to the VI, LRTDP [5]
and TVI algorithms.
2 Related Work
Many MDP solvers are based on the Value Iteration (VI) algorithm [3], or more
precisely on asynchronous variants of VI. In asynchronous VI, MDP states can be
backed up in any order and do not need to be considered the same number of times.
One way to take advantage of this is by assigning a priority to every state and by
considering them in priority order.
Several state-of-the-art MDP algorithms have been proposed to increase the speed
of computation. Many of them are able to focus on the most promising parts of MDP
through heuristic search algorithms such as LRTDP [5] or LAO* [6]. Some other
MDP algorithms use partitioning methods to decompose the state-space in smaller
parts. For example, the P3VI (Partitioned, Prioritized, Parallel Value Iteration) al-
gorithm partitions the state-space, uses a priority metric to order the partitions in an
approximate best solving order, and solves them in parallel [7]. The biggest disad-
vantage of P3VI is that the partitioning is done on a case-by-case basis depending on
the planning domain, i.e., P3VI does not include a general state-space decomposition
method. The inter-process communication between the solving threads also incurs
an overhead on the computation time. The more recent TVI (Topological Value Iter-
ation) algorithm [4] also decomposes the state-space, but does it by considering the
topological structure of the underlying graph of the MDP, making it more general
than P3VI. Unfortunately, to the best of our knowledge, no parallel version of TVI
has been proposed in the literature.
pcTVI: Parallel MDP Solver 103
3 Problem Definition
The expression between square brackets is called the Q-value of a state-action pair:
Õ
𝑄(𝑠, 𝑎) = 𝐶 (𝑠, 𝑎) + 𝑇 (𝑠, 𝑎, 𝑠 0)𝑉 (𝑠).
𝑠0 ∈𝑆
Most MDP solvers are based on dynamic programming algorithms like Value
Iteration (VI), which update iteratively an arbitrarily initialized value function until
convergence with a given precision 𝜖. In the worst case, VI needs to do |𝑆| sweeps of
the state space, where one sweep consists in updating the value estimate of every state
using the Bellman Optimality Equations. Hence, the number of state updates (called
a backup) is O (|𝑆| 2 ). When the MDP is acyclic, most of these backups are wasteful,
since the MDP can in this situation be solved using only |𝑆| backups (ordered in
reverse topological order), thus allowing one to find an optimal policy in O (|𝑆|) [8].
104 J. Champagne Gareau et al.
4 Parallel-chained TVI
5 Empirical Evaluation
if 𝑠 ∈ 𝐺.
0,
ℎmin (𝑠) = 0
𝑎min
𝐶 (𝑠, 𝑎) + 0 min 𝑉 (𝑠 ) , otherwise,
∈ 𝐴𝑠 𝑠 ∈𝑠𝑢𝑐𝑐𝑎 (𝑠)
where 𝐴𝑠 denotes the set of applicable actions in state 𝑠 and 𝑠𝑢𝑐𝑐 𝑎 (𝑠) is the set of
successors when applying action 𝑎 at state 𝑠. The four competing algorithms (VI,
TVI, LRTDP and pcTVI) were implemented in C++ by the authors of this paper and
compiled using the GNU g++ compiler (version 11.2). All tests were performed on a
106 J. Champagne Gareau et al.
computer equipped with four Intel Xeon E5-2620V4 processors (each of them having
8 cores at 2.1 GHz, for a total of 32 cores). For every test domain, we measured
the running time of the four compared algorithms carried out until convergence to
an 𝜖-optimal value function (we used 𝜖 = 10−6 ). Every domain was tested 15 times
with randomly generated MDP instances. To minimize random factors, we report
the median values obtained over these 15 MDP instances.
Since there is no standard MDP domain in the scientific literature suitable to
benchmark a parallel MDP solver, we propose a new general parametric MDP
domain that we use to evaluate the algorithms. This domain, which we call chained-
MDP, uses 5 parameters: (1) 𝑘, the number of independent chains {𝑐 1 , 𝑐 2 , . . . , 𝑐 𝑘 } in
the MDP; (2) 𝑛𝑠𝑐𝑐 , the number of SCCs {𝑠𝑐𝑐 𝑖,1 , 𝑠𝑐𝑐 𝑖,2 , . . . , 𝑠𝑐𝑐 𝑖,𝑛𝑠𝑐𝑐 } in every chain
𝑐 𝑖 ; (3) 𝑛𝑠 𝑝𝑠 , the number of states per SCC; (4) 𝑛 𝑎 the number of applicable actions
per state, and (5) 𝑛𝑒 the number of probabilistic effects per action. The possible
successors 𝑠𝑢𝑐𝑐(𝑠) of a state 𝑠 in 𝑠𝑐𝑐 𝑖, 𝑗 are states in 𝑠𝑐𝑐 𝑖, 𝑗 and either the states
in 𝑠𝑐𝑐 𝑖, 𝑗+1 if it exists, or the goal state otherwise. When generating the transition
function of a state-action pair (𝑠, 𝑎), we sampled 𝑛𝑒 states uniformly from 𝑠𝑢𝑐𝑐(𝑠)
with random probabilities. In each of our tests, we used 𝑛𝑠𝑐𝑐 = 2, 𝑛 𝑎 = 5 and 𝑛𝑒 = 5.
A representation of a Chained-MDP instance is shown in Figure 1.
c1
scc1,1 scc1,2 scc1,3 scc1,4
c2
S scc2,1 scc2,2 scc2,3 scc2,4 A
c3
scc3,1 scc3,2 scc3,3 scc3,4
Fig. 1 A chained-MDP instance where 𝑛𝑐 = 3 and 𝑛𝑠𝑐𝑐 = 4. Each ellipse represents a strongly
connected component.
Figure 2 presents the obtained results for the Chained-MDP domain when varying
the number of states and fixing the number of chains (32). We can observe that when
the number of states is small, pcTVI does not provide an important advantage over
the existing algorithms since the overhead of creating and managing the threads is
taking most of the possible gains. However, as the number of states increases, the gap
in the running time between pcTVI and the three other algorithms increases. This
indicates that pcTVI is particularly useful on very large MDPs, which are usually
needed when considering real-world domains.
Figure 3 presents the obtained results for the same Chained-MDP domain when
varying the number of chains and fixing the number of states (1M). When the
number of chains increases, the total number of SCCs implicitly increases (which
also implies the number of states per SCC decreases). This explains why each tested
pcTVI: Parallel MDP Solver 107
300
VI
Running time (s) 250 LRTDP
200 TVI
pcTVI
150
100
50
0
2 4 6 8 10 12 14 16 18 20
Number of states (x 100 000)
Fig. 2 Average running times (in s) for the Chained-MDP domain with varying number of states
and fixed number of chains (32).
300
VI
Running time (s)
250 LRTDP
200 TVI
pcTVI
150
100
50
0
4 8 12 16 20 24 28 32
Number of chains
Fig. 3 Average running times (in s) for the Chained-MDP domain with varying number of chains
and fixed number of states (1M).
algorithms becomes faster (TVI becomes faster by design, since it solves SCCs
one-by-one without doing useless state backups, and VI and LRTDP become faster
due to an increased locality of the considered states in memory, which improves
cache performance). The performance of pcTVI increases as the number of chains
increases (for the same reason as the others algorithms, but also due to increased
parallelization opportunities). We can also observe that for domains with 4 chains
only, pcTVI still clearly outperforms the other methods. This means that pcTVI does
not need a highly parallel server CPU and can be used on standard 4-core computer.
108 J. Champagne Gareau et al.
6 Conclusion
The main contributions of this paper are two-fold. First, we presented a new algo-
rithm, pcTVI, which is, to the best of our knowledge, the first MDP solver that takes
into account both the topological structure of the MDP (as in TVI) and the parallel
capacities of modern computers (as in P3VI). Second, we introduced a new para-
metric planning domain, Chained-MDP, which models any situation where different
strategies (corresponding to a chain) can reach a goal, but where, once committed
to a strategy, it is not possible to switch to a different one. This domain is ideal to
evaluate the parallel performance of an MDP solver. Our experiments indicate that
pcTVI outperforms the other competing methods (VI, LRTDP, and TVI) on every
tested instance of the Chained-MDP domain. Moreover, pcTVI is particularly effec-
tive when the considered MDP has many SCC chains (for increased parallelization
opportunities) of large size (for decreased overhead of assigning small tasks to the
threads). As future work, we plan to investigate ways of pruning provably subopti-
mal actions, which would allow more SCCs to be found. While this paper focuses
on the automated planning side of MDPs, the proposed optimization and parallel
computing approaches could also be applied when using MDPs with Reinforcement
Learning and other ML algorithms.
Acknowledgements This research has been supported by the Natural Sciences and Engineering
Research Council of Canada (NSERC) and the Fonds de Recherche du Québec — Nature et
Technologies (FRQNT).
References
1. Champagne Gareau, J., Beaudry E., Makarenkov, V.: A fast electric vehicle planner using
clustering. In: Stud. in Classif., Data Anal., and Knowl. Organ., 5, 17-25. Springer (2021)
2. Mausam, Kolobov, A.: Planning with Markov Decision Processes: An AI Perspective. Morgan
& Claypool (2012)
3. Bellman, R.: Dynamic Programming. Prentice Hall (1957)
4. Dai, P., Mausam, Weld, D. S., Goldsmith, J.: Topological value iteration algorithms. J. Artif.
Intell. Res., 42, 181-209 (2011)
5. Bonet, B., Geffner, H.: Labeled RTDP: Improving the convergence of real-time dynamic
programming. In: Proc. of ICAPS, pp. 12-21 (2013)
6. Hansen, E., Zilberstein, S.: LAO*: A heuristic search algorithm that finds solutions with loops.
Artif. Intell., 129(1-2), 35-62 (2001)
7. Wingate, D., Seppi, K.: P3VI: A partitioned, prioritized, parallel value iterator. In: Proc. of
the Int. Conf. on Mach. Learn. (ICML), 863-870 (2004)
8. Bertsekas, D.: Dynamic Programming and Optimal Control, vol. 2. Athena scientific Belmont,
MA (2001)
pcTVI: Parallel MDP Solver 109
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Three-way Spectral Clustering
1 Introduction
Spectral clustering methods are based on the graph theory, where the units are
represented by the vertices of an undirected graph and the edges are weighted by
the pairwise similarities coming from a suitable kernel function, so the clustering
problem is reformulated as a graph partition problem, see e.g. [16, 6]. The spectral
clustering algorithm is a very powerful method for finding non-convex clusters of
data, moreover, it is a handy approach for handling high-dimensional data since it
works on a transformation of the raw data having a smaller dimension than the space
of the original data.
Cinzia Di Nuzzo ( )
Department of Statistics, University of Roma La Sapienza, Piazzale Aldo Moro, 5, 00185 Roma,
Italy, e-mail: [email protected]
Salvatore Ingrassia
Department of Economics and Business, University of Catania, Piazza Università, 2, 95131 Catania,
Italy, e-mail: [email protected]
2 Spectral Clustering
Spectral clustering algorithm for two-way data has been described in [8, 16, 6]. Here,
we summarize the main step of this algorithm.
Let 𝑉 = {𝒙 1 , 𝒙 2 , . . . , 𝒙 𝑛 } be a set of points in X ⊆ R 𝑝 . In order to group the data
𝑉 in 𝐾 cluster, the first step concerns the definition of a symmetric and continuous
function 𝜅 : X × X → [0, ∞) called the kernel function. Afterwards, a similarity
matrix 𝑊 = (𝑤𝑖 𝑗 ) can be assigned by setting 𝑤𝑖 𝑗 = 𝜅(𝒙 𝑖 , 𝒙 𝑗 ) ≥ 0, for 𝒙 𝑖 , 𝒙 𝑗 ∈ X.
and finally the normalized graph Laplacian matrix 𝐿 sym ∈ R𝑛×𝑛 is introduced
Φ𝚪 (𝒙 𝑖 ) = (𝛾1𝑖 , . . . , 𝛾𝐾 𝑖 ), 𝑖 = 1, . . . , 𝑛,
Three-way Spectral Clustering 113
consecutive eigenvalues; the scatter plot of the mapped data in the feature space and
in particular the number of spikes counted in the embedded data space.
We remark that we cannot analyze all possible values of ℎ ∈ {1, 2, . . . , 𝑛 − 1} and
hence we choose a suitable subset H ⊂ {1, 2, . . . , 𝑛 − 1}, in particular we choose
H = {1%, 2%, 5%, 10%, 15%, 20%} × 𝑛 ⊂ {1, 2, . . . , 𝑛 − 1}, and select ℎ ∈ H , see
the following procedure for details.
In this section, we propose a spectral approach for clustering three-way data. Three-
way data consists of a data set referring to the same sets of units and variables,
observed in different situations, i.e., a set of multivariate matrices, that can be
organized in three modes: 𝑛 units, 𝑝 variables, and 𝑡 situations. Therefore, given
𝑛 matrices that represent the vertices of the graph, each matrix is composed by 𝑝
columns that represent our variables and 𝑡 rows that represent the time or another
feature. So we have a tensor of dimension 𝑛 ×𝑡 × 𝑝, thus the dataset is a tensor { 𝑿}𝑖𝑠𝑘
for 𝑖 = 1, . . . , 𝑛, 𝑠 = 1, . . . , 𝑡, 𝑘 = 1, . . . , 𝑝.
We define a distance function 𝛿 𝑀 between two matrices 𝐴, 𝐵 ∈ R 𝑝×𝑡 such that
𝛿 𝑀 : 𝑅 𝑡× 𝑝 × 𝑅 𝑡× 𝑝 → [0, +∞) is defined as
Three-way Spectral Clustering 115
v
u
t 𝑝
𝑡 Õ
Õ
𝛿 𝑀 ( 𝐴, 𝐵) := k 𝐴 − 𝐵k𝐹 = |𝑎 𝑠𝑘 − 𝑏 𝑠𝑘 | 2 (3)
𝑠=1 𝑘=1
where k · k𝐹 is Frobenius norm1. Thus the distance between two units in the matrix
data 𝑿 is equal to
v
u
t 𝑡 𝑝
ÕÕ
𝛿 𝑀 (𝑋𝑖1 𝑠𝑘 , 𝑋𝑖2 𝑠𝑘 ) = |𝑋𝑖1 𝑠𝑘 − 𝑋𝑖2 𝑠𝑘 | 2 , for 𝑖 1 , 𝑖2 = 1, . . . , 𝑛. (4)
𝑠=1 𝑘=1
where 𝜖𝑖1 and 𝜖𝑖2 need to be selected like in the kernel defined in (2).
Afterwards, we compute the similarity matrix 𝑊 given by 𝑤𝑖1 𝑖2 = 𝜅(𝑖 1 , 𝑖2 ), so that
we can apply the spectral clustering algorithm.
Finally, we point out that, differently from approaches based on mixtures of
matrix-variate data, the number of variables of the data set is not a critical issue
because the spectral clustering algorithm is based on distance measures.
We apply the three-way spectral clustering to the analysis of the Insurance data set,
available in the splm R package. This dataset was initially introduced by [7] and
has recently been analyzed by [12]. The goal is to study the consumption of non-life
insurance during the years 1998-2002 in the 103 Italian provinces, so 𝑡 = 5 and
𝑛 = 103. As regards the number of variables, we consider all the variables contained
in the data set, so 𝑝 = 11. Thus, we have 103 matrices of dimensions 5 × 11.
The 103 Italian provinces are divided into north-west (24 provinces), north-
east (22 provinces), center (21 provinces), south (23 provinces), and islands (13
provinces).
As regard the choice of 𝐾 and ℎ, we consider the graphical approach introduced
in Section 3. In Figure 1 the geometric features of spectral clustering are plotted
as ℎ varies. From the number of blocks of the Laplacian matrix (Figure 1-𝑎)), the
first maximum eigengap (Figure 1-𝑏)) and the number of spikes in the feature space
(Figure 1-𝑐)), we deduce that the number of clusters is 𝐾 = 2. For the selection of
−0.08
0.005 0.010 0.015
0.8
eigengap
1% 0.6
−0.10
ℎ=1 0.4
−0.12
0.2
Index
−0.06
1
0.8
0.025
eigengap
−0.09
2% 0.6
0.015
ℎ=2 0.4
−0.12
0.005
0.2
Index
−0.05
1
0.07
0.8
eigengap
−0.08
5% 0.6
0.05
ℎ=5 0.4
−0.11
0.2
0.03
Index
−0.05
0.12
0.8
eigengap
ℎ = 10 0.4
−0.11
0.04
0.2
Index
1
−0.05
0.8
0.15
eigengap
15% 0.6
−0.08
0.10
ℎ = 15 0.4
0.05
−0.11
0.2
Index
1
0.25
−0.05
0.8
eigengap
20%
0.15
0.6
−0.08
ℎ = 21 0.4
0.05
−0.11
0.2
Index
𝑎) 𝑏) 𝑐)
Fig. 1 Insurance data. Spectral clustering features: 𝑎) plot of Laplacian matrix in greyscale; 𝑏)
plot of the first eight eigengap values; 𝑐) scatterplot of the embedded data along with directions
(𝜸 1 , 𝜸 2 ).
Three-way Spectral Clustering 117
𝑎) 𝑏)
6 Conclusion
In this paper, a spectral approach to cluster three-way data has been proposed. So
the data are organized in a tensor and the vertices in the graph are represented by
the matrices of dimension 𝑡 × 𝑝. In order to weigh the matrices in the graph, a
kernel function based on the Frobenius norm between the matrix difference has been
introduced. The performance of the spectral clustering algorithm has been shown in
one real three-way data set. Our method is competitive with respect to other clustering
methods proposed in the literature to perform matrix-data clustering. Finally, in order
to provide suggestions for future research, other kernel functions can be introduced
considering different distances with respect to the Frobenius norm.
Acknowledgements This work was supported by the University of Catania grant PIACERI/CRASI
(2020).
References
1. Bocci, L., Vicari, D.: ROOTCLUS: Searching for "ROOT CLUSters" in Three-Way Proximity
Data. Psychometrika. 84, 941–985 (2019)
2. Di Nuzzo, C., Ingrassia, S.: A mixture model approach to spectral clustering and application
to textual data. Stat. Meth. Appl. Forthcoming (2022)
3. Di Nuzzo, C.: Model selection and mixture approaches in the spectral clustering algorithm.
Ph.D. thesis, Economics, Management and Statistics, University of Messina (2021)
4. Di Nuzzo, C., Ingrassia, S.: A joint graphical approach for model selection in the spectral
clustering algorithm. Tech. Rep. (2022)
5. Garcia Trillos, N., Hoffman, F., Hosseini, B.: Geometric structure of graph Laplacian embed-
dings. arXiv preprint arXiv:1901.10651. (2019)
6. Meila, M.: Spectral clustering. In Hennig, C., Meila, M., Murtagh, F., Rocci, R. (eds.).
Handbook of Cluster Analysis. Chapman and Hall/CRC (2015)
7. Millo, G., Carmeci, G.: Non-life insurance consumption in Italy: A sub-regional panel data
analysis. J. Geogr. Syst. 12, 1–26 (2011)
8. Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: Analysis and an algorithm. Adv. Neural
Inf. Process. Syst. 14 (2002)
9. Sarkar, S., Zhu, X., Melnykov, V., Ingrassia, S.: On parsimonious models for modeling matrix
data. Comput. Stat. Data Anal. 142, 106822 (2020)
10. Schiebinger, G., Wainwright, M. J., Yu, B.: The geometry of kernelized spectral clustering.
Ann. Stat. 43(2), 819–846 (2015a)
11. Tomarchio, S. D., Punzo, A., Bagnato, L.: Two new matrix-variate distributions with applica-
tion in model-based clustering. Comput. Stat. Data Anal. 152, 107050 (2020)
12. Tomarchio, S. D., McNicholas, P., Punzo, A.: Matrix normal cluster-weighted models. J.
Classif. 38, 556-575 (2021)
13. Tomarchio, S. D., Ingrassia, S., Melnykov, V.: Modeling students’ career indicators via mix-
tures of parsimonious matrix-normal distributions. Aust. New Zeal. J. Stat. Forthcoming
(2022)
14. Vichi, M., Rocci, R., Kiers, H. A. L.: Simultaneous component and clustering models for
three-way data: Within and between approaches. J. Classif. 24, 71–98 (2007)
15. Viroli, C.: Finite mixtures of matrix normal distributions for classifying three-way data. Stat.
Comput. 21, 511–522 (2011)
16. von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007)
17. Zelnik-Manor, L., Perona, P.: Self-tuning spectral clustering. Adv. Neural Inf. Process. Syst.
17 (2004)
Three-way Spectral Clustering 119
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Improving Classification of Documents by
Semi-supervised Clustering in a Semantic Space
Jasminka Dobša ( )
Faculty of Organization and Informatics, University of Zagreb, Pavlinska 2, 40000 Varaždin,
Croatia, e-mail: [email protected]
Henk A. L. Kiers
Department of Psychology, University of Groningan, Grote Kruisstraat 2/1, 9712 TS Groningen,
The Netherlands, e-mail: [email protected]
1 Introduction
There are two main families of methods that deal with representation of documents
and words that index them: global matrix factorization methods such as Latent Se-
mantic Analysis (LSA) [2] and local context window methods such as the continuous
bag of words (CBOW) model and the continuous skip-gram model [8]. The latter
use neural networks for learning of representations of words and are intensively
explored lately in the scientific community since the development of fast processors
has enabled processing of huge amounts of data which resulted in improvements in
performance of wide spectra of text mining and natural language tasks. However,
representation of words solely by context window methods has a drawback due to
the neglect of information about global corpus statistics [9].
In this paper we propose a method for representation of documents by application
of a penalized version of the RKM method [4] on a term-document matrix. The
corpus of textual documents is represented by a sparse term-document matrix in
which entry (i, j) is equal to the weight of the i-th index term for the j-th document.
Weights of terms are given by the TfIdf weighting which utilizes local information
about the frequency of the i-th term in the j-th document and global information about
usage of the i-th term in the entire collection. A benchmark method that utilizes global
matrix factorization on term-document matrices is LSA [2] which uses truncated
singular value decomposition (SVD) for representation of terms and documents in
lower-dimensional semantic space. SVD does not capture the clustering structure of
data which motivates application of the RKM.
The rest of the paper is organized as follows: the second section describes related
work on representation of documents and words and methods of dimensionality
reduction related to RKM. The third section describes the modified RKM method
with penalization, while the fourth section describes an experiment on Reuters-21578
data set. In the last section conclusions and directions for further work are given.
2 Related Work
A benchmark method among methods that utilize matrix factorization for repre-
sentation of textual documents is the method of LSA introduced in 1994 [2]. By
LSA a sparse term-document matrix is transformed via SVD into a dense matrix
of the same term-document type with representations of words (index terms) and
documents in a lower-dimensional space. The idea is to map similar documents, or
those that describe the same topics, closer to each other regardless of the terms that
are used in them. A very efficient application of LSA is in cross-lingual information
retrieval where relevant documents for a query in one language are retrieved from a
set of documents in another language [7]. According to our knowledge application
Improving Classification of Documents by Semi-supervised Clustering 123
of methods that simultaneously cluster objects and extract factors in the field of text
mining is very limited. In [6] a method is proposed for cross-lingual information
retrieval based on the RKM method.
and FKM methods are compared using simulations and theoretically in order to
identify cases for their application. Timmerman and associates also propose method
of Subspace 𝑘-means [12] which gives an insight into cluster characteristics in terms
of relative positions of clusters given by centroids and the shape of the clusters given
by within cluster residuals.
in the least squares sense. The dimension of the lower-dimensional space must be
less or equal to the number of clusters. Modified RKM with penalization minimizes
the loss function
4 Experiment
Experiments are conducted for classification on the Reuters-21578 data set, specifi-
cally using the ModApte Split which assigns Reuters reports from April 7, 1987 and
before to the training set, and after, until end of 1987, to the test set. It consists of
9603 training and 3299 test documents. The collection has 90 classes which contain
at least one training and test document. Documents are represented by a bag of words
representation. A list of index terms is formed based on terms that appear in at least
four documents of the collection, which resulted in a list of 9867 index terms.
Classification is conducted by logistic regression (LR) and SVM algorithm. The
basic model is the bag of words representation (full representation), while repre-
sentations in the lower-dimensional space are obtained by SVD (Latent Sematic
Analysis), RKM and RKM with penalization (𝜆 = 0.1, 0.2, 0.4, 0.6). For RKM and
RKM with penalization representations are obtained by applying matrix factoriza-
tion on the term-document matrix of the training documents, and by projection of
test documents on factors given by matrix A in the factorization. RKM is computed
for 90 clusters (which corresponds to the number of classes in the collection) using
as dimension of the lower-dimensional space 𝑘 = 85, and truncated SVD is com-
puted for 𝑘 = 85 as well. The RKM and RKM with penalization algorithms are run
10 times (with different starting estimates), and the representation and factorization
with the minimal loss function is chosen. The optimal cost parameter for LR and
SVM is chosen by grid search technique from the set of values 0.1, 0.5, 1, 10, 100
and 1000. For the classification methods, the LiblineaR library in R is used, while
RKM and RKM with penalization algorithm are implemented in Matlab.
4.2 Results
Results are given in terms of precision, recall, and 𝐹1 measure of the classification.
Recall is proportion of correctly classified samples among all positive samples (i.e.,
samples actually belonging to the class, according to the expert), while precision is
proportion of correctly classified samples among all samples classified as positive
by the model. In the Figures 1 and 2, are shown results of average 𝐹1 measures of
classification for 5 classes sorted in descending order by their size, i.e. number of
train documents (which is 2877 to 389 for classes 1-5, 369 to 181 for classes 6-10,
140 to 111 for classes 11-15, 101 to 75 for classes 16-20, 75 to 55 for classes 21-25,
50 to 41 for classes 26-30, 40 to 37 for classes 31-35, 35 to 24 for classes 36-40,
23 to 19 for classes 41-45, 18 to 16 for classes 46-50, 16 to 13 for classes 51-55,
and 13-10 for classes 56-60). Figure 1 shows the results for classification by LR,
while Figure 2 for classification by SVM. Only the 60 largest classes are observed
since smaller classes (less than 10 training documents) are not interesting for the
126 J. Dobša and H. A. L. Kiers
research, because for those classes recall is low and it can be expected that full bag
of words representation will result in better recognition since classes can possibly be
recognized by key words, but not by transformed representations. It can be seen that
𝐹1 measures are comparable for the full representation and various representations
by RKM with penalization for both classification algorithms for the biggest 25
classes. For smaller classes results for representation by RKM with penalization are
unstable, although for some classes they were better than the basic representation
(in the case of LR). Classification for representations obtained by SVM and RKM
without penalization resulted in lower 𝐹1 measures for all class sizes.
In Table 1 are shown average precision, recall and 𝐹1 measures for the 25 largest
classes for both classification algorithms and all observed representations. In the case
of classification by LR the average recall is improved for representation by RKM
with penalization (for 𝜆 = 0.4) approximately 1% compared to basic full represen-
tation. For classification by SVM average precision is improved for representation
by RKM with penalization (for 𝜆 = 0.6) for almost 6% and 𝐹1 measure is improved
for representation by RKM with penalization (𝜆 = 0.4) for 2% in comparison to
the basic full representation. The best results are obtained for classification by the
SVM algorithm and representation with RKM with penalization with 𝜆 = 0.2 for
which precision is improved for 5% with the similar level of recall as in the basic
representation.
Improving Classification of Documents by Semi-supervised Clustering 127
Fig. 2 Average 𝐹1 measure of classification by SVM for 5 classes sorted by their size.
Table 1 Average precision, recall, and 𝐹1 measure of classification for the 25 largest classes.
original RKM method without proposed modification does not have the same effect
on classification performance; it has a similar effect as the LSA method.
The proposed representation method can improve precision and recall of classi-
fication for sufficiently large classes, i.e. those that have enough training documents
to enable capturing of semantic relations and characteristics of classes. A more
important effect can be observed in the improvement of precision.
In the future we plan to investigate hybrid models using representation of words by
neural language models and application in different domains, such as classification
of images.
References
1. Bengio, J., Ducharme, R., Vincet, P., Jauvin, C.: A Neural probabilistic language model.
Journal of Machine Learning Research 3, 1137-1155 (1997)
2. Deerwester, S., Dumas, S. T., Furnas, G.W., Landauer, T. K., Harshman, R. A.: Indexing
by latent semantic analysis. Journal of the American Society for Information Science 41(6),
381-407 (1990)
3. De Sarbo, W. S., Jedidi, K., Cool, K., Schendel, D.: Simultaneous multidimensional unfolding
and cluster analysis: an investigation of strategic groups. Marketing Letters, 2, 129-146 (1990)
4. De Soete, G., Carroll, J. D.: 𝐾 -means clustering in a low-dimensional Euclidean space. In:
Diday, E., Lechevallier, Y., Schader, M., Bertrand, P., Burtschy, B. (eds.) New Approaches
in Classification and Data Analysis. Studies in Classification, Data Analysis, and Knowledge
Organization, pp. 212-219. Springer, Heidelberg (1994)
5. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional
transformers for language understanding. In: Proceedings of Annual Conference of the North
American Chapter of the Association for Computation Linguistic, pp. 4171-4186, Association
for Computational Linguistic (2019)
6. Dobša, J., Mladenić, D., Rupnik, J., Radošević, D., Magdalenić, I.: Cross-language information
retrieval by Reduced 𝑘-means, International Journal of Computer Information Systems and
Industrial Management Applications, 10, 314-322 (2018)
7. Dumas, S., Letche, T., Littman, M., Landauer, T.: Automatic cross-language retrieval using
latent semantic indexing. In: Proceedings of the AAAI spring symposium on cross-language
text and speech retrieval, pp. 15-21. American Association for Artificial Intelligence (1997)
8. Mikolov, T., Chen, K., Corrado, G.S., Dean, J.: Efficient estimation of word representations
in vector space (2013) Available via arXiv.org
https://arxiv.org/abs/1301.3781.Cited21Jan2022
9. Pennington, J., Socher, R., Manning, C. D.: GloVe: Global vectors for word representation. In:
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
(EMNLP), pp. 1532-1543, Association for Computational Linguistics, (2014)
10. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Tettlemoyer, L.: Deep
contextualized word representations. In Proceedings of the Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies,
1:2227-2237 (2018)
11. Timmerman, M. E. Ceulemans, E., Kiers, H. A. L., Vichi, M: Factorial and Reduced 𝑘-means
reconsidered. Computational Statistics & Data Analyisis, 54, 1856-1871 (2010)
12. Timmerman, M. E., Ceulemans, E., De Rover, K., Van Leeuwen, K.: Subspace𝑘-means
clustering, Behavioural Research, 45, 1011-1023 (2013)
13. Vichi, M., Kiers, H. A. L.: Factorial 𝑘-means analysis for two-way data, Computational
Statistics & Data Analysis, 37, 49-64 (2001)
Improving Classification of Documents by Semi-supervised Clustering 129
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Trends in Data Stream Mining
João Gama
Abstract Learning from data streams is a hot topic in machine learning and data
mining. This article presents our recent work on the topic of learning from data
streams. We focus on emerging topics, including fraud detection and hyper-parameter
tuning for streaming data. The first study is a case study on interconnected by-pass
fraud. This is a real-world problem from high-speed telecommunications data that
clearly illustrates the need for online data stream processing. In the second study,
we present an optimization algorithm for online hyper-parameter tuning from non-
stationary data streams.
1 Introduction
João Gama ( )
FEP-University of Porto and INESC TEC
R. Dr. Roberto Frias, Porto, Portugal, e-mail: [email protected]
The high asymmetry of international termination rates with regard to domestic ones,
where international calls have higher charges applied by the operator where the call
terminates, is fertile ground for the appearance of fraud in Telecommunications.
There are several types of fraud that exploit this type of differential, being the
Interconnect Bypass Fraud one of the most expressive [1, 3].
In this type of fraud, one of several intermediaries responsible for delivering the
calls forwards the traffic over a low-cost IP connection, reintroducing the call in the
destination network already as a local call, using VOIP Gateways. This way, the
entity that sent the traffic is charged the amount corresponding to the delivery of
international traffic. However, once it is illegally delivered as national traffic, it will
not have to pay the international termination fee, appropriating this amount.
Traditionally, the telecom operators analyze the calls of these Gateways to detect
the fraud patterns and, once identified, have their SIM cards blocked. The constant
evolution in terms of technology adopted on these gateways allows them to work
like real SIM farms capable of manipulating identifiers, simulating standard call
patterns similar to the ones of regular users, and even being mounted on vehicles to
complicate the detection using location information.
The interconnect bypass fraud detection algorithms typically consume a stream
𝑆 of events, where 𝑆 contains information about the origin number 𝐴 − 𝑁𝑢𝑚𝑏𝑒𝑟,
the destination number 𝐵 − 𝑁𝑢𝑚𝑏𝑒𝑟, the associated timestamp, and the status of the
call (accomplished or not). The expected output of this type of algorithm is a set of
potential fraudulent 𝐴 − 𝑁𝑢𝑚𝑏𝑒𝑟 𝑠 that require validation by the telecom operator.
This process is not fully automated to avoid blocking legit 𝐴 − 𝑁𝑢𝑚𝑏𝑒𝑟 𝑠 and getting
penalties. In the interconnect bypass fraud, we can observe three different types of
abnormal behaviors:
1. the burst of calls, which are 𝐴 − 𝑁𝑢𝑚𝑏𝑒𝑟 𝑠 that produce enormous quantities of
#𝑐𝑎𝑙𝑙𝑠 (above the #𝑐𝑎𝑙𝑙𝑠 of all 𝐴 − 𝑁𝑢𝑚𝑏𝑒𝑟 𝑠) during a specific time window
𝑊. The size of this time window is typically small;
2. the repetitions, which are the repetition of some pattern (#𝑐𝑎𝑙𝑙𝑠) produced by a
𝐴 − 𝑁𝑢𝑚𝑏𝑒𝑟 during consecutive time windows 𝑊;
3. the mirror behaviors, which are two distinct 𝐴 − 𝑁𝑢𝑚𝑏𝑒𝑟 𝑠 (typically these
𝐴 − 𝑁𝑢𝑚𝑏𝑒𝑟 𝑠 are from the same country) that produces the same pattern of
calls (#𝑐𝑎𝑙𝑙𝑠) during a time window 𝑊.
Trends in Data Stream Mining 133
Figures 1 and 2 present the evolving top-10 most active phone numbers. The
first Figure 1 presents the top-10 cumulative counts, while the Figure 2 presents the
top-10 counts with forget.
80000
60000
Frequency
40000
20000
0
20190600 20190610 20190620 20190630
Date
6000
Frequency
4000
2000
0
20190600 20190610 20190620 20190630
Date
the closest to the best vertex, and worst (𝑊). 𝑀 is a mid vertex (auxiliary model). The
bottom panel in Figure 3 describe the four operations: Contraction, Reflexion,
Expansion, and Shrink.
For each Nelder-Mead operation, it is necessary to compute an additional set of
vertexes (midpoint 𝑀, reflection 𝑅, expansion 𝐸, contraction 𝐶 and shrinkage 𝑆)
and verify if the calculated vertexes belong to the search space. First, the algorithm
computes the midpoint (𝑀) of the best face of the shape as well as the reflection
point (𝑅). After this initial step, it determines whether to reflect or expand based on
the set of heuristics.
The dynamic sample size, which is based on the RMSE metric, attempts to identify
significant changes in the streamed data. Whenever such a change is detected, the
Nelder-Mead compares the performance of the 𝑛 + 1 models under analysis to choose
the most promising model. The sample size 𝑆 𝑠𝑖𝑧𝑒 is given by Equation 1 where 𝜎
Trends in Data Stream Mining 135
EDKƉĞƌĂƚŽƌƐ ƌŝĨƚĞƚĞĐƚĞĚ
Ɛ
Ɛů
ĞĚ ĞůĚ
Ž Ž
D н D
dž
ϭ Ƶ ͘ ͘
нE
D
E ͘ ͘
͘ ͘
t Z t t t
^
Fig. 3 SPT working modes: Exploration and Deployment. Bottom panel illustrates the Nelder &
Mead operators.
represents the standard deviation of the RMSE and 𝑀 the desired error margin. We
use 𝑀 = 95%.
4𝜎 2
𝑆 𝑠𝑖𝑧𝑒 = 2 (1)
𝑀
However, to avoid using small samples, that imply error estimations with large
variance, we defined a lower bound of 30 samples. The adaptation of the Nelder-
Mead algorithm to on-line scenarios relies extensively on parallel processing. The
main thread launches the 𝑛+1 model threads and starts a continuous event processing
loop. This loop dispatches the incoming events to the model threads and, whenever
it reaches the sample size interval, assesses the running models, and calculates the
new sample size. The model assessment involves the ordering of the 𝑛 + 1 models
by RMSE value and the application of the Nelder-Mead algorithm to substitute the
worst model. The Nelder-Mead parallel implementation creates a dedicated thread
per Nelder-Mead operator, totaling seven threads. Each Nelder-Mead operator thread
generates a new model and calculates the incremental RMSE using the instances of
the last sample size interval. The worst model is substituted by the Nelder-Mead
operator thread model with the lowest RMSE.
Figure 4 presents the critical difference diagram [2] of three hyper-parameter
tuning algorithms: SPT, Grid search, default parameter values on four benchmark
classification datasets. The diagram clearly illustrates the good performance of SPT.
4 Conclusions
This paper reviews our recent work in learning from data streams. The two works
present different approaches to dealing with high-speed and time-evolving data:
from applied research in fraud detection to fundamental research on hyperparameter
136 J. Gama
Fig. 4 Critical Difference Diagram comparing Self hyperparameter tuning, Grid hyperparameter
tuning, and default parameters in 4 classification problems.
optimization for streaming algorithms. The first work identifies burst on the activity
in phone calls, using approximate counting with forgetting. The last work presents a
streaming optimization method to find the minimum of a function and its application
in finding the hyper-parameter values that minimize the error. We believe that the
two works reported here will have an impact on the work of other researchers.
Acknowledgements I would like to thank my collaborators Bruno Veloso and Rita P. Ribeiro that
contribute to this work.
References
1. Ali, M. A., Azad, M. A., Centeno, M. P., Hao, F., van Moorsel, A.: Consumer-facing technology
fraud: Economics, attack methods and potential solutions. Future Generation Computer Systems,
100, 408–427 (2019)
2. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. Journal of Machine
Learning Research, 7(Jan), 1–30 (2006)
3. Laleh, N., Azgomi, M. A.: A taxonomy of frauds and fraud detection techniques. In International
Conference on Information Systems, Technology and Management, pp. 256–267. Springer
(2009)
4. Veloso, B., Gama, Malheiro, J. B., Vinagre, J.: Hyperparameter self-tuning for data streams.
Information Fusion, 76, 75–86 (2021)
5. Veloso, B., Tabassum, S., Martins, C., Espanha, R., Azevedo, R., Gama, J.: Interconnect bypass
fraud detection: a case study. Annals des Télécommunications, 75(9), 583–596 (2020)
Trends in Data Stream Mining 137
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Old and New Constraints in Model Based
Clustering
L. A. García-Escudero
Department of Statistics and Operational Research and IMUVA, University of Valladolid, Spain,
e-mail: [email protected]
A. Mayo-Iscar
Department of Statistics and Operational Research and IMUVA, University of Valladolid, Spain,
e-mail: [email protected]
G. Morelli
Department of Economics and Management and Interdepartmental Centre of Robust Statistics,
University of Parma, Italy, e-mail: [email protected]
M. Riani ( )
Department of Economics and Management and Interdepartmental Centre of Robust Statistics,
University of Parma, Italy, e-mail: [email protected]
1 Introduction
On the other hand, in the mixture likelihood approach, we seek the maximization
of
𝑛
Õ 𝑘
©Õ
𝜋 𝑗 𝜙(𝑥𝑖 ; 𝜇 𝑗 , Σ 𝑗 ) ® ,
ª
log (2)
𝑖=1 « 𝑗=1 ¬
with similar notation and conditions on the parameters as above. In this second
approach, a partition into 𝑘 groups can be also obtained, from the fitted mixture
model, by assigning each observation to the cluster-component with the highest
posterior probability.
Unfortunately, it is well-known that the maximization of “log-likelihoods" like
(1) and (2) without constraints on the Σ 𝑗 matrices is a mathematically ill-posed
problem [1, 2]. To see this unboundedness issue, we can just take 𝜇1 = 𝑥 1 , 𝜋1 > 0
and |Σ1 | → 0 making (2) to diverge to infinity or (1) also to diverge with 𝐻1 = {1}.
This lack of boundedness can be solved by just focusing on local maxima of
the likelihood target functions. However, many local maxima are often found and
it is difficult to know which are the most interesting ones. See [3] for a detailed
discussion of this issue. In fact, non-interesting local maxima denoted as “spurious"
solutions, which consist of a few, almost collinear, observations, are often detected
by the Classification EM algorithm (CEM), traditionally applied when maximizing
(1), and by the EM algorithm, traditionally applied when maximizing (2). A recent
review of approaches for dealing with this lack of boundedness and for reducing the
detection of spurious solutions can be found in [4].
It is also common to enforce constraints on the Σ 𝑗 scatter matrices when maxi-
mizing (1) or (2). Among them, the use of “parsimonious” models [5, 6] is one of
the most popular and widely applied approaches in practice. These parsimonious
models follow from a decomposition of the Σ 𝑗 scatter matrices as
Σ 𝑗 = 𝜆 𝑗 Ω 𝑗 Γ 𝑗 Ω0𝑗 , (3)
𝑝
where {𝜆𝑙 (Σ 𝑗 )}𝑙=1 are the set of eigenvalues of the Σ 𝑗 matrix, 𝑗 = 1, ..., 𝑘.
With this eigenvalue-ratio approach, we need a very high 𝑐∗ value to be close to
affine equivariance. Unfortunately, such a high 𝑐∗ value does not always successfully
prevent us from incurring into spurious solutions.
García-Escudero et al. [13] have recently introduced three different types of con-
straints on the Σ 𝑗 matrices which depend on three constants 𝑐 det , 𝑐 shw and 𝑐 shb all of
them being greater than or equal to 1.
The first type of constraint serves to control the maximal ratio among determinants
and, consequently, the maximum allowed difference between component volumes:
𝑝
max 𝑗=1,...,𝑘 |Σ 𝑗 | max 𝑗=1,...,𝑘 𝜆 𝑗
“deter": = ≤ 𝑐 det . (5)
min 𝑗=1,...,𝑘 |Σ 𝑗 | min 𝑗=1,...,𝑘 𝜆 𝑝𝑗
142 L. A. García-Escudero et al.
The second type of constraint controls departures from sphericity “within” each
component:
max𝑙=1,..., 𝑝 𝛾 𝑗𝑙
shape-“within": ≤ 𝑐 shw for 𝑗 = 1, ..., 𝑘. (6)
min𝑙=1,..., 𝑝 𝛾 𝑗𝑙
This provides a set of 𝑘 constraints that in the most constrained case, 𝑐 shw = 1,
imposes Γ1 = ... = Γ 𝑝 = I 𝑝 , where I 𝑝 is the identity matrix of size 𝑝, i.e., sphericity
of components.
Note that the new determinant-and-shape constraints (based on 𝑐 det > 1 and
𝑐 shw = 1) in (4) allow us to deal with spherical “heteroscedastic" cases, whereas the
eigenvalue ratio constraint with 𝑐∗ = 1 can only handle the spherical “homoscedastic"
case. Constraints (5) and (6) were the basis for the “deter-and-shape” constraints in
[14]. These two constraints alone resulted in mathematically well-defined constrained
maximizations of the likelihoods in (1) and (2). However, although highly operative
in many cases, they do not include, as limit cases, all the already mentioned 14
parsimonious models. For instance, we may be interested in the same (or not very
different) Γ 𝑗 or Σ 𝑗 matrices for all the mixture components and these cannot be
obtained as limit cases from the “deter-and-shape” constraints.
The third constraint serves to control the maximum allowed difference between
shape elements “between” components:
max 𝑗=1,...,𝑘 𝛾 𝑗𝑙
shape-“between": ≤ 𝑐 shb for 𝑙 = 1, ..., 𝑝. (7)
min 𝑗=1,...,𝑘 𝛾 𝑗𝑙
This new type of constraint allows us to impose “similar” shape matrices for the
components and, consequently, enforce Γ1 = ... = Γ𝑘 in the most constrained
𝑐 shb = 1 case.
Figure 1 shows an example based on three groups. The data have been generated
imposing equal determinants 𝑐 det = 1, a sensible departure from sphericity “within”
each component 𝑐 shw = 40 and a very moderate difference “between” shape elements
components, 𝑐 shb = 1.3. No constraint has been imposed on the rotation matrices.
Finally an average overlap of 0.10 has been imposed. The generation of these data
sets has been done through the MixSim method of [15], as extended by [16] and
incorporated into the FSDA Matlab toolbox [17]. The overlap is defined as a sum of
pairwise misclassification probabilities. See more details in [16].
The application of traditional tclust approach with maximum ratio between eigen-
values (𝑐∗ ) respectively equal to 128 and 1010 produces the classifications shown
in the left panels of Figure 2. In fact, it could be seen that the results in the top
left panel would be exactly the same one for any choice of 𝑐∗ within the interval
[16, 128]. This means that a higher value of 𝑐∗ would be apparently needed to detect
Old and New Constraints in Model Based Clustering 143
0.8
1
0.7 2
3
0.6
0.5
0.4
0.3
0.2
0.1
-0.1
-0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Fig. 1 An example with simulated data with 3 clusters in two dimensions. The average overlap is
0.10. The data have been generated using equal determinants, moderate difference between shape
elements “between” components and sensible departure from sphericity “within” each component.
those two almost parallel clusters that were shown in Figure 1. However, choosing
a value greater for 𝑐∗ may destroy the desired protection against spurious solutions
provided by the constraints. For example, we see in the lower left panel how the
choice 𝑐∗ = 1010 results in the detection of a spurious group consisting of a single
observation.
The panels on the right, on the other hand, show the partitions resulting from the
3 new constraints imposed on the components covariance matrices. The top right
panel shows the result of applying the 3 new restrictions with values of the tuning
constants very close to the real values used to generate the dataset. We can see
that, in this case, it is possible to recover the real structure of the data generating
process. Moreover, the real cluster structure is also recovered in the low right panel
by choosing larger values of these tuning constants, but not too large just to avoid
detection of spurious solutions. Some guidelines about how to choose these tuning
constants can be found in [13].
144 L. A. García-Escudero et al.
0.6 1 0.6 1
2 2
3 3
0.4 0.4
0.2 0.2
0 0
-0.2 0 0.2 0.4 0.6 -0.2 0 0.2 0.4 0.6
0.6 1 0.6 1
2 2
3 3
0.4 0.4
0.2 0.2
0 0
-0.2 0 0.2 0.4 0.6 -0.2 0 0.2 0.4 0.6
Fig. 2 Comparison between the traditional (left panels) and new tclust procedure (right panels).
References
1. Kiefer, J., Wolfowitz, J.: Consistency of the maximum likelihood estimator in the presence of
infinitely many incidental parameters. Ann. Math. Stat. 27, 887-906 (1956)
2. Day, N. E.: Estimating the components of a mixture of normal distributions. Biometrika, 56,
463-474 (1969)
3. McLachlan, G., Peel, D. A.: Finite Mixture Models. Wiley Series in Probability and Statistics,
New York (2000)
4. García-Escudero, L. A., Gordaliza, A., Greselin, F., Ingrassia, S., Mayo-Iscar, A.: Eigenvalues
and constraints in mixture modeling: geometric and computational issues. Adv. Data Anal.
Classif. 12, 203-233 (2018)
5. Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recogn. 28, 781-
793 (1995)
6. Banfield, J. D., Raftery, A. E.: Model-based Gaussian and non-Gaussian clustering. Biometrics
49, 803-821 (1993)
7. Hathaway, R. J.: A constrained formulation of maximum likelihood estimation for normal
mixture distributions. Ann. Stat. 13, 795-800 (1985)
8. Ingrassia, S., Rocci, R.: Constrained monotone EM algorithms for finite mixture of multivariate
Gaussians. Comput. Stat. Data Anal. 51, 5339-5351 (2007)
9. García-Escudero, L. A., Gordaliza, A., Matrán, C., Mayo-Iscar, A.: A general trimming
approach to robust cluster analysis. Ann. Stat. 36, 1324-1345 (2008)
10. García-Escudero, L. A., Gordaliza, A., Matrán, C., Mayo-Iscar, A.: Exploring the number of
groups in robust model-based clustering. Stat. Comput. 21, 585-599 (2011)
11. García-Escudero, L. A., Gordaliza, A., Mayo-Iscar, A.: A constrained robust proposal for
mixture modeling avoiding spurious solutions. Adv. Data Anal. Classif. 8, 27-43 (2014)
Old and New Constraints in Model Based Clustering 145
12. García-Escudero, L. A., Gordaliza, A., Matrán, C., Mayo-Iscar, A.: Avoiding spurious local
maximizers in mixture modeling. Stat. Comput. 25, 619-633 (2015)
13. García-Escudero, L. A., Mayo-Iscar, A., Riani, M.: Constrained parsimonious model-based
clustering. Stat. Comput. 32 (2022)
14. García-Escudero, L. A., Mayo-Iscar, A., Riani, M.: Model-based clustering with determinant-
and-shape constraint. Stat. Comput. 25, 1-18 (2020)
15. Maitra, R., Melnykov, V.: Simulating data to study performance of finite mixture modeling
and clustering algorithms. J. Comput. Graph. Stat. 19, 354-376 (2010)
16. Riani, M., Cerioli, A., Perrotta, D., Torti, F.: Simulating mixtures of multivariate data with
fixed cluster overlap in FSDA library. Adv. Data Anal. Classif. 9, 461-481 (2015)
17. Riani, M., Perrotta, D., Torti, F.: FSDA: a Matlab toolbox for robust analysis and interactive
data exploration. Chemometr. Intell. Lab. Syst. 116, 17-32 (2012)
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Clustering Student Mobility Data in 3-way
Networks
1 Introduction
1 Database MOBYSU.IT [Mobilità degli Studi Universitari in Italia], research protocol MUR -
Universities of Cagliari, Palermo, Siena, Torino, Sassari, Firenze, Cattolica and Napoli Federico II,
Scientific Coordinator Massimo Attanasio (UNIPA), Data Source ANS-MUR/CINECA.
150 V. G. Genova et al.
In equation (1), 𝑞 y 𝐻 (Q) represents the entropy of the movement between mod-
ules weighed for the probability
Í𝑚 𝑖 that the random walker switches modules on any
given step (𝑞 y ), and 𝑖=1 𝑝 𝐻 (P 𝑖 ) is the entropy of movements within modules
weighed for the fraction of within-module movements that occur in module 𝑖, plus
Í𝑚 𝑖
the probability of exiting module 𝑖 (𝑝 𝑖 ), such that 𝑖=1 𝑝 = 1 + 𝑞 y [9].
In our case, the Infomap algorithm is adopted to discover communities of students
characterized by similar mobility patterns. Indeed, to analyse mobility data, where
links represent patterns of student movement among territorial units and universities,
flow-based approaches are likely to identify the most important features. Finally, in
our student mobility network, to focus only on relevant student flows, a filtering
procedure is adopted by considering the Empirical Cumulative Density Function
(ECDF) of links’ weights distribution.
Students’ cohorts enrolled in Italian universities in four academic years (a.y.) 2008–
09, 2011–12, 2014–15, and 2017–18 are analysed. The number of nodes for the sets
V𝑃 (107 provinces), V𝑈 (79-80 universities), and V𝐸 (45 educational programmes),
and the number of students involved in the four cohorts are quite stable over time
(Table 1). Furthermore, the percentage of movers (i.e., students enrolled in a univer-
sity outside of their region of residence) increased, from 16.4% in the a.y. 2008–09
to 20.6% in the a.y. 2017–18, and it is higher for males than females.
Clustering in 3-way Networks 151
Table 1 Percentage of students according to their mobility status by cohort and gender.
Mover status
Cohort Gender Stayers% Movers%
Following the network simplification approach, the tripartite networks –one for
each cohort– are simplified into bipartite networks, and the four ECDFs of links’
weights are considered to filter relevant flows. The distributions suggest that more
than 50% of links between pairs of nodes have weights equal to 1 (i.e., flows of only
one student), and about 95% of flows are characterized by flows not greater than a
digit. Thus, networks holding links with a value greater or equal to 10 are further
analysed.
To reveal groups of universities and educational programmes attracting students,
the Infomap community detection algorithm is applied. Looking at Table 2, we
notice a reduction of the number of communities from the first to the last student
cohort, suggesting a sort of stabilization in the trajectories of movers towards brand
universities of the center-north with also an increase in the north-north mobility [20],
and a relevant dichotomy between scientific and humanistic educational programmes.
Network visualizations by groups (Figures 1 and 2) confirm that the more attractive
universities are located in the north of Italy, especially for educational programmes
in economics and engineering (the Bocconi University, the Polytechnic of Turin and
the Cattolica University).
Table 2 Number of communities, codelength, and relative saving codelength per cohort.
Relative saving
Cohort Communities Codelength codelength
2008–09 14 0.96 83%
2011–12 17 1.72 70%
2014–15 3 5.23 12%
2017–18 3 1.00 83%
152 V. G. Genova et al.
3 Concluding Remarks
The proposed simplification network strategy on tripartite graphs defined for student
mobility data provides interesting insights for the phenomenon under analysis. The
main attractive destinations still remain the northern universities for educational
programmes, such as engineering and business. Besides the well-known south-to-
north route, other interregional routes in the northern area appear. In addition, the
reduction in the number of communities suggests a sort of stabilization in terms of
mobility roots of movers towards brand universities, highlighting student university
destination choices close to the labor market demand.
Hyper-graphs and multipartite networks still remain very active areas for research
and challenging tasks for scholars interested in discovering the complexities underly-
ing these kinds of data. Specific tools for such complex network structures should be
designed combining network analysis and other statistical techniques. As future lines
of research, the comparison of community detection algorithms that better represent
the structural constraints of the phenomena under analysis and the assessment of
other backbone approaches to filter the significant links will be developed.
Acknowledgements The contribution has been supported from Italian Ministerial grant PRIN 2017
“From high school to job placement: micro-data life course analysis of university student mobility
and its impact on the Italian North-South divide", n. 2017 HBTK5P - CUP B78D19000180001.
References
1. Agresti, A.: Categorical Data Analysis (Vol. 482). John Wiley & Sons, New York (2003)
2. Barber, M. J.: Modularity and community detection in bipartite networks. Phys. Rev. E, 76,
066102 (2007)
3. Batagelj, V., Ferligoj, A., Doreian, P.: Indirect Blockmodeling of 3-Way Networks. In: Brito
P., Cucumel G., Bertrand P., de Carvalho F. (eds) Selected Contributions in Data Analysis
and Classification. Studies in Classification, Data Analysis, and Knowledge Organization, pp.
151–159. Springer, Berlin, Heidelberg (2007)
4. Blöcker, C., Rosvall, M.: Mapping flows on bipartite networks. Phys. Rev. E, 102, 052305
(2020)
5. Blondel, V. D., Guillaume, J. L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities
in large networks. J. Stat. Mech.-Theory E, 10, P10008 (2008)
6. Borgatti, S. P., Everett, M. G.: Regular blockmodels of multiway, multimode matrices. Soc.
Networks, 14, 91–120 (1992)
7. Columbu, S., Porcu, M., Primerano, I., Sulis, I., Vitale, M.P.: Geography of Italian student
mobility: A network analysis approach. Socio. Econ. Plan. Sci. 73, 100918 (2021)
8. Columbu, S., Porcu, M., Primerano, I., Sulis, I., Vitale, M. P.: Analysing the determinants of
Italian university student mobility pathways. Genus, 77, 34 (2021)
9. Edler, D., Bohlin, L., Rosvall, M.: Mapping higher-order network flows in memory and
multilayer networks with infomap. Algorithms, 10, 112 (2017)
10. Everett, M. G., Borgatti, S.: Partitioning multimode networks. In: Doreian, P., Batagelj, V.,
Ferligoj, A. (eds.) Advances in Network Clustering and Blockmodeling, pp. 251-265, John
Wiley & Sons, Hoboken, USA (2020)
154 V. G. Genova et al.
11. Fararo, T. J., Doreian, P.: Tripartite structural analysis: Generalizing the Breiger-Wilson for-
malism. Soc. Networks, 6, 141–175 (1984)
12. Genova, V. G., Tumminello, M., Aiello, F., Attanasio, M.: Student mobility in higher educa-
tion: Sicilian outflow network and chain migrations. Electronic Journal of Applied Statistical
Analysis, 12, 774–800 (2019)
13. Genova, V. G., Tumminello, M., Aiello, F., Attanasio, M.: A network analysis of student
mobility patterns from high school to master’s. Stat. Method. Appl., 30, 1445–1464 (2021)
14. Ikematsu, K., Murata, T.: A fast method for detecting communities from tripartite networks.
In: International Conference on Social Informatics, pp. 192-205. Springer, Cham (2013)
15. Melamed, D., Breiger, R. L., West, A. J.: Community structure in multi-mode networks:
Applying an eigenspectrum approach. Connections, 33, 18–23 (2013)
16. Murata, T.: Detecting communities from tripartite networks. In: Proceedings of the 19th
international conference on world wide web, pp. 1159-1160. (2010)
17. Neubauer, N., Obermayer, K.: Tripartite community structure in social bookmarking data.
New Rev. Hypermedia M., 17, 267-294 (2011)
18. Newman, M. E., Girvan, M.: Finding and evaluating community structure in networks. Phys.
Rev. E, 69, 026113 (2004)
19. Newman, M. E.: Modularity and community structure in networks. Proceedings of the National
Academy of Sciences, 103, 8577-8582 (2006)
20. Rizzi, L., Grassetti, L. Attanasio, M.: Moving from North to North: how are the students’
university flows? Genus 77, 1–22 (2021)
21. Santelli, F., Scolorato, C., Ragozini, G.: On the determinants of student mobility in an inter-
regional perspective: A focus on Campania region. Statistica Applicata - Italian Journal of
Applied Statistics, 31, 119–142 (2019)
22. Santelli, F., Ragozini, G., Vitale, M. P.: Assessing the effects of local contexts on the mobility
choices of university students in Campania region in Italy. Genus, 78, 5 (2022)
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Clustering Brain Connectomes Through a
Density-peak Approach
Riccardo Giubilei
1 Introduction
Clustering is the task of grouping elements from a set in such a way that elements
in the same group, also defined as cluster, are in some sense similar to each other,
and dissimilar to those from other groups. Mode-based clustering is a nonparametric
approach that works by first estimating the density, and then identifying in some
Riccardo Giubilei ( )
Luiss Guido Carli, Rome, Italy, e-mail: [email protected]
way its modes and the corresponding clusters. An effective method to find modes
and clusters is through the density-peak (DP) algorithm [12], which has drawn
considerable attention since its introduction in 2014. One of the striking advantages
of DP is that it does not require data to be embedded in vector spaces, implying that
it can be applied to arbitrary data types, provided that a proper distance is defined.
In this work, we focus on its application to clustering graph-structured data objects.
The expression graph clustering can refer either to within-graph clustering or
to between-graph clustering. In the first case, the elements to be grouped are the
vertices of a single graph; in the second, the objects are distinct graphs. Here, graph
clustering is intended as between-graph clustering. Between-graph clustering is an
emerging but increasingly important task due to the growing need of analyzing and
comparing multiple graphs [10, 4]. Potential applications include clustering: brain
networks of different people for ability assessment, disease prevention, or disease
evaluation; online social ego networks of different users to find people with similar
social structures; different snapshots of the same network evolving over time to
identify similar patterns, cycles, or abrupt changes.
Heretofore, the task of between-graph clustering has not been exhaustively in-
vestigated in the literature, implying a substantial lack of well-established methods.
The goal of this work is to improve and adapt the density-peak algorithm to define a
fairly general method for between-graph clustering. For validation and comparison
purposes, the resulting procedure and its main competitors are applied to grouping
brain connectomes of different people to distinguish between patients affected by
schizophrenia and healthy controls.
2 Related Work
Existing techniques for between-graph clustering can be divided into two main
categories: 1) transforming graph-structured data objects into Euclidean feature
vectors in order to apply standard clustering algorithms; 2) using the distances
between the original graphs in distance-based clustering methods.
The most common technique within the first category is the use of classical
clustering techniques on the vectorized adjacency matrices [10]. Nonetheless, more
advanced numerical summaries have been proposed to better capture the structural
properties of the graphs and to decrease feature dimensionality. Examples include:
shell distribution [1], traces of powers of the adjacency matrix [10], and graph
embeddings such as graph2vec [11]; see [4] for a longer list. Techniques from the
first category share an important drawback: the transformation into feature vectors
necessarily implies loss of information. Additionally, methods for extracting features
may be domain-specific.
The second category features Partitioning Around Medoids (PAM) [7], or k-
medoids, which finds representative observations by iteratively minimizing a cost
function based on the distances between data objects, and assigns other observations
to the closest medoid. PAM’s main limitations are that it requires the number of
Clustering Brain Connectomes Through a Density-peak Approach 157
3 Methods
In this section, we first describe the original DP approach; then, we introduce the
DP-KDE method, which is partly named after Kernel Density Estimation; finally,
we discuss how to employ it for graph clustering.
3.1 Original DP
The density-peak algorithm [12] is based on a simple idea: since cluster centers are
identified as the distribution’s modes, they must be 1) surrounded by neighbors with
lower density, and 2) at a relatively large distance from points with higher density.
Consequently, two quantities are computed for each observation 𝑥𝑖 : the local density
𝜌𝑖 , and the minimum distance 𝛿𝑖 from other data points with higher density. The
local density 𝜌𝑖 of 𝑥𝑖 is defined as:
Õ
𝜌𝑖 = 𝐼 (𝑑𝑖 𝑗 −𝑑𝑐 ) , (1)
𝑗
By convention, the point with highest density has 𝛿𝑖 = 𝑚𝑎𝑥 𝑗 (𝑑𝑖 𝑗 ). The interpretation
of 𝛿𝑖 reflects the algorithm’s core idea: data points that are not local or global maxima
have their 𝛿𝑖 constrained by other points within the same cluster, hence cluster centers
have large values of 𝛿𝑖 . However, this is not sufficient: they also need to have a large 𝜌𝑖
158 R. Giubilei
because otherwise the point could be merely distant from any other. After identifying
cluster centers, other observations are assigned to the same cluster as their nearest
neighbor of higher density.
The density-peak algorithm has many favorable properties: it manages to detect
nonspherical clusters, it does not require the number of clusters in advance or data to
be embedded in vector spaces, it is computationally fast because it does not maximize
explicitly each data point’s density field and it performs cluster assignment in a single
step, it estimates a clear population quantity, and it has only one tuning parameter
(the cutoff distance 𝑑 𝑐 ).
3.2 DP-KDE
The density-peak approach also has drawbacks. Over the last few years, many articles
have proposed improvements to overcome two main critical points: the unstable
density estimation and the absence of an automatic procedure for selecting cluster
centers. In this work, we explicitly tackle these two aspects.
The unstable density estimation induced by Equation (1) has been widely shown
[9, 16, 15]. Although many solutions have been proposed, we espouse the research
line suggesting the use of Kernel Density Estimation (KDE) to compute 𝜌𝑖 [9, 15]:
𝑛
1 Õ 𝑥𝑖 − 𝑥 𝑗
𝜌𝑖 = 𝐾 . (3)
𝑛ℎ 𝑗=1 ℎ
𝛿𝑖 and 𝜌𝑖 are not defined over the same scale, results could be misleading; second, it
implicitly assumes that 𝛿𝑖 and 𝜌𝑖 shall be given the same weight in the decision. We
overcome these two limitations by first normalizing both 𝛿𝑖 and 𝜌𝑖 between 0 and
1, and then giving them different weights that are based on their informativeness.
We measure the latter using the Gini coefficient of the two (normalized) quantities,
under the assumption that the least concentrated distribution between the two is the
most informative. Specifically, each observation is given a measure of importance
that is defined as:
𝐺 ( 𝛿01 ) 𝐺 (𝜌 )
𝛾𝑖𝐺 = 𝛿01,𝑖 𝜌01,𝑖 01 , (5)
where 𝛿01 and 𝜌01 are the normalized versions of 𝛿 and 𝜌 respectively, 𝛿01,𝑖 and 𝜌01,𝑖
are the corresponding 𝑖-th values, and 𝐺 (𝑥) denotes the Gini coefficient of 𝑥. Then,
the selected cluster centers are the top 𝑘 observations in terms of 𝛾𝑖𝐺 . Assigning
observations to the same cluster as their nearest neighbor of higher density is what
concludes the DP-KDE method.
where A𝑖 and A 𝑗 are the adjacency matrices of 𝑥 𝑖 and 𝑥 𝑗 respectively, and || · ||𝐹
denotes the Frobenius norm.
160 R. Giubilei
Consequently, the two fundamental quantities of the DP-KDE method are com-
puted in the following as:
𝑛
Õ 𝑑 𝐸 𝐷 (𝑥𝑖 , 𝑥 𝑗 )
𝜌𝑖 = 𝐾 , (7)
𝑗=1
ℎ
where 𝐾 (·) is the Epanechnikov kernel defined in Equation (4) and the normalizing
constant is omitted because we are simply interested in the ranking between the
densities, and:
Finally, cluster centers are selected as the observations with the largest values
of 𝛾𝑖𝐺 , as defined in Equation (5), and other observations are assigned to the same
cluster as their nearest neighbor in terms of 𝛿𝑖 .
4 Empirical Analysis
1 https://doi.org/10.5281/zenodo.3758534.
2 This exact figure is not included in the article, but the analysis is fully reproducible since the authors
made their source code available at https://github.com/leoguti85/BiomarkersSCHZ.
Clustering Brain Connectomes Through a Density-peak Approach 161
The approach we adopt in this work is rather different. First, graphs are analyzed
in their original form, without any simplification to numeric variables, resulting in
only one graph-structured variable. Observations are 54, each one representing the
functional connectome of a different individual. We tackle the problem with an un-
supervised classification approach seeking to cluster connectomes into two groups:
schizophrenic and healthy. To this end, we use the DP-KDE method for graph cluster-
ing described in Section 3.3. Starting from the 54 connectomes, each observation’s
local density 𝜌𝑖 and minimum distance 𝛿𝑖 are computed using Equations (7) and
(8), respectively. The centers of the two clusters are those whose 𝛾𝑖𝐺 is largest.
Then, other observations are assigned to the same cluster as their nearest neighbor
of higher density. Finally, the clustering performance is evaluated by comparing
the algorithm’s assignment to the ground truth. The DP-KDE method achieves an
accuracy of 70.37%, which is more than 2% higher than the one obtained in [5].
Table 1 includes the performance in terms of accuracy of both the DP-KDE
and the SVM-RFE methods, as well as that of other graph clustering competitors.
Specifically, we consider: the classical DP algorithm on the original data objects,
with the same cutoff distance as in DP-KDE and manually selected cluster centers;
k-means clustering on the 3403 numeric variables obtained from vectorizing the
adjacency matrices; DBSCAN on the original data objects, with parameters 𝜀 = 20.2
and 15 as the minimum number of points required to form a dense region; PAM and
𝑘-groups on the original data objects. In all these cases, the number of clusters has
been kept fixed to 𝑘 = 2. The method that yields the best accuracy in the specific
problem is the DP-KDE.
5 Concluding Remarks
After explaining the importance of graph clustering and briefly reviewing some
existing methods to perform this task, we have considered the possibility of adopting
a density-peak approach. We have improved the original DP algorithm by using
a more robust definition of the density 𝜌𝑖 , and by automatically selecting cluster
centers based on the quantity 𝛾𝑖𝐺 we have introduced. We have also selected a proper
distance between graphs, namely, the Edge Difference Distance. Finally, we have
used the resulting method in an empirical analysis with the goal of clustering brain
connectomes to distinguish between schizophrenic patients and healthy controls.
Our method outperforms another one treating the specific task as supervised, and it
is by far the best one with respect to many graph clustering competitors.
162 R. Giubilei
An initial idea for future work is the search for the optimal number of clusters.
This may be achieved either by fixing a threshold for 𝛾𝑖𝐺 or by selecting all the data
points after the largest increase in terms of 𝛾𝑖𝐺 . Also the cutoff distance could be
tuned, possibly maximizing in some way the dispersion of points in the bivariate
distribution of 𝜌 and 𝛿. Then, the DP-KDE method needs to be extended beyond the
univariate case. Finally, other distances between graphs could be considered to better
reflect alternative application-specific needs, e.g., when graphs are not defined over
the same set of nodes.
Acknowledgements The author would like to thank Pierfancesco Alaimo Di Loro, Federico Car-
lini, Marco Perone Pacifico, and Marco Scarsini for several engaging and stimulating discussions.
References
1. Carmi, S., Havlin, S., Kirkpatrick, S., Shavitt, Y., Shir, E.: A model of Internet topology using
k-shell decomposition. Proc. Natl. Acad. Sci. 104, 11150–11154 (2007)
2. Epanechnikov, V.: Non-parametric estimation of a multivariate probability density. Theory
Probab. Its Appl. 14, 153–158 (1969)
3. Ester, M., Kriegel, H., Sander, J., Xu, X., et al.: A density-based algorithm for discovering
clusters in large spatial databases with noise. KDD-96 34 226–231 (1996)
4. Gutiérrez-Gómez, L., Delvenne, J.: Unsupervised network embeddings with node identity
awareness. Appl. Netw. Sci. 4, 1–21 (2019)
5. Gutiérrez-Gómez, L., Vohryzek, J., Chiêm, B., Baumann, P., Conus, P., Do Cuenod, K.,
Hagmann, P., Delvenne, J.: Stable biomarker identification for predicting schizophrenia in the
human connectome. NeuroImage Clin. 27 102316 (2020)
6. Hammond, D., Gur, Y., Johnson, C.: Graph diffusion distance: A difference measure for
weighted graphs based on the graph Laplacian exponential kernel. IEEE GlobalSIP 2013, pp.
419–422 (2013)
7. Kaufmann, L., Rousseeuw, P.: Clustering by means of medoids. Proc. of the Statistical Data
Analysis based on the L1 Norm Conference, Neuchatel, Switzerland, pp. 405–416 (1987)
8. Li, S., Rizzo, M.: K-groups: A generalization of k-means clustering. ArXiv Preprint
ArXiv:1711.04359 (2017)
9. Mehmood, R., Zhang, G., Bie, R., Dawood, H., Ahmad, H.: Clustering by fast search and find
of density peaks via heat diffusion. Neurocomputing. 208, 210–217 (2016)
10. Mukherjee, S., Sarkar, P., Lin, L.: On clustering network-valued data. NIPS2017, pp. 7074–
7084 (2017)
11. Narayanan, A., Chandramohan, M., Venkatesan, R., Chen, L., Liu, Y., Jaiswal, S.: graph2vec:
Learning distributed representations of graphs. ArXiv Preprint ArXiv:1707.05005 (2017)
12. Rodriguez, A., Laio, A.: Clustering by fast search and find of density peaks. Science 344,
1492–1496 (2014)
13. Shimada, Y., Hirata, Y., Ikeguchi, T., Aihara, K.: Graph distance for complex networks. Sci.
Rep. 6, 1–6 (2016)
14. Székely, G., Rizzo, M.: The energy of data. Annu. Rev. Stat. Appl. 4, 447–479 (2017)
15. Wang, X., Xu, Y.: Fast clustering using adaptive density peak detection. Stat. Methods Med.
Res. 26, 2800–2811 (2017)
16. Xie, J., Gao, H., Xie, W., Liu, X., Grant, P.: Robust clustering by detecting density peaks and
assigning points based on fuzzy weighted K-nearest neighbors. Inf. Sci. 354, 19–40 (2016)
Clustering Brain Connectomes Through a Density-peak Approach 163
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Similarity Forest for Time Series Classification
Abstract The idea of similarity forest comes from Sathe and Aggarwal [19] and is
derived from random forest. Random forests, during already 20 years of existence,
proved to be one of the most excellent methods, showing top performance across a
vast array of domains, preserving simplicity, time efficiency, still being interpretable
at the same time. However, its usage is limited to multidimensional data. Similarity
forest does not require such representation – it is only needed to compute similarities
between observations. Thus, it may be applied to data, for which multidimensional
representation is not available. In this paper, we propose the implementation of
similarity forest for time series classification. We investigate 2 distance measures:
Euclidean and dynamic time warping (DTW) as the underlying measure for the
algorithm. We compare the performance of similarity forest with 1-nearest neighbor
and random forest on the UCR (University of California, Riverside) benchmark
database. We show that similarity forest with DTW, taking into account mean ranks,
outperforms other classifiers. The comparison is enriched with statistical analysis.
Keywords: time series, time series classification, random forest, similarity forest
Tomasz Górecki ( )
Faculty of Mathematics and Computer Science, Adam Mickiewicz University, Uniwersytetu Poz-
nańskiego 4, Poznań, Poland, e-mail: [email protected]
Maciej Łuczak
Faculty of Mathematics and Computer Science, Adam Mickiewicz University, Uniwersytetu Poz-
nańskiego 4, Poznań, Poland, e-mail: [email protected]
Paweł Piasecki
Faculty of Mathematics and Computer Science, Adam Mickiewicz University, Uniwersytetu Poz-
nańskiego 4, Poznań, Poland, e-mail: [email protected]
1 Introduction
between observations. We would like to implement and tune the method to time series
data. We investigate the performance of the model using two distance measures (the
algorithm’s hyper-parameter): Euclidean and DTW. Also, a comparison with other
selected time series classifiers is provided. We compare its performance against
1NN-ED, 1NN-DTW and RF.
The rest of the paper is structured as follows. In Section 2, we provide details
of similarity forest and we give more details about random forests. Additionally, we
discuss how similarity forest is related to random forest. Section 3 describes data
sets that we used and the comparison methodology. The corresponding results are
presented in Section 4. Finally, in Section 5 we give a brief summary of our research.
In the paper, we compare the standard random forest and the similarity forest with
the distance measure: ED (Euclidian distance) and DTW (dynamic time warping
distance). As benchmark methods, we also use the nearest neighbor method (1NN)
with distance measure ED and DTW. 1NN-ED and 1NN-DTW are very common
classification methods for time series classification [2]. For a review of these methods
refer to [14].
Random forest consists of random decision trees. For the construction of a random
forest we usually take decision trees as simple as possible — without special criteria
for stopping, pruning, etc.
When building a decision tree, we start at a node 𝑁, which contains the entire
data set (bootstrap sample). Then, according to an established criterion, we split the
node 𝑁 into two subnodes 𝑁1 and 𝑁2 . In each subnode there are data subsets of
the data set from node 𝑁. We make this split in a way that is optimal for a given
split method. In each node, we write down how the split occurred. Then, proceeding
recursively, we split next nodes into subnodes until the stop criterion occurs. In our
case we take the simplest such criterion, namely we stop the split of a given node
when only elements of the same class are included in a node. We call such a node a
leaf and assign it a label which elements of the node (leaf) have.
Having built a tree, we can now use it (in the testing phase) to classify a new
observation. We pass this observation through the trained tree — starting from the
node 𝑁 selecting each time one of the subnodes, according to the condition stored
in the node. We do this until we reach one of the leaves, and then we assign the test
observation to the class of the leaf.
Now, constructing the random forest, we collect a certain number of decision
trees, train them independently according to the above method and, in the test phase,
168 T. Górecki et al.
use each of the trees to test new observation. Thus, each tree assigns a label to the
test observation. The final label (for the entire forest) we construct by voting, we
choose the most frequently appearing label among the decision trees.
To create a (classical) random tree and a random forest [4], we proceed as described
above using the following node split method:
To obtain √split conditions for a single tree, we select randomly a certain number
of features ( 𝑘 for classification, 𝑘 — number of features), and for each feature
we create a feature vector (column, variable) made of all elements of the data set
(bootstrap sample). For a given feature vector (variable), we determine a threshold
vector. First, we sort values of the feature vector (uniquely — without repeating
values). Let us name this sorted feature vector as 𝑉 = (𝑉1 , 𝑉2 , . . . ). Then we take the
values of the split as means of successive values of the vector 𝑉 :
𝑉𝑖 + 𝑉𝑖+1
𝑣𝑖 = 𝑖 = 1, 2, . . . . (1)
2
Each splitting value divides the data set in node 𝑁 into two subsets — the one (left)
in which we have elements with feature values smaller than 𝑣𝑖 and the second (right)
with other elements. Then we check the quality of such a split.
The splitting point is chosen such that it minimizes the Gini index of the children
nodes. If 𝑝 1 , 𝑝 2 . . . 𝑝 𝑐 are the fractions of data points belonging to the 𝑐 different
Í𝑐
classes in node 𝑁, then the Gini index of that node is given by: 𝐺 (𝑁) = 1 − 𝑖=1 𝑝 2𝑖 .
Then, if the node 𝑁 is split into two children nodes 𝑁1 and 𝑁2 , with 𝑛1 and 𝑛2
points, respectively, the Gini quality of the children nodes is given by:
𝑛1 𝐺 (𝑁1 ) + 𝑛2 𝐺 (𝑁2 )
𝐺𝑄(𝑁1 , 𝑁2 ) = .
𝑛1 + 𝑛2
Quality of the split is given by: 𝐺𝑄(𝑁) = 𝐺 (𝑁) − 𝐺𝑄(𝑁1 , 𝑁2 ).
The similarity forest [19] differs from the ordinary (classical) random forest only in
the way we split nodes of trees. Instead of selecting a certain number of features,
we select randomly a pair of elements 𝑒 1 , 𝑒 2 with different classes. Then, for each
element 𝑒 of the subset of elements in a given node, we calculate the difference of
the squared distances to the elements 𝑒 1 and 𝑒 2 :
where 𝑑 is any fixed distance measure of the elements of the data set. We sort the
vector 𝑤 uniquely (without duplicates) creating the vector 𝑉 and continue as for the
classical decision tree. We calculate values of the split 𝑣𝑖 (1), calculate the quality
of the node split using the Gini index (2.2) and choose the best split. In the learning
phase, we remember in each node how the optimal split occurred (elements 𝑒 1 ,
𝑒 2 , 𝑤(𝑒)). In the learning phase, in each node we write down the optimal split —
elements 𝑒 1 , 𝑒 2 , and value 𝑤(𝑒)).
The difference
√ between a classical random tree and a similarity tree is that instead of
selecting 𝑘 of the features, we select only one pair of elements 𝑒 1 , 𝑒 2 . Generally,
we have much fewer possible node splits, which has a very good effect on the
computation time.
The second important difference is that in the similarity tree we use any distance
measure between elements of the data set. Therefore, we can use distance measures
specific to a data set. For example, for time series we can use the DTW distance,
much better suited for calculating the distance between time series, instead of the
Euclidean distance.
3 Experimental Setup
4 Results
The error rates for each classifier can be found on the accompanying website1. In
the Table 1 we show a short summary of results, including a number of wins (draw
is not counted as a win) and mean ranks. Taking into account mean ranks, SF-DTW
is the best classifier, sightly ahead of RF (mean ranks correspondingly equal 2.64
1 https://github.com/ppias/similarity_forest_for_tsc
170 T. Górecki et al.
Table 1 Number of wins (clearly wins) and mean ranks for examined methods.
Method 1NN-ED 1NN-DTW RF SF-ED SF-DTW
Wins 12 28 38 10 31
Mean rank 3.59 2.89 2.69 3.19 2.64
and 2.89). Figure 1 demonstrates comparison of error rates and ranks for classifiers.
These results lead to a conclusion that even though there is no clear winner, the top
efficient distances are dominated by RF and SF-based classifiers. Figure 2 shows
scatter plots of errors for pairs of classifiers.
SF-ED SF-ED
SF-DTW SF-DTW
RF RF
1NN-ED 1NN-ED
1NN-DTW 1NN-DTW
SF-DTW
SF-ED
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
RF RF SF-ED
with the post-hoc tests in order to detect significant pairwise differences among all
of the classifiers.
Demšar [8] proposes the use of the Nemenyi’s test [16] that compares all the
algorithms pair-wise. For a significance level 𝛼 the test determines the critical
difference (CD). If the difference between the average ranking of two algorithms is
greater than CD the null hypothesis that the algorithms have the same performance
is rejected. Additionally, Demšar [8] creates a plot to visually check the differences,
the CD plot. In the plot, those algorithms that are not joined by a line can be regarded
as different.
In our case, with a significance of 𝛼 = 0.05 any two algorithms with a difference
in the mean rank above 0.54 will be regarded as non equal (Figure 3). We can see
that we have three groups of methods. In the first group we have SF-DTW, RF and
1NN-DTW, in the second we have RF, 1NN-DTW and SF-ED and in the last group
we have SF-ED and 1NN-ED. Unfortunately, groups are not disjoint. The first group
is the group with the highest accuracy of classification. Hence, SF-DTW does not
statistically outperform RF. However, we can recommend it over RF because of
statistically the same quality and much better computational properties.
CD
2 3 4
SF-DTW 1NN-DTW
RF SF-ED
1NN-ED
5 Conclusions
Our contribution is to implement similarity forest for time series classification using
two distance measures: Euclidean and DTW. Comparison based on the recently
updated UCR data repository (128 data sets) was presented. We showed that SF-
DTW outperforms other classifiers, including 1NN-DTW which has been considered
as a strong baseline hard to beat for years. The statistical comparison showed, that RF
and SF-DTW are statistically insignificantly different, however taking into account
mean ranks the latter one is the best one.
There are many improvements that could be applied to the implementation that
we propose. For example, we could test other distance measures such as LCSS [21]
or ERP [5] that were successfully used in time series tasks. Another idea could be
to investigate the usage of boosting algorithm.
Acknowledgements The research work was supported by grant No. 2018/31/N/ST6/01209 of the
National Science Centre.
172 T. Górecki et al.
References
1. Bagnall, A., Lines, J., Hills, J., Bostrom A.: Time-series classification with COTE: The
collective of transformation-based ensembles. IEEE Trans. on Knowl. and Data Eng. 27,
2522–2535 (2015)
2. Bagnall, A., Lines, J., Bostrom, A., Large J., Keogh, E.: The great time series classification
bake off: a review and experimental evaluation of recent algorithmic advances. Data Min. and
Knowl. Discov. 31, 606–660 (2017)
3. Berndt, D. J., Clifford, J.: Using dynamic time warping to find patterns in time series. Proc.
of the 3rd Int. Conf. on Knowl. Discov. and Data Min., pp. 359–370 (1994)
4. Brieman, L.: Random forests. J. Mach. Learn. Arch. 45, 5–32 (2001)
5. Chen, L., Ng, R.: On the marriage of 𝐿 𝑝 -norms and edit distance. Proc. of the 30th Int. Conf.
on Very Large Data Bases 30, pp. 792–803 (2004)
6. Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. on Inf. Theor. 13,
21–27 (1967)
7. Dau, H.A., Keogh, E., Kamgar, K., Yeh, Chin-Chia M., Zhu, Y.,Gharghabi, S., Ratanama-
hatana, C.A., Yanping, C., Hu, B., Begum, N., Bagnall, A., Mueen, A., Batista, G., Hexagon-
ML: The UCR time series classification archive (2019) https://www.cs.ucr.edu/\str
ing~eamonn/time\_series\_data\_2018
8. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. of Mach. Learn.
Res. 7, 1–30 (2006).
9. Du,a D., Graff, C.: UCI Machine Learning Repository.
http://archive.ics.uci.edu/ml
10. Fernandez-Delgado, M., Cernadas, E., Barro, S., Amorim, D.: Do we need hundreds of
classifiers to solve real world classification problems?. J. of Mach. Learn. Res. 15, 3133–3181
(2014)
11. Fix, E, Hodges, J. L.: Discriminatory analysis: nonparametric discrimination, consistency
properties. Techn. Rep. 4, (1951)
12. García, S., Fernández, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple
comparisons in the design of experiments in computational intelligence and data mining:
Experimental Analysis of Power. Inf. Sci. 180, 2044–2064 (2010)
13. Lines, J., Taylor S., Bagnall, A.: HIVE-COTE: The hierarchical vote collective of transfor-
mation based ensembles for time series classification. IEEE Int. Conf. on Data Min., pp.
1041–1046 (2016)
14. Maharaj, E. A., D’Urso, P., Caiado, J.: Time Series Clustering and Classification. Chapman
and Hall/CRC. (2019)
15. Middlehurst, M., Large, J., Flynn, M., Lines, J., Bostrom, A., Bagnall, A.: HIVE-COTE 2.0:
a new meta ensemble for time series classification. (2021)
https://arxiv.org/abs/2104.07551
16. Nemenyi, P.:Distribution-free multiple comparisons. PhD thesis at Princeton University (1963)
17. Pavlyshenko, B. M.: Machine-learning models for sales time series forecasting. Data 4, 15
(2019)
18. Rastogi, V., Srivastava, S., Mishra, M., Thukral, R.: Predictive maintenance for SME in
industry 4.0. 2020 Glob. Smart Ind. Conf., pp. 382–390 (2020)
19. Sathe, S., Aggarwal, C. C.: Similarity forests. Proc. of the 23rd ACM SIGKDD, pp. 395–403
(2017)
20. Tang, J., Chen, X.: Stock market prediction based on historic prices and news titles. Proc. of
the 2018 Int. Conf. on Mach. Learn. Techn., pp. 29–34 (2018)
21. Vlachos, M., Kollios, G., Gunopulos, D.: Discovering similar multidimensional trajectories.
Proc. 18th Int. Conf. on Data Eng., pp. 673–684 (2002)
22. Wuest, T., Irgens, C., Thoben, K. D.: An approach to quality monitoring in manufacturing
using supervised machine learning on product state data. J. of Int. Man. 25, 1167–1180 (2014)
Similarity Forest for Time Series Classification 173
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Detection of the Biliary Atresia Using Deep
Convolutional Neural Networks Based on
Statistical Learning Weights via Optimal
Similarity and Resampling Methods
Abstract Recently, artificial intelligence methods have been applied in several fields,
and their usefulness is attracting attention. These methods are techniques that corre-
spond to models using batch and online processes. Because of advances in compu-
tational power, as represented by parallel computing, online techniques with several
tuning parameters are widely accepted and demonstrate good results. Neural net-
works are representative online models for prediction and discrimination. Many
online methods require large training data to attain sufficient convergence. Thus,
online models may not converge effectively for low and noisy training datasets. For
such cases, to realize effective learning convergence in online models, we introduce
statistical insights into an existing method to set the initial weights of deep convo-
lutional neural networks. Using an optimal similarity and resampling method, we
proposed an initial weight configuration approach for neural networks. For a practice
example, identification of biliary atresia (a rare disease), we verified the usefulness
Kuniyoshi Hayashi ( )
Graduate School of Public Health, St. Luke’s International University, 3-6 Tsukiji, Chuo-ku, Tokyo,
Japan, 104-0045, e-mail: [email protected]
Eri Hoshino · Kotomi Sakai
Research Organization of Science and Technology, Ritsumeikan University, 90-94 Chudoji Awat-
acho, Shimogyo Ward, Kyoto, Japan, 600-8815,
e-mail: [email protected];[email protected]
Mitsuyoshi Suzuki
Department of Pediatrics, Juntendo University Graduate School of Medicine, 2-1-1 Hongo, Bunkyo-
ku, Tokyo, Japan, 113-8421, e-mail: [email protected]
Erika Nakanishi
Department of Palliative Nursing, Health Sciences, Tohoku University Graduate School of
Medicine, 2-1 Seiryo-machi, Aoba-ku, Sendai, Japan, 980-8575,
e-mail: [email protected]
Masayuki Obatake
Department of Pediatric Surgery, Kochi Medical School, 185-1 Kohasu, Oko-cho, Nankoku-shi,
Kochi, Japan, 783-8505, e-mail: [email protected]
of the proposed method by comparing existing methods that also set initial weights
of neural networks.
1 Introduction
The core technique in deep learning corresponds to neural networks, including the
convolutional process. Since 2012, deep learning architectures have been frequently
used for image classification [1, 2]. More so, deep convolution neural networks
(DCNN) are representative nonlinear classification methods for pattern recognition.
The DCNN technique is used as a powerful framework for the entirety of image
processing [3]. The clinical medicine field presents many opportunities to perform
diagnoses using imaging data from patients. Therefore, DCNN techniques are ap-
plied to enhance diagnostic quality, e.g., applying a DCNN to a chest X-ray dataset
to classify pneumonia [2] and detecting breast cancer [4]. However, DCNN architec-
tures involve many parameters to be learned using training data. Therefore, effective
and efficient model development must realize effective learning convergence for
such parameters. Notably, it is important to set the initial parameter values to achieve
better learning convergence. Furthermore, several methods have been proposed to
set initial parameter values in the artificial intelligence (AI) field [5, 6]. However,
there are no clear guidelines for determining which existing methods should be used
in different situations. Thus, we propose an efficient initial weight approach using
existing methods from the viewpoints of optimal similarity and resampling methods.
Using a real-world clinical biliary atresia (BA) dataset, we evaluate the performance
of the proposed method compared with existing DCNNs. Additionally, we show the
usefulness of the proposed method in terms of learning convergence and prediction
accuracy.
2 Background
BA is a rare disease that occurs in children and is fatal unless treated early. Previous
studies have investigated models to identify BA by applying neural networks to pa-
tient data [7] and using an ensemble deep learning model to detect BA [8]. However,
these models were essentially for use in medical institutions, e.g., hospitals. Gener-
ally, certain stool colors in infants and children are highly correlated with BA [9]. In
Japan, the maternal and child health handbook includes a stool color card so parents
can compare their child’s stool color to the information on the card. Such fecal color
cards are widely used to detect BA because of their easy accessibility outside the
clinical environments. However, this stool color card screening approach for BA is
Detection of BA using DCNN Based on Statistical Learning Weights 177
subjective; thus, accurate and objective diagnoses are not always possible. Previ-
ously, we developed a mobile application to classify BA and non-BA stools using
baby stool images captured using an iPhone [10]. Here, a batch type classification
method was used, i.e., the subspace method, originating from the pattern recognition
field. Since BA is a rare disease, the number of events in the case group is generally
less. Thus, when we set the explanatory variables of the target observation as the pixel
values of a target image, the number of explanatory variables is much higher than the
number of observations, especially the disease group. With the subspace method, we
can efficiently discriminate such high-dimensional small-sample data. For example,
our previous study using the subspace method to classify BA and non-BA stools
showed that BA could be discriminated with reasonable accuracy by applying the
proposed method to image pixel data of the stool image data captured by a mobile
phone [10]. This application was an automated version of the stool color card from
the maternal and child health handbook. Unlike previous studies by [7, 8], the appli-
cation is widely available outside hospital environments. As described previously,
DCNNs are useful for image classification, including the automatic classification of
stool images for early BA detection.
3 Proposed Method
Dimension reduction and discrimination processing can be realized using the sub-
space method and DCNN techniques. In DCNN, layers based on padding, convo-
lution, and pooling correspond to the dimension reduction functions, and the affine
layer performs the discrimination. The primary motivation of this study is to propose
a method that properly sets the initial weights of the parameters in a DCNN using
statistical approaches. Our secondary motivation is to apply the proposed method to
real-world, high-dimensional, and small-sample clinical data.
For image discrimination in pattern recognition and machine learning fields, the pixel
values of the image data are set as the explanatory variables for the target outcome.
Here, the data to be classified correspond to a high-dimensional observation. To
improve efficiency and demonstrate the feasibility of discriminant processing, the
dimensionality must be reduced to a manageable size before classification. The most
representative dimensionality reduction method is convolution in pattern recognition
and machine learning, which involves padding, convolution, and pooling operations.
After converting the input image to a pixel data matrix, the pixel data matrix is
surrounded with a numeric value of 0. Using a convolution filter, we reconstruct the
pixel data matrix while considering pixel adjacency information. Generally, the size
and convolution filter type are parameters that need optimization to realize sufficient
178 K. Hayashi et al.
We denote the input pattern matrices comprising numerical pixel values in hue (H),
saturation (S), and value (V) as X 𝐻 (∈ R 𝑝×𝑞 ), X𝑆 (∈ R 𝑝×𝑞 ), and X𝑉 (∈ R 𝑝×𝑞 ),
respectively. First, we performed padding for the input pattern matrices in H, S, and
V, respectively, and then, performed a convolution in each signal pattern matrix using
a convolution filter. Next, we then applied max pooling to each pattern matrix after
convolution. Here, we denote the pattern matrices after the padding, convolution,
𝐻 0 0 𝑆 0 0 𝑉 0 0
and max pooling as X̃ (∈ R 𝑝 ×𝑞 ), X̃ (∈ R 𝑝 ×𝑞 ), and X̃ (∈ R 𝑝 ×𝑞 ), respectively,
where 𝑝 0 and 𝑞 0 are less than 𝑝 and 𝑞. Therefore, we combine the component values of
each pattern matrix after padding, convolution, and max pooling into a single pattern
matrix by simply adding them together. The combined pattern matrix after applying
0 0
the feature selection layer is expressed as X̃(∈ R 𝑝 ×𝑞 ). Next, we applied convolution
and max pooling to the combined pattern matrix 𝑘 times. Additionally, the input
vector after performing the convolution and max pooling 𝑘 times is denoted by
x(∈ Rℓ×1 ), and the output of the DCNN and the label vectors are denoted y(∈ R1×1 )
and t(∈ R1×1 ), respectively. In this study, we evaluated the difference between y
and t according to the mean square error function, i.e., 𝐿(y, t) = ℓ1 k t − y k22 .
Here, we consider a simple neural network with three layers. Concretely, between
the first and second layers, we perform a linear transformation using W1 (∈ R2×ℓ )
and b1 (∈ R2×1 ). Then, a linear transformation is performed using W2 (∈ R1×2 ) and
b2 (∈ R1×1 ) between the second and third layers. Next, we defined 𝑓1 (x) and 𝑓2 (x)
as W1 x + b1 and W2 𝑓1 (x) + b2 , respectively. Note that we assume 𝜂2 is a nonlinear
transformation between the second and third layers, and we calculated the output
y as 𝜂2 ( 𝑓2 ◦ 𝑓1 (x)). Generally, y is calculated as a continuous value. For example,
with classification and regression tree methods, we can determine the optimal cutoff
point of y𝑠 from a prediction perspective.
𝜕y
sigmoid function. Then, 𝜕u 2
is calculated as 𝜂2 (u2 )(1 − 𝜂2 (u2 )). Therefore, we
𝜕𝐿 2
obtain 𝜕W𝑇 = − ℓ (t−y)𝜂2 (u2 )(1−𝜂2 (u2 ))u1 . With the learning coefficient of 𝛾2 , we
2
update W𝑇2 to W𝑇2 − 𝛾2 𝜕W
𝜕𝐿
𝑇 . Then, when performing the partial derivative of 𝐿 (y, t)
2
𝜕𝐿 𝜕𝐿 𝜕y 𝜕u2 𝜕u1 𝜕𝐿 2
with respect to W1 , we can obtain 𝜕W1 = 𝜕y 𝜕u2 𝜕u1 𝜕W1 where 𝜕y = − ℓ (t − y),
𝜕y 𝜕u2 𝑇 𝜕u1
𝜕u2 = 𝜂2 (u2 )(1 − 𝜂2 (u2 )), 𝑇
𝜕u1 = W2 , and 𝜕W1 = 2x . Thus, we then obtain
𝜕𝐿
𝜕W1 = − ℓ4 (t − y)𝜂2 (u2 )(1 − 𝜂2 (u2 ))W𝑇2 x𝑇 . With the learning coefficient of 𝛾1 , we
𝜕𝐿
update W1 to W1 − 𝛾1 𝜕W 1
.
where ℓ00 takes values 1 to ℓ in Equation (1). Similarly, we calculated the autocor-
relation matrix with the observations belonging to 𝑆1 . Then, with eigenvalues (𝜆ˆ 𝑠1 )
and eigenvectors (û𝑠1 ) for the autocorrelation matrix, we calculate the following
projection matrix:
ℓ10
Õ
ˆ
𝑃1 := û𝑠1 û𝑇𝑠1 , (2)
𝑠1 =1
ℓ10
where takes values 1 to ℓ in Equation (2). Here, if the value of x𝑇 ( 𝑃ˆ1 − 𝑃ˆ0 )x > 0,
we classify x into 𝑆1 ; otherwise, we classify x into 𝑆0 .
From a prediction perspective, using the leave-one-out cross-validation [11],
we determined the optimal ℓˆ00 and ℓˆ10 , which are minimum values satisfying 𝜏 <
Íℓ 0 Í Íℓ 0 Í
( 𝑠00 =1 𝜆ˆ 𝑠0 )/( ℓ𝑠0 =1 𝜆ˆ 𝑠0 ) and 𝜏 < ( 𝑠11 =1 𝜆ˆ 𝑠1 )/( ℓ𝑠1 =1 𝜆ˆ 𝑠1 ), respectively. Here, 𝜏 is
a tuning parameter to be optimized using the leave-one-out cross-validation. In the
second step, based on 𝑃ˆ1 , we estimated ŷ 𝑗 as x𝑇𝑗 𝑃ˆ1 x 𝑗 . In the third step, using existing
approaches [5, 6], we generated , we generated normal random numbers and set an
initial matrix, vector, and scalar as Ŵ2 , b̂1 , and b̂2 , respectively. Next, we extracted
180 K. Hayashi et al.
𝑚 observations randomly using the bootstrap method [12]. Using Ŵ2 , b̂1 , b̂2 , and a
bootstrap sample of size 𝑚, we estimated W2 W1 as follows:
𝑚
1 Õ −1
Ŵ2 Ŵ1 = (𝜂 ( ŷ ) − ( Ŵ2 b̂1 + b̂2 ))x𝑇𝑖 (x𝑖 x𝑇𝑖 ) −1 , (3)
𝑚 𝑖=1 2 𝑖
where we estimate the inverse of x𝑖 x𝑇𝑖 in Equation (3) using the naive approach from
the diagonal elements in x𝑖 x𝑇𝑖 . Additionally, using the generalized inverse approach,
we obtained Ŵ1 in the basis of Ŵ2 and Ŵ2 Ŵ1 . Finally, b̂1 , b̂2 , Ŵ1 , and Ŵ2 were
used as initial vectors and matrices to update the parameters of the convolutional
neural network.
In this paper, all analyses were performed using R version 4.1.2 (R Foundation for
Statistical Computing). We applied the proposed method to a real BA dataset. Here,
stool image data with objects, such as diapers partially photographed on the image
were used. In this numeric experiment, we randomly divided 35 data into 15 training
and 20 test data, respectively. Next, we compared the proposed and existing methods
relative to the learning convergence and prediction accuracy on the training and test
data, respectively. Here, we set the values of the learning coefficients 𝛾1 and 𝛾2 to
0.1, respectively. Also, we prepared a single feature selection layer and performed the
convolution and max pooling process seven times. Each time an initial value was set
randomly, learning was performed 1000 times using the 15 training data, and it was
judged that learning converged when the value obtained by dividing the sum of the
absolute values of the difference between ŷ 𝑗 and t 𝑗 by 1000 became less than 0.01.
We repeated to randomly divide 35 data into 15 training and 20 test data five times.
As a result, we created five datasets. For each dataset, the sensitivity, specificity, and
AUC values of the training and test data were calculated using the parameters (b̂1 , b̂2 ,
Ŵ1 , and Ŵ2 ) at the time the learning first converged in the existing and our proposed
methods. Figure 1 shows the average of the five absolute values of the difference
between the correct label and the predicted value at each step when learning was
first converged for each method. We can observe that the error decreased steadily as
the proposed method progressed compared to the existing methods. When the model
was constructed using the weights at the learning convergence point and applied to
15 training data every time, the average values of sensitivity and specificity were
100.0%, and that of the AUC value was 1.000 for all methods. However, a difference
was observed among the compared methods on the test data. For the method by [5],
the average values of sensitivity, specificity, and AUC in the test data were 83.3%,
42.5%, and 0.629, respectively. Also, for that of [6], the average values of sensitivity,
specificity, and AUC in the test data were 85.0%, 40.0%, and 0.625, respectively.
With the proposed method, the average values of sensitivity, specificity, and AUC
obtained on the test data were 85.0%, 67.5%, and 0.763, respectively.
Detection of BA using DCNN Based on Statistical Learning Weights 181
Acknowledgements We thank Shinsuke Ito, Takashi Taguchi, Dr. Yusuke Yamane, Ms. Saeko
Hishinuma, and Dr. Saeko Hirai for their advice. In addition, we acknowledge the biliary atresia
patients’ community (BA no kodomowo mamorukai) for their generous support of this project. This
work was supported by the Mitsubishi Foundation.
182 K. Hayashi et al.
References
1. Rawat, W., Wang, Z.: Deep convolutional neural networks for image classification: a compre-
hensive review. Neural Comput. 29(9), 2352–2449 (2017)
2. Yadav, S. S., Jadhav, S.M.: Deep convolutional neural network based medical image classifi-
cation for disease diagnosis. J. Big Data (2019) doi: 10.1186/s40537-019-0276-2
3. Huang, J., Xu, Z.: Cell detection with deep learning accelerated by sparse kernel. In: Lu, L.
et al. (eds.) Advances in Computer Vision and Pattern Recognition, pp. 137-157. Springer,
Switzerland (2017)
4. Abdelhafiz, D., Yang, C., Ammar, R., Nabavi, S.: Deep convolutional neural networks
for mammography: advances, challenges and applications. BMC Bioinform. (2019) doi:
10.1186/s12859-019-2823-4
5. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural
networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence
and Statistics, pp. 249-256. (2010)
6. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level
performance on ImageNet classification. In: Proceedings of the IEEE International Conference
on Computer Vision (ICCV), pp. 1026-1034. (2015)
7. Liu, J., Dai, S., Chen, G., Sun, S., Jiang, J., Zheng, S., Zheng, Y., Dong, R.: Diagnostic value
and effectiveness of an artificial neural network in biliary atresia. Front. Pediatr. (2020) doi:
10.3389/fped.2020.00409
8. Zhou, W., Yang, Y., Yu, C., Liu, J., Duan, X., Weng, Z., Chen, D., Liang, Q., Fang, Q., Zhou,
J., Ju, H., Luo, Z., Guo, W., Ma, X., Xie, X., Wang, R., Zhou, L.: Ensembled deep learning
model outperforms human experts in diagnosing biliary atresia from sonographic gallbladder
images. Nat. Commun. (2021) doi: 10.1038/s41467-021-21466-z
9. Gu, Y.H., Yokoyama, K., Mizuta, K., Tsuchioka, T., Kudo, T., Sasaki, H., Nio, M., Tang, J.,
Ohkubo, T., Matsui, A.: Stool color card screening for early detection of biliary atresia and
long-term native liver survival: a 19-year cohort study in Japan. J. Pediatr. 166(4), 897–902
(2015)
10. Hoshino, E., Hayashi, K., Suzuki, M., Obatake, M., Urayama, K.Y., Nakano, S., Taura, Y.,
Nio, M., Takahashi, O.: An iPhone application using a novel stool color detection algorithm
for biliary atresia screening. Pediatr. Surg. Int. 33(10), 1115–1121 (2017)
11. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining,
Inference, and Prediction, Second Edition. Springer, New York (2009)
12. Efron, B.: Bootstrap methods: another look at the jackknife. Ann. Stat. 7(1), 1–26 (1979)
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Some Issues in Robust Clustering
Christian Hennig
Abstract Some key issues in robust clustering are discussed with focus on the
Gaussian mixture model based clustering, namely the formal definition of outliers,
ambiguity between groups of outliers and clusters, the interaction between robust
clustering and the estimation of the number of clusters, the essential dependence
of (not only) robust clustering on tuning decisions, and shortcomings of existing
measurements of cluster stability when it comes to outliers.
1 Introduction
Cluster analysis is about finding groups in data. Robust statistics is about methods
that are not affected strongly by deviations from the statistical model assumptions or
moderate changes in a data set. Particular attention has been paid in the robustness
literature to the effect of outliers. Outliers and other model deviations can have a
strong effect on cluster analysis methods as well. There is now much work on robust
cluster analysis, see [1, 19, 9] for overviews.
There are standard techniques of assessing robustness such as the influence func-
tion and the breakdown point [15] as well as simulations involving outliers, and these
have been applied to robust clustering as well [19, 9].
Here I will argue that due to the nature of the cluster analysis problem, there are
issues with the standard reasoning regarding robustness and outliers.
The starting point will be clustering based on the Gaussian mixture model, for
details see [3]. For this approach, 𝑛 observations are assumed i.i.d. with density
Christian Hennig ( )
Dipartimento di Scienze Statistiche “Paolo Fortunati”, University of Bologna, Via delle Belle Arti
41, 40126 Bologna, Italy, e-mail: [email protected]
𝐾
Õ
𝑓 𝜂 (𝑥) = 𝜋 𝑘 𝜑 𝜇𝑘 ,Σ𝑘 (𝑥),
𝑘=1
2 Outliers vs Clusters
It is well known that the sample mean and sample covariance matrix as estimators
of the parameters of a single Gaussian distribution can be driven to breakdown by
a single outlier [15]. Under a Gaussian mixture model with fixed 𝐾, an outlier must
be assigned to a mixture component 𝑘 and will break down the estimators of 𝜇 𝑘 , Σ 𝑘
(which are weighted sample means and covariance matrices) for that component in
the same manner; the same holds for a cluster mean in 𝑘-means clustering.
Addressing this issue, and dealing with more outliers in order to achieve a high
breakdown point, is a starting point for robust clustering. Central ideas are trimming
a proportion of observations [7], adding a “noise component” with constant density
to catch the outliers [4, 3], mixtures with more robust component-wise estimators
such as mixtures of heavy-tailed distributions (Sec. 7 of [18]).
But cluster analysis is essentially different from estimating a homogeneous popu-
lation. Given a data set with 𝐾 clear Gaussian clusters and standard ML-clustering,
consider adding a single outlier that is far enough away from the clusters. Assuming
a lower bound on covariance matrix eigenvalues, the outlier will form a one-point
cluster, the mean of which will diverge with the added outlier, and the original
clusters will be merged to form 𝐾 − 1 clusters [10].
The same will happen with a group of several outliers being close together,
once more added far enough away from the Gaussian clusters. “Breakdown” of an
estimator it is usually understood as the estimator becoming useless. It is questionable
that this is the case here. In fact, the “group of outliers” can well be interpreted as
a cluster in its own right, and putting all these points together in a cluster could be
Some Issues in Robust Clustering 185
The last item suggests that there is an interplay between outlier identification and the
number of clusters, and that adding clusters might be a way of dealing with outliers;
as long as clusters are assumed to be Gaussian, a single additional component may
not be enough. More generally, concentrating robustness research on the case of
fixed 𝐾 may be seen as unrealistic, because 𝐾 is rarely known, although estimating
𝐾 is a notoriously difficult problem even without worrying about outliers [13].
The classical robustness concepts, breakdown point and influence function, as-
sume parameters from R𝑞 with fixed 𝑞. If 𝐾 is not fixed, the number of parameters
is not fixed either, and the classical concepts do not apply.
As an alternative to the breakdown point, [11] defined a “dissolution point”.
Dissolution is measured in terms of cluster memberships of points rather than in
terms of parameters, and is therefore also applicable to nonparametric clustering
methods. Furthermore, dissolution applies to individual clusters in a clustering;
certain clusters may dissolve, i.e., there may be no sufficiently similar cluster in a
new clustering computed after, e.g., adding an outlier; and others may not dissolve.
This does not require 𝐾 to be fixed; the definition is chosen so that if a clustering
changes from 𝐾 to 𝐿 < 𝐾 clusters, at least 𝐾 − 𝐿 clusters dissolve.
Hennig [10, 11] showed that when estimating 𝐾 using the BIC and standard ML
estimation, reasonably well separated clusters do not dissolve when adding possibly
even a large percentage of outliers (this does not hold for every method to estimate
the number of clusters, see [11]). Furthermore, [11] showed that no method with
fixed 𝐾 can be robust for data in which 𝐾 is misspecified - already [7] had found
that robustness features in clustering generally depend on the data.
An implication of these results is that even in the fixed 𝐾 problem, the standard
ML method can be a valid competitor regarding robustness if it comes with a rule
that allows to add one or possibly more clusters that can then be used to fit the
outliers (this is rarely explored in the literature, but [18], Sec. 7.7, show an example
in which adding a single component does not work very well).
An issue with adding clusters to accommodate outliers is that in many applications
it is appropriate to distinguish between meaningful clusters, and observations that
cannot be assigned to such clusters (often referred to as “noise”). Even though adding
clusters of outliers can formally prevent the dissolution of existing clusters, it may
be misleading to interpret the resulting clusters as meaningful, and a classification
as outliers or noise can be more useful. This is provided by the trimming and noise
component approaches to robust clustering. Also some other clustering methods such
as the density-based DBSCAN [5] provide such a distinction. On the other hand,
modelling clusters by heavy-tailed distributions such as in mixtures of t-distributions
will implicitly assign outlying observations to clusters that potentially are quite far
away. For this reason, [18], Sec. 7.7, provide an additional outlier identification
rule on top of the mixture fit. [6] even distinguish between “mild” outliers that are
modelled as having a larger variance around the same mean, and “gross” outliers to
be trimmed. The variety of approaches can be connected to the different meanings
that outliers can have in applications. They can be erroneous, they can be irrelevant
Some Issues in Robust Clustering 187
noise, but they can also be caused by unobserved but relevant special conditions (and
would as such qualify as meaningful clusters), or they could be valid observations
legitimately belonging to a meaningful cluster that regularly produces observations
further away from the centre than modelled by a Gaussian distribution.
Even though currently there is no formal robustness property that requires both the
estimation of 𝐾 and an identification or downweighting of outliers, there is demand
for a method that can do both.
Estimating 𝐾 comes with an additional difficulty that is relevant in connection
with robustness. As mentioned before, in clustering based on the Gaussian mixture
model normally every mixture component will be interpreted as a cluster. In reality,
however, meaningful clusters are not perfectly Gaussian. Gaussian mixtures are very
flexible for approximating non-Gaussian distributions. Using a consistent method
for estimating 𝐾 means that for large enough 𝑛 a non-Gaussian cluster will be
approximated by several Gaussian mixture components. The estimated 𝐾 will be
fine for producing a Gaussian mixture density that fits the data well, but it will
overestimate the number of interpretable clusters. The estimation of 𝐾, if interpreted
as the number of clusters, relies on precise Gaussianity of the clusters, and is as such
itself riddled with a robustness problem; in fact slightly non-Gaussian clusters may
even drive the estimated 𝐾 → ∞ if 𝑛 → ∞ [12, 14].
This is connected with the more fundamental problem that there is no unique
definition of a cluster either. The cluster analysis user needs to specify the cluster
concept of interest even before robustness considerations, and arguably different
clustering methods imply different cluster concepts [13]. A Gaussian mixture model
defines clusters by the Gaussian distributional shape (unless mixture components
are merged to form clusters [12]). Although this can be motivated in some real situ-
ations, robustness considerations require that distributional shapes fairly close to the
Gaussian should be accepted as clusters as well, but this requires another specifica-
tion, namely how far from a Gaussian a cluster is allowed to be, or alternatively how
separated Gaussian components have to be in order to count as separated clusters. A
similar problem can also occur in nonparametric clustering; if clusters are associated
with density modes or level sets, the cluster concept depends on how weak a mode
or gap between high level density sets is allowed to be to be treated as meaningful.
Hennig and Coretto [14] propose a parametric bootstrap approach to simultane-
ously estimate 𝐾 and assign outliers to a noise component. This requires two basic
tuning decisions. The first one regards the minimum percentage of observations so
that a researcher is willing to add another cluster if the noise component can be re-
duced by this amount. The second one specifies a tolerance that allows a data subset
to count as a cluster even though it deviates to some extent from what is expected
under a perfectly Gaussian distribution. There is a third tuning parameter that is in
effect for fixed 𝐾 and tunes how much of the tails of a non-Gaussian cluster can be
assigned to the noise in order to improve the Gaussian appearance of the cluster. One
could even see the required constraints on covariance matrix eigenvalues as a further
tuning decision. Default values can be provided, but situations in which matters can
be improved deviating from default values are easy to construct.
188 C. Hennig
User tuning is not popular, as it is often difficult to make appropriate tuning decisions.
Many scientists believe that subjective user decisions threaten scientific objectivity,
and also background knowledge dependent choices cannot be made when investigat-
ing a method’s performance by theory and simulations. The reason why user tuning
is indispensable in robust cluster analysis is that it is required in order to make the
problem well defined. The distinction between clusters and outliers is an interpre-
tative one that no automatic method can make based on the data alone. Regarding
the number of clusters, imagine two well separated clusters (according to whatever
cluster concept of interest), and then imagine them to be moved closer and closer
together. Below what distance are they to be considered a single cluster? This is
essentially a tuning decision that the data cannot make on their own.
There are methods that do not require user tuning. Consider the mclust imple-
mentation of Gaussian mixture model based clustering. The number of clusters is by
default estimated by the BIC. As seen above, this is not really appropriate for large
data sets, but its derivation is essentially asymptotic, so that there is no theoretical
justification for it for small data sets either. Empirically it often but not always works
well, and there is little investigation of whether it tends to make the “right” decision
in ambiguous situations where it is not clear without user tuning what it even means
to be “right”. Covariance matrix constraints in mclust are not governed by a tuning of
eigenvalues or their ratios to be specified by the user. Rather the BIC decides between
different covariance matrix models, but this can be erratic and unstable, as it depends
on whether the EM-algorithm gets caught in a degenerate likelihood maximum or
not, and in situations where two or more covariance matrix models have similar BIC
values (which happens quite often), a tiny change in the data can result in a different
covariance matrix model being selected, and substantial changes in the clustering. A
tunable eigenvalue condition can result in much smoother behaviour. When it comes
to outlier identification, mclust offers the addition of a uniform “noise” mixture
component governed by the range of the data, again supposedly without user tuning.
This starts from an initial noise estimation that requires tuning (Sec. 3.1.2 of [3]) and
is less robust in terms of breakdown and dissolution than trimming and the improper
noise component, both of which require tuning [10, 11]. The ICL, an alternative to
the BIC (Sec. 2.6 of [3]), on the other hand, is known to merge different Gaussian
mixture components already at a distance at which they intuitively still seem to
be separated clusters. Similar comments apply to the mixture of t-distributions; it
requires user tuning for identifying outliers, scatter matrix constraints, and it has the
same issues with BIC and ICL as the Gaussian mixture.
Summarising, both the identification of and robustness against outliers and the
estimation of the number of clusters require tuning in order to be well defined
problems; user tuning can only be avoided by taking tuning decisions out of the
user’s hands and making them internally, which will work in some situations and
fail in others, and the impression of automatic data driven decision making that a
user may have is rather an illusion. This, however, does not free method designers
from the necessity to provide default tunings for experimentation and cases in which
Some Issues in Robust Clustering 189
the users do not feel able to make the decisions themselves, and tuning guidance for
situations in which more information is available. A decision regarding the smallest
valid size of a cluster is rather well interpretable; a decision regarding admissible
covariance matrix eigenvalues is rather difficult and abstract.
5 Stability Measurement
6 Conclusion
developers need to provide sensible defaults, but also to guide the users regarding
a meaningful interpretation of the tuning decisions.
References
1. Banerjee, A., Davé, R. N.: Robust clustering. WIREs Data Mining Knowl. Discov. 2, 29–59
(2012)
2. Ben-David, S., von Luxburg, U., Pál, D.: A sober look at clustering stability. In: Proceedings
of the 19th annual conference on Learning Theory (COLT’06), pp. 5–19, Springer, Berlin
(2006)
3. Bouveyron, C., Celeux, G., Murphy, T. B., Raftery, A. E.: Model-based clustering and classi-
fication for data science. Cambridge University Press, Cambridge MA (2019)
4. Coretto, P., Hennig, C.: Consistency, breakdown robustness, and algorithms for robust im-
proper maximum likelihood clustering. J. Mach. Learn. Res.18, 1–39 (2017)
5. Ester, M., Kriegel, H. P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters
in large spatial databases with noise. In: Proceedings of the 2nd International Conference on
Knowledge Discovery and Data Mining, pp. 226-231, AAAI Press, Portland OR (1996)
6. Farcomeni, A., Punzo, A.: Robust model-based clustering with mild and gross outliers. TEST
29, 989–1007 (2020)
7. García-Escudero, L. A., Gordaliza, A.: Robustness properties of k-means and trimmed k-
means. J. Am. Stat. Assoc. 94, 956–969 (1999)
8. García-Escudero, L. A., Gordaliza, A., Greselin, F., Ingrassia, S., Mayo-Iscar, A.: Eigenvalues
and constraints in mixture modeling: geometric and computational issues. Adv. Data Anal.
Classi. 12, 203–233 (2018)
9. García-Escudero, L. A., Gordaliza, A., Matrán, C., Mayo-Iscar, A., Hennig, C.: Robustness
and outliers. In: Hennig, C., Meila, M., Murtagh, F., Rocci, R. (eds.) Handbook of Cluster
Analysis, pp. 653–678. Chapman & Hall/CRC, Boca Raton FL (2016)
10. Hennig, C.: Breakdown points for maximum likelihood estimators of location-scale mixtures.
Ann. Stat. 32, 1313–1340 (2004)
11. Hennig, C.: Dissolution point and isolation robustness: robustness criteria for general cluster
analysis methods. J. Multivariate Anal. 99, 1154–1176 (2008)
12. Hennig, C.: Methods for merging Gaussian mixture components. Adv. Data Anal. Classi. 4,
3–34 (2010)
13. Hennig, C.: Clustering strategy and method selection. In: Hennig, C., Meila, M., Murtagh, F.,
Rocci, R. (eds.) Handbook of Cluster Analysis, pp. 703–730. Chapman & Hall/CRC, Boca
Raton FL (2016)
14. Hennig, C., Coretto, P.: An adequacy approach for deciding the number of clusters for
OTRIMLE robust Gaussian mixture-based clustering. Aust. N. Z. J. Stat. (2021) doi:
10.1111/anzs.12338
15. Huber, P. J., Ronchetti, E. M.: Robust Statistics (2nd ed.). Wiley, Hoboken NJ (2009)
16. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985)
17. Malsiner-Walli, G., Frühwirth-Schnatter, S., Grün, B.: Identifying mixtures of mixtures using
Bayesian estimation. J. Comput. Graph. Stat. 26, 285–295 (2017)
18. McLachlan, G. J., Peel, D.: Finite Mixture Models. Wiley, New York (2000)
19. Ritter, G.: Robust cluster analysis and variable selection. Chapman & Hall/CRC, Boca Raton
FL (2015)
Some Issues in Robust Clustering 191
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Robustness Aspects of Optimized Centroids
Abstract Centroids are often used for object localization tasks, supervised seg-
mentation in medical image analysis, or classification in other specific tasks. This
paper starts by contributing to the theory of centroids by evaluating the effect of
modified illumination on the weighted correlation coefficient. Further, robustness
of various centroid-based tools is investigated in experiments related to mouth lo-
calization in non-standardized facial images or classification of high-dimensional
data in a matched pairs design. The most robust results are obtained if the sparse
centroid-based method for supervised learning is accompanied with an intrinsic vari-
able selection. Robustness, sparsity, and energy-efficient computation turn out not to
contradict the requirement on the optimal performance of the centroids.
1 Introduction
Methods based on centroids (templates, prototypes) are simple yet widely used for
object localization or supervised segmentation in image analysis tasks and also within
other supervised or unsupervised methods of machine learning. This is true e.g. in
various biomedical imaging tasks [1], where researchers typically cannot afford a too
large number of available images [3]. Biomedical applications also benefit from the
interpretability (comprehensibility) of centroids [11].
This paper is focused on the question how are centroid-based methods influenced
by data contamination. Section 2 recalls the main approaches to centroid-based
object localization in images, as well as a recently proposed method of [6] for op-
timizing centroids and their weights. The performance of these methods to data
contamination (non-standard conditions) has not been however sufficiently investi-
gated. Particularly, we are interested in the performance of low-energy replacements
of the optimal centroids and in the effect of posterior variable selection (pixel selec-
tion). Section 2.1 presents novel expressions for images with a changed illumination.
Numerical experiments are presented in Section 3. These are devoted to mouth lo-
calization over raw facial images as well as over artificially modified images; other
experiments are devoted to high-dimensional data in a matched pairs design. The
optimized centroids of [6] and especially their modification proposed here turn out
to have remarkable robustness properties. Section 4 brings conclusions.
or (less frequently)
arg min ||x − c||2 (2)
x∈E
Further, the notation x + 𝑎 with x = (𝑥𝑖 𝑗 )𝑖, 𝑗 is used to denote the matrix (𝑥 𝑖 𝑗 + 𝑎)𝑖, 𝑗
for a given 𝑎 ∈ R. We also use the Í Í notation. The image x is divided to two
following
parts x = (x1 , x2 )𝑇 ∈ R 𝑑 , where 𝐼 or 𝐼 𝐼 denote the sum over the pixels of the
first or second part, respectively.
The proofs of the formulas are technical but straightforward exploiting known
properties of 𝑟 𝑤 . The theorem reveals 𝑟 𝑤 to be vulnerable to the modified illumina-
tion, i.e. all the methods based on centroids of Section 2 may be too influenced by
the data modification.
3 Experiments
3.1 Data
Three datasets are considered in the experiments. In the first dataset, the task is to
localize the mouth in the database containing 212 grey-scale 2D facial images of faces
of healthy individuals of size 192 × 256 pixels. The database previously analyzed
in [6] was acquired at the Institute of Human Genetics, University of Duisburg-
Essen, within research of genetic syndrome diagnostics based on facial images [1]
under the projects BO 1955/2-1 and WU 314/2-1 of the German Research Council
(DFG). We consider the training dataset to consist of the first 124 images, while the
remaining 88 images represent an independent test set acquired later but still under
the same standardized conditions fulfilling assumptions of unbiased evaluation. The
centroid described below is used with 𝐼 = 26 and 𝐽 = 56.
Using always raw training images, the methods are applied not only to the raw test
set, but also to the test set after being artificially modified using models inspired by
Section 2.1. On the whole, five different versions of the test database are considered;
the modifications required that we first manually localized the mouths in the test
images:
1. Raw images.
Robustness Aspects of Optimized Centroids 197
𝑓𝑖∗𝑗 = 𝑓𝑖 𝑗 + 𝜆| 𝑗 − 𝑗 0 |, 𝑖 = 1, . . . , 𝐼, 𝑗 = 1, . . . , 𝐽, (9)
Fig. 2 The average centroid used as the initial choice for the centroid optimization.
198 J. Kalina and P. Janáček
3.2 Methods
The following methods are compared in the experiments; standard methods are
computed using R software and we use our own C++ implementation of centroid-
based methods. The average centroid is obtained as the average of all mouths of the
training set, or the average across all patients. The centroid optimization starts with
the average centroid as the initial one, and the optimization of weights starts with
equal weights as the initial ones:
D. Centroid-based method (1) with optimal centroid and equal weights [6].
E. Centroid-based method (1) with optimal centroid and optimal weights as in
[6] (optimizing the centroid and only after that the weights), i.e. with posterior
variable selection (pixel selection).
F. Centroid-based method (1) as in [6], where however the weights are optimized
first, and then the centroid is optimized.
G. Centroid-based method (1) as in [6], where however each step of centroid
optimization is immediately followed by optimization of the weights; this method
performs (in contrary to [6]) intrinsic variable selection.
H. Centroid-based method (1) as in [6], where however each optimization step
proceeds over 10 worst images (instead of the very worst image).
I. Centroid-based method (1) with average centroid, where 𝑟 𝑤 is used as 𝑟 LWS [7]
with weight function
2
𝑡 3
𝜓1 (𝑡) = exp − 2 1 𝑡 < , 𝑡 ∈ [0, 1], (12)
2𝜏 4
Table 1 Classification accuracy for three datasets. For the mouth localization data, modifications of
the test images are described in Section 3: (i) None (raw images); (ii) Illumination; (iii) Asymmetry;
(iv) Rotation; (v) Image denoising. A detailed description of the methods is given in Section 3.2.
Dataset
Mouth localization AMI Simul.
Method (i) (ii) (iii) (iv) (v) (vi) Raw Cont. Raw Cont.
A 0.90 0.86 0.81 0.88 0.81 0.93 0.73 0.66 0.71 0.67
B 0.93 0.90 0.86 0.92 0.86 0.95 0.76 0.70 0.77 0.70
C 0.89 0.84 0.74 0.89 0.84 0.93 0.72 0.61 0.70 0.64
D 1.00 0.98 0.95 0.99 0.93 0.98 0.85 0.83 0.80 0.77
E 1.00 1.00 0.98 1.00 0.95 0.98 0.87 0.85 0.83 0.80
F 1.00 0.98 0.96 1.00 0.89 0.97 0.86 0.82 0.79 0.73
G 1.00 0.96 0.95 1.00 0.93 0.99 0.88 0.85 0.86 0.82
H 1.00 1.00 0.98 1.00 0.92 0.96 0.86 0.83 0.84 0.79
I 0.96 0.96 0.93 0.99 0.94 0.96 0.77 0.72 0.75 0.71
J 0.94 0.93 0.89 0.95 0.89 0.93 0.74 0.69 0.72 0.66
K 1.00 1.00 0.97 0.95 0.97 0.96 Not meaningful
3.3 Results
The results as ratios of correctly classified cases are presented in Table 1. For the
mouth localization, the optimized centroids of methods D, F, and H turn out to out-
perform simple centroids (A, B, and C); the novel modifications E and G performing
intrinsic variable selection yield the best results. Simple standard centroids (A, B,
and C) are non-robust to data contamination; this follows from Section 2.1 and from
analogous considerations for other types of contaminating the images. On the other
hand, the robustness of optimized centroids is achieved by their optimization (but
not by using 𝑟 𝑤 as such). Methods E and G are even able to overcome methods I
and J based on 𝑟 LWS . We recall that 𝑟 𝐿𝑊 𝑆 is globally robust in terms of the break-
down point [4]), is computationally very demanding, and does not seem to allow
any feasible optimization. Other results reported previously in [6] revealed that also
numerous standard machine learning methods are too vulnerable (non-robust) with
respect to data contamination, if measuring the similarity by 𝑟 or 𝑟 𝑤 .
For the AMI dataset, methods E and G with variable selection perform the best
results for raw as well as contaminated datasets. For the simulated data, the method G
yields the best results and the method E stays only slightly behind as the second best
method.
4 Conclusions
learning. We focus on small datasets, for which CNNs cannot be used [10]. This
paper is interested in performance of centroid-based object localization over small
databases with non-standardized images, which commonly appear e.g. in medical
image analysis.
The requirements on robustness with respect to modifications of the images turn
out not to contradict the requirements on optimality of the centroids. The method G
applying an intrinsic variable selection on the optimal centroid and weights [6]
can be interpreted within a broader framework of robust dimensionality reduction
(see [8] for an overview) or low-energy approximate computation. Additional results
not presented here reveal the method based on optimized centroids to be robust also
to small shift. Neither the theoretical part of this paper nor the experiments exploit
any specific properties of faces. The presented robust method has potential also for
various other applications, e.g. for deep fake detection by centroids, robust template
matching by CNNs [9], or applying filters in convolutional layers of CNNs.
Acknowledgements The research was supported by the grant 22-02067S of the Czech Science
Foundation.
References
1. Böhringer, S., de Jong, M. A.: Quantification of facial traits. Frontiers in Genetics 10, 397
(2019)
2. Delaigle, A., Hall, P.: Achieving near perfect classification for functional data. Journal of the
Royal Statistical Society 74, 267–286 (2012)
3. Gao, B., Spratling, M. W.: Robust template matching via hierarchical convolutional features
from a shape biased CNN. ArXiv:2007.15817 (2021)
4. Jurečková, J., Picek, J., Schindler, M.: Robust statistical methods with R. 2nd edn. CRC Press,
Boca Raton (2019)
5. Kalina, J.: A robust pre-processing of BeadChip microarray images. Biocybernetics and
Biomedical Engineering 38, 556–563 (2018)
6. Kalina, J., Matonoha, C.: A sparse pair-preserving centroid-based supervised learning method
for high-dimensional biomedical data or images. Biocybernetics and Biomedical Engineering
40, 774–786 (2020)
7. Kalina, J., Schlenker, A.: A robust supervised variable selection for noisy high-dimensional
data. BioMed Research International 2015, 320385 (2015)
8. Rousseeuw, P. J., Hubert, M.: Anomaly detection by robust statistics. WIREs Data Mining
and Knowledge Discovery 8, e1236 (2018)
9. Sun, L., Sun, H., Wang, J., Wu, S., Zhao, Y., Xu, Y.: Breast mass detection in mammography
based on image template matching and CNN. Sensors 2021, 2855 (2021)
10. Sze, V., Chen, Y. H., Yang, T. J., Emer, J. S.: Efficient processing of deep neural networks.
Morgan & Claypool Publishers, San Rafael (2020)
11. Watanuki, S.: Watershed brain regions for characterizing brand equity-related mental pro-
cesses. Brain Sciences 11, 1619 (2021)
12. Zhu, X., Ramanan, D.: Face detection, pose estimation, and landmark localization in the
wild. IEEE Conference on Computer Vision and Pattern Recognition 2012. IEEE, New York,
pp. 2879–2886 (2012)
Robustness Aspects of Optimized Centroids 201
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Data Clustering and Representation Learning
Based on Networked Data
Abstract To deal simultaneously with both, the attributed network embedding and
clustering, we propose a new model exploiting both content and structure infor-
mation. The proposed model relies on the approximation of the relaxed continuous
embedding solution by the true discrete clustering. Thereby, we show that incorporat-
ing an embedding representation provides simpler and easier interpretable solutions.
Experiment results demonstrate that the proposed algorithm performs better, in terms
of clustering, than the state-of-art algorithms, including deep learning methods de-
voted to similar tasks.
1 Introduction
In recent years, Networks [4] and Attributed Networks (AN) [8] have been used to
model a large variety of real-world networks, such as academic and health care
networks where both node links and attributes/features are available for analysis.
Unlike plain networks in which only node links and dependencies are observed,
with AN, each node is associated with a valuable set of features. In other words, we
have X and W obtained/available independently of X. More recently, the learning
representation has received a significant amount of attention as an important aim
in many applications including social networks, academic citation networks and
protein-protein interaction networks. Hence, Attributed network Embedding (ANE)
[2] aims to seek a continuous low-dimensional matrix representation for nodes
in a network, such that original network topological structure and node attribute
proximity can be preserved in the new low-dimensional embedding.
Although, many approaches have emerged with Network Embedding (NE), the
research on ANE (Attributed Network Embedding) still remains to be explored
[3]. Unlike NE that learns from plain networks, ANE aims to capitalize both the
proximity information of the network and the affinity of node attributes. Note that,
due to the heterogeneity of the two information sources, it is difficult for the existing
NE algorithms to be directly applied to ANE. To sum up, the learned representation
has been shown to be helpful in many learning tasks such as network clustering [13],
Therefore ANE is a challenging research problem due to the high-dimensionality,
sparsity and non-linearity of the graph data.
The paper is organized as follows. In Section 2 we formulate the objective function
to be optimized, describe the different matrices used, and present a Simultaneous
Attributed Network Embedding and Clustering (SANEC) framework for embedding
and clustering. Section 3 is devoted to numerical experiments. Finally, the conclusion
summarizes the advantages of our contribution.
2 Proposed Method
In this section, we describe the SANEC method. We will present the formulation of
an objective function and an effective algorithm for data embedding and clustering.
But first, we show how to construct two matrices S and M integrating both types of
information –content and structure information– to reach our goal.
similar social relations and similar node attributes. This idea is inspired by the fact
that, the labels are strongly influenced by both content and structure information
and inherently correlated to both these information sources. Thereby the new data
representation referred to as M = (𝑚 𝑖 𝑗 ) of size (𝑛 × 𝑑) can be considered as a
multiplicative integration of both W and X by replacing Í each node by the centroid
of their neighborhood (barycenter): i.e, m𝑖 𝑗 = 𝑛𝑘=1 w𝑖𝑘 x 𝑘 𝑗 , ∀𝑖, 𝑗 or M = WX. In
this way, given a graph 𝐺, a graph clustering aims to partition the nodes in 𝐺 into
𝑘 disjoint clusters {𝐶1 , 𝐶2 , . . . , 𝐶 𝑘 }, so that: (1) nodes within the same cluster are
close to each other while nodes in different clusters are distant in terms of graph
structure; and (2) the nodes within the same cluster are more likely to have similar
attribute values.
Let 𝑘 be the number of clusters and the number of components into which the data
is embedded. With M and S, the SANEC method that we propose aims to obtain
the maximally informative embedding according to the clustering structure in the
attributed network data. Therefore, we propose to optimize
S − GZB> = S − BB> S
2 2
+ kSB − GZk 2 (2)
proof. We first expand the matrix norm of the left term of (2)
2
= kSk 2 + GZB> − 2𝑇𝑟 (SGZB> )
2
S − GZB𝑇 (3)
In a similar way, we obtain from the two terms of the right term of (2)
2
S − SBB𝑇 = ||S|| 2 − ||SB|| 2 due to B> B = I (4)
Summing the two terms of (4) and (5) leads to the left term of (2).
2
kSk 2 + kGZk 2 − 2𝑇𝑟 (SGZB> ) = S − GZB𝑇 due to kGZk 2 = GZB>
2
which can be reduced to maxZ 𝑇𝑟 (G> SBZ) s.t. Z> Z = I. As proved in page 29
of [1], let UΣV> be the SVD for G> SB, then Z = UV> .
Q = M> B. (7)
In the same manner for the computation of Z, let ÛΣ̂V̂> be the SVD for (M> Q +
𝜆SGZ), we get
B = ÛV̂> . (8)
It is important to emphasize that, at each step, B exploits the information from the
matrices Q, G, and Z. This highlights one of the aspects of the simultaneity of
embedding and clustering.
Compute G: Finally, given B, Q and Z, the problem (1) is equivalent to
minG kSB − GZk 2 . As G is a cluster membership matrix, its computation is done as
follows: We fix Q, Z, B. Let B̃ = SB and calculate
3 Numerical Experiments
In the following, we compare SANEC with some competitive methods described later.
The performances of all clustering methods are evaluated using challenging real-
world datasets commonly tested with ANE where the clusters are known. Specifically,
we consider three public citation network data sets, Citeseer, Cora and Wiki, which
contain sparse bag-of-words feature vector for each document and a list of citation
links between documents. Each document has a class label. We treat documents as
nodes and the citation links as the edges. The characteristics of the used datasets are
summarized in Table 1. The balance coefficient is defined as the ratio of the number
of documents in the smallest class to the number of documents in the largest class
while nz denotes the percentage of sparsity.
208 L. Labiod and M. Nadif
Table 1 Description of datasets (#: the cardinality).
datasets # Nodes # Attributes # Edges #Classes 𝑛𝑧(%) Balance
Cora 2708 1433 5294 7 98.73 0.22
Citeseer 3312 3703 4732 6 99.14 0.35
Wiki 2405 4973 17981 17 86.46 0.02
In our comparison we include standard methods and also recent deep learning
methods; these differ in the way they use available information. Some of them (such
as K-means) use only X as the baseline, while others use more recent algorithms
based on X and W. All the compared methods are: TADW [14], DeepWalk [7] and
Spectral Clustering [11]. Using X and W we evaluated GAE and VGAE [5],
ARVGA [6], AGC [15] and DAEGC [12].
With the SANEC model, the parameter 𝜆 controls the role of the second term
||S−GZB> || 2 in (1). To measure its impact on the clustering performance of SANECS ,
we vary 𝜆 in {0, 10−6 , 10−3 , 10−1 , 100 , 101 , 103 }. Through, many experiments, as
illustrated in Figure 2 we choose to take 𝜆 = 10−3 . The choice of 𝜆 warrants in-depth
evaluation.
Clustering performance(%)
Clustering performance(%)
0.55 0.5
0.5 0.4
0.4
0.45 0.35
0.4 0.3
0.3
0.35
0.2
0.25
0.3
Compared to the true available clusters, in our experiments the clustering per-
formance is assessed by accuracy (ACC), normalized mutual information (NMI)
and adjusted rand index (ARI). We repeat the experiments 50 times, with differ-
ent random initialization and the averages (mean) are reported in Table 2; the best
performance for each dataset is highlighted in bold.
First, we observe the high performances of methods integrating information from
W. For instance, RTM and RMSC are better than classical methods using only either X
or W. On the other hand, all methods including deep learning algorithms relying on
X and W are better yet. However, regarding SANEC with both versions relying on W,
referred to as SANECW or S referred to as SANECS , we note high performances for all
the datasets and with SANECS , we remark the impact of WX ; it learns low-dimensional
representations while suits the clustering structure.
To go further in our investigation and given the sparsity of X we proceeded to
standardization tf-idf followed by 𝐿 2 , as it is often used to process document-term
Networked Data Clustering 209
matrices; see e.g, [9, 10], while in the construction of WX we used the cosine metric.
In Figure 3 are reported the results where we observe a slight improvement.
70 70 60
SANEC_l2 SANEC_l2 SANEC_l2
SANEC_tfidf SANEC_tfidf SANEC_tfidf
60 60
50
50 50
Clustering performance(%)
Clustering performance(%)
Clustering performance(%)
40
40 40
30
30 30
20
20 20
10
10 10
0 0 0
Acc NMI ARI Acc NMI ARI Acc NMI ARI
Fig. 3 Evaluation of SANECS using tf-idf normalization of X and cosine metric for WX .
4 Conclusion
on learning representation and clustering. The obtained results show the advantages
of combining both tasks over other approaches. SANECS outperforms all recent meth-
ods devoted to the same tasks including deep learning methods which require deep
models pretraining. However, there are other points that warrant in-depth evaluation,
such as the choice of 𝜆 and the complexity of the algorithm in terms of network size.
The proposed framework offers several perspectives and investigations. We have
noted that the construction of M and S is important, it highlights the introduction of
W. As for the WX we have observed that it is fundamental as it makes possible to link
the information from X to the network; this has been verified by many experiments.
First, we would like to be able to measure the impact of each matrix W and WX in
the construction of S by considering two different weights for W and WX as follows:
S = 𝛼W + 𝛽WX . Finally, as we have stressed that Q is an embedding of attributes,
this suggests to consider also a simultaneously ANE and co-clustering.
References
1. Ten Berge, J. M. F: Least Squares Optimization in Multivariate Analysis. DSWO Press, Leiden
University Leiden, (1993)
2. Cai, H. Y., Zheng, V. W., Chang, K. C. C.: A comprehensive survey of graph embedding:
Problems, techniques, and applications. IEEE Trans. Knowl. Data Eng. 30(9), 1616-1637
(2018)
3. Chang, S., Han, W., Qi, G. J., Aggarwal, C. C., Huang, T.S.: Heterogeneous network embedding
via deep architectures. In SIGKDD, pp. 119–128 (2015)
4. Doreian, P., Batagelj, V., Ferligoj, A.: Advances in network clustering and blockmodeling.
John Wiley & Sons (2020)
5. Kipf, T. N., Welling, M.: Variational graph auto-encoders. In NIPS Workshop on Bayesian
Deep Learning, (2016)
6. Pan, S., Hu, R., Long, G., Jiang, J., Yao, L., Zhang, C.: Adversarially regularized graph
autoencoder for graph embedding. In IJCAI, pp. 2609-2615, (2018)
7. Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: Online learning of social representations. In
SIGKDD, pp. 701-710 (2014)
8. Qi, G.J., Aggarwal, C. C., Tian, Q., Ji, H., Huang, T. S.: Exploring context and content links in
social media: A latent space method. IEEE Trans. Pattern Anal. Mach. Intell. 34(5), 850-862
(2012)
9. Salah, A., Nadif, M.: Model-based von Mises-Fisher co-clustering with a conscience. In SDM,
pp. 246–254. SIAM (2017)
10. Salah, A., Nadif, M.: Directional co-clustering. Data Analysis and Classification. 13(3), 591-
620 (2019)
11. Tang, L., Liu, H.: Leveraging social media networks for classification. Data mining and
knowledge discovery. 23(3), 447-478 (2011)
12. Wang, C., Pan, S., Hu, R., Long, G., Jiang, J., Zhang, C.: Attributed graph clustering: A deep
attentional embedding approach. arXiv preprint arXiv:1906.06532 (2019) Available via .
https://arxiv.org/pdf/1906.06532.pdf
13. Wang, C., Pan, S., Long, G., Zhu,X., Jiang, J.: Mgae: Marginalized graph autoencoder for
graph clustering. In CIKM, pp. 889-898, (2017)
14. Yang, C., Liu, Z., Zhao, D., Sun, M., Chang, E. Y.: Network representation learning with rich
text information. In IJCAI, pp. 2111-2117 (2015)
15. Zhang, X., Liu, H., Li, Q., Wu, X. M.: Attributed graph clustering via adaptive graph convo-
lution. arXiv preprint arXiv:1906.01210, (2019) Available via .
https://arxiv.org/pdf/1906.01210.pdf?ref=https://githubhelp.com
Networked Data Clustering 211
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Towards a Bi-stochastic Matrix Approximation
of 𝒌-means and Some Variants
Abstract The 𝑘-means algorithm and some 𝑘-means variants have been shown to
be useful and effective to tackle the clustering problem. In this paper we embed
𝑘-means variants in a bi-stochastic matrix approximation (BMA) framework. Then
we derive from the 𝑘-means objective function a new formulation of the criterion. In
particular, we show that some 𝑘-means variants are equivalent to algebraic problem
of bi-stochastic matrix approximation under some suitable constraints. For optimiz-
ing the derived objective function, we develop two algorithms; the first one consists
in learning a bi-stochastic similarity matrix while the second seeks for the opti-
mal partition which is the equilibrium state of a Markov chain process. Numerical
experiments on real data-sets demonstrate the interest of our approach.
1 Introduction
These last decades unsupervised learning and specifically clustering, have received
a significant amount of attention as an important problem with many application in
data science. Let 𝐴 = (𝑎 𝑖 𝑗 ) be a 𝑛 × 𝑚 continuous data matrix where the set of rows
(objects, individuals) is denoted by 𝐼 and the set of columns (attributes, features) by
𝐽. Many clustering methods such as hierarchical or not aim to construct an optimal
partition of 𝐼 or, sometimes of 𝐽.
In this paper we show how some 𝑘-means variants can be presented as a bi-
stochastic matrix approximation problem under some suitable constraints generated
by the properties of the reached solution. To reach this goal, we first demonstrate that
some variants of 𝑘-means are equivalent to learning a bi-stochastic similarity matrix
having a diagonal block structure. Based on this formulation, referred to as BMA,
we derive two iterative algorithms, the first algorithm learns a bi-stochastic 𝑛 × 𝑛
similarity matrix while the second directly seeks an optimal clustering solution.
Our main contribution is to establish the theoretical connection of the conventional
𝑘-means and some of its variants to BMA framework. The implications of the
reformulation of 𝑘-means as a BMA problem are multi-folds:
• It makes connections with recent clustering methods like spectral clustering and
subspace clustering.
• It learns a well normalized (bi-stochastic normalization) similarity matrix, bene-
ficial for spectral clustering [12].
• Unlike existing spectral and subspace methods which combine in a sequential
way, the steps of similarity learning and clustering derivation, our proposed
method jointly learns a block diagonal bi-stochastic affinity matrix which naturally
expresses a clustering structure.
The rest of paper is organized as follows. Section 2 introduces some variants of
𝑘-means. Section 3 provides Matrix Factorization (MF) and BMA formulations of
𝑘-means variants. Section 4 discusses the BMA clustering algorithm and section 5
is devoted to numerical experiments. Finally, the conclusion summarizes the interest
of our contribution.
2 Variants of 𝒌-Means
Given a data matrix 𝐴 = (𝑎 𝑖 𝑗 ) ∈ 𝑅 𝑛×𝑚 , the aim of clustering is to cluster the rows
or the columns of 𝐴, so as to optimize the difference between 𝐴 = (𝑎 𝑖 𝑗 ) and the
clustered matrix revealing significant block structure. More formally, we seek to
partition the set of rows 𝐼 = {1, . . . , 𝑛} into 𝑘 clusters 𝐶 = {𝐶1 , . . . , 𝐶𝑙 , . . . , 𝐶 𝑘 }.
The partitioning naturally induce clustering index matrix 𝑅 = (𝑟 𝑖𝑙 ) ∈ R𝑛×𝑘 , defined
as binary classification matrix such as we have 𝑟 𝑖𝑙 = 1, if the row 𝑎 𝑖 ∈ 𝐶𝑙 , and
0 otherwise. On the other hand, we note 𝑆 ∈ R𝑚×𝑘 a reduced matrix specifying
the cluster representation. The detection of homogeneous clusters of objects can be
reached by looking for the two matrices 𝑅 and 𝑆 minimizing the total squared residue
measure
J𝐾 𝑀 (𝑅, 𝑆) = || 𝐴 − 𝑅𝑆 > || 2 (1)
The term 𝑅𝑆 > characterizes the information of 𝐴 that can be described by the clusters
structure. The clustering problem can be formulated as a matrix approximation
problem where the clustering aims to minimize the approximation error between the
original data 𝐴 and the reconstructed matrix based on the cluster structures.
Factorial 𝑘-means analysis (FKM) [9] and Reduced 𝑘-means analysis (RKM)
[1] are clustering methods that aim at simultaneously achieving a clustering of the
objects and a dimension reduction of the features. The advantage of these methods
is that both clustering of objects and low-dimensional subspace capturing the cluster
structure are simultaneously obtained. To achieve this objective, RKM is defined by
the minimizing problem of the following criterion
On the other hand, it is easy to verify that the approximation RR> 𝐴 of 𝐴 is formed
by the same value in each block 𝐴𝑙, (𝑙=1,...,𝑘) . Specifically, the matrix R> 𝐴, equal to
𝑆𝑇 , plays the role of a summary of 𝐴 and absorbs the different scales of 𝐴 and R.
Finally RR> 𝐴 gives the row clusters mean vectors. Note that it is easy to show that
R verifies the following properties
Let 𝚷 = RR> be a bi-stochastic similarity matrix, before giving the BMA formula-
tion of 𝑘-means variants, we need first to spell out the good properties of 𝚷. Indeed,
by construction from R, 𝚷 has at least the following properties reported below that
can be easily proven.
Given a data matrix 𝐴 and 𝑘 row clusters, we can hope to discover the cluster structure
of 𝐴 from 𝚷. Notice that from (8) 𝚷 is nonnegative, symmetric, bi-stochastic (doubly
216 L. Labiod and M. Nadif
stochastic) and idempotent. By setting the 𝑘means in the BMA framework, the
problem of clustering is reformulated as the learning of a structured bi-stochastic
similarity matrix 𝚷 by minimizing the following 𝑘-means variants objective,
The theorem below demonstrates that the optimization of the 𝑘-means objective and
the BMA objective under some suitable constraints are equivalent. The equation
(13) establishes the equivalence between 𝑘-means and the BMA formulation. Then,
solving the BMA objective function (9) is equivalent to finding a global solution of
the 𝑘-means criterion (1).
Theorem 1
The proof of this equivalence is given in the appendix. Note that this new formulation
gives some interesting highlights on 𝑘-means and its variants:
• First, this shows that 𝑘-means is equivalent to learning a structured bi-stochastic
similarity matrix which is normalized bi-stochastic matrix with block diagonal
structure.
• Secondly, it establishes very interesting connections of 𝑘-means to many state-of-
the-art subspace clustering methods [10, 5]. Moreover, this formulation combines
the traditional two-step process used by subspace clustering methods, which con-
sist in first constructing an affinity matrix between data points and then applying
spectral clustering to this affinity. This allows joint learning of a similarity matrix
that better reflects the clustering structure by its block diagonal shape.
• Finally, it allows to apply the spirit of 𝑘-means for graph or similarity data.
Bi-stochastic Matrix Approximation 217
First, we establish the relationship between our objective function and that used in
[12, 11]. From || 𝐴 − 𝚷𝐴|| 2 = 𝑇𝑟𝑎𝑐𝑒( 𝐴𝐴> ) + 𝑇𝑟𝑎𝑐𝑒(𝚷𝐴𝐴> 𝚷) − 2𝑇𝑟𝑎𝑐𝑒( 𝐴𝐴> 𝚷)
and by using the idempotent property, 𝚷𝚷> = 𝚷 , we can show that
arg min || 𝐴 − 𝚷𝐴|| 2 ⇔ arg min || 𝐴𝐴> − 𝚷|| 2 ⇔ arg max 𝑇𝑟𝑎𝑐𝑒( 𝐴𝐴> 𝚷).
𝚷 𝚷 Π
5 Experiments Analysis
In this subsection we first ran our algorithm on two real world data set, the 16 town-
ships data which consists of the characteristics (rows) of 16 townships (columns),
each cell indicates the presence 1 or absence 0 of a characteristic on a township . This
example has been used by Niermann [7] for data ordering task and the author aims to
reveal a block diagonal form. The second data called Mero data, comes from archaeo-
logical data on Merovingian buckles found in north eastern France. This data matrix
consists of 59 buckles characterized by 26 attributes of description (see Marco-
ˆ 𝑆𝑅 = 𝐴𝐴𝑇 reorganized
torchino for more details [6]). Figure 1 shows in order, 𝐴, 𝐴,
according to the sorted 𝜋 and the sorted 𝜋 plot for both data sets. We also evaluated
6 Conclusion
Appendix
From the BMA’s formulation, we know that one can easily construct a feasible solu-
tion for 𝑘-means from a feasible solution of BMA’s formulation. Therefore, it remains
to show that from a global solution of BMA’s formulation, we can obtain a feasible
solution of 𝑘-means. In order to show the equivalence between the optimization of 𝑘-
means formulation and the BMA formulation, we first consider the following lemma.
and, Õ Õ Õ Õ
𝜋𝑖0 𝑖 = 𝜋𝑖. = 1 = | 𝐴𝑖 0 | (15)
𝑖 0 ∈ 𝐴𝑖 0 𝑖 ∈ 𝐴𝑖 0 𝑖 ∈ 𝐴𝑖 0 𝑖 ∈ 𝐴𝑖 0
Õ Õ 𝜋2 0 Õ 𝜋𝑖𝑖0
𝑖𝑖
∀𝑖 𝜋𝑖𝑖 = 𝜋𝑖𝑖2 0 ⇒ ∀𝑖 ∈ 𝐴𝑖0 ; = ( )𝜋𝑖𝑖0 = 1. (16)
𝑖0 𝑖0 ∈ 𝐴
𝜋𝑖𝑖 𝑖0 ∈ 𝐴 𝜋𝑖𝑖
𝑖0 𝑖0
Í Í 𝜋𝑖𝑖0
From (14) and (16), we deduce that ∀𝑖 ∈ 𝐴𝑖0 ; 𝑖 0 ∈ 𝐴𝑖 0 𝜋 𝑖 0 𝑖 = 𝑖0 ∈ 𝐴𝑖 0 ( 𝜋𝑖𝑖 )𝜋𝑖𝑖0 ,
0
implying that: 𝜋𝑖𝑖0 = 𝜋𝑖𝑖 , ∀𝑖, 𝑖Í ∈ 𝐴𝑖0 . Substituting in (15) 𝜋𝑖𝑖0 by 𝜋𝑖𝑖 for all
𝑖, 𝑖 0 ∈ 𝐴𝑖0 leads to 𝑖0 ∈ 𝐴𝑖0 𝜋𝑖𝑖0 = 𝑖0 ∈ 𝐴𝑖0 𝜋𝑖𝑖 = | 𝐴𝑖0 |𝜋𝑖𝑖 = 1, ∀𝑖 ∈ 𝐴𝑖0 . From this we
Í
can deduce that 𝜋𝑖𝑖 = 𝜋𝑖𝑖0 = | 𝐴10 | , ∀𝑖, 𝑖 0 ∈ 𝐴𝑖0 . We can therefore rewrite the matrix
𝑖 0
Π 0
Π in the form of a block diagonal matrix Π = where Π 0 is a block matrix
0 Π̄ 0
whose general term is defined by Π𝑖𝑖0 0 = | 𝐴10 | , ∀𝑖, 𝑖 0 ∈ 𝐴𝑖0 and 𝑡𝑟𝑎𝑐𝑒(Π 0 ) = 1.
𝑖
The matrix Π̄ 0 is a positive semi-definite matrix which also verified the constraints
( Π̄ 0 ) 𝑡 = Π̄ 0 , Π̄ 0 1 = 1, ( Π̄ 0 ) 2 = Π̄ 0 and 𝑡𝑟𝑎𝑐𝑒( Π̄ 0 ) = 𝑘 − 1.
By repeating the same process 𝑘 − 1 times, we get the block diagonal form of Π.
Π = 𝑑𝑖𝑎𝑔(Π 0 , Π 1 , . . . , Π 𝑙 , . . . , Π 𝑘−1 ) with, Π 𝑙 = 𝑛1𝑙 1𝑙 1𝑙𝑡 , 𝑡𝑟𝑎𝑐𝑒(Π 𝑙 ) = 1∀𝑙 and
Í 𝑘−1
𝑙=0 𝑛𝑙 = 𝑛.
References
1. De Soete, G., Carroll, J. D.: K-means clustering in a low-dimensional euclidean space. In:
E. Diday et al. (eds.) New Approaches in Classification and Data Analysis, pp. 212–219.
Springer-Verlag Berlin (1994)
2. Dhillon, I. S.: Co-clustering documents and words using bipartite spectral graph partitioning.
In SIGKDD, pp. 269-274 (2001)
3. Ding, C., Li, T., Peng, W., Park, H.: Orthogonal nonnegative matrix trifactorizations for
clustering. In SIGKDD, pp. 126-135 (2006)
4. Golub, G. H., van Loan, C. F.: Matrix Computations (3rd ed.). Johns Hopkins University Press
(1996)
5. Lim, D., Vidal, R., Haeffele, B. D.: Doubly stochastic subspace clustering. ArXiv,
abs/2011.14859, 2020. Available via ArXiv. https://arxiv.org/abs/2011.14859
6. Marcotorchino, J. F.: Seriation problems: an overview. Appl. Stoch. Model. D. A., 7(2),
139–151 (1991)
7. Niermann, S.: Optimizing the ordering of tables with evolutionary computation. American
Statistician. 59(1), 41-46 (2005)
8. Strehl, A., Ghosh, J.: Cluster ensembles—a knowledge reuse framework for combining mul-
tiple partitions. Journal of Machine Learning Research, 3, 583-617 (2002)
9. Vichi, M., Kiers, H. A.: Factorial 𝑘-means analysis for two-way data. CSDA, 37(1), 49-64
(2001)
10. Vidal, R.: Subspace clustering. IEEE Signal Processing Magazine 28(2), 52-68 (2011)
11. Wang, F., Li, P., König, A. C.: Improving clustering by learning a bi-stochastic data similarity
matrix. Knowl. Inf. Syst. 32(2), 351-382 (2012)
12. Zass, R., Shashua, A.: A unifying approach to hard and probabilistic clustering. In ICCV, pp.
294-301 (2005)
Bi-stochastic Matrix Approximation 221
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Clustering Adolescent Female Physical Activity
Levels with an Infinite Mixture Model on
Random Effects
Abstract Physical activity trajectories from the Trial of Activity in Adolescent Girls
(TAAG) capture the various exercise habits over female adolescence. Previous analy-
ses of this longitudinal data from the University of Maryland field site, examined the
effect of various individual-, social-, and environmental-level factors impacting the
change in physical activity levels over 14 to 23 years of age. We aimed to understand
the differences in physical activity levels after controlling for these factors. Using a
Bayesian linear mixed model incorporating a model-based clustering procedure for
random deviations that does not specify the number of groups a priori, we find that
physical activity levels are starkly different for about 5% of the study sample. These
young girls are exercising on average 23 more minutes per day.
1 Introduction
Physical activity and diet are arguably the two main controllable factors having the
greatest impact on our health. Whereas we have little to no control over factors like
our genetic predisposition to disease or exposure to environmental toxins, we have
Amy LaLonde
University of Rochester, NY, USA, e-mail: [email protected]
Tanzy Love ( )
University of Rochester, NY, USA, e-mail: [email protected]
Deborah Rohm Young
University of Maryland, MD, USA, e-mail: [email protected]
Tongtong Wu
University of Rochester, NY, USA, e-mail: [email protected]
much greater control over our diet and activity levels. Despite our ability to choose to
engage in healthy behaviors such as exercising and eating a healthy diet, these choices
are plagued with the complexity of human psychology and the modern demands and
distractions that pervade our lives today. Several factors influence levels of physical
activity; we explore the factors impacting female adolescents using longitudinal data.
The University of Maryland, one of the six initial university field centers of the
Trial of Activity in Adolescent Girls (TAAG), selected to follow its 2006 8𝑡 ℎ grade
cohort for two additional time points over adolescence: 11𝑡 ℎ grade and 23 years of
age. The females were therefore measured roughly at ages 14, 17, and 23. In these
waves, there was no intervention as this observational longitudinal study aimed at
exploring the patterns of physical activity levels and associated factors over time.
The model presented in Wu et al. [1] motivates the current work. We fit a similar
linear mixed model controlling for the same variables. Rather than cluster the raw
physical activity trajectories to identify groups, we cluster the females within the
model-fitting procedure based on the values of the subject-specific deviations from
the adjusted physical activity levels. Fitting a Bayesian linear mixed model, we
simultaneously explore the subject groups through the use of reversible jump Markov
chain Monte Carlo (MCMC) applied to the random effects. Bayesian model-based
clustering methods have been applied within linear mixed models to identify groups
by clustering the fitted values of the dependent variable. For example, [2] fits cluster-
specific linear mixed models to the gene expression outcome using an EM algorithm
and [3] clusters gene expression in a similar fashion, except using Bayesian methods.
In contrast, we perform the clustering on the random effects, which allows us to
investigate the variability that is unexplained by the covariates of interest. This
methodology is advantageous because of its ability to jointly estimate all effects,
while also exploring the infinite space of group arrangements.
y𝑖 = X𝑖 𝜷 + 𝑟 𝑖 1𝑇 + 𝝐 𝑖 (1)
Clustering with an Infinite Mixture Model on Random Effects 225
where 𝜷 represents the coefficients for the covariate effects and 𝝐 𝑖 = (𝜖𝑖,1 , . . . , 𝜖 𝑖,𝑇 )
are the residuals. We assume independence and normality in the residuals and the
random effects; hence, 𝑟 𝑖 ∼ 𝑁 (0, 𝜎𝑟2 ) and 𝝐 𝑖 ∼ 𝑁 (0, 𝜎𝜖2 I𝑇 ) for 𝑖 = 1, . . . , 𝑛.
Fitting the mixed model demonstrates substantial heterogeneity in the residuals,
the variability increases as the fitted values increase. A traditional approach to fixing
this violation would re-fit the model to the log-transformed MVPA values. Plots
of residuals versus fitted values in this model approach also exhibited evidence of
heterogeneity in the model; thus, still violating a core assumption of the regression
framework. Given the changes adolescents experience as they grow into young adults,
we expect to see heterogeneity in the physical activity patterns across this duration of
follow-up time. However, the inability of the model to capture such changes over time
at these higher levels of physical activity suggests the need for model improvements.
The purpose of this analysis is to present our adjustments to previous analyses in
order to investigate underlying characteristics across different groups of females
formed based on deviations from adjusted physical activity levels.
Fig. 1 The plot on the left depicts the residuals versus fitted values for the linear mixed model
in Eq. (1); they demonstrate severe heteroscedasticity. The variance increases as the fitted values
increase. The plot on the right depicts the distribution of the random effects.
We fit the mixed model in Eq. (1) to the sample of female adolescents. The
heteroscedasticity depicted in Figure 1 reveals an increase in variance with predicted
minutes of moderate-to-vigorous physical activity, which we would expect. The plot
on the right in Figure 1 demonstrates that the distribution of the random effects do
not appear to follow our assumption of normally distributed and centered around
zero. The random effects do appear to follow a normal distribution over the lower
range of deviations with a subset of the subjects having larger positive deviations
from the estimated adjusted physical activity levels.
To capture the heterogeneity and allow the random effects to follow a non-normal
distribution, we assign the random effects a Gaussian mixture distribution. Before
introducing the model for heterogeneity, we note the likelihood distribution for the
observed outcomes, Y = (y1 , . . . , y𝑇 ) 0. The moderate-to-vigorous physical activity
distribution is
226 A. LaLonde et al.
𝑛 Ö
𝑇 − 12
Ö 1
𝑝(Y| 𝜷, r, 𝜎𝜖2 ) = 2𝜋𝜎𝜖2 2
exp − 2 (𝑦 𝑖,𝑡 − 𝑋𝑖,𝑡 𝜷 − 𝑟 𝑖 ) . (2)
𝑖=1 𝑡=1
2𝜎𝜖
Then to account for the heterogeneity across subjects, the probability density for
the subject-specific deviations in physical activity is expressed as a mixture of one-
dimensional normal densities,
𝐺
( )
Õ − 12 1
2 2 2
𝑝(𝑟 𝑖 | 𝝁, 𝝈𝒓 ) = 𝜋𝑔 2𝜋𝜎𝑟 ,𝑔 exp − 2 (𝑟 𝑖 − 𝜇𝑔 ) . (3)
𝑔=1
2𝜎𝑟 ,𝑔
Richardson and Green [4] adapts the reversible jump methodology to univariate nor-
mal mixture models. In addition to being able to characterize the distribution of 𝐺,
this Bayesian framework has the ability to simultaneously explore the posterior dis-
tribution for the covariate effects of interest. Furthermore, we will have the posterior
distributions of the group-defining parameters rather than just point estimates. Since
we are interested in the physical activity differences in subjects when controlling for
these covariates, we use Eq. (1) as the basis of our model.
The foundation of our clustering model is a finite mixture model on the random
effects, 𝑟 𝑖 , as shown in Eq. (3). For all 𝑖 = 1, . . . , 𝑛 and 𝑔 = 1, . . . , 𝐺, 𝑟 𝑖 |𝑐 𝑖 , 𝝁 ∼
𝐹𝑟 (𝜇 𝑐𝑖 , 𝜎𝑟2,𝑐𝑖 ), (𝑐 𝑖 = 𝑔)|𝝅, 𝐺 ∼ Categorical(𝜋1 , . . . , 𝜋𝐺 ), 𝜇𝑔 |𝜏 ∼ 𝑁 (𝜇0 , 𝜏),
𝜎𝑟2,𝑔 |𝑐, 𝛿 ∼ 𝐼𝐺 (𝑐, 𝛿), 𝝅|𝐺 ∼ Dirichlet(𝛼, . . . , 𝛼), 𝐺 ∼ Uniform[1, 𝐺 𝑚𝑎𝑥 ], where 𝑐 𝑖
is the latent grouping variable tracking the assignment of 𝑟 𝑖 into any one of the 𝐺 clus-
ters. The likelihood function for these subject-specific deviations, given the group as-
− 12 n o
signment, 𝑐 𝑖 , is simply 𝑝(𝑟 𝑖 |𝑐 𝑖 = 𝑔, 𝜇𝑔 , 𝜎𝑟2,𝑔 ) = 2𝜋𝜎𝑟2,𝑔 exp − 2𝜎12 (𝑟 𝑖 − 𝜇𝑔 ) 2 .
𝑟 ,𝑔
Clustering with an Infinite Mixture Model on Random Effects 227
|r)𝑞 (𝜽 |𝑚) |𝐽 | where 𝑝(·) and 𝑞(·) denote the target and proposal dis-
tribution, respectively. In our case, the target distribution is the posterior distribution
of our group-specific parameters, (𝝁, 𝝈𝑟2 , 𝝅, c), given the data, r, which are the ran-
dom effects. Each proposed move changes the dimension of the parameters in 𝜽 by 1,
adding or deleting group-specific parameters. The ratio 𝑞(𝜽 𝑚 |𝑚 −1 )/𝑞(𝜽 |𝑚) ensures
"dimension balancing", as explained in [4]. For moves increasing in dimension, the
Jacobian, |𝐽 |, is computed as |𝛿𝜽 𝑚 /𝛿(𝜽, u)| because moving from 𝜽 to 𝜽 𝑚 will re-
quire additional parameters, u to appropriately match dimensions. The opposite is
true for moves decreasing in dimension. This is what we refer to as the reversible
jump mechanism; each time a split is proposed, we must also design the reversible
move that would result in the currently merged component, and vice versa.
228 A. LaLonde et al.
Split and merge moves are implemented for our model. These moves update 𝜋,
𝜇, and 𝜎 for two adjacent groups or create two adjacent groups using three Beta-
distributed additional parameters, 𝑢, for dimension balancing in a similar way to
[4]. Within our context of random effects, births and deaths are not appropriate.
A singleton causes issues of identifiability because the 𝑟 𝑖 is no longer defined as
random. We do not allow for birth and death moves in our reversible jump methods.
Our analysis focuses only on these girls from the University of Maryland site of the
TAAG study who were measured at all three follow-up time points, beginning in
2006. After excluding girls with missing outcomes, the final sample consisted of 428
girls measured in 2006, 2009, and 2014. Missing covariate values were imputed for
four subjects using the values from the nearest time point.
We determine the group assignments using an MCMC sampler having 10,000 it-
erations, with a burn-in of 500 draws. The posterior distribution for 𝐺 was extremely
peaked at 𝐺 = 2. Summarization of the posterior distribution of the group assign-
ments via the least squares clustering method delivers the final arrangement, ĉ 𝐿𝑆 , of
girls into two groups describing their physical activity levels [5]. Since our sampler
explores several models for which group assignments and 𝐺 can vary, we sample
additional draws from the posterior distribution of the remaining parameters of in-
terest using an MCMC sampler with the model specification of Eq. (1) with groups
fixed at our posterior assignment, ĉ 𝐿𝑆 , for the subject-specific random effects. This
additional chain was run for 10,000 iterations with a burn-in 500 draws, yielding the
results summarized below. Convergence diagnostics indicated that 10,000 iterations
sufficiently met the effective sample size threshold for estimating the coefficients for
the covariate effects, 𝜷, and the group-specific means, 𝝁, describing the deviations
of the girls’ physical activity levels [6].
After controlling for covariates believed to best describe the variation in the
physical activity levels of females, our method finds that there is a small subset of the
females who are much more active than the remainder of the sample. Every subject
in the more active group has fitted trajectories above the recommended 30 minutes
of exercise. Most of the population does not get the recommended allowance of daily
physical activity and this is well-supported in our analysis. All but two subjects in the
less active group have fitted trajectories that never pass the recommended 30 minutes
of exercise. The random effects from this model better fit a normal distribution (not
centered at 0) for each of the two groups and do not show as much heteroscedasticity
over time as the one group model depicted in Figure 1.
Given these differences are observed even after controlling for the aforementioned
variables, we would like to further examine the characteristics that may set these
highly active females apart from the rest of the girls in our sample. To do this, we
look at a number of other covariates that were either excluded during the variable
selection process or were not measured at all time points. We use simple Wilcoxon
Clustering with an Infinite Mixture Model on Random Effects 229
tests on the available time points of the additional variables and on all time points
for covariates we adjusted for in the initial model.
We first note that the median BMI of the subset of highly active girls is sig-
nificantly lower than that of the remaining girls consistently at each TAAG wave.
Similarly, mother’s education level is also consistently significant at each time point.
These values are measured at each time point to reflect changes as the mother pursues
additional education, or as the girls become more aware of their mother’s education.
The majority of the highly active girls have mother’s who have completed college
or higher (75% or higher at each time point); whereas, the remainder of the sample
has mother’s with a range of education levels (less than high school through college
or more). The number of parks within a one-mile radius of the home is significantly
different among the high and low groups in the middle school and high school years,
when the girls are likely to be living at home. This variable may be an indicator of so-
cioeconomic status as families with more money may live in neighborhoods nearer to
parks. Finally, in the high school and college-aged years, the self-management strate-
gies among the highly active girls are significantly higher rated than the remainder
of the population.
In high school, the subset of highly active girls tend to have better self-described
health, participate in more sports teams, have access to more physical education
classes, and have been older at the time of their first menstrual period. At the college
age, these girls still have higher self-described health; however, the higher levels
of the global physical activity score and self-esteem scores are now significantly
improved in the subset of highly active females.
4 Discussion
We extended the mixed models of [1] with the application still focused on the same
428 girls from the TAAG, TAAG 2, and TAAG 3 studies. Within the Bayesian
linear mixed model, we implemented a clustering procedure aimed at clustering
girls into groups based on deviations from the adjusted physical activity levels.
These groups reflected the tendency for small subsets of females to be highly active.
Not surprisingly, only 24 girls (5% of our sample) were classified as highly active.
This group of highly active girls differs in several ways. These girls are more
active, and thus we expect that the age at first menstrual period will be higher. We
may also expect that the highly active girls are involved in more sports teams and
that they will have higher global physical activity scores. Some other interesting
characteristics of these girls, however, is their increased self-management strategies,
self-esteem scores, and self-described health. This may suggest that interventions
focusing on time management and emphasizing self-efficacy could impact adolescent
female physical activity levels. In doing so, we could aim to increase self-esteem and
self-described health.
The ability to account for heterogeneity in the subject-specific deviations from
an adjusted model allows us to keep the outcome on the original scale while still
230 A. LaLonde et al.
improving model assumptions. Our model estimates model parameters while identi-
fying groups of observations with differing activity levels. In contrast, a frequentist
approach could be taken using EM algorithm; however, we would lose the ability
for the data to give statistical inference on the appropriate number of groups and to
incorporate posterior samples with different numbers of groups into the estimated
class label.
The current analysis looks only at identifying groups based on deviations from
the overall adjusted minutes of MVPA for the females. A natural extension would
be to look at clustering on the slope for time to begin to understand the various
patterns we observe among adolescent females over time. Furthermore, we may
want to incorporate a variable selection procedure into the fixed portion of the
model. The groups we find by either clustering on subject-specific intercepts and/or
slopes would be sensitive to the covariates selected, depending on the variability
captured by this fixed portion of the model. Physical activity, like most human
behavior, varies widely for a multitude of reasons, many of which we may not think
to or are unable to measure. Identifying groups when a traditional mixed model
constructed using standard variable selection methods suggests lack of fit can be a
useful step towards better understanding differences through post-hoc analyses of
the groups’ characteristics.
Acknowledgements Research reported in this publication was supported by the National Institutes
of Health (NIH) under award numbers T32ES007271 and R01HL119058. The content is solely the
responsibility of the authors and does not necessarily represent the official views of the NIH.
References
1. Young, D. R., Mohan, Y. D., Saksvig, B. I., Sidell, M., Wu, T. T., Cohen, D.: Longitudinal
predictors of moderate to vigorous physical activity among adolescent girls and young women.
Under review. (2017)
2. Ng, S. K., McLachlan, G. J., Wang, K., Ben-Tovim Jones, L., Ng, S. W.: A mixture model with
random-effects components for clustering correlated gene-expression profiles. Bioinformatics
22(14), 1745 (2006)
3. Zhou, C., Wakefield, J.: A Bayesian mixture model for partitioning gene expression data.
Biometrics 62(2), 515–525 (2006)
4. Richardson, S., Green, P. J.: On Bayesian analysis of mixtures with an unknown number of
components (with discussion). J. Roy. Stat. Soc. B 59(4), 731–792 (1997)
5. Dahl, D. B.: Model-based clustering for expression data via a Dirichlet process mixture model.
Bayesian Inference for Gene Expression and Proteomics. 201–218 (2006)
6. Flegal, J. M., Hughes, J., Vats, D.: mcmcse: Monte Carlo standard errors for MCMC. R
package version 1.2-1 (2016)
Clustering with an Infinite Mixture Model on Random Effects 231
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Unsupervised Classification of Categorical Time
Series Through Innovative Distances
Abstract In this paper, two novel distances for nominal time series are introduced.
Both of them are based on features describing the serial dependence patterns between
each pair of categories. The first dissimilarity employs the so-called association
measures, whereas the second computes correlation quantities between indicator
processes whose uniqueness is guaranteed from standard stationary conditions. The
metrics are used to construct crisp algorithms for clustering categorical series. The
approaches are able to group series generated from similar underlying stochastic
processes, achieve accurate results with series coming from a broad range of mod-
els and are computationally efficient. An extensive simulation study shows that the
devised clustering algorithms outperform several alternative procedures proposed in
the literature. Specifically, they achieve better results than approaches based on max-
imum likelihood estimation, which take advantage of knowing the real underlying
procedures. Both innovative dissimilarities could be useful for practitioners in the
field of time series clustering.
1 Introduction
Clustering of time series concerns the challenge of splitting a set of unlabeled time
series into homogeneous groups, which is a pivotal problem in many knowledge
discovery tasks [1]. Categorical time series (CTS) are a particular class of time
series exhibiting a qualitative range which consists of a finite number of categories.
Most of the classical statistical tools used for real-valued time series (e.g., the
autocorrelation function) are not useful in the categorical case, so different types
of measures than the standard ones are needed for a proper analysis of CTS. CTS
arise in an extensive assortment of fields [2, 3, 7, 8, 9]. Since only a few works have
addressed the problem of CTS clustering [4, 5], the main goal of this paper is to
introduce novel clustering algorithms for CTS.
Consider a set of 𝑠 categorical time series S = {𝑋𝑡(1) , . . . , 𝑋𝑡(𝑠) }, where the 𝑗-th
( 𝑗)
element 𝑋𝑡 is a 𝑇 𝑗 -length partial realization from any categorical stochastic process
(𝑋𝑡 )𝑡 ∈Z taking values on a number 𝑟 of unordered qualitative categories, which are
coded from 1 to 𝑟 so that the range of the process can be seen as V = {1, . . . , 𝑟 }.
We suppose that the process (𝑋𝑡 )𝑡 ∈Z is bivariate stationary, i.e., the pairwise joint
distribution of (𝑋𝑡−𝑘 , 𝑋𝑡 ) is invariant in 𝑡. Our goal is to perform clustering on the
elements of S in such a way that the series assumed to be generated from identical
stochastic processes are placed together. To that aim, we propose two distance metrics
which are based on feature extraction.
𝑽 (𝑙) give information about the so-called unsigned dependence of the process.
However, it is often useful to know whether a process tends to stay in the state it has
reached or, on the contrary, the repetition of the same state after 𝑙 steps is infrequent.
This motivates the concept of signed dependence, which arises as an analogy of the
autocorrelation function of a numerical process, since such quantity can take either
positive or negative values. Provided that perfect serial dependence holds, we have
perfect positive (negative) serial dependence if 𝑝 𝑖 |𝑖 (𝑙) = 1 (𝑝 𝑖 |𝑖 (𝑙) = 0) for all 𝑖 ∈ V.
Since 𝑽 (𝑙) does not shed light on the signed dependence structure, it would
be valuable to complement the information contained in 𝑽 (𝑙) by adding features
describing signed dependence. In this regard, a common measure of signed serial
dependence at lag 𝑙 is the Cohen's 𝜅, which takes the form
Í𝑟
𝑗=1 ( 𝑝 𝑗 𝑗 (𝑙) − 𝜋 𝑗 )
2
𝜅(𝑙) = Í . (2)
1 − 𝑟𝑗=1 𝜋 2𝑗
Proceeding as with 𝑣(𝑙), the quantity 𝜅(𝑙) can be decomposed in order to obtain
a complete representation of the signed dependence pattern of the process. In this
way, we consider the vector K (𝑙) = (K1 (𝑙), . . . , K𝑟 (𝑙)), where each K𝑖 is defined
as
𝑝 𝑖𝑖 (𝑙) − 𝜋𝑖2
K𝑖 (𝑙) = Í , (3)
1 − 𝑟𝑗=1 𝜋 2𝑗
𝑖 = 1, . . . , 𝑟.
In practice, the matrix 𝑽 (𝑙) and the vector K (𝑙) must be estimated from a 𝑇-length
realization of the process, {𝑋1 , . . . 𝑋𝑇 }. To this aim, we consider estimators of 𝜋𝑖
𝑁 (𝑙)
and 𝑝 𝑖 𝑗 (𝑙), b 𝜋𝑖 and 𝑝b𝑖 𝑗 (𝑙), respectively, defined as b 𝜋𝑖 = 𝑁𝑇𝑖 and 𝑝b𝑖 𝑗 (𝑙) = 𝑇𝑖 𝑗−𝑙 ,
where 𝑁𝑖 is the number of variables 𝑋𝑡 equal to 𝑖 in the realization {𝑋1 , . . . 𝑋𝑇 },
and 𝑁𝑖 𝑗 (𝑙) is the number of pairs (𝑋𝑡 , 𝑋𝑡−𝑙 ) = (𝑖, 𝑗) in the realization {𝑋1 , . . . 𝑋𝑇 }.
Hence, estimates of 𝑽 (𝑙) and K (𝑙), 𝑽 b (𝑙) and K b (𝑙), respectively, can be obtained
by plugging in the estimates b 𝜋𝑖 and 𝑝b𝑖 𝑗 (𝑙) in (2) and (3), respectively. This leads
directly to estimates of 𝑣(𝑙) and 𝜅(𝑙), denoted by b 𝑣(𝑙) and b𝜅 (𝑙).
An alternative way of describing the dependence structure of the process
{𝑋𝑡 , 𝑡 ∈ Z} is to take into consideration its equivalent representation as a multi-
variate binary process. The so-called binarization of {𝑋𝑡 , 𝑡 ∈ Z} is constructed as
follows. Let 𝒆 1 , . . . , 𝒆𝑟 ∈ {0, 1}𝑟 be unit vectors such that 𝒆 𝑘 has all its entries
equal to zero except for a one in the 𝑘-th position, 𝑘 = 1, . . . , 𝑟. Then, the binary
representation of {𝑋𝑡 , 𝑡 ∈ Z} is given by the process {𝒀 𝑡 = (𝑌𝑡 ,1 , . . . , 𝑌𝑡 ,𝑟 ) > , 𝑡 ∈ Z}
such that 𝒀 𝑡 = 𝒆 𝑗 if 𝑋𝑡 = 𝑗. Fixed 𝑙 ∈ N and 𝑖, 𝑗 ∈ V, consider the correlation
𝜙𝑖 𝑗 (𝑙) = 𝐶𝑜𝑟𝑟 (𝑌𝑡 ,𝑖 , 𝑌𝑡−𝑙, 𝑗 ), which measures linear dependence between the 𝑖-th and
𝑗-th categories with respect to the lag 𝑙. The following proposition provides some
properties of the quantity 𝜙𝑖 𝑗 (𝑙).
236 Á. López-Oriona et al.
Proposition 1
In this section we introduce two distance measures between categorical series based
on the features described above. Suppose we have a pair of CTS 𝑋𝑡(1) and 𝑋𝑡(2) , and
consider a set of 𝐿 lags, L = {𝑙1 , . . . , 𝑙 𝐿 }. A dissimilarity based on Cramer’s 𝑣 and
Cohen’s 𝜅, so-called 𝑑𝐶𝐶 , is defined as
𝐿 h
Õ 2
𝑑𝐶𝐶 (𝑋𝑡(1) , 𝑋𝑡(2) ) = b (𝑙 𝑘 ) (1) − 𝑽
b (𝑙 𝑘 ) (2)
𝑣𝑒𝑐 𝑽
𝑘=1
2 i 2
+ K
b (𝑙 𝑘 ) (1) b (𝑙 𝑘 ) (2)
−K 𝝅 (1) − b
+ b 𝝅 (2) ,
where the superscripts (1) and (2) are used to indicate that the corresponding
estimations are obtained with respect to the realizations 𝑋𝑡(1) and 𝑋𝑡(2) , respectively.
An alternative distance measure relying on the binarization of the processes,
so-called 𝑑 𝐵 , is defined as
Unsupervised Classification of Categorical Time Series 237
𝐿
Õ 2 2
𝑑 𝐵 (𝑋𝑡(1) , 𝑋𝑡(2) ) = b 𝑘 ) (1) − 𝚽(𝑙
b 𝑘 ) (2) 𝝅 (1) − b
𝝅 (2)
𝑣𝑒𝑐 𝚽(𝑙 + b .
𝑘=1
For a given set of categorical series, the distances 𝑑𝐶𝐶 and 𝑑 𝐵 can be used
as input for traditional clustering algorithms. In this manuscript we consider the
Partition Around Medoids (PAM) algorithm.
In this section we examine the performance of both metrics 𝑑𝐶𝐶 and 𝑑 𝐵 in the
context of hard clustering (i.e., each series is assigned to exactly one cluster) of CTS
through a simulation study.
Scenario 1. Clustering of MC. Consider four three-state MC, so-called MC1 , MC2 ,
MC3 and MC4 , with respective transition matrices 𝑷11 , 𝑷12 , 𝑷13 and 𝑷14 given by
𝑷11 = 𝑀𝑎𝑡 3 (0.1, 0.8, 0.1, 0.5, 0.4, 0.1, 0.6, 0.2, 0.2),
𝑷12 = 𝑀𝑎𝑡 3 (0.1, 0.8, 0.1, 0.6, 0.3, 0.1, 0.6, 0.2, 0.2),
𝑷13 = 𝑀𝑎𝑡 3 (0.05, 0.90, 0.05, 0.05, 0.05, 0.90, 0.90, 0.05, 0.05),
𝑷14 = 𝑀𝑎𝑡 3 (1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3),
where the operator 𝑀𝑎𝑡 𝑘 , 𝑘 ∈ N transforms a vector into a square matrix of order 𝑘
by sequentially placing the corresponding numbers by rows.
Scenario 2. Clustering of HMM. Consider the bivariate process (𝑋𝑡 , 𝑄 𝑡 )𝑡 ∈Z , where
𝑄 𝑡 stands for the hidden states and 𝑋𝑡 for the observable random variables. Process
(𝑄 𝑡 )𝑡 ∈Z constitutes an homogeneous MC. Both (𝑋𝑡 )𝑡 ∈Z and (𝑄 𝑡 )𝑡 ∈Z are assumed
to be count processes with range {1, . . . , 𝑟 }. Process (𝑋𝑡 , 𝑄 𝑡 )𝑡 ∈Z is assumed to
verify the three classical assumptions of a HMM. Based on previous considerations,
let HMM1 , HMM2 , HMM3 and HMM4 be four three-state HMM with respective
transition matrices 𝑷21 , 𝑷22 , 𝑷23 and 𝑷24 and emission matrices 𝑬 21 , 𝑬 22 , 𝑬 23 and 𝑬 24
given by
238 Á. López-Oriona et al.
𝑷21 = 𝑀𝑎𝑡 3 (0.05, 0.90, 0.05, 0.05, 0.05, 0.90, 0.90, 0.05, 0.05), 𝑷22 = 𝑷21 ,
𝑷23 = 𝑀𝑎𝑡 3 (0.1, 0.7, 0.2, 0.4, 0.4, 0.2, 0.4, 0.3, 0.3),
𝑷24 = 𝑀𝑎𝑡 3 (1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3), 𝑬 21 = 𝑷21 ,
𝑬 22 = 𝑀𝑎𝑡 3 (0.1, 0.8, 0.1, 0.5, 0.4, 0.1, 0.6, 0.2, 0.2), 𝑬 23 = 𝑬 22 ,
𝑬 24 = 𝑀𝑎𝑡 3 (1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3).
where (𝜖 𝑡 )𝑡 ∈Z is i.i.d with 𝑃(𝜖 𝑡 = 𝑖) = 𝜋𝑖 , independent of (𝑋𝑠 )𝑠<𝑡 , and the i.i.d
multinomial random vectors
(𝛼𝑡 ,1 , . . . , 𝛼𝑡 , 𝑝 , 𝛽𝑡 ,0 , . . . , 𝛽𝑡 ,𝑞 ) ∼ MULT(1; 𝜙1 , . . . , 𝜙 𝑝 , 𝜑0 , . . . , 𝜑𝑞 ),
are independent of (𝜖 𝑡 )𝑡 ∈Z and (𝑋𝑠 )𝑠<𝑡 . The considered models are three three-state
NDARMA(2,0) processes and one three-state NDARMA(1,0) process with marginal
distribution 𝝅 3 = (2/3, 1/6, 1/6), and corresponding probabilities in the multinomial
distribution given by
(𝜙1 , 𝜙2 , 𝜑0 )13 = (0.7, 0.2, 0.1), (𝜙1 , 𝜙2 , 𝜑0 )23 = (0.1, 0.45, 0.45),
(𝜙1 , 𝜙2 , 𝜑0 )33 = (0.5, 0.25, 0.25), (𝜙1 , 𝜑0 )43 = (0.2, 0.8).
The simulation study was carried out as follows. For each scenario, 5 CTS of
length 𝑇 ∈ {200, 600} were generated from each process in order to execute the
clustering algorithms twice, thus allowing to analyze the impact of the series length.
The resulting clustering solution produced by each considered algorithm was stored.
The simulation procedure was repeated 500 times for each scenario and value of
𝑇. The computation of 𝑑𝐶𝐶 and 𝑑 𝐵 was carried out by considering L = {1} in
Scenarios 1 and 2, and L = {1, 2} in Scenario 3. This way, we adapted the distances
to the maximum number of significant lags existing in each setting.
To better analyze the performance of both metrics 𝑑𝐶𝐶 and 𝑑 𝐵 , we also obtained
partitions by using alternative techniques for clustering of categorical series. The
considered procedures are described below.
Unsupervised Classification of Categorical Time Series 239
Average values of the quality indexes by taking into account the 500 simulation trials
are given in Tables 1, 2 and 3 for Scenarios 1, 2 and 3, respectively.
The results in Table 1 indicate that the dissimilarity 𝑑𝐶𝐶 is the best performing one
when dealing with MC, outperforming the MLE-based metric 𝑑 𝑀 𝐿𝐸 . The distance
𝑑 𝐵 is also superior to 𝑑 𝑀 𝐿𝐸 . The measure 𝑑𝐶 𝑍 attains in Scenario 1 similar results
than 𝑑𝐶𝐶 , specially for 𝑇 = 600. The good performance of 𝑑𝐶 𝑍 was expected,
since the assumption of first order Markov models considered by this metric is
fulfilled in Scenario 1. Table 2 shows a completely different picture, indicating that
the metrics 𝑑𝐶𝐶 and 𝑑 𝐵 exhibit a significantly better effectiveness than the rest
of the dissimilarities. Finally, the quantities in Table 3 reveal that the model-based
distance 𝑑 𝑀 𝐿𝐸 attains the best results when 𝑇 = 200, but is defeated by 𝑑 𝐵 when
240 Á. López-Oriona et al.
References
1. Liao, T. W.: Clustering of time series data: A survey. Pattern Recogn. 38, 1857-1874 (2005)
2. Churchill, G. A.: Stochastic models for heterogeneous DNA sequences. Bull. Math. Biol. 51,
79-94 (1989)
3. Fokianos, K., Kedem, B.: Regression theory for categorical time series. Stat. Sci. 18, 357-376
(2003)
4. Cadez, I., Heckerman, D., Meek, C., Smyth, P., White, S.: Model-based clustering and visu-
alization of navigation patterns on a web site. Data Min. Knowl. Discov. 7, 399-424 (2003)
5. Fruhwirth-Schnatter, S., Pamminger, C.: Model-based clustering of categorical time series.
Bayesian Analysis. 5, 345-368 (2010)
6. García Magariños, M., Vilar, J. A.: A framework for dissimilarity-based partitioning clustering
of categorical time series. Data Min. Knowl. Discov. 29, 466-502 (2015)
7. Elzinga, C. H.: Combinatorial representations of token sequences. J. Classif. 22, 87-118 (2005)
8. Elzinga, C. H.: Sequence similarity: a nonaligning technique. Socio. Meth. Res. 32, 3-22
(2003)
9. Elzinga, C. H.: Sequence analysis: Metric representations of categorical time series. Socio.
Meth. Res. (2006)
Unsupervised Classification of Categorical Time Series 241
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Fuzzy Clustering by Hyperbolic Smoothing
Abstract We propose a novel method for building fuzzy clusters of large data sets,
using a smoothing numerical approach. The usual sum-of-squares criterion is relaxed
so the search for good fuzzy partitions is made on a continuous space, rather than a
combinatorial space as in classical methods [8]. The smoothing allows a conversion
from a strongly non-differentiable problem into differentiable subproblems of op-
timization without constraints of low dimension, by using a differentiable function
of infinite class. For the implementation of the algorithm, we used the statistical
software 𝑅 and the results obtained were compared to the traditional fuzzy 𝐶–means
method, proposed by Bezdek [1].
1 Introduction
Methods for making groups from data sets are usually based on the idea of disjoint
sets, such as the classical crisp clustering. The most well known are hierarchical
and 𝑘-means [8], whose resulting clusters are sets with no intersection. However,
this restriction may not be natural for some applications, where the condition for
David Masís
Costa Rica Institute of Technology, Cartago, Costa Rica, e-mail: [email protected]
Esteban Segura
CIMPA & School of Mathematics, University of Costa Rica, San José, Costa Rica,
e-mail: [email protected]
Javier Trejos ( )
CIMPA & School of Mathematics, University of Costa Rica, San José, Costa Rica,
e-mail: [email protected]
Adilson E. Xavier
Universidade Federal de Rio de Janeiro, Brazil, e-mail: [email protected]
some objects may be to belong to two or more clusters, rather than only one. Several
methods for constructing overlapping clusters have been proposed in the literature
[4, 5, 8]. Since Zadeh introduced the concept of fuzzy sets [17], the principle of
belonging to several clusters has been used in the sense of a degree of membership
to such clusters. In this direction, Bezdek [1] introduced a fuzzy clustering method
that became very popular since it solved the problem of representation of clusters
with centroids and the assignment of objects to clusters, by the minimization of
a well-stated numerical criterion. Several methods for fuzzy clustering have been
proposed in the literature; a survey of these methods can be found in [16].
In this paper we propose a new fuzzy clustering method based on the numerical
principle of hyperbolic smoothing [15]. Fuzzy 𝐶-Means method is presented in
Section 2 and our proposed Hyperbolic Smoothing Fuzzy Clustering method in
Section 3. Comparative results between these two methods are presented in Section
4. Finally, Section 5 is devoted to the concluding remarks.
2 Fuzzy Clustering
The most well known method for fuzzy clustering is the original Bezdek’s 𝐶-means
method [1] and it is based on the same principles of 𝑘-means or dynamical clusters
[2], that is, iterations on two main steps: i) class representations by the optimization
of a numerical criterion, and ii) assignment to the closest class representative in
order to construct clusters; these iterations are made until a convergence is reached
to a local minimum of the overall quality criterion.
Let us introduce the notation that will be used and the numerical criterion for
optimization. Let X be an 𝑛 × 𝑝 data matrix containing 𝑝 numerical observations
over 𝑛 objects. We look for a 𝐾 × 𝑝 matrix G that represents centroids of 𝐾 clusters
of the 𝑛 objects and an 𝑛 × 𝐾 membership matrix with elements 𝜇𝑖𝑘 ∈ [0, 1], such
that the following criterion is minimized:
Õ𝑛 Õ 𝐾
𝑊 (X, U, 𝐶) = (𝜇𝑖𝑘 ) 𝑚 kx𝑖 − g 𝑘 k 2
Í 𝑖=1 𝑘=1 (1)
subject to 𝐾𝑘=1Í𝜇𝑖𝑘 = 1, for all 𝑖 ∈ {1, 2, . . . , 𝑛}
𝑛
0 < 𝑖=1 𝜇𝑖𝑘 < 𝑛, for all 𝑘 ∈ {1, 2, . . . , 𝐾 },
where x𝑖 is the 𝑖-th row of X and g 𝑘 is the 𝑘-th row of G, representing in R 𝑝 the
centroid of the 𝑘-th cluster.
The parameter 𝑚 ≠ 1 in (1) controls the fuzzyness of the clusters. According to
the literature [16], it is usual to take 𝑚 = 2, since greater values of 𝑚 tend to give
very low values of 𝜇𝑖𝑘 , tending to the usual crisp partitions such as in 𝑘-means. We
also assume that the number of clusters, 𝐾, is fixed.
Minimization of (1) represents a non linear optimization problem with constraints,
which can be solved using Lagrange multipliers as presented in [1]. The solution,
for each row of the centroids matrix, given a matrix U, is:
Fuzzy Clustering by Hyperbolic Smoothing 245
𝑛
, 𝑛
Õ Õ
𝑚
g𝑘 = (𝜇𝑖𝑘 ) x𝑖 (𝜇𝑖𝑘 ) 𝑚 . (2)
𝑖=1 𝑖=1
The solution for the membership matrix, given a matrix centroids G, is [1]:
! 1/(𝑚−1) −1
𝐾 ||x𝑖 − g 𝑘 || 2
Õ
𝜇𝑖𝑘 = . (3)
𝑗=1 ||x𝑖 − g 𝑗 ||
2
The following pseudo-code shows the mains steps of Bezdek’s Fuzzy 𝐶-Means
method [1].
Fuzzy 𝐶-Means method starts from an initial partition that is improved in each
iteration, according to (1), applying Steps 2 and 3 of the algorithm. It is clear that
this procedure may lead to local optima of (1) since iterative improvement in (2) and
(3) is made by a local search strategy.
For the clustering problem of the 𝑛 rows of data matrix X in 𝐾 clusters, we can seek
for the minimum distance between every x𝑖 and its class center g 𝑘 :
formed into:
𝑛
Õ 𝐾
Õ
min 𝑧2𝑖 subject to 𝜓(𝑧𝑖 − 𝜃 (x𝑖 , g 𝑘 , 𝛾), 𝜏) ≥ 𝜖, for 𝑖 = 1, . . . , 𝑛.
𝑖=1 𝑘=1
Finally, according to the Karush–Kuhn–Tucker conditions [10, 11], all the con-
straints are active and the final formulation of the problem is:
𝑛
Õ
min 𝑧2𝑖
𝑖=1
𝐾
Õ (4)
subject to ℎ𝑖 (𝑧𝑖 , G) = 𝜓(𝑧𝑖 − 𝜃 (x𝑖 , g 𝑘 , 𝛾), 𝜏) − 𝜖 = 0, for 𝑖 = 1, . . . , 𝑛,
𝑘=1
𝜖, 𝜏, 𝛾 > 0.
Considering (4), in [15] it was stated the Hyperbolic Smoothing Clustering Method
presented in the following algorithm.
Fuzzy Clustering by Hyperbolic Smoothing 247
The most relevant task in the hyperbolic smoothing clustering method is finding
Í
the zeroes of the function ℎ𝑖 (𝑧𝑖 , G) = 𝐾 𝑘=1 𝜓(𝑧 𝑖 − 𝜃 (x𝑖 , g 𝑘 , 𝛾), 𝜏) − 𝜖 = 0 for
for 𝑖 = 1, . . . , 𝑛. In this paper, we used the Newton-Raphson method for finding
these zeroes [3], particularly the BFGS procedure [12]. Convergence of the Newton-
Raphson method was successful, mainly, thank to a good choice of initial solutions.
In our implementation, these initial approximations were generated by calculating the
minimum distance between the 𝑖-th object and the 𝑘-th centroid for a given partition.
Once the zeroes 𝑧 𝑖 of the functions ℎ𝑖 are obtained, it is implemented the hyperbolic
smoothing. The final solution for this method consists on solving a finite number
of optimization subproblems corresponding to problem (P) in Step 6 of the HSCM
algorithm. Each one of these subproblems was solved with the R routine optim [13],
a useful tool for solving optimization problems in non linear programming. As far
as we know there is no closed solution for solving this step. For the future, we can
consider writing a program by our means, but for this paper we are using this R
routine. Í
Since we have that: 𝐾 𝑘=1 𝜓(𝑧 𝑖 − 𝜃 (x𝑖 , g 𝑘 , 𝛾), 𝜏) = 𝜖, then each entry 𝜇𝑖𝑘 of the
membership matrix is given by: 𝜇𝑖𝑘 = 𝜓 (𝑧𝑖 −𝑑 𝜖
𝑘 , 𝜏)
. It is worth to note that fuzzyness
is controlled by parameter 𝜖.
The following algorithm contains the main steps of the Hyperbolic Smoothing
Fuzzy Clustering (HSFC) method.
4 Comparative Results
Performance of the HSFC method was studied on a data table well known from the
literature, the Fisher’s iris [7] and 16 simulated data tables built from a semi-Monte
Carlo procedure [14].
For comparing FCM and HSFC, we used the implementation of FCM in R
package fclust
Í [6].Í𝑛This comparison2 was made upon the within class sum-of-squares:
𝑊 (𝑃) = 𝐾 𝑘=1 𝑖=1 𝜇𝑖𝑘 kx𝑖 − g 𝑘 k . Both methods were applied 50 times and the
best value of 𝑊 is reported. For simplicity here, for HSFC we used the following
parameters: 𝜌1 = 𝜌2 = 𝜌3 = 0.25, 𝜖 = 0.01 and 𝛾 = 𝜏 = 0.001 as initial values. In
Table 1 the results for Fisher’s iris are shown, in which case HSFC performs slightly
better. It contains the Adjusted Rand Index (ARI) [9] between HSFC and the best
FCM result among 100 runs; ARI compares fuzzy membership matrices crisped into
hard partitions.
Table 1 Minimum sum-of-squares (SS) reported for the Fisher’s iris data table with HSFC and
FCM, 𝐾 being the number of clusters, ARI comparing both methods. In bold best method.
Table 𝐾 SS for HSFC SS for FCM ARI
2 152.348 152.3615 1
Fisher’s iris 3 78.85567 78.86733 0.994
4 57.26934 57.26934 0.980
Table 2 Codes and characteristics of simulated data tables; 𝑛: number of objects, 𝐾 : number of
clusters, card: cardinality, DS: standard deviation.
Table Characteristcs Table Characteristcs
Table 3 Minimum sum-of-squares (SS) reported for HSFC and FCM methods on the simulated
data tables. Best method in bold.
Table 𝐾 SS for SS for ARI Table 𝐾 SS for SS for ARI
HSFC FCM HSFC FCM
5 Concluding Remarks
References
1. Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press,
New York (1981)
2. H.-H. Bock: Origins and extensions of the k-means algorithm in cluster analysis. Electronic
Journal for History of Probability and Statistics 4 (2008)
3. Burden, R., Faires, D.: Numerical analysis, 9th ed. Brooks/Cole, Pacific Grove (2011)
4. Diday, E.: Orders and overlapping clusters by pyramids. In J.De Leeuw et al. (eds.) Multidi-
mensional Data Analysis, DSWO Press, Leiden (1986)
5. Dunn, J. C.: A fuzzy relative of the ISODATA process and its use in detecting compact, well
separated clusters. J. Cybernetics 3, 32–57 (1974)
6. Ferraro, M. B., Giordani, P., Serafini, A.: fclust: An R Package for Fuzzy Clustering. The R
Journal 11(1), 198-210 (2019) doi: 10.32614/RJ-2019-017
7. Fisher, R. A.: The use of multiple measurements in taxonomic problems. Annals of Eugenics
7: 179-188 (1936)
8. Hartigan, J. A.: Clustering Algorithms. Wiley, New York, NY (1975)
9. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193-218 (1985)
10. Karush, W.: Minima of Functions of Several Variables with Inequalities as Side Constraints.
Master’s Thesis, Dept. of Mathematics, University of Chicago, Chicago, Illinois (1939)
11. Kuhn, H., Tucker, A.: Nonlinear programming, Proc. 2nd Berkeley Symposium on Mathemat-
ical Statistics and Probability, University of California Press, Berkeley, pp. 481-492 (1951)
12. Li, D., Fukushima, M.: On the global convergence of the BFGS method for nonconvex
unconstrained optimization problems. SIAM J. Optim. 11, 1054-1064 (2001)
13. R Core Team: R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria (2021)
14. Trejos, J., Villalobos, M. A.: Partitioning by particle swarm optimization. In: Brito, P. Bertrand,
P., Cucumel G., de Carvalho, F. (eds.) Selected Contributions in Data Analysis and Classifi-
cation, pp. 235-244. Springer, Berlin (2007)
15. Xavier, A.: The hyperbolic smoothing clustering method, Pattern Recognit. 43, 731-737 (2010)
16. Yang, M. S.: A survey of fuzzy clustering. Math. Comput. Modelling 18, 1-16 (1993)
17. Zadeh, L. A.: Fuzzy sets. Information and Control 8(3), 338-353 (1965)
Fuzzy Clustering by Hyperbolic Smoothing 251
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Stochastic Collapsed Variational Inference for
Structured Gaussian Process Regression
Networks
1 Introduction
Multi-output regression problems arise in various fields. Often, the processes that
generate such datasets are nonstationary. Modern instrumentation has resulted in
increasing numbers of observations, as well as the occurrence of missing values.
This motivates the development of scalable methods for forecasting in such datasets.
Multi-ouput Gaussian process models or multivariate Gaussian process models
(MGP) generalise the powerful Gaussian process predictive model to vector-valued
random fields [1]. Those models demonstrate improved prediction performance com-
pared with independent univariate Gaussian processes (GP) because MGPs express
correlations between outputs. Since the correlation information of data is encoded in
the covariance function, modeling the flexible and computationally efficient cross-
covariance function is of interest. In the literature of multivariate processes, many
approaches are proposed to build valid cross-covariance functions including the
linear model of coregionalization (LMC) [2], kernel convolution techniques [3], B-
spline based coherence functions [4]. However, most of these models are designed
for modelling low-dimensional stationary processes, and require Monte Carlo sim-
ulations, making inference in large datasets computationally intractable.
Modelling the complicated temporal dependencies across variables is addressed in
[5, 6] by several adaptions of stochastic LMC. Such models can handle input-varying
correlation across multivariate outputs. Especially for multivariate time series, [6]
propose a SGPRN that captures time-varying scale, correlation, and smoothness.
However, the inference in [6] is difficult to handle in applications where either the
number of observations and dimension size are large or where missing data exist.
Here, we propose an efficient variational inference approach for the SGPRN by
employing the inducing variable framework on all latent processes [7], taking ad-
vantage of its collapsed representation where nuisance parameters are marginalized
out [8] and proposing a tractable variational bound amenable to doubly stochastic
variational inference. We call our approach variational SGPRN (VSGPRN). This
variational framework allows the model to handle missing data without increasing
the computational complexity of inference. We numerically provide evidence of the
benefits of simultaneously modeling time-varying correlation, scale and smoothness
in both a synthetic experiment and a real-world problem.
The main contributions of this work are threefold:
• Learning structured Gaussian process regression networks using inducing vari-
ables on both mixing coefficients and latent functions.
• Employing doubly stochastic variational inference for structured Gaussian pro-
cess regression networks by taking advantage of its collapsed representation and
constructing a tractable lower bound of the loglikelihood, making it suitable for
mini-batching learning.
• Demonstrating that our proposed algorithm succeeds in handling time-varying
correlation on missing data under different scenarios in both synthetic data and
real data.
2 Model
Fig. 1 Graphical model of VSGPRN. Left: Illustration of the generative model. Right: Illustration
of the variational structure. The dashed (red) block means that we marginalize out those latent
variables in the variational inference framework.
positive values on the diagonal for model identification [9, 6]. Thus, SGPRN is
defined in the generative model of Figure 1 and it is y ( x) = f ( x) + 𝜖 ( x), f ( x) =
L ( x) g ( x) with independent white noise 𝜖 ( x) 𝑖𝑖𝑑 ∼ N (0, 𝜎𝑒𝑟𝑟2 𝐼). We note that
( x) 2 +ℓ ( x0 ) 2
determines the input-dependent length scale of the shared correlations in 𝐾 𝑔 for all
latent functions 𝑔 𝑑 . The varying length-scale process ℓ plays an important role in
modelling nonstationary time series as illustrated in [11, 6].
Let X = {x𝑖 }𝑖=1 𝑁
be the set of observed inputs and Y = {y𝑖 }𝑖=1 𝑁
be the set
of observed outputs. Denote 𝜂 as the concatenation of all coefficients and all log
length-scale parameters, i.e., 𝜂 = ( l, ℓ) ˜ evaluated at training inputs X. Here, l is a
vector including the entries below the main diagonal and the entries on the diagonal
in the log scale and ℓ˜ = log ℓ is the length-scale parameters in log scale. Also,
denote 𝜃 = (𝜃 𝑙 , 𝜃 ℓ , 𝜎𝑒𝑟𝑟
2 ) as all hyper-parameters, where 𝜃 and 𝜃 are the hyper-
𝑙 ℓ
parameters in kernel 𝐾𝑙 and 𝐾ℓ . We note that directly inferring the posterior of the
latent variables 𝑝(𝜂| Y, 𝜃) ∝ 𝑝( Y |𝜂, 𝜎𝑒𝑟𝑟 2 ) 𝑝(𝜂|𝜃 , 𝜃 ) is computationally intractable
𝑙 ℓ
in general because the computational complexity of 𝑝(𝜂| Y, 𝜃) is O (𝑁 3 𝐷 3 ). To
overcome this issue, we propose an efficient variational inference to significantly
reduce the computational burden in the next section.
256 R. Meng et al.
3 Inference
We introduce a shared set of inducing inputs Z = {z𝑚 }𝑚=1 𝑀 that lie in the same space
as the inputs X and a set of shared inducing variables w𝑑 for each latent function
𝑔 𝑑 evaluated at the inducing inputs Z. Likewise, we consider inducing variables u𝑖𝑖
for the function log 𝐿 𝑖𝑖 when 𝑖 = 𝑗, u𝑖 𝑗 for function 𝐿 𝑖 𝑗 when 𝑖 > 𝑗, and inducing
variables v for function log ℓ( x) evaluated at inducing inputs Z. We denote those
collective variables as l = {l𝑖 𝑗 }𝑖 ≥ 𝑗 , u = {u𝑖 𝑗 }𝑖 ≥ 𝑗 , g = {g𝑑 }𝑑=1
𝐷 , w = {w } 𝐷 , ℓ
𝑑 𝑑=1
and v. Then we redefine the model parameters 𝜂 = ( l, u, g, w, ℓ, v), and the prior
of those model parameters is 𝑝(𝜂) = 𝑝( l | w) 𝑝( w) 𝑝( g | u, ℓ, v) 𝑝( u) 𝑝(ℓ| v) 𝑝( v).
The core assumption of inducing point-based sparse inference is that the inducing
variables are sufficient statistics for the training and testing data in the sense that the
training and testing data are conditionally independent given the inducing variables.
In the context of our model, this means that the posterior processes of 𝐿, 𝑔 and ℓ are
sufficiently determined by the posterior distribution of u, w and v. We propose a
structured variational distribution and its corresponding variational lower bound. Due
to the nonconjugacy of this model, instead of doing expectation in the evidence lower
bound (ELBO), as is normally done in the literature, we perform the marginalization
on inducing variables u, w and g, and then use the reparameterization trick to
apply end-to-end training with stochastic gradient descent. We will also discuss a
procedure for missing data inference and prediction.
To capture the posterior dependency between the latent functions, we propose a
structured variational distribution of the model parameters 𝜂 used to approximate its
posterior distribution as 𝑞(𝜂) = 𝑝( l | u) 𝑝( g | w, ℓ, v) 𝑝(ℓ| v)𝑞( u, w, v) . This varia-
tional structure is illustrated in Figure 1. The variational distribution of the inducing
variables 𝑞( u, w, v) fully characterizes the distribution of q (𝜂). Thus, the inference
of 𝑞( u, w, v) is of interest. We assume the parameters u, w, and v are Gaussian
and mutually independent.
Given the definition of Gaussian process priors for the SGPRN, the conditional
distributions 𝑝( l | u), 𝑝( g | w, ℓ,
˜ v), and 𝑝(ℓ| v) have closed-form expressions and all
are Gaussian, except for 𝑝(ℓ| v), which is log Gaussian. The ELBO of the log like-
lihood of observations under our structured variational distribution 𝑞(𝜂) is derived
using Jensen’s inequality as:
𝑝( Y | g, l) 𝑝( u) 𝑝( w) 𝑝( v)
log 𝑝( Y) ≥ 𝐸 𝑞 ( 𝜂) log = 𝑅 + 𝐴,
𝑞( u, w, v)
(1)
lower bound decomposes across both inputs and outputs and this enables the use
of stochastic optimization methods. Moreover, due to the Gaussian assumption in
the prior and variational distributions of the inducing variables, all KL divergence
terms in the regularization term 𝐴 are analytically tractable. Next, instead of directly
computing expectation, we leverage stochastic inference [13].
Stochastic inference requires sampling of l and g from the joint variational
posterior 𝑞(𝜂). Directly sampling them would introduce much uncertainty from
intermediate variables and thus make inference inefficient. To tackle this is-
sue, we marginalize unnecessary intermediate variables u and w and obtain the
marginal distributions 𝑞( l) = 𝑖= 𝑗 log N ( l𝑖𝑖 | 𝜇˜ 𝑖𝑖 , Σ̃𝑖𝑖 ) 𝑖> 𝑗 N ( l𝑖 𝑗 | 𝜇˜ 𝑖 𝑗 , Σ̃𝑖 𝑗 ) and
Î 𝑙 𝑙 Î 𝑙 𝑙
𝑞( g |ℓ, v) = 𝑑=1 N ( g𝑑 | 𝜇˜ 𝑑 , Σ̃𝑑 ) with a joint distribution 𝑞(ℓ, v) = 𝑝(ℓ| v)𝑞( v),
Î𝐷 𝑔 𝑔
where the conditional mean and covariance matrix are easily derived. The corre-
sponding marginal distributions 𝑞( l𝑛 ) and 𝑞( g𝑛 |ℓ, v) at each 𝑛 are also easy to
derive. Moreover, we conduct collapsed inference by marginalizing the latent vari-
ables g𝑛 , so then the individual expectation is
∫
E𝑞 ( g𝑛 ,l𝑛 ) log( 𝑝(𝑦 𝑛𝑑 | g𝑛 , l𝑛 )) = (𝐿 𝑛𝑑 )𝑞(ℓ𝑛 , v)𝑞( l𝑑 ·𝑛 )𝑑 ( l𝑑 ·𝑛 , ℓ𝑛 , v)), (2)
Í 𝑔 Í𝐷 2 𝑔2
where 𝐿 𝑛𝑑 = log N (𝑦 𝑛𝑑 | 𝐷 2 1
𝑗=1 𝑙 𝑑 𝑗𝑛 𝜇˜ 𝑗𝑛 , 𝜎𝑒𝑟𝑟 ) − 2𝜎𝑒𝑟𝑟
2 ˜ 𝑗𝑛 measure the
𝑗=1 𝑙 𝑑 𝑗𝑛 𝜎
reconstruction performance for observations y𝑛𝑑 .
Directly evaluating the ELBO is still challenging due to the non-linearities in-
troduced by our structured prior. Recent progress in black box variational inference
[13] avoids this difficulty by computing noisy unbiased estimates of the gradient of
ELBO, via approximating the expectations with unbiased Monte Carlo estimates and
relying on either score function estimators [14] or reparameterization gradients [13]
to differentiate through a sampling process. Here we leverage the reparameterization
gradients for stochastic optimization for model parameters. We note that evaluating
ELBO (1) involves two sources of stochasticity from Monte Carlo sampling in (2)
and from data sub-sampling stochasticity [15]. The prediction procedure is based on
Bayes’ rule and replaces the posterior distribution by the inferred variational distribu-
tion. In the case of missing data, the only modification in (1) is in the reconstruction
term, where we sum up the likelihoods of observed data instead of complete data.
4 Experiments
This section illustrates the performance of our model on multivariate time series. We
first show that our approach can model the time-varying correlation and smoothness
of outputs on 2D synthetic datasets in three scenarios with respect to different types of
frequencies but the same missing data mechanism. Then, we compare the imputation
performance on missing data with other inducing-variable based sparse multivariate
Gaussian process models on a real dataset.
258 R. Meng et al.
dard white noise processes. The value of 𝑤 refers to the frequency and the value of
𝑠 characterizes the smoothness. The LF and HF datasets use the same 𝑠 = 1, imply-
ing the smoothness is invariant across time. But they employ different frequencies,
𝑤 = 2 for LF and 𝑤 = 5 for HF (i.e., two periods and five periods in a unit time
interval respectively). The VF dataset takes 𝑠 = 2 and 𝑤 = 5, so that the frequency
of the function is gradually increasing as time increases. For all three datasets, the
system shows that as time 𝑡 increases from 0 to 1, the correlation between 𝑦 1 (𝑡) and
𝑦 2 (𝑡) gradually varies from positive to negative. Within each dataset, we randomly
select 200 training data points, in which 100 time stamps are sampled on the interval
(0, 0.8) for the first dimension and the other 100 time stamps sampled on the interval
(0.2, 1) for the second dimension. For the test inputs, we randomly select 100 time
stamps on the interval (0, 1) for each dimension.
Table 1 Prediction measurements on three synthetic datasets and different models. LF, HF and VF
refer to low-frequency, high-frequency, and time-varying datasets. Three prediction measures are
root mean square error (RMSE), average length of confidence interval (ALCI), and coverage rate
(CR). All three measurements are summarized by the mean and standard deviation across 10 runs
with different random initializations.
Data Model RMSE ALCI CR
IGPR [16] 2.25(1.33e-13) 2.18(1.88e-13) 0.835(0)
ICM [17] 2.26(2.54e-5) 2.18(1.22e-5) 0.835(0)
LF
CMOGP [12] 1.43(6.12e-2) 1.36(1.98e-1) 0.651(3.00e-2)
VGPRN [18] 1.01(0.31) - -
VSGPRN 1.00(1.43e-1) 2.21(6.56e-2) 0.892(1.63e-2)
IGPR [16] 1.51(6.01e-14) 3.17(1.30e-13) 0.915(2.22e-16)
ICM [17] 1.52(1.01e-5) 3.17(1.19e-5) 0.910(0)
HF
CMOGP [12] 1.29(3.04e-2) 2.34(3.31e-1) 0.729(3.07e-2)
VGPRN [18] 1.11(0.25) - -
VSGPRN 1.10(1.98e-1) 2.74(7.94e-2) 0.930(1.14e-2)
IGPR [16] 1.64(8.17e-14) 3.19(3.02e-13) 0.875(0)
ICM [17] 1.66(2.37e-3) 3.16(1.49e-3) 0.880(1.50e-3)
VF
CMOGP [12] 2.24(3.08e-1) 2.56(9.29e-1) 0.697(1.56e-1)
VGPRN [18] 1.04(0.67) - -
VSGPRN 1.24(1.33e-1) 2.92(1.21e-1) 0.887(9.80e-3)
We quantify the model performance in terms of root mean square error (RMSE),
average length of confidence interval (ALCI), and coverage rate (CR) on the test set.
A smaller RMSE corresponds to better predictive performance of the model, and
a smaller ALCI implies a smaller predictive uncertainty. As for CR, the better the
model prediction performance is, the closer CR is to the percentile of the credible
band. Those results are reported by the mean and standard deviation with 10 differ-
ent random initializations of model parameters. Quantitative comparisons relating
SGPRN 259
to all three datasets are in Table 1. We compare with independent Gaussian process
regression (IGPR) [16], the intrinsic coregionalization model (ICM) [17], Collab-
orative Multi-Output Gaussian Processes (CMOGP) [12] and variational inference
of Gaussian process regression networks [18] on three synthetic datasets. In both
CMOGP and VSGPRN approaches, we use 20 inducing variables. We further exam-
ined model predictive performance on a real-world dataset, the PM2.5 dataset from
the UCI Machine Learning Repository [19]. This dataset tracks the concentration of
fine inhalable particles hourly in five cities in China, along with meteorological data,
from Jan 1st, 2010 to Dec 31st, 2015. We compare our model with two sparse Gaus-
sian process models, i.e., independent sparse Gaussian process regression (ISGPR)
[20] and the sparse linear model of corregionalization (SLMC) [17]. In the dataset,
we consider six important attributes and use 20% of the first 5000 standardized mul-
tivaritate for training and use the others for testing. The RMSEs on the testing data
are shown in Table 2, illustrating that VSGPRN had better prediction performance
compared with ISGPR and SLMC, even when using fewer inducing points.
Table 2 Empirical results for PM2.5 dataset. Each model’s performance is summarized by its
RMSE on the testing data. The number of equi-spaced inducing points is given in parentheses.
Data ISGPR (100) [20] SLMC (100) [17] VSGPRN (50) VSGPRN (100) VSGPRN (200)
PM2.5 0.994 0.948 0.840 0.708 0.625
5 Conclusions
References
1. Álvarez, M., Lawrence, N.: Computationally efficient convolved multiple output Gaussian
processes. J. Mach. Learn. Res. 12, 1459-1500 (2011)
2. Goulard, M., Voltz, M.: Linear coregionalization model: tools for estimation and choice of
cross-variogram matrix. Math. Geol. 24, 269-286 (1992)
3. Gneiting, T., Kleiber, W., Schlather, M.: Matérn cross-covariance functions for multivariate
random fields. J. Am. Stat. Assoc. 105, 1167-1177 (2010)
4. Qadir, G., Sun, Y.: Semiparametric estimation of cross-covariance functions for multivariate
random fields. Biom. 77, 547-560 (2021)
5. Gelfand, A., Schmidt, A., Banerjee, S., Sirmans, C.: Nonstationary multivariate process mod-
eling through spatially varying coregionalization. Test. 13, 263-312 (2004)
6. Meng, R., Soper, B., Lee, H., Liu, V., Greene, J., Ray, P.: Nonstationary multivariate Gaussian
processes for electronic health records. J. Biom. Inform. 117, 103698 (2021)
7. Titsias, M., Lawrence, N.: Bayesian Gaussian process latent variable model. Int. Conf. Artif.
Intell. Stat. 844-851 (2010)
8. Teh, Y., Newman, D., Max Welling, M.: A collapsed variational Bayesian inference algorithm
for latent Dirichlet allocation. In: Schölkopf, B., Platt, J., Hofmann, T. (eds.) Advances in
Neural Information Processing Systems 19, (2006)
9. Guhaniyogi, R., Finley, A., Banerjee, S., Kobe, R.: Modeling complex spatial dependencies:
Low-rank spatially varying cross-covariances with application to soil nutrient data. J. Agric.
Biol. Environ. Stat. 18, 274-298 (2013)
10. Møller, J., Syversveen, A., Waagepetersen, R.: Log Gaussian Cox processes. Scand. J. Stat.
25, 451-482 (1998)
11. Remes, S., Heinonen, M., Kaski, S.: Non-stationary spectral kernels. Adv. Neural Inf. Process.
Syst. 30 (2017), https://proceedings.neurips.cc/paper/2017/file/c65d7bd70fe
3e5e3a2f3de681edc193d-Paper.pdf
12. Nguyen, T., Bonilla, E., et al.: Collaborative multi-output Gaussian processes. Uncertain.
Artif. Intell. 643-652 (2014)
13. Titsias, M., Lázaro-Gredilla, M.: Doubly stochastic variational Bayes for non-conjugate infer-
ence. Int. Conf. Mach. Learn. 1971-1979 (2014)
14. Ranganath, R., Gerrish, S., Blei, D.: Black box variational inference. Int. Conf. Artif. Intell.
Stat. 814-822 (2014)
15. Hoffman, M., Blei, D., Wang, C., Paisley, J.: Stochastic variational inference. J. Mach. Learn.
Res. 14, 1303-1347 (2013)
16. Rasmussen, C., Kuss, M.: Gaussian processes in reinforcement learning. Adv. Neural Inf.
Process. Syst. 751-759 (2004)
17. Wackernagel, H.: Multivariate geostatistics: an introduction with applications. Springer Sci-
ence & Business Media (2013)
18. Nguyen, T., Bonilla, E.: Efficient variational inference for Gaussian process regression net-
works. Int. Conf. Artif. Intell. Stat. 472-480 (2013)
19. Liang, X., Zou, T., Guo, B., Li, S., Zhang, H., Zhang, S., Huang, H., Chen, S.: Assessing
Beijing’s PM2.5 pollution: severity, weather impact, APEC and winter heating. Proc. R. Soc.
A: Math. Phys. Eng. Sci. 471, 20150257 (2015)
https://royalsocietypublishing.org/doi/abs/10.1098/rspa.2015.0257
20. Snelson, E., Ghahramani, Z.: Sparse Gaussian processes using pseudo-inputs. Adv. Neural
Inf. Process. Syst. 1257-1264 (2006), http://papers.nips.cc/paper/2857-sparse-g
aussian-processes-using-pseudo-inputs.pdf
SGPRN 261
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
An Online Minorization-Maximization
Algorithm
Hien Duy Nguyen, Florence Forbes, Gersende Fort, and Olivier Cappé
Abstract Modern statistical and machine learning settings often involve high data
volume and data streaming, which require the development of online estimation
algorithms. The online Expectation–Maximization (EM) algorithm extends the pop-
ular EM algorithm to this setting, via a stochastic approximation approach. We show
that an online version of the Minorization–Maximization (MM) algorithm, which in-
cludes the online EM algorithm as a special case, can also be constructed in a similar
manner. We demonstrate our approach via an application to the logistic regression
problem and compare it to existing methods.
1 Introduction
many finite mixture models. The benefit of such algorithms comes from the use of
computationally simple surrogates in place of difficult optimization objectives.
Driven by high volume of data and streamed nature of data acquisition, there
has been a rapid development of online and mini-batch algorithms that can be used
to estimate models without requiring data to be accessed all at once. Online and
mini-batch versions of EM algorithms can be constructed via the classic Stochastic
Approximation framework (see, e.g., [2, 13]) and examples of such algorithms
include those of [3, 7, 8, 10, 11, 12, 19]. Via numerical assessments, many of the
algorithms above have been demonstrated to be effective in mixture model estimation
problems. Online and mini-batch versions of MM algorithms on the other hand
have largely been constructed following convex optimizations methods (see, e.g.,
[9, 14, 23]) and examples of such algorithms include those of [4, 16, 18, 22].
In this work, we provide a stochastic approximation construction of an online
MM algorithm using the framework of [3]. The main advantage of our approach is
that we do not make convexity assumptions and instead replace them with oracle
assumptions regarding the surrogates. Compared to the online EM algorithm of [3]
that this work is based upon, the Online MM algorithm extends the approach to allow
for surrogate functions that do not require latent variable stochastic representations,
which is especially useful for constructing estimation algorithms for mixture of
experts (MoE) models (see, e.g. [20]). We demonstrate the Online MM algorithm
via an application to the MoE-related logistic regression problem and compare it to
competing methods.
Notation. By convention, vectors are column vectors. For a matrix 𝐴, 𝐴> denotes
its transpose. The Euclidean scalar product is denoted by h𝑎, 𝑏i. For a continuously
differentiable function 𝜃 ↦→ ℎ(𝜃) (resp. twice continuously differentiable), ∇ 𝜃 ℎ (or
simply ∇ when there is no confusion) is its gradient (resp. ∇2𝜃 𝜃 is its Hessian). We
denote the vectorization operator that converts matrices to column vectors by vec.
In our work, we consider the case when the minorizer function 𝑔 has the following
structure:
A1 The minorizer surrogate 𝑔 is of the form:
thus providing a minorizer function for the objective function 𝜃 ↦→ E [ 𝑓 (𝜃; 𝑋)].
By A4,
the usual MM algorithm would define iteratively the sequence 𝜃 𝑛+1 =
¯ 𝑛 ; 𝑋) . Since the expectation may not have closed form but infinite datasets
𝜃¯ E 𝑆(𝜃
are available (see A3), we propose a novel Online MM algorithm. It defines the
sequence {𝑠 𝑛 , 𝑛 ≥ 0} as follows: given positive step sizes {𝛾𝑛+1 , 𝑛 ≥ 1} in (0, 1) and
an initial value 𝑠0 ∈ S, set for 𝑛 ≥ 0:
𝑠 𝑛+1 = 𝑠 𝑛 + 𝛾𝑛+1 𝑆¯ 𝜃¯ (𝑠 𝑛 ); 𝑋𝑛+1 − 𝑠 𝑛 . (5)
point 𝜃★ := 𝜃¯ (𝑠★) of the objective function E [ 𝑓 (𝜃; 𝑋)] (i.e., 𝜃★ is a root of the
derivative of the objective function). The proof follows the technique of [3]. Set
h (𝑠) := E 𝑆¯ 𝜃¯ (𝑠) ; 𝑋 − 𝑠, Γ := {𝑠 ∈ S : h (𝑠) = 0}.
Use (2) and A1, and apply the expectation w.r.t. 𝑋 (under A3). This yields (4),
which is available for any 𝜃, 𝜏 ∈ T. This inequality provides a minorizer function for
𝜃 ↦→ E [ 𝑓 (𝜃; 𝑋)]: the difference is nonnegative and minimal (i.e. equal to zero) at
𝜃 = 𝜏. Under the assumptions and A1, this yields
3 Example Application
As an example, we consider the logistic regression problem, where we solve (1) with
where
1 1
𝑠¯1 (𝜏; 𝑥) := 𝑦 − 𝜆 𝜏 > 𝑤 𝑤 + 𝑤𝑤> 𝜏, 𝑆¯2 (𝜏; 𝑥) = − 𝑤𝑤> .
4 8
With S := {(𝑠1 , vec (𝑆2 )) : 𝑠1 ∈ R 𝑝 and 𝑆2 ∈ R 𝑝× 𝑝 is symmetric positive definite} ,
it follows that 𝜃¯ (𝑠) := −(2𝑆2 ) −1 𝑠1 .
Online MM. Let 𝑠 𝑛 = 𝑠1,𝑛 , 𝑆2,𝑛 ∈ S. The corresponding Online MM recursion
is then
> 1 > ¯
𝑠1,𝑛+1 = 𝑠1,𝑛 + 𝛾𝑛+1 𝑌𝑛+1 − 𝜆 𝜃 (𝑠 𝑛 ) 𝑊𝑛+1 𝑊𝑛+1 + 𝑊𝑛+1𝑊𝑛+1 𝜃 (𝑠 𝑛 ) − 𝑠1,𝑛 (8)
¯
4
1 >
𝑆2,𝑛+1 = 𝑆2,𝑛 + 𝛾𝑛+1 − 𝑊𝑛+1𝑊𝑛+1 − 𝑆2,𝑛 , (9)
8
where {(𝑌𝑛+1 , 𝑊𝑛+1 ), 𝑛 ≥ 0} are i.i.d. pairs with the same distribution as 𝑋 = (𝑌 , 𝑊).
Parameter estimates can then be deduced by setting 𝜃 𝑛+1 := 𝜃¯ (𝑠 𝑛+1 ).
For comparison, we also consider two Stochastic Approximation schemes directly
on 𝜃 in the parameter-space: a stochastic gradient (SG) algorithm and a Stochastic
Newton Raphson (SNR) algorithm.
Stochastic gradient. SG requires the gradient of 𝑓 (𝜃; 𝑥) with respect to 𝜃:
∇ 𝑓 (𝜃; 𝑥) = {𝑦 − 𝜆(𝜃 > 𝑤)} 𝑤, which leads to the recursion
Equation (12) assumes that 𝐴ˆ 𝑛+1 is invertible. In this logistic example, we can
guarantee this by choosing 𝐴ˆ 0 to be invertible. Otherwise 𝐴ˆ 𝑛 is invertible after
some 𝑛 sufficiently large, with probability one. Again in the logistic case, observe
that, from the structure of ∇2𝜃 𝜃 𝑓 and from the Woodbury matrix identity, Equations
(11–12) can be replaced by
> 𝐺
𝑎 𝑛+1 𝐺 𝑛 𝑊𝑛+1𝑊𝑛+1
𝐺𝑛 𝛾𝑛+1 𝑛
𝐺 𝑛+1 = − >
.
1 − 𝛾𝑛+1 1 − 𝛾𝑛+1 (1 − 𝛾𝑛+1 ) + 𝛾𝑛+1 𝑎 𝑛+1𝑊𝑛+1 𝐺 𝑛 𝑊𝑛+1
Fig. 1 Logistic regression example: the first row shows Online MM (black), SG (blue), and SNR
(red) recursions. The second row shows the respective Polyak averaging recursions. The estimates
of the first 𝜃 (first column) and the second (second column) components of 𝜃 are plotted started
from 𝑛 = 103 for readability.
4 Final Remarks
Remark 2 Via the minorization approach of [1] (as used in Section 3) and the mixture
representation from [19], we can construct an Online MM algorithm for MoE models,
analogous to the MM algorithm of [20]. We shall provide exposition on such an
algorithm in future work.
Acknowledgements Part of the work by G. Fort is funded by the Fondation Simone et Cino Del
Duca, Institut de France. H. Nguyen is funded by ARC Grant DP180101192. The work is supported
by Inria project LANDER.
270 H. D. Nguyen et al.
References
1. Bohning, D.: Multinomial logistic regression algorithm. Ann. Inst. Stat. Math. (1992)
2. Borkar, V.S.: Stochastic approximation: A dynamical systems viewpoint. Springer (2009)
3. Cappé, O., Moulines, E.: On-line expectation-maximization algorithm for latent data models.
J. Roy. Stat. Soc. B Stat. Meth. 71, 593–613 (2009)
4. Cui, Y., Pang, J.: Modern nonconvex nondifferentiable optimization. SIAM, Philadelphia
(2022)
5. Delyon, B., Lavielle, M., Moulines, E.: Convergence of a stochastic approximation version of
the EM algorithm. Ann. Stat. 27, 94–128 (1999)
6. Dempster, A. P., Laird, N. M., Rubin, D. B.: Maximum likelihood from incomplete data via
the EM algorithm. J. Roy. Stat. Soc. B Stat. Meth. 39, 1–38 (1977)
7. Fort, G., Gach, P., Moulines, E.: Fast incremental expectation maximization for finite-sum
optimization: nonasymptotic convergence. Stat. Comput. 31, 1–24 (2021)
8. Fort, G., Moulines, E., Wai, H. T.: A stochastic path-integrated differential estimator expecta-
tion maximization algorithm. In: Proceedings of the 34th Conference on Neural Information
Processing Systems (NeurIPS) (2020)
9. Hazan, E.: Introduction to online convex optimization. Foundations and Trends in Optimization.
2 (2015)
10. Karimi, B., Miasojedow, B., Moulines, E., Wai, H. T.: Non-asymptotic analysis of biased
stochastic approximation scheme. Proceedings of Machine Learning Research. 99, 1–31 (2019)
11. Karimi, B., Wai, H. T., Moulines, R., Lavielle, M.: On the global convergence of (fast) incre-
mental expectation maximization methods. In: Proceedings of the 33rd Conference on Neural
Information Processing Systems (NeurIPS) (2019)
12. Kuhn, E., Matias, C., Rebafka, T.: Properties of the stochastic approximation EM alpgorithm
with mini-batch sampling. Stat. Comput. 30, 1725–1739 (2020)
13. Kushner, H. J., Yin, G. G.: Stochastic Approximation and Recursive Algorithms and Applica-
tions. Springer, New York (2003)
14. Lan, G.: First-order and Stochastic Optimization Methods for Machine Learning. Springer,
Cham (2020)
15. Lange, K.: MM Optimization Algorithms. SIAM, Philadelphia (2016)
16. Mairal, J.: Stochastic majorization-minimization algorithm for large-scale optimization. In:
Advances in Neural Information Processing Systems, pp. 2283–2291 (2013)
17. McLachlan, G. J., Krishnan, T.: The EM Algorithm And Extensions. Wiley, New York (2008)
18. Mokhtari, A., Koppel, A.: High-dimensional nonconvex stochastic optimization by doubly
stochastic successive convex approximation. IEEE Trans. Signal Process. 68, 6287–6302
(2020)
19. Nguyen, H.D., Forbes, F., McLachlan, G. J.: Mini-batch learning of exponential family finite
mixture models. Stat. Comput. 30, 731–748 (2020)
20. Nguyen, H. D., McLachlan, G. J.: Laplace mixture of linear experts. Comput. Stat. Data Anal.
93, 177–191 (2016)
21. Polyak, B. T., Juditsky, A. B.: Acceleration of stochastic approximation by averaging. SIAM
J. Contr. Optim. 30, 838–855 (1992)
22. Razaviyayn, M., Sanjabi, M., Luo, Z.: A stochastic successive minimization method for non-
smooth nonconvex optimization with applications to transceiver design in wireless communi-
cation networks. Math. Program. Series B. 515–545 (2016)
23. Shalev-Shwartz, S.: Online learning and online convex optimization. Foundations and Trends
in Machine Learning. 4, 107–194 (2011)
An Online MM Algorithm 271
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Detecting Differences in Italian Regional Health
Services During Two Covid-19 Waves
Abstract During the first two waves of Covid-19 pandemic, territorial healthcare sys-
tems have been severely stressed in many countries. The availability (and complexity)
of data requires proper comparisons for understanding differences in performance
of health services. We apply a three-steps approach to compare the performance of
Italian healthcare system at territorial level (NUTS 2 regions), considering daily time
series regarding both intensive care units and ordinary hospitalizations of Covid-19
patients. Changes between the two waves at a regional level emerge from the main
results, allowing to map the pressure on territorial health services.
1 Introduction
During the Covid-19 pandemic, the evaluation of similarities and differences between
territorial health services [23] is relevant for decision makers and should guide the
governance of countries [15] through the so-called “waves”. This type of analysis
becomes even more crucial in countries where the National healthcare system is
regionally-based, which is the case of Italy (or Spain) among others. Italy is one of
the countries in Europe which has been mostly affected by the pandemic, and the
pressure on Regional Health Services (RHS) has been producing dramatic effects
also in the economic [2] and the social [3] spheres. Regional Covid-19-related health
Lucio Palazzo ( )
Department of Political Sciences, University of Naples Federico II, via Leopoldo Rodinò 22 - 80138
Napoli, Italy, e-mail: [email protected]
Riccardo Ievoli
Department of Chemical, Pharmaceutical and Agricultural Sciences, University of Ferrara, via
Luigi Borsari 46 - 44121 Ferrara, Italy, e-mail: [email protected]
indicators are extremely relevant for monitoring the pandemic territorial widespread
[21], and to impose (or relax) restrictions in accordance with the level of health risk.
The aim of this work is to exploit the potential of Multidimensional Scaling (MDS)
to detect the main imbalances occurred in the RHS, observing the hospital admission
dynamics of patients with Covid-19 disease. Both daily time series regarding patients
treated in Intensive Care (IC) units and individuals hospitalized in other hospital
wards are used to evaluate and compare the reaction to healthcare pressure in 21
geographical areas (NUTS 2 Italian regions), considering the first two waves [4] of
pandemic. Indeed, territorial imbalances in terms of RHS’ performance [24] should
be firstly driven by the geographical propagation flows of the virus (first wave). Then,
different reactions to pandemic shock may be provided by RHSs, and changes of
imbalances can be observed in the second wave.
Our proposal consists of three subsequent steps. Firstly, a matrix of distances
between regional time series through a dissimilarity metric [29] is obtained. There-
fore, we apply a (weighted) MDS [19, 22] to map similarity patterns in a reduced
space, adding also a weighting scheme considering the number of neighbouring
regions. Finally, we perform a cluster analysis to identify groups according to RHS
performance in the two waves.
The paper is organized as follows: Section 2 describes the methodological ap-
proach used to compare and cluster time series, while Section 3 introduces data and
descriptive analysis. Results regarding RHSs are depicted and discussed in Section
4, while Section 5 concludes with some remarks and possible advances.
Given a matrix 𝑇 × 𝑛, where 𝑇 represents the days and 𝑛 the number of regions, our
methodological approach consists of three subsequently steps:
Step 1. Compute a dissimilarity matrix 𝐷 based on a given measure;
Step 2. Apply a weighted multidimensional scaling (wMDS) procedure, storing
the coordinates of the first two components;
Step 3. Perform cluster analysis on the MDS reduced space to identify groups
between the 𝑛 regions.
In the first step, a dissimilarity measure is computed for each pair of regional time
series. The objective is to obtain a dissimilarity matrix 𝐷 (with elements 𝑑𝑖, 𝑗 ) for
estimating synthetic measures of the differences between regions. There are different
alternatives to compare time series, some comprehensive overviews are in [29, 13].
A reasonable choice is the the Fourier dissimilarity 𝑑 𝐹 (x, y), which applies the
𝑛-point Discrete Fourier Transform [1] on two time series, allowing to compare the
similarity between two time sequences after converting them into a combination of
structural elements, such as trend and/or cycle.
Detecting Differences in Italian RHS During Covid-19 Waves 275
Daily regional time series reporting a) the number of patients treated in IC units
and b) the number of patients admitted in the other hospital wards are retrieved
through the official website of Italian Civil Protection1. All patients were positive
for the Covid-19 test (nasal and oropharyngeal swab). To take into account the
different sizes in terms of inhabitants, both a) and b) are normalized according to the
population of each territorial unit (estimated at 2020/01/01). The rates of patients
treated in IC units and hospitalized (HO) patients in other hospital wards, are then
multiplied by 100,000.
The whole dataset contains two identified waves2 of Covid-19, as follows:
Wave 1 (W1): 𝑇 = 109 days from February 24 to June 11, 2020
Wave 2 (W2): 𝑇 = 109 days from September 14 to December 31, 2020
The date/trend may also depend on external factors, such as the implementation of
restrictive measures introduced by the Italian Government [27, 6], which influenced
the observed differences between W1 and W2. We have to remark that a full national
lockdown was held between March 9th and May 18th 2020.
Figure 1 shows the time series for HO and IC (rows), according to the two waves
of Covid-19 (columns). The anomaly of the small Italian region (Valle D’Aosta)
emerges both in the first (in particular concerning IC) and second waves (also for
1 Source: www.dati-covid.italia.it
2 Refer to [7] for further details.
276 L. Palazzo and R. Ievoli
W1 W2
20
15
IC
10
100
HO
50
HO), while Lombardia, which is the largest and most populous region, dominates
other territories especially when considering HO of W1. The upper panel of Figure
1 helps to understand differences between the two waves in terms of admission to
intensive cares: while regions with high, medium and low IC rate can be directly
identified through the eyeball of the series during W1, in W2 more homogeneity
is observed. Furthermore, with the exception of Valle D’Aosta, the IC rate remains
always less than 10 for all considered observations.
For what concerns HO rate, (lower panels of Figure 1), Lombardia reaches values
greater than 100 in W1 (especially in April), while during W2 this threshold had
exceeding by Valle D’Aosta and Piemonte (both in November). Again, if W1 opposes
regions with high and (moderately) low HO rates, in W2 the following situation
arises: a) Valle D’Aosta and Piemonte reach values over 100, b) four regions (Liguria,
Lazio, P.A. Trento and P.A. Bolzano) present values over 75, and c) the majority of
territories share similar trends with peaks always lower than 75.
equipped with the Fourier distance3, using a set of weights 𝝎 proportional to the
number of neighbourhoods for each region, ensuring a spatial feature into the model.
Figure 2 displays the main results of wMDS, distinguishing between four levels
of critical issues experienced by the RHS. Outlying performances are coloured in
Violet. A first cluster (in Red) includes “critical” regions while a group depicted in
Orange contains territories with high pressure in their RHS. Regions involved in the
Green cluster experimented a moderate pressure on RHS, while colour Blue indi-
cates territories suffering from a low pressure. These clusters may also be interpreted
as a ranking of the health service risk.
As regards the HO during W1, leaving apart the two outliers (Lombardia and P.A.
Bolzano) the “red” cluster is composed by three Northern regions (Piemonte, Valle
d’Aosta and Emilia-Romagna). The group of high pressure is composed by Liguria,
Marche and P.A. Trento, while the green cluster involves Lazio, Abruzzo and Toscana
(from the centre of Italy) and Veneto. The last group includes nine regions, 7 of which
are located in the southern Italy. In W2 the clustering procedure Piemonte and Valle
d’Aosta are identified as outliers, while the high-pressure group is composed by two
autonomous provinces (Trendo and Bolzano), Lombardia and Liguria. The “orange”
group is constituted by regions located in the North-East (Friuli-Venezia Giulia,
Emilia-Romagna and Veneto), along with Abruzzo and Lazio. Southern regions
are allocated in the “green” coloured group (together with Umbria, Toscana and
Marche), while Molise, Calabria and Basilicata remain in the low-pressure cluster.
Regarding IC rates, during W1 Lombardia and Valle d’Aosta are considered
as outliers while the “red” cluster is composed by four northern Italian regions
(Emilia-Romagna, P.A. of Trento, Piemonte and Liguria), and Marche (located in
the centre). The “orange” cluster contains Toscana, Veneto and P.A. Bolzano, while
the moderate-pressure cluster involves three areas of centre Italy (Lazio and Umbria),
among with the Friuli-Venezia Giulia (from the north-east) and Abruzzo. The last
cluster includes only regions from the south. According to the bottom right panel of
Figure 2, apart from Valle D’Aosta, the procedure identifies Calabria as an outlier.
The “red” group acquires two observations from the Centre of Italy such as Toscana
and Umbria, while the majority of regions are classified in the moderately pressured
group. Only three Southern Italian areas are allocated in the last group (in green).
If the geography of the disease appears fundamental in W1, especially regarding
adjoining territories of Lombardia, in W2 this effect is less evident. Thus, regions
improving (e.g. Emilia-Romagna) or worsening (such as Lazio and Abruzzo) their
clustering “ranking” can be easily observed. As mentioned, the differences of re-
strictive measures imposed by the Government in the two periods may have a role
on these results.
3 We remark that other distance measures have been applied. Moreover, a) the Fourier one shows
better performance in terms of goodness of fit; b) the results are not sensitive with respect to the
choice of distance.
278 L. Palazzo and R. Ievoli
HO W1 HO W2
P.A. Bolzano
P.A. Bolzano
Valle d'Aosta Friuli Venezia Giulia Valle d'Aosta Friuli Venezia Giulia
P.A. Trento P.A. Trento
46°N
Lombardia Lombardia
Veneto Veneto
Piemonte Piemonte
Emilia-Romagna Emilia-Romagna
Liguria Liguria
44°N
Toscana Toscana
Marche Marche
Umbria Umbria
Abruzzo Abruzzo
42°N
Lazio Lazio Molise
Molise
Puglia Puglia
38°N
Sicilia Sicilia
36°N
1
2
IC W1 IC W2 3
P.A. Bolzano P.A. Bolzano 4
Abruzzo Abruzzo
42°N
Molise Lazio Molise
Lazio
Puglia Puglia
Campania Campania
Basilicata Basilicata
Sardegna Sardegna
40°N
Calabria
Calabria
38°N
Sicilia Sicilia
36°N
8°E 10°E 12°E 14°E 16°E 18°E 8°E 10°E 12°E 14°E 16°E 18°E
5 Concluding Remarks
The Covid-19 pandemic has put a strain on the Italian healthcare system. The reac-
tions of RHS play a relevant role to mitigate the health crisis at territorial level and
to guarantee an equitable access to healthcare.
This work helps to understand similarities and divergences between the Italian re-
gions in relation to the health pressure of the first two waves of the virus. Considering
crucial measures such as HO and IC rates, the comparison between two waves allows
to understand differences in the reactions to pandemic shocks of RHS. Although the
northern Italy represented the epicentre of the Covid-19 spread in the first wave,
some regions (e.g. Veneto and Friuli-Venezia Giulia) seem to have succeeded in
avoiding hospitals overcrowding, while Southern regions (and Islands) definitively
suffered from less pressure. Furthermore, in the second wave, the difference appears
slightly smoothed and the cluster sizes seem more homogeneous. Moreover, there
are some exceptions, such as the Emilia-Romagna, which seems to have been less
affected by the second wave, compared to the other regions. The detection of clusters
represents a starting point for the improvement of health governance and can be used
to monitor potential imbalances in future unfortunate waves.
Further analysis may employ other dedicated indicators coming, for instance,
from the Italian National Institute of Statistics4, or using different proposals for com-
bining wMDS with dissimilarity measures and clustering [28]. Following a different
methodological approach, the recent method proposed in [10] should be applied on
those data to include more complex spatial relationships between territories.
References
1. Agrawal, R., Faloutsos, C., Swami, A.: Efficient similarity search in sequence databases. In:
International Conference on Foundations of Data Organization and Algorithms, pp. 69-84.
Springer, Berlin (1993)
2. Ascani, A., Faggian, A., Montresor, S.: The geography of COVID-19 and the structure of local
economies: The case of Italy. Journal of Regional Science, 61(2), 407-441 (2021)
3. Beria, P., Lunkar, V.: Presence and mobility of the population during the first wave of Covid-19
outbreak and lockdown in Italy. Sustainable Cities and Society, 65, 102616 (2021)
4. Bontempi, E.; The Europe second wave of COVID-19 infection and the Italy “strange” situa-
tion. Environmental Research, 193, 110476 (2021)
5. Capolongo, S., Gola, M., Brambilla, A., Morganti, A., Mosca, E. I., Barach, P.: COVID-19
and Healthcare facilities: A decalogue of design strategies for resilient hospitals. Acta Bio
Medica: Atenei Parmensis, 91(9-S), 50 (2020)
6. Chirico, F., Sacco, A., Nucera, G., Magnavita, N.: Coronavirus disease 2019: the second wave
in Italy. Journal of Health Research (2021).
7. Cicchetti, A., Damiani, G., Specchia, M. L., Basile, M., Di Bidino, R., Di Brino, E., Tattoli,
A.: Analisi dei modelli organizzativi di risposta al Covid-19. ALTEMS (2020). link: https:
//altems.unicatt.it/altems-report47.pdf
8. Cuesta-Albertos, J. A., Gordaliza, A., Matrán, C.: Trimmed 𝑘-means: An attempt to robustify
quantizers. The Annals of Statistics, 25(2), 553-576 (1997).
4 see for example the BES indicators of the domain “Health” and “Quality of services”
https://www.istat.it/it/files//2021/03/BES\_2020.pdf
280 L. Palazzo and R. Ievoli
9. Di Iorio, F., Triacca, U.: Distance between VARMA models and its application to spatial differ-
ences analysis in the relationship GDP-unemployment growth rate in Europe. In: International
Work-Conference on Time Series Analysis, pp. 203-215. Springer, Cham (2017)
10. D’Urso, P., De Giovanni, L., Disegna, M., Massari, R.: Fuzzy clustering with spatial-temporal
information. Spatial Statistics, 30, 71-102 (2019)
11. Garcia-Escudero, L. A., Gordaliza, A.: Robustness properties of 𝑘-means and trimmed
𝑘-means. Journal of the American Statistical Association, 94(447), 956–969 (1999) doi:
10.2307/2670010
12. Giuliani, D., Dickson, M. M., Espa, G., Santi, F.: Modelling and predicting the spatio-temporal
spread of COVID-19 in Italy. BMC infectious diseases, 20(1), 1-10 (2020)
13. Górecki, T., Piasecki, P.: A comprehensive comparison of distance measures for time series
classification. In: Steland, A., Rafajłowicz, E., Okhrin, O. (Eds.) Workshop on Stochastic
Models, Statistics and their Application, pp. 409-428. Springer, Nature (2019)
14. Greenacre, M.: Weighted metric multidimensional scaling. In: New developments in Classi-
fication and Data Analysis, pp. 141-149. Springer, Berlin, Heidelberg (2005)
15. Han, E., Tan, M. M. J., Turk, E., Sridhar, D., Leung, G. M., Shibuya, K., Legido-Quigley, H.:
Lessons learnt from easing COVID-19 restrictions: an analysis of countries and regions in
Asia Pacific and Europe. The Lancet, 396(10261), 1525–1534 (2020)
16. He, J., Shang, P., Xiong, H.: Multidimensional scaling analysis of financial time series based on
modified cross-sample entropy methods. Physica A: Statistical Mechanics and its Applications,
500, 210-221 (2018)
17. Kent, J. T., Bibby, J., Mardia, K. V.: Multivariate Analysis. Amsterdam: Academic Press
(1979)
18. Kruskal, J.: The relationship between multidimensional scaling and clustering. In: Classifica-
tion and Clustering, pp. 17-44. Academic Press (1977)
19. Kruskal, J. B.: Multidimensional Scaling (No. 11). Sage (1978)
20. Mardia, K. V.: Some properties of classical multi-dimensional scaling. Communications in
Statistics-Theory and Methods, 7(13), 1233-1241 (1978)
21. Marziano, V., Guzzetta, G., Rondinone, B. M., Boccuni, F., Riccardo, F., Bella, A., Merler,
S.: Retrospective analysis of the Italian exit strategy from COVID-19 lockdown. Proceedings
of the National Academy of Sciences, 118(4) (2021)
22. Mead, A.: Review of the development of multidimensional scaling methods. Journal of the
Royal Statistical Society: Series D (The Statistician), 41(1), 27-39 (1992)
23. Pecoraro, F., Luzi, D., Clemente, F.: Analysis of the different approaches adopted in the Italian
regions to care for patients affected by COVID-19. International Journal of Environmental
Research and Public Health, 18(3), 848 (2021)
24. Pecoraro, F., Clemente, F., Luzi, D.: The efficiency in the ordinary hospital bed management
in Italy: An in-depth analysis of intensive care unit in the areas affected by COVID-19 before
the outbreak. PLoS One, 15(9), e0239249 (2020)
25. Piccolo, D.: Una rappresentazione multidimensionale per modelli statistici dinamici. In: Atti
della XXXII Riunione Scientifica della SIS, 2, pp. 149-160 (1984)
26. Saxena, A., Prasad, M., Gupta, A., Bharill, N., Patel, O. P., Tiwari, A., Lin, C. T.: A review of
clustering techniques and developments. Neurocomputing, 267, 664-681 (2017)
27. Sebastiani, G., Massa, M., Riboli, E.: Covid-19 epidemic in Italy: evolution, projections and
impact of government measures. European Journal of Epidemiology, 35(4), 341-345 (2020)
28. Shang, D., Shang, P., Liu, L.: Multidimensional scaling method for complex time series feature
classification based on generalized complexity-invariant distance. Nonlinear Dynamics, 95(4),
2875-2892 (2019)
29. Studer, M., Ritschard, G.: What matters in differences between life trajectories: A comparative
review of sequence dissimilarity measures. Journal of the Royal Statistical Society: Series A
(Statistics in Society), 179(2), 481-511 (2016)
30. Tenreiro Machado, J. A., Lopes, A. M., Galhano, A. M.: Multidimensional scaling visualiza-
tion using parametric similarity indices. Entropy, 17(4), 1775-1794 (2015)
31. Torgerson, W. S.: Multidimensional scaling: I. Theory and method. Psychometrika, 17(4),
401-419 (1952)
Detecting Differences in Italian RHS During Covid-19 Waves 281
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Political and Religion Attitudes in Greece:
Behavioral Discourses
Abstract The research presented in this paper attempts to explore the relationship be-
tween religious and political attitudes. More specifically we investigate how religious
behavior, in terms of belief intensity and practice frequency, is related to specific
patterns of political behavior such as ideology, understanding democracy and his set
of moral values. The analysis is based on the use of multivariable methods and more
specifically Hierarchical Cluster Analysis and Multiple Correspondence Analysis in
two steps. The findings are based on a survey implemented in 2019 on a sample of
506 respondents in the wider area of Thessaloniki, Greece. The aim of the research is
to highlight the role of people’s religious practice intensity in shaping their political
views by displaying the profiles resulting from the analysis and linking individual
religious and political characteristics as measured with various variables. The final
output of the analysis is a map where all variable categories are visualized, bringing
forward models of political behavior as associated together with other factors such
as religion, moral values and democratic attitudes.
1 Introduction
In this research we present the analysis results of a survey, which was implemented
in April 2019 to 506 respondents in Thessaloniki, focusing on their religious profile
as well as their political attitudes, their moral profile and the way they comprehend
democracy. The aim of the analysis is to investigate and highlight the role of religious
practice in shaping political behavior. In the political behavior analysis field, religion
Georgia Panagiotidou ( )
Aristotle University of Thessaloniki, Greece, e-mail: [email protected]
Theodore Chadjipadelis
Aristotle University of Thessaloniki, Greece, e-mail: [email protected]
and more specifically church practice has emerged as one of the main pillars that form
the political attitudes of voters. Religious habits seem to have a decisive influence
on electoral choices, as derives from Lazarsfeld’s research at Columbia University
in 1944 [3], followed by the work of Butler and Stokes in 1969 [1] and the research
of Michelat and Simon in France [6]. More specifically in the comparative study
of Rose in 1974 [9], it turns out that the more religious voters appear to be more
conservative by choosing to place themselves on the right side of the ideological
“left-right” axis, while the non-religious voters opt for the left political parties.
The research and analysis of Michelat and Simon [6] brings to the surface two
opposing cultural models: on the one hand we have the deeply religious voters, who
belong to the middle and upper classes, residing in the cities or in the countryside,
while on the other hand we have the non-religious left voters with working class
characteristics. The first framework is articulated around religion and those who
belong to it identifying themselves as religious people, is inspired by a conservative
value system, put before the value of the individual, the family, the ancestral heritage
and tradition. The second cultural context is articulated around class rivalries and
socio-economic realities; those who belong to this context identify themselves as
“us workers towards others”. They believe in the values of collective action, vote
for left-wing parties, participate actively in unions and defend the interests of the
working class. To measure the influence of religious practice on political behavior,
applied research uses measurement scales about the intensity of religious beliefs and
the frequency of church service practice as an indicator of the level of one’s religious
integration.
To measure religious intensity level, variables are used such as how often they go
to the service, how much do they believe in the existence of God, of afterlife, in the
dogmas of the church and so on. Since the 90’s there is a rapid decline in the frequency
with which the population attends church service or self-identifies strongly in terms of
religiousness. Nevertheless, the strong correlation between electoral preference and
religious practice remains strong [5]. The most significant change for non-religious
people is that the left is losing its universal influence as many of these voters expand
also to the center. Strongly religious people continue to support the right more and, in
some cases, strengthen the far right. In this paper, apart from attempting to explore
and verify the existing literature over the effect of religion on political behavior,
focusing on the Greek case, the approach exploits methods used to achieve the
visualization of all existing relationships between different sets of variables. To link
together numerous variables and their categories to construct a model of religious and
political behavior, multiple applications of Hierarchical Cluster analysis (HCA) are
being made followed by Multiple Correspondence Analysis (MCA) for the emerging
clusters. In this way, a semantic map is constructed [7], which visualizes discourses
of political and religious behavior and the inner antagonisms between the behavioral
profiles.
Political and Religion Attitudes in Greece: Behavioral Discourses 285
2 Methodology
For the implementation of the research a poll was conducted on a random sample
of 506 people in the greater area of Thessaloniki in Greece, during April 2019.
A questionnaire was used as a research tool which was distributed with an on-site
approach of the random respondents. The questionnaire consisted of three sections:
a) the first section included seven questions for demographic data of the respondent
such as gender, age, educational level, marital status, household income, occupation
and social class to which the respondent considers belonging; b) the second part
contained seven questions, ordinal variables, related to the religious practice and
beliefs of the respondent: i) how often does one go to church? ii) how often does one
pray? iii) how close does one feel to God, Virgin Mary (or to another seven religious
concepts) during church service? iv) how strongly does one have seven different
feelings during church service? v) does one believe or not in the saints, miracles,
prophecies (and another six religious concepts)? Two more questions investigating
their profile in terms of what is taught in the Christian dogma were included vi)
one asking if one can progress only by being an ethical person and vii) another one
asking if they agree on the pain/righteousness scheme, that is if one suffers in this
life will be rewarded later or in the afterlife; c) questions concerning the political
profile of the respondent are developed in the third part of the questionnaire: i)
one’s self-positioning on the ideological left-right axis, ii) a set of nine ordinal
variables requiring one’s agreement or disagreement level on sentences that reflect
the dimensions of liberalism-authoritarianism and left-right iii) this last section
also includes two different sets of pictures, used as symbolic representation for the
“democratic self” and the “moral self” [4]. The first set of twelve pictures represent
various conceptualizations of democracy, and one is asked to select three pictures
that represent democracy. The second set of pictures represent moral values in
life, and one is asked to choose three pictures that represent one’s set of personal
values. Variables are ordinal, using a five-point Likert scale, apart from the question
regarding whether one believes or not in prophecies magic etc. and the two last
questions with the pictures, where we are using a binary scale of yes-no or zero-one
where zero is for a non selected picture and one is for a selected picture.
Data analysis was implemented with the use of M.A.D software (Méthodes
d’Analyse des Données), developed by Professor Dimitris Karapistolis (more about
M.A.D software at www.pylimad.gr). Firstly, Hierarchical Cluster Analysis (HCA)
using chi-quare distance and Ward’s linkage, assigns subjects into distinct groups
based on their response patterns. This first step produces a cluster membership vari-
able, assigning each subject into a group. In addition to this, the behavior typology of
each group is examined, seeing the connection of each variable level to each cluster
using two proportion 𝑧 test (significance level set at 0.05) between respondents be-
longing to cluster 𝑖 and those who do not belong in cluster 𝑖 for a variable level. The
number of clusters is determined by using the empirical criterion of the change in the
ratio of between-cluster inertia to total inertia, when moving from a partition with 𝑟
clusters to a partition with 𝑟 − 1 clusters [8]. In the second step of the analysis, the
cluster membership variable is analyzed together with the existing variables using
286 G. Panagiotidou and T. Chadjipadelis
MCA on the Burt table [2]. All associations among the variable categories are given
on a set of orthogonal axes, with the least possible loss of the original information
of the original Burt table. Next, we apply HCA for the coordinates of variable cat-
egories on the total number of dimensions of the reduced space resulting from the
MCA. In this way we cluster the variable, as previously we clustered the subjects.
By clustering the variable response categories, we detect the various discourses of
behavior, where each cluster of categories stands as a behavioral profile linked with
a set of responses and characteristics. To produce the final output, the semantic map,
we created a table including the output variables of the questionnaire, including de-
mographics and variables for political behavior. Using the same two-step procedure
using HCA and MCA for this final table, the semantic map is constructed, positioning
the variable categories on a bi plot created by the two first dimensions of MCA.
3 Results
In the first step of the analysis, we apply HCA for each set of variables in each
question. In the question: “How close do you feel during the service 1-To God, 2-To
the Virgin, 3-To Christ, 4-To some Saint, Angel, 5-To the other churchgoers, 6-To
Paradise, 7-To Hell, 8-To the divine service, 9-To his preaching priest”, we get four
clusters (Figure 1).
Fig. 1 Four clusters on how close the respondents feel during church service.
For the question: “How strongly you feel after the end of the service 1-The Grace
of God in me, 2-Power of the soul, 3-Forgiveness for those who have hurt me, 4-
Forgiveness for my sins, 5-Peace, 6-Relief it is over”, we get six clusters (Figure
2).
Fig. 2 Six clusters on how the respondents feel at the end of church service.
Political and Religion Attitudes in Greece: Behavioral Discourses 287
Five clusters (Figure 3) for the question: “Do you believe in 1-Bad (magic influ-
ence) 2-Magic? 3- Destiny? 4-Miracles? 5-Prophecies of the Saints? 6- Do you have
pictures of holy figures in your house? 7-in your workplace? 8-Do you have a family
Saint?”.
Fig. 3 Five clusters on the beliefs of the respondents on various aspects of the Christian faith.
Six clusters are detected (Figure 4) for the question: “How do you feel when you
come face to face with a religious image 1-Peace, 2-Awe, 3-The presence of God,
4-Emotion, 5-The need to pray, 6-Contact with the person in the picture”.
Fig. 4 Six clusters on how the respondents feel when facing a religious image.
We proceed with the clustering of the replies on political views and we get seven
clusters of political profiles (Figure 5).
Fig. 5 Seven clusters according to the political views- profile of the respondents.
288 G. Panagiotidou and T. Chadjipadelis
For the symbolic representation of the democratic self, when choosing three
pictures that represent democracy for the respondent, we find eight clusters (Fig-
ure 6), and eight clusters for the symbolic representation of the moral self for the
respondents, as show in Figure 7.
Fig. 7 Eight clusters on the different sets of moral values of the respondents.
In the second step of the analysis, we jointly process the cluster membership
variables. MCA produces the coefficients of each variable category which are now
positioned in a two-dimensional map as seen in Figure 9. HCA is then applied again
to the coefficients of the items, which bring forward three main clusters, modeling
political and religious behavior. In Figure 8, Cluster 77 is connected to centre and
moderate religious behaviour, cluster 78 reflects the voters of the right, with strong
religious habits and beliefs, individualistic attitudes and more authoritarian and
nationalistic political views, whereas cluster 79 represents the leftists, non-religious
voters, closer to revolutionary political views and collective goods. Examining the
antagonisms on the behavioral map (Figure 9), the first horizontal axis which explains
22.8% of the total inertia, is created by the antithesis between right political ideology
- strong religious behavior and left political ideology-no religious behavior (cluster
78 opposite to cluster 79). The second axis (vertical) accounts for 7% of the inertia,
and is explained as the opposition between the center (moderate religious behavior)
against the left and right (cluster 77 opposite to both clusters 78 and 79).
Political and Religion Attitudes in Greece: Behavioral Discourses 289
Fig. 8 Three main behavioral discourses linking all variable categories together.
Fig. 9 The semantic map visualizing the behavioral profiles of voters, and the inner antagonisms.
290 G. Panagiotidou and T. Chadjipadelis
4 Discussion
The analysis uncovers the strong existing relationship between religious habits and
political views, for the Greek case. The semantic map indicates two main antagonistic
cultural discourses, including both religious, political and moral characteristics: The
first discourse (cluster 77) is described as moderately religious practice and beliefs,
connected to the ideological center. These voters have political attitudes that belong
to the space between the center-left and the center-right. They understand democracy
as a connection to money, direct democracy and electronic democracy. Their moral
set of values is naturalistic and individualistic. The next behavioral discourse (cluster
78) describes the voters of right ideology, with strong religious beliefs andfrequent
religious practice. They appear as very ethical and believe in the concept of pain
and righteousness. Regarding their political attitudes these more religious voters
are against violence, have more authoritarian and nationalistic positions. They view
democracy as parliamentary, representative, ancient Greece but also as church, while
their moral set of values appear clearly naturalistic, Christian and nationalistic.
Cluster 79 reflects the exact opposite discourse compared to 78. These voters
belong to the left ideology and are non-religious. They do not adopt the ideas of
the ethical person, or the scheme of pain and righteousness as mentioned in the
Christian dogma. In terms of political attitudes, they are pro-welfare state. These
non-religious and left voters understand democracy as direct with the need for
revolution, protest and riot and support collective goods. Interpreting further the
antagonisms as visualized on the semantic map, the main competition exists between
the “right political ideology - strong religious behavior individualism” discourse
and the “left political ideology-no religious behavior collectivism” discourse. A
secondary opposition is found between the “center ideology- moderate religious
behavior” discourse against the left and right extreme positions.
References
1. Butler, D., Stokes, D.: Political Change in Britain. Macmillan, London (1969)
2. Greenacre, M.: Correspondence Analysis in Practice. Chapman and Hall/CRC Press, Boca
Raton (2007)
3. Lazarsfeld, P. F., Berelson, B., Gaudet, H.: The People’s Choice. Columbia University Press
(1944)
4. Marangudakis, M., Chadjipadelis, T.: The Greek Crisis and its Cultural Origins. Palgrave-
Macmillan, New York (2019)
5. Mayer, N.: Les Modèles Explicatifs du Vote. L’Harmatan, Paris (1997)
6. Michelat, G., Simon, M.: Classe, Religion et Comportement Politique. PFNSP-Editions So-
ciales, Paris (1977)
7. Panagiotidou, G., Chadjipadelis, T.: First-time voters in Greece: views and attitudes of youth on
Europe and democracy. In T. Chadjipadelis, B. Lausen, A. Markos, T. R. Lee, A. Montanari and
R. Nugent (Eds.), Data Analysis and Rationality in a Complex World, Studies in Classification,
Data Analysis and Knowledge Organization, pp. 415-429. Springer (2020)
8. Papadimitriou, G., Florou, G.: Contribution of the Euclidean and chi-square metrics to de-
termining the most ideal clustering in ascending hierarchy (in Greek). In Annals in Honor of
Professor I. Liakis, 546-581. University of Macedonia, Thessaloniki (1996)
9. Rose, R.: Electoral Behavior: a Comparative Handbook. Free Press, New York (1974)
Political and Religion Attitudes in Greece: Behavioral Discourses 291
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Supervised Classification via Neural Networks
for Replicated Point Patterns
Kateřina Pawlasová ( )
Charles University, Faculty of Mathematics and Physics, Ke Karlovu 3, 121 16 Praha 2, Czech
Republic, e-mail: [email protected]
Iva Karafiátová
Charles University, Faculty of Mathematics and Physics, Ke Karlovu 3, 121 16 Praha 2, Czech
Republic, e-mail: [email protected]
Jiří Dvořák
Charles University, Faculty of Mathematics and Physics, Ke Karlovu 3, 121 16 Praha 2, Czech
Republic, e-mail: [email protected]
1 Introduction
Spatial point processes have recently received increasing attention in a broad range
of scientific disciplines, including biology, statistical physics, or material science
[9]. They are used to model the locations of objects or events randomly occurring
in R𝑑 , 𝑑 ≥ 2. We distinguish between the stochastic model (point process) and its
realization observed in a bounded observation window (point pattern).
Typically, analyzing spatial point pattern data means working with just one pat-
tern, which comes from a specific physical measurement. In this paper, we take
another perspective: we suppose that a collection of patterns, which are independent
realizations of some underlying stochastic models, is to be analyzed simultaneously.
These independent realizations are then referred to as replicated point patterns.
Recently, this type of data has become more frequent, encouraging the adaptation
of methods such as supervised classification to the point pattern setting.
Since we are talking about supervised classification, our task is to predict the la-
bel variable (indicating class membership) for a newly observed point pattern, using
the knowledge about a sample collection of patterns with known labels (training
data). In the literature, this problem has been studied to a limited extent. Properties
of a classifier constructed specifically for the situation where the observed patterns
were generated by inhomogeneous Poisson point processes with different intensity
functions are discussed in [5]. However, this method is based on the special proper-
ties of the Poisson point process, and its use is thus limited to a small class of models.
On the other hand, no assumptions about the underlying stochastic models are made
in [12], where the task for replicated point patterns is transformed, with the help
of multidimensional scaling [16], to the classification task in R2 . In [10, 11], the ker-
nel regression classifier for functional data [4] is adapted for replicated point patterns.
Instead of classifying the patterns themselves, a selected functional characteristic
(e.g. the pair correlation function) is estimated for each pattern. These estimated
values are considered functional observations, and the classification if performed
in the context of functional data. The idea of linking point patterns to functional
data also appears in [12] – the dissimilarity matrix needed for the multidimensional
scaling is based on the same type of dissimilarity measure that is used for the ker-
nel regression classifier in [10, 11]. Finally, [17] briefly discusses the model-based
supervised classification. Unsupervised classification is explored in [2].
In this paper, our goal is to discuss the use of classifiers based on artificial neu-
ral networks in the context of replicated point patterns. We pay special attention
to the procedure described in [14], where both functional and scalar observations
enter the input layer. Hence, similarly as in [10, 11], each pattern can be represented
by estimated values of a selected functional characteristic and the classification is per-
formed in the context of functional data. The resulting decision about class member-
ship is based on the spatial properties of the observed patterns that can be described
by the selected characteristic. Therefore, with a thoughtfully chosen characteristic,
this method has great potential within a wide range of possible classification scenar-
ios. Moreover, it can be used without assuming stationarity of the underlying point
Supervised Classification via Neural Networks for Replicated Point Patterns 295
processes, and it can be easily extended to more complicated settings (e.g., point
patterns in non-Euclidean spaces or realizations of random sets).
We present a short simulation experiment that illustrates the behaviour of the neu-
ral network described in [14]. Binary classification is performed on realizations
of two different point process models – the Thomas process (model for attractive
interactions among pairs of points) and the Poisson point process (benchmark model
for no interactions among points). This approach is then compared to the classifica-
tion based on convolutional neural networks (CNNs) [8], where each pattern enters
the network as a binary image. Finally, both methods based on artificial neural net-
works are compared to the kernel regression classifier studied in [10, 11] which can
be considered a benchmark in the context of replicated point patterns.
This paper is organized as follows. Section 2 provides a brief theoretical back-
ground on spatial point processes and their functional characteristics, including
the definition of the pair correlation function, which plays a crucial role in the se-
quel. Section 3 summarizes the methodology introduced in [14] about neural network
models with general input space. Section 4 is devoted to a short simulation example.
This section presents the necessary definitions from the point process theory. Our ex-
position closely follows the book [13]. For detailed explanation of the theoretical
foundations, see, e.g., [7]. Throughout the paper, a simple point process 𝑋 is defined
as a random locally finite subset of R𝑑 , 𝑑 ≥ 2, where each point 𝑥 ∈ 𝑋 corresponds
to a specific object or event occurring at the location 𝑥 ∈ R𝑑 . In applications, 𝑋 can
be used as a mathematical tool to model random locations of cell nuclei in a tissue
(with 𝑑 = 2) or centers of undesirable air bubbles in industrial materials (𝑑 = 3).
We distinguish between the mathematical model 𝑋, which is called a point process,
and its observed realization X, which is called a point pattern. Examples of four
different point patterns are given in Figure 1.
Before proper definition of the pair correlation function, a functional characteristic
that plays a key role in the sequel, we need to define some moment properties of 𝑋.
The intensity function 𝜆(·) is a non-negative measurable function on R𝑑 such that
𝜆(𝑥) d𝑥 corresponds to the probability of observing a point of 𝑋 in a neighborhood
of 𝑥 with an infinitesimally small area d𝑥. If 𝑋 is stationary (its distribution is
translation invariant in R𝑑 ), then 𝜆(·) = 𝜆 is a constant function and the constant 𝜆 is
called the intensity of 𝑋. In this case, 𝜆 is interpreted as the expected number of points
of 𝑋 that occur in a set with unit 𝑑-dimensional volume. Similarly, the second-order
product density 𝜆 (2) (· , ·) is a non-negative measurable function on R𝑑 × R𝑑 such
that 𝜆 (2) (𝑥, 𝑦) d𝑥 d𝑦 corresponds to the probability of observing two points of 𝑋
that occur jointly at the neighborhoods of 𝑥 and 𝑦 with infinitesimally small areas
d𝑥 and d𝑦.
Assuming the existence of 𝜆 and 𝜆 (2) , the pair correlation function 𝑔(𝑥, 𝑦) is de-
fined as 𝜆 (2) (𝑥, 𝑦)/(𝜆(𝑥)𝜆(𝑦)), for 𝜆(𝑥)𝜆(𝑦) > 0. If 𝜆(𝑥) = 0 or 𝜆(𝑦) = 0, we set
296 K. Pawlasová et al.
This section prepares the theoretical background for the supervised classification
of replicated point patterns via artificial neural networks. The recent approach
of [14, 15] is the cornerstone of our proposed classifier, and hence we focus on
its description in the following paragraphs. On the other hand, the approach based
on CNNs is more established in the literature. We use it primarily for comparison
and thus we refer the reader to [8] for a detailed description.
Following the setup in [14], let us assume that we want to build a neural network
such that it takes 𝐾 ∈ N functional variables and 𝐽 ∈ N scalar variables as input.
In detail, suppose that we have 𝑓 𝑘 : 𝜏𝑘 −→ R, 𝑘 = 1, 2, . . . , 𝐾 (𝜏𝑘 are possibly
different intervals in R), and 𝑧 (1)
𝑗 ∈ R, 𝑗 = 1, 2, . . . , 𝐽. Furthermore, suppose that
the first layer of the network contains 𝑛1 ∈ N neurons. We then want the 𝑖-th neuron
of the first layer to transfer the value
𝐾 ∫ 𝐽
©Õ Õ
𝑧𝑖(2) = 𝑔 𝛽𝑖𝑘 (𝑡) 𝑓 𝑘 (𝑡) d𝑡 + 𝑤𝑖(1) (1) (1) ª
𝑗 𝑧 𝑗 + 𝑏 𝑖 ® , 𝑖 = 1, 2, . . . , 𝑛1 ,
𝜏𝑘
« 𝑘=1 𝑗=1 ¬
Fig. 1 Theoretical values of the pair correlation function 𝑔 for the Poisson point process
and the Thomas process with different values of the model parameter 𝜎. For these models, 𝑔 is trans-
lation invariant and isotropic. A single realization of the Poisson point process and the Thomas
process with parameter 𝜎 set to 0.1, 0.05 and 0.02 respectively, is illustrated in the right part
of the figure.
Í Í𝑚𝑘 ∫ ∫
be expressed as 𝐾 𝑘=1 𝑙=1 𝑐 𝑖𝑙𝑘 𝜏𝑘 𝜙𝑙 (𝑡) 𝑓 𝑘 (𝑡) d𝑡, where the integrals 𝜏𝑘 𝜙𝑙 (𝑡) 𝑓 𝑘 (𝑡) d𝑡
can be calculated a priori and the coefficients of the linear combination of the basis
functions {𝑐 𝑖𝑙𝑘 } act as scalar weights of the first layer and are learned by the network.
The scalar values 𝑧𝑖(2) , 𝑖 = 1, . . . , 𝑛1 , then propagate through the next fully connected
layers as usual. An in-depth analysis of the computational point of view is provided
in [14]. In the software R, neural networks with general input space are covered by
the package FuncNN [15] built over the packages keras [6] and tensorflow [1].
The last two packages are used to handle CNNs.
4 Simulation Example
This section presents a simple simulation experiment in which we illustrate the per-
formance of the classification rule based on the neural network with general input
space. Binary classification is considered, where the group membership indicates
whether a point pattern was generated by a stationary Poisson point process or a sta-
tionary Thomas process, the latter exhibiting attractive interactions among pairs
of points [13]. The sample realizations can be seen in Figure 1.
We consider the Thomas process to be a model with one parameter 𝜎. Small values
of 𝜎 indicates strong, attractive short-range interactions between points, while larger
values of 𝜎 result in looser clusters of points. Attractive interactions between the
points of a Thomas process result in the values of the pair correlation function being
greater than the constant 1, which corresponds to the Poisson case. The effect of
𝜎 on the shape of the theoretical pair correlation function of the Thomas process
(which is translation invariant and isotropic) is illustrated in Figure 1.
298 K. Pawlasová et al.
Since the model parameter 𝜎 affects the strength and range of attractive interac-
tions between points of the Thomas process, the complexity of the binary classifica-
tion task described above increases with increasing values of 𝜎 [10, 11]. Therefore,
this experiment focuses on the situation where 𝜎 is set to 0.1, and all realizations
are observed on the unit square [0, 1] 2 . We fix the intensity of the two models to 400
(in spatial statistics, patterns with several hundreds of points are standard nowadays).
In this framework, we expect the classification task to be challenging enough to ob-
serve differences in the performance of the considered classifiers. On the other hand,
it is still reasonable to distinguish (w.r.t. the chosen observation window) the realiza-
tions of the model with attractive interactions from the realizations corresponding
to the complete spatial randomness.
Two different collections of labelled point patterns are considered as training sets.
The first, referred to as Training data 1, is composed of 1 000 patterns per group.
The second, called Training data 2, is then composed of 100 patterns per group.
The test and validation sets have the same size and composition as the Training
data 2. Table 1 presents the accuracy of three classification rules (described below)
with respect to the test set. For the first two rules, the accuracy is in fact averaged
over five runs corresponding to different settings of initial weights in the underlying
neural network. Concerning the network architecture, we fix the ReLU function to be
the activation function for all layers, except the output one. The output layer consists
of one neuron with sigmoid activation function. The loss function is the binary
cross-entropy. A detailed description of the individual layers is given below.
Rule 1 is based on the neural network with general input space. We set 𝐾 and
𝐽 from Sect. 3 to be 1 and 0, respectively, and 𝜏1 = (0, 0.25). The value 0.25 is
related to the observation window of the point patterns at hand being [0, 1] 2 . Then,
𝑓1 is the vector of the estimated values of the pair correlation function 𝑔 (estimated
by the function pcf.ppp from the package spatstat [3] with default settings but
the option divisor set to d), considered as a functional observation. Furthermore,
we set 𝑚 1 = 29, and consider the Fourier basis. The data preparation (estimation of 𝑔,
computation of integrals from Sect. 3) takes 740 s of elapsed time (w.r.t. the Training
data 1, on a standard personal computer). To tune the hyperparameters of the final
neural network (number of hidden layers, number of neurons per hidden layers,
dropout, etc.), we performed a rough grid search (models with various combinations
of the hyperparameters were trained on Training data 1 and we used the loss function
and the accuracy computed on the validation set to compare the performances).
The resulting network consists of one hidden layer with 128 neurons followed by
a dropout layer with a rate of 0.3. We use the Adam optimizer, and the learning rate is
decaying exponentially, with initial value 0.001 and decay parameter 0.05. In total,
the network has 3 969 trainable parameters. To train the network, we perform 50
epochs with an average elapsed time of 200 ms per epoch (w.r.t. Training data 1).
Rule 2 uses CNNs. Similarly to the previous case, our decision about the network
architecture is based on a rough grid search. The final network has two convolutional
layers, each of them with 8 filters, a squared kernel matrix with 36 (first layer) or 16
rows (second layer), and a following average pooling layer with the pool size fixed
at 2 × 2. We add a dropout layer after the pooling, with a rate of 0.3 (after the first
Supervised Classification via Neural Networks for Replicated Point Patterns 299
Table 1 Accuracy for the three presented classification rules w.r.t. the testing set. For Rule 1
and Rule 2, the accuracy is averaged over five runs corresponding to five different choices of initial
weights in the underlying neural networks. In addition, the standard deviation computed from
the five accuracy values is reported. Values close to 1 indicate a nearly perfect classification.
pooling) and 0.2 (after the second pooling). The batch size is set to 32. We use
the Adam optimizer, and the learning rate is decaying exponentially, with initial
value 0.001 and decay parameter 0.1. The total number of trainable parameters is
equal to 32 785 and we perform 50 epochs with the average elapsed time per epoch
(w.r.t. Training data 1) equal to 930 s. Data preparation (converting point patterns
to binary images) takes less than 10 s of the elapsed time (w.r.t. Training data 1).
Rule 3 is the kernel regression classifier studied in [10, 11]. We use the Epanech-
nikov kernel together with an automatic procedure for the selection of the smoothing
parameter. The underlying dissimilarity measure for point patterns is constructed
as the integrated squared difference of the corresponding estimates of the pair cor-
relation function 𝑔; for more details, see [10]. The elapsed time needed to compute
the upper triangle of the dissimilarity matrix (containing dissimilarities between
every pair of patterns from Training data 1) is equal to 390 s. To predict the class
membership for the testing set (w.r.t. Training data 1), 206 s elapsed. During the clas-
sification procedure, no random initialization of any weights is needed. Thus, there
is no reason to average the accuracy in Table 1 over multiple runs.
For Training data 1, Table 1 shows that the highest accuracy was achieved for the
neural network with general input space. The standard deviation of the five different
accuracy values is significantly higher for CNN which has almost ten times more
trainable parameters than the network with general input space. For Training data
2, the kernel regression method achieved the highest accuracy. In this situation,
the performance of the classifier is stable even in the case of small training data.
For the first two rules, the neural network models chosen with the help of the grid
search (where the networks were trained w.r.t. the bigger training set) are now
trained w.r.t. the smaller training set. The resulting accuracy is still around 0.90
for the network with general input space, but it drops to 0.5 (random assignment
of labels) for CNN. The size of Training data 2 seems to be too small to successfully
optimize the large amount of trainable parameters of the convolutional network.
To conclude, our simulation example suggests that the classifier based on CNN
(using information about the precise configuration of points) is in the presented sit-
uation outperformed by the classifiers based on the estimated values of the pair cor-
relation function (using information about the interactions between pairs of points).
The high number of trainable parameters of the CNN makes its use rather demanding
with respect to computational time. The approach based on neural networks with
300 K. Pawlasová et al.
general input space proved to be competitive with or even outperform the current
benchmark method (kernel regression classifier), especially for large datasets. Also,
it has the lowest demands regarding computational time. In the case of a small
dataset, the low number of hyperparameters speaks in favor of kernel regression.
Finally, in the simple classification scenario that we have presented, the choice
of the pair correlation function was adequate. In practical applications, a problem-
specific characteristic should be constructed to achieve satisfactory performance.
Acknowledgements The work of Kateřina Pawlasová and Iva Karafiátová has been supported from
the Grant schemes at Charles University, project no. CZ.02.2.69/0.0/0.0/19 073/0016935. The work
of Jiří Dvořák has been supported by the Czech Grant Agency, project no. 19-04412S.
References
1. Allaire, J. J., Eddelbuettel, D., Golding, N., Tang, Y.: tensorflow: R Interface to TensorFlow
(2016) Available at GitHub. https://github.com/rstudio/tensorflow.Cited10Jan
2022
2. Ayala, G., Epifanio, I., Simo, A., Zapater, V.: Clustering of spatial point patterns. Comput.
Stat. Data. Anal. 50, 1016–1032 (2006)
3. Baddeley, A., Rubak, E., Turner, R.: Spatial Point Patterns: Methodology and Applications
with R. Chapman & Hall/CRC Press, Boca Raton (2015)
4. Ferraty, F., Vieu, P.: Nonparametric Functional Data Analysis. Theory and Practice.
Springer-Verlag, New York (2006)
5. Cholaquidis, A., Forzani, L., Llop, P., Moreno, L.: On the classification problem for Poisson
point processes. J. Multivar. Anal. 153, 1–15 (2017)
6. Chollet, F., Allaire, J. J. and others: R Interface to Keras (2017) Available via GitHub.
https://github.com/rstudio/keras.Cited10Jan2022
7. Daley, D., Vere-Jones, D.: An Introduction to the Theory of Point Processes. Vol II., 2nd edn.
Springer-Verlag, New York (2008)
8. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016)
9. Illian, J., Penttinen, A., Stoyan, H., Stoyan, D.: Statistical Analysis and Modelling of Spatial
Point Patterns. Wiley, Chichester (2004)
10. Koňasová, K., Dvořák, J.: Techniques from functional data analysis adaptable for spatial
point patterns (2021) In: Proceedings of the 22nd European Young Statisticians Meeting.
https://www.eysm2021.panteion.gr/publications.html.Cited10Jan2022
11. Koňasová, K., Dvořák, J.: Supervised nonparametric classification in the context of replicated
point patterns. Submitted (2021)
12. Mateu, J., Schoenberg, F. P., Diez, D. M., González, J. A., Lu, W.: On measures of dissimilarity
between point patterns: classification based on prototypes and multidimensional scaling.
Biom. J. 57, 340–358 (2015)
13. Møller, J., Waagepetersen, R.: Statistical Inference and Simulation for Spatial Point Processes.
Chapman & Hall/CRC, Boca Raton (2004)
14. Thind, B., Multani, K., Cao, J.: Deep Learning with Functional Inputs (2020) Available via
arxiv. https://arxiv.org/pdf/2006.09590.pdf.Cited10Jan2022
15. Thind, B., Wu, S., Groenewald, R., Cao, J.: FuncNN: An R Package to Fit Deep Neural
Networks Using Generalized Input Spaces (2020) Available via arxiv.
https://arxiv.org/pdf/2009.09111.pdf.Cited10Jan2022
16. Torgerson, W.: Multidimensional Scaling: I. Theory and Method. Psychometrika. 17, 401–419
(1952)
17. Vo, B. N., Dam, N., Phung, D., Tran, Q. N., Vo, B. T.: Model-based learning for point pattern
data. Pattern Recognit. 84, 136–151 (2018)
Supervised Classification via Neural Networks for Replicated Point Patterns 301
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Parsimonious Mixtures of Seemingly Unrelated
Contaminated Normal Regression Models
Abstract In recent years, the research into linear multivariate regression based on
finite mixture models has been intense. With such an approach, it is possible to
perform regression analysis for a multivariate response by taking account of the
possible presence of several unknown latent homogeneous groups, each of which is
characterised by a different linear regression model. For a continuous multivariate
response, mixtures of normal regression models are usually employed. However, in
real data, it is not unusual to observe mildly atypical observations that can negatively
affect the estimation of the regression parameters under a normal distribution in
each mixture component. Furthermore, in some fields of research, a multivariate
regression model with a different vector of covariates for each response should be
specified, based on some prior information to be conveyed in the analysis. To take
account of all these aspects, mixtures of contaminated seemingly unrelated normal
regression models have been recently developed. A further extension of such an
approach is presented here so as to ensure parsimony, which is obtained by imposing
constraints on the group-covariance matrices of the responses. A description of the
resulting parsimonious mixtures of seemingly unrelated contaminated regression
models is provided together with the results of a numerical study based on the
analysis of a real dataset, which illustrates their practical usefulness.
Gabriele Perrone ( )
Department of Statistical Sciences, University of Bologna, via delle Belle Arti 41, 40126 Bologna,
Italy, e-mail: [email protected]
Gabriele Soffritti
Department of Statistical Sciences, University of Bologna, via delle Belle Arti 41, 40126 Bologna,
Italy, e-mail: [email protected]
1 Introduction
X̃∗0 𝜷∗1 + 𝝐, 𝝐 ∼ 𝑁 𝑀 (0 𝑀 , 𝚺1 ) with probability 𝜋1 ,
Y = ··· (1)
X̃∗0 𝜷∗ + 𝝐, 𝝐 ∼ 𝑁 𝑀 (0 𝑀 , 𝚺 𝐾 ) with probability 𝜋 𝐾 ,
𝐾
where 𝜋 𝑘 is the prior probability of the 𝑘th latent sub-population, with 𝜋 𝑘 > 0 for
Í
𝑘 = 1, . . . , 𝐾; 𝐾 ∗ ∗
𝑘=1 𝜋 𝑘 = 1; X̃ is the following (𝑃 + 𝑀) × 𝑀 partitioned matrix:
X∗ 0 𝑃1 +1 . . . 0 𝑃1 +1
1
0 𝑃2 +1 X∗ . . . 0 𝑃2 +1
∗ 2
X̃ = .
.. .. ,
.. . .
0 𝑃 +1 0 𝑃 +1 . . . X∗
𝑀 𝑀 𝑀
Table 1 Features of the parameterisations for the covariance matrices 𝚺 𝑘 , 𝑘 = 1, . . . , 𝐾 (𝐾 > 1).
Acronym Covariance structure Volume Shape Orientation CM step 𝑛𝝈
EEE 𝜆DAD0 Equal Equal Equal Closed 𝑛𝚺
VVV 𝜆 𝑘 D 𝑘 A 𝑘 D0𝑘 Variable Variable Variable Closed 𝐾 𝑛𝚺
EII 𝜆I Equal Spherical − Closed 1
VII 𝜆𝑘 I Variable Spherical − Closed 𝐾
EEI 𝜆A Equal Equal Axis-aligned Closed 𝑀
VEI 𝜆𝑘 A Variable Equal Axis-aligned Iterative 𝑀 +𝐾 −1
EVI 𝜆A 𝑘 Equal Variable Axis-aligned Closed 𝑀 𝐾 − (𝐾 − 1)
VVI 𝜆𝑘 A𝑘 Variable Variable Axis-aligned Closed 𝑀𝐾
EEV 𝜆D 𝑘 AD0𝑘 Equal Equal Variable Iterative 𝐾 𝑛𝚺 − (𝐾 − 1) 𝑀
VEV 𝜆 𝑘 D 𝑘 AD0𝑘 Variable Equal Variable Iterative 𝐾 𝑛𝚺 − (𝐾 − 1) ( 𝑀 − 1)
EVE 𝜆DA 𝑘 D0 Equal Variable Equal Iterative 𝑛𝚺 − (𝐾 − 1) ( 𝑀 − 1)
VVE 𝜆 𝑘 DA 𝑘 D0 Variable Variable Equal Iterative 𝑛𝚺 − (𝐾 − 1) 𝑀
VEE 𝜆 𝑘 DAD0 Variable Equal Equal Iterative 𝑛𝚺 − (𝐾 − 1)
EVV 𝜆D 𝑘 A 𝑘 D0𝑘 Equal Variable Variable Iterative 𝐾 𝑛𝚺 − (𝐾 − 1)
The models illustrated in Section 2 have been fitted to a dataset [5] containing the
volume of sales (Move), a measures of the display activity (Nsale) and the log price
(Lprice) for seven of the top 10 U.S. brands in the canned tuna product category in
the 𝐼 = 338 weeks between September 1989 and May 1997. The goal of the analysis
is to study the dependence of canned tuna sales on prices and promotional activites
for two products: Star Kist 6 oz. (SK) and Bumble Bee Solid 6.12 oz. (BBS). To this
end, the following vectors have been considered: Y0 = (𝑌1 = Lmove SK, 𝑌2 = Lmove
BBS), X0 = (𝑋1 = Nsale SK, 𝑋2 = Lprice SK, 𝑋3 = Nsale BBS, 𝑋4 = Lprice
BBS), where Lmove denotes the logarithm of Move. The analysis has been carried
out using all the parameterisations of the MSUN, MN, MCSUN and MCN models
for each 𝐾 ∈ {1, 2, 3, 4, 5, 6}. Furthermore, MSUN and MCSUN models have been
fitted by considering all possible subvectors of X as vectors X𝑚 , 𝑚 = 1, 2, for each
𝐾. In this way, best subset selections for Lmove SK and Lmove BBS have been
included in the analysis both with and without contamination. The overall number of
fitted models is 37376, including the fully unconstrained models (i.e., with the VVV
parameterisation) previously employed in [11] to perform the same analysis.
Table 2 reports some information about the nine models which best fit the analysed
dataset according to the three model selection criteria over the six examined values
of 𝐾 within each model class. An analysis based on a single linear regression model
(𝐾 = 1), both with and without contamination, appears to be inadequate according to
all criteria. All the examined criteria indicate that the overall best model for studying
the effect of prices and promotional activities on sales of SK and BBS tuna is a
parsimonious mixture of two SU contaminated Gaussian linear regression models
with the EVE parameterisation for the covariance matrices in which the log unit sales
of SK tuna are regressed on the log prices and the promotional activites of the same
brand, while the regressors selected for the BBS log unit sales are the log prices of
308 G. Perrone and G. Soffritti
both brands and the promotional activites of BBS. Thus, the analysis suggests that
two sources of complexity affect the analysed dataset: unobserved heterogeneity over
time (𝐾 = 2 clusters of weeks have been detected) and the presence of mildly atypical
observations. Since the two estimated proportions of typical observations are quite
similar (see the values of 𝛼ˆ 𝑘 in Table 3), contamination seems to characterise the
two clusters of weeks detected by the model almost in the same way. As far as the
strength of the contaminating effects on the conditional variances and covariances
of Y|X = x is concerned, it appears to be stronger in the first cluster, where the
estimated inflation parameter is larger (𝜂ˆ1 = 15.70). By focusing the attention on the
other estimates, it appears that also some of the estimated regression coefficients,
variances and covariances are affected by heterogeneity over time. Sales of SK tuna
results to be negatively affected by prices and positively affected by promotional
activites of the same brand within both clusters detected by the model, but with
effects which are sligthly stronger in the first cluster of weeks. A similar behavior is
detected for the estimated regression equation for Lmove BBS, which also highlights
that Lmove BBS are positively affected by the log prices of SK tuna, especially in
the first cluster of weeks. Furthermore, typical weeks in the first cluster show values
of Lmove SK which are more homogeneous than those of Lmove BBC; the opposite
holds true for the typical weeks belonging to the second cluster. Also the correlation
between log sales of SK and BBS products results to be affected by heterogeneity
over time: while in the largest cluster of weeks this correlation has been estimated
to be slightly positive (0.200), the first cluster is characterised by a mild estimated
negative correlation (−0.151). An interesting feature of this latter cluster is that 17
out of the 20 weeks which have been assigned to this cluster are consecutive from
week no. 58 to week no. 74, which correspond to the period from mid-October 1990
to mid-February 1991 characterised by a worldwide boycott campaign encouraging
consumers not to buy Bumble Bee tuna because Bumble Bee was found to be buying
yellow-fin tuna caught by dolphin-unsafe techniques [1]. Such events could represent
one of the sources of the unobserved heterogeneity detected by the model. According
to the overall best model, some weeks have beed detected to be mild outliers. In the
first cluster, this has happened for week no. 60 (immediately after Halloween 1990)
and week no. 73 (two weeks immediately before Presidents day 1999). The analysis
∗
of the estimated sample residuals y𝑖 − 𝝁ˆ 1 (x𝑖 ; 𝜷ˆ 1 ) for the 20 weeks belonging to the
first cluster (see the scatterplot on the left side of Figure 1) clearly show that weeks
60 and 73 noticeably deviates from the other weeks. Among the 318 weeks of the
second cluster, 32 have resulted to be mild outliers, most of which are associated
with holidays and special events that took place between September 1989 and mid-
October 1990 or between mid-February and May 1997 (see the scatterplot on the
right side of Figure 1). These results are almost equal to those obtained using the best
overall fully unconstrained fitted model in the analysis presented in [11]. However,
the EVE parameterisation for the MSUCN model has allowed to obtain a better trade-
off among the fit, the model complexity and the uncertainty of the estimated partition
of the weeks; furthermore, it has led to a slightly lower number of mild outliers in
the second cluster of weeks.
Parsimonious Seemingly Unrelated Contaminated Normal Regression Mixtures 309
Table 3 Parameter estimates of the overall best model for the analysis of tuna sales.
𝝍ˆ 𝑘=1 𝑘=2
𝜋ˆ 𝑘 0.062 0.938
𝛼ˆ 𝑘 0.810 0.844
𝜂ˆ 𝑘 15.70 6.94
0∗
𝜷ˆ 𝑘1 (8.87, 0.56, −4.70) (8.64, 0.27, −3.09)
0∗
𝜷ˆ 𝑘2 (15.04,
3.92, 2.83, −17.76)
(9.98, 0.25, 0.12, −3.83)
0.034 −0.009 0.121 0.012
𝚺𝑘
ˆ
−0.009 0.105 0.012 0.030
73
0.8
2.0
0.6
1.5
●
● ●●
●
0.4
●
residuals for Lmove BBS
● ●
● ●● ●
●●
●
●
68 ● ● ●● ●
● ● ●●●● ●● ●
● ●● ●●
0.2
●
●●
0.5
● ● ●● ● ●
71 ●● ●●
● ● ● ● ●● ●●● ● ●
●● ●● ●●●●● ●●
70 ●
● ●● ● ●● ●
● ●
●● ●
●●●
69 67 ●
●●●
● ●●●●●●●●
74
66 105 ● ● ●●● ●
●●●●
● ●●● ●●● ●
●●
● ● ● ● ●
●●●
0.0
58 ●
●●
0.0
65 ●● ● ● ●
● ● ● ● ●●● ●●
●● ●● ● ●
51 330 ● ● ●● ●● ●●●
● ●●●●● ●
●●● ●
● ●● ●
59 ●●
●
● ●
●
●
●●●●
●
● ● ●●●● ●●
● ●● ●●
●
● ●
64 ● ● ● ●● ● ● ●●
●● ● ●
72 ●● ● ●● ●
63 ● ● ● ● ● ● ●
−0.2
−0.5
● ● ●● ● ● ● ●
●●
62 ●●● ● ● ●●
● ●●●
● ● ●
●● ●
● ● ●● ●
61 ●● ●●
●●
● ● ●
−0.4
−1.0
● ●
●
−0.6
−1.5
60
Fig. 1 Scatterplots of the estimated residuals for the weeks assigned to the first (left) and second
(right) clusters detected by the overall best model. Points of the first scatterplot are labelled with
the number of the corresponding weeks. Black circle and red triangle in the second scatterplot
correspond to typical and outlying weeks, respectively.
4 Conclusions
both in the presence of mild outliers and multivariate correlated dependent variables,
each of which is regressed on a different vector of covariates. Models from this
class allow for simultaneous robust clustering and detection of mild outliers in
multivariate regression analysis. They encompass several other types of Gaussian
mixture-based linear regression models previously proposed in the literature, such
as the ones illustrated in [7, 8, 9], providing a robust and flexible tool for modelling
data in practical applications where different regressors are considered to be relevant
for the prediction of different dependent variables. Previous research (see [9, 11])
demonstrated that BIC and ICL could be effectively employed to select a proper
value for 𝐾 in the presence of mildly contaminated data. Thanks to an imposition of
an eigen-decomposed structure on the 𝐾 variance-covariance matrices of Y|X = x,
the presented models are characterised by a reduced number of variance-covariance
parameters to be included in the analysis, thus improving flexibility, usefulness and
effectiveness of an approach to multivariate linear regression analysis based on finite
Gaussian mixture models in real data applications.
References
1. Baird, I. G., Quastel, N.: Dolphin-safe tuna from California to Thailand: localisms in environ-
mental certification of global commodity networks. Ann. Assoc. Am. Geogr. 101, 337–355
(2011)
2. Biernacki, C., Celeux, G., Govaert, G.: Assessing a mixture model for clustering with the
integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell. 22, 719–725 (2000)
3. Cadavez, V. A. P., Hennningsen, A.: The use of seemingly unrelated regression (SUR) to
predict the carcass composition of lambs. Meat. Sci. 92, 548–553 (2012)
4. Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recognit. 28,
781–793 (1995)
5. Chevalier, J. A., Kashyap, A. K., Rossi, P. E.: Why don’t prices rise during periods of peak
demand? Evidence from scanner data. Am. Econ. Rev. 93, 15–37 (2003)
6. Disegna, M., Osti, L.: Tourists’ expenditure behaviour: the influence of satisfaction and the
dependence of spending categories. Tour. Econ. 22, 5–30 (2016)
7. Galimberti, G., Soffritti, G.: Seemingly unrelated clusterwise linear regression. Adv. Data
Anal. Classif. 14, 235–260 (2020)
8. Jones, P. N., McLachlan, G. J.: Fitting finite mixture models in a regression context. Aust.
New Zeal. J. Stat. 34, 233–240 (1992)
9. Mazza, A., Punzo, A.: Mixtures of multivariate contaminated normal regression models. Stat.
Pap. 169, 787–822 (2020)
10. Meng, X. L., Rubin, D. B.: Maximum likelihood estimation via the ECM algorithm: A general
framework. Biometrika. 80, 267–278 (1993)
11. Perrone, G., Soffritti, G.: Seemingly unrelated clusterwise linear regression for contaminated
data. Under review (2021)
12. R Core Team R: a language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria (2022) http://www.R-project.org
13. Ritter, G.: Robust cluster analysis and variable selection. Chapman & Hall, Boca Raton (2015)
14. Srivastava, V. K., Giles, D. E. A.: Seemingly unrelated regression equations models. Marcel
Dekker, New York (1987)
15. Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)
16. White, E. N., Hewings, G. J. D.: Space-time employment modelling: some results using
seemingly unrelated regression estimators. J. Reg. Sci. 22, 283–302 (1982)
Parsimonious Seemingly Unrelated Contaminated Normal Regression Mixtures 311
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Penalized Model-based Functional Clustering: a
Regularization Approach via Shrinkage Methods
Abstract With the advance of modern technology, and with data being recorded
continuously, functional data analysis has gained a lot of popularity in recent years.
Working in a mixture model-based framework, we develop a flexible functional
clustering technique achieving dimensionality reduction schemes through a 𝐿 1 pe-
nalization. The proposed procedure results in an integrated modelling approach
where shrinkage techniques are applied to enable sparse solutions in both the means
and the covariance matrices of the mixture components, while preserving the under-
lying clustering structure. This leads to an entirely data-driven methodology suitable
for simultaneous dimensionality reduction and clustering. Preliminary experimental
results, both from simulation and real data, show that the proposed methodology is
worth considering within the framework of functional clustering.
Nicola Pronello ( )
Department of Neurosciences, Imaging and Clinical Sciences, University of Chieti-Pescara, Chieti,
Italy, e-mail: [email protected]
Rosaria Ignaccolo
Department of Economics and Statistics "Cognetti de Martiis", University of Torino, Torino, Italy,
e-mail: [email protected]
Luigi Ippoliti
Department of Economics, University of Chieti-Pescara, Pescara, Italy,
e-mail: [email protected]
Sara Fontanella
National Heart and Lung Institute, Imperial College London, London, United Kingdom,
e-mail: [email protected]
1 Introduction
In recent decades, technological innovations have produced data that are increasingly
complex, high dimensional, and structured. A large amount of these data can be
characterized as functions defined on some continuous domain and their statistical
analysis has attracted the interest of many researchers. This surge of interests is
explained by the ubiquitous examples of functional data that can be found in different
application fields (see for example [2], and references therein for specific examples).
With functions as the basic units of observation, the analysis of functional data
poses significant theoretical and practical challenges to statisticians. Despite these
difficulties, methodology for clustering functional data has advanced rapidly during
the past years; recent surveys of functional data clustering are presented in [7] and
[2]. Popular approaches have extended classical clustering concepts for vector-valued
multivariate data to functional data.
In this paper, we consider a finite mixture as a flexible model for clustering.
In particular, applying a functional model-based clustering algorithm with an 𝐿 1 -
penalty function on a set of projection coefficients, we extend the results of [8]
and [9] for vector-valued multivariate data to a functional data framework. This
approach appears particularly appealing in all cases in which the functions are
spatially heterogeneous, meaning that some parts of the function can be smoother
than in other parts, or that there may be distant parts of the function that are correlated
with each other. Furthermore, the introduction of a shrinkage penalty allows to look
for directions in the feature space (that is now the space of expansion/projection
coefficients) that are the most useful in separating the underlying groups without
first applying dimensionality reduction techniques.
In Section 2 we present at first the methodology along with some details on model
estimation (subsection 2.2). Secondly, in Section 3, we perform a validation study
with simulated and real data for which the classes are known a-priori.
practice, such curves/trajectories are available only at a discrete set of the domain
points {𝑡 𝑖𝑠 : 𝑖 = 1, . . . , 𝑛, 𝑠 = 1, . . . , 𝑚 𝑖 } and the 𝑛 curves need to be reconstructed.
To this goal, it is common to assume that the curves belong to a finite dimensional
space spanned by a basis of functions, so that given a basis of functions 𝚽 =
{𝜓1 , ..., 𝜓 𝑝 } each curve 𝑥𝑖 (𝑡) admits the following decomposition:
𝑝
Õ
𝑥𝑖 (𝑡) = 𝛽 𝑗,𝑖 𝜓 𝑗 (𝑡), 𝑖 = 1, . . . , 𝑛; (2.1)
𝑗=1
with 𝜖𝑖 ∼ N (0, 𝜎𝜖2 ), the realizations of the random coefficients 𝛽 𝑗,𝑖 for 𝑗 = 1, . . . , 𝑝
0 0
describing each curve can be obtained via least squares as 𝜷ˆ𝑖 = (𝚯𝑖 𝚯𝑖 ) −1 𝚯𝑖 X𝑖𝑜𝑏𝑠
where 𝚯𝑖 = (𝜓 𝑗 (𝑡𝑖𝑠 )), 1 ≤ 𝑗 ≤ 𝑝, 1 ≤ 𝑠 ≤ 𝑚 𝑖 contains the basis functions evaluated
0
at the fixed domain points and X𝑖𝑜𝑏𝑠 = (𝑥 𝑖𝑜𝑏𝑠 (𝑡𝑖1 ), . . . , 𝑥 𝑖𝑜𝑏𝑠 (𝑡 𝑖𝑚𝑖 )) is the vector of
observed values of the 𝑖-th curve.
With the goal of dividing into 𝐾 homogeneous groups the observed curves
𝑥 1 , . . . , 𝑥 𝑛 , let us assume that it exists an unobservable grouping variable Z =
(𝑍1 , ..., 𝑍 𝐾 ) ∈ [0, 1] 𝐾 indicating the cluster membership: 𝑧 𝑖,𝑘 = 1 if 𝑥𝑖 belongs to
cluster 𝑘, 0 otherwise (and 𝑧 𝑖,𝑘 is indeed what we want to predict for each curve).
In adopting a model-based clustering approach, we denote with 𝜋 𝑘 the (a-priori)
probabilities of belonging to a group:
𝜋 𝑘 = P(𝑍 𝑘 = 1), 𝑘 = 1, . . . , 𝐾,
Í𝐾
such that 𝑘=1 𝜋 𝑘 = 1 and 𝜋 𝑘 > 0 for each 𝑘, and we assume that, conditionally on
𝑍, the random vector 𝜷 follows a multivariate Gaussian distribution, that is for each
cluster
𝜷|(𝑍 𝑘 = 1) = 𝜷 𝑘 ∼ N ( 𝝁 𝑘 , 𝚺 𝑘 )
where 𝝁 𝑘 = (𝜇1,𝑘 , . . . , 𝜇 𝑝,𝑘 )𝑇 and 𝚺 𝑘 are respectively the mean vector and
the covariance matrix of the 𝑘-th group. Then the marginal distribution of 𝜷 =
{𝛽1 , . . . , 𝛽 𝑝 } can be written as a finite mixture with mixing proportions 𝜋 𝑘 as
𝐾
Õ
𝑝( 𝜷) = 𝜋 𝑘 𝑓 ( 𝜷 𝑘 ; 𝝁 𝑘 , 𝚺 𝑘 ),
𝑘=1
316 N. Pronello et al.
where || 𝝁 𝑘 ||1 = 𝑝𝑗=1 |𝜇 𝑘, 𝑗 |, 𝜆1 > 0 and 𝜆2 > 0 are penalty parameters to be suitably
Í
chosen.
The penalty term on the cluster mean vectors allow for component selection
in the functional data framework (whereas it would be variable selection in the
multivariate case), considering that when the 𝑗-th component in the basis expansion
is not useful in separating groups it has a common mean across groups, that is
𝜇1, 𝑗 = . . . = 𝜇 𝐾 , 𝑗 = 0. Then to realize component selection the considered term is
Í 𝐾
𝑘=1 || 𝝁 𝑘 ||1 . Í Í𝑝
The second part of the penalty, namely 𝐾 𝑘=1 𝑗,𝑙 |𝑊 𝑘; 𝑗,𝑙 |, imposes a shrinkage on
the elements of the precision matrices, thus avoiding possible singularity problems
and facilitating the estimation of large and sparse covariance matrices.
(with initialization values), and maximization (M) of a lower bound of the obtained
expected value with respect to the unknown parameters.
In particular, at the 𝑑-th iteration, given a current estimate 𝜽 (𝑑) , the lower bound
after the E-step assumes the following form:
(𝑑)
𝑄 𝑃 (𝜽;𝜽 (𝑑) )= 𝐾
Í Í𝑛 Í𝐾 Í𝐾 Í 𝑝
𝑘=1 𝑖=1 𝜏𝑘,𝑖 [log 𝜋𝑘 +log 𝑓 (𝜷 𝑖 ;𝝁 𝑘 ,𝚺 𝑘 ) ]−𝜆1 𝑘=1 | |𝝁 𝑘 | |1 −𝜆2 𝑘=1 𝑗,𝑙
|𝑊𝑘; 𝑗,𝑙 | ,
𝑏(𝑖) − 𝑎(𝑖)
𝑠(𝑖) =
max{𝑎(𝑖), 𝑏(𝑖)}
where 𝑎(𝑖) is the average distance of curve 𝑖 to all other curves ℎ assigned to the
same cluster (if 𝑖 is the only observation in its cluster, then 𝑠(𝑖) = 0), and 𝑏(𝑖) is
the minimum average distance of curve 𝑖 to observations ℎ which are assigned to
a different cluster. This definition ensures that 𝑠(𝑖) takes values in [−1, 1], where
values close to one indicate “better”clustering solutions. Conditional on 𝐾 and a pair
of values (𝜆1 , 𝜆2 ), we thus assess the overall cluster solution using the total average
of silhouette values
𝑛
1Õ
𝑆(𝐾, 𝜆1 , 𝜆2 ) = 𝑠(𝑖).
𝑛 𝑖=1
In particular, by doing a grid search for the triple (𝐾, 𝜆1 , 𝜆2 ), the best cluster
solution is obtained by looking for the largest value of the average silhouette width
(ASW) index. Note that, to evaluate 𝑠(𝑖), 𝑖 = 1, . . . , 𝑛, and then the objective function
𝑆(𝐾, 𝜆1 , 𝜆2 ), we need to compute a distance between pairs of curves 𝑋𝑖 and 𝑋ℎ . One
318 N. Pronello et al.
3 Experimental Results
3.1 Simulation
With the exclusion of 3 entries per group, the means 𝝁 𝑘 are all zero mean vectors.
Under this scenario, the simulated curves (25 per group) and the non-zero group
expansion coefficients are represented in Figure 1. For this simple simulation setting,
estimation results suggest that, using euclidean distance to computed the ASW, the
grid search procedure is always able to correctly select the cluster-relevant basis
functions. This is confirmed by Figure 2 which shows both the distribution (over 100
replications) of the selected basis functions and the data projected on these bases that
clearly highlight the identification of 4 clusters. Under this scenario, the quality of
the estimated clusters thus appears very good as the analysis of the misclassification
rate suggests an 100% of accuracy in all the replicated datasets.
Similar results hold for more complex simulation designs, where we consider
different structure of the covariance matrices in the data generating process.
We evaluate the PFC-𝐿 1 model on a well-known benchmark data set, namely the
electrocardiogram (ECG) data set (data can be found at the UCR Time Series
Classification Archive [3]).
The ECG data set comprises a set of 200 electrocardiograms from 2 groups of
patients, myocardial infarction and healthy, sampled at 96 time instants in time.
Penalized Model-based Functional Clustering 319
Fig. 1 Left: 25 simulated curves for each group. Right: Vector of expansion coefficients for each
group, with only three non-zero coefficients corresponding to basis functions with specific period-
icities (Hertz values).
Fig. 2 Left: Data projected on cluster specific functional subspace generated by the selected basis
functions. Right: Distribution (over 100 replications) of the selected basis functions shown for pairs
of sine and cosine basis functions, according to the Hertz values.
This data set were previously used to compare the performance of several func-
tional clustering models in [1]. The results in Table 5 of [1] show that the FunFEM
models, compared to other state of the art methodologies, achieved the best perfor-
mances in terms of accuracy. Hence, here, we limit the comparison to the results
obtained with the PFC-𝐿 1 and the FunFEM models. Although FunFEM models relay
on a mixture of Gaussian distributions describing the likelihood of the data similarly
to our proposal, they differ on facing the intrinsic high dimension of the problem
by estimating a latent discriminant subspace in parallel with the steps of an EM
algorithm.
For all the data, we reconstruct the functional form from the sampled curves
choosing arbitrarily 20 cubic spline basis of functions. We tested the PFC-𝐿 1 models
considering five different values for the number of clusters, 𝐾 = {2, 3, 4, 5, 6}, and
six values for 𝜆1 = {0.5, 1, 5, 10, 15, 20}.
Considering that the GLASSO penalty parameter 𝜆 depends linearly from 𝜆2 ,
the choice of 𝜆 2 has to provide suitable values for 𝜆. A practical approach is to
choose values avoiding convergence problems with GLASSO. Here 𝜆 2 was set to
{5, 7.5, 10, 12, 15, 20} for the ECG data. Both PFC-𝐿 1 and FunFEM algorithms were
initialized using a 𝐾-means procedure.
320 N. Pronello et al.
The clustering accuracies, computed with respect to the known labels, are 69% for
FunFEM DFM [ 𝛼𝑘 𝑗 𝛽𝑘 ] (choosing among 12 different model parameterizations with
BIC index), and 75% for PFC-L1 [𝜆1 = 0.5 , 𝜆2 = 5] (values of tuning parameters
chose by ASW index) . Thus PFC-𝐿 1 achieves good performance, with an increase
in the accuracy about 9%.
4 Discussion
In this paper we tried to investigate the potential of shrinkage methods for clustering
functional data. Our numerical examples show the advantages of performing clus-
tering with features selection, such as uncover interesting structures underlying the
data while preserving good clustering accuracy. To the best of our knowledge, this is
the first proposal that considers a penalty for both means and covariances of mixture
components in functional model-based clustering. In the model selection section we
defined an heuristic criterion to choose among different model parameterizations
based on average silhouette index. It may be interesting to evaluate different dis-
tances (i.e. not euclidean) to compute this index in future research. Moreover, we
will consider more complex simulation designs to investigate the robustness of the
proposal and extend the comparison with the state of the art methodologies on more
benchmark datasets.
References
1. Bouveyron, C., Come, E., Jacques, E.: The discriminative functional mixture model for a
comparative analysis of bike sharing systems. Ann. Appl. Stat. 9, 1726–1760 (2015)
2. Chamroukhi, F., Nguyen, H.: Model-based clustering and classification of functional data.
Wiley Interdiscip. Rev.: Data Min. and Knowl. Discov. 9, e1298, 1–36 (2019)
3. Dau, H. A., Keogh, E., Kamgar, K., Yeh, C.-C. M., Zhu, Y., Gharghabi, S., Ratanamahatana,
C. A., Yanping, Hu, B., Begum, N., Bagnall, A., Mueen, A., Batista, G., Hexagon-ML: The
UCR Time Series Classification Archive (October 2018)
https://www.cs.ucr.edu/$\sim$eamonn/time\_series\_data\_2018/
4. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM
algorithm. J. R. Stat. Soc. Ser. B (Methodol.). 39, 1–38 (1977)
5. Friedman, J., Hastie, T., Tibshirani, R.: Sparse inverse covariance estimation with the graphical
lasso. Biostat. 9, 432–41 (2008)
6. Friedman, J., Hastie, T., Tibshirani, R.: glasso: Graphical Lasso: Estimation of Gaussian
Graphical Models, R package version 1.11 (2019).
https://CRAN.R-project.org/package=glasso
7. Jacques, J., Preda, C.: Functional data clustering: A survey. Adv. Data Anal. Classif. 8, 231–
255 (2013)
8. Pan, W., Shen, X.: Penalized model-based clustering with application to variable selection. J.
Mach. Learn. Res. 8, 1145–1164 (2007)
9. Zhou, H., Pan, W., Shen, X.: Penalized model-based clustering with unconstrained covariance
matrices. Electron. J. Stat. 3, 1473–1496 (2009)
Penalized Model-based Functional Clustering 321
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Emotion Classification Based on Single
Electrode Brain Data: Applications for Assistive
Technology
Duarte Rodrigues
Faculty of Engineering of University of Porto (FEUP), Rua Dr. Roberto Frias, s/n 4200-465 Porto,
Portugal, e-mail: [email protected]
Luis Paulo Reis
Faculty of Engineering of University of Porto (FEUP) and Artificial Intelligence and Computer
Science Laboratory (LIACC), Rua Dr. Roberto Frias, s/n 4200-465 Porto, Portugal,
e-mail: [email protected]
Brígida Mónica Faria ( )
School of Health, Polytechnic of Porto (ESS-P.PORTO) and Artificial Intelligence and Computer
Science (LIACC), Rua Dr. Roberto Frias, s/n 4200-465 Porto, Portugal,
e-mail: [email protected]
1 Introduction
Emotions are a part of our lives, as humans we know how to identify the tiniest
of microexpressions to unveil what someone is feeling, but also how to use them
to express our hearts. From the youngest of ages we see and interact with others
and build a database of patterns of, for example, what joy is and how different it is
from fear or sadness. Computers, on the other hand, do not have any idea of what an
emotion is or how to recognize it. Or do they?
The Artificial Intelligence and Computer Science Laboratory (LIACC) estab-
lished 2 projects where emotion recognition can be of the utmost importance. The
first project, the "IntellWheels 2.0" [1], intends to develop an interactive and in-
telligent electric wheelchair. This innovative equipment will have a diverse set of
features, such as an adaptive control system (through eye gaze, a brain-computer
interface, hand orientation, among others) and a personalized multi-modal interface
which will allow communication to multiple devices both from the patients and the
caregivers. In this case, having information about the mood of the patient is very
beneficial, because the interface can give updates to the nursing staff of the emotional
condition of the patient. The second project, the "Sleep at the Wheel" [2], focuses on
the research of an interface that can sense and predict a driver’s drowsiness state, be-
ing able to detect if he fell asleep while driving and, consequently, support an alarm
system to provide safer routing and driving. Here the state of mind of the driver
is a very important aspect, as different emotions, like anger or fear, can provoke
dangerous situations or unpredictable scenarios, making the driver less attentive to
his surroundings.
In this work, emotions will be sensed through a brain-computer interface (BCI).
These are commercial devices that allow to acquire a surface electroencephalo-
gram (EEG). This signal is used to measure the electrical activity of the brain, that
fluctuates according to the firing of the neurons in the brain, being quantified in
micro-volts. In this research, the BCI used was the "NeuroSky MindWave2" which
possesses one single electrode on the forehead, from which it collects a signal from
the activity of the frontal lobe. This brain area is responsible for the higher executive
functions, including emotional regulation, planning, reasoning and problem solving
[3].
The study of emotion recognition started with psychologist Paul Ekman that
defined, based on a cross cultural study, six core emotions - Fear, Anger, Happiness,
Sadness, Surprise and Disgust [4]. Later, psychologist Robert Plutchik established a
model called "Wheel of Emotions", a diagram where every emotion can be derived
from the core 6.
It is also important to have a way to measure what someone is feeling or what
emotion they are experiencing. An easy way to do this is through the "Discrete Emo-
tion Questionnaire", a psychological validated questionnaire to verify the intensity
of a certain emotion. This assessment presents the 6 core emotions to the subjects
asking them to rate the intensity they felt, from 1 to 7 [5].
As a first approach in this area, the current work aims to be able to identify the
core emotions using EEG signals collected with the BCI.
Emotion Classification with 1-channel BCI 325
2 Experimental Methodology
In order to correctly identify the core emotions, the first step is to trigger them in
an efficient way for the brain data collected to be as informative as possible.To do
so, the emotions were prompted via a set of video clips, that lasted 5-7 seconds.
These videos were selected from a certified database, where the videos were labeled
according to the intensity and kind of emotion it caused in the subjects [6]. For each
of the 6 core emotions, the 4 videos classified with the biggest intensity were selected
to be presented to the participants of this research work.
For each of the 24 video clips (4 videos per each of the 6 emotions), 3 EEG
samples are collected. The first is before the display of the video, where a fixation
cross is presented, in order to collect the idle/blank state of the user, where he
is asked to relax. The second sample is the EEG during the video (active visual
stimulus); and the third sample is after the video finishes where the volunteer is
processing the emotion triggered (higher level thinking), while getting back to the
initial relaxed state, where the fixation cross is presented again. To confirm that the
volunteers experience the same emotion defined in the pre-determined label, they
are a prompted to answer the “Discrete Emotion Questionnaire”, after the 3 EEG
samples are collected.
Regarding the physiological signal processing, this step is important because the
raw EEG signal that comes directly from the BCI has a low signal-to-noise ratio,
as well as many surrounding artifacts that contaminate the readings, especially eye
blinks and facial movements triggered by the various emotions. These interfering
signals caused by the latter, denominated electromyograms (EMG), are characterized
by high frequencies (50-150 Hz) that make the underlying signal very noisy. Every
time a person blinks, the EEG signal shows a very high peak with a very low
frequency (<1Hz). To remove these muscle artifacts, a 5th order utterworth bandpass
filter (this type of filter was chosen because it has the flattest frequency response,
which leads to less signal distortion) with cut-off frequencies in 1 Hz and 50 Hz
[7].The attenuation of very low frequencies is important to remove the eye blinks
artifacts. Considering the top cut-off frequency, it is very convenient to use 50 Hz
since it mitigates the effects of the power line noise and the EMG artifacts. Like
this, no important brain data is lost. At this step, the EEG was segmented in the
brain waves of interest, i.e., the alpha and beta brain waves. The best way to perform
this is to apply bandpass filters (same filter type as before) in the corresponding
bandwidths, 8-13Hz and 13-32 Hz, to have alpha and beta bands, respectively.
The EEG signals, at this stage possess the "emotional data" exposed allowing
to extract the features. To do so, multiple mathematical equations were applied to
obtain relevant information from the signals. Feature extraction methods depend
on the domain, as will be seen ahead [8]. Most strategies to extract features from
the EEG are formulas applied in the time domain, such as, the common statistical
equations, the Hjorth statistical parameters, the mean and zero crossings (number of
times the signal crosses these 2 thresholds) [8]. Besides these, there were applied
more advanced feature extraction methods, based on fractal dimensions and entropy
analysis (methods to assess the complexity, or irregularity, of a time-series) [9].
326 D. Rodrigues et al.
This first hypothesis describes the main goal of the project where a model was
developed to classify 6 emotions.
First, the feature extraction was computed. At this step, the optimal number of
features to get selected was tested, iterating from 5 to 50, 5 at a time. The best
number found was 30, which gave the best accuracies, with a balanced computation
time and power. This value was chosen for the 3 feature matrixes (raw, normalized
and standardized). The dataset was then divided into training and testing with an
80% ratio and fully independent of one another. Each model was then trained and
assessed, by computing the accuracy in the test dataset. Table 1 presents the results
for each model.
When comparing the various models, the average accuracy is around 16-18%,
logically due to the number of classes in the problem (100%/6 = 16,6%). Despite
this, the best result reached was 25% accuracy, with the features in their raw state,
since the magnitude information was not lost, so patterns in different emotions could
be more easily identified due to the high discrepancy in the values. These results are
not discouraging since the main objective of the study is very ambitious, as we are
trying to create a model to define universally what an emotion is. There is no work
more subjective or abstract, and the only way to achieve this universal standardization
would be with a sample population as wide and diverse as possible with different
beliefs, nationalities, age groups, etc. Although this is an initial study, it shows that
it is possible to register and identify differences in the electrical changes of the
prefrontal cortex and, with that information, categorize what someone is feeling.
328 D. Rodrigues et al.
As the results in the previous hypothesis could not precisely identify an emotion when
compared to the other 5, the problem was narrowed down and a new hypothesis was
tested, to continue the proposed research. In this experiment, the model was trained
to discern between only 2 emotions, decided a priori. For demonstration purposes,
a concrete example can be seen in Table 2 where it compares "fear" vs "surprise".
In this case, most of the machine learning algorithms have accuracies in the
order of the 50-53%. This results are not ideal, as they are no better than a random
choice between the two classes, however this can be justified by the low population
sample, which is not high enough to bring to the surface concrete patterns on the
features. Regarding the deep learning approach, the RNN has an advantage in this
case, giving a final accuracy of 69%. This result shows that this model is reliable, and
in the majority of the cases the 2 emotions can be distinguished. In this particular
case, the facial expressions and their muscle activity, can induce big artifacts in
the EEG. Someone who feels surprised has the tendency to raise their eyebrows
and open the mouth. These movements can lead to a difference in the EEG and,
consequently, in the patterns of the features, making the distinction between surprise
and fear more noticeable. The same thinking applies to other emotions that trigger
facial movement, like laugh, frowning, among others.
Besides the good results presented in the last premise, one last hypothesis was
assessed, regarding the difference between experiencing the emotion while watching
the video (direct stimulus), and after, when the fixation cross is presented, while the
volunteer is simply thinking and cognitively processing the emotion.
Emotion Classification with 1-channel BCI 329
As it can be seen, for this experiment, most models did fairly well using the
standardized feature, being all accuracies higher than 80%. However, when testing
the deep learning approach, this architecture revealed to fit almost perfectly to the
testing data, with an accuracy higher than 96%. This hypothesis is the proof of
concept that the characteristics of the signal collected during the stimulus itself
are very different from the ones from a signal obtained when the person is simply
thinking and cognitively processing the emotion (this change would be obvious if
the EEG was collected from the occipital lobe, which is responsible for the visual
perception, but is remarkable when spotted in the prefrontal cortex).
4 Conclusions
In conclusion, as a first approach, the results achieved are very satisfactory and
reveal a high potential to be greatly efficient in the proposed applications both in
"IntellWheels2.0" and "Sleep at the Wheel projects". Nevertheless by collecting
more data the models will get more generalized resulting in more realistic patterns
and, consequently, increasing the prediction’s accuracies.
Comparing to the literature, using simple visual stimuli to distinguish six emo-
tions, in a relaxed state, is a novel tactic. Most studies, complement the stimulus with
forced facial expression, introducing different characteristics to the signal, leading
to better results. Other studies use BCIs with more electrodes (channels), covering a
wider cranial surface and, consequently, getting more EEG and information, which
leads to more robust results.
As future work, the preprocessing of the data could be polished, improving the
removal of artifacts and enhancing the underlying information of the EEG’s. To obtain
better results, it could also be used a transfer learning approach, by pre-training the
models with another emotion related EEG databases.
330 D. Rodrigues et al.
References
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
The Death Process in Italy Before and During
the Covid-19 Pandemic: a Functional
Compositional Approach
Riccardo Scimone ( )
MOX, Dipartimento di Matematica, Politecnico di Milano and Center for Analysis, Decision and
Society, Human Technopole, Milano, Italy, e-mail: [email protected]
Alessandra Menafoglio
MOX, Dipartimento di Matematica, Politecnico di Milano, Milano, Italy,
e-mail: [email protected]
Laura M. Sangalli
MOX, Dipartimento di Matematica, Politecnico di Milano, Milano, Italy,
e-mail: [email protected]
Piercesare Secchi
MOX, Dipartimento di Matematica, Politecnico di Milano and Center for Analysis, Decision and
Society, Human Technopole, Milano, Italy, e-mail: [email protected]
At the dawn of the third year of global pandemic, we can affirm that no aspect of
people’s everyday life has been left untouched by the consequences of Covid-19.
The virus, in addition to exacting an heavy death toll, has caused great upheavals
in global economy, education systems, technological development and in countless
other aspects of human life. Given this global reaching, we deem appropriate to anal-
yse death counts from all causes, and not just those directly attributed to Covid-19, as
a proxy of how Italian administrative units, be they municipalities or provinces, have
been affected by the pandemic. This choice is driven by the following considerations:
• Death counts from all causes are, on many levels, high quality data: they have a
very fine spatial and temporal granularity, being collected daily in each Italian
municipality, they are finely stratified in many age classes, and they are not affected
by errors due to incorrect attribution of the cause of death, as may happen, for
example, in deciding whether or not a given death is due to Covid-19;
• They incorporate any possible shock, be it direct or indirect, which the natural
death process underwent: less deaths from road accidents due to restrictive poli-
cies, more deaths from other pathologies which are left untreated because of the
unnatural stress on the welfare systems, and so on;
• They are made freely available by ISTAT1, with a substantial amounts of historical
data; in particular, in the following analysis we consider data starting from the
beginning of 2011 up to the end of 2020.
The purpose of the analysis of such data is twofold: (1) to study the correlation
structure of the death process in Italy before and during the pandemic, assessing
possible perturbations caused by its outbreak, and (2) to assess local anomalies at
the municipality level (i.e., identifying municipalities which behave unusually with
respect to the surrounding). This talk will entirely be devoted to presenting data and
results concerning people aged between 50 and 69 years. The elderly class was the
focus of [1], while analyses focusing on younger age classes can be freely examined
at https://github.com/RiccardoScimone/Mortality-densities-italy
-analysis.git.
Daily death counts for the 107 Italian provinces, in the time interval spanning
from 2017 to 2020, are shown in Fig. 1: for each province, we draw death counts
along the year in light blue. The black solid line is the weighted mean number of
deaths, where each province has a weight proportional to its population. We also
1 https://www.istat.it/it/archivio/240401
The Death Process in Italy 335
highlight four provinces with colours: Rome, Milan, Naples, and Bergamo. By a
visual inspection, it is easy to see that, during the years 2017, 2018 and 2019, the
mortality in this age class has an almost uniform behaviour, with only a very slight
increase in deaths during winter, for some Provinces. Conversely, 2020 presents
an abnormal behaviour in many provinces, due to the pandemic outbreak: look for
example at the double peak for Milan, hit by both pandemic waves, or the single,
dramatically sharp peak of Bergamo, which reached, during the first wave, higher
death counts than the ones associated to provinces which are several times bigger, as
Rome or Naples. By comparison with the plots in [1], on can see how all these peaks
are less sharper with respect to the elderly class: this is perfectly reasonable, since
people aged more than 70 years are much more susceptible to death by Covid-19.
Fig. 1 Daily death counts during the last four years, for the Italian provinces. The plots refer to
people aged between 50 and 69 years. For each province, death counts along the year are plotted in
light blue: curves are overlaid one on top of the other to visualize their variability. The black solid
line is the weighted mean number of deaths, where each province has a weight proportional to its
population, while some selected provinces are highlighted in colour.
To set some notation, we denote the available death counts data as 𝑑𝑖𝑦𝑡 , where
𝑖 is a geographical index, identifying provinces or municipalities, 𝑦 is the year and
𝑡 is the day within year 𝑦. Moreover, we denote by 𝑇𝑖𝑦 the absolutely continuous
random variable time of death along the calendar year, that models the instant of
death of a person living in area 𝑖 and passing away during year 𝑦. We hence consider
the empirical discrete probability density of this random variable,
𝑑𝑖𝑦𝑡
𝑝 𝑖𝑦𝑡 = Í for 𝑡 = 1, ..., 365
𝑡 𝑑 𝑖𝑦𝑡
for each area 𝑖 and year 𝑦. The family {𝑝 𝑖𝑦 }𝑖𝑦 is the main focus of our analysis: we
show these discrete densities in Fig. 2, with the same color choices of Fig. 1. It is
336 R. Scimone et al.
clear that using densities provides a natural alignment of areas whose population
differs significantly, providing complementary insights with respect to the absolute
number of death counts: greater emphasis is given on the temporal structure of the
phenomenon. For example, the astonishing behaviour of the province of Bergamo
during the first pandemic wave in 2020, is now much more visible.
Fig. 2 Empirical densities of daily mortality, for people aged between 50 and 69 years, at the
provincial scale. For each province, the empirical density of the daily mortality is plotted in light
blue: densities are overlaid one on top of the other to visualize their variability. The black solid line
is the weighted mean density, where the weight for each province has been set to be proportional to
its population; some selected provinces are highlighted in colour.
Fig. 3 Smooth estimates of the mortality densities over the 107 Italian provinces. The usual pattern
of mortality is visible till 2019, while the functional process is completely different in 2020, with
the two pandemic waves clearly captured by the estimated densities. The black thick lines represent
the mean density, computed in 𝐵 2 , with weights proportional to the population in each area.
are shown in Fig. 3: they are obtained by smoothing the {𝑝 𝑖𝑦 }𝑖𝑦 via compositional
splines [8, 9]. It is easy to see, by comparison with Fig. 2, how smoothing filters out
a good amount of noise, much more than the case of the elderly class: this is fairly
reasonable, since the death process is usually more noisy for younger age classes.
From now on, the { 𝑓𝑖𝑦 }𝑖𝑦 are analysed as a spatio-temporal functional random sample
taking values in 𝐵2 (Θ). We briefly anticipate the results of such analysis:
1. The { 𝑓𝑖𝑦 }𝑖𝑦 are decomposed, by means of a linear model formulated in 𝐵2 (Θ)
[10], in a predictable and an unpredictable part, on the basis of mortality during
previous years;
2. The unpredictable part is then analysed spatially in order to infer the main
spatial correlation characteristics of the process; in particular, the impacts of
the pandemic are investigated via functional variography [13, 14, 11, 12] and
Principal Component Analysis in the 𝐵2 space (SFPCA, [16]);
3. The results obtained at the provincial level are reduced to the municipality scale
by spatial downscaling [15] techniques, obtaining smooth density estimates
for each municipality. This provides continuous density at the municipality
level, without directly smoothing the corresponding daily death process, which
is quite irregular due to the reduced population of many municipalities. The
spatial downscaling estimates, that are exclusively based on provincial data, are
then compared with the actual measurements on municipalities, allowing for the
identification of local anomalies.
Points 1 and 2 above are detailed in Section 2, while point 3 will be discussed during
the talk. The reader is referred to [1] for full details on the analysis pipeline.
338 R. Scimone et al.
2 Some Results
The first step of the analysis of the random sample { 𝑓𝑖𝑦 }𝑖𝑦 , where 𝑖 is indexing the
107 Italian provinces, is the formulation of a family of function-on-function linear
models in 𝐵2 (Θ), extending classical models formulated in the 𝐿 2 case [17], namely
Fig. 4 First four panels, from the left: heatmaps of the 𝐵 2 norm of the prediction errors 𝛿𝑖𝑦 , in
logarithmic scale, for the elderly class. In 2020 the pandemic diffusion is clearly visible in northern
Italy, while the prediction errors are generally higher on all provinces. Last panel: result of a
𝐾 -mean 𝐵 2 functional clustering (𝐾 = 3) on the 𝛿𝑖𝑦 , during 2020.
the error committed in forecasting 𝑓𝑖𝑦 using the model fitted at year 𝑦 − 1. Thus,
we can look at the densities 𝛿𝑖𝑦 as the unpredictable component of 𝑓𝑖𝑦 , i.e., as a
proxy of what happened at year 𝑦 which could not be predicted by information
available at the previous years, and analyze them under the spatial viewpoint. For
example, we can look at the spatial heatmaps of the 𝐵2 norms of the 𝛿𝑖𝑦 , which are
shown in Fig 4. It is clear, by looking at the magnitude of the error norms, that what
happened during 2020 was to a large extent unpredictable, since almost all Italian
The Death Process in Italy 339
provinces are characterized by higher errors with respect to previous years. More
significantly, in 2020 a clear spatial pattern can be noticed, at least during the first
wave in northern Italy: a diffusive process, having at its core the provinces most
gravely hit by the first pandemic wave, seems to take place in northern Italy. This
pattern is, as reasonable, slightly less evident with respect to the case of the elderly
class analysed in [1]. Going in this direction, we also show in Fig 4 the result of
a K-means functional clustering, set in the 𝐵2 space, of the 𝛿𝑖𝑦 for the year 2020.
We clearly identify provinces hit by the first wave (blue cluster), while the other two
clusters behave irregularly: this is a neat distinction with people aged more than 70
years, where each cluster clearly identifies different kinds of pandemic behaviour
(see [1]). For a more precise investigation of the spatial correlation structure of the
Fig. 5 Empirical trace-semivariograms for the prediction errors 𝛿𝑖𝑦 , in people aged between 50
and 69 years. The purple lines are the corresponding fitted exponential models. Distances on the
x-axes are expressed in kilometers. The last panel shows the 2020 severe perturbation of the spatial
dependence structure of the process generating the prediction errors.
process across different years, from the 𝛿𝑖𝑦 we compute a functional trace variogram
for each year: we show them for 2017 up to 2020 in Figure 5. Without entering into
the details of the mathematical definition of variograms, we can look at the fitted
curves in Figure 5 as follows. Distances are on the x-axis, while on the y-axis we
have a function of the spatial correlation of the process: when the curve reaches its
horizontal asymptote, it has reached the total variance of the process and we are
beyond the maximum correlation length. In this perspective, it is immediate to infer
that not only the total variance of the functional process 𝛿𝑖𝑦 has sharply increased
340 R. Scimone et al.
in 2020, but also a significant spatial correlation has manifested, compatibly with
the presence of a pandemic. In the main work [1], we further deepen the connection
between the pandemic and the upheavals in the spatial structure by means of Principal
Component Analysis of the 𝛿𝑖𝑦 in the Bayes space (SFPCA, [16]).
References
1. Scimone, R., Menafoglio, A., Sangalli, L. M., Secchi, P.: A look at the spatio-temporal
mortality patterns in Italy during the COVID-19 pandemic through the lens of mortality
densities. Spatial Stat. (2021) doi:10.1016/j.spasta.2021.100541
2. Egozcue, J., Díaz–Barrero, J., Pawlowsky-Glahn, V.: Hilbert space of probability density
functions based on Aitchison geometry. Acta Mathematica Sinica. 22, 1175-1182 (2006)
3. Pawlowsky-Glahn, V., Egozcue, J., Boogaart, K.: Bayes Hilbert spaces. Aust. New Zeal. J.
Stat. 56, 171-194 (2014)
4. Boogaart, K., Egozcue, J., Pawlowsky-Glahn, V.: Bayes linear spaces. SORT. 34, 201-222
(2010)
5. Villani, C.: Topics in Optimal Transportation. American Mathematical Society (2003)
6. Aitchison, J.: The statistical analysis of compositional data. J. Roy. Stat. Soc. B Stat. Meth.
44, 139-177 (1982)
7. Aitchison, J.: The Statistical Analysis of Compositional Data. Chapman & Hall, London
(1986)
8. Machalová, J., Hron, K., Monti, G.: Preprocessing of centred logratio transformed density
functions using smoothing splines. J. Appl. Stat. 43 (2015)
9. Machalová, J., Talská, R., Hron, K., Gába, A.: Compositional splines for representation of
density functions. Comput. Stat. 36, 1031-1064 (2021)
10. Talská, R., Menafoglio, A., Machalová, J., Hron, K., Fišerová, E.: Compositional regression
with functional response. Comput. Stat. Data Anal. 123, 66-85 (2018)
11. Menafoglio, A., Petris, G.: Kriging for Hilbert-space valued random fields: The operatorial
point of view. J. Multivariate Anal. 146 (2015)
12. Menafoglio, A., Grujic, O., Caers, J.: Universal kriging of functional data: Trace-variography
vs cross-variography? Application to gas forecasting in unconventional shales. Spatial Stat.
15, 39-55 (2016)
13. Nerini, D., Monestiez, P., Manté, C.: Cokriging for spatial functional data. J. Multivariate
Anal. 101, 409-418 (2010)
14. Menafoglio, A., Secchi, P., Dalla Rosa, M.: A universal kriging predictor for spatially depen-
dent functional data of a Hilbert space. Electronic Journal of Statistics 7, 2209-2240 (2013)
15. Goovaerts, P.: Kriging and semivariogram deconvolution in the presence of irregular geo-
graphical units. Mathematical Geosciences. 40, 101-128 (2008)
16. Hron, K., Menafoglio, A., Templ, M., Hrůzová, K., Filzmoser, P.: Simplicial principal com-
ponent analysis for density functions in Bayes spaces. Comput. Stat. Data Anal. 94, 330-350
(2016)
17. Ramsay, J., Silverman, B.: Functional Data Analysis. Springer (2005)
The Death Process in Italy 341
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Clustering Validation in the Context of
Hierarchical Cluster Analysis:
an Empirical Study
Osvaldo Silva ( )
Universidade dos Açores and CICSNOVA.UAc, Rua da Mãe de Deus, 9500-321, Portugal, e-mail:
[email protected]
Áurea Sousa
Universidade dos Açores and CEEAplA, Rua da Mãe de Deus, Portugal,
e-mail: [email protected]
Helena Bacelar-Nicolau
Universidade de Lisboa (UL) Faculdade de Psicologia and Institute of Environmental Health
(ISAMB/FM-UL), Portugal, e-mail: [email protected]
1 Introduction
Data were obtained from a questionnaire administered to three hundred and fifty
students who were attending Higher Education in a public university, after their
informed consent. The questionnaire contains, among others, eleven questions related
to academic life and the respective courses.
Several algorithms of hierarchical cluster analysis of variables were applied
on the data matrix. The variables (items) are: T1-Participation, T2-Interest, T3-
Expectations, T4-Accomplishment, T5-Job Outlook, T6- Teachers’ Professional
Competence, T7-Distribution of Curricular Units, T8- Number of weekly hours
of lessons, T9-Number of hours of daily study, T10-School Outcomes and T11-
Assessment Methods, which were evaluated based on a Likert scale from 1 to 5
(1-Totally disagree, 2- Partially disagree, 3- Neither disagree nor agree, 4- Partially
agree, 5- Totally agree).
The Ascendant Hierarchical Cluster Analysis (AHCA) was based on the standard
affinity coefficient [1, 17] and Spearman’s correlation coefficient. In this paper both
measures of comparison were combined with three probabilistic aggregation criteria
(AVL, AVB and AV1), issued from the VL parametric family. This methodology, in the
scope of Cluster Analysis, uses probabilistic comparison functions, between pairs of
elements, which correspond to random variables following a unit uniform distribu-
Clustering Validation: an Empirical Study 345
tion. Besides, this approach considers probabilistic aggregation criteria, which can
be interpreted as distribution functions of statistics of independent random variables,
that are i.i.d. uniform on [0, 1] (e.g., [17]).
Let A and B be two clusters with cardinals, respectively, 𝛼 and 𝛽, and let 𝛾 𝑥 𝑦
be a similarity measure between pairs of elements, 𝑥, 𝑦 ∈ 𝐸 (set of elements to
classify). Concerning the family I of AVL methods (e.g., SL, AV1, AVB, and AVL),
the comparison functions between clusters can be summarized by the following
conjoined formula:
Γ( 𝐴, 𝐵) = ( 𝑝 𝐴𝐵 ) 𝑔 ( 𝛼,𝛽) (1)
where 𝛼 = 𝐶𝑎𝑟 𝑑 𝐴, 𝛽 = 𝐶𝑎𝑟𝑑 𝐵, 𝑝 𝐴𝐵 = 𝑚𝑎𝑥 [𝛾 𝑎𝑏 : (𝑎, 𝑏) ∈ ( 𝐴 × 𝐵], with
1 ≤ 𝑔(𝛼, 𝛽) ≤ 𝛼𝛽, and 𝛾 𝑥 𝑦 , establishing a bridge between SL and AVL methods
which have a braking effect on the formation √ of chains. For example, 𝑔(𝛼, 𝛽) = 1 for
SL, 𝑔(𝛼, 𝛽)=(𝛼 + 𝛽)/2 for AV1, 𝑔(𝛼, 𝛽)= 𝛼𝛽 for AVB, and 𝑔(𝛼, 𝛽) = 𝛼𝛽 for AVL
(see [3, 17]).
The application of the two measures of comparison between elements (Spearman
correlation coefficient and standard affinity coefficient), combined with the afore-
mentioned aggregation criteria, aims to find a typology of items corresponding to
the best partition among the best partitions obtained by the several algorithms, in
order to verify if there are any substantial changes in the results. Therefore, some
validation indices based on the values of the corresponding proximity matrices were
used, namely the global levels statistics (STAT) [1, 10, 11] and the indices P(I2mod,
Σ) and 𝛾 [8], adapted to this type of matrices [16], so that the choice of the best
partition is judicious and based on the desirable properties (e.g., isolation and homo-
geneity of the clusters). Concerning the best partitions, the respective clusters and
the identification of their most representative elements were based on appropriate
adaptations of the Mann and Whitney U statistics [8] and of the silhouette plots [14]
to the case of similarity measures.
Each level of a dendrogram corresponds to a stage in the constitution of the
partitions hierarchy. Therefore, the study of the most relevant partition(s) is strictly
related to the choice of the best cut-off levels (e.g., [6, 5])
According to Bacelar Nicolau [1, 2], the global levels statistics (STAT) values
must be calculated for each of the 𝑘 = 1, 𝑛𝑖𝑣𝑚𝑎𝑥 levels of the corresponding den-
drograms, designating them by 𝑆𝑇 𝐴𝑇 (𝑘). At each level k, 𝑆𝑇 𝐴𝑇 (𝑘) is the global
statistics that measures the total information given by the pre-order associated to
the corresponding partition, in relation to the initial pre-order associated with the
similarity or dissimilarity measure. A “significant” level is considered to be one that
corresponds to a partition for which the global statistics undergoes a significant in-
crease in relation to the information provided by neighbouring levels, that is, a local
maximum of the differences 𝐷 𝐼 𝐹 (𝑘) = 𝑆𝑇 𝐴𝑇 (𝑘) − 𝑆𝑇 𝐴𝑇 (𝑘 − 1), 𝑘 = 1, 𝑛𝑖𝑣𝑚𝑎𝑥.
346 O. Silva et al.
To evaluate the partitions, an appropriate adaptation of the index P (I2, Σ) [8] for the
case of similarity measures was used, given by the following formula:
𝑐 Σ Σ 𝑠𝑖 𝑗
1 Õ 𝑖 ∈𝐶𝑟 𝑗∉𝐶𝑟
𝑃(𝐼2𝑚𝑜𝑑, Σ) = (2)
𝑐 𝑟 =1 𝑛𝑟 × (𝑁 − 𝑛𝑟 )
where 𝑐 is the number of clusters of the partition and 𝑠𝑖 𝑗 is the value of the similarity
measure between the element 𝑖 belonging to cluster 𝐶𝑟 and the element 𝑗 belonging
to another cluster. This index takes into account the number of clusters and the
number of elements in each of the clusters and evaluates the isolation of clusters
belonging to a given partition.
The 𝛾 index, proposed by Goodman and Kruskal [7], has been widely used in cluster
validation [9]. Comparisons are developed between all within-cluster similarities,
𝑠𝑖 𝑗 and all between-cluster similarities 𝑠 𝑘𝑙 [18]. A comparison is judged concordant
(respectively discordant) if 𝑠𝑖 𝑗 is strictly greater (respectively, smaller) than 𝑠 𝑘𝑙 . The
𝛾 index is defined by:
𝛾 = (𝑆+ − 𝑆− )/(𝑆+ + 𝑆− ), (3)
where 𝑆+ (or 𝑆− ) is the number of concordant (respectively, discordant) comparisons.
This index is a global stopping rule and it evaluates the fit of the partition in c clusters
based on the homogeneity (high similarity between the elements within the clusters)
and the isolation (low similarity of the elements between the clusters) of the clusters.
Note that the higher the value of this index, the better is the adjustment of that
partition.
The use of STAT, 𝛾 and P(I2mod, Σ) indices can help identifying the most
significant levels of a dendrogram, taking into account both the homogeneity and
the isolation of the clusters [15].
U statistics [12] are relevant for assessing the suitability of a cluster, combining the
concepts of compactness and isolation. Thus, the “best” cluster is the one with the
lowest values of global U-index, 𝑈𝐺 , and local U-index, 𝑈 𝐿 [8]. In the present paper
we used an appropriate adaptation of these indices to the case of similarity measures
(for details, see [19]). Moreover, the clusters considered “ideal” are those for which
𝑈𝐺 and 𝑈 𝐿 both take the value zero. Mann and Whitney’s U statistics are useful in
Clustering Validation: an Empirical Study 347
decision making, in situations of uncertainty, both for the evaluation of the clusters
and partitions.
We also used an appropriate adaptation of the silhouette plots [14], which allows
the assessment of compactness and relative isolation of clusters. The adaptation of
this measure for the case of similarity measures, 𝑆𝑖𝑙 (𝑖), considers the average of the
similarities between an element i belonging to cluster 𝐶𝑟 , which contains 𝑛𝑟 (≥ 2)
elements, and all other elements that do not belong to this cluster (see [19]). The
values of this measure {𝑆𝑖𝑙 (𝑖) : 𝑖 ∈ 𝐶𝑟 } lie between −1 and +1, with “values near +1
indicating that element strongly belongs to the cluster in which it has been placed”
([8], p. 205). In the case of a singleton cluster, 𝑆𝑖𝑙 (𝑖) assumes the value zero [8] in
the corresponding algorithm.
Affinity AVL (T1, T3, T4, T5 ,T6, T7, T8, T10, T11), (T2, T9) STAT=5.1301
𝛾= 0.8589
P(I2mod,Σ)=0.2077
AV1/AVB (T1, T3, T4 , T5, T6, T7, T8, T10, T11), (T2), (T9) STAT=5.3453
𝛾= 0.8830
P(I2mod,Σ)=0.2049
Spearman AVL (T3, T4 ,T2 , T9) (T7, T11, T8), (T6, T10), (T1), (T5) STAT=4.0152
𝛾= 0.8178
P(I2mod,Σ)=0.3896
AV1/AVB (T3, T4 ,T2 , T9, T6 ) (T7, T11, T8), (T1, T10), (T5) STAT=4.05751
𝛾= 0.7317
P(I2mod,Σ)=0.38177
Fig. 1 Dendrograms based on standard affinity coefficient (left side) and Spearman’s correlation
coefficient (right side) - AVL.
The “best” partition obtained using the affinity coefficient and the AVL method is
the partition into two clusters (level 9 of the aggregation process). The first cluster
consists of nine items that highlight the importance of the teachers’ professional
competence, the structuring/content of the course and the future perspectives in
relation to the career opportunities, mostly factors exogenous to the students. The
second one is composed by two items (T2 and T9) which emphasize the role of
interest in the study of Mathematics.
The algorithms in which the standard affinity coefficient was used are the ones that
provided the best partitions and their hierarchies are the ones that remained closest
to the initial pre-orders. In fact, in the case of Spearman correlation coefficient the
values of STAT and 𝛾 indices are clearly lower than the previous ones. Moreover,
the cluster {T1, T3, T4, T5, T6, T7, T8, T10, T11}, corresponding to the best
partition provided by the combination of the standard affinity coefficient with the
aggregation criteria AVL, AV1 and AVB, presents (𝑈𝐺 =39 and 𝑈 𝐿 =4, both lower than
those obtained for the cluster {T3, T4, T2, T9, T6} (𝑈𝐺 =65 and 𝑈 𝐿 =26) provided
by the Spearman correlation coefficient combined, respectively, with AV1 and AVB
methods.
Focusing the attention on the two first partitions of Table 1, the only difference
between them is that while the best partition provided by AV1 and AVB methods
contains the singletons T2 and T9, the best partition given by AVL joins these two
singletons in the same cluster. The values of the numerical validation indices shown
in Table 1 indicate that the best partition is the one provided by AV1 and AVB
methods. This conclusion is reinforced by the observation of the silhouette plot (see
Figure 2), which indicates that the cluster joining T2 and T9, given by AVL method,
includes the elements which have the two lowest values of 𝑆𝑖𝑙 and Sil (T2) is negative
Clustering Validation: an Empirical Study 349
(i.e., T2 does not fit very well in this cluster). Note that the silhouette plot cannot be
used for the best partition, since it does not apply for singletons.
4 Final Remarks
This research was useful concerning the identification of relevant partitions of items
in the context of Higher Education. In the cases where the affinity and the Spearman
correlation coefficients were used, it was concluded that the probabilistic criteria AV1
and AVB showed a higher agreement regarding the hierarchies of partitions obtained
than the AVL method.
The validation measures STAT, 𝛾 and P(I2mod, Σ) help us to determine the best
cut-off levels of a hierarchy of clusters, taking into account both the homogeneity
and the isolation of the clusters. It should also be noted that if there is no absolute
consensus between these three measures, the Mann and Whitney U statistics and the
silhouette plot prove to be very useful, as we have seen with the application of this
methodology to evaluate both the clusters and the partitions obtained.
Acknowledgements Funding. This work is financed by national funds through FCT – Founda-
tion for Science and Technology, I.P., within the scope of the project «UIDB/04647/2020» of
CICS.NOVA – Centro de Ciências Sociais da Universidade Nova de Lisboa.
350 O. Silva et al.
References
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
An MML Embedded Approach for Estimating
the Number of Clusters
Abstract Assuming that the data originate from a finite mixture of multinomial
distributions, we study the performance of an integrated Expectation Maximization
(EM) algorithm considering Minimum Message Length (MML) criterion to select
the number of mixture components. The referred EM-MML approach, rather than
selecting one among a set of pre-estimated candidate models (which requires run-
ning EM several times), seamlessly integrates estimation and model selection in a
single algorithm. Comparisons are provided with EM combined with well-known
information criteria – e.g. the Bayesian information Criterion. We resort to synthetic
data examples and a real application. The EM-MML computation time is a clear ad-
vantage of this method; also, the real data solution it provides is more parsimonious,
which reduces the risk of model order overestimation and improves interpretability.
1 Introduction
Cláudia Silvestre ( )
Escola Superior de Comunicação Social, Campus de Benfica do IPL 1549-014 Lisboa, Portugal,
e-mail: [email protected]
Margarida G. M. S. Cardoso
BRU-UNIDE, ISCTE-IUL, Av. das Forças Armadas, 1649-026 Lisboa, Portugal,
e-mail: [email protected]
Mário Figueiredo
Instituto de Telecomunicações, Portugal, Av. Rovisco Pais 1, 1049-001 Lisboa, Portugal,
e-mail: [email protected]
methods for categorical data are more challenging [12] and there are fewer techniques
available [11].
In order to determine the number of clusters, model-based approaches commonly
resort to information-based criteria e.g., the Bayesian Information Criterion (BIC)
[15] or the Akaike Information Criterion (AIC) [1]. These criteria look for a balance
between the model’s fit to the data (which corresponds to maximizing the likelihood
function) and parsimony (using penalties associated with measures of model com-
plexity), thus trying to avoid over-fitting. The use of information criteria follows the
estimation of candidate finite mixture models for which a predetermined number
of clusters is indicated, generally resorting to an EM (Expectation Maximization)
algorithm [7]. In this work, we focus on determining the number of clusters while
clustering categorical data, using an EM embedded approach to estimate the number
of clusters. This approach does not rely on selecting among a set of pre-estimated
candidate models, but rather integrates estimation and model selection in a single
algorithm. Our new implementation to deal with categorical variables by estimating
a finite mixture of multinomials, follows a previous version described in [16]. We
capitalized on the work of Figueiredo and Jain [9] for clustering continuous data and
extended it for dealing with categorical data. The embedded method is thus based on
a Minimum Message Length (MML) criterion to select the number of clusters and
on an EM algorithm to estimate the model parameters.
The literature on finite mixture models and their application is vast, including some
books covering theory, geometry, and applications [8, 13, 3]. When applying finite
mixture models to social sciences, the analyst is often confronted with the need to
uncover sub-populations based on qualitative indicators.
In clustering, the identity of the component that generated each sample observa-
tion is unknown. The observed data Y is therefore regarded as incomplete, where
the missing data is a set of indicator variables Z = {𝑧1 , ..., 𝑧 𝑛 }, each taking the
form 𝑧𝑖 = [𝑧𝑖1 , ..., 𝑧 𝑖𝐾 ] 0, where 𝑧𝑖𝑘 is a binary indicator: 𝑧 𝑖𝑘 takes the value 1 if the
observation 𝑦 was generated by the k-th component, and 0 otherwise. It is usually
𝑖
assumed that the {𝑧 𝑖 , 𝑖 = 1, . . . , 𝑛} are i.i.d., following a multinomial distribution of
𝐾 categories, with probabilities {𝛼1 , . . . , 𝛼𝐾 }. The log-likelihood of complete data
{Y, Z} is given by
𝑛 Õ
Õ 𝐾 h i
log 𝑓 (Y, Z|Θ) = 𝑧 𝑖𝑘 log 𝛼 𝑘 𝑓 (𝑦 |𝜃 𝑘 ) . (2)
𝑖
𝑖=1 𝑘=1
AIC, (CAIC) and the Modified AIC (MAIC) - and also the Integrated Completed
Likelihood (ICL) [14, 4]. They are all easily implemented, the final model being
selected according to a compromise between its fit to data and its complexity.
In this work, we use the Minimum Message Length (MML) criterion to choose
the number of components of a mixture of multinomials. MML is based on the
information-theoretic view of estimation and model selection, according to which an
adequate model is one that allows a short description of the observations. MML-type
criteria evaluate statistical models according to their ability to compress a message
containing the data, looking for a balance between choosing a simple model and
one that describes the data well. According to Shannon’s information theory, if 𝑌 is
some random variable with probability distribution 𝑝(𝑦|Θ), the optimal code-length
(in an expected value sense) for an outcome 𝑦 is 𝑙 (𝑦|Θ) = − log2 𝑝(𝑦|Θ), measured
in bits (from the base-2 logarithm). If Θ is unknown, the total code-length function
has two parts: 𝑙 (𝑦, Θ) = 𝑙 (𝑦|Θ) + 𝑙 (Θ); the first part encodes the outcome 𝑦, while
the second part encodes the parameters of the model. The first part corresponds the
fit of the model to the data (better fit corresponds to higher compression), while the
second part represents the complexity of the model. The message length function
for a mixture of distributions (as developed in [2]) is:
1 𝐶
𝑙 (𝑦, Θ) = − log 𝑝(Θ) − log 𝑝(𝑦|Θ) + log |𝐼 (Θ)| + (1 − log(12)) , (4)
2 2
where 𝑝(Θ) is a prior distribution overh 2the parameters,i 𝑝(𝑦|Θ) is the likelihood
𝜕
function of mixture, |𝐼 (Θ)| ≡ | − 𝐸 𝜕Θ 2 log 𝑝(𝑌 |Θ) | is the determinant of the
expected Fisher information matrix, and 𝐶 is the the number of parameters of
the model that need to be estimated. For example,Í for the 𝐾 mixture multinomial
𝐿
distributions presented in (3), 𝐶 = (𝐾 − 1) + 𝐾 𝑙=1 (𝐶𝑙 − 1) . The expected Fisher
information matrix of a mixture leads to a complex analytical form of MML which
cannot be easily computed. To overcome this difficulty, Figueiredo and Jain [9]
replace the expected
h 2 Fisher information
i matrix by its complete-data counterpart
𝜕
𝐼𝑐 (Θ) ≡ −𝐸 𝜕𝜃 2 log 𝑝(𝑌 , 𝑍 |Θ) . Also, they adopt independent Jeffreys’ priors for
the mixture parameters that is proportional to the square root of the determinant of
the Fisher information matrix. The resulting message length function is
𝑀 Õ 𝑛 𝛼
𝑘 𝑘 𝑛𝑧 𝑛 𝑘 𝑛𝑧 (𝑀 + 1)
𝑙 (𝑦, Θ) = log + log + − log 𝑝(𝑦, Θ) (5)
2 𝑘: 𝛼𝑘 >0
12 2 12 2
for 𝑖 = 1, . . . , 𝑛 and 𝑘 = 1, . . . , 𝐾.
M-step: For the M-step, noticing that the first term in (5) can be seen as the
+1
negative log-prior − log 𝑝(𝛼 𝑘 ) = 𝐶−𝐾 2𝐾 log 𝛼 𝑘 (plus a constant), and enforcing
Í
the conditions that 𝛼 𝑘 ≥ 0, for 𝑘 = 1, ..., 𝐾 and that 𝐾 𝑘=1 𝛼 𝑘 = 1, yields the
following updates for the estimates of the 𝛼 𝑘 parameters:
( 𝑛 )
Õ
(𝑡) 𝐶−𝐾 +1
max 0, 𝑧¯𝑖𝑘 −
2𝐾
(𝑡+1) 𝑖=1
𝛼𝑘
b = ( ), (7)
𝐾 𝑛
Õ Õ
(𝑡) 𝐶−𝐾 +1
max 0, 𝑧¯𝑖 𝑗 −
𝑗=1 𝑖=1
2𝐾
for 𝑘 = 1, ..., 𝐾. Notice that, some b 𝛼 𝑘(𝑡+1) may be zero; in that case, the 𝑘-th
component is excluded from the mixture model. The multinomial parameters
corresponding to components with b 𝛼 𝑘(𝑡+1) = 0 need not be further calculated,
since these components do not contribute to the likelihood. For the components
with non-zero probability, b𝛼 𝑘(𝑡+1) > 0, the estimates of multinomial parameters
are updated to their standard weighted ML estimates:
𝑛
Õ
(𝑡)
𝑧¯𝑖𝑘 𝑦 𝑖𝑙𝑐
(𝑡+1) 𝑖=1
𝜃 𝑘𝑙𝑐
b = 𝑛 , (8)
Õ
(𝑡)
𝑛𝑙 𝑧¯𝑖𝑘
𝑖=1
358 C. Silvestre et al.
for 𝑘 = 1, . . . , 𝐾, 𝑙 = 1, . . . , 𝐿, and 𝑐 Í
= 1, . . . , 𝐶𝑙 . Notice that, in accordance with
𝑙 b(𝑡+1)
the meaning of the 𝜃 𝑘𝑙𝑐 parameters, 𝐶 𝑐=1 𝜃 𝑘𝑙𝑐 = 1.
choose or change: a) your order of tasks; b) your methods of work; c) your speed or
rate of work. Do you work in a group or team that has common tasks and can plan
its work?
EM-MML selected 7 clusters, which is a smaller number than for the remaining
criteria (ICL, BIC, CAIC, AIC and MAIC select 10, 12, 12, 15 and 15 respectively).
This fact avoids estimation problems associated with very small segments and also
improves the interpretability of the clustering solution.
The segments selected by EM-MML criterion are presented in Figure 1. Workers
with slightly above average autonomy (cluster 7) live in several countries, but Ireland
stands out, as well as Belgium, Germany, Netherlands, Switzerland, and the UK
regions. Denmark, Estonia, Malta, and Norway are the countries where the most
independent workers are found (cluster 3). The smallest cluster, 6, includes Sweden
and a region of Greece and Kriti and Açores, a Greek and a Portuguese region,
respectively. The cluster 5, where workers claim they have no autonomy, includes
regions from many countries.
In this work, a model selection criterion and method for finite mixture models of
categorical observations was studied - EM-MML. This algorithm simultaneously
performs model estimation and selects the number of components/clusters. When
compared to information criteria, which are commonly associated with the use of
the EM algorithm, the EM-MML method exhibits several advantages: 1) it easily
recovers the true number of clusters in synthetic data sets with various degrees of
360 C. Silvestre et al.
separation; 2) its computations times are significantly lower than those required
by standard approaches resorting to the sequential use of EM and an information
criterion; 3) when applied to a real data set it produces a more parsimonious solution,
thus easier to interpret. An additional advantage of this approach that stems from
obtaining more parsimonious solutions is that such solutions have a higher number
of observations per cluster, thus helping to overcome eventual estimation problems.
The performance of the EM-MML is encouraging for selecting the number of
clusters, and the same criterion was already used for feature selection [17]. However,
future research is required, namely considering data sets with different numbers of
clusters and high dimensional data.
Acknowledgements This work was supported by Fundação para a Ciência e Tecnologia, grant
UIDB /00315/2020.
References
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Typology of Motivation Factors for Employees in
the Banking Sector: An Empirical Study Using
Multivariate Data Analysis Methods
Áurea Sousa, Osvaldo Silva, M. Graça Batista, Sara Cabral, and Helena
Bacelar-Nicolau
Áurea Sousa ( )
Universidade dos Açores and CEEAplA, Rua da Mãe de Deus, 9500-321, Portugal,
e-mail: [email protected]
Osvaldo Silva
Universidade dos Açores and CICSNOVA.UAc, Rua da Mãe de Deus, Portugal,
e-mail: [email protected]
M. Graça Batista
Universidade dos Açores and CEEAplA, Rua da Mãe de Deus, Portugal,
e-mail: [email protected]
Sara Cabral
Universidade dos Açores, Rua da Mãe de Deus, Portugal, e-mail: [email protected]
Helena Bacelar-Nicolau
Universidade de Lisboa (UL) Faculdade de Psicologia and Institute of Environmental Health
(ISAMB/FM-UL), Portugal, e-mail: [email protected]
1 Introduction
Γ( 𝐴, 𝐵) = ( 𝑝 𝐴𝐵 ) 𝑔 ( 𝛼,𝛽) (2)
with 𝛼 = 𝐶𝑎𝑟 𝑑 𝐴, 𝛽 = 𝐶𝑎𝑟𝑑 𝐵, 𝑝 𝐴𝐵 = 𝑚𝑎𝑥 [𝛾 𝑎𝑏 : (𝑎, 𝑏) ∈ ( 𝐴 × 𝐵], with
1 ≤ 𝑔(𝛼, 𝛽) ≤ 𝛼𝛽, and 𝛾 𝑥 𝑦 is a similarity measure between pairs of elements, x and
y, of the set of elements to classify (e.g., 𝑔(𝛼, 𝛽) = 1 for SL, 𝑔(𝛼, 𝛽) = 𝛼𝛽 for AVL).
Note that varying 𝑔(𝛼, 𝛽) with 1 < 𝑔(𝛼, 𝛽) < 𝛼𝛽, a sort of compromise can be
built between SL and AVL methods (e.g., 𝑔(𝛼, 𝛽)=(𝛼 + 𝛽)/2 for AV1). Thus, Γ( 𝐴, 𝐵)
will be “more polluted by the chain effect when 𝑔(𝛼, 𝛽) remains near 1, and more
contaminated by the symmetry effect as long as 𝑔(𝛼, 𝛽) is in the neighbourhood of
𝛼𝛽“ ( [17], p. 95). Among the criteria that establish a compromise between AVL and
SL methods, stands out the AV1 method, whose behavior is very similar to that of
AVL and often provides, at its cut-off level, a partition better adjusted to the preorder
than the “best” classification obtained by AVL.
Typology of Motivation Factors for Employees in the Banking Sector 367
Concerning the CatPCA, the best solution comprises four principal components,
and the percentage of variance accounted for (PVAF) across these components is
almost 70% (about 69%) of the data’s total variance. All extracted components have
eigenvalues above 1. Moreover, the first three main components have a very good
internal consistency and the fourth component has an acceptable internal consistency,
as shown by the values of the Cronbach’s Alpha coefficient (see Table 1).
The most important items for the first dimension are items M6, M7, M8, M9,
M10, M11, M12, M13, M15, M19, M21, M22, and M27, which are related to human
relationships/interactions with colleagues and hierarchical superiors, so it is called
368 Á. Sousa et al.
SL/CL {M1, M2, M3, M5, M8, M10, M11, M12, M13, M15, M14, 15.8858 20
M16, M18, M19, M22, M20, M6, M23, M27, M24, M21};
{M4}; {M9}; {M7}; {M25}; {M26}; {M17}
AV1 {M1, M2, M3, M6, M27, M21, M5, M23, M24, M8, M15, 15.6490 22
M14, M16, M10, M13, M11, M12, M18, M19, M20, M22};
{M4, M25, M26}; {M7}; {M9}; {M17}
According to the STAT values, the best partitions were obtained by the classic
SL/CL and the probabilistic AV1 methods (see Table 2). All dendrograms highlighted
four main branches, which are associated with different motivational factors ("Career
progression"; "Psychological well-being / Interpersonal relationships"; "Organiza-
tional environment and working conditions"; "Conformity with objectives and time
to reach them"), bringing new information, and identifying some singletons, as
shown in Figure 1.
4 Conclusion
Organizations and their leaders have become increasingly aware of the importance
of their employees being well and that negative feelings can negatively affect pro-
ductivity. Thus, it is essential to ensure the well-being of employees, taking into
account the main motivational factors identified in this study. CatPCA made it pos-
sible to extract four principal components (dimensions), which explain almost 70%
of the total variance of the data, which were designated, respectively, by “Psy-
chological well-being/Interpersonal relationships”; “Remuneration, job stability and
incentive system”; “Career progression/Professional achievement”; and “Fulfilment
of objectives and timings to achieve them”. Regarding the AHCA of the items that
Typology of Motivation Factors for Employees in the Banking Sector 369
M1 --*--------------------------* M1 --*
|--------------------* |
M2 --*--------------* | | M2 --|
|-----------* | |
M3 --*--------------* |--------* M3 --|
| | |
M6 --*--------------------* | | M6 --|
|--------------* | | |
M27 --*--------------------* |-----------* | M27 --|
| | |
M21 --*-----------------------------------* | M21 --|
| |
M5 --*-----------------* | M5 --|
| | |
M23 --*-----------------|--------------------* |--- M23 --|
| | | |
M24 --*-----------------* | | M24 --|
|-----------* | |
M8 --*-----* | | | M8 --|
|-----------------------* | | | |
M15 --*-----* | | | | M15 --|-----*
|--------* | | | |
M14 --*-----* | | | M14 --| |
|-----------------------* | | | |
M16 --*-----* | | M16 --| |
|-----* | |
M10 --*--------* | M10 --| |
|--------------* | | |
M13 --*--------* | | M13 --| |
|--------------------* | | |
M11 --* | | | M11 --| |
|-----------------------* | | | |
M12 --* | | M12 --| |
|-----* | |--*
M18 --*--* | M18 --| | |
|-----------------------------* | | | |
M19 --*--* | | M19 --| | |
|-----------* | | |
M20 --*-----------* | M20 --| | |
|--------------------* | | |
M22 --*-----------* M22 --* | |
| |
M4 --*-----------------------------------------------------* M4 --* | |
|------ | | |
M25 --*-----------------------------------------* | M25 --|--* | |
|-----------* | | | |
M26 --*-----------------------------------------* M26 --* |--* |
| |
M7 --*-----------------------------------------------------------* M7 --* | |
| |--* |
M9 --*-----------------------------------------------------------* M9 --* |
|
M17 --*------------------------------------------------------------ M17 --*--------*
assess motivation, the dendrograms highlight four main branches, which are associ-
ated with different motivational factors called "Career progression"; "Psychological
well-being / Interpersonal relationships"; "Organizational environment and working
conditions"; and "Conformity with objectives and time to reach them". They carried
new information and identify some singletons as well. Comparing the dendrograms,
we conclude that the clusters referring to the best partitions are quite similar, with
observed differences mainly concerning the few singletons. Moreover, the effec-
tive and fruitful correspondence between the AHCA and the CatPCA results may
help to better understand the main types of factors identified. In fact, the four main
branches of all dendrograms are related to motivational factors which corresponding
interpretation are in consonance with those identified through CatPCA.
Acknowledgements This paper is financed by Portuguese national funds through FCT – Fundação
para a Ciência e a Tecnologia, I.P., project number UIDB/00685/2020.
References
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
A Proposal for Formalization and Definition of
Anomalies in Dynamical Systems
1 Introduction
Anomalies, often interchangeably called outliers [1], are of key interest in explorative
data analysis. Therefore, anomaly detection finds application in many different sci-
entific fields, i.e., in social science, economics, engineering, and medical science [2].
In particular, research in these domains regarding databases, data mining, machine
learning or statistics focuses strongly on anomaly detection [3]. Despite the wide
In general, it is assumed that anomalies are somehow visible within the data of
the observed systems. This is also clearly stated by the definition of an outlier or
anomaly as data points with a substantial deviation from the norm since this requires
a normal state of the system and a measurable deviation [8]. Furthermore, the
anomaly detection requires existence and knowledge of a normal state, a definition
of a deviation, a metric, and a threshold measure of distance. This threshold measure
of distance uses the selected metric. All distances between the norm and the data
points, which are either above (in case of distance measures) or below (in case of
similarity measures) the defined threshold, are assumed to be non-substantial.
Therefore, in addition, the selection of an appropriate metric becomes an impor-
tant tool to accurately describe an anomaly. Some authors claim that, in a practical
application, the selection of a suitable metric might be more important than the
algorithm itself. For example, if clusters are clearly separated within the examined
dataset in context of the selected metric, clusters will be found independently of
A Proposal for Formalization and Definition of Anomalies in Dynamical Systems 375
the used method or algorithm [9]. Other authors claim that the selected method for
investigating clusters is of importance [10].
To summarize, there is no trivial definition of a normal state, a deviation, and when
a deviation might be substantial. Some authors therefore describe the usefulness of
an analysis only within the context of the goals of the analysis [11]. Outlier detection
becomes more of a technical target than an actual scientific finding of something
novel since the novelty is always defined within the technical target of the analysis.
Alternatively, the normal model of the data defines an anomaly [1].
This results, for example, in approaches of regression diagnostics to exclude
outliers and anomalous data prior to an analysis or to conduct the analysis along the
standard model in a more robust way, which is less affected by anomalies [12]. Both
approaches result in the maintaining of the normal model using anomalies as if they
were less adequate or not at all representative of the data set.
Since anomalies are only relevant within a context, a typology of anomalies within
different dataset contexts can be created. Thus, Foorthuis [13] proposes a typology
along the following dimensions: types of data (qualitative, quantitative or mixed),
anomaly level (atomic or aggregated) and cardinality of relationship (univariate or
multivariate). Anomalies are, within this kind of typology, always dependent on
the dataset and behave differently along the measured features, which have been
classified as relevant for the specific analysis. The anomaly detection becomes a
detection of unfitting, surprising values while maintaining the normal model.
If the assumptions regarding normal states, deviation, and substantiality are dropped,
it is possible to discuss anomalies on a more fundamental level for understanding
our surroundings and the observations of them.
To do this, anomalies have to be placed in the historic context of science and
research. Since anomaly detection as a discipline of data science is placed within
the scientific context [14], anomaly detection can also be analyzed as part of the
scientific method and therefore a comparison with the historical understanding of
anomalies in the context of science becomes relevant. By definition of Kuhn [15],
anomalies play an important role in the scientific discovery of novelties:
Discovery commences with the awareness of anomaly, i.e., with the recognition that nature
has somehow violated the paradigm-induced expectations that govern normal science. It
then continues with a (...) exploration of the area of anomaly. And it closes only when the
paradigm theory has been adjusted so that the anomalous has become the expected.
This statement describes scientific progress as a stepwise discovery and the place-
ment of anomalies within a normal state by science. The discussed normal state is
therefore dictated by current scientific knowledge, which encompasses the predic-
tions of the currently available and widely used models and theories. An anomaly
violates the normal state by violating the predictions of these models. The steps of
scientific progress are then as follows:
376 J. M. Spoor et al.
∀𝑖 ∀ 𝑗 ∃! 𝑥𝑖 𝑗 , 𝑥𝑖 𝑗 ∈ 𝑅 𝑗 (1)
The set 𝐶 of all combinations of system state values with 𝐽 features is given by:
Using the defined function space, a restriction of reachable system states via all
functions from 𝐹 is defined, resulting in the set of physically possible system states.
Definition 3 (Physically Possible System States) The relation 𝑓 spans the complete
space of state changes of a system using the entire scope of operations. The resulting
space is the set of all possible system states. The physically possible system states
378 J. M. Spoor et al.
are the possible realizations of 𝑥𝑖 based on a starting point and if only functions from
𝐹 are applied. The set 𝑃 is a group with a neutral element of operations.
𝑧 : 𝐶 → 𝑀, 𝑧(𝑥𝑖 ) = 𝑥 𝑖∗ (6)
Therefore, the set 𝑀 = 𝑅1 × ... × 𝑅 𝐷 is the space of all observable and known
system states. Function 𝑧 is the measurement process.
Definition 5 (Observed Operations) Not all functions of the whole set of function
𝐹 are known or observable when planning and operating a system.
𝐹0 ⊆ 𝐹 (7)
Additionally, only observable system states are modeled when operating a system.
The observed operations of systems are therefore projections of a subsets of known
operations of 𝐹 and operate within the observed and known system states.
𝐹 ∗ = 𝑧(𝐹 0) (8)
The actual conducted operations 𝑓 are always from the set of operations 𝐹, but the
expectation and prediction utilize, due to lack of system knowledge, only 𝑓 ∗ ∈ 𝐹 ∗ .
Therefore, all states applied in operation 𝑓 ∗ are defined as expected system states.
Definition 6 (Expected System States) The system states, which are possible if
only the observed and known operations of the set 𝐹 ∗ are applied to all system states
𝑥𝑖∗ ∈ 𝐸, are the expected system behavior.
The expected system states can be further split into desired system states, where
the system is running most beneficially for its usage, a critical system state, where
a possible error or rare system states are measured, and error states, which are
system faults with operational risks involved as defined by Basel III [18]. Applied
in engineering, this definition is compatible with the definition of DIN EN 13306
since the system is at risk of being unable to perform a certain range of functions
without necessarily being completely inoperable [19]. All kinds of errors, warnings
and non-beneficial system states are the "technical" anomalies within the contextual
analysis of the data set.
A Proposal for Formalization and Definition of Anomalies in Dynamical Systems 379
Definition 7 (Unforeseen System States) The set of unforeseen system states 𝑈 are
therefore all measurable system states within the realm of observable system states
but not within the expected system states:
𝑈 = 𝑀/𝐸 (11)
"Scientific" anomalies in unforeseen system states are measured if the real oper-
ation 𝑓 differs from 𝑓 ∗ such that a prediction error occurs:
𝑓 ∗ (𝑥𝑖∗ ) ∈ 𝐸, 𝑓 ∗ (𝑥 𝑖∗ ) ≠ 𝑧( 𝑓 (𝑥 𝑖 )) ∉ 𝐸 (12)
"Scientific" anomalies are part of the unforeseen system states. Another reason for
unforeseen system states is a measurement of an impossible system state. Anomalies
originated by physically impossible system states are to be distinguished from "scien-
tific" anomalies since the reason for their occurrence follows a different mechanism.
Thus, they are assigned to the "technical" anomalies.
Definition 8 (Physically Impossible System States) Physically impossible system
states 𝐼 are combinations of states in set 𝐶 which are not reachable using function 𝑓 :
𝐼 = 𝐶/𝑃 (13)
4 Conclusion
It is concluded that the anomaly concept is often loosely defined and heavily depends
on assumptions of a normal state, deviation, and substantiality. These definitions are
often case-specific and influenced by the conducting researchers’ choice. Therefore,
a rigorous definition of anomalies is capable of further streamlining the discourse
and increasing a common understanding of what kind of anomaly is described.
Using "technical" and "scientific" anomalies, further research will be conducted
to set up models detecting both types of anomalies separately. Differences between
observed and real system states and operations are a focus of further research to
more precisely analyze the hidden processes of the "scientific" anomaly generation.
Also, a more fundamental discussion of the philosophical definition of anomalies
within the philosophy of science and its applications to anomaly detection in general
should be conducted to further gain insight into the true nature of anomalies.
380 J. M. Spoor et al.
The authors plan to validate the concept by using the proposed definition and
framework in exemplary applications within industrial processes. Furthermore,
anomaly detection methods designed for applications in dynamical systems using
the proposed framework are planned to be developed.
Acknowledgements The Mercedes-Benz Group AG funds this research. The research was prepared
within the framework of the doctoral program of the Institut für Informationsmanagement im
Ingenieurwesen (IMI) at the Karlsruhe Institute of Technology (KIT).
References
1. Aggarwal, C. C.: Outlier Analysis. Springer Science+Business Media, New York (2013)
2. Hodge, V. J., Austin, J.A.: Survey of outlier detection methodologies. Artif. Intell. Rev. 22,
85-126 (2004)
3. Aggarwal, C. C., Sathe, S.: Outlier Ensembles. Springer, Cham (2017)
4. Wang, X., Wang, X., Wilkes M.: New Developments in Unsupervised Outlier Detection -
Algorithms and Applications. Springer, Singapore (2021)
5. Spoor, J. M., Weber, J., Ovtcharova, J.: A definition of anomalies, measurements and predic-
tions in dynamical engineering systems for streamlined novelty detection. Accepted for the
8th International Conference on Control, Decision and Information Technologies (CoDIT),
Istanbul (2022)
6. Åström, K. J., Murray, R. M.: Feedback Systems - An Introduction for Scientists and Engineers.
Princeton University Press, Princeton, New Jersey (2008)
7. Sethi, S. P., Thompson, G. L.: Optimal Control Theory - Applications to Management Science
and Economics. Springer Science+Business Media, Boston, MA (2000)
8. Mehrotra, K. G., Mohan, C., Huang, H.: Anomaly Detection - Principles and Algorithms.
Springer International Publishing, Cham (2017)
9. Skiena, S. S.: The Data Science Design Manual. Springer International Publishing, Cham
(2017)
10. James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning.
Springer Science+Business Media, New York (2013)
11. Fahrmeier, L., Hamerle, A., Tutz, G. (ed.): Multivariate Statistische Verfahren. de Gruyter,
Berlin (1996)
12. Rousseeuw, P. J., Leroy, A. M.: Robust Regression and Outlier Detection. John Wiley & Sons,
Inc (1987)
13. Foorthuis, R.: On the nature and types of anomalies: A review of deviations in data. Int. J.
Data Sci. Anal. 12, 297-331 (2021)
14. Cuadrado-Gallego, J. J., Demchenko, Y.: The Data Science Framework: A View from the
EDISON Project. Springer Nature Switzerland AG, Cham (2020)
15. Kuhn, T.: The Structure of Scientific Revolutions. 2nd ed. The University of Chicago Press,
Chicago (1970)
16. Hawkins, D.: Identification of Outliers. Chapman and Hall (1980)
17. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: A survey. ACM Comput. Surv.
41(3) 15 (2009)
18. Bank for International Settlements: Basel Committee on Banking Supervision: International
Convergence of Capital Measurement and Capital Standards (2006)
19. DIN Deutsches Institut für Normung e. V.: DIN EN 13306: Instandhaltung - Begriffe der
Instandhaltung. Beuth Verlag GmbH, Berlin (2010)
A Proposal for Formalization and Definition of Anomalies in Dynamical Systems 381
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
New Metrics for Classifying Phylogenetic Trees
Using 𝑲-means and the Symmetric Difference
Metric
Abstract The 𝑘-means method can be adapted to any type of metric space and is
sometimes linked to the median procedures. This is the case for symmetric difference
metric (or Robinson and Foulds) distance in phylogeny, where it can lead to median
trees as well as to Euclidean Embedding. We show how a specific version of the
popular 𝑘-means clustering algorithm, based on interesting properties of the Robin-
son and Foulds topological distance, can be used to partition a given set of trees into
one (when the data is homogeneous) or several (when the data is heterogeneous)
cluster(s) of trees. We have adapted the popular cluster validity indices of Silhouette,
and Gap to tree clustering with 𝑘-means. In this article, we will show results of this
new approach on a real dataset (aminoacyl-tRNA synthetases). The new version of
phylogenetic tree clustering makes the new method well suited for the analysis of
large genomic datasets.
1 Introduction
In biology, one of the most significant organizing principles is the "Tree of Life"
(ToL) [12]. In genetic studies, there is evidence of an enormous number of branches,
but even a rough estimate of the total size of the tree remains difficult. Many recent
Nadia Tahiri ( )
Department of Computer Science, University of Sherbrooke, Sherbrooke, QC J1K2R1, Canada,
e-mail: [email protected]
Aleksandr Koshkarov
Department of Computer Science, University of Sherbrooke, Sherbrooke, QC J1K2R1, Canada;
Center of Artificial Intelligence, Astrakhan State University, Astrakhan, 414056, Russia,
e-mail: [email protected]
2 Methods
The 𝑘-means algorithm [9, 10] is a very common algorithm for data parsing. From
a set of 𝑁 observations 𝑥𝑖 , . . . , 𝑥 𝑁 each one being described by 𝑀 variables, this
algorithm creates a partition in 𝑘 homogeneous classes or clusters. Each observation
corresponds to a point in a 𝑀-dimensional space and the proximity between two
points is measured by the distance between them. In the framework of 𝑘-means, the
most commonly used distances are the Euclidean distance, Manhattan distance, and
Minkowski distance [4]. To be precise, the objective of the algorithm is to find the
partition of the 𝑁 points into 𝑘 clusters in such a way that the sum of the squares of the
distances of the points to the center of gravity of the group to which they are assigned
is minimal. To the best of our knowledge, finding an optimal partition according to
the 𝑘-means least-squares criterion is known to be NP-hard [13]. Considering this
New Metrics to Classify Phylogenetic Trees 385
fact, several polynomial-time heuristics were developed, most of which have the time
complexity of O (𝐾 𝑁 𝐼 𝑀) for finding an approximate partitioning solution, where
𝐾 is the maximum possible number of clusters, 𝑁 is the number of objects (for
example, phylogenetic trees), 𝐼 is the number of iterations in the 𝑘-means algorithm,
and 𝑀 is the number of variables characterizing each of the 𝑁 objects.
A well-known metric of comparing two tree topologies in computational biology
is the Robinson-Foulds distance (𝑅𝐹), also known as the symmetric-difference dis-
tance [15]. Moreover, the distance 𝑅𝐹 is a topological distance, which means that
it does not consider the length of the edges of the tree. The formula of 𝑅𝐹 distance
can be describe as (𝑛1 (𝑇1 ) + 𝑛2 (𝑇2 )), where 𝑛1 (𝑇1 ) is the number of partitions of
data implied by the tree 𝑇1 , but not the tree 𝑇2 and 𝑛2 (𝑇2 ) is the number of partitions
of data implied by the tree 𝑇2 but not the tree 𝑇1 . According to Barthélemy and
Monjardet [1], the majority-rule consensus tree of a set of trees is the median tree of
this set. This fact makes the use of tree clustering possible.
The first popular cluster validity index we consider in our study is the Silhouette
width (𝑆𝐻) [16]. Traditionally, the Silhouette width of the cluster 𝑘 is defined as
follows:
"𝑁 #
1 Õ 𝑘
𝑏(𝑖) − 𝑎(𝑖)
𝑠(𝑘) = , (1)
𝑁 𝑘 𝑖=1 𝑚𝑎𝑥(𝑎(𝑖), 𝑏(𝑖))
where 𝑁 𝑘 is the number of objects belonging to cluster 𝑘, 𝑎(𝑖) is the average distance
between object 𝑖 and all other objects belonging to cluster 𝑘, and 𝑏(𝑖) is the smallest,
over all clusters 𝑘 0 different from cluster 𝑘, of all average distances between 𝑖 and all
the objects of cluster 𝑘 0.
We used Equations (2) and (4) for calculating 𝑎(𝑖) and 𝑏(𝑖), respectively, in
our tree clustering algorithm (see also [19]). For instance, the quantity 𝑎(𝑖) can be
calculated as follows:
" Í 𝑁𝑘 #
𝑗=1 𝑅𝐹 (𝑇𝑘𝑖 , 𝑇𝑘 𝑗 )
𝑎(𝑖) = + 𝜉 /𝑁 𝑘 , (2)
2𝑛(𝑇𝑘𝑖 , 𝑇𝑘 𝑗 ) − 6
where 𝑁 𝑘 is the number of trees in cluster 𝑘, 𝑇𝑘𝑖 and 𝑇𝑘 𝑗 are, respectively, trees 𝑖
and 𝑗 in cluster 𝑘, 𝑛(𝑇𝑘𝑖 ) is the number of leaves in tree 𝑇𝑘𝑖 , 𝑛(𝑇𝑘 𝑗 ) is the number of
leaves in tree 𝑇𝑘 𝑗 , and 𝜉 is a penalty function which is defined as follows:
where 𝛼 is the penalization (tuning) parameter, taking values between 0 and 1, used
to prevent from putting to the same cluster trees having small percentages of leaves
in common, and 𝑛(𝑇𝑘𝑖 , 𝑇𝑘 𝑗 ) is the number of common leaves in trees 𝑇𝑘𝑖 and 𝑇𝑘 𝑗 .
The formula for 𝑏(𝑖) is as follows:
" Í 𝑁𝑘 0 #
𝑗=1 𝑅𝐹 (𝑇𝑘𝑖 , 𝑇𝑘 0 𝑗 )
𝑏(𝑖) = min + 𝜉 /𝑁 𝑘 0 , (4)
1≤𝑘 0 ≤𝐾 ,𝑘 0 ≠𝑘 2𝑛(𝑇𝑘𝑖 , 𝑇𝑘 0 𝑗 ) − 6
where 𝑇𝑘 0 𝑗 is the tree 𝑗 of the cluster 𝑘 0, such that 𝑘 0 ≠ 𝑘, and 𝑁 𝑘 0 is the number of
trees in the cluster 𝑘 0.
The optimal number of clusters, 𝐾, corresponds to the maximum average Silhou-
ette width, 𝑆𝐻, which is calculated as follows:
𝐾 h
Õ i
𝑆𝐻 = 𝑠(𝐾) = 𝑠(𝑘) /𝐾 . (5)
𝑘=1
The value of the Silhouette index defined by Equation (5) ranges from -1 to +1.
It is worth noting that the 𝑆𝐻 cluster validity index (Equations (1) to (5)) do not allow
comparing the solution consisting of a single consensus tree (𝐾 = 1; the calculation of
𝑆𝐻 is impossible in this case) with clustering solutions involving multiple consensus
trees or supertrees (𝐾 ≥ 2). This can be considered as an important disadvantage
of the 𝑆𝐻-based classifications because a good tree clustering method should be
able to recover a single consensus tree or supertree when the input set of trees is
homogeneous (e.g. for a set of gene trees that share the same evolutionary history).
The Gap statistic was first used by Tibshirani et al. [20] to estimate the number of
clusters provided by partitioning algorithms. The formulas proposed by Tibshirani
et al. were based on the properties of the Euclidean distance. In the context of tree
clustering, the Gap statistic can be defined as follows. Consider a clustering of 𝑁
trees into 𝐾 non-empty clusters, where 𝐾 ≥ 1. First, we define the total intracluster
distance, 𝐷 𝑘 , characterizing the cohesion between the trees belonging to the same
cluster 𝑘:
𝑁𝑘 Õ𝑁𝑘
" #
Õ 𝑅𝐹 (𝑇𝑘𝑖 , 𝑇𝑘 𝑗 )
𝐷𝑘 = +𝜉 . (6)
𝑖=1 𝑗=1
2𝑛(𝑇𝑘𝑖 , 𝑇𝑘 𝑗 ) − 6
Then, the sum of the average total intracluster distances, 𝑉𝐾 , can be calculated
using this formula:
𝐾
Õ 1
𝑉𝐾 = 𝐷𝑘 . (7)
𝑘=1
2𝑁 𝑘
New Metrics to Classify Phylogenetic Trees 387
Finally, the Gap statistic, which reflects the quality of a given clustering solution
including 𝐾 clusters, can be defined as follows:
where 𝐸 ∗𝑁 denotes expectation under a sample of size 𝑁 from the reference distri-
bution. The following formula [20] for the expectation of 𝑙𝑜𝑔(𝑉𝐾 ) was used in our
algorithm:
𝐸 ∗𝑁 log(𝑉𝐾 ) = log(𝑁𝑛/12) − (2/𝑛) log(𝐾) ,
(9)
where 𝑛 is the number of tree leaves.
The largest value of the Gap statistic corresponds to the best clustering.
To illustrate the methods described above, we used a dataset from Woese et al. [22].
The aminoacyl-tRNA synthetases (aaRSs) are enzymes that attach the appropriate
amino acid onto its cognate transfer RNA. The structure-function aspect of aaRSs
has long attracted the attention of biologists [22, 6]. Moreover, the relationship of
aaRSs to the genetic code is observed from the evolutionary view (the central role
played by the aaRSs in translation would suggest that their histories and that of the
genetic code are somehow intertwined [22]). The novel domain additions to aaRSs
genes play an important role in the inference of the ToL.
We encoded 20 original aminoacyl-tRNA synthetase trees from Woese et al. [22]
in Newick format and then split some of them into sub-trees to account for cases
where the same species appeared more than once in the original tree. Our approach
cannot handle data that includes multiple instances of the same species in the input
trees. Thus, 36 aaRS trees with different numbers of leaves (including 72 species
in total) were used as input of our algorithm (their Newick strings are available at:
https://github.com/tahiri-lab/PhyloClust). Our approach was applied
with the 𝛼 parameter set to 1.
First, we implemented our new approach with the Gap statistic cluster validity
index which suggested the presence of 7 clusters of trees in the data, thus suggesting a
heterogeneous scenario of their evolution. Then, we conducted the computation using
the 𝑆𝐻 cluster validity index and obtained 2 clusters of trees each of which could
be represented by its own supertree. The first cluster obtained using 𝑆𝐻 included 19
trees for a total of 56 organisms, whereas the second cluster included 17 trees for
a total of 61 organisms. The supertrees (see Figure 1) for the two obtained clusters
of trees were inferred using the CLANN program [5]. Further, we decided to infer
the most common horizontal gene transfers which characterized the evolution of
gene trees included in the two obtained tree clusters. The method of [3], reconciling
the species and gene phylogenies to infer transfers, was used for this purpose. The
species phylogenies followed the NCBI taxonomic classification. These phylogenies
were not fully resolved (the species phylogeny in Figure 1a contains 9 internal nodes
388 N. Tahiri and A. Koshkarov
Fig. 1 Nonbinary species tree corresponding to the NCBI taxonomic classification are represented
with (a) 56 species for cluster 1. The 4 HGTs (indicated by arrows) were found by the 𝑆𝐻 index for
the first cluster; (b) 61 species with 𝛼 equal 1 for cluster 2. The 2 HGTs (indicated by arrows) were
found by the 𝑆𝐻 index with 𝛼 equal 1 for the second cluster. We applied Most Similar Supertree
Method (𝑑 𝑓 𝑖𝑡) [5] implemented in CLANN Software with 𝑚𝑟 𝑝 criterion. This criterion is a
matrix representation employing parsimony criterion.
with a degree higher than 3 and the species phylogeny in Figure 1b contains 10
internal nodes with a degree higher than 3).
We used the version of the HGT (Horizontal Gene Transfer) algorithm available
on the T-Rex web site [2] to identify the scenarios of HGT events that reconcile the
species tree and each of the supertrees. We choose the same root between species
trees and supertrees: the root which split Bacteria to the clade of Eukayota and
Archaea.
For the first cluster composed of 56 species, we obtained 40 transfers with 22
regular and 18 trivial HGTs. Trivial HGTs are necessary to transform a non-binary
tree into a binary tree. We removed the trivial HGTs and selected between regular
HGTs. The non-trivial HGTs with low representation are most likely due to the tree
reconstruction artefacts. In Figure 1a, we illustrated only those HGTs that are most
represented in the dataset.
We followed the same procedure for the second cluster composed of 61 species
and obtained 42 transfers with 28 regular and 14 trivial HGTs that are not represented
here. We selected only the most popular HGTs in the dataset. All other transfers are
represented in Figure 1b.
The transfers link of P. horikoshii to the clade of spirochetes (i.e. B. burgdorferi
and T. pallidum) was found by [3, 14]. The transfers of P. horikoshii to P. aerophilum
were also found by [14]. These results confirmed the existing HGT of [3, 14].
New Metrics to Classify Phylogenetic Trees 389
4 Discussion
Many research groups are estimating trees containing several thousands to hundreds
of thousands of species, toward the eventual goal of the estimation of the Tree of Life,
containing perhaps several million leaves. These phylogenetic estimations present
enormous computational challenges, and current computational methods are likely to
fail to run even with datasets on the low end of this range. One approach to estimate a
large species tree is to use phylogenetic estimation methods (such as maximum like-
lihood) on a supermatrix produced by concatenating multiple sequence alignments
for a collection of markers; however, the most accurate of these phylogenetic estima-
tion methods are extremely computationally intensive for datasets with more than
a few thousand sequences. Supertree methods, which assemble phylogenetic trees
from a collection of trees on subsets of the taxa, are important tools for phylogeny
estimation where phylogenetic analyses based upon maximum likelihood (ML) are
infeasible.
In this article, we described a new algorithm for partitioning a set of phylogenetic
trees in several clusters in order to infer multiple supertrees, for which the input trees
have different, but mutually overlapping sets of leaves. We presented new formulas
that allow the use of the popular Silhouette and Gap statistic cluster validity indices
along with the Robinson and Foulds topological distance in the framework of tree
clustering based on the popular 𝑘-means algorithm. The new algorithm can be used
to address a number of important issues in bioinformatics, such as the identification
of genes having similar evolutionary histories, e.g. those that underwent the same
horizontal gene transfers or those that were affected by the same ancient duplication
events. It can also be used for the inference of multiple subtrees of the Tree of Life. In
order to compute the Robinson and Foulds topological distance between such pairs
of trees, we can first reduce them to a common set of leaves. After this reduction,
the Robinson and Foulds distance is normalized by its maximum value, which is
equal to 2𝑛 − 6 for two binary trees with 𝑛 leaves. Overall, the good performance
achieved by the new algorithm in both clustering quality and running time makes it
well suited for analyzing large genomic and phylogenetic datasets. A C++ program,
called PhyloClust (Phylogenetic trees Clustering), implementing the discussed tree
partitioning algorithm is freely available at https://github.com/tahiri-lab/
PhyloClust.
Acknowledgements We would like to thank Andrey Veriga and Boris Morozov for helping us with
the analysis of Aminoacyl-tRNA synthetases data. We also thank Compute Canada for providing
access to high-performance computing facilities. This work was supported by Fonds de Recherche
sur la Santé of Québec and University of Sherbrooke grant.
390 N. Tahiri and A. Koshkarov
References
1. Barthelemy, J., Monjardet, B.: The median procedure in cluster analysis and social choice
theory. Math. Soc. Sci. 1, 235-267 (1981)
2. Boc, A., Legendre, P., Makarenkov, V.: An efficient algorithm for the detection and classifi-
cation of horizontal gene transfer events and identification of mosaic genes. Algorithms From
And For Nature And Life. pp. 253-260 (2013)
3. Boc, A., Philippe, H., Makarenkov, V.: Inferring and validating horizontal gene transfer events
using bipartition dissimilarity. Syst. Biol. 59, 195-211 (2010)
4. Bock, H.: Clustering methods: a history of k-means algorithms. Selected Contributions In
Data Analysis And Classification. pp. 161-172 (2007)
5. Creevey, C., Fitzpatrick, D., Philip, G., Kinsella, R., O’Connell, M., Pentony, M., Travers,
S., Wilkinson, M., McInerney, J.: Does a tree–like phylogeny only exist at the tips in the
prokaryotes?. Proc. Roy. Soc. Lond. B Biol. Sci. 271, 2551-2558 (2004)
6. Godwin, R., Macnamara, L., Alexander, R., Salsbury Jr, F.: Structure and dynamics of tRNA-
met containing core substitutions. ACS Omega. 3, 10668-10678 (2018)
7. Gouy, R., Baurain, D., Philippe, H.: Rooting the tree of life: the phylogenetic jury is still out.
Phil. Trans. Biol. Sci. 370, 20140329 (2015)
8. Hinchliff, C., Smith, S., Allman, J., Burleigh, J., Chaudhary, R., Coghill, L., Crandall, K., Deng,
J., Drew, B., Gazis, R. et al.: Synthesis of phylogeny and taxonomy into a comprehensive tree
of life. Proc. Natl. Acad. Sci. Unit. States Am. 112, 12764-12769 (2015)
9. Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inform. Theor. 28, 129-137 (1982)
10. MacQueen, J. et al.: Some methods for classification and analysis of multivariate observations.
Proceedings of the Fifth Berkeley Symposium On Mathematical Statistics and Probability. 1,
281-297 (1967)
11. Maddison, D.: The discovery and importance of multiple islands of most-parsimonious trees.
Syst. Biol. 40, 315-328 (1991)
12. Maddison, D., Schulz, K., Maddison, W. et al.: The tree of life web project. Zootaxa. 1668,
19-40 (2007)
13. Mahajan, M., Nimbhorkar, P., Varadarajan, K.: The planar k-means problem is NP-hard.
International Workshop On Algorithms And Computation. pp. 274-285 (2009)
14. Makarenkov, V., Boc, A., Delwiche, C., Philippe, H. et al.: New efficient algorithm for
modeling partial and complete gene transfer scenarios. Data Science And Classification.
341-349 (2006)
15. Robinson, D., Foulds, L.: Comparison of phylogenetic trees. Math. Biosci. 53, 131-147 (1981)
16. Rousseeuw, P.: Silhouettes: a graphical aid to the interpretation and validation of cluster
analysis. J. Comput. Appl. Math. 20, 53-65 (1987)
17. Silva, A., Wilkinson, M.: On defining and finding islands of trees and mitigating large island
bias. Syst. Biol. 706, 1282-1294 (2021)
18. Stockham, C., Wang, L., Warnow, T.: Statistically based postprocessing of phylogenetic anal-
ysis by clustering. Bioinformatics. 18, S285-S293 (2002)
19. Tahiri, N., Willems, M., Makarenkov, V.: A new fast method for inferring multiple consensus
trees using k-medoids. BMC Evol. Biol. 18, 1-12 (2018)
20. Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a dataset via the
gap statistic. J. Roy. Stat. Soc. B Stat. Meth. 63, 411-423 (2001)
21. Whidden, C., Zeh, N., Beiko, R.: Supertrees based on the subtree prune-and-regraft distance.
Syst. Biol. 63, 566-581 (2014)
22. Woese, C., Olsen, G., Ibba, M., Soll, D.: Aminoacyl-tRNA synthetases, the genetic code, and
the evolutionary process. Microbiol. Mol. Biol. Rev. 64, 202-236 (2000)
New Metrics to Classify Phylogenetic Trees 391
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
On Parsimonious Modelling via Matrix-variate t
Mixtures
Salvatore D. Tomarchio
Abstract Mixture models for matrix-variate data have becoming more and more
popular in the most recent years. One issue of these models is the potentially high
number of parameters. To address this concern, parsimonious mixtures of matrix-
variate normal distributions have been recently introduced in the literature. However,
when data contains groups of observations with longer-than-normal tails or atypi-
cal observations, the use of the matrix-variate normal distribution for the mixture
components may affect the fitting of the resulting model. Therefore, we consider a
more robust approach based on the matrix-variate 𝑡 distribution for modeling the
mixture components. To introduce parsimony, we use the eigen-decomposition of the
components scale matrices and we allow the degrees of freedom to be equal across
groups. This produces a family of 196 parsimonious matrix-variate 𝑡 mixture mod-
els. Parameter estimation is obtained by using an AECM algorithm. The use of our
parsimonious models is illustrated via a real data application, where parsimonious
matrix-variate normal mixtures are also fitted for comparison purposes.
1 Introduction
Salvatore D. Tomarchio ( )
University of Catania, Department of Economics and Business, Catania, Italy,
e-mail: [email protected]
2 Methodology
Í
where 𝜋𝑔 is the 𝑔th mixing proportion, such that 𝜋𝑔 > 0 and 𝐺 𝑔=1 𝜋 𝑔 = 1, 𝑓 (X; 𝚯𝑔 )
is the 𝑔th component pdf with parameter 𝚯𝑔 , and 𝛀 contains all of the parameters
of the mixture. In this paper, for the 𝑔th component of model (1), we adopt the MVT
distribution having pdf
𝑟 𝑝 𝑝𝑟 +𝜈𝑔 − 𝑝𝑟+𝜈
|𝚺𝑔 | − 2 |𝚿𝑔 | − 2 Γ
𝑔
2 𝛿𝑔 X; M𝑔 , 𝚺𝑔 , 𝚿𝑔 2
𝑓MVT (X; 𝚯𝑔 ) = 𝑝𝑟 𝜈 1+ , (2)
𝜋𝜈 2 Γ 𝑔 𝜈𝑔
𝑔 2
On Parsimonious Modelling via Matrix-variate t Mixtures 395
where 𝛿𝑔 X; M𝑔 , 𝚺𝑔 , 𝚿𝑔 = tr 𝚺−1 −1 0
𝑔 (X − M𝑔 )𝚿𝑔 (X − M𝑔 ) , M𝑔 is the 𝑝 × 𝑟
component mean matrix, 𝚺𝑔 is the 𝑝 × 𝑝 component row scale matrix, 𝚿𝑔 is the 𝑟 × 𝑟
component column scale matrix and 𝜈𝑔 > 0 is the component degree of freedom.
It is interesting to recall that the pdf in (2) can be hierarchically obtained via the
matrix-variate normal scale mixture model when the mixing random variable 𝑊 is
a gamma distribution with scale and rate parameters set to 𝜈𝑔 /2 [10]. Specifically, a
hierarchical representation of MVT distribution can be given as follows
1. 𝑊 ∼ G 𝜈𝑔 /2, 𝜈𝑔 /2 ,
2. X|𝑊 = 𝑤 ∼ N (M𝑔 , 𝚺𝑔 /𝑤, 𝚿𝑔 ),
where G (·) is a gamma distribution and N (·) denotes the MVN distribution. This
representation will be convenient for parameter estimation presented in Section 2.2.
As discussed in Section 1, the mixture model in (1) may be characterized by a
potentially high number of parameters. To address this concern, we firstly use the
eigen-decomposition of the components scale matrices 𝚺𝑔 and 𝚿𝑔 . In detail, we
recall that a generic 𝑞 × 𝑞 scale matrix 𝚽𝑔 can be decomposed as [11]
𝚽𝑔 = 𝜆 𝑔 𝚪𝑔 𝚫𝑔 𝚪𝑔0 , (3)
where
𝑁 Õ
Õ 𝐺
ℓ1𝑐 (𝝅; S𝑐 ) = 𝑧𝑖𝑔 ln 𝜋𝑔 ,
𝑖=1 𝑔=1
𝑁 Õ
𝐺 h 𝑝𝑟
Õ 𝑝𝑟 𝑟 𝑝
ℓ2𝑐 (𝚵; S𝑐 ) = 𝑧𝑖𝑔 − ln (2𝜋) + ln 𝑤𝑖𝑔 − ln |𝚺𝑔 | − ln |𝚿𝑔 |
𝑖=1 𝑔=1
2 2 2 2
𝑤𝑖𝑔 𝛿𝑔 (X; M𝑔 , 𝚺𝑔 , 𝚿𝑔 )
− , (5)
2
𝑁 Õ
𝐺 n 𝜈𝑔 𝜈𝑔 h 𝜈𝑔 i 𝜈𝑔
Õ 𝜈𝑔 o
ℓ3𝑐 (𝜗; S𝑐 ) = 𝑧 𝑖𝑔 ln − ln Γ + − 1 ln 𝑤𝑖𝑔 − 𝑤𝑖𝑔 ,
𝑖=1 𝑔=1
2 2 2 2 2
𝐺 𝐺 𝐺
with 𝝅 = 𝜋𝑔 𝑔=1 , 𝚵 = M𝑔 , 𝚺𝑔 , 𝚿𝑔 𝑔=1 and 𝜗 = 𝜈𝑔 𝑔=1 .
Our AECM algorithm then proceeds as follows (notice that, the parameters
marked with one dot are the updates of the previous iteration, while those marked
with two dots are the updates at the current iteration):
E-step At the E-step we have to compute the following quantities
¤𝑔
𝜋¤ 𝑔 𝑓MVT X𝑖 ; 𝚯 𝑝𝑟 + 𝜈¤𝑔
𝑧¥𝑖𝑔 = Í𝐺 and 𝑤¥ 𝑖𝑔 = . (6)
¤ℎ
𝜋¤ ℎ 𝑓MVT X𝑖 ; 𝚯 ¤ 𝑔 , 𝚺¤ 𝑔 , 𝚿
𝜈¤𝑔 + 𝛿¤𝑔 X𝑖 ; M ¤𝑔
ℎ=1
There is no need to compute the expected value of ln 𝑊𝑖𝑔 , given that we do not
use this quantity to update 𝜈𝑔 .
CM-step 1 At the first CM-step, we have the following updates
Í𝑁 Í𝑁
𝑖=1 𝑧¥𝑖𝑔 𝑖=1 𝑧¥𝑖𝑔 𝑤
¥𝑔= Í
¥ 𝑖𝑔 X𝑖
𝜋¥ 𝑔 = and M 𝑁
.
𝑖=1 𝑧¥𝑖𝑔 𝑤
¥ 𝑖𝑔
𝑁
Because of space constraints, we cannot report here the updates of each par-
simonious structure related to 𝚺𝑔 and 𝚿𝑔 . However, they can be obtained by
generalizing the results in [5]. The only differences consist in the updates of the
row and column scatter matrices of the 𝑔th component, that are here defined as
𝑁
Õ
¥ 𝑔𝑅 = ¥𝑔 𝚿 ¥ 𝑔 0,
¤ 𝑔 X𝑖 − M
−1
W 𝑧¥𝑖𝑔 𝑤¥ 𝑖𝑔 X𝑖 − M
𝑖=1
Õ𝑁
¥𝐶 ¥ 𝑔 0𝚺
¥ 𝑔 X𝑖 − M
¥𝑔 .
−1
W 𝑔 = 𝑧¥𝑖𝑔 𝑤¥ 𝑖𝑔 X𝑖 − M
𝑖=1
On Parsimonious Modelling via Matrix-variate t Mixtures 397
where “partial” refers to fact that the complete data are now defined as S 𝑝𝑐 =
𝑁
{X𝑖 , z𝑖 }𝑖=1 . Then, 𝜈¥𝑔 is determined by maximizing
𝑁
Õ 𝑁 Õ
Õ 𝐺
¥ 𝑔)
𝑧¥𝑖𝑔 ln 𝑓MVT (X𝑖 ; 𝚯 or ¥ 𝑔 ),
𝑧¥𝑖𝑔 ln 𝑓MVT (X𝑖 ; 𝚯
𝑖=1 𝑖=1 𝑔=1
Here, we analyze the Municipalities dataset contained in the AER package [13]
for the R statistical software. It consists of expenditure information for 𝑁 = 265
Swedish municipalities over 𝑟 = 9 years (1979–1987). For each municipality, we
measure the following 𝑝 = 3 variables: (i) total expenditures, (ii) total own-source
revenues and (iii) intergovernmental grants received.
We fitted parsimonious MVT-Ms and MVN-Ms for 𝐺 ∈ {1, 2, 3, 4, 5} to the data,
and for each family of models the Bayesian information criterion (BIC) [14] is used
to select the best fitting model. According to our results, we found that the best among
MVN-Ms has a BIC of -82362.61, a VVV-EE structure and 𝐺 = 4 groups, while
the best among MVT-Ms has a BIC of -82701.59, a VVE-EE-V structure and 𝐺 = 3
groups. Thus, the overall best fitting model is that selected for MVT-Ms. The MVN-
Ms seem to overfit the data, given that an additional group is detected. This is not an
unusual behavior, given that the tails of normal mixture models cannot adequately
accommodate deviations from normality, and additional groups are consequently
found in the data [4, 7, 15]. Anyway, the best fitting models of the two families agree
in finding varying volumes and shapes in the components row scale matrices and
equal shapes and orientations in the components column scale matrices.
Figure 1 illustrates the parallel coordinate plots of the data partition detected by
the VVE-EE-V MVT-Ms. The dashed lines correspond to the estimated mean for
that variable, across the time, in that group. We notice that the first group contains
municipalities having, on average, slightly higher expenditures, an intermediate
398 S. D. Tomarchio
Group 1 Group 2 Group 3 Group 1 Group 2 Group 3 Group 1 Group 2 Group 3
0.030
0.0125
0.030
0.025
0.0100
Expenditures
0.020
0.025
Revenues
Grants
0.0075
0.015
0.020
0.0050
0.010
0.015
0.0025
1980 1982 1984 1986 1980 1982 1984 1986 1980 1982 1984 1986 1980 1982 1984 1986 1980 1982 1984 1986 1980 1982 1984 1986 1980 1982 1984 1986 1980 1982 1984 1986 1980 1982 1984 1986
Years Years Years
Fig. 1 Parallel coordinate plots of the data partition obtained by the VVE-EE-V MVT-Ms. The
dashed lines correspond to the estimated means.
level of revenues and higher levels of intergovernmental grants than the other two
groups. Furthermore, it seems to cluster several outlying observations, as confirmed
by the estimated degree of freedom 𝜈1 = 3.75, which implies quite heavy tails
for this mixture component. The second group shows the lowest average levels of
expenditures and revenues, but a similar amount of received grants to that of the
third group. Interestingly, this group does not presents many outlying observations,
as also supported by the estimated degree of freedom 𝜈2 = 10.95. Lastly, the third
group has the highest levels of revenues but, as already said, it is similar to the other
two groups in the other variables. Also in this case, we have a moderately heavy tail
behavior given that the estimated degree of freedom is 𝜈3 = 6.05.
To evaluate the correlations of the variables with each other and over time, for the
three groups, we now report the correlation matrices R ( ·) related to the covariance
matrices associated to 𝚺𝑔 and 𝚿𝑔 :
0.5
0.4
Uncertainty
0.3
0.2
0.1
0.0
0 100 200
Observations in order of increasing uncertainty
in the first group, excluding a couple of cases, have practically null uncertainties.
This applies to a lesser extent to the municipalities in the other two groups, given
the slightly higher number of exceptions. For example, there are 15 observations
(approximately 5% of the total sample size) that have uncertainty values greater than
0.3. However, and as said above, this is due to the closeness between the groups,
which can be confirmed by looking at the parallel plots in Figure 1.
4 Conclusions
One serious concern of matrix-variate mixture models is the potentially high number
of parameters. Furthermore, many real data requires models having heavier-than-
400 S. D. Tomarchio
normal tails. To address both aspects, in this paper a family of 196 parsimonious
mixture models, based on the matrix-variate 𝑡 distribution, is introduced. The eigen-
decomposition of the components scale matrices, as well as constraints on the com-
ponents degrees of freedom, are used to attain parsimony. An AECM algorithm for
parameter estimation has been presented. Our family of models have been fitted to a
real dataset along with parsimonious mixtures of matrix-variate normal distributions.
The results demonstrate the best fitting results of our models, and the overfitting ten-
dency of matrix-variate normal mixtures. Lastly, the estimated parameters and data
partition for the best of our models have been reported and commented.
Acknowledgements This work was supported by the University of Catania grant PIACERI/CRASI
(2020).
References
1. Gallaugher, M. P. B., McNicholas P. D.: Finite mixtures of skewed matrix variate distributions.
Pattern Recognit. 80, 83–93 (2018)
2. Melnykov, V., Zhu, X.: On model-based clustering of skewed matrix data. J. Multivar. Anal.
167, 181–194 (2018)
3. Melnykov, V., Zhu, X.: Studying crime trends in the USA over the years 2000–2012. Adv.
Data Anal. Classif. 13(1), 325–341 (2019)
4. Tomarchio, S. D., Punzo, A., Bagnato, L.: Two new matrix-variate distributions with applica-
tion in model-based clustering. Comput. Stat. Data Anal. 152, 107050 (2020)
5. Sarkar, S., Zhu, X., Melnykov, V., Ingrassia, S.: On parsimonious models for modeling matrix
data. Comput. Stat. Data Anal. 142, 106822 (2020)
6. Tomarchio, S. D., McNicholas, P. D., Punzo, A.: Matrix normal cluster-weighted models. J.
Classif. 38(3), 556–575 (2021)
7. Tomarchio, S. D., Gallaugher, M. P. B., Punzo, A., McNicholas, P. D.: Mixtures of matrix-
variate contaminated normal distributions. J. Comput. Gr. Stat. 1–9 (2022)
8. Tomarchio, S. D., Ingrassia, S., Melnykov, V.: Modelling students’ career indicators via
mixtures of parsimonious matrix-normal distributions. Aust. N. Z. J. Stat. 1–16 (2022)
9. Viroli, C.: Model based clustering for three-way data structures. Bayesian Anal. 6(4), 573–602
(2011)
10. Doğru, F. Z., Bulut, Y. M., Arslan, O.: Finite mixtures of matrix variate t distributions. Gazi
Univ. J. Sci. 29(2), 335–341 (2016)
11. Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recognit. 28(5),
781–793 (1995)
12. Meng, X. L., Van Dyk, D.: The EM algorithm-an old folk-song sung to a fast new tune. J.
Royal Stat. Soc. B. 59(3), 511–567 (1997)
13. Kleiber, C., Zeileis, A.: Applied Econometrics with R. Springer-Verlag, New York (2008)
14. Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)
15. Gallaugher, M. P. B., Tomarchio, S. D., McNicholas, P. D., Punzo, A.: Multivariate cluster
weighted models using skewed distributions. Adv. Data Anal. Classif. 1–32 (2021)
16. Fraley, C., Raftery, A. E.: Enhanced model-based clustering, density estimation, and discrim-
inant analysis software: MCLUST. J. Classif., 20(2), 263–286 (2003)
17. Tomarchio, S. D., Punzo, A.: Dichotomous unimodal compound models: application to the
distribution of insurance losses. J. Appl. Stat. 47(13-15), 2328–2353 (2020)
On Parsimonious Modelling via Matrix-variate t Mixtures 401
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Evolution of Media Coverage on Climate Change
and Environmental Awareness: an Analysis of
Tweets from UK and US Newspapers
Abstract Climate change represents one of the biggest challenges of our time.
Newspapers might play an important role in raising awareness on this problem and
its consequences. We collected all tweets posted by six UK and US newspapers in
the last decade to assess whether 1) the space given to this topic has grown, 2) any
breakpoint can be identified in the time series of tweets on climate change, and 3) any
main topic can be identified in these tweets. Overall, the number of tweets posted on
climate change increased for all newspapers during the last decade. Although a sharp
decrease in 2020 was observed due to the pandemic, for most newspapers climate
change coverage started to rise again in 2021. While different breakpoints were
observed, for most newspapers 2019 was identified as a key year, which is plausible
based on the coverage received by activities organized by the Fridays for Future
movement. Finally, using different topic modeling approaches, we observed that,
while unsupervised models partly capture relevant topics for climate change, such
as the ones related to politics, consequences for health or pollution, semi-supervised
models might be of help to reach higher informativeness of words assigned to the
topics.
Gianpaolo Zammarchi ( )
University of Cagliari, Viale Sant’Ignazio 17, 09123, Cagliari, Italy,
e-mail: [email protected]
Maurizio Romano
University of Cagliari, Viale Sant’Ignazio 17, 09123, Cagliari, Italy,
e-mail: [email protected]
Claudio Conversano
University of Cagliari, Viale Sant’Ignazio 17, 09123, Cagliari, Italy, e-mail: [email protected]
1 Introduction
Climate change is one of the biggest challenges for our society. Its consequences
which include, among others, glaciers melting, warming oceans, rising sea levels,
and shifting weather or rainfall patterns, are already impacting our health and im-
posing costs on society. Without drastic action aimed at reducing or preventing
human-induced emissions of greenhouse gasses, these consequences are expected
to intensify in the next years. Despite its global and severe impacts, individuals may
perceive climate change as an abstract problem [1]. It is also a well-known fact that
the level of information plays a crucial role in the awareness about a topic (e.g.
healthy food [2] and smoking [3]) . Media represent a crucial source of information
and can exert substantial effects on public opinion, thus helping to raise the awareness
on climate change. For instance, media can explain climate change consequences as
well as portraying actions that governments, communities and single individuals can
take. For this reason, it is important to distinguish themes that might have gained
popularity from those that may have seen a decrease of interest. Nowadays, social
media have become a reliable and popular source of information for people from
all around the world. Twitter is one of the most popular microblogging services and
is used by many traditional newspapers on a daily basis. While we can hypothesize
that in the last few years the media coverage on climate change might have risen,
due for instance to international climate strike movements, the recent emergence of
the coronavirus disease 2019 (COVID-19) pandemic might have led to a decrease of
attention on other relevant topics.
Aims of this work were to: (1) assess trends in media coverage on climate change
using tweets posted by main international newspapers based in United Kingdom
(UK) and United States (US), and (2) identify the main topics discussed in these
tweets using topic modeling.
We downloaded all tweets posted from 2012 January 1st to 2021 December 31st
from the official Twitter account of six widely known newspapers based in UK (The
Guardian, The Independent and The Mirror) or US (The New York Times, The Wash-
ington Post and The Wall Street Journal) leading to a collection of 3,275,499 tweets.
Next, we determined which tweets were related to climate change and environmental
awareness based on the presence of at least one of the following keywords: “climate
change”, “sustainability”, “earth day”, “plastic free”, “global warming”, “pollution”,
“environmentally friendly” or “renewable energy”. We plotted the number of tweets
on climate change posted by each newspaper during each year using R v. 4.1.2 [4].
We analyzed the association between the number of tweets on climate change and
the whole number of tweets posted by each newspaper using Spearman’s correlation
analysis. For each year and for each newspaper, we computed and plotted the differ-
ences in the number of posted tweets compared to the previous year, for either (a)
Climate Change in UK and US Newspapers’ Tweets 405
tweets related to climate change and (b) all tweets. Finally, we used the changepoint
R package [5] to conduct an analysis aimed at identifying structural breaks, i.e. unex-
pected changes in a time series. In many applications, it is reasonable to believe that
there might be m breakpoints (especially if some exogenous event occurs) in which a
shift in mean value is observed. The changepoint package estimates the breakpoints
using several penalty criteria such as the Bayesian Information Criterion (BIC) or the
Akaike Information Criterion (AIC). We estimated the breakpoints using the Binary
Segmentation (BinSeg) method [6] implemented in the package.
Lastly, we used tweets posted by The Guardian to perform topic modeling, a
method for classification of text into topics. Preprocessing (including lemmatization,
removal of stopwords and creation of the document term matrix) was conducted with
tm [7] and quanteda [8] in R. We used two different approaches: 1) Latent Dirichlet
Allocation (LDA) implemented in the textmineR R package [9]; and 2) Correlation
Explanation (CorEx), an approach alternative to LDA that allows both unsupervised
as well as semi-supervised topic modeling [10].
3 Results
Fig. 1 Number of tweets on climate change (A) or total number of tweets (B) posted by the six
newspapers from 2012 to 2021.
For the majority of newspapers, the number of tweets on climate change increased
from 2014 to 2019, saw a sharp decrease in 2020, in correspondence of the emergence
of the COVID-19 pandemic, and a subsequent rise in 2021. On the other hand, the
406 G. Zammarchi et al.
Fig. 2 Year-over-year percentage changes of overall tweets and tweets on climate change. A: The
Guardian, B: The Mirror, C: The Independent, D: The New York Times, E: The Washington Post,
F, The Wall Street Journal.
number of tweets on climate change posted by The Guardian showed a peak during
2015 and a subsequent decrease. However, it must be noted that The Guardian is
also the newspaper that showed a more pronounced decrease in the overall number
of tweets.
The number of tweets on climate change was significantly positively correlated
with the overall number of tweets posted from 2012 to 2021 for four newspapers (The
Guardian, Spearman’s rho = 0.95, 𝑝 < 0.001; The Mirror, Spearman’s rho = 0.95, 𝑝
< 0.001; The Independent, Spearman’s rho = 0.76, 𝑝 = 0.016; The Washington Post,
Spearman’s rho = 0.70, 𝑝 = 0.031) but not for The New York Times (Spearman’s
rho = 0.18, 𝑝 = 0.63) or The Wall Street Journal (Spearman’s rho = 0.49, 𝑝 = 0.15).
Year-over-year percentage changes among either tweets related to climate change or
all posted tweets can be observed in Figure 2.
Looking at Figure 2, we can observe a great variability in the posted number of
tweets during the years, both for the total number of tweets and for the number of
tweets on climate change. While the analysis aimed at identifying structural changes
Climate Change in UK and US Newspapers’ Tweets 407
Fig. 3 Structural changes in the time series of tweets related to climate change. A: The Guardian,
B: The Mirror, C: The Independent, D: The New York Times, E: The Washington Post, F, The Wall
Street Journal. The red line represents the years between two breakpoints.
in the time series comprising tweets on climate change identified three or four
breakpoints for all newspapers, wide variability was observed regarding the specific
year in which these structural changes were identified (Figure 3). Despite the great
variability, Figure 3 shows that even if a common breakpoint cannot be identified,
2019 was a key year for five out of six newspapers (except for The Independent).
Finally, we exploited the topic modeling approach to identify and analyze the main
topics discussed by newspapers in their tweets. Due to space limitations, we focus
only on The Guardian since this newspaper showed a trend in contrast with the
others. Data comes from 2,916 tweets posted by The Guardian analyzed using LDA
and CorEx. For LDA, a range of 5-20 unsupervised topics was tested, with the most
408 G. Zammarchi et al.
interpretable results obtained with 10 topics (Table 1). The topic coherence ranged
from 0.01 to 0.34 (mean: 0.13). For each topic, bi-gram topic labels were assigned
with the labeling algorithm implemented in textmineR. We can observe that topics
are related to politics or leaders (Topics 3, 7 and 10), environmental scientists or
climate journalists (Topics 1 and 5), energy sources (Topics 4 and 8) and effects
of climate change (Topics 2, 6 and 9). The intertopic distance map obtained with
LDAvis is shown in Figure 4. The area of each circle is proportional to the relative
prevalence of that topic in the corpus, while inter-topic distances are computed based
on Jensen-Shannon divergence.
Table 1 Top terms for the ten topics identified with LDA.
dana_nuccitelli air_pollution barack_obama renewable_energy john_abraham
1 school, strike, march, schoolstrik, climat- EPA wipes its climate change site day before
estrikeuk, ukschoolstrik, schoolstrikeclim, march on Washington
climatemarch, arabia, saudi
2 ocean, ice, environment, john, dana, nuc- Chasing Ice filmmakers plumb the ’bottom-
citelli, air, abraham, sea, reed less’ depths of climate change - new clip
from @GuardianEco
3 trump, obama, lead, donald, barack, Trump administration pollution rule strikes
ivanka, brighton, repli, administr, pick final blow against environment
4 plastic, fuel, oil, fossil, compani, pictur, Engaging with oil companies on climate
wast, big, bay, photo change is futile
5 studi, scientist, research, find, link, say, Microplastic pollution revealed ‘absolutely
show, death, prematur, speci everywhere’ by new research
The anchored words are reported in bold.
4 Discussion
The present study aims to evaluate how some of the most relevant British and
American newspapers have given space to the topic of climate change on their
Twitter page in the last decade. Apart from The Guardian, which shows a decreasing
trend in the number of tweets related to climate change, all the other newspapers
showed an overall growing trend, except during 2020. During this year, the number
of tweets related to climate change declined for all six newspapers. This was most
probably due to the COVID-19 outbreak that was massively covered by all media.
By analyzing the breakpoints in Figure 3, it is possible to observe that 2019 was
a relevant year for climate change. This is plausible considering that, starting from
the end of 2018, the strikes launched by the Fridays for Future movement to raise
awareness on the issue of climate change, gained high media coverage.
410 G. Zammarchi et al.
Our topic modeling analysis showed that the main topics defined using unsuper-
vised models such as LDA are mostly related to politics, environmental scientists,
energy sources and effects of climate change. While unsupervised models capture
relevant topics, using CorEx we found a semi-supervised model to be able to reach
a higher total correlation, which is a measure of informativeness of the topics,
compared to an unsupervised model with the same number of topics.
As future developments, we plan to extend our analyses to newspapers from other
countries. We believe our work to be useful to gain more knowledge and awareness
about the climate change topic and on how much space relevant newspapers have
given to this issue on social media. Increasing the knowledge about the nature of the
topics covered by newspapers will lay the basis for future studies aimed at evaluating
public awareness on this highly relevant challenge.
References
1. Van Lange, P. A. M., Huckelba, A. L.: Psychological distance: How to make climate change
less abstract and closer to the self. Curr. Opin. Psychol. 42, 49–53 (2021)
2. Wakefield, M., Flay B., Nichter M., Giovino G.: Role of the media in influencing trajectories
of youth smoking. Addiction, 98, 79-103 (2003)
3. Dumanovsky, T., Huang, C. Y., Bassett, M. T., Silver, L. D.: Consumer awareness of fast-food
calorie information in New York City after implementation of a menu labeling regulation.
American Journal of Public Health, 12, 2520-2525 (2010)
4. R Core Team. R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria (2020). Available via http://www.R-project.org
5. Killick, R., Eckley, I. A.: changepoint: An R Package for changepoint analysis. J. Stat. Softw.
58, 1–19 (2014)
6. Scott, A.J., Knott, M.: A cluster analysis method for grouping means in the analysis of variance,
Biometrics, 30, 507-512 (1974)
7. Feinerer, I., Hornik, K., Meyer, D.: Text mining infrastructure in R. J. Stat. Softw. 25, 1–54
(2008)
8. Benoit, K., Watanabe, K., Wang, H., Nulty, P., Obeng, A., Müller, S., Matsuo, A.: quanteda:
An R package for the quantitative analysis of textual data. J. Open Source Softw. 3, 774 (2018)
9. Jones, T., Doane, W.: Package ‘textmineR’. Functions for Text Mining and Topic Modeling
(2021). Retreived from
https://cran.r-project.org/web/packages/textmineR/textmineR.pdf
10. Gallagher, R., Reing, K., Kale, D., Ver Steeg, G.: Anchored correlation explanation: Topic
modeling with minimal domain knowledge. Trans. Assoc. Comput. 5, 529-542 (2017)
Climate Change in UK and US Newspapers’ Tweets 411
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Index
count data, 35 H
COVID-19, 93, 334 Hayashi, K., 175
Cunial, E., 29 Hennig, C., 183
hierarchical cluster analysis, 93
D hierarchical clustering, 1
D’Urso, P., 233 Hoshino, E., 175
data analysis, 283 hyperparameter tuning, 131
data mining, 73 hyperquadrics, 21
decision boundaries, 21
democracy, 283 I
dependancy chains, 101 Ievoli, R., 273
Di Nuzzo, C., 111 Ignaccolo, R., 313
dimensionality reduction, 83 image processing, 193
Dobša, J., 121 indicator processes, 233
Dvorák, J., 293 Ingrassia, S., 21, 111
dynamical systems, 373 intelligent shopping list, 83
Ippoliti, L., 313
E item classification, 53
ECM algorithm, 303
EEG, 323 J
EM algorithm, 11, 353 Janácek, P., 193
emotions, 323
environment, 403 K
evidence-based policy making, 93 k:-means, 383
expectation-maximization, 263 Kalina, J., 193
Karafiátová, I., 293
F kernel density estimation, 155
factorial k-means, 213 kernel function, 111
Faria, B. M., 323 Kiers, H. A. L., 121
Figueiredo, M., 353 Koshkarov, A., 383
finite mixture model, 353
Fontanella, S., 313 L
Forbes, F., 263 L1-penalty, 313
Fort, G., 263 López-Oriona., Á, 233
fraud detection, 131 Labiod, L., 73, 203, 213
functional data, 11, 293 LaLonde, A., 223
functional data analysis, 29, 313, 334 leadership, 363
fuzzy sets, 243 learning from data streams, 131
leave-one-out cross-validation, 176
G Lee, H. K. H., 253
Gama, J., 131 Love, T., 223
García-Escudero, L. A., 139 low-energy replacements, 193
Gaussian mixture model, 183 LSA, 121
Gaussian process, 253
Genova, V. G., 147 M
Giordano, G., 147 machine and deep learning, 323
Giubilei, R., 155 machine learning, 83
graph clustering, 155 Magopoulou, S., 93
graphical LASSO, 313 Makarenkov, V., 83, 101
grocery shopping recommendation, 83 Markov chain Monte Carlo, 223
Górecki, T., 165 Markov decision process, 101
Masís, D., 243
Index 415
matrix-variate, 393 P
Mayo-Iscar, A., 139 pair correlation function, 293
Mazoure, B., 83 Palazzo, L., 273
measurement error, 53 Panagiotidou, G., 283
Menafoglio, A., 333 parallel computing, 101
Meng, R., 253 parameter estimation, 263
Migliorati, S., 35 parsimonious models, 393
minimum message length, 353 Pawlasová, K., 293
minorization-maximization, 263 Perrone, G., 303
mixed-mode official surveys, 53 phylogenetic trees, 383
mixed-type data, 43 Piasecki, P., 165
mixture model, 35, 223, 313 political behavior, 283
mixture modelling, 139 projection matrix, 176
mixture models, 393 Pronello, N., 313
mixture of regression models, 303 proximity measure, 1
mixtures of regressions, 21
mobility data, 147 R
mode-based clustering, 155 Ragozini, G., 147
model based clustering, 139 random forest, 165
model selection, 353 rare disease, 176
model-based cluster analysis, 303 recommender systems, 83
model-based clustering, 11, 21 reduced k-means, 121, 213
Morelli, G., 139 regional healthcare, 273
motivational factors, 363 Reis, L. P., 323
multidimensional scaling, 273 religion, 283
multivariate data analysis, 1 representation learning, 203
multivariate methods, 283 reversible jump, 223
multivariate regression, 35 Riani, M., 139
multivariate time series, 253 robustness, 193
Rodrigues, D., 323
N Romano, M., 403
Nadif, M., 73, 203, 213
Nakanishi, E., 175 S
neighborhood graph, 1 Sakai, K., 175
network analysis, 63 Sangalli, L. M., 29, 333
networked data, 203 Scimone, R., 333
networks, 155 Secchi, P., 333
neural networks, 293 seemingly unrelated regression, 303
Nguyen, H. D., 263 Segura, E., 243
noise component, 183 semiparametric regression with roughness
nonparametric statistics, 155 penalty, 29
number of clusters, 183 sensitivity and specificity, 176
numerical smoothing, 243 silhouette width, 313
Silva, O., 343, 363
O Silvestre, C., 353
O2S2, 334 similarity forest, 165
Obatake, M., 175 Smith, I., 11
online algorithms, 263 social networks, 63
optimized centroids, 193 Soffritti, G., 303
outlier analysis, 373 Sousa., Á, 343, 363
Ovtcharova, J., 373 sparsity, 193
416 Index