Weighted Clusterwise Linear Regression Based On Adaptive Quadratic Form Distance
Weighted Clusterwise Linear Regression Based On Adaptive Quadratic Form Distance
Keywords: The standard approach to Clusterwise Regression is the Clusterwise Linear Regression method. This approach
Clusterwise regression can lead to data over-fitting, and it is not able to distinguish linear relationships in groups of observations
Quadratic form distance well separated in the space of explanatory variables. This paper presents a Weighted Clusterwise Linear
Adaptive distances
Regression method to obtain homogeneous clusters of observations while maintaining a proper fitting for
Clustering
the response variable, by the minimization of an optimization criterion that combines a k-means-like criterion
(based on an adaptive quadratic form dissimilarity) in x-space and the criterion of minimum squared residuals
of Regression Analysis. The adaptive metric allows automatic weighing or take into account the correlation
between explanatory variables under multiple constraints types. We explore six constraints types. Experiments
with synthetic and benchmark datasets corroborate the usefulness of the proposed method.
1. Introduction Clustering methods (Bock, 1994; Diday & Simon, 1976) partition
a larger unlabeled (unclassified), heterogeneous,1 dataset into smaller,
The search for hidden patterns and natural groups of observa- homogeneous, groups (clusters) of observations, that are easily man-
tions contained in a data sample may lead to information gains on a aged, independently modeled and analyzed. Observations within each
problem, reveal valuable insights, and allow the discovery of meaning- cluster should be as similar as possible, and as dissimilar as possi-
ful relationships between variables of interest. Through mathematical ble from observations of other clusters, assuming the existence of a
(statistical) models and computational algorithms, it is possible to measure of dissimilarity (or similarity) between data objects.
investigate how the behavior of some (independent) explanatory vari- Clusterwise Regression (CR) consists of a collection of methodolo-
ables affect a distinct (dependent) response variable. Data description gies whose goal is to partition large heterogeneous datasets into smaller
(or explanation), parameter estimation, and prediction are common homogeneous groups (clusters) of observations while, simultaneously,
goals of Regression methods (Montgomery, Peck, & Vining, 2001). fitting a regression model for each group. Grouping in this way makes
Traditionally, a regression method fits a single functional (statistical) it easier to consider and understand the regression relationships pre-
model to a data sample. The estimated model is then applied to describe sented in large heterogeneous datasets. These methods enhance the
(or summarize) a dataset, and, perhaps, make predictions for new development of mathematical models to summarize and describe a
observations. dataset. Similarly to traditional Regression modeling, a Clusterwise
Although for some real-life datasets this single-model approach
Regression model can be a much more convenient and useful summary
results in proper regression modeling, the data sample may be com-
of a dataset than a table or graph.
posed of several unknown smaller groups of observations, in which
Clusterwise Regression can be studied as a finite mixture model
each group behaves according to a local functional model between
that uses maximum likelihood estimation. DeSarbo and Cron (1988),
response and explanatory variables (Späth, 1979; Vicari & Vichi, 2013).
and Wedel and DeSarbo (1995), studied a conditional mixture, maxi-
Therefore, each smaller data group must be individually fitted by a
mum likelihood methodology for performing clusterwise linear regres-
regression model to describe all relationships presented in the dataset
sion. Hennig (2000) investigates the identifiability of the parameters
accurately.
∗ Corresponding author.
E-mail addresses: [email protected] (R.A.M. da Silva), [email protected] (F.A.T. de Carvalho).
1
To avoid misinterpretations of other uses of the term ‘‘heterogeneous’’ in Statistics, in our context, it means that the sampled data comes from several
unlabeled groups, where each group follows its own (true) relationship model.
https://doi.org/10.1016/j.eswa.2021.115609
Received 4 April 2020; Received in revised form 14 March 2021; Accepted 12 July 2021
Available online 17 July 2021
0957-4174/© 2021 Elsevier Ltd. All rights reserved.
R.A.M. da Silva and F.A.T. de Carvalho Expert Systems With Applications 185 (2021) 115609
of models for data generated by different linear regression distribu- time-consuming and can fail when the clusters are not homogeneous re-
tions with Gaussian errors. García-Escudero, Gordaliza, Mayo-Iscar, garding the explanatory variables. Another considered approach when
and Martín (2010) proposed a robust clusterwise regression based on the clusters are homogeneous and that is less time consuming than the
trimming that allows different scatters for the regression errors together 𝑘-nearest neighbors rule, is to assign the data points to the clusters
with different group weights. Schlittgen (2011) presented a weighted through the nearest cluster center.
least-squares approach to clusterwise regression in which the residuals In addition to the previous concerns, the optimization process of the
for all regressions are used to assign observations to the different CLR method, following the minimum squared residuals criterion, can
groups. Wu and Yao (2016) introduced a semi-parametric mixture of often assign dissimilar objects (in the space of explanatory variables)
quantile regressions model aiming to improve robustness to skewness to the same cluster, because they have the smallest squared residual
and outliers. Mari, Rocci, and Gattone (2017) presented a clusterwise among all the regression models. This can leads to the production of
linear regression that uses a data-dependent soft constrained method overlapping clusters of observations regarding the explanatory vari-
to provide the parameter maximum likelihood estimates. The proposed ables, even when the clusters of the dataset originally do not overlap.
approach imposes soft scale bounds based on the homoscedastic vari- Therefore, often the regression models provided by the CLR method are
ance and a cross-validated tuning parameter. More recently, Mazza not fitted from homogeneous clusters of observations in the space of
and Punzo (2020) introduced mixtures of multivariate contaminated explanatory variables (Brusco et al., 2008; Manwani & Sastry, 2015;
normal regression models that allow for simultaneous robust clustering Vicari & Vichi, 2013). Regression models fitted from homogeneous
and detection of mild outliers. clusters in the space of explanatory variables can be more efficient to
Despite their usefulness, a traditional finite mixture based cluster- achieve a good performance in the prediction task.
wise regression models assume that the independent variables are fixed, Aiming to obtain regression models fitted from homogeneous clus-
only the response variable is random. However, as pointed out by Hen- ters in the space of explanatory variables, Manwani and Sastry (2015)
nig (2000), regarding the model fitting these assumptions imply that proposed the so-called 𝐾-plane method for piecewise linear regression.
the assignment of the data points to the clusters is based exclusively on In this method, the partition, the centroids and the corresponding linear
the response variable and does not take into account the independent regression models for each group are obtained simultaneously though
variables. The assignment independence on the independent variable the iterative optimization of a suitable objective function that combines
is generally untrue and can make the mixture based clusterwise re- a second term of 𝐾-means type with the CLR loss function.
Despite its usefulness, because the 𝐾-plane method uses a term of
gression models with fixed independent variables less suitable in real
traditional 𝐾-means type to obtain homogeneous clusters in the space
applications. A mixture of regression models with random independent
of explanatory variables, it implicitly consider that all variables are
variables, as the cluster-weighted model (CWM; Gershenfeld, 1997) and
equally important as they have the same weight in the construction
related approaches (see, e.g., Dang, Punzo, McNicholas, Ingrassia, &
of the cluster centroids and assignment of the data points to the
Browne, 2017; Ingrassia, Minotti, & Punzo, 2014; Ingrassia, Minotti,
clusters. However, it is well known that some variables are irrelevant
& Vittadini, 2012 and references therein), attempt to overcome this
while others have different degrees of relevance to the clustering task.
problem. In these approaches, the assignment of the data points to the
Moreover, each cluster may have its own different set of relevant
clusters can be affected by the independent variables.
variables (de Carvalho, Tenorio, & Junior, 2006; Chan, Ching, Ng, &
From the point of view of exploratory data analysis, the stan-
Huang, 2004; Diday & Govaert, 1977; Gustafson & Kessel, 1978; Huang,
dard approach for CR is the Clusterwise Linear Regression (CLR)
Ng, Rong, & Li, 2005; Modha & Spangler, 2003). Besides, because it
method (Charles, 1977; Diday et al., 1979; Späth, 1979, 1982). The
uses the standard Euclidean distance to the comparison between data
CLR method provides a partition of the data sample into a previously
points and centroids, it does not take into account the correlation
fixed number 𝐾 of clusters and a regression model for each cluster in a
between the explanatory variables.
way that minimizes the sum of squared residuals of each within-cluster
In this paper, we propose a crisp Clusterwise Regression method,
regression model. The CLR approach merges two main research fields,
that provides clusters taking into account either the correlation be-
Cluster Analysis (unsupervised learning), and Regression Analysis (su- tween the variables or the weight of relevance of the variables whose
pervised learning), where, unlike conventional clustering methods, goal is to obtain homogeneous clusters of observations for the ex-
there is a supervised cluster modeling. In addition to the former ap- planatory variables, which are well-fitted for the response variable.
proaches, other general optimization algorithms have been proposed to We achieve this aim through the use of an adaptive quadratic form
solve the CLR problem. DeSarbo, Oliver, and Rangaswamy (1989) pro- distance, in the space of the explanatory variables, combined with the
posed a simulated annealing algorithm to a multiple response variables traditional criterion of minimum Residual Sum of Squares criterion of
optimization problems. Brusco, Cradit, and Tashchian (2003) gave a Regression Analysis. The rationale is that the clustering procedure can
simulated annealing algorithm for CR market segmentation. Aurifeille adapt in response to the shapes found in the data.
(2000), Aurifeille and Medlin (2001) and Aurifeille and Quester (2003) The proposed method have a basis on the work of Manwani and
proposed a bio-inspired genetic algorithm to solve the CLR problem. Sastry (2015) and Ryoke, Nakamori, and Suzuki (1995), and in our
More recent exploratory clusterwise regression approaches are Bagirov, previous work, da Silva and de Carvalho (2017). It extends the pa-
Ugon, and Mirzayeva (2015), Beck, Azzag, Bougeard, Lebbah, and per da Silva and de Carvalho (2017) using adaptive quadratic distances
Niang (2018), Bougeard, Abdi, Saporta, and Niang (2018), Preda and defined by a positive definite matrix to compare the data points and the
Saporta (2007) and Vicari and Vichi (2013), and references therein. cluster centroids in the 𝐾-means type term of the objective function.
Despite its usefulness, the CLR approach can lead to data over- The matrices of these quadratic distances are computed interactively
fitting (Brusco, Cradit, Steinley, & Fox, 2008; Manwani & Sastry, 2015; under suitable constraints, change at each iteration and can be the same
Vicari & Vichi, 2013). Moreover, additional difficulty concerns the for all clusters (in this case they are named global adaptive quadratic
prediction task. As in the traditional finite mixture clusterwise regres- distances) or different from one cluster to another (in this case they are
sion approach, the assignment of the data points to the clusters is named local adaptive quadratic distances).
based exclusively on the response variable that has not been observed. An often used constraint is to set the determinant of the matrix
Therefore, approaches to select the most suitable (among 𝐾) regres- associated with a given quadratic distance be one (de Carvalho et al.,
sion model to achieve the prediction task based on the explanatory 2006; Diday & Govaert, 1977; Gustafson & Kessel, 1978). The method
variables are needed. Charles (1977) and Preda and Saporta (2005) presented in da Silva and de Carvalho (2017) uses a global adaptive
propose the 𝑘-nearest neighbors rule to assign the data points to the quadratic distance with the restriction that the associated matrix is
clusters and predicting a new outcome. However, this approach can be diagonal with determinant equal to 1. As we will highlight later, this
2
R.A.M. da Silva and F.A.T. de Carvalho Expert Systems With Applications 185 (2021) 115609
is the same as using a weighted Euclidean distance with the constraint Moreover, the error terms are assumed to have the following prop-
that the product of the variable weights is set to 1. Finally, in this paper, erties: (i) (𝜖𝑖(𝑘) ) = 0; (ii) 𝑉 𝑎𝑟(𝜖𝑖(𝑘) ) = 𝜎𝑘2 ; and (iii) 𝐶𝑜𝑣(𝜖𝑖(𝑘) , 𝜖𝑙(𝑘) ) = 0, 𝑖 ≠
we consider also the case of the weighted Euclidean distance with the 𝑙, 𝑒𝑖 , 𝑒𝑙 ∈ 𝐶𝑘 . In matricial notation:
constraint that the sum of the variable weights is set to 1 (Chan et al.,
2004; Huang et al., 2005). All combinations of constraints and metrics 𝐲𝑘 = 𝝁𝑘 + 𝝐 𝑘 = 𝐗̃ 𝑘 𝐛𝑘 + 𝝐 𝑘 (1 ≤ 𝑘 ≤ 𝐾) (3)
add up to a total of six variations of a common optimization criterion. where 𝐛𝑘 is as previous defined, 𝐗̃ = (𝟏|𝐗) is the augmented input
To summarize, the main contribution of this paper is to provide a ( )
matrix, 𝐕𝑘 = diag [𝑉1𝑘 , … , 𝑉𝑛𝑘 ] is a diagonal 𝑛 × 𝑛 matrix of all crisp 𝑛
Clusterwise Regression method that aims to deal with heterogeneous membership values for each cluster 𝑘, 𝐲𝑘 = 𝐕𝑘 𝐲, 𝝁𝑘 = 𝐗̃ 𝑘 𝐛𝑘 , 𝐗̃ 𝑘 = 𝐕𝑘 𝐗̃
datasets, in a context of linear regression, defining an optimization and 𝝐 𝑘 = [𝜖1(𝑘) , … , 𝜖𝑛(𝑘) ]⊤ is the vector of error components. Besides,
criterion that automatically and adaptively either provides a weight the vector 𝝐 𝑘 is assumed to have the following properties: (i) (𝝐 𝑘 ) = 𝟎;
of relevance to each explanatory variable or take into account the and (ii) 𝐶𝑜𝑣(𝝐 𝑘 ) = (𝝐 𝑘 𝝐 ⊤ ) = 𝜎𝑘2 𝐕𝑘 .
𝑘
correlation between the explanatory variables on the clustering process The Clusterwise Linear Regression problem can be stated as follows.
in the dataset, fitting the best linear models within homogeneous Given a dataset (𝐗, 𝐲), and a number 𝐾 ∈ N>0 of cluster; find a K-
clusters.
partition 𝐕 of (𝐗, 𝐲), and 𝐾 regression hyperplanes that minimizes
This paper is organized as follows. In Section 2 we discuss the
relevant related CR methods: Clusterwise Linear Regression, and K-
plane Regression. We then propose the Weighted Clusterwise Linear ∑
𝐾 ∑
𝑛
𝐽𝑐𝑙𝑟 = 𝑉𝑖𝑘 (𝑦𝑖 − 𝐱̃ 𝑖⊤ 𝐛𝑘 )2 (4)
Regression method in Section 3. Section 4 presents the experimental 𝑘=1 𝑖=1
evaluation comparing all methods. Finally, in Section 5, we present the
final remarks and conclusion. Definition 3. Each vector of regression coefficients 𝐛𝑘 (𝑘 = 1, 2, … , 𝐾)
defines a regression hyperplane, such that
2. Related work
∑
𝑝
𝜇𝑖(𝑘) = 𝐱̃ 𝑖⊤ 𝐛𝑘 = 𝛽0𝑘 + 𝛽𝑗𝑘 𝑥𝑖𝑗 (5)
In this section, we present two CR approaches which are closely
𝑗=1
related to the Weighted Clusterwise Linear Regression method proposed
in the next section. is the 𝑖th predicted value provided by the 𝑘th linear regression model
𝐱̃ 𝑖⊤ 𝐛𝑘 .
2.1. Clusterwise linear regression The combinatorial optimization problem defined by Eq. (4) is com-
putationally intractable. The number of possible K-partitions is too high
The research on Clusterwise Linear Regression (CLR) methods dates
to enumerate them within reasonable computing time (Späth, 1982).
back to the early 70’s through the work of Hans-Hermann Bock (Bock,
Thus, there are some heuristics algorithms (Bock, 2008) proposed in
1969), Christian Charles (Charles, 1977), Helmuth Späth (Späth, 1979,
the clustering literature that leads to useful, acceptable solutions. Späth
1982, 2014); and Edwin Diday (Diday et al., 1979). The CLR method
(1979, 1982) gives an exchange-based algorithm that is dependent on
is a variation of the discrete Generalized Minimum Variance criterion
its initialization, typically converging to a local minimum of the opti-
(also known as Sum of Squares (SSQ) criterion Bock, 2008), where each
mization criterion 𝐽𝑐𝑙𝑟 (Eq. (4)). DeSarbo and Cron (1988), and Wedel
cluster is represented by a prototype hyperplane.
and DeSarbo (1995), studied a conditional mixture, maximum like-
lihood methodology for performing clusterwise linear regression. Fi-
Definition 1. Let 𝑋1 , 𝑋2 , … , 𝑋𝑝 be 𝑝 explanatory variables, and 𝑌
nally, Preda and Saporta (2007) compare a CR method with Principal
be a response variable. Let 𝐸 = {𝑒1 , … , 𝑒𝑛 } a set of individuals, where
Components and Partial Least Squares as regularization methods for a
each individual 𝑒𝑖 ∈ 𝐸 (1 ≤ 𝑖 ≤ 𝑛) is represented by a pair (𝐱𝑖 , 𝑦𝑖 ), with
functional linear regression model.
𝐱𝑖 = [𝑥𝑖1 , … , 𝑥𝑖𝑝 ]⊤ ∈ R𝑝 , where 𝑥𝑖𝑗 ∈ R (1 ≤ 𝑗 ≤ 𝑝) and 𝑦𝑖 ∈ R are the
observed values of 𝑋𝑗 and 𝑌 , respectively, for the 𝑖th observation. A
2.2. K-plane regression
dataset is represented by a pair (𝐗, 𝐲), where 𝐗 = [𝑥𝑖𝑗 ]𝑛×𝑝 is a matrix
with 𝑛 rows (observations) and 𝑝 columns (predictors), and 𝐲 = [𝑦𝑖 ]𝑛×1
Manwani and Sastry (2015) presented the ‘‘Modified’’ K-plane Re-
is the vector of the 𝑛 observed values of the response variable.
gression (KPLANE) method for piecewise linear regression. It estimates
a nonlinear regression function approaching the target function by a
Definition 2. Given a dataset (𝐗, 𝐲), and a number 𝐾 ∈ N>0 of
piecewise linear function, where the input space is partitioned into
clusters; a crisp (hard) K-partition is a disjoint collection of 𝐾 non-
quite distinct regions, and for each group, a linear regression model
empty subsets of (𝐗, 𝐲) whose union is (𝐗, 𝐲). A matrix 𝐕 = [𝑉𝑖𝑘 ]𝑛×𝐾
is fitted. The method faced some issues of CLR method by combining a
denotes a crisp K-partition, such that
{ second term (K-means-like term) with the CLR loss function (Eq. (4)),
1 if the 𝑖𝑡ℎ observation belongs to the 𝑘𝑡ℎ cluster estimating the centroids and the corresponding linear regression mod-
𝑉𝑖𝑘 = (1)
0 otherwise els for each group simultaneously. This additional term tries to ensure
∑ that all observations in a cluster are closer to each other in the x-space.
is subject to a non-emptiness constraint 𝑛𝑖=1 𝑉𝑖𝑘 > 0 (𝑘 = 1, 2, … , 𝐾) The ‘‘Modified’’ K-plane Regression make an attempt to solve the
∑𝐾
and to a disjointness constraint 𝑘=1 𝑉𝑖𝑘 = 1 (𝑖 = 1, 2, … , 𝑛). Let following problem. Given a dataset (𝐗, 𝐲), and a number 𝐾 ∈ N>0 of
𝐾 = (𝐶1 , … , 𝐶𝐾 ) the crisp partition of 𝐸 in 𝐾 clusters and let the
∑ cluster; find a K-partition 𝐕 of (𝐗, 𝐲), 𝐾 regression hyperplanes, and 𝐾
cluster 𝐶𝑘 = {𝑒𝑖 ∈ 𝐸 ∶ 𝑉𝑖𝑘 = 1}, with 𝑛𝑘 = |𝐶𝑘 | = 𝑛𝑖=1 𝑉𝑖𝑘 . prototype centroids that minimizes
For each observation 𝑒𝑖 belonging to cluster 𝐶𝑘 , it is assumed that ∑
𝐾 ∑
𝑛
[ ]
the dependent variable 𝑌 and the independent variables 𝑋𝑗 are related 𝐽𝑘𝑝𝑙𝑎𝑛𝑒 = 𝑉𝑖𝑘 (𝑦𝑖 − 𝐱̃ 𝑖⊤ 𝐛𝑘 )2 + 𝛾‖𝐱𝑖 − 𝐠𝑘 ‖2 (6)
by the following linear regression relationships: 𝑘=1 𝑖=1
∑
𝑝 where 𝐱̃ 𝑖⊤ and 𝐛𝑘 are as previously defined and 𝐠⊤ 𝑘
= [𝑔1𝑘 , … , 𝑔𝑝𝑘 ] ∈ R𝑝
𝑦𝑖 = 𝜇𝑖(𝑘) + 𝜖𝑖(𝑘) = 𝐱̃ 𝑖⊤ 𝐛𝑘 + 𝜖𝑖(𝑘) = 𝛽0𝑘 + 𝛽𝑗𝑘 𝑥𝑖𝑗 + 𝜖𝑖(𝑘) (2) defines the centroid for the 𝑘th cluster; ‖ ⋅ ‖ is the Euclidean norm;
𝑗=1
and 𝛾 is a user-defined hyper-parameter that decides the relative weight
where, 𝐱̃ 𝑖⊤ = [1, 𝑥𝑖1 , … , 𝑥𝑖𝑝 ] ∈ R(𝑝+1) is the 𝑖th augmented row of matrix between the two components of the 𝐽𝑘𝑝𝑙𝑎𝑛𝑒 criterion (Eq. (6)).
𝐗; 𝐛⊤ = [𝛽0𝑘 , 𝛽1𝑘 , … , 𝛽𝑝𝑘 ] ∈ R(𝑝+1) is the vector of regression coefficients In KPLANE method each crisp cluster 𝑘 is characterized by a pair
𝑘 ∑
for the 𝑘th linear model; and 𝜇𝑖(𝑘) = 𝐱̃ 𝑖⊤ 𝐛𝑘 = 𝛽0𝑘 + 𝑝𝑗=1 𝛽𝑗𝑘 𝑥𝑖𝑗 . (𝐠𝑘 , 𝐛𝑘 ). The centroids 𝐠1 , 𝐠2 , … , 𝐠𝐾 can be used to predict the value
3
R.A.M. da Silva and F.A.T. de Carvalho Expert Systems With Applications 185 (2021) 115609
of the response variable of a new observation, similarly as in K-means 3.2. The optimization algorithm
clustering algorithm, choosing the cluster ℎ with the closest center 𝐠ℎ ,
and using the ℎth fitted regression hyperplane 𝐛ℎ in order to predict Find the best crisp K-partition for a given dataset (𝐗, 𝐲) and number
the value of the response variable of a new observation. of clusters, 𝐾, is a hard combinatorial problem. We proceed with an
The K-plane Regression algorithm is an extension of the batch extension of a heuristic algorithm typically used for the (Lloyd) K-
(Lloyd) K-means clustering (Bock, 2008), where, from an initial K- means method. In this way, we give a WCLR algorithm that tries to
partition, the cluster centers, the regression coefficients of each cluster, approximate an optimum crisp K-partition by the minimization of op-
and the new K-partition are computed iteratively until convergence. timization criterion 𝐽𝑤𝑐𝑙𝑟 (Eq. (7)), using an iterative procedure of four
Although the K-plane Regression method was proposed to fit piece- ‘‘alternating minimization’’ steps (representation, weighting, modeling,
wise linear functions, being an extension of the CLR method, it can and assignment) in turn, until convergence.
be used to model datasets, in the context of Clusterwise Regression, The proposed algorithm automatically adjusts the weight matrices,
when the goal is to obtain dense clusters of observations in the space 𝐖1 , 𝐖2 , … , 𝐖𝐾 , at the ‘‘weighting’’ step, to the current K-partition
of explanatory variables. state for each crisp variation of the WCLR approach and fitting local
variations on the explanatory variables relationships for each cluster.
The weight matrices can also be the same for all clusters, that is,
3. Weighted clusterwise linear regression based on adaptive 𝐖1 = 𝐖2 = ⋯ = 𝐖𝐾 = 𝐖 modeling one global weight matrix, 𝐖, and
quadratic form distance reflecting the relationship between explanatory variables within all 𝐾
clusters. We give both variants.
The aim of the Weighted Clusterwise Linear Regression approach In order to prevent the WCLR optimization problem to ending in
is to achieve simultaneously dense clusters of observations, concerning trivial solutions, the values of the 𝐾 weight matrices 𝐖1 , 𝐖2 , … , 𝐖𝐾 of
the explanatory variables, while maintaining a proper fitting for the the adaptive metric must follow a constraint. In this paper, we explore
response variable. this feature and extend our previous work (da Silva & de Carvalho,
2017) with two metrics types: a quadratic distance metric with two
3.1. The optimization problem types of constraints; and the Euclidean distance with four kinds of
constraints. We describe all six constraints types for WCLR method.
The 𝐽𝑤𝑐𝑙𝑟 criterion (Eq. (7)) can be rewritten as follows.
The Weighted Clusterwise Linear Regression (WCLR) approach can
provide crisp (hard) partition of a dataset. The WCLR method is a 𝐾 ∑
∑ 𝑛 [ ]
2
combination of the standard Clusterwise Linear Regression (Charles, 𝐽𝑤𝑐𝑙𝑟 = 𝑉𝑖𝑘 𝑑𝐖 (𝐱𝑖 , 𝐠𝑘 ) + 𝛼(𝑦𝑖 − 𝐱̃ 𝑖⊤ 𝐛𝑘 )2 (9)
𝑘
𝑘=1 𝑖=1
1977; Diday et al., 1979; Späth, 1979, 1982) and K-means-like (Mac-
∑𝐾 ∑ 𝑛
[ ]
Queen, 1967; Mao & Jain, 1996) clustering with quadratic distance and = 𝑉𝑖𝑘 (𝐱𝑖 − 𝐠𝑘 )⊤ (𝐖𝑘 )(𝐱𝑖 − 𝐠𝑘 )
automatic computation of weights matrices 𝐖1 , 𝐖2 , … , 𝐖𝐾 . 𝑘=1 𝑖=1
The quadratic form distance is defined by 𝐾 symmetric positive ∑
𝐾
[ ]
definite matrices, 𝐖1 , 𝐖2 , … , 𝐖𝐾 . The advantage of using an adaptive + 𝛼 (𝐲𝑘 − 𝐗̃ 𝑘 𝐛𝑘 )⊤ (𝐲𝑘 − 𝐗̃ 𝑘 𝐛𝑘 )
𝑘=1
quadratic distance over traditional Euclidean distance is that the clus-
tering algorithm can model clusters with non-spherical shapes (de Car- 3.2.1. Representation step
valho et al., 2006; Diday & Govaert, 1977; Gustafson & Kessel, 1978) in This step provides the optimal minimizer w.r.t the cluster represen-
order to find better data partition and model fitting. Besides, the benefit tatives (prototypes) 𝐠𝑘 . For any fixed K-partition 𝐕, 𝐾 weights matrices
of using a combined criterion is that its derivation as an optimization 𝐖1 , 𝐖2 , … , 𝐖𝐾 , and 𝐾 regression coefficients vectors 𝐛1 , 𝐛2 , … , 𝐛𝐾 , the
problem falls into already known optimization problems, simplifying criterion 𝐽𝑤𝑐𝑙𝑟 (Eq. (9)) is minimized w.r.t. the cluster centroid 𝐠𝑘 by
the interpretation of the estimated intra-cluster regression hyperplanes the estimated centroid 𝐠̂ 𝑘 , such that 𝐽𝑤𝑐𝑙𝑟 (𝐠𝑘 ) ≥ 𝐽𝑤𝑐𝑙𝑟 (̂𝐠𝑘 ).
models and prototypes centroids of each cluster. From
Given a dataset (𝐗, 𝐲), and a number 𝐾 ∈ N>0 of cluster; find a
𝜕 ∑∑
𝐾 𝑛
𝜕𝐽𝑤𝑐𝑙𝑟 [ ]
K-partition 𝐕 of (𝐗, 𝐲) (Definition 2), 𝐾 regression hyperplanes, 𝐾 pro- = 𝑉 (𝐱 − 𝐠𝑘 )⊤ (𝐖𝑘 )(𝐱𝑖 − 𝐠𝑘 ) (10)
totype centroids, and 𝐾 weight matrices, such that the 𝐽𝑤𝑐𝑙𝑟 criterion 𝜕𝐠𝑘 𝜕𝐠𝑘 𝑘=1 𝑖=1 𝑖𝑘 𝑖
is minimized 𝜕𝐽
and by setting the partial derivative to zero, 𝜕𝐠𝑤𝑐𝑙𝑟 = 0, and after some
∑
𝐾 ∑
𝑛 [ ] 𝑘
algebra, the 𝑘th estimated cluster prototype 𝐠̂ 𝑘 is obtained in the same
2
𝐽𝑤𝑐𝑙𝑟 = 𝑉𝑖𝑘 𝑑𝐖 (𝐱𝑖 , 𝐠𝑘 ) + 𝛼(𝑦𝑖 − 𝐱̃ 𝑖⊤ 𝐛𝑘 )2 (7)
𝑘 way as in the ordinary K-means (MacQueen, 1967) method,
𝑘=1 𝑖=1
∑𝑛
∑𝐾 ∑ 𝑛
[ ] 𝑉𝑖𝑘 𝐱𝑖
= 𝑉𝑖𝑘 (𝐱𝑖 − 𝐠𝑘 )⊤ (𝐖𝑘 )(𝐱𝑖 − 𝐠𝑘 ) + 𝛼(𝑦𝑖 − 𝐱̃ 𝑖⊤ 𝐛𝑘 )2 𝐠̂ 𝑘 = ∑𝑖=1
𝑛 (𝑘 = 1, 2, … , 𝐾) (11)
𝑘=1 𝑖=1 𝑖=1 𝑉𝑖𝑘
∑
where 𝑛𝑖=1 𝑉𝑖𝑘 denotes the cardinality of the 𝑘th cluster.
where 𝐱̃ 𝑖⊤ , 𝐛𝑘 and 𝐠𝑘 are as previously defined and
2
𝑑𝐖 (𝐱𝑖 , 𝐠𝑘 ) = (𝐱𝑖 − 𝐠𝑘 )⊤ (𝐖𝑘 )(𝐱𝑖 − 𝐠𝑘 ) (8) 3.2.2. Weighting step
𝑘
This step provides the optimal minimizer w.r.t. to the weight ma-
is a suitable adaptive quadratic distance between the 𝑖th observation 𝐱𝑖 trices 𝐖𝑘 . For any fixed K-partition 𝐕, prototype centroid system
and the 𝑘th cluster prototype 𝐠𝑘 parameterized by the matrix 𝐖𝑘 . 𝐠1 , 𝐠2 , … , 𝐠𝐾 , and regression coefficients vectors 𝐛1 , 𝐛2 , … , 𝐛𝐾 , the cri-
In addition to the number of clusters, 𝐾, the WCLR method has an terion 𝐽𝑤𝑐𝑙𝑟 is partially minimized w.r.t. the weights matrix 𝐖𝑘 by the
other hyper-parameter that must be defined a-priori by the user, 𝛼 ∈ estimated weight matrix 𝐖 ̂𝑘 , such that 𝐽𝑤𝑐𝑙𝑟 (𝐖𝑘 ) ≥ 𝐽𝑤𝑐𝑙𝑟 (𝐖
̂𝑘 ).
R>0 , that decides the relative weight between the two components of Quadratic Distances
𝐽𝑤𝑐𝑙𝑟 criterion. However, in the case of the variant of WCLR where the Following de Carvalho et al. (2006), Diday and Govaert (1977) and
comparison between the data points and the centroids uses a weighted Gustafson and Kessel (1978), first we consider a method variant (named
Euclidean distance with the constraint that the sum of the variable WCLRqpl) with local adaptive quadratic distances. The weight matrices
weights are set to 1, there is still 𝜃 ∈ R≥1 , an additional smoothing of these quadratic distances are computed interactively under suitable
hyper-parameter for the weights of the variables. constraints, change at each iteration and are different from one cluster
4
R.A.M. da Silva and F.A.T. de Carvalho Expert Systems With Applications 185 (2021) 115609
within cluster 𝑘. We can observe that the local weight matrix 𝐖 ̂𝑘 is 𝑤𝑗𝑘 > 0 and 𝑤𝑗𝑘 = 1 (21)
𝑗=1
the inverse of the matrix of variance–covariances 𝐐𝑘 times the 𝑝th root
of its determinant. and obtain
However, it is not always necessary to have different weight matri- ( )
∑
𝐾 ∏
𝑝
ces 𝐖1 , 𝐖2 , … , 𝐖𝐾 for each cluster to obtain good partitioning of the 𝐿𝑤𝑐𝑙𝑟𝑒𝑝𝑙 = 𝐽𝑤𝑐𝑙𝑟 + 𝜆𝑘 1− 𝑤𝑗𝑘 (22)
dataset, and excess of weights can lead to data overfitting. 𝑘=1 𝑗=1
Therefore, we consider a method variant (named WCLRqpg) with then, we compute the partial derivatives of 𝐿𝑤𝑐𝑙𝑟𝑒𝑝𝑙 with respect to
global adaptive quadratic distances. The weight matrices of these 𝑤𝑗𝑘 and 𝜆𝑘 . By setting the partial derivatives to zero, and after some
quadratic distances are computed interactively under suitable con- algebra, we obtain
straints, change at each iteration and are the same for all clusters.
{∏𝑝 [∑𝑛 ]} 1
In such case, we can define a global weight matrix 𝐖 for explanatory ℎ=1 𝑖=1 𝑉𝑖𝑘 (𝑥𝑖ℎ − 𝑔ℎ𝑘 )2 𝑝
∑
𝐾
[ ]
where + 𝛼 (𝐲𝑘 − 𝐗̃ 𝑘 𝐛𝑘 )⊤ (𝐲𝑘 − 𝐗̃ 𝑘 𝐛𝑘 )
∑
𝐾 ∑
𝑛
[ ] 𝑘=1
𝐐= 𝑉𝑖𝑘 (𝐱𝑖 − 𝐠𝑘 )(𝐱𝑖 − 𝐠𝑘 )⊤ (18) where
𝑘=1 𝑖=1
∑
𝑝
is the combined matrix of variance–covariances within the clusters. 2
𝑑𝐖 (𝐱𝑖 , 𝐠𝑘 ) = (𝑤𝑗 )(𝑥𝑖𝑗 − 𝑔𝑗𝑘 )2 (25)
We can observe that the global weight matrix 𝐖̂ is the inverse of the 𝑗=1
matrix of the combined variance–covariances 𝐐 times the 𝑝th root of is a suitable weighted Euclidean distance between the 𝑖th observation 𝐱𝑖
its determinant.
and the 𝑘th cluster prototype 𝐠𝑘 parameterized by the diagonal matrix
Weighted Euclidean Distances ( )
𝐖 = diag 𝑤1 , … , 𝑤𝑝 .
If the weight matrices 𝐖𝑘 of the adaptive distance metric is diag-
Following the same methodology as above, we compute the set of
onal, the adaptive quadratic distance becomes the weighted Euclidean
weights as
distance and the 𝐽𝑤𝑐𝑙𝑟 criterion becomes
{∏ [∑ ]} 1
[ ] 𝑝 𝐾 ∑𝑛 𝑝
∑
𝐾 ∑
𝑛
ℎ=1 𝑘=1 𝑖=1 𝑉𝑖𝑘 (𝑥𝑖ℎ − 𝑔ℎ𝑘 )2
2
𝐽𝑤𝑐𝑙𝑟 = 𝑉𝑖𝑘 𝑑𝐖 (𝐱𝑖 , 𝐠𝑘 ) + 𝛼(𝑦𝑖 − 𝐱̃ 𝑖⊤ 𝐛𝑘 )2 (19) 𝑤̂ 𝑗 = ∑𝐾 ∑𝑛 (26)
𝑘
𝑘=1 𝑖=1 𝑉𝑖𝑘 (𝑥𝑖𝑗 − 𝑔𝑗𝑘 )2
[ ] 𝑘=1 𝑖=1
∑𝐾 ∑ 𝑛 ∑
𝑝
= 𝑉𝑖𝑘 (𝑤𝑗𝑘 )(𝑥𝑖𝑗 − 𝑔𝑗𝑘 )2 (1 ≤ 𝑗 ≤ 𝑝)
𝑘=1 𝑖=1 𝑗=1
We named this variant WCLRepg, where epg stands for ‘‘Euclidean
∑
𝐾
[ ] Product Global’’. We can observe that the weight 𝑤̂ 𝑗 of relevance of
+ 𝛼 (𝐲𝑘 − 𝐗̃ 𝑘 𝐛𝑘 ) (𝐲𝑘 − 𝐗̃ 𝑘 𝐛𝑘 )
⊤
variable 𝑗 is inversely proportional to the combined variance of 𝑗 into
𝑘=1
5
R.A.M. da Silva and F.A.T. de Carvalho Expert Systems With Applications 185 (2021) 115609
the clusters. Relevant variables into the whole cluster partition have is a suitable weighted Euclidean distance between the 𝑖th observation 𝐱𝑖
𝑤̂ 𝑗 > 1, whereas less relevant variables have 𝑤̂ 𝑗 < 1. and the 𝑘th cluster prototype 𝐠𝑘 parameterized by the diagonal matrix
( )
Besides the previous weighted Euclidean distance under local 𝐖 = diag (𝑤1 )𝜃 , … , (𝑤𝑝 )𝜃 and by the smoothing hyper-parameter
and global ‘‘product-to-one’’ restrictions, following (Chan et al., 2004; 𝜃 ∈ R>1 (Chan et al., 2004) for the weights of the variables.
Huang et al., 2005), here we consider weighted Euclidean distances Following the same methodology as above, we compute the set of
under ‘‘sum-to-one’’restrictions to compare the data points and the weights as follows:
cluster representatives.
−1
We propose other two variants of the WCLR method, w.r.t the ⎡ 𝑝 ( ∑𝐾 ∑𝑛 ) 1 ⎤
⎢∑
2 𝜃−1
weights of Euclidean distance under the ‘‘sum-to-one’’restriction. The 𝑘=1 𝑖=1 𝑉𝑖𝑘 (𝑥𝑖𝑗 − 𝑔𝑗𝑘 ) ⎥
𝑤̂ 𝑗 = ⎢ ∑𝐾 ∑𝑛 ⎥ (34)
2
first is the WCLResl variant, where esl stands for ‘‘Euclidean Sum ⎢ℎ=1 𝑘=1 𝑖=1 𝑉𝑖𝑘 (𝑥𝑖ℎ − 𝑔ℎ𝑘 ) ⎥
⎣ ⎦
Local’’. In this case, the 𝐽𝑤𝑐𝑙𝑟 criterion becomes
∑
𝐾 ∑
𝑛 [ ] (1 ≤ 𝑗 ≤ 𝑝)
2
𝐽𝑤𝑐𝑙𝑟 = 𝑉𝑖𝑘 𝑑(𝐖 (𝐱𝑖 , 𝐠𝑘 ) + 𝛼(𝑦𝑖 − 𝐱̃ 𝑖⊤ 𝐛𝑘 )2 (27)
𝑘 ,𝜃) We can observe that the weight 𝑤̂ 𝑗 of relevance of variable 𝑗 is
𝑘=1 𝑖=1
[ 𝑝 ] inversely proportional to the (𝜃 − 1)th root of the combined variance of
∑𝐾 ∑ 𝑛 ∑
= 𝑉𝑖𝑘 (𝑤𝑗𝑘 )𝜃 (𝑥𝑖𝑗 − 𝑔𝑗𝑘 )2 𝑗 into the clusters. Relevant variables into the whole cluster partition
𝑘=1 𝑖=1 𝑗=1 have 𝑤̂ 𝑗 > 1𝑝 , whereas less relevant variables have 𝑤̂ 𝑗 < 1𝑝 .
∑
𝐾
[ ]
+ 𝛼 (𝐲𝑘 − 𝐗̃ 𝑘 𝐛𝑘 )⊤ (𝐲𝑘 − 𝐗̃ 𝑘 𝐛𝑘 ) Remark 1. The ‘‘sum-to-one’’ restriction applied on the 𝐾-means-
𝑘=1
like term of the objective function permits weights of variables equal
where to zero, the ‘‘product-to-one’’ restriction does not. As a consequence,
∑
𝑝
‘‘sum-to-one’’ restriction is able to make variable selection in the clus-
2
𝑑(𝐖 (𝐱𝑖 , 𝐠𝑘 ) = (𝑤𝑗𝑘 )𝜃 (𝑥𝑖𝑗 − 𝑔𝑗𝑘 )2 (28)
𝑘 ,𝜃) tering optimization step (not variable selection in regression), at the
𝑗=1
cost of one more hyper-parameter 𝜃 ∈ R>1 , that needs to be fixed
is a suitable weighted Euclidean distance between the 𝑖th observation
either in advance by the user or in the framework of a cross-validation
𝐱𝑖 and the 𝑘th cluster
( prototype 𝐠𝑘 parameterized
) by the diagonal
scheme.
matrix 𝐖𝑘 = diag (𝑤1𝑘 )𝜃 , … , (𝑤𝑝𝑘 )𝜃 and by a smoothing hyper-
parameter 𝜃 ∈ R>1 (Chan et al., 2004) for the weights of the variables.
This parameter is needed to find optimal weights using the Lagrange 3.2.3. Modeling step
Multipliers method. It must be fixed either in advance by the user or This step provides the optimal minimizer w.r.t the regression coeffi-
in the framework of a cross-validation scheme. cients vectors 𝐛𝑘 . For any fixed K-partition 𝐕, prototype centroid system
We use the method of Lagrange Multipliers, under constraints 𝐠1 , 𝐠2 , … , 𝐠𝐾 , and 𝐾 weights matrices 𝐖1 , 𝐖2 , … , 𝐖𝐾 , the criterion
𝐽𝑤𝑐𝑙𝑟 (Eq. (9)) is minimized w.r.t. 𝐛𝑘 by the estimated regression
∑
𝑝
𝑤𝑗𝑘 ≥ 0 and 𝑤𝑗𝑘 = 1 (29) coefficients vector ̂𝐛𝑘 , such that 𝐽𝑤𝑐𝑙𝑟 (𝐛𝑘 ) ≥ 𝐽𝑤𝑐𝑙𝑟 (̂𝐛𝑘 ). From
𝑗=1
𝜕 ∑[
𝐾
𝜕𝐽𝑤𝑐𝑙𝑟 ]
and obtain =𝛼 (𝐲 − 𝐗̃ 𝑘 𝐛𝑘 )⊤ (𝐲𝑘 − 𝐗̃ 𝑘 𝐛𝑘 ) (35)
( ) 𝜕𝐛𝑘 𝜕𝐛𝑘 𝑘=1 𝑘
∑
𝐾 ∑
𝑝
𝐿𝑤𝑐𝑙𝑟𝑒𝑠𝑙 = 𝐽𝑤𝑐𝑙𝑟 + 𝜆𝑘 1− 𝑤𝑗𝑘 (30) 𝜕𝐽𝑤𝑐𝑙𝑟
and by setting the partial derivative to zero 𝜕𝐛𝑘
= 0; and after some
𝑘=1 𝑗=1
algebra, the 𝑘th estimated regression coefficients vector ̂𝐛𝑘 is computed
Then, we compute the partial derivatives of 𝐿𝑤𝑐𝑙𝑟𝑒𝑠𝑙 with respect to the
similarly as in the Clusterwise Linear Regression (Späth, 1979) method
𝑤𝑗𝑘 and 𝜆𝑘 , and by setting the partial derivatives to zero, and after some
as
algebra, we obtain
( ∑𝑛 ) 1 −1 ̂𝐛𝑘 = (𝐗̃ ⊤ 𝐗̃ 𝑘 )−1 𝐗̃ ⊤ 𝐲𝑘 (1 ≤ 𝑘 ≤ 𝐾) (36)
⎡∑𝑝
− 𝑔𝑗𝑘 )2 𝜃−1 ⎤ 𝑘 𝑘
𝑖=1 𝑉𝑖𝑘 (𝑥𝑖𝑗
𝑤̂ 𝑗𝑘 =⎢ ∑𝑛 ⎥ (31) Also, in order to ensure a necessary condition for the optimization
⎢ℎ=1 𝑖=1 𝑉𝑖𝑘 (𝑥𝑖ℎ − 𝑔ℎ𝑘 )2 ⎥
⎣ ⎦ problem in Eq. (36) to have a solution, we must add the side con-
∑
(1 ≤ 𝑗 ≤ 𝑝; 1 ≤ 𝑘 ≤ 𝐾) dition (Montgomery et al., 2001; Späth, 1979) 𝑛𝑖=1 𝑉𝑖𝑘 > 𝑝 + 1(𝑘 =
1, 2, … , 𝐾) to the already defined K-partition (Definition 2) constraints.
We can observe that the weight 𝑤̂ 𝑗𝑘 of relevance of variable 𝑗 into
cluster 𝑘 is inversely proportional to the (𝜃 − 1)th root of the variance
of 𝑗 within cluster 𝑘. Relevant variables into cluster 𝑘 have 𝑤̂ 𝑗𝑘 > 1𝑝 , 3.2.4. Assignment step
1 This step provides the optimal minimizer w.r.t the K-partition 𝐕. For
the less relevant variables have 𝑤̂ 𝑗𝑘 < 𝑝
.
any fixed prototype centroid system 𝐠1 , … , 𝐠𝐾 , and 𝐾 weights matrices
We can also define a global set of weights for each explanatory
𝐖1 , … , 𝐖𝐾 , and regression coefficients vectors 𝐛1 , … , 𝐛𝐾 , the criterion
variable, so that 𝑤𝑗1 = 𝑤𝑗2 = ⋯ = 𝑤𝑗𝐾 = 𝑤𝑗 . This leads to the WCLResg
variant and in this case the 𝐽𝑤𝑐𝑙𝑟 criterion becomes 𝐽𝑤𝑐𝑙𝑟 provided by Eqs. (9) or (19) can be rewritten as
(𝐾 )
[ ] ∑𝑛 ∑ [ ]
∑
𝐾 ∑
𝑛
2
𝐽𝑤𝑐𝑙𝑟 = 2
𝑉𝑖𝑘 𝑑(𝐖,𝜃) (𝐱𝑖 , 𝐠𝑘 ) + 𝛼(𝑦𝑖 − 𝐱̃ 𝑖⊤ 𝐛𝑘 )2 (32) 𝐽𝑤𝑐𝑙𝑟 = 𝑉𝑖𝑘 𝑑𝐖 (𝐱𝑖 , 𝐠𝑘 ) + 𝛼(𝑦𝑖 − 𝐱̃ 𝑖⊤ 𝐛𝑘 )2 (37)
𝑘
𝑘=1 𝑖=1 𝑖=1 𝑘=1
[ 𝑝 ]
∑𝐾 ∑ 𝑛 ∑ where 𝑑𝐖 2 (𝐱 , 𝐠 ) is computed, respectively, by Eqs. (8) and (20).
𝑖 𝑘
𝜃 2 𝑘
= 𝑉𝑖𝑘 (𝑤𝑗 ) (𝑥𝑖𝑗 − 𝑔𝑗𝑘 ) Stated this way, the objective function above is minimized w.r.t. 𝐕
𝑘=1 𝑖=1 𝑗=1 ̂ such that 𝐽𝑤𝑐𝑙𝑟 (𝐕) ≥ 𝐽𝑤𝑐𝑙𝑟 (𝐕),
̂ if the
by a minimum-cost K-partition 𝐕,
∑
𝐾
[ ] WCLR cost function (Eqs. (9) or (19)) is minimized for each observation
+ 𝛼 (𝐲𝑘 − 𝐗̃ 𝑘 𝐛𝑘 )⊤ (𝐲𝑘 − 𝐗̃ 𝑘 𝐛𝑘 )
𝑘=1 (𝐱𝑖 , 𝑦𝑖 ). This is achieved when
where ⎧ [ ]
2 (𝐱 , 𝐠 ) + 𝛼(𝑦 − 𝐱
̃ 𝑖⊤ 𝐛𝑙 )2
∑
𝑝
̂ ⎪1 if 𝑘 = argmin1≤𝑙≤𝐾 𝑑𝐖 𝑖 𝑙 𝑖
2 𝑉𝑖𝑘 = ⎨ 𝑙 (38)
𝑑(𝐖,𝜃) (𝐱𝑖 , 𝐠𝑘 ) = (𝑤𝑗 )𝜃 (𝑥𝑖𝑗 − 𝑔𝑗𝑘 )2 (33) ⎪0 otherwise
𝑗=1 ⎩
6
R.A.M. da Silva and F.A.T. de Carvalho Expert Systems With Applications 185 (2021) 115609
computed for all 1 ≤ 𝑖 ≤ 𝑛 and 1 ≤ 𝑘 ≤ 𝐾 values of K-partition matrix Algorithm 1 WCLR Algorithm
𝐕. 1: INPUT
In the same way, the minimizers of the variants of WCLR with cost 2: Given a dataset (𝐗, 𝐲), set 𝐾 ∈ N>0 , and 𝛼 ∈ R>0
functions provided by Eqs. (15) and (24) are 3: For 𝑊 𝐶𝐿𝑅𝑒𝑝𝑔 , 𝑊 𝐶𝐿𝑅𝑒𝑝𝑙 , 𝑊 𝐶𝐿𝑅𝑚𝑝𝑔 and 𝑊 𝐶𝐿𝑅𝑚𝑝𝑙 , set 𝜃 ← 1
{ [ 2 ] 4: For 𝑊 𝐶𝐿𝑅𝑒𝑠𝑔 and 𝑊 𝐶𝐿𝑅𝑒𝑠𝑙 , set 𝜃 ∈ R>1
1 if 𝑘 = argmin1≤𝑙≤𝐾 𝑑𝐖 (𝐱𝑖 , 𝐠𝑙 ) + 𝛼(𝑦𝑖 − 𝐱̃ 𝑖⊤ 𝐛𝑙 )2 5: OUTPUT
𝑉̂𝑖𝑘 = (39)
0 otherwise 6: The centroids 𝐠1 , 𝐠2 , … , 𝐠𝐾 , and coefficients vectors 𝐛1 , 𝐛2 , … , 𝐛𝐾
7: For 𝑊 𝐶𝐿𝑅𝑒𝑝𝑔 , 𝑊 𝐶𝐿𝑅𝑒𝑠𝑔 and 𝑊 𝐶𝐿𝑅𝑞𝑝𝑔 , the weight matrix 𝐖
where 𝑑𝐖2 (𝐱 , 𝐠 ) is computed, respectively, according to Eqs. (16) and 8: For 𝑊 𝐶𝐿𝑅𝑒𝑝𝑙 , 𝑊 𝐶𝐿𝑅𝑒𝑠𝑙 and 𝑊 𝐶𝐿𝑅𝑞𝑝𝑙 , the weight matrices
𝑖 𝑙
(25). 𝐖1 , 𝐖2 , … , 𝐖𝐾
9: The K-partition matrix 𝐕
Finally, the minimizers of the variants of WCLR with cost functions
10: {Initialization}
provided by Eqs. (27) and (32) are, respectively,
11: Randomly initialize the K-partition 𝐕(0) according to Definition 2
⎧ [ ] 12: Set the maximum number of iterations, 𝑇 ∈ N≥1
2 (𝐱𝑖 , 𝐠𝑙 ) + 𝛼(𝑦𝑖 − 𝐱̃ 𝑖⊤ 𝐛𝑙 )2
⎪ 1 if 𝑘 = argmin1≤𝑙≤𝐾 𝑑(𝐖 13: {Iterative Steps}
𝑉̂𝑖𝑘 = ⎨ 𝑙 ,𝜃) (40) 14: Set 𝑡 ← 0
⎪0 otherwise
⎩ 15: repeat
16: Set 𝑡 ← 𝑡 + 1
and 17: Representation. Compute 𝐠(𝑡) according to Eq. (11) for each 𝑘 =
𝑘
⎧ [ ] 1, 2, … , 𝐾.
2 (𝐱𝑖 , 𝐠𝑙 ) + 𝛼(𝑦𝑖 − 𝐱̃ 𝑖⊤ 𝐛𝑙 )2
⎪ 1 if 𝑘 = argmin1≤𝑙≤𝐾 𝑑(𝐖,𝜃) 18: Weighting . For 𝑊 𝐶𝐿𝑅𝑞𝑝𝑙 , 𝑊 𝐶𝐿𝑅𝑒𝑝𝑙 and 𝑊 𝐶𝐿𝑅𝑒𝑠𝑙 , compute 𝐖(𝑡)
𝑉̂𝑖𝑘 = ⎨ (41) 𝑘
according to Eqs. (13), (23) and (31), respectively, for each 𝑘 = 1, 2, … , 𝐾.
⎪0 otherwise
⎩ (𝑡)
For 𝑊 𝐶𝐿𝑅𝑞𝑝𝑔 , 𝑊 𝐶𝐿𝑅𝑒𝑝𝑔 and 𝑊 𝐶𝐿𝑅𝑒𝑠𝑔 , compute 𝐖 according to Eqs.
2 2
(17), (26) and (34), respectively.
where 𝑑(𝐖 (𝐱𝑖 , 𝐠𝑙 ) and 𝑑(𝐖,𝜃) (𝐱𝑖 , 𝐠𝑙 ) are computed, respectively, accord- 19: Modeling . Compute 𝐛(𝑡) according to Eq. (36) for each 𝑘 = 1, 2, … , 𝐾.
𝑙 ,𝜃) 𝑘
ing to Eqs. (28) and (33). 20: Assignment . Set 𝑢𝑝𝑑𝑎𝑡𝑒𝑑 ← 0. Update 𝑉𝑖𝑘(𝑡) according to Eq. (38) for
each 𝑖 = 1, 2, … , 𝑛 and 𝑘 = 1, 2, … , 𝐾. If 𝑉𝑖𝑘(𝑡) ≠ 𝑉𝑖𝑘(𝑡−1) then set 𝑢𝑝𝑑𝑎𝑡𝑒𝑑 ←
𝑢𝑝𝑑𝑎𝑡𝑒𝑑 + 1.
3.2.5. WCLR algorithm
21: until 𝑢𝑝𝑑𝑎𝑡𝑒𝑑 = 0 or 𝑡 > 𝑇
The WCLR algorithm is summarized as in Algorithm 1. The time
complexity of (LLoyd) K-means algorithm, with ‘‘Representation Step’’
(Eq. (11)) and ‘‘Assignment Step’’ (Eq. (38)), is 𝑂(𝐾𝑛𝑝) for each
iteration of the algorithm (Hartigan & Wong, 1979). Solving ‘‘Modeling Fitted values. They are defined as 𝝁̂ 𝑘 = 𝐗̃ 𝑘 ̂𝐛𝑘 = 𝐗̃ 𝑘 (𝐗̃ ⊤ 𝐗̃ )−1 𝐗̃ ⊤
𝑘 𝑘
𝐲 =
𝑘 𝑘
Step’’ (Eq. (36)) by using some traditional methods, such as House- ̃ ̃ ⊤ ̃ −1 ̃ ⊤
𝐇𝑘 𝐲𝑘 , where 𝐇𝑘 = 𝐗𝑘 (𝐗𝑘 𝐗𝑘 ) 𝐗𝑘 is the Hat matrix which is: i)
holder transformation or QR decomposition, requires 𝑂(𝑛2 𝑝) arithmetic symmetric (i.e, 𝐇⊤ 𝑘
= 𝐇𝑘 ); ii) idempotent (i.e., 𝐇𝑘 𝐇𝑘 = 𝐇𝑘 ) and iii)
operations (Li, 1996). The computation of determinant in Eq. (17), 𝑡𝑟(𝐇𝑘 ) = 𝑡𝑟(𝐗̃ 𝑘 (𝐗̃ ⊤
𝑘
𝐗̃ 𝑘 )−1 𝐗̃ ⊤ ) = 𝑡𝑟(𝐗̃ ⊤ 𝐗̃ 𝑘 (𝐗̃ ⊤ 𝐗̃ 𝑘 )−1 ) = 𝑡𝑟(𝐈𝑝+1 ) = 𝑝 + 1.
𝑘 𝑘 𝑘
‘‘Weighting Step’’, by LU decomposition or Bareiss algorithm, and the From 𝐗̃ ⊤ 𝐗̃ ̂𝐛 = 𝐗̃ ⊤
𝑘 𝑘 𝑘
𝐲 , we have 𝐗̃ ⊤
𝑘 𝑘
𝝐̂ = 𝟎, since 𝐲𝑘 = 𝝁̂ 𝑘 + 𝝐̂ 𝑘 . Because
𝑘 𝑘 ∑
𝑛
inverse of matrix 𝐐, by Strassen algorithm or Coppersmith–Winograd the first line of 𝐗̃ ⊤ is a line of ones, 𝑖=1 𝑉𝑖𝑘 (𝑦𝑖 − 𝜇̂ 𝑖(𝑘) ) = 𝝐̂ ⊤ 𝑘 𝟏 = 0.
∑ [ ]
algorithm, requires 𝑂(𝑝3 ) operations each. This leads to an overall Moreover, 𝑛𝑖=1 𝑉𝑖𝑘 𝜇̂ 𝑖(𝑘) (𝑦𝑖 − 𝜇̂ 𝑖(𝑘) ) = 𝝁̂ ⊤ 𝑘 𝝐
̂ 𝑘 = ̂𝐛⊤ 𝐗̃ ⊤ 𝝐̂ 𝑘 = 0. Finally,
𝑘 𝑘
complexity of 𝑂(𝑝3 + 𝑛2 𝑝 + 𝐾𝑛𝑝) ∼ 𝑂(𝑝3 ) for each iteration of the 𝐶𝑜𝑣(𝝁̂ 𝑘 ) = 𝐶𝑜𝑣(𝐗̃ 𝑘 ̂𝐛𝑘 ) = 𝐗̃ 𝑘 𝐶𝑜𝑣(̂𝐛𝑘 )𝐗̃ ⊤ 𝑘
= 𝜎𝑘2 𝐗̃ 𝑘 (𝐗̃ ⊤ 𝐗̃ )−1 𝐗̃ ⊤
𝑘 𝑘 𝑘
= 𝜎𝑘2 𝐇𝑘 .
WCLR algorithm. However, the WCLRepg, WCLRepl, WCLResg and Expected value and co-variance of the residuals. They are defined as
WCLResl variants requires 𝑂(𝑛𝐾 + 𝑝) operations for the computation of 𝝐̂ 𝑘 = 𝐲𝑘 − 𝝁̂ 𝑘 = 𝐲𝑘 − 𝐇𝑘 𝐲𝑘 = (𝐈 − 𝐇𝑘 )𝐲𝑘 = 𝐇 ̄ 𝑘 𝐲𝑘 , where 𝐇 ̄ 𝑘 = 𝐈 − 𝐇𝑘
the weights, this leads to an overall complexity of 𝑂(𝑛𝐾+𝑝+𝑛2 𝑝+𝐾𝑛𝑝) ∼ which is i) symmetric; ii) idempotent and iii) 𝑡𝑟(𝐇 ̄ 𝑘 ) = 𝑡𝑟(𝐈) − 𝑡𝑟(𝐇𝑘 ) =
𝑂(𝑛2 𝑝) for these variants. 𝑛 − (𝑝 + 1). Moreover, 𝝐̂ 𝑘 = 𝐇 ̄ 𝑘 𝐲𝑘 = (𝐈 − 𝐗̃ 𝑘 (𝐗̃ ⊤ 𝐗̃ 𝑘 )−1 𝐗̃ ⊤ )(𝐗̃ 𝑘 𝐛𝑘 + 𝝐 𝑘 ) =
𝑘 𝑘
As the optimization algorithm is an iterative minimization, the 𝐗̃ 𝑘 (𝐗̃ ⊤
𝑘
𝐗̃ 𝑘 )−1 𝐗̃ ⊤ 𝝐 𝑘 = (𝐈 − 𝐇𝑘 )𝝐 𝑘 . Therefore, 𝐸(𝝐̂ 𝑘 ) = 𝐇
𝑘
̄ 𝑘 𝐸(𝝐 𝑘 ) = 𝟎.
number of all possible K-partitions 𝐕 of (𝐗, 𝐲) is finite, and the objective Besides, 𝐶𝑜𝑣(𝝐̂ 𝑘 ) = 𝐸(𝝐̂ 𝑘 𝝐̂ ⊤ ⊤
𝑘 ) = 𝐸[(𝐈 − 𝐇𝑘 )𝝐 𝑘 𝝐 𝑘 (𝐈 − 𝐇𝑘 ) ] = (𝐈 −
⊤
𝝐̂ ⊤
𝑘𝝐̂𝑘
3.3. Some properties of least squares estimators 𝜎̂ 𝑘2 = (42)
𝑛𝑘 − (𝑝 + 1)
where 𝑛𝑘 is the cardinality of cluster 𝐶𝑘 , since 𝐸(𝝐̂ ⊤ 𝑘𝝐̂ 𝑘 ) = 𝐸[𝝐 ⊤
𝑘
(𝐈 −
This section provides some properties of the least squares estimator 2
𝐇𝑘 )⊤ (𝐈 − 𝐇𝑘 )𝝐 𝑘 ] = 𝐸[𝑡𝑟(𝝐 ⊤ 𝑘
(𝐈 − 𝐇 )𝝐
𝑘 𝑘 )] = 𝑡𝑟[(𝐈 − 𝐇 𝑘 )𝐸(𝝐 ⊤
𝑘 𝑘 )] = 𝜎𝑘 𝑡𝑟[𝐕𝑘 −
𝝐
and residuals. 𝐗̃ 𝑘 (𝐗̃ ⊤ 𝐗̃ )−1 𝐗̃ ⊤ ] = 𝜎𝑘2 𝑡𝑟(𝐕𝑘 ) − 𝜎𝑘2 𝑡𝑟[(𝐗̃ ⊤ 𝐗̃ )−1 𝐗̃ ⊤ 𝐗̃ ] = 𝜎𝑘2 (𝑛𝑘 − (𝑝 + 1)).
𝑘 𝑘 𝑘 𝑘 𝑘 𝑘 𝑘
Expected value and co-variance of the least squares estimator. First, ̂𝐛𝑘 = Prediction. Let a new observation where 𝑦𝑜 is the value of the re-
(𝐗̃ ⊤ 𝐗̃ )−1 𝐗̃ ⊤
𝑘 𝑘
𝐲 = 𝐛𝑘 + (𝐗̃ ⊤
𝑘 𝑘
𝐗̃ )−1 𝐗̃ ⊤
𝑘 𝑘
𝝐 . Then, 𝐸(̂𝐛𝑘 ) = 𝐛𝑘 since 𝐸(𝝐 𝑘 ) = 𝟎,
𝑘 𝑘 sponse variable when the explanatory variables have values 𝐱𝑜⊤ =
̂
i.e., 𝐛𝑘 is an unbiased estimator of 𝐛𝑘 . Moreover, the estimator error [1, 𝑥𝑥𝑜1 , … , 𝑥𝑜𝑝 ]. According to the 𝑘th regression model 𝑦𝑜 = 𝜇𝑜 +
is ̂𝐛𝑘 − 𝐛𝑘 = (𝐗̃ ⊤ 𝐗̃ )−1 𝐗̃ ⊤
𝑘 𝑘
𝝐 . Therefore, 𝐸(̂𝐛𝑘 − 𝐛𝑘 ) = 𝟎. Finally,
𝑘 𝑘 𝜖𝑜 = 𝐱𝑜⊤ 𝐛𝑘 + 𝜖𝑜 . The least squares estimate of 𝜇𝑜 is 𝜇̂ 𝑜 = 𝐱𝑜⊤ ̂𝐛𝑘 with
𝐶𝑜𝑣(̂𝐛𝑘 ) = 𝐸[(̂𝐛𝑘 − 𝐛𝑘 )(̂𝐛𝑘 − 𝐛𝑘 )⊤] = (𝐗̃ ⊤ 𝐗̃ )−1 𝐗̃ ⊤
𝑘 𝑘 𝑘
𝐸(𝝐 𝑘 𝝐 ⊤
𝑘
)𝐗̃ 𝑘 (𝐗̃ ⊤ 𝐗̃ )−1 =
𝑘 𝑘 𝐸(𝜇̂ 𝑜 ) = 𝐱𝑜⊤ 𝐸(̂𝐛𝑘 ) = 𝐱𝑜⊤ 𝐛𝑘 and 𝑉 𝑎𝑟(𝜇̂ 𝑜 ) = 𝑉 𝑎𝑟(𝐱𝑜⊤ ̂𝐛𝑘 ) = 𝐱𝑜⊤ 𝐶𝑜𝑣(̂𝐛𝑘 )𝐱𝑜 =
̃ ⊤ ̃ −1 ̃ ⊤ ̃ ̃ ⊤ ̃ −1 ̃ ⊤ ̃
𝜎𝑘 (𝐗𝑘 𝐗𝑘 ) 𝐗𝑘 𝐕𝑘 𝐗𝑘 (𝐗𝑘 𝐗𝑘 ) = 𝜎𝑘 (𝐗𝑘 𝐗𝑘 ) , because 𝐸(𝝐 𝑘 𝝐 𝑘 ) = 𝜎𝑘2 𝐕𝑘 ,
2 2 −1 ⊤
𝜎𝑘2 (𝐱𝑜⊤ (𝐗̃ ⊤ 𝐗̃ )−1 𝐱𝑜 ). Moreover, from Eq. (42), 𝑉̂𝑎𝑟(𝜇̂ 𝑜 ) = 𝜎̂ 𝑘2 (𝐱𝑜⊤ (𝐗̃ ⊤ 𝐗̃ )−1
𝑘 𝑘 𝑘 𝑘
𝐕⊤ = 𝐕𝑘 = 𝐕𝑘 𝐕𝑘 , 𝐗̃ ⊤ 𝐕𝑘 = 𝐗̃ ⊤ and 𝐗̃ 𝑘 𝐕𝑘 = 𝐗̃ 𝑘 .
𝑘 𝑘 𝑘
𝐱𝑜 ). Besides, the variance of the forecast error is 𝑉 𝑎𝑟(𝑦𝑜 − 𝜇̂ 𝑜 ) = 𝑉 𝑎𝑟(𝜖𝑜 )+
7
R.A.M. da Silva and F.A.T. de Carvalho Expert Systems With Applications 185 (2021) 115609
8
R.A.M. da Silva and F.A.T. de Carvalho Expert Systems With Applications 185 (2021) 115609
Adjusted Rand Index components of the 𝐽𝑤𝑐𝑙𝑟 criterion, on the other hand it is responsible for
The Adjusted Rand Index (ARI) (Hubert & Arabie, 1985) a similarity balancing the different scales found in both terms of the cost function.
measure between two clusters, adjusted for chance. The ARI can yield The hyper-parameter 𝛾 ∈ R>0 of the KPLANE method has a similar
values from −1 to 1. Values close to 0 (zero) means random label- role of 𝛼 in the WCLR variants. The hyper-parameter 𝜃 is useful only
ing, while value 1 means the clusters are identical. The ARI index is in variations of the WCLR method that adopt the Euclidean metric
computed as: with the summation constraint (methods WCLResg and WCLResl) for
𝑅𝐼 − 𝐸[𝑅𝐼] the weights attributed to the explanatory variables, it is an additional
𝐴𝑅𝐼 = (48) smoothing hyper-parameter for the weights of the variables.
𝑚𝑎𝑥(𝑅𝐼) − 𝐸[𝑅𝐼]
Both hyper-parameters must be fine-tuned according to the specific
where RI is the Rand Index (Rand, 1971). According to (Hubert &
objectives of the analysis being made. In general, an external criterion
Arabie, 1985) the ARI index is not sensitive to the number of classes in
is used to select the best set of hyper-parameters for the analysis. In this
the partitions or to the distributions of the items in the clusters.
paper, except for the CWM method, we use the lowest predicted RMSE
Index 𝛷
value as the external criterion to select the hyper-parameters for each
According to Brusco et al. (2008), in the framework of cluster-
wise linear regression, a convenient normalization of SSres, which is method. Besides, we test each synthetic dataset for all combination of
analogous to the Determination Coefficient (𝑅2 ), is obtained via the hyper-parameters levels:
In addition to the number of clusters, 𝐾, the WCLR method has where 𝐠̂ ℎ is the estimated centroid and 𝐖−1 ℎ
is the estimated
two other hyper-parameters that must be defined a-priori by the user: inverse of the covariance matrix of cluster h given by the variant
𝛼 ∈ R>0 , and 𝜃 ∈ R≥1 . The first has a twofold role, on the one hand it is of CWM that provided the best fit, among the 14 available in the R
responsible for differentiating the relative contribution between the two package flexCWM (Mazza et al., 2018), according to BIC criterion.
9
R.A.M. da Silva and F.A.T. de Carvalho Expert Systems With Applications 185 (2021) 115609
2 (𝐱, 𝐠
𝑦3 = −1 + 𝑋1 + 𝑋2 + 𝜖 (62)
where 𝑑 ̂ ̂ ℎ ) is computed, respectively, by Eqs. (8) and (20);
𝐖ℎ
where 𝜖 ∼ (0, 1) is a Gaussian variable with a mean of 0 and a
• if the cost function is provided either by Eq. (15) or by Eq. (24),
variance of 1.
then
𝐾 Except for the number of clusters, that is assumed to be equal to
2
𝑘 = argmin 𝑑 ̂ (𝐱, 𝐠̂ ℎ ) (57) the number of a priori classes, we use the predicted RMSE value as the
ℎ=1 𝐖
external criterion so select the best set of hyper-parameters for KPLANE
2 (𝐱, 𝐠
where 𝑑 ̂ ̂ ℎ ) is computed, respectively, by Eqs. (16) and (25); and all 6 WCLR variants. For each synthetic dataset we generate 15
𝐖
• if the cost function is given by Eq. (27), then pairs of training and test datasets from a stratified 3 times 5-fold
𝐾
cross-validation resampling strategy. For each pair we fit a clusterwise
𝑘 = argmin 𝑑 2̂ (𝐱, 𝐠̂ ℎ ) (58) regression model and compute the average and standard-deviation of
ℎ=1 (𝐖ℎ ,𝜃)
ARI index and fitted RMSE using the train data, and predicted RMSE
where 𝑑 2̂ (𝐱, 𝐠̂ ℎ ) is computed by Eq. (28); using the test data. Moreover, for each dataset we set the maximum
(𝐖ℎ ,𝜃)
number of algorithm iterations as 100, with 100 restarts for all method,
• Finally, if the cost function is provided by Eq. (32), then
selecting the final model with lowest value of the objective function.
𝐾
𝑘 = argmin 𝑑 2̂ (𝐱, 𝐠̂ ℎ ) (59) We compute the average and standard-deviation of all sampled ARI
ℎ=1 (𝐖,𝜃)
and fitted RMSE (referred to as explanation), and predicted RMSE
where 𝑑 2̂ (𝐱, 𝐠̂ ℎ ) is computed by Eq. (33). (referred to as prediction) and present the results in the summary
(𝐖,𝜃)
Tables 1, 3, 5, 7 and 9 for synthetic datasets 1, 2, 3, 4, and 5,
4.4. Synthetic datasets respectively. The RMSE values computed directly in a training dataset
offer an optimistic value of its estimate, as a reference, we also compute
In this section we present some empirical results to show the the RMSE values in a forecast task. Each test dataset provides one
effectiveness of the Weighted Clusterwise Linear Regression method estimate of the predicted RMSE.
Table 2
Synthetic Dataset 1: model selection and regression coefficients.
Method Cluster Regression coefficients Method Cluster Regression coefficients
𝛽0 𝛽1 𝛽2 𝛽0 𝛽1 𝛽2
CWM 1 1.0515 1.0052 −0.9854 WCLRepl 1 0.9969 0.9974 −0.9902
Xnorm = EII 2 1.0020 −0.9872 0.9889 𝛼 = 1e−05 2 1.0000 −0.9999 0.9912
3 −0.9887 0.9968 1.0027 3 −0.9978 0.9973 1.0012
CLR 1 1.0095 0.01987 0.02009 WCLRqpg 1 0.9969 0.9974 −0.9902
2 −0.4294 −0.09619 0.4421 𝛼 = 1e−05 2 1.0000 −0.9999 0.9912
3 2.2574 −0.5556 0.2051 3 −0.9978 0.9973 1.0012
KMEANS+LM 1 0.9969 0.9974 −0.9902 WCLRqpl 1 0.9969 0.9974 −0.9902
2 1.0000 −0.9999 0.9912 𝛼 = 1e−05 2 1.0000 −0.9999 0.9912
3 −0.9978 0.9973 1.0012 3 −0.9978 0.9973 1.0012
KPLANE 1 0.9969 0.9974 −0.9902 WCLResg 1 0.9969 0.9974 −0.9902
𝛾 = 1e+02 2 1.0000 −0.9999 0.9912 𝛼 = 1e−05 2 1.0000 −0.9999 0.9912
3 −0.9978 0.9973 1.0012 𝜃 = 2.5 3 −0.9978 0.9973 1.0012
WCLRepg 1 0.9969 0.9974 −0.9902 WCLResl 1 0.9969 0.9974 −0.9902
𝛼 = 1e−05 2 1.0000 −0.9999 0.9912 𝛼 = 0.01 2 1.0000 −0.9999 0.9912
3 −0.9978 0.9973 1.0012 𝜃= 3 3 −0.9978 0.9973 1.0012
10
R.A.M. da Silva and F.A.T. de Carvalho Expert Systems With Applications 185 (2021) 115609
Table 3 Table 5
Synthetic Dataset 2: summary of experimental assessment. Synthetic Dataset 3: summary of experimental assessment.
Method Explanation Prediction Method Explanation Prediction
ARI (𝑎𝑣𝑔 ± 𝑠𝑑) RMSE (𝑎𝑣𝑔 ± 𝑠𝑑) RMSE (𝑎𝑣𝑔 ± 𝑠𝑑) ARI (𝑎𝑣𝑔 ± 𝑠𝑑) RMSE (𝑎𝑣𝑔 ± 𝑠𝑑) RMSE (𝑎𝑣𝑔 ± 𝑠𝑑)
CWM 0.9793 ± 0.01004 0.9711 ± 0.02323 1.0322 ± 0.1357 CWM 1.0000 ± 0.0000 0.9520 ± 0.02495 0.9873 ± 0.09822
CLR 0.3896 ± 0.04123 0.8263 ± 0.04303 1.9522 ± 0.5823 CLR 0.4372 ± 0.05076 0.7917 ± 0.02073 2.0694 ± 0.3325
KMEANS+LM 0.4535 ± 0.03248 1.7912 ± 0.07052 1.8547 ± 0.3144 KMEANS+LM 0.5422 ± 0.06358 1.8928 ± 0.06217 1.9657 ± 0.2921
KPLANE 0.5568 ± 0.01641 1.2232 ± 0.0408 1.3253 ± 0.2017 KPLANE 0.8450 ± 0.07876 1.2825 ± 0.1158 1.3777 ± 0.2751
WCLRqpg 0.9760 ± 0.009819 0.9788 ± 0.02861 1.0327 ± 0.1350 WCLRqpg 0.6084 ± 0.09398 1.5918 ± 0.2061 1.9058 ± 0.1746
WCLRqpl 0.9597 ± 0.01072 0.9865 ± 0.04592 1.0705 ± 0.1709 WCLRqpl 1.0000 ± 0.0000 0.9520 ± 0.02495 0.9873 ± 0.09822
WCLRepg 0.9751 ± 0.008046 0.9819 ± 0.0327 1.0329 ± 0.1359 WCLRepg 0.6293 ± 0.1001 1.5633 ± 0.2226 1.8606 ± 0.2613
WCLRepl 0.9662 ± 0.01491 0.9863 ± 0.03449 1.0333 ± 0.1378 WCLRepl 1.0000 ± 0.0000 0.9520 ± 0.02495 0.9873 ± 0.09822
WCLResg 0.9702 ± 0.009039 0.9754 ± 0.02684 1.1167 ± 0.2276 WCLResg 0.6162 ± 0.1174 1.6274 ± 0.3135 2.0630 ± 0.5564
WCLResl 0.9670 ± 0.008843 0.9755 ± 0.02683 1.1179 ± 0.2283 WCLResl 0.9975 ± 0.009701 0.9507 ± 0.02387 1.1157 ± 0.3258
4.4.1. Synthetic dataset 1 variants of CWM are fitted on the entire dataset, the best one (among
The synthetic dataset 1 is a baseline problem. It is generated from 14) selected according to the BIC criterion was EII, where EII means
the same covariance matrix for all three classes. There are no differ- that the mixture of the CWM model comes from a spherical family
ences between the variance of explanatory variables 𝑋1 and 𝑋2 (with with Equal volume and Spherical shape (Mazza et al., 2018). CLR and
zero covariance). Thus, the differences between classes are defined by KMEANS+LM are fitted on the entire dataset. KPLANE and the WCLR
their different means vectors and regression models of the response variants have hyperparameters. They are selected as follows. The entire
variable. The synthetic
[ ] dataset 1 is generated from covariance matrices: dataset is split on learning set (80% of the examples) and validation
𝜮 1 = 𝜮 2 = 𝜮 3 = 10 01 , with mean vectors defined as previously. set (20% of the examples) using stratified sampling. KPLANE and the
Table 1 summarizes the results for the Synthetic Dataset 1. In this WCLR variants are fitted on the learning set according to a fixed value
Table, all the variants of CWM are fitted on the 4 learning folds, the of the hyperparameters. The best hyperparameter is selected from a
best one (among 14) is selected according to the BIC criterion and specific grid (see Eqs. (52) and (53)) according to the minimum RMSE
used to compute the indexes ARI, fitted RMSE and predicted RMSE computed on the validation set. Once selected the hyperparameter
(on the test fold). CLR and KMEANS+LM are fitted on the 4 learning values, KPLANE and the WCLR variants are fitted again on the entire
folds and the fitted models are used to compute the indexes ARI, dataset.
fitted RMSE and predicted RMSE. KPLANE and the WCLR variants We can observe that the CLR method obtained a poor performance
have hyperparameters. They are selected as follows. The 5 folds are recovering both the true classes and models contained in the data,
split on learning set (3 folds), validation set (1 fold) and test set even with the lowest average RMSE value being the CLR method,
(1 fold). KPLANE and the WCLR variants are fitted on the 3 folds in the learning set, but with the greatest in the test set. It merged
learning set according to a fixed value of the hyperparameters. The best different classes of observations in the same clusters. This is due to
hyperparameter is selected from a specific grid (see Eqs. (52) and (53)) the distribution of explanatory variables and the error hidden in the
according to the minimum RMSE computed on the 1 fold validation set. response variable.
Once selected the hyperparameter values, the 3 folds learning set and The other methods performed similarly with average ARI values
the 1 fold validation are merged on a new 4 folds learning set. KPLANE greater than 0.98. They also presented a good recovering of the true
and the WCLR variants are then fitted on the new 4 folds learning models. The good performance of the KMEANS+LM method indicates
set and the fitted models are used to compute the indexes ARI, fitted that the true classes can be easily identified from the cluster struc-
RMSE and predicted RMSE. This procedure is repeated 15 times and the ture contained in the data in relation to the explanatory variables.
average and standard deviation of the indexes are computed. Note that They were outperformed by CLR in terms of fitted RMSE, but they
from one repetition to another, the best variant of the CWM method outperformed CLR in the test set.
and the selected hyperparameters of KPLANE and WCLR variants can
be different. 4.4.2. Synthetic dataset 2
Table 2 provides the regression coefficients of the fitted method In the synthetic dataset 2, the covariance matrices are the same for
on each clusters using all the data available. In this Table, all the all classes, but the explanatory variable 𝑋2 has a greater variance than
Table 4
Synthetic Dataset 2: model selection and regression coefficients.
Method Cluster Regression coefficients Method Cluster Regression coefficients
𝛽0 𝛽1 𝛽2 𝛽0 𝛽1 𝛽2
CWM 1 0.9802 1.0015 −0.9949 WCLRepl 1 0.9916 1.0037 −0.9947
Xnorm = EEI 2 0.9996 −0.9930 1.0014 𝛼 = 0.01 2 1.0234 −0.9062 1.0035
3 −0.9125 0.9727 0.9949 3 −1.2597 1.0435 0.9896
CLR 1 0.2711 0.8543 −1.0151 WCLRqpg 1 0.9916 1.0037 −0.9947
2 −0.03954 0.3912 0.9395 𝛼 = 1.0 2 0.9950 −0.9957 1.0055
3 2.2868 0.3403 0.9581 3 −0.9525 0.9842 0.9974
KMEANS+LM 1 −4.6360 −0.2934 −0.9912 WCLRqpl 1 0.9916 1.0037 −0.9947
2 0.4138 0.6069 1.0810 𝛼 = 0.01 2 1.0272 −0.8977 1.0015
3 −0.2984 0.9151 1.0426 3 −1.1585 1.0186 0.9877
KPLANE 1 0.7444 0.9550 −1.0016 WCLResg 1 0.9916 1.0037 −0.9947
𝛾 = 1.0 2 1.0657 0.3047 1.0447 𝛼 = 1.0 2 0.9986 −0.9932 0.9992
3 1.1063 0.5792 1.0458 𝜃 = 2.5 3 −0.8424 0.9546 0.9928
WCLRepg 1 0.9916 1.0037 −0.9947 WCLResl 1 0.9916 1.0037 −0.9947
𝛼 = 1.0 2 0.9950 −0.9957 1.0055 𝛼 = 1.0 2 0.9986 −0.9932 0.9992
3 −0.9525 0.9842 0.9974 𝜃= 3 3 −0.8424 0.9546 0.9928
11
R.A.M. da Silva and F.A.T. de Carvalho Expert Systems With Applications 185 (2021) 115609
Table 6
Synthetic Dataset 3: model selection and regression coefficients.
Method Cluster Regression coefficients Method Cluster Regression coefficients
𝛽0 𝛽1 𝛽2 𝛽0 𝛽1 𝛽2
CWM 1 0.9960 1.0035 −1.0060 WCLRepl 1 0.9960 1.0035 −1.0060
Xnorm = VVI 2 1.0044 −1.0038 1.0064 𝛼 = 1e−05 2 1.0044 −1.0038 1.0064
3 −0.9948 1.0068 1.0029 3 −0.9956 1.0068 1.0027
CLR 1 −3.8513 −0.2279 −1.0204 WCLRqpg 1 −3.4909 −0.07748 −1.0232
2 1.3713 −0.9474 0.5557 𝛼 = 1.0 2 0.8175 0.6439 −0.3828
3 −0.4136 1.0230 1.1612 3 −10.1301 1.0166 −1.2669
KMEANS+LM 1 −18.6989 −3.4831 −1.2057 WCLRqpl 1 0.9960 1.0035 −1.0060
2 0.5217 0.5046 0.2082 𝛼 = 01e−05 2 1.0044 −1.0038 1.0064
3 −10.0444 1.0156 −1.2463 3 −0.9956 1.0068 1.0027
KPLANE 1 −18.3901 −3.7734 −1.0188 WCLResg 1 −18.6765 −3.3986 −1.2429
𝛾 = 1.0 2 0.9560 −0.9979 1.2166 𝛼 = 1e−05 2 0.8110 0.6544 −0.3186
3 2.2934 1.0085 1.8306 𝜃 = 3.0 3 −10.1152 1.0176 −1.2612
WCLRepg 1 −3.4909 −0.07748 −1.0232 WCLResl 1 0.9960 1.0035 −1.0060
𝛼 = 1.0 2 0.8175 0.6439 −0.3828 𝛼 = 0.01 2 1.0044 −1.0038 1.0064
3 −10.1301 1.0166 −1.2669 𝜃 = 2.5 3 −0.9956 1.0068 1.0027
Table 7 WCLResg variants, the variable with relative high variance (𝑋2 ) is
Synthetic Dataset 4: summary of experimental assessment.
mapped with small weight values in all clusters, whereas the variable
Method Explanation Prediction with relative low variance (𝑋1 ) is mapped with high weight. Therefore,
ARI (𝑎𝑣𝑔 ± 𝑠𝑑) RMSE (𝑎𝑣𝑔 ± 𝑠𝑑) RMSE (𝑎𝑣𝑔 ± 𝑠𝑑) in the assignment of the objects to the clusters, a large difference in
CWM 1.0000 ± 0.0000 0.9291 ± 0.02363 0.9710 ± 0.09791 variable 𝑋1 is reduced whereas a small difference in the variable 𝑋2
CLR 0.4923 ± 0.0229 0.7465 ± 0.01413 2.4708 ± 0.6380 is amplified. Moreover, because WCLRepg and WCLResg variants have
KMEANS+LM 0.3009 ± 0.1814 2.9860 ± 0.4662 3.1323 ± 0.5276
less parameters, they are preferred among the WCLR variants for this
KPLANE 0.6518 ± 0.0311 1.2992 ± 0.09301 1.3847 ± 0.3476
WCLRqpg 1.0000 ± 0.0000 0.9291 ± 0.02363 0.9710 ± 0.09791 dataset configuration.
WCLRqpl 1.0000 ± 0.0000 0.9291 ± 0.02363 0.9710 ± 0.09791 The CLR method was the best in terms of fitted RMSE, but it was
WCLRepg 0.5035 ± 0.09736 1.1640 ± 0.8725 3.5140 ± 1.8576 the worst, together with KMEANS+LM, in the test sets. The CWM was
WCLRepl 0.4834 ± 0.09439 1.1906 ± 0.9204 3.1622 ± 0.4575
the best in terms of RSME in the test set closely followed by all WCLR
WCLResg 0.4859 ± 0.01483 2.2341 ± 0.2266 2.6328 ± 0.7736
WCLResl 0.4858 ± 0.01333 2.0834 ± 0.3773 2.4871 ± 0.4877 variants.
Table 4 was obtained in the same way as Table 2. One can observe
that the best variant of CWM method was EII. Besides, the CWM
method presented the best recovering of the true models in comparison
𝑋1 (with zero covariance). The synthetic dataset 2 is generated from with the others methods.
[ 0]
covariance matrices: 𝜮 1 = 𝜮 2 = 𝜮 3 = 10 15 .
Table 3 was obtained in the same way as Table 1. The results 4.4.3. Synthetic dataset 3
summarized in Table 3 show the relative good performance of the For synthetic dataset 3 we chose three different covariance matrices,
CWM method and all WCLR variants in comparison with the poor one for each class. In class 1, the variance of variable 𝑋2 is much more
performance of the CLR, KMEANS+LM and KPLANE methods in terms important than that of variable 𝑋1 . In class 2, variables 𝑋1 and 𝑋2 have
of partition quality (ARI index). the same variance. Finally, in class 3, the variance of variable 𝑋1 is
We can observe that the addition of the unsupervised terms based much more important than that of variable 𝑋2 . The synthetic dataset 3
[ ] [1 0]
on quadratic (with a full covariance matrix) and weighed Euclidean is generated from covariance matrices: 𝜮 1 = 0.01 0
0 15 ; 𝜮 2 = 0 1 ; and
[ 15 0 ]
distances into the WCLR variants were able to manage the difference 𝜮 3 = 0 0.01 . In all the cases the covariance is zero.
in relative variance between the variables 𝑋1 and 𝑋2 globally for Table 5 was obtained in the same way as Table 1. Table 5 shows
all classes. As pointed out in Section 3.2.2, in the WCLRepg and how the CWM method and the WCLR variants based on local metrics
Table 8
Synthetic Dataset 4: model selection and regression coefficients.
Method Cluster Regression coefficients Method Cluster Regression coefficients
𝛽0 𝛽1 𝛽2 𝛽0 𝛽1 𝛽2
CWM 1 1.0024 1.0082 −0.9996 WCLRepl 1 −5.4018 −0.0385 −1.5398
Xnorm = EEE 2 0.9992 −1.0029 1.0052 𝛼 = 1e+02 2 1.0527 1.1819 2.1415
3 −0.9918 1.0020 1.0036 3 2.4623 −1.0416 −0.1027
CLR 1 −5.4395 −0.04248 −1.5409 WCLRqpg 1 1.0024 1.0082 −0.9996
2 1.0721 1.0974 2.0943 𝛼 = 1e−05 2 0.9992 −1.0029 1.0052
3 2.4164 −1.0437 −0.1068 3 −0.9918 1.0020 1.0036
KMEANS+LM 1 −0.7875 0.01249 −0.5382 WCLRqpl 1 1.0024 1.0082 −0.9996
2 3.5811 1.8174 1.5824 𝛼 = 01e−05 2 0.9992 −1.0029 1.0052
3 0.9019 −1.7501 −0.6655 3 −0.9918 1.0020 1.0036
KPLANE 1 1.0024 1.0082 −0.9996 WCLResg 1 20.4864 4.1513 0.7478
𝛾 = 1.0 2 1.2921 −0.4111 1.0902 𝛼 = 0.01 2 1.8102 −2.4720 −0.5262
3 −5.1296 3.6087 2.5025 𝜃 = 1.5 3 −5.4756 3.6389 2.4800
WCLRepg 1 −5.4069 −0.03755 −1.5408 WCLResl 1 20.4864 4.1513 0.7478
𝛼 = 1e+02 2 1.0702 1.0944 2.0928 𝛼 = 0.01 2 1.8102 −2.4720 −0.5262
3 2.4302 −1.0442 −0.1082 𝜃 = 2.5 3 −5.4756 3.6389 2.4800
12
R.A.M. da Silva and F.A.T. de Carvalho Expert Systems With Applications 185 (2021) 115609
Table 9 Table 11
Synthetic Dataset 5: summary of experimental assessment. Iris Data: summary of experimental assessment.
Method Explanation Prediction Method Explanation Prediction
ARI (𝑎𝑣𝑔 ± 𝑠𝑑) RMSE (𝑎𝑣𝑔 ± 𝑠𝑑) RMSE (𝑎𝑣𝑔 ± 𝑠𝑑) 𝛷(𝐾 ) (𝑎𝑣𝑔 ± 𝑠𝑑) RMSE (𝑎𝑣𝑔 ± 𝑠𝑑) RMSE (𝑎𝑣𝑔 ± 𝑠𝑑)
CWM 1.0000 ± 0.0000 0.9598 ± 0.02035 0.9947 ± 0.08376 CWM 0.8927 ± 0.01085 0.2697 ± 0.01471 0.3450 ± 0.05846
CLR 0.3661 ± 0.03263 0.7510 ± 0.01749 3.8765 ± 0.5466 CLR 0.9561 ± 0.002814 0.1727 ± 0.006026 0.3924 ± 0.05636
KMEANS+LM 0.4785 ± 0.01644 3.2338 ± 0.1995 3.2409 ± 0.7321 KMEANS+LM 0.8698 ± 0.008866 0.2972 ± 0.009812 0.3150 ± 0.04268
KPLANE 0.6486 ± 0.2663 1.1381 ± 0.1723 1.7408 ± 0.6539 KPLANE 0.8698 ± 0.008866 0.2972 ± 0.009812 0.3150 ± 0.04268
WCLRqpg 0.5194 ± 0.05445 2.4770 ± 0.7651 2.7963 ± 1.0832 WCLRqpg 0.8694 ± 0.008696 0.2976 ± 0.009688 0.3162 ± 0.0428
WCLRqpl 0.9967 ± 0.005734 0.9566 ± 0.01958 0.9968 ± 0.08402 WCLRqpl 0.8694 ± 0.008696 0.2976 ± 0.009688 0.3162 ± 0.0428
WCLRepg 0.4395 ± 0.05687 1.8749 ± 1.2464 3.6132 ± 0.7558 WCLRepg 0.8694 ± 0.008696 0.2976 ± 0.009688 0.3162 ± 0.0428
WCLRepl 0.4489 ± 0.04269 2.3811 ± 1.2069 3.4428 ± 0.9760 WCLRepl 0.8694 ± 0.008696 0.2976 ± 0.009688 0.3162 ± 0.0428
WCLResg 0.4419 ± 0.05232 3.0509 ± 0.6624 3.3738 ± 0.7234 WCLResg 0.8704 ± 0.008963 0.2967 ± 0.008622 0.3135 ± 0.0388
WCLResl 0.4373 ± 0.1681 2.2723 ± 1.4346 4.6180 ± 3.2461 WCLResl 0.8710 ± 0.01119 0.2958 ± 0.01156 0.3141 ± 0.04091
(WCLRepl, WCLResl and WCLRqpl), were capable of recovering the Table 7 was obtained in the same way as Table 1. Table 7 shows
true clusters, obtained better results in terms of partition quality (ARI that in the case of non-zero covariance between the variables, CWM
index) compared to all other methods, including WCLR methods that and the WCLR methods based on quadratic metric (WCLRqpg and
use global metrics (WCLRepg, WCLResg and WCLRqpg). WCLRqpl) are able to perfectly recover the true classes contained
As pointed out in Section 3.2.2, in the WCLRepl and WCLResl in the dataset. All other methods, including the methods that use
variants, the variable with relative high variance in a given cluster is weighted Euclidean distances, obtained inferior results in relation to
mapped with small weight values, whereas the variable with relative the CWM, WCLRqpg and WCLRqpl methods. Besides, CWM, WCLRqpg
low variance in a particular cluster is mapped with high weight. There- and WCLRqpl methods were also the best in terms of RSME in the test
fore, in the assignment of the objects to the clusters, a large difference sets.
in variables with relative high variance is reduced whereas a small dif- Table 8 was obtained in the same way as Table 2. One can observe
ference in variables with relative low variance is amplified. Moreover, that the best variant of CWM method was EEE, where EEE means
because WCLRepl and WCLResl variants have less parameters, they are that the mixture of the CWM model comes from a general family with
the preferred WCLR variants for this dataset configuration. Equal volume, Equal shape and Equal orientation (Mazza et al., 2018).
The CLR method was the best in terms of fitted RMSE, but it was the Moreover, CWM, WCLRqpg and WCLRqpl methods were the best in
worst, together with KMEANS+LM, in the test sets. The CWM method recovering of the true models. Finally, WCLRqpg is the preferred WCLR
and all WCLR variants based on local metrics were the best in terms of variation for this dataset configuration because it has less parameters.
RSME in the test set.
Table 6 was obtained in the same way as Table 2. One can observe 4.4.5. Synthetic dataset 5
that the best variant of CWM method was VVI, where VVI means that The synthetic dataset 5 has different covariance matrices for each
the mixture of the CWM model comes from a diagonal family with class, similar to the synthetic dataset 3. However, in this example we
Variable volume, Variable shape and Axis-Aligned orientation (Mazza add a non-zero covariance value between explanatory variables 𝑋1 and
et al., 2018). Besides, the CWM method and all WCLR variants based 𝑋2 . The
[ 4.30synthetic
] dataset
[ 5 ]is generated[ from covariance
] matrices:
on local metrics presented a better recovering of the true models in 𝜮 1 = −8.27 −8.27 ; 𝜮 = 1 −1 ; and 𝜮 = 15.89 8.27 .
15.89 2 −1 1 3 8.27 4.30
comparison with the others methods. Table 9 was obtained in the same way as Table 1. In the synthetic
dataset 5, all covariance matrices are different. In this example, only
4.4.4. Synthetic dataset 4 the CWM and the WCLRqpl methods were able to recover the classes
The synthetic dataset 4 has the same covariance matrix for all contained in this dataset. All other methods, including the WCLRqpg
three classes, with a differentiation between variables variances of 𝑋1 method, have far lower results. Besides, CWM and WCLRqpl were also
and 𝑋2 . However, in this example, we have a non-zero covariance the best in terms of RSME in the test sets.
between explanatory variables. We consider a negative covariance Table 10 was obtained in the same way as Table 2. One can observe
value between both variables. The synthetic
[ 4.30 dataset
] 4 is generated from that the best variant of CWM method was VEV, where VEV means
covariance matrices: 𝜮 1 = 𝜮 2 = 𝜮 3 = −8.27 −8.27 . that the mixture of the CWM model comes from a general family with
15.89
Table 10
Synthetic Dataset 5: model selection and regression coefficients.
Method Cluster Regression coefficients Method Cluster Regression coefficients
𝛽0 𝛽1 𝛽2 𝛽0 𝛽1 𝛽2
CWM 1 1.0007 1.0080 −0.9918 WCLRepl 1 −4.5814 0.09227 −1.4708
Xnorm = VEV 2 0.9931 −0.9988 0.9919 𝛼 = 1e+02 2 0.4835 0.8114 1.4273
3 −1.0070 0.9974 1.0018 3 2.6511 0.7543 1.5275
CLR 1 −4.5539 0.09724 −1.4686 WCLRqpg 1 −4.5539 0.09724 −1.4686
2 0.4828 0.8085 1.4310 𝛼 = 1e+05 2 0.4828 0.8085 1.4310
3 2.6511 0.7543 1.5275 3 2.6511 0.7543 1.5275
KMEANS+LM 1 −17.5720 −2.0734 −2.3059 WCLRqpl 1 1.0007 1.0080 −0.9918
2 0.9188 1.0185 −0.1579 𝛼 = 1.0 2 0.9931 −0.9988 0.9919
3 −5.0351 1.3290 0.3040 3 −1.0070 0.9974 1.0018
KPLANE 1 183.2879 30.9761 14.6114 WCLResg 1 −16.3115 −1.8634 −2.1863
𝛾 = 1.0 2 1.0052 −74.3537 −72.3941 𝛼 = 1e−05 2 0.9670 1.0263 −0.1733
3 −7.3485 1.5384 −0.04432 𝜃 = 2.5 3 −3.5994 1.2060 0.5402
WCLRepg 1 −4.5814 0.09227 −1.4708 WCLResl 1 −4.5539 0.09724 −1.4686
𝛼 = 1e+02 2 0.4835 0.8114 1.4273 𝛼 = 1e+05 2 0.4828 0.8085 1.4310
3 2.6511 0.7543 1.5275 𝜃= 3 3 2.6511 0.7543 1.5275
13
R.A.M. da Silva and F.A.T. de Carvalho Expert Systems With Applications 185 (2021) 115609
Table 12
Iris Data: model selection and regression coefficients.
Method Cluster Regression coefficients Method Clusters Regression coefficients
𝛽0 𝛽1 𝛽2 𝛽3 𝛽0 𝛽1 𝛽2 𝛽3
CWM 1 0.2981 0.3092 0.9970 −0.09034 WCLRepl 1 1.8787 0.4788 0.8145 −0.5893
Xnorm = VEE 2 1.6258 0.2902 1.0336 −0.6752 𝛼 = 1e−05 2 2.3519 0.6548 0.2376 0.2521
3 2.2632 0.7632 0.4737 −0.4474
4 2.1182 0.7133 0.2729 0.1500
5 0.8690 0.3544 0.9224 −0.2259
CLR 1 1.9723 0.5586 0.6744 −0.5328 WCLRqpg 1 1.8787 0.4788 0.8145 −0.5893
2 2.2988 0.5683 0.7204 −0.5469 𝛼 = 1e−05 2 2.3519 0.6548 0.2376 0.2521
KMEANS+LM 1 1.8346 0.4805 0.8256 −0.5998 WCLRqpl 1 1.8787 0.4788 0.8145 −0.5893
2 2.3324 0.6456 0.2659 0.2946 𝛼 = 1e−05 2 2.3519 0.6548 0.2376 0.2521
KPLANE 1 1.8346 0.4805 0.8256 −0.5998 WCLResg 1 1.8787 0.4788 0.8145 −0.5893
𝛾 = 1.0 2 2.3324 0.6456 0.2659 0.2946 𝛼 = 1e−05 2 2.3519 0.6548 0.2376 0.2521
𝜃 = 2.0
WCLRepg 1 1.8787 0.4788 0.8145 −0.5893 WCLResl 1 1.8787 0.4788 0.8145 −0.5893
𝛼 = 1e−05 2 2.3519 0.6548 0.2376 0.2521 𝛼 = 1e−05 2 2.3519 0.6548 0.2376 0.2521
𝜃 = 2.0
Table 13 CWM method and WCLRepl and WCLResl and WCLRqpl variants, in
Student Data: summary of experimental assessment.
comparison with the others methods, when the explanatory variables
Method Explanation Prediction
are not correlated and each a priori class has its own set of relevant (low
𝛷(𝐾 ) (𝑎𝑣𝑔 ± 𝑠𝑑) RMSE (𝑎𝑣𝑔 ± 𝑠𝑑) RMSE (𝑎𝑣𝑔 ± 𝑠𝑑) variance) and irrelevant (high variance) variables. Because WCLRepl
CWM 0.6163 ± 0.06905 5.4720 ± 0.4865 6.2457 ± 0.7481 and WCLResl variants have less parameters, they are the first choice
CLR 0.8357 ± 0.009596 3.5961 ± 0.1057 7.5929 ± 0.7815 among these variants for this dataset configuration. In synthetic dataset
KMEANS+LM 0.5514 ± 0.01947 5.9428 ± 0.1394 6.0203 ± 0.6015
KPLANE 0.5514 ± 0.01947 5.9428 ± 0.1394 6.0203 ± 0.6015
4, where the a priori classes have similar matrices of covariances of
WCLRqpg 0.5534 ± 0.01991 5.9294 ± 0.1422 6.0211 ± 0.5852 the explanatory variables, CWM, WCLRqpg and WCLRqpl were the
WCLRqpl 0.5539 ± 0.01959 5.9260 ± 0.1386 6.0015 ± 0.5915 best in terms of partition quality, recovering of the true classes and
WCLRepg 0.5523 ± 0.01954 5.9369 ± 0.1408 6.0398 ± 0.5970 RMSE in the test sets. Because WCLRqpg has less parameters it is the
WCLRepl 0.5522 ± 0.01943 5.9375 ± 0.1407 6.0412 ± 0.5950
WCLResg 0.5582 ± 0.01795 5.8980 ± 0.1575 6.0259 ± 0.6484
first choice for this dataset configuration. Synthetic dataset 5 showed
WCLResl 0.5573 ± 0.01802 5.9039 ± 0.1536 6.0368 ± 0.6478 the good performance of CWM and WCLRqpl in comparison with the
others methods, when each a priori class has an specifc matrix of
covariances of the explanatory variables. Finally, overall, the CLR and
KMEANS+LM methods perform the worst compared to the CWM and
Variable volume, Equal shape and Variable orientation (Mazza et al., all WCLR variants.
2018). Moreover, CWM and WCLRqpl were the best in recovering of
the true models.
4.5. Benchmark data
In summary, this section considered different configurations of syn-
thetic datasets aiming to provide insights about the usage of a particular
variant of the WCLR method. Synthetic dataset 1 highlighted the good This section provides a study with some benchmark real datasets
performance of the CWM, KMEANS+LM, KPLANE and WCLR variants often used in previous work using a mixture of regression models with
in comparison with the CLR method when the explanatory variables are random independent variables (Dang et al., 2017; Ingrassia et al., 2014,
not correlated and have almost the same variance in all a priori classes. 2012; Mari et al., 2017; Mazza & Punzo, 2020).
In synthetic dataset 2, where the variables were not correlated and have These experiments were performed using the same experimental
the same set of relevant (low variance) and irrelevant (high variance) setup as the evaluation of the synthetic dataset. For CLR, KMEANS+LM,
variables in all classes a priori, the CWM and the WCLR variants were KPLANE, and all WCLR variants, we used the second difference as an
the best. Because WCLRepg and WCLResg variants have less param- index to select the number of clusters in the model. For all benchmark
eters, they are the first choice among WCLR variants for this dataset datasets considered here, the number of clusters is selected between 1
configuration. Synthetic dataset 3 showed the good performance of the and 𝐾𝑀𝐴𝑋 = 5.
Table 14
Student Data: model selection and regression coefficients.
Method Cluster Regression coefficients Method Cluster Regression coefficients
𝛽0 𝛽1 𝛽2 𝛽0 𝛽1 𝛽2
CWM 1 −54.9582 0.8991 −0.1400 WCLRepl 1 −53.6406 0.6681 0.06619
Xnorm = EVE 2 −58.4102 0.7972 −0.03935 𝛼 = 1e−05 2 −44.4621 0.7698 −0.0820
CLR 1 −82.0544 0.7467 0.1761 WCLRqpg 1 −54.1299 0.6526 0.08472
2 −85.0396 0.7409 0.1426 𝛼 = 1e−05 2 −48.5961 0.8459 −0.1277
KMEANS+LM 1 −42.8878 0.7013 −0.02839 WCLRqpl 1 −47.7794 0.8233 −0.1118
2 −50.4509 0.6570 0.05951 𝛼 = 1e−05 2 −54.4070 0.6457 0.09338
KPLANE 1 −42.8878 0.7013 −0.02839 WCLResg 1 −54.3189 0.6976 0.03724
𝛾 = 1e+02 2 −50.4509 0.6570 0.05951 𝛼 = 1e−05 2 −94.5661 0.7101 0.2682
𝜃 = 3.0
WCLRepg 1 −56.1754 0.6851 0.06318 WCLResl 1 −65.6944 0.7122 0.09632
𝛼 = 1e−05 2 −46.8040 0.7986 −0.09486 𝛼 = 1e−05 2 −45.5034 0.6533 0.02937
𝜃 = 2.5 3 −63.3205 0.6877 0.09541
14
R.A.M. da Silva and F.A.T. de Carvalho Expert Systems With Applications 185 (2021) 115609
Table 15 Table 17
Crab Data 1: summary of experimental assessment. Crab Data 2: summary of experimental assessment.
Method Explanation Prediction Method Explanation Prediction
𝛷(𝐾 ) (𝑎𝑣𝑔 ± 𝑠𝑑) RMSE (𝑎𝑣𝑔 ± 𝑠𝑑) RMSE (𝑎𝑣𝑔 ± 𝑠𝑑) 𝛷(𝐾 ) (𝑎𝑣𝑔 ± 𝑠𝑑) RMSE (𝑎𝑣𝑔 ± 𝑠𝑑) RMSE (𝑎𝑣𝑔 ± 𝑠𝑑)
CWM 0.9914 ± 0.002904 0.3184 ± 0.05186 0.6277 ± 0.1572 CWM 0.9813 ± 0.005133 0.3463 ± 0.04429 1.3736 ± 0.1868
CLR 0.9921 ± 0.00167 0.3065 ± 0.03481 0.5983 ± 0.09603 CLR 0.9745 ± 0.002127 0.4081 ± 0.01231 1.4520 ± 0.1858
KMEANS+LM 0.9784 ± 0.002485 0.5087 ± 0.0252 0.5248 ± 0.08267 KMEANS+LM 0.8166 ± 0.01732 1.0929 ± 0.03211 1.1105 ± 0.1186
KPLANE 0.9784 ± 0.002485 0.5087 ± 0.0252 0.5248 ± 0.08267 KPLANE 0.8167 ± 0.01734 1.0925 ± 0.03247 1.1112 ± 0.1193
WCLRqpg 0.9787 ± 0.002547 0.5050 ± 0.02488 0.5181 ± 0.09729 WCLRqpg 0.8113 ± 0.02077 1.1080 ± 0.04194 1.1441 ± 0.1643
WCLRqpl 0.9788 ± 0.002645 0.5044 ± 0.02673 0.5151 ± 0.09912 WCLRqpl 0.8118 ± 0.02038 1.1065 ± 0.04096 1.1432 ± 0.1660
WCLRepg 0.9784 ± 0.002433 0.5087 ± 0.02431 0.5207 ± 0.08303 WCLRepg 0.8160 ± 0.01718 1.0947 ± 0.03217 1.1093 ± 0.1255
WCLRepl 0.9785 ± 0.002463 0.5085 ± 0.02455 0.5213 ± 0.08272 WCLRepl 0.8160 ± 0.01728 1.0947 ± 0.03228 1.1134 ± 0.1311
WCLResg 0.9783 ± 0.0026 0.5098 ± 0.02282 0.5308 ± 0.07949 WCLResg 0.8199 ± 0.01515 1.0829 ± 0.02939 1.1372 ± 0.1238
WCLResl 0.9785 ± 0.002345 0.5077 ± 0.02074 0.5361 ± 0.07602 WCLResl 0.8181 ± 0.01857 1.0877 ± 0.03644 1.1414 ± 0.1181
4.5.1. Iris data of CWM are fitted on the entire dataset, the best combination among
We use the well-known Iris dataset (https://archive.ics.uci.edu/ml/ the number of clusters and 14 variants is selected according to the
index.html) as an illustrative case for a real dataset. This dataset has BIC criterion. The hyperparameters of the CLR, KMEANS+LM, KPLANE,
three classes with 50 observations per class. We select ‘‘Sepal Length’’ and the WCLR variants are selected as follows. The entire dataset is
as the response variable, and ‘‘Sepal Width’’, ‘‘Petal Length’’ and ‘‘Petal split into the learning set (80% of the examples) and the validation set
Width’’ as predictor variables. (20% of the examples) using stratified sampling. CLR, KMEANS+LM,
Table 11 summarizes the results for the Iris dataset. In this Table, KPLANE, and the WCLR variants are fitted on the learning set according
for a fixed number of clusters, all the variants of CWM are fitted on the to a fixed value of the hyperparameters. The best hyperparameters
four learning folds, the best combination (among the 14 variants and combination is selected (number of clusters between 1 and 5, and from
number of clusters) is selected according to the BIC criterion and used a specific grid for 𝛾, 𝛼 and 𝜃 see Eqs. (52) and (53)) according to the
to compute the indexes 𝛷(𝐾 ) and fitted RMSE (both on the learning minimum RMSE computed on the validation set. Once selected the
folds) and predicted RMSE (on the test fold). The hyperparameters of hyperparameter values, CLR, KMEANS+LM, KPLANE, and the WCLR
the CLR, KMEANS+LM, KPLANE, and the WCLR variants are selected variants are fitted again on the entire dataset.
as follows. The five folds are split between the learning set (3 folds), The CWM method selected 5 clusters and VEE variant, where VEE
the validation set (1 fold), and the test set (1 fold). CLR, KMEANS+LM, means that the mixture of the CWM model comes from a general family
KPLANE, and the WCLR variants are fitted on the 3 folds learning with Variable volume, Equal shape, and Equal orientation (Mazza et al.,
set according to a fixed value of the hyperparameters. The best hy- 2018). The other methods selected only 2 clusters. One can observe that
perparameters combination is selected (number of clusters between 1 the models fitted with KMEANS+LM, KPLANE, and the WCLR variants
and 5, and from a specific grid for 𝛾, 𝛼 and 𝜃 see Eqs. (52) and (53)) have similar regression coefficients.
according to the minimum RMSE computed on the 1 fold validation
set. Once selected the hyperparameter values, the 3 folds learning set 4.5.2. Student data
and the 1 fold validation are merged on a new 4 folds learning set. The Student dataset (http://docenti.unict.it/punzo) is a two class
CLR, KMEANS+LM, KPLANE, and the WCLR variants are then fitted on problem with 99 observations in class ‘‘M’’ and 171 observations in
the new 4 folds learning set and the fitted models are used to compute class ‘‘F’’. We select ‘‘WEIGHT’’ as the response variable, and ‘‘HEIGHT’’
the indexes 𝛷(𝐾 ), fitted RMSE, and predicted RMSE. This procedure and ‘‘HEIGHT.F’’ as predictors.
is repeated 15 times and the average and standard deviation of the Table 13 was obtained in the same way as Table 11. As expected,
indexes are computed. the CLR method achieved the best performance in terms of quality
The CWM method obtained the second best fitting performance, just of fit of the regression models (minimum RMSE fitting), with a high
behind the CLR method and also the second-worst performance in terms value of 𝛷(𝐾 ). However, as the CLR method is not able to cluster
of forecasting quality, ahead of the CLR method. All other methods observations into homogeneous groups, the assignment technique by
obtained similar results, being the best in the task of data prediction the nearest cluster center for an unknown observation (prediction)
(see Table 11). obtained the worst result among all the tested methods. CWM was the
Table 12 provides the regression coefficients of the fitted method on second-best in terms of explanation but the second-worst in terms of
each clusters using all the data available. In this Table, all the variants prediction. KMEANS+LM, KPLANE, and all WCLR variants achieved
Table 16
Crab Data 1: model selection and regression coefficients.
Method Cluster Regression coefficients Method Cluster Regression coefficients
𝛽0 𝛽1 𝛽2 𝛽0 𝛽1 𝛽2
CWM 1 1.0805 0.3837 0.1149 WCLRepl 1 2.1329 0.01786 0.9293
Xnorm = EEV 2 0.7022 0.4512 0.06253 𝛼 = 1e−05 2 0.8454 0.2004 0.5898
CLR 1 0.7020 0.1117 0.7794 WCLRqpg 1 1.0465 0.3061 0.3095
2 0.2779 0.1899 0.6911 𝛼 = 1.0 2 0.8287 0.2300 0.5365
KMEANS+LM 1 2.0484 0.02125 0.9264 WCLRqpl 1 1.0465 0.3061 0.3095
2 0.7898 0.2038 0.5873 𝛼 = 1.0 2 0.8287 0.2300 0.5365
KPLANE 1 2.0484 0.02125 0.9264 WCLResg 1 2.1336 0.01025 0.9462
𝛾 = 1e+02 2 0.7898 0.2038 0.5873 𝛼 = 1.0 2 0.8592 0.2080 0.5705
𝜃 = 2.0
WCLRepg 1 2.1329 0.01786 0.9293 WCLResl 1 2.1336 0.01025 0.9462
𝛼 = 1e−05 2 0.8454 0.2004 0.5898 𝛼 = 1.0 2 0.8592 0.2080 0.5705
𝜃 = 2.0
15
R.A.M. da Silva and F.A.T. de Carvalho Expert Systems With Applications 185 (2021) 115609
Table 18
Crab Data 2: model selection and regression coefficients.
Method Cluster Regression coefficients Method Cluster Regression coefficients
𝛽0 𝛽1 𝛽2 𝛽0 𝛽1 𝛽2
CWM 1 2.8862 0.1637 0.2508 WCLRepl 1 6.2738 −0.03783 0.5847
Xnorm = VEV 2 −0.4891 0.4530 −0.01442 𝛼 = 1e−05 2 1.2132 0.3869 −0.04914
3 1.0394 0.4215 −0.05511
4 3.5174 0.3465 −0.2163
5 −32.9230 2.5000 −3.1757
CLR 1 2.6414 0.2569 0.06259 WCLRqpg 1 1.6844 0.7183 −0.8167
2 0.9972 0.3976 0.002037 𝛼 = 1.0 2 3.5927 0.1148 0.3851
KMEANS+LM 1 6.5544 −0.03851 0.5709 WCLRqpl 1 1.7346 0.6723 −0.7194
2 1.2995 0.3821 −0.04662 𝛼 = 1e−05 2 3.5883 0.05646 0.5278
KPLANE 1 6.5544 −0.03851 0.5709 WCLResg 1 5.5552 −0.0411 0.6312
𝛾 = 1e+02 2 1.2995 0.3821 −0.04662 𝛼 = 0.01 2 1.0071 0.4014 −0.06275
𝜃 = 2.0
WCLRepg 1 6.2738 −0.03783 0.5847 WCLResl 1 5.5552 −0.0411 0.6312
𝛼 = 1e−05 2 1.2132 0.3869 −0.04914 𝛼 = 0.01 2 1.0071 0.4014 −0.06275
𝜃 = 2.0
Table 20
Crab Data 3: model selection and regression coefficients.
Method Cluster Regression coefficients Method Cluster Regression coefficients
𝛽0 𝛽1 𝛽2 𝛽0 𝛽1 𝛽2
CWM 1 1.4641 1.6692 −1.2561 WCLRepl 1 1.5911 1.3773 −0.6523
Xnorm = EVV 2 0.9151 1.4454 −0.7952 𝛼 = 1.0 2 0.3145 1.4773 −0.8135
3 0.2653 0.8876 0.4568
4 0.6236 0.9909 0.3415
5 0.1139 1.3960 −0.5446
CLR 1 0.3792 1.3783 −0.6175 WCLRqpg 1 0.6743 1.0756 0.131
2 0.7128 1.2995 −0.3853 𝛼 = 1e−05 2 0.1796 1.3414 −0.5015
KMEANS+LM 1 1.5817 1.3736 −0.6436 WCLRqpl 1 0.2675 1.3568 −0.5404
2 0.3581 1.4748 −0.8120 𝛼 = 1e−05 2 0.6802 1.0725 0.1389
KPLANE 1 1.5817 1.3736 −0.6436 WCLResg 1 1.2626 1.3965 −0.6771
𝛾 = 1e+02 2 0.3581 1.4748 −0.8120 𝛼 = 1e−05 2 0.2740 1.4682 −0.7881
𝜃 = 2.5
WCLRepg 1 1.5911 1.3773 −0.6523 WCLResl 1 1.2626 1.3965 −0.6771
𝛼 = 1.0 2 0.3145 1.4773 −0.8135 𝛼 = 1e−05 2 0.2740 1.4682 −0.7881
𝜃 = 2.5
16
R.A.M. da Silva and F.A.T. de Carvalho Expert Systems With Applications 185 (2021) 115609
Table 21 Table 23
AIS Data 1: summary of experimental assessment. AIS Data 2: summary of experimental assessment.
Method Explanation Prediction Method Explanation Prediction
𝛷(𝐾 ) (𝑎𝑣𝑔 ± 𝑠𝑑) RMSE (𝑎𝑣𝑔 ± 𝑠𝑑) RMSE (𝑎𝑣𝑔 ± 𝑠𝑑) 𝛷(𝐾 ) (𝑎𝑣𝑔 ± 𝑠𝑑) RMSE (𝑎𝑣𝑔 ± 𝑠𝑑) RMSE (𝑎𝑣𝑔 ± 𝑠𝑑)
CWM 0.9212 ± 0.0197 0.1272 ± 0.01517 0.2147 ± 0.0305 CWM 0.4910 ± 0.1605 1.2642 ± 0.2218 2.6636 ± 0.7901
CLR 0.9521 ± 0.003798 0.09978 ± 0.003422 0.2217 ± 0.02381 CLR 0.6906 ± 0.01887 0.9977 ± 0.03655 2.1138 ± 0.3122
KMEANS+LM 0.8671 ± 0.01219 0.1661 ± 0.004527 0.1860 ± 0.01751 KMEANS+LM 0.1416 ± 0.02406 1.6627 ± 0.04565 1.7994 ± 0.2030
KPLANE 0.8671 ± 0.01219 0.1661 ± 0.004527 0.1860 ± 0.01751 KPLANE 0.1416 ± 0.02406 1.6627 ± 0.04565 1.7994 ± 0.2030
WCLRqpg 0.8686 ± 0.0121 0.1651 ± 0.004062 0.1834 ± 0.01699 WCLRqpg 0.1404 ± 0.02596 1.6639 ± 0.04929 1.8164 ± 0.2150
WCLRqpl 0.8681 ± 0.01379 0.1653 ± 0.005076 0.1874 ± 0.01992 WCLRqpl 0.1700 ± 0.03292 1.6347 ± 0.04789 1.7756 ± 0.1791
WCLRepg 0.8689 ± 0.01212 0.1649 ± 0.0041 0.1823 ± 0.01652 WCLRepg 0.1512 ± 0.02915 1.6533 ± 0.05194 1.8066 ± 0.2125
WCLRepl 0.8692 ± 0.01208 0.1647 ± 0.004027 0.1821 ± 0.01657 WCLRepl 0.1527 ± 0.02867 1.6518 ± 0.05038 1.7953 ± 0.2139
WCLResg 0.8705 ± 0.01493 0.1637 ± 0.005827 0.1888 ± 0.02855 WCLResg 0.1922 ± 0.09865 1.6112 ± 0.1192 1.9057 ± 0.2579
WCLResl 0.8735 ± 0.01457 0.1619 ± 0.006577 0.1906 ± 0.03049 WCLResl 0.1860 ± 0.09043 1.6179 ± 0.1125 1.8948 ± 0.2595
Table 22
AIS Data 1: model selection and regression coefficients.
Method Cluster Regression coefficients
𝛽0 𝛽1 𝛽2 𝛽3 𝛽4 𝛽5 𝛽6 𝛽7
CWM 1 −1.8726 0.08597 0.08929 −0.006909 0.1217 0.08503 0.007047 −0.08299
Xnorm = EEE 2 −0.3124 0.1696 −0.0272 −0.002989 −0.03182 −0.09632 −0.003581 0.07232
3 −1.8886 −0.01817 0.2814 0.01853 0.0006676 0.1179 0.01729 −0.1164
4 −2.3053 0.1561 −0.06705 8.773e−05 0.05366 0.05058 −0.001377 −0.03653
5 2.4496 0.2114 −0.3246 0.0008711 −0.08269 −0.1160 0.0009783 0.08062
CLR 1 −1.4790 0.1239 0.003101 0.004875 −0.01317 0.01121 0.005014 −0.01495
2 −0.8254 0.1070 0.02513 0.007988 −0.03089 0.01544 0.005976 −0.02012
KMEANS+LM 1 −0.8096 0.1062 0.03827 0.002648 −0.00229 0.01816 0.002369 −0.01844
2 1.6354 0.1053 0.004454 0.001653 −0.07691 −0.08076 −0.0008948 0.06475
KPLANE 1 −0.8096 0.1062 0.03827 0.002648 −0.00229 0.01816 0.002369 −0.01844
𝛾 = 0.01 2 1.6354 0.1053 0.004454 0.001653 −0.07691 −0.08076 −0.0008948 0.06475
WCLRqpg 1 −1.2760 0.09899 0.05919 −0.004881 0.07353 0.04488 0.003158 −0.04245
𝛼 = 1e−05 2 0.1069 0.1044 0.01188 0.004344 −0.03276 −0.01619 0.001609 0.01076
WCLRqpl 1 0.3575 0.09882 0.007946 0.003454 −0.01923 0.003649 −0.0004965 −0.001892
𝛼 = 1e+02 2 −1.8409 0.1018 0.0686 0.00155 0.04012 0.05927 0.006048 −0.05897
WCLRepg 1 −0.8655 0.0956 0.06036 −0.005106 0.06633 0.03647 0.002261 −0.03513
𝛼 = 1e+02 2 0.2272 0.1100 −0.006818 0.003389 −0.02491 −0.01721 0.0008822 0.01121
WCLRepl 1 −0.8655 0.0956 0.06036 −0.005106 0.06633 0.03647 0.002261 −0.03513
𝛼 = 1e+02 2 0.2272 0.1100 −0.006818 0.003389 −0.02491 −0.01721 0.0008822 0.01121
WCLResg 1 −0.6516 0.1081 0.02484 0.002332 0.006826 0.02275 0.001464 −0.02172
𝛼 = 1e−05 2 0.7554 0.09279 0.00561 0.002064 −0.01955 −0.01476 9.672e−06 0.01154
𝜃 = 3.0
WCLResl 1 0.5981 0.1005 −0.03753 −0.0004531 0.02417 0.01691 0.001243 −0.01978
𝛼 = 1e−05 2 −1.3812 0.1162 0.05077 −0.006741 −0.00223 −0.04336 0.004048 0.0380
𝜃 = 3.0 3 −1.7343 0.1048 0.1101 0.0110 −0.04279 0.01519 0.002644 −0.01721
17
R.A.M. da Silva and F.A.T. de Carvalho Expert Systems With Applications 185 (2021) 115609
Table 24
AIS Data 2: model selection and regression coefficients.
Method Cluster Regression coefficients
𝛽0 𝛽1 𝛽2 𝛽3 𝛽4 𝛽5 𝛽6 𝛽7
CWM 1 22.4021 −0.04721 0.5821 −0.01756 −0.9571 −1.4842 −0.01283 1.2103
Xnorm = VEE 2 2.7545 0.2817 −0.7167 −0.003953 0.3019 −0.2194 0.003012 0.1983
3 −4.0303 −0.2086 1.0413 0.006569 1.0368 1.3213 −0.04334 −1.1392
4 −1.7865 0.4945 −1.2349 −0.04589 0.8064 1.4413 −0.03246 −1.2137
5 −3.1901 0.2909 −0.5221 0.04226 −0.3404 −0.2886 0.0381 0.2313
CLR 1 −3.0512 0.3463 −0.6446 −0.003909 −0.1011 −0.2034 0.04151 0.1457
2 0.6939 0.4744 −1.1211 0.02823 −0.4178 −0.3618 0.06927 0.2568
KMEANS+LM 1 −0.6753 0.1801 −0.1992 0.02401 −0.2228 −0.3391 0.02939 0.2845
2 −7.5449 0.06663 0.2845 0.01711 0.6525 0.8240 −0.04525 −0.6419
KPLANE 1 −0.6753 0.1801 −0.1992 0.02401 −0.2228 −0.3391 0.02939 0.2845
𝛾 = 1e+02 2 −7.5449 0.06663 0.2845 0.01711 0.6525 0.8240 −0.04525 −0.6419
WCLRqpg 1 3.5434 0.1118 0.06781 0.003758 0.02777 −0.008548 −0.01966 0.01596
𝛼 = 1.0 2 −1.4491 0.1692 −0.3116 0.01894 0.3967 0.2403 0.02028 −0.2486
WCLRqpl 1 7.6301 0.05851 0.006868 −0.02161 −0.8241 −1.6768 0.02767 1.5299
𝛼 = 1.0 2 −1.7292 0.0141 0.4310 0.00203 0.3278 0.2846 −0.02699 −0.2225
WCLRepg 1 1.3640 0.1508 −0.3201 0.01761 0.3566 0.1917 0.01364 −0.2063
𝛼 = 1e−05 2 4.8616 0.1481 −0.05725 −0.003174 0.07681 −0.03557 −0.02452 0.02967
WCLRepl 1 1.3640 0.1508 −0.3201 0.01761 0.3566 0.1917 0.01364 −0.2063
𝛼 = 1e−05 2 4.8616 0.1481 −0.05725 −0.003174 0.07681 −0.03557 −0.02452 0.02967
WCLResg 1 2.0163 0.2600 −0.4227 0.0204 −0.2107 −0.2486 0.00758 0.2170
𝛼 = 1e−05 2 −1.0726 0.08747 0.2832 0.01144 0.03791 0.05069 −0.007165 −0.0363
𝜃 = 2.5
WCLResl 1 −1.8261 0.1119 0.2681 0.01973 0.02282 0.08536 −0.00715 −0.07153
𝛼 = 1e−05 2 2.3358 0.2322 −0.3617 0.008123 −0.1516 −0.2526 0.006286 0.2246
𝜃 = 1.5
Table 25 Table 23 was obtained in the same way as Table 11. CLR and
AIS Data 3: summary of experimental assessment.
CWM were the best and second-best in terms of fitting. CWM and CLR
Method Explanation Prediction
were the worst and second-worst in terms of prediction. KMEANS+LM,
𝛷(𝐾 ) (𝑎𝑣𝑔 ± 𝑠𝑑) RMSE (𝑎𝑣𝑔 ± 𝑠𝑑) RMSE (𝑎𝑣𝑔 ± 𝑠𝑑)
KPLANE, and WCLR variants obtained close results, with a better fore-
CWM 0.5455 ± 0.1573 31.3970 ± 6.3623 66.9237 ± 22.1533 cast for the WCLRqpl method. The KMEANS+LM and KPLANE methods
CLR 0.7618 ± 0.01578 23.0617 ± 1.0631 54.0273 ± 10.0379
KMEANS+LM 0.2063 ± 0.01906 42.1410 ± 1.9972 46.7919 ± 7.5112
obtained the same average performance.
KPLANE 0.2006 ± 0.01079 42.2923 ± 1.9426 46.9290 ± 7.7526 Table 24 was obtained in the same way as Table 12. All the
WCLRqpg 0.2794 ± 0.0374 40.1356 ± 2.0030 46.7654 ± 7.9094 methods but CWM were fitted with 2 clusters. CWM method was fitted
WCLRqpl 0.2617 ± 0.05048 40.6265 ± 2.4143 47.9950 ± 8.6583
WCLRepg 0.2677 ± 0.02999 40.4669 ± 1.9313 45.4533 ± 7.7454 with 5 clusters and selected the VEE variant, where VEE means that
WCLRepl 0.2775 ± 0.02724 40.1899 ± 1.7687 45.4962 ± 7.4284 the mixture of the CWM model comes from a general family with
WCLResg 0.3780 ± 0.2021 36.2885 ± 8.9642 58.3707 ± 20.3415 Variable volume, Equal shape, and Equal orientation (Mazza et al.,
WCLResl 0.3663 ± 0.2063 36.6122 ± 9.0213 65.5152 ± 35.3903
2018). One can observe that the models fitted with, respectively,
KMEANS+LM and KPLANE, and WCLRepg and WCLRepl have equal
regression coefficients.
deciliter; and ‘‘Ferr’’, plasma ferritins in ng per deciliter. This dataset AIS Data 3
was splitted into three regression problems. For AIS Data 3, we select ‘‘Fe’’ as response variable and ‘‘Hc’’, ‘‘Hg’’,
AIS Data 1 ‘‘SSF’’, ‘‘Bfat’’, ‘‘LBM’’, ‘‘Ht’’ and ‘‘Wt’’ as predictors.
We select ‘‘RCC’’ as response variable and ‘‘Hc’’, ‘‘Hg’’, ‘‘SSF’’,
Table 25 was obtained in the same way as Table 11. CWM and
‘‘Bfat’’, ‘‘LBM’’, ‘‘Ht’’ and ‘‘Wt’’ as predictors.
CLR were the best in terms of 𝛷(𝐾 ) and RMSE of adjustment but
Table 21 was obtained in the same way as Table 11. CLR and CWM
they were the worst in terms of prediction together with WCLResg and
were the best and second-best in terms of fitting but they were the
WCLResl. The KMEANS+LM and KPLANE methods obtained similar
worst and second-worst in terms of prediction. KMEANS+LM, KPLANE,
performances. WCLRepg and WCLRepl were the best on prediction.
and WCLR variants obtained close results, with a better forecast for
the WCLRqpl and WCLRqpl methods. The KMEANS+LM and KPLANE Table 26 was obtained in the same way as Table 12. CWM,
methods obtained the same average performance. WCLResg, WCLRess were fitted with, respectively, 5, 4, and 3 clusters.
Table 22 was obtained in the same way as Table 12. Except for the All other methods were fitted with 2 clusters. CWM method selected
CWM and WCLResl, all other methods were fitted with 2 clusters. CWM the VEE variant (Mazza et al., 2018). One can observe that the models
method was fitted with 5 clusters and selected the EEE variant, where fitted with, respectively, KMEANS+LM and KPLANE, and WCLRepg and
EEE means that the mixture of the CWM model comes from a general WCLRepl have equal regression coefficients.
family with Equal volume, Equal shape, and Equal orientation (Mazza In summary, although we did not observe a clear difference in the
et al., 2018). One can observe that the models fitted with WCLRepg performance of the WCLR variants concerning the other methods in the
and WCLRepl variants have equal regression coefficients. sets of benchmark data, the variants of WCLR methods proved to be
AIS Data 2 competitive, not having a performance, in terms of both forecasting and
For AIS Data 2, we select ‘‘WCC’’ as response variable and ‘‘Hc’, adjustment, inferior to those competing methods, and getting better
‘‘Hg’’, ‘‘SSF’’, ‘‘Bfat’’, ‘‘LBM’’, ‘‘Ht’’ and ‘‘Wt’’ as predictors. results in most test sets.
18
R.A.M. da Silva and F.A.T. de Carvalho Expert Systems With Applications 185 (2021) 115609
Table 26
AIS Data 3: model selection and regression coefficients.
Method Cluster Regression coefficients
𝛽0 𝛽1 𝛽2 𝛽3 𝛽4 𝛽5 𝛽6 𝛽7
CWM 1 95.3861 −5.9083 19.1190 0.4258 −1.8817 −0.6512 −0.2146 0.1437
Xnorm = VEE 2 362.2367 −2.7865 31.5458 −2.1829 42.6451 29.9094 −5.8134 −24.4450
3 350.4530 −83.5478 223.8450 −0.5946 −22.6161 −25.6711 −1.9691 30.8524
4 795.4112 −10.8173 −6.6222 −0.5483 −8.8517 −17.0596 −1.0194 17.6964
5 318.4206 6.7171 −18.9990 3.3953 −8.9709 13.3782 −2.8236 −10.2648
CLR 1 286.8765 1.2208 −2.3358 1.0352 −2.9650 5.0949 −2.1293 −3.0856
2 709.6924 0.1574 4.1381 1.4209 −11.6871 −0.7588 −4.3744 3.3547
KMEANS+LM 1 189.0091 −2.6484 8.9885 1.1910 −4.8741 2.3278 −1.2100 −0.9894
2 164.5936 −4.9930 18.7558 0.3524 0.3932 3.3630 −1.7094 −1.2604
KPLANE 1 189.0091 −2.6484 8.9885 1.1910 −4.8741 2.3278 −1.2100 −0.9894
𝛾 = 1e+05 2 164.5936 −4.9930 18.7558 0.3524 0.3932 3.3630 −1.7094 −1.2604
WCLRqpg 1 472.1501 −4.2624 5.1860 −0.8107 15.3003 8.7647 −2.4601 −6.7866
𝛼 = 1e−05 2 130.3879 −4.1024 12.3689 0.5679 −0.5553 3.1081 −0.6969 −2.4077
WCLRqpl 1 130.7031 −8.7351 28.8769 0.5532 −2.5788 1.2476 −1.0157 −0.05699
𝛼 = 1e−05 2 331.5308 −2.6581 7.4627 0.6263 −9.6701 −8.6197 −1.7272 9.5890
WCLRepg 1 470.3235 −4.3123 5.1932 −0.8755 16.2272 9.5283 −2.4814 −7.4455
𝛼 = 1e−05 2 144.7105 −4.6483 14.6424 0.7572 −2.5410 1.9994 −0.6315 −1.7303
WCLRepl 1 470.3235 −4.3123 5.1932 −0.8755 16.2272 9.5283 −2.4814 −7.4455
𝛼 = 1e−05 2 144.7105 −4.6483 14.6424 0.7572 −2.5410 1.9994 −0.6315 −1.7303
WCLResg 1 −2.1024 −2.5334 9.8099 0.4940 3.2146 7.0427 −0.2325 −6.1638
𝛼 = 1e−05 2 389.5083 −3.8116 −2.6780 −1.7496 30.8219 18.0838 −2.0804 −14.9244
𝜃 = 1.5 3 317.4094 −3.6790 16.8683 1.0139 −16.2887 −10.9006 −1.4219 10.4371
4 −132.2264 −5.1323 34.8764 0.8758 1.2749 9.6835 −1.2212 −7.4334
WCLResl 1 74.5388 −3.7464 10.7340 0.4363 4.7831 8.3819 −0.6801 −7.0670
𝛼 = 1e−05 2 233.6988 −6.7972 10.1108 −0.9406 18.6829 8.5457 −0.6867 −7.4701
𝜃 = 3.0 3 212.4582 −3.6722 18.4050 0.6021 −3.7430 2.7774 −1.9463 −0.8880
This paper presented the Weighted Clusterwise Linear Regression Ricardo A.M. da Silva: Conceptualization, Methodology, Software,
method. It considered differences in relevance among the explana- Data curation, Writing – review & editing. Francisco de A.T. de Car-
tory variables aiming to obtain meaningful homogeneous clusters for valho: Supervision, Conceptualization, Methodology, Writing – review
the explanatory variables and improved regression models. In order & editing.
to achieve this aim, the WCLR method combines Clusterwise Linear
Regression (Späth, 1979) and K-means (MacQueen, 1967) methods. Declaration of competing interest
The proposed method may automatically and adaptatively provide a
weight of relevance to each explanatory variables or take into account The authors declare that they have no known competing finan-
the correlation between the explanatory variables. Because they learn cial interests or personal relationships that could have appeared to
simultaneously, a prototype and a linear regression model for each clus- influence the work reported in this paper.
ter, they can provide clusters of observations that differ from their set
of regression coefficients and have a reasonable prediction capability Acknowledgments
for the response variable.
The performance and usefulness of the WCLR method were shown The authors would like to thanks the anonymous referees for their
with synthetic datasets and several selected benchmarks datasets with careful revision, and to Conselho Nacional de Desenvolvimento Cientí-
a varied number of instances and variables. Overall, the proposed fico e Tecnológico - CNPq, Brazil (311164/2020-0), and Fundação de
methods behaved competitively relative to previously proposed meth- Amparo à Ciência e Tecnologia do Estado de Pernambuco - FACEPE,
ods CWM, CLR, and KPLANE and the baseline strategy KMEANS+LM, Brazil (IBPG-0041-17), for their partial financial support of this work.
considered in this paper. The synthetic datasets considered different
configurations aiming to provide insights about the preference to a References
particular variant of the WCLR method. The results obtained with
Aurifeille, J.-M. (2000). A bio-mimetic approach to marketing segmentation: Principles
the synthetic dataset demonstrate that the inclusion of an additional
and comparative analysis. European Journal of Economic and Social Systems, 14(1),
weighting step produced more meaningful clusters of individuals based 93–108.
on the explanatory variables and thus provided better-fitted regression Aurifeille, J.-M., & Medlin, C. J. (2001). A dyadic segmentation approach to business
models. The results with the benchmark datasets showed the good partnerships. European Journal of Economic and Social Systems, 15(2), 3–16.
performance of the WCLR variants specially in the forecasting task. Aurifeille, J.-M., & Quester, P. G. (2003). Predicting business ethical tolerance in
international markets: A concomitant clusterwise regression analysis. International
A major challenge found in the optimization of the WCLR objective Business Review, 12(2), 253–272.
function is the tuning of the balance hyper-parameter 𝛼. Future work Bagirov, A. M., Ugon, J., & Mirzayeva, H. G. (2015). An algorithm for clusterwise linear
in this direction is relevant to the method described in this paper, as regression based on smoothing techniques. Optimization Letters, 9, 375–390.
well as to other methods that use a similar configuration. In addition Beck, G., Azzag, H., Bougeard, S., Lebbah, M., & Niang, N. (2018). A new micro-
batch approach for partial least square clusterwise regression regression. Procedia
to this improvement, one can also study the robustness of the proposed Computer Science, 144, 239–250.
method, investigating robust metrics and analyzing their behavior with Bock, H. H. (1969). The equivalence of two extremal problems and its application to
the method proposed here. the iterative classification of multivariate data. In Workshop’medizinische statistik.
19
R.A.M. da Silva and F.A.T. de Carvalho Expert Systems With Applications 185 (2021) 115609
Bock, H. H. (1994). Classification and clustering: Problems for the future. In E. Diday, Ingrassia, S., Minotti, S., & Vittadini, G. (2012). Local statistical modeling via the
Y. Lechevallier, M. Schader, P. Bertrand, & B. Burtschy (Eds.), New approaches cluster- weighted approach with elliptical distributions. Journal of Classification,
in classification and data analysis (pp. 3–24). Berlin, Heidelberg: Springer Berlin 29(3), 363–401.
Heidelberg. Li, L. (1996). A new complexity bound for the least-squares problem. Computers &
Bock, H.-H. (2008). Origins and extensions of the k-means algorithm in cluster analysis. Mathematics with Applications, 31(12), 15–16.
Journal Electronique d’Histoire des Probabilités et de la Statistique, 4(2). MacQueen, J. (1967). Some methods for classification and analysis of multivariate
Bougeard, S., Abdi, H., Saporta, G., & Niang, N. (2018). Clusterwise analysis for observations. In Proceedings of the fifth berkeley symposium on mathematical statistics
multiblock component methods. Advances in Data Analysis and Classification, 12, and probability, volume 1: Statistics (pp. 281–297). Berkeley, Calif.: University of
285–313. California Press.
Brusco, M. J., Cradit, J. D., Steinley, D., & Fox, G. L. (2008). Cautionary remarks on Manwani, N., & Sastry, P. (2015). K-plane regression. Information Sciences, 292, 39–56.
the use of clusterwise regression. Multivariate Behavioral Research, 43(1), 29–49. Mao, J., & Jain, A. K. (1996). A self-organizing network for hyper-ellipsoidal clustering
Brusco, M. J., Cradit, J. D., & Tashchian, A. (2003). Multicriterion clusterwise regression (HEC). IEEE Transactions on Neural Networks, 7(1), 16–29.
for joint segmentation settings: An application to customer value. Journal of Mari, R. D., Rocci, R., & Gattone, S. A. (2017). Clusterwise linear regression modeling
Marketing Research, 40(2), 225–234. with soft scale constraints. International Journal of Approximate Reasoning, 91,
de Carvalho, F. A. T., Tenorio, C. P., & Junior, N. L. C. (2006). Partitional fuzzy 160–178.
clustering methods based on adaptive quadratic distances. Fuzzy Sets and Systems, Mazza, A., & Punzo, A. (2020). Mixtures of multivariate contaminated normal regression
157(21), 2833–2857. models. Statistical Papers, 61, 787–822.
Chan, E., Ching, W., Ng, M., & Huang, J. (2004). An optimization algorithm Mazza, A., Punzo, A., & Ingrassia, S. (2018). flexCWM: A flexible frame- work for
for clustering using weighted dissimilarity measures. Pattern Recognition, 37(5), cluster-weighted models. Journal of Statistical Software, 86, 1–30.
943–952. Modha, D. S., & Spangler, W. S. (2003). Feature weighting in k-means clustering.
Charles, C. (1977). Régression typologique et reconnaissance des formes (Ph.D. thesis), Machine Learning, 52(3), 217–237.
Université Paris IX. Montgomery, D. C., Peck, E. A., & Vining, G. G. (2001). Introduction to linear regression
Dang, U., Punzo, A., McNicholas, P., Ingrassia, S., & Browne, R. (2017). Multivari- analysis. Wiley-Interscience.
ate response and parsimony for Gaussian cluster-weighted models. Journal of Preda, C., & Saporta, G. (2005). Clusterwise pls regression on a stochastic process.
Classification, 34(1), 4–34. Computational Statistics & Data Analysis, 49, 99–108.
DeSarbo, W. S., & Cron, W. L. (1988). A maximum likelihood methodology for Preda, C., & Saporta, G. (2007). PCR and PLS for clusterwise regression on functional
clusterwise linear regression. Journal of Classification, 5(2), 249–282. data. In P. Brito, G. Cucumel, P. Bertrand, & F. de Carvalho (Eds.), Selected
DeSarbo, W. S., Oliver, R. L., & Rangaswamy, A. (1989). A simulated annealing contributions in data analysis and classification (pp. 589–598). Berlin, Heidelberg:
methodology for clusterwise linear regression. Psychometrika, 54(4), 707–736. Springer Berlin Heidelberg.
Diday, E., & Govaert, G. (1977). Classification automatique avec distances adaptatives. Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal
RAIRO Informatique Computer Science, 11(4), 329–349. of the American Statistical Association, 66(336), 846–850.
Diday, E., & Simon, J. C. (1976). Clustering analysis. In K. S. Fu (Ed.), Digital pattern Ryoke, M., Nakamori, Y., & Suzuki, K. (1995). Adaptive fuzzy clustering and fuzzy
recognition (pp. 47–94). Berlin, Heidelberg: Springer Berlin Heidelberg. prediction models. In Fuzzy systems, 1995. International joint conference of the fourth
Diday, E., et al. (1979). Optimisation en classification automatique (tome 1., 2). INRIA, IEEE international conference on fuzzy systems and the second international fuzzy
Rocquencourt, (in French). engineering symposium., Proceedings of 1995 IEEE int (vol. 4) (pp. 2215–2220). IEEE.
Evanno, G., Regnaut, S., & Goudet, J. (2005). Detecting the number of clusters of Saporta, G. (2017). Clusterwise methods, past and present. In ISI 2017 61st world
individuals using the software STRUCTURE: A simulation study. Molecular Ecology, statistics congress. Marrakech, Morocco. URL: https://hal-cnam.archives-ouvertes.fr/
14, 2611–2620. hal-02473529.
García-Escudero, L., Gordaliza, A., Mayo-Iscar, A., & Martín, R. S. (2010). Robust Schlittgen, R. (2011). A weighted least-squares approach to clusterwise regression.
clusterwise linear regression through trimming. Computational Statistics & Data Advances in Statistical Analysis, 95, 205–217.
Analysis, 54(12), 3057–3069. Schwarz, G. E. (1978). Estimating the dimension of a model. The Annals of Statistics,
Gershenfeld, N. (1997). Nonlinear inference and cluster-weighted modeling. The Annals 6, 461–464.
of the New York Academy of Sciences, 808(1), 18–24. da Silva, R. A. M., & de Carvalho, F. A. T. (2017). On combining clusterwise linear
Gustafson, D. E., & Kessel, W. C. (1978). Fuzzy clustering with a fuzzy covariance regression and K-means with automatic weighting of the explanatory variables. In
matrix. In 1978 IEEE conference on decision and control including the 17th symposium A. Lintas, S. Rovetta, P. F. Verschure, & A. E. Villa (Eds.), Lecture notes in computer
on adaptive processes (pp. 761–766). science: vol. 10614, Artificial neural networks and machine learning (pp. 402–410).
Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-means clustering Cham, Berlin et al.: Springer.
algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), Späth, H. (1979). Algorithm 39: Clusterwise linear regression. Computing, 22(4),
100–108. 367–373.
Hennig, C. (2000). Identifiablity of models for clusterwise linear regression. Journal of Späth, H. (1982). Algorithm 48: A fast algorithm for clusterwise linear regression.
Classification, 17(2), 273–296. Computing, 29(2), 175–181.
Huang, J., Ng, M., Rong, H., & Li, Z. (2005). Automated variable weighting in k-means Späth, H. (2014). Mathematical Algorithms for Linear Regression. Academic Press.
type clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(5), Vicari, D., & Vichi, M. (2013). Multivariate linear regression for heterogeneous data.
657–668. Journal of Applied Statistics, 40(6), 1209–1230.
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), Wedel, M., & DeSarbo, W. S. (1995). A mixture likelihood approach for generalized
193–218. linear models. Journal of Classification, 12(1), 21–55.
Ingrassia, S., Minotti, S., & Punzo, A. (2014). Model-based clustering via linear cluster- Wu, Q., & Yao, W. (2016). Mixtures of quantile regressions. Computational Statistics &
weighted models. Computational Statistics & Data Analysis, 71, 159–182. Data Analysis, 93, 162–176.
20