Forward Selection Component Analysis Algorithms and Applications
Forward Selection Component Analysis Algorithms and Applications
Puggini, L., & McLoone, S. (2017). Forward Selection Component Analysis: Algorithms and Applications. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 39(12), 2395-2408.
https://doi.org/10.1109/TPAMI.2017.2648792
Published in:
IEEE Transactions on Pattern Analysis and Machine Intelligence
Document Version:
Peer reviewed version
Publisher rights
Copyright 2017 IEEE.
This work is made available online in accordance with the publisher’s policies. Please refer to any applicable terms of use of the publisher.
General rights
Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other
copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated
with these rights.
Open Access
This research has been made openly available by Queen's academics and its Open Research team. We would love to hear how access to
this research benefits you. – Share your feedback with us: http://go.qub.ac.uk/oa-feedback
Abstract—Principal Component Analysis (PCA) is a powerful and widely used tool for dimensionality reduction. However, the principal
components generated are linear combinations of all the original variables and this often makes interpreting results and root-cause
analysis difficult. Forward Selection Component Analysis (FSCA) is a recent technique that overcomes this difficulty by performing
variable selection and dimensionality reduction at the same time. This paper provides, for the first time, a detailed presentation of the
FSCA algorithm, and introduces a number of new variants of FSCA that incorporate a refinement step to improve performance. We
then show different applications of FSCA and compare the performance of the different variants with PCA and Sparse PCA. The results
demonstrate the efficacy of FSCA as a low information loss dimensionality reduction and variable selection technique and the improved
performance achievable through the inclusion of a refinement step.
1 I NTRODUCTION
The need to analyse large volumes of multivariate data and other fields where datasets are encountered involving
is an increasingly common occurrence in many areas of large numbers of variables with significant levels of inter-
science, engineering and business. In order to build more variable correlation and hence redundancy (see for example
interpretable models or to reduce the cost of data collection [13] and [14]). However, while PCA provides a compact rep-
it is important to discover good compact representations of resentation of a multivariate dataset, it does not lend itself to
high-dimensional datasets. This leads to the fundamental identification of the most representative subset of variables
problem of dimensionality reduction. Many methods have within the data. This is a consequence of the fact that the
been developed to perform supervised dimensionality re- latent variables (principal components) generated by PCA
duction [1], [2], [3], [4] . Given an input matrix X ∈ Rm×v are a linear combination of all original variables, making the
(containing m measurements of v variables) and an output most significant variables difficult to determine [15]. This
value y ∈ Rm these methods try to understand what is especially true in the case of highly correlated datasets
subset of variables, or derived features of X optimally due to the grouping effect, whereby the contribution of a
explain y. Dimensionality reduction can also be defined group of highly correlated variables to a given principal
as an unsupervised problem. In this case we look for the component is distributed evenly across all variables in the
subset of variables/derived features that retain the maxi- group. While this characteristic is beneficial in terms of noise
mum information content with respect to the original set suppression, it means that the contribution of individual
of variables, in the sense of being able to reconstruct the variables can be small making important variables appear
full data matrix X. Different unsupervised dimensionality insignificant. Hence, tasks such as identification of key
reduction techniques have been proposed. Some of them, variables, root-cause analysis and model interpretation can
such as [5], [6], [7], [8], have been developed with the goal be challenging using PCA.
of maximising performance when used as a pre-processing Consequently, various approaches have been developed
step in clustering or classification algorithms, while others, to obtain sparse approximations of PCA. The simplest strat-
such as [9], [10], [11], [12], have been developed in order to egy is to manually set to 0 the values of the principal com-
obtain the optimal reconstruction of the full dataset. Among ponents (PCs) that are smaller than a given threshold but
this latter group Principal Component Analysis (PCA) is this can lead to significant variables being missed if they are
the best known and most widely used technique [11]. PCA part of a group of highly correlated variables [16], [15]. More
provides the most efficient linear transformation of data to sophisticated approaches such as SCoTLASS [17], DSPCA
a lower dimensional space and is relatively straight forward [18], sparse PCA [19], sPCA-rSVD [20], SOCA [21] and [22]
to compute. It has found many applications in chemometrics use a lasso like L1 or L0 penalty or are formulated as
constrained maximization problems in order to encourage
• L. Puggini is with the Department of Electronic Engineering, Na- sparsity in the PCA loadings. However, these methods are
tional University of Ireland, Maynooth. E-mail: [email protected], generally computationally intensive and difficult to use and
[email protected] interpret due to the need to establish the appropriate level
• S. McLoone is with the School of Electronics, Electrical Engi-
neering and Computer Science, Queen’s University Belfast. E-mail: of sparsity for each PC computed.
[email protected] These challenges motivated the second author to de-
Manuscript received June 15, 2015; revised May 23, 2016; accepted December velop a technique called Forward Selection Component
30, 2016. Analysis (FSCA) which seeks to identify a small number
2
of key variables that are representative of the observed the variables in the dataset that are most uncorrelated with
variance across all variables. FSCA was initially introduced linear combinations of the variables already selected, while
in the context of Optical Emission Spectroscopy (OES) data in the latter the criterion used for variable selection is
analysis of plasma etch processes [23] where isolating a the maximum average squared correlation with all other
small number of wavelengths is important for understand- variables in the dataset. Wei and Billings’s algorithm, which
ing the underlying plasma chemistry. More recently, FSCA they refer to as Forward Orthogonal Search (FOS), is similar
has been found to be a particularly effective tool for optimis- in character to FSCA and, as will be discussed in Section 3.2,
ing measurement site selection for spatial wafer metrology yields identical results to FSCA if the data is appropriately
in semiconductor manufacturing [24]. The method works by pre-scaled. Recently [35] introduced a kernel extension of
iteratively deriving a set of orthogonal components which variable selection that enables non-linear relationships be-
are a function of only a subset of the original variables, and tween variables to be taken in account, while [36] devel-
which sequentially maximize the explained variance. At one oped an efficient parallel implementation for data parallel
level FSCA can be regarded as the unsupervised counterpart distributed computing that scales well for large problems.
of Forward Selection Regression in that it returns a set of Both these algorithms are equivalent to FSCA in terms of
Forward Selected Variables (FSVs), but equally it retains the sequence of variables selected, but operate directly in
some of the characteristics and utility of PCA in that it the variable space rather than producing FSCs.
also returns a set of Forward Selection Components (FSCs) While FSCA and the proposed backward refinement
which form an orthogonal basis. This allows, for example, variants are specifically designed for unsupervised variable
the contributions of individual components to be easily selection, we note that they have a number of characteristics
isolated. in common with more general machine learning and signal
In this paper, we present for the first time, a complete processing dictionary selection, sparse representation and
description of the FSCA procedure and algorithms, drawing supervised variable selection problems with regard to the
parallels with PCA as appropriate. In addition, motivated use of greedy forward selection and backward refinement
by the success of a two stage algorithm proposed in [25], steps. In [37], for example, a supervised dictionary selection
[26] for stepwise regression based methods, we propose a framework is developed for spare representation of a set
number of new variants of FSCA that incorporate a similar of signals where the dictionary elements are recursively
backward refinement step. Specifically, we introduce four selected from a candidate set D using a greedy forward
backward refinement variants, which we call Single, and selection algorithm. FSCA can be regarded as special case
Multi-pass Backward Refinement FSCA (denoted as SPBR- of this framework, where the signals to be represented are
FSCA and MPBR-FSCA); and Recursive Single, and Recur- also the set of candidate dictionary elements D.
sive Multi-pass Backward Refinement FSCA (denoted as R- In the supervised context, in particular, various back-
SPBR-FSCA and R-MPBR-FSCA). Then, with the aid of a ward refinement algorithms have been proposed to ad-
number of case studies, we demonstrate the utility of FSCA, dress the non-optimality of greedy selection. Two notable
the enhanced performance obtained with the new variants, examples are the FoBa algorithm by Zhang [38] for learn-
and provide comparisons with PCA and Sparse PCA. ing sparse representations and the Orthogonal Matching
The remainder of the paper is organised as follows. Pursuit with Replacement (OMPR) algorithm by Jain et
Related work on variable selection techniques is identified al. [39] for compressed sensing. In FoBa following each
in Section 2, followed by the algorithmic descriptions of greedy selection step a backward variable elimination step
PCA and FSCA in Section 3. Section 4 introduces the new is performed to remove variables that are not contributing to
backward refinement FSCA algorithms. Then comparative the model. The backward step is aggressive in that it allows
results and analysis are provided in Section 5 and 6 for a small increase in error when a variable is removed. This is
simulated and real world case studies, respectively. Finally, limited to be half the error reduction in the corresponding
conclusions are presented in Section 7. forward selection step such that algorithm convergence is
guaranteed. In OMPR an initial set of variables k is ran-
domly selected and then a refinement step is repeatedly per-
2 R ELATED W ORK formed in which variables are replaced with new ones from
A variety of variable selection methods have been devel- the candidate set using a gradient based update procedure.
oped based on making comparisons with or extracting The backward refinement FSCA algorithms presented
information from a PCA decomposition of the data matrix in this paper share some similarities with both FoBa and
(e.g. [27], [28], [29], [30] and [31]). Other approaches employ OMPR. FoBa is essentially a recursive multi-pass backward
clustering of features using a suitable feature similarity refinement procedure, but differs from R-MPBR-FSCA in
metric as the basis for variable selection (e.g. [5], [7] and that variables are removed rather than replaced in the re-
[12]). Recently [32] proposed a novel L1 regularised for- finement step. In addition, while our refinement algorithms
mulation for the unsupervised variable selection problem are strictly hill climbing, i.e. variables are only replaced if
which has a similar philosophy to sparse PCA and can be they lead to an improvement in performance, they can easily
thought of as the unsupervised counterpart of LASSO [1]. be adapted to employ the more aggressive backward refine-
In addition to FSCA, two other techniques which can be ment step of FoBa. Both OMPR and backward refinement
considered as performing direct variable selection are the FSCA employ a variable replacement strategy, and while
algorithms by Whitley et al. [33] and Wei and Billings [34], OMPR differs substantially from FSCA algorithmically, it
both of which employ orthogonalisation procedures. In the can be regarded as a multi-pass backward refinement ap-
former, variables are selected based on sequentially finding proach, but with the k components initialized randomly,
L. PUGGINI AND S. MCLOONE: FORWARD SELECTION COMPONENT ANALYSIS: ALGORITHMS AND APPLICATIONS 3
rather than being obtained as the output of a forward 3.1 Principal Component Analysis
selection procedure. Principal Component Analysis (PCA) is one of the most
More generally, convex relaxations of the variable se- common and widely used dimensionality reduction tech-
lection problem are a particular case of optimization over niques. The PCA decomposition of X is defined as
convex hulls of an atomic set [40], which can be effectively
k
solved using greedy Frank-Wolfe (aka conditional gradient) X
X̂kP CA = Tk PT
k = ti pT
i, (7)
type algorithms [41]. In this context [42] have developed an
i=1
optimization procedure called CoGEnT for general atomic-
norm regularization problems which incorporates both a where Pk ∈ Rv×k is an orthonormal matrix, computed as
greedy forward selection step and an aggressive backward the first k ordered eigenvectors of the data covariance matrix
refinement step and is effectively a generalization of FoBa XT X (in descending eigenvalue order), and Tk ∈ Rm×k is
to atomic norms. the geometric projection of X on the columns of Pk , that is:
Tk = XPk . (8)
3 DATA D ECOMPOSITION AND R ECONSTRUCTION Here, pi and ti are the i-th column’s of Pk and Tk , respec-
Given a matrix X ∈ Rm×v representing a dataset with m tively. If k = r = rank(X) then the PCA decomposition is
measurements of v variables, where each variable can be exact and
considered without loss of generality to be normalised to X = X̂rP CA = Tr PTr. (9)
have zero mean, different techniques can be used to obtain Otherwise, if k < rank(X) PCA provides the best rank k
a more compact representation of X. Our aim is to estimate approximation to X [11], that is, Pk is the solution to the
a matrix S ∈ Rm×k , where k < rank(X) ≤ min(m, v), optimisation problem
such that it is possible to obtain a good reconstruction of X
by linear regression on S: argmax VX (XPk PT
k ). (10)
Pk
X̂ = SΘ. (1) Consequently, it follows that when S is restricted to k
columns, the optimal choice is S = Tk , in which case
In general, given a regressor matrix S, the optimal least
Θ = PT k.
square error linear reconstruction of the original signal X
Various algorithms exist for computing the PCA decom-
is given by
position. Among these the most popular are the singular
Θ = argmink SΘ̃ − X k2F , (2) value decomposition (SVD) and the Nonlinear Iterative
Θ̃
Partial Least Squares (NIPALS) [43] algorithms. While SVD
where (k · kF ) is the Frobenius norm. The solution to (2) is is more numerically robust and efficient when a full PCA
the well known least-squares solution: decomposition is required, the advantage of NIPALS is
that it computes the PCA decomposition iteratively, one
Θ = (ST S)−1 ST X. (3)
principal component (PC) at a time, in descending order.
Hence, defining the projection matrix This makes it highly efficient in high dimensional problems
where typically only a small number of PCs need to be
Φ(S) = S(ST S)−1 ST , (4) computed. For completeness, and to facilitate comparison
with FSCA presented in the next section, a description of
for a given matrix S, the optimal linear reconstruction of X the NIPALS algorithm is provided in Algorithm 1.
can be expressed as
Require: Data matrix X, number of PCs k Require: Data matrix X, number of FSCs k
1: Set R = X 1: Set R = X # Notation: R = [r1 , . . . , rp ]
2: Initialise P0 = T0 = ∅ 2: Initialise Z0 = M0 = U0 = ∅
3: Set = 10−6 (convergence threshold) 3: for j = 1 to k do
4: Initialise t to a non-zero column of X 4: i = argmax VR (Φ(ri )R)
5: for j = 1 to k do ri ∈R
5: m = ri
6: Set tnew = t and told = 10t 6: z = xi
7: while k told − tnew k2 ≥ do u = RT ri /(rT
7: i ri )
8: told = tnew 8: Mj = [Mj−1 m]
9: p = RTp t/tT t 9: Zj = [Zj−1 z]
10: p = p/ pT p 10: Uj = [Uj−1 u]
11: t = Rp 11: R = R − Φ(m)R
12: tnew = t 12: end for
13: end while 13: return Zk , Mk , Uk
14: Pj = [Pj−1 p]
15: Tj = [Tj−1 t]
16: R = R − tpT Fig. 2. FSCA Algorithm
17: end for
18: return Pk , Tk
Algorithm 3: [Zk , Mk , Uk ]=FSVA(X,k )
ing the explained variance is equivalent to maximising the Recalling the definition of Φ (eqt. 4) this can be recast as
Rayleigh Quotient of XXT (see Appendix A), that is v
X
−1
xT T argmax qT T
j(i) (Z(i) Z(i) ) qj(i) , (22)
i XX xi xi ∈X
argmax VX (Φ(xi )X) ≡ argmax T
, (15) j=1
xi ∈X xi ∈X xi xi
where qj(i) = ZT (i) xj , and Z(i) = [Zk−1 xi ]. Hence, deter-
which can be computed efficiently as
mining the optimum xi requires v 2 evaluations of the vector
v
X (xT
i xj )
2
terms qj(i) and v evaluations of the matrix inverse term
argmax . (16) −1
xi ∈X
T
xi xi (ZT(i) Z(i) ) . We can take two steps to substantially reduce
j=1
the computation time for these terms. Firstly, as proposed in
It is interesting to note that the corresponding expression [35], [36], an O(k 2 ) complexity recursive computation of the
for maximising the average squared correlation metric em- matrix inverse can be obtained by taking advantage of the
ployed in the FOS algorithm proposed by Wei and Billing fact that " T #
[34] is T
Zk−1 Zk−1 r(i)
v Z(i) Z(i) = , (23)
X (xT
i xj )
2
rT a(i)
argmax T T
. (17) (i)
xi ∈X j=1
(xi xi )(xj xj )
where r(i) = Zk−1 xi , and a(i) = xT i xi , and applying
Hence, while the FOS optimization objective has an addi- block matrix inversion algebra to obtain an expression for
tional scaling factor in the denominator, both algorithms (ZT −1
in terms of (ZT −1
(i) Z(i) ) k−1 Zk−1 ) which has already
will in fact yield identical results provided the columns of been computed in the previous iteration, that is:
the data matrix X are normalised so that they are all the " #
same length (xT j xj will then be invariant with respect to j ). T −1 (ZT
k−1 Zk−1 )
−1
+ wbbT −wb
(Z(i) Z(i) ) = , (24)
It is also worth noting that if xi is not constrained to be −wbT w
a column of X the solution to (15) is the largest eigenvector
of XXT , but this is simply the direction of the score vector b = (ZTk−1 Zk−1 )−1 r(i) , w = (a(i) − rT(i) b)−1 .
t1 corresponding to the first PC of the data. This suggests
In contrast direct calculation of the matrix inverse has O(k 3 )
an alternative variable selection approach, as proposed by
computational complexity.
Cui and Dy [31], whereby the first variable selected is
Secondly, evaluating the terms qj(i) , r(i) and a(i) all
the one that is most closely correlated with the first PC.
involve computing vector products xT i xj many times, with
In subsequent steps the selected variables are the closest
substantial repetition both within each variable selection
to the first PC of the corresponding residual matrix. The
iteration and between iterations. This repetition can be elim-
approach, referred to as Orthogonal Principal Feature Selec-
inated by precomputing the covariance matrix C = XT X,
tion (OPFS), is in general only an approximation to FSCA.
where cij = xT 2
i xj , at the cost of O(v m) floating point
This can be deduced as follows. Recalling the definition 2
operations (flops) and O(v ) additional memory.
of the PCA decomposition in equations (7)-(9), the FSCA
Table 1 shows the estimated complexity in terms of float-
optimization objective (eqt. 15) can be expressed as
ing point operations for computing k FSCs with the FSCA
r
X (xT
i tj )
2 and FSVA algorithms, with and without the covariance
argmax T
(18) matrix precomputed, while Fig. 4 shows how complexity
xi ∈X j=1
xi xi
varies as a function of v and m for specific combinations
or equivalently as of the other dimensions. As can be seen, the reduction in
r
X complexity is of the order O(k 2 ) for FSVA. Precomputing C
argmax λj corr(xi , tj )2 , (19) is also beneficial for FSCA, but the impact is less significant
xi ∈X j=1 since it has to be re-computed at each iteration due to the
deflation step. That said, a factor of two reduction in com-
where λj (= tTj tj ) is the variance contribution of the j -th PC. putational complexity is achieved for FSCA. All algorithms
In contrast the OPFS optimization objective corresponds to scale quadratically with v and linearly with m, but differ in
argmax corr(xi , t1 )2 . (20) how they behave with respect to k , with FSCA implementa-
xi ∈X tions growing linearly and FSVA implementations growing
Thus, while OPFS selects variables based on their squared cubically. If precomputing the covariance matrix is √ not an
correlation with the first PC, FSCA selects them based on issue, the preferred algorithm is FSVA when k < 1.5m
the variance-weighted average squared correlation with all (approx.) and FSCA otherwise. FSCA is substantially supe-
PCs. Hence, the sequence of variables selected by OPFS will rior to FSVA when the covariance matrix is not precomputed
√
in general differ from, and explain less variance than, the and also outperforms precomputed FSVA when k > 3m.
variables selected by FSCA. When the computational burden of FSCA becomes pro-
If FSCA is computed using the FSVA implementation hibitive OPFS may offer an attractive compromise due to
(Algorithm 3) an efficient solution can be obtained by noting its significantly lower computational complexity. In OPFS
that the combinatorial optimisation problem in equation (14) the PC needed at each step can be computed efficiently
is equivalent to using NIPALS yielding and algorithm with O((4α + 8)mv)
v complexity per selected variable, compared to order O(mv 2 )
with FSCA. Here, α denotes the average number of iter-
X
argmax xT
j Φ([Zk−1 xi ])xj . (21)
xi ∈X j=1 ations per selected variable for the NIPALS algorithm to
6
log(flops)
12 9
10 10
(j)
10 8
VX (Φ(Zk )X) ≥ max VX (Φ(Zk (xi ))X). (27)
10 10 xi ∈X/Zk
8
10
7
10 This backward refinement step can be performed either at
6 6
step 5 or step 7 of the FSV implementation of the FSCA
10 10
2
10
3
10
4
10
5
10
1
10
2
10
3
10
4
10 algorithm, as highlighted Algorithm 3. When placed at step
log(v) log(m)
7 the refinement step is only undertaken once after the FSV
16
k=30,m=200 12
k=30,v=200 algorithm has completed. In contrast, the refinement step
10 10
is executed following the addition of each new variable if
14
10
11
10 placed at step 5. We will refer to this latter implementation
as recursive backward refinement.
log(flops)
log(flops)
12 10
10 10
There are also two flavours of the refinement step itself.
10
10
9
10 In the first, referred to as Single-Pass Backward Refinement
8 8
(SPBR) (summarised in Algorithm 4), the relevance of each
10 10
variable is evaluated in turn moving sequentially through
6
10 2 3 4 5
7
10 1 2 3 4
the variables from the oldest to the newest. In the second,
10 10 10 10 10 10 10 10
log(v) log(m) to take account of the fact that variables that are initially
relevant may become irrelevant following refinements to
Fig. 4. Complexity of FSCA algorithms as a function of v and m: Plots variables later in the sequence, the process is repeated until a
show FSCA (blue) and FSVA (green) implementations with precom- complete pass occurs without any refinements taking place.
puted covariance matrices (dashed lines) and without (solid lines). This version of the algorithm (summarised in Algorithm 5)
is referred to as Multi-Pass Backward Refinement (MPBR).
Note that by virtue of the sequencing of operations in
converge. It is a function of the spread of the eigenvalues
each algorithm it follows that
of the covariance matrix C, and hence problem dependent.
VX (X̂kF SCA ) ≤ VX (X̂kSP BR ) ≤ VX (X̂kM P BR ). (28)
4 FSCA WITH BACKWARD R EFINEMENT However, no such statement can be made with regard to R-
Ideally we would like to find the subset of k columns of X SPBR or R-MPBR as they may follow different ’hill climbing’
(variables) that can optimally reconstruct X, that is solution paths and hence it is possible for the solutions to be
inferior to the non-recursive implementations when k > 2.
argmax VX (Φ(Zk )X). (25) One of the side-effects of employing the backward refine-
Zk ∈X
ment step is that it breaks the ordering of selected variables
However, this is an NP hard combinatorial optimisation in terms of variance explained. If recovering this ordering is
problem (requires the evaluation of kv = v!/((v − k)!k!)
desirable, an additional modified FSV step can be performed
possible combinations of the variables). In general FSCA on Zk with respect to X after the refinement process has
and other greedy local search approaches are sub optimal been completed (i.e. between Step 7 and 8 in Algorithm 3).
(i.e. they are not guaranteed to find the optimal subset of As summarised in Algorithm 6, this involves recursively
variables according to the defined optimization criteria), selecting the variables in Zk based on how much of the
but they represent a pragmatic solution as searching over variance of X that they explain.
L. PUGGINI AND S. MCLOONE: FORWARD SELECTION COMPONENT ANALYSIS: ALGORITHMS AND APPLICATIONS 7
(j) (j)
Algorithm 4: [Zk , rc ]=SPBR(X,Zk ) (Zk (xi )T Zk (xi ))−1 . This can be achieved by noting that
(j) (j)
Require: Forward selected variables Zk , data matrix X Zk (xi )T Zk (xi ) = ZT T T
k Zk + gj(i) ej + ej hj(i) , (29)
1: rc = 0 (refinement count) where
2: for j = 1 to k − 1 do
3:
(j)
ij = argmax VX (Φ(Zk (xi ))X) gj(i) = ZT
k (xi − zj ), (30)
xi ∈X T
4: if zj 6= xij then hj(i) = gj(i) + (xi − zj ) (xi − zj )ej . (31)
5: rc = rc + 1 (increment refinement count) It then follows, by application of the matrix inversion lemma
(j)
6: Zk = Zk (xij ) (i.e. replace zj with xij ) [44], specifically the Sherman-Morrison formula [45], that
7: end if
8: end for (j) (j)
Aj(i) ej hT
j(i) Aj(i)
(Zk (xi )T Zk (xi ))−1 = Aj(i) − , (32)
9: if rc > 0 then 1 + hT
j(i) Aj(i) ej
10: Repeat steps 3-7 for j = k
where
11: end if
−1 −1
12: return Zk , rc −1
(ZT
k Zk ) gj(i) eT T
j (Zk Zk )
Aj(i) = (ZT
k Zk ) − T T
. (33)
1 + ej (Zk Zk )−1 gj(i)
Fig. 5. Single-Pass Backward Refinement Algorithm This recursive inverse update can be computed in O(8k 2 +
4k + 6) flops and hence has O(8k 2 ) complexity, which com-
pares favourably to the O(4k 2 ) complexity of the forward
Algorithm 5: [Zk ]=MPBR(X,Zk )
step inverse (eqt. 24). The overall additional complexity of
executing SPBR is then O(2v 2 k 3 + 8vk 3 ). In contrast, the re-
Require: Forward selected variables Zk , data matrix X cursive SPBR implementation contributes O(0.5v 2 k 4 +2vk 4 )
1: rc = 1 (refinement flag)
additional complexity.
2: while rc > 0 do
Since repetition of the MPBR loop is dependent on
3: [Zk , rc ]=SPBR(X,Zk ) refinements taking place in the previous pass, the number
4: end while
of repetitions and hence overall algorithm complexity of the
5: return Zk
multi-pass implementations cannot be determined a priori.
If we denote the average number of repetitions as λ then
their complexity can be expressed as λ times the complexity
Fig. 6. Multi-Pass Backward Refinement Algorithm
of the corresponding SPBR and recursive SPBR implementa-
tions. The optional reordering step has O(2vk 3 ) complexity.
Hence, the overall algorithm complexity of FSVA with Back-
4.1 Computational complexity of backward refinement ward refinement is
2
The inclusion of backward refinement has major implica- O(v 2 mk 2 + (2λ + )v 2 k 3 ) (34)
tions for the complexity of FSCA. The lowest complexity 3
implementation is SPBR, which involves a combinatorial for non-recursive implementations and
search of similar complexity to the basic FSV algorithm (eqt. λ 2 4 2 2 3
22), the only difference being that Z is now a fixed size O(v 2 mk 2 +
v k + v k + 2λvk 4 ) (35)
(j) (j) 2 3
matrix, that is Z(i) → Zk (xi ), where Zk (xi ) is as defined
for recursive implementations, where λ = 1 corresponds to
in eqt. (26). Since the covariance matrix, and hence the qj(i)
SPBR and λ > 1 MPBR.
terms, will have already been precomputed for the forward
selection step, the only concern is the development of an
efficient recursive update procedure for the inverse matrix 5 S IMULATED DATASETS
In this section various simulated datasets are used to high-
light the differences between FSCA and FSCA with back-
Algorithm 6: [Zok ]=ReOrder(X,Zk ) ward refinement. The algorithms considered are:
• FSCA: Forward Selection Component Analysis (Al-
Require: Forward selected variables Zk , data matrix X
gorithm 2 or 3)
1: Zo0 = ∅
• SPBR: Single-Pass Backward Refinement (Algorithm
2: for j = 1 to k do
3 with Algorithm 4 employed at step 7)
3: ij = argmax VX (Φ([Zoj−1 zi ])X)
zi ∈Zk /Zo • MPBR: Multi-Pass Backward Refinement (Algorithm
j−1
4: Zoj = [Zoj−1 zij ] 3 with Algorithm 5 employed at step 7)
5: end for • R-SPBR: Recursive Single-Pass Backward Refine-
6: return Zok ment (Algorithm 3 with Algorithm 4 employed at
step 5)
• R-MPBR: Recursive Multi-Pass Backward Refine-
Fig. 7. Modified FSV procedure for reordering variables following back- ment (Algorithm 3 with Algorithm 5 employed at
ward refinement step 5)
8
5.1 Example 1: Four Distinct Variables k PCA OPFS FSCA SPBR MPBR R-SPBR R-MPBR
As a first example we define four base variables 1 30.41 25.88 25.88 25.88 25.88 25.88 25.88
w0 , x0 , y0 , z0 ∼ N (0, 1), 20 noise variables 1 , . . . , 20 ∼ 2 56.68 54.34 54.34 54.34 54.34 54.34 54.34
N (0, 0.1) and two larger noise variables 21 , 22 ∼ N (0, 0.4). 3 80.47 75.72 75.82 78.11 78.11 78.11 78.11
4 98.60 93.62 93.80 98.22 98.22 98.22 98.22
These variables are used to generate a subset of variables 5 99.03 96.36 96.56 98.78 98.78 98.78 98.78
similar to w0 : {wi = w0 + i }i=1,...,5 , a subset of variables 6 99.43 96.40 99.31 99.31 99.31 99.31 99.31
similar to x0 : {xi = x0 + i+5 }i=1,...,5 , a subset of variables
similar to y0 : {yi = y0 + i+10 }i=1,...,5 , a subset of variables TABLE 3
similar to z0 : {zi = z0 + i+15 }i=1,...,5 and two additional Example 1: Variables selected at each step by FSCA, SPBR and OPFS
redundant variables defined as h1 = w0 + x0 + 21 and
h2 = y0 + z0 + 22 . The complete dataset is then defined as k FSCA SPBR OPFS
X = [w0 , . . . , w5 , x0 , . . . , x5 , y0 , . . . , y5 , z0 , . . . , z5 , h1 , h2 ],
1 {h1 } {h1 } {h1 }
with X ∈ Rm×26 . Hence, by design the dataset is highly 2 {h1 , h2 } {h1 , h2 } {h1 , h2 }
redundant, with only 4 of the 26 variables independent. As 3 {h1 , h2 , x0 } {w0 , h2 , x0 } {h1 , h2 , w0 }
such, the information it contains can be optimally summa- 4 {h1 , h2 , x0 , z0 } {w0 , y0 , x0 , z0 } {h1 , h2 , w0 , z0 }
5 {h1 , h2 , x0 , z0 , {y0 , h1 , x0 , z0 , {h1 , h2 , w0 , z0 ,
rized by the 4 base variables (w0 , x0 , y0 , z0 ). w0 } w0 } y0 }
Table 2 shows the variance explained by PCA, OPFS, 6 {h1 , h2 , x0 , z0 , {x0 , y0 , w0 , z0 , {h1 , h2 , w0 , z0 ,
FSCA and the 4 backward refinement FSCA enhancements w0 , y0 } h1 , h2 } y0 , w 2 }
as a function of the number of selected variables k for an
instance of the dataset with m = 1000, while Table 3 shows
the sets of variables selected by FSCA, SPBR and OPFS for perturbed redundant variables X1 generated as a linear
each value of k . Results for MPBR, R-SPBR and R-MPBR are combination of the variables in X0 . In particular we define:
omitted from Table 3 as they are identical to SPBR. • X0 ∈ Rn×u : X0i,j ∼ N (0, 1)
When FSCA is applied to this dataset in general the first • φ ∈ Ru×(v−u) : φi,j ∼ N (0, 1)
FSV will be h1 and the second will be h2 , or viva versa,
• ∈ Rn×(v−u) : i,j ∼ N (0, 0.1)
as dictated by the noise realization. Subsequent selections
• X1 = X0 · φ +
are then from among the base variables until at k = 6 all
• X = [X0 , X1 ]
4 based variables are selected. Hence, the initial selections
become redundant as additional variables are added. As ex- We generated 1000 instances of this dataset for different
pected the backward refinement algorithms explain greater values of u and v for n = 200 and in each case computed
variance than FSCA when k = 3, 4 and 5. Note, that at k = 4 a k = u components FSCA, SPBR, R-SPBR, MPBR and R-
the refinement of the FSCA solution by SPBR identifies the MPBR. The variance explained by the k components se-
optimum set of variables (i.e. the 4 base variables), hence lected by each algorithm averaged over the 1000 repetitions
there is no scope for further improvement by the more is reported in Table 4. The table also reports Sc , the per-
advanced refinement algorithms. centage of true variables selected by each method, defined
For comparison purposes OPFS results are also pre- as
sented in the tables. As can be seen, the variable selections Sc = | {z1 , . . . zu } ∩ {x1 , . . . , xu } |/u × 100. (36)
for k = 1 and 2 are the same as FSCA, but thereafter
OPFS takes a different path, and ends with a suboptimal
solution at k = 6 (explained variance of 96% versus 99%). 100 Num. components =4
The performance of OPFS varies considerably for different 99
98
instances of the dataset. This is illustrated in Figure 8, which 97
96
shows the variation in performance of each method over 200 95
V
TABLE 4 100
Example 2: Percentage variance explained (VX ) and percentage of
true variables selected (Sc ) with FSCA and its backward refinement 98
variants (averaged over 1000 repetitions)
96
100 ×V ∗/VPCA
u v FSCA SPBR MPBR R-SPBR R-MPBR
92
10 30 99.75 99.87 99.89 99.88 99.89
15 50 99.77 99.89 99.92 99.90 99.92 90
20 75 99.78 99.90 99.94 99.90 99.94 FSCA
25 100 99.78 99.91 99.94 99.91 99.94 88 SPBR
MPBR
Percentage of true variables selected (Sc ) 86 R-SPBR
u v FSCA SPBR MPBR R-SPBR R-MPBR R-MPBR
84
1 2 3 4 5 6 7 8 9 10 11 12
10 30 22.38 48.80 70.11 47.73 66.30 Number of Components
15 50 16.03 43.73 74.20 41.98 72.06
20 75 14.70 41.60 81.66 45.11 79.82
25 100 12.85 34.74 71.46 38.21 71.95 Fig. 9. Example 2: The percentage of variance explained as a function
of the number of selected components for u = 10, v = 30
TABLE 5
As expected, the introduction of a refinement step consis- Example 3: The 1st and 2nd loading generated by PCA and SPCA (λ =
tently increases the explained variance relative to FSCA 20) and the 1st and 2nd FSC obtained with FSCA and SPBR
with SPBR reducing unexplained variance by 50%-64% and
MPBR reducing it by 58%-73%. There is no appreciable i PC1 PC2 SPC1 SPC2 FSC1 FSC2 SPBR1 SPBR2
difference between the performance of the recursive and 1 -0.13 0.48 -80.00
non-recursive implementations of each algorithm. A similar 2 -0.13 0.48 -80.00
3 -0.13 0.48 -80.00 1 1
pattern is observed with respect to the number of true vari- 4 -0.13 0.48 -80.00
ables selected by each method, FSCA: 12%-22%, SPBR/R- 5 0.39 0.16 79.61 1 1
SPBR: 35%-49% and MPBR/R-MPBR: 66%-82%. 6 0.39 0.16 79.61
7 0.39 0.16 79.61
Noting that PCA provides an upper bound on achievable 8 0.39 0.16 79.61
explained variance for a given number of components, 9 0.41 0.01 77.43 3.09 1 1
10 0.41 0.01 77.43 3.09
Figure 9 shows the variance explained by FSCA and the var-
VX 60.80 99.99 59.43 99.99 60.75 99.99 58.22 99.99
ious backward refinement algorithms with k = 1, 2, ..., 12
selected components for the case where u = 10, v = 30
and n = 1000, expressed as a percentage of the variance
explained by the equivalent number of PCs obtained using The results reported in Table 5, which are for a single
PCA. As can be seen, SPBR consistently provides improved realization of the dataset, show that both FSCA and SPBR
performance over FSCA for k > 1. For values of k in the with 2 components explain more than 99% of the total
vicinity of the true dimensionality of the data, MPBR is variance. In general, for other realizations FSCA will always
marginally superior to SPBR (57.5% versus 49.2% reduction select either x9 or x10 as one of the two variables. SPBR
in unexplained variance at k = 10, 11.8% versus 9.7% at and MPBR will instead select one variable from the group
k = 8 and 27.0% versus 23.9% at k = 12). In general the generated by v1 and one of from the group generated by
improvement due to backward refinement decreases rapidly v2 . PCA and SPCA assign similar values to similar variables
as k increases beyond the true dimensionality of the data. due to the grouping effect. However, as a result they do not
omit redundant variables. Thus, while SPCA yields sparse
solutions it is not the optimal choice if the objective is to
5.3 Example 3: Sparse PCA Dataset select a compact set of variables to represent the data.
This example is a simulated dataset used in [19] to assess
the performance of the sparse PCA algorithm introduced 6 A PPLICATION E XAMPLES
therein. The dataset is generated by 3 hidden variables 6.1 Pitprops Dataset
v1 , v2 , v3
The pitprops dataset, originally introduced by [46] as a
• v1 ∼ N (0, 290) v2 ∼ N (0, 300). PCA case study, is a widely used benchmark problem for
• v3 = −0.3v1 + 0.952v2 + where ∼ N (0, 1). evaluating the performance of PCA and SPCA like meth-
ods. The dataset consists of 180 samples of 13 variables
and 10 observed variables describing properties of timber and was used by the British
• xi = v1 + 1i where 1i ∼ N (0, 1) f or i = 1, . . . , 4 Forestry Commission in a study to establish if home-grown
• xi = v2 + 2i where 2i ∼ N (0, 1) f or i = 5, . . . , 8 timber had sufficient strength to be used to provide roof
• xi = v3 + 3i where 3i ∼ N (0, 1) f or i = 9, 10 support struts ’Pitprops’ for mines. Using the correlation
matrix for the dataset provided in [46] we generated 180
The final data matrix X ∈ Rn×10 is then defined as X = samples of a multivariate normal distribution to synthesise
[x1 , . . . , x10 ], where n = 1000 is the number of samples. an approximation of the original dataset.
10
96
FSCA variables FSCA components FSCA loadings PCA scores PCA loadings
584.9 0.23 12193 52738 0.05
125.2 0.05 5955 10047 0.03
334.4 0.13 32643
282 0.00
2305 0.22 12444 28008 0.10
318 0.04 4101 2551 0.03
1669 0.13 22906
4242 0.04
2570 0.60 3916 20182 0.08
583 0.22 1313 6201 0.01
1404 0.16 1290 7780 0.06
1077 0.39 3013 11076 0.12
243 0.09 213 2866 0.01
592 0.22 3439 5344 0.09
0 13 27 41 55 0 13 27 41 55 0 500 1000 1500 2000 0 13 27 41 55 0 500 1000 1500 2000
Fig. 11. Plasma Etch Process OES dataset: The first 4 PCA and FSCA components and scores
Fig. 13 shows the clusters obtained with each of the without pre-computation of the √ covariance
√ matrix is the
FSCA algorithms for k = 4 and k = 8. The clusters are superior implementation when k/ m > 3.
represented by markers of different colour and/or shape. A number of novel backward refinement variants of
Of particular note is the variation in spatial consistency of FSCA have also been proposed and efficient algorithm
clusters. It is apparent that the refinement steps yield much implementations developed. Results from simulated and
better spatial consistency of clusters than FSCA, with the application case studies confirm that the refinements yield
biggest improvements occurring with SPBR. Using FSCA improvements in performance relative to FSCA in terms
50% of the clusters are fragmented for both k = 4 and k = 8, of variance explained for a given number of compo-
while using SPBR only 1 cluster (25%) is fragmented when nents/variables selected, better variable selection and, in the
k = 4 and none are fragmented when k = 8. MPBR, R- case of the wafer site optimisation problem, more coherent
SPBR and R-MPBR yield similar results to SPBR with some FSCA clusters. Overall the key observations are that MPBR
minor variation at the boundaries between clusters. It is is superior to SPBR, which is in turn superior to FSCA, and
also noteworthy that in the FSCA plot for k = 8 the ’black that there is little, if any, benefit to be gained from employ-
star’ site is clustered with only one other site which is in a ing the recursive formulations (R-SPBR and R-MPBR) over
spatially unrelated area of the wafer. These anomalies are a their non-recursive counterparts. Indeed, in some instances
consequence of the sites initially selected by FSCA becoming the recursive implementations can yield poorer results.
redundant as additional sites are selected, as discussed in In terms of computational complexity the ordering is
Section 4. This issue, which detracts from the interpretability FSCA < SPBR < MPBR < R-SPBR < R-MPBR, with SPBR
of clusters, is addressed through the introduction of the having the same asymptotic complexity as FSCA. It is also
backward refinement step. noteworthy that the largest relative improvement in per-
formance in terms of variance explained occurs with the
change from FSCA to SPBR. As such, for most practical
7 D ISCUSSION AND C ONCLUSIONS applications SPBR is recommended as it provides a good
balance between complexity and quality of results.
This paper has sought to provide a comprehensive pre- In all the case studies presented in the paper algorithm
sentation of Forward Selection Component Analysis, as the performance is considered in an unsupervised context, as
unsupervised counterpart of Forward Selection Regression this is the natural framework for comparing unsupervised
and an alternative to PCA for dimensionality reduction variable selection techniques. The interested reader is re-
and variable selection in large highly correlated datasets. ferred to Appendix B for a supplementary linear regres-
A number of alternative FSCA algorithm implementations sion case study where performance is also assessed in a
have been described, namely FSCA and FSVA, with and supervised context. Specifically, FSCA and its variants are
without pre-computation of the covariance matrix, and their employed to select the model inputs from a large candidate
computational complexity analysed. In particular, this anal- set, and performance is evaluated in terms of the prediction
ysis reveals that: (1) all algorithms scale linearly with the capability of the resulting models.
number of measurements m and quadratically with the
number of variables v ; (2) FSCA implementations grow
linearly with the number of selected variables k , while ACKNOWLEDGMENTS
FSVA implementations grow cubically with k ; and (3) the The authors would like to thank Adrian Johnston, Seagate
optimum √ choice of implementation is dependent on the Technology and Niall MacGearailt, Intel Ireland for the
ratio k/ m. In general, it is computationally advantageous provision of use cases and Maynooth University for the
to pre-compute and store the covariance matrix when using financial support provided. The authors would also like to
either FSCA or FSVA, with FSVA computationally
√ √ the most acknowledge the anonymous reviewers for their valuable
efficient implementation provided k/ m < 1.5. FSCA suggestions which have greatly enhanced the paper.
L. PUGGINI AND S. MCLOONE: FORWARD SELECTION COMPONENT ANALYSIS: ALGORITHMS AND APPLICATIONS 13
0.5 0.5
0.0 0.0
Θ = argmaxVX (SΘ̃)
0.0 0.5 1.0 0.0 0.5 1.0
V =96.37 V =99.47 Θ̃
0.5 0.5
0.0 0.5
V =97.40
1.0 0.0 0.5
V =99.60
1.0
xi XXT xi
argmax VX (Φ(xi )X) = argmax .
xi ∈X xi ∈X xTi xi
1.0 1.0
0.5 0.5
xi ∈X xi ∈X
0.5 0.5
where the last equivalence is obtained by replacing
X̂ with Φ(xi )X and noting that Φ(xi ) = Φ(xi )T .
By application of the cyclic property of the trace op-
erator and observing that Φ(xi )2 = Φ(xi ) we can
0.0 0.0
write tr(Φ(xi )XXT Φ(xi )) = tr(Φ(xi )Φ(xi )XXT ) =
0.0 0.5
V =97.58
1.0 0.0 0.5
V =99.59
1.0
tr(XXT ), and therefore
1.0 1.0
k X − X̂ k2F = 2tr(XXT ) − 2tr(Φ(xi )XXT ).
0.5 0.5
0.0 0.0
Finally, by application of the definition of Φ(.) in eqt. (4)
0.0 0.5 1.0 0.0 0.5 1.0
and the properties of trace, the r.h.s. can be rewritten as
V =97.58 V =99.58
xi xTi T xTi
Fig. 13. The FSCA clusters obtained with FSCA, SPBR, MPBR, R- argmax tr( XX ) = argmax tr( XXT xi )
SPBR and R-MPBR for k = 4 and k = 8. In each case the FSCA xi ∈X xTi xi xi ∈X xi xTi
selected sites are indicated by circles and the associated clusters by xT XXT xi
markers of different colour and/or shape. The percentage variance = argmax i T .
explained by the different algorithms is reported under each plot. xi ∈X xi xi
14
TABLE 11 TABLE 12
Mean and (standard deviation) of the CV-NMSE (%) achieved with the Percentage of variance in X explained by the variables/features
regression models generated using PCA, FSCA, SPBR, MPBR, selected by PCA, FSCA, and its backward refinement variants
R-SPBR and R-MPBR for input selection
k 5 10 15
k 5 10 15
PCA 97.44 99.31 99.81
PCA 1.40 (0.05) 1.26 (0.04) 1.13 (0.04) FSCA 96.16 98.80 99.57
FSCA 1.45 (0.08) 1.25 (0.04) 1.10 (0.05) SPBR 96.74 99.04 99.68
SPBR 1.36 (0.05) 1.26 (0.06) 1.11 (0.03) MPBR 96.74 99.07 99.70
MPBR 1.35 (0.05) 1.22 (0.09) 1.12 (0.04) R-SPBR 96.74 99.06 99.70
R-SPBR 1.36 (0.05) 1.20 (0.07) 1.13 (0.04) R-MPBR 96.74 99.07 99.70
R-MPBR 1.36 (0.05) 1.19 (0.06) 1.13 (0.04)
[18] A. d’Aspremont, L. El Ghaoui, M. I. Jordan, and G. R. Lanckriet, [42] N. Rao, P. Shah, and S. Wright, “Forwardbackward greedy algo-
“A direct formulation for sparse pca using semidefinite program- rithms for signal demixing,” in Signals, Systems and Computers,
ming,” SIAM review, vol. 49, no. 3, pp. 434–448, 2007. 2014 48th Asilomar Conference on. IEEE, 2014, pp. 437–441.
[19] H. Zou, T. Hastie, and R. Tibshirani, “Sparse principal component [43] H. Wold, “Nonlinear estimation by iterative least square proce-
analysis,” Journal of Computational and Graphical Statistics, vol. 15, dures,” in Research Papers in Statistics, F. David, Ed. Wiley, New
no. 2, pp. 265–286, 2006. York, 1966, pp. 411–444.
[20] H. Shen and J. Z. Huang, “Sparse principal component analysis [44] G. H. Golub and C. F. Van Loan, Matrix computations. JHU Press,
via regularized low rank matrix approximation,” Journal of Multi- 2012, vol. 3.
variate Analysis, vol. 99, no. 6, pp. 1015–1034, 2008. [45] M. S. Bartlett, “An inverse matrix adjustment arising in discrim-
[21] D. M. Witten, R. Tibshirani, and T. Hastie, “A penalized matrix inant analysis,” Ann. Math. Statist, vol. 22, no. 1, pp. 107–111, 03
decomposition, with applications to sparse principal components 1951.
and canonical correlation analysis,” Biostatistics, p. kxp008, 2009. [46] J. Jeffers, “Two case studies in the application of principal compo-
nent analysis,” Applied Statistics, pp. 225–236, 1967.
[22] R. Jenatton, G. Obozinski, and F. Bach, “Structured sparse princi-
[47] H. Yue, S. Qin, R. Markle, C. Nauert, and M. Gatto, “Fault
pal component analysis,” arXiv preprint arXiv:0909.1440, 2009.
detection of plasma etchers using optical emission spectra,” Semi-
[23] E. Ragnoli, S. McLoone, S. Lynn, J. Ringwood, and N. Macgearailt, conductor Manufacturing, IEEE Transactions on, vol. 13, no. 3, pp.
“Identifying key process characteristics and predicting etch rate 374–385, Aug 2000.
from high-dimension datasets,” in Advanced Semiconductor Manu- [48] D. Zeng and C. Spanos, “Virtual metrology modeling for plasma
facturing Conference, 2009. ASMC ’09. IEEE/SEMI, May 2009, pp. etch operations,” Semiconductor Manufacturing, IEEE Transactions
106–111. on, vol. 22, no. 4, pp. 419–431, Nov 2009.
[24] P. Prakash, A. Johnston, B. Honari, and S. McLoone, “Optimal [49] L. Puggini and S. McLoone, “Extreme learning machines for
wafer site selection using forward selection component analysis,” virtual metrology and etch rate prediction,” in Signals and Systems
in Advanced Semiconductor Manufacturing Conference (ASMC), 2012 Conference (ISSC), 2015 26th Irish. IEEE, 2015, pp. 1–6.
23rd Annual SEMI. IEEE, 2012, pp. 91–96.
[25] K. Li, J.-X. Peng, and E.-W. Bai, “A two-stage algorithm for
identification of nonlinear dynamic systems,” Automatica, vol. 42,
no. 7, pp. 1189–1197, 2006.
[26] K. Li, J.-X. Peng, and Bai, “Two-stage mixed discrete–continuous Seán McLoone received an M.E. degree in
identification of radial basis function (rbf) neural models for Electrical and Electronic Engineering and a PhD
nonlinear systems,” Circuits and Systems I: Regular Papers, IEEE in Control Engineering from Queens University
Transactions on, vol. 56, no. 3, pp. 630–643, 2009. Belfast (QUB), Belfast, U.K. in 1992 and 1996,
[27] I. T. Jolliffe, “Discarding variables in a principal component anal- respectively. Following appointments as a Post-
ysis. i: Artificial data,” Applied statistics, pp. 160–173, 1972. doctoral Research Fellow (1996-1997) and Lec-
[28] W. Krzanowski, “Selection of variables to preserve multivariate turer at QUB (1998-2002) he joined the Depart-
data structure, using principal components,” Applied Statistics, pp. ment of Electronic Engineering at the National
22–33, 1987. University of Ireland Maynooth in 2002, where
[29] K. Mao, “Identifying critical variables of principal components he served as Senior Lecturer (2005-2012) and
for unsupervised feature selection,” Systems, Man, and Cybernetics, Head of Department (2009-2012). He is cur-
Part B: Cybernetics, IEEE Transactions on, vol. 35, no. 2, pp. 339–344, rently a Professor and Director of the Energy Power and Intelligent
2005. Control (EPIC) Research Cluster at Queens University Belfast. His
[30] Y. Lu, I. Cohen, X. S. Zhou, and Q. Tian, “Feature selection using research interests include computational intelligence techniques, data
principal feature analysis,” in Proceedings of the 15th international analytics, system identification and control, with a particular focus on
conference on Multimedia. ACM, 2007, pp. 301–304. smart grid and advanced manufacturing informatics applications.
[31] Y. Cui and J. G. Dy, “Orthogonal principal feature selection,” in
Sparse Optimization and Variable Selection Workshop at the Interna-
tional Conference on Machine Learning. Helsinki, Finland, July 2008.
[32] M. Masaeli, Y. Yan, Y. Cui, G. Fung, and J. G. Dy, “Convex principal
feature selection.” in SDM. SIAM, 2010, pp. 619–628. Luca Puggini was born in Rome (Italy) in
[33] D. C. Whitley, M. G. Ford, and D. J. Livingstone, “Unsupervised 1989. He obtained the Laurea Magistrale in Pure
forward selection: a method for eliminating redundant variables,” and Applied Mathematics from the University of
Journal of Chemical Information and Computer Sciences, vol. 40, no. 5, Tor Vergata in 2013. He worked on his thesis
pp. 1160–1168, 2000. at Statistics for Innovation in Oslo, Norway. In
[34] H.-L. Wei and S. A. Billings, “Feature subset selection and ranking September 2013 he joined the Department of
for data dimensionality reduction,” Pattern Analysis and Machine Electronic Engineering at Maynooth University
Intelligence, IEEE Transactions on, vol. 29, no. 1, pp. 162–166, 2007. as a PhD student on a collaborative research
project with Intel Ireland. His research interests
[35] R. Liu, R. Rallo, and Y. Cohen, “Unsupervised feature selection
include statistics, big data, machine learning,
using incremental least squares,” International Journal of Information
and computational intelligence techniques.
Technology and Decision Making, vol. 10, no. 06, pp. 967–987, 2011.
[36] Z. Zhao, R. Zhang, J. Cox, D. Duling, and W. Sarle, “Massively
parallel feature selection: an approach based on variance preser-
vation,” Machine Learning, vol. 92, no. 1, pp. 195–220, 2013.
[37] V. Cevher and A. Krause, “Greedy dictionary selection for sparse
representation,” Selected Topics in Signal Processing, IEEE Journal of,
vol. 5, no. 5, pp. 979–988, 2011.
[38] T. Zhang, “Adaptive forward-backward greedy algorithm for
learning sparse representations,” Information Theory, IEEE Trans-
actions on, vol. 57, no. 7, pp. 4689–4708, 2011.
[39] P. Jain, A. Tewari, and I. S. Dhillon, “Orthogonal matching pursuit
with replacement,” in Advances in Neural Information Processing
Systems, 2011, pp. 1215–1223.
[40] M. Jaggi, “Revisiting frank-wolfe: Projection-free sparse convex
optimization,” in Proceedings of the 30th International Conference on
Machine Learning (ICML-13), 2013, pp. 427–435.
[41] M. Frank and P. Wolfe, “An algorithm for quadratic program-
ming,” Naval Research Logistics Quarterly, vol. 3, no. 1-2, pp. 95–110,
1956.