0% found this document useful (0 votes)

15 views

Forward Selection Component Analysis Algorithms and Applications

Uploaded by

ccampbell.work.sg

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Forward Selection Component Analysis Algorithms and Applications

Uploaded by

ccampbell.work.sg

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Forward Selection Component Analysis: Algorithms and Applications

Puggini, L., & McLoone, S. (2017). Forward Selection Component Analysis: Algorithms and Applications. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 39(12), 2395-2408.
https://doi.org/10.1109/TPAMI.2017.2648792

Published in:
IEEE Transactions on Pattern Analysis and Machine Intelligence

Document Version:
Peer reviewed version

Queen's University Belfast - Research Portal:

Link to publication record in Queen's University Belfast Research Portal

Publisher rights
Copyright 2017 IEEE.
This work is made available online in accordance with the publisher’s policies. Please refer to any applicable terms of use of the publisher.

General rights
Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other
copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated
with these rights.

Take down policy

The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to
ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the
Research Portal that you believe breaches copyright or violates any law, please contact [email protected].

Open Access
This research has been made openly available by Queen's academics and its Open Research team. We would love to hear how access to
this research benefits you. – Share your feedback with us: http://go.qub.ac.uk/oa-feedback

Download date:29. Oct. 2024

Forward Selection Component Analysis:

Algorithms and Applications
Luca Puggini, Seán McLoone, Senior Member, IEEE

Abstract—Principal Component Analysis (PCA) is a powerful and widely used tool for dimensionality reduction. However, the principal
components generated are linear combinations of all the original variables and this often makes interpreting results and root-cause
analysis difficult. Forward Selection Component Analysis (FSCA) is a recent technique that overcomes this difficulty by performing
variable selection and dimensionality reduction at the same time. This paper provides, for the first time, a detailed presentation of the
FSCA algorithm, and introduces a number of new variants of FSCA that incorporate a refinement step to improve performance. We
then show different applications of FSCA and compare the performance of the different variants with PCA and Sparse PCA. The results
demonstrate the efficacy of FSCA as a low information loss dimensionality reduction and variable selection technique and the improved
performance achievable through the inclusion of a refinement step.

Index Terms—Unsupervised dimensionality reduction, subset selection, feature selection

1 I NTRODUCTION
The need to analyse large volumes of multivariate data and other fields where datasets are encountered involving
is an increasingly common occurrence in many areas of large numbers of variables with significant levels of inter-
science, engineering and business. In order to build more variable correlation and hence redundancy (see for example
interpretable models or to reduce the cost of data collection [13] and [14]). However, while PCA provides a compact rep-
it is important to discover good compact representations of resentation of a multivariate dataset, it does not lend itself to
high-dimensional datasets. This leads to the fundamental identification of the most representative subset of variables
problem of dimensionality reduction. Many methods have within the data. This is a consequence of the fact that the
been developed to perform supervised dimensionality re- latent variables (principal components) generated by PCA
duction [1], [2], [3], [4] . Given an input matrix X ∈ Rm×v are a linear combination of all original variables, making the
(containing m measurements of v variables) and an output most significant variables difficult to determine [15]. This
value y ∈ Rm these methods try to understand what is especially true in the case of highly correlated datasets
subset of variables, or derived features of X optimally due to the grouping effect, whereby the contribution of a
explain y. Dimensionality reduction can also be defined group of highly correlated variables to a given principal
as an unsupervised problem. In this case we look for the component is distributed evenly across all variables in the
subset of variables/derived features that retain the maxi- group. While this characteristic is beneficial in terms of noise
mum information content with respect to the original set suppression, it means that the contribution of individual
of variables, in the sense of being able to reconstruct the variables can be small making important variables appear
full data matrix X. Different unsupervised dimensionality insignificant. Hence, tasks such as identification of key
reduction techniques have been proposed. Some of them, variables, root-cause analysis and model interpretation can
such as [5], [6], [7], [8], have been developed with the goal be challenging using PCA.
of maximising performance when used as a pre-processing Consequently, various approaches have been developed
step in clustering or classification algorithms, while others, to obtain sparse approximations of PCA. The simplest strat-
such as [9], [10], [11], [12], have been developed in order to egy is to manually set to 0 the values of the principal com-
obtain the optimal reconstruction of the full dataset. Among ponents (PCs) that are smaller than a given threshold but
this latter group Principal Component Analysis (PCA) is this can lead to significant variables being missed if they are
the best known and most widely used technique [11]. PCA part of a group of highly correlated variables [16], [15]. More
provides the most efficient linear transformation of data to sophisticated approaches such as SCoTLASS [17], DSPCA
a lower dimensional space and is relatively straight forward [18], sparse PCA [19], sPCA-rSVD [20], SOCA [21] and [22]
to compute. It has found many applications in chemometrics use a lasso like L1 or L0 penalty or are formulated as
constrained maximization problems in order to encourage
• L. Puggini is with the Department of Electronic Engineering, Na- sparsity in the PCA loadings. However, these methods are
tional University of Ireland, Maynooth. E-mail: [email protected], generally computationally intensive and difficult to use and
[email protected] interpret due to the need to establish the appropriate level
• S. McLoone is with the School of Electronics, Electrical Engi-
neering and Computer Science, Queen’s University Belfast. E-mail: of sparsity for each PC computed.
[email protected] These challenges motivated the second author to de-
Manuscript received June 15, 2015; revised May 23, 2016; accepted December velop a technique called Forward Selection Component
30, 2016. Analysis (FSCA) which seeks to identify a small number
2

of key variables that are representative of the observed the variables in the dataset that are most uncorrelated with
variance across all variables. FSCA was initially introduced linear combinations of the variables already selected, while
in the context of Optical Emission Spectroscopy (OES) data in the latter the criterion used for variable selection is
analysis of plasma etch processes [23] where isolating a the maximum average squared correlation with all other
small number of wavelengths is important for understand- variables in the dataset. Wei and Billings’s algorithm, which
ing the underlying plasma chemistry. More recently, FSCA they refer to as Forward Orthogonal Search (FOS), is similar
has been found to be a particularly effective tool for optimis- in character to FSCA and, as will be discussed in Section 3.2,
ing measurement site selection for spatial wafer metrology yields identical results to FSCA if the data is appropriately
in semiconductor manufacturing [24]. The method works by pre-scaled. Recently [35] introduced a kernel extension of
iteratively deriving a set of orthogonal components which variable selection that enables non-linear relationships be-
are a function of only a subset of the original variables, and tween variables to be taken in account, while [36] devel-
which sequentially maximize the explained variance. At one oped an efficient parallel implementation for data parallel
level FSCA can be regarded as the unsupervised counterpart distributed computing that scales well for large problems.
of Forward Selection Regression in that it returns a set of Both these algorithms are equivalent to FSCA in terms of
Forward Selected Variables (FSVs), but equally it retains the sequence of variables selected, but operate directly in
some of the characteristics and utility of PCA in that it the variable space rather than producing FSCs.
also returns a set of Forward Selection Components (FSCs) While FSCA and the proposed backward refinement
which form an orthogonal basis. This allows, for example, variants are specifically designed for unsupervised variable
the contributions of individual components to be easily selection, we note that they have a number of characteristics
isolated. in common with more general machine learning and signal
In this paper, we present for the first time, a complete processing dictionary selection, sparse representation and
description of the FSCA procedure and algorithms, drawing supervised variable selection problems with regard to the
parallels with PCA as appropriate. In addition, motivated use of greedy forward selection and backward refinement
by the success of a two stage algorithm proposed in [25], steps. In [37], for example, a supervised dictionary selection
[26] for stepwise regression based methods, we propose a framework is developed for spare representation of a set
number of new variants of FSCA that incorporate a similar of signals where the dictionary elements are recursively
backward refinement step. Specifically, we introduce four selected from a candidate set D using a greedy forward
backward refinement variants, which we call Single, and selection algorithm. FSCA can be regarded as special case
Multi-pass Backward Refinement FSCA (denoted as SPBR- of this framework, where the signals to be represented are
FSCA and MPBR-FSCA); and Recursive Single, and Recur- also the set of candidate dictionary elements D.
sive Multi-pass Backward Refinement FSCA (denoted as R- In the supervised context, in particular, various back-
SPBR-FSCA and R-MPBR-FSCA). Then, with the aid of a ward refinement algorithms have been proposed to ad-
number of case studies, we demonstrate the utility of FSCA, dress the non-optimality of greedy selection. Two notable
the enhanced performance obtained with the new variants, examples are the FoBa algorithm by Zhang [38] for learn-
and provide comparisons with PCA and Sparse PCA. ing sparse representations and the Orthogonal Matching
The remainder of the paper is organised as follows. Pursuit with Replacement (OMPR) algorithm by Jain et
Related work on variable selection techniques is identified al. [39] for compressed sensing. In FoBa following each
in Section 2, followed by the algorithmic descriptions of greedy selection step a backward variable elimination step
PCA and FSCA in Section 3. Section 4 introduces the new is performed to remove variables that are not contributing to
backward refinement FSCA algorithms. Then comparative the model. The backward step is aggressive in that it allows
results and analysis are provided in Section 5 and 6 for a small increase in error when a variable is removed. This is
simulated and real world case studies, respectively. Finally, limited to be half the error reduction in the corresponding
conclusions are presented in Section 7. forward selection step such that algorithm convergence is
guaranteed. In OMPR an initial set of variables k is ran-
domly selected and then a refinement step is repeatedly per-
2 R ELATED W ORK formed in which variables are replaced with new ones from
A variety of variable selection methods have been devel- the candidate set using a gradient based update procedure.
oped based on making comparisons with or extracting The backward refinement FSCA algorithms presented
information from a PCA decomposition of the data matrix in this paper share some similarities with both FoBa and
(e.g. [27], [28], [29], [30] and [31]). Other approaches employ OMPR. FoBa is essentially a recursive multi-pass backward
clustering of features using a suitable feature similarity refinement procedure, but differs from R-MPBR-FSCA in
metric as the basis for variable selection (e.g. [5], [7] and that variables are removed rather than replaced in the re-
[12]). Recently [32] proposed a novel L1 regularised for- finement step. In addition, while our refinement algorithms
mulation for the unsupervised variable selection problem are strictly hill climbing, i.e. variables are only replaced if
which has a similar philosophy to sparse PCA and can be they lead to an improvement in performance, they can easily
thought of as the unsupervised counterpart of LASSO [1]. be adapted to employ the more aggressive backward refine-
In addition to FSCA, two other techniques which can be ment step of FoBa. Both OMPR and backward refinement
considered as performing direct variable selection are the FSCA employ a variable replacement strategy, and while
algorithms by Whitley et al. [33] and Wei and Billings [34], OMPR differs substantially from FSCA algorithmically, it
both of which employ orthogonalisation procedures. In the can be regarded as a multi-pass backward refinement ap-
former, variables are selected based on sequentially finding proach, but with the k components initialized randomly,
L. PUGGINI AND S. MCLOONE: FORWARD SELECTION COMPONENT ANALYSIS: ALGORITHMS AND APPLICATIONS 3

rather than being obtained as the output of a forward 3.1 Principal Component Analysis
selection procedure. Principal Component Analysis (PCA) is one of the most
More generally, convex relaxations of the variable se- common and widely used dimensionality reduction tech-
lection problem are a particular case of optimization over niques. The PCA decomposition of X is defined as
convex hulls of an atomic set [40], which can be effectively
k
solved using greedy Frank-Wolfe (aka conditional gradient) X
X̂kP CA = Tk PT
k = ti pT
i, (7)
type algorithms [41]. In this context [42] have developed an
i=1
optimization procedure called CoGEnT for general atomic-
norm regularization problems which incorporates both a where Pk ∈ Rv×k is an orthonormal matrix, computed as
greedy forward selection step and an aggressive backward the first k ordered eigenvectors of the data covariance matrix
refinement step and is effectively a generalization of FoBa XT X (in descending eigenvalue order), and Tk ∈ Rm×k is
to atomic norms. the geometric projection of X on the columns of Pk , that is:
Tk = XPk . (8)
3 DATA D ECOMPOSITION AND R ECONSTRUCTION Here, pi and ti are the i-th column’s of Pk and Tk , respec-
Given a matrix X ∈ Rm×v representing a dataset with m tively. If k = r = rank(X) then the PCA decomposition is
measurements of v variables, where each variable can be exact and
considered without loss of generality to be normalised to X = X̂rP CA = Tr PTr. (9)
have zero mean, different techniques can be used to obtain Otherwise, if k < rank(X) PCA provides the best rank k
a more compact representation of X. Our aim is to estimate approximation to X [11], that is, Pk is the solution to the
a matrix S ∈ Rm×k , where k < rank(X) ≤ min(m, v), optimisation problem
such that it is possible to obtain a good reconstruction of X
by linear regression on S: argmax VX (XPk PT
k ). (10)
Pk
X̂ = SΘ. (1) Consequently, it follows that when S is restricted to k
columns, the optimal choice is S = Tk , in which case
In general, given a regressor matrix S, the optimal least
Θ = PT k.
square error linear reconstruction of the original signal X
Various algorithms exist for computing the PCA decom-
is given by
position. Among these the most popular are the singular
Θ = argmink SΘ̃ − X k2F , (2) value decomposition (SVD) and the Nonlinear Iterative
Θ̃
Partial Least Squares (NIPALS) [43] algorithms. While SVD
where (k · kF ) is the Frobenius norm. The solution to (2) is is more numerically robust and efficient when a full PCA
the well known least-squares solution: decomposition is required, the advantage of NIPALS is
that it computes the PCA decomposition iteratively, one
Θ = (ST S)−1 ST X. (3)
principal component (PC) at a time, in descending order.
Hence, defining the projection matrix This makes it highly efficient in high dimensional problems
where typically only a small number of PCs need to be
Φ(S) = S(ST S)−1 ST , (4) computed. For completeness, and to facilitate comparison
with FSCA presented in the next section, a description of
for a given matrix S, the optimal linear reconstruction of X the NIPALS algorithm is provided in Algorithm 1.
can be expressed as

X̂ = Φ(S)X. (5) 3.2 Foward Selection Component Analysis

In contrast to PCA which produces a reduced set of new
Different metrics can be used to quantify the approxima-
variables (PCs) that are linear combinations of all existing
tion error between the matrix X and its reconstruction X̂ variables, Forward Selection Component Analysis (FSCA)
(such as element-wise or induced L1 , L2 and L∞ norms of derives a set of new variables (FSCs) that are a function
X̂ − X). The metric that is often considered when working of only a subset of the original variables that maximise
with Principal Component Analysis, and the one adopted the explained variance. This is achieved using the iterative
here, is the percentage of explained variance, that is, the procedure detailed in Algorithm 2. The FSCA algorithm
percentage of the variance observed in X explained by X̂. returns:
Noting that the columns of X have been defined as having
zero mean, this can be expressed in terms of the Frobenius • A matrix Zk composed of a subset of the columns
norm as of X (FSVs) ranked according to how well they
! contribute to the reconstruction of X.
k X̂ − X k2F • A matrix of FSCA components (FSCs) Mk . The first
VX (X̂) = 100 × 1 − . (6)
k X k2F column of Mk is equivalent to the first column of Zk .
The second column of Mk is a function of the first
While VX (X̂) is unbounded in the negative direction for and the second column of Zk and so on. In general
arbitrary X̂, when X̂ is computed as a projection of X onto the k -th FSC will be defined as a function of itself
the subspace spanned by S, as given by eqt. (5), VX (X̂) ≥ 0 and of the previous k − 1 components and so is a
for arbitrary S. (A proof is provided in Appendix A.) function of the first k selected variables.
4

Algorithm 1: [Pk , Tk ]=NIPALS(X, k ) Algorithm 2: [Zk , Mk , Uk ]=FSCA(X, k )

Require: Data matrix X, number of PCs k Require: Data matrix X, number of FSCs k
1: Set R = X 1: Set R = X # Notation: R = [r1 , . . . , rp ]
2: Initialise P0 = T0 = ∅ 2: Initialise Z0 = M0 = U0 = ∅
3: Set = 10−6 (convergence threshold) 3: for j = 1 to k do
4: Initialise t to a non-zero column of X 4: i = argmax VR (Φ(ri )R)
5: for j = 1 to k do ri ∈R
5: m = ri
6: Set tnew = t and told = 10t 6: z = xi
7: while k told − tnew k2 ≥ do u = RT ri /(rT
7: i ri )
8: told = tnew 8: Mj = [Mj−1 m]
9: p = RTp t/tT t 9: Zj = [Zj−1 z]
10: p = p/ pT p 10: Uj = [Uj−1 u]
11: t = Rp 11: R = R − Φ(m)R
12: tnew = t 12: end for
13: end while 13: return Zk , Mk , Uk
14: Pj = [Pj−1 p]
15: Tj = [Tj−1 t]
16: R = R − tpT Fig. 2. FSCA Algorithm
17: end for
18: return Pk , Tk
Algorithm 3: [Zk , Mk , Uk ]=FSVA(X,k )

Fig. 1. PCA NIPALS Algorithm

Require: Data matrix X, number of FSVs k
1: Z0 = ∅
2: for j = 1 to k do
• A matrix of FSCA loadings Uk . 3: ij = argmax VX (Φ([Zj−1 xi ])X)
xi ∈X
FSCA leads to a decomposition of X of the form 4: Zj = [Zj−1 xij ]
k
5: Optional refinement step (recursive)
6: end for
X
X̂kF SC = Mk UT
k = mi uT
i. (11)
7: Optional refinement step
i=1
8: Mk = GramSchmidt(Zk )
where Mk is an orthogonal matrix of Forward Selection 9: Uk = XT Mk (MT −1
k Mk )
Components (FSCs). Equivalently we can express the de- 10: return Zk , Mk , Uk
composition directly in terms of the Forward Selection Vari-
ables (FSVs), Zk as
k
X Fig. 3. Direct FSV implementation of FSCA
X̂kF SV = Zk BT
k = zi bT
i. (12)
i=1
alternative FSV based implementation of FSCA is as given in
Here Zk = [z1 , ..., zk ] = [xi1 , ..., xik ] ⊂ X and BT
k are the Algorithm 3. Note that steps 5 and 7 are place holders for the
corresponding least squares regression coefficients, that is
refinement step which will be introduced in Section 4 and
−1
Bk = XT Zk (ZT
k Zk ) . (13) are not part of the basic algorithm, which we will refer to as
the FSV algorithm (FSVA). If we are only interested in FSVs
In a similar fashion to NIPALS, the FSCs generated by steps 8 and 9 can be omitted, in which case the algorithm
FSCA are ordered in descending order in terms of the corresponds to the feature selection methods presented in
variance of X explained. Furthermore, by virtue of their [35] and [36].
orthogonality the variance contribution of the i-th FSC is Since PCA provides the optimal representation in terms
simply obtained as (mT T
i mi )(ui ui ). A similar expression maximising the variance explained (eq. 10) it follows that
holds for PCA, but since Pk is an orthonormal matrix this
VX (X̂kF SC ) ≤ VX (X̂kP CA ). Hence PCA can be regarded as
reduces to tT
i ti . providing an upper bound on the performance achievable
The FSV decomposition could be computed directly,
with FSCA with a given number of variables k , or equiva-
rather than as a by-product of the FSC computation by
lently a lower bound on the number of variables needed to
solving
achieve a desired reconstruction accuracy.
ik = argmax VX (Φ([Zk−1 xi ])X) (14)
xi ∈X

and setting Zk = [Zk−1 xik ], with Z0 = ∅. However, the

3.3 Computational Complexity of FSCA
results are equivalent for a given number of components,
that is X̂kF SC = X̂kF SV . In particular, once Zk has been com- The computation time of the FSCA algorithm (Algorithm 2)
puted, the corresponding FSC matrix Mk can be obtained is dominated by the combinatorial optimisation problem in
as the Gram-Schmidt orthogonalization of Zk . Thus an step 4. It is relatively straight forward to show that maximis-
L. PUGGINI AND S. MCLOONE: FORWARD SELECTION COMPONENT ANALYSIS: ALGORITHMS AND APPLICATIONS 5

ing the explained variance is equivalent to maximising the Recalling the definition of Φ (eqt. 4) this can be recast as
Rayleigh Quotient of XXT (see Appendix A), that is v
X
−1
xT T argmax qT T
j(i) (Z(i) Z(i) ) qj(i) , (22)
i XX xi xi ∈X
argmax VX (Φ(xi )X) ≡ argmax T
, (15) j=1
xi ∈X xi ∈X xi xi
where qj(i) = ZT (i) xj , and Z(i) = [Zk−1 xi ]. Hence, deter-
which can be computed efficiently as
mining the optimum xi requires v 2 evaluations of the vector
v
X (xT
i xj )
2
terms qj(i) and v evaluations of the matrix inverse term
argmax . (16) −1
xi ∈X
T
xi xi (ZT(i) Z(i) ) . We can take two steps to substantially reduce
j=1
the computation time for these terms. Firstly, as proposed in
It is interesting to note that the corresponding expression [35], [36], an O(k 2 ) complexity recursive computation of the
for maximising the average squared correlation metric em- matrix inverse can be obtained by taking advantage of the
ployed in the FOS algorithm proposed by Wei and Billing fact that " T #
[34] is T
Zk−1 Zk−1 r(i)
v Z(i) Z(i) = , (23)
X (xT
i xj )
2
rT a(i)
argmax T T
. (17) (i)
xi ∈X j=1
(xi xi )(xj xj )
where r(i) = Zk−1 xi , and a(i) = xT i xi , and applying
Hence, while the FOS optimization objective has an addi- block matrix inversion algebra to obtain an expression for
tional scaling factor in the denominator, both algorithms (ZT −1
in terms of (ZT −1
(i) Z(i) ) k−1 Zk−1 ) which has already
will in fact yield identical results provided the columns of been computed in the previous iteration, that is:
the data matrix X are normalised so that they are all the " #
same length (xT j xj will then be invariant with respect to j ). T −1 (ZT
k−1 Zk−1 )
−1
+ wbbT −wb
(Z(i) Z(i) ) = , (24)
It is also worth noting that if xi is not constrained to be −wbT w
a column of X the solution to (15) is the largest eigenvector
of XXT , but this is simply the direction of the score vector b = (ZTk−1 Zk−1 )−1 r(i) , w = (a(i) − rT(i) b)−1 .
t1 corresponding to the first PC of the data. This suggests
In contrast direct calculation of the matrix inverse has O(k 3 )
an alternative variable selection approach, as proposed by
computational complexity.
Cui and Dy [31], whereby the first variable selected is
Secondly, evaluating the terms qj(i) , r(i) and a(i) all
the one that is most closely correlated with the first PC.
involve computing vector products xT i xj many times, with
In subsequent steps the selected variables are the closest
substantial repetition both within each variable selection
to the first PC of the corresponding residual matrix. The
iteration and between iterations. This repetition can be elim-
approach, referred to as Orthogonal Principal Feature Selec-
inated by precomputing the covariance matrix C = XT X,
tion (OPFS), is in general only an approximation to FSCA.
where cij = xT 2
i xj , at the cost of O(v m) floating point
This can be deduced as follows. Recalling the definition 2
operations (flops) and O(v ) additional memory.
of the PCA decomposition in equations (7)-(9), the FSCA
Table 1 shows the estimated complexity in terms of float-
optimization objective (eqt. 15) can be expressed as
ing point operations for computing k FSCs with the FSCA
r
X (xT
i tj )
2 and FSVA algorithms, with and without the covariance
argmax T
(18) matrix precomputed, while Fig. 4 shows how complexity
xi ∈X j=1
xi xi
varies as a function of v and m for specific combinations
or equivalently as of the other dimensions. As can be seen, the reduction in
r
X complexity is of the order O(k 2 ) for FSVA. Precomputing C
argmax λj corr(xi , tj )2 , (19) is also beneficial for FSCA, but the impact is less significant
xi ∈X j=1 since it has to be re-computed at each iteration due to the
deflation step. That said, a factor of two reduction in com-
where λj (= tTj tj ) is the variance contribution of the j -th PC. putational complexity is achieved for FSCA. All algorithms
In contrast the OPFS optimization objective corresponds to scale quadratically with v and linearly with m, but differ in
argmax corr(xi , t1 )2 . (20) how they behave with respect to k , with FSCA implementa-
xi ∈X tions growing linearly and FSVA implementations growing
Thus, while OPFS selects variables based on their squared cubically. If precomputing the covariance matrix is √ not an
correlation with the first PC, FSCA selects them based on issue, the preferred algorithm is FSVA when k < 1.5m
the variance-weighted average squared correlation with all (approx.) and FSCA otherwise. FSCA is substantially supe-
PCs. Hence, the sequence of variables selected by OPFS will rior to FSVA when the covariance matrix is not precomputed
√
in general differ from, and explain less variance than, the and also outperforms precomputed FSVA when k > 3m.
variables selected by FSCA. When the computational burden of FSCA becomes pro-
If FSCA is computed using the FSVA implementation hibitive OPFS may offer an attractive compromise due to
(Algorithm 3) an efficient solution can be obtained by noting its significantly lower computational complexity. In OPFS
that the combinatorial optimisation problem in equation (14) the PC needed at each step can be computed efficiently
is equivalent to using NIPALS yielding and algorithm with O((4α + 8)mv)
v complexity per selected variable, compared to order O(mv 2 )
with FSCA. Here, α denotes the average number of iter-
X
argmax xT
j Φ([Zk−1 xi ])xj . (21)
xi ∈X j=1 ations per selected variable for the NIPALS algorithm to
6

TABLE 1 all possible subsets quickly becomes computationally in-

Flop count and asymptotic complexity for computing k FSCs of tractable with increasing problem dimension.
X ∈ Rm×v with different FSCA algorithm implementations
As a consequence of the greedy strategy adopted by
FSCA variables that are selected in early iterations of the
Method Floating point operation count Complexity algorithm can become redundant as other variables are
k << v included in later iterations. This can result in sub-optimal
2 2 solutions and can also be detrimental to the performance of
O(2v mk + 6vmk + v k + 2mk −
FSCA O(2v 2 mk) some applications, for example, the clustering application
vk − k)
which will be presented in Section 6.3. To overcome this
FSCA
O(v 2 mk + 2vmk + 2v 2 k) O(v 2 mk) weakness, we propose introducing a backward refinement
(PC)
2 step similar to that presented in [25], [26] for forward
O(2v 2 mk̄ + 2v 2 k̄2 + 4vmk̄ + 4v k̄2 + O(v 2 mk2 + v 2 k3 ) selection regression applications, where, following comple-
FSVA
4vmk − 2vk + 2mk2 + 4mk) 3
tion of the forward selection process, selected variables are
FSVA O(v 2 m + v 2 (2k̄2 + k̄) + 2mk2 + 2 2 2 3 reviewed to see if they are still relevant and replaced if they
O(v m + v k )
(PC) 2vmk+4mk−vk+v(4k̄2 −6k̄+3k)) 3 are not.
(j)
Notation: k̄ = k(k − 1)/2; k̄2 = k(k + 1)(2k + 1)/6; and P C denotes Denoting Zk (xi ) as matrix Zk with its j -th column
implementation with precomputed covariance matrix. replaced by xi , that is:
(j)
Zk (xi ) = Zk + (xi − zj )eT
j, (26)
16
k=10,m=200 11
k=10,v=200
10 10
where ej is a vector with its j -th element equal to 1 and all
14
10
10
10 others elements equal to zero, we define zj ∈ Zk as relevant
if
log(flops)

log(flops)

12 9
10 10
(j)
10 8
VX (Φ(Zk )X) ≥ max VX (Φ(Zk (xi ))X). (27)
10 10 xi ∈X/Zk
8
10
7
10 This backward refinement step can be performed either at
6 6
step 5 or step 7 of the FSV implementation of the FSCA
10 10
2
10
3
10
4
10
5
10
1
10
2
10
3
10
4
10 algorithm, as highlighted Algorithm 3. When placed at step
log(v) log(m)
7 the refinement step is only undertaken once after the FSV
16
k=30,m=200 12
k=30,v=200 algorithm has completed. In contrast, the refinement step
10 10
is executed following the addition of each new variable if
14
10
11
10 placed at step 5. We will refer to this latter implementation
as recursive backward refinement.
log(flops)

log(flops)

12 10
10 10
There are also two flavours of the refinement step itself.
10
10
9
10 In the first, referred to as Single-Pass Backward Refinement
8 8
(SPBR) (summarised in Algorithm 4), the relevance of each
10 10
variable is evaluated in turn moving sequentially through
6
10 2 3 4 5
7
10 1 2 3 4
the variables from the oldest to the newest. In the second,
10 10 10 10 10 10 10 10
log(v) log(m) to take account of the fact that variables that are initially
relevant may become irrelevant following refinements to
Fig. 4. Complexity of FSCA algorithms as a function of v and m: Plots variables later in the sequence, the process is repeated until a
show FSCA (blue) and FSVA (green) implementations with precom- complete pass occurs without any refinements taking place.
puted covariance matrices (dashed lines) and without (solid lines). This version of the algorithm (summarised in Algorithm 5)
is referred to as Multi-Pass Backward Refinement (MPBR).
Note that by virtue of the sequencing of operations in
converge. It is a function of the spread of the eigenvalues
each algorithm it follows that
of the covariance matrix C, and hence problem dependent.
VX (X̂kF SCA ) ≤ VX (X̂kSP BR ) ≤ VX (X̂kM P BR ). (28)
4 FSCA WITH BACKWARD R EFINEMENT However, no such statement can be made with regard to R-
Ideally we would like to find the subset of k columns of X SPBR or R-MPBR as they may follow different ’hill climbing’
(variables) that can optimally reconstruct X, that is solution paths and hence it is possible for the solutions to be
inferior to the non-recursive implementations when k > 2.
argmax VX (Φ(Zk )X). (25) One of the side-effects of employing the backward refine-
Zk ∈X
ment step is that it breaks the ordering of selected variables
However, this is an NP hard combinatorial optimisation in terms of variance explained. If recovering this ordering is
problem (requires the evaluation of kv = v!/((v − k)!k!)

desirable, an additional modified FSV step can be performed
possible combinations of the variables). In general FSCA on Zk with respect to X after the refinement process has
and other greedy local search approaches are sub optimal been completed (i.e. between Step 7 and 8 in Algorithm 3).
(i.e. they are not guaranteed to find the optimal subset of As summarised in Algorithm 6, this involves recursively
variables according to the defined optimization criteria), selecting the variables in Zk based on how much of the
but they represent a pragmatic solution as searching over variance of X that they explain.
L. PUGGINI AND S. MCLOONE: FORWARD SELECTION COMPONENT ANALYSIS: ALGORITHMS AND APPLICATIONS 7
(j) (j)
Algorithm 4: [Zk , rc ]=SPBR(X,Zk ) (Zk (xi )T Zk (xi ))−1 . This can be achieved by noting that
(j) (j)
Require: Forward selected variables Zk , data matrix X Zk (xi )T Zk (xi ) = ZT T T
k Zk + gj(i) ej + ej hj(i) , (29)
1: rc = 0 (refinement count) where
2: for j = 1 to k − 1 do
3:
(j)
ij = argmax VX (Φ(Zk (xi ))X) gj(i) = ZT
k (xi − zj ), (30)
xi ∈X T
4: if zj 6= xij then hj(i) = gj(i) + (xi − zj ) (xi − zj )ej . (31)
5: rc = rc + 1 (increment refinement count) It then follows, by application of the matrix inversion lemma
(j)
6: Zk = Zk (xij ) (i.e. replace zj with xij ) [44], specifically the Sherman-Morrison formula [45], that
7: end if
8: end for (j) (j)
Aj(i) ej hT
j(i) Aj(i)
(Zk (xi )T Zk (xi ))−1 = Aj(i) − , (32)
9: if rc > 0 then 1 + hT
j(i) Aj(i) ej
10: Repeat steps 3-7 for j = k
where
11: end if
−1 −1
12: return Zk , rc −1
(ZT
k Zk ) gj(i) eT T
j (Zk Zk )
Aj(i) = (ZT
k Zk ) − T T
. (33)
1 + ej (Zk Zk )−1 gj(i)

Fig. 5. Single-Pass Backward Refinement Algorithm This recursive inverse update can be computed in O(8k 2 +
4k + 6) flops and hence has O(8k 2 ) complexity, which com-
pares favourably to the O(4k 2 ) complexity of the forward
Algorithm 5: [Zk ]=MPBR(X,Zk )
step inverse (eqt. 24). The overall additional complexity of
executing SPBR is then O(2v 2 k 3 + 8vk 3 ). In contrast, the re-
Require: Forward selected variables Zk , data matrix X cursive SPBR implementation contributes O(0.5v 2 k 4 +2vk 4 )
1: rc = 1 (refinement flag)
additional complexity.
2: while rc > 0 do
Since repetition of the MPBR loop is dependent on
3: [Zk , rc ]=SPBR(X,Zk ) refinements taking place in the previous pass, the number
4: end while
of repetitions and hence overall algorithm complexity of the
5: return Zk
multi-pass implementations cannot be determined a priori.
If we denote the average number of repetitions as λ then
their complexity can be expressed as λ times the complexity
Fig. 6. Multi-Pass Backward Refinement Algorithm
of the corresponding SPBR and recursive SPBR implementa-
tions. The optional reordering step has O(2vk 3 ) complexity.
Hence, the overall algorithm complexity of FSVA with Back-
4.1 Computational complexity of backward refinement ward refinement is
2
The inclusion of backward refinement has major implica- O(v 2 mk 2 + (2λ + )v 2 k 3 ) (34)
tions for the complexity of FSCA. The lowest complexity 3
implementation is SPBR, which involves a combinatorial for non-recursive implementations and
search of similar complexity to the basic FSV algorithm (eqt. λ 2 4 2 2 3
22), the only difference being that Z is now a fixed size O(v 2 mk 2 +
v k + v k + 2λvk 4 ) (35)
(j) (j) 2 3
matrix, that is Z(i) → Zk (xi ), where Zk (xi ) is as defined
for recursive implementations, where λ = 1 corresponds to
in eqt. (26). Since the covariance matrix, and hence the qj(i)
SPBR and λ > 1 MPBR.
terms, will have already been precomputed for the forward
selection step, the only concern is the development of an
efficient recursive update procedure for the inverse matrix 5 S IMULATED DATASETS
In this section various simulated datasets are used to high-
light the differences between FSCA and FSCA with back-
Algorithm 6: [Zok ]=ReOrder(X,Zk ) ward refinement. The algorithms considered are:
• FSCA: Forward Selection Component Analysis (Al-
Require: Forward selected variables Zk , data matrix X
gorithm 2 or 3)
1: Zo0 = ∅
• SPBR: Single-Pass Backward Refinement (Algorithm
2: for j = 1 to k do
3 with Algorithm 4 employed at step 7)
3: ij = argmax VX (Φ([Zoj−1 zi ])X)
zi ∈Zk /Zo • MPBR: Multi-Pass Backward Refinement (Algorithm
j−1
4: Zoj = [Zoj−1 zij ] 3 with Algorithm 5 employed at step 7)
5: end for • R-SPBR: Recursive Single-Pass Backward Refine-
6: return Zok ment (Algorithm 3 with Algorithm 4 employed at
step 5)
• R-MPBR: Recursive Multi-Pass Backward Refine-
Fig. 7. Modified FSV procedure for reordering variables following back- ment (Algorithm 3 with Algorithm 5 employed at
ward refinement step 5)
8

Comparisons are also made with PCA, the Orthogonal TABLE 2

Principal Feature Selection approximation to FSCA (OPFS), Example 1: The percentage of explained variance achieved with PCA,
OPFS, FSCA, and its backward refinement variants, for different
and Sparse PCA where appropriate. numbers of selected components

5.1 Example 1: Four Distinct Variables k PCA OPFS FSCA SPBR MPBR R-SPBR R-MPBR
As a first example we define four base variables 1 30.41 25.88 25.88 25.88 25.88 25.88 25.88
w0 , x0 , y0 , z0 ∼ N (0, 1), 20 noise variables 1 , . . . , 20 ∼ 2 56.68 54.34 54.34 54.34 54.34 54.34 54.34
N (0, 0.1) and two larger noise variables 21 , 22 ∼ N (0, 0.4). 3 80.47 75.72 75.82 78.11 78.11 78.11 78.11
4 98.60 93.62 93.80 98.22 98.22 98.22 98.22
These variables are used to generate a subset of variables 5 99.03 96.36 96.56 98.78 98.78 98.78 98.78
similar to w0 : {wi = w0 + i }i=1,...,5 , a subset of variables 6 99.43 96.40 99.31 99.31 99.31 99.31 99.31
similar to x0 : {xi = x0 + i+5 }i=1,...,5 , a subset of variables
similar to y0 : {yi = y0 + i+10 }i=1,...,5 , a subset of variables TABLE 3
similar to z0 : {zi = z0 + i+15 }i=1,...,5 and two additional Example 1: Variables selected at each step by FSCA, SPBR and OPFS
redundant variables defined as h1 = w0 + x0 + 21 and
h2 = y0 + z0 + 22 . The complete dataset is then defined as k FSCA SPBR OPFS
X = [w0 , . . . , w5 , x0 , . . . , x5 , y0 , . . . , y5 , z0 , . . . , z5 , h1 , h2 ],
1 {h1 } {h1 } {h1 }
with X ∈ Rm×26 . Hence, by design the dataset is highly 2 {h1 , h2 } {h1 , h2 } {h1 , h2 }
redundant, with only 4 of the 26 variables independent. As 3 {h1 , h2 , x0 } {w0 , h2 , x0 } {h1 , h2 , w0 }
such, the information it contains can be optimally summa- 4 {h1 , h2 , x0 , z0 } {w0 , y0 , x0 , z0 } {h1 , h2 , w0 , z0 }
5 {h1 , h2 , x0 , z0 , {y0 , h1 , x0 , z0 , {h1 , h2 , w0 , z0 ,
rized by the 4 base variables (w0 , x0 , y0 , z0 ). w0 } w0 } y0 }
Table 2 shows the variance explained by PCA, OPFS, 6 {h1 , h2 , x0 , z0 , {x0 , y0 , w0 , z0 , {h1 , h2 , w0 , z0 ,
FSCA and the 4 backward refinement FSCA enhancements w0 , y0 } h1 , h2 } y0 , w 2 }
as a function of the number of selected variables k for an
instance of the dataset with m = 1000, while Table 3 shows
the sets of variables selected by FSCA, SPBR and OPFS for perturbed redundant variables X1 generated as a linear
each value of k . Results for MPBR, R-SPBR and R-MPBR are combination of the variables in X0 . In particular we define:
omitted from Table 3 as they are identical to SPBR. • X0 ∈ Rn×u : X0i,j ∼ N (0, 1)
When FSCA is applied to this dataset in general the first • φ ∈ Ru×(v−u) : φi,j ∼ N (0, 1)
FSV will be h1 and the second will be h2 , or viva versa,
• ∈ Rn×(v−u) : i,j ∼ N (0, 0.1)
as dictated by the noise realization. Subsequent selections
• X1 = X0 · φ +
are then from among the base variables until at k = 6 all
• X = [X0 , X1 ]
4 based variables are selected. Hence, the initial selections
become redundant as additional variables are added. As ex- We generated 1000 instances of this dataset for different
pected the backward refinement algorithms explain greater values of u and v for n = 200 and in each case computed
variance than FSCA when k = 3, 4 and 5. Note, that at k = 4 a k = u components FSCA, SPBR, R-SPBR, MPBR and R-
the refinement of the FSCA solution by SPBR identifies the MPBR. The variance explained by the k components se-
optimum set of variables (i.e. the 4 base variables), hence lected by each algorithm averaged over the 1000 repetitions
there is no scope for further improvement by the more is reported in Table 4. The table also reports Sc , the per-
advanced refinement algorithms. centage of true variables selected by each method, defined
For comparison purposes OPFS results are also pre- as
sented in the tables. As can be seen, the variable selections Sc = | {z1 , . . . zu } ∩ {x1 , . . . , xu } |/u × 100. (36)
for k = 1 and 2 are the same as FSCA, but thereafter
OPFS takes a different path, and ends with a suboptimal
solution at k = 6 (explained variance of 96% versus 99%). 100 Num. components =4
The performance of OPFS varies considerably for different 99
98
instances of the dataset. This is illustrated in Figure 8, which 97
96
shows the variation in performance of each method over 200 95
V

different dataset realizations with m = 100 when selecting 94

93
k = 4 and 6 components. FSCA also shows considerable 92
91
variation in performance but is in general superior to OPFS. PCA OPFS FSCA SPBR MPBR R-SPBR R-MPBR
A pairwise comparison of the variance explained by OPFS 99.6 Num. components =6
and FSCA over 1000 repetitions of the dataset shows that
99.4
OPFS only outperforms FSCA 2% of the time. In contrast,
99.2
SPBR and the other refinement algorithms are consistently
V

superior to FSCA and OPFS and show little variation in 99.0

performance over the different dataset realizations. 98.8
98.6
PCA OPFS FSCA SPBR MPBR R-SPBR R-MPBR
5.2 Example 2: Block Redundancy
In this example the dataset consists of a block of indepen- Fig. 8. Example 1: Boxplots showing variation in performance of each
dent variables X0 augmented by a second block of noise method for k = 4 and 6 components, over 200 Monte Carlo repetitions
L. PUGGINI AND S. MCLOONE: FORWARD SELECTION COMPONENT ANALYSIS: ALGORITHMS AND APPLICATIONS 9

TABLE 4 100
Example 2: Percentage variance explained (VX ) and percentage of
true variables selected (Sc ) with FSCA and its backward refinement 98
variants (averaged over 1000 repetitions)
96

Percentage of variance explained (VX ) 94

100 ×V ∗/VPCA
u v FSCA SPBR MPBR R-SPBR R-MPBR
92
10 30 99.75 99.87 99.89 99.88 99.89
15 50 99.77 99.89 99.92 99.90 99.92 90
20 75 99.78 99.90 99.94 99.90 99.94 FSCA
25 100 99.78 99.91 99.94 99.91 99.94 88 SPBR
MPBR
Percentage of true variables selected (Sc ) 86 R-SPBR
u v FSCA SPBR MPBR R-SPBR R-MPBR R-MPBR
84
1 2 3 4 5 6 7 8 9 10 11 12
10 30 22.38 48.80 70.11 47.73 66.30 Number of Components
15 50 16.03 43.73 74.20 41.98 72.06
20 75 14.70 41.60 81.66 45.11 79.82
25 100 12.85 34.74 71.46 38.21 71.95 Fig. 9. Example 2: The percentage of variance explained as a function
of the number of selected components for u = 10, v = 30

TABLE 5
As expected, the introduction of a refinement step consis- Example 3: The 1st and 2nd loading generated by PCA and SPCA (λ =
tently increases the explained variance relative to FSCA 20) and the 1st and 2nd FSC obtained with FSCA and SPBR
with SPBR reducing unexplained variance by 50%-64% and
MPBR reducing it by 58%-73%. There is no appreciable i PC1 PC2 SPC1 SPC2 FSC1 FSC2 SPBR1 SPBR2
difference between the performance of the recursive and 1 -0.13 0.48 -80.00
non-recursive implementations of each algorithm. A similar 2 -0.13 0.48 -80.00
3 -0.13 0.48 -80.00 1 1
pattern is observed with respect to the number of true vari- 4 -0.13 0.48 -80.00
ables selected by each method, FSCA: 12%-22%, SPBR/R- 5 0.39 0.16 79.61 1 1
SPBR: 35%-49% and MPBR/R-MPBR: 66%-82%. 6 0.39 0.16 79.61
7 0.39 0.16 79.61
Noting that PCA provides an upper bound on achievable 8 0.39 0.16 79.61
explained variance for a given number of components, 9 0.41 0.01 77.43 3.09 1 1
10 0.41 0.01 77.43 3.09
Figure 9 shows the variance explained by FSCA and the var-
VX 60.80 99.99 59.43 99.99 60.75 99.99 58.22 99.99
ious backward refinement algorithms with k = 1, 2, ..., 12
selected components for the case where u = 10, v = 30
and n = 1000, expressed as a percentage of the variance
explained by the equivalent number of PCs obtained using The results reported in Table 5, which are for a single
PCA. As can be seen, SPBR consistently provides improved realization of the dataset, show that both FSCA and SPBR
performance over FSCA for k > 1. For values of k in the with 2 components explain more than 99% of the total
vicinity of the true dimensionality of the data, MPBR is variance. In general, for other realizations FSCA will always
marginally superior to SPBR (57.5% versus 49.2% reduction select either x9 or x10 as one of the two variables. SPBR
in unexplained variance at k = 10, 11.8% versus 9.7% at and MPBR will instead select one variable from the group
k = 8 and 27.0% versus 23.9% at k = 12). In general the generated by v1 and one of from the group generated by
improvement due to backward refinement decreases rapidly v2 . PCA and SPCA assign similar values to similar variables
as k increases beyond the true dimensionality of the data. due to the grouping effect. However, as a result they do not
omit redundant variables. Thus, while SPCA yields sparse
solutions it is not the optimal choice if the objective is to
5.3 Example 3: Sparse PCA Dataset select a compact set of variables to represent the data.
This example is a simulated dataset used in [19] to assess
the performance of the sparse PCA algorithm introduced 6 A PPLICATION E XAMPLES
therein. The dataset is generated by 3 hidden variables 6.1 Pitprops Dataset
v1 , v2 , v3
The pitprops dataset, originally introduced by [46] as a
• v1 ∼ N (0, 290) v2 ∼ N (0, 300). PCA case study, is a widely used benchmark problem for
• v3 = −0.3v1 + 0.952v2 + where ∼ N (0, 1). evaluating the performance of PCA and SPCA like meth-
ods. The dataset consists of 180 samples of 13 variables
and 10 observed variables describing properties of timber and was used by the British
• xi = v1 + 1i where 1i ∼ N (0, 1) f or i = 1, . . . , 4 Forestry Commission in a study to establish if home-grown
• xi = v2 + 2i where 2i ∼ N (0, 1) f or i = 5, . . . , 8 timber had sufficient strength to be used to provide roof
• xi = v3 + 3i where 3i ∼ N (0, 1) f or i = 9, 10 support struts ’Pitprops’ for mines. Using the correlation
matrix for the dataset provided in [46] we generated 180
The final data matrix X ∈ Rn×10 is then defined as X = samples of a multivariate normal distribution to synthesise
[x1 , . . . , x10 ], where n = 1000 is the number of samples. an approximation of the original dataset.
10

TABLE 6 6.2 Plasma Etch OES Analysis

Pitprops dataset: Percentage of explained variance as a function of the
number of variables selected for each algorithm In semiconductor manufacturing Optimal Emission Spec-
troscopy (OES) is increasingly used to monitor plasma etch
k PCA SPCA FSCA SPBR MPBR R-SPBR R-MPBR
processes. Due to the high dimensionality and correlated
nature of OES data dimensionality reduction techniques
3 67.41 36.16 55.98 60.04 60.04 60.04 60.04
4 75.52 60.86 67.13 67.13 67.13 68.33 68.33 such as PCA are usually employed as a pre-processing step
5 82.13 65.40 74.07 75.76 75.76 75.76 75.76 (see for example [12], [23], [47] and [48]). Here we employ
6 88.00 71.65 80.18 82.38 82.38 82.38 82.38
7 92.33 72.59 85.73 86.67 87.89 87.89 87.89 a sample OES dataset collected during the processing of a
7 92.33 77.43 85.73 86.67 87.89 87.89 87.89 single wafer as a case study for comparing the orthogonal
9 97.63 85.31 94.52 95.87 95.87 95.87 95.87
11 99.38 93.05 98.79 98.79 98.79 98.79 98.79
decompositions generated by FSCA and PCA. The OES
11 99.38 95.11 98.79 98.79 98.79 98.79 98.79 spectrum in question, plotted in Fig. 10, consists of optical
12 99.72 96.89 99.41 99.41 99.41 99.41 99.41 emission intensity time series data for each of the 2000
13 100.0 100.0 100.0 100.0 100.0 100.0 100.0
active spectrometer channels (each channel corresponds to a
different optical wavelength in the range 192-875 nm). Each
time series has 55 samples, hence the resulting dataset is a
TABLE 7 matrix X ∈ R55×2000 of intensity values.
Pitprops dataset: The 6 variables selected by each algorithm and the
optimum set of 6 variables (BEST) in terms of maximizing the
percentage of explained variance in the dataset

FSCA SPBR MPBR R-SPBR R-MPBR BEST

whorls ringbut ringbut ringbut ringbut ringbut
moist moist moist moist moist moist
length length length length length length
ringtop clear clear clear clear clear
clear bowmax bowmax bowmax bowmax bowmax
ovensg ovensg ovensg ovensg ovensg ovensg
VX
80.18 82.38 82.38 82.38 82.38 82.38

For the synthesised dataset six SPCA components were

computed for a range of different values of the penalty
λ. For each value of λ the number of uniquely selected
variables was identified and then the corresponding number
of FSVs computed with FSCA, SPBR, R-SPBR, MPBR and R-
MPBR. The number of variables selected and the percentage
of explained variance are reported in Table 6. In order to Fig. 10. Plasma Etch Process OES Spectrum
provide an upper bound for the percentage of explained
variance that can be achieved we have also reported the
TABLE 8
corresponding values obtained with PCA. From the results Plasma Etch: Accumulative variance explained by PCA, FSCA and the
it can be observed that SPCA is the method that explains the four backward refinement variants of FSCA for different values of k
least variance for a given number of selected variables.
k PCA FSVA SPBR MPBR R-SPBR R-MPBR
In SPCA the number of selected variables is indirectly
chosen through a penalty parameter λ. In some situations 1 77.04 76.63 76.63 76.63 76.63 76.63
2 96.53 96.00 96.37 96.37 96.37 96.37
it can be difficult to choose λ to select a specific number of 3 98.20 98.00 98.14 98.14 98.14 98.14
variables. In our case, for example, it has not been possible 4 99.47 99.27 99.42 99.42 99.42 99.42
to select 1, 2, 8 or 13 variables using SPCA. In particular, 5 99.69 99.50 99.65 99.65 99.65 99.65
6 99.81 99.66 99.75 99.79 99.79 99.79
observe that for 7 and 11 selected variables SPCA returns 7 99.88 99.79 99.85 99.86 99.87 99.86
2 different results. This is due to the fact that two different 8 99.93 99.89 99.91 99.92 99.92 99.92
subsets of 7/11 variables were returned for two different 9 99.95 99.93 99.94 99.95 99.95 99.95
values of λ. Table 7 lists the variables selected by a 6 com-
ponents FSCA, SPBR, MPBR, R-SPBR and R-MPBR together The highly correlated nature of the data is evident from
with the set of 6 variables that maximize the percentage of Table 8, which shows that the first 4 PCs and the first 4
explained variance (BEST). This set has been determined by variables selected by FSCA and its backward refinement
evaluating all possible subsets of 6 variables. Observe that variants explain more than 99% of the variation in data.
MPBR, R-SPBR and R-MPBR select the same variables as In Fig. 11 the PCA loadings and scores are compared with
BEST while SPBR replaces the variable ’moist’ with ’testsg’. the FSVs and FSCs obtained with FSCA. This reveals that
In contrast, FSCA selects only 3 variables in common with the etch process can be analysed using either PCA scores
BEST. or FSCA components. In particular, observe that the FSCA
L. PUGGINI AND S. MCLOONE: FORWARD SELECTION COMPONENT ANALYSIS: ALGORITHMS AND APPLICATIONS 11

components and PCA scores tend to have similar trends. As TABLE 10

noted previously, the PCA scores are obtained as a linear Wafer sites: The percentage of variance explained by the various
methods for different values of k, the number of selected wafer sites
combination of all 2000 original variables (as defined by the
PCA loadings), while the four FSCs can be expressed as a
linear combination of just 4 original variables (the FSVs). k PCA FSCA SPBR MPBR R-SPBR R-MPBR
The benefit of being able to trace process variably back 1 42.69 38.84 38.84 38.84 38.84 38.84
to a small number of OES wavelengths is that individual 2 68.69 64.44 67.01 67.01 67.01 67.01
wavelengths map to specific chemical species present in the 3 85.72 82.59 84.57 85.08 84.79 85.08
plasma, enabling process engineers to gain insight into the 4 98.48 96.37 97.40 97.58 97.58 97.58
5 99.12 97.59 98.63 98.74 98.72 98.73
underlying drivers of process variability. 6 99.47 98.76 99.19 99.20 99.17 99.16
The computation time in seconds for each algorithm 7 99.67 99.22 99.45 99.46 99.42 99.39
in reported in Table 9 for different numbers of computed 8 99.75 99.47 99.60 99.60 99.59 99.58
components/variables selected. As expected, computational 9 99.81 99.64 99.71 99.71 99.71 99.69
time grows rapidly with increasing k for the more complex 10 99.86 99.72 99.77 99.80 99.78 99.79
refinement algorithms. For example, while SPBR is only
twice as computationally intensive as FSVA at k = 9 R-
MPBR is more than 20 times more computationally inten- The PCA results, which are also recorded in Table 10, show
sive. that the lower bound on the number of sites required is 5.
A plot of the variance explained by each of the FSCA
TABLE 9 based methods as a percentage of the variance explained
Plasma Etch: Computation time (in seconds) for PCA, FSVA and the
four backward refinement variants of FSCA for different values of k
by PCA is given in Fig. 12. Again in this example we
observe that, as expected, SPBR outperforms FSCA and
k PCA FSVA SPBR MPBR R-SPBR R-MPBR
MPRB outperforms SBBR (to a lesser extent), but that unlike
the previous examples, R-SPBR and R-MPBR are sometimes
1 0.03 0.05 0.10 0.10 0.10 0.10
2 0.04 0.10 0.20 0.31 0.25 0.36
marginally inferior to their non-recursive counterparts (i.e.
3 0.05 0.16 0.32 0.50 0.47 0.75 for k ≥ 5).
4 0.07 0.22 0.46 0.96 0.80 1.32
5 0.09 0.28 0.60 1.25 1.16 2.32
6 0.10 0.36 0.76 2.40 1.67 3.25 100
7 0.13 0.42 0.93 3.54 2.22 5.85
8 0.14 0.49 1.09 2.91 2.90 8.35
9 0.15 0.56 1.29 3.47 3.63 11.86 98
100 ×V ∗/VPCA

6.3 Wafer Site Optimisation

As a final application example, we evaluate the performance 94
of SPBR, MPBR, R-SPBR and R-MPBR as alternatives to FSCA
FSCA for the semiconductor wafer metrology site optimi- 92 SPBR
sation methodology developed in [24]. The pertinent details MPBR
R-SPBR
are as follows. The objective of the methodology is to use R-MPBR
historical metrology data for a set of candidate measurement 90
1 2 3 4 5 6 7 8 9 10
sites to determine the minimum set of sites that need to be Number of wafer sites selected
measured in order to accurately reconstruct wafer profiles.
The case study dataset consists of production metrology Fig. 12. Wafer sites: The variance explained by the metrology sites
data for a deposition process used in the manufacture of selected by FSCA and the 4 backward refinement algorithms for different
read-write heads, a key component of hard disk drives. The values of k expressed as a percentage of the variance explained by the
equivalent number of PCs
dataset, which was collected over several weeks from a sin-
gle production tool for the process, contains measurements
of 50 candidate sites for 316 wafers. Hence, X ∈ R316×50 and It is also interesting to observe how representative the
the site selection problem equates to selecting the subset of FSCA selected sites are of the full wafer surface. This
columns of X that best describe X. For a detailed descrip- can be visualised by clustering the unmeasured sites in k
tion of the problem statement, solution methodology and clusters Cz1 , . . . , Czk according to their similarity to the k
case study dataset the reader is referred to [24]. FSCA selected sites. Here, the similarly between an FSCA
site, zi , and an unmeasured site, xj , is defined in terms
Here, FSCA and the newly proposed backward refine-
of the impact on reconstruction accuracy of replacing zi
ment variants are employed to determine the optimum
with xj . Specifically, denoting the k selected variables as
subset of wafer sites. The percentage of variance explained (i)
by each method for different numbers of selected sites (k ) Zk = {z1 , . . . , zk }, and noting the definition of Zk (xj )
is reported in Table 10. Defining 99% variance explained as given in eqt. (26), sites are assigned to clusters according to
the minimum reconstruction accuracy threshold, it follows the rule
that 7 sites are needed when using FSC, while 6 sites are (p)
sufficient when using SPBR, MPBR, R-SPBR and R-MPBR.
xj ∈ Czi if i = argmax VX (Φ(Zk (xj ))X). (37)
p=1...k
12 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

FSCA variables FSCA components FSCA loadings PCA scores PCA loadings
584.9 0.23 12193 52738 0.05
125.2 0.05 5955 10047 0.03
334.4 0.13 32643
282 0.00
2305 0.22 12444 28008 0.10
318 0.04 4101 2551 0.03
1669 0.13 22906
4242 0.04
2570 0.60 3916 20182 0.08
583 0.22 1313 6201 0.01
1404 0.16 1290 7780 0.06
1077 0.39 3013 11076 0.12
243 0.09 213 2866 0.01
592 0.22 3439 5344 0.09
0 13 27 41 55 0 13 27 41 55 0 500 1000 1500 2000 0 13 27 41 55 0 500 1000 1500 2000

Fig. 11. Plasma Etch Process OES dataset: The first 4 PCA and FSCA components and scores

Fig. 13 shows the clusters obtained with each of the without pre-computation of the √ covariance
√ matrix is the
FSCA algorithms for k = 4 and k = 8. The clusters are superior implementation when k/ m > 3.
represented by markers of different colour and/or shape. A number of novel backward refinement variants of
Of particular note is the variation in spatial consistency of FSCA have also been proposed and efficient algorithm
clusters. It is apparent that the refinement steps yield much implementations developed. Results from simulated and
better spatial consistency of clusters than FSCA, with the application case studies confirm that the refinements yield
biggest improvements occurring with SPBR. Using FSCA improvements in performance relative to FSCA in terms
50% of the clusters are fragmented for both k = 4 and k = 8, of variance explained for a given number of compo-
while using SPBR only 1 cluster (25%) is fragmented when nents/variables selected, better variable selection and, in the
k = 4 and none are fragmented when k = 8. MPBR, R- case of the wafer site optimisation problem, more coherent
SPBR and R-MPBR yield similar results to SPBR with some FSCA clusters. Overall the key observations are that MPBR
minor variation at the boundaries between clusters. It is is superior to SPBR, which is in turn superior to FSCA, and
also noteworthy that in the FSCA plot for k = 8 the ’black that there is little, if any, benefit to be gained from employ-
star’ site is clustered with only one other site which is in a ing the recursive formulations (R-SPBR and R-MPBR) over
spatially unrelated area of the wafer. These anomalies are a their non-recursive counterparts. Indeed, in some instances
consequence of the sites initially selected by FSCA becoming the recursive implementations can yield poorer results.
redundant as additional sites are selected, as discussed in In terms of computational complexity the ordering is
Section 4. This issue, which detracts from the interpretability FSCA < SPBR < MPBR < R-SPBR < R-MPBR, with SPBR
of clusters, is addressed through the introduction of the having the same asymptotic complexity as FSCA. It is also
backward refinement step. noteworthy that the largest relative improvement in per-
formance in terms of variance explained occurs with the
change from FSCA to SPBR. As such, for most practical
7 D ISCUSSION AND C ONCLUSIONS applications SPBR is recommended as it provides a good
balance between complexity and quality of results.
This paper has sought to provide a comprehensive pre- In all the case studies presented in the paper algorithm
sentation of Forward Selection Component Analysis, as the performance is considered in an unsupervised context, as
unsupervised counterpart of Forward Selection Regression this is the natural framework for comparing unsupervised
and an alternative to PCA for dimensionality reduction variable selection techniques. The interested reader is re-
and variable selection in large highly correlated datasets. ferred to Appendix B for a supplementary linear regres-
A number of alternative FSCA algorithm implementations sion case study where performance is also assessed in a
have been described, namely FSCA and FSVA, with and supervised context. Specifically, FSCA and its variants are
without pre-computation of the covariance matrix, and their employed to select the model inputs from a large candidate
computational complexity analysed. In particular, this anal- set, and performance is evaluated in terms of the prediction
ysis reveals that: (1) all algorithms scale linearly with the capability of the resulting models.
number of measurements m and quadratically with the
number of variables v ; (2) FSCA implementations grow
linearly with the number of selected variables k , while ACKNOWLEDGMENTS
FSVA implementations grow cubically with k ; and (3) the The authors would like to thank Adrian Johnston, Seagate
optimum √ choice of implementation is dependent on the Technology and Niall MacGearailt, Intel Ireland for the
ratio k/ m. In general, it is computationally advantageous provision of use cases and Maynooth University for the
to pre-compute and store the covariance matrix when using financial support provided. The authors would also like to
either FSCA or FSVA, with FSVA computationally
√ √ the most acknowledge the anonymous reviewers for their valuable
efficient implementation provided k/ m < 1.5. FSCA suggestions which have greatly enhanced the paper.
L. PUGGINI AND S. MCLOONE: FORWARD SELECTION COMPONENT ANALYSIS: ALGORITHMS AND APPLICATIONS 13

A PPENDIX A: P ROPERTIES OF VX (.)

k =4 k =8 Theorem 1. Given projection matrix Φ(S) as defined in eqt.

1.0 1.0 (4) and VX (.) as defined in eqt. (6), if X̂ = Φ(S)X then
VX (X̂) ≥ 0 ∀S ∈ Rn×k .
Proof: Choosing X̂ = Φ(S)X is equivalent to choos-
ing X̂ = SΘ, where Θ is the solution to the convex
FSCA

0.5 0.5

minimization problem in eqt. (2). By definition, the Θ

that minimizes eqt. (2) maximizes VX (X̂), that is:

0.0 0.0
Θ = argmaxVX (SΘ̃)
0.0 0.5 1.0 0.0 0.5 1.0
V =96.37 V =99.47 Θ̃

1.0 1.0 It then follows that VX (SΘ) ≥ VX (SΘ̃) ∀Θ̃. Choosing

Θ̃ as the zero matrix 0 gives X̂ = 0 and hence VX (0) =
0. Therefore VX (SΘ) ≥ VX (0) = 0.
SPBR

0.5 0.5

Theorem 2. If VX (.) is defined as in eqt. (6) and Φ(.) defined

as in eqt. (4) then
0.0 0.0

0.0 0.5
V =97.40
1.0 0.0 0.5
V =99.60
1.0
xi XXT xi
argmax VX (Φ(xi )X) = argmax .
xi ∈X xi ∈X xTi xi
1.0 1.0

Proof: It immediately follows from eqt.(6) that

argmax VX (X̂) = argmin k X − X̂ k2F ,

MPBR

0.5 0.5
xi ∈X xi ∈X

where X̂ = Φ(xi )X. Expressing the F-norm in terms of

0.0 0.0
the trace operator gives
0.0 0.5 1.0 0.0 0.5 1.0
V =97.58 V =99.60
k X − X̂ k2F = tr((X − X̂)(X − X̂)T )
1.0 1.0 = tr(XXT ) + tr(X̂X̂T ) − 2tr(XX̂)
= tr(XXT ) + tr(Φ(xi )XXT Φ(xi )) − 2tr(Φ(xi )XXT ),
R-SPBR

0.5 0.5
where the last equivalence is obtained by replacing
X̂ with Φ(xi )X and noting that Φ(xi ) = Φ(xi )T .
By application of the cyclic property of the trace op-
erator and observing that Φ(xi )2 = Φ(xi ) we can
0.0 0.0
write tr(Φ(xi )XXT Φ(xi )) = tr(Φ(xi )Φ(xi )XXT ) =
0.0 0.5
V =97.58
1.0 0.0 0.5
V =99.59
1.0
tr(XXT ), and therefore

1.0 1.0
k X − X̂ k2F = 2tr(XXT ) − 2tr(Φ(xi )XXT ).

It immediately follows that

R-MPBR

0.5 0.5

argmin k X − X̂ kF = argmax tr(Φ(xi )XXT ).

xi ∈X xi ∈X

0.0 0.0
Finally, by application of the definition of Φ(.) in eqt. (4)
0.0 0.5 1.0 0.0 0.5 1.0
and the properties of trace, the r.h.s. can be rewritten as
V =97.58 V =99.58
xi xTi T xTi
Fig. 13. The FSCA clusters obtained with FSCA, SPBR, MPBR, R- argmax tr( XX ) = argmax tr( XXT xi )
SPBR and R-MPBR for k = 4 and k = 8. In each case the FSCA xi ∈X xTi xi xi ∈X xi xTi
selected sites are indicated by circles and the associated clusters by xT XXT xi
markers of different colour and/or shape. The percentage variance = argmax i T .
explained by the different algorithms is reported under each plot. xi ∈X xi xi
14

TABLE 11 TABLE 12
Mean and (standard deviation) of the CV-NMSE (%) achieved with the Percentage of variance in X explained by the variables/features
regression models generated using PCA, FSCA, SPBR, MPBR, selected by PCA, FSCA, and its backward refinement variants
R-SPBR and R-MPBR for input selection
k 5 10 15
k 5 10 15
PCA 97.44 99.31 99.81
PCA 1.40 (0.05) 1.26 (0.04) 1.13 (0.04) FSCA 96.16 98.80 99.57
FSCA 1.45 (0.08) 1.25 (0.04) 1.10 (0.05) SPBR 96.74 99.04 99.68
SPBR 1.36 (0.05) 1.26 (0.06) 1.11 (0.03) MPBR 96.74 99.07 99.70
MPBR 1.35 (0.05) 1.22 (0.09) 1.12 (0.04) R-SPBR 96.74 99.06 99.70
R-SPBR 1.36 (0.05) 1.20 (0.07) 1.13 (0.04) R-MPBR 96.74 99.07 99.70
R-MPBR 1.36 (0.05) 1.19 (0.06) 1.13 (0.04)

no clear winner among the methods. This can be attributed

A PPENDIX B: A REGRESSION EXAMPLE to the absence of causality between unsupervised input
As a supplementary example we consider the application variable/feature selection and selecting inputs that yield the
of the various FSCA techniques to input selection in a re- best regression results. Overall, we can conclude that FSCA
gression problem and benchmark their performance against and its backward refinement variants are as effective as PCA
PCA. Since, in general, unsupervised variable or feature for this application, with the added advantage of providing
selection techniques are not guaranteed to identify good an easy to interpret model.
predictors of a target output, and it is straightforward to
design pathological examples where each method will per-
form poorly, to provide a fair comparison we select a dataset R EFERENCES
for a practical application where PCA yields good results. [1] R. Tibshirani, “Regression shrinkage and selection via the lasso,”
The application in question is the prediction of etch rate of Journal of the Royal Statistical Society. Series B (Methodological), pp.
a plasma etch process from optical emission spectroscopy 267–288, 1996.
(OES) measurements recorded from the etching chamber [2] A. Miller, Subset selection in regression. CRC Press, 2012.
[3] R. Dı́az-Uriarte and S. A. De Andres, “Gene selection and classifi-
during wafer processing. The associated dataset, described cation of microarray data using random forest,” BMC Bioinformat-
fully in [49], is defined by an input matrix X ∈ R2194×200 of ics, vol. 7, no. 1, p. 3, 2006.
OES summary statistics and an output vector y ∈ R2194 of [4] X. Geng, D.-C. Zhan, and Z.-H. Zhou, “Supervised nonlinear
dimensionality reduction for visualization and classification,” Sys-
the recorded etch rate for 2194 production wafers. tems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on,
A Monte Carlo study was undertaken in which the data vol. 35, no. 6, pp. 1098–1107, 2005.
was randomly split into training and test data, correspond- [5] P. Mitra, C. Murthy, and S. K. Pal, “Unsupervised feature selection
ing respectively to 70% and 30% of the data. Using the using feature similarity,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 24, no. 3, pp. 301–312, 2002.
training data, k inputs were generated in turn with PCA, [6] J. G. Dy and C. E. Brodley, “Feature selection for unsupervised
FSCA, SPBR, MPBR, R-SPBR, and R-MPBR, and in each learning,” The Journal of Machine Learning Research, vol. 5, pp. 845–
case a linear regression model estimated. The performance 889, 2004.
of the models was evaluated by cross-validation on the test [7] Z. Zhao and H. Liu, “Spectral feature selection for supervised
and unsupervised learning,” in Proceedings of the 24th international
data. This process was repeated 20 times for k = 5, 10 and conference on Machine Learning. ACM, 2007, pp. 1151–1157.
15, and the mean and standard deviation of the normalized [8] Y. Saeys, I. Inza, and P. Larrañaga, “A review of feature selection
mean squared prediction error (denoted as the CV-NMSE) techniques in bioinformatics,” Bioinformatics, vol. 23, no. 19, pp.
2507–2517, 2007.
computed in each case. These values are reported in Table [9] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction
11. by locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–
For completeness, the percentage of the variance in 2326, 2000.
the input data explained by the selected input fea- [10] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality
reduction and data representation,” Neural Computation, vol. 15,
tures/variables (as computed on the training data set) are no. 6, pp. 1373–1396, 2003.
reported in Table 12. The pattern is the same as observed [11] I. Jolliffe, Principal component analysis. Wiley Online Library, 2005.
in previous examples. PCA yields the highest variance [12] B. Flynn and S. McLoone, “Max separation clustering for feature
explained and FSCA the lowest among the methods consid- extraction from optical emission spectroscopy data,” Semiconductor
Manufacturing, IEEE Transactions on, vol. 24, no. 4, pp. 480–488,
ered for each value of k . The backward refinement methods 2011.
fall between these two extremes in terms of their perfor- [13] L. Mariey, J. Signolle, C. Amiel, and J. Travert, “Discrimination,
mance, with SPBR yielding most improvement relative to classification, identification of microorganisms using ftir spec-
troscopy and chemometrics,” Vibrational Spectroscopy, vol. 26, no. 2,
FSCA. pp. 151–159, 2001.
Reviewing Table 11 it is evident that the inputs selected [14] J. Trygg, E. Holmes, and T. Lundstedt, “Chemometrics in metabo-
by all methods yield good prediction models with the CV- nomics,” Journal of Proteome Research, vol. 6, no. 2, pp. 469–479,
NSME ranging from 1.35% to 1.45% when k = 5, 1.19% 2007.
[15] G. P. McCabe, “Principal variables,” Technometrics, vol. 26, no. 2,
to 1.26% when k = 10, and 1.10% to 1.13% when k = 15. pp. 137–144, 1984.
It is interesting to note that while PCA explains the most [16] J. Cadima and I. T. Jolliffe, “Loading and correlations in the in-
variance in terms of the input data, MPBR yields the best terpretation of principle compenents,” Journal of Applied Statistics,
vol. 22, no. 2, pp. 203–214, 1995.
regression model when k = 5, R-MPBR when k = 10 and
[17] I. T. Jolliffe, N. T. Trendafilov, and M. Uddin, “A modified principal
FSCA when k = 15. While the differences in CV-NMSE are component technique based on the lasso,” Journal of Computational
statistically significant at a 95% confidence level, there is and Graphical Statistics, vol. 12, no. 3, pp. 531–547, 2003.
L. PUGGINI AND S. MCLOONE: FORWARD SELECTION COMPONENT ANALYSIS: ALGORITHMS AND APPLICATIONS 15

[18] A. d’Aspremont, L. El Ghaoui, M. I. Jordan, and G. R. Lanckriet, [42] N. Rao, P. Shah, and S. Wright, “Forwardbackward greedy algo-
“A direct formulation for sparse pca using semidefinite program- rithms for signal demixing,” in Signals, Systems and Computers,
ming,” SIAM review, vol. 49, no. 3, pp. 434–448, 2007. 2014 48th Asilomar Conference on. IEEE, 2014, pp. 437–441.
[19] H. Zou, T. Hastie, and R. Tibshirani, “Sparse principal component [43] H. Wold, “Nonlinear estimation by iterative least square proce-
analysis,” Journal of Computational and Graphical Statistics, vol. 15, dures,” in Research Papers in Statistics, F. David, Ed. Wiley, New
no. 2, pp. 265–286, 2006. York, 1966, pp. 411–444.
[20] H. Shen and J. Z. Huang, “Sparse principal component analysis [44] G. H. Golub and C. F. Van Loan, Matrix computations. JHU Press,
via regularized low rank matrix approximation,” Journal of Multi- 2012, vol. 3.
variate Analysis, vol. 99, no. 6, pp. 1015–1034, 2008. [45] M. S. Bartlett, “An inverse matrix adjustment arising in discrim-
[21] D. M. Witten, R. Tibshirani, and T. Hastie, “A penalized matrix inant analysis,” Ann. Math. Statist, vol. 22, no. 1, pp. 107–111, 03
decomposition, with applications to sparse principal components 1951.
and canonical correlation analysis,” Biostatistics, p. kxp008, 2009. [46] J. Jeffers, “Two case studies in the application of principal compo-
nent analysis,” Applied Statistics, pp. 225–236, 1967.
[22] R. Jenatton, G. Obozinski, and F. Bach, “Structured sparse princi-
[47] H. Yue, S. Qin, R. Markle, C. Nauert, and M. Gatto, “Fault
pal component analysis,” arXiv preprint arXiv:0909.1440, 2009.
detection of plasma etchers using optical emission spectra,” Semi-
[23] E. Ragnoli, S. McLoone, S. Lynn, J. Ringwood, and N. Macgearailt, conductor Manufacturing, IEEE Transactions on, vol. 13, no. 3, pp.
“Identifying key process characteristics and predicting etch rate 374–385, Aug 2000.
from high-dimension datasets,” in Advanced Semiconductor Manu- [48] D. Zeng and C. Spanos, “Virtual metrology modeling for plasma
facturing Conference, 2009. ASMC ’09. IEEE/SEMI, May 2009, pp. etch operations,” Semiconductor Manufacturing, IEEE Transactions
106–111. on, vol. 22, no. 4, pp. 419–431, Nov 2009.
[24] P. Prakash, A. Johnston, B. Honari, and S. McLoone, “Optimal [49] L. Puggini and S. McLoone, “Extreme learning machines for
wafer site selection using forward selection component analysis,” virtual metrology and etch rate prediction,” in Signals and Systems
in Advanced Semiconductor Manufacturing Conference (ASMC), 2012 Conference (ISSC), 2015 26th Irish. IEEE, 2015, pp. 1–6.
23rd Annual SEMI. IEEE, 2012, pp. 91–96.
[25] K. Li, J.-X. Peng, and E.-W. Bai, “A two-stage algorithm for
identification of nonlinear dynamic systems,” Automatica, vol. 42,
no. 7, pp. 1189–1197, 2006.
[26] K. Li, J.-X. Peng, and Bai, “Two-stage mixed discrete–continuous Seán McLoone received an M.E. degree in
identification of radial basis function (rbf) neural models for Electrical and Electronic Engineering and a PhD
nonlinear systems,” Circuits and Systems I: Regular Papers, IEEE in Control Engineering from Queens University
Transactions on, vol. 56, no. 3, pp. 630–643, 2009. Belfast (QUB), Belfast, U.K. in 1992 and 1996,
[27] I. T. Jolliffe, “Discarding variables in a principal component anal- respectively. Following appointments as a Post-
ysis. i: Artificial data,” Applied statistics, pp. 160–173, 1972. doctoral Research Fellow (1996-1997) and Lec-
[28] W. Krzanowski, “Selection of variables to preserve multivariate turer at QUB (1998-2002) he joined the Depart-
data structure, using principal components,” Applied Statistics, pp. ment of Electronic Engineering at the National
22–33, 1987. University of Ireland Maynooth in 2002, where
[29] K. Mao, “Identifying critical variables of principal components he served as Senior Lecturer (2005-2012) and
for unsupervised feature selection,” Systems, Man, and Cybernetics, Head of Department (2009-2012). He is cur-
Part B: Cybernetics, IEEE Transactions on, vol. 35, no. 2, pp. 339–344, rently a Professor and Director of the Energy Power and Intelligent
2005. Control (EPIC) Research Cluster at Queens University Belfast. His
[30] Y. Lu, I. Cohen, X. S. Zhou, and Q. Tian, “Feature selection using research interests include computational intelligence techniques, data
principal feature analysis,” in Proceedings of the 15th international analytics, system identification and control, with a particular focus on
conference on Multimedia. ACM, 2007, pp. 301–304. smart grid and advanced manufacturing informatics applications.
[31] Y. Cui and J. G. Dy, “Orthogonal principal feature selection,” in
Sparse Optimization and Variable Selection Workshop at the Interna-
tional Conference on Machine Learning. Helsinki, Finland, July 2008.
[32] M. Masaeli, Y. Yan, Y. Cui, G. Fung, and J. G. Dy, “Convex principal
feature selection.” in SDM. SIAM, 2010, pp. 619–628. Luca Puggini was born in Rome (Italy) in
[33] D. C. Whitley, M. G. Ford, and D. J. Livingstone, “Unsupervised 1989. He obtained the Laurea Magistrale in Pure
forward selection: a method for eliminating redundant variables,” and Applied Mathematics from the University of
Journal of Chemical Information and Computer Sciences, vol. 40, no. 5, Tor Vergata in 2013. He worked on his thesis
pp. 1160–1168, 2000. at Statistics for Innovation in Oslo, Norway. In
[34] H.-L. Wei and S. A. Billings, “Feature subset selection and ranking September 2013 he joined the Department of
for data dimensionality reduction,” Pattern Analysis and Machine Electronic Engineering at Maynooth University
Intelligence, IEEE Transactions on, vol. 29, no. 1, pp. 162–166, 2007. as a PhD student on a collaborative research
project with Intel Ireland. His research interests
[35] R. Liu, R. Rallo, and Y. Cohen, “Unsupervised feature selection
include statistics, big data, machine learning,
using incremental least squares,” International Journal of Information
and computational intelligence techniques.
Technology and Decision Making, vol. 10, no. 06, pp. 967–987, 2011.
[36] Z. Zhao, R. Zhang, J. Cox, D. Duling, and W. Sarle, “Massively
parallel feature selection: an approach based on variance preser-
vation,” Machine Learning, vol. 92, no. 1, pp. 195–220, 2013.
[37] V. Cevher and A. Krause, “Greedy dictionary selection for sparse
representation,” Selected Topics in Signal Processing, IEEE Journal of,
vol. 5, no. 5, pp. 979–988, 2011.
[38] T. Zhang, “Adaptive forward-backward greedy algorithm for
learning sparse representations,” Information Theory, IEEE Trans-
actions on, vol. 57, no. 7, pp. 4689–4708, 2011.
[39] P. Jain, A. Tewari, and I. S. Dhillon, “Orthogonal matching pursuit
with replacement,” in Advances in Neural Information Processing
Systems, 2011, pp. 1215–1223.
[40] M. Jaggi, “Revisiting frank-wolfe: Projection-free sparse convex
optimization,” in Proceedings of the 30th International Conference on
Machine Learning (ICML-13), 2013, pp. 427–435.
[41] M. Frank and P. Wolfe, “An algorithm for quadratic program-
ming,” Naval Research Logistics Quarterly, vol. 3, no. 1-2, pp. 95–110,
1956.