Paper-Simple Poisson PCA An Algorithm For Sparse Feature
Paper-Simple Poisson PCA An Algorithm For Sparse Feature
https://doi.org/10.1007/s00180-019-00903-0
ORIGINAL PAPER
Received: 11 July 2018 / Accepted: 6 June 2019 / Published online: 11 June 2019
© The Author(s) 2019
Abstract
Dimension reduction tools offer a popular approach to analysis of high-dimensional
big data. In this paper, we propose an algorithm for sparse Principal Component Analy-
sis for non-Gaussian data. Since our interest for the algorithm stems from applications
in text data analysis we focus on the Poisson distribution which has been used exten-
sively in analysing text data. In addition to sparsity our algorithm is able to effectively
determine the desired number of principal components in the model (order determi-
nation). The good performance of our proposal is demonstrated with both synthetic
and real data examples.
1 Introduction
Principal Component Analysis (PCA), and its variants, are popular and well-
established methods for unsupervised dimension reduction through feature extraction.
These methods attempt to construct low-dimensional representations of a dataset
which minimise the reconstruction error for the data, or for its associated parame-
ters. An attractive property of PCA is that it produces a linear transformation of the
data which allows reconstructions to be calculated using only matrix multiplication. It
was developed initially for Gaussian-distributed data, and has since been extended to
include other distributions, as well as extended for use in a Bayesian framework. Exam-
ples of such extensions include Sparse PCA (SPCA) (Zou et al. 2006), Probabilistic
PCA (PPCA) (Tipping and Bishop 1999), Bayesian PCA (BPCA) (Bishop 1999),
Multinomial PCA (Buntine 2002), Bayesian Exponential Family PCA (Mohamed
B Luke Smallman
[email protected]
123
560 L. Smallman et al.
et al. 2009), Simple Exponential Family PCA (SePCA) (Li and Tao 2013), Gener-
alised PCA (GPCA) (Landgraf and Lee 2015) and Sparse Generalised PCA (SGPCA)
(Smallman et al. 2018). The wide variety of PCA extensions can be attributed to a
combination of the very widespread use of PCA in high-dimensional applications and
to its insuitability for problems not well-approximated by the Gaussian distribution.
In this paper, we present a method for Simple Poisson PCA (SPPCA) based on
the SePCA (Li and Tao 2013), an extension of PCA which (through the prescription
of a simple probabilistic model) can be applied to data from any exponential family
distribution. Firstly, by focusing on the Poisson distribution we provide an alterna-
tive inferential algorithm to the original paper which is simpler and makes use of
gradient-based methods of optimisation. This simple algorithm then facilitates our
second extension: the application of a sparsity-inducing adaptive L 0 penalty to the
loadings matrix which improves the ability of the algorithm when dealing with high-
dimensional data with uninformative components. We call our sparse extension Sparse
Simple Poisson PCA (SSPPCA). We note that similarly one can focus on any other
exponential family distribution and create the respective gradient-based optimisation
for that distribution. We discuss only the Poisson distribution in this work due to the
fact that we focus on text data which are usually modelled using a Poisson distribu-
tion. We also note that in the simulation studies and real-world example which we will
investigate in this work, we do not use particularly high-dimensional data. While our
method should work the same with such data, the necessary computational adaptations
to work with very high-dimensional data are beyond the scope of this paper.
A common feature of high-dimensional data is the presence of uninformative dimen-
sions. In text data, observed data is often represented by a document-term matrix X
whose i jth entry is the number of times the ith document contains the jth term. In
practice, a considerable number of these terms are irrelevant to classification, cluster-
ing or other analysis. Such examples have lead to the development of many methods
for sparse dimension reduction, such as Sparse Principal Component Analysis (Zou
et al. 2006), Joint Sparse Principal Component Analysis (Yi et al. 2017), Sparse Gen-
eralised Principal Component Analysis (Smallman et al. 2018) and more. In this paper,
our sparsifying procedure for SPPCA uses the adaptive L 0 penalty to induce sparsity.
We will present the method in general terms for any exponential family distribution
and illustrate with a case study using the Poisson distribution and text data. We stress
that although this method does not give a classification algorithm, dimension reduc-
tion methods are often used as a preprocessing step before classification or clustering
techniques; as such these tasks provide both important applications and methods of
validation for this work. As such, we will spend some time investigating the ability
of our proposed methods to provide dimension reduction transformations which leave
the data amenable to such applications.
One difficulty associated with PCA algorithms is the need to choose the desired
number of principal components, a process known generally in dimension reduction
frameworks as order determination. This is in practice a difficult task, particularly with
data involving large numbers of features, and often necessitates multiple experiments to
determine the best number. For suitable models, Automatic Relevance Determination
(ARD) (Mackay 1995), provides an automatic method for order determination; like
SePCA, SPPCA is able to make use of ARD, as is our sparse adaptation. We will
123
Simple Poisson PCA: an algorithm for (sparse) feature… 561
investigate the behaviour of this order determination for both the original SPPCA and
for Sprse SPPCA.
In Sect. 2 we discuss the exponential family of distributions and the previous work
on SePCA, as well as outline the process of Automatic Relevance Determination.
Section 2.3 defines Simple Poisson Principal Component Analysis (SPPCA), with
details of the techniques involved for numerical computation, and reconstruction of the
data. Section 3 introduces sparsity, giving Sparse Simple Poisson Principal Component
Analysis (SSPPCA). We give a detailed estimation procedure for both the SPPCA and
SSPPCA in Sect. 4.1. We investigate numerical performance in Sect. 5, focusing on
artificial data sets in Sects. 5.1 and 5.3, the performance of the order determination
via ARD in Sect. 5.2, and a healthcare dataset in Sect. 5.4. Finally, we will discuss
implications and plans for future work in Sect. 6. To improve readability, certain results
and derivations are included as appendices to the text.
2 SePCA
Since our work is based on Simple Exponential PCA by Li and Tao (2013), we will
introduce in this section the general method for the exponential family of distributions.
We first introduce the most frequently used notation from Li and Tao (2013). We let N
be the sample size, D the number of features and d the number of principal components.
Moreover, X is the D × N data matrix, xn ∈ R D is the nth observation, Y is the d × N
scores matrix and yn ∈ Rd is the score of the nth observation. Furthermore, W is a
D × d loadings matrix, w j ∈ R D is the loadings of the jth principal component, Θ is
the D×N parameter matrix and θ n ∈ R N is the parameter vector of the nth observation.
Finally, we denote the joint posterior likelihood of X, Y , W , α with P(X, Y , W , α).
Then we state the definition of the exponential family of distributions in terms of
its conditional distribution to motivate our work.
Definition 1 The exponential family distribution has a probability function which is
conditional on a single vector parameter θ , and takes the canonical form:
p(x|θ ) = exp xT θ + g(θ) + h(x)
where g : R D → R and h : R D → R.
In SePCA (see Li and Tao 2013) proposed modelling the sampling process of x n
by the distribution p(x n |W , yn ) = Exp(x n |θ n ) where Exp(x n |θ n ) is the condi-
tional distribution of the exponential family as defined above, and θ n = W yn is
the natural parameter vector for the exponential family distribution that generates
x n . On yn they place a Gaussian prior: p( yn ) = N ( yn |0d , I d ) where 0d is the d-
dimensional vector with all entries zero and I d the d × d identity matrix. Finally,
on each principal
component wi , i = 1, . . . , d they place another Gaussian prior:
p(W |α) = dj=1 N (w j |0 D , α −1
j I D ) where α = {α1 , . . . , αd } is a set of precision
123
562 L. Smallman et al.
D
j ≈
αM (1)
||wMP
j ||2
2
where wMP j is the maximum a posteriori (MAP) estimate of w j . In essence this implies
an iterative procedure as the posterior estimate of W depends on α and vice versa.
Here, the value of α j , j = 1, . . . , d, indicates whether a principal component w j
should be kept or ignored. This is done using Automatic Relevance Determination
(ARD) which was introduced in Mackay (1995). If α j > M where M is sufficiently
large then we may infer that all components of w j are within a small neighbourhood of
0 with high probability; thus we may safely discard them. In practice, we have usually
found M ≈ 100 to be sufficient.
To make inference on W and Y we use the MAP estimation of the log posterior which
has the form:
log P(X, Y , W , α) = log p(X|W , Y , α) + log p(Y ) + log p(W |α) + constant
N
1 1
= x Tn W yn + g(W yn ) − tr(Y T Y ) − tr(W T W Diag(α))
2 2
n=1
where Diag(α) is the d × d diagonal matrix with entries α. In Li and Tao (2013),
the authors based their estimation procedure on the fact that the conditional dis-
tribution p(X|W , Y ) is some general exponential family distribution. Therefore,
they suggested to approximate the log-likelihood with a lower bound and adopt an
expectation-maximisation (EM) approach for optimisation. This led to a rather com-
plicated inference procedure on W and Y .
In this paper we propose a different inference procedure on W and Y . Although in
Li and Tao (2013) the conditional distribution p(X|W , Y ) is some general exponential
family distribution, in specific problems the distribution is well defined, for example in
their simulated data they used a binomial distribution and in the text data framework we
are interested in this paper we use Poisson distribution. Therefore, we suggest that the
123
Simple Poisson PCA: an algorithm for (sparse) feature… 563
estimation of W and Y is done using simple gradient based methods for optimisation to
find the MAP estimates of P(X, Y , W , α). We will illustrate the details of our method
by example in the next section, where we discuss Simple Poisson PCA (SPPCA).
As we said earlier in this paper, we are interested in a PCA algorithm which is appro-
priate for text data. Text data is usually transformed into numeric vectors where we
measure the number of times a word appears in a document (or a sentence or a para-
graph). The usual distribution to measure counts like this is a Poisson distribution and
therefore here we present the special version of SePCA where the Poisson distribution
is used instead of the general exponential distribution.
The Poisson distribution is a discrete probability distribution, and is a member of
the exponential family distribution. It has probability mass function, conditional on λ,
λx e−λ
p(x|λ) =
x!
The joint distribution of D independent Poisson variates, with means λ =
(λ1 . . . λ D ) is also a member of the exponential family distribution, with mass function
D
D
D D
p(x|λ) = p(xi |λi ) = exp xi log(λi ) − λi − log(xi !)
i=1 i=1 i=1 i=1
D
This is in canonical form with θi = log(λi ), g(θ) = − i=1 eθi and h(x) =
D
− i=1 log(xi !) where g and h are defined in Definition 1.
To run SPPCA, we need to define a number of distributions as was the case with
SePCA. The prior distributions defined for SePCA are the same. Since the forms of
g, h are determined though, the likelihood of X|W , Y , α can be explicitly stated:
N
p(X|W , Y ) ∝ exp x Tn W yn + g(W yn )
n=1
N
D
(W yn )i
∝ exp x Tn W yn − e
n=1 i=1
∝ exp tr(X T W Y ) − eW Y
where indicates the sum over all entries and eW Y is the component-wise expo-
nential (and not the matrix exponential).
Similarly, the joint log-posterior of W , Y |X, α is, up to addition of a constant:
log P(X, Y , W , α) = log p(X|W , Y ) + log p(Y ) + log p(W |α) + const
1 1
= tr(X T W Y ) − eW Y − tr(Y T Y )− tr(W T W Diag(α))
2 2
123
564 L. Smallman et al.
Now the above can be used to directly make inference on W and Y using MAP
estimates of the log-posterior. As was mentioned earlier, the fact that the exponential
distribution is specified makes the estimation procedure much easier and there is no
need to rely on an EM approach to make inference. For completeness we mention that
for estimation of α we use the equation used in SePCA as shown in Eq. (1).
The algorithm which we developed alternates between parameter estimation for α
and inference on W , Y . We will give more details on the estimation after we introduce
the Sparse SPPCA as the estimation algorithms for the two are similar.
When we do feature extraction for dimension reduction the features are often a function
of all the original variables in our model. In most cases though a lot of the coefficients
for the original variables are close to zero and you expect that these variables are not
significant in the feature construction and you would like to remove them by setting
their coefficients to zero. Sparse PCA algorithms have been proposed over the years
[such as Zou et al. (2006) and Yi et al. (2017)] to address sparsity in the classic PCA
setting for Gaussian data. In the generalised setting where the data is not Gaussian
there has been limited effort. To the best of our knowledge, only recently has there been
interest in developing sparse algorithms for non-Gaussian PCA settings (Smallman
et al. 2018). In Smallman et al. (2018) the authors propose the use of SCAD (Fan and
Li 2001) and LASSO (Tibshirani 1996) penalties (or a combination of the two) to be
applied to a generalised PCA algorithm proposed in Landgraf and Lee (2015). In this
work, we propose an algorithm which has advantages over the work in Smallman et al.
(2018). First, we use a penalty proposed in Frommlet and Nuel (2016) which allows for
a simpler computational algorithm than the one proposed before. More importantly,
this algorithm can automatically detect the working dimension d of the problem at the
same time as estimating the principal components (as was the case with SePCA). To the
best of our knowledge, this is the first sparse and non-Gaussian based PCA algorithm
that simultaneously achieves this. In this section, we discuss how one can introduce
sparsity to SePCA and then focus on introducing sparsity in the SPPCA framework.
It is known in the literature that LASSO and SCAD penalties computationally complex
problems and are computationally expensive in extracting sparse features. Therefore,
to maintain the simplicity of our estimation algorithm and not add unnecessary com-
plexity we propose the use of an iterative approximation to the L 0 norm penalty (see
Frommlet and Nuel 2016) on W :
D
d
D
d
(W i j )2
||W ||0 = 1(W i j = 0) ≈
i=1 j=1 i=1 j=1
(W i0j )2 + δ
where W 0 is the previous value of W , and δ > 0 a very small value. The sparsity
penalty is weighted by a constant k:
123
Simple Poisson PCA: an algorithm for (sparse) feature… 565
D
d
(W i j )2
S=k
i=1 j=1
(W i0j )2 + δ
Here we note that the optimal value of k is data specific and should be estimated using
cross-validation while the value of δ does not affect the performance of the algorithm
as was demonstrated by Frommlet and Nuel (2016) (as long as it is small compared
to the entries of matrix W 0 ).
Although Frommlet and Nuel (2016) suggested an iterative procedure which
approximates the L 0 norm penalty, successively minimising a penalised objection
function then recalculating the weights for that penalty, we will instead recalculate the
weights within each penalised objective function minimisation. We will discuss the
precise differences in Sect. 4.2.
As was mentioned before in this work we focus on using specifically the Poisson
distribution as a result of its use in modelling text data. Therefore, we move one
step further and model the Sparse Simple Poisson PCA to achieve sparsity under
the assumption that the general exponential distribution is replaced with a Poisson
distribution. The objective function which will be used for the inference on W and Y
is the following:
1 1
Pps = tr(X T W Y ) − eW Y − tr(Y T Y ) − tr(W T W Diag(α))
2 2
D
d
(W i j )2
−k
2 .
i=1 j=1 W i0j + δ
In this section we present the necessary steps we need to take so that we are ready to run
our estimation algorithm for SPPCA. Then we discuss what changes in the SSPPCA
algorithm. It is important to make clear here that this estimation algorithm will work
if instead of Poisson distribution we use any other exponential family distribution.
To run our estimation algorithms we need the derivatives of the objective function with
respect to W and Y to aid with optimisation. Using matrix algebra gives the following:
∂P
= XY T − eW Y Y T − W Diag(α)
∂W
∂P
= W T X − W T eW Y − Y (2)
∂Y
123
566 L. Smallman et al.
The algorithm is similar for the sparse version of the algorithm, namely the SSPPCA.
The only things which change are the objective function Pps and its derivatives which
take the form:
∂ Pps W
= XY T − eW Y Y T − W Diag(α) − 2k
∂W (W 0 )2 +δ
∂ Pps
= W T X − W T eW Y − Y
∂Y
where (W0 )2 and eWY are component-wise operations. As in the previous section the
elementwise derivatives from which these were derived are relegated in Appendix B.
The rest of the steps are similar to the algorithm for SPPCA with the only difference
123
Simple Poisson PCA: an algorithm for (sparse) feature… 567
being the need to define δ which is a tuning parameter for the adaptive L 0 norm penalty
we are using to induce sparsity.
∂ Ps
It is very important to clarify here that the gradient ∂ Wp only exists because we are
approximating the L 0 norm on W by a differentiable function. This means that if other
penalties, e.g. LASSO or SCAD, were to be used the estimation algorithm would have
been computationally more complex.
Finally, we make a note that we pass this penalty into R’s optim function on
each iteration so that we provide a unified framework of sparse and non-sparse fea-
ture extraction. One can achieve sparsity in a different way which resembles the idea
of Frommlet and Nuel (2016) more accurately. Although this is closer to the idea
presented by Frommlet and Nuel (2016) it does not allow us to use the simple com-
putational algorithm for sparse feature extraction. In simulation studies not presented
here, we found that implementing the L 0 penalty as Frommlet and Nuel (2016) suggest
provides a statistically insignificant gain of approximately 1% in average Euclidean
silhouette on classed data over our combined method. We deemed that this did not
merit the more computationally intensive implementation or the de-unification of the
sparse and non-sparse algorithms.
5 Numerical studies
In this section, we will investigate the performance of SPPCA and SSPPCA and
compare their performances against those of PCA, SPCA (Zou et al. 2006), GPCA
(Landgraf and Lee 2015) and SGPCA (Smallman et al. 2018). We compare with PCA
to demonstrate that an algorithm based on Gaussian data is not going to work as well
in this setting, and with SPCA to show that even adding sparsity will not counteract
this problem. We also compare with GPCA which is another exponential family PCA
algorithm and with SGPCA which (to the best of our knowledge) is the only other
sparse PCA algorithm for exponential family distributions. In Sect. 5.1 we will work
with synthetic data drawn from a Poisson hidden-factor model. This model will then be
extended to a two-class hidden-factor model in Sect. 5.3. Finally, we will investigate
a real-world healthcare dataset in Sect. 5.4.
All investigations in this section will use the same basic model, with some small
adaptations. We will use the two hidden factors
[(v1i + εi1 ), (v1i + εi2 ), (2v1i + εi3 ), . . . , (2v1i + εi10 )]T (4)
123
568 L. Smallman et al.
[(v1i + εi1 ), (v1i + εi2 ), (v2i + εi3 ), (v2i + εi4 ), (v1i + 3v2i + εi5 ) . . . ,
(v1i + 3v2i + εi10 )]T (5)
[(v1i + εi1 ), (v1i + εi2 ), (v2i + εi3 ), (v2i + εi4 ), (v3i + εi5 ),
(v3i + εi6 ), (3v1i + 2v2i + 2v3i + εi7 ), . . . ,
(3v1i + 2v2i + 2v3i εi D )]T (6)
The first analysis will use two datasets, with “true” dimensions 1 and 2 respectively,
which we will refer to as X1D and X2D. Each consists of 100 observations of a random
vector of length 10, but the construction of that vector differs. For the component
selection procedure we set M = 100 except the first 10 iterations where M = 500 (as
was mentioned in Sect. 4.1 we do this to avoid removing components too early). Also
for the SSPPCA algorithm δ = 10−8 .
To construct X1D, let v1i , i = 1, . . . , 100 be independently observed values of
V1 and let εi j , i = 1, . . . , 100, j = 1, . . . , 10 be independently observed values of
E. Then the ith observation in X1D has its first two components equal to v1i plus
error, and the remaining eight components are equal to 2 ∗ v1i plus error. Formally
each observation has the form given in (4). To give a bit more insight here, one should
expect that a good dimension reduction in this case will identify that we need exactly
one component, which has larger coefficients for variables 3–10 and it has smaller
coefficients for variables 1 and 2.
Similarly, the ith observation in X2D has its first two components equal to an
observed value v1i of V1 plus independent errors, its second two components equal
to an observed value v2i of V2 plus independent errors, and its final six components
equation to v1i + 3v2i plus independent errors, as given in (5).
To both of these datasets we applied each of SPPCA, SSPPCA, PCA, SPCA, GPCA
and SGPCA. For the latter three we needed to specify the dimension; for SPPCA
and SSPPCA the automatic relevance determination criterion successfully identified
the true dimension. The loadings for the one-dimensional data are given in Table 1;
SPPCA, SSPPCA, PCA and SPCA all give very similar results qualitatively, giving
equal weighting to components three through ten (corresponding to the 2v1 term) and
slightly smaller values to the first two components corresponding to the v1 term. Out
of these four, PCA has arguably the best performance, with the loadings accurately
capturing the data generation model. GPCA gives approximately equal weighting to
all the terms. SGPCA, on the other hand, gives considerably more sporadic loadings.
This is perhaps due to the lack of sparsity of the underlying data.
In Table 2 we give the two loadings for the two-dimension data. Here, the first
SPPCA loading gives roughly equal weight to the first two and last six components,
corresponding to the v1 and v1 + 3v2 terms respectively, and a slightly lower loading
to the second two components (corresponding to the v2 terms). The second SPPCA
loading gives most weight to the last six components, with small weights for the
second pair of components and the lowest weights to the first pair of components. The
performance of SSPPCA is more easily interpretable; the first loading gives highest
weighting to the last six components, with smaller weight for the first four; the second
loading strongly identifies the first two components with near-zero weighting given
to all other terms. PCA’s first loading primarily identifies the v1 + 3v2 term, with its
123
Simple Poisson PCA: an algorithm for (sparse) feature… 569
second primarily identifying the v1 term; SPCA does similarly with sparser loadings.
GPCA’s first loading gives approximately equal weighting to all terms (except for
the very first component), with its second primarily emphasising the v1 components.
Finally, SGPCA’s first loading identifies a combination of the v1 and v1 + 3v2 terms,
while its second fairly strong identifies the v1 components. Of all the loadings, the
most successful at identifying the hidden factors are the second loadings of SSPPCA,
PCA, SPCA, GPCA and SGPCA, with SSPPCA, SPCA and SGPCA arguably slightly
better as the other components are driven closer to 0.
Although SPPCA and SSPPCA are not supervised methods, it is instructive to see
whether, given data arising from two or more classes, they are able to find principal
components which are able to distinguish between these classes. This gives some
123
570 L. Smallman et al.
(a) SPPCA
25 94 24 18
50 82 62 26
100 62 24 26
200 24 16 14
(b) SSPPCA
25 2 8 4
50 42 10 8
100 82 60 18
200 78 70 50
indication of their suitability for use as a step before applying a clustering or clas-
sification algorithm (depending on whether labels are available or not). To this end,
we construct two sets of classed data; the first having observations from two classes
123
Simple Poisson PCA: an algorithm for (sparse) feature… 571
0.8
240
0.1
0.0
0.6
200
−0.4 −0.3 −0.2 −0.1
0.4
PC2
PC2
PC2
160
0.2
0.0
120
−0.2
−0.70 −0.66 −0.62 −0.58 −1.0 −0.9 −0.8 −0.7 −250 −200 −150 −100 −50
−80
220
50
−100
−120
180
PC2
PC2
PC2
0
−140
−50
140
−160
−100
100
−250 −200 −150 −100 −50 −350 −300 −250 −200 −150 100 150 200
Fig. 1 Scores from X2C. The (red) outline-only squares represent data drawn from the first class, while the
(black) filled triangles represent data drawn from the second class (colour figure online)
with equal sample sizes from both, the second having three classes with imbalanced
sample sizes.
We will use again the hidden factors from (3) and both datasets have dimension
D = 10 and total sample size N = 100. We will denote the two-class data by X2C
and the three-class data by X3C. The first class for both datasets will have its first
two components equal to observations v2 of V2 with independent error E and the
remaining eight components equal to 3v2 with independent error. The second class
for both will have first two components equal to 2v3 with independent error and the
remaining eight components equal to v3 , where the v3 are observations of V3 . The
third class will have all components equal to observations from V1 with independent
error. The two-class data X2C has 50 observations from the first class and 50 from
the second. The three-class data X3C is divided between 25 observations of the first
class, 25 observations of the second class, and 50 observations of the third class.
The loadings from applying SPPCA, SSPPCA, GPCA, SGPCA, PCA and SPCA
to X2C are given in Fig. 1. For GPCA, SGPCA, PCA and SPCA we must specify
a dimension: as both SPPCA and SSPPCA choose d = 2 we use that value. All six
algorithms achieve good separation of the two classes, although it is worth noting
that GPCA achieves much worse separation using only the first principal component
than the other methods. Visually, it appears that SPPCA and SSPPCA (in Fig. 1a, b
respectively) give the best clustering of the two classes. Note, though, that all of
the algorithms except GPCA separate the data (except for a single outlying point in
123
572 L. Smallman et al.
0.6
0
0.4
0.4
0.2
0.2
−50
0.0
PC2
0.0
PC2
PC2
−0.2
−100
−0.4
−0.6
−150
0.5 0.6 0.7 0.8 −1.0 −0.9 −0.8 −0.7 −0.6 −0.5 50 100 150 200 250 300 350
60
−50
50
40
20
−100
PC2
0
PC2
PC2
0
−50
−150
−20
−100
−40
50 100 150 200 250 300 350 −350 −250 −150 −50 50 100 150 200
PC1 PC1 PC1
Fig. 2 Scores from X3C. The (red) outline-only squares represent data drawn from the first class, the (black)
filled triangles represent data drawn from the second class, and the (blue) + symbols represent data drawn
from the third (majority) class (colour figure online)
SSPPCA) with only the first direction, which is encouraging, especially for PCA and
SPCA which are not specialised to this situation. We use the method of silhouettes put
forward by Rousseeuw (1987) to analyse the performance further, using the Euclidean
distance metric and clusters found using k-medioid clustering. The silhouette of the
b(i)−a(i)
ith observation is given by max{a(i),b(i)} , where a(i) is the average dissimilarity of
the ith observation to the other members of its cluster and b(i) is the lowest average
dissimilarity of the ith observation to any other cluster. We can thus interpret the
silhouette as a measure of how well a data point is assigned to its cluster; the average
silhouette over a dataset gives a measure for how well clustered the data is. Average
silhouette values range between −1 and 1; the closer to 1 the better the clustering. In
Table 4 we give average silhouettes for X2C for each of the six algorithms. Our visual
intuition that SPPCA and SSPPCA give the best clustering is confirmed, differing
from PCA by a little over 25%. The superior performance of SPPCA and SSPPCA is
123
Simple Poisson PCA: an algorithm for (sparse) feature… 573
1.0
0.0 0.5 1.0
0.5
0.0
PC3
−0.5
1.0
−1.0
−1.5 −0.5 −1.5
0.5
0.8300
0.8290 0.830
0.8295
PC2
0.0
0.8285
−0.5
0.8280 0.82900.8280
−1.0 −0.8
−1.0
−0.8
−1.0
PC1
−1.2
−1.2 −1.0 −0.8 −0.6
−1.4
−1.4 −1.2
PC1
0.5 0.5
0.0 PC3 0.0
0.0
PC3 0.0
−0.5
−0.5
−1.0−0.5 0.0 −1.0 1.0 −0.5 0.0
−1.0
0.5 1.0
0.0 PC2 0.0 0.5 PC2
−0.5 0.0
−1.0 −0.5
−1.0 0.0 1.0 0.0
−1.0
1.5 0.5 1.0 1.5 1.0 1.5
1.5
1.0
0.5 1.0
PC1 PC1
0.0 0.5
−0.5
0.0
−1.0 0.0 −1.0 0.0 0.5
1.0
0.5 1.0
0.5
PC3 0.0
−0.5
−0.5 0.0
(e) PCA
Fig. 3 The resulting principal components from applying SPPCA, SSPPCA, GPCA, SGPCA and PCA to
the healthcare data
123
574 L. Smallman et al.
continued in the three-class study (Fig. 2), though the gap does narrow, as we can see
from the silhouettes. We note that none of the tested methods perform poorly, were
one to achieve separation on a real-world dataset like, for example, PCA does with this
synthetic example, it would be a significant success. However, real-world data is rarely
so amiable as a synthetic example like this, and we suggest that the performance gain
from the SPPCA and SSPPCA methods on a real-world dataset may well be crucial
to providing a workable dimension reduction.
We will now examine the efficacy of SPPCA and SSPPCA in reducing the dimension
of a real-world dataset. The data is a sample of 100 observations from a lexicon
classifier dataset used by Cardiff and Vale University Health Board in the analysis of
letters sent from consultants at a hospital to general practitioners about outpatients.
Broadly, the data falls into two classes: discharge letters and follow-up appointment
letters. Due to the nature of these letters, there is a heavy imbalance between the two
classes. However, in order to better illustrate the performance of the methods in this
manuscript, we have randomly selected an equal sample size from each class. This
leaves us with 100 observations of dimension D = 55.
In Fig. 3 we show the results of applying SPPCA, SSPPCA, GPCA and SGPCA
to this dataset. Discharge data points are shown with crosses and follow up points are
shown with circles. For both SPPCA and SSPPCA we used M = 40. For SSPPCA
we also used k = 0.07. From Fig. 3a we can see that SPPCA estimated d as 2; on the
other hand, from Fig. 3b we see that SSPPCA chose d = 3. Based on this, we chose
d = 3 for GPCA, SGPCA and PCA, which require a fixed value.
There is evidence of class separation in all the principal component diagrams, even
in just the pairs of dimensions for the 3-dimensional methods. However, it is unclear
just from these visualisations which of the methods has the best performance. In order
to better quantify the clustering, we give the average (Euclidean) silhouettes in Table 5.
Based on this performance metric, SPPCA and SSPPCA are the best performers,
performing significantly better than previous methods, including PCA which is the
default method in practice.
6 Discussion
In this paper we have developed a Poisson based PCA algorithm which we called
SPPCA and which was based on the SePCA (Li and Tao 2013). We use a different
algorithm for inference on W and Y than SePCA. We have illustrated this in the specific
case where the distribution is Poisson, by developing the SPPCA algorithm. We have
also introduced an approximate L 0 sparsity penalty in this context to allow for Sparse
123
Simple Poisson PCA: an algorithm for (sparse) feature… 575
SPPCA. In a more general framework this can be seen as a unified way of achieving
sparse or non-sparse feature extraction from a Poisson-based PCA algorithm. At the
same time this algorithm should be easily extendable to other distributions in the
exponential family by modifying appropriately the formulas.
The sparse algorithm performs particularly well, both in latent dimension discovery
and in class separation for multi class Poisson data. Computation times are acceptable
for small samples (N ≤ 500), but become a slightly more burdensome for larger
samples. It is worth noting that there exist multiple solutions or local maxima. This
is also dealt with simply, by evaluating multiple optima using the fully specified
probability model upon which SePCA is based; for more details on this model we
direct the reader to Li and Tao (2013). In practice, we have found that this has not
been necessary, the maxima obtained starting from the Gaussian PCA have performed
perfectly well.
There is scope for extension of this work. First of all it is interesting to introduce
different more complex sparsity penalties, such as the L 1 or SCAD penalties and com-
pare their performance. Another possible extension is the development of nonlinear
feature extraction methods as well as sparse nonlinear feature extraction method in
the generalised PCA setting for non-Gaussian data.
Acknowledgements The second author would like to thank St John’s College in Oxford for part-funding
this project and Cardiff University’s School of Mathematics for hosting him for 8 weeks. Support from both
schools was vital to the successful completion of this project.
The authors would like to thank Cardiff and Vale University Health Board for graciously providing the
healthcare dataset.
The authors express their gratitude for the constructive comments received from the editor and two reviewers
which were instrumental in improving the quality of this manuscript.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 Interna-
tional License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution,
and reproduction in any medium, provided you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license, and indicate if changes were made.
Following on from Sect. 4.1, we will now derive the element wise derivatives of P:
D
N
1 1
P = tr(XT WY) − eWY − tr(YT Y) − tr(WT WDiag(α))
2 2
i=1 j=1
N
d
D
D N d
= Xki Wk j Y ji − exp Wik Yk j
i=1 j=1 k=1 i=1 j=1 k=1
1
N
d
1
d
D
− Y2ji − W2ji αi
2 2
i=1 j=1 i=1 j=1
123
576 L. Smallman et al.
= (XY )ab − (e
T WY
Y )ab − (WDiagα)ab
T
As needed for the estimation algorithm for SSPPCA, described in Sect. 4.2, we will
now derive the gradients of Pps element wise:
D
N
1
Pps = tr(XT WY) − eWY − tr(YT Y)
2
i=1 j=1
1 D d
(Wi j )2
− tr(W WDiag(α)) − k
T
2
i=1 j=1
(Wi0j )2 + δ
d
N d
D
D N 1 2
N d
= Xki Wk j Y ji − exp Wik Yk j − Y ji
2
i=1 j=1 k=1 i=1 j=1 k=1 i=1 j=1
1
d
D
D
d
(Wi j )2
− W2ji αi − k
2
i=1 j=1 i=1 j=1
(Wi0j )2 + δ
and the gradient with respect to Y (though we note it is identical to the above) is
d
∂ Pps
D
D
= Xkb Wka − Wia exp Wik Ykb − Yab
∂Yab
k=1 i=1 k=1
123
Simple Poisson PCA: an algorithm for (sparse) feature… 577
References
Bishop CM (1999) Bayesian PCA. In: Advances in neural information processing systems. vol 11, pp
382–388
Buntine W (2002) Variational extensions to EM and multinomial PCA. Mach Learn ECML 2002:23–34
Collins M, Dasgupta S, Schapire RE (2002) A generalization of principal components analysis to the
exponential family. Adv Neural Inf Process Syst 14:617–624
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am
Stat Assoc 96(456):1348–1360
Frommlet F, Nuel G (2016) An adaptive ridge procedure for L0 regularization. PLoS ONE 11(2):1–23
Landgraf AJ, Lee Y (2015) Generalized principal component analysis: projection of saturated model parame-
ters. Ohio State University Statistics Department Technical Report, (890). Available from: http://www.
stat.osu.edu/~yklee/mss/tr890.pdf
Li J, Tao D (2013) Simple exponential family PCA. IEEE Trans. Neural Netw. Learn. Syst. 24(3):485–497
Mackay DJC (1995) Probable networks and plausible predictions–a review of practical Bayesian methods
for supervised neural networks. Netw Comput Neural Syst 6(3):469–505
Mohamed S, Heller K, Ghahramani Z (2009) Bayesian exponential family PCA. Adv Neural Inf Process
Syst 21:1089–1096
Pearson K (1901) On lines and planes of closest fit to systems of points in space. Lond Edinb Dublin Philos
Mag J Sci 2(1):559–572
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J
Comput Appl Math 20:53–65
Smallman L, Artemiou A, Morgan J (2018) Sparse generalised principal component analysis. Pattern
Recognit 83:443–455
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodol)
58(1):267–288
Tipping ME, Bishop CM (1999) Probabilistic principal component analysis. J R Stat Soc Ser B (Stat
Methodol) 61(3):611–622
Yi S, Lai Z, He Z, Cheung Y, Liu Y (2017) Joint sparse principal component analysis. Pattern Recognit
61:524–536
Zou H, Hastie T, Tibshirani R (2006) Sparse principal component analysis. J Comput Graph Stat 2:265–286
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
123