Local - PCA - It Is Against PCA
Local - PCA - It Is Against PCA
Local - PCA - It Is Against PCA
Nandakishore Kambhatla
Todd K. Leen
Department of Computer Science and Engineering, Oregon Graduate Institute
of Science and Technology, Portland, Oregon 97291-1000, U.S.A.
1 Introduction
detection (Petsche et al., 1996; Japkowicz, Myers, & Gluck, 1995). In these
schemes, high-dimensional sensor signals are projected onto a subspace that
best describes the signals obtained during normal operation of the moni-
tored system. New signals are categorized as normal or abnormal according
to the distance between the signal and its projection1 .
The classic technique for linear dimension reduction is principal compo-
nent analysis (PCA). In PCA, one performs an orthogonal transformation to
the basis of correlation eigenvectors and projects onto the subspace spanned
by those eigenvectors corresponding to the largest eigenvalues. This trans-
formation decorrelates the signal components, and the projection along the
high-variance directions maximizes variance and minimizes the average
squared residual between the original signal and its dimension-reduced
approximation.
A neural network implementation of one-dimensional PCA implemented
by Hebb learning was introduced by Oja (1982) and expanded to hierarchi-
cal, multidimensional PCA by Sanger (1989), Kung and Diemantaras (1990),
and Rubner and Tavan (1989). A fully parallel (nonhierarchical) design that
extracts orthogonal vectors spanning an m-dimensional PCA subspace was
given by Oja (1989). Concurrently, Baldi and Hornik (1989) showed that the
error surface for linear, three-layer autoassociators with hidden layers of
width m has global minima corresponding to input weights that span the
m-dimensional PCA subspace.
Despite its widespread use, the PCA transformation is crippled by its
reliance on second-order statistics. Though uncorrelated, the principal com-
ponents can be highly statistically dependent. When this is the case, PCA
fails to find the most compact description of the data. Geometrically, PCA
models the data as a hyperplane embedded in the ambient space. If the
data components have nonlinear dependencies, PCA will require a larger-
dimensional representation than would be found by a nonlinear technique.
This simple realization has prompted the development of nonlinear alter-
natives to PCA.
Hastie (1984) and Hastie and Stuetzle (1989) introduce their principal
curves as a nonlinear alternative to one-dimensional PCA. Their parame-
terized curves f (λ): R → Rn are constructed to satisfy a self-consistency
requirement. Each data point x is projected to the closest point on the curve
λ f (x) = argminµ kx − f (µ)k, and the expectation of all data points that
project to the same parameter value λ is required to be on the curve. Thus
f (3) = Ex [x | λ f (x) = 3 ]. This mathematical statement reflects the desire
that the principal curves pass through the middle of the data.
1 These schemes also use a sigmoidal contraction map following the projection so that
new signals that are close to the subspace, yet far from the training data used to construct
the subspace, can be properly tagged as outliers.
Dimension Reduction by Local PCA 1495
Hastie and Stuetzle (1989) prove that the curve f (λ) is a principal curve iff
it is a critical point (with respect to variations in f (λ)) of the mean squared
distance between the data and their projection onto the curve. They also
show that if the principal curves are lines, they correspond to PCA. Finally,
they generalize their definitions from principal curves to principal surfaces.
Neural network approximators for principal surfaces are realized by five-
layer, autoassociative networks. Independent of Hastie and Stuetzle’s work,
several researchers (Kramer, 1991; Oja, 1991; DeMers & Cottrell, 1993; Usui,
Nakauchi, & Nakano, 1991) have suggested such networks for nonlinear
dimension reduction. These networks have linear first (input) and fifth (out-
put) layers, and sigmoidal nonlinearities on the second- and fourth-layer
nodes. The input and output layers have width n. The third layer, which
carries the dimension-reduced representation, has width m < n. We will
refer to this layer as the representation layer. Researchers have used both lin-
ear and sigmoidal response functions for the representation layer. Here we
consider only linear response in the representation layer.
The networks are trained to minimize the mean squared distance be-
tween the input and output and, because of the middle-layer bottleneck,
build a dimension-reduced representation of the data. In view of Hastie and
Stuetzle’s critical point theorem, and the mean square error (MSE) training
criteria for five-layer nets, these networks can be viewed as approximators
of principal surfaces.2
Recently Hecht-Nielsen (1995) extended the application of five-layer au-
toassociators from dimension reduction to encoding. The third layer of his
replicator networks has a staircase nonlinearity. In the limit of infinitely
steep steps, one obtains a quantization of the middle-layer activity, and
hence a discrete encoding of the input signal.3
In this article, we propose a locally linear approach to nonlinear dimen-
sion reduction that is much faster to train than five-layer autoassociators
and, in our experience, provides superior solutions. Like five-layer autoas-
sociators, the algorithm attempts to minimize the MSE between the original
2 There is, as several researchers have pointed out, a fundamental difference between
the representation constructed by Hastie’s principal surfaces and the representation con-
structed by five-layer autoassociators. Specifically, autoassociators provide a continuous
parameterization of the embedded manifold, whereas the principal surfaces algorithm
does not constrain the parameterization to be continuous.
3 Perfectly sharp staircase hidden-layer activations are, of course, not trainable by gra-
dient methods, and the plateaus of a rounded staircase will diminish the gradient signal
available for training. However, with parameterized hidden unit activations h(x; a) with
h(x; 0) = x and h(x; 1) a sigmoid staircase, one can envision starting training with lin-
ear activations and gradually shift toward a sharp (but finite sloped) staircase, thereby
obtaining an approximation to Hecht-Nielsen’s replicator networks. The changing activa-
tion function will induce both smooth changes and bifurcations in the set of cost function
minima. Practical and theoretical issues in such homotopy methods are discussed in Yang
and Yu (1993) and Coetzee and Stonick (1995).
1496 Nandakishore Kambhatla and Todd K. Leen
E = Ex [ kx − f (g(x))k2 ].
The variance in the original vectors provides a useful normalization scale for
the squared residuals, so our experimental results are quoted as normalized
reconstruction error:
Ex [ kx − f (g(x))k2 ]
Enorm = . (2.1)
Ex [ kx − Ex xk2 ]
This may be regarded as the noise-to-signal ratio for the dimension reduc-
tion, with the signal strength defined as its variance.
Neural network implementations of these maps are provided by au-
toassociators, layered, feedforward networks with equal input and output
dimension n. During training, the output targets are set to the inputs; thus,
autoassociators are sometimes called self-supervised networks. When used
to perform dimension reduction, the networks have a hidden layer of width
m < n. This hidden, or representation, layer is where the low-dimensional
representation is extracted. In terms of the maps defined above, processing
from the input to the representation layer corresponds to the projection g,
and processing from the representation to the output layer corresponds to
the immersion f .
1
0
-1
.5
5
0
Low Dimensional
Encoding -.5
-1
-1.5
Original High -1
Dimensional
0
Representation
a) b) 1
0 .5 1
-1 -.5
.55
.255
-.25
25
-.5
.5
-1
-.5
0
.5
1
Figure 2: Coordinates built by local PCA for data distributed on the surface of
a hemiphere. The solid lines are the two principal eigendirections for the data
in each region. One of the regions is shown shaded for emphasis.
X
m
x̂ = r(i) + zj ej(i) . (3.2)
j=1
where P(i) is the m × n matrix whose rows are the trailing eigenvectors of
the covariance matrix 6 (i) . The matrix 5(i) is the projection orthogonal to
the local m-dimensional PCA subspace.
4 The number of bits required to specify the region is small (between four and seven bits
in all the experiments presented here) with respect to the number of bits used to express
the double-precision coordinates within each region. In this respect, the specification of
the region is nearly free.
1502 Nandakishore Kambhatla and Todd K. Leen
r (2) r (2)
r
(1) r (1)
x x
e(2)
1 e(2)
e1(1) e(1)
1
1
Figure 3: Assignment of the data point x to one of two regions based on (left)
Euclidean distance and (right) the reconstruction distance. The reference vectors
r(i) and leading eigenvector e(i)
1 are shown for each of two regions (i = 1, 2). See
text for explanation.
The reconstruction distance d(x, r(i) ) is the squared projection of the dif-
ference vector x − r(i) on the trailing eigenvectors of the covariance matrix
for region R(i)5 . Equivalently, it is the squared Euclidean distance to the
linear manifold that is defined by the local m-dimensional PCA in the ith
local region. Clustering with respect to the reconstruction distance directly
minimizes the expected reconstruction error Erecon .
Figure 3 illustrates the difference between Euclidean distance and the
reconstruction distance, with the latter intended for a one-dimensional local
PCA. Suppose we want to determine to which of two regions the data point
x belongs. For Euclidean clustering, the distance between the point x and
the two centroids r(1) and r(2) is compared, and the data point is assigned to
the cluster whose centroid is closest—in this case, region 1. For clustering
by the reconstruction distance, the distance from the point to the two one-
dimensional subspaces (corresponding to the principal subspace for the two
regions) is compared, and the data point is assigned to the region whose
principal subspace is closest—in this case, region 2. Data points that lie on
the intersection of hyperplanes are assigned to the region with lower index.
Thus the membership in regions defined by the reconstruction distance
can be different from that defined by Euclidean distance. This is because
the reconstruction distance does not count the distance along the leading
eigendirections. Neglecting the distance along the leading eigenvectors is
exactly what is required, since we retain all the information in the leading
directions during the PCA projection. Notice too that, unlike the Euclidean
Voronoi regions, the regions arising from the reconstruction distance may
not be connected sets.
Since the reconstruction distance (see equation 3.4) depends on the eigen-
vectors of 6 (i) , an online algorithm for clustering would be prohibitively
5 Note that when the target dimension m equals 0, the representation is reduced to the
reference vector r(i) with no local coordinates. The distortion measure then reduces to the
Euclidean distance.
Dimension Reduction by Local PCA 1503
1 X
r(i) = argminr (x − r)T 5(i) (x − r), (3.6)
Ni x∈R(i)
where x̄ is the mean of the data in R(i) . Thus any vector r whose
projection along the trailing eigenvectors equals the projection of x̄
along the trailing eigenvectors is a generalized centroid of R(i) . For
convenience, we take r = x̄. Next compute the covariance matrices
1 X
6 (i) = (x − r(i) )(x − r(i) )T
Ni x∈R(i)
6In deriving the centroid equations, care must be exercised to take into account the
dependence of ej(i) (and hence 5(i) ) on r(i) .
1504 Nandakishore Kambhatla and Todd K. Leen
The encoding of x consists of the index i and the m local principal compo-
nents z. The decoding, or reconstruction, of the vector x is
X
m
x̂ = r(i) + zj ej(i) . (3.9)
j=1
4 Experimental Results
E[kx − x̂k2 ]
Enorm ≡ , (4.1)
E[kx − E[x]k2 ]
We computed the global PCA for the speech data by diagonalizing the
covariance matrices using Householder reduction followed by QL.7 The
image training set has fewer data points than the input dimension. Thus the
covariance matrix is singular, and we cannot use Householder reduction.
Instead we computed the PCA using singular-value decomposition applied
to the data matrix.
All of the architecture selection was carried out by monitoring perfor-
mance on a validation (or holdout) set. To limit the space of architectures,
the autoassociators have an equal number of nodes in the second and fourth
(sigmoidal) layers. These were varied from 5 to 50 in increments of 5. The
networks were regularized by early stopping on a validation data set.
For VQPCA, we varied the number of local regions for the unconstrained
VQ from 5 to 50 in increments of 5. The multistage VQ examples used
two levels, with the number of regions at each level varied from 5 to 50 in
increments of 5. The branching factor of the tree-structured VQ was varied
from 2 to 9.
For all experiments and all architectures, we report only the results on
those architectures that obtained the lowest reconstruction error on the val-
idation data set. In this sense, we report the best results we obtained.
7 The Householder algorithm reduces the covariance matrix to tridiagonal form, which
In order to test variability of these results across different training and test
sets, we reshuffled and repartitioned the data into new training, validation,
and tests sets of the same size as those above. The new data sets gave results
very close to those reported here (Kambhatla, 1995).
Table 4: Encoding and Decoding Times (FLOPs) for the Image Data.
cent lower error than the best autoassociator. There is little distinction be-
tween the Euclidean and reconstruction distance clustering for these data.
The VQPCA trains significantly faster than the autoassociators. Although
the conjugate gradient algorithm is relatively quick, it generates encodings
inferior to those obtained with the stochastic gradient descent and BFGS
simulators.
Table 4 shows the encode and decode times for the different algorithms.
We again note that VQPCA algorithms using reconstruction distance clus-
tering require many more floating-point operations (FLOPs) to encode an
1510 Nandakishore Kambhatla and Todd K. Leen
input vector than does the Euclidean distance algorithm or the five-layer
networks. However, as before, the decode times are much less for VQPCA.
As before, shuffling and repartitioning the data into training, validation,
and test data sets and repeating the experiments returned results very close
to those given here.
Finally, in order to compare directly with DeMers and Cottrell’s (1993)
results, we also conducted experiments training with all the data (no separa-
tion into validation and test sets). This is essentially a model fitting problem,
with no influence from statistical sampling. We show results only for the
autoassociators trained with SGD, since these returned lower error than the
conjugate gradient simulators, and the memory requirements for BFGS were
prohibitive. We report the results from those architectures that provided the
lowest reconstruction error on the training data.
The results are shown in Table 5. Both nonlinear techniques produce en-
codings with lower error than PCA, indicating significant nonlinear struc-
ture in the data. For the same data and using a five-layer autoassociator with
30 nodes in each of the second and fourth layers, DeMers and Cottrell (1993)
obtain a reconstruction error Enorm = 0.1317.9 This is comparable to our re-
sults. We note that the VQPCA algorithms train two orders of magnitude
faster than the networks while obtaining encodings with about one-third
the reconstruction error.
It is useful to examine the images obtained from the encodings for the
various algorithms. Figure 4 shows two sample images from the data set
along with their reconstructions from five-dimensional encodings. The al-
gorithms correspond to those reported in Table 5. These two images were
selected because their reconstruction error closely matched the average. The
left-most column shows the images as reconstructed from the 50 principal
components. The second column shows the reconstruction from 5 princi-
9 DeMers and Cottrell report half the MSE per output node, E = (1/2) ∗ (1/50) ∗ MSE =
pal components. The third column is the reconstruction from the five-layer
autoassociator, and the last two columns are the reconstruction from the
Euclidean and reconstruction distance VQPCA.
The five-dimensional PCA has grossly reduced resolution, and gray-scale
distortion (e.g., the hair in the top image). All of the nonlinear algorithms
produce superior results, as indicated by the reconstruction error. The lower
image shows a subtle difference between the autoassociator and the two
VQPCA reconstructions; the posture of the mouth is correctly recovered in
the latter.
5 Discussion
representation and then scalar quantizes the new representation. This pro-
duces a product code for the data. Standard approaches include prepro-
cessing by PCA or discrete cosine transform, followed by scalar quantiza-
tion (Wallace, 1991). As discussed in the introduction, the nonlinear trans-
forms considered here provide more accurate representations than PCA and
should provide for better transform coding.
This work suggests a full implementation of transform coding, with
comparisons betweeen PCA, autoassociators, and VQPCA in terms of rate
distortion curves. Transform coding using VQPCA with the reconstruc-
tion distance clustering requires additional algorithm development. The
reconstruction distance distortion function depends explicitly on the tar-
get dimension, while the latter depends on the allocation of transform bits
between the new coordinates. Consequently a proper transform coding
scheme needs to couple the bit allocation to the clustering, an enhancement
that we are developing.
The second potential application is in novelty detection. Recently several
authors have used three-layer autoassociators to build models of normal
equipment function (Petsche et al., 1996; Japkowicz et al., 1995). Equipment
faults are then signaled by the failure of the model to reconstruct the new sig-
nal accurately. The nonlinear models provided by VQPCA should provide
more accurate models of the normal data, and hence improve the sensitivity
and specificity for fault detection.
Acknowledgments
This work was supported in part by grants from the Air Force Office of
Scientific Research (F49620-93-1-0253) and the Electric Power Research In-
stitute (RP8015-2). We thank Gary Cottrell and David DeMers for supplying
image data and the Center for Spoken Language Understanding at the Ore-
gon Graduate Institute of Science and Technology for speech data. We are
grateful for the reviewer’s careful reading and helpful comments.
References
Baldi, P., & Hornik, K. (1989). Neural networks and principal component analy-
sis: Learning from examples without local minima. Neural Networks, 2, 53–58.
Bourlard, H., & Kamp, Y. (1988). Auto-association by multilayer perceptrons
and singular value decomposition. Biol. Cyb., 59, 291–294.
Bregler, C., & Omohundro, S. M. (1995). Nonlinear image interpolation using
manifold learning. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in
neural information processing systems 7. Cambridge, MA: MIT Press.
Broomhead, D. S. (1991, July). Signal processing for nonlinear systems. In
S. Haykin (Ed.), Adaptive Signal Processing, SPIE Proceedings (pp. 228–243).
Bellingham, WA: SPIE.
1514 Nandakishore Kambhatla and Todd K. Leen
Hornik, M., Stinchcombe, M., & White, H. (1989). Multilayer feedforward net-
works are universal approximators. Neural Networks, 2, 359–368.
Japkowicz, N., Myers, C., & Gluck, M. (1995). A novelty detection approach to
classification. In Proceedings of IJCAI.
Kambhatla, N. (1995) Local models and gaussian mixture models for statistical data
processing. Unpublished doctoral dissertation, Oregon Graduate Institute.
Kambhatla, N., & Leen, T. K. (1994). Fast non-linear dimension reduction. In
J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information
processing systems 6. San Mateo, CA: Morgan Kaufmann.
Kramer, M. A. (1991). Nonlinear principal component analysis using autoasso-
ciative neural networks. AIChE Journal, 37, 233–243.
Kung, S. Y., & Diamantaras, K. I. (1990). A neural network learning algorithm for
adaptive principal component extraction (APEX). In Proceedings of the IEEE
International Conference on Acoustics Speech and Signal Processing (pp. 861–864).
Malthouse, E. C. (1996). Some theoretical results on non-linear principal compo-
nents analysis (Unpublished research report). Evanston, IL: Kellogg School
of Management, Northwestern University.
Oja, E. (1982). A simplified neuron model as a principal component analyzer. J.
Math. Biology, 15, 267–273.
Oja, E. (1989). Neural networks, principal components, and subspaces. Interna-
tional Journal of Neural Systems, 1, 61–68.
Oja, E. (1991). Data compression, feature extraction, and autoassociation in feed-
forward neural networks. In Artificial neural networks (pp. 737–745). Amster-
dam: Elsevier Science Publishers.
Petsche, T., Marcantonio, A., Darken, C., Hanson, S. J., Kuhn, G. M., & Santoso,
I. (1996) A neural network autoassociator for induction motor failure predic-
tion. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in
neural information processing systems 8. Cambridge, MA: MIT Press.
Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1987). Numer-
ical recipes—the art of scientific computing. Cambridge: Cambridge University
Press.
Rognvaldsson, T. (1994). On Langevin updating in multilayer perceptrons. Neu-
ral Computation, 6, 916–926.
Rubner, J., & Tavan, P. (1989). A self-organizing network for principal component
analysis. Europhysics Lett., 20, 693–698.
Sanger, T. (1989). An optimality principle for unsupervised learning. In
D. S. Touretzky (ed.), Advances in neural information processing systems 1. San
Mateo, CA: Morgan Kaufmann.
Usui, S., Nakauchi, S., & Nakano, M. (1991). Internal color representation ac-
quired by a five-layer neural network. In O. Simula, T. Kohonen, K. Makisara,
& J. Kangas (Eds.), Artificial neural networks. Amsterdam: Elsevier Science
Publishers, North-Holland.
Wallace, G. K. (1991). The JPEG still picture compression standard. Communica-
tions off the ACM, 34, 31–44.
Wax, M., & Kailath, T. (1985). Detection of signals by information theoretic cri-
teria. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-33(2),
387–392.
1516 Nandakishore Kambhatla and Todd K. Leen
Yang, L., & Yu, W. (1993). Backpropagation with homotopy. Neural Computation,
5, 363–366.