Appendix TutorialonPCA
Appendix TutorialonPCA
net/publication/272821354
CITATIONS READS
15 6,399
1 author:
Yuan Zhe Ma
Schlumberger Limited
182 PUBLICATIONS 1,482 CITATIONS
SEE PROFILE
All content following this page was uploaded by Yuan Zhe Ma on 27 February 2015.
The transform is defined in such a way that the first PC represents the most variability in the
data, in fact, as much as possible, under the condition of the orthogonality between any pair of
components. Each succeeding component in turn has the highest variance possible not accounted
for by the preceding PCs, under the orthogonality condition. Hence, PCs are uncorrelated between
each other.
PCA is mathematically defined as a linear transform that converts the data to a new coordinate
system such that the first PC lies on the coordinate that has the largest variance by projection of
the data, the second PC lies on the coordinate with second largest variance, and so on. The
procedure includes several steps:
(1) Calculating the (multivariate) covariance or correlation matrix from the sample data,
(2) Computing eigenvalues and eigenvectors of the covariance or correlation matrix, and
(3) Generating the PCs; each PC is a linear combination of optimally weighted original
variables, such as:
where Pi is the ith PC, bik is the weight (some call regression coefficient) for the variable, Xk. It is
often convenient that all the variables, Xk, are standardized to zero mean and one standard
deviation.
The weights, bik, are calculated using covariance or correlation matrix. As the covariance or
correlation matrix is symmetric positive definite, it yields an orthogonal basis of eigenvectors,
each of which has a nonnegative eigenvalue. These eigenvectors, multiplied by the original inputs
(as Equation 1), correspond to PCs and the eigenvalues are proportional to the variances explained
by the PCs. For more mathematical insights of PCA, readers can refer to Basilevsky (1994), Everitt
and Dunn (2002), and Abdi and Williams (2010).
PCA is a non-parametric statistical method that provides analytical solutions based on linear
algebra; statistical moments, such as mean and covariance, are simply calculated from the data
without any assumption. Because of its efficiency in removing redundancy and capability of
extracting interpretable information, PCA has a wide range of applications, spanning over nearly
all the industries, from computer vision to neuroscience, from medical data analysis to psychology,
and from chemical research to seismic data analysis, among others. In fact, PCA is one of the most
used multivariate statistical tools; with the explosion of data in modern society, its application is
ever increasing.
A simple bivariate example with two petrophysical variables, neutron and density (RHOB), is
presented here to illustrate the method. The two PCs from PCA of neutron and RHOB logs are
overlain on the neutron-RHOB crossplots (Figs. 1a and 1b). The first PC (PC1) represents the
major axis that describes the maximum variability of the data and the second PC (PC2) represents
the minimum axis that describes the remaining variability not accounted for by the first PC. In this
example, the major axis, PC1, approximately represents porosity and the minor axis, PC2,
approximately represents the lithology. This explains why lithofacies clustering by ANN or
statistical clustering methods using PC1 are not good (Fig. 1c), but lithofacies clustered using PC2
are more consistent with the benchmark chart (Fig. 1d). In other cases, major PCs, such as PC1,
are important; sometimes, lithofacies classification using PC1 alone is good enough (Ma, 2011;
Ma et al., 2011, 2014).
PCs can be rotated to align with a physically more meaningful variable. This can be illustrated
with a bivariate example, in which the two original variables are equally weighted in the PCs
before rotation. In the neutron-RHOB analysis, neutron and RHOB equally contribute to both PC1
and PC2. However, for example, if neutron is more important than RHOB for porosity
determination, PC1 can be rotated to be correlated higher with neutron. Fig. 1e shows a rotated
component that has an increased correlation to neutron, and decreased correlation to RHOB (Table
1). Similarly, if RHOB is more important than neutron in determining lithofacies, PC2 can be
rotated to reflect that. Fig. 1f shows a rotated component from PC2 that has an increase correlation
to RHOB and a decreased correlation to neutron (Table 1). The two rotated components do not
have to be orthogonal as shown in this example. The main criterion of rotation is to make a
component physically meaningful.
Fig. A-1 Illustration of two principal components from PCA of Neutron and density (RHOB) on
Neutron-RHOB or their PC1-PC2 crossplots. (a) Overlay of PC1 on Neutron- RHOB crossplot
(arrow indicates the coordinate on which PC1 is defined). (b) Overlay of PC2 on Neutron- RHOB
crossplot (arrow indicates the coordinate on which PC2 is defined). (c) PC1-PC2 crossplot (their
correlation is zero). (d) Overlay of lithofacies clustered by ANN using PC1 on Neutron-RHOB
crossplot (red: sandstone, green: limestone, and blue: dolostone). (e) Overlay of lithofacies
clustered by ANN using PC2 on Neutron-RHOB crossplot. (f) Overlay of a rotated PC2 on
Neutron- RHOB crossplot.
Table 1 Correlation matrix between pairs of six variables: Neutron (NPHI), density (RHOB), their PCs
(PC1 and PC2), and two rotated component (PC1_rotated and PC2_rotated).
The original data can be reconstructed from the principal components. The general equation of
reconstructing the original data can be expressed as the following matrix formulation:
D = P Ct + uMt (2)
where D is the reconstructed data matrix of size k×n (k being the number of variables, n being the
number of samples), P is the matrix of principal components of size q×n (q is the number of PCs,
equal or less than k), C is the matrix of correlation coefficients between the PCs and the variables
of size k×q, t denotes the matrix transpose, is the diagonal matrix that contains the standard
deviations of the variables of size k×k, u is a unit vector of size n, and M is vector that contains
the mean values of the variables of size k.
When the data are highly correlated, a small number of PCs out of all the PCs can reconstruct
the data quite well. PCA is highly efficient in removing the redundancy, which is highlighted by
the following seismic amplitude versus offset (AVO) example (Fig. 2a). Consider different offsets
as variables and common mid points as observations or samples. The first PC (Fig. 2b) from PCA
represents more than 99.6% variance explained and can be used to reconstruct the original data.
This is done simply by 1D vector multiplication of PC1 (Fig. 2b) and its correlation coefficients
to each offset normalized by the respective standard deviation and mean of each offset (Fig. 2c).
The result is very much similar to the original AVO data (compare Figs. 2a and 2d).
In the AVO example above, q is set to 1 as PC1 represents more than 99% of the information
in the data. This explains the surprising reconstructed 2D map (Fig. 2d) simply by vector
multiplication of two 1D functions of different size (Figs. 2b and 2c), and normalizations by the
standard deviations and means.
280
180
(a)
(b) (c)
280
180
(d)
Fig. 2 PCA of AVO data and reconstruction of AVO data by using one PC. (a) Original
AVO data. (b) PC1 (as a function of common mid-point or CMP). (c) Correlations between
PC1 and each offset. (d) The reconstructed AVO data using PC1, i.e., vector multiplication
of (b) and (c) normalized by the respective standard deviation and means of each offset (see
Equation 2).
References:
Abdi H. and Williams L.J. (2010) Principal component analysis. Statistics & Data Mining Series,
Vol. 2, John Wiley & Sons, p. 433-459.
Basilevsky A (1994) Statistical Factor Analysis and Related Methods: Theory and Applications.
Wiley Series in Probability and Mathematical Statistics.
Everitt BS, Dunn G (2002) Applied Multivariate Data Analysis. 2nd Edition, Arnold Publisher,
London.
Ma, Y. Z., 2011, Lithofacies clustering using principal component analysis and neural network:
applications to wireline logs, Math. Geosciences, 43(4):401-419.
Ma, Y.Z., Wang, H., Sitchler, J., et al. 2014. Mixture Decomposition and Lithofacies Clustering
Using Wireline Logs. J. Applied Geophysics. 102:10-20, doi: 10.1016/j.jappgeo.2013.12.011.
Ma YZ, Gomez E, Young TL, Cox DL, Luneau B, Iwere F (2011) Integrated reservoir modeling
of a Pinedale tight-gas reservoir in the Greater Green River Basin, Wyoming. In Y. Z. Ma
and P. LaPointe (Eds), Uncertainty Analysis and Reservoir Modeling, AAPG Memoir 96,
Tulsa.
Pearson K. (1901) On lines and planes of closest fit to systems of points in space, Philosophical
Magazine 2(11):559-572.
View publication stats