PCA Biology
PCA Biology
PCA Biology
Data Reduction
summarization of data with many (p) variables by a smaller set of (k) derived (synthetic, composite) variables.
p k
Data Reduction
Residual variation is information in A that is not retained in X balancing act between
clarity of representation, ease of understanding oversimplification: loss of important or relevant information.
probably the most widely-used and wellknown of the standard multivariate methods invented by Pearson (1901) and Hotelling (1933)
first applied in ecology by Goodall (1954) under the name factor analysis (principal factor analysis is a synonym of PCA).
takes a data matrix of n objects by p variables, which may be correlated, and summarizes it by uncorrelated axes (principal components or principal axes) that are linear combinations of the original p variables the first k components display as much as possible of the variation among objects.
objects are represented as a cloud of n points in a multidimensional space with an axis for each of the p variables the centroid of the points is defined by the mean of each variable
the variance of each variable is the average squared deviation of its n values around the mean of that variable. n
1 2 X im X i Vi n 1 m 1
1 n X im X i X jm X j Cij n 1 m 1
Covariance of variables i and j Sum over all n objects Value of Mean of variable i variable i in object m Value of variable j in object m Mean of variable j
variables X1 and X2 have positive covariance & each has a similar variance.
14 12
2D Example of PCA
10
Variable X 2
X 2 4.91
0 0 2 4 6 8
X 1 8.35
10 12 14 16 18 20
Variable X1
V1 6.67
V2 6.24
C1, 2 3.42
Configuration is Centered
each variable is adjusted to a mean of zero (by subtracting the mean from each value).
8 6
Variable X 2
0 -8 -6 -4 -2 0 2 4 6 8 10 12
-2
-4
-6
Variable X1
PC 1 has the highest possible variance (9.88) PC 2 has a variance of 3.03 PC 1 and PC 2 have zero covariance.
6 4
PC 2
0 -8 -6 -4 -2 0 2 4 6 8 10 12
-2
-4
-6
PC 1
Generalization to p-dimensions
Generalization to p-dimensions
PC 3 is in the direction of the next highest variance, subject to the constraint that it has zero covariance with both PC 1 and PC 2 and so on... up to PC p
each principal axis is a linear combination of the original two variables PCj = ai1Y1 + ai2Y2 + ainYn aijs are the coefficients for factor i, multiplied by the measured value for variable j
8
PC 1
6
PC 2
Variable X 2
2
0 -8 -6 -4 -2 0 2 4 6 8 10 12
-2
-4
-6
Variable X1
PC axes are a rigid rotation of the original variables PC 1 is simultaneously the direction of maximum variance and a least-squares line of best fit (squared distances of points away from PC 1 are minimized).
8
PC 1
6
PC 2
Variable X 2
2
0 -8 -6 -4 -2 0 2 4 6 8 10 12
-2
-4
-6
Variable X1
Generalization to p-dimensions
if we take the first k principal components, they define the k-dimensional hyperplane of best fit to the point cloud of the total variance of all p variables:
PCs 1 to k represent the maximum possible proportion of that variance that can be displayed in k dimensions i.e. the squared Euclidean distances among points calculated from their coordinates on PCs 1 to k are the best possible representation of their squared Euclidean distances in the full p dimensions.
Covariance vs Correlation
using covariances among variables only makes sense if they are measured in the same units even then, variables with high variances will dominate the principal components these problems are generally avoided by standardizing each variable to unit variance and zero mean.
X im
im
Covariance vs Correlation
covariances between the standardized variables are correlations
after standardization, each variable has a variance of 1.000 correlations can be also calculated from the variances and covariances:
Correlation between variables i and j
rij
C ij ViV j
Variance of variable i
first step is to calculate the crossproducts matrix of variances and covariances (or correlations) among every pair of the p variables square, symmetric matrix
X2
3.4170 6.2384 X1 X2
X2 0.5297 1.0000
Variance-covariance Matrix
where X is the n x p data matrix, with each variable centered (also standardized by if using correlations).
X1
X1 X2 6.6707 3.4170
S XX
X2
3.4170 6.2384 X1 X2
SD
X2 0.5297 1.0000
Variance-covariance Matrix
Manipulating Matrices
transposing: could change the columns to rows or the rows to columns
X = 10 0 4 7 1 2 X = 10 7 0 1 4 2
multiplying matrices
must have the same number of columns in the premultiplicand matrix as the number of rows in the postmultiplicand matrix
sum of the diagonals of the variancecovariance matrix is called the trace it represents the total variance in the data it is the mean squared Euclidean distance between each object and the centroid in p-dimensional space.
X1
X1 X2 6.6707 3.4170
X2
3.4170 6.2384 X1 X2
X2 0.5297 1.0000
Trace = 12.9091
finding the principal axes involves eigenanalysis of the cross-products matrix (S) the eigenvalues (latent roots) of S are solutions () to the characteristic equation
S I 0
the eigenvalues, 1, 2, ... p are the variances of the coordinates on each principal component axis the sum of all p eigenvalues equals the trace of S (the sum of the variances of the original variables).
X1 X1 X2 6.6707 3.4170 X2 3.4170 6.2384
1 = 9.8783 2 = 3.0308
Note: 1+2 =12.9091
Trace = 12.9091
each eigenvector consists of p values which represent the contribution of each variable to the principal component axis
eigenvectors are uncorrelated (orthogonal)
their cross-products are zero.
Eigenvectors
u1 X1 X2 0.7291 0.6844 u2 -0.6844 0.7291
0.7291*(-0.6844) + 0.6844*0.7291 = 0
coordinates of each object i on the kth principal axis, known as the scores on PC k, are computed as
variance of the scores on each PC axis is equal to the corresponding eigenvalue for that axis the eigenvalue represents the variance displayed (explained or extracted) by the kth axis
the sum of the first k eigenvalues is the variance explained by the k-dimensional ordination.
1 = 9.8783
2 = 3.0308
Trace = 12.9091
PC 2
0 -8 -6 -4 -2 0 2 4 6 8 10 12
-2
-4
-6
PC 1
data from research on habitat definition in the endangered Baw Baw frog 16 environmental and structural variables measured at each of 124 sites correlation matrix used because variables have different units
Philoria frosti
Eigenvalues
Axis 1 2 3 4 5 6 7 8 9 10 Eigenvalue 5.855 3.420 1.122 1.116 0.982 0.725 0.563 0.529 0.476 0.375 % of Variance 36.60 21.38 7.01 6.97 6.14 4.53 3.52 3.31 2.98 2.35 Cumulative % of Variance 36.60 57.97 64.98 71.95 78.09 82.62 86.14 89.45 92.42 94.77
Interpreting Eigenvectors
correlations between variables and the principal axes are known as loadings each element of the eigenvectors represents the contribution of a given variable to a component
1 2 Altitude pH Cond TempSurf Relief maxERht avERht %ER %VEG %LIT %LOG %W H1Moss DistSWH DistSW DistMF 0.3842 -0.1159 -0.2729 0.0538 -0.0765 0.0248 0.0599 0.0789 0.3305 -0.3053 -0.3144 -0.0886 0.1364 -0.3787 -0.3494 0.3899 0.0659 0.1696 -0.1200 -0.2800 0.3855 0.4879 0.4568 0.4223 -0.2087 0.1226 0.0402 -0.0654 -0.1262 0.0101 -0.1283 0.0586
3 -0.1177 -0.5578 0.3636 0.2621 -0.1462 0.2426 0.2497 0.2278 -0.0276 0.1145 -0.1067 -0.1171 0.4761 0.0042 0.1166 -0.0175
Eigenvalue
PC Axis Number
if the structure in the data is NONLINEAR (the cloud of points twists and curves its way through p-dimensional space), the principal axes will not be an efficient and informative summary of the data.
In community ecology, PCA is useful for summarizing variables whose relationships are approximately linear or at least monotonic
e.g. A PCA of many soil properties might be used to extract a few components that summarize main dimensions of soil variation
70
60
50
Abundance
40
30
20
10
Ambiguity of Absence
Abundance
Environmental Gradient
Axis 2
Axis 1
The HorseshoeEffect
curvature of the gradient and the degree of infolding of the extremes increase with beta diversity PCA ordinations are not useful summaries of community data except when beta diversity is very low using correlation generally does better than covariance
this is because standardization by species improves the correlation between Euclidean distance and environmental distance.
PCA should NOT be used with community data (except maybe when beta diversity is very low).