Sess03 Dimension Reduction Methods
Sess03 Dimension Reduction Methods
Methods
Principal Components & Factor
Analysis
Need for Dimension Reduction
• The databases typically used in data mining may have millions of records
and thousands of variables
• It is unlikely that all of the variables are independent, with no correlation
structure among them
• Data analysts need to guard against multicollinearity, a condition where
some of the predictor variables are strongly correlated with each other
• Multicollinearity leads to instability in the solution space, leading to
possible incoherent results, such as in multiple regression, where a
multicollinear set of predictors can result in a regression which is
significant overall, even when none of the individual variables is significant
• Even if such instability is avoided, inclusion of variables which are highly
correlated tends to overemphasize a particular component of the model,
as the component is essentially being double counted
Need for Dimension Reduction
•• Statisticians noted that the sample size needed to fit a multivariate
function grows exponentially with the number of variables - higher-
dimension spaces are inherently sparse
• For example, the empirical rule tells us that, in 1-d, about 68% of normally
distributed variates lie between from the mean; while, for a 10-d
multivariate normal distribution, only 2% of the data lies within the
analogous hypersphere
• The use of too many predictor variables to model a relationship with a
response variable can unnecessarily complicate the interpretation of the
analysis, and violates the principle of parsimony
• Also, retaining too many variables may lead to overfitting, in which the
generality of the findings is hindered because new data do not behave the
same as the training data for all the variables
Need for Dimension Reduction
• Further, analysis solely at the variable-level might miss the fundamental
underlying relationships among the predictors
• Several predictors might fall naturally into a single group, (a factor or a
component), which addresses a single aspect of the data
• For example, the variables savings account balance, checking account
balance, home equity, stock portfolio value, and 401k balance might all fall
together under the single component, assets
• In some applications, such as image analysis, retaining full dimensionality
would make most problems intractable. For example, a face classification
system based on 256 × 256 pixel images could potentially require vectors
of dimension 65,536
Goals
• To reduce the number of predictor items
• To help ensure that these predictor items are independent
• To provide a framework for interpretability of the results
• But how?
• Here,
Principal Components Analysis (PCA)
• How to create Covariance Matrix?
• Covariance measures are not scaled so, change in unit of measurement would
change value of covariance so, correlation matrix is developed by using
• Correlation Matrix has the following structure
Principal Components Analysis (PCA)
• Again consider the standardized data i.e.
• As each variable has been standardized so,
i.e. for standardized data set, covariance and correlation matrix are same
• Then ith principal component (PC) can be computed for the standardized data
matrix as where denotes the transpose of ith eigen
vector of
• The PCs are linear combinations of the standardized variables such that
• The variances of the PCs are as large as possible
• The PCs are uncorrelated
• The first PC is which has greater
variability than any other possible combination of the Z variables
Principal Components Analysis (PCA)
• The first PC is the linear combination that maximizes
• The second PC is the linear combination that maximizes and it is
independent of first PC
• In general the ith PC is the linear combination which is independent of all
the other PCs and maximizes
• Eigenvalues: If B is an mxm matrix and let I be the Identity matrix then the following
scalars will be called Eigenvalues of B if
• Eigenvectors: For the above matrix and its eigenvalues a nonzero mx1 vector e will
be called as an eigen vector corresponding to the eigenvalue λ if
• Two important properties of Eigenvalues and Eigenvectors
when the eigenvectors are computed on covariance matrix
and
Results on PCA
• The total variability in the standardized set of predictors equals the sum of the
variances of the Z-vectors, which equals the sum of the variances of the
components, which equals the sum of the eigenvalues, which equals the number of
predictors
• The partial correlation between a given component and a given predictor variable is
a function of an eigenvector and an eigenvalue. Specifically,
• First, the analyst specifies how much of the total variability that he or she would like
the principal components to account for
• Then, the analyst simply selects the components one by one until the desired
proportion of variability explained is attained
• For example, suppose we would like our components to explain 85% of the
variability in the variables. Then, from Table, we would choose components 1–3,
which together explain 86.057% of the variability. However, if we wanted our
components to explain 90% or 95% of the variability, then we would need to include
component 4 along with components 1–3, which together would explain 96.368% of
the variability
• Again, as with the eigenvalue criterion, how large a proportion is enough?
Scree Plot Criterion
• A scree plot is a graphical plot of the eigenvalues against the component number
• Scree plots are useful for finding an upper bound (maximum) for the number of
components that should be retained
• The scree plot criterion is this: The maximum number of components that should
be extracted is just before where the plot first begins to straighten out into a
horizontal line
Comparison of Three Criteria
• To summarize, the recommendations of our criteria are as
follows:
• The Eigenvalue Criterion:
o Retain components 1–3, but do not throw away component 4 yet
• The Proportion of Variance Explained Criterion
o Components 1–3 account for a solid 86% of the variability, and tacking on
component 4 gives us a superb 96% of the variability
• The Scree Plot Criterion
o Do not extract more than four components
Comparison of Three Criteria
• In a case like this, where there is no clear-cut best solution, why not try it both ways
and see what happens? Three and Four components
• The component weights smaller than 0.15 are suppressed to ease the component
interpretation
• Note that the first three components are each exactly the same in both cases, and
each is the same as when we extracted all eight components
• This is because each component extracts its portion of the variability sequentially, so
that later component extractions do not affect the earlier ones.
Communalities
• PCA does not extract all the variance from the variables, but only that proportion of
the variance that is shared by several variables
• Communality represents the proportion of variance of a particular variable that is
shared with other variables
• The communalities represent the overall importance of each of the variables in the
PCA as a whole
• For example, a variable with a communality much smaller than the other variables
indicates that this variable shares much less of the common variability among the
variables, and contributes less to the PCA solution
• Communalities that are very low for a particular variable should be an indication to
the analyst that the particular variable might not participate in the PCA solution
• Overall, large communality values indicate that the principal components have
successfully extracted a large proportion of the variability in the original variables,
while small communality values show that there is still much variation in the data
set that has not been accounted for by the principal components
Communalities
• Communality values are calculated as the sum of squared component weights, for a
given variable
• We are trying to determine whether to retain component 4, the “housing age”
component
• Thus, we calculate the commonality value for the variable housing median age, using the
component weights for this variable (hage_z) from Table
• Two communality values for housing median age are calculated, one for retaining three
components, and the other for retaining four components
• Communalities less than 0.5 can be considered to be too low the variable shares less
than half of its variability in common with the other variables
• Suppose that for some reason we wanted or needed to keep the variable housing median
age as an active part of the analysis. Then, extracting only three components would not
be adequate, as housing median age shares only 35% of its variance with the other
variables.
• If we wanted to keep this variable in the analysis, we would need to extract the fourth
component, which lifts the communality for housing median age over the 50% threshold
Comparison of Four Selection Criteria
• The Eigenvalue Criterion recommended three components, but did not absolutely reject
the fourth component. Also, for small numbers of variables, this criterion can
underestimate the best number of components to extract
• The Proportion of Variance Explained Criterion stated that we needed to use
four components if we wanted to account for that superb 96% of the variability.
As our ultimate goal is to substitute these components for the original data and
use them in further modeling downstream, being able to explain so much of the
variability in the original data is very attractive
• The Scree Plot Criterion said not to exceed four components
• The Minimum Communality Criterion stated that, if we wanted to keep housing
median age in the analysis, we had to extract the fourth component. As we
intend to substitute the components for the original data, then we need to keep
this variable, and therefore we need to extract the fourth component
Validation of PCs’
• Recall that the original data set was divided into a training data set and a test data set
• In order to validate the principal components uncovered here, we now perform PCA on the
standardized variables for the test data set
• The resulting component matrix is shown in Table, with component weights smaller than
±0.50 suppressed
• Although the component weights do not exactly equal those of the training set, nevertheless
the same four components were extracted, with a one-to-one correspondence in terms of
which variables are associated with which component
• If the split sample method described here does not successfully provide validation, then the
analyst should take this as an indication that the results (for the data set as a whole) are not
generalizable, and the results should not be reported as valid.
• If the lack of validation stems from a subset of the variables, then the analyst may consider
omitting these variables, and performing the PCA again
Factor Analysis
• Related to Principal Components Analysis but has different goals
• PCA seeks to identify orthogonal linear combinations of the variables, to be used either for
descriptive purposes or to substitute a smaller number of uncorrelated components for the
original variables
• In contrast, factor analysis represents a model for the data, and as such is more elaborate
especially using factor rotation
• The factor analysis model hypothesizes that the response vector can be
modeled as linear combinations of a smaller set of latent unobserved latent random variables
called common factors along with an error term in following way
Note that the correlations, although statistically significant in several cases, are overall much weaker than the correlations from the
houses data set. A weaker correlation structure should pose more of a challenge for the dimension-reduction method
Application
• To allow us to view the results using a scatter plot, we decide a priori to extract only two
factors
• The following factor analysis is performed using the principal axis factoring option with an
iterative procedure used to estimate the communalities and the factor solution
• This analysis required 152 such iterations before reaching convergence
• The eigenvalues and the proportions of the variance explained by each factor are shown
• Note that the first two factors extract less than half of the total variability in the variables, as
contrasted with the houses data set, where the first two components extracted over 72% of
the variability due to the weaker correlation structure
Using R
• The factor loadings 𝐋m×k are shown in Table. Factor loadings are analogous to the component
weights in PCA, and represent the correlation between the ith variable and the jth factor
• Notice that the factor loadings are much weaker than the previous houses example, again due
to the weaker correlations among the standardized variables
• The communalities are also much weaker than the houses example : The low communality
values reflect the fact that there is not much shared correlation among the variables
• Note that the factor extraction increases the shared correlation
Factor Rotation
• To assist in the interpretation of the factors, factor rotation may be performed
• Corresponds to a transformation (usually orthogonal) of the coordinate axes, leading to a
different set of factor loadings
• Analogous to a scientist attempting to elicit greater contrast and detail by adjusting the focus
of the microscope and sharpest focus occurs when each variable has high factor loadings on a
single factor, with low-to-moderate loadings on the other factors
• For the Houses example, this sharp focus occurred already on the unrotated factor loadings,
so rotation was not necessary
• However, Table of factor loading for Adults dataset shows that we should perhaps try factor
rotation for the adult data set to improve our interpretation
Factor Rotation
• Note that most vectors do not closely follow the coordinate axes, which means that there is
poor “contrast” among the variables for each factor, thereby reducing interpretability
• Next, a varimax rotation was applied to the matrix of factor loadings, resulting in the new set
of factor loadings
• Note that the contrast has been increased for most variables
• Figure shows that the factor loadings have been rotated along the axes of maximum
variability, represented by Factor 1 and Factor 2
• Often, the first factor extracted represents a “general factor,” and accounts for much of the
total variability
Factor Rotation
• The effect of factor rotation is to redistribute this first factor’s variability explained among the
second, third, and subsequent factors
• The sums of squared loadings for Factor 1 for the unrotated case is
• This represents 10.7% of the total variability, and about 61% of the variance explained by the
first two factors
• For the rotated case, Factor 1’s influence has been partially redistributed to Factor 2 now
accounting for 9.6% of the total variability and about 55% of the variance explained by the
first two factors
•
Goals of Factor Rotation
• Three methods for orthogonal rotation, in which the axes are rigidly maintained at
90∘
• The goal when rotating the matrix of factor loadings is to ease interpretability by
simplifying the rows and columns of the column matrix
• We assume that the columns in a matrix of factor loadings represent the factors,
and that the rows represent the variables
• Simplifying the rows of this matrix would entail maximizing the loading of a
particular variable on one particular factor, and keeping the loadings for this variable
on the other factors as low as possible (ideal: row of zeroes and ones)
• Similarly, simplifying the columns of this matrix would entail maximizing the loading
of a particular factor on one particular variable, and keeping the loadings for this
factor on the other variables as low as possible (ideal: column of zeroes and ones)
• Three types of Rotation
• Quartimax Rotation
• Varimax Rotation
• Equimax Rotation
Types of Factor Rotation
• Quartimax Rotation seeks to simplify the rows of a matrix of factor loadings. It
tends to rotate the axes so that the variables have high loadings for the first factor,
and low loadings thereafter. The difficulty is that it can generate a strong “general”
first factor, in which many variables have high loadings
• Varimax Rotation prefers to simplify the column of the factor loading matrix. It
maximizes the variability in the loadings for the factors, with a goal of working
toward the ideal column of zeroes and ones for each variable. The rationale for
varimax rotation is that we can best interpret the factors when they are strongly
associated with some variable and strongly not associated with other variables
• Varimax is more invariant than Quartimax (Shown by researchers)
• Equimax Rotation seeks to compromise between simplifying the columns and the
rows
• The researcher may prefer to avoid the requirement that the rotated factors remain
orthogonal (independent)
• In this case, oblique rotation methods are available, in which the factors may be
correlated with each other
• This rotation method is called oblique because the axes are no longer required to be