0% found this document useful (0 votes)
15 views16 pages

GIS320 Lecture6 Principal Components Analysis

Uploaded by

thandokunene6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views16 pages

GIS320 Lecture6 Principal Components Analysis

Uploaded by

thandokunene6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

2023/08/14

Lecture 6

Principal components analysis (PCA)

Associate Professor Gregory Breetzke


[email protected]
Room 1-19, Geography Building

disclaimer

1
2023/08/14

what is PCA?
• Principal Component Analysis (or PCA) is a method that is used to
reduce the dimensionality of large data sets. How?

• E.g., change 20 variables into 4 variables/factors/components

• Reducing the number of variables of a data set comes at the


expense of accuracy, but the trick in dimensionality reduction is to
trade a little accuracy for simplicity

• So to sum up, the idea of PCA is simple — reduce the number of


variables of a data set, while preserving as much information as
possible

concepts in PCA
• Conceptually, using two datasets, the transformation of the data is
accomplished as follows:-

– The data is plotted in a scatterplot

– An ellipse is calculated to bound the points in the scatterplot

2
2023/08/14

concepts in PCA
• The major axis of the ellipse is determined

• The major axis becomes the new x-axis, the first principal component (PC1)
PC1 depicts the greatest variation because it is the largest transect that can
be drawn through the ellipse

• I.e., greatest variation = the line that captures most information of the data

• The direction of PC1 is the eigenvector


and its magnitude is the eigenvalue.

• The angle of the x-axis to PC1


is the angle of rotation that is
used in the transformation.

concepts in PCA
• An orthogonal line perpendicular to PC1 is calculated.

• This line is the second principal component (PC2) and the new axis
for the original y-axis.

• The new axis describes the greatest variance not described by PC1.

• What happens if there are more than two datasets/variables?

3
2023/08/14

steps in a PCA?
STEP 1: STANDARDISATION

• The aim of this step is to standardize the range of the continuous initial
variables so that each one of them contributes equally to the analysis.

• Why?

• Calculate z-score. Why?

• Once the standardization is done, all the variables will be transformed to


the same scale

Steps in a PCA?
STEP 2: CORRELATION MATRIX COMPUTATION

• The aim of this step is to see if there is any relationship between the
variables.

• Because sometimes, variables are highly correlated in such a way


that they contain redundant information. So, in order to identify
these correlations, we compute a matrix.

• What is correlation?
– “Correlation” on the other hand measures both the strength and
direction of the linear relationship between two variables.

4
2023/08/14

Steps in a PCA?
STEP 3: COMPUTE THE PRINCIPAL COMPONENTS

• Principal components are new variables that are constructed as


linear combinations or mixtures of the initial variables.

• These combinations are done in such a way that the new variables
(i.e., principal components) are uncorrelated and most of the
information within the initial variables is squeezed or compressed
into the first components.

Steps in a PCA?
STEP 3: COMPUTE THE PRINCIPAL COMPONENTS

• Principal components are less interpretable and don’t have any real
meaning since they are constructed as linear combinations of the
initial variables.

• Principal components are constructed in such a manner that the first


principal component accounts for the largest possible variance in
the data set

5
2023/08/14

Steps in a PCA?
STEP 3: COMPUTE THE PRINCIPAL COMPONENTS

• The second principal component is calculated in the same way, with


the condition that it is uncorrelated with (i.e., perpendicular to) the
first principal component and that it accounts for the next highest
variance

• This continues until a total of p principal components have been


calculated, equal to the original number of variables.

PCA in raster GIS


• Principal component analysis catches redundancy between data sets.

• What about aspect, slope, and hillshade data? Is there redundancy in


these three data sets? If so, how much?

6
2023/08/14

PCA in raster GIS


Step 1. Run the “Composite” tool in ArcPro

• The composite bands tool combines the aspect, hillshade, and


slope rasters into a single 3-band raster. Use the following
rasters as inputs:

– ASPECT: Band 1
– HILLSHADE: Band 2
– SLOPE: Band 3

• Output the new raster as Composite

PCA in raster GIS


Step 2. Execute the “Principal Components” tool

• Using the Spatial Analyst extension in ArcPro, execute the


“Principal Components” tool with the following criteria:

– INPUT RASTER: Composite


– OUTPUT RASTER: PCA
– NUMBER OF PRINCIPAL COMPONENTS: 3
– OUTPUT DATA FILE: PrincipalComponents.txt

• The result will be a 3-channel PCA composite and a data file


showing the amount of redundancy.

7
2023/08/14

PCA in raster GIS


• The “percent of eigenvalues” shows how much each principal component
accounts for.
Magnitude of variance

• This table shows that the first component accounts for 67.1% of the
covariance (or ‘information’ of the 3 rasters collectively)

• When you add the second component, it accounts for 98.1% of the
‘information’. The third component does not give much extra information
(1.9%) and is slightly redundant with principal components 1 and 2.

PCA in remote sensing


• Running a principal component analysis on three bands is useful
because we found the third component did not add much information.

• What about a 10-band multispectral image? Or even 100 or 200 bands


(hyperspectral imagery)?

• This is where PCA is really useful – multispectral and hyperspectral


analysis.

• For example, if most of the variance (eigenvalue) is found in principal


components one, two, and three, it’s only necessary to use these three
principal components. For land cover classification, it is much easier
using three bands compared to all 10 bands.

• In summary, PCA identifies duplicate data over multiple channels,


reduces redundancy, and speeds up the processing time. This is key for
principal component analysis image processing.

8
2023/08/14

PCA in raster (ArcPro)


• The input raster bands.
• They can be integer or floating point
type.

PCA in raster (ArcPro)

• The output multiband raster dataset.


• If all of the input bands are integer type,
the output raster bands will be integer. If
any of the input bands are floating point,
the output will be floating point.

9
2023/08/14

PCA in raster (ArcPro)

• Number of principal components.


• The number must be greater than zero
and less than or equal to the total
number of input raster bands.
• The default is the total number of
rasters in the input.

PCA in raster (ArcPro)

• Output ASCII data file storing principal


component parameters.
• The output data file records the
correlation and covariance matrices, the
eigenvalues and eigenvectors, the
percent variance each eigenvalue
captures, and the accumulative variance
described by the eigenvalues.
• The extension for the output file can be
.txt or .asc.

10
2023/08/14

PCA in raster (ArcPro)


• The result of the tool is a multiband raster with the same number of bands as
the specified number of components (one band per axis or component in the
new multivariate space).

• The first principal component will have the greatest variance, the second will
show the second most variance not described by the first, and so forth.

• The first three or four rasters of the resulting multiband raster from principal
components tool will describe more than 95 percent of the variance. The
remaining individual raster bands can be dropped.

• Since the new multiband raster contains fewer bands, and more than 95 percent
of the variance of the original multiband raster is intact, the computations will be
faster, and the accuracy is maintained.

PCA in raster (ArcPro)

11
2023/08/14

PCA in vector (GeoDa)


• REMEMBER THESE STEPS
1. Come up with a list of possible x (independent) variables that may be helpful in
estimating y (dependent variable)
2. Collect data on the y variable and your x variables from step 1
3. Check the relationships between each x (independent) variable and y (using
scatterplots and correlations), and use the results to eliminate those variables
that aren’t strongly related to y
4. Look at the possible relationships between the x (independent) variables to
make sure you aren’t being redundant (avoid multicollinearity)
5. Use those x variables (from step 4) in a multiple OLS regression analysis to
find the best-fitting model for your data
6. Use the best-fitting model (from step 5) to predict y for given x- values by
plugging those x-values into the model

• REMEMBER THESE ASSUMPTIONS


1. Linear relationship between dependent and independent variable(s)
2. Outliers
3. Non-stationarity
4. Multicollinearity
5. Spatially autocorrelated residuals
6. Normal distribution bias

PCA in vector (GeoDa)


• Sometimes, variables are highly correlated in such a way that it would be
duplicate information found in another variable.

• Principal component analysis identifies duplicate data over several


datasets. Then, PCA aggregates only essential information into groups called
“principal components“.

12
2023/08/14

PCA in vector (GeoDa)


• Assumption of regression

Table: Correlations for the independent variables


x1 x2 x3 x4 x5 x6 x7 x8 x9
x1: % Unemployed 1

x2: NZDep .81 1

x3: % Males -.15 -.12 1

x4: % Aged 15-29 .80 .25 .03 1

x5: % Resided for less than five years .07 -.00 -.07 .89 1

x6: % Renting .53 .60 .02 .76 .51 1

x7: Index of Concentration of the Extremes (ICE) -.35 -.65 .01 .02 .09 -.32 1

x8: Diversity Index (DI) .67 .58 -.16 .78 .25 .79 -.15 1

x9: % Foreign born .11 -.13 -.15 .39 .72 .24 .41 .43 1

• Rule of thumb: 0.70 threshold

PCA in vector (GeoDa)

13
2023/08/14

PCA in vector (GeoDa)

Factor loadings

PCA in vector (GeoDa)

Factor labelling

14
2023/08/14

PCA in vector (GeoDa)

Factor labelling

Which variables are ‘loaded’ onto PC1?


Which variables are ‘loaded’ onto PC2? etc

Write a descriptive label for PC1, PC2, PC3, PC4, PC5, PC6

PC1 = Unemployed mover; PC2 = Female foreigner

PCA in vector (GeoDa)

Principal components
are less interpretable
and don’t have any real
meaning since they are
constructed as linear
combinations of the
initial variables.

Used in regression if a number of variables are correlated


Used to create indices if an analyse would like to create a composite indicator of a concept

15
2023/08/14

uses of PCA?
• Principal Component Analysis (or PCA) is being applied in:

• Biomedical industry
– Drug discover programmes

• Healthcare industry

• Retail industry
– Customer profiling

• Image compression

16

You might also like