Lab 04 - Correlation and Regression
Lab 04 - Correlation and Regression
MSIM4311
Topic 4
CORRELATION ANALYSIS USING PROC CORR
Correlation Analysis Basics
The correlation coefficient measures the linear relationship between two
quantitative variables measured on the same entity.
The correlation is a unitless quantity ranging from -1 to + 1 where = -1 and
= +1 correspond to perfect negative and positive linear relationships,
respectively, and = 0 indicates no linear relationship.
Note that the MODEL statement is used to tell SAS which variables to use in
the analysis. The MODEL statement has the following form:
MODEL dependentvar = independentvar;
MODEL TASK=CREATE;
TITLE "Example simple linear regression” RUN;
QUIT;
A QUIT statement is
recommended for PROC REG
to end the analysis.
TASK = 2.16+0.0625*CREATE
30
Principal Component Analysis
Eigendecomposition of the covariance matrix - to gain a deeper appreciation of PCA. There are several steps in computing PCA:
Feature standardisation. We standardise each feature to have a mean of 0 and a variance of 1. As we explain later in
assumptions and limitations, features with values on different orders of magnitude prevent PCA from computing the best
principal components.
Obtain the covariance matrix computation. The covariance matrix is a square matrix, of d x d dimensions, where d stands for
“dimension” (or feature or column, if our data is tabular). It shows the pairwise feature correlation between each feature.
Calculate the eigendecomposition of the covariance matrix. We calculate the eigenvectors (unit vectors) and their associated
eigenvalues (scalars by which we multiply the eigenvector) of the covariance matrix. If you want to brush up on your linear
algebra, this resource refreshes your knowledge of eigendecomposition.
Sort the eigenvectors from the highest eigenvalue to the lowest. The eigenvector with the highest eigenvalue is the first
principal component. Higher eigenvalues correspond to greater amounts of shared variance explained.
Select the number of principal components. Select the top N eigenvectors (based on their eigenvalues) to become the N
principal components. The optimal number of principal components is both subjective and problem-dependent. Usually, we
look at the cumulative amount of shared variance explained by the combination of principal components and pick the number
of components which still significantly explain the shared variance.
31
Principal Component Analysis
data SocioEconomics;
input Population School Employment Services HouseValue;
datalines;
5700 12.8 2500 270 25000
1000 10.9 600 10 10000
32
Principal Component Analysis
33
Principal Component Analysis
Factor 1 = 0.34 X Population + 0.45 X School + 0.40 X Employment + 0.55 X Service + 0.47 X House Value
34
These slides are based on the book:
These slides are provided for you to use to teach SAS using this book. Feel free to
modify them for your own needs. Please send comments about errors in the slides
(or suggestions for improvements) to [email protected]. Thanks.
Introduction to SAS Essentials Mastering SAS for Data Analytics, 2nd Edition
By Alan C, Elliott and Wayne A. Woodward