0% found this document useful (0 votes)
5 views36 pages

Lab 04 - Correlation and Regression

The document covers statistical tools in data mining, focusing on correlation analysis and regression techniques using SAS. It explains the correlation coefficient, PROC CORR for correlation analysis, and PROC REG for both simple and multiple linear regression. Additionally, it introduces Principal Component Analysis (PCA) and its application in data analysis, along with relevant SAS code examples.

Uploaded by

fwchu1111
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views36 pages

Lab 04 - Correlation and Regression

The document covers statistical tools in data mining, focusing on correlation analysis and regression techniques using SAS. It explains the correlation coefficient, PROC CORR for correlation analysis, and PROC REG for both simple and multiple linear regression. Additionally, it introduces Principal Component Analysis (PCA) and its application in data analysis, along with relevant SAS code examples.

Uploaded by

fwchu1111
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Statistical tools in data mining

MSIM4311
Topic 4
CORRELATION ANALYSIS USING PROC CORR
 Correlation Analysis Basics
 The correlation coefficient measures the linear relationship between two
quantitative variables measured on the same entity.
 The correlation  is a unitless quantity ranging from -1 to + 1 where  = -1 and
 = +1 correspond to perfect negative and positive linear relationships,
respectively, and  = 0 indicates no linear relationship.

2 SAS ESSENTIALS -- Elliott & Woodward


Pearson’s r
 The correlation coefficient  is typically estimated from data using the Pearson
correlation coefficient, usually denoted r.
 PROC CORR in SAS provides a test of the above hypotheses designed to
determine whether the estimated correlation coefficient, r, is significantly
different from zero.
 The syntax for the PROC CORR procedure is:

PROC CORR <options>; <statements>;

3 SAS ESSENTIALS -- Elliott & Woodward


Common Options for PROC CORR
Common Options for PROC CORR
Option Explanation
DATA = datasetname Specifies which data set to use.
SPEARMAN Requests Spearman rank correlations
NOSIMPLE Suppresses display of descriptive
statistics
NOPROB Suppresses the display of p-values
PLOTS= PLOTS=MATRIX requests a scatterplot
matrix and PLOTS=SCATTER requests
individual scatterplots.
OUTP= Specifies an output data set
continuing Pearson correlations.

4 SAS ESSENTIALS -- Elliott & Woodward


Common Statements for PROC CORR
Common Statements for PROC CORR
VAR variable list All possible pairwise correlations are
calculated for the variables listed and
displayed in a table.
WITH variable(s); All possible correlations are obtained
between the variables in the VAR list and
variables in the WITH list
MODEL Specifies dependent and independent
variables for the analysis.
MODEL depvar=indvar(s);
More explanation follows.
BY, FORMAT, LABEL, WHERE Statements common to most procedures, and
may be used here.

5 SAS ESSENTIALS -- Elliott & Woodward


PROC CORR Code for Correlations
PROC CORR DATA= "C: \SASDATA \SOMEDATA";
VAR AGE TIMEl TIME2;
TITLE "Example using PROC CORR";
RUN;
Specifies which variables to
include in the output
correlation table.

Output (partial) from this program.


In each cell the top number is the
correlation and the bottom is the p-
value testing the previously
described hypothesis.

6 SAS ESSENTIALS -- Elliott & Woodward


Producing a Matrix of Scatterplots

PROC CORR DATA=C:\SASDATA\SOMEDATA


PLOTS=MATRIX;
VAR AGE TIMEl TIME2;
TITLE 'Example using PROC CORR';
RUN; Requests a matrix of
scatterplots. Notice
that this option occurs
within the first
semicolon.

7 SAS ESSENTIALS -- Elliott & Woodward


Graphical Results of the PLOTS=MATRIX option

8 SAS ESSENTIALS -- Elliott & Woodward


Change the option to PLOTS=MATRIX(HISTOGRAM)

9 SAS ESSENTIALS -- Elliott & Woodward


Calculating Correlations Using the WITH Statement
PROC CORR DATA= "C:\SASDATA\SOMEDATA";
VAR TIMEl-TIME4;
WITH AGE;
RUN; The WITH option limits the
size of the correlation table

Output using the


WITH statement

10 SAS ESSENTIALS -- Elliott & Woodward


SIMPLE LINEAR REGRESSION
 Simple linear regression is used to predict the value of a dependent variable
from the value of an independent variable.
 The following SAS PROC REG code produces asimple linear regression
equation :
PROC REG; This MODEL statement indicates
MODEL FVC=ASB; that you want to create an
equations that predicts FVC
RUN; from values of ASB.

 Note that the MODEL statement is used to tell SAS which variables to use in
the analysis. The MODEL statement has the following form:
MODEL dependentvar = independentvar;

11 SAS ESSENTIALS -- Elliott & Woodward


The Simple Linear Regression MODEL Statement
MODEL dependentvar = independentvar;

 This statement syntax indicates the dependent variable (dependentvar) as the


measure you are trying to predict and the independent variable
(independentvar) as your predictor.

12 SAS ESSENTIALS -- Elliott & Woodward


The Simple Linear Regression Model
 The regression line is an estimate of a theoretical line describing the
relationship between the independent variable (X) and the dependent
variable (Y):
Y = a + bx + e
 where a (alpha) is the y-intercept, b (Beta) is the slope, and e is an error term
that is normally distributed with zero mean and constant variance. b = 0
indicates that there is no linear relationship between X and Y. A simple linear
regression analysis is used to develop an equation for predicting the
dependent variable given a value (x) of the independent variable. The
regression line calculated by SAS is given by
෡ = a + bx
𝒀
 where a and b are the least-squares estimates of a and b.

13 SAS ESSENTIALS -- Elliott & Woodward


Using SAS PROC REG for Simple Linear Regression
The general syntax for PROC REG is as follows:
PROC REG <Options>; <Statements>;
Common Options for PROC REG
Option Explanation
DATA = dataname Specifies which data set to use.
SIMPLE Displays descriptive statistics
CORR Displays a correlation matrix for variables listed in
the MODEL and VAR statements
PLOTS=option PLOTS = NONE suppresses graphs. Otherwise several
diagnostic graphs are produced by default.
NOPRINT Suppresses output when you want to capture results
but not display them
ALPHA=p Sets significance levels for confidence and prediction
intervals

14 SAS ESSENTIALS -- Elliott & Woodward


Common Statements for PROC REG
Common Statements for PROC REG
MODEL dependentvar = Specifies the variable to be predicted (dependentvar)
independentvar </ options >; and the variable that is the predictor
(independentvar)
OUTPUT OUT=dataname Specifies output data set information. For example
MODEL Y=A1 B1;
OUTPUT OUT=OUTREG P=YHAT R=YRESID;
Creates the variables YHAT for predicted values (P) and
YRESID for residual values. Other handy variables
include LCL and UCL (confidence limits on individual
values) and LCLM and UCLM (confidence limits on the
mean)
PLOTS=option(s) Requests plots. Some option include COOKD, LCL,
UCLM, UCL, UCLM, RESIDUALS. See SAS
documentation for others.
BY, FORMAT, LABEL, WHERE These statements are common to most procedures,
and may be used here.

15 SAS ESSENTIALS -- Elliott & Woodward


Simple Linear Regression Example

The MODEL statement defines


the linear regression equation
PROC RBG; you are calculating.

MODEL TASK=CREATE;
TITLE "Example simple linear regression” RUN;
QUIT;

A QUIT statement is
recommended for PROC REG
to end the analysis.

16 SAS ESSENTIALS -- Elliott & Woodward


final 會問D數字咩嚟(要解釋)!!!

Selected Output from PROC REG


R-Squared is a measure of the
strength of the association.

The regression equation from


this analysis is

TASK = 2.16+0.0625*CREATE

The parameter estimates are the estimates of


alpha (Intercept) and beta (slope/CREATE).

17 SAS ESSENTIALS -- Elliott & Woodward


Graphical Results of Regression Analysis
The shaded area
represents a 95%
confidence interval
for the average
TASK score for a
given CREATE score.

18 SAS ESSENTIALS -- Elliott & Woodward


Diagnostic Plots for Linear Regression
Residual by Predicted Value
plot (upper left), we want
to see a random scatter
of points above and below
the 0 line, which is the case
here. A nonrandom pattern
of dots could indicate an
inadequate model.

19 SAS ESSENTIALS -- Elliott & Woodward


Diagnostic Plots for Linear Regression
The RStudent by Predicted
Value plot indicates
whether any Studentized
residuals fall beyond two
standard deviations, which
would indicate unusual
values. In this case, none
fall outside the ±2 limits.

20 SAS ESSENTIALS -- Elliott & Woodward


Diagnostic Plots for Linear Regression
The RStudent by Leverage
plot attempts to locate
observations that might
have unusual influence
(leverage) on the
calculation of the
regression coefficients. In
this case, there is possibly
one observation that has
undue influence. We'll
identify this
observation later.

21 SAS ESSENTIALS -- Elliott & Woodward


Diagnostic Plots for Linear Regression

In the Residual by Quartile


plot, a tight and random
scatter along the diagonal
line indicates an adequate
fit to the model.

22 SAS ESSENTIALS -- Elliott & Woodward


Diagnostic Plots for Linear Regression

The Dependent Variable


(TASK) by Predicted Value
plot visualizes variability in
the prediction, so if there is
a pattern (e.g., variability
increases as the predicted
value increases) it indicates
a nonconstant variance of
the error.

23 SAS ESSENTIALS -- Elliott & Woodward


Diagnostic Plots for Linear Regression

The Cook's D plot is


designed to identify
outliers or leverage
points. In this case,
it appears that
observations 5 and
6 are suspect.

24 SAS ESSENTIALS -- Elliott & Woodward


Diagnostic Plots for Linear Regression

Residuals by Percent plot


assesses the normality of the
residuals.

25 SAS ESSENTIALS -- Elliott & Woodward


Diagnostic Plots for Linear Regression

The Proportion Less


(Spread plot) plots the
proportion of the data by
the rank for two or more
categories. If the vertical
spread (base on ranked
data) is about the same, it
means that there is about
the same variance in both
the fitted and residual
values.

26 SAS ESSENTIALS -- Elliott & Woodward


MULTIPLE LINEAR REGRESSION USING PROC REG
 Multiple Linear Regression (MLR) is an extension of simple linear regression. In
MLR, there is a single dependent variable (Y) and more than one independent
(Xi) variable. As with simple linear regression, the multiple regression equation
calculated by SAS is a sample-based version of a theoretical equation
describing the relationship between the k independent variables and the
dependent variable Y.

Y = a + b1x1 + b 2x2 + … + b kxk + e

27 SAS ESSENTIALS -- Elliott & Woodward


Using SAS PROC REG for Multiple Linear Regression
 As mentioned previously, the general syntax for PROC REG is

PROC REG <Options>; <Statements>;

28 SAS ESSENTIALS -- Elliott & Woodward


Additional Statement Options for the PROC REG MODEL statement (Options
follow /) (Relevant to Multiple Linear Regression)
Option Explanation
P Requests a table containing predicted values from
the model.
R Requests that the residuals be analyzed.
CLM Prints the 95 percent upper and lower confidence
limits.
CLI Requests the 95 percent upper and lower confidence
limits for an individual value.
INCLUDE=k Include the first k variables in the variable list in the
model (for automated selection procedures).
SELECTION=option Specifies automated variable selection procedure:
BACKWARD, FORWARD, and STEPWISE, etc.
SLSTAY=p Specifies the maximum p-value for a variable to stay
in a model during automated model selection.
SLENTRY=p Minimum p-value for a variable to enter a model for
forward or stepwise selection.
29 SAS ESSENTIALS -- Elliott & Woodward
Principal Component Analysis
 Principal Component Analysis (PCA) is one of the most commonly used
unsupervised machine learning algorithms across various applications:
exploratory data analysis, dimensionality reduction, information compression,
data de-noising, and more.

 For a 2-dimensional dataset, the dataset can be presented as a scatterplot.

30
Principal Component Analysis
 Eigendecomposition of the covariance matrix - to gain a deeper appreciation of PCA. There are several steps in computing PCA:

 Feature standardisation. We standardise each feature to have a mean of 0 and a variance of 1. As we explain later in
assumptions and limitations, features with values on different orders of magnitude prevent PCA from computing the best
principal components.

 Obtain the covariance matrix computation. The covariance matrix is a square matrix, of d x d dimensions, where d stands for
“dimension” (or feature or column, if our data is tabular). It shows the pairwise feature correlation between each feature.

 Calculate the eigendecomposition of the covariance matrix. We calculate the eigenvectors (unit vectors) and their associated
eigenvalues (scalars by which we multiply the eigenvector) of the covariance matrix. If you want to brush up on your linear
algebra, this resource refreshes your knowledge of eigendecomposition.

 Sort the eigenvectors from the highest eigenvalue to the lowest. The eigenvector with the highest eigenvalue is the first
principal component. Higher eigenvalues correspond to greater amounts of shared variance explained.

 Select the number of principal components. Select the top N eigenvectors (based on their eigenvalues) to become the N
principal components. The optimal number of principal components is both subjective and problem-dependent. Usually, we
look at the cumulative amount of shared variance explained by the combination of principal components and pick the number
of components which still significantly explain the shared variance.

31
Principal Component Analysis
data SocioEconomics;
input Population School Employment Services HouseValue;
datalines;
5700 12.8 2500 270 25000
1000 10.9 600 10 10000

proc factor data=SocioEconomics ;


run;

32
Principal Component Analysis

Population = 0.58 X Factor 1 + 0.80642 X Factor 2

33
Principal Component Analysis

Factor 1 = 0.34 X Population + 0.45 X School + 0.40 X Employment + 0.55 X Service + 0.47 X House Value

34
These slides are based on the book:

Introduction to SAS Essentials


Mastering SAS for Data Analytics, 2nd Edition

By Alan C, Elliott and Wayne A. Woodward

Paperback: 512 pages


Publisher: Wiley; 2 edition (August 3, 2015)
Language: English
ISBN-10: 111904216X
ISBN-13: 978-1119042167

These slides are provided for you to use to teach SAS using this book. Feel free to
modify them for your own needs. Please send comments about errors in the slides
(or suggestions for improvements) to [email protected]. Thanks.

35 SAS ESSENTIALS -- Elliott & Woodward


End of Topic 4

Introduction to SAS Essentials Mastering SAS for Data Analytics, 2nd Edition
By Alan C, Elliott and Wayne A. Woodward

You might also like