0% found this document useful (0 votes)

5 views36 pages

Lab 04 - Correlation and Regression

The document covers statistical tools in data mining, focusing on correlation analysis and regression techniques using SAS. It explains the correlation coefficient, PROC CORR for correlation analysis, and PROC REG for both simple and multiple linear regression. Additionally, it introduces Principal Component Analysis (PCA) and its application in data analysis, along with relevant SAS code examples.

Uploaded by

fwchu1111

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views36 pages

Lab 04 - Correlation and Regression

Uploaded by

fwchu1111

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Statistical tools in data mining

MSIM4311
Topic 4
CORRELATION ANALYSIS USING PROC CORR
 Correlation Analysis Basics
 The correlation coefficient measures the linear relationship between two
quantitative variables measured on the same entity.
 The correlation  is a unitless quantity ranging from -1 to + 1 where  = -1 and
 = +1 correspond to perfect negative and positive linear relationships,
respectively, and  = 0 indicates no linear relationship.

2 SAS ESSENTIALS -- Elliott & Woodward

Pearson’s r
 The correlation coefficient  is typically estimated from data using the Pearson
correlation coefficient, usually denoted r.
 PROC CORR in SAS provides a test of the above hypotheses designed to
determine whether the estimated correlation coefficient, r, is significantly
different from zero.
 The syntax for the PROC CORR procedure is:

PROC CORR <options>; <statements>;

3 SAS ESSENTIALS -- Elliott & Woodward

Common Options for PROC CORR
Common Options for PROC CORR
Option Explanation
DATA = datasetname Specifies which data set to use.
SPEARMAN Requests Spearman rank correlations
NOSIMPLE Suppresses display of descriptive
statistics
NOPROB Suppresses the display of p-values
PLOTS= PLOTS=MATRIX requests a scatterplot
matrix and PLOTS=SCATTER requests
individual scatterplots.
OUTP= Specifies an output data set
continuing Pearson correlations.

4 SAS ESSENTIALS -- Elliott & Woodward

Common Statements for PROC CORR
Common Statements for PROC CORR
VAR variable list All possible pairwise correlations are
calculated for the variables listed and
displayed in a table.
WITH variable(s); All possible correlations are obtained
between the variables in the VAR list and
variables in the WITH list
MODEL Specifies dependent and independent
variables for the analysis.
MODEL depvar=indvar(s);
More explanation follows.
BY, FORMAT, LABEL, WHERE Statements common to most procedures, and
may be used here.

5 SAS ESSENTIALS -- Elliott & Woodward

PROC CORR Code for Correlations
PROC CORR DATA= "C: \SASDATA \SOMEDATA";
VAR AGE TIMEl TIME2;
TITLE "Example using PROC CORR";
RUN;
Specifies which variables to
include in the output
correlation table.

Output (partial) from this program.

In each cell the top number is the
correlation and the bottom is the p-
value testing the previously
described hypothesis.

6 SAS ESSENTIALS -- Elliott & Woodward

Producing a Matrix of Scatterplots

PROC CORR DATA=C:\SASDATA\SOMEDATA

PLOTS=MATRIX;
VAR AGE TIMEl TIME2;
TITLE 'Example using PROC CORR';
RUN; Requests a matrix of
scatterplots. Notice
that this option occurs
within the first
semicolon.

7 SAS ESSENTIALS -- Elliott & Woodward

Graphical Results of the PLOTS=MATRIX option

8 SAS ESSENTIALS -- Elliott & Woodward

Change the option to PLOTS=MATRIX(HISTOGRAM)

9 SAS ESSENTIALS -- Elliott & Woodward

Calculating Correlations Using the WITH Statement
PROC CORR DATA= "C:\SASDATA\SOMEDATA";
VAR TIMEl-TIME4;
WITH AGE;
RUN; The WITH option limits the
size of the correlation table

Output using the

WITH statement

10 SAS ESSENTIALS -- Elliott & Woodward

SIMPLE LINEAR REGRESSION
 Simple linear regression is used to predict the value of a dependent variable
from the value of an independent variable.
 The following SAS PROC REG code produces asimple linear regression
equation :
PROC REG; This MODEL statement indicates
MODEL FVC=ASB; that you want to create an
equations that predicts FVC
RUN; from values of ASB.

 Note that the MODEL statement is used to tell SAS which variables to use in
the analysis. The MODEL statement has the following form:
MODEL dependentvar = independentvar;

11 SAS ESSENTIALS -- Elliott & Woodward

The Simple Linear Regression MODEL Statement
MODEL dependentvar = independentvar;

 This statement syntax indicates the dependent variable (dependentvar) as the

measure you are trying to predict and the independent variable
(independentvar) as your predictor.

12 SAS ESSENTIALS -- Elliott & Woodward

The Simple Linear Regression Model
 The regression line is an estimate of a theoretical line describing the
relationship between the independent variable (X) and the dependent
variable (Y):
Y = a + bx + e
 where a (alpha) is the y-intercept, b (Beta) is the slope, and e is an error term
that is normally distributed with zero mean and constant variance. b = 0
indicates that there is no linear relationship between X and Y. A simple linear
regression analysis is used to develop an equation for predicting the
dependent variable given a value (x) of the independent variable. The
regression line calculated by SAS is given by
෡ = a + bx
𝒀
 where a and b are the least-squares estimates of a and b.

13 SAS ESSENTIALS -- Elliott & Woodward

Using SAS PROC REG for Simple Linear Regression
The general syntax for PROC REG is as follows:
PROC REG <Options>; <Statements>;
Common Options for PROC REG
Option Explanation
DATA = dataname Specifies which data set to use.
SIMPLE Displays descriptive statistics
CORR Displays a correlation matrix for variables listed in
the MODEL and VAR statements
PLOTS=option PLOTS = NONE suppresses graphs. Otherwise several
diagnostic graphs are produced by default.
NOPRINT Suppresses output when you want to capture results
but not display them
ALPHA=p Sets significance levels for confidence and prediction
intervals

14 SAS ESSENTIALS -- Elliott & Woodward

Common Statements for PROC REG
Common Statements for PROC REG
MODEL dependentvar = Specifies the variable to be predicted (dependentvar)
independentvar </ options >; and the variable that is the predictor
(independentvar)
OUTPUT OUT=dataname Specifies output data set information. For example
MODEL Y=A1 B1;
OUTPUT OUT=OUTREG P=YHAT R=YRESID;
Creates the variables YHAT for predicted values (P) and
YRESID for residual values. Other handy variables
include LCL and UCL (confidence limits on individual
values) and LCLM and UCLM (confidence limits on the
mean)
PLOTS=option(s) Requests plots. Some option include COOKD, LCL,
UCLM, UCL, UCLM, RESIDUALS. See SAS
documentation for others.
BY, FORMAT, LABEL, WHERE These statements are common to most procedures,
and may be used here.

15 SAS ESSENTIALS -- Elliott & Woodward

Simple Linear Regression Example

The MODEL statement defines

the linear regression equation
PROC RBG; you are calculating.

MODEL TASK=CREATE;
TITLE "Example simple linear regression” RUN;
QUIT;

A QUIT statement is
recommended for PROC REG
to end the analysis.

16 SAS ESSENTIALS -- Elliott & Woodward

final 會問D數字咩嚟（要解釋）！！！

Selected Output from PROC REG

R-Squared is a measure of the
strength of the association.

The regression equation from

this analysis is

TASK = 2.16+0.0625*CREATE

The parameter estimates are the estimates of

alpha (Intercept) and beta (slope/CREATE).

17 SAS ESSENTIALS -- Elliott & Woodward

Graphical Results of Regression Analysis
The shaded area
represents a 95%
confidence interval
for the average
TASK score for a
given CREATE score.

18 SAS ESSENTIALS -- Elliott & Woodward

Diagnostic Plots for Linear Regression
Residual by Predicted Value
plot (upper left), we want
to see a random scatter
of points above and below
the 0 line, which is the case
here. A nonrandom pattern
of dots could indicate an
inadequate model.

19 SAS ESSENTIALS -- Elliott & Woodward

Diagnostic Plots for Linear Regression
The RStudent by Predicted
Value plot indicates
whether any Studentized
residuals fall beyond two
standard deviations, which
would indicate unusual
values. In this case, none
fall outside the ±2 limits.

20 SAS ESSENTIALS -- Elliott & Woodward

Diagnostic Plots for Linear Regression
The RStudent by Leverage
plot attempts to locate
observations that might
have unusual influence
(leverage) on the
calculation of the
regression coefficients. In
this case, there is possibly
one observation that has
undue influence. We'll
identify this
observation later.