Statistics 1 - Introduction To ANOVA, Regression, and Logistic Regression
Statistics 1 - Introduction To ANOVA, Regression, and Logistic Regression
Course Notes
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression Course Notes was developed by
Marc Huber and Danny Modlin. Additional contributions were made by Lee Bennett, Chris Daman, Tarek
Elnaccash, Bob Lucas, Diane K. Michelson, Mike Patetta, and Catherine Truxillo, and artwork by Stanley
Goldman. Editing and production support was provided by the Curriculum Development and Support
Department.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product
names are trademarks of their respective companies.
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression Course Notes
Copyright © 2015 SAS Institute Inc. Cary, NC, USA. All rights reserved. Printed in the United States of
America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in
any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written
permission of the publisher, SAS Institute Inc.
Book code E70544, course code LWST141/ST141, prepared date 18Nov2015. LWST141_002
ISBN 978-1-62960-132-8
For Your Information iii
Table of Contents
Course Description .................................................................................................................... viii
Prerequisites ................................................................................................................................ ix
Exercises.................................................................................................................. 5-17
Exercises.................................................................................................................. 7-56
E.1 Writing and Submitting SAS Programs in SAS Enterprise Guide.................................. E-3
Demonstration: Adding a SAS Program to a Project ..............................................E-11
viii For Your Information
Course Description
This introductory course is for SAS software users who perform statistical analyses using SAS/STAT
software. The focus is on >t tests, ANOVA, and linear regression, and includes a brief introduction to
logistic regression. This course (or equivalent knowledge) is a prerequisite to many of the courses in the
statistical analysis curriculum. <p> A more advanced treatment of ANOVA and regression occurs in the
&st241 course. A more advanced treatment of logistic regression occurs in the &cdal41 course and the
&pmlr41 course.
To learn more…
For information about other courses in the curriculum, contact the SAS
Education Division at 1-800-333-7660, or send e-mail to [email protected].
You can also find this information on the web at http://support.sas.com/training/
as well as in the Training Course Catalog.
For a list of other SAS books that relate to the topics covered in this
course notes, USA customers can contact the SAS Publishing Department
at 1-800-727-3228 or send e-mail to [email protected]. Customers outside
the USA, please contact your local SAS office.
Also, see the SAS Bookstore on the web at http://support.sas.com/publishing/
for a complete list of books and a convenient order form.
For Your Information ix
Prerequisites
Before attending this course, you should
have completed the equivalent of an undergraduate course in statistics covering >p-values, hypothesis
testing, analysis of variance, and regression.
be able to execute SAS programs and create SAS data sets. You can gain this experience by completing
the SAS® Programming 1: Essentials course.
x For Your Information
Chapter 1 Course Overview and
Review of Concepts
1.1 Course Overview ..........................................................................................................1-3
Demonstration: Ames Home Sales Data Set Exploration .................................................... 1-10
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Course Overview 1-3
Objectives
Give overview of the models presented in the course.
Compare inferential statistics with predictive modeling.
Introduce the Ames Home Sales data set.
Decide what tasks to complete before analyzing
the data.
Produce descriptive statistics for both categorical
and interval level variables.
3
3
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-4 Chapter 1 Course Overview and Review of Concepts
5
5
This course deals with statistical modeling. The type of modeling depends on the level of measurement
of two types of variables.
The first type of variable is called Response. These are the variables that generally are the focus
of business or research. They are also known as outcome variables or target variables or (in designed
experiments) dependent variables.
The second type of variable is referred to as predictor variables. These are the measures that are
theoretically associated with the response variables. They can therefore be used to “predict” the value
of the response variables. They are also known as independent variables in analysis of data from designed
experiments.
Categorical data analysis is concerned with categorical responses, regardless of whether the predictor
variables are categorical or continuous. Categorical responses have a measurement scale consisting
of a set of categories.
Continuous data analysis is concerned with the analysis of continuous responses, regardless of whether
the predictor variables are categorical or continuous
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Course Overview 1-5
6
6
Overview of Models
General Linear Models
Y = b 0 + b1X1… + b kXk + e
– Analysis of Variance
(ANOVA)
– Regression
1.0
Predicted Probabilities
0.6
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
7 X
7
Models in this course can be most generally categorized as Generalized Linear Models. In every case,
there is a response (or target) variable, which is the variable of interest, and the explanatory (or predictor)
variable(s), which are used to model and predict the level of the response variable.
When the response variable is continuous and you can assume a normal distribution of errors, you can
use a General Linear Model to model the relationship between predictor variables and response variables.
You perform ordinary least squares regression, analysis of variance (ANOVA), or analysis of covariance
(ANCOVA), depending on whether the explanatory variables are all continuous, all categorical,
or a combination of continuous and categorical, respectively.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-6 Chapter 1 Course Overview and Review of Concepts
When the response variable is categorical, there can be an indirect modeling of the variable, by use
of a link function. When the response variable is binary (can take on only 2 values) then the link function
is typically the logit and the analysis is called logistic regression, regardless of the level of measurement
of any of the explanatory variables. This type of modeling will be described in greater detail in a later
chapter.
The defining feature of linear models is the linear function of the explanatory variables. The regression
coefficients are just numbers and they are multiplied by the explanatory variable values. These products
are then summed to get the individual’s predicted value.
8
8
Explanatory Modeling typically refers to modeling using inferential statistical methods. These are
the classical statistical methods that most students learn in their first statistics courses. The focus
is on descriptions about the nature of relationships among variables and inference about pre-stated
hypotheses about the data and the relationships among variables. Looking at the model equations,
explanatory modeling focuses on the estimates of the beta coefficients. Those values say something about
the explanatory variables’ relationships with the response variable. Confidence limits and p-values are
analyzed closely to determine confidence about estimates and about decisions about the existence
of nonzero relationships between the predictor variables and the response variable. Distributional
assumptions are vital and the goal is finding the “true relationships” among variables. In explanatory
models, samples are usually small and there are few explanatory variables.
Predictive Modeling typically refers to methods for finding the most accurate predictions for future values
of the response or target variable. Sample sizes are usually quite large, rendering statistical hypothesis
testing virtually useless, as nearly every relationship appears statistically significant. The focus is not
so much on the parameters of the model as it is in the predictions of observations. In the model equations
shown above, these are represented on the left side of the equation. In predictive models, sample sizes are
typically very large and there are many more explanatory variables (often referred to as predictor
variables or inputs).
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Course Overview 1-7
Assessment methods of the adequacy of a model differ between explanatory and predictive models.
Whereas the adequacy of explanatory models is usually assessed using classic statistical metrics, such
as p-values and confidence intervals of parameter estimates, the adequacy of predictive models is usually
assessed by comparing observed to predicted values on a holdout sample of data not used to create the
model.
9
9
The data for the instructional demonstrations in this course were collected by Dr. Dean DeCock,
of Truman State University in Kirksville, Missouri, USA. The full description of the data set is provided
in the Journal of Statistics Education. The data set contains information about the sale of individual
residential property in Ames, Iowa, from 2006 to 2010.
The data set STAT1.AmesHousing contains all original data from Dr. DeCock, including 2,930
observations and a large number of explanatory variables involved in assessing home values. In addition,
some summary variables were calculated, as well as a variable calculated as the natural log of the sale
price of the home.
The data set STAT1.AmesHousing2 is a subset of the full data set, for homes with normal sales
conditions (to avoid analyzing foreclosure or distressed sales) and gross living area of 1,500 square feet
or less (to focus on homes of modest size).
The data set STAT1.AmesHousing3 is a random sample of 300 houses, which will be used for all
of the demonstrations in the course.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-8 Chapter 1 Course Overview and Review of Concepts
10
10
11
11
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Course Overview 1-9
The UNIVARIATE procedure not only computes descriptive statistics, but also provides greater detail
about the distributions of the variables.
Selected UNIVARIATE procedure statements:
VAR specifies numeric variables to analyze. If no VAR statement appears, then
all numeric variables in the data set are analyzed.
HISTOGRAM creates high-resolution histograms.
INSET places a box or table of summary statistics, called an inset, directly in a graph
created with a CDFPLOT, HISTOGRAM, PPPLOT, PROBPLOT, or
QQPLOT statement. The INSET statement must follow the PLOT statement
that creates the plot that you want to augment.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-10 Chapter 1 Course Overview and Review of Concepts
Change the location of homefolder to the folder where you have stored the program files for this
course.
Partial Log
1 /*st100d05.sas*/
3 %let homefolder=S:\Workshop;
4 %include "&homefolder\st100d01.sas";
NOTE: Missing values were generated as a result of performing an operation on missing values.
Each place is given by: (Number of times) at (Line):(Column).
1 at 494:19
NOTE: The data set STAT1.AMESHOUSING has 2930 observations and 98 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds
NOTE: There were 1361 observations read from the data set STAT1.AMESHOUSING.
WHERE (Sale_Condition='Normal') and (Gr_Liv_Area<=1500);
NOTE: The data set STAT1.AMESHOUSING2 has 1361 observations and 30 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
NOTE: The data set STAT1.AMESHOUSING3 has 300 observations and 30 variables.
NOTE: PROCEDURE SURVEYSELECT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Course Overview 1-11
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-12 Chapter 1 Course Overview and Review of Concepts
Style of dwelling
Cumulative Cumulative
House_Style Frequency Percent Frequency Percent
Two and one-half story: 2nd level unfinished 2 0.67 228 76.00
Two story 38 12.67 266 88.67
Split Foyer 13 4.33 279 93.00
Split Level 21 7.00 300 100.00
The categories with “2nd level unfinished” have too few members to analyze, so they will
be merged with One story and Two story in the variable House_Style2.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Course Overview 1-13
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-14 Chapter 1 Course Overview and Review of Concepts
Overall_Qual and Overall_Cond have many levels with small frequencies. The variables will
be trichotomized into Below Average, Average, and Above Average, in the variables
Overall_Qual2 and Overall_Cond2.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Course Overview 1-15
The construction year has more values than is practical to treat as a categorical variable in a
statistical model with only 300 observations.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-16 Chapter 1 Course Overview and Review of Concepts
Number of fireplaces
Cumulative Cumulative
Fireplaces Frequency Percent Frequency Percent
0 195 65.00 195 65.00
1 93 31.00 288 96.00
2 12 4.00 300 100.00
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Course Overview 1-17
Mo_Sold shows a clear trend toward sales in July and June. Due to small numbers in some
months, the variable Season_Sold was created and used for subsequent analyses. Season 1
is from month 12 to month 2; season 2 is from month 3 to month 5; season 3 is from month
6 to month 8; and season 4 is from month 9 to month 11.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-18 Chapter 1 Course Overview and Review of Concepts
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Course Overview 1-19
The value ‘NA’ was used for houses that had no garages. There are three missing values.
Information contained in the variables, Year_Built, Mo_Sold, and Year_Sold was used to create
the variable Age_Sold, which is the age in years of the house when sold. It will be used as an interval
variable.
/*st101d01.sas*/ /*Part C*/
/*PROC UNIVARIATE provides summary statistics and plots for */
/*interval variables. The ODS statement specifies that only */
/*the histogram be displayed. The INSET statement requests */
/*summary statistics without having to print out tables.*/
ods select histogram;
proc univariate data=STAT1.ameshousing3 noprint;
var &interval;
histogram &interval / normal kernel;
inset n mean std / position=ne;
title "Interval Variable Distribution Analysis";
run;
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-20 Chapter 1 Course Overview and Review of Concepts
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Course Overview 1-21
Sale price for the entire data set is skewed to the right. Log_Price is relatively normally distributed for
the entire data set. However, for houses 1,500 square feet or less in gross living area, Sale_Price itself is
relatively normally distributed. This can be seen by comparing the normal and kernel density curves.
They are relatively similar in shape.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-22 Chapter 1 Course Overview and Review of Concepts
The number of bedrooms above grade (Bedroom_AbvGr) is discrete with relatively few
observed values. It could be treated as a categorical (ordinal) variable in analysis.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Course Overview 1-23
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-24 Chapter 1 Course Overview and Review of Concepts
Objective
Define some common terminology related
to hypothesis testing and confidence intervals.
17
17
estimates
S estimates
18
18
In inferential statistics, the focus is on learning about populations. Examples of populations are all people
with a certain disease, all drivers with a certain level of insurance, or all customers, both current and
potential, at a bank.
Parameters are evaluations of characteristics of populations. They are generally unknown and must
be estimated through the use of samples. A sample is a group of measurements from a population. In
order for inferences to be valid, the sample should be representative of the population.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.2 Quick Review of Statistical Concepts 1-25
A sample statistic is a measurement from a sample. You infer information about population parameters
through the use of sample statistics.
A point estimate is a single, best estimate of a population parameter.
19
19
Because sampling involves variability, parameter estimates have variability. Often, the variability
of sample statistics is approximately normal. Another name for the normal distribution is the Gaussian
distribution. The normal distribution is bell-shaped, symmetric, and defined by two parameters, μ
(the population mean) and σ (the population standard deviation). The mean locates the midpoint
of the distribution. The standard deviation describes its spread.
The formula for a normal distribution of x around a mean, µ, with standard deviation, σ, is
( x )2
2
f ( x, e
2
.
The standard normal curve has μ=0 and σ=1. The area under the curve between any two values can
be calculated. In statistics, think about probabilities related to the normal curve. Given the variability
around the center (the mean, or point estimate of the parameter), you can think about the probability
of sampling a value within some distance, zσ, from the mean. It is the area under the normal probability
density curve in an area ranging from –zσ to zσ. Some well-known values are shown in the slide.
Approximately 68% of the total area lies within 1 standard deviation of the mean. Approximately 95%
of the total area lies within 1.96 standard deviations of the mean. Approximately 99.7% of the area lies
within 3 standard deviations of the mean.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-26 Chapter 1 Course Overview and Review of Concepts
20
20
In statistics, assumptions are often made about distributions of parameters. A common one is that
the sampling distribution of parameters is normal. This does not necessarily mean that the units
of the population are normally distributed. It is often assumed that the parameter itself is normally
distributed. Even though most statisticians only take one sample and get one point estimate for the
population parameters, it is useful if they can assume normality of the parameter. That makes calculations
of confidence intervals and p-values relatively easy. The variability of a parameter is measured
by its standard error.
The standard error of the mean is computed as follows:
s
sx
n
where
s is the sample standard deviation.
n is the sample size.
The standard error of the mean is a measure of precision of the parameter estimate. The smaller
the standard error, the more precise your estimate.
You can improve the precision of an estimate (reduce the standard error) by increasing
the sample size.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.2 Quick Review of Statistical Concepts 1-27
Confidence Intervals
95% Confidence
( | | )
A 95% confidence interval represents a range
of values within which you are 95% certain that
the true population mean exists.
– One interpretation is that if 100 different samples
were drawn from the same population and 100
intervals were calculated, approximately 95
of them would contain the population mean.
21
21
A confidence interval
is a range of values that you believe is likely to contain the population parameter of interest
is defined by an upper and lower bound around a parameter estimate.
To construct a confidence interval, a significance level must be chosen.
A 95% confidence interval is commonly used to assess the variability of the sample mean. In the Ames
housing sales example, you interpret a 95% confidence interval by stating that you are 95% confident that
the interval contains the mean sale price for your population of home sales.
You want to be as confident as possible, but remember that if you increase the confidence level
too much, the width of your interval increases beyond the point where it is informative. For
example, a 100% confidence interval would have confidence bounds of negative and positive
infinity.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-28 Chapter 1 Course Overview and Review of Concepts
Ho: 1 - 2 = 0 Set α
H1: 1 - 2 ≠ 0
22
22
In inferential statistics, you infer information about population parameters through the use of statistics.
The inferences are not exact. As you have seen, there is variability of parameter estimates. You phrase
questions as tests of hypotheses about population parameters. The answers are typically phrased as
a probability that a specific statement about the parameter is true, given the evidence provided by the
data. That statement is called the null hypothesis. The probability calculated from the data is called the p-
value.
When the p-value is low, it provides doubt about the truth of the null hypothesis. How low does the p-
value need to be before you reject the null hypothesis completely? That depends on you. That threshold
that you choose is called the significance level of your test.
Coin Example
23
23
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.2 Quick Review of Statistical Concepts 1-29
24
24
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-30 Chapter 1 Course Overview and Review of Concepts
55 Heads 40 Heads
45 Tails 60 Tails
p-value=.3682 p-value=.0569
37 Heads 15 Heads
63 Tails 85 Tails
p-value=.0120 p-value<.0001
25
25
The effect size refers to the magnitude of the difference in sampled population from the null hypothesis.
In this example, the null hypothesis of a fair coin suggests 50% heads and 50% tails. If the true coin
flipped were actually weighted to give 55% heads, the effect size would be 5%.
If you flip a coin 100 times and count the number of heads, you do not doubt that the coin is fair if you
observe exactly 50 heads. However, you might be
somewhat skeptical that the coin is fair if you observe 40 or 60 heads
even more skeptical that the coin is fair if you observe 37 or 63 heads
highly skeptical that the coin is fair if you observe 15 or 85 heads.
In this situation, as the difference between the number of heads and tails increases, you have more
evidence that the coin is not fair.
A p-value measures the probability of observing a value as extreme as or more extreme than the one
observed, simply by chance, given that the null hypothesis is true. For example, if your null hypothesis
is that the coin is fair and you observe 40 heads (60 tails), the p-value is the probability of observing
a difference in the number of heads and tails of 20 or more from a fair coin tossed 100 times.
A large p-value means that you would often see a test statistic value this large in experiments with a fair
coin. A small p-value means that you would rarely see differences this large from a fair coin. In the latter
situation, you have evidence that the coin is not fair, because if the null hypothesis were true, a random
sample selected from it would not likely have the observed statistic values.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.2 Quick Review of Statistical Concepts 1-31
4 Heads 16 Heads
6 Tails 24 Tails
p-value=.7539 p-value=.2682
26
26
A p-value is not only affected by the effect size. It is also affected by the sample size (number of coin
flips, k).
For a fair coin, you would expect 50% of k flips to be heads. In this example, in each case, the observed
proportion of heads from k flips was 0.4. This value is different from the 0.5 you would expect under H0.
The evidence is stronger, when the number of trials (k) on which the proportion is based increases. As you
saw in the section about confidence intervals, the variability around a mean estimate is smaller, when the
sample size is larger. For larger sample sizes, you can measure means more precisely. Therefore, 40%
of the heads out of 400 flips would make you more certain that this was not a chance difference from 50%
than would 40% out of 10 flips. The smaller p-values reflect this confidence. The p-value here assesses
the probability that this difference from 50% occurred purely by chance.
27
27
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-32 Chapter 1 Course Overview and Review of Concepts
Objective
Perform a hypothesis test using the TTEST procedure.
30
30
Performing a t-Test
To test the null hypothesis H0: =0 against H1: ≠0,
SAS software calculates the value of student’s t statistic:
( x 0)
t
sx
For the Ames homes sales price example:
t (137,525135,000) 1.16
2,172.1
The null hypothesis is rejected when the calculated value
is more extreme (either positive or negative) than would
be expected by chance if H0 were true.
31
31
As mentioned in a previous section, when you do not know the true population standard deviation, σ, then
you must estimate it from the sample. Then you must also use student’s t-distribution, rather than the
normal distribution, for calculating p-values and confidence limits. Student’s t-distribution approaches
the normal distribution as sample size increases.
A one-sample t-test compares the mean calculated from a sample to a hypothesized mean. The null
hypothesis of the test is generally that the difference between the two means is zero.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 One-Sample t-Tests 1-33
For the example, suppose that you would like to know whether the mean sale price for houses in Ames,
Iowa, is $135,000. 0 is the hypothesized value of 135,000, x is the sample mean of SalePrice,
and s x is the standard error of the mean.
The student’s t statistic measures how far x is from the null hypothesized mean, in standard error units.
To reject a test with this statistic, the t statistic should be much higher or lower than 0 and have a small
corresponding p-value.
The results of this test are valid if the distribution of sample means is normal.
2.5% 2.5%
t=.1.16
32
32
For a two-sided test of a hypothesis, the rejection region is contained in both tails of the t distribution.
If the t statistic falls in the rejection region (in the shaded region in the graph above), then you reject
the null hypothesis. Otherwise, you fail to reject the null hypothesis.
The area in each of the tails corresponds to α/2 or 2.5%. The sum of the areas under the tails is 5%, which
is alpha.
The alpha and t-distribution mentioned here are the same as those in the section about confidence
intervals. In fact, there is a direct relationship. The rejection region based on begins at the point
where the (1.00-)% confidence interval ends.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-34 Chapter 1 Course Overview and Review of Concepts
33
33
The TTEST procedure performs t-tests and computes confidence limits for one sample, paired
observations, two independent samples, and the AB/BA crossover design. With ODS Statistical Graphics,
PROC TTEST can also be used to produce histograms, Quantile-Quantile plots, box plots, and confidence
limit plots.
Selected TTEST procedure statements:
CLASS specifies the two-level variable for the analysis. Only one variable is allowed
in the CLASS statement. If no CLASS statement is included, a one-sample t-test
is performed.
PAIRED PairLists; specifies the PairLists to identify the variables to be compared in paired
comparisons. You can use one or more PairLists.
VAR specifies numeric response variables for the analysis. If the VAR statement is not
specified, PROC TTEST analyzes all numeric variables in the input data set that
are not listed in a CLASS (or BY) statement.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 One-Sample t-Tests 1-35
A confidence interval plot is a visual display of the sample statistic value (of the mean, in this case)
and the confidence interval calculated from the data. If there is a null hypothesized value for the
parameter, it can be drawn on the plot as a reference line. In this way, the statistical significance of a test
can be visually assessed. If the (1.00-)% confidence interval does not include the null hypothesis value,
then that implies that the null hypothesis can be rejected at the significance level. If the confidence
interval includes the null hypothesis value, then that implies that the null hypothesis cannot be rejected
at that significance level.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-36 Chapter 1 Course Overview and Review of Concepts
Example: Use the TTEST procedure to test whether the mean of SalePrice is $135,000 in the data set
STAT1.AmesHousing3.
/*st101d02.sas*/
ods graphics;
The mean value is $137,525. The t-value associated with that is 1.16. The p-value is 0.2460. Therefore,
you would reach the conclusion that the mean sale price of homes is not statistically different from
$135,000.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 One-Sample t-Tests 1-37
Distribution of SalePrice
With 95% Confidence Interval for Mean
30
Normal
Kernel
25
20
Percent
15
10
0
95% Confidence
Null Value
Mean of SalePrice
With 95% Confidence Interval
The confidence interval plot shows the confidence interval around the mean estimate of sale price. Its
intersection with the $135,000 reference line shows that the mean value in the sample is not statistically
significantly different from $135,000 at an alpha level of 0.05.
The confidence bounds can be changed using an ALPHA= option in the PROC TTEST statement.
Set alpha equal to 1-confidence. For example, for a 99% confidence interval, specify
“ALPHA=0.01”.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-38 Chapter 1 Course Overview and Review of Concepts
250000
Sale price in dollars
200000
150000
100000
50000
-3 -2 -1 0 1 2 3
Quantile
Neither the histogram nor the q-q plot show extreme departures from normality. Therefore, the Student’s
t-test is valid.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 One-Sample t-Tests 1-39
1.03 Quiz
What is the null hypothesis for a one-sample t-test?
a. H0: =0
b. H0: 0=0
c. H0: -0=0
d. H0: 0-0=0
36
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-40 Chapter 1 Course Overview and Review of Concepts
Exercises
1. Performing a One-Sample t-Test
The data in STAT1.NormTemp come from an article in the Journal of Statistics Education by Dr.
Allen L. Shoemaker from the Psychology Department at Calvin College. The data are based on an
article in a 1992 edition of JAMA (Journal of the American Medical Association), which questions
the notion that the true mean body temperature is 98.6. There are 65 males and 65 females. There
is also some question about whether mean body temperatures for women are the same as for men.
The variables in the data set are as follows:
ID Identification number
BodyTemp Body temperature (degrees Fahrenheit)
Gender Coded (Male, Female)
HeartRate Heart rate (beats per minute)
a. Look at the distribution of the continuous variables in the data set using PROC UNIVARIATE,
including producing histograms and insets with means, standard deviations and sample size.
b. Perform a one-sample t-test to determine whether the mean of body temperatures (the variable
BodyTemp in STAT1.NormTemp) is 98.6. Produce a confidence interval plot of BodyTemp
with the value 98.6 used as a reference.
1) What is the value of the t statistic and the corresponding p-value?
2) Do you reject or fail to reject the null hypothesis at the 0.05 level that the average temperature
is 98.6 degrees?
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.4 Two-Sample t-Tests 1-41
Objectives
Use the TTEST procedure to analyze the differences
between two population means.
Verify the assumptions of a two-sample t-test.
41
41
Ho: 1 - 2 = 0
Statistical Assumptions:
– independent observations
– normally distributed population means
– equal population variances
42
42
In a one-sample t-test, the sample’s mean is compared against some hypothesized mean value. For
example, in the previous example, the mean sale price was compared against $140,000. So, the null
hypothesis is 1=140,000.
If you want to compare the means of two different groups, you can specify the hypothesis in either
or two ways: 1=2; or 1-2=0.
Before you start the analysis, examine the data to verify that the statistical assumptions are valid.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-42 Chapter 1 Course Overview and Review of Concepts
The assumption of independent observations means that no observations provide any information about
any other observation that you collect. For example, measurements are not repeated on the same subject.
This assumption can be verified during the design stage.
The assumption of normality can be verified if the data are approximately normally distributed
or if enough data are collected. For small samples, this assumption can be verified by examining plots
of the data.
There are several tests for equal variances. If this assumption is not valid, an approximate t-test can
be performed.
If these assumptions are not valid and no adjustments are made, the probability of drawing incorrect
conclusions from the analysis could increase.
2 2 2 2
H0 : 1 = 2 H1 : 1 = 2
max(s 21 , s 22 )
F= 2 2
min(s 1 , s 2 )
43
43
To evaluate the assumption of equal variances in each group, you can use the Folded F test for equality
of variances. The null hypothesis for this test is that the variances are equal. The F value is calculated
as a ratio of the greater of the two variances divided by the lesser of the two. Thus, if the null hypothesis
is true, F tends to be close to 1.0 and the p-value for F is statistically nonsignificant (p>0.05).
If you reject the null hypothesis, it is recommended that you use the unequal variance t-test in the
PROC TTEST output for testing the equality of group means.
This test is valid only for independent samples from normal distributions. Normality is required
even for large sample sizes. If your data are not normally distributed, you can use Levene’s test
or the Brown-Forsythe test for homogeneity of variances. These are available as options
HOVTEST=LEVENE and HOVTEST=BF in the MEANS statement in the GLM procedure.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.4 Two-Sample t-Tests 1-43
44
44
Check the assumption of equal variances and then use the appropriate test for equal means. Because
the p-value of the test F statistic is 0.7446, there is not enough evidence to reject the null hypothesis
of equal variances.
Therefore, use the equal variance t-test line in the output to test whether the means of the two
populations are equal.
The null hypothesis that the group means are equal is rejected at the 0.05 level. You conclude that there
is a difference between the means of the groups.
The equality of variances F test is found at the bottom of the PROC TTEST output.
45
45
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-44 Chapter 1 Course Overview and Review of Concepts
Again, check the assumption of equal variances and then use the appropriate test for equal means.
Because the p-value of the test F statistic is less than alpha=0.05, there is enough evidence to reject
the null hypothesis of equal variances.
Therefore, use the unequal variance t-test line in the output to test whether the means of the two
populations are equal.
The null hypothesis that the group means are equal is rejected at the 0.05 level.
If you choose the equal variance t-test, you would not reject the null hypothesis at the 0.05 level.
This shows the importance of choosing the appropriate t-test.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.4 Two-Sample t-Tests 1-45
Two-Sample t-Test
Example: Use the TTEST procedure to test whether the mean of SalePrice is the same for homes with
masonry veneer and those without.
/*st101d03.sas*/
ods graphics;
Distribution of SalePrice
40 N
30
Percent
20
10
0
40 Y
30
Percent
20
10
0
Masonry_V...
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-46 Chapter 1 Course Overview and Review of Concepts
250000
200000
Sale price in dollars
150000
150000
100000
100000
50000
-3 -2 -1 0 1 2 3 -2 -1 0 1 2
Quantile Quantile
The Q-Q plots seems to indicate that the data from each group approximate a normal distribution.
There seems to be one potential outlier in each group at the upper end of the distribution.
If assumptions are not met, you can do an equivalent nonparametric test, which does not make
distributional assumptions. PROC NPAR1WAY is one procedure for performing this type of test.
It is described in an appendix.
The statistical tables for the TTEST procedure are displayed below.
Masonry_Veneer N Mean Std Dev Std Err Minimum Maximum
N 209 130172 37531.7 2596.1 35000.0 290000
Y 89 154705 32239.8 3417.4 75000.0 245000
Diff (1-2) -24533.0 36039.6 4561.6
Masonry_Veneer Method Mean 95% CL Mean Std Dev 95% CL Std Dev
N 130172 125054 135290 37531.7 34245.4 41521.0
Y 154705 147914 161496 32239.8 28099.7 37821.9
Diff (1-2) Pooled -24533.0 -33510.3 -15555.6 36039.6 33355.6 39197.1
Diff (1-2) Satterthwaite -24533.0 -32997.9 -16068.0
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.4 Two-Sample t-Tests 1-47
Equality of Variances
Method Num DF Den DF F Value Pr > F
Folded F 208 88 1.36 0.1039
In the Statistics table, examine the descriptive statistics for each group and their differences.
Look at the Equality of Variances table that appears at the bottom of the output. The F test for equal
variances has a p-value of 0.1039. Because this value is greater than the alpha level of 0.05, do not
reject the null hypothesis of equal variances (This is equivalent to saying that there is insufficient
evidence to indicate that the variances are not equal.)
Based on the F test for equal variances, you then look in the t-Tests table at the t-test for the
hypothesis of equal means. Using the equal variance (Pooled) t-test, you reject the null hypothesis
that the group means are equal. The mean difference between no masonry veneer and masonry
veneer is -$24,533. Because the p-value is less than 0.05 (Pr>|t<.0001), you conclude that there is a
statistically significant difference in the sale price between houses with the two types of veneer.
The 95% confidence interval for the mean difference (-33510.3, -15555.6) does not include
0. This also implies statistical significance at the 0.05 alpha level.
Satterthwaite
Pooled
Confidence intervals are shown in the output object titled Mean of SalePrice Difference (N – Y). This plot
reflects the values from the confidence interval for the mean differences.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-48 Chapter 1 Course Overview and Review of Concepts
Exercises
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.4 Two-Sample t-Tests 1-49
47
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-50 Chapter 1 Course Overview and Review of Concepts
1.5 Solutions
Solutions to Exercises
1. Performing a One-Sample t-Test
a. Look at the distribution of the continuous variables in the data set using PROC UNIVARIATE,
including producing histograms and insets with means, standard deviations and sample size.
/*st101s01.sas*/ /*Part A*/
%let interval=BodyTemp HeartRate;
ods graphics;
ods select histogram;
proc univariate data=STAT1.NormTemp noprint;
var &interval;
histogram &interval / normal kernel;
inset n mean std / position=ne;
title "Interval Variable Distribution Analysis";
run;
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.5 Solutions 1-51
b. Perform a one-sample t-test to determine whether the mean of body temperatures (the variable
BodyTemp in STAT1.NormTemp) is 98.6. Produce a confidence interval plot of BodyTemp
with the value 98.6 used as a reference.
/*st101s01.sas*/ /*Part B*/
proc ttest data=STAT1.NormTemp h0=98.6
plots(only shownull)=interval;
var BodyTemp;
title 'Testing Whether the Mean Body Temperature=98.6';
run;
title;
Partial Output
Testing Whether the Mean Body Temperature=98.6
The TTEST Procedure
Variable: BodyTemp
95% CL Std
Mean 95% CL Mean Std Dev Dev
98.2492 98.1220 98.3765 0.7332 0.6536 0.8350
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-52 Chapter 1 Course Overview and Review of Concepts
Mean of BodyTemp
With 95% Confidence Interval
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.5 Solutions 1-53
Distribution of Change
40
Control
30
Percent
20
10
0
40
Treatment
30
Percent
20
10
Control
Group
Treatment
-50 -25 0 25 50
Change
Normal Kernel
10 20
Change
Change
0 0
-10 -20
-1 0 1 -2 -1 0 1 2
Quantile Quantile
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-54 Chapter 1 Course Overview and Review of Concepts
Because the p-value for the Equality of Variances test is greater than the alpha level of 0.05,
you would not reject the null hypothesis. This conclusion supports the assumption of equal
variance (the null hypothesis being tested here).
c. Does the new teaching technique seem to result in significantly different change scores compared
with the standard technique?
Group N Mean Std Dev Std Err Minimum Maximum
Control 13 6.9677 8.6166 2.3898 -6.2400 19.4100
Treatment 15 11.3587 14.8535 3.8352 -17.3300 32.9200
Diff (1-2) -4.3910 12.3720 4.6882
Group Method Mean 95% CL Mean Std Dev 95% CL Std Dev
Control 6.9677 1.7607 12.1747 8.6166 6.1789 14.2238
Treatment 11.3587 3.1331 19.5843 14.8535 10.8747 23.4255
Diff (1-2) Pooled -4.3910 -14.0276 5.2457 12.3720 9.7432 16.9550
Diff (1-2) Satterthwaite -4.3910 -13.7401 4.9581
The p-value for the Pooled (Equal Variance) test for the difference between the two means
shows that the two groups are not statistically significantly different. Therefore, there is not
strong enough evidence to say conclusively that the new teaching technique is different from
the old. The Difference Interval plot displays these conclusions graphically.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.5 Solutions 1-55
Satterthwaite
Pooled
-15 -10 -5 0 5
Difference
The confidence interval includes the value zero, indicating a lack of statistical significance
of the mean difference.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-56 Chapter 1 Course Overview and Review of Concepts
28
28
37
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.5 Solutions 1-57
48
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-58 Chapter 1 Course Overview and Review of Concepts
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Chapter 2 ANOVA and Regression
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Graphical Analysis 2-3
Objectives
Explain what an association is.
Graphically explore associations
in the AmesHousing3 data set.
3
3
Associations
An association exists between two variables when
the expected value of one variable differs at different
levels of the other variable.
A linear association between two continuous variables
can be inferred when the general shape of a scatter
plot of the two variables is a straight line.
4
4
ANOVA and linear regression tests linear associations between predictor and response variables. In linear
regression models, the predictor variable is continuous. In ANOVA, the predictor variable is categorical.
Typically, the categorical predictor is converted into binary dummy variables for purposes of model
calculations. The following slides illustrate associations in ANOVA and regression.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-4 Chapter 2 ANOVA and Regression
Scatter Plots
5
5
Scatter plots are two-dimensional graphs produced by plotting one variable against another within a set
of coordinate axes. The coordinates of each point correspond to the values of the two variables.
Scatter plots are useful to accomplish the following:
explore the relationships between two variables
locate outlying or unusual values
identify possible trends
identify a basic range of Y and X values
communicate data analysis results
The predicted value can be thought of as the best estimate of the value of the response at a given value
of the predictor variable. Scatter plots show graphically the relationship between predictor variables
and response variables. Traditionally, predictor variables are plotted on the x axis and response variables
are plotted on the y-axis. A preliminary analysis of associations involves discovery of the presence
of associations and their nature.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Graphical Analysis 2-5
6
6
Describing the relationship between two continuous variables is an important first step in any statistical
analysis. The scatter plot is the most important tool that you have in describing these relationships.
The diagrams above illustrate some possible relationships.
1. A straight line describes the relationship.
2. Curvature is present in the relationship.
3. There could be a cyclical pattern in the relationship. You might see this when the predictor is time.
4. There is no clear relationship between the variables.
7
7
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-6 Chapter 2 ANOVA and Regression
If the value of x is unknown, then the best prediction of y would be the mean of y. The question becomes
whether knowing x affects the best prediction of y. In regression, “best” is defined as the model (in this
case, the regression line) that minimizes the sum of the squared differences between all actual y values
and the corresponding predicted y values. If there is no association between x and y, then the best
prediction of y will remain the mean value of y, even when x is known. The regression line will be
horizontal at the value of the mean of y.
8
8
If there is an association between x and y, then the best prediction of y would depend of the value of x.
The regression line will not be horizontal.
9
9
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Graphical Analysis 2-7
Similarly, when the predictor variable is categorical, lack of association with the response variable means
that the best prediction of y is the mean of y, regardless of the value of x. Once again, a horizontal line
through the means of all levels of the categorical predictor will indicate lack of association and therefore
equal means.
10
10
Where there is an association between the categorical predictor and the continuous response, a plot will
show a non-horizontal line connecting the category-specific means of y. The category-specific means will
be better predictions than the overall mean of y. In other words, the average squared differences between
the actual y and the predicted y will be smaller when the group-specific mean is the predicted value of y
than when the overall mean is the predicted value of y.
11
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-8 Chapter 2 ANOVA and Regression
Exploring Associations
Example: Create scatter plots to show relationships between continuous predictors and SalePrice
and comparative box plots to show relationships between categorical predictors and
SalePrice.
/*st102d01.sas*/ /*Part A*/
proc sgscatter data=STAT1.ameshousing3;
plot SalePrice*Gr_Liv_Area / reg;
title "Associations of Above Grade Living Area with Sale Price";
run;
Selected PLOT statement option:
REG Adds a regression fit to the scatter plot.
300000
250000
Sale price in dollars
200000
150000
100000
50000
There does seem to be a nonzero association between above grade living area and sale price.
There seems to be more variability in sale price at higher living area values. This is called
heteroscedasticity. This topic is discussed in detail in the Statistics 2: ANOVA and Regression
course.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Graphical Analysis 2-9
options nolabel;
proc sgscatter data=STAT1.ameshousing3;
plot SalePrice*(&interval) / reg;
title "Associations of Interval Variables with Sale Price";
run;
OPTIONS statement option:
NOLABEL Does not allow SAS procedures to use labels with variables.
SalePrice
SalePrice
200000 200000 200000
SalePrice
SalePrice
300000 300000
250000 250000
SalePrice
SalePrice
200000 200000
150000 150000
100000 100000
50000 50000
0 1 2 3 4 1 2 3 4
Bedroom_AbvGr Total_Bathroom
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-10 Chapter 2 ANOVA and Regression
There seems to be some association between each of the predictor variables and SalePrice.
/*st102d01.sas*/ /*Part C*/
proc sgplot data=STAT1.ameshousing3;
vbox SalePrice / category=Central_Air
connect=mean;
title "Sale Price Differences across Central Air";
run;
PROC SGPLOT statement:
VBOX Creates a vertical box plot that shows the distribution of your data.
300000
250000
200000
SalePrice
150000
100000
50000
N Y
Central_Air
Houses with central air sell on average at higher prices than houses without central air.
Therefore, there is a nonzero association between Central_Air and SalePrice.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Graphical Analysis 2-11
%let i = 1 ;
%let i = %eval(&i + 1 ) ;
%end ;
%mend box;
The main part of the macro is the following code:
proc sgplot data=&dsn;
vbox &response / category=&var
grouporder=ascending
connect=mean;
title "&response across Levels of &var";
run;
This is similar to the code run in the previous example, which used data set values for category, data set
name, and response variable. The %charvar macro variable is read by SAS as a string, with spaces
separating members of the list of variables. The following statement creates a macro variable called
&var, which is chosen to be the ith member of &charvar string:
%let var = %scan(&charvar,&i,%str( ));
The members are defined to be separated by a space using %str( ), which is the third argument
in the %scan function. The &i, which is the second argument to the function indexes the ordered value
of the member of the string. The %do %while loop uses the %scan function to check to see whether
there are any more members in the macro variable &charvar.
The macro is called using the following code:
%box(dsn = STAT1.ameshousing3,
response = SalePrice,
charvar = &categorical);
The output is not displayed here.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-12 Chapter 2 ANOVA and Regression
Objectives
Use the GLM procedure to analyze the differences
between population means.
Verify the assumptions of analysis of variance.
15
15
16
16
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 One-Way ANOVA 2-13
Overview
Are there any differences among the population means?
Response
Predictor
Categorical
Continuous
One-Way
ANOVA
Another way of asking: Does information about group
membership help predict the level of a numeric response?
17
17
Analysis of variance (ANOVA) is a statistical technique used to compare the means of two or more
groups of observations or treatments. For this type of problem, you have the following:
a continuous dependent variable, or response variable
a discrete independent variable, also called a predictor or explanatory variable.
18
18
If you analyze the difference between two means using ANOVA, you reach the same conclusions as you
reach using a pooled, two-group t-test. Performing a two-group mean comparison in PROC GLM gives
you access to graphical and assessment tools different from those available in performing the same
comparison with PROC TTEST.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-14 Chapter 2 ANOVA and Regression
Placebo Treatment 1
Treatment 2
19
19
When there are three or more levels for the grouping variable, a simple approach is to run a series
of t-tests between all the pairs of levels. For example, you might be interested in T-cell counts in patients
taking three medications (including one placebo). You could simply run a t-test for each pair of
medications. A more powerful approach is to analyze all the data simultaneously. The mathematical
model is called a one-way analysis of variance (ANOVA), and the test statistic used is the F ratio, rather
than the Student’s t value.
20
20
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 One-Way ANOVA 2-15
H1: F1 ≠ F2 or F1 ≠ F3
H0: F1=F2=F3=F4 or F1 ≠ F4 or F2 ≠ F3
or F2 ≠ F4 or F3 ≠ F4
21
21
Small differences between sample means are usually present. The objective is to determine whether these
differences are statistically significant. In other words, is the difference greater than what might be
expected to occur by chance?
Variability Variability
Total Variability between Groups within Groups
22
22
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-16 Chapter 2 ANOVA and Regression
In ANOVA, the Total Variation (as measured by the corrected total sum of squares) is partitioned into two
components, the Between Group Variation (displayed in the ANOVA table as the Model Sum of Squares)
and the Within Group Variation (displayed as the Error Sum of Squares). As its name implies, ANalysis
Of VAriance analyzes, or breaks apart, the variance of the dependent variable to determine whether
the between-group variation is a significant portion of the total variation. ANOVA compares the portion
of variation in the response variable attributable to the grouping variable to the portion of variability that
is unexplained. The test statistic, the F Ratio, is a ratio of the model variance to the error variance. The
calculations are shown below.
Total Variation the overall variability in the response variable. It is calculated as the sum
of the squared differences between each observed value and the overall mean,
Yij Y . This measure is also referred to as the Total Sum of Squares
2
(SST).
Between Group Variation the variability explained by the independent variable and therefore
represented by the between-treatment sum of squares. It is calculated as the
weighted (by group size) sum of the squared differences between the mean
for each group and the overall mean, ni Yi Y . This measure is also
2
Squares (SSE).
SST=SSM+SSE, meaning that the model sum of squares and the error sum of squares sums
to the total sum of squares.
Sums of Squares
23
23
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 One-Way ANOVA 2-17
A simple example of the various sums of squares is shown in this set of slides. First, the overall mean
of all data values is calculated.
(7-6)2
(3-6)2
24
24
The total sum of squares, SST, is a measure of the total variability in a response variable. It is calculated
by summing the squared distances from each point to the overall mean. Because it is correcting for
the mean, this sum is sometimes called the corrected total sum of squares.
YB 8
(7-8)2
YA 4 (5-4)2
25
25
The error sum of squares, SSE, measures the random variability within groups; it is the sum of the squared
deviations between observations in each group and that group’s mean. This is often referred to as the
unexplained variation or within-group variation.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-18 Chapter 2 ANOVA and Regression
YB 8
(8-6)2
(4-6)2
YA 4
26
26
The model sum of squares, SSM, measures the variability between groups; it is the sum of the squared
deviations between each group mean and the overall mean, weighted by the number of observations
in each group. This is often referred to as the explained variation. The model sum of squares can also
be calculated by subtracting the error sum of squares from the total sum of squares: SSM=SSTSSE.
In this example, the model explains approximately 85.7%, ((SSM / SST)*100)%, of the variability
in the response. The other 14.3% represents unexplained variability, or process variation. In other words,
the variability due to differences between the groups (the explained variability) makes up a larger
proportion of the total variability than the random error within the groups (the unexplained variability).
The total sum of squares (SST) refers to the overall variability in the response variable. The SST is
computed under the null hypothesis (that the group means are all the same). The error sum of squares
(SSE) refers to the variability within the treatments not explained by the independent variable. The SSE
is computed under the alternative hypothesis (that the model includes nonzero effects). The model sum
of squares (SSM) refers to the variability between the treatments explained by the independent variable.
The basic measures of variation under the two hypotheses are transformed into a ratio of the model and
the error variances, which has a known distribution (Snedecor’s F distribution) under the null hypothesis
that all group means are equal. The F ratio can be used to compute a p-value.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 One-Way ANOVA 2-19
The null hypothesis for analysis of variance is tested using an F statistic. The F statistic is calculated
as the ratio of the Between Group Variance to the Within Group Variance. In the output of PROC GLM,
these values are shown as the Model Mean Square and the Error Mean Square. The mean square values
are calculated as the sum of square value divided by the degrees of freedom.
In general, degrees of freedom (DF) can be thought of as the number of independent pieces
of information.
Model DF is the number of treatments minus 1.
Corrected total DF is the sample size minus 1.
Error DF is the sample size minus the number of treatments (or the difference between the corrected
total DF and the Model DF.
Mean squares are calculated by taking sums of squares and dividing by the corresponding degrees
of freedom. They can be thought of as variances.
Mean square error (MSE) is an estimate of 2, the constant variance assumed for all treatments.
If i=j, for all i j, then the mean square for the model (MSM) is also an estimate of 2.
If ij, for any i j, then MSM estimates 2 plus a positive constant.
SSM
MSM
F
df M
SS E
.
MSE df E
The p-value for the test is then calculated from the F distribution with appropriate degrees of freedom.
Variance is the traditional measure of precision. Mean square error (MSE) is the traditional
measure of accuracy used by statisticians. MSE is equal to variance plus bias-squared. Because
the sample mean x is an unbiased estimate of the population mean (), bias=0 and MSE
measures the variance.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-20 Chapter 2 ANOVA and Regression
Coefficient of Determination
SS M
R 2
SST
“Proportion of variance accounted for by the model”
28
28
The coefficient of determination, R2, is a measure of the proportion of variability in the response or
dependent variables explained by the explanatory or independent variables in the analysis. This statistic
SS M
is calculated as R
2
SST
The value of R2 is between 0 and 1. The value is
close to 0 if the independent variables do not explain much variability in the data
close to 1 if the independent variables explain a relatively large proportion of variability in the data.
Although values of R2 closer to 1 are preferred, judging the magnitude of R2 depends on the context
of the problem.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 One-Way ANOVA 2-21
Base Unaccounted
SalePrice = + Central_Air +
Level for Variation
Yik = + i + ik
29
29
The model, Yik=+i+ik, is one way of representing the relationship between the dependent
and independent variables in ANOVA.
Yik the kth value of the response variable for the ith treatment.
the overall population mean of the response for example, sale price.
i the difference between the population mean of the ith treatment and the overall mean, . This
is referred to as the effect of treatment i.
ik the difference between the observed value of the kth observation in the ith group and the mean
of the ith group. This is called the error term.
PROC GLM uses a parameterization of categorical variables in its CLASS statement that will
not directly estimate the values of the parameters in the model shown. The correct parameter
estimates can be obtained by adding the SOLUTION option in the MODEL statement in PROC
GLM and then using simple algebra. Parameter estimates and standard errors can also be obtained
using ESTIMATE statements. These issues are discussed in depth in the
Statistics 2: ANOVA and Regression course and in the SAS documentation.
In this data set, the list of categories observed in the categorical variables is exhaustive. In other
words, there are no other levels imagined possible. In some applications this would be considered
a fixed effect. If the observed levels of a categorical variable comprise just a sample of many that
could have been used (for example, if you used neighborhood as an explanatory variable and only
looked at houses in 10 neighborhoods, but were really interested in generalizing to all
communities), the sampling variability of that variable would need to be taken into account
in the model. In that case, the variable would be treated as a random effect. Random effects are
discussed in the Statistics 2: ANOVA and Regression course.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-22 Chapter 2 ANOVA and Regression
30
32
32
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 One-Way ANOVA 2-23
PROC GLM supports RUN-group processing, which means the procedure stays active until
a PROC, DATA, or QUIT statement is encountered. This enables you to submit additional
statements followed by another RUN statement without resubmitting the PROC statement.
33
33
The CLASS statement creates a set of “design variables” (sometimes referred to as “dummy variables”)
representing the information contained in any categorical variables. Linear regression is then performed
on the design variables. ANOVA can be thought of as linear regression on dummy variables. It is only in
the interpretation of the model that a distinction is made.
Even if categorical variables are represented by numbers such as 1, 2, 3, the CLASS statement tells SAS
to set up design variables to represent the categories. If a numerically coded categorical variable were not
included in the CLASS variable list, then PROC GLM would interpret it as a continuous variable in the
regression calculations.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-24 Chapter 2 ANOVA and Regression
34
For CLASS variable coding in PROC GLM, the number of design variables created is the number
of levels of the CLASS variable. For example, because the variable IncLevel has three levels, three
design variables are created. Each design variable is a binary indicator of membership in a particular level
of the CLASS variable. So, each observation in the data set will be assigned values on all three of these
new variables in PROC GLM.
In this parameterization scheme, however, a third design variable is always redundant when the other two
are included. For example, if you know that IncLevel is not 1 and IncLevel is also not 2, then you do not
need a third variable to tell you that IncLevel is 3. Because the design variables are read in order,
it is the third design variable that is considered redundant.
If you would like to see the regression equation estimates for the design variables, you can add
the SOLUTION option to the MODEL statement in PROC GLM.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 One-Way ANOVA 2-25
35
The validity of the p-values depends on the data meeting the assumptions for ANOVA. Therefore,
it is good practice to verify those assumptions in the process of performing the analysis of group
differences.
Independence implies that the ij occurrences in the theoretical model are uncorrelated.
The errors are assumed to be normally distributed for every group or treatment.
Approximately equal error variances are assumed across treatments.
36
Additional tests and remedies for violations of these assumptions are described in the
Statistics 2: ANOVA and Regression course.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-26 Chapter 2 ANOVA and Regression
37
37
The residuals from the ANOVA are calculated as the actual values minus the predicted values (the group
means in ANOVA). Diagnostic plots (including normal quantile-quantile plots of the residuals) can be
used to assess the normality assumption. With a reasonably sized sample and approximately equal groups
(balanced design), only severe departures from normality are considered a problem. Residual values sum
to 0 in ANOVA and ordinary least squares regression.
In ANOVA with more than one predictor variable, the HOVTEST option is unavailable. In those
circumstances, you can plot the residuals against their predicted values to visually assess whether
the variability is constant across groups.
38
38
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 One-Way ANOVA 2-27
40
40
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-28 Chapter 2 ANOVA and Regression
The UNPACK option can be used in order to separate the individual plots in the panel display.
Selected MEANS statement option:
HOVTEST= performs a test of homogeneity (equality) of variances. The null hypothesis for this
test is that the variances are equal. Levene’s test is the default.
Partial Output:
Turn your attention to the first two tables of the output. The first table specifies the number of levels
and the values of the class variable. Because a FORMAT statement was used, the formatted values are
displayed.
One-Way ANOVA with Heating Quality as Predictor
Class Level Information
Class Levels Values
Heating_QC 4 Average/Typical Excellent Fair Good
The second table shows both the number of observations read and the number of observations used.
These values are the same because there are no missing values in for any variable in the model. If any
row has missing data for a predictor or response variable, that row is dropped from the analysis.
Number of Observations Read 300
Number of Observations Used 300
The second part of the output contains all of the information that is needed to test the equality
of the treatment means. It is divided into three parts:
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 One-Way ANOVA 2-29
The F statistic and corresponding p-value are reported in the Analysis of Variance table. Because
the reported p-value (<.0001) is less than 0.05, you reject the null hypothesis of no difference between the
means.
R-Square Coeff Var Root MSE SalePrice Mean
0.157920 25.23100 34698.90 137524.9
The coefficient of variation (denoted Coeff Var) expresses the root MSE (the estimate of the standard
deviation for all treatments) as a percent of the mean. It is a unitless measure that is useful in comparing
the variability of two sets of data with different units of measurement.
The SalePrice Mean is the mean of all of the data values for the variable SalePrice, without regard
for Heating_QC.
As discussed previously, the R2 value is often interpreted as the “proportion of variance accounted for
by the model.” Therefore, you might say that in this model, Heating_QC explains about 16% of the
variability of SalePrice.
Source DF Type I SS Mean Square F Value Pr > F
Heating_QC 3 66835556221 22278518740 18.50 <.0001
For a one-way analysis of variance (only one classification variable), the information about the
independent variable in the model is an exact duplicate of the model line of the analysis of variance table.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-30 Chapter 2 ANOVA and Regression
100000
2 2
RStudent
RStudent
Residual
50000
0 0 0
-50000
-2 -2
-100000
150000
300000 0.05
Sale price in dollars
100000
250000 0.04
Residual
Cook's D
50000 200000
0.03
150000
0
0.02
100000
-50000 0.01
50000
-100000 0.00
50000 300000
-3 -2 -1 0 1 2 3 Predicted Value 0 100 200 300
Quantile Observation
30 Fit–Mean Residual
25 150000
50000 Parameters 4
15
Error DF 296
10 0
MSE 1.2E9
5 -50000 R-Square 0.1579
Adj R-Square 0.1494
0 -100000
The plot in the upper left panel shows the residuals plotted against the fitted values from the ANOVA
model. Essentially, you are looking for a random scatter within each group. Any patterns or trends in this
plot can indicate model misspecification.
To check the normality assumption, look at the residual histogram and Q-Q plot, which are at the bottom
left and middle left, respectively. The histogram is approximately symmetric. The data values in the
quantile-quantile plot stay close to the diagonal reference line and give support to the assumption
of normally distributed errors.
The default plot created with this code is a box plot.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 One-Way ANOVA 2-31
Distribution of SalePrice
300000 F 18.50
Prob > F <.0001
250000
Sale price in dollars
200000
150000
100000
50000
The output above is the result of the HOVTEST option in the MEANS statement. The null hypothesis
is that the variances are equal over all Heating_QC groups. The p-value of 0.6305 is not smaller than
your alpha level of 0.05 and therefore you do not reject the null hypothesis. Another of your assumptions
is met.
At this point, if you determined that the variances were not equal, you could add the WELCH
option to the MEANS statement. This requests Welch’s (1951) variance-weighted one-way
ANOVA. This alternative to the usual ANOVA is robust to the assumption of equal variances.
This is similar to the unequal variance t-test for two populations. See the appendix for more
information.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-32 Chapter 2 ANOVA and Regression
43
43
Garlic Example
45
45
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 One-Way ANOVA 2-33
Exercises
Example: Montana Gourmet Garlic is a company that grows garlic using organic methods. It specializes
in hardneck varieties. Knowing a little about experimental methods, the owners design
an experiment to test whether growth of the garlic is affected by the type of fertilizer used.
They limit the experimentation to a Rocambole variety named Spanish Roja, and test three
organic fertilizers and one chemical fertilizer (as a control). They blind themselves to the
fertilizer by using containers with numbers 1 through 4. (In other words, they design the
experiment in such a way that they do not know which fertilizer is in which container.)
One acre of farmland is set aside for the experiment. It is divided into 32 beds. They randomly
assign fertilizers to beds. At harvest, they calculate the average weight of garlic bulbs in each
of the beds. The data are in the STAT1.Garlic data set.
These are the variables in the data set:
Fertilizer The type of fertilizer used (1 through 4)
BulbWt The average garlic bulb weight (in pounds) in the bed
BedID A bed identification number
1. Analysis of Variance with Garlic Data
Consider an experiment to study four types of fertilizer, labeled 1, 2, 3, and 4. One fertilizer
is chemical and the rest are organic. You want to see whether the average of weights of garlic bulbs
are significantly different for plants in beds using different fertilizers.
Test the hypothesis that the means are equal. Be sure to check that the assumptions of the analysis
method that you choose are met. What conclusions can you reach at this point in your analysis?
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-34 Chapter 2 ANOVA and Regression
Objectives
Perform pairwise comparisons among groups after
finding a significant effect of an independent variable
in ANOVA.
Demonstrate graphical features in PROC GLM for
performing post hoc tests.
Interpret a diffogram.
Interpret a control plot.
48
48
49
49
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 ANOVA Post Hoc Tests 2-35
51
51
3 3 .14
4 6 .26
5 10 .40
53
53
When you control the comparisonwise error rate (CER), you fix the level of alpha for a single
comparison, without taking into consideration all the pairwise comparisons that you are making.
The experimentwise error rate (EER) uses an alpha that takes into consideration all the pairwise
comparisons that you are making. Presuming no differences exist, the chance that you falsely conclude
that at least one difference exists is much higher when you consider all possible comparisons.
If you want to make sure that the error rate is 0.05 for the entire set of comparisons, use a method that
controls the experimentwise error rate at 0.05.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-36 Chapter 2 ANOVA and Regression
There is some disagreement among statisticians about whether and how to control
the experimentwise error rate.
Control
Comparisonwise Pairwise t-tests
Error Rate
54
54
All of these multiple comparison methods are requested with options in the LSMEANS statement
of PROC GLM.
In order to call for the statistical hypothesis tests for group differences and ODS Statistical Graphics
to support them, turn on ODS Graphics and then:
For Comparisonwise Control LSMEANS / PDIFF=ALL ADJUST=T
For Experimentwise Control LSMEANS / PDIFF=ALL ADJUST=TUKEY or
PDIFF=CONTROL(‘control level’) ADJUST=DUNNETT
Many other available options control the experimentwise error rate. For information about these
options, see the SAS Documentation.
One-tailed tests against a control level can be requested using the CONTROLL (lower tail)
or CONTROLU (upper tail) options in the LSMEANS statement.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 ANOVA Post Hoc Tests 2-37
55
55
A pairwise comparison examines the difference between two treatment means. “All pairwise
comparisons” means all possible combinations of two treatment means.
Tukey’s multiple comparison adjustment is based on conducting all pairwise comparisons and guarantees
that the Type I experimentwise error rate is equal to alpha for this situation. If you choose to do fewer
than all pairwise comparisons, then this method is more conservative.
Diffograms
56
56
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-38 Chapter 2 ANOVA and Regression
A diffogram can be used to quickly tell whether two group means are statistically significant. The point
estimates for the differences between pairs of group means can be found at the intersections of the vertical
and horizontal lines drawn at group mean values. The downward-sloping diagonal lines show the
confidence intervals for the differences. The upward-sloping line is a reference line showing where
the group means would be equal. Intersection of the downward-sloping diagonal line for a pair with
the upward-sloping, broken gray diagonal line implies that the confidence interval includes zero and that
the mean difference between the two groups is not statistically significant. In that case, the diagonal line
for the pair will be broken. If the confidence interval does not include zero, then the diagonal line for
the pair will be solid. With ODS Statistical Graphics, these plots are automatically generated when you
use the PDIFF=ALL option in the LSMEANS statement.
57
57
Dunnett’s method is recommended when there is a true control group. When appropriate (when a natural
control category exists, against which all other categories are compared) it is more powerful than methods
that control for all possible comparisons. In order to do a one-sided test, use the option
PDIFF=CONTROLL (for lower-tail tests when the alternative hypothesis states that a group’s mean
is less than the control group’s mean) or PDIFF=CONTROLU (for upper-tail tests when the alternative
hypothesis states that a group’s mean is greater than the control group’s mean).
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 ANOVA Post Hoc Tests 2-39
Control Plots
58
58
LS-mean control plots are produced only when you specify PDIFF=CONTROL or ADJUST=DUNNETT
in the LSMEANS statement, and in this case they are produced by default. The value of the control
is shown as a horizontal line. The shaded area is bounded by the UDL and LDL (Upper Decision Limit
and Lower Decision Limit). If the vertical line extends past the shaded area, that means that the group
represented by that line is significantly different from the control group.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-40 Chapter 2 ANOVA and Regression
Because Heating_QC uses a format, the formatted value, rather than the internal coded value,
must be specified as the control level in the second LSMEANS statement,
Selected PLOTS= options:
CONTROLPLOT requests a display in which least squares means are compared against a reference
level. LS-mean control plots are produced only when you specify
PDIFF=CONTROL or ADJUST=DUNNETT in the LSMEANS statement,
and in this case they are produced by default.
DIFFPLOT modifies the diffogram produced by an LSMEANS statement with the PDIFF=ALL
option (or only PDIFF, because ALL is the default argument). The CENTER option
marks the center point for each comparison. This point corresponds to the
intersection of two least squares means.
Selected LSMEANS statement options:
PDIFF= requests p-values for the differences, which is the probability of seeing a difference
between two means that is as large as the observed means or larger if the two
population means are actually the same. You can request to compare all means using
PDIFF=ALL. You can also specify which means to compare. For details, see
the documentation for LSMEANS under the GLM procedure.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 ANOVA Post Hoc Tests 2-41
ADJUST= specifies the adjustment method for multiple comparisons. If no adjustment method
is specified, the Tukey method is used by default. The T option asks that no
adjustment be made for multiple comparisons. The TUKEY option uses Tukey's
adjustment method. The DUNNETT option uses Dunnett’s method. For a list
of available methods, check the documentation for LSMEANS under the GLM
procedure.
The MEANS statement can be used for multiple comparisons. However, the results can
be misleading if the groups that are specified have different numbers of observations.
The following output is for the Tukey LSMEANS comparisons.
The GLM Procedure
The first part of the output shows the means for each group. The second part of the output shows p-values
from pairwise comparisons of all possible combinations of means. Notice that row 2/column 4 has the
same p-value as row 4/column 2 because the same two means are compared in each case. Both are
displayed as a convenience to the user. Notice also that row 1/column 1, row 2/column 2, and so on, are
blank, because it does not make any sense to compare a mean to itself.
The only nonsignificant pairwise difference is between Average/Typical and Good.
The Least Square Means are shown graphically in the mean plot. The Tukey-adjusted differences among
the LSMEANS are shown in the diffogram.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-42 Chapter 2 ANOVA and Regression
160000
Excellent
140000
Good
Average/Typical
120000
100000 Fair
Fair Excellent
Good Average/Typical
80000
The solid line denotes significant differences between heating quality levels. (Confidence intervals for
the difference do not cross the diagonal equivalence line.)
The following output is for the Dunnett LSMEANS comparisons:
H0:LSMean=Control
SalePrice
Heating_QC LSMEAN Pr > |t|
Average/Typical 130573.529
Excellent 154919.187 <.0001
Fair 97118.750 0.0010
Good 130844.086 0.9999
In this case, all other quality levels are compared to Average/Typical. Once again, Good is the only
level that is not statistically significantly different from that control level.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 ANOVA Post Hoc Tests 2-43
160000
Sale price in dollars LS-Mean
UDL
140000
120000
LDL
100000
Control = Average/Typical, Mean = 130574
This plot corresponds to the tables that were summarized. The horizontal line is drawn at the least squared
mean for Average/Typical, which is 130574. The three other means are represented by the ends
of the vertical lines extending from the horizontal control line. The mean value for Good is so close
to Average/Typical that it cannot be seen here.
Notice that the blue areas of non-significance vary in size. This is because different comparisons
involve different sample sizes. Smaller sample sizes require larger mean differences to reach
statistical significance.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-44 Chapter 2 ANOVA and Regression
Exercises
a. Conduct pairwise comparisons with an experimentwise error rate of =0.05. (Use the Tukey
adjustment.) Which types of fertilizer are significantly different?
b. Use level 4 (the chemical fertilizer) as the control group and perform a Dunnett comparison with
the organic fertilizers to see whether they affected the average weights of garlic bulbs differently
from the control fertilizer.
c. (Extra) Perform unadjusted tests of all pairwise comparisons to see what would have happened
if the multi-test adjustments had not been made. How do the results compare to what you saw
in the Tukey adjusted tests?
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.4 Pearson Correlation 2-45
Objectives
Use a scatter plot to examine the linear relationship
between two continuous variables.
Use correlation statistics to quantify the degree
of association between two continuous variables.
Describe potential misuses of the correlation
coefficient.
Use the CORR procedure to obtain Pearson
correlation coefficients.
63
63
64
64
Recall that you can visualize the relationship between variables using a scatter plot. If both variables are
measured on a continuous scale, you can also see whether you can detect a pattern in the relationship.
Before you perform any statistical analysis, it is important to first understand the nature of the
relationship.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-46 Chapter 2 ANOVA and Regression
The Pearson product moment correlation is a measure of the linear relationship between two continuous
variables. The formula for the population correlation is
rxy
(( x x)( y y))
i i i
( x x) ( y y )
i i
2
i i
2
If a scatter plot of the relationship is distinctly non-linear, then this measure of association is invalid.
Correlation
-1 0 1
Correlation Coefficient
65
65
As you examine the scatter plot, you can find evidence of the nature of the correlation between
the variables.
Values of the Pearson product-moment correlation are
between 1 and 1
closer to either extreme if there is a high degree of linear association between the two variables
close to 0 if there is no linear association between the two variables
greater than 0 if there is a positive (upward-sloping) linear association
less than 0 if there is a negative (downward-sloping) linear association
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.4 Pearson Correlation 2-47
66
66
The null hypothesis for a test of a correlation coefficient is =0. Rejecting the null hypothesis only means
that you can be confident that the true population correlation is not 0. Small p-values can occur (as is the
case with many statistics) because of very large sample sizes. For example, even a correlation coefficient
of 0.01 can be statistically significant with a large enough sample size. Therefore, it is important to also
look at the value of r itself to see whether it is meaningfully large.
67
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-48 Chapter 2 ANOVA and Regression
69
69
The next several slides describe in more detail some cautions in interpreting the Pearson correlation
coefficient.
70
Common errors can be made when you interpret the correlation between variables. One example of this
is using correlation coefficients to conclude a cause-and-effect relationship.
A strong correlation between two variables does not mean change in one variable causes the other
variable to change, or vice versa.
Sample correlation coefficients can be large because of chance or because both variables are affected
by other variables.
“Correlation does not imply causation.”
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.4 Pearson Correlation 2-49
Tennessee
Utah Mississippi Louisiana
New Mexico Ohio
71
71
Bivariate (two-variable) correlations describe the measurable degree of linear association between
the variables involved. However, often the relationship is just an artifact of both variables’ relationships
with some third variable. An example of reaching errant conclusions comes from U.S. Department of
Education data from the Scholastic Aptitude Test (SAT) from 2013. The scatter plot above shows each
state’s average total SAT score versus the average state expenditure in 2011 in U.S. dollars per public
school student. The correlation between the two variables is 0.233. Looking at the plot and at this
statistic, you might be led to the non-intuitive conclusion that state spending on education hurts student
performance. While the calculated correlation statistic is factual, the simplicity of the relationship implied
by it is not.
Missing Link
Illinois
1800 Michigan
r =North
-0.903 r = -0.903
Missouri
Iowa Dakota
Wisconsin
Minnesota
Kentucky
Nebraska
South Dakota
Wyoming
Kansas
Arkansas
Colorado
1700
2013 Average SAT
Tennessee
Oklahoma
Utah
Mississippi
Ohio
Louisiana
New Mexico
1600 Alabama New Hampshire
Montana Massachusetts
Washington Connecticut
West VirginiaArizona Vermont
Oregon Virginia
1500 Alaska New Jersey
North CarolinaCalifornia
Florida Maryland
Pennsylvania
Hawaii Indiana NewIsland
Rhode York
Nevada Texas Georgia
South Carolina
1400 District of Columbia
Maine Idaho
Delaware
72
72
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-50 Chapter 2 ANOVA and Regression
The 2013 report did not take into account the differences among the states in the percentage of students
taking the SAT. There are many reasons for the varying participation rates. Some states have lower
participation because their students primarily take the rival ACT standardized test. Others have rules
requiring even non-college-bound students to take the test. In low participating states, often only
the highest performing students choose to take the SAT. Another reported table shows the relationship
between participation rate (percent taking the SAT) and average SAT total score. The correlation is
0.903, indicating that states with lower participation rates tend to have higher average scores.
Alaska Vermont
New Jersey
Connecticut
West Virginia Maryland Wisconsin
North Dakota
Massachusetts
Louisiana Rhode Island
Pennsylvania
Ohio
New Hampshire
Michigan Illinois
Montana Nebraska Iowa Minnesota
Hawaii Kentucky Kansas
New Mexico Arkansas Delaware Missouri
OregonMaine Colorado South Dakota
Alabama Mississippi Tennessee Virginia
Nevada Oklahoma California
South Carolina Florida Indiana Washington
North Carolina
Texas Arizona Georgia
Utah
Idaho
73
73
If you adjust for differences in participation rates, the conclusions about the effect of expenditures might
change. In this case, there seems to be a slight positive linear relationship between expenditures
and average total score on the SAT when you first adjust for participation rates.
Simple correlations often do not tell the whole story.
These types of adjustments are described in greater detail in the sections about multiple
regression.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.4 Pearson Correlation 2-51
74
74
In the scatter plot, the variables have a fairly low Pearson correlation coefficient. Why?
Pearson correlation coefficients measure linear relationships.
A Pearson correlation coefficient close to 0 indicates that there is not a strong linear relationship
between two variables.
A Pearson correlation coefficient close to 0 does not mean that there is no relationship of any kind
between the two variables.
In this example, there is a curvilinear relationship between the two variables.
r=0.02 r=0.82
75
75
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-52 Chapter 2 ANOVA and Regression
Correlation coefficients are highly affected by a few extreme values on either variable’s range. The scatter
plots show that the degree of linear relationship is mainly determined by one point. If you include the
unusual point in the data set, the correlation is close to 1. If you do not include it, the correlation is close
to 0.
In this situation, follow these steps:
1. Investigate the unusual data point to make sure it is valid.
2. If the data point is valid, collect more data between the unusual data point and the group of data
points to see whether a linear relationship unfolds.
3. Try to replicate the unusual data point by collecting data at a fixed value of x (in this case, x=10).
This determines whether the data point is unusual.
4. Compute two correlation coefficients, one with the unusual data point and one without it. This shows
how influential the unusual data point is in the analysis. In this case, it is greatly influential.
76
76
You can use the CORR procedure to produce correlation statistics and scatter plots for your data.
By default, PROC CORR produces Pearson correlation coefficients and corresponding p-values.
Selected CORR procedure statements:
VAR specifies variables for which to produce correlations. If a WITH statement is not
specified, correlations are produced for each pair of variables in the VAR
statement. If the WITH statement is specified, the VAR statement specifies
the column variables in the correlation matrix.
WITH produces correlations for each variable in the VAR statement with all variables
in the WITH statement. The WITH statement specifies the row variables in the
correlation matrix.
ID specifies one or more additional tip variables to identify observations in scatter
plots and scatter plot matrices.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.4 Pearson Correlation 2-53
Exploratory analysis in preparation for multiple regression often involves looking at bivariate scatter plots
and correlations between each of the predictor variables and the response variable. It is not suggested that
exclusion or inclusion decisions be made on the basis of these analyses. The purpose is to explore
the shape of the relationships (because linear regression assumes a linear shape to the relationship)
and to screen for outliers. You will also want to check for multivariate outliers when you test your
multiple regression models later.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-54 Chapter 2 ANOVA and Regression
Examine the relationships between SalePrice and the continuous predictor variables in the data set.
Use the CORR procedure.
/*st102d04.sas*/ /*Part A*/
%let interval=Gr_Liv_Area Basement_Area Garage_Area Deck_Porch_Area
Lot_Area Age_Sold Bedroom_AbvGr Total_Bathroom;
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.4 Pearson Correlation 2-55
Simple Statistics
Variable N Mean Std Dev Sum Minimum Maximum Label
SalePrice 300 137525 37623 41257460 35000 290000 Sale price in dollars
Gr_Liv_Area 300 1131 232.64939 339222 334.00000 1500 Above grade (ground) living area square feet
Basement_Area 300 882.31000 359.78397 264693 0 1645 Basement area in square feet
Garage_Area 300 369.45333 176.25309 110836 0 902.00000 Size of garage in square feet
Deck_Porch_Area 300 118.26333 132.61169 35479 0 897.00000 Total area of decks and porches in square feet
Lot_Area 300 8294 3324 2488241 1495 26142 Lot size in square feet
Age_Sold 300 45.88667 27.47697 13766 1.00000 135.00000 Age of house when sold, in years
Bedroom_AbvGr 300 2.51333 0.69144 754.00000 0 4.00000 Bedrooms above grade
Total_Bathroom 300 1.70167 0.65707 510.50000 1.00000 4.10000 Total number of bathrooms (half bathrooms
counted 10%)
The correlation coefficient between SalePrice and Basement_Area is 0.68956. The p-value is small,
which indicates that the population correlation coefficient (ρ) is likely different from 0. The second largest
correlation coefficient, in absolute value, is Gr_Liv_Area, at 0.65046.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-56 Chapter 2 ANOVA and Regression
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.4 Pearson Correlation 2-57
Notice that there are several houses with basements sized 0 square feet. These are houses without
basements. This mixture of data can affect the correlation coefficient. You will need to take this into
account later when you build a model with basement area as a predictor variable.
If you want to explore an observation further, you can move the cursor over the observation and
information is displayed in a floating box. You can only do this in an HTML file with IMAGEMAP
turned on. The coordinate values, observation number, and ID variable values are displayed.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-58 Chapter 2 ANOVA and Regression
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.4 Pearson Correlation 2-59
The correlation and scatter plot analyses indicate that several variables might be good predictors
for SalePrice.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-60 Chapter 2 ANOVA and Regression
When you prepare to conduct a regression analysis, it is always good practice to examine the correlations
among the potential predictor variables. When you do not specify a WITH statement, you get a matrix
of correlations of all VAR variables. That matrix can be very big and difficult to interpret. To limit
the displayed output to only the strongest correlations, you can use a BEST= option.
/*st102d04.sas*/ /*Part B*/
ods graphics off;
proc corr data=STAT1.AmesHousing3
nosimple
best=3;
var &interval;
title "Correlations and Scatter Plot Matrix of Predictors";
run;
Selected PROC CORR statement option:
NOSIMPLE suppresses printing simple descriptive statistics for each variable.
BEST= Prints the n highest correlation coefficients for each variable, n ≥ 1.
PLOTS(MAXPOINTS=n) The global plot option MAXPOINTS= specifies that plots with elements
that require processing more than n points be suppressed. The default
is MAXPOINTS=5000. This limit is ignored if you specify
MAXPOINTS=NONE.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.4 Pearson Correlation 2-61
There are moderately strong correlations between Total_Bathroom and Age_Sold (-0.52889), between
Total_Bathroom and Basement_Area (0.48500), and between Bedroom_AbvGr and Gr_Liv_Area
(0.48431).
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-62 Chapter 2 ANOVA and Regression
79
79
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.4 Pearson Correlation 2-63
Exercises
Important! ODS Graphics in PROC CORR limits you to 10 VAR variables at a time,
so for this exercise, look at the relationships with Age, Weight, and Height separately
from the circumference variables (Neck Chest Abdomen Hip Thigh Knee Ankle
Biceps Forearm Wrist).
This limitation exists only on the graphics obtained from ODS. The correlation table will
display all variables in the VAR statement by default.
1) Can straight lines adequately describe the relationships?
2) Are there any outliers that you should investigate?
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-64 Chapter 2 ANOVA and Regression
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.5 Simple Linear Regression 2-65
81
81
Objectives
Explain the concepts of simple linear regression.
Fit a simple linear regression using the REG
procedure.
Produce predicted values and confidence intervals.
84
84
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-66 Chapter 2 ANOVA and Regression
85
85
Overview
86
86
In the last section, you used correlation analysis to quantify the linear relationships between continuous
response variables. Two pairs of variables can have the same correlation, but very different linear
relationships. In this section, you use simple linear regression to define the linear relationship between
a response variable and a predictor variable.
The response variable is the variable of primary interest.
The predictor variable is used to explain the variability in the response variable.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.5 Simple Linear Regression 2-67
87
87
In simple linear regression, the values of the predictor variable are assumed to be fixed. Thus, you try
to explain the variability of the response variable given the values of the predictor variable.
b 1 units
1 unit
b0
88
88
The relationship between the response variable and the predictor variable can be characterized
by the equation yi=b0+b1xi+i, i=1, …, n
where
yi is the response variable.
xi is the predictor variable.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-68 Chapter 2 ANOVA and Regression
b0 is the intercept parameter, which corresponds to the value of the response variable when the
predictor is 0.
b1 is the slope parameter, which corresponds to the magnitude of change in the response variable
given a one unit change in the predictor variable.
Unknown
Relationship
Y=b 0+b1X
^
Y–Y
Residual
^ ^ ^
Y=b0+b1X
Regression
Best Fit Line
89
89
Because your goal in simple linear regression is usually to characterize the relationship between the
response and predictor variables in your population, you begin with a sample of data. From this sample,
you estimate the unknown population parameters (b0, b1) that define the assumed relationship between
your response and predictor variables.
Estimates of the unknown population parameters b0 and b1 are obtained by the method of ordinary least
squares. This method provides the estimates by determining the line that minimizes the sum of the
squared vertical distances between the observations and the fitted line. In other words, the fitted or
regression line is as close as possible to all the data points.
Ordinary least squares produces parameter estimates with certain optimum properties. If the assumptions
of simple linear regression are valid, the least squares estimates are unbiased estimates of the population
parameters and have minimum variance (efficiency). The least squares estimators are often called BLUE
(Best Linear Unbiased Estimators). The term best is used because of the minimum variance property.
Because of these optimum properties, ordinary least squares is used by many data analysts to investigate
the relationship between continuous predictor and response variables.
With a large and representative sample, the fitted regression line should be a good approximation of the
relationship between the response and predictor variables in the population. The estimated parameters
obtained using the method of ordinary least squares should be good approximations of the true population
parameters.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.5 Simple Linear Regression 2-69
Ȳ
90
90
To determine whether the predictor variable explains a significant amount of variability in the response
variable, the simple linear regression model is compared to the baseline model. The fitted regression line
in a baseline model is a horizontal line across all values of the predictor variable. The slope of the
regression line is 0 and the intercept is the sample mean of the response variable, ( Y ).
In a baseline model, there is no association between the response variable and the predictor variable.
Therefore, knowing the value of the predictor variable does not improve predictions of the response over
simply using the unconditional mean (the mean calculated disregarding the predictor variables) of the
response variable.
Unexplained
Total
Ȳ
* Explained
^ ^ +b
Y=b ^X
0 1
91
91
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-70 Chapter 2 ANOVA and Regression
To determine whether a simple linear regression model is better than the baseline model, compare
the explained variability to the unexplained variability.
Explained variability is related to the difference between the regression line and the mean of the
response variable. The model sum of squares (SSM) is the amount of variability
explained by your model. The model sum of squares is equal to Yˆi Y .
2
Unexplained variability is related to the difference between the observed values and the regression line.
The error sum of squares (SSE) is the amount of variability unexplained by your
model. The error sum of squares is equal to Y Yˆ .
2
i i
Total variability is related to the difference between the observed values and the mean of the
response variable. The corrected total sum of squares (SST) is the sum of the
explained and unexplained variability. The corrected total sum of squares is equal
to Yi Y .
2
Remember that the relationship of the following: total=unexplained+explained applies for sums
of squares over all observations and not necessarily for any individual observation.
92
92
If the estimated simple linear regression model does not fit the data better than the baseline model,
you fail to reject the null hypothesis. Thus, you do not have enough evidence to say that the slope
of the regression line in the population differs from zero.
If the estimated simple linear regression model does fit the data better than the baseline model, you reject
the null hypothesis. Thus, you do have enough evidence to say that the slope of the regression line in the
population differs from zero and that the predictor variable explains a significant amount of variability
in the response variable.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.5 Simple Linear Regression 2-71
Unknown
Relationship
Y=b 0+b1X
93
93
One of the assumptions of simple linear regression is that the mean of the response variable is linearly
related to the value of the predictor variable. In other words, a straight line connects the means
of the response variable at each value of the predictor variable.
The other assumptions are the same as the assumptions for ANOVA, that is, the error is normally
distributed and has constant variance across the range of the predictor variable, and observations
are independent.
94
94
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-72 Chapter 2 ANOVA and Regression
The REG procedure enables you to fit regression models to your data.
Selected REG procedure statement:
MODEL specifies the response and predictor variables. The variables must be numeric.
PROC REG supports RUN-group processing, which means that the procedure stays active until
a PROC, DATA, or QUIT statement is encountered. This enables you to submit additional
statements followed by another RUN statement without resubmitting the PROC statement.
Lot_Area SalePrice
95
95
If you were only performing a simple linear regression, the most correlated predictor with
SalePrice is Basement_Area (0.68956). We saw that all of our predictor variables were
significantly correlated with SalePrice and ultimately we will be performing a multiple linear
regression analysis. For the following demonstration, we will select one of our predictors to
illustrate a simple linear regression.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.5 Simple Linear Regression 2-73
Example: Because there is an apparent linear relationship between SalePrice and Lot_Area, perform
a simple linear regression analysis with SalePrice as the response variable.
/*st102d05.sas*/
ods graphics;
The Number of Observations Read and the Number of Observations Used are the same, which indicates
that no missing values were detected for either SalePrice or Lot_Area.
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 1 27164711173 27164711173 20.44 <.0001
Error 298 3.960588E11 1329056404
Corrected Total 299 4.232235E11
The Analysis of Variance (ANOVA) table provides an analysis of the variability observed in the data
and the variability explained by the regression line.
The ANOVA table for simple linear regression is divided into six columns:
Source labels the source of variability.
DF is the degrees of freedom associated with each source of variability.
Sum of Squares is the amount of variability associated with each source of variability.
Mean Square is the ratio of the sum of squares and the degrees of freedom. This value corresponds
to the amount of variability associated with each degree of freedom for each source
of variation.
F Value is the ratio of the mean square for the model and the mean square for the error.
This ratio compares the variability explained by the regression line to the variability
unexplained by the regression line.
Pr>F is the p-value associated with the F value.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-74 Chapter 2 ANOVA and Regression
Each of the column measurements are applied to the following sources of variation:
Model is the variability explained by your model (Between Group).
Error is the variability unexplained by your model (Within Group).
Corrected Total is the total variability in the data (Total).
The F value tests whether the slope of the predictor variable is equal to 0. The p-value is small (less than
0.05), so you have enough evidence at the 0.05 significance level to reject the null hypothesis. Thus, you
can conclude that the simple linear regression model fits the data better than the baseline model. In other
words, Lot_Area explains a significant amount of variability in SalePrice.
The third part of the output provides summary measures of fit for the model.
Root MSE 36456 R-Square 0.0642
Dependent Mean 137525 Adj R-Sq 0.0610
Coeff Var 26.50882
Root MSE The root mean square error is an estimate of the standard deviation of the response
variable at each value of the predictor variable. It is the square root of the MSE.
The R square is the squared value of the correlation that you saw earlier
between Lot_Area and SalePrice (0.25335). This is no coincidence. For
simple regression, the R-square value is the square of the value of the
bivariate Pearson correlation coefficient.
Adj R Sq The adjusted R square is adjusted for the number of parameters in the model.
This statistic is useful in multiple regression and is discussed in a later section.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.5 Simple Linear Regression 2-75
Parameter Estimates
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|
Intercept Intercept 1 113740 5666.48352 20.07 <.0001
Lot_Area Lot size in square feet 1 2.86770 0.63431 4.52 <.0001
The Parameter Estimates table defines the model for your data.
DF represents the degrees of freedom associated with each term in the model.
Parameter Estimate is the estimated value of the parameters associated with each term in the model.
Standard Error is the standard error of each parameter estimate.
t Value is the t statistic, which is calculated by dividing the parameter estimates by their
corresponding standard error estimates.
Pr > |t| is the p-value associated with the t statistic. It tests whether the parameter
associated with each term in the model is different from 0. For this example,
the slope for the predictor variable is statistically different from 0. Thus, you can
conclude that the predictor variable explains a significant portion of variability
in the response variable.
Because the estimate of bo=113740 and b1=2.86770, the estimated regression equation is given
by SalePrice=$113,740+$2.86770*(Lot_Area).
The model indicates that each additional square foot of lot area is associated with an approximately $2.87
higher sale price.
Extrapolation of the model beyond the range of your predictor variables is inappropriate.
You cannot assume that the relationship maintains in areas that were not sampled from.
The parameter estimates table also shows that the intercept parameter is not equal to 0. However, the test
for the intercept parameter only has practical significance when the range of values for the predictor
variable includes 0. In this example, the test could not have practical significance because SalePrice=0
(giving away a house for free) is not within the range of observed values.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-76 Chapter 2 ANOVA and Regression
4 4
100000
RStudent
RStudent
Residual
2 2
0 0 0
-2 -2
-100000
300000
Sale price in dollars
0.20
100000 250000
Residual
Cook's D
0.15
200000
150000 0.10
0
100000
0.05
50000
-100000 0.00
50000 300000
-3 -2 -1 0 1 2 3 Predicted Value 0 100 200 300
Quantile Observation
Fit–Mean Residual
30
20
Parameters 2
Error DF 298
10 0 MSE 1.33E9
R-Square 0.0642
Adj R-Square 0.061
-100000
0
-113E3 12500 137500 0 1 0 1
Residual Proportion Less
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.5 Simple Linear Regression 2-77
100000
Residual
-100000
The diagnostics table and the residuals by Lot_Area table show a variety of plots designed to help with
an assessment of the data’s fulfillment of statistical assumptions and influential outliers. These plots are
explored in detail in a later chapter.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-78 Chapter 2 ANOVA and Regression
250000
Sale price in dollars
100000
50000
The Fit plot produced by ODS Graphics shows the predicted regression line superimposed over a scatter
plot of the data.
To assess the level of precision around the mean estimates of SalePrice, you can produce confidence
intervals around the means. This is represented in the shaded area in the plot.
A 95% confidence interval for the mean says that you are 95% confident that your interval contains
the population mean of Y for a particular X.
Confidence intervals become wider as you move away from the mean of the independent variable.
This reflects the fact that your estimates become more variable as you move away from the means
of X and Y.
Suppose that the mean SalePrice at a fixed value of Lot_Area is not the focus. If you are interested
in making a prediction for a future single observation, you need a prediction interval. This is represented
by the area between the broken lines in the plot.
A 95% prediction interval is one that you are 95% confident contains a new observation.
Prediction intervals are wider than confidence intervals because single observations have more
variability than sample means.
Printed tables for the confidence and prediction intervals at each observed data point can
be obtained by adding the CLM and CLI options to the MODEL statement.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.5 Simple Linear Regression 2-79
97
97
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-80 Chapter 2 ANOVA and Regression
Exercises
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.6 Solutions 2-81
2.6 Solutions
Solutions to Exercises
1. Analysis of Variance with Garlic Data
Consider an experiment to study four types of fertilizer, labeled 1, 2, 3, and 4. One fertilizer
is chemical and the rest are organic. You want to see whether the average of weights of garlic bulbs
are significantly different for plants in beds using different fertilizers.
Test the hypothesis that the means are equal. Be sure to check that the assumptions of the analysis
method that you choose are met. What conclusions can you reach at this point in your analysis?
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-82 Chapter 2 ANOVA and Regression
The overall F value from the analysis of variance table is associated with a p-value=0.0013.
Presuming that all assumptions of the model are valid, you know that at least one treatment
mean is different from one other treatment mean. At this point, you do not know which
means are significantly different from one another.
R-Square Coeff Var Root MSE BulbWt Mean
0.423291 15.85633 0.033960 0.214172
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.6 Solutions 2-83
2 2
0.05
1 1
RStudent
RStudent
Residual
0.00 0 0
-1 -1
-0.05 -2 -2
-3 -3
0.18 0.20 0.22 0.24 0.18 0.20 0.22 0.24 0.125 0.175 0.225
Predicted Value Predicted Value Leverage
0.20
0.05
0.25
0.15
Residual
Cook's D
BulbWt
0.00 0.20
0.10
0.15
-0.05 0.05
0.10 0.00
40 Fit–Mean Residual
30 0.05
Observations 32
Percent
Parameters 4
20 0.00 Error DF 28
MSE 0.0012
10
-0.05 R-Square 0.4233
Adj R-Square 0.3615
0
-0.105 -0.015 0.075 0.0 0.4 0.8 0.0 0.4 0.8
Residual Proportion Less
Both the histogram and Q-Q plot show that the residuals seem relatively normally
distributed (one assumption for ANOVA).
Levene's Test for Homogeneity of BulbWt Variance
ANOVA of Squared Deviations from Group Means
Sum of Mean
Source DF Squares Square F Value Pr > F
Fertilizer 3 9.13E-6 3.043E-6 1.54 0.2257
Error 28 0.000055 1.974E-6
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-84 Chapter 2 ANOVA and Regression
The Levene’s Test for Homogeneity of Variance shows a p-value greater than alpha.
Therefore, do not reject the hypothesis of homogeneity of variances (equal variances across
Ad types). This assumption for ANOVA is met.
2. Post Hoc Pairwise Comparisons
Consider again the analysis of the STAT1.Garlic data set. There was a statistically significant
difference among means for sales for the different fertilizers. Perform a post hoc test to look
at the individual differences among means.
a. Conduct pairwise comparisons with an experimentwise error rate of =0.05. (Use the Tukey
adjustment.) Which types of fertilizer are significantly different?
b. Use level 4 (the chemical fertilizer) as the control group and perform a Dunnett comparison with
the organic fertilizers to see whether they affected the average weights of garlic bulbs differently
from the control fertilizer.
c. (Extra) Perform unadjusted tests of all pairwise comparisons to see what would have happened
if the multi-test adjustments had not been made. How do the results compare to what you saw
in the Tukey adjusted tests?
/*st102s02.sas*/
ods graphics;
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.6 Solutions 2-85
0.250
3
0.225
0.200
4
0.175
4 2 1 3
0.150
The Tukey comparisons show significant differences between fertilizers 3 and 4 (p=0.0020)
and 1 and 4 (p=0.0058).
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-86 Chapter 2 ANOVA and Regression
H0:LSMean=Control
BulbWt
Fertilizer LSMEAN Pr > |t|
1 0.23539981 0.0031
2 0.20511406 0.1801
3 0.24240747 0.0011
4 0.17376488
0.225
UDL
0.200
BulbWt LS-Mean
0.175
0.150
LDL
0.125 Control = 4, Mean = 0.17376
1 2 3
Fertilizer
The Dunnett comparisons show the same pairs as significantly different, but with smaller
p-values than with the Tukey comparisons (3 versus 4 p=0.0011, 1 versus 4 p=0.0031). This is
due to the fact that the Tukey adjustment is for more pairwise comparisons than the
Dunnett adjustment.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.6 Solutions 2-87
BulbWt LSMEAN
Fertilizer LSMEAN Number
1 0.23539981 1
2 0.20511406 2
3 0.24240747 3
4 0.17376488 4
3
0.24
1
0.22
0.20
0.18
4
0.16 4 2 1 3
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-88 Chapter 2 ANOVA and Regression
The unadjusted (t-test) comparisons all have smaller p-values than they had with Tukey
adjustments. One additional comparison has a p-value below 0.05 (2 versus 3).
3. Describing the Relationship between Continuous Variables
Percentage of body fat, age, weight, height, and 10 body circumference measurements (for example,
abdomen) were recorded for 252 men by Dr. Roger W. Johnson of Calvin College in Minnesota. The
data are in the STAT1.BodyFat2 data set. Body fat, one measure of health, was accurately estimated
by an underwater weighing technique. There are two measures of percentage body fat in this data set.
a. Generate scatter plots and correlations for the VAR variables Age, Weight, Height, and the
circumference measures versus the WITH variable, PctBodyFat2.
Important! ODS Graphics in PROC CORR limits you to 10 VAR variables at a time, so
for this exercise, look at the relationships with Age, Weight, and Height separately from
the circumference variables (Neck Chest Abdomen Hip Thigh Knee Ankle Biceps
Forearm Wrist).
/*st102s03.sas*/ /*Part A*/
%let interval=Age Weight Height Neck Chest Abdomen Hip
Thigh Knee Ankle Biceps Forearm Wrist;
1 With PctBodyFat2
Variables:
13 Variables: Age Weight Height Neck Chest Abdomen Hip Thigh Knee
Ankle Biceps Forearm Wrist
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.6 Solutions 2-89
Simple Statistics
Variable N Mean Std Dev Sum Minimum Maximum
PctBodyFat2 252 19.15079 8.36874 4826 0 47.50000
Age 252 44.88492 12.60204 11311 22.00000 81.00000
Weight 252 178.92440 29.38916 45089 118.50000 363.15000
Height 252 70.30754 2.60958 17718 64.00000 77.75000
Neck 252 37.99206 2.43091 9574 31.10000 51.20000
Chest 252 100.82421 8.43048 25408 79.30000 136.20000
Abdomen 252 92.55595 10.78308 23324 69.40000 148.10000
Hip 252 99.90476 7.16406 25176 85.00000 147.70000
Thigh 252 59.40595 5.24995 14970 47.20000 87.30000
Knee 252 38.59048 2.41180 9725 33.00000 49.10000
Ankle 252 23.10238 1.69489 5822 19.10000 33.90000
Biceps 252 32.27341 3.02127 8133 24.80000 45.00000
Forearm 252 28.66389 2.02069 7223 21.00000 34.90000
Wrist 252 18.22976 0.93358 4594 15.80000 21.40000
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-90 Chapter 2 ANOVA and Regression
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.6 Solutions 2-91
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-92 Chapter 2 ANOVA and Regression
13 Variables: Age Weight Height Neck Chest Abdomen Hip Thigh Knee Ankle Biceps
Forearm Wrist
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.6 Solutions 2-93
Weight has a high correlation with nearly every other variable. Hip also is correlated with
most variables.
c. (Advanced) Output the correlation table into a data set. Print out only the correlations whose
absolute values are 0.70 and above or note them with an asterisk in the full correlation table.
Potential solution to printing out the correlation matrix with asterisks for correlations with
absolute values at 0.7 and above:
%let big=0.7;
proc format;
picture correlations &big -< 1 = '009.99' (prefix="*")
-1 <- -&big = '009.99' (prefix="*")
-&big <-< &big = '009.99';
run;
Obs _NAME_ Age Weight Height Neck Chest Abdomen Hip Thigh Knee Ankle Biceps Forearm Wrist
4 Age 1 0.01 0.24 0.11 0.17 0.23 0.05 0.20 0.01 0.10 0.04 0.08 0.21
5 Weight 0.01 1 0.48 *0.83 *0.89 *0.88 *0.94 *0.86 *0.85 0.61 *0.80 0.63 *0.72
6 Height 0.24 0.48 1 0.32 0.22 0.18 0.37 0.33 0.50 0.39 0.31 0.32 0.39
7 Neck 0.11 *0.83 0.32 1 *0.78 *0.75 *0.73 0.69 0.67 0.47 *0.73 0.62 *0.74
8 Chest 0.17 *0.89 0.22 *0.78 1 *0.91 *0.82 *0.72 *0.71 0.48 *0.72 0.58 0.66
9 Abdomen 0.23 *0.88 0.18 *0.75 *0.91 1 *0.87 *0.76 *0.73 0.45 0.68 0.50 0.61
10 Hip 0.05 *0.94 0.37 *0.73 *0.82 *0.87 1 *0.89 *0.82 0.55 *0.73 0.54 0.63
11 Thigh 0.20 *0.86 0.33 0.69 *0.72 *0.76 *0.89 1 *0.79 0.53 *0.76 0.56 0.55
12 Knee 0.01 *0.85 0.50 0.67 *0.71 *0.73 *0.82 *0.79 1 0.61 0.67 0.55 0.66
13 Ankle 0.10 0.61 0.39 0.47 0.48 0.45 0.55 0.53 0.61 1 0.48 0.41 0.56
14 Biceps 0.04 *0.80 0.31 *0.73 *0.72 0.68 *0.73 *0.76 0.67 0.48 1 0.67 0.63
15 Forearm 0.08 0.63 0.32 0.62 0.58 0.50 0.54 0.56 0.55 0.41 0.67 1 0.58
16 Wrist 0.21 *0.72 0.39 *0.74 0.66 0.61 0.63 0.55 0.66 0.56 0.63 0.58 1
Potential solution to printing out only the correlations whose absolute values are 0.7 and above:
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-94 Chapter 2 ANOVA and Regression
%let big=0.7;
data bigcorr;
set pearson;
array vars{*} &interval;
do i=1 to dim(vars);
if abs(vars{i})<&big then vars{i}=.;
end;
if _type_="CORR";
drop i _type_;
run;
Obs _NAME_ Age Weight Height Neck Chest Abdomen Hip Thigh Knee Ankle Biceps Forearm Wrist
1 Age 1.00 . . . . . . . . . . . .
2 Weight . 1.00 . 0.83 0.89 0.89 0.94 0.87 0.85 . 0.80 . 0.73
3 Height . . 1.00 . . . . . . . . . .
4 Neck . 0.83 . 1.00 0.78 0.75 0.73 . . . 0.73 . 0.74
5 Chest . 0.89 . 0.78 1.00 0.92 0.83 0.73 0.72 . 0.73 . .
6 Abdomen . 0.89 . 0.75 0.92 1.00 0.87 0.77 0.74 . . . .
7 Hip . 0.94 . 0.73 0.83 0.87 1.00 0.90 0.82 . 0.74 . .
8 Thigh . 0.87 . . 0.73 0.77 0.90 1.00 0.80 . 0.76 . .
9 Knee . 0.85 . . 0.72 0.74 0.82 0.80 1.00 . . . .
10 Ankle . . . . . . . . . 1.00 . . .
11 Biceps . 0.80 . 0.73 0.73 . 0.74 0.76 . . 1.00 . .
12 Forearm . . . . . . . . . . . 1.00 .
13 Wrist . 0.73 . 0.74 . . . . . . . . 1.00
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.6 Solutions 2-95
Model: MODEL1
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 1 6593.01614 6593.01614 150.03 <.0001
Error 250 10986 43.94389
Corrected Total 251 17579
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 -12.05158 2.58139 -4.67 <.0001
Weight 1 0.17439 0.01424 12.25 <.0001
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-96 Chapter 2 ANOVA and Regression
2 2
10
RStudent
RStudent
Residual
0 0 0
-10
-2 -2
-20
20 50
0.6
40
PctBodyFat2
10
Residual
Cook's D
30 0.4
0
20
0.2
-10
10
-20 0 0.0
25 Fit–Mean Residual
20 30
20 Observations 252
Percent
15
Parameters 2
10
10 Error DF 250
0 MSE 43.944
5 -10 R-Square 0.3751
Adj R-Square 0.3726
0 -20
20
10
Residual
-10
-20
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.6 Solutions 2-97
60
40 Observations 252
PctBodyFat2
Parameters 2
Error DF 250
MSE 43.944
R-Square 0.3751
Adj R-Square 0.3726
20
a. What is the value of the F statistic and the associated p-value? How would you interpret this with
regard to the null hypothesis?
The value of the F statistics is 150.03 (p<.0001). Therefore, the null hypothesis of a zero
slope for Weight (no linear association) is rejected.
b. Write the predicted regression equation.
The prediction equation is
PctBodyFat2 = -12.05158+0.17439*Weight.
c. What is the value of R-square? How would you interpret this?
The R-square value is 0.3751. Weight explains approximately 37.51% of the variability
in PctBodyFat2.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-98 Chapter 2 ANOVA and Regression
31
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.6 Solutions 2-99
39
39
41
41
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-100 Chapter 2 ANOVA and Regression
50
50
52
52
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.6 Solutions 2-101
68
82
82
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-102 Chapter 2 ANOVA and Regression
98
98
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Chapter 3 More Complex Linear
Models
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Two-Way ANOVA and Interactions 3-3
Objectives
Fit a two-way ANOVA model.
Detect interactions between factors.
Analyze the treatments when there is a significant
interaction.
3
3
n-Way ANOVA
Categorical
Predictor
Response
One-Way
ANOVA
1 Predictor
Continuous n-Way
ANOVA
More than
1 Predictor
4
4
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-4 Chapter 3 More Complex Linear Models
In the previous chapter, you considered the case where you had one categorical predictor variable. In this
section, consider a case with two categorical predictors. In general, anytime you have more than one
categorical predictor (or explanatory) variable and a continuous response variable, the analysis is called n-
way ANOVA. The n can be replaced with the number of categorical predictor variables.
5
5
ANOVA and regression are used to estimate parameters in statistical models, which are simply
the mathematical relationships relating explanatory variables with response variables. The same model
can be expressed in a variety of ways, depending on the desired way of communicating the results.
Much of ANOVA terminology comes from the world of design of experiments (DOE) because ANOVA
is often used to analyze data obtained from designed experiments.
In this course, you will encounter the term effect to mean the magnitude of the expected change
in the response variable presumably caused by the change in value of an explanatory term in the model.
The terms themselves are often referred to as effects in models. Main effects are effects of a single
variable, averaged across the levels of all other explanatory variables.
Sometimes there is evidence that explanatory variables do not seem to relate to the response variable
in an additive fashion, but rather two or more act jointly on the response variable above and beyond the
individual effects of either (any) of the explanatory variables. These effects are called interaction effects
and are explained in more detail in the next slides.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Two-Way ANOVA and Interactions 3-5
The Model
6
6
With each additional predictor variable, a new parameter is introduced to the model. The example
in the slide represents a model predicting the change in blood pressure after the administration of a drug.
Notice that the interaction term involves both of the main effects of DrugDose and Disease. This is also
known as a crossed effect in experimental design.
Yijk the observed change in BloodP for each subject from before to after treatment
()ij the effect of the interaction between the ith Disease and the jth DrugDose
Verifying ANOVA assumptions with more than one variable is discussed in the Statistics 2:
ANOVA and Regression class.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-6 Chapter 3 More Complex Linear Models
Two-Way ANOVA
Example: Perform a two-way ANOVA of SalePrice with Heating_QC and Season_Sold as predictor
variables.
Before conducting an analysis of variance, you should explore the data.
/*st103d01.sas*/ /*Part A*/
ods graphics off;
proc means data=STAT1.ameshousing3
mean var std nway;
class Season_Sold Heating_QC;
var SalePrice;
format Season_Sold Season.;
title 'Selected Descriptive Statistics';
run;
Selected PROC MEANS statement option:
NWAY When you include CLASS variables, NWAY specifies that the output data set contains only
statistics for the observations with the highest _TYPE_ and _WAY_ values. NWAY
corresponds to the combination of all class variables.
PROC MEANS Output
Analysis Variable : SalePrice Sale price in dollars
Season Heating
when quality
house and N
sold condition Obs Mean Variance Std Dev
Winter Ex 6 145583.33 1579141667 39738.42
Fa 3 58100.00 321330000 17925.68
Gd 10 124330.00 935189000 30580.86
TA 16 121312.50 1679295833 40979.21
Spring Ex 41 153765.24 1129742652 33611.64
Fa 7 98657.14 452506190 21272.19
Gd 18 149619.83 1082782633 32905.66
TA 34 129404.41 767370965 27701.46
Summer Ex 45 154279.42 1244833504 35282.20
Fa 5 128800.00 1332825000 36507.88
Gd 22 113727.27 1155184935 33988.01
TA 58 134046.55 1138642444 33743.78
Fall Ex 15 163726.93 2436449681 49360.41
Fa 1 45000.00 . .
Gd 8 143812.50 547495536 23398.62
TA 11 129345.45 462560727 21507.23
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Two-Way ANOVA and Interactions 3-7
The mean sale price is always lowest for houses with fair heating systems. Note that there is only one
house in the data set with a fair heating system sold in the fall.
To further explore the numerous treatments, examine the means graphically.
/*st103d01.sas*/ /*Part B*/
proc sgplot data=STAT1.ameshousing3;
vline Season_Sold / group=Heating_QC
stat=mean
response=SalePrice
markers;
format Season_Sold season.;
run;
VLINE Creates a vertical line chart (the line is horizontal).
Selected VLINE statement options:
GROUP= Specifies a category variable that is used to group the data. A separate plot
is created for each unique value of the grouping variable. The plot elements for
each group value are automatically distinguished by different visual attributes.
STAT= specifies the statistic for the vertical axis.
RESPONSE= Specifies a numeric response variable for the plot. The summarized values
of the response variable are displayed on the vertical axis.
MARKERS adds data point markers to the series plot data points.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-8 Chapter 3 More Complex Linear Models
150000
125000
Sale price in dollars
100000
75000
50000
The relationship might be clearer in the graph. The season a home is sold does not seem to affect the sale
price very much, except where the heating system is fair. For those homes, the mean sale price seems
markedly lower in the colder seasons. This plot is exploratory, and helps you plan your analysis. Later
you see similar plotting output directly from PROC GLM.
You can use the GLM procedure to discover the effects of both Season_Sold and Heating_QC.
/*st103d01.sas*/ /*Part C*/
ods graphics on;
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Two-Way ANOVA and Interactions 3-9
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 6 72774816066 12129136011 10.14 <.0001
Error 293 350448703445 1196070660.2
Corrected Total 299 423223519511
The global test is of the null hypothesis that all means are equal for all explanatory variables. Notice that
the number of degrees of freedom for this test is 6. Both Season_Sold and Heating_QC account for three
degrees of freedom (number of categories minus 1). The statistically significant p value indicates that not
all means are equal for all explanatory variables. It does not indicate which mean values differ.
R-Square Coeff Var Root MSE SalePrice Mean
0.171954 25.14764 34584.25 137524.9
The R-Square value of 0.171954 shows that about 17% of the variability in SalePrice is explained
by the two categorical predictors.
Source DF Type I SS Mean Square F Value Pr > F
Heating_QC 3 66835556221 22278518740 18.63 <.0001
Season_Sold 3 5939259845 1979753282 1.66 0.1768
In this case, the test of Heating_QC is an unadjusted test, because no other terms are above it, whereas
the Season_Sold test adjusts for Heating_QC, which appears before it. The CLASS statement ordering
determines the ordering in this table.
Source DF Type III SS Mean Square F Value Pr > F
Heating_QC 3 60050783038 20016927679 16.74 <.0001
Season_Sold 3 5939259845 1979753282 1.66 0.1768
The Type III sums of squares are commonly called partial sums of squares. The Type III sum of squares
for a particular variable is the increase in the model sum of squares due to adding the variable to a model
that already contains all the other variables in the model. Type III sums of squares, therefore, do not
depend on the order in which the explanatory variables are specified in the model. The Type III SS values
are not generally additive (except in a completely balanced design, where all categories of all inputs
contain exactly the same sample size). The values do not necessarily sum to the Model SS.
In this example, the Type I effects and Type III effects differ only slightly. There seems to be no
significant differences across all levels of the Season_Sold variable, whereas there are differences across
the Heating_QC variable.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-10 Chapter 3 More Complex Linear Models
You will generally interpret and report results based on the Type III SS.
250000
Sale price in dollars
200000
150000
100000
50000
1 2 3 4
Season when house sold
Heating_QC Ex Fa Gd TA
This plot differs from the exploratory plot because it imposes a main effects model on the data. In other
words, the effect of each variable is not permitted to differ at different levels of the other variable. That
constraint can be relaxed by adding an interaction term.
SalePrice LSMEAN
Season_Sold LSMEAN Number
1 117255.605 1
2 131263.281 2
3 128216.231 3
4 133543.394 4
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Two-Way ANOVA and Interactions 3-11
135000
Sale price in dollars LS-Mean
130000
125000
120000
1 2 3 4
Season_Sold
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-12 Chapter 3 More Complex Linear Models
140000
130000 2
3
120000
1
110000
1 3 2 4
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Two-Way ANOVA and Interactions 3-13
Interactions
8
8
An interaction occurs when the differences between group means on one variable change at different
levels of another variable.
The average blood pressure change over different doses was plotted in mean plots and then connected for
disease A and B.
In the left plot above, different types of disease show the same change across different levels of dose.
In the right plot, however, as the dose increases, average blood pressure decreases for those with disease
A, but increases for those with disease B. This indicates an interaction between the variables DrugDose
and Disease.
When you analyze an n-way ANOVA with interactions, you should first look at any tests for interaction
among factors.
If there is no interaction between the factors, the tests for the individual factor effects (main effects) can
be interpreted as true effects of that factor.
If an interaction exists between any factors, the tests for the individual factor effects might be misleading.
These are known as tests of marginal effects and only tell part of the story about the overall effect of that
variable.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-14 Chapter 3 More Complex Linear Models
Yijk i j ijk
9
9
When the interaction is not statistically significant, the main effects can be analyzed with the model
as originally written. This is generally the method used when analyzing designed experiments.
However, even when analyzing designed experiments, some statisticians suggest that if the interaction
is nonsignificant, the interaction effect can be deleted from the model and then the main effects are
analyzed. This increases the power of the main effects tests.
Neter, Kutner, Wasserman, and Nachtsheim (1996) suggest guidelines for when to delete the interaction
from the model:
There are fewer than five degrees of freedom for the error.
The F value for the interaction term is < 2.
When you analyze data from an observational study, it is more common to delete
the non-significant interaction from the model and then analyze the main effects.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Two-Way ANOVA and Interactions 3-15
PROC GLM …;
MODEL A B A*B;
or
MODEL A|B;
RUN;
QUIT;
10
10
An interaction term can be specified in PROC GLM using the * operator between two listed variables.
An alternate way of specifying a full factorial model is through the use of the bar operator (|). You can
shorten the specification of a large factorial model by using the bar operator. For example, two ways
of writing the model for a full three-way factorial model follow:
model Y=A B C A*B A*C B*C A*B*C;
model Y=A|B|C;
When the bar (|) is used, the right and left sides become effects, and the cross of them becomes an effect.
Multiple bars are permitted.
You can also specify the maximum number of variables involved in any effect that results from bar
evaluation by specifying that maximum number, preceded by an @ sign, at the end of the bar effect.
For example, the specification A | B | C @2 would result in only those effects that contain 2 or fewer
variables: in this case, A B A*B C A*C and B*C.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-16 Chapter 3 More Complex Linear Models
STORE <OUT=>item-store-name
</ LABEL='label'>;
11
The STORE statement applies to the following SAS/STAT procedures: GENMOD, GLIMMIX, GLM,
GLMSELECT, LOGISTIC, MIXED, ORTHOREG, PHREG, PROBIT, SURVEYLOGISTIC,
SURVEYPHREG, and SURVEYREG. This statement requests that the procedure save the context
and results of the statistical analysis into an item store. An item store is a binary file format that cannot
be modified by the user. The contents of the item store can be processed with the PLM procedure.
One example of item-store use is to perform a time-consuming analysis and to store its results by using
the STORE statement. At a later time, you can then perform specific statistical analysis tasks based
on the saved results of the previous analysis, without having to fit the model again.
In the STORE statement:
item-store-name is a usual one- or two-level SAS name, similar to the names that are used for SAS data
sets. If you specify a one-level name, then the item store resides in the Work library and
is deleted at the end of the SAS session. Because item stores usually are used to perform
postprocessing tasks, typical usage specifies a two-level name of the form
libname.membername.
label identifies the estimate on the output. A label is optional but must be enclosed
in quotation marks.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Two-Way ANOVA and Interactions 3-17
12
The PLM procedure performs post-fitting statistical analyses and plotting for the contents of a SAS item
store that were previously created with the STORE statement in some other SAS/STAT procedure.
The statements that are available in the PLM procedure are designed to reveal the contents of the source
item store via the Output Delivery System (ODS) and to perform post-fitting tasks.
The use of item stores and PROC PLM enables you to separate common post-processing tasks, such
as testing for treatment differences and predicting new observations under a fitted model, from the
process of model building and fitting. A numerically expensive model fitting technique can be applied
once to produce a source item store. The PLM procedure can then be called multiple times and the results
of the fitted model are analyzed without incurring the model fitting expenditure again.
Selected PROC PLM option:
RESTORE specifies the source item store for processing.
Selected PROC PLM procedure statements:
EFFECTPLOT The EFFECTPLOT statement produces a display of the fitted model and provides
options for changing and enhancing the displays.
LSMEANS computes and compares least squares means (LS-means) of fixed effects.
LSMESTIMATE provides custom hypothesis tests among least squares means.
SHOW uses the Output Delivery System to display contents of the item store. This statement
is useful for verifying that the contents of the item store apply to the analysis and for
generating ODS tables.
SLICE The SLICE statement provides a general mechanism for performing a partitioned
analysis of the LS-means for an interaction. This analysis is also known as an analysis
of simple effects. The SLICE statement uses the same options as the LSMEANS
statement.
WHERE is used in the PLM procedure when the item store contains BY-variable information
and you want to apply the PROC PLM statements to only a subset of the BY groups.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-18 Chapter 3 More Complex Linear Models
Example: Perform a two-way ANOVA of SalePrice with Heating_QC and Season_Sold as predictor
variables. Include the interaction between the two explanatory variables.
/*st103d02.sas*/ /*Part A*/
ods graphics on;
The number of degrees of freedom for the model is now 15. This includes 3 degrees of freedom for each
of the main effects and 3*3=9 degrees of freedom for the interaction term. The model is statistically
significant.
R-Square Coeff Var Root MSE SalePrice Mean
0.230634 24.62130 33860.40 137524.9
The R-Square for this model is 0.230634, which means that about 23% of the variability in SalePrice
is explained.
Source DF Type I SS Mean Square F Value Pr > F
Heating_QC 3 66835556221 22278518740 19.43 <.0001
Season_Sold 3 5939259845 1979753282 1.73 0.1617
Season_So*Heating_QC 9 24835058089 2759450899 2.41 0.0121
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Two-Way ANOVA and Interactions 3-19
The interaction effect is statistically significant at the 0.05 alpha level. This means that the effect
of Season_Sold differs at different levels of Heating_QC and that the effect of Heating_QC differs
at different levels of Season_Sold. In the main effect model it appeared that Season_Sold was not related
to SalePrice. Now it can be seen that it is related, but in a more complex way.
250000
Sale price in dollars
200000
150000
100000
50000
Heating_QC Ex Fa Gd TA
The interaction model is able to reflect the data more accurately than the main effects model.
The SLICE= option in the LSMEANS statement enables you to look at the effect of Season_Sold
at all levels of Heating_QC.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-20 Chapter 3 More Complex Linear Models
Note: To ensure overall protection level, only probabilities associated with pre-planned comparisons
should be used
The tests displayed are of Season_Sold within each slice of Heating_QC. There is a larger seasonal
effect for houses with fair and good heating systems. The note reminds you that these p-values are not
adjusted for multiple tests. Adjustment is possible using PROC PLM.
/*st103d02.sas*/ /*Part B*/
proc plm restore=interact plots=all;
slice Heating_QC*Season_Sold / sliceby=Heating_QC adjust=tukey;
effectplot interaction(sliceby=Heating_QC) / clm;
run;
Partial PROC PLM Output
Store Information
Item Store WORK.INTERACT
Data Set Created From STAT1.AMESHOUSING3
Created By PROC GLM
Date Created 02JUL14:10:35:36
Response Variable SalePrice
Class Variables Season_Sold Heating_QC
Model Effects Intercept Heating_QC Season_Sold Season_So*Heating_QC
The slice of excellent heating systems shows that there is no significant effect of season. So, skip
to the analysis fair heating systems:
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Two-Way ANOVA and Interactions 3-21
Note that the F test is not adjusted. However, the pairwise tests are adjusted.
Simple Differences of Season_So*Heating_QC Least Squares Means
Adjustment for Multiple Comparisons: Tukey-Kramer
Season Season
when when
house house Standard
Slice sold sold Estimate Error DF t Value Pr > |t| Adj P
Heating_QC Fa Winter Spring -40557 23366 284 -1.74 0.0837 0.3071
Heating_QC Fa Winter Summer -70700 24728 284 -2.86 0.0046 0.0235
Heating_QC Fa Winter Fall 13100 39099 284 0.34 0.7378 0.9870
Heating_QC Fa Spring Summer -30143 19827 284 -1.52 0.1295 0.4267
Heating_QC Fa Spring Fall 53657 36198 284 1.48 0.1394 0.4495
Heating_QC Fa Summer Fall 83800 37092 284 2.26 0.0246 0.1102
150000
Summer Fa
Spring Fa
100000
Winter Fa
50000 Fall Fa
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-22 Chapter 3 More Complex Linear Models
For fair systems, the only statistically significant pairwise comparison is between summer and winter.
F Test for Season_So*Heating_QC Least
Squares Means Slice
Num Den
Slice DF DF F Value Pr > F
Heating_QC Gd 3 284 4.23 0.0060
160000
Spring Gd
Fall Gd
140000
Winter Gd
120000
Summer Gd
100000
Summer Gd Winter Gd Fall Gd Spring Gd
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Two-Way ANOVA and Interactions 3-23
There was a significant mean sale price difference for houses with good heating systems between
the spring months and the winter months.
F Test for Season_So*Heating_QC Least
Squares Means Slice
Num Den
Slice DF DF F Value Pr > F
Heating_QC TA 3 284 0.62 0.6021
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-24 Chapter 3 More Complex Linear Models
14
17
17
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Two-Way ANOVA and Interactions 3-25
Exercises
Data were collected in an effort to determine whether different dose levels of a given drug have
an effect on blood pressure for people with one of three types of heart disease. The data are
in the STAT1.Drug data set.
The data set contains the following variables:
DrugDose dosage level of drug (1, 2, 3, 4), corresponding to (Placebo, 50 mg, 100 mg, 200 mg)
Disease heart disease category
BloodP change in diastolic blood pressure after 2 weeks treatment
a. Use the SGPLOT procedure to examine the data with a vertical line plot. Put BloodP on the Y
axis, DrugDose on the X axis, and then stratify by Disease. What information can you obtain
from looking at the data?
b. Test the hypothesis that the means are equal, making sure to include an interaction term if the
results from PROC SGPLOT indicate that would be advisable. What conclusions can you reach
at this point in your analysis?
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-26 Chapter 3 More Complex Linear Models
Objectives
Explain the mathematical model for multiple
regression.
Describe the main advantage of multiple regression
versus simple linear regression.
Explain the standard output from the REG procedure.
Describe common pitfalls of multiple linear regression.
20
20
21
21
In simple linear regression, you can model the relationship between the two variables (two dimensions)
with a line (one dimension).
For the two-variable model, you can model the relationship of three variables (three dimensions) with
a plane (two dimensions).
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Multiple Regression 3-27
* * *
*
*
0 *
* *
*
* X1
X2
22
22
If there is no relationship among Y and X1 and X2, the model is a horizontal plane passing through
the point (Y=0, X1=0, X2=0).
*
*
* **
* X1
X2
23
23
If there is a relationship among Y and X1 and X2, the model is a sloping plane passing through three
points:
(Y=0, X1=0, X2=0)
(Y=0+1, X1=1, X2=0)
(Y=0+2, X1=0, X2=1)
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-28 Chapter 3 More Complex Linear Models
Y=0+1X1+2X2+ Y=0+1X1+2X12+3X2+4X22+
Linear Model with Linear Model with
24
only Linear Effects Nonlinear Effects
24
You investigate the relationship among k+1 variables (k predictors+1 response) using a k-dimensional
surface for prediction.
The multiple general linear model is not restricted to modeling only planar relationships. By using higher
order terms, such as quadratic or cubic powers of the Xs or cross products of one X with another, surfaces
more complex than planes can be modeled. It should be noted, though, that these are still linear models
because the response variable is related to the predictor terms by a linear function.
In the examples in this course, the models are limited to relatively simple surfaces.
The model has p=k+1 parameters (the s), including the intercept, 0.
Alternative Hypothesis:
The regression model does fit the data better than
the baseline model.
Not all is equal zero.
25
25
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Multiple Regression 3-29
If the estimated linear regression model does not fit the data better than the baseline model, you fail
to reject the null hypothesis. Thus, you do not have enough evidence to say that all of the slopes of the
regression in the population differ from zero. The predictor variables do not explain a significant amount
of variability in the response variable.
If the estimated linear regression model does fit the data better than the baseline model, you reject the null
hypothesis. Thus, you do have enough evidence to say that at least one slope of the regression in the
population differs from zero. At least one predictor variable explains a significant amount of variability
in the response variable.
26
26
Techniques to evaluate the validity of these assumptions are discussed in a later chapter.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-30 Chapter 3 More Complex Linear Models
27
27
The advantage of performing multiple linear regression over a series of simple linear regression models
far outweighs the disadvantages. In practice, many responses depend on multiple factors that might
interact in some way.
SAS tools help you decide upon a “best” model, a choice that might depend on the purposes
of the analysis, as well as subject-matter expertise.
28
28
Even though multiple linear regression enables you to analyze many experimental designs, ranging from
simple to complex, you focus on applications for analytical studies and predictive modeling. Other SAS
procedures, such as GLM, are better suited for analyzing experimental data.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Multiple Regression 3-31
The distinction between using multiple regression for an analytic analysis and prediction modeling
is somewhat artificial. A model developed for prediction is probably a good analytic model. Conversely,
a model developed for an analytic study is probably a good prediction model.
Myers (1999) refers to four applications of regression:
prediction
variable screening
model specifications
parameter estimation
The term analytical analysis is similar to Myers’ parameter estimation application and variable screening.
Prediction
The terms in the model, the values of their coefficients,
and their statistical significance are of secondary
importance.
The focus is on producing a model that is the best at
predicting future values of Y as a function of the Xs.
The predicted value of Y is given by this formula:
29
29
Most investigators whose main goal is prediction do not ignore the terms in the model (the Xs), the values
of their coefficients (the s), or their statistical significance (the p-values). They use these statistics
to help choose among models with different numbers of terms and predictive capabilities.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-32 Chapter 3 More Complex Linear Models
30
30
31
31
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Multiple Regression 3-33
Adjusted R Square
(n i)(1 R 2 )
2
RADJ 1
n p
i=1 if there is an intercept and 0 otherwise
n=the number of observations used to fit the model
p=the number of parameters in the model
32
32
The R-square always increases or stays the same as you include more terms in the model. Therefore,
choosing the “best” model is not as simple as just making the R-square as large as possible.
The adjusted R-square is a measure similar to R-square, but it takes into account the number of terms
in the model. It can be thought of as a penalized version of R-square with the penalty increasing with each
parameter added to the model.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-34 Chapter 3 More Complex Linear Models
Example: Perform a regression model of SalePrice with Lot_Area and Basement_Area as predictor
variables.
/*st103d03.sas*/ /*Part A*/
ods graphics on;
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 2 2.032206E11 1.016103E11 137.17 <.0001
Error 297 2.200029E11 740750509
Corrected Total 299 4.232235E11
The R-Square value is much greater than for the model that included only Lot_Area. It is now 0.4802.
The adjusted R-Square is also greater than in the simple model.
Parameter Estimates
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|
Intercept Intercept 1 69016 5129.52179 13.45 <.0001
Basement_Area Basement area in square feet 1 70.08680 4.54618 15.42 <.0001
Lot_Area Lot size in square feet 1 0.80430 0.49210 1.63 0.1032
In this model, Lot_Area is no longer significant. The parameter estimate for each of the explanatory
variables adjust for the other variable in the model. The Lot_Area estimate is notably different than it was
in the simple regression model (2.87 in the simple regression model and 0.80 in this model) and its p-
value no longer shows statistical significance.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Multiple Regression 3-35
100000 4 4
RStudent
RStudent
Residual
50000 2 2
0 0 0
-50000 -2 -2
0.125
100000 300000
Sale price in dollars
250000 0.100
Residual
Cook's D
50000
200000 0.075
150000
0 0.050
100000
-50000 0.025
50000
0.000
50000 300000
-3 -2 -1 0 1 2 3 Predicted Value 0 100 200 300
Quantile Observation
Fit–Mean Residual
30
100000
Observations 300
Percent
20 50000 Parameters 3
Error DF 297
0 MSE 7.41E8
10
R-Square 0.4802
-50000
Adj R-Square 0.4767
0
-90000 10000 110000 0 1 0 1
Residual Proportion Less
The quantile-quantile plot of residuals do not indicate problems with an assumption of normally
distributed error.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-36 Chapter 3 More Complex Linear Models
100000
50000
Residual
-50000
The residuals show no pattern, although lot size does show some outliers.
This same model can be run in PROC GLM. Additional plots can be obtained using PROC GLM
and PROC PLM.
/*st103d03.sas*/ /*Part B*/
proc glm data=STAT1.ameshousing3
plots(only)=(contourfit);
model SalePrice=Basement_Area Lot_Area;
store out=multiple;
title "Model with Basement Area and Gross Living Area";
run;
Selected PLOTS option:
CONTOURFIT modifies the contour fit plot produced by default when you have a model
involving only two continuous predictors. The plot displays a contour plot
of the predicted surface overlaid with a scatter plot of the observed data.
PROC GLM Output
Number of Observations Read 300
Number of Observations Used 300
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 2 203220618262 101610309131 137.17 <.0001
Error 297 220002901249 740750509.26
Corrected Total 299 423223519511
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Multiple Regression 3-37
Notice that the ANOVA table contains the same information as in PROC REG.
R-Square Coeff Var Root MSE SalePrice Mean
0.480173 19.79041 27216.73 137524.9
Standard
Parameter Estimate Error t Value Pr > |t|
Intercept 69015.61360 5129.521790 13.45 <.0001
Basement_Area 70.08680 4.546183 15.42 <.0001
Lot_Area 0.80430 0.492102 1.63 0.1032
The estimates table gives the same results (within rounding error) as the parameter estimates table
in PROC REG.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-38 Chapter 3 More Complex Linear Models
The contour plot shows predicted values of SalePrice as gradations of color from blue (low values) to red
(high values). The dots for the actual data are similarly colored. Observations that are perfectly fit would
show the same color within the circle as outside the circle. The blue lines help you read the actual
predictions at even intervals. For example, the circle that is being pointed at in the plot has a basement
area of about 1,500 square feet, a lot size of about 17,000 square feet and a predicted value of over
$180,000 for sale price. Its color shows that its observed sale price is actually closer to about $160,000.
PROC PLM can use the item store for further analysis.
/*st103d03.sas*/ /*Part C*/
proc plm restore=multiple plots=all;
effectplot contour (y=Basement_Area x=Lot_Area);
effectplot slicefit(x=Lot_Area
sliceby=Basement_Area=250 to 1000 by 250);
run;
Selected EFFECTPLOT options:
CONTOUR Displays a contour plot of predicted values against two continuous covariates.
SLICEFIT Displays a curve of predicted values versus a continuous variable grouped
by the levels of another effect.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Multiple Regression 3-39
Store Information
Item Store WORK.MULTIPLE
Data Set Created From STAT1.AMESHOUSING3
Created By PROC GLM
Date Created 03JUL14:11:40:15
Response Variable SalePrice
Model Effects Intercept Basement_Area Lot_Area
PROC PLM does not display the observed values on the contour plot.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-40 Chapter 3 More Complex Linear Models
140000
Sale price in dollars
120000
100000
Another way of displaying the results of a 2-predictor regression model is through the use of a slice plot.
The regression lines represent the slices of Basement_Area that were specified in the SAS code.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Multiple Regression 3-41
Exercises
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-42 Chapter 3 More Complex Linear Models
36
36
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.3 Solutions 3-43
3.3 Solutions
Solutions to Exercises
1. Performing Two-Way ANOVA
Data were collected in an effort to determine whether different dose levels of a given drug have
an effect on blood pressure for people with one of three types of heart disease. The data are
in the STAT1.Drug data set.
The data set contains the following variables:
DrugDose dosage level of drug (1, 2, 3, 4), corresponding to (Placebo, 50 mg, 100 mg, 200 mg)
Disease heart disease category
BloodP change in diastolic blood pressure after 2 weeks treatment
a. Use the SGPLOT procedure to examine the data with a vertical line plot. Put BloodP on the Y
axis, DrugDose on the X axis, and then stratify by Disease. What information can you obtain
from looking at the data?
/*st103s01.sas*/ /*Part A*/
proc sgplot data=STAT1.drug;
vline DrugDose / group=Disease
stat=mean
response=BloodP
markers;
format DrugDose dosefmt.;
run;
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-44 Chapter 3 More Complex Linear Models
20
10
BloodP (Mean)
-10
-20
Disease A B C
It appears that drug dose affects change in blood pressure. However, that effect is
not consistent across diseases. Higher doses result in increased blood pressure for
patients with disease B, decreased blood pressure for patients with disease A, and
little change in blood pressure for patients with disease C.
b. Test the hypothesis that the means are equal, making sure to include an interaction term
if the results from PROC SGPLOT indicate that would be advisable. What conclusions can you
reach at this point in your analysis?
/*st103s01.sas*/ /*Part B*/
ods graphics on;
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.3 Solutions 3-45
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 11 36476.8353 3316.0759 7.66 <.0001
Error 158 68366.4589 432.6991
Corrected Total 169 104843.2941
The global F test indicates a significant difference among the different groups. Because the
interaction is in the model, this is a test of all combinations of DrugDose*Disease against all
other combinations.
R-Square Coeff Var Root MSE BloodP Mean
0.347918 -906.7286 20.80142 -2.294118
The R-Square value implies that about 35% of the variation in BloodP can be explained by
variation in the explanatory variables.
Source DF Type I SS Mean Square F Value Pr > F
DrugDose 3 54.03137 18.01046 0.04 0.9886
Disease 2 19276.48690 9638.24345 22.27 <.0001
DrugDose*Disease 6 17146.31698 2857.71950 6.60 <.0001
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-46 Chapter 3 More Complex Linear Models
50
25
BloodP
-25
-50
1 2 3 4
DrugDose
Disease A B C
BloodP
DrugDose Disease LSMEAN
1 A 1.3333333
1 B -8.1333333
1 C 0.4285714
2 A -9.6875000
2 B 5.4000000
2 C -4.8461538
3 A -26.2307692
3 B 24.7857143
3 C -5.1428571
4 A -22.5555556
4 B 23.2307692
4 C 1.3076923
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.3 Solutions 3-47
The sliced table shows the effects of DrugDose at each level of Disease. The effect is
significant for all but disease C.
2. Performing Multiple Regression Using the REG Procedure
a. Using the STAT1.BodyFat2 data set, run a regression of PctBodyFat2 on the variables Age,
Weight, Height, Neck, Chest, Abdomen, Hip, Thigh, Knee, Ankle, Biceps, Forearm, and
Wrist.
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 13 13159 1012.22506 54.50 <.0001
Error 238 4420.06401 18.57170
Corrected Total 251 17579
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-48 Chapter 3 More Complex Linear Models
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 -21.35323 22.18616 -0.96 0.3368
Age 1 0.06457 0.03219 2.01 0.0460
Weight 1 -0.09638 0.06185 -1.56 0.1205
Height 1 -0.04394 0.17870 -0.25 0.8060
Neck 1 -0.47547 0.23557 -2.02 0.0447
Chest 1 -0.01718 0.10322 -0.17 0.8679
Abdomen 1 0.95500 0.09016 10.59 <.0001
Hip 1 -0.18859 0.14479 -1.30 0.1940
Thigh 1 0.24835 0.14617 1.70 0.0906
Knee 1 0.01395 0.24775 0.06 0.9552
Ankle 1 0.17788 0.22262 0.80 0.4251
Biceps 1 0.18230 0.17250 1.06 0.2917
Forearm 1 0.45574 0.19930 2.29 0.0231
Wrist 1 -1.65450 0.53316 -3.10 0.0021
1) Compare the ANOVA table with that from the model with only Weight in the previous
exercise. What is different?
There are key differences between the ANOVA table for this model and the Simple
Linear Regression model.
The degrees of freedom for the model are much higher, 13 versus 1.
The Mean Square model and the F ratio are much smaller.
2) How do the R-square and the adjusted R-square compare with these statistics for the Weight
regression demonstration?
Both the R-square and adjusted R-square for the full models are larger than the simple
linear regression. The multiple regression model explains almost 75% of the variation in
the PctBodyFat2 variable versus only about 37.5% explained by the simple linear
regression model.
3) Did the estimate for the intercept change? Did the estimate for the coefficient of Weight
change?
Yes, including the other variables in the model changed the estimates both of the
intercept and the slope for Weight. Also, the p-values for both changed dramatically.
The slope of Weight is now not significantly different from zero.
3. Simplifying the Model
Rerun the preceding model, but eliminate the variable with the highest p-value. Compare the output
with the preceding model.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.3 Solutions 3-49
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 12 13159 1096.57225 59.29 <.0001
Error 239 4420.12286 18.49424
Corrected Total 251 17579
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 -21.30204 22.12123 -0.96 0.3365
Age 1 0.06503 0.03108 2.09 0.0374
Weight 1 -0.09602 0.06138 -1.56 0.1191
Height 1 -0.04166 0.17369 -0.24 0.8107
Neck 1 -0.47695 0.23361 -2.04 0.0423
Chest 1 -0.01732 0.10298 -0.17 0.8666
Abdomen 1 0.95497 0.08998 10.61 <.0001
Hip 1 -0.18801 0.14413 -1.30 0.1933
Thigh 1 0.25089 0.13876 1.81 0.0719
Ankle 1 0.18018 0.21841 0.82 0.4102
Biceps 1 0.18182 0.17193 1.06 0.2913
Forearm 1 0.45667 0.19820 2.30 0.0221
Wrist 1 -1.65227 0.53057 -3.11 0.0021
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-50 Chapter 3 More Complex Linear Models
The p-value for the model did not change out to four decimal places.
b. Did the R-square and adjusted R-square change notably?
The R-square showed essentially no change. The adjusted R-square increased from 0.7348
to 0.7359. When an adjusted R-square increases by removing a variable from the model,
it implies that the removed variable was not necessary.
c. Did the parameter estimates and their p-values change notably?
Some of the parameter estimates and their p-values changed slightly, none to any large
degree.
4. More Simplifying of the Model
Again, rerun the preceding model, but drop the variable with the highest p-value.
This program reruns the regression with Chest removed, because it is the variable with the highest
p-value in the previous model.
/*st103s02.sas*/ /*Part C*/
proc reg data=STAT1.BodyFat2;
model PctBodyFat2=Age Weight Height
Neck Abdomen Hip Thigh
Ankle Biceps Forearm Wrist;
title 'Regression of PctBodyFat2 on All '
'Predictors, Minus Knee, Chest';
run;
quit;
PROC REG Output
Number of Observations Read 252
Number of Observations Used 252
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 11 13158 1196.21310 64.94 <.0001
Error 240 4420.64572 18.41936
Corrected Total 251 17579
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.3 Solutions 3-51
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 -23.13736 19.20171 -1.20 0.2294
Age 1 0.06488 0.03100 2.09 0.0374
Weight 1 -0.10095 0.05380 -1.88 0.0618
Height 1 -0.03120 0.16185 -0.19 0.8473
Neck 1 -0.47631 0.23311 -2.04 0.0421
Abdomen 1 0.94965 0.08406 11.30 <.0001
Hip 1 -0.18316 0.14092 -1.30 0.1950
Thigh 1 0.25583 0.13534 1.89 0.0599
Ankle 1 0.18215 0.21765 0.84 0.4035
Biceps 1 0.18055 0.17141 1.05 0.2933
Forearm 1 0.45262 0.19634 2.31 0.0220
Wrist 1 -1.64984 0.52930 -3.12 0.0020
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-52 Chapter 3 More Complex Linear Models
15
37
37
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Chapter 4 Model Building and
Effect Selection
4.1 Stepwise Selection Using Significance Level .............................................................4-3
Demonstration: Stepwise Regression .................................................................................. 4-10
Exercises .............................................................................................................................. 4-18
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Stepwise Selection Using Significance Level 4-3
Objectives
Describe forward, backward, and stepwise model
selection methodology.
Explain the GLMSELECT procedure options for model
selection using significance level.
3
3
Model Selection
Eliminating one variable at a time manually for small
data sets is a reasonable approach.
However, eliminating one variable at a time manually
for large data sets can take an extreme amount of time.
4
4
A process for selecting models might be to start with all the variables in the STAT1.ameshousing3 data
set and eliminate the least significant terms, based on p-values.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-4 Chapter 4 Model Building and Effect Selection
For a small data set, a final model can be developed in a reasonable amount of time. If you start with
a large model, however, eliminating one variable at a time can take an extreme amount of time. You
would have to continue this process until only terms with p-values lower than some threshold value, such
as 0.05 or 0.10, remain.
PROC GLMSELECT
PROC GLMSELECT DATA=SAS-data-set <options>;
CLASS variables;
MODEL dependent(s)=regressor(s) </ options>;
RUN;
5
5
CHOOSE=<option>
SELECT=<option>
STOP=<option>
6
6
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Stepwise Selection Using Significance Level 4-5
PROC GLMSELECT offers many choices for selection techniques and criteria through the usage
of different options.
SELECTION specifies the method used to select the model. Possible methods to choose from include
NONE (specifies no model selection), FORWARD, BACKWARD, STEPWISE, LAR,
LASSO, and ELASTICNET. The default is STEPWISE.
CHOOSE specifies the criterion for choosing the model. The specified criterion is evaluated at each
step of the selection process, and the model that yields the best value of the criterion
is chosen. If CHOOSE= is omitted, the model at the final step in the selection process
is selected.
SELECT specifies the criterion that determines the order in which effects enter or leave at each
step of the specified selection method. The default value is SELECT=SBC. The effect
that is selected to enter or leave at a step of the selection process is the effect whose
addition to or removal from the current model gives the maximum improvement
in the specified criterion.
STOP specifies when to stop the selection process. If you do not specify the STOP= option but
do specify the SELECT= option, this criterion will also be used as the STOP= criterion.
Default is STOP=SBC when neither STOP= nor SELECT= is specified. If you specify
STOP=n, then selection will stop at the first step for which the selected model has
n effects.
BACKWARD
ELIMINATION
STEPWISE
SELECTION
7
7
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-6 Chapter 4 Model Building and Effect Selection
Forward Selection
0
1
2
3
4
5
Stop
14
14
Forward selection starts with an empty model. The method computes an F statistic for each predictor
variable not in the model and examines the largest of these statistics. If it is significant at a specified
significance level (specified by the SLENTRY= option), the corresponding variable is added to the
model. After a variable is entered in the model, it is never removed from the model. The process
is repeated until none of the remaining variables meets the specified level for entry. By default,
SLENTRY=0.50.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Stepwise Selection Using Significance Level 4-7
Backward Elimination
0
1
2
3
4
5
6
Stop
22
22
Backward elimination starts off with the full model. Results of the F test for individual parameter
estimates are examined, and the least significant variable that falls above the specified significance level
(specified by the SLSTAY= option) is removed. After a variable is removed from the model, it remains
excluded. The process is repeated until no other variable in the model meets the specified significance
level for removal. By default, SLSTAY=0.10.
Stepwise Selection
0
1
2
3
4
5
6
Stop
30
30
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-8 Chapter 4 Model Building and Effect Selection
Stepwise selection is similar to forward selection in that it starts with an empty model and incrementally
builds a model one variable at a time. However, the method differs from forward selection in that
variables already in the model do not necessarily remain. The backward component of the method
removes variables from the model that do not meet the significance criterion specified in the SLSTAY=
option. The stepwise selection process terminates if no further variables can be added to the model
or if the variable entered into the model is the only variable removed in the subsequent backward
elimination. By default, SLENTRY=0.15 and SLSTAY=0.15.
Stepwise selection (Forward, Backward, and Stepwise) has some serious shortcomings. Simulation
studies (Derksen and Keselman 1992) evaluating variable selection techniques found the following:
1. The degree of collinearity among the predictor variables affected the frequency with which authentic
predictor variables found their way into the final model.
2. The number of candidate predictor variables affected the number of noise variables that gained entry
to the model.
3. The size of the sample was of little practical importance in determining the number of authentic
variables contained in the final model.
One recommendation is to use the variable selection methods to create several candidate models, and then
use subject-matter knowledge to select the variables that result in the best model within the scientific
or business context of the problem. Therefore, you are simply using these methods as a useful tool
in the model-building process (Hosmer and Lemeshow 2000).
31
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Stepwise Selection Using Significance Level 4-9
Statisticians give warnings and cautions about the appropriate interpretation of p-values from models
chosen using any automated variable selection technique. Refitting many submodels in terms of an
optimum fit to the data distorts the significance levels of conventional statistical tests. However, many
researchers and users of statistical software neglect to report that the models that they ended up with were
chosen using automated methods. They report statistical quantities such as standard errors, confidence
limits, p-values, and R-square as if the resulting model were entirely prespecified. These inferences are
inaccurate, tending to err on the side of overstating the significance of predictors and making predictions
with overly optimistic confidence. This problem is very evident when there are many iterative stages
in model building. When there are many variables and you use stepwise selection to find a small subset
of variables, inferences become less accurate (Chatfield 1995, Raftery 1994, Freedman 1983).
One solution to this problem is to split your data. One part can be used for finding the regression model
and the other part can be used for inference. Another solution is to use bootstrapping methods to obtain
the correct standard errors and p-values. Bootstrapping is a resampling method that tries to approximate
the distribution of the parameter estimates to estimate the standard error.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-10 Chapter 4 Model Building and Effect Selection
Stepwise Regression
Example: Select a model for predicting SalePrice in the STAT1.ameshousing3 data set by using
the STEPWISE selection method. Use 0.05 as the significance level for entry into and staying
in the model.
%let interval=Gr_Liv_Area Basement_Area Garage_Area Deck_Porch_Area
Lot_Area Age_Sold Bedroom_AbvGr Total_Bathroom ;
/*st104d01.sas*/
ods graphics on;
proc glmselect data=STAT1.ameshousing3 plots=all;
STEPWISE: model SalePrice=&interval / selection=stepwise
details=steps select=SL slstay=0.05 slentry=0.05;
title "Stepwise Model Selection for SalePrice - SL 0.05";
run;
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value
Model 0 0 . .
Error 299 4.232235E11 1415463276
Corrected Total 299 4.232235E11
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Stepwise Selection Using Significance Level 4-11
Parameter Estimates
Standard
Parameter DF Estimate Error t Value
Intercept 1 137525 2172.144314 63.31
Recall that STEPWISE selection begins like FORWARD selection with just the intercept. Then, subject
o the criterion specified, an effect enters the model, if possible.
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value
Model 1 2.012418E11 2.012418E11 270.16
Error 298 2.219817E11 744904950
Corrected Total 299 4.232235E11
Parameter Estimates
Standard
Parameter DF Estimate Error t Value
Intercept 1 73904 4179.193780 17.68
Basement_Area 1 72.107717 4.387055 16.44
The p-values can be displayed on the Parameter Estimates table by including the
SHOWPVALUES option in the MODEL statement.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-12 Chapter 4 Model Building and Effect Selection
Entry Candidates
Log
Rank Effect pValue Pr > F
1 Basement_Area -98.8577 <.0001
2 Gr_Liv_Area -84.6132 <.0001
3 Age_Sold -73.5219 <.0001
4 Total_Bathroom -69.1880 <.0001
5 Garage_Area -63.3558 <.0001
6 Deck_Porch_Area -34.3105 <.0001
7 Lot_Area -11.6303 <.0001
8 Bedroom_AbvGr -5.5339 0.0040
-20
-40
log(p-value)
-60
-80
-100
Ba
Ag
To
Lo
Be
r_
ar
ec
e
ta
t_
s
d
Li
ag
_S
em
k_
ro
l
Ar
_B
v_
om
e_
Po
e
ol
en
Ar
at
a
d
Ar
_A
hr
e
t_
ch
e
a
oo
Ar
bv
a
_A
e
G
re
a
r
a
Effect
During each step of the selection process, SAS displays both a table and graph of the entry candidates for
that individual step. In step one, there are several entry candidates whose significance level is displayed
as <.0001. To distinguish between these candidates, SAS also displays the log of the p-value for each
effects. From both the table and the graph, you see that Basement_Area will be first to enter the model.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Stepwise Selection Using Significance Level 4-13
Upon completion of the selection process, SAS generates a summary table detailing the steps taken
in the development of the model. The F values and p-values shown in this summary table are not
the F and p-values for the selected model. These are statistics from each individual step. Final p-values
and parameter estimates can be found in the table preceding this summary or at the conclusion
of the output.
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value
Model 7 3.424508E11 48921543221 176.86
Error 292 80772716963 276618894
Corrected Total 299 4.232235E11
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-14 Chapter 4 Model Building and Effect Selection
Parameter Estimates
Standard
Parameter DF Estimate Error t Value
Intercept 1 47463 5880.674041 8.07
Gr_Liv_Area 1 65.303724 5.436672 12.01
Basement_Area 1 29.849078 3.345400 8.92
Garage_Area 1 36.309606 6.452405 5.63
Deck_Porch_Area 1 32.052554 7.967677 4.02
Lot_Area 1 0.708127 0.317512 2.23
Age_Sold 1 -447.198682 41.019314 -10.90
Bedroom_AbvGr 1 -5042.766498 1687.928168 -2.99
Adding the SHOWPVALUES option to the MODEL statement will add p-values to the output
within the Parameter Estimates table.
0.6
Standardized Coefficient
Gr_Liv _Area
0.4
Basement_Area
Garage_Area
0.2 Deck_Porch_Area
Lot_Area
0.0 Bedroom_Abv Gr
-0.2
Age_Sold
-0.4
0.015
0.010
0.005
0.000
In
1+
2+
3+
4+
5+
6+
7+
te
Ba
De
Be
Lo
rc
r_
ge
ar
t_
s
dr
ep
Li
ag
em
k_
_S
A
oo
v_
t
re
e_
Po
en
ol
m
A
a
d
rc
_A
t_
re
re
h_
A
bv
a
re
G
a
re
r
a
Effect Sequence
In the Coefficient Panel, PROC GLMSELECT displays a panel two plots showing how the standardized
coefficients and the criterion used to choose the final model evolved as the selection progressed. In this
image you can monitor the change in the standardized coefficients as each effect is added to or deleted
from the model.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Stepwise Selection Using Significance Level 4-15
AIC AICC
Adj R-Sq
1+
2+
3+
4+
5+
6+
7+
In
1+
2+
3+
4+
5+
6+
7+
te
te
Ba
De
Be
Lo
Ba
De
Be
Lo
rc
rc
r_
ge
ar
r_
ge
ar
t_
t_
s
ck
dr
se
ck
dr
ep
ep
Li
ag
Li
ag
em
_S
_S
A
A
oo
oo
_P
_P
m
v_
v_
t
t
re
re
e_
e_
en
ol
en
ol
m
m
or
o
A
A
a
a
d
d
A
rc
_A
_A
t_
t_
re
ch
r
re
ea
re
h_
A
A
a
bv
bv
_A
a
a
re
re
A
G
G
a
re
re
r
r
a
a
Effect Sequence Effect Sequence
Best Criterion Value Selected Step
The Criteria Panel displays the progression of the adjusted R-square, AIC, AICC, SBC, as well as any
other criteria that are named in the CHOOSE=, SELECT=, STOP=, or STATS= option. The star denotes
the best model of the eight that were tested, in this example.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-16 Chapter 4 Model Building and Effect Selection
1.3E9
Average Squared Error
1E9
7.5E8
5E8
2.5E8
In
1+
2+
3+
4+
5+
6+
7+
te
Ba
Ag
Be
Lo
r
r_
ar
ec
ce
t_
s
d
Li
ag
_S
em
k_
ro
pt
Ar
v_
om
e_
Po
e
ol
en
Ar
a
d
Ar
_A
e
t_
ch
e
a
Ar
bv
a
_A
e
G
re
a
r
a
Effect Sequence
The Average Square Error Plot shows the progression of the average square error (ASE) evaluated
on the training data. As more effects are added to the model, the ASE decreases for the training data.
When a test or validation data set are provided, this plot will also contain information about the ASE
in those data sets. This plot is best used with a hold-out data set to detect over fitting.
Additional code has been included that performs FORWARD and BACKWARD selection with
the selection criterion set as significance level. In both cases, the SLENTRY and SLSTAY criteria
have been changed to 0.05.
proc glmselect data=STAT1.ameshousing3 plots=all;
FORWARD: model SalePrice=&interval / selection=forward
details=steps select=SL slentry=0.05;
title "Forward Model Selection for SalePrice - SL 0.05";
run;
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Stepwise Selection Using Significance Level 4-17
Stepwise Models
FORWARD Basement_Area,
(slentry=0.05) Gr_Liv_Area,
& Age_Sold,
Garage_Area,
STEPWISE Deck_Porch_Area,
(slentry=0.05, Bedroom_AbvGr,
slstay=0.05) Lot_Area
&
BACKWARD
(slstay=0.05)
33
33
The final models obtained using the SLENTRY=0.05 and SLSTAY=0.05 criteria are displayed for
FORWARD, BACKWARD, and STEPWISE. In this instance, all the selected models matched. It is
important to note that the choice of SLENTRY and SLSTAY levels can greatly affect the final models that
are selected using stepwise methods. Some analysts use larger boundaries to get models to a manageable
size then do manual reduction instead of using low values for SLENTRY and SLSTAY.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-18 Chapter 4 Model Building and Effect Selection
Exercises
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Information Criterion and Other Selection Options 4-19
4.01 Poll
The STEPWISE, BACKWARD, and FORWARD
strategies result in the same final model if the same
significance levels are used in all three.
True
False
36
36
Objectives
Describe different criteria available within PROC
GLMSELECT to perform model selection.
Compare models provided from PROC GLMSELECT
using different selection criteria.
39
39
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-20 Chapter 4 Model Building and Effect Selection
Information Criteria
Akaike’s information criterion (AIC)
Corrected Akaike’s information criterion (AICC)
Sawa Bayesian information criterion (BIC)
Schwarz Bayesian information criterion (SBC)
Smaller is better.
40
40
Beyond significance level, there are several statistics, referred to as information criteria, that can be used
both to evaluate competing models as well as direct the selection process within PROC GLMSELECT.
These criteria each search for a model that will minimize the unexplained variability using as few effects
within the model as possible (most parsimonious model).
𝑆𝑆𝐸
Each information criterion begins 𝑛𝑙𝑜𝑔 ( 𝑛 ). It then invokes a penalty representing the complexity
of the model. A table of these penalties is shown below where n is the number of observations,
p is the number of parameters including the intercept, and 𝜎̂ 2 is the estimate of pure error variance from
fitting the full model. For each of these information criteria, smaller is better.
AIC 2𝑝 + 𝑛 + 2
AICC 𝑛(𝑛 + 𝑝)
𝑛−𝑝−2
BIC 2(𝑝 + 2)𝑞 − 2𝑞2
SBC 𝑝𝑙𝑜𝑔(𝑛)
̂2
𝑛𝜎
In the BIC penalty, 𝑞 = 𝑆𝑆𝐸
.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Information Criterion and Other Selection Options 4-21
(n i )(1 R 2 )
R 2
1
n p
ADJ
41
41
Other choices of selection criteria within PROC GLMSELECT include adjusted R-square and Mallows’
Cp.
The R-square always increases or stays the same as you include more terms in the model. Therefore,
choosing the “best” model is not as simple as just making the R-square as large as possible.
The adjusted R-square is a measure similar to R-square, but it takes into account the number of terms
in the model. It can be thought of as a penalized version of R-square with the penalty increasing with each
parameter added to the model. In the equation, i=1 if there is an intercept and 0 otherwise. The number
of observations used to fit the model is n and the number of parameters in the model is p.
More discussion about Mallow’s Cp can be found in the self-study section of this chapter.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-22 Chapter 4 Model Building and Effect Selection
Example: Invoke PROC GLMSELECT four times on the SalePrice variable regressing on the interval
variables (you can use the macro %interval) within STAT1.ameshousing3. For each, request
STEPWISE selection with the SELECTION= option and include DETAILS=STEPS to obtain
step information and the selection summary table. Once in each PROC GLMSELECT use
SELECT=AIC, SELECT=BIC, SELECT=AICC, and SELECT=SBC and compare the
selected models from the output.
%let interval=Gr_Liv_Area Basement_Area Garage_Area Deck_Porch_Area
Lot_Area Age_Sold Bedroom_AbvGr Total_Bathroom ;
/*st104d02.sas*/
ods graphics on;
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Information Criterion and Other Selection Options 4-23
0.6
Standardized Coefficient
Gr_Liv _Area
0.4
Basement_Area
Garage_Area
0.2 Deck_Porch_Area
0.0 Lot_Area
Bedroom_Abv Gr
-0.2
Age_Sold
-0.4
6400
6300
6200
In
1+
2+
3+
4+
5+
6+
7+
8+
te
Ba
De
Be
Lo
To
rc
r_
ge
ar
t_
ta
s
dr
ep
Li
ag
em
k_
_S
l_
oo
v_
t
re
Ba
e_
Po
en
ol
m
A
a
d
rc
th
_A
t_
re
re
ro
h_
A
bv
a
om
re
G
a
re
r
a
Effect Sequence
The AIC component of the Coefficient Panel shows larger improvements to the AIC across steps one
through three and moderate improvements to the AIC across steps four and five. After step five the AIC
only minimally improves compared to the rest. It is also at step five that the standardized coefficients
have stabilized and do not appear to vary as new effects are added to the model. One could entertain
stopping after five effects and include that as an additional model that could be validated using a hold-out
data set.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-24 Chapter 4 Model Building and Effect Selection
AIC AICC
Adj R-Sq
1+
2+
3+
4+
5+
6+
7+
8+
In
1+
2+
3+
4+
5+
6+
7+
8+
te
te
Ba
De
Be
Lo
To
Ba
De
Be
Lo
To
rc
rc
r_
ge
ar
r_
ge
ar
t_
ta
t_
ta
se
ck
dr
se
ck
dr
ep
ep
Li
ag
Li
ag
_S
_S
A
l_
l_
oo
oo
_P
_P
m
m
v
v_
t
t
re
Ba
re
Ba
e_
e_
_A
en
ol
en
ol
m
m
or
or
A
a
a
d
d
A
th
th
_A
_
t_
t_
r
ch
ch
A
ea
re
ea
re
ro
ro
A
A
bv
bv
_A
_A
a
a
om
om
re
re
G
G
a
re
re
r
r
a
a
Effect Sequence Effect Sequence
Best Criterion Value Selected Step
The Criteria Panel displays several of the other fit statistics for the model at each step. The AIC and AICC
are minimized and the adjusted R-square is maximized at step eight. The SBC is shown to minimize
at step six.
Effects: Intercept Gr_Liv_Area Basement_Area Garage_Area Deck_Porch_Area Lot_Area Age_Sold
Bedroom_AbvGr Total_Bathroom
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value
Model 8 3.431321E11 42891512314 155.84
Error 291 80091420996 275228251
Corrected Total 299 4.232235E11
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Information Criterion and Other Selection Options 4-25
Parameter Estimates
Standard
Parameter DF Estimate Error t Value
Intercept 1 44347 6191.271944 7.16
Gr_Liv_Area 1 63.197764 5.585739 11.31
Basement_Area 1 28.692184 3.417034 8.40
Garage_Area 1 35.754191 6.445840 5.55
Deck_Porch_Area 1 31.370539 7.959436 3.94
Lot_Area 1 0.699495 0.316761 2.21
Age_Sold 1 -420.815037 44.219144 -9.52
Bedroom_AbvGr 1 -4834.848748 1688.858227 -2.86
Total_Bathroom 1 3022.124723 1920.839066 1.57
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-26 Chapter 4 Model Building and Effect Selection
Similar to the AIC output, the selection process did not complete before all effects were added
to the model.
0.6
Standardized Coefficient
Gr_Liv _Area
0.4
Basement_Area
Garage_Area
0.2 Deck_Porch_Area
0.0 Lot_Area
Bedroom_Abv Gr
-0.2
Age_Sold
-0.4
6100
6000
5900
In
1+
2+
3+
4+
5+
6+
7+
8+
te
Ba
De
Be
Lo
To
rc
r_
ge
ar
t_
ta
s
dr
ep
Li
ag
em
k_
_S
l_
oo
v_
t
re
Ba
e_
Po
en
ol
m
A
a
d
rc
th
_A
t_
re
re
ro
h_
A
bv
a
om
re
G
a
re
r
a
Effect Sequence
The Coefficient Panel also mimics the patterns that you noticed from the AIC selection setup.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Information Criterion and Other Selection Options 4-27
AIC
AICC SBC
Adj R-Sq
1+
2+
3+
4+
5+
6+
7+
8+
In
1+
2+
3+
4+
5+
6+
7+
8+
te
te
Ba
De
Be
Lo
To
Ba
De
Be
Lo
To
rc
rc
r_
ge
ar
r_
ge
ar
t_
ta
t_
ta
se
ck
dr
se
ck
dr
ep
ep
Li
ag
Li
ag
_S
_S
A
l_
l_
oo
oo
_P
_P
m
m
v
v_
t
t
re
Ba
re
Ba
e_
e_
_A
en
ol
en
ol
m
m
or
or
A
a
a
d
d
A
th
th
_A
_
t_
t_
r
ch
ch
A
ea
re
ea
re
ro
ro
A
A
bv
bv
_A
_A
a
a
om
om
re
re
G
G
a
re
re
r
r
a
a
Effect Sequence Effect Sequence
Best Criterion Value Selected Step
Recall that the Criteria Panel, by default, contains AIC, AICC, SBC, and adjusted R-squared. With
the inclusion of SELECT=BIC, the BIC plot is included in the output. Again, you see similarities between
this panel and the one generated by the AIC selection.
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value
Model 8 3.431321E11 42891512314 155.84
Error 291 80091420996 275228251
Corrected Total 299 4.232235E11
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-28 Chapter 4 Model Building and Effect Selection
Parameter Estimates
Standard
Parameter DF Estimate Error t Value
Intercept 1 44347 6191.271944 7.16
Gr_Liv_Area 1 63.197764 5.585739 11.31
Basement_Area 1 28.692184 3.417034 8.40
Garage_Area 1 35.754191 6.445840 5.55
Deck_Porch_Area 1 31.370539 7.959436 3.94
Lot_Area 1 0.699495 0.316761 2.21
Age_Sold 1 -420.815037 44.219144 -9.52
Bedroom_AbvGr 1 -4834.848748 1688.858227 -2.86
Total_Bathroom 1 3022.124723 1920.839066 1.57
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Information Criterion and Other Selection Options 4-29
0.6
Standardized Coefficient
Gr_Liv _Area
0.4
Basement_Area
Garage_Area
0.2 Deck_Porch_Area
0.0 Lot_Area
Bedroom_Abv Gr
-0.2
Age_Sold
-0.4
6400
6300
6200
In
1+
2+
3+
4+
5+
6+
7+
8+
te
Ba
De
Be
Lo
To
rc
r_
ge
ar
t_
ta
s
dr
ep
Li
ag
em
k_
_S
l_
oo
v_
t
re
Ba
e_
Po
en
ol
m
A
a
d
rc
th
_A
t_
re
re
ro
h_
A
bv
a
om
re
G
a
re
r
a
Effect Sequence
The results and plots from the AICC selection again mimic those of both AIC and BIC selection.
The parameter estimates table from the AICC selection match those from both AIC and BIC selection.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-30 Chapter 4 Model Building and Effect Selection
0.6
Standardized Coefficient
Gr_Liv _Area
0.4 Basement_Area
Garage_Area
0.2 Deck_Porch_Area
0.0 Bedroom_Abv Gr
-0.2
Age_Sold
-0.4
6200
SBC
6100
6000
5900
In
1+
2+
3+
4+
5+
6+
te
Ba
De
Be
rc
r_
ge
ar
s
dr
ep
Li
ag
em
k_
_S
oo
v_
t
e_
Po
en
ol
m
A
rc
_A
t_
re
re
h_
A
bv
a
re
G
a
re
r
a
Effect Sequence
The Coefficient Panel shows similarities to those seen earlier. Larger improvements in SBC can be seen
across steps one through three while minimal, compared to the rest, improvements in SBC occur over
steps four through six. Like previous images, the standardized coefficients appear to stabilize after step
four. One could entertain a four variable model as an option.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Information Criterion and Other Selection Options 4-31
AIC AICC
Adj R-Sq
1+
2+
3+
4+
5+
6+
In
1+
2+
3+
4+
5+
6+
te
te
Ba
De
Be
Ba
De
Be
rc
rc
r_
ge
ar
r_
ge
ar
s
ck
dr
se
ck
dr
ep
ep
Li
ag
Li
ag
em
_S
_S
oo
oo
_P
_P
m
v_
v_
t
t
e_
e_
en
ol
en
ol
m
m
or
or
A
A
d
d
A
A
_A
_A
t_
t_
re
ch
re
ch
re
re
A
A
a
a
bv
bv
_A
_A
a
a
re
re
G
G
a
re
re
r
r
a
a
Effect Sequence Effect Sequence
Best Criterion Value Selected Step
The Criteria Panel shows that, accounting only for the models viewed, the optimal fit statistics were
obtained step six.
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value
Model 6 3.410749E11 56845818595 202.75
Error 293 82148607939 280370676
Corrected Total 299 4.232235E11
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-32 Chapter 4 Model Building and Effect Selection
Parameter Estimates
Standard
Parameter DF Estimate Error t Value
Intercept 1 48620 5897.324643 8.24
Gr_Liv_Area 1 65.097413 5.472624 11.90
Basement_Area 1 31.279351 3.305546 9.46
Garage_Area 1 38.728785 6.403565 6.05
Deck_Porch_Area 1 32.487956 8.019119 4.05
Age_Sold 1 -434.199118 40.877494 -10.62
Bedroom_AbvGr 1 -4189.095026 1655.065743 -2.53
The decision of which selection methods and criteria to use typically depends on the area
of research to which the problem is applied. Standards and practices that are common to your
individual research/work area should be considered. When multiple methods are invoked yielding
different models, honest assessment with hold-out data can be used and encouraged.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Information Criterion and Other Selection Options 4-33
43
43
The model selection strategies discussed over the past two sections generates a list of possible models
from which you can choose. To aid in the decision of which model is “better,” consultation of a subject
area expert can be incorporated. Another option is to perform honest assessment on the models in
question using a hold-out data set. This option will be discussed in a later chapter of this course.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-34 Chapter 4 Model Building and Effect Selection
Exercises
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.3 All Possible Selection (Self-Study) 4-35
Objectives
Explain the REG procedure options for all possible
model selection.
Describe model selection options and interpret
output to evaluate the fit of several models.
47
47
Model Selection
Data set contains eight interval variables as potential
predictors.
48
48
A process for selecting models might be to start with all the interval variables in the
STAT1.ameshousing3 data set and invoke some form of stepwise selection discussed in previous
sections. This could be done by hand or with the assistance of SAS.
An alternative option would be to explore all possible models capable from the predictor variables
provided and determine which is “best.” This method of all possible selection can be performed using
PROC REG.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-36 Chapter 4 Model Building and Effect Selection
49
49
0 1
1 2
2 4
3 8
4 16
5 32
50
50
In the STAT1.ameshousing3 data set, there are eight possible independent variables. Therefore, there
are 28=256 possible regression models. There are eight possible one-variable models, 28 possible two-
variable models, 56 possible three-variable models, and so on.
You can choose to only look at the best n (as measured by the model R2 for k=1, 2, 3, …, 7) by using
the BEST= option on the model statement. The BEST= option only reduces the output. All regressions
are still calculated.
If there were 20 possible independent variables, there would be more than 1,000,000 models.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.3 All Possible Selection (Self-Study) 4-37
Mallows’ Cp
Mallows’ Cp is a simple indicator of effective
variable selection within a model.
Look for models with Cp p, where p equals
the number of parameters in the model, including
the intercept.
Mallows recommends choosing the first (fewest
variables) model where Cp approaches p.
51
51
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-38 Chapter 4 Model Building and Effect Selection
52
52
Hocking suggested the use of the Cp statistic, but with alternative criteria, depending on the purpose of the
analysis. His suggestion of (Cp2ppfull+1) is included in the REG procedure’s calculations of criteria
reference plots for best models.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.3 All Possible Selection (Self-Study) 4-39
Example: Invoke PROC REG to produce a regression of SalePrice on all the other interval variables
in the STAT1.ameshousing3 data set.
%let interval=Gr_Liv_Area Basement_Area Garage_Area Deck_Porch_Area
Lot_Area Age_Sold Bedroom_AbvGr Total_Bathroom ;
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-40 Chapter 4 Model Building and Effect Selection
There are many models to compare. It would be unwieldy to try to determine the best model by viewing
the output tables. Therefore, it is advisable to look at the ODS plots.
0.8
0.6
R-Square
0.4
0.2
0.0
2 4 6 8
Number of Parameters
The R-square plot compares all models based on their R-square values. As noted earlier, adding variables
to a model always increases R-square, and therefore the full model is always best. Therefore, you can
only use the R-square value to compare models of equal numbers of parameters.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.3 All Possible Selection (Self-Study) 4-41
0.8
0.6
Adjusted R-Square
0.4
0.2
0.0
2 4 6 8
Number of Parameters
The adjusted R-square does not have the problem that the R-square has. You can compare models of
different sizes. In this case, it is difficult to see which model has the higher adjusted R-square, the starred
model for seven parameters or eight parameters.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-42 Chapter 4 Model Building and Effect Selection
1200
1000
800
Mallows C(p)
600
400
200
2 4 6 8
Number of Parameters
The line Cp=p is plotted to help you identify models that satisfy the criterion Cpp for prediction.
The lower line is plotted to help identify which models satisfy Hocking's criterion Cp2ppfull+1 for
parameter estimation.
Use the graph and review the output to select a relatively short list of models that satisfy the criterion
appropriate for your objective. It is often the case that the best model is difficult to see because of the
range of Cp values at the high end. These models are clearly not the best and therefore you can focus on
the models near the bottom of the range of Cp.
/*st104d03.sas*/ /*Part B*/
proc reg data=STAT1.ameshousing3 plots(only)=(cp);
ALLPOSS: model SalePrice=&interval / selection=cp rsquare adjrsq
best=20;
title "Best Models Using All Possible Selection for SalePrice";
run;
quit;
Selected SELECTION= option methods:
BEST=n limits the output to only the best n models.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.3 All Possible Selection (Self-Study) 4-43
Adjusted
Model Number in R- R-
Index Model C(p) Square Square Variables in Model
1 8 9.0000 0.8108 0.8056 Gr_Liv_Area Basement_Area Garage_Area
Deck_Porch_Area Lot_Area Age_Sold Bedroom_AbvGr
Total_Bathroom
2 7 9.4754 0.8091 0.8046 Gr_Liv_Area Basement_Area Garage_Area
Deck_Porch_Area Lot_Area Age_Sold Bedroom_AbvGr
3 7 11.8765 0.8076 0.8030 Gr_Liv_Area Basement_Area Garage_Area
Deck_Porch_Area Age_Sold Bedroom_AbvGr
Total_Bathroom
4 6 12.4745 0.8059 0.8019 Gr_Liv_Area Basement_Area Garage_Area
Deck_Porch_Area Age_Sold Bedroom_AbvGr
5 7 15.1956 0.8054 0.8008 Gr_Liv_Area Basement_Area Garage_Area
Deck_Porch_Area Lot_Area Age_Sold Total_Bathroom
6 6 15.7530 0.8038 0.7997 Gr_Liv_Area Basement_Area Garage_Area
Deck_Porch_Area Age_Sold Total_Bathroom
7 6 16.4459 0.8033 0.7993 Gr_Liv_Area Basement_Area Garage_Area
Deck_Porch_Area Lot_Area Age_Sold
8 5 17.0005 0.8017 0.7983 Gr_Liv_Area Basement_Area Garage_Area
Deck_Porch_Area Age_Sold
9 7 22.5339 0.8007 0.7959 Gr_Liv_Area Basement_Area Garage_Area Lot_Area
Age_Sold Bedroom_AbvGr Total_Bathroom
10 6 23.7403 0.7986 0.7944 Gr_Liv_Area Basement_Area Garage_Area Lot_Area
Age_Sold Bedroom_AbvGr
11 6 25.8313 0.7972 0.7931 Gr_Liv_Area Basement_Area Garage_Area Age_Sold
Bedroom_AbvGr Total_Bathroom
12 5 27.1943 0.7950 0.7915 Gr_Liv_Area Basement_Area Garage_Area Age_Sold
Bedroom_AbvGr
13 6 32.9173 0.7926 0.7884 Gr_Liv_Area Basement_Area Garage_Area Lot_Area
Age_Sold Total_Bathroom
14 5 33.3028 0.7911 0.7875 Gr_Liv_Area Basement_Area Garage_Area Age_Sold
Total_Bathroom
15 5 35.3618 0.7897 0.7861 Gr_Liv_Area Basement_Area Garage_Area Lot_Area
Age_Sold
16 4 35.7387 0.7882 0.7853 Gr_Liv_Area Basement_Area Garage_Area Age_Sold
17 7 37.7677 0.7907 0.7857 Gr_Liv_Area Basement_Area Deck_Porch_Area Lot_Area
Age_Sold Bedroom_AbvGr Total_Bathroom
18 6 39.3019 0.7885 0.7841 Gr_Liv_Area Basement_Area Deck_Porch_Area Lot_Area
Age_Sold Bedroom_AbvGr
19 6 45.8708 0.7842 0.7798 Gr_Liv_Area Basement_Area Deck_Porch_Area Age_Sold
Bedroom_AbvGr Total_Bathroom
20 5 47.7363 0.7817 0.7780 Gr_Liv_Area Basement_Area Deck_Porch_Area Age_Sold
Bedroom_AbvGr
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-44 Chapter 4 Model Building and Effect Selection
50
40
Mallows C(p)
30
20
10
5 6 7 8 9
Number of Parameters
In this example the number of parameters in the full model, pfull, equals 9 (eight variables plus the
intercept).
The smallest model that falls under the Hocking line has p=9, the full model. This model also has a Cp
value that is equal to p exactly, falling directly on Mallows line. From this information, your full model
appears to be a potential model for prediction and variable explanation. This result is likely to change
if additional continuous predictors are included in the analysis.
If multiple models, sharing the same number of parameters, fall below these lines, there are several
options that can be used to make a decision. First, the analyst can appeal to a subject matter expert who
could potentially provide previous experiences that could “break the tie.” Secondly, other fit statistics
could be used as a comparison between the models. Perhaps one of the models has a higher adjusted
R-square value. Thirdly, the models in question could be compared using a hold-out data set, especially
when the focus is prediction.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.3 All Possible Selection (Self-Study) 4-45
54
54
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-46 Chapter 4 Model Building and Effect Selection
Exercises
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.4 Solutions 4-47
4.4 Solutions
Solutions to Exercises
1. Using Significance Level Model Selection Techniques
Use the STAT1.BodyFat2 data set to identify a set of “best” models.
a. With the SELECTION=STEPWISE option, use SELECT=SL to identify a set of candidate
models that predict PctBodyFat2 as a function of the variables Age, Weight, Height, Neck,
Chest, Abdomen, Hip, Thigh, Knee, Ankle, Biceps, Forearm, and Wrist. Use the default
values for SLENTRY and SLSTAY.
/*st104s01.sas*/ /*Part A*/
ods graphics on;
proc glmselect data=STAT1.bodyfat2 plots=all;
STEPWISESL: model PctBodyFat2=Age Weight Height Neck Chest
Abdomen Hip Thigh Knee Ankle Biceps Forearm
Wrist / SELECTION=STEPWISE SELECT=SL;
title 'SL STEPWISE Selection with PctBodyFat2';
run;
quit;
Partial PROC GLMSELECT Output
Stepwise Selection Summary
Effect Effect Number
Step Entered Removed Effects In F Value Pr > F
0 Intercept 1 0.00 1.0000
1 Abdomen 2 488.93 <.0001
2 Weight 3 50.58 <.0001
3 Wrist 4 8.15 0.0047
4 Forearm 5 6.78 0.0098
5 Neck 6 2.73 0.1000
6 Age 7 2.58 0.1098
7 Thigh 8 3.66 0.0569
Selection stopped because the candidate for entry has SLE > 0.15 and the candidate for removal has
SLS < 0.15.
The STEPWISE selection process, using significance level appears to select an eight effect model
(including the intercept).
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-48 Chapter 4 Model Building and Effect Selection
Abdomen
1.0
Standardized Coefficient
0.5
Thigh
0.0
Wrist
Weight
-0.5
Selected Step
0.10
0.08
p-value
0.06
0.04
0.02
0.00
The Coefficient Panel shows that the standardized coefficients do not vary greatly as additional
effects are added to the model.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.4 Solutions 4-49
AIC AICC
Adj R-Sq
1+
2+
3+
4+
5+
6+
7+
In
1+
2+
3+
4+
5+
6+
7+
te
te
A
Fo
Ne
Th
Fo
Ne
Th
rc
rc
bd
ge
bd
ge
ei
ris
ei
ris
r
ig
ig
c
c
ep
ep
ea
ea
gh
gh
k
k
om
om
h
t
t
t
t
rm
rm
t
t
en
en
The Fit Panel indicates that the best model, according to AIC, AICC, and Adjusted R-square,
is the final model viewed during the selection process. SBC shows a minimum at step four.
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value
Model 7 13087 1869.59160 101.56
Error 244 4491.84861 18.40922
Corrected Total 251 17579
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-50 Chapter 4 Model Building and Effect Selection
Parameter Estimates
Standard
Parameter DF Estimate Error t Value
Intercept 1 -33.257991 9.006812 -3.69
Age 1 0.068166 0.030792 2.21
Weight 1 -0.119441 0.034025 -3.51
Neck 1 -0.403802 0.220620 -1.83
Abdomen 1 0.917885 0.069499 13.21
Thigh 1 0.221960 0.116013 1.91
Forearm 1 0.553139 0.184788 2.99
Wrist 1 -1.532401 0.510415 -3.00
The parameter estimates from the selected model are presented in the Parameter Estimates table.
b. Try FORWARD.
/*st104s01.sas*/ /*Part B*/
proc glmselect data=STAT1.bodyfat2 plots=all;
FORWARDSL: model PctBodyFat2=Age Weight Height Neck Chest
Abdomen Hip Thigh Knee Ankle Biceps Forearm
Wrist / SELECTION=FORWARD SELECT=SL;
title 'SL FORWARD Selection with PctBodyFat2';
run;
quit;
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.4 Solutions 4-51
Abdomen
Standardized Coefficient
1.0
0.5
Thigh
Biceps
0.0
Wrist
Weight
-0.5
0.3
p-value
0.2
0.1
0.0
In
1+
2+
3+
4+
5+
6+
7+
8+
9+
10
te
Fo
Ne
Th
Hi
Bi
+A
rc
bd
ge
ce
ei
ris
ig
c
nk
ep
ea
gh
k
om
ps
t
le
t
rm
t
en
Effect Sequence
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-52 Chapter 4 Model Building and Effect Selection
The Coefficient Panel shows that the standardized coefficients do not vary greatly as additional
effects are added to the model.
AIC AICC
Adj R-Sq
1+ pt
2+ om
3+
4+ t
5+ arm
6+ k
7+
8+ h
9+
10 ps
In
1+ pt
2+ om
3+
4+ t
5+ arm
6+ k
7+
8+ h
9+
10 ps
te
te
A
W t
Fo
Ne
Th
Hi
Bi
+A
W t
Fo
Ne
Th
Hi
Bi
+A
rc
rc
bd
ge
bd
ge
p
ce
ce
ei n
ris
ei n
ris
re
ig
re
ig
c
c
nk
nk
e
e
g
g
h
h
le
le
e
The Fit Panel indicates that the best model, according to AIC, AICC, Adjusted R-square and SBC,
are at various steps in the selection progression.
Effects: Intercept Age Weight Neck Abdomen Hip Thigh Ankle Biceps Forearm Wrist
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value
Model 10 13158 1315.76595 71.72
Error 241 4421.33035 18.34577
Corrected Total 251 17579
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.4 Solutions 4-53
Parameter Estimates
Standard
Parameter DF Estimate Error t Value
Intercept 1 -25.999624 12.153156 -2.14
Age 1 0.065093 0.030919 2.11
Weight 1 -0.107396 0.042068 -2.55
Neck 1 -0.467490 0.228115 -2.05
Abdomen 1 0.957721 0.072760 13.16
Hip 1 -0.179124 0.139083 -1.29
Thigh 1 0.259259 0.133892 1.94
Ankle 1 0.184526 0.216864 0.85
Biceps 1 0.186171 0.168580 1.10
Forearm 1 0.453031 0.195932 2.31
Wrist 1 -1.656662 0.527061 -3.14
The parameter estimates from the selected model are presented in the Parameter Estimates table.
c. How many variables would result from a model using FORWARD selection and a significance
level for entry criterion of 0.05, instead of the default SLENTRY of 0.50?
/*st104s01.sas*/ /*Part C*/
proc glmselect data=STAT1.bodyfat2 plots=all;
FORWARDSL: model PctBodyFat2=Age Weight Height Neck Chest
Abdomen Hip Thigh Knee Ankle Biceps Forearm
Wrist / SELECTION=FORWARD SELECT=SL
SLENTRY=0.05;
title 'SL FORWARD (0.05) Selection with PctBodyFat2';
run;
quit;
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-54 Chapter 4 Model Building and Effect Selection
When the SLENTRY is changed from default to 0.05, the number of effects in the selected model
reduces to five (including the intercept).
2. Using Other Model Selection Techniques
Use the STAT1.BodyFat2 data set to identify a set of “best” models.
a. With the SELECTION=STEPWISE option, use SELECT=SBC to identify a set of candidate
models that predict PctBodyFat2 as a function of the variables Age, Weight, Height, Neck,
Chest, Abdomen, Hip, Thigh, Knee, Ankle, Biceps, Forearm, and Wrist.
/*st104s02.sas*/ /*Part A*/
ods graphics on;
proc glmselect data=STAT1.bodyfat2 plots=all;
STEPWISESBC: model PctBodyFat2=Age Weight Height Neck Chest
Abdomen Hip Thigh Knee Ankle Biceps Forearm
Wrist / SELECTION=STEPWISE SELECT=SBC;
title 'SBC STEPWISE Selection with PctBodyFat2';
run;
quit;
The STEPWISE selection process, using SELECT=SBC appears to select a five effect model
(including the intercept).
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.4 Solutions 4-55
Abdomen
1.0
Standardized Coefficient
0.5
Forearm
0.0
Wrist
Weight
-0.5
1000
SBC
900
800
The Coefficient Panel shows that the standardized coefficients do not vary greatly as additional
effects are added to the model.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-56 Chapter 4 Model Building and Effect Selection
AIC AICC
Adj R-Sq
Intercept 1+Abdomen 2+Weight 3+Wrist 4+Forearm Intercept 1+Abdomen 2+Weight 3+Wrist 4+Forearm
Effect Sequence Effect Sequence
The Fit Panel indicates that the best model, according to AIC, AICC, Adjusted R-square and SBC,
is the final model viewed during the selection process. Remember that this statement is made
comparing only the models that were viewed in these steps of the selection process.
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value
Model 4 12921 3230.18852 171.28
Error 247 4658.23577 18.85925
Corrected Total 251 17579
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.4 Solutions 4-57
Parameter Estimates
Standard
Parameter DF Estimate Error t Value
Intercept 1 -34.854074 7.245005 -4.81
Weight 1 -0.135631 0.024748 -5.48
Abdomen 1 0.995751 0.056066 17.76
Forearm 1 0.472928 0.181661 2.60
Wrist 1 -1.505562 0.442666 -3.40
The parameter estimates from the selected model are presented in the Parameter Estimates table.
b. Try SELECT=AIC.
/*st104s02.sas*/ /*Part B*/
proc glmselect data=STAT1.bodyfat2 plots=all;
STEPWISEAIC: model PctBodyFat2=Age Weight Height Neck Chest
Abdomen Hip Thigh Knee Ankle Biceps Forearm
Wrist / SELECTION=STEPWISE SELECT=AIC;
title 'AIC STEPWISE Selection with PctBodyFat2';
run;
quit;
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-58 Chapter 4 Model Building and Effect Selection
Using SELECT=AIC, the selected model contains nine effects (including the intercept).
3. Using All-Regression Techniques
a. With the SELECTION=CP option, use an all-possible regression technique to identify a set
of candidate models that predict PctBodyFat2 as a function of the variables Age, Weight,
Height, Neck, Chest, Abdomen, Hip, Thigh, Knee, Ankle, Biceps, Forearm, and Wrist.
Hint: Select only the best 60 models based on Cp to compare.
/*st104s03.sas*/ /*Part A*/
ods graphics / imagemap=on;
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.4 Solutions 4-59
The plot indicates that the best model according to Mallows’ criterion is an eight-parameter
(seven variables plus an intercept) model. The best model according to Hocking’s criterion
has 10 parameters (including the intercept).
A partial table of the 60 models, their C(p) values, and the numbers of variables in the
models is displayed.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-60 Chapter 4 Model Building and Effect Selection
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.4 Solutions 4-61
37
37
55
55
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-62 Chapter 4 Model Building and Effect Selection
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Chapter 5 Model Post-Fitting for
Inference
5.1 Examining Residuals ....................................................................................................5-3
Demonstration: Residual Plots ............................................................................................. 5-10
Exercises .............................................................................................................................. 5-17
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Examining Residuals 5-3
Objectives
Review the assumptions of linear regression.
Examine the assumptions with scatter plots
and residual plots.
Unknown
Relationship
Y = b 0 + b1X
4
4
Recall that the model for the linear regression has the form Y=b0+b1X+. When you perform a regression
analysis, several assumptions about the error terms must be met to provide valid tests of hypothesis
and confidence intervals. The assumptions are that the error terms
have a mean of 0 at each value of the predictor variable
are normally distributed at each value of the predictor variable
have the same variance at each value of the predictor variable
are independent.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-4 Chapter 5 Model Post-Fitting for Inference
5.01 Poll
Predictor variables are assumed to be normally
distributed in linear regression models.
True
False
5
5
7
7
To illustrate the importance of plotting data, four examples were developed by Anscombe (1973).
In each example, the scatter plot of the data values is different. However, the regression equation,
Y=3.0+0.5X, and the R-square statistic, 0.67, are the same.
In the first plot, a regression line adequately describes the data.
In the second plot, a simple linear regression model is not appropriate because you are fitting a straight
line through a curvilinear relationship.
In the third plot, there seems to be an outlying data value that is affecting the regression line. This outlier
is an influential data value in that it is substantially changing the fit of the regression line.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Examining Residuals 5-5
In the fourth plot, the outlying data point dramatically changes the fit of the regression line. In fact,
the slope would be undefined without the outlier.
The four plots illustrate that relying on the regression output to describe the relationship between your
variables can be misleading. The regression equations and the R-square statistics are the same even
though the relationships between the two variables are different. Always produce a scatter plot before
you conduct a regression analysis.
Verifying Assumptions
8
8
To verify the assumptions for regression, you can use the residual values from the regression analysis
as your best estimates of the error terms. Residuals are defined as follows: r Y Yˆ i i i
th
where Yˆi is the predicted value for the i value of the dependent variable.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-6 Chapter 5 Model Post-Fitting for Inference
9
9
10
10
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Examining Residuals 5-7
11
11
Remedy is to analyze
using PROC AUTOREG.
12
12
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-8 Chapter 5 Model Post-Fitting for Inference
13
13
The graphs above are plots of residual values versus predicted values or predictor variable values for
four models fit to different sets of data. If model assumptions are valid, then the residual values should
be randomly scattered about a reference line at 0. Any patterns or trends in the residuals might indicate
problems in the model.
1. The model form appears to be adequate because the residuals are randomly scattered about
a reference line at 0 and no patterns appear in the residual values.
2. The model form is incorrect. The plot indicates that the model should take into account curvature
in the data. One possible solution is to add a quadratic term as one of the predictor variables.
3. The variance is not constant. As you move from left to right, the variance increases. One possible
solution is to transform your dependent variable. Another possible solution is to use either
PROC GENMOD or PROC GLIMMIX, and choose a model that does not assume equal variances.
4. The observations are not independent. For this graph, the residuals tend to be followed by residuals
with the same sign, which is called autocorrelation. This problem can occur when you have
observations that were collected over time. A possible solution is to use the AUTOREG procedure
in SAS/ETS software.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Examining Residuals 5-9
Detecting Outliers
14
14
Besides verifying assumptions, it is also important to check for outliers. Observations that are far away
from the bulk of your data are outliers. These observations are often data errors or reflect unusual
circumstances. In either case, it is good statistical practice to detect these outliers and find out why they
occurred.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-10 Chapter 5 Model Post-Fitting for Inference
Residual Plots
Example: Invoke the REG procedure noticing the default graphics. Then use a PLOTS= option to
produce full-sized ODS residual plots and diagnostic plots for the model including all interval
predictor variables.
%let interval=Gr_Liv_Area Basement_Area Garage_Area Deck_Porch_Area
Lot_Area Age_Sold Bedroom_AbvGr Total_Bathroom ;
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Examining Residuals 5-11
Partial Output
Residual and diagnostic plots are produced in the DIAGNOSTICS panel plot. (Several of these are
discussed in more detail later in the chapter.)
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-12 Chapter 5 Model Post-Fitting for Inference
50000
Residual
-50000
500 750 1000 1250 1500 0 500 1000 1500 0 200 400 600 800
Gr_Liv_Area Basement area in square feet Size of garage in square feet
100000
50000
Residual
-50000
50000
Residual
-50000
0 1 2 3 4 1 2 3 4
Bedrooms above grade Total_Bathroom
The plot of the residuals versus the values of the interval predictor variables is shown above. They show
no obvious trends or patterns in the residuals. Recall that independence of residual errors (no trends)
is an assumption for linear regression, as is constant variance across all levels of all predictor variables
(and across all levels of the predicted values, which is seen earlier).
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Examining Residuals 5-13
When visually inspecting residual plots, the distinction of whether a pattern exists
is a matter of discretion for the viewer. If there is any question to the presence of a pattern,
a further investigation for possible causes of potential patterns should be performed.
Hint: If you want to view the DIAGNOSTICS panel plots separately, specify
PLOTS=DIAGNOSTICS(UNPACK) in the PROC REG statement. You can also specify each plot
individually by name. Individual plots are produced full sized.
/*st105d01.sas*/ /*Part B*/
proc reg data=STAT1.ameshousing3
plots(only)=(QQ RESIDUALBYPREDICTED RESIDUALS);
CONTINUOUS: model SalePrice = &interval;
title 'SalePrice Model - Plots of Diagnostic Statistics';
run;
quit;
Selected REG statement PLOTS= options:
PLOTS(ONLY)= produces only the plots listed and suppresses printing of default plots.
QQ produces residual Quantile-Quantile plot to assess the normality
of the residual error.
RESIDUALBYPREDICTED produces residuals by predicted values.
RESIDUALS produces residuals by predictor variable values.
You can also use the R option in the MODEL statement of PROC REG to obtain residual
diagnostics. Output from the R option includes the values of the response variable, the predicted
values of the response variable, the standard error of the predicted values, the residuals,
the standard error of the residuals, the student residuals, and a summary of the student residuals
in tabular rather than graphic form.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-14 Chapter 5 Model Post-Fitting for Inference
The plots of the residuals by predicted values of SalePrice and by each of the predictor variables
are shown below. The residual values appear to be randomly scattered about the reference line
at 0. There are no apparent trends or patterns in the residuals.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Examining Residuals 5-15
50000
Residual
-50000
500 750 1000 1250 1500 0 500 1000 1500 0 200 400 600 800
Gr_Liv_Area Basement area in square feet Size of garage in square feet
100000
50000
Residual
-50000
50000
Residual
-50000
0 1 2 3 4 1 2 3 4
Bedrooms above grade Total_Bathroom
The plot of the residuals against the normal quantiles is shown below. If the residuals are normally
distributed, the plot should appear to be a straight, diagonal line. If the plot deviates substantially from
the reference line, then there is evidence against normality.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-16 Chapter 5 Model Post-Fitting for Inference
The plot below shows little deviation from the expected pattern. Thus, you can conclude that the residuals
do not significantly violate the normality assumption. If the residuals did violate the normality
assumption, then a transformation of the response variable or a different model might be warranted.
PROC REG Output (Continued)
50000
Residual
-50000
-3 -2 -1 0 1 2 3
Quantile
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Examining Residuals 5-17
Exercises
1. Examining Residuals
Assess the model obtained from the final forward stepwise selection of predictors for the
STAT1.BodyFat2 data set. Run a regression of PctBodyFat2 on Abdomen, Weight, Wrist,
and Forearm. Create plots of the residuals by the four regressors and by the predicted values
and a normal Quantile-Quantile plot.
a. Do the residual plots indicate any problems with the constant variance assumption?
b. Are there any outliers indicated by the evidence in any of the residual plots?
c. Does the Quantile-Quantile plot indicate any problems with the normality assumption?
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-18 Chapter 5 Model Post-Fitting for Inference
Objectives
Use statistics to identify potentially influential
observations.
19
19
Influential Observations
20
20
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Influential Observations 5-19
Recall in the previous section that you saw examples of data sets where the simple linear regression
model fits were essentially the same. However, plotting the data revealed that the model fits were
different.
One of the examples showed a highly influential observation similar to the example above.
Identifying influential observations in multiple linear regression is more complex because you have more
predictors to consider.
The REG procedure has options to calculate statistics to identify influential observations.
Diagnostic Statistics
Statistics that help identify influential observations
are the following:
Studentized residuals
RSTUDENT residuals
Cook’s D
DFFITS
DFBETAS
21
21
The R option in the MODEL statement prints the studentized residuals and the Cook’s D, as well as
others discussed previously. The INFLUENCE option in the MODEL statement prints the RSTUDENT,
DFFITS, and DFBETAS, as well as several others.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-20 Chapter 5 Model Post-Fitting for Inference
22
22
One way to check for outliers is to use the studentized residuals. These are calculated by dividing
the residual values by their standard errors. For a model that fits the data well and has no outliers, you can
expect that 68% of the studentized residuals would be within [-1,1]. In general, studentized residuals that
have an absolute value less than 2.0 could easily occur by chance. Studentized residuals that are between
an absolute value of 2.0 to 3.0 occur infrequently and could be outliers. Studentized residuals that are
larger than an absolute value of 3.0 occur rarely by chance alone and should be investigated.
Studentized residuals are often referred to as “standardized residuals.” The cutoff values are
chosen based on the tail probabilities from the normal probability distribution.
23
23
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Influential Observations 5-21
RSTUDENT
25
25
Studentized residuals are the ordinary residuals divided by their standard errors. The RSTUDENT
residuals are similar to the studentized residuals except that they are calculated after deleting the ith
observation. In other words, the RSTUDENT residual is the difference between the observed Y
and the predicted value of Y excluding this observation from the regression.
There is a difference between the labels used in SAS and in SAS Enterprise Guide.
SAS SAS Enterprise Guide
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-22 Chapter 5 Model Post-Fitting for Inference
Cook’s D Statistic
Cook’s D statistic is a measure of the simultaneous
change in the parameter estimates when the ith
observation is deleted from the analysis.
4
Cook's Di >
n
26
To detect influential observations, you can use Cook’s D statistic. This statistic measures the change
in the parameter estimates that results from deleting each observation.
1
Cook's Di = 2 b - b(i) XX b - b(i)
ps
p the number of regression parameters
2
s mean squared error of the regression model
b the vector of parameter estimates
b(i) the vector of parameter estimates obtained after deleting the ith observation
X’X corrected sum of squares and cross-products matrix
Identify observations above the cutoff and investigate the reasons that they occurred.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Influential Observations 5-23
DFFITS
DFFITSi measures the impact that the ith observation
has on the predicted value.
p
| DFFITS i | > 2
n
27
27
Yˆi Yˆ(i )
DFFITSi
s(Yˆ )
i
Yˆ( i ) the ith predicted value when the ith observation is deleted
p
Belsey, Kuh, and Welsch (1980) provide this suggested cutoff: |DFFITSi|>2 , where p is the number
n
of terms in the current model, including the intercept, and n is the sample size.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-24 Chapter 5 Model Post-Fitting for Inference
DFBETAS
Measure of change in the jth parameter estimate
with deletion of the ith observation
One DFBETA per parameter per observation
Helpful in explaining on which parameter coefficient
the influence most lies
A suggested cutoff for influence is shown below:
1
| DFBETAij | > 2
n
28
28
DFBETAS is abbreviated from Difference in Betas. They contain the standardized difference for each
individual coefficient estimate resulting from the omission of the ith observation. They are identified
by column headings with the name of the corresponding predictor in the Output window and also
by plots, if requested in the PROC REG statement. Because there are many DFBETAS, it might be useful
to examine only those corresponding to a large Cook’s D. Large DFBETAS indicate which predictor(s)
might be the cause of the influence.
bj b(i ) j
DFBETAij =
s(bj )
bj jth regression parameter estimate
b(i)j jth regression parameter estimate with observation i deleted
s(bj) standard error of bj
Belsley, Kuh, and Welsch (1980) recommend 2 as a general cutoff value to indicate influential
1
observations and 2 as a size-adjusted cutoff.
n
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Influential Observations 5-25
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-26 Chapter 5 Model Post-Fitting for Inference
The ODS OUTPUT statement along with the PLOTS= option outputs the data from the influence plots
into separate data sets.
PROC REG Output
Number of Observations Read 300
Number of Observations Used 300
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 7 3.424508E11 48921543221 176.86 <.0001
Error 292 80772716963 276618894
Corrected Total 299 4.232235E11
Parameter Estimates
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|
Intercept Intercept 1 47463 5880.67404 8.07 <.0001
Gr_Liv_Area Above grade (ground) living area square feet 1 65.30372 5.43667 12.01 <.0001
Basement_Area Basement area in square feet 1 29.84908 3.34540 8.92 <.0001
Garage_Area Size of garage in square feet 1 36.30961 6.45241 5.63 <.0001
Deck_Porch_Area Total area of decks and porches in square feet 1 32.05255 7.96768 4.02 <.0001
Lot_Area Lot size in square feet 1 0.70813 0.31751 2.23 0.0265
Age_Sold Age of house when sold, in years 1 -447.19868 41.01931 -10.90 <.0001
Bedroom_AbvGr Bedrooms above grade 1 -5042.76650 1687.92817 -2.99 0.0031
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Influential Observations 5-27
151 54
285 126 185
211
RStudent
2 106
22 233
288
273
0 44 38
119 17
55
239
68
-2 218 240 58
242 292
168 227
27
The RStudent plot shows sixteen observations beyond two standard errors from the mean of 0. Those are
identified with their observation numbers. Because you expect 5% of values to be beyond two standard
errors from the mean (remember that RStudent residuals are assumed to be normally distributed), the fact
that you have sixteen that far outside the primary cluster gives no cause for concern. (Five percent of 300
is 15 expected observations.)
Other observations are also labeled in this plot for other reasons such as high leverage.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-28 Chapter 5 Model Post-Fitting for Inference
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Influential Observations 5-29
151
0.0
21 58 68 173
168 213 242 294
227 292
-0.5
240
218
-1.0
Once again, several observations have been flagged as an influential point based on DFFITS.
At this point, it might be helpful to see which parameters these observations might influence most.
DFBETAS provides that information.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-30 Chapter 5 Model Post-Fitting for Inference
0.75
0.50 218 123
292 68 123
185 240 294
76 114 166 151
0.25 27
33
21 213 298 57 285 122 7797 166 213
218
-0.50
-0.75
DFBETAS
0.75
0.50
233
227 151 242
240
21 126 151 218 233
0.25 110
123
101 147 293 3654
58 102
126 9
48 110 173 216
103
0.00 17 38
76 114
68 218254
213 265298 117 166 213 260
185
227 168
-0.25 292
123
240
-0.50
218
-0.75
0.75
0.50
22 101 151 218
DFBETAS
0.00 521 52
54 168 213 294
298 39 114 166
227
238278
218 242 288
7 240
-0.25 123 185
151
-0.50
-0.75
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Influential Observations 5-31
Detection of outliers or influential observations with plots is convenient for relatively small data sets, but
for larger data sets, like the housing data, it can be very difficult to discern one observation from another.
One method for extracting only the influential observations from a data set is to output the ODS plots data
into data sets and then subset the influential observations.
The next part of the program prints the influential observations in the influence diagnostic data sets that
were produced using ODS OUTPUT.
/*st105d02.sas*/ /*Part B*/
title;
proc print data=Rstud;
run;
Partial Output
Obs Model Dependent RStudent PredictedValue outLevLabel Observation
1 SigLimit SalePrice 1.73092 185283.46 . 1
2 SigLimit SalePrice 0.67964 180284.34 . 2
3 SigLimit SalePrice 0.63948 104541.46 . 3
4 SigLimit SalePrice -0.58261 169597.56 . 4
The variable outLevLabel is nonmissing only for an observation labeled for any reason on the RStudent
plot.
proc print data=Cook;
run;
Partial Output
Obs Model Dependent CooksD Observation CooksDLabel
1 SigLimit SalePrice 0.01260 1 .
2 SigLimit SalePrice 0.00102 2 .
3 SigLimit SalePrice 0.00186 3 .
4 SigLimit SalePrice 0.00092 4 .
The variable CooksDLabel identifies observations that are deemed influential due to high
Cook’s D values; these are observations having influence on all the estimated parameters as a group.
proc print data=Dffits;
run;
Partial Output
Obs Model Dependent Observation DFFITS DFFITSOUT
1 SigLimit SalePrice 1 0.31861 .
2 SigLimit SalePrice 2 0.09029 .
3 SigLimit SalePrice 3 0.12177 .
4 SigLimit SalePrice 4 -0.08573 .
The variable DFFITSOUT identifies observations that are deemed influential due to high DFFITS
values; these are observations having influence on the predictions.
proc print data=Dfbs;
run;
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-32 Chapter 5 Model Post-Fitting for Inference
Partial Output
Obs Model Dependent Observation _DFBETAS1 _DFBETASOUT1 _DFBETAS2 _DFBETASOUT2
1 SigLimit SalePrice 1 -0.00567 . 0.10783 .
2 SigLimit SalePrice 2 0.00677 . 0.02128 .
3 SigLimit SalePrice 3 0.05218 . -0.04403 .
4 SigLimit SalePrice 4 -0.03621 . 0.00052 .
The variables _DFBETASOUT1 through _DFBETASOUT8 identify the observations whose DFBETA
values exceed the threshold for influence. _DFBETASOUT1 represents the value for the intercept.
The other seven variables show influential outliers on each of the predictor variables in the MODEL
statement in PROC REG.
As the number of predictor variables increases, additional panels are required to show all the
information from DFBETAS. This will need to be handled prior to merging data sets. With the
multiple panels for DFBETAS, the DFBS data set is effectively split. The first 300 observations
display the DFBETAS information for the first panel, which includes the first six effects in the
model (including the intercept). The information for the second panel, which includes the final
two effects, is missing. Beginning at observation 301, this is reversed. The following code block
splits the DFBS data set into two parts and combines them into one new data set (DFBS2) using
the UPDATE statement.
data Dfbs01;
set Dfbs (obs=300);
run;
data Dfbs02;
set Dfbs (firstobs=301);
run;
data Dfbs2;
update Dfbs01 Dfbs02;
by Observation;
run;
The next DATA step merges the four data sets containing the influence data and outputs only
the observations that exceeded the respective influence cutoff levels. The cutoff for RStudent has been
augmented to 3 and -3.
The results are then displayed.
data influential;
/* Merge datasets from above.*/
merge Rstud
Cook
Dffits
Dfbs2;
by observation;
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Influential Observations 5-33
do i=2 to dim(dfbetas);
if dfbetas{i} then flag=1;
end;
title;
proc print data=influential;
id observation;
var Rstudent CooksD Dffitsout _dfbetasout:;
run;
PROC PRINT Output
Observation RStudent CooksD DFFITSOUT _DFBETASOUT1 _DFBETASOUT2 _DFBETASOUT3
1 . . . . . 0.11744
5 . . . . 0.12008 -0.21199
7 . 0.01782 0.37928 . 0.11635 .
9 . . . . . .
This table is a summary of the plots displayed previously. From this output, flagged observations can then
be investigated to try to determine what makes these points influential. As always, this would be after
determining that the point was valid and not erroneous data.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-34 Chapter 5 Model Post-Fitting for Inference
30
30
If the unusual data are erroneous, correct the errors and reanalyze the data.
(In this course, time does not permit discussion of higher order models in any depth. This discussion
is in Statistics 2: ANOVA and Regression.)
Another possibility is that the observation, although valid, could be unusual. If you had a larger sample
size, there might be more observations similar to the unusual ones.
You might have to collect more data to confirm the relationship suggested by the influential observation.
In general, do not exclude data. In many circumstances, some of the unusual observations contain
important information.
If you do choose to exclude some observations, include a description of the types of observations that you
exclude and provide an explanation. Also discuss the limitation of your conclusions, given the exclusions,
as part of your report or presentation.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Influential Observations 5-35
Exercises
2. Generating Potential Outliers
Using the STAT1.BodyFat2 data set, run a regression model of PctBodyFat2 on Abdomen,
Weight, Wrist, and Forearm.
a. Use plots to identify potential influential observations based on the suggested cutoff values.
b. Output residuals to a data set, subset the data set by only those who are potentially influential
outliers, and print the results.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-36 Chapter 5 Model Post-Fitting for Inference
33
33
5.3 Collinearity
Objectives
Determine whether collinearity exists in a model.
Generate output to evaluate the strength
of the collinearity and what variables are involved
in the collinearity.
Determine methods that can minimize collinearity
in a model.
36
36
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.3 Collinearity 5-37
Illustration of Collinearity
X1
X2
37
37
The goal of multiple linear regression is to find the best fit plane through the data to predict the response
variable. Here is an example in three dimensions, two predictor variables, and a response variable. You
can picture that the prediction plane that you are trying to build is similar to a tabletop, where the
observations guide the angle of the tabletop, relative to the floor, in the same way as the legs for the table.
If the legs line up with one another, then the plane built on top of it tends to be unstable.
Where should the prediction plane be placed? The slopes of the prediction plane relative to each X
and the Y are the parameter coefficient estimates.
X1 and X2 almost follow a straight line, that is, X1=X2 in the (X1, X2) plane.
Why is this a problem? Two reasons exist.
1. Neither might appear to be significant when both are in the model. However, either might be
significant when only one is in the model. Thus, collinearity can hide significant effects. (The reverse
can be true as well. Collinearity can increase the apparent statistical significance of effects.)
2. Collinearity tends to increase the variance of parameter estimates and consequently increase
prediction error.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-38 Chapter 5 Model Post-Fitting for Inference
Illustration of Collinearity
X1
X2
38
38
Illustration of Collinearity
X1
X2
39
39
However, the removal of only one data point (or only moving the data point) results in a very different
prediction plane (as represented by the lighter plane). This illustrates the variability of the parameter
estimates when there is extreme collinearity.
When collinearity is a problem, the estimates of the coefficients are unstable. This means that they have
a large variance. Consequently, the true relationship between Y and the Xs might be quite different from
that suggested by the magnitude and sign of the coefficients.
Collinearity is not a violation of the assumptions of linear regression.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.3 Collinearity 5-39
Collinearity Diagnostics
PROC REG offers these tools that help quantify
the magnitude of the collinearity problems and identify
the subset of Xs that is collinear:
VIF
COLLIN
COLLINOINT
40
40
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-40 Chapter 5 Model Post-Fitting for Inference
41
41
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.3 Collinearity 5-41
Example of Collinearity
Example: Another research group has been working with the same Ames housing data but for different
purposes. It has been brought to your attention that they found a variable useful in their
analysis and are willing to share this new variable, score. The records can be matched using
the PID variable. The new information can be found in STAT1.amesaltuse. Sort both
STAT1.amesaltuse and STAT1.ameshousing3 by PID and merge these two data sets into a
single data set amescombined. Using PROC CORR, investigate the correlations between the
variable score and the other interval variables (using the macro %interval).
/*st105d03.sas*/ /*Part A*/
proc sort data=STAT1.ameshousing3;
by PID;
run;
proc sort data=STAT1.amesaltuse;
by PID;
run;
data amescombined;
merge STAT1.ameshousing3 STAT1.amesaltuse;
by PID;
run;
title;
proc corr data=amescombined nosimple;
var &interval;
with score;
run;
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-42 Chapter 5 Model Post-Fitting for Inference
The new variable score appears to be significantly correlated with all of the interval variables, but focus
your attention on the actual correlations. Recall that closer to 1 or -1 implies a stronger correlation within
the pairing. Score appears to be most correlated with Basement_Area, Gr_Liv_Area, and
Total_Bathroom. These significant correlations are not enough to actually diagnose collinearity.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.3 Collinearity 5-43
Collinearity Diagnostics
Example: Invoke PROC REG and use the VIF option to assess the magnitude of the collinearity problem
and identify the terms involved in the problem.
/*st105d03.sas*/ /*Part B*/
proc reg data=amescombined;
model SalePrice = &interval score / vif;
title 'Collinearity Diagnostics';
run;
quit;
Partial PROC REG Output
Parameter Estimates
Parameter Standard Variance
Variable Label DF Estimate Error t Value Pr > |t| Inflation
Intercept Intercept 1 -4254871 3419274 -1.24 0.2144 0
Gr_Liv_Area Above grade (ground) living area square feet 1 923.25717 684.04818 1.35 0.1782 27569
Basement_Area Basement area in square feet 1 2178.35638 1709.68175 1.27 0.2036 411868
Garage_Area Size of garage in square feet 1 35.01213 6.46640 5.41 <.0001 1.41398
Deck_Porch_Area Total area of decks and porches in square feet 1 30.64725 7.97228 3.84 0.0001 1.21667
Lot_Area Lot size in square feet 1 0.69964 0.31644 2.21 0.0278 1.20422
Age_Sold Age of house when sold, in years 1 -422.21228 44.18905 -9.55 <.0001 1.60476
Bedroom_AbvGr Bedrooms above grade 1 -4888.35244 1687.71153 -2.90 0.0041 1.48233
Total_Bathroom Total number of bathrooms 1 3047.94315 1919.03449 1.59 0.1133 1.73073
(half bathrooms counted 10%)
score 1 429.97552 341.96962 1.26 0.2096 533085
Some of the VIFs are much larger than 10. A severe collinearity problem is present. At this point there
are many ways to proceed. However, it is always a good idea to use some subject-matter expertise. When
subject-matter expertise is not available, another option is to systematically remove variables starting with
the highest VIF and re-run the analysis. Much like p-values, the VIF values will need to be updated with
each successive variable removal.
Reaching out to the researchers that provided the score variable, it was determined that score was a
composite variable.
score=round(10000-(2*Gr_Liv_Area + 5*Basement_Area),10);
The researchers, on the basis of prior literature, created a composite variable, which is a weighted
function of the two variables, Gr_Liv_Area, and Basement_Area. This is not an uncommon occurrence
and illustrates an important point. If a composite variable is included in a model along with some
or all of its component measures, there is bound to be collinearity.
If the composite variable has meaning, it can be used as a stand-in measure for both components and you
can remove the variables Gr_Liv_Area and Basement_Area from the analysis.
Composite measures have the disadvantage of losing some information about the individual variables.
If this is of concern, then remove score from the analysis.
A decision was made to remove score from the analysis. Another check of collinearity is warranted.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-44 Chapter 5 Model Post-Fitting for Inference
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.3 Collinearity 5-45
44
44
5.05 Poll
If there is no correlation among the predictor variables,
can there still be collinearity in the model?
Yes
No
46
46
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-46 Chapter 5 Model Post-Fitting for Inference
(4) (3)
Assumption Validation Candidate
and Influential Model
Observation Detection Selection
Yes
(5) (6)
Model Prediction
No Testing
Revision
48
(1) Preliminary Analysis: This step includes the use of descriptive statistics, graphs, and correlation
analysis.
(2) Collinearity Detection: This step includes the use of the VIF statistic, condition indices, and
variation proportions.
(3) Candidate Model Selection: This step uses the numerous selection options in PROC REG or
PROC GLMSELECT to identify one or more candidate models.
(4) Assumption Validation and Influential Observation Detection: The former includes the plots of
residuals and graphs of the residuals versus the predicted values. It also includes a test for equal
variances. The latter includes the examination of R-Student residuals, Cook’s D statistic, DFFITS,
and DFBETAS statistics.
(5) Model Revision: If steps (3) and (4) indicate the need for model revision, generate a new model
by returning to these two steps.
(6) Prediction Testing: If possible, validate the model with data not used to build the model.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.3 Collinearity 5-47
Exercises
3. Assessing Collinearity
Using the STAT1.BodyFat2 data set, run a regression of PctBodyFat2 on all the other numeric
variables in the file.
a. Determine whether there is a collinearity problem.
b. If so, decide what you would like to do about that. Will you remove any variables?
Why or why not?
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-48 Chapter 5 Model Post-Fitting for Inference
5.4 Solutions
Solutions to Exercises
1. Examining Residuals
Assess the model obtained from the final forward stepwise selection of predictors for the
STAT1.BodyFat2 data set. Run a regression of PctBodyFat2 on Abdomen, Weight, Wrist,
and Forearm. Create plots of the residuals by the four regressors and by the predicted values
and a normal Quantile-Quantile plot.
/*st105s01.sas*/
ods graphics / imagemap=on;
proc reg data=STAT1.BodyFat2
plots(only)=(QQ RESIDUALBYPREDICTED RESIDUALS);
FORWARD: model PctBodyFat2=
Abdomen Weight Wrist Forearm;
id Case;
title 'FORWARD Model - Plots of Diagnostic Statistics';
run;
quit;
a. Do the residual plots indicate any problems with the constant variance assumption?
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.4 Solutions 5-49
It does not appear that the data violate the assumption of constant variance. Also, the
residuals show nice random scatter and indicate no problem with model specification.
b. Are there any outliers indicated by the evidence in any of the residual plots?
There are a few (x-space) outliers for Wrist and Forearm and one clear outlier in each of
Abdomen and Weight values.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-50 Chapter 5 Model Post-Fitting for Inference
c. Does the Quantile-Quantile plot indicate any problems with the normality assumption?
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.4 Solutions 5-51
a. Use plots to identify potential influential observations based on the suggested cutoff values.
/*st105s02.sas*/ /*Part A*/
ods graphics on;
ods output RSTUDENTBYPREDICTED=Rstud
COOKSDPLOT=Cook
DFFITSPLOT=Dffits
DFBETASPANEL=Dfbs;
proc reg data=STAT1.BodyFat2
plots(only label)=
(RSTUDENTBYPREDICTED
COOKSD
DFFITS
DFBETAS);
FORWARD: model PctBodyFat2=
Abdomen Weight Wrist Forearm;
id Case;
title 'FORWARD Model - Plots of Diagnostic Statistics';
run;
quit;
There are only a modest number of observations farther than two standard error units from
the mean of 0.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-52 Chapter 5 Model Post-Fitting for Inference
There are 10 labeled outliers, but observation 39 is clearly the most extreme.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.4 Solutions 5-53
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-54 Chapter 5 Model Post-Fitting for Inference
DFBETAS are particularly high for observation 39 on the parameters for weight and forearm
circumference.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.4 Solutions 5-55
b. Output residuals to a data set, subset the data set by only those who are potentially influential
outliers, and print the results.
/* st105s02.sas */ /* Part B */
data influential;
/* Merge data sets from above. */
merge Rstud
Cook
Dffits
Dfbs;
by observation;
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-56 Chapter 5 Model Post-Fitting for Inference
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.4 Solutions 5-57
3. Assessing Collinearity
Using the STAT1.BodyFat2 data set, run a regression of PctBodyFat2 on all the other numeric
variables in the file.
a. Determine whether there is a collinearity problem.
/*st105s03.sas*/ /*Part A*/
ods graphics off;
proc reg data=STAT1.BodyFat;
FULLMODL: model PctBodyFat2=
Age Weight Height
Neck Chest Abdomen Hip Thigh
Knee Ankle Biceps Forearm Wrist
/ vif;
title 'Collinearity -- Full Model';
run;
quit;
ods graphics on;
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 13 13168 1012.88783 54.65 <.0001
Error 238 4411.44804 18.53550
Corrected Total 251 17579
Parameter Estimates
Parameter Standard Variance
Variable DF Estimate Error t Value Pr > |t| Inflation
Intercept 1 -18.18849 17.34857 -1.05 0.2955 0
Age 1 0.06208 0.03235 1.92 0.0562 2.25045
Weight 1 -0.08844 0.05353 -1.65 0.0998 33.50932
Height 1 -0.06959 0.09601 -0.72 0.4693 1.67459
Neck 1 -0.47060 0.23247 -2.02 0.0440 4.32446
Chest 1 -0.02386 0.09915 -0.24 0.8100 9.46088
Abdomen 1 0.95477 0.08645 11.04 <.0001 11.76707
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-58 Chapter 5 Model Post-Fitting for Inference
Parameter Estimates
Parameter Standard Variance
Variable DF Estimate Error t Value Pr > |t| Inflation
Hip 1 -0.20754 0.14591 -1.42 0.1562 14.79652
Thigh 1 0.23610 0.14436 1.64 0.1033 7.77786
Knee 1 0.01528 0.24198 0.06 0.9497 4.61215
Ankle 1 0.17400 0.22147 0.79 0.4329 1.90796
Biceps 1 0.18160 0.17113 1.06 0.2897 3.61974
Forearm 1 0.45202 0.19913 2.27 0.0241 2.19249
Wrist 1 -1.62064 0.53495 -3.03 0.0027 3.37751
There seems to be high collinearity associated with Weight, Hip, and Abdomen. Chest and
Thigh are below the cutoff but are larger than the others that do not exceed 5.
b. If so, decide what you would like to do about that. Will you remove any variables? Why or why
not?
The answer is not so easy. Weight is collinear with some set of the other variables, but as
you saw before in your model-building process, Weight is a relatively significant predictor
in the “best” models. The answer is for a subject-matter expert to determine.
If you want to remove Weight, simply run the model again without that variable.
/*st105s03.sas*/ /*Part B*/
ods graphics off;
proc reg data=STAT1.BodyFat;
NOWT: model PctBodyFat2=
Age Height
Neck Chest Abdomen Hip Thigh
Knee Ankle Biceps Forearm Wrist
/ vif;
title 'Collinearity -- No Weight';
run;
quit;
ods graphics on;
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.4 Solutions 5-59
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 12 13117 1093.07775 58.55 <.0001
Error 239 4462.05682 18.66969
Corrected Total 251 17579
Parameter Estimates
Parameter Standard Variance
Variable DF Estimate Error t Value Pr > |t| Inflation
Intercept 1 7.54528 7.67169 0.98 0.3263 0
Age 1 0.07316 0.03176 2.30 0.0221 2.15369
Height 1 -0.14157 0.08586 -1.65 0.1005 1.32980
Neck 1 -0.58279 0.22314 -2.61 0.0096 3.95560
Chest 1 -0.09077 0.09083 -1.00 0.3187 7.88319
Abdomen 1 0.92587 0.08497 10.90 <.0001 11.28546
Hip 1 -0.33792 0.12318 -2.74 0.0065 10.46928
Thigh 1 0.22264 0.14465 1.54 0.1251 7.75310
Knee 1 -0.08666 0.23483 -0.37 0.7124 4.31235
Ankle 1 0.10688 0.21850 0.49 0.6252 1.84379
Biceps 1 0.13168 0.16905 0.78 0.4368 3.50690
Forearm 1 0.44842 0.19984 2.24 0.0258 2.19223
Wrist 1 -1.74681 0.53138 -3.29 0.0012 3.30871
Some collinearity still exists in the model. If Abdomen, the remaining variable with
the highest VIF, is removed and then the R-square (and adjusted R-square) value is reduced
by approximately 0.13.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-60 Chapter 5 Model Post-Fitting for Inference
6
6
24
24
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.4 Solutions 5-61
34
34
45
45
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-62 Chapter 5 Model Post-Fitting for Inference
47
47
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Chapter 6 Model Building and
Scoring for Prediction
6.1 Brief Introduction to Predictive Modeling ...................................................................6-3
Demonstration: Predictive Model Building ............................................................................. 6-9
Exercises .............................................................................................................................. 6-14
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Brief Introduction to Predictive Modeling 6-3
Objectives
Explain the concepts of predictive modeling.
Illustrate the modeling essentials of a predictive
model.
Explain the importance of data partitioning.
Strategy
Fact-Based
Past Behavior Predictions
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-4 Chapter 6 Model Building and Scoring for Prediction
There are many business applications of predictive modeling. Database marketing uses customer
databases to improve sales promotions and product loyalty. In target marketing, the cases are customers,
the inputs are attributes such as previous purchase history and demographics, and the target is often
a binary variable indicating a response to a past promotion. The aim is to find segments of customers that
are likely to respond to some offer so that they can be targeted. Historic customer databases can also
be used to predict who is likely to switch brands or cancel services (churn). Loyalty promotions can then
be targeted at new cases that are at risk.
Credit scoring is used to decide whether to extend credit to applicants. The cases are past applicants. Most
input variables come from the credit application or credit reports. A relevant binary target is whether
the case defaulted (charged off) or the debt was paid. The aim is to reduce defaults and serious
delinquencies on new applicants for credit.
In fraud detection, the cases are transactions (for example, telephone calls and credit card purchases)
or insurance claims. The inputs are the particulars and circumstances of the transaction. The binary target
is whether that case was fraudulent. The aim is to anticipate fraud or abuse on new transactions or claims
so that they can be investigated or impeded.
Predictive modeling starts with a training data set. The observations in a training data set are known
as training cases (also known as examples, instances, or records). The variables are called inputs (also
known as predictors, features, explanatory variables, or independent variables) and targets (also known
as a response, outcome, or dependent variable). For a given case, the inputs reflect your state of
knowledge before measuring the target.
The measurement scale of the inputs and the target can be varied. The inputs and the target can
be numeric variables, such as income. They can be nominal variables, such as occupation. They are often
binary variables, such as a positive or negative response concerning home ownership.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Brief Introduction to Predictive Modeling 6-5
Model Complexity
Just right
Too flexible
Fitting a model to data requires searching through the space of possible models. Constructing a model
with good generalization requires choosing the right complexity. For regression, including more terms
in the model increases complexity.
Selecting model complexity involves a tradeoff between bias and variance. An insufficiently complex
model might not be flexible enough. This leads to underfitting – that is, systematically missing the signal
(the true relationships). This leads to biased inferences, which are inferences that are not the true ones
in the population.
A naive modeler might assume that the most complex model should always outperform the others, but
this is not the case. An overly complex model might be too flexible. This leads to overfitting – that is,
accommodating nuances of the random noise (chance relationships) in the particular sample. This leads
to models that have higher variance when applied to a population. A model with just enough flexibility
gives the best generalization.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-6 Chapter 6 Model Building and Scoring for Prediction
10
The strategy for choosing model complexity in data mining is to use honest assessment. With honest
assessment, you select the model that performs best on a validation data set, which is not used to fit
the model. Assessing performance on the same data set that was used to develop the model leads
to selecting too complex a model (overfitting).
The classic example of this is selecting linear regression models based on R square.
In predictive modeling, the standard strategy for honest assessment of model performance is data
splitting. A portion is used for fitting the model – that is, the training data set. The remaining data are
separated for empirical validation.
The validation data set is used for monitoring and tuning the model to improve its generalization.
The tuning process usually involves selecting among models of different types and complexities.
The tuning process optimizes the selected model on the validation data.
Because the validation data are used to select from a set of related models, reported performance
will be overstated, on the average. Consequently, a further holdout sample is needed for a final,
unbiased assessment. The test data set has only one use, which is to give a final honest estimate
of generalization. Cases in the test set must be treated in the same way that new data would be
treated. The cases cannot be involved in any way in the determination of the fitted prediction
model. In practice, many analysts see no need for a final honest assessment of generalization.
An optimal model is chosen using the validation data, and the model assessment measured
on the validation data is reported as an upper bound on the performance expected when the model
is deployed.
With small or moderate data sets, data splitting is inefficient; the reduced sample size can severely
degrade the fit of the model. Computer-intensive methods, such as the crossvalidation and bootstrap
methods, were developed so that all the data can be used for both fitting and honest assessment. However,
data mining usually has the luxury of massive data sets.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Brief Introduction to Predictive Modeling 6-7
2
Rate model
performance using
3
validation data.
4
5
Model Validation
Complexity Assessment
11
Using performance on the training data set usually leads to selecting a model that is too complex.
(The classic example is selecting linear regression models based on R square.) To avoid this problem,
PROC GLMSELECT can select the model based on validation data performance, from the sequence
of models selected based on training data measures.
Model Selection
Training Data Validation Data
inputs target inputs target
2
Select the simplest
model with the highest
3
validation assessment.
4
5
Model Validation
Complexity Assessment
12
In keeping with Occam’s razor, the best model is the simplest model with the highest validation
performance.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-8 Chapter 6 Model Building and Scoring for Prediction
13
PROC GLMSELECT can perform model building with honest assessment with a holdout (validation)
data set in two ways. If a holdout data set has already been created, then it can be referred to in the
PROC GLMSELECT statement as the VALDATA. If there is only one data set, it can be randomly
partitioned into training and validation data (as well as test data, if required) using a FRACTION option
in the PARTITION statement. A nonzero seed in the PROC GLMSELECT statement will assure
replicability.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Brief Introduction to Predictive Modeling 6-9
ods graphics;
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-10 Chapter 6 Model Building and Scoring for Prediction
Dimensions
Number of Effects 20
Number of Parameters 43
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Brief Introduction to Predictive Modeling 6-11
Stop Details
Candidate Candidate Compare
For Effect SBC SBC
Removal Deck_Porch_Area 5718.6683 > 5717.9317
Gr_Liv _Area
Ov erall_Cond2 6
Standardized Coefficient
0.2 Ov erall_Qual2 6
Ov erall_Cond2 5
Ov erall_Qual2 5
Fireplaces 2
0.0
Bedroom_Abv Gr
-0.2
Age_Sold
2.6E8
2.6E8
2.5E8
Fu
1-
2-
3-
4-
5-
6-
7-
8-
Se
Ho
Fo
Ce
He
Lo
ll M
ar
as
un
t_
as
us
nt
at
od
ag
on
Sh
ra
in
da
on
e_
el
e_
g_
l_
ry
ap
tio
_S
St
A
Ty
_V
e_
yl
n_
ir
ol
C
pe
e2
en
d
2
2
_2
ee
r
Effect Sequence
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-12 Chapter 6 Model Building and Scoring for Prediction
The Coefficient Progression plot shows the history of the model selection. The vertical reference
line at 5-Central_Air shows that the model at that step had the optimal level of Validation ASE
(average squared error) on the validation data set compared with any other model
in the progression.
…
Selected Model
The selected model, based on Validation ASE, is the model at Step 5.
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value
Model 19 3.566452E11 18770797693 90.06
Error 274 57107246191 208420607
Corrected Total 293 4.137524E11
Parameter Estimates
Standard
Parameter DF Estimate Error t Value
Intercept 1 51207 7079.121457 7.23
Overall_Qual2 5 1 6782.080263 3104.469941 2.18
Overall_Qual2 6 1 13659 3414.565419 4.00
Overall_Qual2 4 0 0 . .
Overall_Cond2 5 1 8996.618020 4137.937302 2.17
Overall_Cond2 6 1 15909 4025.283609 3.95
Overall_Cond2 4 0 0 . .
Fireplaces 1 1 9716.205925 2044.560791 4.75
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Brief Introduction to Predictive Modeling 6-13
Parameter Estimates
Standard
Parameter DF Estimate Error t Value
Fireplaces 2 1 7235.661619 4540.159269 1.59
Fireplaces 0 0 0 . .
Heating_QC Fa 1 -11668 4315.812370 -2.70
Heating_QC Gd 1 -3178.918390 2496.841385 -1.27
Heating_QC TA 1 -6689.247126 2133.424223 -3.14
Heating_QC Ex 0 0 . .
Masonry_Veneer Y 1 -3369.652622 2079.343731 -1.62
Masonry_Veneer N 0 0 . .
Lot_Shape_2 Regular 1 -4507.715447 2036.544994 -2.21
Lot_Shape_2 Irregular 0 0 . .
Gr_Liv_Area 1 42.972194 5.709351 7.53
Basement_Area 1 25.491273 3.170869 8.04
Garage_Area 1 29.698556 5.913131 5.02
Deck_Porch_Area 1 20.952561 7.235245 2.90
Lot_Area 1 1.199858 0.307660 3.90
Age_Sold 1 -422.187733 47.675825 -8.86
Bedroom_AbvGr 1 -4541.124997 1523.500120 -2.98
Total_Bathroom 1 3806.351237 1714.333548 2.22
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-14 Chapter 6 Model Building and Scoring for Prediction
Exercises
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.2 Scoring Predictive Models 6-15
Objectives
Explain the concepts of scoring.
Score new data using both PROC GLMSELECT
and PROC PLM.
19
Scoring
Model Deployment
Model
Development
20
The predictive modeling task is not completed when a model and allocation rule is determined.
The model must be practically applied to new cases. This process is called scoring.
In database marketing, this process can be tremendously burdensome because the data to be scored might
be many times more massive than the data used to develop the model. Moreover, the data might be stored
in a different format on a different system using different software.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-16 Chapter 6 Model Building and Scoring for Prediction
In other applications, such as fraud detection, the model might need to be integrated into an online
monitoring system.
Scoring Recipe
The model results The scoring code
in a formula or rules. is deployed.
The data require – To score, you do not
modifications. rerun the algorithm;
– Derived inputs apply score code
– Transformations (equations) obtained
from the final model
– Missing value
to the scoring data.
imputation
21
Any modifications that you make to the training data (imputing missing values, transformations,
standardization) should be applied to the validation and the scoring data in the same way. This means that
if you have subtracted the mean of x(training) from the training data, then the mean of x(training) should
also be subtracted from the validation and the scoring data. This practice keeps the different data sets
comparable.
22
22
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.2 Scoring Predictive Models 6-17
It might seem strange to go through two or three steps to score new data when there is a way to do it in
one step. There is a SCORE statement in PROC GLMSELECT and you can produce a model and score
data in one step. However, this method is inefficient if you will want to score more than once or model
using a large data set. You can score with PROC PLM using the item store created in a STORE statement
in PROC GLMSELECT. One potential problem with this method is that others might not be able to use
this code with earlier versions of SAS or you might not want to share the entire item store. If so, you can
produce detailed scoring code using the CODE statement in PROC PLM.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-18 Chapter 6 Model Building and Scoring for Prediction
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.2 Scoring Predictive Models 6-19
Partial Output
The COMPARE Procedure
(Method=RELATIVE(2.22E-10), Criterion=0.0001)
Variables Summary
Observation Summary
First Obs 1 1
Use of the DATA step code can result in a small reduction in precision. However, the predictions are
essentially the same.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-20 Chapter 6 Model Building and Scoring for Prediction
Exercises
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Solutions 6-21
6.3 Solutions
Solutions to Exercises
1. Predictive Model Building Using PROC GLMSELECT to Partition
Build a model predicting SalePrice starting with all of the variables used in the previous
demonstration. Partition the AmesHousing3 data set into a training data set of approximately 2/3
and a validation data set of approximately 1/3. Use stepwise selection with AIC as the selection
criterion and validation average squared error for the model choice criterion.
/*st106s01.sas*/
%let interval=Gr_Liv_Area Basement_Area Garage_Area Deck_Porch_Area
Lot_Area Age_Sold Bedroom_AbvGr Total_Bathroom;
%let categorical=House_Style2 Overall_Qual2 Overall_Cond2 Fireplaces
Season_Sold Garage_Type_2 Foundation_2 Heating_QC
Masonry_Veneer Lot_Shape_2 Central_Air;
ods graphics;
proc glmselect data=STAT1.ameshousing3
plots=all
seed=8675309;
class &categorical / param=reference ref=first;
model SalePrice=&categorical &interval /
selection=stepwise
select=aic
choose=validate;
partition fraction(validate=0.3333);
title "Selecting the Best Model using Honest Assessment";
run;
Your results will likely differ somewhat, depending on the seed that you choose.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-22 Chapter 6 Model Building and Scoring for Prediction
Stop Details
Candidate Candidate Compare
For Effect AIC AIC
Entry Masonry_Veneer 3959.1313 > 3958.4479
Removal Total_Bathroom 3961.4810 > 3958.4479
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Solutions 6-23
0.50
Ov erall_Cond2 6
0.25 Ov erall_Cond2 5
Garage_Area
Fireplaces 1
Total_Bathroom
Heating_QC Gd
0.00
Heating_QC TA
-0.25 Age_Sold
Selected Step
Validation ASE
1.5E9
1.3E9
1E9
7.5E8
5E8
2.5E8
In
1+
2+
3+
4+
5+
6+
7+
8+
9+
10
11
12
13
14
te
Ba
Fi
Ho
De
+H
+L
+B
+T
-H
rc
r_
ge
ar
ve
re
ve
ou
ot
ot
se
us
ck
ea
ed
ep
Li
ag
pl
_S
ra
ra
_A
al
se
_P
m
r
v_
ac
t
_B
e_
ll_
n
_S
l_
en
ol
re
om
_S
or
g_
A
es
Co
Q
d
a
ty
t_
a
r
ch
ua
th
ty
_A
ea
re
nd
le
A
ro
le
_A
a
l2
2
re
bv
2
2
om
a
re
G
a
r
Effect Sequence
The selected model, based on Validation ASE, is the model at Step 10.
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value
Model 18 2.231765E11 12398695049 65.49
Error 178 33699428801 189322634
Corrected Total 196 2.568759E11
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-24 Chapter 6 Model Building and Scoring for Prediction
Parameter Estimates
Standard
Parameter DF Estimate Error t Value
Intercept 1 27334 10120 2.70
House_Style2 1Story 1 12267 4203.159135 2.92
House_Style2 2Story 1 2456.477699 4386.235156 0.56
House_Style2 SFoyer 1 20779 7050.033468 2.95
House_Style2 SLvl 1 17117 5527.649598 3.10
House_Style2 1.5Fin 0 0 . .
Overall_Qual2 5 1 7841.596393 3417.138088 2.29
Overall_Qual2 6 1 14024 3806.928311 3.68
Overall_Qual2 4 0 0 . .
Overall_Cond2 5 1 12475 4949.669709 2.52
Overall_Cond2 6 1 17766 4841.031305 3.67
Overall_Cond2 4 0 0 . .
Fireplaces 1 1 5832.276234 2471.249968 2.36
Fireplaces 2 1 10886 4999.141012 2.18
Fireplaces 0 0 0 . .
Heating_QC Fa 1 -13782 5544.767861 -2.49
Heating_QC Gd 1 -3687.706899 2867.792984 -1.29
Heating_QC TA 1 -5944.139856 2467.507946 -2.41
Heating_QC Ex 0 0 . .
Gr_Liv_Area 1 54.360524 6.486247 8.38
Basement_Area 1 18.329197 3.964241 4.62
Garage_Area 1 33.820604 6.692579 5.05
Deck_Porch_Area 1 27.291527 8.243101 3.31
Age_Sold 1 -379.483707 54.640384 -6.95
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Solutions 6-25
Your results will likely differ somewhat, depending on the seed that you choose.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-26 Chapter 6 Model Building and Scoring for Prediction
Partial Output
The COMPARE Procedure
(Method=RELATIVE(2.22E-10), Criterion=0.0001)
Variables Summary
Observation Summary
First Obs 1 1
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Solutions 6-27
14
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-28 Chapter 6 Model Building and Scoring for Prediction
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Chapter 7 Categorical Data
Analysis
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.1 Describing Categorical Data 7-3
Objectives
Examine the distribution of categorical variables.
Do preliminary examinations of associations between
variables.
3
3
4
4
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-4 Chapter 7 Categorical Data Analysis
5
5
No Association
72% 28%
72% 28%
Is your manager’s mood associated
with the weather?
6
6
There appears to be no association between your manager’s mood and the weather here because the row
percentages are the same in each column.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.1 Describing Categorical Data 7-5
Association
82% 18%
60% 40%
Is your manager’s mood associated
with the weather?
7
7
There appears to be an association here because the row percentages are different in each column.
Frequency Tables
A frequency table shows the number of observations
that occur in certain categories or intervals. A one-way
frequency table examines one variable.
Cumulative Cumulative
Income Frequency Percent
Frequency Percent
8
8
Typically, there are four types of frequency measures included in a frequency table:
Frequency is the number of times the value appears in the data set.
Percent represents the percentage of the data that has this value.
Cumulative Frequency accumulates the frequency of each of the values by adding the second
frequency to the first, and so on.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-6 Chapter 7 Categorical Data Analysis
Cumulative Percent accumulates the percentage by adding the second percentage to the first,
and so on.
Crosstabulation Tables
A crosstabulation table shows the number of observations
for each combination of the row and column variables.
9
9
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.1 Describing Categorical Data 7-7
10
10
PROC FREQ can generate large volumes of output as the number of variables or the number
of variable levels (or both) increases.
11
11
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-8 Chapter 7 Categorical Data Analysis
12
12
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.1 Describing Categorical Data 7-9
Examining Distributions
Example: Invoke PROC FREQ and create one-way frequency tables for the variables Bonus,
Fireplaces, and Lot_Shape_2 and create two-way frequency tables for the variables Bonus
by Fireplaces, and Bonus by Lot_Shape_2. For the continuous variable, Basement_Area,
create histograms for each level of Bonus. Use a CLASS statement in PROC UNIVARIATE.
Use the FORMAT procedure to format the values of Bonus.
/*st107d01.sas*/
title;
proc format;
value bonusfmt 1="Bonus Eligible"
0="Not Bonus Eligible"
;
run;
proc freq data=STAT1.ameshousing3;
tables Bonus Fireplaces Lot_Shape_2
Fireplaces*Bonus Lot_Shape_2*Bonus /
plots(only)=freqplot(scale=percent);
format Bonus bonusfmt.;
run;
FREQPLOT(<suboptions>) requests a frequency plot. Frequency plots are available for frequency
and crosstabulation tables. For multiway tables, PROC FREQ provides
a two-way frequency plot for each stratum.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-10 Chapter 7 Categorical Data Analysis
Number of fireplaces
Cumulative Cumulative
Fireplaces Frequency Percent Frequency Percent
0 195 65.00 195 65.00
1 93 31.00 288 96.00
2 12 4.00 300 100.00
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.1 Describing Categorical Data 7-11
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-12 Chapter 7 Categorical Data Analysis
There seem to be no unusual data values that could be due to coding errors for any of the categorical
variables.
The requested two-way frequency tables follow. You can get a preliminary idea whether there
are associations between the outcome variable, Bonus, and the predictor variables, Fireplaces
and Lot_Shape_2, by examining the distribution of Bonus at each value of the predictors.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.1 Describing Categorical Data 7-13
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-14 Chapter 7 Categorical Data Analysis
With the unequal group sizes, the row percentages might not easily display if Fireplaces is associated
with Bonus.
Table of Lot_Shape_2 by Bonus
Lot_Shape_2(Regular Bonus(Sale Price >
or irregular lot shape) $175,000)
Frequency
Percent Not
Row Pct Bonus Bonus
Col Pct Eligible Eligible Total
Irregular 62 31 93
20.74 10.37 31.10
66.67 33.33
24.31 70.45
Regular 193 13 206
64.55 4.35 68.90
93.69 6.31
75.69 29.55
Total 255 44 299
85.28 14.72 100.00
Frequency Missing = 1
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.1 Describing Categorical Data 7-15
There seems to be an association between Bonus and Lot_Shape_2, with a greater chance of not being
bonus eligible when lot shape is regular.
The plot below shows the distribution of the continuous variable, Basement_Area, by bonus status.
The distribution of houses that are not bonus eligible appears to be more variable as evident by the larger
standard deviation. The mean of not bonus eligible houses is over 400 square feet smaller than houses that
are bonus eligible.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-16 Chapter 7 Categorical Data Analysis
Objectives
Perform a chi-square test for association.
Examine the strength of the association.
Perform a Mantel-Haenszel chi-square test.
17
17
Overview
Type of
Predictors Continuous
Type of Categorical Continuous and
Response Categorical
18
18
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.2 Tests of Association 7-17
Introduction
Table of Lot_Shape_2 by Bonus
Lot_Shape_2 Bonus
Row Pct Not Bonus Bonus Total
Eligible Eligible
Irregular 66.67% 33.33% N=93
Regular 93.69% 6.31% N=206
Total N=255 N=44 N=299
19
19
There appears to be an association between Lot_Shape_2 and Bonus because the row probabilities
are different in each column. To test for this association, you assess whether the difference between
the probabilities of irregular lots being bonus eligible (33.33%) and regular lots being bonus eligible
(6.31%) is greater than would be expected by chance.
Null Hypothesis
There is no association between Lot_Shape_2
and Bonus.
The probability of a home sale being bonus eligible
is the same regardless of lot shape.
Alternative Hypothesis
There is an association between Lot_Shape_2
and Bonus.
The probability of a home sale being bonus eligible
is not the same for irregular and regular lot shapes.
20
20
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-18 Chapter 7 Categorical Data Analysis
Chi-Square Test
NO ASSOCIATION
observed frequencies=expected frequencies
ASSOCIATION
observed frequencies≠expected frequencies
21
21
A commonly used test that examines whether there is an association between two categorical variables
is the Pearson chi-square test. The chi-square test measures the difference between the observed cell
frequencies and the cell frequencies that are expected if there is no association between the variables.
If you have a significant chi-square statistic, there is strong evidence that an association exists between
your variables.
Under the null hypothesis of no association between the Row and Column variables, the
“expected” percentage in any R*C cell will be equal to the percent in that cell’s row (R/T) times
the percent in the cell’s column (C/T). The expected count is then only that expected percentage
times the total sample size. The expected count=(R/T)*(C/T)*T=(R*C)/T.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.2 Tests of Association 7-19
Chi-Square Tests
Chi-square tests and the corresponding p-values
determine whether an association exists
22
22
The p-value for the chi-square test only indicates how confident you can be that the null hypothesis
of no association is false. It does not tell you the magnitude of an association. The value of the chi-square
statistic also does not tell you the magnitude of the association. If you double the size of your sample
by duplicating each observation, you double the value of the chi-square statistic, even though the strength
of the association does not change.
Measures of Association
CRAMER’S V
23
23
One measure of the strength of the association between two nominal variables is Cramer’s V statistic.
It has a range of -1 to 1 for 2-by-2 tables and 0 to 1 for larger tables. Values farther from 0 indicate
stronger association. Cramer’s V statistic is derived from the Pearson chi-square statistic.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-20 Chapter 7 Categorical Data Analysis
Odds Ratios
An odds ratio indicates how much more likely, with
respect to odds, a certain event occurs in one group
relative to its occurrence in another group.
Example: How do the odds of irregular lot shapes being
bonus eligible compare to those of regular lot
shapes?
pevent
Odds =
1 pevent
24
24
The odds ratio can be used as a measure of the strength of association for 2 * 2 tables. Do not mistake
odds for probability. Odds are calculated from probabilities as shown in the next slides.
25
25
There is a 90% probability of having the outcome in group B. What is the probability of having
the outcome in group A?
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.2 Tests of Association 7-21
Probability of Yes in
Group B=0.90 Probability of No in
Group B=0.10
26
26
The odds of an outcome are the ratio of the expected probability that the outcome will occur
to the expected probability that the outcome will not occur. The odds for group B are 9, which indicate
that you expect nine times as many occurrences as non-occurrences in group B.
What are the odds of having the outcome in group A?
Odds Ratio
Outcome
Yes No Total
Group A 60 20 80
Group B 90 10 100
Odds of Yes in
Group A=3 Odds of Yes in
Group B=9
27
27
The odds ratio of group A to group B equals 1/3, or 0.3333, which indicates that the odds of getting
the outcome in group A are one third those in group B. If you were interested in the odds ratio of group
B to group A, you would simply take the multiplicative inverse (or reciprocal) of 1/3 to arrive at 3.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-22 Chapter 7 Categorical Data Analysis
Group B Group A
More Likely More Likely
0 1
28
28
The odds ratio shows the strength of the association between the predictor variable and the outcome
variable. If the odds ratio is 1, then there is no association between the predictor variable and the
outcome. If the odds ratio is greater than 1, then group A, the numerator group, is more likely to have
the outcome. If the odds ratio is less than 1, then group B, the denominator group, is more likely to have
the outcome.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.2 Tests of Association 7-23
Chi-Square Test
Example: Use the FREQ procedure to test for an association between the variables Lot_Shape_2
and Bonus as well as Fireplaces and Bonus. Generate the expected cell frequencies
and the cell’s contribution to the total chi-square statistic.
/*st107d02.sas*/
ods graphics off;
proc freq data=STAT1.ameshousing3;
tables (Lot_Shape_2 Fireplaces)*Bonus
/ chisq expected cellchi2 nocol nopercent
relrisk;
format Bonus bonusfmt.;
title 'Associations with Bonus';
run;
ods graphics on;
Selected TABLES statement options:
CHISQ produces the chi-square test of association and the measures of association based
on the chi-square statistic.
EXPECTED prints the expected cell frequencies under the hypothesis of no association.
CELLCHI2 prints each cell’s contribution to the total chi-square statistic.
NOCOL suppresses printing the column percentages.
NOPERCENT suppresses printing the cell percentages.
RELRISK prints a table with risk ratios (probability ratios) and odds ratios.
The frequency table is shown below.
Table of Lot_Shape_2 by Bonus
Lot_Shape_2(Regular Bonus(Sale Price >
or irregular lot shape) $175,000)
Frequency
Expected Not
Cell Chi-Square Bonus Bonus
Row Pct Eligible Eligible Total
Irregular 62 31 93
79.314 13.686
3.7797 21.905
66.67 33.33
Regular 193 13 206
175.69 30.314
1.7064 9.8893
93.69 6.31
Total 255 44 299
Frequency Missing = 1
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-24 Chapter 7 Categorical Data Analysis
It appears that the cell for Lot_Shape_2=Irregular and Bonus=1 (Bonus Eligible) contributes
the most to the chi-square statistic. The Cell Chi-Square value is 21.905.
Below is the table that shows the chi-square test and Cramer’s V.
Statistic DF Value Prob
Chi-Square 1 37.2807 <.0001
Likelihood Ratio Chi-Square 1 34.4226 <.0001
Continuity Adj. Chi-Square 1 35.1587 <.0001
Mantel-Haenszel Chi-Square 1 37.1561 <.0001
Phi Coefficient -0.3531
Contingency Coefficient 0.3330
Cramer's V -0.3531
Because the p-value for the chi-square statistic is <.0001, which is below 0.05, you reject the null
hypothesis at the 0.05 level and conclude that there is evidence of an association between Lot_Shape_2
and Bonus. Cramer’s V of -0.3531 indicates that the association detected with the chi-square test
is relatively weak.
Fisher's Exact Test
Cell (1,1) Frequency (F) 62
Left-sided Pr <= F <.0001
Right-sided Pr >= F 1.0000
Exact tests are often useful where asymptotic distributional assumptions are not met. The usual guidelines
for the asymptotic chi-square test are generally 20-25 total observations for a 2*2 table, with 80%
of the table cells having counts greater than 5. Fisher’s Exact Test is provided by PROC FREQ when tests
of association are requested for 2*2 tables. Otherwise, the exact test must be requested using an EXACT
statement.
Odds Ratio and Relative Risks
Statistic Value 95% Confidence Limits
Odds Ratio 0.1347 0.0664 0.2735
Relative Risk (Column 1) 0.7116 0.6137 0.8251
Relative Risk (Column 2) 5.2821 2.9002 9.6202
The Odds Ratio and Relative Risk table shows another measure of strength of association.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.2 Tests of Association 7-25
The odds ratio is shown in the first row of the table, along with the 95% confidence limits. To interpret
the odds ratio, refer to the contingency table at the beginning of the output. The top row (Irregular,
in this case) is the numerator of the ratio while the bottom row (Regular) is the denominator. The
interpretation is stated in relation to the left column of the contingency table (Not Bonus Eligible).
The value of 0.1347 says that an irregular lot has about 13.5% of the odds of not being bonus eligible,
compared with a regular lot. This is equivalent to saying that a regular lot has about 13.5% of the odds
of being bonus eligible, compared with an irregular lot.
Relative Risk estimates for each column are interpreted as probability ratios, rather than odds ratios. You
get a choice of assessing probabilities of the left column (Column1) or the right column (Column2). For
example, the Column1 Relative Risk shows the ratio of the probabilities of irregular lots to regular lots
being in the left column (66.67/93.69=0.7116).
If is often easier to report odds ratios by first transforming the decimal value to a percent difference value.
The formula for doing that is (OR-1) * 100. In the example, you have (0.1347-1)*100=-86.53%. In other
words, regular lots have 86.53 percent lower odds of being bonus eligible compared with irregular lots.
The 95% odds ratio confidence interval goes from 0.0664 to 0.2735. That interval does not include 1.
This confirms the statistically significant (at alpha=0.05) result of the Pearson chi-square test of
association. A confidence interval that included the value 1 (equality of odds) would be a non-significant
result.
Table of Fireplaces by Bonus
Fireplaces(Number Bonus(Sale Price >
of fireplaces) $175,000)
Frequency
Expected Not
Cell Chi-Square Bonus Bonus
Row Pct Eligible Eligible Total
0 177 18 195
165.75 29.25
0.7636 4.3269
90.77 9.23
1 68 25 93
79.05 13.95
1.5446 8.7529
73.12 26.88
2 10 2 12
10.2 1.8
0.0039 0.0222
83.33 16.67
Total 255 45 300
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-26 Chapter 7 Categorical Data Analysis
There also seems to be an association between Fireplaces and Bonus (Chi-Square(2 df)=15.4141,
p=0.0004). Cramer’s V for that association is 0.2267.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.2 Tests of Association 7-27
30
30
Is
Fireplaces
associated
with
Bonus
?
32
You already saw that Bonus and Fireplaces have a significant general association. Another question that
you can ask is whether Bonus and Fireplaces have a significant ordinal association. The appropriate test
for ordinal associations is the Mantel-Haenszel chi-square test.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-28 Chapter 7 Categorical Data Analysis
33
33
34
34
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.2 Tests of Association 7-29
The Mantel-Haenszel chi-square statistic is more powerful than the general association chi-square statistic
for detecting an ordinal association. The reasons are that
all of the Mantel-Haenszel statistic’s power is concentrated toward that objective
the power of the general association statistic is dispersed over a greater number of alternatives.
Rank Association
35
35
To measure the strength of the ordinal association, you can use the Spearman correlation statistic.
This statistic
has a range between -1 and 1
has values close to 1 if there is a relatively high degree of positive correlation
has values close to -1 if there is a relatively high degree of negative correlation
is appropriate only if both variables are ordinal scaled and the values are in a logical order.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-30 Chapter 7 Categorical Data Analysis
36
36
The Spearman statistic can be interpreted as the Pearson correlation between the ranks on variable
X and the ranks on variable Y.
For character values, SAS assigns, by default, a 1 to column 1, a 2 to column 2, and so on. You can
change the default with the SCORES= option in the TABLES statement.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.2 Tests of Association 7-31
Example: Use PROC FREQ to test whether an ordinal association exists between Bonus and Fireplaces.
/*st107d03.sas*/
ods graphics off;
proc freq data=STAT1.ameshousing3;
tables Fireplaces*Bonus / chisq measures cl;
format Bonus bonusfmt.;
title 'Ordinal Association between FIREPLACES and BONUS?';
run;
ods graphics on;
Selected TABLES statement options:
CHISQ produces the Pearson chi-square, the likelihood-ratio chi-square, and the
Mantel-Haenszel chi-square. It also produces measures of association based
on chi-square such as the phi coefficient, the contingency coefficient, and Cramer’s V.
MEASURES produces the Spearman correlation statistic along with other measures of association.
CL produces confidence bounds for the MEASURES statistics.
The crosstabulation is shown below.
Table of Fireplaces by Bonus
Fireplaces(Number Bonus(Sale Price >
of fireplaces) $175,000)
Frequency
Percent Not
Row Pct Bonus Bonus
Col Pct Eligible Eligible Total
0 177 18 195
59.00 6.00 65.00
90.77 9.23
69.41 40.00
1 68 25 93
22.67 8.33 31.00
73.12 26.88
26.67 55.56
2 10 2 12
3.33 0.67 4.00
83.33 16.67
3.92 4.44
Total 255 45 300
85.00 15.00 100.00
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-32 Chapter 7 Categorical Data Analysis
Because the p-value of the Mantel-Haenszel chi-square is 0.0010, you can conclude at the 0.05
significance level that there is evidence of an ordinal association between Bonus and Fireplaces.
The Spearman correlation statistic and the 95% confidence bounds are shown below.
95%
Statistic Value ASE Confidence Limits
Gamma 0.4964 0.1111 0.2786 0.7143
Kendall's Tau-b 0.2072 0.0585 0.0926 0.3218
Stuart's Tau-c 0.1449 0.0433 0.0600 0.2298
Somers' D C|R 0.1510 0.0451 0.0626 0.2395
Somers' D R|C 0.2842 0.0786 0.1301 0.4383
Pearson Correlation 0.1896 0.0591 0.0737 0.3054
Spearman Correlation 0.2107 0.0594 0.0943 0.3272
Lambda Asymmetric C|R 0.0000 0.0000 0.0000 0.0000
Lambda Asymmetric R|C 0.0667 0.0603 0.0000 0.1849
Lambda Symmetric 0.0467 0.0424 0.0000 0.1298
Uncertainty Coefficient C|R 0.0571 0.0298 0.0000 0.1156
Uncertainty Coefficient R|C 0.0313 0.0167 0.0000 0.0640
Uncertainty Coefficient Symmetric 0.0404 0.0213 0.0000 0.0823
The Spearman Correlation (0.2107) indicates that there is a moderate, positive ordinal relationship
between Fireplaces and Bonus (that is, as Fireplaces levels increase, Bonus tends to increase).
The ASE is the asymptotic standard error (0.0594), which is an appropriate measure of the standard error
for larger samples.
Because the 95% confidence interval (0.0943, 0.3272) for the Spearman correlation statistic does not
contain 0, the relationship is significant at the 0.05 significance level.
The confidence bounds are valid only if your sample size is large. A general guideline is to have a sample
size of at least 25 for each degree of freedom in the Pearson chi-square statistic.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.2 Tests of Association 7-33
Exercises
a. Invoke the FREQ procedure and create one-way frequency tables for the categorical variables.
1) What is the measurement scale of each variable?
Variable Measurement Scale
Unsafe
Type
Region
Weight
Size
2) What is the proportion of cars made in North America?
3) For the variables Unsafe, Size, Region, and Type, are there any unusual data values that
warrant further investigation?
b. Use PROC FREQ to examine the crosstabulation of the variables Region by Unsafe. Generate
a temporary format to clearly identify the values of Unsafe. Along with the default output,
generate the expected frequencies, the chi-square test of association, and the odds ratio.
Use the following code for the format:
proc format;
value safefmt 0='Average or Above'
1='Below Average';
run;
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-34 Chapter 7 Categorical Data Analysis
1) For the cars made in Asia, what percentage had a below-average safety score?
2) For the cars with an average or above safety score, what percentage was made in North
America?
3) Do you see a statistically significant (at the 0.05 level) association between Region
and Unsafe?
4) What does the odds ratio compare and what does this one say about the difference in odds
between Asian and North American cars?
c. Use the variable named Size. Examine the ordinal association between Size and Unsafe.
Use PROC FREQ.
1) What statistic should you use to detect an ordinal association between Size and Unsafe?
2) Do you reject or fail to reject the null hypothesis at the 0.05 level?
3) What is the strength of the ordinal association between Size and Unsafe?
4) What is the 95% confidence interval around that statistic?
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.2 Tests of Association 7-35
40
40
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-36 Chapter 7 Categorical Data Analysis
Objectives
Define the concepts of logistic regression.
Fit a binary logistic regression model using
the LOGISTIC procedure.
Describe the standard output from the LOGISTIC
procedure with one continuous predictor variable.
Read and interpret odds ratio tables and plots.
44
44
Overview
Type of
Predictors Continuous
Type of Categorical Continuous and
Response Categorical
45
45
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.3 Introduction to Logistic Regression 7-37
Overview
Response Analysis
Linear
Regression
Continuous Analysis
Logistic
Regression
Categorical Analysis
46
46
Regression analysis enables you to characterize the relationship between a response variable
and one or more predictor variables. In linear regression, the response variable is continuous.
In logistic regression, the response variable is categorical.
47
47
If the response variable is dichotomous (two categories), the appropriate logistic regression model
is binary logistic regression.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-38 Chapter 7 Categorical Data Analysis
If you have more than two categories (levels) within the response variable, then there are two possible
logistic regression models:
1. If the response variable is nominal, you fit a nominal logistic regression model.
2. If the response variable is ordinal, you fit an ordinal logistic regression model.
48
You might be tempted to analyze a regression model with a binary response variable using PROC
GLMSELECT, PROC REG, or PROC GLM. However, there are problems with that. Besides the arbitrary
nature of the coding, there is the problem that the predicted values will take on values that have no
intrinsic meaning, with regard to your response variable. There is also the mathematical inconvenience of
not being able to assume normality and constant variance when the response variable has only two values.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.3 Introduction to Logistic Regression 7-39
49
Instead of modeling the zeros and ones directly, another way of thinking about modeling a binary variable
is to model the probability of either the zero or the one. If you can model the probability of the one
(called p), then you also modeled the probability of the zero, which would be (1–p). Probabilities are truly
continuous, so this line of thinking might sound compelling at first.
One problem is that the predicted values from a linear model can assume, theoretically, any value.
However, probabilities are by definition bounded between 0 and 1.
Another problem is that the relationship between the probability of the outcome and a predictor variable
is usually nonlinear rather than linear. In fact, the relationship often resembles an S-shaped curve
(a “sigmoidal” relationship).
Probabilities do not have a random normal error associated with them, but rather a binomial error
of p*(1-p). That error is greatest at probabilities close to 0.5 and lowest near 0 and 1.
As mentioned above, probabilities have a binomial error of the form p*(1-p)=(p-p2). Taking
the derivative of this expression with respect to p yields the expression 1-2*p. Setting the
derivative equal to zero and solving for p returns a value of 0.5. This binomial error equation
is a downward facing parabola, which means that the greatest value is at 0.5 and lowest values
are near 0 and 1.
Finally, there is no such thing as an “observed probability” and therefore least squares methods cannot
be used. The response variable is always either 0 or 1 and therefore the probability of the event is either
0% or 100%. This is another reason why it is untenable to assume a normal distribution of error.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-40 Chapter 7 Categorical Data Analysis
pi 1
(0 1X1i )
1 e
1
P ( 0 X )
1 e
50
50
This plot shows a model of the relationship between a continuous predictor and the probability
of an event or outcome. The linear model clearly does not fit if this is the true relationship between
X and the probability. In order to model this relationship directly, you must use a nonlinear function.
One such function is displayed. The S-shape of the function is known as a sigmoid.
The rate of change parameter of this function (1) determines the rate of increase or decrease of the curve.
When the parameter value is greater than 0, the probability of the outcome increases as the predictor
variable values increase. When the parameter is less than 0, the probability decreases as the predictor
variable values increase. As the absolute value of the parameter increases, the curve has a steeper rate
of change. When the parameter value is equal to 0, the curve can be represented by a straight, horizontal
line that shows an equal probability of the event for everyone.
The values for this model cannot be estimated in PROC GLMSELECT, PROC REG, or PROC GLM
because this is not a linear model.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.3 Introduction to Logistic Regression 7-41
Logit Transformation
Logistic regression models transformed probabilities,
called logits*,
pi
logit( pi ) ln
1 p
i
where
i indexes all cases (observations)
pi is the probability that the event (for example,
a sale) occurs in the ith case
ln is the natural log (to the base e).
51
51
A logistic regression model applies a logit transformation to the probabilities. Two of the problems that
you saw with modeling the probability directly were that probabilities were bounded between 0 and 1,
and that there was not likely a straight line relationship between predictors and probabilities.
First, deal with the problem of restricted range of the probability. What about the range of a logit?
As p approaches its maximum value of 1, the value ln(p/(1–p)) goes to infinity. As p approaches its
minimum value of 0, p/(1–p) approaches 0. The natural log of something approaching 0 is something that
goes to negative infinity. So, the logit has no upper or lower bounds.
If you can model the logit, then simple algebra enables you to model the odds or the probability.
The logit transformation ensures that the model generates estimated probabilities between 0 and 1.
The logit is the natural log of the odds. The odds and odds ratios were discussed in a previous section.
This relationship between the odds and the logit will become important later in this section.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-42 Chapter 7 Categorical Data Analysis
Assumption
Logit
Transformation
52
52
Assumption in logistic regression: The logit has a linear relationship with the predictor variables.
If the hypothesized nature of the direct relationship between X and p are correct, then the logit has a linear
relationship with X through the parameters. In other words, a linear function of X, additive in relation
to the parameters, can be used to model the logit. In that way, you can indirectly model the probability.
To verify this assumption, it would be useful to plot the logits by the predictor variable. (Logit plots are
illustrated in the appendix.)
where
logit (pi)= logit of the probability of the event
0= intercept of the regression equation
k= parameter estimate of the kth predictor variable
53
53
For a binary response variable, the linear logistic model with one predictor variable has the form above.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.3 Introduction to Logistic Regression 7-43
Unlike linear regression, the logit is not normally distributed and the variance is not constant. Therefore,
logistic regression requires a more computationally complex estimation method, named the Method of
Maximum Likelihood, to estimate the parameters. This method finds the values of the parameters that
make the observed data most likely. This is accomplished by maximizing the likelihood function that
expresses the probability of the observed data as a function of the unknown parameters.
pi
logit( pi ) ln
1 p
i
54
54
LOGISTIC Procedure
General form of the LOGISTIC procedure:
56
56
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-44 Chapter 7 Categorical Data Analysis
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.3 Introduction to Logistic Regression 7-45
Example: Fit a binary logistic regression model in PROC LOGISTIC. Select Bonus as the outcome
variable and Basement_Area as the predictor variable. Use the EVENT= option to model the
probability of being bonus eligible and request profile likelihood confidence intervals around
the estimated odds ratios.
/*st107d04.sas*/
ods graphics on;
proc logistic data=STAT1.ameshousing3 alpha=.05
plots(only)=(effect oddsratio);
model Bonus(event='1')=Basement_Area / clodds=pl;
title 'LOGISTIC MODEL (1):Bonus=Basement_Area';
run;
Selected PLOTS options:
EFFECT requests a plot of the predicted probability on the Y axis by the predictor on the X axis.
If there is more than one predictor variable in the model, the partial effect plot can be
requested using the option (X=<variable>).
ODDSRATIO requests a plot of the odds ratios, along with its (1-ALPHA) confidence limits. The width
of the confidence limits can be changed from the default of 95% using an ALPHA=
option in the PROC LOGISTIC statement. The chosen alpha level applies to all
confidence intervals produced in all tables and plots in that run of PROC LOGISTIC.
Selected MODEL statement options:
(EVENT=) specifies the event category for the binary response model. PROC LOGISTIC models
the probability of the event category. You can specify the value (formatted if a format is
applied) of the event category in quotation marks or you can specify one of the following
keywords. The default is EVENT=FIRST.
FIRST designates the first ordered category as the event.
LAST designates the last ordered category as the event.
CLODDS=PL requests profile likelihood confidence intervals for the odds ratios of all predictor
variables, which are desirable for small sample sizes. The CLODDS= option also enables
production of the ODDSRATIO plot.
SAS Output
Model Information
Data Set STAT1.AMESHOUSING3
Response Variable Bonus Sale Price > $175,000
Number of Response Levels 2
Model binary logit
Optimization Technique Fisher's scoring
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-46 Chapter 7 Categorical Data Analysis
The Model Information table describes the data set, the response variable, the number of response levels,
the type of model, the algorithm used to obtain the parameter estimates, and the number of observations
read and used.
The Optimization Technique is the iterative numerical technique that PROC LOGISTIC uses to estimate
the model parameters.
The model is assumed to be “binary logit” when there are exactly two response levels.
Number of Observations Read 300
Number of Observations Used 300
The Number of Observations Used is the count of all observations that are nonmissing for all variables
specified in the MODEL statement.
Response Profile
Ordered Total
Value Bonus Frequency
1 0 255
2 1 45
The Response Profile table shows the response variable values listed according to their ordered values.
By default, PROC LOGISTIC orders the response variable alphanumerically so that it bases the logistic
regression model on the probability of the smallest value. Because you used the EVENT=option in this
example, the model is based on the probability of being bonus eligible (Bonus=1). The Response Profile
table also shows frequencies of response values.
Probability modeled is Bonus=1.
It is advisable to check that the modeled response level is the one that you intended.
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
The Model Convergence Status simply informs you that the convergence criterion was met. There are
a number of options to control the convergence criterion.
The optimization technique does not always converge to a maximum likelihood solution. When this
is the case, the output after this point cannot be trusted. Always check to see that the Convergence
criterion is satisfied.
Model Fit Statistics
Intercept
Intercept and
Criterion Only Covariates
AIC 255.625 161.838
SC 259.329 169.246
-2 Log L 253.625 157.838
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.3 Introduction to Logistic Regression 7-47
The Testing Global Null Hypothesis: BETA=0 table provides three statistics to test the null hypothesis
that all regression coefficients of the model are 0.
A significant p-value for these tests provides evidence that at least one of the regression coefficients for
an explanatory variable is significantly different from 0. In this way, they are similar to the overall F test
in linear regression. The Likelihood Ratio Chi-Square is calculated as the difference between the -2 Log L
value of the baseline model (Intercept Only) and the -2 Log L value of the hypothesized model (Intercept
and Covariates). The statistic is a distributed asymptotically chi-square with degrees of freedom equal
to the difference in number of parameters between the hypothesized model and the baseline model. The
Score and Wald tests are also used to test whether all the regression coefficients are 0. The likelihood ratio
test is the most reliable, especially for small sample sizes (Agresti 1996). All three tests are asymptotically
equivalent and often give very similar values.
Wald statistics (p-values and confidence limits) require fewer computations to perform
and are therefore the default for most output in PROC LOGISTIC.
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -9.7854 1.2896 57.5758 <.0001
Basement_Area 1 0.00739 0.00107 48.0617 <.0001
The Analysis of Maximum Likelihood Estimates table lists the estimated model parameters, their standard
errors, Wald Chi-Square values, and p-values.
The parameter estimates are the estimated coefficients of the fitted logistic regression model. The logistic
regression equation is logit( p̂ )=9.7854+(0.00739)*Basement_Area for this example.
The Wald chi-square and its associated p-value tests whether the parameter estimate is significantly
different from 0. For this example, the p-value for the variable Basement_Area is significant at the 0.05
significance level (p<.0001).
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-48 Chapter 7 Categorical Data Analysis
The estimated model is displayed on the probability scale in the Effect plot. The observed values are
plotted at probabilities 1.00 and 0.00.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.3 Introduction to Logistic Regression 7-49
The above tables and plots are described in detail in the next slides.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-50 Chapter 7 Categorical Data Analysis
58
58
The odds ratio for a continuous predictor calculates the estimated relative odds for subjects that are one
unit apart on the continuous measure. For example, in the Housing Bonus example, Basement_Area
is the continuous measure. If you remember, the logit is the natural log of the odds. Because you can
calculate an estimated logit from the logistic model, the odds can be calculated by simply exponentiating
that value. An odds ratio for a one-unit difference is then the ratio of the exponentiated predicted logits for
two people who are one unit apart.
The odds ratio for basement_area indicates that the odds of being bonus eligible increase by 0.7% for
each increase in 1 square foot of basement area.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.3 Introduction to Logistic Regression 7-51
59
59
The 95% confidence limits indicate that you are 95% confident that the true odds ratio is between 1.005
and 1.010. Because the 95% confidence interval does not include 1.000, the odds ratio is significant
at the 0.05 alpha level.
If you want a different significance level for the confidence intervals, you can use the ALPHA=
option in the MODEL statement. The value must be between 0 and 1. The default value of 0.05
results in the calculation of a 95% confidence interval.
The profile likelihood confidence intervals are different from the Wald-based confidence intervals.
This difference is because the Wald confidence intervals use a normal error approximation, whereas the
profile likelihood confidence intervals are based on the value of the log-likelihood. These likelihood-ratio
confidence intervals require a much greater number of computations, but are generally preferred to the
Wald confidence intervals, especially for sample sizes less than 50 (Allison 1999).
The Odds Ratio plot displays the results of the Odds Ratio table graphically. A reference line shows
the null hypothesis. When the confidence interval crosses the reference line, the effect of the variable
is not significant.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-52 Chapter 7 Categorical Data Analysis
60
60
Comparing Pairs
To find concordant, discordant, and tied pairs, compare
houses that had the outcome of interest against houses
that did not.
Not Bonus Eligible Bonus Eligible
61
61
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.3 Introduction to Logistic Regression 7-53
Concordant Pair
Compare a 1200 square foot basement that was bonus
eligible with an 800 square foot basement that was not.
Not Eligible, 800 sqft Bonus Eligible, 1200 sqft
P(Eligible)=.0204 P(Eligible)=.2865
62
62
For all pairs of observations with different values of the response variable, a pair is concordant
if the observation with the outcome has a higher predicted outcome probability (based on the model)
than the observation without the outcome.
Discordant Pair
Compare a 1400 square foot basement that was bonus
eligible with a 1600 square foot basement that was not.
Not Eligible, 1600 sq ft Bonus Eligible, 1400 sq ft
P(Eligible)=.8855 P(Eligible)=.6379
63
63
A pair is discordant if the observation with the outcome has a lower predicted outcome probability
than the observation without the outcome.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-54 Chapter 7 Categorical Data Analysis
Tied Pair
Compare two 1350 square foot basements. One was
bonus eligible and the other not.
Not Eligible, 1350 sqft Bonus Eligible, 1350 sqft
P(Eligible)=.5490 P(Eligible)=.5490
64
64
A pair is tied if it is neither concordant nor discordant. (The probabilities are the same.)
65
65
The Association of Predicted Probabilities and Observed Responses table lists several measures
of association to help you assess the predictive ability of the logistic model.
The number of pairs used to calculate the values of this table is equal to the product of the counts
of observations with positive responses and negative responses. In this example, that value is
255*45=11,475.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.3 Introduction to Logistic Regression 7-55
You can use these percentages as goodness-of-fit measures to compare one model to another. In general,
higher percentages of concordant pairs and lower percentages of discordant pairs indicate a more
desirable model.
The four rank correlation indices (Somer’s D, Gamma, Tau-a, and c) are computed from the numbers
of concordant, discordant, and tied pairs of observations. In general, a model with higher values for
these indices has better predictive ability than a model with lower values for these indices.
The c (concordance) statistic estimates the probability of an observation with the outcome having a higher
predicted probability than an observation without the outcome. It is calculated as the percent concordant
plus one half the percent tied. The range of possible values is 0.500 (no better predictive power than
flipping a fair coin) to 1.000 (perfect prediction). The value of 0.895 shows a very strong ability of
Basement_Area to discriminate between houses that were bonus eligible and houses that were not.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-56 Chapter 7 Categorical Data Analysis
Exercises
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.4 Logistic Regression with Categorical Predictors 7-57
Objectives
State how a logistic model with categorical predictors
does and does not differ from one with continuous
predictors.
Describe what a CLASS statement does.
Define the standard output from the LOGISTIC
procedure with categorical predictor variables.
69
69
Overview
Type of
Predictors Continuous
Type of Categorical Continuous and
Response Categorical
70
70
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-58 Chapter 7 Categorical Data Analysis
71
71
The CLASS statement creates a set of “design variables” representing the information contained in any
categorical variables. These design variables are incorporated into the model calculations rather than the
original categorical variables. Character variables cannot be used, as is, in the model. SAS cannot use
a variable with values such as ‘yes’ or ‘no’ adequately in the determination of a model.
Even if categorical variables are represented by numbers such as 1, 2, 3, the CLASS statement tells SAS
to set up design variables to represent the categories. This is necessary because the numeric values that
are assigned to the levels of the categorical variable are generally arbitrary and might not truly reflect
distances between levels.
72
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.4 Logistic Regression with Categorical Predictors 7-59
For effect coding (also called deviation from the mean coding), the number of design variables created
is the number of levels of the CLASS variable minus 1. For example, consider a variable IncLevel, which
has three levels. In this case, two design variables were created. For the last level of the CLASS variable
(High Income), all the design variables have a value of -1. Parameter estimates of the CLASS main
effects using this coding scheme estimate the difference between the effect of each level and the average
effect over all levels.
73
73
If you use Effect Coding for a CLASS variable, then the parameter estimates and p-values reflect
differences from the mean logit value over all levels. So, for IncLevel, the Estimate shows the estimated
difference in logit values between IncLevel=1 (Low Income) and the average logit across all income
levels.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-60 Chapter 7 Categorical Data Analysis
74
74
For reference cell coding, parameter estimates of the CLASS main effects estimate the difference between
the effect of each level and the last level, called the reference level. For example, the effect for the level
Low estimates the logit difference between Low and High. You can choose the reference level in the
CLASS statement.
75
75
Notice the difference between this table and the previous parameter estimates table. Because you used
Reference Cell Coding, instead of Effect Coding, the meanings of the parameter estimates and p-values
are different. Now, the parameter estimate and p-value for IncLevel=1 reflect the difference between
IncLevel=1 and Inclevel=3 (the reference level).
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.4 Logistic Regression with Categorical Predictors 7-61
It is important to know what type of parameterization you are using in order to interpret
and report the results of this table.
76
76
Odds ratios for categorical predictors are reported for bi-group comparisons in PROC LOGISTIC,
no matter which parameterization is chosen. Thus, even if Effect Coding is selected for the Gender
variable, the odds ratio tables display odds comparisons between females and males (and not females
versus the average of both). The same holds true for variables with more than two levels; comparisons
will not be group versus the average of all.
77
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-62 Chapter 7 Categorical Data Analysis
78
78
80
80
Each design variable is assigned its own beta value. The number of parameters in the logistic model take
into account the intercept, the number of continuous predictors, and the number of design variables
assigned to CLASS variables.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.4 Logistic Regression with Categorical Predictors 7-63
Example: Fit a binary logistic regression model in PROC LOGISTIC. Select Bonus as the outcome
variable and Basement_Area, Fireplaces, and Lot_Shape_2 as the predictor variables.
Specify reference cell coding and specify Regular as the reference group for Lot_Shape_2
and 0 as the reference level for Fireplaces. Also use the EVENT= option to model the
probability of surviving and request profile likelihood confidence intervals around the
estimated odds ratios.
/*st107d05.sas*/
ods graphics on;
proc logistic data=STAT1.ameshousing3
plots(only)=(effect oddsratio);
class Fireplaces(ref='0') Lot_Shape_2(ref='Regular') / param=ref;
model Bonus(event='1')=Basement_Area Fireplaces Lot_Shape_2 /
clodds=pl;
units Basement_Area=100;
title 'LOGISTIC MODEL (2):Bonus= Basement_Area Fireplaces
Lot_Shape_2';
run;
Selected PROC LOGISTIC statement:
UNITS enables you to specify units of change for the continuous explanatory variables so that
customized odds ratios can be estimated.
Selected CLASS statement options:
(REF='level') specifies the event category chosen as the reference level when using Reference or Effect
parameterization. You can specify the value (formatted if a format is applied) of the
reference category in quotation marks or you can specify one of the following keywords.
The default is REF=LAST.
FIRST designates the first ordered category as the reference level.
LAST designates the last ordered category as the reference level.
PARAM= specifies the parameterization. This value can be specified for each CLASS variable
by typing it within parentheses after the variable name, or for all CLASS variables,
by typing it after the options slash (/) at the end of the list of CLASS variables.
If there are numerous levels in the CLASS variable, you might want to use subject-matter
knowledge to reduce the number of levels. This is especially important when the levels have
few or no observations.
Model Information
Data Set STAT1.AMESHOUSING3
Response Variable Bonus Sale Price > $175,000
Number of Response Levels 2
Model binary logit
Optimization Technique Fisher's scoring
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-64 Chapter 7 Categorical Data Analysis
Response Profile
Ordered Total
Value Bonus Frequency
1 0 255
2 1 44
The Class Level Information table includes the predictor variable in the CLASS statement. Because you
used the PARAM=REF and REF='Regular' options, this table reflects your choice of
Lot_Shape_2='Regular' as the reference level. The design variable is 1 when
Lot_Shape_2='Irregular' and 0 when Lot_Shape_2='Regular'. The reference level for Fireplaces
is 0, so there or two design variables, each coded 0 for observations where Fireplaces=0.
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.4 Logistic Regression with Categorical Predictors 7-65
The SC value in the Basement_Area only model was 169.246. Here it is 159.001. Recalling that smaller
values imply better fit, you can conclude that this model is better fitting.
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 119.3133 4 <.0001
Score 91.7250 4 <.0001
Wald 49.8671 4 <.0001
This model is statistically significant, indicating at least one of the predictors in the model is useful
in predicting Bonus.
Type 3 Analysis of Effects
Wald
Effect DF Chi-Square Pr > ChiSq
Basement_Area 1 38.1356 <.0001
Fireplaces 2 5.2060 0.0741
Lot_Shape_2 1 16.9421 <.0001
The Type 3 Analysis of Effects table is generated when a predictor variable is used in the CLASS
statement. This analysis is similar to the individual tests in the GLMSELECT procedure parameter
estimates table. Just as in PROC GLMSELECT and PROC REG, these are adjusted effects.
Fireplaces is not statistically significant at the 0.05 level while the other remaining predictors are
statistically significant.
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -11.0882 1.5384 51.9467 <.0001
Basement_Area 1 0.00744 0.00120 38.1356 <.0001
Fireplaces 1 1 0.8810 0.4658 3.5770 0.0586
Fireplaces 2 1 -0.7683 0.9654 0.6335 0.4261
Lot_Shape_2 Irregular 1 1.9025 0.4622 16.9421 <.0001
For CLASS variables, effects are displayed for each of the design variables. Because reference cell
coding was used, each effect is measured against the reference level. For example, the estimate for
Lot_Shape_2 | Irregular shows the difference in logits between houses with irregular and regular lot
shapes. Fireplaces | 1 shows the logit difference between houses with 1 fireplace and 0 fireplaces while
Fireplaces | 2 shows the difference in logits between houses with 2 fireplaces and 0 fireplaces. Not
all of these contrasts are statistically significant.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-66 Chapter 7 Categorical Data Analysis
The c (Concordance) statistic value is 0.930 for this model, indicating that 93% of the positive and
negative response pairs are correctly sorted using Basement_Area, Fireplaces, and Lot_Shape_2.
Odds Ratio Estimates and Profile-Likelihood Confidence Intervals
Effect Unit Estimate 95% Confidence Limits
Basement_Area 100.0 2.105 1.696 2.727
Fireplaces 1 vs 0 1.0000 2.413 0.973 6.127
Fireplaces 2 vs 0 1.0000 0.464 0.054 2.703
Lot_Shape_2 Irregular vs Regular 1.0000 6.703 2.786 17.301
The odds ratios show that, adjusting for the other predictor variables, houses with irregular plots had
6.703 times the houses with regular plots odds of being bonus eligible. Houses with 1 fireplace had nearly
2.5 times the odds (2.413) of houses with 0 fireplaces and houses with 2 fireplaces had 53.6% lower odds
than houses with 0 fireplaces. The UNITS statement applies to the odds ratio table requested by the
CLODDS=PL option. The table shows that a 100 square foot larger basement is associated with a 110.5%
increase in bonus eligibility odds. The ODDSRATIO plot displays these values graphically.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.4 Logistic Regression with Categorical Predictors 7-67
Finally, the Effects plot shows the probability of survival across all combinations of categories and levels
of all three predictor variables.
This plot is obtained by applying the parameter estimates from the logistic model to values
of the predictors and then converting the predictions to the probability scale.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-68 Chapter 7 Categorical Data Analysis
Exercises
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.5 Stepwise Selection with Interactions and Predictions 7-69
84
84
Objectives
Fit a multiple logistic regression model with main
effects and interactions using the backward
elimination method.
Explain interactions using graphs.
Calculate predictions in a logistic setting using
PROC PLM.
88
88
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-70 Chapter 7 Categorical Data Analysis
Overview
Type of
Predictors Continuous
Type of Categorical Continuous and
Response Categorical
89
89
90
90
If you are doing exploratory analysis and want to find a best subset model, PROC LOGISTIC provides
the three stepwise methods that are available in PROC REG or PROC GLMSELECT. However, the
default selection criteria are not the same. Remember that you can always change the selection criteria
using the SLENTRY= and SLSTAY= options in the MODEL statement.
If you have a large number of variables, you might first need to try a variable reduction method such
as variable clustering.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.5 Stepwise Selection with Interactions and Predictions 7-71
91
91
Model hierarchy refers to the requirement that, for any term to be in the model, all effects contained
in the term must be present in the model. For example, in order for the interaction X2*X4 to enter the
model, the main effects X2 and X4 must be in the model. Likewise, neither effect X2 nor X4 can leave
the model while the interaction X2*X4 is in the model.
When you use the backward elimination method with interactions in the model, PROC LOGISTIC begins
by fitting the full model with all the main effects and interactions. PROC LOGISTIC then eliminates the
nonsignificant interactions one at a time, starting with the least significant interaction (the one with the
largest p-value). Next, PROC LOGISTIC eliminates the nonsignificant main effects not involved in any
significant interactions. The final model should consist of only significant interactions, the main effects
involved in those interactions, and any other significant main effects.
For a more customized analysis, the HIERARCHY= option specifies whether the hierarchy
is maintained and whether a single effect or multiple effects are allowed to enter or leave
the model in one step for forward, backward, and stepwise selection.
The default is HIERARCHY=SINGLE. You can change this option by inserting the
HIERARCHY= option in the MODEL statement. See the SAS/STAT® 9.4 User’s Guide
in the SAS online documentation for more information about using this option. In the LOGISTIC
procedure, HIERARCHY=SINGLE is the default, meaning that SAS will not remove a main
effect before first removing all interactions involving that main effect.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-72 Chapter 7 Categorical Data Analysis
Example: Fit a multiple logistic regression model using the backward elimination method. The full
model should include all the main effects and two-way interactions.
/*st107d06.sas*/ /*Part A*/
proc logistic data=STAT1.ameshousing3
plots(only)=(effect oddsratio);
class Fireplaces(ref='0') Lot_Shape_2(ref='Regular') / param=ref;
model Bonus(event='1')=Basement_Area|Fireplaces|Lot_Shape_2 @2 /
selection=backward clodds=pl slstay=0.10;
units Basement_Area=100;
title 'LOGISTIC MODEL (3): Backward Elimination '
'Bonus=Basement_Area|Fireplaces|Lot_Shape_2';
run;
The bar notation with the @2 constructs a model with all the main effects and the two-factor
interactions. If you increase it to @3, then you construct a model with all of the main effects,
the two-factor interactions, and the three-factor interaction. However, the three-factor interaction
might be more difficult to interpret.
Selected MODEL statement option:
SELECTION= specifies the method to select the variables in the model. BACKWARD requests
backward elimination, FORWARD requests forward selection, NONE fits the
complete model specified in the MODEL statement, STEPWISE requests stepwise
selection, and SCORE requests best subset selection. The default is NONE.
Model Information
Data Set STAT1.AMESHOUSING3
Response Variable Bonus Sale Price > $175,000
Number of Response Levels 2
Model binary logit
Optimization Technique Fisher's scoring
Response Profile
Ordered Total
Value Bonus Frequency
1 0 255
2 1 44
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.5 Stepwise Selection with Interactions and Predictions 7-73
Note: 1 observation was deleted due to missing values for the response or explanatory variables.
All information to this point is the same as that from the previous model.
Backward Elimination Procedure
Class Level Information
Design
Class Value Variables
Fireplaces 0 0 0
1 1 0
2 0 1
Lot_Shape_2 Irregular 1
Regular 0
The Model Fit Statistics and Testing Global Null Hypothesis tables at Step 0 are presented.
Step 0. The following effects were entered:
Intercept Basement_Area Fireplaces Basement_Area*Fireplaces Lot_Shape_2
Basement_Area*Lot_Shape_2 Fireplaces*Lot_Shape_2
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-74 Chapter 7 Categorical Data Analysis
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.5 Stepwise Selection with Interactions and Predictions 7-75
No (additional) effects met the 0.1 significance level for removal from the model.
The procedure stops after the two interactions involving Fireplaces are removed.
Summary of Backward Elimination
Effect Number Wald Variable
Step Removed DF In Chi-Square Pr > ChiSq Label
1 Fireplace*Lot_Shape_ 2 5 3.2305 0.1988
2 Basement_*Fireplaces 2 4 1.7237 0.4224
Joint Tests
Wald
Effect DF Chi-Square Pr > ChiSq
Basement_Area 1 18.2896 <.0001
Fireplaces 2 4.7171 0.0946
Lot_Shape_2 1 5.0247 0.0250
Basement_*Lot_Shape_ 1 3.1127 0.0777
Note: Under full-rank parameterizations, Type 3 effect tests are replaced by joint tests. The joint test for
an effect is a test that all the parameters associated with that effect are zero. Such joint tests might
not be equivalent to Type 3 effect tests under GLM parameterization.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-76 Chapter 7 Categorical Data Analysis
Notice that when a CLASS statement is used, new rows are added to the parameter estimates table. These
represent design variables that SAS creates in order to test the interactions.
As described in the ANOVA chapter, an interaction between two variables means that the effect of one
variable is different at different values of the other variable. This makes the model more complex to
interpret.
Association of Predicted Probabilities and
Observed Responses
Percent Concordant 93.8 Somers' D 0.876
Percent Discordant 6.2 Gamma 0.876
Percent Tied 0.1 Tau-a 0.221
Pairs 11220 c 0.938
The c value is a slight improvement over the previous model (c=0.930) that only included the main
effects.
Odds ratios are not calculated for effects involved in interactions. Any single odds ratio for
Basement_Area or for Lot_Shape_2 would be misleading because the effects vary for each at different
levels of the other variable.
Odds Ratio Estimates and Profile-Likelihood Confidence
Intervals
Effect Unit Estimate 95% Confidence Limits
Fireplaces 1 vs 0 1.0000 2.153 0.865 5.500
Fireplaces 2 vs 0 1.0000 0.390 0.047 2.251
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.5 Stepwise Selection with Interactions and Predictions 7-77
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-78 Chapter 7 Categorical Data Analysis
In order to estimate and plot odds ratios for the simple effects of variables involved in an interaction,
an ODDSRATIO statement with the AT= option can be used. An EFFECTSPLOT statement can help
display the interaction, as well.
/*st107d06.sas*/ /*Part B*/
proc logistic data=STAT1.ameshousing3
plots(only)=oddsratio(range=clip);
class Fireplaces(ref='0') Lot_Shape_2(ref='Regular') / param=ref;
model Bonus(event='1')=Basement_Area|Lot_Shape_2 Fireplaces;
units Basement_Area=100;
oddsratio Basement_Area / at (Lot_Shape_2=ALL) cl=pl;
oddsratio Lot_Shape_2 / at (Basement_Area=1000 1500) cl=pl;
title 'LOGISTIC MODEL (3.1): Bonus=Basement_Area|Lot_Shape_2
Fireplaces';
run;
Selected PROC LOGISTIC statement PLOTS option:
RANGE= with suboptions (<min><,max>) | CLIP, specifies the range of the displayed odds
ratio axis. The RANGE=CLIP option has the same effect as specifying the minimum
odds ratio as min and the maximum odds ratio as max. By default, all odds ratio
confidence intervals are displayed. This option is helpful when one or more odds ratio
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.5 Stepwise Selection with Interactions and Predictions 7-79
confidence intervals are so large that the smaller ones become difficult to see on the
scale required to show the larger ones.
Selected statement:
ODDSRATIO produces odds ratios for a variable even when the variable is involved in interactions
with other covariates, and for classification variables that use any parameterization.
You can also specify variables on which constructed effects are based, in addition
to the names of COLLECTION or MULTIMEMBER effects.
Selected options for the ODDSRATIO statement:
AT specifies fixed levels of the interacting covariates. If a specified covariate does not
interact with the variable, then its AT list is ignored. For continuous interacting
covariates, you can specify one or more numbers in the value-list. For classification
covariates, you can specify one or more formatted levels of the covariate enclosed
in single quotation marks (for example, A=‘cat’ ‘dog’), you can specify the keyword
REF to select the reference-level, or you can specify the keyword ALL to select all
levels of the classification variable. By default, continuous covariates are set to their
means, while CLASS covariates are set to ALL.
Partial PROC LOGISTIC Output
Odds Ratio Estimates and Profile-Likelihood Confidence Intervals
Odds Ratio Estimate 95% Confidence Limits
Basement_Area units=100 at Lot_Shape_2=Irregular 1.791 1.421 2.396
Basement_Area units=100 at Lot_Shape_2=Regular 2.960 1.932 5.315
Lot_Shape_2 Irregular vs Regular at Basement_Area=1000 20.278 4.623 146.987
Lot_Shape_2 Irregular vs Regular at Basement_Area=1500 1.643 0.283 9.145
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-80 Chapter 7 Categorical Data Analysis
Notice the effect of the RANGE=CLIP suboption. The Odds Ratio axis is clipped just beyond the odds
ratio estimate of Lot_Shape_2 Irregular versus Regular at Basement_Area=1000. The upper
bound of the associated 95% confidence interval is 146.987.
From this plot it is clear that the lot shape effect is different at different values of basement area.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.5 Stepwise Selection with Interactions and Predictions 7-81
Example: Using the model selected from backward selection including main effects and two-way
interactions, generate predictions for bonus eligibility for new data.
/*st107d07.sas*/
ods select none;
proc logistic data=STAT1.ameshousing3;
class Fireplaces (ref=’0’) Lot_Shape_2 (ref=’Regular’) / param=ref;
model Bonus(event=’1’)=Basement_Area|Lot_Shape_2 Fireplaces;
units Basement_Area=100;
store out=isbonus;
run;
ods select all;
data newhouses;
length Lot_Shape_2 $9;
input Fireplaces Lot_Shape_2 $ Basement_Area;
datalines;
0 Regular 1060
2 Regular 775
2 Irregular 1100
1 Irregular 975
1 Regular 800
;
run;
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-82 Chapter 7 Categorical Data Analysis
Store Information
Distribution Binary
Class Variables Fireplaces Lot_Shape_2 Bonus
Model Effects Intercept Basement_Area Lot_Shape_2 Basement_*Lot_Shape_ Fireplaces
The PROC PLM output shows that the house with the highest predicted probability (0.306) of being
bonus eligible has an irregular lot shape, 1 fireplace, and a basement area of 975 square feet. The house
with the lowest predicted probability (0.0004) has a regular lot shape, 2 fireplaces, and a basement area
of 775.
Care should be taken to ensure that predictions are made only for new data records that fall within
the range of the training data. If not, predictions could be invalid due to extrapolation.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.5 Stepwise Selection with Interactions and Predictions 7-83
Exercises
The variable Size is coded (1, 2, 3), but the applied format requires that the formatted value
be used in the CLASS statement for the REF= category.
value sizefmt 1='Small'
2='Medium'
3='Large';
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-84 Chapter 7 Categorical Data Analysis
7.6 Solutions
Solutions to Exercises
1. Performing Tests and Measures of Association
An insurance company wants to relate the safety of vehicles to several other variables. A score
is given to each vehicle model, using the frequency of insurance claims as a basis. The data are
in the STAT1.safety data set.
a. Invoke the FREQ procedure and create one-way frequency tables for the categorical variables.
/*st107s01.sas*/ /*Part A*/
ods graphics off;
proc freq data=STAT1.safety;
tables Unsafe Type Region Size;
title "Safety Data Frequencies";
run;
ods graphics on;
Cumulative Cumulative
Unsafe Frequency Percent Frequency Percent
0 66 68.75 66 68.75
1 30 31.25 96 100.00
Cumulative Cumulative
Type Frequency Percent Frequency Percent
Large 16 16.67 16 16.67
Medium 29 30.21 45 46.88
Small 20 20.83 65 67.71
Sport/Utility 16 16.67 81 84.38
Sports 15 15.63 96 100.00
Cumulative Cumulative
Region Frequency Percent Frequency Percent
Asia 35 36.46 35 36.46
N America 61 63.54 96 100.00
Cumulative Cumulative
Size Frequency Percent Frequency Percent
1 35 36.46 35 36.46
2 29 30.21 64 66.67
3 32 33.33 96 100.00
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.6 Solutions 7-85
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-86 Chapter 7 Categorical Data Analysis
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.6 Solutions 7-87
1) For the cars made in Asia, what percentage had a below-average safety score?
Region is a row variable, so look at the Row Pct value in the Below Average cell
of the Asia row. That value is 42.86.
2) For the cars with an average or above safety score, what percentage was made in North
America?
The Col Pct value for the cell for North America in the column for Average
or Above is 69.70.
3) Do you see a statistically significant (at the 0.05 level) association between Region
and Unsafe?
The association is not statistically significant at the 0.05 alpha level. The p-value
is 0.0631.
4) What does the odds ratio compare and what does this one say about the difference in odds
between Asian and North American cars?
The odds ratio compares the odds of below average safety for North America versus
Asia. The odds ratio of 0.4348 means that cars made in North America have 56.52
percent lower odds for being unsafe than cars made in Asia.
Recall that odds ratios given in the Estimates of Relative Risk table are calculated
comparing row1/row2 for column1. In this problem, this comparison is Asia to N
America whose outcome is Average or Above in safety. The value 0.4348
is interpreted as the odds of having an Average or Above car made in Asia is
0.4348 times the odds for American-made cars. If you wished to compare N
America to Asia, still using Average or Above for safety, the odds ratio
would be the inverse of 0.4348, or approximately 2.3. This is interpreted as cars made
in North America have 2.3 times the odds for being safe than cars made in Asia. This
single inversion would also create the odds ratio for comparing Asia to N
America but Below Average in safety. If you wished to compare N America
to Asia using Below Average in safety, you would invert your odds ratio twice
returning to the value 0.4348.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-88 Chapter 7 Categorical Data Analysis
c. Use the variable named Size. Examine the ordinal association between Size and Unsafe.
Use PROC FREQ.
/*st107s01.sas*/ /*Part C*/
proc freq data=STAT1.safety;
tables Size*Unsafe / chisq measures cl;
format Unsafe safefmt.;
title "Association between Unsafe and Size";
run;
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.6 Solutions 7-89
95%
Statistic Value ASE Confidence Limits
Gamma -0.8268 0.0796 -0.9829 -0.6707
Kendall's Tau-b -0.5116 0.0726 -0.6540 -0.3693
Stuart's Tau-c -0.5469 0.0866 -0.7166 -0.3771
Somers' D C|R -0.4114 0.0660 -0.5408 -0.2819
Somers' D R|C -0.6364 0.0860 -0.8049 -0.4678
Pearson Correlation -0.5401 0.0764 -0.6899 -0.3903
Spearman Correlation -0.5425 0.0769 -0.6932 -0.3917
Lambda Asymmetric C|R 0.3667 0.1569 0.0591 0.6743
Lambda Asymmetric R|C 0.2951 0.0892 0.1203 0.4699
Lambda Symmetric 0.3187 0.0970 0.1286 0.5088
Uncertainty Coefficient C|R 0.2735 0.0836 0.1096 0.4374
Uncertainty Coefficient R|C 0.1551 0.0490 0.0590 0.2512
Uncertainty Coefficient Symmetric 0.1979 0.0615 0.0773 0.3186
1) What statistic should you use to detect an ordinal association between Size and Unsafe?
The Mantel-Haenszel Chi-Square
2) Do you reject or fail to reject the null hypothesis at the 0.05 level?
Reject
3) What is the strength of the ordinal association between Size and Unsafe?
The Spearman correlation is -0.5425.
4) What is the 95% confidence interval around that statistic?
The CI is (-0.6932, -0.3917).
2. Performing a Logistic Regression Analysis
Fit a simple logistic regression model using STAT1.safety with Unsafe as the outcome variable and
Weight as the predictor variable. Use the EVENT= option to model the probability of below-average
safety scores. Request Profile Likelihood confidence limits and an odds ratio plot along with an effect
plot.
/*st107s02.sas*/
ods graphics on;
proc logistic data=STAT1.safety plots(only)=(effect oddsratio);
model Unsafe(event='1')=Weight / clodds=pl;
title 'LOGISTIC MODEL (1):Unsafe=Weight';
run;
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-90 Chapter 7 Categorical Data Analysis
Model Information
Data Set STAT1.SAFETY
Response Variable Unsafe
Number of Response Levels 2
Model binary logit
Optimization Technique Fisher's scoring
Response Profile
Ordered Total
Value Unsafe Frequency
1 0 66
2 1 30
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.6 Solutions 7-91
Weight
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-92 Chapter 7 Categorical Data Analysis
a. Do you reject or fail to reject the global null hypothesis that all regression coefficients
of the model are 0?
The p-value for the Likelihood Ratio test is <.0001 and therefore the global null hypothesis
is rejected.
b. Write the logistic regression equation.
The regression equation is as follows:
Logit(Unsafe)=3.5422 + (-1.3901)*Weight.
c. Interpret the odds ratio for Weight.
The odds ratio for Weight (0.249) says that the odds for being unsafe (having a below
average safety rating) are 75.1% lower for each thousand pound increase in weight.
The confidence interval (0.102 , 0.517) does not contain 1, indicating that that the odds ratio
is statistically significant.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.6 Solutions 7-93
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-94 Chapter 7 Categorical Data Analysis
a. Do you reject or fail to reject the null hypothesis that all regression coefficients of the model
are 0?
You reject the null hypothesis with a p<.0001.
Type 3 Analysis of Effects
Wald
Effect DF Chi-Square Pr > ChiSq
Weight 1 2.1176 0.1456
Region 1 0.4506 0.5020
Size 2 15.3370 0.0005
b. If you do reject the global null hypothesis, then which predictors significantly predict safety
outcome?
Only Size is significantly predictive of Unsafe.
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 0.0500 1.8008 0.0008 0.9778
Weight 1 -0.6678 0.4589 2.1176 0.1456
Region N America 1 -0.3775 0.5624 0.4506 0.5020
Size 1 1 2.6783 0.8810 9.2422 0.0024
Size 2 1 0.6582 0.9231 0.5085 0.4758
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.6 Solutions 7-95
Weight
Size 1 vs 3
Size 2 vs 3
0 20 40 60 80 100
Odds Ratio
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-96 Chapter 7 Categorical Data Analysis
1.00
0.75
Probability
0.50
0.25
0.00
0 2 4 6
Weight
Region * Size
Asia 1 Asia 2 Asia 3
N America 1 N America 2 N America 3
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.6 Solutions 7-97
/*st107s04.sas*/
ods graphics on;
proc logistic data=STAT1.safety plots(only)=(effect oddsratio);
class Region (param=ref ref='Asia')
Size (param=ref ref='Small');
model Unsafe(event='1') = Weight Region Size
/ clodds=pl selection=backward;
units Weight = -1;
store isSafe;
format Size sizefmt.;
title 'Logistic Model: Backwards Elimination';
run;
a. Which terms appear in the final model? Only Size appears in the final model.
Summary of Backward Elimination
Effect Number Wald
Step Removed DF In Chi-Square Pr > ChiSq
1 Region 1 2 0.4506 0.5020
2 Weight 1 1 2.1565 0.1420
b. Do you think this is a better model than the one fit with only Region? Comparing the model fit
statistics, you see that the AIC (92.629) and SC (100.322) are both smaller in the model fit
by the backward elimination method, 119.854 and 124.982 respectively. This indicates that
the Size only model is doing better than the Region only model. Using the c statistics, you
can also see improvement beyond the Region only model, 0.818 previously 0.598.
c. Using the final model, chosen by backward elimination, and the STORE statement, generate
predictive probabilities for the cars in the following DATA step code.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-98 Chapter 7 Categorical Data Analysis
data checkSafety;
length Region $9.;
input Weight Size Region $ 5-13;
datalines;
4 1 N America
3 1 Asia
5 3 Asia
5 2 N America
;
run;
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.6 Solutions 7-99
13
13
31
31
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-100 Chapter 7 Categorical Data Analysis
41
41
pi
logit( pi ) ln
1 p
i
55
55
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.6 Solutions 7-101
79
79
85
85
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-102 Chapter 7 Categorical Data Analysis
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Appendix A References
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
A.1 References A-3
A.1 References
Agresti, A. 1996. An Introduction to Categorical Data Analysis. New York: John Wiley & Sons.
Allison, P. 1999. Logistic Regression Using the SAS® System: Theory and Application. Cary, NC:
SAS Institute Inc.
Anscombe, F. 1973. “Graphs in Statistical Analysis.” The American Statistician 27:17–21.
Belsey, D. A., E. Kuh, and R. E. Welsch. 1980. Regression Diagnostics: Identifying Influential Data
and Sources of Collinearity. New York: John Wiley & Sons.
Chatfield, C. (1995), “Model Uncertainty, Data Mining and Statistical Inference,” Journal of the Royal
Statistical Society, 158:419–466.
Findley, D.F. and E. Parzen. 1995. "A Conversation with Hirotugu Akaike." Statistical Science Vol. 10,
No. 1:104–117.
Freedman, D.A. 1983, “A Note on Screening Regression Equations,” The American Statistician,
37:152–155.
Hocking, R. R. 1976. “The Analysis and Selection of Variables in Linear Regression.” Biometrics
32:1–49
Hosmer, D.W. and Lemeshow, S. 2000. Applied Logistic Regression 2nd Edition, New York:
John Wiley & Sons.
Johnson, R. W. 1996. “Fitting percentage of body fat to simple body measurements” Journal of Statistics
Education, Vol. 4, No. 1.
Mallows, C. L. 1973. “Some Comments on Cp.” Technometrics 15:661–675.
Marquardt, D. W. 1980. “You Should Standardize the Predictor Variables in Your Regression Models.”
Journal of the American Statistical Association 75:74–103.
Myers, R. H. 1990. Classical and Modern Regression with Applications, Second Edition. Boston:
Duxbury Press.
Neter, J., M. H. Kutner, W. Wasserman, and C. J. Nachtsheim. 1996. Applied Linear Statistical Models,
Fourth Edition. New York: WCB McGraw Hill.
Raftery, A.E. (1995), “Bayesian Model Selection in Social Research,” Sociological Methodology.
Rawlings, J. O. 1988. Applied Regression Analysis: A Research Tool. Pacific Grove, CA:
Wadsworth & Brooks.
Santner, T.J. and D. E. Duffy. 1989. The Statistical Analysis of Discrete Data. New York: Springer-Verlag.
Shoemaker, A. L. 1996. “What's Normal? – Temperature, Gender, and Heart Rate.” Journal of Statistics
Education, Vol. 4, No. 2.
Tukey, John W. 1977. Exploratory Data Analysis. Addison-Wesley, Reading, MA.
Welch, B. L. 1951. "On the Comparison of Several Mean Values: An Alternative Approach." Biometrika
38:330–336.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
A-4 Appendix A References
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Appendix B Sampling from SAS
Data Sets
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
B.1 Random Samples B-3
The SURVERYSELECT procedure selects a random sample from a SAS data set.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
B-4 Appendix B Sampling from SAS Data Sets
Part A shows how to select a certain sample size using the SAMPSIZE= option.
/* st10bd01.sas */ /*Part A*/
proc surveyselect
data= STAT1.Safety /* sample from data table */
seed=31475 /* recommended that you use this option */
method=srs /* simple random sample */
sampsize=12 /* sample size */
out=work.SafetySample /* sample stored in this data set */
;
run;
If you do not provide a seed, you will not be able to reproduce the sample. It is recommended that
you always include a seed when using PROC SURVEYSELECT.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
B.1 Random Samples B-5
Part B shows how to select a certain percentage of the original sample using the SAMPRATE= option.
/* st10bd01.sas */ /*Part B*/
proc surveyselect
data= STAT1.Safety /* sample from data table */
seed=31475 /* recommended that you use this option */
method=srs /* simple random sample */
samprate=0.05 /* sample size */
out=work.SafetySample /* sample stored in this data set */
;
run;
proc print data=work.SafetySample;
run;
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
B-6 Appendix B Sampling from SAS Data Sets
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Appendix C Additional Topics
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C.1 Paired t-Tests C-3
Paired Samples
Advertising
Sales Before
Sales After
3
3
For many types of data, repeat measurements are taken on the same subject throughout a study.
The simplest form of this study is often referred to as the paired t-test.
In this study design,
subjects are exposed to a treatment, for example, an advertising strategy
a measurement is taken of the subjects before and after the treatment
the subjects, on average, respond the same way to the treatment, although there might be differences
among the subjects.
The assumptions of this test are that
the subjects are selected randomly.
the distribution of the sample mean differences is normal. The central limit theorem can be applied
for large samples.
The hypotheses of this test are the following:
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C-4 Appendix C Additional Topics
4
4
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C.1 Paired t-Tests C-5
Paired t-Test
Example: Dollar values of sales were collected both before and after a particular advertising campaign.
You are interested in determining the effect of the campaign on sales. You collected data from
30 different randomly selected regions. The level of sales both before (pre) and after (post)
the campaign were recorded and are shown below.
/*st10cd01.sas*/ /*Part A*/
proc print data=STAT1.market (obs=20);
title;
run;
The PAIRED statement used below is used to test whether the mean of post-sales is significantly different
from the mean of pre-sales.
/*st10cd01.sas*/ /*Part B*/
proc ttest data= STAT1.market plots(showh0)=interval;
paired post*pre;
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C-6 Appendix C Additional Topics
The Mean in this table refers to the difference of Post minus Pre. That ordering is specified
in the PAIRED statement.
95% CL Std
Mean 95% CL Mean Std Dev Dev
0.9463 0.6001 1.2925 0.9271 0.7384 1.2464
The T Tests table provides the requested analysis. The p-value for the difference post–pre is less than
0.0001. Assuming that you want a 0.01 level of significance, you reject the null hypothesis and conclude
that there is a change in the average sales after the advertising campaign. Also, based on the fact that
the mean is positive 0.9463, there appears to be an increase in the average sales after the advertising
campaign.
30
Percent
20
10
0
95% Confidence
-2 0 2 4
Difference
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C.1 Paired t-Tests C-7
12 12
10 10
8 8
post pre
The Paired Profiles plot shows each observation pair as well as the pair of means.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C-8 Appendix C Additional Topics
14
Mean
12
pre
10
8 10 12 14
post
The agreement plot shows most pairs lie to the lower left of the diagonal reference line,
representing equality between pre and post measurements. This plot shows that not only
is the mean greater for post than for pre, but that relationship holds true in most pairs.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C.1 Paired t-Tests C-9
2
Difference
-1
-2 -1 0 1 2
Quantile
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C-10 Appendix C Additional Topics
7
7
In many situations that you might decide that rejection on only one side of the mean is important.
For example, a drug company might only want to test for positive differences between a new drug
and a placebo and not negative differences. One-sided tests are a way doing this.
In the exercise data, the researcher might have been only curious in seeing the improvement in change
scores due to the intervention and not considered the possibility that intervention would actually harm
student performance.
H0: µ1-µ2 ≤ 0
8
8
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C.2 One-Sided t-Tests C-11
For two-sample upper-tail t-tests, the null hypothesis is one not only of equivalence, but also of difference
between two means. If you believe that the mean change of the treatment group is strictly greater than
the mean change of the control group, this implies that you believe that the difference between the mean
changes for (Treatment-Control) is strictly greater than zero. That would then be your alternative
hypothesis, H1: µ1-µ2 > 0. The null hypothesis is then, H0: µ1-µ2 ≤ 0. Only t values above zero can achieve
statistical significance. The critical t value for significance on the upper end will be smaller than it would
have been in a two-sample test. Therefore, if you are correct about the direction of the true difference, you
would have more power to detect that significance using the one-sided test. Confidence intervals for one-
sided upper- tail tests always have an upper bound of infinity (no upper bound).
The H0= option in PROC TTEST allows other values for the null hypothesis.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C-12 Appendix C Additional Topics
One-Sided t-Test
/*st10bd02.sas*/
proc ttest data=STAT1.German
plots(only shownull)=interval h0=0 sides=L;
class Group;
var Change;
title "One-Sided t-Test Comparing Treatment to Control";
run;
H0=0 is the default, but is written here explicitly for completeness. SIDES=L declares this
to be an upper one-sided t-test. Because Control comes before Treatment in the alphabet,
the difference score in PROC TTEST will be for Control minus Treatment by default.
Group N Mean Std Dev Std Err Minimum Maximum
Control 13 6.9677 8.6166 2.3898 -6.2400 19.4100
Treatment 15 11.3587 14.8535 3.8352 -17.3300 32.9200
Diff (1-2) -4.3910 12.3720 4.6882
Group Method Mean 95% CL Mean Std Dev 95% CL Std Dev
Control 6.9677 1.7607 12.1747 8.6166 6.1789 14.2238
Treatment 11.3587 3.1331 19.5843 14.8535 10.8747 23.4255
Diff (1-2) Pooled -4.3910 -Infty 3.6052 12.3720 9.7432 16.9550
Diff (1-2) Satterthwaite -4.3910 -Infty 3.3545
Equality of Variances
Method Num DF Den DF F Value Pr > F
Folded F 14 12 2.97 0.0660
Notice that the confidence limits for the difference between Control and Treatment are different
from that in the exercise, even though the Mean Diff is exactly the same. The lower confidence bound for
the difference is now Infty (Infinity). For right-sided tests, the lower bound would be infinite in the
positive direction. The p-value for the pooled variance test of the difference between Control and
Treatment is now 0.1788, which is half of what it was in the two-sided test.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C.2 One-Sided t-Tests C-13
Satterthwaite
Pooled
-15 -10 -5 0 5
Difference
The Difference Interval plot reflects the one-sided nature of the analysis. The arrows pointing left
represent the infinite confidence bound.
The determination of whether to perform a one-sided test or a two-sided test should be made
before any analysis or glancing at the data, and should be made bases on subject-matter
considerations and not statistical power considerations.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C-14 Appendix C Additional Topics
10
10
Nonparametric Analysis
Nonparametric analyses are those that rely only
on the assumption that the observations are independent.
A nonparametric test is appropriate when
the data contains valid outliers
the data is skewed
13
13
Nonparametric tests are most often used when the normality assumption required for analysis of variance
is in question. Although ANOVA is robust with regard to minor departures from normality, extreme
departures can make the test less sensitive to differences between means. Therefore, when the data is
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C.3 Nonparametric ANOVA C-15
markedly skewed or there are extreme outliers, nonparametric methods might be more appropriate.
In addition, when the data follows a count measurement scale instead of interval, nonparametric methods
should be used.
Rank Scores
Treatment A B
Response 2 5 7 8 10 6 9 11 13 15
Rank Score 1 2 4 5 7 3 6 8 9 10
Sum = 19 Sum = 36
14
14
In nonparametric analysis, the rank of each data point is used instead of the raw data.
The illustrated ranking system ranks the data from smallest to largest. In the case of ties, the ranks are
averaged. The sums of the ranks for each of the treatments are used to test the hypothesis that the
populations are identical. For two populations, the Wilcoxon rank-sum test is performed. For any number
of populations, a Kruskal-Wallis test is used.
Median Scores
Treatment A B
Response 2 5 7 8 10 6 9 11 13 15
Median Score 0 0 0 0 1 0 1 1 1 1
Median = 9.5
Sum = 1 Sum = 4
15
15
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C-16 Appendix C Additional Topics
Recall that the median is the 50th percentile, which is the middle of your data values.
When calculating median scores, a score of
0 is assigned, if the data value is less than or equal to the median
1 is assigned, if the data value is above the median.
The sums of the median scores are used to conduct the Median test for two populations
or the Brown-Mood test for any number of populations.
Hypotheses of Interest
16
16
Nonparametric tests compare the probability distributions of sampled populations rather than specific
parameters of these populations.
In general, with no assumptions about the distributions of the data, you are testing these hypotheses:
H0: all populations are identical with respect to shape and location
H1: all populations are not identical with respect to shape and location.
Thus, if you reject the null hypothesis, you conclude that the population distributions are different, but
you did not identify the reason for the difference. The difference could be because of different variances,
skewness, kurtosis, or means.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C.3 Nonparametric ANOVA C-17
17
17
Hospice Example
Are there different effects of a marketing visit, in terms
of increasing the number of referrals to the hospice,
among the various specialties of physicians?
18
18
Consider a study done by Kathryn Skarzynski to determine whether there was a change in the number
of referrals received from physicians after a visit by a hospice marketing nurse. One of her study
questions was, “Are there different effects of the marketing visits, in terms of increasing the number
of referrals, among the various specialties of physicians?”
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C-18 Appendix C Additional Topics
Veneer Example
Are there differences between the durability of brands
of wood veneer?
19
19
Consider another experiment where the goal of the experiment is to compare the durability of three brands
of synthetic wood veneer. This type of veneer is often used in office furniture and on kitchen cabinets.
To determine durability, four samples of each of three brands are subjected to a friction test. The amount
of veneer material that is worn away due to the friction is measured. The resulting wear measurement
is recorded for each sample. Brands that have a small wear measurement are desirable.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C.3 Nonparametric ANOVA C-19
Example: A portion of Ms. Skarzynski’s data about the hospice marketing visits is in the st192.hosp data
set. The variables in the data set are as follows:
id the ID number of the physician’s office visited
visit the type of visit, to the physician or to the physician’s staff
code the medical specialty of the physician
ref3p the number of referrals three months before the visit
ref2p the number of referrals two months before the visit
ref1p the number of referrals one month before the visit
ref3a the number of referrals three months after the visit
ref2a the number of referrals two months after the visit
ref1a the number of referrals one month after the visit
In addition, the following variables have been calculated:
avgprior the average number of referrals per month for the three months before the visit
diff1 the difference between the number of referrals one month after the visit and the average
number of referrals before the visit
diff2 the difference between the number of referrals two months after the visit and the average
number of referrals before the visit
diff3 the difference between the number of referrals three months after the visit
and the average number of referrals before the visit
diffbys1 the difference between the number of referrals one month after the visit and the number
of referrals three months before the visit
diffbys2 the difference between the number of referrals two months after the visit and the number
of referrals three months before the visit
diffbys3 the difference between the number of referrals three months after the visit and the number
of referrals three months before the visit.
Print a subset of the variables for the first 10 observations in the data set.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C-20 Appendix C Additional Topics
One of the analyses to answer the research question is to compare diffbys3 (the number of referrals three
months after the visit minus the number three months before the visit) for the different specialties.
Initially, you want to examine the distribution of the data.
/*st10cd03.sas*/ /*Part B*/
proc univariate data=STAT1.hosp noprint;
class code;
var diffbys3;
histogram diffbys3 / normal kernel ncols=3;
inset mean std skewness kurtosis
normal(adpval="Anderson-Darling P"
cvmpval="Cramer von Mises P"
ksdpval="Komogorov-Smirnov P");
probplot diffbys3 / normal ncols=3;
inset mean std skewness kurtosis;
title 'Descriptive Statistics for Hospice Data';
format code spcfmt.;
run;
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C.3 Nonparametric ANOVA C-21
Examine the histograms and normal probability plots for each group.
Partial Output
Based on skewness and kurtosis, the oncologists and family practice doctors data might not be normal.
All three goodness-of-fit tests reject the null hypothesis that the data is normal.
…
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C-22 Appendix C Additional Topics
2.5
# refs 3 mnth after minus # 3 mths prior
0.0
-2.5
-5.0
-7.5
-10.0
1 10 25 50 75 95 99 1 10 25 50 75 95 99 1 10 25 50 75 95 99
Normal Percentiles
Internal medicine doctors appear to have only three values: 0, 1, and 2. The plots indicate that the data
is not normal.
Family practice doctors appear to have mostly 0 values.
Both family practice doctors and oncologists have highly kurtotic distributions.
/*st10cd03.sas*/ /*Part C*/
proc sgplot data=STAT1.hosp;
vbox diffbys3 / category=code;
format code spcfmt.;
run;
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C.3 Nonparametric ANOVA C-23
5.0
2.5
# refs 3 mnth after minus # 3 mths prior
0.0
-2.5
-5.0
-7.5
The box plots strongly support the conclusion that the data is not normal. Remember that the data values
of diffbys3 are actually counts and therefore ordinal. This suggests that a nonparametric analysis would
be more appropriate.
For illustrative purposes, use the WILCOXON option to perform a rank sum test and the MEDIAN option
to perform the Median test. This data was actually analyzed using the Rank Sum test.
/*st10cd03.sas*/ /*Part D*/
proc npar1way data=STAT1.hosp wilcoxon median;
class code;
var diffbys3;
format code spcfmt.;
run;
Selected PROC NPAR1WAY statement options:
WILCOXON requests an analysis of the rank scores. The output includes the Wilcoxon two-sample test
and the Kruskal-Wallis test for two or more populations.
MEDIAN requests an analysis of the median scores. The output includes the median two-sample
test and the median one-way analysis test for two or more populations.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C-24 Appendix C Additional Topics
Kruskal-Wallis Test
Chi-Square 4.2304
DF 2
Pr > Chi-Square 0.1206
The PROC NPAR1WAY output from the WILCOXON option shows the actual sums of the rank scores
and the expected sums of the rank scores if the null hypothesis is true. From the Kruskal-Wallis test
(chi-square approximation), the p-value is .1206. Therefore, at the 5% level of significance, you do not
reject the null hypothesis. There is not enough evidence to conclude that the distributions of change
in hospice referrals for the different groups of physicians are significantly different.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C.3 Nonparametric ANOVA C-25
Median One-Way
Analysis
Chi-Square 3.8515
DF 2
Pr > Chi-Square 0.1458
Again, based on the p-value of .1458, at the 5% level of significance, you do not reject the null
hypothesis. There is not enough evidence to conclude that there are differences between specialists.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C-26 Appendix C Additional Topics
PROC NPAR1WAY produces a box plot similar to the one you created for exploratory data analysis.
In addition, when you specify the MEDIAN option, a mosaic plot is generated and shows the number
of observations above and below the median for each group:
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C.3 Nonparametric ANOVA C-27
Example: For an experiment to compare the durability of three brands of synthetic wood veneer, perform
nonparametric one-way ANOVA. The data is stored in the st192.ven data set.
/*st10bd04.sas*/
proc print data=STAT1.ven;
title 'Wood Veneer Wear Data';
run;
Because there is a sample size of only four for each brand of veneer, the usual PROC NPAR1WAY
Wilcoxon test p-values might be inaccurate. Instead, the EXACT statement should be added to the PROC
NPAR1WAY code. This provides exact p-values for the simple linear rank statistics based on the
Wilcoxon scores rather than estimated p-values based on continuous approximations.
Exact analysis is available for both the WILCOXON and MEDIAN options in PROC NPAR1WAY. You
can specify which of these scores you want to use to compute the exact p-values by adding either one or
both of these options to the EXACT statement. If no options are listed in the EXACT statement, exact
p-values are computed for all the linear rank statistics requested in the PROC NPAR1WAY statement.
You should exercise care when choosing to use the EXACT statement with PROC NPAR1WAY.
Computational time can be prohibitive depending on the number of groups, the number of distinct
response variables, the total sample size, and the speed and memory available on your computer. You can
terminate exact computations and exit PROC NPAR1WAY at any time by pressing the Break button in
the SAS windowing environment or the Stop button in SAS Enterprise Guide, and choosing to stop
computations.
/*st10cd04.sas*/ /*Part A*/
proc print data=STAT1.ven;
title 'Wood Veneer Wear Data';
run;
Obs brand wear
1 Acme 2.3
2 Acme 2.1
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C-28 Appendix C Additional Topics
Kruskal-Wallis Test
Chi-Square 5.8218
DF 2
Asymptotic Pr > Chi-Square 0.0544
Exact Pr >= Chi-Square 0.0480
In the PROC NPAR1WAY output shown above, the exact p-value is .0480, which is significant at =.05.
Notice the difference between the exact p-value and the (asymptotic) p-value based on the chi-square
approximation.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C.3 Nonparametric ANOVA C-29
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C-30 Appendix C Additional Topics
23
23
A partial regression plot is a graphical method for visualizing the test of significance for the parameter
estimates in the full model. It is a plot of the residuals from two regression analyses.
24
24
Partial regression leverage plots are graphical methods that enable you to see the effect of a single
variable in a multiple regression setting, controlling for the effect of all other variables.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C.4 Partial Regression Plots C-31
Partial regression plots are produced automatically with ODS Statistical Graphics when you
specify the PLOTS=PARTIAL option in the PROC REG statement.
25
25
In the example shown, there are three partial regression plots, one for each independent variable.
In general terms, for a partial regression plot of the independent variable Xr,
the vertical axis is the residuals from a regression of Y regressed on all Xs except Xr
the horizontal axis is the residuals from a regression of Xr regressed on all other Xs.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C-32 Appendix C Additional Topics
Example: Generate and interpret partial regression plots for the full model and compare them to the fit
plot from the simple regression model with RunTime.
/*st10cd05.sas*/ /*Part A*/
proc reg data=STAT1.SAT
plots(only)=fitplot(nolimits stats=none);
model CombinedSAT2013 = Spend2011;
title 'Simple Regression';
run;
quit;
Selected PLOTS= options:
NOLIMITS
suppresses the display of confidence and prediction limits.
STATS=NONE
suppresses the display of the model statistics box.
Partial Output
Simple Regression
Model: MODEL1
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 1672.65384 72.34052 23.12 <.0001
Spend2011 1 -0.00782 0.00636 -1.23 0.2243
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C.4 Partial Regression Plots C-33
1700
CombinedSAT2013
1600
1500
1400
On its own, Spend2011 is not a significant predictor of CombinedSAT2013 with a p-value 0.2243.
The observations seem highly variable about the regression line and the slope is negative (-0.00782).
Next run the same model, except add the Participation2013 variable, which is the proportion of eligible
high school seniors who have taken the SAT, as a regressor.
/*st10cd05.sas*/ /*Part B*/
proc reg data=STAT1.SAT
plots(only)=partial(unpack);
model CombinedSAT2013=Spend2011 Participation2013 / partial;
title 'Parital Regression Plots';
run;
quit;
Selected MODEL statement option:
PARTIAL generates partial regression plots for all predictor variables in the model. If you also
specify PLOTS=PARTIAL in the PROC REG statement, ODS Graphics are produced.
Partial Output
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C-34 Appendix C Additional Topics
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 1693.46454 31.37863 53.97 <.0001
Spend2011 1 0.00375 0.00287 1.31 0.1967
Participation2013 1 -366.95940 25.14546 -14.59 <.0001
The parameter estimate for Spend2011 is not significant in this model, either. The p-value is 0.1967.
However, the sign of the parameter estimate has changed.
100
Partial Residual for CombinedSAT2013
-100
The plot shows this relationship graphically. The Y axis is now the partial residuals from regressing
CombinedSAT2013 on Participation2013. The X axis is the partial residuals from regressing Spen2011
on Participation2013. The variance is much greater for observations around the partial regression line
than for the simple regression line shown previously.
In addition to enabling you to visualize the adjusted relationships in a multiple regression model,
partial residual plots can help you detect potential outliers. For example, a potential influential
outlier can be seen in the upper right corner of the plot. None was seen in the simple regression
fit plot.
If you use the IMAGEMAP option and an ID statement, you can place your mouse over a data
point to see the value of Name displayed as a tag on the plot.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C.5 Exact Tests for Contingency Tables C-35
χ2
Asymptotic
28
28
There are times when the chi-square test might not be appropriate. In fact, when more than 20%
of the cells have expected cell frequencies of less than 5, the chi-square test might not be valid. This
is because the p-values are based on the assumption that the test statistic follows a particular distribution
when the sample size is sufficiently large. Therefore, when the sample sizes are small, the asymptotic
(large sample) p-values might not be valid.
29
29
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C-36 Appendix C Additional Topics
The criterion for the chi-square test is based on the expected values, not the observed values. In the slide
above, 1 out of 9, or 11% of the cells, has an observed count less than 5. However, 4 out of 9, or 44%,
of the cells have expected counts less than 5. Therefore, the chi-square test might not be valid.
30
30
The EXACT statement provides exact p-values for many tests in the FREQ procedure. Exact p-values
are useful when the sample size is small. In this case, the asymptotic p-values might not be useful.
However, large data sets (in terms of sample size, number of rows, and number of columns) can require
a prohibitive amount of time and memory for computing exact p-values. For large data sets, consider
whether exact p-values are needed or whether asymptotic p-values might be quite close to the exact
p-values.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C.5 Exact Tests for Contingency Tables C-37
0 3 3 .86 2.14 3
2 2 4 1.14 2.86 4
2 5 7 2 5 7
A p-value gives the probability of the value of the 2 value being
as extreme as or more extreme than the one observed, just
by chance.
Could the underlined sample values occur just by chance?
31
31
Consider the table at left above. With such a small sample size, the asymptotic p-values would not
be valid, because the accuracy of those p-values depends on large enough expected values in all cells.
Exact p-values reflect the probability of observing a table with at least as much evidence
of an association as the one actually observed, given there is no association between the variables.
Recall that expected count within each cell is calculated by expected count=(R*C)/T.
32
A key assumption behind the computation of exact p-values is that the column totals and row totals are
fixed. There are only three possible tables, including the observed table, given the fixed marginal totals.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C-38 Appendix C Additional Topics
Possible Table 2 is most like the Expected Table of the previous slide. So, the probability (0.571) that
its cell values would occur in a table, given these row and column total values, is greatest of any possible
table that could occur by chance.
0 3 3 1 2 3 2 1 3
2 2 4 1 3 4 0 4 4
2 5 7 2 5 7 2 5 7
To compute an exact p-value for this example, examine the chi-square value for each table and the
probability that the table should occur by chance if the null hypothesis of no association were true.
(The probabilities add up to 1.)
Remember the definition of a p-value. It is the probability, if the null hypothesis is true, that you would
obtain a sample statistic as great as or greater than the one you observed just by chance.
In this example, this means the probability of obtaining a table with a 2 value as great as or greater than
the 2.100 for the Observed Table. The probability associated with every table with a 2 value of 2.100
or higher would be summed to compute the two-sided exact p-value.
The exact p-value would be 0.286 (Observed Table)+0.143 (Possible Table 3)=0.429. This means you
have a 42.9% chance of obtaining a table with at least as much of an association as the observed table
simply by random chance.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C.5 Exact Tests for Contingency Tables C-39
The warning tells you that you should not trust the reported p-value in this table.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C-40 Appendix C Additional Topics
The Two-sided Pr <= P value is the one that you will report. Notice the difference between the exact
p-value (0.4286) and the asymptotic p-value (0.1473) in the Pearson chi-square test table. The exact
p-values are larger. Exact tests tend to be more conservative than asymptotic tests.
For tables larger than 2*2, an EXACT statement must be submitted to obtain exact p-values.
For large tables, this can take a long time and use a great deal of computational resources.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C.6 Empirical Logit Plots C-41
Objectives
Explain the concept of empirical logit plots.
Plot empirical logits for continuous and ordinal
predictor variables.
36
36
37
37
For continuous data, a recommended step before building a regression model is to analyze the bivariate
relationships between the regressors and the response variables. The goal is not only to detect outliers, but
also to analyze the shape of the relationships to determine whether there might be some nonlinear trend
that should be modeled in the analysis. For binary response variables, a scatter plot contributes little to
these ends.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C-42 Appendix C Additional Topics
38
38
The logistic model asserts a linear relationship with the logit (not with the actual binary values).
However, a logit for one observation will be infinite in either the positive or negative direction
(ln(p/(1–p))=ln(1/0) or ln(0/1)). A recommendation, however, is to group the data into approximately
equally sized bins, based on the values of the predictor variable. The bin size should be adequate in
number of observations to reduce the sample variability of the logits. You can then assume that the
average probability within each bin is approximately the value of the proportion in the bin with the event.
The estimated logit is then approximately equal to ln(proportion/(1–proportion)).
If the predictor variable is a nominal variable, then there is no need to create a logit plot.
39
39
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C.6 Empirical Logit Plots C-43
If the standard logistic regression model adequately fits the data, the logit plots should be fairly linear.
The above graph shows a predictor variable that meets the assumption of linearity in the logit.
40
40
The logit plot can also show serious nonlinearities between the outcome variable and the predictor
variable. The above graph reveals a quadratic relationship between the outcome and predictor variables.
Adding a polynomial term or binning the predictor variable into three groups (two dummy variables
would model the quadratic relationship) and treating it as a classification variable can improve the model
fit.
41
41
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C-44 Appendix C Additional Topics
A common approach when computing logits is to take the log of the odds. The path from the definition
of a logit to the formula above is shown below. C represents the total number in the bin and E represents
the total number of positive events in the bin.
Ei
Pi C Ei
Ci Ei
i
(1 Pi )
Ci Ci
Ci Ei
The logit is undefined for any bin in which the outcome rate is 100% or 0%. To eliminate this problem
and reduce the variability of the logits, a common recommendation is to add a small constant
to the numerator and denominator of the formula that computes the logit (Santner and Duffy 1989).
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C.6 Empirical Logit Plots C-45
Example: Plot the estimated logits of the outcome variable Bonus versus the predictor variable
Fireplaces. To construct the estimated logits, the number of bonus eligible houses and the
total number of houses by each level of Fireplaces must be computed.
/*st10cd07.sas*/ /*Part A*/
proc means data=STAT1.ameshousing3 noprint nway;
class Fireplaces;
var Bonus;
output out=bins sum(Bonus)=NEvents n(Bonus)=NCases;
run;
data bins;
set bins;
Logit=log((NEvents+0.5)/(NCases-NEvents+0.5));
run;
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C-46 Appendix C Additional Topics
-1.0
-1.5
Logit
-2.0
0 1 2
Number of fireplaces
The trend shown in the empirical logit plot for this ordinal variable does not appear linear.
In some cases, when a linear pattern is detected in a logit plot for an ordinal variable, the variable
can be removed from the CLASS statement, implying that it would be considered the same as a
continuous variable. The statistical advantage of doing so would be to increase model power, due
to obtaining almost the same information using fewer degrees of freedom. However, theoretical
justifications should always supersede such data-driven considerations.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C.6 Empirical Logit Plots C-47
Example: Plot the estimated logits of the outcome variable Bonus versus the predictor variable
Basement_Area. Because Basement_Area is a continuous variable, bin the observations into
20 groups to ensure that an adequate number of observations are used to compute the
estimated logit.
/*st10cd07.sas*/ /*Part B*/
proc rank data=STAT1.ameshousing3 groups=15 out=Ranks;
var Basement_Area;
ranks Rank;
run;
data bins;
set bins;
Logit=log((NEvents+0.5)/(NCases-NEvents+0.5));
run;
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
C-48 Appendix C Additional Topics
In the case of Basement_Area, you do not have a made-to-order bin variable, so you must create one.
You can use the RANK procedure for this purpose. You have 300 observations. It is recommended that
you have approximately 20 to 30 observations per bin. If you divide the sample size by the desired per bin
count, you can estimate the value to use for the GROUPS= option of PROC RANK. In this case,, with 20
observations per bin, the number of bins should be 300/20 or 15 bins.
0
Logit
-2
-4
Regression Loess
The empirical logit plot shows a deviation from linearity. One possibility is to add a quadratic (squared)
term for Basement_Area.
The empirical logit plot is univariate and therefore can be misleading in the presence of interactions
and partial associations in a logistic regression model. (Association between the response variable and the
predictor variable changes with the addition of another predictor variable in the model.) If an interaction
is suspected, a model with the interaction term and main effects should be evaluated before any variable
is eliminated. Estimated logit plots should never be used to eliminate variables from consideration for
a multiple logistic regression model.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Appendix D Percentile Definitions
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
D.1 Calculating Percentiles D-3
Example: Calculate the 25th percentile for the following data using the five definitions available
in PROC UNIVARIATE:
1 3 7 11 14
For all of these calculations (except definition 4), you use the value np = (5) (0.25) = 1.25. This can
be viewed as an observation number. However, there is obviously no observation 1.25.
Definition 1 returns a weighted average. The value returned is 25% of the distance between
observations 1 and 2. (The value of 25% is the fractional part of 1.25 expressed
as a percentage.)
percentile = 1 + (0.25)(3 – 1) = 1.5
Definition 2 rounds to the nearest observation number. Thus, the value 1.25 is rounded to 1 and
the first observation, 1, is taken as the 25th percentile. If np were 1.5, then the second
observation would be selected as the 25th percentile.
Definition 3 always rounds up. Thus, 1.25 rounds up to 2 and the second data value, 3, is taken
as the 25th percentile.
Definition 4 is a weighted average similar to definition 1, except instead of using np, definition
4 uses (n + 1) p = 1.5.
percentile = 1 + (0.5)(3 – 1) = 2
Definition 5 rounds up to the next observation number unless np is an integer. In that case,
an average of the observations represented by np and (np + 1) is calculated. In this
example, definition 5 rounds up, and the 25th percentile is 3.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
D-4 Appendix D Percentile Definitions
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Appendix E Writing and Submitting
SAS® Programs in SAS® Enterprise
Guide®
E.1 Writing and Submitting SAS Programs in SAS Enterprise Guide ............................ E-3
Demonstration: Adding a SAS Program to a Project ........................................................... E-11
E-2 Appendix E Writing and Submitting SAS® Programs in SAS® Enterprise Guide®
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
E.1 Writing and Submitting SAS Programs in SAS Enterprise Guide E-3
Objectives
Create and submit new SAS programs.
Insert existing programs into a project.
List programming statements to avoid.
Generate a combined project program and log.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
E-4 Appendix E Writing and Submitting SAS® Programs in SAS® Enterprise Guide®
When you insert code, a shortcut to the file is added in the project, which means that changes made
to the code in the project are also saved to the .sas file that you inserted. Also, if you make changes
to the .sas file outside of SAS Enterprise Guide, the changes are reflected when you open or run
the project again.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
E.1 Writing and Submitting SAS Programs in SAS Enterprise Guide E-5
If SAS is available on multiple servers, you can select Select Server and designate the server on which
the program should run.
If the data for a task is located on a server that is different from the server where the SAS code is run,
then SAS Enterprise Guide copies the data to the server where the code actually runs. Because moving
large amounts of data over a network can be time- and resource-intensive, it is recommended that
the server that you choose to process the code be the same server on which the data resides.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
E-6 Appendix E Writing and Submitting SAS® Programs in SAS® Enterprise Guide®
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
E.1 Writing and Submitting SAS Programs in SAS Enterprise Guide E-7
The Analyze Program button enables you to select one of these two options:
Analyze Program SAS Enterprise Guide can create a process flow from a program. Using
Flow this process flow, you can quickly identify the different parts of the
program and see how the parts are related.
Analyze Program When analyzing a program for grid computing, SAS Enterprise Guide
for Grid Computing identifies the parts of the program that are not dependent on one another.
These parts can run simultaneously on multiple computers, which means
that SAS Enterprise Guide returns the results more quickly. When SAS
analyzes a program, lines of SAS/CONNECT code are added to your
original program. Therefore, you must have a license for SAS Grid
Manager or SAS/CONNECT to analyze a program for grid computing.
Both options run the code behind the scenes to complete the analysis. If a data set is open
in the SAS Enterprise Guide session, the analysis might fail. To view and close any open data
sets, select Tools View Open Data Sets.
10
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
E-8 Appendix E Writing and Submitting SAS® Programs in SAS® Enterprise Guide®
Using Autocomplete
In SAS Enterprise Guide 4.3, the Program Editor includes
an autocomplete feature. The editor can suggest
SAS statements
procedures
macro programs
macro variables
functions
formats
librefs
11
The autocomplete feature automatically suggests appropriate keywords. You can also manually open
the Autocomplete window by using the following shortcut keys:
Open the Autocomplete window for the keyword on which the cursor is currently Ctrl + spacebar
positioned. In a blank program, this shortcut displays a list of global statements.
Open the Autocomplete window that contains a list of the SAS libraries that are
Ctrl + L
available with the current server connection.
Open the Autocomplete window that contains a list of data sets that were created
Ctrl + D
by using the DATA statement.
Open the Autocomplete window that contains a list of SAS functions. Ctrl + Shift + F1
Open the Autocomplete window that contains a list of macro functions. Ctrl + Shift + F2
Open the Autocomplete window that contains a list of SAS formats. Ctrl + Shift + F
Open the Autocomplete window that contains a list of SAS informats. Ctrl + Shift + I
Open the Autocomplete window that contains a list of statistics keywords. Ctrl + Shift + K
Open the Autocomplete window that contains a list of SAS colors. Ctrl + Shift + C
Open the Autocomplete window that contains a list of style attributes. Ctrl + Shift + F4
Open the Autocomplete window that contains a list of style elements. Ctrl + Shift + F3
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
E.1 Writing and Submitting SAS Programs in SAS Enterprise Guide E-9
12
Rearranging Windows
13
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
E-10 Appendix E Writing and Submitting SAS® Programs in SAS® Enterprise Guide®
User-Defined
Automatic
14
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
E.1 Writing and Submitting SAS Programs in SAS Enterprise Guide E-11
To modify the rules for formatting code, select Program Editor Options Indenter.
4. To execute the SAS program, select Run on the toolbar. A report is generated and lists the products
in the saleswomen data set. Twenty-one are added to the project. Because TestScores was the first
data set created, it is automatically placed on a new tab. All other data sets are accessible from
the process flow.
Obs Purchase Gender Income Age
1 0 Female Low 40
2 0 Female Low 46
3 1 Female Low 41
5. To include a frequency report to analyze the distribution of Purchase in the Sales data set, use
the FREQ procedure in the SAS program. At the end of the program, type pr. A list of keywords
is provided. Press the spacebar to select the word PROC for the program.
6. A list of procedure names is automatically provided. Type fr and press the spacebar again to select
freq for the program. Next, a list of valid options for the PROC FREQ statement is provided.
Type d and press the spacebar to select data=.
7. A list of data sets in the project and defined libraries is provided. Select STAT1, press the spacebar
and then select SALES for the data set.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
E-12 Appendix E Writing and Submitting SAS® Programs in SAS® Enterprise Guide®
8. The list of valid options for the PROC FREQ statement appears again. Type o, select order=,
and press the spacebar. Type fr to select FREQ and then enter a semicolon to complete the statement
that appears as follows:
proc freq data=STAT1.sales order=freq;
9. Continue to use the autocomplete feature to write the remainder of the step:
proc freq data=STAT1.sales order=freq;
tables Gender*Purchase / chisq relrisk;
run;
10. Highlight the PROC FREQ step in the program and select Run Run Selection. Select Yes when
you are prompted to replace the results.
Partial Results
Table of Gender by Purchase
Gender Purchase
Frequency
Percent
Row Pct
Col Pct 0 1 Total
Female 139 101 240
32.25 23.43 55.68
57.92 42.08
51.67 62.35
Male 130 61 191
30.16 14.15 44.32
68.06 31.94
48.33 37.65
Total 269 162 431
62.41 37.59 100.00
11. The program now includes three steps and creates multiple data sets and reports. To better
visualize the flow of the program, return to the Program tab and select Analyze Program
Analyze Program Flow.
12. In the Analyze SAS Program window, select Begin analysis. Then type Analysis of Associations
in the Name of process flow to create field and select Create process flow Close.
If a data set is open in the SAS Enterprise Guide session, the analysis might fail.
To view and close any open data sets, select Tools View Open Data Sets….
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
E.1 Writing and Submitting SAS Programs in SAS Enterprise Guide E-13
A new process flow is added to the project, and illustrates the flow of the steps in the program.
To delete a process flow, right-click on the process flow in the Project Tree and select Delete.
13. The Program Editor also includes syntax tooltips. Double-click on the st10dd01 program in the
Project Tree or Process Flow window. Hold the mouse pointer over any keyword in the program.
A tooltip displays syntax details for that particular step or statement.
You can view syntax tooltips by holding the mouse pointer over items in the autocomplete
windows.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
E-14 Appendix E Writing and Submitting SAS® Programs in SAS® Enterprise Guide®
14. Save the modified program by returning to the Program tab and selecting Save Save As….
Save the program as st100d05s and select Save.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
E.1 Writing and Submitting SAS Programs in SAS Enterprise Guide E-15
Exporting Code
All SAS code within a project can be exported to a file that
can be edited and executed in other SAS environments.
Select
File Export
Export All Code
in Project.
16
Project Log
The project log can be used to maintain and export
an aggregated log of all code submitted for the project.
17
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
E-16 Appendix E Writing and Submitting SAS® Programs in SAS® Enterprise Guide®
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.